how to remove '\xe2' from a list

2023-01-08 14:11 问答作者：

I am new to python and am using it to use nltk in my project.After word-tokenizing the raw data obtained from a webpage I got a list containing '\xe2' ,'\xe3','\x98' etc.However I do not need these and want to delete them.

I simply tried

if '\x' in a

and

if a.startswith('\xe')

and it gives me an error saying invalid \x escape

But when I try a regular expression

re.search('^\\x',a)

i get

Traceback (most recent call last):
File "<pyshell#开发者_如何学Go83>", line 1, in <module>
print re.search('^\\x',a)
File "C:\Python26\lib\re.py", line 142, in search
return _compile(pattern, flags).search(string)
File "C:\Python26\lib\re.py", line 245, in _compile
raise error, v # invalid expression
error: bogus escape: '\\x'

even re.search('^\\x',a) is not identifying it.

I am confused by this,even googling didnt help(I might be missing something).Please suggest any simple way to remove such strings from the list and what was wrong with the above.

Thanks in advance!

You can use unicode(a, 'ascii', 'ignore') to remove all non-ascii characters in the string at once.

It helps here to understand the difference between a string literal and a string.

A string literal is a sequence of characters in your source code. When parsed and compiled by the Python interpreter, it produces a string, which is a sequence of characters in memory.

For example, the string literal " a " produces the string a.

String literals can take a number of forms. All of these produce the same string a:

"a"
'a'
r"a"
"""a"""
r'''a'''

Source code is traditionally ASCII-only, but we'd like it to contain string literals that can produce characters beyond ASCII. To do this escapes can be used. For example, the string literal "\xe2" produces a single-character string, with a character with integer value E2 hexadecimal, or 226 decimal.

This explains the error about "\x" being an invalid escape: the parser is expecting you to specify the hexadecimal value of a character.

To detect if a string has any characters in a certain range, you can use a regex with a character class specifying the lower and upper bounds of the characters you don't want:

if re.search(r"[\x90-\xff]", a):

'\xe2' is one character, \x is an escape sequence that's followed by a hex number and used to specify a byte literally.
That means you have to specify the whole expression:

>>> s = '\xe2hello'
>>> print s
'\xe2hello'
>>> s.replace('\xe2', '')
'hello'

More information can be found in the Python docs.

I see other answers have done a good job in explaining your confusion with respect to '\x', but while suggesting that you may not want to completely remove non-ASCII characters, have not provided a specific way to do other normalization beyond such removing.

If you want to obtain some "reasonably close ASCII character" (e.g., strip accents from letters but leave the underlying letter, &c), this SO answer may help -- the code in the accepted answer, using only the standard Python library, is:

import unicodedata

def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

Of course, you'll need to apply this function to each string item in the list you mention in the title, e.g

cleanedlist = [strip_accents(s) for s in mylist]

if all items in mylist are strings.

Let's stand back and think about this a little bit ...

You're using nltk (natural language toolkit) to parse (presumably) natural language.

Your '\xe2' is highly likely to represent U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX (â).
Your '\xe3' is highly likely to represent U+00E3 LATIN SMALL LETTER A WITH TILDE (ã).

They look like natural language letters to me. Are you SURE that you don't need them?

If you want only to enter this pattern and avoid the error,

you can try insert a + between \ and x like here:

re.search('\+x[0123456789abcdef]*',a)

继续阅读：python regex

how to remove '\xe2' from a list

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？