Removing right-to-left mark and other unicode characters from input in Python
I am writing a forum in Python. I want to strip input containing the right-to-left mark开发者_如何学Go and things like that. Suggestions? Possibly a regular expression?
The OP, in a hard-to-read comment to another answer, has an example that appears to start like...:
comment = comment.encode('ascii', 'ignore')
comment = '\xc3\xa4\xc3\xb6\xc3\xbc'
That of course, with the two statements in this order, would be a different error (the first one tries to access comment
but only the second one binds that name), but let's assume the two lines are interchanged, as follows:
comment = '\xc3\xa4\xc3\xb6\xc3\xbc'
comment = comment.encode('ascii', 'ignore')
This, which would indeed cause the error the OP seems to have in that hard-to-read comment, is a problem for a different reason: comment
is a byte string (no leading u
before the opening quote), but .encode
applies to a unicode string -- so Python first of all tries to make a temporary unicode out of that bytestring with the default codec, ascii
, and that of course fails because the string is full of non-ascii characters.
Inserting the leading u
in that literal would work:
comment = u'\xc3\xa4\xc3\xb6\xc3\xbc'
comment = comment.encode('ascii', 'ignore')
(this of course leaves comment
empty since all of its characters are ignored). Alternatively -- for example if the original byte string comes from some other source, not a literal:
comment = '\xc3\xa4\xc3\xb6\xc3\xbc'
comment = comment.decode('latin-1')
comment = comment.encode('ascii', 'ignore')
here, the second statement explicitly builds the unicode with a codec that seems applicable to this example (just a guess, of course: you can't tell with certainty which codec is supposed to apply from just seeing a bare bytestring!-), then the third one, again, removes all non-ascii characters (and again leaves comment
empty).
If you simply want to restrict the characters to those of a certain character set, you could encode the string in that character set and just ignore encoding errors:
>>> uc = u'aäöüb'
>>> uc.encode('ascii', 'ignore')
'ab'
It's hard to guess the set of characters you want to remove from your Unicode strings. Could it be they are all the “Other, Format” characters? If yes, you can do:
import unicodedata
your_unicode_string= filter(
lambda c: unicodedata.category(c) != 'Cf',
your_unicode_string)
"example".replace(u'\u200e', '')
You can remove the characters by the hex values with .replace()
method.
精彩评论