开发者

Removing right-to-left mark and other unicode characters from input in Python

I am writing a forum in Python. I want to strip input containing the right-to-left mark开发者_如何学Go and things like that. Suggestions? Possibly a regular expression?


The OP, in a hard-to-read comment to another answer, has an example that appears to start like...:

comment = comment.encode('ascii', 'ignore')
comment = '\xc3\xa4\xc3\xb6\xc3\xbc'

That of course, with the two statements in this order, would be a different error (the first one tries to access comment but only the second one binds that name), but let's assume the two lines are interchanged, as follows:

comment = '\xc3\xa4\xc3\xb6\xc3\xbc'
comment = comment.encode('ascii', 'ignore')

This, which would indeed cause the error the OP seems to have in that hard-to-read comment, is a problem for a different reason: comment is a byte string (no leading u before the opening quote), but .encode applies to a unicode string -- so Python first of all tries to make a temporary unicode out of that bytestring with the default codec, ascii, and that of course fails because the string is full of non-ascii characters.

Inserting the leading u in that literal would work:

comment = u'\xc3\xa4\xc3\xb6\xc3\xbc'
comment = comment.encode('ascii', 'ignore')

(this of course leaves comment empty since all of its characters are ignored). Alternatively -- for example if the original byte string comes from some other source, not a literal:

comment = '\xc3\xa4\xc3\xb6\xc3\xbc'
comment = comment.decode('latin-1')
comment = comment.encode('ascii', 'ignore')

here, the second statement explicitly builds the unicode with a codec that seems applicable to this example (just a guess, of course: you can't tell with certainty which codec is supposed to apply from just seeing a bare bytestring!-), then the third one, again, removes all non-ascii characters (and again leaves comment empty).


If you simply want to restrict the characters to those of a certain character set, you could encode the string in that character set and just ignore encoding errors:

>>> uc = u'aäöüb'
>>> uc.encode('ascii', 'ignore')
'ab'


It's hard to guess the set of characters you want to remove from your Unicode strings. Could it be they are all the “Other, Format” characters? If yes, you can do:

import unicodedata

your_unicode_string= filter(
    lambda c: unicodedata.category(c) != 'Cf',
    your_unicode_string)


"example".replace(u'\u200e', '')

You can remove the characters by the hex values with .replace() method.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜