wsgi - processing unicode characters from post

2023-04-07 11:31 问答作者：

python 2.7

raw = '%C3%BE%C3%A6%C3%B0%C3%B6' #string from wsgi post_data
raw_uni = raw.replace('%', r'\x')
raw_uni # gives '\\xC3\\xBE\\xC3\\xA6\\xC3\\xB0\\xC3\\xB6'
print raw uni #gives '\xC3\xBE\x开发者_StackOverflow社区C3\xA6\xC3\xB0\xC3\xB6'
uni = unicode(raw_uni, 'utf-8')
uni #gives u'\\xC3\\xBE\\xC3\\xA6\\xC3\\xB0\\xC3\\xB6+\\xC3\\xA9g'
print uni #gives \xC3\xBE\xC3\xA6\xC3\xB0\xC3\xB6+\xC3\xA9g

However if I change raw_uni to be:

raw_uni = '\xC3\xBE\xC3\xA6\xC3\xB0\xC3\xB6'

and now do:

uni = unicode(raw_uni, 'utf-8')
uni #gives u'\xfe\xe6\xf0\xf6'
print uni #gives þæðö

which is what I want.

how do I get rid of this extra '\' in raw_uni or take advantage of the fact that it's only there in the repr version of the string? More to the point, why does unicode(raw_uni, 'utf-8') use the repr version of the string???

thanks

You should be using urllib.unquote, not a manual replace:

>>> import urllib
>>> raw = '%C3%BE%C3%A6%C3%B0%C3%B6'
>>> urllib.unquote(raw)
'\xc3\xbe\xc3\xa6\xc3\xb0\xc3\xb6'
>>> unicode(urllib.unquote(raw), 'utf-8')
u'\xfe\xe6\xf0\xf6'

The underlying issue here is that you have a fundamental misunderstanding of what hex escapes are. The repr of a non-printable character can be expressed as a hex escape, which looks like a single backslash, followed by an 'x', followed by two hex characters. This is also how you would type these characters into a string literal, but it is still only a single character. Your replace line does not turn your original string into hex escapes, it just replaces each '%' with a literal backslash character followed by an 'x'.

Consider the following examples:

>>> len('\xC3')         # this is a hex escape, only one character
1
>>> len(r'\xC3')        # this is four characters, '\', 'x', 'C', '3'
4
>>> r'\xC3' == '\\xC3'  # raw strings escape backslashes
True

If for some reason you can't use urllib.unquote, the following should work:

raw_uni = re.sub('%(\w{2})', lambda m: chr(int(m.group(1), 16)), raw)

继续阅读：python unicode

wsgi - processing unicode characters from post

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？