Python unicode string with UTF-8?

2023-01-05 15:44 问答作者：

I'm getting back from a library what looks to be an incorrect unicode string:

>>> title
u'Sopet\xc3\xb3n'

Now, those two hex escapes there are the UTF-8 encoding for U+00F3 LATIN SMALL LETTER O WITH ACUTE. So far as I understand, a unicode string in Python should have the actual character, not the the UTF-8 encoding for the character, so I think this is incorrect and presumably a bug either in the library or in my input, right?

The question is, how do I (a) recognize that I have UTF-8 encoded text in my unicode st开发者_开发技巧ring, and (b) convert this to a proper unicode string?

I'm stumped on (a), as there's nothing wrong, encoding-wise, about that original string (i.e, both are valid characters in their own right, u'\xc3\xb3' == Ã³, but they're not what's supposed to be there)

It looks like I can achieve (b) by eval()ing that repr() output minus the "u" in front to get a str and then decoding the str with UTF-8:

>>> eval(repr(title)[1:]).decode("utf-8")
u'Sopet\xf3n'
>>> print eval(repr(title)[1:]).decode("utf-8")
Sopetón

But that seems a bit kludgy. Is there an officially-sanctioned way to get the raw data out of a unicode string and treat that as a regular string?

a) Try to put it through the method below.

>>> u'Sopet\xc3\xb3n'.encode('latin-1').decode('utf-8')
u'Sopet\xf3n'

You should use:

>>> title.encode('raw_unicode_escape')

Python2:

print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape'))

Python3:

print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape').decode('utf8'))

继续阅读：python unicode

Python unicode string with UTF-8?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？