How to convert \xXY encoded characters to UTF-8 in Python?

2023-02-05 06:28 问答作者：

I have a text which contains characters such as "\xaf", "\xbe", which, as I understan开发者_高级运维d it from this question, are ASCII encoded characters.

I want to convert them in Python to their UTF-8 equivalents. The usual string.encode("utf-8") throws UnicodeDecodeError. Is there some better way, e.g., with the codecs standard library?

Sample 200 characters here.

Your file is already a UTF-8 encoded file.

# saved encoding-sample to /tmp/encoding-sample
import codecs
fp= codecs.open("/tmp/encoding-sample", "r", "utf8")
data= fp.read()

import unicodedata as ud

chars= sorted(set(data))
for char in chars:
    try:
        charname= ud.name(char)
    except ValueError:
        charname= "<unknown>"
    sys.stdout.write("char U%04x %s\n" % (ord(char), charname))

And manually filling in the unknown names:
char U000a LINE FEED
char U001e INFORMATION SEPARATOR TWO
char U001f INFORMATION SEPARATOR ONE

.encode is for converting a Unicode string (unicode in 2.x, str in 3.x) to a a byte string (str in 2.x, bytes in 3.x).

In 2.x, it's legal to call .encode on a str object. Python implicitly decodes the string to Unicode first: s.encode(e) works as if you had written s.decode(sys.getdefaultencoding()).encode(e).

The problem is that the default encoding is "ascii", and your string contains non-ASCII characters. You can solve this by explicitly specifying the correct encoding.

>>> '\xAF \xBE'.decode('ISO-8859-1').encode('UTF-8')
'\xc2\xaf \xc2\xbe'

It's not ASCII (ASCII codes only go up to 127; \xaf is 175). You first need to find out the correct encoding, decode that, and then re-encode in UTF-8.

Could you provide an actual string sample? Then we can probably guess the current encoding.

继续阅读：character-encoding non-ascii-characters python unicode utf-8

How to convert \xXY encoded characters to UTF-8 in Python?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？