how to correct the misencoded string?

2022-12-19 14:20 问答作者：

i used mutagen to read the mp3 metadata, since the id3 tag is read in as unicode but in fact it is GBK encoded. how to correct this in python?

audio = EasyID3(name)
title = audio["title"][0] 
print title
prin开发者_StackOverflowt repr(title)

produces

µ±Äã¹Âµ¥Äã»áÏëÆðË
u'\xb5\xb1\xc4\xe3\xb9\xc2\xb5\xa5\xc4\xe3\xbb\xe1\xcf\xeb\xc6\xf0\xcb\xad'

but in fact it should be in GBK (chinese).

当你孤单你会想起谁

It looks like the string has been decoded to unicode using the wrong encoding (latin-1).

You need to encode it to a byte string and then decode it back to unicode using the correct encoding.

title = u'\xb5\xb1\xc4\xe3\xb9\xc2\xb5\xa5\xc4\xe3\xbb\xe1\xcf\xeb\xc6\xf0\xcb\xad'
print title.encode('latin-1').decode('gbk')
当你孤单你会想起谁

Looks like it's auto-decoding using latin1. To fix:

>>> title = u'\xb5\xb1\xc4\xe3\xb9\xc2\xb5\xa5\xc4\xe3\xbb\xe1\xcf\xeb\xc6\xf0\xcb\xad'
>>> print title.encode('latin1').decode('GBK')
当你孤单你会想起谁

Tested in Python 2.x but should work fine in 3 as well.

精彩评论