how to correct the misencoded string?
i used mutagen to read the mp3 metadata, since the id3 tag is read in as unicode but in fact it is GBK encoded. how to correct this in python?
audio = EasyID3(name)
title = audio["title"][0]
print title
prin开发者_StackOverflowt repr(title)
produces
µ±Äã¹Âµ¥Äã»áÏëÆðË
u'\xb5\xb1\xc4\xe3\xb9\xc2\xb5\xa5\xc4\xe3\xbb\xe1\xcf\xeb\xc6\xf0\xcb\xad'
but in fact it should be in GBK (chinese).
当你孤单你会想起谁
It looks like the string has been decoded to unicode using the wrong encoding (latin-1).
You need to encode it to a byte string and then decode it back to unicode using the correct encoding.
title = u'\xb5\xb1\xc4\xe3\xb9\xc2\xb5\xa5\xc4\xe3\xbb\xe1\xcf\xeb\xc6\xf0\xcb\xad'
print title.encode('latin-1').decode('gbk')
当你孤单你会想起谁
Looks like it's auto-decoding using latin1
. To fix:
>>> title = u'\xb5\xb1\xc4\xe3\xb9\xc2\xb5\xa5\xc4\xe3\xbb\xe1\xcf\xeb\xc6\xf0\xcb\xad'
>>> print title.encode('latin1').decode('GBK')
当你孤单你会想起谁
Tested in Python 2.x but should work fine in 3 as well.
精彩评论