Detect charset and convert to utf-8 in Python? [duplicate]
Is there any universal method to detect string charset? I user IPTC tags and have no known encoding. I need to detect it and then change them to utf-8.
Anybody can help?
You want to use chardet
, an encoding detector
It's a bit late, but there is also another solution: try to use pyicu.
An example:
import icu
def convert_encoding(data, new_coding='UTF-8'):
coding = icu.CharsetDetector(data).detect().getName()
if new_coding.upper() != coding.upper():
data = unicode(data, coding).encode(new_coding)
return data
If you want to do it with cchardet, you can use this function.
import cchardet
def convert_encoding(data, new_coding = 'UTF-8'):
encoding = cchardet.detect(data)['encoding']
if new_coding.upper() != encoding.upper():
data = data.decode(encoding, data).encode(new_coding)
return data
There is another module called cchardet
It is said to be faster than chardet.
Note that it requires Cython
精彩评论