开发者

Detect charset and convert to utf-8 in Python? [duplicate]

This question already has answers here: How to determine the encoding of text 开发者_如何学C (16 answers) Closed 5 years ago.

Is there any universal method to detect string charset? I user IPTC tags and have no known encoding. I need to detect it and then change them to utf-8.

Anybody can help?


You want to use chardet, an encoding detector


It's a bit late, but there is also another solution: try to use pyicu.

An example:

import icu
def convert_encoding(data, new_coding='UTF-8'):
    coding = icu.CharsetDetector(data).detect().getName()
    if new_coding.upper() != coding.upper():
        data = unicode(data, coding).encode(new_coding)
    return data


If you want to do it with cchardet, you can use this function.

import cchardet
def convert_encoding(data, new_coding = 'UTF-8'):
  encoding = cchardet.detect(data)['encoding']

  if new_coding.upper() != encoding.upper():
    data = data.decode(encoding, data).encode(new_coding)

  return data


There is another module called cchardet

It is said to be faster than chardet.

Note that it requires Cython

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜