开发者

Splitting results from chardet output to collect encoding type

I am testing chardet in one of my scripts. I wanted to identify the encoding type of a result variable and chardet seems to do fine here.

So this is what I am doing:

myvar1 <-- gets its value from other functions

myvar2 = chardet.detect(myvar1) <-- to detect the encoding type of myvar1

Now when I do a print myvar2, I receive the output:

{'confidence': 1.0, 'encoding': 'ascii'}

Question 1: Can someone give pointer on how to collect only the encoding value part out of this, i.e. ascii.

Edit: The scenario is as follows:

I am using unicode(myvar1) to write all input as unicode. But as soon as myvar1 gets a value like 0xab, unicode(m开发者_如何学编程yvar1) fails with the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position xxx: ordinal not in range(128)

Therefore, I am tring to:

  1. first identify the encoding type of the input which comes in myvar1,
  2. take the encoding type in myvar2,
  3. decode the input (myvar1) with this encoding (myvar2) using decode() [?]
  4. pass it on to unicode.

The input coming in is variable and not in my control.

I am sure there are other ways to do this, but I am new to this. And I am open to trying.

Any pointer please.

Many Thanks.


print myvar2['encoding']

Now for the added related info: chardet is an attempt of detecting the encoding. It isn't 100% reliable and fails sometimes. However it's the best you've got since reliable encoding detection is impossible. Just provide a way for your users to specify encoding if chardet fails for them.

You can't read a text you don't know specific encoding type for. It is impossible -- because the same byte sequence can mean different chars on different encodings. In other words, encodings are ambiguous. chardet is just a guess. It can and will fail in the wild. The best and only reliable way is to ask whoever generated the string which encoding was used in first place.


EDIT: for your scenario, the only way to stay sane is to ask whoever generated the string what's the encoding used. You said that

"The input coming in is variable and not in my control."

If that's true, then you can't correctly read the input. You can't read a text input from a bunch of bytes without knowing beforehand which encoding it used. It's impossible. By definition.

Please ask whoever is generating the bytestrings to provide you the encoding used to generate the bytestrings, together with the bytestrings themselves, so you can make sense of them. Without the encoding, a bytestring is just a chunk of bytes and you can't know which chars are there. It's like having a bunch of data but not knowing how to interpret them.

Where do those bytes comes from? Why don't you have control over which encoding was used to generate the data? Does the data provider know that the data they're providing is useless since you can't correctly interpret it?

I will repeat once more to make it really clear: You can't correctly, reliably read a bunch of bytes as text without knowing the encoding used to generate the bytes. There's no way it will work reliably. You need some kind of agreement with the producer so you'll know the encoding.


second problem: as the traceback says, aBuf is an int but it's expecting a string. You need to find out why.

uhhhh ... just worked it out; you are feeding it a single byte, expressed as an integer (0xab) instead of a string ('\xab'). In any case, chardet requires much more than 1 byte to be able to guess an encoding. Feeding any charset detector one byte is utterly pointless.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜