Splitting results from chardet output to collect encoding type

2023-01-21 16:47 问答作者：

I am testing chardet in one of my scripts. I wanted to identify the encoding type of a result variable and chardet seems to do fine here.

So this is what I am doing:

myvar1 <-- gets its value from other functions

myvar2 = chardet.detect(myvar1) <-- to detect the encoding type of myvar1

Now when I do a print myvar2, I receive the output:

{'confidence': 1.0, 'encoding': 'ascii'}

Question 1: Can someone give pointer on how to collect only the encoding value part out of this, i.e. ascii.

Edit: The scenario is as follows:

I am using unicode(myvar1) to write all input as unicode. But as soon as myvar1 gets a value like 0xab, unicode(m开发者_如何学编程yvar1) fails with the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position xxx: ordinal not in range(128)

Therefore, I am tring to:

first identify the encoding type of the input which comes in myvar1,

take the encoding type in myvar2,

decode the input (myvar1) with this encoding (myvar2) using decode() [?]

pass it on to unicode.

The input coming in is variable and not in my control.

I am sure there are other ways to do this, but I am new to this. And I am open to trying.

Any pointer please.

Many Thanks.

print myvar2['encoding']

Now for the added related info: chardet is an attempt of detecting the encoding. It isn't 100% reliable and fails sometimes. However it's the best you've got since reliable encoding detection is impossible. Just provide a way for your users to specify encoding if chardet fails for them.

You can't read a text you don't know specific encoding type for. It is impossible -- because the same byte sequence can mean different chars on different encodings. In other words, encodings are ambiguous. chardet is just a guess. It can and will fail in the wild. The best and only reliable way is to ask whoever generated the string which encoding was used in first place.

EDIT: for your scenario, the only way to stay sane is to ask whoever generated the string what's the encoding used. You said that

"The input coming in is variable and not in my control."

If that's true, then you can't correctly read the input. You can't read a text input from a bunch of bytes without knowing beforehand which encoding it used. It's impossible. By definition.

Please ask whoever is generating the bytestrings to provide you the encoding used to generate the bytestrings, together with the bytestrings themselves, so you can make sense of them. Without the encoding, a bytestring is just a chunk of bytes and you can't know which chars are there. It's like having a bunch of data but not knowing how to interpret them.

Where do those bytes comes from? Why don't you have control over which encoding was used to generate the data? Does the data provider know that the data they're providing is useless since you can't correctly interpret it?

I will repeat once more to make it really clear: You can't correctly, reliably read a bunch of bytes as text without knowing the encoding used to generate the bytes. There's no way it will work reliably. You need some kind of agreement with the producer so you'll know the encoding.

second problem: as the traceback says, aBuf is an int but it's expecting a string. You need to find out why.

uhhhh ... just worked it out; you are feeding it a single byte, expressed as an integer (0xab) instead of a string ('\xab'). In any case, chardet requires much more than 1 byte to be able to guess an encoding. Feeding any charset detector one byte is utterly pointless.

继续阅读：python

Splitting results from chardet output to collect encoding type

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？