Importing file with unknown encoding from Python into MongoDB

2023-02-04 10:33 问答作者：

Working on importing a tab-delimited file over HTTP in Python.

Before inserting a row's data into MongoDB, I'm removing slashes, ticks and quotes from the string.

Whatever the encoding of the data is, MongoDB is throwing me the exception:

bson.errors.InvalidStringData: strings in documents must be valid UTF-8

So in an endeavour to solve this problem, from the reading I've done I want to as quickly as I can, convert the row's data to Unicode using the unicode() function. In addition, I have tried calling the decode() function passing "unicode" as the first parameter but receive the error:

LookupError: unknown encoding: unicode

From there, I can make my string manipulations such as replacing the slashes, ticks, and quotes. Then before inserting the data into MongoDB, convert it to UTF-8 using the str.encode('utf-8') function.

Problem: When converting to Unicode, I am receiving the error

UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 1258: ordinal not in range(128)

With this error, I'm not exactly sure where to continue.

My question is this: How do I successfully import the data from a fil开发者_C百科e without knowing its encoding and successfully insert it into MongoDB, which requires UTF-8?

Thanks Much!

Try these in order:

(0) Check that your removal of the slashes/ticks/etc is not butchering the data. What's a tick? Please show your code. Please show a sample of the raw data ... use print repr(sample_raw data) and copy/paste the output into an edit of your question.

(1) There's an old maxim: "If the encoding of a file is unknown, or stated to be ISO-8859-1, it is cp1252" ... where are you getting it from? If it's coming from Western Europe, the Americas, or any English/French/Spanish-speaking country/territory elsewhere, and it's not valid UTF-8, then it's likely to be cp1252

[Edit 2] Your error byte 0x93 decodes to U+201C LEFT DOUBLE QUOTATION MARK for all encodings cp1250 to cp1258 inclusive ... what language is the text written in? [/Edit 2]

(2) Save the file (before tick removal), then open the file in your browser: Does it look sensible? What do you see when you click on View / Character Encoding?

(3) Try chardet

Edit with some more advice:

Once you know what the encoding is (let's assume it's cp1252):

(1) convert your input data to unicode: uc = raw_data.decode('cp1252')

(2) process the data (remove slashes/ticks/etc) as unicode: clean_uc = manipulate(uc)

(3) you need to output your data encoded as utf8: to_mongo = clean_uc.encode('utf8')

Note 1: Your error message says "can't decode byte 0x93 in position 1258" ... 1258 bytes is a rather long chunk of text; is this reasonable? Have you had a look at the data that it is complaining about? How? what did you see?

Note 2: Please consider reading the Python Unicode HOWTO and this article

继续阅读：character-encoding mongodb python

Importing file with unknown encoding from Python into MongoDB

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？