Why am I getting a UnicodeDecodeError in Python's JSON encoding?
I am using Solr 3.3 to index stuff from my database. I compose the JSON content in Python. I manage to upload 2126 records which add up to 523246 chars (approx 511kb). But when I try 2027 records, Python gives me the error:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "D:\Technovia\db_indexer\solr_update.py", line 69, in upload_service_details
request_string.append(param_list)
File "C:\Python27\lib\json\__init__.py", line 238, in dumps
**kw).encode(obj)
File "C:\Python27\lib\json\encoder.py", line 203, in encode
chunks = list(chunks)
File "C:\Python27\lib\json\encoder.py", line 425, in _iterencode
for chunk in _iterenco开发者_JS百科de_list(o, _current_indent_level):
File "C:\Python27\lib\json\encoder.py", line 326, in _iterencode_list
for chunk in chunks:
File "C:\Python27\lib\json\encoder.py", line 384, in _iterencode_dict
yield _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 68: invalid start byte
Ouch. Is 512kb worth of bytes a fundamental limit? Is there any high-volume alternative to the existing JSON module?
Update: its a fault of some data as trying to encode *biz_list[2126:]* causes an immediate error. Here is the offending piece:
'2nd Floor, Gurumadhavendra Towers,\nKadavanthra Road, Kaloor,\nCochin \x96 682 017'
How can I configure it so that it can be encodable into JSON?
Update 2: The answer worked as expected: the data came from a MySQL table encoded in "latin-1-swedish-ci". I saw significance in a random number. Sorry for spontaneously channeling the spirit of a headline writer when diagnosing the fault.
Simple, just don't use utf-8 encoding if your data is not in utf-8
>>> json.loads('["\x96"]')
....
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 0: invalid start byte
>>> json.loads('["\x96"]', encoding="latin-1")
[u'\x96']
json.loads
If
s
is astr
instance and is encoded with an ASCII based encoding other than utf-8 (e.g. latin-1) then an appropriateencoding
name must be specified. Encodings that are not ASCII based (such as UCS-2) are not allowed and should be decoded tounicode
first.
Edit: To get proper unicode value of "\x96" use "cp1252" as Eli Collins mentioned
>>> json.loads('["\x96"]', encoding="cp1252")
[u'\u2013']
精彩评论