Encoding error while deserializing a json object from Google
As an exercise I built a little script that query Google Suggest JSON API. The code is quite simple:
query = 'a'
url = "ht开发者_Go百科tp://clients1.google.co.jp/complete/search?hl=ja&q=%s&json=t" %query
response = urllib.urlopen(url)
result = json.load(response)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x83 in position 0: invalid start byte
If I try to read()
the response object, this is what I've got:
'["a",["amazon","ana","au","apple","adobe","alc","\x83A\x83}\x83]\x83\x93","\x83A\x83\x81\x83u\x83\x8d","\x83A\x83X\x83N\x83\x8b","\x83A\x83\x8b\x83N"],["","","","","","","","","",""]]'
So it seams that the error is raised when python try to decode the string. This only happens with google.co.jp and the Japanese language. I tried the same code with different contry/languages and I do not get the same issue: when I try to deserialize the object everything works OK.
- I checked the response headers for and they always specify utf-8 as the response encoding.
- I checked the JSON string with an online parser (http://json.parser.online.fr/) and again all seams OK
Any ideas to solve this problem? What make the JSON load()
function choke?
Thanks in advance.
The response header (print response.header
) contains the following information:
Content-Type: text/javascript; charset=Shift_JIS
Note the charset.
If you specify this encoding in json.load
it will work:
result = json.load(response, encoding='shift_jis')
Regardless of what the spec says, the string "\x83A\x83}\x83]\x83\x93" is not UTF-8.
At a guess, it is one of [ "cp932", "shift_jis", "shift_jis_2004", "shift_jisx0213" ]; try decoding as one of these.
精彩评论