开发者

Encoding error while deserializing a json object from Google

As an exercise I built a little script that query Google Suggest JSON API. The code is quite simple:

query = 'a'
url = "ht开发者_Go百科tp://clients1.google.co.jp/complete/search?hl=ja&q=%s&json=t" %query
response = urllib.urlopen(url)
result = json.load(response)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x83 in position 0: invalid start byte

If I try to read() the response object, this is what I've got:

'["a",["amazon","ana","au","apple","adobe","alc","\x83A\x83}\x83]\x83\x93","\x83A\x83\x81\x83u\x83\x8d","\x83A\x83X\x83N\x83\x8b","\x83A\x83\x8b\x83N"],["","","","","","","","","",""]]'

So it seams that the error is raised when python try to decode the string. This only happens with google.co.jp and the Japanese language. I tried the same code with different contry/languages and I do not get the same issue: when I try to deserialize the object everything works OK.

  • I checked the response headers for and they always specify utf-8 as the response encoding.
  • I checked the JSON string with an online parser (http://json.parser.online.fr/) and again all seams OK

Any ideas to solve this problem? What make the JSON load() function choke?

Thanks in advance.


The response header (print response.header) contains the following information:

Content-Type: text/javascript; charset=Shift_JIS

Note the charset.

If you specify this encoding in json.load it will work:

result = json.load(response, encoding='shift_jis')


Regardless of what the spec says, the string "\x83A\x83}\x83]\x83\x93" is not UTF-8.

At a guess, it is one of [ "cp932", "shift_jis", "shift_jis_2004", "shift_jisx0213" ]; try decoding as one of these.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜