problem with unicode decoding
This is funny.. I am trying to read geographic lookup data from openstreetmap. The code that performs the query looks like this
params = urllib.urlencode({'q': ",".join([e for e in ful开发者_C百科l_address]), 'format': "json", "addressdetails" : "1"})
query = "http://nominatim.openstreetmap.org/search?%s" % params
print query
time.sleep(5)
response = json.loads(unicode(urllib.urlopen(query).read(), "UTF-8"), encoding="UTF-8")
print response
The query for Zürich is correctly URL-encoded on UTF-8 data. No wonders here.
http://nominatim.openstreetmap.org/search?q=Z%C3%BCrich%2CSWITZERLAND&addressdetails=1&format=json
When I print the response, the u with umlaut is encoded latin1 (0xFC)
[{u'display_name': u'Z\xfcrich, Bezirk Z\xfcrich, Z\xfcrich, Schweiz, Europe', u'place_id': 588094, u'lon': 8.540443
but that's nonsense because openstreetmap returns the JSON data in UTF-8
Connecting to nominatim.openstreetmap.org (nominatim.openstreetmap.org)|128.40.168.106|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Wed, 26 Jan 2011 13:48:33 GMT
Server: Apache/2.2.14 (Ubuntu)
Content-Location: search.php
Vary: negotiate
TCN: choice
X-Powered-By: PHP/5.3.2-1ubuntu4.7
Access-Control-Allow-Origin: *
Content-Length: 3342
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Type: application/json; charset=UTF-8
Length: 3342 (3.3K) [application/json]
which is also confirmed by the file contents, and then I explicitly say that it's UTF-8 both at read and json parsing.
What's going on here ?
EDIT : apparently it's the json.loads that screws up somehow.
When I go and print the response, the u with umlaut is encoded latin1 (0xFC)
You are just misinterpreting the output. It's a unicode string (you can tell by the u in prefix), there's no encoding "attached" - the \xFC means there it's the codepoint with number 0xFC, which happens to be the U-Umlaut (see http://www.fileformat.info/info/unicode/char/fc/index.htm). The reason why this happens is that the numbering of the first 256 unicode codepoints coincides with the latin1 encoding.
In short, you did everything right - you have a unicode object with the right content (that is agnostic to encodings), you can choose the encoding you want when you use that content for output somewhere by doing unicodestr.encode("utf-8") or by using codecs, see http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data
The output is fine. Whenever you print data on the console, Python encondes Unicode the data only when printing the actual string. If you print a list of unicodes, each unicode string is show on the console as its repr():
>>> a=u'á'
>>> a
u'\xe1'
>>> print a
á
>>> [a]
[u'\xe1']
>>> print [a]
[u'\xe1']
精彩评论