decoding problem with urllib2 in python
I'm trying to use urllib2 in python 2.7 to fetch a page from the web. The page happens to be encoded in unicode(UTF-8) and have greek characters. When I try to fetch and print it with the code below, I get gibberish instead of the greek characters.
import urllib2
print urllib2.urlopen("http://www.pamestihima.gr").read()
开发者_如何学Go
The result is the same both in Netbeans 6.9.1 and in Windows 7 CLI.
I'm doing something wrong, but what?
Unicode is not UTF-8. UTF-8 is a string encoding, like ISO-8859-1, ASCII etc.
Always decode your data as soon as possible, to make real Unicode out of it. (
'somestring in utf8'.decode('utf-8') == u'somestring in utf-8'
), unicode objects areu''
, not''
When you have data leaving your app, always encode it in the proper encoding. For Web stuff this is
utf-8
mostly. For console stuff this is whatever your console encoding is. On Windows this is not UTF-8 by default.
It prints correctly for me, too.
Check the character encoding of the program in which you are viewing the HTML source code. For example, in a Linux terminal, you can find "Set Character Encoding" and make sure it is UTF-8.
精彩评论