开发者

decoding problem with urllib2 in python

I'm trying to use urllib2 in python 2.7 to fetch a page from the web. The page happens to be encoded in unicode(UTF-8) and have greek characters. When I try to fetch and print it with the code below, I get gibberish instead of the greek characters.

import urllib2
print urllib2.urlopen("http://www.pamestihima.gr").read()
开发者_如何学Go

The result is the same both in Netbeans 6.9.1 and in Windows 7 CLI.

I'm doing something wrong, but what?


  1. Unicode is not UTF-8. UTF-8 is a string encoding, like ISO-8859-1, ASCII etc.

  2. Always decode your data as soon as possible, to make real Unicode out of it. ('somestring in utf8'.decode('utf-8') == u'somestring in utf-8'), unicode objects are u'' , not ''

  3. When you have data leaving your app, always encode it in the proper encoding. For Web stuff this is utf-8mostly. For console stuff this is whatever your console encoding is. On Windows this is not UTF-8 by default.


It prints correctly for me, too.

Check the character encoding of the program in which you are viewing the HTML source code. For example, in a Linux terminal, you can find "Set Character Encoding" and make sure it is UTF-8.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜