开发者

Why I got messy characters while opening url using urllib2?

Here's my code, you guys ca开发者_如何学运维n also test it out. I always get messed-up characters instead of page source.

Header = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)"}

Req = urllib2.Request("http://rlslog.net", None, Header)

Response = urllib2.urlopen(Req)

Html = Response.read()

print Html[:1000]

Normally Html should be page source, but it ended up to be tons of messed-up characters. Anybody knows why?

BTW: I'm on python 2.7


As Bruce already suggested, it seems to be a problem with compression. The server returns gzip compressed content, but urllib2 does not support automatic gzip compression. In fact, the server is misbehaving in this case as far as I know: it should only compress the content if an Accept-encoding: gzip header is present (which you either provide yourself, or is automatically added by your client if it supports it).

So: either use a library that supports it automatically, like httplib2 (which I've tested with the page in question, and it works), or decompress yourself (see the answer to this SO question for how to do it, note that in the question the headers returned by the server are checked to see if the content is gzip compressed)


You make your request with a user agent which supports on the fly compression. Are you sure that the output is not gzip compressed ? Try running it through zlib module and/or printing headers

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜