Getting unicode from a urllib request
I am running the following code trying to find particular information in some HTML. I am having an encoding/decoding problem, however, that I cannot resolve.
import urllib
req = urllib.urlopen('http://securities.stanford.edu/1046/AAI00_01/')
html = req.read()
type(html)
# <type 'str'>
html.upper().find('HTML')
# -1
print html[0:20]
# ??<HTML><HE
html[0:10]
# '\xff\xfe<\x00H\x00T\x00M\x00'
req.headers['content-type']
# 'text/html'
html = html.encode('utf-8')
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
What is the solution to this problem? All I need to do is scrape some information from the page using .find and re开发者_开发问答gular expressions.
I am using Mac OSX and running Python 2.6.1 from within Terminal.
If you're trying to convert from the str
you have to a unicode
, you want to use html.decode
, not encode
.
Older, bad advice: Also, since you seem to have a BOM at the beginning there, you probably
want to use 'utf_8_sig'
as the encoding, which will strip the BOM on decode.
New, better advice: Actually, from seeing all those \x00
's in the output along with the BOM, it looks more like the encoding is actually UTF-16, not UTF-8. So, html.decode('utf-16')
should be the way to go.
精彩评论