Python HTMLParser: UnicodeDecodeError
I'm using HTMLParser to parse pages I pull down with urllib, and am coming across UnicodeDecodeError
exceptions when passing some to HTMLParser
.
I tried using chardet
to detect the encodings and to convert to ascii
, or utf-8
(the docs don't seem to say what it should be). lossiness is acceptable, but while the decode/encode lines work just fine, I always get the error after self.feed().
The information is there if I just print
it out.
from HTMLParser import HTMLParser
import urllib
import chardet
class search_youtube(HTMLParser):
def __init__(self, search_terms):
HTMLParser.__init__(self)
self.track_ids = []
for search in search_terms:
self.__in_result = False
search = urllib.quote_plus(search)
query = 'http://youtube.com/results?search_query='
page = urllib.urlopen(query + search).read()
try:
self.feed(page)
except UnicodeDecodeError:
encoding = chardet.detect(page)['encoding']
if encoding != 'unicode':
page = page.decode(encoding)
page = page.encode('ascii', 'ignore')
self.feed(page)
print 'success'
searches = ['telepopmusik breathe']
results = search_youtube(searches)
print results.track_ids
here's the output:
Traceback (most recent call last):
File "test.py", line 27, in <module>
results = search_youtube(searches)
File "test.py", line 23, in __init__
self.feed(page)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 252, in parse_starttag
attrvalue = self.unescape(attrvalue)
File "/usr/lib/python2.6/HTMLParser.py", line 390, in unescape
开发者_高级运维 return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.6/re.py", line 151, in sub
return _compile(pattern, 0).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
It is UTF-8, indeed. This works:
from HTMLParser import HTMLParser
import urllib
class search_youtube(HTMLParser):
def __init__(self, search_terms):
HTMLParser.__init__(self)
self.track_ids = []
for search in search_terms:
self.__in_result = False
search = urllib.quote_plus(search)
query = 'http://youtube.com/results?search_query='
connection = urllib.urlopen(query + search)
encoding = connection.headers.getparam('charset')
page = connection.read().decode(encoding)
self.feed(page)
print 'success'
searches = ['telepopmusik breathe']
results = search_youtube(searches)
print results.track_ids
You don't need chardet, Youtube are not morons, they actually send the correct encoding in the header.
What encoding does chardet say it is?
Please explain "The information is there if I just print it out": what is "it"? If you can read it and it makes sense when you print it to your console, then it must be in the usual/default encoding for your system; what is that? What operating system? What locale?
Can you give us a typical URL to make a query so that we can inspect for ourselves what you are seeing?
At one place in your code, you decode your output, then immediately smash it by using .encode('ascii', 'ignore')
; why?
精彩评论