feedparser fails during script run, but can't reproduce in interactive python console
It's failing with this when I run eclipse or when I run my script in iPython:
'ascii' codec can't decode byte 0xe2 in position 32: ordinal not in range(128)
I don't know why, but when I simply execute the feedparse.parse(url) statement using the same url, there is no error thrown. This is stumping me big time.
The code is as simple as:
try:
d = feedparser.parse(url)
except Exception, e:
logging.error('Error while retrieving feed.')
logging.error(e)
logging.error(formatExceptionInfo(None))
logging.error(formatExceptionInfo1())
Here is the stack trace:
d = feedparser.parse(url)
File "C:\Python26\lib\site-packages\feedparser.py", line 2623, in parse
feedparser.feed(data)
File "C:\Python26\lib\site-packages\feedparser.py", line 1441, in feed
sgmllib.SGMLParser.feed(s开发者_如何学运维elf, data)
File "C:\Python26\lib\sgmllib.py", line 104, in feed
self.goahead(0)
File "C:\Python26\lib\sgmllib.py", line 143, in goahead
k = self.parse_endtag(i)
File "C:\Python26\lib\sgmllib.py", line 320, in parse_endtag
self.finish_endtag(tag)
File "C:\Python26\lib\sgmllib.py", line 360, in finish_endtag
self.unknown_endtag(tag)
File "C:\Python26\lib\site-packages\feedparser.py", line 476, in unknown_endtag
method()
File "C:\Python26\lib\site-packages\feedparser.py", line 1318, in _end_content
value = self.popContent('content')
File "C:\Python26\lib\site-packages\feedparser.py", line 700, in popContent
value = self.pop(tag)
File "C:\Python26\lib\site-packages\feedparser.py", line 641, in pop
output = _resolveRelativeURIs(output, self.baseuri, self.encoding)
File "C:\Python26\lib\site-packages\feedparser.py", line 1594, in _resolveRelativeURIs
p.feed(htmlSource)
File "C:\Python26\lib\site-packages\feedparser.py", line 1441, in feed
sgmllib.SGMLParser.feed(self, data)
File "C:\Python26\lib\sgmllib.py", line 104, in feed
self.goahead(0)
File "C:\Python26\lib\sgmllib.py", line 138, in goahead
k = self.parse_starttag(i)
File "C:\Python26\lib\sgmllib.py", line 296, in parse_starttag
self.finish_starttag(tag, attrs)
File "C:\Python26\lib\sgmllib.py", line 338, in finish_starttag
self.unknown_starttag(tag, attrs)
File "C:\Python26\lib\site-packages\feedparser.py", line 1588, in unknown_starttag
attrs = [(key, ((tag, key) in self.relative_uris) and self.resolveURI(value) or value) for key, value in attrs]
File "C:\Python26\lib\site-packages\feedparser.py", line 1584, in resolveURI
return _urljoin(self.baseuri, uri)
File "C:\Python26\lib\site-packages\feedparser.py", line 286, in _urljoin
return urlparse.urljoin(base, uri)
File "C:\Python26\lib\urlparse.py", line 215, in urljoin
params, query, fragment))
File "C:\Python26\lib\urlparse.py", line 184, in urlunparse
return urlunsplit((scheme, netloc, url, query, fragment))
File "C:\Python26\lib\urlparse.py", line 192, in urlunsplit
url = scheme + ':' + url
File "C:\Python26\lib\encodings\cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
PARTIALLY SOLVED:
This is reproducable when the URL being passed to feedparser.parse() is unicode. It won't repro when it's an ascii URL. And for the record, you need a feed that has some high character unicode characters. I am not sure why this is.
Looks like the url that is giving you problem contains text with some encoding (such as latin-1, where 0xe2
would be "lowercase a with a circle on top" aka â
) without a proper content-type header (it should have a charset= parameter in Content-Type:
but doesn't).
If that is the case feedparser
cannot guess the encoding, tries the default (ascii
), and fails.
this part of feedparser's docs explains the issues in more detail.
Unfortunately there are no "magic bullets" to solve this general issue (due to bozos that break the XML rules). You could try catching this exception, and in the handler read the url's contents separately (use urllib2
) and try decoding them with various possible encodings -- then when you finally get a usable unicode object this way, feed that to feedparser.parse
(whose first arg can be a url, a file stream, or a unicode string with the data).
With reference to the OP's comment: Try any url literal, such as u'myfeed.blah/xml' It should reproduce.
>>> from pprint import pprint as pp
>>> import feedparser
>>> d = feedparser.parse(u'myfeed.blah/xml')
>>> pp(d)
{'bozo': 1,
'bozo_exception': SAXParseException('not well-formed (invalid token)',),
'encoding': 'utf-8',
'entries': [],
'feed': {},
'namespaces': {},
'version': ''}
>>> d = feedparser.parse(u'http://myfeed.blah/xml')
>>> pp(d)
{'bozo': 1,
'bozo_exception': URLError(gaierror(11001, 'getaddrinfo failed'),),
'encoding': 'utf-8',
'entries': [],
'feed': {},
'version': None}
>>> d = feedparser.parse("http://feedparser.org/docs/examples/atom10.xml")
>>> d['bozo']
0
>>> d['feed']['title']
u'Sample Feed'
>>> d = feedparser.parse(u"http://feedparser.org/docs/examples/atom10.xml")
>>> d['bozo']
0
>>> d['feed']['title']
u'Sample Feed'
>>>
Please stop thrashing about; provide a URL that actually causes the problem.
精彩评论