开发者

feedparser fails during script run, but can't reproduce in interactive python console

It's failing with this when I run eclipse or when I run my script in iPython:

'ascii' codec can't decode byte 0xe2 in position 32: ordinal not in range(128) 

I don't know why, but when I simply execute the feedparse.parse(url) statement using the same url, there is no error thrown. This is stumping me big time.

The code is as simple as:

      try:
           d = feedparser.parse(url)
      except Exception, e:
           logging.error('Error while retrieving feed.')
           logging.error(e)
           logging.error(formatExceptionInfo(None))
           logging.error(formatExceptionInfo1())

Here is the stack trace:

d = feedparser.parse(url)


 File "C:\Python26\lib\site-packages\feedparser.py", line 2623, in parse
    feedparser.feed(data)
  File "C:\Python26\lib\site-packages\feedparser.py", line 1441, in feed
    sgmllib.SGMLParser.feed(s开发者_如何学运维elf, data)
  File "C:\Python26\lib\sgmllib.py", line 104, in feed
    self.goahead(0)
  File "C:\Python26\lib\sgmllib.py", line 143, in goahead
    k = self.parse_endtag(i)
  File "C:\Python26\lib\sgmllib.py", line 320, in parse_endtag
    self.finish_endtag(tag)
  File "C:\Python26\lib\sgmllib.py", line 360, in finish_endtag
    self.unknown_endtag(tag)
  File "C:\Python26\lib\site-packages\feedparser.py", line 476, in unknown_endtag
    method()
  File "C:\Python26\lib\site-packages\feedparser.py", line 1318, in _end_content
    value = self.popContent('content')
  File "C:\Python26\lib\site-packages\feedparser.py", line 700, in popContent
    value = self.pop(tag)
  File "C:\Python26\lib\site-packages\feedparser.py", line 641, in pop
    output = _resolveRelativeURIs(output, self.baseuri, self.encoding)
  File "C:\Python26\lib\site-packages\feedparser.py", line 1594, in _resolveRelativeURIs
    p.feed(htmlSource)
  File "C:\Python26\lib\site-packages\feedparser.py", line 1441, in feed
    sgmllib.SGMLParser.feed(self, data)
  File "C:\Python26\lib\sgmllib.py", line 104, in feed
    self.goahead(0)
  File "C:\Python26\lib\sgmllib.py", line 138, in goahead
    k = self.parse_starttag(i)
  File "C:\Python26\lib\sgmllib.py", line 296, in parse_starttag
    self.finish_starttag(tag, attrs)
  File "C:\Python26\lib\sgmllib.py", line 338, in finish_starttag
    self.unknown_starttag(tag, attrs)
  File "C:\Python26\lib\site-packages\feedparser.py", line 1588, in unknown_starttag
    attrs = [(key, ((tag, key) in self.relative_uris) and self.resolveURI(value) or value) for key, value in attrs]
  File "C:\Python26\lib\site-packages\feedparser.py", line 1584, in resolveURI
    return _urljoin(self.baseuri, uri)
  File "C:\Python26\lib\site-packages\feedparser.py", line 286, in _urljoin
    return urlparse.urljoin(base, uri)
  File "C:\Python26\lib\urlparse.py", line 215, in urljoin
    params, query, fragment))
  File "C:\Python26\lib\urlparse.py", line 184, in urlunparse
    return urlunsplit((scheme, netloc, url, query, fragment))
  File "C:\Python26\lib\urlparse.py", line 192, in urlunsplit
    url = scheme + ':' + url
  File "C:\Python26\lib\encodings\cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)

PARTIALLY SOLVED:

This is reproducable when the URL being passed to feedparser.parse() is unicode. It won't repro when it's an ascii URL. And for the record, you need a feed that has some high character unicode characters. I am not sure why this is.


Looks like the url that is giving you problem contains text with some encoding (such as latin-1, where 0xe2 would be "lowercase a with a circle on top" aka â) without a proper content-type header (it should have a charset= parameter in Content-Type: but doesn't).

If that is the case feedparser cannot guess the encoding, tries the default (ascii), and fails.

this part of feedparser's docs explains the issues in more detail.

Unfortunately there are no "magic bullets" to solve this general issue (due to bozos that break the XML rules). You could try catching this exception, and in the handler read the url's contents separately (use urllib2) and try decoding them with various possible encodings -- then when you finally get a usable unicode object this way, feed that to feedparser.parse (whose first arg can be a url, a file stream, or a unicode string with the data).


With reference to the OP's comment: Try any url literal, such as u'myfeed.blah/xml' It should reproduce.

>>> from pprint import pprint as pp
>>> import feedparser

>>> d = feedparser.parse(u'myfeed.blah/xml')
>>> pp(d)
{'bozo': 1,
 'bozo_exception': SAXParseException('not well-formed (invalid token)',),
 'encoding': 'utf-8',
 'entries': [],
 'feed': {},
 'namespaces': {},
 'version': ''}

>>> d = feedparser.parse(u'http://myfeed.blah/xml')
>>> pp(d)
{'bozo': 1,
 'bozo_exception': URLError(gaierror(11001, 'getaddrinfo failed'),),
 'encoding': 'utf-8',
 'entries': [],
 'feed': {},
 'version': None}

>>> d = feedparser.parse("http://feedparser.org/docs/examples/atom10.xml")
>>> d['bozo']
0
>>> d['feed']['title']
u'Sample Feed'

>>> d = feedparser.parse(u"http://feedparser.org/docs/examples/atom10.xml")
>>> d['bozo']
0
>>> d['feed']['title']
u'Sample Feed'
>>>

Please stop thrashing about; provide a URL that actually causes the problem.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜