BeautifulSoup is choking on jQuery script, any known workaround?
I'm giving BeautifulSoup an html document and simply by constructing a BeautifulSoup object instance with the full html, it seems to choke on the following line of a jQuery script that's embedded within the html:
var txt = "Logged in as: <a href=\"http://somedomain.com/the-blah/\">" + uname + "</a> <small>(<a href=\"http://somedomain.com/the-blah/\">The Blah</a> | <a href=\"http://somedomain.com/the-blah/?action=logout\">logout</a>)</small>";
The full stack trace for the error is the following:
/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, *args, **kwargs)
1497 kwargs['smartQuotesTo'] = self.HTML_ENTITIES
1498 kwargs['isHTML'] = True
-> 1499 BeautifulStoneSoup.__init__(self, *args, **kwargs)
1500
1501 SELF_CLOSING_TAGS = buildTagMap(None,
/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, markup, parseOnlyThese, fromEncoding, markupMassage, smartQuotesTo, convertEntities, selfClosingTags, isHTML, builder)
1228 self.markupMassage = markupMassage
1229 try:
-> 1230 self._feed(isHTML=isHTML)
1231 except StopParsing:
1232 pass
/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in _feed(self, inDocumentEncoding, isHTML)
1261 self.builder.reset()
1262
-> 1263 self.builder.feed(markup)
1264 # Close out any unfinished strings and close all the open tags.
1265 self.endData()
/usr/lib/python2.6/HTMLParser.pyc in feed(self, data)
106 """
107 self.rawdata = self.rawdata + data
--> 108 self.goahead(0)
109
110 def close(self):
/usr/lib/python2.6/HTMLParser.pyc in goahead(self, end)
146 if startswith('<', i):
147 if starttagopen.match(rawdata, i): # < + letter
--> 148 k = self.parse_starttag(i)
149 elif startswith("</", i):
150 开发者_Python百科 k = self.parse_endtag(i)
/usr/lib/python2.6/HTMLParser.pyc in parse_starttag(self, i)
227 def parse_starttag(self, i):
228 self.__starttag_text = None
--> 229 endpos = self.check_for_whole_start_tag(i)
230 if endpos < 0:
231 return endpos
/usr/lib/python2.6/HTMLParser.pyc in check_for_whole_start_tag(self, i)
302 return -1
303 self.updatepos(i, j)
--> 304 self.error("malformed start tag")
305 raise AssertionError("we should not get here!")
306
/usr/lib/python2.6/HTMLParser.pyc in error(self, message)
113
114 def error(self, message):
--> 115 raise HTMLParseError(message, self.getpos())
116
117 __starttag_text = None
HTMLParseError: malformed start tag, at line 193, column 110
From what I can glean it has something to do with the angle brackets being within quotes, it seems to be thrown off by this. What kind of work around is there, or is there another library that handles these edge cases better? Or alternatively, is there a way to tell it to ignore all javascript content?
the easiest way would probably be to delete all the scripts. see the section Removing Elements in the documentation: http://www.crummy.com/software/BeautifulSoup/documentation.html#Removing%20elements
精彩评论