BeautifulSoup is choking on jQuery script, any known workaround?

2023-01-24 20:30 问答作者：

I'm giving BeautifulSoup an html document and simply by constructing a BeautifulSoup object instance with the full html, it seems to choke on the following line of a jQuery script that's embedded within the html:

        var txt = "Logged in as: <a href=\"http://somedomain.com/the-blah/\">" + uname + "</a> <small>(<a href=\"http://somedomain.com/the-blah/\">The Blah</a> | <a href=\"http://somedomain.com/the-blah/?action=logout\">logout</a>)</small>";

The full stack trace for the error is the following:

    /usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, *args, **kwargs)
   1497             kwargs['smartQuotesTo'] = self.HTML_ENTITIES
   1498         kwargs['isHTML'] = True
-> 1499         BeautifulStoneSoup.__init__(self, *args, **kwargs)
   1500 
   1501     SELF_CLOSING_TAGS = buildTagMap(None,

/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, markup, parseOnlyThese, fromEncoding, markupMassage, smartQuotesTo, convertEntities, selfClosingTags, isHTML, builder)
   1228         self.markupMassage = markupMassage
   1229         try:
-> 1230             self._feed(isHTML=isHTML)
   1231         except StopParsing:
   1232             pass

/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in _feed(self, inDocumentEncoding, isHTML)
   1261         self.builder.reset()
   1262 
-> 1263         self.builder.feed(markup)
   1264         # Close out any unfinished strings and close all the open tags.

   1265         self.endData()

/usr/lib/python2.6/HTMLParser.pyc in feed(self, data)
    106         """
    107         self.rawdata = self.rawdata + data
--> 108         self.goahead(0)
    109 
    110     def close(self):

/usr/lib/python2.6/HTMLParser.pyc in goahead(self, end)
    146             if startswith('<', i):
    147                 if starttagopen.match(rawdata, i): # < + letter
--> 148                     k = self.parse_starttag(i)
    149                 elif startswith("</", i):
    150   开发者_Python百科                  k = self.parse_endtag(i)

/usr/lib/python2.6/HTMLParser.pyc in parse_starttag(self, i)
    227     def parse_starttag(self, i):
    228         self.__starttag_text = None
--> 229         endpos = self.check_for_whole_start_tag(i)
    230         if endpos < 0:
    231             return endpos

/usr/lib/python2.6/HTMLParser.pyc in check_for_whole_start_tag(self, i)
    302                 return -1
    303             self.updatepos(i, j)
--> 304             self.error("malformed start tag")
    305         raise AssertionError("we should not get here!")
    306 

/usr/lib/python2.6/HTMLParser.pyc in error(self, message)
    113 
    114     def error(self, message):
--> 115         raise HTMLParseError(message, self.getpos())
    116 
    117     __starttag_text = None

HTMLParseError: malformed start tag, at line 193, column 110

From what I can glean it has something to do with the angle brackets being within quotes, it seems to be thrown off by this. What kind of work around is there, or is there another library that handles these edge cases better? Or alternatively, is there a way to tell it to ignore all javascript content?

the easiest way would probably be to delete all the scripts. see the section Removing Elements in the documentation: http://www.crummy.com/software/BeautifulSoup/documentation.html#Removing%20elements

继续阅读：html-parsing screen-scraping xml

BeautifulSoup is choking on jQuery script, any known workaround?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？