Is it possible to hook up a more robust HTML parser to Python mechanize?

2022-12-12 11:16 问答作者：

I am trying to parse and submit a form on a website using mechanize, but it appears that the built-in form parser cannot detect the form and i开发者_运维技巧ts elements. I suspect that it is choking on poorly formed HTML, and I'd like to try pre-parsing it with a parser better designed to handle bad HTML (say lxml or BeautifulSoup) and then feeding the prettified, cleaned-up output to the form parser. I need mechanize not only for submitting the form but also for maintaining sessions (I'm working this form from within a login session.)

I'm not sure how to go about doing this, if it is indeed possible.. I'm not that familiar with the various details of the HTTP protocol, how to get various parts to work together etc. Any pointers?

I had a problem where a form field was missing from a form, I couldn't find any malformed html but I figured that was the cause so I used BeautifulSoup's prettify function to parse it and it worked.

resp = br.open(url)
soup = BeautifulSoup(resp.get_data())
resp.set_data(soup.prettify())
br.set_response(resp)

I'd love to know how to this automatically.

Edit: found out how to do this automatically

class PrettifyHandler(mechanize.BaseHandler):
    def http_response(self, request, response):
        if not hasattr(response, "seek"):
            response = mechanize.response_seek_wrapper(response)
        # only use BeautifulSoup if response is html
        if response.info().dict.has_key('content-type') and ('html' in response.info().dict['content-type']):
            soup = BeautifulSoup(response.get_data())
            response.set_data(soup.prettify())
        return response

    # also parse https in the same way
    https_response = http_response

br = mechanize.Browser()
br.add_handler(PrettifyHandler())

br will now use BeautifulSoup to parse all responses where html is contained in the content type (mime type), eg text/html

reading from the big example on the first page of the mechanize website:

# Sometimes it's useful to process bad headers or bad HTML:
response = br.response()  # this is a copy of response
headers = response.info()  # currently, this is a mimetools.Message
headers["Content-type"] = "text/html; charset=utf-8"
response.set_data(response.get_data().replace("<!---", "<!--"))
br.set_response(response)

so it seems very possible to preprocess the response with another parser which will regenerate well-formed HTML, then feed it back to mechanize for further processing.

What you're looking for can be done with lxml.etree which is the xml.etree.ElementTree emulator (and replacement) provided by lxml:

First we take bad mal-formed HTML:

% cat bad.html
<html>
<HEAD>
    <TITLE>this HTML is awful</title>
</head>
<body>
    <h1>THIS IS H1</H1>
    <A HREF=MYLINK.HTML>This is a link and it is awful</a>
    <img src=yay.gif>
</body>
</html>

(Observe the mixed case between opening and closing tags, missing quotation marks).

And then parse it:

>>> from lxml import etree
>>> bad = file('bad.html').read()
>>> html = etree.HTML(bad)
>>> print etree.tostring(html)
<html><head><title>this HTML is awful</title></head><body>
    <h1>THIS IS H1</h1>
    <a href="MYLINK.HTML">This is a link and it is awful</a>
    <img src="yay.gif"/></body></html>

Observe that the tagging and quotation has been corrected for us.

If you were having problems parsing the HTML before, this might be the answer you're looking for. As for the details of HTTP, that is another matter entirely.

继续阅读：mechanize python

Is it possible to hook up a more robust HTML parser to Python mechanize?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？