开发者

How to handle nested form tags with lxml

I want to scrape some html pages that have nested form elements with lxml. Even BeautifulSoup chokes on these pages, the only parser I've found that can handle them so far is Minimal开发者_如何学运维Soup which has no knowledge of which tags can be nested or not.

Does lxml have any parsers that don't care about about nested form tags? Any other suggestions?

If I have to I'll just continue using MinimalSoup.


How about lxml.etree.HTMLParser? That should work relatively well, right?

import urllib2
import lxml.etree as etree
page = urllib2.urlopen(url)
parser = etree.HTMLParser()
tree = etree.parse(page,parser)

And you have your tree!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜