开发者

How can I get the content of body element by using html5lib in Python?

How can I get the content of <body> element by using html5lib in Python?

Example input data: <html><head></head><body>xxx<b>yyy</b></hr></body></html>

Expected output: xxx<b>yyy</b></hr>

It should work even if HTML is brok开发者_运维问答en (unclosed tags,...).


html5lib allows you to parse your documents using a variety of standard tree formats. You can do this using lxml, as I've done below, or you can follow the instructions in their user documentation to do it either with minidom, ElementTree or BeautifulSoup.

file = open("mydocument.html")
doc = html5lib.parse(file, treebuilder="lxml")
content = doc.findtext("html/body", default=None):

Response to comment

It is possible to acheive this without installing any external libs using their own simpletree.py, but judging by the comment at the start of the file I would guess this is not the recommended way...

# Really crappy basic implementation of a DOM-core like thing

If you still want to do this, however, you can parse the html document like so:

f = open("mydocument.html")
doc = html5lib.parse(f) 

and then find the element you're looking for by doing a breadth-first search of the child nodes in the document. The nodes are kept in an array named childNodes and each node has a name stored in the field name.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜