开发者

Is there a way to parse html with lxml, but manipulate it with minidom?

I have an application where I've been using html5lib to liberally parse html. I use the minidom interface, because I need a real DOM API and ElementTree is not appropriate for wh开发者_如何学运维at I'm doing.

Here's how I do this:

parser = html5lib.XHTMLParser(tree=html5lib.treebuilders.getTreeBuilder('dom'))
parser.parse(html)

However, parsing huge files is becoming a performance bottleneck, and lxml parsing is about 80 times faster than html5lib (I benchmarked it).

How do I parse with lxml or a similarly fast bad-html-tolerant library, and manipulate with a DOM-compatible API?


Think I found a solution:

from xml.dom.pulldom import SAX2DOM
import lxml.sax
def parse_lxml_dom(html):
    tree = lxml.html.document_fromstring(html)
    handler = SAX2DOM()
    lxml.sax.saxify(tree, handler)
    return handler.document

However, this is only about 7 times faster than html5lib. The saxify call takes quite a while.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜