Is there a way to parse html with lxml, but manipulate it with minidom?
I have an application where I've been using html5lib to liberally parse html. I use the minidom interface, because I need a real DOM API and ElementTree is not appropriate for wh开发者_如何学运维at I'm doing.
Here's how I do this:
parser = html5lib.XHTMLParser(tree=html5lib.treebuilders.getTreeBuilder('dom'))
parser.parse(html)
However, parsing huge files is becoming a performance bottleneck, and lxml parsing is about 80 times faster than html5lib (I benchmarked it).
How do I parse with lxml or a similarly fast bad-html-tolerant library, and manipulate with a DOM-compatible API?
Think I found a solution:
from xml.dom.pulldom import SAX2DOM
import lxml.sax
def parse_lxml_dom(html):
tree = lxml.html.document_fromstring(html)
handler = SAX2DOM()
lxml.sax.saxify(tree, handler)
return handler.document
However, this is only about 7 times faster than html5lib. The saxify call takes quite a while.
精彩评论