Does lxml parse HTML contextually?
I'm using lxml
to parse HTML:
>>> from lxml.html import fromstring, tostring
It parses trailing whitespace correctly in some cases:
>>> html = """<div>some <i>text</i> </div>"""
>>> html == tostring(fromstring(html))
True
But it seems to break when encountering unknown tags (such as the blah
tag below).
>>> html = """<div>so开发者_运维问答me <blah>text</blah> </div>"""
>>> html == tostring(fromstring(html))
False
How can I fix it to include trailing whitespace for all tags?
This appears to be due to the behavior of libxml2 (I've removed some error messages from the version below):
>>> print libxml2.htmlParseDoc("""<div>some <blah>text</blah> </div>""", "UTF-8")
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>some <blah>text</blah></div></body></html>
>>> print libxml2.htmlParseDoc("""<div>some <i>text</i> </div>""", "UTF-8")
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>some <i>text</i> </div></body></html>
I am still probing for a workaround. libxml2's XML parser doesn't exhibit this behavior, but I think it would work a lot worse on broken html.
You need to set a flag in the parser itself to remove whitespace. I've done this when parsing xml like this:
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
data = etree.parse(open(file),parser)
精彩评论