Python + lxml: how to find the namespace of a tag?
I am processing some HTML files with python + lxml. Some of them have been edited with MS Word, and we have <p>
tags written as <o:p> </o:p>
for instance. IE and Firefox do not interpret these MS tags as real <p>
tags, and do not display line breaks before and after the <o:p>
tags, and that is how the original editors have formatted the files, e.g. no spaces around the nbsp's.
lxml on the other hand is tidy, and after processing the HTML files, we see that all the <o:p>
tags have been changed to proper <p>
tags. Unfortunately after this tidying up both browsers now display line breaks around all nbsp's, which breaks the original formatting.
So, my idea was to browse through all those <o:p>
tags and either remove them or add their .text attribute to the parent .text attribute, i.e. remove the <o:p>
tag markers.
from lxml import etree
import lxml.html
from StringIO import StringIO
s='<p>somepara</p> <o:p>msoffice_para</o:p>'
parser = lxml.html.HTMLParser()
html=lxml.html.parse( StringIO( s), parser)
for t in html.xpath( "//p"):
print "tag: " + t.tag + ", text: '" + t.text + "'"
The result is:
tag: p, text: 'somepara'
tag: p, text: 'msoffice_para'
So, lxlm remove开发者_开发问答s the namespace name from the tag marker. Is there a way to know which <p>
tag is from which namespace, so I only remove the ones with <o:p>
?
Thanks.
From the HTML specs: "The HTML syntax does not support namespace declarations".
So I think lxml.html.HTMLParser
removes/ignores the namespace.
However, BeautifulSoup parses HTML differently so I thought it might be worth a shot. If you also have BeautifulSoup installed, you can use the BeautifulSoup parser with lxml like this:
import lxml.html.soupparser as soupparser
import lxml.html
import io
s='<p>somepara</p> <o:p>msoffice_para</o:p>'
html=soupparser.parse(io.BytesIO(s))
BeautifulSoup does not remove the namespace, but neither does it recognize the namespace as such. Instead, it is just part of the name of the tag.
That is to say,
html.xpath('//o:p',namespaces={'o':'foo'})
does not work. But this workaround/hack
for t in html.xpath('//*[name()="o:p"]'):
print "tag: " + t.tag + ", text: '" + t.text + "'"
yields
tag: o:p, text: 'msoffice_para'
If the html is actually well-formed, you could use the etree.XMLParser
instead. Otherwise, try unutbu's answer.
精彩评论