Python + lxml: how to find the namespace of a tag?

2023-04-03 06:33 问答作者：

I am processing some HTML files with python + lxml. Some of them have been edited with MS Word, and we have <p> tags written as <o:p>&nbsp</o:p> for instance. IE and Firefox do not interpret these MS tags as real <p> tags, and do not display line breaks before and after the <o:p> tags, and that is how the original editors have formatted the files, e.g. no spaces around the nbsp's.

lxml on the other hand is tidy, and after processing the HTML files, we see that all the <o:p> tags have been changed to proper <p> tags. Unfortunately after this tidying up both browsers now display line breaks around all nbsp's, which breaks the original formatting.

So, my idea was to browse through all those <o:p> tags and either remove them or add their .text attribute to the parent .text attribute, i.e. remove the <o:p> tag markers.

from lxml import etree
import lxml.html
from StringIO import StringIO

s='<p>somepara</p> <o:p>msoffice_para</o:p>'

parser = lxml.html.HTMLParser()
html=lxml.html.parse( StringIO( s), parser)

for t in html.xpath( "//p"):
     print "tag: " + t.tag + ",  text: '" + t.text + "'"

The result is:

tag: p,  text: 'somepara'
tag: p,  text: 'msoffice_para'

So, lxlm remove开发者_开发问答s the namespace name from the tag marker. Is there a way to know which <p> tag is from which namespace, so I only remove the ones with <o:p>?

Thanks.

From the HTML specs: "The HTML syntax does not support namespace declarations". So I think lxml.html.HTMLParser removes/ignores the namespace.

However, BeautifulSoup parses HTML differently so I thought it might be worth a shot. If you also have BeautifulSoup installed, you can use the BeautifulSoup parser with lxml like this:

import lxml.html.soupparser as soupparser
import lxml.html
import io
s='<p>somepara</p> <o:p>msoffice_para</o:p>'
html=soupparser.parse(io.BytesIO(s))

BeautifulSoup does not remove the namespace, but neither does it recognize the namespace as such. Instead, it is just part of the name of the tag.

That is to say,

html.xpath('//o:p',namespaces={'o':'foo'})

does not work. But this workaround/hack

for t in html.xpath('//*[name()="o:p"]'):    
    print "tag: " + t.tag + ",  text: '" + t.text + "'"

yields

tag: o:p,  text: 'msoffice_para'

If the html is actually well-formed, you could use the etree.XMLParser instead. Otherwise, try unutbu's answer.

继续阅读：lxml namespaces python

Python + lxml: how to find the namespace of a tag?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？