开发者

Parse text of element with empty element inside

I'm trying to convert an XHTML document that uses lots of tables into a semantic XML document in Python using xml.etree. However, I'm having some trouble converting this XHTML

<TD>
  Textline1<BR/>
  Textline2<BR/>
  Textline3
</TD>

into something like this

<lines>
  <line>Textline1</line>
  <line>Textline2</line>
  <line>Textline3</line>
</lines>

The开发者_StackOverflow problem is that I don't know how to get the text after the BR elements.


You need to use the .tail property of the <br> elements.

import xml.etree.ElementTree as et

doc = """<TD>
  Textline1<BR/>
  Textline2<BR/>
  Textline3
</TD>
"""

e = et.fromstring(doc)

items = []
for x in e.getiterator():
    if x.text is not None:
        items.append(x.text.strip())
    if x.tail is not None:
        items.append(x.tail.strip())

doc2 = et.Element("lines")
for i in items:
    l=et.SubElement(doc2, "line")
    l.text = i

print(et.tostring(doc2))


I don't think the tags being empty is your problem. xml.etree may not expect you to have child elements and bare text nodes mixed together.

BeautifulSoup is great for parsing XML or HTML that isn't well formatted:

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(open('in.html').read())
print "\n".join(["<line>%s</line>" % node.strip() for node in soup.find('td').contents if isinstance(node, BeautifulSoup.NavigableString)])
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜