Using lxml to find order of text and sub-elements

2023-01-07 07:31 问答作者：

Let's say I have the following HTML:

<div>
text1
<div>
  t1
</div>
text2
<div>
  t2
</div>
text3
</div>

I know of how to get the text and subelements of the enclosing div using lxml.html. But is there a way to access both text and sub elements in an iterative manner, that preserves order? In other words, I want to know开发者_如何学编程 where the "free text" of the div appears relative to the images. I would like to be able to know that "text1" appears before the first inner-div, and that text2 appears between the two inner-divs, etc.

The elementtree interface, which lxml also offers, supports that -- e.g. with the built-in element tree in Python 2.7:

>>> from xml.etree import ElementTree as et
>>> x='''<div>
... text1
... <div>
...   t1
... </div>
... text2
... <div>
...   t2
... </div>
... text3
... </div>'''
>>> t=et.fromstring(x)
>>> for el in t.iter():
...   print '%s: %r, %r' % (el.tag, el.text, el.tail)
... 
div: '\ntext1\n', None
div: '\n  t1\n', '\ntext2\n'
div: '\n  t2\n', '\ntext3\n'

Depending on your version of lxml/elementtree, you may need to spell the iterator method .getiterator() instead of .iter().

If you need a single generator that will yields tags and texts in order, for example:

def elements_and_texts(t):
    for el in t.iter():
        yield 'tag', el.tag
        if el.text is not None:
            yield 'text', el.text
        if el.tail is not None:
            yield 'tail', el.tail

This basically removes the Nones and yields two-tuples with a first item of 'tag', 'text', or 'tail', to help you distinguish. I imagine this is not your ideal format, but it should not be hard to mold it into something more to your liking;-).

继续阅读：lxml python xml

Using lxml to find order of text and sub-elements

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？