Multiple tag names in lxml's iterparse?

2023-01-12 13:29 问答作者：

Is there a way to get multiple tag names from lxml's lxml.etree.iterparse? I have a file-like object with an expensive read operation and many tags, so getting all tags or doing two开发者_StackOverflow中文版 passes is suboptimal.

Edit: It would be something like Beautiful Soup's find(['tag-1', 'tag-2]), except as an argument to iterparse. Imagine parsing an HTML page for both <td> and <div> tags.

I know I'm late for the game, but maybe someone else needs help with the same issue. This code will generate events for both Tag1 and Tag2 tags:

etree.iterparse(io.BytesIO(xml), events=('end',), tag=('Tag1', 'Tag2'))

I'm not 100% sure what you mean here by "getting all tags", but perhaps this is what you're looking for:

for event, elem in iterparse(file_like_object):
    if elem.tag == 'td' or elem.tag == 'div':
        # reached the end of an interesting tag
        print 'found:', elem.tag
        # possibly quit early to prevent further parsing
        if exit_condition: break

iterparse generates events on the fly during parsing, so you're only reading as much data as is required. However, there's no way you can skip reading elements during parsing, as you wouldn't know how far to skip. In the above, we just ignore tags that we're not interested in.

As you may already know: don't use xml parsers for html. Edit - It turns out that lxml supports html parsing, but you should check the docs to see to what extent.

继续阅读：elementtree lxml python

Multiple tag names in lxml's iterparse?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？