Multiple tag names in lxml's iterparse?
Is there a way to get multiple tag names from lxml's lxml.etree.iterparse? I have a file-like object with an expensive read operation and many tags, so getting all tags or doing two开发者_StackOverflow中文版 passes is suboptimal.
Edit: It would be something like Beautiful Soup's find(['tag-1', 'tag-2])
, except as an argument to iterparse. Imagine parsing an HTML page for both <td>
and <div>
tags.
I know I'm late for the game, but maybe someone else needs help with the same issue.
This code will generate events for both Tag1
and Tag2
tags:
etree.iterparse(io.BytesIO(xml), events=('end',), tag=('Tag1', 'Tag2'))
I'm not 100% sure what you mean here by "getting all tags", but perhaps this is what you're looking for:
for event, elem in iterparse(file_like_object):
if elem.tag == 'td' or elem.tag == 'div':
# reached the end of an interesting tag
print 'found:', elem.tag
# possibly quit early to prevent further parsing
if exit_condition: break
iterparse
generates events on the fly during parsing, so you're only reading as much data as is required. However, there's no way you can skip reading elements during parsing, as you wouldn't know how far to skip. In the above, we just ignore tags that we're not interested in.
As you may already know: don't use xml parsers for html. Edit - It turns out that lxml supports html parsing, but you should check the docs to see to what extent.
精彩评论