Is the content between anchor tags (a) in html seen as a branch in lxml?
I am trying to get some content in html documents. Some of the documents have a table of contents that very nicely indicates where in the document the content I want to strip out is located. That is either the value or text_content of the tag are easily identifiable and point to what I need. For example I might have two anchor tags in the toc that have the following values
key=href value=#listofplaces t开发者_如何学Goext_content=Places we have visited
key=href value=#transport text_content=Ways we have traveled
and then in the body of the document
key=name value=listofplaces text_content=''
then there are lots of html elements, some tables, maybe some div tags, some unknown number of elements followed by the next anchor
key=name value=transport text_content=''
I was planning on using the output from a function to identify the beginning and end of the section I want to copy from the document. That is I was going to read the document and snip out the section between the anchor tags listofplaces and transport. I started thinking that LXML is so powerful that maybe the content I want is a branch of some sort that I just have not been able to figure out its identity.
No, there is not a single branch between siblings. However, you can just iterate over their parent and extract (can be done in various ways, depending on how you already have handles for the anchor tags). Note the handling of text and tail to avoid losing data. Modifying example_doc to see the results may help you better understand this example code.
import lxml.etree
example_doc = """
<root>
<a name="listofplaces"/>
text
<sibling/>
<sibling/>
<a name="transport"/>
</root>
"""
root = lxml.etree.XML(example_doc)
new_root = lxml.etree.Element("root")
it = iter(root)
for e in it:
if e.tag == "a" and e.get("name") == "listofplaces":
new_root.text = e.tail
break
else:
assert False, "TODO: handle tag not found"
for e in it:
if e.tag == "a" and e.get("name") == "transport":
break
new_root.append(e)
else:
assert False, "TODO: handle tag not found"
print lxml.etree.tostring(new_root)
精彩评论