Parsing ODF in Python with lxml
I'm trying to parse the content.xml inside a ODF-file. I've read the file into a string and i've got a tree object with lxml.etree:
tree = etree.XML(string)
But now I need to find every subelement that is tex开发者_StackOverflow中文版t:a OR text:h. I've been told in previous question that I could use XPath. I've tried but got stuck every single time. Can't even find one of those elements.
If i try:
elem = tree.xpath('//text:p')I just get a
XPathEvalError: Undefined namespace prefix
So how do I get a list with BOTH of thoose subelements in the right order so i can iterate over them?
That's because text
is a namespace abbreviation, defined in the ODF schema. Try
tree.xpath('//text:a | //text:h',
namespaces={'text': 'urn:oasis:names:tc:opendocument:xmlns:text:1.0'})
|
is the set union operator. See also LXML docs.
精彩评论