Python and libxml2: how to iterate in xml nodes with XPATH
I have a problem with retrieving information from a XML tree.
My XML has this shape:
<?xml version="1.0"?>
<records xmlns="http://www.mysyte.com/foo">
<record>
<id>first</id>
<name>john</name>
<papers>
<paper>john_1</paper>
<paper>john_2</paper>
</papers>
</record>
<record>
<id>second</id>
<name>mike</name>
<papers>
<paper>mike_a</paper>
<paper>mike_b</paper>
</papers>
</record>
<record>
<id>third</id>
<name>albert</name>
<papers>
<paper>paper of al</paper>
<paper>other paper</paper>
</papers>
</record>
</records>
What I want to do is to extract tuples of data like the follow:
[{'code': 'first', 'name': 'john'},
{'code': 'second', 'name': 'mike'},
{'code': 'third', 'name': 'albert'}]
Now I wrote this python code:
try:
doc = libxml2.parseDoc(xml)
except (libxml2.parserError, TypeError):
print "Problems loading XML"
ctxt = doc.xpathNewContext()
ctxt.xpathRegisterNs("pre", "http://www.mysyte.com/foo")
record_nodes = ctxt.xpathEval('/pre:records/pre:record')
for record_node in record_nodes:
id = record_node.xpathEval('id')[0].content
name = record_node.xpathEval('name')[0].content
ret_list.append({'code': id, 'name': name})
My problem is that I don't have any result and I have the impression that I'm doing something wrong with the XPATH when I iterate on the nodes.
I also tried with these XPATHs for the id and the name:
/id
/name
/record/id
/record/name
/pre:id
/pre:name
and so on, but with any result (BTW if I use t开发者_如何学Che prefix in the sub queries I have an error).
Any idea?
Here is a suggestion. Note the setContextNode()
method:
import libxml2
xml = "test.xml"
doc = libxml2.parseFile(xml)
ctxt = doc.xpathNewContext()
ctxt.xpathRegisterNs("pre","http://www.mysyte.com/foo")
ret_list = []
record_nodes = ctxt.xpathEval('/pre:records/pre:record')
for node in record_nodes:
ctxt.setContextNode(node)
_id = ctxt.xpathEval('pre:id')[0].content
name = ctxt.xpathEval('pre:name')[0].content
ret_list.append({'code': _id, 'name': name})
print ret_list
You can select all the elements you need with a single XPath expression:
/pre:records/pre:record/*[self::pre:id or self::pre:name]
Then just process the selected nodes in python.
If it is possible to switch to lxml, here is one way it could be done:
import lxml.etree as le
root=le.XML(content)
result=[]
namespaces={'pre':'http://www.mysyte.com/foo'}
for record in root:
id=record.xpath('pre:id',namespaces=namespaces)[0]
name=record.xpath('pre:name',namespaces=namespaces)[0]
result.append({'code':id.text,'name':name.text})
print(result)
# [{'code': 'first', 'name': 'john'}, {'code': 'second', 'name': 'mike'}, {'code': 'third', 'name': 'albert'}]
Building off of Dimitre Novatchev's XPath expression, you could do this:
id_name_nodes = iter(ctxt.xpathEval('/pre:records/pre:record/*[self::pre:id or self::pre:name]'))
ret_list=[]
for id,name in zip(id_name_nodes,id_name_nodes):
ret_list.append({'code':id.content,'name':name.content})
print(ret_list)
This libxml2 code, relies on every record having an id and name.
If an id
or name
is missing, the ret_list
will pair the wrong id and name, failing silently. Under the same circumstance, the lxml code would raise an error.
libxslt lacks such an important namespace support for some reason, but we can pre-parse the xml file, pre-read namespaces from it and then call xsltproc with those namespaces
def xpath(xml, xpathexpression):
f=open(xml)
fcontent = f.read()
f.close()
doc=libxml2.parseFile(xml)
xp = doc.xpathNewContext()
for nsdeclaration in re.findall('xmlns:*\w*="[^"]*"', fcontent):
m = re.match('xmlns:(\w+)=.*', nsdeclaration)
if m:
ns = m.group(1)
else:
ns = "default"
url = nsdeclaration[nsdeclaration.find('"')+1:nsdeclaration.rfind('"')]
xp.xpathRegisterNs(ns, url)
a=xp.xpathEval(xpathexpression)
if len(a):
return a[0].content
return ""
精彩评论