开发者

Python and libxml2: how to iterate in xml nodes with XPATH

I have a problem with retrieving information from a XML tree.

My XML has this shape:

<?xml version="1.0"?>
<records xmlns="http://www.mysyte.com/foo">
  <record>
    <id>first</id>
    <name>john</name>
    <papers>
      <paper>john_1</paper>
      <paper>john_2</paper>
    </papers>
  </record>
  <record>
    <id>second</id>
    <name>mike</name>
    <papers>
      <paper>mike_a</paper>
      <paper>mike_b</paper>
    </papers>
  </record>
  <record>
    <id>third</id>
    <name>albert</name>
    <papers>
      <paper>paper of al</paper>
      <paper>other paper</paper>
    </papers>
  </record>
</records>

What I want to do is to extract tuples of data like the follow:

[{'code': 'first', 'name': 'john'}, 
 {'code': 'second', 'name': 'mike'}, 
 {'code': 'third', 'name': 'albert'}]

Now I wrote this python code:

try:
  doc = libxml2.parseDoc(xml)
except (libxml2.parserError, TypeError):
  print "Problems loading XML"

ctxt = doc.xpathNewContext()
ctxt.xpathRegisterNs("pre", "http://www.mysyte.com/foo")

record_nodes = ctxt.xpathEval('/pre:records/pre:record')

for record_node in record_nodes:
  id = record_node.xpathEval('id')[0].content
  name = record_node.xpathEval('name')[0].content
  ret_list.append({'code': id, 'name': name})

My problem is that I don't have any result and I have the impression that I'm doing something wrong with the XPATH when I iterate on the nodes.

I also tried with these XPATHs for the id and the name:

/id
/name
/record/id
/record/name
/pre:id
/pre:name

and so on, but with any result (BTW if I use t开发者_如何学Che prefix in the sub queries I have an error).

Any idea?


Here is a suggestion. Note the setContextNode() method:

import libxml2

xml = "test.xml"
doc = libxml2.parseFile(xml) 

ctxt = doc.xpathNewContext() 
ctxt.xpathRegisterNs("pre","http://www.mysyte.com/foo") 

ret_list = []
record_nodes = ctxt.xpathEval('/pre:records/pre:record') 

for node in record_nodes:
    ctxt.setContextNode(node)
    _id = ctxt.xpathEval('pre:id')[0].content
    name = ctxt.xpathEval('pre:name')[0].content
    ret_list.append({'code': _id, 'name': name}) 

print ret_list


You can select all the elements you need with a single XPath expression:

/pre:records/pre:record/*[self::pre:id or self::pre:name]

Then just process the selected nodes in python.


If it is possible to switch to lxml, here is one way it could be done:

import lxml.etree as le
root=le.XML(content)
result=[]
namespaces={'pre':'http://www.mysyte.com/foo'}
for record in root:
    id=record.xpath('pre:id',namespaces=namespaces)[0]
    name=record.xpath('pre:name',namespaces=namespaces)[0]
    result.append({'code':id.text,'name':name.text})
print(result)
# [{'code': 'first', 'name': 'john'}, {'code': 'second', 'name': 'mike'}, {'code': 'third', 'name': 'albert'}]

Building off of Dimitre Novatchev's XPath expression, you could do this:

id_name_nodes = iter(ctxt.xpathEval('/pre:records/pre:record/*[self::pre:id or self::pre:name]'))

ret_list=[]
for id,name in zip(id_name_nodes,id_name_nodes):
    ret_list.append({'code':id.content,'name':name.content})
print(ret_list)

This libxml2 code, relies on every record having an id and name. If an id or name is missing, the ret_list will pair the wrong id and name, failing silently. Under the same circumstance, the lxml code would raise an error.


libxslt lacks such an important namespace support for some reason, but we can pre-parse the xml file, pre-read namespaces from it and then call xsltproc with those namespaces

def xpath(xml, xpathexpression):
    f=open(xml)
    fcontent = f.read()
    f.close()

    doc=libxml2.parseFile(xml)
    xp = doc.xpathNewContext()
    for nsdeclaration in re.findall('xmlns:*\w*="[^"]*"', fcontent):
        m = re.match('xmlns:(\w+)=.*', nsdeclaration)
        if m:
            ns = m.group(1)
        else:
            ns = "default"
        url = nsdeclaration[nsdeclaration.find('"')+1:nsdeclaration.rfind('"')]
        xp.xpathRegisterNs(ns, url)
    a=xp.xpathEval(xpathexpression)
    if len(a):
        return a[0].content
    return ""
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜