xml parse without the recursive search python
This is driving me mental, and I've probably been hacking away at it for to long so would appreciate some help to prevent lose of/restore my sanity! The food based xml is only an example of what I wish to achieve.
I have the following file which I am trying to put into a graph, so wheat and fruit are parents with a depth of 0. Indian is a child of wheat with a depth of 1 and so on and so on.
Each of the layers has some keywords. So what I want to get out is
layer, depth, parent, keywords
wheat, 1, ROOT, [bread, pita, narn, loaf]
indian, 2, wheat [chapati]
mumbai, 3, indian, puri
fruit, 1,ROOT, [apple, orange, pear, lemon]
This is a sample file -
<keywords>
<layer id="wheat">
<layer id="indian">
<keyword>chapati</keyword>
<layer id="mumbai">
<keyword>puri</keyword>
</layer>
</layer>
<keyword>bread</keyword>
<keyword>pita</keyword>
<keyword>narn</keyword>
<keyword>loaf</keyword>
</layer>
<layer id="fruit">
<keyword>apple</keyword>
<keyword>orange</keyword>
<keyword>pear</keyword>
<keyword>lemon</keyword>
</layer>
</keywords>
So this isnt a graph question, I can do that bit thats easy. What im struggling with is parsing the XML.
If I do a
xmldoc = minidom.parse(self.filename)
layers = xmldoc.getElementsByTagName('layer')
layers only returns all of the layer elements, which is to much and has not concept of depth/ hierachy as far as I can understand as it does a recursive search.
The following post is good, but doesnt provide the concepts I require. XML Parsing with Python and minidom. Can anyone help with how I might go about this? I can post my code but its so hacked tog开发者_JS百科ether/fundementally broken I don't think it would be use to man nor beast!
Cheers
Dave
Use lxml. In particular, XPath. You can get all layer
elements, regardless of level, through "//layer"
and the layer
with the id id
through "//layer[id='{}'][0]".format(id)
. The keyword
elements directly under an element (or several elements) by ".../keyword"
(where ...
is a query that yields the nodes whose descendants should be searched).
Getting the depth of a given node is not quite as trivial, but still easy. I didn't find an existing function (afaik, this is outside the domain of XPath - athough you can check for the depth in a query, you only return elements, i.e. you can return nodes with a specific depth but not the depth itself), so here's a hand-rolled one (no recursion, since it's not necessary - but in general, working with XML means working with recursion, like it or not!):
def depth(node):
depth = 0
while node.getparent() is not None:
node = node.getParent()
depth += 1
return depth
Something very similar is possible with DOM, if you should be foolish enough not to use the best Python XML library in existence ;)
Here's a solution with ElementTree:
from xml.etree import ElementTree as ET
from io import StringIO
from collections import defaultdict
data = '''\
<keywords>
<layer id="wheat">
<layer id="indian">
<keyword>chapati</keyword>
<layer id="mumbai">
<keyword>puri</keyword>
</layer>
</layer>
<keyword>bread</keyword>
<keyword>pita</keyword>
<keyword>narn</keyword>
<keyword>loaf</keyword>
</layer>
<layer id="fruit">
<keyword>apple</keyword>
<keyword>orange</keyword>
<keyword>pear</keyword>
<keyword>lemon</keyword>
</layer>
</keywords>
'''
path = ['ROOT'] # stack for layer names
items = defaultdict(list) # key=layer, value=list of items @ layer
f = StringIO(data)
for evt,e in ET.iterparse(f,('start','end')):
if evt == 'start':
if e.tag == 'layer':
path.append(e.attrib['id']) # new layer added to path
elif e.tag == 'keyword':
items[path[-1]].append(e.text) # add item to last layer in path
elif evt == 'end':
if e.tag == 'layer':
layer = path.pop()
parent = path[-1]
print layer,len(path),parent,items[layer]
Output
mumbai 3 indian ['puri']
indian 2 wheat ['chapati']
wheat 1 ROOT ['bread', 'pita', 'narn', 'loaf']
fruit 1 ROOT ['apple', 'orange', 'pear', 'lemon']
You can either recursively walk the DOM treje (see kelloti's answer) or determine the info from the found nodes:
xmldoc = minidom.parse(filename)
layers = xmldoc.getElementsByTagName("layer")
def _getText(node):
rc = []
for n in node.childNodes:
if n.nodeType == n.TEXT_NODE:
rc.append(n.data)
return ''.join(rc)
def _depth(n):
res = -1
while isinstance(n, minidom.Element):
n = n.parentNode
res += 1
return res
for l in layers:
keywords = [_getText(k) for k in l.childNodes
if k.nodeType == k.ELEMENT_NODE and k.tagName == 'keyword']
print("%s %s %s" % (l.getAttribute("id"), _depth(l), keywords))
Try iterating through all child nodes in a recursive function, checking each for tag name. i.e.
def findLayer(node):
for n in node.childNodes:
if n.localName == 'layer':
findLayer(n)
# do things here
Alternately, try using a different XML library like Amara or lxml that has XPath capabilities. With XPath you can have much more control for searching the DOM tree with very little code.
精彩评论