开发者

xml parse without the recursive search python

This is driving me mental, and I've probably been hacking away at it for to long so would appreciate some help to prevent lose of/restore my sanity! The food based xml is only an example of what I wish to achieve.

I have the following file which I am trying to put into a graph, so wheat and fruit are parents with a depth of 0. Indian is a child of wheat with a depth of 1 and so on and so on.

Each of the layers has some keywords. So what I want to get out is

layer, depth, parent, keywords
wheat, 1, ROOT, [bread, pita, narn, loaf]  
indian, 2, wheat [chapati]
mumbai, 3, indian, puri 
fruit, 1,ROOT, [apple, orange, pear, lemon]

This is a sample file -

<keywords>
    <layer id="wheat">
        <layer id="indian">
            <keyword>chapati</keyword>
            <layer id="mumbai">
                <keyword>puri</keyword>
            </layer>
        </layer>
        <keyword>bread</keyword>
        <keyword>pita</keyword>
        <keyword>narn</keyword>
        <keyword>loaf</keyword>
    </layer>
    <layer id="fruit">
        <keyword>apple</keyword>
        <keyword>orange</keyword>
        <keyword>pear</keyword>
        <keyword>lemon</keyword>
    </layer>

</keywords>

So this isnt a graph question, I can do that bit thats easy. What im struggling with is parsing the XML.

If I do a

xmldoc = minidom.parse(self.filename)

layers = xmldoc.getElementsByTagName('layer')

layers only returns all of the layer elements, which is to much and has not concept of depth/ hierachy as far as I can understand as it does a recursive search.

The following post is good, but doesnt provide the concepts I require. XML Parsing with Python and minidom. Can anyone help with how I might go about this? I can post my code but its so hacked tog开发者_JS百科ether/fundementally broken I don't think it would be use to man nor beast!

Cheers

Dave


Use lxml. In particular, XPath. You can get all layer elements, regardless of level, through "//layer" and the layer with the id id through "//layer[id='{}'][0]".format(id). The keyword elements directly under an element (or several elements) by ".../keyword" (where ... is a query that yields the nodes whose descendants should be searched).

Getting the depth of a given node is not quite as trivial, but still easy. I didn't find an existing function (afaik, this is outside the domain of XPath - athough you can check for the depth in a query, you only return elements, i.e. you can return nodes with a specific depth but not the depth itself), so here's a hand-rolled one (no recursion, since it's not necessary - but in general, working with XML means working with recursion, like it or not!):

def depth(node):
    depth = 0
    while node.getparent() is not None:
        node = node.getParent()
        depth += 1
    return depth

Something very similar is possible with DOM, if you should be foolish enough not to use the best Python XML library in existence ;)


Here's a solution with ElementTree:

from xml.etree import ElementTree as ET
from io import StringIO
from collections import defaultdict

data = '''\
<keywords>
    <layer id="wheat">
        <layer id="indian">
            <keyword>chapati</keyword>
            <layer id="mumbai">
                <keyword>puri</keyword>
            </layer>
        </layer>
        <keyword>bread</keyword>
        <keyword>pita</keyword>
        <keyword>narn</keyword>
        <keyword>loaf</keyword>
    </layer>
    <layer id="fruit">
        <keyword>apple</keyword>
        <keyword>orange</keyword>
        <keyword>pear</keyword>
        <keyword>lemon</keyword>
    </layer>
</keywords>
'''

path = ['ROOT']  # stack for layer names
items = defaultdict(list)  # key=layer, value=list of items @ layer

f = StringIO(data)
for evt,e in ET.iterparse(f,('start','end')):
    if evt == 'start':
        if e.tag == 'layer':
            path.append(e.attrib['id']) # new layer added to path
        elif e.tag == 'keyword':
            items[path[-1]].append(e.text) # add item to last layer in path
    elif evt == 'end':
        if e.tag == 'layer':
            layer = path.pop()
            parent = path[-1]
            print layer,len(path),parent,items[layer]

Output

mumbai 3 indian ['puri']
indian 2 wheat ['chapati']
wheat 1 ROOT ['bread', 'pita', 'narn', 'loaf']
fruit 1 ROOT ['apple', 'orange', 'pear', 'lemon']


You can either recursively walk the DOM treje (see kelloti's answer) or determine the info from the found nodes:

xmldoc = minidom.parse(filename)
layers = xmldoc.getElementsByTagName("layer")

def _getText(node):
    rc = []
    for n in node.childNodes:
        if n.nodeType == n.TEXT_NODE:
            rc.append(n.data)
    return ''.join(rc)

def _depth(n):
    res = -1
    while isinstance(n, minidom.Element):
        n = n.parentNode
        res += 1
    return res

for l in layers:
    keywords = [_getText(k) for k in l.childNodes
                if k.nodeType == k.ELEMENT_NODE and k.tagName == 'keyword']
    print("%s %s %s" % (l.getAttribute("id"), _depth(l), keywords))


Try iterating through all child nodes in a recursive function, checking each for tag name. i.e.

def findLayer(node):
    for n in node.childNodes:
        if n.localName == 'layer':
            findLayer(n)
            # do things here

Alternately, try using a different XML library like Amara or lxml that has XPath capabilities. With XPath you can have much more control for searching the DOM tree with very little code.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜