xml parse without the recursive search python

2023-02-08 13:12 问答作者：

This is driving me mental, and I've probably been hacking away at it for to long so would appreciate some help to prevent lose of/restore my sanity! The food based xml is only an example of what I wish to achieve.

I have the following file which I am trying to put into a graph, so wheat and fruit are parents with a depth of 0. Indian is a child of wheat with a depth of 1 and so on and so on.

Each of the layers has some keywords. So what I want to get out is

layer, depth, parent, keywords
wheat, 1, ROOT, [bread, pita, narn, loaf]  
indian, 2, wheat [chapati]
mumbai, 3, indian, puri 
fruit, 1,ROOT, [apple, orange, pear, lemon]

This is a sample file -

<keywords>
    <layer id="wheat">
        <layer id="indian">
            <keyword>chapati</keyword>
            <layer id="mumbai">
                <keyword>puri</keyword>
            </layer>
        </layer>
        <keyword>bread</keyword>
        <keyword>pita</keyword>
        <keyword>narn</keyword>
        <keyword>loaf</keyword>
    </layer>
    <layer id="fruit">
        <keyword>apple</keyword>
        <keyword>orange</keyword>
        <keyword>pear</keyword>
        <keyword>lemon</keyword>
    </layer>

</keywords>

So this isnt a graph question, I can do that bit thats easy. What im struggling with is parsing the XML.

If I do a

xmldoc = minidom.parse(self.filename)

layers = xmldoc.getElementsByTagName('layer')

layers only returns all of the layer elements, which is to much and has not concept of depth/ hierachy as far as I can understand as it does a recursive search.

The following post is good, but doesnt provide the concepts I require. XML Parsing with Python and minidom. Can anyone help with how I might go about this? I can post my code but its so hacked tog开发者_JS百科ether/fundementally broken I don't think it would be use to man nor beast!

Cheers

Dave

Use lxml. In particular, XPath. You can get all layer elements, regardless of level, through "//layer" and the layer with the id id through "//layer[id='{}'][0]".format(id). The keyword elements directly under an element (or several elements) by ".../keyword" (where ... is a query that yields the nodes whose descendants should be searched).

Getting the depth of a given node is not quite as trivial, but still easy. I didn't find an existing function (afaik, this is outside the domain of XPath - athough you can check for the depth in a query, you only return elements, i.e. you can return nodes with a specific depth but not the depth itself), so here's a hand-rolled one (no recursion, since it's not necessary - but in general, working with XML means working with recursion, like it or not!):

def depth(node):
    depth = 0
    while node.getparent() is not None:
        node = node.getParent()
        depth += 1
    return depth

Something very similar is possible with DOM, if you should be foolish enough not to use the best Python XML library in existence ;)

Here's a solution with ElementTree:

from xml.etree import ElementTree as ET
from io import StringIO
from collections import defaultdict

data = '''\
<keywords>
    <layer id="wheat">
        <layer id="indian">
            <keyword>chapati</keyword>
            <layer id="mumbai">
                <keyword>puri</keyword>
            </layer>
        </layer>
        <keyword>bread</keyword>
        <keyword>pita</keyword>
        <keyword>narn</keyword>
        <keyword>loaf</keyword>
    </layer>
    <layer id="fruit">
        <keyword>apple</keyword>
        <keyword>orange</keyword>
        <keyword>pear</keyword>
        <keyword>lemon</keyword>
    </layer>
</keywords>
'''

path = ['ROOT']  # stack for layer names
items = defaultdict(list)  # key=layer, value=list of items @ layer

f = StringIO(data)
for evt,e in ET.iterparse(f,('start','end')):
    if evt == 'start':
        if e.tag == 'layer':
            path.append(e.attrib['id']) # new layer added to path
        elif e.tag == 'keyword':
            items[path[-1]].append(e.text) # add item to last layer in path
    elif evt == 'end':
        if e.tag == 'layer':
            layer = path.pop()
            parent = path[-1]
            print layer,len(path),parent,items[layer]

Output

mumbai 3 indian ['puri']
indian 2 wheat ['chapati']
wheat 1 ROOT ['bread', 'pita', 'narn', 'loaf']
fruit 1 ROOT ['apple', 'orange', 'pear', 'lemon']

You can either recursively walk the DOM treje (see kelloti's answer) or determine the info from the found nodes:

xmldoc = minidom.parse(filename)
layers = xmldoc.getElementsByTagName("layer")

def _getText(node):
    rc = []
    for n in node.childNodes:
        if n.nodeType == n.TEXT_NODE:
            rc.append(n.data)
    return ''.join(rc)

def _depth(n):
    res = -1
    while isinstance(n, minidom.Element):
        n = n.parentNode
        res += 1
    return res

for l in layers:
    keywords = [_getText(k) for k in l.childNodes
                if k.nodeType == k.ELEMENT_NODE and k.tagName == 'keyword']
    print("%s %s %s" % (l.getAttribute("id"), _depth(l), keywords))

Try iterating through all child nodes in a recursive function, checking each for tag name. i.e.

def findLayer(node):
    for n in node.childNodes:
        if n.localName == 'layer':
            findLayer(n)
            # do things here

Alternately, try using a different XML library like Amara or lxml that has XPath capabilities. With XPath you can have much more control for searching the DOM tree with very little code.

继续阅读：dom python xml

xml parse without the recursive search python

Output

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

Output

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生 新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？