开发者

How to check if the two XML files are equivalent with Python?

How to check if two XML files are equivalent?

For example, the two XML files are the same even though the ordering is different. I need to check if the two XML files content the same textual info disregarding the order.

<a>
   <b>hello</b>
   <c><d>world</d></c>
</a>

<a>
   <c><d>world</d>开发者_StackOverflow社区</c>
   <b>hello</b>
</a>

Are there tools for this out there?


It all depends on your definition of "equivalent".

Assuming you really only care about the text nodes (for example: the d tags in your example do not even matter, you only care about the content word), you can just make a set of the text nodes of each document, and compare the sets. Using lxml, this could look like:

from lxml import etree

tree1 = etree.parse('example1.xml')
tree2 = etree.parse('example2.xml')

print set(tree1.getroot().itertext()) == set(tree2.getroot().itertext())

You might even want to ignore whitespace nodes, doing something like:

set(i for i in tree.getroot().itertext() if i.strip())

Note that using sets means you will NOT take into account how many times certain pieces of text occur in the document (this might be what you want, it might not). If the order is not important, but the number of times something occurs is, you could use a dictionary instead of a set, and keep track of the number of occurences (eg. with collections.defaultdict() or collections.Counter in python 2.7)

But if it is only the order of the direct child elements of the root element (in your case, children of the a element) that may be ignored, and everything inside those elements really counts, you would need another approach. You could for example do xml canonicalization on each child element to get a normalized version of each child (again, I don't know if this is normalized enough for your needs).

from lxml import etree

tree1 = etree.parse('example1.xml')
tree2 = etree.parse('example2.xml')

set1 = set(etree.tostring(i, method='c14n') for i in tree1.getroot())
set2 = set(etree.tostring(i, method='c14n') for i in tree2.getroot())

print set1 == set2

Note: to keep the example simpler, I've used the development version of lxml, in older versions, there is no method='c14n' for etree.tostring(), only a c14n() method on the ElementTree, that writes to a file-like object. So to get it working there, you'd have to copy each element to a tree of its own, and use a StringIO() object as a dummy file)

Also, this way of doing it is probably not recommended with very large files.

But again: a BIG WARNING: you really have to know what you need as "equivalent", and create your own solution based on that knowledge!


Ordering is important in XML, so the two files you provided are different. Normally you could normalize the XML and then simply compare the files as text, but if you want order-insensitive comparison, you will probably have to implement it yourself using one of the bazillion XML parsers out there (I would recommend lxml, by the way).


my solution is below. compare all attributes,tags iteration. Some code refered from : Testing Equivalence of xml.etree.ElementTree

import xml.etree.ElementTree as ET


def elements_equal(e1, e2):
    if e1.tag != e2.tag: 
        return False
    if e1.text != e2.text: 
        if  e1.text!=None and e2.text!=None :
            return False
    if e1.tail != e2.tail:
        if e1.tail!=None and e2.tail!=None:
            return False
    if e1.attrib != e2.attrib: 
        return False
    if len(e1) != len(e2): 
        return False
    return all(elements_equal(c1, c2) for c1, c2 in zip(e1, e2))



def is_two_xml_equal(f1, f2):
    tree1 = ET.parse(f1)
    root1 = tree1.getroot()
    tree2 = ET.parse(f2)
    root2 = tree2.getroot()
    return elements_equal(root1,root3)

f1 = '2.xml'
f2 = '1.xml'
print(is_two_xml_equal(f1, f2))
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜