Python ElementTree duplicate checker
So I have to write a "duplicate checker" to compare two XMLs and see if they are the same (contain the same data). Now because they come from the same class and are generated form an XSD the structure the order of the elements inside will most likely be the same.
The best way I can think of doing the duplicate check is to set up two dictionaries (dictLeft, dictRight) and saving the xpath#value as the key and the number of times it occurs. Something like this:
Left:
{ 'my/path/to/name#greg': 1, 'my/path/to/name#john': 2, 'my/path/to/car#toyota': 1}
Right
{ 'my/path/to/name#greg': 1, 'my/path/to/name#bill': 1, 'my/path/to/car#toyota': 1}
Comparing these two dictionaries will give me a fairly accurate indication of w开发者_如何学编程hether or not these two XMLs are the same or not (there is the odd chance that I may get false results, but it is very remote).
Does anyone else have a better idea? Maybe a function in ElementTree that I do not know about?
EDIT: To better explain:
<root><person><name>Bob</name><surname>marley</surname></root>
and
<root><person><surname>marley</surname><name>Bob</name></root>
would be considered the same. I am ignoring attributes. The idea is to keep the code as simple as possible while not hampering performance too much.
OK, so I had to make a decision and went with this:
foreach path in xpathlist
find entries for path for both xml1 and xml2
foreach entry in xmlentries1
dict1[path#entry.value]++
foreach entry in xmlentries2
dict2[path#entry.value]++
if dict1 and dict2 are not equal
return false
return true
I hope this makes sense. This allows me to test for specific/all xpaths. If someone has a better algorithm, I'm all ears :)
From your example, it seems like you should be able to use iterparse and use collections.Counter to count the appearance of each tag and its attributes as keys for the counter. Example:
from xml.etree import cElementTree as ElementTree
from collections import Counter
your_xml = get_xml()
count = Counter()
parser = ElementTree.iterparse(your_xml)
for event, element in parser:
#joining string as key for ease of debugging, strictly speaking,
#one could use a tuple and save the str() on the attrib dict
key = "".join((element.tag, str(element.attrib), element.text))
count[element.tag] += 1
alternatively, make count a normal dict and just compare equality of the two dicts (seems conceptually simpler to me).
If two XMLs are generated from the same code and contains the same values (in the same order) then you could simply do a string comparison of the XML data.
If that works then it's probably the simples solution possible, but there might be reasons why that won't work for you.
This problem starts with defining what you mean by "the same".
For instance, a simple definition of equality, for XML elements, is that two XML elements are equal if:
- they're in the same namespace,
- they have the same tag name,
- they have the same set of attributes, with the same values,
- their respective lists of child nodes, excluding comments and processing instructions, and whitespace-only text nodes, contain the same values in the same order.
There are all kinds of reasons why this trivial definition might not suffice:
- you may want to ignore elements that aren't in namespaces you know about - i.e. you don't want your application's equality test to fail just because other applications are storing data in the XML
- child element ordering may not be significant or (worse) may be significant for some elements and not others
- comment, processing-instruction, and whitespace-only text nodes may be significant
- you may need to normalize whitespace (see the
normalize-space()
function in XSLT) in text nodes before comparing them
Once you've defined equality, implementing a method to test it is relatively straightforward. But you need to define equality first.
精彩评论