Python ElementTree duplicate checker

2023-02-17 04:20 问答作者：

So I have to write a "duplicate checker" to compare two XMLs and see if they are the same (contain the same data). Now because they come from the same class and are generated form an XSD the structure the order of the elements inside will most likely be the same.

The best way I can think of doing the duplicate check is to set up two dictionaries (dictLeft, dictRight) and saving the xpath#value as the key and the number of times it occurs. Something like this:

Left:

{ 'my/path/to/name#greg': 1, 'my/path/to/name#john': 2, 'my/path/to/car#toyota': 1}

Right

{ 'my/path/to/name#greg': 1, 'my/path/to/name#bill': 1, 'my/path/to/car#toyota': 1}

Comparing these two dictionaries will give me a fairly accurate indication of w开发者_如何学编程hether or not these two XMLs are the same or not (there is the odd chance that I may get false results, but it is very remote).

Does anyone else have a better idea? Maybe a function in ElementTree that I do not know about?

EDIT: To better explain:

<root><person><name>Bob</name><surname>marley</surname></root>

and

<root><person><surname>marley</surname><name>Bob</name></root>

would be considered the same. I am ignoring attributes. The idea is to keep the code as simple as possible while not hampering performance too much.

OK, so I had to make a decision and went with this:

foreach path in xpathlist
  find entries for path for both xml1 and xml2
  foreach entry in xmlentries1
    dict1[path#entry.value]++
  foreach entry in xmlentries2
    dict2[path#entry.value]++

  if dict1 and dict2 are not equal
    return false
return true

I hope this makes sense. This allows me to test for specific/all xpaths. If someone has a better algorithm, I'm all ears :)

From your example, it seems like you should be able to use iterparse and use collections.Counter to count the appearance of each tag and its attributes as keys for the counter. Example:

from xml.etree import cElementTree as ElementTree
from collections import Counter

your_xml = get_xml()
count = Counter()
parser = ElementTree.iterparse(your_xml)
for event, element in parser:
    #joining string as key for ease of debugging, strictly speaking,
    #one could use a tuple and save the str() on the attrib dict
    key = "".join((element.tag, str(element.attrib), element.text))
    count[element.tag] += 1

alternatively, make count a normal dict and just compare equality of the two dicts (seems conceptually simpler to me).

If two XMLs are generated from the same code and contains the same values (in the same order) then you could simply do a string comparison of the XML data.

If that works then it's probably the simples solution possible, but there might be reasons why that won't work for you.

This problem starts with defining what you mean by "the same".

For instance, a simple definition of equality, for XML elements, is that two XML elements are equal if:

they're in the same namespace,
they have the same tag name,
they have the same set of attributes, with the same values,
their respective lists of child nodes, excluding comments and processing instructions, and whitespace-only text nodes, contain the same values in the same order.

There are all kinds of reasons why this trivial definition might not suffice:

you may want to ignore elements that aren't in namespaces you know about - i.e. you don't want your application's equality test to fail just because other applications are storing data in the XML
child element ordering may not be significant or (worse) may be significant for some elements and not others
comment, processing-instruction, and whitespace-only text nodes may be significant
you may need to normalize whitespace (see the normalize-space() function in XSLT) in text nodes before comparing them

Once you've defined equality, implementing a method to test it is relatively straightforward. But you need to define equality first.

继续阅读：elementtree python xml

Python ElementTree duplicate checker

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？