开发者

Python, checksum of a dict

I'm thinking to create a checksum of a dict to know if it was modified or not For the moment i have that:

>>> import hashlib
>>> import pickle
>>> d = {'k': 'v', 'k2': 'v2'}
>>> z = pickle.dumps(d)
>>> hashlib.md5(z).hexdigest()
'8521955ed8c63c554744058c9888dc30'

Perhaps a better solution exists?

Note: I want t开发者_运维问答o create an unique id of a dict to create a good Etag.

EDIT: I can have abstract data in the dict.


Something like this:

reduce(lambda x,y : x^y, [hash(item) for item in d.items()])

Take the hash of each (key, value) tuple in the dict and XOR them alltogether.

@katrielalex If the dict contains unhashable items you could do this:

hash(str(d))

or maybe even better

hash(repr(d))


In Python 3, the hash function is initialized with a random number, which is different for each python session. If that is not acceptable for the intended application, use e.g. zlib.adler32 to build the checksum for a dict:

import zlib

d={'key1':'value1','key2':'value2'}
checksum=0
for item in d.items():
    c1 = 1
    for t in item:
        c1 = zlib.adler32(bytes(repr(t),'utf-8'), c1)
    checksum=checksum ^ c1

print(checksum)


I would recommend an approach very similar to the one your propose, but with some extra guarantees:

import hashlib, json
hashlib.md5(json.dumps(d, sort_keys=True, ensure_ascii=True).encode('utf-8')).hexdigest()
  • sort_keys=True: keep the same hash if the order of your keys changes
  • ensure_ascii=True: in case you have some non-ascii characters, to make sure the representation does not change

We use this for our ETag.


I don't know whether pickle guarantees you that the hash is serialized the same way every time.

If you only have dictionaries, I would go for o combination of calls to keys(), sorted(), build a string based on the sorted key/value pairs and compute the checksum on that


I think you may not realise some of the subtleties that go into this. The first problem is that the order that items appear in a dict is not defined by the implementation. This means that simply asking for str of a dict doesn't work, because you could have

str(d1) == "{'a':1, 'b':2}"
str(d2) == "{'b':2, 'a':1}"

and these will hash to different values. If you have only hashable items in the dict, you can hash them and then join up their hashes, as @Bart does or simply

hash(tuple(sorted(hash(x) for x in d.items())))

Note the sorted, because you have to ensure that the hashed tuple comes out in the same order irrespective of which order the items appear in the dict. If you have dicts in the dict, you could recurse this, but it will be complicated.

BUT it would be easy to break any implementation like this if you allow arbitrary data in the dictionary, since you can simply write an object with a broken __hash__ implementation and use that. And you can't use id, because then you might have equal items which compare different.

The moral of the story is that hashing dicts isn't supported in Python for a reason.


As you said, you wanted to generate an Etag based on the dictionary content, OrderedDict which preserves the order of the dictionary may be better candidate here. Just iterator through the key,value pairs and construct your Etag string.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜