Commutative, accumulator-based function for calculating a digest of multiple hashes
I'm writing something that summarizes the files in a file system by hashing a sample of their contents. It constructs a tree of directories and files. Each file entry has the has开发者_JAVA百科h of the file contents. For each directory entry, I want to store a hash of the contents of all files in the directory, including those in sub-directories - I'll call this the directory content hash.
The tricky thing about the directory content hash is that I want it to be independent of the structure of the directory. I.E. the hash should be the same if two directories contain the same files, but organized with a different sub-directories structure.
The only two methods I can think of are:
Calculate the MD5 of the concatenation of all file content hashes. In order to get the desired hash properties, I would have to list all of the files in the directory, sort them by their hash, concatenate the sorted hashes, and then run MD5 on the concatenation. This seems slower than I would like. I can do the sorting pretty efficiently by using merge sort while calculating directory content hashes throughout a tree, but I can't get around calculating a lot of MD5 hashes on large inputs.
Combine file content hashes using XOR. Each directory would only need to XOR the file content hashes and directory content hashes of its immediate children. This is very fast and simple, but not very collision resistant. It can't even tell the difference between a directory which contains 1 instance of a file, and a directory which contains three instances of the same file.
It would be nice if there is a function which can be used similar to the way XOR is used in method #2, but is more collision resistant. I think method #1 would be fast enough for this specific case, but in the interest of exploring-all-the-options/intellectual-curiosity/future-applications, I'd like to know whether there's a function that satisfies the description in the title (I have a vague memory of wanting a function like that several times in the past).
Thanks.
Order independent hashing of collections of hashes (is essentially what you're looking for, non?)
It sounds like any order independent operation (like addition or multiplication) will do the trick for you. Addition has the benefit of overflowing in a nice way. I don't recall if multiplication will work as well.
In short: add all of your values, ignoring the overflow, and you should get something useful. Any other similar function should do the trick if addition isn't sufficiently collision resistant.
As the count of items is important but the order isn't; just sort the list of hashes and then hash the list.
find . -print0 | xargs -0 sha1sum | cut -c -40 | sort | sha1sum
This would give the type of hash value which is invariant to the directory arrangement.
I found this article: https://kevinventullo.com/2018/12/24/hashing-unordered-sets-how-far-will-cleverness-take-you/
Like @Slartibartfast says, addition is what you want. The interesting thing from the article is that it proves that no matter what "commutative" operation you do there will always be problem elements. In the case of addition, the problem element is the item with a hash of 0.
While there are several documented approaches to defining a hash function for lists and other containers where iteration order is guaranteed, there seems to be less discussion around best practices for defining a hash function for unordered containers. One obvious approach is to simply sum {(+)} or xor {(\oplus)} the hashes of the individual elements of the container. A downside to these approaches is the existence of “problem elements” which hash to 0; when such elements are inserted into any container, that container’s hash will remain unchanged. One might suspect that this is due to the structured nature of addition or xor, and that a more clever choice of hash function on the unordered container could avoid this. In fact, at the end of the post, we’ll mathematically prove a proposition which roughly states that any general purpose method for hashing unordered containers, which can be incrementally updated based on the existing hash, is essentially equivalent to one of the more “obvious” choices in that it has the same algebraic structure, and in particular has the same “problem” elements.
If you have Google guava available, it provides a utility method, Hashing.combinedUnordered(), that does what you want. (Internally, this is implemented by adding all the hashes together.)
https://code.google.com/p/guava-libraries/wiki/HashingExplained
精彩评论