Commutative, accumulator-based function for calculating a digest of multiple hashes

2023-01-06 03:43 问答作者：

I'm writing something that summarizes the files in a file system by hashing a sample of their contents. It constructs a tree of directories and files. Each file entry has the has开发者_JAVA百科h of the file contents. For each directory entry, I want to store a hash of the contents of all files in the directory, including those in sub-directories - I'll call this the directory content hash.

The tricky thing about the directory content hash is that I want it to be independent of the structure of the directory. I.E. the hash should be the same if two directories contain the same files, but organized with a different sub-directories structure.

The only two methods I can think of are:

Calculate the MD5 of the concatenation of all file content hashes. In order to get the desired hash properties, I would have to list all of the files in the directory, sort them by their hash, concatenate the sorted hashes, and then run MD5 on the concatenation. This seems slower than I would like. I can do the sorting pretty efficiently by using merge sort while calculating directory content hashes throughout a tree, but I can't get around calculating a lot of MD5 hashes on large inputs.
Combine file content hashes using XOR. Each directory would only need to XOR the file content hashes and directory content hashes of its immediate children. This is very fast and simple, but not very collision resistant. It can't even tell the difference between a directory which contains 1 instance of a file, and a directory which contains three instances of the same file.

It would be nice if there is a function which can be used similar to the way XOR is used in method #2, but is more collision resistant. I think method #1 would be fast enough for this specific case, but in the interest of exploring-all-the-options/intellectual-curiosity/future-applications, I'd like to know whether there's a function that satisfies the description in the title (I have a vague memory of wanting a function like that several times in the past).

Thanks.

Order independent hashing of collections of hashes (is essentially what you're looking for, non?)

It sounds like any order independent operation (like addition or multiplication) will do the trick for you. Addition has the benefit of overflowing in a nice way. I don't recall if multiplication will work as well.

In short: add all of your values, ignoring the overflow, and you should get something useful. Any other similar function should do the trick if addition isn't sufficiently collision resistant.

As the count of items is important but the order isn't; just sort the list of hashes and then hash the list.

find . -print0 | xargs -0 sha1sum | cut -c -40 | sort | sha1sum

This would give the type of hash value which is invariant to the directory arrangement.

I found this article: https://kevinventullo.com/2018/12/24/hashing-unordered-sets-how-far-will-cleverness-take-you/

Like @Slartibartfast says, addition is what you want. The interesting thing from the article is that it proves that no matter what "commutative" operation you do there will always be problem elements. In the case of addition, the problem element is the item with a hash of 0.

While there are several documented approaches to defining a hash function for lists and other containers where iteration order is guaranteed, there seems to be less discussion around best practices for defining a hash function for unordered containers. One obvious approach is to simply sum {(+)} or xor {(\oplus)} the hashes of the individual elements of the container. A downside to these approaches is the existence of “problem elements” which hash to 0; when such elements are inserted into any container, that container’s hash will remain unchanged. One might suspect that this is due to the structured nature of addition or xor, and that a more clever choice of hash function on the unordered container could avoid this. In fact, at the end of the post, we’ll mathematically prove a proposition which roughly states that any general purpose method for hashing unordered containers, which can be incrementally updated based on the existing hash, is essentially equivalent to one of the more “obvious” choices in that it has the same algebraic structure, and in particular has the same “problem” elements.

If you have Google guava available, it provides a utility method, Hashing.combinedUnordered(), that does what you want. (Internally, this is implemented by adding all the hashes together.)

https://code.google.com/p/guava-libraries/wiki/HashingExplained

继续阅读：hash

Commutative, accumulator-based function for calculating a digest of multiple hashes

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？