How to discover identical files without comparing them to eachother?

2023-02-11 07:37 问答作者：

I am building a site where users can upload content. As always I aim for world dominance, so I would like to avoid storing the same file twice. For instance, if a user tries to upload the same file two times (by renaming or simply forgetting about what she has done in the past).

My current approach is to have the database that tracks each uploaded file store the following information about each file:

file size in bytes
MD5 sum of file contents
SHA1 sum of file contents

And then a unique index on those three columns. Using two hashes to minimize the ris开发者_开发知识库k of false positives.

So, my question is really: what is the probability of two different ("real-world") files of the same size having identical MD5 and SHA1 hashes?

Or: Is there a smarter method of similar (un)complexity?

(I understand that the probability could depend on the file size).

Thanks!

The probability of two real-world files of the same size having the same SHA1 hash is zero for all practical purposes. Some weaknesses in SHA1 have been found, but creating a file from a SHA1 hash and a size (1) is incredibly expensive in terms of computing power and (2) produces either garbage or the original file.

Adding MD5 to the mix is total overkill. If you don't trust SHA-1, then a better option is to switch to SHA-2.

If you're really paranoid, try comparing files with identical (size, SHA1) signatures. That will, however, have to read both the files entirely if they are equal.

I believe storing MD5 and SHA1 hashes is adding unnecessary complexity and not good design. I would say storing the tuple of (SHA1, file size) would be by far good enough. Especially if you're starting a new community site, I'd safely use that solution and only create something more clever once it becomes a problem. As the saying goes, premature optimization is the root of all evil, and it's arguable if it'll be `optimizing'.

edit: I did not quantify the odds of you getting a MD5+SHA1 collision. I'd say it's zero. By a crude, back of the envelope calculation, the odds of two different files of arbitrary file sizes having identical (SHA1,MD5) tuple is 2^-288, which is zero as far as I'm concerned. Having to require identical file size reduces that even further.

You can use Broders implementation of the Rabin fingerprinting algorithm. It is faster to compute than sha1 and md5 and it is proven to be collision resistant. However, it is not considered to be safe against malicious attacks, it is possible fot someone to purposefuly alter the file in question sithout changing the fingerprint itself. If you just want to check the similarity of files, it is s pretty good solution.

C# implementation, not tested:

http://www.developpez.net/forums/d863959/dotnet/general-dotnet/contribuez/algorithm-rabin-fingerprint/

继续阅读：comparison file hash-collision statistics unique

How to discover identical files without comparing them to eachother?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？