How should I be handling checksum collisions in my application?
I have a part of my application that stores files. Because we could potentially be adding many of the same file, I am first keeping a开发者_StackOverflow社区 hash of each file. If two files have the same hash, then we throw out one, and both "references" to that file point to the same physical file.
How much should I be worried about hash collisions?
In the case of a collision what should I do? The whole crux of my code so far depends on there not being two different files with the same hash. In the event of a collision right now, my app would throw out a legitmately different file and point to the file with the same hash.
Should I be using something other than MD5? Does SHA-1 have a better collision rate?
Unless you're in some really REALLY critical application, do not worry about hash collisions. They are so rare that many things assume they are not going to happen, and catastrophic things will happen to these things if that assumption ends up being false just once.
SHA1 has a larger output space than MD5 (and fewer attacks are known on it, too), so it's definitely not a worse choice. If you are afraid of someone actively colliding your hashes, perhaps a later variant of SHA, such as SHA-256, might be a good idea.
The chance of a collision between the hashes of any two randomly selected bitstreams is the inversely proportional to the number of distinct states that the hash represents. So a 64 bit hash encodes 2 ** 64
states and has a chance of 1 / (2**64)
of a collision for any pair of files. But you are really concerned with the chance of a collisions over a (large) set of files, so you need to do the "birthday paradox" calculation, plugging the probability of a pairwise collision and the expected number of files.
But I think that the bottom line is that throwing away a file without doing a comparison is an unsafe thing to do, even if the numbers say that the probability of a collision is small.
In the provided scenario you never have to worry. It is not possible for 2 different documents to have the same checksum unless they are the same. Imagine this:
var a = 1; var b = 2;
b + 3 = 5; // true yay! a + 3 != 5; // no collision possible as long as var a does not equal 2
var 'a' with any value other than 2 can never ever compute to 5 so no collision possible. Since you are using (or should be using) a 1 way checksum hashing algorithm the resulting hash will always be dependent on its inputs
Hash collisions happen when you're dealing with randomly generated hashes that due to their random unspecified inputs could collide though very unlikely.
Please note I am in no way inferring that one way hashing algorithms are accomplished through simple addition. I'm merely using addition as a simple example based on the simple notion that they both take a set of values and output a different set values based upon them.
精彩评论