开发者

Can I use part of MD5 hash for data identification?

I use MD5 hash for identifying files with unknown origin. No attacker here, so I don't care that MD5 has been broken and one can intendedly generate collisions.

My problem is I need to provide logging so that different problems are diagnosed easier. If I log every hash as a hex string that's too long, inconvenient and looks ugly, so I'd like to shorten the hash string.

Now I know that just taking a small part of a GUID is a very bad idea - GUIDs are designed to be unique, but part of them are not.

Is the same true for MD5 - can I take say first 4 bytes of MD5 and assume that I only get collision probability higher due to the reduced number of bytes c开发者_如何学Pythonompared to the original hash?


The short answer is yes, you can use the first 4 bytes as an id. Beware of the birthday paradox though:

http://en.wikipedia.org/wiki/Birthday_paradox

The risk of a collision rapidly increases as you add more files. With 50.000 there's roughly 25% chance that you'll get an id collision.

EDIT: Ok, just read the link to your other question and with 100.000 files the chance of collision is roughly 70%.


Here is a related topic you may refer to

What is the probability that the first 4 bytes of MD5 hash computed from file contents will collide?


Another way to shorten the hash is to convert it to something more efficient than HEX like Base64 or some variant there-of.

Even if you're determined to take on 4 characters, taking 4 characters of base64 gives you more bits than hex.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜