开发者

How much data (many MB) can I uniquely identify using MD5

I've got millions of data records that are each about 2MB in size. Every one of these pieces of data are stored in a file and there is a set of other data associated with that record (stored in a database).

When my program runs I'll be presented, in memory, with one of the data records and need to produce the associated data. To do this I'm ima开发者_C百科gining taking an MD5 of the memory, then using this hash as a key into the database. The key will help me locate the other data.

What I need to know is if an MD5 hash of the data contents is a suitable way to uniquliy identify a 2MB piece of data, meaning can I use an MD5 hash without worrying too much about collisions?

I realize there is a chance for collision, my concern is how likely is the chance for collision on millions of 2MB data records? Is collision a likely occurrence? What about when compared to hard disk failure or other computer failures? How much data can MD5 be used to safely identify? what about millions of GB files?

I'm not worried about malice or data tampering. I've got protections such that I wont be receiving manipulated data.


This boils down to so-called Birthday paradox. That Wikipedia page has simplified formulas for evaluating the collision probability. It will be very some very small number.

The next question is how you deal with say 10-12 collision probability - see this very similar question.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜