efficient hash function for uris
i 开发者_Python百科am looking for a hash function to build a (global) fixed size id for strings, most of them URIs.
it should be:
- fast
- low chance of collision
- ~ 64bit
- exploiting the structure of an uri if that is possible?
would http://murmurhash.googlepages.com/ be a good choice or is there anything better suited?
Try MD4. As far as cryptography is concerned, it is "broken", but since you do not have any security concern (you want a 64-bit output size, which is too small to yield any decent security against collisions), that should not be a problem. MD4 yields a 128-bit value, which you just have to truncate to the size you wish.
Cryptographic hash functions are designed for resilience to explicit attempts at building collisions. Conceivably, one can build a faster function by relaxing that condition (it is easier to beat random collisions than a determinate attacker). There are a few such functions, e.g. MurmurHash. However it may take a quite specific setup to actually notice the speed difference. With my home PC (a 2.4 GHz Core2), I can hash about 10 millions of short strings per second with MD4, using a single CPU core (I have four cores). For MurmurHash to be faster than MD4 in a non-negligible way, it would have to be used in a context involving at least one million hash invocations per second. That does not happen very often...
I'd wait a little longer for MurmurHash3 to be finalized, then use that. The 128-bit version should give you adequate collision protection against the birthday paradox.
精彩评论