64bit MessageDigest - store short texts as long
I want to represent short texts (ie word, couple of words) as a 64 bit hash (want to store them as longs)
MessageDigest.getInstance("MD5") returns 128bits.
Is there anything else I could use, could i just peel off half of it. I am not worried of someone trying to duplicate a hash, I would like to minimize开发者_运维技巧 the number of clashes (two different strings having the same hash)
MD5 (and SHA) hash "smear" the data in a uniform way across the hashed value so any 64 bits ypu choose out of the final value will be as sensitive to a change as any other 64 bits. Your only concern will be the increased probability of collisions.
You can just use any part of the MD5 hash.
We tried to fold 128-bit into 64-bit with various algorithms but the folding action didn't make any noticeable difference in hash distribution.
Why don't you just using hashCode() of String? We hashed 8 million Email addresses into 32-bit integer and there are actually more collisions with MD5 than String hashCode. You can run hashCode twice (forward and backward) and make it a 64-bit long.
You can take a sampling of 64-bits from the 128-bit hash. You cannot guarantee there will be no clashes - only a perfect hash will give you that, and there is no perfect hash for arbitrary length strings) but the chances of a clash will be very small.
As well as a sampling, you could derive the hash using a more complex function, such as XOR consecutive pairs of bits.
As a cryptographic hash (even one nowadays considered broken), MD5 has no significant correlation between input and output bits. That means, simply taking the first or last half will give you a perfectly well-distributed hash function. Anything else would never have been seriously considered as a cryptographic hash.
What about using some block cipher with 64bit block size ?
精彩评论