开发者

Analysis of the usage of prime numbers in hash functions

I was studying hash-based sort and I found that using prime numbers in a hash function is considered a good idea, because multiplying each character of the key by a prime number and adding the results up would produce a unique value (because primes are unique) and a prime number like 31 would produce better distribution of keys.

key(s)=s[0]*31(len–1)+s[1]*31(len–2)+ ... +s[len–1]

Sample code:

public int hashCode( ) 
{
    int h = hash;
    if (h == 0) 
    {
        for (int i = 0; i < chars.length; i++) 
        {
            h = MULT*h + chars[i];
        }
        hash = h;
    }
    return h;
}

I would like to understand why the use of even numbers for multiplying each character is a bad idea in the context of this explanation below (found on another forum; it sounds like a good explanation, but I'm failing to grasp it). If the reasoning below is not valid, I would appreciate a simpler explanation.

Suppose MULT were 26, and consider hashing a hundred-character string. How much influence does the string's first开发者_如何学编程 character have on the final value of 'h'? The first character's value will have been multiplied by MULT 99 times, so if the arithmetic were done in infinite precision the value would consist of some jumble of bits followed by 99 low-order zero bits -- each time you multiply by MULT you introduce another low-order zero, right? The computer's finite arithmetic just chops away all the excess high-order bits, so the first character's actual contribution to 'h' is ... precisely zero! The 'h' value depends only on the rightmost 32 string characters (assuming a 32-bit int), and even then things are not wonderful: the first of those final 32 bytes influences only the leftmost bit of `h' and has no effect on the remaining 31. Clearly, an even-valued MULT is a poor idea.


I think it's easier to see if you use 2 instead of 26. They both have the same effect on the lowest-order bit of h. Consider a 33 character string of some character c followed by 32 zero bytes (for illustrative purposes). Since the string isn't wholly null you'd hope the hash would be nonzero.

For the first character, your computed hash h is equal to c[0]. For the second character, you take h * 2 + c[1]. So now h is 2*c[0]. For the third character h is now h*2 + c[2] which works out to 4*c[0]. Repeat this 30 more times, and you can see that the multiplier uses more bits than are available in your destination, meaning effectively c[0] had no impact on the final hash at all.

The end math works out exactly the same with a different multiplier like 26, except that the intermediate hashes will modulo 2^32 every so often during the process. Since 26 is even it still adds one 0 bit to the low end each iteration.


This hash can be described like this (here ^ is exponentiation, not xor).

hash(string) = sum_over_i(s[i] * MULT^(strlen(s) - i - 1)) % (2^32).

Look at the contribution of the first character. It's

(s[0] * MULT^(strlen(s) - 1)) % (2^32).

If the string is long enough (strlen(s) > 32) then this is zero.


Other people have posted the answer -- if you use an even multiple, then only the last characters in the string matter for computing the hash, as the early character's influence will have shifted out of the register.

Now lets consider what happens when you use a multiplier like 31. Well, 31 is 32-1 or 2^5 - 1. So when you use that, your final hash value will be:

\sum{c_i 2^{5(len-i)} - \sum{c_i}

unfortunately stackoverflow doesn't understad TeX math notation, so the above is hard to understand, but its two summations over the characters in the string, where the first one shifts each character by 5 bits for each subsequent character in the string. So using a 32-bit machine, that will shift off the top for all except the last seven characters of the string.

The upshot of this is that using a multiplier of 31 means that while characters other than the last seven have an effect on the string, its completely independent of their order. If you take two strings that have the same last 7 characters, for which the other characters also the same but in a different order, you'll get the same hash for both. You'll also get the same hash for things like "az" and "by" other than in the last 7 chars.

So using a prime multiplier, while much better than an even multiplier, is still not very good. Better is to use a rotate instruction, which shifts the bits back into the bottom when they shift out the top. Something like:

public unisgned hashCode(string chars)
{
    unsigned h = 0;
    for (int i = 0; i < chars.length; i++) {
        h = (h<<5) + (h>>27);  // ROL by 5, assuming 32 bits here
        h += chars[i];
    }
    return h;
}

Of course, this depends on your compiler being smart enough to recognize the idiom for a rotate instruction and turn it into a single instruction for maximum efficiency.

This also still has the problem that swapping 32-character blocks in the string will give the same hash value, so its far from strong, but probably adequate for most non-cryptographic purposes


would produce a unique value

Stop right there. Hashes are not unique. A good hash algorithm will minimize collisions, but the pigeonhole principle assures us that perfectly avoiding collisions is not possible (for any datatype with non-trivial information content).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜