开发者

Why setting HashTable's length to a Prime Number is a good practice?

I was going through Eric Lippert's latest Blog post for Guidelines and rules for GetHashCode when i hit this para:

We could be even more clever here; just as a List resizes itself when it gets full, the bucket set could resize itself as well, to ensure that the average bucket length stays low. Also, for technical reasons it is often a good idea to make the bucket set length a prime number, rather than 100. There are plenty of i开发者_运维知识库mprovements we could make to this hash table. But this quick sketch of a naive implementation of a hash table will do for now. I want to keep it simple.

So looks like i'm missing something. Why is it a good practice to set it to a prime number?.


You can find people that suggest the two opposite ends of the spectrum. On the one side, choosing a prime number for the size of the hash table will reduce the chances of collisions, even if the hash function is not too effective distributing the results. Note that if (in the simplest example to argue about) a power of 2 size is decided, only the lower bits affect the bucket, while for a prime number most bits in the result of the hash will be used.

On the other hand, you can gain more by choosing a better hash function, or even rehashing he result of the hash function by applying some bit operations, and using a power of 2 hash size to speed up calculations.

As an example from real life, Java HashTable were initially implemented by using prime (or almost prime sizes), but from Java 1.4 on, the design was changed to use power of two number of buckets and added a second fast hash function applied to the result of the initial hash. An interesting article commenting that change can be found here.

So basically:

  • a prime number helps dispersing the inputs across the different buckets even in the event of not-so-good hash functions.

  • a similar effect can be achieved by post processing the result of the hash function, and using a power of 2 size to speedup the modulo operation (bit mask) and compensate for the post processing.


Because this produces a better hash function and reduces the number of possible collisions. This is explained in Choosing a good hashing function:

A basic requirement is that the function should provide a uniform distribution of hash values. A non-uniform distribution increases the number of collisions, and the cost of resolving them.

The distribution needs to be uniform only for table sizes s that occur in the application. In particular, if one uses dynamic resizing with exact doubling and halving of s, the hash function needs to be uniform only when s is a power of two. On the other hand, some hashing algorithms provide uniform hashes only when s is a prime number.


Say your bucket set length is a power of 2 - that makes the mod calculations quite fast. It also means that the bucket selection is determine solely by the top m bits of the hash code. (Where m = 32 - n, where n is the power of 2 being used). So it's like you're throwing away useful bits of the hashcode immediately.

Or as in this blog post from 2006 puts it:

Suppose your hashCode function results in the following hashCodes among others {x , 2x, 3x, 4x, 5x, 6x...}, then all these are going to be clustered in just m number of buckets, where m = table_length/GreatestCommonFactor(table_length, x). (It is trivial to verify/derive this). Now you can do one of the following to avoid clustering:

...

Or simply make m equal to the table_length by making GreatestCommonFactor(table_length, x) equal to 1, i.e by making table_length coprime with x. And if x can be just about any number then make sure that table_length is a prime number.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜