Why setting HashTable's length to a Prime Number is a good practice?

2023-02-14 14:47 问答作者：

I was going through Eric Lippert's latest Blog post for Guidelines and rules for GetHashCode when i hit this para:

We could be even more clever here; just as a List resizes itself when it gets full, the bucket set could resize itself as well, to ensure that the average bucket length stays low. Also, for technical reasons it is often a good idea to make the bucket set length a prime number, rather than 100. There are plenty of i开发者_运维知识库mprovements we could make to this hash table. But this quick sketch of a naive implementation of a hash table will do for now. I want to keep it simple.

So looks like i'm missing something. Why is it a good practice to set it to a prime number?.

You can find people that suggest the two opposite ends of the spectrum. On the one side, choosing a prime number for the size of the hash table will reduce the chances of collisions, even if the hash function is not too effective distributing the results. Note that if (in the simplest example to argue about) a power of 2 size is decided, only the lower bits affect the bucket, while for a prime number most bits in the result of the hash will be used.

On the other hand, you can gain more by choosing a better hash function, or even rehashing he result of the hash function by applying some bit operations, and using a power of 2 hash size to speed up calculations.

As an example from real life, Java HashTable were initially implemented by using prime (or almost prime sizes), but from Java 1.4 on, the design was changed to use power of two number of buckets and added a second fast hash function applied to the result of the initial hash. An interesting article commenting that change can be found here.

So basically:

a prime number helps dispersing the inputs across the different buckets even in the event of not-so-good hash functions.
a similar effect can be achieved by post processing the result of the hash function, and using a power of 2 size to speedup the modulo operation (bit mask) and compensate for the post processing.

Because this produces a better hash function and reduces the number of possible collisions. This is explained in Choosing a good hashing function:

A basic requirement is that the function should provide a uniform distribution of hash values. A non-uniform distribution increases the number of collisions, and the cost of resolving them.

The distribution needs to be uniform only for table sizes s that occur in the application. In particular, if one uses dynamic resizing with exact doubling and halving of s, the hash function needs to be uniform only when s is a power of two. On the other hand, some hashing algorithms provide uniform hashes only when s is a prime number.

Say your bucket set length is a power of 2 - that makes the mod calculations quite fast. It also means that the bucket selection is determine solely by the top m bits of the hash code. (Where m = 32 - n, where n is the power of 2 being used). So it's like you're throwing away useful bits of the hashcode immediately.

Or as in this blog post from 2006 puts it:

Suppose your hashCode function results in the following hashCodes among others {x , 2x, 3x, 4x, 5x, 6x...}, then all these are going to be clustered in just m number of buckets, where m = table_length/GreatestCommonFactor(table_length, x). (It is trivial to verify/derive this). Now you can do one of the following to avoid clustering:

...

Or simply make m equal to the table_length by making GreatestCommonFactor(table_length, x) equal to 1, i.e by making table_length coprime with x. And if x can be just about any number then make sure that table_length is a prime number.

继续阅读：arrays hash primes

Why setting HashTable's length to a Prime Number is a good practice?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？