开发者

How likely is it to get a HashCode collision with this hashcode function?

How likely is it to get a HashCode collision with the function below in following scenarios.

  1. With random int values for key[0],key[1], key[2], key[3]
  2. With random key values with the following constraints
    • key[0] <1,000,000
    • key[1] <10,000
    • key[2] <1,000
    • key[3] <1,000

Assume we have 10 Million objects.

int[] key=new int[4];    
public override int GetHashCode()
{
    // Use large prime multiples to create a unique hash key
    // Create the hash offsets using a "even powers of 2 minus 1" method, which gives 
    // primes most of the time.  
    int hashKey = 0;
    hashKey += 2047 * key[0];
    hashKey += 8191 * key[1];
    hashKey += 32767 * key[2];
    hashKey += 131071 * key[3];
    return hashKey;
开发者_Python百科}


This is kind of a strange question. Let's start with the obvious errors in the code:

// Use large prime multiples to create a unique hash key     
// Create the hash offsets using a "even powers of 2 minus 1" method, which gives      
// primes most of the time.   

First off, those are all odd powers of two minus one; none of them are even powers of two minus one.

Second, of the four multipliers you've chosen as "large prime multiples", half of them are not prime. 2047 and 32767 are composite.

Third, if we "correct" -- and I use the word advisedly -- the statement to be "odd powers of 2 minus one which gives primes most of the time" that statement is absurdly wrong. A prime of that form is known as a Mersenne prime, and there are only 47 known Mersenne primes. I assure you that the density of Mersenne primes is considerably lower than one half. Put it this way: of the odd-power Mersenne numbers between 2^1-1 and 2^43112609−1, 46 of them are known to be prime numbers, which is about one in a half a million, not half.

Fourth, what do you imagine prime numbers have to do with anything? What mythological power do prime numbers have? Surely what matters is that the distribution of hash codes tends to not produce multiples of the hash table size. Since the hash table size is chosen to be a prime number, it seems like this is potentially exacerbating the problem.

Fifth, hash keys are not unique; your question is about when they collide, so clearly they cannot be unique.

Sixth, suppose your hash function had a perfectly random distribution across the space of 32 bit integers. By the birthday "paradox" you'd expect there to be a far greater than 99% chance of at least one collision when drawing ten million numbers at random from a 32 bit space. In fact, the expected number of collisions would be on the order of ten or twenty thousand. (We could work out the exact number of expected collisions, but who cares what it is exactly; it is in that order of magnitude.)

Is that too many collisions? It is going to be very difficult to do better than a random distribution. If you require fewer collisions than that, then you shouldn't be using a 32 bit hashing algorithm in the first place.

Seventh, who cares how many collisions a hash function has across its full range? Surely the practical question ought to really be "how does this hash perform with realistic data in a large table?" You, unlike us, can answer that question by trying it. If it meets your performance budget, great, worry about something else. If it doesn't, figure out why not before you start blaming the hash function.

I am very confused by this question and what you hope to gain from its answer. Can you explain?


I wrote a quick script to test this.

import random

def hash(key):
    hashKey = 0
    hashKey += 2047 * key[0]
    hashKey += 8191 * key[1]
    hashKey += 32767 * key[2]
    hashKey += 131071 * key[3]
    return hashKey

seen = set()
collisions = 0
for i in range(0,10000000):
    x = hash([random.randint(0,1000000), random.randint(0,10000), random.randint(0,1000), random.randint(0,1000)])
    if x in seen:
        collisions += 1
    else:
        seen.add(x)

print collisions

When I ran it, it told me I got 23735 collisions. I also tried it on one million elements, and I got 247 collisions. Both numbers are averages over 4 runs.


I was going to say you should use

int hashKey = key[0].GetHashCode();
hashKey ^= key[1].GetHashCode();
hashKey ^= key[2].GetHashCode();
hashKey ^= key[3].GetHashCode();

as it would give better results, but when I tested it I was thoroughly surprised. Posting the results anyway because as a scientist "results you did not expect are still results".

Collisions1 is your method, Collisions2 is my method, this is the results of 4 runs

Collisions1: 23744
Collisions2: 8996107

Collisions1: 23825
Collisions2: 8996215

Collisions1: 23771
Collisions2: 8996119

Collisions1: 24031
Collisions2: 8996157
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜