Why should the "prime-based" hashcode implementation be used instead of the "naive" one?
I have seen that a prime number implementation of the GetHashCode function is being recommend, for example here. However using the following code (in VB, sorry), it seems as if that implementation gives the same hash density as a "naive" xor implementation. If the density is the same, I would suppose there is the same probability of collision in both implementations. Am I missing anything on why is the prime approach preferred?
I am supossing that if the hash code is a byte I do not lose generality for the integer case.
Sub Main()
Dim XorHashes(255) As Integer
Dim PrimeHashes(255) As Integer
For i = 0 To 255
For j = 0 To 255
For k = 0 To 255
XorHashes(GetXorHash(i, j, 开发者_高级运维k)) += 1
PrimeHashes(GetPrimeHash(i, j, k)) += 1
Next
Next
Next
For i = 0 To 255
Console.WriteLine("{0}: {1}, {2}", i, XorHashes(i), PrimeHashes(i))
Next
Console.ReadKey()
End Sub
Public Function GetXorHash(ByVal valueOne As Integer, ByVal valueTwo As Integer, ByVal valueThree As Integer) As Byte
Return CByte((valueOne Xor valueTwo Xor valueThree) Mod 256)
End Function
Public Function GetPrimeHash(ByVal valueOne As Integer, ByVal valueTwo As Integer, ByVal valueThree As Integer) As Byte
Dim TempHash = 17
TempHash = 31 * TempHash + valueOne
TempHash = 31 * TempHash + valueTwo
TempHash = 31 * TempHash + valueThree
Return CByte(TempHash Mod 256)
End Function
The probability of collisions also depends on the expected distribution of the input data. In your example you assume input data that is uniformly distributed over the entire range. This is the ideal situation and it's no surprise that both algorithms perform well.
However, if you assume that the input data generally is similar in the high bits and differs mostly only in the low bits (note: a lot of real data is like this), the prime number method will spread this variation out over the whole hash whereas the XOR method will not - small changes in the low bits of two or more values can easily cancel each other out when XOR'ed. So the prime number method is less likely to collide in this case.
Also you should use 32-bit values for GetHashCode, not 8-bit values.
Truncating the hash is your problem here. The Xor method can only ever produce 256 distinct values. The Prime method can generate more than 750,000 distinct values, but you throw 749,744 of them away by using only the 8 low bits. And can thus never do a better job than Xor.
In your specific case, you can do much better. There are enough bits in an Integer to generate a unique hash with 16 million distinct values:
Public Shared Function GetGoodHash(ByVal valueOne As Integer, ByVal valueTwo As Integer, ByVal valueThree As Integer) As Integer
Return valueOne And 255 + (valueTwo And 255) << 8 + (valueThree And 255) << 16
End Function
The Xor method is okay when the input values are well distributed. A problem with the prime method is that it is easy to trigger an Overflow exception. That's difficult to deal with in VB.NET code, it doesn't have the equivalent of the C# unchecked keyword. You have to turn that off globally with Project + Properties, Compile tab, Advanced Compile Options, tick "Remove integer overflow checks". Avoid that by computing the hash as an Int64. Which makes it a bit expensive.
精彩评论