uniform 16bit hash function for strings
I have about 50,000 words that I want to map each of them to a 16bit number and I'm seeking for a hash function to run on j2me. To be more specific I'm looking for a hash function with below criteria:
- few (or no) collisions
- light CPU load
- I have all of the words now
- Avalanche effect is not important, since it's not about security. It' just a look-up table.
I've tested java St开发者_StackOverflowrign.hashCode(), murmur hash, jenkins one at a time and a few simple hand-made ones but all of them have at least 30% collisions.
The minimal perfect hashing seems to have heavy CPU load for a small mobile phone too.Can anybody help me with this?
note: As you know murmur algorithm needs a seed and different seeds have different uniformity. How can I find the seed with minimum collisions?
Thank you in advance
You could look into an old-fashioned CRC. They are very fast and reasonably collision-free. Just not quite at 16 bits, as this experiment seems to indicate. But nevertheless, you might give it a try, maybe it's good enough for your purposes.
This answer might be late, but for reference, MurmurHash 3 is sufficiently fast to satisfy your speed criteria. However due to the constraint you have imposed, collisions will be quite common, since 16 bits can represent a range of 65536 values, your 50000 words would create some collisions.
Solutions:
- use 20+ bits for the key (with 32 bits, there is one collision in a few million samples)
- write a test program to find a fit seed for 16 bits, here are some useful tools: http://code.google.com/p/smhasher/
Here is the function I use in C# to map file name to 16 bit number. In my tests it performed better than Pearson hashing.
public static unsafe int Get16BitHash(string str)
{
int hash = 0;
int len = str.Length;
fixed (char* ch = str)
{
for (int i = 0; i < len; i++)
{
hash = hash + ((hash) << 5) + *(ch + i) + ((*(ch + i)) << 7);
}
}
return ((hash) ^ (hash >> 16)) & 0xffff;
}
精彩评论