
How many string characters should I read to get a good hash?

Here is a开发者_开发百科 little conundrum for you: If you use a hash algorithm like CRC-64 then how many bytes in a string would be necessary to read to calculate a good hash? Lets say all your strings are at least 2 KB long then it seems a waste or resources using the whole string to calculate the cache, but just how many characters do you think is enough? Would just 8 ASCII-characters be enough since it equals 64-bits? Wont using more than 8 ASCII characters just be pointless? I want to know your though on this.

Update: With a 'good hash' I mean the point where the likelihood of hash collisions can not get any less by using even more bytes to calculate it.

If you use CRC-64 over 8 bytes or less then there is no point in using CRC-64: just use the 8 bytes "as is". A CRC does not have any added value unless the input is longer than the intended output.

As a general rule, if your hash function has an output of n bits then collisions begin to appear once you have accumulated about 2n/2 strings. In shorter words, if you use 64 bits, then it is very unlikely that you encounter a collision in the first 2 billions of strings. If you get a 160-bit or more output, then collisions are virtually unfeasible (you will encounter much less collisions than hardware failures such as the CPU catching fire). This assumes that the hash function is "perfect". If your hash function begins by selecting a few data bytes, then, necessarily, the bytes that you do not select cannot have any influence on the hash output, so you'd better use the "good" bytes -- which utterly depends on the kind of strings that you are hashing. There is no general rule here.

My advice would be to first try using a generic hash function over the whole string; I usually recommend MD4. MD4 is a cryptographic hash function, which has been utterly broken, but for a problem with no security involved, it is still very good at mixing data elements (cryptographically speaking, a CRC is so much more broken than MD4). MD4 has been reported to actually be faster than CRC-32 on some platforms, so you could give it a shot. On a basic PC (my 2.4 GHz Core2), a MD4 implementation works at about 700 MBytes/s, so we are talking about 35000 hashed 2 kB strings per second, which is not bad.

What are the chances that the first 8 letters of two different strings are the same? Depending on what these strings are, it could be very high, in which case you'll definitely get hash collisions.

Hash the whole thing. A few kilobytes is nothing. Unless you actually have a need to save nanoseconds in your program, not hashing the full strings would be premature optimization.





验证码 换一张
取 消

