开发者

Hash function w/ PHP integer overflow

I've looked through a lot of questions here on "PHP integer overflow" but I can't find anything that answers my specific question, so I hope I haven't missed an existing answer.

I want to use the djb2 hash function in PHP to hash keys into something like a shard identifier (domain index for SimpleDB). It overflows unsigned long ints, so I can't do it identically in straight PHP, because PHP's native ints are 32-bit signed.

So I tried PHP's bc and libgmp math extensions, which allow arbitrary length, and they get round the signedness/scale problem, but they make ints "too big" - i.e. they don't overflow.

Using GMP in particular works and seems to give consistent results, but is obviously an order of magnitude slower than in C (0m0.017s vs 0m0.002s). I don't know whether this is simply because it's PHP vs C, or whether it would be significantly faster in PHP if I could get it to overflow. I'd rath开发者_如何学Pythoner test and find out, but I can't see a way to make that happen.

So, is there any way to force a ULONG maximum in PHP? Would I maybe have to wrap the C function in a PHP extension? Or, given that I'm planning only to hash short-ish keys (probably 64 chars or less), would that provide seriously diminishing returns?


Why do you think those hash functions require more than 32 bits? long types are not guaranteed to be that size, they are just >= 32 bits. On my 32-bit platforms long is always 32 bits.

Those notes you linked to, I assume, were written at times when 64 bits were not so popular as nowadays, and also before the long long type (which is >= 64 bits even on 32-bit platforms) was introduced, so the author just used the biggest type that was available then.

That "djb2" hash is just another variation of the hash function almost analogous to the linear congruential generator and it has been known for a long time. Explicit modulo operation was replaced there with an overflow, which is effectively "modulo 2^(sizeof long)". That might be (although, not sure about that) good for performance if compiled as C, but it is likely not so good for the hash quality. And it doesn't make sense in PHP because the number will be promoted to a double and grow into infinity.

I would suggest using that hashing algorithm with the usual PHP integers, but improve the hashing by adding explicit modulo with divisor which is a prime number less than PHP_INT_MAX (have you checked that limit on 64-bit builds of PHP, by the way?). Maybe, the multiplier (33) will have to be changed to get better hash distribution with the strings that have to be hashed.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜