Can you explain theese disturbing anomalies of md5 and modulo?
Okay, the title is really very subjective. But thats just what the problem is to me.
The background is that I want to distribute hits of static web contents evenly about a defined number of caching servers. Also the delivery to clients should speed up because several domains are in use and requests are not blocking each other. I also don't need a classic load balancer but generate the right links right away in my html code.
I also want to ensure that the same url always gets served by the same server.
So I just defined a little function that returns the host to use by hashing the request url and calculates the modulo by the number of servers in use:
function pseudocode_statify($url) { // $url looks like /folder1/folder2/file.jpg
return 'http://' . md5($url) % $num_of_servers .'.mydomain.com' . $url;
}
I first had something like hex decoding and substring to prevent overflow in place, but found out that it just works fine the way above.
However my problem is that if I run the following test script:
for($i=0;$i<100000;$i++) {
$md5 = md5(uniqid($i).microtime().rand(1,999999999999));
$result[$md5%2]++;
}
I expected an even distribution. meaning that $result[0] would be near the value of $result[1];
This was not the case.
Okay, this is nothing special sofar. I would have just accepted the fact that md5 is not as evenly distributed as i thought and would have gone vor another hashing algorithm like sha1 or s开发者_如何学Goomething.
But I tried to reproduce the findings and found a pattern that I cannot explain.
The ratio was always about 2/1. In fact it was the ratio was always something like 1/2.16 to 1/2.17
Sample output of some runs of the script above:
output was generated by: echo "ratio: ".$result[0]/$result[1]."\n";
ratio: 2.1757121534504
ratio: 2.1729411578062
ratio: 2.1726559360393
ratio: 2.1676895664225
ratio: 2.1667416128848
ratio: 2.1667115284133
ratio: 2.1677791605385
ratio: 2.1658969579688
ratio: 2.1668508131769
ratio: 2.1689292821741
Now the weird thing was that the ratio of sums % 2 equaling 1 and sums % 2 equaling 0 sometimes alternated!
for($j = 0; $j<100;$j++) {
for($i=0;$i<100000;$i++) {
$md5 = md5(uniqid($i).microtime().rand(1,999999999999));
$result[$md5%2]++;
}
var_dump($result);
}
I ran the script from the command line two sperate times and aborted it after 3 runs and it produced theese two outputs:
joe@joe-laptop:/home/flimmit/httpdocs$ php test.php
PHP Notice: Undefined variable: result in /home/flimmit/httpdocs/test.php on line 6
PHP Notice: Undefined offset: 0 in /home/flimmit/httpdocs/test.php on line 6
PHP Notice: Undefined offset: 1 in /home/flimmit/httpdocs/test.php on line 6
array(2) {
[0]=>
int(68223)
[1]=>
int(31777)
}
array(2) {
[0]=>
int(136384)
[1]=>
int(63616)
}
array(2) {
[0]=>
int(204498)
[1]=>
int(95502)
}
^C
joe@joe-laptop:/home/flimmit/httpdocs$ php test.php
PHP Notice: Undefined variable: result in /home/flimmit/httpdocs/test.php on line 6
PHP Notice: Undefined offset: 1 in /home/flimmit/httpdocs/test.php on line 6
PHP Notice: Undefined offset: 0 in /home/flimmit/httpdocs/test.php on line 6
array(2) {
[1]=>
int(31612)
[0]=>
int(68388)
}
array(2) {
[1]=>
int(63318)
[0]=>
int(136682)
}
array(2) {
[1]=>
int(94954)
[0]=>
int(205046)
}
^C
joe@joe-laptop:/home/flimmit/httpdocs$
As you can see in the first one the first entry of results is always higher, in the second one its the other way round. same script.
Strange thing is that i can ONLY reproduce this behaviour when i run the script several times.
I wrote this small script to reproduce the "swapping" and generate enough measure data:
for($j = 0; $j<100;$j++) {
for($i=0;$i<rand(1000,10000);$i++) {
$md5 = md5(uniqid($i).microtime().rand(1,99999999));
$result[$md5%2]++;
}
#var_dump($result);
echo "ratio: ".$result[0]/$result[1]." ".(($result[0]<$result[1]) ? "A":"B")."\n";
sleep(rand(2,5));
}
But here It only prints b, never A's. Which made me think there might be a semantic error in the script, but i didnt find any.
I am really stuck and this bothers me a lot.
So my questions:
Can you recommend any literature / weblinks were i could read about md5 a little bit deeper including distributions etc
Can you explain / reproduce the behaviour? Do I have an error here? (in fact thats very likely but i cant find it)
Can you recommend any other algorithm that would besuitable for my use case? It needs not be cryptographic or strong but fast, deterministic and evenly distributed.
The md5()
function returns a string, not an integer.
Which means that this string will be type-casted to an integer to do the modulo ; and as this string will contain characters in the 0-9A-F
range, casted to an integer, you have :
- 1 chance out of 16 of getting a 0
- 9 chances out of 16 of getting between 1 and 9
- 6 chances out of 16 of getting between A and F -- which will be casted to a 0
For example, this :
$a = md5('plop1');
var_dump($a, (int)$a);
$a = md5('plop2');
var_dump($a, (int)$a);
$a = md5('plop5');
var_dump($a, (int)$a);
Will get you the following output :
string 'ac4bf0e466417336599b72a8b2f595da' (length=32)
int 0
string 'ed91c463402dd797d0718350f5bd0acd' (length=32)
int 0
string '85782b3afb04072c1bf172a6a7e6bb5e' (length=32)
int 85782
I'll let you guess the possible impact this can have on the result of the modulo operator ;-)
for ($i = 0; $i < 10; $i++) {
srand(crc32('test_url1'));
echo rand().'<br />';
}
for ($i = 0; $i < 10; $i++) {
srand(crc32('test_url2'));
echo rand().'<br />';
}
add a range to the rand function and you have your server value.
"The background is that I want to distribute hits of static web contents evenly about a defined number of caching servers."
Many, many load balancers already do that. Squid, nginx, Varnish, HAProxy...
Please do not write your own load balancer in PHP. Please.
精彩评论