Why do digits 1, 2 and 3 appear so frequently using C rand() function?
What I am trying to do is to generate some random numbers (not necessarily single digit) like
29106
7438
5646
4487
9374
28671
92
13941
25226
10076
and then count the number of digits I get:
count[0] = 3 Percentage = 6.82
count[1] = 开发者_开发技巧 5 Percentage = 11.36
count[2] = 6 Percentage = 13.64
count[3] = 3 Percentage = 6.82
count[4] = 6 Percentage = 13.64
count[5] = 2 Percentage = 4.55
count[6] = 7 Percentage = 15.91
count[7] = 5 Percentage = 11.36
count[8] = 3 Percentage = 6.82
count[9] = 4 Percentage = 9.09
This is the code I am using:
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
int main() {
int i;
srand(time(NULL));
FILE* fp = fopen("random.txt", "w");
// for(i = 0; i < 10; i++)
for(i = 0; i < 1000000; i++)
fprintf(fp, "%d\n", rand());
fclose(fp);
int dummy;
long count[10] = {0,0,0,0,0,0,0,0,0,0};
fp = fopen("random.txt", "r");
while(!feof(fp)) {
fscanf(fp, "%1d", &dummy);
count[dummy]++;
}
fclose(fp);
long sum = 0;
for(i = 0; i < 10; i++)
sum += count[i];
for(i = 0; i < 10; i++)
printf("count[%d] = %7ld Percentage = %5.2f\n",
i, count[i], ((float)(100 * count[i])/sum));
}
If I generate a large number of random numbers (1000000), this is the result I get:
count[0] = 387432 Percentage = 8.31
count[1] = 728339 Percentage = 15.63
count[2] = 720880 Percentage = 15.47
count[3] = 475982 Percentage = 10.21
count[4] = 392678 Percentage = 8.43
count[5] = 392683 Percentage = 8.43
count[6] = 392456 Percentage = 8.42
count[7] = 391599 Percentage = 8.40
count[8] = 388795 Percentage = 8.34
count[9] = 389501 Percentage = 8.36
Notice that 1, 2 and 3 have too many hits. I have tried running this several times and each time I get very similar results.
I am trying to understand what could cause 1, 2 and 3 to appear much more frequently than any other digit.
Taking hint from what Matt Joiner and Pascal Cuoq pointed out,
I changed the code to use
for(i = 0; i < 1000000; i++)
fprintf(fp, "%04d\n", rand() % 10000);
// pretty prints 0
// generates numbers in range 0000 to 9999
and this is what I get (similar results on multiple runs):
count[0] = 422947 Percentage = 10.57
count[1] = 423222 Percentage = 10.58
count[2] = 414699 Percentage = 10.37
count[3] = 391604 Percentage = 9.79
count[4] = 392640 Percentage = 9.82
count[5] = 392928 Percentage = 9.82
count[6] = 392737 Percentage = 9.82
count[7] = 392634 Percentage = 9.82
count[8] = 388238 Percentage = 9.71
count[9] = 388352 Percentage = 9.71
What can be the reason that 0, 1 and 2 are favored?
Thanks everyone. Using
int rand2(){
int num = rand();
return (num > 30000? rand2():num);
}
fprintf(fp, "%04d\n", rand2() % 10000);
I get
count[0] = 399629 Percentage = 9.99
count[1] = 399897 Percentage = 10.00
count[2] = 400162 Percentage = 10.00
count[3] = 400412 Percentage = 10.01
count[4] = 399863 Percentage = 10.00
count[5] = 400756 Percentage = 10.02
count[6] = 399980 Percentage = 10.00
count[7] = 400055 Percentage = 10.00
count[8] = 399143 Percentage = 9.98
count[9] = 400104 Percentage = 10.00
rand()
generates a value from 0
to RAND_MAX
. RAND_MAX
is set to INT_MAX
on most platforms, which may be 32767
or 2147483647
.
For your example given above, it appears that RAND_MAX
is 32767
. This will place an unusually high frequency of 1
, 2
and 3
for the most significant digit for the values from 10000
to 32767
. You can observe that to a lesser degree, values up to 6
and 7
will also be slightly favored.
Regarding the edited question,
This is because the digits are still not uniformly distributed even if you % 10000
. Assume RAND_MAX == 32767
, and rand()
is perfectly uniform.
For every 10,000 numbers counting from 0, all of the digits will appear uniformly (4,000 each). However, 32,767 is not divisible by 10,000. Therefore, these 2,768 numbers will provide more leading 0, 1 and 2's to the final count.
The exact contribution from these 2,768 numbers are:
digits count
0 1857
1 1857
2 1625
3 857
4 857
5 857
6 855
7 815
8 746
9 746
adding 12,000 for the initial 30,000 numbers to the count, then divide by the total number of digits (4×32,768) should give you the expected distribution:
number probability (%)
0 10.5721
1 10.5721
2 10.3951
3 9.80911
4 9.80911
5 9.80911
6 9.80759
7 9.77707
8 9.72443
9 9.72443
which is close to what you get.
If you want to truly uniform digit distribution, you need to reject those 2,768 numbers:
int rand_4digits() {
const int RAND_MAX_4_DIGITS = RAND_MAX - RAND_MAX % 10000;
int res;
do {
res = rand();
} while (res >= RAND_MAX_4_DIGITS);
return res % 10000;
}
Looks like Benford's Law - see http://en.wikipedia.org/wiki/Benford%27s_law, or alternatively a not very good RNG.
That's because you generate numbers between 0
and RAND_MAX
. The generated numbers are evenly distributed (i.e. approx. same probability for each number), however, the digits 1,2,3 occur more often than others in this range. Try generating between 0
and 10
, where each digit occurs with the same probability and you'll get a nice distribution.
If I understand what the OP (person asking the question) wants, they want to make better random numbers.
rand() and random(), quite frankly, don't make very good random numbers; they both do poorly when tested against diehard and dieharder (two packages for testing the quality of random numbers).
The Mersenne twister is a popular random number generator which is good for pretty much everything except crypto-strong random numbers; it passes all of the diehard(er) tests with flying colors.
If one needs crypto-strong random numbers (numbers that can not be guessed, even if someone knows which particular crypto-strong algorithm is being used), there are a number of stream ciphers out there. The one I like to use is called RadioGatún[32], and here’s a compact C representation of it:
/*Placed in the public domain by Sam Trenholme*/
#include <stdint.h>
#include <stdio.h>
#define p uint32_t
#define f(a) for(c=0;c<a;c++)
#define n f(3){b[c*13]^=s[c];a[16+c]^=s[c];}k(a,b
k(p *a,p *b){p A[19],x,y,r,q[3],c,i;f(3){q[c]=b[c
*13+12];}for(i=12;i;i--){f(3){b[c*13+i]=b[c*13+i-
1];}}f(3){b[c*13]=q[c];}f(12){i=c+1+((c%3)*13);b[
i]^=a[c+1];}f(19){y=(c*7)%19;r=((c*c+c)/2)%32;x=a
[y]^(a[(y+1)%19]|(~a[(y+2)%19]));A[c]=(x>>r)|(x<<
(32-r));}f(19){a[c]=A[c]^A[(c+1)%19]^A[(c+4)%19];
}a[0]^=1;f(3){a[c+13]^=q[c];}}l(p *a,p *b,char *v
){p s[3],q,c,r,x,d=0;for(;;){f(3){s[c]=0;}for(r=0
;r<3;r++){for(q=0;q<4;q++){if(!(x=*v&255)){d=x=1;
}v++;s[r]|=x<<(q*8);if(d){n);return;}}}n);}}main(
int j,char **h){p a[39],b[39],c,e,g;if(j==2){f(39
){a[c]=b[c]=0;}l(a,b,h[1]);f(16){k(a,b);}f(4){k(a
,b);for(j=1;j<3;++j){g=a[j];for(e=4;e;e--){printf
("%02x",g&255);g>>=8;}}}printf("\n");}}
There are also a lot of other really good random number generators out there.
When you want to generate random value from range [0, x)
, instead of doing rand()%x
, you should apply formula x*((double)rand()/RAND_MAX)
, which will give you nicely distributed random values.
Say, RAND_MAX is equal to 15, so rand
will give you integers from 0 to 15. When you use modulo operator to get random numbers from [0, 10)
, values [0,5]
will have higher frequency than [6,9]
, because 3 == 3%10 == 13%10
.
精彩评论