How to generate correlated binary variables
I need to generate a series of N random binary variables with a given correlation function. Let x = {xi} be a series of binary variables (taking the value 0 or 1, i running from 1 to N). The marginal probability is given Pr(xi = 1) = p, and the variables should be correlated in the following way:
Corr[ xi xj ] = const × |i−j|−α (for i!=j)
where α is a positive number.
If it is easier, consider the correlation function:
Corr[ xi xj ] = (|i&minus开发者_StackOverflow社区;j|+1)−α
The essential part is that I want to investigate the behavior when the correlation function goes like a power law. (not α|i−j| )
Is it possible to generate a series like this, preferably in Python?
Thanks for all your inputs. I found an answer to my question in the cute little article by Chul Gyu Park et al., so in case anyone run into the same problem, look up:
"A simple method for Generating Correlated Binary Variates" (jstor.org.stable/2684925)
for a simple algorithm. The algorithm works if all the elements in the correlation matrix are positive, and for a general marginal distribution Pr(x_i)=p_i.
j
You're describing a random process, and it looks like a tough one to me... if you eliminated the binary (0,1) requirement, and instead specified the expected value and variance, it would be possible to describe this as a white noise generator feeding through a 1-pole low-pass filter, which I think would give you the α|i-j| characteristic.
This actually might meet the bar for mathoverflow.net, depending on how it is phrased. Let me try asking....
update: I did ask on mathoverflow.net for the α|i-j| case. But perhaps there are some ideas there that can be adapted to your case.
A quick search at RSeek reveals that R has packages
- bindata
- binarySimCLF
to do this.
The brute force solution is to express the constraints of the problem as a linear program with 2^N
variables pr(w)
where w
ranges over all binary strings of length N
. First the constraint that pr
be a probability distribution:
for all w: 0 <= pr(w) <= 1
sum_w pr(w) = 1
Second, the constraint that the expectation of each variable be p
:
for all i: sum_{w such that w[i] = 1} pr(w) = p
Third, the covariance constraints:
for all i < j: sum_{w such that w[i] = w[j] = 1} pr(w) = const * |j - i|^alpha - p^2
This is very slow, but a cursory literature search turned up nothing better. If you decide to implement it, here are some LP solvers with Python bindings: http://wiki.python.org/moin/NumericAndScientific/Libraries
Express the distribution xi as a linear combination of some independent basis distributions fj: xi = ai1f1 + ai2f2 + ... . Let us constrain fj to be independent variables uniformly distributed in 0..1 or in {0,1} (discrete). Let us now express everything we know in matrix form:
Let X be the vector (x1, x2, .., xn)
Let A be the matrix (a_ij) of dimension (k,n) (n rows, k columns)
Let F be the vector (f1, f2, .., fk)
Let P be the vector (p1, p2, .., pn)
Let R be the matrix (E[x_i,x_j]) for i,j=1..n
Definition of the X distribution: X = A * F
Constraint on the mean of individual X variables: P = A * (1 ..k times.. 1)
Correlation constraint: AT*A = 3R or 2R in the discrete case (because E[x_i x_j] =
E[(a_i1*f_1 + a_i2*f_2 + ...)*(a_j1*f_1 + a_j2*f_2 + ...)] =
E[sum over p,q: a_ip*f_p*a_jq*f_q] = (since for p/=q holds E[f_p*f_q]=0)
E[sum over p: a_ip*a_jp*f_p^2] =
sum over p: a_ip*a_jp*E[f_p^2] = (since E[f_p^2] = 1/3 or 1/2 for the discrete case)
sum over p: 1/3 or 1/2*a_ip*a_jp
And the vector consisting of those sums over p: a_ip*a_jp is precisely AT*A.
Now you need to solve the two equations:
AT*A = 3R (or 2R in the discrete case)
A*(1...1) = P
Solution of the first equation corresponds to finding the square root of the matrix 3R or 2R. See for example http://en.wikipedia.org/wiki/Cholesky_factorization and generally http://en.wikipedia.org/wiki/Square_root_of_a_matrix . Something also should be done about the second one :)
I ask mathematicians around to correct me, because I may very well have mixed ATA with AAT or done something even more wrong.
To generate a value of xi as a linear mixture of the basis distributions, use a two-step process: 1) use a uniform random variable to choose one of the basis distributions, weighted with corresponding probability, 2) generate a result using the chosen basis distribution.
Here's an intuitive / experimental approach that seems to work.
If b is an binary r.v., m is the mean of the binary r.v., c is the correlation you want, rand() generates a U(0,1) r.v., and d is the correlated binary r.v. you want:
d = if(rand() < c, b, if(rand() < m , 0, 1))
That is if a uniform r.v. is less than the desired correlation, d = b. Otherwise d = another random binary number.
I ran this 1000 times for a column of 2000 binary r.v.s. with m=.5 and c = .4 and c = .5 The correlation mean was exactly as specified, the distribution appeared to be normal. For a correlation of 0.4 the std deviation of the correlation was 0.02.
Sorry - I can't prove that this works all the time, but you have to admit, it sure is easy.
精彩评论