How to generate correlated binary variables

2022-12-23 08:18 问答作者：

I need to generate a series of N random binary variables with a given correlation function. Let x = {x_i} be a series of binary variables (taking the value 0 or 1, i running from 1 to N). The marginal probability is given Pr(x_i = 1) = p, and the variables should be correlated in the following way:

Corr[ x_i x_j ] = const × |i−j|^−α (for i!=j)

where α is a positive number.

If it is easier, consider the correlation function:

Corr[ x_i x_j ] = (|i&minus开发者_StackOverflow社区;j|+1)^−α

The essential part is that I want to investigate the behavior when the correlation function goes like a power law. (not α^|i−j| )

Is it possible to generate a series like this, preferably in Python?

Thanks for all your inputs. I found an answer to my question in the cute little article by Chul Gyu Park et al., so in case anyone run into the same problem, look up:

"A simple method for Generating Correlated Binary Variates" (jstor.org.stable/2684925)

for a simple algorithm. The algorithm works if all the elements in the correlation matrix are positive, and for a general marginal distribution Pr(x_i)=p_i.

You're describing a random process, and it looks like a tough one to me... if you eliminated the binary (0,1) requirement, and instead specified the expected value and variance, it would be possible to describe this as a white noise generator feeding through a 1-pole low-pass filter, which I think would give you the α^|i-j| characteristic.

This actually might meet the bar for mathoverflow.net, depending on how it is phrased. Let me try asking....

update: I did ask on mathoverflow.net for the α^|i-j| case. But perhaps there are some ideas there that can be adapted to your case.

A quick search at RSeek reveals that R has packages

bindata
binarySimCLF

to do this.

The brute force solution is to express the constraints of the problem as a linear program with 2^N variables pr(w) where w ranges over all binary strings of length N. First the constraint that pr be a probability distribution:

for all w: 0 <= pr(w) <= 1
sum_w pr(w) = 1

Second, the constraint that the expectation of each variable be p:

for all i: sum_{w such that w[i] = 1} pr(w) = p

Third, the covariance constraints:

for all i < j: sum_{w such that w[i] = w[j] = 1} pr(w) = const * |j - i|^alpha - p^2

This is very slow, but a cursory literature search turned up nothing better. If you decide to implement it, here are some LP solvers with Python bindings: http://wiki.python.org/moin/NumericAndScientific/Libraries

Express the distribution x_i as a linear combination of some independent basis distributions f_j: x_i = a_i1f₁ + a_i2f₂ + ... . Let us constrain f_j to be independent variables uniformly distributed in 0..1 or in {0,1} (discrete). Let us now express everything we know in matrix form:

Let X be the vector (x1, x2, .., xn)
Let A be the matrix (a_ij) of dimension (k,n) (n rows, k columns)
Let F be the vector (f1, f2, .., fk) 
Let P be the vector (p1, p2, .., pn)
Let R be the matrix (E[x_i,x_j]) for i,j=1..n
Definition of the X distribution: X = A * F
Constraint on the mean of individual X variables: P = A * (1 ..k times.. 1)
Correlation constraint: AT*A = 3R or 2R in the discrete case (because E[x_i x_j] = 
  E[(a_i1*f_1 + a_i2*f_2 + ...)*(a_j1*f_1 + a_j2*f_2 + ...)] =
  E[sum over p,q: a_ip*f_p*a_jq*f_q] = (since for p/=q holds E[f_p*f_q]=0)
  E[sum over p: a_ip*a_jp*f_p^2] =
  sum over p: a_ip*a_jp*E[f_p^2] = (since E[f_p^2] = 1/3 or 1/2 for the discrete case)
  sum over p: 1/3 or 1/2*a_ip*a_jp
And the vector consisting of those sums over p: a_ip*a_jp is precisely AT*A.

Now you need to solve the two equations:

AT*A      = 3R (or 2R in the discrete case)
A*(1...1) = P

Solution of the first equation corresponds to finding the square root of the matrix 3R or 2R. See for example http://en.wikipedia.org/wiki/Cholesky_factorization and generally http://en.wikipedia.org/wiki/Square_root_of_a_matrix . Something also should be done about the second one :)

I ask mathematicians around to correct me, because I may very well have mixed ATA with AAT or done something even more wrong.

To generate a value of x_i as a linear mixture of the basis distributions, use a two-step process: 1) use a uniform random variable to choose one of the basis distributions, weighted with corresponding probability, 2) generate a result using the chosen basis distribution.

Here's an intuitive / experimental approach that seems to work.

If b is an binary r.v., m is the mean of the binary r.v., c is the correlation you want, rand() generates a U(0,1) r.v., and d is the correlated binary r.v. you want:

d = if(rand() < c, b, if(rand() < m , 0, 1))

That is if a uniform r.v. is less than the desired correlation, d = b. Otherwise d = another random binary number.

I ran this 1000 times for a column of 2000 binary r.v.s. with m=.5 and c = .4 and c = .5 The correlation mean was exactly as specified, the distribution appeared to be normal. For a correlation of 0.4 the std deviation of the correlation was 0.02.

Sorry - I can't prove that this works all the time, but you have to admit, it sure is easy.

继续阅读：algorithm math probability random statistics

How to generate correlated binary variables

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？