Calculating a Sample Covariance Matrix for Groups with plyr

2022-12-28 05:18 问答作者：

I'm going to use the sample code from http://gettinggeneticsdone.blogspot.com/2009/11/split-apply-and-combine-in-r-using-plyr.html开发者_开发知识库 for this example. So, first, let's copy their example data:

mydata=data.frame(X1=rnorm(30), X2=rnorm(30,5,2),
SNP1=c(rep("AA",10), rep("Aa",10), rep("aa",10)),
SNP2=c(rep("BB",10), rep("Bb",10), rep("bb",10)))

I am going to ignore SNP2 in this example and just pretend the values in SNP1 denote group membership. So then, I may want some summary statistics about each group in SNP1: "AA", "Aa", "aa".

Then if I want to calculate the means for each variable, it makes sense (modifying their code slightly) to use:

> ddply(mydata, c("SNP1"), function(df)
data.frame(meanX1=mean(df$X1), meanX2=mean(df$X2)))
  SNP1      meanX1   meanX2
1   aa  0.05178028 4.812302
2   Aa  0.30586206 4.820739
3   AA -0.26862500 4.856006

But what if I want the sample covariance matrix for each group? Ideally, I would like a 3D array, where the I have the covariance matrix for each group, and the third dimension denotes the corresponding group. I tried a modified version of the previous code and got the following results that have convinced me that I'm doing something wrong.

> daply(mydata, c("SNP1"), function(df) cov(cbind(df$X1, df$X2)))
, ,  = 1


SNP1         1          2
  aa 1.4961210 -0.9496134
  Aa 0.8833190 -0.1640711
  AA 0.9942357 -0.9955837

, ,  = 2


SNP1          1        2
  aa -0.9496134 2.881515
  Aa -0.1640711 2.466105
  AA -0.9955837 4.938320

I was thinking that the dim() of the 3rd dimension would be 3, but instead, it is 2. Really this is a sliced up version of the covariance matrix for each group. If we manually compute the sample covariance matrix for aa, we get:

           [,1]       [,2]
[1,]  1.4961210 -0.9496134
[2,] -0.9496134  2.8815146

Using plyr, the following gives me what I want in list() form:

> dlply(mydata, c("SNP1"), function(df) cov(cbind(df$X1, df$X2)))
$aa
           [,1]       [,2]
[1,]  1.4961210 -0.9496134
[2,] -0.9496134  2.8815146

$Aa
           [,1]       [,2]
[1,]  0.8833190 -0.1640711
[2,] -0.1640711  2.4661046

$AA
           [,1]       [,2]
[1,]  0.9942357 -0.9955837
[2,] -0.9955837  4.9383196

attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
  SNP1
1   aa
2   Aa
3   AA

But like I said earlier, I would really like this in a 3D array. Any thoughts on where I went wrong with daply() or suggestions? Of course, I could typecast the list from dlply() to a 3D array, but I'd rather not do this because I will be repeating this process many times in a simulation.

As a side note, I found one method (http://www.mail-archive.com/r-help@r-project.org/msg86328.html) that provides the sample covariance matrix for each group, but the outputted object is bloated.

Thanks in advance.

daply makes the splitting variable the first dimension in the array.

a <- daply(mydata, c("SNP1"), function(df) cov(cbind(df$X1, df$X2)))
l <- dlply(mydata, c("SNP1"), function(df) cov(cbind(df$X1, df$X2)))

This is so that a[1, , ] and l[[1]] correspond to the same output. As wkmor1 suggests, you can use aperm to rearrange the dimensions, but I'd like to know more about why the initial form doesn't suit your needs.

How bout...

aperm(daply(mydata, c("SNP1"), function(df) cov(cbind(df$X1, df$X2))),perm=c(2,3,1))

'aperm' is to arrays as 't' is to matrices. The perm argument specifies the way the dims should change.

继续阅读：r

Calculating a Sample Covariance Matrix for Groups with plyr

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？