开发者

ddply returning too many results

For some reason I'm getting more results than I expected since the upgrade to R-2.13.0 - and the upgrade to plyr_1.5.1.tar.gz... I tried this on an old version of plyr (version unsure unfortunately as I've just overwritten it...)

library(plyr)
dd <-data.frame(matrix(rnorm(216),72,3),c(rep("A",24),rep("B",24),
  rep("C",24)),c(rep("J",36),rep("K",36)))
colnames(dd) <- c("v1", "v2", "v3", "dim1", "dim2")

results1 <- ddply(dd, c("dim1","dim2"), function(df) c(m1=mean(df$v1)) )
results2 <- ddply(dd, c("dim1","dim2"), function(df) { c(m1=mean(df$v1),
    m2=mean(df$v2)) } )
results3 <- ddply(dd, c("dim1","dim2"), function(df) { c(m1=mean(df$v1),
    m2=mean(df$v2), m3=mean(df$v3)) } )

I don't understand why results 2 has twice the number of rows in results1 and results3 has three times as many - where the original results1 is just replicated twice or three times.

I had a handy copy of R version 2.11.0 Patched (2010-05-01 r51907) using an old version of plyr the results I was expecting were...

> results1
  dim1 dim2          m1
1    A    J  0.07312783
2    B    J -0.22428746
3    B    K -0.44205832
4    C    K  0.21421456
> results2
  d开发者_运维百科im1 dim2          m1         m2
1    A    J  0.07312783 -0.1130148
2    B    J -0.22428746  0.4394832
3    B    K -0.44205832 -0.1934018
4    C    K  0.21421456 -0.0178809
> results3
  dim1 dim2          m1         m2          m3
1    A    J  0.07312783 -0.1130148 -0.03175873
2    B    J -0.22428746  0.4394832  0.21581696
3    B    K -0.44205832 -0.1934018 -0.28313530
4    C    K  0.21421456 -0.0178809 -0.21948430

The results I get from R version 2.13.0 (2011-04-13)

> results1
  dim1 dim2         m1
1    A    J -0.2270726
2    B    J  0.5860493
3    B    K -0.5986129
4    C    K  0.3135809
> results2
  dim1 dim2         m1          m2
1    A    J -0.2270726 -0.19037813
2    B    J  0.5860493 -0.05385395
3    B    K -0.5986129  0.29404095
4    C    K  0.3135809 -0.26744010
5    A    J -0.2270726 -0.19037813
6    B    J  0.5860493 -0.05385395
7    B    K -0.5986129  0.29404095
8    C    K  0.3135809 -0.26744010
> results3
   dim1 dim2         m1          m2          m3
1     A    J -0.2270726 -0.19037813 -0.20448734
2     B    J  0.5860493 -0.05385395 -0.11190857
3     B    K -0.5986129  0.29404095 -0.27072101
4     C    K  0.3135809 -0.26744010 -0.03184949
5     A    J -0.2270726 -0.19037813 -0.20448734
6     B    J  0.5860493 -0.05385395 -0.11190857
7     B    K -0.5986129  0.29404095 -0.27072101
8     C    K  0.3135809 -0.26744010 -0.03184949
9     A    J -0.2270726 -0.19037813 -0.20448734
10    B    J  0.5860493 -0.05385395 -0.11190857
11    B    K -0.5986129  0.29404095 -0.27072101
12    C    K  0.3135809 -0.26744010 -0.03184949

why has results2 got 8 rows instead of 4 and results3 got 12 rows instead of 4?

Thanks, Sean


This will be fixed shortly in plyr 1.5.2


It's the c() function inside your ddply() that's causing the problem.

Here are three alternative ways that you can write your statement for results3, progressively getting simpler:

  1. Use data.frame inside your function:

    ddply(dd, c("dim1","dim2"), function(df) {data.frame(m1=mean(df$v1), m2=mean(df$v2), m3=mean(df$v3)) } )

  2. Use summarise:

    ddply(dd, .(dim1, dim2), summarise, m1=mean(v1), m2=mean(v2), m3=mean(v3))

  3. Use numcolwise.

    ddply(dd, .(dim1, dim2), numcolwise(mean))

In each case the results are what you would expect:

  dim1 dim2          m1         m2          m3
1    A    J -0.04272659 -0.1468376  0.17902942
2    B    J -0.10133503 -0.1427358 -0.05241214
3    B    K  0.29698847 -0.0989732  0.14422812
4    C    K  0.04108324  0.2014864 -0.15893221
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜