ddply returning too many results
For some reason I'm getting more results than I expected since the upgrade to R-2.13.0 - and the upgrade to plyr_1.5.1.tar.gz... I tried this on an old version of plyr (version unsure unfortunately as I've just overwritten it...)
library(plyr)
dd <-data.frame(matrix(rnorm(216),72,3),c(rep("A",24),rep("B",24),
rep("C",24)),c(rep("J",36),rep("K",36)))
colnames(dd) <- c("v1", "v2", "v3", "dim1", "dim2")
results1 <- ddply(dd, c("dim1","dim2"), function(df) c(m1=mean(df$v1)) )
results2 <- ddply(dd, c("dim1","dim2"), function(df) { c(m1=mean(df$v1),
m2=mean(df$v2)) } )
results3 <- ddply(dd, c("dim1","dim2"), function(df) { c(m1=mean(df$v1),
m2=mean(df$v2), m3=mean(df$v3)) } )
I don't understand why results 2 has twice the number of rows in results1 and results3 has three times as many - where the original results1 is just replicated twice or three times.
I had a handy copy of R version 2.11.0 Patched (2010-05-01 r51907) using an old version of plyr the results I was expecting were...
> results1
dim1 dim2 m1
1 A J 0.07312783
2 B J -0.22428746
3 B K -0.44205832
4 C K 0.21421456
> results2
d开发者_运维百科im1 dim2 m1 m2
1 A J 0.07312783 -0.1130148
2 B J -0.22428746 0.4394832
3 B K -0.44205832 -0.1934018
4 C K 0.21421456 -0.0178809
> results3
dim1 dim2 m1 m2 m3
1 A J 0.07312783 -0.1130148 -0.03175873
2 B J -0.22428746 0.4394832 0.21581696
3 B K -0.44205832 -0.1934018 -0.28313530
4 C K 0.21421456 -0.0178809 -0.21948430
The results I get from R version 2.13.0 (2011-04-13)
> results1
dim1 dim2 m1
1 A J -0.2270726
2 B J 0.5860493
3 B K -0.5986129
4 C K 0.3135809
> results2
dim1 dim2 m1 m2
1 A J -0.2270726 -0.19037813
2 B J 0.5860493 -0.05385395
3 B K -0.5986129 0.29404095
4 C K 0.3135809 -0.26744010
5 A J -0.2270726 -0.19037813
6 B J 0.5860493 -0.05385395
7 B K -0.5986129 0.29404095
8 C K 0.3135809 -0.26744010
> results3
dim1 dim2 m1 m2 m3
1 A J -0.2270726 -0.19037813 -0.20448734
2 B J 0.5860493 -0.05385395 -0.11190857
3 B K -0.5986129 0.29404095 -0.27072101
4 C K 0.3135809 -0.26744010 -0.03184949
5 A J -0.2270726 -0.19037813 -0.20448734
6 B J 0.5860493 -0.05385395 -0.11190857
7 B K -0.5986129 0.29404095 -0.27072101
8 C K 0.3135809 -0.26744010 -0.03184949
9 A J -0.2270726 -0.19037813 -0.20448734
10 B J 0.5860493 -0.05385395 -0.11190857
11 B K -0.5986129 0.29404095 -0.27072101
12 C K 0.3135809 -0.26744010 -0.03184949
why has results2 got 8 rows instead of 4 and results3 got 12 rows instead of 4?
Thanks, Sean
This will be fixed shortly in plyr 1.5.2
It's the c() function inside your ddply() that's causing the problem.
Here are three alternative ways that you can write your statement for results3, progressively getting simpler:
Use data.frame inside your function:
ddply(dd, c("dim1","dim2"), function(df) {data.frame(m1=mean(df$v1), m2=mean(df$v2), m3=mean(df$v3)) } )
Use summarise:
ddply(dd, .(dim1, dim2), summarise, m1=mean(v1), m2=mean(v2), m3=mean(v3))
Use numcolwise.
ddply(dd, .(dim1, dim2), numcolwise(mean))
In each case the results are what you would expect:
dim1 dim2 m1 m2 m3
1 A J -0.04272659 -0.1468376 0.17902942
2 B J -0.10133503 -0.1427358 -0.05241214
3 B K 0.29698847 -0.0989732 0.14422812
4 C K 0.04108324 0.2014864 -0.15893221
精彩评论