开发者

How does ddply handle factors as "split" variables?

I have a data.frame with 20 columns. The first two are factors, and the rest are numeric. I'd like to use the first two columns as split variables and then apply the mean() to the remaining columns.

This seems like a quick and easy job for ddply(), however, the results for the output data.frame are not开发者_StackOverflow社区 what I am looking for. Here is a minimal example with just one column of data:

Aa <- c(rep(c("A", "a"), each = 20))
Bb <- c(rep(c("B", "b", "B", "b"), each = 10))
x <- runif(40)
df1 <- data.frame(Aa, Bb, x)

ddply(df1, .(Aa, Bb), mean)

The output is:

  Aa Bb         x
1 NA NA 0.5193275
2 NA NA 0.4491907
3 NA NA 0.4848128
4 NA NA 0.4717899
Warning messages:
1: In mean.default(X[[1L]], ...) :
  argument is not numeric or logical: returning NA

The warning is repeated 8 times, presumably once for each call to mean(). I'm guessing this comes from trying to take the mean of a factor. I could write this as:

ddply(df1, .(Aa, Bb), function(df1) mean(df1$x))

or

ddply(df1, .(Aa, Bb), summarize, x = mean(x))

both of which do work (not giving NAs), but I would rather avoid writing out 18 such x = mean(x) statements, one for each of my numeric columns.

Is there a general solution? I'm not wedded to ddply if there is a better answer elsewhere.


Since you are reducing hte number of rows, you need to use summarise:

> ddply(df1, .(Aa, Bb), summarise, mean_x =mean(x) )
  Aa Bb    mean_x
1  a  b 0.3790675
2  a  B 0.4242922
3  A  b 0.5622329
4  A  B 0.4574471

It's just as easy to use aggregate in this instance. Let's say you had two variables:

> aggregate(df1[-(1:2)], df1[1:2], mean)
  Aa Bb         x         y
1  a  b 0.4249121 0.4639192
2  A  b 0.6127175 0.4639192
3  a  B 0.4522292 0.4826715
4  A  B 0.5201965 0.4826715


ddply supports negative indexing as well:

ddply(df1, .(Aa, Bb), function(x) mean(x[-(1:2)]))
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜