开发者

In R, how to use "aggregate" or "by" when not all combinations of factors are present?

Here is a small example to illustrate my data:

> df <- data.frame(subgroup=rep(paste("s",1:3, sep=""), times=3),
                   feature=c(rep("a",6), rep("b",3)),
                   var=rep(1:3, each=3),
                   data=c(rnorm(3,1), rnorm(3,2), rnorm(3,0)))
> df
  subgroup feature var        data
1       s1 开发者_开发问答      a   1  1.53152620
2       s2       a   1  1.25476445
3       s3       a   1  1.04221040
4       s1       a   2  1.68913400
5       s2       a   2  1.48290273
6       s3       a   2  1.62871854
7       s1       b   3  0.05278296
8       s2       b   3 -0.66623654
9       s3       b   3 -1.40006454

I want to examine the sum of the "data" column for each combination of feature-var that are present in my dataset. More precisely, I want to obtain TRUE when the sum is bigger than 3, and FALSE otherwise:

> result
  feature snp   res
1       a   1  TRUE
2       a   2  TRUE
3       b   3 FALSE

I tried using "aggregate" or "by", but can't make them fit my need. Any idea? Thanks in advance.


One approach is to use plyr's function ddply to group on feature and var. You can use the summarize function to create a new data.frame with a column that corresponds to the rule you developed.

library(plyr)
ddply(df, c("feature", "var"), summarize, res = ifelse(sum(data) > 3,TRUE, FALSE))

Results in:

  feature var   res
1       a   1  TRUE
2       a   2  TRUE
3       b   3 FALSE

Another alternative is to use data.table which is supposed to provide some performance benefits:

library(data.table)
dt <- data.table(df)

dt[, ifelse(sum(data) > 3, TRUE, FALSE), by = c("feature", "var")]

     feature var    V1
[1,]       a   1  TRUE
[2,]       a   2  TRUE
[3,]       b   3 FALSE
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜