In R, how to use "aggregate" or "by" when not all combinations of factors are present?
Here is a small example to illustrate my data:
> df <- data.frame(subgroup=rep(paste("s",1:3, sep=""), times=3),
feature=c(rep("a",6), rep("b",3)),
var=rep(1:3, each=3),
data=c(rnorm(3,1), rnorm(3,2), rnorm(3,0)))
> df
subgroup feature var data
1 s1 开发者_开发问答 a 1 1.53152620
2 s2 a 1 1.25476445
3 s3 a 1 1.04221040
4 s1 a 2 1.68913400
5 s2 a 2 1.48290273
6 s3 a 2 1.62871854
7 s1 b 3 0.05278296
8 s2 b 3 -0.66623654
9 s3 b 3 -1.40006454
I want to examine the sum of the "data" column for each combination of feature-var that are present in my dataset. More precisely, I want to obtain TRUE when the sum is bigger than 3, and FALSE otherwise:
> result
feature snp res
1 a 1 TRUE
2 a 2 TRUE
3 b 3 FALSE
I tried using "aggregate" or "by", but can't make them fit my need. Any idea? Thanks in advance.
One approach is to use plyr
's function ddply
to group on feature and var. You can use the summarize
function to create a new data.frame
with a column that corresponds to the rule you developed.
library(plyr)
ddply(df, c("feature", "var"), summarize, res = ifelse(sum(data) > 3,TRUE, FALSE))
Results in:
feature var res
1 a 1 TRUE
2 a 2 TRUE
3 b 3 FALSE
Another alternative is to use data.table
which is supposed to provide some performance benefits:
library(data.table)
dt <- data.table(df)
dt[, ifelse(sum(data) > 3, TRUE, FALSE), by = c("feature", "var")]
feature var V1
[1,] a 1 TRUE
[2,] a 2 TRUE
[3,] b 3 FALSE
精彩评论