Writing a function to analyze a subset within a dataframe
I am trying to write a function to aggregate or subset a data frame by a particular column, and then count the proportion of values in another column within that dataframe with certain values.
Specifically, the relevant parts of my data frame, allmutations, look like this:
gennumber sel
1 -0.00351647开发者_C百科088810292
1 0.000728499401888683
1 0.0354633950503043
1 0.000209700229276244
2 6.42307549736376e-05
2 -0.0497259605114181
2 -0.000371856995145525
Within each generation (gennumber), I would like to count the proportion of values in “sel” that are greater than 0.001, between -0.001 and 0.001, and less than -0.001. Over the entire data set, I've just been doing this:
ben <- allmutations$sel > 0.001 #this is for all generations
bencount <- length(which(ben==TRUE))
totalmu <- length(ben) # #length(ben) = total # of mutants
tot.pben <- bencount/totalmu #proportion
What is the best way to do that operation for each value in gennumber? Also, is there an easy way to get proportion of values in the range -0.001 < sel < 0.001? I couldn't figure out how to do it, so I “cheated” and took an absolute value of the column and just looked for values less than 0.001. I can't help but feel there must be a better way though.
Thanks for any help you can give, and please let me know if I can provide any clarification.
dput()
of data:
structure(list(gennumber = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), sel = c(-0.00351647088810292,
0.000728499401888683, 0.0354633950503043, 0.000209700229276244,
6.42307549736376e-05, -0.0497259605114181, -0.000371856995145525
)), .Names = c("gennumber", "sel"), class = "data.frame", row.names = c(NA,
-7L))
For the first part, assuming your data are in dat
, we first split the data by gennumber
:
sdat <- with(dat, split(dat, gennumber))
then we write a custom function to do the comparison you want
foo <- function(x, cutoff = 0.001) {
sum(x[,2] > cutoff) / length(x[,2])
}
and sapply()
it over the individual chunks of data in sdat
sapply(sdat, foo)
Which gives:
> sapply(sdat, foo)
1 2
0.25 0.00
for this sample of data.
For the second part, we can extend the above function foo()
to accept an upper and lower limit and do the computation:
bar <- function(x, upr, lwr) {
sum(lwr < x[,2] & x[,2] < upr) / length(x[,2])
}
Which gives, [showing how to pass in the extra arguments]
> sapply(sdat, bar, lwr = -0.001, upr = 0.001)
1 2
0.5000000 0.6666667
You can combine two logical tests with &
, so to test -0.001 < sel < 0.001 you can write sel > -0.001 & sel < 0.001
Here is a way using plyr
:
dat <- read.table(tc <- textConnection("
gennumber sel
1 -0.00351647088810292
1 0.000728499401888683
1 0.0354633950503043
1 0.000209700229276244
2 6.42307549736376e-05
2 -0.0497259605114181
2 -0.000371856995145525"), header = TRUE); close(tc)
library("plyr")
ddply(dat,.(gennumber),summarize,
`sel < -0.001` = sum(sel < -0.001)/length(sel),
`-0.001 < sel < 0.001` = sum(sel > -0.001 & sel < 0.001)/length(sel),
`0.001 < sel` = sum(sel > 0.001)/length(sel))
精彩评论