开发者

Subsetting in R, joining and calculating multiple repetitions

Here is a sample:

> tmp
    label   value1  开发者_JS百科value2
1   aa_x_x  xx      xx
2   bc_x_x  xx      xx
3   aa_x_x  xx      xx
4   bc_x_x  xx      xx

How to calculate median of all repeated labels (or more, of the corresponding values in other data frame columns), but taking into account only the first two letters (ie. "aa_1_1" and "aa_s_3" are the same values)? The list of labels is finite and usable.

I have read about aggregate, %in%, subset and substr, but I am unable to compile anything useful and simple.

Here is what I hope to get:

> tmp.result
    label   median1 some.calculation2
1   aa      xx      xx
2   bc      xx      xx
3   aa      xx      xx
4   bc      xx      xx

Thank you very much.


Have you tried making a new data frame--I'll call it tmp2--where tmp2$label==substr(tmp$label,0,2)? From there, you can, for example, use tapply(tmp2$value1,tmp2$label,mean) to get the average values of value1 aggregated over tmp2$label.

An option using dplyr

library(dplyr)
tmp %>%
   group_by(label=sub('_.*$', '', label)) %>% 
   transmute(median1=median(value1), mean1=mean(value2))

Or data.table

 library(data.table)
 setDT(tmp)[,  c('median1', 'mean1') := list(median(value1), 
    mean1= mean(value2)) , .(label=sub('_.*$', '', label))][, c(1,4:5), 
       with=FALSE]
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜