Substitute values with their mean in a data frame in R
I need to replace the values of the two replica (A and B) in a data frame, with their mean.
This is the data frame:
Sample.Name <- c("sample01","sample01","sample02","sample02","sample开发者_如何学Python03","sample03")
Rep <- c("A", "B", "A", "B", "A", "B")
Rep <- as.factor(Rep)
joy <- sample(1000:50000000, size=120, replace=TRUE)
values <- matrix(joy, nrow=6, ncol=20)
df.data <- cbind.data.frame(Sample.Name, Rep, values)
names(df.data)[-c(1:2)] <- paste("V", 1:20, sep="")
And this is the loop I tried to write to substitute the mean to the replica:
Sample <- as.factor(Sample.Name)
livelli <- levels(Sample)
for (i in (1:(length(livelli)))){
estrai.replica <- which(df.data == livelli[i])
media.replica <- apply(values[estrai.replica,], 2, mean)
foo <- rbind(media.replica)
}
The main problems are:
- in this way I have only the last row in my new data frame (foo), and
- I haven't the name of the sample in any column.
Do you have any suggestion?
I think you want to aggregate
your data frame. Try this:
aggregate(df.data, by=list(Sample.Name), FUN=mean)
Out of curiosity I tried a tapply based solution.
# Not correct: lapply(df.data[-(1:3)], tapply, INDEX=df.data$Sample.Name, FUN=mean)
It just needed as.data.frame
to "clean it up".
# Not correct: as.data.frame(lapply(df.data[-(1:3)], tapply, INDEX=df.data$Sample.Name, FUN=mean))
EDIT: Like @daroczig I got an error complaining that the trim argument to mean.default is not of length 1. So adding further arguments form mean was attempted but only when I also changed to a two argument version of "[" did I succeed in satisfying the interpreter but still not getting the right grouping of the function application. This version does work:
as.data.frame(lapply(df.data[, 3:22],
function(x) tapply(x, df.data$Sample.Name, FUN=mean)) )
A data.table
solution for time and memory efficiency
library(data.table)
DT <- as.data.table(df.data)
DT[,lapply(.SD, mean),by = Sample.Name, .SDcols = paste0('V',1:20)]
Note that .SD
is the subset for each group and .SDcols
defines the columns in .SD
to evaluate lapply
upon.
精彩评论