How can I identify and summarize sets of data from matching groups in a dataframe?
Here is an example dataframe:
set.seed(0)
x1 <- c(1, 1, 1, 1, 1, 2, 2, 2, 2)
x2 <- c(1, 1, 0, 0, 0, 1, 1, 1, 1)
x3 <- c(1, 1, 2, 2, 4, 1, 1, 2, 1)
n <- c(1, 1, 1, 5, 5, 1, 1, 1, 1)
y <- rnorm(9)
mydf <- data.frame(x1, x2, x3, n, y)
What I would like to do is
- identify rows with n=1 and which share identical values of (x1, x2, x3)
- return a single row for each subset with y = mean(y) and n = length(y)
- keep other 开发者_开发问答rows the same.
for example, the new dataframe would be
x1 <- c(1, 1, 1, 1, 2, 2)
x2 <- c(1, 0, 0, 0, 1, 1)
x3 <- c(1, 2, 2, 4, 1, 2)
n <- c(2, 1, 5, 5, 3, 1)
y <- c(mean(y[1:2]), y[3], y[4], y[5], mean(y[c(6:7,9)]), y[8])
newdf <- data.frame(x1, x2, x3, n, y)
I can figure this out with conditionals and loops, but I would prefer to learn more elegant way to do this.
By "identical values in other columns", I take it you mean that each subset is defined by the same value of x1
in each of the rows of the subset, not that x1
is equal to x2
. Thanks for the example to see what you meant.
library("plyr")
To get parts one and two
ddply(mydf[mydf$n==1,], .(x1, x2, x3), summarise, n = length(y), y = mean(y))
This can be rbind
-ed with the part of mydf
where n!=1
to get what you said
rbind(
ddply(mydf[mydf$n==1,], .(x1, x2, x3), summarise, n = length(y), y = mean(y)),
mydf[mydf$n!=1,]
)
This doesn't have the same order as you listed. If that is really important, you can add some auxiliary sorting variables.
mydf$order = seq(length=nrow(mydf))
newdf <- rbind(
ddply(mydf[mydf$n==1,], .(x1, x2, x3), summarise,
n = length(y), y = mean(y), order=min(order)),
mydf[mydf$n!=1,]
)
newdf <- newdf[order(newdf$order),]
newdf$order <- NULL
精彩评论