r equivalent of group by with cube
Some sql databases support a with cube
modifier to group by
operations. Mine doesn't have this feature.
Basically if I have a dataset like:
+------+-----------+---------+---------+
| sum | source_id | type_id | variety |
+------+-----------+---------+---------+
| 491 | 1 | 1 | 1 |
| 2008 | 1 | 2 | 1 |
| 33 | 1 | 3 | 1 |
| 483 | 1 | 4 | 1 |
| 482 | 1 | 5 | 1 |
| 343 | 1 | 6 | 1 |
| 4979 | 4 | 5 | 1 |
| 303 | 5 | 1 | 1 |
| 443 | 5 | 1 | 2 |
| 1295 | 5 | 2 | 1 |
...
I want to import this into a data frame in r and generate t开发者_StackOverflow社区he combined sum for all sub-permutations of (source_id, type_id, and variety). So, the combined sum where source_id=1, where source_id=1 and type_id=1, where source_id=1 and variety=1, where type_id=1 and variety=1, where type_id=1, where source_id=2, and so on.
How can I best accomplish this?
You can use ddply for this, and input a list with the possible combinations, like this :
facs <- c("source_id","type_id","variety")
combs <- unlist(
mapply(function(j)combn(facs,j,simplify=F),1:3)
,recursive=F)
require(plyr)
datlist <- mapply(function(j)ddply(Data,j,summarize,sum(Sum)),combs)
require(reshape)
rbind.fill(datlist)
Tested with :
Data <- data.frame(
Sum=rpois(10,5),
source_id=rep(1:2,each=5),
type_id=rep(1:5,each=2),
variety=rep(1:2,5)
)
This should do it
# generate dummy data
df = data.frame(
Sum = rnorm(10),
source_id = sample(10, 5, replace = T),
type_id = sample(10, 5, replace = T),
variety = sample(10, 5, replace = T)
)
index = names(df)[-1]
temp = expand.grid(0:1, 0:1, 0:1)[-1,]
require(plyr)
cubedf = adply(temp, 1, function(x)
ddply(df, index[x == 1], summarize, SUM = sum(Sum)))
EDIT: ALTERNATE SOLUTION (using code borrowed from Joris)
library(plyr)
# list factor variables
index = names(df)[-1]
# generate all combinations of factor variables
combs = unlist(llply(1:3, combn, x = index, simplify = F), recursive = F)
# calculate sum for each combination
cubedf = ldply(combs, function(var)
ddply(df, var, summarize, SUM = sum(Sum)))
Joris's Answer is right. But I must admit that it's not intuitive to me at first blush. Prior to reading his answer, I would have solved this with multiple ddply()
steps. Something like this:
Data <- data.frame(
Sum=rpois(10,5),
source_id=rep(1:2,each=5),
type_id=rep(1:5,each=2),
variety=rep(1:2,5)
)
require(plyr)
myStuff1 <- ddply(Data, c("source_id" ), function(df) sum(df$Sum) )
myStuff2 <- ddply(Data, c("source_id", "type_id" ), function(df) sum(df$Sum) )
myStuff3 <- ddply(Data, c("source_id", "type_id", "variety"), function(df) sum(df$Sum) )
精彩评论