Proper/fastest way to reshape a data.table
I have a data table in R:
library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12))
DT
x y v
[1,] 1 A 12
[2,] 1 B 62
[3,] 1 A 60
[4,] 1 B 61
[5,] 2 A 83
[6,] 2 B 97
[7,] 2 A 1
[8,] 2 B 22
[9,] 3 A 99
[10,] 3 B 47
[11,] 3 A 63
[12,] 3 B 49
I can easily sum the variable v by the groups in the data.table:
out <- DT[,list(SUM=sum(v)),by=list(x,y)]
out
x y SUM
[1,] 1 A 72
[2,] 1 B 123
[3,] 2 A 84
[4,] 2 B 119
[5,] 3 A 162
[6,] 3 B 96
However, I would like to have the groups (y) as columns, rather than rows. I can accomplish this using reshape:
out <开发者_JAVA技巧;- reshape(out,direction='wide',idvar='x', timevar='y')
out
x SUM.A SUM.B
[1,] 1 72 123
[2,] 2 84 119
[3,] 3 162 96
Is there a more efficient way to reshape the data after aggregating it? Is there any way to combine these operations into one step, using the data.table operations?
The data.table package implements faster melt/dcast functions (in C). It also has additional features by allowing to melt and cast multiple columns. Please see the new Efficient reshaping using data.tables on Github.
melt/dcast functions for data.table have been available since v1.9.0 and the features include:
There is no need to load
reshape2package prior to casting. But if you want it loaded for other operations, please load it before loadingdata.table.dcastis also a S3 generic. No moredcast.data.table(). Just usedcast().melt:is capable of melting on columns of type 'list'.
gains
variable.factorandvalue.factorwhich by default areTRUEandFALSErespectively for compatibility withreshape2. This allows for directly controlling the output type ofvariableandvaluecolumns (as factors or not).melt.data.table'sna.rm = TRUEparameter is internally optimised to remove NAs directly during melting and is therefore much more efficient.NEW:
meltcan accept a list formeasure.varsand columns specified in each element of the list will be combined together. This is faciliated further through the use ofpatterns(). See vignette or?melt.
dcast:accepts multiple
fun.aggregateand multiplevalue.var. See vignette or?dcast.use
rowid()function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. See ?dcast.
Old benchmarks:
melt: 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds.dcast: 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds.
Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast pull request to reshape2?
This feature is now implemented into data.table (from version 1.8.11 on), as can be seen in Zach's answer above.
I just saw this great chunk of code from Arun here on SO. So I guess there is a data.table solution. Applied to this problem:
library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=1e6),
y=c("A","B"),
v=sample(1:100,12))
out <- DT[,list(SUM=sum(v)),by=list(x,y)]
# edit (mnel) to avoid setNames which creates a copy
# when calling `names<-` inside the function
out[, as.list(setattr(SUM, 'names', y)), by=list(x)]
})
x A B
1: 1 26499966 28166677
2: 2 26499978 28166673
3: 3 26500056 28166650
This gives the same results as DWin's approach:
tapply(DT$v,list(DT$x, DT$y), FUN=sum)
A B
1 26499966 28166677
2 26499978 28166673
3 26500056 28166650
Also, it is fast:
system.time({
out <- DT[,list(SUM=sum(v)),by=list(x,y)]
out[, as.list(setattr(SUM, 'names', y)), by=list(x)]})
## user system elapsed
## 0.64 0.05 0.70
system.time(tapply(DT$v,list(DT$x, DT$y), FUN=sum))
## user system elapsed
## 7.23 0.16 7.39
UPDATE
So that this solution also works for non-balanced data sets (i.e. some combinations do not exist), you have to enter those in the data table first:
library(data.table)
set.seed(1234)
DT <- data.table(x=c(rep(c(1,2,3),each=4),3,4), y=c("A","B"), v=sample(1:100,14))
out <- DT[,list(SUM=sum(v)),by=list(x,y)]
setkey(out, x, y)
intDT <- expand.grid(unique(out[,x]), unique(out[,y]))
setnames(intDT, c("x", "y"))
out <- out[intDT]
out[, as.list(setattr(SUM, 'names', y)), by=list(x)]
Summary
Combining the comments with the above, here's the 1-line solution:
DT[, sum(v), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,
setNames(as.list(V1), paste(y)), by = x]
It's also easy to modify this to have more than just the sum, e.g.:
DT[, list(sum(v), mean(v)), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,
setNames(as.list(c(V1, V2)), c(paste0(y,".sum"), paste0(y,".mean"))), by = x]
# x A.sum B.sum A.mean B.mean
#1: 1 72 123 36.00000 61.5
#2: 2 84 119 42.00000 59.5
#3: 3 187 96 62.33333 48.0
#4: 4 NA 81 NA 81.0
Data.table objects inherit from 'data.frame' so you can just use tapply:
> tapply(DT$v,list(DT$x, DT$y), FUN=sum)
AA BB
a 72 123
b 84 119
c 162 96
You can use dcast from reshape2 library. Here is the code
# DUMMY DATA
library(data.table)
mydf = data.table(
x = rep(1:3, each = 4),
y = rep(c('A', 'B'), times = 2),
v = rpois(12, 30)
)
# USE RESHAPE2
library(reshape2)
dcast(mydf, x ~ y, fun = sum, value_var = "v")
NOTE: The tapply solution would be much faster.
加载中,请稍侯......
精彩评论