Fastest Tall-Wide pivoting in R
I am dealing with a simple table of the form
date variable value
1970-01-01 V1 0.434
1970-01-01 V2 12.12
1970-01-01 V3 921.1
1970-01-02 V1 -1.10
1970-01-03 V3 0.000
1970-01-03 V5 312e6
... ... ...
The pairs (date, variable) are unique. I would like to transform this table into a wide-form one.
date V1 V2 V3 V4 V5
1970-01-01 0.434 12.12 921.1 NA NA
1970-01-02 -1.10 NA NA NA NA
1970-01-03 0.000 NA NA NA 312e6
And I would like to do it in the fastest possible way, since I have to repeat the operation repeatedly over tables with 1e6 records. In R native mode, I believe that both tapply()
, reshape()
and d*ply()
are dominated speed-wise by data.table
. I would like to test the performance of the latter against a sqlite-based solution (or other DB). Has this b开发者_JS百科een done before? Are there performance gains? And, how does one convert tall-to-wide in sqlite, when the number of "wide" fields (the date) is variable and not known in advance?
I use an approach that is based on what tapply
does, but is about an order of magnitude faster (primarily as there is no per-cell function call).
Timings using tall
from Prasad's post:
pivot = function(col, row, value) {
col = as.factor(col)
row = as.factor(row)
mat = array(dim = c(nlevels(row), nlevels(col)), dimnames = list(levels(row), levels(col)))
mat[(as.integer(col) - 1L) * nlevels(row) + as.integer(row)] = value
mat
}
> system.time( replicate(100, wide <- with(tall, tapply( value, list(dt,tkr), identity))))
user system elapsed
11.31 0.03 11.36
> system.time( replicate(100, wide <- with(tall, pivot(tkr, dt, value))))
user system elapsed
0.9 0.0 0.9
Regarding possible issues with ordering, there shouldn't be any problem:
> a <- with(tall, pivot(tkr, dt, value))
> b <- with(tall[sample(nrow(tall)), ], pivot(tkr, dt, value))
> all.equal(a, b)
[1] TRUE
A few remarks. A couple of SO questions address how to do tall-to-wide pivoting in Sql(ite): here and here. I haven't looked at those too deeply but my impression is that doing it in SQL is ugly, as in: your sql query needs to explicitly mention all possible keys in the query! (someone please correct me if I'm wrong). As for data.table
, you can definitely do group-wise operations very fast, but I don't see how you can actually cast the result into a wide format.
If you want to do it purely in R, I think tapply
is the speed champ here, much faster than acast
from reshape2
:
Create some tall data, with some holes in it just to make sure the code is doing the right thing:
tall <- data.frame( dt = rep(1:100, 100),
tkr = rep( paste('v',1:100,sep=''), each = 100),
value = rnorm(1e4)) [-(1:5), ]
> system.time( replicate(100, wide <- with(tall, tapply( value, list(dt,tkr), identity))))
user system elapsed
4.73 0.00 4.73
> system.time( replicate(100, wide <- acast( tall, tkr ~ dt)))
user system elapsed
7.93 0.03 7.98
精彩评论