R: When using data.table how do I get columns of y when I do x[y]?
UPDATE: Old question ... it was resolved by data.table v1.5.3 in Feb 2011.
I am trying to use the data.table
package, and really like the speedups I am getting, but I am stumped by this error when I do x[y, <expr>]
where x
and y
are "data-tables" with the same key, and <expr>
contains column names of both x
and y
:
require(data.table)
x <- data.table( foo = 1:5, a = 5:1 )
y <- data.table( foo = 1:5, boo = 10:14)
setkey(x, foo)
setkey(y, foo)
> x[y, foo*boo]
Error in eval(expr, envir, enclos) : object 'boo' not found
UPDATE... To clarify the functionality I am looking for in the above example: I need to do the equivalent of the following:
with(merge(x,y), foo*boo)
However according to the below extract from the data.table
FAQ, this should have worked:
Finally, although it appears as though x[y] does not return the columns in y, you can actually use the columns from y in the j expression. This is what we mean by join inherited scope. Why not just return the union of all the columns from x and y and then run expressions on that? It boils down to eciency of code and what is quicker to program. When you write x[y,fooboo], data.table automatically inspects the j expression to see which columns it uses. It will only subset, or group, those columns only. Memory is only created for the columns the j uses. Let's say foo is in x, and boo is in y (along with 20 other columns in y). Isn't x[y,fooboo] quicker to program and quicker to run than a merge step followed by another subset step ?
I am aware of this question that addressed a similar issue, but it did not seem to have been resolved satisfactorily. Anyone know what I am missing or misunderstanding? Thanks.
UPDATE: I asked on the data-table help mailing list and the package author (Matthew Dowle) replied that indeed the FAQ quoted above is wrong, so the syntax I am using will not work currently, i.e. I cannot refer to the y
columns in the j
(i.e. second) argument when I do x[y,开发者_开发问答...]
.
I am not sure if I understand the problem well, and I also just started to read the docs of data.table library, but I think if you would like to get the columns of y and also do something to those by the columns of a, you might try something like:
> x[y,a*y]
foo boo
[1,] 5 50
[2,] 8 44
[3,] 9 36
[4,] 8 26
[5,] 5 14
Here, you get back the columns of y multiplied by the a column of x. If you want to get x's foo multiplied by y's boo, try:
> y[,x*boo]
foo a
[1,] 10 50
[2,] 22 44
[3,] 36 36
[4,] 52 26
[5,] 70 14
After editing: thank you @Prasad Chalasani making the question clearer for me.
If simple merging is preferred, then the following should work. I made up a more complex data to see the actions deeper:
x <- data.table( foo = 1:5, a=20:24, zoo = 5:1 )
y <- data.table( foo = 1:5, b=30:34, boo = 10:14)
setkey(x, foo)
setkey(y, foo)
So only an extra column was added to each data.table. Let us see merge
and doing it with data.tables
:
> system.time(merge(x,y))
user system elapsed
0.027 0.000 0.023
> system.time(x[,list(y,x)])
user system elapsed
0.003 0.000 0.006
From which the latter looks a lot faster. The results are not identical though, but can be used in the same way (with an extra column of the latter run):
> merge(x,y)
foo a zoo b boo
[1,] 1 20 5 30 10
[2,] 2 21 4 31 11
[3,] 3 22 3 32 12
[4,] 4 23 2 33 13
[5,] 5 24 1 34 14
> x[,list(x,y)]
foo a zoo foo.1 b boo
[1,] 1 20 5 1 30 10
[2,] 2 21 4 2 31 11
[3,] 3 22 3 3 32 12
[4,] 4 23 2 4 33 13
[5,] 5 24 1 5 34 14
So to get xy
we might use: xy <- x[,list(x,y)]
. To compute a one-column data.table from xy$foo * xy$boo
, the following might work:
> xy[,foo*boo]
[1] 10 22 36 52 70
Well, the result is not a data.table but a vector instead.
Update (29/03/2012): thanks for @David for pointing my attention to the fact that merge.data.table
were used in the above examples.
精彩评论