How to rewrite a "sapply" command to increase performance?
I have a data.frame named "d" of ~1,300,000 lines and 4 columns and another data.frame named "gc" of ~12,000 lines and 2 columns (but see the smaller example below).
d <- data.frame( gene=rep(c("a","b","c"),4), val=rnorm(12), ind=c( rep(rep("i1",3),2), rep(rep("i2",3),2) ), exp=c( rep("e1",3), rep("e2",3), rep("e1",3), rep("e2",3) ) )
gc <- data.frame( gene=c("a","b","c"), chr=c("c1","c2","c3") )
Here is how "d" looks like:
gene val ind exp
1 a 1.38711902 i1 e1
2 b -0.25578496 i1 e1
3 c 0.49331256 i1 e1
4 a -1.38015272 i1 e2
5 b 1.46779219 i1 e2
6 c -0.84946320 i1 e2
7 a 0.01188061 i2 e1
8 b -0.13225808 i2 e1
9 c 0.16508404 i2 e1
10 a 0.70949804 i2 e2
11 b -0.64950167 i2 e2
12 c 0.12472479 i2 e2
And here is "gc":
gene chr
1 a c1
2 b c2
3 c c3
I want to add a 5th column to "d" by incorporating data from "gc" that match with the 1st column of "d". For the moment I am using sapply.
d$chr <- sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
But on the real data, it takes a "very long" time (I am running the command with "system.time()" since more than 30 minutes and it's still not finished).
Do 开发者_JAVA技巧you have any idea of how I could rewrite this in a clever way? Or should I consider using plyr, maybe with the "parallel" option (I have four cores on my computer)? In such a case, what would be the best syntax?
Thanks in advance.
I think you can just use the factor as index:
gc[ d[,1], 2]
[1] c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
Levels: c1 c2 c3
does the same as:
sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
[1] c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
Levels: c1 c2 c3
But is much faster:
> system.time(replicate(1000,sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )))
user system elapsed
5.03 0.00 5.02
>
> system.time(replicate(1000,gc[ d[,1], 2]))
user system elapsed
0.12 0.00 0.13
Edit:
To expand a bit on my comment. The gc
dataframe requires one row for each level of gene
in the order of the levels for this to work:
d <- data.frame( gene=rep(c("a","b","c"),4), val=rnorm(12), ind=c( rep(rep("i1",3),2), rep(rep("i2",3),2) ), exp=c( rep("e1",3), rep("e2",3), rep("e1",3), rep("e2",3) ) )
gc <- data.frame( gene=c("c","a","b"), chr=c("c1","c2","c3") )
gc[ d[,1], 2]
[1] c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
Levels: c1 c2 c3
sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
[1] c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1
Levels: c1 c2 c3
But it is not hard to fix that:
levels(gc$gene) <- levels(d$gene) # Seems redundant as this is done right quite often automatically
gc <- gc[order(gc$gene),]
gc[ d[,1], 2]
[1] c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1
Levels: c1 c2 c3
sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
[1] c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1
Levels: c1 c2 c3
An alternative solution that does not beat Sasha's approach timing-wise, but is more generalizable and readable, is to simply merge
the two data frames:
d <- merge(d, gc)
I have a slower system, so here are my timings:
> system.time(replicate(1000,sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )))
user system elapsed
11.22 0.12 11.86
> system.time(replicate(1000,gc[ d[,1], 2]))
user system elapsed
0.34 0.00 0.35
> system.time(replicate(1000, merge(d, gc, by="gene")))
user system elapsed
3.35 0.02 3.40
The benefit is that you could have multiple keys, fine control over non-matching items, etc.
精彩评论