Run time - using apply functions
I have two apply functions excecuting the average and standard deviation across the first two dimensions on a large three dimentional array (437216,8,3). It takes 16 minutes to complete on Rx32. It's the first of many large arrays in a database we are 开发者_开发百科applying this script on a regular basis. Any thoughts on how to speed up runtime?
That seems very slow. On my machine
set.seed(10)
x = array(rnorm(437216*8*3), dim = c(437216,8,3))
system.time(apply(x, 1, mean))
takes
user system elapsed
23.903 0.263 24.522
FWIW,
system.time(apply(x, 2, mean))
user system elapsed
0.546 0.274 0.841
system.time(apply(x, 3, mean))
user system elapsed
0.516 0.267 0.790
What is your sessionInfo()?
sessionInfo()
R version 2.11.1 (2010-05-31)
i386-apple-darwin9.8.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] cimis_0.1-3 RLastFM_0.1-4 RCurl_1.4-2 bitops_1.0-4.1 XML_3.1-0 lattice_0.18-8
loaded via a namespace (and not attached):
[1] grid_2.11.1 tools_2.11.1
My systemInfo() is as follows:
sessionInfo() R version 2.11.0 (2010-04-22) x86_64-pc-mingw32
locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] abind_1.1-0 RSQLite_0.9-1 DBI_0.2-5
The apply function is applied across both the first and second margin (1:2) and the system time is below, which I believe is what is causing it run so long. I ran it on a better computer/system (listed above) and cut the run time some (below), but it still seems like it's taking longer than it should:
> system.time(apply(x,1:2,mean))
user system elapsed
311.56 0.30 311.88
> system.time(apply(x,1:2,sd))
user system elapsed
505.92 0.21 506.81
I'll look into converting it to a data.frame and unlisting it as in the second suggestion. Thanks for all the help!
EDIT : After the code provided by OP, the problem became clear. Trick is to convert it to a dataframe :
> x = array(rnorm(437216*8*3), dim = c(437216,8,3))
> system.time(apply(x,1:2,mean))
user system elapsed
107.06 0.18 107.34
# This is run on a new quadcore i7, so it's not a slow machine...
> Tmp <- data.frame(V1=as.vector(x[,,1]),
+ V2=as.vector(x[,,2]),
+ V3= as.vector(x[,,3]))
> system.time({
+ Means <- rowMeans(Tmp)
+ Sd <- sqrt(rowSums((Tmp-Means)^2)/(3-1))
+ })
user system elapsed
6.72 0.40 7.12
To get the results in the correct matrix :
Means <- matrix(Means,ncol=8)
Sd <- matrix(Sd,ncol=8)
Proof of concept :
x = array(rnorm(10*8*3), dim = c(10,8,3))
m1 <- apply(x,1:2,mean)
sd1 <- apply(x,1:2,sd)
Tmp <- data.frame(V1=as.vector(x[,,1]),
V2=as.vector(x[,,2]),
V3= as.vector(x[,,3]))
m2 <- rowMeans(Tmp)
sd2 <- sqrt(rowSums((Tmp-m2)^2)/2)
m2 <-matrix(m2,ncol=8)
sd2 <- matrix(sd2,ncol=8)
> all.equal(m1,m2)
[1] TRUE
> all.equal(sd1,sd2)
[1] TRUE
精彩评论