vectorization vs. parallelization in R

2023-04-09 05:39 问答作者：

As a toy example, suppose that we have a function called 'my_func' (the code is below) that takes two parameters 'n' and 'p'. Our function, 'my_func', will generate a random matrix 'x' with 'n' rows and 'p' columns and do something expensive in both runtime and memory usage, such as computing the sum of the singular values of 'x'. (Of course, the function is a one-liner, but I am shooting for readability here.)

my_func <- function(n, p) {
  x <- replicate(p, rnorm(n))
  sum(svd(x)$d)
}

If we wish to compute 'my_func' for several values of 'n', and for each value of 'n' we have several values of 'p', then vectorizing the function and then applying it the combinations to 'my_func' is straightforward:

n <- 10 * seq_len(5)
p <- 100 * seq_len(10)
grid <- expand.grid(n = n, p = p)
my_func <- Vectorize(my_func)
set.seed(42)
do.call(my_func, grid)
[1]   98.61785  195.50822  292.21575  376.79186  468.13570  145.18359
[7]  280.67456  421.03196  557.87138  687.75040  168.42994  340.42452
[13]  509.65528  683.69883  851.29063  199.08474  400.25584  595.18311
[19]  784.21508  982.34591  220.73215  448.23698  669.02622  895.34184
[开发者_StackOverflow25] 1105.48817  242.52422  487.56694  735.67588  976.93840 1203.25949

Notice that each call to 'my_func' can be painfully slow for large 'n' and 'p' (try n = 1000 and p = 2000 for starters).

Now, in my actual application with a similarly constructed function, the number of rows in 'grid' is much larger than given here. Hence, I am trying to understand vectorizing in R a little better.

First question: In the above example, are the calls to 'my_func' performed sequentially so that the the memory usage in one call is garbage collected before the next call? I use vectorization often but have never stopped to ask this question.

Second question: (This question may depend on the first) Assuming that the number of calls is large enough and that 'my_func' is slow enough, is parallelization warranted here? I am presuming yes. My real question is: is parallelization warranted here if instead 'my_func' had the same large matrix passed to it for each call? For sake of argument, assume the matrix is called 'y', has 1000 rows and 5000 columns and is calculated on-the-fly. Of course, passing the matrix 'y' to each of the parallel nodes will incur some lag.

I understand that the answer to the second question may be "It depends on..." If that is the case, please let me know, and I will try to give more detail.

Also, I appreciate any advice, feedback, or OMFG WTF N00B YOU HAVEN'T SEEN THIS OTHER OBSCURE SOMEWHAT RELEVANT DISCUSSION??!!!111oneone1

The answer to the first question is pretty clearly yes: almost everything in R is by default serial. (A very few things internally start to use OpenMP, but R as an engine will likely remain single-threaded).

So for the second question: Yes, do try that. I don't use Vectorize() much, but I do like the *apply() family. Solve it with lapply(), then load the multicore package and use mclapply() to run it over as many cores as yo u have. Here is an example:

R> system.time(res <- lapply(1:nrow(grid), 
+                            function(i) my_func(grid[i,1],grid[i,2])))
   user  system elapsed 
  0.470   0.000   0.459 
R> system.time(res <- mclapply(1:nrow(grid), 
+                              function(i) my_func(grid[i,1], grid[i,2])))
   user  system elapsed 
  0.610   0.140   0.135 
R>

Notice how elapsed time is now about 29% (= 0.135/0.459) of the original.

From here you can generalize further with parallel execution across several machines--the Task View on High-Performane Computing with R has further pointers. R 2.14.0 due October 31 will have a new package 'parallel' which combines parts of multicore and snow.

继续阅读：vectorization

vectorization vs. parallelization in R

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？