vectorization vs. parallelization in R
As a toy example, suppose that we have a function called 'my_func' (the code is below) that takes two parameters 'n' and 'p'. Our function, 'my_func', will generate a random matrix 'x' with 'n' rows and 'p' columns and do something expensive in both runtime and memory usage, such as computing the sum of the singular values of 'x'. (Of course, the function is a one-liner, but I am shooting for readability here.)
my_func <- function(n, p) {
x <- replicate(p, rnorm(n))
sum(svd(x)$d)
}
If we wish to compute 'my_func' for several values of 'n', and for each value of 'n' we have several values of 'p', then vectorizing the function and then applying it the combinations to 'my_func' is straightforward:
n <- 10 * seq_len(5)
p <- 100 * seq_len(10)
grid <- expand.grid(n = n, p = p)
my_func <- Vectorize(my_func)
set.seed(42)
do.call(my_func, grid)
[1] 98.61785 195.50822 292.21575 376.79186 468.13570 145.18359
[7] 280.67456 421.03196 557.87138 687.75040 168.42994 340.42452
[13] 509.65528 683.69883 851.29063 199.08474 400.25584 595.18311
[19] 784.21508 982.34591 220.73215 448.23698 669.02622 895.34184
[开发者_StackOverflow25] 1105.48817 242.52422 487.56694 735.67588 976.93840 1203.25949
Notice that each call to 'my_func' can be painfully slow for large 'n' and 'p' (try n = 1000 and p = 2000 for starters).
Now, in my actual application with a similarly constructed function, the number of rows in 'grid' is much larger than given here. Hence, I am trying to understand vectorizing in R a little better.
First question: In the above example, are the calls to 'my_func' performed sequentially so that the the memory usage in one call is garbage collected before the next call? I use vectorization often but have never stopped to ask this question.
Second question: (This question may depend on the first) Assuming that the number of calls is large enough and that 'my_func' is slow enough, is parallelization warranted here? I am presuming yes. My real question is: is parallelization warranted here if instead 'my_func' had the same large matrix passed to it for each call? For sake of argument, assume the matrix is called 'y', has 1000 rows and 5000 columns and is calculated on-the-fly. Of course, passing the matrix 'y' to each of the parallel nodes will incur some lag.
I understand that the answer to the second question may be "It depends on..." If that is the case, please let me know, and I will try to give more detail.
Also, I appreciate any advice, feedback, or OMFG WTF N00B YOU HAVEN'T SEEN THIS OTHER OBSCURE SOMEWHAT RELEVANT DISCUSSION??!!!111oneone1
The answer to the first question is pretty clearly yes: almost everything in R is by default serial. (A very few things internally start to use OpenMP, but R as an engine will likely remain single-threaded).
So for the second question: Yes, do try that. I don't use Vectorize()
much, but I do like the *apply()
family. Solve it with lapply()
, then load the multicore package and use mclapply()
to run it over as many cores as yo u have. Here is an example:
R> system.time(res <- lapply(1:nrow(grid),
+ function(i) my_func(grid[i,1],grid[i,2])))
user system elapsed
0.470 0.000 0.459
R> system.time(res <- mclapply(1:nrow(grid),
+ function(i) my_func(grid[i,1], grid[i,2])))
user system elapsed
0.610 0.140 0.135
R>
Notice how elapsed time is now about 29% (= 0.135/0.459) of the original.
From here you can generalize further with parallel execution across several machines--the Task View on High-Performane Computing with R has further pointers. R 2.14.0 due October 31 will have a new package 'parallel' which combines parts of multicore and snow.
精彩评论