OpenMP: What is the benefit of nesting parallelizations?
From what I understand, #pragma omp parallel
and its variations basically execute the following block in a number of concurrent threads, which corresponds to the number of CPUs. When having nested parallelizations - parallel for within parallel for, parallel function within parallel function etc. - what happens on the inner parallelization?
I'm new to OpenMP, and the case I have in mind is probably rather trivial - multiplying a vector with a matrix. This is done in two nested for loops. Assuming the number of CPUs is smaller than the number of elements in the vector, is there any benefit in trying to run the inner loop in parallel? Will the total number of threads be larger than the number of CPUs, or will the i开发者_Go百科nner loop be executed sequentially?
(1) Nested parallelism in OpenMP: http://docs.oracle.com/cd/E19205-01/819-5270/aewbc/index.html
You need to turn on nested parallelism by setting OMP_NESTED
or omp_set_nested
because many implementations turn off this feature by default, even some implementations didn't support nested parallelism fully. If turned on, whenever you meet parallel for
, OpenMP will create the number of threads as defined in OMP_NUM_THREADS
. So, if 2-level parallelism, the total number of threads would be N^2, where N = OMP_NUM_THREADS
.
Such nested parallelism will cause oversubscription, (i.e., the number of busy threads is greater than the cores), which may degrade the speedup. In an extreme case, where nested parallelism is called recursively, threads could be bloated (e.g., creating 1000s threads), and computer just wastes time for context switching. In such case, you may control the number of threads dynamically by setting omp_set_dynamic
.
(2) An example of matrix-vector multiplication: the code would look like:
// Input: A(N by M), B(M by 1)
// Output: C(N by 1)
for (int i = 0; i < N; ++i)
for (int j = 0; j < M; ++j)
C[i] += A[i][j] * B[j];
In general, parallelizing inner loops while outer loops are possible is bad because of forking/joining overhead of threads. (though many OpenMP implementations pre-create threads, it still requires some to dispatch tasks to threads and to call implicit barrier at the end of parallel-for)
Your concern is the case of where N < # of CPU. Yes, right, in this case, the speedup would be limited by N, and letting nested parallelism will definitely have benefits.
However, then the code would cause oversubscription if N is sufficiently large. I'm just thinking the following solutions:
- Changing the loop structure so that only 1-level loop exists. (It looks doable)
- Specializing the code: if N is small, then do nested parallelism, otherwise don't do that.
- Nested parallelism with
omp_set_dynamic
. But, please make it sure howomp_set_dynamic
controls the number of threads and the activity of threads. Implementations may vary.
For something like dense linear algebra, where all the potential parallelism is already lain bare in one place in nice wide for loops, you don't need nested parallism -- if you do want to protect against the case of having (say) really narrow matricies where the leading dimension might be smaller than the number of cores, then all you need is the collapse directive which notionally flattens the multiple loops into one.
Nested parallelism is for those cases where the parallelism isn't all exposed at once -- say you want to do 2 simultaneous function evaluations, each of which could usefully utilize 4 cores, and you have an 8 core system. You call the function in a parallel section, and within the function definition there is an additional, say, parallel for.
At the outer level use the NUM_THREADS(num_groups) clause to set the number of threads to use. If your outer loop has a count N, and the number of processors or cores is num_cores, use num_groups = min(N,num_cores). At the inner level, you need to set the number of sub-threads for each thread group so that the total number of subthreads equals the number of cores. So if num_cores = 8, N = 4, then num_groups = 4. At the lower level each sub-thread should use 2 threads (since 2+2+2+2 = 8) so use the NUM_THREADS(2) clause. You can collect the number of sub-threads into an array with one element per outer region thread (with num_groups elements).
This strategy always makes optimal use of your cores. When N < num_cores some nested parallelisation occurs. When N >= num_cores the array of subthread counts contains all 1s and so the inner loop is effectively serial.
精彩评论