CUDA __syncthreads() and recursion
I want to use __syncthreads() to a recursion like
__device__ void foo(int k) {
if (some_condition) {
for (int i=0;i<8;i++) {
foo(i+k); // foo might take longer with some inputs
__syncthreads();
}
}
}
How does this __syncthreads() now apply? I know it is only applied within a block. As far as I understand it, this holds for all local threads independently of the recursion depth? But what if I wanted to make sure that this __syncthreads() to a certain recursion depths? Is that even possible? I could check for the recursion depth, but I believe that won't work either.
Are there possible alternatives?
I've seen the that there are 3 syncthread extensions for CUDA Device >= 2.0
int __syncthreads_count(int predicate);
int __syncthreads_and(int pred开发者_开发技巧icate);
int __syncthreads_or(int predicate);
But I don't think they will help since they seem like an atomic counter.
As you know, __syncthreads()
is only safe where all threads within a block reach the barrier. This means that if you are calling __syncthreads()
from within a condition the condition must evaluate to the same on all threads within a block.
For __syncthreads()
within recursion, this means that all threads within a block must execute the recursion to the same depth, otherwise not all threads will be reaching the same barrier.
Are there possible alternatives?
Yes, don't use the recursion paradigm to express your function logic
Off course what you said about __syncthreads() is true, it only works for local threads within the blocks therefore you have no control what is happening in other blocks. the best way for reduction is first make a reduction for the whole array which will general an array equal to the size of blocks. Then do not copy the array back to the Host instead call another reduction which will will have 1 block and threads similar to the number of blocks in the previous call and later copy the array of size 1 from Device to Host. but make sure to use cudaThreadSynchronize() between two calls coz unless the first reduction is generated you can make this reduction. this is two step reduction but it works for me.
cheers!!! saif
精彩评论