CUDA: more dimensions for a block or just one?
I need to square root each element of a matrix (which is basically a vector of float values once in memory) using CUDA.
Matrix dimensions are not known 'a priori' and may vary [2-20.000].
I was wondering: I might use (as Jonathan suggested here) one block dimension like this:
int thread_id = blockDim.x * block_id + threadIdx.x;
and check for thread_id lower than rows*columns... that's pretty simple and straight.
But is there any particular performance reason why should I use two (or even three) block grid dimensions to perform such a calculation (keeping in mind that I have a matrix afterall) instead of just one?
I'm thinking 开发者_如何学运维at coalescence problems, like making all threads reading values sequentially
The dimensions only exist for convenience, internally everything is linear, so there would be no advantage in terms of efficiency either way. Avoiding the computation of the (contrived) linear index as you've shown above would seem to be a bit faster, but there wouldn't be any difference in how the threads coalesce.
精彩评论