开发者

CUDA memory allocation - is it efficient

This is my code. I have lot of threads so that those threads 开发者_Python百科calling this function many times. Inside this function I am creating an array. It is an efficient implementation?? If it is not please suggest me the efficient implementation.

__device__ float calculate minimum(float *arr)
 {
     float vals[9];      //for each call to this function I am creating this arr
                        // Is it efficient?? Or how can I implement this efficiently?
                        // Do I need to deallocate the memory after using this array?
     for(int i=0;i<9;i++)
         vals[i] = //call some function and assign the values
     float min = findMin(vals);
     return min;
 }


There is no "array creation" in that code. There is a statically declared array. Further, the standard CUDA compilation model will inline expand __device__functions, meaning that the vals will be compiled to be in local memory, or if possible even in registers.

All of this happens at compile time, not run time.


Perhaps I am missing something, but from the code you have posted, you don't need the temporary array at all. Your code will be (a little) faster if you do something like this:

 #include "float.h" // for FLT_MAX

__device__ float calculate minimum(float *arr)
 {
     float minVal = FLT_MAX:
     for(int i=0;i<9;i++)
         thisVal = //call some function and assign the values
         minVal = min(thisVal,minVal);
     return minVal;
 }

Where an array is actually required, there is nothing wrong with declaring it in this way (as many others have said).


Regarding the "float vals[9]", this will be efficient in CUDA. For arrays that have small size, the compiler will almost surely allocate all the elements into registers directly. So "vals[0]" will be a register, "vals[1]" will be a register, etc.

If the compiler starts to run out of registers, or the array size is larger than around 16, then local memory is used. You don't have to worry about allocating/deallocating local memory, the compiler/driver do all that for you.

Devices of compute capability 2.0 and greater do have a call stack to allow things like recursion. For example you can set the stack size to 6KB per thread with:

cudaStatus = cudaThreadSetLimit(cudaLimitStackSize, 1024*6);

Normally you won't need to touch the stack yourself. Even if you put big static arrays in your device functions, the compiler and driver will see what's there and make space for you.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜