CUDA memory allocation - is it efficient

2023-03-11 04:56 问答作者：

This is my code. I have lot of threads so that those threads 开发者_Python百科calling this function many times. Inside this function I am creating an array. It is an efficient implementation?? If it is not please suggest me the efficient implementation.

__device__ float calculate minimum(float *arr)
 {
     float vals[9];      //for each call to this function I am creating this arr
                        // Is it efficient?? Or how can I implement this efficiently?
                        // Do I need to deallocate the memory after using this array?
     for(int i=0;i<9;i++)
         vals[i] = //call some function and assign the values
     float min = findMin(vals);
     return min;
 }

There is no "array creation" in that code. There is a statically declared array. Further, the standard CUDA compilation model will inline expand __device__functions, meaning that the vals will be compiled to be in local memory, or if possible even in registers.

All of this happens at compile time, not run time.

Perhaps I am missing something, but from the code you have posted, you don't need the temporary array at all. Your code will be (a little) faster if you do something like this:

 #include "float.h" // for FLT_MAX

__device__ float calculate minimum(float *arr)
 {
     float minVal = FLT_MAX:
     for(int i=0;i<9;i++)
         thisVal = //call some function and assign the values
         minVal = min(thisVal,minVal);
     return minVal;
 }

Where an array is actually required, there is nothing wrong with declaring it in this way (as many others have said).

Regarding the "float vals[9]", this will be efficient in CUDA. For arrays that have small size, the compiler will almost surely allocate all the elements into registers directly. So "vals[0]" will be a register, "vals[1]" will be a register, etc.

If the compiler starts to run out of registers, or the array size is larger than around 16, then local memory is used. You don't have to worry about allocating/deallocating local memory, the compiler/driver do all that for you.

Devices of compute capability 2.0 and greater do have a call stack to allow things like recursion. For example you can set the stack size to 6KB per thread with:

cudaStatus = cudaThreadSetLimit(cudaLimitStackSize, 1024*6);

Normally you won't need to touch the stack yourself. Even if you put big static arrays in your device functions, the compiler and driver will see what's there and make space for you.

CUDA memory allocation - is it efficient

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？