CUDA: Reasons for using preprocessing variables to specify the problem size
I'm coding CUDA in Matlab mex-Files. When you look at CUDA examples on the internet or even manuals from nvidia, you often see the use of preprocessing variables to specify the problem size, e.g. the vector length for a vector addition or something like this. I coded my program also like this: Preprocessing Variables for specifying the problem size. And I have to admit it: I like it since you can access those everywhere in your code, e.g. as limits in a loop or somethin开发者_如何学Cg like this, without having to explicitly pass them via argument to the function.
But I ran into the following problem: I wanted to bench the program for several different problem sizes and thus I need to compile the code everytime again by passing the preprocessing-variable to the compiler. It's not a problem, I already coded the benchmark and it works. But I just wonder afterwards now, why I chose this version and did not simply specify it by a user input on runtime. And thus I'm looking for reasons one might want to use preprocessing variables instead of simply passing the problem size to the program.
Thanks!
When you compile-in problem-size constants in the kernel, then the compiler can make certain classes of optimizations that it can't if the sizes are only known at runtime. Full loop unrolling is an obvious example.
In other cases, for instance shared memory array sizes, it is a lot clearer if the sizes are compiled-in; otherwise you have to pass in the total shared memory size at kernel launch time and break that memory up into the number of shared arrays you need. That works fine, but the code is much clearer if you can just have static declarations, for which you need the compile-time sizes.
The main reason is that in general the problem size will be intimately linked to the GPU architecture, e.g. number of threads per block, number of blocks, amount of shared memory per thread, number of registers per thread, etc. In general these numbers are all carefully hand tuned to get the maximum usage of available resources and you can't easily change the problem size dynamically while still maintaining optimum performance.
精彩评论