CUDA __umul24 function, useful or not?

2023-02-21 20:00 问答作者：

Is开发者_高级运维 worth replacing all multiplications with the __umul24 function in a CUDA kernel? I read different and opposite opinions and I can't still make a bechmark to figure it out

Only in devices with architecture prior to fermi, that is with cuda capabilities prior to 2.0 where the integer arithmetic unit is 24 bit.

On Cuda Device with capabilities >= 2.0 the architecture is 32 bit the _umul24 will be slower instead of faster. The reason is because it has to emulate the 24 bit operation with 32 bit architecture.

The question is now: Is it worth the effort for the speed gain ? Probably not.

Just wanted to chime in with a slightly different opinion than Ashwin/fabrizioM...

If you're just trying to teach yourself CUDA, their answer is probably more or less acceptable. But if you're actually trying to deploy a production-grade app to a commercial or research setting, that sort of attitude is generally not acceptable, unless you are absolutely sure that your end users' (or you, if you're the end user) is Fermi or later.

More likely, there's many users who will be running CUDA on legacy machines who would receive benefits from using Compute Level appropriate functionality. And it's not as hard as Ashwin/fabrizioM make it out to be.

e.g. in a code I'm working on, I'm using:

//For prior to Fermi use umul, for Fermi on, use
//native mult.
__device__ inline void MultiplyFermi(unsigned int a, unsigned int b)
{ a*b; }

__device__ inline void MultiplyAddFermi(unsigned int a, unsigned int b,
                                        unsigned int c)
{ a*b+c; }

__device__ inline void MultiplyOld(unsigned int a, unsigned int b)
{ __umul24(a,b); }

__device__ inline void MultiplyAddOld(unsigned int a, unsigned int b,
                                      unsigned int c)
{ __umul24(a,b)+c; }

//Maximum Occupancy =
//16384
void GetComputeCharacteristics(ComputeCapabilityLimits_t MyCapability)
{
    cudaDeviceProp DeviceProperties;
    cudaGetDeviceProperties(&DeviceProperties, 0 );
    MyCapability.ComputeCapability =
    double(DeviceProperties.major)+ double(DeviceProperties.minor)*0.1;
}

Now there IS a downside here. What is it?

Well any kernel you use a multiplication, you must have two different versions of the kernel.

Is it worth it?

Well consider, this is a trivial copy & paste job, and you're gaining efficiency, yes in my opinion. After all, CUDA isn't the easiest form of programming conceptually (nor is any parallel programming). If performance is NOT critical, ask yourself: why are you using CUDA?

If performance is critical, it's negligent to to code lazy and either abandon legacy devices or post less-than-optimal execution, unless you're absolutely confident you can abandon legacy support for your deployment (allowing optimal execution).

For most, it makes sense to provide legacy support, given that it's not that hard once you realize how to do it. Be aware this means that that you will also need to update your code, in order to adjust in to changes in future architectures.

Generally you should note what the latest version the code was targeted at, when it was written and perhaps print some sort of warning to users if they have a compute capability beyond what your latest implementation is optimized for.

继续阅读：multiplication

CUDA __umul24 function, useful or not?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？