Using Cuda optimization approaches for OpenCL

2023-03-04 12:10 问答作者：

The more I learn about OpenCL, the more it seems that the right optimiza开发者_StackOverflowtion of your kernel is the key to success. Furthermore I noticed, that the kernels for both languages seem very similar.

So how sensible would it be using Cuda optimization strategies learned from books and tutorials on OpenCL kernels? ... Considering that there is so much more (good) literature for Cuda than for OpenCL.

What is your opinion on that? What is your experience?

Thanks!

If you are working with just nvidia cards, you can use the same optimization approaches in both CUDA as well as OpenCL. A few things to keep in mind though is that OpenCL might have a larger start up time (This was a while ago when I was experimenting with both of them) compared to CUDA on nvidia cards.

However if you are going to work with different architectures, you will need to figure out a way to generalize your OpenCL program to be optimal across multiple platforms, which is not possible with CUDA.

But some of the few basic optimization approaches will remain the same. For example, on any platform the following will be true.

Reading from and writing to memory addresses that are aligned will have higher performance (And sometimes necessary on platforms like the Cell Processor).
Knowing and understanding the limited resources of each platform. (may it be called constant memory, shared memory, local memory or cache).
Understanding parallel programming. For example, figuring out the trade off between performance gains (launching more threads) and overhead costs (launching, communication and synchronization).

That last part is useful in all kinds of parallel programming (be multi core, many core or grid computing).

While I'm still new at OpenCL (and barely glanced at CUDA), optimization at the developer level can be summarized as structuring your code so that it matches the hardware's (and compiler's) preferred way of doing things.

On GPUs, this can be anything from correctly ordering your data to take advantage of cache coherency (GPUs LOVE to work with cached data, from the top all the way down to the individual cores [there are several levels of cache]) to taking advantage of built-in operations like vector and matrix manipulation. I recently had to implement FDTD in OpenCL and found that by replacing the expanded dot/cross products in the popular implementations with matrix operations (which GPUs love!), reordering loops so that the X dimension (elements of which are stored sequentially) is handled in the innermost loop instead of the outer, avoiding branching (which GPUs hate), etc, I was able to increase the speed performance by about 20%. Those optimizations should work in CUDA, OpenCL or even GPU assembly, and I would expect that to be true of all of the most effective GPU optimizations.

Of course, most of this is application-dependent, so it may fall under the TIAS (try-it-and-see) category.

Here are a few links I found that look promising:

NVIDIA - Best Practices for OpenCL Programming

AMD - Porting CUDA to OpenCL

My research (and even NVIDIA's documentation) points to a nearly 1:1 correspondence between CUDA and OpenCL, so I would be very surprised if optimizations did not translate well between them. Most of what I have read focuses on cache coherency, avoiding branching, etc.

Also, note that in the case of OpenCL, the actual compilation process is handled by the vendor (I believe it happens in the video driver), so it may be worthwhile to have a look at the driver documentation and OpenCL kits from your vendor (NVIDIA, ATI, Intel(?), etc).

继续阅读：opencl optimization

Using Cuda optimization approaches for OpenCL

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？