JNCI/JCOL kernel optimization

2023-03-31 07:45 问答作者：

I have a kernel running in open CL (via a jocl front end) that is running horrible slow compared to the other kernels, I'm trying to figure why and how to accelerate it. This kernel is very basic. it's sole job is to decimate the number of sample points we have. It copies every Nth point from the input array to a smaller output array to shrink our array size.

The kernel is passed a float specifying how many points to skip between 'good' points. So if it is passed 1.5 it will skip one point, ten two, then one etc to keep an average of every 1.5 points being skipped. The input array is already on the GPU (it was generated by an earlier kernel) and the output array will stay on the kernel so there is no expense to transfer data to or from the CPU.

This kernel is running 3-5 times slower then any of the other kernels; and as much as 20 times slower then some of the fast kernels. I realize that I'm suffering a penalty for not coalescing my array accesses; but I can't believe that it would cause me to run this horribly slow. After all every other kernel is touching every sample in the array, I would think touching ever X sample in the array, even if not coalesced, should be around the same speed at least of touching every sample in an array.

The original kernel actually decimated two arrays at once, for real and imaginary data. I tried splitting the kernel up into two kernel calls, one to decimate real and one to decimate imaginary data; but this didn't help at all. Likewise I tried 'unrolling' the kernel by having one thread be responsible for decimation of 3-4 points; but this didn't help any. Ive tried messing with the size of data passed into ea开发者_运维知识库ch kernel call (ie one kernel call on many thousands of data points, or a few kernel calls on a smaller number of data points) which has allowed me to tweak out small performance gains; but not to the order of magnitude I need for this kernel to be considered worth implementing on GPU.

just to give a sense of scale this kernel is taking 98 ms to run per iteration while the FFT takes only 32 ms for the same input array size and every other kernel is taking 5 or less ms. What else could cause such a simple kernel to run so absurdly slow compared to the rest of the kernels were running? Is it possible that I actually can't optimize this kernel sufficiently to warrant running it on the GPU. I don't need this kernel to run faster then CPU; just not quite as slow compared to CPU so I can keep all processing on the GPU.

it turns out the issue isn't with the kernel at all. Instead the problem is that when I try to release the buffer I was decimating it causes the entire program to stall while the kernel (and all other kernels in queue) complete. This appears to be functioning incorrectly, the clrelease should only decrement a counter so far as I understand, not block on the queue. However; the important point is that my kernel is running efficiently as it should be.

继续阅读：jocl opencl optimization

JNCI/JCOL kernel optimization

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？