Confused about profiling results from my OpenCL kernel (DSP Window Function)

2023-01-24 10:06 问答作者：

I completed a Window Function kernel in OpenCL. Basically a window function just applies a set of coefficients over another set of numbers piece by piece (Wikipedia explains it better). I was able to stuff the window coefficient float array in constant cache for most cases.

I expected my results from Compute Prof to show that the host to device and device to host memory transfers would take more than 95% of the processing time. For nearly all of my cases it is only 80% of the processing time. I am writing and reading one 4.2 million float array to and from the board and writing another float array that generally stays well below a million.

Does anything in the kernel look fishy? Any opinions on if it is a problem that should run faster on a GPU than a CPU in the first place(I am still not 100% on this). I am a little stunned as to why my gld_efficiency an开发者_JAVA百科d gst_efficiency hover between 0.1 and 0.2. I made this kernel with G80 global memory coalescing in mind. My global memory overall throughput seems alright at 40gbs. The kernel is pretty simple and is posted below.

__kernel void window(__global float* inputArray, // first frame to ingest starts at 0.  Sized to nFramesToIngest*framesize samples
    __constant float* windowArray, // may already be partly filled
    int windowSize, // size of window frame, in floats
    int primitivesPerDataFrame, //amount of primitives in each frame of inputArray parameter
    int nInFramesThisCall, //each thread solves a frame, so this integer represent how many threads this kernel launches
    int isRealNumbers //0 for complex, non-zero for real 
)
{
int gid = get_global_id(0) + get_global_size(0) * get_global_id(1);

if(gid < nInFramesThisCall) //make sure we don't execute unnecessary threads
{
    if(isRealNumbers)
    {
        for(int i = 0; i < primitivesPerDataFrame; i++)
        {
            int inputArrayIndex = (gid*primitivesPerDataFrame)+i;
            inputArray[inputArrayIndex] = inputArray[inputArrayIndex] * windowArray[i%windowSize];
        }
    }
    else //complex
    {
        for(int i = 0; i < primitivesPerDataFrame; i++)
        {
            int inputArrayIndex = (gid*primitivesPerDataFrame)+i;
            inputArray[inputArrayIndex] = inputArray[inputArrayIndex] * windowArray[i%windowSize/2];
        }
    }
}

}

How many threads (the OpenCL term is work-items, by the way) are you using? You need at least something in the hundreds to load a big GPU efficiently.

You say you want to make use of coalesced memory access, but a load with an offset like

int inputArrayIndex = (gid*primitivesPerDataFrame)+i;

will not make this possible in most cases. NVidia's G80 has pretty severe restrictions when it comes to coalescing, see the "OpenCL Best Practices Guide" for more information. Basically, work-items from one warp have to access elements of a 64 or 128 byte aligned block in a certain fashion at the same time to make loads and stores happen coalesced.

Or to give you an example: if primitivesPerDataFrame is 16, loads and stores of a warp are done at offsets spaced 16 elements apart, making any efficient coalescing impossible.

继续阅读：nvidia opencl profiling signal-processing

Confused about profiling results from my OpenCL kernel (DSP Window Function)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？