Confused about profiling results from my OpenCL kernel (DSP Window Function)
I completed a Window Function kernel in OpenCL. Basically a window function just applies a set of coefficients over another set of numbers piece by piece (Wikipedia explains it better). I was able to stuff the window coefficient float array in constant cache for most cases.
I expected my results from Compute Prof to show that the host to device and device to host memory transfers would take more than 95% of the processing time. For nearly all of my cases it is only 80% of the processing time. I am writing and reading one 4.2 million float array to and from the board and writing another float array that generally stays well below a million.
Does anything in the kernel look fishy? Any opinions on if it is a problem that should run faster on a GPU than a CPU in the first place(I am still not 100% on this). I am a little stunned as to why my gld_efficiency an开发者_JAVA百科d gst_efficiency hover between 0.1 and 0.2. I made this kernel with G80 global memory coalescing in mind. My global memory overall throughput seems alright at 40gbs. The kernel is pretty simple and is posted below.
__kernel void window(__global float* inputArray, // first frame to ingest starts at 0. Sized to nFramesToIngest*framesize samples
__constant float* windowArray, // may already be partly filled
int windowSize, // size of window frame, in floats
int primitivesPerDataFrame, //amount of primitives in each frame of inputArray parameter
int nInFramesThisCall, //each thread solves a frame, so this integer represent how many threads this kernel launches
int isRealNumbers //0 for complex, non-zero for real
)
{
int gid = get_global_id(0) + get_global_size(0) * get_global_id(1);
if(gid < nInFramesThisCall) //make sure we don't execute unnecessary threads
{
if(isRealNumbers)
{
for(int i = 0; i < primitivesPerDataFrame; i++)
{
int inputArrayIndex = (gid*primitivesPerDataFrame)+i;
inputArray[inputArrayIndex] = inputArray[inputArrayIndex] * windowArray[i%windowSize];
}
}
else //complex
{
for(int i = 0; i < primitivesPerDataFrame; i++)
{
int inputArrayIndex = (gid*primitivesPerDataFrame)+i;
inputArray[inputArrayIndex] = inputArray[inputArrayIndex] * windowArray[i%windowSize/2];
}
}
}
}
How many threads (the OpenCL term is work-items, by the way) are you using? You need at least something in the hundreds to load a big GPU efficiently.
You say you want to make use of coalesced memory access, but a load with an offset like
int inputArrayIndex = (gid*primitivesPerDataFrame)+i;
will not make this possible in most cases. NVidia's G80 has pretty severe restrictions when it comes to coalescing, see the "OpenCL Best Practices Guide" for more information. Basically, work-items from one warp have to access elements of a 64 or 128 byte aligned block in a certain fashion at the same time to make loads and stores happen coalesced.
Or to give you an example: if primitivesPerDataFrame
is 16, loads and stores of a warp are done at offsets spaced 16 elements apart, making any efficient coalescing impossible.
精彩评论