Problem with OpenCL kernel recompile slowing down program and possible memory issues because of that

2023-01-28 00:16 问答作者：

I'm fairly new to OpenCL and I'm running OS X 10.6 which the Nvidia 330 graphics card. I'm working on a cloth simulation in C++ which I've managed to write a kernel for that compiles and runs. The problem is that it's running slower than it did on the cpu without OpenCL. I believe the reason for this is that every time I call the update() method to do some calculations I'm setting the context and device and then recompiling the Kernel from source.

To solve this, I tried encapsulating the various OpenCL types I needed into the cloth simulation class to try and store them there, and then created an initCL() to set up these values. I then created a runCL() to execute the kernel. Strangely this only gives me a memory problem when I separate the OpenCL stuff into two methods. It works fine if the initCL() and runCL() are both combined into one method though which is why I'm a little stuck.

The program compiles and runs but I then get a SIGABRT or EXC BAD ACCESS at the point marked in the runCL() code. When I get a SIGABRT I get the error CL_INVALID_COMMAND_QUEUE but I can't work out for the life of me why this only happens when I split up the two methods. I sometimes get a SIGABRT when the assertion fails which is to be expected but other times I just get the bad memory access error when trying to write to the buffer.

Also if anyone can tell me a better way/the right to do this or if the JIT recompiling isn't what's slowing my code down then I'd be very grateful because I've been staring at this for far too long!

Thanks,

Jon

The initialisation of OpenCL variables Code:

int VPESimulationCloth::initCL(){
   // Find the CPU CL device, as a fallback
   err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_CPU, 1, &device, NULL);
   assert(err == CL_SUCCESS);

   // Find the GPU CL device, this is what we really want
// If there is no GPU device is CL capable, fall back to CPU
  err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
if (err != CL_SUCCESS) err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_CPU, 1, &device, NULL);
assert(device);

// Get some information about the returned device
cl_char vendor_name[1024] = {0};
cl_char device_name[1024] = {0};
err = clGetDeviceInfo(device, CL_DEVICE_VENDOR, sizeof(vendor_name), 
                vendor_name, &returned_size);
err |= clGetDeviceInfo(device, CL_DEVICE_NAME, sizeof(device_name), 
                 device_name, &returned_size);
assert(err == CL_SUCCESS);
//printf("Connecting to %s %s...\n", vendor_name, device_name);

// Now create a context to perform our calculation with the 
// specified device 
context = clCreateContext(0, 1, &device, NULL, NULL, &err);
assert(err == CL_SUCCESS);

// And also a command queue for the context
cmd_queue = clCreateCommandQueue(context, device, 0, NULL);

// Load the program source from disk
// The kernel/program should be in the resource directory
const char * filename = "clothSimKernel.cl";
char *program_source = load_program_source(filename);


program[0] = clCreateProgramWithSource(context, 1, (const char**)&program_source,
                             NULL, &err);
if (!program[0])
{
   printf("Error: Failed to create compute program!\n");
   return EXIT_FAILURE;
}
assert(err == CL_SUCCESS);

err = clBuildProgram(program[0], 0, NULL, NULL, NULL, NULL);
if (err != CL_SUCCESS)
{
   char build[2048];
   clGetProgramBuildInfo(program[0], device, CL_PROGRAM_BUILD_LOG, 2048, build, NULL);
   printf("Build Log:\n%s\n",build);
   if (err == CL_BUILD_PROGRAM_FAILURE) {
      printf("CL_BUILD_PROGRAM_FAILURE\n");
   }
}
if (err != CL_SUCCESS) {
   cout<<getErrorDesc(err)<<endl;
}
assert(err == CL_SUCCESS);
//writeBinaries();
// Now create the kernel "objects" that we want to use in the example file 
kernel[0] = clCreateKernel(program[0], "clothSimulation", &err);

}

The method to execute the kernel Code:

int VPESimulationCloth::runCL(){

// Find the GPU CL device, this is what we really want
// If there is no GPU device is CL capable, fall back to CPU
err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
if (err != CL_SUCCESS) err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_CPU, 1, &device, NULL);
assert(device);

// Get some information about the returned device
cl_char vendor_name[1024] = {0};
cl_char device_name[1024] = {0};
err = clGetDeviceInfo(device, CL_DEVICE_VENDOR, sizeof(vendor_name), 
                vendor_name, &returned_size);
err |= clGetDeviceInfo(device, CL_DEVICE_NAME, sizeof(device_name), 
                 device_name, &returned_size);
assert(err == CL_SUCCESS);
//printf("Connecting to %s %s...\n", vendor_name, device_name);

// Now create a context to perform our calculation with the 
// specified device 

//cmd_queue = clCreateCommandQueue(context, device, 0, NULL);
//memory allocation
cl_mem nowPos_mem, prevPos_mem, rForce_mem, mass_mem, passive_mem,    canMove_mem,numPart_mem, theForces_mem, numForces_mem, drag_mem, answerPos_mem;

// Allocate memory on the device to hold our data and store the results into
buffer_size = sizeof(float4) * numParts;

// Input arrays 
//------------------------------------
// This is where the error occurs
nowPos_mem = clCreateBuffer(context, CL_MEM_READ_ONLY, buffer_size, 开发者_如何学CNULL, NULL);
err = clEnqueueWriteBuffer(cmd_queue, nowPos_mem, CL_TRUE, 0, buffer_size,
                    (void*)nowPos, 0, NULL, NULL);
if (err != CL_SUCCESS) {
  cout<<getErrorDesc(err)<<endl;
}
assert(err == CL_SUCCESS);
//------------------------------------
prevPos_mem = clCreateBuffer(context, CL_MEM_READ_ONLY, buffer_size, NULL, NULL);
err = clEnqueueWriteBuffer(cmd_queue, prevPos_mem, CL_TRUE, 0, buffer_size,
                    (void*)prevPos, 0, NULL, NULL);
assert(err == CL_SUCCESS);
rForce_mem = clCreateBuffer(context, CL_MEM_READ_ONLY, buffer_size, NULL, NULL);
err = clEnqueueWriteBuffer(cmd_queue, rForce_mem, CL_TRUE, 0, buffer_size,
                    (void*)rForce, 0, NULL, NULL);
assert(err == CL_SUCCESS);
mass_mem = clCreateBuffer(context, CL_MEM_READ_ONLY, buffer_size, NULL, NULL);
err = clEnqueueWriteBuffer(cmd_queue, mass_mem, CL_TRUE, 0, buffer_size,
                    (void*)mass, 0, NULL, NULL);
assert(err == CL_SUCCESS);
answerPos_mem = clCreateBuffer(context, CL_MEM_READ_WRITE, buffer_size, NULL, NULL);
//uint buffer
buffer_size = sizeof(uint) * numParts;
passive_mem = clCreateBuffer(context, CL_MEM_READ_ONLY, buffer_size, NULL, NULL);
err = clEnqueueWriteBuffer(cmd_queue, passive_mem, CL_TRUE, 0, buffer_size,
                    (void*)passive, 0, NULL, NULL);
assert(err == CL_SUCCESS);
canMove_mem = clCreateBuffer(context, CL_MEM_READ_ONLY, buffer_size, NULL, NULL);
err = clEnqueueWriteBuffer(cmd_queue, canMove_mem, CL_TRUE, 0, buffer_size,
                    (void*)canMove, 0, NULL, NULL);
assert(err == CL_SUCCESS);

buffer_size = sizeof(float4) * numForces;
theForces_mem = clCreateBuffer(context, CL_MEM_READ_ONLY, buffer_size, NULL, NULL);
err = clEnqueueWriteBuffer(cmd_queue, theForces_mem, CL_TRUE, 0, buffer_size,
                    (void*)theForces, 0, NULL, NULL);
assert(err == CL_SUCCESS);

//drag float
buffer_size = sizeof(float);
drag_mem = clCreateBuffer(context, CL_MEM_READ_ONLY, buffer_size, NULL, NULL);
err |= clEnqueueWriteBuffer(cmd_queue, drag_mem, CL_TRUE, 0, buffer_size,
                    (void*)drag, 0, NULL, NULL);
assert(err == CL_SUCCESS);

// Now setup the arguments to our kernel
err  = clSetKernelArg(kernel[0],  0, sizeof(cl_mem), &nowPos_mem);
err |= clSetKernelArg(kernel[0],  1, sizeof(cl_mem), &prevPos_mem);
err |= clSetKernelArg(kernel[0],  2, sizeof(cl_mem), &rForce_mem);
err |= clSetKernelArg(kernel[0],  3, sizeof(cl_mem), &mass_mem);
err |= clSetKernelArg(kernel[0],  4, sizeof(cl_mem), &passive_mem);
err |= clSetKernelArg(kernel[0],  5, sizeof(cl_mem), &canMove_mem);
err |= clSetKernelArg(kernel[0],  6, sizeof(cl_mem), &numParts);
err |= clSetKernelArg(kernel[0],  7, sizeof(cl_mem), &theForces_mem);
err |= clSetKernelArg(kernel[0],  8, sizeof(cl_mem), &numForces);
err |= clSetKernelArg(kernel[0],  9, sizeof(cl_mem), &drag_mem);
err |= clSetKernelArg(kernel[0],  10, sizeof(cl_mem), &answerPos_mem);
if (err != CL_SUCCESS) {
   cout<<getErrorDesc(err)<<endl;
}
assert(err == CL_SUCCESS);
// Run the calculation by enqueuing it and forcing the 
// command queue to complete the task
size_t global_work_size = numParts;
size_t local_work_size = global_work_size/8;
err = clEnqueueNDRangeKernel(cmd_queue, kernel[0], 1, NULL, 
                     &global_work_size, &local_work_size, 0, NULL, NULL);
if (err != CL_SUCCESS) {
   cout<<getErrorDesc(err)<<endl;
}

assert(err == CL_SUCCESS);
//clFinish(cmd_queue);

// Once finished read back the results from the answer 
// array into the results array
//reset the buffer first
buffer_size = sizeof(float4) * numParts;
err = clEnqueueReadBuffer(cmd_queue, answerPos_mem, CL_TRUE, 0, buffer_size, 
                   answerPos, 0, NULL, NULL);
if (err != CL_SUCCESS) {
   cout<<getErrorDesc(err)<<endl;
}


//cl mem
clReleaseMemObject(nowPos_mem);
clReleaseMemObject(prevPos_mem);
clReleaseMemObject(rForce_mem);
clReleaseMemObject(mass_mem);
clReleaseMemObject(passive_mem);
clReleaseMemObject(canMove_mem);
clReleaseMemObject(theForces_mem);
clReleaseMemObject(drag_mem);
clReleaseMemObject(answerPos_mem);
clReleaseCommandQueue(cmd_queue);
clReleaseContext(context);
assert(err == CL_SUCCESS);
return err;

}

Problem solved! At the bottom of the runCL() method I was "freeing" all my cl types, I though I was just freeing some cl_mem but on closer inspection I was freeing the context etc. An obvious and annoying mistake as always :).

Thanks to andrew.brownsword on the Khronos forums for spotting this one.

Well done for fixing the main issue.

Regarding performance, is numParts a large number? The global work size should be large to ensure that you saturate the device with work, e.g. tens of thousands. Ideally the local work size (when linearized) would be a multiple of 32, the best value will depend on your kernel.

It is common to set local work size to some constant or to some value dependent on your kernel (you can query for information like maximum local work size) since numParts/8 could cause launch failures if it becomes too large (the limit depends on the specific kernel and the specific device).

继续阅读：c memory opencl

Problem with OpenCL kernel recompile slowing down program and possible memory issues because of that

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？