开发者

Is possible to span an OpenCL kernel to run concurrently on CPU and GPU

Lets assume that I have a computer which has a multicore processor and a GPU. I wo开发者_如何学Culd like to write an OpenCL program which runs on all cores of the platform. Is this possible or do I need to choose a single device on which to run the kernel?


In theory yes, you can, the CL API allows it. But the platform/implementation must support it, and i don't think most CL implementatations do.

To do it, get the cl_device_id of the CPU device and the GPU device, and create a context with those two devices, using clCreateContext.


No you can't span automagically a kernel on both CPU and GPU, it's either one or the other.

You could do it but this will involve creating and managing manually two command queues (one for each device).

See this thread: http://devforums.amd.com/devforum/messageview.cfm?catid=390&threadid=124591&messid=1072238&parentid=0&FTVAR_FORUMVIEWTMP=Single


One context can only be for one platform. If your multi-device code needs to work across platforms (for example, Intel platform CPU OpenCL, and NVidia GPU) then you need separate contexts.

However, if the GPU and CPU happened to be in the same platform, then yes you could use one context.

If you are using multiple devices on the same platform (two identical GPUs, or two GPUs from the same manufacturer) then you can share the context - as long as they both come from a single clGetDeviceIDs call.

EDIT: I should add that a GPU+CPU context doesn't mean any automatically managed CPU+GPU execution. Typically, it is a best-practice to let the driver allocate a memory buffer that can be DMA'd by the GPU for maximum performance. In the case where you have the CPU and GPU in the same context, you'd be able to share those buffers across the two devices.

You still have to split the workload up yourself. My favorite load balancing technique is using events. Every n work items, attach an event object to a command (or enqueue a marker), and wait for the event that you set n workitems ago (the prior one). If you didn't have to wait, then you need to increase n on that device, if you did have to wait, then you should decrease n. This will limit the queue depth, n will hover around the perfect depth to keep the device busy. You need to do it anyway to avoid causing GUI render starvation. Just keep n commands in each command queue (where the CPU and GPU have separate n) and it will divide perfectly.


You cannot span a kernel to multiple devices. But if the code you a re running is not dependant on other results (ie: Procesing blocks of 16kB of data, that needs huge processing), you can launch the same kernel on GPU and CPU. And put some blocks on the GPU and some on the CPU.

That way it should boost up the performance.

You can do that, creating a clContext shared for CPU and GPU, and 2 command queues.

This is not aplicable to all the kernels. Some times the kernel code applies to all the input data, and is not able to be separated in parts or chunks.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜