How many tasks can be executed simultaneously on GPU device?
I'm using OpenCL and have ATI 4850 card. It has:
- CL_DEVICE_MAX_COMPUTE_UNITS: 10
- CL_DEVICE_MAX开发者_如何学编程_WORK_ITEM_DIMENSIONS: 3
- CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
- CL_DEVICE_MAX_WORK_ITEM_SIZES:(256, 256, 256)
- CL_DEVICE_AVAILABLE: 1
- CL_DEVICE_NAME: ATI RV770
How many tasks can it execute simultaneously?
Is it CL_DEVICE_MAX_COMPUTE_UNITS * CL_DEVICE_MAX_WORK_ITEM_SIZES = 2560?
To be more specific: a single core processor can execute only one task in the one moment, dual-core can execute 2 tasks... How many tasks can execute my GPU at one moment? Or rephrased: How many processors does my GPU have?
The RV770 has 10 SIMD cores, each consisting of 16 shader cores, each consisting of 5 ALUs (VLIW5 architecture). A total of 800 ALUs that can do parallel computations. I don't think there's a way to get all these numbers out of OpenCL. I'm also not sure what you would equate to a CPU core. Perhaps a shader core? You can read about VLIW at Wikipedia. It's an interesting design.
If you say a CPU core is only executing one "task" at any given time, even though it has multiple ALUs working in parallel, then I guess you can say the RV770 would be working on 160 tasks. But with the differences in how different chips work, I think "core" and "task" can become difficult to define. A CPU with hyperthreading can even execute two sets of code at the same time. With OpenCL I don't believe it is possible yet to execute more than one kernel at any given time - unless recent driver updates have changed that.
Anyway, I think it is more important to present your work to the GPU in a way that gives the best performance. Unfortunately there's no way to find the best work group size other than experimenting. At least not that I know of. One help is that if the drivers support OpenCL 1.1 you can query the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE and set your work size to a multiple of that. Otherwise, going for a multiple of 64 is probably a safe bet.
GPU work ends up becoming wavefronts/warps.
Using a GPU for UI and compute is effectively using it for many programs without being aware of it. Many for the GUI drawing, plus whatever compute kernels you are executing. Fast OpenCL clients are asynchronous and overlap multiple instance of work so they won't be latency-bound. It is expected that you'll use multiple kernels in parallel.
There doesn't seem to be a "hard" limit other than memory limiting the number of buffers you can use. When using the same GPU for UI and for compute, you must throttle your work. In my experience, issuing too much work will cause starvation of the GUI and/or your compute kernels. There doesn't seem to be anything in the way of ensuring that you won't have starvation (long delays before a work item begins actually executing). Some work item(s) may sit for a very long time (10s seconds or more in bad cases) while the GPU does other work items. I speculate that items are dispatched to pipelines based on data availability and little or nothing is there to prevent starvation of work items.
Limiting how far ahead work is enqueued greatly improves GUI responsiveness by letting the GPU drain its work queue almost/sometimes to empty, reducing GUI drawing workitem starvation delays.
精彩评论