开发者

producer/comsumer model and concurrent kernels

I'm writing a cuda program that can be interpreted as producer/consumer model.

There are two kernels, one produc开发者_运维知识库es a data on the device memory,

and the other kernel the produced data.

The number of comsuming threads are set two a multiple of 32 which is the warp size.

and each warp waits utill 32 data have been produced.

I've got some problem here.

If the consumer kernel is loaded later than the producer,

the program doesn't halt.

The program runs indefinately sometimes even though consumer is loaded first.

What I'm asking is that is there a nice implementation model of producer/consumer in CUDA?

Can anybody give me a direction or reference?

here is the skeleton of my code.

**kernel1**:

while LOOP_COUNT
    compute something
    if SOME CONDITION
        atomically increment PRODUCE_COUNT          
        write data into DATA            
atomically increment PRODUCER_DONE

**kernel2**:
while FOREVER
    CURRENT=0
    if FINISHED CONDITION
        return
    if PRODUCER_DONE==TOTAL_PRODUCER && CONSUME_COUNT==PRODUCE_COUNT
        return
    if (MY_WARP+1)*32+(CONSUME_WARPS*32*CURRENT)-1 < PRODUCE_COUNT
        process the data
        if SOME CONDITION
            set FINISHED CONDITION true
        increment CURRENT
    else if PRODUCUER_DONE==TOTAL_PRODUCER
        if currnet*32*CONSUME_WARPS+THREAD_INDEX < PRODUCE_COUNT
            process the data
            if SOME CONDITION
                set FINISHED CONDITION true
            increment CURRENT


Since you did not provide an actual code, it is hard to check where is the bug. Usually the sceleton is correct, but the problem lies in details.

One of possible issues that I can think of:

By default, in CUDA there is no guarantee that global memory writes by one kernel will be visible by another kernel, with an exception of atomic operations. It can happen then that your first kernel increments PRODUCER_DONE, but there is still no data in DATA.

Fortunately, you are given the intristic function __threadfence() which halts the execution of the current thread, until the data is visible. You should put it before atomically incrementing PRODUCER_DONE. Check out chapter B.5 in the CUDA Programming Guide.

Another issue that may or may not appear:

From the point of view of kernel2, the compiler may deduct that PRODUCE_COUNT, once read, it never changes. The compiler may optimise the code so that, once loaded in register it reuses its value, instead of querying the global memory every time. Solution? Use volatile, or read the value using another atomic operation.

(Edit) Third issue:

I forgot about one more problem. On pre-Fermi cards (GeForce before 400-series) you can run only a single kernel at a time. So, if you schedule the producer to run after the consumer, the system will wait for consumer-kernel to end before producer-kernel starts its execution. If you want both to run at the same time, put both into a single kernel and have an if-branch based on some block index.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜