producer/comsumer model and concurrent kernels

2023-02-14 13:39 问答作者：

I'm writing a cuda program that can be interpreted as producer/consumer model.

There are two kernels, one produc开发者_运维知识库es a data on the device memory,

and the other kernel the produced data.

The number of comsuming threads are set two a multiple of 32 which is the warp size.

and each warp waits utill 32 data have been produced.

I've got some problem here.

If the consumer kernel is loaded later than the producer,

the program doesn't halt.

The program runs indefinately sometimes even though consumer is loaded first.

What I'm asking is that is there a nice implementation model of producer/consumer in CUDA?

Can anybody give me a direction or reference?

here is the skeleton of my code.

**kernel1**:

while LOOP_COUNT
    compute something
    if SOME CONDITION
        atomically increment PRODUCE_COUNT          
        write data into DATA            
atomically increment PRODUCER_DONE

**kernel2**:
while FOREVER
    CURRENT=0
    if FINISHED CONDITION
        return
    if PRODUCER_DONE==TOTAL_PRODUCER && CONSUME_COUNT==PRODUCE_COUNT
        return
    if (MY_WARP+1)*32+(CONSUME_WARPS*32*CURRENT)-1 < PRODUCE_COUNT
        process the data
        if SOME CONDITION
            set FINISHED CONDITION true
        increment CURRENT
    else if PRODUCUER_DONE==TOTAL_PRODUCER
        if currnet*32*CONSUME_WARPS+THREAD_INDEX < PRODUCE_COUNT
            process the data
            if SOME CONDITION
                set FINISHED CONDITION true
            increment CURRENT

Since you did not provide an actual code, it is hard to check where is the bug. Usually the sceleton is correct, but the problem lies in details.

One of possible issues that I can think of:

By default, in CUDA there is no guarantee that global memory writes by one kernel will be visible by another kernel, with an exception of atomic operations. It can happen then that your first kernel increments PRODUCER_DONE, but there is still no data in DATA.

Fortunately, you are given the intristic function __threadfence() which halts the execution of the current thread, until the data is visible. You should put it before atomically incrementing PRODUCER_DONE. Check out chapter B.5 in the CUDA Programming Guide.

Another issue that may or may not appear:

From the point of view of kernel2, the compiler may deduct that PRODUCE_COUNT, once read, it never changes. The compiler may optimise the code so that, once loaded in register it reuses its value, instead of querying the global memory every time. Solution? Use volatile, or read the value using another atomic operation.

(Edit) Third issue:

I forgot about one more problem. On pre-Fermi cards (GeForce before 400-series) you can run only a single kernel at a time. So, if you schedule the producer to run after the consumer, the system will wait for consumer-kernel to end before producer-kernel starts its execution. If you want both to run at the same time, put both into a single kernel and have an if-branch based on some block index.

继续阅读：producer-consumer

producer/comsumer model and concurrent kernels

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？