CUDA 4.0 RC - many host threads per one GPU - cudaStreamQuery and cudaStreamSynchronize behaviour

2023-02-15 08:01 问答作者：

I wrote a code which uses many host (OpenMP) threads per one GPU. Each thread has its own CUDA stream to order it requests. It looks very similar to below code:

#pragma omp parallel for num_threads(STREAM_NUMBER)
for (int sid = 0; sid < STREAM_NUMBER; sid++) {
    cudaStream_t stream;
    cudaStreamCreate(&s开发者_StackOverflow社区tream);

    while (hasJob()) {

        //... code to prepare job - dData, hData, dataSize etc

        cudaError_t streamStatus = cudaStreamQuery(stream);
        if (streamStatus == cudaSuccess) {
             cudaMemcpyAsync(dData, hData, dataSize, cudaMemcpyHostToDevice, stream);
             doTheJob<<<gridDim, blockDim, smSize, stream>>>(dData, dataSize);
        else {
             CUDA_CHECK(streamStatus);
        }
        cudaStreamSynchronize(stream);
    }
    cudaStreamDestroy(stream);
}

And everything were good till I got many small jobs. In that case, from time to time, cudaStreamQuery returns cudaErrorNotReady, which is for me unexpected because I use cudaStreamSynchronize. Till now I were thinking that cudaStreamQuery will always return cudaSuccess if it is called after cudaStreamSynchronize. Unfortunately it appeared that cudaStreamSynchronize may finish even when cudaStreamQuery still returns cudaErrorNotReady.

I changed the code into the following and everything works correctly.

#pragma omp parallel for num_threads(STREAM_NUMBER)
for (int sid = 0; sid < STREAM_NUMBER; sid++) {
    cudaStream_t stream;
    cudaStreamCreate(&stream);

    while (hasJob()) {

        //... code to prepare job - dData, hData, dataSize etc

        cudaError_t streamStatus;
        while ((streamStatus = cudaStreamQuery(stream)) == cudaErrorNotReady) {
             cudaStreamSynchronize();
        }
        if (streamStatus == cudaSuccess) {
             cudaMemcpyAsync(dData, hData, dataSize, cudaMemcpyHostToDevice, stream);
             doTheJob<<<gridDim, blockDim, smSize, stream>>>(dData, dataSize);
        else {
             CUDA_CHECK(streamStatus);
        }
        cudaStreamSynchronize(stream);
    }
    cudaStreamDestroy(stream);
}

So my question is.... is it a bug or a feature?

EDIT: it is similar to JAVA

synchronize {
    while(waitCondition) {
         wait();
    }
}

What is under

//... code to prepare job - dData, hData, dataSize etc

Do you have any functions of kind cudaMemcpyAsync there, or the only memory transfer is in the code you have shown? Those are asynchronous functions may exit early, even when the code is not at the destination yet. When that happens cudaStreamQuery will return cudaSuccess only when memory transfers succeed.

Also, does hasJob() uses any of the host-CUDA functions?

If I am not mistaken, in a single stream, it is not possible to execute both kernel and memory transfers. Therefore, calling cudaStreamQuery is necessary only when a kernel depends on the data transferred by a different stream.

Didn't notice it earlier: cudaStreamSynchronize() should take a parameter (stream). I am not sure which stream you are synchronising when parameter is ommited, could be that it defaults to stream 0.

继续阅读：cuda-streams openmp

CUDA 4.0 RC - many host threads per one GPU - cudaStreamQuery and cudaStreamSynchronize behaviour

更多精彩内容

精彩评论

最新问答

产后抑郁症的症状有什么？

黑神话悟空黄风大圣是好人吗?？

喜境牛奶光美肤仪适合出差旅行使用吗？？

当贝f3放到4米之外会模糊吗？

永劫无间高价皮肤有哪些?？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？