stack overflow exception at program start (CUDA Monte Carlo Pi)

2023-03-09 16:59 问答作者：

My problem is that I am receiving a stack overflow exception at program start when the program first enters main. My program is a Parallel Monte Carlo Pi calculator using CUDA. When I try and debug the program in Visual Studio, the exception pops up before any breakpoint I can select. Any help is appreciated.

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>

#define NUM_THREAD 512
#define NUM_BLOCK 65534

///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
// Function to sum an array
__global__ void reduce0(float *g_odata) {
extern __shared__ int sdata[];

// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_odata[i];
__syncthreads();

// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2) { // step = s x 2
    if (tid % (2*s) == 0) { // only threadIDs开发者_StackOverflow divisible by the step participate
        sdata[tid] += sdata[tid + s];
    }
    __syncthreads();
}

// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
__global__ void monteCarlo(float *g_odata, int  trials, curandState *states){
    extern __shared__ int sdata[];
//  unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned int k, incircle;
    float x, y, z;
    incircle = 0;

    curand_init(1234, i, 0, &states[i]);

    for(k = 0; k < trials; k++){

    x = curand_uniform(&states[i]);
    y = curand_uniform(&states[i]);
    z = sqrt(x*x + y*y);
    if (z <= 1) incircle++;
    else{}
    }
    __syncthreads();
    g_odata[i] = incircle;
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
int main() {

    float* solution = (float*)calloc(100, sizeof(float));
    float *sumDev, sumHost[NUM_BLOCK*NUM_THREAD];
    int trials, total; 
    curandState *devStates;



    trials = 100;
    total = trials*NUM_THREAD*NUM_BLOCK;

    dim3 dimGrid(NUM_BLOCK,1,1); // Grid dimensions
    dim3 dimBlock(NUM_THREAD,1,1); // Block dimensions
    size_t size = NUM_BLOCK*NUM_THREAD*sizeof(float); //Array memory size
    cudaMalloc((void **) &sumDev, size); // Allocate array on device
    cudaMalloc((void **) &devStates, size*sizeof(curandState));
    // Do calculation on device by calling CUDA kernel
    monteCarlo <<<dimGrid, dimBlock, size>>> (sumDev, trials, devStates);
        // call reduction function to sum
    reduce0 <<<dimGrid, dimBlock, size>>> (sumDev);
    // Retrieve result from device and store it in host array
    cudaMemcpy(sumHost, sumDev, size, cudaMemcpyDeviceToHost);

    *solution = 4*(sumHost[0]/total);
    printf("%.*f\n", 1000, *solution);
    free (solution);
    //*solution = NULL;
    return 0;
}

I would assume the problem is this:

float *sumDev, sumHost[NUM_BLOCK*NUM_THREAD];

for

#define NUM_THREAD 512
#define NUM_BLOCK 65534

That leaves you with a roughly 130Mb statically declared array. I doubt the compiler runtime library can deal with such a large static allocation, which is why you get an instant stack overflow. Replace it with a dynamic allocation and the stack overflow problem will go away. But then read Pavan's post carefully, because once you fix the stack overflow, the CUDA code itself also needs some redesign before it will work.

You are declaring the size of shared memory = size; like here

monteCarlo <<<dimGrid, dimBlock, size>>>

The value of size = 512 * 65534 * 4 = 2^9 * 2^16 * 2^2 = 2^27 (more than the maximum value of shared memory on any card I can think of).

But looking at your kernels, I think you want the shared memory to be equal to the number of threads you have.

So you either need to do

1)
this for launching your kernels

monteCarlo <<<dimGrid, dimBlock, (NUM_THREADS * sizeof(int))>>>

2)
Or use this for launching your kernels

monteCarlo <<<dimGrid, dimBlock>>>

And this to declare your shared memory inside your kernel.

__shared__ int sdata[NUM_THREADS]; // Note: no extern before __shared__

I personally prefer method two for these kinds of kernels because the shared memory is proportional to the number of threads, but the number of threads is known to be constant. It is also slightly faster.

EDIT

Apart from the forementioned problems I doubt that this might be causing problems too.

 cudaMalloc((void **) &devStates, size*sizeof(curandState));

Becuase size itself is this.

size = NUM_BLOCKS * NUM_THREADS * sizeof(float);

May be you wanted to do this instead ?

cudaMalloc((void **) &devStates, (NUM_BLOCKS *NUM_THREADS)*sizeof(curandState));

As for the actual stack overflow problem you may want to look at talonmies post.

继续阅读：exception montecarlo pi stack-overflow

stack overflow exception at program start (CUDA Monte Carlo Pi)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？