Convolution, array with filter, in CUDA

2023-01-18 15:59 问答作者：

I'm trying to take the convolution of an array of data, 256x256, with a filter, 3x3 on a GPU using shared memory. I understand that I'm to break the array up in blocks, and then apply the filter within each block. This ultimately means that blocks with overlap along the edges, and some padding will need to be done around the edges where there is no data so that the filter works properly.

int grid = (256/(16+3-1))*(256/(16+3-1)) where 256 is the length or width of my array, 16 is the length or wide of my block in shared memory, 3 is the length or width of my filter, and I minus one to make it so it's even.

int thread = (16+3-1)*(16+3-1)

Now I call my kernel <<>>(output, input, 256) input and output are an array of size 256*256

__global__ void kernel(float *input, float *output, int size)
{
    __shared__ float tile[16+3-1][开发者_开发知识库16+3-1];
    blockIdx.x = bIdx;
    blockIdy.y = bIdy;
    threadIdx.x = tIdx;
    threadIdy.y = tIdy

    //i is for input
    unsigned int iX = bIdx * 3 + tIdx;
    unsigned int iY = bIdy * 3 + tIdy;

    if (tIdx == 0 || tIdx == width || tIdy == 0 || tIdy == height)
    {
        //this will pad the outside edges
        block[tIdy][tIdx] = 0;
    }
    else 
    {
        //This will fill in the block with real data
        unsigned int iin = iY * size + iX;
        block[tIdy][tIdx] = idata[iin];
    }

    __syncthreads();

    //I believe is above is correct; below, where I do the convolution, I feel is wrong
    float result = 0;
    for(int fX=-N/2; fX<=N/2; fX++){
        for(int fY=-N/2; fY<=N/2; fY++){
            if(iY+fX>=0 && iY+fX<size && iX+fY>=0 && iX+fY<size)
                result+=tile[tIdx+fX][tIdy+fY];
        }
    }
    output[iY*size+iX] = result/(3*3);
}

When I run the code, if I run the convolution part, I get a kernel error. Any insights? Or suggestions?

Check out the sobelFilter SDK sample.

It uses texture to deal with the edge cases, overfetches blocks slightly (but the texture cache makes that more efficient), and uses shared memory for the processing.

The subtle thing about the shared memory is that you get 4-way bank conflicts if you read adjacent bytes. One way to get around this, illustrated in the sobelFilter sample, is to unroll your loop 4x and access every fourth byte.

继续阅读：c

Convolution, array with filter, in CUDA

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？