comparing Matlab vs CUDA correlation and reduction on a 2D array

2023-01-10 05:36 问答作者：

I am trying to compare cross-correlation using FFT vs using windowing method.

My Matlab code is:

isize = 20;
n = 7;
for i = 1:n %%7x7 xcorr
  for j = 1:n
    xcout(i,j) = sum(sum(ffcorr1 .* ref(i:i+isize-1,j:j+isize-1))); %%ref is 676 element array and ffcorr1 is a 400 element array
  end
end

similar CUDA kernel:

__global__ void xc_corr(double* in_im, double* ref_im, int pix3, int isize, int n, double* out1, double* temp1, double* sum_temp1)
{

    int p = blockIdx.x * blockDim.x + threadIdx.x;
    int q = 0;
    int i = 0;
    int j = 0;
    int summ = 0;

    for(i = 0; i < n; ++i)
    {
        for(j = 0; j < n; ++j)
        {
            summ  = 0; //force update
            for(p = 0; p < pix1; ++p)
            {
                for(q = 0; q < pix1; ++q)
                {
                    temp1[((i*n+j)*pix1*pix1)+p*pix1+q] = in_im[p*pix1+q] * ref_im[(p+i)*pix1+(q+j)];               
                    sum_temp1[((i*n+j)*pix1*pix1)+p*pix1+q] += temp1[((i*n+j)*pix1*pix1)+p*pix1+q];
                    out1[i*n+j] = sum_temp1[((i*n+j)*pix1*pix1)+p*pix1+q];
                }
            }       
        }
    }

I have called this in my kernel as

int blocksize = 64; //multiple of 32
int nblocks = (pix3+blocksize-1)/blocksize; //round to max pix3 = 400
xc_corr &开发者_JAVA百科lt;<< nblocks,blocksize >>> (ffcorr1, ref_d, pix3, isize, npix, xcout, xc_partial);
cudaThreadSynchronize();

Somehow, when I do a diff on the output file, I see that the CUDA kernel computes for only the first 400 elements.

What is the correct way to write this kernel??

Also, what is the difference in declaring i,j as shown below in my kernel??

int i = blockIdx.x * blockDim.y + threadIdx.x * threadIdx.y;
int j = blockIdx.y * blockDim.x + threadIdx.x * threadIdx.y;

int blocksize = 64; //multiple of 32
int nblocks = (pix3+blocksize-1)/blocksize; //round to max pix3 = 400
xc_corr <<< nblocks,blocksize >>> (ffcorr1, ref_d, pix3, isize, npix, xcout, xc_partial);

means that you are launching 64 threads per block, and number of threadblocks equal to 1 more than needed to process pix3 elements. If pix3 is indeed 400, then you are processing 400 elements because you'll launch 7 threadblocks, each of which does 64 points, and 48 of which does nothing.

I'm not too sure what's the problem here.

Also,

int i = blockIdx.x * blockDim.y + threadIdx.x * threadIdx.y;
int j = blockIdx.y * blockDim.x + threadIdx.x * threadIdx.y;

blocksize and nblocks are actually converted to dim3 vectors, so that they have a (x,y,z) value. If you call a kernel with <<64,7>>, that'll be translated to

dim3 blocksize(64,1,1);
dim3 nblocks(7,1,1);
kernel<<blocksize,nblocks>>();

so for each kernel call, the blockIdx has 3 components, the thread id x, y, and z, corresponding to the 3d grid of threads you are in. In your case, since you only have an x component, blockIdx.y and threadIdx.y are all going to be 1 no matter what. So essentially, they're useless.

Honestly, you seem like you should go reread the basics of CUDA from the user manual, because there are a lot of basics you seem to be missing. Explaining it here wouldn't be economical since it's all written down in a nice documentation you can get here. And if you just want to have a faster FFT with cuda, there's a number of libraries you can just download and install on Nvidia's CUDA zone that will do it for you if you don't care about learning CUDA.

Best of luck mate.

PS. you don't need to call cudaThreadSynchronize after each kernel ;)

继续阅读：correlation multidimensional-array reduction

comparing Matlab vs CUDA correlation and reduction on a 2D array

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？