开发者

CUDA Device To Device transfer expensive

I have written some code to try to swap quadrants of a 2D matrix for FFT purposes, that is stored in a flat array.

    int leftover = W-dcW;

    T *temp;
    T *topHalf;
cudaMalloc((void **)&temp, dcW * sizeof(T));

    //swap every row, left and right
    for(int i = 0; i < H; i++)
    {
        cudaMemcpy(temp, &data[i*W], dcW*sizeof(T),cudaMemcpyDeviceToDevice);
        cudaMemcpy(&data[i*W],&data[i*W+dcW], leftover*sizeof(T), cudaMemcpyDeviceToDevice);
        cudaMemcpy(&data[i*W+leftover], temp, dcW*sizeof(T), cudaMemcpyDeviceToDevice); 
    }

cudaMalloc((void **)&topHalf, dcH*W* sizeof(T));
    leftover = H-dcH;
    cudaMemcpy(topHalf, data, dcH*W*sizeof(T), cudaMemcpyDeviceToDevice);
    cudaMemcpy(data, &data[dcH*W], leftover*W*sizeof(T), cudaMemcpyDeviceToDevice);
    cudaMemcpy(&data[leftover*W], topHalf, dcH*W*sizeof(T), cudaMemcpyDeviceToDevice);

Notice that this code takes device pointers, and does DeviceToDevice transfers.

Why does this seem to run so slow? Can this be optimized somehow? I timed this compared to the same operation on host using regular memcpy 开发者_如何学运维and it was about 2x slower.

Any ideas?


I ended up writing a kernel to do the swaps. This was indeed faster than the Device to Device memcpy operations


Perhaps the following solution to perform the 2d fftshift in CUDA would be of interest:

#define IDX2R(i,j,N) (((i)*(N))+(j))

__global__ void fftshift_2D(double2 *data, int N1, int N2)
{
    int i = threadIdx.y + blockDim.y * blockIdx.y;
    int j = threadIdx.x + blockDim.x * blockIdx.x;

    if (i < N1 && j < N2) {
        double a = pow(-1.0, (i+j)&1);

        data[IDX2R(i,j,N2)].x *= a;
        data[IDX2R(i,j,N2)].y *= a;
    }
}

It consists in multiplying the matrix to be transformed by a chessboard of 1s and -1s which is equivalent to the multiplication by exp(-j*(n+m)*pi) and thus to shifts in both directions in the conjugate domain.

You have to call this kernel before and after the application of the CUFFT.

One pro is that memory movements/swapping are avoided.

IMPROVEMENT IN SPEED

Following the suggestion received at the NVIDIA Forum, improved speed can be achieved as by changing the instruction

double a = pow(-1.0,(i+j)&1);

to

double a = 1-2*((i+j)&1);

to avoid the use of the slow routine pow.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜