CUDA - memcpy2d - wrong pitch

2023-03-06 18:25 问答作者：

I just started CUDA programming, and was trying to execute the code shown below. The idea is to copy a 2dimensional array to the device, calculate the sum of all elements and to retrieve the sum afterwards (I know that this algorithm is not parallelized. In fact it is doing more work, then necessary. This is however just intended as practice for memcopy).

#include<stdio.h>
#include<cuda.h>
#include <iostream>
#include <cutil_inline.h>

#define height 50
#define width 50

using namespace std;

// Device code
__global__ void kernel(float* devPtr, int pitch,int* sum)
{
int tempsum = 0;    
for (int r = 0; r < height; ++r) {
        int* row = (int*)((char*)devPtr + r * pitch);
        for (int c = 0; c < width; ++c) {
             int element = row[c];
             tempsum = tempsum + element;
        }
    }
*sum = tempsum;
}

//Host Code
int main()
{

int testarray[2][8] = {{4,4,4,4,4,4,4,4},{4,4,4,4,4,4,4,4}};
int* sum =0;
int* sumhost = 0;
sumhost = (int*)malloc(sizeof(int));

cout << *sumhost << endl;

float* devPtr;
size_t pitch;
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(int), height);
cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);

cudaMalloc((void**)&sum, sizeof(int));
kernel<<<1, 4>>开发者_如何学运维;>(devPtr, pitch, sum);
cutilCheckMsg("kernel launch failure");
cudaMemcpy(sumhost, sum, sizeof(int), cudaMemcpyDeviceToHost);

cout << *sumhost << endl;

return 0;
}

This code compiles just fine (on the 4.0 sdk release candidate). However as soon as I try to execute, I get

0
cpexample.cu(43) : cutilCheckMsg() CUTIL CUDA error : kernel launch failure : invalid pitch argument.

Which is unfortunate, since I have no idea how to fix it ;-(. As far as I know, the pitch is an offset in memory to allow faster copying of data. However such a pitch is only used in the device memory, not in the host memory, isn't it? Therefore the pitch of my host memory should be 0, shouldn't it?

Moreover I would also like to ask two other questions:

If i declare a variable like int* sumhost (see above), where does this pointer point to? At first to the host memory and after cudaMalloc to the device memory?
cutilCheckMsg was very handy in this case. Are there similar functions for debugging i should know of?

In this line of your code:

cudaMemcpy2D(devPtr,pitch,testarray,0,8* sizeof(int),4,cudaMemcpyHostToDevice);

you're saying the source-pitch value for testarray is equal to 0, but how can that be possible when the formula for pitch is T* elem = (T*)((char*)base_address + row * pitch) + column? If we substituted a value of 0 for pitch in that formula, we will not get the right values when looking up an address at some 2-dimensional (x,y) ordered pair offset. One thing to consider is that the rule for the pitch value is pitch = width + padding. On the host, the padding is often equal to 0, but the width is not 0 unless there is nothing in your array. On the hardware side there may be extra padding, which is why the value for pitch may not equal the declared width of the array. Therefore you can conclude that pitch >= width depending on the padding value. So even on the host-side, the value for the source pitch should be at least the size of each row in bytes, meaning in the case of testarray, it should be 8*sizeof(int). Finally, the height of your 2D array in the host is also only 2 rows, not 4.

As an answer to your question about what happens with allocated pointers, if you allocate a pointer with malloc(), then the pointer is given an address value that resides in host memory. So you can dereference it on the host-side, but not on the device side. On the other-hand, a pointer allocated with cudaMalloc() is given a pointer to memory residing on the device. Therefore if you dereference it on the host, it's not pointing to allocated memory on the host, and unpredictable results will ensue. It is okay though to pass this pointer address to the kernel on the device, since when it's dereferenced on the device-side, it's pointing to memory locally accessible to the device. Overall the CUDA runtime keeps these two memory locations separate, providing memory copy functions that will copy back and forth between the device and host, and use the address values from these pointers as the source and-or destination for the copy depending on the desired direction (host-to-device or device-to-host). Now if you took the same int*, and first allocated it with malloc(), and then (after hopefully calling free() on the pointer) with cudaMalloc(), your pointer would first have an address that pointed to host memory, and then device memory. You would have to keep track of its state in-order to avoid unpredictable results from dereferencing an address that was on the device or host depending on whether it was dereferenced in host code or device code.

CUDA - memcpy2d - wrong pitch

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？