Cuda cudaMemcpy and cudaMalloc
i always read that it is slow to allocate and transfer data form cpu to g开发者_StackOverflowpu. is this because cudaMalloc is slow? is it because cudaMemcpy is slow? or is it becuase both of them are slow?
It is mostly tied to 2 things, the first begin the speed of the PCIExpress bus between the card and the cpu. The other is tied to the way these functions operate. Now, I think the new CUDA 4 has better support for memory allocation (standard or pinned) and a way to access memory transparently across the bus.
Now, let's face it, at some point, you'll need to get data from point A to point B to compute something. Best way to handle is to either have a really large computation going on or use CUDA streams to overlap transfer and computation on the GPU.
In most applications, you should be doing cudaMalloc once at the beginning and then not call it any more. Thus, the bottleneck is really cudaMemcpy.
This is due to physical limitations. For a standard PCI-E 2.0 x16 link, you'll get 8GB/s theoretical but typically 5-6GB/s in practice. Compare this w/ even a mid range Fermi like the GTX460 which has 80+GB/s bandwidth on the device. You're in effect taking an order of magnitude hit in memory bandwidth, spiking your data transfer times accordingly.
GPGPUs are supposed to be supercomputers and I believe Seymour Cray (the supercomputer guy) said, "a supercomputer turns compute-bound problems into I/O bound problems". Thus, optimizing data transfers is everything.
In my personal experience, iterative algorithms are the ones that by far show the best improvements by porting to GPGPU (2-3 orders of magnitude) due to the fact that you can eliminate transfer time by keeping everything in-situ on the GPU.
精彩评论