Questions about Cuda 4.0 and unified memory model
Nvidia seems to be touting that Cuda 4.0 allows programmers to use a unified memory model between the CPU and GPU. This is not going to replace the need to manage the memory manually in the GP开发者_Go百科U and CPU for best performance, but will it allow for easier implementations that can be tested, proven, and then optimised (manually manage GPU and CPU memory)? I'd like to hear comments or opinions :)
From what I read, the important difference is that if you have 2 or more GPUs, you will be able to transfer memory from GPU1 to GPU2 without touching host RAM. You will be also able to control 2 GPUs with only one thread on the host.
Hmmm, that seems a big news! The thrust library built by NVIDIA's own engineers already gives you some flavor. You can move the data from RAM to GPU's DRAM just by a mere = sign (No need to call cudaMalloc and cudaMemcpy and stuff like that). So thrust makes CUDA-C more like 'just C'.
Maybe they'll integrate this into CUDA-API in future. Note that in back-hand the procedure will be the same (and will remain same forever), but hidden from the programmer for ease. (I don't like that)
Edit: CUDA 4.0 has been announced and thrust will be integrated with it.
The "unified" memory only refers to address space. Host and device pointers are allocated from the same 64-bit address space, so any given pointer range is unique across the process. As a result, CUDA can infer from the pointer which device a pointer range "belongs to."
It's important not to confuse address spaces with the ability to read/write those pointer ranges. The CPU will not be able to dereference device memory pointers. I believe that on unified-address-capable platforms, all host allocations will be mapped by default, though, so the GPU(s) will be able to dereference host allocations.
Note: the default driver model on Windows Vista/Windows 7 does not support this feature.
精彩评论