CUDA 4.0 Peer to Peer Access confusion
I have two questions related to CUDA 4.0 Peer access:
- Is there any way I could copy data like from - GPU#0 ---> GPU#1 ---> GPU#2 ---> GPU#3. Presently in my code it works fine when I use just two GPUs at a time, but fails when I check peer access on a third GPU using- cudaDeviceCanAccessPeer. So, the following works -- cudaDeviceCanAccessPeer(&flag_01, dev0, dev1), but when I have two such statements:- cudaDeviceCanAccessPeer(&flag_01, dev0, dev1)and- cudaDeviceCanAccessPeer(&flag_12, dev1, dev2), the later fails (0 is returned to the flag_12 variable).
- Would it work only for GPUs connected via a common PCIe OR is Peer copy dependent upon the underlying PCIe interconnection? I do not开发者_StackOverflow中文版 understand PCIe, but upon doing nvidia-smi I see that the PCIe buses of the GPUs are 2, 3, 83 and 84. 
The testbed is a dual socket 6 core Intel Westmere, with 4 GPUs - Nvidia Tesla C2050.
EDIT: Bandwidthtest between HtoD and DtoH, and SimpleP2P results between two GPUs (DtoD):

I suspect this is the problem. From an upcoming NVIDIA document:
NVIDIA GPUs are designed to take full advantage of the PCI-e Gen2 standard, including the Peer-to-Peer communication, but the IOH chipset does not support the full PCI-e Gen2 specification for P2P communication with other IOH chipsets
The cudaPeerEnable() API call will return an error code if the application tries to establish a P2P relationship between two GPUs that would require P2P communication over QPI. The cudaMemcopy() function for P2P Direct Transfers automatically falls back to using a Device-to-Host-to-Device path, but there is no automatic fallback for P2P Direct Access (P2P load/store instructions in device code).
One known example system is the HP Z800 workstation with dual IOH chipsets which can run the simpleP2P example, but bandwidth is very low (100s of MB/s instead of several GB/s) because of the fallback path.
NVIDIA is investigating whether GPU P2P across QPI can be supported by adding functionality to future GPU architectures.
Reference: Intel® 5520 Chipset and Intel® 5500 Chipset Datasheet, Table 7-4: Inbound Memory Address Decoding: “The IOH does not support non-contiguous byte enables from PCI Express for remote peer-to-peer MMIO transactions. This is an additional restriction over the PCI Express standard requirements to prevent incompatibility with Intel QuickPath Interconnect”. -- http://www.intel.com/Assets/PDF/datasheet/321328.pdf
In general we advise building multi-GPU workstations and clusters that have all PCI-express slots intended for GPUs connected to a single IOH.
 
         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论