How to structure data for optimal speed in a CUDA app
I am attempting to write a simple particle system that leverages CUDA to do the updating of the particle positions. Right now I am defining a particle has an object with a position defined with three float values, and a velocity also defined with three float values. When updating the particles, I am adding a constant value to the Y component of the velocity to simulate gravity, then adding the velocity to the 开发者_开发技巧current position to come up with the new position. In terms of memory management is it better to maintain two separate arrays of floats to store the data or to structure in a object oriented way. Something like this:
struct Vector
{
float x, y, z;
};
struct Particle
{
Vector position;
Vector velocity;
};
It seems like the size of the data is the same with either method (4 bytes per float, 3 floats per Vector, 2 Vectors per Particle totaling 24 bytes total) It seems like the OO approach would allow more effiecient data transfer between the CPU and GPU because I could use a single Memory copy statement instead of 2 (and in the long run more, as there are a few other bits of information about particles that will become relevant, like Age, Lifetime, Weight/Mass, Temperature, etc) And then theres also just the simple readability of the code and ease of dealing with it that also makes me inclined toward the OO approach. But the examples I have seen don't utilize structured data, so it makes me wonder if theres a reason.
So the question is which is better: individual arrays of data or structured objects?
It's common in data parallel programming to talk about "Struct of Arrays" (SOA) versus "Array of Structs" (AOS), where the first of your two examples is AOS and the second is SOA. Many parallel programming paradigms, in particular SIMD-style paradigms, will prefer SOA.
In GPU programming, the reason that SOA is typically preferred is to optimise the accesses to the global memory. You can view the recorded presentation on Advanced CUDA C from GTC last year for a detailed description of how the GPU accesses memory.
The main point is that memory transactions have a minimum size of 32 bytes and you want to maximise the efficiency of each transaction.
With AOS:
position[base + tid].x = position[base + tid].x + velocity[base + tid].x * dt;
// ^ write to every third address ^ read from every third address
// ^ read from every third address
With SOA:
position.x[base + tid] = position.x[base + tid] + velocity.x[base + tid] * dt;
// ^ write to consecutive addresses ^ read from consecutive addresses
// ^ read from consecutive addresses
In the second case, reading from consecutive addresses means that you have 100% efficiency versus 33% in the first case. Note that on older GPUs (compute capability 1.0 and 1.1) the situation is much worse (13% efficiency).
There is one other possibility - if you had two or four floats in the struct then you could read the AOS with 100% efficiency:
float4 lpos;
float4 lvel;
lpos = position[base + tid];
lvel = velocity[base + tid];
lpos.x += lvel.x * dt;
//...
position[base + tid] = lpos;
Again, check out the Advanced CUDA C presentation for the details.
精彩评论