开发者

Cuda: Architecture of GTX460 and code-related seperation of grid/block/thread

Hey there, so far what I understood is the following: The GTX460 1GB (GF104) has 2GPCs with 4 SMs each, so 8SMs in total of which 1 is disabled, w开发者_StackOverflowhich means: 7SM ("Streaming Multiprocessors"). Each SM has 48 Cuda Cores (Is that what we call a Thread? And can it be understood like ONE core of a CPU like the Q9550-Quadcore?), so in total 336 Cuda Cores.

So what I don't understand right now: Why this 'complicated' architecture and not only like on a CPU to say: 'Okay, the GPU has N cores, thats it!'?

Assuming I have a certain program which is seperated into a grid of B blocks and each block of T threads (so B*T threads in total), can I somehow tell that ONE block is always connected to ONE SM or NOT? Because if it would be like that, this would make things harder for the coder since he should know how many SMs there are to optimize the parallelization to each graphics-card. E.g.: When I would have a graphics card with 8 SM and my program would only seperate the data into a grid of 1 block with N threads, I could only use one SM which would not use all its ressources!

Is there any way to only use some of the threads of the card when coding my program? I would really love to benchmark the speedup by running my program on 1..M Threads in total, where M is the total number of cuda cores (if this is equivalent to 'thread'), but how to do this? Is it sufficient to code my program like that:

cudaKernel<<<1, 1>>>(...)

to

cudaKernel<<<1, M>>>(...)

and run it each time? The only problem which I see here is the following: Let's assume I have the simple vector addition example:

#define SIZE 10
__global__ void vecAdd(float* A, float* B, float* C)
{
   int i = threadIdx.x;
   A[i]=0; 
   B[i]=i;
   C[i] = A[i] + B[i];
}
int main()
{
    int N=SIZE;
    float A[SIZE], B[SIZE], C[SIZE];
    // Kernel invocation

    float *devPtrA;
    float *devPtrB;
    float *devPtrC;
    [...]
    vecAdd<<<1, N>>>(devPtrA, devPtrB, devPtrC);
}

when I would now set vecAdd<<<1, N>>> to vecAdd<<<1, 1>>> the single thread wouldn't calculate C to a N-size vector because the only thread would just calculate the first value of A, B and thus C. How to overcome this problem than? Thanks a lot for clarifying in advance!! You will help me a lot!


For the most part, the answer to most of what you're asking about is no. Not just not, but heck no.

The general idea with most GPU programming is that it should be implicitly scalable -- i.e., your code automatically uses as many cores as it's given. In particular, when it's being used for graphics processing, the core get split between executing the three types of shaders (vertex, geometry, fragment/pixel). As such, the number of available cores can (and often will) vary dynamically, depending on the overall load.

There are a few reasons the GPU is organized this way instead of like a typical multicore CPU. First, it's intended primarily for use on "embarrassingly parallel" problems -- obviously enough, its primary purpose is applying similar code to each of a large number of pixels. Although you're not using it exactly the same when for CUDA code, that's still the basis of the hardware design. Second, as alluded to above, the cores are actually split between at least three different purposes, and can be split between more (e.g., a graphics application using some cores, and your CUDA code using others). Third, the extra abstraction helps keep your code immune to changes in the hardware. It lets you specify just the computation you need, and it deals with scheduling that on the hardware as efficiently as it can.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜