开发者

OpenCL code runs faster on MBP than on NVIDIA GTX 480

I'm I have come across a strange problem. I'm implementing some linear algebra, only matrix multiplications so far, in OpenCL, and have been testing this on my laptop. The code is really simple:

__kernel void matrix_mult(__global float* a, 
              __global float* b, 
              __global float* c,
              const int N) 
{
  int row = get_global_id(1);
  int col = get_global_id(0);
  float sum = 0.0f;
  for (int i = 0; i < N; i++) {
    sum += a[row*N+i] * b[i*N+col];
  }
  c[row*N+col] = sum;
}

I test the hardware by running the code 100 times like this:

  clock_t begin=clock(); 

  const unsigned int repeats = 100;
  for(int  i = 0; i != repeats; i++){
    runCL(a, b, results,N, N*N);
  }

  clock_t end=clock();

On my MBP matrix_multiplications take about 1.2 ms, on matrices of size 512*512 while the same code takes about 3 ms when running on a GTX 480 Linux box. This bothers me since, I would't expect the expensive GTX card to be a little faster than the laptop.

As far as I can see either my code is 'wrong' of I'm timing in some wrong way.

I tried using the event-based timing system in the OpenCL spec, this gave some a bit more realistic results.

cl_event event = {0}; 
err = clEnqueueNDRangeKernel(cmd_queue, kernel[0], 2, NULL, global_work_size, NULL, 0, NULL, &event);
assert(err == CL_SUCCESS);


cl_int err =  clWaitForEvents (1,&event);
cl_ulong start, end; 
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END,   sizeof(cl_ulong), &end,   NULL); 
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL); 
double executionTimeInMilliseconds = (end - start) * 1.0e-6f;
std::cout << "execution time in milis : " << executionTimeInMilliseconds << std::endl;

Now the GT330M will do the operation in 46ms and the GTX480 does it in 2.5 ms. This then makes for another really interesting question, with PROFILING turned on the GT 330M becomes about 30 times slower, this sorta makes sense, but the GTX480 keeps开发者_运维问答 up the same performance. Can anyone explain why this is?


In timing the original problem, what you're seeing here is that with this naive code, the better specs of the GTX480 are actually hurting you.

The code sample, a first pass at a matrix multiply, is completely dominated by memory bandwidth; each thread is accessing a different element of B which can't be coallesced because of the stride.

The GTX480 has a 3x larger (384 bit) and 2x faster (1840 MHz) memory bus than the GT330M (128bit, 800 MHz). Nominally, that gives a peak bandwidth advantage of 177.4GB/s vs 25.6 GB/s, and since this is memory-bandwidth dominated, you might think that would win. However, because of the non-coalesced reads and the wider memory bus, the b-array accesses are only using 32 bits of that 384 bit memory access, and in the 330M case, only 32 bits out of each 128 bit access. So the effective memory bandwidths for the b access are 14.8GB/s and 6.4GB/s; so now there's only a factor of 2 difference in total memory bandwidth rather than 7 or so, and so much of the advantage of the faster card is being squandered; in addition, that memory bandwidth has to be divided by 10x as many cores, so the latency for each core to get its access and do the calculation is longer. I suspect that if you used larger matrix sizes, you could hide more of the latency and get at closer to the best-possible 2x speedup rather than the 2.5x slowdown you're seeing.

The ultimate solution here is to use a more memory-friendly matrix multiplication algorithm as a benchmark.

The profiling results you're seeing, though, I have no idea about. Perhaps the 330M doesn't have as good hardware support for the profiling, so things have to be implemented in software? Since the GTX numbers are about the same either way, I'd just use the simpler timing approach for now, which since you're not using asynchronous kernels or transfer, should be fine.


I think you're pushing the limits on the timer resolution for Nvidia. Try clGetDeviceInfo() with CL_DEVICE_PROFILING_TIMER_RESOLUTION to check it. With those tiny times I wouldn't really conclude anything.


A few ms could be the difference between initialization routines for each code path, especially when both testing systems have different hardware. I recommend starting by testing a larger set which requires at least several seconds on both the laptop and the nVidia card.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜