CUDA: What reasons could there be for nvcc taking several minutes to compile?
I have some CUDA code that nvcc
(well, technically ptxas
) likes to take upwards of 10 minutes to compile. While it isn't small, it certainly isn't huge. (~5000 lines).
The delay seems to come and go between CUDA version updates, but previously it only took a minute or so instead of 10.
When I used the -v
option, it seemed to get stuck after displaying the following:
ptxas --key="09ae2a85bb2d44b6" -arch=sm_13 "/tmp/tmpxft_00开发者_StackOverflow社区002ab1_00000000-2_trip3dgpu_kernel.ptx" -o "/tmp/tmpxft_00002ab1_00000000-9_trip3dgpu_kernel.sm_13.cubin"
The kernel does have a fairly large parameter list and a structure with a good number of pointers is passed around, but I do know that there was at least one point in time in which very nearly the exact same code compiled in only a couple seconds.
I am running 64 bit Ubuntu 9.04 if it helps.
Any ideas?
I had similar problem - without optimization, compilation failed running out of registers, and with optimizations it took nearly half an hour. My kernel had expressions like
t1itern[II(i,j)] = (1.0 - overr) * t1itero[II(i,j)] + overr * (rhs[IJ(i-1,j-1)].rhs1 - abiter[IJ(i-1,j-1)].as * t1itern[II(i,j - 1)] - abiter[IJ(i-1,j-1)].ase * t1itero[II(i + 1,j - 1)] - abiter[IJ(i-1,j-1)].ae * t1itern[II(i + 1,j)] - abiter[IJ(i-1,j-1)].ane * t1itero[II(i + 1,j + 1)] - abiter[IJ(i-1,j-1)].an * t1itern[II(i,j + 1)] - abiter[IJ(i-1,j-1)].anw * t1itero[II(i - 1,j + 1)] - abiter[IJ(i-1,j-1)].aw * t1itern[II(i - 1,j)] - abiter[IJ(i-1,j-1)].asw * t1itero[II(i - 1,j - 1)] - rhs[IJ(i-1,j-1)].aads * t2itern[II(i,j - 1)] - rhs[IJ(i-1,j-1)].aadn * t2itern[II(i,j + 1)] - rhs[IJ(i-1,j-1)].aade * t2itern[II(i + 1,j)] - rhs[IJ(i-1,j-1)].aadw * t2itern[II(i - 1,j)] - rhs[IJ(i-1,j-1)].aadc * t2itero[II(i,j)]) / abiter[IJ(i-1,j-1)].ac;
and when i rewrote them:
tt1 = lrhs.rhs1;
tt1 = tt1 - labiter.as * t1itern[II(1,j - 1)];
tt1 = tt1 - labiter.ase * t1itern[II(2,j - 1)];
tt1 = tt1 - labiter.ae * t1itern[II(2,j)];
//etc
it significantly reduced compilation time and register usage.
You should note that there is a limit on the the size of the parameter list that can be passed to a function, currently 256 bytes (see section B.1.4 of the CUDA Programming Guide). Has the function changed at all?
There is also a limit of 2 million PTX instructions per kernel, but you shouldn't be approaching that ;-)
What version of the toolkit are you using? The 3.0 beta is available if you are a registered developer which is a major update. If you still have the problem you should contact NVIDIA, they will need to be able to reproduce the problem of course.
Setting -maxrregcount 64
on the compile line helps since it causes the register allocator to spill to lmem earlier
精彩评论