CUDA: What reasons could there be for nvcc taking several minutes to compile?

2022-12-09 06:27 问答作者：

I have some CUDA code that nvcc (well, technically ptxas) likes to take upwards of 10 minutes to compile. While it isn't small, it certainly isn't huge. (~5000 lines).

The delay seems to come and go between CUDA version updates, but previously it only took a minute or so instead of 10.

When I used the -v option, it seemed to get stuck after displaying the following:

ptxas --key="09ae2a85bb2d44b6" -arch=sm_13 "/tmp/tmpxft_00开发者_StackOverflow社区002ab1_00000000-2_trip3dgpu_kernel.ptx" -o "/tmp/tmpxft_00002ab1_00000000-9_trip3dgpu_kernel.sm_13.cubin"

The kernel does have a fairly large parameter list and a structure with a good number of pointers is passed around, but I do know that there was at least one point in time in which very nearly the exact same code compiled in only a couple seconds.

I am running 64 bit Ubuntu 9.04 if it helps.

Any ideas?

I had similar problem - without optimization, compilation failed running out of registers, and with optimizations it took nearly half an hour. My kernel had expressions like

t1itern[II(i,j)] = (1.0 - overr) * t1itero[II(i,j)] + overr * (rhs[IJ(i-1,j-1)].rhs1 - abiter[IJ(i-1,j-1)].as  * t1itern[II(i,j - 1)] - abiter[IJ(i-1,j-1)].ase * t1itero[II(i + 1,j - 1)] - abiter[IJ(i-1,j-1)].ae  * t1itern[II(i + 1,j)] - abiter[IJ(i-1,j-1)].ane * t1itero[II(i + 1,j + 1)] - abiter[IJ(i-1,j-1)].an  * t1itern[II(i,j + 1)] - abiter[IJ(i-1,j-1)].anw * t1itero[II(i - 1,j + 1)] - abiter[IJ(i-1,j-1)].aw  * t1itern[II(i - 1,j)] - abiter[IJ(i-1,j-1)].asw * t1itero[II(i - 1,j - 1)] - rhs[IJ(i-1,j-1)].aads * t2itern[II(i,j - 1)] - rhs[IJ(i-1,j-1)].aadn * t2itern[II(i,j + 1)] - rhs[IJ(i-1,j-1)].aade * t2itern[II(i + 1,j)] - rhs[IJ(i-1,j-1)].aadw * t2itern[II(i - 1,j)] - rhs[IJ(i-1,j-1)].aadc * t2itero[II(i,j)]) / abiter[IJ(i-1,j-1)].ac;

and when i rewrote them:

tt1 = lrhs.rhs1;
tt1 = tt1 - labiter.as  * t1itern[II(1,j - 1)];
tt1 = tt1 - labiter.ase * t1itern[II(2,j - 1)];
tt1 = tt1 - labiter.ae  * t1itern[II(2,j)];
//etc

it significantly reduced compilation time and register usage.

You should note that there is a limit on the the size of the parameter list that can be passed to a function, currently 256 bytes (see section B.1.4 of the CUDA Programming Guide). Has the function changed at all?

There is also a limit of 2 million PTX instructions per kernel, but you shouldn't be approaching that ;-)

What version of the toolkit are you using? The 3.0 beta is available if you are a registered developer which is a major update. If you still have the problem you should contact NVIDIA, they will need to be able to reproduce the problem of course.

Setting -maxrregcount 64 on the compile line helps since it causes the register allocator to spill to lmem earlier

继续阅读：gpgpu nvcc

CUDA: What reasons could there be for nvcc taking several minutes to compile?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？