开发者

cuda timer question

say I want to time a memory fetching from device global memory

cudaMemcpy(...cudaMemcpyHostToDevice);
cudaThreadSynchronize();
time1 ...

kernel_call();
cudaThreadSynchronize();
time2 ...

cudaMemcpy(...cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
time3 ...

I don't understand why my time3 and time2 always give same results. My kernel does take a long time to get the result ready for fetching, but shouldn't cud开发者_如何学JAVAaThreadSynchronize() block all the operation before kernel_call is done? Also fetching from device memory to host memory shall also take a while, at least noticeable. Thanks.


The best way to monitor the execution time is to use the CUDA_PROFILE_LOG=1 environment variable, and set in the CUDA_PROFILE_CONFIG file the values, timestamp, gpustarttimestamp,gpuendtimestamp. after running your cuda program with those environment variable a local .cuda_log file should be created and listed inside the timing amounts of memcopies and kernel execution to the microsecond level. clean and not invasive .


I don't know if thats the critical point in here, but i noticed the following:

if you look trough the nvidia code samples (don't know where exactly), you will find something like a "warm-up" function, which is called before some critical kernel is called which should is measured.

Why?

Because the nvidia-driver will dynamically optimize the art of driving the gpu during the first access (in your case before timer1) every time a program is executed. There will be a lot of overhead. That wasn't clear for me for a long time. When i did 10 runs, the first run was sooooo sloooow. Now i know why.

Solution: just use a dummy/warm-up function which is accessing the gpu hardware before the real execution of your code begins.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜