cuda timer question
say I want to time a memory fetching from device global memory
cudaMemcpy(...cudaMemcpyHostToDevice);
cudaThreadSynchronize();
time1 ...
kernel_call();
cudaThreadSynchronize();
time2 ...
cudaMemcpy(...cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
time3 ...
I don't understand why my time3 and time2 always give same results. My kernel does take a long time to get the result ready for fetching, but shouldn't cud开发者_如何学JAVAaThreadSynchronize() block all the operation before kernel_call is done? Also fetching from device memory to host memory shall also take a while, at least noticeable. Thanks.
The best way to monitor the execution time is to use the CUDA_PROFILE_LOG=1 environment variable, and set in the CUDA_PROFILE_CONFIG file the values, timestamp, gpustarttimestamp,gpuendtimestamp. after running your cuda program with those environment variable a local .cuda_log file should be created and listed inside the timing amounts of memcopies and kernel execution to the microsecond level. clean and not invasive .
I don't know if thats the critical point in here, but i noticed the following:
if you look trough the nvidia code samples (don't know where exactly), you will find something like a "warm-up" function, which is called before some critical kernel is called which should is measured.
Why?
Because the nvidia-driver will dynamically optimize the art of driving the gpu during the first access (in your case before timer1) every time a program is executed. There will be a lot of overhead. That wasn't clear for me for a long time. When i did 10 runs, the first run was sooooo sloooow. Now i know why.
Solution: just use a dummy/warm-up function which is accessing the gpu hardware before the real execution of your code begins.
精彩评论