is using cudaHostAlloc good for my case
i have a kernel launched several times, untill a solution is found. the solution will be found by at least one block.
therefore when a block finds the solution it sh开发者_JAVA技巧ould inform the cpu that the solution is found, so the cpu prints the solution provided by this block. so what i am currently doing is the following:__global__ kernel(int sol)
{
//do some computations
if(the block found a solution)
sol = blockId.x //atomically
}
now on every call to the kernel i copy sol back to the host memory and check its value. if its set to 3 for example, i know that blockid 3 found the solution so i now know where the index of the solution start, and copy the solution back to the host.
in this case, will using cudaHostAlloc be a better option? more over would copying the value of a single integer on every kernel call slows down my program?Issuing a copy from GPU to CPU and then waiting for its completion will slow your program a bit. Note that if you choose to send 1 byte or 1KB, that won't make much of a difference. In this case bandwidth is not a problem, but latency.
But launching a kernel does consume some time as well. If the "meat" of your algorithm is in the kernel itself I wouldn't spend too much time on that single, small transfer.
Do note, if you choose to use the mapped memory, instead of using cudaMemcpy
, you will need to explicitly put a cudaDeviceSynchronise
(or cudaThreadSynchronise
with older CUDA) barrier (as opposed to an implicit barrier at cudaMemcpy
) before reading the status. Otherwise, your host code may go achead reading an old value stored in your pinned memory, before the kernel overwrites it.
精彩评论