PyCUDA GPUArray slice-based operations
The PyCUDA documentation is a bit light on examples for those of us in the 'Non-Guru' class, but I'm wondering about the operations available for array operations on gpuarrays, ie. if I wanted to gpuarray this loop;
m=np.random.random((K,N,N))
a=np.zeros_like(m)
b=np.random.random(N) #example
for k in range(K):
for x in range(N):
for y in range(N):
a[k,x,y]=m[k,x,y]*b[y]
The regular first-stop python reduction for this would be something like
for k in range(K):
for x in range(N):
a[k,x,:]=m[k,x,:]*b
But I c开发者_如何学编程an't see any simple way to do this with GPUArray, other than writing a custom elementwise kernel, and even then with this problem there would have to be looping constructs in the kernel and at that point of complexity I'm probably better off just writing my own full blown SourceModule kernel.
Can anyone clue me in ?
That is probably best done with your own kernel. While PyCUDA's gpuarray class is a really convenient abstraction of GPU memory into something which can be used interchangeably with numpy arrays, there is no getting around the need to code for the GPU for anything outside of the canned linear algebra and parallel reduction operations.
That said, it is a pretty trivial little kernel to write. So trivial that it would be memory bandwidth bound - you might want to see if you can "fuse" a few like operations together to improve the ratio of FLOPS to memory transactions a bit.
If you need some help with the kernel, drop in a comment, and I can expand the answer to include a rough prototype.
You can also use the memcpy_dtod()
method and the slicing functionality of gpuarrays. Its strange that normal assignment does not work. set()
does not work because it assumes host to device transfer (using memcpy_htod()
).
for k in range(K):
for x in range(N):
pycuda.driver.memcpy_dtod(a[k,x,:].gpudata, (m[k,x,:]*b).gpudata, a[k,x,:].nbytes)
精彩评论