I looked through the programming guide and best practices guide and it mentioned that Global Memory access takes 400-600 cycles. I did not see much on the other memory types like texture cache, consta
I have the following matrix multiplication code, implemented using CUDA 3.2 and VS 2008. I am running on Windows server 2008 r2 enterprise. I am running a Nvidia GTX 480. The following code works fine
It seems like 2 million floats should be no big deal, only 8MBs of 1GB of GPU RAM. I am able to allocate that much at times and sometimes more than that with no trouble. I get CL_OUT_OF_RESOURCES when
I would like to see an example of rendering with nVidia Cg to an offscreen frame buffer object. T开发者_开发问答he computers I have access to have graphic cards but no monitors (or X server). So I wa
I\'ve been working on an AES CUDA application and I have a kernel which performs ECB encryption on the GPU. In order to assure the logic of the algorithm is not modified when running in parallel I sen
If my algorithm is bottlenecked by host to device and device to host memory tr开发者_StackOverflowansfers, is the only solution a different or revised algorithm?There are a couple things you can try t
I don\'t know whether this is the right forum. Anyway here is the question. In one of our application we display medical images and on top of them some algorithm generated bitmap. The real bitmap is a
In Nvidia\'s compute prof there is a column called \"static private mem per work group\" and the tooltip of it says \"Size of statically allocated shared memory per block\". My application shows that
const char programSource[] = \"__kernel void vecAdd(__global int *a, __global int *b, __global int *c)\"
One thing I haven\'t figured out and google isn\'t helping me, is why is it possible to have bank conflicts with shared memory, but not in global memory? Can there be bank conflicts with registers?