开发者

Can I call a "function-like macro" in a header file from a CUDA __global__ function?

This is part of my header file aes_locl.h:

.
.
# define SWAP(x) (_lrotl(开发者_StackOverflow中文版x, 8) & 0x00ff00ff | _lrotr(x, 8) & 0xff00ff00) 
# define GETU32(p) SWAP(*((u32 *)(p))) 
# define PUTU32(ct, st) { *((u32 *)(ct)) = SWAP((st)); } 
.
.

Now from the .cu file I have declared a __ global__ function and included the header file like this :

#include "aes_locl.h"
.....
__global__ void cudaEncryptKern(u32* _Te0, u32* _Te1, u32* _Te2, u32* _Te3, unsigned char* in, u32* rdk, unsigned long* length)
{
    u32 *rk = rdk;
    u32 s0, s1, s2, s3, t0, t1, t2, t3;

    s0 = GETU32(in + threadIdx.x*(i) ) ^ rk[0];
}

This leads me to the following error message:

error: calling a host function from a __ device__/__ global__ function is only allowed in device emulation mode

I have sample code where the programmer calls the macro exactly in that way.

Can I call it in this way, or is this not possible at all? If it is not, I will appreciate some hints of what would be the best approach to rewrite the macros and assign the desired value to S0.

thank you very much in advance!!!


I think the problem is not the macros themselves - the compilation process used by nvcc for CUDA code runs the C preprocessor in the usual way and so using header files in this way should be fine. I believe the problem is in your calls to _lrotl and _lrotr.

You ought to be able to check that that is indeed the problem by temporarily removing those calls.

You should check the CUDA programming guide to see what functionality you need to replace those calls to run on the GPU.


The hardware doesn't have a built-in rotate instruction, and so there is no intrinsic to expose it (you can't expose something that doesn't exist!).

It's fairly simple to implement with shifts and masks though, for example if x is 32-bits then to rotate left eight bits you can do:

((x << 8) | (x >> 24))

Where x << 8 will push everything left eight bits (i.e. discarding the leftmost eight bits), x >> 24 will push everything right twnty-four bits (i.e. discarding all but the leftmost eight bits), and bitwise ORing them together gives the result you need.

// # define SWAP(x) (_lrotl(x, 8) & 0x00ff00ff | _lrotr(x, 8) & 0xff00ff00)
# define SWAP(x) (((x << 8) | (x >> 24)) & 0x00ff00ff | ((x >> 8) | (x << 24)) & 0xff00ff00)

You could of course make this more efficient by recognising that the above is overkill:

# define SWAP(x) (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8))


The error says what the problem really is. You are calling a function/macro defined in another file (which belongs to the CPU code), from inside the CUDA function. This is impossible!

You cannot call CPU functions/macros/code from a GPU function.

You should put your definitions (does _lrotl() exist in CUDA?) inside the same file that will be compiled by nvcc.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜