Can I call a "function-like macro" in a header file from a CUDA global function?

2023-01-09 10:28 问答作者：

This is part of my header file aes_locl.h:

.
.
# define SWAP(x) (_lrotl(开发者_StackOverflow中文版x, 8) & 0x00ff00ff | _lrotr(x, 8) & 0xff00ff00) 
# define GETU32(p) SWAP(*((u32 *)(p))) 
# define PUTU32(ct, st) { *((u32 *)(ct)) = SWAP((st)); } 
.
.

Now from the .cu file I have declared a __ global__ function and included the header file like this :

#include "aes_locl.h"
.....
__global__ void cudaEncryptKern(u32* _Te0, u32* _Te1, u32* _Te2, u32* _Te3, unsigned char* in, u32* rdk, unsigned long* length)
{
    u32 *rk = rdk;
    u32 s0, s1, s2, s3, t0, t1, t2, t3;

    s0 = GETU32(in + threadIdx.x*(i) ) ^ rk[0];
}

This leads me to the following error message:

error: calling a host function from a __ device__/__ global__ function is only allowed in device emulation mode

I have sample code where the programmer calls the macro exactly in that way.

Can I call it in this way, or is this not possible at all? If it is not, I will appreciate some hints of what would be the best approach to rewrite the macros and assign the desired value to S0.

thank you very much in advance!!!

I think the problem is not the macros themselves - the compilation process used by nvcc for CUDA code runs the C preprocessor in the usual way and so using header files in this way should be fine. I believe the problem is in your calls to _lrotl and _lrotr.

You ought to be able to check that that is indeed the problem by temporarily removing those calls.

You should check the CUDA programming guide to see what functionality you need to replace those calls to run on the GPU.

The hardware doesn't have a built-in rotate instruction, and so there is no intrinsic to expose it (you can't expose something that doesn't exist!).

It's fairly simple to implement with shifts and masks though, for example if x is 32-bits then to rotate left eight bits you can do:

((x << 8) | (x >> 24))

Where x << 8 will push everything left eight bits (i.e. discarding the leftmost eight bits), x >> 24 will push everything right twnty-four bits (i.e. discarding all but the leftmost eight bits), and bitwise ORing them together gives the result you need.

// # define SWAP(x) (_lrotl(x, 8) & 0x00ff00ff | _lrotr(x, 8) & 0xff00ff00)
# define SWAP(x) (((x << 8) | (x >> 24)) & 0x00ff00ff | ((x >> 8) | (x << 24)) & 0xff00ff00)

You could of course make this more efficient by recognising that the above is overkill:

# define SWAP(x) (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8))

The error says what the problem really is. You are calling a function/macro defined in another file (which belongs to the CPU code), from inside the CUDA function. This is impossible!

You cannot call CPU functions/macros/code from a GPU function.

You should put your definitions (does _lrotl() exist in CUDA?) inside the same file that will be compiled by nvcc.

继续阅读：c gpgpu macros

Can I call a "function-like macro" in a header file from a CUDA global function?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？