Why is "a =(b>0)?1:0" better than "if-else" version in CUDA?
Could you 开发者_运维百科tell me why
a =(b>0)?1:0
is better than
if (b>0)a=1; else a =0;
version in CUDA? Please give details. Many thanks.
Yik
There was a time when the NVIDIA compiler used idiom testing to generate more efficient code for the ternary operator than if/then/else constructs. This is the results of a small test to see whether this is still the case:
__global__ void branchTest0(float *a, float *b, float *d)
{
unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
float aval = a[tidx], bval = b[tidx];
float z0 = (aval > bval) ? aval : bval;
d[tidx] = z0;
}
__global__ void branchTest1(float *a, float *b, float *d)
{
unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
float aval = a[tidx], bval = b[tidx];
float z0;
if (aval > bval) {
z0 = aval;
} else {
z0 = bval;
}
d[tidx] = z0;
}
Compiling these two kernels for compute capability 2.0 with the CUDA 4.0 release compiler, the comparison section produces this:
branchTest0:
max.f32 %f3, %f1, %f2;
and
branchTest1:
setp.gt.f32 %p1, %f1, %f2;
selp.f32 %f3, %f1, %f2, %p1;
The ternary operator gets compiled into a single floating point maximum instruction, whereas the if/then/else gets compiled into two instructions, a compare followed by a select. Both codes are conditionally executed - neither produces branching. The machine code emitted by the assembler for these is also different and closely replicates the PTX:
branchTest0:
/*0070*/ /*0x00201c00081e0000*/ FMNMX R0, R2, R0, !pt;
and
branchTest1:
/*0070*/ /*0x0021dc00220e0000*/ FSETP.GT.AND P0, pt, R2, R0, pt;
/*0078*/ /*0x00201c0420000000*/ SEL R0, R2, R0, P0;
So it would seem that, at least for Fermi GPUs with CUDA 4.0 with this sort of construct, the ternary operator does produce fewer instructions that an equivalent if/then/else. Whether there is a performance difference between them comes down to microbenchmarking data which I don't have.
In general, I would recommend to write CUDA code in a natural style, and let the compiler worry about local branching. Besides predication, the GPU hardware also implements "select" type instructions. Using talonmies's framework and sticking in the original poster's code, I find that the same machine code is produced for both versions with the CUDA 4.0 compiler for sm_20. I used -keep to retain intermediate files, and the cuobjdump utility to produce the disassembly. Both the ternary operator and the if-statement are translated into an FCMP instruction, which is a "select" instruction.
The sample case examined by talonmies is actually a special case. The compiler recognizes some common source code idioms, such as the particular ternary expression frequently used to express max() and min() operations, and generates code accordingly. The equivalent if-statement is not recognized as an idiom.
__global__ void branchTest0(float *bp, float *d)
{
unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
float b = bp[tidx];
float a = (b>0)?1:0;
d[tidx] = a;
}
__global__ void branchTest1(float *bp, float *d)
{
unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
float b = bp[tidx];
float a;
if (b>0)a=1; else a =0;
d[tidx] = a;
}
code for sm_20
Function : _Z11branchTest1PfS_
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x84009c042c000000*/ S2R R2, SR_Tid_X;
/*0010*/ /*0x94001c042c000000*/ S2R R0, SR_CTAid_X;
/*0018*/ /*0x10019de218000000*/ MOV32I R6, 0x4;
/*0020*/ /*0x20009ca320044000*/ IMAD R2, R0, c [0x0] [0x8], R2;
/*0028*/ /*0x1020dc435000c000*/ IMUL.U32.U32.HI R3, R2, 0x4;
/*0030*/ /*0x80211c03200d8000*/ IMAD.U32.U32 R4.CC, R2, R6, c [0x0] [0x20];
/*0038*/ /*0x90315c4348004000*/ IADD.X R5, R3, c [0x0] [0x24];
/*0040*/ /*0xa0209c03200d8000*/ IMAD.U32.U32 R2.CC, R2, R6, c [0x0] [0x28];
/*0048*/ /*0x00401c8584000000*/ LD.E R0, [R4];
/*0050*/ /*0xb030dc4348004000*/ IADD.X R3, R3, c [0x0] [0x2c];
/*0058*/ /*0x03f01c003d80cfe0*/ FCMP.LEU R0, RZ, 0x3f800, R0;
/*0060*/ /*0x00201c8594000000*/ ST.E [R2], R0;
/*0068*/ /*0x00001de780000000*/ EXIT;
....................................
Function : _Z11branchTest0PfS_
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x84009c042c000000*/ S2R R2, SR_Tid_X;
/*0010*/ /*0x94001c042c000000*/ S2R R0, SR_CTAid_X;
/*0018*/ /*0x10019de218000000*/ MOV32I R6, 0x4;
/*0020*/ /*0x20009ca320044000*/ IMAD R2, R0, c [0x0] [0x8], R2;
/*0028*/ /*0x1020dc435000c000*/ IMUL.U32.U32.HI R3, R2, 0x4;
/*0030*/ /*0x80211c03200d8000*/ IMAD.U32.U32 R4.CC, R2, R6, c [0x0] [0x20];
/*0038*/ /*0x90315c4348004000*/ IADD.X R5, R3, c [0x0] [0x24];
/*0040*/ /*0xa0209c03200d8000*/ IMAD.U32.U32 R2.CC, R2, R6, c [0x0] [0x28];
/*0048*/ /*0x00401c8584000000*/ LD.E R0, [R4];
/*0050*/ /*0xb030dc4348004000*/ IADD.X R3, R3, c [0x0] [0x2c];
/*0058*/ /*0x03f01c003d80cfe0*/ FCMP.LEU R0, RZ, 0x3f800, R0;
/*0060*/ /*0x00201c8594000000*/ ST.E [R2], R0;
/*0068*/ /*0x00001de780000000*/ EXIT;
....................................
In general you need to avoid branches in CUDA code, otherwise you may get warp divergence which can result in a big performance hit. if
/else
clauses will normally result in branches based on a test of an expression. One way of eliminating branches is to use an expression which can be implemented without branches if the compiler is smart enough - that way all the threads in a warp follow the same code path.
In both cases the compiler is going to try to do the same thing, it will aim to use predicated execution. You can find more information in the CUDA C Programming Guide (available via the website) and also on Wikipedia. Essentially for short branches such as this the hardware is able to emit instructions for both sides of the branch and use a predicate to indicate which threads should actually execute the instructions.
In other words, there would be minimal performance difference. With older compilers the tertiary operator sometimes helped, but nowadays they are equivalent.
Don't know for CUDA, but in C++ and C99, using the former you can initialize a const variable.
int const a = (b>0) ? 1 : 0;
Whereas with the latter, you cannot make your a
variable const as you have to declare it before the if
.
Note that it could be written even shorter:
int const a = (b>0);
And you could even remove the parenthesis ... but IMHO it does not improve reading.
I find it easier to read. It's immediately obvious that the purpose of the whole statement is to set the value of a
.
The intent is to assign a
to one of two values, and the ternary conditional operator syntax lets you have only one a =
in your statement.
I think a standard if/else all on one line is ugly (regardless of what it's used for).
精彩评论