if/else statement in SSE intrinsics
I am trying to optimize a small piece of code with SSE intrinsics (I am a complete beginner on the topic), but I am a little stuck on the use of conditionals.
My original code is:
unsigned long c;
unsigned long constant = 0x12345678;
unsigned long table[256];
int n, k;
for( n = 0; n < 256; n++ )
{
c = n;
for( k = 0; k < 8; k++ )
{
if( c & 1 ) c = constant ^ (c >> 1);
else c >>= 1;
}
table[n] = c;
}
The goal of this code is to compute a crc table (the constant can be any polynomial, it doesn't play a role here),
I suppose my optimized code would be something like:
__m128 x;
__m128 y;
__m128 *table;
x = _mm_set_ps(3, 2, 1, 0);
y = _mm_set_ps(3, 2, 1, 0);
//offset for incrementation
offset = _mm_set1_ps(4);
for( n = 0; n < 64; n++ )
{
y = x;
for( k = 0; k < 8; k++ )
{
//if do something with y
//else do something with y
}
table[n] = y;
x = _mm_add_epi32 (x, offset);
}
I have no idea how to go through the if-else statement, but I suspect there is a clever trick. Has anybody an idea on how to do that?
(Aside from this, my optimization is probabl开发者_如何学Goy quite poor - any advice or correction on it would be treated with the greatest sympathy)
You can get rid of the if/else entirely. Back in the days when I produced MMX assembly code, that was a common programming activity. Let me start with a series of transformations on the "false" statement:
c >>= 1;
c = c >> 1;
c = 0 ^ (c >> 1);
Why did I introduce the exclusive-or? Because exclusive-or is also found in the "true" statement:
c = constant ^ (c >> 1);
Note the similarity? In the "true" part, we xor with a constant, and in the false part, we xor with zero.
Now I'm going to show you a series of transformations on the entire if/else statement:
if (c & 1)
c = constant ^ (c >> 1); // same as before
else
c = 0 ^ (c >> 1); // just different layout
if (c & 1)
c = constant ^ (c >> 1);
else
c = (constant & 0) ^ (c >> 1); // 0 == x & 0
if (c & 1)
c = (constant & -1) ^ (c >> 1); // x == x & -1
else
c = (constant & 0) ^ (c >> 1);
Now the two branches only differ in the second argument to the binary-and, which can be calculated trivially from the condition itself, thus enabling us to get rid of the if/else:
c = (constant & -(c & 1)) ^ (c >> 1);
Disclaimer: This solution only works on a two's complement architecture where -1 means "all bits set".
The idea in SSE is to build both results and then blend the results together.
E.g. :
__m128i mask = ...; // some way to build mask[n] = 0x1
__m128i constant = ...;
__m128i tmp_c = _mm_xor_si128( _mm_srli_epis32( c, 1 ), constant );
__m128i tmp_c2 = _mm_srli_epis32( c, 1 );
__m128i v = _mm_cmpeq_epi32( c, mask );
tmp_c = _mm_and_epi32( tmp_c, mask );
tmp_c2 = _mm_andnot_si128( mask, tmp_c2 );
c = _mm_or_si128( tmp_c, tmp_c2 );
// or in sse4_1
c = _mm_blendv_epi8( tmp_c, tmp_c2, mask );
Note beside, this is not complete code, only to demonstrate the principle.
The first step in efficiently computing CRC is using a wider basic unit than the bit. See here for an example of how to do this byte per byte.
精彩评论