How to optimize a cycle?
I have the following bottleneck function.
typedef unsigned char byte;
void CompareArrays(const byte * p1S开发者_高级运维tart, const byte * p1End, const byte * p2, byte * p3)
{
const byte b1 = 128-30;
const byte b2 = 128+30;
for (const byte * p1 = p1Start; p1 != p1End; ++p1, ++p2, ++p3) {
*p3 = (*p1 < *p2 ) ? b1 : b2;
}
}
I want to replace C++
code with SSE2 intinsic functions. I have tried _mm_cmpgt_epi8
but it used signed compare. I need unsigned compare.
Is there any trick (SSE, SSE2, SSSE3) to solve my problem?
Note: I do not want to use multi-threading in this case.
Instead of offsetting your signed values to make them unsigned, a slightly more efficient way would be to do the following:
- use
_mm_min_epu8
to get the unsigned min of p1, p2 - compare this min for equality with p2 using
_mm_cmpeq_epi8
- the resulting mask will now be 0x00 for elements where p1 < p2 and 0xff for elements where p1 >= p2
- you can now use this mask with
_mm_or_si128
and_mm_andc_si128
to select the appropriate b1/b2 values
Note that this is 4 instructions in total, compared with 5 using the offset + signed comparison approach.
You can subtract 127 from your numbers, and then use _mm_cmpgt_epi8
Yes, this can be done in SIMD, but it will take a few steps to make the mask.
Ruslik got it right, I think. You want to xor each component with 0x80 to flip the sense of the signed and unsigned comparison. _mm_xor_si128 (PXOR
) gets you that -- you'll need to create the mask as a static char array somewhere before loading it into a SIMD register. Then _mm_cmpgt_epi8 gets you a mask and you can use a bitwise AND (eg _mm_and_si128
) to perform a masked-move.
Yes, SSE will not work here. You can improve this code performance on multi-core computer by using OpenMP:
void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3) { const byte b1 = 128-30; const byte b2 = 128+30; int n = p1End - p1Start; #pragma omp parallel for for (int i = 0; i < n; ++p1, ++i) { p3[i] = (p1[i] < p2[i]) ? b1 : b2; } }
Unfortunately, many of the answers above are incorrect. Let's assume a 3-bit word:
unsigned: 4 5 6 7 0 1 2 3 == signed: -4 -3 -2 -1 0 1 2 3 (bits: 100 101 110 111 000 001 010 011)
The method by Paul R is incorrect. Suppose we want to know if 3 > 2. min(3,2) == 2, which suggests yes, so the method works here. Now suppose we want to know if if 7 > 2. The value 7 is -1 in signed representation, so min(-1,2) == -1, which suggests wrongly that 7 is not greater than 2 unsigned.
The method by Andrey is also incorrect. Suppose we want to know if 7 > 2, or a = 7, and b = 2. The value 7 is -1 in signed representation, so the first term (a > b) fails, and the method suggests that 7 is not greater than 2.
However, the method by BJobnh, as corrected by Alexey, is correct. Just subtract 2^(n-1) from the values, where n is the number of bits. In this case, we would subtract 4 to obtain new corresponding values:
old signed: -4 -3 -2 -1 0 1 2 3 => new signed: 0 1 2 3 -4 -3 -2 -1 == new unsigned 0 1 2 3 4 5 6 7.
In other words, unsigned_greater_than(a,b) is equivalent to signed_greater_than(a - 2^(n-1), b - 2^(n-1)).
use pcmpeqb and be the Power with you.
精彩评论