How to optimize a cycle?

2023-01-21 12:03 问答作者：

I have the following bottleneck function.

typedef unsigned char byte;
void CompareArrays(const byte * p1S开发者_高级运维tart, const byte * p1End, const byte * p2, byte * p3)
{
     const byte b1 = 128-30;
     const byte b2 = 128+30;
     for (const byte * p1 = p1Start; p1 != p1End; ++p1, ++p2, ++p3) {
        *p3 = (*p1 < *p2 ) ? b1 : b2;
    }
}

I want to replace C++ code with SSE2 intinsic functions. I have tried _mm_cmpgt_epi8 but it used signed compare. I need unsigned compare.

Is there any trick (SSE, SSE2, SSSE3) to solve my problem?

Note: I do not want to use multi-threading in this case.

Instead of offsetting your signed values to make them unsigned, a slightly more efficient way would be to do the following:

use _mm_min_epu8 to get the unsigned min of p1, p2
compare this min for equality with p2 using _mm_cmpeq_epi8
the resulting mask will now be 0x00 for elements where p1 < p2 and 0xff for elements where p1 >= p2
you can now use this mask with _mm_or_si128 and _mm_andc_si128 to select the appropriate b1/b2 values

Note that this is 4 instructions in total, compared with 5 using the offset + signed comparison approach.

You can subtract 127 from your numbers, and then use _mm_cmpgt_epi8

Yes, this can be done in SIMD, but it will take a few steps to make the mask.

Ruslik got it right, I think. You want to xor each component with 0x80 to flip the sense of the signed and unsigned comparison. _mm_xor_si128 (PXOR) gets you that -- you'll need to create the mask as a static char array somewhere before loading it into a SIMD register. Then _mm_cmpgt_epi8 gets you a mask and you can use a bitwise AND (eg _mm_and_si128) to perform a masked-move.

Yes, SSE will not work here. You can improve this code performance on multi-core computer by using OpenMP:

void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3)
{
     const byte b1 = 128-30;
     const byte b2 = 128+30;

     int n = p1End - p1Start;
     #pragma omp parallel for
     for (int i = 0; i < n; ++p1, ++i) 
     {
        p3[i] = (p1[i] < p2[i]) ? b1 : b2;
     }
}

Unfortunately, many of the answers above are incorrect. Let's assume a 3-bit word:

unsigned: 4 5 6 7 0 1 2 3 == signed: -4 -3 -2 -1 0 1 2 3 (bits: 100 101 110 111 000 001 010 011)

The method by Paul R is incorrect. Suppose we want to know if 3 > 2. min(3,2) == 2, which suggests yes, so the method works here. Now suppose we want to know if if 7 > 2. The value 7 is -1 in signed representation, so min(-1,2) == -1, which suggests wrongly that 7 is not greater than 2 unsigned.

The method by Andrey is also incorrect. Suppose we want to know if 7 > 2, or a = 7, and b = 2. The value 7 is -1 in signed representation, so the first term (a > b) fails, and the method suggests that 7 is not greater than 2.

However, the method by BJobnh, as corrected by Alexey, is correct. Just subtract 2^(n-1) from the values, where n is the number of bits. In this case, we would subtract 4 to obtain new corresponding values:

old signed: -4 -3 -2 -1 0 1 2 3 => new signed: 0 1 2 3 -4 -3 -2 -1 == new unsigned 0 1 2 3 4 5 6 7.

In other words, unsigned_greater_than(a,b) is equivalent to signed_greater_than(a - 2^(n-1), b - 2^(n-1)).

use pcmpeqb and be the Power with you.

继续阅读：assembly intrinsics optimization sse2

How to optimize a cycle?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？