Compute the absolute difference between unsigned integers using SSE

2023-01-09 13:34 问答作者：

In C is there a branch-less technique to compute the absolute difference between two unsigned ints? For e开发者_如何转开发xample given the variables a and b, I would like the value 2 for cases when a=3, b=5 or b=3, a=5. Ideally I would also like to be able to vectorize the computation using the SSE registers.

There are several ways to do it, I'll just mention one:

SSE4

Use PMINUD and PMAXUD to separate the larger value in register #1, and the smaller value in register #2.
Subtract them.

MMX/SSE2

Flip the sign bit of the two values because the next instruction only accepts signed integer comparison.
PCMPGTD. Use this result as a mask.
Compute the results of both (a-b) and (b-a)
Use POR ( PAND ( mask, a-b ), PANDN ( mask, b-a ) ) to select the correct value for the absolute difference.

From tommesani.com, one solution for this problem is to use saturating unsigned subtraction twice.

As the saturating subtraction never goes below 0, you compute: r1 = (a-b).saturating r2 = (b-a).saturating

If a is greater than b, r1 will contain the answer, and r2 will be 0, and vice-versa for b>a. ORing the two partial results together will yield the desired result.

According to the VTUNE users manual, PSUBUSB/PSUBUSW is available for 128-bit registers, so you should be able to get a ton of parallelization this way.

max(i,j) - min(i,j)
(i>j)*(i-j) + (j>i)*(j-i)

you can certainly use SSE registers, but compiler may do this for you anyways

SSE2:

Seems to be about the same speed as Phernost's second function. Sometimes GCC schedules it to be a full cycle faster, other times a little slower.

__m128i big = _mm_set_epi32( INT_MIN, INT_MIN, INT_MIN, INT_MIN );

a = _mm_add_epi32( a, big ); // re-center the variables: send 0 to INT_MIN,
b = _mm_add_epi32( b, big ); // INT_MAX to -1, etc.
__m128i diff = _mm_sub_epi32( a, b ); // get signed difference
__m128i mask = _mm_cmpgt_epi32( b, a ); // mask: need to negate difference?
mask = _mm_andnot_si128( big, mask ); // mask = 0x7ffff... if negating
diff = _mm_xor_si128( diff, mask ); // 1's complement except MSB
diff = _mm_sub_epi32( diff, mask ); // add 1 and restore MSB

SSSE3:

Ever so slightly faster than previous. There is a lot of variation depending on how things outside the loop are declared. (For example, making a and b volatile makes things faster! It appears to be a random effect on scheduling.) But this is consistently fastest by a cycle or so.

__m128i big = _mm_set_epi32( INT_MIN, INT_MIN, INT_MIN, INT_MIN );

a = _mm_add_epi32( a, big ); // re-center the variables: send 0 to INT_MIN,
b = _mm_add_epi32( b, big ); // INT_MAX to -1, etc.
__m128i diff = _mm_sub_epi32( b, a ); // get reverse signed difference
__m128i mask = _mm_cmpgt_epi32( b, a ); // mask: need to negate difference?
mask = _mm_xor_si128( mask, big ); // mask cannot be 0 for PSIGND insn
diff = _mm_sign_epi32( diff, mask ); // negate diff if needed

SSE4 (thx rwong):

Can't test this.

__m128i diff = _mm_sub_epi32( _mm_max_epu32( a, b ), _mm_min_epu32( a, b ) );

compute the difference and return the absolute value

__m128i diff = _mm_sub_epi32(a, b);  
__m128i mask = _mm_xor_si128(diff, a);
mask = _mm_xor_si128(mask, b);
mask = _mm_srai_epi32(mask, 31);
diff = _mm_xor_si128(diff, mask);  
mask = _mm_srli_epi32(mask, 31);  
diff = _mm_add_epi32(diff, mask);

This requires one less operation that using the signed compare op, and produces less register pressure.

Same amount of register pressure as before, 2 more ops, better branch and merging of dependency chains, instruction pairing for uops decoding, and separate unit utilization. Although this requires a load, which may be out of cache. I'm out of ideas after this one.

__m128i mask, diff;
diff = _mm_set1_epi32(-1<<31); // dependency branch after
a = _mm_add_epi32(a, diff); // arithmetic sign flip
b = _mm_xor_si128(b, diff); // bitwise sign flip parallel with 'add' unit
diff = _mm_xor_si128(a, b); // reduce uops, instruction already decoded
mask = _mm_cmpgt_epi32(b, a); // parallel with xor
mask = _mm_and_si128(mask, diff); // dependency merge, branch after
a = _mm_xor_si128(a, mask); // if 2 'bit' units in CPU, parallel with next
b = _mm_xor_si128(b, mask); // reduce uops, instruction already decoded
diff = _mm_sub_epi32(a, b); // result

After timing each version with 2 million iterations on a Core2Duo, differences are immeasurable. So pick whatever is easier to understand.

One or more of the below will likely result in branchless code, depending on the machine and compiler, since the conditional expressions are all very simple.

I haven't been through all the sse answers but possibly some of the below are represented in the vector code; certainly all the below are vectorizable (if you have the unsigned compare to begin with, or fake it by toggling the msb first.). I thought it would be helpful to have some practical scalar answers to the question.

unsigned udiff( unsigned a, unsigned b )
{
      unsigned result = a-b;   // ok if a<b;
      if(a <b ) result = -result; 
      return result;
}
unsigned udiff( unsigned a, unsigned b )
{
      unsigned n =(a<b)? (unsigned)-1 : 0u;
      unsigned result = a-b;
      return (result^n)-n; // 'result' if n = 0; '-result' if n = 0xFFFFFFFF
}


unsigned udiff( unsigned a, unsigned b )
{
      unsigned axb = a^b;
      if( a < b )  axb = 0;
      return (axb^b) - (axb^a);  // a-b, or b-a
}

This will work on x86_64 (or anything where 64-bit temps are basically free)

unsigned udiff( unsigned a, unsigned b )
{
      unsigned n= (unsigned)( 
         (long long)((unsigned long long)a - (unsigned long long)b)>>32 
                      ); // same n as 2nd example
      unsigned result = a-b;
      return (result^n)-n; // 'result' if n = 0; '-result' if n = 0xFFFFFFFF
}

Try this (assumes 2nd complements, which is OK judgning by the fact that you're asking for SSE):

int d = a-b;
int ad = ((d >> 30) | 1) * d;

Explanation: sign-bit (bit 31) gets propagated down to 1st bit. the | 1 part ensures that the multiplier is either 1 or -1. Multiplications are fast on modern CPUs.

Erm ... its pretty easy ...

int diff = abs( a - b );

Easily vectorisable (Using SSE3 as):

__m128i sseDiff = _mm_abs_epi32( _mm_sub_epi32( a, b ) );

a and b are unsigned integers. Consider a=0 and b=0xffffffff. The correct absolute difference is 0xffffffff, but your solution will give 1.

继续阅读：sse unsigned

Compute the absolute difference between unsigned integers using SSE

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？