128bit hash comparison with SSE
PS: The targeted machines are x86_64 servers with SSE2 instructions; I'm also interested in a NEON instruction for the same job.
There are no 128-bit integer comparison instructions in the SSE or NEON instruction sets.
SSE4.1 added vector 64-bit integer comparisons: PCMPEQQ and PCMPGTQ, but because of the way they are implemented it is not straightforward to piece two of them together into a 128-bit comparison.
The preferred way to accomplish a 128-bit comparison on x86_64 is to use a 64-bit comparison on the high word, then an additional 64-bit comparison on the low word only if the high words compare equal:
cmp {ahi}, {bhi}
jne 0f
cmp {alo}, {blo}
0: // flags are now set as though a comparison of unsigned 128-bit values
// was performed; signed comparisons are a bit different.
On ARM, the usual idiom is a sequence of conditional comparisons word-by-word to set the flags as needed.
Actually the 128 bit comparison of two values a
and b
is possible using SSE 4.1 with two instructions and a spare register set to zero before.
In x86 assembly, using legacy 128 bit SSE:
pxor %xmm2, %xmm2 # set xmm2 to zero. Should be moved out of the loop.
# compare %xmm0 to %xmm1 for equality
pxor %xmm0, %xmm1 # xmm1 is zero if both operands are equal
ptest %xmm2, %xmm1 # test not(xmm2) and xmm1. If any bit in xmm1 is set
jc equal # the carry flag is cleared.
not_equal:
...
equal:
Using intrinsics in C is preferred, as they will automatically benefit from AVX 3 operands syntax, which does actually save a considerable amount of SSE register moves.
static const __m128i zero = {0};
inline bool compare128(__m128i a, __m128i b) {
__m128i c = _mm_xor_si128(a, b);
return _mm_testc_si128(zero, c);
}
This compiles to something similar as above, especially the bool temporary gets folded and the carry flag is used directly.
PCMPGT will not compare the entire 128 bits, it will always work with smaller units and produce separate results. Additionally it works on signed values, which complicates things further.
If you are running in 64 bit mode, I think it would be fastest to use two native 64 bit subtracts or compares.
Not sure why you can't find the documentation, it is all in the intel instruction set reference.
精彩评论