SSE2 intrinsics: access memory directly
Many SSE instructions allow th开发者_StackOverflow中文版e source operand to be a 16-byte aligned memory address. For example, the various (un)pack instructions. PUNCKLBW has the following signature:
PUNPCKLBW xmm1, xmm2/m128
Now this doesn't seem to be possible at all with intrinsics. It looks like it's mandatory to use _mm_load* intrinsics to read anything in memory. This is the intrinsic for PUNPCKLBW:
__m128i _mm_unpacklo_epi8 (__m128i a, __m128i b);
(As far as I know, the __m128i type always refers to an XMM register.)
Now, why is this? It's rather sad since I see some optimization potential by addressing memory directly...
The intrinsics correspond relatively directly to actual instructions, but compilers are not obligated to issue the corresponding instructions. Optimizing a load followed by an operation (even when written in intrinsics) into the memory form of the operation is a common optimization performed by all respectable compilers when it is advantageous to do so.
TLDR: write the load and the operation in intrinsics, and let the compiler optimize it.
Edit: trivial example:
#include <emmintrin.h>
__m128i foo(__m128i *addr) {
__m128i a = _mm_load_si128(addr);
__m128i b = _mm_load_si128(addr + 1);
return _mm_unpacklo_epi8(a, b);
}
Compiling with gcc -Os -fomit-frame-pointer
gives:
_foo:
movdqa (%rdi), %xmm0
punpcklbw 16(%rdi), %xmm0
retq
See? The optimizer will sort it out.
You can just use your memory values directly. For example:
__m128i *p=static_cast<__m128i *>(_aligned_malloc(8*4,16));
for(int i=0;i<32;++i)
reinterpret_cast<unsigned char *>(p)[i]=static_cast<unsigned char>(i);
__m128i xyz=_mm_unpackhi_epi8(p[0],p[1]);
The interesting part of the result:
; __m128i xyz=_mm_unpackhi_epi8(p[0],p[1]);
0040BC1B 66 0F 6F 00 movdqa xmm0,xmmword ptr [eax]
0040BC1F 66 0F 6F 48 10 movdqa xmm1,xmmword ptr [eax+10h]
0040BC24 66 0F 68 C1 punpckhbw xmm0,xmm1
0040BC28 66 0F 7F 04 24 movdqa xmmword ptr [esp],xmm0
So the compiler is doing a bit of a poor job -- or perhaps this way is faster and/or playing with the options would fix that -- but it generates code that works, and the C++ code is stating what it wants fairly directly.
精彩评论