Why does my data not seem to be aligned?
I'm trying to figure out how to best pre-calculate some sin and cosine values, store them in aligned blocks, and then use them later for SSE calculations:
At the beginning of my program, I create an object with member:
static __m128 *m_sincos;
then I initialize that member in the constructor:
m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16);
for (int t=0; t<Bins; t++)
m_sincos[t] = _mm_set_ps(cos(t), sin(t), sin(t), cos(t));
When I go to use m_sincos, I run into three problems:
-The data does not seem to be aligned
movaps xmm0, m_sincos[t] //crashes
movups xmm0, m_sincos[t] //does not crash
-The variables do not seem to be correct
movaps result, xmm0 // returns values that are not what is in m_sin开发者_C百科cos[t]
//Although, putting a watch on m_sincos[t] displays the correct values
-What really confuses me is that this makes everything work (but is too slow):
__m128 _sincos = m_sincos[t];
movaps xmm0, _sincos
movaps result, xmm0
m_sincos[t]
is a C expression. In an assembly instruction, however, (__asm
?), it's interpreted as an x86 addressing mode, with a completely different result. For example, VS2008 SP1 compiles:
movaps xmm0, m_sincos[t]
into: (see the disassembly window when the app crashes in debug mode)
movaps xmm0, xmmword ptr [t]
That interpretation attempts to copy a 128-bit value stored at the address of the variable t
into xmm0. t
, however, is a 32-bit value at a likely unaligned address. Executing the instruction is likely to cause an alignment failure, and would get you incorrect results at the odd case where t
's address is aligned.
You could fix this by using an appropriate x86 addressing mode. Here's the slow but clear version:
__asm mov eax, m_sincos ; eax <- m_sincos
__asm mov ebx, dword ptr t
__asm shl ebx, 4 ; ebx <- t * 16 ; each array element is 16-bytes (128 bit) long
__asm movaps xmm0, xmmword ptr [eax+ebx] ; xmm0 <- m_sincos[t]
Sidenote:
When I put this in a complete program, something odd occurs:
#include <math.h>
#include <tchar.h>
#include <xmmintrin.h>
int main()
{
static __m128 *m_sincos;
int Bins = 4;
m_sincos = (__m128*) _aligned_malloc(Bins*sizeof(__m128), 16);
for (int t=0; t<Bins; t++) {
m_sincos[t] = _mm_set_ps(cos((float) t), sin((float) t), sin((float) t), cos((float) t));
__asm movaps xmm0, m_sincos[t];
__asm mov eax, m_sincos
__asm mov ebx, t
__asm shl ebx, 4
__asm movaps xmm0, [eax+ebx];
}
return 0;
}
When you run this, if you keep an eye on the registers window, you might notice something odd. Although the results are correct, xmm0
is getting the correct value before the movaps
instruction is executed. How does that happen?
A look at the generated assembly code shows that _mm_set_ps()
loads the sin/cos results into xmm0
, then saves it to the memory address of m_sincos[t]
. But the value remains there in xmm0
too. _mm_set_ps
is an 'intrinsic', not a function call; it does not attempt to restore the values of registers it uses after it's done.
If there's a lesson to take from this, it might be that when using the SSE intrinsic functions, use them throughout, so the compiler can optimize things for you. Otherwise, if you're using inline assembly, use that throughout too.
You should always use the instrinsics or even just turn it on and leave them, rather than explicitly coding it in. This is because __asm is not portable to 64bit code.
精彩评论