How can I use SSE (and SSE2, SSE3, etc.) extensions when building with Visual C++?
I'm now working in a small optimisation of a basic dot product function, by using SSE instructions in visual studio.
Here is my code : (function call convention is cdecl) :
float SSEDP4(const vect & vec1, const vect & vec2)
{
__asm
{
// get addresses
开发者_运维百科 mov ecx, dword ptr[vec1]
mov edx, dword ptr[vec2]
// get the first vector
movups xmm1, xmmword ptr[ecx]
// get the second vector (must use movups, because data is not assured to be aligned to 16 bytes => TODO align data)
movups xmm1, xmmword ptr[edx]
// OP by OP multiply with second vector (by address)
mulps xmm1, xmm2
// add everything with horizontal add func (SSE3)
haddps xmm1, xmm1
// is one addition enough ?
// try to extract, we'll see
pextrd eax, xmm1, 03h
}
}
vect
is a simple struct that contains 4 single precision floats, non aligned to 16 bytes (that is why I use movups
and not movaps
)
vec1
is initialized with (1.0, 1.2, 1.4, 1.0)
and vec2
with (2.0, 1.8, 1.6, 1.0)
Everything compiles well, but at execution, I got 0 in both XMM registers, and so as result while debugging, visual studio shows me 2 registers (MMX1 and MMX2, or sometimes MMX2 and MMX3) which are 64 bits registers, but no XMM and everything to 0.
Does someone has an idea of what's happening ?
Thank you in advance :)
There are a couple of ways to get at SSE instructions on MSVC++:
- Compiler Intrinsics -> http://msdn.microsoft.com/en-us/library/t467de55.aspx
- External MASM file.
Inline assembly (as in your example code) is no longer a reasonable option because it will not compile when building for non 32 bit, x86, systems. (E.g. building a 64 bit binary will fail)
Moreover, assembly blocks inhibit most optimizations. This is bad for you because even simple things like inlining won't happen for your function. Intrinsics work in a manner that does not defeat optimizers.
You compiled and ran correctly, so you are at least able to use SSE.
In order to view SSE registers in the Registers window, right click on the Registers window and select SSE. That should let you see the XMM registers.
You can also use @xmm<register><component>
(e.g., @xmm00
to view xmm0[0]
) in the watch window to look at individual components of the XMM registers.
Now, as for your actual problem, you are overwriting xmm1
with [edx]
instead of stuffing that into xmm2
.
Also, scalar floating point values are returned on the x87 stack in st(0)
. Instead of trying to remember how to do that, I simply store the result in a stack variable and let the compiler do it for me:
float SSEDP4(const vect & vec1, const vect & vec2)
{
float result;
__asm
{
// get addresses
mov ecx, dword ptr[vec1]
mov edx, dword ptr[vec2]
// get the first vector
movups xmm1, xmmword ptr[ecx]
// get the second vector (must use movups, because data is not assured to be aligned to 16 bytes => TODO align data)
movups xmm2, xmmword ptr[edx] // xmm2, not xmm1
// OP by OP multiply with second vector (by address)
mulps xmm1, xmm2
// add everything with horizontal add func (SSE3)
haddps xmm1, xmm1
// is one addition enough ?
// try to extract, we'll see
pextrd [result], xmm1, 03h
}
return result;
}
精彩评论