Profiling SIMD Code
UPDATED - Check Below
Will keep this as short as possible. Happy to add any more details if required.
I have some sse code for normalising a vector. I'm using QueryPerformanceCounter() (wrapped in a helper struct) to measure performance.
If I measure like this
for( int j = 0; j < NUM_VECTORS; ++j )
{
Timer t(norm_sse);
NormaliseSSE( vectors_sse+j);
}
The results I get are often slower than just doing a standard normalise with 4 doubles representing a vector (testing in the same configuration).
for( int j = 0; j < NUM_VECTORS; ++j )
{
Timer t(norm_dbl);
NormaliseDBL( vectors_dbl+j);
}
However, timing just the entirety of the loop like this
{
Timer t(norm_sse);
for( int j = 0; j < NUM_VECTORS; ++j ){
NormaliseSSE( vectors_sse+j );
}
}
shows the SSE code to be an order of magnitude faster, but doesn't really affect the measurements for the double version. I've done a fair bit of experimentation and searching, and can't seem to find a reasonable answer as to why.
For example, I know there can be penalities when casting the results to float, but none of that is going on here.
Can anyone offer any insight? What is it about calling QueryPerformanceCounter between each normalise that slows the SIMD code down so much?
Thanks for reading :)
More details below:
- Both normalise methods are inlined (verified in disassembly)
- Running in release
- 32 bit compilation
Simple Vector struct
_declspec(align(16)) struct FVECTOR{
typedef float REAL;
union{
struct { REAL x, y, z, w; };
__m128 Vec;
};
};
Code to Normalise SSE:
__m128 Vec = _v->Vec;
__m128 sqr = _mm_mul_ps( Vec, Vec ); // Vec * Vec
__m128 yxwz = _mm_shuffle_ps( sqr, sqr , 0x4e );
__m128 addOne = _mm_add_ps( sqr, yxwz );
__m128 swapPairs = _mm_shuffle_ps( addOne, addOne , 0x11 );
__m128 addTwo = _mm_add_ps( addOne, swapPairs );
__m128 invSqrOne = _mm_rsqrt_ps( addTwo );
_v->Vec = _mm_mul_ps( invSqrOne, Vec );
Code to normalise doubles
double len_recip = 1./sqrt(v->x*v->x + v->y*v->y + v->z*v->z);
v->x *= len_recip;
v->y *= len_recip;
v->z *= len_recip;
Helper struct
struct Timer{
Timer( LARGE_INTEGER & a_Storage ): Storage( a_Storage ){
QueryPerformanceCounter( &PStart );
}
~Timer(){
LARGE_INTEGER PEnd;
QueryPerformanceCounter( &PEnd );
Storage.QuadPart += ( PEnd.QuadPart - PStart.QuadPart );
}
LARGE_INTEGER& Storage;
LARGE_INTEGER PStart;
};
Update So thanks to Johns comments, I think I've managed to confirm that it is QueryPerformanceCounter thats doing bad things to my simd code.
I added a new timer struct that uses RDTSC directly, and it seems to give results consistent to what I would expect. The result is still far slower than timing the entire loop, rather than each iteration separately, but I expect that that's because Getting the RDTSC involves flushing the instruction pipeline (Check http://www.strchr.com/performance_measurements_with_rdtsc for more info).
struct PreciseTimer{
PreciseTimer( LARGE_INTEGER& a_Storage ) : Storage(a_Storage){
StartVal.QuadPart = GetRDTSC();
}
~PreciseTimer(){
Storage.QuadPart += ( GetRDTSC() - StartVal.QuadPart );
}
unsigned __int64 inline GetRDTSC() {
开发者_如何学JAVA unsigned int lo, hi;
__asm {
; Flush the pipeline
xor eax, eax
CPUID
; Get RDTSC counter in edx:eax
RDTSC
mov DWORD PTR [hi], edx
mov DWORD PTR [lo], eax
}
return (unsigned __int64)(hi << 32 | lo);
}
LARGE_INTEGER StartVal;
LARGE_INTEGER& Storage;
};
When it's only the SSE code running the loop, the processor should be able to keep its pipelines full and executing a huge number of SIMD instructions per unit time. When you add the timer code within the loop, now there's a whole bunch of non-SIMD instructions, possibly less predictable, between each of the easy-to-optimize operations. It's likely that the QueryPerformanceCounter call is either expensive enough to make the data manipulation part insignificant, or the nature of the code it executes wreaks havoc with the processor's ability to keep executing instructions at the maximum rate (possibly due to cache evictions or branches that are not well-predicted).
You might try commenting out the actual calls to QPC in your Timer class and see how it performs--this may help you discover if it's the construction and destruction of the Timer objects that is the problem, or the QPC calls. Likewise, try just calling QPC directly in the loop instead of making a Timer and see how that compares.
QPC is a kernel function, and calling it causes a context switch, which is inherently far more expensive and destructive than any equivalent user-mode function call, and will definitely annihilate the processor's ability to process at it's normal speed. In addition to that, remember that QPC/QPF are abstractions and require their own processing- which likely involves the use of SSE itself.
精彩评论