Floating point comparison irreproducibility
I and my Ph.D. student have encountered a problem in a physics data analysis context that I could use some insight on. We have code that analyzes data from one of the LHC experiments that gives irreproducible results. In particular, the results of calculations obtained from the same binary, run on the same machine can differ between successive executions. We are aware of the many different sources of irreproducibility, but have excluded the usual suspects.
We have tracked the problem down to irreproducibility of (double precision) floating point compari开发者_运维技巧son operations when comparing two numbers that that nominally have the same value. This can happen occasionally as a result of prior steps in the analysis. An example we just found an example that tests whether a number is less than 0.3 (note that we NEVER test for equality between floating values). It turns out that due to the geometry of the detector, it was possible for the calculation to occasionally produce a result which would be exactly 0.3 (or its closest double precision representation).
We are well aware of the pitfalls in comparing floating point numbers and also with the potential for excess precision in the FPU to affect the results of the comparison. The question I would like to have answered is "why are the results irreproducible?" Is it because the FPU register load or other FPU instructions are not clearing the excess bits and thus "leftover" bits from previous calculations are affecting the results? (this seems unlikely) I saw a suggestion on another forum that context switches between processes or threads could also induce a change in floating point comparison results due to the contents of the FPU being stored on the stack, and thus, being truncated. Any comments on these =or other possible explanations would be appreciated.
At a guess, what's happening is that your computations are normally being carried out to a few extra bits of precision inside the FPU, and only rounded at specific points (e.g., when you assign a result to a value).
When there's a context switch, however, the state of the FPU has to be saved and restored -- and there's at least a pretty fair chance that those extra bits are not being saved and restored in the context switch. When it happens, that probably wouldn't cause a major change, but if (for example) you later subtract off a fixed amount from each and multiply what's left, the difference would be multiplied as well.
To be clear: I doubt that "left over" bits would be the culprit. Rather, it would be loss of extra bits causing rounding at slightly different points in the computation.
What platform?
Most FPUs can internally store more accuracy than the ieee double representation - to avoid rounding error in intermediate results. There is often a compiler switch to trade speed/accuracy - see http://msdn.microsoft.com/en-us/library/e7s85ffb(VS.80).aspx
Is the program multi-threaded?
If yes, I would suspect a race condition.
If not, program execution is deterministic. The most probable reasong for getting different results given the same inputs is undefined behaviour, i.e., a bug in your program. Reading an uninitialized variable, stale pointer, overwriting lowest bits of some FP number on the stack, etc. The possibilities are endless. If you're running this on linux, try running it under valgrind and see if it uncovers some bugs.
BTW, how did you narrow down the problem to FP comparison?
(Long shot: failing hardware? E.g., failing RAM chip might cause data to be read differently on different occasions. Though, that'd probably crash the OS rather quickly.)
Any other explanation is implausible -- bugs in the OS or the HW would have not gone undiscovered for long.
I made this:
#include <stdio.h>
#include <stdlib.h>
typedef long double ldbl;
ldbl x[1<<20];
void hexdump( void* p, int N ) {
for( int i=0; i<N; i++ ) printf( "%02X", ((unsigned char*)p)[i] );
}
int main( int argc, char** argv ) {
printf( "sizeof(long double)=%i\n", sizeof(ldbl) );
if( argc<2 ) return 1;
int i;
ldbl a = ldbl(1)/atoi(argv[1]);
for( i=0; i<sizeof(x)/sizeof(x[0]); i++ ) x[i]=a;
while(1) {
for( i=0; i<sizeof(x)/sizeof(x[0]); i++ ) if( x[i]!=a ) {
hexdump( &a, sizeof(a) );
printf( " " );
hexdump( &x[i], sizeof(x[i]) );
printf( "\n" );
}
}
}
compiled with IntelC using /Qlong_double, so that it produced this:
;;; for( i=0; i<sizeof(x)/sizeof(x[0]); i++ ) if( x[i]!=a ) {
xor ebx, ebx ;25.10
; LOE ebx f1
.B1.9: ; Preds .B1.19 .B1.8
mov esi, ebx ;25.47
shl esi, 4 ;25.47
fld TBYTE PTR [?x@@3PA_TA+esi] ;25.51
fucomp ;25.57
fnstsw ax ;25.57
sahf ;25.57
jp .B1.10 ; Prob 0% ;25.57
je .B1.19 ; Prob 79% ;25.57
[...]
.B1.19: ; Preds .B1.18 .B1.9
inc ebx ;25.41
cmp ebx, 1048576 ;25.17
jb .B1.9 ; Prob 82% ;25.17
and started 10 instances with different "seeds". As you can see, it compares the 10-byte long doubles from memory with one on the FPU stack, so in the case when OS doesn't preserve full precision, we'd surely see an error. And well, they're still running without detecting anything... which is not really surprising, because x86 has commands to save/restore the whole FPU state at once, and anyway an OS which won't preserve full precision would be completely broken.
So either its some unique OS/cpu/compiler, or differing comparison results are produced after changing something in the program and recompiling it, or its a bug in the program, eg. a buffer overrun.
The CPU's internal FPU can store floating points at higher accuracy than double or float. These values have to be converted whenever the values in the register have to be stored somewhere else, including when memory is swapped out into cache (this I know for a fact) and a context switch or OS interrupt on that core sounds like another easy source. Of course, the timing of OS interrupts or context switches or the swapping of non-hot memory is completely unpredictable and uncontrollable by the application.
Of course, this depends on platform, but your description sounds like you run on a modern desktop or server (so x86).
I'll just merge some of the comments from David Rodriguez and Bo Persson and make a wild guess.
Could it be task switching while using SSE3 instructions? Based on this Intel article on using SSE3 instructions the commands to preserve register status FSAVE and FRESTOR have been replaced by FXSAVE and FXRESTOR, which should handle the full length of the accumulator.
On an x64 machine, I suppose that the "incorrect" instruction could be contained in some external compiled library.
You are certainly hitting GCC Bug n°323, which, as other points out is due to the excess precision of the FPU.
Solutions :
- Using SSE (or AVX, it's 2016...) to perform your computations
- Using the
-ffloat-store
compile switch. From the GCC docs.
Do not store floating-point variables in registers, and inhibit other options that might change whether a floating-point value is taken from a register or memory.
This option prevents undesirable excess precision on machines such as the 68000 where the floating registers (of the 68881) keep more precision than a double is supposed to have. Similarly for the x86 architecture. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. Use -ffloat-store for such programs, after modifying them to store all pertinent intermediate computations into variables.
精彩评论