开发者

Extended (80-bit) double floating point in x87, not SSE2 - we don't miss it?

I was reading today about researchers discovering that NVidia's Phys-X libraries use x87 FP vs. SSE2. Obviously this will be suboptimal for parallel datasets where speed trumps precision. However, the article author goes on to quote:

Intel started discouraging the use of x87 with the introduction of the P4 in late 2000. AMD deprecated x87 since the K8 in 2003, as x86-64 is defined with SSE2 support; VIA’s C7 has supported SSE2 since 2005. In 64-bit versions of Windows, x87 is deprecated for user-m开发者_如何学Pythonode, and prohibited entirely in kernel-mode. Pretty much everyone in the industry has recommended SSE over x87 since 2005 and there are no reasons to use x87, unless software has to run on an embedded Pentium or 486.

I wondered about this. I know that x87 uses 80-bit extended doubles internally to compute values, and SSE2 doesn't. Does this not matter to anyone? It seems surprising to me. I know when I do computations on points, lines and polygons in a plane, values can be surprisingly wrong when doing subtractions, and areas can collapse and lines alias one another due to lack of precision. Using 80-bit values vs. 64-bit values could help, I would imagine.

Is this incorrect? If not, what can we use to perform extended double FP operations if x87 is phased out?


The biggest problem with x87 is basically that all register operations are done in 80 bits, whereas most of the time people only use 64 bit floats (i.e. double-precision floats). What happens is, you load a 64 bit float into the x87 stack, and it gets converted to 80 bits. You do some operations on it in 80 bits, then store it back into memory, converting it into 64 bits. You will get a different result than if you had done all the operations with just 64 bits, and with an optimizing compiler it can be very unpredictable how many conversions a value might go through, so it's hard to verify that you're getting the "correct" answer when doing regression tests.

The other problem, which only matters from the point of view of someone writing assembly (or indirectly writing assembly, in the case of someone writing a code generator for a compiler), is that the x87 uses a register stack, whereas SSE uses individually accessible registers. With x87 you have a bunch of extra instructions to manipulate the stack, and I imagine Intel and AMD would rather make their processors run fast with SSE code than trying to make those extra stack-manipulation x87 instructions run fast.

BTW if you are having problems with inaccuracy, you will want to take a look at the article "What every programmer should know about floating-point arithmetic", and then maybe use an arbitrary precision math library (e.g. GMP) instead.


To make proper use of extended-precision math, it's necessary that a language support a type which can be used to store the result of intermediate computations, and can be substituted for the expressions yielding those results. Thus, given:

void print_dist_squared(double x1, double y1, double x2, double y2)
{
  printf("%12.6f", (x2-x1)*(x2-x1)+(y2-y1)*(y2-y1));
}

there should be some type that could be used to capture and replace the common sub-expressions x2-x1 and y2-y1, allowing the code to be rewritten as:

void print_dist_squared(double x1, double y1, double x2, double y2)
{
  some_type dx = x2-x1;
  some_type dy = y2-y1;
  printf("%12.6f", dx*dx + dy*dy);
}

without altering the semantics of the program. Unfortunately, ANSI C failed to specify any type which could be used for some_type on platforms which perform extended-precision calculations, and it became far more common to blame Intel for the existence of extended-precision types than to blame ANSI's botched support.

In fact, extended-precision types have just as much value on platforms without floating-point units as they do on x87 processors, since on such processors a computation like x+y+z would entail the following steps:

  1. Unpack the mantissa, exponent, and possibly sign of x into separate registers (exponent and sign can often "double-bunk")
  2. Unpack y likewise.
  3. Right-shift the mantissa of the value with the lower exponent, if any, and then add or subtract the values.
  4. In case x and y had different signs, left-shift the mantissa until the leftmost bit is 1 and adjust the exponent appropriately.
  5. Pack the exponent and mantissa back into double format.
  6. Unpack the that temporary result.
  7. Unpack z.
  8. Right-shift the mantissa of the value with the lower exponent, if any, and then add or subtract the values.
  9. In case the earlier result and z had different signs, left-shift the mantissa until the leftmost bit is 1 and adjust the exponent appropriately.
  10. Pack the exponent and mantissa back into double format.

Using an extended-precision type will allow steps 4, 5, and 6 to be eliminated. Since a 53-bit mantissa is too large to fit in less than four 16-bit registers or two 32-bit registers, performing an addition with a 64-bit mantissa isn't any slower than using a 53-bit mantissa, so using extended-precision math offers faster computation with no downside in a language which supports a proper type to hold temporary results. There is no reason to fault Intel for providing an FPU which could perform floating-point math in the fashion that was also the most efficient method on non-FPU chips.


The other answer seems to suggest that using 80-bit precision is a bad idea, but it isn't. It performs a sometimes vital role in keeping imprecision at bay, see e.g. the writings of W. Kahan.

Always use 80-bit intermediate arithmetic if you can get away with it speed-wise. If that means you have to use x87 maths, well, do so. Support for it is ubiquitous and as long as people keep doing the right thing, it will remain ubiquitous.


Double precission is 11 bits less than f80 (about 2.5 nibbles/digits), for many app (mostly games) it wouldn't hurt. But you will need all the accuracy available for say, space program or medical app.

It's a bit misleading when some say that f80 (and discouraged by it) operating on stack. FPU registers and operations similar to stack operation, maybe that what makes people confused. It actually memory based (load/store), not stack per-se, compared to, for instance, calling convention like cdecl stdcall which do actually passing parameters via stack. and nothing wrong with that.

The big advantage of SSE actually is serialize operation, 2, 4, 8 values at once, with many varian operations. Yes you can directly transfer to register, but you will transfer that values to memory anyway at the end.

The big disadvantage of f80 is, its odd 10 byte long, it disrupt alignment. you'd have to align them 16 for faster access. but not really practicable for array.

You still have to use fpu for trigonometric and other trancedental math operations. For asm, there's many f80 tricks that really fun and useful.

For games and regular simple app (nearly all), you can just used double without getting someone died. But for a few serious, math or scientific app you just can't ditch f80.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜