开发者

Question regarding IEEE 754, 64 bits double?

Please take a look at the f开发者_如何学JAVAollowing content:

Question regarding IEEE 754, 64 bits double?

I understand how to convert a double to a binary based on IEEE 754. But I don't understand what the formula is used for.

Can anyone give me an example when we use the above formula, please?

Thanks a lot.


The formula that is highlighted in red can be used to calculate the real number that a 64-bit value represents when treated as a IEEE 754 double. It's only useful if you want to manually calculate the conversion from binary to the base-10 real number that it represents, such as when verifying the correctness of a C library's implementation of printf.

For example, using the formula on 0x3fd5555555555555, x is found to be exactly 0.333333333333333314829616256247390992939472198486328125. That is the real number that 0x3fd5555555555555 represents.

#include <stdio.h>
#include <stdlib.h>

int main()
{
  union {
    double d;
    unsigned long long ull;
  } u;

  u.ull = 0x3fd5555555555555L;
  printf("%.55f\n", u.d);

  return EXIT_SUCCESS;
}

http://codepad.org/kSithgZQ

EDIT: As Olof commented, an IEEE 754 double exactly represents the value x in the equation, but not all real numbers are exactly representable. In fact, only a finite number of reals such as 0.5, 0.125, and 0.333333333333333314829616256247390992939472198486328125 are exactly representable, while the vast majority (uncountably many) including 1/3, 0.1, 0.4, and π are not.

The key to knowing whether a real is exactly-representable as an IEEE 754 double is to calculate the real number's binary representation and write it in scientific notation (e.g. b1.001×2-1 for 0.5625). If the number of binary digits to the right of the decimal point excluding trailing zeroes is less than or equal to 52 and the exponent minus one is between -1022 and +1023, inclusive, then the number is exactly representable.

Let's go through a couple of examples. Note that it helps to have an arbitrary-precision calculator on hand. I will use ARIBAS.

  1. The number 1/64 is 0.015625 in decimal. To calculate its binary representation, we can use ARIBAS' decode_float function:

     ==> set_floatprec(double_float).
    -: 64
    
    ==> 1/64.
    -: 0.0156250000000000000
    
    ==> set_printbase(2).
    -: 0y10
    
    ==> decode_float(1/64).
    -: (0y10000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000, 
    -0y1000101)
    
    ==> set_printbase(10).
    -: 10
    
    ==> -0y1000101.
    -: -69
    

    Thus 1/64 = b0.000001, or b1.0×2-6 in scientific notation.

    1/64 is exactly-representable.

  2. The number 1/10 = 0.1 in decimal. To calculate its binary representation:

    ==> set_printbase(2).
    -: 0y10
    
    ==> decode_float(1/10).
    -: (0y11001100_11001100_11001100_11001100_11001100_11001100_11001100_11001100, 
    -0y1000011)
    
    ==> set_printbase(10).
    -: 10
    
    ==> -0y1000011.
    -: -67
    

    So 1/10 = 0.1 = b0.0001100 (where bold represents a repeating digit sequence), or b1.1001100×2-4 in scientific notation.

    1/10 is not exactly-representable.


The formula is to convert the binary representation into a number !

You only need it if you are implementing a floating point unit

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜