开发者

exact representation of floating points in c

void main()
{
    float a = 0.7;

    if (a < 0.7)
        printf("c");
    else
        printf("c++");
} 

In the above question for 0.7, "c" will be printed, but for 0.8, "c++" wil be printed. Why?

And how is any flo开发者_如何转开发at represented in binary form?

At some places, it is mentioned that internally 0.7 will be stored as 0.699997, but 0.8 as 0.8000011. Why so?


basically with float you get 32 bits that encode

VALUE   = SIGN * MANTISSA * 2 ^ (128 - EXPONENT)
32-bits = 1-bit  23-bits               8-bits

and that is stored as

MSB                    LSB
[SIGN][EXPONENT][MANTISSA]

since you only get 23 bits, that's the amount of "precision" you can store. If you are trying to represent a fraction that is irrational (or repeating) in base 2, the sequence of bits will be "rounded off" at the 23rd bit.

0.7 base 10 is 7 / 10 which in binary is 0b111 / 0b1010 you get:

0.1011001100110011001100110011001100110011001100110011... etc

Since this repeats, in fixed precision there is no way to exactly represent it. The same goes for 0.8 which in binary is:

0.1100110011001100110011001100110011001100110011001101... etc

To see what the fixed precision value of these numbers is you have to "cut them off" at the number of bits you and do the math. The only trick is you the leading 1 is implied and not stored so you technically get an extra bit of precision. Because of rounding, the last bit will be a 1 or a 0 depending on the value of the truncated bit.

So the value of 0.7 is effectively 11,744,051 / 2^24 (no rounding effect) = 0.699999988 and the value of 0.8 is effectively 13,421,773 / 2^24 (rounded up) = 0.800000012.

That's all there is to it :)


A good reference for this is What Every Computer Scientist Should Know About Floating-Point Arithmetic. You can use higher precision types (e.g. double) or a Binary Coded Decimal (BCD) library to achieve better floating point precision if you need it.


The internal representation is IEE754.

You can also use this calculator to convert decimal to float, I hope this helps to understand the format.


floats will be stored as described in IEEE 754: 1 bit for sign, 8 for a biased exponent, and the rest storing the fractional part.

Think of numbers representable as floats as points on the number line, some distance apart; frequently, decimal fractions will fall in between these points, and the nearest representation will be used; this leads to the counterintuitive results you describe.

"What every computer scientist should know about floating point arithmetic" should answer all your questions in detail.


If you want to know how float/double is presented in C(and almost all languages), please refert to Standard for Floating-Point Arithmetic (IEEE 754) http://en.wikipedia.org/wiki/IEEE_754-2008

Using single-precision floats as an example, here is the bit layout:  
seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm    meaning  
31                              0    bit #  
s = sign bit, e = exponent, m = mantissa


Another good resource to see how floating point numbers are stored as binary in computers is Wikipedia's page on IEEE-754.


Floating point numbers in C/C++ are represented in IEEE-754 standard format. There are many articles on the internet, that describe in much better detail than I can here, how exactly a floating point is represented in binary. A simple search for IEEE-754 should illuminate the mystery.


0.7 is a numeric literal; it's value is the mathematical real number 0.7, rounded to the nearest double value.

After initialising float a = 0.7, the value of a is 0.7 rounded to float, that is the real number 0.7, rounded to the nearest double value, rounded to the nearest float value. Except by a huge coincidence, you wouldn't expect a to be equal to 0.7.

"if (a < 0.7)" compares 0.7 rounded to double then to float with the number 0.7 rounded to double. It seems that in the case of 0.7, the rounding produced a smaller number. And in the same experiment with 0.8, rounding 0.8 to float will produce a larger number than 0.8.


Floating point comparisons are not reliable, whatever you do. You should use threshold tolerant comparison/ epsilon comparison of floating points.

Try IEEE-754 Floating-Point Conversion and see what you get. :)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜