exact representation of floating points in c

2023-01-27 16:14 问答作者：

void main()
{
    float a = 0.7;

    if (a < 0.7)
        printf("c");
    else
        printf("c++");
}

In the above question for 0.7, "c" will be printed, but for 0.8, "c++" wil be printed. Why?

And how is any flo开发者_如何转开发at represented in binary form?

At some places, it is mentioned that internally 0.7 will be stored as 0.699997, but 0.8 as 0.8000011. Why so?

basically with float you get 32 bits that encode

VALUE   = SIGN * MANTISSA * 2 ^ (128 - EXPONENT)
32-bits = 1-bit  23-bits               8-bits

and that is stored as

MSB                    LSB
[SIGN][EXPONENT][MANTISSA]

since you only get 23 bits, that's the amount of "precision" you can store. If you are trying to represent a fraction that is irrational (or repeating) in base 2, the sequence of bits will be "rounded off" at the 23rd bit.

0.7 base 10 is 7 / 10 which in binary is 0b111 / 0b1010 you get:

0.1011001100110011001100110011001100110011001100110011... etc

Since this repeats, in fixed precision there is no way to exactly represent it. The same goes for 0.8 which in binary is:

0.1100110011001100110011001100110011001100110011001101... etc

To see what the fixed precision value of these numbers is you have to "cut them off" at the number of bits you and do the math. The only trick is you the leading 1 is implied and not stored so you technically get an extra bit of precision. Because of rounding, the last bit will be a 1 or a 0 depending on the value of the truncated bit.

So the value of 0.7 is effectively 11,744,051 / 2^24 (no rounding effect) = 0.699999988 and the value of 0.8 is effectively 13,421,773 / 2^24 (rounded up) = 0.800000012.

That's all there is to it :)

A good reference for this is What Every Computer Scientist Should Know About Floating-Point Arithmetic. You can use higher precision types (e.g. double) or a Binary Coded Decimal (BCD) library to achieve better floating point precision if you need it.

The internal representation is IEE754.

You can also use this calculator to convert decimal to float, I hope this helps to understand the format.

floats will be stored as described in IEEE 754: 1 bit for sign, 8 for a biased exponent, and the rest storing the fractional part.

Think of numbers representable as floats as points on the number line, some distance apart; frequently, decimal fractions will fall in between these points, and the nearest representation will be used; this leads to the counterintuitive results you describe.

"What every computer scientist should know about floating point arithmetic" should answer all your questions in detail.

If you want to know how float/double is presented in C(and almost all languages), please refert to Standard for Floating-Point Arithmetic (IEEE 754) http://en.wikipedia.org/wiki/IEEE_754-2008

Using single-precision floats as an example, here is the bit layout:  
seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm    meaning  
31                              0    bit #  
s = sign bit, e = exponent, m = mantissa

Another good resource to see how floating point numbers are stored as binary in computers is Wikipedia's page on IEEE-754.

Floating point numbers in C/C++ are represented in IEEE-754 standard format. There are many articles on the internet, that describe in much better detail than I can here, how exactly a floating point is represented in binary. A simple search for IEEE-754 should illuminate the mystery.

0.7 is a numeric literal; it's value is the mathematical real number 0.7, rounded to the nearest double value.

After initialising float a = 0.7, the value of a is 0.7 rounded to float, that is the real number 0.7, rounded to the nearest double value, rounded to the nearest float value. Except by a huge coincidence, you wouldn't expect a to be equal to 0.7.

"if (a < 0.7)" compares 0.7 rounded to double then to float with the number 0.7 rounded to double. It seems that in the case of 0.7, the rounding produced a smaller number. And in the same experiment with 0.8, rounding 0.8 to float will produce a larger number than 0.8.

Floating point comparisons are not reliable, whatever you do. You should use threshold tolerant comparison/ epsilon comparison of floating points.

Try IEEE-754 Floating-Point Conversion and see what you get. :)

继续阅读：c floating-point

exact representation of floating points in c

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？