开发者

how to sum a large number of float number?

I build a paralle sum code to sum a large numbe开发者_高级运维r of float numbers then I found when the number of figures is bigger than 100000000, the result will go wrong. Then I build a serial code to compare. The serial code also get wrong number. Anybody knows why? Thanks!

my simple code is as follows.

the result is " 1.67772e+007". It should be 1e+008

int main()
{
    size_t N=100000000;
    cout<<"n is : "<<N<<endl;
    clock_t start = clock();
    task_scheduler_init init;
    vector<float> myvec;
    vector<float>* pvec;
    for(int i=0;i<N;i++)
        myvec.push_back(1.0f);
    pvec=&myvec;
    float mysum;
    mysum=parallelSum(pvec);
    cout<<" the p sum is: "<<mysum<<endl;
    clock_t finish = clock();
        cout<<"Time Used  = "<<(finish - start)/CLOCKS_PER_SEC<<endl;
        mysum=0;
       for(int i=0;i<N;i++)
    mysum+=myvec[i];
        cout<<" the s sum is: "<<mysum<<endl;
    return 0;
}


Your problem is due to the limited available precision of floating point numbers.

While

1.0f + 1.0f == 2.0f, 

You will find that

16777216.0f + 1.0f == 16777216.0f

The extra 1.0f is just thrown away since 16777217 cannot be represented in float format.

Looking at your result – 1.67772e+007 – it's clear that this is exactly what has happened.

Your expected result, 100000000.0, is quite a lot (6x) bigger than 16777216.0f, but once the sum reaches a total of 16777216.0f it stays there for the remaining 8327884 additions of 1.0f.

Solution: Try using double instead of float, which goes up to 9007199254740992.0 before hitting this problem.

Why?

In single precision floating point there are only 24 bits of precision available, and 2^24 is 16777216. There is no way to encode 16777217 in 24 bits, so it simply stays at 16777216, on the reasoning that it's close enough to the real answer. The real problem arises when you add lots of very small numbers to a big number, where the sum of the small numbers is signifiant relative to the big one, but individually they are not.

Note that 16777216.0f is not the largest number that can be represented in float, but just represents the limit of precision. For example, 16777216.0f x 2^4 + 2^4 => 16777216.0f x 2^4

double has 53 bits of precision, so can encode up to 2^53, or 9007199254740992.0 before hitting the point where adding 1.0d fails.


This issue also represents another hazard for parallelising floating point operations - floating point addition is not associative, in other words, your sequential algorithm:

Sum(A) = (...((((A1 + A2) + A3) + A4) ... A10000)

May produce a different result from the parallelised version:

Sum(A) = (...((((A1 + A2) + A3) + A4) ... A1000)
       + (...((((A1001 + A1002) + A1003) + A1004) ... A2000)
       + (...((((A2001 + A2002) + A2003) + A2004) ... A3000)
       ...
       + (...((((A9001 + A9002) + A9003) + A9004) ... A10000)

since any given step may lose precision to a different degree.

This does not mean that either is more right, but that you may get unexpected results.


If you really have to use float, try the following:

  • sort your numbers from most negative to most positive (order (N log N))
  • add each pair in turn: B1 := A1 + A2, B2 := A3 + A4, B3 := A5 + A6 this produces a new list B, half the length of the initial one
  • repeat this procedure on B to get C, C to get D, etc
  • stop when there is only one number left.

Note that this changes your algorithmic complexity from an O(N) operation to an O(N log N) operation, but it's rather more likely to produce the correct number. It's quite parallelisable. You may be able to merge the sort and sum operations if you are clever about it.


Use the Kahan summation algorithm

  • It has the same algorithmic complexity as a naive summation
  • It will greatly increase the accuracy of a summation, without requiring you to switch data types to double.

I tested summing 100000000 floats of 1.0f using std::accumulate - the result was 1.67772e+007. However, using this Kahan summation implementation, the result was 1e+008

template <class Iter>
float kahan_summation(Iter begin, Iter end) {
  float result = 0.f;

  float c = 0.f;
  for(;begin != end; ++begin) {
    float y = *begin - c;
    float t = result + y;
    c = (t - result) - y;
    result = t;
  }
  return result;
}

You could of course switch to all floats to doubles and the kahan summation algorithm would give would higher precision than a naive summation using doubles.


IEEE754 inaccuracies. Try GMP.


If the numbers you are summing vary greatly in magnitude, you can sort them first and add them starting with the smallest numbers. This will ensure that their results still count and not get lost due to the precision problem (which is also there with double, it just takes longer to manifest there).

Otherwise you can build a number of partial sums each of which are roughly the same magnitude and add them together later; this should also ensure that you're getting a correct result.

Or look for a nice library for arbitrary-precision math :)


Alex Brown's answer is excellent. . For a full explanation of floating point problems try this

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜