开发者

Compute weighted averages for large numbers

I'm trying to get the weighted average of a few numbers. Basically I have:

Price    - 134.42
Quantity - 15236545

There can be as few as one or two or as many as fifty or sixty pairs of prices and quantities. I need to figure out the weighted average of the price. Basically, the weighted average should give very little weight to pairs like

Price    - 100000000.00
Quantity - 3

and more to the pair above.

The formula I currently have is:

((price)(quantity) + (price)(quantity) + ...)/totalQuantity

So far I have this done:

        double optimalPrice = 0;
        int totalQuantity = 0;
        double rolling = 0;
        System.out.println(rolling);

        Iterator it = orders.entrySet().iterator();
        while(it.hasNext()) {
            Syste开发者_如何转开发m.out.println("inside");
            Map.Entry order = (Map.Entry)it.next();
            double price = (Double)order.getKey();
            int quantity = (Integer)order.getValue();
            System.out.println(price + " " + quantity);

            rolling += price * quantity;
            totalQuantity += quantity;
            System.out.println(rolling);
        }
        System.out.println(rolling);
        return rolling/totalQuantity;

The problem is I very quickly max out the "rolling" variable.

How can I actually get my weighted average?


A double can hold a pretty large number (about 1.7 x 10^308, according the docs), but you probably shouldn't use it for values where exact precision is required (such as monetary values).

Check out the BigDecimal class instead. This question on SO talks about it in more detail.


One solution is to use java.math.BigInteger for both rolling and totalQuantity, and only divide them at the end. This has a better numeric stability, as you only have a single floating-point division at the end and everything else is integer operations.

BigInteger is basically unbounded so you shouldn't run into any overflows.

EDIT: Sorry, only upon re-reading I've noticed your price is a double anyway. Maybe it's worth circumventing this by multiplying it with 100 and then converting to BigInteger - since I see in your example it has precisely 2 digits right of the decimal point - and then divide it by 100 at the end, although it's a bit of a hack.


For maximum flexibility, use BigDecimal for rolling, and BigInteger for totalQuantity. After dividing (note, you have it backwards; it should be rolling / totalQuantity), you can either return a BigDecimal, or use doubleValue at a loss of precision.


At any given point, you have recorded both the total value ax + by + cz + ... = pq and the total weight a + b + c + ... = p. Knowing both then gives you the average value pq/p = q. The problem is that pq and p are large sums that overflow, even though you just want the moderately sized q.

The next step adds, for example, a weight of r and a value s. You want to find the new sum (pq + rs) / (p + r) by using only the value of q, which can only happen if p and pq somehow "annihilate" by being in the numerator and denominator of the same fraction. That's impossible, as I'll show.

The value that you need to add in this iteration is, naturally,

(pq + rs) / (p + r) - q

Which can't be simplified to a point where p*q and p disappear. You can also find

(pq + rs) / q(p + r)

the factor by which you'd multiply q in order to get the next average; but again, pq and p remain. So there's no clever solution.

Others have mentioned arbitrary-precision variables, and that's a good solution here. The size of p and pq grow linearly with the number of entries, and the memory usage and calculation speed of integers/floats grows logarithmically with the size of the values. So performance is O(log(n)) unlike the disaster that it would if p were somehow the multiple of many numbers.


First, I don't see how you could be "maxing out" the rolling variable. As @Ash points out, it can represent values up to about 1.7 x 10^308. The only possibility I can think of is that you have some bad values in your input. (Perhaps the real problem is that you are losing precision ...)

Second, your use of a Map as to represent orders is strange and probably broken. The way you are currently using it, you cannot represent orders involving two or more items with the same price.


Your final result is an just a weighted average of precises, so presumably you don't need to follow the rules used when calculating account balances etc. If I am correct about the above, then you don't need to use BigDecimal, double will suffice.

The problem of overflow can be solved by storing a "running average" and updating it with each new entry. Namely, let

a_n = (sum_{i=1}^n x_i * w_i) / (sum_{i=1}^n w_i)

for n = 1, ..., N. You start with a_n = x_n and then add

d_n := a_{n+1} - a_n

to it. The formula for d_n is

d_n = (x_{n+1} - w_{n+1}*a_n) / W_{n+1}

where W_n := sum_{i=1}^n w_n. You need to keep track of W_n, but this problem can be solved by storing it as double (it will be OK as we're only interested in the average). You can also normalize the weights, if you know that all your weights are multiples of 1000, just divide them by 1000.

To get additional accuracy, you can use compensated summation.

Preemptive explanation: it is OK to use floating point arithmetic here. double has relative precision of 2E-16. The OP is averaging positive numbers, so there will be no cancellation error. What the proponents of arbitrary precision arithmetic don't tell you is that, leaving aside rounding rules, in the cases when it does give you lots of additional precision over IEEE754 floating point arithmetic, this will come at significant memory and performance cost. Floating point arithmetic was designed by very smart people (Prof. Kahan, among others), and if there was a way of cheaply increasing arithmetic precision over what is offered by floating point, they'd do it.

Disclaimer: if your weights are completely crazy (one is 1, another is 10000000), then I am not 100% sure if you will get satisfying accuracy, but you can test it on some example when you know what the answer should be.


Do two loops: compute totalQuantity first in the first loop. Then in the second loop accumulate price * (quantity / totalQuantity).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜