Compute weighted averages for large numbers
I'm trying to get the weighted average of a few numbers. Basically I have:
Price - 134.42
Quantity - 15236545
There can be as few as one or two or as many as fifty or sixty pairs of prices and quantities. I need to figure out the weighted average of the price. Basically, the weighted average should give very little weight to pairs like
Price - 100000000.00
Quantity - 3
and more to the pair above.
The formula I currently have is:
((price)(quantity) + (price)(quantity) + ...)/totalQuantity
So far I have this done:
double optimalPrice = 0;
int totalQuantity = 0;
double rolling = 0;
System.out.println(rolling);
Iterator it = orders.entrySet().iterator();
while(it.hasNext()) {
Syste开发者_如何转开发m.out.println("inside");
Map.Entry order = (Map.Entry)it.next();
double price = (Double)order.getKey();
int quantity = (Integer)order.getValue();
System.out.println(price + " " + quantity);
rolling += price * quantity;
totalQuantity += quantity;
System.out.println(rolling);
}
System.out.println(rolling);
return rolling/totalQuantity;
The problem is I very quickly max out the "rolling" variable.
How can I actually get my weighted average?
A double can hold a pretty large number (about 1.7 x 10^308, according the docs), but you probably shouldn't use it for values where exact precision is required (such as monetary values).
Check out the BigDecimal class instead. This question on SO talks about it in more detail.
One solution is to use java.math.BigInteger
for both rolling
and totalQuantity
, and only divide them at the end. This has a better numeric stability, as you only have a single floating-point division at the end and everything else is integer operations.
BigInteger
is basically unbounded so you shouldn't run into any overflows.
EDIT: Sorry, only upon re-reading I've noticed your price is a double
anyway. Maybe it's worth circumventing this by multiplying it with 100 and then converting to BigInteger
- since I see in your example it has precisely 2 digits right of the decimal point - and then divide it by 100 at the end, although it's a bit of a hack.
For maximum flexibility, use BigDecimal
for rolling
, and BigInteger
for totalQuantity
. After dividing (note, you have it backwards; it should be rolling / totalQuantity), you can either return a BigDecimal, or use doubleValue
at a loss of precision.
At any given point, you have recorded both the total value ax + by + cz + ... = pq
and the total weight a + b + c + ... = p
. Knowing both then gives you the average value pq/p = q
. The problem is that pq
and p
are large sums that overflow, even though you just want the moderately sized q
.
The next step adds, for example, a weight of r
and a value s
. You want to find the new sum (pq + rs) / (p + r)
by using only the value of q
, which can only happen if p
and pq
somehow "annihilate" by being in the numerator and denominator of the same fraction. That's impossible, as I'll show.
The value that you need to add in this iteration is, naturally,
(pq + rs) / (p + r) - q
Which can't be simplified to a point where p*q
and p
disappear. You can also find
(pq + rs) / q(p + r)
the factor by which you'd multiply q in order to get the next average; but again, pq
and p
remain. So there's no clever solution.
Others have mentioned arbitrary-precision variables, and that's a good solution here. The size of p
and pq
grow linearly with the number of entries, and the memory usage and calculation speed of integers/floats grows logarithmically with the size of the values. So performance is O(log(n)) unlike the disaster that it would if p
were somehow the multiple of many numbers.
First, I don't see how you could be "maxing out" the rolling
variable. As @Ash points out, it can represent values up to about 1.7 x 10^308
. The only possibility I can think of is that you have some bad values in your input. (Perhaps the real problem is that you are losing precision ...)
Second, your use of a Map
as to represent orders is strange and probably broken. The way you are currently using it, you cannot represent orders involving two or more items with the same price.
Your final result is an just a weighted average of precises, so presumably you don't need to follow the rules used when calculating account balances etc. If I am correct about the above, then you don't need to use BigDecimal
, double
will suffice.
The problem of overflow can be solved by storing a "running average" and updating it with each new entry. Namely, let
a_n = (sum_{i=1}^n x_i * w_i) / (sum_{i=1}^n w_i)
for n = 1, ..., N. You start with a_n = x_n and then add
d_n := a_{n+1} - a_n
to it. The formula for d_n is
d_n = (x_{n+1} - w_{n+1}*a_n) / W_{n+1}
where W_n := sum_{i=1}^n w_n. You need to keep track of W_n, but this problem can be solved by storing it as double
(it will be OK as we're only interested in the average). You can also normalize the weights, if you know that all your weights are multiples of 1000, just divide them by 1000.
To get additional accuracy, you can use compensated summation.
Preemptive explanation: it is OK to use floating point arithmetic here. double
has relative precision of 2E-16. The OP is averaging positive numbers, so there will be no cancellation error. What the proponents of arbitrary precision arithmetic don't tell you is that, leaving aside rounding rules, in the cases when it does give you lots of additional precision over IEEE754 floating point arithmetic, this will come at significant memory and performance cost. Floating point arithmetic was designed by very smart people (Prof. Kahan, among others), and if there was a way of cheaply increasing arithmetic precision over what is offered by floating point, they'd do it.
Disclaimer: if your weights are completely crazy (one is 1, another is 10000000), then I am not 100% sure if you will get satisfying accuracy, but you can test it on some example when you know what the answer should be.
Do two loops: compute totalQuantity first in the first loop. Then in the second loop accumulate price * (quantity / totalQuantity).
精彩评论