Estimating/forecasting download completion time
We've all poked fun at the 'X minutes remaining' dialog which seems to be too simplistic, but how can we improve it?
Effectively, the input is the set of download speeds up to the current time, and we need to use this to estimate the completion time, perhaps with an indication of certainty, like '20-25 mins remaining' using some Y% confidence interval.
Code that did this could be put in a little library and used in projects all over, so is it really that difficult? How would you do it? What weighting would you give to previous download speeds?
Or is there some open source code already out there?
Edit: Summarising:
- Improve estimated completion time via better algo/filter etc.
- Provide interval instead of single time ('1h45-2h30 mins'开发者_StackOverflow社区), or just limit the precision ('about 2 hours').
- Indicate when progress has stalled - although if progress consistently stalls and then continues, we should be able to deal with that. Perhaps 'about 2 hours, currently stalled'
More generally, I think you are looking for a way to give an instant mesure of the transfer speed, which is generally obtained by an average over a small period.
The problem is generally that in order to be reactive, the period is usually extremely small, which leads to the yoyo effect.
I would propose a very simple scheme, let's model it.
Think of a curve speed (y) over time (x).
the Instant Speed, is no more than reading y for the current x (x0).
the Average Speed, is no more than
Integral(f(x), x in [x0-T,x0]) / T
the scheme I propose is to apply a filter, to give more weight to the last moments, while still taking into account the past moments.
It can be easily implement as g(x,x0,T) = 2 * (x - x0) + 2T
which is a simple triangle of surface T.
And now you can compute Integral(f(x)*g(x,x0,T), x in [x0-T,x0]) / T
, which should work because both functions are always positive.
Of course you could have a different g
as long as it's always positive in the given interval and that its integral on the interval is T (so that its own average is exactly 1).
The advantage of this method is that because you give more weight to immediate events, you can remain pretty reactive even if you consider larger time intervals (so that the average is more precise, and less susceptible to hiccups).
Also, what I have rarely seen but think would provide more precise estimates would be to correlate the time used for computing the average to the estimated remaining time:
- if I download a 5ko file, it's going to be loaded in an instant, no need to estimate
- if I download a 15 Mo file, it's going to take between 2 minutes roughly, so I would like estimates say... every 5 seconds ?
- if I download a 1.5 Go file, it's going to take... well around 200 minutes (with the same speed)... which is to say 3h20m... perhaps that an estimates every minute would be sufficient ?
So, the longer the download is going to take, the less reactive I need to be, and the more I can average out. In general, I would say that a window could cover 2% of the total time (perhaps except for the few first estimates, because people appreciate immediate feedback). Also, indicating progress by whole % at a time is sufficient. If the task is long, I was prepared to wait anyway.
I wonder, would a state estimation technique produce good results here? Something like a Kalman Filter?
Basically you predict the future by looking at your current model, and change the model at each time step to reflect the changes to the real world. I think this kind of technique is used for estimating the time left on your laptop battery, which can also vary according to use, age of battery, etc'.
see http://en.wikipedia.org/wiki/Kalman_filter for a more in depth description of the algorithm.
The filter also gives a variance measure, which could be used to indicate your confidence of the estimate (allthough, as was mentioned by other answers, it might not be the best idea to show this to the end user)
Does anyone know if this is actually used somewhere for download (or file copy) estimation?
Don't confuse your users by providing more information than they need. I'm thinking of the confidence interval. Skip it.
Internet download times are highly variable. The microwave interferes with WiFi. Usage varies by time of day, day of week, holidays, and releases of new exciting games. The server may be heavily loaded right now. If you carry your laptop to cafe, the results will be different than at home. So, you probably can't rely on historical data to predict the future of download speeds.
If you can't accurately estimate the time remaining, then don't lie to your user by offering such an estimate.
If you know how much data must be downloaded, you can provide % completed progress.
If you don't know at all, provide a "heartbeat" - a piece of moving UI that shows the user that things are working, even through you don't know how long remains.
Improving the estimated time itself: Intuitively, I would guess that the speed of the net connection is a series of random values around some temporary mean speed - things tick along at one speed, then suddenly slow or speed up.
One option, then, could be to weight the previous set of speeds by some exponential, so that the most recent values get the strongest weighting. That way, as the previous mean speed moves further into the past, its effect on the current mean reduces.
However, if the speed randomly fluctuates, it might be worth flattening the top of the exponential (e.g. by using a Gaussian filter), to avoid too much fluctuation.
So in sum, I'm thinking of measuring the standard deviation (perhaps limited to the last N minutes) and using that to generate a Gaussian filter which is applied to the inputs, and then limiting the quoted precision using the standard deviation.
How, though, would you limit the standard deviation calculation to the last N minutes? How do you know how long to use?
Alternatively, there are pattern recognition possibilities to detect if we've hit a stable speed.
I've considered this off and on, myself. I the answer starts with being conservative when computing the current (and thus, future) transfer rate, and includes averaging over longer periods, to get more stable estimates. Perhaps low-pass filtering the time that is displayed, so that one doesn't get jumps between 2 minutes and 2 days.
I don't think a confidence interval is going to be helpful. Most people wouldn't be able to interpret it, and it would just be displaying more stuff that is a guess.
精彩评论