Maths! Approximating the mean, without storing the whole data set
Obvious (but expensive) solution:
I would like to store rating of a track (1-10) in a table like this:
TrackID
Vote
And then a simple
开发者_StackOverflow中文版SELECT AVERAGE(Vote) FROM `table` where `TrackID` = some_val
to calculate the average.
However, I am worried about scalability on this, especially as it needs to be recalculated each time.
Proposed, but possibly stupid, solution:
TrackID
Rating
NumberOfVotes
Every time someone votes, the Rating
is updated with
new_rating = ((old_rating * NumberOfVotes) + vote) / (NumberOfVotes + 1)
and stored as the TrackID
's new Rating
value. Now whenever the Rating
is wanted, it's a simple lookup, not a calculation.
Clearly, this does not calculate the mean. I've tried a few small data sets, and it approximates the mean. I believe it might converge as the data set increases? But I'm worried that it might diverge!
What do you guys think? Thanks!
Assuming you had infinite numeric precision, that calculation does update the mean correctly. In practice, you're probably using integer types, so it won't be exact.
How about storing the cumulative vote count, and the number of votes? (i.e. total=total+vote
, numVotes=numVotes+1
). That way, you can get the exact mean by dividing one by the other.
This approach will only break if you get so many votes that you overflow the range of the data type you're using. So use a big data type (32-bit ought to be enough, unless you're expecting ~4 billion votes)!
Store TrackId
, RatingSum
, NumberOfVotes
in your table.
Every time someone votes,
- NumberOfVotes = NumberOfVotes + 1
- RatingsSum = RatingsSum + [rating supplied by user]
Then when selecting
SELECT TrackId, RatingsSum / NumberOfVotes FROM ...
Your solution is completely legit. and differes only by roughly a few times the floating point precision from a value calculated from the full source set.
You can certainly calculate a running mean and standard deviation without having all the points in hand. You merely need to accumulate the sum, sum of squares, and number of points.
It's not an approximation; the mean and standard deviation are exact.
Here's a Java class that demonstrates. You can adapt to your SQL solution as needed:
package statistics;
public class StatsUtils
{
private double sum;
private double sumOfSquares;
private long numPoints;
public StatsUtils()
{
this.init();
}
private void init()
{
this.sum = 0.0;
this.sumOfSquares = 0.0;
this.numPoints = 0L;
}
public void addValue(double value)
{
// Check for overflow in either number of points or sum of squares; reset if overflow is detected
if ((this.numPoints == Long.MAX_VALUE) || (this.sumOfSquares > (Double.MAX_VALUE-value*value)))
{
this.init();
}
this.sum += value;
this.sumOfSquares += value*value;
++this.numPoints;
}
public double getMean()
{
double mean = 0.0;
if (this.numPoints > 0)
{
mean = this.sum/this.numPoints;
}
return mean;
}
public double getStandardDeviation()
{
double standardDeviation = 0.0;
if (this.numPoints > 1)
{
standardDeviation = Math.sqrt((this.sumOfSquares - this.sum*this.sum/this.numPoints)/(this.numPoints-1L));
}
return standardDeviation;
}
public long getNumPoints() { return this.numPoints; }
}
Small improvement on your solution. You have the table:
TrackID
SumOfVotes
NumberOfVotes
When someone votes,
NumberOfVotes = NumberOfVotes + 1
SumOfVotes = SumOfVotes + ThisVote
and to see the average you only then do a division:
SELECT TrackID, (SumOfVotes/NumberOfVotes) AS Rating FROM `table`
I would add that the original (obvious and expensive) solution is only expensive compared to the provied solution when calculating the average. It is cheaper when a vote is added, deleted or changed. I guess that the original table
TrackID
Vote
VoterID
would still need to be used in the provided solution to keep track of the vote (rating) of every voter. SO, two tables have to be updated for every change in this table (insert, delete or Vote update).
In other words, the original solution may be the best way to go.
精彩评论