Maths! Approximating the mean, without storing the whole data set

2023-02-03 13:17 问答作者：

Obvious (but expensive) solution:

I would like to store rating of a track (1-10) in a table like this:

TrackID
Vote

And then a simple

开发者_StackOverflow中文版SELECT AVERAGE(Vote) FROM `table` where `TrackID` = some_val

to calculate the average.

However, I am worried about scalability on this, especially as it needs to be recalculated each time.

Proposed, but possibly stupid, solution:

TrackID
Rating
NumberOfVotes

Every time someone votes, the Rating is updated with

new_rating = ((old_rating * NumberOfVotes) + vote) / (NumberOfVotes + 1)

and stored as the TrackID's new Rating value. Now whenever the Rating is wanted, it's a simple lookup, not a calculation.

Clearly, this does not calculate the mean. I've tried a few small data sets, and it approximates the mean. I believe it might converge as the data set increases? But I'm worried that it might diverge!

What do you guys think? Thanks!

Assuming you had infinite numeric precision, that calculation does update the mean correctly. In practice, you're probably using integer types, so it won't be exact.

How about storing the cumulative vote count, and the number of votes? (i.e. total=total+vote, numVotes=numVotes+1). That way, you can get the exact mean by dividing one by the other.

This approach will only break if you get so many votes that you overflow the range of the data type you're using. So use a big data type (32-bit ought to be enough, unless you're expecting ~4 billion votes)!

Store TrackId, RatingSum, NumberOfVotes in your table.

Every time someone votes,

NumberOfVotes = NumberOfVotes + 1
RatingsSum = RatingsSum + [rating supplied by user]

Then when selecting

SELECT TrackId, RatingsSum / NumberOfVotes FROM ...

Your solution is completely legit. and differes only by roughly a few times the floating point precision from a value calculated from the full source set.

You can certainly calculate a running mean and standard deviation without having all the points in hand. You merely need to accumulate the sum, sum of squares, and number of points.

It's not an approximation; the mean and standard deviation are exact.

Here's a Java class that demonstrates. You can adapt to your SQL solution as needed:

package statistics;

public class StatsUtils
{
    private double sum;
    private double sumOfSquares;
    private long numPoints;

    public StatsUtils()
    {
        this.init();
    }

    private void init()
    {
        this.sum = 0.0;
        this.sumOfSquares = 0.0;
        this.numPoints = 0L;
    }

    public void addValue(double value)
    {
        // Check for overflow in either number of points or sum of squares; reset if overflow is detected
        if ((this.numPoints == Long.MAX_VALUE) || (this.sumOfSquares > (Double.MAX_VALUE-value*value)))
        {
            this.init();
        }

        this.sum += value;
        this.sumOfSquares += value*value;
        ++this.numPoints;
    }

    public double getMean()
    {
        double mean = 0.0;

        if (this.numPoints > 0)
        {
            mean = this.sum/this.numPoints;
        }

        return mean;
    }

    public double getStandardDeviation()
    {
        double standardDeviation = 0.0;

        if (this.numPoints > 1)
        {
            standardDeviation = Math.sqrt((this.sumOfSquares - this.sum*this.sum/this.numPoints)/(this.numPoints-1L));
        }

        return standardDeviation;
    }

    public long getNumPoints() { return this.numPoints; }
}

Small improvement on your solution. You have the table:

TrackID
SumOfVotes
NumberOfVotes

When someone votes,

NumberOfVotes = NumberOfVotes + 1
SumOfVotes = SumOfVotes + ThisVote

and to see the average you only then do a division:

SELECT TrackID, (SumOfVotes/NumberOfVotes) AS Rating FROM `table`

I would add that the original (obvious and expensive) solution is only expensive compared to the provied solution when calculating the average. It is cheaper when a vote is added, deleted or changed. I guess that the original table

TrackID
Vote
VoterID

would still need to be used in the provided solution to keep track of the vote (rating) of every voter. SO, two tables have to be updated for every change in this table (insert, delete or Vote update).

In other words, the original solution may be the best way to go.

继续阅读：average math sql

Maths! Approximating the mean, without storing the whole data set

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？