Calculate distribution of values in Rails/MySQL
What's the best way to calculate a distribution of values across rows in MYSQL (or possibly some calculated value in ruby).
Example would be the length of messages sent through the site. So we can look through all messages and see this one was 100 characters long, 1000 etc.
Then for a given message I want to come back with a %, like "this message falls into the 90 percentile for length" compared to other messages.
Caveats
- these aren't necessarily linear distributions so simple averages wouldn't be sufficient here
- distributions should be cached somehow for fast lookup, may need to pull in dozens of these distributions for a si开发者_如何学JAVAngle page load
Maybe some stats package/gem in ruby or any good examples of this out there? Thanks!
You could calculate the standard deviation for your dataset, and then given your datapoint, tell how many standard deviations away you are from the mean.
That way you could say 95 percentile, etc.
Check out:
- StatSample: http://ruby-statsample.rubyforge.org/
- Statistics2: https://github.com/abscondment/statistics2
Normal Distribution: http://www.regentsprep.org/Regents/math/algtrig/ATS2/NormalLesson.htm
The problem with calculating the standard deviation and mean of your data and doing an inverse normal lookup is that you are making a VERY restrictive assumption. What if your data is not normal? There are many situations where this is a pretty bad assumption.
You ideally want to answer this question non-parametrically, that is, without having to depend on some distributional assumptions. One way to do this is to use a MySQL view:
http://www.mysqltutorial.org/mysql-views-tutorial.aspx
Assuming that you want to calculate percentiles for message length of each row in the table, you can try creating a MySQL view that counts, for row i, the number of other message lengths that are less than or equal to the message length for row i. Since the table is updated once for each insertion, lookups will be fast.
If the MySQL view is being too slow on calculation, you can also try inserting the following logic into a write on your data access layer:
- Create your own "# message lengths less than or equal to me" column in the table.
- On a new row insertion, populate the "# message lengths less than or equal to me" column for this row using the straightforward SQL count expression as appropriate.
- Also on a new row insertion, iterate through all of the other rows in this table and increment this "# message lengths less than or equal to me" for all rows where the message length is greater than the new row's message length.
Because your problem is nicely linear in the ranking, we can get away with reducing the O(n^2) complexity to O(n) by using the latter approach. Hope this helps.
Also, if you are interested in doing some other manipulation than just percentiles, check out this blog for information on normalizing, scaling, ranking, etc. It includes some good graphics on what is happening to your data, although it is written for c#:
http://www.redowlconsulting.com/Blog/post/2011/07/28/StatisticalTricksForLists.aspx
精彩评论