Calculate distribution of values in Rails/MySQL

2023-03-24 02:37 问答作者：

What's the best way to calculate a distribution of values across rows in MYSQL (or possibly some calculated value in ruby).

Example would be the length of messages sent through the site. So we can look through all messages and see this one was 100 characters long, 1000 etc.

Then for a given message I want to come back with a %, like "this message falls into the 90 percentile for length" compared to other messages.

Caveats

these aren't necessarily linear distributions so simple averages wouldn't be sufficient here
distributions should be cached somehow for fast lookup, may need to pull in dozens of these distributions for a si开发者_如何学JAVAngle page load

Maybe some stats package/gem in ruby or any good examples of this out there? Thanks!

You could calculate the standard deviation for your dataset, and then given your datapoint, tell how many standard deviations away you are from the mean.

That way you could say 95 percentile, etc.

Check out:

StatSample: http://ruby-statsample.rubyforge.org/
Statistics2: https://github.com/abscondment/statistics2

Normal Distribution: http://www.regentsprep.org/Regents/math/algtrig/ATS2/NormalLesson.htm

The problem with calculating the standard deviation and mean of your data and doing an inverse normal lookup is that you are making a VERY restrictive assumption. What if your data is not normal? There are many situations where this is a pretty bad assumption.

You ideally want to answer this question non-parametrically, that is, without having to depend on some distributional assumptions. One way to do this is to use a MySQL view:

http://www.mysqltutorial.org/mysql-views-tutorial.aspx

Assuming that you want to calculate percentiles for message length of each row in the table, you can try creating a MySQL view that counts, for row i, the number of other message lengths that are less than or equal to the message length for row i. Since the table is updated once for each insertion, lookups will be fast.

If the MySQL view is being too slow on calculation, you can also try inserting the following logic into a write on your data access layer:

Create your own "# message lengths less than or equal to me" column in the table.
On a new row insertion, populate the "# message lengths less than or equal to me" column for this row using the straightforward SQL count expression as appropriate.
Also on a new row insertion, iterate through all of the other rows in this table and increment this "# message lengths less than or equal to me" for all rows where the message length is greater than the new row's message length.

Because your problem is nicely linear in the ranking, we can get away with reducing the O(n^2) complexity to O(n) by using the latter approach. Hope this helps.

Also, if you are interested in doing some other manipulation than just percentiles, check out this blog for information on normalizing, scaling, ranking, etc. It includes some good graphics on what is happening to your data, although it is written for c#:

http://www.redowlconsulting.com/Blog/post/2011/07/28/StatisticalTricksForLists.aspx

继续阅读：ruby-on-rails statistics

Calculate distribution of values in Rails/MySQL

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？