determining histogram bin size
I'm looking to create a histogram in SQL (which in itself isn't too tricky), but what I'm looking for is a way of splitting the bins so that each bin / band has the same proportion of the data included within.
For example if I have the sample data (the value column) and I want to divide it into 5 bins, I know that I can work out the number of bins by doing something like
(MAX(Value) - MIN(Value)) / numberofsteps
Will give the groups we see in the band 1 column.
However what I want is for the bands to be calculated so that each band accounts for (100 / n) % of the total where n is the number of bands (so in this case each of the 5 bands would represent 20% of the total data) - which is what is shown in the band 2 column
Value band 1 band 2
1 | 1 to 2 | 0 to 1
1 | 1 to 2 | 0 to 1
1 | 1 to 2 | 0 to 1
1 | 1 to 2 | 0 to 1
2 | 1 to 2 | 2 to 3
2 | 1 to 2 | 2 to 3
3 | 1 to 2 | 2 to 3
3 | 1 to 2 | 2 to 3
4 | 3 to 4 | 4 to 6
4 | 3 to 4 | 4 to 6
5 | 5 to 6 | 4 to 6
6 | 5 to 6 | 4 to 6
7 | 7 to 8 | 7 to 8
8 | 7 to 8 | 7 to 8
8 | 7 to 8 | 7 to 8
8 | 7 to 8 | 7 to 8
9 | 9 to 10 | 9 to 10
10 | 9 to 10 | 9 to 10
10 | 9 to 10 | 9 to 10
10 | 9 to 10 | 9 to 10
Is there a way to do this in SQL (i'm using SQL server 2005 if that helps), possibly without creating a UDF and having it so that I can easily alter the number o开发者_如何学Pythonf bins would be great (if that's not asking the impossible!)
Thanks
To divide into bins you can use the ntile function.
with Vals AS
(
SELECT 1 AS value UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 8 UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 10 UNION ALL SELECT 10 UNION ALL SELECT 10
), TiledVals AS
(
SELECT value, NTILE(5) OVER (ORDER BY value) AS BinNumber
FROM Vals
)
SELECT value, BinNumber,
Min(value) OVER (PARTITION BY BinNumber) As StartBin,
MAX(value) OVER (PARTITION BY BinNumber) As EndBin
FROM TiledVals
Gives
value BinNumber StartBin EndBin
----------- -------------------- ----------- -----------
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
2 2 2 3
2 2 2 3
3 2 2 3
3 2 2 3
4 3 4 6
4 3 4 6
5 3 4 6
6 3 4 6
7 4 7 8
8 4 7 8
8 4 7 8
8 4 7 8
9 5 9 10
10 5 9 10
10 5 9 10
10 5 9 10
精彩评论