How to write the quantile aggregate function?

2023-04-10 14:28 问答作者：

I have the following table:

CREATE TABLE #TEMP (ColA VARCHAR(MAX), ColB VARCHAR(MAX), Date date, Value int)

INSERT INTO #TEMP VALUES('A','B','7/1/2010','11143274')
INSERT INTO #TEMP VALUES('A','B','7/1/2010','13303527')
INSERT INTO #TEMP VALUES('A','B','7/1/2010','17344238')
INSERT INTO #TEMP VALUES('A','B','7/1/2010','13236525')
INSERT INTO #TEMP VALUES('A','B','7/1/2010','10825232')
INSERT INTO #TEMP VALUES('A','B','7/1/2010','13567253')
INSERT INTO #TEMP VALUES('A','B','7/1/2010','10726342')
INSERT INTO #TEMP VALUES('A','B','7/1/2010','11605647')

INSERT INTO #TEMP VALUES('A','B','7/2/2010','13236525')
INSERT INTO #TEMP VALUES('A','B','7/2/2010','10825232')
INSERT INTO #TEMP VALUES('A','B','7/2/2010','13567253')
INSERT INTO #TEMP VALUES('A','B','7/2/2010','10726342')
INSERT INTO #TEMP VALUES('A','B','7/2/2010','11605647')
INSERT INTO #TEMP VALUES('A','B','7/2/2010','17344238')
INSERT INTO #TEMP VALUES('A','B','7/2/2010','17344238')
INSERT INTO #TEMP VALUES('A','B','7/2/2010','17344238')

SELECT * FROM #TEMP

DROP TABLE #TEMP

In R (a statistical software), to calculate the 95th percentile value of the last column, I am doing something like this:

ddply(data, c("ColA", "ColB", "Date"), summarize, Value95=quantile(Value, 0.95))

and the output is the following:

A B 2010-07-01 16022293
A B 2010-07-02 17344238

All this is doing is performing a GROUP BY operation on ColA, ColB and Date and applying an aggregate function quantile function. So far so good but I sh开发者_如何转开发ould have a way to do this in SQL Server because this is an aggregate operation that can cleanly be done in SQL and when the data is in the order of millions, I really want to do this in SQL than a statistical software.

My problem is I am not able to find a good way to write the quantile function itself. I tried using NTILE but it does not make sense using NTILE(100) when the number of rows under a particular GROUP BY is less than 100. Is there a good way to do this?

UPDATE: Some more output from R if it helps:

> quantile(c(1,2,3,4,5,5), 0.95)
95% 
  5 
> quantile(c(1,2,3,4,5,5), 0.0)
0% 
 1 
> quantile(c(1,2,3,4,5,5), 1.0)
100% 
   5 
> quantile(c(1,2,3,4,5,5), 0.5) // MEDIAN
50% 
3.5

when the data is in the order of millions, I really want to do this in SQL than a statistical software.

Have you tried the data.table package in R? See this article comparing ddply to data.table.

Here is how I would do that (the code is little bit messy)

CREATE TABLE #TEMP (ColA VARCHAR(MAX), ColB VARCHAR(MAX), Date date, Value int)

INSERT INTO #TEMP VALUES('A','B','7/1/2010','11143274')
INSERT INTO #TEMP VALUES('A','B','7/1/2010','13303527')
INSERT INTO #TEMP VALUES('A','B','7/1/2010','17344238')
INSERT INTO #TEMP VALUES('A','B','7/1/2010','13236525')
INSERT INTO #TEMP VALUES('A','B','7/1/2010','10825232')
INSERT INTO #TEMP VALUES('A','B','7/1/2010','13567253')
INSERT INTO #TEMP VALUES('A','B','7/1/2010','10726342')
INSERT INTO #TEMP VALUES('A','B','7/1/2010','11605647')

INSERT INTO #TEMP VALUES('A','B','7/2/2010','13236525')
INSERT INTO #TEMP VALUES('A','B','7/2/2010','10825232')
INSERT INTO #TEMP VALUES('A','B','7/2/2010','13567253')
INSERT INTO #TEMP VALUES('A','B','7/2/2010','10726342')
INSERT INTO #TEMP VALUES('A','B','7/2/2010','11605647')
INSERT INTO #TEMP VALUES('A','B','7/2/2010','17344238')
INSERT INTO #TEMP VALUES('A','B','7/2/2010','17344238')
INSERT INTO #TEMP VALUES('A','B','7/2/2010','17344238')

INSERT INTO #TEMP VALUES('A','c','7/2/2010','1')
INSERT INTO #TEMP VALUES('A','c','7/2/2010','2')
INSERT INTO #TEMP VALUES('A','c','7/2/2010','3')
INSERT INTO #TEMP VALUES('A','c','7/2/2010','4')
INSERT INTO #TEMP VALUES('A','c','7/2/2010','5')
INSERT INTO #TEMP VALUES('A','c','7/2/2010','5')


declare @perc decimal(6,5)
set @perc = 1.0

select cola, colb,date, sum(value)/convert(decimal,count(value)) from (

select 
   row_number() OVER(partition by x.cola, x.colb, x.date order by x.value) as id,
   x.*,
   convert(int, y.zz) as j,
   case when (y.zz - convert(int, y.zz)) = 0 then convert(int, y.zz) + 1 else convert(int, y.zz) end as k,
   y.zz
from 
#temp x join 
(
   SELECT 
      cola, 
      colb, 
      date, 
      count(*)*@perc zz 
   FROM 
      #TEMP  
   group by 
      cola, 
      colb, 
      date
)y on x.cola = y.cola and x.colb = y.colb and x.date = y.date

)xxx where id = j or id = k
group by cola, colb, date

There are more ways ho to calculate that (in terms of the method used). I was using the SAS 5 (R-2) method.

继续阅读：r sql-server sql-server-2008

How to write the quantile aggregate function?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？