Is there a way to calculate correlation in TSQL using OVER Clauses instead of CTE's?

2023-03-26 05:41 问答作者：

Let's say you have a table with columns, Date, GroupID, X and Y.

CREATE TABLE #sample
  (
     [Date]  DATETIME,
     GroupID INT,
     X       FLOAT,
     Y       FLOAT
  )

DECLARE @date DATETIME = getdate()

INSERT INTO #sample VALUES(@date, 1, 1,3)
INSERT INTO #sample VALUES(DATEADD(d, 1, @date), 1, 1,1)
INSERT INTO #sample VALUES(DATEADD(d, 2, @date), 1, 4,2)
INSERT INTO #sample VALUES(DATEADD(d, 3, @date), 1, 3,3)
INSERT INTO #sample VALUES(DATEADD(d, 4, @date), 1, 6,4)
INSERT INTO #sample VALUES(DATEADD(d, 5, @date), 1, 7,5开发者_StackOverflow社区)
INSERT INTO #sample VALUES(DATEADD(d, 6, @date), 1, 1,6)

and you want to calculate the correlation of X and Y for each group. Currently I use CTEs which get a little messy:

;WITH DataAvgStd
     AS (SELECT GroupID,
                AVG(X)   AS XAvg,
                AVG(Y)   AS YAvg,
                STDEV(X) AS XStdev,
                STDEV(Y) AS YSTDev,
                COUNT(*) AS SampleSize
         FROM   #sample
         GROUP  BY GroupID),
     ExpectedVal
     AS (SELECT s.GroupID,
                SUM(( X - XAvg ) * ( Y - YAvg )) AS ExpectedValue
         FROM   #sample s
                JOIN DataAvgStd das
                  ON s.GroupID = das.GroupID
         GROUP  BY s.GroupID)
SELECT das.GroupID,
       ev.ExpectedValue / ( das.SampleSize - 1 ) / ( das.XStdev * das.YSTDev )
       AS
       Correlation
FROM   DataAvgStd das
       JOIN ExpectedVal ev
         ON das.GroupID = ev.GroupID

DROP TABLE #sample

It seems like there should be a way to use OVER and PARTITION to do this in one fell swoop without any subqueries. Ideally TSQL would have a function so you could write:

SELECT GroupID, CORR(X, Y) OVER(PARTITION BY GroupID)
FROM #sample
GROUP BY GroupID

Using this formula of corellation you cannot avoid all the nested queries even if you use over(). The thing is that you cannot use both group by and over in the same query and also you can not have nested aggregation functions e.g. sum(x - avg(x)). So you in best case scenario, according to your data, you will need to keep at least the with.

Your code will look like something like that

;WITH DataAvgStd
     AS (SELECT GroupID,
                STDEV(X) over(partition by GroupID) AS XStdev,
                STDEV(Y) over(partition by GroupID) AS YSTDev,
                COUNT(*) over(partition by GroupID) AS SampleSize,
                ( X - AVG(X) over(partition by GroupID)) * ( Y - AVG(Y) over(partition by GroupID)) AS ExpectedValue
         FROM   #sample s)         
SELECT distinct GroupID,
       SUM(ExpectedValue) over(partition by GroupID) / (SampleSize - 1 ) / ( XStdev * YSTDev ) AS Correlation
FROM DataAvgStd

An alternative is to use an equilevant formula for correlation as Wikipedia describes.

This can be written as

SELECT GroupID,
       Correlation=(COUNT(*) * SUM(X * Y) - SUM(X) * SUM(Y)) / 
                   (SQRT(COUNT(*) * SUM(X * X) - SUM(X) * SUM(x))
                    * SQRT(COUNT(*) * SUM(Y* Y) - SUM(Y) * SUM(Y)))
FROM #sample s
GROUP BY GroupID;

A Single-Pass Solution for Both Calcs:

There are two flavors of the Pearson correlation coefficient, one for a Sample and one for an entire Population. These are simple, single-pass, and I believe, correct formulas for both:

-- Methods for calculating the two Pearson correlation coefficients
SELECT  
        -- For Population
        (avg(x * y) - avg(x) * avg(y)) / 
        (sqrt(avg(x * x) - avg(x) * avg(x)) * sqrt(avg(y * y) - avg(y) * avg(y))) 
        AS correlation_coefficient_population,
        -- For Sample
        (count(*) * sum(x * y) - sum(x) * sum(y)) / 
        (sqrt(count(*) * sum(x * x) - sum(x) * sum(x)) * sqrt(count(*) * sum(y * y) - sum(y) * sum(y))) 
        AS correlation_coefficient_sample
    FROM (
        -- The following generates a table of sample data containing two columns with a luke-warm and tweakable correlation 
        -- y = x for 0 thru 99, y = x - 100 for 100 thru 199, etc.  Execute it as a stand-alone to see for yourself
        -- x and y are CAST as DECIMAL to avoid integer math, you should definitely do the same
        -- Try TOP 100 or less for full correlation (y = x for all cases), TOP 200 for a PCC of 0.5, TOP 300 for one near 0.33, etc.
        -- The superfluous "+ 0" is where you could apply various offsets to see that they have no effect on the results
        SELECT TOP 200
                CAST(ROW_NUMBER() OVER (ORDER BY [object_id]) - 1 + 0 AS DECIMAL) AS x, 
                CAST((ROW_NUMBER() OVER (ORDER BY [object_id]) - 1) % 100 AS DECIMAL) AS y 
            FROM sys.all_objects
    ) AS a

As I noted in the comments, you can try the example with TOP 100 or less for full correlation (y = x for all cases); TOP 200 yields correlations very near 0.5; TOP 300, around 0.33; etc. There is a place ("+ 0") to add an offset if you like; spoiler alert, it has no effect. Make sure you CAST your values as DECIMAL - integer math can significantly impact these calcs.

SQL get's a bit funny about nesting aggregates or windowing functions, hence the need for the CTEs or derived tables.

If it must be implemented on the DB server, and you are looking for something more readable than the CTEs your only option is to roll your own aggregate with CLR.

There is a good tutorial here http://www.sqlservercentral.com/articles/SQLCLR/71942/ on building a similar CLR aggregate.

继续阅读：correlation sql sql-server tsql

Is there a way to calculate correlation in TSQL using OVER Clauses instead of CTE's?

A Single-Pass Solution for Both Calcs:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

A Single-Pass Solution for Both Calcs:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？