T-SQL for a normalized average

2023-03-22 12:42 问答作者：

I'm looking for a way to calculate a useful average for a given set of values which may contain huge spikes. (e.g 21, 54, 34, 14, 20, 300, 23 or 1, 1, 1, 1, 200, 1, 100) the s开发者_运维技巧pikes can throw things off when using the standard average calculation.

I looked into using the median but this doesn't really give the desired result.

I would like to implement this in T-SQL

Any ideas?

This way you can take away the highest and the lowest 25 % before calculating the result.

declare @t table (col1 int)

insert @t
select 21 union all 
select 54 union all 
select 34 union all 
select 14 union all 
select 20 union all 
select 300 union all 
select 23 union all 
select 1 union all 
select 1 union all 
select 1 union all 
select 1 union all 
select 200 union all 
select 1 union all 
select 100


select avg(col1) from (
select top 67 percent col1 from (
select top 75 percent col1 from @t order by col1
) a order by col1 desc) b

Use median filter:

SELECT  AVG(value)
FROM    (
        SELECT  TOP 1 value AS median
        FROM    (
                SELECT  TOP 50 PERCENT value
                FROM    mytable
                ORDER BY
                        value
                ) q
        ORDER BY
                value DESC
        ) q
JOIN    mytable m
ON      ABS(LOG10(value) - LOG10(median)) <= @filter_level

Create GROUP BY by logarithmic rule (for example difference between number not exceeds 10 times or any other base of log)
Create filtering (using HAVING) by non-representative groups (for example less than 3)

The danger in doing this is that you can't be certain that all those spikes are insignificant and worth discarding. One person's noise is another person's Black Swan.

If you're worried about large values skewing your view of the data needlessly, you'd be better off using a measure like median that's less sensitive to outliers. It's harder to calculate than mean, but it'll give you a measure of centrality that's not swayed as much by spikes.

You may consider using a windowing function like OVER / PARTITION BY. This will allow you to fine-tune exclusions within specific groups of rows (like by name, date, or hour). In this example, I borrow the rows from the example t-clausen.dk and expand by adding a name so we can demonstrate windowing.

-- Set boundaries, like the TOP PERCENT used in the afore mentioned example
DECLARE @UBOUND FLOAT, @LBOUND FLOAT

SET @UBOUND = 0.8 --(80%)
SET @LBOUND = 0.2 --(20%)

--Build a CTE table 
;WITH tb_example AS (
select [Val]=21,[fname]='Bill' union all 
select 54,'Tom' union all
select 34,'Tom' union all 
select 14,'Bill' union all 
select 20,'Bill' union all 
select 300,'Tom' union all 
select 23,'Bill' union all 
select 1,'Tom' union all 
select 1,'Tom' union all 
select 1,'Bill' union all 
select 1,'Tom' union all 
select 200,'Bill' union all 
select 1,'Tom' union all 
select 12,'Tom' union all 
select 8,'Tom' union all 
select 11,'Bill' union all 
select 100,'Bill'
)

--Outer query applies criteria of your choice to remove spikes
SELECT fname,AVG(Val) FROM (
-- Inner query applies windowed aggregate values for outer query processing
    SELECT * 
    ,ROW_NUMBER()  OVER (PARTITION BY fname order by Val) RowNum 
    ,COUNT(*) OVER (PARTITION BY fname) RowCnt
    ,MAX(Val) OVER (PARTITION BY fname) MaxVal
    ,MIN(Val) OVER (PARTITION BY fname) MinVal
    FROM tb_example
) TB
WHERE 
-- You can use the bounds to eliminate the top and bottom 20%
 RowNum BETWEEN (RowCnt*@LBOUND) and (RowCnt*@UBOUND) -- Limits window
-- Or you may chose to simply eliminate the Max and MIN values
 OR (Val > MinVal AND Val < MaxVal) -- Removes Lowest and Highest values
GROUP BY fname

In this case, I use both criteria and AVG the val by fname. But the sky is the limit with how you chose to mitigate spikes with this technique.

继续阅读：algorithm data-structures sql-server tsql

T-SQL for a normalized average

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？