开发者

T-SQL for a normalized average

I'm looking for a way to calculate a useful average for a given set of values which may contain huge spikes. (e.g 21, 54, 34, 14, 20, 300, 23 or 1, 1, 1, 1, 200, 1, 100) the s开发者_运维技巧pikes can throw things off when using the standard average calculation.

I looked into using the median but this doesn't really give the desired result.

I would like to implement this in T-SQL

Any ideas?


This way you can take away the highest and the lowest 25 % before calculating the result.

declare @t table (col1 int)

insert @t
select 21 union all 
select 54 union all 
select 34 union all 
select 14 union all 
select 20 union all 
select 300 union all 
select 23 union all 
select 1 union all 
select 1 union all 
select 1 union all 
select 1 union all 
select 200 union all 
select 1 union all 
select 100


select avg(col1) from (
select top 67 percent col1 from (
select top 75 percent col1 from @t order by col1
) a order by col1 desc) b


Use median filter:

SELECT  AVG(value)
FROM    (
        SELECT  TOP 1 value AS median
        FROM    (
                SELECT  TOP 50 PERCENT value
                FROM    mytable
                ORDER BY
                        value
                ) q
        ORDER BY
                value DESC
        ) q
JOIN    mytable m
ON      ABS(LOG10(value) - LOG10(median)) <= @filter_level


  1. Create GROUP BY by logarithmic rule (for example difference between number not exceeds 10 times or any other base of log)
  2. Create filtering (using HAVING) by non-representative groups (for example less than 3)


The danger in doing this is that you can't be certain that all those spikes are insignificant and worth discarding. One person's noise is another person's Black Swan.

If you're worried about large values skewing your view of the data needlessly, you'd be better off using a measure like median that's less sensitive to outliers. It's harder to calculate than mean, but it'll give you a measure of centrality that's not swayed as much by spikes.


You may consider using a windowing function like OVER / PARTITION BY. This will allow you to fine-tune exclusions within specific groups of rows (like by name, date, or hour). In this example, I borrow the rows from the example t-clausen.dk and expand by adding a name so we can demonstrate windowing.

-- Set boundaries, like the TOP PERCENT used in the afore mentioned example
DECLARE @UBOUND FLOAT, @LBOUND FLOAT

SET @UBOUND = 0.8 --(80%) SET @LBOUND = 0.2 --(20%)

--Build a CTE table ;WITH tb_example AS ( select [Val]=21,[fname]='Bill' union all select 54,'Tom' union all select 34,'Tom' union all select 14,'Bill' union all select 20,'Bill' union all select 300,'Tom' union all select 23,'Bill' union all select 1,'Tom' union all select 1,'Tom' union all select 1,'Bill' union all select 1,'Tom' union all select 200,'Bill' union all select 1,'Tom' union all select 12,'Tom' union all select 8,'Tom' union all select 11,'Bill' union all select 100,'Bill' )

--Outer query applies criteria of your choice to remove spikes SELECT fname,AVG(Val) FROM ( -- Inner query applies windowed aggregate values for outer query processing SELECT * ,ROW_NUMBER() OVER (PARTITION BY fname order by Val) RowNum ,COUNT(*) OVER (PARTITION BY fname) RowCnt ,MAX(Val) OVER (PARTITION BY fname) MaxVal ,MIN(Val) OVER (PARTITION BY fname) MinVal FROM tb_example ) TB WHERE -- You can use the bounds to eliminate the top and bottom 20% RowNum BETWEEN (RowCnt*@LBOUND) and (RowCnt*@UBOUND) -- Limits window -- Or you may chose to simply eliminate the Max and MIN values OR (Val > MinVal AND Val < MaxVal) -- Removes Lowest and Highest values GROUP BY fname

In this case, I use both criteria and AVG the val by fname. But the sky is the limit with how you chose to mitigate spikes with this technique.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜