T-SQL for a normalized average
I'm looking for a way to calculate a useful average for a given set of values which may contain huge spikes. (e.g 21, 54, 34, 14, 20, 300, 23 or 1, 1, 1, 1, 200, 1, 100) the s开发者_运维技巧pikes can throw things off when using the standard average calculation.
I looked into using the median but this doesn't really give the desired result.
I would like to implement this in T-SQL
Any ideas?
This way you can take away the highest and the lowest 25 % before calculating the result.
declare @t table (col1 int)
insert @t
select 21 union all
select 54 union all
select 34 union all
select 14 union all
select 20 union all
select 300 union all
select 23 union all
select 1 union all
select 1 union all
select 1 union all
select 1 union all
select 200 union all
select 1 union all
select 100
select avg(col1) from (
select top 67 percent col1 from (
select top 75 percent col1 from @t order by col1
) a order by col1 desc) b
Use median filter:
SELECT AVG(value)
FROM (
SELECT TOP 1 value AS median
FROM (
SELECT TOP 50 PERCENT value
FROM mytable
ORDER BY
value
) q
ORDER BY
value DESC
) q
JOIN mytable m
ON ABS(LOG10(value) - LOG10(median)) <= @filter_level
- Create
GROUP BY
by logarithmic rule (for example difference between number not exceeds 10 times or any other base of log) - Create filtering (using
HAVING
) by non-representative groups (for example less than 3)
The danger in doing this is that you can't be certain that all those spikes are insignificant and worth discarding. One person's noise is another person's Black Swan.
If you're worried about large values skewing your view of the data needlessly, you'd be better off using a measure like median that's less sensitive to outliers. It's harder to calculate than mean, but it'll give you a measure of centrality that's not swayed as much by spikes.
You may consider using a windowing function like OVER / PARTITION BY. This will allow you to fine-tune exclusions within specific groups of rows (like by name, date, or hour). In this example, I borrow the rows from the example t-clausen.dk and expand by adding a name so we can demonstrate windowing.
-- Set boundaries, like the TOP PERCENT used in the afore mentioned example
DECLARE @UBOUND FLOAT, @LBOUND FLOAT
SET @UBOUND = 0.8 --(80%) SET @LBOUND = 0.2 --(20%)
--Build a CTE table ;WITH tb_example AS ( select [Val]=21,[fname]='Bill' union all select 54,'Tom' union all select 34,'Tom' union all select 14,'Bill' union all select 20,'Bill' union all select 300,'Tom' union all select 23,'Bill' union all select 1,'Tom' union all select 1,'Tom' union all select 1,'Bill' union all select 1,'Tom' union all select 200,'Bill' union all select 1,'Tom' union all select 12,'Tom' union all select 8,'Tom' union all select 11,'Bill' union all select 100,'Bill' )
--Outer query applies criteria of your choice to remove spikes SELECT fname,AVG(Val) FROM ( -- Inner query applies windowed aggregate values for outer query processing SELECT * ,ROW_NUMBER() OVER (PARTITION BY fname order by Val) RowNum ,COUNT(*) OVER (PARTITION BY fname) RowCnt ,MAX(Val) OVER (PARTITION BY fname) MaxVal ,MIN(Val) OVER (PARTITION BY fname) MinVal FROM tb_example ) TB WHERE -- You can use the bounds to eliminate the top and bottom 20% RowNum BETWEEN (RowCnt*@LBOUND) and (RowCnt*@UBOUND) -- Limits window -- Or you may chose to simply eliminate the Max and MIN values OR (Val > MinVal AND Val < MaxVal) -- Removes Lowest and Highest values GROUP BY fname
In this case, I use both criteria and AVG the val by fname. But the sky is the limit with how you chose to mitigate spikes with this technique.
精彩评论