开发者

Building Stored Procedure to group data into ranges with roughly equal results in each bucket

I am trying to build one procedure to take a large amount of data and create 5 range buckets to display the data. the buckets ranges will have to be set according to the result开发者_高级运维s.

Here is my existing SP

GO
/****** Object:  StoredProcedure [dbo].[sp_GetRangeCounts]    Script Date: 03/28/2010 19:50:45 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

ALTER PROCEDURE [dbo].[sp_GetRangeCounts]
    @idMenu int

AS

declare 
@myMin decimal(19,2),
@myMax decimal(19,2),
@myDif decimal(19,2),
@range1 decimal(19,2),
@range2 decimal(19,2), 
@range3 decimal(19,2),
@range4 decimal(19,2),
@range5 decimal(19,2),
@range6 decimal(19,2)

SELECT @myMin=Min(modelpropvalue), @myMax=Max(modelpropvalue)
FROM xmodelpropertyvalues where modelPropUnitDescriptionID=@idMenu 

set @myDif=(@myMax-@myMin)/5
set @range1=@myMin
set @range2=@myMin+@myDif
set @range3=@range2+@myDif
set @range4=@range3+@myDif
set @range5=@range4+@myDif
set @range6=@range5+@myDif

select @myMin,@myMax,@myDif,@range1,@range2,@range3,@range4,@range5,@range6

select t.range as myRange, count(*) as myCount
from (
  select case  
    when modelpropvalue between @range1 and @range2 then 'range1'
    when modelpropvalue between @range2 and @range3 then 'range2'
    when modelpropvalue between @range3 and @range4 then 'range3'
    when modelpropvalue between @range4 and @range5 then 'range4'
    when modelpropvalue between @range5 and @range6 then 'range5'
     end as range
  from xmodelpropertyvalues where modelpropunitDescriptionID=@idmenu) t
group by t.range order by t.range

This calculates the min and max value from my table, works out the difference between the two and creates 5 buckets. The problem is that if there are a small amount of very high (or very low) values then the buckets will appear very distorted - as in these results...

range1  2806
range2  296
range3  75
range5  1

Basically I want to rebuild the SP so it creates buckets with equal amounts of results in each. I have played around with some of the following approaches without quite nailing it...

SELECT modelpropvalue, NTILE(5) OVER (ORDER BY modelpropvalue) FROM xmodelpropertyvalues - this creates a new column with either 1,2,3,4 or 5 in it

 ROW_NUMBER()OVER (ORDER BY modelpropvalue) between @range1 and @range2
 ROW_NUMBER()OVER (ORDER BY modelpropvalue) between @range2 and @range3
  • or maybe i could allocate every record a row number then divide into ranges from this?


You can use the ranking function ntile to split a result set in equal parts. This example creates a table with values 1...100, and splits them in 5 ranges:

set nocount on

declare @t table (value int)
declare @i int

set @i = 0
while @i < 100
    begin
    insert into @t (value) values (@i)
    set @i = @i + 1
    end

select 
    NTILE(5) over (order by value) as range
,   value
from @t

By using ntile in a subquery, you can do groups and aggregate math on the ranges. For example, to print the sum of numbers 1..19, 20..39, 40..59, etc.

select range, SUM(value)
from (
    select 
        NTILE(5) over (order by value) as range
    ,   value
    from @t
) sub
group by range


You can use top 20% to get the first fifth of the records, that will get you the end of the first range:

select @range1 = max(modelpropvalue)
from (
  select top 20% modelpropvalue
  from xmodelpropertyvalues where modelPropUnitDescriptionID = @idMenu
  order by modelpropvalue
) x

Then you can use that value to exclude the first range, and use top 25% to get the next range:

select @range2 = max(modelpropvalue)
from (
  select top 25% modelpropvalue
  from xmodelpropertyvalues where modelPropUnitDescriptionID = @idMenu
  where modelpropvalue > @range1
  order by modelpropvalue
) x

And so on with 33.3333% and 50% for the third and fourth ranges.

Note that to get the correct count, you should not use between. As both the start and end values are inclusive, you will count the edge values for both the range where it ends and the range where it starts.

select t.range as myRange, count(*) as myCount
from (
  select case  
    when modelpropvalue <= @range1 then 'range1'
    when modelpropvalue <= @range2 then 'range2'
    when modelpropvalue <= @range3 then 'range3'
    when modelpropvalue <= @range4 then 'range4'
    else 'range5'
    end as range
  from xmodelpropertyvalues where modelpropunitDescriptionID=@idmenu) t
group by t.range order by t.range

(Well, actually the case would protect you in this case as it would pick the first match, but if you would count them individually you would count some records twice. To get a single range you would exclude the first value and include the second: where modelpropvalue > @range1 and modelpropvalue <= @range2.)


OK, I got this working using both approaches but the problem seems to be with my data. In certain areas I have huge amounts of records with the same value on the field I am querying (it is the weight of trucks and one particular model is extremely popular), so there is really no way to divide this up evenly!

I have decided to go back to the original stored procedure, which calculated the ranges simply by dividing min and max value by 5, as it ran much quicker. However, as this SP is run up to 8 times in the page (for weight, fuel capacity, engine size, etc.), I could do with speeding it up a bit, as it takes about 0.5 secs to run. Here is the full SP - any ideas on how I can optimize its speed would be greatly appreciated...

ALTER PROCEDURE [dbo].[sp_GetRangeCounts]
    @idMenu int,
    @myFilters varchar(5000),
    @myStore int,
    @myLabel varchar(50) OUTPUT,
    @myUnit varchar(50) OUTPUT,
    @range1 int OUTPUT,
    @range2 int OUTPUT, 
    @range3 int OUTPUT, 
    @range4 int OUTPUT,
    @range5 int OUTPUT,
    @range6 int OUTPUT,
    @range1count int OUTPUT,
    @range2count int OUTPUT,
    @range3count int OUTPUT,
    @range4count int OUTPUT,
    @range5count int OUTPUT
AS

declare 
@myMin int,
@myMax int,
@myDif int

declare @myInfoTable table(
myMin integer,
myMax integer,
myLabel varchar(50),
myUnit varchar(50)
)

insert @myInfoTable (myMin,myMax,myLabel,myUnit) exec('SELECT Min(ConvertedValue) as myMin, Max(ConvertedValue) as myMax,unitDescriptionTrans as myLabel,unitUnitTransDescription as myUnit
FROM LenPropValsView where UnitDescriptionID='+@idMenu+' and xStoreID='+@myStore+@myFilters+' group by unitdescriptionTrans,unitUnitTransDescription')

select @myMin=myMin-1 from @myInfoTable
select @myMax=myMax+1 from @myInfoTable
select @myLabel=myLabel from @myInfoTable
select @myUnit=myUnit from @myInfoTable

set @myDif=(@myMax-@myMin)/5
set @range1=@myMin
set @range2=@myMin+@myDif
set @range3=@range2+@myDif
set @range4=@range3+@myDif
set @range5=@range4+@myDif
set @range6=@myMax

select @myLabel,@myUnit,@myMin,@myMax,@myDif,@range1,@range2,@range3,@range4,@range5,@range6

declare @myData table(
    myRange varchar(50),
    myCount integer
    )

insert @myData(myRange,myCount)
 exec ('select t.range as myRange, count(*) as myCount
from (
  select case  
    when ConvertedValue <='+@range2+' then ''range1''
    when ConvertedValue >'+@range2+' and ConvertedValue<='+@range3+' then ''range2''
    when ConvertedValue >'+@range3+' and ConvertedValue<='+@range4+' then ''range3''
    when ConvertedValue >'+@range4+' and ConvertedValue<='+@range5+' then ''range4''
    else ''range5''
     end as range
  from LenPropValsView where unitDescriptionID='+@idmenu+' and xStoreID='+@myStore+@myFilters+') t
group by t.range order by t.range')

select @range1count=myCount from @myData where myRange='range1'
select @range2count=myCount from @myData where myRange='range2'
select @range3count=myCount from @myData where myRange='range3'
select @range4count=myCount from @myData where myRange='range4'
select @range5count=myCount from @myData where myRange='range5'

select @range1count,@range2count,@range3count,@range4count,@range5count
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜