开发者

Merging date intervals in SQL Server

I have the following data:

StartDate   |  EndDate
-------------------------
1982.03.02  |  1982.09.30 
1982.10.01  |  1985.01.17 
1985.06.26  |  1985.07.26 
1985.07.30  |  1991.12.31 
1992.01.01  |  1995.12.31 
1996.01.01  |  2004.05.31 
2004.06.05  |  2006.01.31 
开发者_如何学C2006.02.01  |  2011.05.20              

I need to merge any intervals that are adjacent (both start and the end date are included in the intervals, so an interval ending on 2003.05.06 is adjacent with an interval starting on 2003.05.07), so in this case, the resulting set should be:

StartDate   |  EndDate
-------------------------
1982.03.02  |  1985.01.17 
1985.06.26  |  1985.07.26 
1985.07.30  |  2004.05.31 
2004.06.05  |  2011.05.20              

For me, the obvious way to do this is to iterate the set with a cursor, and construct a result set row-by-row. However, this functionality will be within code that could potentially be called thousands of times in a day, on a server under heavy load, so I'd prefer not having any performance issues. Any data set is small (20 rows tops), and the data range is large, so any solution that generates all the dates in a range is unfeasible.

Is there a better way I'm not seeing?


Initialization code (from Damien's answer):

CREATE TABLE Periods (
    StartDate datetime NOT NULL CONSTRAINT PK_Periods PRIMARY KEY CLUSTERED,
    EndDate datetime NOT NULL
)

INSERT INTO Periods(StartDate,EndDate)
SELECT '19820302', '19820930'
UNION ALL SELECT '19821001', '19850117'
UNION ALL SELECT '19850626', '19850726'
UNION ALL SELECT '19850730', '19911231'
UNION ALL SELECT '19920101', '19951231'
UNION ALL SELECT '19960101', '20040531'
UNION ALL SELECT '20040605', '20060131'
UNION ALL SELECT '20060201', '20110520'


It takes longer for me to set up the sample data than to write the query - it would be better if you posted questions that include CREATE TABLE and INSERT/SELECT statements. I don't know what your table is called, I've called mine Periods:

create table Periods (
    StartDate date not null,
    EndDate date not null
)
go
insert into Periods(StartDate,EndDate)
select '19820302','19820930' union all
select '19821001','19850117' union all
select '19850626','19850726' union all
select '19850730','19911231' union all
select '19920101','19951231' union all
select '19960101','20040531' union all
select '20040605','20060131' union all
select '20060201','20110520'
go
; with MergedPeriods as (
    Select p1.StartDate, p1.EndDate
    from
        Periods p1
            left join
        Periods p2
            on
                p1.StartDate = DATEADD(day,1,p2.EndDate)
    where
        p2.StartDate is null
    union all
    select p1.StartDate,p2.EndDate
    from
        MergedPeriods p1
            inner join
        Periods p2
            on
                p1.EndDate = DATEADD(day,-1,p2.StartDate)
)
select StartDate,MAX(EndDate) as EndDate
from MergedPeriods group by StartDate

Result:

StartDate   EndDate
1982-03-02  1985-01-17
1985-06-26  1985-07-26
1985-07-30  2004-05-31
2004-06-05  2011-05-20


Here's a query that performs best of all submissions so far, with only two table accesses in the execution plan (instead of three or more). All queries are of course helped by indexes. Please note that the execution plan rates this query as more expensive, but the actual Reads & CPU are significantly better. Estimated costs in execution plans are not the same as actual performance.

WITH Grps AS (
   SELECT
      (Row_Number() OVER (ORDER BY P1.StartDate) - 1) / 2 Grp,
      P1.StartDate,
      P1.EndDate
   FROM
      Periods P1
      CROSS JOIN (SELECT -1 UNION ALL SELECT 1) D (Dir)
      LEFT JOIN Periods P2 ON
         DateAdd(Day, D.Dir, P1.StartDate) = P2.EndDate
         OR DateAdd(Day, D.Dir, P1.EndDate) = P2.StartDate
   WHERE
      (Dir = -1 AND P2.EndDate IS NULL)
      OR (Dir = 1 AND P2.StartDate IS NULL)
)
SELECT
   Min(StartDate) StartDate,
   Max(EndDate) EndDate
FROM Grps
GROUP BY Grp;

One more thing I think worth mentioning is that querying your date period table would all around in most cases be simpler and better performing if you used exclusive end dates (aka "open" end dates) instead of closed ones:

StartDate   | EndDate     | EndDate
(Inclusive) | (Inclusive) | (Exclusive)
---------------------------------------
1982.03.02  | 1982.09.30  | 1982.10.01
1982.10.01  | 1985.01.17  | 1985.01.18

Using exclusive end dates is (in my opinion) best practice most of the time because it allows you to change the data type of the date column or to change the resolution of the date, without affecting any queries, code, or other logic. For example, if your dates needed to be to the nearest 12 hours instead of 24 hours, you'd have major work to get that accomplished, whereas if you used exclusive end dates not a single thing would have to change!

If you were using exclusive end dates, my query would look like this:

WITH Grps AS (
   SELECT
      (Row_Number() OVER (ORDER BY P1.StartDate) - 1) / 2 Grp,
      P1.StartDate,
      P1.EndDate
   FROM
      Periods P1
      CROSS JOIN (SELECT 1 UNION ALL SELECT 2) X (Which)
      LEFT JOIN Periods P2 ON
         (X.Which = 1 AND P1.StartDate = P2.EndDate)
         OR (X.Which = 2 AND P1.EndDate = P2.StartDate)
   WHERE
      P2.EndDate IS NULL
      OR P2.StartDate IS NULL
)
SELECT
   Min(StartDate) StartDate,
   Max(EndDate) EndDate
FROM Grps
GROUP BY Grp;

Notice there's no DateAdd or DateDiff now, with hardcoded values of "1 Day" that would have to change if you for example switched to 12-hour periods.

Update

Here's an updated query that incorporates things I've learned in the last almost 5 years. This query now has no joins at all, and though it does have 3 sort operations in it which could be performance problems, I think this query will compete reasonably well, and in the absence of indexes will probably beat all others hands down.

WITH Groups AS (
   SELECT Grp = Row_Number() OVER (ORDER BY StartDate) / 2, *
   FROM
      #Periods
      (VALUES (0), (0)) X (Dup)
), Ranges AS (
   SELECT StartDate = Max(StartDate), EndDate = Min(EndDate)
   FROM Groups
   GROUP BY Grp
   HAVING Max(StartDate) <> DateAdd(day, 1, Min(EndDate))
), ReGroups AS (
   SELECT
      Grp = Row_Number() OVER (ORDER BY StartDate) / 2,
      StartDate,
      EndDate
   FROM
      Ranges
      CROSS JOIN (VALUES (0), (0)) X (Dup)
)
SELECT
   StartDate = Min(StartDate),
   EndDate = Max(EndDate)
FROM ReGroups
GROUP BY Grp
HAVING Count(*) = 2
;

And here's yet another version using windowing functions (kind of what the previous query is simulating):

WITH LeadLag AS (
   SELECT
      PrevEndDate = Coalesce(Lag(EndDate) OVER (ORDER BY StartDate), '00010101'),
      NextStartDate = Coalesce(Lead(StartDate) OVER (ORDER BY StartDate), '99991231'),
      *
   FROM #Periods
), Dates AS (
   SELECT
      X.*
   FROM
      LeadLag
      CROSS APPLY (
         SELECT
            StartDate = CASE WHEN DateAdd(day, 1, PrevEndDate) <> StartDate THEN StartDate ELSE NULL END,
            EndDate = CASE WHEN DateAdd(day, 1, EndDate) <> NextStartDate THEN EndDate ELSE NULL END
      ) X
   WHERE
      X.StartDate IS NOT NULL
      OR X.EndDate IS NOT NULL
), Final AS (
   SELECT
      StartDate,
      EndDate = Min(EndDate) OVER (ORDER BY EndDate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
   FROM Dates
)
SELECT *
FROM Final
WHERE StartDate IS NOT NULL
;


You could look up the heads: rows that start a period. Then search for the last end date before the next head in a subquery:

; with heads as
        (
        select  StartDate
        ,       EndDate
        ,       row_number() over (order by StartDate) as rn
        from    @YourTable h
        where   not exists
                (
                select  *
                from    @YourTable next
                where   next.EndDate = dateadd(day, -1, h.StartDate)
                )
        )
select  heads.StartDate
,       (
        select  top 1 EndDate
        from    @YourTable
        where   EndDate < COALESCE(
                (
                select  StartDate
                from    heads h2
                where   heads.rn + 1 = h2.rn
                ), '9999-01-01')
        order by
                EndDate desc
        ) as EndDate
from    heads

Example at ODATA.


Hmmm... I know you said

any solution that generates all the dates in a range is unfeasible.

But for some reason I just wanted to show how that would be done. I don't mean to waste your time.

First, create a numbers table if you don't already have one.

CREATE TABLE Numbers (
   Num int NOT NULL CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED
)
INSERT Numbers VALUES (0)
WHILE @@RowCount < 65536
   INSERT Numbers SELECT Num FROM Numbers + (SELECT Max(Num) FROM Numbers) + 1

Then group some islands!

WITH Dts AS (
   SELECT
      DateAdd(Day, Num, StartDate) Dt,
      DateAdd(
         Day,
         -DENSE_RANK() OVER (ORDER BY StartDate, Num),
         DateAdd(Day, Num, StartDate)
      ) Grp
   FROM
      Periods P
      INNER JOIN Numbers N ON DateDiff(Day, P.StartDate, P.EndDate) >= N.Num
)
SELECT Min(Dt) StartDate, Max(Dt) EndDate
FROM Dts
GROUP BY Grp
ORDER BY StartDate

If you are using SQL 2000 this won't work, so please let me know and I'll come up with another solution for you.


Here's a very similar thread for PostgreSQL:

PostgreSQL matching interval between start and end time against timestamp

I'm only mildly familiar with T-SQL, so I'm not entirely sure the takeaway applies to you, but the general idea is to additionally store an indexable geometry type with a GIST (or R-tree) index, and to query against it. This will make the queries very fast.

(example segment code below is from peufeu's reply, and applies to date ranges too):

CREATE TABLE segments( start INTEGER NOT NULL, stop INTEGER NOT NULL, range_box BOX NOT NULL );
INSERT INTO segments SELECT n,n+1,BOX(POINT(n,-1),POINT(n+1,1)) FROM generate_series( 1, 1000000 ) n;
CREATE INDEX segments_box ON segments USING gist( range_box );
CREATE INDEX segments_start ON segments(start);
CREATE INDEX segments_stop ON segments(stop);

EXPLAIN ANALYZE SELECT * FROM segments WHERE 300000 BETWEEN start AND stop;
 Index Scan using segments_start on segments  (cost=0.00..12959.24 rows=209597 width=72) (actual time=91.990..91.990 rows=2 loops=1)
   Index Cond: (300000 >= start)
   Filter: (300000 <= stop)
 Total runtime: 92.023 ms

EXPLAIN ANALYZE SELECT * FROM segments WHERE range_box && '(300000,0,300000,0)'::BOX;
 Bitmap Heap Scan on segments  (cost=283.49..9740.27 rows=5000 width=72) (actual time=0.036..0.037 rows=2 loops=1)
   Recheck Cond: (range_box && '(300000,0),(300000,0)'::box)
   ->  Bitmap Index Scan on segments_box  (cost=0.00..282.24 rows=5000 width=0) (actual time=0.032..0.032 rows=2 loops=1)
         Index Cond: (range_box && '(300000,0),(300000,0)'::box)
 Total runtime: 0.064 ms

Again, the above is PostgreSQL specific, but it might be worth looking if the needed types/operator/indexes in T-SQL exist as well.


Old thread, but if anyone is looking for an implementation of doing this in PostGIS, here is an example:

-- Create the data:
drop table if exists periods;
create temporary table periods as
select '19820302'::date as StartDate,'19820930'::date as EndDate union all
select '19821001'::date,'19850117'::date union all
select '19850626'::date,'19850726'::date union all
select '19850730'::date,'19911231'::date union all
select '19920101'::date,'19951231'::date union all
select '19960101'::date,'20040531'::date union all
select '20040605'::date,'20060131'::date union all
select '20060201'::date,'20110520'::date;

-- Run with PostGIS
-- Convert all intervals to lines, and then do point intersection.
select 
  '1970-01-01'::date+st_x(st_astext(st_pointn(line,1)))::int4 as start, 
  '1970-01-01'::date+st_x(st_astext(st_pointn(line,st_numpoints(line))))::int4-1 as end 
from 
(select (st_dump(st_linemerge(st_union(the_geom)))).geom as line from 
(select st_makeline(st_makepoint(startdate-'1970-01-01',0),
        st_makepoint(enddate-'1970-01-01'+1,0)) as the_geom from periods)t 
)x;  

-- Result
start       |  end
-------------------------
1982-03-02  |  1985-01-17 
1985-06-26  |  1985-07-26 
1985-07-30  |  2004-05-31 
2004-06-05  |  2011-05-20  


alter table MergedPeriods (
   StartDate date not null,
EndDate date not null
)
go
insert into MergedPeriods(StartDate,EndDate)
select '20130210','20130215' union all
select '20130216','20130228' union all
select '20130302','20130312' union all
select '20130317','20130325' union all
select '20130326','20130405' union all
select '20130406','20130411' union all
select '20130502','20130610' 
go
; with MergedPeriods as (
    Select p1.StartDate, p1.EndDate
    from
        [test].[dbo].[Periods] p1
            left join
        [test].[dbo].[Periods] p2
            on
                p1.StartDate = DATEADD(day,1,p2.EndDate)
    where

       p2.StartDate is null
    union all
    select p1.StartDate,p2.EndDate
    from
        MergedPeriods p1
            inner join
        [test].[dbo].[Periods] p2
            on
                p1.EndDate = DATEADD(day,-1,p2.StartDate)



)

select MIN(StartDate),MAX(EndDate) as EndDate
from MergedPeriods group by StartDate
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜