Need to "merge" blocks of records in query into one, depending on other query columns
I have the following db scheme.
CREATE TABLE Twix.dbo.Sensors
(Id int PRIMARY KEY NOT NULL IDENTITY(1,1),
Location nvarchar(260) NOT NULL);
CREATE TABLE Twix.dbo.Visitors
(Id int PRIMARY KEY NOT NULL IDENTITY(1,1),
Name nvarchar(260) NOT NULL);
CREATE TABLE Twix.dbo.Visits(Id int PRIMARY KEY NOT NULL IDENTITY(1,1),
SensorId int CONSTRAINT FK__Visits__Sensor__239E FOREIGN KEY REFERENCES Sensors(Id),
VisitorId int CONSTRAINT FK__Visits__Visitors__4DCF FOREIGN KEY REFERENCES Visitors(Id),
InTime datetime NOT NULL,
OutTime datetime NOT NULL);
InTime-OutTime periods in Visits table could be intersected. I.e. parts of real visits. For example,
SELECT VisitorId, InTime, OutTime FROM Visits ORDER BY VisitorId, InTime
1 2011-02-09 15:26:59.173 2011-02-09 15:29:22.097
1 2011-02-09 15:28:40.530 2011-02-09 15:29:52.737
1 2011-02-09 15:30:10.577 2011-02-09 16:17:58.967
1 2011-02-09 19:09:23.057 2011-02-09 19:10:57.493
1 2011-开发者_运维百科02-09 19:12:16.087 2011-02-09 19:13:27.493
1 2011-02-14 15:08:46.333 2011-02-14 15:26:42.433
2 2011-02-09 17:55:42.390 2011-02-09 18:52:03.780
2 2011-02-09 18:10:56.727 2011-02-09 18:11:57.493
2 2011-02-09 21:47:15.650 2011-02-09 21:48:38.783
2 2011-02-09 21:50:18.337 2011-02-09 21:55:26.777
3 2011-02-09 17:12:58.103 2011-02-09 19:51:59.697
3 2011-02-09 22:27:52.073 2011-02-09 23:03:24.753
3 2011-02-09 23:02:51.177 2011-02-10 09:51:14.890
3 2011-02-14 15:27:42.270 2011-02-14 15:42:31.107
3 2011-02-14 15:43:37.320 2011-02-14 18:45:26.163
4 2011-02-09 21:07:51.030 2011-02-09 21:51:02.880
4 2011-02-09 22:42:52.660 2011-02-09 23:21:13.830
4 2011-02-09 23:23:08.563 2011-02-09 23:35:12.847
4 2011-02-09 23:36:05.120 2011-02-09 23:59:02.813
4 2011-02-10 05:58:44.103 2011-02-10 05:59:55.867
4 2011-02-12 08:29:36.620 2011-02-12 09:51:18.510
4 2011-02-12 13:13:42.650 2011-02-12 14:06:01.473
5 2011-02-10 06:48:52.717 2011-02-10 07:37:04.870
5 2011-02-10 06:50:31.067 2011-02-10 06:52:20.877
5 2011-02-10 06:52:36.273 2011-02-10 06:53:36.523
5 2011-02-10 06:59:11.790 2011-02-10 07:00:34.867
5 2011-02-10 08:36:39.563 2011-02-10 08:46:14.760
5 2011-02-10 12:47:05.567 2011-02-10 12:48:05.860
5 2011-02-10 12:49:19.590 2011-02-10 13:09:27.880
5 2011-02-10 12:49:25.733 2011-02-10 12:59:59.883
5 2011-02-10 12:55:23.460 2011-02-10 12:56:23.507
I need to get "completed visits", i.e. merge all visits for concrete vistor, if visits are intersecting, or time difference between them less than 10 min (i.e. "InTime of visit2" - "OutTime of visit1" < 10 min).
In C# it looks like:
private IEnumerable Merge(IEnumerable visits, uint holeInterval)
{
var vlist = new LinkedList<Visit>(visits.OrderBy(o => o.InTime));
var result = new List<Visit>();
while (vlist.Count > 1)
{
Visit a = vlist.First.Value; vlist.RemoveFirst();
Visit b = vlist.First.Value; vlist.RemoveFirst();
var r = Visit.Merge(a, b, holeInterval); // Merges two visits
if (r != null) { vlist.AddFirst(r); }
else { result.Add(a); vlist.AddFirst(b); }
}
result.Add(vlist.First.Value);
return result;
}
public IEnumerable<Visit> ListCompleteVisits()
{
var result = new List<Visit>();
var queryResult = from visits in this.repository.ListVisits()
group visits by visits.Visitor.Id into vgroup
select Merge(vgroup, this.holeInterval);
foreach (var v in queryResult)
{
result.AddRange(v);
}
return result;
}
I tried, and get the following:
DECLARE @holeInterval int
SET @holeInterval = 10
SELECT t.RowNumber, t.VisitorId, t.InTime, t.OutTime, t.BInMinusAOut,
(SELECT MIN(InTime) FROM Visits AS D
WHERE VisitorId NOT IN (select VisitorId from Visits group by VisitorId having COUNT(*) = 1)
AND D.InTime <= t.InTime and D.VisitorId = t.VisitorId
AND t.RowNumber > /*here should be max rownumber greater than holeinterval*/
) AS MinInTime
FROM
(SELECT
ROW_NUMBER() OVER(ORDER BY VisitorId, InTime ASC) AS RowNumber,
VisitorId, InTime, OutTime,
DATEDIFF(MI,InTime,
(SELECT MIN(InTime) FROM Visits AS B
WHERE B.InTime > A.InTime and A.VisitorId = B. VisitorId
)) AS BInMinusAIn,
DATEDIFF(MI,OutTime,
(SELECT MIN(InTime) FROM Visits AS B
WHERE B.InTime > A.InTime and A.VisitorId = B. VisitorId
))) AS BInMinusAOut
FROM Visits AS A
WHERE VisitorId NOT IN (select VisitorId from Visits group by VisitorId having COUNT(*) = 1)
) t
/*WHERE t.BInMinusAOut > @holeInterval OR t.BInMinusAOut IS NULL*/
ORDER BY VisitorId, InTime
1 1 2011-02-09 15:26:59.173 2011-02-09 15:28:22.097 0
2 1 2011-02-09 15:28:40.530 2011-02-09 15:29:52.737 1
3 1 2011-02-09 15:30:10.577 2011-02-09 16:17:58.967 9
4 1 2011-02-09 16:26:44.810 2011-02-09 16:51:46.423 20
5 1 2011-02-09 17:11:57.633 2011-02-09 17:13:20.680 2
6 1 2011-02-09 17:15:35.727 2011-02-09 17:18:48.493 -2
7 1 2011-02-09 17:16:12.230 2011-02-09 17:42:47.867 3
8 1 2011-02-09 17:45:43.793 2011-02-09 17:52:10.860 3
9 1 2011-02-09 17:55:31.127 2011-02-09 20:13:22.743 -109
10 1 2011-02-09 18:24:00.427 2011-02-09 18:32:12.033 2
11 1 2011-02-09 18:34:15.877 2011-02-09 18:37:19.770 2
12 1 2011-02-09 18:39:46.440 2011-02-09 18:48:16.800 2
13 1 2011-02-09 18:50:59.270 2011-02-09 20:03:47.550 -54
14 1 2011-02-09 19:09:23.057 2011-02-09 19:10:57.493 2
15 1 2011-02-09 19:12:16.087 2011-02-09 19:13:27.493 48
Now I need to "merge" 1st-4th into 2011-02-09 15:26:59.173 - 2011-02-09 16:51:46.423 5th-15h into 2011-02-09 17:11:57.633 - 2011-02-09 20:13:22.743 This means, that I have to take min InTime in rows, which is between current and the last, where MinInTime > @holeInterval, and max OutTime for those range too.
I.e. result:
1 2011-02-09 15:26:59.173 2011-02-09 16:51:46.423
1 2011-02-09 17:11:57.633 2011-02-09 20:13:22.743
Thank's.
Don't try to do everything in one big select statement.
Write a stored procedure that creates a temporary table by using Select INTO #temp
syntax. Then modify/append the temporary table. Then finally, select from the temp table as an output. The performance will be good and you can move on.
The query was written with Common Table Expressions to make it a bit clearer what is happening, but you can just UNION ALL
the 2 queries for exactly the same result.
What is happening in the query:
OverlappingVisits
the table is joined by expanding the InTime and OutTime by 10 minutes. This will select all the overlapping Vists from the data set. Then the CASE
statement is used to choose the lowest value from the 2 overlapping timespans as InTime and the highest value for the OutTime. Since we still get multiple results by doing that a groupby with min/max is done to filter out the minimum and maximum timestamps of the complete resultset.
NonOverlappingVisits
Similar to the overlapping visits the timespans are expanded by 10 minutes on either side to find the overlapping parts and matched with WHERE NOT EXISTS
.
Declare @holeinterval int
SET @holeinterval = 10
;WITH OverlappingVisits (VisitorId, InTime, OutTime)
AS (select v1.VisitorId
, InTime = MIN(CASE WHEN v1.InTime < v2.InTime THEN v1.InTime ELSE v2.InTime END)
, OutTime = MAX(CASE WHEN v1.OutTime < v2.OutTime THEN v2.OutTime ELSE v1.OutTime END)
FROM Visits v1
INNER JOIN Visits v2
ON v1.VisitorId = v2.VisitorId
AND (v1.InTime BETWEEN dateadd(minute, -@holeinterval, v2.InTime) AND dateadd(minute, @holeinterval, v2.OutTime)
OR v1.OutTime BETWEEN dateadd(minute, @holeinterval, v2.InTime) AND dateadd(minute, @holeinterval, v2.OutTime))
AND v1.Id <> v2.Id
GROUP BY v1.VisitorId),
NonOverlappingVisits (VisitorId, InTime, OutTime)
AS (
SELECT v1.VisitorId
, v1.InTime
, v1.OutTime
FROM Visits v1
WHERE NOT EXISTS(SELECT *
FROM Visits v2
WHERE v1.VisitorId = v2.VisitorId
AND (v1.InTime BETWEEN dateadd(minute, -@holeinterval, v2.InTime) AND dateadd(minute, @holeinterval, v2.OutTime)
OR v1.OutTime BETWEEN dateadd(minute, @holeinterval, v2.InTime) AND dateadd(minute, @holeinterval, v2.OutTime))
AND v1.Id <> v2.Id))
SELECT *
FROM OverlappingVisits
UNION ALL
SELECT *
FROM NonOverlappingVisits
ORDER BY VisitorId, InTime, OutTime
;WITH
Visits_tuned AS (
/* adding some helper columns */
SELECT
VisitorId,
InTime,
OutTime,
OutTimeDelayed = DATEADD(minute, 10, OutTime),
rownum = ROW_NUMBER() OVER (PARTITION BY VisitorId
ORDER BY InTime, OutTime),
VisitId = ROW_NUMBER() OVER (ORDER BY VisitorId, InTime, OutTime)
FROM Visits
),
Visits_starts AS (
/* spotting the starting points of the 'merged' visits */
SELECT DISTINCT
v1.VisitId,
IsStart = 1 - CASE WHEN v2.VisitId IS NULL THEN 0 ELSE 1 END
FROM Visits_tuned v1
LEFT JOIN Visits_tuned v2
ON v1.InTime BETWEEN v2.InTime AND v2.OutTimeDelayed
AND v1.VisitorId = v2.VisitorId AND v1.rownum <> v2.rownum
),
Visits_rec AS (
/* basically, selecting the original data, but with
InTime values replaced by the starting InTimes */
SELECT
VisitId,
VisitorId,
InTime,
OutTime
FROM Visits_tuned
WHERE VisitId = 1
UNION ALL
SELECT
v.VisitId,
v.VisitorId,
IntTime = CASE
WHEN v.VisitorId = r.VisitorId AND
s.IsStart = 0 AND r.InTime < v.InTime
THEN r.InTime
ELSE v.InTime
END,
v.OutTime
FROM Visits_tuned v
INNER JOIN Visits_rec r ON v.VisitId = r.VisitId + 1
INNER JOIN Visits_starts s ON v.VisitId = s.VisitId
)
/* main select; just grouping by visitor and in-time */
SELECT
VisitorId,
InTime,
OutTime = MAX(OutTime)
FROM Visits_rec
GROUP BY VisitorId, InTime
ORDER BY 1, 2
精彩评论