Identify a specific sequence of records in a table
Assume a table with the fields TransactionId, ItemId, Code, EffectiveDate, and CreateDate.
+---------------+--------+------+------------------+------------------+ | TransactionId | ItemId | Code | EffectiveDate | CreateDate | +---------------+--------+------+------------------+------------------+ | 1| 1| 8| 12/2/2009 1:13 PM| 12/2/2009 1:13 PM| +---------------+--------+------+------------------+------------------+ | 4| 1| 51|12/2/2009 11:08 AM| 12/3/2009 9:01 AM| +---------------+--------+------+------------------+------------------+ | 2| 1| 14|12/2/2009 11:09 AM|12/2/2009 11:09 AM| +---------------+--------+------+------------------+------------------+ | 3| 1| 61| 12/3/2009 8:33 AM| 12/3/2009 8:33 AM| +---------------+--------+------+------------------+------------------+ | 5| 1| 28| 12/3/2009 9:33 AM| 12/3/2009 9:33 AM| +---------------+--------+------+------------------+------------------+ | 6| 1| 9| 12/3/2009 1:58 PM| 12/3/2009 1:58 PM| +---------------+--------+------+------------------+------------------+
I need to get the set of records where the sequence 51, 61, 9 occurs for a given ItemId, sorted by EffectiveDate. There could be other records with other codes in between these records.
In this case, I would return TransactionId's 4, 3, and 6, as shown below.
+---------------+--------+------+------------------+------------------+ | TransactionId | ItemId | Code | EffectiveDate | CreateDate | +---------------+--------+------+------------------+------------------+ | 4| 1| 51|12/2/2009 11:08 AM| 12/3/2009 9:01 AM| +---------------+--------+------+------------------+------------------+ | 3| 1| 61| 12/3/2009 8:33 AM| 12/3/2009 8:33 AM| +---------------+--------+------+------------------+------------------+ | 6| 1| 9| 12/3/2009 1:58 PM| 12/3/2009 1:58 PM| +---------------+--------+------+------------------+------------------+
Note that:
- This isn't the only sequence I'll need to identify, but it illustrates the problem.
- Records can be inserted into the table out of order; that is, the 61 record could have开发者_StackOverflow社区 been inserted first, followed by the 51, and then the 9. You can see this in the example where for the code 51 record the CreateDate is later than the EffectiveDate.
- The order of the sequence matters. So, the sequence 61, 9, 51 would not return any records, but 51, 61, 9 would.
A DB approach is ideal if it's simple (i.e. no cursors or overly complicated stored procedure), but a code approach could also work, although it would result in a significant amount of data transfer out of the DB.
The environment is SQL Server 2005 and C#/.NET 3.5.
Actually, you could get a couple of fairly simple solutions leveraging ranking/windowing functions and/or CTEs and recursive CTEs.
Create a procedure that accepts a character-based comma-separated list of Code values you are looking for in the sequence you want them in - use any of a dozen possible ways to split this list into a table/set that is made up of the sequence and Code value, resulting in a table with a structure like this:
declare @sequence table (sequence int not null, Code int not null);
Once you have this, it's simply a matter of sequencing the source set based on joining the sequenced table to the source table on the same Code values for a given ItemId - once you have the source set filtered and sequenced, you can simply join again based on the matching sequence values - this is sounding complex, but in reality it would be a single query like this:
with srcData as (
select row_number() over(order by t.EffectiveDate) as rn,
t.TransactionId, t.ItemId, t.Code, t.EffectiveDate, t.CreateDate
from #TableName t
join @sequence s
on t.Code = s.Code
where t.ItemId = @item_id
)
select d.TransactionId, d.ItemId, d.Code, d.EffectiveDate, d.CreateDate
from srcData d
join @sequence s
on d.rn = s.sequence
and d.Code = s.Code
order by d.rn;
This alone won't guarantee that you get a result-set that is identical to what you are looking for, but staging the data into a temp table and adding a few simple checks around the code would do the trick (for example, add a checksum validation and a sum of the code values)
declare @tempData table (rn int, TransactionId smallint, ItemId smallint, Code smallint, EffectiveDate datetime, CreateDate datetime);
with srcData as (
select row_number() over(order by t.EffectiveDate) as rn,
t.TransactionId, t.ItemId, t.Code, t.EffectiveDate, t.CreateDate
from #TableName t
join @sequence s
on t.Code = s.Code
where t.ItemId = @item_id
)
insert @tempData
(rn, TransactionId, ItemId, Code, EffectiveDate, CreateDate)
select d.rn, d.TransactionId, d.ItemId, d.Code, d.EffectiveDate, d.CreateDate
from srcData d
join @sequence s
on d.rn = s.sequence
and d.Code = s.Code;
-- Verify we have matching hash/sums
if
(
( (select sum(Code) from @sequence) = (select sum(Code) from @tempData) )
and
( (select checksum_agg(checksum(sequence, Code)) from @sequence) = (select checksum_agg(checksum(rn, Code)) from @tempData) )
)
begin;
-- Match - return the resultset
select d.TransactionId, d.ItemId, d.Code, d.EffectiveDate, d.CreateDate
from @tempData d
order by d.rn;
end;
If you want to do it all inline, you could use a different approach leveraging CTEs and recursion to perform a running sum/total and OrdPath-like comparison as well (though you'd still need to parse the sequence character data out into a dataset)
-- Sequence data with running total
with sequenceWithRunningTotal as
(
-- Anchor
select s.sequence, s.Code, s.Code as runningTotal, cast(s.Code as varchar(8000)) as pth,
sum(s.Code) over(partition by 1) as sumCode
from @sequence s
where s.sequence = 1
-- Recurse
union all
select s.sequence, s.Code, b.runningTotal + s.Code as runningTotal,
b.pth + '.' + cast(s.Code as varchar(8000)) as pth,
b.sumCode as sumCode
from @sequence s
join sequenceWithRunningTotal b
on s.sequence = b.sequence + 1
),
-- Source data with sequence value
srcData as
(
select row_number() over(order by t.EffectiveDate) as rn,
t.TransactionId, t.ItemId, t.Code, t.EffectiveDate, t.CreateDate,
sum(t.Code) over(partition by 1) as sumCode
from #TableName t
join @sequence s
on t.Code = s.Code
where t.ItemId = @item_id
),
-- Source data with running sum
sourceWithRunningSum as
(
-- Anchor
select t.rn, t.TransactionId, t.ItemId, t.Code, t.EffectiveDate, t.CreateDate,
t.Code as runningTotal, cast(t.Code as varchar(8000)) as pth,
t.sumCode
from srcData t
where t.rn = 1
-- Recurse
union all
select t.rn, t.TransactionId, t.ItemId, t.Code, t.EffectiveDate, t.CreateDate,
s.runningTotal + t.Code as runningTotal,
s.pth + '.' + cast(t.Code as varchar(8000)) as pth,
t.sumCode
from srcData t
join sourceWithRunningSum s
on t.rn = s.rn + 1
)
select d.TransactionId, d.ItemId, d.Code, d.EffectiveDate, d.CreateDate
from sourceWithRunningSum d
join sequenceWithRunningTotal s
on d.rn = s.sequence
and d.Code = s.Code
and d.runningTotal = s.runningTotal
and d.pth = s.pth
and d.sumCode = s.sumCode
order by d.rn;
A DB approach is ideal if it's simple (i.e. no cursors or overly complicated stored procedure)
I don't believe a pure DB approach ("pure" meaning only using SQL SELECT) is practical because the type of SQL I envision would require very convoluted self-joins, field concatenation, MAX() functions, etc. This type of SQL might be a fun academic answer to a puzzle in Joe Celko's "SQL for Smarties" book but I don't think that's appropriate for production code.
I think the realistic approach is to write some kind of loop that keeps track of state. Your problem in the general sense is very similar to writing code for stateful inspection of TCPIP packets for spam filtering or scanning credit-card transactions for fraudulent patterns. All these problems share similar characteristics: the actions you take on the current row(record) depends on what records you saw previously (the context)... and that aspect requires holding state variables.
If you want to avoid round-tripping the data for analysis, it looks like Transact-SQL is the best way for performance. Or use hosted CLR to take advantage of C# syntax while still keeping the processing within the database engine.
This is just off the top of my head and is untested, so it may need some tweaking:
SELECT DISTINCT
T.TransactionID,
T.ItemID,
T.Code,
T.EffectiveDate,
T.CreateDate
FROM
My_Table T
INNER JOIN (
SELECT
T1.TransactionID,
T2.TransactionID,
T3.TransactionID
FROM
My_Table T1
INNER JOIN My_Table T2 ON
T2.ItemID = T1.ItemID AND
T2.Code = 61 AND
T2.EffectiveDate > T1.EffectiveDate
INNER JOIN My_Table T3 ON
T3.ItemID = T1.ItemID AND
T3.Code = 9 AND
T3.EffectiveDate > T2.EffectiveDate
WHERE
T1.Code = 51
) SQ ON
SQ.TransactionID = T1.TransactionID OR
SQ.TransactionID = T2.TransactionID OR
SQ.TransactionID = T3.TransactionID
精彩评论