开发者

Identify a specific sequence of records in a table

Assume a table with the fields TransactionId, ItemId, Code, EffectiveDate, and CreateDate.

+---------------+--------+------+------------------+------------------+
| TransactionId | ItemId | Code |   EffectiveDate  |     CreateDate   |
+---------------+--------+------+------------------+------------------+
|              1|       1|     8| 12/2/2009 1:13 PM| 12/2/2009 1:13 PM|
+---------------+--------+------+------------------+------------------+
|              4|       1|    51|12/2/2009 11:08 AM| 12/3/2009 9:01 AM|
+---------------+--------+------+------------------+------------------+
|              2|       1|    14|12/2/2009 11:09 AM|12/2/2009 11:09 AM|
+---------------+--------+------+------------------+------------------+
|              3|       1|    61| 12/3/2009 8:33 AM| 12/3/2009 8:33 AM|
+---------------+--------+------+------------------+------------------+
|              5|       1|    28| 12/3/2009 9:33 AM| 12/3/2009 9:33 AM|
+---------------+--------+------+------------------+------------------+
|              6|       1|     9| 12/3/2009 1:58 PM| 12/3/2009 1:58 PM|
+---------------+--------+------+------------------+------------------+

I need to get the set of records where the sequence 51, 61, 9 occurs for a given ItemId, sorted by EffectiveDate. There could be other records with other codes in between these records.

In this case, I would return TransactionId's 4, 3, and 6, as shown below.

+---------------+--------+------+------------------+------------------+
| TransactionId | ItemId | Code |   EffectiveDate  |     CreateDate   |
+---------------+--------+------+------------------+------------------+
|              4|       1|    51|12/2/2009 11:08 AM| 12/3/2009 9:01 AM|
+---------------+--------+------+------------------+------------------+
|              3|       1|    61| 12/3/2009 8:33 AM| 12/3/2009 8:33 AM|
+---------------+--------+------+------------------+------------------+
|              6|       1|     9| 12/3/2009 1:58 PM| 12/3/2009 1:58 PM|
+---------------+--------+------+------------------+------------------+

Note that:

  • This isn't the only sequence I'll need to identify, but it illustrates the problem.
  • Records can be inserted into the table out of order; that is, the 61 record could have开发者_StackOverflow社区 been inserted first, followed by the 51, and then the 9. You can see this in the example where for the code 51 record the CreateDate is later than the EffectiveDate.
  • The order of the sequence matters. So, the sequence 61, 9, 51 would not return any records, but 51, 61, 9 would.

A DB approach is ideal if it's simple (i.e. no cursors or overly complicated stored procedure), but a code approach could also work, although it would result in a significant amount of data transfer out of the DB.

The environment is SQL Server 2005 and C#/.NET 3.5.


Actually, you could get a couple of fairly simple solutions leveraging ranking/windowing functions and/or CTEs and recursive CTEs.

Create a procedure that accepts a character-based comma-separated list of Code values you are looking for in the sequence you want them in - use any of a dozen possible ways to split this list into a table/set that is made up of the sequence and Code value, resulting in a table with a structure like this:

declare @sequence table (sequence int not null, Code int not null);

Once you have this, it's simply a matter of sequencing the source set based on joining the sequenced table to the source table on the same Code values for a given ItemId - once you have the source set filtered and sequenced, you can simply join again based on the matching sequence values - this is sounding complex, but in reality it would be a single query like this:

with srcData as (
    select  row_number() over(order by t.EffectiveDate) as rn,
            t.TransactionId, t.ItemId, t.Code, t.EffectiveDate, t.CreateDate
    from    #TableName t
    join    @sequence s
    on      t.Code = s.Code
    where   t.ItemId = @item_id
)
select  d.TransactionId, d.ItemId, d.Code, d.EffectiveDate, d.CreateDate
from    srcData d
join    @sequence s
on      d.rn = s.sequence
and     d.Code = s.Code
order by d.rn;

This alone won't guarantee that you get a result-set that is identical to what you are looking for, but staging the data into a temp table and adding a few simple checks around the code would do the trick (for example, add a checksum validation and a sum of the code values)

declare @tempData table (rn int, TransactionId smallint, ItemId smallint, Code smallint, EffectiveDate datetime, CreateDate datetime);

with srcData as (
    select  row_number() over(order by t.EffectiveDate) as rn,
            t.TransactionId, t.ItemId, t.Code, t.EffectiveDate, t.CreateDate
    from    #TableName t
    join    @sequence s
    on      t.Code = s.Code
    where   t.ItemId = @item_id
)
insert  @tempData
        (rn, TransactionId, ItemId, Code, EffectiveDate, CreateDate)
select  d.rn, d.TransactionId, d.ItemId, d.Code, d.EffectiveDate, d.CreateDate
from    srcData d
join    @sequence s
on      d.rn = s.sequence
and     d.Code = s.Code;

-- Verify we have matching hash/sums    
if
(
    ( (select sum(Code) from @sequence) = (select sum(Code) from @tempData) )
    and
    ( (select checksum_agg(checksum(sequence, Code)) from @sequence) = (select checksum_agg(checksum(rn, Code)) from @tempData) )
)
begin;
    -- Match - return the resultset
    select  d.TransactionId, d.ItemId, d.Code, d.EffectiveDate, d.CreateDate
    from    @tempData d
    order by d.rn;

end;

If you want to do it all inline, you could use a different approach leveraging CTEs and recursion to perform a running sum/total and OrdPath-like comparison as well (though you'd still need to parse the sequence character data out into a dataset)

-- Sequence data with running total
with sequenceWithRunningTotal as
(
    -- Anchor
    select  s.sequence, s.Code, s.Code as runningTotal, cast(s.Code as varchar(8000)) as pth,
            sum(s.Code) over(partition by 1) as sumCode
    from    @sequence s
    where   s.sequence = 1
    -- Recurse
    union all
    select  s.sequence, s.Code, b.runningTotal + s.Code as runningTotal,
            b.pth + '.' + cast(s.Code as varchar(8000)) as pth,
            b.sumCode as sumCode
    from    @sequence s
    join    sequenceWithRunningTotal b
    on      s.sequence = b.sequence + 1
),
-- Source data with sequence value
srcData as 
(
    select  row_number() over(order by t.EffectiveDate) as rn,
            t.TransactionId, t.ItemId, t.Code, t.EffectiveDate, t.CreateDate,
            sum(t.Code) over(partition by 1) as sumCode
    from    #TableName t
    join    @sequence s
    on      t.Code = s.Code
    where   t.ItemId = @item_id
),
-- Source data with running sum
sourceWithRunningSum as
(
    -- Anchor
    select  t.rn, t.TransactionId, t.ItemId, t.Code, t.EffectiveDate, t.CreateDate,
            t.Code as runningTotal, cast(t.Code as varchar(8000)) as pth,
            t.sumCode
    from    srcData t
    where   t.rn = 1
    -- Recurse
    union all
    select  t.rn, t.TransactionId, t.ItemId, t.Code, t.EffectiveDate, t.CreateDate,
            s.runningTotal + t.Code as runningTotal,
            s.pth + '.' + cast(t.Code as varchar(8000)) as pth,
            t.sumCode
    from    srcData t
    join    sourceWithRunningSum s
    on      t.rn  = s.rn + 1
)
select  d.TransactionId, d.ItemId, d.Code, d.EffectiveDate, d.CreateDate
from    sourceWithRunningSum d
join    sequenceWithRunningTotal s
on      d.rn = s.sequence
and     d.Code = s.Code
and     d.runningTotal = s.runningTotal
and     d.pth = s.pth
and     d.sumCode = s.sumCode
order by d.rn;


A DB approach is ideal if it's simple (i.e. no cursors or overly complicated stored procedure)

I don't believe a pure DB approach ("pure" meaning only using SQL SELECT) is practical because the type of SQL I envision would require very convoluted self-joins, field concatenation, MAX() functions, etc. This type of SQL might be a fun academic answer to a puzzle in Joe Celko's "SQL for Smarties" book but I don't think that's appropriate for production code.

I think the realistic approach is to write some kind of loop that keeps track of state. Your problem in the general sense is very similar to writing code for stateful inspection of TCPIP packets for spam filtering or scanning credit-card transactions for fraudulent patterns. All these problems share similar characteristics: the actions you take on the current row(record) depends on what records you saw previously (the context)... and that aspect requires holding state variables.

If you want to avoid round-tripping the data for analysis, it looks like Transact-SQL is the best way for performance. Or use hosted CLR to take advantage of C# syntax while still keeping the processing within the database engine.


This is just off the top of my head and is untested, so it may need some tweaking:

SELECT DISTINCT
     T.TransactionID,
     T.ItemID,
     T.Code,
     T.EffectiveDate,
     T.CreateDate
FROM
     My_Table T
INNER JOIN (
     SELECT
          T1.TransactionID,
          T2.TransactionID,
          T3.TransactionID
     FROM
          My_Table T1
     INNER JOIN My_Table T2 ON
          T2.ItemID = T1.ItemID AND
          T2.Code = 61 AND
          T2.EffectiveDate > T1.EffectiveDate
     INNER JOIN My_Table T3 ON
          T3.ItemID = T1.ItemID AND
          T3.Code = 9 AND
          T3.EffectiveDate > T2.EffectiveDate
     WHERE
          T1.Code = 51
     ) SQ ON
     SQ.TransactionID = T1.TransactionID OR
     SQ.TransactionID = T2.TransactionID OR
     SQ.TransactionID = T3.TransactionID
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜