How can I efficiently compute the MAX of one column, ordered by another column?

2023-01-05 05:22 问答作者：

I have a table schema similar to the following (simplified):

CREATE TABLE Transactions
(
    TransactionID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
    CustomerID int NOT NULL,  -- Foreign key, not shown
    TransactionDate datetime NOT NULL,
    ...
)

CREATE INDEX IX_Transactions_Customer_Date
ON Transactions (CustomerID, TransactionDate)

To give a bit of background here, this transaction table is actually consolidating several different types of transactions from another vendor's database (we'll call it an ETL process), and I therefore don't have a great deal of control over the order in which they get inserted. Even if I did, transactions may be backdated, so the important thing to note here is that the maximum TransactionID for any given customer is not necessarily the most recent transaction.

In fact, the most recent transaction is a combination of the date and the ID. Dates are not unique - the vendor often truncates the time of day - so to get the most recent transaction, I have to first find the most recent date, and then find the most recent ID for that date.

I know that I can do this with a windowing query (ROW_NUMBER() OVER (PARTITION BY TransactionDate DESC, TransactionID DESC)), but this requires a 开发者_开发技巧full index scan and a very expensive sort, and thus fails miserably in terms of efficiency. It's also pretty awkward to keep writing all the time.

Slightly more efficient is using two CTEs or nested subqueries, one to find the MAX(TransactionDate) per CustomerID, and another to find the MAX(TransactionID). Again, it works, but requires a second aggregate and join, which is slightly better than the ROW_NUMBER() query but still rather painful performance-wise.

I've also considered using a CLR User-Defined Aggregate and will fall back on that if necessary, but I'd prefer to find a pure SQL solution if possible to simplify the deployment (there's no need for SQL-CLR anywhere else in this project).

So the question, specifically is:

Is it possible to write a query that will return the newest TransactionID per CustomerID, defined as the maximum TransactionID for the most recent TransactionDate, and achieve a plan equivalent in performance to an ordinary MAX/GROUP BY query?

(In other words, the only significant steps in the plan should be an index scan and stream aggregate. Multiple scans, sorts, joins, etc. are likely to be too slow.)

The most useful index might be:

CustomerID, TransactionDate desc, TransactionId desc

Then you could try a query like this:

select  a.CustomerID
,       b.TransactionID
from    (
        select  distinct
                CustomerID
        from    YourTable
        ) a
cross apply   
        (
        select  top 1
                TransactionID
        from    YourTable
        where   CustomerID = a.CustomerID
        order by
                TransactionDate desc,
                TransactionId desc
        ) b

How about something like this where you force the optimizer to calculate the derived table first. In my tests, this was less expensive than the two Max comparisons.

Select T.CustomerId, T.TransactionDate, Max(TransactionId)
From Transactions As T
    Join    (
            Select T1.CustomerID, Max(T1.TransactionDate) As MaxDate
            From Transactions As T1
            Group By T1.CustomerId
            ) As Z
        On Z.CustomerId = T.CustomerId
            And Z.MaxDate = T.TransactionDate
Group By T.CustomerId, T.TransactionDate

Disclaimer: Thinking out loud :)

Could you have an indexed, computed column that combines the TransactionDate and TransactionID columns into a form that means finding the latest transaction is just a case of finding the MAX of that single field?

This one seemed to have good performance statistics:

SELECT
    T1.customer_id,
    MAX(T1.transaction_id) AS transaction_id
FROM
    dbo.Transactions T1
INNER JOIN
(
    SELECT
        T2.customer_id,
        MAX(T2.transaction_date) AS max_dt
    FROM
        dbo.Transactions T2
    GROUP BY
        T2.customer_id
) SQ1 ON
    SQ1.customer_id = T1.customer_id AND
    T1.transaction_date = SQ1.max_dt
GROUP BY
    T1.customer_id

I think I actually figured it out. @Ada had the right idea and I had the same idea myself, but was stuck on how to form a single composite ID and avoid the extra join.

Since both dates and (positive) integers are byte-ordered, they can not only be concatenated into a BLOB for aggregation but also separated after the aggregate is done.

This feels a little unholy, but it seems to do the trick:

SELECT
    CustomerID,
    CAST(SUBSTRING(MAX(
        CAST(TransactionDate AS binary(8)) + 
        CAST(TransactionID AS binary(4))),
      9, 4) AS int) AS TransactionID
FROM Transactions
GROUP BY CustomerID

That gives me a single index scan and stream aggregate. No need for any additional indexes either, it performs the same as just doing MAX(TransactionID) - which makes sense, obviously, since all of the concatenation is happening inside the aggregate itself.

继续阅读：aggregate composite-key sql-server sql-server-2008

How can I efficiently compute the MAX of one column, ordered by another column?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

Best solution for private video database [closed]

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML