How can I efficiently compute the MAX of one column, ordered by another column?
I have a table schema similar to the following (simplified):
CREATE TABLE Transactions
(
TransactionID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
CustomerID int NOT NULL, -- Foreign key, not shown
TransactionDate datetime NOT NULL,
...
)
CREATE INDEX IX_Transactions_Customer_Date
ON Transactions (CustomerID, TransactionDate)
To give a bit of background here, this transaction table is actually consolidating several different types of transactions from another vendor's database (we'll call it an ETL process), and I therefore don't have a great deal of control over the order in which they get inserted. Even if I did, transactions may be backdated, so the important thing to note here is that the maximum TransactionID
for any given customer
is not necessarily the most recent transaction.
In fact, the most recent transaction is a combination of the date and the ID. Dates are not unique - the vendor often truncates the time of day - so to get the most recent transaction, I have to first find the most recent date, and then find the most recent ID for that date.
I know that I can do this with a windowing query (ROW_NUMBER() OVER (PARTITION BY TransactionDate DESC, TransactionID DESC)
), but this requires a 开发者_开发技巧full index scan and a very expensive sort, and thus fails miserably in terms of efficiency. It's also pretty awkward to keep writing all the time.
Slightly more efficient is using two CTEs or nested subqueries, one to find the MAX(TransactionDate)
per CustomerID
, and another to find the MAX(TransactionID)
. Again, it works, but requires a second aggregate and join, which is slightly better than the ROW_NUMBER()
query but still rather painful performance-wise.
I've also considered using a CLR User-Defined Aggregate and will fall back on that if necessary, but I'd prefer to find a pure SQL solution if possible to simplify the deployment (there's no need for SQL-CLR anywhere else in this project).
So the question, specifically is:
Is it possible to write a query that will return the newest TransactionID
per CustomerID
, defined as the maximum TransactionID
for the most recent TransactionDate
, and achieve a plan equivalent in performance to an ordinary MAX
/GROUP BY
query?
(In other words, the only significant steps in the plan should be an index scan and stream aggregate. Multiple scans, sorts, joins, etc. are likely to be too slow.)
The most useful index might be:
CustomerID, TransactionDate desc, TransactionId desc
Then you could try a query like this:
select a.CustomerID
, b.TransactionID
from (
select distinct
CustomerID
from YourTable
) a
cross apply
(
select top 1
TransactionID
from YourTable
where CustomerID = a.CustomerID
order by
TransactionDate desc,
TransactionId desc
) b
How about something like this where you force the optimizer to calculate the derived table first. In my tests, this was less expensive than the two Max comparisons.
Select T.CustomerId, T.TransactionDate, Max(TransactionId)
From Transactions As T
Join (
Select T1.CustomerID, Max(T1.TransactionDate) As MaxDate
From Transactions As T1
Group By T1.CustomerId
) As Z
On Z.CustomerId = T.CustomerId
And Z.MaxDate = T.TransactionDate
Group By T.CustomerId, T.TransactionDate
Disclaimer: Thinking out loud :)
Could you have an indexed, computed column that combines the TransactionDate and TransactionID columns into a form that means finding the latest transaction is just a case of finding the MAX of that single field?
This one seemed to have good performance statistics:
SELECT
T1.customer_id,
MAX(T1.transaction_id) AS transaction_id
FROM
dbo.Transactions T1
INNER JOIN
(
SELECT
T2.customer_id,
MAX(T2.transaction_date) AS max_dt
FROM
dbo.Transactions T2
GROUP BY
T2.customer_id
) SQ1 ON
SQ1.customer_id = T1.customer_id AND
T1.transaction_date = SQ1.max_dt
GROUP BY
T1.customer_id
I think I actually figured it out. @Ada had the right idea and I had the same idea myself, but was stuck on how to form a single composite ID and avoid the extra join.
Since both dates and (positive) integers are byte-ordered, they can not only be concatenated into a BLOB for aggregation but also separated after the aggregate is done.
This feels a little unholy, but it seems to do the trick:
SELECT
CustomerID,
CAST(SUBSTRING(MAX(
CAST(TransactionDate AS binary(8)) +
CAST(TransactionID AS binary(4))),
9, 4) AS int) AS TransactionID
FROM Transactions
GROUP BY CustomerID
That gives me a single index scan and stream aggregate. No need for any additional indexes either, it performs the same as just doing MAX(TransactionID)
- which makes sense, obviously, since all of the concatenation is happening inside the aggregate itself.
精彩评论