开发者

How to avoid a Select N+1 like problem in this scenario?

Sc开发者_C百科enario: I need to display the average of the last 20 reported values. I need to do this for all users. Am using Sql Server 2005 Express. Thats the lowest version of the db server I need to support.

The way I am doing this now is: 1 query to fetch all users. 1 query per user to fetch the last 20 reported values. While I cant actually do the average in sql for business reasons, lets assume for the time being that I can.

With that assumption, in my head, the sql does a order by date, a limit of 20 rows per user and finally a group by user id. Unfortunately there doesnt seem to be any way to do this in sql.

Is there any way to avoid the N+1 queries?

Edit1:

Ericb's answer gets the job done. I will however wait for some time before marking it as the answer for two reasons.

  1. I would like to know if there are any performance penalties to this method. The reports table will contain tens of thousands of rows per user. Though I only need to average the latest 20.
  2. I am tempted to modify the question (ie remove the assumption) and reflect my business requirement. Am hoping that maybe even that could be solved in SQL alone.

Same question, but with the assumption removed:

The average needs to be done on the 20 most recent continuous reports. Meaning, suppose the latest 20 rows (in desc order) contain 15 rows (20 to 6) for the times 2:25PM to 2:40PM. And the rows 5 to 1 contain are with times 2:43PM to 2:48PM ... The most recent continuous dataset is rows 5 to 1. So the average needs to be done for those 5 rows only. Its not like the data will come in batches, so the numbers 15 and 5 could just as easily have been 10 and 10 or 3 and 5 and 12 or even all 20 continuous (For simplicity's sake I assumed that latest 20 would all be continuous).

What do you guys think? Can it be done in sql or is this something best handled in c#?

Edit 2: I was thinking about it. In c# I would start with the most recent date. Subtract 1 minute. And check whether the next most recent date matches this value. If it does, add it to a list. Looking at these steps, I cant imagine how it will be possible to replicate something like this in sql. In fact am still not sure what the c# equivalent of ericb's answer would be. Which leads me to wonder, how does one think in sql?


Hopefully I'm interpreting this right. I'm assuming a very basic table setup:

CREATE TABLE Reports
(
    UserId INT,
    Report INT,
    CreatedOn DATETIME  
)

CREATE TABLE Users
(
    UserId INT
)


SELECT  x.UserId, AVG(x.Report) as Report_Avg
FROM
        (
        SELECT  R.Report, U.UserId, ROW_NUMBER() OVER (PARTITION BY U.UserId ORDER BY R.CreatedOn DESC) as RowNum
        FROM    Reports R
                INNER JOIN Users U
                ON R.UserId = U.UserId
        ) x
WHERE   x.RowNum <= 20
GROUP BY x.UserId

My code does use a PARTITION BY and ROW_NUMBER syntax which should be part of ANSI SQL.


Based on your changes, you could try something like this...

NB: This is based on the assumption that all data is minute by minute, and no timestamps will get repeated. If this assumption is false, I'd recommend posting your actual data structure and describing the exact behaviour of the data that can be entered in to it.

WITH
  mostRecentData AS
(
  SELECT
    userID,
    MAX(TimeStamp) AS TimeStamp
  FROM
    yourData
  GROUP BY
    userID
)
,
  ordered_data AS
(
  SELECT
    [reportData].*,
    DATEDIFF(minute, [reportData].TimeStamp, [mostRecentData].TimeStamp) AS offset,
    ROW_NUMBER() OVER (PARTITION BY [reportData].UserID ORDER BY [reportData].TimeStamp DESC) AS sequenceID
  FROM
    yourData                AS [reportData]
  INNER JOIN
    [mostRecentData]
      ON [reportData].userID = [reportData].UserID
)

SELECT
  UserID,
  AVG(someField)
FROM
  orderedData
WHERE
  sequenceID <= 20             -- At most the 20 most recent values
  AND sequenceID - offset = 1  -- Only Consecutive entries from the latest entry
GROUP BY
  UserID

Assuming you have appropriate indexes, the sequenceID <= 20 will resolve quickly, ensuring you don't end up parsing every record for every user.

The sequenceID - offset, however, won't make use of indexes, and so will be processed for every one of those 20 records. But that's not a big overhead really.

Example Data showing that sequenceID - offset = 1 really does get the most recent consecutive set of data...

TimeStamp  |  Row_Number()  |  Offset  |  Row_Number() - Offset

  12:24            1             0                1
  12:23            2             1                1
  12:22            3             2                1
  12:20            4             4                0
  12:19            5             5                0
  12:17            6             7               -1


probably a bad idea, but maybe this will put you on the right track?

select id
from users u
left outer join 
  (
    select value
    from reported_values
    where user_id in (1,2,3)
    order by created_at desc limit 20
  ) as v
  on u.id = v.user_id
where id in (1,2,3)


First, if you know the rate of reported values, or at least a minimum rate of reported values, you can find an earliest date and filter by date first. That should improve performance by reducing the number of rows you query, as long as you are indexing on the date column.

Next, you can group-by username and use the sum() function to aggregate per-user. That saves you N-1 queries and obviates the first one, meaning: 1 query.

Example:

select username, sum(value), count(value) as numvals from table where date > [calculated earliest date/time] group by username

With the count in there, you can do two things.

  1. If you only need a 'close enough' value, you can simply divide the sum(value) by count(value) and get a close average.
  2. If you can afford a couple of iterations, you could add a 'having numvals = 20' clause and vary the date until you get all of the users
    • This is a bit fuzzier than the limit approach, but avoids sorting. It makes sense if you have a good idea what date to filter on to get only 20 values.
    • If I had to choose between sorting and calculating an average for saving memory, I/O cycles, and CPU, I would pick average every time and twice on Sunday.

Alternatively, you can remove the two aggregates and the group by clause, sort first by username and then by date, and just select username and value. Then do the count (most recent 20) filtering outside of the DB when you do your average calculation.

select username, value from table order by username, date

The cost to my suggestions is that unless your users get values at the same rate, limit doesn't work because it would be limiting across all of the users. However, if the number of queries is the primary issue, I think these would address the issue.

Caveat: I'm not a DB guy, so the syntax above may be horrific, and my ideas may be due to brain damage. However, I suggest benchmarking to be sure.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜