开发者

How can I improve my select query for storing large versioned data sets?

At work, we build large multi-page web applications, consisting mostly of radio and check boxes. The primary purpose of each application is to gather data, but as users return to a page they have previously visited, we report back to them their previous responses. Worst-case scenario, we might have up to 900 distinct variables and around 1.5 million users.

For several reasons, it makes sense to use an insert-only approach to storing the data (as opposed to update-in-place) so that we can capture historical data about repeated interactions with variables. The net result is that we might have several responses per user per variable.

Our table to collect the respons开发者_高级运维es looks something like this:

CREATE TABLE [dbo].[results](
    [id] [bigint] IDENTITY(1,1) NOT NULL,
    [userid] [int] NULL,
    [variable] [varchar](8) NULL,
    [value] [tinyint] NULL,
    [submitted] [smalldatetime] NULL)

Where id serves as the primary key.

Virtually every request results in a series of insert statements (one per variable submitted), and then we run a select to produce previous responses for the next page (something like this):

SELECT t.id, t.variable, t.value
    FROM results t WITH (NOLOCK)
    WHERE t.userid = '2111846' AND 
    (t.variable='internat' OR t.variable='veteran' OR t.variable='athlete') AND
    t.id IN (SELECT MAX(id) AS id
        FROM results WITH (NOLOCK)
        WHERE userid = '2111846' AND (t.variable='internat' OR t.variable='veteran' OR t.variable='athlete')
        GROUP BY variable)

Which, in this case, would return the most recent responses for the variables "internat", "veteran", and "athlete" for user 2111846.

We have followed the advice of the database tuning tools in indexing the tables, and against our data, this is the best-performing version of the select query that we have been able to come up with. Even so, there seems to be significant performance degradation as the table approaches 1 million records (and we might have about 150x that). We have a fairly-elegant solution in place for sharding the data across multiple tables which has been working quite well, but I am open for any advice about how I might construct a better version of the select query. We use this structure frequently for storing lots of independent data points, and we like the benefits it provides.

So the question is, how can I improve the performance of the select query? I assume the nested select statement is a bad idea, but I have yet to find an alternative that performs as well.

Thanks in advance.

NB: Since we emphasize creating over reading in this case, and since we never update in place, there doesn't seem to be any penalty (and some advantage) for using the NOLOCK directive in this case.


Don't use NOLOCK hints. Use snapshot isolation instead, simply enable READ_COMMITED_SNAPSHOT ON on the database.

Also, you say you want 'the most recent', but you select the MAX by id, which is a IDENTITY key. Shouldn't you use a datetime for 'recent' ? An identity will only give the insert order as 'recent', which may have no meaning for the domain model.

To return the data you want the best way is to create an appropriate clustered key:

CREATE TABLE [dbo].[results](
    [id] [bigint] IDENTITY(1,1) NOT NULL,
    [userid] [int] NULL,
    [variable] [varchar](8) NULL,
    [value] [tinyint] NULL,
    [submitted] [smalldatetime] NULL);

create clustered index cdxResults on results ([userid], variable, id DESC);

If you have a table of possible variable types, then the query can be highly optimized by using MAX inside an APPLY operator:

SELECT t.id, t.variable, t.value
FROM ( 
  SELECT 'internat' as variable
  UNION ALL SELECT 'veteran'
  UNION ALL SELECT 'athlete'
) as v
CROSS APPLY (
   SELECT TOP(1) id, variable, value
   FROM results 
   WHERE userid = @userid
   AND variable = v.variable
   ORDER BY id DESC
) as t

This query will run 3 seeks into the clustered index, the response time will be always constant O(1), irrelevant of the size of the data. Ie. same time for 100 rows in the table, 1 million or 1 billion.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜