How can I improve my select query for storing large versioned data sets?

2022-12-24 03:18 问答作者：

At work, we build large multi-page web applications, consisting mostly of radio and check boxes. The primary purpose of each application is to gather data, but as users return to a page they have previously visited, we report back to them their previous responses. Worst-case scenario, we might have up to 900 distinct variables and around 1.5 million users.

For several reasons, it makes sense to use an insert-only approach to storing the data (as opposed to update-in-place) so that we can capture historical data about repeated interactions with variables. The net result is that we might have several responses per user per variable.

Our table to collect the respons开发者_高级运维es looks something like this:

CREATE TABLE [dbo].[results](
    [id] [bigint] IDENTITY(1,1) NOT NULL,
    [userid] [int] NULL,
    [variable] [varchar](8) NULL,
    [value] [tinyint] NULL,
    [submitted] [smalldatetime] NULL)

Where id serves as the primary key.

Virtually every request results in a series of insert statements (one per variable submitted), and then we run a select to produce previous responses for the next page (something like this):

SELECT t.id, t.variable, t.value
    FROM results t WITH (NOLOCK)
    WHERE t.userid = '2111846' AND 
    (t.variable='internat' OR t.variable='veteran' OR t.variable='athlete') AND
    t.id IN (SELECT MAX(id) AS id
        FROM results WITH (NOLOCK)
        WHERE userid = '2111846' AND (t.variable='internat' OR t.variable='veteran' OR t.variable='athlete')
        GROUP BY variable)

Which, in this case, would return the most recent responses for the variables "internat", "veteran", and "athlete" for user 2111846.

We have followed the advice of the database tuning tools in indexing the tables, and against our data, this is the best-performing version of the select query that we have been able to come up with. Even so, there seems to be significant performance degradation as the table approaches 1 million records (and we might have about 150x that). We have a fairly-elegant solution in place for sharding the data across multiple tables which has been working quite well, but I am open for any advice about how I might construct a better version of the select query. We use this structure frequently for storing lots of independent data points, and we like the benefits it provides.

So the question is, how can I improve the performance of the select query? I assume the nested select statement is a bad idea, but I have yet to find an alternative that performs as well.

Thanks in advance.

NB: Since we emphasize creating over reading in this case, and since we never update in place, there doesn't seem to be any penalty (and some advantage) for using the NOLOCK directive in this case.

Don't use NOLOCK hints. Use snapshot isolation instead, simply enable READ_COMMITED_SNAPSHOT ON on the database.

Also, you say you want 'the most recent', but you select the MAX by id, which is a IDENTITY key. Shouldn't you use a datetime for 'recent' ? An identity will only give the insert order as 'recent', which may have no meaning for the domain model.

To return the data you want the best way is to create an appropriate clustered key:

CREATE TABLE [dbo].[results](
    [id] [bigint] IDENTITY(1,1) NOT NULL,
    [userid] [int] NULL,
    [variable] [varchar](8) NULL,
    [value] [tinyint] NULL,
    [submitted] [smalldatetime] NULL);

create clustered index cdxResults on results ([userid], variable, id DESC);

If you have a table of possible variable types, then the query can be highly optimized by using MAX inside an APPLY operator:

SELECT t.id, t.variable, t.value
FROM ( 
  SELECT 'internat' as variable
  UNION ALL SELECT 'veteran'
  UNION ALL SELECT 'athlete'
) as v
CROSS APPLY (
   SELECT TOP(1) id, variable, value
   FROM results 
   WHERE userid = @userid
   AND variable = v.variable
   ORDER BY id DESC
) as t

This query will run 3 seeks into the clustered index, the response time will be always constant O(1), irrelevant of the size of the data. Ie. same time for 100 rows in the table, 1 million or 1 billion.

继续阅读：sql-server sql-server-2005

How can I improve my select query for storing large versioned data sets?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？