Indexing simple query in a huge database

2023-02-08 03:47 问答作者：

I got a table contains nearly 850,000,000 rows.

The table has the following fields:

[ID] [bigint] IDENTITY(1,1) NOT NULL,
[D1] [int] NOT NULL,
[D2] [int] NOT NULL,
[D3] [int] NOT NULL,
[D4] [int] NOT NULL,
[D5] [int] NOT NULL,
[D6] [int] NOT NULL,
[D7] [int] NOT NULL,
[D8] [int] NOT NULL,
[D9] [int] NOT NULL,
[A] [int] NOT NULL,
[Hb] [bit] NOT NULL,

All my queries for this table are quite the same -

Select [D1-D9], [A] Where [Hb] = 0 AND [D1] <> x AND [D2] <> y AND [D3] = z,

etc....

Each query will ALWAYS query ALL [D1-D9] fields and always ask for [Hb] = 0

Example for a query:

SELECT [D1], [D2], [D3], [D4], [D5], [D6],[D7], [D8],[D9], [A] 
  from [myTable] 
 WHERE [D1] <> 8 AND [D2] <> 2 AND [D3] <> 5 AND [D4] = 8 AND [D5] = 2 
   AND [D6] = 5 AND [D7] = 5 AND [D8] = 3 AND [D9] = 4 AND [A] = 0 AND [Hb] = 0

How should I index this table for the fastest开发者_运维知识库 results?

Thanks a lot

The best you can do is have your index do equality checks first followed by a residual non-equality lookup. That is, = before <>.

Rearranging the WHERE clause:

WHERE
--Equality
D4 = 8 AND D5 = 2 AND D6 = 5 AND D7 = 5 AND D8 = 3 AND D9 = 4 AND A = 0 
--in the middle    
AND Hb = 0
--Non-Equality
D1 <> 8 AND D2 <> 2 AND D3 <> 5

So, first draft is this:

CREATE .. INDEX ... ON (D4, D5, D6, D7, D8, D9, A, Hb, D1, D2, D3)

The order of D4 to D9 should be based on selectivity. Highest numbers first. Hb should always go last in the equality columns because it's bit

SELECT
   COUNT(DISTINCT D4) AS D4COunt,
   COUNT(DISTINCT D5) AS D5COunt,
   COUNT(DISTINCT D6) AS D6COunt,
   COUNT(DISTINCT D7) AS D7COunt,
   COUNT(DISTINCT D8) AS D8COunt,
   COUNT(DISTINCT D9) AS D9COunt,
   COUNT(DISTINCT A) AS ACOunt
FROM
    Mytable

Finally, this can be clustered or non-clustered. If you have no other indexes or no FKs then I'd consider make this the clustered PK. Otherwise, just create a clustered surrogate key and make this index NONCLUSTERED

Edit:

An article that (hopefully) explains why column order matters for mulitple column indexes: Craig Freedman's Seek Predicates. And his Scans and Seeks too

Edit2:

I asked if the = before <> are on the same columns: it appeared "yes". OP's comment to this answer says "no" so everything I've said here is pointless

The answer from Damien_The_Unbeliever's suggested index intersections to try and get around this the equality/nonequality mix.

You may find (if the individual equality/inequality tests are different for the ten columns in each query) that the best you can do is build a narrow index on each column individually, and then hope that the optimizer will apply index intersection, where it will use the indexes on each column where it makes sense to do so.

Extending @gbn's answer.

For a table of this size, you definitely need an index which would cover all columns selected.

However, for each column you should decide whether you want it to be a key column or an included column in the index.

To do this, run this query:

SELECT  SUM(CASE D1 WHEN 8 THEN 0 ELSE 1 END) / COUNT(*) AS D1Card,
        SUM(CASE D2 WHEN 2 THEN 0 ELSE 1 END) / COUNT(*) / COUNT(DISTINCT D2) AS D2Card,
        SUM(CASE D3 WHEN 5 THEN 0 ELSE 1 END) / COUNT(*) / COUNT(DISTINCT D3) AS D3Card,
        SUM(CASE d4 WHEN 8 THEN 1 ELSE 0 END) / COUNT(DISTINCT D4) AS D4Card,
        SUM(CASE d5 WHEN 2 THEN 1 ELSE 0 END) / COUNT(DISTINCT D5) AS D5Card,
        SUM(CASE d6 WHEN 5 THEN 1 ELSE 0 END) / COUNT(DISTINCT D6) AS D6Card,
        SUM(CASE d7 WHEN 5 THEN 1 ELSE 0 END) / COUNT(DISTINCT D7) AS D7Card,
        SUM(CASE d8 WHEN 3 THEN 1 ELSE 0 END) / COUNT(DISTINCT D8) AS D8Card,
        SUM(CASE d9 WHEN 4 THEN 1 ELSE 0 END) / COUNT(DISTINCT D9) AS D9Card,
        SUM(CASE a WHEN 0 THEN 1 ELSE 0 END) / COUNT(DISTINCT A) AS ACard,
        SUM(CASE Hb WHEN 0 THEN 1 ELSE 0 END) / COUNT(DISTINCT Hb) AS HbCard
FROM    Mytable

You should create a list of the least selective columns (those with the highest values of *Card) which (together) comprise more than 25% of your records.

Say, the selectivity chart on the columns looks like this:

Column  Selectivity  Cumulative selectivity
D4      0.96         0.96
D8      0.87         0.84
D9      0.85         0.70
D7      0.72         0.51
D6      0.65         0.33 -- here
D5      0.20         0.07
A       0.02         0.00
Hb      0.01         0.00

This means that the conditions on columns d4, d8, d9, d7, d6 together match about 33% of your records.

In this case, there is no need to use them as key columns. You should create an index on the other, selective, columns and include the non-selective ones into the index.

CREATE INDEX ix_mytable_filter ON (Hb, A, D5) INLCUDE (D1, D2, D3, D4, D6, D7, D8, D9)

The columns with the non-equality filter always go to the INCLUDE section.

Note that it will only improve the current query, with the given values of the filters. If your filters are arbitrary, you would need to use all equality filtered columns as the keys of the index.

It may also be case that the conditions like [D1] <> 8 involve magic numbers, and there are few records for which this condition holds.

In this case, you can add a computed column into your table's definition:

ALTER TABLE mytable ADD d1_ne_8 AS CASE D1 WHEN 8 THEN 0 ELSE 1 END

and add this expression to the index (with regard to the rules above).

If you do this, you will have to use d1_ne_8 = 1 instead of d1 <> 8.

First of all queries with conditions like "where X <> 8" may make any index useless (it may depend on your actual DB engine.

It is safer to change this from

SELECT [D1], [D2], [D3], [D4], [D5], [D6],[D7], [D8],[D9], [A] 
  from [myTable] 
 WHERE [D1] <> 8 AND [D2] <> 2 AND [D3] <> 5 AND [D4] = 8 AND [D5] = 2 
   AND [D6] = 5 AND [D7] = 5 AND [D8] = 3 AND [D9] = 4 AND [A] = 0 AND [Hb] = 0

to something more like this:

SELECT [D1], [D2], [D3], [D4], [D5], [D6],[D7], [D8],[D9], [A] 
  from [myTable] 
 WHERE ([D1] < 8 or [D1] > 8) 
       AND ([D2] < 2 or [D2] > 2) 
       AND ([D3] < 5 or [D3] > 5) 
       AND [D4] = 8 AND [D5] = 2  AND [D6] = 5 AND [D7] = 5 AND [D8] = 3 
       AND [D9] = 4 AND [A] = 0 AND [Hb] = 0

If your algorithm is deterministic (i.e. A = f(d1, d2, d3...d9)) then your D columns combined with Hb constitute a key. Try creating a clustered composite index on all D columns and Hb, partitioning on Hb for a little speed boost. You may also consider dropping the ID field.

EDIT: Just realised I'd missed the <> conditions. As others have mentioned this makes things much harder. What you really want to use here is a bitmap index, but AFAIK SQL Server doesn't have them. You will probably need to rely on individual column indexes playing nicely together.

Basically, you should create compound index started by column with equality check. So, in your case it is naturally to use [Hb] as the first component since you stated that [Hb] will be checked with equality. The next elements of the index are [D*], followed by [A]

create index IXC_MyTable1 on Mytable(Hb, D1, D2, D3, D4, D5, D6, D7, D8, D9, A)

On the second thought, you can employ a partial indexing and let the db do a fast index scan (CMIIW) to the table to check other values. In this case, you should include Id as the last item of the index. For example:

create index IXC_MyTable__D123 on Mytable(Hb, D1, D2, D3, Id)
create index IXC_MyTable__D456 on Mytable(Hb, D4, D5, D6, Id)
create index IXC_MyTable__D789 on Mytable(Hb, D7, D8, D9, Id)

The query will use IXC_MyTable__D123 index when it uses equality checks on Hb, D1, D2, and D3; and so on.

继续阅读：data-structures database-design indexing sql sql-server

Indexing simple query in a huge database

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？