Indexing simple query in a huge database
I got a table contains nearly 850,000,000 rows.
The table has the following fields:
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[D1] [int] NOT NULL,
[D2] [int] NOT NULL,
[D3] [int] NOT NULL,
[D4] [int] NOT NULL,
[D5] [int] NOT NULL,
[D6] [int] NOT NULL,
[D7] [int] NOT NULL,
[D8] [int] NOT NULL,
[D9] [int] NOT NULL,
[A] [int] NOT NULL,
[Hb] [bit] NOT NULL,
All my queries for this table are quite the same -
Select [D1-D9], [A] Where [Hb] = 0 AND [D1] <> x AND [D2] <> y AND [D3] = z,
etc....
Each query will ALWAYS query ALL [D1-D9] fields and always ask for [Hb] = 0
Example for a query:
SELECT [D1], [D2], [D3], [D4], [D5], [D6],[D7], [D8],[D9], [A]
from [myTable]
WHERE [D1] <> 8 AND [D2] <> 2 AND [D3] <> 5 AND [D4] = 8 AND [D5] = 2
AND [D6] = 5 AND [D7] = 5 AND [D8] = 3 AND [D9] = 4 AND [A] = 0 AND [Hb] = 0
How should I index this table for the fastest开发者_运维知识库 results?
Thanks a lot
The best you can do is have your index do equality checks first followed by a residual non-equality lookup. That is, =
before <>
.
Rearranging the WHERE clause:
WHERE
--Equality
D4 = 8 AND D5 = 2 AND D6 = 5 AND D7 = 5 AND D8 = 3 AND D9 = 4 AND A = 0
--in the middle
AND Hb = 0
--Non-Equality
D1 <> 8 AND D2 <> 2 AND D3 <> 5
So, first draft is this:
CREATE .. INDEX ... ON (D4, D5, D6, D7, D8, D9, A, Hb, D1, D2, D3)
The order of D4
to D9
should be based on selectivity. Highest numbers first. Hb
should always go last in the equality columns because it's bit
SELECT
COUNT(DISTINCT D4) AS D4COunt,
COUNT(DISTINCT D5) AS D5COunt,
COUNT(DISTINCT D6) AS D6COunt,
COUNT(DISTINCT D7) AS D7COunt,
COUNT(DISTINCT D8) AS D8COunt,
COUNT(DISTINCT D9) AS D9COunt,
COUNT(DISTINCT A) AS ACOunt
FROM
Mytable
Finally, this can be clustered or non-clustered. If you have no other indexes or no FKs then I'd consider make this the clustered PK. Otherwise, just create a clustered surrogate key and make this index NONCLUSTERED
Edit:
An article that (hopefully) explains why column order matters for mulitple column indexes: Craig Freedman's Seek Predicates. And his Scans and Seeks too
Edit2:
I asked if the =
before <>
are on the same columns: it appeared "yes".
OP's comment to this answer says "no" so everything I've said here is pointless
The answer from Damien_The_Unbeliever's suggested index intersections to try and get around this the equality/nonequality mix.
You may find (if the individual equality/inequality tests are different for the ten columns in each query) that the best you can do is build a narrow index on each column individually, and then hope that the optimizer will apply index intersection, where it will use the indexes on each column where it makes sense to do so.
Extending @gbn's answer.
For a table of this size, you definitely need an index which would cover all columns selected.
However, for each column you should decide whether you want it to be a key column or an included column in the index.
To do this, run this query:
SELECT SUM(CASE D1 WHEN 8 THEN 0 ELSE 1 END) / COUNT(*) AS D1Card,
SUM(CASE D2 WHEN 2 THEN 0 ELSE 1 END) / COUNT(*) / COUNT(DISTINCT D2) AS D2Card,
SUM(CASE D3 WHEN 5 THEN 0 ELSE 1 END) / COUNT(*) / COUNT(DISTINCT D3) AS D3Card,
SUM(CASE d4 WHEN 8 THEN 1 ELSE 0 END) / COUNT(DISTINCT D4) AS D4Card,
SUM(CASE d5 WHEN 2 THEN 1 ELSE 0 END) / COUNT(DISTINCT D5) AS D5Card,
SUM(CASE d6 WHEN 5 THEN 1 ELSE 0 END) / COUNT(DISTINCT D6) AS D6Card,
SUM(CASE d7 WHEN 5 THEN 1 ELSE 0 END) / COUNT(DISTINCT D7) AS D7Card,
SUM(CASE d8 WHEN 3 THEN 1 ELSE 0 END) / COUNT(DISTINCT D8) AS D8Card,
SUM(CASE d9 WHEN 4 THEN 1 ELSE 0 END) / COUNT(DISTINCT D9) AS D9Card,
SUM(CASE a WHEN 0 THEN 1 ELSE 0 END) / COUNT(DISTINCT A) AS ACard,
SUM(CASE Hb WHEN 0 THEN 1 ELSE 0 END) / COUNT(DISTINCT Hb) AS HbCard
FROM Mytable
You should create a list of the least selective columns (those with the highest values of *Card
) which (together) comprise more than 25%
of your records.
Say, the selectivity chart on the columns looks like this:
Column Selectivity Cumulative selectivity
D4 0.96 0.96
D8 0.87 0.84
D9 0.85 0.70
D7 0.72 0.51
D6 0.65 0.33 -- here
D5 0.20 0.07
A 0.02 0.00
Hb 0.01 0.00
This means that the conditions on columns d4, d8, d9, d7, d6
together match about 33%
of your records.
In this case, there is no need to use them as key columns. You should create an index on the other, selective, columns and include the non-selective ones into the index.
CREATE INDEX ix_mytable_filter ON (Hb, A, D5) INLCUDE (D1, D2, D3, D4, D6, D7, D8, D9)
The columns with the non-equality filter always go to the INCLUDE
section.
Note that it will only improve the current query, with the given values of the filters. If your filters are arbitrary, you would need to use all equality filtered columns as the keys of the index.
It may also be case that the conditions like [D1] <> 8
involve magic numbers, and there are few records for which this condition holds.
In this case, you can add a computed column into your table's definition:
ALTER TABLE mytable ADD d1_ne_8 AS CASE D1 WHEN 8 THEN 0 ELSE 1 END
and add this expression to the index (with regard to the rules above).
If you do this, you will have to use d1_ne_8 = 1
instead of d1 <> 8
.
First of all queries with conditions like "where X <> 8" may make any index useless (it may depend on your actual DB engine.
It is safer to change this from
SELECT [D1], [D2], [D3], [D4], [D5], [D6],[D7], [D8],[D9], [A]
from [myTable]
WHERE [D1] <> 8 AND [D2] <> 2 AND [D3] <> 5 AND [D4] = 8 AND [D5] = 2
AND [D6] = 5 AND [D7] = 5 AND [D8] = 3 AND [D9] = 4 AND [A] = 0 AND [Hb] = 0
to something more like this:
SELECT [D1], [D2], [D3], [D4], [D5], [D6],[D7], [D8],[D9], [A]
from [myTable]
WHERE ([D1] < 8 or [D1] > 8)
AND ([D2] < 2 or [D2] > 2)
AND ([D3] < 5 or [D3] > 5)
AND [D4] = 8 AND [D5] = 2 AND [D6] = 5 AND [D7] = 5 AND [D8] = 3
AND [D9] = 4 AND [A] = 0 AND [Hb] = 0
If your algorithm is deterministic (i.e. A = f(d1, d2, d3...d9)) then your D columns combined with Hb constitute a key. Try creating a clustered composite index on all D columns and Hb, partitioning on Hb for a little speed boost. You may also consider dropping the ID field.
EDIT: Just realised I'd missed the <> conditions. As others have mentioned this makes things much harder. What you really want to use here is a bitmap index, but AFAIK SQL Server doesn't have them. You will probably need to rely on individual column indexes playing nicely together.
Basically, you should create compound index started by column with equality check. So, in your case it is naturally to use [Hb] as the first component since you stated that [Hb] will be checked with equality. The next elements of the index are [D*], followed by [A]
create index IXC_MyTable1 on Mytable(Hb, D1, D2, D3, D4, D5, D6, D7, D8, D9, A)
On the second thought, you can employ a partial indexing and let the db do a fast index scan (CMIIW) to the table to check other values. In this case, you should include Id as the last item of the index. For example:
create index IXC_MyTable__D123 on Mytable(Hb, D1, D2, D3, Id)
create index IXC_MyTable__D456 on Mytable(Hb, D4, D5, D6, Id)
create index IXC_MyTable__D789 on Mytable(Hb, D7, D8, D9, Id)
The query will use IXC_MyTable__D123 index when it uses equality checks on Hb, D1, D2, and D3; and so on.
精彩评论