SQL indexes for "not equal" searches
The SQL index allows to find quickly a string which matches my query. Now, I have to search in a big table the strings which do not match. Of course, the normal index does not help and I have to do a slow sequential scan:
essais=> \d phone_idx
Index "public.phone_idx"
Column | Type
--------+------
phone | text
btree, for table "public.phonespersons"
essais=> EXPLAIN SELECT person FROM PhonesPersons WHERE phone = '+33 1234567';
QUERY PLAN
-------------------------------------------------------------------------------
Index Scan using phone_idx on phonespersons (cost=0.00开发者_如何转开发..8.41 rows=1 width=4)
Index Cond: (phone = '+33 1234567'::text)
(2 rows)
essais=> EXPLAIN SELECT person FROM PhonesPersons WHERE phone != '+33 1234567';
QUERY PLAN
----------------------------------------------------------------------
Seq Scan on phonespersons (cost=0.00..18621.00 rows=999999 width=4)
Filter: (phone <> '+33 1234567'::text)
(2 rows)
I understand (see Mark Byers' very good explanations) that PostgreSQL can decide not to use an index when it sees that a sequential scan would be faster (for instance if almost all the tuples match). But, here, "not equal" searches are really slower.
Any way to make these "is not equal to" searches faster?
Here is another example, to address Mark Byers' excellent remarks. The index is used for the '=' query (which returns the vast majority of tuples) but not for the '!=' query:
essais=> \d tld_idx
Index "public.tld_idx"
Column | Type
-----------------+------
pg_expression_1 | text
btree, for table "public.emailspersons"
essais=> EXPLAIN ANALYZE SELECT person FROM EmailsPersons WHERE tld(email) = 'fr';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using tld_idx on emailspersons (cost=0.25..4010.79 rows=97033 width=4) (actual time=0.137..261.123 rows=97110 loops=1)
Index Cond: (tld(email) = 'fr'::text)
Total runtime: 444.800 ms
(3 rows)
essais=> EXPLAIN ANALYZE SELECT person FROM EmailsPersons WHERE tld(email) != 'fr';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Seq Scan on emailspersons (cost=0.00..27129.00 rows=2967 width=4) (actual time=1.004..1031.224 rows=2890 loops=1)
Filter: (tld(email) <> 'fr'::text)
Total runtime: 1037.278 ms
(3 rows)
DBMS is PostgreSQL 8.3 (but I can upgrade to 8.4).
Possibly it would help to write:
SELECT person FROM PhonesPersons WHERE phone < '+33 1234567'
UNION ALL
SELECT person FROM PhonesPersons WHERE phone > '+33 1234567'
or simply
SELECT person FROM PhonesPersons WHERE phone > '+33 1234567'
OR phone < '+33 1234567'
PostgreSQL should be able to determine that the selectivity of the range operation is very high and to consider using an index for it.
I don't think it can use an index directly to satisfy a not-equals predicate, although it would be nice if it could try re-writing the not-equals as above (if it helps) during planning. If it works, suggest it to the developers ;)
Rationale: searching an index for all values not equal to a certain one requires scanning the full index. By contrast, searching for all elements less than a certain key means finding the greatest non-matching item in the tree and scanning backwards. Similarly, searching for all elements greater than a certain key in the opposite direction. These operations are easy to fulfill using b-tree structures. Also, the statistics that PostgreSQL collects should be able to point out that "+33 1234567" is a known frequent value: by removing the frequency of those and nulls from 1, we have the proportion of rows left to select: the histogram bounds will indicate whether those are skewed to one side or not. But if the exclusion of nulls and that frequent value pushes the proportion of rows remaining low enough (Istr about 20%), an index scan should be appropriate. Check the stats for the column in pg_stats to see what proportion it's actually calculated.
Update: I tried this on a local table with a vaguely similar distribution, and both forms of the above produced something other than a plain seq scan. The latter (using "OR") was a bitmap scan that may actually devolve to just being a seq scan if the bias towards your common value is particularly extreme... although the planner can see that, I don't think it will automatically rewrite to an "Append(Index Scan,Index Scan)" internally. Turning "enable_bitmapscan" off just made it revert to a seq scan.
PS: indexing a text column and using the inequality operators can be an issue, if your database location is not C. You may need to add an extra index that uses text_pattern_ops or varchar_pattern_ops; this is similar to the problem of indexing for column LIKE 'prefix%'
predicates.
Alternative: you could create a partial index:
CREATE INDEX PhonesPersonsOthers ON PhonesPersons(phone) WHERE phone <> '+33 1234567'
this will make the <>
-using select statement just scan through that partial index: since it excludes most of the entries in the table, it should be small.
The database is able use the index for this query, but it chooses not to because it would be slower. Update: This is not quite right: you have to rewrite the query slightly. See Araqnid's answer.
Your where clause selects almost all rows in your table (rows = 999999). The database can see that a table scan would be faster in this case and therefore ignores the index. It is faster because the column person
is not in your index so it would have to make two lookups for each row, once in the index to check the WHERE clause, and then again in the main table to fetch the column person
.
If you had a different type of data where there were most values were foo
and just a few were bar
and you said WHERE col <> 'foo'
then it probably would use the index.
Any way to make these "is not equal to" searches faster?
Any query that selects almost 1 million rows is going to be slow. Try adding a limit clause.
精彩评论