Tuning subquery in postgres
I have discov开发者_开发知识库ered some suspect data in a database. I am attempting to determine if a certain field, lastname, is correct. I have come up with the following query in postgres:
SELECT members."memberID",
members.lastname
FROM members
WHERE members."memberID" NOT IN (SELECT members."memberID"
FROM members
WHERE members.lastname ~* '[a-zA-z]+([-][a-zA-Z]+)*');
The subquery currently matches against normal names and names with a hypen. The parent query should display the members who don't match that pattern. Currently the query takes an incredible amount of time to run (i've never seen it complete). I am not sure why it takes so long or how to improve it.
NOT EXISTS
SELECT m."memberID",
m.lastname
FROM MEMBERS m
WHERE NOT EXISTS (SELECT NULL
FROM MEMBERS b
WHERE b.lastname ~* '[a-zA-z]+([-][a-zA-Z]+)*'
AND b."memberID" = m."memberID");
LEFT JOIN / IS NULL
SELECT m."memberID",
m.lastname
FROM MEMBERS m
LEFT JOIN MEMBERS b ON b."memberID" = m."memberID"
AND b.lastname ~* '[a-zA-z]+([-][a-zA-Z]+)*'
WHERE b."memberID" IS NULL
Summary
Quote:
PostgreSQL treats
LEFT JOIN
andNOT EXISTS
equally, using same execution plan for both of them (namely a Hash Anti Join for the example above).As for
NOT IN
, which is semantically different since its logic is trivalent and it can return NULL, PostgreSQL tries to take this into account and limits itself to using a filter against a subplan (a hashed subplan for a hashable resultset like in example above).Since it need to search the hash table for each missing value twice (first time to find the value, second time to find a NULL), this method is a little less efficient.
A plain subplan, which the optimizer can resort to any time it decides the list will not fit into the memory, is very inefficient and the queries that have possibility of using it should be avoided like a plague.
That’s why in PostgreSQL 8.4 one should always use
LEFT JOIN / IS NULL
orNOT EXISTS
rather thanNOT IN
to find the missing values.
Addendum
But as Andrew Lazarus points out, if there are no duplicates of memberid in the MEMBERS
table, the query only needs to be:
SELECT m."memberID",
m.lastname
FROM MEMBERS m
WHERE b.lastname ~* '[a-zA-z]+([-][a-zA-Z]+)*'
I like OMG Ponies answer, but if memberID
is unique (i.e., PK), you can just drop the subquery altogether.
SELECT members."memberID",
members.lastname
FROM members
WHERE members.lastname !~ '[a-zA-Z]+([-][a-zA-Z]+)*';
(I deleted the case-insensitive operator since the regexp covers both cases.)
精彩评论