开发者

Tuning subquery in postgres

I have discov开发者_开发知识库ered some suspect data in a database. I am attempting to determine if a certain field, lastname, is correct. I have come up with the following query in postgres:

SELECT members."memberID", 
       members.lastname 
  FROM members 
 WHERE members."memberID" NOT IN (SELECT members."memberID" 
                                    FROM members 
                                   WHERE members.lastname ~* '[a-zA-z]+([-][a-zA-Z]+)*');

The subquery currently matches against normal names and names with a hypen. The parent query should display the members who don't match that pattern. Currently the query takes an incredible amount of time to run (i've never seen it complete). I am not sure why it takes so long or how to improve it.


NOT EXISTS

SELECT m."memberID", 
       m.lastname 
  FROM MEMBERS m 
 WHERE NOT EXISTS (SELECT NULL
                     FROM MEMBERS b
                    WHERE b.lastname ~* '[a-zA-z]+([-][a-zA-Z]+)*'
                      AND b."memberID" = m."memberID");

LEFT JOIN / IS NULL

   SELECT m."memberID", 
          m.lastname 
     FROM MEMBERS m 
LEFT JOIN MEMBERS b ON b."memberID" = m."memberID"
                   AND b.lastname ~* '[a-zA-z]+([-][a-zA-Z]+)*'
    WHERE b."memberID" IS NULL

Summary

Quote:

PostgreSQL treats LEFT JOIN and NOT EXISTS equally, using same execution plan for both of them (namely a Hash Anti Join for the example above).

As for NOT IN, which is semantically different since its logic is trivalent and it can return NULL, PostgreSQL tries to take this into account and limits itself to using a filter against a subplan (a hashed subplan for a hashable resultset like in example above).

Since it need to search the hash table for each missing value twice (first time to find the value, second time to find a NULL), this method is a little less efficient.

A plain subplan, which the optimizer can resort to any time it decides the list will not fit into the memory, is very inefficient and the queries that have possibility of using it should be avoided like a plague.

That’s why in PostgreSQL 8.4 one should always use LEFT JOIN / IS NULL or NOT EXISTS rather than NOT IN to find the missing values.

Addendum

But as Andrew Lazarus points out, if there are no duplicates of memberid in the MEMBERS table, the query only needs to be:

SELECT m."memberID", 
       m.lastname 
  FROM MEMBERS m 
 WHERE b.lastname ~* '[a-zA-z]+([-][a-zA-Z]+)*'


I like OMG Ponies answer, but if memberID is unique (i.e., PK), you can just drop the subquery altogether.

SELECT members."memberID", 
       members.lastname 
  FROM members 
 WHERE members.lastname !~ '[a-zA-Z]+([-][a-zA-Z]+)*';

(I deleted the case-insensitive operator since the regexp covers both cases.)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜