Why does SQL cost explode with simple "or"?
I have the following statement to find unambiguous names in my data (~1 Million entries):
select Prename, Surname from person p1
where Prename is not null and Surname is not null
and not exists (
select * from person p2 where (p1.Surname = p2.Surname OR p1.Surname = p2.Altname)
and p2.Prename LIKE CONCAT(CONCAT('%', p1.Prename), '%') and p2.id <> p1.id
) and inv_date IS NULL
Oracle shows a huge cost of 1477315000 and execution does not end after 5 minutes. Simply splitting the OR into an own exists subclause boost开发者_如何学运维s performance to 0,5 s and costs to 45000:
select Prename, Surname from person p1
where Prename is not null and Surname is not null
and not exists (
select * from person p2 where p1.Surname = p2.Surname and
p2.Prename LIKE CONCAT(CONCAT('%', p1.Prename), '%') and p2.id <> p1.id
) and not exists (
select * from person p2 where p1.Surname = p2.Altname and
p2.Prename LIKE CONCAT(CONCAT('%', p1.Prename), '%') and p2.id <> p1.id
) and inv_date IS NULL
It's not my question to tweak this to the best, as it is only a seldomly executed query, and I know CONTACT is surpassing any index, but I just wonder where this high cost comes from. Both queries seem semantically equivalent to me.
The answer is in the EXPLAIN PLAN for your queries. They may semantically be equivalent but the execution plan behind the scenes for your queries are vastly different.
EXISTS operates differently from a JOIN and essentially, your OR filter statement is what joins the tables together.
No JOIN occurs in the second query as you are only retrieving records from one table.
The results of your two queries may be semantically equivalent, but the execution is not operationally equivalent. Your second example never makes use of an OR operator to combine predicates. All of your predicates in the second example are combined using an AND.
The performance is better because, if the first predicate that is combined with an AND does not evaluate to true then the second (or any other predicate) is skipped, (not evaluated). If you used an OR then both (or all) predicates would have to be evaluated frequently thus slowing down your query. (ORed predicates are checked until one evaluates to true.)
I would consider testing the query rewritten as below... Do a direct join from one to the other on the criteria that "Qualifies" what IS considered a match... Then, in the WHERE clause, throw it out if it doesn't come up with a match
select
p1.Prename,
p1.Surname
from
person p1
join person p2
on p1.ID <> p2.ID
and ( p1.Surname = p2.Surname
or p1.SurName = p2.AltName )
and p2.PreName like concat( concat( '%', p1.Prename ), '%' )
where
p1.PreName is not null
and p1.SurName is not null
and p1.Inv_date is null
and p2.id is null
Per your comments, but from what it appears you were looking for... NO, do NOT do a left outer join... If you are looking for names that are ALIKE that you want to PURGE out (however you'll handle that), you only want to PREQUALIFY those records that DO HAVE A MATCH via the self-join (hence normal join). If you have a name that DOES NOT have a similar name, you probably want to leave it alone... thus it will automatically be left OUT of the result set.
Now, the WHERE clause kicks in... You have a valid person on the left... that HAS a person on the right.. These ARE the duplicates... so you have the match, now by throwing in the logical "p2.ID IS NULL" creates the same result as NOT EXIST giving the final results.
I put my query back to a normal "join".
精彩评论