Why is my SQL 'NOT IN' clause producing different results from 'NOT EXISTS'
I have two SQL queries producing different results when I would expect them to produce the same result. I am trying to find the number of events that do not have a corresponding location. All locations have an event but events can also link to non-location records.
The following query produces a count of 16244, the correct value.
SELECT COUNT(DISTINCT e.event_id)
FROM events AS e
WHERE NOT EXISTS
(SELECT * FROM locations AS l WHERE l.event_id = e.event_id)
The following query produces a count of 0.
SELECT COUNT(DISTINCT e.event_id)
FROM events AS e
WHERE e.event_id NOT IN (SELECT l.event_id FROM locations AS l)
The following SQL does so开发者_如何学JAVAme summaries of the data set
SELECT 'Event Count',
COUNT(DISTINCT event_id)
FROM events
UNION ALL
SELECT 'Locations Count',
COUNT(DISTINCT event_id)
FROM locations
UNION ALL
SELECT 'Event+Location Count',
COUNT(DISTINCT l.event_id)
FROM locations AS l JOIN events AS e ON l.event_Id = e.event_id
And returns the following results
Event Count 139599 Locations Count 123355 Event+Location Count 123355
Can anyone shed any light on why the 2 initial queries do not produce the same figure.
You have a NULL in the subquery SELECT l.event_id FROM locations AS l
so NOT IN will always evaluate to unknown and return 0 results
SELECT COUNT(DISTINCT e.event_id)
FROM events AS e
WHERE e.event_id NOT IN (SELECT l.event_id FROM locations AS l)
The reason for this behaviour can be seen from the below example.
'x' NOT IN (NULL,'a','b')
≡ 'x' <> NULL and 'x' <> 'a' and 'x' <> 'b'
≡ Unknown and True and True
≡ Unknown
The NOT IN
form works differently for NULLs. The presence of a single NULL
will cause the entire statement to fail, thus returning no results.
So you have at least one event_id
in locations
that is NULL
.
Also, your query might be better written as a join:
SELECT
COUNT(DISTINCT e.event_id)
FROM
events AS e
LEFT JOIN locations AS l ON e.event_id = l.event_id
WHERE
l.event_id IS NULL
[UPDATE: apparently, the NOT EXISTS
version is faster.]
In and Exists are processed very very differently.
Select * from T1 where x in ( select y from T2 )
is typically processed as:
select *
from t1, ( select distinct y from t2 ) t2
where t1.x = t2.y;
The subquery is evaluated, distinct'ed, indexed (or hashed or sorted) and then joined to the original table -- typically.
As opposed to
select * from t1 where exists ( select null from t2 where y = x )
That is processed more like:
for x in ( select * from t1 )
loop
if ( exists ( select null from t2 where y = x.x )
then
OUTPUT THE RECORD
end if
end loop
It always results in a full scan of T1 whereas the first query can make use of an index on T1(x).
精彩评论