Remove all the elements that occur in one list from another
Let's say I have开发者_开发技巧 two lists, l1
and l2
. I want to perform l1 - l2
, which returns all elements of l1
not in l2
.
I can think of a naive loop approach to doing this, but that is going to be really inefficient. What is a pythonic and efficient way of doing this?
As an example, if I have l1 = [1,2,6,8] and l2 = [2,3,5,8]
, l1 - l2
should return [1,6]
Python has a language feature called List Comprehensions that is perfectly suited to making this sort of thing extremely easy. The following statement does exactly what you want and stores the result in l3
:
l3 = [x for x in l1 if x not in l2]
l3
will contain [1, 6]
.
One way is to use sets:
>>> set([1,2,6,8]) - set([2,3,5,8])
set([1, 6])
Note, however, that sets do not preserve the order of elements, and cause any duplicated elements to be removed. The elements also need to be hashable. If these restrictions are tolerable, this may often be the simplest and highest performance option.
Performance Comparisons
Comparing the performance of all the answers mentioned here on Python 3.9.1 and Python 2.7.16.
Python 3.9.1
Answers are mentioned in order of performance:
Arkku's
set
difference using subtraction "-" operation - (91.3 nsec per loop)mquadri$ python3 -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1 - l2" 5000000 loops, best of 5: 91.3 nsec per loop
Moinuddin Quadri's using
set().difference()
- (133 nsec per loop)mquadri$ python3 -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1.difference(l2)" 2000000 loops, best of 5: 133 nsec per loop
Moinuddin Quadri's list comprehension with
set
based lookup- (366 nsec per loop)mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "[x for x in l1 if x not in l2]" 1000000 loops, best of 5: 366 nsec per loop
Donut's list comprehension on plain list - (489 nsec per loop)
mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = [2,3,5,8];" "[x for x in l1 if x not in l2]" 500000 loops, best of 5: 489 nsec per loop
Daniel Pryden's generator expression with
set
based lookup and type-casting tolist
- (583 nsec per loop) : Explicitly type-casting to list to get the final object aslist
, as requested by OP. If generator expression is replaced with list comprehension, it'll become same as Moinuddin Quadri's list comprehension withset
based lookup.mquadri$ mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(x for x in l1 if x not in l2)" 500000 loops, best of 5: 583 nsec per loop
Moinuddin Quadri's using
filter()
and explicitly type-casting tolist
(need to explicitly type-cast as in Python 3.x, it returns iterator) - (681 nsec per loop)mquadri$ python3 -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(filter(lambda x: x not in l2, l1))" 500000 loops, best of 5: 681 nsec per loop
Akshay Hazari's using combination of
functools.reduce
+filter
-(3.36 usec per loop) : Explicitly type-casting tolist
as from Python 3.x it started returned returning iterator. Also we need to importfunctools
to usereduce
in Python 3.xmquadri$ python3 -m timeit "from functools import reduce; l1 = [1,2,6,8]; l2 = [2,3,5,8];" "list(reduce(lambda x,y : filter(lambda z: z!=y,x) ,l1,l2))" 100000 loops, best of 5: 3.36 usec per loop
Python 2.7.16
Answers are mentioned in order of performance:
Arkku's
set
difference using subtraction "-" operation - (0.0783 usec per loop)mquadri$ python -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1 - l2" 10000000 loops, best of 3: 0.0783 usec per loop
Moinuddin Quadri's using
set().difference()
- (0.117 usec per loop)mquadri$ mquadri$ python -m timeit -s "l1 = set([1,2,6,8]); l2 = set([2,3,5,8]);" "l1.difference(l2)" 10000000 loops, best of 3: 0.117 usec per loop
Moinuddin Quadri's list comprehension with
set
based lookup- (0.246 usec per loop)mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "[x for x in l1 if x not in l2]" 1000000 loops, best of 3: 0.246 usec per loop
Donut's list comprehension on plain list - (0.372 usec per loop)
mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = [2,3,5,8];" "[x for x in l1 if x not in l2]" 1000000 loops, best of 3: 0.372 usec per loop
Moinuddin Quadri's using
filter()
- (0.593 usec per loop)mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "filter(lambda x: x not in l2, l1)" 1000000 loops, best of 3: 0.593 usec per loop
Daniel Pryden's generator expression with
set
based lookup and type-casting tolist
- (0.964 per loop) : Explicitly type-casting to list to get the final object aslist
, as requested by OP. If generator expression is replaced with list comprehension, it'll become same as Moinuddin Quadri's list comprehension withset
based lookup.mquadri$ python -m timeit -s "l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(x for x in l1 if x not in l2)" 1000000 loops, best of 3: 0.964 usec per loop
Akshay Hazari's using combination of
functools.reduce
+filter
-(2.78 usec per loop)mquadri$ python -m timeit "l1 = [1,2,6,8]; l2 = [2,3,5,8];" "reduce(lambda x,y : filter(lambda z: z!=y,x) ,l1,l2)" 100000 loops, best of 3: 2.78 usec per loop
Expanding on Donut's answer and the other answers here, you can get even better results by using a generator comprehension instead of a list comprehension, and by using a set
data structure (since the in
operator is O(n) on a list but O(1) on a set).
So here's a function that would work for you:
def filter_list(full_list, excludes):
s = set(excludes)
return (x for x in full_list if x not in s)
The result will be an iterable that will lazily fetch the filtered list. If you need a real list object (e.g. if you need to do a len()
on the result), then you can easily build a list like so:
filtered_list = list(filter_list(full_list, excludes))
Use the Python set type. That would be the most Pythonic. :)
Also, since it's native, it should be the most optimized method too.
See:
http://docs.python.org/library/stdtypes.html#set
http://docs.python.org/library/sets.htm (for older python)
# Using Python 2.7 set literal format.
# Otherwise, use: l1 = set([1,2,6,8])
#
l1 = {1,2,6,8}
l2 = {2,3,5,8}
l3 = l1 - l2
use Set Comprehensions {x for x in l2} or set(l2) to get set, then use List Comprehensions to get list
l2set = set(l2)
l3 = [x for x in l1 if x not in l2set]
benchmark test code:
import time
l1 = list(range(1000*10 * 3))
l2 = list(range(1000*10 * 2))
l2set = {x for x in l2}
tic = time.time()
l3 = [x for x in l1 if x not in l2set]
toc = time.time()
diffset = toc-tic
print(diffset)
tic = time.time()
l3 = [x for x in l1 if x not in l2]
toc = time.time()
difflist = toc-tic
print(difflist)
print("speedup %fx"%(difflist/diffset))
benchmark test result:
0.0015058517456054688
3.968189239501953
speedup 2635.179227x
Alternate Solution :
reduce(lambda x,y : filter(lambda z: z!=y,x) ,[2,3,5,8],[1,2,6,8])
Using set.difference()
:
You can use set.difference()
to get new set with elements in the set that are not in the others. i.e. set(A).difference(B)
will return set with items present in A
, but not in B
. For example:
>>> set([1,2,6,8]).difference([2,3,5,8])
{1, 6}
It is a functional approach to get set
difference mentioned in Arkku's answer (which uses arithmetic subtraction -
operator for set difference).
Since sets are unordered, you'll loose the ordering of elements from initial list. (continue reading next section if you want to maintain the orderig of elements)
Using List Comprehension with set
based lookup
If you want to maintain the ordering from initial list, then Donut's list comprehension based answer will do the trick. However, you can get better performance from the accepted answer by using set
internally for checking whether element is present in other list. For example:
l1, l2 = [1,2,6,8], [2,3,5,8]
s2 = set(l2) # Type-cast `l2` to `set`
l3 = [x for x in l1 if x not in s2]
# ^ Doing membership checking on `set` s2
If you are interested in knowing why membership checking is faster is set
when compared to list
, please read this: What makes sets faster than lists?
Using filter()
and lambda expression
Here's another alternative using filter()
with the lambda expression. Adding it here just for reference, but it is not performance efficient:
>>> l1 = [1,2,6,8]
>>> l2 = set([2,3,5,8])
# v `filter` returns the a iterator object. Here I'm type-casting
# v it to `list` in order to display the resultant value
>>> list(filter(lambda x: x not in l2, l1))
[1, 6]
Using filterfalse
without lambda-expression
When using functions like filter
or filterfalse
and similar from itertools
you can usually save performance by avoiding lambda
-expressions and using already existing functions. Instances of list
and set
defines a __contains__
-method to use for containment checks. The in
-operator calls this method under the hood, so using x in l2
can be replaced by l2.__contains__(x)
. Usually this replacement is not really prettier but in this specific case it allows us to gain better performance than using a lambda
-expression, when used in combination with filterfalse
:
>>> from itertools import filterfalse
>>> l1 = [1, 2, 6, 8]
>>> l2 = [2, 3, 5, 8]
>>> list(filterfalse(l2.__contains__, l1))
[1, 6]
filterfalse
creates an iterator yielding all elements that returns false
when used as an argument for l2.__contains__
.
Sets has a faster implementation of __contains__
so even better is:
>>> from itertools import filterfalse
>>> l1 = [1, 2, 6, 8]
>>> l2 = set([2, 3, 5, 8])
>>> list(filterfalse(l2.__contains__, l1))
[1, 6]
Performance
Using list:
$ python3 -m timeit -s "from itertools import filterfalse; l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(filterfalse(l2.__contains__, l1))"
500000 loops, best of 5: 522 nsec per loop
Using set:
$ python3 -m timeit -s "from itertools import filterfalse; l1 = [1,2,6,8]; l2 = set([2,3,5,8]);" "list(filterfalse(l2.__contains__, l1))"
1000000 loops, best of 5: 359 nsec per loop
The set approach is the best if you WANT THAT behavior. If you do not want to remove all instances of elements in list l1 that exist only once in l2, those set operations will lead to incorrect results. Suppose you have repeating elements in l1 , and probably even in l2 and want an actual difference of the two lists l1 - l2, while maintaining the order of remaining elements:
l1 = [1, 2, 3, 4, 5, 5, 6, 5, 5, 2]
l2 = [1, 2, 2, 5]
_ = [l1.remove(item) for item in l2 if item in l1] # discard return value
print(l1) # [3, 4, 5, 6, 5, 5]
- Careful that this will be significantly slower than set operation use this only if your use case needs it
- If you dont want to modify the original list - create a copy of the list first
Sets versus list comprehension benchmark on Python 3.8
(adding up to Moinuddin Quadri's benchmarks)
tldr: Use Arkku's set solution, it's even faster than promised in comparison!
Checking existing files against a list
In my example I found it to be 40 times (!) faster to use Arkku's set solution than the pythonic list comprehension for a real world application of checking existing filenames against a list.
List comprehension:
%%time
import glob
existing = [int(os.path.basename(x).split(".")[0]) for x in glob.glob("*.txt")]
wanted = list(range(1, 100000))
[i for i in wanted if i not in existing]
Wall time: 28.2 s
Sets
%%time
import glob
existing = [int(os.path.basename(x).split(".")[0]) for x in glob.glob("*.txt")]
wanted = list(range(1, 100000))
set(wanted) - set(existing)
Wall time: 689 ms
Try this:
l1=[1,2,6,8]
l2=[2,3,5,8]
r=[]
for x in l1:
if x in l2:
continue
r=r+[x]
print(r)
Maintaining order by exploiting the ordered property of dicts (Python 3.7+)
Note: the reference implementation of dicts
in Python 3.6 maintains keys in their order of insertion order, but this is not guaranteed by the specification. For 3.7 and up, this guarantee was added.
The keys of a dict
function as a sort of set
; duplicates are implicitly filtered out, and lookup is efficient due to hashing. Therefore, we can implement a "set difference" by building a dict using l1
as keys, and then removing any keys that appear in l2
. This maintains order and uses a fast algorithm, but incurs a fair amount of constant-factor overhead.
d = dict.fromkeys(l1)
for i in l2:
try:
del d[i]
except KeyError:
pass
l3 = list(d.keys())
精彩评论