More efficient way to remove items from large data sets
I have two large lists:
a = [['abcdefghijklmno', 'foo', 'bar'], … ]
b = [['abcdefghij12345', 'foo', 'bar'], … ]
I'm interested in all members of a
which don't have a corresponding entry in b
, and vice versa, based on comparing a[n][0]
and b[n][0]
for all n in a
and b
. I create two sets of these sublist items, which allows me to do set_a.difference(set_b)
, and vice versa, which is very fast. But creating two lists based on the remaining items in a
and b
is (perhaps obviously) slower:
def remaining(ls ,y, z):
return [i for i in ls if i[0] in y.difference(z)]
where ls
is either a
or b
, and y
and z
are the two sets detailed above. Is there an开发者_StackOverflow中文版y point in rethinking the structure of a
and b
to speed this up (e.g. using dicts with a[0]
and b[0]
values as the keys?
I suspect that your test in the list comprehension is calling y.difference for each element. Try this:
def remaining(ls, y, z):
diff = y.difference(z)
return filter(lambda i: i[0] in diff, ls)
At least def remaining(ls ,y, z):
should be rewritten in def remaining(ls, common_set):
.
Consider the next idea: wrap ['abcdefghijklmno', 'foo', 'bar']
in an object (probably with __slots__
) and define its __hash__
using only 'abcdefghijklmno'
value. After that you will be able to do set(a) - set(b)
and get you task solved.
精彩评论