More efficient way to remove items from large data sets
I have two large lists:
a = [['abcdefghijklmno', 'foo', 'bar'], … ]
b = [['abcdefghij12345', 'foo', 'bar'], … ]
I'm interested in all members of a which don't have a corresponding entry in b, and vice versa, based on comparing a[n][0] and b[n][0] for all n in a and b. I create two sets of these sublist items, which allows me to do set_a.difference(set_b), and vice versa, which is very fast. But creating two lists based on the remaining items in a and b is (perhaps obviously) slower:
def remaining(ls ,y, z):
return [i for i in ls if i[0] in y.difference(z)]
where ls is either a or b, and y and z are the two sets detailed above. Is there an开发者_StackOverflow中文版y point in rethinking the structure of a and b to speed this up (e.g. using dicts with a[0] and b[0] values as the keys?
I suspect that your test in the list comprehension is calling y.difference for each element. Try this:
def remaining(ls, y, z):
diff = y.difference(z)
return filter(lambda i: i[0] in diff, ls)
At least def remaining(ls ,y, z): should be rewritten in def remaining(ls, common_set):.
Consider the next idea: wrap ['abcdefghijklmno', 'foo', 'bar'] in an object (probably with __slots__) and define its __hash__ using only 'abcdefghijklmno' value. After that you will be able to do set(a) - set(b) and get you task solved.
加载中,请稍侯......
精彩评论