开发者

More efficient way to remove items from large data sets

I have two large lists:

a = [['abcdefghijklmno', 'foo', 'bar'], … ]
b = [['abcdefghij12345', 'foo', 'bar'], … ]

I'm interested in all members of a which don't have a corresponding entry in b, and vice versa, based on comparing a[n][0] and b[n][0] for all n in a and b. I create two sets of these sublist items, which allows me to do set_a.difference(set_b), and vice versa, which is very fast. But creating two lists based on the remaining items in a and b is (perhaps obviously) slower:

def remaining(ls ,y, z):
    return [i for i in ls if i[0] in y.difference(z)]

where ls is either a or b, and y and z are the two sets detailed above. Is there an开发者_StackOverflow中文版y point in rethinking the structure of a and b to speed this up (e.g. using dicts with a[0] and b[0] values as the keys?


I suspect that your test in the list comprehension is calling y.difference for each element. Try this:

def remaining(ls, y, z):
    diff = y.difference(z)
    return filter(lambda i: i[0] in diff, ls)


At least def remaining(ls ,y, z): should be rewritten in def remaining(ls, common_set):.

Consider the next idea: wrap ['abcdefghijklmno', 'foo', 'bar'] in an object (probably with __slots__) and define its __hash__ using only 'abcdefghijklmno' value. After that you will be able to do set(a) - set(b) and get you task solved.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜