selecting the earliest entry in a list that also satisfies other criteria
Say I have a list L1
, and the entries in L1
have 4 parts and are formatted like this cat1, cat2, date, ID
. The list is sorted, so alphabetically by the cat1
entries, then alphabetically by the cat2
entries, then by the earliest date
. I want a subset of this list that contains the entry with earliest date for each cat1, cat2
pair. This is the code that I have that does this already:
L1=[A, X, 2008-06-01, 1858
A, X, 2008-12-05, 1905
B, X, 2001-08-08, 1149
B, Y, 2006-03-05, 1638
B, Y, 2009-06-09, 1950
C, X, 2005-12-01, 1611
C, X, 2006-08-08, 1689
C, X, 2006-11-22, 1712
C, X, 2008-04-22, 1842
C, Y, 2008-12-05, 1816
C, Y, 2008-12-05, 1821
C, Y, 2008-12-05, 1882
C, Z, 2008-12-05, 1905
C, Z, 2009-06-01, 1935
C, Z, 2009-06-09, 1950
D, X, 2009-11-06, 1989
D, Y, 2008-12-05, 1905
D, Z, 2008-12-05, 1905
D, Z, 2008-12-05, 1905
E, X, 2008-12-05, 1905
E, Z, 2008-12-05, 1905
F, Y, 2008-12-05, 1905
G, X, 2008-12-05, 1905
G, Z, 2007-12-01, 1807]
L2=[j.next() for i, j in itertools.groupby(L1, lambda x: x.split(",", 2)[:2])]
L2=[A, X, 2008-06-01, 1858
B, X, 2001-08-08, 1149
B, Y, 2006-03-05, 1638
C, X, 2005-12-01, 1611
C, Y, 2008-12-05, 1816
C, Z, 2008-12-05, 1905
D, X, 2009-11-06, 1989
D, Y, 2008-12-05, 1905
D, Z, 2008-12-05, 1905
E, X, 2008-12-05, 1905
E, Z, 2008-12-05, 1905
F, Y, 2008-12-05, 1905
G, X, 2008-12-05, 1905
G, Z, 2007-12-01, 1807]
The trick now is that I want the earliest entry for each cat1, cat2
pair where the ID
is in a list of values in <=3 keys in dict1
AND dict2
. In other words, once the earliest entry for the cat1, cat2
pair is found is should be tested in each dict1
and dict2
and if the ID
is found to be contained in the value list for 4+ keys of either dictionary, it should go to the next earliest entry for that cat1, cat2
pair, and in order to add the entry to L2
its ID
should be in 3 or less keys in both dict1
and dict2
. I'm not quite sure how to go about this... maybe use re.search or something?
dict1[key]=[ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID]
dict2[key]=[ID,ID,ID,ID,ID,ID,ID,ID,ID,ID开发者_如何转开发,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID]
so instead of L2 having just the earliest entry per cat1, cat2
pair, it will contain the earliest entry where the ID
from that entry is not among the ID
list in 4+ keys in both dict1
AND dict2
.
If dict1
and dict2
's value-lists are not too big, you could generate the set of all valid IDs first, then filter L1 to contain only those tuples (X,Y,date,ID) whose IDs are in the set of value IDs:
import collections
def valid_ids(*dcts):
valid=collections.defaultdict(int)
for dct in dcts:
for key,value in dct.iteritems():
valid[value]+=1
return set(value for value,count in valid.iteritems() if count<=3)
ids=valid_ids(dict1,dict2)
L1_filtered=[text.split(',') for text in L1 if text.split(',')[-1].strip() in ids]
L2 = [j.next() for i, j in itertools.groupby(L1_filtered, lambda x: x.split(",", 2)[:2])]
Note that if dict1
and dict2
have value-lists with a huge number of IDs, then the above method is not ideal, because you will waste a lot of time determining the set of value ids while in forming L2
you may only need a little bit of this data.
Using Hugh Bothwell's idea, if dict1
and dict2
have large value-lists, then it might pay to just check if particular IDs are valid as needed :
def is_valid(ID,*dcts):
return sum(1 for dct in dcts
for key,value in dct.iteritems()
if ID in value) <= 3
L2=[]
for key, group in itertools.groupby(L1, lambda x: x.split(",", 2)[:2]):
for text in group:
X,Y,date,ID = text.split(',')
X = X.strip()
Y = Y.strip()
date = date.strip()
ID = ID.strip()
if is_valid(ID,dict1,dict2):
L2.append(X,Y,date,ID)
break
else:
# There is no valid ID for this group!
continue
Note that if you use the first method, with valid_ids
, you loop through the dicts just once. If you use the second method, you loop through the dicts at least once for every group (unique X
and Y
pairs), and possibly multiple times for each group.
My guess is that the first method will be faster for most data sets, but profiling both methods with your real data is probably the safest way to tell.
I think you need something like
L2 = []
for xy,rem in itertools.groupby(L1, lambda x: x.split(",", 2)[:2]):
for s in rem:
date,id = s.split(",")
if TEST_ID(id):
L2.append(s)
break
精彩评论