selecting the earliest entry in a list that also satisfies other criteria

2023-02-10 12:13 问答作者：

Say I have a list L1, and the entries in L1 have 4 parts and are formatted like this cat1, cat2, date, ID. The list is sorted, so alphabetically by the cat1 entries, then alphabetically by the cat2 entries, then by the earliest date. I want a subset of this list that contains the entry with earliest date for each cat1, cat2 pair. This is the code that I have that does this already:

L1=[A, X, 2008-06-01, 1858
A, X, 2008-12-05, 1905
B, X, 2001-08-08, 1149
B, Y, 2006-03-05, 1638
B, Y, 2009-06-09, 1950
C, X, 2005-12-01, 1611
C, X, 2006-08-08, 1689
C, X, 2006-11-22, 1712
C, X, 2008-04-22, 1842
C, Y, 2008-12-05, 1816
C, Y, 2008-12-05, 1821
C, Y, 2008-12-05, 1882
C, Z, 2008-12-05, 1905
C, Z, 2009-06-01, 1935
C, Z, 2009-06-09, 1950
D, X, 2009-11-06, 1989
D, Y, 2008-12-05, 1905
D, Z, 2008-12-05, 1905
D, Z, 2008-12-05, 1905
E, X, 2008-12-05, 1905
E, Z, 2008-12-05, 1905
F, Y, 2008-12-05, 1905
G, X, 2008-12-05, 1905
G, Z, 2007-12-01, 1807]

L2=[j.next() for i, j in itertools.groupby(L1, lambda x: x.split(",", 2)[:2])]

L2=[A, X, 2008-06-01, 1858
B, X, 2001-08-08, 1149
B, Y, 2006-03-05, 1638
C, X, 2005-12-01, 1611
C, Y, 2008-12-05, 1816
C, Z, 2008-12-05, 1905
D, X, 2009-11-06, 1989
D, Y, 2008-12-05, 1905
D, Z, 2008-12-05, 1905
E, X, 2008-12-05, 1905
E, Z, 2008-12-05, 1905
F, Y, 2008-12-05, 1905
G, X, 2008-12-05, 1905
G, Z, 2007-12-01, 1807]

The trick now is that I want the earliest entry for each cat1, cat2 pair where the ID is in a list of values in <=3 keys in dict1 AND dict2. In other words, once the earliest entry for the cat1, cat2 pair is found is should be tested in each dict1 and dict2 and if the ID is found to be contained in the value list for 4+ keys of either dictionary, it should go to the next earliest entry for that cat1, cat2 pair, and in order to add the entry to L2 its ID should be in 3 or less keys in both dict1 and dict2. I'm not quite sure how to go about this... maybe use re.search or something?

dict1[key]=[ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID]    
dict2[key]=[ID,ID,ID,ID,ID,ID,ID,ID,ID,ID开发者_如何转开发,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID]

so instead of L2 having just the earliest entry per cat1, cat2 pair, it will contain the earliest entry where the ID from that entry is not among the ID list in 4+ keys in both dict1 AND dict2.

If dict1 and dict2's value-lists are not too big, you could generate the set of all valid IDs first, then filter L1 to contain only those tuples (X,Y,date,ID) whose IDs are in the set of value IDs:

import collections
def valid_ids(*dcts):
    valid=collections.defaultdict(int)
    for dct in dcts:
        for key,value in dct.iteritems():
            valid[value]+=1
    return set(value for value,count in valid.iteritems() if count<=3)

ids=valid_ids(dict1,dict2)

L1_filtered=[text.split(',') for text in L1 if text.split(',')[-1].strip() in ids]
L2 = [j.next() for i, j in itertools.groupby(L1_filtered, lambda x: x.split(",", 2)[:2])]

Note that if dict1 and dict2 have value-lists with a huge number of IDs, then the above method is not ideal, because you will waste a lot of time determining the set of value ids while in forming L2 you may only need a little bit of this data.

Using Hugh Bothwell's idea, if dict1 and dict2 have large value-lists, then it might pay to just check if particular IDs are valid as needed :

def is_valid(ID,*dcts):    
    return sum(1 for dct in dcts
               for key,value in dct.iteritems()
               if ID in value) <= 3       

L2=[]
for key, group in itertools.groupby(L1, lambda x: x.split(",", 2)[:2]):
    for text in group:
        X,Y,date,ID = text.split(',')
        X = X.strip()
        Y = Y.strip()
        date = date.strip()
        ID = ID.strip()
        if is_valid(ID,dict1,dict2):
            L2.append(X,Y,date,ID)
            break
    else:
        # There is no valid ID for this group!
        continue

Note that if you use the first method, with valid_ids, you loop through the dicts just once. If you use the second method, you loop through the dicts at least once for every group (unique X and Y pairs), and possibly multiple times for each group.

My guess is that the first method will be faster for most data sets, but profiling both methods with your real data is probably the safest way to tell.

I think you need something like

L2 = []
for xy,rem in itertools.groupby(L1, lambda x: x.split(",", 2)[:2]):
    for s in rem:
        date,id = s.split(",")
        if TEST_ID(id):
            L2.append(s)
            break

继续阅读：list python

selecting the earliest entry in a list that also satisfies other criteria

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？