开发者

Unique list of dicts based on keys

I have a list of dics:

     data = {}
     data['ke开发者_如何学Pythony'] = pointer_key
     data['timestamp'] = timestamp
     data['action'] = action
     data['type'] = type
     data['id'] = id

     list = [data1, data2, data3, ... ]

How can I ensure that for each data item in the list, only one such element exists for each "key"? If there are two keys as seen below, the most recent timestamp would win:

    list = [{'key':1,'timestamp':1234567890,'action':'like','type':'photo',id:245},
            {'key':2,'timestamp':2345678901,'action':'like','type':'photo',id:252},
            {'key':1,'timestamp':3456789012,'action':'like','type':'photo',id:212}]

    unique(list)

    list = [{'key':2,'timestamp':2345678901,'action':'like','type':'photo',id:252},
            {'key':1,'timestamp':3456789012,'action':'like','type':'photo',id:212}]

Thanks.


Here's my solution:

def uniq(list_dicts):
    return [dict(p) for p in set(tuple(i.items()) 
        for i in list_dicts)]

hope it will help somebody.


I needed this, but didn't like any of the answers here. So I made this simple and performant version.

def list_of_seq_unique_by_key(seq, key):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if x[key] not in seen and not seen_add(x[key])]

# Usage
# If you want most recent timestamp to win, just sort by timestamp first
list = sorted(list, key=lambda k: k['timestamp'], reverse=True)
# Remove everything with a duplicate value for key 'key'
list = list_of_seq_unique_by_key(list, 'key')


I think that you mean is that every 'key' field should be unique over all the datas.

Well, lets start with what you probably should do: Use a database, they love to solve these problem.

You can do the job by hand too, for example:

def unique_keys( items):
    seen = set()
    for item in items:
        key = item['key']
        if key not in seen:
             seen.add(key)
             yield item
        else:
             # its a duplicate key, do what?
             pass # drops it

print list(unique_keys(data_list))

Or maybe you want a data structure that stores existing keys and prevents you from creating new datas for keys that already exist ... ?


To clarify, you have multiple dictionaries, but you want a unique data['key']? E.g., if data1['key'] = 'hello' you want to make sure that data2['key'] = 'hello' isn't allowed? Do you want it just raise an error? This is a way to validate that its fine. (Also its not good to name your list 'list' as list is a datatype in python)

datalist = [datadict1, datadict2, datadict3]
big_key_list = []
for datadict in datalist:
    curkey = datadict.get('key')
    if curkey not in big_key_list:
        big_key_list.append(curkey)
    else:
        raise Exception("Key %s in two data dicts" % curkey)

Now a better way of doing this would be to create a new class inheriting from dict that contains subdictionaries, but doesn't allow multiple keys to have the same value. This way errors get thrown upon assignment rather than you can just check if things are fine (and not know what to do if things aren't fine, other than raise an error).

EDIT: Actually, looking at what you likely want to do, you have the data setup incorrectly. I say this as it seems you want to have a separate dictionary for each entry. This is almost certainly an inelegant way of doing it.

First create a class:

class MyDataObject(object):
    def __init__(self, **kwargs):
        for k,v in kwargs:
            self.__dict__[k] = v

or if they always will have all 4 fixed parameters:

class MyDataObject(object):
    def __init__(self, timestamp, action, obj_type, obj_id):
        self.timestamp = timestamp
        self.action = action
        self.type = obj_type
        self.id = obj_id

Then just define your datatypes.

data = {}
data['key1'] = MyDataObject(timestamp='some timestamp', action='some action', type='some type', id = 1234)
data['key2'] = MyDataObject(timestamp='some timestamp2', action='some action2', type='some type2', id = 1235)

You would access your data like:

data['key1'].timestamp # returns 'some timestamp'
data['key2'].action # returns 'some action2'

or you can even access using dict() (e.g., this is helpful if you have a variable x='action' and you want to access it).

data['key1'].__dict__('action') # returns 'some action'
data['key2'].__dict__('timestamp') # returns 'some timestamp2'

Now you just have a dictionary of objects, where the key is unique and the data associated with the key is kept as one object (of type MyDataObject).


You could also use a dictionary of lists, with each list position representing a specific value.

data = {}
data[pointer_key] = [timestamp, action, type, id]
if new_pointer_key in data:
    if this_timestamp > data[new_pointer_key][0]:   ## first element of list=timestamp
        data[new_pointer_key] = [new_timestamp,  new_action, new_type, new_id] 


When you do things like these, it's usually a good sign that there's a mistake in design somewhere.
But it can be done:

from operator import itemgetter

def unique(list_of_dicts):
    _sorted = sorted(list_of_dicts, key=itemgetter('timestamp'), reverse=True)
    known_keys = set()
    result = []
    for d in _sorted:
        key = d['key']
        if key in known_keys: continue
        known_keys.add(key)
        result.append(d)
    return result

Output (note: it changes ordering):

[{'action': 'like', 'timestamp': 3456789012, 'type': 'photo', 'id': 212, 'key': 1},
{'action': 'like', 'timestamp': 2345678901, 'type': 'photo', 'id': 252, 'key': 2}]

And now that keys are unique (with recent timestamps kept, as desired), it's good idea to convert it to something that reflects your data better, as suggested by jimbob:

class MyDataObject(object):
    def __init__(self, timestamp, action, obj_type, obj_id):
        self.timestamp = timestamp
        self.action = action
        self.type = obj_type
        self.id = obj_id

data = {}
for action in unique(_list):
    key = action['key']
    data[key] = MyDataObject(action['timestamp'], action['action'],
        action['type'], action['id'])


>>> def unique(l):
...     return {k['key']:k for k in l}.values()
...
>>> print(unique([ {'key':1,'timestamp':1234567890,'action':'like','type':'photo',id:245},
...                {'key':2,'timestamp':2345678901,'action':'like','type':'photo',id:252},
...                {'key':1,'timestamp':3456789012,'action':'like','type':'photo',id:212} ]))
dict_values([{<built-in function id>: 212, 'type': 'photo', 'key': 1, 'timestamp': 3456789012, 'action': 'like'}, {<built-in function id>: 252, 'type': 'photo', 'key': 2, 'timestamp': 2345678901, 'action': 'like'}])


You don't need to. By definition a dict can only have one entry for a given key.


>>> d = {'a': 1, 'b': 2, 'a': 3}
>>> d
{'a': 3, 'b': 2}

So in a dict, there is uniqueness of key.

Update: (On the basis of your comment)

In case you are looking for one key, multiple values, you subclass dict like:

>>> class custom_dict(dict):
      def __setitem__(self, key, value):
        self.setdefault(key, []).append(value)

>>> m = custom_dict()
>>> m['key'] = 1
>>> m['key'] = 2
>>> m
{'key': [1, 2]}

That should do it.


The groupby function from itertools might be useful here:

def unique(items, key, order=None):
    sort_func = (lambda v: (key(v), order(v))) if order else key
    groups = itertools.groupby(sorted(items, key=sort_func), key)
    return [group.next() for unused_key, group in groups]

or

def unique(items, key, order=None):
    groups = itertools.groupby(sorted(items, key=key), key)
    return [max(group, key=order) for unused_key, group in groups]

It groups together items that appear the same based upon an optional key. Using it on data sorted by the same qualifier will put them into groups. Taking the first element will make them unique. To allow your 'sorted by timestamp' option, we can sort by key as well as timestamp, and then group only by key. Then you could use it as below:

data = [{'key':1, 'timestamp':1234567890, 'action':'like', 'type':'photo', 'id':245},
        {'key':2, 'timestamp':2345678901, 'action':'like', 'type':'photo', 'id':252},
        {'key':1, 'timestamp':3456789012, 'action':'like', 'type':'photo', 'id':212}]

# unique(data)
key = lambda d: d['key']  # Group by key
order = lambda d: -d['timestamp']  # Sort by descending order timestamp
data = unique(data, key, order_func=order)

data == [{'key':1, 'timestamp':3456789012, 'action':'like', 'type':'photo', 'id':212},
         {'key':2, 'timestamp':2345678901, 'action':'like', 'type':'photo', 'id':252}]

We force the key to be first in the sort function to ensure we group correctly, regardless of ordering.

This solution changes the order of your items, though it does have the advantage of inoffensive storage and time complexity.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜