Sorting sub-lists into new sub-lists based on common first items
I have a large number of two-membered sub-lists that are members of a list called mylist
:
mylist = [['AB001', 22100],
['AB001', 32935],
['XC013', 99834],
['VD126', 18884],
['AB001', 34439],
['XC013', 86701]]
I want to sort mylist
into new sub-lists based on whether the sub-lis开发者_开发技巧ts contain the same string as the first item. For example, this is what I am looking for my code to output:
newlist = [['AB001', 22100], ['AB001', 32935], ['AB001', 34439]],
[['XC013', 99834], ['XC013', 86701]],
[['VD126', 18884]]
Here is how I was trying to code this:
mylist = sorted(mylist)
newlist = []
for sublist in mylist:
id = sublist[0]
if id == next.id:
newlist.append(id)
print newlist
I was also trying to understand if itertools.groupby()
was the correct tool for this problem. Can someone help me with this problem?
You were right about this being a job for groupby
:
from itertools import groupby
from operator import itemgetter
mylist = [['AB001', 22100],
['AB001', 32935],
['XC013', 99834],
['VD126', 18884],
['AB001', 4439],
['XC013', 86701]]
print([list(value) for key, value in groupby(sorted(mylist), key=itemgetter(0))])
This will give you a list-of-lists, grouped by the first item in the sublist.
[[['AB001', 4439], ['AB001', 22100], ['AB001', 32935]],
[['VD126', 18884]],
[['XC013', 86701], ['XC013', 99834]]]
collections.defaultdict
An itertools.groupby
solution will incur O(n log n) cost since the input must be sorted first. You can a use defaultdict
of lists for a guaranteed O(n) solution:
from collections import defaultdict
dd = defaultdict(list)
for item in mylist:
dd[item[0]].append(item)
res = list(dd.values())
print(res)
[[['AB001', 22100], ['AB001', 32935], ['AB001', 34439]],
[['XC013', 99834], ['XC013', 86701]],
[['VD126', 18884]]]
There are a number of alternatives to solve this problem:
def regroup_by_di(items, key=None):
result = {}
callable_key = callable(key)
for item in items:
key_value = key(item) if callable_key else item
if key_value not in result:
result[key_value] = []
result[key_value].append(item)
return result
import collections
def regroup_by_dd(items, key=None):
result = collections.defaultdict(list)
callable_key = callable(key)
for item in items:
result[key(item) if callable_key else item].append(item)
return dict(result) # to be in line with other solutions
def regroup_by_sd(items, key=None):
result = {}
callable_key = callable(key)
for item in items:
key_value = key(item) if callable_key else item
result.setdefault(key_value, []).append(item)
return result
import itertools
def regroup_by_it(items, key=None):
seq = sorted(items, key=key)
result = {
key_value: list(group)
for key_value, group in itertools.groupby(seq, key)}
return result
def group_by(
seq,
key=None):
items = iter(seq)
try:
item = next(items)
except StopIteration:
return
else:
callable_key = callable(key)
last = key(item) if callable_key else item
i = j = 0
for i, item in enumerate(items, 1):
current = key(item) if callable_key else item
if last != current:
yield last, seq[j:i]
last = current
j = i
if i >= j:
yield last, seq[j:i + 1]
def regroup_by_gb(items, key=None):
return dict(group_by(sorted(items, key=key), key))
These can be divided into two categories:
- loop through the input creating a
dict
-like structure (regroup_by_di()
,regroup_by_dd()
,regroup_by_sd()
) - sorting the input and then use a
uniq
-like function (e.g.itertools.groupby()
) (regroup_by_it()
,regroup_by_gb()
)
The first class of approaches has O(n)
computational complexity, while the second class of approaches has O(n log n)
.
All of the proposed approach require specifying a key
.
For OP's problem, operators.itemgetter(0)
or lambda x: x[0]
would work. Additionally, to get OP's desired results one should get only the list(dict.values())
, e.g.:
from operator import itemgetter
mylist = [['AB001', 22100],
['AB001', 32935],
['XC013', 99834],
['VD126', 18884],
['AB001', 4439],
['XC013', 86701]]
print(list(regroup_by_di(mylist, key=itemgetter(0)).values()))
# [[['AB001', 22100], ['AB001', 32935], ['AB001', 4439]], [['XC013', 99834], ['XC013', 86701]], [['VD126', 18884]]]
The timings come out as faster for all dict
-based (1st class) solutions and slower for all groupby
-based (2nd class) solutions.
Within the dict
-based solutions, their performances will depend slightly on the "collision rate", which is proportional to the number of times a new item will create a new object.
For higher collision rates, the regroup_by_di()
may be the fastest, while for lower collision rates the regroup_by_dd()
may be the fastest.
The benchmarks come out as follow:
- 0.1% collision rate (approx. 1000 elements per group)
- 10% collision rate (approx. 10 elements per group)
- 50% collision rate (approx. 2 elements per group)
- 100% collision rate (approx. 1 element per group)
(More details available here.)
Without importing any packages:
- Build a dictionary and then get the values to a list
- Use
.get
to determine if a key exists, and return some specified value,None
in this case, if the key is nonexistent. dict.get
defaults toNone
, so this method never raises a KeyError.- If
None
is a value in the dictionary, then change the default value returned by.get
.test.get(t[0], 'something here')
- If
- Use
- Because: sub-lists based on common first items.
- Add index 0 as the
key
, and then add thelist
,t
, as thedict
value.
- Add index 0 as the
test = dict()
for t in mylist:
if test.get(t[0]) == None:
test[t[0]] = [t]
else:
test[t[0]].append(t)
final = list(test.values())
# print final results in
[[['AB001', 22100], ['AB001', 32935], ['AB001', 34439]],
[['XC013', 99834], ['XC013', 86701]],
[['VD126', 18884]]]
精彩评论