Python group by array a, and summarize array b - Performance
Given two unordered arrays of same lengths a and b:
a = [7,3,5,7,5,7]
b = [0.2,0.1,0.3,0.1,0.1,0.2]
I'd like to group by the elements in a:
aResult = [7,3,5]
summing over the elements in b (Example used to summarize a probability density function):
bResult = [0.2 + 0.1 + 0.2, 0.1, 0.3 + 0.1] = [0.5, 0.1, 0.4]
Alternatively, random a and b in python:
import numpy as np
a = np.random.randint(1,10,10000)
b = np.array([1./len(a)]*len(a))
I have two approaches, which for sure are far from the lower performance boundary. Approach 1 (at least nice and short): Time: 0.769315958023
def approach_2(a,b):
bResult = [sum(b[i == a]) for i in np.unique(a)]
aResult = np.unique(a)
Approach 2 (numpy.groupby, horribly slow) Time: 4.65299129486
def approach_2(a,b):
tmp = [(a[i],b[i]) for i in range(len(a))]
tmp2 = np.array(tmp, dtype = [('a', float),('b', float)])
tmp2 = np.sort(tmp2, order='a')
bResult = []
aResult = []
for key, group in groupby(tmp2, lambda x: x[0]):
aResult.append(key)
bResult.append(sum([i[1] for i in group]))
Update: Approach3, by Pablo. Time: 1.0265750885
def approach_Pablo(a,b):
pdf = defaultdict(int);
for x,y in zip(a,b):
pdf[x] += y
开发者_运维技巧
Update: Approach 4, by Unutbu. Time: 0.184849023819 [WINNER SO FAR, but a as integer only]
def unique_Unutbu(a,b):
x=np.bincount(a,weights=b)
aResult = np.unique(a)
bResult = x[aResult]
Maybe someone finds a smarter solution to this problem than me :)
Here's approach similar to @unutbu's one:
import numpy as np
def f(a, b):
result_a, inv_ndx = np.unique(a, return_inverse=True)
result_b = np.bincount(inv_ndx, weights=b)
return result_a, result_b
It allows non-integer type for a
array. It allows large values in a
array. It returns a
elements in a sorted order. If it is desirable it easy to restore original order usingreturn_index
argument of np.unique()
function.
It performance worsens as the number of unique elements in a
rises. It 4 times slower than @unutbu's version on the data from your question.
I've made performance comparison with additional three methods. The leaders are: for integer arrays -- hash-based implementation in Cython; for double
arrays (for input size 10000) -- sort-based impl. also in Cython.
If a
is composed of ints < 2**31-1 (that is, if a
has values that can fit in dtype int32
), then you could use np.bincount
with weights:
import numpy as np
a = [7,3,5,7,5,7]
b = [0.2,0.1,0.3,0.1,0.1,0.2]
x=np.bincount(a,weights=b)
print(x)
# [ 0. 0. 0. 0.1 0. 0.4 0. 0.5]
print(x[[7,3,5]])
# [ 0.5 0.1 0.4]
np.unique(a)
returns [3 5 7]
, so the result appears in a different order:
print(x[np.unique(a)])
# [ 0.1 0.4 0.5]
One potential problem with using np.bincount
is that it returns an array whose length is equal to the maximum value in a
. If a
contains even one element with value near 2**31-1, then bincount
would have to allocate an array of size 8*(2**31-1)
bytes (or 16GiB).
So np.bincount
might be the fastest solution for arrays a
which have big length, but not big values. For arrays a
which have small length (and big or small values), using a collections.defaultdict
would probably be faster.
Edit: See J.F. Sebastian's solution for a way around the integer-values-only restriction and big-values problem.
How about this approach:
from collections import defaultdict
pdf = defaultdict(int)
a = [7,3,5,7,5,7]
b = [0.2,0.1,0.3,0.1,0.1,0.2]
for x,y in zip(a,b):
pdf[x] += y
You only iterate over each element once and use a dictionary for fast lookup. If you really want two separate arrays as results in the end you can ask for them:
aResult = pdf.keys()
bResult = pdf.values()
精彩评论