Using numpy.bincount with array weights
I would like to use bincount
to sum arrays, however it supports only doubles.
For example this works:
np.bincount([1, 1, 0],weights=np.array([1, 2, 4]))
Out: array([ 4., 3.])
However I would like to use a dimension 2 array as:
np.bincount([1, 1, 0],weights=np.array([[1,1], [2,2], [4,4]]))
ValueError: object too deep for desired array
The desired output is:
Out: array([[ 4., 4.],[3., 3.]])
Better explanation after comments:
I want to sum开发者_开发问答 each row of the array into the corresponding index.
With a loop it would be:
Bin=np.zeros(2,2)
for i in [1,1,0]:
Bin[i]+=a[i]
a is previous 3x2 matrix Is there an efficient way to obtain this result?
As per the numpy documentation:
numpy.bincount(x, weights=None, minlength=None)
weights : array_like, optional; Weights, array of the same shape as x.
So you can't use bincount
directly in this fashion unless you alter x
somehow.
Edit: So I came up with a slightly tricky way of doing this, but no guarantees about the performance when you go to large arrays. Basically I'm going to leverage how scipy sparse matrices handle repeated entries at the same indices (they sum them):
from scipy.sparse import *
a = np.array([[1,1], [2,2], [4,4]])
ii = np.array([1, 1, 0])
ares = a.reshape((-1,),order='F')
# ares == array([1, 2, 4, 1, 2, 4])
col = np.tile(ii,(a.shape[1],))
# col == np.array([1, 1, 0, 1, 1, 0])
row = np.tile([0,1],(a.shape[0],1)).reshape((-1,),order='F')
# row == np.array([0,0,0,1,1,1])
g = coo_matrix((ares,(col,row)),shape=(2,2))
print g.todense()
Now you're going to have to generalize this to your precise data. The basic idea is that you want to map each data point to the correct element of your results array and then let the sparse array handle summing the duplicate entries.
Otherwise, I would look into using Cython if you are forced to use looping to solve this.
Edit 2: For kicks, I timed two different methods:
import numpy as np
from scipy.sparse import *
def method1():
return np.array([np.bincount(ii, r) for r in a.T]).T
def method2():
ares = a.reshape((-1,),order='F')
col = np.tile(ii,(a.shape[1],))
row = np.tile(np.arange(a.shape[1]),(a.shape[0],1)).reshape((-1,),order='F')
return coo_matrix((ares,(col,row)),shape=(np.unique(ii).size,a.shape[1])).todense()
if __name__ == '__main__':
from timeit import Timer
a = np.random.randint(0,1000,(1000000,3))
ii = np.random.randint(0,10,(a.shape[0],))
N = 100
t1 = Timer("method1()", "from __main__ import method1")
t2 = Timer("method2()", "from __main__ import method2")
print 't2/t1: %f' % (t2.timeit(N)/t1.timeit(N))
On my machine, method2
is about 3-5x slower than method1
depending on the shape of the inputs so looping isn't necessarily a bad option.
You should use scipy csr
matrix to represent the indices, and then a dot product with your data. On my laptop it's 14x faster than @JoshAdel's method1
and 54x faster than @JoshAdel's method2
for big matrices.
def method1():
return np.array([np.bincount(ii, r) for r in a.T]).T
def method2():
ares = a.reshape((-1,),order='F')
col = np.tile(ii,(a.shape[1],))
row = np.tile(np.arange(a.shape[1]),(a.shape[0],1)).reshape((-1,),order='F')
return coo_matrix((ares,(col,row)),shape=(ii.max()+1,a.shape[1])).todense()
def method3():
csr = csr_matrix((np.ones(ii.shape[0]), (ii, np.arange(ii.shape[0]))))
return csr*a
Let's generate random data and time them:
n = 1<<18
d = 512
ii = np.random.randint(low=1, high=1<<10, size=n)
a = np.random.randn((n, d))
%timeit method1()
# 1 loop, best of 3: 3.13 s per loop
%timeit method2()
# 1 loop, best of 3: 11.7 s per loop
%timeit method3()
# 1 loop, best of 3: 216 ms per loop
# sanity checks:
assert (method1() == method2()).all()
assert (method1() == method3()).all()
note: I replaced np.unique(ii).size
in method2
by ii.max()+1
I think this is a simpler version for the thing you asked, based on bincount. In essence you break your weights by column in order to have the same size as your initial array and then stack the different columns together.
a = np.array([1, 1, 0])
b = np.array([[1,1], [2,2], [4,4]])
uni = np.unique(a)
a_x = np.bincount(a, weights=b[:,0], minlength=len(uni))
a_y = np.bincount(a, weights=b[:,1], minlength=len(uni))
final = np.column_stack((a_x.T, a_y.T))
# final = np.array([[ 4., 4.],[3., 3.]])
精彩评论