Using numpy.bincount with array weights

2023-02-14 20:25 问答作者：

I would like to use bincount to sum arrays, however it supports only doubles. For example this works:

np.bincount([1, 1, 0],weights=np.array([1, 2, 4]))
Out: array([ 4.,  3.])

However I would like to use a dimension 2 array as:

np.bincount([1, 1, 0],weights=np.array([[1,1], [2,2], [4,4]]))
ValueError: object too deep for desired array

The desired output is:

Out: array([[ 4.,  4.],[3., 3.]])

Better explanation after comments:

I want to sum开发者_开发问答 each row of the array into the corresponding index.

With a loop it would be:

Bin=np.zeros(2,2)
for i in [1,1,0]:
    Bin[i]+=a[i]

a is previous 3x2 matrix Is there an efficient way to obtain this result?

As per the numpy documentation:

numpy.bincount(x, weights=None, minlength=None)

weights : array_like, optional; Weights, array of the same shape as x.

So you can't use bincount directly in this fashion unless you alter x somehow.

Edit: So I came up with a slightly tricky way of doing this, but no guarantees about the performance when you go to large arrays. Basically I'm going to leverage how scipy sparse matrices handle repeated entries at the same indices (they sum them):

 from scipy.sparse import *
 a = np.array([[1,1], [2,2], [4,4]])
 ii = np.array([1, 1, 0])

 ares = a.reshape((-1,),order='F')
 # ares == array([1, 2, 4, 1, 2, 4])

 col = np.tile(ii,(a.shape[1],))
 # col == np.array([1, 1, 0, 1, 1, 0])

 row = np.tile([0,1],(a.shape[0],1)).reshape((-1,),order='F') 
 # row == np.array([0,0,0,1,1,1]) 

 g = coo_matrix((ares,(col,row)),shape=(2,2))
 print g.todense()

Now you're going to have to generalize this to your precise data. The basic idea is that you want to map each data point to the correct element of your results array and then let the sparse array handle summing the duplicate entries.

Otherwise, I would look into using Cython if you are forced to use looping to solve this.

Edit 2: For kicks, I timed two different methods:

import numpy as np
from scipy.sparse import *

def method1():
    return np.array([np.bincount(ii, r) for r in a.T]).T

def method2():
    ares = a.reshape((-1,),order='F')
    col = np.tile(ii,(a.shape[1],))
    row = np.tile(np.arange(a.shape[1]),(a.shape[0],1)).reshape((-1,),order='F') 

    return coo_matrix((ares,(col,row)),shape=(np.unique(ii).size,a.shape[1])).todense()

if __name__ == '__main__':
    from timeit import Timer

    a = np.random.randint(0,1000,(1000000,3))
    ii = np.random.randint(0,10,(a.shape[0],))

    N = 100
    t1 = Timer("method1()", "from __main__ import method1")
    t2 = Timer("method2()", "from __main__ import method2")
    print 't2/t1: %f' % (t2.timeit(N)/t1.timeit(N))

On my machine, method2 is about 3-5x slower than method1 depending on the shape of the inputs so looping isn't necessarily a bad option.

You should use scipy csr matrix to represent the indices, and then a dot product with your data. On my laptop it's 14x faster than @JoshAdel's method1 and 54x faster than @JoshAdel's method2 for big matrices.

def method1():
    return np.array([np.bincount(ii, r) for r in a.T]).T

def method2():
    ares = a.reshape((-1,),order='F')
    col = np.tile(ii,(a.shape[1],))
    row = np.tile(np.arange(a.shape[1]),(a.shape[0],1)).reshape((-1,),order='F') 

    return coo_matrix((ares,(col,row)),shape=(ii.max()+1,a.shape[1])).todense()

def method3():
    csr = csr_matrix((np.ones(ii.shape[0]), (ii, np.arange(ii.shape[0]))))
    return csr*a

Let's generate random data and time them:

n = 1<<18
d = 512
ii = np.random.randint(low=1, high=1<<10, size=n)
a = np.random.randn((n, d))

%timeit method1()
# 1 loop, best of 3: 3.13 s per loop

%timeit method2()
# 1 loop, best of 3: 11.7 s per loop

%timeit method3()
# 1 loop, best of 3: 216 ms per loop

# sanity checks:
assert (method1() == method2()).all()
assert (method1() == method3()).all()

note: I replaced np.unique(ii).size in method2 by ii.max()+1

I think this is a simpler version for the thing you asked, based on bincount. In essence you break your weights by column in order to have the same size as your initial array and then stack the different columns together.

a = np.array([1, 1, 0])
b = np.array([[1,1], [2,2], [4,4]])
uni = np.unique(a)

a_x = np.bincount(a,  weights=b[:,0], minlength=len(uni))
a_y = np.bincount(a,  weights=b[:,1], minlength=len(uni))

final = np.column_stack((a_x.T, a_y.T))
# final = np.array([[ 4.,  4.],[3., 3.]])

继续阅读：numpy python

Using numpy.bincount with array weights

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？