开发者

Most efficient way to sum huge 2D NumPy array, grouped by ID column?

I have a massive data array (500k rows) that looks like:

id  value  score
1   20     20
1   10     30
1   15     0
2   12     4
2   3      8
2   56     9
3   6      18
...

As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column. 开发者_运维问答

I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id

With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.

table_trunc = table[(table == id).any(1)]
score       = sum(table_trunc[:,2])

Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?


you can use bincount():

import numpy as np

ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]

print np.bincount(ids, weights=data)

the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.


You can use a for loop and numba

from numba import njit

@njit
def wbcnt(b, w, k):
    bins = np.arange(k)
    bins = bins * 0
    for i in range(len(b)):
        bins[b[i]] += w[i]
    return bins

Using @HYRY's variables

ids = [1, 1, 1, 2, 2, 2, 3]
data = [20, 30, 0, 4, 8, 9, 18]

Then:

wbcnt(ids, data, 4)

array([ 0, 50, 21, 18])

Timing

%timeit wbcnt(ids, data, 4)
%timeit np.bincount(ids, weights=data)

1000000 loops, best of 3: 1.99 µs per loop
100000 loops, best of 3: 2.57 µs per loop


I noticed the numpy tag but in case you don't mind using pandas (or if you read in these data using this module), this task becomes an one-liner:

import pandas as pd

df = pd.DataFrame({'id': [1,1,1,2,2,2,3], 'score': [20,30,0,4,8,9,18]})

So your dataframe would look like this:

  id  score
0   1     20
1   1     30
2   1      0
3   2      4
4   2      8
5   2      9
6   3     18

Now you can use the functions groupby() and sum():

df.groupby(['id'], sort=False).sum()

which gives you the desired output:

    score
id       
1      50
2      21
3      18

By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.


You can try using boolean operations:

ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]

[((ids == i)*data).sum() for i in np.unique(ids)]

This may be a bit more effective than using np.any, but will clearly have trouble if you have a very large number of unique ids to go along with large overall size of the data table.


If you're looking only for sum you probably want to go with bincount. If you also need other grouping operations like product, mean, std etc. have a look at https://github.com/ml31415/numpy-groupies . It's the fastest python/numpy grouping operations around, see the speed comparison there.

Your sum operation there would look like:

res = aggregate(id, score)


The numpy_indexed package has vectorized functionality to perform this operation efficiently, in addition to many related operations of this kind:

import numpy_indexed as npi
npi.group_by(id).sum(score)


Maybe using itertools.groupby, you can group on the ID and then iterate over the grouped data.

(The data must be sorted according to the group by func, in this case ID)

>>> data = [(1, 20, 20), (1, 10, 30), (1, 15, 0), (2, 12, 4), (2, 3, 0)]
>>> groups = itertools.groupby(data, lambda x: x[0])
>>> for i in groups:
        for y in i:
            if isinstance(y, int):
                print(y)
            else:
                for p in y:
                    print('-', p)

Output:

1
- (1, 20, 20)
- (1, 10, 30)
- (1, 15, 0)
2
- (2, 12, 4)
- (2, 3, 0)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜