开发者

Numpy: is it possible to use numpy and ndarray to replace for a loop in this code snippet?

I am looking for a smarter and better solution.

I want to apply different scaling factors to a number field based on the label content. Hopefully the following code can illustrate what I am trying to achieve:

PS = [('A', 'LABEL1', 20),
('B', 'LABEL2', 15),
('C', 'LABEL3', 120),
('D', 'LABEL1', 3),]

FACTOR = [('LABEL1', 0.1), ('LABEL2', 0.5), ('LABEL3', 10)]

d_factor = dict(FACTOR)

for p in PS:
        newp = (p[0], p[1], p[2]*d_factor[p[1]])
        print newp

It is a very trivial operation, but I need to perform it on a dataset of at least one million rows.

So, of course, the faster the better.

The factors will be known in advance and they will be no more than 20 to 30 in numbers.

  1. Is there any matrix or linalg trick we can use?

  2. Can ndarray accepts a text value in a cel开发者_运维百科l?


If you want to mix data types you are going to want structured arrays.

If you are going to want the index of matching values in a lookup array you want searchsorted

Your example goes like this:

>>> import numpy as np
>>> PS = np.array([
    ('A', 'LABEL1', 20),
    ('B', 'LABEL2', 15),
    ('C', 'LABEL3', 120),
    ('D', 'LABEL1', 3),], dtype=('a1,a6,i4'))
>>> FACTOR = np.array([
    ('LABEL1', 0.1), 
    ('LABEL2', 0.5), 
    ('LABEL3', 10)],dtype=('a6,f4'))

Your structured arrays:

>>> PS
array([('A', 'LABEL1', 20), ('B', 'LABEL2', 15), ('C', 'LABEL3', 120),
       ('D', 'LABEL1', 3)], 
      dtype=[('f0', '|S1'), ('f1', '|S6'), ('f2', '<i4')])
>>> FACTOR
array([('LABEL1', 0.10000000149011612), ('LABEL2', 0.5), ('LABEL3', 10.0)], 
      dtype=[('f0', '|S6'), ('f1', '<f4')])

And you can access individual fields like this (or you can give them names; see the docs):

>>> FACTOR['f0']
array(['LABEL1', 'LABEL2', 'LABEL3'], 
      dtype='|S6')

How to perform the lookup of FACTOR on PS (FACTOR must be sorted):

>>> idx = np.searchsorted(FACTOR['f0'], PS['f1'])
>>> idx
array([0, 1, 2, 0])
>>> FACTOR['f1'][idx]
array([  0.1,   0.5,  10. ,   0.1], dtype=float32)

Now simply create a new array and multiply:

>>> newp = PS.copy()
>>> newp['f2'] *= FACTOR['f1'][idx]
>>> newp
array([('A', 'LABEL1', 2), ('B', 'LABEL2', 7), ('C', 'LABEL3', 1200),
       ('D', 'LABEL1', 0)], 
      dtype=[('f0', '|S1'), ('f1', '|S6'), ('f2', '<i4')])


If you compare two numpy arrays, you get the corresponding indexes. You can use those indexes to do collective operations. This probably isn't the fastest modification, but it is simple and clear. If PS needs to have the structure you show, you can use custom dtype and have a Nx3 array.

import numpy as np

col1 = np.array(['a', 'b', 'c', 'd'])
col2 = np.array(['1', '2', '3', '1'])
col3 = np.array([20., 15., 120., 3.])

factors = {'1': 0.1, '2': 0.5, '3': 10, }

for label, fac in  factors.iteritems():
    col3[col2==label] *= fac

print col3


I don't think numpy can help you for that. BTW, it is ndarray, not nparray...

Maybe you could do it with a generator. See http://www.dabeaz.com/generators/index.html

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜