Fast replacement of values in a numpy array
I have a very large numpy array (containing up to a million elements) like the one below:
[0,1,6,5,1,2,7,6,2,3,8,7,3,4,9,8,5,6,11,10,6,7,12,11,7,
8,13,12,8,9,14,13,10,11,16,15,11,12,17,16,12,13,18,17,13,
14,19,18,15,16,21,20,16,17,22,21,17,18,23,22,18,19,24,23]
and a small dictionary map for replacing some of the elements in the above array
{4: 0, 9: 5, 14: 10, 开发者_运维百科19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}
I would like to replace some of the elements according to the map above. The numpy array is really large, and only a small subset of the elements (occurring as keys in the dictionary) will be replaced with the corresponding values. What is the fastest way to do this?
I believe there's even more efficient method, but for now, try
from numpy import copy
newArray = copy(theArray)
for k, v in d.iteritems(): newArray[theArray==k] = v
Microbenchmark and test for correctness:
#!/usr/bin/env python2.7
from numpy import copy, random, arange
random.seed(0)
data = random.randint(30, size=10**5)
d = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}
dk = d.keys()
dv = d.values()
def f1(a, d):
b = copy(a)
for k, v in d.iteritems():
b[a==k] = v
return b
def f2(a, d):
for i in xrange(len(a)):
a[i] = d.get(a[i], a[i])
return a
def f3(a, dk, dv):
mp = arange(0, max(a)+1)
mp[dk] = dv
return mp[a]
a = copy(data)
res = f2(a, d)
assert (f1(data, d) == res).all()
assert (f3(data, dk, dv) == res).all()
Result:
$ python2.7 -m timeit -s 'from w import f1,f3,data,d,dk,dv' 'f1(data,d)'
100 loops, best of 3: 6.15 msec per loop
$ python2.7 -m timeit -s 'from w import f1,f3,data,d,dk,dv' 'f3(data,dk,dv)'
100 loops, best of 3: 19.6 msec per loop
Assuming the values are between 0 and some maximum integer, one could implement a fast replace by using the numpy-array as int->int
dict, like below
mp = numpy.arange(0,max(data)+1)
mp[replace.keys()] = replace.values()
data = mp[data]
where first
data = [ 0 1 6 5 1 2 7 6 2 3 8 7 3 4 9 8 5 6 11 10 6 7 12 11 7
8 13 12 8 9 14 13 10 11 16 15 11 12 17 16 12 13 18 17 13 14 19 18 15 16
21 20 16 17 22 21 17 18 23 22 18 19 24 23]
and replacing with
replace = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}
we obtain
data = [ 0 1 6 5 1 2 7 6 2 3 8 7 3 0 5 8 5 6 11 10 6 7 12 11 7
8 13 12 8 5 10 13 10 11 16 15 11 12 17 16 12 13 18 17 13 10 15 18 15 16
1 0 16 17 2 1 17 18 3 2 18 15 0 3]
I benchmarked some solutions, and the result is without appeal :
import timeit
import numpy as np
array = 2 * np.round(np.random.uniform(0,10000,300000)).astype(int)
from_values = np.unique(array) # pair values from 0 to 2000
to_values = np.arange(from_values.size) # all values from 0 to 1000
d = dict(zip(from_values, to_values))
def method_for_loop():
out = array.copy()
for from_value, to_value in zip(from_values, to_values) :
out[out == from_value] = to_value
print('Check method_for_loop :', np.all(out == array/2)) # Just checking
print('Time method_for_loop :', timeit.timeit(method_for_loop, number = 1))
def method_list_comprehension():
out = [d[i] for i in array]
print('Check method_list_comprehension :', np.all(out == array/2)) # Just checking
print('Time method_list_comprehension :', timeit.timeit(method_list_comprehension, number = 1))
def method_bruteforce():
idx = np.nonzero(from_values == array[:,None])[1]
out = to_values[idx]
print('Check method_bruteforce :', np.all(out == array/2)) # Just checking
print('Time method_bruteforce :', timeit.timeit(method_bruteforce, number = 1))
def method_searchsort():
sort_idx = np.argsort(from_values)
idx = np.searchsorted(from_values,array,sorter = sort_idx)
out = to_values[sort_idx][idx]
print('Check method_searchsort :', np.all(out == array/2)) # Just checking
print('Time method_searchsort :', timeit.timeit(method_searchsort, number = 1))
And I got the following results :
Check method_for_loop : True
Time method_for_loop : 2.6411612760275602
Check method_list_comprehension : True
Time method_list_comprehension : 0.07994363596662879
Check method_bruteforce : True
Time method_bruteforce : 11.960559037979692
Check method_searchsort : True
Time method_searchsort : 0.03770717792212963
The "searchsort" method is almost a hundred times faster than the "for" loop, and about 3600 times faster than the numpy bruteforce method. The list comprehension method is also a very good trade-off between code simplicity and speed.
Another more general way to achieve this is function vectorization:
import numpy as np
data = np.array([0, 1, 6, 5, 1, 2, 7, 6, 2, 3, 8, 7, 3, 4, 9, 8, 5, 6, 11, 10, 6, 7, 12, 11, 7, 8, 13, 12, 8, 9, 14, 13, 10, 11, 16, 15, 11, 12, 17, 16, 12, 13, 18, 17, 13, 14, 19, 18, 15, 16, 21, 20, 16, 17, 22, 21, 17, 18, 23, 22, 18, 19, 24, 23])
mapper_dict = {4: 0, 9: 5, 14: 10, 19: 15, 20: 0, 21: 1, 22: 2, 23: 3, 24: 0}
def mp(entry):
return mapper_dict[entry] if entry in mapper_dict else entry
mp = np.vectorize(mp)
print mp(data)
No solution was posted still without a python loop on the array (except Celil's one, which however assume numbers are "small"), so here is an alternative:
def replace(arr, rep_dict):
"""Assumes all elements of "arr" are keys of rep_dict"""
# Removing the explicit "list" breaks python3
rep_keys, rep_vals = array(list(zip(*sorted(rep_dict.items()))))
idces = digitize(arr, rep_keys, right=True)
# Notice rep_keys[digitize(arr, rep_keys, right=True)] == arr
return rep_vals[idces]
the way "idces" is created comes from here.
The numpy_indexed package (disclaimer: I am its author) provides an elegant and efficient vectorized solution to this type of problem:
import numpy_indexed as npi
remapped_array = npi.remap(theArray, list(dict.keys()), list(dict.values()))
The method implemented is similar to the searchsorted based approach mentioned by Jean Lescut, but even more general. For instance, the items of the array do not need to be ints, but can be any type, even nd-subarrays themselves; yet it should achieve the same kind of performance.
A fully vectorized solution using np.in1d
and np.searchsorted
:
replace = numpy.array([list(replace.keys()), list(replace.values())]) # Create 2D replacement matrix
mask = numpy.in1d(data, replace[0, :]) # Find elements that need replacement
data[mask] = replace[1, numpy.searchsorted(replace[0, :], data[mask])] # Replace elements
for i in xrange(len(the_array)):
the_array[i] = the_dict.get(the_array[i], the_array[i])
Well, you need to make one pass through theArray
, and for each element replace it if it is in the dictionary.
for i in xrange( len( theArray ) ):
if foo[ i ] in dict:
foo[ i ] = dict[ foo[ i ] ]
Pythonic way without the need for data to be integer, can be even strings:
from scipy.stats import rankdata
import numpy as np
data = np.random.rand(100000)
replace = {data[0]: 1, data[5]: 8, data[8]: 10}
arr = np.vstack((replace.keys(), replace.values())).transpose()
arr = arr[arr[:,1].argsort()]
unique = np.unique(data)
mp = np.vstack((unique, unique)).transpose()
mp[np.in1d(mp[:,0], arr),1] = arr[:,1]
data = mp[rankdata(data, 'dense')-1][:,1]
精彩评论