Optimal (broadcasted) matrix division in numpy. Avoiding temporary arrays or not?
Numpy allows matrices of different sizes to be added/multiplied/divided provided certain broadcasting rules are followed. Also, creation of temporary arrays is a major speed impediment to numpy.
The following timit results surprise me...what is going on?
In [41]: def f_no_dot(mat,arr):
....: return mat/arr
In [42]: def f_dot(mat,arr):
....: denominator = scipy.dot(arr, scipy.ones((1,2)))
....: return mat/denominator
In [43]: mat = scipy.rand(360000,2)
In [44]: arr = scipy.rand(360000,1)
In [45]: timeit temp = f_no_dot(mat,arr)
10 loops, best of 3: 44.7 ms per loop
In [46]: timeit temp = f_dot(mat,arr)
100 loops, best of 3: 10.1 ms per loop
I thought that f_dot would be slower since it had to create the temporary array denominator, and I assumed 开发者_JAVA技巧that this step was skipped by f_no_dot. I should note that these times scale linearly (with array size, up to length 1 billion) for f_no_dot, and slightly worse than linear for f_dot.
I thought that f_dot would be slower since it had to create the temporary array denominator, and I assumed that this step was skipped by f_no_dot.
For what it's worth, creating the temporary array is skipped, which is why f_no_dot
is slower (but uses less memory).
Element-wise operations on arrays of the same size are faster, because numpy doesn't have to worry about the striding (dimensions, size, etc) of the arrays.
Operations that use broadcasting will generally be a bit slower than operations that don't have to.
If you have the memory to spare, creating a temporary copy can give you a speedup, but will use more memory.
For example, comparing these three functions:
import numpy as np
import timeit
def f_no_dot(x, y):
return x / y
def f_dot(x, y):
denom = np.dot(y, np.ones((1,2)))
return x / denom
def f_in_place(x, y):
x /= y
return x
num = 3600000
x = np.ones((num, 2))
y = np.ones((num, 1))
for func in ['f_dot', 'f_no_dot', 'f_in_place']:
t = timeit.timeit('%s(x,y)' % func, number=100,
setup='from __main__ import x,y,f_dot, f_no_dot, f_in_place')
print func, 'time...'
print t / 100.0
This yields similar timings to your results:
f_dot time...
0.184361531734
f_no_dot time...
0.619203259945
f_in_place time...
0.585789341927
However, if we compare the memory usage, things become a bit clearer...
The combined size of our x
and y
arrays is about 27.5 + 55 MB, or 82 MB (for 64-bit ints). There's an additional ~11 MB of overhead in import numpy, etc.
Returning x / y
as a new array (i.e. not doing x /= y
) will require another 55 MB array.
100 runs of f_dot
:
We're creating a temporary array here, so we'd expect to see 11 + 82 + 55 + 55 MB or ~203 MB of memory usage. And, that's what we see...
100 runs of f_no_dot
:
If no temporary array is created, we'd expect a peak memory usage of 11 + 82 + 55 MB, or 148 MB...
So, x / y
is not creating an additional num x 2
temporary array to do the division.
Thus, the division takes a quite a bit longer than it would if it were operating on two arrays of the same size.
100 runs of f_in_place
:
If we can modify x
in-place, we can save even more memory, if that's the main concern.
Basically, numpy tries to conserve memory at the expense of speed, in some cases.
What you are seeing is most likely iteration overhead over the small (2,)
dimension. Numpy (versions < 1.6) dealt efficiently only with operations involving contiguous arrays (of the same shape). The effect goes away as the size of the last dimension increases.
To see the effect of contiguity:
The situation is however much improved in Numpy >= 1.6.0:
In [1]: import numpy
In [2]: numpy.__version__
Out[2]: '1.5.1'
In [3]: arr_cont1 = numpy.random.rand(360000, 2)
In [4]: arr_cont2 = numpy.random.rand(360000, 2)
In [5]: arr_noncont = numpy.random.rand(360000, 4)[:,::2]
In [6]: arr_bcast = numpy.random.rand(360000, 1)
In [7]: %timeit arr_cont1 / arr_cont2
100 loops, best of 3: 5.75 ms per loop
In [8]: %timeit arr_noncont / arr_cont2
10 loops, best of 3: 54.4 ms per loop
In [9]: %timeit arr_bcast / arr_cont2
10 loops, best of 3: 55.2 ms per loop
(All the timings above are probably only accurate to 1 ms.)
In [1]: import numpy
In [2]: numpy.__version__
Out[2]: '1.6.1'
In [3]: arr_cont1 = numpy.random.rand(360000, 2)
In [4]: arr_cont2 = numpy.random.rand(360000, 2)
In [5]: arr_noncont = numpy.random.rand(360000, 4)[:,::2]
In [6]: arr_bcast = numpy.random.rand(360000, 1)
In [7]: %timeit arr_cont1 / arr_cont2
100 loops, best of 3: 5.37 ms per loop
In [8]: %timeit arr_noncont / arr_cont2
100 loops, best of 3: 6.12 ms per loop
In [9]: %timeit arr_bcast / arr_cont2
100 loops, best of 3: 7.81 ms per loop
Note also that temporaries are not that expensive:
In [82]: %timeit arr_cont1.copy()
1000 loops, best of 3: 778 us per loop
EDIT: Note above that also arr_noncont
is sort of contiguous with stride of 2*itemsize
, so that the inner loop can be unraveled --- Numpy can make it about as fast as the contiguous array. With broadcasting (or with a really noncontiguous array such as numpy.random.rand(360000*2, 2)[::2,:]
, the inner loop cannot be unraveled, and these cases are slightly slower. Improving that would still be possible if Numpy emitted tailored machine code on the fly for each loop, but it doesn't do that (at least yet :)
精彩评论