Summing with a for loop faster than with reduce?
I wanted to see how much faster reduce was than using a for loop for simple numerical operations. Here's what I found (using the standard timeit library):
In [54]: print(setup)
from operator import add, iadd
r = range(100)
In [55]: print(stmt1)
c = 0
for i in r:
c+=i
In [56]: timeit(stmt1, setup)
Out[56]: 8.948904991149902
In [58]: print(stmt3)
reduce(add, r)
In [59]: timeit(stmt3, setup)
Out[59]: 13.316915035247803
Looking a little more:
In [68]: timeit("1+2", setup)
Out[68]: 0.04145693778991699
In [69]: timeit("add(1,2)", setup)
Out[69]: 0.22807812690734863
What's going on here? Obviously, reduce does loop faster than for, but the function call seems to dominate. Shouldn't the reduce version run almost entirely in C? Using iadd(c,i) in the for loop version makes it run in ~24 seconds. Why开发者_运维知识库 would using operator.add be so much slower than +? I was under the impression + and operator.add run the same C code (I checked to make sure operator.add wasn't just calling + in python or anything).
BTW, just using sum runs in ~2.3 seconds.
In [70]: print(sys.version)
2.7.1 (r271:86882M, Nov 30 2010, 09:39:13)
[GCC 4.0.1 (Apple Inc. build 5494)]
The reduce(add, r)
must invoke the add()
function 100 times, so the overhead of the function calls adds up -- reduce uses PyEval_CallObject
to invoke add
on each iteration:
for (;;) {
...
if (result == NULL)
result = op2;
else {
# here it is creating a tuple to pass the previous result and the next
# value from range(100) into func add():
PyTuple_SetItem(args, 0, result);
PyTuple_SetItem(args, 1, op2);
if ((result = PyEval_CallObject(func, args)) == NULL)
goto Fail;
}
Updated: Response to question in comments.
When you type 1 + 2
in Python source code, the bytecode compiler performs the addition in place and replaces that expression with 3
:
f1 = lambda: 1 + 2
c1 = byteplay.Code.from_code(f1.func_code)
print c1.code
1 1 LOAD_CONST 3
2 RETURN_VALUE
If you add two variables a + b
the compiler will generate bytecode which loads the two variables and performs a BINARY_ADD, which is far faster than calling a function to perform the addition:
f2 = lambda a, b: a + b
c2 = byteplay.Code.from_code(f2.func_code)
print c2.code
1 1 LOAD_FAST a
2 LOAD_FAST b
3 BINARY_ADD
4 RETURN_VALUE
It could be the overhead of copying args and return values (i.e. "add(1, 2)"), opposed to simply operating on numeric literals
edit: Switching out zeroes instead of array multiply closes the gap big time.
from functools import reduce
from numpy import array, arange, zeros
from time import time
def add(x, y):
return x + y
def sum_columns(x):
if x.any():
width = len(x[0])
total = zeros(width)
for row in x:
total += array(row)
return total
l = arange(3000000)
l = array([l, l, l])
start = time()
print(reduce(add, l))
print('Reduce took {}'.format(time() - start))
start = time()
print(sum_columns(l))
print('For loop took took {}'.format(time() - start))
Gets you down almost no difference at all.
Reduce took 0.03230619430541992
For loop took took 0.058577775955200195
old: If reduce is used for adding together NumPy arrays by index, it can be faster than a for loop.
from functools import reduce
from numpy import array, arange
from time import time
def add(x, y):
return x + y
def sum_columns(x):
if x.any():
width = len(x[0])
total = array([0] * width)
for row in x:
total += array(row)
return total
l = arange(3000000)
l = array([l, l, l])
start = time()
print(reduce(add, l))
print('Reduce took {}'.format(time() - start))
start = time()
print(sum_columns(l))
print('For loop took took {}'.format(time() - start))
With the result of
[ 0 3 6 ..., 8999991 8999994 8999997]
Reduce took 0.024930953979492188
[ 0 3 6 ..., 8999991 8999994 8999997]
For loop took took 0.3731539249420166
精彩评论