Efficient operator+
I have to compute large sums of 3d vectors and a comparison of using a vector class with overloaded operator+ and operator* versus summing up of separate components shows a performance difference of about a factor of three. I know assume the difference must be due to construction of objects in the overloaded operators.
How can one avoid the construction and improve performance?
I'm espacially puzzled, because the following is afaik basically the standard way to do it and I would expect the compiler to optimize this. In real life, the sums are not going to be done within a loop but in quite large expressions (several tens of MBs in total pre executable) summing up different vectors, this is why operator+ is used below.
class Vector
{
double x,y,z;
...
Vector&
Vector::operator+=(const Vector &v)
{
x += v.x;
y += v.y;
z += v.z;
return *this;
}
Vector
Vector::operator+(const Vector &v)
{
return Vector(*this) += v; // bad: construction and copy(?)
}
...
}
// comparison
double xx[N], yy[N], zz[N];
Vector vec[N];
// assume xx, yy, zz and vec are properly initialized
Vector sum开发者_如何学Go(0,0,0);
for(int i = 0; i < N; ++i)
{
sum = sum + vec[i];
}
// this is a factor 3 faster than the above loop
double sumxx = 0;
double sumyy = 0;
double sumzz = 0;
for(int i = 0; i < N; ++i)
{
sumxx = sumxx + xx[i];
sumyy = sumyy + yy[i];
sumzz = sumzz + zz[i];
}
Any help is greatly appreciated.
EDIT: Thank you all for your great input, I have the performance now at the same level. @Dima's and especially @Xeo's answer did the trick. I wish I could mark more than one answer "accepted". I'll test some of the other suggestions too.
This article has a really good argumentation on how to optimize operators such as +
, -
, *
, /
.
Implement the operator+
as a free function like this in terms of operator+=
:
Vector operator+(Vector lhs, Vector const& rhs){
return lhs += rhs;
}
Notice on how the lhs
Vector is a copy and not a reference. This allowes the compiler to make optimizations such as copy elision.
The general rule that article conveys: If you need a copy, do it in the parameters, so the compiler can optimize. The article doesn't use this example, but the operator=
for the copy-and-swap idiom.
Why not replace
sum = sum + vec[i];
with
sum += vec[i];
... that should eliminate two calls to the copy constructor and one call to the assignment operator for each iteration.
But as always, profile and know where the expense is coming instead of guessing.
You might be interested in expression templates.
I implemented most of the optimizations being proposed here and compared it with the performance of a function call like
Vector::isSumOf( Vector v1, Vector v2)
{
x = v1.x + v2.x;
...
}
Repeatedly executing same loop with a few billion vector summations for every method in alternating order, did not result in the promised gains.
In case of the member function posted by bbtrb, this method took 50% more time than the isSumOf()
function call.
Free, non member operator+ (Xeo) method needed up to double the time (100% more) of the is SumOf()
function.
(gcc 4.6.3 -O3)
I aware of the fact, that this was not a representative testing, but since i could not reproduce any performance gains by using operators at all. I suggest to avoid them, if possible.
Usually, operator + looks like:
return Vector (x + v.x, y + v.y, z + v.z);
with a suitably defined constructor. This allows the compiler to do return value optimisation.
But if you're compiling for IA32, then SIMD would be worth considering, along with changes to the algorithms to take advantage of the SIMD nature. Other processors may have SIMD style instructions.
I think the difference in performance is caused by the compiler optimization here. Adding up elements of arrays in a loop can be vectorized by the compiler. Modern CPUs have instructions for adding multiple numbers in a single clock tick, such as SSE, SSE2, etc. This seems to be a likely explanation for the factor of 3 difference that you are seeing.
In other words, adding corresponding elements of two arrays in a loop may generally be faster than adding corresponding members of a class. If you represent the vector as an array inside your class, rather than x, y, and z, you may get the same speedup for your overloaded operators.
Are the implementations to your Vector operator functions directly in the header file or are they in a separate cpp file? In the header file they would typically be inlined in an optimized build. But if they are compiled in a different translation unit, then they often won't be (depending on your build settings). If the functions aren't inlined, then the compiler won't be able to do the type of optimization you are looking for.
In cases like these, have a look at the disassembly. Even if you don't know much about assembly code it's usually pretty easy to figure out what's different in simple cases like these.
Actually if you look at any real matrix code the operator+ and the operator+= don't do that.
Because of the copying involved they introduce a pseudo object into the expression and only do the real work when the assignment is executed. Using lazy evaluation like this also allows NULL operations to be removed during expression evaluation:
class Matrix;
class MatrixOp
{
public: virtual void DoOperation(Matrix& resultInHere) = 0;
};
class Matrix
{
public:
void operator=(MatrixOp* op)
{
// No copying has been done.
// You have built an operation tree.
// Now you are goign to evaluate the expression and put the
// result into *this
op->DoOperation(*this);
}
MatrixOp* operator+(Matrix& rhs) { return new MatrixOpPlus(*this,rhs);}
MatrixOp* operator+(MatrixOp* rhs){ return new MatrixOpPlus(*this,rhs);}
// etc
};
Of course this is a lot more complex than I have portrayed here in this simplified example. But if you use a library that has been designed for matrix operations then it will have already been done for you.
Your Vector implementation:
Implement the operator+()
like this:
Vector
Vector::operator+(const Vector &v)
{
return Vector(x + v.x, y + v.y, z + v.z);
}
and add the inline
operator in your class definition (this avoids the stack pushs and pops of the return address and method arguments for each method call, if the compiler finds it useful).
Then add this constructor:
Vector::Vector(const double &x, const double &y, const double &z)
: x(x), y(y), z(z)
{
}
which lets you construct a new vector very efficiently (like you would do in my operator+()
suggestion)!
In the code using your Vector:
You did:
for(int i = 0; i < N; ++i)
{
sum = sum + vec[i];
}
Unroll this kind of loops! Doing only one operation (as it would be optimized to using the SSE2/3 extensions or something similar) in a very large loop is very inefficient. You should rather do something like this:
//Unrolled loop:
for(int i = 0; i <= N - 10; i += 10)
{
sum = sum + vec[i];
+ vec[i+1];
+ vec[i+2];
+ vec[i+3];
+ vec[i+4];
+ vec[i+5];
+ vec[i+6];
+ vec[i+7];
+ vec[i+8];
+ vec[i+9];
}
//Doing the "rest":
for(int i = (N / 10) * 10; i < N; ++i)
{
sum = sum + vec[i];
}
(Note that this code is untested and may contain a "off-by-one"-error or so...)
Note that you are asking different things because the data is not disposed in the same way in memory. When using Vector array the coordinates are interleaved "x1,y1,z1,x2,y2,z2,...", while with the double arrays you have "x1,x2,...,y1,y2,...z1,z2...". I suppose this could have an impact on compiler optimizations or how the caching handles it.
精彩评论