Trying to reduce the speed overhead of an almost-but-not-quite-int number class
I have implemented a C++ class which behaves very similarly to the standard int
type. The difference is that it has an additional concept of "epsilon" which represents some tiny value that is much less than 1, but greater than 0. One way to think of it is as a very wide fixed point number with 32 MSBs (the integer parts), 32 LSBs (the epsilon parts) and a huge sea of zeros in between. (Note: A big difference between this class and normal fixed point numbers is that there are two signs, not one: "value" and "epsilon" can be negative independently of each other, whereas for fixed point, there is one sign for the entire number.)
The following class works, but introduces a ~2x speed penalty in the overall program. (The program includes code that has nothing to do with this class, so the actual speed penalty of this class is probably much greater than 2x.) I can't paste the code that is using this class, but I can say the following:
+, -, +=, <, >
and >=
are the only heavily used operators. Use of setEpsilon()
and getInt()
is extremely rare. *
is also rare, and does not even need to consider the epsilon values at all.
Here is the class:
#include <limits>
struct int32Uepsilon {
typedef int32Uepsilon Self;
int32Uepsilon () { _value = 0;
_eps = 0; }
int32Uepsilon (const int &i) { _value = i;
_eps = 0; }
void setEpsilon() { _eps = 1; }
Self operator+(const Self &rhs) const { Self result = *this;
result._value += rhs._value;
result._eps += rhs._eps;
return result; }
Self operator-(const Self &rhs) const { Self result = *this;
result._value -= rhs._value;
result._eps -= rhs._eps;
return result; }
Self operator-( ) const { Self result = *this;
result._value = -result._value;
result._eps = -result._eps;
return result; }
Self operator*(const Self &rhs) const { return this->getInt() * rhs.getInt(); } // XXX: discards epsilon
bool operator<(const Self &rhs) const { return (_value < rhs._value) ||
(_value == rhs._value && _eps < rhs._eps); }
bool operator>(const Self &rhs) const { return (_value > rhs._value) ||
(_value == rhs._value && _eps > rh开发者_开发知识库s._eps); }
bool operator>=(const Self &rhs) const { return (_value >= rhs._value) ||
(_value == rhs._value && _eps >= rhs._eps); }
Self &operator+=(const Self &rhs) { this->_value += rhs._value;
this->_eps += rhs._eps;
return *this; }
Self &operator-=(const Self &rhs) { this->_value -= rhs._value;
this->_eps -= rhs._eps;
return *this; }
int getInt() const { return(_value); }
private:
int _value;
int _eps;
};
namespace std {
template<>
struct numeric_limits<int32Uepsilon> {
static const bool is_signed = true;
static int max() { return 2147483647; }
}
};
The code above works, but it is quite slow. Does anyone have any ideas on how to improve performance? There are a few hints/details I can give that might be helpful:
- 32 bits are definitely insufficient to hold both _value and _eps. In practice, up to 24 ~ 28 bits of _value are used and up to 20 bits of _eps are used.
- I could not measure a significant performance difference between using
int32_t
andint64_t
, so memory overhead itself is probably not the problem here. - Saturating addition/subtraction on _eps would be cool, but isn't really necessary.
- Note that the signs of _value and _eps are not necessarily the same! This broke my first attempt at speeding this class up.
- Inline assembly is no problem, so long as it works with GCC on a Core i7 system running Linux!
One thing to try is the canonical practice of defining e.g. operator+
in terms of operator+=
:
Self operator+(const Self &rhs) const { return Self(*this) += rhs; }
This facilitates the return-value optimization, which eliminates the copy constructor that would otherwise be needed for return-by-value.
Also, it reduces code maintenance!
2x speed penalty does not seem unreasonable since all operations are done twice.
You might use MMX/SSE2 instructions to pack value and epsilon in two registers and perform the two operations in parallel only once. Alternatively, on 64-bit architecture you can pack the two values in a single int64, as in: [32 bits of value][12 zeros][20 bits of eps]
. Comparisons would work automatically with a single operations, addition and subtraction would need to mask out carry over from eps into padding zeros. There's no obstacle to using MMX for addition and subtraction (masking out happens automatically then) and ordinary integer comparison for comparisons.
BTW , your operator-= seems to be buggy: in this->_eps -= rhs._eps
, this->eps
can become negative. Shouldn't you then adjust both eps and decrement the value? What is the overflow behavior of eps? Does it ever carry over into value?
My (partial) solution is using one integer operation, instead of two in your solution as pointed out by Oli Charlesworth in the comment.
Here is how you can do: use int64_t
to store both _eps
and _value
. In my example below, _value
is represent by bit0-to-bit31
and _eps
is represented by bit32-to-bit63
.
struct int32Uepsilon
{
typedef int32Uepsilon Self;
int64_t value;
int32Uepsilon () { value = 0 }
int32Uepsilon (const int i) { value = i; }
void setEpsilon()
{
//equivent to _eps = 1
value = ((int64_t)1 << 32) + (value & 0xFFFFFFFF);
}
Self operator+(const Self &rhs) const
{
Self result = *this;
//this adds lower 32 bits to lower 32 bits, upper 32 bits to upper 32 bits!
result.value += rhs.value;
return result;
}
//....
int getValue() { return value & 0xFFFFFFFF; }
int getEpsilon() { return value >> 32; }
};
If there is no overflow, then +
can be done efficiently and reliably. This is a just a start. Try thinking if other operations can be done reliably using some bit-operations.
A simple demonstration of addition. Please read the comment
int main()
{
int64_t x = ((int64_t)2 << 32) + 4; //eps = 2, value = 4
int64_t y = ((int64_t)65 << 32) +7897; //eps = 65, value = 7897
int64_t z = x + y ; //in z, eps = (2+65) = 67, value = (4 + 7897) = 7901
cout << (x >> 32) << ", " << (x & 0xFFFFFFFF) << endl;
cout << (y >> 32) << ", " << (y & 0xFFFFFFFF) << endl;
cout << (z >> 32) << ", " << (z & 0xFFFFFFFF) << endl;
return 0;
}
Output:
2, 4
65, 7897
67, 7901
as expected.
Demo at Ideone: http://www.ideone.com/GjSnJ
Instead of x + y
, you can use x | y
which is even faster operation.
Thanks for the input, everyone. In the end, I used inline assembly to implement +, - and +=. Amazingly, GCC could not vectorize these tiny functions by itself, though using the __builtin_ia32_paddd()
builtin helped somewhat. Eventually, the overall program overhead (compared to using bare int
) was reduced from 100% to 50%. Looking at the generated assembly, the result seemed to be optimal, at least where I looked. As far as I could tell from a quick read of the Intel manuals, there is nothing to help with the comparison operations. (There are vector comparison instructions, but none of them are helpful here.)
精彩评论