Efficiency of Bitwise XOR in c++ in comparison to more readable methods
I've recently been writing some code for a research project that I'm working on, where efficiency is very important. I've been considering scraping some of the regular methods I do things in and using bitwise XORs instead. What I'm wondering is if this will make if a difference (if I'm performing this operation say several million times) or if it's the same after I use 03 in g++.
The two examples that come to mind:
I had an instance where (I'm working with purely positive ints) I needed to change n to n-1 if n was odd or n to (n+1) if n was even. I figured I had a few options:
if(n%2) // or (n%2==0) and flip the order
n=n-1
else
n=n+1
or
n=n+2*n%2-1; //This of course was silly, but was the only non-bitwise 1 line I could come up with
Finally:
n=n^1;
All of the methods clearly do the same thing, but my feeling was that the third one would be the most efficient.
The next example is on a more general note. Say I'm comparing two positive integers, will one of these perform better than the others. Or will the difference really not be noticeable, even if I per开发者_Python百科form this operation several million times:
if(n_1==n_2)
if(! (n_1 ^ n_2) )
if( n_1 ^ n_2) else \do work here
Will the compiler just do the same operation in all of these instances? I'm just curious if there is an instance when I should use bitwise operations and not trust the compiler to do the work for me.
Fixed:In correct statement of problem.
It's easy enough to check, just fire up your disassembler. Take a look:
f.c:
unsigned int f1(unsigned int n)
{
n ^= 1;
return n;
}
unsigned int f2(unsigned int n)
{
if (n % 2)
n=n-1;
else
n=n+1;
return n;
}
Build and disassemble:
$ cc -O3 -c f.c
$ otool -tV f.o
f.o:
(__TEXT,__text) section
_f1:
00 pushq %rbp
01 movq %rsp,%rbp
04 xorl $0x01,%edi
07 movl %edi,%eax
09 leave
0a ret
0b nopl _f1(%rax,%rax)
_f2:
10 pushq %rbp
11 movq %rsp,%rbp
14 leal 0xff(%rdi),%eax
17 leal 0x01(%rdi),%edx
1a andl $0x01,%edi
1d cmovel %edx,%eax
20 leave
21 ret
It looks like f1()
is a bit shorter, whether or not that matters in reality is up to some benchmarking.
I needed to change n to n-1 if n was even or n to (n+1) if n was odd.
In that case, regardless of efficiency, n = n ^ 1
is wrong.
For your second case, ==
will be just as efficient (if not more so) than any of the others.
In general, when it comes to optimization, you should benchmark it yourself. If a potential optimization is not worth benchmarking, it's not really worth making.
I kind-of disagree with most of the answers here, which is why I still see myself replying to a question of 2010 :-)
XOR is practically speaking the fastest operation a CPU can possibly do, and the good part is that all CPU's support it. The reason for this is quite easy: a XOR gate can be created with only 4 NAND gates or 5 NOR gates -- which means it's easy to create using the fabric of your silicon. Unsurprising, all CPU's that I know of can execute your XOR operation in 1 clock tick (or even less).
If you need to do a XOR on multiple items in an array, modern x64 CPU's also support XOR's on multiple items at once like f.ex. the SIMD instructions on Intel.
The alternative solution you opt uses the if-then-else. True, most compilers are able to figure this easy thing out... but why take any chances and what's the consequence?
The consequence of your compiler not figuring it out are branch prediction errors. A single branch prediction failure will easily take 17 clock ticks. If you take one look at the execution speeds of processor instructions, you'll find that branches are quite bad for your performance, especially when dealing with random data.
Note that this also means that if you construct your test incorrectly, the data will mess up your performance measurements.
So to conclude: first think, then program, then profile - not the other way around. And use XOR.
About the only way to know for sure is to test. I'd have to agree that it would take a fairly clever compiler to produce as efficient of output for:
if(n%2) // or (n%2==0) and flip the order
n=n-1
else
n=n+1
as it could for n ^= 1;
, but I haven't checked anything similar recently enough to say with any certainty.
As for your second question, I doubt it makes any difference -- an equality comparison is going to end up fast for any of these methods. If you want speed, the main thing to do is avoid having a branch involved at all -- e.g. something like:
if (a == b)
c += d;
can be written as: c += d * (a==b);
. Looking at the assembly language, the second will often look a bit messy (with ugly cruft to get the result of the comparison from the flags into a normal register) but still often perform better by avoiding any branches.
Edit: At least the compilers I have handy (gcc & MSVC), do not generate a cmov
for the if
, but they do generate a sete
for the * (a==b)
. I expanded the code to something testable.
Edit2: Since Potatoswatter brought up another possibility using bit-wise and instead of multiplication, I decided to test that along with the others. Here's the code with that added:
#include <time.h>
#include <iostream>
#include <stdlib.h>
int addif1(int a, int b, int c, int d) {
if (a==b)
c+=d;
return c;
}
int addif2(int a, int b, int c, int d) {
return c += d * (a == b);
}
int addif3(int a, int b, int c, int d) {
return c += d & -(a == b);
}
int main() {
const int iterations = 50000;
int x = rand();
unsigned tot1 = 0;
unsigned tot2 = 0;
unsigned tot3 = 0;
clock_t start1 = clock();
for (int i=0; i<iterations; i++) {
for (int j=0; j<iterations; j++)
tot1 +=addif1(i, j, i, x);
}
clock_t stop1 = clock();
clock_t start2 = clock();
for (int i=0; i<iterations; i++) {
for (int j=0; j<iterations; j++)
tot2 +=addif2(i, j, i, x);
}
clock_t stop2 = clock();
clock_t start3 = clock();
for (int i=0; i<iterations; i++) {
for (int j=0; j<iterations; j++)
tot3 +=addif3(i, j, i, x);
}
clock_t stop3 = clock();
std::cout << "Ignore: " << tot1 << "\n";
std::cout << "Ignore: " << tot2 << "\n";
std::cout << "Ignore: " << tot3 << "\n";
std::cout << "addif1: " << stop1-start1 << "\n";
std::cout << "addif2: " << stop2-start2 << "\n";
std::cout << "addif3: " << stop3-start3 << "\n";
return 0;
}
Now the really interesting part: the results for the third version are quite interesting. For MS VC++, we get roughly what most of us would probably expect:
Ignore: 2682925904
Ignore: 2682925904
Ignore: 2682925904
addif1: 4814
addif2: 3504
addif3: 3021
Using the &
instead of the *
, gives a definite improvement -- almost as much of an improvement as *
gives over if
. With gcc the result is quite a bit different though:
Ignore: 2680875904
Ignore: 2680875904
Ignore: 2680875904
addif1: 2901
addif2: 2886
addif3: 7675
In this case, the code using if
is much closer to the speed of the code using *
, but the code using &
is slower than either one -- a lot slower! In case anybody cares, I found this surprising enough that I re-compiled a couple of times with different flags, re-ran a few times with each, and so on and the result was entirely consistent -- the code using &
was consistently considerably slower.
The poor result with the third version of the code compiled with gcc gets us back to what I said to start with [and ends this edit]:
As I said to start with, "the only way to know for sure is to test" -- but at least in this limited testing, the multiplication consistently beats the if
. There may be some combination of compiler, compiler flags, CPU, data pattern, iteration count, etc., that favors the if
over the multiplication -- there's no question that the difference is small enough that a test going the other direction is entirely believable. Nonetheless, I believe that it's a technique worth knowing; for mainstream compilers and CPUs, it seems reasonably effective (though it's certainly more helpful with MSVC than with gcc).
[resumption of edit2:] the result with gcc using &
demonstrates the degree to which 1) micro-optimizations can be/are compiler specific, and 2) how much different real-life results can be from expectations.
Is n^=1
faster than if ( n%2 ) --n; else ++n;
? Yes. I wouldn't expect a compiler to optimize that. Since the bitwise operation is so much more succinct, it might be worthwhile to familiarize yourself with XOR and maybe add a comment on that line of code.
If it's really critical to the functionality of your program, it could also be considered a portability issue: if you test on your compiler and it's fast, you would likely be in for a surprise when trying on another compiler. Usually this isn't an issue for algebraic optimizations.
Is x^y
faster than x==y
? No. Doing things in roundabout ways is generally not good.
A good compiler will optimize n%2
but you can always check the assembly produced to see. If you see divides, start optimizing it yourself because divide is about as slow as it gets.
You should trust your compiler. gcc/++ is the product of years of development and it's capable of doing any optimizations you're probably thinking of doing. And, it's likely that if you start playing around, you'll tamper with it's efforts to optimize your code.
n ^= 1
and n1==n2
are probably the best you can do, but really, if you are after maximum efficiency, don't eyeball the code looking for little things like that.
Here's an example of how to really tune for performance.
Don't expect low level optimizations to help much until sampling has proven they are where you should focus.
精彩评论