开发者

Where is the virtual function call overhead?

I'm trying to benchmark the difference between a function pointer call and a virtual function call. To do this, I have written two pieces of code, that do the same mathematical computation over an array. One variant uses an array of pointers to functions and calls those in a loop. The other variant uses an array of pointers to a base class and calls its virtual function, which is overloaded in the derived classes to do absolutely the same thing as the functions in the first variant. Then I print the time elapsed and use a simple shell script to run the benchmark many times and compute the average run time.

Here is the code:

#include <iostream>
#include <cstdlib>
#include <ctime>
#include <cmath>

using namespace std;

long long timespecDiff(struct timespec *timeA_p, struct timespec *timeB_p)
{
return ((timeA_p->tv_sec * 1000000000) + timeA_p->tv_nsec) -
    ((timeB_p->tv_sec * 1000000000) + timeB_p->tv_nsec);
}

void function_not( double *d ) {
*d = sin(*d);
}

void function_and( double *d ) {
*d = cos(*d);
}

void function_or( double *d ) {
*d = tan(*d);
}

void function_xor( double *d ) {
*d = sqrt(*d);
}

void ( * const function_table[4] )( double* ) = { &function_not, &function_and, &function_or, &function_xor };

int main(void)
{
srand(time(0));
void ( * index_array[100000] )( double * );
double array[100000];
for ( long int i = 0; i < 100000; ++i ) {
    index_array[i] = function_table[ rand() % 4 ];
    array[i] = ( double )( rand() / 1000 );
}

struct timespec start, end;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for ( long int i = 0; i < 100000; ++i ) {
    index_array[i]( &array[i] );
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

unsigned long long time_elapsed = timespecDiff(&end, &start);
cout << time_elapsed / 1000000000.0 << endl;
}

and here is the virtual function variant:

#include <iostream>
#include <cstdlib>
#include <ctime>
#include <cmath>

using namespace std;

long long timespecDiff(struct timespec *timeA_p, struct timespec *timeB_p)
{
return ((timeA_p->tv_sec * 1000000000) + timeA_p->tv_nsec) -
    ((timeB_p->tv_sec * 1000000000) + timeB_p->tv_nsec);
}

class A {
public:
    virtual void calculate( double *i ) = 0;
};

class A1 : public A {
public:
    void calculate( double *i ) {
    *i = sin(*i);
    }
};

class A2 : public A {
public:
    void calculate( double *i ) {
        *i = cos(*i);
    }
};

class A3 : public A {
public:
    void calculate( double *i ) {
        *i = tan(*i);
    }
};

class A4 : public A {
public:
    void calculate( double *i ) {
        *i = sqrt(*i);
    }
};

int main(void)
{
srand(time(0));
A *base[100000];
double array[100000];
for ( long int i = 0; i <开发者_如何学编程; 100000; ++i ) {
    array[i] = ( double )( rand() / 1000 );
    switch ( rand() % 4 ) {
    case 0:
    base[i] = new A1();
    break;
    case 1:
    base[i] = new A2();
    break;
    case 2:
    base[i] = new A3();
    break;
    case 3:
    base[i] = new A4();
    break;
    }
}

struct timespec start, end;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for ( int i = 0; i < 100000; ++i ) {
    base[i]->calculate( &array[i] );
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

unsigned long long time_elapsed = timespecDiff(&end, &start);
cout << time_elapsed / 1000000000.0 << endl;
}

My system is LInux, Fedora 13, gcc 4.4.2. The code is compiled it with g++ -O3. The first one is test1, the second is test2.

Now I see this in console:

[Ignat@localhost circuit_testing]$ ./test2 && ./test2 
0.0153142
0.0153166

Well, more or less, I think. And then, this:

[Ignat@localhost circuit_testing]$ ./test2 && ./test2 
0.01531
0.0152476

Where are the 25% which should be visible? How can the first executable be even slower than the second one?

I'm asking this because I'm doing a project which involves calling a lot of small functions in a row like this in order to compute the values of an array, and the code I've inherited does a very complex manipulation to avoid the virtual function call overhead. Now where is this famous call overhead?


In both cases you are calling functions indirectly. In one case through your table of function pointers, and in the other through the compiler's array of function pointers (the vtable). Not surprisingly, two similar operations give you similar timing results.


Virtual functions may be slower than regular functions, but that's due to things like inlines. If you call a function through a function table, those can't be inlined either, and the lookup time is pretty much the same. Looking up through your own lookup table is of course going to be the same as looking up through the compiler's lookup table.
Edit: Or even slower, because the compiler knows a lot more than you about things like processor cache and such.


I think you're seeing the difference, but it's just the function call overhead. Branch misprediction, memory access and the trig functions are the same in both cases. Compared to those, it's just not that big a deal, though the function pointer case was definitely a bit quicker when I tried it.

If this is representative of your larger program, this is a good demonstration that this type of microoptimization is sometimes just a drop in the ocean, and at worst futile. But leaving that aside, for a clearer test, the functions should perform some simpler operation, that is different for each function:

void function_not( double *d ) {
    *d = 1.0;
}

void function_and( double *d ) {
    *d = 2.0;
}

And so on, and similarly for the virtual functions.

(Each function should do something different, so that they don't get elided and all end up with the same address; that would make the branch prediction work unrealistically well.)

With these changes, the results are a bit different. Best of 4 runs in each case. (Not very scientific, but the numbers are broadly similar for larger numbers of runs.) All timings are in cycles, running on my laptop. Code was compiled with VC++ (only changed the timing) but gcc implements virtual function calls in the same way so the relative timings should be broadly similar even with different OS/x86 CPU/compiler.

Function pointers: 2,052,770

Virtuals: 3,598,039

That difference seems a bit excessive! Sure enough, the two bits of code aren't quite the same in terms of their memory access behaviour. The second one should have a table of 4 A *s, used to fill in base, rather than new'ing up a new one for each entry. Both examples will then have similar behaviour (1 cache miss/N entries) when fetching the pointer to jump through. For example:

A *tbl[4] = { new A1, new A2, new A3, new A4 };
for ( long int i = 0; i < 100000; ++i ) {
    array[i] = ( double )( rand() / 1000 );
    base[i] = tbl[ rand() % 4 ];
}

With this in place, still using the simplified functions:

Virtuals (as suggested here): 2,487,699

So there's 20%, best case. Close enough?

So perhaps your colleague was right to at least consider this, but I suspect that in any realistic program the call overhead won't be enough of a bottleneck to be worth jumping through hoops over.


Nowadays, on most systems, memory access is the primary bottleneck, and not the CPU. In many cases, there is little significant difference between virtual and non-virtual functions - they usually represent a very small portion of execution time. (Sorry, I don't have reported figures to back this up, just emprical data.)

If you want to get the best performance you will get more bang for your buck if you look into how to parallelize the computation to take advantage of multiple cores/processing units, rather than worring about micro-details of virtual vs non-virtual functions.


Many people fall into the habit of doing things just because they are thought to be "faster". It's all relative.

If I'm going to take a 100-mile drive from my home, I have to start by driving around the block. I can drive around the block to the right, or to the left. One of those will be "faster". But will it matter? Of course not.

In this case, the functions that you call are in turn calling math functions.

If you pause the program under the IDE or GDB, I suspect you will find that nearly every time you pause it it will be in those math library routines (or it should be!), and dereferencing an additional pointer to get there (assuming it doesn't bust a cache) should be lost in the noise.

Added: Here is a favorite video: Harry Porter's relay computer. As that thing laboriously clacks away adding numbers and stepping its program counter, I find it helpful to be mindful that that's what all computers are doing, just on a different scale of time and complexity. In your case, think about an algorithm to do sin, cos, tan, or sqrt. Inside, it is clacking away doing those things, and only incidentally following addresses or messing with a really slow memory to get there.


And finally, the function pointer approach has turned out to be the fastest one. Which was what I'd expected from the very beginning.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜