How can I test the performance of a C function?
Are there some good ways to know how a function performs in C? I would like to, for example compare m开发者_Go百科y own function to a library function.
You need high-resolution timers.
On Linux, gettimeofday()
is a decent choice, it gives you microsecond resolution. On Windows, QueryPerformanceCounter()
is typical. Make sure you run your function many times, to get stable readings.
Quick sample, for Linux:
struct timeval t0, t1;
unsigned int i;
gettimeofday(&t0, NULL);
for(i = 0; i < 100000; i++)
function_to_measure();
gettimeofday(&t1, NULL);
printf("Did %u calls in %.2g seconds\n", i, t1.tv_sec - t0.tv_sec + 1E-6 * (t1.tv_usec - t0.tv_usec));
You would of course adjust the count (100,000) to match the performance of the function. It's best if the function really takes a while to run, otherwise the loop and/or the function call overhead might dominate.
The open-source Callgrind profiler (for Linux) is a really awesome way to measure performance. Coupled with KCacheGrind, you get really great visualizations of where your time is spent.
Callgrind is part of Valgrind.
- Art
Hello i will give you an example and explain it:
#include <stdio.h>
#include <time.h>
int main(void)
{
clock_t start_clk = clock();
/*
put any code here
*/
printf("Processor time used by program: %lg sec.\n", \
(clock() - start_clk) / (long double) CLOCKS_PER_SEC);
return 0;
}
output: Processor time used by program: 4.94066e-324 sec.
time.h:
declares clock_t which is an arithmetic(you can do math on this value like i do in the example) time value. basically put any code where the comment is.
CLOCKS_PER_SEC is a macro declared in time.h, use it as the denominator to convert the value into seconds.
it is essential to cast it to long double for two reasons:
- we don't know what type clock_t actually is, but we want to print it (what conversion will you put in printf?).
- long double is a very precise type which can represent really small values.
Run it (them) several million times (each) and measure the time it takes.
The one that completes faster is the better performant.
gprof can help :)
Here's the result of gprof when I run a program of mine for 10 seconds (function names changed)
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 60.29 8.68 8.68 115471546 0.00 0.00 workalot 39.22 14.32 5.64 46 122.70 311.32 work_b 0.49 14.39 0.07 inlined 0.07 14.40 0.01 46 0.22 0.22 work_c 0.00 14.40 0.00 460 0.00 0.00 find_minimum 0.00 14.40 0.00 460 0.00 0.00 feedback 0.00 14.40 0.00 46 0.00 0.00 work_a
Fred, I notice that you said in a comment that you're on OS X. The best way to get very accurate timings of small-scale functions on OS X is with the mach_absoute_time( )
function. You can use it as follows:
#include <mach/mach_time.h>
#include <stdint.h>
int loopCount;
uint64_t startTime = mach_absolute_time( );
for (loopCount = 0; loopCount < iterations; ++loopCount) {
functionBeingTimed( );
}
uint64_t endTime = mach_absolute_time( );
double averageTime = (double)(endTime-startTime) / iterations;
This gives you the average timing across iterations
calls to the function. This can be affected somewhat by effects outside of your process on the system. Thus, you may instead want to take the fastest time:
#include <mach/mach_time.h>
#include <stdint.h>
int loopCount;
double bestTime = __builtin_inf();
for (loopCount = 0; loopCount < iterations; ++loopCount) {
uint64_t startTime = mach_absolute_time( );
functionBeingTimed( );
uint64_t endTime = mach_absolute_time( );
double bestTime = __builtin_fmin(bestTime, (double)(endTime-startTime));
}
This can have its own problems, especially if the function being timed is very very fast. You need to think about what you are really trying to measure and pick an approach that is scientifically justified (good experimental design is hard). I often use a hybrid between these two approaches as a first attempt at measuring a novel task (a minimum of averages over many calls).
Note also that in the code samples above, the timings are in "mach time units". If you just want to compare algorithms, this is usually fine. For some other purposes, you may want to convert them to nanoseconds or cycles. To do this, you can use the following functions:
#include <mach/mach_time.h>
#include <sys/sysctl.h>
#include <stdint.h>
double ticksToNanoseconds(double ticks) {
static double nanosecondsPerTick = 0.0;
// The first time the function is called
// ask the system how to convert mach
// time units to nanoseconds
if (0.0 == nanosecondsPerTick) {
mach_timebase_info_data_t timebase;
// to be completely pedantic, check the return code of this call:
mach_timebase_info(&timebase);
nanosecondsPerTick = (double)timebase.numer / timebase.denom;
}
return ticks * nanosecondsPerTick;
}
double nanosecondsToCycles(double nanoseconds) {
static double cyclesPerNanosecond = 0.0;
// The first time the function is called
// ask the system what the CPU frequency is
if (0.0 == cyclesPerNanosecond) {
uint64_t freq;
size_t freqSize = sizeof(freq);
// Again, check the return code for correctness =)
sysctlbyname("hw.cpufrequency", &freq, &freqSize, NULL, 0L );
cyclesPerNanosecond = (double)freq * 1e-9;
}
return nanoseconds * cyclesPerNanosecond;
}
Be aware that the conversion to nanoseconds will always be sound, but the conversion to cycles can go awry in various ways, because modern CPUs do not run at one fixed speed. Nonetheless, it generally works pretty well.
All of these other answers are using some variant of gettimeofday()
for timing. This is pretty crude since you usually need to run the kernel many times to get reproducible results. Putting it in a tight loop changes the state of both code and data caches so these results may not be indicative of the real performance.
A much better alternative is to actually use the CPU cycle counter. On x86, you can do this with the rdtsc
instruction. This is from x264:
static inline uint32_t read_time(void)
{
uint32_t a = 0;
#if defined(__GNUC__) && (defined(ARCH_X86) || defined(ARCH_X86_64))
asm volatile( "rdtsc" :"=a"(a) ::"edx" );
#elif defined(ARCH_PPC)
asm volatile( "mftb %0" : "=r" (a) );
#elif defined(ARCH_ARM) // ARMv7 only
asm volatile( "mrc p15, 0, %0, c9, c13, 0" : "=r"(a) );
#endif
return a;
}
For more on profiling using various hardware counters, see PAPI. For some purposes, simulators (like Callgrind and interrupt-based profilers (Oprofile) are useful.
Store off the system time before you enter the function. Store off the system time after you return from the function. Subtract the difference and compare the two implementations.
Checkout HighResTimer for a high performance timer.
You'll probably find storing the time before/after isn't accurate enough and will probably result in 0 unless you have a longer running function.
Check out RDTSC but it is better do it like below.
0 - Call system's Sleep or Yield function so that when it returns, you have a new timeslice
1 - RDTSC
2 - Call your function
3 - RDTSC
If your function is a long running one, you have to use some sort of profiling tool like gprof (it is very easy to use) & Intel's VTune application (which I have not used for a long time). After seeing Art's answer, I changed my mind from gprof to Callgrind. I used only the Memcheck tool of Valgrind in the past and it was a magnificent tool. I have not used Callgrind before but I am sure it is better than gprof...
Store timestamp before entering function
Store timestamp after exiting function
Compare timestamps
Make sure to use a significant sample as the time resolution might vary your results. This is especially true for short duration functions. Use high-resolution timers (microseconds resolution is available on most platforms).
As the simplest and portable approach you can use the standard function time(), which returns the current number of seconds since the Epoch.
#include <time.h>
time_t starttime, endtime;
starttime = time(NULL);
for (i = 0; i < 1000000; i++)
{
testfunc();
}
endtime = time(NULL);
printf("Time in seconds is %d\n", (int)(endtime-starttime));
Adjust the number of iterations to your needs. If one function call needs 5 seconds then you need a laaarge cup of coffee for 1000000 iterations... When the difference is less than 1 second, even for a large number, you should 1) ask yourself if it matters and if yes, 2) check if your favorite compiler already has builtin profiling functions.
精彩评论