开发者

Performance of qsort vs std::sort?

According to Scott Meyers, in his Effective STL book - item 46. He claimed that std::sort is about 670% faster than std::qsort due to the fact开发者_如何转开发 of inline. I tested myself, and I saw that qsort is faster :( ! Could anyone help me to explain this strange behavior?

#include <iostream>
#include <vector>
#include <algorithm>

#include <cstdlib>
#include <ctime>
#include <cstdio>

const size_t LARGE_SIZE = 100000;

struct rnd {
    int operator()() {
        return rand() % LARGE_SIZE;
    }
};

int comp( const void* a, const void* b ) {
    return ( *( int* )a - *( int* )b );
}

int main() {
    int ary[LARGE_SIZE];
    int ary_copy[LARGE_SIZE];
    // generate random data
    std::generate( ary, ary + LARGE_SIZE, rnd() );
    std::copy( ary, ary + LARGE_SIZE, ary_copy );
    // get time
    std::time_t start = std::clock();
    // perform quick sort C using function pointer
    std::qsort( ary, LARGE_SIZE, sizeof( int ), comp );
    std::cout << "C quick-sort time elapsed: " << static_cast<double>( clock() - start ) / CLOCKS_PER_SEC << "\n";
    // get time again
    start = std::clock();
    // perform quick sort C++ using function object
    std::sort( ary_copy, ary_copy + LARGE_SIZE );
    std::cout << "C++ quick-sort time elapsed: " << static_cast<double>( clock() - start ) / CLOCKS_PER_SEC << "\n";
}

This is my result:

C quick-sort time elapsed: 0.061
C++ quick-sort time elapsed: 0.086
Press any key to continue . . .

Update

Effective STL 3rd Edition ( 2001 )

Chapter 7 Programming with STL

Item 46: Consider function objects instead of functions as algorithm parameters.


std::clock() is not a viable timing clock. You should use a platform-specific higher resolution timer, like the Windows High Performance Timer. More than that, the way that you call clock() is that first, text is output to the console, which is included in the time. This definitely invalidates the test. In addition, make sure that you compiled with all optimizations.

Finally, I copied and pasted your code, and got 0.016 for qsort and 0.008 for std::sort.


I am surprised that no one mentions caches.

In your code, you start by touching ary and ary_copy so they are resident in the cache at the time of qsort. During qsort, ary_copy might get evicted. At the time of std::sort, the elements would have to be fetched from memory or a larger (read slower) cache level. This will of course depend on your cache sizes.

Try to reverse the test, i.e., start by running std::sort.

As some people have pointed out; making the array larger will make the test more fair. The reason is that a large array is less likely to fit in cache.


The two sorting algorithms, without optimizations enabled, should have comparable performance. The reason that the C++ sort tends to appreciably beat qsort is that the compiler can inline the comparisons being made, since the compiler has type information about what function is being used to perform the comparison. Did you run these tests with optimization enabled? If not, try turning it on and running this test again.


Another reason that qsort may perform much better than expected is that newer compilers can inline and optimize through the function pointer.

If the C header defines an inline implementation of qsort instead of implementing it inside of a library and the compiler supports indirect function inlining, then qsort can be just as fast as std::sort.


On my machine adding some meat (making the array 10 million elements and moving it in the data section) and compiling with

g++ -Wall -O2 -osortspeed sortspeed.cpp

I get as result

C quick-sort time elapsed: 3.48
C++ quick-sort time elapsed: 1.26

Be also careful about modern "green" CPUs that may be configured to run at a variable speed depending on the load of the system. When benchmarking, this kind of behavior can drive you crazy (on my machine I've a small script that fixes CPU clock that I use when making speed tests).


Writing accurate benchmarks is difficult, so let's get Nonius to do it for us! Let's test qsort, std::sort with no inlining, and std::sort with inlining (the default) on a vector of a million random integers.

// sort.cpp
#define NONIUS_RUNNER
#include <nonius.h++>
#include <random>
#include <algorithm>

// qsort
int comp(const void* a, const void* b) {
    const int arg1 = *static_cast<const int*>(a);
    const int arg2 = *static_cast<const int*>(b);

    // we can't simply return a - b, because that might under/overflow
    return (arg1 > arg2) - (arg1 < arg2);
}

// std::sort with no inlining
struct compare_noinline {
    __attribute__((noinline)) bool operator()(const int a, const int b) {
        return a < b;
    }
};

// std::sort with inlining
struct compare {
    // the compiler will automatically inline this
    bool operator()(const int a, const int b) {
        return a < b;
    }
};

std::vector<int> gen_random_vector(const size_t size) {

    std::random_device seed;
    std::default_random_engine engine{seed()};
    std::uniform_int_distribution<int> dist{std::numeric_limits<int>::min(), std::numeric_limits<int>::max()};

    std::vector<int> vec;
    for (size_t i = 0; i < size; i += 1) {
        const int rand_int = dist(engine);
        vec.push_back(rand_int);
    }

    return vec;
}

// generate a vector of a million random integers
constexpr size_t size = 1'000'000;
static const std::vector<int> rand_vec = gen_random_vector(size);

NONIUS_BENCHMARK("qsort", [](nonius::chronometer meter) {

    // Nonius does multiple runs of the benchmark, and each one needs a new
    // copy of the original vector, otherwise we'd just be sorting the same
    // one over and over
    const size_t runs = static_cast<size_t>(meter.runs());
    std::vector<std::vector<int>> vectors{runs};
    std::fill(vectors.begin(), vectors.end(), rand_vec);

    meter.measure([&](const size_t run) {

        std::vector<int>& current_vec = vectors[run];

        std::qsort(current_vec.data(), current_vec.size(), sizeof(int), comp);

        return current_vec;
    });
});

NONIUS_BENCHMARK("std::sort noinline", [](nonius::chronometer meter) {

    const size_t runs = static_cast<size_t>(meter.runs());
    std::vector<std::vector<int>> vectors{runs};
    std::fill(vectors.begin(), vectors.end(), rand_vec);

    meter.measure([&](const size_t run) {

        std::vector<int>& current_vec = vectors[run];

        std::sort(current_vec.begin(), current_vec.end(), compare_noinline{});

        return current_vec;

    });
});

NONIUS_BENCHMARK("std::sort inline", [](nonius::chronometer meter) {

    const size_t runs = static_cast<size_t>(meter.runs());
    std::vector<std::vector<int>> vectors{runs};
    std::fill(vectors.begin(), vectors.end(), rand_vec);

    meter.measure([&](const size_t run) {

        std::vector<int>& current_vec = vectors[run];

        std::sort(current_vec.begin(), current_vec.end(), compare{});

        return current_vec;

    });
});

Compiling with Apple Clang 7.3.0,

$ clang++ -std=c++14 -stdlib=libc++ -O3 -march=native sort.cpp -o sort
$ ./sort

and running it on my 1.7 GHz i5 Macbook Air, we get

qsort                211 ms +/- 6 ms
std::sort noinline   127 ms +/- 5 ms
std::sort inline      87 ms +/- 4 ms

So std::sort with no inlining is about 1.7x faster than qsort (perhaps due to different sorting algorithms), and inlining bumps that up to about 2.4x faster. Certainly an impressive speedup, but much less than 670%.


Why there's no one mentions that extra memory fetch in the C standard library's compare function?

In the C standard library, void* is used to hold all types of member data, which means that when the member data is actually accessed, the void* pointer must be performed once additional dereference.

struct list {
        void *p_data; // deference this pointer when access real data
        struct list *prev;
        struct list *next;
};

However, in STL, with the help of template's code generation capabilities, member data saved with void* in the C standard library can be directly placed inside the type, avoiding additional dereferences during access.

template <typename T>
class list {
public:
        T data; // access data directly
        list *prev;
        list *next;
};

So, in theory, std::sort is faster than qsort.


I am not sure with 670% faster. It must have been a specific dataset tailored to show the speed of std::sort. In general, std::sort is indeed faster than qsort because of a couple of these things:

  1. qsort operates on void*, which first requires a dereference, and second requires the size of the data type to perform the swaps. Therefore, the swap operation of qsort is done every byte. Look at qsort implementation and notice its SWAP macro is a loop. Here's the video by Jon Bentley explaining the timing differences (starts at 45m): https://www.youtube.com/watch?v=aMnn0Jq0J-E&t=2700s

  2. The inline may make it speed up a bit, but that's micro-optimization, not the major contributor.

  3. std::sort is actually a hybrid algorithm called Introsort. C qsort is a pure quicksort implementation. Given a dataset that's terrible for quicksort, std::sort changes to heapsort instead. So if you create a bad input for qsort, it will be unbearably slow.

  4. The profiling code above is insufficient. Input size should be increased. 100K is hardly enough. Increase it to 1M or 10M, then repeat the sorting multiple times then take the average or median. If necessary, compile them into a separate binary and run them separately.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜