Writing functional code in C++ III – Performance of different allocation schemes - Luca Bolognese

Writing functional code in C++ III – Performance of different allocation schemes

Luca -

☕ 6 min. read

Now we know how to rep­re­sent records and we know how to op­er­ate on them us­ing a nice F# like syn­tax. But how do we store our record in a data struc­ture in the first place?

Code for this post is here. Thanks to Andy Sawyer and Steve Bower for re­view­ing this.

As it is of­ten the case, C++ gives you many op­tions that are not avail­able in stan­dard func­tional lan­guages. A mixed bless­ing of some sort.

  1. You can store them by value or by pointer
  2. If you store them by pointer, you can use nor­mal point­ers or smart point­ers
  3. If you store them by smart pointer, you can use dif­fer­ent ones

Well, stor­ing them by value is the sim­plest op­tion, but it has two short­com­ings:

  1. You pay the price of a full record copy when­ever the ob­ject need to be copied (i.e. when you add/​re­move records to/​from a vec­tor and when­ever the vec­tor needs to re­size it­self)
  2. If you store a record in two con­tain­ers, you get two dif­fer­ent copies of it. If you mod­ify one, the other won’t be mod­i­fied

The sec­ond is­sue is not a prob­lem in our sce­nario as records, by de­f­i­n­i­tion, are im­mutable and struc­turally equiv­a­lent (aka equal­ity for them does­n’t mean pointer equal­ity). The first is­sue is more tricky. Our records get a copy con­struc­tor by de­fault which does a memcpy’ of the ob­ject. That should be fast. But fast enough?

To an­swer this ques­tion (and more), I em­barked in a lit­tle per­for­mance test that en­com­passes most ways to store a pod type in an vec­tor like data struc­ture. For all my test I used this struct:

struct Record {
    int Id;
    char k1[2];
    char k2[2];
    char k3[2];
    char k4[2];
    char k5[2];
    char k6[2];
    char mem[bigBlock];
    void Lock() {}
    void Unlock() {}
};

By re­com­pil­ing with dif­fer­ent val­ues for bigBlock’ I can test what hap­pens when the record size grows. In all tests a record is ini­tial­ized with this func­tion:

void record_init(Record& r, int i) {
    r.Id = i;
    strcpy(r.k1, "0");
    strcpy(r.k2, "0");
    strcpy(r.k3, "0");
    strcpy(r.k4, "0");
    strcpy(r.k5, "0");
    strcpy(r.k6, "0");
    memset(r.mem, '-', bigBlock);
}

The tests are spe­cific to the sce­nario I care about: per­form­ing func­tional-like op­er­a­tions on a Range:

struct to_int {
    typedef int result_type;
    int operator() (const Record& r) const {
        return r.Id;
    };
};
function<bool (const Record&)> filter_f = [](const Record& r) {return r.Id < filterNo;};
template <class Range>
int accumulate_filter(const Range& r) {
    return boost::accumulate(
        r | filtered(filter_f) | transformed(to_int()),
        0,
        plus<int>());
}

The us­age of a func­tion and  a func­tor is a bit odd. I don’t re­call why I did it that way, but it does­n’t mat­ter as it is the same for each test. What the test does is just fil­ter­ing a bunch of record, trans­form­ing them (map) to ints and sum these ints.

How many rep­e­ti­tions of each test are used, how big is the record, how many records the Range con­tains is spec­i­fied in these con­stants:

const int repetitions = 1000;
const int bigBlock = 1000;
const int howMany = 1000;

Time is kept us­ing this clock that wraps boost::chrono:

typedef boost::chrono::process_cpu_clock the_clock;
struct timer {
    timer(): clock_(the_clock::now()) {}
    the_clock::times elapsed() {
        return (the_clock::now() - clock_).count();
    }
    the_clock::time_point clock_;
};

I have tested the fol­low­ing con­fig­u­ra­tions. Adding records by value:

int normal() {
    vector<Record> v;
    for (int i = 0; i < howMany; ++i) {
        Record r;
        record_init(r, i);
        v.push_back(r);
    }
    return accumulate_filter(v);
}

I then tested adding records us­ing a pod_vec­tor. This is a data struc­ture de­scribed here and in Imperfect C++”. It is a vec­tor that uses as an au­to_buffer as the un­der­ly­ing stor­age. An au­to_buffer is a class that uses stack mem­ory if it needs more than a cer­tain num­ber of bytes spec­i­fied at com­pile time, oth­er­wise it uses heap mem­ory. It works well if you know at com­pile time that most al­lo­ca­tions for some­thing take at most N bytes, but some might need more. This sit­u­a­tion is sur­pris­ingly fre­quent. Unfortunately, your ob­jects need to be POD to use the pod_vec­tor.

I tested it in the case where the stack space is big enough (podVector<howMany*2>) and when it is not (podVector<howMany/10>).

int podVector() {
    pod_vector<Record, size> v;
    for (int i = 0; i < howMany; ++i) {
        Record r;
        record_init(r, i);
        v.push_back(r);
    }
    return accumulate_filter(v);
}

I then tested just al­lo­cat­ing the mem­ory, with­out free­ing it and us­ing boost::pool in it’s local’ form, which means that mem­ory is freed when it goes out of scope (stack-like). The main draw­back of the lat­ter is that you can­not re­turn the con­tainer/​range.

int boostallocator(WhichOne which) {
    boost::pool<> p(sizeof(Record));
    vector<Record*> v;
    Record* r;
    for (int i = 0; i < howMany; ++i) {
        if(which == Boost)
            r = (Record*)p.malloc(); // memory freed at function exit
        else
            r = new Record; // oops, memory leak
        record_init(*r, i);
        v.push_back(r);
    }
    return accumulate_filter(v | indirected);
}

Indirected is needed be­cause we are not talk­ing about point­ers. I also tested var­i­ous vari­a­tions of shared point­ers. First a nor­mal shared_ptr, then a shared_ptr ini­tial­ized with the boost::sin­gle­ton_pool and fi­nally a pointer ini­tial­ized with make_shared.

int pointers(WhichOne which) {
    vector<shared_ptr<Record>> v;
    for (int i = 0; i < howMany; ++i) {
        shared_ptr<Record> r;
        if(which == Normal)
            r = shared_ptr<Record>(new Record);
        else if(which == Boost)
            r = shared_ptr<Record>((Record*)record_pool::malloc(), [](void* p) {record_pool::free(p);});
        else if(which == Make)
            r = make_shared<Record>();
        else throw "This kind of pointer doesn't exist";
        record_init(*r, i);
        v.push_back(r);
    }
    return accumulate_filter(v | indirected);
}

Finally, I used a Loki smart pointer. This is a very el­e­gantly de­signed smart point­ers from  Modern C++ de­sign. You pass as tem­plate pa­ra­me­ters the poli­cies you want your smart pointer to have (aka how it should be­have). I tested it like so:

typedef Loki::SmartPtr<Record,
                 Loki::RefCounted,
                 Loki::DisallowConversion,
                 Loki::AssertCheck,
                 Loki::DefaultSPStorage,
                 LOKI_DEFAULT_CONSTNESS> RecordPtr;

And us­ing the fol­low­ing, slightly more con­vo­luted code:

int lokipointers(WhichOne) {
    vector<RecordPtr> v;
    for (int i = 0; i < howMany; ++i) {
        RecordPtr r = RecordPtr(new Record());
        record_init(*r, i);
        v.push_back(r);
    }
    auto ret = accumulate_filter(v | transformed(RecordPtr::GetPointer) | indirected);
    return ret;
}

The re­sult of my tests are in this table and at this link. I used VS10 and gcc 4.6.2 on a Windows 7 box. The first num­ber in the (S, N) pair at the top of each col­umn rep­re­sents  the size of the record, the sec­ond one rep­re­sents the num­ber of ob­jects in the vec­tor. To ob­tain rea­son­able num­bers, the tests have been run with a dif­fer­ent num­ber of it­er­a­tion for each pair, which means that you can com­pare the re­sults ver­ti­cally, but not hor­i­zon­tally.

Results

There are a lot of in­ter­est­ing things in this table. Here are my take­aways. They are ob­vi­ously spe­cific to this sin­gle sce­nario. You might have dif­fer­ent sce­nar­ios or dif­fer­ent take­aways:

  • Stack al­lo­ca­tion is not too bad for small ob­jects, es­pe­cially for gcc
  • Pod_vector is a good over­all per­former (even if you don’t get the size right), it is un­for­tu­nate that just works with POD types
  • GCC seems to be lag­ging on the shared_ptr front. Also msvc im­ple­men­ta­tion of the make_shared op­ti­miza­tion gives a vis­i­ble ad­van­tage
  • There is not much ad­van­tage in us­ing the shared pointer with the boost pooled al­lo­ca­tor
  • If you can use the boost lo­cal pool al­lo­ca­tor, you should. It is faster than stack al­lo­ca­tion (in this sce­nario). Remember that the mem­ory is re­claimed when you exit the func­tion …

So here you have it. A small de­tour on the per­for­mance of dif­fer­ent schemes for al­lo­cat­ing pod types. As it is of­ten the case in C++, it de­pends …

BTW: Andy Sawyer told me about his rough al­go­rithm to de­cide which STL con­tainer to use. I re­pro­duce it here:

BEGIN

A de­ci­sion tree for se­lect­ing a se­quence con­tainer:

  • I’m in a rush and don’t want to read the rest: use std::deque

  • Do we know ahead of time ex­actly how many el­e­ments will be needed (and will they all fit on our stack!) - If so, use std::ar­ray.

  • Do we need con­stant-time ran­dom ac­cess? (Note that we of­ten think we do, but ac­tu­ally don’t - YAGNI) If so, then we can elim­i­nate std::list/​std::for­ward_list as can­di­dates.

  • Do we need bidi­rec­tional it­er­a­tion? (Again, we need this less of­ten that we think we do). If so, that elim­i­nates std::for­ward_list

  • Will there be a large num­ber of in-the-mid­dle in­ser­tion/​re­moval?  (and as­sum­ing we haven’t al­ready elim­i­nated them) then std::list/​std::for­ward_list (especially when the con­tained type is ex­pen­sive to move/​copy).  In ex­treme cases, this may be a strong enough re­quire­ment to over­ride other con­straints (e.g. the need for ran­dom ac­cess). On the other hand, it may be vi­able to re­duce the cost of move/​copy of large con­tained types by us­ing con­tain­ers of (smart) point­ers.

  • Do we need the con­tents as a con­tigu­ous ar­ray? use std::vec­tor (and call re­serve  if we can) - some­times, any­way; I’ve seen cases where it’s faster to build my col­lec­tion as a std::deque, then trans­form to std::vec­tor later on.)

  • Use std::deque.

Or, to put it an­other way, use std::deque  un­less there’s a good rea­son not to”.  That’s prob­a­bly an overly-strong state­ment, but it’s a rea­son­able start­ing point.

Naturally, every case will be dif­fer­ent - and no set of rules of thumb’ is go­ing to work every time (this is one of the things which dis­tin­guish rules of thumb from laws of physics). Commonly, we don’t know ahead of time which of those cri­te­ria is go­ing to have the most sig­nif­i­cant im­pact on what we’re do­ing; when I find my­self in that sit­u­a­tion, I’ll use std::deque; once I have a firmer idea of the cri­te­ria, I’ll go back and re­visit that choice (which — quite of­ten — is as sim­ple as chang­ing a type­def). And of course - as with all optimisations’ - mea­sure, mea­sure and mea­sure again!

END

Next time, we’ll go back to more tra­di­tional func­tional pro­gram­ming top­ics.

0 Webmentions

These are webmentions via the IndieWeb and webmention.io.