A fast and thread-safe pool allocator for Qt - Part 2

In part 1 of this blog series, we developed a pool allocator that is optimized for small allocations. We confirmed that we do that a lot in Qt when we allocate QEvent or QObject instances, and a specialized allocator might be useful for application developers as well. So far, our solutions will allocate complete pages of memory as needed, and hand out memory chunks of a fixed size that is specified at compile time through a template parameter. It supports different threading models, with different tradeoffs regarding performance, memory efficiency, and concurrency. The results were giving us a very promising performance, beating the general-purpose allocators by a factor of 3-10 in our multi-threaded benchmarks.

However, with an allocator that can only handle one chunk-size, and never returns memory back to the operating system, we still have a way to go before we can really support our QEvent and QObject use cases within Qt. We can't just make our library waste and hog memory, or require application developers to reimplement operator new/delete to be able to allocate instances of their larger subclasses!

But before we think about adding more complexity, we really need to think about testing, so that's what this blog post is all about.

What do we want to test?

Testing an allocator only using black-box testing strategies is very limiting; we can't be confident that the allocator is implemented correctly if we can only look at the pointers we get out of it. We really want to know that the allocator is in the state we expect it to be. Ideally, we can do this without having to write test code that more or less duplicates what the allocator is already doing; it would mostly just verify that we are capable of copy/pasting code. Also, if we were to access internal data structures, then we would have to lock those during multi-threaded testing, which we definitely do not want to do.

But even with a flawless implementation of an allocator, user-code will also have bugs. Developers should get predictable errors if their code performs out-of-bounds reads or writes, accesses deleted objects, has double-free bugs, or leaks memory. Such accesses to memory would usually cause segmentation violation errors with a useful stack trace. But we operate only in memory that belongs to the process (unless we happen to hit the edge of a memory page), and the operating system doesn't watch over what we do with that memory. Also, using custom allocators introduces a whole new type of error, which is allocator mismatch. All allocators use datastructures and memory layouts that are optimized for their efficient operation, so throwing an arbitrary address at an allocator will result in undefined behavior if that address wasn't coming from that same allocator in the first place.

And lastly, our allocator is to some degree configurable, and designed for rather specific situations. We want to collect data about how the allocator is used under realistic conditions so that we can tune the configurations and validate that this is the right kind of allocator in the first place.

Until we know that the allocator works correctly, we can leave the discovery of user-code bugs aside. Here are some things we want to verify to be confident about our implementation:

  • when we allocate as many chunks as fit on a page, and then free the first chunk we allocated, then the next allocation shouldn't allocate an additional page
  • when we allocate more chunks than fit on a page, then the allocator should allocate a new page
  • in a multi-threaded allocator, arenas that are in use don't get deleted if the respective thread terminates, but are marked as stale
  • in a multi-threaded allocator, a new thread will re-use an arena previously marked as stale
  • in a shared-pool allocator, all threads use the same arena

Testing with Policy

We want to make our allocator testable, without making it slower. An allocator that has to check environment variables or other runtime configurations as part of their code paths in order to produce observable data will have to do some additional work, even if it's just checking whether a function pointer is nullptr before calling it. We want to avoid such additional work; not only does it slow things down, it could also hide subtle race conditions we might have in our multi-threading code.

Fortunately we are using C++, and we are implementing our allocator as a template class. This allows us to use policy-based design, where we declare an interface through which our allocator can report events of significance that happen within its implementation:


enum class AnalyzerEvent {
    Allocate,
    Free,
    ArenaCreate,
    ArenaDestroy,
    PageAllocate,
    PageFree,
    MemoryLeaked
};

// the no-op policy
struct NoAnalyzer
{
    reportEvent(AnalyzerEvent, void *) {};
};

template<size_t ChunkSize, ThreadModel Threading = MultiThreaded,
         typename Analyzer = NoAnalyzer>
class PoolAllocator : public Analyzer
{
    // ...
    struct Arena
    {
        Arena(PoolAllocator *allocator)
            : m_allocator(allocator)
        {
            m_allocator->reportEvent(AnalyzerEvent::ArenaCreate, this);
        }
        // ...
        bool grow()
        {
            void *ptr = ::mmap(NULL, sizeof(Page), PROT_READ | PROT_WRITE,
                               MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
            m_allocator->reportEvent(AnalyzerEvent::PageAllocate, ptr);
            // ...
        }
        PoolAllocator *m_allocator;
    };
public:
    // ...
    void *allocate(size_t size)
    {
        reportEvent(AnalyzerEvent::Allocate, &size);
        // ...
    }
    void free(void *ptr)
    {
        reportEvent(AnalyzerEvent::Free, ptr);
        // ...
    }
};

With this pattern, the compiler will by default optimize the call to the empty function away. And since empty base optimization applies, we don't increase the instance size of PoolAllocator. We don't even need a virtual function table! We do need to pass a pointer to our allocator into the Arena instances, where some events will be reported; but we will see later that this pointer has other practical applications, so we pay those 8 bytes per thread.

If an analyzer class is provided, user-code can access its data members directly via the PoolAllocator instance. We need to be careful not to declare any members that conflict with existing members of PoolAllocator, and we need to make the analyzer as threadsafe as the PoolAllocator instance is configured to be. This will typically introduce a mutex, so using an analyzer will have some effects on the behavior and performance of the allocator, especially under highly concurrent circumstances.


struct UsageAnalyzer
{
    QBasicMutex m_mutex;
    int pageCount = 0;

    void recordEvent(AnalyzerEvent t, void *p)
    {
        std::lock_guard lock(m_mutex);
        switch (t) {
        case AnalyzerEvent::PageAllocate:
            ++pageCount;
            break;
        default:
            break;
        }
    }
};

With such an analyzer we can now write tests cases that confirm the correct behavior of the allocator, for example to test our two first items from the list above:


void tst_PoolAllocator::test()
{
    PoolAllocator<16, SingleThreaded, UsageAnalyzer> allocator;
    const size_t numChunks = allocator.chunksPerPage();

    void *objects[numChunks];
    QCOMPARE(allocator.pageCount, 0);
    for (int i = 0; i < numChunks; ++i)
        objects[i] = allocator.allocate(16);
    QCOMPARE(allocator.pageCount, 1);
    allocator.free(objects[0]);
    objects[0] = allocator.allocate(16);
    QCOMPARE(allocator.pageCount, 1);

    void *object = allocator.allocate(16);
    QCOMPARE(allocator.pageCount, 2);

    allocator.free(object);
    for (int i = 0; i < numChunks; ++i)
        allocator.free(objects[i]);
}

We can expand the analyzer class to keep track of the general usage of the allocator in our application, and we can have memory leaks reported to us when the allocator goes out of scope (see the leaky test case in the code review topic). What we can't do efficiently yet is associating observable meta data with individual allocations.

Meta data using a header

Our allocator doesn't require any header with information about the allocation, which gives us - if we ignore the few pointers the allocator itself carries around - 99.8% memory utilization: each 4K page contains one pointer to the arena it belongs to, so we have 4088 bytes for chunks. That's great, but sometimes it's practical to be able to associate some meta data with every allocation. For instance, we might want to confirm that the allocations using our pool allocator are indeed short-lived. A use-case where such meta-data could even be part of a production system would be a transparently implemented reference counter or garbage collection.

The data structure for our Node type has so far been a union: for free nodes, a pointer to the previous node on the free stack; for allocated nodes, an array of bytes:


union Node
{
    Node *previous;
    char chunk[ChunkSize];
};

For the default cases where we don't need a header, we want to stick to that layout. If we do want the header, then we need to add another data member with the necessary space. Since our pool allocator is a template, we can use meta-programming techniques:


struct NoHeader {};

template<size_t ChunkSize, ThreadModel Threading = MultiThreaded,
         typename Header = NoHeader, typename Analyzer = NoAnalyzer>
class PoolAllocator : public Analyzer
{
    static constexpr size_t HeaderSize = std::is_empty<Header>::value
                                       ? 0 : sizeof(Header);
    struct NodeDataWithHeader
    {
        char header[HeaderSize];
        char chunk[ChunkSize];
    };
    struct NodeDataWithoutHeader
    {
        char chunk[ChunkSize];
    };
    using NodeData = typename std::conditional<HeaderSize == 0,
                     NodeDataWithoutHeader, NodeDataWithHeader>::type;
    union Node
    {
        Node *previous;
        NodeData data;
        void initHeader()
        {
            if constexpr (HeaderSize > 0
                      && std::is_default_constructible<Header>::value
                      && !std::is_trivially_default_constructible<Header>::value) {
                (void)new (node->header) Header();
            }
        }
        void destructHeader()
        {
            if constexpr (HeaderSize > 0
                      && std::is_destructible<Header>::value
                      && !std::is_trivially_destructible<Header>::value) {
                reinterpret_cast<Header*>(node->header)->~Header();
            }
        }
        static Node *fromMemory(void *ptr)
        {
            return reinterpret_cast<Node*>(static_cast<char*>(ptr) - HeaderSize);
        }
    };
    // ...

C++ does not allow a null-sized array; and even if it did, the C++ standard requires that each member of a struct has a unique address. Even a zero-sized array consumes a byte of memory, and sizeof returns 1 for an empty struct. To not have to waste that byte, we select the appropriate struct at compile time, depending on whether the type is empty or not. The usage of C++17's if constexpr statement prevents compile time errors when we don't have a header, or if the header doesn't require construction or destruction. Helper functions hide the details, so allocate() and free() don't become a lot more complicated:


    void *allocate(size_t size)
    {
        Q_ASSERT(size <= ChunkSize);
        Node *node = arena()->pop();
        if (!node)
            return nullptr;
        node->initHeader();
        return node->data.chunk;
    }

    void free(void *ptr)
    {
        Node *node = Node::fromMemory(ptr);
        node->destructHeader();
        Arena *arena = Page::fromNode(node)->arena;
        arena->push(node);
    }

Since user code will want to do something with the header, the allocator class can give us access to the header for a pointer:


    static Header *header(void *ptr)
    {
        if constexpr (HeaderSize == 0)
            return nullptr;
        return reinterpret_cast<Header*>(Node::fromMemory(ptr)->header);
    }
};

Analyzing

We can now do some interesting things with headers and analyzers. Since we build an allocator optimized for small, short-lived objects, let's confirm that this is in fact how our code allocates objects:


struct TimingHeader
{
    QElapsedTimer timer;
    TimingHeader()
    {
        timer.start();
    }
};

struct UsageAnalyzer
{
    UsageAnalyzer(...)
    {}

    QBasicMutex m_mutex;
    QMap<size_t, int> m_sizehistogram;
    QMap<h;int, int> m_timeHistogram;
    int m_pageCount = 0;

    void recordEvent(QAllocator::AnalyzerEvent t, void *p);
};

// the biggest event we have seen in our last test was 112 bytes
using EventAllocator = PoolAllocator<112, MultiThreaded, TimingHeader, UsageAnalyzer>;
Q_GLOBAL_STATIC(EventAllocator, eventAllocator);

void UsageAnalyzer::recordEvent(AnalyzerEvent t, void *p)
{
    std::lock_guard lock(m_mutex);
    switch (t) {
    case AnalyzerEvent::Allocate:
        ++m_sizeHistogram[*static_cast<size_t*>(p)];
        break;
    case AnalyzerEvent::Free:
        ++m_timeHistogram[QEventAllocator::header(p)->timer.elapsed()];
        break;
    case AnalyzerEvent::PageAllocate:
        ++m_pageCount;
        break;
    default:
        break;
    }
}

void *QEvent::operator new(std::size_t size) noexcept
{
    qDebug() << "Usage:" << eventAllocator()->m_sizeHistogram;
    return eventAllocator()->allocate(size);
}

void QEvent::operator delete(void *ptr) noexcept
{
    qDebug() << "Timing:" << eventAllocator()->m_timeHistogram;
    qDebug() << "Pages:" << eventAllocator()->m_pageCount;
    eventAllocator()->free(ptr);
}

Using this in Qt's qcoreevent.cpp and running any Qt examle will indeed confirm that QEvent allocations are both small, and very short lived, with very few exceptions.

lifetime_qevent

Usage: QMap((24, 2367)(32, 26)(112, 557))
Timing: QMap((0, 44)(1, 10)(2, 91)(3, 116)(4, 330)(5, 524)(6, 546)(7, 397)(8, 338)(9, 219)(10, 135)(11, 41)(12, 21)(13, 14)(14, 7)(15, 6)(30, 12)(31, 5)(32, 17)(33, 3)(35, 7)(36, 10)(37, 3)(41, 11)(44, 5)(45, 7)(46, 2)(47, 4)(49, 1)(63, 2)(90, 9)(95, 5)(96, 3)(215, 1)(19800, 2)(19802, 3))
Pages: 4

We can also see that we are peaking at 4 pages allocated for QEvent instances. 16K is definitely more than we would like, so we definitely need to optimize this... in the next blog post.

User-Code Debugging

Now that we can verify that our allocator is implemented correctly, and that we are using the right allocator in the first place, we can focus on making it easy for users to discover bugs in their code.

For starters, we can add a few trip-wires in debug builds. The allocator being a template helps again, as only the code using it needs to be built in debug mode. For example, we can add a simple double-free detection by traversing the stack and checking if the incoming node is already in it:


void push(Node *node)
{
#ifdef QT_DEBUG
    Node *doubleFreeTest = stack;
    while (doubleFreeTest) {
        if (doubleFreeTest == node)
            qFatal("Double free detected for %p", node->chunk);
        doubleFreeTest = doubleFreeTest->previous;
    }
#endif
    node->previous = stack;
    stack = node;
}

We can also fill freed chunk memory with garbage using memset to make it likely that using such memory will trigger bugs and undefined behaviors that are easily noticed and identified when debugging. Doing that only for debug builds doesn't add any overhead for code running in production.

The architecture of our allocator also allows us to efficiently protect against allocator mismatches. Since we can easily map from user address to page, from there to arena, and from there - thanks to the allocator pointer we needed to add to be able to call our policy interface - to the allocator, we can implement a check:


    bool owns(void *memory) const
    {
        Page *page = Page::fromNode(Node::fromMemory(memory));
        return page->arena->m_allocator == this;
    }
    void free(void *ptr)
    {
        if (!owns(ptr))
            qFatal("Allocator mismatch for %p", ptr);

        Node *node = Node::fromMemory(ptr);
        node->destructHeader();
        Arena *arena = Page::fromNode(node)->arena;
        arena->push(node);
    }

This would help if we have multiple pool allocators for different sizes or types. Of course, it is more likely that we get a crash if free() or owns() is called with a completely unknown pointer, in which case there is no problem in the first place.

Support for Address Sanitizer

A very good and popular aid for memory debugging is the Address Sanitizer tool. Compilers that support it provide APIs, allowing user code like ours to poison and unpoison memory addresses and regions. There are also APIs to test whether memory addresses and regions are poisoned, which we can then use to verify that we poison and unpoison regions correctly. Using ASan APIs implicitly also improves the testing of the allocator code itself, since we will touch poisoned addresses if e.g. our pointer arithmetics are correct.

Screen Shot 2019-10-02 at 18.45.44

In the code review topic, ASan support is added as a separate commit. Let me know how this works for you!

Now we have put a fair bit of effort into making the allocator itself, and the code using it, testable. In the next post, we can go ahead with implementing some of the missing, and more complex, functionality: support for address alignment, detecting and freeing empty pages, destructive resets, and allocator selection.


Blog Topics:

Comments