Back to Blog home

A never-ending struggle…

Published on Wednesday October 22, 2008 by Bradley T. Hughes in Labs Qt KDE Qt Concurrent Threads | Comments

For the past 2 weeks, I've been wrestling with a fix for task 208487. At first glance, it seems pretty far-fetched (at least it did to me): an application that has 2 billion running timers and runs out of timer ids. But then I went and looked at the code (that I wrote!) and the real problem slapped me in the face: any application can only ever start 2 billion timers in total (which is different from having 2 billion concurrent timers). We never recycle timer ids. That means that the following code will quit working after 24.85 days. If we change the timeout to be zero instead, the runtime is reduced to hours instead of days.

class Object : public QObject
{
Q_OBJECT
public:
Object() : QObject() { QTimer::singleShot(1, this, SLOT(timeout())); }
public slots:
void timeout() { QTimer::singleShot(1, this, SLOT(timeout())); }
};

After quickly rejecting the question "Should I fix this?" (of course I should!), I promptly got my hands dirty looking for a solution. Fixing this proved to be harder than I originally anticipated...

First attempt: a bit array

My original idea was to use a bit array to keep track of used and free timer ids. I started off using QBitArray, but then went over to using a QVector<int> instead. The reason for this was because I could cache the id that would most likely be returned next (i.e. the best guess), and if the best guess turned out to not be free the next time we asked for an id, it was very quick and easy to look in the word containing the cached id to see if any ids close to the best guess were free. Failing that, I would search the whole vector looking for the first free bit. If everything was in use, we increased the size of the vector and started using bits in the newly added space. Releasing an id back to the pool only involved toggling a single bit. I was pretty happy with this approach after I committed it.

Oops #1

But then one of my colleagues quickly pointed out a fatal flaw: timer ids need to be unique across the entire application, not just unique per-thread. Since I had used one vector per thread, this mean that two objects in different threads could get the same timer id. But what happens when those objects where moved to the same thread? Oops! I quickly had to git revert ... my change.

Oops #2

Another colleague questioned the implementation itself. Previously, we used an atomic serial number to "generate" timer ids, meaning id allocation was O(1), but now it was O(N) (where N is the number of running timers). We don't like regressions, regardless of whether it's a behavioral or performance regression. He proposed a free-list implementation that would keep the O(1) performance. A vector is initialized with i + 1 for 0 < = i < size, and we would keep the index for the next free id. Initializing and resizing the vector is straight forward:

int *vector = 0;
int size = 0;
int free = 0;

void resize(int newSize)
{
int *p = (int *)qMalloc(newSize * sizeof(int));
if (vector)
memcpy(p, vector, size * sizeof(int));
for (int i = size; i != newSize; ++i)
p[i] = i + 1;
size = newSize;
qFree(vector);
vector = p;
}

Allocating a new id is extremely efficient now, since we always know the next free id. Before returning this id, we only have to update the next free id based on what's stored in the vector:

int allocateId()
{
if (nextFree == size)
resize(size ? 4 : 2 * size);

int id = free;
free = vector[id];
return id;
}

Releasing just as efficient, since it just does the inverse:

void releaseId(int id)
{
vector[id] = free;
free = id;
}

This is much more efficient that the approach that I used originally, so I wanted to try it out...

Second attempt: a locked free-list

I started off by just using the above as-is, and throwing a global QMutex in the allocateId() and releaseId() functions. It worked, but I've grown allergic to global locks in Qt. Besides, by the time I got to this point, it was Thursday evening, and tomorrow was Creative Friday (yeh!). So I decided to use my Creative Friday to see if a lock-free implementation of the free-list was possible...

Third attempt: a "lock-free" free-list

I started off by incrementally changing the locked implementation into a lock-free one. First on the plan was making the free variable atomic with QAtomicInt (and renaming the variable to nextFreeTimerId). The naive implementation ended up looking like this:

int allocateId()
{
// QMutexLocker locker(&globalMutex);

// ### FIXME
if (nextFree == size)
resize(size ? 4 : 2 * size);

int at = int(nextFreeTimerId);
int newValue = vector[at];
return nextFreeTimerId.fetchAndStoreRelaxed(newValue);
}

void releaseId(int id)
{
// QMutexLocker locker(&globalMutex);
vector[id] = nextFreeTimerId.fetchAndStoreRelaxed(id);
}

So, we know that we need to fix the resize() function to do something different. I started off by having a bucket array of 1024 buckets with each bucket containing 4096 entries. Each bucket would be allocated on demand using code like this:

    ....
int timerId = int(nextFreeTimerId);

// which bucket are we looking in?
int bucket = timerId / BucketSize;
int at = timerId % BucketSize;
volatile int *b = timerIds[bucket];

if (!b) {
// allocate a new bucket
b = new int[BucketSize];
for (int i = 0; i != BucketSize; ++i)
b[i] = (bucket * BucketSize) + i + 1;

if (!timerIds[bucket].testAndSetOrdered(0, b)) {
// another thread won the race to allocate the bucket
delete [] b;
b = timerIds[bucket];
}
}
...

So, that removes the need for the resize() completely. However, if you look at the rest of allocateId() after the resize() call, we can see a race. nextFreeTimerId is read non-atomically to index into the vector. The releaseId()'s fetch-and-store could change nextFreeTimerId before allocateId()'s fetch-and-store. Ouch.

There's a second, harder to see race, in releaseId(). The vector is updated after the fetch-and-store, which means that allocateId() reads the incorrect value from the vector (even though nextFreeTimerId is correct) just before its fetch-and-store. Double ouch!

To solve the races, we'll have to introduce some test-and-set loops:

int allocateId()
{
int timerId;
do {
timerId = int(nextFreeTimerId);

// which bucket are we looking in?
... (see above)

} while(nextFreeTimerId.testAndSetRelease(timerId, b[at]));

return timerId;
}

void releaseId(int timerId)
{
int bucket = timerId / BucketSize;
int at = timerId % BucketSize;
volatile int *b = timerIds[bucket];

int freeId;
do {
freeId = int(nextFreeTimerId);
b[at] = freeId;
} while (nextFreeTimerId.testAndSetRelease(freeId, timerId);
}

But for some reason, I was getting duplicates from allocateId(), even though I really shouldn't be. I didn't really understand what was going wrong, but by the time I got to this point, Creative Friday was at an end, brain-fry was setting in, and the weekend beckoned. I made one last ditch attempt to "fix" the lock-free free-list implementation by adding a spin-lock before heading out the door:

int allocateId()
{
int timerId;
do {
do { timerId = int(nextFreeTimerId); } while (timerId == -1);

...
}

void releaseId(int timerId)
{
...

int freeId;
do {
do { freeId = nextFreeTimerId.fetchAndStoreAcquire(-1); } while (freeId == -1);
b[at] = freeId;
} while (nextFreeTimerId.testAndSetRelease(-1, timerId);
}

It worked! I committed and fled my office to begin my weekend activities.

Finally: really lock-free

Monday morning, I read through a lengthy comment to my commit made by my ever vigilant colleagues. One pointed out the waste of space by using so many buckets. Another pointed out an ABA-problem (aha! so that's why I was seeing duplicates before!) with a patch to "solve" it and remove the spin-lock.

The solution for the ABA-problem was to use 7 of the top 8-bits of the timer id as a serial number (can't use all 8, since timer ids need to be positive in Qt). The serial number is incremented (and eventually wraps) for every atomic test-and-set. This doesn't solve the ABA-problem completely (according to the theory), but in practice it works very well.

After reading a paper on lock-free vectors, I got the idea to use buckets that grow exponentially in size. Instead of having a large number of fixed sized buckets, we could instead use a small number of buckets, which means we waste less space for the static data. I ended up using 8 buckets where the bucket size is 8^B (where B is the bucket index). So bucket 1 is 8 elements, bucket 2 is 64, bucket 3 is 512, and so on. The last bucket is 8^8 or 16777216 entries. This just happens to be the same as the timer id space that we are using. Remember, since we're using 8-bits for a serial number, the effective timer id space is 2^24 (16777216). What a coincidence! (Note: to avoid wasted space, the last bucket will not actually be 8^8 elements, but rather (8^8) - 2396744 elements. 2396744 is the sum of the sizes for the first 7 buckets.)

So, the final solution that I committed to Qt today uses a lock-free implementation of the original free-list proposal, built on top of the bucket array with exponential bucket sizes, and using the top 8-bits of the timer id as a serial number to solve the ABA-problem. This solves the problem of running out of timer ids, without restricting the number of concurent, running timers an application can have. (16777216 concurrent, running timers is a lot of timers! If this isn't enough for all the Qt applications out there, I'm going to retire and become a sheep farmer some where up in the Norwegian mountains.)

Here's the the most interesting parts of what we ended up with:

int QAbstractEventDispatcherPrivate::allocateTimerId()
{
int timerId, newTimerId;
do {
timerId = nextFreeTimerId;

// which bucket are we looking in?
int which = timerId & 0x00ffffff;
int bucket = bucketOffset(which);
int at = bucketIndex(bucket, which);
volatile int *b = timerIds[bucket];

if (!b) {
// allocate a new bucket
b = allocateBucket(bucket);
if (!timerIds[bucket].testAndSetRelease(0, b)) {
// another thread won the race to allocate the bucket
delete [] b;
b = timerIds[bucket];
}
}

newTimerId = b[at];
} while (!nextFreeTimerId.testAndSetRelease(timerId, newTimerId));

return timerId;
}

void QAbstractEventDispatcherPrivate::releaseTimerId(int timerId)
{
int which = timerId & 0x00ffffff;
int bucket = bucketOffset(which);
int at = bucketIndex(bucket, which);
volatile int *b = timerIds[bucket];

int freeId, newTimerId;
do {
freeId = nextFreeTimerId;
b[at] = freeId & 0x00ffffff;

newTimerId = prepareNewValueWithSerialNumber(freeId, timerId);
} while (!nextFreeTimerId.testAndSetRelease(freeId, newTimerId));
}

The rest of the code will show up in tomorrow's snapshot. You can browse for the changes to src/corelib/kernel/qabstracteventdispatcher.cpp in our public git repo (but remember, the change won't show up until tomorrow).

The End

This post is a bit long, but as I mentioned before, I have been thinking about this for the past 2 weeks. Writing lock-free code is not easy; anyone would agree to that. Without the help of my colleagues, I probably would have spent more time on it. As far as I know, this is the first lock-free data-structure that has gone into Qt, and I wanted to make sure that it was done right. I love thinking about these kinds of problems. I hope I get to do more of it. :)

Subscribe to Our Blog

Stay up to date with the latest marketing, sales and service tips and news.

The blog comment system has been migrated to a new platform. If you face any issues, please let us know via feedback@qt.io.