Atomic reference counting - is it worth it?

So, I just returned from DevDays2006 in Munich, and I figured I would finish this blog posting, since I mentioned in my presentations. :)

Starting with Qt 4.0, all implicitly shared classes in Qt use atomic reference counting to ensure the integrity of the data being shared. This removes a burden from the application programmer: the need to force deep-copies when passing data to another thread. The end result is that my life gets a whole lot easier (no more support questions from people who forgot to make a deep-copy) and your life gets a whole lot easier (since you don't have to worry about making deep-copies). Great! Wonderful!! Right?

We certainly think so. However, some people disagree. Recently, one of my colleagues posted a link[1] to our internal mailing list, which brings the whole idea of atomic reference counting under question. So, I started off by writing a few benchmarks of my own, taking some of the tests used by the articles accompanying the first[2][3][4] to give them a try.

What I found was of no surprise (to me, at least). Some of the points made in the article are valid, which also shows up in the benchmark results. The main one being that atomic reference counting is considerably slower than normal integer reference counting. This is to be expected; since we constantly have to read/set the reference count in memory, we can only go as fast as the memory bus lets us. Download the benchmark[5] and try it yourself. I have implemented 4 string classes for this benchmark. SimpleString and SimpleString2 are not shared at all (the latter uses an exponential growth strategy). SharedString and AtomicString use the shared_null optimzation that QString and QByteArray use, exponential growth strategy, and the ByteRef indirection in operator[] described below. The benchmark has one command-line option, -comparison, which will include std::string and QByteArray in the benchmark.

Even with the extra cost of the atomic increments and decrements, the results show that implicit sharing using atomic reference counting is more efficient than not sharing at all. Why? Because we have eliminated one of the main arguments against the copy-on-write mechanism; the square-bracket operator not knowing if it needs to detach or not. "Why use copy-on-write if we always have to make a copy? It would be better to always make the copy."

Take a look at the SharedString and AtomicString classes (they are the two that use implicit sharing). Instead of being pessimistic and always detaching in operator[], we instead return a ByteRef class which has a reference to the original string and the index passed to the operator. This class only has two operators itself: a cast operator that is const, and an assignment operator. If the programmer doesn't modify the value returned by operator[], then we don't detach. Clever, isn't it? :) Instead of arguing against implicit sharing because we assumed operator[] has to detach, we actually solved the problem. Qt has used this in QString and QByteArray for a long time, since the Qt 3.x days. Combine this with the shared_null optimzation, which SharedString and AtomicString do in the benchmark, and you end up with a string that is more memory and cpu efficient.

After my first Qt in Depth presentation in Munich, a few people grabbed me, asking about how atomic reference counting performed. One guy in particular asked "Do you have a benchmark?" My response was: "Not only do I have a benchmark, I have printouts of the benchmark!" (I had hoped to finished this posting while at DevDays, but I didn't have the time.) I'm not going to mention names, but this particular fellow has previously posted to qt-interest about his dislike for the implicit sharing used throughout Qt. (I will let the person in question decide if he wants to reveal his identity. I hope he does, just to prove that I'm not making this up.) After going through the results with him and the others, he became a believer. "I love numbers. You've convinced me.", he said. :)

As I mentioned, take a look at the benchmark. I've included results from 3 different machines: rayon - a dual-core opteron running Linux, stri - a Pentium M also running Linux, and error - a dual-core opteron running Windows.


References:
  1. http://www.gotw.ca/publications/optimizations.htm
  2. http://www.gotw.ca/gotw/043.htm
  3. http://www.gotw.ca/gotw/044.htm
  4. http://www.gotw.ca/gotw/045.htm
  5. http://chaos.troll.no/~bhughes/implicitsharing.tar.gz - Run with -comparison to include std::string and QByteArray in the results.

Blog Topics:

Comments