A quick explanation of the benchmark

I just realized that I didn't actually explain how to interpret the results of the benchmark I mentioned in my previous post. So, here it is...

When you run the benchmark (don't forget -comparison if you want std::string and QByteArray in the results), the first line you'll see is this:


Time to run = 1000 ms
--------------------------------------------------

Which tells us how long each test function is run. Next, you'll see output like this:


AtomicString copyConstruction:
length 10, thrA: 40201 per-ms, thrB: 44077 per-ms, thrC: 40000 per-ms
length 100, thrA: 35429 per-ms, thrB: 40167 per-ms, thrC: 40033 per-ms
length 1000, thrA: 40167 per-ms, thrB: 40033 per-ms, thrC: 43915 per-ms

The first line tells us which class and which functionality we are testing. In this case, we're testing AtomicString copy construction. The following 3 lines tell us the length of the string we are using, followed by the results of the run. We use 3 threads (not with SharedString, it is not reentrant): one by itself and the other two concurrently. In the results, thrA is run by itself, then thrB and thrC are run concurrently. This gives us an idea of whether or not the test function scales. Larger numbers are better. In this case, we can make roughly 40,000 copies of an AtomicString per millisecond, which scales to multiple CPUs (the test machine is a dual-core opteron).

Also, all the results for similar test functions are grouped together, so it's easy to see how AtomicString compares with, e.g., SimpleString2 or std::string.

There are several test functions:

  • copyConstruction - copy construct a instance of the class with the given length.
  • appendSingleCharacter - append characters to an instance of the class. The string is truncated when length reaches the given length.
  • nonMutatingAccess - use operator[] to fetch a value from the string and add it to a volatile integer.
  • nonMutatingAccessAfterCopy - same as above, except that a copy is made before using operator[].
  • nonMutatingAccessOnCopy - same as above, except that operator[] is called on the copy (not the original).
  • mutatingCopy1 - 1/3 of copies are const, 1/3 are non-mutating (operator[]), 1/3 are modified once (append a single character).
  • mutatingCopy2 - 1/2 of copies are const, 1/4 are non-mutating (operator[] used 3 times), 1/4 are modified (append a single character 3 times).
  • functionWithArgument - call a function with an instance of the class with the given length as the only argument (argument is passed by value).
  • functionWithTemporaryArgument - same as above, except that the argument is a temporary copy.
  • functionReturningCopy - an instance of the class with the given length is returned by value.

Look at testfunctions.h for the code for each test. Each test is implemented as a template function, which is called by the test harness in main.cpp. In all cases, each test function is working on its own instance of the class, never on the same instance, which would require locking, defeating the purpose of the benchmark :)

The results of the benchmark show that in almost all cases, SharedString and AtomicString are more efficient than SimpleString and SimpleString2. I really only care about AtomicString, since SharedString cannot be used in a threaded program (since the reference count on the shared_null quickly gets corrupted). In many cases, AtomicString is faster than std::string, which is surprising, since AtomicString hasn't been optimized beyond the normal patterns used in the Qt library (shared_null, ByteRef returned from operator[], exponential growth strategy). The cases where AtomicString is slower than SimpleString and SimpleString2 are typically for length 10 strings, but AtomicString quickly recovers for longer strings (mutatingCopy1 and mutatingCopy2).

I have included results for a few test machines in the benchmark as well, in the results/ subdirectory:

  • results/results-rayon.txt - My dual-core AMD Opteron 165 @ 1.8GHz running Kubuntu Dapper Drake (amd64), built with GCC 4.0.3
  • results/results-stri.txt - My Pentium M @ 1.73GHz running Kubuntu Dapper Drake (i386), built with the Intel C++ Compiler 9.1.043
  • results/results-error.txt - My colleagues dual-core AMD Opteron 165 @ 1.8GHz running Windows XP (i386), built with MSVC2005

My conclusion: Atomically reference counting all implicitly shared classes is the best thing to do in Qt. You will see benefits for both threaded and non-threaded programs.


Blog Topics:

Comments