The Wide Finder Project is an informal parallel programming competition where the task is to compute web site statistcs from a 218-million line access log. Each entry will be benchmarked on a Sun T2000 with support for 32 hardware threads, giving lots of opportunities for parallel processing.
What makes this really interesting is that the project is not only about performance, but rather about writing code that scales to many CPU cores with as little extra programmer effort as possible. Some results are already in, with OCaml currently in the lead performance-wise.
Eager to bring the performance lead back to C++ where it belongs I started out writing my own implementation using QtConcurrent and the other Qt APIs. Briefly explained, the code uses QtConcurrent::mappedReduced to multi-thread the code, and then QByteArray::split() twice to iterate over each word in each line. The current version computes the number of hits for each page.
Results: (parsing 100K lines on an 8-core 2.8 GHz Mac Pro)
1 Thread real 0m2.283s user 0m1.141s sys 0m0.249s
2 Threads: real 0m1.446s user 0m1.853s sys 0m0.271s
1.6X speedup.. not too bad.
4 Threads: real 0m3.186s user 0m10.643s sys 0m0.407s
1 second slower that the single-threaded version.. this does not bode well.
8 Threads: real 0m7.000s user 0m46.922s sys 0m0.724s
Seven seconds! We get a nice linear scaling of the run-time as we increase the number of threads, but unfortunately in the wrong direction. The program us spending a lot of user time doing something though, so let's run it through Shark and see what's going on:
80% in a spin-lock used by malloc/free. But who is calling malloc that much?
Aha.. QByteArray::split(). While being a very convenient API, split() was clearly not designed for heavy parsing like this. Still, I'd like a less catastrophic impact on the run-time when adding threads, even if the program really is calling malloc/free to often. Let's try with the ptmalloc memory allocator instead:
8 Threads: real 0m0.908s user 0m3.784s sys 0m0.533s
ptmalloc is used in GNU/Linux though the GNU C library and scales much better on multicore systems. The program itself still does not scale beyond 4 threads, but it does not get significantly worse either when adding threads. I guess it's debatable whether or not this is qualifies as a bug in the Darwin memory allocator, but at least ptmalloc shows that it is possible to do better.
That's all for now :) For the next installment I'll try to get better scaling, at the expense of increasing the developer effort.