Improving the rendering performance with more SIMD

With the last two versions of Qt, we consistently improved performance. Qt 4.5 introduced pluggable graphics systems and numerous rendering optimizations. Qt 4.6 brought optimizations all over the place, and the performance on embedded improved continuously with each patch release.

A problem with increasing the speed all the time is that we fall short on ways to improve for the next iterations. We have to look for new areas of improvement, and once again we are making Qt 4.7 faster than its predecessors.

Single instruction, multiple data

One of ways we used to get Qt 4.7 faster than its predecessor is using the processors more effectively. Modern processors have ways to execute an instruction on multiple data at a time. This is called single instruction, multiple data: SIMD. In particular, recent x86 processors have SSE extensions, while ARM Cortex have Neon.

The principle is simple. Let's see a use case were we have simple operations operating on multiple data:

quint32 a[256];
quint32 b[256];
quint32 c[256];
// [...]

for (int i = 0; i < 256; ++i) {
c[i] = a[i] + b[i];
On processors supporting SIMD, this code can be improved by applying the instructions on multiple data. For example, with SSE2, the following code loads 4 data at a time, applies the + operation, and stores the value in c:

quint32 a[256];
quint32 b[256];
quint32 c[256];
// [...]

for (int i = 0; i < 256; i += 4) {
__m128i vectorA = _mm_loadu_si128((__m128i*)&a[i]);
__m128i vectorB = _mm_loadu_si128((__m128i*)&b[i]);
__m128i vectorC = _mm_add_epi32(vectorA, vectorB);
_mm_storeu_si128((__m128i*)&c[i], vectorC);
The code above contains instrinsics which the compiler replaces with SSE2 instructions.

This example is so simple the compiler can optimize it automatically when passed the right options. But in most real cases, the change is not that obvious, and the algorithm needs to be slightly modified to work with vectors.

Qt has used SIMD for a long time, using MMX and 3DNow! for example. In Qt 4.7, we extended our usage of SSE on x86, and of Neon on ARM Cortex processors. By using SIMD in more places, we've gained between 2 and 4 times the speed in some uses cases.

Improving raster

In Qt 4.7, lots of rendering primitives have been reimplemented using SSE and Neon. This affects the raster graphics system in a very positive way.

The functions rewritten for SIMD are generally 2 to 4 times faster than the generic implementation. Microbenchmarks can be misleading, so to measure the impact on a realistic use case, I've used the WebKit benchmark suite.

On the "scrolling" test, we load over the top 50 most visited web pages and scroll them up and down. For this test I get the following improvement compared to Qt 4.6 compiled without any SIMD:
Performance improvement of Qt 4.7

The tests have been run with the same version of QtWebKit (WebKit trunk) in all cases to remove the influence of the improvements done in the engine.

Compiling with SIMD

You do not have to do anything special to enjoy those improvements of Qt. When you build Qt, the configure script detects which features are supported by the compiler. You can see which extension are supported in the summary printed on the command line.

Supporting the CPU extensions at compile time does not mean they will be used. When an application starts, Qt detects what is available, and sets up the fastest functions available for the current processor.

With more SSE, we have more code sensitive to alignment. Unfortunately, some compilers have bugs regarding the alignment of vectors. Having a recent compiler is a good idea to get the best performance, while avoiding crashes.


We are not done with improvements just yet. The most common functions have been optimized, but lots of less common paths can also be improved. For the last month, every week I think I am almost done, and Andreas pokes me with a new interesting use case. Those improvements are making their way to the 4.7 branch, and you can already expect 4.7.1 to be a little faster than the upcoming Qt 4.7.0.

Blog Topics: