Improving the rendering performance with more SIMD

August 24, 2010 by Benjamin | Comments

With the last two versions of Qt, we consistently improved performance. Qt 4.5 introduced pluggable graphics systems and numerous rendering optimizations. Qt 4.6 brought optimizations all over the place, and the performance on embedded improved continuously with each patch release.

A problem with increasing the speed all the time is that we fall short on ways to improve for the next iterations. We have to look for new areas of improvement, and once again we are making Qt 4.7 faster than its predecessors.

Single instruction, multiple data

One of ways we used to get Qt 4.7 faster than its predecessor is using the processors more effectively. Modern processors have ways to execute an instruction on multiple data at a time. This is called single instruction, multiple data: SIMD. In particular, recent x86 processors have SSE extensions, while ARM Cortex have Neon.

The principle is simple. Let's see a use case were we have simple operations operating on multiple data:

quint32 a[256]; quint32 b[256]; quint32 c[256]; // [...]

for (int i = 0; i < 256; ++i) {
c[i] = a[i] + b[i];
}
On processors supporting SIMD, this code can be improved by applying the instructions on multiple data. For example, with SSE2, the following code loads 4 data at a time, applies the + operation, and stores the value in c:

quint32 a[256]; quint32 b[256]; quint32 c[256]; // [...]

for (int i = 0; i < 256; i += 4) {
__m128i vectorA = _mm_loadu_si128((__m128i*)&a[i]);
__m128i vectorB = _mm_loadu_si128((__m128i*)&b[i]);
__m128i vectorC = _mm_add_epi32(vectorA, vectorB);
_mm_storeu_si128((__m128i*)&c[i], vectorC);
}
The code above contains instrinsics which the compiler replaces with SSE2 instructions.

This example is so simple the compiler can optimize it automatically when passed the right options. But in most real cases, the change is not that obvious, and the algorithm needs to be slightly modified to work with vectors.

Qt has used SIMD for a long time, using MMX and 3DNow! for example. In Qt 4.7, we extended our usage of SSE on x86, and of Neon on ARM Cortex processors. By using SIMD in more places, we've gained between 2 and 4 times the speed in some uses cases.

Improving raster

In Qt 4.7, lots of rendering primitives have been reimplemented using SSE and Neon. This affects the raster graphics system in a very positive way.

The functions rewritten for SIMD are generally 2 to 4 times faster than the generic implementation. Microbenchmarks can be misleading, so to measure the impact on a realistic use case, I've used the WebKit benchmark suite.

On the "scrolling" test, we load over the top 50 most visited web pages and scroll them up and down. For this test I get the following improvement compared to Qt 4.6 compiled without any SIMD:

The tests have been run with the same version of QtWebKit (WebKit trunk) in all cases to remove the influence of the improvements done in the engine.

Compiling with SIMD

You do not have to do anything special to enjoy those improvements of Qt. When you build Qt, the configure script detects which features are supported by the compiler. You can see which extension are supported in the summary printed on the command line.

Supporting the CPU extensions at compile time does not mean they will be used. When an application starts, Qt detects what is available, and sets up the fastest functions available for the current processor.

With more SSE, we have more code sensitive to alignment. Unfortunately, some compilers have bugs regarding the alignment of vectors. Having a recent compiler is a good idea to get the best performance, while avoiding crashes.

Future

We are not done with improvements just yet. The most common functions have been optimized, but lots of less common paths can also be improved. For the last month, every week I think I am almost done, and Andreas pokes me with a new interesting use case. Those improvements are making their way to the 4.7 branch, and you can already expect 4.7.1 to be a little faster than the upcoming Qt 4.7.0.

Development Framework & Tools

Qt Framework

Qt Development Tools

Qt Design Studio

Qt Quality Assurance

Qt Digital Ads

Qt Insight

Quality Assurance Tools

Squish

Coco

Test Center

Axivion Static Code Analysis

Axivion Architecture Verification

More

Qt 6

Licensing

Qt Features

Qt for Python

Industry & Platform Solutions

Industry

Automotive

Aviation & Aerospace

Industrial Vehicles

Micro-Mobility Interfaces

Consumer Electronics

Industrial Automation

Medical Devices

Platform

Desktop, Mobile & Web

Embedded Devices

MCU (Microcontrollers)

Cloud Solutions

More

Next-Gen UX

Limitless Scalability

Productivity

Our Ultimate Collection of Resources

Development Framework & Tools

Qt Resource Center

Qt Blog

Qt Success Stories

Qt Demos

Quality Assurance Tools

QA Resources

QA Blog

QA Success Stories

More

Live Events & Webinars

Documentation

Take Learning Qt to the Next Level

Learn with us

Qt Academy

Qt Educational License

Qt Documentation

Qt Forum

We're Here for You—Support and Services

Helpful Links

Contact Us

Qt Partners

Qt Support

Qt Customer Portal

Qt Customer Success

Qt Professional Services

Improving the rendering performance with more SIMD

Single instruction, multiple data

Improving raster

Compiling with SIMD

Future

Blog Topics:

Comments

Subscribe to our newsletter

Subscribe Newsletter

Try Qt 6.7 Now!

We're Hiring

Read Next

Qt Performance and Tools Update Part 1

Qt 6.6 and 6.7 Make QML Faster than Ever: A New Benchmark and Analysis

"Coding in the Dark" or what one can do in Qt in 10 minutes