Compiling QML to C++: A 4x speedup

As you may know, you can compile your QML code to C++ these days. There are multiple reasons why you would do this. One of them is that it leads you to better structured code by forcing you to declare the types you're using. The most important one is that the resulting program will run faster.

In my previous posts I've been rather cautious about the actual performance numbers. This is for a reason. The Qt Quick Compiler cannot translate any old JavaScript you throw at it, and depending on the exact characteristics of your code, the resulting speedup varies greatly. We're constantly working on increasing the Qt Quick Compiler's coverage of the QML language, but it's still a long way to go.

However, today I'll go out on a limb and show you a piece of code that gets 4 times faster by compiling it to C++. Consider the following little QML program:

import QtQml

QtObject {
    id: root

    enum Parameters {
        Length = 1024,
        Iterations = 32768,

        Category0 = 0xf0f,
        Category1 = 0xf0f0,
        Category2 = 0xf0f0f,
        Maximum   = 0xf0f0f0,
        Mask      = 0xabcdef
    }

    function randomNumber() : int {
        return (Math.random() * Categorizer.Maximum);
    }

    property var numbers: {
        var result = [];
        for (var i = 0; i < Categorizer.Length; ++i)
            result[i] = randomNumber();
        return result;
    }

    function sum() : list<double> {
        var numbers = root.numbers;

        var cat1Sum = 0;
        var cat2Sum = 0;
        var cat3Sum = 0;
        var huge = 0;
        for (var i = 0; i < Categorizer.Iterations; ++i) {
            for (var j = 0; j < Categorizer.Length; ++j) {
                var num = numbers[j] & Categorizer.Mask;
                if (num < Categorizer.Category0)
                    cat1Sum += num;
                else if (num < Categorizer.Category1)
                    cat2Sum += num;
                else if (num < Categorizer.Category2)
                    cat3Sum += num;
                else
                    huge += num;
            }
        }

        return [cat1Sum, cat2Sum, cat3Sum, huge];
    }

    Component.onCompleted: {
        console.log("start")
        var result = sum();
        console.log("done");

        console.log("< " + Categorizer.Category0 + ":", result[0])
        console.log("< " + Categorizer.Category1 + ":", result[1])
        console.log("< " + Categorizer.Category2 + ":", result[2])
        console.log("huge:", result[3]);
    }
}

It generates some random numbers and iterates them repeatedly, masking them with a bitmask, and adding them up into 4 sums, depending on their size. While you should not write your business logic in JavaScript, some helper function like this might be employed to position visual elements relative to each other and it might even be a bottleneck. The outer iteration is just there so that we can talk about seconds rather milliseconds.

Save this program as a file called "Categorizer.qml" and run it with the "qml" utility, measuring the time between the "start" and "done" output. You can use the underappreciated QT_MESSAGE_PATTERN environment variable to have it print the timestamps for each message. For example:

QT_MESSAGE_PATTERN="%{time process}: %{message}" /path/to/qml Categorizer.qml

On my machine I get something like this:

0.000: start
1.842: done
1.842: < 3855: 808091648
1.842: < 61680: 29789028352
1.842: < 986895: 3433703112704
1.842: huge: 170690851176448

You need at least Qt 6.2 to run this example. The result will likely be very similar for any version of Qt that can run it, even the latest 6.5 snapshots. Using the "qml" utility you disable most benefits of Qt Quick Compiler. The code is not compiled ahead of time, no C++ code is generated, and no byte code is compiled into your (non-existing) application. If you run it more than once, on subsequent invocations it will still be able to use cache files to avoid re-compiling the document to byte code, but the compilation is not the dominant factor here.

Those numbers are rather underwhelming when you think about what the program actually does. Now let's see how it behaves when you compile it to C++. Create an example project with Qt Creator, using the Qt 6.2+ template, and add the Categorizer.qml file to it. Then have your main.cpp load Categorizer.qml rather than main.qml. Make sure to compile your application in release mode. This will make sure the QML file is compiled to C++. Now run the result. Tada: the numbers are the same, even with the latest 6.5 snapshot.

"OK, what's the point?" you may ask. Well, lets try something else. Add a type to the property declaration for "numbers":

property list<double> numbers: { ... }

You may notice that you now need Qt 6.4 to run this. Earlier versions of Qt will consider it a syntax error. The resulting performance, when running it with 6.4, or with 6.5 without compiling to C++ is actually worse than before. I see you are getting really impatient with me, but please, let's try just one last thing: Use the new code with your example project with Qt 6.5 beta1. I get:

0.000: start
0.392: done
0.392: < 3855: 607322112
0.392: < 61680: 27637481472
0.392: < 986895: 3592250556416
0.392: huge: 181336245927936

You can call me Watman now. And you can leave it at that if you like. I won't sing the NaNNaNNaN tune for you in that case. If you want to make really sure that it's the compilation to C++ that causes the speedup you can define the QML_DISABLE_DISK_CACHE environment variable:

QML_DISABLE_DISK_CACHE=1 ./my_example

This prevents the QML engine from using any byte code or native code generated for QML documents at compile time. It will then re-compile the source code at run time and interpret or JIT it.

But let me sing some NaNNaNNaN now.

Let's talk about JavaScript

If we look at this little "sum()" function through a pure JavaScript lens, we see that it retrieves a thing called "numbers" from another thing called "root". Now this "numbers" could be an array of numbers. Or it could be a URL object, a String, a dictionary keyed by combinations of "幽", "霊", "文", and "字", or just about anything else.

Then we iterate over a range of numbers. We can be pretty sure that i starts out as an integer. However, we again know nothing about the boundary condition. We simply retrieve something called "Iterations" from something called "Categorizer", and both are outside the scope of this function.

Then we retrieve something from our "numbers". This may again be just about anything, including undefined if that "j" does not exist in "numbers". The masking operation, though, is a neat little trick. The ECMAScript standard mandates that whatever you throw into an '&' operator, what you get out is an integer. Now, ECMAScript tries hard not to know about integers, but luckily it cannot completely hide them. Therefore, we could know something about the "+=" operations. Those are all numbers, after all. The worst thing that can happen there is that they overflow from integer into a real number (pun intented, and don't ask).

So, for the inner part of the loop we could generate some fairly efficient code even without knowing anything else about the rest of the file. For the rest, we have to always assume that we might have to sing the Watman tune, and generate code generic enough to do just that.

The JavaScript interpreter (and JIT) in QML, specifically, is not even capable of optimizing the inner part of the loop, though. It doesn't do any type inference, as it's optimized for simplicity and compilation speed. It does cache the lookups for the enumerations, which makes them faster on subsequent iterations, but even the cached lookups require some function calls.

It's important to note that other JIT compilers, for example the one used in V8, can do interesting heuristical optimizations on such code. By observing the occurrence of types at run time, specialized code can be generated for types that occur frequently, with deoptimization steps for unexpected type mismatches. This comes at a cost of memory and code complexity, though.

So much about JavaScript.

Let's talk about QML

If we look at the whole file as a QML document, we know quite a bit more. First, the previously opaque values called "Categorizer.Iterations", "Categorizer.Length", etc turn out to be enumeration values, which by definition can be expressed as integers. At compile time, we even know their values and we know they cannot change. Furthermore, we know what "root" is. "root" is an ID. Elements referenced by ID, in contrast to properties, cannot change. Here's the catch, though: as long as the "numbers" property is just a "var", we still don't know anything about it. The Qt Quick Compiler refuses to generate code that sings the Watman tune. Therefore, it rejects the whole "sum()" function in this case and we fall back to interpretation or JIT compiling.

If, however, the type for the "numbers" property is given, the Qt Quick Compiler as shipped with the latest 6.5 snapshot generated efficient C++ code for the "sum()" function. It knows that the result of an indexed lookup in "numbers" can only ever produce a number or undefined. It might produce undefined because you may have replaced the value of the "numbers" property with a shorter list.

With that information alone the Qt Quick Compiler could generate code for the comparisons and the "+=" operations, using QJSPrimitiveValue. The result would be somewhat slower, but still faster than interpretation. Using the "&" operator, we further constrain the type, so that the Qt Quick Compiler can generate straight C++ arithmetics using int and double. It also just pastes the numeric values of the enum entries into the generated code, saving us the lookups we need to do in JavaScript.

The generated C++ code looks like this:

// var num = numbers[j] & Categorizer.Mask;
// generate_MoveReg
r17_1 = r14_1;
// generate_LoadReg
r2_3 = r12_1;
// generate_LoadElement
if (!QJSNumberCoercion::isInteger(r2_3))
    r2_5 = QJSPrimitiveValue();
else if (r2_3 >= 0 && r2_3 < r17_1.size())
    r2_5 = QJSPrimitiveValue(r17_1.at(r2_3));
else
    r2_5 = QJSPrimitiveValue();
// generate_StoreReg
r18_1 = r2_5;
// generate_GetLookup
r2_6 = 11259375;
// generate_BitAnd
r2_3 = double((r18_1.toInteger() & r2_6));
// generate_StoreReg
r13_1 = r2_3;
// if (num < Categorizer.Category0)
// generate_StoreReg
r17_2 = r2_3;
// generate_GetLookup
{
int retrieved;
retrieved = 3855;
r2_3 = double(retrieved);
}
// generate_CmpLt
r2_4 = r17_2 < r2_3;
// generate_JumpFalse
if (!r2_4) {
    goto label_4;
}
;
// cat1Sum += num;
// generate_MoveReg
r18_2 = r7_1;
// generate_LoadReg
r2_3 = r13_1;
// generate_Add
r2_3 = (r18_2 + r2_3);
// generate_StoreReg
r7_1 = r2_3;
// generate_Jump
{
    goto label_5;
}
[...]

There are a lot of unnecessary renames in there, but a C++ compiler worth its salt should be able to eliminate them. So, this is almost as fast as we can get with the inner loop without violating the ECMAScript standard. We might still perform a range analysis on the loop counters to find out that they can never be anything but integers, but so far we don't.

Let's talk about the future

Clearly, the above example is carefully tailored for maximum W{at|ow} effect. However, this is where QML is going. As the Qt Quick Compiler's language support improves such examples will become more common.


Blog Topics:

Comments