CPU usage improvements in Qt 3D

Many improvements have been made to Qt 3D since the release of Qt 5.6, our previous long-term-support (LTS) release. Engineers from KDAB and The Qt Company have been working hard to bring new features to Qt 5.9 LTS, many of which are listed in the What's new in Qt 3D with Qt 5.9 blog post by Sean Harmer from KDAB. While more features are still on the way (such as a Vulkan backend), the focus in the latest releases has shifted towards performance and stability. The performance has in fact improved greatly since Qt 5.6, especially for complex scenes and scenes with large frame graphs.

Scenes with many viewports typically result in large frame graphs because each viewport corresponds to a leaf node. If you are not familiar with the frame graph concept in Qt 3D and how powerful it is, you should check out Paul Lemire's blog post at kdab.com. The screenshot below is from one of our internal benchmarks; a fairly simple (and colorful) scene with 28 viewports:

Qt 3D benchmark with many viewports

The CPU usage in this benchmark has gone significantly down from Qt 5.6.2 to Qt 5.9.2, and the Qt Company is working together with the engineers in KDAB on a series of changes that we expect will make the CPU usage drop even further in Qt 5.11:

viewport-benchmark-results

Many of the performance improvements have been driven forward by the port of the Qt 3D Studio runtime to be based on Qt 3D. Although the runtime is planned for a release next year, we are already now adding performance improvements to the current Qt 5.9.x LTS series. Here are some benchmark results from our internal Qt3D Studio examples:

3dstudio-benchmark-results

Performance improvements are added in many parts of Qt 3D. We have for instance also added support for efficient file formats, such as glTF2. In this post, we will go into details about some of the changes we are making to reduce the CPU usage, while in a later post, we will discuss reductions in memory consumption.

Improving the job dependency solver

One of the performance improvements we have made was to the Qt 3D job dependency solver. Qt 3D divides work that needs to be done each frame into separate, smaller jobs that can be run in parallel. The jobs are part of the flexible backend/frontend architecture of Qt 3D, which separates the frontend on the main thread from the backend, which consists of aspects that perform the rendering, input handling, and animation (more on this in the Qt 3D Overview docs).

The backend runs jobs from the different aspects on a thread pool, and each job can define dependencies on other jobs that should run before it. These dependencies need to be resolved efficiently, because the jobs often change from one frame to another. While this is simple when the number of jobs is small, it becomes increasingly time-consuming for complex scenes with large frame graphs.

By profiling our examples with Callgrind, we found performance bottlenecks in certain parts of the job dependency solver. Specifically, a large QVector of all dependencies would be resized whenever jobs were finished and corresponding dependencies could be removed from the list. This drastically reduced the performance.

We started working on a solution where we would get rid of the QVector altogether and store two lists on the jobs instead: One list of who the job has dependencies on and one list of who depends on it.

class AspectTaskRunnable {
    // ... other definitions
    QVector m_dependencies;
    QVector m_dependers;
};

With this solution, when a job finished, it could go through its list of m_dependers and remove itself from the depender's list of m_dependencies. If their list was now empty, that job could also be started. However, we now had many small QVectors that were resized all the time. Although better than resizing a big QVector, it was still not optimal.

Finally, we realized that because the dependencies cannot change while jobs are running, there was no need to track both who depends on a job and who the job is dependent on. For each job, it is enough to know which jobs depend on it, and only the number of jobs it is dependent on.

class AspectTaskRunnable {
    // ... other definitions
    int m_dependencyCount = 0;
    QVector<AspectTaskRunnable*> m_dependers;
};

Whenever a job is finished, we go through the list of jobs depending on it, and subtract their dependency count by one. The latter code looks something like this (shamelessly simplified for readability):

void QThreadPooler::taskFinished(AspectTaskRunnable *job)
{
    const auto &dependers = job->m_dependers;
    for (auto &depender : dependers) {
        depender->m_dependencyCount--;
        if (depender->m_dependencyCount == 0) {
            m_threadPool.start(depender);
        }
    }
}

By implementing this change, the job dependency solver became an insignificant contributor to the CPU usage, and we could start focusing on other bottlenecks.

Improving the performance of QThreadPool

Other parts of Qt are also benefiting from optimization opportunities found in our benchmarks. For instance, Qt 3D uses QThreadPool from Qt Core to automatically manage the jobs and allocate them to different threads. However, similarly to the previous case, QThreadPool used to store jobs in a QVector which would be resized whenever a job was finished. This is not a big issue for use cases with few jobs, but suddenly became a bottleneck for complex Qt 3D scenes with a large number of jobs.

We decided to change the implementation in QThreadPool to use larger "queue pages" and put pointers to these pages in a QVector. In each page, we keep track of the index of the first job in the queue and the index of the last job in the queue:

class QueuePage {
    enum {
        MaxPageSize = 256;
    };

// ... helper functions, etc.

int m_firstIndex = 0; int m_lastIndex = -1; QRunnable *m_entries[MaxPageSize]; };

Now, all we have to do is increment the first index whenever a job is finished and increment the last index when a job is added. If there is no more room in the page, we allocate a new one. It's a simple and low-level implementation, but it's also way more efficient.

Caching the results of specific jobs

Next, we found that certain jobs stood out as very CPU-demanding. Some of these jobs, such as the QMaterialParameterGathererJob, were doing a lot of work each frame even though the results from previous frames were the same. This was a clear opportunity for caching the results to improve the performance. First, let us have a look at what QMaterialParameterGathererJob does.

In Qt 3D, you can override the values of every parameter defined in a QRenderPass by setting it on the QTechnique, QEffect or QMaterial which is using that render pass. Each parameter is in turn used to define a uniform value in the final shader program. This code shows a QML example where the parameter "color" is set on all levels:

Material {
    parameters: [
        Parameter { name: "color"; value: "red"}
    ]
    effect: Effect {
        parameters: [
            Parameter { name: "color"; value: "blue"}
        ]
        techniques: Technique {
              // ... graphics API filter, filter keys, etc.

parameters: [ Parameter { name: "color"; value: "green"} ] renderPasses: RenderPass { parameters: [ Parameter { name: "color"; value: "purple"} ] shaderProgram: ShaderProgram { // vertex shader code, etc.

fragmentShaderCode: " #version 130 uniform vec4 color; out vec4 fragColor; void main() { fragColor = color; } " } } } } }

To figure out the final value of a parameter used in the shader program, QMaterialParameterGathererJob goes through all the materials in a scene and finds the corresponding effects, techniques, and render passes. Then, by prioritizing parameters set in QMaterial over QEffect, QTechnique, and QRenderPass, we determine the final value of the parameter.
In this case, the value is "red", because the QMaterial parameters have the highest priority.

Gathering all parameters is quite time-consuming in large scenes with many materials, and showed up as a bottleneck for some of our Qt 3D Studio examples. We therefore decided to cache the parameter values found by the QMaterialParameterGathererJob, but quickly understood that the cache would always be invalidated if the values changed every frame. This is a common case, especially if parameters are animated. Instead, we decided to cache pointers to QParameter objects rather than their values. The values are then stored outside of the cache, and are fetched only when needed. Caching the results resulted in a huge performance boost in scenes with many parameters, as we only had to run the job when there are larger changes in the scene, such as when materials are added.

We have worked with many cases like this, where we have taken some of our larger examples, profiled them, identified bottlenecks in specific jobs, and worked to find ways to improve the performance or cache the results. Thankfully, the job-based system in Qt 3D makes it easy to optimize or cache specific jobs independently, so you can expect more improvements like this to land in the upcoming Qt 3D releases.


Blog Topics:

Comments