Qt Graphics and Performance - The Raster Engine
Todays topic is the raster engine, Qt's software rasterizer. Its the reference implementation and the only paint engine that implements all possible feature combinations that
The story of Qt's software engine started around December 2004, if my memory serves me. My colleague Trond and I had been working for a while on the new painting architecture for Qt 4, codenamed "Arthur". Trond had been working on the X11 and OpenGL 1.x engines and I was focusing on the combined Win32 GDI/GDI+ engine along with
QPainter and surrounding APIs. We had introduced a few new features, such as antialiasing, alpha transparency for
QColor, full world transformation support and linear gradients. As few of these new features were supported by GDI, it meant that using any of these features implied switching to GDI+, which at the time was insanely slow, at least on all the machines we had in the Oslo office back then. Actually, enabling the GDI advanced graphics mode to do transformations was also not very fast.
Then we came upon this toolkit called Anti-Grain Geometry (AGG) which did everything in software, in plain C++, and we were just amazed at what it could do. Our immediate reaction was to curl up on the floor in agony, thinking that we were going about this all wrong. Using these native API's was not helping us at all. In fact it was preventing us from getting the feature set we wanted with a performance that was acceptable. Once we settled down again, our first idea was to try to implement a custom AGG paint engine which would just delegate all drawing into the AGG pipeline. But alas, the template nature of the AGG API combined with the extremely generic
QPainter API bloated up into a pipeline that didn't perform nearly as good as the demos we had seen.
So we took our Christmas vacation and started over in January of 2005. Still quite depressed over the new feature set that didn't perform combined with being limited by a minimal subset of native API's, I went to Matthias and Lars and asked if I could get three weeks of time to hack together a software only paint engine as a proof of concept. I got an "OK" and spent the following weeks implementing software pixmap transformation, bi-linear filtering, clipping support in the crudest possible way and three weeks later I had a running software paint engine and quite proudly announced that I was "just about done". I've reconstructed an image of how I remember it:
The system clipping was all over the place, bitmap patterns were broken, but perhaps worst of all, all text is rendered using QPainterPath's, and all drawing was antialiased. Despite it not looking 100% good, the performance of the various features was pretty ok. It was agreed that this was a good start, but that we needed a bit more work. And so started the sprint for the Qt 4.0 beta a few months later.
The initial version that was released with Qt 4.0 worked quite well in terms of features, but in hindsight the performance was far from what our users demanded from Qt. As a result, we harvested a lot of criticism over the first year of Qt 4.0. Since then, we've done a lot, and I mean a LOT, and my gut feeling is that it is the engine that performs the best for average Qt usage, so I think we made a good choice back then in dropping GDI and GDI+. And, as I outlined in my previous post, we are toying with making raster the default across all desktop systems for the sake of speed and consistency.
The overall structure of the engine is that all drawing is decomposed into horizontal bands with a coverage value, called spans. Many spans will together form the "mask" for a shape and each pixel that is inside the mask is filled using a span function.
The image highlights one scanline of a polygon which is filled with a linear gradient. There are 4 spans, one which fades in the opacity of the polygon and two which fade out the opacity of the gradient. For each pixel in the polygon, the gradient function is called and we write the pixel to the destination, possibly alpha blending it, if the coverage value is other than full opacity or if the pixel we got from the gradient function contains alpha.
Clipping also use the same mechanism. The span function for clipping takes the incoming spans, intersects them with the set of spans that defines the clip and calls the actual filling span function.
All operations followed this pattern. When a drawRect call comes in, we generate a list of spans for each scan line and set up a span function according to the current brush. A pixmap is similar, we create a list of spans and use a pixmap span function. A polygon is passed to a scanconverter which produces a span list, etc. We have two scan converters, one for antialiased and one for aliased drawing. The antialiased one is pretty much a fork of FreeType's grayraster.c, with some minor tweaks, I think we needed to add support odd-even fills, for instance. Text is also converted into spans.
Lines, Polylines and Path Strokes
These primitives are passed to a separate processor called a stroker. The stroker creates a new path that visually matches the fillable shape that the outline represents. There is a public API for this too, in
QPainterPathStroker. This fillable shape is then passed to one of the scan converters which in turn scan converts the shape into spans. For dashed outlines, the same process happens, and the resulting fillable shape is a path with a potentially very large amount of subpaths. Naturally, such a sub-path is costly to scan convert, which is part of the reason why we explicitly do not put dashed lines on the list of high-performance features. In fact, in many cases, line dashing is one of the slowest operations available in the raster engine, so use it with extreme caution.
A hacky alternative which performs much better, is to set a 2x2 black/white or black/transparent pixmap brush and draw the stroke using a pen with brush. A bit more to set up, but if that's what it takes to get in running fast, then that's what it takes.
setTransform or any other state change on
QPainter will result in a different set of span functions being set up. Each brush, or fill-type if you like as pens on this level are essentially just fills too, has a special span function associated with it and we also pass a per brush span data. For solid color fills the span data contains the color, for transformed pixmap drawing it contains the inverse matrix, a source pixel pointer, bytes per line and other required information. For clips it contains the span function to call after you clipped the spans. The thing to notice about state changes is that each time you switch from one brush to another brush or from one transformation to another, these structures do need to be updated. Up to Qt 4.4, this was in many cases a noticeable performance problem, bubbling up to 10-15% in profilers when rendering graphics view scenes, but since 4.5 the impact of this is minimal.
Well, perhaps not minimal compared to drawing a 2 pixel long line, but minimal compared to filling a 64x64 rectangle. The point is that though the raster engine is the engine that probably handles state changes best of all our engines, there are some usecases where it still shows up, and it should still be minimized.
The task of the span functions is to generate a pixel and combine it with the destination according to the current state of the painter. Though the raster engine supports rendering to any of our image formats except 8-bit indexed, it will internally do all rendering in ARGB32_Premultiplied. Premultiplied alpha has the benefit that we don't have to multiply the alpha into the color channels and it saves us a division in the blending. The reason for doing all rendering in one format is that the alternative simply doesn't scale. Just think of the combination of composition modes multiplied with the number of image formats a source image can have multiplied with what formats the destination can have. To support all combinations we have a generic approach where we for each span do:
- Get the source pixels, e.g. from a gradient, pixmap, image or solid color, and convert them to ARGB32_Premultiplied.
- Get the destination pixels and convert them to ARGB32_Premultiplied
- Blend the source into the destination using current composition mode
- Convert the result to destination format and write it back.
This may seem like a lot of work, so luckily the story doesn't end there.
Special casing and Optimizations
As I outlined in the QPainter documentation patch that I added recently, which was the start of this blog series, its all about defining which scenarios we want to be fast and which scenarios we just need working. Over the years since the initial release of the raster engine in the summer of 2005, we've added tons of of special cases to support what we experience as the functions that are called the most and which have the most impact.
QPainter::drawRect. In 4.4 both of these implied a state change. Actually,
fillRectimplied two state changes because it set the brush to what was passed to fillRect and then set it back to what the painter state was. In 4.5, as part of this Falcon project, we introduced a new internal QPaintEngine subclass which supports a state-less fillRect with a color. This matches how applications normally use the painter anyway.
- ARGB32_Premultiplied on ARGB32_Premultiplied
- ARGB32_Premultiplied on RGB32
- ARGB32_Premultiplied on RGB16
- ARGB8565_Premultiplied on RGB16
- RGB32 on RGB32
- RGB16 on RGB16
I think that was all of them.
A lot of details, but it gives an idea of what to consider when you write code for this engine. If all you are drawing is 1024x1024 pixmaps, then none of these things matter because all the time is anyway spent in the span function that does pixmap blending, but the second you have more content, several lines, several polygons, which are smaller in size, then these things are critical to achieve good performance.
The overall performance of the engine, when used according to how it's outlined above, can be thought of as:
Overhead + O(pixelsTouched * memoryAndBusCapacity)
There is nothing scientific about that formula, but when you're hitting the optimal path, all time should be spent in one of the many for loops inside
qdrawhelper_xxx.cpp or even better
qblendfunctions.cpp. These loops will spend all their time on per pixel processing. If these functions could be made faster by doing the algorithms slightly differently, then great, but if you see in your profiling that all time is spent in for instance
qt_blend_argb32_on_argb32, then that means you told us to blend alpha pixmaps together and we're doing that as fast as we can and you have zero loss between your app and actual processing. If all time is spent processing pixels, then that is a good thing. The overhead here is the time spent in state changes, function call overhead, and similar.
I got some feedback on one of the previous blogs that a few bar charts would be nice, so I'll post some numbers on what kind of throughput is possible with the raster paint engine. I've timed it on both my Windows desktop machine and on my N900 to get a comparison. The operations range from several million pr second to only a few hundred so the scale is logarithmic, keep that in mind as you look at them.
As you can see, the fill-rate is more or less tied to the number of pixels involved. For some operations it takes a little bit longer to do something, like drawPixmap with scaling is somewhat slower than drawPixmap without, but you see that the rough formula I gave above holds quite often. Double the size of the primitive in each direction and you have one quarter the performance. It was also not my intention to trick you with using different numbers for drawPixmap, its just how the test was set up.
If you compare the three 4x4 rectangle drawing versions, you see that they differ when the rectangles are small.
drawRect without brush change is fastest at around 7.4Mops/sec, followed by fillRect at ~6.1Mops/sec and then drawRect with brush change at 1.8Mops/sec. At 128x128 there is just a little difference between the two, which is what I was getting at with the state changes above. It is possible to do them and if you're drawing semi-large areas, it doesn't matter, but if you're plotting pixels, doing loads of small lines here and there or particle effects with 8x8 pixmaps, then you want to do that in a tight loop with nothing else happening.
You can also see that the speed of non-smooth scaling is holding its own vs non-scaled pixmap drawing.
Finally, if you compare the N900 to the desktop Windows machine you see that despite windows only having a 4 times faster processor the speed is often around 10 times worse. Why? Because the CPU isn't the only limitation, bus/memory capacity is also a limiting factor, and it's to be honest not a fair comparison...
I hope you enjoyed this post and more will come in 2010.
Subscribe to our newsletter
Try Qt 5.15 LTS Now!
Download the latest release here: www.qt.io/download.
Qt 5.15 was developed with a strong focus on quality and is a long-term-supported (LTS) release that will be supported for 3 years.
Check out all our open positions here and follow us on Instagram to see what it's like to be #QtPeople.
Näytä tämä julkaisu Instagramissa.
Want to build something for tomorrow, join #QtPeople today! We have loads of cool jobs you don’t want to miss! http://qt.io/careers #builtwithQt #software #developers #coding #framework #tool #tooling #C++ #QML #engineers #sales #tech #technology #UI #UX #CX #Qt #Qtdev #global #openpositions #careers #job
Henkilön Qt (@theqtcompany) jakama julkaisu