QStrings and Unicode -- optimising QString::fromUtf8

March 25, 2011 by Thiago Macieira | Comments

Other blogs in this series:

Last time, I attacked the problem of the conversion from Latin 1 (ISO-8859-1) to UTF-16 in QStrings. Most character literals in C++ source code are actually in ASCII, which is a subset of Latin 1, which in turn is the easiest conversion to UTF-16. So the performance of that operation is quite important to us.

But I was left wondering: this is 2011, why are we still restricting ourselves to ASCII? I mean, even if you're just writing your non-translated messages in English, you sometimes need some non-ASCII codepoints, like the "micro" sign (µ), the degree sign (°), the copyright sign (©) or even the Euro currency sign (€). I specifically added the Euro to the list because, unlike the others, it's not part of Latin 1, so you need to use another encoding to represent it. Besides, this is 2011, the de-facto encoding for text interchange is UTF-8.

So I asked myself the question: what if we changed the default codec used by QString from Latin 1 to UTF-8? Clearly, the concern on my mind was that the conversion from UTF-8 is much more complex than Latin 1, so we could have a performance hit. We'd lose all the SIMD work Benjamin and I did to improve the conversion.

Having spent a good amount of time optimising the fromLatin1 operation, I decided to look into the UTF-8 one. The first thing that I realised when looking at the QString::fromUtf8 code is that it calls out to the same code that is used by QTextCodec. So it's got a few things that are only possible with QTextCodec and could slow down the conversion: it detects the presence of an UTF-8 BOM, it tries to restore a state at the beginning and save it at the end, and it has an option to replace bad conversions with NULs instead of U+FFFD.

So the first thing I did was to make the code stateless. I did that by removing code. As you'll see in the graphs below, somehow this made the code worse! So I decided to rewrite the code instead, based on the experience I got from the fromLatin1 work.

I also took the opportunity to optimise the code for ASCII. If this becomes the default codec for QString, we should expect that most UTF-8 strings passed in are also ASCII strings, so that should be the common case we optimise for. You'll see in the graphs below that I ran the fromLatin1 conversions (the best results from the last blog) on the same data and I'm trying to compare how much worse is the UTF-8 conversion compared to the Latin 1 ones.

The next step, after writing the generic C++ code, was to try and optimise with SIMD instructions. So I copied the code I had for fromLatin1 and I simply had to detect if any of the bytes being converted had the highest bit in the byte set. With SSE2, this was easy, as there's an instruction that takes the highest bit from each byte in the SSE register and saves it in a regular CPU register. The algorithm is thus:

Load 16 bytes from memory
Get the highest bits from those bytes and check if any was set
If any was set, identify which byte was set

Identifying which bit was set is easy, by using the "Bit Scan Forward" instruction. The result of this bit scan is also the first non-ASCII byte in the loaded data. That way, we keep the SSE2 loop for the common case of ASCII and we just introduce the "move mask + bit scan" operation in the middle to detect non-ASCII.

Another thing I did is that even if the non-ASCII byte is detected, I still save the 16 bytes converted (after unpacking) where that non-ASCII character was detected. That means we temporarily have an invalid conversion, but we at least make sure the ASCII part of the bytes did get saved to memory.

Then I tried to do the same with Neon on ARM, but I ran into two issues: one, which is easy to solve, is that ARM doesn't have a "Bit Scan Forward" (find lowest bit set) instruction, but it has a "Count Leading Zeroes" instruction, which finds the highest bit set. Solving that we do by using the instruction that reverses the bits in the register.

The biggest problem is that Neon has no instruction equivalent to SSE2's "move mask" instruction to extract the highest bits in the loaded register. I searched the Internet for algorithms to compensate and I found one, which involved one AND operation and three parallel-adds. The performance wasn't great... I solved instead the problem by using the VTEST instruction to test for the highest bit set and then spilling the doubleword Neon register onto two ARM 32-bit registers. After that, it was regular code to detect a high bit set and do the UTF-8 conversion.

Here's the result, in a graph:

graph

Like last time, the first column is the baseline and is pegged to 100%, as the hardware being tested is quite different. The second column is the Qt 4.7 code with some portions that did the stateful activites removed. So, like I said, somehow removing code makes it slower...

The third column my rewrite of the UTF-8 conversion in pure C++, cross-platform code, trying to optimise for ASCII. It's not a great improvement on ARM (less than 5%), but on Atom the code is over 50% faster[*] than previously. The fourth column is the SIMD-optimised code using the best technique from the last blog, but with UTF-8 support, the fifth is the same but without checking for malformed UTF-8 input.

Like I said before, the question I was trying to answer was what the impact would be if we changed the default codec from Latin 1 to UTF-8. To answer it, the final three columns are the benchmarking of the fromLatin1 functions with the same data, compared to the same baseline. My current best code for UTF-8 is between 30 and 50% worse than the best code for Latin 1.

In other words, either we don't change the default, or we keep the QLatin1String class for performance-sensitive code. All of Qt already uses it anyway...

[*] Remember that 50% faster means it takes 33.3% less time.

Development Framework & Tools

Qt Framework

Qt Development Tools

Qt Design Studio

Qt Quality Assurance

Qt Digital Ads

Qt Insight

Quality Assurance Tools

Squish

Coco

Test Center

Axivion Static Code Analysis

Axivion Architecture Verification

More

Qt 6

Licensing

Qt Features

Qt for Python

Industry & Platform Solutions

Industry

Automotive

Aviation & Aerospace

Industrial Vehicles

Micro-Mobility Interfaces

Consumer Electronics

Industrial Automation

Medical Devices

Platform

Desktop, Mobile & Web

Embedded Devices

MCU (Microcontrollers)

Cloud Solutions

More

Next-Gen UX

Limitless Scalability

Productivity

Our Ultimate Collection of Resources

Development Framework & Tools

Qt Resource Center

Qt Blog

Qt Success Stories

Qt Demos

Quality Assurance Tools

QA Resources

QA Blog

QA Success Stories

More

Live Events & Webinars

Documentation

Take Learning Qt to the Next Level

Learn with us

Qt Academy

Qt Educational License

Qt Documentation

Qt Forum

We're Here for You—Support and Services

Helpful Links

Contact Us

Qt Partners

Qt Support

Qt Customer Portal

Qt Customer Success

Qt Professional Services

QStrings and Unicode -- optimising QString::fromUtf8

Blog Topics:

Comments

Subscribe to our newsletter

Subscribe Newsletter

Try Qt 6.7 Now!

We're Hiring

Read Next

Security advisory: Recently discovered HTTP2 handling issue impacts Qt

Qt Performance and Tools Update Part 1

Security advisory: OAuth1 in QtNetworkAuth