QStrings and Unicode -- optimising QString::fromUtf8

Other blogs in this series:

Last time, I attacked the problem of the conversion from Latin 1 (ISO-8859-1) to UTF-16 in QStrings. Most character literals in C++ source code are actually in ASCII, which is a subset of Latin 1, which in turn is the easiest conversion to UTF-16. So the performance of that operation is quite important to us.

But I was left wondering: this is 2011, why are we still restricting ourselves to ASCII? I mean, even if you're just writing your non-translated messages in English, you sometimes need some non-ASCII codepoints, like the "micro" sign (µ), the degree sign (°), the copyright sign (©) or even the Euro currency sign (€). I specifically added the Euro to the list because, unlike the others, it's not part of Latin 1, so you need to use another encoding to represent it. Besides, this is 2011, the de-facto encoding for text interchange is UTF-8.

So I asked myself the question: what if we changed the default codec used by QString from Latin 1 to UTF-8? Clearly, the concern on my mind was that the conversion from UTF-8 is much more complex than Latin 1, so we could have a performance hit. We'd lose all the SIMD work Benjamin and I did to improve the conversion.

Having spent a good amount of time optimising the fromLatin1 operation, I decided to look into the UTF-8 one. The first thing that I realised when looking at the QString::fromUtf8 code is that it calls out to the same code that is used by QTextCodec. So it's got a few things that are only possible with QTextCodec and could slow down the conversion: it detects the presence of an UTF-8 BOM, it tries to restore a state at the beginning and save it at the end, and it has an option to replace bad conversions with NULs instead of U+FFFD.

So the first thing I did was to make the code stateless. I did that by removing code. As you'll see in the graphs below, somehow this made the code worse! So I decided to rewrite the code instead, based on the experience I got from the fromLatin1 work.

I also took the opportunity to optimise the code for ASCII. If this becomes the default codec for QString, we should expect that most UTF-8 strings passed in are also ASCII strings, so that should be the common case we optimise for. You'll see in the graphs below that I ran the fromLatin1 conversions (the best results from the last blog) on the same data and I'm trying to compare how much worse is the UTF-8 conversion compared to the Latin 1 ones.

The next step, after writing the generic C++ code, was to try and optimise with SIMD instructions. So I copied the code I had for fromLatin1 and I simply had to detect if any of the bytes being converted had the highest bit in the byte set. With SSE2, this was easy, as there's an instruction that takes the highest bit from each byte in the SSE register and saves it in a regular CPU register. The algorithm is thus:

  1. Load 16 bytes from memory
  2. Get the highest bits from those bytes and check if any was set
  3. If any was set, identify which byte was set

Identifying which bit was set is easy, by using the "Bit Scan Forward" instruction. The result of this bit scan is also the first non-ASCII byte in the loaded data. That way, we keep the SSE2 loop for the common case of ASCII and we just introduce the "move mask + bit scan" operation in the middle to detect non-ASCII.

Another thing I did is that even if the non-ASCII byte is detected, I still save the 16 bytes converted (after unpacking) where that non-ASCII character was detected. That means we temporarily have an invalid conversion, but we at least make sure the ASCII part of the bytes did get saved to memory.

Then I tried to do the same with Neon on ARM, but I ran into two issues: one, which is easy to solve, is that ARM doesn't have a "Bit Scan Forward" (find lowest bit set) instruction, but it has a "Count Leading Zeroes" instruction, which finds the highest bit set. Solving that we do by using the instruction that reverses the bits in the register.

The biggest problem is that Neon has no instruction equivalent to SSE2's "move mask" instruction to extract the highest bits in the loaded register. I searched the Internet for algorithms to compensate and I found one, which involved one AND operation and three parallel-adds. The performance wasn't great... I solved instead the problem by using the VTEST instruction to test for the highest bit set and then spilling the doubleword Neon register onto two ARM 32-bit registers. After that, it was regular code to detect a high bit set and do the UTF-8 conversion.

Here's the result, in a graph:

graph

Like last time, the first column is the baseline and is pegged to 100%, as the hardware being tested is quite different. The second column is the Qt 4.7 code with some portions that did the stateful activites removed. So, like I said, somehow removing code makes it slower...

The third column my rewrite of the UTF-8 conversion in pure C++, cross-platform code, trying to optimise for ASCII. It's not a great improvement on ARM (less than 5%), but on Atom the code is over 50% faster[*] than previously. The fourth column is the SIMD-optimised code using the best technique from the last blog, but with UTF-8 support, the fifth is the same but without checking for malformed UTF-8 input.

Like I said before, the question I was trying to answer was what the impact would be if we changed the default codec from Latin 1 to UTF-8. To answer it, the final three columns are the benchmarking of the fromLatin1 functions with the same data, compared to the same baseline. My current best code for UTF-8 is between 30 and 50% worse than the best code for Latin 1.

In other words, either we don't change the default, or we keep the QLatin1String class for performance-sensitive code. All of Qt already uses it anyway...

[*] Remember that 50% faster means it takes 33.3% less time.


Blog Topics:

Comments