Subnavigation

Some recent changes to choice of translation

April 25, 2025 by Eddy | Comments

Since 6.7 there has been a flury of activity in how Qt choses, and helps its users chose, suitable localisation and internationalisation. That flurry seems to have settled down now, in 6.9, so it's time to give a summary of what's changed and why. The story actually starts in 6.4 with a fix for QTBUG-102796, to make the ordering of entries in uiLanguages() consistent between the system locale and those based on data from the Unicode Consortium's Common Locale Data Repository (CLDR). But first, let's get …

A little context

When an application has a choice of resources to use, to tune the application to suit the user's needs, one category of choice is known as localisation and internationalisation – or L10n and I18n, since even anglophones don't agree on how to spell them, but several languages have 12-letter words for the former and 20-letter words for the latter, in each case agreeing at the start and end. The idea is to adapt (where relevant) to what languages the user understands, what scripts they can read and what conventions they use for writing various things (for example, amounts of money or the numeric forms of dates). The first two of these are obviously enough known as language and script; the rest is assumed to depend on those and where the user lives, which is identified in terms of a territory – in most cases a country but, the world being how it is, there are complications. The combination of language, script and territory is known as a locale. In Qt, L10n and I18n are principally taken care of by QLocale and QTranslator, although there are some other places that get involved.

QLocale knows how to query the operating system to find what the user's settings say about their L10n preferences. It's also what applications consult to get things like lists of available L10n choices, such as the application might offer in a dialog to let the user set a situation-specific L10n, taking the place of those user settings. Either way, a QLocale instance has the information to help select suitable L10n and I18n for other features of the application. In particular, one really big part of this, which in Qt is treated as I18n, is how text written by programmers gets translated into text to be read by users. That's taken care of by QTranslator, in cooperation with Qt Linguist and related tools.

QTranslator gets to select among the available translations that are installed for an application, to pick one suitable for the user. Its source of truth for that is QLocale::uiLanguages() – which should really be called uiLocales(), since it returns a list of locale identifiers, not of languages. The idea is that it picks, from the available translations, one matching as early an entry in uiLanguages() as it can find.

All the recent changes have been driven by trying to make that process more robust and reliable in the face of the diverse user configurations that may be out there. In the long term, my hope is that we can implement a QLocaleSelector (see QTBUG-112765) that can handle this more gracefully but, for now, improvements in this area have taken the form of refinements to what uiLanguages() returns and how QTranslator uses it. There are other parts of Qt that use uiLanguages() in similar ways, and we may well review how they do so, now that the dust has settled, but they don't always have the same priorities as translation. For example, text-to-speech requires selection of a locale-appropriate voice, but might not care about the script aspect of a locale.

Locale identifiers

The entries in the list of strings returned by uiLanguages() are identifiers for locales, made up of so-called subtags, joined together by separators (I'll be using dashes, but underscores are also commonly used). Each subtag identifies a language, script or territory; they usually appear in this order. (In general, subtags can represent other things, but Qt only recognises these three.) Thus, for example, en-Latn-US identifies English as it is spoken in the USA and written in the Latin script (by which is meant the common script of most European languages, on whose unaccented forms the US-ASCII character repertoire is based). An identifier can't have two subtags of the same kind, but it can leave out script and/or territory; and the special language und (for undefined) is used as placeholder for language when it is not specified. So en is a generic locale using the English language, and und-AU is a generic locale for Australia.

Aside from the system locale, QLocale gets all its data about L10n from the Unicode Consortium's Common Locale Data Repository (CLDR). This comes with a set of likely subtag rules that are used to fill in the blanks when a locale is incompletely specified. If two locale identifiers, when filled in according to these rules, give the same full form, QLocale treats them as equivalent. Since they're equivalent under likely sub-tag rules, I'll call this likely-equivalence.

Thus, for example, the rule da ⇒ da-Latn-DK says that if all you know about the user's preference is that they speak Danish, you're most likely best off using the Latin script and the ways of using Danish that are usual in Denmark. In most cases, the territory implied by a given language is the one the language is named after, with two notable exceptions: English and Portuguese doesn't map to England and Portugal. Instead, the rules en ⇒ en-Latn-US and pt ⇒ pt-Latn-BR map them to the USA and Brazil, due to there being more speakers of those languages in their former colonies than in the land of origin. Except that there isn't actually an en ⇒ en-Latn-US rule – because this equivalence is implied by the following.

The final fallback of the likely subtag rules, after 790 others (at CLDR v46.1), is und ⇒ en-Latn-US. This says that if you don't know what else to do, try the USAish form of English written in the Latin script. For the incomplete locale descriptions that don't have a matching likely subtag rule, there are rules for how to pick a likely subtag rule to apply. For example, if all you know is the user is Australian, expressed as und-AU, for which there is no rule, you set aside the only thing you knew, AU, apply the und rule above and then restore the thing you knew, replacing the US territory part of en-Latn-US with AU to get en-Latn-AU. Given und-Latn-AU, the same rule would imply en-Latn-AU. Likewise, plain en gets augmented by taking its remaining subtags from the und rule, which makes it equivalent to en-Latn-US despite the lack of the overt rule, saying that, that I mentioned above.

The rules then allow one to take a partially-specified locale, such as en-AU, and infer the parts omitted. Conversely, given a fully-specified locale, the same rules say which parts of it one can omit and still imply the same thing. Given that en ⇒ en-Latn-US, which differs from en-AU, so we can't prune it down to en. While starting with only AU does (see above) imply en-Latn-AU, the same as en-AU implies, we have to express what we started with as und-AU, which isn't a pruning of en-AU, so en-AU is minimal.

How Qt uses that

Until 6.9, QLocale::uiLanguages() starts with the identifier of the locale it's given – or, in the case of the system locale, potentially a sequence of identifiers indicating what the user has said they can understand – and expands each entry by adding some forms likely-equivalent to it. Until 5.14 (and LTS 5.12.6) that expansion was only applied to CLDR-derived entries; if the system locale gave a list, that was used without change. Initially, I'd handled the addition of likely-equivalents for the system locale via a QLocale instance constructed from each string the system gave us, forgetting that this could coerce what the user had asked for to the closest match for which Qt has CLDR-derived locale data.

From then to 6.4, the sort order for the system locale didn't match that of CLDR-derived locales (QTBUG-102796). Since then (aside from some quirks where the initial entry might appear earlier), the entries resulting from expansion of a single entry appear with the more specific (with more subtags) before the less specific (with fewer). In 6.5 I also fixed my mistake of sending system locale entries via QLocale (see above) and added the system locale's own identifier to the list, if the system query hadn't included it (or an equivalent).

In 6.7 I sorted out some complications to how QMimeType used uiLanguages() (when selecting how to describe a file-type, typically identified by the file extension, in a way the user will understand) and added a separator parameter to uiLanguages() to make that a little simpler (although I had to fix a mistake in that later). In response to what I describe below, I've been able to further simplify the QMimeType code more recently.

The seed of change

One problem with only including likely-equivalent entries is that the list for en-AU includes en-Latn-AU but not en. If an application has an en translation but neither en-AU nor en-Latn-AU, it's still fairly sensible for it to select the en it has. In the case of English, this works fine (although, as we'll see, life isn't so easy for some other locales). So QTranslator was taking each entry it gets from uiLanguages() and, after searching for a resource matching it, checking for matches to truncations of it, dropping the last subtag each time, before moving on to the next entry in uiLanguages(). That way, it found en as a truncation of en-AU and all was well – until 6.4, when I made the ordering consistent, and the system locale started delivering en-Latn-AU before en-AU.

That lead to QTranslator truncating en-Latn-AU via en-Latn to en before it got to en-AU, with the result that the user who'd configured en-AU got lumbered with the en translation before the code noticed en-AU was available (and more appropriate). Which is where my story begins, with

QTBUG-121418: QTranslator loads zh instead of zh_TW translation
QTBUG-124898: (The en-AU case above, somewhat disguised).

Technically, the problem was there previously: the change in 6.4 just made it more visible. If a user had configured en-AU, en-GB as their system configuration, this would previously have expanded to en-AU, en-Latn-AU, en-GB, en-Latn-GB and, in the presence of only en-GB and en, they'd have been landed with the latter (which isn't even equivalent to any of their given choices, as it's en-Latn-US) despite the former being an exact match for one of their choices. But now that we could see the bug, we set out to fix it.

Recent events

An initial attempt to fix QTBUG-124898 was to have QTranslator – instead of truncating each entry in the uiLanguages() list as it iterated that entry – actually build the expanded list, with all truncations inserted into it and sort its entries by specificity failed to take account what happens when uiLanguages() starts with more than one entry, to expand on with likely-equivalent companions. For example, let's see what happens for the case I considered in the last paragraph:

uiLanguages() starts with en-AU, en-GB and expands it to en-Latn-AU, en-AU, en-Latn-GB, en-GB; then QTranslator
Adds truncations to it: en-Latn-AU, en-Latn, en, en-AU, en, en-Latn-GB, en-Latn, en, en-GB, en and
Sorts by specificity: en-Latn-AU, en-Latn-GB, en-Latn, en-AU, en-Latn, en-GB, en (I've eliminated duplicates, just for clarity).

Notice that this has put en-Latn-GB before en-AU, reversing the order of the entries they came from in the list we started with. That works out worse when there's a mix of languages.

In particular, QTBUG-129434 had a mix that included English and Traditional Chinese, zh-Hant. Since plain zh is likely-equivalent to zh-Hans, Simplified Chinese, an actual zh-Hant translation has this more specific form for its translation's name, where the the app's translators hadn't needed to distinguish the various forms of English so just used plain en for it. It thus wasn't found because en was now later in the list, even though the English entry in the system configuration was earlier. None of the versions of English before zh-Hant matched an available translation file, so zh-Hant was picked. Thankfully this was found before the mistake could be released and was duly fixed by a timely revert.

At this point I got to study the problem and concluded that the real problem is that QTranslator doesn't know about likely-equivalence, so isn't in a position to understand the ordering of uiLanguages(). While some results of truncation shall be likely-equivalent, others shall not (for example, en-AU isn't equivalent to its truncation en, since this is equivalent to en-Latn-US). Since truncation has to be done at some point, the answer is for it to be done by the part of the system that actually does understand likely-equivalence, namely QLocale. It also became clear that we needed to be more careful to include all likely-equivalents of a given entry alongside it. Previously, it just ensured the final list contained the result of filling in all likely subtags and the minimal likely-equivalent; this meant, for example, that a user configuring just plain en (which is a minimal form, so didn't get that addition) got en-Latn-US added to it but didn't get en-US or en-Latn added – it now does.

One other thing came to light in this: the prior attempt at a fix had been done without knowledge that uiLanguages() might contain entries from quite distinct languages. This was why that attempt had failed, and wasn't explicit in its documentation, so Volker added a paragraph about that. Then we set about ensuring the truncations got added in the right place.

Matching script

One might reasonably wonder why uiLanguages() wasn't simply including the truncations already. After all, for many languages, the minimal form is all the translators ever bother with, and it works well enough for most users of that language. We've already seen one case where it's not as simple as that, with zh-Hant being widely used, while zh ⇒ zh-Hans means that it's not likely-equivalent to its truncation. I'm not sure how mutually intelligible the simplified and traditional forms of the script are, or what proportion of traditional readers are familiar enough to cope with simplified, but this illustrates the problem: namely, that a languag may exist in several scripts. This is no problem for code selecting a voice to use for text-to-speech rendering, but it matters for written translations.

I don't have an exhaustive list of examples where one language is written in different scripts by different populations, much less an exhaustive knowledge of which of those cases present a concrete problem of mutual intelligibility, but one theme in the cases where it arises is that the populations using distinct scripts for the same language are, in several cases, on opposite sides of some political or cultural conflict. Consequently, giving a user a translation in the other side's script runs a risk of causing distress or offence – or even getting them into trouble, if an unenlightened boss catches them reading enemy texts – quite apart from the risk that they simply can't read it. Given that folk tend to feel particularly strongly about conflicts with those from whom they least differ, this is another good reason to take care to not inflict such problems when we can avoid it.

So while we've now decided to include non-equivalent truncations uiLanguages(), we need to take care that all reasonable options that are equivalent to what the user has configured get tried before any non-equivalent truncations. After some experimentation and feedback from users who'd reported related issues, I settled on a compromise for cases where a truncation does use the same script as the entry it truncates, but isn't equivalent. For that case, I opted to include the truncation just after the last block of likely-equivalent entries of which one truncated to it.

If that rule is a bit hard to understand, consider a user who's configured en-GB, en-NL, nl-NL (imagine a Brit living in the Netherlands). Adding likely-equivalents expands that to en-Latn-GB, en-GB, en-Latn-NL, en-NL, nl-Latn-NL, nl-NL, nl-Latn, nl; this includes nl-Latn and nl because they are likely-equivalent to nl-NL but leaves out en and en-Latn because they aren't likely-equivalent to en-GB or en-NL. If we stuck all non-equivalent truncations at the end, this would put en after nl so the user would get their UI in Dutch instead of English, even though they can read plain en (which is in the script they're used to) just fine. So this rule says to put these English truncations after the last block of English entries in our list, leading to en-Latn-GB, en-GB, en-Latn-NL, en-NL, en-Latn, en, nl-Latn-NL, nl-NL, nl-Latn, nl, which ensures an en translation is selected, when available, in preference to a nl one.

In contrast, the Punjabi language is written in the Arabic script in Pakistan but in Gurmukhi in India. A Punjabi from an Arabic-writing background might not know the Gurmukhi script at all (and vice versa). If such a user lives in England they might well have a system configuration selecting pa-PK, en-GB. Adding likely-equivalents then expands this to pa-Arab-PK, pa-PK, pa-Arab, en-Latn-GB, en-GB. Since pa-Arab is likely-equivalent to pa-PK, it is included – but the likely subtag rule pa ⇒ pa-Guru-IN makes plain pa distinct, so it is left out. Furthermore, since the script implied by pa is Guru, not matching the Arab implied by pa-PK, it gets shunted to the end of the list when we're adding truncations. In contrast, as before, en-Latn and en (though not likely-equivalent to it) do match the script implied by en-GB, so are still added to the end of its block, resulting in pa-Arab-PK, pa-PK, pa-Arab, en-Latn-GB, en-GB, en-Latn, en, pa. If there's no pa-PK or equivalent translation, but there are pa and en translations, this gets the user en, which we're sure they can read (as its script matches what they asked for), in preference to pa, even though that's their preferred language, because the available translation for it is in a script they may be entirely unable to read.

Getting it right

So now we knew what we wanted, I just had to adapt the code to actually do that. This turned out to be quite tricky, but judiciously writing test-cases helped navigate to a final working solution, while making the code as straightforward as all these complications permit. (In fact, writing this prompted me to check one case I'd forgotten and thereby find a bug that I've now fixed in the course of writing this.) I hadn't, in any case, spotted all the details discussed above until I got to see how well the first few changes worked out and discuss the behaviour with others.

The primary change was to add truncated entries to uiLanguages(). That let us play with the result, discover quirks and corner-cases and work out what to do differently. That went to 6.9 and simplified QTranslator, while 6.8 got a reworking of its QTranslator code to do roughly the same thing. I then did some sorting out of fine details in (what has since become) 6.9.

The addition of truncated entries left uiLanguages() somewhat complicated so I reworked it to be a bit more straightforward. Adding some more test-cases then let me (finally) close QTBUG-121418.

At this point I recognised the need to be more systematic about adding equivalent entries. That let me understand the ordering better and adapt the ordering to put each same-script truncation at the end of the last block of equivalents that gave rise to it (albeit with the mistake I mentioned above, whose fix I'm now seeing get integrated). After that I saw how to make the insertion of equivalents more systematic.

So where does that leave us ?

Hopefully, as ever with Qt, everything should Just Work – as well as it did before, and maybe a bit better. You may, however, be able to simplify code using QLocale::uiLanguages(), while also making it work faster and better, if you were previously working round any of these complications:

If you've got any code that truncates entries from uiLanguages() for similar reasons to why QTranslator used to, you no longer (from 6.9; and mostly 6.8.3, too) need to do that.
If your code checks for a resource matching the name of the locale whose uiLanguages() you're also scanning for matches, you should now (since 6.5) be fine just checking for matches in uiLanguages(), as the locale's name should be in there, too (along with some likely-equivalents and truncated forms).
Or, really, if you'd found it necessary to kludge something using uiLanguages() to avoid odd corner cases where it didn't reliably Do The Right Thing™, try dropping the kludges and seeing whether it now Does The Right Thing after all. I'd love to hear stories of that, if you have any to share, whether you end up having to keep the kludges or are finally glad you can get rid of them.

Thanks to all the good folks who contributed by telling us what was wrong before, that I hope we've now sorted out fully – and, as ever, if you find behaviour that looks wrong, or can think of ways Qt might behave better, feel free to let us know through any of the usual channels, to help us make Qt better with every release.

Blog Topics:

Dev Loop

Comments