Teaching QLocale more about number formats

BlueMarble-EastQLocale looks after all localisation (or L10n) within Qt; while Qt 6 has swept away a few last fragments of L10n built into other things, that now consistently use the C locale (and advise you to use QLocale if you need L10n), it's also seen some significant improvements to how QLocale does those things, particularly in relation to numeric texts and surrogate pairs.

Background

QLocale packages and provides access to a data-set provided by the Unicode Consortium: the Common Locale Data Repository (CLDR). This is published as a suite of XML files and we have python scripts that digest these and turn them into the configuration data for QLocale. Those scripts have had a fairly major overhaul this year, which should mostly be invisible to users of Qt (although it'll hopefully make the process of taking in updates to CLDR a little more robust and easier for other developers to do), aside from some corrections, to infelicities in how we extract the data from CLDR's XML, that it's made possible. Where possible those fixes are shared with Qt 5.15.

In QString and friends we represent text using UTF-16; the size() of a string reported by these types is the number of UTF-16 code units. However, Unicode is bigger than fits into 16 bits, so there are code points (roughly what one normally thinks of as characters in written text, although there are further complications) that can't be represented as a single UTF-16 code unit; these are represented as surrogate pairs, in which two UTF-16 code units encode a single code point. When these are needed, the UTF-16-size reported by QString, in code units, is greater than the code-point size, which a native reader of the text is more likely to consider to be its length. In particular, a single character may need to be represented by a QString of length two.

Characters in Unicode's Basic Multilingual Plane (BMP) can be represented by single UTF-16 code-points but the rest need surrogate pairs; and CLDR does contain some locales that use characters outside the BMP. Normally this isn't an issue, since QString can handle surrogate pairs just fine.

Surrogate pairs

Unfortunately, some of QLocale's methods for single character values – relevant to number and list formats – returned QChar, which is a single UTF-16 code unit.

This meant we couldn't support the Chakma number system, whose digits need surrogate pairs. So we have to leave it, and the locales using it, out of Qt 5. However, the transition to Qt 6 allows a change to the single character APIs, so they now return QString instead of QChar. For most locales, the string returned shall have length 1 but locales that need to use a surrogate pair to represent a character can now do so. More recently, CLDR has added locale data for the Fulah language in the Adlam script, whose numbering system also has a surrogate pair for its zero; so this change also made it possible to take these new locales in, when they were added.

That, of course, forced an overhaul of QLocale's code to parse and format numbers (and, internally, lists and quoting): the UTF-16 indexing into strings can no longer be relied on to correspond simply to the character-position indexing. That provoked some fairly extensive clean-up of the number-formatting code, primarily to avoid duplication of code that's now trickier, along with a general clean-up of surrogate-handling.

Suzhou's non-consecutive digits

In related work, though not involving surrogates, a contributor reported that, in the Souzhou numbering system, QLocale was producing gibberish. This number system isn't used normally in any locale in CLDR, but it's a traditional Chinese system (whose 1 through 9 Mahjong character 1Mahjong character 9 appear, for example, on Mahjong tiles) that some operating systems allow a user to configure in their system settings.

As it happens, The Unicode Consortium departed from its usual practice of making the digits of a number system consecutive code points in its numbering: for Souzhou, one through nine are indeed consecutive, but zero isn't immediately before them. Since QLocale only records the zero digit for each locale, this required some minor refinements in code that previously took the usual practice for granted.

Digit grouping

One of the areas the surrogate clean-up particularly impacted was the grouping of digits – sometimes called thousands grouping by those whose locales group digits in threes – within the whole-number part of a number.

The code previously relied on simply inserting commas at every third position before the fractional-part separator (or decimal point, although it's only a dot, or point, in some locales). This was, in any case, the wrong thing to do: some locales have different sized groups and even some that do group in threes omit the first separator if it would leave a small group. For example, Spanish leaves 1000 ungrouped, only adding a comma when there are at least two digits before it, as at 10,000; and various locales from the Indian sub-continent have three digits in the right-most group but two digits in each group to the left of that. We included a fix for (only) India's grouping in 5.15, but the general case remained to be addressed.

So, as well as supporting non-BMP digits, QLocale now groups digits in numbers according to the correct rules for each locale. The python scripts now extract the CLDR's data saying what to do, that gets packed into one more byte of data per locale.

Extracting the right data

As noted above, we've included in Qt 5.15 some of the improvements in the scripts to extract data from CLDR's XML; so in this section I'll just briefly describe some improvements that aren't strictly new in Qt 6, although they did come out of the work to prepare for it.

The heart of the improvements was a restructuring of the (Python) scripts around an improved scanner for the parsed XML's DOM tree, that selects the correct entries from the diverse XML files provided by CLDR, in a schema called LDML. This has to take account of somewhat complex rules by which data specific to a locale is supplemented by data inherited from a chain of parent locales, while various look-ups may also be re-routed by an alias mechanism, some data may be filtered based on a draft attribute and the XPath-based selection of nodes has to take account of attributes classed by LDML as distinguished. The prior implementation (understandably enough, given how complex the rules are) got some of this tangled, leading it to get the wrong data in some cases, notably including some of the data we're newly including in Qt 6.

The replacement uses a more coherent object-oriented design and a careful separation of tasks between helper classes to ensure we get the right data. (CLDR access, reading CLDR, writing and reading our intermediate XML format, updating source-code and time-zone data.) In the process, I removed some old hacks to maintain compatibility with antique LDML versions.

Currency formats were particularly affected by this. In many cases the old script was using a currency format specific to a number system other than the one the locale actually used (due to finding an over-ride for that number system in the locale itself, which it should have ignored due to a distinguished attribute, the number system, being wrong; the locale inherits from a parent local, without over-riding, its currency format for the number system it does use). We decided – given that fixing this lead to changing many currency formats in any case – to prefer accounting versions of currency formats, where available; so rather more locales now have distinct formats for positive and negative amounts of currencies. There were also various list formats that were likewise affected by correctly attending to distinguished attributes.

Other Improvements

In the course of all this, we have made many fixes and improvements. Here are some of the smaller ones:

  • QLocale now respects the case of the exponent separator provided by CLDR; so, for instance, many locales now represent a million, in floating-point 'e' format, as 1E+06 instead of 1e+06.
  • For the floating-point 'g'-format, the transition from 'f'-format to 'e'-format is now done as documented and intended.
  • In common with many string-related classes, raw Unicode data is now represented using C++'s char16_t and char32_t types.
  • Pervasive support for QStringView parameters.

We have also made extensive internal improvements and simplified various things.

Recent work

In particular, while writing this post, I finished up some work to fix how QLocale selects a suitable locale, from those at its disposal, given only a sub-set of language, script and country, or given a combination for which we have no data. So Qt 6.0.0 should get something closer to what you intended, when what you ask for is either unsupported or insufficiently specific. Some of those fixes are also included in Qt 5.15.2.

This recent work also included a purge of obsolete language names – those for which CLDR provides no data, along with a few long-deprecated aliases – and various Language, Script and Country names (now converted to recognised aliases) are updated to better align with CLDR.

Data size

Adding support for new locales (593 in v5.14.0 through v5.15.1 (v36), 601 in v37 (v5.15.2), 615 in v37 (dev), 618 at v38) and adding more data to each locale naturally increases the size of the data tables compiled into QLocale, that encode the CLDR data; indeed, there's now a whole new single character table (albeit small). In the course of all the work above, I also reworked the way we package that data, to reduce its size.

In particular, much of the data held for each locale is stored in supporting tables (of date, time, list and currency formats, for example) with the data for each locale recording an offset into the table at which its entry starts, along with the length of that entry.

  • Various tables contained duplicated data and shorter entries that were substrings of longer entries; having the user of the substring reference its appearance within the longer string reduced duplication and made the tables a little smaller.
  • Almost all fields have sizes that fit within an 8-bit value, so storing these sizes in 16-bit values was wasteful, although the start-offsets (which do need to be 16-bit) and sizes (now mostly 8-bit) had to be separated within the data-structure to benefit from this, without padding gaps. Given how many of these fields there are, in each of the many supported locales, these 8-bit savings add up. Each calendar went from 30 to 26 bytes per locale; the main locale table went from 132 (131 +pad) to 124 (123 +pad) bytes per locale.
  • In the course of researching details for this blog post, I've also noticed a nice simplification that shall reduce the size of the table of display names of currencies from a bit over 35 KiB to less than 11 KiB (and make the look-up of currency names slightly faster). That's yet to be put into practice, so isn't included in the graph below.

(The one table whose entry sizes didn't all fit into an 8-bit value is month-name; there, the table contains the concatenation of all month names, with semicolon separators, and one locale's month-names for one calendar added up to 264 bytes.)

Here's how the total size and the tables that make it up have varied in the course of this work. (The graphs are an SVG which contains the raw data, for anyone interested in the exact numbers.) See below for details, also linked from various lines, curves and texts:

[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] Small tables Large tables Total Size / byte → 5001000200030003500 List Date Time Byte AM PM Single Currency Symbol Currency Format Language Code Language Name Language Index Script Code Script Name Script Index Country Code Country Name Country Index Locale index Size / KiB → 1020304060708090 Locale Likely sub-tags Roman Hijri Jalali Month Indices Day Name Currency Name Endonym Size / KiB → 100200300350

Commits (positions across each graph):

  1. Release 5.14.0 used as base-line (there's no change in 5.14.2) as some of the present work was included in 5.15
  2. Take correct account of LDML's distinguished attributes.
  3. 5.15.0 included the change to use of accountancy formats.
  4. 5.15.2 included CLDR v37, adding one new language and eight new locales. The next few columns, up to dev's v37 update, pre-date its release chronologically while only making it into Qt 6. The dev update to v37 was thus able to include more new locales. Later releases of 5.15 (LTS) shall use CLDR v38.
  5. Consolidating data tables to avoid duplication of sub-strings. This change, and the next few, happened on dev before the distinguished attributes fix and accountancy formats were merged up from 5.15.
  6. Use of 8-bit sizes for most fields, along with separation of 16-bit indices from those 8-bit sizes, to avoid padding, saving 20 bytes per locale. Some fields and data tables were reordered in the process, leading to some minor shrinkage of tables.
  7. Added support for surrogate pairs as single character data, in a new table.
  8. Correct digit-grouping in number formats, adding one byte per locale. Merged in the data selection fixes included in 5.15.0.
  9. CLDR v37 added one new language and twenty new locales (on dev, which can represent all the number systems the new locales use). The new locales brought in two new number systems, adding to the single-character table.
  10. Purge unsupported Language enum members (and some archaic aliases) and restore alphabetic order in all three enums. This changes the order in which entries are added to various data tables, which changes how much duplication is avoided by the consolidation mentioned above.
  11. Update to CLDR v38. This added three new locales and changed details in some existing locales.

Tables (lines in the graphs):

  • String tables. The main locale table references strings in each of these tables by start offset and length. Shrunk by deduplication, may grow when new locales are added.
    • List format: indicates the separator between entries in a list and any prefix or suffix to be included
    • Date format: how to combine year, month and day
    • Time format: how to combine hour, minute, second and possibly milliseconds
    • A marker to distinguish a time before noon from one after
    • A marker to distinguish a time after noon from one before
    • How to display data-sizes, e.g. of files. At v38, CLDR has set out to improve its description of quantities involving units, which has expaned this table
    • Single character: punctuation and zero digits, used in number and quote formatting; some need a surrogate pair
    • Currency symbol: similar to the single-character table
    • Currency format: how to combine the currency symbol with a number; may have distinct forms for positive and negative amounts
    • Day names, in long, short and narrow forms
    • Currency names: how the currency is described in text
    • Endonym: each locale's names for its language and country (contrast with the English names, below)
  • A string table (as above) of month names and an index table referencing it, for each of the (thus far) supported base calendars. Month names come in various flavours (long, short, narrow; with plain and stand-alone versions) and each base calendar has its own set. The Month Indices are the same size (one full set of indices per locale) so this size gets just one curve in the graph, but its size contributes ×3 to the total data size. (That 3 may increase as folk contribute more calendar implementations; and can be decreased by turning off features.)
    • Roman: data is shared between Gregorian, Julian and Milankovic calendars; many locales have distinct names, making this the single biggest table
    • Hijri: the base used by our implementation of the Islamic Civil Calendar, ready for sharing with implementations of other variants of the Islamic calendar
    • Jalali: a Persian traditional calendar, so most locales share the traditional Pharsi versions of the names
  • For each of Language, Script and Country (territory in CLDR): their ISO Codes, English names and an index table for the latter. The codes are fixed width (width four for script; the other two have width three, with a '\0' as third byte for two-character codes) but the name lengths vary; the index table says where each name starts in it.
  • Locale index: entries in the main table are sorted to make those with the same language consecutive; this table is indexed by language, mapping it to the first locale using that language. Aside from a terminating zero entry, it's thus the same size as the index table for English names of languages.
  • Likely sub-tag mapping table: this is used to select the right locale to use either when language, script and country haven't all been specified or when there's no data for those selected and we must fall back on a most suitable approximation.
  • The main data table, with an entry per locale, which references each of the data tables above (except the month names, looked after by each calendar) and carries a few data fields of its own (such as those describing the proper placement of separators for digit grouping).

The end result is that the locale data in Qt 6 will actually be (almost 8%, with the currency name saving I'm about to make) smaller than in 5.14, despite significant additions to the data provided and the addition of over twenty locales, many of which we previously couldn't support.


Blog Topics:

Comments