String collation with locales

As Carlos mentioned few months ago, here in the Qt Earth Team we were working on improving internationalization and localization support in Qt. At the moment we have already implemented the following features:

  • Take into account LC_MESSAGES, LC_TIME, etc environment variables on Linux/Unix platforms;
  • Added writing script support to QLocale;
  • Added some currency support (the feedback showed that it might not be enough and we might want to extend that api before the release).
  • Returning a list of languages from the OS for app translations; First day of week; week days; quoting strings; joining lists of strings; etc

Another todo on our list was looking into collation support (QTBUG-17104). For those who are blessed by not knowing what collation is (as it is a quite complex topic), collation is another word for ordering strings. For example sorting two strings - "coté" and "côte" - according to Canadian French locale (fr-CA) the latter string comes first, while with French in France (fr-FR) it is the former that should come first. Currently Qt has exactly one function that does localized string comparison - QString::localeAwareCompare - and there are two issues with that function - it only compares strings according to the system locale (you cannot specify which locales' rules to use), and it uses "default" rules - there is no way to configure it to do numeric comparison (e.g. "file10" and "file2"). Hence I've made a draft implementation of a QtCollator class that wraps around ICU library and provides a Qt-style api for collation.

class QtCollator
    enum Strength {
        PrimaryStrength = 1,
        BaseLetterStrength = PrimaryStrength,

SecondaryStrength = 2, AccentsStrength = SecondaryStrength,

TertiaryStrength = 3, CaseStrength = TertiaryStrength,

QuaternaryStrength = 4, PunctuationStrength = QuaternaryStrength,

IdenticalStrength = 5, CodepointStrength = IdenticalStrength };

enum Option { PreferUpperCase = 0x01, PreferLowerCase = 0x02, FrenchCollation = 0x04, DisableNormalization = 0x08, IgnorePunctuation = 0x10, ExtraCaseLevel = 0x20, HiraganaQuaternaryMode = 0x40, NumericMode = 0x80 }; Q_DECLARE_FLAGS(Options, Option)

QtCollator(const QLocale &locale = QLocale()); QtCollator(const QtCollator &); ~QtCollator(); QtCollator &operator=(const QtCollator &);

void setLocale(const QLocale &locale); QLocale locale() const;

void setStrength(Strength); Strength strength() const;

void setOptions(Options); Options options() const;

enum CasePreference { IgnoreCase = 0x0, UpperCase = 0x1, LowerCase = 0x2 };

bool isCaseSensitive() const; CasePreference casePreference() const; void setCasePreference(CasePreference c);

void setNumericMode(bool on); bool numericMode() const;

int compare(const QString &s1, const QString &s2) const; int compare(const QStringRef &s1, const QStringRef &s2) const; bool operator()(const QString &s1, const QString &s2) const { return compare(s1, s2) < 0; }

QByteArray sortKey(const QString &string) const; };

Simple benchmark has shown that QtCollator (i.e. ICU implementation of the collation algorithm) is 30 times faster than QString::localeAwareCompare on Linux (and hence strcoll()); and 5 times faster than localeAwareCompare on Windows (i.e. CompareString).

At the moment the code lives in a separate research repository and doesn't depend on anything inside Qt, but just wraps around libICU library and can be used by third-party applications as an add-on. The fact that it relies on the third-party library that might or might not be available on the platform, makes it hard to include in Qt, especially considering that features that ICU provides somewhat overlap with what Qt provides already. An alternative might be to implement Unicode Collation Algorithm and use CLDR data directly (which we already do for our QLocale data).

I don't have answers to what exactly is going to happen to collation in Qt, but I hope to raise this at Qt Contributors Summit and we shall see what plan we can come up with.

Source code:

Example application: