TextToSpeech improvements for Qt 6.6

When we announced the Qt 6 port of Qt Speech for Qt 6.4, one of the comments pointed out that the module would be more valuable if applications could access the generated speech audio data. Qt 6.6 introduces exactly that, plus a few more features and API improvements.

Synthesizing audio data from text

The QTextToSpeech C++ class has learned how to generate the speech audio as PCM data. In addition to using QTextToSpeech::say(QString), which simply plays the generated audio, applications can call one of the new QTextToSpeech::synthesize() overloads. These overloads take the input text as well as a slot, i.e. a functor, lambda, free function, or member function pointer (with context object if needed). That slot will then get called whenever a chunk of PCM data is available from the backend, with a QAudioBuffer from Qt Multimedia (or, slightly more efficiently, a QAudioFormat and QByteArray) describing the format and containing the actual data. Applications can then post-process the PCM data, write it to a file, or cache it for repeated play-backs using Qt Multimedia.

Better process control

With Qt 6.6, applications will have better control over the flow of the speech generation. The new QTextToSpeech::enqueue function adds an utterance to an ongoing text-to-speech process, and the new aboutToSynthesize signal is emitted before each of the enqueued utterances gets passed to the backend. This allows applications to make modifications to speech attributes, such as voice or pitch, for each utterance in the queue. And while speech audio is being played, QTextToSpeech can now emit the sayingWord signal for each word as it gets spoken, allowing applications to follow the progress and perhaps give visual cues to the user.

Selecting voices made easy

We made it easier for applications to select a voice for the text-to-speech synthesis. This has been difficult until now, as applications had to first set the correct locale on the QTextToSpeech object, and then pick one of the voices from the list of availableVoices. With Qt 6.6, it becomes easy to find a suitable voice matching a combination of criteria:

const auto frenchWomen = textToSpeech->findVoices(QLocale::French,
QVoice::Female,
QVoice::Adult); const auto norwegians = textToSpeech->findVoices(QLocale::Norway);

Note how the criteria can include an attribute of a locale (e.g. just "French" as a language, or "Norway" as the country; a QLocale object always has both defined). This way, your application doesn't have to worry about the optimal territory or dialect. To be fair, one shouldn't ask a Nynorsk voice to pronounce a Bokmål text; but if your system only happens to support one of the Norwegian official languages, then using that will still be an improvement over the English voice of your e.g. navigation system trying to pronounce my old street address in "Banksjef Frølichs Gate".

With the exception of QTextToSpeech::synthesize (where the code that processes the raw PCM bytes should be written in C++ anyway), all new capabilities are available from QML as well. E.g. the selection of a voice is achieved through an attached VoiceSelector property:

TextToSpeech {
    id: femaleEnglishVoice
    VoiceSelector.gender: Voice.Female
    VoiceSelector.language: Qt.locale("en")
}

This will implicitly select the first matching voice, or otherwise leave the voice unchanged.

What's left?

The last significant feature on my Qt TextToSpeech backlog is support for Speech Synthesis Markup Language, or short SSML. A work-in-progress implementation is available on gerrit code review, and what I learned from that experiment is that each backend supports a different subset of SSML. Also, the data we get from backends for the new sayingWord signal are indices into the actual text being spoken, not into the XML string. This might be ok, but the feature needs some more thinking; We don't want an XML string that works well on one platform to break the output completely on a different platform (but should we then remove XML elements that we know to be currently unsupported?).

Not all new features are available with all backends. In particular, synthesising to PCM data as well as word progress emission require support for the backend. The new QTextToSpeech:engineCapabilities API reports which features are implemented by the backend, and we have updated the backend documentation with the relevant details. Applications can now check at runtime which features they can use, but it would of course be best if everything just worked everywhere. Most importantly, it would be great if we could synthesise speech PCM data also with the speech-dispatcher engine. Contributions welcome, although last time I checked, this required some work on speech-dispatcher itself (at least on the documentation).

As for speech support as a whole: the qtspeech repository now covers the direction from text to speech; some research has been done and proof-of-concept implementations for speech recognition are available on gerrit code review. We'd be very interested to learn more about your use-cases for such a module.

And apropos contributions - around the Qt 6.6 feature freeze we had a public API review of Qt TextToSpeech, and I'd like to thank Marc, Fabian, and Philippe for taking the time to go through the changes, provide their feedback, and generally help with improving this module!


Blog Topics:

Comments