Low-level audio processing with QtMultimedia

One of the new features introduced in Qt 4.6 is the QtMultimedia module. The 'big picture' view of QtMultimedia has been presented in an earlier post to this blog, and has been recently updated. Here I want to take a closer look at the low-level audio APIs in particular, to discuss the types of applications for which they may be useful.

In a following post, I'll illustrate this by describing a new demo application which has been added to Qt. To whet your appetite, here's a picture of it:

Spectrum analyser running on SymbianSpectrum analyser: screenshot
Screenshots of spectrum analyser demo app running on Symbian and Windows

Anatomy of an audio stack

One way to explain the intention of the QtMultimedia audio APIs is to take a step back and look at what happens inside an audio playback software stack. For now, let's think about an archetypal stack, rather than the software which is running on any particular platform. While the implementations vary considerably between platforms, the concepts are broadly similar, at least for the purposes of this discussion.

When the user hits the 'play' button on a music track, the following may be among the operations which take place under the covers:

  • Acquisition of hardware resources required by the use case, e.g. output devices (speaker / headphones), coprocessors used for decoding or effects processing, etc.
    • This is particularly important on embedded devices, where:
      • Resources can be highly constrained, e.g. the device may only have sufficient processing power to decode one MP3 stream at a time
      • There may be a requirement to ensure that multimedia use cases don't interfere with other aspects of the device, for example the fact that music is being played should not prevent the ringtone from being played when a mobile phone receives an incoming call
  • Reading the clip contents either from the file system, or by streaming from the network
  • Decryption of DRM-protected content
  • Parsing the clip's container format
    • Extracting metadata such as artist, track name, format etc
    • Extracting the encoded audio bytestream
  • Decoding the bytestream to generate a raw PCM stream
  • Applying effects
  • Mixing with other audio streams which are being played concurrently
  • Routing to the correct output device
  • Audio rendering
    • Digital to analogue conversion (DAC) - converting the PCM bytestream into a varying voltage signal
    • Amplification

The following picture tries to show how the components which perform these operations may be related in our imaginary audio stack. The exact arrangement may vary considerably between implementations. In some cases, these differences are simply the result of differing philosophies or approaches to the design of the audio stack. In others, the configuration of audio components may be dictated by hardware constraints. For example, on many embedded devices, audio processing may be performed by a dedicated coprocessor. The physical output connection of this processor may constrain what processing can happen downstream of it - for example, if the MP3 codec runs on a processor whose output is wired directly to the DAC, then no effects can be inserted into the PCM part of the audio path.

Anatomy of an audio stackThe boxes represent components of the native audio stack. The red bars represent APIs which expose the functionality of the native stack, at different levels of abstraction.

The 'high level API' deals only with control, not data. This is to say that no buffers of audio data - be it MP3, PCM or any other format - pass between the client and the stack via this interface. Instead, the client describes the audio data which it wishes to process in the form of a descriptor such as a filename or URL. The processing itself is controlled via high-level commands such as play / record, pause, stop, and seek. On top of these commands, there may be another layer which provides features such as playlist management.

Parameters of the processing may be exposed to the client: these will almost definitely include volume / gain; more advanced parameters such as balance, equalizer, and control over audio effects may also be available. In addition, the API - or perhaps a companion API at a similar level of the stack - may allow the client control over the audio routing, by providing information about which audio input / output devices are currently available, and allowing the client to select which of them is used for a given playback / recording session.

In contrast with the above, the 'low level API' deals directly with the content of the audio stream. Buffers of audio data are exchanged between the client and the lower levels of the audio stack. The data formats which can be used at this level may vary depending on the platform: most, if not all, audio stacks will allow the client to play or record PCM streams, while support for processing streams of compressed data may or may not be provided.

Because this API is dealing directly with the data stream rather than with an abstract clip descriptor, some control commands - notably seek - do not make sense at this level. Others such as pause still do have a place: although the client is providing or consuming data via the API, it is not typically directly connected to the audio hardware itself. There must usually be a level of buffering between the two in order to ensure that, should the client temporarily stop processing (for example due to its thread yielding to, or being pre-empted by, another one), the audio hardware can continue to read or write data into memory.

The set of audio parameters which are available to the client may be restricted in comparison with those provided by the higher-level API - volume / gain may well be the only parameter which this interface exposes. Similarly, the client may be given less control over audio routing than the higher-level API affords. This would be the case, for example, if the low-level API represented a specific physical device, while the high-level API represented the audio subsystem as a whole.

This description of the high level API should sound familiar to those who have used Qt's Phonon API (at least if we only think about audio playback - Phonon does not support recording). The functional scope of the high level API may, however, go significantly beyond that of Phonon, as discussed previously in Justin's post.

The low level API, on the other hand, corresponds to QtMultimedia audio. Before looking at the latter in more detail, it's worth emphasising one point regarding the relationship between Phonon and QtMultimedia: current Phonon backends do not use QtMultimedia. The implementations of these two APIs are currently completely separate - at least down as far as the native API level.

Looking forward, the QtMobility project is delivering a suite of high-level multimedia APIs. These provide a similar level of abstraction to Phonon, but include features which Phonon lacks, and afford additional flexibility. For a recent update on the status and availability of these APIs, see this post.

The QtMultimedia audio APIs

So, having looked at audio APIs from an abstract standpoint, let's look at the QtMultimedia audio APIs themselves. This consists of the following four classes:

  • QAudioDeviceInfo
    • Represents an audio device such as a loudspeaker, headset or microphone. Describes its capabilities, in terms of which audio formats the device is able to process. A static function, availableDevices() is provided in order to allow the client to query the set of audio devices present in the system.
  • QAudioInput
    • Allows the client to receive data, in a specified audio format from a specified audio input device. Data is transferred via a QIODevice object, with QAudioInput offering two modes of operation
      • "pull mode": the client provides a QIODevice by calling void start(QIODevice*). No further intervention is required from the client in order to manage the data flow.
      • "push mode": the QAudioInput object provides a QIODevice via QIODevice* start(). The client must then listen for the readyRead() signal, and then read() the new data.

    The client can control some aspects of the latency* - i.e. the amount of time between audio being sampled by the hardware and the corresponding data arriving to the application - by calling setBufferSize(). Supported buffer sizes may vary from platform to platform, but most allow sub-10ms latencies at all supported formats.

    The processedUSecs() function allows the client to determine how much data has been captured by the audio device. At any given time, the difference between this and the amount of data which has been received via the QIODevice indicates the amount of latency.

    * Note however that the other source of latency - the time taken for the audio device to prepare to capture data - is outside the control of the client. This initialization happens asynchronously following a call to start(), and its completion is indicated by a stateChanged(QAudio::State) signal.

  • Corresponding interface for audio output, which provides a similar pair of "pull" / "push" overloads of start().

Work in progress

It's worth saying at this point that there are low-level audio use cases which QAudioInput and QAudioOutput don't cover. Or to put it another way, some of the functionality towards the bottom part of the diagram above is not currently exposed by QtMultimedia. This missing functionality includes the following:

  • Ability to resume following pre-emption
  • While QAudioOutput and QAudioInput allow the client to suspend and resume processing, on some occasions suspension may caused by events elsewhere in the system rather than being requested by the client.

    On some platforms - particularly in the embedded space - the concept of audio pre-emption is important. For example, on a mobile phone, music playback may need to be terminated by the system when a call is received, so that the ringtone can be played. Once the call has ended (or has been rejected by the user), music playback can be resumed; whether this happens automatically, or requires user to resume playback, depends on the device in question.

    For a QAudioOutput or QAudioInput client which is pre-empted in this way, we need (a) to be able to tell the client that it has been pre-empted, and (b) a way to notify the client when the audio resources which it needs have become available once again, so that it can either auto-resume, or just re-enable the 'play' button in its UI.

  • Notification of changes in audio device availability
  • Although QtMultimedia allows the client to query the list of available devices via QAudioDeviceInfo::availableDevices(), there is no signal via which the client can be notified when this list changes. This could happen for example when the user plugs in or disconnects a headset or an HDMI cable. When this happens, the platform may decide that the default audio device has changed, and automatically re-route. Because the application is not notified when this happens, it cannot decide whether this is actually the behaviour which it wants: for example, the application developer may wish to pause audio playback when a headset is disconnected, rather than have its audio automatically re-routed to a loudspeaker.
  • Seamless re-routing
  • This is related to the previous point, specifically, what the client can do when notified that an audio device has become (un)available. While the client can select which device to use by providing a QAudioDeviceInfo to the QAudioInput or QAudioOutput constructor, there is no way to change the output device during recording / playback. The only way to re-route is therefore to tear down the audio session by destroying the QAudioOutput device, and then create a new one with the desired device. The problem with this is that the audio subsystem to which QAudioOutput provides access may be buffering audio data close to the hardware. Tear-down and re-creation therefore at best causes a gap in playback while the system re-buffers, and may also involve some audio data being lost altogether.
  • Volume
  • Finally, QAudioOutput and QAudioInput do not provide any way to query or set the volume / gain.

Plans for adding volume control to QtMultimedia are under way; discussions around the other topics are ongoing. As always, we'd value feedback on these or any other aspects of Qt - please get involved by commenting on this post or via the #qt-labs IRC channel.

When you should go low

So, given the choice between the high-level Phonon API and the low-level QtMultimedia API, what considerations can help you decide which one to use?

Well, let's start with some easy wins - there are some use cases when Phonon is clearly the right choice:

  • Development of a music player, when the Phonon backends on your targetted platforms already provide:
    • Codecs for most or all of the audio formats your users are likely to want to play
    • Support for most or all of the protocols via which your users want to stream music

In these cases, there is no reason to go to the extra effort which using QAudioOutput would require - the Phonon backend is already doing the heavy lifting for you.

Another use case for which Phonon is probably the way to go is when your application needs to deal with DRM data. This is because using QAudioOutput would likely require the application to handle plaintext (i.e. decrypted) data itself, and may therefore impose some limits on how the application can be deployed.

Update 18/05/10: Phonon does not currently support playback of DRM content, and there is no plan to add this.

Conversely, if the application needs to record, rather than just play audio, then Phonon clearly isn't suitable since it doesn't support audio capture.

On the other hand, if the application has any of the following characteristics, QAudioOutput may be the better choice:

  • Specific latency requirements
  • Need to access raw audio data directly

Applications which may have such requirements include:

  • Streaming applications such as VoIP or video telephony endpoints
  • Streaming music applications, when Phonon doesn't offer the required protocol / codec support
  • Real-time audio analysis applications, such as instrument tuners
  • Applications which synthesize their own audio streams, such as musical instrument simulators
  • Those which need to play sounds at precisely defined moments, such as games

A problem which presents itself when looking at the above list is that taking the decision to use QAudioOutput leaves a lot of work to be done in the application itself. Imagine that the reason Phonon cannot be used is that, although the application needs to stream audio via a protocol which the Phonon backend supports (say, RTSP), the stream is encoded using a proprietary codec which is not available to the Phonon backend. In this case, the application needs to manage the RTSP stream itself - either by using its own streaming engine, or maybe by using native platform streaming APIs, decode the stream using the proprietary codec, and then pass the decoded audio data to QAudioOutput.

The root of this problem is that the abstractions offered by both Phonon and QtMultimedia correspond to a large chunk of audio stack, rather than allowing access to individual components. In the case of Phonon, the entire stack is lumped together and abstracted by a single API. QtMultimedia breaks this down a bit, but still groups the codecs and the final output stage (routing and the DAC) together. There aren't yet any Qt abstractions for individual components such as codecs, meaning that application developers who wish to address those components directly must do so via platform-specific APIs.

A demo is worth a thousand words

... but I've already written twice that much, so it's probably time for a break. In a following post, we'll look at putting QtMultimedia to work in that demo application.

Blog Topics: