W3C XML Schema validation with Qt

As one can see here my contributions to KDE stopped from October 2008 to January 2009... that's not because I have joined another open source project or my natural instincts forced me to hibernate, no, it's because of two other reasons:

  • I started an internship at Trolltech aka Qt Software as part of my studies
  • The internship was located in Oslo, so there was a lot to explore for me on the evenings and during the weekends ;)

Today I want to talk about the project I worked on here at Qt Software during the last 4 month.
Qt provides a really nice support for handling XML documents with DOM and SAX APIs in the core module and
QXmlStreamReader as an alternative approach of parsing XML. Although nowadays there are many hand-written
implementations for loading XML documents from the local hard disc or via network and iterating over the
nodes to extract single parts of the XML tree, QtXmlPatterns provides a performant implementation
of the XQuery and XPath specification, which allows you to do all the loading and extracting of XML data with only
5 lines of code. In Qt 4.5 the XML support has been extended by an XSLT implementation, that allows you to easily
convert documents from one XML dialect into another or generating source code from a XML description like the Akonadi
project does it for their database access classes.

However all these technologies expect, that the xml input documents are well-formed and valid. In this case well-formed
means that it is valid according the XML specification (correct tag syntax, only nested tags etc.). Valid means, that
the elements have only the attributes that are allowed by the XML dialect the document is an instance of and that
the elements appear in the correct order. While the well-formedness is enforced by all basic XML parsers (in our case
QXmlStreamReader), the validity depends on the validation language that one wants to use. The most common languages are

  • W3C XML Schema
  • RelaxNG
  • Schematron

W3C XML Schema is, like the name suggests, defined by the W3C and part of nearly every other XML technology specification released by them,
so it is well accepted and used many software systems out there.
RelaxNG has a similar approach of validating XML documents like XML Schema, however it has a much simpler syntax.
Schematron follows an imperative concept of validation. Instead of describing the complete structure of a valid document,
it defines assertions that must be true, otherwise the document is invalid.

So we can see that it is worth to have an implementation for W3C XML Schema to complete the QtXmlPatterns module, unfortunately
Qt 4.4 and 4.5 are missing support for that... and that is the point where we finally come back to my internship project ;)

The goal of the project was to evaluate the existing XML schema validation C/C++ implementations (API-wise) and come
up with a nice API (and of course implementation) based on Qt and integrated into QtXmlPatterns.

To make a long story short... yes, we have a working implementation now!

And now the longer version of that story:
The implementation passes around 99% of the tests of the W3C XML Schema Test Suite. That looks pretty good at the first glance, but I have to admit, that I disabled some of the tests. For example all the tests that are marked as invalid in the bug tracking system and all tests that currently do not pass because of memory/processor resource limitations...
So I guess the real number of passed tests is around 98%, still acceptable IMHO ;)

So how can a developer make use of the schema validation? In the current version, we support only checking
if a schema document is valid and if an instance document is valid according a given schema. The validation
is not integrated into QXmlStreamReader yet, so the developer has to do the check manually like in the
following code:

    #include <QtXmlPatterns/QXmlSchema>
#include <QtXmlPatterns/QXmlSchemaValidator>

QXmlSchema schema;
schema.load( QUrl("file:///home/user/myschema.xsd") );

if ( schema.isValid() ) {
QXmlSchemaValidator validator( schema );
if ( validator.validate( QUrl("file:///home/user/instance.xml") ) ) {
qDebug() < < "instance is valid";
} else {
qDebug() << "instance is invalid";
}
} else {
qDebug() << "schema is invalid";
}

Of course it is also possible to retrieve the error message why the schema or instance is invalid, information
about that can be found in the API documentation.

For those of you who prefer fancy, colored screenshots, here come some pictures of the example application:

Valid document instance

Invalid document instance

When choosing the invalid instance document, the application points out the invalid XML construct inside the
document.

We plan to provide the Qt branch with the schema validation support as separated branch in the Qt labs git
repository. Unfortunately you have to checkout and compile the complete Qt, as the schema support also patches
components outside QtXmlPatterns (namely QRegExp), but as soon as we merge it upstream, that shouldn't be a
problem any longer.

So will schema validation support make it into Qt 4.6? Maybe... I hope so, also there is still some stuff
to do, iron out all the rough edges (usability wise) and integrate it cleanly into the rest of Qt, however
my time here in Oslo ends soon and I have to go back to university. But I'm quite sure that Frans (my technical
mentor and the other 50% of the XML team ;) ) will continue to keep track of it and further help to make it really rock!


Blog Topics:

Comments