Qt 4.6 with XML Schema support

Now it is official, last week my xml schema branch on gitorious has been merged into qt master (the upcoming Qt 4.6) by the brave Trolls Peter and Thiago.
Since February, when the internship ended and I had to go back to university, most of the changes to the branch were code cleanups, API improvements and additional documentation. So no new features so far, however there is a long TODO list on my desk that will get some attention as soon as time permits ;)

So what is all this XML Schema thingy about? Why shall I use it my applications? As you can see in the CWE/SANS TOP 25 Most Dangerous Programming Errors the first place in category Insecure Interaction Between Components is held by Improper Input Validation. Hands on, who of us checks all input data that is entered by the user in our application? While it is kind of easy to check input, typed into a QLineEdit, QTextEdit or QSpinBox, with a QValidator or by some other form of manually written source code, validating input data that come from files or via network is more complicated. These kind of data have a special format (e.g. a CSV file, an ODF document or an image) that has to be parsed and interpreted to make sure that no bogus data are pushed into the application. However for accessing these common formats there exist libraries that do the checks and validation for us and return an error in case of a violation.

During the last years the XML format has become more and more popular and the possibility to define own XML dialects (e.g. DocBook, MathML, QtUI) led to a broad adoption in the software world. If you want to parse a XML document you normally make use of a XML parser like QXmlStreamReader or QDomDocument, which do all the low-level parsing for you and throw errors if the input document is not well-formed as defined in the XML standard. Well-formed basically means that tag names, attribute names and attribute values only contain allowed characters and that tags are nested correctly in each other. However well-formedness doesn't say anything about what tags are allowed (e.g. <html>, <resource> etc.) inside this document and which tag can contain other tags or attributes. So the classic XML parsers can't help you to make sure that your input data are correct, you will find yourself writing code like the following

if ( element.tagName() != "resource" )
return error;


when parsing XML documents in your application.

Since this problem is as old as XML exists, there has been made many attempts to provide an easy way of validating XML documents according specific constraints. The document type definition (DTD) is an integral part of the XML standard, however its capability to define the structure of a XML document is quite limited. Other attempts are XML Schema and Relax NG, where the former is the official validation language of the W3C and the later the result of a counter-movement to XML Schema, which is seen as too complex and difficult to learn by many day-to-day XML users.

Independent on their complexity, all XML validation languages have one thing in common, they describe in an abstract way, how a valid XML document should look like. In other words they can define a grammar for a language. This grammar (a DTD or XML Schema file) is then used to validate a XML instance document and decide whether it is correct or not.

This functionality can be used with the new QXmlSchema and QXmlSchemaValidator classes in Qt 4.6 now. So instead of checking the single tag names manually, just create a XML Schema definition of the format you want to parse, pass this definition to the QXmlSchema object and then validate it against your input data via QXmlSchemaValidator. The validator will tell you then whether the input data can be processed further as they are valid, or if processing should be stopped because they are bogus. That will take away the burden of error/validity checking from you and does reduce the code size as well... however you still have to iterate over the single XML elements and parse the data into a C++ object representation to work further on the data. Can't we simplify that somehow?

To cite a famous quote: "Yes, we can!" ;)

The grammar of the XML document that is used for validation contains all information we need to generate C++ code with the following properties:

  1. C++ classes for every type defined in the XML schema
  2. C++ code for parsing (and validating) a XML document and fill the C++ objects
  3. C++ code for writing out C++ objects to XML documents

So instead of defining your C++ data objects first and try to fill them with XML data in a second step, you could write a XML Schema definition first, which describes your XML input data, and then you let generate the matching C++ data objects, the parsing and synthesizing code automatically. That sounds really great (how much time have you wasted by parsing XML documents into C++ structures manually?) and indeed it is! ;) 'So where is the code' you might ask... Unfortunately there is no code written yet that could do that, however with having the XML Schema definition as internal object representation the first major step is already done. Creating C++ code from it is just a question of studiousness and time. Let's see what the future will bring!


Blog Topics:

Comments