DOM Blues

Did you know that there are characters you cannot store in an XML document in any way? You cannot store #x7 as plain character data:

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

You cannot store it as a character reference:

"Characters referred to using character references MUST match the production for Char."

You cannot even store it in a CDATA section:

[20] CData ::= (Char* - (Char* ']]>' Char*))

Well, this is actually a lie. Character data is defined as:

[14] CharData ::= [^< &]* - ([^<&]* ']]>' [^< &]*)

i.e. any sequence of characters that does not contain '< ', '&' and "]>>". What does "any sequence" mean?

"Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. ... Consequently, XML processors MUST accept any character in the range specified for Char."

So while not explicitly illegal, #x7 is not guaranteed to be accepted by XML parsers. And, indeed, libxml's xmllint dies on it with an "internal error" (erm...?).

What all this means is that if you blindly store QStrings in QDomText nodes, chances are you will end up with a document that's a bit... weird. It's not non-well-formed, because it does not violate any well-formedness constraints. But it's not guaranteed to be readable to XML parsers either.

So what can you do? Well, here's what I'm thinking about: a static function


which allows the programmer to specify one of three actions to take:

  • Do nothing, risking a weird XML document.
  • Silently drop invalid characters.
  • Return null nodes from the factory functions in QDomDocument.

The first action is obviously bad. The second action is also bad - if there is a #x7 in the data, it's probably there for a reason and it's likely to be missed. The third action is just ugly, since Qt doesn't throw exceptions and you end up having to check for a null node every time you use a factory function.

Go figure. :p

Blog Topics: