A ramble on characters in XML documents
By sandoz on Jun 05, 2006
When I was rookie XML 1.0 user i was not aware that there were restrictions in the characters that are allowed in element/attribute tag names, text content and attribute values. It caused some mild eyebrow raises when i found out and looked more closely at the W3C XML 1.0 Recommendation! XML's foundation is Unicode characters, right? so why the subset?
Take for example the specified character range of a character that is allowed as part of text content:
So that means no 'control' characters, like 'NULL' or 'BELL' (see here for good description of the issues , which may explain the reasons why, for XML 1.0, control characters were disallowed). A character code of '0' is not allowed as part of text content of an XML document, i think this makes sense from the perspective C/C++ since '0' is used as terminator for strings, and allowing '0' would cause all sorts of issues.
So the following XML document is not well-formed:
For more information on this I highly recommend looking at Tim Bray's most excellent annotated XML 1.0.
The 'NULL' character code is still disallowed. I think XML 1.1 is an improvement on XML 1.0 since XML 1.0 restricted the characters codes that were allowed in element/attribute tag names. Now more languages can utilize markup for element/attribute names.