A ramble on characters in XML documents


When I was rookie XML 1.0 user i was not aware that there were restrictions in the characters that are allowed in element/attribute tag names, text content and attribute values. It caused some mild eyebrow raises when i found out and looked more closely at the W3C XML 1.0 Recommendation! XML's foundation is Unicode characters, right? so why the subset?

Take for example the specified character range of a character that is allowed as part of text content:

Char   ::=   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

So that means no 'control' characters, like 'NULL' or 'BELL' (see here for  good description of the issues , which may explain the reasons why, for XML 1.0, control characters were disallowed).  A character code of  '0' is not allowed as part of text content of an XML document, i think this makes sense from the perspective C/C++ since '0' is used as terminator for strings, and allowing '0' would cause all sorts of issues.
Note: bnux, an interesting binary encoding of the XML infoset, cleverly takes advantage of this fact to terminate a sequence of UTF-8 encoded characters using a '0' instead of length prefixing.

So the following XML document is not well-formed:

<element>&#x00;</element>

For more information on this I highly recommend looking at Tim Bray's most excellent annotated XML 1.0.

Having said that, the W3C XML 1.1 Recommendation opened the door for 'control' characters!, the character range of a character is now:

Char   ::=   [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

The 'NULL' character code is still disallowed. I think XML 1.1 is an improvement on XML 1.0 since XML 1.0 restricted the characters codes that were allowed in element/attribute tag names. Now more languages can utilize markup for element/attribute names.

Note: The subtle difference between XML 1.0 and XML 1.1 can cause some interesting edge case issues. Take for example the W3C xml:id Recommendation. The value of an xml:id attribute is specified to be an NCName. If you need to correctly implement xml:id on top of an existing parser that does not support xml:id, for example as a SAX XMLFilter, then that filter will need to check the version number of XML and perform the correct NCName character validation based on the XML version. This is the case with Norm's nifty xml:id implementation at java.net. Obviously the ideal solution would be that the parsers implement xml:id.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

sandoz

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today