XML - Dissected
By Ramkumar Menon-Oracle on Jul 06, 2007
XML is a language used to describe a class of data objects called XML documents.
XML documents have both a physical structure and a logical structure. Here is the illustration of the phyiscal structure of an XML document.
As indicated in the diagram, the physical layout of an XML document consists of one or more entities. Each document should have a document entity. An entity may refer to other entities that may result in inclusion of those entities in the document.
A data object is an XML document if it is "well-formed". There are several constraints that the data object need to adhere to, in order to be well formed. These are discussed in detail in the XML 1.0 specification. But at the minimum, the document should hold one or more elements within it. The document should have exactly one root element whose content should not appear within any other element. Each element in the document that is not the root should have a parent element.
Parsed Entities are those entities that contain character sequences. The characters may either be markup or character data.
Characters could be one of
b) Carriage Return
c) Line feed
d) Unicode characters
e) ISO/IEC 10646 characters
Of these, (a), (b), (c) and #x20 are termed as whitespace characters.
Markup allows description of the logical structure and the storage of an XML document.
Characters used for Markup could take form as one of the following.
a) Start tag
b) End tag
c) Empty element tag
d) Entity references
e) Character references
g) CDATA section delimiters
h) Document type declarations
i) Processing Instructions
j) XML declarations
k) Text declarations
l) Whitespace outside of the document element.
All Characters except markup form the Character data in an XML Document. The ampersand and the left angle brackets are not allowed in character data except when used witihn a comment, a PI or a CDATA section.
At all other places, these characters have to appropriately escaped using numerical character references or the predefined entity references [for e.g. &]. Entities and entity references are described later in this article.
Similarly, apostrpohes and quotation marks have to be used in their escaped representation for use within contents of elements and attributes.
XML Documents should begin the XML declaration that indicates the version
<?xml version = '1.0' encoding = 'UTF-8'?>
The above declaration instructs the processor to interpret the document according to XML 1.0 specification, and informs it about the character encoding that is in use within the document.
Maybe either of an element type declaration, attribute list declaration, entity declaration, notation declaration, a PI declaration or a comment declaration.
<!ELEMENT orderDate EMPTY> is an element declaration. These can be defined locally within the XML document as a part of the document type declaration, or within an external subset � a DTD.
<!ATTLIST order Id PCDATA "100-001"> is a declaration of an attribute named "Id" for an element name "order". The declaration also indicates that the attribute possesses a default value "100-001".
Entity Declarations and References
Do not confuse this with the "Entity" referred to in figure 1. Those entities are just logical representations of the actual physical storage of the XML document.
Entities are used for two main purposes.
a) To act as a macro �define once and use at multiple places
b) To refer to external resources from within the document, for instance an XML file, a JPG image or a Word document.
c) To refer to arbitrary unicode characters within your XML document.
(a) are termed as internal entities.
(b) are termed as external entities
(c) are termed as character entities.
An Entity has to be declared before it can be used.
The declaration for each of these categories of entities is illustrated below.
1. Internal Entity declaration
<!ENTITY orgName �Oracle Corporation�>
This means that "orgName" is an entity that aliases the string "Oracle Corporation".
This entity can be referenced witihn the XML document by prefixing the entity name with a "&" and suffixing it with a semicolon.
. . . .
2. External Entity declaration
External entities can be declared to refer to XML text, or binary content.
Declarations for entities that contain XML text are declared as
<!ENTITY entityName [PUBLIC "publicIdentifier"] SYSTEM "system-identifier".
For example, you could include a product catalog XML into your XML document using the following entity declaration.
<!ENTITY catalog SYSTEM "http://www.oracle.com/prod/catalog/products.xml" > These kind of external entities are termed as Parsed Entities.
For declaring entities containing non-XML data, the declaration should contain a NOTATION that indicates the actual format of the entity.
<!NOTATION JPG SYSTEM "Joint Photographic Experts Group">
<!ENTITY picture SYSTEM "http://www.photos.com/myphotos/me.jpg" JPG">
These kind of external entities are termed as Unparsed Entities.
3. Character Entities
XML 1.0 specification defines 5 pre-defined character entities namely lt, gt, amp, quot and apos.
These stand for the characters "<", ">", "&", " and ` respectively.
These entities, and for that matter, any character entities, need not be declared separately in your XML document. The following example illustrates references to these entities in an XML document.
<description>This is a <tag> & ' is an apostrophe</description>
These properties are listed below.
For referring to any arbitrary unicode character in your document, you just need to use entity reference for the hex code or the decimal unicode character number.
<chineseText><水 is a chinese character</chineseText>
Expanded and Unexpanded Entities
Entity expansion refers to the derefencing of the entity value when it is encountered in the document.
When entity declarations in a document contain entity references, those entity references are not expanded until the entity reference to the containing entity is encountered.
<!ENTITY ramkumar &r;kumar>
<!ENTITY r ram>
Here "ramkumar" is an unexpanded entity. It contains the entity reference "r". This reference is not expanded until &ramkumar; is encountered in the XML document.
On the other hand, Character entities are expanded as soon as they are encountered.
A Processing instruction allows XML documents to provide instructions for applications that intend to process the document. A processing instruction is not a part of the character data of an XML document.
The most common PI that is in use is the XML declaration.
<?xml version = '1.0' encoding = 'UTF-8'?>
The above processing instructions instructs the processor to interpret the document according to XML 1.0 specification, and informs it about the character encoding that is in use within the document.
You could find alternate usages of processing instructions in the XSL Maps that you generate using Oracle Jdeveloper. It would by default contain processing instruction for the Mapper tool on the source and target XSDs for the XSL Map.
<!-- SPECIFICATION OF MAP SOURCES AND TARGETS, DO NOT MODIFY. -->
<rootElement name="inputRoot" namespace="http://xmlns.oracle.com/input"/>
<rootElement name="outputRoot" namespace="http://xmlns.oracle.com/output"/>
<!-- GENERATED BY ORACLE XSL MAPPER 10.1.3.1.0(build 061009.0802) AT [TUE JUN 26 11:27:36 PDT 2007]. -->
Document Type Declaration
Declares or gives pointers to the definition of markup declarations in the XML Document. This definition of the grammar is termed as "document type definition" or DTD.
The DTD can be local or external [or both together ]to the XML document.
<!DOCTYPE purchaseOrder SYSTEM "po.dtd">
. . . .
indicates that the element purchaseOrder is defined at the DTD at the URI "po.dtd".
<!DOCTYPE purchaseOrder [
<!ELEMENT purchaseOrder (shipTo|billTo|customer)>
. . . . .
. . . .
contain local markup declarations.
<!DOCTYPE purchaseOrder SYSTEM "po.dtd"[
<!ELEMENT purchaseOrder (shipTo|billTo|customer)>
. . . . .
. . . .
contains both internal and external markup declarations. For instance, one of the elements within the purchaseOrder could be defined in the dtd "po.dtd" [termed as the external subset], whereas others, for instance, "shipTo" and "billTo" could be declared in the internal subset, i.e. local to the document.
Standalone Document declaration
The Document can be declarared to be a standalone XML document if the document does not depend on any external subset [an external DTD] to resolve the declarations for any of the objects defined within the XML document. In other words, the processor of the XML document need not worry about processing any external subsets to process the XML document.
The standalone document declaration is performed as a part of the XML declaration PI.
<?xml version="1.0" standalone="yes|no">
The XML specification lists four cases where the processor needs to lookup declarations in an external subset.
a) Attributes may have default values
Consider an XML document that contains an element named shipInfo. The shipInfo element is defined to possess an attribute named orderDate.
<?xml version="1.0" standalone="no"?>
. . .
If you take a close look at the XML above, the shipInfo element does not possess such an attribute, even though the declaration says so. That�s fine too, since attributes need not occur in the document, even if the elements are defined to possess one or more. But what if the attribute declaration defined a default value for orderDate? In this case, the parser needs to lookup the definition of the attribute to obtain the default value for the attribute while processing the document. If the declaration of the attribute and the defaulting is defined at an external subset, then the standalone attribute value should indicate "no" to serve as an instruction to the processor.
b) Resolve Entity references in an external subset
The XML document could be referring to entities that had been declared in the external subset.
c) Perform Attribute value Normalization
If the normalization of an attribute value requires resolution of an entity declared in an external subset. Normalization is a process followed by the XML processor before the value can be validated. This process involves the following steps.
i) Perform end-of-the-line handling � which involves replacing all carriage return and immediately following line feed characters into a line feed character.
ii) Perform in-place replacement of all character entity references.
iii) Perform resolution of all entity references in the attribute value. The entity itself could be defined in the internal or external subset.
iv) Replace all whitespace, linefeed, carriage return and tab characters with a single whitespace character.
d) Elements with Element-only content
If the XML document contains element types with element-only content.i.e. the element contains child elements that are of element type, optionally separated by whitespace characters.
This has an interesting significance.
You can notice whitespace content between endTags and startTags of elements, in an XML document. The element could have been defined to be
<!ELEMENT order (PCDATA|shipTo|billTo|customer)*>
<!ELEMENT order (shipTo|billTo|customer)*>
In the former case, the whitespaces should not be ignored � they should be passed as is, to the processor. In the latter case, they should be ignored.
To indicate that the processor needs to consider the whitespaces, the document modeler could indicate that the document is not standalone, by adding the declaration on the XML declaration PI. Further, this is only necessary if the element declaration is in an external subset.
XML Information Set
An XML Information Set also termed as the Infoset is used to describe well-formed XML Documents. It does so by defining a logical data model for an XML document. The logical data model is achieved through an abstract data set containing information items that describe different portions of the XML document.
An infoset is usually obtained by parsing an XML Document according to the rules of the specification. On the flip side, XML can be considered as a serialized form of the logical structure defined by the Infoset. But an XML may not be the only possible serialized representation of the Infoset. In summary, an Infoset shields serialization syntax from the logical model. Foor instance, an Infoset could be serialized into a binary XML format that has a different syntax from a serialized XML 1.0 document.
Although building an Infoset requires the XML to be well formed, it may not necessarily be a valid XML document. For instance, the XML may possess broken external entity references, or possess undeclared elements or attributes. None of these prevent the XML document from possessing an Infoset.
For e.g., the following XML document has an Infoset, even though it is not a valid document.
<price>This is not a number�</price>
But the following is not a well-formed XML document, and hence cannot possess an infoset.
Similarly, an XML document that is not namespace-well-formed cannot possess an Infoset. The notion of namespace well formedness is described in detail within the W3C recommended specification - Namespaces in XML 1.0 [Second Edition]
<this is not a valid node name>foobar</this is not a valid node name>
is not a namespace well-formed document, since the node name does not conform to the NCName production.
XML Information Sets are typically used within other specifications that need to refer to information in an XML document. For this purpose, the Infoset provides a consistent set of definitions for different information that can appear in the document.
An information item is an abstract description of a specific item in an XML document.
It may contain one to eleven number of information items as explained below.
At the minimum, the Infoset contains the Document Information Item that contains exactly one Element Information item.
The 11 information items are listed below.
a) The Document Information Item
b) Element Information Items
c) Attribute Information Items
d) Processing Instruction Information Items
e) Unexpanded Entity Reference Information Items
f) Character Information Items
g) Comment Information Items
h) The Document Type Declaration Information Item
i) Unparsed Entity Information Item
j) Notation Information Items
k) Namespace Information Items
Each information item is attached with a set of properties. An illustration of the properties of the Document Information item and the Element information Item is given below. For a complete list of properties, refer to the XML Information Set specification.
Properties of the Document Information Item
1) Children �an ordered list of child information items. The list contains exactly one element information item termed as the "root". It could also contain exactly one Doctype declaration information item, and one more comment information items. The list of childrent does not include the PIs and comments within inline DTDs.
2) Document element � an element informaiton item, commonly referred to as the root.
4) Unparsed entities
5) Base URI
6) Character encoding scheme
9) All declarations processed
Properties of the Element Information Item
1) Namespace name
2) Local Name
6) Namespace attributes
7) In-scope Namespaces
8) Base URI
All of the properties should have been clear from the earlier sections, except for "Base URI". The latter is explained below.
The notion of base URIs originated from HTML, where designers use the HTML <base> tag to indicate the base URI to be used while resolving any other relative URIs used in the web page, rather than using the current document�s URI.
For instance, a web page at http://publicdocs.com/rammenon/index.html" may have the following content.
. . . .
The specification of the base URI indicates the web hyperlink "book-summary.html" must be resolved as http://www.ramsdocs.com/book-summary.html", rather than http://publicdocs.com/rammenon/book-summary.html".
In case of XML documents, the linking of the documents could be performed through linking languages like Xlink. [XML Linking language]. The semantics of defining Base URIs for portions of XML documents is explained in the XML Base Specification.
For example, consider the XML document po.xml at http://www.info.com/po.xml.
<purchaseOrder xmlns=www.allInfo.com xmlns:base="http://www.po.org/oracle" xmlns:xlink="http://www.w3.org/19999/xlink>
. . . .
In this case, info.xml is obtained and linked from http://www.po.org/oracle/info.xml" as opposed to http://www.info.com/info.xml"
Several Information Items in the infoset have a base URI or a declaration base URI property. All of these are processed according to the XML Base specification.
That explains the concepts for a base URI.
Coming back to the main point, the Infoset is used by other W3C Specifications to define their respective languages.
For instance, WSDL 2.0 language is based on the XML Infoset. This is how it uses the Infoset to describe a WSDL description.
The description element information item has the following Infoset properties:
a) A [local name] of description.
b) A [namespace name] of "http://www.w3.org/ns/wsdl".
c) One or more attribute information items amongst its [attributes] as follows:
i. A REQUIRED targetNamespace attribute information item as described below in 220.127.116.11 targetNamespace attribute information item.
ii. Zero or more namespace qualified attribute information items whose [namespace name] is NOT "http://www.w3.org/ns/wsdl".
d) Zero or more element information items amongst its [children], in order as follows:
i. Zero or more documentation element information items
ii. Zero or more element information items from among the following, in any order:
1. Zero or more include element information items
2. Zero or more import element information items
3. Zero or more namespace-qualified element information items whose [namespace name] is NOT "http://www.w3.org/ns/wsdl".
e) An OPTIONAL types element information item.
f) Zero or more element information items from among the following, in any order:
i. interface element information items
ii. binding element information items
iii. service element information items
iv. Zero or more namespace-qualified element information items whose [namespace name] is NOT "http://www.w3.org/ns/wsdl".
As you can see, it describes the logical structure of a WSDL 2.0 compliant document through a set of element and attribute information items, and their properties.
All the element and attribute informations referred in the definition of the description element information item are described in a similar fashion within the specification.
The Document conformance section of the Specification states as follows.
"An element information item (as defined in [XML Information Set]) whose namespace name is "http://www.w3.org/ns/wsdl" and whose local part is description conforms to this specification if it is valid according to the XML Schema for that element as defined by this specification (http://www.w3.org/2007/06/wsdl/wsdl20.xsd) and additionally adheres to all the constraints contained in this specification and conforms to the specifications of any extensions contained in it. Such a conformant element information item constitutes a WSDL 2.0 document."