Wednesday Aug 09, 2006

All XML roads lead to RDF

Where I proove that thinking of the web as a database does not mean thinking of it in terms of an xml database. Rather we will have to think in terms of graphs.

XML applied to Documents

XML stands for eXtensible Markup Language. The most successful markup language is and remains html and its cleaned up xhtml successor. So let me start with this.

Xhtml is designed to markup strings such as the following "Still pond A frog jumps in Plop!" with presentation information. In a wysiwyg editor I would do this by highlighting parts of the text using my mouse, then setting a property for that part of the text by clicking some button such as bold available to me from some menu. XML allows me save this information in the document by giving me a way to create arbitrary parenthesizing properties. So for example using the xhtml markup language, the following ≺blockquote≻≺font face="papyrus,helvetica"≻Still pond≺br/≻A frog jumps in≺br/≻Plop!≺br/≻≺/font≻≺/blockquote≻ should display in a compliant browser as

Still pond
A frog jumps in

XML applied to Data

Take some information written out one way or another. This will usually be in some tabular format as in a receipt, a spread sheet or a database table. Take this receipt I have in front of me for example. The text on it is "1 Fresh Orange Juice 1.95 1 Café au Lait 1,20 1 Sparkling Water 1,40". This is not very easy to parse for a human, let a alone a machine. But extrapolating from the experience with html, I can use xml to mark the text up like this:

  ≺item≻≺description≻1 Fresh Orange Juice≺/description≻≺price currency="euro"≻1,95≺/price≻≺/item≻
  ≺item≻≺description≻1 Café au Lait≺/description≻≺price currency="euro"≻1,20≺/price≻≺/item≻
  ≺item≻≺description≻1 Sparking Water≺/description≻≺price currency="euro"≻1,40≺/price≻≺/item≻

This is now much more accessible to computer automation, as the machine can deduce from the tags what interpretation to give the enclosed string. And so was born the great enthusiasm for xml formats and web services.

World Wide Data

It is clear that by following the above procedure we can create machine readable documents for every type of data, stored in every kind of database available worldwide. We just need to mark it up. Wait. We need to do more. We need to agree on a vocabulary and a tree like way of displaying it, since xml forms a markup tree. Let us assume for this article that the naming problem has somehow been solved, and let's look more closely at the data format problem.

Say I want to describe a house, then I will want to have an xml format something like this


but if I want to describe a person I would of course describe them like this


In the first case the person object is part of the house document, whereas in the second case the house information is part of the person document. Both are equally valid ways of doing this. This is not an isolated case. It will happen whenever we wish to describe some object. No object has priority of any other. Here is another example. We may want to describe a book like this:

   ≺author≻Ken Wilber≺/author≻
   ≺title≻Sex, Ecology, Spirituality≺/title≻

but of course if I had a CV/resumé of Ken Wilber then my xml would be like this

  ≺name≻Ken Wilber≺/name≻

again there is no natural way putting things. In one case the ≺book≻ element is the root of the tree, in another it is an element of the tree. It follows that every type of object will require its own type of document to describe it. This is not a problem if the world were composed just of turtles. But it isn't: there are an infinite number of types of things in our very rich world. Furthermore what is of interest in each type of document depends completely on the context. In one type of document we may be more interested in the friends a person has, in another in his medical history, in yet another his academic achievements, etc... So there is not even one objective way to describe anything! If we were to create a tree structured document to describe every type of thing we are interested in, we would therefore also need to create an uncountable number of document formats for every different way we wanted to describe each class of objects.


This is summarized simply by saying "The World is a Graph". The world can just be described holistically as consisting of objects and relations between those objects. Take any object in the world, you will be able to reach any other object by following relations stemming from it. Make that type of object the root of your graph, and you have an xml format.
So the problem is not so much that it is not possible to describe each subgraph we find using XML. One can! The problem emerges rather when considering the tools required to query and understand these documents. It is clear from the arguments above that when thinking web wide, one has to give up the idea that information will reach one in a limited number of hierarchically structured ways. As a result tools such as XQuery, that are designed to query documents at the xml structure level are not adapted for querying information across documents, since the tree structure of the xml documents will gets in the way of the description of the graph that the world is and that documents are attempting to describe. XQuery people know this, which is why they don't like RDF. But it is not RDF that is the problem. It is reality that is the problem. And that is a lot more difficult to change.

To repeat, if RDF never had been invented, your database of documents would end up containing an infinitely large number of different types of xml documents to describing the infinite types of objects out there, each of course requiring its own specific interpretation (since XML does not come with a semantics). And so you may as well start off using RDF, since that is where you will end up anyway.

The world is an interconnected graph of things. RDF allows one to describe the world as a graph. SPARQL is the query language to query such a graph. Use the tools that fit the world!


  • This is not to say that rdf/xml is perfect. I myself believe it is a really good first attempt at trying to do something very ambitious. Sadly it was done a little too early. Something better will certainly come along. In the mean time it is good enough for nearly everything anyone may want to do with it when wishing to send data out on the web.
  • Having many XML documents is not a problem for the Semantic web since it is easy to convert each of the formats using GRDDL to rdf.



« July 2016