how binary relations beat tuples

Last week I was handed a puzzle by Francois Bry: "Why does RDF limit itself to binary relations? Why this deliberate lack of expressivity?".

Logical Equivalence Reply

My initial answer was that all tuples could be reduced to binary relations. So take a simple table like this:

User IDnameaddressbirthdaycoursehomepage
1234Henry Story21 rue Saint Honoré
Fontainebleau
France
29 Julyphilosophyhttp://bblfish.net/
1235Danny AyersLoc. Mozzanella, 7
Castiglione di Garfagnana
Lucca
Italy
14 Jansemwebhttp://dannyayers.com

The first row in the above column can be expressed as a set of binary relations as shown in this graph:

The same can clearly be done for the second row.

Since the two models express equivalent information I would opt aesthetically for the graph over the tuples, since it requires less primitives, which tends to make things simpler and clearer. Perhaps that can already be seen in the way the above table is screaming out for refactoring: a person may easily have more than one homepage. Adding a new homepage relation is easy, doing this in a table is a lot less so.

But this line of argument will not convince a battle worn database administrator. Both systems do the same thing. One is widely deployed, the other not. So that is the end of the conversation. Furthermore it seems clear that retrieving a row in a table is quick and easy. If you need chunks of information to be together that beats the join that seems to be required in the graph version above. Pragmatics beats aesthetics hands down it seems.

Global Distributed Open Data

The database engineer might have won the battle, but he will not win the war [1]. Wars are fought at a much higher level, on a global scale. The problem the Semantic Web is attacking is global data, not local data. On the Semantic Web, the web is the database and data is distributed and linked together. On the Semantic Web use case the data won't all be managed in one database by a few resource constrained superusers but distributed in different places and managed by the stake holder of that information. In our example we can imagine three stake holders of different pieces of information: Danny Ayers for his personal information, Me for mine, and the university for its course information. This information will then be available as resources on the web, returning different representations, which in one way or another may encode graphs such as the ones below. Note that duplication of information is a good thing in a distributed network.

By working with the most simple binary relations, it is easy to cut information up down to their most atomic unit, publish them anywhere on the web, distributing the responsibility to different owners. This atomic nature of relations also makes it easy to merge information again. Doing this with tuples would be unnecessarily complex. Binary relations are a consequence of taking the open world assumption seriously in a global space. By using Universal Resource Identifiers (URIs), it is possible for different documents to co-refer to the same entitities, and to link together entities in a global manner.

The Verbosity critique

Another line of attack similar to the first could be that rdf is just too verbose. Imagine the relation children which would relate a person to a list of their children. If one sticks just with binary relations this is going to be very awkward to write out. In a graph it would look like this.

image of a simple list as a graph

Which in Turtle would give something like this:

:Adam :children 
     [ a rdf:List;
       rdf:first :joe;
       rdf:rest [ a rdf:List;
            rdf:first :jane;
            rdf:rest rdf:nil ];
     ] .

which clearly is a bit unnecessarily verbose. But that is not really a problem. One can, and Turtle has, developed a notation for writing out lists. So that one can write much more simply:

:Adam :children ( :joe :jane ) .

This is clearly much easier to read and write than the previous way (not to speak about the equivalent in rdf/xml). RDF is a structure developed at the semantic level. Different notations can be developed to express the same content. The reason it works is because it uses URIs to name things.

Efficiency Considerations

So what about the implementation question: with tables oft accessed data is closely gathered together. This it seems to me is an implementation issue. One can easily imagine RDF databases that would optimize the layout in memory of their data at run time in a Just in Time manner, depending on the queries received. Just as the Java JIT mechanism ends up in a overwhelming number of cases to be faster than hand crafted C, because the JIT can take advantage of local factors such as the memory available on the machine, the type of cpu, and other issues, which a statically compiled C binary cannot do. So in the case of the list structure shown above there is no reason why the database could not just place the :joe and jane in an array of pointers.

In any case, if one wants distributed decentralised data, there is no other way to do it. Pragamatism does have the last word.

Notes

  1. Don't take the battle/war analogy too far please. Both DB technologies and Semantic Web ones can easily work together as demonstrated by tools such as D2RQ.
Comments:

Some jumbled related thoughts here, that I've been meaning to write down forever:

The open-world, 'stuff may sometimes be missing' nature of Web data may give us a bias towards binary relations as a fundamental unit of information.

In RDF, we always know that all relations have an arity of two. Even if the things being inter-related aren't necessarily known to us (ie. the bNode mechanism).

I can say livesWith(bnode1,bnode2)

If we allowed more than two, we'd need to decide whether each property had a fixed arity, or was flexible. Eg:

livesWith(bnode1,bnode2, foo,bar)

livesWith(bnode1,bnode2,foo,bar,fee)

but what if we didn't know the pice of information 'bar'?

Perhaps:

livesWith(bnode1,bnode2,foo, nil, fee)

What if more slots were allowed?

livesWith(bnode1,bnode2, 2002, foo, nil, fee) ...

...hard to keep track of which slot is which. Yet fixing 'livesWith' to always have a fixed arity (say of 7, or 8) seems unsemwebbishly rigid and out of keeping with the grassroots nature of deployments such as FOAF. In general, the RDF and FOAF design strips power from the hands of schema designers, and puts it into the hands of information publishers. What are the properties of a Person? Don't look to the FOAF schema for a final answer; look to the Web, and see what things people have claimed about things of that type.

Maybe we could have named slots, so we can keep track of what 'foo' is supposed to mean,

livedWith(bnode,bnode2,startDate=2008, occupancy=partTime,location=123,321, foo=bar, ...) ...

...but if we do that, we start reinventing RDF inside itself.

Complexity is a lump in the carpet.

RDF brutally atomises data into a series 3-part claims; yes the information could be more intuitively couched if it wasn't forced into binary form, ... but it would be less well packed for lossy, collaborative and loosly coupled Web sharing, merging and aggregation.

That said, I think the W3C RIF group are making an n-ary system with some RDF compatibility, so worth keeping an eye on that.

Posted by Dan Brickley on March 20, 2008 at 08:28 AM CET #

Henry,

It would be nice if you could tack a URI on to each matchstick-man so that HTTP facilitated remote referencing comes out stronger in your cool illustration. Of course you don't want the fully qualified URI, but a #me or #this will do :-)

Kingsley

Posted by Kingsley Idehen on March 20, 2008 at 10:10 AM CET #

Good point Kingsley. I have just updated the large figure in the "Global Open Distributed Data" slide to make it clearer how deep URIs and URLs in particular reach in the semantic web.

Posted by Henry Story on March 20, 2008 at 10:44 AM CET #

I had a related discussion many years back (I think RDF had just been made a Candidate Recommendation), not about tuples in general, but ntuples for particular values of n>3 (The distinction being that n could be a fixed number). For instance, one discussant argued that quadruples were particularly useful, since they would allow for named graphs of triples, or triples with provenance, or contexts, or something.

Another discussant (Eric Prud'hommeaux, in fact) pointed out that many so-called triple stores actually are quad stores, for exactly this reason. But he went on to point out that that implementation convenience didn't change the nature of the standard (nor the compliance of these tools to it); you can implement /more/ than the standard, if you want to provide extra capabilities; as long as you read and write the standard serializations, and (nowadays) respond correctly to standard query languages, etc..

Posted by Dean Allemang on March 20, 2008 at 04:36 PM CET #

Kind of a strawman database design you have there.

That table is obviously hideously denormalized / bad design. But it nicely highlights a major issue. The person to address relation changes with time. Binary relations force you into a pretty ugly hole here - you can time-slice yourself or you can create a bNode for each habitation process or do a number of other things, each with awful consequences.

Of course, this doesn't matter if the SW/GGG/LinkedDataSphere/whatever it is this week is only concerned with a goldfish view of what the current state of the world is. But if the SW is to move beyond gimmicky toy applications and make some headway against the "battle worn database administrator"s it will need to address this issue. Tossing out RDF and allowing n-ary predicates would be a good start.

...and turtle or no turtle, rdf lists and seqs are kind of horrible for modeling. For one thing your instances will be kind of opaque to OWL reasoning. In this case what's wrong with a simple has_child binary relation?

Posted by Chris Mungall on March 20, 2008 at 10:24 PM CET #

Hi Chris,

I know the initial table is not a very well designed data structure. Not much in my argument rests on that design I think. I just wanted to show how:

1. that table can easily be turned into a graph: so there is a simple equivalence between both. This would tend to be a point in favor of tabular databases, since they are well established.

2. That graphs are much more easy to cut up into arbitrary components, since the binary relational model is atomic. That is what the Global Distributed Data section is about. Each one of the 3 graphs shows the information in the original table cut up into different graphs distributed on the web.

Now it is true here that the above table would probably be cut up into 2 or more tables in a better designed database. Nevertheless those three tables would presumably have to be in the same database (cluster) managed by the same entity. The identifiers would be local to the database usually. Nothing stops DB architects using URIs as identifiers, but SQL the language and the backing DBs are not really designed with that in mind. SQL DBs mostly work on a closed world assumption, where global identifiers don't add much value.

In an open world it is much more difficult to work out what the arity of a tuple should be. How many properties should a person have? How big should the table be? Why make this decision ahead of time? If you are a DB architect, you make it by consulting your clients and use cases and then you refactor [1], if you can. In an open world you cannot and do not want to know all the use cases for the information you are generating. So we don't start by building UFOs [2], just by stating the information we know and we care about. By the way, this is how detectives work. They don't start with all the information in the puzzle. Slowly they gather information, incomplete information, and put the pieces together... By stating this information in a simple binary relational system we optimize our and other's ability to mesh information. That is why binary relations are optimal.

Concerning temporal reasoning I wrote a little post on this subject "Keeping track of contexts in Life and on the Web" [3], where I show a couple of ways to build a temporal database in RDF, and how to move from a view of things as they are now to how they were or will be. That discussion brings out a point mentioned by Dean Allemang on quads. To work on the Semantic web in an open world one needs quad stores such as Sesame 2.0 [4] to keep track of context. Now the interesting things about this is that it is a move that needs to be made for all information. It is not a problem we have on a case by case basis, such as when one asks what type of tuple may be needed to best represent a person. This is a logically dictated move, that was discovered and rediscovered in philosophy and logic [5] all through the 20th century. It is what we need to be able to make statements such as "Jane believes Tim kissed Emma", to distinguish what one person thinks from what another thinks. Luckily the simple triple structure of RDF makes it difficult for people to publish what they believe other people believe. Temporal reasoning, let alone counterfactual reasoning, or propositional attitude logic is not easy, be it in tuple databases or on the semantic web. Well in fact on the semantic web, the notion of graphs does give us the tools to do this cleanly. And query languages such as SPARQL do allow that capability to be put to very interesting uses [6].

So perhaps I would add that SQL is missing an essential element that SPARQL has: namely the notion of graphs of information, and being able to query over graphs. So tables in contemporary databases are both an extra unnecessary structure, they hide the deep need for the concept of graphs, making them even less useful for writing good meshup applications. [7]

Henry

[1] See Martin Fowler's book "Refactoring Databases"
[2] http://blogs.sun.com/bblfish/entry/ufo_s_seen_growing_on
[3] http://blogs.sun.com/bblfish/entry/it_s_all_about_context
[4] http://openrdf.org/
[5] See my explanation of possible worlds theory in "The Fifth Dimension"
http://blogs.sun.com/bblfish/entry/the_fifth_dimension
[6] http://blogs.sun.com/bblfish/entry/opening_sesame_with_networked_graphs
[7] http://blogs.sun.com/bblfish/entry/beatnik_change_your_mind

Posted by Henry Story on March 21, 2008 at 03:46 AM CET #

Chris Mungall wrote:
> ..and turtle or no turtle, rdf lists and seqs are kind of horrible for modeling. For
> one thing your instances will be kind of opaque to OWL reasoning. In this case
> what's wrong with a simple has_child binary relation?

You are completely right Chris. I would use the binary has_child relation, which is much better for merging data. The article I wrote above was quite long and I could not think of a good relation to a list at the time of writing. I was just trying to make a point as to the difference between syntax and semantics here.

Thinking about lists it is interesting to note that at that level RDF is based on an even more basic data structure than Lisp. Lists are themselves just composed of relations. So the Lisp enthusiasts, who love it for its simplicity, should see here how this takes those intuitions even further.

Posted by Henry Story on March 21, 2008 at 06:12 AM CET #

Hi Henry

Thanks for your response. Some comments

"SQL DBs mostly work on a closed world assumption, where global identifiers don't add much value"

SQL DBs do indeed have a CW semantics. I'm not sure it follows that global IDs don't add value. It doesn't hold for biological databases for example.

"In an open world you cannot and do not want to know all the use cases for the information you are generating"

I think you may be using closed-world and open-world in a different sense from the logical one. Open world doesn't mean you don't know the use cases. You can know the use case precisely and still require an OWA for reasoning. Conversely you can have very open ended use cases and still use the CWA.

I'll buy that there is some loose correlation more as a result of the different cultural underpinnings.

"In an open world it is much more difficult to work out what the arity of a tuple should be"

Again, this is a function of open-ended use cases, not CWA vs OWA. I would contend that it is not difficult to work out the arity of a relation in an open-ended use case situation, and this is purely a function of good modeling. Normalized design will lead to low arities (you have a good description of normalized design above). However, plenty of relations, such as for example "least common ancestor" are naturally arity>2. Of course you can use the n-ary relations pattern here, but I can't believe anyone would believe this is a truly satisfactory solution.

"Luckily the simple triple structure of RDF makes it difficult for people to publish what they believe other people believe"

Is this a typo? Seems like a massive flaw in RDF, rendering it useless for science[\*]. Your contexts example uses n3, in which it \*is\* simple to make statements about belief. You also claim quad stores such as Sesame are required. This means that you need some extension of RDF such as TriX to fully represent what is in a quadstore (such that it can be communicated in a single document). Seems like a recipe for mass confusion, not global interoperation.

[\*] (in fact it _is_ easy to make statements about statements in RDF, if we allow reification, but everyone is desperately trying to deprecate reification despite the fact that NGs do not meet really basic requirements such as the ability to make statements about statements in a single document)

Posted by guest on March 21, 2008 at 10:45 AM CET #

Thanks Chris(?) for rectifying some of my muddled thinking/explanation here.

1. Contexts
----------

RDF and N3 can deal with graphs and so non extensional contexts. It is just more difficult to do this in RDF/XML, and I was thinking this might have been a good thing to get people going. Writing reasoners that deal with simple factual statements is one level easier than dealing with the indirection introduced by belief contexts. We will be able to get a lot of very interesting applications going just with pure factual statements, such as

sun:BlackBox iso:weight "10000"\^\^iso:kg .

Just making statements on price, weight, creation, etc. etc. in an extensible machine readable manner brings us very far.

When one starts learning logic, one first starts with extensional logic where all sentences retain their truth value when co-referring terms are exchanged. So if "Superman flies" is true, then since "Superman = Clark Kent", it follows that "Clark Kent flies". Once the logic of these sentences is understood, one can then consider non extensional contexts such as "Laura Lane believes Superman flies". It does not follow here that "Laura Lane believes Clark Kent flies". To be able to work on those one needs context aware logics such as that introduced by Guha in the thesis I refer to in footnote [3] in my previous reply, and which SPARQL queries make possible.

2. Open World/Closed world
------------------------

Yes, arguing from the Open World to questions of arity was perhaps a bit muddled. You are right, the open world assumption plays a part perhaps more as it forces one into the state of mind of thinking of merging data right from the beginning.

3. Arity of Tuples
---------------

When I was speaking of the difficulty of working out the arity of tuples, I was thinking for example of what the relations on a Person should be. That cannot it seems to me, be known in advance. My guess is that databases are full of such objects.

Your example "least common ancestor" can be mathematically defined ( is it an analytic relation? ) and so it can be decided in advance what its arity should be. I suppose the well known addition operator is in that class of relations too. Perhaps in the case of + one could have a relation from a list to a sum.

( 5 8 13 ) plus 26 .

It would be interesting to have a good list of such examples of relations where arities can be known in advance and to see if there is a good way of modelling them in RDF. It would then be possible to asses what the pros and cons of this are...

Thanks for the feedback.

Posted by Henry Story on March 21, 2008 at 12:33 PM CET #

Mr. Story,

To address your main point directly, i never understood why a "central authority" could not be set up to sell "connections" that is a repository of 1-1 channels which "you can leave at any time."

hth,
colm.

Posted by Colm Kennedy on April 02, 2008 at 09:26 AM CEST #

Maybe could you think of the following experience.

Assume we would only have binary verbs in English (or French, or German for that matter). Instead of saying for example: "I like to exchange ideas with Henry Story", one would have to say: "I like to exchange http://example.com/#ideas"; and "http://example.com/#ideas with HenryStory". Instead of URI, one could of course any possible form of anaphora that man kind developped in the past.

As you put it, this would be aesthetical. I admit it.

The reason for not restricting Knowledge Representation to binary verbs/relations, however, is that evolution has resulted in a mankind with natural languages departing from this aesthetical simplicty.

Natural language, and how people think, is not aesthetically simple. It might be complicated, tricky, and in some cases, enlightnening because of this complexity. Higher mathematics expressed in the simple formalisms of first-order binary logic would not be understandable - like mathematical proofs delivered by old-age theorem provers are not.

Mankind needs higher-level was of expression. n-ary relationships with n > 2 belong to that. They are needed for an intuitive and user-friendly knowledge representation.

Do you see the point?

Posted by François Bry on April 04, 2008 at 01:43 AM CEST #

Post a Comment:
Comments are closed for this entry.
About

bblfish

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today