Sunday Nov 30, 2008

variation on @timoreilly: hyperdata is the new intel outside

Context: Tim O'Reilly said "Data is the new Intel Inside".

Recently in a post "Why I love Twitter":

What's different, of course, is that Twitter isn't just a protocol. It's also a database. And that's the old secret of Web 2.0, Data is the Intel Inside. That means that they can let go of controlling the interface. The more other people build on Twitter, the better their position becomes.

The meme was launched in the well known "What is Web 2.0" paper in the section entitled "Data is the next Intel Inside"

Applications are increasingly data-driven. Therefore: For competitive advantage, seek to own a unique, hard-to-recreate source of data.

Most of the data is outside your database. It can only be that way, the world is huge, and you are just one small link in the human chain. Linking that data is knowledge and value creation. Hyperdata is the foundation of Web 3.0.

Thursday Sep 04, 2008

Building Secure, Open and Distributed Social Network Applications

Current Social Networks don't allow you to have friends outside their network. When on Facebook, you can't point to your friend on LinkedIn. They are data silos. This audio enhanced slide show explains how a distributed decentralized social network is being built, how it works, and how to make is secure using the foaf+ssl protocol (a list of pointers on the esw wiki).

It is licenced under a CC Attribution ShareAlike Licence.
My voice is a bit odd on the first slide, but it gets better I think as I go along.

Building Secure Open & Distributed Social Networks( Viewing this slide show requires a flash plugin. Sorry I only remembered this limitation after having put it online. If you know of a good Java substitute let me know. The other solution would have been to use Slidy. PDF and Annotated Open Document Format versions of this presentation are available below. (why is this text visible in Firefox even when the plugin works?) )

This is the presentation I gave at JavaOne 2008 and at numerous other venues in the past four months.

The slidecast works a lot better as a presentation format, than my previous semantic web video RDF: Connecting Software and People which I published as a h.264 video over a couple of years ago, and which takes close to 64MB of disk space. The problem with that format is that it is not easy to skip through the slides to the ones that interest you, or to go back and listen to a passage carefully again. Or at least it feels very clunky. My mp3 sound file only takes 17MB of space in comparison, and the graphics are much better quality in this slide show.

It is hosted by the excellent slideshare service, which translated my OpenOffice odp document ( once they were cleaned up a little: I had to make sure it had no pointers to local files remaining accessible from the Edit>Links menu (which otherwise choked their service)). I used the Audacity sound editor to create the mp3 file which I then place on my bblfish.net server. Syncing the sound and the slides was then very easy using SlideShare's SlideCast application. I found that the quality of the slides was a lot better once I had created an account on their servers. The only thing missing would be a button in addition to the forward and backward button that would allow one to show the text of the audio, for people with hearing problems - something equivalent to the Notes view in Open Office.

You can download the OpenOffice Presentation which contains my notes for each slide and the PDF created from it too. These are all published under a Creative Commons Attribution, Share Alike license. If you would like some of the base material for the slides, please contact me. If you would like to present them in my absence feel free to.

Thursday Mar 20, 2008

how binary relations beat tuples

Last week I was handed a puzzle by Francois Bry: "Why does RDF limit itself to binary relations? Why this deliberate lack of expressivity?".

Logical Equivalence Reply

My initial answer was that all tuples could be reduced to binary relations. So take a simple table like this:

User IDnameaddressbirthdaycoursehomepage
1234Henry Story21 rue Saint Honoré
Fontainebleau
France
29 Julyphilosophyhttp://bblfish.net/
1235Danny AyersLoc. Mozzanella, 7
Castiglione di Garfagnana
Lucca
Italy
14 Jansemwebhttp://dannyayers.com

The first row in the above column can be expressed as a set of binary relations as shown in this graph:

The same can clearly be done for the second row.

Since the two models express equivalent information I would opt aesthetically for the graph over the tuples, since it requires less primitives, which tends to make things simpler and clearer. Perhaps that can already be seen in the way the above table is screaming out for refactoring: a person may easily have more than one homepage. Adding a new homepage relation is easy, doing this in a table is a lot less so.

But this line of argument will not convince a battle worn database administrator. Both systems do the same thing. One is widely deployed, the other not. So that is the end of the conversation. Furthermore it seems clear that retrieving a row in a table is quick and easy. If you need chunks of information to be together that beats the join that seems to be required in the graph version above. Pragmatics beats aesthetics hands down it seems.

Global Distributed Open Data

The database engineer might have won the battle, but he will not win the war [1]. Wars are fought at a much higher level, on a global scale. The problem the Semantic Web is attacking is global data, not local data. On the Semantic Web, the web is the database and data is distributed and linked together. On the Semantic Web use case the data won't all be managed in one database by a few resource constrained superusers but distributed in different places and managed by the stake holder of that information. In our example we can imagine three stake holders of different pieces of information: Danny Ayers for his personal information, Me for mine, and the university for its course information. This information will then be available as resources on the web, returning different representations, which in one way or another may encode graphs such as the ones below. Note that duplication of information is a good thing in a distributed network.

By working with the most simple binary relations, it is easy to cut information up down to their most atomic unit, publish them anywhere on the web, distributing the responsibility to different owners. This atomic nature of relations also makes it easy to merge information again. Doing this with tuples would be unnecessarily complex. Binary relations are a consequence of taking the open world assumption seriously in a global space. By using Universal Resource Identifiers (URIs), it is possible for different documents to co-refer to the same entitities, and to link together entities in a global manner.

The Verbosity critique

Another line of attack similar to the first could be that rdf is just too verbose. Imagine the relation children which would relate a person to a list of their children. If one sticks just with binary relations this is going to be very awkward to write out. In a graph it would look like this.

image of a simple list as a graph

Which in Turtle would give something like this:

:Adam :children 
     [ a rdf:List;
       rdf:first :joe;
       rdf:rest [ a rdf:List;
            rdf:first :jane;
            rdf:rest rdf:nil ];
     ] .

which clearly is a bit unnecessarily verbose. But that is not really a problem. One can, and Turtle has, developed a notation for writing out lists. So that one can write much more simply:

:Adam :children ( :joe :jane ) .

This is clearly much easier to read and write than the previous way (not to speak about the equivalent in rdf/xml). RDF is a structure developed at the semantic level. Different notations can be developed to express the same content. The reason it works is because it uses URIs to name things.

Efficiency Considerations

So what about the implementation question: with tables oft accessed data is closely gathered together. This it seems to me is an implementation issue. One can easily imagine RDF databases that would optimize the layout in memory of their data at run time in a Just in Time manner, depending on the queries received. Just as the Java JIT mechanism ends up in a overwhelming number of cases to be faster than hand crafted C, because the JIT can take advantage of local factors such as the memory available on the machine, the type of cpu, and other issues, which a statically compiled C binary cannot do. So in the case of the list structure shown above there is no reason why the database could not just place the :joe and jane in an array of pointers.

In any case, if one wants distributed decentralised data, there is no other way to do it. Pragamatism does have the last word.

Notes

  1. Don't take the battle/war analogy too far please. Both DB technologies and Semantic Web ones can easily work together as demonstrated by tools such as D2RQ.

Wednesday Mar 19, 2008

Semantic Web for the Working Ontologist

I am really excited to see that Dean Allemang and Jim Hendler's book "Semantic Web for the Working Ontologist" is now available for pre-order on Amazon's web site. When I met Dean at Jazoon 2007 he let me have a peek at an early copy of this book[1]: it was exactly what I had been waiting a long time for. A very easy introduction to the Semantic Web and reasoning that does not start with the unnecessarily complex RDF/XML [2] but with the one-cannot-be-simpler triple structure of RDF, and through a series of practical examples brings the reader step by step to a full view of all of the tools in the Semantic Web stack, without a hitch, without a problem, fluidly. I was really impressed. Getting going in the Semantic Web is going to be a lot easier when this book is out. It should remove the serious problem current students are facing of having to find a way through a huge number of excellent but detailed specs, some of which are no longer relevant. One does not learn Java by reading the Java Virtual Machine specification or even the Java Language Specification. Those are excellent tools to use once one has read many of the excellent introductory books such as the unavoidable Java Tutorial or Bruce Eckel's Thinking in Java. Dean Allemang and Jim Hendler's books are going to play the same role for the Semantic Web. Help get millions of people introduced to what has to be the most revolutionary development in computer science since the development of the web itself. Go and pre-order it. I am going to do this right now.

Notes

  1. the draft I looked at 9 months ago had introductions to ntriples, turtle, OWL explained via rules, SPARQL, some simple well known ontologies such as skos and foaf, and a lot more.
  2. The W3C has recently published a new RDF Primer in Turtle in recognition of the difficulty of getting going when the first step requires understanding RDF/XML.
  3. Dean gave a talk at JavaOne that is now available online, which goes over the first chapters of the book. While you are waiting for the book, you can learn a lot by following this.

Friday Aug 17, 2007

Open Data: Information wants to be linked

With over 2 billion relations from the great web community data projects such as Wikipedia, Project Gutenberg, Music Brainz, and many more... the Linking Open Data initiative is tying together a vast pool of quality machine readable information on which one can run any of the over 500 Semantic Web tools. As the value of linked information increases much faster than the networks described by Metcalf's Law, the value of this must be tremendous.

By creating data browsing interfaces such as Tabulator, one has a very simple RESTful, Resource Oriented Architecture API to work with. With various SPARQL endpoints available or to be built, one can treat that information like a hugely powerful database.

Forget Web APIs: long live linked data!

Some of the projects listed are:

About

bblfish

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today