Saturday May 26, 2007

Answers to "Duck Typing Done Right"

I woke up this morning with a large number of comments to my previous post "Duck Typing Done right" . It would be confusing to answer them all together in the comments section there, so I aggregated my responses here.

I realize the material covered here is very new to many people. Luckily it is very easy to understand. For a quick introduction see my short Video introduction to the Semantic Web.

Also I should mention that the RDF is a declarative framework. So its relationship to method Duck Typing is not a direct one. But nevertheless there is a lot to learn by understanding the simplicity of the RDF framework.

On the reference of Strings

Kevin asks why the URI "http://a.com/Duck" is less ambigous than a string "Duck". In one respect Kevin is completely correct. In RDF they are both equally precise. But what they refer to is quite different from what one expects. The string "Duck" refers to the string "Duck". A URI on the other hand refers to the resource identified by it; URIs stand for Universal Resource Identifiers after all. The URI "http://a.com/Duck", as defined above, refers to the set of Ducks. How do you know? Well you should be able to GET <http://a.com/Duck> and receive a human or machine representation for it, selectable via content negotiation. This won't work in the simple examples I gave in my previous post, as they were just quick examples I hacked together by way of illustration. But try GETing <http://xmlns.com/foaf/0.1/knows> for a real working example. See my longer post on this issue GET my meaning?

Think about the web. Everyday you type in URLs into a web browser and you get the page you want. When you type "http://google.com/" you don't sometimes get <http://altavista.com>. The web works as well as it does, because URLs identify things uniquely. Everyone can mint their own if they own some section of the namespace, and PUT the meaning for that resource at that resource's location.

On Ambiguity and Vagueness

Phil Daws is correct to point out that URIs don't remove all fuzziness or vagueness. We can have fuzzy or vague concepts, and that is a good thing. foaf:knows (<http://xmlns.com/foaf/0.1/knows> ) whilst unambigous is quite a fuzzily defined relation. If you click on it's URL this is what you will get:

We take a broad view of 'knows', but do require some form of reciprocated interaction (ie. stalkers need not apply). Since social attitudes and conventions on this topic vary greatly between communities, counties and cultures, it is not appropriate for FOAF to be overly-specific here.

If someone foaf:knows a person, it would be usual for the relation to be reciprocated. However this doesn't mean that there is any obligation for either party to publish FOAF describing this relationship. A foaf:knows relationship does not imply friendship, endorsement, or that a face-to-face meeting has taken place: phone, fax, email, and smoke signals are all perfectly acceptable ways of communicating with people you know.

You probably know hundreds of people, yet might only list a few in your public FOAF file. That's OK. Or you might list them all. It is perfectly fine to have a FOAF file and not list anyone else in it at all. This illustrates the Semantic Web principle of partial description: RDF documents rarely describe the entire picture. There is always more to be said, more information living elsewhere in the Web (or in our heads...).

Since foaf:knows is vague by design, it may be suprising that it has uses. Typically these involve combining other RDF properties. For example, an application might look at properties of each foaf:weblog that was foaf:made by someone you "foaf:knows". Or check the newsfeed of the online photo archive for each of these people, to show you recent photos taken by people you know.

For more information on this see my post "Fuzzy thinking in Berkeley"

On UFOs

Paddy worries that this requires a Universal Class Hierarchy. No worries there. The Semantic Web is designed to work in a distributed way. People can grow their vocabularies, just like we all have grown the web by each publishing our own files on it. The Semantic Web is about linked data. The semantic web does not require UFOs (Unified Foundational Ontologies) to get going, and it may never need them at all, though I suspect that having one could be very helpful. See my longer post UFO's seen growing on the Web.

Relations are first class objects

Paddy and Jon Olson were mislead by my uses of classes to think that RDF ties relations/properties to classes. They don't. Relations in RDF are first class citizens, as you may see in the Dublin Core metadata initiative, which defines a set of very simple and very general relations to describe resources on the web, such as dc:author, dc:created etc... I think we need a :sparql relation that would relate anything to an authoritative SPARQL endpoint, for example. There clearly is no need to constrain the domain of such a relation in any way.

Scalability and efficiency

Jon Olson agrees with me that duck typing is good enough for some very large and good software projects. One of my favorite semantic web tools for example is cwm, which is written in python. When I say Duck Typing does not scale as implemented in those languages, I mean really big scale, like you know, the web. URIs is what has allowed the web to scale to the whole planet, and what will allow it to scale into structured data way beyond what we may even be comfortable imagining right now. This is not over engineered at all as Eric Biesterfeld fears. In fact it works because it gets the key elements right. And they are very simple as I demonstrated in my recent JavaOne BOF. The key concepts are:
  • URIs refer to resources,
  • resources return representations,
  • to describe something on the web one needs to
    • refer to the thing one wishes to describe, and that requires a URI,
    • second specify the property relation one wishes to attribute to it (and that also requires a URI)
    • and finally specify the value of that property.
That's it.

Semantics

An anonymous writer mentions the "ugliness" of the syntax. This is not a problem. The semantic web is about semantic (see the illustration on this post) It defines the relationship of a string to what it names. It does not require a specific syntax. If you don't like the xml/rdf syntax, which most people think is overly complicated, then use the N3 syntax, or come up with something better.

On Other Languages

As mentioned above there need not be one syntax for RDF. Of course it helps in communication if we agree on something, and currently, for better of for worse that is rdf/xml.

But that does not mean that other procedural languages cannot play well with it. They can since the syntax is not what is important, but the semantics, and those are very well defined.

There are a number of very useful bindings in pretty much every language. From Franz lisp to the redland library for c, python, perl, ruby, to Prolog bindings, and many Java bindings such as Sesame and Jena. Way too much to list here. For a very comprehensive overview see Mike Bergman's full survey of Semantic tools.

Note

I have received a huge amount of hits from reddit. Way over 500. If it is still on the top page when you read this, take the time to vote for it :-)

Friday May 25, 2007

Duck Typing done right

Dynamic Languages such as Python, Ruby and Groovy, make a big deal of their flexibility. You can add new methods to classes, extend them, etc... at run time, and do all kinds of funky stuff. You can even treat an object as of a certain type by looking at it's methods. This is called Duck Typing: "If it quacks like a duck and swims like a Duck then it's a duck", goes the well known saying. The main criticism of Duck Typing has been that what is gained in flexibility is lost in precision: it may be good for small projects, but it does not scale. I want to show here both that the criticism is correct, and how to overcome it.

Let us look at Duck Typing a little more closely. If something is a bird that quacks like a duck and swims like a duck, then why not indeed treat it like a duck? Well one reason that occurs immediately, is that in nature there are always weird exceptions. It may be difficult to see the survival advantage of looking like a duck, as opposed to say looking like a lion, but one should never be surprised at the surprising nature of nature.
Anyway, that's not the type of problem people working with duck typing ever have. How come? Well it's simple: they usually limit the interactions of their objects to a certain context, where the objects being dealt with are such that if any one of them quacks like a duck, then it is a duck. And so here we in essence have the reason for the criticism: In order for duck typing to work, one has to limit the context, one has to limit the objects manipulated by the program, in such a way that the duck typing falls out right. Enlarge the context, and at some point you will find objects that don't fit the presuppositions of your code. So: for simple semantic reasons, those programs won't scale. The more the code is mixed and meshed with other code, the more likely it is that an exception will turn up. The context in which the duck typing works is a hidden assumption, usually held in the head of the small group of developers working on the code.

A slightly different way of coming to the same conclusion, is to realize that these programming languages don't really do an analysis of the sound of quacking ducks. Nor do they look at objects and try to classify the way these are swimming. What they do is look at the name of the methods attached on an object, and then do a simple string comparison. If an object has the swim method, they will assume that swim stands for the same type of thing that ducks do. Now of course it is well established that natural language is ambiguous and hence very context dependent. The methods names gain their meaning from their association to english words, which are ambiguous. There may for example be a method named swim, where those letters stand for the acronym "See What I Mean". That method may return a link to some page on the web that describes the subject of the method in more detail, and have no relation to water activities. Calling that method in expectation of a sound will lead to some unexpected results
But once more, this is not a problem duck typing programs usually have. Programmers developing in those languages will be careful to limit the execution of the program to only deal with objects where swim stand for the things ducks do. But it does not take much for that presupposition to fail. Extend the context somewhat by loading some foreign code, and at some point these presuppositions will break down and nasty difficult to locate bugs will surface. Once again, the criticism of duck typing not being scalable is perfectly valid.

So what is the solution? Well it requires one very simple step: one has to use identifiers that are context free. If you can use identifiers for swimming that are universal, then they will alway mean the same thing, and so the problem of ambiguity will never surface. Universal identifiers? Oh yes, we have those: they are called URIs.
Here is an example. Let us

  • name the class of ducks
    <http://a.com/Duck> a owl:Class;
             rdfs:subClassOf <http://a.com/Bird>;
             rdfs:comment "The class of ducks, those living things that waddle around in ponds" .
    
  • name the relation <http://a.com/swimming> which relates a thing to the time it is swimming
     <http://a.com/swimming> a owl:DatatypeProperty;
                             rdfs:domain <http://a.com/Animal> ;
                             rdfs:range xsd:dateTime .
     
  • name the relation <http://a.com/quacking> which relates a thing to the time it is quacking (like a duck)
     <http://a.com/quacking> a owl:DatatypeProperty;
                             rdfs:domain <http://a.com/Duck> ;
                             rdfs:range xsd:dateTime .
    
  • state that an duck is an animal
     <http://a.com/Duck> rdfs:subClassOf <http://a.com/Animal> .
    
Now if you ever see the relation
:d1  <http://a.com/quacking> "2007-05-25T16:43:02"\^\^xsd:dateTime .

then you know that :d1 is a duck ( or that the relation is false, but that is another matter ), and this will be true whatever the context you find the relation in. You know this because the url http://a.com/quacking always refers to the same relation, and that relation was defined as linking ducks to times.
Furthermore notice how you may conclude many more things from the above statement. Perhaps you have an ontology of animals written in OWL, that states that Ducks are one of those animals that always has two parents. Given that, you would be able to conclude that :d1 has two parents, even if you don't know which they are. Animals are physical beings, you may discover by clicking on the http://a.com/Animal URL, and in particular one of those physical things that always has a location. It would therefore be quite correct to query for the location of :d1...
You can get to know a lot of things with just one simple statement. In fact with the semantic web, what that single statement tells you gets richer and richer the more you know. The wider the context of your knowledge the more you know when someone tells you something, since you can use inferencing to deduce all the things you have not been told. The more things you know, the easier it is to make inferences (see Metcalf's law).

In conclusion, duck typing is done right on the semantic web. You don't have to know everything about something to work with what you have, and the more you know the more you can do with the information given to you. You can have duck typing and scale.

About

bblfish

Search

Archives
« July 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
   
       
Today