SPARQLing AltaVista: the meaning of forms

Did you know that AltaVista has a SPARQL endpoint? And that all of its results are served up that way? No? Well take the Red Pill and I will show you how deep the rabbit hole goes...

Take the query for the three words "matrix rabbit hole". Go to AltaVista and enter those words into the search box. Press "Find" and you will end up at the result page http://www.altavista.com/web/results?itag=ody&q=rabbit+hole+matrix&kgs=1&kls=0. This page lists the following information for each result:

  • The title of the page
  • The link to the page
  • An extract of the page containing the relevant words
  • A link to more results from that particular web site
In other words it is just the result of the following SPARQL query [1]:
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PRFIX eg: <http://altavista.eg/ont>
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
CONSTRUCT {
     ?page dc:title ?title;
           eg:summary ?summary;
           eg:moreResults ?more .
} WHERE {
     ?page dc:title ?title;
           eg:content ?content;
           eg:summary ?summary;
           eg:moreResults ?more .
     ?content pf:textMatch "+matrix +rabbit +hole" .
}
LIMIT 10
OFFSET 0

The AltaVista engineers - and I know them well having worked there for 5 years - of course understand User Interface issues very well, and so they don't return the default XML result format. They pass all their results first through a clever and optimised XSLT transform, that gives you the page that you now see in your browser.

In order to do this, the AltaVista engineers developed - and I can now speak openly about this - a clever mapping between html forms and SPARQL queries. Sadly it is such a long time ago that I worked there now, that my memory is a little dim on the exact manner in which they did this. So please forgive my mistakes. But I am sure we can work this out together.
Html Forms consist essentially of a number of key-value pairs where the end user is asked to provide the values, a form processing agent URL, and an action button if the user answers the question asked of him. Given that, the trick is just to create a simple SPARQL template language, so that one can relate a form processing agent to one or more SPARQL query templates. What does a SPARQL query template look like? Well it is really very similar to a SPARQL query, except that it has ??vars which need to be replaced by values from the form. So the SPARQL template associated with the front page form could be expressed like this:

@prefix fm: <http://altavisat.eg/ont#> .

<http://www.altavista.com/web/results>  a fm:FormHandler;
    fm:template """
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PRFIX eg: <http://altavista.eg/ont>
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
CONSTRUCT {
     ?page dc:title ?title;
           eg:summary ?summary;
           eg:moreResults ?more .
} WHERE {
     ?page dc:title ?title;
           eg:content ?content;
           eg:summary ?summary;
           eg:moreResults ?more .
     ?content pf:textMatch ??(\\" ??q \\")
}
LIMIT ???lmt
OFFSET ??( ??stq \* ???lmt )
""" .

So when AltaVista receives a form submission, the AV front end decode it and replaces all the ??key patterns above with the value of key as passed by the form. It then replaces all the ???keyparam patterns with default values set by the user and available from his session state. Finally the ??( ... ) operations are exectued. This can be in the form of a multiplication or a string concatenation as shown above, or other operations such as dealing with defaults. Having done that you end up with the previous SPARQL query, which is ready to be sent to the back end engines. The back engines have a more powerful language to play with, allowing AltaVista to propose new paying services to large customers such as Yahoo [2].

This has a few advantages. It reduces the size of the forms POSTED to AltaVista: a SPARQL query would take a lot more space, which would end up taking a fraction of a second off the processing time and so make it ever so much more painfully obvious that the redirect they are doing to av.rds.yahoo.com is destroying their performance. It would also give end users more freedom than is needed: It is a good policy decision to reduce the uses of a tool, when one aims to shrink one's market.

This mapping could be extreemly useful in a number of other ways.
For one it would help make it clear to machines what the meaning of a form is. Forms are questions asked to an agent. The meaning of the question is usually obvious to a human end user who speaks the language of the web page that is being shown. But for a machine to do the same, it helps to map the form to a semanticall defined query, which can be reasoned with. In this case the answers given by the human is used to construct a question that is sent to the server. In other cases the form is asking the user for his desired, and using this to construct an action. There is some interesting work in mapping different uses of forms to rdf still, but I think this does bring a key element into play.
Having a machine understandable version of a form means that a robot can start putting his rdf glasses and see things right. All that would be needed would be to link the form handler to an XSLT that could transform the resulting html to the SPARQL result format, and each of the thousands of existing web forms suddenly become transparent to the world of machine agents. [3]
It could also help reduce the work of defining new Protocols. The good part of OpenId Attribute Exchange for example is just a complex specification for a limited SPARQL template, if you put your rdf glasses on.[4]

With time you get to see the real structure of the world. As that happens the questions you start asking become a lot more interesting.

Notes

  1. The pf:textMatch relation is defined by the LARQ, the Jena Lucene-ARQ free text indexing for SPARQL extension, and it makes a lot of sense.
    The namesppace is sadly not dereferenceable. IE. Clicking on http://jena.hpl.hp.com/ARQ/property does not give you the definition of the relation. It would be nice if it did.
    Note also how in SPARQL you can have literals as subjects. This is explained in section 12.1.4 of the current SPARQL query language specification.
    Thanks to Andy Seaborne for the link.
  2. I do know of course that AltaVista is part of Yahoo! now. And by the way all of the above is meant to be taken with a pinch of salt red pills.
  3. This is called screen scraping, and is of course more work for the consumer. It is nicer when the information provider has a strong interest in providing a stable format.
  4. A large part of the spec is a duplication of the work that should be done by HTTP verbs such as GET, PUT, POST and DELETE. Using the Atom Protocol to publish a foaf file would deal with a large part of the spec in a few lines. Well that's what it looks like to me after studying the spec for a few hours only, and so I may have missed something.
Comments:

Henry,

Where is the actual SPARQL Endpoint? The place where I can send SPARQL directly via the SPARQL Protocol etc..

Nice post as per usual :-)

Kingsley

Posted by Kingsley Idehen on August 22, 2007 at 10:16 AM CEST #

Kingsley wrote:
> Where is the actual SPARQL Endpoint?

They never made it publicly available ;-) But you can ask Dave Beckett (http://purl.org/net/dajobe/) who is working at Yahoo. I think he is working on making it public, though he is very secretive, so he might not admit it... :-)

Posted by Henry Story on August 22, 2007 at 10:51 AM CEST #

Hello Henry,

This all sounds like great stuff! And I echo Kingsley with asking for the SPARQL endpoint. If this is made public I can see Altavista gaining some nice momentum from the Linking Open Data people and the semweb people in general.

Cheers,

David

Posted by David Peterson on August 23, 2007 at 12:01 AM CEST #

I am not going to be very optimistic that AltaVista will be the first to make a SPARQL endpoint available. They have some major problems currently to deal with, such as forcing all results to go through a slow redirect. If they can't deal with that, then my guess is that they are stuck somewhere pretty bad.

More important is that:

- It probably would not be too expensive to turn search engines into SPARQL endpoints. More problematic for them is that it makes some very powerful queries available that they may not want to respond to. The forms interface reduces the number of queries that can be asked. They could of course just refuse to reply to queries that take more than a certain time to compute...
- You can make similar types of searches with Lucene and the Jena SPARQL plug in
- You could describe a lot of forms as SPARQL query templates, which means that you can build your programs from this perspective.
- Search engines are query end points, that put a lot of energy in bullet proofing their end points against attack, with algorithms to return queries as quickly as possible. It would be good to have SPARQL endpoints in Jena, Sesame, Mulgara... bullet proofed their endpoints in a similar way.
- This ties in with getting a better understanding of what forms are. I describe a way of using SPARQL queries as a form of RDF from here
http://blogs.sun.com/bblfish/entry/restful_semantic_web_services

I am more interested in tying these two worlds together - web 1.0 and web 3.0 - and provoking a shift in perspective.

Posted by Henry Story on August 23, 2007 at 04:34 AM CEST #

I'm surprised you didn't mention my position on SPARQL and its relationship to RDF Forms, which basically talks about this same "Facade" style, but from the POV of why the difference matters (a key advantage that you didn't mention above).

http://www.markbaker.ca/blog/2006/08/09/sparql-useful-but-not-a-game-changer/

Posted by Mark Baker on August 28, 2007 at 11:36 PM CEST #

a regex power showoff?

Posted by Pippo Baudo on October 03, 2007 at 05:59 AM CEST #

Post a Comment:
Comments are closed for this entry.
About

bblfish

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today