Wednesday Aug 22, 2007

SPARQLing AltaVista: the meaning of forms

Did you know that AltaVista has a SPARQL endpoint? And that all of its results are served up that way? No? Well take the Red Pill and I will show you how deep the rabbit hole goes...

Take the query for the three words "matrix rabbit hole". Go to AltaVista and enter those words into the search box. Press "Find" and you will end up at the result page http://www.altavista.com/web/results?itag=ody&q=rabbit+hole+matrix&kgs=1&kls=0. This page lists the following information for each result:

  • The title of the page
  • The link to the page
  • An extract of the page containing the relevant words
  • A link to more results from that particular web site
In other words it is just the result of the following SPARQL query [1]:
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PRFIX eg: <http://altavista.eg/ont>
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
CONSTRUCT {
     ?page dc:title ?title;
           eg:summary ?summary;
           eg:moreResults ?more .
} WHERE {
     ?page dc:title ?title;
           eg:content ?content;
           eg:summary ?summary;
           eg:moreResults ?more .
     ?content pf:textMatch "+matrix +rabbit +hole" .
}
LIMIT 10
OFFSET 0

The AltaVista engineers - and I know them well having worked there for 5 years - of course understand User Interface issues very well, and so they don't return the default XML result format. They pass all their results first through a clever and optimised XSLT transform, that gives you the page that you now see in your browser.

In order to do this, the AltaVista engineers developed - and I can now speak openly about this - a clever mapping between html forms and SPARQL queries. Sadly it is such a long time ago that I worked there now, that my memory is a little dim on the exact manner in which they did this. So please forgive my mistakes. But I am sure we can work this out together.
Html Forms consist essentially of a number of key-value pairs where the end user is asked to provide the values, a form processing agent URL, and an action button if the user answers the question asked of him. Given that, the trick is just to create a simple SPARQL template language, so that one can relate a form processing agent to one or more SPARQL query templates. What does a SPARQL query template look like? Well it is really very similar to a SPARQL query, except that it has ??vars which need to be replaced by values from the form. So the SPARQL template associated with the front page form could be expressed like this:

@prefix fm: <http://altavisat.eg/ont#> .

<http://www.altavista.com/web/results>  a fm:FormHandler;
    fm:template """
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PRFIX eg: <http://altavista.eg/ont>
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
CONSTRUCT {
     ?page dc:title ?title;
           eg:summary ?summary;
           eg:moreResults ?more .
} WHERE {
     ?page dc:title ?title;
           eg:content ?content;
           eg:summary ?summary;
           eg:moreResults ?more .
     ?content pf:textMatch ??(\\" ??q \\")
}
LIMIT ???lmt
OFFSET ??( ??stq \* ???lmt )
""" .

So when AltaVista receives a form submission, the AV front end decode it and replaces all the ??key patterns above with the value of key as passed by the form. It then replaces all the ???keyparam patterns with default values set by the user and available from his session state. Finally the ??( ... ) operations are exectued. This can be in the form of a multiplication or a string concatenation as shown above, or other operations such as dealing with defaults. Having done that you end up with the previous SPARQL query, which is ready to be sent to the back end engines. The back engines have a more powerful language to play with, allowing AltaVista to propose new paying services to large customers such as Yahoo [2].

This has a few advantages. It reduces the size of the forms POSTED to AltaVista: a SPARQL query would take a lot more space, which would end up taking a fraction of a second off the processing time and so make it ever so much more painfully obvious that the redirect they are doing to av.rds.yahoo.com is destroying their performance. It would also give end users more freedom than is needed: It is a good policy decision to reduce the uses of a tool, when one aims to shrink one's market.

This mapping could be extreemly useful in a number of other ways.
For one it would help make it clear to machines what the meaning of a form is. Forms are questions asked to an agent. The meaning of the question is usually obvious to a human end user who speaks the language of the web page that is being shown. But for a machine to do the same, it helps to map the form to a semanticall defined query, which can be reasoned with. In this case the answers given by the human is used to construct a question that is sent to the server. In other cases the form is asking the user for his desired, and using this to construct an action. There is some interesting work in mapping different uses of forms to rdf still, but I think this does bring a key element into play.
Having a machine understandable version of a form means that a robot can start putting his rdf glasses and see things right. All that would be needed would be to link the form handler to an XSLT that could transform the resulting html to the SPARQL result format, and each of the thousands of existing web forms suddenly become transparent to the world of machine agents. [3]
It could also help reduce the work of defining new Protocols. The good part of OpenId Attribute Exchange for example is just a complex specification for a limited SPARQL template, if you put your rdf glasses on.[4]

With time you get to see the real structure of the world. As that happens the questions you start asking become a lot more interesting.

Notes

  1. The pf:textMatch relation is defined by the LARQ, the Jena Lucene-ARQ free text indexing for SPARQL extension, and it makes a lot of sense.
    The namesppace is sadly not dereferenceable. IE. Clicking on http://jena.hpl.hp.com/ARQ/property does not give you the definition of the relation. It would be nice if it did.
    Note also how in SPARQL you can have literals as subjects. This is explained in section 12.1.4 of the current SPARQL query language specification.
    Thanks to Andy Seaborne for the link.
  2. I do know of course that AltaVista is part of Yahoo! now. And by the way all of the above is meant to be taken with a pinch of salt red pills.
  3. This is called screen scraping, and is of course more work for the consumer. It is nicer when the information provider has a strong interest in providing a stable format.
  4. A large part of the spec is a duplication of the work that should be done by HTTP verbs such as GET, PUT, POST and DELETE. Using the Atom Protocol to publish a foaf file would deal with a large part of the spec in a few lines. Well that's what it looks like to me after studying the spec for a few hours only, and so I may have missed something.
About

bblfish

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today