Friday Mar 20, 2009

Updates to Minion's query API

I've actually started to convert our AURA project code to use the new Minion query API, and as a result I decided that having a single Relation class for all of the relational operators really makes code that uses the query API look ugly.

I've just checked in a change to the query API that provides individual classes for the relational operators. These are all simple specializations of the Relation class, but it makes the code using the API a lot prettier. For example, this:

        Element query = 
	   new And(new Relation("aura-type",
                                Relation.Operator.EQUALS,
                                "artist"),
                   new Relation("aura-name", 
		                Relation.Operator.SUBSTRING,
        			artistName));

becomes this:

        Element query = 
	   new And(new Equals("aura-type", "artist"),
                   new Substring("aura-name", artistName));

Which is a lot neater, in my opinion. Don't ever forget that your API is the user interface for your library!

While debugging some problems yesterday I discovered the need to print out the queries, so while I was in there, I added toString methods that produce a nice string version of a query expressed using the API. The above query will be converted to the string:

(And (= aura-type artist) ( aura-name coldplay))

(assuming that artistName is "coldplay", obviously!) And yes, that is an s-expression. Lisp programmers represent! (OK, OK, it's not really an s-expression (spaces in the field values will screw things up), but it's close, and the parentheses delineate the expressions nicely.)

The changes to the code have been committed and we should have new javadoc up later on today.

Thursday Mar 12, 2009

A query API for Minion

Minion actually offers three different query languages (and is designed so that others can be added pretty easily), but it can be difficult to use the query languages when a program (as opposed to a person) is writing the queries.

For example, lets say that you're providing a search engine for musical artists, and you want your users to be able to search by artist name. The first thing that you'll end up doing is quoting the value provided by the user by doing something like:

   String query = String.format("artist-name = \\"%s\\"", name);
   engine.search(query);
because you know that artist names can have spaces in them.

This will work for a while, but you will soon find that there's an amazing amount of stuff that occurs in the names of musical artists. We have a crawl of information about 50,000 musical artists (more on that in the days to come, I hope.) If we sort the list of artists by name, here's the first 15:

!!!
!Action Pact!
!DISTAIN
!DelaDap
"Brother" Jack McDuff
"Little" Louie Vega
"Weird Al" Yankovic
#9 Dream
#Poundsign#
$wingin' Utter$
'Til Tuesday
't Hof van Commerce
(Love) Tattoo
(The Sounds Of) Kaleidoscope
(Young) Pioneers

Let's say that our user has typed in "Weird Al" Yankovic. If we're not careful about escaping the quotes, we're either going to have a parse exception or get weird results because the query parser will interpret the query differently than the user intended. At this point you begin to have a dawning sense of horror that you're going to spend the rest of your time figuring out what the set of weird characters is (usually when something breaks) and then dealing with escaping them on a case-by-case basis.

To completely sidestep this problem, we've added a programmatic query API to Minion. The idea is pretty simple: you build a Java structure that describes the query and then give it to the search engine to evaluate. Our search for an artist name above becomes:

        Element e = new StringRelation("aura-name", Relation.Operator.Equals, name);
        engine.search(e);
No need to worry about escaping things, because we're just passing around Java strings.

You can start out with creating a query element for a term and searching for just that term:

        Term j = new Term("java");
        ResultSet rs = engine.search(j);

As with the query languages, the term will search for morphological variations by default. We can specify how we want a term to be treated during the query by specifying modifiers:

        Term j2 = new Term("Java", EnumSet.of(Term.Modifier.CASE, Term.Modifier.MORPH));
        rs = engine.search(j2);
which will search for the term Java (and morphological variations) only in that exact case. You can combine terms using boolean operators:
        Term l = new Term("language");
        Term s = new Term("specification");
        And a = new And(j, l, s);
        rs = engine.search(a);
        rs = engine.search(new Or(j, l, s));

Note that there are varargs constructors for And and Or so you don't need to clutter up your code with lots of anonymous arrays.

You can combine boolean operators and terms in the natural way:

        And ao = new And(j, new Or(l, s));

Using the query API also means that you don't have to worry about what the precedence of the operators is in the query language (e.g., "What does java <and> language <or> specification do?").

As indicated above, you can use the query API for relational querying too:

        Relation sr = new Relation("subject", Relation.Operator.SUBSTRING,
                "java");
        Relation dr = new Relation("date", Relation.Operator.LESS_THAN,
                "03/12/2009");
and, of course, you can combine relational and boolean queries:
        And subj = new And(j, l, s, sr);

You can restrict elements to be searched for in a particular field or fields in the following way:

        Term j = new Term("java");
        j.addField("subject");
you can also use a set of fields, if you prefer. When more than one field is specified, a hit in any of the fields means that the document will be in the result set.

Finally, you can indicate that the results generated by a particular element of the query should be treated in a strict boolean fashion. This means that a document is either in the set or it isn't; there are no scores associated with the documents. When combining a set of documents that were treated in a strict boolean fashion with a set where the documents are scored, the strict boolean result set will not affect the scores in the combined set (although they may affect the presence or absence of certain documents!)

This is useful to do when you want to make sure that a result set contains documents that have a particular property, but you don't want that property to mess up the scores. For example:

        Term j = new Term("java");
        Term l = new Term("language");
        l.setStrict(true);
        And a = new And(j, l);
will return the documents that have java and language in them, but the scores for those documents won't be influenced by the score of language.

At the moment, the API only allows for boolean, relational, and range queries, but in the near future we'll be adding in proximity operators as well as document similarity operators.

The only thing that I don't really like about this new API is that it doesn't do type-checking as you build the query. For example, if you're building a relation like:

       Relation dr = new Relation("date", Relation.Operator.LESS_THAN,
                "03/12/2009");
we can't ensure that date is a saved field, that it's a date saved field, and that that provided value is a date. This is because the queries are built without considering any particular engine, so we can't tell what fields are defined or what their types are. Currently this is checked before the query is evaluated in the search method, and a QueryException is raised if there is a problem.

On a stylistic note, we decided to have a single Relation class rather than breaking out operators for each relation type. I still can't decide whether that was a good idea or not. Helpful suggestions welcome in the comments!

About

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into recommendation and other technologies that can use a good text search engine. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today