Minion and Lucene: Finding Variants

Back in the good old days, most search engines stemmed the terms being indexed.

The idea was that removing the suffixes on a word would save space (since you need to store fewer terms in the dictionary and store fewer postings), and it would allow the users to type in any variant of a particular term. The engine would stem the query terms before looking them up in the dictionary, resulting in the engine returning the documents for all variants of the term.

The problem with this approach is that it makes it impossible to search for a variant in exactly the way specified by the user. So, for example, you couldn't search for the surname woods without also getting hits for the singular wood.

By default, both Minion and Lucene store the word forms encountered in the documents in the index, rather than storing (for example) stemmed forms. The difference between the engines is that Minion provides for searching across term variants at query time. By default, Minion
searches for all known morphological variations of the query terms. We generate the variations using a lightweight morphological framework that uses a set of rules similar to the set used by stemmers. The interesting thing about this is that the lightweight morphology is generative, so that given a term we can produce a set of terms that we should try to lookup in the dictionary.

The lightweight morphology tends to overgenerate, but it overgenerates in a lot of the same ways that people tend to. The good thing is that if it generates something that's not really an English word (e.g., I've seen it generate happiless from happiness) then the dictionary lookup will fail and it won't impact the query results.

We currently have lightweight morphological analyzers for English, Spanish, and German (and one for French that we haven't integrated yet!)

This behavior can be modified with the use of the EXACT query operator. Additionally, Minion provides a language-independent stemmer that can be selected at query time using the STEM operator.

By default, Lucene only searches for the form provided in the query, so, for example, a query for dog will exclude documents that only include the plural form dogs. Lucene can be configured to stem the terms as they are added to the index and then stem the query terms, but this leads to the problem described above.

A solution to this is to use the lightweight morphological analyzer to generate the variants and then modify a query to look for any of the variants. In fact, we did this in some of our evaluations of Lucene.


Please stop describing behavior in Lucene as "default" unless you do it in the context of referring to a specific class (by name) in which there is behavior that can be modified through a constructor option or "setter" method call.

"Lucene" has no default when it comes to stemming (or case sensitivity for that matter). Lucene as a Java library comes with a variety of tools for achieving a variety of goals -- you misrepresent it by describing it as having any particular "default" behavior when it comes to how text is indexed. For better or for worse, Lucene requires application developers to experts of their domain and decide how their text should be dealt with.

Posted by guest on May 04, 2008 at 07:30 PM EDT #

There is another solution to this problem. Index and search for both the morphological root and the original term and let the tf-idf put the exact matches on top.

Posted by Rob Young on May 06, 2008 at 03:15 AM EDT #

Indexing both forms is definitely one way to do it. It's kind of the way that we handle the case variations. Having said that, for variations that are fairly common (e.g. plurals) you might want to combine the term stats so that tfidf is more representative.

Posted by Stephen Green on May 06, 2008 at 04:15 AM EDT #

Post a Comment:
Comments are closed for this entry.

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.


« June 2016