Minion and Lucene: Finding Variants
By searchguy on May 03, 2008
The idea was that removing the suffixes on a word would save space (since you need to store fewer terms in the dictionary and store fewer postings), and it would allow the users to type in any variant of a particular term. The engine would stem the query terms before looking them up in the dictionary, resulting in the engine returning the documents for all variants of the term.
The problem with this approach is that it makes it impossible to search for a variant in exactly the way specified by the user. So, for example, you couldn't search for the surname woods without also getting hits for the singular wood.
By default, both Minion and Lucene store the word forms encountered in the documents in the index, rather than storing (for example) stemmed forms. The difference between the engines is that Minion provides for searching across term variants at query time. By default, Minion
searches for all known morphological variations of the query terms. We generate the variations using a lightweight morphological framework that uses a set of rules similar to the set used by stemmers. The interesting thing about this is that the lightweight morphology is generative, so that given a term we can produce a set of terms that we should try to lookup in the dictionary.
The lightweight morphology tends to overgenerate, but it overgenerates in a lot of the same ways that people tend to. The good thing is that if it generates something that's not really an English word (e.g., I've seen it generate happiless from happiness) then the dictionary lookup will fail and it won't impact the query results.
We currently have lightweight morphological analyzers for English, Spanish, and German (and one for French that we haven't integrated yet!)
This behavior can be modified with the use of the
EXACT query operator. Additionally, Minion provides a language-independent stemmer that can be selected at query time using the
By default, Lucene only searches for the form provided in the query, so, for example, a query for dog will exclude documents that only include the plural form dogs. Lucene can be configured to stem the terms as they are added to the index and then stem the query terms, but this leads to the problem described above.
A solution to this is to use the lightweight morphological analyzer to generate the variants and then modify a query to look for any of the variants. In fact, we did this in some of our evaluations of Lucene.