By searchguy on Mar 03, 2009
Matt Chaput commented the other day about an aspect of the dictionaries, and in passing mentioned that his Python search engine (Whoosh! Good name, dude!) uses the lightweight morphology from Minion (ported to Python, obviously.)
Otis followed up asking where he could get more information, and I thought I would post an entry rather than burying it in the comments.
The lightweight morphology component (colloquially known as "LiteMorph") was originally developed by Bill Woods as an alternative to the stemmers that are traditionally used in search engines.
The basic idea behind LiteMorph is to take a word and replace known suffixes with other suffixes that are considered plausible in the morphology of the particular language of interest (there are LiteMorphs for English, German and Spanish). Here's an example of a LiteMorph rule for English:
.aeiouy o u s + -> ly,ness
This rule means: for a word that ends with a vowel followed by the letters ous, you can generate variations by adding the letters ly and ness to the word. So, a word like joyous will generate the variations joyously and joyousness.
Note that these rules were developed long before regular expressions were available in Java, so they use an invented syntax. Someone who wanted to port them to use regular expressions would be warmly welcomed into the Minion family!
This generativeness (there's a morph for you!) is a very nice property. For example, given the term build, the English version of LiteMorph generates the following list of variations:
build, builder, builders, building, buildings, builds, built
This is nice because this is the list of terms that we need to look up in the dicitionary to get the variations for build.
Each LiteMorph contains an exception list, so we can handle things like the irregular verb run:
ran, run, runner, runners, running, runnings, runs
The variations that are generated can be a bit weird, though. Here are the generated variants for happiness:
happied, happier, happiers, happies, happiest, happiful, happiless, happily, happiment, happiments, happiness, happinesses, happy, happying, happyings
The good news is that although these seem weird, we're looking them up in the dictionary so really weird variations (e.g., happiless, which I totally think should be a word) just won't appear and so they won't affect the search. On the other hand, variations that a stemmer would never get very well might appear and the LiteMorph will catch them.
I had q quick look around for papers explaining the syntax of the LiteMorph rules and came up with a paper for TaPor 2004 that was written by a grad student who implemented a French LiteMorph.