LiteMorph: A lightweight morphology

Matt Chaput commented the other day about an aspect of the dictionaries, and in passing mentioned that his Python search engine (Whoosh! Good name, dude!) uses the lightweight morphology from Minion (ported to Python, obviously.)

Otis followed up asking where he could get more information, and I thought I would post an entry rather than burying it in the comments.

The lightweight morphology component (colloquially known as "LiteMorph") was originally developed by Bill Woods as an alternative to the stemmers that are traditionally used in search engines.

The basic idea behind LiteMorph is to take a word and replace known suffixes with other suffixes that are considered plausible in the morphology of the particular language of interest (there are LiteMorphs for English, German and Spanish). Here's an example of a LiteMorph rule for English:


.aeiouy o u s + -> ly,ness

This rule means: for a word that ends with a vowel followed by the letters ous, you can generate variations by adding the letters ly and ness to the word. So, a word like joyous will generate the variations joyously and joyousness.

Note that these rules were developed long before regular expressions were available in Java, so they use an invented syntax. Someone who wanted to port them to use regular expressions would be warmly welcomed into the Minion family!

This generativeness (there's a morph for you!) is a very nice property. For example, given the term build, the English version of LiteMorph generates the following list of variations:


build, builder, builders, building, buildings, builds, built

This is nice because this is the list of terms that we need to look up in the dicitionary to get the variations for build.

Each LiteMorph contains an exception list, so we can handle things like the irregular verb run:


ran, run, runner, runners, running, runnings, runs

The variations that are generated can be a bit weird, though. Here are the generated variants for happiness:


happied, happier, happiers, happies, happiest, happiful, happiless, happily, happiment, happiments, happiness, happinesses, happy, happying, happyings

The good news is that although these seem weird, we're looking them up in the dictionary so really weird variations (e.g., happiless, which I totally think should be a word) just won't appear and so they won't affect the search. On the other hand, variations that a stemmer would never get very well might appear and the LiteMorph will catch them.

I had q quick look around for papers explaining the syntax of the LiteMorph rules and came up with a paper for TaPor 2004 that was written by a grad student who implemented a French LiteMorph.

Comments:

Thanks for the information. So instead of index-time term trimming this does query-time term-based query expansion. This could be a small performance hit, since rewrites need to happen, dictionary lookups need to happen, and then more terms need to be looked up in the inverted index. Have you measured how much this costs Minion by any chance?

You said: "On the other hand, variations that a stemmer would never get very well might appear and the LiteMorph will catch them.". Do you have some good examples?

Are there any cases where LiteMorph just doesn't do well (or does worse than, say Porter or Kstem)?

Thanks.

Posted by Otis Gospodnetic on March 04, 2009 at 03:00 AM EST #

Otis, you understand correctly: this is query-time expansion. I haven't directly measured the cost of using expansion, but it should be relatively easy for me to do so. I had a couple of dictionary things that I wanted to get timings for, so I'll toss this on the pile. My suspicion: the cost is not in the dictionary lookups but rather in processing the postings.

Bill had a bunch of examples of things that LiteMorph would get that Porter would miss, I'll see if I can dig them up.

Of course, my "don't throw information away" philosophy leads me to think that stemming your index terms without storing the unstemmed version means that there are queries that you can't pose, and there's no easy way that I know to go from a stemmed version to the unstemmed variants (I suppose you could store them.)

In Lucene, if one's using stemming does one typically put stemmed versions in one field and unstemmed in another?

There are lots of cases where LiteMorph does weird things. It's a bit too clever-clever about Latin endings, so, for example, we always got "javae" as an expansion for "java" (at least until I modified the exception list :-)

A configurable exception mechanism for LiteMorph would be nice come to think of it...

Posted by Stephen Green on March 04, 2009 at 03:58 AM EST #

Post a Comment:
Comments are closed for this entry.
About

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into recommendation and other technologies that can use a good text search engine. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today