Tuesday Mar 03, 2009

LiteMorph: A lightweight morphology

Matt Chaput commented the other day about an aspect of the dictionaries, and in passing mentioned that his Python search engine (Whoosh! Good name, dude!) uses the lightweight morphology from Minion (ported to Python, obviously.)

Otis followed up asking where he could get more information, and I thought I would post an entry rather than burying it in the comments.

The lightweight morphology component (colloquially known as "LiteMorph") was originally developed by Bill Woods as an alternative to the stemmers that are traditionally used in search engines.

The basic idea behind LiteMorph is to take a word and replace known suffixes with other suffixes that are considered plausible in the morphology of the particular language of interest (there are LiteMorphs for English, German and Spanish). Here's an example of a LiteMorph rule for English:


.aeiouy o u s + -> ly,ness

This rule means: for a word that ends with a vowel followed by the letters ous, you can generate variations by adding the letters ly and ness to the word. So, a word like joyous will generate the variations joyously and joyousness.

Note that these rules were developed long before regular expressions were available in Java, so they use an invented syntax. Someone who wanted to port them to use regular expressions would be warmly welcomed into the Minion family!

This generativeness (there's a morph for you!) is a very nice property. For example, given the term build, the English version of LiteMorph generates the following list of variations:


build, builder, builders, building, buildings, builds, built

This is nice because this is the list of terms that we need to look up in the dicitionary to get the variations for build.

Each LiteMorph contains an exception list, so we can handle things like the irregular verb run:


ran, run, runner, runners, running, runnings, runs

The variations that are generated can be a bit weird, though. Here are the generated variants for happiness:


happied, happier, happiers, happies, happiest, happiful, happiless, happily, happiment, happiments, happiness, happinesses, happy, happying, happyings

The good news is that although these seem weird, we're looking them up in the dictionary so really weird variations (e.g., happiless, which I totally think should be a word) just won't appear and so they won't affect the search. On the other hand, variations that a stemmer would never get very well might appear and the LiteMorph will catch them.

I had q quick look around for papers explaining the syntax of the LiteMorph rules and came up with a paper for TaPor 2004 that was written by a grad student who implemented a French LiteMorph.

Saturday May 03, 2008

Minion and Lucene: Finding Variants

Back in the good old days, most search engines stemmed the terms being indexed.

The idea was that removing the suffixes on a word would save space (since you need to store fewer terms in the dictionary and store fewer postings), and it would allow the users to type in any variant of a particular term. The engine would stem the query terms before looking them up in the dictionary, resulting in the engine returning the documents for all variants of the term.

The problem with this approach is that it makes it impossible to search for a variant in exactly the way specified by the user. So, for example, you couldn't search for the surname woods without also getting hits for the singular wood.

By default, both Minion and Lucene store the word forms encountered in the documents in the index, rather than storing (for example) stemmed forms. The difference between the engines is that Minion provides for searching across term variants at query time. By default, Minion
searches for all known morphological variations of the query terms. We generate the variations using a lightweight morphological framework that uses a set of rules similar to the set used by stemmers. The interesting thing about this is that the lightweight morphology is generative, so that given a term we can produce a set of terms that we should try to lookup in the dictionary.

The lightweight morphology tends to overgenerate, but it overgenerates in a lot of the same ways that people tend to. The good thing is that if it generates something that's not really an English word (e.g., I've seen it generate happiless from happiness) then the dictionary lookup will fail and it won't impact the query results.

We currently have lightweight morphological analyzers for English, Spanish, and German (and one for French that we haven't integrated yet!)

This behavior can be modified with the use of the EXACT query operator. Additionally, Minion provides a language-independent stemmer that can be selected at query time using the STEM operator.

By default, Lucene only searches for the form provided in the query, so, for example, a query for dog will exclude documents that only include the plural form dogs. Lucene can be configured to stem the terms as they are added to the index and then stem the query terms, but this leads to the problem described above.

A solution to this is to use the lightweight morphological analyzer to generate the variants and then modify a query to look for any of the variants. In fact, we did this in some of our evaluations of Lucene.

About

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into recommendation and other technologies that can use a good text search engine. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today