Monday Mar 16, 2009

Lightweight Morphology vs. Stemming: What's the Difference?

Otis asked whether there were any obvious cases where lightweight morphology does better than stemming, or where stemming does better than morphology. When running the performance numbers for a stemmed index versus an unstemmed one the other day, there were some instances where the number of returned documents were wildly different. I wrote a quick script to find the terms that had more than a given percentage difference in the number of documents returned for the queries, which would show me the queries where the differences were the largest.

Let's look at a continuum here, starting with LiteMorph generating way more documents than stemming. Here's an interesting one:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
knowledge knowledg 50088623574

Yow. The problem here is that it looks like we're over-generating for knowledge. From the kitchen sink, we can use the :morph command to show the variants for a term that actually occur in an index:

> :morph knowledge
 knowledge (22136)
 know (430909)
 known (74908)
 knew (4929)
 knows (17479)
 knowing (6967)

Here we're using the merged version of the unstemmed index. The numbers don't necessarily add up because multiple variants might occur in a single document. Since know is an irregular verb, it's in the exception table for LiteMorph, so this would be an easy fix (and I'll probably make it.)

A little further along the continuum we see an interesting set:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
writeswrite48844894183
writingwrite48844894183
writtenwritten48844831512
writerwriter48844815938

Judging by the numbers, I'd say that litemorph considers those terms to be equivalent, while the stemmer only considers writes and writing to be equivalent. Let's consult the sink:

> :morph writer
 writer (12729)
 write (65901)
 writers (3994)
 writes (17922)
 wrote (412434)
 writeable (615)
 written (31511)
 writings (84)
 writing (22414)

IMHO, this one's a draw. I would say that written should be in the equivalence set for writes, but that writer (i.e, someone who writes) is a bit of a tougher sell. The big miss here for the stemmer is that it didn't get the past tense wrote (why so many wrotes? Don't forget that this is from email archives!) This example is also the exception table for LiteMorph, since write is an irregular verb.

This pattern shows up for other words in our test set like caller and submitter, which are not in the exception list, so we'd probably need to fix this one by modifying the LiteMorph rules and the exception table. If we decided to fix it, that is.

There are some clear cases where irregular verbs get missed by the stemmer, like the following:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
caughtcaught336749311

Perhaps my Porter stemmer is out of date? I suppose I should try this with the stemmer(s) in Lucene.

There are a lot of cases where the difference between the stemmed and unstemmed indices is only a few documents:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
copiedcopi193667193638
copiescopi193667193638
copyingcopi193667193638

LiteMorph allows and the stemmer leaves out copier and copiers in this case.

At the far end of the spectrum are the terms for which LiteMorph produces far fewer documents than stemming. Here are a few interesting ones:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
engineengin22488392368
engineerengin377348392368
engineeringengin377350392368
engineersengin377348392368

The stemmer has conflated all of these words into the stem engine, while the LiteMorph engine considers them to be two separate clusters:

> :morph engine
 engine (21082)
 engined (2)
 enginers (4)
 engines (2438)
 enginer (53)
> :morph engineer
 engineer (268417)
 engineeing (13)
 engineering (119954)
 enginee (26)
 engineered (547)
 engineers (28743)
 enginees (1)
 engineerings (68)

There are also cases where I think stemming makes a conflation that is incorrect. For example:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
informinform25020412651
informedinform25118412651
informationinform396257412651

I don't think the conflation of informed and information is very good, and LiteMorph doesn't make it.

All in all, for the 1806 terms that we tested, there were 820 terms whose results were within 100 documents of each other, 300 terms where LiteMorph produced more results, and 686 terms where stemming produced more results.

It's not clear to me that there's a firm conclusion to be drawn here, except that which one is better probably depends on what you're trying to do. Certainly, stemming the index will give you better performance on searching at the expense of not being able to search for exact forms (unless you index the root forms). LiteMorph allows you to generate variants, but there's clearly some pretty weird stuff in there.

Friday Mar 13, 2009

Lightweight Morphology vs. Stemming

Otis asked about the cost incurred doing LiteMorph expansion of terms during searches. I haven't really looked at this before, since we don't stem indices by default, but it's a fair question.

I happen to have a few million email messages lying around, so I ran up an index of around 1.7 million of them with our standard index and a stemmed index. Here's what our standard index looks like at the end of the run, as reported by our QueryTest kitchen-sink tester program:

18 active partitions: 1007 (763425) 1266 (195146) 1521 (200812) 
 1797 (194458) 2083 (185667) 2143 (35463) 2203 (37921) 2254 (42842) 
 2307 (34676) 2316 (6611) 2327 (8591) 2334 (6901) 2344 (7566) 
 2345 (438) 2346 (1447) 2347 (554) 2348 (763) 2349 (3) 
 Sorting specification is: -score
 Partitions have:
  1723284 documents
  2337983867 tokens
  10212437 terms

There are 18 partitions in this index (that's pretty ripe for a merge, actually.) The numbers in the parentheses are the number of documents in each partition. You can see the number of documents, tokens, and terms in the entire index.

Here's what the stemmed index looks like:

16 active partitions: 881 (1248212) 1086 (238562) 1130 (49578) 1176 (50489) 
 1211 (48669) 1246 (52899) 1252 (9199) 1258 (10005) 1264 (8902) 1265 (2001) 
 1266 (38) 1267 (1963) 1268 (2001) 1269 (763) 1270 (2) 1271 (1) 
 Sorting specification is: -score
 Partitions have:
  1723284 documents
  2337983867 tokens
  6970295 terms

I wrote a quick program to select some random words from the main dictionary of the largest partition in an index. In this case, that was partition 1007, whose main dictionary contained nearly 5.5 million words.

Because we want to test on "interesting" words, I restricted the search to words that are longer than 5 alphabetic characters and that occur in more than 0.5% of the documents in that partition (3817 documents, in this case). This resulted in 1806 terms (Zipf's Law in action, I guess!).

Using our new query API, I wrote a program that takes each of the words and runs it as a single term query. The program takes a switch on the command line to indicate whether the terms should be stemmed or morphed.

I'm running Solaris 10 on the test machine. psrinfo -v on this machine says:

Status of virtual processor 0 as of: 03/11/2009 13:27:56
  on-line since 01/04/2009 22:05:36.
  The i386 processor operates at 2200 MHz,
        and has an i387 compatible floating point processor.
...
Status of virtual processor 7 as of: 03/11/2009 13:27:56
  on-line since 01/04/2009 22:05:54.
  The i386 processor operates at 2200 MHz,
        and has an i387 compatible floating point processor.

It's a four processor box, where each processor has two cores (it's a v40z.) The box has 32GB of RAM, the java is version 1.6.0_06, and I'm running with -Xmx1g (the actual process size doesn't get much above 300MB, though.)

The collections are stored on a ZFS file system that's on a disk array attached to the box via fiber channel. This is fairly low-performing storage (after we got it we were told that it was meant for near line backup. Sigh.). Here's the pool:

NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
files                  2.27T    415G   1.86T    17%  ONLINE     -
And the status:
  pool: files
 state: ONLINE
 scrub: none requested
config:

	NAME                                         STATE     READ WRITE CKSUM
	files                                        ONLINE       0     0     0
	  raidz2                                     ONLINE       0     0     0
	    c0t600C0FF00000000009234951BE0FE300d0s2  ONLINE       0     0     0
	    c0t600C0FF00000000009234968DA6E9000d0s2  ONLINE       0     0     0
	    c0t600C0FF00000000009234973FFDC2800d0s2  ONLINE       0     0     0
	    c0t600C0FF000000000092349113B66D600d0s2  ONLINE       0     0     0
	    c0t600C0FF000000000092349239DC55200d0s2  ONLINE       0     0     0

I'm pretty sure each of those vdevs is built out of multiple disks in the actual array. (Ask me about this if you think it matters, and I can find out.)

While preparing this blog entry, I ran the query program a number of times in order to get the output that I wanted, so these indices were probably warm in the disk cache. Anyways, here's the basic results:

IndexTotal Query Time (ms)Avg. Query Time (ms)
Unstemmed67232.2537.23
Stemmed34732.7119.23

The queries using lightweight expansion take about 1.9 times as long as the queries using stemming.

I was curious if the number of partitions in the index was affecting how long the queries were taking to run, so I generated versions of the indices where all of the partitions had been merged into a single partition. The results for this merged index were:

IndexTotal Query Time (ms)Avg. Query Time (ms)
Unstemmed (merged)59615.2133.01
Stemmed (merged) 30880.8917.01

This is a speedup of around 12% in both cases, but about the same ratio for the times.

So, where's the extra time going, you ask? One of the things that we've added to Minion recently is query statistics processing. The engine keeps track of the number of dictionary lookups, dictionary cache hits, how many postings lists are loaded, how long it takes to load them, an so on. The stuff we're collecting is in the QueryStats class, and at any time you can ask the search engine for the current set of query statistics. Here's the stats for the merged index for the unstemmed index:

1806 queries in 59615.21ms, 33.01ms per query
Dictionary activity:
 34952 lookups in 1391.21ms (2.33% of total), 0.04ms per lookup
 Cache Hits:                           208
 Cache Misses:                       34744
Postings Activity
 9650 postings lists read in 2180.81ms (3.66% of total), 0.23ms per read
 Average Postings size:               67.7KB
 Term Cache Hits:                        0
 Term Cache Misses:                      0
 Generating term cache entries:        0.0ms (0.00% of total)
 Postings iteration:               25670.3ms (43.06% of total)
Query Activity
 Union processing:                 25671.7ms (43.06% of total)
 Intersect processing:                 0.0ms (0.00% of total)
 Sorting postings:                 22929.8ms (38.46% of total)
 Normalizing scores:                3189.2ms (5.35% of total)
and here's the stats for the merged, stemmed index:
1806 queries in 30880.89ms, 17.10ms per query
Dictionary activity:
 1806 lookups in 340.58ms (1.10% of total), 0.19ms per lookup
 Cache Hits:                           209
 Cache Misses:                        1597
Postings Activity
 1806 postings lists read in 2063.93ms (6.68% of total), 1.14ms per read
 Average Postings size:              320.2KB
 Term Cache Hits:                        0
 Term Cache Misses:                      0
 Generating term cache entries:        0.0ms (0.00% of total)
 Postings iteration:               23008.0ms (74.51% of total)
Query Activity
 Union processing:                 23009.4ms (74.51% of total)
 Intersect processing:                 0.0ms (0.00% of total)
 Sorting postings:                     0.0ms (0.00% of total)
 Normalizing scores:                3143.3ms (10.18% of total)

Note that there's some overlap in the categories and we don't count everything, so the numbers don't add up to 100%.

The big difference between these two is the "Sorting Postings" time. This is the time that's spent integrating the postings for the morphological variants. When we're processing the postings for the variants, experience has shown us that it's faster to list out all the documents and then sort them to combine the scores from documents with more than one variant than it is to merge them as we're processing the variants. Of course, in the stemmed case, we only ever iterate through one postings list, so the sort isn't necessary.

Tuesday Mar 03, 2009

LiteMorph: A lightweight morphology

Matt Chaput commented the other day about an aspect of the dictionaries, and in passing mentioned that his Python search engine (Whoosh! Good name, dude!) uses the lightweight morphology from Minion (ported to Python, obviously.)

Otis followed up asking where he could get more information, and I thought I would post an entry rather than burying it in the comments.

The lightweight morphology component (colloquially known as "LiteMorph") was originally developed by Bill Woods as an alternative to the stemmers that are traditionally used in search engines.

The basic idea behind LiteMorph is to take a word and replace known suffixes with other suffixes that are considered plausible in the morphology of the particular language of interest (there are LiteMorphs for English, German and Spanish). Here's an example of a LiteMorph rule for English:


.aeiouy o u s + -> ly,ness

This rule means: for a word that ends with a vowel followed by the letters ous, you can generate variations by adding the letters ly and ness to the word. So, a word like joyous will generate the variations joyously and joyousness.

Note that these rules were developed long before regular expressions were available in Java, so they use an invented syntax. Someone who wanted to port them to use regular expressions would be warmly welcomed into the Minion family!

This generativeness (there's a morph for you!) is a very nice property. For example, given the term build, the English version of LiteMorph generates the following list of variations:


build, builder, builders, building, buildings, builds, built

This is nice because this is the list of terms that we need to look up in the dicitionary to get the variations for build.

Each LiteMorph contains an exception list, so we can handle things like the irregular verb run:


ran, run, runner, runners, running, runnings, runs

The variations that are generated can be a bit weird, though. Here are the generated variants for happiness:


happied, happier, happiers, happies, happiest, happiful, happiless, happily, happiment, happiments, happiness, happinesses, happy, happying, happyings

The good news is that although these seem weird, we're looking them up in the dictionary so really weird variations (e.g., happiless, which I totally think should be a word) just won't appear and so they won't affect the search. On the other hand, variations that a stemmer would never get very well might appear and the LiteMorph will catch them.

I had q quick look around for papers explaining the syntax of the LiteMorph rules and came up with a paper for TaPor 2004 that was written by a grad student who implemented a French LiteMorph.

Saturday May 03, 2008

Minion and Lucene: Finding Variants

Back in the good old days, most search engines stemmed the terms being indexed.

The idea was that removing the suffixes on a word would save space (since you need to store fewer terms in the dictionary and store fewer postings), and it would allow the users to type in any variant of a particular term. The engine would stem the query terms before looking them up in the dictionary, resulting in the engine returning the documents for all variants of the term.

The problem with this approach is that it makes it impossible to search for a variant in exactly the way specified by the user. So, for example, you couldn't search for the surname woods without also getting hits for the singular wood.

By default, both Minion and Lucene store the word forms encountered in the documents in the index, rather than storing (for example) stemmed forms. The difference between the engines is that Minion provides for searching across term variants at query time. By default, Minion
searches for all known morphological variations of the query terms. We generate the variations using a lightweight morphological framework that uses a set of rules similar to the set used by stemmers. The interesting thing about this is that the lightweight morphology is generative, so that given a term we can produce a set of terms that we should try to lookup in the dictionary.

The lightweight morphology tends to overgenerate, but it overgenerates in a lot of the same ways that people tend to. The good thing is that if it generates something that's not really an English word (e.g., I've seen it generate happiless from happiness) then the dictionary lookup will fail and it won't impact the query results.

We currently have lightweight morphological analyzers for English, Spanish, and German (and one for French that we haven't integrated yet!)

This behavior can be modified with the use of the EXACT query operator. Additionally, Minion provides a language-independent stemmer that can be selected at query time using the STEM operator.

By default, Lucene only searches for the form provided in the query, so, for example, a query for dog will exclude documents that only include the plural form dogs. Lucene can be configured to stem the terms as they are added to the index and then stem the query terms, but this leads to the problem described above.

A solution to this is to use the lightweight morphological analyzer to generate the variants and then modify a query to look for any of the variants. In fact, we did this in some of our evaluations of Lucene.

About

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into recommendation and other technologies that can use a good text search engine. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today