Lightweight Morphology vs. Stemming: What's the Difference?

Otis asked whether there were any obvious cases where lightweight morphology does better than stemming, or where stemming does better than morphology. When running the performance numbers for a stemmed index versus an unstemmed one the other day, there were some instances where the number of returned documents were wildly different. I wrote a quick script to find the terms that had more than a given percentage difference in the number of documents returned for the queries, which would show me the queries where the differences were the largest.

Let's look at a continuum here, starting with LiteMorph generating way more documents than stemming. Here's an interesting one:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
knowledge knowledg 50088623574

Yow. The problem here is that it looks like we're over-generating for knowledge. From the kitchen sink, we can use the :morph command to show the variants for a term that actually occur in an index:

> :morph knowledge
 knowledge (22136)
 know (430909)
 known (74908)
 knew (4929)
 knows (17479)
 knowing (6967)

Here we're using the merged version of the unstemmed index. The numbers don't necessarily add up because multiple variants might occur in a single document. Since know is an irregular verb, it's in the exception table for LiteMorph, so this would be an easy fix (and I'll probably make it.)

A little further along the continuum we see an interesting set:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
writeswrite48844894183
writingwrite48844894183
writtenwritten48844831512
writerwriter48844815938

Judging by the numbers, I'd say that litemorph considers those terms to be equivalent, while the stemmer only considers writes and writing to be equivalent. Let's consult the sink:

> :morph writer
 writer (12729)
 write (65901)
 writers (3994)
 writes (17922)
 wrote (412434)
 writeable (615)
 written (31511)
 writings (84)
 writing (22414)

IMHO, this one's a draw. I would say that written should be in the equivalence set for writes, but that writer (i.e, someone who writes) is a bit of a tougher sell. The big miss here for the stemmer is that it didn't get the past tense wrote (why so many wrotes? Don't forget that this is from email archives!) This example is also the exception table for LiteMorph, since write is an irregular verb.

This pattern shows up for other words in our test set like caller and submitter, which are not in the exception list, so we'd probably need to fix this one by modifying the LiteMorph rules and the exception table. If we decided to fix it, that is.

There are some clear cases where irregular verbs get missed by the stemmer, like the following:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
caughtcaught336749311

Perhaps my Porter stemmer is out of date? I suppose I should try this with the stemmer(s) in Lucene.

There are a lot of cases where the difference between the stemmed and unstemmed indices is only a few documents:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
copiedcopi193667193638
copiescopi193667193638
copyingcopi193667193638

LiteMorph allows and the stemmer leaves out copier and copiers in this case.

At the far end of the spectrum are the terms for which LiteMorph produces far fewer documents than stemming. Here are a few interesting ones:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
engineengin22488392368
engineerengin377348392368
engineeringengin377350392368
engineersengin377348392368

The stemmer has conflated all of these words into the stem engine, while the LiteMorph engine considers them to be two separate clusters:

> :morph engine
 engine (21082)
 engined (2)
 enginers (4)
 engines (2438)
 enginer (53)
> :morph engineer
 engineer (268417)
 engineeing (13)
 engineering (119954)
 enginee (26)
 engineered (547)
 engineers (28743)
 enginees (1)
 engineerings (68)

There are also cases where I think stemming makes a conflation that is incorrect. For example:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
informinform25020412651
informedinform25118412651
informationinform396257412651

I don't think the conflation of informed and information is very good, and LiteMorph doesn't make it.

All in all, for the 1806 terms that we tested, there were 820 terms whose results were within 100 documents of each other, 300 terms where LiteMorph produced more results, and 686 terms where stemming produced more results.

It's not clear to me that there's a firm conclusion to be drawn here, except that which one is better probably depends on what you're trying to do. Certainly, stemming the index will give you better performance on searching at the expense of not being able to search for exact forms (unless you index the root forms). LiteMorph allows you to generate variants, but there's clearly some pretty weird stuff in there.

Comments:

[Trackback] Bookmarked your post over at Blog Bookmarker.com!

Posted by weird on March 16, 2009 at 04:17 PM EDT #

Hi Stephen,

Minion looks very interesting and promising.

I see it has "Draft" status on the download page ( "Engine source 08-26-2008" )

How would you describe the current versions of Minion that are in svn repositiry? Are they Alpha, Beta? Or may be there is a release candidate?

Regards, Sergey

Posted by Sergey on March 16, 2009 at 11:16 PM EDT #

Post a Comment:
Comments are closed for this entry.
About

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into recommendation and other technologies that can use a good text search engine. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today