Lightweight Morphology vs. Stemming

Otis asked about the cost incurred doing LiteMorph expansion of terms during searches. I haven't really looked at this before, since we don't stem indices by default, but it's a fair question.

I happen to have a few million email messages lying around, so I ran up an index of around 1.7 million of them with our standard index and a stemmed index. Here's what our standard index looks like at the end of the run, as reported by our QueryTest kitchen-sink tester program:

18 active partitions: 1007 (763425) 1266 (195146) 1521 (200812) 
 1797 (194458) 2083 (185667) 2143 (35463) 2203 (37921) 2254 (42842) 
 2307 (34676) 2316 (6611) 2327 (8591) 2334 (6901) 2344 (7566) 
 2345 (438) 2346 (1447) 2347 (554) 2348 (763) 2349 (3) 
 Sorting specification is: -score
 Partitions have:
  1723284 documents
  2337983867 tokens
  10212437 terms

There are 18 partitions in this index (that's pretty ripe for a merge, actually.) The numbers in the parentheses are the number of documents in each partition. You can see the number of documents, tokens, and terms in the entire index.

Here's what the stemmed index looks like:

16 active partitions: 881 (1248212) 1086 (238562) 1130 (49578) 1176 (50489) 
 1211 (48669) 1246 (52899) 1252 (9199) 1258 (10005) 1264 (8902) 1265 (2001) 
 1266 (38) 1267 (1963) 1268 (2001) 1269 (763) 1270 (2) 1271 (1) 
 Sorting specification is: -score
 Partitions have:
  1723284 documents
  2337983867 tokens
  6970295 terms

I wrote a quick program to select some random words from the main dictionary of the largest partition in an index. In this case, that was partition 1007, whose main dictionary contained nearly 5.5 million words.

Because we want to test on "interesting" words, I restricted the search to words that are longer than 5 alphabetic characters and that occur in more than 0.5% of the documents in that partition (3817 documents, in this case). This resulted in 1806 terms (Zipf's Law in action, I guess!).

Using our new query API, I wrote a program that takes each of the words and runs it as a single term query. The program takes a switch on the command line to indicate whether the terms should be stemmed or morphed.

I'm running Solaris 10 on the test machine. psrinfo -v on this machine says:

Status of virtual processor 0 as of: 03/11/2009 13:27:56
  on-line since 01/04/2009 22:05:36.
  The i386 processor operates at 2200 MHz,
        and has an i387 compatible floating point processor.
Status of virtual processor 7 as of: 03/11/2009 13:27:56
  on-line since 01/04/2009 22:05:54.
  The i386 processor operates at 2200 MHz,
        and has an i387 compatible floating point processor.

It's a four processor box, where each processor has two cores (it's a v40z.) The box has 32GB of RAM, the java is version 1.6.0_06, and I'm running with -Xmx1g (the actual process size doesn't get much above 300MB, though.)

The collections are stored on a ZFS file system that's on a disk array attached to the box via fiber channel. This is fairly low-performing storage (after we got it we were told that it was meant for near line backup. Sigh.). Here's the pool:

NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
files                  2.27T    415G   1.86T    17%  ONLINE     -
And the status:
  pool: files
 state: ONLINE
 scrub: none requested

	NAME                                         STATE     READ WRITE CKSUM
	files                                        ONLINE       0     0     0
	  raidz2                                     ONLINE       0     0     0
	    c0t600C0FF00000000009234951BE0FE300d0s2  ONLINE       0     0     0
	    c0t600C0FF00000000009234968DA6E9000d0s2  ONLINE       0     0     0
	    c0t600C0FF00000000009234973FFDC2800d0s2  ONLINE       0     0     0
	    c0t600C0FF000000000092349113B66D600d0s2  ONLINE       0     0     0
	    c0t600C0FF000000000092349239DC55200d0s2  ONLINE       0     0     0

I'm pretty sure each of those vdevs is built out of multiple disks in the actual array. (Ask me about this if you think it matters, and I can find out.)

While preparing this blog entry, I ran the query program a number of times in order to get the output that I wanted, so these indices were probably warm in the disk cache. Anyways, here's the basic results:

IndexTotal Query Time (ms)Avg. Query Time (ms)

The queries using lightweight expansion take about 1.9 times as long as the queries using stemming.

I was curious if the number of partitions in the index was affecting how long the queries were taking to run, so I generated versions of the indices where all of the partitions had been merged into a single partition. The results for this merged index were:

IndexTotal Query Time (ms)Avg. Query Time (ms)
Unstemmed (merged)59615.2133.01
Stemmed (merged) 30880.8917.01

This is a speedup of around 12% in both cases, but about the same ratio for the times.

So, where's the extra time going, you ask? One of the things that we've added to Minion recently is query statistics processing. The engine keeps track of the number of dictionary lookups, dictionary cache hits, how many postings lists are loaded, how long it takes to load them, an so on. The stuff we're collecting is in the QueryStats class, and at any time you can ask the search engine for the current set of query statistics. Here's the stats for the merged index for the unstemmed index:

1806 queries in 59615.21ms, 33.01ms per query
Dictionary activity:
 34952 lookups in 1391.21ms (2.33% of total), 0.04ms per lookup
 Cache Hits:                           208
 Cache Misses:                       34744
Postings Activity
 9650 postings lists read in 2180.81ms (3.66% of total), 0.23ms per read
 Average Postings size:               67.7KB
 Term Cache Hits:                        0
 Term Cache Misses:                      0
 Generating term cache entries:        0.0ms (0.00% of total)
 Postings iteration:               25670.3ms (43.06% of total)
Query Activity
 Union processing:                 25671.7ms (43.06% of total)
 Intersect processing:                 0.0ms (0.00% of total)
 Sorting postings:                 22929.8ms (38.46% of total)
 Normalizing scores:                3189.2ms (5.35% of total)
and here's the stats for the merged, stemmed index:
1806 queries in 30880.89ms, 17.10ms per query
Dictionary activity:
 1806 lookups in 340.58ms (1.10% of total), 0.19ms per lookup
 Cache Hits:                           209
 Cache Misses:                        1597
Postings Activity
 1806 postings lists read in 2063.93ms (6.68% of total), 1.14ms per read
 Average Postings size:              320.2KB
 Term Cache Hits:                        0
 Term Cache Misses:                      0
 Generating term cache entries:        0.0ms (0.00% of total)
 Postings iteration:               23008.0ms (74.51% of total)
Query Activity
 Union processing:                 23009.4ms (74.51% of total)
 Intersect processing:                 0.0ms (0.00% of total)
 Sorting postings:                     0.0ms (0.00% of total)
 Normalizing scores:                3143.3ms (10.18% of total)

Note that there's some overlap in the categories and we don't count everything, so the numbers don't add up to 100%.

The big difference between these two is the "Sorting Postings" time. This is the time that's spent integrating the postings for the morphological variants. When we're processing the postings for the variants, experience has shown us that it's faster to list out all the documents and then sort them to combine the scores from documents with more than one variant than it is to merge them as we're processing the variants. Of course, in the stemmed case, we only ever iterate through one postings list, so the sort isn't necessary.


Post a Comment:
Comments are closed for this entry.

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.


« July 2016