Lightweight Morphology vs. Stemming
By searchguy on Mar 13, 2009
Otis asked about the cost incurred doing LiteMorph expansion of terms during searches. I haven't really looked at this before, since we don't stem indices by default, but it's a fair question.
I happen to have a few million email messages lying around, so I ran up an index of around 1.7 million of them with our standard index and a stemmed index. Here's what our standard index looks like at the end of the run, as reported by our
QueryTest kitchen-sink tester program:
18 active partitions: 1007 (763425) 1266 (195146) 1521 (200812) 1797 (194458) 2083 (185667) 2143 (35463) 2203 (37921) 2254 (42842) 2307 (34676) 2316 (6611) 2327 (8591) 2334 (6901) 2344 (7566) 2345 (438) 2346 (1447) 2347 (554) 2348 (763) 2349 (3) Sorting specification is: -score Partitions have: 1723284 documents 2337983867 tokens 10212437 terms
There are 18 partitions in this index (that's pretty ripe for a merge, actually.) The numbers in the parentheses are the number of documents in each partition. You can see the number of documents, tokens, and terms in the entire index.
Here's what the stemmed index looks like:
16 active partitions: 881 (1248212) 1086 (238562) 1130 (49578) 1176 (50489) 1211 (48669) 1246 (52899) 1252 (9199) 1258 (10005) 1264 (8902) 1265 (2001) 1266 (38) 1267 (1963) 1268 (2001) 1269 (763) 1270 (2) 1271 (1) Sorting specification is: -score Partitions have: 1723284 documents 2337983867 tokens 6970295 terms
I wrote a quick program to select some random words from the main dictionary of the largest partition in an index. In this case, that was partition 1007, whose main dictionary contained nearly 5.5 million words.
Because we want to test on "interesting" words, I restricted the search to words that are longer than 5 alphabetic characters and that occur in more than 0.5% of the documents in that partition (3817 documents, in this case). This resulted in 1806 terms (Zipf's Law in action, I guess!).
Using our new query API, I wrote a program that takes each of the words and runs it as a single term query. The program takes a switch on the command line to indicate whether the terms should be stemmed or morphed.
I'm running Solaris 10 on the test machine.
psrinfo -v on this machine says:
Status of virtual processor 0 as of: 03/11/2009 13:27:56 on-line since 01/04/2009 22:05:36. The i386 processor operates at 2200 MHz, and has an i387 compatible floating point processor. ... Status of virtual processor 7 as of: 03/11/2009 13:27:56 on-line since 01/04/2009 22:05:54. The i386 processor operates at 2200 MHz, and has an i387 compatible floating point processor.
It's a four processor box, where each processor has two cores (it's a v40z.) The box has 32GB of RAM, the java is version 1.6.0_06, and I'm running with -Xmx1g (the actual process size doesn't get much above 300MB, though.)
The collections are stored on a ZFS file system that's on a disk array attached to the box via fiber channel. This is fairly low-performing storage (after we got it we were told that it was meant for near line backup. Sigh.). Here's the pool:
NAME SIZE USED AVAIL CAP HEALTH ALTROOT files 2.27T 415G 1.86T 17% ONLINE -And the status:
pool: files state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM files ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c0t600C0FF00000000009234951BE0FE300d0s2 ONLINE 0 0 0 c0t600C0FF00000000009234968DA6E9000d0s2 ONLINE 0 0 0 c0t600C0FF00000000009234973FFDC2800d0s2 ONLINE 0 0 0 c0t600C0FF000000000092349113B66D600d0s2 ONLINE 0 0 0 c0t600C0FF000000000092349239DC55200d0s2 ONLINE 0 0 0
I'm pretty sure each of those vdevs is built out of multiple disks in the actual array. (Ask me about this if you think it matters, and I can find out.)
While preparing this blog entry, I ran the query program a number of times in order to get the output that I wanted, so these indices were probably warm in the disk cache. Anyways, here's the basic results:
|Index||Total Query Time (ms)||Avg. Query Time (ms)|
The queries using lightweight expansion take about 1.9 times as long as the queries using stemming.
I was curious if the number of partitions in the index was affecting how long the queries were taking to run, so I generated versions of the indices where all of the partitions had been merged into a single partition. The results for this merged index were:
|Index||Total Query Time (ms)||Avg. Query Time (ms)|
This is a speedup of around 12% in both cases, but about the same ratio for the times.
So, where's the extra time going, you ask? One of the things that we've added to Minion recently is query statistics processing. The engine keeps track of the number of dictionary lookups, dictionary cache hits, how many postings lists are loaded, how long it takes to load them, an so on. The stuff we're collecting is in the
QueryStats class, and at any time you can ask the search engine for the current set of query statistics. Here's the stats for the merged index for the unstemmed index:
1806 queries in 59615.21ms, 33.01ms per query Dictionary activity: 34952 lookups in 1391.21ms (2.33% of total), 0.04ms per lookup Cache Hits: 208 Cache Misses: 34744 Postings Activity 9650 postings lists read in 2180.81ms (3.66% of total), 0.23ms per read Average Postings size: 67.7KB Term Cache Hits: 0 Term Cache Misses: 0 Generating term cache entries: 0.0ms (0.00% of total) Postings iteration: 25670.3ms (43.06% of total) Query Activity Union processing: 25671.7ms (43.06% of total) Intersect processing: 0.0ms (0.00% of total) Sorting postings: 22929.8ms (38.46% of total) Normalizing scores: 3189.2ms (5.35% of total)and here's the stats for the merged, stemmed index:
1806 queries in 30880.89ms, 17.10ms per query Dictionary activity: 1806 lookups in 340.58ms (1.10% of total), 0.19ms per lookup Cache Hits: 209 Cache Misses: 1597 Postings Activity 1806 postings lists read in 2063.93ms (6.68% of total), 1.14ms per read Average Postings size: 320.2KB Term Cache Hits: 0 Term Cache Misses: 0 Generating term cache entries: 0.0ms (0.00% of total) Postings iteration: 23008.0ms (74.51% of total) Query Activity Union processing: 23009.4ms (74.51% of total) Intersect processing: 0.0ms (0.00% of total) Sorting postings: 0.0ms (0.00% of total) Normalizing scores: 3143.3ms (10.18% of total)
Note that there's some overlap in the categories and we don't count everything, so the numbers don't add up to 100%.
The big difference between these two is the "Sorting Postings" time. This is the time that's spent integrating the postings for the morphological variants. When we're processing the postings for the variants, experience has shown us that it's faster to list out all the documents and then sort them to combine the scores from documents with more than one variant than it is to merge them as we're processing the variants. Of course, in the stemmed case, we only ever iterate through one postings list, so the sort isn't necessary.