Minion and Lucene: Performance
By searchguy on May 09, 2008
We did some performance comparisons a while ago, and it probably deserves a full post. Part of the problem with performance comparisons is that it's hard to get the same conditions for both systems.
I think it would be a good idea for all of the open source engines to get together, find a nice open document collection (the Apache mailing list archives and their associated searches?) and build a nice set of regression tests and some pooled relevance sets so that we can test retrieval performance without having to rely on the TREC data.
At any rate, we used Lucene 2.0 and a build of Minion to build an indexer for email messages. We used these to build indexes of 1.1 million email messages. We then ran a bunch of queries against the index. The messages and queries were taken from an archive of about 5 million messages that we provide internally. We ran the queries serially and in parallel up to 32 threads on a couple of Sun x64 and Niagara boxes.
The practical upshot was that the Minion indexer was substantially faster than the Lucene indexer (at the cost of more memory used) and the queries were comparable in similar conditions. Of course, indexing speed probably isn't that big a deal in most situations.
Our "standard" config Lucene did a lot better than our default Minion config, which is totally understandable, since Minion was doing a lot more work (e.g., doing query term expansion, processing the case sensitive postings and searching all the fields). When we feed Lucene our expanded terms, and both engines are doing straightforward boolean AND, the difference in query times is about 10% in Lucene's favor. This sounds like a lot, but we're talking about 35ms vs 39ms.
I have no doubt that we could go in and massage the query evaluator (or come up with a more efficient postings format, or ...) to get that 10% back and I'm sure that the Lucene folks could go in and get another 10%, lather, rinse, repeat.
With respect to global IDF, we store a separate dictionary containing global term statistics, so when we're doing weighting, we're using the statistics for the collection as a whole. We actually try to be fairly careful about the term statistics, especially for things like document categorization.
As for distributed search, Minion is like Lucene in that it's a simple indexing and retrieval library. We're working on distributing the search as part of Aura, but I can't really comment on the performance there yet as it's still pretty early days. I will say that it was pretty pleasingly fast on a 7 million document index distributed into 16 pieces — fast enough that I thought it was broken when we ran the first queries. I'll let you know how it goes when we have 100 million or a billion docs :-)