Minion and Lucene: Performance

Otis asked about Minion performance and how it compares to Lucene.

We did some performance comparisons a while ago, and it probably deserves a full post. Part of the problem with performance comparisons is that it's hard to get the same conditions for both systems.

I think it would be a good idea for all of the open source engines to get together, find a nice open document collection (the Apache mailing list archives and their associated searches?) and build a nice set of regression tests and some pooled relevance sets so that we can test retrieval performance without having to rely on the TREC data.

At any rate, we used Lucene 2.0 and a build of Minion to build an indexer for email messages. We used these to build indexes of 1.1 million email messages. We then ran a bunch of queries against the index. The messages and queries were taken from an archive of about 5 million messages that we provide internally. We ran the queries serially and in parallel up to 32 threads on a couple of Sun x64 and Niagara boxes.

The practical upshot was that the Minion indexer was substantially faster than the Lucene indexer (at the cost of more memory used) and the queries were comparable in similar conditions. Of course, indexing speed probably isn't that big a deal in most situations.

Our "standard" config Lucene did a lot better than our default Minion config, which is totally understandable, since Minion was doing a lot more work (e.g., doing query term expansion, processing the case sensitive postings and searching all the fields). When we feed Lucene our expanded terms, and both engines are doing straightforward boolean AND, the difference in query times is about 10% in Lucene's favor. This sounds like a lot, but we're talking about 35ms vs 39ms.

I have no doubt that we could go in and massage the query evaluator (or come up with a more efficient postings format, or ...) to get that 10% back and I'm sure that the Lucene folks could go in and get another 10%, lather, rinse, repeat.

With respect to global IDF, we store a separate dictionary containing global term statistics, so when we're doing weighting, we're using the statistics for the collection as a whole. We actually try to be fairly careful about the term statistics, especially for things like document categorization.

As for distributed search, Minion is like Lucene in that it's a simple indexing and retrieval library. We're working on distributing the search as part of Aura, but I can't really comment on the performance there yet as it's still pretty early days. I will say that it was pretty pleasingly fast on a 7 million document index distributed into 16 pieces — fast enough that I thought it was broken when we ran the first queries. I'll let you know how it goes when we have 100 million or a billion docs :-)

Comments:

When it comes to indexing speed, the current version of Lucene (2.3.x) should be much faster than 2.0 (for query performance of common queries, there's probably not much of a difference).

Posted by Daniel Naber on May 10, 2008 at 04:02 AM EDT #

What Daniel said :)
Do you recall what parameters you used for each during indexing? That's the key, of course.

Posted by Otis Gospodnetic on May 10, 2008 at 06:54 AM EDT #

I figured that things would have changed, since the tests were more than a year ago. I'll see if I can figure out what we did with the source code for the indexer (it's in a CVS repository somewhere around here...)

Honestly, the last time I worried about indexing speed was sometime late in 2001, so I don't really think this is a problem for either engine.

If I recall correctly, we didn't really do any tuning for Lucene (i.e., we used the basic Analyzer, Tokenizer, etc.) and we did pretty much what you would expect for email (e.g., stored the subject and the message date, etc.) For Minion we used the default configuration.

Posted by Stephen Green on May 10, 2008 at 02:35 PM EDT #

svn co lucene/java/trunk/contrib/benchmark and have a look. Perhaps Minion can develop something similar. Then it would be really easily to compare and for anyone else to independently benchmark, test their own hw, etc.

Posted by Otis Gospodnetic on May 10, 2008 at 02:53 PM EDT #

OK, I'll put that on the list of things to do.

Posted by Stephen Green on May 11, 2008 at 02:20 PM EDT #

Hi Stephen
I have contacted you via LinkedIn regarding Minion. We would be interested in looking into possible cooperation with Minion. This will also serve as a real (very demanding) test-bed for Minion and its performance (and features like distributed search and more).
Hoping to hear from you regarding this.

Posted by Ron on May 12, 2008 at 12:30 AM EDT #

Hi Stephen,
When do you plan to run performance tests on 100 million+ documents?

Posted by Wojtek on May 21, 2008 at 01:16 PM EDT #

Wojtek,

We'll probably be getting close to that this summer as Project Aura comes on line.

Posted by Stephen Green on May 21, 2008 at 02:45 PM EDT #

Stephen, could you please describe again what project Aura is exactly, and tell us if this is something that Sun will be releasing to OSS world like you did with Minion, or if Aura is something that will be commercial and propriatory?

Posted by Otis Gospodnetic on May 21, 2008 at 03:30 PM EDT #

Aura is a new project centered around recommendation, with the aim of building a Web-scale recommender system that offers a hybrid of collaborative filtering and content-based recommenders.

Paul's been blogging about Aura more than I have. You can read posts about Aura here:

http://blogs.sun.com/plamere/category/recommendation

Aura will be OSS. I'm going through the open source process here at Sun right now and we're hoping to have our first release around September, if we can work our interns hard enough :-)

Posted by Stephen Green on May 22, 2008 at 01:18 AM EDT #

Post a Comment:
Comments are closed for this entry.
About

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into recommendation and other technologies that can use a good text search engine. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today