The rest of the story...

Paul blogged today about using the search engine to implement a persistent set of strings. He called it abusing the search engine, but it was so simple to do that it seems like more of a use than an abuse, IMHO.

One of Minion's strengths is that it offers a fairly small "public" API that is supposed to offer all of the functionality that you need for indexing and searching documents. Paul's persistent set uses an interface called SimpleIndexer that, as the name suggests, provides a simple way to index documents.

Recall that a document in Minion is just a bunch of fields, so to index using a SimpleIndexer you just do something like:


SearchEngine e = SearchEngineFactory.getSearchEngine(indexDir);
SimpleIndexer si = e.getSimpleIndexer();

to get the a search engine and the simple indexer. Then for each document you want to index you can say:


si.startDocument(key);
si.addField(field1, value1);
si.addField(field2, value2);
...
si.endDocument();

when you're done you need to tell the engine that you're done with the simple indexer so that any information that's accumulated in memory can be flushed to disk:


si.finish();

Don't worry about "indexing too much" with a simple indexer. The engine will flush data to the disk when the heap starts to fill. Also, don't forget to close the engine when you're done with all your indexing:


e.close();

As it stands right now, if you forget to call finish some of the data that you've indexed might be discarded. This is the kind of infelicity that I'm hoping to fix over the next little while. Paul was complaining about having to remember to close the engine yesterday, so we'll probably make that a little easier to deal with as well.

Generally speaking, when Paul complains about the engine I listen. His (constructive!) criticism is the reason that we have SimpleIndexer in the first place.

There are a couple of other ways of indexing documents, but the SimpleIndexer is a remarkably powerful way to index a whole host of things (blogs, email, databases, etc.)

Comments:

I notice that you create a document by having a stateful SimpleIndexer. The Lucene way of constructing a document and then adding it to the index feels much cleaner to me.

What is your motivation for having a SimpleIndexer that "remembers" that you are currently editing a document?

Posted by Jens on April 23, 2008 at 03:38 AM EDT #

In a lot of respects, constructing a document and then indexing it is cleaner, which is why Minion offers a way to do that too: you can pass a java.util.Map to the search engine and tell it to index it. I'll get to that in a future post, I expect.

Paul really was the motivation for the simple indexer: he had a data source that he was reading line-by-line and he just wanted to add the data as he was going rather than collecting up a map and indexing that.

I now find that I'm using SimpleIndexer most of the time myself and I initially resisted implementing it!

Posted by Stephen Green on April 23, 2008 at 03:46 AM EDT #

Post a Comment:
Comments are closed for this entry.
About

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.

Search

Archives
« May 2015
SunMonTueWedThuFriSat
     
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
      
Today