Tuesday Mar 10, 2009

Dictionaries in Minion: Merging

The last major function that we're going to investigate for dictionaries is merging them.

When indexing in Minion, we try to index as much data as possible in memory before dumping the data to the disk. Each chunk of data that we dump to the disk is called a partition. We usually call a partition that was dumped directly by the indexer a "first-level partition."

A partition is composed of a number of files, and the contents of these files are mostly dictionaries and their associated postings. As indexing proceeds and we dump more first-level partitions, the PartitionManager for the index will collect a number of first-level partitions of a given size (in terms of the number of documents that contain) and merge them together into a new partition. As you can imagine, we usually call the partition resulting from a merge of first-level partitions a "second-level partition". The PartitionManager will eventually merge second-level partitions into third-level ones, and so on.

Because of our EIDAP philosophy, all of these partition merges boil down to merging a number of dictionaries. So, for example, we build the main dictionary for a merged partition by merging together the main dictionaries from the partitions that we're merging.

The main dictionary merge method is pretty straightforward, given how central it is to how indexing works in Minion.

Let's walk through a merge for the main dictionaries for a number of partitions. The first thing that you will notice is that the method signature is pretty complicated. This is a bit of an annoyance, but the merge method has to deal with a number of different situations. Let's look at the parameters for the method:

IndexEntry entryFactory
This is an empty entry of the same type as is in the dictionaries we'll be merging (and, therefore, in the dictionary that we'll be creating. We'll use it to make new entries for the dictionaries.

NameEncoder encoder
An encoder for the names in the merged dictionary. In the case of the main dictionary for the partitions, this will be a StringNameHandler that will handle the 3-in-4 front coding of the terms in the dictionary.

PartitionStats partStats
A set of simple statistics about the partition that we may be collecting while merging the terms.

DiskDictionary[] dicts
The actual dictionaries to merge.

EntryMapper[] mappers
A set of mappers for the entries. These can be used (for example) to renumber entries in the dictionary during the merge.

int[] starts
The starting document IDs for the data from each of the partitions. During indexing, the document IDs are assigned per partition (i.e., the first document ID in a partition is always 1), so during merging we need to remap the document IDs from the merged partitions.

int[][] postIDMaps
A set of old-to-new mappings for the IDs encoded in the postings associated with this dictionary.

RandomAccessFile mDictFile
The file where the merged dictionary will be (eventually) written.

PostingsOutput[] postOut
The file(s) where the merged postings will be written

boolean appendPostings
In most cases, when merging dictionaries, the postings associated with the entries in the dictionary can simply be appended onto one another to generate the postings for the merged dictionaries. There are a few dictionaries where this is not the case.

The first thing that we do for the merge is to build a heap of iterators, one for each of the dictionaries that we're merging (note that DiskDictionary implements Iterable — we'll talk more about dictionary iterators later.) The heap is organized by the names of the entries currently at the head of the iterator. While there are still iterators on the heap, the merge loop proceeds as follows:

  1. peek at the top element of the heap, and make a new entry for the merged dictionary using the name of the top element. We set the ID of the new dictionary entry in sequence or using the provided mappers.

  2. While the element at the top of the heap has the same name as the current element:

    1. Remove the top element from the heap.

    2. Read the postings associated with the element, and append them onto the postings that we're building for the new entry.

    3. Keep track of the mapping for the ID of this entry in the dictionary from which it was drawn and in the new dictionary.

    4. Advance the iterator that was at the top of the heap. If there is another element on that iterator, put the iterator back on the heap.

    5. peek at the element at the top of the heap.

  3. Write the merged postings to the postings file

  4. Write the new, merged entry to the dictionary using a DictionaryWriter

See, simple, right? There's actually quite a bit of complexity hidden in some of those steps (most notably, appending or merging postings — we'll talk more about that later too), but the basics actually are pretty straightforward.

Monday Mar 09, 2009

Dictionaries in Minion: Searching

In the entry on dumping, we described the on-disk representation for a dictionary. The representation is pretty straightforward:

  • A dictionary header. This contains information about the number of entries in the dictionary as well as information about the sizes of the following buffers.
  • An optional buffer. If the dictionary required the storing of an ID to position map in the dictionary (because the IDs of the entries in the dictionary were not remapped when the dictionary
    was dumped), then this buffer will occur in the on-disk representation. The entry at position i will encode the position of the entry with ID i in the sorted dictionary.

  • A buffer. This buffer contains the names of the entries in the dictionary using a standard 3-in-4 front-coding. The size of a block of terms is variable.
  • A buffer. This buffer contains the positions in the preceding buffer where each block of four terms starts. Each position is encoded using four bytes.

  • A buffer. This buffer contains the encoded information from the entries in the dictionary. The information and how it's encoded is particular to the entry type, but most types encode information about the position of the postings list in an associated postings file, the size of the postings list, and the number of postings in the postings list. The size of the information for an entry is variable.

  • A buffer. This buffer contains the position in the preceding buffer of the information for each entry. Each position is encoded using four bytes.

Because we can't guarantee that we can load very large dictionaries into memory at query time (and because we have a lot of dictionaries), we use file-backed buffers that keep an in-memory cache of a chunk of the buffer while the rest of the buffer stays on disk. When we open a dictionary, we just instantiate these buffers and have them backed by
the buffers on disk.

The main dictionary operation during querying is looking up entries by name: we look up words from the document in the main dictionary, we look up field values in the dictionaries for the saved fields, and so on. Keep in mind that at query time (unlike index time) a dictionary may be accessed by multiple threads, so we need to make sure that our operations are thread safe.

Let's have a look at how this is done.

We begin by taking duplicates of our file-backed buffers that contain the dictionary data. This is a relatively low-cost operation and it means that we don't have to worry about the buffers' in-memory caches thrashing as other threads try to find entries at the same time that we are.

Each dictionary maintains an LRA cache of entries that have been looked up, so we check that cache to see if the entry is already available. If it is, we return a copy of the cached entry and we're done. The size of this cache is configurable, and it can be set up to (eventually) cache all of the entries.

If the entry is not in the cache, then we need to see whether an entry with the given name occurs in the dictionary. We do this by running a binary search on the encoded names buffer. Because we use a 3-in-4 front coding, the search actually proceeds against the uncompressed names at the start of each block of four names.

Note that as we're running the binary search we also cache the uncompressed names at the positions we're testing. This means that over time, the top few levels of the binary search tree will be cached and so we won't have to keep decoding those names every time we search for something by name. This lets us get down to a pretty small portion of the dictionary before we ever have to actually decode any

At the end of our binary search, we might have gotten lucky and have the name be one of our uncompressed names, but usually the binary search narrows things down to a block of names that may contain the name we're interested in. Once we find a block of names, we can iterate through it to see if it contains the name we're looking for.

If we don't find the name at this point, we can return null because we're sure the dictionary doesn't contain an entry with this name. If we do find the name, then we know the position in the dictionary for the entry with this name.

Once we have the position, we can create an entry based on the encoded data for this entry.

On the way out, we cache the entry that we found for this name.

An additional filigree on this: we use the same lookup code to find matches for an initial substring of a name, which we can use when looking for stemmed variations of a term.

Thursday Mar 05, 2009

Inside Lucene's Indexing Code

For search engine innards fans, Mark Miller is exploring lucene’s indexing code over at the Lucid Imagination Blog.

It looks like pretty interesting stuff, and a good way for me to find out more about how Lucene works.

Dictionaries in Minion: Dumping a Dictionary

Once the indexer decides that it's time to dump a partition (because
memory's getting full or because it was told to by a call to SearchEngine.flush),
the partition dump basically reduces into dumping a bunch of
dictionaries and their associated postings data.

The MemoryDictionary.dump
method is responsible for dumping the dictionary. The first step in a
dictionary dump is to instantiate a DictionaryWriter.
This is a helper class that does most of the actual work of encoding
and writing the dictionary. We use a helper class because a lot of
the work done during the dump of a dictionary is also done when
merging dictionaries (more on that in an upcoming post.)

Once we're ready to start writing the dictionary, the entries in the
dictionary are sorted by their names. We sort the entries by name so
that we can use binary search to find entries at query time. We could
avoid the sorting step by keeping the entries in a
SortedMap at indexing time, but it turns out to be faster
to sort the entries when we're ready to dump them than to keep them
sorted while we're indexing them.

Although the dictionary assigned integer IDs to the entries during
indexing, at dump time we can take the opportunity to renumber the
entries so that the ID for an entry corresponds to its position in the
sorted list. If we do this, looking up a dictionary entry by ID can
be done very quickly at query time.

We do this renumbering for dictionaries like the main dictionary for a
partition (the one containing the terms from the documents), but we
skip it for the terms in the document dictionary. This renumbering
business is one of the places where the EIDAP philosophy gets a bit
burdensome. When you're sorting the entries in the document
dictionary, you don't want to renumber the entries, because the IDs
that were assigned to these entries have been encoded into the
postings lists for the terms in the main dictionary and we don't want
to have to re-encode all those postings.

In the cases where we don't renumber the entries, we need to keep a
map from the ID for an entry to the position in the dictionary, so
that we can continue to do fast lookups by ID at query time.

When we do renumber the entries, we can optionally keep a map from old
entry IDs to new entry IDs. The ID map from one dictionary can be
useful when dumping another dictionary's entries.

Once we've figured out the order that the entries will be dumped in
and what the entries' final IDs will be, each entry is written out to
the on-disk representation, via the DictionaryWriter.
The on-disk representation for the dictionary is a DictionaryHeader,
followed by four buffers.

The first buffer contains a standard 3-in-4 front coding of the names
of the entries. The basic idea is that we group the names of the
terms into groups of 4. The first name in a block is uncompressed,
the remaining names in the block have any initial substring shared with
the previous name removed and replaced with the length of the shared
initial substring.

Here's an example from an index I had lying around. A four word block
is composed of the words: evergreen, evergrey,
everlast, everlasting. These will get compressed
into a block of names that looks like:


where the numbers are actually bytes representing the initial
shared substring length. Thus 37 characters are compressed into 23. The
encoding of names is handled by classes that implement the NameEncoder
interface. The appropriate name encoder for a dictionary is passed
into the dump method for a dictionary.

The second buffer encodes the positions of the start of each of the
blocks of four terms in the dictionary. Because we need to be able to
quickly find the start of a block, these positions are encoded using
a fixed number of bytes.

The third buffer encodes the rest of the information contained in a
dictionary entry. This is stuff like the number of occurrences of the
term and the offset in the postings file where the postings associated
with the term can be found. Each entry knows how to encode this
information onto a buffer, and the encoding is done by the IndexEntry.encodePostingsInfo

The fourth buffer encodes the position in the third buffer of the
entry information for each entry. Again, we need to be able to find
the information for an entry quickly when we're fetching entries at
query time, so the offset is encoded using a fixed-length encoding.

As we're encoding the information for the entries, we also dump the
collected postings information to the appropriate postings files,
keeping track of the offsets where the postings are dumped and how
many bytes are in the postings.

Because we use the DictionaryWriter for merging
dictionaries too, we can't assume that the dictionary representation
can be stored completely in memory. Believe me, life would be easier
if we could, but this really does get to be a problem with large
indices. To work around this, when we're encoding the dictionary, we
use buffers that have an in-memory cache and are backed by a temporary
file. In the case of small dictionaries, this works out to be pretty
much a wash, because we only ever use the in-memory cache and never
have to write any of the data to the file.

The last stage of dumping the dictionaries is to write the dictionary
header and then transfer the encoded buffers into their final
destination file. We make sure this is done efficiently using NIO
channel transfer operations.

Wednesday Mar 04, 2009

Absolutely nothing to do with search engines, but still kind of nerdy

This is just plain funny.

Also, I'm heading to a midnight showing of Watchmen on Thursday/Friday. To say that I'm stoked about this movie would be a severe understatement, so I expect I'm setting myself up for a huge disappoinment. Still, I won't be dressing up to go to see the movie.

Tuesday Mar 03, 2009

LiteMorph: A lightweight morphology

Matt Chaput commented the other day about an aspect of the dictionaries, and in passing mentioned that his Python search engine (Whoosh! Good name, dude!) uses the lightweight morphology from Minion (ported to Python, obviously.)

Otis followed up asking where he could get more information, and I thought I would post an entry rather than burying it in the comments.

The lightweight morphology component (colloquially known as "LiteMorph") was originally developed by Bill Woods as an alternative to the stemmers that are traditionally used in search engines.

The basic idea behind LiteMorph is to take a word and replace known suffixes with other suffixes that are considered plausible in the morphology of the particular language of interest (there are LiteMorphs for English, German and Spanish). Here's an example of a LiteMorph rule for English:

.aeiouy o u s + -> ly,ness

This rule means: for a word that ends with a vowel followed by the letters ous, you can generate variations by adding the letters ly and ness to the word. So, a word like joyous will generate the variations joyously and joyousness.

Note that these rules were developed long before regular expressions were available in Java, so they use an invented syntax. Someone who wanted to port them to use regular expressions would be warmly welcomed into the Minion family!

This generativeness (there's a morph for you!) is a very nice property. For example, given the term build, the English version of LiteMorph generates the following list of variations:

build, builder, builders, building, buildings, builds, built

This is nice because this is the list of terms that we need to look up in the dicitionary to get the variations for build.

Each LiteMorph contains an exception list, so we can handle things like the irregular verb run:

ran, run, runner, runners, running, runnings, runs

The variations that are generated can be a bit weird, though. Here are the generated variants for happiness:

happied, happier, happiers, happies, happiest, happiful, happiless, happily, happiment, happiments, happiness, happinesses, happy, happying, happyings

The good news is that although these seem weird, we're looking them up in the dictionary so really weird variations (e.g., happiless, which I totally think should be a word) just won't appear and so they won't affect the search. On the other hand, variations that a stemmer would never get very well might appear and the LiteMorph will catch them.

I had q quick look around for papers explaining the syntax of the LiteMorph rules and came up with a paper for TaPor 2004 that was written by a grad student who implemented a French LiteMorph.

Monday Mar 02, 2009

Dictionaries in Minion: Indexing

Following on with our discussion of dictionaries, lets look at how
dictionaries get used during indexing. The MemoryDictionary
class is an implemenation of the Dictionary
interface that is used during indexing.

The basic functionality for storing dictionary entries while indexing
is handled by a HashMap<Object,Entry>. Note that
the system is structured so that we never have to worry about a
MemoryDictionary being accessed by multiple threads, so
we don't need to worry about concurrency for this map.

We create a MemoryDictionary by calling the constructor,
passing in the Class for the entries that will be stored
in the dictionary. The astute reader will realize that this is an
opportunity to use generics, and genericizing the dictionaries is on
our list of things to do.

Once we've got a dictionary, we want to add some entries to it. Code
to do this will typically look something like the following:

IndexEntry mde = (IndexEntry) mainDict.get(name);
if(mde == null) {
mde = mainDict.newEntry(name);
mainDict.put(name, mde);

The code tries to retrieve an entry with the given name from the
mainDict dictionary. If there is no such entry, then one
is generated by the dictionary (which is why it needs the entry class
above.) This entry is then put into the dictionary with the given

The MemoryDictionary.newEntry
method is responsible for constructing an entry with the given name
and assigning that entry a numeric ID.

The newEntry method is actually kind of complicated,
because it's responsible for handling the difference between cased and
uncased entries. If the dictionary is using cased entries, the
newEntry method will make sure that the cased entry
points to the uncased entry, so that postings can be added to both
when necessary without having to do multiple dictionary lookups when
processing a token or field value.

This is pretty much all of the action that a dictionary sees during
indexing: creating entries and fetching existing entries so that
postings can be added to them (I'll be covering the postings stuff

Up next: how a dictionary gets dumped (don't worry, it has a happy ending!)

Friday Feb 27, 2009

Results Filtering in Minion

One of the capabilities that we've recently added to Minion is results filtering.

Let's say that you run a search to find all of the documents that contain the word dog. When you get back the ResultSetfrom the search, you want to show the user the top 10 results. But let's also say that you only want to show the top 10 results where the breed field is german shepherd. Now, you could have added a clause to the query that would have added this restriction to the search, but in an interactive system, it would be nice if you could change the breed restriction without having to re-run the query every time.

This is where results filtering comes in. Rather than re-run the search, you can build a results filter that you pass to the ResultSet.getResults method.

The ResultsFilter interface describes the methods that a results filter needs to implement. The main method is ResultsFilter.filter, which takes an accessor for the result currently under consideration. If the implementation of this method returns true then the result currently under consideration can be returned in the list of results. If this method returns false then the result won't be returned.

The ResultAccessor interface gives you access to the saved field values for the result currently under consideration. In addition, you can fetch the score for the result or the key for the result.

All of this raises the question: when does a result come under consideration? A simple implementation of results filtering would run the filter against all of the documents that satisfied a query. This approach would require fetching field values for a lot of results that would never make it into the top n results.

The approach that Minion takes is the following: while we're building the heap of the top n results so far, if we decide that the search result that we're looking at should replace the top element of the heap, then we run the results filter on that result. If the results filter says that we should add the result, then we replace the top of the heap.

We've been using results filters a lot in building recommenders in the AURA Project and they've turned out to be pretty ferociously useful.

One final word: Minion doesn't actually do anything with the ResultsFilter.getPassed or ResultsFilter.getTested methods: they're there for your benefit, so you can feel free to have them just return 0 if you don't want to keep track of that information. Those methods are there because our first runs with filters turned on were so fast that I thought the filters weren't getting called!

Thursday Feb 26, 2009

Dictionaries in Minion: Dictionary Entries

Dictionaries are a map from entry names to entries, so let's take a second to look at what the entries are before we dive into how the dictionaries use them. For those who want to follow along in the code, all of the code pertaining to the entry types is in the com.sun.labs.minion.indexer.entry package.

The basic interface for the entry types is Entry, which describes the capabilties that all entries need to have. There are methods to get and set the name or the integer ID of the entry, methods to get the partition or the dictionary that the entry belongs to, as well as methods to get occurrence statistics for the entry.

Note that the names for the entries are only specified as Object. We expect names to be instances of String, Long, Double, or Date. These types are currently sufficient for all of our uses (including terms in documents, values in saved fields, bigrams, and so on.) The code used in the dictionaries is such that anything that implements Comparable could be used as the name for an entry, since we need to be able to sort dictionaries by name before dumping them. To-do list item: this dependency should be explicit!

The Entry interface has two sub-interfaces IndexEntry and QueryEntry. These names aren't the greatest, but classes implementing IndexEntry are meant to be used during indexing (i.e., when generating dictionary entries and their associated postings) and classes implementing QueryEntry are meant to be used during query operations (i.e., when looking up entries and processing their postings). Even though most of the entry types implement both these interfaces, we make this distinction because it makes it clear in the code how we're going to use a particular entry.

The BaseEntry abstract class provides an implementation for some of the parts of IndexEntry and QueryEntry. This entry is, unsurprisingly, the basis for pretty much all of the rest of the entries.

In particular, there is the SinglePostingsEntry class, which is another abstract base class for all the entry types that store a single postings list. There are concrete subclasses of SinglePostingsEntry for entries that store postings with IDs only, with IDs and frequencies, and for entries that store everything (IDs, frequencies, field information, and word positions).

Another important abstract class is CasedEntry that defines the basis for all of the entry types that store cased postings lists. For example, if an engine is configured to provide case-sensitive searching, then we will store the postings for each term twice: once in a postings list that ignores the case in which the term was encountered, and once in a postings list that preserves the case in which the term was encountered. Why do it this way? Because in the default case (I want to search for a term without regard to case) we only need to lookup one term and then read and process one postings list.

So who's responsible for generating these entries? The dictionaries are. At indexing time, the dictionary is used to generate an entry for a new term. At query time, the entries will be constructed from the dictionary data that was dumped during indexing.

Wednesday Feb 25, 2009

Dictionaries in Minion: An Introduction

As I've mentioned previously, our design philosophy for Minion is that everything is a dictionary and postings.

To recap, in Minion, a dictionary is a map from entry names to entries. The entry name may be a term from a document, a value from a saved field, or a bigram that will be used for doing wildcard matching. The entry pointed to by the name contains information about that name, including frequency of occurrence, the dictionary that the entry was drawn from, and (most importantly) a pointer to the postings associated with that entry.

There are actually two kinds of dictionaries in Minion. The first kind is a MemoryDictionary, which is used to hold a dictionary at indexing time. The second kind is a DiskDictionary, which is used to hold a dictionary at query time. Both of these dictionary types are implementations of the Dictionary interface that describes the capabilities that all dictionaries share.

Over the next few posts I'll be describing the entries that you'll find in dictionaries, and how the MemoryDictionary and DiskDictionary classes are used in Minion.

Posts in this series:

Monday Feb 23, 2009

Indexing an SQL database (or part of one)

Via Glen Newton, LuSql, a "command line Java application that can be used to build a Lucene index from an arbitrary SQL query of a JDBC-accessible SQL database".

I've often thought about writing something similar for Minion, but Glen appears to have done a remarkably thorough job for Lucene.  A 40 page user manual? That's impressive. It's Apache Licensed, so I suppose I could port it to use Minion as the indexing engine...

Sunday Feb 22, 2009

Bits, and the buffering thereof

I'm going to try to get back to blogging about the innards of Minion. I'm going to start at the bottom, but I'll probably jump around a little bit depending on what takes my fancy.

Way down at the bottom of Minion are bit buffers. Most of the higher-level file formats in Minion are built up from a number of bit buffers. A dictionary, for example, consists of a header and then four buffers containing the data for the dictionary.

The buffers that Minion uses are inspired by the buffer types in the java.nio package, although they do not extend those types. The main issues with these buffers are that we can't use them to encode an integer using a word aligned compression method and they won't grow on demand.

We divide buffers into two interfaces: one for buffers that you write to and one for buffers that you read from. We make this distinction because it allows us to be clear in the code when a buffer should be used for reading and when it should be used for writing. In reality, our buffers implement both of these interfaces.

Readable and writeable buffers have a few things in common: you can set the position, figure out how many bytes are remaining to be read or written, or figure out how much data the buffer can hold. The com.sun.labs.minion.util.buffer.WriteableBuffer interface defines a number of methods for encoding data onto the buffer, and the com.sun.labs.minion.util.buffer.ReadableBuffer provides the corresponding methods for decoding data.

There are methods for encoding and decoding the standard Java numeric types using their usual number of bytes, as well as methods for encoding integers using a fixed number of bytes (so, for example, you can encode numbers that you're sure aren't going to be bigger than 2\^7 using a single byte.) There are methods for encoding and decoding java.lang.Strings using a UTF-8 encoding that doesn't encode the length of the string as an unsigned short (you learn a lot looking at the source code of the JRE.)

The main compression routine in the buffer classes encodes integers using a word-aligned compression method. This is essentially the VInt format that Lucene uses. Using a word-aligned compression method is much more efficient (from a CPU usage standpoint) than using more sophisticated compression methods like gamma coding. For more information about these methods, you should have a look at Managing Gigabytes, or have a look at Zobel and Moffat's Computing Surveys article Inverted files for text search engines.

Minion Buffers can be written to or read from java.nio buffers, Data(Input|Output)s, java.nio.channel.(Writeable|Readable)ByteChannels, or other Minion buffers, which is how they eventually make it out to files.

Minion contains a number of implementations for these buffers which differ mainly in the backing store used to store the encoded data. There are implementations backed by arrays and by NIO buffers for in-memory operation and implementations that are backed by files with an in-memory buffer, for those occasions when one doesn't want to store 100MB of encoded dictionary entries in main memory.

Thursday Feb 19, 2009

Paul's Mantra

Paul has left Sun for The Echo Nest and he appears to be having a good time so far (Toto's "Rosanna" notwithstanding.) I've had a great time working with Paul over the past few years, and I think I've now internalized one of Paul's mantras: "What's the simplest thing that could possibly work?"

This is a good way to get over the hump on a new project when you want to get something started and you're not sure what the best way to do it would be, but you're pretty sure that you know one way that might work (probably). It's surprising how far you can get using this technique.

Today's example of this comes via Good Math, Bad Math, wherein Mark discovers that a gap buffer implementation for a text editor works quite well enough that it's not necessary to use a rope, even though a rope has better theoretical properties.

The annoyance for me in this one is that Mark seems astonished that a Java implementation of the gap buffer could be fast enough to obviate the need for a rope. When is this "Java is Slow!!1!" meme going to die?

Also annoying: I now have Rosanna stuck in my head.

Friday Jun 13, 2008

Highlighting search results in Minion

I've just posted a new piece of Minion documentation about how search results highlighting works.

It's kind of complicated, but then again getting the highlighting that you want is kind of complicated. The short version is: if you have a set of query terms and a document that you want to highlight that contains (some of) those terms, then:

  1. Tell the passage retrieval API what fields you want to highlight and how to treat the passages in that field.

  2. Use the passage retrieval algorithm to find a set of passages.

  3. Pull out the highlighted passages and display theme.

Using the passage retrieval algorithm to find a set of passages has some handy side effects like it easily handles things like finding morphological variations of the query terms.

A major improvement for this version over previous versions, is that the process of figuring out how to build a passage of a particular size (e.g., you want to display a 500 character passage from the body of an email message) is a lot more robust.

Wednesday May 21, 2008

Open Source TREC: TRECmentum!

Great minds think alike1, I suppose. Grant Ingersoll, a core committer for Lucene posted last week about open source search engine relevance, proposing a TREC-like evaluation for open source engines. His much more comprehensive post explains a lot of what I mentioned in passing in a post a little while ago.

Getting the TREC data is definitely a barrier to entry to someone who wants to just try the TREC ad-hoc queries to see how their favorite engine holds up against the standard TREC evaluation measures. At Sun, we actually participated in TREC 1.5 times and we've intended to compete other times, so we have all of the data but that doesn't actually do anyone else any good.

It also does take a fair amount of hardware to run a TREC system through its paces. In 2000, we borrowed about 20 Ultrasparc machines to distribute indexing and retrieval for the question answering task. Of course, we messed up by deciding to compete the week before the evaluation data was going out!

Grant mentions a number of collections that we could use for the evaluation. I think we should collect up as many mail archives as we could get our hands on as well (I, for example, could see about getting the OpenSolaris mailing lists and associated queries) since that data tends to have an interesting structure and it could lead (eventually) to tasks like topic detection and tracking and social network analysis. I'd even have a go at seeing if we could host the evaluation collections somewhere around here, if that was helpful.

I guess what I'm saying is "sign me up". I think this could be a great benefit to all of the open source engine communities. TREC certainly was to the academic and commercial communities.

Update: Paul reminds me in an email that we have a blog crawler and can pretty easily generate several million documents. We wouldn't have queries, though.

1. But as my grandfather always used to say: "Great minds think alike, but fools seldom differ."


This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.


« July 2016