Wednesday Jun 17, 2009

Crikey! Fix a bug and look what you get!

Our intern arrived at the beginning of June and promptly began putting bugs into my pristine and utterly bug free search engine code.

One of the bugs he planted/discovered was that the recent change to use a BST for part of the dictionary lookup resulted in a BST search that was a bit, shall we say, overzealous in searching the tree. I fixed the bug and committed the change. Today I had a bit of time, so I decided to re-run the multi threading tests to see how the change affected our dictionary lookup times.

I was expecting the times to be worse, because I thought that perhaps we were not finding the right entries (but the dictionary code checks for that), but here's what I found:

# of threadsTotal Number of LookupsAverage Lookup Time (ms)

That's a huge speedup at the low end: somewhere between 3 to 5 times! Something weird happens in the middle where the results are worse, but then at the top end is settles down to be more than twice as fast. Also notice that I added a result for 200 threads (hey, why not, right?)

Here's the graph:

I have to tell you: it's pretty cool to see a machine with 256 hardware threads looking like this in prstat:

  1529 stgreen   100M   86M cpu169  30    0   4:00:48  71% java/223

Onward to query performance...

Thursday May 28, 2009

Scalin' Dictionaries 2: Electric BST

I've been working on the scalability properties of the dictionaries again. The last time, thanks to Jeff, we managed to get the dictionary to scale pretty well up to 32 threads and acceptably up to 64 threads by removing the synchronization bottlenecks on allocating lookup states during dictionary lookups and by using positional reads on the files containing the dictionaries.

Once we'd done this, I went back to collect/analyze and had a look at where the synchronization time was being spent during a run of our dictionary test program with 128 threads. I was surprised to see that a lot of synchronization time was being spent adding elements to the caches that we're mainitaining. This was a bit weird, because there's no explicit synchronization in the java.util.LinkedHashMap that we were using for the cache. I suspected that we were hitting an allocation problem with the objects that it allocated to put on the linked list that preserves the insertion order.

Aside from the synchronization problems, the main problem with the caches, in particular the entry-by-name cache, is that we're not getting that many hits on the cache during querying. In some of our query tests, the cache hit rate is about 10%, which means we're doing a lot of synchronization for not a lot of benefit.

So, I did away with both the entry-by-name cache and the name-by-position cache that we were using. The name-by-position cache actually was used: over time it was building up the top few layers of the binary search tree for the names in the dictionary. Unfortunately, this useful property was overwhelmed by the need to synchronize on the cache to fetch out the entry name at a given position while binary searching.

So I decided to add an honest-to-goodness binary search tree to the dictionary. This tree is populated when the dictionary is loaded, and since it never changes, it's completely thread safe. Because we took out the other caches and because the BST is so simple, we can afford to devote more entries to the BST. Every level that we add to the tree may save us a read from the underlying file, which is good, because the fastest read is the one that you don't have to make.

Each node in the BST stores the name at that position in the dictionary and the upper and lower extents of the search, so that once we've traversed the BST we know which chunk of the dictionary entries to continue the search in.

Here's the results for the new dictionary test, where we use a 1024 node BST (i.e., a tree 9 levels deep), along with a graph (note the log scale!) comparing this one to the old one:

# of threadsTotal Number of LookupsAverage Lookup Time (ms)

That gives us a speedup of about 1.8 times at 64 and 128 threads, and a slightly smaller speedup of 1.6 times with smaller numbers of threads.

The last time I did this, I got a comment asking why we were even hitting the disk when we had so much memory on this box. That's a good question, but I'm pretty sure that we are hitting the disk. If we copy the index into a RAM disk and run the dictionary test, here are the numbers and the corresponding graph:

# of threadsTotal Number of LookupsAverage Lookup Time (ms)

So, yeah, I'd say we're hitting the disk. That 9.67ms per lookup with 128 threads is pretty nice. That's about 123 times faster than original code was doing with a disk-based lookup.

While I was debugging my BST, I went ahead and modified the DictionaryFactory so that when you open a dictionary that is smaller than the specified cache size, we just return you a CachedDiskDictionary that loads the dictionary into a hash table when the dictionary is opened, since it all would have been cached eventually anyways.

Tuesday May 19, 2009

Scaling a dictionary

Another post about Minion's dictionaries today. We recently got hold of a really big box: it has 256 hardware threads and 256GB of RAM. This lead us to ask the question: How does Minion scale on this kind of hardware. Our initial experiments running queries on a couple of million documents with a varying number of threads (powers of 2 up to 128) showed us that as we increased the number of threads we were spending more and more time doing dictionary lookups.

Because of our EIDAP philosophy, we need to be sure that our dictionaries have good performance especially the multi-threaded case. We've tried out things on 4 or 8 processor machines, but nothing like the new beast. Although I'm writing about it, Jeff did all of the hard work here. The Sun Studio collect/analyze tools turned out to be exceedingly useful for doing this analysis.

We built a test program that selects a number of terms from a dictionary on-disk and then looks them up in the dictionary. A lot. For the runs that we'll be describing, we selected 10,000 terms. This list of terms is handed out to a number of threads. Each thread shuffles its list and then looks up the terms from its list in the dictionary until a time limit (300 seconds by default) passes.

Here's the state of affairs before we started:

Number of threadsTotal Number of LookupsAverage Lookup Time (ms)

Oh, dear. Not very good: we're pretty close to doubling the time when we double the number of threads, which is kind of the opposite of the definition of scalability. These times are fairly acceptable when we're doing querying with a small number of threads, because they're swamped by all of the other work that we're doing, like uncompressing postings. Once we get up to larger numbers of threads (around 16 or 32), the dictionary lookup time starts to dominate the times for the other work.

We started out by inspecting the code for doing a get from the dictionary. I described the way that it worked in a previous post, but the basic idea is that we do a binary search to find the term. We have an LRA cache for entries in the dictionary indexed by name that is meant to speed up access for commonly used terms. We also have an LRA cache for entries indexed by their position in the dictionary that is meant to speed up the binary search. Since dictionary access is multithreaded, we need to synchronize the cache accesses.

This was the only synchronization that was happening in the dictionary's get method, so we figured that was what was causing the scalability bottleneck. Unfortunately, a quick change to reduce the amount of synchronization by removing the entry-by-position cache didn't make any difference!

This is where collect/analyze comes in. It turns out that it can do a pretty good job of giving you visibility into where your Java code is spending its synchronization time, but it also shows you what's happening underneath Java as well. Jeff ran up the tools on our big box and I have to say that we were surprised at what he found.

The first step of a dictionary fetch is to create a lookup state that contains copies of the file-backed buffers containing the dictionary's data. Although we provided for a way to re-use a lookup state, the test program was generating a new lookup state for every dictionary lookup, which meant that it was duplicating the file-backed buffers for every lookup. The tools showed us two synchronization bottlenecks in the buffer code: the first was that we were using a non-static logger, and getting the logger caused synchronization to happen. The second was that we were blocking when allocating the byte arrays that we used to buffer the on-disk dictionary data.

We were surprised that the allocator wasn't doing very well in a multithreaded environment, and it turns out that there are (at least) two multithreaded allocators that you can use in Solaris. Unfortunately, Java isn't linked against these libraries, so using them would require LD_PRELOAD tricks when running Minion. We've always tried to avoid having to have a quadruple bucky Java invocation, and we didn't want to start now.

The answer was to use thread-local storage to store a per-thread lookup state. When a thread does its first dictionary lookup the lookup state gets created and then that lookup state is used for all future lookups. Don't worry: Jeff was careful to make sure that the lookup states associated with unused threads will get garbage collected.

Once we had that working, we re-ran our tests and got better results, but still not great. So, back to collect/analyze. Now we were seeing a bottleneck when reading data from the file. This turned out to be synchronization on the RandomAccessFile in the FileReadableBuffer. In order to read a chunk of data from the dictionary, we need to seek to a particular position in the file and then read the data. Of course, this needs to be atomic!

An NIO FileChannel offers a positional read method that does the seek-and-read without requiring synchronization (this may not be the case on some OSes, so caveat optimizer!) Our final modification was therefore to introduce a new file-backed buffer implementation, NIOFileReadableBuffer that uses a FileChannel and an NIO buffer to store the data.

We added a configuration option to the dictionaries so that we could select one or the other of the file-backed buffer implementations and then re-ran our tests. Here's the results after this change, along with a nice graph.

# of threadsTotal Number of LookupsAverage Lookup Time (ms)Speedup

Clearly, this is a keeper. At 32 threads we're doing a lot better than we were at 4 threads and almost better than we were doing at 2 threads! We start to see the times doubling again as we get to 64 and 128 threads.

Because of the nature of the test, we're not hitting the dictionary cache very much, so I expect we're starting to run into some contention for the disk here (the index is residing on a ZFS pool that's on disks that are in the box). Of course, I thought I knew what the problem was at the beginning, so back to collect/analyze we go!

Tuesday Mar 10, 2009

Dictionaries in Minion: Merging

The last major function that we're going to investigate for dictionaries is merging them.

When indexing in Minion, we try to index as much data as possible in memory before dumping the data to the disk. Each chunk of data that we dump to the disk is called a partition. We usually call a partition that was dumped directly by the indexer a "first-level partition."

A partition is composed of a number of files, and the contents of these files are mostly dictionaries and their associated postings. As indexing proceeds and we dump more first-level partitions, the PartitionManager for the index will collect a number of first-level partitions of a given size (in terms of the number of documents that contain) and merge them together into a new partition. As you can imagine, we usually call the partition resulting from a merge of first-level partitions a "second-level partition". The PartitionManager will eventually merge second-level partitions into third-level ones, and so on.

Because of our EIDAP philosophy, all of these partition merges boil down to merging a number of dictionaries. So, for example, we build the main dictionary for a merged partition by merging together the main dictionaries from the partitions that we're merging.

The main dictionary merge method is pretty straightforward, given how central it is to how indexing works in Minion.

Let's walk through a merge for the main dictionaries for a number of partitions. The first thing that you will notice is that the method signature is pretty complicated. This is a bit of an annoyance, but the merge method has to deal with a number of different situations. Let's look at the parameters for the method:

IndexEntry entryFactory
This is an empty entry of the same type as is in the dictionaries we'll be merging (and, therefore, in the dictionary that we'll be creating. We'll use it to make new entries for the dictionaries.

NameEncoder encoder
An encoder for the names in the merged dictionary. In the case of the main dictionary for the partitions, this will be a StringNameHandler that will handle the 3-in-4 front coding of the terms in the dictionary.

PartitionStats partStats
A set of simple statistics about the partition that we may be collecting while merging the terms.

DiskDictionary[] dicts
The actual dictionaries to merge.

EntryMapper[] mappers
A set of mappers for the entries. These can be used (for example) to renumber entries in the dictionary during the merge.

int[] starts
The starting document IDs for the data from each of the partitions. During indexing, the document IDs are assigned per partition (i.e., the first document ID in a partition is always 1), so during merging we need to remap the document IDs from the merged partitions.

int[][] postIDMaps
A set of old-to-new mappings for the IDs encoded in the postings associated with this dictionary.

RandomAccessFile mDictFile
The file where the merged dictionary will be (eventually) written.

PostingsOutput[] postOut
The file(s) where the merged postings will be written

boolean appendPostings
In most cases, when merging dictionaries, the postings associated with the entries in the dictionary can simply be appended onto one another to generate the postings for the merged dictionaries. There are a few dictionaries where this is not the case.

The first thing that we do for the merge is to build a heap of iterators, one for each of the dictionaries that we're merging (note that DiskDictionary implements Iterable — we'll talk more about dictionary iterators later.) The heap is organized by the names of the entries currently at the head of the iterator. While there are still iterators on the heap, the merge loop proceeds as follows:

  1. peek at the top element of the heap, and make a new entry for the merged dictionary using the name of the top element. We set the ID of the new dictionary entry in sequence or using the provided mappers.

  2. While the element at the top of the heap has the same name as the current element:

    1. Remove the top element from the heap.

    2. Read the postings associated with the element, and append them onto the postings that we're building for the new entry.

    3. Keep track of the mapping for the ID of this entry in the dictionary from which it was drawn and in the new dictionary.

    4. Advance the iterator that was at the top of the heap. If there is another element on that iterator, put the iterator back on the heap.

    5. peek at the element at the top of the heap.

  3. Write the merged postings to the postings file

  4. Write the new, merged entry to the dictionary using a DictionaryWriter

See, simple, right? There's actually quite a bit of complexity hidden in some of those steps (most notably, appending or merging postings — we'll talk more about that later too), but the basics actually are pretty straightforward.

Thursday Mar 05, 2009

Dictionaries in Minion: Dumping a Dictionary

Once the indexer decides that it's time to dump a partition (because
memory's getting full or because it was told to by a call to SearchEngine.flush),
the partition dump basically reduces into dumping a bunch of
dictionaries and their associated postings data.

The MemoryDictionary.dump
method is responsible for dumping the dictionary. The first step in a
dictionary dump is to instantiate a DictionaryWriter.
This is a helper class that does most of the actual work of encoding
and writing the dictionary. We use a helper class because a lot of
the work done during the dump of a dictionary is also done when
merging dictionaries (more on that in an upcoming post.)

Once we're ready to start writing the dictionary, the entries in the
dictionary are sorted by their names. We sort the entries by name so
that we can use binary search to find entries at query time. We could
avoid the sorting step by keeping the entries in a
SortedMap at indexing time, but it turns out to be faster
to sort the entries when we're ready to dump them than to keep them
sorted while we're indexing them.

Although the dictionary assigned integer IDs to the entries during
indexing, at dump time we can take the opportunity to renumber the
entries so that the ID for an entry corresponds to its position in the
sorted list. If we do this, looking up a dictionary entry by ID can
be done very quickly at query time.

We do this renumbering for dictionaries like the main dictionary for a
partition (the one containing the terms from the documents), but we
skip it for the terms in the document dictionary. This renumbering
business is one of the places where the EIDAP philosophy gets a bit
burdensome. When you're sorting the entries in the document
dictionary, you don't want to renumber the entries, because the IDs
that were assigned to these entries have been encoded into the
postings lists for the terms in the main dictionary and we don't want
to have to re-encode all those postings.

In the cases where we don't renumber the entries, we need to keep a
map from the ID for an entry to the position in the dictionary, so
that we can continue to do fast lookups by ID at query time.

When we do renumber the entries, we can optionally keep a map from old
entry IDs to new entry IDs. The ID map from one dictionary can be
useful when dumping another dictionary's entries.

Once we've figured out the order that the entries will be dumped in
and what the entries' final IDs will be, each entry is written out to
the on-disk representation, via the DictionaryWriter.
The on-disk representation for the dictionary is a DictionaryHeader,
followed by four buffers.

The first buffer contains a standard 3-in-4 front coding of the names
of the entries. The basic idea is that we group the names of the
terms into groups of 4. The first name in a block is uncompressed,
the remaining names in the block have any initial substring shared with
the previous name removed and replaced with the length of the shared
initial substring.

Here's an example from an index I had lying around. A four word block
is composed of the words: evergreen, evergrey,
everlast, everlasting. These will get compressed
into a block of names that looks like:


where the numbers are actually bytes representing the initial
shared substring length. Thus 37 characters are compressed into 23. The
encoding of names is handled by classes that implement the NameEncoder
interface. The appropriate name encoder for a dictionary is passed
into the dump method for a dictionary.

The second buffer encodes the positions of the start of each of the
blocks of four terms in the dictionary. Because we need to be able to
quickly find the start of a block, these positions are encoded using
a fixed number of bytes.

The third buffer encodes the rest of the information contained in a
dictionary entry. This is stuff like the number of occurrences of the
term and the offset in the postings file where the postings associated
with the term can be found. Each entry knows how to encode this
information onto a buffer, and the encoding is done by the IndexEntry.encodePostingsInfo

The fourth buffer encodes the position in the third buffer of the
entry information for each entry. Again, we need to be able to find
the information for an entry quickly when we're fetching entries at
query time, so the offset is encoded using a fixed-length encoding.

As we're encoding the information for the entries, we also dump the
collected postings information to the appropriate postings files,
keeping track of the offsets where the postings are dumped and how
many bytes are in the postings.

Because we use the DictionaryWriter for merging
dictionaries too, we can't assume that the dictionary representation
can be stored completely in memory. Believe me, life would be easier
if we could, but this really does get to be a problem with large
indices. To work around this, when we're encoding the dictionary, we
use buffers that have an in-memory cache and are backed by a temporary
file. In the case of small dictionaries, this works out to be pretty
much a wash, because we only ever use the in-memory cache and never
have to write any of the data to the file.

The last stage of dumping the dictionaries is to write the dictionary
header and then transfer the encoded buffers into their final
destination file. We make sure this is done efficiently using NIO
channel transfer operations.

Monday Mar 02, 2009

Dictionaries in Minion: Indexing

Following on with our discussion of dictionaries, lets look at how
dictionaries get used during indexing. The MemoryDictionary
class is an implemenation of the Dictionary
interface that is used during indexing.

The basic functionality for storing dictionary entries while indexing
is handled by a HashMap<Object,Entry>. Note that
the system is structured so that we never have to worry about a
MemoryDictionary being accessed by multiple threads, so
we don't need to worry about concurrency for this map.

We create a MemoryDictionary by calling the constructor,
passing in the Class for the entries that will be stored
in the dictionary. The astute reader will realize that this is an
opportunity to use generics, and genericizing the dictionaries is on
our list of things to do.

Once we've got a dictionary, we want to add some entries to it. Code
to do this will typically look something like the following:

IndexEntry mde = (IndexEntry) mainDict.get(name);
if(mde == null) {
mde = mainDict.newEntry(name);
mainDict.put(name, mde);

The code tries to retrieve an entry with the given name from the
mainDict dictionary. If there is no such entry, then one
is generated by the dictionary (which is why it needs the entry class
above.) This entry is then put into the dictionary with the given

The MemoryDictionary.newEntry
method is responsible for constructing an entry with the given name
and assigning that entry a numeric ID.

The newEntry method is actually kind of complicated,
because it's responsible for handling the difference between cased and
uncased entries. If the dictionary is using cased entries, the
newEntry method will make sure that the cased entry
points to the uncased entry, so that postings can be added to both
when necessary without having to do multiple dictionary lookups when
processing a token or field value.

This is pretty much all of the action that a dictionary sees during
indexing: creating entries and fetching existing entries so that
postings can be added to them (I'll be covering the postings stuff

Up next: how a dictionary gets dumped (don't worry, it has a happy ending!)

Wednesday Feb 25, 2009

Dictionaries in Minion: An Introduction

As I've mentioned previously, our design philosophy for Minion is that everything is a dictionary and postings.

To recap, in Minion, a dictionary is a map from entry names to entries. The entry name may be a term from a document, a value from a saved field, or a bigram that will be used for doing wildcard matching. The entry pointed to by the name contains information about that name, including frequency of occurrence, the dictionary that the entry was drawn from, and (most importantly) a pointer to the postings associated with that entry.

There are actually two kinds of dictionaries in Minion. The first kind is a MemoryDictionary, which is used to hold a dictionary at indexing time. The second kind is a DiskDictionary, which is used to hold a dictionary at query time. Both of these dictionary types are implementations of the Dictionary interface that describes the capabilities that all dictionaries share.

Over the next few posts I'll be describing the entries that you'll find in dictionaries, and how the MemoryDictionary and DiskDictionary classes are used in Minion.

Posts in this series:


This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into recommendation and other technologies that can use a good text search engine. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.


« April 2014