By searchguy on May 19, 2009
Another post about Minion's dictionaries today. We recently got hold of a really big box: it has 256 hardware threads and 256GB of RAM. This lead us to ask the question: How does Minion scale on this kind of hardware. Our initial experiments running queries on a couple of million documents with a varying number of threads (powers of 2 up to 128) showed us that as we increased the number of threads we were spending more and more time doing dictionary lookups.
Because of our EIDAP philosophy, we need to be sure that our dictionaries have good performance especially the multi-threaded case. We've tried out things on 4 or 8 processor machines, but nothing like the new beast. Although I'm writing about it, Jeff did all of the hard work here. The Sun Studio collect/analyze tools turned out to be exceedingly useful for doing this analysis.
We built a test program that selects a number of terms from a dictionary on-disk and then looks them up in the dictionary. A lot. For the runs that we'll be describing, we selected 10,000 terms. This list of terms is handed out to a number of threads. Each thread shuffles its list and then looks up the terms from its list in the dictionary until a time limit (300 seconds by default) passes.
Here's the state of affairs before we started:
|Number of threads||Total Number of Lookups||Average Lookup Time (ms)|
Oh, dear. Not very good: we're pretty close to doubling the time when we double the number of threads, which is kind of the opposite of the definition of scalability. These times are fairly acceptable when we're doing querying with a small number of threads, because they're swamped by all of the other work that we're doing, like uncompressing postings. Once we get up to larger numbers of threads (around 16 or 32), the dictionary lookup time starts to dominate the times for the other work.
We started out by inspecting the code for doing a get from the dictionary. I described the way that it worked in a previous post, but the basic idea is that we do a binary search to find the term. We have an LRA cache for entries in the dictionary indexed by name that is meant to speed up access for commonly used terms. We also have an LRA cache for entries indexed by their position in the dictionary that is meant to speed up the binary search. Since dictionary access is multithreaded, we need to synchronize the cache accesses.
This was the only synchronization that was happening in the
get method, so we figured that was what was
causing the scalability bottleneck. Unfortunately, a quick change to
reduce the amount of synchronization by removing the entry-by-position
cache didn't make any difference!
This is where collect/analyze comes in. It turns out that it can do a pretty good job of giving you visibility into where your Java code is spending its synchronization time, but it also shows you what's happening underneath Java as well. Jeff ran up the tools on our big box and I have to say that we were surprised at what he found.
The first step of a dictionary fetch is to create a lookup state
that contains copies of the file-backed buffers containing the
dictionary's data. Although we provided for a way to re-use a lookup
state, the test program was generating a new lookup state for every
dictionary lookup, which meant that it was duplicating the file-backed
buffers for every lookup. The tools showed us two synchronization
bottlenecks in the buffer
code: the first was that we were using a non-
logger, and getting the logger caused synchronization to happen. The
second was that we were blocking when allocating the byte arrays that
we used to buffer the on-disk dictionary data.
We were surprised that the allocator wasn't doing very well in a
multithreaded environment, and it turns out that there are (at least)
multithreaded allocators that you can use in Solaris.
Unfortunately, Java isn't linked against these libraries, so using
them would require
LD_PRELOAD tricks when running
Minion. We've always tried to avoid having to have a quadruple
bucky Java invocation, and we didn't want to start now.
The answer was to use thread-local storage to store a per-thread lookup state. When a thread does its first dictionary lookup the lookup state gets created and then that lookup state is used for all future lookups. Don't worry: Jeff was careful to make sure that the lookup states associated with unused threads will get garbage collected.
Once we had that working, we re-ran our tests and got better
results, but still not great. So, back to collect/analyze. Now we
were seeing a bottleneck when reading data from the file. This turned
out to be synchronization on the
RandomAccessFile in the
In order to read a chunk of data from the dictionary, we need to seek
to a particular position in the file and then read the data. Of
course, this needs to be atomic!
FileChannel offers a positional
read method that does the seek-and-read without
requiring synchronization (this may not
be the case on some OSes, so caveat optimizer!) Our final
modification was therefore to introduce a new file-backed buffer
that uses a
FileChannel and an NIO buffer to store the data.
We added a configuration option to the dictionaries so that we could select one or the other of the file-backed buffer implementations and then re-ran our tests. Here's the results after this change, along with a nice graph.
|# of threads||Total Number of Lookups||Average Lookup Time (ms)||Speedup|
Clearly, this is a keeper. At 32 threads we're doing a lot better than we were at 4 threads and almost better than we were doing at 2 threads! We start to see the times doubling again as we get to 64 and 128 threads.
Because of the nature of the test, we're not hitting the dictionary cache very much, so I expect we're starting to run into some contention for the disk here (the index is residing on a ZFS pool that's on disks that are in the box). Of course, I thought I knew what the problem was at the beginning, so back to collect/analyze we go!