Another post about Minion's dictionaries today. We recently got
hold of a really big
box: it has 256 hardware threads and 256GB of RAM. This lead us
to ask the question: How does Minion scale on this kind of hardware.
Our initial experiments running queries on a couple of million
documents with a varying number of threads (powers of 2 up to 128)
showed us that as we increased the number of threads we were spending
more and more time doing dictionary lookups.
Because of our EIDAP
philosophy, we need to be sure that our dictionaries have good
performance especially the multi-threaded case. We've tried out
things on 4 or 8 processor machines, but nothing like the new beast.
Although I'm writing about it, Jeff did all
of the hard work here. The Sun Studio collect/analyze
tools turned out to be exceedingly useful for doing this analysis.
We built a
test program that selects a number of terms from a dictionary
on-disk and then looks them up in the dictionary. A lot. For the
runs that we'll be describing, we selected 10,000 terms. This list of
terms is handed out to a number of threads. Each thread shuffles its
list and then looks up the terms from its list in the dictionary until
a time limit (300 seconds by default) passes.
Here's the state of affairs before we started:
|Number of threads||Total Number of Lookups||Average Lookup Time (ms)|
Oh, dear. Not very good: we're pretty close to doubling the time
when we double the number of threads, which is kind of the opposite of
the definition of scalability. These times are fairly acceptable when
we're doing querying with a small number of threads, because they're
swamped by all of the other work that we're doing, like uncompressing
postings. Once we get up to larger numbers of threads (around 16 or
32), the dictionary lookup time starts to dominate the times for the
We started out by inspecting the
code for doing a get from the dictionary. I described the way
that it worked in a
previous post, but the basic idea is that we do a binary search to
find the term. We have an LRA cache for entries in the dictionary
indexed by name that is meant to speed up access for commonly used
terms. We also have an LRA cache for entries indexed by their
position in the dictionary that is meant to speed up the binary
search. Since dictionary access is multithreaded, we need to
synchronize the cache accesses.
This was the only synchronization that was happening in the
get method, so we figured that was what was
causing the scalability bottleneck. Unfortunately, a quick change to
reduce the amount of synchronization by removing the entry-by-position
cache didn't make any difference!
This is where collect/analyze comes in. It turns out that it can
do a pretty good job of giving you visibility into where your Java
code is spending its synchronization time, but it also shows you
what's happening underneath Java as well. Jeff ran up the tools on
our big box and I have to say that we were surprised at what he found.
The first step of a dictionary fetch is to create a lookup state
that contains copies of the file-backed buffers containing the
dictionary's data. Although we provided for a way to re-use a lookup
state, the test program was generating a new lookup state for every
dictionary lookup, which meant that it was duplicating the file-backed
buffers for every lookup. The tools showed us two synchronization
bottlenecks in the buffer
code: the first was that we were using a non-
logger, and getting the logger caused synchronization to happen. The
second was that we were blocking when allocating the byte arrays that
we used to buffer the on-disk dictionary data.
We were surprised that the allocator wasn't doing very well in a
multithreaded environment, and it turns out that there are (at least)
multithreaded allocators that you can use in Solaris.
Unfortunately, Java isn't linked against these libraries, so using
them would require
LD_PRELOAD tricks when running
Minion. We've always tried to avoid having to have a quadruple
bucky Java invocation, and we didn't want to start now.
The answer was to use thread-local storage to store a per-thread
lookup state. When a thread does its first dictionary lookup the
lookup state gets created and then that lookup state is used for all
future lookups. Don't worry: Jeff was careful to make sure that
the lookup states associated with unused threads will get garbage
Once we had that working, we re-ran our tests and got better
results, but still not great. So, back to collect/analyze. Now we
were seeing a bottleneck when reading data from the file. This turned
out to be synchronization on the
RandomAccessFile in the
In order to read a chunk of data from the dictionary, we need to seek
to a particular position in the file and then read the data. Of
course, this needs to be atomic!
FileChannel offers a positional
read method that does the seek-and-read without
requiring synchronization (this may not
be the case on some OSes, so caveat optimizer!) Our final
modification was therefore to introduce a new file-backed buffer
that uses a
FileChannel and an NIO buffer to store the data.
We added a configuration option to the dictionaries so that we could
select one or the other of the file-backed buffer implementations and
then re-ran our tests. Here's the results after this change, along
with a nice graph.
|# of threads||Total Number of Lookups||Average
Lookup Time (ms)||Speedup|
Clearly, this is a keeper. At 32 threads we're doing a lot better
than we were at 4 threads and almost better than we were doing at 2
threads! We start to see the times doubling again as we get to 64 and
Because of the nature of the test, we're not hitting the
dictionary cache very much, so I expect we're starting to run into
some contention for the disk here (the index is residing on a ZFS pool
that's on disks that are in the box). Of course, I thought I knew
what the problem was at the beginning, so back to collect/analyze we go!