Wednesday Jun 17, 2009

Crikey! Fix a bug and look what you get!

Our intern arrived at the beginning of June and promptly began putting bugs into my pristine and utterly bug free search engine code.

One of the bugs he planted/discovered was that the recent change to use a BST for part of the dictionary lookup resulted in a BST search that was a bit, shall we say, overzealous in searching the tree. I fixed the bug and committed the change. Today I had a bit of time, so I decided to re-run the multi threading tests to see how the change affected our dictionary lookup times.

I was expecting the times to be worse, because I thought that perhaps we were not finding the right entries (but the dictionary code checks for that), but here's what I found:

# of threadsTotal Number of LookupsAverage Lookup Time (ms)

That's a huge speedup at the low end: somewhere between 3 to 5 times! Something weird happens in the middle where the results are worse, but then at the top end is settles down to be more than twice as fast. Also notice that I added a result for 200 threads (hey, why not, right?)

Here's the graph:

I have to tell you: it's pretty cool to see a machine with 256 hardware threads looking like this in prstat:

  1529 stgreen   100M   86M cpu169  30    0   4:00:48  71% java/223

Onward to query performance...

Thursday May 28, 2009

Scalin' Dictionaries 2: Electric BST

I've been working on the scalability properties of the dictionaries again. The last time, thanks to Jeff, we managed to get the dictionary to scale pretty well up to 32 threads and acceptably up to 64 threads by removing the synchronization bottlenecks on allocating lookup states during dictionary lookups and by using positional reads on the files containing the dictionaries.

Once we'd done this, I went back to collect/analyze and had a look at where the synchronization time was being spent during a run of our dictionary test program with 128 threads. I was surprised to see that a lot of synchronization time was being spent adding elements to the caches that we're mainitaining. This was a bit weird, because there's no explicit synchronization in the java.util.LinkedHashMap that we were using for the cache. I suspected that we were hitting an allocation problem with the objects that it allocated to put on the linked list that preserves the insertion order.

Aside from the synchronization problems, the main problem with the caches, in particular the entry-by-name cache, is that we're not getting that many hits on the cache during querying. In some of our query tests, the cache hit rate is about 10%, which means we're doing a lot of synchronization for not a lot of benefit.

So, I did away with both the entry-by-name cache and the name-by-position cache that we were using. The name-by-position cache actually was used: over time it was building up the top few layers of the binary search tree for the names in the dictionary. Unfortunately, this useful property was overwhelmed by the need to synchronize on the cache to fetch out the entry name at a given position while binary searching.

So I decided to add an honest-to-goodness binary search tree to the dictionary. This tree is populated when the dictionary is loaded, and since it never changes, it's completely thread safe. Because we took out the other caches and because the BST is so simple, we can afford to devote more entries to the BST. Every level that we add to the tree may save us a read from the underlying file, which is good, because the fastest read is the one that you don't have to make.

Each node in the BST stores the name at that position in the dictionary and the upper and lower extents of the search, so that once we've traversed the BST we know which chunk of the dictionary entries to continue the search in.

Here's the results for the new dictionary test, where we use a 1024 node BST (i.e., a tree 9 levels deep), along with a graph (note the log scale!) comparing this one to the old one:

# of threadsTotal Number of LookupsAverage Lookup Time (ms)

That gives us a speedup of about 1.8 times at 64 and 128 threads, and a slightly smaller speedup of 1.6 times with smaller numbers of threads.

The last time I did this, I got a comment asking why we were even hitting the disk when we had so much memory on this box. That's a good question, but I'm pretty sure that we are hitting the disk. If we copy the index into a RAM disk and run the dictionary test, here are the numbers and the corresponding graph:

# of threadsTotal Number of LookupsAverage Lookup Time (ms)

So, yeah, I'd say we're hitting the disk. That 9.67ms per lookup with 128 threads is pretty nice. That's about 123 times faster than original code was doing with a disk-based lookup.

While I was debugging my BST, I went ahead and modified the DictionaryFactory so that when you open a dictionary that is smaller than the specified cache size, we just return you a CachedDiskDictionary that loads the dictionary into a hash table when the dictionary is opened, since it all would have been cached eventually anyways.

Tuesday May 19, 2009

Scaling a dictionary

Another post about Minion's dictionaries today. We recently got hold of a really big box: it has 256 hardware threads and 256GB of RAM. This lead us to ask the question: How does Minion scale on this kind of hardware. Our initial experiments running queries on a couple of million documents with a varying number of threads (powers of 2 up to 128) showed us that as we increased the number of threads we were spending more and more time doing dictionary lookups.

Because of our EIDAP philosophy, we need to be sure that our dictionaries have good performance especially the multi-threaded case. We've tried out things on 4 or 8 processor machines, but nothing like the new beast. Although I'm writing about it, Jeff did all of the hard work here. The Sun Studio collect/analyze tools turned out to be exceedingly useful for doing this analysis.

We built a test program that selects a number of terms from a dictionary on-disk and then looks them up in the dictionary. A lot. For the runs that we'll be describing, we selected 10,000 terms. This list of terms is handed out to a number of threads. Each thread shuffles its list and then looks up the terms from its list in the dictionary until a time limit (300 seconds by default) passes.

Here's the state of affairs before we started:

Number of threadsTotal Number of LookupsAverage Lookup Time (ms)

Oh, dear. Not very good: we're pretty close to doubling the time when we double the number of threads, which is kind of the opposite of the definition of scalability. These times are fairly acceptable when we're doing querying with a small number of threads, because they're swamped by all of the other work that we're doing, like uncompressing postings. Once we get up to larger numbers of threads (around 16 or 32), the dictionary lookup time starts to dominate the times for the other work.

We started out by inspecting the code for doing a get from the dictionary. I described the way that it worked in a previous post, but the basic idea is that we do a binary search to find the term. We have an LRA cache for entries in the dictionary indexed by name that is meant to speed up access for commonly used terms. We also have an LRA cache for entries indexed by their position in the dictionary that is meant to speed up the binary search. Since dictionary access is multithreaded, we need to synchronize the cache accesses.

This was the only synchronization that was happening in the dictionary's get method, so we figured that was what was causing the scalability bottleneck. Unfortunately, a quick change to reduce the amount of synchronization by removing the entry-by-position cache didn't make any difference!

This is where collect/analyze comes in. It turns out that it can do a pretty good job of giving you visibility into where your Java code is spending its synchronization time, but it also shows you what's happening underneath Java as well. Jeff ran up the tools on our big box and I have to say that we were surprised at what he found.

The first step of a dictionary fetch is to create a lookup state that contains copies of the file-backed buffers containing the dictionary's data. Although we provided for a way to re-use a lookup state, the test program was generating a new lookup state for every dictionary lookup, which meant that it was duplicating the file-backed buffers for every lookup. The tools showed us two synchronization bottlenecks in the buffer code: the first was that we were using a non-static logger, and getting the logger caused synchronization to happen. The second was that we were blocking when allocating the byte arrays that we used to buffer the on-disk dictionary data.

We were surprised that the allocator wasn't doing very well in a multithreaded environment, and it turns out that there are (at least) two multithreaded allocators that you can use in Solaris. Unfortunately, Java isn't linked against these libraries, so using them would require LD_PRELOAD tricks when running Minion. We've always tried to avoid having to have a quadruple bucky Java invocation, and we didn't want to start now.

The answer was to use thread-local storage to store a per-thread lookup state. When a thread does its first dictionary lookup the lookup state gets created and then that lookup state is used for all future lookups. Don't worry: Jeff was careful to make sure that the lookup states associated with unused threads will get garbage collected.

Once we had that working, we re-ran our tests and got better results, but still not great. So, back to collect/analyze. Now we were seeing a bottleneck when reading data from the file. This turned out to be synchronization on the RandomAccessFile in the FileReadableBuffer. In order to read a chunk of data from the dictionary, we need to seek to a particular position in the file and then read the data. Of course, this needs to be atomic!

An NIO FileChannel offers a positional read method that does the seek-and-read without requiring synchronization (this may not be the case on some OSes, so caveat optimizer!) Our final modification was therefore to introduce a new file-backed buffer implementation, NIOFileReadableBuffer that uses a FileChannel and an NIO buffer to store the data.

We added a configuration option to the dictionaries so that we could select one or the other of the file-backed buffer implementations and then re-ran our tests. Here's the results after this change, along with a nice graph.

# of threadsTotal Number of LookupsAverage Lookup Time (ms)Speedup

Clearly, this is a keeper. At 32 threads we're doing a lot better than we were at 4 threads and almost better than we were doing at 2 threads! We start to see the times doubling again as we get to 64 and 128 threads.

Because of the nature of the test, we're not hitting the dictionary cache very much, so I expect we're starting to run into some contention for the disk here (the index is residing on a ZFS pool that's on disks that are in the box). Of course, I thought I knew what the problem was at the beginning, so back to collect/analyze we go!

Friday May 01, 2009

Running Mahout on Elastic MapReduce

Here in the Labs we have a Big Data reading group. The idea is that we get together once a week and discuss a paper of interest. We've covered a lot of the famous ones, like the initial papers for GFS and MapReduce. A couple of weeks ago, I volunteered to tackle the paper from Stanford that lays out methods for running a number of standard machine learning techniques in a MapReduce framework.

The Apache Mahout project was started to build the algorithms described in the paper on the Hadoop MapReduce framework (the original paper describes running the algorithms on multicore processors.) They've also brought in the Taste Collaborative Filtering framework, which is interesting to us as recommendation folks. As it turns out, they had just released Mahout 0.1. around the time we were going to read the paper.

Coincidentally, Amazon had just announced their Elastic MapReduce (EMR) service that lets you run a MapReduce job on EC2 instances, so I decided to see what it would take to get Mahout running on EMR.

I didn't manage to get it running in time for the reading group, but one Mahout issue and a few "Oh, that's the way it works"es later, I had it running.

Apparently I'm the first person to have run Mahout on Elastic MapReduce, which just shows, as my father used to say, that brute force has an elegance all its own.

If you're interested the details are on the Mahout wiki.

Tuesday Apr 21, 2009

Distributed Key Stores: Roll your own!

Leonard Lin has some interesting notes on distributed key stores. He implemented a distributed key store for a client based on Tokyo Cabinet and a consistent hashing scheme to distribute data.

The really interesting part of this though, is that he had a look at a lot of the available options (admittedly on a pretty tight schedule) and his conclusion is that, given the maturity of the options available you could probably write your own in a day.

I'm pretty interested in retrying some of this evaluation with our own data. I'm not sure how the numbers will compare given that we're doing inverted indexing on the text in the items that we're storing, but it will be interesting to find out!

Wednesday Apr 08, 2009

My god, it's full of sshes

Here's a handy hint for controlling a load of EC2 instances (or some other griddy type of thing) from a single terminal in MacOS.

Check out the awesome screenshot:

Of course, if your data store is running on 128 instances, you're going to have teeny-tiny terminal windows...

Thursday Apr 02, 2009

The new machine

Doop-de-do, logging in to the new machine.

[stgreen@hogwarts 13:38:21 ~]$ 

Huh. I wonder how many processors this machine has.

[stgreen@hogwarts 13:38:22 ~]$ /usr/sbin/psrinfo -v
Status of virtual processor 0 as of: 04/02/2009 13:38:26
  on-line since 03/31/2009 14:56:19.
  The sparcv9 processor operates at 1414 MHz,
        and has a sparcv9 floating point processor.
Status of virtual processor 1 as of: 04/02/2009 13:38:26
  on-line since 03/31/2009 14:56:22.
  The sparcv9 processor operates at 1414 MHz,
        and has a sparcv9 floating point processor.
Status of virtual processor 2 as of: 04/02/2009 13:38:26
  on-line since 03/31/2009 14:56:22.
  The sparcv9 processor operates at 1414 MHz,
        and has a sparcv9 floating point processor.

[...Many lines deleted...]

Status of virtual processor 253 as of: 04/02/2009 13:38:27
  on-line since 03/31/2009 14:56:23.
  The sparcv9 processor operates at 1414 MHz,
        and has a sparcv9 floating point processor.
Status of virtual processor 254 as of: 04/02/2009 13:38:27
  on-line since 03/31/2009 14:56:23.
  The sparcv9 processor operates at 1414 MHz,
        and has a sparcv9 floating point processor.
Status of virtual processor 255 as of: 04/02/2009 13:38:27
  on-line since 03/31/2009 14:56:23.
  The sparcv9 processor operates at 1414 MHz,
        and has a sparcv9 floating point processor.

OK. That's a lot of processors there. I wonder how much memory it has.

[stgreen@hogwarts 13:38:27 ~]$ /usr/sbin/prtconf | head -2
System Configuration:  Sun Microsystems  sun4v
Memory size: 261856 Megabytes

Uh, wow. That's a lot of RAM. I wonder how much disk space.

[stgreen@hogwarts 13:43:49 ~]$ df -hl 
Filesystem             size   used  avail capacity  Mounted on
/dev/dsk/c0t0d0s0       15G    10G   4.6G    69%    /
swap                   161G   414M   161G     1%    /tmp
swap                   161G    56K   161G     1%    /var/run
scratch                134G    60G    74G    45%    /scratch

Heh. So I guess at start up we should just cache the disk then, right?

Anyone have any single instance, multi-threaded search scalability tests they want me to try?

Thursday Mar 26, 2009

Where does cloud computing happen?

Well, Penny Arcade pretty much nails cloud computing. It happens in the sky!.

OnLive does sound good, but I'll be interested to see how bad the input lag is. If it works, can I finally get Verizon to give me a SunRay desktop at home, so that I don't have to dick around with Windows anymore?

And while we're at it: Violence Fight!

Monday Mar 23, 2009

Why search can be weird and frustrating (and fun!)

From CNet, we see that Craigslist bests MySpace as top search term. The article contains a graph from hitwise showing the relative popularity of searches for craigslist, myspace, and facebook.

This is weird to me, because these are the actual domain names of the things that the users are (in all likelihood) searching for. Typing them into the location bar in Firefox takes you right to the appropriate place in each case. But a search engine is how you find things right?

It seems pretty obvious to me that anyone searching for one of these is unlikely to want to find all the occurrences of the term craigslist and it seems like the obvious thing to do with such a query is to just have the search engine issue a redirect to the appropriate resource. Of course, Google lives by showing you ads, so they're not going to do that, but they should (and do!) have the site come up at the top of the listings.

Not to give away too much, but if you look at the top searches for Sun's intranet, you see the same phenomenon: a number of the top searches in any given month are the actual host names for internal tools (the one you use to book vacation, the one you use to reserve a conference room, etc.) as well as for things like java.

Properly speaking, this is a search problem, but not a search engine problem, per se. A search engine can easily find all the places where a given word (or domain name) occurs, but it takes someone who understands the user community of the search engine and the purposes to which a search interface is going to be put to make sure that useful things happen.

It seems to me that if you're going to offer a search interface to a wide community, you need to be sure that you're watching your query logs and noticing things like the fact that people are search for the internal tools and then making sure that queries for those things are being handled in the appropriate way.

For example, behind the firewall, a search for a particular Web-based application should simply re-direct to that application (perhaps with a delay for the one-in-a-million person who actually wanted to search for that term.) A search for a person's name (or a close spelling) should generate a results page with just their information from the corporate directory.

In my experience, however, search seems to start cycling around how much stuff is being crawled and what document formats are supported for indexing, rather than on how it can be made truly useful.

At least it means there's lots of low-hanging fruit for a dedicated search guy!

Friday Mar 20, 2009

Updates to Minion's query API

I've actually started to convert our AURA project code to use the new Minion query API, and as a result I decided that having a single Relation class for all of the relational operators really makes code that uses the query API look ugly.

I've just checked in a change to the query API that provides individual classes for the relational operators. These are all simple specializations of the Relation class, but it makes the code using the API a lot prettier. For example, this:

        Element query = 
	   new And(new Relation("aura-type",
                   new Relation("aura-name", 

becomes this:

        Element query = 
	   new And(new Equals("aura-type", "artist"),
                   new Substring("aura-name", artistName));

Which is a lot neater, in my opinion. Don't ever forget that your API is the user interface for your library!

While debugging some problems yesterday I discovered the need to print out the queries, so while I was in there, I added toString methods that produce a nice string version of a query expressed using the API. The above query will be converted to the string:

(And (= aura-type artist) ( aura-name coldplay))

(assuming that artistName is "coldplay", obviously!) And yes, that is an s-expression. Lisp programmers represent! (OK, OK, it's not really an s-expression (spaces in the field values will screw things up), but it's close, and the parentheses delineate the expressions nicely.)

The changes to the code have been committed and we should have new javadoc up later on today.

Wednesday Mar 18, 2009

Clsutering in Minion

Mostly just a note to myself here, but this post on speeding up K-means clustering should come in handy when I get back to Minion's clustering code.

I think we might already be covered, because we're usually clustering document vectors, and those are represented sparsely by nature. The memory locality is something we'd need to keep in mind, though.

They're getting a minute per epoch on a "few hundred thousand" text messages. I'll have to see what Minion's clustering performance would be like on that size of data. In our internal use of the clustering we have good interactive-time (i.e., you can run clustering as part of an async HTTP request in a Web app) clustering performance up to about a thousand (short) docs using K-means.

Monday Mar 16, 2009

Lightweight Morphology vs. Stemming: What's the Difference?

Otis asked whether there were any obvious cases where lightweight morphology does better than stemming, or where stemming does better than morphology. When running the performance numbers for a stemmed index versus an unstemmed one the other day, there were some instances where the number of returned documents were wildly different. I wrote a quick script to find the terms that had more than a given percentage difference in the number of documents returned for the queries, which would show me the queries where the differences were the largest.

Let's look at a continuum here, starting with LiteMorph generating way more documents than stemming. Here's an interesting one:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
knowledge knowledg 50088623574

Yow. The problem here is that it looks like we're over-generating for knowledge. From the kitchen sink, we can use the :morph command to show the variants for a term that actually occur in an index:

> :morph knowledge
 knowledge (22136)
 know (430909)
 known (74908)
 knew (4929)
 knows (17479)
 knowing (6967)

Here we're using the merged version of the unstemmed index. The numbers don't necessarily add up because multiple variants might occur in a single document. Since know is an irregular verb, it's in the exception table for LiteMorph, so this would be an easy fix (and I'll probably make it.)

A little further along the continuum we see an interesting set:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs

Judging by the numbers, I'd say that litemorph considers those terms to be equivalent, while the stemmer only considers writes and writing to be equivalent. Let's consult the sink:

> :morph writer
 writer (12729)
 write (65901)
 writers (3994)
 writes (17922)
 wrote (412434)
 writeable (615)
 written (31511)
 writings (84)
 writing (22414)

IMHO, this one's a draw. I would say that written should be in the equivalence set for writes, but that writer (i.e, someone who writes) is a bit of a tougher sell. The big miss here for the stemmer is that it didn't get the past tense wrote (why so many wrotes? Don't forget that this is from email archives!) This example is also the exception table for LiteMorph, since write is an irregular verb.

This pattern shows up for other words in our test set like caller and submitter, which are not in the exception list, so we'd probably need to fix this one by modifying the LiteMorph rules and the exception table. If we decided to fix it, that is.

There are some clear cases where irregular verbs get missed by the stemmer, like the following:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs

Perhaps my Porter stemmer is out of date? I suppose I should try this with the stemmer(s) in Lucene.

There are a lot of cases where the difference between the stemmed and unstemmed indices is only a few documents:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs

LiteMorph allows and the stemmer leaves out copier and copiers in this case.

At the far end of the spectrum are the terms for which LiteMorph produces far fewer documents than stemming. Here are a few interesting ones:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs

The stemmer has conflated all of these words into the stem engine, while the LiteMorph engine considers them to be two separate clusters:

> :morph engine
 engine (21082)
 engined (2)
 enginers (4)
 engines (2438)
 enginer (53)
> :morph engineer
 engineer (268417)
 engineeing (13)
 engineering (119954)
 enginee (26)
 engineered (547)
 engineers (28743)
 enginees (1)
 engineerings (68)

There are also cases where I think stemming makes a conflation that is incorrect. For example:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs

I don't think the conflation of informed and information is very good, and LiteMorph doesn't make it.

All in all, for the 1806 terms that we tested, there were 820 terms whose results were within 100 documents of each other, 300 terms where LiteMorph produced more results, and 686 terms where stemming produced more results.

It's not clear to me that there's a firm conclusion to be drawn here, except that which one is better probably depends on what you're trying to do. Certainly, stemming the index will give you better performance on searching at the expense of not being able to search for exact forms (unless you index the root forms). LiteMorph allows you to generate variants, but there's clearly some pretty weird stuff in there.

Friday Mar 13, 2009

Lightweight Morphology vs. Stemming

Otis asked about the cost incurred doing LiteMorph expansion of terms during searches. I haven't really looked at this before, since we don't stem indices by default, but it's a fair question.

I happen to have a few million email messages lying around, so I ran up an index of around 1.7 million of them with our standard index and a stemmed index. Here's what our standard index looks like at the end of the run, as reported by our QueryTest kitchen-sink tester program:

18 active partitions: 1007 (763425) 1266 (195146) 1521 (200812) 
 1797 (194458) 2083 (185667) 2143 (35463) 2203 (37921) 2254 (42842) 
 2307 (34676) 2316 (6611) 2327 (8591) 2334 (6901) 2344 (7566) 
 2345 (438) 2346 (1447) 2347 (554) 2348 (763) 2349 (3) 
 Sorting specification is: -score
 Partitions have:
  1723284 documents
  2337983867 tokens
  10212437 terms

There are 18 partitions in this index (that's pretty ripe for a merge, actually.) The numbers in the parentheses are the number of documents in each partition. You can see the number of documents, tokens, and terms in the entire index.

Here's what the stemmed index looks like:

16 active partitions: 881 (1248212) 1086 (238562) 1130 (49578) 1176 (50489) 
 1211 (48669) 1246 (52899) 1252 (9199) 1258 (10005) 1264 (8902) 1265 (2001) 
 1266 (38) 1267 (1963) 1268 (2001) 1269 (763) 1270 (2) 1271 (1) 
 Sorting specification is: -score
 Partitions have:
  1723284 documents
  2337983867 tokens
  6970295 terms

I wrote a quick program to select some random words from the main dictionary of the largest partition in an index. In this case, that was partition 1007, whose main dictionary contained nearly 5.5 million words.

Because we want to test on "interesting" words, I restricted the search to words that are longer than 5 alphabetic characters and that occur in more than 0.5% of the documents in that partition (3817 documents, in this case). This resulted in 1806 terms (Zipf's Law in action, I guess!).

Using our new query API, I wrote a program that takes each of the words and runs it as a single term query. The program takes a switch on the command line to indicate whether the terms should be stemmed or morphed.

I'm running Solaris 10 on the test machine. psrinfo -v on this machine says:

Status of virtual processor 0 as of: 03/11/2009 13:27:56
  on-line since 01/04/2009 22:05:36.
  The i386 processor operates at 2200 MHz,
        and has an i387 compatible floating point processor.
Status of virtual processor 7 as of: 03/11/2009 13:27:56
  on-line since 01/04/2009 22:05:54.
  The i386 processor operates at 2200 MHz,
        and has an i387 compatible floating point processor.

It's a four processor box, where each processor has two cores (it's a v40z.) The box has 32GB of RAM, the java is version 1.6.0_06, and I'm running with -Xmx1g (the actual process size doesn't get much above 300MB, though.)

The collections are stored on a ZFS file system that's on a disk array attached to the box via fiber channel. This is fairly low-performing storage (after we got it we were told that it was meant for near line backup. Sigh.). Here's the pool:

NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
files                  2.27T    415G   1.86T    17%  ONLINE     -
And the status:
  pool: files
 state: ONLINE
 scrub: none requested

	NAME                                         STATE     READ WRITE CKSUM
	files                                        ONLINE       0     0     0
	  raidz2                                     ONLINE       0     0     0
	    c0t600C0FF00000000009234951BE0FE300d0s2  ONLINE       0     0     0
	    c0t600C0FF00000000009234968DA6E9000d0s2  ONLINE       0     0     0
	    c0t600C0FF00000000009234973FFDC2800d0s2  ONLINE       0     0     0
	    c0t600C0FF000000000092349113B66D600d0s2  ONLINE       0     0     0
	    c0t600C0FF000000000092349239DC55200d0s2  ONLINE       0     0     0

I'm pretty sure each of those vdevs is built out of multiple disks in the actual array. (Ask me about this if you think it matters, and I can find out.)

While preparing this blog entry, I ran the query program a number of times in order to get the output that I wanted, so these indices were probably warm in the disk cache. Anyways, here's the basic results:

IndexTotal Query Time (ms)Avg. Query Time (ms)

The queries using lightweight expansion take about 1.9 times as long as the queries using stemming.

I was curious if the number of partitions in the index was affecting how long the queries were taking to run, so I generated versions of the indices where all of the partitions had been merged into a single partition. The results for this merged index were:

IndexTotal Query Time (ms)Avg. Query Time (ms)
Unstemmed (merged)59615.2133.01
Stemmed (merged) 30880.8917.01

This is a speedup of around 12% in both cases, but about the same ratio for the times.

So, where's the extra time going, you ask? One of the things that we've added to Minion recently is query statistics processing. The engine keeps track of the number of dictionary lookups, dictionary cache hits, how many postings lists are loaded, how long it takes to load them, an so on. The stuff we're collecting is in the QueryStats class, and at any time you can ask the search engine for the current set of query statistics. Here's the stats for the merged index for the unstemmed index:

1806 queries in 59615.21ms, 33.01ms per query
Dictionary activity:
 34952 lookups in 1391.21ms (2.33% of total), 0.04ms per lookup
 Cache Hits:                           208
 Cache Misses:                       34744
Postings Activity
 9650 postings lists read in 2180.81ms (3.66% of total), 0.23ms per read
 Average Postings size:               67.7KB
 Term Cache Hits:                        0
 Term Cache Misses:                      0
 Generating term cache entries:        0.0ms (0.00% of total)
 Postings iteration:               25670.3ms (43.06% of total)
Query Activity
 Union processing:                 25671.7ms (43.06% of total)
 Intersect processing:                 0.0ms (0.00% of total)
 Sorting postings:                 22929.8ms (38.46% of total)
 Normalizing scores:                3189.2ms (5.35% of total)
and here's the stats for the merged, stemmed index:
1806 queries in 30880.89ms, 17.10ms per query
Dictionary activity:
 1806 lookups in 340.58ms (1.10% of total), 0.19ms per lookup
 Cache Hits:                           209
 Cache Misses:                        1597
Postings Activity
 1806 postings lists read in 2063.93ms (6.68% of total), 1.14ms per read
 Average Postings size:              320.2KB
 Term Cache Hits:                        0
 Term Cache Misses:                      0
 Generating term cache entries:        0.0ms (0.00% of total)
 Postings iteration:               23008.0ms (74.51% of total)
Query Activity
 Union processing:                 23009.4ms (74.51% of total)
 Intersect processing:                 0.0ms (0.00% of total)
 Sorting postings:                     0.0ms (0.00% of total)
 Normalizing scores:                3143.3ms (10.18% of total)

Note that there's some overlap in the categories and we don't count everything, so the numbers don't add up to 100%.

The big difference between these two is the "Sorting Postings" time. This is the time that's spent integrating the postings for the morphological variants. When we're processing the postings for the variants, experience has shown us that it's faster to list out all the documents and then sort them to combine the scores from documents with more than one variant than it is to merge them as we're processing the variants. Of course, in the stemmed case, we only ever iterate through one postings list, so the sort isn't necessary.

Thursday Mar 12, 2009

A query API for Minion

Minion actually offers three different query languages (and is designed so that others can be added pretty easily), but it can be difficult to use the query languages when a program (as opposed to a person) is writing the queries.

For example, lets say that you're providing a search engine for musical artists, and you want your users to be able to search by artist name. The first thing that you'll end up doing is quoting the value provided by the user by doing something like:

   String query = String.format("artist-name = \\"%s\\"", name);;
because you know that artist names can have spaces in them.

This will work for a while, but you will soon find that there's an amazing amount of stuff that occurs in the names of musical artists. We have a crawl of information about 50,000 musical artists (more on that in the days to come, I hope.) If we sort the list of artists by name, here's the first 15:

!Action Pact!
"Brother" Jack McDuff
"Little" Louie Vega
"Weird Al" Yankovic
#9 Dream
$wingin' Utter$
'Til Tuesday
't Hof van Commerce
(Love) Tattoo
(The Sounds Of) Kaleidoscope
(Young) Pioneers

Let's say that our user has typed in "Weird Al" Yankovic. If we're not careful about escaping the quotes, we're either going to have a parse exception or get weird results because the query parser will interpret the query differently than the user intended. At this point you begin to have a dawning sense of horror that you're going to spend the rest of your time figuring out what the set of weird characters is (usually when something breaks) and then dealing with escaping them on a case-by-case basis.

To completely sidestep this problem, we've added a programmatic query API to Minion. The idea is pretty simple: you build a Java structure that describes the query and then give it to the search engine to evaluate. Our search for an artist name above becomes:

        Element e = new StringRelation("aura-name", Relation.Operator.Equals, name);;
No need to worry about escaping things, because we're just passing around Java strings.

You can start out with creating a query element for a term and searching for just that term:

        Term j = new Term("java");
        ResultSet rs =;

As with the query languages, the term will search for morphological variations by default. We can specify how we want a term to be treated during the query by specifying modifiers:

        Term j2 = new Term("Java", EnumSet.of(Term.Modifier.CASE, Term.Modifier.MORPH));
        rs =;
which will search for the term Java (and morphological variations) only in that exact case. You can combine terms using boolean operators:
        Term l = new Term("language");
        Term s = new Term("specification");
        And a = new And(j, l, s);
        rs =;
        rs = Or(j, l, s));

Note that there are varargs constructors for And and Or so you don't need to clutter up your code with lots of anonymous arrays.

You can combine boolean operators and terms in the natural way:

        And ao = new And(j, new Or(l, s));

Using the query API also means that you don't have to worry about what the precedence of the operators is in the query language (e.g., "What does java <and> language <or> specification do?").

As indicated above, you can use the query API for relational querying too:

        Relation sr = new Relation("subject", Relation.Operator.SUBSTRING,
        Relation dr = new Relation("date", Relation.Operator.LESS_THAN,
and, of course, you can combine relational and boolean queries:
        And subj = new And(j, l, s, sr);

You can restrict elements to be searched for in a particular field or fields in the following way:

        Term j = new Term("java");
you can also use a set of fields, if you prefer. When more than one field is specified, a hit in any of the fields means that the document will be in the result set.

Finally, you can indicate that the results generated by a particular element of the query should be treated in a strict boolean fashion. This means that a document is either in the set or it isn't; there are no scores associated with the documents. When combining a set of documents that were treated in a strict boolean fashion with a set where the documents are scored, the strict boolean result set will not affect the scores in the combined set (although they may affect the presence or absence of certain documents!)

This is useful to do when you want to make sure that a result set contains documents that have a particular property, but you don't want that property to mess up the scores. For example:

        Term j = new Term("java");
        Term l = new Term("language");
        And a = new And(j, l);
will return the documents that have java and language in them, but the scores for those documents won't be influenced by the score of language.

At the moment, the API only allows for boolean, relational, and range queries, but in the near future we'll be adding in proximity operators as well as document similarity operators.

The only thing that I don't really like about this new API is that it doesn't do type-checking as you build the query. For example, if you're building a relation like:

       Relation dr = new Relation("date", Relation.Operator.LESS_THAN,
we can't ensure that date is a saved field, that it's a date saved field, and that that provided value is a date. This is because the queries are built without considering any particular engine, so we can't tell what fields are defined or what their types are. Currently this is checked before the query is evaluated in the search method, and a QueryException is raised if there is a problem.

On a stylistic note, we decided to have a single Relation class rather than breaking out operators for each relation type. I still can't decide whether that was a good idea or not. Helpful suggestions welcome in the comments!

Wednesday Mar 11, 2009

Google's Postings Format

Jeff's Search Engine Caffe points to Google's Jeff Dean's slides from the WSDM 2009 Keynote.

It's a pretty interesting look at how Google has grown over the years. Of particular interest to me as a search guy is the description of some of their index formats. On slide 47 there's a pretty detailed description of the postings lists format that they were using before they went to an all-in-memory index. They were using variable length encodings (like Gamma and Huffman encoding) for some of the data, in order to get the most compact on-disk representation that could still be decoded quickly.

This was what the initial versions of the Minion engine did as well.

On slide 54, he starts to talk about the formats that they considered when they went to an all-in-memory index. Given that they could hold the whole thing in memory, they needed to use a format that could be decoded as quickly as possible, which led them to use byte-aligned variable length encodings.

They started with what they call a Varint encoding, which is the same encoding that Minion and Lucene use. The following slides detail a number of variations of this encoding scheme, culminating in slide 63, where they describe encoding groups of four numbers together and pulling out two-bit prefixes into a leading byte, in order to reduce the amount of masking and shifting necessary to decode an integer.

He says the Varint encoding could be decoded at the rate of 180M integers per second, while their "group varint" approach could decode at the rate of 400M integers per second. He points out that this means that you can only encode 30 bits per integer, but that's not much of a restriction for typical postings data (document ID deltas and positions).

I think I might have to give this one a try myself, in my copious free time.


This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.


« June 2016