Thursday Mar 26, 2009

Where does cloud computing happen?

Well, Penny Arcade pretty much nails cloud computing. It happens in the sky!.

OnLive does sound good, but I'll be interested to see how bad the input lag is. If it works, can I finally get Verizon to give me a SunRay desktop at home, so that I don't have to dick around with Windows anymore?

And while we're at it: Violence Fight!

Monday Mar 23, 2009

Why search can be weird and frustrating (and fun!)

From CNet, we see that Craigslist bests MySpace as top search term. The article contains a graph from hitwise showing the relative popularity of searches for craigslist, myspace, and facebook.

This is weird to me, because these are the actual domain names of the things that the users are (in all likelihood) searching for. Typing them into the location bar in Firefox takes you right to the appropriate place in each case. But a search engine is how you find things right?

It seems pretty obvious to me that anyone searching for one of these is unlikely to want to find all the occurrences of the term craigslist and it seems like the obvious thing to do with such a query is to just have the search engine issue a redirect to the appropriate resource. Of course, Google lives by showing you ads, so they're not going to do that, but they should (and do!) have the site come up at the top of the listings.

Not to give away too much, but if you look at the top searches for Sun's intranet, you see the same phenomenon: a number of the top searches in any given month are the actual host names for internal tools (the one you use to book vacation, the one you use to reserve a conference room, etc.) as well as for things like java.

Properly speaking, this is a search problem, but not a search engine problem, per se. A search engine can easily find all the places where a given word (or domain name) occurs, but it takes someone who understands the user community of the search engine and the purposes to which a search interface is going to be put to make sure that useful things happen.

It seems to me that if you're going to offer a search interface to a wide community, you need to be sure that you're watching your query logs and noticing things like the fact that people are search for the internal tools and then making sure that queries for those things are being handled in the appropriate way.

For example, behind the firewall, a search for a particular Web-based application should simply re-direct to that application (perhaps with a delay for the one-in-a-million person who actually wanted to search for that term.) A search for a person's name (or a close spelling) should generate a results page with just their information from the corporate directory.

In my experience, however, search seems to start cycling around how much stuff is being crawled and what document formats are supported for indexing, rather than on how it can be made truly useful.

At least it means there's lots of low-hanging fruit for a dedicated search guy!

Friday Mar 20, 2009

Updates to Minion's query API

I've actually started to convert our AURA project code to use the new Minion query API, and as a result I decided that having a single Relation class for all of the relational operators really makes code that uses the query API look ugly.

I've just checked in a change to the query API that provides individual classes for the relational operators. These are all simple specializations of the Relation class, but it makes the code using the API a lot prettier. For example, this:

        Element query = 
	   new And(new Relation("aura-type",
                                Relation.Operator.EQUALS,
                                "artist"),
                   new Relation("aura-name", 
		                Relation.Operator.SUBSTRING,
        			artistName));

becomes this:

        Element query = 
	   new And(new Equals("aura-type", "artist"),
                   new Substring("aura-name", artistName));

Which is a lot neater, in my opinion. Don't ever forget that your API is the user interface for your library!

While debugging some problems yesterday I discovered the need to print out the queries, so while I was in there, I added toString methods that produce a nice string version of a query expressed using the API. The above query will be converted to the string:

(And (= aura-type artist) ( aura-name coldplay))

(assuming that artistName is "coldplay", obviously!) And yes, that is an s-expression. Lisp programmers represent! (OK, OK, it's not really an s-expression (spaces in the field values will screw things up), but it's close, and the parentheses delineate the expressions nicely.)

The changes to the code have been committed and we should have new javadoc up later on today.

Wednesday Mar 18, 2009

Clsutering in Minion

Mostly just a note to myself here, but this post on speeding up K-means clustering should come in handy when I get back to Minion's clustering code.

I think we might already be covered, because we're usually clustering document vectors, and those are represented sparsely by nature. The memory locality is something we'd need to keep in mind, though.

They're getting a minute per epoch on a "few hundred thousand" text messages. I'll have to see what Minion's clustering performance would be like on that size of data. In our internal use of the clustering we have good interactive-time (i.e., you can run clustering as part of an async HTTP request in a Web app) clustering performance up to about a thousand (short) docs using K-means.

Monday Mar 16, 2009

Lightweight Morphology vs. Stemming: What's the Difference?

Otis asked whether there were any obvious cases where lightweight morphology does better than stemming, or where stemming does better than morphology. When running the performance numbers for a stemmed index versus an unstemmed one the other day, there were some instances where the number of returned documents were wildly different. I wrote a quick script to find the terms that had more than a given percentage difference in the number of documents returned for the queries, which would show me the queries where the differences were the largest.

Let's look at a continuum here, starting with LiteMorph generating way more documents than stemming. Here's an interesting one:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
knowledge knowledg 50088623574

Yow. The problem here is that it looks like we're over-generating for knowledge. From the kitchen sink, we can use the :morph command to show the variants for a term that actually occur in an index:

> :morph knowledge
 knowledge (22136)
 know (430909)
 known (74908)
 knew (4929)
 knows (17479)
 knowing (6967)

Here we're using the merged version of the unstemmed index. The numbers don't necessarily add up because multiple variants might occur in a single document. Since know is an irregular verb, it's in the exception table for LiteMorph, so this would be an easy fix (and I'll probably make it.)

A little further along the continuum we see an interesting set:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
writeswrite48844894183
writingwrite48844894183
writtenwritten48844831512
writerwriter48844815938

Judging by the numbers, I'd say that litemorph considers those terms to be equivalent, while the stemmer only considers writes and writing to be equivalent. Let's consult the sink:

> :morph writer
 writer (12729)
 write (65901)
 writers (3994)
 writes (17922)
 wrote (412434)
 writeable (615)
 written (31511)
 writings (84)
 writing (22414)

IMHO, this one's a draw. I would say that written should be in the equivalence set for writes, but that writer (i.e, someone who writes) is a bit of a tougher sell. The big miss here for the stemmer is that it didn't get the past tense wrote (why so many wrotes? Don't forget that this is from email archives!) This example is also the exception table for LiteMorph, since write is an irregular verb.

This pattern shows up for other words in our test set like caller and submitter, which are not in the exception list, so we'd probably need to fix this one by modifying the LiteMorph rules and the exception table. If we decided to fix it, that is.

There are some clear cases where irregular verbs get missed by the stemmer, like the following:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
caughtcaught336749311

Perhaps my Porter stemmer is out of date? I suppose I should try this with the stemmer(s) in Lucene.

There are a lot of cases where the difference between the stemmed and unstemmed indices is only a few documents:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
copiedcopi193667193638
copiescopi193667193638
copyingcopi193667193638

LiteMorph allows and the stemmer leaves out copier and copiers in this case.

At the far end of the spectrum are the terms for which LiteMorph produces far fewer documents than stemming. Here are a few interesting ones:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
engineengin22488392368
engineerengin377348392368
engineeringengin377350392368
engineersengin377348392368

The stemmer has conflated all of these words into the stem engine, while the LiteMorph engine considers them to be two separate clusters:

> :morph engine
 engine (21082)
 engined (2)
 enginers (4)
 engines (2438)
 enginer (53)
> :morph engineer
 engineer (268417)
 engineeing (13)
 engineering (119954)
 enginee (26)
 engineered (547)
 engineers (28743)
 enginees (1)
 engineerings (68)

There are also cases where I think stemming makes a conflation that is incorrect. For example:

Unstemmed TermStemmed TermUnstemmed DocsStemmed Docs
informinform25020412651
informedinform25118412651
informationinform396257412651

I don't think the conflation of informed and information is very good, and LiteMorph doesn't make it.

All in all, for the 1806 terms that we tested, there were 820 terms whose results were within 100 documents of each other, 300 terms where LiteMorph produced more results, and 686 terms where stemming produced more results.

It's not clear to me that there's a firm conclusion to be drawn here, except that which one is better probably depends on what you're trying to do. Certainly, stemming the index will give you better performance on searching at the expense of not being able to search for exact forms (unless you index the root forms). LiteMorph allows you to generate variants, but there's clearly some pretty weird stuff in there.

Friday Mar 13, 2009

Lightweight Morphology vs. Stemming

Otis asked about the cost incurred doing LiteMorph expansion of terms during searches. I haven't really looked at this before, since we don't stem indices by default, but it's a fair question.

I happen to have a few million email messages lying around, so I ran up an index of around 1.7 million of them with our standard index and a stemmed index. Here's what our standard index looks like at the end of the run, as reported by our QueryTest kitchen-sink tester program:

18 active partitions: 1007 (763425) 1266 (195146) 1521 (200812) 
 1797 (194458) 2083 (185667) 2143 (35463) 2203 (37921) 2254 (42842) 
 2307 (34676) 2316 (6611) 2327 (8591) 2334 (6901) 2344 (7566) 
 2345 (438) 2346 (1447) 2347 (554) 2348 (763) 2349 (3) 
 Sorting specification is: -score
 Partitions have:
  1723284 documents
  2337983867 tokens
  10212437 terms

There are 18 partitions in this index (that's pretty ripe for a merge, actually.) The numbers in the parentheses are the number of documents in each partition. You can see the number of documents, tokens, and terms in the entire index.

Here's what the stemmed index looks like:

16 active partitions: 881 (1248212) 1086 (238562) 1130 (49578) 1176 (50489) 
 1211 (48669) 1246 (52899) 1252 (9199) 1258 (10005) 1264 (8902) 1265 (2001) 
 1266 (38) 1267 (1963) 1268 (2001) 1269 (763) 1270 (2) 1271 (1) 
 Sorting specification is: -score
 Partitions have:
  1723284 documents
  2337983867 tokens
  6970295 terms

I wrote a quick program to select some random words from the main dictionary of the largest partition in an index. In this case, that was partition 1007, whose main dictionary contained nearly 5.5 million words.

Because we want to test on "interesting" words, I restricted the search to words that are longer than 5 alphabetic characters and that occur in more than 0.5% of the documents in that partition (3817 documents, in this case). This resulted in 1806 terms (Zipf's Law in action, I guess!).

Using our new query API, I wrote a program that takes each of the words and runs it as a single term query. The program takes a switch on the command line to indicate whether the terms should be stemmed or morphed.

I'm running Solaris 10 on the test machine. psrinfo -v on this machine says:

Status of virtual processor 0 as of: 03/11/2009 13:27:56
  on-line since 01/04/2009 22:05:36.
  The i386 processor operates at 2200 MHz,
        and has an i387 compatible floating point processor.
...
Status of virtual processor 7 as of: 03/11/2009 13:27:56
  on-line since 01/04/2009 22:05:54.
  The i386 processor operates at 2200 MHz,
        and has an i387 compatible floating point processor.

It's a four processor box, where each processor has two cores (it's a v40z.) The box has 32GB of RAM, the java is version 1.6.0_06, and I'm running with -Xmx1g (the actual process size doesn't get much above 300MB, though.)

The collections are stored on a ZFS file system that's on a disk array attached to the box via fiber channel. This is fairly low-performing storage (after we got it we were told that it was meant for near line backup. Sigh.). Here's the pool:

NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
files                  2.27T    415G   1.86T    17%  ONLINE     -
And the status:
  pool: files
 state: ONLINE
 scrub: none requested
config:

	NAME                                         STATE     READ WRITE CKSUM
	files                                        ONLINE       0     0     0
	  raidz2                                     ONLINE       0     0     0
	    c0t600C0FF00000000009234951BE0FE300d0s2  ONLINE       0     0     0
	    c0t600C0FF00000000009234968DA6E9000d0s2  ONLINE       0     0     0
	    c0t600C0FF00000000009234973FFDC2800d0s2  ONLINE       0     0     0
	    c0t600C0FF000000000092349113B66D600d0s2  ONLINE       0     0     0
	    c0t600C0FF000000000092349239DC55200d0s2  ONLINE       0     0     0

I'm pretty sure each of those vdevs is built out of multiple disks in the actual array. (Ask me about this if you think it matters, and I can find out.)

While preparing this blog entry, I ran the query program a number of times in order to get the output that I wanted, so these indices were probably warm in the disk cache. Anyways, here's the basic results:

IndexTotal Query Time (ms)Avg. Query Time (ms)
Unstemmed67232.2537.23
Stemmed34732.7119.23

The queries using lightweight expansion take about 1.9 times as long as the queries using stemming.

I was curious if the number of partitions in the index was affecting how long the queries were taking to run, so I generated versions of the indices where all of the partitions had been merged into a single partition. The results for this merged index were:

IndexTotal Query Time (ms)Avg. Query Time (ms)
Unstemmed (merged)59615.2133.01
Stemmed (merged) 30880.8917.01

This is a speedup of around 12% in both cases, but about the same ratio for the times.

So, where's the extra time going, you ask? One of the things that we've added to Minion recently is query statistics processing. The engine keeps track of the number of dictionary lookups, dictionary cache hits, how many postings lists are loaded, how long it takes to load them, an so on. The stuff we're collecting is in the QueryStats class, and at any time you can ask the search engine for the current set of query statistics. Here's the stats for the merged index for the unstemmed index:

1806 queries in 59615.21ms, 33.01ms per query
Dictionary activity:
 34952 lookups in 1391.21ms (2.33% of total), 0.04ms per lookup
 Cache Hits:                           208
 Cache Misses:                       34744
Postings Activity
 9650 postings lists read in 2180.81ms (3.66% of total), 0.23ms per read
 Average Postings size:               67.7KB
 Term Cache Hits:                        0
 Term Cache Misses:                      0
 Generating term cache entries:        0.0ms (0.00% of total)
 Postings iteration:               25670.3ms (43.06% of total)
Query Activity
 Union processing:                 25671.7ms (43.06% of total)
 Intersect processing:                 0.0ms (0.00% of total)
 Sorting postings:                 22929.8ms (38.46% of total)
 Normalizing scores:                3189.2ms (5.35% of total)
and here's the stats for the merged, stemmed index:
1806 queries in 30880.89ms, 17.10ms per query
Dictionary activity:
 1806 lookups in 340.58ms (1.10% of total), 0.19ms per lookup
 Cache Hits:                           209
 Cache Misses:                        1597
Postings Activity
 1806 postings lists read in 2063.93ms (6.68% of total), 1.14ms per read
 Average Postings size:              320.2KB
 Term Cache Hits:                        0
 Term Cache Misses:                      0
 Generating term cache entries:        0.0ms (0.00% of total)
 Postings iteration:               23008.0ms (74.51% of total)
Query Activity
 Union processing:                 23009.4ms (74.51% of total)
 Intersect processing:                 0.0ms (0.00% of total)
 Sorting postings:                     0.0ms (0.00% of total)
 Normalizing scores:                3143.3ms (10.18% of total)

Note that there's some overlap in the categories and we don't count everything, so the numbers don't add up to 100%.

The big difference between these two is the "Sorting Postings" time. This is the time that's spent integrating the postings for the morphological variants. When we're processing the postings for the variants, experience has shown us that it's faster to list out all the documents and then sort them to combine the scores from documents with more than one variant than it is to merge them as we're processing the variants. Of course, in the stemmed case, we only ever iterate through one postings list, so the sort isn't necessary.

Thursday Mar 12, 2009

A query API for Minion

Minion actually offers three different query languages (and is designed so that others can be added pretty easily), but it can be difficult to use the query languages when a program (as opposed to a person) is writing the queries.

For example, lets say that you're providing a search engine for musical artists, and you want your users to be able to search by artist name. The first thing that you'll end up doing is quoting the value provided by the user by doing something like:

   String query = String.format("artist-name = \\"%s\\"", name);
   engine.search(query);
because you know that artist names can have spaces in them.

This will work for a while, but you will soon find that there's an amazing amount of stuff that occurs in the names of musical artists. We have a crawl of information about 50,000 musical artists (more on that in the days to come, I hope.) If we sort the list of artists by name, here's the first 15:

!!!
!Action Pact!
!DISTAIN
!DelaDap
"Brother" Jack McDuff
"Little" Louie Vega
"Weird Al" Yankovic
#9 Dream
#Poundsign#
$wingin' Utter$
'Til Tuesday
't Hof van Commerce
(Love) Tattoo
(The Sounds Of) Kaleidoscope
(Young) Pioneers

Let's say that our user has typed in "Weird Al" Yankovic. If we're not careful about escaping the quotes, we're either going to have a parse exception or get weird results because the query parser will interpret the query differently than the user intended. At this point you begin to have a dawning sense of horror that you're going to spend the rest of your time figuring out what the set of weird characters is (usually when something breaks) and then dealing with escaping them on a case-by-case basis.

To completely sidestep this problem, we've added a programmatic query API to Minion. The idea is pretty simple: you build a Java structure that describes the query and then give it to the search engine to evaluate. Our search for an artist name above becomes:

        Element e = new StringRelation("aura-name", Relation.Operator.Equals, name);
        engine.search(e);
No need to worry about escaping things, because we're just passing around Java strings.

You can start out with creating a query element for a term and searching for just that term:

        Term j = new Term("java");
        ResultSet rs = engine.search(j);

As with the query languages, the term will search for morphological variations by default. We can specify how we want a term to be treated during the query by specifying modifiers:

        Term j2 = new Term("Java", EnumSet.of(Term.Modifier.CASE, Term.Modifier.MORPH));
        rs = engine.search(j2);
which will search for the term Java (and morphological variations) only in that exact case. You can combine terms using boolean operators:
        Term l = new Term("language");
        Term s = new Term("specification");
        And a = new And(j, l, s);
        rs = engine.search(a);
        rs = engine.search(new Or(j, l, s));

Note that there are varargs constructors for And and Or so you don't need to clutter up your code with lots of anonymous arrays.

You can combine boolean operators and terms in the natural way:

        And ao = new And(j, new Or(l, s));

Using the query API also means that you don't have to worry about what the precedence of the operators is in the query language (e.g., "What does java <and> language <or> specification do?").

As indicated above, you can use the query API for relational querying too:

        Relation sr = new Relation("subject", Relation.Operator.SUBSTRING,
                "java");
        Relation dr = new Relation("date", Relation.Operator.LESS_THAN,
                "03/12/2009");
and, of course, you can combine relational and boolean queries:
        And subj = new And(j, l, s, sr);

You can restrict elements to be searched for in a particular field or fields in the following way:

        Term j = new Term("java");
        j.addField("subject");
you can also use a set of fields, if you prefer. When more than one field is specified, a hit in any of the fields means that the document will be in the result set.

Finally, you can indicate that the results generated by a particular element of the query should be treated in a strict boolean fashion. This means that a document is either in the set or it isn't; there are no scores associated with the documents. When combining a set of documents that were treated in a strict boolean fashion with a set where the documents are scored, the strict boolean result set will not affect the scores in the combined set (although they may affect the presence or absence of certain documents!)

This is useful to do when you want to make sure that a result set contains documents that have a particular property, but you don't want that property to mess up the scores. For example:

        Term j = new Term("java");
        Term l = new Term("language");
        l.setStrict(true);
        And a = new And(j, l);
will return the documents that have java and language in them, but the scores for those documents won't be influenced by the score of language.

At the moment, the API only allows for boolean, relational, and range queries, but in the near future we'll be adding in proximity operators as well as document similarity operators.

The only thing that I don't really like about this new API is that it doesn't do type-checking as you build the query. For example, if you're building a relation like:

       Relation dr = new Relation("date", Relation.Operator.LESS_THAN,
                "03/12/2009");
we can't ensure that date is a saved field, that it's a date saved field, and that that provided value is a date. This is because the queries are built without considering any particular engine, so we can't tell what fields are defined or what their types are. Currently this is checked before the query is evaluated in the search method, and a QueryException is raised if there is a problem.

On a stylistic note, we decided to have a single Relation class rather than breaking out operators for each relation type. I still can't decide whether that was a good idea or not. Helpful suggestions welcome in the comments!

Wednesday Mar 11, 2009

Google's Postings Format

Jeff's Search Engine Caffe points to Google's Jeff Dean's slides from the WSDM 2009 Keynote.

It's a pretty interesting look at how Google has grown over the years. Of particular interest to me as a search guy is the description of some of their index formats. On slide 47 there's a pretty detailed description of the postings lists format that they were using before they went to an all-in-memory index. They were using variable length encodings (like Gamma and Huffman encoding) for some of the data, in order to get the most compact on-disk representation that could still be decoded quickly.

This was what the initial versions of the Minion engine did as well.

On slide 54, he starts to talk about the formats that they considered when they went to an all-in-memory index. Given that they could hold the whole thing in memory, they needed to use a format that could be decoded as quickly as possible, which led them to use byte-aligned variable length encodings.

They started with what they call a Varint encoding, which is the same encoding that Minion and Lucene use. The following slides detail a number of variations of this encoding scheme, culminating in slide 63, where they describe encoding groups of four numbers together and pulling out two-bit prefixes into a leading byte, in order to reduce the amount of masking and shifting necessary to decode an integer.

He says the Varint encoding could be decoded at the rate of 180M integers per second, while their "group varint" approach could decode at the rate of 400M integers per second. He points out that this means that you can only encode 30 bits per integer, but that's not much of a restriction for typical postings data (document ID deltas and positions).

I think I might have to give this one a try myself, in my copious free time.

Tuesday Mar 10, 2009

New look

A new look for the blog. This is the new "Cloud" theme for blogs.sun.com. Restful, isn't it?

I realize it may seem a bit trendy (and bandwagon-jump-ony), but on the AURA project, we've been building a system that's meant to run in the cloud. I'll probably be blogging about this more in the future.

Plus, I must really mean it if I switched my theme from "Green", right?

Dictionaries in Minion: Merging

The last major function that we're going to investigate for dictionaries is merging them.

When indexing in Minion, we try to index as much data as possible in memory before dumping the data to the disk. Each chunk of data that we dump to the disk is called a partition. We usually call a partition that was dumped directly by the indexer a "first-level partition."

A partition is composed of a number of files, and the contents of these files are mostly dictionaries and their associated postings. As indexing proceeds and we dump more first-level partitions, the PartitionManager for the index will collect a number of first-level partitions of a given size (in terms of the number of documents that contain) and merge them together into a new partition. As you can imagine, we usually call the partition resulting from a merge of first-level partitions a "second-level partition". The PartitionManager will eventually merge second-level partitions into third-level ones, and so on.

Because of our EIDAP philosophy, all of these partition merges boil down to merging a number of dictionaries. So, for example, we build the main dictionary for a merged partition by merging together the main dictionaries from the partitions that we're merging.

The main dictionary merge method is pretty straightforward, given how central it is to how indexing works in Minion.

Let's walk through a merge for the main dictionaries for a number of partitions. The first thing that you will notice is that the method signature is pretty complicated. This is a bit of an annoyance, but the merge method has to deal with a number of different situations. Let's look at the parameters for the method:


IndexEntry entryFactory
This is an empty entry of the same type as is in the dictionaries we'll be merging (and, therefore, in the dictionary that we'll be creating. We'll use it to make new entries for the dictionaries.

NameEncoder encoder
An encoder for the names in the merged dictionary. In the case of the main dictionary for the partitions, this will be a StringNameHandler that will handle the 3-in-4 front coding of the terms in the dictionary.

PartitionStats partStats
A set of simple statistics about the partition that we may be collecting while merging the terms.

DiskDictionary[] dicts
The actual dictionaries to merge.

EntryMapper[] mappers
A set of mappers for the entries. These can be used (for example) to renumber entries in the dictionary during the merge.

int[] starts
The starting document IDs for the data from each of the partitions. During indexing, the document IDs are assigned per partition (i.e., the first document ID in a partition is always 1), so during merging we need to remap the document IDs from the merged partitions.

int[][] postIDMaps
A set of old-to-new mappings for the IDs encoded in the postings associated with this dictionary.

RandomAccessFile mDictFile
The file where the merged dictionary will be (eventually) written.

PostingsOutput[] postOut
The file(s) where the merged postings will be written

boolean appendPostings
In most cases, when merging dictionaries, the postings associated with the entries in the dictionary can simply be appended onto one another to generate the postings for the merged dictionaries. There are a few dictionaries where this is not the case.

The first thing that we do for the merge is to build a heap of iterators, one for each of the dictionaries that we're merging (note that DiskDictionary implements Iterable — we'll talk more about dictionary iterators later.) The heap is organized by the names of the entries currently at the head of the iterator. While there are still iterators on the heap, the merge loop proceeds as follows:


  1. peek at the top element of the heap, and make a new entry for the merged dictionary using the name of the top element. We set the ID of the new dictionary entry in sequence or using the provided mappers.

  2. While the element at the top of the heap has the same name as the current element:

    1. Remove the top element from the heap.

    2. Read the postings associated with the element, and append them onto the postings that we're building for the new entry.

    3. Keep track of the mapping for the ID of this entry in the dictionary from which it was drawn and in the new dictionary.

    4. Advance the iterator that was at the top of the heap. If there is another element on that iterator, put the iterator back on the heap.

    5. peek at the element at the top of the heap.


  3. Write the merged postings to the postings file

  4. Write the new, merged entry to the dictionary using a DictionaryWriter

See, simple, right? There's actually quite a bit of complexity hidden in some of those steps (most notably, appending or merging postings — we'll talk more about that later too), but the basics actually are pretty straightforward.

Monday Mar 09, 2009

Dictionaries in Minion: Searching

In the entry on dumping, we described the on-disk representation for a dictionary. The representation is pretty straightforward:


  • A dictionary header. This contains information about the number of entries in the dictionary as well as information about the sizes of the following buffers.
  • An optional buffer. If the dictionary required the storing of an ID to position map in the dictionary (because the IDs of the entries in the dictionary were not remapped when the dictionary
    was dumped), then this buffer will occur in the on-disk representation. The entry at position i will encode the position of the entry with ID i in the sorted dictionary.

  • A buffer. This buffer contains the names of the entries in the dictionary using a standard 3-in-4 front-coding. The size of a block of terms is variable.
  • A buffer. This buffer contains the positions in the preceding buffer where each block of four terms starts. Each position is encoded using four bytes.

  • A buffer. This buffer contains the encoded information from the entries in the dictionary. The information and how it's encoded is particular to the entry type, but most types encode information about the position of the postings list in an associated postings file, the size of the postings list, and the number of postings in the postings list. The size of the information for an entry is variable.

  • A buffer. This buffer contains the position in the preceding buffer of the information for each entry. Each position is encoded using four bytes.

Because we can't guarantee that we can load very large dictionaries into memory at query time (and because we have a lot of dictionaries), we use file-backed buffers that keep an in-memory cache of a chunk of the buffer while the rest of the buffer stays on disk. When we open a dictionary, we just instantiate these buffers and have them backed by
the buffers on disk.

The main dictionary operation during querying is looking up entries by name: we look up words from the document in the main dictionary, we look up field values in the dictionaries for the saved fields, and so on. Keep in mind that at query time (unlike index time) a dictionary may be accessed by multiple threads, so we need to make sure that our operations are thread safe.

Let's have a look at how this is done.

We begin by taking duplicates of our file-backed buffers that contain the dictionary data. This is a relatively low-cost operation and it means that we don't have to worry about the buffers' in-memory caches thrashing as other threads try to find entries at the same time that we are.

Each dictionary maintains an LRA cache of entries that have been looked up, so we check that cache to see if the entry is already available. If it is, we return a copy of the cached entry and we're done. The size of this cache is configurable, and it can be set up to (eventually) cache all of the entries.

If the entry is not in the cache, then we need to see whether an entry with the given name occurs in the dictionary. We do this by running a binary search on the encoded names buffer. Because we use a 3-in-4 front coding, the search actually proceeds against the uncompressed names at the start of each block of four names.

Note that as we're running the binary search we also cache the uncompressed names at the positions we're testing. This means that over time, the top few levels of the binary search tree will be cached and so we won't have to keep decoding those names every time we search for something by name. This lets us get down to a pretty small portion of the dictionary before we ever have to actually decode any
names.

At the end of our binary search, we might have gotten lucky and have the name be one of our uncompressed names, but usually the binary search narrows things down to a block of names that may contain the name we're interested in. Once we find a block of names, we can iterate through it to see if it contains the name we're looking for.

If we don't find the name at this point, we can return null because we're sure the dictionary doesn't contain an entry with this name. If we do find the name, then we know the position in the dictionary for the entry with this name.

Once we have the position, we can create an entry based on the encoded data for this entry.

On the way out, we cache the entry that we found for this name.

An additional filigree on this: we use the same lookup code to find matches for an initial substring of a name, which we can use when looking for stemmed variations of a term.

Thursday Mar 05, 2009

Inside Lucene's Indexing Code

For search engine innards fans, Mark Miller is exploring lucene’s indexing code over at the Lucid Imagination Blog.

It looks like pretty interesting stuff, and a good way for me to find out more about how Lucene works.

Dictionaries in Minion: Dumping a Dictionary

Once the indexer decides that it's time to dump a partition (because
memory's getting full or because it was told to by a call to SearchEngine.flush),
the partition dump basically reduces into dumping a bunch of
dictionaries and their associated postings data.

The MemoryDictionary.dump
method is responsible for dumping the dictionary. The first step in a
dictionary dump is to instantiate a DictionaryWriter.
This is a helper class that does most of the actual work of encoding
and writing the dictionary. We use a helper class because a lot of
the work done during the dump of a dictionary is also done when
merging dictionaries (more on that in an upcoming post.)

Once we're ready to start writing the dictionary, the entries in the
dictionary are sorted by their names. We sort the entries by name so
that we can use binary search to find entries at query time. We could
avoid the sorting step by keeping the entries in a
SortedMap at indexing time, but it turns out to be faster
to sort the entries when we're ready to dump them than to keep them
sorted while we're indexing them.

Although the dictionary assigned integer IDs to the entries during
indexing, at dump time we can take the opportunity to renumber the
entries so that the ID for an entry corresponds to its position in the
sorted list. If we do this, looking up a dictionary entry by ID can
be done very quickly at query time.

We do this renumbering for dictionaries like the main dictionary for a
partition (the one containing the terms from the documents), but we
skip it for the terms in the document dictionary. This renumbering
business is one of the places where the EIDAP philosophy gets a bit
burdensome. When you're sorting the entries in the document
dictionary, you don't want to renumber the entries, because the IDs
that were assigned to these entries have been encoded into the
postings lists for the terms in the main dictionary and we don't want
to have to re-encode all those postings.

In the cases where we don't renumber the entries, we need to keep a
map from the ID for an entry to the position in the dictionary, so
that we can continue to do fast lookups by ID at query time.

When we do renumber the entries, we can optionally keep a map from old
entry IDs to new entry IDs. The ID map from one dictionary can be
useful when dumping another dictionary's entries.

Once we've figured out the order that the entries will be dumped in
and what the entries' final IDs will be, each entry is written out to
the on-disk representation, via the DictionaryWriter.
The on-disk representation for the dictionary is a DictionaryHeader,
followed by four buffers.

The first buffer contains a standard 3-in-4 front coding of the names
of the entries. The basic idea is that we group the names of the
terms into groups of 4. The first name in a block is uncompressed,
the remaining names in the block have any initial substring shared with
the previous name removed and replaced with the length of the shared
initial substring.

Here's an example from an index I had lying around. A four word block
is composed of the words: evergreen, evergrey,
everlast, everlasting. These will get compressed
into a block of names that looks like:


0evergreen4grey4last8ing

where the numbers are actually bytes representing the initial
shared substring length. Thus 37 characters are compressed into 23. The
encoding of names is handled by classes that implement the NameEncoder
interface. The appropriate name encoder for a dictionary is passed
into the dump method for a dictionary.

The second buffer encodes the positions of the start of each of the
blocks of four terms in the dictionary. Because we need to be able to
quickly find the start of a block, these positions are encoded using
a fixed number of bytes.

The third buffer encodes the rest of the information contained in a
dictionary entry. This is stuff like the number of occurrences of the
term and the offset in the postings file where the postings associated
with the term can be found. Each entry knows how to encode this
information onto a buffer, and the encoding is done by the IndexEntry.encodePostingsInfo
method.

The fourth buffer encodes the position in the third buffer of the
entry information for each entry. Again, we need to be able to find
the information for an entry quickly when we're fetching entries at
query time, so the offset is encoded using a fixed-length encoding.

As we're encoding the information for the entries, we also dump the
collected postings information to the appropriate postings files,
keeping track of the offsets where the postings are dumped and how
many bytes are in the postings.

Because we use the DictionaryWriter for merging
dictionaries too, we can't assume that the dictionary representation
can be stored completely in memory. Believe me, life would be easier
if we could, but this really does get to be a problem with large
indices. To work around this, when we're encoding the dictionary, we
use buffers that have an in-memory cache and are backed by a temporary
file. In the case of small dictionaries, this works out to be pretty
much a wash, because we only ever use the in-memory cache and never
have to write any of the data to the file.

The last stage of dumping the dictionaries is to write the dictionary
header and then transfer the encoded buffers into their final
destination file. We make sure this is done efficiently using NIO
channel transfer operations.

Wednesday Mar 04, 2009

Look what's back in Netbeans!

I was having a quick look at the new NetBeans 6.7M2 this morning, and look what I found in the plugins list:

W00t! Jackpot is back!

Man, I could have totally used this a couple of weeks ago. I was switching from using custom logging in Minion to using java.util.logging Jackpot would have relieved me of the need for a very evil Perl script (is there another kind? (I kid, I kid!)) and a lot of hand editing.

Extra awesomeness: Prolog editing mode! Now I can dust off the family tree NLP system that I wrote for CS240 (using Waterloo Prolog on an IBM mainframe, no less) in 1987!

Absolutely nothing to do with search engines, but still kind of nerdy

This is just plain funny.

Also, I'm heading to a midnight showing of Watchmen on Thursday/Friday. To say that I'm stoked about this movie would be a severe understatement, so I expect I'm setting myself up for a huge disappoinment. Still, I won't be dressing up to go to see the movie.

About

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into recommendation and other technologies that can use a good text search engine. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.

Search

Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today