Dictionaries in Minion: Searching

In the entry on dumping, we described the on-disk representation for a dictionary. The representation is pretty straightforward:

  • A dictionary header. This contains information about the number of entries in the dictionary as well as information about the sizes of the following buffers.
  • An optional buffer. If the dictionary required the storing of an ID to position map in the dictionary (because the IDs of the entries in the dictionary were not remapped when the dictionary
    was dumped), then this buffer will occur in the on-disk representation. The entry at position i will encode the position of the entry with ID i in the sorted dictionary.

  • A buffer. This buffer contains the names of the entries in the dictionary using a standard 3-in-4 front-coding. The size of a block of terms is variable.
  • A buffer. This buffer contains the positions in the preceding buffer where each block of four terms starts. Each position is encoded using four bytes.

  • A buffer. This buffer contains the encoded information from the entries in the dictionary. The information and how it's encoded is particular to the entry type, but most types encode information about the position of the postings list in an associated postings file, the size of the postings list, and the number of postings in the postings list. The size of the information for an entry is variable.

  • A buffer. This buffer contains the position in the preceding buffer of the information for each entry. Each position is encoded using four bytes.

Because we can't guarantee that we can load very large dictionaries into memory at query time (and because we have a lot of dictionaries), we use file-backed buffers that keep an in-memory cache of a chunk of the buffer while the rest of the buffer stays on disk. When we open a dictionary, we just instantiate these buffers and have them backed by
the buffers on disk.

The main dictionary operation during querying is looking up entries by name: we look up words from the document in the main dictionary, we look up field values in the dictionaries for the saved fields, and so on. Keep in mind that at query time (unlike index time) a dictionary may be accessed by multiple threads, so we need to make sure that our operations are thread safe.

Let's have a look at how this is done.

We begin by taking duplicates of our file-backed buffers that contain the dictionary data. This is a relatively low-cost operation and it means that we don't have to worry about the buffers' in-memory caches thrashing as other threads try to find entries at the same time that we are.

Each dictionary maintains an LRA cache of entries that have been looked up, so we check that cache to see if the entry is already available. If it is, we return a copy of the cached entry and we're done. The size of this cache is configurable, and it can be set up to (eventually) cache all of the entries.

If the entry is not in the cache, then we need to see whether an entry with the given name occurs in the dictionary. We do this by running a binary search on the encoded names buffer. Because we use a 3-in-4 front coding, the search actually proceeds against the uncompressed names at the start of each block of four names.

Note that as we're running the binary search we also cache the uncompressed names at the positions we're testing. This means that over time, the top few levels of the binary search tree will be cached and so we won't have to keep decoding those names every time we search for something by name. This lets us get down to a pretty small portion of the dictionary before we ever have to actually decode any

At the end of our binary search, we might have gotten lucky and have the name be one of our uncompressed names, but usually the binary search narrows things down to a block of names that may contain the name we're interested in. Once we find a block of names, we can iterate through it to see if it contains the name we're looking for.

If we don't find the name at this point, we can return null because we're sure the dictionary doesn't contain an entry with this name. If we do find the name, then we know the position in the dictionary for the entry with this name.

Once we have the position, we can create an entry based on the encoded data for this entry.

On the way out, we cache the entry that we found for this name.

An additional filigree on this: we use the same lookup code to find matches for an initial substring of a name, which we can use when looking for stemmed variations of a term.


Post a Comment:
Comments are closed for this entry.

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.


« July 2016