Bits, and the buffering thereof

I'm going to try to get back to blogging about the innards of Minion. I'm going to start at the bottom, but I'll probably jump around a little bit depending on what takes my fancy.

Way down at the bottom of Minion are bit buffers. Most of the higher-level file formats in Minion are built up from a number of bit buffers. A dictionary, for example, consists of a header and then four buffers containing the data for the dictionary.

The buffers that Minion uses are inspired by the buffer types in the java.nio package, although they do not extend those types. The main issues with these buffers are that we can't use them to encode an integer using a word aligned compression method and they won't grow on demand.

We divide buffers into two interfaces: one for buffers that you write to and one for buffers that you read from. We make this distinction because it allows us to be clear in the code when a buffer should be used for reading and when it should be used for writing. In reality, our buffers implement both of these interfaces.

Readable and writeable buffers have a few things in common: you can set the position, figure out how many bytes are remaining to be read or written, or figure out how much data the buffer can hold. The com.sun.labs.minion.util.buffer.WriteableBuffer interface defines a number of methods for encoding data onto the buffer, and the com.sun.labs.minion.util.buffer.ReadableBuffer provides the corresponding methods for decoding data.

There are methods for encoding and decoding the standard Java numeric types using their usual number of bytes, as well as methods for encoding integers using a fixed number of bytes (so, for example, you can encode numbers that you're sure aren't going to be bigger than 2\^7 using a single byte.) There are methods for encoding and decoding java.lang.Strings using a UTF-8 encoding that doesn't encode the length of the string as an unsigned short (you learn a lot looking at the source code of the JRE.)

The main compression routine in the buffer classes encodes integers using a word-aligned compression method. This is essentially the VInt format that Lucene uses. Using a word-aligned compression method is much more efficient (from a CPU usage standpoint) than using more sophisticated compression methods like gamma coding. For more information about these methods, you should have a look at Managing Gigabytes, or have a look at Zobel and Moffat's Computing Surveys article Inverted files for text search engines.

Minion Buffers can be written to or read from java.nio buffers, Data(Input|Output)s, java.nio.channel.(Writeable|Readable)ByteChannels, or other Minion buffers, which is how they eventually make it out to files.

Minion contains a number of implementations for these buffers which differ mainly in the backing store used to store the encoded data. There are implementations backed by arrays and by NIO buffers for in-memory operation and implementations that are backed by files with an in-memory buffer, for those occasions when one doesn't want to store 100MB of encoded dictionary entries in main memory.

Comments:

Post a Comment:
Comments are closed for this entry.
About

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into recommendation and other technologies that can use a good text search engine. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today