Tuesday May 18, 2010

Pattern for Defining Fields in Minion

While writing various applications that use Minion, I've settled into the habit of using an Enum to declare (and define) the fields I'm using in the search engine. It is a convenient and concise way to combine the declaration of their attributes with a mechanism for always getting the field name right. It is pretty simple and works well. I create an enum with one value per field, using the syntax that lets you specify arguments to a constructor to set the field's attributes. I then can use an accessor method of the enum to always get the particular value's field name whenever I need to refer to the field. As a bonus, I can throw in a defineFields method that simply iterates over all the fields in the enum, defining each one in the search engine with the attributes specified. See the example below.

/\*\* \* An enumeration of the fields available in the index. These values should \* always be used when referencing fields. \*/ public enum IndexFields { /\*\* Email address is saved and searchable \*/ EMAIL ("email", FieldInfo.Type.STRING, EnumSet.of(FieldInfo.Attribute.INDEXED, FieldInfo.Attribute.TOKENIZED, FieldInfo.Attribute.SAVED)), /\*\* ID is saved only \*/ PERSON_ID ("person-id", FieldInfo.Type.STRING, EnumSet.of(FieldInfo.Attribute.SAVED)), /\*\* Tags are saved and searchable, but not broken into tokens. They are vectored for use in document similarity \*/ TAG ("tag", FieldInfo.Type.STRING, EnumSet.of(FieldInfo.Attribute.INDEXED, FieldInfo.Attribute.VECTORED, FieldInfo.Attribute.SAVED)), /\*\* Bio is the "body" of the document, indexed, tokenized, and vectored \*/ BIO ("bio", FieldInfo.Type.NONE, EnumSet.of(FieldInfo.Attribute.INDEXED, FieldInfo.Attribute.TOKENIZED, FieldInfo.Attribute.VECTORED)); /\* \* Each enumerated value will have these three fields \*/ private final String fieldName; private final EnumSet<FieldInfo.Attribute> attrs; private final FieldInfo.Type type; /\*\* \* The constructor to create the instances defined \* above \*/ IndexFields(String fieldName, FieldInfo.Type type, EnumSet<FieldInfo.Attribute> attrs) { this.fieldName = fieldName; this.attrs = attrs; this.type = type; } /\* \* Public methods to get the field properties: \*/ public String getFieldName() { return fieldName; } public EnumSet<FieldInfo.Attribute> getAttributes() { return attrs; } public FieldInfo.Type getType() { return type; } public String toString() { return fieldName; } /\*\* \* Defines the fields enumerated in this enum in \* the provided search engine \*/ public static void defineFields(SearchEngine engine) throws SearchEngineException { for (IndexFields i : IndexFields.values()) { engine.defineField(new FieldInfo( i.getFieldName(), i.getAttributes(), i.getType())); } } }

Monday Nov 02, 2009

AURA Items and Java Compressor Streams

At the replicant level of The AURA Project (the base level that actually holds the data), we store our Items in a Berkeley DB / Java Edition database. Items have a number of fixed fields that we can retrieve them by such as unique key, name, creation time, and type. Each type of item may have its own particular data though that the Data Store itself considers to be opaque. For example: a musical artist will have reviews, bios, links to videos, etc; a user will have a name, email address, gender and so on. To keep things simple in our current implementation, we store all of the type-specific data in a blob. We'll serialize the data to a byte array, then store the byte array in the database.

We run on a very fast network where we weren't initially concerned with the size of these blobs, but as they got bigger we started to hit a bug that seems to be caused by some combination of the network cards we're using, the switch between them, and possibly the ethernet driver as well. This bug was causing us to see some intermittent retransmits with 400ms backoffs on larger objects. Obviously, this isn't good when we're trying to keep our total request time below 500ms (leaving only 100ms to actually do anything). We weren't in a position to track down the delay (the equipment wasn't ours to tamper with), so we tried to mitigate it by decreasing the size of the items. The simplest way to do this (that didn't involve actually storing the blobs somewhere else or breaking them up) was to compress the data. This is all a long winded way of saying that I had cause to evaluate a bunch of the compression stream implementations that I found scattered around. Surprisingly, I couldn't find anything like this already out there on the interwebs.

I wrote a simple test harness that reads a whole bunch of .class files (as an extremely rough approximation of what live instance data would look like) and compresses and decompresses them and records the size and times for each test. Below are the results of reading around 800 class files and compressing them, checking the compressed size, then decompressing them. The first item, "None", is just straight I/O without any compression.

CompressorComp TimeComp SizeDecomp Time

The ZLIB compressors come from the JDK's Deflator zlib implementation with "BEST_SPEED", default, and "BEST_COMPRESSION" options. GZIP and Zip are from the JDK as well. BZip2 is from the Apache Commons Compress project and HadoopGZIP is from the Hadoop project using the Java implementation.

Looking at the results, I was initially excited by the ZLIB-Fast option, but while its compression time is quite good for not that much loss in file size, the decompression time leaves a little to be desired. Since, generally speaking, items get written infrequently in our system and read quite frequently, the decompression time (which is done at the client or in the case of web apps, in the web server) is the more important of the two. ZLIB-Small did much better with decompressing, but the cost of compressing was fairly high. GZIP does pretty well in compression time, size, and decompression time. Zip speeds compression a bit but took a lot longer to decompress, and BZip2 (as expected) trades off time for tighter compression. I was under the impression that Hadoop's GZIP would come out the same as JDK's (in fact, I thought it was using it) but the numbers are consistently different.

I'm looking for something that helps to reduce the size of the data and doesn't take too much time. So all told, GZIP seems to be the clear winner here. Note that these times are only useful for relative comparison. I'm getting the data from many different files (which are cached) and I'm reading/writing in 4K chunks for my test. I may well do better with a different chunk size, but I suspect that the relative numbers will come out about the same.


Jeff Alexander is a member of the Information Retrieval and Machine Learning group in Oracle Labs.


« June 2016