At the replicant level of The AURA Project (the base level that actually holds the data), we store our Items in a Berkeley DB / Java Edition
have a number of fixed fields that we can retrieve them by such as unique key, name, creation time, and type. Each type of item may have its own particular data though that the Data Store itself considers to be opaque. For example: a musical artist will have reviews, bios, links to videos, etc; a user will have a name, email address, gender and so on. To keep things simple in our current implementation, we store all of the type-specific data in a blob. We'll serialize the data to a byte array, then store the byte array in the database.
We run on a very fast network where we weren't initially concerned with the size of these blobs, but as they got bigger we started to hit a bug that seems to be caused by some combination of the network cards we're using, the switch between them, and possibly the ethernet driver as well. This bug was causing us to see some intermittent retransmits with 400ms backoffs on larger objects. Obviously, this isn't good when we're trying to keep our total request time below 500ms (leaving only 100ms to actually do anything). We weren't in a position to track down the delay (the equipment wasn't ours to tamper with), so we tried to mitigate it by decreasing the size of the items. The simplest way to do this (that didn't involve actually storing the blobs somewhere else or breaking them up) was to compress the data. This is all a long winded way of saying that I had cause to evaluate a bunch of the compression stream implementations that I found scattered around. Surprisingly, I couldn't find anything like this already out there on the interwebs.
I wrote a simple test harness that reads a whole bunch of .class files (as an extremely rough approximation of what live instance data would look like) and compresses and decompresses them and records the size and times for each test. Below are the results of reading around 800 class files and compressing them, checking the compressed size, then decompressing them. The first item, "None", is just straight I/O without any compression.
|Compressor||Comp Time||Comp Size||Decomp Time|
The ZLIB compressors come from the JDK's Deflator
zlib implementation with "BEST_SPEED", default, and "BEST_COMPRESSION" options. GZIP and Zip are from the JDK as well. BZip2 is from the Apache Commons Compress
project and HadoopGZIP is from the Hadoop
project using the Java implementation.
Looking at the results, I was initially excited by the ZLIB-Fast option, but while its compression time is quite good for not that much loss in file size, the decompression time leaves a little to be desired. Since, generally speaking, items get written infrequently in our system and read quite frequently, the decompression time (which is done at the client or in the case of web apps, in the web server) is the more important of the two. ZLIB-Small did much better with decompressing, but the cost of compressing was fairly high. GZIP does pretty well in compression time, size, and decompression time. Zip speeds compression a bit but took a lot longer to decompress, and BZip2 (as expected) trades off time for tighter compression. I was under the impression that Hadoop's GZIP would come out the same as JDK's (in fact, I thought it was using it) but the numbers are consistently different.
I'm looking for something that helps to reduce the size of the data and doesn't take too much time. So all told, GZIP seems to be the clear winner here. Note that these times are only useful for relative comparison. I'm getting the data from many different files (which are cached) and I'm reading/writing in 4K chunks for my test. I may well do better with a different chunk size, but I suspect that the relative numbers will come out about the same.