AURA Items and Java Compressor Streams
By user12610620 on Nov 02, 2009
We run on a very fast network where we weren't initially concerned with the size of these blobs, but as they got bigger we started to hit a bug that seems to be caused by some combination of the network cards we're using, the switch between them, and possibly the ethernet driver as well. This bug was causing us to see some intermittent retransmits with 400ms backoffs on larger objects. Obviously, this isn't good when we're trying to keep our total request time below 500ms (leaving only 100ms to actually do anything). We weren't in a position to track down the delay (the equipment wasn't ours to tamper with), so we tried to mitigate it by decreasing the size of the items. The simplest way to do this (that didn't involve actually storing the blobs somewhere else or breaking them up) was to compress the data. This is all a long winded way of saying that I had cause to evaluate a bunch of the compression stream implementations that I found scattered around. Surprisingly, I couldn't find anything like this already out there on the interwebs.
I wrote a simple test harness that reads a whole bunch of .class files (as an extremely rough approximation of what live instance data would look like) and compresses and decompresses them and records the size and times for each test. Below are the results of reading around 800 class files and compressing them, checking the compressed size, then decompressing them. The first item, "None", is just straight I/O without any compression.
|Compressor||Comp Time||Comp Size||Decomp Time|
The ZLIB compressors come from the JDK's Deflator zlib implementation with "BEST_SPEED", default, and "BEST_COMPRESSION" options. GZIP and Zip are from the JDK as well. BZip2 is from the Apache Commons Compress project and HadoopGZIP is from the Hadoop project using the Java implementation.
Looking at the results, I was initially excited by the ZLIB-Fast option, but while its compression time is quite good for not that much loss in file size, the decompression time leaves a little to be desired. Since, generally speaking, items get written infrequently in our system and read quite frequently, the decompression time (which is done at the client or in the case of web apps, in the web server) is the more important of the two. ZLIB-Small did much better with decompressing, but the cost of compressing was fairly high. GZIP does pretty well in compression time, size, and decompression time. Zip speeds compression a bit but took a lot longer to decompress, and BZip2 (as expected) trades off time for tighter compression. I was under the impression that Hadoop's GZIP would come out the same as JDK's (in fact, I thought it was using it) but the numbers are consistently different.
I'm looking for something that helps to reduce the size of the data and doesn't take too much time. So all told, GZIP seems to be the clear winner here. Note that these times are only useful for relative comparison. I'm getting the data from many different files (which are cached) and I'm reading/writing in 4K chunks for my test. I may well do better with a different chunk size, but I suspect that the relative numbers will come out about the same.