« October 2008 | Main | December 2008 »

November 2008 Archives

November 30, 2008

Berkeley DB Java Edition Group Commit vs ext3

In JE, when the durability dictates force-writing to disk (for example, a commitSync() call), it calls fsync(). On modern (non-SSD) disk drives an fsync() will take on the order of several milliseconds, a relatively long time compared to the time required to do a write() call which only moves data to the operating system's file cache. Presently, JE's group commit code ensures that any concurrent application threads requring fsync() calls will have their fsync()s subsequently executed in batches if the fsync() would otherwise block. For example, if a thread T1 requires an fsync(), and JE determines that no fsync() is in progress, it will execute that fsync() immediately. If, while T1 is executing the fsync(), some other thread(s) require an fsync(s), JE will block those threads until T1 finishes. When T1's fsync() completes, a new fsync() executes on behalf of the blocked thread(s). In this way, JE achieves group commit.

The existing JE group commit code assumes that the underlying platform (OS + file system combination) allow IO operations like seek(), read() and write() to execute concurrently with an fsync() call (on the same file, but using different file descriptors). On Solaris and Windows this is true. Hence, on these platforms, a thread which is performing an fsync() does not block another thread performing a concurrent write(). But, on several Linux file systems, ext3 in particular, an exclusive mutex on the inode is grabbed during any IO operation. So a write() call on a file (inode) will be blocked by an fsync() operation on the same file (inode). This negates any performance improvement which might be achieved by group commit.

Internally, I have finished working on improving the group commit so that it batches write() and fsync() calls, rather than just fsync() calls. Just before a write() call is executed, JE checks if an fsync() is in progress and if so, the write call is queued in a (new) Write Queue. Once the fsync() completes, all pending writes in the Write Queue are executed.

I have enabled this behavior by default but it may be disabled by setting the je.log.useWriteQueue configuration parameter to false. The size of the Write Queue (i.e. the amount of data it can queue until any currently-executing IO operations complete) can be controlled with the je.log.writeQueueSize parameter. The default for je.log.writeQueueSize is 1MB with a minimum value of 4KB and a maximum value of 32MB. The Write Queue does not use cache space controlled by the je.maxMemory parameter. For future reference, this is internal SR [#16440].

About November 2008

This page contains all entries posted to Charles Lamb's Blog in November 2008. They are listed from oldest to newest.

October 2008 is the previous archive.

December 2008 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type and Oracle