In JE, when the durability dictates force-writing to disk (for example, a commitSync() call), it calls fsync(). On modern (non-SSD) disk drives an fsync() will take on the order of several milliseconds, a relatively long time compared to the time required to do a write() call which only moves data to the operating system's file cache. Presently, JE's group commit code ensures that any concurrent application threads requring fsync() calls will have their fsync()s subsequently executed in batches if the fsync() would otherwise block. For example, if a thread T1 requires an fsync(), and JE determines that no fsync() is in progress, it will execute that fsync() immediately. If, while T1 is executing the fsync(), some other thread(s) require an fsync(s), JE will block those threads until T1 finishes. When T1's fsync() completes, a new fsync() executes on behalf of the blocked thread(s). In this way, JE achieves group commit.
The existing JE group commit code assumes that the underlying platform (OS + file system combination) allow IO operations like seek(), read() and write() to execute concurrently with an fsync() call (on the same file, but using different file descriptors). On Solaris and Windows this is true. Hence, on these platforms, a thread which is performing an fsync() does not block another thread performing a concurrent write(). But, on several Linux file systems, ext3 in particular, an exclusive mutex on the inode is grabbed during any IO operation. So a write() call on a file (inode) will be blocked by an fsync() operation on the same file (inode). This negates any performance improvement which might be achieved by group commit.
Internally, I have finished working on improving the group commit so that it batches write() and fsync() calls, rather than just fsync() calls. Just before a write() call is executed, JE checks if an fsync() is in progress and if so, the write call is queued in a (new) Write Queue. Once the fsync() completes, all pending writes in the Write Queue are executed.
I have enabled this behavior by default but it may be disabled by setting the je.log.useWriteQueue configuration parameter to false. The size of the Write Queue (i.e. the amount of data it can queue until any currently-executing IO operations complete) can be controlled with the je.log.writeQueueSize parameter. The default for je.log.writeQueueSize is 1MB with a minimum value of 4KB and a maximum value of 32MB. The Write Queue does not use cache space controlled by the je.maxMemory parameter. For future reference, this is internal SR [#16440].
Comments (2)
Interesting post, Charles. Thanks.
Have you done any work to characterize the degree of write and commit concurrency that benefits from this approach? It seems to me that you'd need three committers in the window of a single fsync(), but only one write/commit overlap, to win. Is that right?
Posted by Mike Olson | December 1, 2008 12:27 PM
Posted on December 1, 2008 12:27
Hi Mike,
Yeah, that's probably a good estimate. The tests I ran were with five concurrent writer threads (all doing commit+sync). I saw about a 2.5x improvement. I also ran this on platforms which don't block write() calls during fsync() and saw an improvement over there too. I believe the speedups were 2.5x to 3.5x (roughly).
-cwl
Posted by Charles Lamb | December 1, 2008 1:42 PM
Posted on December 1, 2008 13:42