Berkeley DB Java Edition Group Commit vs ext3

In JE, when the durability dictates force-writing to disk (for example, a commitSync() call), it calls fsync(). On modern (non-SSD) disk drives an fsync() will take on the order of several milliseconds, a relatively long time compared to the time required to do a write() call which only moves data to the operating system's file cache. Presently, JE's group commit code ensures that any concurrent application threads requring fsync() calls will have their fsync()s subsequently executed in batches if the fsync() would otherwise block. For example, if a thread T1 requires an fsync(), and JE determines that no fsync() is in progress, it will execute that fsync() immediately. If, while T1 is executing the fsync(), some other thread(s) require an fsync(s), JE will block those threads until T1 finishes. When T1's fsync() completes, a new fsync() executes on behalf of the blocked thread(s). In this way, JE achieves group commit.

The existing JE group commit code assumes that the underlying platform (OS + file system combination) allow IO operations like seek(), read() and write() to execute concurrently with an fsync() call (on the same file, but using different file descriptors). On Solaris and Windows this is true. Hence, on these platforms, a thread which is performing an fsync() does not block another thread performing a concurrent write(). But, on several Linux file systems, ext3 in particular, an exclusive mutex on the inode is grabbed during any IO operation. So a write() call on a file (inode) will be blocked by an fsync() operation on the same file (inode). This negates any performance improvement which might be achieved by group commit.

Internally, I have finished working on improving the group commit so that it batches write() and fsync() calls, rather than just fsync() calls. Just before a write() call is executed, JE checks if an fsync() is in progress and if so, the write call is queued in a (new) Write Queue. Once the fsync() completes, all pending writes in the Write Queue are executed.

I have enabled this behavior by default but it may be disabled by setting the je.log.useWriteQueue configuration parameter to false. The size of the Write Queue (i.e. the amount of data it can queue until any currently-executing IO operations complete) can be controlled with the je.log.writeQueueSize parameter. The default for je.log.writeQueueSize is 1MB with a minimum value of 4KB and a maximum value of 32MB. The Write Queue does not use cache space controlled by the je.maxMemory parameter. For future reference, this is internal SR [#16440].

Comments:

Interesting post, Charles. Thanks.

Have you done any work to characterize the degree of write and commit concurrency that benefits from this approach? It seems to me that you'd need three committers in the window of a single fsync(), but only one write/commit overlap, to win. Is that right?

Posted by Mike Olson on December 01, 2008 at 04:27 AM EST #

Hi Mike, Yeah, that's probably a good estimate. The tests I ran were with five concurrent writer threads (all doing commit+sync). I saw about a 2.5x improvement. I also ran this on platforms which don't block write() calls during fsync() and saw an improvement over there too. I believe the speedups were 2.5x to 3.5x (roughly). -cwl

Posted by Charles Lamb on December 01, 2008 at 05:42 AM EST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Anything related to Oracle NoSQL Database and/or Berkeley DB Java Edition.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today