Wednesday Nov 16, 2005

ZFS: The Lumberjack

I'm a lumberjack and I'm ok,
I sleep all night and I work all day.

... (Monty Python)

Well just recently I've been working much of the night as well! For many years now I've been working on logging. First improving UFS logging (technically journalling) performance and more recently on the ZFS Intent Log. Hence - "The Lumberjack".

So, you may well ask, what is this ZFS Intent Log? ZFS is always consistent on disk due to its transaction model. Unix system calls can be considered as transactions which are aggregated into a transaction group for performance and committed together periodically. Either everything commits or nothing does. That is, if a power goes out, then the transactions in the pool are never partial. This commitment happens fairly infrequently - typically a few seconds between each transaction group commit.

Some applications, such as databases, need assurance that say the data they wrote or mkdir they just executed is on stable storage, and so they request synchronous semantics such as O_DSYNC (when opening a file), or execute fsync(fd) after a series of changes to a file descriptor. Obviously waiting seconds for the transaction group to commit before returning from the system call is not a high performance solution. Thus the ZFS Intent Log (ZIL) was born.

The ZIL gathers together in memory transactions of system calls. If the system call immediately needs to be pushed out (e.g. O_DSYNC) or an fsync() later arrives then the in-memory list is pushed out to a per filesystem on-disk log. In the event of a crash or power fail the log is examined and those system calls that did not get committed are replayed.

The on-disk log consists of a header and log blocks. Inside the log blocks are records containing the system call transactions. Log blocks are chained together dynamically. There is no fixed location for the blocks, they are just grabbed and released as needed from the pool. This dynamic allocation makes it easy to change the size the log block size according to demand. When a transaction group commits then all log blocks containing those transactions are released. On replay we determine the validity of the next block in the chain using a strong checksum and further checking of the log record fields. Here's a diagram of the on-disk format:

So that's the ZIL in a nutshell. It's design was guided by previous years of frustration with UFS logging. We knew it had to be very fast as synchronous semantics are critical to database performance. We certainly learnt from previous mistakes.

There's a whole bunch more intricate details I could talk about, but they can wait for another blog entry. However, you can also get details about the ZIL in Neels blog. There's also more work to do.Still it's currently very functional now and bloody fast!

Technorati Tag:

About

perrin

Search

Top Tags
Categories
Archives
November 2005 »
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks