ZFS: The Lumberjack

I'm a lumberjack and I'm ok,
I sleep all night and I work all day.

... (Monty Python)

Well just recently I've been working much of the night as well! For many years now I've been working on logging. First improving UFS logging (technically journalling) performance and more recently on the ZFS Intent Log. Hence - "The Lumberjack".

So, you may well ask, what is this ZFS Intent Log? ZFS is always consistent on disk due to its transaction model. Unix system calls can be considered as transactions which are aggregated into a transaction group for performance and committed together periodically. Either everything commits or nothing does. That is, if a power goes out, then the transactions in the pool are never partial. This commitment happens fairly infrequently - typically a few seconds between each transaction group commit.

Some applications, such as databases, need assurance that say the data they wrote or mkdir they just executed is on stable storage, and so they request synchronous semantics such as O_DSYNC (when opening a file), or execute fsync(fd) after a series of changes to a file descriptor. Obviously waiting seconds for the transaction group to commit before returning from the system call is not a high performance solution. Thus the ZFS Intent Log (ZIL) was born.

The ZIL gathers together in memory transactions of system calls. If the system call immediately needs to be pushed out (e.g. O_DSYNC) or an fsync() later arrives then the in-memory list is pushed out to a per filesystem on-disk log. In the event of a crash or power fail the log is examined and those system calls that did not get committed are replayed.

The on-disk log consists of a header and log blocks. Inside the log blocks are records containing the system call transactions. Log blocks are chained together dynamically. There is no fixed location for the blocks, they are just grabbed and released as needed from the pool. This dynamic allocation makes it easy to change the size the log block size according to demand. When a transaction group commits then all log blocks containing those transactions are released. On replay we determine the validity of the next block in the chain using a strong checksum and further checking of the log record fields. Here's a diagram of the on-disk format:

So that's the ZIL in a nutshell. It's design was guided by previous years of frustration with UFS logging. We knew it had to be very fast as synchronous semantics are critical to database performance. We certainly learnt from previous mistakes.

There's a whole bunch more intricate details I could talk about, but they can wait for another blog entry. However, you can also get details about the ZIL in Neels blog. There's also more work to do.Still it's currently very functional now and bloody fast!

Technorati Tag:

Comments:

[Trackback] Ik heb een laptop van de zaak, die ik dus ook elke dag gebruik, en waar ik al mijn werk op doe. Een tijdje maak ik mij al zorgen over de backups, ik draai ubuntu, en heb dus niet de mooie dingen zoals zfs tot mijn beschikking, dan zou ik bv zfs send k...

Posted by Logic blog on October 09, 2007 at 06:24 AM PDT #

Hi Neil,
Good description of ZFS intent log. I do have a couple of questions here.
- What is the usual size of a transaction group?
- If the transaction group is bigger than a disk sector, how do you guarantee atomicity while writing/replaying the whole group?
- Since the intent log is part of file system, is copy-on-write done for these writes as well?
- What is the logical representation of intent log on the file system? Is it a file?
- Now that Sun has released Amber Road which probably uses SSDs for intent log, are these devices shared by various file systems?
Thanks,
Anand

Posted by Anand Vidwansa on November 18, 2008 at 11:38 PM PST #

This isn't a real Lumberjacks Blog :( Jk Any response yet whether the devices are shared by various file systems or what?

Thanks

Posted by Dan the Logger on February 20, 2009 at 12:33 AM PST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

perrin

Search

Top Tags
Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks