The ZFS Intent Log

A quick guide to the ZFS Intent Log (ZIL)

I am not a ZFS developer. However I am interested in ZFS performance, and am intrigued by ZFS Logging. I figure a good way to learn about something is to blog about it ;-). What follows is my notes as I made my way through the ZIL


Most modern file systems include a logging feature to ensure faster write times and crash recovery time (fsck). UFS has supported logging since Solaris 2.7 and uses logging as the default on Solaris 10. Our tests internally have shown us that logging file systems perform as good as (sometimes even better) non-logging file systems.

Logging is implemented via the ZFS Intent Log module in ZFS. ZFS Intent Log or ZIL is implemented in the zil.c file. Here is a brief walk through of the logging implementation in ZFS. All of this knowledge can be found in the zil.[c|h] files in the ZFS source code. I also recommend you check out Neil's blog -- He is one of the ZFS developers who works on the ZIL.

All file system related system calls are logged as transaction records by the ZIL. These transaction records contain sufficient information to replay them back in the event of a system crash.

ZFS operations are always a part of a DMU (Data Management Unit) transaction. When a DMU transaction is opened, there is also a ZIL transaction that is opened. This ZIL transaction is associated with the DMU transaction, and in most cases discarded when the DMU transaction commits. These transactions accumulate in memory until an fsync or O_DSYNC write happens in which case they are committed to stable storage. For committed DMU transactions, the ZIL transactions are discarded (from memory or stable storage).

The ZIL consists of a zil header, zil blocks and zil trailer. The zil header points to a list of records. Each of these log records are variable sized structures whose format depends on the transaction type. Each log record structure consists of a common structure of type lr_t followed by multiple structures/fields that are specific to each transaction. These Log records can reside either in memory or on disk. The on disk format described in zil.h. ZIL records are written to disk in variable sized blocks. The minimum block size is defined as ZIL_MIN_BLKSZ and is currently 4096 (4k) bytes. The maximum block size is defined as ZIL_MAX_BLKSZ which is equal to SPA_MAXBLOCKSIZE (128KB). The zil block size written to disk is chosen to be either the size of all outstanding zil blocks (with a maximum of ZIL_MAX_BLKSZ) or if there are no outstanding ZIL transactions, the size of the last zil block that was committed.

ZIL and write(2)
The zil behaves differently for different size of writes that happens. For small writes, the data is stored as a part of the log record. For writes greater than zfs_immediate_write_sz (64KB), the ZIL does not store a copy of the write, but rather syncs the write to disk and only a pointer to the sync-ed data is stored in the log record. We can examine the write(2) system call on ZFS using dtrace.

230  -> zfs_write                                         21684
230    -> zfs_prefault_write                              28005
230    <- zfs_prefault_write                              35446
230    -> zfs_time_stamper                                69932
230      -> zfs_time_stamper_locked                       72893
230      <- zfs_time_stamper_locked                       74813
230    <- zfs_time_stamper                                76893
230    -> zfs_log_write                                   81054
230    <- zfs_log_write                                   89855
230  <- zfs_write                                         96257
230  <= write

As you can see there is a log entry associated with every write(2) call. If the file was opened with the O_DSYNC flag, writes are supposed to be synchronous. For synchronous writes, the ZIL has to commit the zil transaction to stable storage before returning. For non-synchronous writes the ZIL holds on to the transaction in memory where it is held until the DMU transaction commits or there is an fsync or an O_DSYNC write.

zil.c walk thorough

There are several zil functions that operate on zil records. What follows is a very brief description of their functionality.

  • zil_create() creates a dmu transaction and allocates a first log block and commits it.
  • zil_itx_create() is used to create a new zil transaction.
  • zil_itx_assign() is used to associate this intent log transaction with a dmu transaction.
  • zil_itx_clean() is used to clean up all in memory log transactions. Clearing in memory zil transactions implies that these are not flushed to disk. zil_itx_clean() is called via the zil_clean() function which dispatches a work request to a dispatch thread.
  • zil_commit() is used to commit zil transactions to stable storage.
  • zil_sync() ZIL transactions are then cleaned (or deleted) in the zil_sync routine when the DMU transactions that they are assigned to is committed to disk (maybe as a result of a fsync) It is mostly called from the txg_sync_thread every txg_time (5 seconds) via this code path.


ZFS Mount

During file system mount time, ZFS checks to see if there is an intent log. If there is an intent log, this implies that the system crashed (as the ZIL is deleted at umount(2) time). This intent log is converted to a replay log and is replayed to updated the file system to a stable state. If both the replay log and intent log are present, it implies that the system crashed while replaying the replay log in which case it is OK to ignore/delete the replay log and replay the intent log.

ZIL Tunables
I am almost tempted to mention some tunables here but the truth is that ZFS is intended to not require any tuning.  ZFS should (and will) perform optimally "Out of the Box". You might find some switches in the code, but they are only for internal development and will be yanked out soon!

ZIL Performance
As you must have figured out by now, ZIL performance is critical for performance of synchronous writes. A common application that issues synchronous writes is a database. This means that all of these writes run at the speed of the ZIL. The ZIL is already quite optimized, and I am sure ongoing efforts will optimize this code path even further. As Neil mentions, using nvram/solid state disks for the log would make it scream!. I also recommend that you checkout Roch's work on ZFS performance for details of other performance studies in progress.

Dtrace scripts for use with zil

  • To see ZIL activity 
    • dtrace -n zil\\\*:entry'{@[probefunc]=count();}' -n zio_free_blk:entry'{@[probefunc]=count();}' 
  • To see blocksize of log writes
    • dtrace -n ::zil_lwb_commit:entry'{@[1] = quantize((lwb_t\*)args[2]->lwb_sz);}'


Congratulations to the ZFS team for delivering such a world class product. You folks rock!.

Technorati Tag:
Technorati Tag:
Technorati Tag:


Hello, could you briefly describe the differences between write to a intent log, and not write to the "zfs"? I mean, the zfs technique is write to a zil (on disk) and not write to the zfs fs (on disk too). So, i think the objetive is avoid disk "seeks", and maybe other things that i can not see. Thanks very much for this great article!

Posted by Marcelo Leal on July 30, 2007 at 11:25 PM PDT #

Hi. If I understand the behaviour of ZFS correctly (please correct me if I'm wrong!), when an O_DSYNC write or fsync() occurs the log is written to stable storage. In the case of a typical Oracle 8K write, the data is written to stable storage as part of the log record. This means that the data must first be read from disk and then written back again when the next DMU commit occurs. In this scenario, should the ZIL log reads be fulfilled from the ARC cache in most cases?

Posted by Duncan Rutland on May 20, 2008 at 06:22 PM PDT #

The zil reads are fulfilled from the ARC in all cases. If you are concerned about the "double" write, you should note that multiple O_DSYNC writes could be part of one log write, and that the sync later generates mostly sequential writes. The current pain point [that is being solved] is the contention around zil_commit when multiple threads are doing O_DSYNC writes.
"6699227 zil train mode" will help us there.

Posted by realneel on May 20, 2008 at 09:40 PM PDT #

Our Oracle database on ZFS has very slow from time to time. Oracle opens datafile with O_DSYNC which I don't think and don't know how to disable it. Benchmark testing shows O_DSYNC 8k block size writes are much slow then OS buffered write without O_DSYNC. Would you please advise tuning parameters or methods ?

by setting set zfs:zfs_nocacheflush = 1 in /etc/system does little improve, not very much.

Posted by James on September 11, 2008 at 03:18 PM PDT #

Unfortunately, this and other articles (such as which incorrectly states 'Note, Oracle does not require separate certification for Operating System features such as ZFS, so don't worry about it and go looking on metalink.')
fail to mention that ZFS is NOT certified or supported with Oracle databases or Oracle Application Server.. it says so on Metalink (notes 403202.1 and 730691.1).

Posted by Steve on September 15, 2008 at 09:29 PM PDT #

Oracle is nowhere mentioned in this blog post.. Whats your point?

Posted by neel on September 16, 2008 at 12:48 AM PDT #

Our issue is O_DSYNC 8k block writes, which is done by Oracle, on ZFS are very slow. Neelakanth Nadgir's article explains ZIL which I think it could be the cause of writing slowness. I was wondering whether using seperate log would speed up.

Posted by James on September 16, 2008 at 10:32 AM PDT #

James: What version of Solaris are you using? There have been ongoing improvements in the O_DSYNC write latency. If you can use it, I would highly recommend using the separate intent log feature of ZFS. This will isolate the log writes and will give you the lowest latency. Be aware that you can currently only add an slog device; removing it is hard!

Posted by Neelakanth Nadgir on September 16, 2008 at 10:44 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed



« January 2017