Today's post is a bit long, but I promise you'll find the entire post
interesting even if you are only interested in reading about ZFS. Also,
apologies in advance for the wacky fonts - the defaults don't seem to
work quite right.
is coming! Submit to FREENIX! I'm especially encouraging all of
you userland/application types to submit papers. FREENIX is intended
to publish work on open source projects that isn't being published
anywhere else, and while most kernel programmers think that the most
trivial snippet of code is worth publication, userland programmers can
easily write wildly popular million-lines-of-code systems with nary a
thought about publishing. The deadline for submissions is October 22,
Cool systems paper: WAFL
The cool systems paper for today is File System Design
for an NFS File Server Appliance, by Dave Hitz, et al. This paper
describes the WAFL file system (WAFL stands for Write Anywhere File
Layout), used internally by NetApp filers. In my opinion, this paper
describes the most significant improvement in file system design since
the original FFS in 1978. The basic idea behind WAFL is that all
data is part of a tree of blocks, each pointing to the block below
it. All updates to the file system are copy-on-write - each block is
written to a new location when it is modified. This allows easy
transactional updates to the file system.
The WAFL paper is a prime example of the kind of paper I'd like to see
published more often (N.B. It was published in the Winter 1994 USENIX
conference). From an academic standpoint, the paper is unacceptable
due to style and format. From the standpoint of great new ideas, full
implementation, and advancing the state of the art in practical file
system design and implementation, it's a gem. Unfortunately, some
people conclude that because NetApp filers use NVRAM to get acceptable
(nay, excellent!) performance while holding to the NFS standard, the
design ideas behind WAFL aren't useful for general purpose UNIX file
systems. I say they're wrong - but read the paper and form your own
opinion. The ZFS team thinks that a copy-on-write, transactionally
updated general purpose UNIX file system is not only feasible but an
excellent idea - which is why we wrote ZFS.
Matt Ahrens, one of the
primary architects and implementors of ZFS, stepped up to the plate
about ZFS in his blog. Read Matt's
blog entry introducing ZFS for a simple introduction to ZFS.
cool systems paper will also help you understand ZFS, since the
basic philosophy behind some parts of WAFL and ZFS is similar. I'll
add to what Matt has written and answer some of the most common
questions people have asked me about ZFS.
Q. ZFS is just another dumb local file system. What is new about
A lot of things! I'll try to hit the high points.
- No more partitions. Many ZFS file systems share a common
storage pool. The analogy is that ZFS makes the
sysadmin-to-storage relationship like the programmer-to-memory
relationship - you don't care where the storage is coming from, you
just want storage. (Obviously, some administrators do care - and ZFS
allows you to create many different storage pools if you need
different kinds of storage for different file systems.) The
administrator may add and remove disks from the storage pool with
simple one-line commands, without worrying about which file systems
are using the disks. The era of manual grow-and-shrink of file
systems is over; ZFS automatically grows and shrinks the space used by
a file system in the storage pool.
- The on-disk state is always valid. No more fsck, no need to
replay a journal. We use a copy-on-write, transactional update system
to correctly and safely update data on disk. There is no window where
the on-disk state can be corrupted by a power cycle or system panic.
This means that you are much less likely to lose data than on a file
system that uses fsck or journal replay to repair on-disk data
corruption after a crash. Yes, supposedly fsck-free file systems
already exist - but then explain to me all the time I've spent waiting
for fsck to finish on ext3 or logging UFS - and the resulting file
- All data is checksummed. Even the checksums are checksummed!
Between our checksums and copy-on-write, transactional updates, the
only way the on-disk data can be accidentally corrupted without being
detected is a collision in our 64-bit checksums (or a software bug -
which is always true). With most other file systems, the on-disk data
can be silently corrupted by any number of bugs - hardware failure,
disk driver bugs, a panic at the wrong time, a direct overwrite of the
disk by the administrator...
These are only the top three features of ZFS. ZFS has a million nifty
little features - compression, self-healing data, multiple (and
automatically selected) block sizes, unlimited constant-time snapshots
- but these are the biggies.
Q. Why isn't ZFS a clustered/multi-node/distributed file
system? Isn't the local file system problem solved?
Speaking from around 8 years of system administration experience, I
can say that the local file system problem has most emphatically not
been solved! Whenever someone asks me this question, I have to wonder
if they ever ran out of space on a partition (especially frustrating
on a disk with a lot of free space in other partitions), damaged their
root file system beyond repair by tripping on the power cable,
attempted to use any volume manager at all, spent a weekend (only a
weekend if you are lucky) upgrading disks on a file server, tried to
grow or shrink a file system, typed the wrong thing in /etc/fstab, ran
into silent data corruption, or waited for fsck to finish on their
supposedly journaling (and fsck-free) file system. Between me and two
or three of my closest friends (none of whom are even sysadmins), we
have run into all of these problems within the last year, on state of
the art file systems - ext3, VxFS, logging UFS, you name it. As far
as I can tell, most people have simply become accustomed to the
inordinate amount of pain involved in administering file systems.
We're here to say that file systems don't have to be complex, fragile,
labor-intensive, and frustrating to use. Creating a decent local file
system turned out to be more than big enough of a problem to solve all
by itself; we'll leave designing a distributed file system for another
Q. What is ZFS's performance? Can I see some ZFS benchmarks?
ZFS is still under development, and the benchmarks numbers change day
by day. Any benchmark results published now would only be a random
snapshot in time of a wildly varying function and not particularly
useful for deciding whether to use ZFS the released product. However,
we can tell you that performance is secondary only to correctness for
the development team and we are evaluating ZFS in comparison with many
different file systems on Solaris and Linux. We can also tell you
about some of the architectural features of ZFS that will help make
ZFS performance scream.
- Every file system can use the I/O bandwidth of every disk in the
storage pool - and does, automatically. We call this dynamic
striping. Unlike static striping, it requires no extra configuration
or stripe rebuilding or the like - ZFS simply allocates a block for a
particular file system from whichever disk seems best (based on
available space, outstanding I/Os, or what have you).
- Copy-on-write of all blocks allows much smarter write allocation
and aggregation. For example, random writes to a database can be
aggregated into one sequential write.
- Automatic selection of block size on a per-file basis allows us
to choose the most efficient I/O size while making reasonable use of
- Integration of the volume manager and file system improves
performance by giving more information than "write this block here and
that block there" - the classic block device interface. Greg Ganger
has a tech
report discussing the problems of the narrow block device
interface at the boundary between the storage device and the operating
system; existing volume managers only compound the problem by
introducing a second, identical block device interface where the
information gets lost all over again. In ZFS, the "volume manager" -
really, the storage pool manager - has all the information about,
e.g., dependencies between writes, and uses that information to
Q. Isn't copy-on-write of every block awfully expensive? The
changes in block pointers will ripple up through the indirect blocks,
causing many blocks to be rewritten when you change just one byte of
If we only wrote out one set of changes at once, it would be very
slow! Instead, we aggregate many writes together, and then write out
the changes to disk together (and very carefully allocate and schedule
them). This way the cost of rewriting indirect blocks will be
amortized over many writes.
Feel free to ask more questions in the comments; I'll do my best to