Dynamics of ZFS
ZFS has a number of identified components that governs its
performance. We review the major ones here.
volume manager is a layer of software that groups a set of block
devices in order to implement some form of data protection
and/or aggregation of devices exporting the collection as a
storage volumes that behaves as a simple block device.
filesystem is a layer that will manage such a block device using a
subset of system memory in order to provide Filesystem operations
(including Posix semantics) to applications and provide a
hierarchical namespace for storage - files. Applications issue
reads and writes to the Filesystem and the Filesystem issues Input
and Output (I/O) operations to the storage/block device.
implements those 2 functions at once. It thus typically manages
sets of block devices (leaf vdev), possibly grouping them into
protected devices (RAID-Z or N-way mirror) and aggregating those
top level vdevs into pool. Top level vdevs can be added to a pool
at any time. Objects that are stored onto a pool will be
dynamically striped onto the available vdevs.
with pools, ZFS manages a number of very lightweight
filesystem objects. A ZFS filesystem is basically just a set of properties
associated with a given mount point. Properties of a filesystem
includes the quota (maximum size) and reservation
(guaranteed size) as well as, for example, whether or not to
compress file data when storing blocks. The filesystem is
characterized as lightweight because it does not statically associate
with any physical disk blocks and any of its settable properties can
be simply changed dynamically.
recordsize is one of those properties of a given ZFS filesystem
instance. ZFS files smaller than the recordsize are stored using
a single filesystem block (FSB) of variable length in multiple of a
disk sector (512 Bytes). Larger files are stored using multiple FSB,
each of recordsize bytes, with default value of 128K.
FSB is the basic file unit managed by ZFS and to which a checksum
is applied. After a file grows to be larger than the recordsize (and
gets to be stored with multiple FSB) changing the Filesystem's
recordsize property will not impact the file in question. A copy
of the file will inherit the tuned recordsize value. A FSB can
be mirrored onto a vdev or spread to a RAID-Z device.
recordsize is currently the only performance tunable of ZFS. The
default recordsize may lead to early storage saturation: For many
small updates (much smaller than 128K) to large files (bigger than 128K) the
default value can cause an extra strain on the physical storage or on
data channel (such as a fiber channel) linking it to the host.
For those loads, If one notices a saturated I/O channel then tuning
the recordsize to smaller values should be investigated.
basic mode of operation for writes operations that do not require
synchronous semantics (no O_DSYNC, fsync(), etc), is that ZFS will
absorb the operation in a per host system cache called Adaptive
Replacement Cache (ARC). Since there is only one host system memory
but potentially multiple ZFS pools,
cached data from all pools is handled by a unique ARC.
file modification (e.g. a write) is associated with a certain
transaction group (TXG). At regular interval (default of txg_time =
5 seconds) each TXG will shut down and the pool will issue a sync
operation for that group. A TXG may also be shut down when the ARC
indicates that there is too much dirty memory currently being
cached. As a TXG closes, a new one immediately opens and file
modifications then associate with the new active TXG.
the active TXG shuts down while a previous one is still in the
process of syncing data to the storage, then applications will be
throttled until the running sync completes. In this situation where
are sinking a TXG, while TXG + 1 is closed due to memory
limitations or the 5 second clock and is waiting to sync itself;
applications are throttled waiting to write to TXG + 2. We need
sustained saturation of the storage or a memory constraint in order
to throttle applications.
sync of the Storage Pool will involve sending all level 0 data
blocks to disk, when done, all level 1 indirect blocks, etc. until
eventually all blocks representing the new state of the filesystem
have been committed. At that point we update the ueberblock to point
to the new consistent state of the storage pool.
ZFS Intent Log (ZIL)
file modification that come with some immediate data integrity
constraint (O_DSYNC, fsync etc.) ZFS manages a per-filesystem
intent log or ZIL. The ZIL marks each FS operation (say a
write) with a log sequence number. When a synchronous command is
requested for the operation (such as an fsync), the ZIL will output
blocks up to the sequence number. When the ZIL is in process of
committing data, further commit operations will wait for the
previous ones to complete. This allows the ZIL to aggregate
multiple small transactions into larger ones thus performing
commits using fewer larger I/Os.
ZIL works by issuing all the required I/Os and then flushing
the write caches if those are enabled. This use of disk write
cache does not artificially improve a disk's commit latency because
ZFS insures that data is physically committed to storage before returning. However
the write cache allows a disk to hold multiple concurrent I/O
transactions and this acts as a good substitute for drives that do
not implement tag queues.
The current state of the ZIL is such that if there is a lot of
pending data in a Filesystem (written to the FS, not yet output to
disk) and a process issues an fsync() for one of it's files, then all
pending operations will have to be sent to disk before the
synchronous command can complete. This can lead to unexpected
performance characteristics. Code is under review.
I/O Scheduler and Priorities
keeps track of pending I/Os but only issues to disk
controllers a certain number (35 by default). This allows the
controllers to operate efficiently while never overflowing their
queues. By limiting the I/O queue size, service times of individual
disks are kept to reasonable values. When one I/O completes, the
I/O scheduler then decides the next most important one to issue.
The priority scheme is timed based; so for instance an Input I/O to
service a read calls will be prioritize over any regular Output
I/O issued in the last ~ 0.5 seconds.
fact that ZFS will limit each leaf devices I/O queue to 35, is
one of the reasons that suggests that zpool should be built using
vdevs that are individual disks or at least volumes that map to
small number of disks. Otherwise this self imposed limits
could become an artificial performance throttle.
read cannot be serviced from the ARC cache, ZFS will issue a
'prioritized' I/O for the data. So even if the storage is handling
a heavy output load, there are only 35 I/Os outstanding, all
with reasonable service times. As soon as one of the 35 I/Os completes
the I/O scheduler will issue the read I/O to the controller. This
insures good service times for read operations in general.
to avoid starvation, when there is a long-standing backlog of
Output I/Os then eventually those regain priority over the Input
I/O. ZIL synchronous I/Os are of the same priority to synchronous
prefetch code allowing ZFS to detect sequential or strided access to
a file and issue I/O ahead of phase is currently under review. To
quote the developer "ZFS prefetching needs some love".
never overwrites live data on-disk and will always output full records
validated by a checksum. So in order to partially overwrite a
file record, ZFS first has to have the corresponding data in
memory. If the data is not yet cached, ZFS will issue an input
I/O before allowing the write(2) to partially modify the file
record. With the data now in cache, more writes can target the
blocks. On output ZFS will checksum data before sending to disk.
For full record overwrite the input phase is not necessary.
Simple write calls (not O_DSYNC) are normally absorbed by the ARC
cache and so proceed very quickly. Such a sustained dd(1)-like load can
quickly overrun a large amount of system memory and cause transaction
groups to eventually throttle all applications for large amount
of time (10s of seconds). This is probably what underwrites the
notion that ZFS needs more RAM (it does not). Write throttling code
is under review.
Soft Track Buffer
input I/O is serious business. While a Filesystem can decide
where to write stuff out on disk, the Inputs are requested by
applications. This means a necessary head seek to the location of the
data. The time to issue a small read will be totally dominated by
this seek. So ZFS takes the stance that it might as well
amortize those operations and so, for uncached reads,
ZFS normally will issue a fairly large Input I/O (64K by
default). This will help loads that input data using
similar access pattern to the output phase. The data goes into a
per device cache holding 20MB.
cache can be invaluable in reducing the I/Os necessary to read-in
data. But just like the recordsize, if the inflated I/O cause a
storage channel saturation the Soft Track Buffer can act as a
The ARC Cache
most interesting caching occurs at the ARC layer. The ARC manages
the memory used by blocks from all pools (each pool servicing
many filesystems). ARC stands for Adaptive Replacement Cache
and is inspired by a paper of
Megiddo/Modha presented at FAST'03
ARC manages it's data keeping a notion of Most Frequently Used (MFU)
and Most Recently Use (MRU) balancing intelligently between the two.
One of it's very interesting properties is that a large scan of a
file will not destroy most of the cached data.
a system with Free Memory, the ARC will grow as it starts to cache
data. Under memory pressure the ARC will return some of it's memory
to the kernel until low memory conditions are relieved.
note that while ZFS has behaved rather well under 'normal' memory
pressure, it does not appear to behave satisfactorily under swap
shortage. The memory usage pattern of ZFS is very different to other filesystems
such as UFS and so exposes VM layer issues in a number of
corner cases. For instance, a number of kernel
operations fails with ENOMEM not even attempting a reclaim
operation. If they did, then ZFS would be responding by releasing
some of it's own buffers allowing the initial operation to then
fact that ZFS caches data in the kernel address space does mean that
the kernel size will be bigger than when using traditional filesystems. For
heavy duty usage it is recommended to use a 64-bit kernel i.e. any
Sparc system or an AMD configured in 64-bit mode. Some systems
that have managed in the past to run without any swap configured
should probably start to configure some.
The behavior of the ARC in response to memory pressure is under review.
enhancement to ZFS has improved it's CPU efficiency by a large
factor. We don't expect to deviate from other filesystems much in
terms of cycles per operations. ZFS checksums all disk blocks but this
has not proven to be costly at all in terms of CPU consumption.
can be configured to compress on-disk blocks. We do expect to
see some extra CPU consumption from that compression. While it is
possible that compression could lead to some performance gain due to
reduced I/O load, the emphasis of compression should be to save
on-disk space not performance.
About Your Test ?
This is what I know about the ZFS performance model today. My
comparison on different types of modelled workloads made last fall already had ZFS
ahead on many of them; we have improved the
biggest issues highlighted then and there are further performance
improvements in the pipeline (based on UFS, we know this will never
end). Best Practices are being spelled out.
You can contribute by comparing your actual usage and workload pattern
with the simulated workloads. But nothing will beat having
reports from real workloads at this stage; Your results are
therefore of great interest to us.
And watch this space for updates...