The Dynamics of ZFS
ZFS has a number of identified components that governs its performance. We review the major ones here.
A volume manager is a layer of software that groups a set of block devices in order to implement some form of data protection and/or aggregation of devices exporting the collection as a storage volumes that behaves as a simple block device.
A filesystem is a layer that will manage such a block device using a subset of system memory in order to provide Filesystem operations (including Posix semantics) to applications and provide a hierarchical namespace for storage - files. Applications issue reads and writes to the Filesystem and the Filesystem issues Input and Output (I/O) operations to the storage/block device.
ZFS implements those 2 functions at once. It thus typically manages sets of block devices (leaf vdev), possibly grouping them into protected devices (RAID-Z or N-way mirror) and aggregating those top level vdevs into pool. Top level vdevs can be added to a pool at any time. Objects that are stored onto a pool will be dynamically striped onto the available vdevs.
Associated with pools, ZFS manages a number of very lightweight filesystem objects. A ZFS filesystem is basically just a set of properties associated with a given mount point. Properties of a filesystem includes the quota (maximum size) and reservation (guaranteed size) as well as, for example, whether or not to compress file data when storing blocks. The filesystem is characterized as lightweight because it does not statically associate with any physical disk blocks and any of its settable properties can be simply changed dynamically.
The recordsize is one of those properties of a given ZFS filesystem instance. ZFS files smaller than the recordsize are stored using a single filesystem block (FSB) of variable length in multiple of a disk sector (512 Bytes). Larger files are stored using multiple FSB, each of recordsize bytes, with default value of 128K.
The FSB is the basic file unit managed by ZFS and to which a checksum is applied. After a file grows to be larger than the recordsize (and gets to be stored with multiple FSB) changing the Filesystem's recordsize property will not impact the file in question. A copy of the file will inherit the tuned recordsize value. A FSB can be mirrored onto a vdev or spread to a RAID-Z device.
The recordsize is currently the only performance tunable of ZFS. The default recordsize may lead to early storage saturation: For many small updates (much smaller than 128K) to large files (bigger than 128K) the default value can cause an extra strain on the physical storage or on the data channel (such as a fiber channel) linking it to the host. For those loads, If one notices a saturated I/O channel then tuning the recordsize to smaller values should be investigated.
FACE="Times, serif">Transaction Groups
The basic mode of operation for writes operations that do not require synchronous semantics (no O_DSYNC, fsync(), etc), is that ZFS will absorb the operation in a per host system cache called Adaptive Replacement Cache (ARC). Since there is only one host system memory but potentially multiple ZFS pools, cached data from all pools is handled by a unique ARC.
Each file modification (e.g. a write) is associated with a certain transaction group (TXG). At regular interval (default of txg_time = 5 seconds) each TXG will shut down and the pool will issue a sync operation for that group. A TXG may also be shut down when the ARC indicates that there is too much dirty memory currently being cached. As a TXG closes, a new one immediately opens and file modifications then associate with the new active TXG.
If the active TXG shuts down while a previous one is still in the process of syncing data to the storage, then applications will be throttled until the running sync completes. In this situation where are sinking a TXG, while TXG + 1 is closed due to memory limitations or the 5 second clock and is waiting to sync itself; applications are throttled waiting to write to TXG + 2. We need sustained saturation of the storage or a memory constraint in order to throttle applications.
A sync of the Storage Pool will involve sending all level 0 data blocks to disk, when done, all level 1 indirect blocks, etc. until eventually all blocks representing the new state of the filesystem have been committed. At that point we update the ueberblock to point to the new consistent state of the storage pool.
FACE="Times, serif">ZFS Intent Log (ZIL)
For file modification that come with some immediate data integrity constraint (O_DSYNC, fsync etc.) ZFS manages a per-filesystem intent log or ZIL. The ZIL marks each FS operation (say a write) with a log sequence number. When a synchronous command is requested for the operation (such as an fsync), the ZIL will output blocks up to the sequence number. When the ZIL is in process of committing data, further commit operations will wait for the previous ones to complete. This allows the ZIL to aggregate multiple small transactions into larger ones thus performing commits using fewer larger I/Os.
The ZIL works by issuing all the required I/Os and then flushing the write caches if those are enabled. This use of disk write cache does not artificially improve a disk's commit latency because ZFS insures that data is physically committed to storage before returning. However the write cache allows a disk to hold multiple concurrent I/O transactions and this acts as a good substitute for drives that do not implement tag queues.
CAVEAT: The current state of the ZIL is such that if there is a lot of pending data in a Filesystem (written to the FS, not yet output to disk) and a process issues an fsync() for one of it's files, then all pending operations will have to be sent to disk before the synchronous command can complete. This can lead to unexpected performance characteristics. Code is under review.
FACE="Times, serif">I/O Scheduler and Priorities
ZFS keeps track of pending I/Os but only issues to disk controllers a certain number (35 by default). This allows the controllers to operate efficiently while never overflowing their queues. By limiting the I/O queue size, service times of individual disks are kept to reasonable values. When one I/O completes, the I/O scheduler then decides the next most important one to issue. The priority scheme is timed based; so for instance an Input I/O to service a read calls will be prioritize over any regular Output I/O issued in the last ~ 0.5 seconds.
The fact that ZFS will limit each leaf devices I/O queue to 35, is one of the reasons that suggests that zpool should be built using vdevs that are individual disks or at least volumes that map to small number of disks. Otherwise this self imposed limits could become an artificial performance throttle.
FACE="Times, serif">Read Syscalls
However to avoid starvation, when there is a long-standing backlog of Output I/Os then eventually those regain priority over the Input I/O. ZIL synchronous I/Os are of the same priority to synchronous reads.
The prefetch code allowing ZFS to detect sequential or strided access to a file and issue I/O ahead of phase is currently under review. To quote the developer "ZFS prefetching needs some love".
FACE="Times, serif">Write Syscalls
ZFS never overwrites live data on-disk and will always output full records validated by a checksum. So in order to partially overwrite a file record, ZFS first has to have the corresponding data in memory. If the data is not yet cached, ZFS will issue an input I/O before allowing the write(2) to partially modify the file record. With the data now in cache, more writes can target the blocks. On output ZFS will checksum data before sending to disk. For full record overwrite the input phase is not necessary.
CAVEAT: Simple write calls (not O_DSYNC) are normally absorbed by the ARC cache and so proceed very quickly. Such a sustained dd(1)-like load can quickly overrun a large amount of system memory and cause transaction groups to eventually throttle all applications for large amount of time (10s of seconds). This is probably what underwrites the notion that ZFS needs more RAM (it does not). Write throttling code is under review.
FACE="Times, serif">Soft Track Buffer
An input I/O is serious business. While a Filesystem can decide where to write stuff out on disk, the Inputs are requested by applications. This means a necessary head seek to the location of the data. The time to issue a small read will be totally dominated by this seek. So ZFS takes the stance that it might as well amortize those operations and so, for uncached reads, ZFS normally will issue a fairly large Input I/O (64K by default). This will help loads that input data using similar access pattern to the output phase. The data goes into a per device cache holding 20MB.
This cache can be invaluable in reducing the I/Os necessary to read-in data. But just like the recordsize, if the inflated I/O cause a storage channel saturation the Soft Track Buffer can act as a performance throttle.
FACE="Times, serif">The ARC Cache
The most interesting caching occurs at the ARC layer. The ARC manages the memory used by blocks from all pools (each pool servicing many filesystems). ARC stands for Adaptive Replacement Cache and is inspired by a paper of Megiddo/Modha presented at FAST'03 Usenix conference.
That ARC manages it's data keeping a notion of Most Frequently Used (MFU) and Most Recently Use (MRU) balancing intelligently between the two. One of it's very interesting properties is that a large scan of a file will not destroy most of the cached data.
On a system with Free Memory, the ARC will grow as it starts to cache data. Under memory pressure the ARC will return some of it's memory to the kernel until low memory conditions are relieved.
We note that while ZFS has behaved rather well under 'normal' memory pressure, it does not appear to behave satisfactorily under swap shortage. The memory usage pattern of ZFS is very different to other filesystems such as UFS and so exposes VM layer issues in a number of corner cases. For instance, a number of kernel operations fails with ENOMEM not even attempting a reclaim operation. If they did, then ZFS would be responding by releasing some of it's own buffers allowing the initial operation to then succeed.
The fact that ZFS caches data in the kernel address space does mean that the kernel size will be bigger than when using traditional filesystems. For heavy duty usage it is recommended to use a 64-bit kernel i.e. any Sparc system or an AMD configured in 64-bit mode. Some systems that have managed in the past to run without any swap configured should probably start to configure some.
The behavior of the ARC in response to memory pressure is under review.
FACE="Times, serif">CPU Consumption
Recent enhancement to ZFS has improved it's CPU efficiency by a large factor. We don't expect to deviate from other filesystems much in terms of cycles per operations. ZFS checksums all disk blocks but this has not proven to be costly at all in terms of CPU consumption.
ZFS can be configured to compress on-disk blocks. We do expect to see some extra CPU consumption from that compression. While it is possible that compression could lead to some performance gain due to reduced I/O load, the emphasis of compression should be to save on-disk space not performance.
What About Your Test ?
This is what I know about the ZFS performance model today. My performance comparison on different types of modelled workloads made last fall already had ZFS ahead on many of them; we have improved the biggest issues highlighted then and there are further performance improvements in the pipeline (based on UFS, we know this will never end). Best Practices are being spelled out.
You can contribute by comparing your actual usage and workload pattern with the simulated workloads. But nothing will beat having reports from real workloads at this stage; Your results are therefore of great interest to us. And watch this space for updates...