NFS and ZFS, a fine combination
By User13278091-Oracle on janv. 08, 2007
No doubt there is still a lot to learn about ZFS as an NFS server and this will not delve deeply into that topic. What I'd like to dispel here is the notion that ZFS can cause some NFS workloads to exhibit pathological performance characteristics.
Since there have been a few perceived 'sightings' of such slowdowns, a little clarification is in order. Large reported slowdowns would typically be reported when looking at a single threaded load, probably doing small file creation such as 'tar xf many_small_files.tar'.
For instance, I've run a small such test over a 72G SAS drive.
tmpfs : 0.077 sec ufs : 0.25 sec zfs : 0.12 sec nfs/zfs : 12 sec
There are a few things to observe here. Local filesystem services have a huge advantage for this type of load: in the absence of specific request by the application (e.g. tar), local filesystems can lose your data and noone will complain. This is data loss, not data corruption, and this generally accepted data loss will occur in the event of a system crash. The argument being that if you need a higher level of integrity, you need to program it in applications either using O_DSYNC, fsync etc. Many applications are not that critical and avoid such burden.
NFS and COMMIT
On the other hand, the nature of the NFS protocol is such that the client _must_ at some specific point request to the server to place previously sent data onto stable storage. This is done through an NFSv3 or NFSv4 COMMIT operation. The COMMIT operation is a contract between clients and servers that allows the client to forget about its previous historical interaction with the file. In the event of a server crash/reboot, the client is guaranteed that previously commited data will be returned by the server. Operations since the last COMMIT can be replayed after a server crash in a way that insures a coherent view between everybody involved.
But this all topples over if the COMMIT contract is not honored. If a local filesystem does not properly commit data when requested to do so, there is no more guarantee that the client's view of files will be what it would otherwise normally expect. Despite the fact that the client has completed the 'tar x' with no errors, it can happen that some of the files are missing in full or in parts.
With local filesystems, a system crash is plainly obvious to users and requires applications to be restarted. With NFS, a server crash in not obvious to users of the service (the only sign being a lengthy pause), and applications are not notified. The fact that files or parts of files may go missing in the absence of errors can be considered as plain corruption of the client's side view.
When the underlying filesystem serving NFS ignores COMMIT request, or when the storage subsystem acknowledge I/O before they reach stable storage, what is potential data loss on the server, becomes corruption of the client's point of view.
It turns out that in NFSv3/NFSv4 the client will request a COMMIT on close; Moreover, the NFS server itself is required to commit on meta-data operations; for NFSv3 that is on :
SETATTR, CREATE, MKDIR, SYMLINK, MKNOD, REMOVE, RMDIR, RENAME, LINK
and a COMMIT maybe required on the containing directory.
Let's imagine we find a way to run our load at 1 COMMIT (on close) per extracted files. The COMMIT means the client must wait for at least a full I/O latency and since 'tar x' processes the tar file from a single thread, that implies that we can run our workload at the maximum rate (assuming infinitely fast networking) of one extracted file per I/O latency or about 200 extracted files per second (on modern disks). If the files to be extracted are 1K in average size, the tar x will proceed at a pace of 200K/sec. If we are required to issue 2 COMMIT operations per extracted files (for instance due to a server-side COMMIT on file create), that would further halves that throughput number.
However, If we had lots of threads extracting individual files concurrently the performance would scale up nicely with the number of threads.
But tar is single threaded, so what is actually going on here ? The need to COMMIT frequently means that our thread must frequently pause for a full server side I/O latency. Because our single threaded tar is blocked, nothing is able to process the rest of our workload. If we allow the server to ignore COMMIT operations, then NFS responses will we sent earlier allowing the single thread to proceed down the tar file at greater speed. One must realise that the extra performance is obtained at the risk of causing corruption from the client's point of view in the event of a crash.
Whether or not the client or the server needs to COMMIT as often as it does is a separate issue. The existence of other clients that would be accessing the files needs to be considered in that discussion. The point being made here is that this issue is not particular to ZFS, nor does ZFS necessarily exacerbate the problem. The performance of single threaded writes to NFS will be throttled as a result of the NFS-imposed COMMIT semantics.
ZFS Relevant Controls
ZFS has two controls that come into this picture. The disk write caches and the zil_disable tunable. ZFS is designed to work correctly whether or not the disk write caches are enabled. This is acheived through explicit cache flush requests, which are generated (for example) in response to an NFS COMMIT. Enabling the write caches is then a performance consideration, and can offer performance gains for some workloads. This is not the same with UFS which is not aware of the existence of a disk write cache and is not designed to operate with such cache enabled. Running UFS on a disk with write cache enabled can lead to corruption of the client's view in the event of a system crash.
ZFS also has the zil_disable control. ZFS is not designed to operate with zil_disable set to 1. Setting this variable (before mounting a ZFS filesystem) means that O_DSYNC writes, fsync as well as NFS COMMIT operations are all ignored! We note that, even without a ZIL, ZFS will always maintain a coherent local view of the on-disk state. But by ignoring NFS COMMIT operations, it will cause the client's view to become corrupted (as defined above).
Comparison with UFS
In the original complaint, there was no comparison between a semantically correct NFS service delivered by ZFS to another similar NFS service delivered by another filesystem. Let's gather some more data:
Local and memory based filesystems : tmpfs : 0.077 sec ufs : 0.25 sec zfs : 0.12 sec NFS service with risk of corruption of client's side view : nfs/ufs : 7 sec (write cache enable) nfs/zfs : 4.2 sec (write cache enable,zil_disable=1) nfs/zfs : 4.7 sec (write cache disable,zil_disable=1) Semantically correct NFS service : nfs/ufs : 17 sec (write cache disable) nfs/zfs : 12 sec (write cache disable,zil_disable=0) nfs/zfs : 7 sec (write cache enable,zil_disable=0)
We note that with most filesystems we can easily produce an improper NFS service by enabling the disk write caches. In this case, a server-side filesystem may think it has commited data to stable storage but the presence of an enabled disk write cache causes this assumption to be false. With ZFS, enabling the write caches is not sufficient to produce an improper service.
Disabling the ZIL (setting zil_disable to 1 using mdb and then mounting the filesystem) is one way to generate an improper NFS service. With the ZIL disabled, commit request are ignored with potential client's view corruption.
An different topic is about running ZFS on intelligent storage arrays. One known pathology is that some arrays will _honor_ the ZFS request to flush the write caches despite the fact that their caches are qualified as stable storage. In this case, NFS performance will be much much worst than otherwise expected. On this topic and ways to workaround this specific issue, see Jason's .Plan: Shenanigans with ZFS.
In many common circumstances, ZFS offers a fine NFS service that complies with all NFS semantics even with write caches enabled. If another filesystem appears much faster, I suggest first making sure that this other filesystem complies in the same way.
This is not to say that ZFS performance cannot be perfected as clearly it can. The performance of ZFS is still evolving quite rapidly. In many situations, ZFS provides the highest throughput of any filesystem. In others, ZFS performance is highly competitive with other filesystems. In some cases, ZFS can be slower than other filesystems -- while in all cases providing end-to-end data integrity, ease of use and integrated services such as compression, snapshots etc.
See Also Eric's fine entry on zil_disable