By User13278091-Oracle on nov. 04, 2008
I would certainly like to see every performance issue studied with a scientific approach. OpenSolaris and Dtrace are just incredible enablers when trying to reach root cause and finding those causes is really the best way to work toward delivering improved performance. More generally tough, people use common wisdom or possible faulty assumption to match their symptoms with that of other similar reported problems. And, as human nature has it, we'll easily blame the component we're least familiar with for problems. So we often end up with a lot of report of ZFS performance that once, drilled down, become either totally unrelated to ZFS (say HW problems) , or misconfiguration, departure from Best Practices or, at times, unrealistic expectations.
That does not mean, there are no issues. But it's important that users can more easily identify known issues, schedule for fixes, workarounds etc. So anyone deploying ZFS should really be familiar with those 2 sites : ZFS Best Practices and Evil Tuning Guide
That said, what are real commonly encountered performance problems I've seen and where do we stand ?
Writes overunning memory
That is a real problem that was fixed last March and is integrated in the Solaris U6 release. Running out of memory causes many different types of complaints and erratic system behavior. This can happen anytime a lot of data is created and streamed at rate greater than that which can be set into the pool. Solaris U6 will be an important shift for customers running into this issue. ZFS will still try to use memory to cache your data (a good thing) but the competition this creates for memory resources will be much reduced. The way ZFS is designed to deal with this contention (ARC shrinking) will need a new evaluation from the community. The lack of throttling was a great impairement to the ability of the ARC to give back memory under pressure. In the mean time lots of people are capping their arc size with success as per the Evil Tuning guide.
For more on this topic check out : The new ZFS write throttle
Cache flushes on SAN storage
This is a common issue we hit in the entreprise. Although it will cause ZFS to be totally underwhelming in terms of performance, it's interestingly not a sign of any defect in ZFS. Sadly this touches customers that are the most performance minded. The issue is somewhat related to ZFS and somewhat to the Storage. As is well documented elsewhere, ZFS will, at critical times, issue "cache flush" request to the storage elements on which is it layered. This is to take into account the fact that storage can be layered on top of _volatile_ caches that do need to be set on stable storage for ZFS to reach it's consistency points. Entreprise Storage Arrays do not use _volatile_ caches to store data and so should ignore the request from ZFS to "flush caches". The problem is that some arrays don't. This misunderstanding between ZFS and Storage Arrays leads to underwhelming performance. Fortunately we have an easy workaround that can be used to quickly identify if this is indeed the problem : setting zfs_nocacheflush (see evil tuning guide). The best workaround here is to configure the storage with the setting to indeed ignore "cache flush". And we also have the option of tuning sd.conf on a per array basis. Refer again to the evil tuning guide for more detailed information.
NFS slow over ZFS (Not True)
This is just not generally true and often a side effect of the previous Cache flush problem. People have used storage arrays to accelerate NFS for long time but failed to see the expected gains with ZFS. Many sighting of NFS problems are traced to this.
Other sightings involve common disks with volatile caches. Here the performance delta observed are rooted in the stronger semantics that ZFS offer to this operational model. See NFS and ZFS for a more detailed description of the issue.
While I don't consider ZFS as generally slow serving NFS, we did identify in recent months a condition that effects high thread count of synchronous writes (such as a DB). This issue is fixed in the Solaris 10 Update 6 (CR 6683293).
I would encourage you to be familiar to where we stand regarding ZFS and NFS because, I know of no big gapping ZFS over NFS problems (if there were one, I think I would know). People just need to be aware that NFS is a protocol need some type of accelaration (such as NVRAM) in order to deliver a user experience close to what a direct attach filesystem provides.
ZIL is a problem (Not True)
There is a wide perception that the ZIL is the source of performance problems. This is just a naive interpretation of the facts. The ZIL serves a very fundamental component of the filesystem and does that admirably well. Disabling the synchronous semantics of a filesystem will necessarely lead to higher performance in a way that is totally misleading to the outside observer. So while we are looking at further zil improvements for large scale problems, the ZIL is just not today the source of common problems. So please don't disable this unless you know what you're getting into.
Random read from Raid-Z
Raid-Z is a great technology that allows to store blocks on top of common JBOD storage without being subject to raid-5 write hole corruption (see : http://blogs.sun.com/bonwick/entry/raid_z). However the performance characteristics of raid-z departs significantly from raid-5 as to surprise first time users. Raid-Z as currently implemented spreads blocks to the full width of the raid group and creates extra IOPS during random reading. At lower loads, the latency of operations is not impacted but sustained random read loads can suffer. However, workloads that end up with frequent cache hits will not be subject to the same penalty as workloads that access vast amount of data more uniformly. This is where one truly needs to say, "it depends".
Interestingly, the same problem does not affect Raid-Z streaming performance and won't affect workloads that commonly benefit from caching. That said both random and streaming performance are perfectible and we are looking at a number different ways to improve on this situation. To better understand Raid-Z, see one of my very first ZFS entry on this topic : Raid-Z
CPU consumption, scalability and benchmarking
This is an area we will need to make more studies. With todays very capable multicore systems, there are many workloads that won't suffer from the CPU consumptions of ZFS. Most systems do not run at 100% cpu bound (being more generally constrained by disk, networks or application scalability) and the user visible latency of operations are not strongly impacted by extra cycles spent in say the ZFS checksumming.
However, this view breaks down when it comes to system benchmarking. Many benchmarks I encounter (the most crafted ones to boot) end up as host CPU efficiency benchmarks : How many Operations can I do on this system given large amount of disk and network resources while preserving some level X of response time. The answer to this question is purely the reverse of the cycles spent per operation.
This concern is more relevant when the CPU cycles spent in managing direct attach storage and filesystem is in direct competition with cycles spent in the application. This is also why database benchmarking is often associated with using raw device, a fact must less encountered in common deployment.
Root causing scalability limits and efficiency problems is just part of the never ending performance optimisation of filesystems.
Directio has been a great enabler of database performance in other filesystems. The problem for me is that Direct I/O is a group of improvements each with their own contribution to the end result. Some want the concurrent writes, some wants to avoid a copy, some wants to avoid double caching, some don't know but see performance gains when turned on (some also see a degradation). I note that concurrent writes has never been a problem in ZFS and that the extra copy used when managing a cache is generally cheap considering common DB rates of access. Acheiving greater CPU efficiency is certainly a valid goal and we need to look into what is impacting this in common DB workloads. In the mean time, ZFS in OpenSolaris got a new feature to manage the cachebility of Data in the ZFS ARC. The per filesystem "primarycache" property will allow users to decide if blocks should actually linger in the ARC cache or just be transient. This will allow DB deployed on ZFS to avoid any form of double caching that might have occured in the past.
ZFS Performance is and will be a moving target for some time in the future. Solaris 10 Update 6 with a new write throttle, will be a significant change and then Opensolaris offers additional advantages. But generally just be skeptical of any performance issue that is not root caused: the problem might not be where you expect it