lundi nov. 10, 2008

Using ZFS as a Network Attach Controller and the Value of Solid State Devices



So Sun is coming out today with a line of Sun Storage 7000 systems that have ZFS as the integrated volume and filesystem manager using both read and write optimized SSD. What is this Hybrid Storage Pool and why is this a good performance architecture for storage ?

A write optimized SSD is a custom designed device for the purpose of accelerating operations of the ZFS intent log (ZIL). The ZIL is the part of ZFS that manages the important synchronous operation guaranteeing that such writes are acknowledged quickly to applications while guaranteeing persistence in case of outage. Data stored in the ZIL is also kept in memory until ZFS issue the next Transaction Groups (every few seconds).

The ZIL is what stores data urgently (when application is waiting) but the TXG is what stores data permanently. The ZIL on-disk blocks are only ever re-read after a failure such as power outage. So the SSDs that are used to accelerate the ZIL are write-optimized : they need to handle data at low latency on writes; reads are unimportant.

The TXG is an operation that is asynchronous to applications : apps are generally not waiting for transactions groups to commit. The exception here is when data is generated at a rate that exceeds the TXG rate for a sustained period of time. In this case, we become throttled by the pool throughput. In a NAS storage this will rarely happen since network connectivity even at GB/s is still much less that what storage is capable of and so we do not generate the imbalance.

The important thing now is that in a NAS server, the controller is also running a file level protocol (NFS or CIFS) and so is knowledgeable about the nature (synchronous or not) of the requested writes. As such it can use the accelerated path (the SSD) only for the necessary component of the workloads. Less competition for these devices means we can deliver both high throughput and low latency together in the same consolidated server.

But here is where is gets nifty. At times, a NAS server might receive a huge synchronous request. We've observed this for instance due to fsflush running on clients which will turn non-synchronous writes into a massive synchronous one. I note here that a way to reduce this effect, is to tune up fsflush (to say 600). This is commonly done to reduce the cpu usage of fsflush but will be welcome in the case of client interacting with NAS storage. We can also disable page flushing entirely by setting dopageflush to 0. But that is a client issue. From the perspective of the server, we still need as a NAS to manage large commit request.

When subject to such a workload, say 1GB commit, ZFS being all aware of the situation, can now decide to bypass the SDD device and issue request straight to disk based pool blocks. It would do so for 2 reasons. One is that the pool of disks in it's entirety has more throughput capabilities than the few write optimized SSD and so we will service this request faster. But more importantly, the value of the SSD is in it's latency reduction aspect. Leaving the SSDs available to service many low latency synchronous writes is considered valuable here. Another way to say this is that large writes are generally well served by regular disk operations (they are throughput bound) whereas small synchronous writes (latency bound) can and will get help from the SSDs.

Caches at work

On the read path we also have custom designed read optimized SSDs to fit in these OpenStorage platforms. At Sun, we just believe that many workloads will naturally lend to caching technologies. In a consolidated storage solution, we can offer up to 128GB of primary memory based caching and approximately 500GB of SSD based caching.

We also recognized that the latency delta between memory cached response and disk response was just too steep. By inserting a layer of SSD between memory and disk, we have this intermediate step providing lower latency access than disk to a working set which is now many times greater than memory.

It's important here to understand how and when these read optimized SSD will work. The first thing to recognized is that the SSD will have to be primed with data. They feed off data being evicted from the primary caches. So their effect will not immediately seen at the start of a benchmarks. Second, one of the value of read optimized SSD is truly in low latency responses to small requests. Small request here means things of the order of 8K in size. Such request will occur either when dealing with small files (~8K) or if dealing with larger size but with fix record based application, typically a database. For those application it is customary to set the recordsize and this will allow those new SSDs to become more effective.

Our read optimized SSD can service up to 3000 read IOPS (see Brendan's work on the L2 ARC) and this is close or better to what a 24 x 7.2 RPM disks JBOD can do. But the key point is that the low latency response means it can do so using much fewer threads that would be necessary to reach the same level on a JBOD. Brendan demonstrated here that the response time of these devices can be 20 times faster than disks and 8 to 10 times faster from the client's perspective. So once data is installed in the SSD, users will see their requests serviced much faster which means we are less likely to be subject to queuing delays.

The use of read optimized SSD is configurable in the Appliance. Users should learn to identify the part of their datasets that end up gated by lightly threaded read response time. For those workloads enabling the secondary cache is one way to deliver the value of the read optimized SSD. For those filesystems, if the workload contains small files (such as 8K) there is no need to tune anything, however for large files access in small chunks setting the filesystem recordsize to 8K is likely to produce the best response time.

Another benefit to these SSDs will be in the $/IOPS case. Some workloads are just IOPS hungry while not necessarely huge block consumers. The SSD technology offers great advantages in this space where a single SDD can deliver the IOPS of a full JBOD at a fraction of the cost. So with workloads that are more modestly sized but IOPS hungry a test drive of the SSD will be very interesting.

It's also important to recognized that these systems are used in consolidation scenarios. It can be that some part of the applications will be sped up by read or write optimized SSD, or by the large memory based caches while other consolidated workloads can exercise other components.

There is another interesting implication to using SSD in the storage in regards to clustering. The read optimized ssd acting as caching layers actually never contain critical data. This means those SSD can go into disk slots of head nodes since there is no data to be failed over. On the other hand, write optimized SSD will store data associated with the critical synchronous writes. But since those are located in dual-ported backend enclosures, not the head nodes, it implies that, during clustered operations, storage head nodes do not have to exchange any user level data.

So by using ZFS and read and write optimized SSDs, we can deliver low latency writes for application that rely on them, and good throughput for synchronous and non synchronous case using cost effective SATA drives. Similarly on the read size, the high amount of primary and secondary caches enables delivering high IOPS at low latency (even if the workload is not highly threaded) and it can do so using the more cost and energy efficient SATA drive.

Our architecture allows us to take advantage of the latency accelerators while never being gated by them.

mardi nov. 04, 2008

People ask: where are we with ZFS performance ?

The standard answer to any computer performance question is almost always : "it depends" which is semantically equivalent to "I don't know". The better answer is to state the dependencies.


I would certainly like to see every performance issue studied with a scientific approach. OpenSolaris and Dtrace are just incredible enablers when trying to reach root cause and finding those causes is really the best way to work toward delivering improved performance. More generally tough, people use common wisdom or possible faulty assumption to match their symptoms with that of other similar reported problems. And, as human nature has it, we'll easily blame the component we're least familiar with for problems. So we often end up with a lot of report of ZFS performance that once, drilled down, become either totally unrelated to ZFS (say HW problems) , or misconfiguration, departure from Best Practices or, at times, unrealistic expectations.


That does not mean, there are no issues. But it's important that users can more easily identify known issues, schedule for fixes, workarounds etc. So anyone deploying ZFS should really be familiar with those 2 sites : ZFS Best Practices and Evil Tuning Guide


That said, what are real commonly encountered performance problems I've seen and where do we stand ?


Writes overunning memory


That is a real problem that was fixed last March and is integrated in the Solaris U6 release. Running out of memory causes many different types of complaints and erratic system behavior. This can happen anytime a lot of data is created and streamed at rate greater than that which can be set into the pool. Solaris U6 will be an important shift for customers running into this issue. ZFS will still try to use memory to cache your data (a good thing) but the competition this creates for memory resources will be much reduced. The way ZFS is designed to deal with this contention (ARC shrinking) will need a new evaluation from the community. The lack of throttling was a great impairement to the ability of the ARC to give back memory under pressure. In the mean time lots of people are capping their arc size with success as per the Evil Tuning guide.


For more on this topic check out : The new ZFS write throttle


Cache flushes on SAN storage


This is a common issue we hit in the entreprise. Although it will cause ZFS to be totally underwhelming in terms of performance, it's interestingly not a sign of any defect in ZFS. Sadly this touches customers that are the most performance minded. The issue is somewhat related to ZFS and somewhat to the Storage. As is well documented elsewhere, ZFS will, at critical times, issue "cache flush" request to the storage elements on which is it layered. This is to take into account the fact that storage can be layered on top of _volatile_ caches that do need to be set on stable storage for ZFS to reach it's consistency points. Entreprise Storage Arrays do not use _volatile_ caches to store data and so should ignore the request from ZFS to "flush caches". The problem is that some arrays don't. This misunderstanding between ZFS and Storage Arrays leads to underwhelming performance. Fortunately we have an easy workaround that can be used to quickly identify if this is indeed the problem : setting zfs_nocacheflush (see evil tuning guide). The best workaround here is to configure the storage with the setting to indeed ignore "cache flush". And we also have the option of tuning sd.conf on a per array basis. Refer again to the evil tuning guide for more detailed information.


NFS slow over ZFS (Not True)


This is just not generally true and often a side effect of the previous Cache flush problem. People have used storage arrays to accelerate NFS for long time but failed to see the expected gains with ZFS. Many sighting of NFS problems are traced to this.


Other sightings involve common disks with volatile caches. Here the performance delta observed are rooted in the stronger semantics that ZFS offer to this operational model. See NFS and ZFS for a more detailed description of the issue.


While I don't consider ZFS as generally slow serving NFS, we did identify in recent months a condition that effects high thread count of synchronous writes (such as a DB). This issue is fixed in the Solaris 10 Update 6 (CR 6683293).


I would encourage you to be familiar to where we stand regarding ZFS and NFS because, I know of no big gapping ZFS over NFS problems (if there were one, I think I would know). People just need to be aware that NFS is a protocol need some type of accelaration (such as NVRAM) in order to deliver a user experience close to what a direct attach filesystem provides.


ZIL is a problem (Not True)


There is a wide perception that the ZIL is the source of performance problems. This is just a naive interpretation of the facts. The ZIL serves a very fundamental component of the filesystem and does that admirably well. Disabling the synchronous semantics of a filesystem will necessarely lead to higher performance in a way that is totally misleading to the outside observer. So while we are looking at further zil improvements for large scale problems, the ZIL is just not today the source of common problems. So please don't disable this unless you know what you're getting into.


Random read from Raid-Z


Raid-Z is a great technology that allows to store blocks on top of common JBOD storage without being subject to raid-5 write hole corruption (see : http://blogs.sun.com/bonwick/entry/raid_z). However the performance characteristics of raid-z departs significantly from raid-5 as to surprise first time users. Raid-Z as currently implemented spreads blocks to the full width of the raid group and creates extra IOPS during random reading. At lower loads, the latency of operations is not impacted but sustained random read loads can suffer. However, workloads that end up with frequent cache hits will not be subject to the same penalty as workloads that access vast amount of data more uniformly. This is where one truly needs to say, "it depends".


Interestingly, the same problem does not affect Raid-Z streaming performance and won't affect workloads that commonly benefit from caching. That said both random and streaming performance are perfectible and we are looking at a number different ways to improve on this situation. To better understand Raid-Z, see one of my very first ZFS entry on this topic : Raid-Z


CPU consumption, scalability and benchmarking


This is an area we will need to make more studies. With todays very capable multicore systems, there are many workloads that won't suffer from the CPU consumptions of ZFS. Most systems do not run at 100% cpu bound (being more generally constrained by disk, networks or application scalability) and the user visible latency of operations are not strongly impacted by extra cycles spent in say the ZFS checksumming.


However, this view breaks down when it comes to system benchmarking. Many benchmarks I encounter (the most crafted ones to boot) end up as host CPU efficiency benchmarks : How many Operations can I do on this system given large amount of disk and network resources while preserving some level X of response time. The answer to this question is purely the reverse of the cycles spent per operation.


This concern is more relevant when the CPU cycles spent in managing direct attach storage and filesystem is in direct competition with cycles spent in the application. This is also why database benchmarking is often associated with using raw device, a fact must less encountered in common deployment.


Root causing scalability limits and efficiency problems is just part of the never ending performance optimisation of filesystems.


Direct I/O


Directio has been a great enabler of database performance in other filesystems. The problem for me is that Direct I/O is a group of improvements each with their own contribution to the end result. Some want the concurrent writes, some wants to avoid a copy, some wants to avoid double caching, some don't know but see performance gains when turned on (some also see a degradation). I note that concurrent writes has never been a problem in ZFS and that the extra copy used when managing a cache is generally cheap considering common DB rates of access. Acheiving greater CPU efficiency is certainly a valid goal and we need to look into what is impacting this in common DB workloads. In the mean time, ZFS in OpenSolaris got a new feature to manage the cachebility of Data in the ZFS ARC. The per filesystem "primarycache" property will allow users to decide if blocks should actually linger in the ARC cache or just be transient. This will allow DB deployed on ZFS to avoid any form of double caching that might have occured in the past.


ZFS Performance is and will be a moving target for some time in the future. Solaris 10 Update 6 with a new write throttle, will be a significant change and then Opensolaris offers additional advantages. But generally just be skeptical of any performance issue that is not root caused: the problem might not be where you expect it


mercredi mai 14, 2008

The new ZFS write throttle

A very significant improvement is coming soon to ZFS. A change that will increase the general quality of service delivered by ZFS. Interestingly it's a change that might also slow down your microbenchmark but nevertheless it's a change you should be eager for.


Write throttling


For a filesystem, write throttling designates the act of blocking application for some amount of time, as short as possible, waiting for the proper conditions to allow the write system calls to succeed. Write throttling is normally required because applications can write to memory (dirty memory pages) at a rate significantly faster than the kernel can flush the data to disk. Many workloads dirty memory pages by writing to the filesystem page cache at near memory copy speed, possibly using multiple threads issuing high-rates of filesystem writes. Concurrently, the filesystem is doing it's best to drain all that data to the disk subsystem.


Given the constraints, the time to empty the filesystem cache to disk can be longer than the time required for applications to dirty the cache. Even if one considers storage with fast NVRAM, under sustained load, that NVRAM will fill up to a point where it needs to wait for a slow disk I/O to make room for more data to get in.


When committing data to a filesystem in bursts, it can be quite desirable to push the data at memory speed and then drain the cache to disk during the lapses between bursts. But when data is generated at a sustained high rate, lack of throttling leads to total memory depletion. We thus need at some point to try and match the application data rate with that of the I/O subsystem. This is the primary goal of write throttling.


A secondary goal of write throttling is to prevent massive data loss. When applications do not manage I/O synchronization (i.e don't use O_DSYNC and fsync), data ends up cached in the filesystem and the contract is that there is no guarantee that the data will still be there if a system crash were to occur. So even if the filesystem cannot be blamed for such data loss, it is still a nice feature to help prevent such massive losses.


Case in point : UFS Write throttling


For instance UFS would use the fsflush daemon to try to keep data exposed for no more than 30 seconds (default value of autoup). Also, UFS would keep track of the amount of I/O outstanding for each file. Once too much I/O was pending, UFS would throttle writers for that file. This was controlled through ufs_HW, ufs_LW and their values were commonly tuned (a bad sign). Eventually old defaults values were updated and seem to work nicely today. UFS write throttling thus operates on a per file basis. While there are some merits to this approach, it can be defeated as it does not manage the imbalance between memory and disks at a system level.


ZFS Previous write throttling


ZFS is designed around the concept of transaction groups (txg). Normally, every 5 seconds an _open_ txg goes to the quiesced state. From that state the quiesced txg will go to the syncing state which sends dirty data to the I/O subsystem. For each pool, there are at most 1 txg in each of the 3 states, open, quiescing, syncing. Write throttling used to occur when the 5 second txg clock would fire while the syncing txg had not yet completed. The open group would wait on the quiesced one which waits on the syncing one. Application writers (write system call) would block, possibly a few seconds, waiting for a txg to open. In other words, if a txg took more than 5 seconds to sync to disk, we would globally block writers thus matching their speed with that of the I/O. But if a workload had a bursty write behavior that could be synced during the allotted 5 seconds, application would never be throttled.


The Issue


But ZFS did not sufficiently controled the amount of data that could get in an open txg. As long as the ARC cache was no more than half dirty, ZFS would accept data. For a large memory machine or one with weak storage, this was likely to cause long txg sync times. The downsides were many :


	- if we did ended up throttled, long  sync times meant the system
	behavior would be sluggish for seconds at a time.

	- long txg sync times also meant that our granularity at which 
	we could generate snapshots would be impacted.

	- we ended up with lots of pending data in the cache all of
	which could be lost in the event of a crash.

	- the ZFS I/O scheduler which prioritizes operations was also
	negatively impacted.	

	- By  not    throttling we had the possibility that
	sequential writes on large files  could displace from the ARC
	a very large number of smaller objects. Refilling
	that data  meant  very  large number of  disk I/Os.  
	Not throttling can  paradoxically  end up as  very
	costly for performance.

	- the previous code also could at times, not be issuing I/Os
	to disk for seconds even though the workload was
	critically dependant of storage speed.

	- And foremost, lack of throttling depleted memory and prevented
	ZFS from reacting to memory pressure.
That ZFS is considered a memory hog is most likely the results of the the previous throttling code. Once a proper solution is in place, it will be interesting to see if we behave better on that front.


The Solutions


The new code keeps track of the amount of data accepted in a TXG and the time it takes to sync. It dynamically adjusts that amount so that each TXG sync takes about 5 seconds (txg_time variable). It also clamps the limit to no more than 1/8th of physical memory.


And to avoid the system wide and seconds long throttle effect, the new code will detect when we are dangerously close to that situation (7/8th of the limit) and will insert 1 tick delays for applications issuing writes. This prevents a write intensive thread from hogging the available space starving out other threads. This delay should also generally prevent the system wide throttle.


So the new steady state behavior of write intensive workloads is that, starting with an empty TXG, all threads will be allowed to dirty memory at full speed until a first threshold of bytes in the TXG is reached. At that time, every write system call will be delayed by 1 tick thus significantly slowing down the pace of writes. If the previous TXG completes it's I/Os, then the current TXG will then be allowed to resume at full speed. But in the unlikely event that a workload, despite the per write 1-tick delay, manages to fill up the TXG up to the full threshold we will be forced to throttle all writes in order to allow the storage to catch up.


It should make the system much better behaved and generally more performant under sustained write stress.


If you are owner of an unlucky workload that ends up as slowed by more throttling, do consider the other benefits that you get from the new code. If that does not compensate for the loss, get in touch and tell us what your needs are on that front.


lundi janv. 08, 2007

NFS and ZFS, a fine combination


No doubt there is still a lot to learn about ZFS as an NFS server and this will not delve deeply into that topic. What I'd like to dispel here is the notion that ZFS can cause some NFS workloads to exhibit pathological performance characteristics.


The Sightings


Since there have been a few perceived 'sightings' of such slowdowns, a little clarification is in order. Large reported slowdowns would typically be reported when looking at a single threaded load, probably doing small file creation such as 'tar xf many_small_files.tar'.


For instance, I've run a small such test over a 72G SAS drive.


	tmpfs   :  0.077 sec
	ufs     :  0.25  sec
	zfs     :  0.12  sec
	nfs/zfs : 12     sec


There are a few things to observe here. Local filesystem services have a huge advantage for this type of load: in the absence of specific request by the application (e.g. tar), local filesystems can lose your data and noone will complain. This is data loss, not data corruption, and this generally accepted data loss will occur in the event of a system crash. The argument being that if you need a higher level of integrity, you need to program it in applications either using O_DSYNC, fsync etc. Many applications are not that critical and avoid such burden.


NFS and COMMIT


On the other hand, the nature of the NFS protocol is such that the client _must_ at some specific point request to the server to place previously sent data onto stable storage. This is done through an NFSv3 or NFSv4 COMMIT operation. The COMMIT operation is a contract between clients and servers that allows the client to forget about its previous historical interaction with the file. In the event of a server crash/reboot, the client is guaranteed that previously commited data will be returned by the server. Operations since the last COMMIT can be replayed after a server crash in a way that insures a coherent view between everybody involved.


But this all topples over if the COMMIT contract is not honored. If a local filesystem does not properly commit data when requested to do so, there is no more guarantee that the client's view of files will be what it would otherwise normally expect. Despite the fact that the client has completed the 'tar x' with no errors, it can happen that some of the files are missing in full or in parts.


With local filesystems, a system crash is plainly obvious to users and requires applications to be restarted. With NFS, a server crash in not obvious to users of the service (the only sign being a lengthy pause), and applications are not notified. The fact that files or parts of files may go missing in the absence of errors can be considered as plain corruption of the client's side view.


When the underlying filesystem serving NFS ignores COMMIT request, or when the storage subsystem acknowledge I/O before they reach stable storage, what is potential data loss on the server, becomes corruption of the client's point of view.


It turns out that in NFSv3/NFSv4 the client will request a COMMIT on close; Moreover, the NFS server itself is required to commit on meta-data operations; for NFSv3 that is on :


	SETATTR, CREATE, MKDIR, SYMLINK, MKNOD, 
	REMOVE, RMDIR, RENAME, LINK


and a COMMIT maybe required on the containing directory.


Expected Performance


Let's imagine we find a way to run our load at 1 COMMIT (on close) per extracted files. The COMMIT means the client must wait for at least a full I/O latency and since 'tar x' processes the tar file from a single thread, that implies that we can run our workload at the maximum rate (assuming infinitely fast networking) of one extracted file per I/O latency or about 200 extracted files per second (on modern disks). If the files to be extracted are 1K in average size, the tar x will proceed at a pace of 200K/sec. If we are required to issue 2 COMMIT operations per extracted files (for instance due to a server-side COMMIT on file create), that would further halves that throughput number.


However, If we had lots of threads extracting individual files concurrently the performance would scale up nicely with the number of threads.


But tar is single threaded, so what is actually going on here ? The need to COMMIT frequently means that our thread must frequently pause for a full server side I/O latency. Because our single threaded tar is blocked, nothing is able to process the rest of our workload. If we allow the server to ignore COMMIT operations, then NFS responses will we sent earlier allowing the single thread to proceed down the tar file at greater speed. One must realise that the extra performance is obtained at the risk of causing corruption from the client's point of view in the event of a crash.


Whether or not the client or the server needs to COMMIT as often as it does is a separate issue. The existence of other clients that would be accessing the files needs to be considered in that discussion. The point being made here is that this issue is not particular to ZFS, nor does ZFS necessarily exacerbate the problem. The performance of single threaded writes to NFS will be throttled as a result of the NFS-imposed COMMIT semantics.


ZFS Relevant Controls


ZFS has two controls that come into this picture. The disk write caches and the zil_disable tunable. ZFS is designed to work correctly whether or not the disk write caches are enabled. This is acheived through explicit cache flush requests, which are generated (for example) in response to an NFS COMMIT. Enabling the write caches is then a performance consideration, and can offer performance gains for some workloads. This is not the same with UFS which is not aware of the existence of a disk write cache and is not designed to operate with such cache enabled. Running UFS on a disk with write cache enabled can lead to corruption of the client's view in the event of a system crash.


ZFS also has the zil_disable control. ZFS is not designed to operate with zil_disable set to 1. Setting this variable (before mounting a ZFS filesystem) means that O_DSYNC writes, fsync as well as NFS COMMIT operations are all ignored! We note that, even without a ZIL, ZFS will always maintain a coherent local view of the on-disk state. But by ignoring NFS COMMIT operations, it will cause the client's view to become corrupted (as defined above).


Comparison with UFS


In the original complaint, there was no comparison between a semantically correct NFS service delivered by ZFS to another similar NFS service delivered by another filesystem. Let's gather some more data:


Local and memory based filesystems :

	tmpfs   :  0.077 sec
	ufs     :  0.25  sec
	zfs     :  0.12  sec

NFS service with risk of corruption of client's side view :

	nfs/ufs :  7     sec (write cache enable)
	nfs/zfs :  4.2   sec (write cache enable,zil_disable=1)
	nfs/zfs :  4.7   sec (write cache disable,zil_disable=1)

Semantically correct NFS service :

	nfs/ufs : 17     sec (write cache disable)
	nfs/zfs : 12     sec (write cache disable,zil_disable=0)
	nfs/zfs :  7     sec (write cache enable,zil_disable=0)


We note that with most filesystems we can easily produce an improper NFS service by enabling the disk write caches. In this case, a server-side filesystem may think it has commited data to stable storage but the presence of an enabled disk write cache causes this assumption to be false. With ZFS, enabling the write caches is not sufficient to produce an improper service.


Disabling the ZIL (setting zil_disable to 1 using mdb and then mounting the filesystem) is one way to generate an improper NFS service. With the ZIL disabled, commit request are ignored with potential client's view corruption.


Intelligent Storage


An different topic is about running ZFS on intelligent storage arrays. One known pathology is that some arrays will _honor_ the ZFS request to flush the write caches despite the fact that their caches are qualified as stable storage. In this case, NFS performance will be much much worst than otherwise expected. On this topic and ways to workaround this specific issue, see Jason's .Plan: Shenanigans with ZFS.


Conclusion


In many common circumstances, ZFS offers a fine NFS service that complies with all NFS semantics even with write caches enabled. If another filesystem appears much faster, I suggest first making sure that this other filesystem complies in the same way.


This is not to say that ZFS performance cannot be perfected as clearly it can. The performance of ZFS is still evolving quite rapidly. In many situations, ZFS provides the highest throughput of any filesystem. In others, ZFS performance is highly competitive with other filesystems. In some cases, ZFS can be slower than other filesystems -- while in all cases providing end-to-end data integrity, ease of use and integrated services such as compression, snapshots etc.


See Also Eric's fine entry on zil_disable

vendredi sept. 22, 2006

ZFS and OLTP

ZFS and Databases


Given that we started to have enough understanding on the internal dynamics of ZFS, I figured it was time to tackle the next hurdle : running a database management system (DBMS). Now I know very little myself about DBMS, so I teamed up with people that have tons of experience with it, my Colleagues from Performance Engineering (PAE), Neelakanth (Neel) Nagdir and Sriram Gummuluru getting occasional words of wisdom from Jim Mauro as well.


Note that UFS (with DIO) has been heavily tuned over the years to provide very good support for DBMS. We are just beginning to explore the tweaks and tunings necessary to achieve comparable performance from ZFS in this specialized domain.


We knew that running a DBMS would be a challenge since, a database tickles filesystems in ways that are quite different from other types of loads. We had 2 goals. Primarily, we needed to understand how ZFS performs in a DB environment and in what specific area it needs to improve. Secondly, we figured that whatever would come out of the work, could be used as blog-material, as well as best practice recommendations. You're reading the blog material now; also watch this space for Best Practise updates.



Note that it was not a goal of this exercise to generate data for a world record press-release. (There is always a metric where this can be achieved.)


Workload


The workload we use in PAE to characterize DBMSes is called OLTP/Net. This benchmark was developed inside Sun for the purpose of engineering performance into DBMS. Modeled on common transaction processing benchmarks, it is OLTP-like but with a higher network-to-disk ratio. This makes it more representative of real world application. Quoting from Neel's prose:


        "OLTP/Net, the New-Order transaction involves multi-hops as it
        performs Item validation, and inserts a single item per hop as
        opposed to block updates "


I hope that means something to you; Neel will be blogging on his own, if you need more info.


Reference Point


The reference performance point for this work would be UFS (with VxFS being also an interesting data point, but I'm not tasked with improving that metric). For DB loads we know that UFS directio (DIO) provides a significant performance boost and that would be our target as well.


Platform & Configuration


Our platform was a Niagara T2000 (8 cores @ 1.2Ghz, 4 HW threads or strands per core) with 130 @ 36GB disks attached in JBOD fashion. Each disk was partitioned in 2 equal slices, with half of the surface given to a Solaris Volume Manager (SVM) onto which UFS would be built and the other half was given to ZFS pool.


The benchmark was designed to not fully saturate either the CPU or the disks. While we know that performance varies between inner & outer disk surface, we don't expect the effect to be large enough to require attention here.


Write Cache Enabled (WCE)


ZFS is designed to work safely, whether or not a disk write-cache is enabled (WCE). This stays true if ZFS is operating on a disk slice. However, when given a full disk, ZFS will turn _ON_ the write cache as part of the import sequence. That is, it won't enable write cache when given only a slice. So, to be fair to ZFS capabilities we manually turned ON WCE when running our test over ZFS.


UFS is not designed to work with WCE and will put data at risk if WCE is set, so we needed to turn it off for the UFS runs. We needed to do this, to get around the fact that we did not have enough disk to provide each filesystem. Therefore the performance we measured is what would be expected when giving full disk to either filesystem. We note that, for the FC devices we used, WCE does not provide ZFS a significant performance boost on this setup.


No Redundancy


For this initial effort we also did not configure any form of redundancy for either filesystem. ZFS RAID-Z does not really have equivalent feature in UFS and so we settled on simple stripe. We could eventually configure software mirroring on both filesystems, but we don't expect that will change our conclusions. But still this will be interesting in follow-up work.


DBMS logging


Another thing we know already is that a DBMS's log writer latency is critical to OLTP performance. So in order to improve on that metric, it's good practice to set aside a number of disks for the DBMS' logs. So with this in hand, we manage to run our benchmark and get our target performance number (in relative terms, higher the better):



        UFS/DIO/SVM :           42.5
        Separate Data/log volumes


Recordsize


OK, so now we're ready. We load up Solaris 10 Update 2 (S10U2/ZFS), build a log pool and a data pool and get going. Note that log writers actually generate a pattern of sequential I/O of varying sizes. That should map quite well with ZFS out of the box. But for the DBMS' data pool, we expect a very random pattern of read and writes to DB records. A commonly known zfs best practice when servicing fixed record access is to match the ZFS' recordsize property to that of the application. We note that UFS, by chance or by design, also works (at least on sparc) using 8K records.


2nd run ZFS/S10U2


So for a fair comparison, we set the recordsize to 8K for the data pool and run our OLTP/Net and....gasp!:



        ZFS/S10U2       :       11.0
        Data pool (8K record on FS)
        Log pool (no tuning)


So that's no good and we have our work cut out for us.


The role of Prefetch in this result


To some extent we already knew of a subsystem that commonly misbehaves (which is being fix as we speak), the vdev level prefetch code (that I also refer to as the software track buffer). In this code, whenever ZFS issues a small read I/O to a device, it will, by default, go and fetch quite a sizable chunk of data (64K) located at the physical location being read. In itself, this should not increase the I/O latency which is dominated by the head-seek and since the data is stored in a small fixed sized buffer we don't expect this is eating up too much memory either. However in a heavy-duty environment like we have here, every extra byte that moves up or down the data channel occupies valuable space. Moreover, for a large DB, we really don't expect the speculatively read data to be used very much. So for our next attempt we'll tune down the prefetch buffer to 8K.


And the role of the vq_max_pending parameter


But we don't expect this to be quite sufficient here. My DBMS savvy friends would tell me that the I/O latency of reads was quite large in our runs. Now ZFS prioritizes reads over writes and so we thought we should be ok. However during a pool transaction group sync, ZFS will issue quite a number of concurrent writes to each device. This is the vq_max_pending parameter which default to 35. Clearly during this phase the read latency even if prioritized will take a somewhat longer time to complete.


3rd run, ZFS/S10U2 - tuned


So I wrote up a script to tune those 2 ZFS knobs. We could then run with a vdev preftech buffer of 8K and a vq_max_pending of 10. This boosted our performance almost 2X:



        ZFS/S10U2       :       22.0
        Data pool (8K record on FS)
        Log pool (no tuning)
        vq_max_pending : 10
        vdev prefetch : 8K


But not quite satisfying yet.


ZFS/S10U2 known bug


We know of something else about ZFS. In the last few builds before S10U2, a little bug made it's way into the code base. The effect of this bug was that for full record rewrite, ZFS would actually input the old block even though the data is actually not needed at all. Shouldn't be too bad, perfectly aligned block rewrites of uncached data is not that common....except for database, bummer.


So S10U2 is plagued with this issue affecting DB performance with no workaround. So our next step was to move on to ZFS latest bits.


4th run ZFS/Build 44


Build 44 of our next Solaris version has long had this particular issue fixed. There we topped our past performance with:




        ZFS/B44         :       33.0
        Data pool (8K record on FS)
        Log pool (no tuning)
        vq_max_pending : 10
        vdev prefetch : 8K


As we compare to umpty-years of super tuned UFS:



        UFS/DIO/SVM : 42.5
        Separate Data/log volumes


Summary


I think at this stage of ZFS, the results are neither great nor bad. We have achieved:



        UFS/DIO   : 100 %
        UFS       : xx   no directio (to be updated)
        ZFS Best  : 75%  best tuned config with latest bits. 
        ZFS S10U2 : 50%  best tuned config.
        ZFS S10U2 : 25%  simple tuning.


To achieve acceptable performance levels:


The latest ZFS code base. ZFS improves fast these days. We will need to keep tracking releases for a little while. The current OpenSolaris release as well as the upcoming Solaris 10 Update 3 (this fall), should perform for these tests, as well as the Build 44 results shown here.


1 data pool and 1 log pool: common practice to partition HW resource when we want proper isolation. Going forward I think that, we will eventually get to the point where this will not be necessary but it seems an acceptable constraint for now. Tuned vdev prefetch: the code is being worked on. I expect that in a near future this will not be necessary.


Tuned vq_max_pending: that may take a little longer. In a DB workload, latency is key and throughput secondary. There are a number of ideas that needs to be tested which will help ZFS improve on both average latency as well as latency fluctuations. This will help both the Intent log (O_DSYNC write) latency as well as reads.


Parting Words


As those improvement come out, they may well allow ZFS to catch or surpass our best UFS numbers. When you match that kind of performance with all the usability and data integrity features of ZFS, that's a proposition that becomes hard to pass up.

mercredi mai 31, 2006

WHEN TO (AND NOT TO) USE RAID-Z


		WHEN TO (AND NOT TO) USE RAID-Z


RAID-Z is the technology  used by ZFS  to implement a data-protection  scheme
which is less  costly  than  mirroring  in  terms  of  block
overhead.

Here,  I'd  like  to go  over,    from a theoretical standpoint,   the
performance implication of using RAID-Z.   The goal of this technology
is to allow a storage subsystem to be able  to deliver the stored data
in  the face of one  or more disk   failures.  This is accomplished by
joining  multiple disks into  a  N-way RAID-Z  group. Multiple  RAID-Z
groups can be dynamically striped to form a larger storage pool.

To store file data onto  a RAID-Z group, ZFS  will spread a filesystem
(FS) block onto the N devices that make up the  group.  So for each FS
block,  (N - 1) devices  will  hold file  data  and 1 device will hold
parity  information.   This information  would eventually   be used to
reconstruct (or  resilver) data in the face  of any device failure. We
thus  have 1 / N  of the available disk  blocks that are used to store
the parity  information.   A 10-disk  RAID-Z group  has 9/10th of  the
blocks effectively available to applications.

A common alternative for data protection, is  the use of mirroring. In
this technology, a filesystem block is  stored onto 2 (or more) mirror
copies.  Here again,  the system will  survive single disk failure (or
more with N-way mirroring).  So 2-way mirror actually delivers similar
data-protection at   the expense of   providing applications access to
only one half of the disk blocks.

Now  let's look at this from  the performance angle in particular that
of  delivered filesystem  blocks  per second  (FSBPS).  A N-way RAID-Z
group  achieves it's protection  by spreading a  ZFS block  onto the N
underlying devices.  That means  that a single  ZFS block I/O must  be
converted to N device I/Os.  To be more precise,  in order to acces an
ZFS block, we need N device I/Os for Output and (N - 1) device I/Os for
input as the parity data need not generally be read-in.

Now after a request for a  ZFS block has been spread  this way, the IO
scheduling code will take control of all the device  IOs that needs to
be  issued.  At this  stage,  the ZFS  code  is capable of aggregating
adjacent  physical   I/Os  into   fewer ones.     Because of  the  ZFS
Copy-On-Write (COW) design, we   actually do expect this  reduction in
number of device level I/Os to work extremely well  for just about any
write intensive workloads.  We also expect  it to help streaming input
loads significantly.  The situation of random inputs is one that needs
special attention when considering RAID-Z.

Effectively,  as  a first approximation,  an  N-disk RAID-Z group will
behave as   a single   device in  terms  of  delivered    random input
IOPS. Thus  a 10-disk group of devices  each capable of 200-IOPS, will
globally act as a 200-IOPS capable RAID-Z group.  This is the price to
pay to achieve proper data  protection without  the 2X block  overhead
associated with mirroring.

With 2-way mirroring, each FS block output must  be sent to 2 devices.
Half of the available IOPS  are thus lost  to mirroring.  However, for
Inputs each side of a mirror can service read calls independently from
one another  since each  side   holds the full information.    Given a
proper software implementation that balances  the inputs between sides
of a mirror, the  FS blocks delivered by a  mirrored group is actually
no less than what a simple non-protected RAID-0 stripe would give.

So looking  at random access input  load, the number  of FS blocks per
second (FSBPS), Given N devices to be grouped  either in RAID-Z, 2-way
mirrored or simply striped  (a.k.a RAID-0, no  data protection !), the
equation would  be (where dev  represents   the capacity in  terms  of
blocks of IOPS of a single device):

					Random
		Blocks Available	FS Blocks / sec
		----------------	--------------
RAID-Z		(N - 1) \* dev		1 \* dev		
Mirror		(N / 2) \* dev		N \* dev		
Stripe		N \* dev			N \* dev		


Now lets take 100 disks of  100 GB, each each  capable of 200 IOPS and
look  at different  possible configurations;  In the   table below the
configuration labeled:
	
	"Z 5 x (19+1)"

refers to a dynamic striping of 5 RAID-Z groups, each group made of 20
disks (19 data disk + 1 parity). M refers to a 2-way mirror and S to a
simple dynamic stripe.


						Random
	 Config		Blocks Available	FS Blocks /sec
	 ------------	----------------	--------- 
	 Z 1  x (99+1) 	9900 GB		  	  200	  
	 Z 2  x (49+1)	9800 GB		  	  400	  
	 Z 5  x (19+1)	9500 GB			 1000	  
	 Z 10 x (9+1)	9000 GB			 2000	  
	 Z 20 x (4+1)	8000 GB			 4000	  
	 Z 33 x (2+1)	6600 GB			 6600	  

	 M  2 x (50) 	5000 GB			20000	  
	 S  1 x (100)   10000 GB		20000	  


So RAID-Z  gives you  at most 2X  the number  of blocks that mirroring
provides  but hits you  with  much fewer  delivered IOPS.  That  means
that, as the number of  devices in a  group N increases, the  expected
gain over mirroring (disk blocks)  is bounded (to  at most 2X) but the
expected cost  in IOPS is not  bounded (cost in  the range of [N/2, N]
fewer IOPS).  

Note  that for wide RAID-Z configurations,  ZFS takes into account the
sector  size of devices  (typically 512 Bytes)  and dynamically adjust
the effective number of columns in a stripe.  So even if you request a
99+1  configuration, the actual data  will probably be  stored on much
fewer data columns than that.   Hopefully this article will contribute
to steering deployments away from those types of configuration.

In conclusion, when preserving IOPS capacity is important, the size of
RAID-Z groups    should be restrained  to smaller   sizes and one must
accept some level of disk block overhead.

When performance matters most, mirroring should be highly favored.  If
mirroring  is considered too   costly but performance  is nevertheless
required, one could proceed like this:

	Given N devices each capable of X IOPS.

	Given a target of delivered  Y FS blocks per second
	for the storage pool.

	Build your storage using dynamically  striped RAID-Z groups of
	(Y / X) devices.

For instance: 

	Given 50 devices each capable of 200 IOPS.

	Given a target of delivered 1000 FS blocks per second
	for the storage pool.

	Build your storage using dynamically striped RAID-Z groups of
	(1000 / 200) = 5 devices.

In that system we then would have  20% block overhead lost to maintain
RAID-Z level parity.

RAID-Z is a great  technology not only  when disk blocks are your most
precious resources but also  when your available  IOPS far exceed your
expected needs.  But beware  that if you  get your hands on fewer very
large  disks, the IOPS capacity  can  easily become your most precious
resource. Under those conditions, mirroring should be strongly favored
or alternatively a  dynamic stripe of RAID-Z  groups each made up of a
small number of devices.


mardi mai 16, 2006

128K Suffice

I argue for the fact that 128K I/O sizes is sufficient to extract the most out of a disk given enough concurrent I/Os[Read More]

mercredi nov. 16, 2005

ZFS to UFS Performance Comparison on Day 1


With special thanks to Chaoyue Xiong for her help in this work.
        
In this paper I'd like to review the performance data we have gathered
comparing this initial  release of ZFS  (Nov 16 2005) with the Solaris
legacy, optimized beyond reason, UFS filesystem.  The  data we will be
reviewing is based on 14 Unit tests that  were designed to stress some
specific usage pattern of  filesystem operations.  Working  with these
well  contained usage     scenarios, greatly  facilitate    subsequent
performance engineering analysis.

Our focus was to issue a fair head to  head comparison between UFS and
ZFS but not  try to  produce the  biggest,  meanest marketing numbers.
Since ZFS  is also a Volume   Manager, we actually  compared  ZFS to a
UFS/SVM combination.  In cases  where ZFS underperforms UFS, we wanted
to figure out why and how to improve ZFS.

We currently also are focusing on data intensive operations.  Metadata
intensive tests are  being develop and we will   report on those  in a
later study.

Looking ahead to  our results we find  that of our  12 Filesystem Unit
test that were successfully run:

  •     ZFS outpaces UFS in 6 tests by a mean factor of 3.4
  •     UFS outpaces ZFS in 4 tests by a mean factor of 3.0
  •     ZFS equals UFS in 2 tests.


In this paper, we will be taking a closer  look at the tests where UFS
is ahead and try to make proposition toward improving those numbers.


THE SYSTEM UNDER TEST

Our testbed is a hefty V890 with 8 x 1200 Mhz US-IV CPUs (16 cores). At
this point  we are  not  yet monitoring  the  CPU utilization  of  the
different tests  although we plan to do  so in the future. The storage
is  an insanely  large  300 disk   array; The disks  were   rather old
technology,  small &  slow  9 GB  disks.  None  of the  test  currently
stresses the array very much  and the idea was  mostly trying to  take
the storage configuration   out  of the  equation.  Working  with  old
technology disks, the absolute throughput  numbers are not necessarily
of interest; they are presented in an appendix.

Every disk  in our configuration  is partitioned into   2 slices and a
simple zvm or  zpool stripped volume  is made across all spindles. We
then build  a filesystem on top of  the volume.  All commands  are run
with default parameters.  Both filesystems  are mounted and we can run
our test suite on either one.

Every  test is rerun  multiple  times  in  succession; The tests   are
defined and developed to avoid variability between instances. Some of
the current test definition  require that file data  not be present in
the filesystem cache. Since we currently do not  have a convenient way
to control this for  ZFS, the result for those  tests are omitted from
this report.


THE FILESYSTEM UNIT TESTS


Here  is the  definition   of the 14  data   intensive tests  we  have
currently identified.   Note  that  we   are  very  open to   new test
definition; if you know of an data  intensive application, that uses a
Filesystem in  a very  different pattern,  and  there must be  tons of
them, we would dearly like to hear from you.


Test 1


This is the simplest way  to create a  file; we open/creat a file then
issue 1MB writes until the filesize reaches 128 MB; we then close the file.

Test 2


In this test, we also create a new file,  although here we work with a
file opened  with the O_DSYNC  flag.  We work with  128K writes system
calls.  This maps to some database file creation scheme.

Test 3


This test is  also relative to file creation  but with writes that are
much smaller and of varying sizes. In this test, we create a 50MB file
using writes of size picked randomly between [1K,8K]. The file is open
with  default flags (no  O_\*SYNC) but  every 10 MB  of written  data we
issue an fsync() call for the  whole file. This form  of access can be
used for log files that have data integrity requirements.

Test 4


Moving now to a read test; we read a  1 GB file (assumed in cache) with
32K read system call. This is a rather  simple test to keep everybody
honest.

Test 5


This is same test as Test  4 but when  the file is assumed not present
in the filesystem cache. We currently have  no control on ZFS for this
and so we  will not be reporting   performance numbers for  this test.
This is a basic streaming read sequence that should test the readahead
capacity of a filesystem.

Test 6


Our previous write test, were allocating  writes. In this test we will
verify the ability of a filesystem to rewrite over an existing file.
We will look at 32K writes, to a file open with O_DSYNC.

Test 7


Here we also test the ability to rewrite existing  files. The size are
randomly picked  in the [1K,8K] range. Not  special  control over data
integrity (no O_\*SYNC, no fsync()).

Test 8


In  this test  we  create a very  large  file  (10 GB) with 1MB  writes
followed by 2 full-pass sequential  read.  This test is still evolving
but we want  verify the ability of the  filesystem to work with  files
that are of size close or larger that available free memory.

Test 9


In this test, we issue 8K writes at random 8K aligned offsets in a 1 GB
file. When 128 MB of data is written we issue an fsync().

Test 10


Here,  we issue  2K writes at   random (unaligned) offsets  to  a file
opened O_DSYNC.


Test 11


Same test   as 10 but using 4   cooperating threads all  working  on a
single file.

Test 12


Here we attempt to  simulate a mixed  read/write pattern. Working with
an existing file, we loop through  a pattern of 3  reads at 3 randomly
selected 8K aligned  offsets followed by an  8K write to the last read
block.
 

Test 13

In this test  we   issue 2K pread()    calls (to an  random  unaligned
offset).  File is asserted to not be in  the cache. Since we currently
have no such control, no won't report data for this test.

Test 14

We have 4 cooperating  threads (working on a  single file)  issuing 2K
pread() calls  to random unaligned offset. The  file is present in the
cache.




THE RESULTS

We have  a common testing framework  to generate the performance data.
Each test is written using as a simple  C program and the framework is
responsible   for creating   threads,   files,  timing   the runs  and
reporting.  We currently are in discussing merging this test framework
with the Filebench  suite.  We regret that  we cannot easily share the
test  code,  however the   above descriptions  should  be sufficiently
precise to allow someone to  reproduce our data.   In my mind a simple
10 to 20 disk array and any small server  should be enough to generate
similar  numbers.  If anyone  find very different  results, I would be
very interested in knowing about it.

Our      framework reports all    timing    results   as a   throughput
measure. Absolute values of throughput is  highly test case dependent.
A 2K O_DSYNC write will  not have the same throughput  as a 1MB cached
read.  Some test would be better described in  terms of operations per
second.    However  since  our focus  is  a   relative ZFS to  UFS/SVM
comparison, we will focus here on  the delta in throughput between the
2 filesystems (for the curious  the full throughput  data is posted in
the appendix).


Drumroll....


Task ID      Description                                               Winning FS / Performance Delta

1                 open() and allocation  of a                        ZFS / 3.4X

                   128.00 MB file with
                   write(1024K) then close().                

2                 open(O_DSYNC) and                               ZFS / 5.3X
                   allocation of a
                   5.00 MB file with
                   write(128K) then close().              


3                 open()  and allocation of a                        UFS / 1.8X
                   50.00 MB file with write() of
                   size picked uniformly  in
                   [1K,8K] issuing fsync()                          
                   every 10.00 MB


4                 Sequential read(32K) of a
                       ZFS / 1.1X
                   1024.00 MB file, cached.
                              


5                 Sequential read(32K) of a                         no data
                  1024 MB MB file, uncached.


6                 Sequential rewrite(32K) of a                    ZFS / 2.6X
                   10.00   MB  file,  O_DSYNC,
                   uncached                      


7                 Sequential rewrite() of a 1000.00            UFS / 1.3X
                   MB cached file, size picked
                   uniformly    in the [1K,8K]                  
                   range, then close().



8                 create  a file   of size 1/2  of                    ZFS / 2.3X
                   freemem   using  write(1MB)
                   followed  by 2    full-pass
                   sequential   read(1MB).  No              
                   special cache manipulation.


9                 128.00  MB  worth  of random  8            UFS / 2.3X
                   K-aligned write       to  a
                   1024.00  MB  file; followed                  
                   by fsync(); cached.



10              1.00  MB worth of   2K write to            draw (UFS == ZFS)
                  100.00   MB file,  O_DSYNC,
                  random offset, cached.


11             1.00  MB worth  of  2K write  to                ZFS  / 5.8X
                  100.00 MB    file, O_DSYNC,
                  random offset, uncached.  4
                  cooperating  threads   each             
                  writing 1 MB

12             128.00 MB  worth of  8K aligned            draw (UFS == ZFS)
                 read&write   to  1024.00 MB
                 file, pattern  of 3 X read,
                 then write to   last   read
                 page,     random    offset,
                 cached.

13             5.00  MB worth of pread(2K) per            no data
                 thread   within   a  shared
                1024.00  MB    file, random
                offset, uncached



14            5.00 MB  worth of  pread(2K) per                UFS / 6.9X
                thread within a shared
                1024.00 MB file, random                         
                offset, cached 4 threads.


As stated in the abstract

  •     ZFS outpaces UFS in 6 tests by a mean factor of 3.4
  •     UFS outpaces ZFS in 4 tests by a mean factor of 3.0
  •     ZFS equals UFS in 2 tests.

The performance differences can be sizable; lets have a closer look
at some of them.





PERFORMANCE DEBRIEF

Lets look at each test to try and understand what  is the cause of the
performance differences.

Test 1 (ZFS 3.4X)

     open() and allocation  of a
    128.00 MB file with
    write(1024K) then close().                

This  test is not fully  analyzed. We note  that in this situation UFS
will regularly kick off some I/O from the  context of the write system
call.  This would occur  whenever a  cluster  of writes (typically  of
size  128K or 1MB)  has completed. The initiation  of I/O by UFS slows
down the process.  On the other hand ZFS  can zoom through the test at
a rate much closer to  a memcopy.  The  ZFS I/Os to disks are actually
generated internally by the ZFS  transaction group mechanism: every few
seconds a transaction group will come and flush the dirty data to disk
and this occurs without throttling the test.

Test 2 (ZFS 5.3X)

     open(O_DSYNC) and
    allocation of a
    5.00 MB file with
    write(128K) then close().              

Here ZFS shows  an even bigger advantage.   Because of it's design and
complexity,  UFS is actually somewhat limited  in it capacity to write
allocate files in  O_DSYNC mode.  Every  new  UFS write  requires some
disk block   allocation, which must occur  one  block at a   time when
O_DSYNC is set. ZFS can easily outperform UFS for this test.

Test 3 (UFS 1.8X)

     open()  and allocation of a
    50.00 MB file with write() of
    size picked uniformly  in
    [1K,8K] issuing fsync()                          
    every 10.00 MB


Here ZFS pays the advantage it had in test  1.  In this test, we issue
very many writes to a file.  Those are cached as the process is racing
along.  When the fsync() hits (every 10 MB  of outstanding data per the
test definition) the FS must now guarantee that all the data is set to
stable  storage.  Since UFS  kicks off  I/O more  regularly, when  the
fsync() hits UFS has a smaller amount  of data left to  sync up.  What
save the day for ZFS is that, for that leftover data UFS slows down to
a crawl.  On the other hand ZFS has accumulated a large amount of data
in the cache and when  the fsync() hits.   Fortunately ZFS is able  to
issue much larger I/Os to  disk and catches some  of it's lag that has
built  up.  But the final  results shows that UFS  wins the horse race
(at least  in this specific test);  Details of the test will influence
final  result here. 

However the ZFS  team  is working on ways   to make the fsync()   much
better.  We actually have 2  possible avenues of improvements.  We can
borrow from  the  UFS behavior and kick  off  some I/Os when too  much
outstanding data is cached.  UFS does  this at a very regular interval
which does not look  right either.  But clearly  if a file has many MB
of outstanding  dirty  data  sending   them  off  to  disk   might  be
beneficial.  On    the other hand,    keeping  the data in   cache  in
interesting when  the pattern of  writing is such  that the same file
offsets  are written and re-written over  and over again.  Sending the
data to disk  is wasteful if  data is subsequently  rewritten shortly
after.  Basically the FS must place a bet on whether a future fsync()
will occur before an new write  to the block.   We cannot win this bet
on all tests all the time.

Given that fsync() performance  is important, I  would like to see  us
asynchronously kick off I/O when some we reach many MB of outstanding
data to a file. This is nevertheless debatable.

Even if we don't do this, we have another area of improvement that the
ZFS team  is looking into.  When the  Fsync finally hits the fan, even
with a  lot of outstanding data;  the current  implementation does not
issue  disk I/Os very efficiently.   The proper way  to  do this is to
kick-off all required I/Os  and then wait  for  them to all  complete.
Currently in the  intricacies of the   code, some I/Os are  issued and
waited  upon one after the   other.  This is  not yet  optimal but  we
certainly  should see  improvements coming  in the  future and I truly
expect ZFS fsync() performance to be ahead all the time.
   
Test 4 (ZFS 1.1X)

     Sequential read(32K) of a 1024.00
    MB file, cached.

Rather simple  test, mostly    close  to memcopy  speed  between   the
Filesystem  cache and the  user buffer. Contest is  almost a wash with
ZFS slightly on top. Not yet analyzed.


Test 5 (N/A)

     Sequential read(32K) of a 1024.00
    MB file, uncached.

No results dues to lack of control on the ZFS file level caching.


Test 6 (ZFS 2.6X)

      Sequential rewrite(32K) of a
    10.00   MB  file,  O_DSYNC,
    uncached                      

Due  to the WAFL  (Write Anywhere File  Layout) ZFS, a  rewrite is not
very different to an initial  write and it seems  to perform very well
on this  test.  Presumably UFS performance is  hindered by the need to
synchronize the cached data. Result not yet analyzed.

Test 7 (UFS 1.3X)

     Sequential rewrite() of a 1000.00
    MB cached file, size picked
    uniformly    in the [1K,8K]                  
    range, then close().


In this test we are not timing any of the  disk I/O. This is merely a
test about unrolling the  filesystem code for 1K  to 8K cached writes.
The  UFS codepath wins in  simplicity and years of performance tuning.
The ZFS codepath here somewhat suffers from it's youth. Understandably
the ZFS  current  implementation is very well   layered and we  easily
imagine  that the  locking  strategies of   the different layers   are
independent of one another. We have found (thanks dtrace) that a small
ZFS cached write would use about 3 times as many lock acquisition that
an equivalent UFS    call.  Mutex rationalization  within  or  between
layers certainly seems to be an area of  potential improvement for ZFS
that would help this particular test.  We  also realised that the very
clean and  layered code implementation   is causing the  callstack  to
follow very many elevator ride up and down between  layers. On a Sparc
CPU going up  and down 6  or 7 layers  deep in the callstack causes  a
spill/fill trap and one   additional trap for every additional   floor
travelled. Fortunately there  are very  many  areas where ZFS  will be
able to merge different functions into  single one or possibly exploit
the technique of  tail calls to regain  some of the lost  performance.
All in all, we find that the performance difference is small enough to
not  be  worrysome at this  point  specially in  view of  the possible
improvements we already have identified.


Test 8 (ZFS 2.3X)

      create  a file   of size 1/2  of
    freemem   using  write(1MB)
    followed  by 2    full-pass
    sequential   read(1MB).  No              
    special cache manipulation.

This  test  needs to  be   analyzed further.  We   note that  UFS will
proactively  freebehind read blocks. While  this is a very responsible
use of memory   (give it back  after use)  it  potentially  impact the
re-read UFS performance.  While we're happy  to see ZFS performance on
top, some investigation is  warranted to make sure  that ZFS does  not
overconsume memory in some situations.

Test 9 (UFS 2.3X)

       128.00  MB  worth  of random  8
    K-aligned write       to  a
    1024.00  MB  file; followed                  
    by fsync(); cached.


In this test we expect a rational similar to the one of Test 3 to take
effect. The same cure should also apply.

Test 10 (draw)

      1.00  MB worth of   2K write to
    100.00   MB file,  O_DSYNC,
    random offset, cached.

Both FS must issue and wait  for a 2K I/O on  each write. They both do
this as efficiently as possible.


Test 11 (ZFS 5.8X)

     1.00  MB worth  of  2K write  to
    100.00 MB    file, O_DSYNC,
    random offset, uncached.  4
    cooperating  threads   each             
    writing 1 MB

This test is similar to the previous  one except for the 4 cooperating
threads. ZFS being on top highlights a key feature of ZFS, the lack of
single writer lock.  UFS can only allow  a single write thread working
per file.  The only  exception is  when directio  is enabled  and then
only with rather restrictive conditions. UFS with directio would allow
concurrent writers with the implied restriction  that it did not honor
full POSIX semantics regarding write atomicity.  ZFS,  out of the box,
is able  to  allow concurrent  writers  without requiring any  special
setup  nor   giving up full     POSIX semantics. All   great news  for
simplicity of deployment and great Data-Base performance .

Test 12 (draw)

    128.00 MB  worth of  8K aligned
    read&write   to  1024.00 MB
    file, pattern  of 3 X read,
    then write to   last   read
    page,     random    offset,
    cached.

Both filesystem perform appropriately. Test still require analysis.

Test 13 (N/A)

      5.00  MB worth of pread(2K) per       
    thread   within   a  shared
    1024.00  MB    file, random
    offset, uncached

No results dues to lack of control on the ZFS file level caching.

Test 14 (UFS 6.9X)

     5.00 MB  worth of  pread(2K) per   
    thread within a shared
    1024.00 MB file, random                         
    offset, cached 4 threads.

This test unexplicably  shows UFS on  top.   The UFS  code can perform
rather well given  that  the FS cache  is  stored in the   page cache.
Servicing writes from  cache can be made  very scalable.  We  are just
starting our analysis  of  the performance characteristic of   ZFS for
this   test  We have identified  some  serialization  construct in the
buffer management code where we find that  reclaiming the buffers into
which to put the cached  data is acting  as a serial throttle. This is
truly the  only test where  the   ZFS performance disappoint  although
there    is no doubt   that    we will be    finding   a cure to  this
implementation issue.




THE TAKEAWAY


ZFS is  on top   on very  many  of  our  test  often by a  significant
factor. Where UFS is ahead we have a clear view on  how to improve the
ZFS implementation.  The case of shared readers to  a single file will
be the test that requires special attention.

Given   the youth of the  ZFS  implementation, the performance outline
presented in this paper shows that the ZFS design decision are totally
validated from a performance perspective.


FUTURE DIRECTIONS

Clearly, we should now expands the unit  test coverage.  We would like
to study more metadata intensive workloads.  We also would like to see
how   ZFS  features such as  compression    and RaidZ perform.   Other
interesting studies could   focus    on CPU consumption     and memory
efficiency.  We also  need to find  a solution to running the existing
unit test that requires the files to not be cached in the filesystem.







APPENDIX/ THROUGHPUT MEASURE

Here are the raw throughput measures for each of the 14 Unit test.

 Task ID      Description              ZFS latest+nv25(MB/s)      UFS+nv25 (MB/s)

1     open() and allocation  of a        486.01572         145.94098
    128.00 MB file with
    write(1024K) then close().                 ZFS 3.4X



2     open(O_DSYNC) and            4.5637             0.86565
    allocation of a
    5.00 MB file with
    write(128K) then close().               ZFS 5.3X


3     open()  and allocation of a          27.3327         50.09027
    50.00 MB file with write() of
    size picked uniformly  in
    [1K,8K] issuing fsync()                           1.8X UFS
    every 10.00 MB


4     Sequential read(32K) of a 1024.00    674.77396         612.92737
    MB file, cached.
                               ZFS 1.1X


5     Sequential read(32K) of a 1024.00    1756.57637         17.53705
    MB file, uncached.
                               XXXXXXXXX


6      Sequential rewrite(32K) of a        2.20641         0.85497
    10.00   MB  file,  O_DSYNC,
    uncached                       ZFS 2.6X


7     Sequential rewrite() of a 1000.00    204.31557         257.22829
    MB cached file, size picked
    uniformly    in the [1K,8K]                   1.3X UFS
    range, then close().



8      create  a file   of size 1/2  of    698.18182         298.25243
    freemem   using  write(1MB)
    followed  by 2    full-pass
    sequential   read(1MB).  No               ZFS 2.3X
    special cache manipulation.


9       128.00  MB  worth  of random  8        42.75208         100.35258
    K-aligned write       to  a
    1024.00  MB  file; followed                   2.3X UFS
    by fsync(); cached.




10      1.00  MB worth of   2K write to        0.117925         0.116375
    100.00   MB file,  O_DSYNC,
    random offset, cached.                      ====


11     1.00  MB worth  of  2K write  to    0.42673         0.07391
    100.00 MB    file, O_DSYNC,
    random offset, uncached.  4
    cooperating  threads   each              ZFS 5.8X
    writing 1 MB



12      128.00 MB  worth of  8K aligned        264.84151         266.78044
    read&write   to  1024.00 MB
    file, pattern  of 3 X read,
    then write to   last   read                 =====
    page,     random    offset,
    cached.



13      5.00  MB worth of pread(2K) per        75.98432         0.11684
    thread   within   a  shared
    1024.00  MB    file, random               XXXXXXXX
    offset, uncached



14     5.00 MB  worth of  pread(2K) per    56.38486         386.70305
    thread within a shared
    1024.00 MB file, random                          6.9X UFS
    offset, cached 4 threads.




,
About

user13278091

Search

Categories
Archives
« avril 2014
lun.mar.mer.jeu.ven.sam.dim.
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today
News
Blogroll

No bookmarks in folder