lundi déc. 15, 2008

Decoding Bonnie++



I've been studying the popular Bonnie++ load generator to see if it was a suitable benchmark to use with Network attached storage such as Sun Storage 7000 line. At this stage I've looked at the single client runs, and it doesn't appear that Bonnie++ is an appropriate tool in this environment because as we'll see here, for many of the tests, it either stresses the networking environment or the strength of client side cpu.

The first interesting thing to note is that Bonnie will work on a data set that is double the client's memory. This does address some of the client side caching concern one could otherwise have. In a NAS environment the amount of memory present on the server is not considered by a default bonnie++ run. My client had 4GB leading to a working set was then 8GB while the server had 128GB of memory. The Bonnie++'s output looks like :
  Writing with putc()...done
  Writing intelligently...done
  Rewriting...done
  Reading with getc()...done
  Reading intelligently...done
  start 'em...done...done...done...
  Create files in sequential order...done.
  Stat files in sequential order...done.
  Delete files in sequential order...done.
  Create files in random order...done.
  Stat files in random order...done.
  Delete files in random order...done.

  Version 1.03d       ------Sequential Output------ --Sequential Input- --Random-
		      -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
  Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
  v2c01            8G 81160  92 109588  38 89987  67 69763  88 113613  36  2636  67
		      ------Sequential Create------ --------Random Create--------
		      -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
		files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
		   16   687  10 +++++ +++  1517   9   647  10 +++++ +++  1569   8
  v2c01,8G,81160,92,109588,38,89987,67,69763,88,113613,36,2635.7,67,16,687,10,+++++,+++,1517,9,647,10,+++++,+++,1569,8

Method

I have used a combination of Solaris truss(1), reading Bonnie++ code, looking at AmberRoad's Analytics data , as well as a custom Bonnie d-script in order to understand how each test triggered system calls on the client and how those translated into a NAS server load. In the d-script, I characterise the system calls by the average elapse time as well as by the time spent waiting for a response from the NAS server. The time spent waiting is the operational latency that one should be interested in when characterising a NAS, while the additional time relates to the client CPU strength along with the client NFS implementation. Here is what I found trying to explain how performant each test was.

Writing with putc()

So easy enough, that test creates a file using single character putc stdio library call.

This test is clearly a client CPU test with most of the time spent in user space running putc(). Every 8192 putc, stdio library will issue a write(2) system call. That syscall is still a client CPU test since the data is absorbed on the client cache. What we test here is the client single CPU performance and the client NFS implementation. On a 2 CPU/ 4GB V20z running Solaris, we observed on the server using analytics a network transfer rate of 87 MB/sec.

Results : 87 MB/sec of writes. Limited by single CPU speed.



Writing intelligently...done

Here it's more clever since it writes a file using sequential 8K write system calls.

In this test the CPU is much relieved. So here the application is running 8K write system call to client NFS. This is absorbed by memory on the client. With an Opensolaris client, no over the wire request is sent for such an 8K write. However after 4 such 8K writes we reach the natural 32K chunk advertised by the server and that will cause the client to asynchronously issue a write request to the server. The asynchronous nature means that this will not cause the application to wait for the response and the test will keep going on CPU. The process will now race ahead generating more 8K writes and 32K asynchronous NFS requests. If we manage to generate such request at a greater rate than responses, we will consume all allocated aysnchronous threads. On Solaris this maps to nfs4_max_threads (8) threads. When all 8 asynchronous threads are waiting for a response, then the application will finally block waiting for a previously issued request to get a response and free an async thread.

Since generating 8K write systems to fill the client cache is faster than the network connection between the client and the server we will eventually reach this point. The steady state of this test is that Bonnie++ is waiting for data to transfer to the server. This happens at the speed of a single NFS connection which for us saturated the 1Gbps link we had. We observed 113MB/sec which is network line rate considering protocol overheads.

To get more through on this test, one could use Jumbo Frame ethernet instead of the 1500 Byte default frame size used as this would reduce the protocol overhead slightly. One could also configure the server and client to use 10Gbps ethernet links.

One could also use LACP link aggregation of 1Gbps network ports to increase the throughput. LACP increases throughput of multiple network connections but not single socket protocol. By default a Solaris client will establish a single connection (clnt_max_conns = 1) to a server (1 connections per target IP). So using multiple aggregated links _and_ tuning clnt_max_conns could yield extra throughput here.

Using single connection one could use a faster network between client and server links to reach additional throughput.

More commonly, we expect to saturate the client 1Gbps connectivity here, not much of a stress for a Sun Storage 7000 server.

Results : 113 MB/sec of writes. Network limited.



Rewriting...done

This gets a little interesting. It actually reads 8K, lseek back to the start of the block, overwrites the 8K with new data and loops.

So here we read, lseek back, overwrite . For the NFS protocol lseek is a noop since every over the wire write is tagged with the target offset. In this test we are effectively stream reading the file from the server and stream writing the file back to the server. The stream write behavior will be much like the previous test. We never need to block the process unless we consume all 8 asynchronous threads. Similarly 8K sequential reads will be recognised by our client NFS as streaming access which will deploy asynchronous readahead requests. We will use 4 (nfs4_nra) request for 32K blocks ahead of the point being currently read. What we observed here was that of 88 second of elapse time, 15 was spent in write and 20 in reads. However a small portion of that was spent waiting for response. It was mostly all spent on CPU time to interact with the client NFS. This implies that readhead and asynchronous writeback was behaving without becoming bottlenecks. The Bonnie++ process took 50 sec of the 88 sec and a big chunk of this, 27 sec, was spent waiting off cpu. I struggle somewhat in this interpretation but I do know from the Analytics data on the server that the network is seeing 100 MB/sec of data flowing in each direction. This must also be close to network saturation. The wait time attributed to Bonnie++ in this test seems be related to kernel preemption. As Bonnie++ is coming out of its system calls we see such events in dtrace.

              unix`swtch+0x17f
              unix`preempt+0xda
              genunix`post_syscall+0x59e
              genunix`syscall_exit+0x59
              unix`0xfffffffffb800f06
            17570


This must be to service the kernel threads of higher priority, likely the asynchronous threads being spawned by the reads and writes.

This test is then a stress test of bidirectional flow of 32K data transfers. Just like the previous test, to improve the numbers one would need to improve the network connection throughput between the client and server. It also potentially could then benefit from faster and more client CPUs.

Results : 100MB/sec in each direction, network limited.



Reading with getc()...done

Reads the file one character at a time.

Back to a test of the client CPU much like the first one. We see that the readahead are working great since little time is spent waiting (0.4 of 114 seconds). Given that this test does 1 million reads in 114 seconds, the average latency could be evaluated to be 114 usec.

Results : 73MB/sec, single CPU limited on the client.



Reading intelligently...done start 'em...done...done...done...

Reads with 8k system calls, sequential.

This test seems to be using 3 spawned bonnie process to read files. The reads are of size 8K and we needed 1M of them to read our 8GB working set. We observed with analytics no I/O done on the server since it had 128GB of cache available to it. The network on the other hand is saturated at 118 MB/sec.

The dtrace script shows that the 1M read calls collectively spend 64 seconds waiting (most of that NFS response). So that implies a 64 usec read response time for this sequential workload.

Results : 118MB/sec, limited by Network environment.



start 'em...done...done...done...

Here is seems that Bonnie starts 3 helper processes used to read the files in the "Reading Intelligently" test.

Create files in sequential order...done.

Here we see 16K files being created (with creat(2)) then closed.

This test will create and close 16K files and took 22 seconds in our environment. 19 seconds were used for the creates, 17.5 waiting for responses. That means a 1ms response time for file creates. The test seems single threaded. Using analytics we observe 13500 NFS ops per second to handle those file create. We do see some activity on the Write bias SSD although very modest at 2.64MB /sec. Given that the test is single threaded we can't estimate if this metric is representative of the NAS server capability. More likely this is representative the single thread capability of the whole environment made of : client CPU, client NFS implementation, client network driver and configuration, network envinronment including switches, and the NAS server.

Results : 744 filecreate per second per thread. Limited by operational latency.

Here is the analytics view captured for the this tests and the following 5 tests.



Stat files in sequential order...done.

Test was too elusive possibly working against cached stat information.

Delete files in sequential order...done.

Here we unlink(2) the 16K files.

Here we call the unlink system call for the 16K files. The run takes 10.294 seconds showing a 1591 unlink per second. Each call goes off cpu, waiting for a server response for 600 usec.

Much like the create file test above, while we get information about the single threaded unlink time present in the environment it's obviously not representative of the server's capabilities.'

Results : 1591 unlink per second per thread, Limited by operational latency.

Create files in random order...done.

We recreate 16K files, closing each one but also running a stat() system call on each.

Stat files in random order...done.

Elusive as above.

Delete files in random order...done.

We remove the 16K files.

I could not discern in the "random order" test any meaninful differences to the sequential order ones.

Analytics screenshot of Bonnie++ run

Here is the full screen shot from analytics including Disk and CPU data



The takeway here is that single instance bonnie++ does not generally stress one Sun Storage 7000 NAS server but will stress the client CPU and 1Gbps network connectivity. There is no multi-client support in Bonnie++ (that I could find).

One can certainly start multiple clients simultaneously, but since the different tests would not be synchronized the output of bonnie++ would be very questionable. Bonnie++ does have a multi-instance synchronisation mode that is based on semaphore which will only work if all instances are running within the same OS environment.

So in a multi client test, Only the total elapsed time would be of interest here and that would be dominated by the streaming performance as each client would read and write its working set 3 times over the wire. Filecreate and unlink times would also contribute to the total elapsed time of such a test.

For a single node multi-instance bonnie++ run, one would need to have a large client, with at least 16 x 2Ghz CPUS, and about 10Gbps worth of network capabilities in order to properly test one Sun Storage 7410 server. Otherwise, Bonnie++ is more likely to show client and network limits, not server ones. As for unlink capabilities, the topic is a pretty complex and important one that certainly cannot be captured with simple commands. The interaction with snapshots and the I/O load generated on the server during large unlink storms needs to be studied carefully in order to understand the competitive merits of different solutions.

In Summary, here is what governs the performance of the individual Bonnie++ tests :
    Writing with putc()... 87 MB/sec Limited by client's single CPU speed
    Writing intelligently...113 MB/sec Limited by Network conditions
    Rewriting...100MB/sec Limited by Network conditions
    Reading with getc()...73MB/sec Limited by client's single CPU speed
    Reading intelligently...118MB/sec Limited by Network conditions
    start 'em...done...done...done...
    Create files in sequential order...744 create/sLimited by operational latency
    Stat files in sequential order...not observable
    Delete files in sequential order...1591 unlink/sLimited by operational latency
    Create files in random order...same as sequential
    Stat files in random order...same as sequential
    Delete files in random order...same as sequential


So Bonnie++ won't tell you much about our server's capabilities. Unfortunately, the clustered mode of Bonnie++ won't coordinate multiple clients systems and so cannot be used to stress a server. Bonnie++ could be used to stress a NAS server using a single large multi-core client with very strong networking capabilities. In the end though I don't expect to learn much about our servers over and above what is already known. For that please check out our links here :

  • Low Level Performance of Sun Storage
  • Analyzing the Sun Storage 7000
  • Designing Performance Metrics...
  • Sun Storage 7xxx Performance Invariants


  • Here is the bonnie.d d-script used and the output generated bonnie.out.

    lundi nov. 10, 2008

    Blogfest : Performance and the Hybrid Storage Pool

    Today Sun is announcing a new line of Unified Storage designed by a core of the most brilliant engineers . For starters Mike Shapiro provides a great introduction into this product, the new economics behind it and the killer App in Sun Storage 7000.

    The killer App is of course Bryan Cantrill's brainchild, the already famous Analytics. As a performance engineer, it's been a great thrill to have given this tool an early test drive. Working a full 1 ocean's (the atlantic) + 1 continent (the USA) away from my system running Analytics I was skeptical at first that I would be visualizing in real time all that information : the NFS/CIFS ops, the disk ops, the CPU load and network throughput, per client, per disk, per file ARE YOU CRAZY ! All that information available IN REAL TIME; I just have to say a big thank you to the team that made it possible. I can't wait to see our customer put this to productive use.

    Also check out Adam Levanthal's great description of HSP the Hybrid Storage Pool and read my own perspective on this topic ZFS as a Network Attach Storage Controller.

    Lest we forget the immense contribution of the boundless Energy bubble that is Brendan Gregg; the man that braught DTracetoolkit to the semi-geek; he must be jumping with excitement as we now see the power of DTrace delivered to each and every system administrator. He talks here about the Status Dashboard. And Brendan's contribution does not stop here, he is also the parent of this wonderful component of the HSP known as the L2ARC which is how the readzillas become activated. See his own previous work on the L2ARC along with Jing Zhang more recent studies. Quality assurance people don't often get into the spotlight but check out Tim Foster 's post on how he tortured the zpool code adding and removing l2 arc devices from pools :

    For myself, it's been very exciting to be able to see performance improvement ideas get turned into product improvements from weeks to weeks. Those interested should read how our group influenced the product that is shipping today, see Alan Chiu and my own Delivering Performance Improvements.

    Such a product has a strong Price/Performance appeal and given that we fundamentally did not think that there where public benchmarks that captured our value proposition, we had to come up with a third millenium participative ways to talk about performance. Check out how we designed our Metrics or maybe go straight to our numbers obtained by Amitabha Banerjee a concise entry backed up by immense, intense and carefull data gathering effort in the last few weeks. bmseer is putting his own light on the low level data (data to be updated with numbers from a grander config).

    I've also posted here a few performance guiding lights to be used thinking about this product; I call them Performance Invariants. So further numbers can be found here about raid rebuild times.

    On the application side, we have the great work of Sean (Hsianglung Wu) and Arini Balakrishnan showing how a 7210 can deliver > 5000 concurrent video streams at an aggregate of, you're kidding, : WOW ZA 750MB/sec. More Details on how this was acheived in cdnperf.

    Jignesh Shaw shows step by step instructions setting up PostgreSQL over iSCSI.

    See our Vice President, Solaris Data, Availability, Scalability & HPC Bob Porras trying to tame this beast into a nutshell and pointing out code bits reminding everyone of the value of the OpenStorage proposition.

    See also what bmseer has to say on Web 2.0 Consolidation and get from Marcus Heckel a walkthrough of setting up Olio Web 2.0 kit with nice Analytics performance screenshots. Also get the ISV reaction (a bit later) from Georg Edelmann. Ryan Pratt reports on Windows Server 2003 WHQL certification of the Sun Storage 7000 line.

    And this just in : Data about what to expect from a Database perspective.

    We can talk all we want about performance but as Josh Simons points out, these babies are available to you for your own try and buy. Or check out how you could be running the appliance within the next hour really : Sun Storage 7000 in VMware.

    It seems I am in competition with another less verbose aggregator Finally capture the whole stream of related posting to Sun Storage 7000

    Delivering Performance Improvements to Sun Storage 7000


    I describe here the effort I spearheaded studying the performance characteristics of the OpenStorage platform and the ways in which our team of engineers delivered real out of the box improvements to the product that is shipping today.

    One of the Joy of working on the OpenStorage NAS appliance was that solutions we found to performance issues could be immediately transposed into changes to the appliance without further process.

    The first big wins

    We initially stumble on 2 major issues, one for NFS synchronous writes and one for the CIFS protocol in general. The NFS problem was a subtle one involving the distinction of O_SYNC vs O_DSYNC writes in the ZFS intent log and was impacting our threaded synchronous writes test by up to a 20X factor. Fortunately I had an history of studying that part of the code and could quickly identify the problem and suggest a fix. This was tracked as 6683293: concurrent O_DSYNC writes to a fileset can be much improved over NFS.

    The following week, turning to CIFS studies, we were seeing great scalability limitation in the code. Here again I was fortunate to be the first one to hit this. The problem was that to manage CIFS request the kernel code was using simple kernel allocations that could accommodate the largest possible request. Such large allocations and deallocations causes what is known as a storm of TLB shootdown cross-calls limiting scalability.

    Incredibly though after implementing the trivial fix, I found that the rest of the CIFS server was beautifully scalable code with no other barriers. So in one quick and simple fix (using kmem caches) I could demonstrate a great scalability improvements to CIFS. This was tracked as 6686647 : smbsrv scalability impacted by memory

    Since those 2 protocol problems were identified early on, I must say that no serious protocol performance problems have come up. While we can always find incremental improvements to any given test, our current implementation has held up to our testing so far.

    In the next phase of the project, we did a lot of work on improving network efficiency at high data rate. In order to deliver the throughput that the server is capable of, we must use 10Gbps network interface and the one available on the NAS platforms are based on the Neptune networking interface running the nxge driver.

    Network Setup

    I collaborated on this with Alan Chiu that already new a lot about this network card and driver tunables and so we quickly could hash out the issues. We had to decide for a proper out of the box setup involving
    	- how many MSI-X interrupts to use
    	- whether to use networking soft rings or not
    	- what bcopy threshold to use in the driver as opposed to
    	  binding dma.
    	- Whether to use or not the new Large Segment Offload (LSO)
    	  technique for transmits.
    
    We new basically where we wanted to go here. We wanted many interrupts on receive side so as to not overload any CPU and avoid the use of layered softrings which reduces efficiency. A low bcopy threshold so that dma binding be used more frequently as the default value was too high for this x64 based platform. And LSO was providing a nice boost to efficiency. That got us to some proper efficiency level.

    However we noticed that under stress and high number of connections our efficiency would drop by 2 or 3 X. After much head scratching we rooted this to the use of too many TX dma channels. It turns out that with this driver and architecture using a few channels leads to more stickyness in the scheduling and much much greater efficiency. We settled on 2 tx rings as a good compromise. That got us to a level of 8-10 cpu cycles per byte transfered in network code (more on Performance Invariants). Interrupt Blanking

    Studying a Opensource alternative controller, we also found that on 1 of 14 metrics we where slower. That was rooted in the interrupt blanking parameter that NIC use to gain efficiency. What we found here was that by reducing our blanking to a small value we could leapfrog the competition (from 2X worse to 2X better) on this test while preserving our general network efficiency. We were then on par or better for every one of the 14 tests.

    Media Streaming

    When we ran thousand or 1 Mb/s media streams from our systems we quickly found that the file level software prefetching was hurting us. So we initially disabled the code in our lab to run our media studies but at the end of the project we had to find an out of the box setup that could preserve our Media result without impairing maximum read streaming. At some point we realized that what we were hitting 6469558: ZFS prefetch needs to be more aware of memory pressure. It turns out that the internals of zfetch code is setup to manage 8 concurrent streams per file and can readahead up to 256 blocks or records : in this case 128K. So when we realized that with 1000s of streams we could readahead ourself out of memory, we knew what we needed to do. We decided on setting up 2 streams per file reading ahead up to 16 blocks and that seems quite sufficient to retain our media serving throughput while keeping so prefetching capabilities. I note here also is that NFS client code will themselve recognize streaming and issue their own readahead. The backend code is then reading ahead of client readahead requests. So we kind of where getting ahead of ourselves here. Read more about it @ cndperf

    To slog or not to slog

    One of the innovative aspect of this Openstorage server is the use of read and write optimized solid state devices; see for instance The Value of Solid State Devices.

    Those SSD are beautiful devices designed to help latency but not throughput. A massive commit is actually better handled by regular storage not ssd. It turns out that it was actually dead easy to instruct the ZIL to recognize massive commits and divert it's block allocation strategy away from the SSD toward the common pool of disks. We see two benefits here, the massive commits will sped up (preventing the SSD from becoming the bottleneck) but more importantly the SSD will now be available as low latency devices to handle workloads that rely on low latency synchronous operations. One should note here that the ZIL is a "per filesystem" construct and so while a filesystem might be working on a large commit another filesystem from the same pool might still be running a series of small transaction and benefit from the write optimized SSD.

    In a similar way, when we first tested the read-optimized ssds , we quickly saw that streamed data would install in this caching layer and that it could slow down the processing later. Again the beauty of working on an appliance and closely with developers meant that the following build, those problems had been solved.

    Transaction Group Time

    ZFS operates by issuing regular transaction groups in which modifications since last transaction group are recorded on disk and the ueberblock is updated. This used to be done at a 5 second interval but with the recent improvement to the write throttling code this became a 30 second interval (on light workloads) which aims to not generate more than 5 seconds of I/O per transaction groups. Using 5 seconds of I/O per txg was used to maximize the ratio of data to metadata in each txg, delivering more application throughput. Now these Storage 7000 servers will typically have lots of I/O capability on the storage side and the data/metadata is not as much a concern as for a small JBOD storage. What we found was that we could reduce the the target of 5 second of I/O down to 1 while still preserving good throughput. Having this smaller value smoothed out operation.

    IT JUST WORKS

    Well that is certainly the goal. In my group, we spent the last year performance testing these OpenStorage systems finding and fixing bugs, suggesting code improvements, and looking for better compromise for common tunables. At this point, we're happy with the state of the systems particularly for mirrored configuration with write optimized SSD accelerators. Our code is based on a recent OpenSolaris (from august) that already has a lot of improvements over Solaris 10 particularly for ZFS, to which we've added specific improvements relevant to NAS storage. We think these systems will at times deliver great performance (see Amithaba's results ) but almost always shine in the price performance categories.

    Sun Storage 7000 Performance invariants



    I see many reports about running campains of test measuring performance over a test matrix. One problem with this approach is of course the Matrix. That matrix never big enough for the consumer of the information ("can you run this instead ?").

    A more useful approach is to think in terms of performance invariants. We all know that 7.2K RPM disk drive can do 150-200 IOPS as an invariant and disks will have throughput limit such as 80MB/sec. Thinking in terms of those invariant helps in extrapolating performance data (with caution) and observing breakdowns in invariant is often a sign that something else needs to be root caused.

    So using 11 metrics and our Performance engineering effort what can be our guiding invariants ? Bearing in mind that it is expected that those are rough estimate. For real measured numbers check out Amitabha Banerjee's excellent post on Analyzing the Sun Storage 7000.

    Streaming : 1 GB/s on server and 110 MB/sec on client

    For read Streaming wise, we're observing that 1GB/s is somewhat our guiding number for read streaming . This can be acheived with fairly small number of client and threads but will be easier to reach if the data is prestaged in server caches. A client normally running 1Gbe network cards is able to extract 110 MB/sec rather easily. Read streaming will be easier to acheived with the larger 128K records probably due to the lesser CPU demand. While our results are with regular 1500 Bytes ethernet frames, using jumbo frame will also make this limit easier to reach or even break. For a mirrored pool, data needs to be sent twice to the storage and we see a reduction of about 50% for write streaming workloads.

    Random Read I/Os per second : 150 random read IOPS per mirrored disks

    This is probably a good guiding light also. When going to disks that will be a reasonable expectation. But here caching can radically change this. Since we can configure up to 128GB of host ram and 4 times that much of secondary caches, there are opportunity to break this barrier. But when going to spindles that needs to be kept under consideration. We also know that Raid-z spreads records to all disks. So the 150 IOPS limit basically applies to raid-z groups. Do plan to have many groups to service random reads.

    Random Read I/Os per second using SSDs : 3100 Read IOPS per Read Optimized SSD

    In some instances, data after eviction from main memory will be kept in secondary caches. Small files and tuned recordsize filesystem are good target workload for this. Those read-optimized SSD can restitute this data at a rate of 3100 IOPS L2 ARC). More importantly so it can do so at much reduced latency meaning that lightly threaded workloads will be able to acheive high throughput.

    Synchronous writes per second : 5000-9000 Synchronous write per Write Optimized SSD

    Synchronous writes can be generated by a O_DSYNC write (database) or just as part of the NFS protocol (such as the tar extract : open,write,close workloads). Those will reach the NAS server and be coalesced in a single transaction with the separate intent log. Those SSD devices are great latency accelerators but are still devices with a max throughput of around 110 MB/sec. However our code actually detects when the SSD devices become the bottleneck and will divert some of the I/O request to use the main storage pool. The net of all this is a complex equation but we've observed easily 5000-8000 synchronous writes per SSD up to 3 devices (or 6 in mirrored pairs). Using smaller working set which creates less competition for CPU resources we've even observed 48K synchronous writes per second.



    Cycles per Bytes : 30-40 cycles per byte for NFS and CIFS

    Once we include the full NFS or CIFS protocol, the efficiency was observed to be in the 30-40 cycles per byte (8 to 10 of those coming from the pure network component at regulat 1500 bytes MTU). More studies are required to figure out the extent to which this is valid but it's an interesting way to look at the problem. Having to run disk I/O vs being serviced directly from cached data is expected to exert an additional 10-20 cycles per byte. Obviously for metadata test in which small amount of byte is transfered per operation, we probably need to come up with a cycles/MetaOps invariant but that is still TBD.

    Single Client NFS throughput : 1 TCP Window per round trip latency.

    This is one fundamental rule of network throughput but it's a good occasion to refresh this in everyones mind. Clients, at least solaris clients, will establish a single TCP connection to a server. On that connection there can be a large number of unreleated requests as NFS is a very scalable protocol. However, a single connection will transport data at a maximum speed of a "socket buffer" divided by the round trip latency. Since today's network speed, particularly in wide area networks have grown somewhat faster than default socket buffers we can see such things becoming performance bottleneck. Now given that I work in Europe but my tests systems are often located in california, I might be a little more sensitive than most to this fact. So one important change we did early on, in this project was to simply bump up the default socket buffers in the 7000 line to 1MB. However for read throughput under similar conditions, we can only advise you to do the same to your client infrastructure.

    Using ZFS as a Network Attach Controller and the Value of Solid State Devices



    So Sun is coming out today with a line of Sun Storage 7000 systems that have ZFS as the integrated volume and filesystem manager using both read and write optimized SSD. What is this Hybrid Storage Pool and why is this a good performance architecture for storage ?

    A write optimized SSD is a custom designed device for the purpose of accelerating operations of the ZFS intent log (ZIL). The ZIL is the part of ZFS that manages the important synchronous operation guaranteeing that such writes are acknowledged quickly to applications while guaranteeing persistence in case of outage. Data stored in the ZIL is also kept in memory until ZFS issue the next Transaction Groups (every few seconds).

    The ZIL is what stores data urgently (when application is waiting) but the TXG is what stores data permanently. The ZIL on-disk blocks are only ever re-read after a failure such as power outage. So the SSDs that are used to accelerate the ZIL are write-optimized : they need to handle data at low latency on writes; reads are unimportant.

    The TXG is an operation that is asynchronous to applications : apps are generally not waiting for transactions groups to commit. The exception here is when data is generated at a rate that exceeds the TXG rate for a sustained period of time. In this case, we become throttled by the pool throughput. In a NAS storage this will rarely happen since network connectivity even at GB/s is still much less that what storage is capable of and so we do not generate the imbalance.

    The important thing now is that in a NAS server, the controller is also running a file level protocol (NFS or CIFS) and so is knowledgeable about the nature (synchronous or not) of the requested writes. As such it can use the accelerated path (the SSD) only for the necessary component of the workloads. Less competition for these devices means we can deliver both high throughput and low latency together in the same consolidated server.

    But here is where is gets nifty. At times, a NAS server might receive a huge synchronous request. We've observed this for instance due to fsflush running on clients which will turn non-synchronous writes into a massive synchronous one. I note here that a way to reduce this effect, is to tune up fsflush (to say 600). This is commonly done to reduce the cpu usage of fsflush but will be welcome in the case of client interacting with NAS storage. We can also disable page flushing entirely by setting dopageflush to 0. But that is a client issue. From the perspective of the server, we still need as a NAS to manage large commit request.

    When subject to such a workload, say 1GB commit, ZFS being all aware of the situation, can now decide to bypass the SDD device and issue request straight to disk based pool blocks. It would do so for 2 reasons. One is that the pool of disks in it's entirety has more throughput capabilities than the few write optimized SSD and so we will service this request faster. But more importantly, the value of the SSD is in it's latency reduction aspect. Leaving the SSDs available to service many low latency synchronous writes is considered valuable here. Another way to say this is that large writes are generally well served by regular disk operations (they are throughput bound) whereas small synchronous writes (latency bound) can and will get help from the SSDs.

    Caches at work

    On the read path we also have custom designed read optimized SSDs to fit in these OpenStorage platforms. At Sun, we just believe that many workloads will naturally lend to caching technologies. In a consolidated storage solution, we can offer up to 128GB of primary memory based caching and approximately 500GB of SSD based caching.

    We also recognized that the latency delta between memory cached response and disk response was just too steep. By inserting a layer of SSD between memory and disk, we have this intermediate step providing lower latency access than disk to a working set which is now many times greater than memory.

    It's important here to understand how and when these read optimized SSD will work. The first thing to recognized is that the SSD will have to be primed with data. They feed off data being evicted from the primary caches. So their effect will not immediately seen at the start of a benchmarks. Second, one of the value of read optimized SSD is truly in low latency responses to small requests. Small request here means things of the order of 8K in size. Such request will occur either when dealing with small files (~8K) or if dealing with larger size but with fix record based application, typically a database. For those application it is customary to set the recordsize and this will allow those new SSDs to become more effective.

    Our read optimized SSD can service up to 3000 read IOPS (see Brendan's work on the L2 ARC) and this is close or better to what a 24 x 7.2 RPM disks JBOD can do. But the key point is that the low latency response means it can do so using much fewer threads that would be necessary to reach the same level on a JBOD. Brendan demonstrated here that the response time of these devices can be 20 times faster than disks and 8 to 10 times faster from the client's perspective. So once data is installed in the SSD, users will see their requests serviced much faster which means we are less likely to be subject to queuing delays.

    The use of read optimized SSD is configurable in the Appliance. Users should learn to identify the part of their datasets that end up gated by lightly threaded read response time. For those workloads enabling the secondary cache is one way to deliver the value of the read optimized SSD. For those filesystems, if the workload contains small files (such as 8K) there is no need to tune anything, however for large files access in small chunks setting the filesystem recordsize to 8K is likely to produce the best response time.

    Another benefit to these SSDs will be in the $/IOPS case. Some workloads are just IOPS hungry while not necessarely huge block consumers. The SSD technology offers great advantages in this space where a single SDD can deliver the IOPS of a full JBOD at a fraction of the cost. So with workloads that are more modestly sized but IOPS hungry a test drive of the SSD will be very interesting.

    It's also important to recognized that these systems are used in consolidation scenarios. It can be that some part of the applications will be sped up by read or write optimized SSD, or by the large memory based caches while other consolidated workloads can exercise other components.

    There is another interesting implication to using SSD in the storage in regards to clustering. The read optimized ssd acting as caching layers actually never contain critical data. This means those SSD can go into disk slots of head nodes since there is no data to be failed over. On the other hand, write optimized SSD will store data associated with the critical synchronous writes. But since those are located in dual-ported backend enclosures, not the head nodes, it implies that, during clustered operations, storage head nodes do not have to exchange any user level data.

    So by using ZFS and read and write optimized SSDs, we can deliver low latency writes for application that rely on them, and good throughput for synchronous and non synchronous case using cost effective SATA drives. Similarly on the read size, the high amount of primary and secondary caches enables delivering high IOPS at low latency (even if the workload is not highly threaded) and it can do so using the more cost and energy efficient SATA drive.

    Our architecture allows us to take advantage of the latency accelerators while never being gated by them.

    Designing Performance Metrics for Sun Storage 7000

    One of the necessary checkpoint before launching a product is to be able to assess it's performance. With Sun Storage 7xxx we had a challenge in that the only NFS benchmark of notoriety was SPEC SFS. Now this benchmark will have it's supporters and some customers might be attached to it but it's important to understand what a benchmarks actually says.

    These SFS benchmark is a lot about "cache busting" the server : this is interesting but at Sun we think that Caches are actually helpful in real scenarios. Data goes in cycles in which it becomes hot at times. Retaining that data in cache layers allow much lower latency access, and much better human interaction with storage engines. Being a cache busting benchmark, SFS numbers end up as a measure of the number of disk rotation attached to the NAS server. So good SFS result requires 100 or 1000 of expensive, energy hungry 15K RPM spindles. To get good IOPS, layers of caching are more important to the end user experience and cost efficiency of the solution.

    So we needed another way to talk about performance. Benchmarks tend to test the system in peculiar ways that not necessarely reflect the workloads each customer is actually facing. There are very many workload generators for I/O but one interesting one that is OpenSource and extensible is Filebench available in Source.

    So we used filebench to gather basic performance information about our system with the hope that customers will then use filebench to generate profiles that map to their own workloads. That way, different storage option can be tested on hopefully more meaningful tests than benchmarks.

    Another challenge is that a NAS server interacts with client system that themselve keep a cache of the data. Given that we wanted to understand the back-end storage, we had to setup the tests to avoid client side caching as much as possible. So for instance between the phase of file creation and the phase of actually running the tests we needed to clear the client caches and at times the server caches as well. These possibilities are not readily accessible with the simplest load generators and we had to do this in rather ad-hoc fashion. One validation of our runs was to insure that the amount of data transfered over the wire, observed with Analytics was compatible with the aggregate throughput measured at the client.

    Still another challenge was that we needed to test a storage system designed to interact with large number of clients. Again load generators are not readily setup to coordinate multiple client and gather global metrics. During the course of the effort filebench did come up with a clustered mode of operation but we actually where too far engaged in our path to take advantage of it.

    This coordination of client is important because, the performance information we want to report is actually the one that is delivered to the client. Now each client will report it's own value for a given test and our tool will sum up the numbers; but such a Sum is only valid inasmuch as the tests ran on the clients in the same timeframe. The possibility of skew between tests is something that needs to be monitored by the person running the investigation.

    One way that we increased this coordination was that we divided our tests in 2 categories; those that required precreated files, and those that created files during the timed portion of the runs. If not handled properly, file creation would actually cause important result skew. The option we pursued here was to have a pre-creation phase of files that was done once. From that point, our full set of metrics could then be run and repeated many times with much less human monitoring leading to better reproducibility of results.

    Another goal of this effort was that we wanted to be able to run our standard set of metrics in a relatively short time. Say less than 1 hours. In the end we got that to about 30 minutes per run to gather 10 metrics. Having a short amount of time here is important because there are lots of possible ways that such test can be misrun. Having someone watch over the runs is critical to the value of the output and to it's reproducibility. So after having run the pre-creation of file offline, one could run many repeated instance of the tests validating the runs with Analytics and through general observation of the system gaining some insight into the meaning of the output.

    At this point we were ready to define our metrics.

    Obviously we needed streaming reads and writes. We needed ramdom reads. We needed small synchronous writes important to Database workloads and to the NFS protocol. Finally small filecreation and stat operation completed the mix. For random reading we also needed to distinguish between operating from disks and from storage side caches, an important aspect of our architecture.

    Now another thing that was on my mind was that, this is not a benchmark. That means we would not be trying to finetune the metrics in order to find out just exactly what is the optimal number of threads and request size that leads to best possible performance from the server. This is not the way your workload is setup. Your number of client threads running is not elastic at will. Your workload is what it is (threading included); the question is how fast is it being serviced.

    So we defined precise per client workloads with preset number of thread running the operations. We came up with this set just as an illustration of what could be representative loads :
        1- 1 thread streaming reads from 20G uncached set, 30 sec. 
        2- 1 thread streaming reads from same set, 30 sec.
        3- 20 threads streaming reads from 20G uncached set, 30 sec.
        4- 10 threads streaming reads from same set, 30 sec.
        5- 20 threads 8K random read from 20G uncached set, 30 sec.
        6- 128 threads 8K random read from same set, 30 sec.
        7- 1 thread streaming write, 120 sec
        8- 20 threads streaming write, 120 sec
        9- 128 threads 8K synchronous writes to 20G set, 120 sec
        10- 20 threads metadata (fstat) IOPS from pool of 400k files, 120 sec
        11- 8 threads 8K file create IOPS, 120 sec. 
    


    For each of the 11 metrics, we could propose mapping these to relevant industries :
         1- Backups, Database restoration (source), DataMining , HPC
         2- Financial/Risk Analysis, Video editing, HPC
         3- Media Streaming, HPC
         4- Video Editing
         5- DB consolidation, Mailserver, generic fileserving, Software development.
         6- DB consolidation, Mailserver, generic fileserving, Software development.
         7- User data Restore (destination)
         8- Financial/Risk Analysis, backup server
         9- Database/OLTP
         10- Wed 2.0, Mailserver/Mailstore, Software Development
         11- Web 2.0, Mailserver/Mailstore, Software Development 
    


    We managed to get all these tests running except the fstat (test 10) due to a technicality in filebench. Filebench insisted on creating the files up front and this test required thousands of them; moreover filebench used a method that ended up single threaded to do so and in the end, the stat information was mostly cached on the client. While we could have plowed through some of the issues the conjunctions of all these made us put the fstat test on the side for now.

    Concerning thread counts, we figured that single stream read test was at times critical (for administrative purposes) and an interesting measure of the latency. Test 1 and 2 were defined this way with test 1 starting with cold client and server caches and test 2 continuing the runs after having cleared the client cache (but not the server) thus showing the boost from server side caching. Test 3 and 4 are similarly defined with more threads involved for instance to mimic a media server. Test 5 and 6 did random read tests, again with test 5 starting with a cold server cache and test 6 continuing with some of the data precached from test 5. Here, we did have to deal with client caches trying to insure that we don't hit in the client cache too much as the run progressed. Test 7 and 8 showcased streaming writes for single and 20 streams (per client). Reproducibility of test 7 and 8 is more difficult we believe because of client side fsflush issue. We found that we could get more stable results tuning fsflush on the clients. Test 9 is the all important synchronous write case (for instance a database). This test truly showcases the benefit of our write side SSD and also shows why tuning the recordsize to match ZFS records with DB accesses is important. Test 10 was inoperant as mentioned above and test 11 filecreate, completes the set.

    Given that those we predefined test definition, we're very happy to see that our numbers actually came out really well with these tests particularly for the Mirrored configs with write optimized SSDs. See for instance results obtained by Amitabha Banerjee .

    I should add that these can now be used to give ballpark estimate of the capability of the servers. They were not designed to deliver the topmost numbers from any one config. The variability of the runs are at times more important that we'd wish and so your mileage will vary. Using Analytics to observe the running system can be quite informative and a nice way to actually demo that capability. So use the output with caution and use your own judgment when it comes to performance issues.

    mardi nov. 04, 2008

    People ask: where are we with ZFS performance ?

    The standard answer to any computer performance question is almost always : "it depends" which is semantically equivalent to "I don't know". The better answer is to state the dependencies.


    I would certainly like to see every performance issue studied with a scientific approach. OpenSolaris and Dtrace are just incredible enablers when trying to reach root cause and finding those causes is really the best way to work toward delivering improved performance. More generally tough, people use common wisdom or possible faulty assumption to match their symptoms with that of other similar reported problems. And, as human nature has it, we'll easily blame the component we're least familiar with for problems. So we often end up with a lot of report of ZFS performance that once, drilled down, become either totally unrelated to ZFS (say HW problems) , or misconfiguration, departure from Best Practices or, at times, unrealistic expectations.


    That does not mean, there are no issues. But it's important that users can more easily identify known issues, schedule for fixes, workarounds etc. So anyone deploying ZFS should really be familiar with those 2 sites : ZFS Best Practices and Evil Tuning Guide


    That said, what are real commonly encountered performance problems I've seen and where do we stand ?


    Writes overunning memory


    That is a real problem that was fixed last March and is integrated in the Solaris U6 release. Running out of memory causes many different types of complaints and erratic system behavior. This can happen anytime a lot of data is created and streamed at rate greater than that which can be set into the pool. Solaris U6 will be an important shift for customers running into this issue. ZFS will still try to use memory to cache your data (a good thing) but the competition this creates for memory resources will be much reduced. The way ZFS is designed to deal with this contention (ARC shrinking) will need a new evaluation from the community. The lack of throttling was a great impairement to the ability of the ARC to give back memory under pressure. In the mean time lots of people are capping their arc size with success as per the Evil Tuning guide.


    For more on this topic check out : The new ZFS write throttle


    Cache flushes on SAN storage


    This is a common issue we hit in the entreprise. Although it will cause ZFS to be totally underwhelming in terms of performance, it's interestingly not a sign of any defect in ZFS. Sadly this touches customers that are the most performance minded. The issue is somewhat related to ZFS and somewhat to the Storage. As is well documented elsewhere, ZFS will, at critical times, issue "cache flush" request to the storage elements on which is it layered. This is to take into account the fact that storage can be layered on top of _volatile_ caches that do need to be set on stable storage for ZFS to reach it's consistency points. Entreprise Storage Arrays do not use _volatile_ caches to store data and so should ignore the request from ZFS to "flush caches". The problem is that some arrays don't. This misunderstanding between ZFS and Storage Arrays leads to underwhelming performance. Fortunately we have an easy workaround that can be used to quickly identify if this is indeed the problem : setting zfs_nocacheflush (see evil tuning guide). The best workaround here is to configure the storage with the setting to indeed ignore "cache flush". And we also have the option of tuning sd.conf on a per array basis. Refer again to the evil tuning guide for more detailed information.


    NFS slow over ZFS (Not True)


    This is just not generally true and often a side effect of the previous Cache flush problem. People have used storage arrays to accelerate NFS for long time but failed to see the expected gains with ZFS. Many sighting of NFS problems are traced to this.


    Other sightings involve common disks with volatile caches. Here the performance delta observed are rooted in the stronger semantics that ZFS offer to this operational model. See NFS and ZFS for a more detailed description of the issue.


    While I don't consider ZFS as generally slow serving NFS, we did identify in recent months a condition that effects high thread count of synchronous writes (such as a DB). This issue is fixed in the Solaris 10 Update 6 (CR 6683293).


    I would encourage you to be familiar to where we stand regarding ZFS and NFS because, I know of no big gapping ZFS over NFS problems (if there were one, I think I would know). People just need to be aware that NFS is a protocol need some type of accelaration (such as NVRAM) in order to deliver a user experience close to what a direct attach filesystem provides.


    ZIL is a problem (Not True)


    There is a wide perception that the ZIL is the source of performance problems. This is just a naive interpretation of the facts. The ZIL serves a very fundamental component of the filesystem and does that admirably well. Disabling the synchronous semantics of a filesystem will necessarely lead to higher performance in a way that is totally misleading to the outside observer. So while we are looking at further zil improvements for large scale problems, the ZIL is just not today the source of common problems. So please don't disable this unless you know what you're getting into.


    Random read from Raid-Z


    Raid-Z is a great technology that allows to store blocks on top of common JBOD storage without being subject to raid-5 write hole corruption (see : http://blogs.sun.com/bonwick/entry/raid_z). However the performance characteristics of raid-z departs significantly from raid-5 as to surprise first time users. Raid-Z as currently implemented spreads blocks to the full width of the raid group and creates extra IOPS during random reading. At lower loads, the latency of operations is not impacted but sustained random read loads can suffer. However, workloads that end up with frequent cache hits will not be subject to the same penalty as workloads that access vast amount of data more uniformly. This is where one truly needs to say, "it depends".


    Interestingly, the same problem does not affect Raid-Z streaming performance and won't affect workloads that commonly benefit from caching. That said both random and streaming performance are perfectible and we are looking at a number different ways to improve on this situation. To better understand Raid-Z, see one of my very first ZFS entry on this topic : Raid-Z


    CPU consumption, scalability and benchmarking


    This is an area we will need to make more studies. With todays very capable multicore systems, there are many workloads that won't suffer from the CPU consumptions of ZFS. Most systems do not run at 100% cpu bound (being more generally constrained by disk, networks or application scalability) and the user visible latency of operations are not strongly impacted by extra cycles spent in say the ZFS checksumming.


    However, this view breaks down when it comes to system benchmarking. Many benchmarks I encounter (the most crafted ones to boot) end up as host CPU efficiency benchmarks : How many Operations can I do on this system given large amount of disk and network resources while preserving some level X of response time. The answer to this question is purely the reverse of the cycles spent per operation.


    This concern is more relevant when the CPU cycles spent in managing direct attach storage and filesystem is in direct competition with cycles spent in the application. This is also why database benchmarking is often associated with using raw device, a fact must less encountered in common deployment.


    Root causing scalability limits and efficiency problems is just part of the never ending performance optimisation of filesystems.


    Direct I/O


    Directio has been a great enabler of database performance in other filesystems. The problem for me is that Direct I/O is a group of improvements each with their own contribution to the end result. Some want the concurrent writes, some wants to avoid a copy, some wants to avoid double caching, some don't know but see performance gains when turned on (some also see a degradation). I note that concurrent writes has never been a problem in ZFS and that the extra copy used when managing a cache is generally cheap considering common DB rates of access. Acheiving greater CPU efficiency is certainly a valid goal and we need to look into what is impacting this in common DB workloads. In the mean time, ZFS in OpenSolaris got a new feature to manage the cachebility of Data in the ZFS ARC. The per filesystem "primarycache" property will allow users to decide if blocks should actually linger in the ARC cache or just be transient. This will allow DB deployed on ZFS to avoid any form of double caching that might have occured in the past.


    ZFS Performance is and will be a moving target for some time in the future. Solaris 10 Update 6 with a new write throttle, will be a significant change and then Opensolaris offers additional advantages. But generally just be skeptical of any performance issue that is not root caused: the problem might not be where you expect it


    mercredi mai 14, 2008

    The new ZFS write throttle

    A very significant improvement is coming soon to ZFS. A change that will increase the general quality of service delivered by ZFS. Interestingly it's a change that might also slow down your microbenchmark but nevertheless it's a change you should be eager for.


    Write throttling


    For a filesystem, write throttling designates the act of blocking application for some amount of time, as short as possible, waiting for the proper conditions to allow the write system calls to succeed. Write throttling is normally required because applications can write to memory (dirty memory pages) at a rate significantly faster than the kernel can flush the data to disk. Many workloads dirty memory pages by writing to the filesystem page cache at near memory copy speed, possibly using multiple threads issuing high-rates of filesystem writes. Concurrently, the filesystem is doing it's best to drain all that data to the disk subsystem.


    Given the constraints, the time to empty the filesystem cache to disk can be longer than the time required for applications to dirty the cache. Even if one considers storage with fast NVRAM, under sustained load, that NVRAM will fill up to a point where it needs to wait for a slow disk I/O to make room for more data to get in.


    When committing data to a filesystem in bursts, it can be quite desirable to push the data at memory speed and then drain the cache to disk during the lapses between bursts. But when data is generated at a sustained high rate, lack of throttling leads to total memory depletion. We thus need at some point to try and match the application data rate with that of the I/O subsystem. This is the primary goal of write throttling.


    A secondary goal of write throttling is to prevent massive data loss. When applications do not manage I/O synchronization (i.e don't use O_DSYNC and fsync), data ends up cached in the filesystem and the contract is that there is no guarantee that the data will still be there if a system crash were to occur. So even if the filesystem cannot be blamed for such data loss, it is still a nice feature to help prevent such massive losses.


    Case in point : UFS Write throttling


    For instance UFS would use the fsflush daemon to try to keep data exposed for no more than 30 seconds (default value of autoup). Also, UFS would keep track of the amount of I/O outstanding for each file. Once too much I/O was pending, UFS would throttle writers for that file. This was controlled through ufs_HW, ufs_LW and their values were commonly tuned (a bad sign). Eventually old defaults values were updated and seem to work nicely today. UFS write throttling thus operates on a per file basis. While there are some merits to this approach, it can be defeated as it does not manage the imbalance between memory and disks at a system level.


    ZFS Previous write throttling


    ZFS is designed around the concept of transaction groups (txg). Normally, every 5 seconds an _open_ txg goes to the quiesced state. From that state the quiesced txg will go to the syncing state which sends dirty data to the I/O subsystem. For each pool, there are at most 1 txg in each of the 3 states, open, quiescing, syncing. Write throttling used to occur when the 5 second txg clock would fire while the syncing txg had not yet completed. The open group would wait on the quiesced one which waits on the syncing one. Application writers (write system call) would block, possibly a few seconds, waiting for a txg to open. In other words, if a txg took more than 5 seconds to sync to disk, we would globally block writers thus matching their speed with that of the I/O. But if a workload had a bursty write behavior that could be synced during the allotted 5 seconds, application would never be throttled.


    The Issue


    But ZFS did not sufficiently controled the amount of data that could get in an open txg. As long as the ARC cache was no more than half dirty, ZFS would accept data. For a large memory machine or one with weak storage, this was likely to cause long txg sync times. The downsides were many :


    	- if we did ended up throttled, long  sync times meant the system
    	behavior would be sluggish for seconds at a time.
    
    	- long txg sync times also meant that our granularity at which 
    	we could generate snapshots would be impacted.
    
    	- we ended up with lots of pending data in the cache all of
    	which could be lost in the event of a crash.
    
    	- the ZFS I/O scheduler which prioritizes operations was also
    	negatively impacted.	
    
    	- By  not    throttling we had the possibility that
    	sequential writes on large files  could displace from the ARC
    	a very large number of smaller objects. Refilling
    	that data  meant  very  large number of  disk I/Os.  
    	Not throttling can  paradoxically  end up as  very
    	costly for performance.
    
    	- the previous code also could at times, not be issuing I/Os
    	to disk for seconds even though the workload was
    	critically dependant of storage speed.
    
    	- And foremost, lack of throttling depleted memory and prevented
    	ZFS from reacting to memory pressure.
    
    That ZFS is considered a memory hog is most likely the results of the the previous throttling code. Once a proper solution is in place, it will be interesting to see if we behave better on that front.


    The Solutions


    The new code keeps track of the amount of data accepted in a TXG and the time it takes to sync. It dynamically adjusts that amount so that each TXG sync takes about 5 seconds (txg_time variable). It also clamps the limit to no more than 1/8th of physical memory.


    And to avoid the system wide and seconds long throttle effect, the new code will detect when we are dangerously close to that situation (7/8th of the limit) and will insert 1 tick delays for applications issuing writes. This prevents a write intensive thread from hogging the available space starving out other threads. This delay should also generally prevent the system wide throttle.


    So the new steady state behavior of write intensive workloads is that, starting with an empty TXG, all threads will be allowed to dirty memory at full speed until a first threshold of bytes in the TXG is reached. At that time, every write system call will be delayed by 1 tick thus significantly slowing down the pace of writes. If the previous TXG completes it's I/Os, then the current TXG will then be allowed to resume at full speed. But in the unlikely event that a workload, despite the per write 1-tick delay, manages to fill up the TXG up to the full threshold we will be forced to throttle all writes in order to allow the storage to catch up.


    It should make the system much better behaved and generally more performant under sustained write stress.


    If you are owner of an unlucky workload that ends up as slowed by more throttling, do consider the other benefits that you get from the new code. If that does not compensate for the loss, get in touch and tell us what your needs are on that front.


    lundi janv. 08, 2007

    NFS and ZFS, a fine combination


    No doubt there is still a lot to learn about ZFS as an NFS server and this will not delve deeply into that topic. What I'd like to dispel here is the notion that ZFS can cause some NFS workloads to exhibit pathological performance characteristics.


    The Sightings


    Since there have been a few perceived 'sightings' of such slowdowns, a little clarification is in order. Large reported slowdowns would typically be reported when looking at a single threaded load, probably doing small file creation such as 'tar xf many_small_files.tar'.


    For instance, I've run a small such test over a 72G SAS drive.


    	tmpfs   :  0.077 sec
    	ufs     :  0.25  sec
    	zfs     :  0.12  sec
    	nfs/zfs : 12     sec
    


    There are a few things to observe here. Local filesystem services have a huge advantage for this type of load: in the absence of specific request by the application (e.g. tar), local filesystems can lose your data and noone will complain. This is data loss, not data corruption, and this generally accepted data loss will occur in the event of a system crash. The argument being that if you need a higher level of integrity, you need to program it in applications either using O_DSYNC, fsync etc. Many applications are not that critical and avoid such burden.


    NFS and COMMIT


    On the other hand, the nature of the NFS protocol is such that the client _must_ at some specific point request to the server to place previously sent data onto stable storage. This is done through an NFSv3 or NFSv4 COMMIT operation. The COMMIT operation is a contract between clients and servers that allows the client to forget about its previous historical interaction with the file. In the event of a server crash/reboot, the client is guaranteed that previously commited data will be returned by the server. Operations since the last COMMIT can be replayed after a server crash in a way that insures a coherent view between everybody involved.


    But this all topples over if the COMMIT contract is not honored. If a local filesystem does not properly commit data when requested to do so, there is no more guarantee that the client's view of files will be what it would otherwise normally expect. Despite the fact that the client has completed the 'tar x' with no errors, it can happen that some of the files are missing in full or in parts.


    With local filesystems, a system crash is plainly obvious to users and requires applications to be restarted. With NFS, a server crash in not obvious to users of the service (the only sign being a lengthy pause), and applications are not notified. The fact that files or parts of files may go missing in the absence of errors can be considered as plain corruption of the client's side view.


    When the underlying filesystem serving NFS ignores COMMIT request, or when the storage subsystem acknowledge I/O before they reach stable storage, what is potential data loss on the server, becomes corruption of the client's point of view.


    It turns out that in NFSv3/NFSv4 the client will request a COMMIT on close; Moreover, the NFS server itself is required to commit on meta-data operations; for NFSv3 that is on :


    	SETATTR, CREATE, MKDIR, SYMLINK, MKNOD, 
    	REMOVE, RMDIR, RENAME, LINK
    


    and a COMMIT maybe required on the containing directory.


    Expected Performance


    Let's imagine we find a way to run our load at 1 COMMIT (on close) per extracted files. The COMMIT means the client must wait for at least a full I/O latency and since 'tar x' processes the tar file from a single thread, that implies that we can run our workload at the maximum rate (assuming infinitely fast networking) of one extracted file per I/O latency or about 200 extracted files per second (on modern disks). If the files to be extracted are 1K in average size, the tar x will proceed at a pace of 200K/sec. If we are required to issue 2 COMMIT operations per extracted files (for instance due to a server-side COMMIT on file create), that would further halves that throughput number.


    However, If we had lots of threads extracting individual files concurrently the performance would scale up nicely with the number of threads.


    But tar is single threaded, so what is actually going on here ? The need to COMMIT frequently means that our thread must frequently pause for a full server side I/O latency. Because our single threaded tar is blocked, nothing is able to process the rest of our workload. If we allow the server to ignore COMMIT operations, then NFS responses will we sent earlier allowing the single thread to proceed down the tar file at greater speed. One must realise that the extra performance is obtained at the risk of causing corruption from the client's point of view in the event of a crash.


    Whether or not the client or the server needs to COMMIT as often as it does is a separate issue. The existence of other clients that would be accessing the files needs to be considered in that discussion. The point being made here is that this issue is not particular to ZFS, nor does ZFS necessarily exacerbate the problem. The performance of single threaded writes to NFS will be throttled as a result of the NFS-imposed COMMIT semantics.


    ZFS Relevant Controls


    ZFS has two controls that come into this picture. The disk write caches and the zil_disable tunable. ZFS is designed to work correctly whether or not the disk write caches are enabled. This is acheived through explicit cache flush requests, which are generated (for example) in response to an NFS COMMIT. Enabling the write caches is then a performance consideration, and can offer performance gains for some workloads. This is not the same with UFS which is not aware of the existence of a disk write cache and is not designed to operate with such cache enabled. Running UFS on a disk with write cache enabled can lead to corruption of the client's view in the event of a system crash.


    ZFS also has the zil_disable control. ZFS is not designed to operate with zil_disable set to 1. Setting this variable (before mounting a ZFS filesystem) means that O_DSYNC writes, fsync as well as NFS COMMIT operations are all ignored! We note that, even without a ZIL, ZFS will always maintain a coherent local view of the on-disk state. But by ignoring NFS COMMIT operations, it will cause the client's view to become corrupted (as defined above).


    Comparison with UFS


    In the original complaint, there was no comparison between a semantically correct NFS service delivered by ZFS to another similar NFS service delivered by another filesystem. Let's gather some more data:


    Local and memory based filesystems :
    
    	tmpfs   :  0.077 sec
    	ufs     :  0.25  sec
    	zfs     :  0.12  sec
    
    NFS service with risk of corruption of client's side view :
    
    	nfs/ufs :  7     sec (write cache enable)
    	nfs/zfs :  4.2   sec (write cache enable,zil_disable=1)
    	nfs/zfs :  4.7   sec (write cache disable,zil_disable=1)
    
    Semantically correct NFS service :
    
    	nfs/ufs : 17     sec (write cache disable)
    	nfs/zfs : 12     sec (write cache disable,zil_disable=0)
    	nfs/zfs :  7     sec (write cache enable,zil_disable=0)
    


    We note that with most filesystems we can easily produce an improper NFS service by enabling the disk write caches. In this case, a server-side filesystem may think it has commited data to stable storage but the presence of an enabled disk write cache causes this assumption to be false. With ZFS, enabling the write caches is not sufficient to produce an improper service.


    Disabling the ZIL (setting zil_disable to 1 using mdb and then mounting the filesystem) is one way to generate an improper NFS service. With the ZIL disabled, commit request are ignored with potential client's view corruption.


    Intelligent Storage


    An different topic is about running ZFS on intelligent storage arrays. One known pathology is that some arrays will _honor_ the ZFS request to flush the write caches despite the fact that their caches are qualified as stable storage. In this case, NFS performance will be much much worst than otherwise expected. On this topic and ways to workaround this specific issue, see Jason's .Plan: Shenanigans with ZFS.


    Conclusion


    In many common circumstances, ZFS offers a fine NFS service that complies with all NFS semantics even with write caches enabled. If another filesystem appears much faster, I suggest first making sure that this other filesystem complies in the same way.


    This is not to say that ZFS performance cannot be perfected as clearly it can. The performance of ZFS is still evolving quite rapidly. In many situations, ZFS provides the highest throughput of any filesystem. In others, ZFS performance is highly competitive with other filesystems. In some cases, ZFS can be slower than other filesystems -- while in all cases providing end-to-end data integrity, ease of use and integrated services such as compression, snapshots etc.


    See Also Eric's fine entry on zil_disable

    vendredi sept. 22, 2006

    ZFS and OLTP

    ZFS and Databases


    Given that we started to have enough understanding on the internal dynamics of ZFS, I figured it was time to tackle the next hurdle : running a database management system (DBMS). Now I know very little myself about DBMS, so I teamed up with people that have tons of experience with it, my Colleagues from Performance Engineering (PAE), Neelakanth (Neel) Nagdir and Sriram Gummuluru getting occasional words of wisdom from Jim Mauro as well.


    Note that UFS (with DIO) has been heavily tuned over the years to provide very good support for DBMS. We are just beginning to explore the tweaks and tunings necessary to achieve comparable performance from ZFS in this specialized domain.


    We knew that running a DBMS would be a challenge since, a database tickles filesystems in ways that are quite different from other types of loads. We had 2 goals. Primarily, we needed to understand how ZFS performs in a DB environment and in what specific area it needs to improve. Secondly, we figured that whatever would come out of the work, could be used as blog-material, as well as best practice recommendations. You're reading the blog material now; also watch this space for Best Practise updates.



    Note that it was not a goal of this exercise to generate data for a world record press-release. (There is always a metric where this can be achieved.)


    Workload


    The workload we use in PAE to characterize DBMSes is called OLTP/Net. This benchmark was developed inside Sun for the purpose of engineering performance into DBMS. Modeled on common transaction processing benchmarks, it is OLTP-like but with a higher network-to-disk ratio. This makes it more representative of real world application. Quoting from Neel's prose:


            "OLTP/Net, the New-Order transaction involves multi-hops as it
            performs Item validation, and inserts a single item per hop as
            opposed to block updates "
    


    I hope that means something to you; Neel will be blogging on his own, if you need more info.


    Reference Point


    The reference performance point for this work would be UFS (with VxFS being also an interesting data point, but I'm not tasked with improving that metric). For DB loads we know that UFS directio (DIO) provides a significant performance boost and that would be our target as well.


    Platform & Configuration


    Our platform was a Niagara T2000 (8 cores @ 1.2Ghz, 4 HW threads or strands per core) with 130 @ 36GB disks attached in JBOD fashion. Each disk was partitioned in 2 equal slices, with half of the surface given to a Solaris Volume Manager (SVM) onto which UFS would be built and the other half was given to ZFS pool.


    The benchmark was designed to not fully saturate either the CPU or the disks. While we know that performance varies between inner & outer disk surface, we don't expect the effect to be large enough to require attention here.


    Write Cache Enabled (WCE)


    ZFS is designed to work safely, whether or not a disk write-cache is enabled (WCE). This stays true if ZFS is operating on a disk slice. However, when given a full disk, ZFS will turn _ON_ the write cache as part of the import sequence. That is, it won't enable write cache when given only a slice. So, to be fair to ZFS capabilities we manually turned ON WCE when running our test over ZFS.


    UFS is not designed to work with WCE and will put data at risk if WCE is set, so we needed to turn it off for the UFS runs. We needed to do this, to get around the fact that we did not have enough disk to provide each filesystem. Therefore the performance we measured is what would be expected when giving full disk to either filesystem. We note that, for the FC devices we used, WCE does not provide ZFS a significant performance boost on this setup.


    No Redundancy


    For this initial effort we also did not configure any form of redundancy for either filesystem. ZFS RAID-Z does not really have equivalent feature in UFS and so we settled on simple stripe. We could eventually configure software mirroring on both filesystems, but we don't expect that will change our conclusions. But still this will be interesting in follow-up work.


    DBMS logging


    Another thing we know already is that a DBMS's log writer latency is critical to OLTP performance. So in order to improve on that metric, it's good practice to set aside a number of disks for the DBMS' logs. So with this in hand, we manage to run our benchmark and get our target performance number (in relative terms, higher the better):


    
            UFS/DIO/SVM :           42.5
            Separate Data/log volumes
    


    Recordsize


    OK, so now we're ready. We load up Solaris 10 Update 2 (S10U2/ZFS), build a log pool and a data pool and get going. Note that log writers actually generate a pattern of sequential I/O of varying sizes. That should map quite well with ZFS out of the box. But for the DBMS' data pool, we expect a very random pattern of read and writes to DB records. A commonly known zfs best practice when servicing fixed record access is to match the ZFS' recordsize property to that of the application. We note that UFS, by chance or by design, also works (at least on sparc) using 8K records.


    2nd run ZFS/S10U2


    So for a fair comparison, we set the recordsize to 8K for the data pool and run our OLTP/Net and....gasp!:


    
            ZFS/S10U2       :       11.0
            Data pool (8K record on FS)
            Log pool (no tuning)
    


    So that's no good and we have our work cut out for us.


    The role of Prefetch in this result


    To some extent we already knew of a subsystem that commonly misbehaves (which is being fix as we speak), the vdev level prefetch code (that I also refer to as the software track buffer). In this code, whenever ZFS issues a small read I/O to a device, it will, by default, go and fetch quite a sizable chunk of data (64K) located at the physical location being read. In itself, this should not increase the I/O latency which is dominated by the head-seek and since the data is stored in a small fixed sized buffer we don't expect this is eating up too much memory either. However in a heavy-duty environment like we have here, every extra byte that moves up or down the data channel occupies valuable space. Moreover, for a large DB, we really don't expect the speculatively read data to be used very much. So for our next attempt we'll tune down the prefetch buffer to 8K.


    And the role of the vq_max_pending parameter


    But we don't expect this to be quite sufficient here. My DBMS savvy friends would tell me that the I/O latency of reads was quite large in our runs. Now ZFS prioritizes reads over writes and so we thought we should be ok. However during a pool transaction group sync, ZFS will issue quite a number of concurrent writes to each device. This is the vq_max_pending parameter which default to 35. Clearly during this phase the read latency even if prioritized will take a somewhat longer time to complete.


    3rd run, ZFS/S10U2 - tuned


    So I wrote up a script to tune those 2 ZFS knobs. We could then run with a vdev preftech buffer of 8K and a vq_max_pending of 10. This boosted our performance almost 2X:


    
            ZFS/S10U2       :       22.0
            Data pool (8K record on FS)
            Log pool (no tuning)
            vq_max_pending : 10
            vdev prefetch : 8K
    


    But not quite satisfying yet.


    ZFS/S10U2 known bug


    We know of something else about ZFS. In the last few builds before S10U2, a little bug made it's way into the code base. The effect of this bug was that for full record rewrite, ZFS would actually input the old block even though the data is actually not needed at all. Shouldn't be too bad, perfectly aligned block rewrites of uncached data is not that common....except for database, bummer.


    So S10U2 is plagued with this issue affecting DB performance with no workaround. So our next step was to move on to ZFS latest bits.


    4th run ZFS/Build 44


    Build 44 of our next Solaris version has long had this particular issue fixed. There we topped our past performance with:


    
    
            ZFS/B44         :       33.0
            Data pool (8K record on FS)
            Log pool (no tuning)
            vq_max_pending : 10
            vdev prefetch : 8K
    


    As we compare to umpty-years of super tuned UFS:


    
            UFS/DIO/SVM : 42.5
            Separate Data/log volumes
    


    Summary


    I think at this stage of ZFS, the results are neither great nor bad. We have achieved:


    
            UFS/DIO   : 100 %
            UFS       : xx   no directio (to be updated)
            ZFS Best  : 75%  best tuned config with latest bits. 
            ZFS S10U2 : 50%  best tuned config.
            ZFS S10U2 : 25%  simple tuning.
    


    To achieve acceptable performance levels:


    The latest ZFS code base. ZFS improves fast these days. We will need to keep tracking releases for a little while. The current OpenSolaris release as well as the upcoming Solaris 10 Update 3 (this fall), should perform for these tests, as well as the Build 44 results shown here.


    1 data pool and 1 log pool: common practice to partition HW resource when we want proper isolation. Going forward I think that, we will eventually get to the point where this will not be necessary but it seems an acceptable constraint for now. Tuned vdev prefetch: the code is being worked on. I expect that in a near future this will not be necessary.


    Tuned vq_max_pending: that may take a little longer. In a DB workload, latency is key and throughput secondary. There are a number of ideas that needs to be tested which will help ZFS improve on both average latency as well as latency fluctuations. This will help both the Intent log (O_DSYNC write) latency as well as reads.


    Parting Words


    As those improvement come out, they may well allow ZFS to catch or surpass our best UFS numbers. When you match that kind of performance with all the usability and data integrity features of ZFS, that's a proposition that becomes hard to pass up.

    mardi sept. 19, 2006

    Tuning the knobs

    A script is provided to tune some ZFS knobs[Read More]

    mercredi juil. 12, 2006

    ZFS and Directio




    ZFS AND DIRECTIO





    In view of the great performance gains that UFS gets out of the 'Directio' (DIO) feature, it is interesting to ask ourselves, where exactly do those gains come from and if ZFS can be tweaked to benefit from them in the same way.


    UFS Directio


    UFS Directio is actually a set of things bundled together that improves performance of very specific workloads most notably that of Database. Directio is actually a performance hint to the filesystem and apart from relaxing posix requirements does not carry any change in filesystem semantics. The users of directio actually assert the condition on the full Filesystem or individual file level and the filesystem code if given extra freedom to run or not the tuned DIO codepath.


    What does that tuned code path gets us ? A few things:
    
    	- output goes directly from application buffer to disk
    	  bypassing the filesystem core memory cache.
    
    	- the FS is not constrained  anymore to strictly obey the POSIX
    	write ordering.  The FS is thus  able to allow multiple thread
    	concurrently issuing some I/Os to a single file.
    
    	- On input UFS DIO refrains from doing any form of readahead.
    
    
    In a sense, by taking out the middleman (the filesystem cache), UFS/DIO causes files to behave a lot like a raw device. Application reads and writes map one to one onto individual I/Os.


    People often consider that the great gains that DIO provides comes from avoiding the CPU cost of the copy into system caches and from the avoiding the double buffering, once in the DB, once in the FS, that one gets in the non-directio case.


    I would argue that while the CPU cost associated with a copy certainly does exists, the copy will run very very quickly compared to the time the ensuing I/O takes. So the impact of the copy would only appear on systems that have their CPU quite saturated, notably for industry standard benchmarks. However real systems, which are more likely to be I/O constrained than CPU constrained should not pay a huge toll to this effect.


    As for double buffering, I note that Databases (or applications in general), are normally setup to consume a given amount of memory and the FS operates using the remaining portion. Filesystems caches data in memory for lack of better use of that memory. And FS give up their hold whenever necessary. So the data is not double buffered but rather 'free' memory keeps a hold on recently issued I/O. Buffering data in 2 locations does not look like a performance issue to me.


    Anything for ZFS ?


    So what does that leaves us with ? Why is DIO so good ? This tells me that we gain a lot from those 2 mantras
    
    		don't do any more I/O that  requested
    
       		allow   multiple concurrent I/O to a file.
    
    I note that UFS readahead is particularly bad for certain usage; when UFS sees access to 2 consecutive pages, it will read a full cluster and those are typically 1MB in sizes today. So avoiding UFS readahead has probably contributed greatly to the success of DIO. As for ZFS there are 2 levels of readahead (a.k.a prefetching). One that is filebased and one device based. Both are being reworked at this stage. I note that filebased readahead code has not and will not behave like UFS. On the other hand device level prefetching probably is being over agressive for DB type loads and it should be avoided. While I have not given hope of that this can be managed automatically, watch this space for tuning scripts to control the device prefetching behavior.


    DIO for input does not otherwise appear an interesting proposition since if the data is cached, I don't really see the gains in bypassing it (apart from slowing down the reads).


    As for writes, ZFS, out of the box, does not suffer from the single writer lock that UFS needs to implement the posix ordering rules. The transaction groups (TXG) are sufficient for that purpose (see The Dynamics of ZFS).



    This leaves us to the amount of I/O needed by the 2 filesystems when running many concurrent O_DSYNC writers running small writes to random file offsets.


    UFS actually handles this load by overwriting the data in it's preallocated disk locations. Every 8K pages is associated with set place on the storage and a write to that location means a disk head movement and an 8K output I/O. This loads should scale well with number of disks in the storage and the 'random' IOPS capability of each drives. If a drives handle 150 random IOPS, then we can handle about 1MB/s/drive of output.


    Now ZFS will behave quite differently. ZFS does not have preallocation of file blocks and will not, ever, overwrite live data. The handling of the O_DSYNC writes in ZFS will occur in 2 stages.


    The 2 stages of ZFS


    First at the ZFS Intent Log (ZIL) level where we need to I/O the data in order to release the application blocked in a write call. Here the ZIL has the ability of aggregating data from multiple writes and issue fewer/larger I/Os than UFS would. Given the ZFS strategy of block allocation we also expect those I/O to be able to stream to the disk at high speed. We don't expect to be restrained by the random IOPS capabilities of disk but more by their streaming performance.


    Next at the TXG level, we clean up the state of the filesystem and here again the block allocation should allow high rate of data transfer. At this stage there are 2 things we have to care about.


    With current state of things, we probably will see the data sent to disk twice, once to the ZIL once to the pool. While this appears suboptimal at first, the aggregation and streaming characteristics of ZFS makes the current situation already probably better than what UFS can achieve. We're also looking to see if we can make this even better by avoiding the 2 copies while preserving the full streaming performance characteristics.


    For pool level I/O we must take care to not inflate the amount of data sent to disk which could eventually cause early storage saturation. ZFS works out of the box with 128K records for large files. However for DB workloads, we expect this will be tuned such that the ZFS recordsize matches the DB block size. We also expect the DB blocksize to be at least 8K in sizes. Matching the ZFS recordize to the DB block size is a recommendation that is inline with what UFS DIO has taught us: don't do any more I/O than necessary.


    Note also that with ZFS, because we don't overwrite live data, every block output needs to bubble up into metadata block updates etc... So there are some extra I/O that ZFS has to do. So depending on the exact test conditions the gains of ZFS can be offset by the extra metadata I/Os.


    ZFS Performance and DB


    Despite all the advantage of ZFS, the reason that performance data has been hard to come by is that we have to clear up the road and bypass the few side issues that currently affects performance on large DB loads. At this stage, we do have to spend some time and apply magic recipes to get ZFS performance on Database to behave the way it's intended to.


    But when the dust settles, we should be right up there in terms of performance compared to UFS/DIO, and improvements ideas are still plenty, if you have some more I'm interested....

    mercredi juin 21, 2006

    The Dynamics of ZFS



    The Dynamics of ZFS



    ZFS has a number of identified components that governs its performance. We review the major ones here.

    Introducing ZFS


    A volume manager is a layer of software that groups a set of block devices in order to implement some form of data protection and/or aggregation of devices exporting the collection as a storage volumes that behaves as a simple block device.


    A filesystem is a layer that will manage such a block device using a subset of system memory in order to provide Filesystem operations (including Posix semantics) to applications and provide a hierarchical namespace for storage - files. Applications issue reads and writes to the Filesystem and the Filesystem issues Input and Output (I/O) operations to the storage/block device.


    ZFS implements those 2 functions at once. It thus typically manages sets of block devices (leaf vdev), possibly grouping them into protected devices (RAID-Z or N-way mirror) and aggregating those top level vdevs into pool. Top level vdevs can be added to a pool at any time. Objects that are stored onto a pool will be dynamically striped onto the available vdevs.


    Associated with pools, ZFS manages a number of very lightweight filesystem objects. A ZFS filesystem is basically just a set of properties associated with a given mount point. Properties of a filesystem includes the quota (maximum size) and reservation (guaranteed size) as well as, for example, whether or not to compress file data when storing blocks. The filesystem is characterized as lightweight because it does not statically associate with any physical disk blocks and any of its settable properties can be simply changed dynamically.


    Recordsize


    The recordsize is one of those properties of a given ZFS filesystem instance. ZFS files smaller than the recordsize are stored using a single filesystem block (FSB) of variable length in multiple of a disk sector (512 Bytes). Larger files are stored using multiple FSB, each of recordsize bytes, with default value of 128K.


    The FSB is the basic file unit managed by ZFS and to which a checksum is applied. After a file grows to be larger than the recordsize (and gets to be stored with multiple FSB) changing the Filesystem's recordsize property will not impact the file in question. A copy of the file will inherit the tuned recordsize value. A FSB can be mirrored onto a vdev or spread to a RAID-Z device.


    The recordsize is currently the only performance tunable of ZFS. The default recordsize may lead to early storage saturation: For many small updates (much smaller than 128K) to large files (bigger than 128K) the default value can cause an extra strain on the physical storage or on the data channel (such as a fiber channel) linking it to the host. For those loads, If one notices a saturated I/O channel then tuning the recordsize to smaller values should be investigated.


    Transaction Groups


    The basic mode of operation for writes operations that do not require synchronous semantics (no O_DSYNC, fsync(), etc), is that ZFS will absorb the operation in a per host system cache called Adaptive Replacement Cache (ARC). Since there is only one host system memory but potentially multiple ZFS pools, cached data from all pools is handled by a unique ARC.


    Each file modification (e.g. a write) is associated with a certain transaction group (TXG). At regular interval (default of txg_time = 5 seconds) each TXG will shut down and the pool will issue a sync operation for that group. A TXG may also be shut down when the ARC indicates that there is too much dirty memory currently being cached. As a TXG closes, a new one immediately opens and file modifications then associate with the new active TXG.


    If the active TXG shuts down while a previous one is still in the process of syncing data to the storage, then applications will be throttled until the running sync completes. In this situation where are sinking a TXG, while TXG + 1 is closed due to memory limitations or the 5 second clock and is waiting to sync itself; applications are throttled waiting to write to TXG + 2. We need sustained saturation of the storage or a memory constraint in order to throttle applications.


    A sync of the Storage Pool will involve sending all level 0 data blocks to disk, when done, all level 1 indirect blocks, etc. until eventually all blocks representing the new state of the filesystem have been committed. At that point we update the ueberblock to point to the new consistent state of the storage pool.


    ZFS Intent Log (ZIL)


    For file modification that come with some immediate data integrity constraint (O_DSYNC, fsync etc.) ZFS manages a per-filesystem intent log or ZIL. The ZIL marks each FS operation (say a write) with a log sequence number. When a synchronous command is requested for the operation (such as an fsync), the ZIL will output blocks up to the sequence number. When the ZIL is in process of committing data, further commit operations will wait for the previous ones to complete. This allows the ZIL to aggregate multiple small transactions into larger ones thus performing commits using fewer larger I/Os.


    The ZIL works by issuing all the required I/Os and then flushing the write caches if those are enabled. This use of disk write cache does not artificially improve a disk's commit latency because ZFS insures that data is physically committed to storage before returning. However the write cache allows a disk to hold multiple concurrent I/O transactions and this acts as a good substitute for drives that do not implement tag queues.


    CAVEAT: The current state of the ZIL is such that if there is a lot of pending data in a Filesystem (written to the FS, not yet output to disk) and a process issues an fsync() for one of it's files, then all pending operations will have to be sent to disk before the synchronous command can complete. This can lead to unexpected performance characteristics. Code is under review.


    I/O Scheduler and Priorities


    ZFS keeps track of pending I/Os but only issues to disk controllers a certain number (35 by default). This allows the controllers to operate efficiently while never overflowing their queues. By limiting the I/O queue size, service times of individual disks are kept to reasonable values. When one I/O completes, the I/O scheduler then decides the next most important one to issue. The priority scheme is timed based; so for instance an Input I/O to service a read calls will be prioritize over any regular Output I/O issued in the last ~ 0.5 seconds.


    The fact that ZFS will limit each leaf devices I/O queue to 35, is one of the reasons that suggests that zpool should be built using vdevs that are individual disks or at least volumes that map to small number of disks. Otherwise this self imposed limits could become an artificial performance throttle.


    Read Syscalls


    If a read cannot be serviced from the ARC cache, ZFS will issue a 'prioritized' I/O for the data. So even if the storage is handling a heavy output load, there are only 35 I/Os outstanding, all with reasonable service times. As soon as one of the 35 I/Os completes the I/O scheduler will issue the read I/O to the controller. This insures good service times for read operations in general.


    However to avoid starvation, when there is a long-standing backlog of Output I/Os then eventually those regain priority over the Input I/O. ZIL synchronous I/Os are of the same priority to synchronous reads.


    Prefetch


    The prefetch code allowing ZFS to detect sequential or strided access to a file and issue I/O ahead of phase is currently under review. To quote the developer "ZFS prefetching needs some love".


    Write Syscalls


    ZFS never overwrites live data on-disk and will always output full records validated by a checksum. So in order to partially overwrite a file record, ZFS first has to have the corresponding data in memory. If the data is not yet cached, ZFS will issue an input I/O before allowing the write(2) to partially modify the file record. With the data now in cache, more writes can target the blocks. On output ZFS will checksum data before sending to disk. For full record overwrite the input phase is not necessary.


    CAVEAT: Simple write calls (not O_DSYNC) are normally absorbed by the ARC cache and so proceed very quickly. Such a sustained dd(1)-like load can quickly overrun a large amount of system memory and cause transaction groups to eventually throttle all applications for large amount of time (10s of seconds). This is probably what underwrites the notion that ZFS needs more RAM (it does not). Write throttling code is under review.


    Soft Track Buffer


    An input I/O is serious business. While a Filesystem can decide where to write stuff out on disk, the Inputs are requested by applications. This means a necessary head seek to the location of the data. The time to issue a small read will be totally dominated by this seek. So ZFS takes the stance that it might as well amortize those operations and so, for uncached reads, ZFS normally will issue a fairly large Input I/O (64K by default). This will help loads that input data using similar access pattern to the output phase. The data goes into a per device cache holding 20MB.


    This cache can be invaluable in reducing the I/Os necessary to read-in data. But just like the recordsize, if the inflated I/O cause a storage channel saturation the Soft Track Buffer can act as a performance throttle.


    The ARC Cache


    The most interesting caching occurs at the ARC layer. The ARC manages the memory used by blocks from all pools (each pool servicing many filesystems). ARC stands for Adaptive Replacement Cache and is inspired by a paper of Megiddo/Modha presented at FAST'03 Usenix conference.


    That ARC manages it's data keeping a notion of Most Frequently Used (MFU) and Most Recently Use (MRU) balancing intelligently between the two. One of it's very interesting properties is that a large scan of a file will not destroy most of the cached data.


    On a system with Free Memory, the ARC will grow as it starts to cache data. Under memory pressure the ARC will return some of it's memory to the kernel until low memory conditions are relieved.


    We note that while ZFS has behaved rather well under 'normal' memory pressure, it does not appear to behave satisfactorily under swap shortage. The memory usage pattern of ZFS is very different to other filesystems such as UFS and so exposes VM layer issues in a number of corner cases. For instance, a number of kernel operations fails with ENOMEM not even attempting a reclaim operation. If they did, then ZFS would be responding by releasing some of it's own buffers allowing the initial operation to then succeed.


    The fact that ZFS caches data in the kernel address space does mean that the kernel size will be bigger than when using traditional filesystems. For heavy duty usage it is recommended to use a 64-bit kernel i.e. any Sparc system or an AMD configured in 64-bit mode. Some systems that have managed in the past to run without any swap configured should probably start to configure some.


    The behavior of the ARC in response to memory pressure is under review.


    CPU Consumption


    Recent enhancement to ZFS has improved it's CPU efficiency by a large factor. We don't expect to deviate from other filesystems much in terms of cycles per operations. ZFS checksums all disk blocks but this has not proven to be costly at all in terms of CPU consumption.


    ZFS can be configured to compress on-disk blocks. We do expect to see some extra CPU consumption from that compression. While it is possible that compression could lead to some performance gain due to reduced I/O load, the emphasis of compression should be to save on-disk space not performance.


    What About Your Test ?


    This is what I know about the ZFS performance model today. My performance comparison on different types of modelled workloads made last fall already had ZFS ahead on many of them; we have improved the biggest issues highlighted then and there are further performance improvements in the pipeline (based on UFS, we know this will never end). Best Practices are being spelled out.
    You can contribute by comparing your actual usage and workload pattern with the simulated workloads. But nothing will beat having reports from real workloads at this stage; Your results are therefore of great interest to us. And watch this space for updates...


    mercredi juin 07, 2006

    Tuning ZFS recordsize

    One important performance parameter of ZFS is the recordsize which govern the size of filesystem blocks for large files. This is the unit that ZFS validates through checksums. Filesystem blocks are dynamically striped onto the pooled storage, on a block to virtual device (vdev) basis.

    It is expected that for some loads, tuning the recordsize will be required. Note that, in traditional Filesytems such a tunable would govern the behavior of all of the underlying storage. With ZFS, tuning this parameter only affects the tuned Filesystem instance; it will apply to newly created files. The tuning is achieved using

    zfs set recordsize=64k mypool/myfs

    In ZFS all files are stored either as a single block of varying sizes (up to the recordsize) or using multiple recordsize blocks. Once a file grows to be multiple blocks, it's blocksize if definitively set to the FS recordsize at the time.

    Some more experience will be required with the recordsize tuning. Here are some elements to guide along the way.

    If one considers the input of a FS block typically in response to an application read, the size of the I/O in question will not basically impact the latency by much. So, as a first approximation, the recordsize does not matter (I'll come back to that) to read-type workloads.

    For FS block outputs, those that are governed by the recordsize, actually occur mostly asynchronously with the application; and since applications are not commonly held up by those outputs, the delivered throughput is, as for read-type loads, not impacted by the recordsize.

    So the first approximation is that recordsize does not impact performance much. To service loads that are transient in nature with short I/O bursts (< 5 seconds) we do not expect records tuning to be necessary. The same can be said for sequential type loads.

    So what about the second approximation ? A problem that can occur with using an inflated recordsize (128K) compared to application read/write sizes, is early storage saturation. If an application requests 64K of data, then providing a 128K record doesn't change the latency that the application sees much. However if the extra data is discarded from the cache before ever being read, we see that the extra occupation of the data channel was occupied for no good reason. If a limiting factor to the storage is, for instance, a 100MB/sec channel, I can handle 700 times 128K records per second onto that channel. If I halves the recordsize that should double the number of small records I can input.

    On the small record output loads, the system memory creates a buffer that defer the direct impact to applications. For output, if the storage is saturated this way for tens of seconds, ZFS will eventually throttle applications. This means that, in the end, when the recordsize leads to sustained storage overload on output, there will be an impact as well.

    There is another aspect to the recordsize. A partial write to an uncached FS block (a write syscall of size smaller than the recordsize) will have to first input the corresponding data. Conversely, when individual writes are such that they cover full filesystem recordsize blocks, those writes can be handled without the need to input the associated FS blocks. Other consideration (metadata overhead, caching) dictates however that the recordsize not be reduced below a certain point (16K to 64K; do send-in your experience).

    So, one advice is to keep an eye on the channel throughput and tune recordsize for random access workloads that saturate to storage. Sequential type workloads should work quite well with the current default recordsize. If the applications' read/write sizes can be increased, that should also be considered. For non-cached workloads that overwrites file data in small aligned chunks , then matching the recordsize with the write access size may bring some performance gains.



    mardi juin 06, 2006

    DOES ZFS REALLY USE MORE RAM ?

    DOES ZFS REALLY USE MORE RAM ?



    I'll touch 3 aspects of that question here :

    - reported freemem

    - syscall writes to mmap pages

    - application write throttling

    Reported freemem will be lower when running with ZFS than say UFS. The UFS page cache is considered as freemem. ZFS will return it's 'cache' only when memory is needed. So you will operate with lower freemem but won't normally suffer from this.

    It's been wrongly feared that this mode of operation puts us back to the days of Solaris 2.6 and 7 where we saw a roaller coaster effect on freemem leading to sub-par application performance. We actually DO NOT have this problem with ZFS. The old problem came because the memory reaper could not distinguish between a useful application page and an UFS cached page. That was bad. ZFS frees up it's cache in a way that does not cause this problem.

    ZFS is designed to release some of it's memory when kernel modules exert back pressure onto the kmem subsystem. Some kernel code that did not properly exert that pressure was recently fixed (short description here: 4034947).

    There is one peculiar workload that does lead ZFS to consume more memory: writing (using syscalls) to pages that are also mmaped. ZFS does not use the regular paging system to manage data that passes through reads and writes syscalls. However mmaped I/O which is closely tied to the Virtual Memory subsystem still goes through the regular paging code . So syscall writting to mmaped pages, means we will keep 2 copies of the associated data at least until we manage to get the data to disk. We don't expect that type of load to commonly use large amount of ram.

    Finally, one area where ZFS will behave quite differently from UFS is in throttling writters. With UFS, up to not long ago, we throttled a process trying to write to a file, as soon as that file had 0.5 M B of I/O pending associated with it. This limit has been recently upped to 16 MB. The gain of such throttling is that we prevent an application working on a single file or consuming inordinate amount of system memory. The downside is that we throttle an application possibly unnecessarely when memory is plenty.

    ZFS will not throttle individual apps like this. The scheme is mutualized between all writers: when the global load of applications data overflows the I/O subsystem for 5 to 10 seconds then we throttle the applications allowing the I/O to catch up. Applications thus have a lot more ram to play with before being throttled.

    This is probably what's behind the notion that ZFS likes more RAM. By and large, to cache some data, ZFS just needs the equivalent amount of RAM as any other filesystem. But currently, ZFS lets applications run a lot more decoupled from the I/O subsystem. This can speed up some loads by very large factor, but at times, will appear as extra memory consumption.

    About

    user13278091

    Search

    Categories
    Archives
    « avril 2014
    lun.mar.mer.jeu.ven.sam.dim.
     
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
        
           
    Today
    News
    Blogroll

    No bookmarks in folder