vendredi oct. 30, 2015

Block Picking

In the final article of this series, I walk through the fascinating topic of block picking. As is well known ZFS is an allocate on write storage system and every time it updates the on-disk structure, it writes the new data wherever it chooses. Within reasonable bounds ZFS also control the timing of when to write data out to devices (outside of ZIL blocks which is governed by applications). The current bounds are set about 5 seconds apart and when that clock ticks, we bundle up all recent changes in a transaction group (TXG) handled by spa_sync.

Armed with this herd of pending data blocks, ZFS issues a highly concurrent workload dedicated to running all CPU intensive tasks. For any individual I/O, after going through compression, encryption and checksumming, we move on to the allocation task and finally device level I/O scheduling.

The first task of allocation involves, selecting a device in a pool. Then, within that device, selecting a sub-region called a metaslab and finally, within that metaslab, selecting a block where the data is stored. Our guiding principles through the process is to
  • Foster write I/O aggregation
  • Ensure devices are used efficiently
  • Avoid fragmenting the on-disk space
  • Limit the core memory required
  • Serve concurrent allocations quickly
  • Do all this with as little CPU resources as possible
Let's see how ZFS solves this tough equation.

Device Selection

When a TXG is triggered, we want devices to receive a set of I/Os such that the I/Os;
  • Aggregate
  • Stream on the media
I/O aggregation occurs downstream when 2 I/Os, which are ready to issue, have just been allocated on the same device in adjacent locations. The per device I/O pipeline is then able to merge both of them (N of them actually) into a single device I/O. This is an incredible feature as it applies to items that have no relationship to each other, other than their allocation proximity as first described in Need Inodes ?.

Streaming the media is a close cousin to aggregation. I/Os aggregate when they are adjacent, but if they are not, we still want to avoid seeking to a far away area of a disk every I/O. A disk seek is, after all, an eternity. So, while setting data onto the media, we like to keep logical block addresses (LBA) as packed as possible leading to streaming efficiency. Similarly for SSDs, doing so, avoids fragmenting the internal space mapping done by the Flash Translation Layer (FTL). Even logical volumes software welcome this model.

ZFS allocates from a device until about 1MB of data is handled. The first 1MB of blocks that reaches the allocation stage (after CPU heavy transformations) are directed to a first device. The following 1MB of blocks move on to the next one, and so on. After iterating round robin through every device, we are now in a state where every device in a pool is busy with I/O. At the same time other blocks are being processed through the CPU intensive stages and are now building up an I/O backlog onto every device. At that moment, even if CPUs are heavily used, the system is actually issuing I/O to all devices that are 100% busy.

Metaslab Selection

Once we have directed 1MB of allocation to a specific device, we still want the I/Os to target a localized subarea of the device. A metaslab has an in-core structure representing free space of a portion of a device (very roughly 1%). At any one time, ZFS only allocates from a single metaslab of a device insuring dense packing of all writes and therefore avoiding long disk head excursions. The other benefit is that for the purpose of allocation, we strictly only need to keep in-core a structure representing the free space for that subarea. This is the active metaslab. During a given TXG we thus only write to the active metaslab of every device.

Block Picking

And now, going for the kill. We have the device and the metaslab subarea within it to service our allocation of size X. We finally have to choose a specific device location to fit this allocation. A simple strategy would be to allocate all blocks back-to-back regardless of size. That would lead to maximum aggregation, but we must be considerate of space fragmentation implications.

Blocks we are allocating together at this instant, may well be freed later on a different schedule from each other. Frankly we have no way to know when a block free occurs since that is entirely driven by the workload. In ZFS, our bias is to consider that blocks that have similar size have a better chance of having similar life expectancy. We exploit this by maintaining a set of separate pointers within our metaslab and allocate blocks of similar size near each other.

Historical Behavior

When ZFS first came of age, it had 2 strategies to pick blocks. The regular one lead to good performance through aggregation while another strategy was aimed at defragmenting and lead to terrible performance. We would switch strategy when a metaslab started to have less than 30% free space within it. Customers voiced their discontent loudly. The 30% parameter was later reduced to 4% but that didn't reduce the complaints1.

The other problem we had in the past, was that when we needed to switch a metaslab whose structure was not yet in core, we would block all allocating threads waiting on the in-core loading of the metaslab data. If that took too much time, we could leave devices idling in the process.

Finally, we would switch only when an allocation could not be satisfied usually because a request was larger than the largest free block available. Forced to switch, we would then select the metaslab that had the best2 free space. This meant that we would keep using metaslabs past their prime capacity to foster aggregated and streaming I/Os.

Today is a Better World

Fast forward to today, we have evolved this whole process in very significant ways.

The thread blocking problem was simple enough; when we do in fact load a metaslab we quickly direct allocating threads to other devices. We therefore keep feeding the other devices more smoothly.

But the most important advance is that we don't use an allocator that switches strategy based on available space in the metaslab. Allocations of a given size are serviced from a chunk chosen such that the I/Os aggregate 16-fold : so 4K allocations tend to consume 64K chunks, while 128K allocations looks for 2MB chunks. Allocations of different sizes do not compete for the same sized chunks. Finally, as the maximum size available in a metaslab is reduced, we simply and gradually scale down our potential for aggregation from this metaslab.

Alongside this change, we decided to actually switch away from a metaslab as soon as it started to show signs of fatigue. As long as a metaslab is able to serve blocks of approximately 1MB we keep allocating from it. But as soon as it's biggest block size drops below this threshold, we go and pick the metaslab with the best free space.

Finally to account for more frequent switches, we also decided to unload metaslabs less aggressively than before. This policy allows us to reuse a metaslab without incurring the cost of loading since that comes with both a CPU and I/O cost.


With these changes in, we have an allocator that fosters aggregation very effectively and leads to performance that degrades gracefully as the pool fills up. This allocator has served us well over the many years it's been in place.

ZFS gives you great performance by handling writes in a way that streams the data to devices. This is effective and delivers maximum performance as long as there is large sequential free space on devices. For users, the real judge is to monitor the average I/O size for writes and if, for the same workload mix, the write size starts to creep down, then it's time to consider adding additional storage space.

Le Fin

1 Some still remember this 30% factor and use that as a rule of thumb to not exceed even though we have not used this allocator in years; tuning metaslab_df_free_pct has no effect on our systems.

2 I say best free space and not most free space since we actually boost the desirability of metaslabs with low addresses. For physical disks, outer tracks fly faster under a disk head and that translate into more throughput. Even for SSDs we see a benefit : Favoring one side of the addresses, means we reuse freed space more aggressively. The overwrites of an SSD LBA range means that the flash cells holding the overwritten data can be recycled quickly by the FTL, which simplifies greatly its operation.

mercredi janv. 21, 2015

Sequential Resilvering

In the initial days of ZFS some pointed out that ZFS resilvering was metadata driven and was therefore super fast : after all we only had to resilver data that was in-use compared to traditional storage that has to resilver entire disk even if there is no actual data stored. And indeed on newly created pools ZFS was super fast for resilvering.

But of course storage pools rarely stay empty. So what happened when pools grew to store large quantities of data ? Well we basically had to resilver most blocks present on a failed disk. So the advantage of only resilvering what is actually present is not much of a advantage, in real life, for ZFS.

And while ZFS based storage grew in importance, so did disk sizes. The disk sizes that people put in production are growing very fast showing the appetite of customers to store vast quantities of data. This is happening despite the fact that those disks are not delivering significantly more IOPS than their ancestors. As time goes by, a trend that has lasted forever, we have fewer and fewer IOPS available to service a given unit of data. Here ZFSSA storage arrays with TB class caches are certainly helping the trend. Disk IOPS don't matter as much as before because all of the hot data is cached inside ZFS. So customers gladly tradeoff IOPS for capacity given that ZFSSA deliver tons of cached IOPS and ultra cheap GB of storage.

And then comes resilvering...

So when a disk goes bad, one has to resilver all of the data on it. It is assured at that point that we will be accessing all of the data from surviving disks in the raid group and that this is not a highly cached set. And here was the rub with old style ZFS resilvering : the metadata driven algorithm was actually generating small random IOPS. The old algorithm was actually going through all of the blocks file by file, snapshot by snapshot. When it found an element to resilver, it would issue the IOPS necessary for that operation. Because of the nature of ZFS, the populating of those blocks didn't lead to a sequential workload on the resilvering disks.

So in a worst case scenario, we would have to issue small random IOPS covering 100% of what was stored on the failed disk and issue small random writes to the new disk coming in as a replacement. With big disks and very low IOPS rating comes ugly resilvering times. That effect was also compounded by a voluntary design balance that was strongly biased to protect application load. The compounded effect was month long resilvering.

The Solution

To solve this, we designed a subtly modified version of resilvering. We split the algorithm in two phases. The populating phase and the iterating phase. The populating phase is mostly unchanged over the previous algorithm except that, when encountering a block to resilver, instead of issuing the small random IOPS, we generate a new on disk log of them. After having iterated through all of the metadata and discovered all of the elements that need to be resilvered we now can sort these blocks by physical disk offset and issue the I/O in ascending order. This in turn allows the ZIO subsystem to aggregate adjacent I/O more efficiently leading to fewer larger I/Os issued to the disk. And by virtue of issuing I/Os in physical order it allows the disk to serve these IOPS at the streaming limit of the disks (say 100MB/sec) rather than being IOPS limited (say 200 IOPS).

So we hold a strategy that allows us to resilver nearly as fast as physically possible by the given disk hardware. With that newly acquired capability of ZFS, comes the requirement to service application load with a limited impact from resilvering. We therefore have some mechanism to limit resilvering load in the presence of application load. Our stated goal is to be able to run through resilvering at 1TB/day (1TB of data reconstructed on the replacing drive) even in the face of an active workload.

As disks are getting bigger and bigger, all storage vendors will see increasing resilvering times. The good news is that, since Solaris 11.2 and ZFSSA since 2013.1.2, ZFS is now able to run resilvering with much of the same disk throughput limits as the rest of non-ZFS based storage.

The sequential resilvering performance on a RAIDZ pool is particularly noticeable to this happy Solaris 11.2 customer saying It is really good to see the new feature work so well in practice.

lundi oct. 03, 2011

Fast, Safe, Cheap : Pick 3

Today, we're making performance headlines with Oracle's ZFS Storage Appliance.

SPC-1 : Twice the performance of NetApp at the same latency; Half the $/IOPS;

I'm proud to say that, yours truly, along with a lot of great teammates in Oracle, is not totally foreign to this milestone.

We are announcing that Oracle's 7420C cluster acheived 137000 SPC-1 IOPS with an average latency of less than 10 ms. That is double the results of NetApp's 3270A while delivering the same latency. As compared to the NetApp 3270 result, this is a 2.5x improvement in $/SPC-1-IOPS (2.99$/IOPS vs $7.48/IOPS). We're also showing that when the ZFS Storage Appliance runs at the rate posted by the 3270A (68034 SPC-1 IOPS), our latency of 3.26ms is almost 3X lower than theirs (9.16ms). Moreover, our result was obtained with 23700 GB of user level capacity (internally mirrored) for 17.3 $/GB while NetApp's , even using a space saving raid scheme, can only deliver 23.5$/GB. This is the price per GB of application data actually used in the benchmark. On top of that the 7420C still had 40% of space headroom whereas the 3270A was left with only 10% of free blocks.

These great results were at least partly made possible with the availability of 15K RPM Hard Disk Drives (HDD). Those are great to run the most demanding databases because they combine a large IOPS capability and are generally of smaller capacity. The ratio of IOPS/GB makes them ideal to store high intensity database modeled by SPC-1. On top of that, this concerted engineering effort lead to improved software not just for those running on 15K RPM. We actually used this benchmark to seek out how to increase the quality of our products. The preparation runs, after an initial diagnostic of some issue, we were attached to finding solutions that where not targeting the idiosyncrasies of SPC-1 but based on sound design decision. So instead of changing the default value of some internal parameter to a new static default, we actually changed the way the parameter worked so that our storage systems or all types and sizes would benefit.

So not only are we getting a great SPC-1 results, but all existing customers will benefit from this effect even if they are operating outside of the intense conditions created by the benchmark.

So what is SPC-1 ? It is one of the few benchmarks which counts for storage. It is maintained by Storage Performance Council (SPC). SPC-1 simulates multiple databases running on a centralized storage or storage cluster. But even if SPC-1 is a block based benchmark, within the ZFS Storage appliance, a block based FC or iSCSI volume is handled very much the same way as would be a large file subject to synchronous operation. And by Combining modern network technologies (Infiniband or 10Gbe Ethernet), the CPU power packed in the 7420C storage controllers and Oracle's custom dNFS technology for databases, one can truly acheive very high database transaction rates on top of the more manageable and flexible file based protocols.

The benchmarks defines three Application Storage Unit (ASU): ASU1 with a heavy 8KB block read/write component, ASU2 with a much lighter 8KB block read/write component, and ASU3 which is subject to hundreds of write streams. As such it's is not too far from a simulation of running hundreds of Oracle database onto a single system : ASU1 and ASU2 for datafiles and ASU3 for redolog storage.

The total size of the ASUs is constrained such that all of the stored data (including mirror protection and disk used for spares) must exceed 55% of all configured storage. The benchmark team is then free to decide how much total storage to configure. From that figure, 10% is given to ASU3 (redo log space) and the rest divided equally between heavily ASU1 and lightly used ASU2.

The benchmark team also has to select the SPC-1 IOPS throughput level it wishes to run. This is not a light decision given you want to balance high IOPS; low latency and $/user GB.

Once the target IOPS rate is selected, there are multiple criteria needed to pass a successful audit; one of the most critical is that you have to run at the specified IOPS rate for a whole 8 hour. Note that the previous specifications of the benchmark used by NetApp called for an 4 hour run. During that 8 hour run delivering a solid 137000 SPC-1 IOPS, the avg latency of must be less than 30ms (we did much better than that).

After this brutal 8 hour run, the benchmark then enters another critical phase: the workload is restarted (using a new randomly selected working set) and performance is measured for a 10 minute period. It is this 10 minute period that decides the official latency of the run.

When everything is said and done, you press the trigger; go to sleep and wake up to the result. As you could guess we were ecstatic that morning. Before that glorious day, for lack of a stronger word, a lot of hard work had been done during the extensive preparation runs. With little time, and normally not all of the hardware, one runs through series of run at incremental loads, making educated guesses as to how to improve the result. As you get more hardware you scale up the result tweaking things more or less until the final hour.

SPC-1, with it's requirement of less than 45% of unused space, is designed to trigger many disk level random read IOPS. Despite this inherent random pattern of the workload, we saw that our extensive caching architecture was as helpful for this benchmark as it is in real production workloads. While the 15K RPM HDDs normally levels off with random operation at a rate slightly above 300 IOPS, our 7420C, as a whole, could deliver almost 500 user-level SPC-1 IOPS per HDDs.

In the end one of the most satisfying aspect was to see that the data being managed by ZFS was stored rock solid on disk, properly checksummed, all data could be snapshot, compressed on demand, and delivering an impressively steady performance.

2X the absolute performance, 2.5X cheaper per SPC-1 IOPS, almost 3X lower latency, 30% cheaper per user GB with room to grow... So, If you have a storage decision coming and you need, FAST, SAFE, CHEAP : pick 3, take a fresh look at the ZFS Storage appliance.

SPC-1, SPC-1 IOPS, $/SPC-1 IOPS reg tm of Storage Performance Council (SPC). More info Sun ZFS Storage 7420 Appliance and

Oracle Sun ZFS Storage Appliance 7420 _ _As of October 3, 2011 Netapp FAS3270A _ _As of October 3, 2011

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

mercredi mai 26, 2010

Let's talk about Lun Alignment for 3 minutes

Recall that I had Lun alignment on my mind a few weeks ago. Nothing special about the ZFS storage appliance over any other storage. Pay attention to how you partition your luns, it can have a great impact on performance. Right Roch ? :

vendredi juin 19, 2009

ZFS and OpenStorage things you might have missed

Here are a few things that caught my attention.

First off, a great post showing the scalability of a 7410 with SAS Grid Computing all the way to 900MB/sec+ of throughput through a single IP interface.

SAS Grid and OpenStorage 7410

You'd think a CPU benchmark would not be speeded up by filesystem consideration but think again as you see this detailed study of

ZFS accelerated SPEC CPU

Also keep an eye on MySQL best practice from Neel and his cool mysql/innodb tools : MySQL Inno DB best practices, Inniostat & MySQL Truss

It quite nice to see that all the engineering effort is really coming together now. The ZFS we have today has made incredible strides in the last year.

mardi mars 03, 2009

Performance of the Hybrid Storage Pool

Hope to be able to keep your attention for 30 minutes of impromptu conversation about OpenStorage performance. At the end of the piece, I show my geeky capture/replay technology.

lundi nov. 10, 2008

Sun Storage 7000 Performance invariants

I see many reports about running campains of test measuring performance over a test matrix. One problem with this approach is of course the Matrix. That matrix never big enough for the consumer of the information ("can you run this instead ?").

A more useful approach is to think in terms of performance invariants. We all know that 7.2K RPM disk drive can do 150-200 IOPS as an invariant and disks will have throughput limit such as 80MB/sec. Thinking in terms of those invariant helps in extrapolating performance data (with caution) and observing breakdowns in invariant is often a sign that something else needs to be root caused.

So using 11 metrics and our Performance engineering effort what can be our guiding invariants ? Bearing in mind that it is expected that those are rough estimate. For real measured numbers check out Amitabha Banerjee's excellent post on Analyzing the Sun Storage 7000.

Streaming : 1 GB/s on server and 110 MB/sec on client

For read Streaming wise, we're observing that 1GB/s is somewhat our guiding number for read streaming . This can be acheived with fairly small number of client and threads but will be easier to reach if the data is prestaged in server caches. A client normally running 1Gbe network cards is able to extract 110 MB/sec rather easily. Read streaming will be easier to acheived with the larger 128K records probably due to the lesser CPU demand. While our results are with regular 1500 Bytes ethernet frames, using jumbo frame will also make this limit easier to reach or even break. For a mirrored pool, data needs to be sent twice to the storage and we see a reduction of about 50% for write streaming workloads.

Random Read I/Os per second : 150 random read IOPS per mirrored disks

This is probably a good guiding light also. When going to disks that will be a reasonable expectation. But here caching can radically change this. Since we can configure up to 128GB of host ram and 4 times that much of secondary caches, there are opportunity to break this barrier. But when going to spindles that needs to be kept under consideration. We also know that Raid-z spreads records to all disks. So the 150 IOPS limit basically applies to raid-z groups. Do plan to have many groups to service random reads.

Random Read I/Os per second using SSDs : 3100 Read IOPS per Read Optimized SSD

In some instances, data after eviction from main memory will be kept in secondary caches. Small files and tuned recordsize filesystem are good target workload for this. Those read-optimized SSD can restitute this data at a rate of 3100 IOPS L2 ARC). More importantly so it can do so at much reduced latency meaning that lightly threaded workloads will be able to acheive high throughput.

Synchronous writes per second : 5000-9000 Synchronous write per Write Optimized SSD

Synchronous writes can be generated by a O_DSYNC write (database) or just as part of the NFS protocol (such as the tar extract : open,write,close workloads). Those will reach the NAS server and be coalesced in a single transaction with the separate intent log. Those SSD devices are great latency accelerators but are still devices with a max throughput of around 110 MB/sec. However our code actually detects when the SSD devices become the bottleneck and will divert some of the I/O request to use the main storage pool. The net of all this is a complex equation but we've observed easily 5000-8000 synchronous writes per SSD up to 3 devices (or 6 in mirrored pairs). Using smaller working set which creates less competition for CPU resources we've even observed 48K synchronous writes per second.

Cycles per Bytes : 30-40 cycles per byte for NFS and CIFS

Once we include the full NFS or CIFS protocol, the efficiency was observed to be in the 30-40 cycles per byte (8 to 10 of those coming from the pure network component at regulat 1500 bytes MTU). More studies are required to figure out the extent to which this is valid but it's an interesting way to look at the problem. Having to run disk I/O vs being serviced directly from cached data is expected to exert an additional 10-20 cycles per byte. Obviously for metadata test in which small amount of byte is transfered per operation, we probably need to come up with a cycles/MetaOps invariant but that is still TBD.

Single Client NFS throughput : 1 TCP Window per round trip latency.

This is one fundamental rule of network throughput but it's a good occasion to refresh this in everyones mind. Clients, at least solaris clients, will establish a single TCP connection to a server. On that connection there can be a large number of unreleated requests as NFS is a very scalable protocol. However, a single connection will transport data at a maximum speed of a "socket buffer" divided by the round trip latency. Since today's network speed, particularly in wide area networks have grown somewhat faster than default socket buffers we can see such things becoming performance bottleneck. Now given that I work in Europe but my tests systems are often located in california, I might be a little more sensitive than most to this fact. So one important change we did early on, in this project was to simply bump up the default socket buffers in the 7000 line to 1MB. However for read throughput under similar conditions, we can only advise you to do the same to your client infrastructure.

mardi sept. 19, 2006

Tuning the knobs

A script is provided to tune some ZFS knobs[Read More]

mercredi juil. 12, 2006

ZFS and Directio


In view of the great performance gains that UFS gets out of the 'Directio' (DIO) feature, it is interesting to ask ourselves, where exactly do those gains come from and if ZFS can be tweaked to benefit from them in the same way.

UFS Directio

UFS Directio is actually a set of things bundled together that improves performance of very specific workloads most notably that of Database. Directio is actually a performance hint to the filesystem and apart from relaxing posix requirements does not carry any change in filesystem semantics. The users of directio actually assert the condition on the full Filesystem or individual file level and the filesystem code if given extra freedom to run or not the tuned DIO codepath.

What does that tuned code path gets us ? A few things:

	- output goes directly from application buffer to disk
	  bypassing the filesystem core memory cache.

	- the FS is not constrained  anymore to strictly obey the POSIX
	write ordering.  The FS is thus  able to allow multiple thread
	concurrently issuing some I/Os to a single file.

	- On input UFS DIO refrains from doing any form of readahead.

In a sense, by taking out the middleman (the filesystem cache), UFS/DIO causes files to behave a lot like a raw device. Application reads and writes map one to one onto individual I/Os.

People often consider that the great gains that DIO provides comes from avoiding the CPU cost of the copy into system caches and from the avoiding the double buffering, once in the DB, once in the FS, that one gets in the non-directio case.

I would argue that while the CPU cost associated with a copy certainly does exists, the copy will run very very quickly compared to the time the ensuing I/O takes. So the impact of the copy would only appear on systems that have their CPU quite saturated, notably for industry standard benchmarks. However real systems, which are more likely to be I/O constrained than CPU constrained should not pay a huge toll to this effect.

As for double buffering, I note that Databases (or applications in general), are normally setup to consume a given amount of memory and the FS operates using the remaining portion. Filesystems caches data in memory for lack of better use of that memory. And FS give up their hold whenever necessary. So the data is not double buffered but rather 'free' memory keeps a hold on recently issued I/O. Buffering data in 2 locations does not look like a performance issue to me.

Anything for ZFS ?

So what does that leaves us with ? Why is DIO so good ? This tells me that we gain a lot from those 2 mantras

		don't do any more I/O that  requested

   		allow   multiple concurrent I/O to a file.
I note that UFS readahead is particularly bad for certain usage; when UFS sees access to 2 consecutive pages, it will read a full cluster and those are typically 1MB in sizes today. So avoiding UFS readahead has probably contributed greatly to the success of DIO. As for ZFS there are 2 levels of readahead (a.k.a prefetching). One that is filebased and one device based. Both are being reworked at this stage. I note that filebased readahead code has not and will not behave like UFS. On the other hand device level prefetching probably is being over agressive for DB type loads and it should be avoided. While I have not given hope of that this can be managed automatically, watch this space for tuning scripts to control the device prefetching behavior.

DIO for input does not otherwise appear an interesting proposition since if the data is cached, I don't really see the gains in bypassing it (apart from slowing down the reads).

As for writes, ZFS, out of the box, does not suffer from the single writer lock that UFS needs to implement the posix ordering rules. The transaction groups (TXG) are sufficient for that purpose (see The Dynamics of ZFS).

This leaves us to the amount of I/O needed by the 2 filesystems when running many concurrent O_DSYNC writers running small writes to random file offsets.

UFS actually handles this load by overwriting the data in it's preallocated disk locations. Every 8K pages is associated with set place on the storage and a write to that location means a disk head movement and an 8K output I/O. This loads should scale well with number of disks in the storage and the 'random' IOPS capability of each drives. If a drives handle 150 random IOPS, then we can handle about 1MB/s/drive of output.

Now ZFS will behave quite differently. ZFS does not have preallocation of file blocks and will not, ever, overwrite live data. The handling of the O_DSYNC writes in ZFS will occur in 2 stages.

The 2 stages of ZFS

First at the ZFS Intent Log (ZIL) level where we need to I/O the data in order to release the application blocked in a write call. Here the ZIL has the ability of aggregating data from multiple writes and issue fewer/larger I/Os than UFS would. Given the ZFS strategy of block allocation we also expect those I/O to be able to stream to the disk at high speed. We don't expect to be restrained by the random IOPS capabilities of disk but more by their streaming performance.

Next at the TXG level, we clean up the state of the filesystem and here again the block allocation should allow high rate of data transfer. At this stage there are 2 things we have to care about.

With current state of things, we probably will see the data sent to disk twice, once to the ZIL once to the pool. While this appears suboptimal at first, the aggregation and streaming characteristics of ZFS makes the current situation already probably better than what UFS can achieve. We're also looking to see if we can make this even better by avoiding the 2 copies while preserving the full streaming performance characteristics.

For pool level I/O we must take care to not inflate the amount of data sent to disk which could eventually cause early storage saturation. ZFS works out of the box with 128K records for large files. However for DB workloads, we expect this will be tuned such that the ZFS recordsize matches the DB block size. We also expect the DB blocksize to be at least 8K in sizes. Matching the ZFS recordize to the DB block size is a recommendation that is inline with what UFS DIO has taught us: don't do any more I/O than necessary.

Note also that with ZFS, because we don't overwrite live data, every block output needs to bubble up into metadata block updates etc... So there are some extra I/O that ZFS has to do. So depending on the exact test conditions the gains of ZFS can be offset by the extra metadata I/Os.

ZFS Performance and DB

Despite all the advantage of ZFS, the reason that performance data has been hard to come by is that we have to clear up the road and bypass the few side issues that currently affects performance on large DB loads. At this stage, we do have to spend some time and apply magic recipes to get ZFS performance on Database to behave the way it's intended to.

But when the dust settles, we should be right up there in terms of performance compared to UFS/DIO, and improvements ideas are still plenty, if you have some more I'm interested....

mercredi juin 21, 2006

The Dynamics of ZFS

The Dynamics of ZFS

ZFS has a number of identified components that governs its performance. We review the major ones here.

Introducing ZFS

A volume manager is a layer of software that groups a set of block devices in order to implement some form of data protection and/or aggregation of devices exporting the collection as a storage volumes that behaves as a simple block device.

A filesystem is a layer that will manage such a block device using a subset of system memory in order to provide Filesystem operations (including Posix semantics) to applications and provide a hierarchical namespace for storage - files. Applications issue reads and writes to the Filesystem and the Filesystem issues Input and Output (I/O) operations to the storage/block device.

ZFS implements those 2 functions at once. It thus typically manages sets of block devices (leaf vdev), possibly grouping them into protected devices (RAID-Z or N-way mirror) and aggregating those top level vdevs into pool. Top level vdevs can be added to a pool at any time. Objects that are stored onto a pool will be dynamically striped onto the available vdevs.

Associated with pools, ZFS manages a number of very lightweight filesystem objects. A ZFS filesystem is basically just a set of properties associated with a given mount point. Properties of a filesystem includes the quota (maximum size) and reservation (guaranteed size) as well as, for example, whether or not to compress file data when storing blocks. The filesystem is characterized as lightweight because it does not statically associate with any physical disk blocks and any of its settable properties can be simply changed dynamically.


The recordsize is one of those properties of a given ZFS filesystem instance. ZFS files smaller than the recordsize are stored using a single filesystem block (FSB) of variable length in multiple of a disk sector (512 Bytes). Larger files are stored using multiple FSB, each of recordsize bytes, with default value of 128K.

The FSB is the basic file unit managed by ZFS and to which a checksum is applied. After a file grows to be larger than the recordsize (and gets to be stored with multiple FSB) changing the Filesystem's recordsize property will not impact the file in question. A copy of the file will inherit the tuned recordsize value. A FSB can be mirrored onto a vdev or spread to a RAID-Z device.

The recordsize is currently the only performance tunable of ZFS. The default recordsize may lead to early storage saturation: For many small updates (much smaller than 128K) to large files (bigger than 128K) the default value can cause an extra strain on the physical storage or on the data channel (such as a fiber channel) linking it to the host. For those loads, If one notices a saturated I/O channel then tuning the recordsize to smaller values should be investigated.

Transaction Groups

The basic mode of operation for writes operations that do not require synchronous semantics (no O_DSYNC, fsync(), etc), is that ZFS will absorb the operation in a per host system cache called Adaptive Replacement Cache (ARC). Since there is only one host system memory but potentially multiple ZFS pools, cached data from all pools is handled by a unique ARC.

Each file modification (e.g. a write) is associated with a certain transaction group (TXG). At regular interval (default of txg_time = 5 seconds) each TXG will shut down and the pool will issue a sync operation for that group. A TXG may also be shut down when the ARC indicates that there is too much dirty memory currently being cached. As a TXG closes, a new one immediately opens and file modifications then associate with the new active TXG.

If the active TXG shuts down while a previous one is still in the process of syncing data to the storage, then applications will be throttled until the running sync completes. In this situation where are sinking a TXG, while TXG + 1 is closed due to memory limitations or the 5 second clock and is waiting to sync itself; applications are throttled waiting to write to TXG + 2. We need sustained saturation of the storage or a memory constraint in order to throttle applications.

A sync of the Storage Pool will involve sending all level 0 data blocks to disk, when done, all level 1 indirect blocks, etc. until eventually all blocks representing the new state of the filesystem have been committed. At that point we update the ueberblock to point to the new consistent state of the storage pool.

ZFS Intent Log (ZIL)

For file modification that come with some immediate data integrity constraint (O_DSYNC, fsync etc.) ZFS manages a per-filesystem intent log or ZIL. The ZIL marks each FS operation (say a write) with a log sequence number. When a synchronous command is requested for the operation (such as an fsync), the ZIL will output blocks up to the sequence number. When the ZIL is in process of committing data, further commit operations will wait for the previous ones to complete. This allows the ZIL to aggregate multiple small transactions into larger ones thus performing commits using fewer larger I/Os.

The ZIL works by issuing all the required I/Os and then flushing the write caches if those are enabled. This use of disk write cache does not artificially improve a disk's commit latency because ZFS insures that data is physically committed to storage before returning. However the write cache allows a disk to hold multiple concurrent I/O transactions and this acts as a good substitute for drives that do not implement tag queues.

CAVEAT: The current state of the ZIL is such that if there is a lot of pending data in a Filesystem (written to the FS, not yet output to disk) and a process issues an fsync() for one of it's files, then all pending operations will have to be sent to disk before the synchronous command can complete. This can lead to unexpected performance characteristics. Code is under review.

I/O Scheduler and Priorities

ZFS keeps track of pending I/Os but only issues to disk controllers a certain number (35 by default). This allows the controllers to operate efficiently while never overflowing their queues. By limiting the I/O queue size, service times of individual disks are kept to reasonable values. When one I/O completes, the I/O scheduler then decides the next most important one to issue. The priority scheme is timed based; so for instance an Input I/O to service a read calls will be prioritize over any regular Output I/O issued in the last ~ 0.5 seconds.

The fact that ZFS will limit each leaf devices I/O queue to 35, is one of the reasons that suggests that zpool should be built using vdevs that are individual disks or at least volumes that map to small number of disks. Otherwise this self imposed limits could become an artificial performance throttle.

Read Syscalls

If a read cannot be serviced from the ARC cache, ZFS will issue a 'prioritized' I/O for the data. So even if the storage is handling a heavy output load, there are only 35 I/Os outstanding, all with reasonable service times. As soon as one of the 35 I/Os completes the I/O scheduler will issue the read I/O to the controller. This insures good service times for read operations in general.

However to avoid starvation, when there is a long-standing backlog of Output I/Os then eventually those regain priority over the Input I/O. ZIL synchronous I/Os are of the same priority to synchronous reads.


The prefetch code allowing ZFS to detect sequential or strided access to a file and issue I/O ahead of phase is currently under review. To quote the developer "ZFS prefetching needs some love".

Write Syscalls

ZFS never overwrites live data on-disk and will always output full records validated by a checksum. So in order to partially overwrite a file record, ZFS first has to have the corresponding data in memory. If the data is not yet cached, ZFS will issue an input I/O before allowing the write(2) to partially modify the file record. With the data now in cache, more writes can target the blocks. On output ZFS will checksum data before sending to disk. For full record overwrite the input phase is not necessary.

CAVEAT: Simple write calls (not O_DSYNC) are normally absorbed by the ARC cache and so proceed very quickly. Such a sustained dd(1)-like load can quickly overrun a large amount of system memory and cause transaction groups to eventually throttle all applications for large amount of time (10s of seconds). This is probably what underwrites the notion that ZFS needs more RAM (it does not). Write throttling code is under review.

Soft Track Buffer

An input I/O is serious business. While a Filesystem can decide where to write stuff out on disk, the Inputs are requested by applications. This means a necessary head seek to the location of the data. The time to issue a small read will be totally dominated by this seek. So ZFS takes the stance that it might as well amortize those operations and so, for uncached reads, ZFS normally will issue a fairly large Input I/O (64K by default). This will help loads that input data using similar access pattern to the output phase. The data goes into a per device cache holding 20MB.

This cache can be invaluable in reducing the I/Os necessary to read-in data. But just like the recordsize, if the inflated I/O cause a storage channel saturation the Soft Track Buffer can act as a performance throttle.

The ARC Cache

The most interesting caching occurs at the ARC layer. The ARC manages the memory used by blocks from all pools (each pool servicing many filesystems). ARC stands for Adaptive Replacement Cache and is inspired by a paper of Megiddo/Modha presented at FAST'03 Usenix conference.

That ARC manages it's data keeping a notion of Most Frequently Used (MFU) and Most Recently Use (MRU) balancing intelligently between the two. One of it's very interesting properties is that a large scan of a file will not destroy most of the cached data.

On a system with Free Memory, the ARC will grow as it starts to cache data. Under memory pressure the ARC will return some of it's memory to the kernel until low memory conditions are relieved.

We note that while ZFS has behaved rather well under 'normal' memory pressure, it does not appear to behave satisfactorily under swap shortage. The memory usage pattern of ZFS is very different to other filesystems such as UFS and so exposes VM layer issues in a number of corner cases. For instance, a number of kernel operations fails with ENOMEM not even attempting a reclaim operation. If they did, then ZFS would be responding by releasing some of it's own buffers allowing the initial operation to then succeed.

The fact that ZFS caches data in the kernel address space does mean that the kernel size will be bigger than when using traditional filesystems. For heavy duty usage it is recommended to use a 64-bit kernel i.e. any Sparc system or an AMD configured in 64-bit mode. Some systems that have managed in the past to run without any swap configured should probably start to configure some.

The behavior of the ARC in response to memory pressure is under review.

CPU Consumption

Recent enhancement to ZFS has improved it's CPU efficiency by a large factor. We don't expect to deviate from other filesystems much in terms of cycles per operations. ZFS checksums all disk blocks but this has not proven to be costly at all in terms of CPU consumption.

ZFS can be configured to compress on-disk blocks. We do expect to see some extra CPU consumption from that compression. While it is possible that compression could lead to some performance gain due to reduced I/O load, the emphasis of compression should be to save on-disk space not performance.

What About Your Test ?

This is what I know about the ZFS performance model today. My performance comparison on different types of modelled workloads made last fall already had ZFS ahead on many of them; we have improved the biggest issues highlighted then and there are further performance improvements in the pipeline (based on UFS, we know this will never end). Best Practices are being spelled out.
You can contribute by comparing your actual usage and workload pattern with the simulated workloads. But nothing will beat having reports from real workloads at this stage; Your results are therefore of great interest to us. And watch this space for updates...

mercredi juin 07, 2006

Tuning ZFS recordsize

One important performance parameter of ZFS is the recordsize which govern the size of filesystem blocks for large files. This is the unit that ZFS validates through checksums. Filesystem blocks are dynamically striped onto the pooled storage, on a block to virtual device (vdev) basis.

It is expected that for some loads, tuning the recordsize will be required. Note that, in traditional Filesytems such a tunable would govern the behavior of all of the underlying storage. With ZFS, tuning this parameter only affects the tuned Filesystem instance; it will apply to newly created files. The tuning is achieved using

zfs set recordsize=64k mypool/myfs

In ZFS all files are stored either as a single block of varying sizes (up to the recordsize) or using multiple recordsize blocks. Once a file grows to be multiple blocks, it's blocksize if definitively set to the FS recordsize at the time.

Some more experience will be required with the recordsize tuning. Here are some elements to guide along the way.

If one considers the input of a FS block typically in response to an application read, the size of the I/O in question will not basically impact the latency by much. So, as a first approximation, the recordsize does not matter (I'll come back to that) to read-type workloads.

For FS block outputs, those that are governed by the recordsize, actually occur mostly asynchronously with the application; and since applications are not commonly held up by those outputs, the delivered throughput is, as for read-type loads, not impacted by the recordsize.

So the first approximation is that recordsize does not impact performance much. To service loads that are transient in nature with short I/O bursts (< 5 seconds) we do not expect records tuning to be necessary. The same can be said for sequential type loads.

So what about the second approximation ? A problem that can occur with using an inflated recordsize (128K) compared to application read/write sizes, is early storage saturation. If an application requests 64K of data, then providing a 128K record doesn't change the latency that the application sees much. However if the extra data is discarded from the cache before ever being read, we see that the extra occupation of the data channel was occupied for no good reason. If a limiting factor to the storage is, for instance, a 100MB/sec channel, I can handle 700 times 128K records per second onto that channel. If I halves the recordsize that should double the number of small records I can input.

On the small record output loads, the system memory creates a buffer that defer the direct impact to applications. For output, if the storage is saturated this way for tens of seconds, ZFS will eventually throttle applications. This means that, in the end, when the recordsize leads to sustained storage overload on output, there will be an impact as well.

There is another aspect to the recordsize. A partial write to an uncached FS block (a write syscall of size smaller than the recordsize) will have to first input the corresponding data. Conversely, when individual writes are such that they cover full filesystem recordsize blocks, those writes can be handled without the need to input the associated FS blocks. Other consideration (metadata overhead, caching) dictates however that the recordsize not be reduced below a certain point (16K to 64K; do send-in your experience).

So, one advice is to keep an eye on the channel throughput and tune recordsize for random access workloads that saturate to storage. Sequential type workloads should work quite well with the current default recordsize. If the applications' read/write sizes can be increased, that should also be considered. For non-cached workloads that overwrites file data in small aligned chunks , then matching the recordsize with the write access size may bring some performance gains.

mardi juin 06, 2006



I'll touch 3 aspects of that question here :

- reported freemem

- syscall writes to mmap pages

- application write throttling

Reported freemem will be lower when running with ZFS than say UFS. The UFS page cache is considered as freemem. ZFS will return it's 'cache' only when memory is needed. So you will operate with lower freemem but won't normally suffer from this.

It's been wrongly feared that this mode of operation puts us back to the days of Solaris 2.6 and 7 where we saw a roaller coaster effect on freemem leading to sub-par application performance. We actually DO NOT have this problem with ZFS. The old problem came because the memory reaper could not distinguish between a useful application page and an UFS cached page. That was bad. ZFS frees up it's cache in a way that does not cause this problem.

ZFS is designed to release some of it's memory when kernel modules exert back pressure onto the kmem subsystem. Some kernel code that did not properly exert that pressure was recently fixed (short description here: 4034947).

There is one peculiar workload that does lead ZFS to consume more memory: writing (using syscalls) to pages that are also mmaped. ZFS does not use the regular paging system to manage data that passes through reads and writes syscalls. However mmaped I/O which is closely tied to the Virtual Memory subsystem still goes through the regular paging code . So syscall writting to mmaped pages, means we will keep 2 copies of the associated data at least until we manage to get the data to disk. We don't expect that type of load to commonly use large amount of ram.

Finally, one area where ZFS will behave quite differently from UFS is in throttling writters. With UFS, up to not long ago, we throttled a process trying to write to a file, as soon as that file had 0.5 M B of I/O pending associated with it. This limit has been recently upped to 16 MB. The gain of such throttling is that we prevent an application working on a single file or consuming inordinate amount of system memory. The downside is that we throttle an application possibly unnecessarely when memory is plenty.

ZFS will not throttle individual apps like this. The scheme is mutualized between all writers: when the global load of applications data overflows the I/O subsystem for 5 to 10 seconds then we throttle the applications allowing the I/O to catch up. Applications thus have a lot more ram to play with before being throttled.

This is probably what's behind the notion that ZFS likes more RAM. By and large, to cache some data, ZFS just needs the equivalent amount of RAM as any other filesystem. But currently, ZFS lets applications run a lot more decoupled from the I/O subsystem. This can speed up some loads by very large factor, but at times, will appear as extra memory consumption.

jeudi déc. 15, 2005

Beware of the Performance of RW Locks

In my naive little mind a rw lock would represents a performant scalable construct inasmuch as WRITERS do not hold the lock for a significant amount of time. One figures that the lock would be held for short WRITERS times followed by concurrent execution of RW_READERS.

What I recently found out is quite probably well known to seasoned kernel engineer but this was new to me. So I figured it could be of interest to others.


So Reader/Writer locks (RW) can be used in kernel and user level code to allow multiple READERS of, for instance, a data structure, to access the structure while allowing only a single WRITER at a time within the bounds of the rwlock().

A RW locks (rwlock(9F), rwlock(3THR)) is more complex that a simple mutex. So acquiring such locks will be more expensive. This means that if the expected hold times of a lock is quite small (say to update or read 1 or 2 fields of a structure) then regular mutexes can usually do that job very well. A common programming mistake is to expect faster execution of RW locks for those cases.

However when READ hold times need to be fairly long; then RW locks represent an alternative construct. With those locks we expect to have multiple READERS executing concurrently thus leading to performant code that scales to large numbers of threads. As I said, if WRITERS are just quick updates to the structure, we can naively believe that our code will scale well.

Not So

Let's see how it goes. A WRITER lock cannot get in the protected code while READERS are executing protected code. The WRITER must then wait at the door until READERS releases their hold. If the implementation of RW locks didn't pay attention, there would be cases in which at least one READER is always present within the protected code and WRITERS would get starved of access. To prevent such starvation, RW lock must block READERS as soon as a WRITER has requested access. But no matter, our WRITERS will quickly update the structure and we will get concurrent execution most of the time. Isn't it ?

Well not quite. As just stated, a RW locks will block readers as soon as a WRITER has hit the door. This means that the construct does not allow parallel execution at that point. Moreover the WRITER will stay at the door while READERS are executing. So the construct stays fully serializing from the time a WRITER hits until all current READERS are done followed by the WRITERS time.

For Instance:

	- a RW_READER gets in and will keep a long time. ---|
	- a RW_WRITER hits the lock; is put on hold.        |
	- other RW_READERS now also block.                  |
	.... time passes			            |
	- the long RW_READER releases	   <----------------|
	- the RW_WRITER gets the lock; work; releases
	- other RW_READER now work concurrently.

Pretty obvious once you think about it. So to assess the capacity of a RW lock to allow parallel execution, one must consider the average hold time as a READER but also the frequency of access as a WRITER. The construct becomes efficient and scalable to N threads if and only if:

(avg interval between writers) >> (N \* avg read hold times).


In the end, from a performance point of view, RW locks should be used only when the average hold times is significant in order to justify the use of this more complex type of lock: for instance, calling a function of unknown latency or issuing an I/O while holding the lock represent good candidates. But the construct will be scalable to N threads, if and only if WRITERS are very infrequent.


mardi déc. 06, 2005

Showcasing UltraSPARC T1 with Directory Server's searches

So my Friend and Sun's Directory Server (DS) developer Gilles Bellaton recently got his hands onto an early access Niagara (UltraSPARC T1) system; something akin to SunFireTMT2000.

The chip in the system only had 7 active cores and thus 28 hardware threads (a.k.a strands) but we wanted to check how well it would perform on DS. The results here are a little anecdotal: we just ran a few quick test with the aim to showcase Niagara but nevertheless the results we're beyond expectations.

If you consider the Throughput Engine architecture that Niagara provides (what the Inquire says), we can expect it to perform well in highly multithreaded loads such as a directory search test. Since we had limited disk space on the system the slapd instance was created on /tmp. We realize that this is not at all proper deployment conditions; however the nature of the test is such that we would expect the system to operate mostly from memory (Database fully cached). The only data that would need to go to disk on a real deployment would be the 'access log' and this typically is a not a throughput limiting subsystem.

So we can prudently expect that a real on-disk deployment of a read-mostly workload in which the DB can be fully cached could perform perhaps closely to our findings. This showcase test is a base search over a tiny 1000 entries Database using 50 thread slapd. Slapd was not tuned in any way before the test. For simplicity, the client was run on the same system as the server. This means that, on the one hand, the client is consuming some CPU away from the server, but on the other it reduces the need to run the Network adapter driver code. All in all, this was not designed as a realistic DS test but only to see in a few hours of access time to the system if DS was running acceptably well on this new cool Hardware.

The Results were obtained with Gilles' workspace of DS 6.0 optimized build of August 29th 2005. The number of CPUs where adjusted by creating psrset.

Numbers of Strands                   Search/sec			Ratio
1                                     920			 1    X
3 (1 core; 3 str/core)               2260			 2.45 X
4 (1 core; 4 str/core)               2650			 2.88 X
4 (4 core; 1 str/core)               4100			 4.45 X
14 (7 cores, 2 str/core) 	    12500			13.59 X
21 (7 cores, 3 str/core)            16100			17.5  X
28 (7 cores; 4 str/core)            18200			19.8  X

Those are pretty good scaling numbers straight out of the box. While other more realistics investigation will be produced, this test at least showed us early that Niagara based systems were not suffering from an flagrant deficiencies when running DS searches.


lundi juin 13, 2005

Bonjour Monde

That's "Hello World" in French but one wouldn't say it that way anyway. Maybe one would say "Bonjour tout le monde" meaning "Hello All" which you may say for example entering a room filled of people (specially if like me, you don't care much about greeting everyone individually). So that's your first hint. I'm a geeky sociopath more likely to communicate through a weblog than in real life. The next hint is that I master the french language as you might expect from someone that lives in France. However I've lived in France for only about 15 years and that should allow you to guess that I was not born in France (french laws prohibits child labor). And I've been working for Sun since 1997. The reason I master the French language is probably because both my parents spoke no other language. That was in Quebec, a part of Canada filled with People that speak English with a funny semi-french accent. So bear with me my writing also will have this accent. In Summary: Canadian, lives in France, works for Sun for 8 years. I do performance engineering which to me means: I take a performance number, I explain why it is what is it, and propose what needs to be done to improve it. Welcome to my blog, and your name is ?



« juillet 2016

No bookmarks in folder