mercredi juin 10, 2015

Zero Copy I/O Aggregation

One of my favorite feature of ZFS is the I/O aggregation done in the final stage of issuing I/Os to devices. In this article, I explain in more detail what this feature is and how we recently improved it with a new zero copy feature.

It is well known that ZFS is a Copy-on-Write storage technology. That doesn't meant that we constantly copy data from disk to disk. More to the point it means that when data is modified we store that data in a fresh on-disk location of our own choosing. This is primarily done for data integrity purposes and is managed by the ZFS transaction group (TXG) mechanism that runs every few seconds. But an important side benefit of this freedom given to ZFS is that I/Os, even unrelated I/Os, can be allocated in physical proximity to one another. Cleverly scheduling those I/Os to disk then makes it possible to detect contiguous I/Os and issue few large ones rather than many small ones.

One consequence of I/O aggregation is that the final I/O sizes used by ZFS during a TXG, as observed by ZFSSA Analytics or iostat(1), depend more on the availability of contiguous on-disk free space than it does on the individual application write(2) sizes. To a new ZFS user or storage administrator, it can certainly be really baffling that 100s of independent 8K writes can end up being serviced by a single disk I/O.

The timeline of an asynchronous write is described like this:

  • Application issues a write(2) of N byte to a file stored using ZFS records of size R. Initially the data is stored in the ARC cache.

  • ZFS notes the M dirty blocks needing to be issued in the next TXG as follows:
    • If R=128K, a small write(2) say of 10Bytes here means 1 dirty block (of 128K)
    • If R=8K, a single 128K write(2) implies 16 dirty blocks (of 8K)

  • Within the next few seconds multiple dirty blocks get associated with the upcoming TXG.

  • The TXG starts. ZFS gathers all of the dirty blocks and starts I/Os1.

    • Individual blocks get checksummed and, as necessary, compressed and encrypted. Then and only then, knowing the compressed size and the actual data that needs to be stored on disk, a device is selected and an allocation takes place,

    • The allocation engine finds a chunk in proximity to recent allocations (a future topic of its own),

    • The I/O is maintained by ZFS using 2 structures, one ordered by priority and another ordered by device offset.

  • As soon as there is at least one I/O in these structures, the device level ZIO pipeline gets to work. When a slot is available, the highest priority I/O for that device is selected to be issued.

And here is where the magic occurs. With this highest priority I/O in hand, the ZIO pipeline doesn't just issue that I/O to the device. It first checks for other I/Os which could be physically adjacent to this one. It gathers all such I/Os together until hitting our upper limit for disk I/O size. Because of the way this process works, if there are contiguous chunks of free space available on the disk, we're nearly guaranteed that ZFS finds pending I/Os that are adjacent and can be aggregated.

This also explains why one sees regular bursts of large I/Os whose sizes are mostly unrelated to the sizes of writes issued by the applications. And I emphasize that this is totally unrelated to the random or sequential nature of the application workload. Of course, for hard disk drives (HDDs), managing writes this way is very efficient. Therefore, those HDDs are less busy and stay available to service the incoming I/Os that applications are waiting on.

And this bring us to the topic du jour. Up to recently, there was a cost to doing this aggregation in the form of a memory copy. We would take the buffers coming from the ZIO pipeline (after compression and encryption) and copy them to a newly allocated aggregated buffer. Thanks to a new Solaris mvector feature, we can now run the ZIO aggregation pipeline without incurring this copy. That, in turns, allows us to boost the maximum aggregation size from 128K up to 1MB for extra efficiency. The aggregation code also limits itself to aggregating 64 buffers together. When working with 8K blocks we can see up to 512K I/O during a TXG and 1MB I/O with bigger blocks.

Now, a word about the ZIL. In this article, I focus on the I/Os issued by the TXG which happens every 5 seconds. In between TXG, if disk writes are observed, those would have to come from the ZIL. The ZIL also does it's own grouping of write requests that hit a given dataset (share, zvol or filesystem). Then, once the ZIL gets to issue an I/O, it uses the same I/O pipeline as just described. Since ZIL I/Os are of high priority, they tend to issue straight away. And because they issue quickly, there is generally not a lot of them around for aggregation. So it is common to have the ZIL I/Os not aggregate much if at all. However, under a heavy synchronous write load, when the underlying device becomes saturated, a queue of ZIL I/Os forms and they become subject to ZIO level aggregation.

When observing the I/Os issued to a pool with iostat it's nice to keep all this in mind: synchronous writes don't really show up with their own size. The ZIL issues I/O for a set of synchronous writes that may further aggregate under heavy load. Then, with a 5 second regularity, the pool issues I/O for every modified block, usually with large I/Os whose size is unrelated to the application I/O size.

It's a really efficient way to do all this, but it does require some time getting used to it.
1 Application write size is not considered during a TXG.

mardi avr. 28, 2015

It is the Dawning of the Age of the L2ARC

One of the most exciting things that have gone into ZFS in recent history has been the overhaul of the L2ARC code. We fundamentaly changed the L2ARC such that it would do the following:

  • reduce its own memory footprint,
  • be able to survive reboots,
  • be managed using a better eviction policy,
  • be compressed on SSD,
  • and finally allow feeding at much greater rates then ever achieved before.
Let's review these elements, one by one.

Reduced Footprint

We already saw in this ReARC article that we dropped the amount of core header information from 170 bytes to 80 bytes. This means we can track more than twice as much L2ARC data as before using a given memory footprint. In the past, the L2ARC had trouble building up in size due to its feeding algorithm, but we'll see below that the new code allows us to grow the L2ARC and use up available SSD space in its entirety. So much so that initial testing revealed a problem: For small memory configs with large SSDs, the L2ARC headers could actually end up filling most of the ARC cache and that didn't deliver good performance. So, we had to put in place a memory guard for L2 headers which is currently set to 30% of the ARC. As the ARC grows and shrinks so does the maximum space dedicated to tracking the L2ARC. So, a system with 1TB of ARC cache, then up to 300GB if necessary could be devoted to tracking the L2ARC. With the 80 bytes headers, this means we could track a whopping 30TB of data assuming 8K blocksize. If you use 32K blocksize, currently the largest blocks we allow in L2ARC, then that grows up to 120TB of SSD based auto-tiered L2ARC. Of course, if you have a small L2ARC the tracking footprint of the in-core metadata is smaller.

Persistent Across Reboot

With that much tracked L2ARC space, you would hate to see it washed away on a reboot as the previous code did. Not so anymore, the new L2ARC has an on-disk format that allows it to be reconstructed when a pool is imported. That new format tracks the device space in 8MB segments for which each ZFS blocks (DVAs for the ZFS geeks) consumes 40 bytes of on-SSD space. So reusing the example of an L2ARC made up of only 8K-sized blocks, each 8MB segments could store about 1000 of those blocks consuming just 40K of on-SSD metadata. The key thing here is that to rebuild the in-core L2ARC space after a reboot, you only need to read back 40K, from the SSD itself, in order to discover and start tracking 8MB worth of data. We found that we could start tracking many TBs of L2ARC within minutes after a reboot. Moreover we made sure that as segment headers were read in, they would immediately be made available to the system and start to generate L2ARC hits, even before the L2ARC was done importing every segments. I should mention that this L2ARC import is done asynchronously with respect to the pool import and is designed to not slow down pool import or concurrent workloads. Finally that initial L2ARC import mechanism was made scalable with many import threads per L2ARC device.

Better Eviction

One of the benefits of using an L2ARC segment architecture is that we can now weigh them individually and use the least valued segment as eviction candidate. The previous L2ARC would actually manage L2ARC space by using a ring buffer architecture: first-in first-out. It's not a terrible solution for an L2ARC but the new code allows us to work on a weight function to optimise eviction policy. The current algorithm puts segments that are hit, an L2ARC cache hit, at the top of the list such that a segment with no hits gets evicted sooner.

Compressed on SSD

Another great new feature delivered is the addition of compressed L2ARC data. The new L2ARC stores data in SSDs the same way it is stored on disk. Compressed datasets are captured in the L2ARC in compressed format which provides additional virtual capacity. We often see a 2:1 compression ratio for databases and that is becoming more and more the standard way to deploy our servers. Compressed data now uses less SSD real estate in the L2ARC: a 1TB device holds 2TB of data if the data compresses 2:1. This benefit helps absorb the extra cost of flash based storage. For the security minded readers, be reassured that the data stored in the persistent L2ARC is stored using the encrypted format.

Scalable Feeding

There is a lot to like about what I just described but what gets me the most excited is the new feeding algorithm. The old one was suboptimal in many ways. It didn't feed well, disrupted the primary ARC, had self-imposed obsolete limits and didn't scale with the number of L2ARC devices. All gone.

Before I dig in, it should be noted that a common misconception about L2ARC feeding is assuming that the process handles data as it gets evicted from L1. In fact the two processes, feeding and evicting, are separate operations and it is sometimes necessary under memory pressure to evict a block before being able to install it in the L2ARC. The new code is much much better at avoiding such events; it does so by keeping it's feed point well ahead of the ARC tail. Under many conditions, when data is evicted from primary ARC it is after the L2ARC has processed it.

The old code also had some self-imposed throughput limit that meant that N x L2ARC devices in one pool, would not be fed at proper throughput. Given the strength of the new feeding algorithm we were able to remove such limits and now feeding scales with number of L2ARC devices in use. We also removed an obsolete constraint in which read I/Os would not be sent to devices as they were fed.

With these in place, if you have enough L2ARC bandwidth in the devices, then there are few constraints in the feeder to prevent actually capturing 100% of eligible L2ARC data1. And capturing 100% of data is the key to actually delivering a high L2ARC hit rate in the future. By hitting in L2, of course you delight end users waiting for such reads. More importantly, an L2ARC hit is a disk read I/O that doesn't have to be done. Moreover, that saved HDD read is a random read, one that would have lead to a disk seek, the real weakness of HDDs. Therefore, we reduce utilization of the HDDs, which is of paramount importance when some unusual job mix arrives and causes those HDDs become the resource gating performance: A.K.A crunch time. With a large L2ARC hit count, you get out of this crunch time quicker and restore proper level of service to your users.


The L2ARC Eligibility rules were impacted by the compression feature. The max blocksize considered for eligibility was unchanged at 32K but the check is now done on compressed size if compression is enabled. As before, the idea behind an upper limit on eligible size is two-fold, first for larger blocks, the latency advantage of flash over spinning media is reduced. The second aspect of this is that the SSD will eventually fill up with data. At that point, any block we insert in the L2ARC requires an equivalent amount of eviction. A single large block can thus cause eviction of a large number of small blocks. Without an upper cap on block size, we can face a situation of inserting a large block for a small gain with a large potential downside if many small evicted blocks become the subject of future hits. To paraphrase Yogi Berra: "Caching decisions are hard."2.

The second important eligibility criteria is that blocks must not have been read through prefetching. The idea is fairly simple. Prefetching applies to sequential workloads and for such workloads, flash storage offers little advantage over HDDs. This means that data that comes in through ZFS level prefetching is not eligible for L2ARC.

These criteria leave 2 pitfalls to avoid during an L2ARC demo, first configuring all datasets with 128K recordsize and second trying to prime the L2ARC using dd-like sequential workloads. Both of those are by design workloads that bypasse the L2ARC. The L2ARC is designed to help you with disk crunching real workloads, which are those that access small blocks of data in random order.

Conclusion : A Better HSP

In this context, the Hybrid Storage Pool (HSP) model refers to our ZFSSA architecture where data is managed in 3 tiers:

  1. a high capacity TB scale super fast RAM cache;
  2. a PB scale pool of hard disks with RAID protection;
  3. a channel of SSD base cache devices that automatically capture an interesting subset of the data.
And since the data is captured in the L2ARC device only after it has been stored in the main storage pool, those L2ARC SSDs do not need to be managed by RAID protection. A single copy of the data is kept in the L2ARC knowing that if any L2ARC device disappears, data is guaranteed to be present in the main pool. Compared to a mirrored all-flash storage solution, this ZFSSA auto-tiering HSP means that you get 2X the bang for your SSD dollar by avoiding mirroring of SSDs and with ZFS compression that becomes easily 4X or more. This great performance comes along with the simplicity of storing all of your data, hot, warm or cold, into this incredibly versatile high performance and cost effective ZFS based storage pool.

1It should be noted that ZFSSA tracks L2ARC eviction as "Cache: ARC evicted bytes per second broken down by L2ARC state", with subcategories of "cached," "uncached ineligible," and "uncached eligible." Having this last one at 0 implies a perfect L2ARC capture.

2For non-americans, this famous baseball coach is quoted to have said, "It's tough to make predictions, especially about the future."

vendredi févr. 20, 2015

ZIL Pipelinening

The third topic on my list of improvements since 2010 is ZIL pipelining :
		Allow the ZIL to carve up smaller units of
		work for better pipelining and higher log device 
So let's remind ourselves of a few things about the ZIL and why it's so critical to ZFS. The ZIL stands for ZFS Intent Log and exists in order to speed up synchronous operations such as an O_DSYNC write or fsync(3C) calls. Since most Database operation involve synchronous writes it's easy to understand that having good ZIL performance is critical in many environments.

It is well understood that a ZFS pool updates it's global on-disk state at a set interval (5 seconds these days). The ZIL is actually what keeps information in between those transaction group (TXG). The ZIL records what is committed to stable storage from a users point of view. Basically the last committed TXG + replay of the ZIL is the valid storage state from a users perspective.

The on-disk ZIL is a linked list of records which is actually only useful in the event of a power outage or system crash. As part of a pool import, the on-disk ZIL is read and operations replayed such that the ZFS pool contains the exact information that had been committed before the disruption.

While we often think of the ZIL as it's on-disk representation (it's committed state), the ZIL is also an in-memory representation of every posix operation that needs to modify data. For example, a file creation even if that is an asynchronous operation needs to be tracked by the ZIL. This is because any asynchronous operation, may at any point in time require to be committed to disk; this is often due to an fsync(3C) call. At that moment, every pending operation on a given file needs to be packaged up and committed to the on-disk ZIL.

Where is the on-disk ZIL stored ?

Well that's also more complex than it sound. ZFS manages devices specifically geared to store ZIL blocks; those separate slog devices or slogs are very often flash SSD. However the ZIL is not constrained to only using blocks from slog devices; it can store data on main (non-slog) pool devices. When storing ZIL information into the non-slog pool devices, the ZIL has a choice of recording data inside zil blocks or recording full file records inside pool blocks and storing a reference to it inside the ZIL. This last method for storing ZIL blocks has the benefit of offloading work from the upcoming TXG sync at the expense of higher latency since the ZIL I/Os are being sent to rotating disks. This mode is the one used with logbias=throughput. More on that below.

Net net: the ZIL records data in stable storage in a link list and user applications have synchronization point in which they choose to wait on the ZIL to complete it's operation.

When things are not stressed, operations show up at the ZIL, wait a little bit while the ZIL does it's work, and are then released. Latency of the ZIL is then coherent with the underlying device used to capture the information. In this rosy picture we would not have done this train project.

At times though, the system can get stressed. The older mode of operation of the ZIL was to issue a ZIL transaction (implemented by ZFS function zil_commit_writer) and while that was going on, build up the next ZIL transaction with everything that showed up at the door. Under stress when a first operation would be serviced with a high latency, the next transaction would accumulate many operations, growing in size thus leading to a longer latency transaction and this would spiral out of control. The system would automatically divide into 2 ad-hoc sets of users; a set of operations which would commit together as a group, while all other threads in the system would form the next ZIL transaction and vice-versa.

This leads to bursty activity on the ZIL devices, which meant that, at times, they would go unused even though they were the critical resource. This 'convoy' effect also meant disruption of servers because when those large ZIL transaction do complete, 100s or 1000s of user threads might see their synchronous operation complete and all would end up flagged as 'runnable' at the same time. Often those would want to consume the same resource, run on the same CPU, of use the same lock etc. This led to thundering herds, a source of system inefficiency.

Thanks to the ZIL train project, we now have the ability to break down convoys into smaller units and dispatch them into smaller ZIL level transactions which are then pipelined through the entire data center.

With logbias set to throughput, the new code is attempting to group ZIL transactions in sets of approximately 40 operations which is a compromise between efficient use of ZIL and reduction of the convoy effect. For other types of synchronous operations we group them into sets representing about ~32K of data to sync. That means that a single sufficiently large operation may run by itself but more threads will group together if their individual commit size are small.

The ZIL train is thus expected to handle burst of synchronous activity with a lot less stress on the system.


As we just saw the ZIL provides 2 modes of operation. The throughput mode and the default latency mode. The throughput mode is named as such not so much because it favors throughput but more so because it doesn't care too much about individual operation latency. The implied corollary of throughput friendly workloads is that they are very highly concurrent (100s or 1000s of independent operations) and therefore are able to get to high throughput even when served at high latency. The goal of providing a ZIL throughput mode is to actually free up slog devices from having to handle such highly concurrent workloads and allow those slog devices to concentrate on serving other low-concurrency, but highly sensitive to latency operations.

For Oracle DB, we therefore recommend the use of logbias set to throughput for DB files which are subject to highly concurrent DB writer operations while we recommend the use of the default latency mode for handling other latency sensitive files such as the redo log. This separation is particularly important when redo log latency is very critical and when the slog device is itself subject to stress.

When using Oracle 12c with dnfs and OISP, this best practice is automatically put into place. In addition to proper logbias handling, DB data files are created with a ZFS recordsize matching the established best practice : ZFS recordsize matching DB blocksize for datafiles; ZFS recordsize of 128K for redo log.

When setting up a DB, with or without OISP, there is one thing that Storage Administrators must enforce : they must segregate redo log files into their own filesystems (also known as shares or datasets). The reason for this is that the ZIL is a single linked list of transactions maintained by each filesystem (other filesystems run their own ZIL independently). And while the ZIL train allows for multiple transaction to be in flight concurrently, there is a strong requirement for completion of the transaction and notification of waiters to be handled in order. If one were to mix data files and redo log files in the same ZIL, then some redo transaction would be linked behind some DB writer transactions. Those critical redo transaction committing in latency mode to a slog device would see their I/O complete quickly (100us timescale) but nevertheless have to wait for an antecedent DB writer transaction committing in throughput mode to regular spinning disk device (ms timescale). In order to avoid this situation, one must ensure that redo log files are stored in their own shares.

Let me stop here, I have a train to catch...

mardi déc. 02, 2014


The initial topic from my list is reARC. This is a major rearchitecture of the code that manages ZFS in-memory cache along with its interface to the DMU. The ARC is of course a key enabler of ZFS high performance. As the scale of systems grow in memory size, CPU count and frequency, some major changes were required to the ARC to keep up with the pace. reARC is such a major body of work, I can only talk about of few aspects of the Wonders of ZFS Storage here.

In this article, I describe how the reARC project had impact on at least these 7 important aspects of it's operation:
  • Managing metadata
  • Handling ARC accesses to cloned buffers
  • Scalability of cached and uncached IOPS
  • Steadier ARC size under steady state workloads
  • Improved robustness for a more reliable code
  • Reduction of the L2ARC memory footprint
  • Finally, a solution to the long standing issue of I/O priority inversion
The diversity of topics covered serves as a great illustration of the incredible work handled by the ARC and a testament to the importance of ARC operations to all other ZFS subsystems. I'm truly amazed at how a single project was able to deliver all this goodness in one swoop.

No Meta Limits

Previously, the ARC claimed to use a two-state model:
  • "most recently used" (MRU)
  • "most frequently used" (MFU)
But it further subdivided these states into data and metadata lists.

That model, using 4 main memory lists, created a problem for ZFS. The ARC algorithm gave us only 1 target size for each of the 2 MRU and MFU states. The fact that we had 2 lists (data and metadata) but only 1 target size for the aggregate meant that when we needed to adjust the list down, we just didn't have the necessary information to perform the shrink. This lead to the presence of an ugly tunable arc_meta_limit, which was impossible to set properly and was a source of problems for customers.

This problem raises an interesting point and a pet peeve of mine. Many people I've interacted with over the years defended the position that metadata was worth special protection in a cache. After all, metadata is necessary to get to data, so it has intrinsically higher value and should be kept around more. The argument is certainly sensible on the surface, but I was on the fence about it.

ZFS manages every access through a least recently used scheme (LRU). New access to some block, data or metadata, puts that block back to the head of the LRU list, very much protected from eviction, which happens at the tail of the list.

When considering special protection for metadata, I've always stumbled on this question:

If some buffer, be it data or metadata, has not seen any accesses for sufficient amount of time, such that the block is now the tail of an eviction list, what is the argument that says that I should protect that block based on it's state ?
I came up blank on that question. If it hasn't been used, it can be evicted, period. Furthermore, even after taking this stance, I was made aware of an interesting fact about ZFS. Indirect blocks, the blocks that hold a set of block pointers to the actual data are non_evictable inasmuch as any of the block pointers they reference are currently in the ARC. In other words, if some data is in cache, it's metadata is also in the cache and furthermore, is non-evictable. This fact really reinforced my position that in our LRU cache handling, metadata doesn't need special protection from eviction.

And so, the reARC project actually took the same path. No more separation of data and metadata and no more special protection. This improvement led to fewer lists to manage and simpler code, such as shorter lock hold times for eviction. If you are tuning arc_meta_limit for legacy reasons, I advise you to try without this special tuning. It might be hurting you today and should be considered obsolete.

Single Copy Arc: Dedup of Memory

Yet another truly amazing capability of ZFS is it's infinite snapshot capabilities. There are just no limits, other than hardware, to the number of (software) snapshots that you can have.

What is magical here is not so much that ZFS can manage a large number of snapshots, but that it can do so without reference counting the blocks that are referenced through a snapshot. You might need to read that sentence again ... and check the blog entry.

Now fast forward to today where there is something new for the ARC. While we've always had the ability to read a block referenced from the N-different snapshots (or clones), the old ARC actually had to manage separate in-memory copies of each block. If the accesses were all reads, we'd needlessly instantiate the same data multiple times in memory.

With the reARC project and the new DMU to ARC interfaces, we don't have to keep multiple data copies. Multiple clones of the same data share the same buffers for read accesses and new copies are only created for a write access. It has not escaped our notice that this N-way pairing has immense consequences for virtualization technologies. The use of ZFS clones (or writable snapshots) is just a great way to deploy a large number of virtual machines. ZFS has always been able to store N clone copies with zero incremental storage costs. But reARC is taking this one step further. As VMs are used, the in-memory caches that are used to manage multiple VMs no longer need to inflate, allowing the space savings to be used to cache other data. This improvement allows Oracle to boast the amazing technology demonstration of booting 16000 VMs simultaneously.

Improved Scalability of Cached and Uncached OPs

The entire MRU/MFU list insert and eviction processes have been redesigned. One of the main functions of the ARC is to keep track of accesses, such that most recently used data is moved to the head of the list and the least recently used buffers make their way towards the tail, and are eventually evicted. The new design allows for eviction to be performed using a separate set of locks from the set that is used for insertion. Thus, delivering greater scalability. Moreover, through a very clever algorithm, we're able to move buffers from the middle of a list to the head without acquiring the eviction lock.

These changes were very important in removing long pauses in ARC operations that hampered the previous implementation. Finally, the main hash table was modified to use more locks placed on separate cache lines improving the scalability of the ARC operations. This lead to a boost in the cached and uncached maximum IOPs capabilities of the ARC.

Steadier Size, Smaller Shrinks

The growth and shrink model of the ARC was also revisited. The new model grows the ARC less aggressively when approaching memory pressure and instead recycles buffers earlier on. This recycling leads to a steadier ARC size and fewer disruptive shrink cycles. If the changing environment nevertheless requires the ARC to shrink, the amount by which we do shrink each time is reduced to make it less of a stress for each shrink cycle. Along with the reorganization of the ARC list locking, this has lead to a much steadier, dependable ARC at high loads.

ARC Access Hardening

A new ARC reference mechanism was created that allows the DMU to signify read or write intent to the ARC. This, in turn, enables more checks to be performed by the code. Therefore, catching bugs earlier in the process. A better separation of function between the DMU and the ARC is critical for ZFS robustness or hardening. In the new reARC mode of operation, the ARC now actually has the freedom relocate kernel buffers in memory in between DMU accesses to a cached buffer. This new feature proves invaluable as we scale to large memory systems.

L2ARC Memory Footprint Reduction

Historically, buffers were tracked in the L2ARC (the SSD based secondary ARC) using the same structure that was used by the main primary ARC. This represented about 170 bytes of memory per buffer. The reARC project was able to cut down this amount by more than 2X to a bare minimum that now only requires about 80 bytes of metadata per L2 buffers. With the arrival of larger SSDs for L2ARC and a better feeding algorithm, this reduced L2ARC footprint is a very significant change for the Hybrid Storage Pool (HSP) storage model.

I/O Priority Inversion

One nagging behavior of the old ARC and ZIO pipeline was the so-called I/O priority inversion. This behavior was present mostly for prefetching I/Os, which was handled by the ZIO pipeline at a lower priority operation than, for example, a regular read issued by an application. Before reARC, the behavior was that after an I/O prefetch was issued, a subsequent read of the data that arrived while the I/O prefetch was still pending, would block waiting on the low priority I/O prefetch completion.

While it sounds simple enough to just boost the priority of the in-flight I/O prefetch, ARC/ZIO code was structured in such a way that this turned out to be much trickier than it sounds. In the end, the reARC project and subsequent I/O restructuring changes, put us on the right path regarding this particular quirkiness. Fixing the I/O priority inversion meant that fairness between different types of I/O was restored.


The key points that we saw in reARC are as follows:
  1. Metadata doesn't need special protection from eviction, arc_meta_limit has become an obsolete tunable.

  2. Multiple clones of the same data share the same buffers for great performance in a virtualization environment.
  3. We boosted ARC scalability for cached and uncached IOPs.
  4. The ARC size is now steadier and more dependable.
  5. Protection from creeping memory bugs is better.
  6. L2ARC uses a smaller footprint.
  7. I/Os are handled with more fairness in the presence of prefetches.
All of these improvements are available to customers of Oracle's ZFS Storage Appliances in any AK-2013 releases and recent Solaris 11 releases. And this is just topic number one. Stay tuned as we go about describing further improvements we're making to ZFS.

ZFS Performance boosts since 2010

Well, look who's back! After years of relative silence, I'd like to put back on my blogging hat and update my patient readership about the significant ZFS technological improvements that have integrated since Sun and ZFS became Oracle brands. Since there is so much to cover, I tee up this series of article with a short description of 9 major performance topics that have evolved significantly in the last years. Later, I will describe each topic in more details in individual blog entries. Of course, these selected advancements represents nowhere near an exhaustive list. There has been over 650 changes to the ZFS code in the last 4 years. My personal performance bias has selected topics that I know best. The designated topics are:
  1. reARC

  2. Scales the ZFS cache to TB class machines and CPU counts in thousands.
  3. Sequential Resilvering

  4. Converts a random workload to a sequential one.
  5. ZIL Pipelining

  6. Allows the ZIL to carve up smaller units of work for better pipelining and higher log device utilisation.
  7. It is the dawning of the age of the L2ARC

  8. Not only did we make the L2ARC persistent on reboot, we made the feeding process so much more efficient we had to slow it down.
  9. Zero Copy I/O Aggregation

  10. A new tool delivered by the Virtual Memory team allows the already incredible ZFS I/O aggregation feature to actually do its thing using one less copy.
  11. Scalable Reader/Writer locks

  12. Reader/Writer locks, used extensively by ZFS and Solaris, had their scalability greatly improved on on large systems.
  13. New thread Scheduling class

  14. ZFS transaction groups are now managed by a new type of taskqs which behave better managing bursts of cpu activity.
  15. Concurrent Metaslab Syncing

  16. The task of syncing metaslabs is now handled with more concurrency, boosting ZFS write throughput capabilities.
  17. Block Picking

  18. The task of choosing blocks for allocations has been enhanced in a number of ways, allowing us to work more efficiently at a much higher pool capacity percentage.
There you have it. I'm looking forward to reinvigorating my blog so stay tuned.

mardi févr. 28, 2012

Sun ZFS Storage Appliance : can do blocks, can do files too!

Last October, we demonstrated storage leadership in block protocols with our stellar SPC-1 result showcasing our top of the line Sun ZFS Storage 7420.

As a benchmark SPC-1's profile is close to what a fixed block size DB would actually be doing. See Fast Safe Cheap : Pick 3 for more details on that result. Here, for an encore, we're showing today how the ZFS Storage appliance can perform in a totally different environment : generic NFS file serving.

We're announcing that the Sun ZFS Storage 7320's reached 134,140 SPECsfs2008_nfs.v3 Ops/sec ! with 1.51 ms ORT running SPEC SFS 2008 benchmark.

Does price performance matters ? It does, doesn't it, See what Darius has to say about how we compare to Netapp : Oracle posts Spec SFS.

This is one step further in the direction of bringing to our customer true high performance unified storage capable of handling blocks and files on the same physical media. It's worth noting that provisioning of space between the different protocols is entirely software based and fully dynamic, that every stored element fully checksummed, that all stored data can be compressed with a number of different algorithms (including gzip), and that both filesystems and block based luns can be snapshot and cloned at their own granularity. All these manageability features available to you in this high performance storage package.

Way to go ZFS !

SPEC and SPECsfs are registered trademarks of Standard Performance Evaluation Corporation (SPEC). Results as of February 22, 2012, for more information see

lundi oct. 03, 2011

ZFS Storage Appliance at OOW

At Oracle Openworld this week in San Francisco, The ZFS Storage appliance booth is located in Moscone South, Center - SC-139. I'll be spending time there tuesday and wednesday afternoon hoping to hear from both existing and prospective customers.

jeudi mars 11, 2010

Dedup Performance Considerations

One of the major milestones for ZFS Storage appliance with 2010/Q1 is the ability to dedup data on disk. The open question is then : What performance characteristics are we expected to see from Dedup ? As Jeff says, this is the ultimate gaming ground for benchmarks. But lets have a look at the fundamentals.

ZFS Dedup Basics

Dedup code is simplistically a large hash table (the DDT). It uses a 256 bit (32 Bytes) checksum along with other metata data to identify data content. On a hash match, we only need to increase a reference count, instead of writing out duplicate data. The dedup code is integrated in the I/O pipeline and is done on the fly as part of the ZFS transaction group (see Dynamics of ZFS, The New ZFS Write Throttle ). A ZFS zpool typically holds a number of datasets : either block level LUNS which are based on ZVOL or NFS and CIFS File Shares based on ZFS filesystems. So while the dedup table is a construct associated with individual zpool, enabling of the deduplication feature is something controlled at the dataset level. Enabling of the dedup feature on a dataset, has no impact on existing data which stay outside of the dedup table. However any new data stored in the dataset will then be subject to the dedup code. To actually have existing data become part of the dedup table one can run a variant of "zfs send | zfs recv" on the datasets.

Dedup works on a ZFS block or record level. For a iSCSI or FC LUN, i.e. objects backed by ZVOL datasets, the default blocksize is 8K. For filesystems (NFS, CIFS or Direct Attach ZFS), object smaller than 128K (the default recordsize) are stored as a single ZFS block while objects bigger than the default recordsize are stored as multiple records Each record is the unit which can end up deduplicated in the DDT. Whole Files which are duplicated in many filesystems instances are expected to dedup perfectly. For example, whole DB copied from a master file are expected to falls in this category. Similarly for LUNS, virtual desktop users which were created from the same virtual desktop master image are also expected to dedup perfectly.

An interesting topic for dedup concerns streams of bytes such as a tar file. For ZFS, a tar file is actually a sequence of ZFS records with no identified file boundaries. Therefore, identical objects (files captured by tar) present in 2 tar-like byte streams might not dedup well unless the objects actually start on the same alignment within the byte stream. A better dedup ratio would be obtained by expanding the byte stream into it's constituent file objects within ZFS. If possible, the tools creating the byte stream would be well advised to start new objects on identified boundaries such as 8K.

Another interesting topic is backups of active Databases. Since database often interact with their constituent files with an identified block size, it is rather important for the deduplication effectiveness that the backup target be setup with a block size that matches the source DB block size. Using a larger block on the deduplication target has the undesirable consequence that modifications to small blocks of the source database will cause those large blocks in the backup target to appear unique and not dedup somewhat artificially. By using an 8K block size in the dedup target dataset instead of 128K, one could conceivably see up to a 10X better deduplication ratio.

Performance Model and I/O Pipeline Differences

What is the effect on performance of Dedup ? First when dedup is enabled, the checksum used by ZFS to validate the disk I/O is changed to the cryptographically strong SHA256. Darren Moffat shows in his blog that SHA256 actually runs at more than 128 MB/sec on a modern cpu. This means that less than 1 ms is consumed to checksum a 128K and less than 64 usec for an 8K unit. This cost is online incurred when actually reading or writing data to disk, an operation that is expected to take 5-10 ms; therefore the checksum generation or validation is not a source of concern.

For the read code path, very little modification should be observed. The fact that a reads happens to hit a block which is part of the dedup table is not relevant to the main code path. The biggest effect will be that we use a stronger checksum function invoked after a read I/O : at most an extra 1 ms is added to a 128K disk I/O. However if a subsequent read is for a duplicate block which happens to be in the pool ARC cache, then instead of having to wait for a full disk I/O, only a much faster copy of duplicate block will be necessary. Each filesystem can then work independently on their copy of the data in the ARC cache as is the case without deduplication. Synchronous writes are also unaffected in their interaction with the ZIL. The blocks written in the ZIL have a very short lifespan and are not subject to deduplication. Therefore the path of synchronous writes is mostly unaffected unless the pool itself ends up not being able to absorb the sustained rate of incoming changes for 10s of seconds. Similarly for asynchronous writes which interact with the ARC caches, dedup code has no affect unless the pool's transaction group itself becomes the limiting factor. So the effect of dedup will take place during the pool transaction group updates. Here is where we take all modifications that occurred in the last few seconds and atomically commit a large transaction group (TXG). While a TXG is running, applications are not directly affected except possibly for the competition for CPU cycles. They mostly continue to read from disk and do synchronous write to the zil, and asynchronous writes to memory. The biggest effect will come if the incoming flow of work exceed the capabilities of the TXG to commit data to disk. Then eventually the reads and write will be held up by the necessary write (Throttling) code preventing ZFS from consuming up all of memory .

Looking into the ZFS TXG, we have 2 operations of interest, the creation of a new data block and the simple removal (free) of a previously used block. ZFS operating under a copy on write (COW) model, any modification to an existing block actually represents both a new data block creation and a free to a previously used block (unless a snapshot was taken in which case there is no free). For file shares, this concerns existing file rewrites; for block luns (FC and iSCSI), this concerns most writes except the initial one (very first write to a logical block address or LBA actually allocates the initial data; subsequent writes to the same LBA are handled using COW). For the creation of a new application data block, ZFS will then run the checksum of the block, as it does normally and then lookup in the dedup table for a match based on that checksum and a few other bits of information. On a dedup table hit, only a reference count needs to be increased and such changes to the dedup table will be stored on disk before the TXG completes. Many DDT entries are grouped in a disk block and compression is involved. A big win occurs when many entries in a block are subject to a write match during one TXG. Then a single 1 x 16K I/O can then replace 10s of larger IOPS. As for free operations, the internals of ZFS actually holds the referencing block pointer which contains the checksum of the block being freed. Therefore there is no need to read nor recompute the checksum of the data being freed. ZFS, with checksum in hand, looks up the entry in dedup table and decrement the reference counter. If the counter is non zero then nothing more is necessary (just the dedup table sync). If the freed block ends up without any reference then it will be freed.

The DEDUP table itself an an object managed by ZFS at the pool level. The table is considered metadata and it's elements will be stored in the ARC cache. Up to 25% of memory (zfs_arc_meta_limit) can be used to store metadata. When the dedup table actually fits in memory, then enabling dedup is expected to have a rather small effect on performance. But when the table is many time greater than allotted memory, then the lookups necessary to complete the TXG can cause write throttling to be invoked earlier than the same workload running without dedup. If using an L2ARC, the DDT table represents prime objects to use the secondary cache. Note that independent of the size of the dedup table, read intensive workloads in highly duplicated environment, are expected to be serviced using fewer IOPS at lower latency than without dedup. Also note that whole filesystem removal or large file truncation are operation that can free up large quantity of data at once and when the dedup table exceeds allotted memory then those operation, which are more complex with deduplication, can then impact the amount of data going into every TXG and the write throttling behavior.

So how large is the dedup table ?

The command zdb -DD on a pool shows the size of DDT entries. In one of my experiment it reported about 200 Bytes of core memory for table entries. If each unique object is associated with 200 Bytes of memory then that means that 32GB of ram could reference 20TB of unique data stored in 128K records or more than 1TB of unique data in 8K records. So if there is a need to store more unique data than what these ratio provide, strongly consider allocating some large read optimized SSD to hold the DDT. The DDT lookups are small random IOs which are handled very well by current generation SSDs.

The first motivation to enable dedup is actually when dealing with duplicate data to begin with. If possible procedures that generate duplication could be reconsidered. The use of ZFS Clones is actually a much better way to generate logically duplicate data for multiple users in a way that does not require a dedup hash table.

But when the operating conditions does not allow the use of ZFS Clones and data is highly duplicated, then the ZFS deduplication capability is a great way to reduce the volume of stored data.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Referenced Links :

Proper Alignment for extra performance

Because of disk parititioning software on your storage clients (keywords : EFI, VTOC, fdisk, DiskPart,...) or a mismatch between storage configuration and application request pattern, you could be suffering a 2-4X performance degradation....

Many I/O performance problem I see end up being the result of a mismatch in request sizes or it's alignment versus the natural block size of the underlying storage. While raw disk storage works using a 512 Byte sector and performs at the same level independent of the starting offset of I/O requests this is not the case for more sophisticated storage which will tend to use larger block units. Some SSDs today support 512B aligned requests but will work much better if you give them 4K aligned requests as described in Aligning on 4K boundaries Flash and Sizes. The Sun Oracle 7000 Unified Storage line supports different sizes of blocks between 4K and 128K (it can actually go lower but I would not recommend that in general). Having proper alignment between the application's view, the initiator partitioning and the backing volume can have great impact on the end performance delivered to applications.

When is alignment most important ?

Alignment problems are most likely to have an impact with
  • running a DB on file shares or block volumes
  • write streaming to block volumes (backups)
Also impacted at a lesser level :
  • large file rewrites on CIFS or NFS shares
In each case adjusting the recordsize to match the workload and insuring that partitions are aligned on a block boundary could have important effect on your performance.

Let's review the different cases.

Case 1: running a Database (DB) on file shares or block volumes

Here the DB is a block oriented application. General ZFS Best Practices warrant that the storage use a record size equal to the DB natural block size. At the logical level, the DB is issuing I/O which are aligned on block boundaries. When using file semantics (NFS or CIFS), then the alignment is guaranteed to be observed all the way to the backend storage. But when using block device semantics, the alignments of requests on the initiator is not guaranteed to be the same as the alignement on the target side. Misalignment of the LUN will cause two pathologies. First, an application block read will straddle 2 storage blocks creating storage IOPS inflation (more backend reads than application reads). But a more drastic effect will be seen for block writes which, when aligned, could be serviced by a single write I/O. Those will now require a Read-Modify-Write (R-W-M) of 2 adjacent storage blocks. Such type of I/O inflation leads to additional storage load and degrade performance during high demand.

To avoid such I/O inflation, insure that the backing store uses a block size (LUN volblocksize or Share recordsize) compatible with the DB block size. If using a file share such as NFS, insure that the filesystem client passes I/O requests directly to the NFS server using a mount option such as directio or use Oracle's dNFS client (Note that with directio mount option, memory management considerations independent of alignment concerns, the server will behave better when the client specifies rsize,wsize options not exceeding 128K). To avoid such LUN misalignment, prefer the use full LUNS as opposed to sliced partition. If disk slices must be used, prefer partitioning scheme in which one can control the sector offset of individual partitions such as EFI labels. In that case start partitions on a sector boundary which aligns with the volume's blocksize. For instance a initial block for a parition which is a multiple of 16 \* 512B sectors will align on an 8K boundary, the default lun blocksize.

Case 2: write streaming to block volumes (backups)

The other important case to pay attention to is stream writing to a raw block device. Block devices by default commit each write to stable storage. This path is often optimized through the use of acceleration devices such as write optimized SSD. Misalignement of the LUNS due to partitioning software imply that application writes, which could otherwise be committed to SSD at low latency, will instead be delayed by disk reads caught in R-M-W. Because the writes are synchronous in nature, the application running on the initiator will thus be considerably delayed by disk reads. Here again one must insure that partitions created on the client system are aligned with the volumes blocksize which typically default to 8K. For pure streaming workloads large blocksize up to the maximum 128K can lead to greater streaming performance. One must take good care that the block size used for a LUNS should not exceed the application writes sizes to raw volumes or risk being hit by the R-M-W penalty.

Case 3: large file rewrites on CIFS or NFS shares

For file shares, large streaming write will be of 2 types : they will either be the more common file creation (write allocation) or they will correspond to streaming overwrite to existing file. The more common write allocation would not greatly suffer from misalignment since there is no pre-existing data to be read and modified. But for the less common streaming rewrite to files, one can definitely be impacted by misalignment and R-M-W cycles. Fortunately file protocols are not subject to LUN misalignment so one must only take care that the write sizes reaching the storage be multiple of the recordsize used to create the file share in the storage. The solaris NFS clients often issues 32K write size for streaming application while CIFS has been observed to use 64K from clients. If existing streaming asynchronous file rewrite is an important component of your I/O workloads (a rare set of conditions), it might well be that setting the LUN blocksize accordingly will provide a boost to delivered performance.

In summary

The problem with alignment is more generally seen with fixed record oriented application (as for Oracle Database or Microsoft Exchange) with random access pattern and synchronous I/O semantics. It can be caused by partitioning software (fdisk, diskpart) which create disk partitions not aligned with the storage blocks. It can also be caused to a lesser extent by streaming file overwrite when the application write size does not match the file share's blocksize. The Sun Storage 7000 line offers great flexibility in selecting different blocksizes for different use within a single pool of storage. However it has no control on the offset that could be selected during disk partitioning of block devices on client systems. Care must be taken when partitioning disks to avoid misalignment and degraded performance. Using full LUNs is preferred.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Referenced Links :

Doubling Exchange Performance

2010/Q1 delivers 200% more Exchange performance and 50% extra transaction processing

One of the great advances present in the ZFS Appliance 2010/Q1 software update relates to the block allocation strategy. It's been one the most complex performance investigation I've ever had to deal with because of the very strong impact previous history of block allocation had on future performance. It was maddening experience littered with dead end leads. During that whole time it was very hard to make sense of the data and segregate what was due to a problem in block allocation from author causes that leads customer to report performance issues.

Executive Summary

A series of changes to ZFS metaslab code lead to 50% improved OLTP performance and 70% reduced variability from run to run. We also saw a full 200% improvement on MS Exchange performance from these changes.

Excruciating Details for aspiring developer "Abandon hope all ye who enter here"

At some point we started to look at random synchronous file rewrite (a la DB writer) and it seemed clear that the performance was not what we expected for this workload. Basically, independent DB block synchronous writes were not aggregating into larger I/Os in the vdev queue. We could not truly assert a point where a regression had set in, so rather than threat this as a performance regression, we just decided to study what we had in 2009/Q3 and see how we could make our current scheme work better. And that lead us on the path of the metaslab allocator:

As Jeff explains, when a piece of data needs to be stored on disk, ZFS will first select a top level vdev (a raid-z group, a mirrored set, a single disk) for it. Within that top level vdev, a metaslab (slab for short) will be chosen and within the slab a block of Data Virtual Address (DVA) space will be selected. We didn't have a problem with the vdev selection process, but we thought we have an issue with the block allocator. What we were seeing was that for random file rewrite the aggregation factor was large (say 8 blocks or more) when performance was good but dropped to 1 or 2 when performance was low. So we tried to see if we could do a better job at selecting blocks that would lead to better I/O aggregation down the pipeline. We kept looking at the effect of block allocation but it turned out the source of problem was in the slab selection process.

So a slab is a portion of DVA space within a metaslab group (aka a top level vdev). We currently divide VDEV space into approximately 200 slabs (see vdev_metaslab_set_size). Slabs can be either loaded in memory or not. When loaded, the associated spacemaps are active meaning we can allocate space from them. When slabs are not loaded, we can't allocated space but we can still free space from them (ZFS being copy-on-write or COW, a block rewrite frees up the old space). In this case we just log to disk the freed range information. As load and unload of spacemaps are not cheap and we insure we minimize such operation.

So each slab is weighted according to a few criteria and the slab with the highest weight is selected for allocation on a vdev. The first criteria for slab selection is to reuse the same one as the last one used: basically don't change a winner. We refer to this as the PRIMARY slab. The second criteria for slab selection is the amount of free space. The more the better. However, lower LBA (logical block addresses) which maps to outer cylinders will generally give better performance. So we weight lower LBA more than inner ones at equivalent free space. Finally, a slab that has already been used in the past, even if currently unloaded, is preferred to opening up a fresh new slab. This is the SMO bonus (because primed slabs have a Space Map Object associated). We do want to favor previously used slabs in order to limit the span of head seeks : we only move inwards when outer space is filled up.

The purpose of the slabs is to service a block allocation, say for a 128K record. So when a request comes in, the highest weighted slab is chosen as we ask for a block of the proper size using an AVL tree of free/allocated space. There was a problem we had to deal with in previous releases which occurred when such allocation failed because of free space fragmentation. Then the AVL tree was then not able to find a span of the requested size and was consuming CPU only to figure out there was no free block present to satisfy an allocation. When space was really tight in a pool we walked every slab before deciding that the allocation needed to be split into small chunks and a gang block (a block of blocks) created. So the spacemaps were augmented with another structure that allowed ZFS to immediately know how large an allocation could be serviced in a slab (the so called picker private tree organized by size of free space).

At that point we had 2 ways to select a block, either find one in sequence of previous allocation (first fit) or use one that fills in exactly a hole in the allocated space: so called best fit allocator. We also decided then to switch from best fit to first fit as a slab became 70% full. The problem that this created, we now realize, is that while it helped the compactness of the on-disk layout, it created a headache for writes. Each new allocation, got a taylored-fit disk area and this lead to much less write aggregation than expected. We would see that write workloads to a slab slowed down as it transitioned to 70% full (note this occurred when a slab was 70% full not the full vdev nor the pool). Eventually, the degraded slab became fully used and it would transition to a different slab with better performance characteristic. Performance could then fluctuate from an hour to the next.

So to solve this problem, what went in 2010/Q1 software release is multifold. The most important thing is: we increased the threshold at which we switched from 'first fit' (go fast) to 'best fit' (pack tight) from 70% full to 96% full. With TB drives, each slab is at least 5GB and 4% is still 200MB plenty of space and no need to do anything radical before that. This gave us the biggest bang. Second, instead of trying to reuse the same primary slabs until it failed an allocation we decided to stop giving the primary slab this preferential threatment as soon as the biggest allocation that could be satisfied by a slab was down to 128K (metaslab_df_alloc_threshold). At that point we were ready to switch to another slab that had more free space. We also decided to reduce the SMO bonus. Before, a slab that was 50% empty was preferred over slabs that had never been used. In order to foster more write aggregation, we reduced the threshold to 33% empty. This means that a random write workload now spread to more slabs where each one will have larger amount of free space leading to more write aggregation. Finally we also saw that slab loading was contributing to lower performance and implemented a slab prefetch mechanism to reduce down time associated with that operation.

The conjunction of all these changes lead to 50% improved OLTP and 70% reduced variability from run to run (see David Lutz's post on OLTP performance) . We also saw a full 200% improvement on MS Exchange performance from these changes.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

jeudi oct. 08, 2009

CMT, NFS and 10 Gbe

Now that we have Gigabytes/sec class of Network Attached OpenStorage and highly threaded CMT servers to attach from you figure just connecting the two would be enough to open the pipes for immediate performance. Well ... almost.

Our openstorage system can deliver great performance but we often find limitation on the client side. Now that NAS servers can deliver so much power, their NAS client can themselve be powerful servers trying to deliver GB/sec class services to the internet.

CMT servers are great throughput engines for that, however they deliver the goods when the whole stack is threaded. So in a recent engagement, my collegue David Lutz found that we needed one tuning at each of 4 levels in Solaris : IP, TCP, RPC and NFS.


ip_soft_rings_cnt requires tuning up to Solaris 10 update 7. The default value of 2 is not enough to sustain the high throughput in a CMT environment. A value of 16 proved beneficial.

In /etc/system :
   \* To drive 10Gbe in CMT in Solaris 10 update 7 : see
   set ip_soft_rings_cnt=16

The receive socket buffer size is critical to the TCP connection performance. The buffer is not preallocated and memory is only used if and when the application is not reading the data it has requested. The default at 48K is from the age of 10MB/s Network cards and 1GB/sec systems. Having a larger value allows the peer to not throttle it's flow pending the returning TCP ACK. This is specially critical in high latency environment, urban area networks or other large fat network but it's also critical in the datacenter to reach a reasonable portion of the 10Gbe available in today's NIC. It turns out that NFS connection inherit the TCP default for the system and so it's interesting to run with a value between 400K and 1MB :

	ndd -set /dev/tcp_recv_hiwat 400000

But even with this, a single TCP connection is not enough to extract the most out of 10Gbe on CMT. And the solaris rpc client will establish a single connection to any of the server it connects to. The code underneath is highly threaded but did suffer from a few bugs when trying to tune that number of connections notably 6696163, 6817942 both of which are fixed in S10 update 8.

With that release, it becomes interesting to tune the number of RPC connections for instance to 8.

In /etc/system :
   \* To drive 10Gbe in CMT in Solaris 10 update 8 : see
   set clnt_max_conns=8

And finally, above the RPC layer, NFS does implement a pool of threads per mount point to service asynchronous requests. These will be mostly used in streaming workloads (readahead and writebehind) while other synchronous requests will be issued within the context of the application thread. The default number of asynchronous requests is likely to limit performance in some streaming scenario. So I would experiment with

In /etc/system :
   \* To drive 10Gbe in CMT in Solaris 10 update 7 : see
   set nfs3_max_threads=32
   set nfs4_max_threads=32

As usual YMMV and use them with the usual circumspection, remember that tuning is evil but it's better to know about these factors than being in the dark and stuck with lower than expected performance.

jeudi sept. 17, 2009

iSCSI unleashed

One of the much anticipated feature of the 2009.Q3 release of the fishworks OS is a complete rewrite of the iSCSI target implementation known as Common Multiprotocol SCSI Target or COMSTAR. The new target code is an in-kernel implementation that replaces what was previously known as the iSCSI target deamon, a user-level implementation of iSCSI.

Should we expect huge performance gains from this change ? You Bet !

But like most performance question, the answer is often : it depends. The measured performance of a given test is gated by the weakest link triggered. iSCSI is just one component among many others that can end up gating performance. If the daemon was not a limiting factor, then that's your answer.

The target deamon was a userland implementation of iSCSI : some daemon threads would read data from a storage pool and write data to a socket or vice versa. Moving to a kernel implementation opens up options to bypass at least one of the copies and that is being considered as a future option. But extra copies while undesirable do not necessarily contribute to the small packet latency or large request throughput; For small packets requests, the copy is small change compared to the request handling. For large request throughput the important things is that the data path establishes a pipelined flow in order to keep every components busy at all times.

But the way threads interact with one another can be a much greater factor in delivered performance. And there lies the problem. The old target deamon suffered from one major flaw in that each and every iSCSI requests would require multiple trips through a single queue (shared between every luns) and that queue was being read and written by 2 specific threads. Those 2 threads would end up fighting for the same locks. This was compounded by the fact that user level threads can be put to sleep when they fail to acquire a mutex and that going to sleep for a user level thread is a costly operation implying a system call and all the accounting that goes with that.

So while the iSCSI target deamon gave reasonable service for large request, it was much less scalable in terms of the number IOPS that can be served and the CPU efficiency in which it could do that. IOPS being of course a critical metrics for block protocols.

As an illustration of that with 10 client initiators and 10 threads per initiators (so 100 outstanding request) doing 8K cache-hit reads, we observed

Old Target Daemon Comstar Improvement
31K IOPS 85K IOPS 2.7X

Moreover the target daemon was consuming 7.6 CPU to service those 31K IOPS while comstar could handle 2.7X more IOPS consuming only 10 cpus, a 2X improvement in iops per cpu efficiency.

On the write side, with a disk pool that had 2 striped write optimised SSD, comstar gave us 50% more throughput (130 MB/sec vs 88MB/sec) and 60% more cpu efficiency.


During our testing we noted a few interesting contributor to delivered performance. The first being the setting of iSCSI immediatedata parameter iSCSIadm(1M). On the write path, that parameter will cause the initiator iSCSI to send up to 8K of data along with the initial request packet. While this is a good idea to do so, we found that for certain sizes of writes, it would trigger some condition in the zil that caused ZFS to issue more data than necessary through the logzillas. The problem is well understood and remediation is underway and we expect to get to a situation in which keeping the default value of immediatedata=yes is the best. But as of today, for those attempting world record data transfer speeds through logzillas, setting immediatedata=no and using a 32K or 64K write size might yield positive result depending on your client OS.

Interrupt Blanking

Interested in low latency request response ? Interestingly, a chunk of that response time is lost in the obscure setting of network card drivers. Network cards will often delay pending interrupts in the hope of coalescing more packets into a single interrupt. The extra efficiency often results in more throughput at high data rate at the expense of small packet latency. For 8K request we manage to get 15% more single threaded IOPS by tweaking one such client side parameter. Historically such tuning has always been hidden in the bowel of each drivers and specific to ever client OS so that's too broad a topic to cover here. But for Solaris clients, the Crossbow framework is aiming among other thing to make latency vs throughput decision much more adaptive to operating condition relaxing the need for per workload tunings.

WCE Settings

Another important parameter to consider for comstar is the 'write cache enable' bit. By default all write request to an iSCSI lun needs to be committed to stable storage as this is what is expected by most consumers of block storage. That means that each individual write request to a disk based storage pool will take minimally a disk rotation or 5ms to 8ms to commit. This also why a write optimised SSD is quite critical to many iSCSI workloads often yeilding 10X performance improvements. Without such an SSD, iSCSI performance will appear quite lackluster particularly for lightly threaded workloads which more affected by latency characteristics.

One could then feel justified to set the write cache enable bits on some luns in order to recoup some spunk in their engine. One good news here is that in the new 2009.Q3 release of fishworks the setting is now persistent across reboots and reconnection event, fixing a nasty condition of 2009.Q2. However one should be very careful with this setting as the end consumer of block storage (exchange, NTFS, oracle,...) is quite probably operating under an unexpected set of condition. This setting can lead to application corruption in case of outage (no risk for the storage internal state).

There is one exception to this caveat and it is ZFS itself. ZFS is designed to safely and correctly operate on top of devices that have their write cached enabled. That is because ZFS will flush write caches whenever application semantics or its own internal consistency require it. So a ZPOOL created on top of iSCSI luns would be well justified to set the WCE on the lun to boost performance.

Synchronous write bias

Finally as described in my blog entry about Synchronous write bias, we now have to option to bypass the write optimised SSDs for a lun if the workload it receive is less sensitive to latency. This would be the case of a highly threaded workload doing large data transfers. Experimenting with this new property is warranted at this point.

Synchronous write bias property

With the release of 2009.Q3 release of fishworks along with a new iSCSI implemtation we're coming up with a very significant new feature for managing performance of Oracle database : the new dataset Synchronous write bias property or logbias for short. In a nutshell, this property takes the default value of Latency signifying that the storage should handle synchronous writes in urgency, the historical default handling. See Brendan's comprehensive blog entry on the Separate Intent Log and synchronous writes. However for datasets holding Oracle Datafiles, the logbias property can be set to Throughput signifying that the storage should avoid using log devices acceleration instead trying to optimize the workload's throughput and efficiency. We definitely expect to see a good boost to Oracle performance from this feature for many types of workloads and configs; workloads that generate 10s of MB/sec of DB writer traffic and have no more than 1 logzilla per tray/JBOD.

The property is set in the Share Properties just above database recordsize. You might need to unset the Inherit from projet checkbox in order to modify the settings on a particular share:

The logbias property addresses a peculiar aspect of Oracle workloads : namely that DB writers are issuing a large number of concurrent synchronous writes to Oracle datafiles, writes which individually are not particularly urgent. In contrast to other types of synchronous writes workloads, the more important metrics for DB Writers is not about individual latency. The important metric is that the storage keep up with the throughput demand in order to have database buffers always available for recycling. This is unlike redo log writes which are critically sensitive to latency as they are holding up individual transactions and thus users.

ZFS and the ZIL

A little background; with ZFS, synchronous writes are managed by the ZFS Intent Log ZIL. Because synchronous writes are typically holding up applications, it's important to handle those writes with some level of urgency and the ZIL does an admirable job at that.

In the Openstorage hybrid storage pool the ZIL itself is speeded up using low latency write-optimized SSD devices : the logzillas. Those devices are used to commit a copy of the in-memory ZIL transaction and retain the data until an upcoming transaction group commits the in-memory state to the on-disk pooled storage (Dynamics of ZFS, The New ZFS write throttle).

So while the ZIL speeds up synchronous writes, logzillas speeds up the zil. Now SSDs can serve IOPS at a blazing 100μs but also have their own throughput limits: currently around 110MB/sec per device. At that throughput, committing, for example, 40K of data will need minimally 360μs. The more data we can divert away from log devices, the lower the latency response of those devices will be.

It's interesting to note that other types of raid controllers will be hostage of their NVRAM and require, for consistency, that data be committed through some form of acceleration in order to avoid the Raid Write Hole (Bonwick on Raid-Z). ZFS, however, does not require that data passes through its SSD commit accelerator and it can manage consistency of commits either using disk or using SSDs.

Synchronous write bias : Throughput

With this newfound ability of storage administrators to signify to ZFS that some datasets will be subject to highly threaded synchronous writes for which global throughput is more critical than individual write latency, we can enable a different handling mode. By setting Logbias=Throughput ZFS is able to divert writes away from the Logzillas which are then preserved for servicing low latency sensitive operations (e.g. redo log operations).

  • A setting of Synchronous write bias : Throughput for a dataset allows synchronous writes to files in other datasets to have lower latency access to SSD log devices.
But that's not all. Data flowing through a logbias=Throughput dataset is still served by the ZIL. It turns out that the ZIL has different internal options in the way it can commit transactions one of which being tagged WR_INDIRECT. WR_INDIRECT commits issue an I/O for the modified file record and record a pointer to it in the zil chain. (see WR_INDIRECT in zil.c, zvol.c, zfs_log.c ).

ZIL transaction of type WR_INDIRECT might use more disk I/Os and slightly higher latency immediately but less I/Os and less total bytes during the upcoming transaction group update. Up to this point, the heuristics that lead to using WR_INDIRECT transactions, were not triggered by DB writer workloads. But armed with the knowledge that comes with the new logbias property, we're now less concerned about the slight latency increase that WR_INDIRECT can have. So from efficiency consideration the logbias=Throughput datasets are now set to use this mode leading to more leveled latency distributions of Transactions.

  • Synchronous write bias : Throughput is a dataset mode that reduces the number of I/Os that need to be issued on behalf of this dataset during the regular transaction group updates leading to more leveled response time.
A reminder that such kind of improvements sometimes can go unnoticed in sustained benchmarks if the downstream Transaction group destage is not given enough resources. Make sure you have enough spindles (or total disk KRPM) to sustain the level of performance you need. A pool with 2 logzillas and a single JBOD, might have enough SSD throughput to absorb DB writer workloads without adversely affecting redo log latency and so would not benefit from the special logbias settings, however for 1 logzillas per JBOD the situation might be reversed.

While the DB Record Size property is inherited by files in a dataset and is immutable, the logbias setting is totally dynamic and can be toggled on the fly during operations. For instance, during database creation or some lightly threaded write operations to Datafiles, it's expected that logbias=Latency should perform better.

Logbias deployments for Oracle

As of the 2009.Q3 release of fishworks, the current wisdom around deploying Oracle DB an Openstorage system with SSD acceleration, is to segregate, at the filesystem/dataset level, but within the single storage pool, Oracle datafiles, index files and redo Log files. Having each type of files in different dataset allows better observability into each one using the great analytics tool. But also, each dataset can then be tuned independantly to deliver the most stable performance characteristics. The most important parameter to consider is the ZFS internal recordsize used to manage the files. For Oracle datafiles the established (ZFS Best Practice) is to match the recordsize and the DB block size. For redo log files using default 128K records means that fewer file updates will be stradling multiple filesystem records. With 128K records we expect to have fewer transaction needing to wait for redo log input I/Os leading more leveled latency distribution for transactions. As for Index files, using smaller blocks of 8K offers better cacheability feature for both the primary and secondary caches (only cache what is used from indexes), but using larger blocks offers better index-scan performance. Experimenting is in order, depending on your use case, but an intermediate block size of maybe 32K might also be considered for mixed usage scenario.

For Oracle datafiles specifically, using the new setting of Synchronous write bias : Throughput has potential to deliver more stable performance in general and higher performance for redo log sensitive workloads.

Dataset Recordsize Logbias
Datafiles 8K Throughput
Redo Logs 128K(default)Latency(default)
Index 8K-32K?Latency(default)

Following these guidelines yielded a 40% boost in our Transaction processing testing in which we had 1 logzillas for a 40 disk pool.

jeudi juin 11, 2009

Compared Performance of Sun 7000 Unified Storage Array Line

The Sun Storage 7410 Unified Storage Array provides high-performance for NAS environments. Sun's product can be used on a wide variety of applications. The Sun Storage 7410 Unified Storage Array with a _single_ 10 GbE connection delivers linespeed of the 10 GbE.

  • The Sun Storage 7410 Unified Storage Array delivers 1 GB/sec throughput performance.
  • The Sun Storage 7310 Unified Storage Array delivers over 500 MB/sec on streaming writes for backups and imaging applications.
  • The Sun Storage 7410 Unified Storage Array delivers over 22000 of 8K synchronous writes per second combining great DB performance and ease of deployment of Network Attached Storage while delivering the economics benefits of inexpensice SATA disks.
  • The Sun Storage 7410 Unified Storage Array delivers over 36000 of random 8K reads per second from a 400GB working set for great Mail application responsiveness. This corresponds to an entreprise of 100000 people with every employee accessing new data every 3.6 second consolidated on a single server.

All those numbers characterise a single head of a 7410 clusterable technology. The 7000 clustering technology stores all data in dual attached disk trays and no state is shared between cluster heads (see Sun 7000 Storage clusters). This means that an active-active cluster of 2 healthy 7410 will deliver 2X the performance posted here.

Also note that the performance posted here represent what is acheived under a very tightly defined constrained workload (see Designing 11 Storage metric) and those do not represent the performance limits of the systems. This is testing 1 x 10 GbE port only; each product can have 2 or 4 10 GbE ports, and by running load across multiple ports the server can deliver even higher performance. Achieving maximum performance is a separate exercise done extremely well by my friend Brendan :

Measurement Method

To measure our performance we used the open source Filebench tool accessible from SourceForge (Filebench on Measuring performance of a NAS storage is not an easy task. One has to deal with the client side cache which needs to be bypassed, the synchronisation of multiple clients, the presence of client side page flushing deamons which can turn asynchronous workloads into synchronous ones. Because our Storage 7000 line can have such large caches (up to 128GB of ram and more than 500GB of secondary caches) and we wanted to test disk responses, we needed to find a backdoor ways to flush those caches on the servers. Read Amithaba Filebench Kit entry on the topic in which he posts a link to the toolkit used to produce the numbers.

We recently released our first major software update 2000.Q2 and along with that a new lower cost clusterable 96 TB Storage, the 7310.

We report here the compared numbers of a 7310 with the latest software release to those previously obtained for the 7410, 7210 and 7110 systems each attached to an 18 to 20 client pool over a single 10Gbe interface with the regular frame ethernet (1500 Bytes). By the way, looking at brendan's results above, I encourage you to upgrade to use Jumbo Frames ethernet for even more performance and note that our servers can drive two 10Gbe at line speed.

Tested Systems and Metrics

The tested setup are :
        Sun Storage 7410, 4 x quad core: 16 cores @ 2.3 Ghz AMD.
        128GB of host memory.
        1 dual port 10Gbe Network Atlas Card. NXGE driver. 1500 MTU
        Streaming Tests:
        2 x J4400 JBOD,  44 x 500GB SATA drives 7.2K RPM, Mirrored pool, 
        3 Write optimized 18GB SSD, 2 Read Optimized 100GB SSD.
        IOPS tests:
        12 x J4400 JBOD, 280 x 500GB SATA drives 7.2K RPM, Mirrored pool,
        272 Data drives + 8 spares.
        8-Mirrored Write Optimised 18GB SSD, 6 Read Optimized 100GB SSD.
        FW OS : ak/generic@2008.11.20,1-0

        Sun Storage 7310,2 x quad core: 8 cores @ 2.3 Ghz AMD.
        32GB of host memory.
        1 dual port 10Gbe Network Atlas  Atlas Card (1 port used). NXGE driver. 1500 MTU
        4 x J4400 JBOD for a total 92 SATA drives  7.2K RPM
        43 mirrored pairs
        4 Write Optimised 18GB SSD, 2 Read Optimized 100GB SSD.
        FW OS : Q2 2009.,1-1.15

        Sun Storage 7210, 2 x quad core: 8 cores @ 2.3 Ghz AMD
        32 GB of host memory.
        1 dual port 10Gbe Network Atlas Atlas Card (1 port used). NXGE driver. 1500 MTU
        44  x 500 GB SATA drives  7.2K RPM, Mirrored pool,
        2 Write Optimised 18 GB SSD.
        FW OS : ak/generic@2008.11.20,1-0

        Sun Storage 7110, 2 x quad core opteron: 8 cores @ 2.3 Ghz AMD
        8 GB of host memory.
        1 dual port 10Gbe Network Atlas Atlas Card (1 port used). NXGE driver. 1500 MTU
        12 x 146 GB SAS drives, 10K RPM, in 3+1 Raid-Z pool.
        FW OS : ak/generic@2008.11.20,1-0

The newly released 7310 was tested with the most recent software revision and that certainly is giving the 7310 an edge over it's peers. The 7410 on the other hand was measured here managing a much large contingent of storage, including mirrored Logzillas and 3 times as many JBODs and that is expected to account for some of the performance delta being observed.

    Metrics Short Name
    1 thread per client streaming cached reads Stream Read light
    1 thread per client streaming cold cache reads Cold Stream Read light
    10 threads per client streaming cached reads Stream Read
    20 threads per client streaming cold cached reads Cold Stream Read
    1 thread per client streaming write Stream Write light
    20 threads per client streaming write Stream Write
    128 threads per client 8k synchronous writes Sync write
    128 threads per client 8k random read Random Read
    20 threads per client 8k random read on cold caches Cold Random Read
    8 threads per client 8k small file create IOPS Filecreate

There are 6 read tests, 2 writes test and 1 synchronous write test which overwrites it's data files as a database would. A final filecreate test complete the metrics. Test executes against 20GB working set _per client_ times 18 to 20 clients. There are 4 sets used in total running over independent shares for a total of 80GB per client. So before actual runs at taken, we create all working sets or 1.6 TB of precreated data. Then before each run, we clear all caches on the clients and server.

In each of the 3 groups of 2 read tests, the first one benefits from no caching at all and the throughput delivered to the client over the network is observed to come from disk. The test runs for N seconds priming data in the Storage caches. A second run (non-cold) is then started after clearing the client side caches. Those test will see the 100% of the data delivered over the network link but not all of it is coming off the disks. Streaming test will race through the cached data and then finish off reading from disks. The random read test can also benefit from increasing cached responses as the test progresses. The exact caching characteristic of a 7000 lines will depend on a large number of parameters including your application access pattern. Numbers here reflect the performance of fully randomized test over 20GB per client x 20 clients or a 400GB working set. Upcoming studies will include more data (showing even higher performance) for workloads with higher cache hit ratio than those used here.

In a Storage 7000 server, disks are grouped together in one pool and then individual Shares are creates. Each share has access to all disk resource subject to quota (a minimum) and reservation (a maximum) that might be set. One important setup parameter associated with each share is the DB record size. It is generally better for IOPS test to use 8K records and for streaming test to use 128K records. The recordsize can be dynamically set based on expected usage.

The tests shown here were obtained with NFSv4 the default for Solaris clients (NFSv3 is expected to come out slightly better). The clients were running Solaris 10, with tuned tcp_recv_hiwat of 400K and dopageflush=0 to prevent buffered writes from being converted into synchronous writes.

Compared Results of the 7000 Storage Line

    NFSv4 Test 7410 Head
    Mirrored Pool
    7310 Head
    Mirrored Pool
    7210 Head
    Mirrored Pool
    7110 Head
    3+1 Raid-Z
    Cold Stream Read light 915 MB/sec 685 MB/sec 719 MB/sec 378 MB/sec
    Stream Read light 1074 MB/sec 751 MB/sec 894 MB/sec 416 MB/sec
    Cold Stream Read 959 MB/sec 598 MB/sec 752 MB/sec 329 MB/sec
    Stream Read 1030 MB/sec 620 MB/sec 792 MB/sec 386 MB/sec
    Stream Write light 480 MB/sec 507 MB/sec 490 MB/sec 226 MB/sec
    Stream Write 447 MB/sec 526 MB/sec 481 MB/sec 224 MB/sec
    Sync write 22383 IOPS 8527 IOPS 10184 IOPS 1179 IOPS
    Filecreate 5162 IOPS 4909 IOPS 4613 IOPS 162 IOPS
    Cold Random Read 28559 IOPS 5686 IOPS 4006 IOPS 1043 IOPS
    Random Read 36478 IOPS 7107 IOPS 4584 IOPS 1486 IOPS
    Per Spindle IOPS 272 Spindles 86 Spindles 44 Spindles 12 Spindles
    Cold Random Read 104 IOPS 76 IOPS 91 IOPS 86 IOPS
    Random Read 134 IOPS 94 IOPS 104 IOPS 123 IOPS


The data shows that the entire Sun Storage 7000 line are throughput workhorse delivering 10 Gbps level NAS services per cluster head nodes, using a single Network Interface and single IP address for easy integration into your existing network.

As with other storage technology write streaming performance require more involvement from the storage controller and this leads to about 50% less write throughput compared to read throughput.

The use of write optimized SSD in the 7410, 7310 and 7220 also give this storage very high synchronous write capabilities. This is one of the most interesting result as it maps to database performance. The ability to sustain 24000 O_DSYNC writes at 192MB/sec of synchronized user data using only 48 inexpensive sata disks and 3 write optimized SSD is one of the many great performance characteristics of this novel storage system.

Random Read test generally map directly to individual disk capabilities and is a measure of total disk rotations. The cold runs shows that all our platforms are delivering data at the expected 100 IOPS per spindle for those SATA disks. Recall that our offering is based on the economical energy efficient 7.2 RPM disk technology. For cold random reads, a mirrored pair of 2 x 7.2K RPM offers the same total disk rotation (and IOPS) as expensive and power hungry 15 K RPM disks but in a much more economical package.

Moreover the difference between the warm and cold random read runs is showing that the Hybrid Storage Pool (HSP) is providing a 30% boost even on this workload that addresses randomly 400GB working set on 128GB of controller cache. The effective boost from the HSP can be much greater depending on the cacheability of workloads.

If we consider an organisation in which the avg mail message is 8K in size, our results show that we could consolidate 100000 employees on a single 7410 storage where each employee is accessing new data every 3.6 seconds with 70ms response time.

Messaging system are also big consumer of file creations, I've shown in the past how efficient ZFS can be at creating small files (Need Inodes ?). For the NFS protocol, file creation is a straining workload but the 7000 storage line comes out not too bad with more than 5000 filecreates per second per storage controller.


Performance Can never be summerised with a few numbers and we have just begun to scratch the surface here. The numbers presented here along with the disruptive pricing of the Hybrid Storage Pool will, I hope, go a long way to show the incredible power of the Open Storage architecture being proposed. And keep in mind that this performance is achievable using less expensive, less power hungry SATA drives and that every data services : NFS, CIFS, iSCSI, ftp, HTTP etc. offered by our Sun Storage 7000 servers are available at 0 additional software cost to you.

Disclosure Statement: Sun Microsystem generated results using filebench. Results reported 11/10/08 and 26/05/2009 Analysis done on June 6 2009.

mercredi mai 27, 2009

Free Beer and Free Deep Dive

For those lucky enough to be in the Bay area next week, I just heard there will be free unlimited beer and free access to the technical deep dives at the Community One event at the Moscone Center and Intercontinental Hotel nearby (and I'm only lying about one of the free thingy). The program looks great with lots of star speakers on both days, so while free is cool, don't overlook the June 1st program as well. Lucky you.



« juillet 2015

No bookmarks in folder