vendredi oct. 30, 2015

Block Picking

In the final article of this series, I walk through the fascinating topic of block picking. As is well known ZFS is an allocate on write storage system and every time it updates the on-disk structure, it writes the new data wherever it chooses. Within reasonable bounds ZFS also control the timing of when to write data out to devices (outside of ZIL blocks which is governed by applications). The current bounds are set about 5 seconds apart and when that clock ticks, we bundle up all recent changes in a transaction group (TXG) handled by spa_sync.

Armed with this herd of pending data blocks, ZFS issues a highly concurrent workload dedicated to running all CPU intensive tasks. For any individual I/O, after going through compression, encryption and checksumming, we move on to the allocation task and finally device level I/O scheduling.

The first task of allocation involves, selecting a device in a pool. Then, within that device, selecting a sub-region called a metaslab and finally, within that metaslab, selecting a block where the data is stored. Our guiding principles through the process is to
  • Foster write I/O aggregation
  • Ensure devices are used efficiently
  • Avoid fragmenting the on-disk space
  • Limit the core memory required
  • Serve concurrent allocations quickly
  • Do all this with as little CPU resources as possible
Let's see how ZFS solves this tough equation.

Device Selection

When a TXG is triggered, we want devices to receive a set of I/Os such that the I/Os;
  • Aggregate
  • Stream on the media
I/O aggregation occurs downstream when 2 I/Os, which are ready to issue, have just been allocated on the same device in adjacent locations. The per device I/O pipeline is then able to merge both of them (N of them actually) into a single device I/O. This is an incredible feature as it applies to items that have no relationship to each other, other than their allocation proximity as first described in Need Inodes ?.

Streaming the media is a close cousin to aggregation. I/Os aggregate when they are adjacent, but if they are not, we still want to avoid seeking to a far away area of a disk every I/O. A disk seek is, after all, an eternity. So, while setting data onto the media, we like to keep logical block addresses (LBA) as packed as possible leading to streaming efficiency. Similarly for SSDs, doing so, avoids fragmenting the internal space mapping done by the Flash Translation Layer (FTL). Even logical volumes software welcome this model.

ZFS allocates from a device until about 1MB of data is handled. The first 1MB of blocks that reaches the allocation stage (after CPU heavy transformations) are directed to a first device. The following 1MB of blocks move on to the next one, and so on. After iterating round robin through every device, we are now in a state where every device in a pool is busy with I/O. At the same time other blocks are being processed through the CPU intensive stages and are now building up an I/O backlog onto every device. At that moment, even if CPUs are heavily used, the system is actually issuing I/O to all devices that are 100% busy.

Metaslab Selection

Once we have directed 1MB of allocation to a specific device, we still want the I/Os to target a localized subarea of the device. A metaslab has an in-core structure representing free space of a portion of a device (very roughly 1%). At any one time, ZFS only allocates from a single metaslab of a device insuring dense packing of all writes and therefore avoiding long disk head excursions. The other benefit is that for the purpose of allocation, we strictly only need to keep in-core a structure representing the free space for that subarea. This is the active metaslab. During a given TXG we thus only write to the active metaslab of every device.

Block Picking

And now, going for the kill. We have the device and the metaslab subarea within it to service our allocation of size X. We finally have to choose a specific device location to fit this allocation. A simple strategy would be to allocate all blocks back-to-back regardless of size. That would lead to maximum aggregation, but we must be considerate of space fragmentation implications.

Blocks we are allocating together at this instant, may well be freed later on a different schedule from each other. Frankly we have no way to know when a block free occurs since that is entirely driven by the workload. In ZFS, our bias is to consider that blocks that have similar size have a better chance of having similar life expectancy. We exploit this by maintaining a set of separate pointers within our metaslab and allocate blocks of similar size near each other.

Historical Behavior

When ZFS first came of age, it had 2 strategies to pick blocks. The regular one lead to good performance through aggregation while another strategy was aimed at defragmenting and lead to terrible performance. We would switch strategy when a metaslab started to have less than 30% free space within it. Customers voiced their discontent loudly. The 30% parameter was later reduced to 4% but that didn't reduce the complaints1.

The other problem we had in the past, was that when we needed to switch a metaslab whose structure was not yet in core, we would block all allocating threads waiting on the in-core loading of the metaslab data. If that took too much time, we could leave devices idling in the process.

Finally, we would switch only when an allocation could not be satisfied usually because a request was larger than the largest free block available. Forced to switch, we would then select the metaslab that had the best2 free space. This meant that we would keep using metaslabs past their prime capacity to foster aggregated and streaming I/Os.

Today is a Better World

Fast forward to today, we have evolved this whole process in very significant ways.

The thread blocking problem was simple enough; when we do in fact load a metaslab we quickly direct allocating threads to other devices. We therefore keep feeding the other devices more smoothly.

But the most important advance is that we don't use an allocator that switches strategy based on available space in the metaslab. Allocations of a given size are serviced from a chunk chosen such that the I/Os aggregate 16-fold : so 4K allocations tend to consume 64K chunks, while 128K allocations looks for 2MB chunks. Allocations of different sizes do not compete for the same sized chunks. Finally, as the maximum size available in a metaslab is reduced, we simply and gradually scale down our potential for aggregation from this metaslab.

Alongside this change, we decided to actually switch away from a metaslab as soon as it started to show signs of fatigue. As long as a metaslab is able to serve blocks of approximately 1MB we keep allocating from it. But as soon as it's biggest block size drops below this threshold, we go and pick the metaslab with the best free space.

Finally to account for more frequent switches, we also decided to unload metaslabs less aggressively than before. This policy allows us to reuse a metaslab without incurring the cost of loading since that comes with both a CPU and I/O cost.


With these changes in, we have an allocator that fosters aggregation very effectively and leads to performance that degrades gracefully as the pool fills up. This allocator has served us well over the many years it's been in place.

ZFS gives you great performance by handling writes in a way that streams the data to devices. This is effective and delivers maximum performance as long as there is large sequential free space on devices. For users, the real judge is to monitor the average I/O size for writes and if, for the same workload mix, the write size starts to creep down, then it's time to consider adding additional storage space.

Le Fin

1 Some still remember this 30% factor and use that as a rule of thumb to not exceed even though we have not used this allocator in years; tuning metaslab_df_free_pct has no effect on our systems.

2 I say best free space and not most free space since we actually boost the desirability of metaslabs with low addresses. For physical disks, outer tracks fly faster under a disk head and that translate into more throughput. Even for SSDs we see a benefit : Favoring one side of the addresses, means we reuse freed space more aggressively. The overwrites of an SSD LBA range means that the flash cells holding the overwritten data can be recycled quickly by the FTL, which simplifies greatly its operation.

mardi août 04, 2015

Concurrent Metaslab Syncing

As hinted in my previous article, spa_sync() is the function that runs whenever a pool needs to update it's internal state. That thread is the master of ceremony for the whole TXG syncing process. As such it is the most visible of thread. At the same time, it's the thread we actually want to see idling. The spa_sync thread is setup to generate work for taskqs and then wait for the work to happen. That's why we often see spa_sync waiting in zio_wait or taskq_wait. This is what we expect that thread to be doing.

Let's dig into this process a bit more. While we do expect spa_sync to mostly be waiting, it is not the only thing that it does. Before it waits, it has to farm out work to those taskqs. Every TXG, spa_sync wakes up and starts to create work for the zio taskq threads. Those threads immediately pick up the initial tasks posted by spa_sync and just as quickly generate load for pool devices. Our goal is just to keep taskqs and more importantly device fed with work.

And so, we have this single spa_sync thread, quickly posting work to zio taskqs and threads working on checksum computation and other CPU intensive tasks. This model ensures that the disk queues are non-empty for the duration of the data update portion of a TXG.

In practice, that single spa_sync thread is able to generate the tasks to service the most demanding environment. When we hit some form of pool saturation, we typically see spa_sync waiting on a zio and that is just the expected sign that something at the I/O level below ZFS is the current limiting factor.

But, not too long ago, there was a grain of sand in this beautiful clockwork. After spa_sync was all done with ... well waiting... it had a final step to run before updating the uberblock. It would walk through all the devices and process all the space map updates, keeping track of all the allocs and frees. In many cases, this was a quick on-CPU operation done by the spa_sync thread. But when dealing with a large amount of deletion it could show up as significant. It was definitely something that spa_sync was tackling itself as opposed to farming out to workers.

A project was spawned to fix this and during the evaluation the ZFS engineer figured out that a lot of the work could be handled in the earlier stages of the zio processing, further reducing the amount of work we could have to wait on in the later stages of spa_sync.

This fix was a very important step in making sure that the critical thread running spa_sync spends most of it's time ...waiting.

jeudi juil. 30, 2015

System Duty Cycle Scheduling Class

It's well known that ZFS uses a bulk update model to maintain the consistency of information stored on disk. This is referred to as a transaction group (TXG) update or internally as a spa_sync(), which is the name of the function that orchestrates this task. This task ultimately updates the uberblock between consistent ZFS states.

Today these tasks are expected to run on a 5-second schedule with some leeway. Internally, ZFS builds up the data structures such that when a new TXG is ready to be issued it can do so in the most efficient way possible. That method turned out to be a mixed blessing.

The story is that when ZFS is ready, it uses zio taskqs to execute all of the heavy lifting, CPU intensive jobs necessary to complete the TXG. This process includes the checksumming of every modified block and possibly compressing and encrypting them. It also does on-disk allocation and issues I/O to the disk drivers. This means there is a lot of CPU intensive work to do when a TXG is ready to go. The zio subsystem was crafted in such a way that when this activity does show up, the taskqs that manage the work never need to context switch out. The taskq threads can run on CPU for seconds on end. That created a new headache for the Solaris scheduler.

Things would not have been so bad if ZFS was the only service being provided. But our systems, of course, deliver a variety of services and non-ZFS clients were being short changed by the scheduler. It turns out that before this use case, most kernel threads had short spans of execution. Therefore kernel threads were never made preemptable and nothing would prevent them from continuous execution (seconds is same as infinity for a computer). With ZFS, we now had a new type of kernel thread, one that frequently consumed significant amounts of CPU time.

A team of Solaris engineers went on to design a new scheduling class specifically targeting this kind of bulk processing. Putting the zio taskqs in this class allowed those threads to become preemptable when they used too much CPU. We also changed our model such that we limited the number of CPUs dedicated to these intensive taskqs. Today, each pool may use at most 50% of nCPUS to run these tasks. This is managed by kernel parameter zio_taskq_batch_pct which was reduced from 100% to 50%.

Using these 2 features we are now much better equipped to allow the TXG to proceed at top speeds without starving application from CPU access and in the end, running applications is all that matters.

vendredi juil. 24, 2015

Scalable Reader/Writer Locks

ZFS is designed as a highly scalable storage pool kernel module.

Behind that simple idea are a lot of subsystems, internal to ZFS, which are cleverly designed to deliver high performance for the most demanding environments. But as computer systems grow in size and as demand for performance follows that growth, we are bound to hit scalability limits (at some point) that we had not anticipated at first.

ZFS easily scales in capacity by aggregating 100s of hard disks into a single administration domain. From that single pool, 100s or even 1000s of filesystems can be trivially created for a variety of purposes. But then people got crazy (rightly so) and we started to see performance tests running on a single filesystem. That scenario raised an interesting scalability limit for us...something had to be done.

Filesystems are kernel objects that get mounted once at some point (often at boot). Then, they are used over and over again, millions even billions of times. To simplify each read/write system call uses the filesystem object for a few milliseconds. And then, days or weeks later, a system administrator wants this filesystem unmounted and that's that. Filesystem modules, ZFS or other, need to manage this dance in which the kernel object representing a mount point is in-use for the duration of a system call and so must prevent that object from disappearing. Only when there are no more system calls using a mountpoint, can a request to unmount be processed. This is implemented simply using a basic reader/writer lock, rwlock(3C) : A read or write system call acquires a read lock on the filesystem object and holds it for the duration of the call, while a umount(2) acquires a write lock on the object.

For many years, individual filesystems from a ZFS pool were protected by a standard Solaris rwlock. And while this could handle 100s of thousands or read/write calls per second through a single filesystem eventually people wanted more.

Rather than depart from the basic kernel rwlock, the Solaris team decided to tackle the scalability of the rwlock code itself. By taking advantage of visibility into a system's architecture, Solaris is able to use multiple counters in a way that scales with the system's size. A small system can use a simple counter to track readers while a large system can use multiple counters each stored on separate cache lines for better scaling. As a bonus they were able to deliver this feature without changing the rwlock function signature. For ZFS code, just simple rwlock initialization change was needed to open up the benefit of this scalable rwlock.

We also found that, in addition to protecting the filesystem object itself, another structure called a ZAP object used to manage directories was also hitting the rwlock scalability limit and that was changed too.

Since the new locks have been put into action, they have delivered scalable performance into single filesystems that is absolutely superb. While the French explorer Jean-Louis Etienne claims that "On se repousse pas ses limites, on les decouvre:" From the comfort of my air-conditioned office, I conclude that we are pushing the limits out of harm's way.

mercredi juin 10, 2015

Zero Copy I/O Aggregation

One of my favorite feature of ZFS is the I/O aggregation done in the final stage of issuing I/Os to devices. In this article, I explain in more detail what this feature is and how we recently improved it with a new zero copy feature.

It is well known that ZFS is a Copy-on-Write storage technology. That doesn't meant that we constantly copy data from disk to disk. More to the point it means that when data is modified we store that data in a fresh on-disk location of our own choosing. This is primarily done for data integrity purposes and is managed by the ZFS transaction group (TXG) mechanism that runs every few seconds. But an important side benefit of this freedom given to ZFS is that I/Os, even unrelated I/Os, can be allocated in physical proximity to one another. Cleverly scheduling those I/Os to disk then makes it possible to detect contiguous I/Os and issue few large ones rather than many small ones.

One consequence of I/O aggregation is that the final I/O sizes used by ZFS during a TXG, as observed by ZFSSA Analytics or iostat(1), depend more on the availability of contiguous on-disk free space than it does on the individual application write(2) sizes. To a new ZFS user or storage administrator, it can certainly be really baffling that 100s of independent 8K writes can end up being serviced by a single disk I/O.

The timeline of an asynchronous write is described like this:

  • Application issues a write(2) of N byte to a file stored using ZFS records of size R. Initially the data is stored in the ARC cache.

  • ZFS notes the M dirty blocks needing to be issued in the next TXG as follows:
    • If R=128K, a small write(2) say of 10Bytes here means 1 dirty block (of 128K)
    • If R=8K, a single 128K write(2) implies 16 dirty blocks (of 8K)

  • Within the next few seconds multiple dirty blocks get associated with the upcoming TXG.

  • The TXG starts. ZFS gathers all of the dirty blocks and starts I/Os1.

    • Individual blocks get checksummed and, as necessary, compressed and encrypted. Then and only then, knowing the compressed size and the actual data that needs to be stored on disk, a device is selected and an allocation takes place,

    • The allocation engine finds a chunk in proximity to recent allocations (a future topic of its own),

    • The I/O is maintained by ZFS using 2 structures, one ordered by priority and another ordered by device offset.

  • As soon as there is at least one I/O in these structures, the device level ZIO pipeline gets to work. When a slot is available, the highest priority I/O for that device is selected to be issued.

And here is where the magic occurs. With this highest priority I/O in hand, the ZIO pipeline doesn't just issue that I/O to the device. It first checks for other I/Os which could be physically adjacent to this one. It gathers all such I/Os together until hitting our upper limit for disk I/O size. Because of the way this process works, if there are contiguous chunks of free space available on the disk, we're nearly guaranteed that ZFS finds pending I/Os that are adjacent and can be aggregated.

This also explains why one sees regular bursts of large I/Os whose sizes are mostly unrelated to the sizes of writes issued by the applications. And I emphasize that this is totally unrelated to the random or sequential nature of the application workload. Of course, for hard disk drives (HDDs), managing writes this way is very efficient. Therefore, those HDDs are less busy and stay available to service the incoming I/Os that applications are waiting on.

And this bring us to the topic du jour. Up to recently, there was a cost to doing this aggregation in the form of a memory copy. We would take the buffers coming from the ZIO pipeline (after compression and encryption) and copy them to a newly allocated aggregated buffer. Thanks to a new Solaris mvector feature, we can now run the ZIO aggregation pipeline without incurring this copy. That, in turns, allows us to boost the maximum aggregation size from 128K up to 1MB for extra efficiency. The aggregation code also limits itself to aggregating 64 buffers together. When working with 8K blocks we can see up to 512K I/O during a TXG and 1MB I/O with bigger blocks.

Now, a word about the ZIL. In this article, I focus on the I/Os issued by the TXG which happens every 5 seconds. In between TXG, if disk writes are observed, those would have to come from the ZIL. The ZIL also does it's own grouping of write requests that hit a given dataset (share, zvol or filesystem). Then, once the ZIL gets to issue an I/O, it uses the same I/O pipeline as just described. Since ZIL I/Os are of high priority, they tend to issue straight away. And because they issue quickly, there is generally not a lot of them around for aggregation. So it is common to have the ZIL I/Os not aggregate much if at all. However, under a heavy synchronous write load, when the underlying device becomes saturated, a queue of ZIL I/Os forms and they become subject to ZIO level aggregation.

When observing the I/Os issued to a pool with iostat it's nice to keep all this in mind: synchronous writes don't really show up with their own size. The ZIL issues I/O for a set of synchronous writes that may further aggregate under heavy load. Then, with a 5 second regularity, the pool issues I/O for every modified block, usually with large I/Os whose size is unrelated to the application I/O size.

It's a really efficient way to do all this, but it does require some time getting used to it.
1 Application write size is not considered during a TXG.

mardi avr. 28, 2015

It is the Dawning of the Age of the L2ARC

One of the most exciting things that have gone into ZFS in recent history has been the overhaul of the L2ARC code. We fundamentaly changed the L2ARC such that it would do the following:

  • reduce its own memory footprint,
  • be able to survive reboots,
  • be managed using a better eviction policy,
  • be compressed on SSD,
  • and finally allow feeding at much greater rates then ever achieved before.
Let's review these elements, one by one.

Reduced Footprint

We already saw in this ReARC article that we dropped the amount of core header information from 170 bytes to 80 bytes. This means we can track more than twice as much L2ARC data as before using a given memory footprint. In the past, the L2ARC had trouble building up in size due to its feeding algorithm, but we'll see below that the new code allows us to grow the L2ARC and use up available SSD space in its entirety. So much so that initial testing revealed a problem: For small memory configs with large SSDs, the L2ARC headers could actually end up filling most of the ARC cache and that didn't deliver good performance. So, we had to put in place a memory guard for L2 headers which is currently set to 30% of the ARC. As the ARC grows and shrinks so does the maximum space dedicated to tracking the L2ARC. So, a system with 1TB of ARC cache, then up to 300GB if necessary could be devoted to tracking the L2ARC. With the 80 bytes headers, this means we could track a whopping 30TB of data assuming 8K blocksize. If you use 32K blocksize, currently the largest blocks we allow in L2ARC, then that grows up to 120TB of SSD based auto-tiered L2ARC. Of course, if you have a small L2ARC the tracking footprint of the in-core metadata is smaller.

Persistent Across Reboot

With that much tracked L2ARC space, you would hate to see it washed away on a reboot as the previous code did. Not so anymore, the new L2ARC has an on-disk format that allows it to be reconstructed when a pool is imported. That new format tracks the device space in 8MB segments for which each ZFS blocks (DVAs for the ZFS geeks) consumes 40 bytes of on-SSD space. So reusing the example of an L2ARC made up of only 8K-sized blocks, each 8MB segments could store about 1000 of those blocks consuming just 40K of on-SSD metadata. The key thing here is that to rebuild the in-core L2ARC space after a reboot, you only need to read back 40K, from the SSD itself, in order to discover and start tracking 8MB worth of data. We found that we could start tracking many TBs of L2ARC within minutes after a reboot. Moreover we made sure that as segment headers were read in, they would immediately be made available to the system and start to generate L2ARC hits, even before the L2ARC was done importing every segments. I should mention that this L2ARC import is done asynchronously with respect to the pool import and is designed to not slow down pool import or concurrent workloads. Finally that initial L2ARC import mechanism was made scalable with many import threads per L2ARC device.

Better Eviction

One of the benefits of using an L2ARC segment architecture is that we can now weigh them individually and use the least valued segment as eviction candidate. The previous L2ARC would actually manage L2ARC space by using a ring buffer architecture: first-in first-out. It's not a terrible solution for an L2ARC but the new code allows us to work on a weight function to optimise eviction policy. The current algorithm puts segments that are hit, an L2ARC cache hit, at the top of the list such that a segment with no hits gets evicted sooner.

Compressed on SSD

Another great new feature delivered is the addition of compressed L2ARC data. The new L2ARC stores data in SSDs the same way it is stored on disk. Compressed datasets are captured in the L2ARC in compressed format which provides additional virtual capacity. We often see a 2:1 compression ratio for databases and that is becoming more and more the standard way to deploy our servers. Compressed data now uses less SSD real estate in the L2ARC: a 1TB device holds 2TB of data if the data compresses 2:1. This benefit helps absorb the extra cost of flash based storage. For the security minded readers, be reassured that the data stored in the persistent L2ARC is stored using the encrypted format.

Scalable Feeding

There is a lot to like about what I just described but what gets me the most excited is the new feeding algorithm. The old one was suboptimal in many ways. It didn't feed well, disrupted the primary ARC, had self-imposed obsolete limits and didn't scale with the number of L2ARC devices. All gone.

Before I dig in, it should be noted that a common misconception about L2ARC feeding is assuming that the process handles data as it gets evicted from L1. In fact the two processes, feeding and evicting, are separate operations and it is sometimes necessary under memory pressure to evict a block before being able to install it in the L2ARC. The new code is much much better at avoiding such events; it does so by keeping it's feed point well ahead of the ARC tail. Under many conditions, when data is evicted from primary ARC it is after the L2ARC has processed it.

The old code also had some self-imposed throughput limit that meant that N x L2ARC devices in one pool, would not be fed at proper throughput. Given the strength of the new feeding algorithm we were able to remove such limits and now feeding scales with number of L2ARC devices in use. We also removed an obsolete constraint in which read I/Os would not be sent to devices as they were fed.

With these in place, if you have enough L2ARC bandwidth in the devices, then there are few constraints in the feeder to prevent actually capturing 100% of eligible L2ARC data1. And capturing 100% of data is the key to actually delivering a high L2ARC hit rate in the future. By hitting in L2, of course you delight end users waiting for such reads. More importantly, an L2ARC hit is a disk read I/O that doesn't have to be done. Moreover, that saved HDD read is a random read, one that would have lead to a disk seek, the real weakness of HDDs. Therefore, we reduce utilization of the HDDs, which is of paramount importance when some unusual job mix arrives and causes those HDDs become the resource gating performance: A.K.A crunch time. With a large L2ARC hit count, you get out of this crunch time quicker and restore proper level of service to your users.


The L2ARC Eligibility rules were impacted by the compression feature. The max blocksize considered for eligibility was unchanged at 32K but the check is now done on compressed size if compression is enabled. As before, the idea behind an upper limit on eligible size is two-fold, first for larger blocks, the latency advantage of flash over spinning media is reduced. The second aspect of this is that the SSD will eventually fill up with data. At that point, any block we insert in the L2ARC requires an equivalent amount of eviction. A single large block can thus cause eviction of a large number of small blocks. Without an upper cap on block size, we can face a situation of inserting a large block for a small gain with a large potential downside if many small evicted blocks become the subject of future hits. To paraphrase Yogi Berra: "Caching decisions are hard."2.

The second important eligibility criteria is that blocks must not have been read through prefetching. The idea is fairly simple. Prefetching applies to sequential workloads and for such workloads, flash storage offers little advantage over HDDs. This means that data that comes in through ZFS level prefetching is not eligible for L2ARC.

These criteria leave 2 pitfalls to avoid during an L2ARC demo, first configuring all datasets with 128K recordsize and second trying to prime the L2ARC using dd-like sequential workloads. Both of those are by design workloads that bypasse the L2ARC. The L2ARC is designed to help you with disk crunching real workloads, which are those that access small blocks of data in random order.

Conclusion : A Better HSP

In this context, the Hybrid Storage Pool (HSP) model refers to our ZFSSA architecture where data is managed in 3 tiers:

  1. a high capacity TB scale super fast RAM cache;
  2. a PB scale pool of hard disks with RAID protection;
  3. a channel of SSD base cache devices that automatically capture an interesting subset of the data.
And since the data is captured in the L2ARC device only after it has been stored in the main storage pool, those L2ARC SSDs do not need to be managed by RAID protection. A single copy of the data is kept in the L2ARC knowing that if any L2ARC device disappears, data is guaranteed to be present in the main pool. Compared to a mirrored all-flash storage solution, this ZFSSA auto-tiering HSP means that you get 2X the bang for your SSD dollar by avoiding mirroring of SSDs and with ZFS compression that becomes easily 4X or more. This great performance comes along with the simplicity of storing all of your data, hot, warm or cold, into this incredibly versatile high performance and cost effective ZFS based storage pool.

1It should be noted that ZFSSA tracks L2ARC eviction as "Cache: ARC evicted bytes per second broken down by L2ARC state", with subcategories of "cached," "uncached ineligible," and "uncached eligible." Having this last one at 0 implies a perfect L2ARC capture.

2For non-americans, this famous baseball coach is quoted to have said, "It's tough to make predictions, especially about the future."

vendredi févr. 20, 2015

ZIL Pipelining

The third topic on my list of improvements since 2010 is ZIL pipelining :
		Allow the ZIL to carve up smaller units of
		work for better pipelining and higher log device 
So let's remind ourselves of a few things about the ZIL and why it's so critical to ZFS. The ZIL stands for ZFS Intent Log and exists in order to speed up synchronous operations such as an O_DSYNC write or fsync(3C) calls. Since most Database operation involve synchronous writes it's easy to understand that having good ZIL performance is critical in many environments.

It is well understood that a ZFS pool updates it's global on-disk state at a set interval (5 seconds these days). The ZIL is actually what keeps information in between those transaction group (TXG). The ZIL records what is committed to stable storage from a users point of view. Basically the last committed TXG + replay of the ZIL is the valid storage state from a users perspective.

The on-disk ZIL is a linked list of records which is actually only useful in the event of a power outage or system crash. As part of a pool import, the on-disk ZIL is read and operations replayed such that the ZFS pool contains the exact information that had been committed before the disruption.

While we often think of the ZIL as it's on-disk representation (it's committed state), the ZIL is also an in-memory representation of every posix operation that needs to modify data. For example, a file creation even if that is an asynchronous operation needs to be tracked by the ZIL. This is because any asynchronous operation, may at any point in time require to be committed to disk; this is often due to an fsync(3C) call. At that moment, every pending operation on a given file needs to be packaged up and committed to the on-disk ZIL.

Where is the on-disk ZIL stored ?

Well that's also more complex than it sound. ZFS manages devices specifically geared to store ZIL blocks; those separate slog devices or slogs are very often flash SSD. However the ZIL is not constrained to only using blocks from slog devices; it can store data on main (non-slog) pool devices. When storing ZIL information into the non-slog pool devices, the ZIL has a choice of recording data inside zil blocks or recording full file records inside pool blocks and storing a reference to it inside the ZIL. This last method for storing ZIL blocks has the benefit of offloading work from the upcoming TXG sync at the expense of higher latency since the ZIL I/Os are being sent to rotating disks. This mode is the one used with logbias=throughput. More on that below.

Net net: the ZIL records data in stable storage in a link list and user applications have synchronization point in which they choose to wait on the ZIL to complete it's operation.

When things are not stressed, operations show up at the ZIL, wait a little bit while the ZIL does it's work, and are then released. Latency of the ZIL is then coherent with the underlying device used to capture the information. In this rosy picture we would not have done this train project.

At times though, the system can get stressed. The older mode of operation of the ZIL was to issue a ZIL transaction (implemented by ZFS function zil_commit_writer) and while that was going on, build up the next ZIL transaction with everything that showed up at the door. Under stress when a first operation would be serviced with a high latency, the next transaction would accumulate many operations, growing in size thus leading to a longer latency transaction and this would spiral out of control. The system would automatically divide into 2 ad-hoc sets of users; a set of operations which would commit together as a group, while all other threads in the system would form the next ZIL transaction and vice-versa.

This leads to bursty activity on the ZIL devices, which meant that, at times, they would go unused even though they were the critical resource. This 'convoy' effect also meant disruption of servers because when those large ZIL transaction do complete, 100s or 1000s of user threads might see their synchronous operation complete and all would end up flagged as 'runnable' at the same time. Often those would want to consume the same resource, run on the same CPU, of use the same lock etc. This led to thundering herds, a source of system inefficiency.

Thanks to the ZIL train project, we now have the ability to break down convoys into smaller units and dispatch them into smaller ZIL level transactions which are then pipelined through the entire data center.

With logbias set to throughput, the new code is attempting to group ZIL transactions in sets of approximately 40 operations which is a compromise between efficient use of ZIL and reduction of the convoy effect. For other types of synchronous operations we group them into sets representing about ~32K of data to sync. That means that a single sufficiently large operation may run by itself but more threads will group together if their individual commit size are small.

The ZIL train is thus expected to handle burst of synchronous activity with a lot less stress on the system.


As we just saw the ZIL provides 2 modes of operation. The throughput mode and the default latency mode. The throughput mode is named as such not so much because it favors throughput but more so because it doesn't care too much about individual operation latency. The implied corollary of throughput friendly workloads is that they are very highly concurrent (100s or 1000s of independent operations) and therefore are able to get to high throughput even when served at high latency. The goal of providing a ZIL throughput mode is to actually free up slog devices from having to handle such highly concurrent workloads and allow those slog devices to concentrate on serving other low-concurrency, but highly sensitive to latency operations.

For Oracle DB, we therefore recommend the use of logbias set to throughput for DB files which are subject to highly concurrent DB writer operations while we recommend the use of the default latency mode for handling other latency sensitive files such as the redo log. This separation is particularly important when redo log latency is very critical and when the slog device is itself subject to stress.

When using Oracle 12c with dnfs and OISP, this best practice is automatically put into place. In addition to proper logbias handling, DB data files are created with a ZFS recordsize matching the established best practice : ZFS recordsize matching DB blocksize for datafiles; ZFS recordsize of 128K for redo log.

When setting up a DB, with or without OISP, there is one thing that Storage Administrators must enforce : they must segregate redo log files into their own filesystems (also known as shares or datasets). The reason for this is that the ZIL is a single linked list of transactions maintained by each filesystem (other filesystems run their own ZIL independently). And while the ZIL train allows for multiple transaction to be in flight concurrently, there is a strong requirement for completion of the transaction and notification of waiters to be handled in order. If one were to mix data files and redo log files in the same ZIL, then some redo transaction would be linked behind some DB writer transactions. Those critical redo transaction committing in latency mode to a slog device would see their I/O complete quickly (100us timescale) but nevertheless have to wait for an antecedent DB writer transaction committing in throughput mode to regular spinning disk device (ms timescale). In order to avoid this situation, one must ensure that redo log files are stored in their own shares.

Let me stop here, I have a train to catch...

mercredi janv. 21, 2015

Sequential Resilvering

In the initial days of ZFS some pointed out that ZFS resilvering was metadata driven and was therefore super fast : after all we only had to resilver data that was in-use compared to traditional storage that has to resilver entire disk even if there is no actual data stored. And indeed on newly created pools ZFS was super fast for resilvering.

But of course storage pools rarely stay empty. So what happened when pools grew to store large quantities of data ? Well we basically had to resilver most blocks present on a failed disk. So the advantage of only resilvering what is actually present is not much of a advantage, in real life, for ZFS.

And while ZFS based storage grew in importance, so did disk sizes. The disk sizes that people put in production are growing very fast showing the appetite of customers to store vast quantities of data. This is happening despite the fact that those disks are not delivering significantly more IOPS than their ancestors. As time goes by, a trend that has lasted forever, we have fewer and fewer IOPS available to service a given unit of data. Here ZFSSA storage arrays with TB class caches are certainly helping the trend. Disk IOPS don't matter as much as before because all of the hot data is cached inside ZFS. So customers gladly tradeoff IOPS for capacity given that ZFSSA deliver tons of cached IOPS and ultra cheap GB of storage.

And then comes resilvering...

So when a disk goes bad, one has to resilver all of the data on it. It is assured at that point that we will be accessing all of the data from surviving disks in the raid group and that this is not a highly cached set. And here was the rub with old style ZFS resilvering : the metadata driven algorithm was actually generating small random IOPS. The old algorithm was actually going through all of the blocks file by file, snapshot by snapshot. When it found an element to resilver, it would issue the IOPS necessary for that operation. Because of the nature of ZFS, the populating of those blocks didn't lead to a sequential workload on the resilvering disks.

So in a worst case scenario, we would have to issue small random IOPS covering 100% of what was stored on the failed disk and issue small random writes to the new disk coming in as a replacement. With big disks and very low IOPS rating comes ugly resilvering times. That effect was also compounded by a voluntary design balance that was strongly biased to protect application load. The compounded effect was month long resilvering.

The Solution

To solve this, we designed a subtly modified version of resilvering. We split the algorithm in two phases. The populating phase and the iterating phase. The populating phase is mostly unchanged over the previous algorithm except that, when encountering a block to resilver, instead of issuing the small random IOPS, we generate a new on disk log of them. After having iterated through all of the metadata and discovered all of the elements that need to be resilvered we now can sort these blocks by physical disk offset and issue the I/O in ascending order. This in turn allows the ZIO subsystem to aggregate adjacent I/O more efficiently leading to fewer larger I/Os issued to the disk. And by virtue of issuing I/Os in physical order it allows the disk to serve these IOPS at the streaming limit of the disks (say 100MB/sec) rather than being IOPS limited (say 200 IOPS).

So we hold a strategy that allows us to resilver nearly as fast as physically possible by the given disk hardware. With that newly acquired capability of ZFS, comes the requirement to service application load with a limited impact from resilvering. We therefore have some mechanism to limit resilvering load in the presence of application load. Our stated goal is to be able to run through resilvering at 1TB/day (1TB of data reconstructed on the replacing drive) even in the face of an active workload.

As disks are getting bigger and bigger, all storage vendors will see increasing resilvering times. The good news is that, since Solaris 11.2 and ZFSSA since 2013.1.2, ZFS is now able to run resilvering with much of the same disk throughput limits as the rest of non-ZFS based storage.

The sequential resilvering performance on a RAIDZ pool is particularly noticeable to this happy Solaris 11.2 customer saying It is really good to see the new feature work so well in practice.

mardi déc. 02, 2014


The initial topic from my list is reARC. This is a major rearchitecture of the code that manages ZFS in-memory cache along with its interface to the DMU. The ARC is of course a key enabler of ZFS high performance. As the scale of systems grow in memory size, CPU count and frequency, some major changes were required to the ARC to keep up with the pace. reARC is such a major body of work, I can only talk about of few aspects of the Wonders of ZFS Storage here.

In this article, I describe how the reARC project had impact on at least these 7 important aspects of it's operation:
  • Managing metadata
  • Handling ARC accesses to cloned buffers
  • Scalability of cached and uncached IOPS
  • Steadier ARC size under steady state workloads
  • Improved robustness for a more reliable code
  • Reduction of the L2ARC memory footprint
  • Finally, a solution to the long standing issue of I/O priority inversion
The diversity of topics covered serves as a great illustration of the incredible work handled by the ARC and a testament to the importance of ARC operations to all other ZFS subsystems. I'm truly amazed at how a single project was able to deliver all this goodness in one swoop.

No Meta Limits

Previously, the ARC claimed to use a two-state model:
  • "most recently used" (MRU)
  • "most frequently used" (MFU)
But it further subdivided these states into data and metadata lists.

That model, using 4 main memory lists, created a problem for ZFS. The ARC algorithm gave us only 1 target size for each of the 2 MRU and MFU states. The fact that we had 2 lists (data and metadata) but only 1 target size for the aggregate meant that when we needed to adjust the list down, we just didn't have the necessary information to perform the shrink. This lead to the presence of an ugly tunable arc_meta_limit, which was impossible to set properly and was a source of problems for customers.

This problem raises an interesting point and a pet peeve of mine. Many people I've interacted with over the years defended the position that metadata was worth special protection in a cache. After all, metadata is necessary to get to data, so it has intrinsically higher value and should be kept around more. The argument is certainly sensible on the surface, but I was on the fence about it.

ZFS manages every access through a least recently used scheme (LRU). New access to some block, data or metadata, puts that block back to the head of the LRU list, very much protected from eviction, which happens at the tail of the list.

When considering special protection for metadata, I've always stumbled on this question:

If some buffer, be it data or metadata, has not seen any accesses for sufficient amount of time, such that the block is now the tail of an eviction list, what is the argument that says that I should protect that block based on it's state ?
I came up blank on that question. If it hasn't been used, it can be evicted, period. Furthermore, even after taking this stance, I was made aware of an interesting fact about ZFS. Indirect blocks, the blocks that hold a set of block pointers to the actual data are non_evictable inasmuch as any of the block pointers they reference are currently in the ARC. In other words, if some data is in cache, it's metadata is also in the cache and furthermore, is non-evictable. This fact really reinforced my position that in our LRU cache handling, metadata doesn't need special protection from eviction.

And so, the reARC project actually took the same path. No more separation of data and metadata and no more special protection. This improvement led to fewer lists to manage and simpler code, such as shorter lock hold times for eviction. If you are tuning arc_meta_limit for legacy reasons, I advise you to try without this special tuning. It might be hurting you today and should be considered obsolete.

Single Copy Arc: Dedup of Memory

Yet another truly amazing capability of ZFS is it's infinite snapshot capabilities. There are just no limits, other than hardware, to the number of (software) snapshots that you can have.

What is magical here is not so much that ZFS can manage a large number of snapshots, but that it can do so without reference counting the blocks that are referenced through a snapshot. You might need to read that sentence again ... and check the blog entry.

Now fast forward to today where there is something new for the ARC. While we've always had the ability to read a block referenced from the N-different snapshots (or clones), the old ARC actually had to manage separate in-memory copies of each block. If the accesses were all reads, we'd needlessly instantiate the same data multiple times in memory.

With the reARC project and the new DMU to ARC interfaces, we don't have to keep multiple data copies. Multiple clones of the same data share the same buffers for read accesses and new copies are only created for a write access. It has not escaped our notice that this N-way pairing has immense consequences for virtualization technologies. The use of ZFS clones (or writable snapshots) is just a great way to deploy a large number of virtual machines. ZFS has always been able to store N clone copies with zero incremental storage costs. But reARC is taking this one step further. As VMs are used, the in-memory caches that are used to manage multiple VMs no longer need to inflate, allowing the space savings to be used to cache other data. This improvement allows Oracle to boast the amazing technology demonstration of booting 16000 VMs simultaneously.

Improved Scalability of Cached and Uncached OPs

The entire MRU/MFU list insert and eviction processes have been redesigned. One of the main functions of the ARC is to keep track of accesses, such that most recently used data is moved to the head of the list and the least recently used buffers make their way towards the tail, and are eventually evicted. The new design allows for eviction to be performed using a separate set of locks from the set that is used for insertion. Thus, delivering greater scalability. Moreover, through a very clever algorithm, we're able to move buffers from the middle of a list to the head without acquiring the eviction lock.

These changes were very important in removing long pauses in ARC operations that hampered the previous implementation. Finally, the main hash table was modified to use more locks placed on separate cache lines improving the scalability of the ARC operations. This lead to a boost in the cached and uncached maximum IOPs capabilities of the ARC.

Steadier Size, Smaller Shrinks

The growth and shrink model of the ARC was also revisited. The new model grows the ARC less aggressively when approaching memory pressure and instead recycles buffers earlier on. This recycling leads to a steadier ARC size and fewer disruptive shrink cycles. If the changing environment nevertheless requires the ARC to shrink, the amount by which we do shrink each time is reduced to make it less of a stress for each shrink cycle. Along with the reorganization of the ARC list locking, this has lead to a much steadier, dependable ARC at high loads.

ARC Access Hardening

A new ARC reference mechanism was created that allows the DMU to signify read or write intent to the ARC. This, in turn, enables more checks to be performed by the code. Therefore, catching bugs earlier in the process. A better separation of function between the DMU and the ARC is critical for ZFS robustness or hardening. In the new reARC mode of operation, the ARC now actually has the freedom relocate kernel buffers in memory in between DMU accesses to a cached buffer. This new feature proves invaluable as we scale to large memory systems.

L2ARC Memory Footprint Reduction

Historically, buffers were tracked in the L2ARC (the SSD based secondary ARC) using the same structure that was used by the main primary ARC. This represented about 170 bytes of memory per buffer. The reARC project was able to cut down this amount by more than 2X to a bare minimum that now only requires about 80 bytes of metadata per L2 buffers. With the arrival of larger SSDs for L2ARC and a better feeding algorithm, this reduced L2ARC footprint is a very significant change for the Hybrid Storage Pool (HSP) storage model.

I/O Priority Inversion

One nagging behavior of the old ARC and ZIO pipeline was the so-called I/O priority inversion. This behavior was present mostly for prefetching I/Os, which was handled by the ZIO pipeline at a lower priority operation than, for example, a regular read issued by an application. Before reARC, the behavior was that after an I/O prefetch was issued, a subsequent read of the data that arrived while the I/O prefetch was still pending, would block waiting on the low priority I/O prefetch completion.

While it sounds simple enough to just boost the priority of the in-flight I/O prefetch, ARC/ZIO code was structured in such a way that this turned out to be much trickier than it sounds. In the end, the reARC project and subsequent I/O restructuring changes, put us on the right path regarding this particular quirkiness. Fixing the I/O priority inversion meant that fairness between different types of I/O was restored.


The key points that we saw in reARC are as follows:
  1. Metadata doesn't need special protection from eviction, arc_meta_limit has become an obsolete tunable.

  2. Multiple clones of the same data share the same buffers for great performance in a virtualization environment.
  3. We boosted ARC scalability for cached and uncached IOPs.
  4. The ARC size is now steadier and more dependable.
  5. Protection from creeping memory bugs is better.
  6. L2ARC uses a smaller footprint.
  7. I/Os are handled with more fairness in the presence of prefetches.
All of these improvements are available to customers of Oracle's ZFS Storage Appliances in any AK-2013 releases and recent Solaris 11 releases. And this is just topic number one. Stay tuned as we go about describing further improvements we're making to ZFS.

ZFS Performance boosts since 2010

Well, look who's back! After years of relative silence, I'd like to put back on my blogging hat and update my patient readership about the significant ZFS technological improvements that have integrated since Sun and ZFS became Oracle brands. Since there is so much to cover, I tee up this series of article with a short description of 9 major performance topics that have evolved significantly in the last years. Later, I will describe each topic in more details in individual blog entries. Of course, these selected advancements represents nowhere near an exhaustive list. There has been over 650 changes to the ZFS code in the last 4 years. My personal performance bias has selected topics that I know best. The designated topics are:
  1. reARC

  2. Scales the ZFS cache to TB class machines and CPU counts in thousands.
  3. Sequential Resilvering

  4. Converts a random workload to a sequential one.
  5. ZIL Pipelining

  6. Allows the ZIL to carve up smaller units of work for better pipelining and higher log device utilisation.
  7. It is the dawning of the age of the L2ARC

  8. Not only did we make the L2ARC persistent on reboot, we made the feeding process so much more efficient we had to slow it down.
  9. Zero Copy I/O Aggregation

  10. A new tool delivered by the Virtual Memory team allows the already incredible ZFS I/O aggregation feature to actually do its thing using one less copy.
  11. Scalable Reader/Writer locks

  12. Reader/Writer locks, used extensively by ZFS and Solaris, had their scalability greatly improved on on large systems.
  13. New thread Scheduling class

  14. ZFS transaction groups are now managed by a new type of taskqs which behave better managing bursts of cpu activity.
  15. Concurrent Metaslab Syncing

  16. The task of syncing metaslabs is now handled with more concurrency, boosting ZFS write throughput capabilities.
  17. Block Picking

  18. The task of choosing blocks for allocations has been enhanced in a number of ways, allowing us to work more efficiently at a much higher pool capacity percentage.
There you have it. I'm looking forward to reinvigorating my blog so stay tuned.

mardi févr. 28, 2012

Sun ZFS Storage Appliance : can do blocks, can do files too!

Last October, we demonstrated storage leadership in block protocols with our stellar SPC-1 result showcasing our top of the line Sun ZFS Storage 7420.

As a benchmark SPC-1's profile is close to what a fixed block size DB would actually be doing. See Fast Safe Cheap : Pick 3 for more details on that result. Here, for an encore, we're showing today how the ZFS Storage appliance can perform in a totally different environment : generic NFS file serving.

We're announcing that the Sun ZFS Storage 7320's reached 134,140 SPECsfs2008_nfs.v3 Ops/sec ! with 1.51 ms ORT running SPEC SFS 2008 benchmark.

Does price performance matters ? It does, doesn't it, See what Darius has to say about how we compare to Netapp : Oracle posts Spec SFS.

This is one step further in the direction of bringing to our customer true high performance unified storage capable of handling blocks and files on the same physical media. It's worth noting that provisioning of space between the different protocols is entirely software based and fully dynamic, that every stored element fully checksummed, that all stored data can be compressed with a number of different algorithms (including gzip), and that both filesystems and block based luns can be snapshot and cloned at their own granularity. All these manageability features available to you in this high performance storage package.

Way to go ZFS !

SPEC and SPECsfs are registered trademarks of Standard Performance Evaluation Corporation (SPEC). Results as of February 22, 2012, for more information see

lundi oct. 03, 2011

Fast, Safe, Cheap : Pick 3

Today, we're making performance headlines with Oracle's ZFS Storage Appliance.

SPC-1 : Twice the performance of NetApp at the same latency; Half the $/IOPS;

I'm proud to say that, yours truly, along with a lot of great teammates in Oracle, is not totally foreign to this milestone.

We are announcing that Oracle's 7420C cluster acheived 137000 SPC-1 IOPS with an average latency of less than 10 ms. That is double the results of NetApp's 3270A while delivering the same latency. As compared to the NetApp 3270 result, this is a 2.5x improvement in $/SPC-1-IOPS (2.99$/IOPS vs $7.48/IOPS). We're also showing that when the ZFS Storage Appliance runs at the rate posted by the 3270A (68034 SPC-1 IOPS), our latency of 3.26ms is almost 3X lower than theirs (9.16ms). Moreover, our result was obtained with 23700 GB of user level capacity (internally mirrored) for 17.3 $/GB while NetApp's , even using a space saving raid scheme, can only deliver 23.5$/GB. This is the price per GB of application data actually used in the benchmark. On top of that the 7420C still had 40% of space headroom whereas the 3270A was left with only 10% of free blocks.

These great results were at least partly made possible with the availability of 15K RPM Hard Disk Drives (HDD). Those are great to run the most demanding databases because they combine a large IOPS capability and are generally of smaller capacity. The ratio of IOPS/GB makes them ideal to store high intensity database modeled by SPC-1. On top of that, this concerted engineering effort lead to improved software not just for those running on 15K RPM. We actually used this benchmark to seek out how to increase the quality of our products. The preparation runs, after an initial diagnostic of some issue, we were attached to finding solutions that where not targeting the idiosyncrasies of SPC-1 but based on sound design decision. So instead of changing the default value of some internal parameter to a new static default, we actually changed the way the parameter worked so that our storage systems or all types and sizes would benefit.

So not only are we getting a great SPC-1 results, but all existing customers will benefit from this effect even if they are operating outside of the intense conditions created by the benchmark.

So what is SPC-1 ? It is one of the few benchmarks which counts for storage. It is maintained by Storage Performance Council (SPC). SPC-1 simulates multiple databases running on a centralized storage or storage cluster. But even if SPC-1 is a block based benchmark, within the ZFS Storage appliance, a block based FC or iSCSI volume is handled very much the same way as would be a large file subject to synchronous operation. And by Combining modern network technologies (Infiniband or 10Gbe Ethernet), the CPU power packed in the 7420C storage controllers and Oracle's custom dNFS technology for databases, one can truly acheive very high database transaction rates on top of the more manageable and flexible file based protocols.

The benchmarks defines three Application Storage Unit (ASU): ASU1 with a heavy 8KB block read/write component, ASU2 with a much lighter 8KB block read/write component, and ASU3 which is subject to hundreds of write streams. As such it's is not too far from a simulation of running hundreds of Oracle database onto a single system : ASU1 and ASU2 for datafiles and ASU3 for redolog storage.

The total size of the ASUs is constrained such that all of the stored data (including mirror protection and disk used for spares) must exceed 55% of all configured storage. The benchmark team is then free to decide how much total storage to configure. From that figure, 10% is given to ASU3 (redo log space) and the rest divided equally between heavily ASU1 and lightly used ASU2.

The benchmark team also has to select the SPC-1 IOPS throughput level it wishes to run. This is not a light decision given you want to balance high IOPS; low latency and $/user GB.

Once the target IOPS rate is selected, there are multiple criteria needed to pass a successful audit; one of the most critical is that you have to run at the specified IOPS rate for a whole 8 hour. Note that the previous specifications of the benchmark used by NetApp called for an 4 hour run. During that 8 hour run delivering a solid 137000 SPC-1 IOPS, the avg latency of must be less than 30ms (we did much better than that).

After this brutal 8 hour run, the benchmark then enters another critical phase: the workload is restarted (using a new randomly selected working set) and performance is measured for a 10 minute period. It is this 10 minute period that decides the official latency of the run.

When everything is said and done, you press the trigger; go to sleep and wake up to the result. As you could guess we were ecstatic that morning. Before that glorious day, for lack of a stronger word, a lot of hard work had been done during the extensive preparation runs. With little time, and normally not all of the hardware, one runs through series of run at incremental loads, making educated guesses as to how to improve the result. As you get more hardware you scale up the result tweaking things more or less until the final hour.

SPC-1, with it's requirement of less than 45% of unused space, is designed to trigger many disk level random read IOPS. Despite this inherent random pattern of the workload, we saw that our extensive caching architecture was as helpful for this benchmark as it is in real production workloads. While the 15K RPM HDDs normally levels off with random operation at a rate slightly above 300 IOPS, our 7420C, as a whole, could deliver almost 500 user-level SPC-1 IOPS per HDDs.

In the end one of the most satisfying aspect was to see that the data being managed by ZFS was stored rock solid on disk, properly checksummed, all data could be snapshot, compressed on demand, and delivering an impressively steady performance.

2X the absolute performance, 2.5X cheaper per SPC-1 IOPS, almost 3X lower latency, 30% cheaper per user GB with room to grow... So, If you have a storage decision coming and you need, FAST, SAFE, CHEAP : pick 3, take a fresh look at the ZFS Storage appliance.

SPC-1, SPC-1 IOPS, $/SPC-1 IOPS reg tm of Storage Performance Council (SPC). More info Sun ZFS Storage 7420 Appliance and

Oracle Sun ZFS Storage Appliance 7420 _ _As of October 3, 2011 Netapp FAS3270A _ _As of October 3, 2011

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

mercredi mai 26, 2010

Let's talk about Lun Alignment for 3 minutes

Recall that I had Lun alignment on my mind a few weeks ago. Nothing special about the ZFS storage appliance over any other storage. Pay attention to how you partition your luns, it can have a great impact on performance. Right Roch ? :

jeudi mars 11, 2010

Dedup Performance Considerations

One of the major milestones for ZFS Storage appliance with 2010/Q1 is the ability to dedup data on disk. The open question is then : What performance characteristics are we expected to see from Dedup ? As Jeff says, this is the ultimate gaming ground for benchmarks. But lets have a look at the fundamentals.

ZFS Dedup Basics

Dedup code is simplistically a large hash table (the DDT). It uses a 256 bit (32 Bytes) checksum along with other metata data to identify data content. On a hash match, we only need to increase a reference count, instead of writing out duplicate data. The dedup code is integrated in the I/O pipeline and is done on the fly as part of the ZFS transaction group (see Dynamics of ZFS, The New ZFS Write Throttle ). A ZFS zpool typically holds a number of datasets : either block level LUNS which are based on ZVOL or NFS and CIFS File Shares based on ZFS filesystems. So while the dedup table is a construct associated with individual zpool, enabling of the deduplication feature is something controlled at the dataset level. Enabling of the dedup feature on a dataset, has no impact on existing data which stay outside of the dedup table. However any new data stored in the dataset will then be subject to the dedup code. To actually have existing data become part of the dedup table one can run a variant of "zfs send | zfs recv" on the datasets.

Dedup works on a ZFS block or record level. For a iSCSI or FC LUN, i.e. objects backed by ZVOL datasets, the default blocksize is 8K. For filesystems (NFS, CIFS or Direct Attach ZFS), object smaller than 128K (the default recordsize) are stored as a single ZFS block while objects bigger than the default recordsize are stored as multiple records Each record is the unit which can end up deduplicated in the DDT. Whole Files which are duplicated in many filesystems instances are expected to dedup perfectly. For example, whole DB copied from a master file are expected to falls in this category. Similarly for LUNS, virtual desktop users which were created from the same virtual desktop master image are also expected to dedup perfectly.

An interesting topic for dedup concerns streams of bytes such as a tar file. For ZFS, a tar file is actually a sequence of ZFS records with no identified file boundaries. Therefore, identical objects (files captured by tar) present in 2 tar-like byte streams might not dedup well unless the objects actually start on the same alignment within the byte stream. A better dedup ratio would be obtained by expanding the byte stream into it's constituent file objects within ZFS. If possible, the tools creating the byte stream would be well advised to start new objects on identified boundaries such as 8K.

Another interesting topic is backups of active Databases. Since database often interact with their constituent files with an identified block size, it is rather important for the deduplication effectiveness that the backup target be setup with a block size that matches the source DB block size. Using a larger block on the deduplication target has the undesirable consequence that modifications to small blocks of the source database will cause those large blocks in the backup target to appear unique and not dedup somewhat artificially. By using an 8K block size in the dedup target dataset instead of 128K, one could conceivably see up to a 10X better deduplication ratio.

Performance Model and I/O Pipeline Differences

What is the effect on performance of Dedup ? First when dedup is enabled, the checksum used by ZFS to validate the disk I/O is changed to the cryptographically strong SHA256. Darren Moffat shows in his blog that SHA256 actually runs at more than 128 MB/sec on a modern cpu. This means that less than 1 ms is consumed to checksum a 128K and less than 64 usec for an 8K unit. This cost is online incurred when actually reading or writing data to disk, an operation that is expected to take 5-10 ms; therefore the checksum generation or validation is not a source of concern.

For the read code path, very little modification should be observed. The fact that a reads happens to hit a block which is part of the dedup table is not relevant to the main code path. The biggest effect will be that we use a stronger checksum function invoked after a read I/O : at most an extra 1 ms is added to a 128K disk I/O. However if a subsequent read is for a duplicate block which happens to be in the pool ARC cache, then instead of having to wait for a full disk I/O, only a much faster copy of duplicate block will be necessary. Each filesystem can then work independently on their copy of the data in the ARC cache as is the case without deduplication. Synchronous writes are also unaffected in their interaction with the ZIL. The blocks written in the ZIL have a very short lifespan and are not subject to deduplication. Therefore the path of synchronous writes is mostly unaffected unless the pool itself ends up not being able to absorb the sustained rate of incoming changes for 10s of seconds. Similarly for asynchronous writes which interact with the ARC caches, dedup code has no affect unless the pool's transaction group itself becomes the limiting factor. So the effect of dedup will take place during the pool transaction group updates. Here is where we take all modifications that occurred in the last few seconds and atomically commit a large transaction group (TXG). While a TXG is running, applications are not directly affected except possibly for the competition for CPU cycles. They mostly continue to read from disk and do synchronous write to the zil, and asynchronous writes to memory. The biggest effect will come if the incoming flow of work exceed the capabilities of the TXG to commit data to disk. Then eventually the reads and write will be held up by the necessary write (Throttling) code preventing ZFS from consuming up all of memory .

Looking into the ZFS TXG, we have 2 operations of interest, the creation of a new data block and the simple removal (free) of a previously used block. ZFS operating under a copy on write (COW) model, any modification to an existing block actually represents both a new data block creation and a free to a previously used block (unless a snapshot was taken in which case there is no free). For file shares, this concerns existing file rewrites; for block luns (FC and iSCSI), this concerns most writes except the initial one (very first write to a logical block address or LBA actually allocates the initial data; subsequent writes to the same LBA are handled using COW). For the creation of a new application data block, ZFS will then run the checksum of the block, as it does normally and then lookup in the dedup table for a match based on that checksum and a few other bits of information. On a dedup table hit, only a reference count needs to be increased and such changes to the dedup table will be stored on disk before the TXG completes. Many DDT entries are grouped in a disk block and compression is involved. A big win occurs when many entries in a block are subject to a write match during one TXG. Then a single 1 x 16K I/O can then replace 10s of larger IOPS. As for free operations, the internals of ZFS actually holds the referencing block pointer which contains the checksum of the block being freed. Therefore there is no need to read nor recompute the checksum of the data being freed. ZFS, with checksum in hand, looks up the entry in dedup table and decrement the reference counter. If the counter is non zero then nothing more is necessary (just the dedup table sync). If the freed block ends up without any reference then it will be freed.

The DEDUP table itself an an object managed by ZFS at the pool level. The table is considered metadata and it's elements will be stored in the ARC cache. Up to 25% of memory (zfs_arc_meta_limit) can be used to store metadata. When the dedup table actually fits in memory, then enabling dedup is expected to have a rather small effect on performance. But when the table is many time greater than allotted memory, then the lookups necessary to complete the TXG can cause write throttling to be invoked earlier than the same workload running without dedup. If using an L2ARC, the DDT table represents prime objects to use the secondary cache. Note that independent of the size of the dedup table, read intensive workloads in highly duplicated environment, are expected to be serviced using fewer IOPS at lower latency than without dedup. Also note that whole filesystem removal or large file truncation are operation that can free up large quantity of data at once and when the dedup table exceeds allotted memory then those operation, which are more complex with deduplication, can then impact the amount of data going into every TXG and the write throttling behavior.

So how large is the dedup table ?

The command zdb -DD on a pool shows the size of DDT entries. In one of my experiment it reported about 200 Bytes of core memory for table entries. If each unique object is associated with 200 Bytes of memory then that means that 32GB of ram could reference 20TB of unique data stored in 128K records or more than 1TB of unique data in 8K records. So if there is a need to store more unique data than what these ratio provide, strongly consider allocating some large read optimized SSD to hold the DDT. The DDT lookups are small random IOs which are handled very well by current generation SSDs.

The first motivation to enable dedup is actually when dealing with duplicate data to begin with. If possible procedures that generate duplication could be reconsidered. The use of ZFS Clones is actually a much better way to generate logically duplicate data for multiple users in a way that does not require a dedup hash table.

But when the operating conditions does not allow the use of ZFS Clones and data is highly duplicated, then the ZFS deduplication capability is a great way to reduce the volume of stored data.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Referenced Links :

Proper Alignment for extra performance

Because of disk parititioning software on your storage clients (keywords : EFI, VTOC, fdisk, DiskPart,...) or a mismatch between storage configuration and application request pattern, you could be suffering a 2-4X performance degradation....

Many I/O performance problem I see end up being the result of a mismatch in request sizes or it's alignment versus the natural block size of the underlying storage. While raw disk storage works using a 512 Byte sector and performs at the same level independent of the starting offset of I/O requests this is not the case for more sophisticated storage which will tend to use larger block units. Some SSDs today support 512B aligned requests but will work much better if you give them 4K aligned requests as described in Aligning on 4K boundaries Flash and Sizes. The Sun Oracle 7000 Unified Storage line supports different sizes of blocks between 4K and 128K (it can actually go lower but I would not recommend that in general). Having proper alignment between the application's view, the initiator partitioning and the backing volume can have great impact on the end performance delivered to applications.

When is alignment most important ?

Alignment problems are most likely to have an impact with
  • running a DB on file shares or block volumes
  • write streaming to block volumes (backups)
Also impacted at a lesser level :
  • large file rewrites on CIFS or NFS shares
In each case adjusting the recordsize to match the workload and insuring that partitions are aligned on a block boundary could have important effect on your performance.

Let's review the different cases.

Case 1: running a Database (DB) on file shares or block volumes

Here the DB is a block oriented application. General ZFS Best Practices warrant that the storage use a record size equal to the DB natural block size. At the logical level, the DB is issuing I/O which are aligned on block boundaries. When using file semantics (NFS or CIFS), then the alignment is guaranteed to be observed all the way to the backend storage. But when using block device semantics, the alignments of requests on the initiator is not guaranteed to be the same as the alignement on the target side. Misalignment of the LUN will cause two pathologies. First, an application block read will straddle 2 storage blocks creating storage IOPS inflation (more backend reads than application reads). But a more drastic effect will be seen for block writes which, when aligned, could be serviced by a single write I/O. Those will now require a Read-Modify-Write (R-W-M) of 2 adjacent storage blocks. Such type of I/O inflation leads to additional storage load and degrade performance during high demand.

To avoid such I/O inflation, insure that the backing store uses a block size (LUN volblocksize or Share recordsize) compatible with the DB block size. If using a file share such as NFS, insure that the filesystem client passes I/O requests directly to the NFS server using a mount option such as directio or use Oracle's dNFS client (Note that with directio mount option, memory management considerations independent of alignment concerns, the server will behave better when the client specifies rsize,wsize options not exceeding 128K). To avoid such LUN misalignment, prefer the use full LUNS as opposed to sliced partition. If disk slices must be used, prefer partitioning scheme in which one can control the sector offset of individual partitions such as EFI labels. In that case start partitions on a sector boundary which aligns with the volume's blocksize. For instance a initial block for a parition which is a multiple of 16 \* 512B sectors will align on an 8K boundary, the default lun blocksize.

Case 2: write streaming to block volumes (backups)

The other important case to pay attention to is stream writing to a raw block device. Block devices by default commit each write to stable storage. This path is often optimized through the use of acceleration devices such as write optimized SSD. Misalignement of the LUNS due to partitioning software imply that application writes, which could otherwise be committed to SSD at low latency, will instead be delayed by disk reads caught in R-M-W. Because the writes are synchronous in nature, the application running on the initiator will thus be considerably delayed by disk reads. Here again one must insure that partitions created on the client system are aligned with the volumes blocksize which typically default to 8K. For pure streaming workloads large blocksize up to the maximum 128K can lead to greater streaming performance. One must take good care that the block size used for a LUNS should not exceed the application writes sizes to raw volumes or risk being hit by the R-M-W penalty.

Case 3: large file rewrites on CIFS or NFS shares

For file shares, large streaming write will be of 2 types : they will either be the more common file creation (write allocation) or they will correspond to streaming overwrite to existing file. The more common write allocation would not greatly suffer from misalignment since there is no pre-existing data to be read and modified. But for the less common streaming rewrite to files, one can definitely be impacted by misalignment and R-M-W cycles. Fortunately file protocols are not subject to LUN misalignment so one must only take care that the write sizes reaching the storage be multiple of the recordsize used to create the file share in the storage. The solaris NFS clients often issues 32K write size for streaming application while CIFS has been observed to use 64K from clients. If existing streaming asynchronous file rewrite is an important component of your I/O workloads (a rare set of conditions), it might well be that setting the LUN blocksize accordingly will provide a boost to delivered performance.

In summary

The problem with alignment is more generally seen with fixed record oriented application (as for Oracle Database or Microsoft Exchange) with random access pattern and synchronous I/O semantics. It can be caused by partitioning software (fdisk, diskpart) which create disk partitions not aligned with the storage blocks. It can also be caused to a lesser extent by streaming file overwrite when the application write size does not match the file share's blocksize. The Sun Storage 7000 line offers great flexibility in selecting different blocksizes for different use within a single pool of storage. However it has no control on the offset that could be selected during disk partitioning of block devices on client systems. Care must be taken when partitioning disks to avoid misalignment and degraded performance. Using full LUNs is preferred.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Referenced Links :




« février 2017

No bookmarks in folder