mercredi juin 10, 2015

Zero Copy I/O Aggregation

One of my favorite feature of ZFS is the I/O aggregation done in the final stage of issuing I/Os to devices. In this article, I explain in more detail what this feature is and how we recently improved it with a new zero copy feature.

It is well known that ZFS is a Copy-on-Write storage technology. That doesn't meant that we constantly copy data from disk to disk. More to the point it means that when data is modified we store that data in a fresh on-disk location of our own choosing. This is primarily done for data integrity purposes and is managed by the ZFS transaction group (TXG) mechanism that runs every few seconds. But an important side benefit of this freedom given to ZFS is that I/Os, even unrelated I/Os, can be allocated in physical proximity to one another. Cleverly scheduling those I/Os to disk then makes it possible to detect contiguous I/Os and issue few large ones rather than many small ones.

One consequence of I/O aggregation is that the final I/O sizes used by ZFS during a TXG, as observed by ZFSSA Analytics or iostat(1), depend more on the availability of contiguous on-disk free space than it does on the individual application write(2) sizes. To a new ZFS user or storage administrator, it can certainly be really baffling that 100s of independent 8K writes can end up being serviced by a single disk I/O.

The timeline of an asynchronous write is described like this:

  • Application issues a write(2) of N byte to a file stored using ZFS records of size R. Initially the data is stored in the ARC cache.

  • ZFS notes the M dirty blocks needing to be issued in the next TXG as follows:
    • If R=128K, a small write(2) say of 10Bytes here means 1 dirty block (of 128K)
    • If R=8K, a single 128K write(2) implies 16 dirty blocks (of 8K)

  • Within the next few seconds multiple dirty blocks get associated with the upcoming TXG.

  • The TXG starts. ZFS gathers all of the dirty blocks and starts I/Os1.

    • Individual blocks get checksummed and, as necessary, compressed and encrypted. Then and only then, knowing the compressed size and the actual data that needs to be stored on disk, a device is selected and an allocation takes place,

    • The allocation engine finds a chunk in proximity to recent allocations (a future topic of its own),

    • The I/O is maintained by ZFS using 2 structures, one ordered by priority and another ordered by device offset.

  • As soon as there is at least one I/O in these structures, the device level ZIO pipeline gets to work. When a slot is available, the highest priority I/O for that device is selected to be issued.

And here is where the magic occurs. With this highest priority I/O in hand, the ZIO pipeline doesn't just issue that I/O to the device. It first checks for other I/Os which could be physically adjacent to this one. It gathers all such I/Os together until hitting our upper limit for disk I/O size. Because of the way this process works, if there are contiguous chunks of free space available on the disk, we're nearly guaranteed that ZFS finds pending I/Os that are adjacent and can be aggregated.

This also explains why one sees regular bursts of large I/Os whose sizes are mostly unrelated to the sizes of writes issued by the applications. And I emphasize that this is totally unrelated to the random or sequential nature of the application workload. Of course, for hard disk drives (HDDs), managing writes this way is very efficient. Therefore, those HDDs are less busy and stay available to service the incoming I/Os that applications are waiting on.

And this bring us to the topic du jour. Up to recently, there was a cost to doing this aggregation in the form of a memory copy. We would take the buffers coming from the ZIO pipeline (after compression and encryption) and copy them to a newly allocated aggregated buffer. Thanks to a new Solaris mvector feature, we can now run the ZIO aggregation pipeline without incurring this copy. That, in turns, allows us to boost the maximum aggregation size from 128K up to 1MB for extra efficiency. The aggregation code also limits itself to aggregating 64 buffers together. When working with 8K blocks we can see up to 512K I/O during a TXG and 1MB I/O with bigger blocks.

Now, a word about the ZIL. In this article, I focus on the I/Os issued by the TXG which happens every 5 seconds. In between TXG, if disk writes are observed, those would have to come from the ZIL. The ZIL also does it's own grouping of write requests that hit a given dataset (share, zvol or filesystem). Then, once the ZIL gets to issue an I/O, it uses the same I/O pipeline as just described. Since ZIL I/Os are of high priority, they tend to issue straight away. And because they issue quickly, there is generally not a lot of them around for aggregation. So it is common to have the ZIL I/Os not aggregate much if at all. However, under a heavy synchronous write load, when the underlying device becomes saturated, a queue of ZIL I/Os forms and they become subject to ZIO level aggregation.

When observing the I/Os issued to a pool with iostat it's nice to keep all this in mind: synchronous writes don't really show up with their own size. The ZIL issues I/O for a set of synchronous writes that may further aggregate under heavy load. Then, with a 5 second regularity, the pool issues I/O for every modified block, usually with large I/Os whose size is unrelated to the application I/O size.

It's a really efficient way to do all this, but it does require some time getting used to it.
1 Application write size is not considered during a TXG.

mardi avr. 28, 2015

It is the Dawning of the Age of the L2ARC

One of the most exciting things that have gone into ZFS in recent history has been the overhaul of the L2ARC code. We fundamentaly changed the L2ARC such that it would do the following:

  • reduce its own memory footprint,
  • be able to survive reboots,
  • be managed using a better eviction policy,
  • be compressed on SSD,
  • and finally allow feeding at much greater rates then ever achieved before.
Let's review these elements, one by one.

Reduced Footprint

We already saw in this ReARC article that we dropped the amount of core header information from 170 bytes to 80 bytes. This means we can track more than twice as much L2ARC data as before using a given memory footprint. In the past, the L2ARC had trouble building up in size due to its feeding algorithm, but we'll see below that the new code allows us to grow the L2ARC and use up available SSD space in its entirety. So much so that initial testing revealed a problem: For small memory configs with large SSDs, the L2ARC headers could actually end up filling most of the ARC cache and that didn't deliver good performance. So, we had to put in place a memory guard for L2 headers which is currently set to 30% of the ARC. As the ARC grows and shrinks so does the maximum space dedicated to tracking the L2ARC. So, a system with 1TB of ARC cache, then up to 300GB if necessary could be devoted to tracking the L2ARC. With the 80 bytes headers, this means we could track a whopping 30TB of data assuming 8K blocksize. If you use 32K blocksize, currently the largest blocks we allow in L2ARC, then that grows up to 120TB of SSD based auto-tiered L2ARC. Of course, if you have a small L2ARC the tracking footprint of the in-core metadata is smaller.

Persistent Across Reboot

With that much tracked L2ARC space, you would hate to see it washed away on a reboot as the previous code did. Not so anymore, the new L2ARC has an on-disk format that allows it to be reconstructed when a pool is imported. That new format tracks the device space in 8MB segments for which each ZFS blocks (DVAs for the ZFS geeks) consumes 40 bytes of on-SSD space. So reusing the example of an L2ARC made up of only 8K-sized blocks, each 8MB segments could store about 1000 of those blocks consuming just 40K of on-SSD metadata. The key thing here is that to rebuild the in-core L2ARC space after a reboot, you only need to read back 40K, from the SSD itself, in order to discover and start tracking 8MB worth of data. We found that we could start tracking many TBs of L2ARC within minutes after a reboot. Moreover we made sure that as segment headers were read in, they would immediately be made available to the system and start to generate L2ARC hits, even before the L2ARC was done importing every segments. I should mention that this L2ARC import is done asynchronously with respect to the pool import and is designed to not slow down pool import or concurrent workloads. Finally that initial L2ARC import mechanism was made scalable with many import threads per L2ARC device.

Better Eviction

One of the benefits of using an L2ARC segment architecture is that we can now weigh them individually and use the least valued segment as eviction candidate. The previous L2ARC would actually manage L2ARC space by using a ring buffer architecture: first-in first-out. It's not a terrible solution for an L2ARC but the new code allows us to work on a weight function to optimise eviction policy. The current algorithm puts segments that are hit, an L2ARC cache hit, at the top of the list such that a segment with no hits gets evicted sooner.

Compressed on SSD

Another great new feature delivered is the addition of compressed L2ARC data. The new L2ARC stores data in SSDs the same way it is stored on disk. Compressed datasets are captured in the L2ARC in compressed format which provides additional virtual capacity. We often see a 2:1 compression ratio for databases and that is becoming more and more the standard way to deploy our servers. Compressed data now uses less SSD real estate in the L2ARC: a 1TB device holds 2TB of data if the data compresses 2:1. This benefit helps absorb the extra cost of flash based storage. For the security minded readers, be reassured that the data stored in the persistent L2ARC is stored using the encrypted format.

Scalable Feeding

There is a lot to like about what I just described but what gets me the most excited is the new feeding algorithm. The old one was suboptimal in many ways. It didn't feed well, disrupted the primary ARC, had self-imposed obsolete limits and didn't scale with the number of L2ARC devices. All gone.

Before I dig in, it should be noted that a common misconception about L2ARC feeding is assuming that the process handles data as it gets evicted from L1. In fact the two processes, feeding and evicting, are separate operations and it is sometimes necessary under memory pressure to evict a block before being able to install it in the L2ARC. The new code is much much better at avoiding such events; it does so by keeping it's feed point well ahead of the ARC tail. Under many conditions, when data is evicted from primary ARC it is after the L2ARC has processed it.

The old code also had some self-imposed throughput limit that meant that N x L2ARC devices in one pool, would not be fed at proper throughput. Given the strength of the new feeding algorithm we were able to remove such limits and now feeding scales with number of L2ARC devices in use. We also removed an obsolete constraint in which read I/Os would not be sent to devices as they were fed.

With these in place, if you have enough L2ARC bandwidth in the devices, then there are few constraints in the feeder to prevent actually capturing 100% of eligible L2ARC data1. And capturing 100% of data is the key to actually delivering a high L2ARC hit rate in the future. By hitting in L2, of course you delight end users waiting for such reads. More importantly, an L2ARC hit is a disk read I/O that doesn't have to be done. Moreover, that saved HDD read is a random read, one that would have lead to a disk seek, the real weakness of HDDs. Therefore, we reduce utilization of the HDDs, which is of paramount importance when some unusual job mix arrives and causes those HDDs become the resource gating performance: A.K.A crunch time. With a large L2ARC hit count, you get out of this crunch time quicker and restore proper level of service to your users.


The L2ARC Eligibility rules were impacted by the compression feature. The max blocksize considered for eligibility was unchanged at 32K but the check is now done on compressed size if compression is enabled. As before, the idea behind an upper limit on eligible size is two-fold, first for larger blocks, the latency advantage of flash over spinning media is reduced. The second aspect of this is that the SSD will eventually fill up with data. At that point, any block we insert in the L2ARC requires an equivalent amount of eviction. A single large block can thus cause eviction of a large number of small blocks. Without an upper cap on block size, we can face a situation of inserting a large block for a small gain with a large potential downside if many small evicted blocks become the subject of future hits. To paraphrase Yogi Berra: "Caching decisions are hard."2.

The second important eligibility criteria is that blocks must not have been read through prefetching. The idea is fairly simple. Prefetching applies to sequential workloads and for such workloads, flash storage offers little advantage over HDDs. This means that data that comes in through ZFS level prefetching is not eligible for L2ARC.

These criteria leave 2 pitfalls to avoid during an L2ARC demo, first configuring all datasets with 128K recordsize and second trying to prime the L2ARC using dd-like sequential workloads. Both of those are by design workloads that bypasse the L2ARC. The L2ARC is designed to help you with disk crunching real workloads, which are those that access small blocks of data in random order.

Conclusion : A Better HSP

In this context, the Hybrid Storage Pool (HSP) model refers to our ZFSSA architecture where data is managed in 3 tiers:

  1. a high capacity TB scale super fast RAM cache;
  2. a PB scale pool of hard disks with RAID protection;
  3. a channel of SSD base cache devices that automatically capture an interesting subset of the data.
And since the data is captured in the L2ARC device only after it has been stored in the main storage pool, those L2ARC SSDs do not need to be managed by RAID protection. A single copy of the data is kept in the L2ARC knowing that if any L2ARC device disappears, data is guaranteed to be present in the main pool. Compared to a mirrored all-flash storage solution, this ZFSSA auto-tiering HSP means that you get 2X the bang for your SSD dollar by avoiding mirroring of SSDs and with ZFS compression that becomes easily 4X or more. This great performance comes along with the simplicity of storing all of your data, hot, warm or cold, into this incredibly versatile high performance and cost effective ZFS based storage pool.

1It should be noted that ZFSSA tracks L2ARC eviction as "Cache: ARC evicted bytes per second broken down by L2ARC state", with subcategories of "cached," "uncached ineligible," and "uncached eligible." Having this last one at 0 implies a perfect L2ARC capture.

2For non-americans, this famous baseball coach is quoted to have said, "It's tough to make predictions, especially about the future."

vendredi févr. 20, 2015

ZIL Pipelinening

The third topic on my list of improvements since 2010 is ZIL pipelining :
		Allow the ZIL to carve up smaller units of
		work for better pipelining and higher log device 
So let's remind ourselves of a few things about the ZIL and why it's so critical to ZFS. The ZIL stands for ZFS Intent Log and exists in order to speed up synchronous operations such as an O_DSYNC write or fsync(3C) calls. Since most Database operation involve synchronous writes it's easy to understand that having good ZIL performance is critical in many environments.

It is well understood that a ZFS pool updates it's global on-disk state at a set interval (5 seconds these days). The ZIL is actually what keeps information in between those transaction group (TXG). The ZIL records what is committed to stable storage from a users point of view. Basically the last committed TXG + replay of the ZIL is the valid storage state from a users perspective.

The on-disk ZIL is a linked list of records which is actually only useful in the event of a power outage or system crash. As part of a pool import, the on-disk ZIL is read and operations replayed such that the ZFS pool contains the exact information that had been committed before the disruption.

While we often think of the ZIL as it's on-disk representation (it's committed state), the ZIL is also an in-memory representation of every posix operation that needs to modify data. For example, a file creation even if that is an asynchronous operation needs to be tracked by the ZIL. This is because any asynchronous operation, may at any point in time require to be committed to disk; this is often due to an fsync(3C) call. At that moment, every pending operation on a given file needs to be packaged up and committed to the on-disk ZIL.

Where is the on-disk ZIL stored ?

Well that's also more complex than it sound. ZFS manages devices specifically geared to store ZIL blocks; those separate slog devices or slogs are very often flash SSD. However the ZIL is not constrained to only using blocks from slog devices; it can store data on main (non-slog) pool devices. When storing ZIL information into the non-slog pool devices, the ZIL has a choice of recording data inside zil blocks or recording full file records inside pool blocks and storing a reference to it inside the ZIL. This last method for storing ZIL blocks has the benefit of offloading work from the upcoming TXG sync at the expense of higher latency since the ZIL I/Os are being sent to rotating disks. This mode is the one used with logbias=throughput. More on that below.

Net net: the ZIL records data in stable storage in a link list and user applications have synchronization point in which they choose to wait on the ZIL to complete it's operation.

When things are not stressed, operations show up at the ZIL, wait a little bit while the ZIL does it's work, and are then released. Latency of the ZIL is then coherent with the underlying device used to capture the information. In this rosy picture we would not have done this train project.

At times though, the system can get stressed. The older mode of operation of the ZIL was to issue a ZIL transaction (implemented by ZFS function zil_commit_writer) and while that was going on, build up the next ZIL transaction with everything that showed up at the door. Under stress when a first operation would be serviced with a high latency, the next transaction would accumulate many operations, growing in size thus leading to a longer latency transaction and this would spiral out of control. The system would automatically divide into 2 ad-hoc sets of users; a set of operations which would commit together as a group, while all other threads in the system would form the next ZIL transaction and vice-versa.

This leads to bursty activity on the ZIL devices, which meant that, at times, they would go unused even though they were the critical resource. This 'convoy' effect also meant disruption of servers because when those large ZIL transaction do complete, 100s or 1000s of user threads might see their synchronous operation complete and all would end up flagged as 'runnable' at the same time. Often those would want to consume the same resource, run on the same CPU, of use the same lock etc. This led to thundering herds, a source of system inefficiency.

Thanks to the ZIL train project, we now have the ability to break down convoys into smaller units and dispatch them into smaller ZIL level transactions which are then pipelined through the entire data center.

With logbias set to throughput, the new code is attempting to group ZIL transactions in sets of approximately 40 operations which is a compromise between efficient use of ZIL and reduction of the convoy effect. For other types of synchronous operations we group them into sets representing about ~32K of data to sync. That means that a single sufficiently large operation may run by itself but more threads will group together if their individual commit size are small.

The ZIL train is thus expected to handle burst of synchronous activity with a lot less stress on the system.


As we just saw the ZIL provides 2 modes of operation. The throughput mode and the default latency mode. The throughput mode is named as such not so much because it favors throughput but more so because it doesn't care too much about individual operation latency. The implied corollary of throughput friendly workloads is that they are very highly concurrent (100s or 1000s of independent operations) and therefore are able to get to high throughput even when served at high latency. The goal of providing a ZIL throughput mode is to actually free up slog devices from having to handle such highly concurrent workloads and allow those slog devices to concentrate on serving other low-concurrency, but highly sensitive to latency operations.

For Oracle DB, we therefore recommend the use of logbias set to throughput for DB files which are subject to highly concurrent DB writer operations while we recommend the use of the default latency mode for handling other latency sensitive files such as the redo log. This separation is particularly important when redo log latency is very critical and when the slog device is itself subject to stress.

When using Oracle 12c with dnfs and OISP, this best practice is automatically put into place. In addition to proper logbias handling, DB data files are created with a ZFS recordsize matching the established best practice : ZFS recordsize matching DB blocksize for datafiles; ZFS recordsize of 128K for redo log.

When setting up a DB, with or without OISP, there is one thing that Storage Administrators must enforce : they must segregate redo log files into their own filesystems (also known as shares or datasets). The reason for this is that the ZIL is a single linked list of transactions maintained by each filesystem (other filesystems run their own ZIL independently). And while the ZIL train allows for multiple transaction to be in flight concurrently, there is a strong requirement for completion of the transaction and notification of waiters to be handled in order. If one were to mix data files and redo log files in the same ZIL, then some redo transaction would be linked behind some DB writer transactions. Those critical redo transaction committing in latency mode to a slog device would see their I/O complete quickly (100us timescale) but nevertheless have to wait for an antecedent DB writer transaction committing in throughput mode to regular spinning disk device (ms timescale). In order to avoid this situation, one must ensure that redo log files are stored in their own shares.

Let me stop here, I have a train to catch...

mercredi janv. 21, 2015

Sequential Resilvering

In the initial days of ZFS some pointed out that ZFS resilvering was metadata driven and was therefore super fast : after all we only had to resilver data that was in-use compared to traditional storage that has to resilver entire disk even if there is no actual data stored. And indeed on newly created pools ZFS was super fast for resilvering.

But of course storage pools rarely stay empty. So what happened when pools grew to store large quantities of data ? Well we basically had to resilver most blocks present on a failed disk. So the advantage of only resilvering what is actually present is not much of a advantage, in real life, for ZFS.

And while ZFS based storage grew in importance, so did disk sizes. The disk sizes that people put in production are growing very fast showing the appetite of customers to store vast quantities of data. This is happening despite the fact that those disks are not delivering significantly more IOPS than their ancestors. As time goes by, a trend that has lasted forever, we have fewer and fewer IOPS available to service a given unit of data. Here ZFSSA storage arrays with TB class caches are certainly helping the trend. Disk IOPS don't matter as much as before because all of the hot data is cached inside ZFS. So customers gladly tradeoff IOPS for capacity given that ZFSSA deliver tons of cached IOPS and ultra cheap GB of storage.

And then comes resilvering...

So when a disk goes bad, one has to resilver all of the data on it. It is assured at that point that we will be accessing all of the data from surviving disks in the raid group and that this is not a highly cached set. And here was the rub with old style ZFS resilvering : the metadata driven algorithm was actually generating small random IOPS. The old algorithm was actually going through all of the blocks file by file, snapshot by snapshot. When it found an element to resilver, it would issue the IOPS necessary for that operation. Because of the nature of ZFS, the populating of those blocks didn't lead to a sequential workload on the resilvering disks.

So in a worst case scenario, we would have to issue small random IOPS covering 100% of what was stored on the failed disk and issue small random writes to the new disk coming in as a replacement. With big disks and very low IOPS rating comes ugly resilvering times. That effect was also compounded by a voluntary design balance that was strongly biased to protect application load. The compounded effect was month long resilvering.

The Solution

To solve this, we designed a subtly modified version of resilvering. We split the algorithm in two phases. The populating phase and the iterating phase. The populating phase is mostly unchanged over the previous algorithm except that, when encountering a block to resilver, instead of issuing the small random IOPS, we generate a new on disk log of them. After having iterated through all of the metadata and discovered all of the elements that need to be resilvered we now can sort these blocks by physical disk offset and issue the I/O in ascending order. This in turn allows the ZIO subsystem to aggregate adjacent I/O more efficiently leading to fewer larger I/Os issued to the disk. And by virtue of issuing I/Os in physical order it allows the disk to serve these IOPS at the streaming limit of the disks (say 100MB/sec) rather than being IOPS limited (say 200 IOPS).

So we hold a strategy that allows us to resilver nearly as fast as physically possible by the given disk hardware. With that newly acquired capability of ZFS, comes the requirement to service application load with a limited impact from resilvering. We therefore have some mechanism to limit resilvering load in the presence of application load. Our stated goal is to be able to run through resilvering at 1TB/day (1TB of data reconstructed on the replacing drive) even in the face of an active workload.

As disks are getting bigger and bigger, all storage vendors will see increasing resilvering times. The good news is that, since Solaris 11.2 and ZFSSA since 2013.1.2, ZFS is now able to run resilvering with much of the same disk throughput limits as the rest of non-ZFS based storage.

The sequential resilvering performance on a RAIDZ pool is particularly noticeable to this happy Solaris 11.2 customer saying It is really good to see the new feature work so well in practice.

mardi déc. 02, 2014


The initial topic from my list is reARC. This is a major rearchitecture of the code that manages ZFS in-memory cache along with its interface to the DMU. The ARC is of course a key enabler of ZFS high performance. As the scale of systems grow in memory size, CPU count and frequency, some major changes were required to the ARC to keep up with the pace. reARC is such a major body of work, I can only talk about of few aspects of the Wonders of ZFS Storage here.

In this article, I describe how the reARC project had impact on at least these 7 important aspects of it's operation:
  • Managing metadata
  • Handling ARC accesses to cloned buffers
  • Scalability of cached and uncached IOPS
  • Steadier ARC size under steady state workloads
  • Improved robustness for a more reliable code
  • Reduction of the L2ARC memory footprint
  • Finally, a solution to the long standing issue of I/O priority inversion
The diversity of topics covered serves as a great illustration of the incredible work handled by the ARC and a testament to the importance of ARC operations to all other ZFS subsystems. I'm truly amazed at how a single project was able to deliver all this goodness in one swoop.

No Meta Limits

Previously, the ARC claimed to use a two-state model:
  • "most recently used" (MRU)
  • "most frequently used" (MFU)
But it further subdivided these states into data and metadata lists.

That model, using 4 main memory lists, created a problem for ZFS. The ARC algorithm gave us only 1 target size for each of the 2 MRU and MFU states. The fact that we had 2 lists (data and metadata) but only 1 target size for the aggregate meant that when we needed to adjust the list down, we just didn't have the necessary information to perform the shrink. This lead to the presence of an ugly tunable arc_meta_limit, which was impossible to set properly and was a source of problems for customers.

This problem raises an interesting point and a pet peeve of mine. Many people I've interacted with over the years defended the position that metadata was worth special protection in a cache. After all, metadata is necessary to get to data, so it has intrinsically higher value and should be kept around more. The argument is certainly sensible on the surface, but I was on the fence about it.

ZFS manages every access through a least recently used scheme (LRU). New access to some block, data or metadata, puts that block back to the head of the LRU list, very much protected from eviction, which happens at the tail of the list.

When considering special protection for metadata, I've always stumbled on this question:

If some buffer, be it data or metadata, has not seen any accesses for sufficient amount of time, such that the block is now the tail of an eviction list, what is the argument that says that I should protect that block based on it's state ?
I came up blank on that question. If it hasn't been used, it can be evicted, period. Furthermore, even after taking this stance, I was made aware of an interesting fact about ZFS. Indirect blocks, the blocks that hold a set of block pointers to the actual data are non_evictable inasmuch as any of the block pointers they reference are currently in the ARC. In other words, if some data is in cache, it's metadata is also in the cache and furthermore, is non-evictable. This fact really reinforced my position that in our LRU cache handling, metadata doesn't need special protection from eviction.

And so, the reARC project actually took the same path. No more separation of data and metadata and no more special protection. This improvement led to fewer lists to manage and simpler code, such as shorter lock hold times for eviction. If you are tuning arc_meta_limit for legacy reasons, I advise you to try without this special tuning. It might be hurting you today and should be considered obsolete.

Single Copy Arc: Dedup of Memory

Yet another truly amazing capability of ZFS is it's infinite snapshot capabilities. There are just no limits, other than hardware, to the number of (software) snapshots that you can have.

What is magical here is not so much that ZFS can manage a large number of snapshots, but that it can do so without reference counting the blocks that are referenced through a snapshot. You might need to read that sentence again ... and check the blog entry.

Now fast forward to today where there is something new for the ARC. While we've always had the ability to read a block referenced from the N-different snapshots (or clones), the old ARC actually had to manage separate in-memory copies of each block. If the accesses were all reads, we'd needlessly instantiate the same data multiple times in memory.

With the reARC project and the new DMU to ARC interfaces, we don't have to keep multiple data copies. Multiple clones of the same data share the same buffers for read accesses and new copies are only created for a write access. It has not escaped our notice that this N-way pairing has immense consequences for virtualization technologies. The use of ZFS clones (or writable snapshots) is just a great way to deploy a large number of virtual machines. ZFS has always been able to store N clone copies with zero incremental storage costs. But reARC is taking this one step further. As VMs are used, the in-memory caches that are used to manage multiple VMs no longer need to inflate, allowing the space savings to be used to cache other data. This improvement allows Oracle to boast the amazing technology demonstration of booting 16000 VMs simultaneously.

Improved Scalability of Cached and Uncached OPs

The entire MRU/MFU list insert and eviction processes have been redesigned. One of the main functions of the ARC is to keep track of accesses, such that most recently used data is moved to the head of the list and the least recently used buffers make their way towards the tail, and are eventually evicted. The new design allows for eviction to be performed using a separate set of locks from the set that is used for insertion. Thus, delivering greater scalability. Moreover, through a very clever algorithm, we're able to move buffers from the middle of a list to the head without acquiring the eviction lock.

These changes were very important in removing long pauses in ARC operations that hampered the previous implementation. Finally, the main hash table was modified to use more locks placed on separate cache lines improving the scalability of the ARC operations. This lead to a boost in the cached and uncached maximum IOPs capabilities of the ARC.

Steadier Size, Smaller Shrinks

The growth and shrink model of the ARC was also revisited. The new model grows the ARC less aggressively when approaching memory pressure and instead recycles buffers earlier on. This recycling leads to a steadier ARC size and fewer disruptive shrink cycles. If the changing environment nevertheless requires the ARC to shrink, the amount by which we do shrink each time is reduced to make it less of a stress for each shrink cycle. Along with the reorganization of the ARC list locking, this has lead to a much steadier, dependable ARC at high loads.

ARC Access Hardening

A new ARC reference mechanism was created that allows the DMU to signify read or write intent to the ARC. This, in turn, enables more checks to be performed by the code. Therefore, catching bugs earlier in the process. A better separation of function between the DMU and the ARC is critical for ZFS robustness or hardening. In the new reARC mode of operation, the ARC now actually has the freedom relocate kernel buffers in memory in between DMU accesses to a cached buffer. This new feature proves invaluable as we scale to large memory systems.

L2ARC Memory Footprint Reduction

Historically, buffers were tracked in the L2ARC (the SSD based secondary ARC) using the same structure that was used by the main primary ARC. This represented about 170 bytes of memory per buffer. The reARC project was able to cut down this amount by more than 2X to a bare minimum that now only requires about 80 bytes of metadata per L2 buffers. With the arrival of larger SSDs for L2ARC and a better feeding algorithm, this reduced L2ARC footprint is a very significant change for the Hybrid Storage Pool (HSP) storage model.

I/O Priority Inversion

One nagging behavior of the old ARC and ZIO pipeline was the so-called I/O priority inversion. This behavior was present mostly for prefetching I/Os, which was handled by the ZIO pipeline at a lower priority operation than, for example, a regular read issued by an application. Before reARC, the behavior was that after an I/O prefetch was issued, a subsequent read of the data that arrived while the I/O prefetch was still pending, would block waiting on the low priority I/O prefetch completion.

While it sounds simple enough to just boost the priority of the in-flight I/O prefetch, ARC/ZIO code was structured in such a way that this turned out to be much trickier than it sounds. In the end, the reARC project and subsequent I/O restructuring changes, put us on the right path regarding this particular quirkiness. Fixing the I/O priority inversion meant that fairness between different types of I/O was restored.


The key points that we saw in reARC are as follows:
  1. Metadata doesn't need special protection from eviction, arc_meta_limit has become an obsolete tunable.

  2. Multiple clones of the same data share the same buffers for great performance in a virtualization environment.
  3. We boosted ARC scalability for cached and uncached IOPs.
  4. The ARC size is now steadier and more dependable.
  5. Protection from creeping memory bugs is better.
  6. L2ARC uses a smaller footprint.
  7. I/Os are handled with more fairness in the presence of prefetches.
All of these improvements are available to customers of Oracle's ZFS Storage Appliances in any AK-2013 releases and recent Solaris 11 releases. And this is just topic number one. Stay tuned as we go about describing further improvements we're making to ZFS.

ZFS Performance boosts since 2010

Well, look who's back! After years of relative silence, I'd like to put back on my blogging hat and update my patient readership about the significant ZFS technological improvements that have integrated since Sun and ZFS became Oracle brands. Since there is so much to cover, I tee up this series of article with a short description of 9 major performance topics that have evolved significantly in the last years. Later, I will describe each topic in more details in individual blog entries. Of course, these selected advancements represents nowhere near an exhaustive list. There has been over 650 changes to the ZFS code in the last 4 years. My personal performance bias has selected topics that I know best. The designated topics are:
  1. reARC

  2. Scales the ZFS cache to TB class machines and CPU counts in thousands.
  3. Sequential Resilvering

  4. Converts a random workload to a sequential one.
  5. ZIL Pipelining

  6. Allows the ZIL to carve up smaller units of work for better pipelining and higher log device utilisation.
  7. It is the dawning of the age of the L2ARC

  8. Not only did we make the L2ARC persistent on reboot, we made the feeding process so much more efficient we had to slow it down.
  9. Zero Copy I/O Aggregation

  10. A new tool delivered by the Virtual Memory team allows the already incredible ZFS I/O aggregation feature to actually do its thing using one less copy.
  11. Scalable Reader/Writer locks

  12. Reader/Writer locks, used extensively by ZFS and Solaris, had their scalability greatly improved on on large systems.
  13. New thread Scheduling class

  14. ZFS transaction groups are now managed by a new type of taskqs which behave better managing bursts of cpu activity.
  15. Concurrent Metaslab Syncing

  16. The task of syncing metaslabs is now handled with more concurrency, boosting ZFS write throughput capabilities.
  17. Block Picking

  18. The task of choosing blocks for allocations has been enhanced in a number of ways, allowing us to work more efficiently at a much higher pool capacity percentage.
There you have it. I'm looking forward to reinvigorating my blog so stay tuned.

mardi févr. 28, 2012

Sun ZFS Storage Appliance : can do blocks, can do files too!

Last October, we demonstrated storage leadership in block protocols with our stellar SPC-1 result showcasing our top of the line Sun ZFS Storage 7420.

As a benchmark SPC-1's profile is close to what a fixed block size DB would actually be doing. See Fast Safe Cheap : Pick 3 for more details on that result. Here, for an encore, we're showing today how the ZFS Storage appliance can perform in a totally different environment : generic NFS file serving.

We're announcing that the Sun ZFS Storage 7320's reached 134,140 SPECsfs2008_nfs.v3 Ops/sec ! with 1.51 ms ORT running SPEC SFS 2008 benchmark.

Does price performance matters ? It does, doesn't it, See what Darius has to say about how we compare to Netapp : Oracle posts Spec SFS.

This is one step further in the direction of bringing to our customer true high performance unified storage capable of handling blocks and files on the same physical media. It's worth noting that provisioning of space between the different protocols is entirely software based and fully dynamic, that every stored element fully checksummed, that all stored data can be compressed with a number of different algorithms (including gzip), and that both filesystems and block based luns can be snapshot and cloned at their own granularity. All these manageability features available to you in this high performance storage package.

Way to go ZFS !

SPEC and SPECsfs are registered trademarks of Standard Performance Evaluation Corporation (SPEC). Results as of February 22, 2012, for more information see

lundi oct. 03, 2011

Fast, Safe, Cheap : Pick 3

Today, we're making performance headlines with Oracle's ZFS Storage Appliance.

SPC-1 : Twice the performance of NetApp at the same latency; Half the $/IOPS;

I'm proud to say that, yours truly, along with a lot of great teammates in Oracle, is not totally foreign to this milestone.

We are announcing that Oracle's 7420C cluster acheived 137000 SPC-1 IOPS with an average latency of less than 10 ms. That is double the results of NetApp's 3270A while delivering the same latency. As compared to the NetApp 3270 result, this is a 2.5x improvement in $/SPC-1-IOPS (2.99$/IOPS vs $7.48/IOPS). We're also showing that when the ZFS Storage Appliance runs at the rate posted by the 3270A (68034 SPC-1 IOPS), our latency of 3.26ms is almost 3X lower than theirs (9.16ms). Moreover, our result was obtained with 23700 GB of user level capacity (internally mirrored) for 17.3 $/GB while NetApp's , even using a space saving raid scheme, can only deliver 23.5$/GB. This is the price per GB of application data actually used in the benchmark. On top of that the 7420C still had 40% of space headroom whereas the 3270A was left with only 10% of free blocks.

These great results were at least partly made possible with the availability of 15K RPM Hard Disk Drives (HDD). Those are great to run the most demanding databases because they combine a large IOPS capability and are generally of smaller capacity. The ratio of IOPS/GB makes them ideal to store high intensity database modeled by SPC-1. On top of that, this concerted engineering effort lead to improved software not just for those running on 15K RPM. We actually used this benchmark to seek out how to increase the quality of our products. The preparation runs, after an initial diagnostic of some issue, we were attached to finding solutions that where not targeting the idiosyncrasies of SPC-1 but based on sound design decision. So instead of changing the default value of some internal parameter to a new static default, we actually changed the way the parameter worked so that our storage systems or all types and sizes would benefit.

So not only are we getting a great SPC-1 results, but all existing customers will benefit from this effect even if they are operating outside of the intense conditions created by the benchmark.

So what is SPC-1 ? It is one of the few benchmarks which counts for storage. It is maintained by Storage Performance Council (SPC). SPC-1 simulates multiple databases running on a centralized storage or storage cluster. But even if SPC-1 is a block based benchmark, within the ZFS Storage appliance, a block based FC or iSCSI volume is handled very much the same way as would be a large file subject to synchronous operation. And by Combining modern network technologies (Infiniband or 10Gbe Ethernet), the CPU power packed in the 7420C storage controllers and Oracle's custom dNFS technology for databases, one can truly acheive very high database transaction rates on top of the more manageable and flexible file based protocols.

The benchmarks defines three Application Storage Unit (ASU): ASU1 with a heavy 8KB block read/write component, ASU2 with a much lighter 8KB block read/write component, and ASU3 which is subject to hundreds of write streams. As such it's is not too far from a simulation of running hundreds of Oracle database onto a single system : ASU1 and ASU2 for datafiles and ASU3 for redolog storage.

The total size of the ASUs is constrained such that all of the stored data (including mirror protection and disk used for spares) must exceed 55% of all configured storage. The benchmark team is then free to decide how much total storage to configure. From that figure, 10% is given to ASU3 (redo log space) and the rest divided equally between heavily ASU1 and lightly used ASU2.

The benchmark team also has to select the SPC-1 IOPS throughput level it wishes to run. This is not a light decision given you want to balance high IOPS; low latency and $/user GB.

Once the target IOPS rate is selected, there are multiple criteria needed to pass a successful audit; one of the most critical is that you have to run at the specified IOPS rate for a whole 8 hour. Note that the previous specifications of the benchmark used by NetApp called for an 4 hour run. During that 8 hour run delivering a solid 137000 SPC-1 IOPS, the avg latency of must be less than 30ms (we did much better than that).

After this brutal 8 hour run, the benchmark then enters another critical phase: the workload is restarted (using a new randomly selected working set) and performance is measured for a 10 minute period. It is this 10 minute period that decides the official latency of the run.

When everything is said and done, you press the trigger; go to sleep and wake up to the result. As you could guess we were ecstatic that morning. Before that glorious day, for lack of a stronger word, a lot of hard work had been done during the extensive preparation runs. With little time, and normally not all of the hardware, one runs through series of run at incremental loads, making educated guesses as to how to improve the result. As you get more hardware you scale up the result tweaking things more or less until the final hour.

SPC-1, with it's requirement of less than 45% of unused space, is designed to trigger many disk level random read IOPS. Despite this inherent random pattern of the workload, we saw that our extensive caching architecture was as helpful for this benchmark as it is in real production workloads. While the 15K RPM HDDs normally levels off with random operation at a rate slightly above 300 IOPS, our 7420C, as a whole, could deliver almost 500 user-level SPC-1 IOPS per HDDs.

In the end one of the most satisfying aspect was to see that the data being managed by ZFS was stored rock solid on disk, properly checksummed, all data could be snapshot, compressed on demand, and delivering an impressively steady performance.

2X the absolute performance, 2.5X cheaper per SPC-1 IOPS, almost 3X lower latency, 30% cheaper per user GB with room to grow... So, If you have a storage decision coming and you need, FAST, SAFE, CHEAP : pick 3, take a fresh look at the ZFS Storage appliance.

SPC-1, SPC-1 IOPS, $/SPC-1 IOPS reg tm of Storage Performance Council (SPC). More info Sun ZFS Storage 7420 Appliance and

Oracle Sun ZFS Storage Appliance 7420 _ _As of October 3, 2011 Netapp FAS3270A _ _As of October 3, 2011

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

mercredi mai 26, 2010

Let's talk about Lun Alignment for 3 minutes

Recall that I had Lun alignment on my mind a few weeks ago. Nothing special about the ZFS storage appliance over any other storage. Pay attention to how you partition your luns, it can have a great impact on performance. Right Roch ? :

jeudi mars 11, 2010

Dedup Performance Considerations

One of the major milestones for ZFS Storage appliance with 2010/Q1 is the ability to dedup data on disk. The open question is then : What performance characteristics are we expected to see from Dedup ? As Jeff says, this is the ultimate gaming ground for benchmarks. But lets have a look at the fundamentals.

ZFS Dedup Basics

Dedup code is simplistically a large hash table (the DDT). It uses a 256 bit (32 Bytes) checksum along with other metata data to identify data content. On a hash match, we only need to increase a reference count, instead of writing out duplicate data. The dedup code is integrated in the I/O pipeline and is done on the fly as part of the ZFS transaction group (see Dynamics of ZFS, The New ZFS Write Throttle ). A ZFS zpool typically holds a number of datasets : either block level LUNS which are based on ZVOL or NFS and CIFS File Shares based on ZFS filesystems. So while the dedup table is a construct associated with individual zpool, enabling of the deduplication feature is something controlled at the dataset level. Enabling of the dedup feature on a dataset, has no impact on existing data which stay outside of the dedup table. However any new data stored in the dataset will then be subject to the dedup code. To actually have existing data become part of the dedup table one can run a variant of "zfs send | zfs recv" on the datasets.

Dedup works on a ZFS block or record level. For a iSCSI or FC LUN, i.e. objects backed by ZVOL datasets, the default blocksize is 8K. For filesystems (NFS, CIFS or Direct Attach ZFS), object smaller than 128K (the default recordsize) are stored as a single ZFS block while objects bigger than the default recordsize are stored as multiple records Each record is the unit which can end up deduplicated in the DDT. Whole Files which are duplicated in many filesystems instances are expected to dedup perfectly. For example, whole DB copied from a master file are expected to falls in this category. Similarly for LUNS, virtual desktop users which were created from the same virtual desktop master image are also expected to dedup perfectly.

An interesting topic for dedup concerns streams of bytes such as a tar file. For ZFS, a tar file is actually a sequence of ZFS records with no identified file boundaries. Therefore, identical objects (files captured by tar) present in 2 tar-like byte streams might not dedup well unless the objects actually start on the same alignment within the byte stream. A better dedup ratio would be obtained by expanding the byte stream into it's constituent file objects within ZFS. If possible, the tools creating the byte stream would be well advised to start new objects on identified boundaries such as 8K.

Another interesting topic is backups of active Databases. Since database often interact with their constituent files with an identified block size, it is rather important for the deduplication effectiveness that the backup target be setup with a block size that matches the source DB block size. Using a larger block on the deduplication target has the undesirable consequence that modifications to small blocks of the source database will cause those large blocks in the backup target to appear unique and not dedup somewhat artificially. By using an 8K block size in the dedup target dataset instead of 128K, one could conceivably see up to a 10X better deduplication ratio.

Performance Model and I/O Pipeline Differences

What is the effect on performance of Dedup ? First when dedup is enabled, the checksum used by ZFS to validate the disk I/O is changed to the cryptographically strong SHA256. Darren Moffat shows in his blog that SHA256 actually runs at more than 128 MB/sec on a modern cpu. This means that less than 1 ms is consumed to checksum a 128K and less than 64 usec for an 8K unit. This cost is online incurred when actually reading or writing data to disk, an operation that is expected to take 5-10 ms; therefore the checksum generation or validation is not a source of concern.

For the read code path, very little modification should be observed. The fact that a reads happens to hit a block which is part of the dedup table is not relevant to the main code path. The biggest effect will be that we use a stronger checksum function invoked after a read I/O : at most an extra 1 ms is added to a 128K disk I/O. However if a subsequent read is for a duplicate block which happens to be in the pool ARC cache, then instead of having to wait for a full disk I/O, only a much faster copy of duplicate block will be necessary. Each filesystem can then work independently on their copy of the data in the ARC cache as is the case without deduplication. Synchronous writes are also unaffected in their interaction with the ZIL. The blocks written in the ZIL have a very short lifespan and are not subject to deduplication. Therefore the path of synchronous writes is mostly unaffected unless the pool itself ends up not being able to absorb the sustained rate of incoming changes for 10s of seconds. Similarly for asynchronous writes which interact with the ARC caches, dedup code has no affect unless the pool's transaction group itself becomes the limiting factor. So the effect of dedup will take place during the pool transaction group updates. Here is where we take all modifications that occurred in the last few seconds and atomically commit a large transaction group (TXG). While a TXG is running, applications are not directly affected except possibly for the competition for CPU cycles. They mostly continue to read from disk and do synchronous write to the zil, and asynchronous writes to memory. The biggest effect will come if the incoming flow of work exceed the capabilities of the TXG to commit data to disk. Then eventually the reads and write will be held up by the necessary write (Throttling) code preventing ZFS from consuming up all of memory .

Looking into the ZFS TXG, we have 2 operations of interest, the creation of a new data block and the simple removal (free) of a previously used block. ZFS operating under a copy on write (COW) model, any modification to an existing block actually represents both a new data block creation and a free to a previously used block (unless a snapshot was taken in which case there is no free). For file shares, this concerns existing file rewrites; for block luns (FC and iSCSI), this concerns most writes except the initial one (very first write to a logical block address or LBA actually allocates the initial data; subsequent writes to the same LBA are handled using COW). For the creation of a new application data block, ZFS will then run the checksum of the block, as it does normally and then lookup in the dedup table for a match based on that checksum and a few other bits of information. On a dedup table hit, only a reference count needs to be increased and such changes to the dedup table will be stored on disk before the TXG completes. Many DDT entries are grouped in a disk block and compression is involved. A big win occurs when many entries in a block are subject to a write match during one TXG. Then a single 1 x 16K I/O can then replace 10s of larger IOPS. As for free operations, the internals of ZFS actually holds the referencing block pointer which contains the checksum of the block being freed. Therefore there is no need to read nor recompute the checksum of the data being freed. ZFS, with checksum in hand, looks up the entry in dedup table and decrement the reference counter. If the counter is non zero then nothing more is necessary (just the dedup table sync). If the freed block ends up without any reference then it will be freed.

The DEDUP table itself an an object managed by ZFS at the pool level. The table is considered metadata and it's elements will be stored in the ARC cache. Up to 25% of memory (zfs_arc_meta_limit) can be used to store metadata. When the dedup table actually fits in memory, then enabling dedup is expected to have a rather small effect on performance. But when the table is many time greater than allotted memory, then the lookups necessary to complete the TXG can cause write throttling to be invoked earlier than the same workload running without dedup. If using an L2ARC, the DDT table represents prime objects to use the secondary cache. Note that independent of the size of the dedup table, read intensive workloads in highly duplicated environment, are expected to be serviced using fewer IOPS at lower latency than without dedup. Also note that whole filesystem removal or large file truncation are operation that can free up large quantity of data at once and when the dedup table exceeds allotted memory then those operation, which are more complex with deduplication, can then impact the amount of data going into every TXG and the write throttling behavior.

So how large is the dedup table ?

The command zdb -DD on a pool shows the size of DDT entries. In one of my experiment it reported about 200 Bytes of core memory for table entries. If each unique object is associated with 200 Bytes of memory then that means that 32GB of ram could reference 20TB of unique data stored in 128K records or more than 1TB of unique data in 8K records. So if there is a need to store more unique data than what these ratio provide, strongly consider allocating some large read optimized SSD to hold the DDT. The DDT lookups are small random IOs which are handled very well by current generation SSDs.

The first motivation to enable dedup is actually when dealing with duplicate data to begin with. If possible procedures that generate duplication could be reconsidered. The use of ZFS Clones is actually a much better way to generate logically duplicate data for multiple users in a way that does not require a dedup hash table.

But when the operating conditions does not allow the use of ZFS Clones and data is highly duplicated, then the ZFS deduplication capability is a great way to reduce the volume of stored data.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Referenced Links :

Proper Alignment for extra performance

Because of disk parititioning software on your storage clients (keywords : EFI, VTOC, fdisk, DiskPart,...) or a mismatch between storage configuration and application request pattern, you could be suffering a 2-4X performance degradation....

Many I/O performance problem I see end up being the result of a mismatch in request sizes or it's alignment versus the natural block size of the underlying storage. While raw disk storage works using a 512 Byte sector and performs at the same level independent of the starting offset of I/O requests this is not the case for more sophisticated storage which will tend to use larger block units. Some SSDs today support 512B aligned requests but will work much better if you give them 4K aligned requests as described in Aligning on 4K boundaries Flash and Sizes. The Sun Oracle 7000 Unified Storage line supports different sizes of blocks between 4K and 128K (it can actually go lower but I would not recommend that in general). Having proper alignment between the application's view, the initiator partitioning and the backing volume can have great impact on the end performance delivered to applications.

When is alignment most important ?

Alignment problems are most likely to have an impact with
  • running a DB on file shares or block volumes
  • write streaming to block volumes (backups)
Also impacted at a lesser level :
  • large file rewrites on CIFS or NFS shares
In each case adjusting the recordsize to match the workload and insuring that partitions are aligned on a block boundary could have important effect on your performance.

Let's review the different cases.

Case 1: running a Database (DB) on file shares or block volumes

Here the DB is a block oriented application. General ZFS Best Practices warrant that the storage use a record size equal to the DB natural block size. At the logical level, the DB is issuing I/O which are aligned on block boundaries. When using file semantics (NFS or CIFS), then the alignment is guaranteed to be observed all the way to the backend storage. But when using block device semantics, the alignments of requests on the initiator is not guaranteed to be the same as the alignement on the target side. Misalignment of the LUN will cause two pathologies. First, an application block read will straddle 2 storage blocks creating storage IOPS inflation (more backend reads than application reads). But a more drastic effect will be seen for block writes which, when aligned, could be serviced by a single write I/O. Those will now require a Read-Modify-Write (R-W-M) of 2 adjacent storage blocks. Such type of I/O inflation leads to additional storage load and degrade performance during high demand.

To avoid such I/O inflation, insure that the backing store uses a block size (LUN volblocksize or Share recordsize) compatible with the DB block size. If using a file share such as NFS, insure that the filesystem client passes I/O requests directly to the NFS server using a mount option such as directio or use Oracle's dNFS client (Note that with directio mount option, memory management considerations independent of alignment concerns, the server will behave better when the client specifies rsize,wsize options not exceeding 128K). To avoid such LUN misalignment, prefer the use full LUNS as opposed to sliced partition. If disk slices must be used, prefer partitioning scheme in which one can control the sector offset of individual partitions such as EFI labels. In that case start partitions on a sector boundary which aligns with the volume's blocksize. For instance a initial block for a parition which is a multiple of 16 \* 512B sectors will align on an 8K boundary, the default lun blocksize.

Case 2: write streaming to block volumes (backups)

The other important case to pay attention to is stream writing to a raw block device. Block devices by default commit each write to stable storage. This path is often optimized through the use of acceleration devices such as write optimized SSD. Misalignement of the LUNS due to partitioning software imply that application writes, which could otherwise be committed to SSD at low latency, will instead be delayed by disk reads caught in R-M-W. Because the writes are synchronous in nature, the application running on the initiator will thus be considerably delayed by disk reads. Here again one must insure that partitions created on the client system are aligned with the volumes blocksize which typically default to 8K. For pure streaming workloads large blocksize up to the maximum 128K can lead to greater streaming performance. One must take good care that the block size used for a LUNS should not exceed the application writes sizes to raw volumes or risk being hit by the R-M-W penalty.

Case 3: large file rewrites on CIFS or NFS shares

For file shares, large streaming write will be of 2 types : they will either be the more common file creation (write allocation) or they will correspond to streaming overwrite to existing file. The more common write allocation would not greatly suffer from misalignment since there is no pre-existing data to be read and modified. But for the less common streaming rewrite to files, one can definitely be impacted by misalignment and R-M-W cycles. Fortunately file protocols are not subject to LUN misalignment so one must only take care that the write sizes reaching the storage be multiple of the recordsize used to create the file share in the storage. The solaris NFS clients often issues 32K write size for streaming application while CIFS has been observed to use 64K from clients. If existing streaming asynchronous file rewrite is an important component of your I/O workloads (a rare set of conditions), it might well be that setting the LUN blocksize accordingly will provide a boost to delivered performance.

In summary

The problem with alignment is more generally seen with fixed record oriented application (as for Oracle Database or Microsoft Exchange) with random access pattern and synchronous I/O semantics. It can be caused by partitioning software (fdisk, diskpart) which create disk partitions not aligned with the storage blocks. It can also be caused to a lesser extent by streaming file overwrite when the application write size does not match the file share's blocksize. The Sun Storage 7000 line offers great flexibility in selecting different blocksizes for different use within a single pool of storage. However it has no control on the offset that could be selected during disk partitioning of block devices on client systems. Care must be taken when partitioning disks to avoid misalignment and degraded performance. Using full LUNs is preferred.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Referenced Links :

Doubling Exchange Performance

2010/Q1 delivers 200% more Exchange performance and 50% extra transaction processing

One of the great advances present in the ZFS Appliance 2010/Q1 software update relates to the block allocation strategy. It's been one the most complex performance investigation I've ever had to deal with because of the very strong impact previous history of block allocation had on future performance. It was maddening experience littered with dead end leads. During that whole time it was very hard to make sense of the data and segregate what was due to a problem in block allocation from author causes that leads customer to report performance issues.

Executive Summary

A series of changes to ZFS metaslab code lead to 50% improved OLTP performance and 70% reduced variability from run to run. We also saw a full 200% improvement on MS Exchange performance from these changes.

Excruciating Details for aspiring developer "Abandon hope all ye who enter here"

At some point we started to look at random synchronous file rewrite (a la DB writer) and it seemed clear that the performance was not what we expected for this workload. Basically, independent DB block synchronous writes were not aggregating into larger I/Os in the vdev queue. We could not truly assert a point where a regression had set in, so rather than threat this as a performance regression, we just decided to study what we had in 2009/Q3 and see how we could make our current scheme work better. And that lead us on the path of the metaslab allocator:

As Jeff explains, when a piece of data needs to be stored on disk, ZFS will first select a top level vdev (a raid-z group, a mirrored set, a single disk) for it. Within that top level vdev, a metaslab (slab for short) will be chosen and within the slab a block of Data Virtual Address (DVA) space will be selected. We didn't have a problem with the vdev selection process, but we thought we have an issue with the block allocator. What we were seeing was that for random file rewrite the aggregation factor was large (say 8 blocks or more) when performance was good but dropped to 1 or 2 when performance was low. So we tried to see if we could do a better job at selecting blocks that would lead to better I/O aggregation down the pipeline. We kept looking at the effect of block allocation but it turned out the source of problem was in the slab selection process.

So a slab is a portion of DVA space within a metaslab group (aka a top level vdev). We currently divide VDEV space into approximately 200 slabs (see vdev_metaslab_set_size). Slabs can be either loaded in memory or not. When loaded, the associated spacemaps are active meaning we can allocate space from them. When slabs are not loaded, we can't allocated space but we can still free space from them (ZFS being copy-on-write or COW, a block rewrite frees up the old space). In this case we just log to disk the freed range information. As load and unload of spacemaps are not cheap and we insure we minimize such operation.

So each slab is weighted according to a few criteria and the slab with the highest weight is selected for allocation on a vdev. The first criteria for slab selection is to reuse the same one as the last one used: basically don't change a winner. We refer to this as the PRIMARY slab. The second criteria for slab selection is the amount of free space. The more the better. However, lower LBA (logical block addresses) which maps to outer cylinders will generally give better performance. So we weight lower LBA more than inner ones at equivalent free space. Finally, a slab that has already been used in the past, even if currently unloaded, is preferred to opening up a fresh new slab. This is the SMO bonus (because primed slabs have a Space Map Object associated). We do want to favor previously used slabs in order to limit the span of head seeks : we only move inwards when outer space is filled up.

The purpose of the slabs is to service a block allocation, say for a 128K record. So when a request comes in, the highest weighted slab is chosen as we ask for a block of the proper size using an AVL tree of free/allocated space. There was a problem we had to deal with in previous releases which occurred when such allocation failed because of free space fragmentation. Then the AVL tree was then not able to find a span of the requested size and was consuming CPU only to figure out there was no free block present to satisfy an allocation. When space was really tight in a pool we walked every slab before deciding that the allocation needed to be split into small chunks and a gang block (a block of blocks) created. So the spacemaps were augmented with another structure that allowed ZFS to immediately know how large an allocation could be serviced in a slab (the so called picker private tree organized by size of free space).

At that point we had 2 ways to select a block, either find one in sequence of previous allocation (first fit) or use one that fills in exactly a hole in the allocated space: so called best fit allocator. We also decided then to switch from best fit to first fit as a slab became 70% full. The problem that this created, we now realize, is that while it helped the compactness of the on-disk layout, it created a headache for writes. Each new allocation, got a taylored-fit disk area and this lead to much less write aggregation than expected. We would see that write workloads to a slab slowed down as it transitioned to 70% full (note this occurred when a slab was 70% full not the full vdev nor the pool). Eventually, the degraded slab became fully used and it would transition to a different slab with better performance characteristic. Performance could then fluctuate from an hour to the next.

So to solve this problem, what went in 2010/Q1 software release is multifold. The most important thing is: we increased the threshold at which we switched from 'first fit' (go fast) to 'best fit' (pack tight) from 70% full to 96% full. With TB drives, each slab is at least 5GB and 4% is still 200MB plenty of space and no need to do anything radical before that. This gave us the biggest bang. Second, instead of trying to reuse the same primary slabs until it failed an allocation we decided to stop giving the primary slab this preferential threatment as soon as the biggest allocation that could be satisfied by a slab was down to 128K (metaslab_df_alloc_threshold). At that point we were ready to switch to another slab that had more free space. We also decided to reduce the SMO bonus. Before, a slab that was 50% empty was preferred over slabs that had never been used. In order to foster more write aggregation, we reduced the threshold to 33% empty. This means that a random write workload now spread to more slabs where each one will have larger amount of free space leading to more write aggregation. Finally we also saw that slab loading was contributing to lower performance and implemented a slab prefetch mechanism to reduce down time associated with that operation.

The conjunction of all these changes lead to 50% improved OLTP and 70% reduced variability from run to run (see David Lutz's post on OLTP performance) . We also saw a full 200% improvement on MS Exchange performance from these changes.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

jeudi sept. 17, 2009

iSCSI unleashed

One of the much anticipated feature of the 2009.Q3 release of the fishworks OS is a complete rewrite of the iSCSI target implementation known as Common Multiprotocol SCSI Target or COMSTAR. The new target code is an in-kernel implementation that replaces what was previously known as the iSCSI target deamon, a user-level implementation of iSCSI.

Should we expect huge performance gains from this change ? You Bet !

But like most performance question, the answer is often : it depends. The measured performance of a given test is gated by the weakest link triggered. iSCSI is just one component among many others that can end up gating performance. If the daemon was not a limiting factor, then that's your answer.

The target deamon was a userland implementation of iSCSI : some daemon threads would read data from a storage pool and write data to a socket or vice versa. Moving to a kernel implementation opens up options to bypass at least one of the copies and that is being considered as a future option. But extra copies while undesirable do not necessarily contribute to the small packet latency or large request throughput; For small packets requests, the copy is small change compared to the request handling. For large request throughput the important things is that the data path establishes a pipelined flow in order to keep every components busy at all times.

But the way threads interact with one another can be a much greater factor in delivered performance. And there lies the problem. The old target deamon suffered from one major flaw in that each and every iSCSI requests would require multiple trips through a single queue (shared between every luns) and that queue was being read and written by 2 specific threads. Those 2 threads would end up fighting for the same locks. This was compounded by the fact that user level threads can be put to sleep when they fail to acquire a mutex and that going to sleep for a user level thread is a costly operation implying a system call and all the accounting that goes with that.

So while the iSCSI target deamon gave reasonable service for large request, it was much less scalable in terms of the number IOPS that can be served and the CPU efficiency in which it could do that. IOPS being of course a critical metrics for block protocols.

As an illustration of that with 10 client initiators and 10 threads per initiators (so 100 outstanding request) doing 8K cache-hit reads, we observed

Old Target Daemon Comstar Improvement
31K IOPS 85K IOPS 2.7X

Moreover the target daemon was consuming 7.6 CPU to service those 31K IOPS while comstar could handle 2.7X more IOPS consuming only 10 cpus, a 2X improvement in iops per cpu efficiency.

On the write side, with a disk pool that had 2 striped write optimised SSD, comstar gave us 50% more throughput (130 MB/sec vs 88MB/sec) and 60% more cpu efficiency.


During our testing we noted a few interesting contributor to delivered performance. The first being the setting of iSCSI immediatedata parameter iSCSIadm(1M). On the write path, that parameter will cause the initiator iSCSI to send up to 8K of data along with the initial request packet. While this is a good idea to do so, we found that for certain sizes of writes, it would trigger some condition in the zil that caused ZFS to issue more data than necessary through the logzillas. The problem is well understood and remediation is underway and we expect to get to a situation in which keeping the default value of immediatedata=yes is the best. But as of today, for those attempting world record data transfer speeds through logzillas, setting immediatedata=no and using a 32K or 64K write size might yield positive result depending on your client OS.

Interrupt Blanking

Interested in low latency request response ? Interestingly, a chunk of that response time is lost in the obscure setting of network card drivers. Network cards will often delay pending interrupts in the hope of coalescing more packets into a single interrupt. The extra efficiency often results in more throughput at high data rate at the expense of small packet latency. For 8K request we manage to get 15% more single threaded IOPS by tweaking one such client side parameter. Historically such tuning has always been hidden in the bowel of each drivers and specific to ever client OS so that's too broad a topic to cover here. But for Solaris clients, the Crossbow framework is aiming among other thing to make latency vs throughput decision much more adaptive to operating condition relaxing the need for per workload tunings.

WCE Settings

Another important parameter to consider for comstar is the 'write cache enable' bit. By default all write request to an iSCSI lun needs to be committed to stable storage as this is what is expected by most consumers of block storage. That means that each individual write request to a disk based storage pool will take minimally a disk rotation or 5ms to 8ms to commit. This also why a write optimised SSD is quite critical to many iSCSI workloads often yeilding 10X performance improvements. Without such an SSD, iSCSI performance will appear quite lackluster particularly for lightly threaded workloads which more affected by latency characteristics.

One could then feel justified to set the write cache enable bits on some luns in order to recoup some spunk in their engine. One good news here is that in the new 2009.Q3 release of fishworks the setting is now persistent across reboots and reconnection event, fixing a nasty condition of 2009.Q2. However one should be very careful with this setting as the end consumer of block storage (exchange, NTFS, oracle,...) is quite probably operating under an unexpected set of condition. This setting can lead to application corruption in case of outage (no risk for the storage internal state).

There is one exception to this caveat and it is ZFS itself. ZFS is designed to safely and correctly operate on top of devices that have their write cached enabled. That is because ZFS will flush write caches whenever application semantics or its own internal consistency require it. So a ZPOOL created on top of iSCSI luns would be well justified to set the WCE on the lun to boost performance.

Synchronous write bias

Finally as described in my blog entry about Synchronous write bias, we now have to option to bypass the write optimised SSDs for a lun if the workload it receive is less sensitive to latency. This would be the case of a highly threaded workload doing large data transfers. Experimenting with this new property is warranted at this point.

Synchronous write bias property

With the release of 2009.Q3 release of fishworks along with a new iSCSI implemtation we're coming up with a very significant new feature for managing performance of Oracle database : the new dataset Synchronous write bias property or logbias for short. In a nutshell, this property takes the default value of Latency signifying that the storage should handle synchronous writes in urgency, the historical default handling. See Brendan's comprehensive blog entry on the Separate Intent Log and synchronous writes. However for datasets holding Oracle Datafiles, the logbias property can be set to Throughput signifying that the storage should avoid using log devices acceleration instead trying to optimize the workload's throughput and efficiency. We definitely expect to see a good boost to Oracle performance from this feature for many types of workloads and configs; workloads that generate 10s of MB/sec of DB writer traffic and have no more than 1 logzilla per tray/JBOD.

The property is set in the Share Properties just above database recordsize. You might need to unset the Inherit from projet checkbox in order to modify the settings on a particular share:

The logbias property addresses a peculiar aspect of Oracle workloads : namely that DB writers are issuing a large number of concurrent synchronous writes to Oracle datafiles, writes which individually are not particularly urgent. In contrast to other types of synchronous writes workloads, the more important metrics for DB Writers is not about individual latency. The important metric is that the storage keep up with the throughput demand in order to have database buffers always available for recycling. This is unlike redo log writes which are critically sensitive to latency as they are holding up individual transactions and thus users.

ZFS and the ZIL

A little background; with ZFS, synchronous writes are managed by the ZFS Intent Log ZIL. Because synchronous writes are typically holding up applications, it's important to handle those writes with some level of urgency and the ZIL does an admirable job at that.

In the Openstorage hybrid storage pool the ZIL itself is speeded up using low latency write-optimized SSD devices : the logzillas. Those devices are used to commit a copy of the in-memory ZIL transaction and retain the data until an upcoming transaction group commits the in-memory state to the on-disk pooled storage (Dynamics of ZFS, The New ZFS write throttle).

So while the ZIL speeds up synchronous writes, logzillas speeds up the zil. Now SSDs can serve IOPS at a blazing 100μs but also have their own throughput limits: currently around 110MB/sec per device. At that throughput, committing, for example, 40K of data will need minimally 360μs. The more data we can divert away from log devices, the lower the latency response of those devices will be.

It's interesting to note that other types of raid controllers will be hostage of their NVRAM and require, for consistency, that data be committed through some form of acceleration in order to avoid the Raid Write Hole (Bonwick on Raid-Z). ZFS, however, does not require that data passes through its SSD commit accelerator and it can manage consistency of commits either using disk or using SSDs.

Synchronous write bias : Throughput

With this newfound ability of storage administrators to signify to ZFS that some datasets will be subject to highly threaded synchronous writes for which global throughput is more critical than individual write latency, we can enable a different handling mode. By setting Logbias=Throughput ZFS is able to divert writes away from the Logzillas which are then preserved for servicing low latency sensitive operations (e.g. redo log operations).

  • A setting of Synchronous write bias : Throughput for a dataset allows synchronous writes to files in other datasets to have lower latency access to SSD log devices.
But that's not all. Data flowing through a logbias=Throughput dataset is still served by the ZIL. It turns out that the ZIL has different internal options in the way it can commit transactions one of which being tagged WR_INDIRECT. WR_INDIRECT commits issue an I/O for the modified file record and record a pointer to it in the zil chain. (see WR_INDIRECT in zil.c, zvol.c, zfs_log.c ).

ZIL transaction of type WR_INDIRECT might use more disk I/Os and slightly higher latency immediately but less I/Os and less total bytes during the upcoming transaction group update. Up to this point, the heuristics that lead to using WR_INDIRECT transactions, were not triggered by DB writer workloads. But armed with the knowledge that comes with the new logbias property, we're now less concerned about the slight latency increase that WR_INDIRECT can have. So from efficiency consideration the logbias=Throughput datasets are now set to use this mode leading to more leveled latency distributions of Transactions.

  • Synchronous write bias : Throughput is a dataset mode that reduces the number of I/Os that need to be issued on behalf of this dataset during the regular transaction group updates leading to more leveled response time.
A reminder that such kind of improvements sometimes can go unnoticed in sustained benchmarks if the downstream Transaction group destage is not given enough resources. Make sure you have enough spindles (or total disk KRPM) to sustain the level of performance you need. A pool with 2 logzillas and a single JBOD, might have enough SSD throughput to absorb DB writer workloads without adversely affecting redo log latency and so would not benefit from the special logbias settings, however for 1 logzillas per JBOD the situation might be reversed.

While the DB Record Size property is inherited by files in a dataset and is immutable, the logbias setting is totally dynamic and can be toggled on the fly during operations. For instance, during database creation or some lightly threaded write operations to Datafiles, it's expected that logbias=Latency should perform better.

Logbias deployments for Oracle

As of the 2009.Q3 release of fishworks, the current wisdom around deploying Oracle DB an Openstorage system with SSD acceleration, is to segregate, at the filesystem/dataset level, but within the single storage pool, Oracle datafiles, index files and redo Log files. Having each type of files in different dataset allows better observability into each one using the great analytics tool. But also, each dataset can then be tuned independantly to deliver the most stable performance characteristics. The most important parameter to consider is the ZFS internal recordsize used to manage the files. For Oracle datafiles the established (ZFS Best Practice) is to match the recordsize and the DB block size. For redo log files using default 128K records means that fewer file updates will be stradling multiple filesystem records. With 128K records we expect to have fewer transaction needing to wait for redo log input I/Os leading more leveled latency distribution for transactions. As for Index files, using smaller blocks of 8K offers better cacheability feature for both the primary and secondary caches (only cache what is used from indexes), but using larger blocks offers better index-scan performance. Experimenting is in order, depending on your use case, but an intermediate block size of maybe 32K might also be considered for mixed usage scenario.

For Oracle datafiles specifically, using the new setting of Synchronous write bias : Throughput has potential to deliver more stable performance in general and higher performance for redo log sensitive workloads.

Dataset Recordsize Logbias
Datafiles 8K Throughput
Redo Logs 128K(default)Latency(default)
Index 8K-32K?Latency(default)

Following these guidelines yielded a 40% boost in our Transaction processing testing in which we had 1 logzillas for a 40 disk pool.

jeudi juin 11, 2009

Compared Performance of Sun 7000 Unified Storage Array Line

The Sun Storage 7410 Unified Storage Array provides high-performance for NAS environments. Sun's product can be used on a wide variety of applications. The Sun Storage 7410 Unified Storage Array with a _single_ 10 GbE connection delivers linespeed of the 10 GbE.

  • The Sun Storage 7410 Unified Storage Array delivers 1 GB/sec throughput performance.
  • The Sun Storage 7310 Unified Storage Array delivers over 500 MB/sec on streaming writes for backups and imaging applications.
  • The Sun Storage 7410 Unified Storage Array delivers over 22000 of 8K synchronous writes per second combining great DB performance and ease of deployment of Network Attached Storage while delivering the economics benefits of inexpensice SATA disks.
  • The Sun Storage 7410 Unified Storage Array delivers over 36000 of random 8K reads per second from a 400GB working set for great Mail application responsiveness. This corresponds to an entreprise of 100000 people with every employee accessing new data every 3.6 second consolidated on a single server.

All those numbers characterise a single head of a 7410 clusterable technology. The 7000 clustering technology stores all data in dual attached disk trays and no state is shared between cluster heads (see Sun 7000 Storage clusters). This means that an active-active cluster of 2 healthy 7410 will deliver 2X the performance posted here.

Also note that the performance posted here represent what is acheived under a very tightly defined constrained workload (see Designing 11 Storage metric) and those do not represent the performance limits of the systems. This is testing 1 x 10 GbE port only; each product can have 2 or 4 10 GbE ports, and by running load across multiple ports the server can deliver even higher performance. Achieving maximum performance is a separate exercise done extremely well by my friend Brendan :

Measurement Method

To measure our performance we used the open source Filebench tool accessible from SourceForge (Filebench on Measuring performance of a NAS storage is not an easy task. One has to deal with the client side cache which needs to be bypassed, the synchronisation of multiple clients, the presence of client side page flushing deamons which can turn asynchronous workloads into synchronous ones. Because our Storage 7000 line can have such large caches (up to 128GB of ram and more than 500GB of secondary caches) and we wanted to test disk responses, we needed to find a backdoor ways to flush those caches on the servers. Read Amithaba Filebench Kit entry on the topic in which he posts a link to the toolkit used to produce the numbers.

We recently released our first major software update 2000.Q2 and along with that a new lower cost clusterable 96 TB Storage, the 7310.

We report here the compared numbers of a 7310 with the latest software release to those previously obtained for the 7410, 7210 and 7110 systems each attached to an 18 to 20 client pool over a single 10Gbe interface with the regular frame ethernet (1500 Bytes). By the way, looking at brendan's results above, I encourage you to upgrade to use Jumbo Frames ethernet for even more performance and note that our servers can drive two 10Gbe at line speed.

Tested Systems and Metrics

The tested setup are :
        Sun Storage 7410, 4 x quad core: 16 cores @ 2.3 Ghz AMD.
        128GB of host memory.
        1 dual port 10Gbe Network Atlas Card. NXGE driver. 1500 MTU
        Streaming Tests:
        2 x J4400 JBOD,  44 x 500GB SATA drives 7.2K RPM, Mirrored pool, 
        3 Write optimized 18GB SSD, 2 Read Optimized 100GB SSD.
        IOPS tests:
        12 x J4400 JBOD, 280 x 500GB SATA drives 7.2K RPM, Mirrored pool,
        272 Data drives + 8 spares.
        8-Mirrored Write Optimised 18GB SSD, 6 Read Optimized 100GB SSD.
        FW OS : ak/generic@2008.11.20,1-0

        Sun Storage 7310,2 x quad core: 8 cores @ 2.3 Ghz AMD.
        32GB of host memory.
        1 dual port 10Gbe Network Atlas  Atlas Card (1 port used). NXGE driver. 1500 MTU
        4 x J4400 JBOD for a total 92 SATA drives  7.2K RPM
        43 mirrored pairs
        4 Write Optimised 18GB SSD, 2 Read Optimized 100GB SSD.
        FW OS : Q2 2009.,1-1.15

        Sun Storage 7210, 2 x quad core: 8 cores @ 2.3 Ghz AMD
        32 GB of host memory.
        1 dual port 10Gbe Network Atlas Atlas Card (1 port used). NXGE driver. 1500 MTU
        44  x 500 GB SATA drives  7.2K RPM, Mirrored pool,
        2 Write Optimised 18 GB SSD.
        FW OS : ak/generic@2008.11.20,1-0

        Sun Storage 7110, 2 x quad core opteron: 8 cores @ 2.3 Ghz AMD
        8 GB of host memory.
        1 dual port 10Gbe Network Atlas Atlas Card (1 port used). NXGE driver. 1500 MTU
        12 x 146 GB SAS drives, 10K RPM, in 3+1 Raid-Z pool.
        FW OS : ak/generic@2008.11.20,1-0

The newly released 7310 was tested with the most recent software revision and that certainly is giving the 7310 an edge over it's peers. The 7410 on the other hand was measured here managing a much large contingent of storage, including mirrored Logzillas and 3 times as many JBODs and that is expected to account for some of the performance delta being observed.

    Metrics Short Name
    1 thread per client streaming cached reads Stream Read light
    1 thread per client streaming cold cache reads Cold Stream Read light
    10 threads per client streaming cached reads Stream Read
    20 threads per client streaming cold cached reads Cold Stream Read
    1 thread per client streaming write Stream Write light
    20 threads per client streaming write Stream Write
    128 threads per client 8k synchronous writes Sync write
    128 threads per client 8k random read Random Read
    20 threads per client 8k random read on cold caches Cold Random Read
    8 threads per client 8k small file create IOPS Filecreate

There are 6 read tests, 2 writes test and 1 synchronous write test which overwrites it's data files as a database would. A final filecreate test complete the metrics. Test executes against 20GB working set _per client_ times 18 to 20 clients. There are 4 sets used in total running over independent shares for a total of 80GB per client. So before actual runs at taken, we create all working sets or 1.6 TB of precreated data. Then before each run, we clear all caches on the clients and server.

In each of the 3 groups of 2 read tests, the first one benefits from no caching at all and the throughput delivered to the client over the network is observed to come from disk. The test runs for N seconds priming data in the Storage caches. A second run (non-cold) is then started after clearing the client side caches. Those test will see the 100% of the data delivered over the network link but not all of it is coming off the disks. Streaming test will race through the cached data and then finish off reading from disks. The random read test can also benefit from increasing cached responses as the test progresses. The exact caching characteristic of a 7000 lines will depend on a large number of parameters including your application access pattern. Numbers here reflect the performance of fully randomized test over 20GB per client x 20 clients or a 400GB working set. Upcoming studies will include more data (showing even higher performance) for workloads with higher cache hit ratio than those used here.

In a Storage 7000 server, disks are grouped together in one pool and then individual Shares are creates. Each share has access to all disk resource subject to quota (a minimum) and reservation (a maximum) that might be set. One important setup parameter associated with each share is the DB record size. It is generally better for IOPS test to use 8K records and for streaming test to use 128K records. The recordsize can be dynamically set based on expected usage.

The tests shown here were obtained with NFSv4 the default for Solaris clients (NFSv3 is expected to come out slightly better). The clients were running Solaris 10, with tuned tcp_recv_hiwat of 400K and dopageflush=0 to prevent buffered writes from being converted into synchronous writes.

Compared Results of the 7000 Storage Line

    NFSv4 Test 7410 Head
    Mirrored Pool
    7310 Head
    Mirrored Pool
    7210 Head
    Mirrored Pool
    7110 Head
    3+1 Raid-Z
    Cold Stream Read light 915 MB/sec 685 MB/sec 719 MB/sec 378 MB/sec
    Stream Read light 1074 MB/sec 751 MB/sec 894 MB/sec 416 MB/sec
    Cold Stream Read 959 MB/sec 598 MB/sec 752 MB/sec 329 MB/sec
    Stream Read 1030 MB/sec 620 MB/sec 792 MB/sec 386 MB/sec
    Stream Write light 480 MB/sec 507 MB/sec 490 MB/sec 226 MB/sec
    Stream Write 447 MB/sec 526 MB/sec 481 MB/sec 224 MB/sec
    Sync write 22383 IOPS 8527 IOPS 10184 IOPS 1179 IOPS
    Filecreate 5162 IOPS 4909 IOPS 4613 IOPS 162 IOPS
    Cold Random Read 28559 IOPS 5686 IOPS 4006 IOPS 1043 IOPS
    Random Read 36478 IOPS 7107 IOPS 4584 IOPS 1486 IOPS
    Per Spindle IOPS 272 Spindles 86 Spindles 44 Spindles 12 Spindles
    Cold Random Read 104 IOPS 76 IOPS 91 IOPS 86 IOPS
    Random Read 134 IOPS 94 IOPS 104 IOPS 123 IOPS


The data shows that the entire Sun Storage 7000 line are throughput workhorse delivering 10 Gbps level NAS services per cluster head nodes, using a single Network Interface and single IP address for easy integration into your existing network.

As with other storage technology write streaming performance require more involvement from the storage controller and this leads to about 50% less write throughput compared to read throughput.

The use of write optimized SSD in the 7410, 7310 and 7220 also give this storage very high synchronous write capabilities. This is one of the most interesting result as it maps to database performance. The ability to sustain 24000 O_DSYNC writes at 192MB/sec of synchronized user data using only 48 inexpensive sata disks and 3 write optimized SSD is one of the many great performance characteristics of this novel storage system.

Random Read test generally map directly to individual disk capabilities and is a measure of total disk rotations. The cold runs shows that all our platforms are delivering data at the expected 100 IOPS per spindle for those SATA disks. Recall that our offering is based on the economical energy efficient 7.2 RPM disk technology. For cold random reads, a mirrored pair of 2 x 7.2K RPM offers the same total disk rotation (and IOPS) as expensive and power hungry 15 K RPM disks but in a much more economical package.

Moreover the difference between the warm and cold random read runs is showing that the Hybrid Storage Pool (HSP) is providing a 30% boost even on this workload that addresses randomly 400GB working set on 128GB of controller cache. The effective boost from the HSP can be much greater depending on the cacheability of workloads.

If we consider an organisation in which the avg mail message is 8K in size, our results show that we could consolidate 100000 employees on a single 7410 storage where each employee is accessing new data every 3.6 seconds with 70ms response time.

Messaging system are also big consumer of file creations, I've shown in the past how efficient ZFS can be at creating small files (Need Inodes ?). For the NFS protocol, file creation is a straining workload but the 7000 storage line comes out not too bad with more than 5000 filecreates per second per storage controller.


Performance Can never be summerised with a few numbers and we have just begun to scratch the surface here. The numbers presented here along with the disruptive pricing of the Hybrid Storage Pool will, I hope, go a long way to show the incredible power of the Open Storage architecture being proposed. And keep in mind that this performance is achievable using less expensive, less power hungry SATA drives and that every data services : NFS, CIFS, iSCSI, ftp, HTTP etc. offered by our Sun Storage 7000 servers are available at 0 additional software cost to you.

Disclosure Statement: Sun Microsystem generated results using filebench. Results reported 11/10/08 and 26/05/2009 Analysis done on June 6 2009.



« juillet 2015

No bookmarks in folder