mercredi janv. 21, 2015

Sequential Resilvering

In the initial days of ZFS some pointed out that ZFS resilvering was metadata driven and was therefore super fast : after all we only had to resilver data that was in-use compared to traditional storage that has to resilver entire disk even if there is no actual data stored. And indeed on newly created pools ZFS was super fast for resilvering.

But of course storage pools rarely stay empty. So what happened when pools grew to store large quantities of data ? Well we basically had to resilver most blocks present on a failed disk. So the advantage of only resilvering what is actually present is not much of a advantage, in real life, for ZFS.

And while ZFS based storage grew in importance, so did disk sizes. The disk sizes that people put in production are growing very fast showing the appetite of customers to store vast quantities of data. This is happening despite the fact that those disks are not delivering significantly more IOPS than their ancestors. As time goes by, a trend that has lasted forever, we have fewer and fewer IOPS available to service a given unit of data. Here ZFSSA storage arrays with TB class caches are certainly helping the trend. Disk IOPS don't matter as much as before because all of the hot data is cached inside ZFS. So customers gladly tradeoff IOPS for capacity given that ZFSSA deliver tons of cached IOPS and ultra cheap GB of storage.

And then comes resilvering...

So when a disk goes bad, one has to resilver all of the data on it. It is assured at that point that we will be accessing all of the data from surviving disks in the raid group and that this is not a highly cached set. And here was the rub with old style ZFS resilvering : the metadata driven algorithm was actually generating small random IOPS. The old algorithm was actually going through all of the blocks file by file, snapshot by snapshot. When it found an element to resilver, it would issue the IOPS necessary for that operation. Because of the nature of ZFS, the populating of those blocks didn't lead to a sequential workload on the resilvering disks.

So in a worst case scenario, we would have to issue small random IOPS covering 100% of what was stored on the failed disk and issue small random writes to the new disk coming in as a replacement. With big disks and very low IOPS rating comes ugly resilvering times. That effect was also compounded by a voluntary design balance that was strongly biased to protect application load. The compounded effect was month long resilvering.

The Solution

To solve this, we designed a subtly modified version of resilvering. We split the algorithm in two phases. The populating phase and the iterating phase. The populating phase is mostly unchanged over the previous algorithm except that, when encountering a block to resilver, instead of issuing the small random IOPS, we generate a new on disk log of them. After having iterated through all of the metadata and discovered all of the elements that need to be resilvered we now can sort these blocks by physical disk offset and issue the I/O in ascending order. This in turn allows the ZIO subsystem to aggregate adjacent I/O more efficiently leading to fewer larger I/Os issued to the disk. And by virtue of issuing I/Os in physical order it allows the disk to serve these IOPS at the streaming limit of the disks (say 100MB/sec) rather than being IOPS limited (say 200 IOPS).

So we hold a strategy that allows us to resilver nearly as fast as physically possible by the given disk hardware. With that newly acquired capability of ZFS, comes the requirement to service application load with a limited impact from resilvering. We therefore have some mechanism to limit resilvering load in the presence of application load. Our stated goal is to be able to run through resilvering at 1TB/day (1TB of data reconstructed on the replacing drive) even in the face of an active workload.

As disks are getting bigger and bigger, all storage vendors will see increasing resilvering times. The good news is that, since Solaris 11.2 and ZFSSA since 2013.1.2, ZFS is now able to run resilvering with much of the same disk throughput limits as the rest of non-ZFS based storage.

The sequential resilvering performance on a RAIDZ pool is particularly noticeable to this happy Solaris 11.2 customer saying It is really good to see the new feature work so well in practice.

mardi déc. 02, 2014


The initial topic from my list is reARC. This is a major rearchitecture of the code that manages ZFS in-memory cache along with its interface to the DMU. The ARC is of course a key enabler of ZFS high performance. As the scale of systems grow in memory size, CPU count and frequency, some major changes were required to the ARC to keep up with the pace. reARC is such a major body of work, I can only talk about of few aspects of the Wonders of ZFS Storage here.

In this article, I describe how the reARC project had impact on at least these 7 important aspects of it's operation:
  • Managing metadata
  • Handling ARC accesses to cloned buffers
  • Scalability of cached and uncached IOPS
  • Steadier ARC size under steady state workloads
  • Improved robustness for a more reliable code
  • Reduction of the L2ARC memory footprint
  • Finally, a solution to the long standing issue of I/O priority inversion
The diversity of topics covered serves as a great illustration of the incredible work handled by the ARC and a testament to the importance of ARC operations to all other ZFS subsystems. I'm truly amazed at how a single project was able to deliver all this goodness in one swoop.

No Meta Limits

Previously, the ARC claimed to use a two-state model:
  • "most recently used" (MRU)
  • "most frequently used" (MFU)
But it further subdivided these states into data and metadata lists.

That model, using 4 main memory lists, created a problem for ZFS. The ARC algorithm gave us only 1 target size for each of the 2 MRU and MFU states. The fact that we had 2 lists (data and metadata) but only 1 target size for the aggregate meant that when we needed to adjust the list down, we just didn't have the necessary information to perform the shrink. This lead to the presence of an ugly tunable arc_meta_limit, which was impossible to set properly and was a source of problems for customers.

This problem raises an interesting point and a pet peeve of mine. Many people I've interacted with over the years defended the position that metadata was worth special protection in a cache. After all, metadata is necessary to get to data, so it has intrinsically higher value and should be kept around more. The argument is certainly sensible on the surface, but I was on the fence about it.

ZFS manages every access through a least recently used scheme (LRU). New access to some block, data or metadata, puts that block back to the head of the LRU list, very much protected from eviction, which happens at the tail of the list.

When considering special protection for metadata, I've always stumbled on this question:

If some buffer, be it data or metadata, has not seen any accesses for sufficient amount of time, such that the block is now the tail of an eviction list, what is the argument that says that I should protect that block based on it's state ?
I came up blank on that question. If it hasn't been used, it can be evicted, period. Furthermore, even after taking this stance, I was made aware of an interesting fact about ZFS. Indirect blocks, the blocks that hold a set of block pointers to the actual data are non_evictable inasmuch as any of the block pointers they reference are currently in the ARC. In other words, if some data is in cache, it's metadata is also in the cache and furthermore, is non-evictable. This fact really reinforced my position that in our LRU cache handling, metadata doesn't need special protection from eviction.

And so, the reARC project actually took the same path. No more separation of data and metadata and no more special protection. This improvement led to fewer lists to manage and simpler code, such as shorter lock hold times for eviction. If you are tuning arc_meta_limit for legacy reasons, I advise you to try without this special tuning. It might be hurting you today and should be considered obsolete.

Single Copy Arc: Dedup of Memory

Yet another truly amazing capability of ZFS is it's infinite snapshot capabilities. There are just no limits, other than hardware, to the number of (software) snapshots that you can have.

What is magical here is not so much that ZFS can manage a large number of snapshots, but that it can do so without reference counting the blocks that are referenced through a snapshot. You might need to read that sentence again ... and check the blog entry.

Now fast forward to today where there is something new for the ARC. While we've always had the ability to read a block referenced from the N-different snapshots (or clones), the old ARC actually had to manage separate in-memory copies of each block. If the accesses were all reads, we'd needlessly instantiate the same data multiple times in memory.

With the reARC project and the new DMU to ARC interfaces, we don't have to keep multiple data copies. Multiple clones of the same data share the same buffers for read accesses and new copies are only created for a write access. It has not escaped our notice that this N-way pairing has immense consequences for virtualization technologies. The use of ZFS clones (or writable snapshots) is just a great way to deploy a large number of virtual machines. ZFS has always been able to store N clone copies with zero incremental storage costs. But reARC is taking this one step further. As VMs are used, the in-memory caches that are used to manage multiple VMs no longer need to inflate, allowing the space savings to be used to cache other data. This improvement allows Oracle to boast the amazing technology demonstration of booting 16000 VMs simultaneously.

Improved Scalability of Cached and Uncached OPs

The entire MRU/MFU list insert and eviction processes have been redesigned. One of the main functions of the ARC is to keep track of accesses, such that most recently used data is moved to the head of the list and the least recently used buffers make their way towards the tail, and are eventually evicted. The new design allows for eviction to be performed using a separate set of locks from the set that is used for insertion. Thus, delivering greater scalability. Moreover, through a very clever algorithm, we're able to move buffers from the middle of a list to the head without acquiring the eviction lock.

These changes were very important in removing long pauses in ARC operations that hampered the previous implementation. Finally, the main hash table was modified to use more locks placed on separate cache lines improving the scalability of the ARC operations. This lead to a boost in the cached and uncached maximum IOPs capabilities of the ARC.

Steadier Size, Smaller Shrinks

The growth and shrink model of the ARC was also revisited. The new model grows the ARC less aggressively when approaching memory pressure and instead recycles buffers earlier on. This recycling leads to a steadier ARC size and fewer disruptive shrink cycles. If the changing environment nevertheless requires the ARC to shrink, the amount by which we do shrink each time is reduced to make it less of a stress for each shrink cycle. Along with the reorganization of the ARC list locking, this has lead to a much steadier, dependable ARC at high loads.

ARC Access Hardening

A new ARC reference mechanism was created that allows the DMU to signify read or write intent to the ARC. This, in turn, enables more checks to be performed by the code. Therefore, catching bugs earlier in the process. A better separation of function between the DMU and the ARC is critical for ZFS robustness or hardening. In the new reARC mode of operation, the ARC now actually has the freedom relocate kernel buffers in memory in between DMU accesses to a cached buffer. This new feature proves invaluable as we scale to large memory systems.

L2ARC Memory Footprint Reduction

Historically, buffers were tracked in the L2ARC (the SSD based secondary ARC) using the same structure that was used by the main primary ARC. This represented about 170 bytes of memory per buffer. The reARC project was able to cut down this amount by more than 2X to a bare minimum that now only requires about 80 bytes of metadata per L2 buffers. With the arrival of larger SSDs for L2ARC and a better feeding algorithm, this reduced L2ARC footprint is a very significant change for the Hybrid Storage Pool (HSP) storage model.

I/O Priority Inversion

One nagging behavior of the old ARC and ZIO pipeline was the so-called I/O priority inversion. This behavior was present mostly for prefetching I/Os, which was handled by the ZIO pipeline at a lower priority operation than, for example, a regular read issued by an application. Before reARC, the behavior was that after an I/O prefetch was issued, a subsequent read of the data that arrived while the I/O prefetch was still pending, would block waiting on the low priority I/O prefetch completion.

While it sounds simple enough to just boost the priority of the in-flight I/O prefetch, ARC/ZIO code was structured in such a way that this turned out to be much trickier than it sounds. In the end, the reARC project and subsequent I/O restructuring changes, put us on the right path regarding this particular quirkiness. Fixing the I/O priority inversion meant that fairness between different types of I/O was restored.


The key points that we saw in reARC are as follows:
  1. Metadata doesn't need special protection from eviction, arc_meta_limit has become an obsolete tunable.

  2. Multiple clones of the same data share the same buffers for great performance in a virtualization environment.
  3. We boosted ARC scalability for cached and uncached IOPs.
  4. The ARC size is now steadier and more dependable.
  5. Protection from creeping memory bugs is better.
  6. L2ARC uses a smaller footprint.
  7. I/Os are handled with more fairness in the presence of prefetches.
All of these improvements are available to customers of Oracle's ZFS Storage Appliances in any AK-2013 releases and recent Solaris 11 releases. And this is just topic number one. Stay tuned as we go about describing further improvements we're making to ZFS.

ZFS Performance boosts since 2010

Well, look who's back! After years of relative silence, I'd like to put back on my blogging hat and update my patient readership about the significant ZFS technological improvements that have integrated since Sun and ZFS became Oracle brands. Since there is so much to cover, I tee up this series of article with a short description of 9 major performance topics that have evolved significantly in the last years. Later, I will describe each topic in more details in individual blog entries. Of course, these selected advancements represents nowhere near an exhaustive list. There has been over 650 changes to the ZFS code in the last 4 years. My personal performance bias has selected topics that I know best. The designated topics are:
  1. reARC

  2. Scales the ZFS cache to TB class machines and CPU counts in thousands.
  3. Sequential Resilvering

  4. Converts a random workload to a sequential one.
  5. ZIL Pipelining

  6. Allows the ZIL to carve up smaller units of work for better pipelining and higher log device utilisation.
  7. It is the dawning of the age of the L2ARC

  8. Not only did we make the L2ARC persistent on reboot, we made the feeding process so much more efficient we had to slow it down.
  9. Zero Copy I/O Aggregation

  10. A new tool delivered by the Virtual Memory team allows the already incredible ZFS I/O aggregation feature to actually do its thing using one less copy.
  11. Scalable Reader/Writer locks

  12. Reader/Writer locks, used extensively by ZFS and Solaris, had their scalability greatly improved on on large systems.
  13. New thread Scheduling class

  14. ZFS transaction groups are now managed by a new type of taskqs which behave better managing bursts of cpu activity.
  15. Concurrent Metaslab Syncing

  16. The task of syncing metaslabs is now handled with more concurrency, boosting ZFS write throughput capabilities.
  17. Block Picking

  18. The task of choosing blocks for allocations has been enhanced in a number of ways, allowing us to work more efficiently at a much higher pool capacity percentage.
There you have it. I'm looking forward to reinvigorating my blog so stay tuned.

mardi févr. 28, 2012

Sun ZFS Storage Appliance : can do blocks, can do files too!

Last October, we demonstrated storage leadership in block protocols with our stellar SPC-1 result showcasing our top of the line Sun ZFS Storage 7420.

As a benchmark SPC-1's profile is close to what a fixed block size DB would actually be doing. See Fast Safe Cheap : Pick 3 for more details on that result. Here, for an encore, we're showing today how the ZFS Storage appliance can perform in a totally different environment : generic NFS file serving.

We're announcing that the Sun ZFS Storage 7320's reached 134,140 SPECsfs2008_nfs.v3 Ops/sec ! with 1.51 ms ORT running SPEC SFS 2008 benchmark.

Does price performance matters ? It does, doesn't it, See what Darius has to say about how we compare to Netapp : Oracle posts Spec SFS.

This is one step further in the direction of bringing to our customer true high performance unified storage capable of handling blocks and files on the same physical media. It's worth noting that provisioning of space between the different protocols is entirely software based and fully dynamic, that every stored element fully checksummed, that all stored data can be compressed with a number of different algorithms (including gzip), and that both filesystems and block based luns can be snapshot and cloned at their own granularity. All these manageability features available to you in this high performance storage package.

Way to go ZFS !

SPEC and SPECsfs are registered trademarks of Standard Performance Evaluation Corporation (SPEC). Results as of February 22, 2012, for more information see

lundi oct. 03, 2011

Fast, Safe, Cheap : Pick 3

Today, we're making performance headlines with Oracle's ZFS Storage Appliance.

SPC-1 : Twice the performance of NetApp at the same latency; Half the $/IOPS;

I'm proud to say that, yours truly, along with a lot of great teammates in Oracle, is not totally foreign to this milestone.

We are announcing that Oracle's 7420C cluster acheived 137000 SPC-1 IOPS with an average latency of less than 10 ms. That is double the results of NetApp's 3270A while delivering the same latency. As compared to the NetApp 3270 result, this is a 2.5x improvement in $/SPC-1-IOPS (2.99$/IOPS vs $7.48/IOPS). We're also showing that when the ZFS Storage Appliance runs at the rate posted by the 3270A (68034 SPC-1 IOPS), our latency of 3.26ms is almost 3X lower than theirs (9.16ms). Moreover, our result was obtained with 23700 GB of user level capacity (internally mirrored) for 17.3 $/GB while NetApp's , even using a space saving raid scheme, can only deliver 23.5$/GB. This is the price per GB of application data actually used in the benchmark. On top of that the 7420C still had 40% of space headroom whereas the 3270A was left with only 10% of free blocks.

These great results were at least partly made possible with the availability of 15K RPM Hard Disk Drives (HDD). Those are great to run the most demanding databases because they combine a large IOPS capability and are generally of smaller capacity. The ratio of IOPS/GB makes them ideal to store high intensity database modeled by SPC-1. On top of that, this concerted engineering effort lead to improved software not just for those running on 15K RPM. We actually used this benchmark to seek out how to increase the quality of our products. The preparation runs, after an initial diagnostic of some issue, we were attached to finding solutions that where not targeting the idiosyncrasies of SPC-1 but based on sound design decision. So instead of changing the default value of some internal parameter to a new static default, we actually changed the way the parameter worked so that our storage systems or all types and sizes would benefit.

So not only are we getting a great SPC-1 results, but all existing customers will benefit from this effect even if they are operating outside of the intense conditions created by the benchmark.

So what is SPC-1 ? It is one of the few benchmarks which counts for storage. It is maintained by Storage Performance Council (SPC). SPC-1 simulates multiple databases running on a centralized storage or storage cluster. But even if SPC-1 is a block based benchmark, within the ZFS Storage appliance, a block based FC or iSCSI volume is handled very much the same way as would be a large file subject to synchronous operation. And by Combining modern network technologies (Infiniband or 10Gbe Ethernet), the CPU power packed in the 7420C storage controllers and Oracle's custom dNFS technology for databases, one can truly acheive very high database transaction rates on top of the more manageable and flexible file based protocols.

The benchmarks defines three Application Storage Unit (ASU): ASU1 with a heavy 8KB block read/write component, ASU2 with a much lighter 8KB block read/write component, and ASU3 which is subject to hundreds of write streams. As such it's is not too far from a simulation of running hundreds of Oracle database onto a single system : ASU1 and ASU2 for datafiles and ASU3 for redolog storage.

The total size of the ASUs is constrained such that all of the stored data (including mirror protection and disk used for spares) must exceed 55% of all configured storage. The benchmark team is then free to decide how much total storage to configure. From that figure, 10% is given to ASU3 (redo log space) and the rest divided equally between heavily ASU1 and lightly used ASU2.

The benchmark team also has to select the SPC-1 IOPS throughput level it wishes to run. This is not a light decision given you want to balance high IOPS; low latency and $/user GB.

Once the target IOPS rate is selected, there are multiple criteria needed to pass a successful audit; one of the most critical is that you have to run at the specified IOPS rate for a whole 8 hour. Note that the previous specifications of the benchmark used by NetApp called for an 4 hour run. During that 8 hour run delivering a solid 137000 SPC-1 IOPS, the avg latency of must be less than 30ms (we did much better than that).

After this brutal 8 hour run, the benchmark then enters another critical phase: the workload is restarted (using a new randomly selected working set) and performance is measured for a 10 minute period. It is this 10 minute period that decides the official latency of the run.

When everything is said and done, you press the trigger; go to sleep and wake up to the result. As you could guess we were ecstatic that morning. Before that glorious day, for lack of a stronger word, a lot of hard work had been done during the extensive preparation runs. With little time, and normally not all of the hardware, one runs through series of run at incremental loads, making educated guesses as to how to improve the result. As you get more hardware you scale up the result tweaking things more or less until the final hour.

SPC-1, with it's requirement of less than 45% of unused space, is designed to trigger many disk level random read IOPS. Despite this inherent random pattern of the workload, we saw that our extensive caching architecture was as helpful for this benchmark as it is in real production workloads. While the 15K RPM HDDs normally levels off with random operation at a rate slightly above 300 IOPS, our 7420C, as a whole, could deliver almost 500 user-level SPC-1 IOPS per HDDs.

In the end one of the most satisfying aspect was to see that the data being managed by ZFS was stored rock solid on disk, properly checksummed, all data could be snapshot, compressed on demand, and delivering an impressively steady performance.

2X the absolute performance, 2.5X cheaper per SPC-1 IOPS, almost 3X lower latency, 30% cheaper per user GB with room to grow... So, If you have a storage decision coming and you need, FAST, SAFE, CHEAP : pick 3, take a fresh look at the ZFS Storage appliance.

SPC-1, SPC-1 IOPS, $/SPC-1 IOPS reg tm of Storage Performance Council (SPC). More info Sun ZFS Storage 7420 Appliance and

Oracle Sun ZFS Storage Appliance 7420 _ _As of October 3, 2011 Netapp FAS3270A _ _As of October 3, 2011

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

mercredi mai 26, 2010

Let's talk about Lun Alignment for 3 minutes

Recall that I had Lun alignment on my mind a few weeks ago. Nothing special about the ZFS storage appliance over any other storage. Pay attention to how you partition your luns, it can have a great impact on performance. Right Roch ? :

jeudi juin 11, 2009

Compared Performance of Sun 7000 Unified Storage Array Line

The Sun Storage 7410 Unified Storage Array provides high-performance for NAS environments. Sun's product can be used on a wide variety of applications. The Sun Storage 7410 Unified Storage Array with a _single_ 10 GbE connection delivers linespeed of the 10 GbE.

  • The Sun Storage 7410 Unified Storage Array delivers 1 GB/sec throughput performance.
  • The Sun Storage 7310 Unified Storage Array delivers over 500 MB/sec on streaming writes for backups and imaging applications.
  • The Sun Storage 7410 Unified Storage Array delivers over 22000 of 8K synchronous writes per second combining great DB performance and ease of deployment of Network Attached Storage while delivering the economics benefits of inexpensice SATA disks.
  • The Sun Storage 7410 Unified Storage Array delivers over 36000 of random 8K reads per second from a 400GB working set for great Mail application responsiveness. This corresponds to an entreprise of 100000 people with every employee accessing new data every 3.6 second consolidated on a single server.

All those numbers characterise a single head of a 7410 clusterable technology. The 7000 clustering technology stores all data in dual attached disk trays and no state is shared between cluster heads (see Sun 7000 Storage clusters). This means that an active-active cluster of 2 healthy 7410 will deliver 2X the performance posted here.

Also note that the performance posted here represent what is acheived under a very tightly defined constrained workload (see Designing 11 Storage metric) and those do not represent the performance limits of the systems. This is testing 1 x 10 GbE port only; each product can have 2 or 4 10 GbE ports, and by running load across multiple ports the server can deliver even higher performance. Achieving maximum performance is a separate exercise done extremely well by my friend Brendan :

Measurement Method

To measure our performance we used the open source Filebench tool accessible from SourceForge (Filebench on Measuring performance of a NAS storage is not an easy task. One has to deal with the client side cache which needs to be bypassed, the synchronisation of multiple clients, the presence of client side page flushing deamons which can turn asynchronous workloads into synchronous ones. Because our Storage 7000 line can have such large caches (up to 128GB of ram and more than 500GB of secondary caches) and we wanted to test disk responses, we needed to find a backdoor ways to flush those caches on the servers. Read Amithaba Filebench Kit entry on the topic in which he posts a link to the toolkit used to produce the numbers.

We recently released our first major software update 2000.Q2 and along with that a new lower cost clusterable 96 TB Storage, the 7310.

We report here the compared numbers of a 7310 with the latest software release to those previously obtained for the 7410, 7210 and 7110 systems each attached to an 18 to 20 client pool over a single 10Gbe interface with the regular frame ethernet (1500 Bytes). By the way, looking at brendan's results above, I encourage you to upgrade to use Jumbo Frames ethernet for even more performance and note that our servers can drive two 10Gbe at line speed.

Tested Systems and Metrics

The tested setup are :
        Sun Storage 7410, 4 x quad core: 16 cores @ 2.3 Ghz AMD.
        128GB of host memory.
        1 dual port 10Gbe Network Atlas Card. NXGE driver. 1500 MTU
        Streaming Tests:
        2 x J4400 JBOD,  44 x 500GB SATA drives 7.2K RPM, Mirrored pool, 
        3 Write optimized 18GB SSD, 2 Read Optimized 100GB SSD.
        IOPS tests:
        12 x J4400 JBOD, 280 x 500GB SATA drives 7.2K RPM, Mirrored pool,
        272 Data drives + 8 spares.
        8-Mirrored Write Optimised 18GB SSD, 6 Read Optimized 100GB SSD.
        FW OS : ak/generic@2008.11.20,1-0

        Sun Storage 7310,2 x quad core: 8 cores @ 2.3 Ghz AMD.
        32GB of host memory.
        1 dual port 10Gbe Network Atlas  Atlas Card (1 port used). NXGE driver. 1500 MTU
        4 x J4400 JBOD for a total 92 SATA drives  7.2K RPM
        43 mirrored pairs
        4 Write Optimised 18GB SSD, 2 Read Optimized 100GB SSD.
        FW OS : Q2 2009.,1-1.15

        Sun Storage 7210, 2 x quad core: 8 cores @ 2.3 Ghz AMD
        32 GB of host memory.
        1 dual port 10Gbe Network Atlas Atlas Card (1 port used). NXGE driver. 1500 MTU
        44  x 500 GB SATA drives  7.2K RPM, Mirrored pool,
        2 Write Optimised 18 GB SSD.
        FW OS : ak/generic@2008.11.20,1-0

        Sun Storage 7110, 2 x quad core opteron: 8 cores @ 2.3 Ghz AMD
        8 GB of host memory.
        1 dual port 10Gbe Network Atlas Atlas Card (1 port used). NXGE driver. 1500 MTU
        12 x 146 GB SAS drives, 10K RPM, in 3+1 Raid-Z pool.
        FW OS : ak/generic@2008.11.20,1-0

The newly released 7310 was tested with the most recent software revision and that certainly is giving the 7310 an edge over it's peers. The 7410 on the other hand was measured here managing a much large contingent of storage, including mirrored Logzillas and 3 times as many JBODs and that is expected to account for some of the performance delta being observed.

    Metrics Short Name
    1 thread per client streaming cached reads Stream Read light
    1 thread per client streaming cold cache reads Cold Stream Read light
    10 threads per client streaming cached reads Stream Read
    20 threads per client streaming cold cached reads Cold Stream Read
    1 thread per client streaming write Stream Write light
    20 threads per client streaming write Stream Write
    128 threads per client 8k synchronous writes Sync write
    128 threads per client 8k random read Random Read
    20 threads per client 8k random read on cold caches Cold Random Read
    8 threads per client 8k small file create IOPS Filecreate

There are 6 read tests, 2 writes test and 1 synchronous write test which overwrites it's data files as a database would. A final filecreate test complete the metrics. Test executes against 20GB working set _per client_ times 18 to 20 clients. There are 4 sets used in total running over independent shares for a total of 80GB per client. So before actual runs at taken, we create all working sets or 1.6 TB of precreated data. Then before each run, we clear all caches on the clients and server.

In each of the 3 groups of 2 read tests, the first one benefits from no caching at all and the throughput delivered to the client over the network is observed to come from disk. The test runs for N seconds priming data in the Storage caches. A second run (non-cold) is then started after clearing the client side caches. Those test will see the 100% of the data delivered over the network link but not all of it is coming off the disks. Streaming test will race through the cached data and then finish off reading from disks. The random read test can also benefit from increasing cached responses as the test progresses. The exact caching characteristic of a 7000 lines will depend on a large number of parameters including your application access pattern. Numbers here reflect the performance of fully randomized test over 20GB per client x 20 clients or a 400GB working set. Upcoming studies will include more data (showing even higher performance) for workloads with higher cache hit ratio than those used here.

In a Storage 7000 server, disks are grouped together in one pool and then individual Shares are creates. Each share has access to all disk resource subject to quota (a minimum) and reservation (a maximum) that might be set. One important setup parameter associated with each share is the DB record size. It is generally better for IOPS test to use 8K records and for streaming test to use 128K records. The recordsize can be dynamically set based on expected usage.

The tests shown here were obtained with NFSv4 the default for Solaris clients (NFSv3 is expected to come out slightly better). The clients were running Solaris 10, with tuned tcp_recv_hiwat of 400K and dopageflush=0 to prevent buffered writes from being converted into synchronous writes.

Compared Results of the 7000 Storage Line

    NFSv4 Test 7410 Head
    Mirrored Pool
    7310 Head
    Mirrored Pool
    7210 Head
    Mirrored Pool
    7110 Head
    3+1 Raid-Z
    Cold Stream Read light 915 MB/sec 685 MB/sec 719 MB/sec 378 MB/sec
    Stream Read light 1074 MB/sec 751 MB/sec 894 MB/sec 416 MB/sec
    Cold Stream Read 959 MB/sec 598 MB/sec 752 MB/sec 329 MB/sec
    Stream Read 1030 MB/sec 620 MB/sec 792 MB/sec 386 MB/sec
    Stream Write light 480 MB/sec 507 MB/sec 490 MB/sec 226 MB/sec
    Stream Write 447 MB/sec 526 MB/sec 481 MB/sec 224 MB/sec
    Sync write 22383 IOPS 8527 IOPS 10184 IOPS 1179 IOPS
    Filecreate 5162 IOPS 4909 IOPS 4613 IOPS 162 IOPS
    Cold Random Read 28559 IOPS 5686 IOPS 4006 IOPS 1043 IOPS
    Random Read 36478 IOPS 7107 IOPS 4584 IOPS 1486 IOPS
    Per Spindle IOPS 272 Spindles 86 Spindles 44 Spindles 12 Spindles
    Cold Random Read 104 IOPS 76 IOPS 91 IOPS 86 IOPS
    Random Read 134 IOPS 94 IOPS 104 IOPS 123 IOPS


The data shows that the entire Sun Storage 7000 line are throughput workhorse delivering 10 Gbps level NAS services per cluster head nodes, using a single Network Interface and single IP address for easy integration into your existing network.

As with other storage technology write streaming performance require more involvement from the storage controller and this leads to about 50% less write throughput compared to read throughput.

The use of write optimized SSD in the 7410, 7310 and 7220 also give this storage very high synchronous write capabilities. This is one of the most interesting result as it maps to database performance. The ability to sustain 24000 O_DSYNC writes at 192MB/sec of synchronized user data using only 48 inexpensive sata disks and 3 write optimized SSD is one of the many great performance characteristics of this novel storage system.

Random Read test generally map directly to individual disk capabilities and is a measure of total disk rotations. The cold runs shows that all our platforms are delivering data at the expected 100 IOPS per spindle for those SATA disks. Recall that our offering is based on the economical energy efficient 7.2 RPM disk technology. For cold random reads, a mirrored pair of 2 x 7.2K RPM offers the same total disk rotation (and IOPS) as expensive and power hungry 15 K RPM disks but in a much more economical package.

Moreover the difference between the warm and cold random read runs is showing that the Hybrid Storage Pool (HSP) is providing a 30% boost even on this workload that addresses randomly 400GB working set on 128GB of controller cache. The effective boost from the HSP can be much greater depending on the cacheability of workloads.

If we consider an organisation in which the avg mail message is 8K in size, our results show that we could consolidate 100000 employees on a single 7410 storage where each employee is accessing new data every 3.6 seconds with 70ms response time.

Messaging system are also big consumer of file creations, I've shown in the past how efficient ZFS can be at creating small files (Need Inodes ?). For the NFS protocol, file creation is a straining workload but the 7000 storage line comes out not too bad with more than 5000 filecreates per second per storage controller.


Performance Can never be summerised with a few numbers and we have just begun to scratch the surface here. The numbers presented here along with the disruptive pricing of the Hybrid Storage Pool will, I hope, go a long way to show the incredible power of the Open Storage architecture being proposed. And keep in mind that this performance is achievable using less expensive, less power hungry SATA drives and that every data services : NFS, CIFS, iSCSI, ftp, HTTP etc. offered by our Sun Storage 7000 servers are available at 0 additional software cost to you.

Disclosure Statement: Sun Microsystem generated results using filebench. Results reported 11/10/08 and 26/05/2009 Analysis done on June 6 2009.

lundi nov. 10, 2008

Sun Storage 7000 Performance invariants

I see many reports about running campains of test measuring performance over a test matrix. One problem with this approach is of course the Matrix. That matrix never big enough for the consumer of the information ("can you run this instead ?").

A more useful approach is to think in terms of performance invariants. We all know that 7.2K RPM disk drive can do 150-200 IOPS as an invariant and disks will have throughput limit such as 80MB/sec. Thinking in terms of those invariant helps in extrapolating performance data (with caution) and observing breakdowns in invariant is often a sign that something else needs to be root caused.

So using 11 metrics and our Performance engineering effort what can be our guiding invariants ? Bearing in mind that it is expected that those are rough estimate. For real measured numbers check out Amitabha Banerjee's excellent post on Analyzing the Sun Storage 7000.

Streaming : 1 GB/s on server and 110 MB/sec on client

For read Streaming wise, we're observing that 1GB/s is somewhat our guiding number for read streaming . This can be acheived with fairly small number of client and threads but will be easier to reach if the data is prestaged in server caches. A client normally running 1Gbe network cards is able to extract 110 MB/sec rather easily. Read streaming will be easier to acheived with the larger 128K records probably due to the lesser CPU demand. While our results are with regular 1500 Bytes ethernet frames, using jumbo frame will also make this limit easier to reach or even break. For a mirrored pool, data needs to be sent twice to the storage and we see a reduction of about 50% for write streaming workloads.

Random Read I/Os per second : 150 random read IOPS per mirrored disks

This is probably a good guiding light also. When going to disks that will be a reasonable expectation. But here caching can radically change this. Since we can configure up to 128GB of host ram and 4 times that much of secondary caches, there are opportunity to break this barrier. But when going to spindles that needs to be kept under consideration. We also know that Raid-z spreads records to all disks. So the 150 IOPS limit basically applies to raid-z groups. Do plan to have many groups to service random reads.

Random Read I/Os per second using SSDs : 3100 Read IOPS per Read Optimized SSD

In some instances, data after eviction from main memory will be kept in secondary caches. Small files and tuned recordsize filesystem are good target workload for this. Those read-optimized SSD can restitute this data at a rate of 3100 IOPS L2 ARC). More importantly so it can do so at much reduced latency meaning that lightly threaded workloads will be able to acheive high throughput.

Synchronous writes per second : 5000-9000 Synchronous write per Write Optimized SSD

Synchronous writes can be generated by a O_DSYNC write (database) or just as part of the NFS protocol (such as the tar extract : open,write,close workloads). Those will reach the NAS server and be coalesced in a single transaction with the separate intent log. Those SSD devices are great latency accelerators but are still devices with a max throughput of around 110 MB/sec. However our code actually detects when the SSD devices become the bottleneck and will divert some of the I/O request to use the main storage pool. The net of all this is a complex equation but we've observed easily 5000-8000 synchronous writes per SSD up to 3 devices (or 6 in mirrored pairs). Using smaller working set which creates less competition for CPU resources we've even observed 48K synchronous writes per second.

Cycles per Bytes : 30-40 cycles per byte for NFS and CIFS

Once we include the full NFS or CIFS protocol, the efficiency was observed to be in the 30-40 cycles per byte (8 to 10 of those coming from the pure network component at regulat 1500 bytes MTU). More studies are required to figure out the extent to which this is valid but it's an interesting way to look at the problem. Having to run disk I/O vs being serviced directly from cached data is expected to exert an additional 10-20 cycles per byte. Obviously for metadata test in which small amount of byte is transfered per operation, we probably need to come up with a cycles/MetaOps invariant but that is still TBD.

Single Client NFS throughput : 1 TCP Window per round trip latency.

This is one fundamental rule of network throughput but it's a good occasion to refresh this in everyones mind. Clients, at least solaris clients, will establish a single TCP connection to a server. On that connection there can be a large number of unreleated requests as NFS is a very scalable protocol. However, a single connection will transport data at a maximum speed of a "socket buffer" divided by the round trip latency. Since today's network speed, particularly in wide area networks have grown somewhat faster than default socket buffers we can see such things becoming performance bottleneck. Now given that I work in Europe but my tests systems are often located in california, I might be a little more sensitive than most to this fact. So one important change we did early on, in this project was to simply bump up the default socket buffers in the 7000 line to 1MB. However for read throughput under similar conditions, we can only advise you to do the same to your client infrastructure.



« mars 2015

No bookmarks in folder