Tuesday Mar 10, 2009

SSDs for HSPs

We're announcing a couple of new things in the flash SSD space. First, support the Intel X25-E SSD in a bunch of our servers. This can be used to create a Hybrid Storage Pool like in the Sun Storage 7000 series, or as just a little flash for high performance / low power / tough environmentals.

Second, we're introducing a new open standard with the Open Flash Module. This creates a new form factor for SSDs bringing flash even closer to the CPU for higher performance and tighter system integration. SSDs in HDD form factors were a reasonable idea to gain market acceptance in much the same way as you first listened to your iPod over your car stereo with that weird tape adapter. Now the iPod is a first class citizen in many cars and, with the Open Flash Module, flash has found a native interface and form factor. This is a building block that we're very excited about, and it was designed specifically for use with ZFS and the Hybrid Storage Pool. Stay tuned: these flash miniDIMMs as they're called will be showing up in some interesting places soon enough. Speaking personally, this represents an exciting collaboration of hardware and software, and it's gratifying to see Sun showing real leadership around flash through innovation.

Saturday Mar 07, 2009

Presentation: Hybrid Storage Pools and SSDs

Today at The First Workshop on Integrating Solid-state Memory into the Storage Hierarchy (WISH 2009) I gave a short talk about our experience integrating flash into the storage hierarchy and the interaction with SSDs. In the talk I discussed the recent history of flash SSDs as well as some key areas for future improvements. You can download it here. The workshop was terrific with some great conversations about the state of solid state storage and its future directions; thank you to the organizers and participants.

Friday Mar 06, 2009

Fishworks VM: the 7000 series on your laptop

In May of 2007 I was lined up to give my first customer presentation of what would become the Sun Storage 7000 series. I inherited a well-worn slide deck describing the product, but we had seen the reactions of prospective customers who saw the software live and had a chance to interact with features such as Analytics; no slides would elicit that kind of response. So with some tinkering, I hacked up our installer and shoe-horned the prototype software into a virtual machine. The live demonstration was a hit despite some rocky software interactions.

As the months passed, our software became increasingly aware of our hardware platforms; the patches I had used for the virtual machine version fell into disrepair. Racing toward the product launch, neither I nor anyone else in the Fishworks group had the time to nurse it back to health. I found myself using months old software for a customer demo — a useful tool, but embarrassing given the advances we had made. We knew that the VM was going to be great for presentations, and we had talked about releasing a version to the general public, but that, we thought, was something that we could sort out after the product launch.

In the brief calm after the frenetic months finishing the product and just a few days before the launch in Las Vegas, our EVP of storage, John Fowler, paid a visit to the Fishworks office. When we mentioned the VM version, his eyes lit up at the thought of how it would help storage professionals. Great news, but we realized that the next few days had just become much busier.

Creating the VM version was a total barn-raising. Rather than a one-off with sharp edges, adequate for a canned demo, we wanted to hand a product to users that would simulate exactly a Sun Storage 7000 series box. In about three days, everyone in the group pitched in to build what was essentially a brand new product and platform complete with a hardware view conjured from bits of our actual appliances.

After a frenetic weekend in November, the Sun Unified Storage Simulator was ready in time for the launch. You can download it here for VMware. We had prepared versions for VirtualBox as well as VMware, preferring VirtualBox since it's a Sun product; along the way we found some usability issues with the VirtualBox version — we were pushing both products beyond their design center and VMware handled it better. Rest assured that we're working to resolve those issues and we'll release the simulator for VirtualBox just as soon as it's ready. Note that we didn't limit the functionality at all; what you see is exactly what you'll get with an actual 7000 series box (though the 7000 series will deliver much better performance than a laptop). Analytics, replication, compression, CIFS, iSCSI are all there; give it a try and see what you think.

Monday Mar 02, 2009

More from the storage anarchist

In my last blog post I responded to Barry Burke author of the Storage Anarchist blog. I was under the perhaps naive impression that Barry was an independent voice in the blogosphere. In fact, he's merely Storage Anarchist by night; by day he's the mild-mannered chief strategy officer for EMC's Symmetrix Products Group — a fact notable for its absence from Barry's blog. In my post, I observed that Barry had apparently picked his horse in the flash race and Chris Caldwell commented that "it would appear that not only has he chosen his horse, but that he's planted squarely on its back wearing an EMC jersey." Indeed.

While looking for some mention of his employment with EMC, I found this petard from Barry Burke chief strategy officer for EMC's Symmetrix Products Group:

And [the "enterprise" differentiation] does matter – recall this video of a Fishworks JBOD suffering a 100x impact on response times just because the guy yells at a drive. You wouldn't expect that to happen with an enterprise class disk drive, and with enterprise-class drives in an enterprise-class array, it won't.

Barry, we wondered the same thing so we got some time on what you'd consider an enterprise-class disk drive in an enterprise-class array from an enterprise-class vendor. The results were nearly identical (of course, measuring latency on other enterprise-class solutions isn't nearly as easy). It turns out drives don't like being shouted at (it's shock, not the traditional RV drives compensate for). That enterprise-class rig was not an EMC Symmetrix though I'd salivate over the opportunity to shout at one.

Thursday Feb 26, 2009

Dancing with the Anarchist

Barry Burke, the Storage Anarchist, has written an interesting roundup ("don't miss the amazing vendor flash dance") covering the flash strategies of some players in the server and storage spaces. Sun's position on flash comes out a bit mangled, but Barry can certainly be forgiven for missing the mark since Sun hasn't always communicated its position well. Allow me to clarify our version of the flash dance.

Barry's conclusion that Sun sees flash as well-suited for the server isn't wrong — of course it's harder to drive high IOPS and low latency outside a single box. However we've also proven not only that we see a big role for flash in storage, but that we're innovating in that realm with the Hybrid Storage Pool (HSP) an architecture that seamlessly integrates flash into the storage hierarchy. Rather than a Ron Popeil-esque sales pitch, let me take you through the genesis of the HSP.

The HSP is something we started to develop a bit over two years ago. By January of 2007, we had identified that a ZFS intent-log device using flash would greatly improve the performance of the nascent Sun Storage 7000 series in a way that was simpler and more efficient that some other options. We started getting our first flash SSD samples in February of that year. With SSDs on the brain, we started contemplating other uses and soon came up with the idea of using flash as a secondary caching tier between the DRAM cache (the ZFS ARC) and disk. We dubbed this the L2ARC.

At that time we knew that we'd be using mostly 7200 RPM disks in the 7000 series. Our primary goal with flash was to greatly improve the performance of synchronous writes and we addressed this with the flash log device that we call Logzilla. With the L2ARC we solved the other side of the performance equation by improving read IOPS by leaps and bounds over what hard drives of any rotational speed could provide. By August of 2007, Brendan had put together the initial implementation of the L2ARC, and, combined with some early SSD samples — Readzillas — our initial enthusiasm was borne out. Yes, it's a caching tier so some workloads will do better than others, but customers have been very pleased with their results.

These two distinct uses of flash comprise the Hybrid Storage Pool. In April 2008 we gave our first public talk about the HSP at the IDF in Shanghai, and a year and a bit after Brendan's proof of concept we shipped the 7410 with Logzilla and Readzilla. It's important to note that this system achieves remarkable price/performance through its marriage of commodity disks with flash. Brendan has done a terrific job of demonstrating the performance enabled by the HSP on that system.

While we were finishing the product, the WSJ reported that EMC was starting to use flash drives into their products. I was somewhat deflated initially until it became clear that EMC's solution didn't integrate flash into the storage hierarchy nearly as seamlessly or elegantly as we had with the HSP; instead they had merely replaced their fastest, most expensive drives with faster and even more expensive SSDs. I'll disagree with the Storage Anarchist's conclusion: EMC did not start the flash revolution nor are they leading the way (though I don't doubt they are, as Barry writes, "Taking Our Passion, And Making It Happen"). EMC though has done a great service to the industry by extolling the virtues of SSDs and, presumably, to EMC customers by providing a faster tier for HSM.

In the same article, Barry alludes to some of the problems with EMC's approach using SSDs from STEC:

STEC rates their ZeusIOPS drives at something north of 50,000 read IOPS each, but as I have explained before, this is a misleading number because it’s for 512-byte blocks, read-only, without the overhead of RAID protection. A more realistic expectation is that the drives will deliver somewhere around 5-6000 4K IOPS (4K is a more typical I/O block size).
The Hybrid Storage Pool avoids the bottlenecks associated with a tier 0 approach, drives much higher IOPS, scales, and makes highly efficient economical use of the resources from flash to DRAM and disk. Further, I think we'll be able to debunk this notion that the enterprise needs its own class of flash devices by architecting commodity flash to build an enterprise solution. There are a lot of horses in this race; Barry has clearly already picked his, but the rest of you may want survey the field.

Monday Feb 23, 2009

HSP talk at the OpenSolaris Storage Summit

The organizers of the OpenSolaris Storage Summit asked me to give a presentation about Hybrid Storage Pools and ZFS. You can download the presentation titled ZFS, Cache, and Flash. In it, I talk about flash as a new caching tier in the storage hierarchy, some of the innovations in ZFS to enable the HSP, and an aside into the how we implement an HSP in the Sun Storage 7410.

Thursday Feb 19, 2009

Flash workshop at ASPLOS

Before this year's ASPLOS conference, I'll be speaking at the First Workshop on Integrating Solid-state Memory into the Storage Hierarchy (WISH2009). It looks like a great program with some terrific papers on how to use flash effectively and how to combine various solid state technologies to complement conventional storage.

I'll be talking about the work we've done at Sun on the Hybrid Storage Pool. In addition I'll discuss some of the new opportunities that flash and other solid state technologies create. The workshop takes place in Washington D.C. on March 7th. Hope to see you there.

In semi-related news, along with Eric and Mike I'll be speaking at the OpenSolaris Storage Summit in San Francisco this coming Monday the 23rd.

Update March 7, 2009: I've subsequently posted the slides I used for the WISH 2009 and OpenSolaris Storage Summit 2009 talks.

Monday Dec 01, 2008

Casting the shadow of the Hybrid Storage Pool

The debate, calmly waged, on the best use of flash in the enterprise can be summarized as whether flash should be a replacement for disk, acting as primary storage, or it should be regarded as a new, and complementary tier in the storage hierarchy, acting as a massive read cache. The market leaders in storage have weighed in the issue, and have declared incontrovertibly that, yes, both are the right answer, but there's some bias underlying that equanimity. Chuck Hollis, EMC's Global Marketing CTO, writes, that "flash as cache will eventually become less interesting as part of the overall discussion... Flash as storage? Well, that's going to be really interesting." Standing boldly with a foot in each camp, Dave Hitz, founder and EVP at Netapp, thinks that "Flash is too expensive to replace disk right away, so first we'll see a new generation of storage systems that combine the two: flash for performance and disk for capacity." So what are these guys really talking about, what does the landscape look like, and where does Sun fit in all this?

Flash as primary storage (a.k.a. tier 0)

Integrating flash efficiently into a storage system isn't obvious; the simplest way is as a direct replacement for disks. This is why most of the flash we use today in enterprise systems comes in units that look and act just like hard drives: SSDs are designed to be drop in replacements. Now, a flash SSD is quite different than a hard drive — rather than a servo spinning platters while a head chatters back and forth, an SSD has floating gates arranged in blocks... actually it's probably simpler to list what they have in common, and that's just the form factor and interface (SATA, SAS, FC). Hard drives have all kind of properties that don't make sense in the world of SSDs (e.g. I've seen an SSD that reports it's RPM telemetry as 1), and SSDs have their own quirks with no direct analog (read/write asymmetry, limited write cycles, etc). SSD venders, however, manage to pound these round pegs into their square holes, and produce something that can stand in for an existing hard drive. Array vendors are all too happy to attain buzzword compliance by stuffing these SSDs into their products.

The trouble with HSM is the burden of the M.

Storage vendors already know how to deal with a caste system for disks: they striate them in layers with fast, expensive 15K RPM disks as tier 1, and slower, cheaper disks filling out the chain down to tape. What to do with these faster, more expensive disks? Tier-0 of course! An astute Netapp blogger asks, "when the industry comes up with something even faster... are we going to have tier -1" — great question. What's wrong with that approach? Nothing. It works; it's simple; and we (the computing industry) basically know how to manage a bunch of tiers of storage with something called hierarchical storage management. The trouble with HSM is the burden of the M. This solution kicks the problem down the road, leaving administrators to figure out where to put data, what applications should have priority, and when to migrate data.

Flash as a cache

The other school of thought around flash is to use it not as a replacement for hard drives, but rather as a massive cache for reading frequently accessed data. As I wrote back in June for CACM, "this new flash tier can be thought of as a radical form of hierarchical storage management (HSM) without the need for explicit management. Tersely, HSM without the M. This idea forms a major component of what we at Sun are calling the Hybrid Storage Pool (HSP), a mechanism for integrating flash with disk and DRAM to form a new, and — I argue — superior storage solution.

Let's set aside the specifics of how we implement the HSP in ZFS — you can read about that elsewhere. Rather, I'll compare the use of flash as a cache to flash as a replacement for disk independent of any specific solution.

The case for cache

It's easy to see why using flash as primary storage is attractive. Flash is faster than the fastest disks by at least a factor of 10 for writes and a factor of 100 for reads measured in IOPS. Replacing disks with flash though isn't without nuance; there are several inhibitors, primary among them is cost. The cost of flash continues to drop, but it's still much more expensive than cheap disks, and will continue to be for quite awhile. With flash as primary storage, you still need data redundancy — SSDs can and do fail — and while we could use RAID with single- or double-device redundancy, that would cleave the available IOPS by a factor of the stripe width. The reason to migrate to flash is for performance so it wouldn't make much sense to hang a the majority of that performance back with RAID. The remaining option, therefore, is to mirror SSDs whereby the already high cost is doubled.

It's hard to argue with results, all-flash solutions do rip. If money were no object that may well be the best solution (but if cost truly wasn't a factor, everyone would strap batteries to DRAM and call it a day).

Can flash as a cache do better? Say we need to store a 50TB of data. With an all-flash pool, we'll need to buy SSDs that can hold roughly 100TB of data if we want to mirror for optimal performance, and maybe 60TB if we're willing to accept a far more modest performance improvement over conventional hard drives. Since we're already resigned to cutting a pretty hefty check, we have quite a bit of money to play with to design a hybrid solution. If we were to provision our system with 50TB of flash and 60TB of hard drives we'd have enough cache to retain every byte of active data in flash while the disks provide the necessary redundancy. As writes come in the filesystem would populate the flash while it writes data persistently to disk. The performance of this system would be epsilon away from the mirrored flash solution as read requests would only go to disk in the case of faults from the flash devices. Note that we never rely on correctness from the flash; it's the hard drives that provide reliability.

The performance of this system would be epsilon away from the mirrored flash solution...

The hybrid solution is cheaper, and it's also far more flexible. If a smaller working set accounted for a disproportionally large number of reads, the total IOPS capacity of the all-flash solution could be underused. With flash as a cache, data could be migrated to dynamically distribute load, and additional cache could be used to enhance the performance of the working set. It would be possible to use some of the same techniques with an all-flash storage pool, but it could be tricky. The luxury of a cache is that the looser contraints allow for more aggressive data manipulation.

Building on the idea of concentrating the use of flash for hot data, it's easy to see how flash as a cache can improve performance even without every byte present in the cache. Most data doesn't require 50μs random access latency over the entire dataset, users would see a significant performance improvement with just the active subset in a flash cache. Of course, this means that software needs to be able to anticipate what data is in use which probably inspired this comment from Chuck Hollis: "cache is cache — we all know what it can and can't do." That may be so, but comparing an ocean of flash for primary storage to a thimbleful of cache reflects fairly obtuse thinking. Caching algorithms will always be imperfect, but the massive scale to which we can grow a flash cache radically alters the landscape.

Even when a working set is too large to be cached, it's possible for a hybrid solution to pay huge dividends. Over at Facebook, Jason Sobel (a colleague of mine in college) produced an interesting presentation on their use of storage (take a look at Jason's penultimate slide for his take on SSDs). Their datasets are so vast and sporadically accessed that the latency of actually loading a picture, say, off of hard drives isn't actually the biggest concern, rather it's the time it takes to read the indirect blocks, the metadata. At facebook, they've taken great pains to reduce the number of dependent disk accesses from fifteen down to about three. In a case such as theirs, it would never be economical store or cache the full dataset on flash and the working set is similarly too large as data access can be quite unpredictable. It could, however, be possible to cache all of their metadata in flash. This would reduce the latency to an infrequently accessed image by nearly a factor of three. Today in ZFS this is a manual setting per-filesystem, but it would be possible to evolve a caching algorithm to detect a condition where this was the right policy and make the adjustment dynamically.

Using flash as a cache offers the potential to do better, and to make more efficient and more economical use of flash. Sun, and the industry as a whole have only just started to build the software designed to realize that potential.

Putting products before words

At Sun, we've just released our first line of products that offer complete flash integration with the Hybrid Storage Pool; you can read about that in my blog post on the occassion of our product launch. On the eve of that launch, Netapp announced their own offering: a flash-laden PCI card that plays much the same part as their DRAM-based Performance Acceleration Module (PAM). This will apparently be available sometime in 2009. EMC offers a tier 0 solution that employs very fast and very expensive flash SSDs.

What we have in ZFS today isn't perfect. Indeed, the Hybrid Storage Pool casts the state of the art forward, and we'll be catching up with solutions to the hard questions it raises for at least a few years. Only then will we realize the full potential of flash as a cache. What we have today though integrates flash in a way that changes the landscape of storage economics and delivers cost efficiencies that haven't been seen before. If the drives manufacturers don't already, it can't be long until they hear the death knell for 15K RPM drives loud and clear. Perhaps it's cynical or solipsistic to conclude that the timing of Dave Hitz's and Chuck Hollis' blogs were designed to coincide with the release of our new product and perhaps take some of the wind out of our sails, but I will — as the commenters on Dave's Blog have — take it as a sign that we're on the right track. For the moment, I'll put my faith in this bit of marketing material enigmatically referenced in a number of Netapp blogs on the subject of flash:

In today's competitive environment, bringing a product or service to market faster than the competition can make a significant difference. Releasing a product to market in a shorter time can give you first-mover advantage and result in larger market share and higher revenues.

Wednesday Nov 19, 2008

Sun Storage 7410 space calculator

The Sun Storage 7410 is our expandable storage appliance that can be hooked up to anywhere from one and twelve JBODs with 24 1TB disks. With all those disks we provide the several different options for how to arrange them into your storage pool: double-parity RAID-Z, wide-strip double-parity RAID-Z, mirror, striped, and single-parity RAID-Z with narrow stripes. Each of these options has a different mix of availability, performance, and capacity that are described both in the UI and in the installation documentation. With the wide array of supported configurations, it can be hard to know how much usable space each will support.

To address this, I wrote a python script that presents a hypothetical hardware configuration to an appliance and reports back the available options. We use the logic on the appliance itself to ensure that the results are completely accurate as the same algorithms would be applied as when then the physical pallet of hardware shows up. This, of course, requires you to have an appliance available to query — fortunately, you can run a virtual instance of the appliance on your laptop.

You can download the sizecalc.py here; you'll need python installed on the system where you run it. Note that the script uses XML-RPC to interact with the appliance, and consequently it relies on unstable interfaces that are subject to change. Others are welcome to interact with the appliance at the XML-RPC layer, but note that it's unstable and unsupported. If you're interested in scripting the appliance, take a look at Bryan's recent post. Feel free to post comments here if you have questions, but there's no support for the script, implied, explicit, unofficial or otherwise.

Running the script by itself produces a usage help message:

$ ./sizecalc.py
usage: ./sizecalc.py [ -h <half jbod count> ] <appliance name or address>
    <root password> <jbod count>
Remember that you need a Sun Storage 7000 appliance (even a virtual one) to execute the capacity calculation. In this case, I'll specify a physical appliance running in our lab, and I'll start with a single JBOD (note that I've redacted the root password, but of course you'll need to type in the actual root password for your appliance):
$ ./sizecalc.py catfish \*\*\*\*\* 1
type            NSPF   width  spares   data drives     capacity (TB)
raidz2         False      11       2            22                18
raidz2 wide    False      23       1            23                21
mirror         False       2       2            22                11
stripe         False       0       0            24                24
raidz1         False       4       4            20                15
Note that with only one JBOD no configurations support NSPF (No Single Point of Failure) since that one JBOD is always a single point of failure. If we go up to three JBODs, we'll see that we have a few more options:
$ ./sizecalc.py catfish \*\*\*\*\* 3
type            NSPF   width  spares   data drives     capacity (TB)
raidz2         False      13       7            65                55
raidz2          True       6       6            66                44
raidz2 wide    False      34       4            68                64
raidz2 wide     True       6       6            66                44
mirror         False       2       4            68                34
mirror          True       2       4            68                34
stripe         False       0       0            72                72
raidz1         False       4       4            68                51
In this case we have to give up a bunch of capacity in order to attain NSPF. Now let's look at the largest configuration we support today with twelve JBODs:
$ ./sizecalc.py catfish \*\*\*\*\* 12
type            NSPF   width  spares   data drives     capacity (TB)
raidz2         False      14       8           280               240
raidz2          True      14       8           280               240
raidz2 wide    False      47       6           282               270
raidz2 wide     True      20       8           280               252
mirror         False       2       4           284               142
mirror          True       2       4           284               142
stripe         False       0       0           288               288
raidz1         False       4       4           284               213
raidz1          True       4       4           284               213

The size calculator also allows you to model a system with Logzilla devices, write-optimized flash devices that form a key part of the Hybrid Storage Pool. After you specify the number of JBODs in the configuration, you can include a list of how many Logzillas are in each JBOD. For example, the following invocation models twelve JBODs with four Logzillas in the first 2 JBODs:

$ ./sizecalc.py catfish \*\*\*\*\* 12 4 4
type            NSPF   width  spares   data drives     capacity (TB)
raidz2         False      13       7           273               231
raidz2          True      13       7           273               231
raidz2 wide    False      55       5           275               265
raidz2 wide     True      23       4           276               252
mirror         False       2       4           276               138
mirror          True       2       4           276               138
stripe         False       0       0           280               280
raidz1         False       4       4           276               207
raidz1          True       4       4           276               207

A very common area of confusion has been how to size Sun Storage 7410 systems, and the relationship between the physical storage and the delivered capacity. I hope that this little tool will help to answer those questions. A side benefit should be still more interest in the virtual version of the appliance — a subject I've been meaning to post about so stay tuned.

Update December 14, 2008: A couple of folks requested that the script allow for modeling half-JBOD allocations because the 7410 allows you to split JBODs between heads in a cluster. To accommodate this, I've added a -h option that takes as its parameter the number of half JBODs. For example:

$ ./sizecalc.py -h 12 \*\*\*\*\* 0
type            NSPF   width  spares   data drives     capacity (TB)
raidz2         False      14       4           140               120
raidz2          True      14       4           140               120
raidz2 wide    False      35       4           140               132
raidz2 wide     True      20       4           140               126
mirror         False       2       4           140                70
mirror          True       2       4           140                70
stripe         False       0       0           144               144
raidz1         False       4       4           140               105
raidz1          True       4       4           140               105

Update February 4, 2009: Ryan Matthews and I collaborated on a new version of the size calculator that now lists the raw space available in TB (decimal as quoted by drive manufacturers for example) as well as the usable space in TiB (binary as reported by many system tools). The latter also takes account of the sliver (1/64th) reserved by ZFS:

$ ./sizecalc.py \*\*\*\*\* 12
type          NSPF  width spares  data drives       raw (TB)   usable (TiB)
raidz2       False     14      8          280         240.00         214.87
raidz2        True     14      8          280         240.00         214.87
raidz2 wide  False     47      6          282         270.00         241.73
raidz2 wide   True     20      8          280         252.00         225.61
mirror       False      2      4          284         142.00         127.13
mirror        True      2      4          284         142.00         127.13
stripe       False      0      0          288         288.00         257.84
raidz1       False      4      4          284         213.00         190.70
raidz1        True      4      4          284         213.00         190.70

Update June 17, 2009: Ryan Matthews with help from has again revised the size calculator to model both adding expansion JBODs and to account for the now expandable Sun Storage 7210. Take a look at Ryan's post for usage information. Here's an example of the output:

$ ./sizecalc.py \*\*\* 1 h1 add 1 h add 1 
Sun Storage 7000 Size Calculator Version 2009.Q2
type          NSPF  width spares  data drives       raw (TB)   usable (TiB)
mirror       False      2      5           42          21.00          18.80
raidz1       False      4     11           36          27.00          24.17
raidz2       False  10-11      4           43          35.00          31.33
raidz2 wide  False  10-23      3           44          38.00          34.02
stripe       False      0      0           47          47.00          42.08

Update September 16, 2009: Ryan Matthews updated the size calculator for the 2009.Q3 release. The update includes the new triple-parity RAID wide stripe and three-way mirror profiles:

$ ./sizecalc.py boga \*\*\* 4
Sun Storage 7000 Size Calculator Version 2009.Q3
type          NSPF  width spares  data drives       raw (TB)   usable (TiB)
mirror       False      2      4           92          46.00          41.18
mirror        True      2      4           92          46.00          41.18
mirror3      False      3      6           90          30.00          26.86
mirror3       True      3      6           90          30.00          26.86
raidz1       False      4      4           92          69.00          61.77
raidz1        True      4      4           92          69.00          61.77
raidz2       False     13      5           91          77.00          68.94
raidz2        True      8      8           88          66.00          59.09
raidz2 wide  False     46      4           92          88.00          78.78
raidz2 wide   True      8      8           88          66.00          59.09
raidz3 wide  False     46      4           92          86.00          76.99
raidz3 wide   True     11      8           88          64.00          57.30
stripe       False      0      0           96          96.00          85.95

\*\* As of 2009.Q3, the raidz2 wide profile has been deprecated.
\*\* New configurations should use the raidz3 wide profile.

Sunday Nov 09, 2008

Hybrid Storage Pools in the 7410

The Sun Storage 7000 Series launches today, and with it Sun has the world's first complete product that seamlessly adds flash into the storage hierarchy in what we call the Hybrid Storage Pool. The HSP represents a departure from convention, and a new way of thinking designing a storage system. I've written before about the principles of the HSP, but now that it has been formally announced I can focus on the specifics of the Sun Storage 7000 Series and how it implements the HSP.

Sun Storage 7410: The Cadillac of HSPs

The best example of the HSP in the 7000 Series is the 7410. This product combines a head unit (or two for high availability) with as many as 12 J4400 JBODs. By itself, this is a pretty vanilla box: big, economical, 7200 RPM drives don't win any races, and the maximum of 128GB of DRAM is certainly a lot, but some workloads will be too big to fit in that cache. With flash, however, this box turns into quite the speed demon.


The write performance of 7200 RPM drive isn't terrific. The appalling thing is that the next best solution — 15K RPM drives — aren't really that much better: a factor of two or three at best. To blow the doors off, the Sun Storage 7410 allows up to four write-optimized flash drives per JBOD each of which is capable of handling 10,000 writes per second. We call this flash device Logzilla.

Logzilla is a flash-based SSD that contains a pretty big DRAM cache backed by a supercapacitor so that the cache can effectively be treated as nonvolatile. We use Logzilla as a ZFS intent log device so that synchronous writes are directed to Logzilla and clients incur only that 100μs latency. This may sound a lot like how NVRAM is used to accelerate storage devices, and it is, but there are some important advantages of Logzilla. The first is capacity: most NVRAM maxes out at 4GB. That might seem like enough, but I've talked to enough customers to realize that it really isn't and that performance cliff is an awful long way down. Logzilla is an 18GB device which is big enough to hold the necessary data while ZFS syncs it out to disk even running full tilt. The second problem with NVRAM scalability: once you've stretched your NVRAM to its limit there's not much you can do. If your system supports it (and most don't) you can add another PCI card, but those slots tend to be valuable resources for NICs and HBAs, and even then there's necessarily a pretty small number to which you could conceivably scale. Logzilla is an SSD sitting in a SAS JBOD so it's easy to plug more devices into ZFS and use them as a growing pool of intent log devices.


The standard practice in storage systems is to use the available DRAM as a read cache for data that is likely to be frequently accessed, and the 7000 Series does the same. In fact, it can do quite a better job of it because, unlike most storage systems which stop at 64GB of cache, the 7410 has up to 256GB of DRAM to use as a read cache. As I mentioned before, that's still not going to be enough to cache the entire working set for a lot of use cases. This is where we at Fishworks came up with the innovative solution of using flash as a massive read cache. The 7410 can accomodate up to six 100GB, read-optimized, flash SSDs; accordingly, we call this device Readzilla.

With Readzilla, a maximum 7410 configuration can have 256GB of DRAM providing sub-μs latency to cached data and 600GB worth of Readzilla servicing read requests in around 50-100μs. Forgive me for stating the obvious: that's 856GB of cache &mdash. That may not suffice to cache all workloads, but it's certainly getting there. As with Logzilla, a wonderful property of Readzilla is its scalability. You can change the number of Readzilla devices to match your workload. Further, you can choose the right combination of DRAM and Readzilla to provide the requisite service times with the appopriate cost and power use. Readzilla is cheaper and less power-hungry than DRAM so applications that don't need the blazing speed of DRAM can prefer the more economical flash cache. It's a flexible solution that can be adapted to specific needs.

Putting It All Together

We started with DRAM and 7200 RPM disks, and by adding Logzilla and Readzilla the Sun Storage 7410 also has great write and read IOPS. Further, you can design the specific system you need with just the right balance of write IOPS, read IOPS, throughput, capacity, power-use, and cost. Once you have a system, the Hybrid Storage Pool lets you solve problems with targeted solutions. Need capacity? Add disk. Out of read IOPS? Toss in another Readzilla or two. Write bogging down? Another Logzilla will net you another 10,000 write IOPS. In the old model, of course, all problems were simple because the solution was always the same: buy more fast drives. The HSP in the 7410 lets you address the specific problem you're having without paying for a solution to three other problems that you don't have.

Of course, this means that administrators need to better understand the performance limiters, and fortunately the Sun Storage 7000 Series has a great answer to that in Analytics. Pop over to Bryan's blog where he talks all about that feature of the Fishworks software stack and how to use it to find performance problems on the 7000 Series. If you want to read more details about Hybrid Storage Pools and how exactly all this works, take a look my article on the subject in CACM, as well as this post about the L2ARC (the magic behind using Readzilla) and a nice marketing pitch on HSPs.

Monday Oct 20, 2008

Hybrid Storage Pool goes glossy

I've written about Hybrid Storage Pools (HSPs) here several times as well as in an article that appeared in the ACM's Queue and CACM publications. Now the folks in Sun marketing on the occasion of our joint SSD announcement with Intel have distilled that down to a four page glossy, and they've done a terrific job. I suggest taking a look.

The concept behind the HSP is a simple one: combine disk, flash, and DRAM into a single coherent and seamless data store that makes optimal use of each component and its economic niche. The mechanics of how this happens required innovation from the Fishworks and ZFS groups to integrate flash as a new tier in storage hierarchy for use in our forthcoming line of storage products. The impact of the HSP is pure economics: it delivers superior capacity and performance for a lower cost and smaller power footprint. That's the marketing pitch; if you want to wade into the details, check out the links above.

Saturday Oct 04, 2008

Apple updates DTrace... again

Back in January, I ranted about Apple's ham-handed breakage in their DTrace port. After some injured feelings and teary embraces, Apple cleaned things up a bit, but some nagging issues remained as I wrote:

For the Apple folks: I'd argue that revealing the name of otherwise untraceable processes is no more transparent than what Activity Monitor provides — could I have that please?

It would be very un-Apple to — you know — communicate future development plans, but in 10.5.5, DTrace has seen another improvement. Previously when using DTrace to observe the system at large, iTunes and other paranoid apps would be hidden; now they're showing up on the radar:

# dtrace -n 'profile-1999{ @[execname] = count(); }'
dtrace: description 'profile-1999' matched 1 probe

  loginwindow                                                       2
  fseventsd                                                         3
  kdcmond                                                           5
  socketfilterfw                                                    5
  distnoted                                                         7
  mds                                                               8
  dtrace                                                           12
  punchin-helper                                                   12
  Dock                                                             20
  Mail                                                             25
  Terminal                                                         26
  SystemUIServer                                                   28
  Finder                                                           42
  Activity Monito                                                  49
  pmTool                                                           67
  WindowServer                                                    184
  iTunes                                                         1482
  kernel_task                                                    4030

And of course, you can use generally available probes to observe only those touchy apps with a predicate:

# dtrace -n 'syscall:::entry/execname == "iTunes"/{ @[probefunc] = count(); }'
dtrace: description 'syscall:::entry' matched 427 probes

  pwrite                                                           13
  read                                                             13
  stat64                                                           13
  open_nocancel                                                    14
  getuid                                                           22
  getdirentries                                                    26
  pread                                                            29
  stat                                                             32
  gettimeofday                                                     34
  open                                                             36
  close                                                            37
  geteuid                                                          65
  getattrlist                                                     199
  munmap                                                          328
  mmap                                                            338

Predictably, the details of iTunes are still obscured:

# dtrace -n pid42896:::entry
dtrace: error on enabled probe ID 225607 (ID 69364: pid42896:libSystem.B.dylib:pthread_mutex_unlock:entry):
  invalid user access in action #1
dtrace: error on enabled probe ID 225546 (ID 69425: pid42896:libSystem.B.dylib:spin_lock:entry):
  invalid user access in action #1
dtrace: 1005103 drops on CPU 1

... which is fine by me; I've got code of my own I should be investigating. While I'm loath to point it out, an astute reader and savvy DTrace user will note that Apple may have left the door open an inch wider than they had anticipated. Anyone care to post some D code that makes use of that inch? I'll post an update as a comment in a week or two if no one sees it.

Update: There were some good ideas in the comments. Here's the start of a script that can let you follow the flow of control of a thread in an "untraceable" process:

#!/usr/sbin/dtrace -s

#pragma D option quiet

	trace("this program is already traceable\\n");

/self->level < 0 || self->level > 40/
	self->level = 0;

	this->p = ((dtrace_state_t \*)arg0)->dts_ecbs[arg1 - 1]->dte_probe;
	this->mod = this->p->dtpr_mod;
	this->func = this->p->dtpr_func;
	this->entry = ("entry" == stringof(this->p->dtpr_name));

	printf("%\*s-> %s:%s\\n", self->level \* 2, "",
	    stringof(this->mod), stringof(this->func));

	printf("%\*s<- %s:%s\\n", self->level \* 2, "",
	    stringof(this->mod), stringof(this->func));

Monday Aug 11, 2008

A glimpse into Netapp's flash future

The latest edition of Communications of the ACM includes a panel discussion between "seven world-class storage experts". The primary topic was flash memory and how it impacts the world of storage. The most interesting comment came from Steve Kleiman, Senior Vice President and Chief Scientist at Netapp:

My theory is that whether it’s flash, phase-change memory, or something else, there is a new place in the memory hierarchy. There was a big blank space for decades that is now filled and a lot of things that need to be rethought. There are many implications to this, and we’re just beginning to see the tip of the iceberg.

The statement itself isn't earth-shattering — it would be immodest to say so as I reached the same conclusion in my own CACM article last month — with price trends and performance characteristics, it's obvious that flash has become relevant; those running the numbers as Steve Kleiman has will come to the same conclusion about how it might integrate into a system. What's interesting is that the person at Netapp "responsible for setting future technology directions for the company" has thrown his weight behind the idea. I look forward to seeing how this is manifested in Netapp's future offerings. Will it look something like the Hybrid Storage Pool (HSP) that we've developed with ZFS? Or might it integrate flash more explicitly into the virtual memory system in ONTAP, Netapp's embedded operating system? Soon enough we should start seeing products in the market that validate our expectations for flash and its impact to enterprise storage.

Wednesday Jul 23, 2008

Hybrid Storage Pools: The L2ARC

I've written recently about the hybrid storage pool (HSP), using ZFS to augment the conventional storage stack with flash memory. The resulting system improve performance, cost, density, capacity, power dissipation — pretty much evey axis of importance.

An important component of the HSP is something called the second level adaptive replacement cache (L2ARC). This allows ZFS to use flash as a caching tier that falls between RAM and disk in the storage hierarchy, and permits huge working sets to be serviced with latencies under 100us. My colleague, Brendan Gregg, implemented the L2ARC, and has written a great summary of how the L2ARC works and some concrete results. Using the L2ARC, Brendan was able to achieve a 730% performance improvement over 7200RPM drives. Compare that with 15K RPM drives which will improve performance by at most 100-200%, while costing more, using more power, and delivering less total capacity than Brendan's configuration. Score one for the hybrid storage pool!

Tuesday Jul 01, 2008

Hybrid Storage Pools in CACM

As I mentioned in my previous post, I wrote an article about the hybrid storage pool (HSP); that article appears in the recently released July issue of Communications of the ACM. You can find it here. In the article, I talk about a novel way of augmenting the traditional storage stack with flash memory as a new level in the hierarchy between DRAM and disk, as well as the ways in which we've adapted ZFS and optimized it for use with flash.

So what's the impact of the HSP? Very simply, the article demonstrates that, considering the axes of cost, throughput, capacity, IOPS and power-efficiency, HSPs can match and exceed what's possible with either drives or flash alone. Further, an HSP can be built or modified to address specific goals independently. For example, it's common to use 15K RPM drives to get high IOPS; unfortunately, they're expensive, power-hungry, and offer only a modest improvement. It's possible to build an HSP that can match the necessary IOPS count at a much lower cost both in terms of the initial investment and the power and cooling costs. As another example, people are starting to consider all-flash solutions to get very high IOPS with low power consumption. Using flash as primary storage means that some capacity will be lost to redundancy. An HSP can provide the same IOPS, but use conventional disks to provide redundancy yielding a significantly lower cost.

My hope — perhaps risibly naive — is that HSPs will mean the eventual death of the 15K RPM drive. If it also puts to bed the notion of flash as general purpose mass storage, well, I'd be happy to see that as well.


Adam Leventhal, Fishworks engineer


« July 2016