X

Casting the shadow of the Hybrid Storage Pool

Guest Author

The debate, calmly waged, on the best use of flash in the enterprise can be
summarized as whether flash should be a replacement for disk, acting as
primary storage, or it should be regarded as a new, and complementary tier in
the storage hierarchy, acting as a massive read cache. The market leaders in
storage have weighed in the issue, and have declared incontrovertibly that,
yes, both are the right answer, but there's some bias underlying that
equanimity.
Chuck Hollis, EMC's Global Marketing CTO, writes, that
"flash
as cache will eventually become less interesting as part of the overall
discussion... Flash as storage? Well, that's going to be really
interesting.
"
Standing boldly with a foot in each camp, Dave Hitz, founder and EVP at Netapp, thinks that
"Flash is
too expensive to replace disk right away, so first we'll see a new generation of
storage systems that combine the two: flash for performance and disk for
capacity.
"
So what are these guys really talking about, what does the landscape look like,
and where does Sun fit in all this?

Flash as primary storage (a.k.a. tier 0)

Integrating flash efficiently into a storage system isn't obvious; the simplest
way is as a direct replacement for disks. This is why most of the flash we use
today in enterprise systems comes in units that look and act just like hard
drives: SSDs are designed to be drop in replacements. Now, a flash SSD is
quite different than a hard drive — rather than a servo spinning
platters while a head chatters back and forth, an SSD has floating gates
arranged in blocks... actually it's probably simpler to list what they have
in common, and that's just the form factor and interface (SATA, SAS, FC).
Hard drives have all kind of properties that don't make sense in the world of
SSDs (e.g. I've seen an SSD that reports it's RPM telemetry as 1),
and SSDs have their own quirks with no direct analog (read/write asymmetry,
limited write cycles, etc). SSD venders, however, manage to pound these round
pegs into their square holes, and produce something that can stand in for an
existing hard drive. Array vendors are all too happy to attain buzzword
compliance by stuffing these SSDs into their products.

The trouble with HSM is the burden of the M.

Storage vendors already know how to deal with a caste system for disks: they
striate them in layers with fast, expensive 15K RPM disks as tier 1, and
slower, cheaper disks filling out the chain down to tape. What to do with
these faster, more expensive disks? Tier-0 of course! An astute Netapp
blogger asks,
"href="http://blogs.netapp.com/shadeofblue/2008/11/both-disk-and-c.html">when
the industry comes up with something even faster... are we going to have
tier -1" — great question.
What's wrong with that approach? Nothing. It works; it's simple; and we (the
computing industry) basically know how to manage a bunch of tiers of storage
with something called
href="http://en.wikipedia.org/wiki/Hierarchical_storage_management">hierarchical
storage management.
The trouble with HSM is the burden of the M. This solution kicks the problem
down the road, leaving administrators to figure out where to put data, what
applications should have priority, and when to migrate data.

Flash as a cache

The other school of thought around flash is to use it not as a replacement
for hard drives, but rather as a massive cache for reading frequently accessed
data. As I wrote back in June for CACM,
"this
new flash tier can be thought of as a radical form of hierarchical storage
management (HSM) without the need for explicit management.
Tersely,
HSM without the M. This idea forms a major component of what we at Sun
are calling the
Hybrid
Storage Pool (HSP)
, a mechanism for integrating flash with disk and DRAM
to form a new, and —
I
argue
— superior storage solution.

Let's set aside the specifics of how we implement the HSP in
ZFS — you can
read about that
href="http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=553">elsewhere.
Rather, I'll compare the use of flash as a cache to flash as a replacement
for disk independent of any specific solution.

The case for cache

It's easy to see why using flash as primary storage is attractive. Flash is
faster than the fastest disks by at least a factor of 10 for writes and a
factor of 100 for reads measured in IOPS.
Replacing disks with flash though isn't without nuance;
there are several inhibitors, primary among
them is cost. The cost of flash continues to drop, but it's still much more
expensive than cheap disks, and will continue to be for quite awhile. With
flash as primary storage, you still need data redundancy — SSDs can and
do fail — and while we could use RAID with single- or
double-device redundancy, that would cleave the available IOPS by a factor of
the stripe width. The reason to migrate to flash is for performance so it
wouldn't make much sense to hang a the majority of that performance back with
RAID.
The remaining option, therefore, is to mirror SSDs whereby the already high
cost is doubled.

It's hard to argue with results, all-flash solutions do rip. If money were
no object that may well be the best solution (but if cost truly wasn't a
factor, everyone would strap batteries to DRAM and call it a day).

Can flash as a cache do better? Say we need to store a 50TB of data. With an
all-flash pool, we'll need to buy SSDs that can hold roughly 100TB of data if
we want to mirror for optimal performance, and maybe 60TB if we're willing to
accept a
far more modest performance improvement over conventional hard drives. Since
we're already resigned to cutting a pretty hefty check, we have quite a bit
of money to play with to design a hybrid solution.
If we were to provision our system with
50TB of flash and 60TB of hard drives we'd have enough cache to retain every
byte of active data in flash while the disks provide the necessary
redundancy. As writes come in the filesystem would populate the flash while
it writes data persistently to disk. The performance of this system would be
epsilon away from the mirrored flash solution as read requests would only go
to disk in the case of faults from the flash devices. Note that we never rely on
correctness from the flash; it's the hard drives that provide reliability.

The performance of this system would be epsilon away from the mirrored flash solution...

The hybrid solution is cheaper, and it's also far more flexible. If a smaller
working set accounted for a disproportionally large number of reads, the total
IOPS capacity of the all-flash solution could be underused. With flash as a
cache, data could be migrated to dynamically distribute load, and additional
cache could be used to enhance the performance of the working set. It would be
possible to use some of the same techniques with an all-flash storage pool, but
it could be tricky. The luxury of a cache is that the looser contraints allow
for more aggressive data manipulation.

Building on the idea of concentrating the use of flash for hot data,
it's easy to see how flash as a cache can improve
performance even without every byte present in the cache. Most data doesn't
require 50μs random access latency over the entire dataset, users would see a
significant performance improvement with just the active subset in a flash
cache.
Of course, this means
that software needs to be able to anticipate what data is in use which probably
inspired this comment from Chuck Hollis: "cache is cache — we all know
what it can and can't do." That may be so, but comparing an ocean of flash for
primary storage to a thimbleful of cache reflects fairly obtuse thinking.
Caching algorithms will always be imperfect, but the massive scale to which we
can grow a flash cache radically alters the landscape.

Even when a working set is too large to be cached, it's possible for a hybrid
solution to pay huge dividends.
Over at Facebook, Jason Sobel
(a colleague of mine in college)
produced an interesting
presentation
on their use of storage (take a look at Jason's penultimate slide for his take
on SSDs).
Their datasets are so vast and sporadically accessed that the latency of
actually loading a picture, say, off of hard drives isn't actually the biggest
concern, rather it's the time it takes to read the indirect blocks, the
metadata. At facebook, they've taken great pains to reduce the number of
dependent disk accesses from fifteen down to about three.
In a case such as theirs, it would never be economical store or cache the full
dataset on flash and the working set is similarly too large as data access can
be quite unpredictable.
It could, however, be possible to cache all of their metadata in flash.
This would reduce the latency to an infrequently accessed image by nearly a
factor of three. Today in ZFS this is a manual setting per-filesystem, but it
would be possible to evolve a caching algorithm to detect a condition where this
was the right policy and make the adjustment dynamically.

Using flash as a cache offers the potential to do better, and to
make more efficient and more economical use of flash. Sun, and the industry
as a whole have only just started to build the software designed to realize
that potential.

Putting products before words

At Sun, we've just released our first line of products that offer complete
flash integration with the Hybrid Storage Pool; you can read about that in
my blog post
on the occassion of our product launch. On the eve
of that launch, Netapp announced their own offering: a flash-laden PCI card that
plays much the same part as their DRAM-based Performance Acceleration Module
(PAM). This will apparently be available
sometime
in 2009
.
EMC offers a tier 0 solution that employs very fast and very expensive flash
SSDs.

What we have in ZFS today isn't perfect.
Indeed, the Hybrid Storage Pool casts the state of the art forward, and we'll be
catching up with solutions to the hard questions it raises for at least a few
years. Only then will we realize the full potential of flash as a cache.
What we have today though integrates flash in a way that changes the landscape
of storage economics and delivers cost efficiencies that haven't been seen
before. If the drives manufacturers don't already, it can't be long until they
hear the death knell for 15K RPM drives loud and clear.
Perhaps it's cynical or solipsistic to conclude that the timing of Dave
Hitz's and Chuck Hollis' blogs were designed to coincide with the release of
our new product and perhaps take some of the wind out of our sails,
but I will — as the
href="http://blogs.netapp.com/dave/2008/11/disk-is-the-new.html#comment-138918166">commenters on Dave's Blog have — take it as a sign
that we're on the right track. For the moment, I'll put my faith in
href="http://www.netapp.com/us/company/leadership/data-center-transformation/dct-flash.html">this bit of marketing material
enigmatically referenced in a number of Netapp
blogs
on the subject of flash:

In today's competitive environment, bringing a product or service to market
faster than the competition can make a significant difference. Releasing a
product to market in a shorter time can give you first-mover advantage and
result in larger market share and higher revenues.

Join the discussion

Comments ( 8 )
  • saf Wednesday, December 3, 2008

    Nice article.

    I'm asking me what have been changed in ZFS to be adapted to SSD disks?

    And when does Sun will sell Server with SSDs?


  • Adam Leventhal Thursday, December 4, 2008

    @saf To build the hybrid storage pool, we added the L2ARC to ZFS. We also use the ability to have separate intent-log devices, but that was already part of ZFS. We already sell the Sun Storage 7000 series with flash drives; I don't know when we'll be selling SSDs in general-purpose servers.


  • Ts Friday, December 5, 2008

    Well, whether or not people should use flash ssd should depend on how fast the data gets updated. Too fast, flash will fail prematurely, driving up the cost. If data is written once and never updated, then even mlc ssd devices can be used even with 10k cycle endurance. Things like historical tick data is perfect for mlc flash. Things like 2k sized images that gets deleted once a while should be cached in l2arc with slc ssd. Things that update too fast you have to use either memcache or pure brute force 15k sas drives.

    What is interesting is that with intel ssd on the market now, other samsung mlcs are going for 150 dollars per 64gb of mlc flash. At that price zfs should try to use it as a massive l2arc anyways


  • Adam Leventhal Tuesday, December 9, 2008

    @Ts I think you're still thinking in terms of flash as primary storage. With the Hybrid Storage Pool, you can easily limit the rate at which you write to flash. Further, the primary ARC serves data under heavy churn -- transient data is unlikely to survive long enough to ever be written to flash. Again, this is like HSM without the M.


  • Storage Guy Thursday, December 11, 2008

    These day most NAS devices are being built as N-way clusters. Like Isilon, ONTAP-GX and HP storage works. These generally require an N-way clustered file system as well.

    When will the storage 7000 series from Sun be N-way clustered. Is it because ZFS is not a clustered file system.


  • Adam Leventhal Thursday, December 11, 2008

    @Storage Guy I'm not sure; you might find a blog post about clustering and ask there...


  • Dave Nicholson Thursday, December 18, 2008

    Where do you see flash as an "extension of system memory" in servers and RDMA (remote direct memory access) fitting in to your vision of the future? Flash is obviously a game changing technology that makes all of the assumptions upon which current storage systems are based a bit out of whack. Leveraging ZFS and creating a Hybrid Pool doesn't change the plumbing between the server and the storage. Nor does it change the nature of the applications that all of this machinery is designed to support. If every node in a grid of servers had TBs of on-board flash at it's disposal, the future for storage vendors may include monster farms of "50TB SATA-V" drives instead of the sexy flash they crave.

    On another note, delivering flash as part of a legacy architecture in the form of stuffing it into a disk can is simply a method of accomplishing the first mover advantage you reference above. No one sees it as the optimal use of flash. Least of all the folks who currently do the stuffing. :-)

    Great discussion.


  • TS Thursday, December 18, 2008

    "Further, the primary ARC serves data under heavy churn -- transient data is unlikely to survive long enough to ever be written to flash. Again, this is like HSM without the M."

    I know. But you do have to consider the cases where transient data doesn't fit in memory. We are at an age where data is so massive, we need more memory than maxing out the 64 DIMM slots in the servers with the biggest DIMMs you can find, at exponental price steps. Take cases where in a web 2.0 site, that when user logs in, you have to show on the web site that they are logged in and a timestamp. That is one small transcient IOPS of a few bytes at most. Right now it is stored in a cluster of memcache boxes(if you wanted to reduce database IO write load). In other word, it is not persistent. You can have redundent memcache boxes, but you can't preserve those transcient data if power suddenly goes out, which means, you can only store transcient data in memory if you can afford to lose them.

    There are use cases where the churn rate is extremely high(real time stock ticks for example) that right now, we use a delayed non-persistent model(in-memory cache), which is incorrect from database ACID perspective(Durability perspective). You cannot assume that ZFS can handle all of the high churn rate data in the ARC from a practical perspective.

    On the other hand, on the slow churn rate data such as profile pictures on dating sites(where once uploaded will hardly be updated), using non raided MLC flash as L2ARC devices makes perfect sense. If they break, you can still fetch the data from RAID10 hard drives below the L2ARC.

    I don't ever think that even SLC SSDs can be used as primary storage. Reason being that you need minimum 2x for RAID. RAID1 or RAID10 that was really good for hard drive cannot be used for flash because that data are mirrored, which means the SSDs with similar write cycles will fail at about the same time, even with really good write leveling algorithm. RAID5/6 has a problem where one IO is spread across multiple SSDs, thus is a write IOPS amplification factor, which effectively reduces SSD life cycle by the number of drives in the array

    We will see. ZFS kicks ass.


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services