Sunday Jul 25, 2010

Good-bye, Sun

In Februrary 1996, I came out to Sun Microsystems to interview for a job knowing only two things: that I wanted to do operating systems kernel development -- and that I didn't particularly want to work for Sun. I was right on the first count, but knew I was wrong on the second just moments into my first conversation with Jeff. He was emphatic that I should join him in forging the future, sharing both my enthusiasm for what was possible and my disdain for the broken, busted and boogered-up. Fourteen years later, I don't for a moment regret my decision to join Jeff and Sun: we fostered an environment where the OS was viewed not as a regrettable drag on progress, but rather as a nexus of innovation -- incubating technologies that today make a real difference in people's lives.

In 2006, itching to try something new, Mike and I talked the company into taking the risk of allowing several of us to start Fishworks. That Sun supported our endeavor so enthusiastically was the company at its finest: empowering engineers to tackle hard problems, and inspiring them to bring innovative solutions to market. And with the budding success of the 7000 Series, I would like to believe that we made good on the company's faith in us -- and more generally on its belief in innovation as differentiator.

Now the time has come for me to venture again into something new -- but this time it is to be beyond the company's walls. This is obviously with mixed emotion; while I am excited about the future, it is very difficult for me personally to leave a company in which I have had such close relationships with so many. One of Sun's greatest strengths was that we technologists were never discouraged from interacting directly and candidly with our customers and users, and many of our most important innovations came from these relationships. This symbiosis was critically important at several junctures of my own career, and I owe many of you a profound debt of gratitude -- both for your counsel over the years, and for your willingness to bet your own business and livelihood on the technologies that I helped develop. You, like us, are innovators who love nothing more than great technology, and your steadfast faith in us means more to me than I can express; thank you.

As for my virtual address, it too is changing. This post will be my last at blogs.sun.com; in the future, you can find my blog at its new (permanent) home: http://dtrace.org/blogs/bmc (where comments on this entry will be open). As for e-mail, you can find me at the first letter of my first name concatenated with my last name at acm.org.

Thank you again for everything; take care -- and stay in touch!

Wednesday Mar 10, 2010

Turning the corner

It's a little hard to believe that it's been only fifteen months since we shipped our first product. It's been a hell of a ride; there is nothing as exhilarating nor as exhausting as having a newly developed product that is both intricate and wildly popular. Especially in the domain of enterprise storage -- where perfection is not just the standard but (entirely reasonably) the expectation -- this makes for some seriously spiked punch.

For my own part, I have had my head down for the last six months as the Technical Lead for our latest software release, 2010.Q1, which is publicly available as of today. In my experience, I have found that in software (if not in life), one may only ever pick two of quality, features and schedule -- and for 2010.Q1, we very much picked quality and features. (As for schedule, let it be only said that this release was once known as "2009.Q4"...)

2010.Q1 Quality

You don't often see enterprise storage vendors touting quality improvements for a very simple reason: if the product was perfect when you sold it to me, why are you talking about how much you've improved it? So I'm going to break a little bit with established tradition and acknowledge that the product has not been perfect, though not without good reason. With our initial development of the product, we were pushing many new technologies very aggressively: not only did we seek to build enterprise-grade storage on commodity components (a deceptively daunting challenge in its own right), we were also building on entirely new elements like flash -- and then topped it all off with an ambitious, from-scratch management stack. What were we possibly thinking by making so many bets at once? We made these bets not out of recklessness, but rather because they were essential elements of our Big Bet: that customers were sick of paying monopoly rents for enterprise storage, and that we could deliver a quantum leap in price-performance. (And if nothing else, let it be said that we got that one very, very right -- seemingly too right, at times.) As for the specific technology bets, some have proven to be unblemished winners, while others have been more of a struggle. Sometimes the struggle was because the problem was hard, sometimes it was because the software was immature, and sometimes it was because a component that was assumed to have known failure modes had several (or many) unanticipated (or byzantine) failure modes. And in the worst cases, of course, it was all three...

I'm pleased to report that in 2010.Q1, we turned the corner on all fronts: in addition to just fixing a boatload of bugs in key areas like clustering and networking, we engaged in fundamental work like Dave's rearchitecture of remote replication, adapted to new device failure modes as with Greg's rearchitecture around resilience to HBA logic failure, and -- perhaps most importantly -- integrated critical firmware upgrades to each of the essential components of the I/O path (HBAs, SIM cards and disks). Also in 2010.Q1, we changed the way the way that we run the evaluation of the software, opening the door to many in our rapidly growing customer base. As a result, this release is already running on more customer production systems than any of its predecessors were at the time that they shipped -- and on many more eval and production machines within our own walls.

2010.Q1 Features

But as important as quality is to this release, it's not the full story: the release is also packed with major features like deduplication, iSER/SRP support, Kerberized NFS support and Fibre Channel support. Of these, the last is of particular interest to me because, in addition to my role as the Technical Lead for 2010.Q1, I was also responsible for the integration of FC support into the product. There was a lot of hard work here, but much of it was born by John Forte and his COMSTAR team, who did a terrific job not only on the SCSI Target Management facility (STMF) but also on the base ALUA support necessary to allow proper FC operation in a cluster. As for my role, it was fun to cut the code to make all of this stuff work. Thanks to some great design work by Todd Patrick, along with some helpful feedback from field-facing colleagues like Ryan Matthews, I think we came up with a clean, functional interface. And working closely with both John and our test team, we have developed a rock-solid FC product. But of course (and as one might imagine), for me personally, the really gratifying bit was adding FC support to analytics. With just a pinch of DTrace and a bit of glue code, we now have visibility into FC operations by LUN, by project, by target, by initiator, by operation, by SCSI command, by size, by offset and by latency -- and by any combination thereof.

As I was developing FC analytics, I would use as my source of load a silly disk benchmark I wrote back in the day when Adam and I were evaluating SSDs. Here for example, is that benchmark running against a LUN that I named "thicktail-bench":


The initiator here is the machine "thicktail"; it's interesting to break down by initiator and see the paths by which thicktail is accessing the LUN:


(These names are human readable because I have added aliases for each of thicktail's two HBA ports. Had I not added those aliases, we would see WWNs here.) The above shows us that thicktail is accessing the LUN through both of its paths, which is what we would expect (but good to visually confirm). Let's see how it's accessing the LUN in terms of operations:


Nothing too surprising here -- this is the write phase of the benchmark and we have no log devices on this system, so we fully expect this. But let's break down by offset:


The first time I saw this, I was surprised. Not because of what it shows -- I wrote this benchmark, and I know what it does -- but rather because it was so eye-popping to really see its behavior for the first time. In particular, this captures an odd phase I added to this benchmark: it does random writes across an increasing large range. I did this because we had discovered that some SSDs did fine when the writes were confined to a small logical region, but broke down -- badly -- when the writes were over a larger region. And no, I don't know why this was the case (presumably the firmware was in fragmented/wear-leveling/cache-busting hell); all I know is that we rejected any further exploration once the writes to the SSD were of a higher latency than that of my first hard drive: the IBM PC XT's 10 MB ST-412, which had roughly 95 ms writes! (We felt that expecting an SSD to have better write latency than a hard drive from the first Reagan Administration was tough but fair...)

What now?

As part of our ongoing maturity as a product, we have developed a new role here at Fishworks: starting in 2010.Q1, the Technical Lead for the release will, as the release ships, transition to become the full-time Support Lead for that release in the field. This means many things for the way we support the product, but for our customers, it means that if and when you do have an issue on 2010.Q1, you should know that the buck on your support call will ultimately stop with me. We are establishing an unprecedented level of engineering integration with our support teams, and we believe that it will show in the support experience. So welcome to 2010.Q1 -- and happy upgrading!

Thursday Nov 26, 2009

John Birrell

It is with a heavy heart that I announce that we in the DTrace community have lost one of our own: the indomitable John Birrell, who ported DTrace to FreeBSD, suffered a stroke and passed away on Friday, November 20, 2009.

We on Team DTrace knew John to be a remarkably talented and determined software engineer. As those who have attempted ports can attest, DTrace passes through rough country, and a port to a foreign system is a significant undertaking that requires mastery of both DTrace and (particularly) the target system. And in being the first to attempt a port, John's challenge was that much greater -- and his success in the endeavor a tribute to both his ability and (especially) his tenacity. For example, in performing the port, John decided that DTrace's dependency on the cyclic subsystem was such that it, too, needed to be ported. He didn't need to do this (and indeed, other ports have decided that an arbitrary resolution profile provider is not worth the significant trouble), but that he undertook this additional technical challenge anyway -- even when any victory would remain hidden to all but the most expert eye -- says a lot about John as both an engineer and a man. Later, when the port ran into some frustrating licensing issues, John once again did not give up. Rather, he backed up, and found a path forward that would satisfy all parties -- even though it required significant technical reworking on his part. I have long believed that the mark of a great engineer is not how frequently they get knocked down, but rather how quickly they get back up -- and in this regard, John was indisputably a giant.

John, you will be missed -- not only by the FreeBSD community upon which you made an indelible mark, but by those of us in the DTrace community who only had the opportunity to work with you more recently. And while your legacy might remain anonymous to the future generations that will benefit from the fruits of your long labor, we will always know that it never would have happened without you. Thank you, and farewell.

(Those who wish to memorialize John may want to do as I did and make a donation in his memory to the FreeBSD Foundation.)

Thursday May 14, 2009

Queue, CACM, and the rebirth of the ACM

As I have mentioned before (if in passing), I sit on the Editorial Advisory Board of ACM Queue, ACM's flagship publication for practitioners. In the past year, Queue has undergone a significant transformation, and now finds itself at the vanguard of a much broader shift within the ACM -- one that I confess to once thinking impossible.

My story with respect to the ACM is like that of many practitioners, I suspect: I first became aware of the organization as an undergraduate computer science student, when it appeared to me as the embodiment of academic computer science. This perception was cemented by its flagship publication, Communications of the ACM, a magazine which, to a budding software engineer longing for the world beyond academia, seemed to be adrift in dreamy abstraction. So when I decided at the end of my undergraduate career to practice my craft professionally, I didn't for a moment consider joining the ACM: it clearly had no interest in the practitioner, and I had no interest in it.

Several years into my career, my colleague David Brown mentioned that he was serving on the Editorial Board of a new ACM publication aimed at the practitioner, dubbed ACM Queue. The idea of the ACM focussing on the practitioner brought to mind a piece of Sun engineering lore from the old Mountain View days. Sometime in the early 1990s, the campus engaged itself in a water fight that pitted one building against the next. The researchers from the Sun Labs building built an elaborate catapult to launch water-filled missiles at their adversaries, while the gritty kernel engineers in legendary MTV05 assembled surgical tubing into simple but devastatingly effective three-person water balloon slingshots. As one might guess, the Labs folks never got their catapult to work -- and the engineers doused them with volley after volley of water balloons. So when David first mentioned that the ACM was aiming a publication at the practitioner, my mental image was of lab-coated ACM theoreticians, soddenly tinkering with an overcomplicated contraption. I chuckled to myself at this picture, wished David good luck on what I was sure was going to be a fruitless endeavor, and didn't think any more of it.

Several months after it launched, I happened to come across an issue of the new ACM Queue. With skepticism, I read a few of the articles. I found them to be surprisingly useful -- almost embarrassingly so. I sheepishly subscribed, and I found that even the articles that I disagreed with -- like this interview with an apparently insane Alan Kay -- were more thought-provoking than enraging. And in soliciting articles on sordid topics like fault management from engineers like my long-time co-conspirator Mike Shapiro, the publication proved itself to be interested in both abstract principles and their practical application. So when David asked me to be a guest expert for their issue on system performance, I readily accepted. I put together an issue that I remain proud of today, with articles from Bart Smaalders on performance anti-patterns, Phil Beevers on development methodologies for high-performance software, me on DTrace -- and topped off with an interview between Kirk McKusick and Jarod Jenson that, among its many lessons, warns us of the subtle perils of Java's notifyAll.

Two years later, I was honored to be asked to join Queue's Editorial Advisory Board, where my eyes were opened to a larger shift within the ACM: the organization -- led by both its executive leadership in CEO John White and COO Pat Ryan and its past and present elected ACM leadership like Steve Bourne, Dave Patterson, Stu Feldman and Wendy Hall -- were earnestly and deliberately seeking to bring the practitioner into the ACM fold. And I quickly learned that I was not alone in my undergraduate dismissal of Communications of the ACM: CACM was broadly viewed within the ACM as being woefully out of touch with both academic and practitioner alike, with one past president confessing that he himself couldn't stomach reading it -- even when his name was on the masthead. There was an active reform movement within the ACM to return the publication to its storied past, and this trajectory intersected with the now-proven success of Queue: it was decided that the in-print vehicle for Queue would shift to become the Practice section of a new, revitalized CACM. I was elated by this change, for it meant that our superlative practitioner-authored content would at last enter the walled garden of the larger academic community. And for practitioners, a newly relevant CACM would also serve to expose us to a much broader swathe of computer science.

After much preparation, the new CACM launched in July 2008. Nearly a year later, I think it can safely be called a success. To wit, I point to two specific (if personal) examples from that first issue alone: thanks to the new CACM, my colleague Adam Leventhal's work on flash memory and our integration of it in ZFS found a much broader readership than it would have otherwise -- and Adam was recently invited to join an otherwise academic symposium on flash. And thanks to the new CACM, I -- and thousands of other practitioners -- were treated to David Shaw's incredible Anton, the kind of work that gives engineers an optimistic excitement uniquely induced by such moon shots. By bringing together the academic and the practitioner, the new CACM is effecting a new ACM.

So, to my fellow practitioners: I strongly encourage you to join me as a member of the ACM. While CACM is clearly a compelling and tangible benefit, it is not the only reason to join the ACM. As professionals, I believe that we have a responsibility to our craft: to learn from our peers, to offer whatever we might have to teach, and to generally leave the profession better than we found it. In other professions -- in law, in medicine, and in more traditional engineering domains -- this professional responsibility is indoctrinated to the point of expectation. But our discipline perhaps shows its youth in our ignorance of this kind of professional service. To be fair, this cannot be laid entirely at the practitioner's feet: the organizations that have existed for computer scientists have simply not been interested in attracting, cultivating, or retaining the practitioner. But with the shift within the ACM embodied by the new CACM, this is changing. The ACM now aspires to be the organization that represents all computer scientists -- not just those who teach students, perform research and write papers, but also those of us who cut code, deliver product and deploy systems for a living. Joining the ACM helps it make good on this aspiration; we practitioners cannot effect this essential change from outside its membership. And we must not stop at membership: if there is an article that you might like to write for the broader ACM audience, or an article that you'd like to see written, or a suggestion you might have for a CTO roundtable or a practitioner you think should be interviewed, or, for that matter, any other change that you might like to see in the ACM to further appeal to the practitioner, do not stay silent; the ACM has given us practitioners a new voice -- but it is only good if we use it!

Thursday Feb 19, 2009

Moore's Outlaws

My blog post eulogizing SPEC SFS has elicited quite a bit of reaction, much of it from researchers and industry observers who have drawn similar conclusions. While these responses were very positive, my polemic garnered a different reaction from SPEC SFS stalwart NetApp, where, in his response defending SPEC SFS, my former colleague Mike Eisler concocted this Alice-in-Wonderland defense of the lack of a pricing disclosure in the benchmark:

Like many industries, few storage companies have fixed pricing. As much as heads of sales departments would prefer to charge the same highest price to every customer, it isn't going to happen. Storage is a buyers' market. And for storage devices that serve NFS and now CIFS, the easily accessible numbers on spec.org are yet another tool for buyers. I just don't understand why a storage vendor would advocate removing that tool.

Mike's argument -- and I'm still not sure that I'm parsing it correctly -- appears to be that the infamously opaque pricing in the storage business somehow helps customers because they don't have to pay a single "highest price"! That is, that the lack of transparent pricing somehow reflects the "buyers' market" in storage. If that is indeed Mike's argument, someone should let the buyers know how great they have it -- those silly buyers don't seem to realize that the endless haggling over software licensing and support contracts is for them!

And if that argument isn't contorted enough for you, Mike takes a second tack:

In storage, the cost of the components to build the device falls continuously. Just as our customers have a buyers' market, we storage vendors are buyers of components from our suppliers and also enjoy a buyers' market. Re-submitting numbers after a hunk of sheet metal declines in price is silly.

His ludicrous "sheet metal" example aside (what enterprise storage product contains more than a few hundred bucks of sheet metal?), Mike's argument appears to be that technology curves like Moore's Law and Kryder's Law lead to enterprise storage prices that are falling with such alarming speed that they're wrong by the time as they are so much as written down! If it needs to be said, this argument is absurd on many levels. First, the increases in transistor density and areal storage density tend to result in more computing bandwidth and more storage capacity per dollar, not lower absolute prices. (After all, your laptop is three orders of magnitude more powerful than a personal computer circa 1980 -- but it's certainly not a thousandth of the price.)

Second, has anyone ever accused the enterprise storage vendors of dropping their prices in pace with these laws -- or even abiding by them in the first place? The last time I checked, the single-core Mobile Celeron that NetApp currently ships in their FAS2020 and FAS2050 -- a CPU with a criminally small 256K of L2 cache -- is best described as a Moore's Outlaw: a CPU that, even when it first shipped six (six!) years ago, was off the curve. (A single-core CPU with 256K of L2 cache was abiding by Moore's Law circa 1997.) Though it's no wonder that NetApp sees plummeting component costs when they're able to source their CPUs by dumpster diving...

Getting back to SPEC SFS: even if the storage vendors were consistently reflecting technology improvements, SPEC SFS is (as I discussed) a drive latency benchmark that doesn't realize the economics of these curves anyway; drives are not rotating any faster year-over-year, having leveled out at 15K RPM some years ago due to some nasty physical constraints (like, the sound barrier). So there's no real reason to believe that the 2,016 15K RPM drives used in NetApp's opulent 1,032,461 op submission are any cheaper today than when this configuration was first submitted three years ago. Yes, those same drives would likely have more capacity (being 146GB or 300GB and not the 72GB in the submission), but recall that these drives are being short-stroked to begin with -- so such as additional capacity is being used at all by the benchmark, it will only be used to assure even less head movement!

Finally, even if Mike were correct that technology advances result in ever falling absolute prices, it still should not prohibit price disclosures. We all understand that prices reflect a moment in time, and if natural inflation does not dissuade us from price disclosures, nor should any technology-induced deflation.

So to be clear: SPEC SFS needs pricing disclosures. TPC has them, SPC has them -- and SFS needs them if the benchmark has any aspiration to enduring relevance. While SPEC SFS's flaws run deeper than the missing price disclosure, the disclosure would at least keep the more egregious misbehaviors in check -- and it would also (I believe) show storage buyers the degree to which the systems measured by SPEC SFS do not in fact correspond to the systems that they purchase and deploy.

One final note: in his blog entry, Mike claims that "SPEC SFS performance is the minimum bar for entry into the NAS business." If he genuinely believes this, Mike may want to write a letter to the editors of InfoWorld: in their recent review of our Sun Storage 7210, they had the audacity to ignore the lack of SPEC SFS results for the appliance, instead running their own benchmarks. Their rating for the product's performance? 10 out of 10. What heresy!

Sunday Feb 01, 2009

Eulogy for a benchmark

I come to bury SPEC SFS, not to praise it.

When we at Fishworks set out, our goal was to build a product that would disrupt the enterprise NAS market with revolutionary price/performance. Based on the economics of Sun's server business, it was easy to know that we would deliver on the price half of that promise, but the performance half promised to be more complicated: while price is concrete and absolute, the notion of performance fluctuates with environment, workload and expectations. To cut through these factors, computing systems have long had their performance quantified with benchmarks that hold environment and workload constant, and as we began to learn about NAS benchmarks, one in particular loomed large among the vendors: SPEC's system file server benchmark, SFS. Curiously, the benchmark didn't come up much in conversations with customers, who seemed to prefer talking about raw capabilities like maximum delivered read bandwidth, maximum delivered write bandwidth, maximum synchronous write IOPS (I/O operations per second) or maximum random read IOPS. But it was clear that the entrenched NAS vendors took SPEC SFS very seriously (indeed, to the point that they seemed to use no other metric to describe the performance of the system), and upstarts seeking to challenge them seemed to take it even more seriously, so we naturally assumed that we too should use SPEC SFS as the canonical metric of our system...

But as we explored SPEC SFS -- as we looked at the workload that it measures, examined its run rules, studied our rivals' submissions and then contrasted that to what we saw in the market -- an ugly truth emerged: whatever connection to reality it might have once had, SPEC SFS has long since become completely divorced from the way systems are actually used. And worse than simply being outdated or irrelevant, SPEC SFS is so thoroughly misguided as to implicitly encourage vendors to build the wrong systems -- ones that are increasingly outrageous and uneconomic. Quite the opposite of being beneficial to customers in evaluating systems, SPEC SFS has decayed to the point that it is serving the opposite ends: by rewarding the wrong engineering decisions, punishing the right ones and eliminating price from the discussion, SPEC SFS has actually led to lower performing, more expensive systems! And amazingly, in the update to SPEC SFS -- SPEC SFS 2008 -- the benchmark's flaws have not only gone unaddressed, they have metastasized. The result is such a deformed monstrosity that -- like the index case of some horrific new pathogen -- its only remaining utility lies on the autopsy table: by dissecting SPEC SFS and understanding how it has failed, we can seek to understand deeper truths about benchmarks and their failure modes.

Before taking the scalpel to SPEC SFS, it is worth considering system benchmarks in the abstract. The simplest system benchmarks are microbenchmarks that measure a small, well-defined operation in the system. Their simplicity is their great strength: because they boil the system down to its most fundamental primitives, the results can serve as a truth that transcends the benchmark. That is, if a microbenchmark measures a NAS box to provide 1.04 GB/sec read bandwidth from disk, then that number can be considered and understood outside of the benchmark itself. The simplicity of microbenchmarks conveys other advantages as well: microbenchmarks are often highly portable, easily reproducible, straightforward to execute, etc.

Unfortunately, systems themselves are rarely as simple as their atoms, and microbenchmarks are unable to capture the complex interactions of a deployed system. More subtly, microbenchmarks can also lead to the wrong conclusions (or, worse, the wrong engineering decisions) by giving excessive weight to infrequent operations. In his excellent article on performance anti-patterns, my colleague Bart Smaalders discussed this problem with respect to the getpid system call. Because measuring getpid has been the canonical way to measure system call performance, some operating systems have "improved" system call performance by turning getpid into a library call. This effort is misguided, as are any decisions based on the results of measuring it: as Bart pointed out, no real application calls getpid frequently enough for it to matter in terms of delivered performance.

Making benchmarks representative of actual loads is a more complicated undertaking, with any approach stricken by potentially serious failings. The most straightforward approach is taken by application benchmarks, which run an actual (if simplified) application on the system, and measure its performance. This approach has the obvious advantage of measuring actual, useful work -- or at least one definition of it. This means, too, that system effects are being taken into consideration, and that one can have confidence that more than a mere back eddy of the system is being measured. But an equally obvious drawback to this approach is that it is only measuring one application -- an application which may not be at all representative of a deployed system. Moreover, because the application itself is often simplified, application benchmarks can still exhibit the microbenchmark's failings of oversimplification. From the perspective of storage systems, application benchmarks have a more serious problem: because application benchmarks require a complete, functional system, they make it difficult to understand and quantify merely the storage component. From the application's perspective, the system is opaque; who is to know if, say, an impressive TPC result is due to the storage system rather than more mundane factors like, say, database tuning?

Synthetic benchmarks address this failing by taking the hybrid approach of deconstructing application-level behavior into microbenchmark-level operations that they then run in mix that matches the actual use. Ideally, synthetic benchmarks combine the best of both variants: they offer the simplicity and reproducibility of the microbenchmarks, but the real-world applicability of the application-level benchmarks. But beneath this promise of synthetic benchmarks lurks an opposite peril: if not executed properly, synthetic benchmarks can embody the worst properties of both benchmark variants. That is, if a synthetic benchmark combines microbenchmark-level operations in a way that does not in fact correspond to higher level behavior, it has all of the complexity, specificity and opacity of the worst application-level benchmarks -- with the utter inapplicability to actual systems exhibited by the worst microbenchmarks.

As one might perhaps imagine from the foreshadowing, SPEC SFS is a synthetic benchmark: it combines NFS operations in an operation mix designed to embody "typical" NFS load. SPEC SFS has evolved over more than a decade, having started life as NFSSTONE and then morphing into NHFSSTONE (ca. 1992) and then LADDIS (a consortium of Legato, Auspex, DEC, Data General, Interphase and Sun) before become a part of SPEC. (As an aside, "LADDIS" is clearly BUNCH-like in being a portent of a slow and miserable death -- may Sun break the curse!) Here is the NFS operation mix for SPEC SFS over its lifetime:





















NFS operation SFS 1.1 (LADDIS) SFS 2.0/3.0 (NFSv2) SFS 2.0/2.3 (NFSv3) SFS 2008
LOOKUP
34%
36%
27%
24%

READ
22%
14%
18%
18%

WRITE
15%
7%
9%
10%

GETATTR
13%
26%
11%
26%

READLINK
8%
7%
7%
1%

READDIR
3%
6%
2%
1%

CREATE
2%
1%
1%
1%

REMOVE
1%
1%
1%
1%

FSSTAT
1%
1%
1%
1%

SETATTR
-
-
1%
4%

READDIRPLUS
-
-
9%
2%

ACCESS
-
-
7%
11%

COMMIT
-
-
5%
N/A


The first thing to note is that the workload hasn't changed very much over the years: it started off being 58% metadata read operations (LOOKUP, GETATTR, READLINK, READDIR, READDIRPLUS, ACCESS), 22% read operations and 15% write operations, and it's now 65% metadata read operations, 18% read operations and 10% write operations. So where did that original workload come from? From an unpublished study at Sun conducted in 1986! (I recently interviewed a prospective engineer who was not yet born when this data was gathered -- and I've always thought it wise to be wary of data older than oneself.) The updates to the operation mix are nearly as dubious: according to David Robinson's thorough paper on the motivation for SFS 2.0, the operation mix for SFS 3.0 was updated based on a survey of 750 Auspex servers running NFSv2 -- which even at the time of that paper's publication in 1999 must have elicited some cocked eyebrows about the relevance of workloads on such clunkers. And what of the most recent update? The 2008 reaffirmation of the decades-old workload is, according to SPEC, "based on recent data collected by SFS committee members from thousands of real NFS servers operating at customer sites." SPEC leaves unspoken the uncanny coincidence that the "recent data" pointed to an identical read/write mix as that survey of those now-extinct Auspex dinosaurs a decade ago -- plus ça change, apparently!

Okay, so perhaps the operation mix is paleolithic. Does that make it invalid? Not necessarily, but this particular operation mix does appear to be something of a living fossil: it is biased heavily towards reads, with a mere 15% of operations being writes (and a third of these being metadata writes). While I don't doubt that this is an accurate snapshot of NAS during the Reagan Administration, the world has changed quite a bit since then. Namely, DRAM sizes have grown by nearly five orders of magnitude (!), and client caching has grown along with it -- both in the form of traditional NFS client caching, and in higher-level caching technologies like memcached or (at a larger scale) content distribution networks. This caching serves to satisfy reads before they ever make it to the NAS head, which can leave the NAS head with those operations that cannot be cached, worked around or generally ameliorated -- which is to say, writes.

If the workload mix is dated because it does not express the rise of DRAM as cache, one might think that this would also shine through in the results, with systems increasingly using DRAM cache to achieve a high SPEC SFS result. But this has not in fact transpired, and the reason it hasn't brings us to the first fatal flaw of SPEC SFS: instead of making the working set a parameter of the benchmark -- and having a result be not a single number but rather a graph of results given different working set sizes -- the working set size is dictated by the desired number of operations per second. In particular, in running SPEC SFS 3.0, one is required to have ten megabytes of underlying filesystem for every operation per second. (Of this, 10% is utilized as the working set of the benchmark.) This "scaling rule" is a grievous error, for it diminishes the importance of cache as load climbs: in order to achieve higher operations per second, one must have larger working sets -- even though there is absolutely no reason to believe that such a simple, linear relationship exists in actual workloads. (Indeed, in my experience, if anything the opposite may be true: those who are operation intensive seem to have smaller working sets, not larger ones -- and those with larger amounts of data tend to focus on bandwidth more than operations per second.)

Interestingly, when this scaling rule was established, it was done so with some misgivings. According to David Robinson's paper (emphasis added):

From the large file set created, a smaller working set is chosen for actual operations. In SFS 1.0 the working set size was 20% of file set size or 1 MB per op/sec. With the doubling of the file set size in SFS 2.0, the working set was cut in half to 10% to maintain the same working set size. Although the amount of disk storage grows at a rapid rate, the amount of that storage actually being accessed grows at a much slower rate. [ ... ] A 10% working set size may still be too large. Further research in this area is needed.

David, at least, seems to have been aware that this scaling rule was specious even a decade ago. But if the scaling rule was suspect in the mid-1990s, it has became absurd since. To see why, take, for example, NetApp's reasonably recent result of 137,306 operations per second. Getting to this number requires 10 MB per op/sec, or about 1.3 TB. Now, 10% of this -- or about 130GB -- will be accessed over the course of the benchmark. The problem is that from the perspective of caching, the only hope here is to cache metadata, as the data itself exceeds the size of cache and the data access pattern is essentially random. With the cache effectively useless, the engineering problem is no longer designing intelligent caching architectures, but rather designing a system that can quickly serve data from disk. Solving the former requires creativity, trade-offs and balance -- but solving the latter just requires brute force: fast drives and more of 'em. And in this NetApp submission, the force is particularly shock-and-awe: not just 15K RPM drives, but a whopping 224 144GB 15K RPM drives -- delivering 32TB of raw capacity for a mere 1.3TB filesystem. Why would anyone overprovision storage by a factor of 20? The answer is that with the filesytem presumably designed to allocate from outer tracks before inner ones, allocating only 5% of available capacity guarantees that all data will live on those fastest, outer tracks. This practice -- so-called short-stroking -- means both faster transfers and minimal no head movement, guaranteeing that any I/O operation can be satisfied in just the rotational latency of a 15K RPM drive.

Short-stroking 224 15K RPM drives is the equivalent of fueling a dragster with nitromethane -- it is top performance at a price so high as to be useless off the dragstrip. It's a safe bet that if one actually had this problem -- if one wished to build a system to optimize for random reads within a 130GB working set over a total data set of 1.3TB -- one would never consider such a costly solution. How, then, would one solve this particular problem? Putting the entire data set on flash would certainly become tempting: an all flash-based solution is both faster and cheaper than the fleet of nitro-belching 15K RPM drives. But if this is so, does it mean that the future of SFS is to be flash-based configurations vying for king of an increasingly insignificant hill? It might have been so were in not for the revisions in SPEC SFS 2008: the scaling rule has gone from absurd to laughably ludicrous, as what used to be 10MB per op/sec, is now 120MB per op/sec. And as if this recklessness were not enough, the working set ratio has additionally been increased from 10% to 30% of total storage. One can only guess what inspired this descent into madness, but the result is certainly insane: to achieve this same 137,306 ops will require a 17TB filesystem -- of which an eye-watering 5TB will be hot! This is nearly a 40X increase in working set size, without (as far as I can tell) any supporting data. At best, David's warning that the scaling rule may have been excessive has been roundly ignored; at worst, the vendors have deliberately calculated how to adjust the problem posed by the benchmark such that thousands of 15K RPM drives remain the only possible solution, even in light of new technologies like flash. But it's hard to know for sure which case SPEC has fallen into: the decision to both increase the scaling rule and increase the working set ratio is so terrible that incompetence becomes indistinguishable from malice.

Be it due to incompetence or malice, SPEC's descent into a disk benchmark while masquerading as a system benchmark does worse than simply mislead the customer, it actively encourages the wrong engineering decisions. In particular, as long as SPEC SFS is thought to be the canonical metric of NFS performance, there is little incentive to add cache to NAS heads. (If SPEC SFS isn't going to use it, why bother?) The engineering decisions made by the NAS market leaders reflect this thinking, as they continue to peddle grossly undersized DRAM configurations -- like NetApp's top-of-the-line FAS6080 and its meager maximum of 32GB of DRAM per head! (By contrast, our Sun Storage 7410 has up to 128GB of DRAM -- and for a fraction of the price, I hasten to add.) And it is of no surprise that none of the entrenched players conceived of the hybrid storage pool; SPEC SFS does little to reward cache, so why focus on it? (Aside from the fact that it delivers much faster systems, of course!)

While SPEC SFS is hampered by its ancient workload and made ridiculous by its scaling rule, there is a deeper and more pernicious flaw in SPEC SFS: there is no pricing disclosure. This flaw is egregious, unconscionable and inexcusable: as the late, great Jim Gray made clear in his classic 1985 Datamation paper, one cannot consider performance in a vacuum -- when purchasing a system, performance must be considered relative to price. Gray tells us how the database community came to understand this: in 1973, a bank received two bids for a new transaction system. One was for $5M from a mini-computer vendor (e.g. DEC with its PDP-11), the other for $25M from a traditional mainframe vendor (presumably IBM). The solutions offered identical performance; the fact that there was a 5X difference in price (and therefore price/performance), "crystallized" (in Gray's words) the importance of price in benchmarking -- and Gray's paper in turn enshrined price as an essential metric of a database system. (Those interested in the details of the origins Gray's iconoclastic Datamation paper and the long shadow that it has cast are encouraged to read David DeWitt and Charles Levine's excellent retrospective on Gray's work in database performance.) Today, the TPC benchmarks that Gray inspired have pricing at their heart: each submission is required to have a full disclosure report (FDR) that must include the price of the system and everything that that price includes, including part numbers and per-part pricing. Moreover, the system must be orderable: customers must be able to call up the vendor and demand the specified config at the specified price. This is a beautiful thing, because TPC allows for competition not just on performance ("TpmC" in TPC parlance) but also price/performance ($/TpmC). And indeed, in the 1990s, this is exactly what happened as low $/TpmC submissions from the likes of SQLServer running on Dell put competitive pressure on vendors like Sun to focus on price/performance -- with customers being the clear winners in the contest.

By contrast, SPEC SFS's absence of a pricing disclosure forbids competitors from competing on price/performance, instead encouraging absolute performance at any cost. This was taken to the logical extreme with NetApp and their preposterous 1,032,461 result -- which took but 2,016 short-stroked 15K RPM drives! Steven Schwartz took NetApp to task for the exorbitance of this configuration, pointing out that NetApp's configuration was a factor two to four times more expensive on a per-op basis than competitive results in his blog entry aptly titled "Benchmarks - Lies and the Lying Liars Who Tell Them."

But are lower results any less outrageous? Take again that NetApp config. We don't know how much that 3170 and its 224 15K RPM drives will cost you because NetApp isn't forced to disclose it, but suffice it to say that it's quite a bit -- almost certainly seven figures undiscounted. But for the sake of argument, let's assume that you get a steep discount and you somehow get this clustered, racked-out config to price out at $500K. Even then, given the meager 1.3TB delivered for purposes of the benchmark, this system costs an eye-watering $384/GB -- which is about 8X more expensive than DRAM! So even in the unlikely event that your workload and working set match SPEC SFS, you would still be better off blowing your wallet on a big honkin' RAM disk than buying the benchmarked configuration. And this embodies the essence of the failings of SPEC SFS: the (mis)design of the benchmark demands economic insanity -- but the lack of pricing disclosure conceals that insanity from the casual observer. The lesson of SPEC SFS is therefore manifold: be skeptical of a system benchmark that is synthetic, be suspicious of a system benchmark that lacks a price disclosure -- and be damning when they are one and the same.

With the SPEC SFS carcass dismembered and dispensed with, where does this leave Fishworks and our promise to deliver revolutionary price/performance? After considering SPEC SFS (and rejecting it, obviously), we came to believe that the storage benchmark well was so poisioned that the best way to demonstrate the performance of the product would be simple microbenchmarks that customers could run themselves -- which had the added advantage of being closer to the raw capabilities that customers wanted to talk about anyway. In this spirit, see, for example, Brendan's blog entry on the 7410's performance limits. Or, if you're more interested in latency than bandwidth, check out his screenshots of the L2ARC in action. Most importantly: don't take our word for it -- get one yourself, run it with your workload, and then use our built-in analytics to understand not just how the system runs, but also why. We have, after all, designed our systems to be run, not just to be sold...

Wednesday Jan 21, 2009

The Hunter becomes the Hunted

I recently came into a copy of Dave Hitz's new book How to Castrate a Bull. A full review is to come, but I couldn't wait to serve up one delicious bit of irony. Among the book's many unintentionally fascinating artifacts is NetApp's original business plan, dated January 16, 1992. In that plan, NetApp's proposed differentiators are high availability, easy administration, high performance and low price -- differentiators that are eerily mirrored by Fishworks' proposed differentiators nearly fourteen years later. But the irony goes non-linear when Hitz discusses the "Competition" in that original business plan:

Sun Microsystems is the main supplier of NFS file servers. Sun sells over 2/3 of all NFS file servers. Our initial product will be positioned to cost significantly less than Sun's lower-end server, with performance comparable to their high-end servers.

It is unlikely that Sun will be able to produce a server that performs as well, or costs as little, for several reasons:

  • Sun's server hardware is inherently more expensive because it has lower production volumes than our components [...]

  • The culture among software engineers at Sun places little value on performance.

  • The structure of Sun -- with SunSoft doing NFS and UNIX, and SMCC [Sun Microsystems Computer Corporation] doing hardware -- makes it difficult for Sun to produce products that provide creative software-hardware system solutions.

  • Sun's distribution costs will likely remain high due to the level of technical support required to install and manage a Sun server.

While I had known that NetApp targeted Sun in its early days, I had no idea how explicit that attack had been. Now, it must be said that Hitz was right about Sun on all counts -- and that NetApp thoroughly disrupted Sun with its products, ultimately coming to dominate the NAS market itself. But it is stunning the degree to which NetApp's own business plan -- nearly verbatim -- is being used against it, not least by the very company that NetApp originally disrupted. (See Slide 5 of the Fishworks elevator pitch -- and use your imagination.) Indeed, like NetApp in the 1990s, the Sun Storage 7000 Series is not disruptive by accident, and as I elaborate on in this presentation, we are very deliberately positioning the product to best harness the economic winds blowing so strongly in its favor.

NetApp's success with their original business plan and our nascent success with Fishworks point to the most important lesson that the history of technology has to teach: economics always wins -- a product or a technology or a company ultimately cannot prop up unsustainable economics. Perhaps unlike Hitz, however, I had to learn that lesson the hard way: in the post-bubble meltdown that brought Sun within an inch of its life. But then again, perhaps Hitz has yet to have his final lesson on the subject...

Wednesday Dec 31, 2008

Catching disk latency in the act

Today, Brendan made a very interesting discovery about the potential sources of disk latency in the datacenter. Here's a video we made of Brendan explaining (and demonstrating) his discovery:



This may seem silly, but it's not farfetched: Brendan actually made this discovery while exploring drive latency that he had seen in a lab machine due to a missing screw on a drive bracket. (!) Brendan has more details on the discovery, demonstrating how he used the Fishworks analytics to understand and visualize it.

If this has piqued your curiosity about the nature of disk mechanics, I encourage you to read Jon Elerath's excellent ACM Queue article, Hard disk drives: the good, the bad and the ugly! As Jon notes, noise is a known cause of what is called a non-repeatable runout (NRRO) -- though it's unclear if Brendan's shouting is exactly the kind of noise-induced NRRO that Jon had in mind...

Sunday Nov 16, 2008

On Modalities and Misadventures

Part of the design center of Fishworks is to develop powerful infrastructure that is also easy to use, with the browser as the vector for that usability. So it's been incredibly vindicating to hear some of the initial reaction to our first product, which uses words like "truly breathtaking" to describe the interface.

But as much as we believe in the browser as system interface, we also recognize that it cannot be the only modality for interacting with the system: there remain too many occasions when one needs the lightweight precision of a command line interface. These occasions may be due to usability concerns (starting a browser can be an unnecessary and unwelcome burden for a simple configuration change), but they may also arise because there is no alternative: one cannot use a browser to troubleshoot a machine that cannot communicate over the network or to automate interaction with the system. For various reasons, I came to be the one working on this problem at Fishworks -- and my experiences (and misadventures) in solving it present several interesting object lessons in software engineering.

Before I get to those, a brief aside about our architecture: as will come to no surprise to anyone who has used our applaince and is familiar with browser-based technologies, our interface is AJAX-based. In developing an AJAX-based application, one needs to select a protocol for client-server communication, and for a variety of reasons, we selected XML-RPC -- a simple XML-based remote procedure call protocol. XML-RPC has ample client support (client libraries are readily available for JavaScript, Perl, Python, etc.), and it was a snap to write a C-based XML-RPC server. This allowed us to cleanly separate our server logic (the "controller" in MVC parlance) from our client (the "view"), and (more importantly) it allowed us to easily develop an automated test suite to test server-side logic. Now, (very) early in our development I had written a simple Perl script to allow for us to manually test the server-side that I called "aksh" -- the appliance kit shell. It provided a simple, captive, command-line interface with features like command history, and it gave developers (that is to say, us) the ability to manually tickle their server-side logic without having to write client-side code.

In part because I had written this primordial shell, the task of writing the command line interface fell to me. And when I first approached the problem it seemed natural to simply extend that little Perl script into something more elaborate. That is, I didn't stop to think if Perl was still the right tool for what was a much larger job. In not stopping to reconsider, I was committed a classic software engineering blunder: the problem had changed (and in particular, it had grown quite a bit larger than I understood it to be), but I was still thinking in terms of my existing (outmoded) solution.

As I progressed through the project -- as my shell surpassed 1,000 lines and made its way towards 10,000 -- I was learning of a painful truth about Perl that many others had discovered before me: that it is an undesigned dung heap of a language entirely inappropriate for software in-the-large. As a coping mechanism, I began to vent my frustrations at the language with comments in the source, like this vitriolic gem around exceptions:


eval {
#
# We need to install our own private __WARN__ handler that
# calls die() to be able to catch any non-numeric exception
# from the below coercion without inducing a tedious error
# message as a side-effect. And has it been said recently that
# Perl is a trash heap of a language? Indeed, it reminds one
# of a reeking metropolis like Lagos or Nairobi: having long
# since outgrown its original design (such as there was ever
# any design to begin with), it is now teeming but crippled --
# sewage and vigilantes run amok. And yet, still the masses
# come. But not because it is Utopia. No, they come only
# because this dystopia is marginally better than the only
# alternatives that they know...
#
local $SIG{'__WARN__'} = sub { die(); };
$val = substr($value, 0, length($value) - 1) + 0.0;
};

(In an attempt to prevent roaming gangs of glue-huffing Perl-coding teenagers from staging raids on my comments section: I don't doubt that there's a better way to do what I was trying to achieve above. But I would counter that there's also a way to live like a king in Lagos or Nairobi -- that doesn't make it them tourist destinations.)

Most disconcertingly, the further I got into the project, the more the language became an impediment -- exactly the wrong trajectory for an evolving software system. And so as I wrote more and more code -- and wrestled more and more with the ill-suited environment -- the feeling haunting me became unmistakable: this is the wrong path. There's no worse feeling for a software engineer: knowing that you have made what is likely the wrong decision, but feeling that you can't revisit that decision because of the time already spent down the wrong path. And so, further down the wrong path you go...

Meanwhile, as I was paying the price of my hasty decison, Eric -- always looking for a way to better test our code -- was experimenting writing a test harness in which he embedded SpiderMonkey and emulated a DOM layer. These experiments were a complete success: Eric found that embedding SpiderMonkey into a C program was a snap, and the end result allowed us to get automated test coverage over client JavaScript code that previously had to be tested by hand.

Given both Eric's results and my increasing frustrations with Perl, an answer was becoming clear: I needed to rewrite the appliance shell as a JavaScript/C hybrid, with the bulk of the logic living in JavaScript and system interaction bits living in C. This would allow our two interface modalities (the shell and the web) to commonalize logic, and it would eradicate a complicated and burdensome language from our product. While this seemed like the right direction, I was wary of making another hasty decision. So I started down the new path by writing a library in C that could translate JavaScript objects into an XML-RPC request (and the response back into JavaScript objects). My thinking here was that if the JavaScript approach turned out to be the wrong approach for the shell, we could still use the library in Eric's new testing harness to allow a wider range of testing. As an aside, this is a software engineering technique that I have learned over the years: when faced with a decision, determine if there are elements that are common to both paths, and implement them first, thereby deferring the decision. In my experience, making the decision after having tackled some of its elements greatly informs the decision -- and because the work done was common, no time (or less time) was lost.

In this case, I had the XML-RPC client library rock-solid after about a week of work. The decision could be deferred no longer: it was time to rewrite Perl functionality in JavaScript -- time that would indeed be wasted if the JavaScript path was a dead-end. So I decided that I would give myself a week. If, after that week, it wasn't working out, at least I would know why, and I would be able to return to the squalor of Perl with fewer doubts.

As it turns out, after that week, it was clear that the JavaScript/C hybrid approach was the much better approach -- Perl's death warrant had been signed. And here we come to another lesson that I learned that merits an aside: in the development of DTrace, one regret was we did not start the development of our test suite earlier. We didn't make that mistake with Fishworks: the first test was created just minutes after the first line of code. With my need to now rewrite the shell, this approach paid an unexpected dividend: because I had written many tests for the old, Perl-based shell, I had a ready-made test suite for my new work. Therefore, contrary to the impressions of some about test-driven development, the presence of tests actually accelerated the development of the new shell tremendously. And of course, once I integrated the new shell, I could say with confidence that it did not contain regressions over the old shell. (Indeed, the only user visible change was that it was faster. Much, much, much faster.)

While it was frustrating to think of the lost time (it ultimately took me six weeks to get the new JavaScript-based shell back to where the old Perl-based shell had been), it was a great relief to know that we had put the right architecture in place. And as often happens when the right software architecture is in place, the further I went down the path of the JavaScript/C hybrid, the more often I had the experience of new, interesting functionality simply falling out. In particular, it became clear that I could easily add a second JavaScript instance to the shell to allow for a scripting environment. This allows users to build full, programmatic flow control into their automation infrastructure without ever having to "screen scrape" output. For example, here's a script to display the used and available space in each share on the appliance:


script
run('shares');
projects = list();

printf('%-40s %-10s %-10s\\n', 'SHARE', 'USED', 'AVAILABLE');

for (i = 0; i < projects.length; i++) {
run('select ' + projects[i]);
shares = list();

for (j = 0; j < shares.length; j++) {
run('select ' + shares[j]);

share = projects[i] + '/' + shares[j];
used = run('get space_data').split(/\\s+/)[3];
avail = run('get space_available').split(/\\s+/)[3];

printf('%-40s %-10s %-10s\\n', share, used, avail);
run('cd ..');
}

run('cd ..');
}

If you saved the above to file named "space.aksh", you could run it this way:


% ssh root@myappliance < space.aksh
Password:
SHARE USED AVAILABLE
admin/accounts 18K 248G
admin/exports 18K 248G
admin/primary 18K 248G
admin/traffic 18K 248G
admin/workflow 18K 248G
aleventhal/hw_eng 18K 248G
bcantrill/analytx 1.00G 248G
bgregg/dashbd 18K 248G
bgregg/filesys01 25.5K 100G
bpijewski/access_ctrl 18K 248G
...

(You can also upload SSH keys to the appliance if you do not wish to be prompted for the password.)

As always, don't take our word for it -- download the appliance and check it out yourself! And if you have the appliance (virtual or otherwise), click on "HELP" and then type "scripting" into the search box to get full documentation on the appliance scripting environment!

Sunday Nov 09, 2008

Fishworks: Now it can be told

In October 2005, longtime partner-in-crime Mike Shapiro and I were taking stock. Along with Adam Leventhal, we had just finished DTrace -- and Mike had finished up another substantial body of work in FMA -- and we were beginning to wonder about what was next. As we looked at Solaris 10, we saw an incredible building block -- the best, we felt, ever made, with revolutionary technologies like ZFS, DTrace, FMA, SMF and so on. But we also saw something lacking: despite being a great foundation, the truth was that the technology wasn't being used in many essential tasks in information infrastructure, from routing packets to storing blocks to making files available over the network. This last one especially grated: despite having invented network attached storage with NFS in 1983, and despite having the necessary components to efficiently serve files built into the system, and despite having exciting hardware like Thumper and despite having absolutely killer technologies like ZFS and DTrace, Sun had no share -- none -- of the NAS market.

As we reflected on why this was so -- why, despite having so many of the necessary parts Sun had not been able to put together a compelling integrated product -- we realized that part of the problem was organizational: if we wanted to go solve this problem, it was clear that we could not do it from the confines of a software organization. With this in mind, we requested a meeting with Greg Papadopoulos, Sun's CTO, to brainstorm. Greg quickly agreed to a meeting, and Mike and I went to his office to chat. We described the problem that we wanted to solve: integrate Sun's industry-leading components together and build on them to develop a killer NAS box -- one with differentiators only made possible by our technology. Greg listened intently as we made our pitch, and then something unexpected happened -- something that tells you a lot about Sun: Greg rose from his chair and exclaimed, "let's do it!" Mike and I were caught a bit flat-footed; we had expected a much safer, more traditional answer -- like "let's commission a task force!" or something -- and instead here was Greg jumping out in front: "Get me a presentation that gives some of the detail of what you want to do, and I'll talk to Jonathan and Scott about it!"

Back in the hallway, Mike and I looked at each other, still somewhat in disbelief that Greg had been not just receptive, but so explicitly encouraging. Mike said to me exactly what I was thinking: "Well, I guess we're doing this!"

With that, Mike and I pulled into a nearby conference room, and we sat down with a new focus. This was neither academic exercise nor idle chatter over drinks -- we now needed to think about what specifically separated our building blocks from a NAS appliance. With that, we started writing missing technologies on the whiteboard, which soon became crowded with things like browser-based management, clustering, e-mail alerts, reports, integrated fault management, seamless upgrades and rollbacks, and so on. When the whiteboard was full and we took a look at all of it, the light went on: virtually none of this stuff was specific to NAS. At that instant, we realized that the NAS problem was but one example of a larger problem, and that the infrastructure to build fully-integrated, special-purpose systems was itself general-purpose across those special purposes!

We had a clear picture of what we wanted to go do. We put our thoughts into a presentation that we entitled "A Problem, An Opportunity & An Idea" (of which I have made available a redacted version) and sent that to Greg. A week or so later, we had a con-call with Greg, in which he gave us the news from Scott and Jonathan: they bought it. It was time to put together a business plan, build our team and get going.

Now Mike and I went into overdrive. First, we needed a name. I don't know how long he had been thinking about it, or how it struck him, but Mike said that he was thinking of the name "Fishworks", it not only being a distinct name that paid homage to a storied engineering tradition (and with an oblique Simpsons reference to boot), but one that also embedded an apt acronym: "FISH", Mike explained, stood for "fully-integrated software and hardware" -- which is exactly what we wanted to go build. I agreed that it captured us perfectly -- and Fishworks was born.

We built our team -- including Adam, Eric and Keith -- and on February 15, 2006, we got to work. Over the next two and a half years, we went through many changes: our team grew to include Brendan, Greg, Cindi, Bill, Dave and Todd; our technological vision expanded as we saw the exciting potential of the flash revolution; and our product scope was broadened through hundreds of conversations with potential customers. But through these changes our fundamental vision remained intact: that we would build a general purpose appliance kit -- and that we would use it to build a world-beating NAS appliance. Today, at long last, the first harvest from this long labor is available: the Sun Storage 7110, Sun Storage 7210 and Sun Storage 7410.



It is deeply satisfying to see these products come to market, especially because the differentiators that we so boldly predicted to Sun's executives so long ago have not only come to fruition, they are also delivering on our promise to set the product apart in the marketplace. Of these, I am especially proud of our DTrace-based appliance analytics. With analytics, we sought to harness the great power of DTrace: its ability to answer ad hoc questions that are phrased in terms of the system's abstractions instead of its implementation. We saw an acute need for this in network storage, where even market-leading products cannot answer the most basic of questions: "what am I serving and to whom?" The key, of course, was to capture the strength of DTrace visually -- and the trick was to give up enough of the arbitrary latitude of DTrace to allow for strictly visual interactions without giving up so much as to unnecessarily limit the power of the facility.

I believe that the result -- which you can sample in this screenshot -- does more than simply strike the balance: we have come up with ways to visualize and interact with data that actually function as a force multiplier for the underlying instrumentation technology. So not only does analytics bring the power of DTrace to a much broader spectrum of technologists, it also -- thanks to the wonders of the visual cortex -- has much greater utility than just DTrace alone. (Or, as one hardened veteran of command line interfaces put it to me, "this is the one GUI that I actually want to use!")

There is much to describe about analytics, and for those interested in a reasonably detailed guided tour of the facility, check out this presentation on analytics that I will be giving later this week at Sun's Customer Engineering Conference in Las Vegas. While the screenshots in that presentation are illustrative, the power of analytics (like DTrace before it) is in actually seeing it for yourself, in real-time. You can get a flavor for that in this video, in which Mike and I demonstrate and discuss analytics. (That video is part of a larger Inside Fishworks series that takes you through many elements of our team and the product.) While the video is great, it still can't compare to seeing analytics in your own environment -- and for that, you should contact your Sun rep or Sun reseller and arrange to test drive an appliance yourself. Or if you're the impatient show-me-now kind, download this VMware image that contains a full, working Sun Storage 7000 appliance, with 16 (virtual) disks. Configure the virtual appliance, add a few shares, access them via CIFS, WebDAV, NFS, whatever, and bust out some analytics!

Monday Nov 03, 2008

Concurrency's Shysters

For as long as I've been in computing, the subject of concurrency has always induced a kind of thinking man's hysteria. When I was coming up, the name of the apocalypse was symmetric multiprocessing -- and its arrival was to be the Day of Reckoning for software. There seemed to be no end of doomsayers, even among those who putatively had the best understanding of concurrency. (Of note was a famous software engineer who -- despite substantial experience in SMP systems at several different computer companies -- confidently asserted to me in 1995 that it was "simply impossible" for an SMP kernel to "ever" scale beyond 8 CPUs. Needless to say, several of his past employers have since proved him wrong...)

There also seemed to be no end of concurrency hucksters and shysters, each eager to peddle their own quack cure for the miasma. Of these, the one that stuck in my craw was the two-level scheduling model, whereby many user-level threads are multiplexed on fewer kernel-level (schedulable) entities. (To paraphrase what has been oft said of little-known computer architectures, you haven't heard of it for a reason.) The rationale for the model -- that it allowed for cheaper synchronization and lightweight thread creation -- seemed to me at the time to be long on assertions and short on data. So working with my undergraduate advisor, I developed a project to explore this model both quantitatively and dynamically, work that I undertook in the first half of my senior year. And early on in that work, it became clear that -- in part due to intractable attributes of the model -- the two-level thread scheduling model was delivering deeply suboptimal performance...

Several months after starting the investigation, I came to interview for a job at Sun with Jeff, and he (naturally) asked me to describe my undergraduate work. I wanted to be careful here: Sun was the major proponent of the two-level model, and while I felt that I had the hard data to assert that the model was essentially garbage, I also didn't want to make a potential employer unnecessarily upset. So I stepped gingerly: "As you may know," I began, "the two-level threading model is very... intricate." "Intricate?!" Jeff exclaimed, "I'd say it's completely busted!" (That moment may have been the moment that I decided to come work with Jeff and for Sun: the fact that an engineer could speak so honestly spoke volumes for both the engineer and the company. And despite Sun's faults, this engineering integrity remains at Sun's core to this day -- and remains a draw to so many of us who have stayed here through the ups and downs.) With that, the dam had burst: Jeff and I proceeded to gush about how flawed we each thought the model to be -- and how dogmatic its presentation. So paradoxically, I ended up getting a job at Sun in part by telling them that their technology was unsound!

Back at school, I completed my thesis. Like much undergraduate work, it's terribly crude in retrospect -- but I stand behind its fundamental conclusion that the unintended consequences of the two-level scheduling model make it essentially impossible to achieve optimal performance. Upon arriving at Sun, I developed an early proof-of-concept of the (much simpler) single-level model. Roger Faulkner did the significant work of productizing this as an alternative threading model in Solaris 8 -- and he eliminated the two-level scheduling model entirely in Solaris 9, thus ending the ill-begotten experiment of the two-level scheduling model somewhere shy of its tenth birthday. (Roger gave me the honor of approving his request to integrate this work, an honor that I accepted with gusto.)

So why this meandering walk through a regrettable misadventure in the history of software systems? Because over a decade later, concurrency is still being used to instill panic in the uninformed. This time, it is chip-level multiprocessing (CMP) instead of SMP that promises to be the End of Days -- and the shysters have taken a new guise in the form of transactional memory. The proponents of this new magic tonic are in some ways darker than their forebears: it is no longer enough to warn of Judgement Day -- they must also conjure up notions of Original Sin to motivate their perverted salvation. "The heart of the problem is, perhaps, that no one really knows how to organize and maintain large systems that rely on locking" admonished Nir Shavit recently in CACM. (Which gives rise to the natural follow-up question: is the Solaris kernel not large, does it not rely on locking or do we not know how to organize and maintain it? Or is that we do not exist at all?) Shavit continues: "Locks are not modular and do not compose, and the association between locks and data is established mostly by convention." Again, no data, no qualifiers, no study, no rationale, no evidence of experience trying to develop such systems -- just a naked assertion used as a prop for a complicated and dubious solution. Are there elements of truth in Shavit's claims? Of course: one can write sloppy, lock-based programs that become a galactic, unmaintainable mess. But does it mean that such monstrosities are inevitable? No, of course not.

So fine, the problem statement is (deeply) flawed. Does that mean that the solution is invalid? Not necessarily -- but experience has taught me to be wary of crooked problem statements. And in this case (perhaps not surprisingly) I take umbrage with the solution as well. Even if one assumes that writing a transaction is conceptually easier than acquiring a lock, and even if one further assumes that transaction-based pathologies like livelock are easier on the brain than lock-based pathologies like deadlock, there remains a fatal flaw with transactional memory: much system software can never be in a transaction because it does not merely operate on memory. That is, system software frequently takes action outside of its own memory, requesting services from software or hardware operating on a disjoint memory (the operating system kernel, an I/O device, a hypervisor, firmware, another process -- or any of these on a remote machine). In much system software, the in-memory state that corresponds to these services is protected by a lock -- and the manipulation of such state will never be representable in a transaction. So for me at least, transactional memory is an unacceptable solution to a non-problem.

As it turns out, I am not alone in my skepticism. When we on the Editorial Advisory Board of ACM Queue sought to put together an issue on concurrency, the consensus was twofold: to find someone who could provide what we felt was much-needed dissent on TM (and in particular on its most egregious outgrowth, software transactional memory), and to have someone speak from experience on the rise of CMP and what it would mean for practitioners.

For this first article, we were lucky enough to find Calin Cascaval and colleagues, who ended up writing a must-read article on STM in November's CACM. Their conclusions are unavoidable: STM is a dog. (Or as Cascaval et al. more delicately put it: "Based on our results, we believe that the road for STM is quite challenging.") Their work is quantitative and analytical and (best of all, in my opinion) the authors never lose sight of the problem that transactional memory was meant to solve: to make parallel programming easier. This is important, because while many of the leaks in the TM vessel can ultimately be patched, the patches themselves add layer upon layer of complexity. Cascaval et al. conclude:

And because the argument for TM hinges upon its simplicity and productivity benefits, we are deeply skeptical of any proposed solutions to performance problems that require extra work by the programmer.
And while their language is tighter (and the subject of their work a weightier and more active research topic), the conclusions of Cascaval et al. are eerily similar to my final verdict on the two-level scheduling model, over a decade ago:
The dominating trait of the [two-level] scheduling model is its complexity. Unfortunately, virtually all of its complexity is exported to the programmer. The net result is that programmers must have a complete understanding of the model the inner workings of its implementation in order to be able to successfully tap its strengths and avoid its pitfalls.
So TM advocates: if Roger Faulkner knocks on your software's door bearing a scythe, you would be well-advised to not let him in...

For the second article, we brainstormed potential authors -- but as we dug up nothing but dry holes, I found myself coming to an unescapable conclusion: Jeff and I should write this, if nothing else as a professional service to prevent the latest concurrency hysteria from reaching epidemic proportions. The resulting article appears in full in the September issue of Queue, and substantially excerpted in the November issue of CACM. Writing the article was a gratifying experience, and gave us the opportunity to write down much of what we learned the hard way in the 1990s. In particular, it was cathartic to explore the history of concurrency. Having been at concurrency's epicenter for nearly a decade, I felt that the rise of CMP has been recently misrepresented as a failure of hardware creativity -- and it was vindicating to read CMP's true origins in the original DEC Piranha paper: that given concurrent databases and operating systems, implementing multiple cores on the die was simply the best way to deliver OLTP performance. That is, it was the success of concurrent software -- and not the failure of imagination on the part of hardware designers -- that gave rise to the early CMP implementations. Hopefully practitioners will enjoy reading the article as much as we enjoyed writing it -- and here's hoping that we live to see a day when concurrency doesn't attract so many schemers and dreamers!

Wednesday Sep 03, 2008

Happy 5th Birthday, DTrace!

It's hard to believe, but DTrace is five years old today: it was on September 3, 2003 that DTrace integrated into Solaris. DTrace was a project that extended all three of us to our absolute limit as software engineers -- and the 24 hours before integration was then (and remains now) the most harrowing of my career. As it will hopefully remain my most stressful experience as an engineer, the story of that final day merits a retelling...


Our project had been running for nearly two years, but it was not until mid-morning on September 2nd -- the day before we were slated to integrate -- that it was discovered that the DTrace prototype failed to boot on some very old hardware (the UltraSPARC-I, the oldest hardware still supported at that time). Now, "failed to boot" can meet a bunch of different things, but this was about as awful as it gets: a hard hang after the banner message. That is, booting mysteriously stopped making progress soon after control transferred to the kernel -- and one could not break in with the kernel debugger. This is an awful failure mode because with no debugger and no fatal error, one has no place to start other than to start adding print statements -- or start ripping out the code that is the difference between the working system and the busted one. This was a terrifying position to be in less than 24 hours before integration! Strangely, it was only the non-DEBUG variant that failed to boot: the DEBUG version laden with assertions worked fine. Our only lucky break was that we were able to find two machines that exhibited the problem, enabling us to bifurcate our efforts: I starting ripping out DTrace-specific code in one workspace, while Mike started frenetically adding print statements in another...


Meanwhile, while we were scrambling to save our project, Eric was having his first day at Sun. My office door was closed, and with our integration pending and me making frequent (and rapid) trips back and forth to the lab, the message to my coworkers was clear: stay the hell back. Eric was blissfully unaware of these implicit signals, however, and he cheerfully poked his head in my office to say hello (Eric had worked the previous summer in our group as an intern). I can't remember exactly what I said to Eric when he opened my office door, but suffice it to say that the implicit signals were replaced with a very explicit one -- and I remain grateful to this day that Eric didn't quit on the spot...


Back on our problem, Mike -- through process of elimination -- had made the key breakthrough: it wasn't actually an instruction set architecture (ISA) issue, but rather it seemed to be a host bus adapter (HBA) issue. This was an incredibly important discovery: while we had a bevy of architectural changes that could conceivably be invalid on an ancient CPU, we had no such HBA-specific changes -- this was more likely to be something marring the surface of our work rather than cracking its foundation. Mike further observed that running a DEBUG variant of these ancient HBA drivers (esp and fas) would boot on an otherwise non-DEBUG kernel. At that, I remembered that we actually did have some cosmetic changes to these drivers, and on carefully reviewing the diffs, we found a deadly problem: in folding some old tracing code under a DEBUG-only #define, a critical line (the one that actually initiates the I/O) became compiled in only when DEBUG was defined. We hadn't seen this until now because these drivers were only used on ancient machines -- machines on which we had never tested non-DEBUG. We fixed the problem, and all of our machines booted DEBUG and non-DEBUG -- and we felt like we were breathing again for the first time in the more than six hours that we had been working on the problem. (Here is the mail that I sent out explaining the problem.)


To celebrate DTrace's birthday beyond just recounting the terror of its integration, I wanted to make a couple of documents public that we have not previously shared:


  • The primordial presentation on DTrace, developed in late 1999. Some of the core ideas are present here (in particular, production instrumentation and zero disabled probe effect), but we hadn't yet figured out some very basic notions -- like that we needed our own language.


  • Our first real internal presentation on DTrace, presented March 12, 2002 as a Kernel Technical Discussion. Here the thinking is much better fleshed out around kernel-level instrumentation -- and a prototype existed and was demonstrated. But a key direction for the technology -- the ability to instrument user-level generally and in semantically relevant ways in particular -- was still to come when Adam joined the team shortly after this presentation. (A video of this presentation also exists; in the unlikely event that anyone wants to actually relive three hours of largely outmoded thinking, I'll find a way to make it available.)


  • The e-mail we sent out after integration, September 3, 2003 -- five years ago today.


We said it then, and it's even truer today: it's been quite a ride. Happy 5th Birthday, DTrace -- and thanks again to everyone in the DTrace community for making it what it has become!

Tuesday Jul 29, 2008

DTrace and the Palisades Interstate Parkway

In general, I don't believe in drawing attention to bugs in the software of others: any significant body of software is likely to have bugs, and I think one can too easily draw overly broad inferences by looking at software through the lens of its defects (a pathology that I have previously discussed at some length). However -- and as you might imagine from the preamble -- I'm about to make an exception to that gentlemanly rule...


I, along with zillions of others, read the breathless hype about the new would-be Google slayer, cuil. When a new search engine pops up, my egotistical reflex is to first search for "dtrace", and the results of searching for "dtrace" on cuil were very, um, interesting. The search results themselves were fine; more creative were the images that cuil decided to associate with them. If you look at that screenshot, you will be able to find an image of a quilt, a strip mall, what could pass for a program from an August Wilson play, and -- strangest of all -- a sign that reads "Welcome to Palisades Interstate Parkway". I can't say for certain that I've never travelled on the Palisades Interstate Parkway, but with all deference to that 38.25-mile stretch of tarmac, I do believe that I can say that it played no role in DTrace -- or DTrace in it. Indeed, I can say with absolute confidence that searching for "palisades interstate parkway dtrace" will, in short order, yield only this blog entry -- provided, that is, that one doesn't perform said search on cuil... ;)

Friday Jul 18, 2008

Revisiting the Intel 432

As I have discussed before, I strongly believe that to understand systems, you must understand their pathologies -- systems are most instructive when they fail. Unfortunately, we in computing systems do not have a strong history of studying pathology: despite the fact that failure in our domain can be every bit as expensive (if not more so) than in traditional engineering domains, our failures do not (usually) involve loss of life or physical property and there is thus little public demand for us to study them -- and a tremendous industrial bias for us to forget them as much and as quickly as possible. The result is that our many failures go largely unstudied -- and the rich veins of wisdom that these failures generate live on only in oral tradition passed down by the perps (occasionally) and the victims (more often).

A counterexample to this -- and one of my favorite systems papers of all time -- is Robert Colwell's brilliant Performance Effects of Architectural Complexity in the Intel 432. This paper, which dissects the abysmal performance of Intel's infamous 432, practically drips with wisdom, and is just as relevant today as it was when the paper was originally published nearly twenty years ago.

For those who have never heard of the Intel 432, it was a microprocessor conceived of in the mid-1970s to be the dawn of a new era in computing, incorporating many of the latest notions of the day. But despite its lofty ambitions, the 432 was an unmitigated disaster both from an engineering perspective (the performance was absolutely atrocious) and from a commercial perspective (it did not sell -- a fact presumably not unrelated to its terrible performance). To add insult to injury, the 432 became a sort of punching bag for researchers, becoming, as Colwell described, "the favorite target for whatever point a researcher wanted to make."

But as Colwell et al. reveal, the truth behind the 432 is a little more complicated than trendy ideas gone awry; the microprocessor suffered from not only untested ideas, but also terrible execution. For example, one of the core ideas of the 432 is that it was a capability-based system, implemented with a rich hardware-based object model. This model had many ramifications for the hardware, but it also introduced a dangerous dependency on software: the hardware was implicitly dependent on system software (namely, the compiler) for efficient management of protected object contexts ("environments" in 432 parlance). As it happened, the needed compiler work was not done, and the Ada compiler as delivered was pessimal: every function was implemented in its own environment, meaning that every function was in its own context, and that every function call was therefore a context switch!. As Colwell explains, this software failing was the greatest single inhibitor to performance, costing some 25-35 percent on the benchmarks that he examined.

If the story ended there, the tale of the 432 would be plenty instructive -- but the story takes another series of interesting twists: because the object model consumed a bunch of chip real estate (and presumably a proportional amount of brain power and department budget), other (more traditional) microprocessor features were either pruned or eliminated. The mortally wounded features included a data cache (!), an instruction cache (!!) and registers (!!!). Yes, you read correctly: this machine had no data cache, no instruction cache and no registers -- it was exclusively memory-memory. And if that weren't enough to assure awful performance: despite having 200 instructions (and about a zillion addressing modes), the 432 had no notion of immediate values other than 0 or 1. Stunningly, Intel designers believed that 0 and 1 "would cover nearly all the need for constants", a conclusion that Colwell (generously) describes as "almost certainly in error." The upshot of these decisions is that you have more code (because you have no immediates) accessing more memory (because you have no registers) that is dog-slow (because you have no data cache) that itself is not cached (because you have no instruction cache). Yee haw!

Colwell's work builds to crescendo as it methodically takes apart each of these architectural issues -- and then attempts to model what the microprocessor would look like were it properly implemented. The conclusion he comes to is the object model -- long thought to be the 432's singular flaw -- was only one part of a more complicated picture, and that its performance was "dominated, in large part, by artifacts and not by concepts." If there's one imperfection with Colwell's work, it's that he doesn't realize how convincingly he's made the case that these artifacts were induced by a rigid and foolish adherence to the concepts.

So what is the relevance of Colwell's paper now, 20 years later? One of the principal problems that Colwell describes is the disconnect between innovation at the hardware and software levels. This disconnect continues to be a theme, and can be seen in current controversies in networking (TOE or no?), in virtualization (just how much microprocessor support do we want/need -- and at what price?), and (most clearly, in my opinion) in hardware transactional memory. Indeed, like an apparition from beyond the grave, the Intel 432 story should serve as a chilling warning to those working on transactional memory today: like the 432 object model, hardware transactional memory requires both novel microprocessor architecture and significant new system software. And like the 432 object model, hardware transactional memory has been touted more for its putative programmer productivity than for its potential performance gains. This is not to say that hardware transactional memory is not an appropriate direction for a microprocessor, just that its advocates should not so stubbornly adhere to their novelty that they lose sight of the larger system. To me, that is the lesson of the Intel 432 -- and thanks to Colwell's work, that lesson is available to all who wish to learn it.

Monday Jun 30, 2008

DTrace on Linux

The interest in DTrace on Linux is heating up again -- this time in an inferno on the Linux 2008 Kernel Summit discussion list. Under discussion is SystemTap, the Linux-born DTrace-knockoff, with people like Ted Ts'o explaining why they find SystemTap generally unusable ("Do you really expect system administrators to use this tool?") and in stark contrast to DTrace ("it just works").

While the comparison is clearly flattering, I find it a bit disappointing that no one in the discussion seems to realize that DTrace "just works" not merely my implementation, but also by design. Over and over again, we made architectural and technical design decisions that would yield an instrumentation framework that would be not just safe, powerful and flexible, but also usable. The subtle bit here is that many of those decisions were not at the surface of the system (where the discussion on the Linux list seems to be currently mired), but in its guts. To phrase it more concretely, innovations like CTF, DOF and provider-specified stability may seem like mind-numbing, arcane implementation detail (and okay, they probably are that too), but they are the foundation upon which the usability of DTrace is built. If you don't solve the problems that they solve, you won't have a system anywhere near as usable as DTrace.

So does SystemTap appreciate either the importance of these problems or the scope of their solutions? Almost certainly not -- for if they did, they would come to the same conclusion that technologists at Apple, QNX, and the FreeBSD project have come to: the only way to have a system at parity with DTrace is to port DTrace.

Fortunately for Linux users, there are some in the community who have made this realization. In particular, Paul Fox has a nascent port of DTrace to Linux. Paul still has a long way to go (and I'm sure he could use whatever help Linux folks are willing to offer) but it's impossible to believe that Paul isn't on a shorter and more realistic path than SystemTap to achieving safe, powerful, flexible -- and usable! -- dynamic Linux instrumentation. Good luck to you Paul; we continue to be available to help where we can -- and may the Linux community realize the value of your work sooner rather than later!

About

bmc

Search

Top Tags
Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today