Monday Nov 03, 2008

Concurrency's Shysters

For as long as I've been in computing, the subject of concurrency has always induced a kind of thinking man's hysteria. When I was coming up, the name of the apocalypse was symmetric multiprocessing -- and its arrival was to be the Day of Reckoning for software. There seemed to be no end of doomsayers, even among those who putatively had the best understanding of concurrency. (Of note was a famous software engineer who -- despite substantial experience in SMP systems at several different computer companies -- confidently asserted to me in 1995 that it was "simply impossible" for an SMP kernel to "ever" scale beyond 8 CPUs. Needless to say, several of his past employers have since proved him wrong...)

There also seemed to be no end of concurrency hucksters and shysters, each eager to peddle their own quack cure for the miasma. Of these, the one that stuck in my craw was the two-level scheduling model, whereby many user-level threads are multiplexed on fewer kernel-level (schedulable) entities. (To paraphrase what has been oft said of little-known computer architectures, you haven't heard of it for a reason.) The rationale for the model -- that it allowed for cheaper synchronization and lightweight thread creation -- seemed to me at the time to be long on assertions and short on data. So working with my undergraduate advisor, I developed a project to explore this model both quantitatively and dynamically, work that I undertook in the first half of my senior year. And early on in that work, it became clear that -- in part due to intractable attributes of the model -- the two-level thread scheduling model was delivering deeply suboptimal performance...

Several months after starting the investigation, I came to interview for a job at Sun with Jeff, and he (naturally) asked me to describe my undergraduate work. I wanted to be careful here: Sun was the major proponent of the two-level model, and while I felt that I had the hard data to assert that the model was essentially garbage, I also didn't want to make a potential employer unnecessarily upset. So I stepped gingerly: "As you may know," I began, "the two-level threading model is very... intricate." "Intricate?!" Jeff exclaimed, "I'd say it's completely busted!" (That moment may have been the moment that I decided to come work with Jeff and for Sun: the fact that an engineer could speak so honestly spoke volumes for both the engineer and the company. And despite Sun's faults, this engineering integrity remains at Sun's core to this day -- and remains a draw to so many of us who have stayed here through the ups and downs.) With that, the dam had burst: Jeff and I proceeded to gush about how flawed we each thought the model to be -- and how dogmatic its presentation. So paradoxically, I ended up getting a job at Sun in part by telling them that their technology was unsound!

Back at school, I completed my thesis. Like much undergraduate work, it's terribly crude in retrospect -- but I stand behind its fundamental conclusion that the unintended consequences of the two-level scheduling model make it essentially impossible to achieve optimal performance. Upon arriving at Sun, I developed an early proof-of-concept of the (much simpler) single-level model. Roger Faulkner did the significant work of productizing this as an alternative threading model in Solaris 8 -- and he eliminated the two-level scheduling model entirely in Solaris 9, thus ending the ill-begotten experiment of the two-level scheduling model somewhere shy of its tenth birthday. (Roger gave me the honor of approving his request to integrate this work, an honor that I accepted with gusto.)

So why this meandering walk through a regrettable misadventure in the history of software systems? Because over a decade later, concurrency is still being used to instill panic in the uninformed. This time, it is chip-level multiprocessing (CMP) instead of SMP that promises to be the End of Days -- and the shysters have taken a new guise in the form of transactional memory. The proponents of this new magic tonic are in some ways darker than their forebears: it is no longer enough to warn of Judgement Day -- they must also conjure up notions of Original Sin to motivate their perverted salvation. "The heart of the problem is, perhaps, that no one really knows how to organize and maintain large systems that rely on locking" admonished Nir Shavit recently in CACM. (Which gives rise to the natural follow-up question: is the Solaris kernel not large, does it not rely on locking or do we not know how to organize and maintain it? Or is that we do not exist at all?) Shavit continues: "Locks are not modular and do not compose, and the association between locks and data is established mostly by convention." Again, no data, no qualifiers, no study, no rationale, no evidence of experience trying to develop such systems -- just a naked assertion used as a prop for a complicated and dubious solution. Are there elements of truth in Shavit's claims? Of course: one can write sloppy, lock-based programs that become a galactic, unmaintainable mess. But does it mean that such monstrosities are inevitable? No, of course not.

So fine, the problem statement is (deeply) flawed. Does that mean that the solution is invalid? Not necessarily -- but experience has taught me to be wary of crooked problem statements. And in this case (perhaps not surprisingly) I take umbrage with the solution as well. Even if one assumes that writing a transaction is conceptually easier than acquiring a lock, and even if one further assumes that transaction-based pathologies like livelock are easier on the brain than lock-based pathologies like deadlock, there remains a fatal flaw with transactional memory: much system software can never be in a transaction because it does not merely operate on memory. That is, system software frequently takes action outside of its own memory, requesting services from software or hardware operating on a disjoint memory (the operating system kernel, an I/O device, a hypervisor, firmware, another process -- or any of these on a remote machine). In much system software, the in-memory state that corresponds to these services is protected by a lock -- and the manipulation of such state will never be representable in a transaction. So for me at least, transactional memory is an unacceptable solution to a non-problem.

As it turns out, I am not alone in my skepticism. When we on the Editorial Advisory Board of ACM Queue sought to put together an issue on concurrency, the consensus was twofold: to find someone who could provide what we felt was much-needed dissent on TM (and in particular on its most egregious outgrowth, software transactional memory), and to have someone speak from experience on the rise of CMP and what it would mean for practitioners.

For this first article, we were lucky enough to find Calin Cascaval and colleagues, who ended up writing a must-read article on STM in November's CACM. Their conclusions are unavoidable: STM is a dog. (Or as Cascaval et al. more delicately put it: "Based on our results, we believe that the road for STM is quite challenging.") Their work is quantitative and analytical and (best of all, in my opinion) the authors never lose sight of the problem that transactional memory was meant to solve: to make parallel programming easier. This is important, because while many of the leaks in the TM vessel can ultimately be patched, the patches themselves add layer upon layer of complexity. Cascaval et al. conclude:

And because the argument for TM hinges upon its simplicity and productivity benefits, we are deeply skeptical of any proposed solutions to performance problems that require extra work by the programmer.
And while their language is tighter (and the subject of their work a weightier and more active research topic), the conclusions of Cascaval et al. are eerily similar to my final verdict on the two-level scheduling model, over a decade ago:
The dominating trait of the [two-level] scheduling model is its complexity. Unfortunately, virtually all of its complexity is exported to the programmer. The net result is that programmers must have a complete understanding of the model the inner workings of its implementation in order to be able to successfully tap its strengths and avoid its pitfalls.
So TM advocates: if Roger Faulkner knocks on your software's door bearing a scythe, you would be well-advised to not let him in...

For the second article, we brainstormed potential authors -- but as we dug up nothing but dry holes, I found myself coming to an unescapable conclusion: Jeff and I should write this, if nothing else as a professional service to prevent the latest concurrency hysteria from reaching epidemic proportions. The resulting article appears in full in the September issue of Queue, and substantially excerpted in the November issue of CACM. Writing the article was a gratifying experience, and gave us the opportunity to write down much of what we learned the hard way in the 1990s. In particular, it was cathartic to explore the history of concurrency. Having been at concurrency's epicenter for nearly a decade, I felt that the rise of CMP has been recently misrepresented as a failure of hardware creativity -- and it was vindicating to read CMP's true origins in the original DEC Piranha paper: that given concurrent databases and operating systems, implementing multiple cores on the die was simply the best way to deliver OLTP performance. That is, it was the success of concurrent software -- and not the failure of imagination on the part of hardware designers -- that gave rise to the early CMP implementations. Hopefully practitioners will enjoy reading the article as much as we enjoyed writing it -- and here's hoping that we live to see a day when concurrency doesn't attract so many schemers and dreamers!

Wednesday Sep 03, 2008

Happy 5th Birthday, DTrace!

It's hard to believe, but DTrace is five years old today: it was on September 3, 2003 that DTrace integrated into Solaris. DTrace was a project that extended all three of us to our absolute limit as software engineers -- and the 24 hours before integration was then (and remains now) the most harrowing of my career. As it will hopefully remain my most stressful experience as an engineer, the story of that final day merits a retelling...

Our project had been running for nearly two years, but it was not until mid-morning on September 2nd -- the day before we were slated to integrate -- that it was discovered that the DTrace prototype failed to boot on some very old hardware (the UltraSPARC-I, the oldest hardware still supported at that time). Now, "failed to boot" can meet a bunch of different things, but this was about as awful as it gets: a hard hang after the banner message. That is, booting mysteriously stopped making progress soon after control transferred to the kernel -- and one could not break in with the kernel debugger. This is an awful failure mode because with no debugger and no fatal error, one has no place to start other than to start adding print statements -- or start ripping out the code that is the difference between the working system and the busted one. This was a terrifying position to be in less than 24 hours before integration! Strangely, it was only the non-DEBUG variant that failed to boot: the DEBUG version laden with assertions worked fine. Our only lucky break was that we were able to find two machines that exhibited the problem, enabling us to bifurcate our efforts: I starting ripping out DTrace-specific code in one workspace, while Mike started frenetically adding print statements in another...

Meanwhile, while we were scrambling to save our project, Eric was having his first day at Sun. My office door was closed, and with our integration pending and me making frequent (and rapid) trips back and forth to the lab, the message to my coworkers was clear: stay the hell back. Eric was blissfully unaware of these implicit signals, however, and he cheerfully poked his head in my office to say hello (Eric had worked the previous summer in our group as an intern). I can't remember exactly what I said to Eric when he opened my office door, but suffice it to say that the implicit signals were replaced with a very explicit one -- and I remain grateful to this day that Eric didn't quit on the spot...

Back on our problem, Mike -- through process of elimination -- had made the key breakthrough: it wasn't actually an instruction set architecture (ISA) issue, but rather it seemed to be a host bus adapter (HBA) issue. This was an incredibly important discovery: while we had a bevy of architectural changes that could conceivably be invalid on an ancient CPU, we had no such HBA-specific changes -- this was more likely to be something marring the surface of our work rather than cracking its foundation. Mike further observed that running a DEBUG variant of these ancient HBA drivers (esp and fas) would boot on an otherwise non-DEBUG kernel. At that, I remembered that we actually did have some cosmetic changes to these drivers, and on carefully reviewing the diffs, we found a deadly problem: in folding some old tracing code under a DEBUG-only #define, a critical line (the one that actually initiates the I/O) became compiled in only when DEBUG was defined. We hadn't seen this until now because these drivers were only used on ancient machines -- machines on which we had never tested non-DEBUG. We fixed the problem, and all of our machines booted DEBUG and non-DEBUG -- and we felt like we were breathing again for the first time in the more than six hours that we had been working on the problem. (Here is the mail that I sent out explaining the problem.)

To celebrate DTrace's birthday beyond just recounting the terror of its integration, I wanted to make a couple of documents public that we have not previously shared:

  • The primordial presentation on DTrace, developed in late 1999. Some of the core ideas are present here (in particular, production instrumentation and zero disabled probe effect), but we hadn't yet figured out some very basic notions -- like that we needed our own language.

  • Our first real internal presentation on DTrace, presented March 12, 2002 as a Kernel Technical Discussion. Here the thinking is much better fleshed out around kernel-level instrumentation -- and a prototype existed and was demonstrated. But a key direction for the technology -- the ability to instrument user-level generally and in semantically relevant ways in particular -- was still to come when Adam joined the team shortly after this presentation. (A video of this presentation also exists; in the unlikely event that anyone wants to actually relive three hours of largely outmoded thinking, I'll find a way to make it available.)

  • The e-mail we sent out after integration, September 3, 2003 -- five years ago today.

We said it then, and it's even truer today: it's been quite a ride. Happy 5th Birthday, DTrace -- and thanks again to everyone in the DTrace community for making it what it has become!

Tuesday Jul 29, 2008

DTrace and the Palisades Interstate Parkway

In general, I don't believe in drawing attention to bugs in the software of others: any significant body of software is likely to have bugs, and I think one can too easily draw overly broad inferences by looking at software through the lens of its defects (a pathology that I have previously discussed at some length). However -- and as you might imagine from the preamble -- I'm about to make an exception to that gentlemanly rule...

I, along with zillions of others, read the breathless hype about the new would-be Google slayer, cuil. When a new search engine pops up, my egotistical reflex is to first search for "dtrace", and the results of searching for "dtrace" on cuil were very, um, interesting. The search results themselves were fine; more creative were the images that cuil decided to associate with them. If you look at that screenshot, you will be able to find an image of a quilt, a strip mall, what could pass for a program from an August Wilson play, and -- strangest of all -- a sign that reads "Welcome to Palisades Interstate Parkway". I can't say for certain that I've never travelled on the Palisades Interstate Parkway, but with all deference to that 38.25-mile stretch of tarmac, I do believe that I can say that it played no role in DTrace -- or DTrace in it. Indeed, I can say with absolute confidence that searching for "palisades interstate parkway dtrace" will, in short order, yield only this blog entry -- provided, that is, that one doesn't perform said search on cuil... ;)

Friday Jul 18, 2008

Revisiting the Intel 432

As I have discussed before, I strongly believe that to understand systems, you must understand their pathologies -- systems are most instructive when they fail. Unfortunately, we in computing systems do not have a strong history of studying pathology: despite the fact that failure in our domain can be every bit as expensive (if not more so) than in traditional engineering domains, our failures do not (usually) involve loss of life or physical property and there is thus little public demand for us to study them -- and a tremendous industrial bias for us to forget them as much and as quickly as possible. The result is that our many failures go largely unstudied -- and the rich veins of wisdom that these failures generate live on only in oral tradition passed down by the perps (occasionally) and the victims (more often).

A counterexample to this -- and one of my favorite systems papers of all time -- is Robert Colwell's brilliant Performance Effects of Architectural Complexity in the Intel 432. This paper, which dissects the abysmal performance of Intel's infamous 432, practically drips with wisdom, and is just as relevant today as it was when the paper was originally published nearly twenty years ago.

For those who have never heard of the Intel 432, it was a microprocessor conceived of in the mid-1970s to be the dawn of a new era in computing, incorporating many of the latest notions of the day. But despite its lofty ambitions, the 432 was an unmitigated disaster both from an engineering perspective (the performance was absolutely atrocious) and from a commercial perspective (it did not sell -- a fact presumably not unrelated to its terrible performance). To add insult to injury, the 432 became a sort of punching bag for researchers, becoming, as Colwell described, "the favorite target for whatever point a researcher wanted to make."

But as Colwell et al. reveal, the truth behind the 432 is a little more complicated than trendy ideas gone awry; the microprocessor suffered from not only untested ideas, but also terrible execution. For example, one of the core ideas of the 432 is that it was a capability-based system, implemented with a rich hardware-based object model. This model had many ramifications for the hardware, but it also introduced a dangerous dependency on software: the hardware was implicitly dependent on system software (namely, the compiler) for efficient management of protected object contexts ("environments" in 432 parlance). As it happened, the needed compiler work was not done, and the Ada compiler as delivered was pessimal: every function was implemented in its own environment, meaning that every function was in its own context, and that every function call was therefore a context switch!. As Colwell explains, this software failing was the greatest single inhibitor to performance, costing some 25-35 percent on the benchmarks that he examined.

If the story ended there, the tale of the 432 would be plenty instructive -- but the story takes another series of interesting twists: because the object model consumed a bunch of chip real estate (and presumably a proportional amount of brain power and department budget), other (more traditional) microprocessor features were either pruned or eliminated. The mortally wounded features included a data cache (!), an instruction cache (!!) and registers (!!!). Yes, you read correctly: this machine had no data cache, no instruction cache and no registers -- it was exclusively memory-memory. And if that weren't enough to assure awful performance: despite having 200 instructions (and about a zillion addressing modes), the 432 had no notion of immediate values other than 0 or 1. Stunningly, Intel designers believed that 0 and 1 "would cover nearly all the need for constants", a conclusion that Colwell (generously) describes as "almost certainly in error." The upshot of these decisions is that you have more code (because you have no immediates) accessing more memory (because you have no registers) that is dog-slow (because you have no data cache) that itself is not cached (because you have no instruction cache). Yee haw!

Colwell's work builds to crescendo as it methodically takes apart each of these architectural issues -- and then attempts to model what the microprocessor would look like were it properly implemented. The conclusion he comes to is the object model -- long thought to be the 432's singular flaw -- was only one part of a more complicated picture, and that its performance was "dominated, in large part, by artifacts and not by concepts." If there's one imperfection with Colwell's work, it's that he doesn't realize how convincingly he's made the case that these artifacts were induced by a rigid and foolish adherence to the concepts.

So what is the relevance of Colwell's paper now, 20 years later? One of the principal problems that Colwell describes is the disconnect between innovation at the hardware and software levels. This disconnect continues to be a theme, and can be seen in current controversies in networking (TOE or no?), in virtualization (just how much microprocessor support do we want/need -- and at what price?), and (most clearly, in my opinion) in hardware transactional memory. Indeed, like an apparition from beyond the grave, the Intel 432 story should serve as a chilling warning to those working on transactional memory today: like the 432 object model, hardware transactional memory requires both novel microprocessor architecture and significant new system software. And like the 432 object model, hardware transactional memory has been touted more for its putative programmer productivity than for its potential performance gains. This is not to say that hardware transactional memory is not an appropriate direction for a microprocessor, just that its advocates should not so stubbornly adhere to their novelty that they lose sight of the larger system. To me, that is the lesson of the Intel 432 -- and thanks to Colwell's work, that lesson is available to all who wish to learn it.

Monday Jun 30, 2008

DTrace on Linux

The interest in DTrace on Linux is heating up again -- this time in an inferno on the Linux 2008 Kernel Summit discussion list. Under discussion is SystemTap, the Linux-born DTrace-knockoff, with people like Ted Ts'o explaining why they find SystemTap generally unusable ("Do you really expect system administrators to use this tool?") and in stark contrast to DTrace ("it just works").

While the comparison is clearly flattering, I find it a bit disappointing that no one in the discussion seems to realize that DTrace "just works" not merely my implementation, but also by design. Over and over again, we made architectural and technical design decisions that would yield an instrumentation framework that would be not just safe, powerful and flexible, but also usable. The subtle bit here is that many of those decisions were not at the surface of the system (where the discussion on the Linux list seems to be currently mired), but in its guts. To phrase it more concretely, innovations like CTF, DOF and provider-specified stability may seem like mind-numbing, arcane implementation detail (and okay, they probably are that too), but they are the foundation upon which the usability of DTrace is built. If you don't solve the problems that they solve, you won't have a system anywhere near as usable as DTrace.

So does SystemTap appreciate either the importance of these problems or the scope of their solutions? Almost certainly not -- for if they did, they would come to the same conclusion that technologists at Apple, QNX, and the FreeBSD project have come to: the only way to have a system at parity with DTrace is to port DTrace.

Fortunately for Linux users, there are some in the community who have made this realization. In particular, Paul Fox has a nascent port of DTrace to Linux. Paul still has a long way to go (and I'm sure he could use whatever help Linux folks are willing to offer) but it's impossible to believe that Paul isn't on a shorter and more realistic path than SystemTap to achieving safe, powerful, flexible -- and usable! -- dynamic Linux instrumentation. Good luck to you Paul; we continue to be available to help where we can -- and may the Linux community realize the value of your work sooner rather than later!

Saturday May 31, 2008

A Tribute to Jim Gray

Like several hundred others, I spent today at Berkeley at a tribute to honor Jim Gray. While many there today had collaborated with Jim in some capacity, my own connection to him is tertiary at best: I currently serve on the Editorial Advisory Board of ACM Queue, a capacity in which Jim also served up until his disappearance. And while I never met Jim, his presence is still very much felt on the Queue Board -- and I went today as much to learn about as to honor the man and computer scientist who still looms so large.

I came away inspired not only by what Jim had done and how he had done it, but also by the many in whom Jim had seen and developed a kind of greatness. It's rare for a gifted mind to be emotionally engaged with others (they are often "anti-human" in the words of one presenter), but practically unique for an intellect of Jim's caliber to have his ability for intense, collaborative engagement with so many others. Indeed, I found myself wondering if Jim didn't belong in the pantheon of Erdos and Feynman -- two other larger-than-life figures that shared Jim's zeal for problems and affection for those solving them.

On a slightly more concrete level, I was impressed that not only was Jim a big, adventurous thinker, but that he was one who remained moored to reality. This delicate balance pervades his life's work: The System R folks talked about how in a year he not only wrote 9 papers, but cut 10,000 lines of code. The Wisconsin database folks talked about how he who had pioneered so much in transaction processing also pioneered the database benchmarks, developing the precursor for what would become the Transaction Processing Council in a classic Datamation paper. The Tandem folks talked about how he reached beyond research and development to engage with the field and with customers to figure out (and publish!) why systems actually failed -- which was quite a reach for a company that famously dubbed every product "NonStop". And the Microsoft folks talked about his driving vision to put a large database on the web that people would actually use, leading him to develop TerraServer, and to inspire and assist subsequent systems like the Sloan Digital Sky Survey and Microsoft Research's WorldWide Telescope (which gave an incredible demo, by the way) -- systems that solved hard, abstract problems but that also delivered appreciable concrete results.

All along, Jim remained committed to big, future-looking ideas, while still insisting on designing and implementing actual systems in the present. We in computer science do not strike that balance often enough -- we are too often either detached from reality, or drowning in it -- so Jim's ability to not only strike that balance but to inspire it in others was a tremendous asset to our discipline. Jim, you are sorely missed -- even (and perhaps especially) by those of us who never met you...

Sunday Mar 16, 2008


dtrace.conf(08) was this past Friday, and (no surprise, given the attendees), it ended up being an incredible (un)conference. DTrace is able to cut across vertical boundaries in the software stack (taking you from say, PHP through to the kernel), and it is not surprising that this was also reflected in our conference, which ran from the darkness of new depths (namely, Andrew Gardner from the University of Arizona on using DTrace on dynamically reconfigurable FPGAs) through the more familiar deep of virtual machines (VMware, Xen and Zones) and operating system kernels (Solaris, OS X, BSD, QNX), into the rolling hills of databases (PostgreSQL) and craggy peaks of the Java virtual machine, and finally to the hypoxic heights of environments like Erlang and AIR. And as it is the nature of DTrace to reflect not the flowery theory of a system but rather its brutish reality, it was not at all surprising (but refreshing nonetheless) that nearly every presentation had a demo to go with it. Of these, it was particularly interesting that several were on actual production machines -- and that several others were on the presenter's Mac laptop. (I know I have been over the business case of the Leopard port of DTrace before, but these demos served to once again remind of the new vistas that have been opened up by having DTrace on such a popular development platform -- and how the open sourcing of DTrace has indisputably been in Sun's best business interest.)

When we opened in the morning, I claimed that the room had as much software talent as has ever been assembled in 70 people; after seeing the day's presentations and the many discussions that they spawned, I would double down on that claim in an instant. And while the raw pulling power of the room was awe-inspiring, the highlight for me was looking around as everyone was eating dinner: despite many companies having sent more than one participant, no dinner conversation seemed to have two people from the same company -- or even from the same layer of the stack. It almost brought a tear to the eye to see such interaction among such superlative software talent from such disjoint domains, and my dinner partner captured the zeitgeist precisely: "it's like a commune in here."

Anyway, I had a hell of a time, and judging by their blog entries, Keith, Theo, Jarod and Stephen did too. So thanks everyone for coming, and a huge thank you to our sponsor Forsythe -- and here's looking forward to dtrace.conf(09)!

Tuesday Dec 18, 2007

Announcing dtrace.conf

We on Team DTrace have realized that with so much going on in the world of DTrace -- ports to new systems, providers for new languages and development of new DTrace-based tools -- the time is long overdue for a DTrace summit of sorts. So I'm very pleased to announce our first-ever DTrace (un)conference: dtrace.conf(08), to be held in San Francisco on March 14, 2008. One of the most gratifying aspects of DTrace is its community: because DTrace operates both across abstraction boundaries and to great depth, it naturally attracts technologists who do the same -- systems generalists who couple an ability to think abstractly about the system with a thirst for the concrete data necessary to better understand it. So if nothing else, dtrace.conf should provide for thought-provoking and wide-ranging conversation!

The conference wiki has more details; hope to see you at dtrace.conf in San Francisco in March!

Tuesday Dec 04, 2007

Boom/bust cycles

Growing up, I was sensitive to the boom/bust cycles endemic in particular industries: my grandfather was a petroleum engineer, and he saw the largest project of his career cancelled during the oil bust in the mid-1980s.[1] And growing up in Colorado, I saw not only the destruction wrought by the oil bust (1987 was very bleak in Denver), but also the long boom/bust history in the mining industries -- unavoidable in towns like Leadville. (Indeed, it was only after going to the East Coast for university that I came to appreciate that school children in the rest of the country don't learn about the Silver Panic of 1893 in such graphic detail.)

So when, in the early 1990s, I realized that my life's calling was in software, I felt a sense of relief: here was an industry that was at last not shackled to the fickle Earth -- an industry that was surely "bust-proof" at some level. Ha! As I learned (painfully, like most everyone else) in the Dot-Com boom and subsequent bust, booms and busts are -- if anything -- even more endemic in our industry. Companies grow from nothing to spectacular heights in virtually no time at all -- and can crash back into nothingness just as quickly.

I bring up all of this, because if you haven't seen it, this video is absolutely brilliant, capturing these endemic cycles perfectly (and hilariously), and imparting a surprising amount of wisdom besides.

I'll still take our boom and bust cycles over those in other industries: there are, after all, still not quite yet software "ghost towns" -- however close the old Excite@Home building on the 101 was getting before it was finally leased!

If anyone is looking for an unspeakably large quantity of technical documentation on the (cancelled) ARAMCO al-Qasim refinery project, I have it waiting for your perusal!

Sunday Nov 11, 2007

On Dreaming in Code

As I noted previously, I recently gave a Tech Talk at Google on DTrace. When I gave the talk, I was in the middle of reading Scott Rosenberg's Dreaming in Code, and (for whatever poorly thought-out reason) I elected to use my dissatisfaction with the book as an entree into DTrace. I think that my thinking here was to use what I view to be Rosenberg's limited understanding of software as a segue into the more general difficulty of understanding running software -- with that of course being the problem that DTrace was designed to solve.

However tortured that path of reasoning may sound now, it was much worse in the actual presentation -- and in terms of Dreaming in Code, my introduction comes across as a kind of gangland slaying of Rosenberg's work. When I saw the video, I just cringed, noted with relief that at least the butchery was finished by five minutes in, and hoped that viewers would remember the meat of the presentation rather than the sloppy hors d'oeuvres.

I was stupidly naive, of course -- for as soon as so much as one blogging observer noted the connection between my presentation and the book, Google-alerted egosurfing would no doubt take over and I'd be completely busted. And that is more or less exactly what happened, with Rosenberg himself calling me out on my introduction, complaining (rightly) that I had slagged his book without providing much substance. Rosenberg was particularly perplexed because he felt that he and I are concluding essentially the same thing. But this is not quite the case: the conclusion (if not mantra) of his book is that "software is hard", while the point I was making in the Google talk is not so much that developing software is hard, but rather that software itself is wholly different -- that it is (uniquely) a confluence between information and machine. (And, in particular, that software poses unique observability challenges.)

This sounds like a really pedantic difference -- especially given that Rosenberg (in both book and blog entry) does make some attempt to show the uniqueness of software -- but the difference is significant: instead of beginning with software's information/machine duality, and from there exploring the difficulty of developing it, Rosenberg's work has at its core the difficulty itself. That is, Rosenberg uses the difficulty of developing software as the lens to understand all of software. And amazingly enough, if you insist on looking at the world through a bucket of shit, the world starts to look an awful lot like a bucket of shit: by the end of the book, Rosenberg has left the lay reader with the sense that we are careering towards some sort of software heat-death, after which meetings-about-meetings and stickies-on-whiteboards will prevent all future progress.

It's not clear if this is the course that Rosenberg intended to set at the outset; in his Author's Note, he gives us his initial bearings:

Why is good software so hard to make? Since no one seems to have a definitive answer even now, at the start of the twenty-first century, fifty years deep into the computer era, I offer by way of exploration the tale of the making of one piece of software.

The problem is that the "tale" that he tells is that of the OSAF's Chandler, a project plagued by metastasized confusion. In fact, the project is such a wreck that it gives rise to a natural question: did Rosenberg pick a doomed project because he was convinced at the outset that developing software was impossible, and he wanted to be sure to write about a project that wouldn't hang the jury? Or did his views of the impossibility of developing software come about as a result of his being trapped on such a reeking vessel? On the one hand, it seems unimaginable that Rosenberg would deliberately book his passage on the Mobro 4000, but on the other, it's hard to imagine how a careful search would have yielded a better candidate to show just how bad software development can get: PC-era zillionaire with too many ideas funds on-the-beach programmers to change the world with software that has no revenue model. Yikes -- call me when you ship something...

By the middle of the book it is clear that the Chandler garbage barge is adrift and listing, and Rosenberg, who badly wants the fate of Chandler to be a natural consequence of software and not merely a tale of bad ideas and poor execution, begins to mark time by trying to find the most general possible reasons for this failure. And it is this quest that led to my claim in the Google video that he was "hoodwinked by every long-known crank" in software, a claim that Rosenberg objects to, noting that in the video I provided just one example (Alan Kay). But this claim I stand by as made -- and I further claim that the most important contribution of Dreaming in Code may be its unparalleled Tour de Crank, including not just the Crank Holy Trinity of Minsky/Kurzweil/Joy, but many of the lesser known crazy relations that we in computer science have carefully kept locked in the cellar. Now, to be fair, Rosenberg often dismisses them (and Kapor put himself squarely in the anti-Kurzweil camp with his 2029 bet), but he dismisses them not nearly often enough or swiftly enough -- and by just presenting them, he grants many of them more than their due in terms of legitimacy.

So how would a change in Rosenberg's perspective have yielded a different work? For one, Rosenberg misses what is perhaps the most profound ramification of the information/machine duality: that software -- unlike everything else that we build -- can achieve a timeless and absolute perfection. To be fair, Rosenberg comes within sight of this truth -- but only for a moment: on page 336 he opens a section with "When software is, one way or another, done, it can have astonishing longevity." But just as quickly as hopes are raised, they are dashed to the ground again; he follows up that promising sentence with "Though it may take forever to build a good program, a good program will sometimes last almost as long." Damn -- so close, yet so far away. If "almost as long" were replaced with "in perpetuity", he might have been struck by the larger fallacies in his own reasoning around the intractable fallibility of software.

And software's ability to achieve perfection is indeed striking, for it makes software more like math than like traditional engineering domains -- after all, long after the Lighthouse at Alexandria crumbled, Euclid's greatest common divisor algorithm is showing no signs of wearing out. Why is this important? Because once software achieves something approximating perfection (and a surprising amount of it does), it sediments into the information infrastructure: the abstractions defined by the software become the bedrock that future generations may build upon. In this way, each generation of software engineer operates at a higher level of abstraction than the one that came before it, exerting less effort to do more. These sedimenting abstractions also (perhaps paradoxically) allow new dimensions of innovation deeper in the stack: with the abstractions defined and the constraints established, one can innovate underneath them -- provided one can do so in a way that is absolutely reliable (after all, this is to be bedrock), and with a sufficient improvement in terms of economics to merit the risk. Innovating in such a way poses hard problems that demand a much more disciplined and creative engineer than the software sludge that typified the PC revolution, so if Rosenberg had wanted to find software engineers who know how to deliver rock-solid highly-innovative software, he should have gone to those software engineers who provide the software bedrock: the databases, the operating systems, the virtual machines -- and increasingly the software that is made available as a service. There he still would have found problems to be sure (software still is, after all, hard), but he would have come away with a much more nuanced (and more accurate) view of both the state-of-the-art of software development -- and of the future of software itself.

Thursday Nov 08, 2007

DTrace on QNX!

There are moments in anyone's life that seem mundane at the time, but become pivotal in retrospect: that chance meeting of your spouse, or the job you applied for on a lark, or the inspirational course that you just stumbled upon.

For me, one of those moments was in late December, 1993. I was a sophomore in college, waiting in Chicago O'Hare for a connecting flight home to Denver for the winter break, reading an issue of the late Byte magazine dedicated to operating systems. At the time, I had just completed what would prove to be the most intense course of my college career -- Brown's (in)famous CS169 -- and the magazine contained an article on a very interesting microkernel-based operating system called QNX. The article, written by the late, great Dan Hildebrand, was exciting: we had learned about microkernels in CS169 (indeed, we had implemented one) and here was one running in the wild, solving real, hard problems. I was inspired. When I returned to school, I cold e-mailed Dan and asked about the possibilites of summer employment. Much to my surprise, he responded! When I returned to school in January, Dan and QNX co-founder Dan Dodge gave me an intense phone interview -- and then offered me work for the summer. I worked at QNX (based in Kanata, outside of Ottawa) that summer, and then came back the next. While I didn't end up working for QNX after graduation, I have always thought highly of the company, the people -- and especially the technology.

So you can imagine that for me, it's a unique pleasure -- and an an odd sort of homecoming -- to announce that DTrace is being ported to QNX. In my DTrace talk at Google I made passing reference (at about 1:11:53) to one "other system" that DTrace was being ported to, but that I was not at liberty to mention which. Some speculated that this "surely" must be Windows -- so hopefully the fact that it was QNX will serve to remind that it's a big heterogeneous world out there. So, to Colin Burgess and Thomas Fletcher at QNX: congratulations on your fine work. And to QNX'ers: welcome to the DTrace community!

Now, with the drinks poured and the toasts made, I must confess that this is also a moment colored by some personal sadness, for I would love nothing more right now than to call up danh, chat about the exciting prospects for DTrace on QNX, and reminisce about the many conversations over our shared cube wall so long ago. (And if nothing else, I would kid him about his summer-of-1994 enthusiasm for Merced!) Dan, you are sorely missed...

Monday Oct 29, 2007

DTrace, Leopard, and the business of open source

If you haven't seen it, DTrace is now shipping in Mac OS X Leopard. This is very exciting for us here at Sun, but one could be forgiven for asking an obvious question: why? How is having our technology in Leopard (which, if Ars Technica is to be believed, is "perhaps the most significant change in the Leopard kernel") helping Sun? More bluntly, haven't we in Sun handed Apple a piece of technology that we could have sold them instead? The answer to these questions -- which are very similar in spirit to the questions that were asked over and over again internally as we prepared to open source Solaris -- is that they are simply the wrong questions.

The thrust of the business questions around open source should not be "how does this directly help me?" or "can't I sell them something instead?" but rather "how much does it cost me?" and "does it hurt me?" Why must one shift the bias of the questions? Because open source often helps in much more profound (and unanticipated) ways than just this quarter's numbers; one must look at open source as long term strategy rather than short term tactics. And as for the intense (and natural) desire to sell a technology instead of giving away the source code, one has to understand that the choice is not between "I give a customer my technology" and "I sell a customer my technology", but rather between "a customer that I never would have had uses my technology" and "a customer that I never would have had uses someone else's technology." When one thinks of open source in this way, the business case becomes much clearer -- but this still may be a bit abstract, so let's apply these questions to the specific case of DTrace in Leopard...

The first question is "how much did it cost Sun to get DTrace on Leopard?" The answer to this first question is that it cost Sun just about nothing. And not metaphorical nothing -- I'm talking actual, literal nothing: Adam, Mike and I had essentially one meeting with the Apple folks, answering some questions that we would have answered for anyone anyway. But answering questions doesn't ship product; how could the presence of our software in another product cost us nothing? This is possible because of that most beautiful property of software: it has no variable cost; the only meaningful costs associated with software are fixed costs, and those costs were all borne by Apple. Indeed, it has cost Sun more money in terms of my time to blog how this didn't cost anything to Sun than it did in fact cost Sun in the first place...

With that question answered, the second question is "does the presence of DTrace on Leopard hurt Sun?" The answer is that it's very hard to come up with a situation whereby this hurts Sun: one would have to invent a fictitious customer who is happily buying Sun servers and support -- but only because they can't get DTrace on their beloved Mac OS X. In fact, this mythical customer apparently hates Sun (but paradoxically loves DTrace?) so much that they're willing to throw out all of their Sun and Solaris investment over a single technology -- and one that is present in both systems no less. Even leaving aside that Solaris and Mac OS X are not direct competitors, this just doesn't add up -- or at least, it adds up to such a strange, irrational customer that you'll find them in the ones and twos, not the thousands or millions.

But haven't we lost some competitive advantage to Apple? Doesn't that hurt Sun? The answer, again, is no. If you love DTrace (and again, that must be presupposed in the question -- if DTrace means nothing to you, then its presence in Mac OS X also means nothing to you), then you are that much more likely to look at (and embrace) other perhaps less ballyhooed Solaris
technologies like SMF, FMA, Zones, least-privilege, etc. That is, the kind of technologist who appreciates DTrace is also likely to appreciate the competitive advantages of Solaris that run far, far beyond merely DTrace -- and that appreciation is not likely to be changed by the presence of DTrace in another system.

Okay, so this doesn't cost Sun anything, and it doesn't hurt Sun. Once one accepts that, one is open to a much more interesting and meaningful question: namely, does this help Sun? Does it help Sun to have our technology -- especially a revolutionary one -- present in other systems? The answer is "you bet!" There are of course some general, abstract ways that it helps -- it grows our DTrace community, it creates larger markets for our partners and ISVs that wish to offer DTrace-based solutions and services, etc. But there are also more specific, concrete ways: for example, how could it not help Solaris to have Ruby developers (the vast majority of whom develop on Mac OS X) become accustomed to using DTrace to debug their Rails app? Today, Rails apps are generally developed on Mac OS X and deployed on Linux -- but one can make a very, very plausible argument that getting Rails developers hooked on DTrace on the development side could well the change the dynamics on the deployment side. (After all, DTrace + Leopard + Ruby-on-Rails is crazy delicious!) This all serves as an object lesson of how unanticipatable the benefits of open source can be: despite extensive war-gaming, no one at Sun anticipated that open sourcing DTrace would allow it to be used to Sun's advantage on a hot web development platform running on a hip development system, neither of which originated at Sun.

And the DTrace/Leopard/Ruby triumvirate points to a more profound change: the presence of DTrace in other systems assures that it transcends a company or its products -- that it moves beyond a mere a feature, and becomes a technological advance. As such, you can be sure that systems that lack DTrace will become increasingly unacceptable over time. DTrace's shift from product to technological advance -- just like the shifts in NFS or Java before it -- is decidedly and indisputably in Sun's interest, and indeed it embodies the value proposition of the open systems vision that started our shop in the first place. So here's to DTrace on Leopard, long may it reign!

Thursday Sep 06, 2007

DTrace on ONTAP?

As presumably many have seen, NetApp is suing Sun over ZFS. I was particularly disturbed to read Dave Hitz's account, whereby we supposedly contacted NetApp 18 months ago requesting that they pay us "lots of money" (his words) for infringing our patents. Now, as a Sun patent holder, reading this was quite a shock: I'm not enamored with the US patent system, but I have been willing to contribute to Sun's patent portfolio because we have always used them defensively. Had we somehow lost our collective way?

Now, I should say that there was something that smelled fishy about Dave's account from the start: I have always said that a major advantage of working for or doing business with Sun is that we're too disorganized to be evil. Being disorganized causes lots of problems, but actively doing evil isn't among them; approaching NetApp to extract gobs of money shows, if nothing else, a level of organization that we as a company are rarely able to summon. So if Dave's account were true, we had silently effected two changes at once: we had somehow become organized and evil!

Given this, I wasn't too surprised to learn that Dave's account isn't true. As Jonathan explains, NetApp first approached STK through a third party intermediary (boy, are they ever organized!) seeking to buy the STK patents. When we bought STK, we also bought the ongoing negotiations, which obviously, um, broke down.

Beyond clarifying how this went down, I think two other facts merit note: one is that NetApp has filed suit in East Texas, the infamous den of patent trolls. If you'll excuse my frankness for a moment, WTF? I mean, I'm not a lawyer, but we're both headquartered in California -- let's duke this out in California like grown-ups.

The second fact is that this is not the first time that NetApp has engaged in this kind of behavior: in 2003, shortly after buying the patents from bankrupt Auspex, NetApp went after BlueArc for infringing three of the patents that they had just bought. Importantly, NetApp lost on all counts -- even after an appeal. To me, the misrepresentation of how the suit came to be, coupled with the choice of venue and history show that NetApp -- despite their claptrap to the contrary -- would prefer to meet their competition in a court of law than in the marketplace.

In the spirit of offering constructive alternatives, how about porting DTrace to ONTAP? As I offered to IBM, we'll help you do it -- and thanks to the CDDL, you could do it without fear of infringing anyone's patents. After all, everyone deserves DTrace -- even patent trolls. ;)

Tuesday Aug 21, 2007

DTrace at Google

Recently, I gave a Tech Talk at Google on DTrace, the video for which is now online. If you've seen me present before and you don't want to suffer through the same tired anecdotes, arcane jokes, and disturbing rants, jump ahead to about 57:16 to see a demo of DTrace for Python -- and in particular John Levon's incredible Python ustack helper. You also might want to skip ahead to 1:10:46 for the Q&A -- any guesses what the first question was? (Hint: it was such an obvious question, both I and the room erupted with laughter.) Note that my rather candid answer to that question represents my opinion alone, and does not constitute legal advice -- and nor does it represent any sort of official position of Sun...

Thursday Aug 02, 2007

Caught on camera

Software is such an abstract domain that it is rare for a photograph to say much. But during his recent trip to OSCON, Adam snapped a photo that actually says quite a bit. For those who are curious, "jdub" is the nom de guerre of Jeff Waugh, who should really know better...



Top Tags
« July 2016