X

Recent Posts

Fishworks

Turning the corner

It's a little hard to believe that it's been only fifteen months since we shipped our first product. It's been a hell of a ride; there is nothing as exhilarating nor as exhausting as having a newly developed product that is both intricate and wildly popular. Especially in the domain of enterprise storage -- where perfection is not just the standard but (entirely reasonably) the expectation -- this makes for some seriously spiked punch.For my own part, I have had my head down for the last six months as the Technical Lead for our latest software release, 2010.Q1, which is publicly available as of today. In my experience, I have found that in software (if not in life), one may only ever pick two of quality, features and schedule -- and for 2010.Q1, we very much picked quality and features. (As for schedule, let it be only said that this release was once known as "2009.Q4"...)2010.Q1 QualityYou don't often see enterprise storage vendors touting quality improvements for a very simple reason: if the product was perfect when you sold it to me, why are you talking about how much you've improved it? So I'm going to break a little bit with established tradition and acknowledge that the product has not been perfect, though not without good reason. With our initial development of the product, we were pushing many new technologies very aggressively: not only did we seek to build enterprise-grade storage on commodity components (a deceptively daunting challenge in its own right), we were also building on entirely new elements like flash -- and then topped it all off with an ambitious, from-scratch management stack. What were we possibly thinking by making so many bets at once? We made these bets not out of recklessness, but rather because they were essential elements of our Big Bet: that customers were sick of paying monopoly rents for enterprise storage, and that we could deliver a quantum leap in price-performance. (And if nothing else, let it be said that we got that one very, very right -- seemingly too right, at times.) As for the specific technology bets, some have proven to be unblemished winners, while others have been more of a struggle. Sometimes the struggle was because the problem was hard, sometimes it was because the software was immature, and sometimes it was because a component that was assumed to have known failure modes had several (or many) unanticipated (or byzantine) failure modes. And in the worst cases, of course, it was all three...I'm pleased to report that in 2010.Q1, we turned the corner on all fronts: in addition to just fixing a boatload of bugs in key areas like clustering and networking, we engaged in fundamental work like Dave's rearchitecture of remote replication, adapted to new device failure modes as with Greg's rearchitecture around resilience to HBA logic failure, and -- perhaps most importantly -- integrated critical firmware upgrades to each of the essential components of the I/O path (HBAs, SIM cards and disks). Also in 2010.Q1, we changed the way the way that we run the evaluation of the software, opening the door to many in our rapidly growing customer base. As a result, this release is already running on more customer production systems than any of its predecessors were at the time that they shipped -- and on many more eval and production machines within our own walls.2010.Q1 FeaturesBut as important as quality is to this release, it's not the full story: the release is also packed with major features like deduplication, iSER/SRP support, Kerberized NFS support and Fibre Channel support. Of these, the last is of particular interest to me because, in addition to my role as the Technical Lead for 2010.Q1, I was also responsible for the integration of FC support into the product. There was a lot of hard work here, but much of it was born by John Forte and his COMSTAR team, who did a terrific job not only on the SCSI Target Management facility (STMF) but also on the base ALUA support necessary to allow proper FC operation in a cluster. As for my role, it was fun to cut the code to make all of this stuff work. Thanks to some great design work by Todd Patrick, along with some helpful feedback from field-facing colleagues like Ryan Matthews, I think we came up with a clean, functional interface. And working closely with both John and our test team, we have developed a rock-solid FC product. But of course (and as one might imagine), for me personally, the really gratifying bit was adding FC support to analytics. With just a pinch of DTrace and a bit of glue code, we now have visibility into FC operations by LUN, by project, by target, by initiator, by operation, by SCSI command, by size, by offset and by latency -- and by any combination thereof.As I was developing FC analytics, I would use as my source of load a silly disk benchmark I wrote back in the day when Adam and I were evaluating SSDs. Here for example, is that benchmark running against a LUN that I named "thicktail-bench":The initiator here is the machine "thicktail"; it's interesting to break down by initiator and see the paths by which thicktail is accessing the LUN:(These names are human readable because I have added aliases for each of thicktail's two HBA ports. Had I not added those aliases, we would see WWNs here.) The above shows us that thicktail is accessing the LUN through both of its paths, which is what we would expect (but good to visually confirm). Let's see how it's accessing the LUN in terms of operations:Nothing too surprising here -- this is the write phase of the benchmark and we have no log devices on this system, so we fully expect this. But let's break down by offset:The first time I saw this, I was surprised. Not because of what it shows -- I wrote this benchmark, and I know what it does -- but rather because it was so eye-popping to really see its behavior for the first time. In particular, this captures an odd phase I added to this benchmark: it does random writes across an increasing large range. I did this because we had discovered that some SSDs did fine when the writes were confined to a small logical region, but broke down -- badly -- when the writes were over a larger region. And no, I don't know why this was the case (presumably the firmware was in fragmented/wear-leveling/cache-busting hell); all I know is that we rejected any further exploration once the writes to the SSD were of a higher latency than that of my first hard drive: the IBM PC XT's 10 MB ST-412, which had roughly 95 ms writes! (We felt that expecting an SSD to have better write latency than a hard drive from the first Reagan Administration was tough but fair...)What now?As part of our ongoing maturity as a product, we have developed a new role here at Fishworks: starting in 2010.Q1, the Technical Lead for the release will, as the release ships, transition to become the full-time Support Lead for that release in the field. This means many things for the way we support the product, but for our customers, it means that if and when you do have an issue on 2010.Q1, you should know that the buck on your support call will ultimately stop with me. We are establishing an unprecedented level of engineering integration with our support teams, and we believe that it will show in the support experience. So welcome to 2010.Q1 -- and happy upgrading!

It's a little hard to believe that it's been only fifteen months since we shipped our first product. It's been a hell of a ride; there is nothing as exhilarating nor as exhausting as having a newly...

Fishworks

John Birrell

It is with a heavy heart that I announce that we in the DTrace community have lost one of our own: the indomitable John Birrell, who ported DTrace to FreeBSD, suffered a stroke and passed away on Friday, November 20, 2009.We on Team DTrace knew John to be a remarkably talented and determined software engineer. As those who have attempted ports can attest, DTrace passes through rough country, and a port to a foreign system is a significant undertaking that requires mastery of both DTrace and (particularly) the target system. And in being the first to attempt a port, John's challenge was that much greater -- and his success in the endeavor a tribute to both his ability and (especially) his tenacity. For example, in performing the port, John decided that DTrace's dependency on the cyclic subsystem was such that it, too, needed to be ported. He didn't need to do this (and indeed, other ports have decided that an arbitrary resolution profile provider is not worth the significant trouble), but that he undertook this additional technical challenge anyway -- even when any victory would remain hidden to all but the most expert eye -- says a lot about John as both an engineer and a man. Later, when the port ran into some frustrating licensing issues, John once again did not give up. Rather, he backed up, and found a path forward that would satisfy all parties -- even though it required significant technical reworking on his part. I have long believed that the mark of a great engineer is not how frequently they get knocked down, but rather how quickly they get back up -- and in this regard, John was indisputably a giant.John, you will be missed -- not only by the FreeBSD community upon which you made an indelible mark, but by those of us in the DTrace community who only had the opportunity to work with you more recently. And while your legacy might remain anonymous to the future generations that will benefit from the fruits of your long labor, we will always know that it never would have happened without you. Thank you, and farewell.(Those who wish to memorialize John may want to do as I did and make a donation in his memory to the FreeBSD Foundation.)

It is with a heavy heart that I announce that we in the DTrace community have lost one of our own: the indomitable John Birrell, who ported DTrace to FreeBSD, suffered a stroke and passed away on...

Fishworks

Queue, CACM, and the rebirth of the ACM

As I have mentioned before (if in passing), I sit on the Editorial Advisory Board of ACM Queue, ACM's flagship publication for practitioners. In the past year, Queue has undergone a significant transformation, and now finds itself at the vanguard of a much broader shift within the ACM -- one that I confess to once thinking impossible.My story with respect to the ACM is like that of many practitioners, I suspect: I first became aware of the organization as an undergraduate computer science student, when it appeared to me as the embodiment of academic computer science. This perception was cemented by its flagship publication, Communications of the ACM, a magazine which, to a budding software engineer longing for the world beyond academia, seemed to be adrift in dreamy abstraction. So when I decided at the end of my undergraduate career to practice my craft professionally, I didn't for a moment consider joining the ACM: it clearly had no interest in the practitioner, and I had no interest in it.Several years into my career, my colleague David Brown mentioned that he was serving on the Editorial Board of a new ACM publication aimed at the practitioner, dubbed ACM Queue. The idea of the ACM focussing on the practitioner brought to mind a piece of Sun engineering lore from the old Mountain View days. Sometime in the early 1990s, the campus engaged itself in a water fight that pitted one building against the next. The researchers from the Sun Labs building built an elaborate catapult to launch water-filled missiles at their adversaries, while the gritty kernel engineers in legendary MTV05 assembled surgical tubing into simple but devastatingly effective three-person water balloon slingshots. As one might guess, the Labs folks never got their catapult to work -- and the engineers doused them with volley after volley of water balloons. So when David first mentioned that the ACM was aiming a publication at the practitioner, my mental image was of lab-coated ACM theoreticians, soddenly tinkering with an overcomplicated contraption. I chuckled to myself at this picture, wished David good luck on what I was sure was going to be a fruitless endeavor, and didn't think any more of it.Several months after it launched, I happened to come across an issue of the new ACM Queue. With skepticism, I read a few of the articles. I found them to be surprisingly useful -- almost embarrassingly so. I sheepishly subscribed, and I found that even the articles that I disagreed with -- like this interview with an apparently insane Alan Kay -- were more thought-provoking than enraging. And in soliciting articles on sordid topics like fault management from engineers like my long-time co-conspirator Mike Shapiro, the publication proved itself to be interested in both abstract principles and their practical application. So when David asked me to be a guest expert for their issue on system performance, I readily accepted. I put together an issue that I remain proud of today, with articles from Bart Smaalders on performance anti-patterns, Phil Beevers on development methodologies for high-performance software, me on DTrace -- and topped off with an interview between Kirk McKusick and Jarod Jenson that, among its many lessons, warns us of the subtle perils of Java's notifyAll.Two years later, I was honored to be asked to join Queue's Editorial Advisory Board, where my eyes were opened to a larger shift within the ACM: the organization -- led by both its executive leadership in CEO John White and COO Pat Ryan and its past and present elected ACM leadership like Steve Bourne, Dave Patterson, Stu Feldman and Wendy Hall -- were earnestly and deliberately seeking to bring the practitioner into the ACM fold. And I quickly learned that I was not alone in my undergraduate dismissal of Communications of the ACM: CACM was broadly viewed within the ACM as being woefully out of touch with both academic and practitioner alike, with one past president confessing that he himself couldn't stomach reading it -- even when his name was on the masthead. There was an active reform movement within the ACM to return the publication to its storied past, and this trajectory intersected with the now-proven success of Queue: it was decided that the in-print vehicle for Queue would shift to become the Practice section of a new, revitalized CACM. I was elated by this change, for it meant that our superlative practitioner-authored content would at last enter the walled garden of the larger academic community. And for practitioners, a newly relevant CACM would also serve to expose us to a much broader swathe of computer science.After much preparation, the new CACM launched in July 2008. Nearly a year later, I think it can safely be called a success. To wit, I point to two specific (if personal) examples from that first issue alone: thanks to the new CACM, my colleague Adam Leventhal's work on flash memory and our integration of it in ZFS found a much broader readership than it would have otherwise -- and Adam was recently invited to join an otherwise academic symposium on flash. And thanks to the new CACM, I -- and thousands of other practitioners -- were treated to David Shaw's incredible Anton, the kind of work that gives engineers an optimistic excitement uniquely induced by such moon shots. By bringing together the academic and the practitioner, the new CACM is effecting a new ACM.So, to my fellow practitioners: I strongly encourage you to join me as a member of the ACM. While CACM is clearly a compelling and tangible benefit, it is not the only reason to join the ACM. As professionals, I believe that we have a responsibility to our craft: to learn from our peers, to offer whatever we might have to teach, and to generally leave the profession better than we found it. In other professions -- in law, in medicine, and in more traditional engineering domains -- this professional responsibility is indoctrinated to the point of expectation. But our discipline perhaps shows its youth in our ignorance of this kind of professional service. To be fair, this cannot be laid entirely at the practitioner's feet: the organizations that have existed for computer scientists have simply not been interested in attracting, cultivating, or retaining the practitioner. But with the shift within the ACM embodied by the new CACM, this is changing. The ACM now aspires to be the organization that represents all computer scientists -- not just those who teach students, perform research and write papers, but also those of us who cut code, deliver product and deploy systems for a living. Joining the ACM helps it make good on this aspiration; we practitioners cannot effect this essential change from outside its membership. And we must not stop at membership: if there is an article that you might like to write for the broader ACM audience, or an article that you'd like to see written, or a suggestion you might have for a CTO roundtable or a practitioner you think should be interviewed, or, for that matter, any other change that you might like to see in the ACM to further appeal to the practitioner, do not stay silent; the ACM has given us practitioners a new voice -- but it is only good if we use it!

As I have mentioned before (if in passing), I sit on the Editorial Advisory Board of ACM Queue, ACM's flagship publication for practitioners. In the past year, Queue has undergone a significant...

Fishworks

Moore's Outlaws

My blog post eulogizing SPEC SFS has elicited quite a bit of reaction, much of it from researchers and industry observers who have drawn similar conclusions. While these responses were very positive, my polemic garnered a different reaction from SPEC SFS stalwart NetApp, where, in his response defending SPEC SFS, my former colleague Mike Eisler concocted this Alice-in-Wonderland defense of the lack of a pricing disclosure in the benchmark:Like many industries, few storage companies have fixed pricing. As much as heads of sales departments would prefer to charge the same highest price to every customer, it isn't going to happen. Storage is a buyers' market. And for storage devices that serve NFS and now CIFS, the easily accessible numbers on spec.org are yet another tool for buyers. I just don't understand why a storage vendor would advocate removing that tool.Mike's argument -- and I'm still not sure that I'm parsing it correctly -- appears to be that the infamously opaque pricing in the storage business somehow helps customers because they don't have to pay a single "highest price"! That is, that the lack of transparent pricing somehow reflects the "buyers' market" in storage. If that is indeed Mike's argument, someone should let the buyers know how great they have it -- those silly buyers don't seem to realize that the endless haggling over software licensing and support contracts is for them!And if that argument isn't contorted enough for you, Mike takes a second tack:In storage, the cost of the components to build the device falls continuously. Just as our customers have a buyers' market, we storage vendors are buyers of components from our suppliers and also enjoy a buyers' market. Re-submitting numbers after a hunk of sheet metal declines in price is silly.His ludicrous "sheet metal" example aside (what enterprise storage product contains more than a few hundred bucks of sheet metal?), Mike's argument appears to be that technology curves like Moore's Law and Kryder's Law lead to enterprise storage prices that are falling with such alarming speed that they're wrong by the time as they are so much as written down! If it needs to be said, this argument is absurd on many levels. First, the increases in transistor density and areal storage density tend to result in more computing bandwidth and more storage capacity per dollar, not lower absolute prices. (After all, your laptop is three orders of magnitude more powerful than a personal computer circa 1980 -- but it's certainly not a thousandth of the price.)Second, has anyone ever accused the enterprise storage vendors of dropping their prices in pace with these laws -- or even abiding by them in the first place? The last time I checked, the single-core Mobile Celeron that NetApp currently ships in their FAS2020 and FAS2050 -- a CPU with a criminally small 256K of L2 cache -- is best described as a Moore's Outlaw: a CPU that, even when it first shipped six (six!) years ago, was off the curve. (A single-core CPU with 256K of L2 cache was abiding by Moore's Law circa 1997.) Though it's no wonder that NetApp sees plummeting component costs when they're able to source their CPUs by dumpster diving...Getting back to SPEC SFS: even if the storage vendors were consistently reflecting technology improvements, SPEC SFS is (as I discussed) a drive latency benchmark that doesn't realize the economics of these curves anyway; drives are not rotating any faster year-over-year, having leveled out at 15K RPM some years ago due to some nasty physical constraints (like, the sound barrier). So there's no real reason to believe that the 2,016 15K RPM drives used in NetApp's opulent 1,032,461 op submission are any cheaper today than when this configuration was first submitted three years ago. Yes, those same drives would likely have more capacity (being 146GB or 300GB and not the 72GB in the submission), but recall that these drives are being short-stroked to begin with -- so such as additional capacity is being used at all by the benchmark, it will only be used to assure even less head movement!Finally, even if Mike were correct that technology advances result in ever falling absolute prices, it still should not prohibit price disclosures. We all understand that prices reflect a moment in time, and if natural inflation does not dissuade us from price disclosures, nor should any technology-induced deflation. So to be clear: SPEC SFS needs pricing disclosures. TPC has them, SPC has them -- and SFS needs them if the benchmark has any aspiration to enduring relevance. While SPEC SFS's flaws run deeper than the missing price disclosure, the disclosure would at least keep the more egregious misbehaviors in check -- and it would also (I believe) show storage buyers the degree to which the systems measured by SPEC SFS do not in fact correspond to the systems that they purchase and deploy.One final note: in his blog entry, Mike claims that "SPEC SFS performance is the minimum bar for entry into the NAS business." If he genuinely believes this, Mike may want to write a letter to the editors of InfoWorld: in their recent review of our Sun Storage 7210, they had the audacity to ignore the lack of SPEC SFS results for the appliance, instead running their own benchmarks. Their rating for the product's performance? 10 out of 10. What heresy!

My blog post eulogizing SPEC SFS has elicited quite a bit of reaction, much of it from researchers and industry observers who have drawn similar conclusions. While these responses were very positive,...

Fishworks

Eulogy for a benchmark

I come to bury SPEC SFS, not to praise it. When we at Fishworks set out, our goal was to build a product that would disrupt the enterprise NAS market with revolutionary price/performance. Based on the economics of Sun's server business, it was easy to know that we would deliver on the price half of that promise, but the performance half promised to be more complicated: while price is concrete and absolute, the notion of performance fluctuates with environment, workload and expectations. To cut through these factors, computing systems have long had their performance quantified with benchmarks that hold environment and workload constant, and as we began to learn about NAS benchmarks, one in particular loomed large among the vendors: SPEC's system file server benchmark, SFS. Curiously, the benchmark didn't come up much in conversations with customers, who seemed to prefer talking about raw capabilities like maximum delivered read bandwidth, maximum delivered write bandwidth, maximum synchronous write IOPS (I/O operations per second) or maximum random read IOPS. But it was clear that the entrenched NAS vendors took SPEC SFS very seriously (indeed, to the point that they seemed to use no other metric to describe the performance of the system), and upstarts seeking to challenge them seemed to take it even more seriously, so we naturally assumed that we too should use SPEC SFS as the canonical metric of our system... But as we explored SPEC SFS -- as we looked at the workload that it measures, examined its run rules, studied our rivals' submissions and then contrasted that to what we saw in the market -- an ugly truth emerged: whatever connection to reality it might have once had, SPEC SFS has long since become completely divorced from the way systems are actually used. And worse than simply being outdated or irrelevant, SPEC SFS is so thoroughly misguided as to implicitly encourage vendors to build the wrong systems -- ones that are increasingly outrageous and uneconomic. Quite the opposite of being beneficial to customers in evaluating systems, SPEC SFS has decayed to the point that it is serving the opposite ends: by rewarding the wrong engineering decisions, punishing the right ones and eliminating price from the discussion, SPEC SFS has actually led to lower performing, more expensive systems! And amazingly, in the update to SPEC SFS -- SPEC SFS 2008 -- the benchmark's flaws have not only gone unaddressed, they have metastasized. The result is such a deformed monstrosity that -- like the index case of some horrific new pathogen -- its only remaining utility lies on the autopsy table: by dissecting SPEC SFS and understanding how it has failed, we can seek to understand deeper truths about benchmarks and their failure modes. Before taking the scalpel to SPEC SFS, it is worth considering system benchmarks in the abstract. The simplest system benchmarks are microbenchmarks that measure a small, well-defined operation in the system. Their simplicity is their great strength: because they boil the system down to its most fundamental primitives, the results can serve as a truth that transcends the benchmark. That is, if a microbenchmark measures a NAS box to provide 1.04 GB/sec read bandwidth from disk, then that number can be considered and understood outside of the benchmark itself. The simplicity of microbenchmarks conveys other advantages as well: microbenchmarks are often highly portable, easily reproducible, straightforward to execute, etc. Unfortunately, systems themselves are rarely as simple as their atoms, and microbenchmarks are unable to capture the complex interactions of a deployed system. More subtly, microbenchmarks can also lead to the wrong conclusions (or, worse, the wrong engineering decisions) by giving excessive weight to infrequent operations. In his excellent article on performance anti-patterns, my colleague Bart Smaalders discussed this problem with respect to the getpid system call. Because measuring getpid has been the canonical way to measure system call performance, some operating systems have "improved" system call performance by turning getpid into a library call. This effort is misguided, as are any decisions based on the results of measuring it: as Bart pointed out, no real application calls getpid frequently enough for it to matter in terms of delivered performance. Making benchmarks representative of actual loads is a more complicated undertaking, with any approach stricken by potentially serious failings. The most straightforward approach is taken by application benchmarks, which run an actual (if simplified) application on the system, and measure its performance. This approach has the obvious advantage of measuring actual, useful work -- or at least one definition of it. This means, too, that system effects are being taken into consideration, and that one can have confidence that more than a mere back eddy of the system is being measured. But an equally obvious drawback to this approach is that it is only measuring one application -- an application which may not be at all representative of a deployed system. Moreover, because the application itself is often simplified, application benchmarks can still exhibit the microbenchmark's failings of oversimplification. From the perspective of storage systems, application benchmarks have a more serious problem: because application benchmarks require a complete, functional system, they make it difficult to understand and quantify merely the storage component. From the application's perspective, the system is opaque; who is to know if, say, an impressive TPC result is due to the storage system rather than more mundane factors like, say, database tuning? Synthetic benchmarks address this failing by taking the hybrid approach of deconstructing application-level behavior into microbenchmark-level operations that they then run in mix that matches the actual use. Ideally, synthetic benchmarks combine the best of both variants: they offer the simplicity and reproducibility of the microbenchmarks, but the real-world applicability of the application-level benchmarks. But beneath this promise of synthetic benchmarks lurks an opposite peril: if not executed properly, synthetic benchmarks can embody the worst properties of both benchmark variants. That is, if a synthetic benchmark combines microbenchmark-level operations in a way that does not in fact correspond to higher level behavior, it has all of the complexity, specificity and opacity of the worst application-level benchmarks -- with the utter inapplicability to actual systems exhibited by the worst microbenchmarks. As one might perhaps imagine from the foreshadowing, SPEC SFS is a synthetic benchmark: it combines NFS operations in an operation mix designed to embody "typical" NFS load. SPEC SFS has evolved over more than a decade, having started life as NFSSTONE and then morphing into NHFSSTONE (ca. 1992) and then LADDIS (a consortium of Legato, Auspex, DEC, Data General, Interphase and Sun) before become a part of SPEC. (As an aside, "LADDIS" is clearly BUNCH-like in being a portent of a slow and miserable death -- may Sun break the curse!) Here is the NFS operation mix for SPEC SFS over its lifetime: NFS operation SFS 1.1 (LADDIS) SFS 2.0/3.0 (NFSv2) SFS 2.0/2.3 (NFSv3) SFS 2008 LOOKUP 34% 36% 27% 24% READ 22% 14% 18% 18% WRITE 15% 7% 9% 10% GETATTR 13% 26% 11% 26% READLINK 8% 7% 7% 1% READDIR 3% 6% 2% 1% CREATE 2% 1% 1% 1% REMOVE 1% 1% 1% 1% FSSTAT 1% 1% 1% 1% SETATTR - - 1% 4% READDIRPLUS - - 9% 2% ACCESS - - 7% 11% COMMIT - - 5% N/A The first thing to note is that the workload hasn't changed very much over the years: it started off being 58% metadata read operations (LOOKUP, GETATTR, READLINK, READDIR, READDIRPLUS, ACCESS), 22% read operations and 15% write operations, and it's now 65% metadata read operations, 18% read operations and 10% write operations. So where did that original workload come from? From an unpublished study at Sun conducted in 1986! (I recently interviewed a prospective engineer who was not yet born when this data was gathered -- and I've always thought it wise to be wary of data older than oneself.) The updates to the operation mix are nearly as dubious: according to David Robinson's thorough paper on the motivation for SFS 2.0, the operation mix for SFS 3.0 was updated based on a survey of 750 Auspex servers running NFSv2 -- which even at the time of that paper's publication in 1999 must have elicited some cocked eyebrows about the relevance of workloads on such clunkers. And what of the most recent update? The 2008 reaffirmation of the decades-old workload is, according to SPEC, "based on recent data collected by SFS committee members from thousands of real NFS servers operating at customer sites." SPEC leaves unspoken the uncanny coincidence that the "recent data" pointed to an identical read/write mix as that survey of those now-extinct Auspex dinosaurs a decade ago -- plus ça change, apparently! Okay, so perhaps the operation mix is paleolithic. Does that make it invalid? Not necessarily, but this particular operation mix does appear to be something of a living fossil: it is biased heavily towards reads, with a mere 15% of operations being writes (and a third of these being metadata writes). While I don't doubt that this is an accurate snapshot of NAS during the Reagan Administration, the world has changed quite a bit since then. Namely, DRAM sizes have grown by nearly five orders of magnitude (!), and client caching has grown along with it -- both in the form of traditional NFS client caching, and in higher-level caching technologies like memcached or (at a larger scale) content distribution networks. This caching serves to satisfy reads before they ever make it to the NAS head, which can leave the NAS head with those operations that cannot be cached, worked around or generally ameliorated -- which is to say, writes. If the workload mix is dated because it does not express the rise of DRAM as cache, one might think that this would also shine through in the results, with systems increasingly using DRAM cache to achieve a high SPEC SFS result. But this has not in fact transpired, and the reason it hasn't brings us to the first fatal flaw of SPEC SFS: instead of making the working set a parameter of the benchmark -- and having a result be not a single number but rather a graph of results given different working set sizes -- the working set size is dictated by the desired number of operations per second. In particular, in running SPEC SFS 3.0, one is required to have ten megabytes of underlying filesystem for every operation per second. (Of this, 10% is utilized as the working set of the benchmark.) This "scaling rule" is a grievous error, for it diminishes the importance of cache as load climbs: in order to achieve higher operations per second, one must have larger working sets -- even though there is absolutely no reason to believe that such a simple, linear relationship exists in actual workloads. (Indeed, in my experience, if anything the opposite may be true: those who are operation intensive seem to have smaller working sets, not larger ones -- and those with larger amounts of data tend to focus on bandwidth more than operations per second.) Interestingly, when this scaling rule was established, it was done so with some misgivings. According to David Robinson's paper (emphasis added): From the large file set created, a smaller working set is chosen for actual operations. In SFS 1.0 the working set size was 20% of file set size or 1 MB per op/sec. With the doubling of the file set size in SFS 2.0, the working set was cut in half to 10% to maintain the same working set size. Although the amount of disk storage grows at a rapid rate, the amount of that storage actually being accessed grows at a much slower rate. [ ... ] A 10% working set size may still be too large. Further research in this area is needed. David, at least, seems to have been aware that this scaling rule was specious even a decade ago. But if the scaling rule was suspect in the mid-1990s, it has became absurd since. To see why, take, for example, NetApp's reasonably recent result of 137,306 operations per second. Getting to this number requires 10 MB per op/sec, or about 1.3 TB. Now, 10% of this -- or about 130GB -- will be accessed over the course of the benchmark. The problem is that from the perspective of caching, the only hope here is to cache metadata, as the data itself exceeds the size of cache and the data access pattern is essentially random. With the cache effectively useless, the engineering problem is no longer designing intelligent caching architectures, but rather designing a system that can quickly serve data from disk. Solving the former requires creativity, trade-offs and balance -- but solving the latter just requires brute force: fast drives and more of 'em. And in this NetApp submission, the force is particularly shock-and-awe: not just 15K RPM drives, but a whopping 224 144GB 15K RPM drives -- delivering 32TB of raw capacity for a mere 1.3TB filesystem. Why would anyone overprovision storage by a factor of 20? The answer is that with the filesytem presumably designed to allocate from outer tracks before inner ones, allocating only 5% of available capacity guarantees that all data will live on those fastest, outer tracks. This practice -- so-called short-stroking -- means both faster transfers and minimal no head movement, guaranteeing that any I/O operation can be satisfied in just the rotational latency of a 15K RPM drive. Short-stroking 224 15K RPM drives is the equivalent of fueling a dragster with nitromethane -- it is top performance at a price so high as to be useless off the dragstrip. It's a safe bet that if one actually had this problem -- if one wished to build a system to optimize for random reads within a 130GB working set over a total data set of 1.3TB -- one would never consider such a costly solution. How, then, would one solve this particular problem? Putting the entire data set on flash would certainly become tempting: an all flash-based solution is both faster and cheaper than the fleet of nitro-belching 15K RPM drives. But if this is so, does it mean that the future of SFS is to be flash-based configurations vying for king of an increasingly insignificant hill? It might have been so were in not for the revisions in SPEC SFS 2008: the scaling rule has gone from absurd to laughably ludicrous, as what used to be 10MB per op/sec, is now 120MB per op/sec. And as if this recklessness were not enough, the working set ratio has additionally been increased from 10% to 30% of total storage. One can only guess what inspired this descent into madness, but the result is certainly insane: to achieve this same 137,306 ops will require a 17TB filesystem -- of which an eye-watering 5TB will be hot! This is nearly a 40X increase in working set size, without (as far as I can tell) any supporting data. At best, David's warning that the scaling rule may have been excessive has been roundly ignored; at worst, the vendors have deliberately calculated how to adjust the problem posed by the benchmark such that thousands of 15K RPM drives remain the only possible solution, even in light of new technologies like flash. But it's hard to know for sure which case SPEC has fallen into: the decision to both increase the scaling rule and increase the working set ratio is so terrible that incompetence becomes indistinguishable from malice. Be it due to incompetence or malice, SPEC's descent into a disk benchmark while masquerading as a system benchmark does worse than simply mislead the customer, it actively encourages the wrong engineering decisions. In particular, as long as SPEC SFS is thought to be the canonical metric of NFS performance, there is little incentive to add cache to NAS heads. (If SPEC SFS isn't going to use it, why bother?) The engineering decisions made by the NAS market leaders reflect this thinking, as they continue to peddle grossly undersized DRAM configurations -- like NetApp's top-of-the-line FAS6080 and its meager maximum of 32GB of DRAM per head! (By contrast, our Sun Storage 7410 has up to 128GB of DRAM -- and for a fraction of the price, I hasten to add.) And it is of no surprise that none of the entrenched players conceived of the hybrid storage pool; SPEC SFS does little to reward cache, so why focus on it? (Aside from the fact that it delivers much faster systems, of course!) While SPEC SFS is hampered by its ancient workload and made ridiculous by its scaling rule, there is a deeper and more pernicious flaw in SPEC SFS: there is no pricing disclosure. This flaw is egregious, unconscionable and inexcusable: as the late, great Jim Gray made clear in his classic 1985 Datamation paper, one cannot consider performance in a vacuum -- when purchasing a system, performance must be considered relative to price. Gray tells us how the database community came to understand this: in 1973, a bank received two bids for a new transaction system. One was for $5M from a mini-computer vendor (e.g. DEC with its PDP-11), the other for $25M from a traditional mainframe vendor (presumably IBM). The solutions offered identical performance; the fact that there was a 5X difference in price (and therefore price/performance), "crystallized" (in Gray's words) the importance of price in benchmarking -- and Gray's paper in turn enshrined price as an essential metric of a database system. (Those interested in the details of the origins Gray's iconoclastic Datamation paper and the long shadow that it has cast are encouraged to read David DeWitt and Charles Levine's excellent retrospective on Gray's work in database performance.) Today, the TPC benchmarks that Gray inspired have pricing at their heart: each submission is required to have a full disclosure report (FDR) that must include the price of the system and everything that that price includes, including part numbers and per-part pricing. Moreover, the system must be orderable: customers must be able to call up the vendor and demand the specified config at the specified price. This is a beautiful thing, because TPC allows for competition not just on performance ("TpmC" in TPC parlance) but also price/performance ($/TpmC). And indeed, in the 1990s, this is exactly what happened as low $/TpmC submissions from the likes of SQLServer running on Dell put competitive pressure on vendors like Sun to focus on price/performance -- with customers being the clear winners in the contest. By contrast, SPEC SFS's absence of a pricing disclosure forbids competitors from competing on price/performance, instead encouraging absolute performance at any cost. This was taken to the logical extreme with NetApp and their preposterous 1,032,461 result -- which took but 2,016 short-stroked 15K RPM drives! Steven Schwartz took NetApp to task for the exorbitance of this configuration, pointing out that NetApp's configuration was a factor two to four times more expensive on a per-op basis than competitive results in his blog entry aptly titled "Benchmarks - Lies and the Lying Liars Who Tell Them." But are lower results any less outrageous? Take again that NetApp config. We don't know how much that 3170 and its 224 15K RPM drives will cost you because NetApp isn't forced to disclose it, but suffice it to say that it's quite a bit -- almost certainly seven figures undiscounted. But for the sake of argument, let's assume that you get a steep discount and you somehow get this clustered, racked-out config to price out at $500K. Even then, given the meager 1.3TB delivered for purposes of the benchmark, this system costs an eye-watering $384/GB -- which is about 8X more expensive than DRAM! So even in the unlikely event that your workload and working set match SPEC SFS, you would still be better off blowing your wallet on a big honkin' RAM disk than buying the benchmarked configuration. And this embodies the essence of the failings of SPEC SFS: the (mis)design of the benchmark demands economic insanity -- but the lack of pricing disclosure conceals that insanity from the casual observer. The lesson of SPEC SFS is therefore manifold: be skeptical of a system benchmark that is synthetic, be suspicious of a system benchmark that lacks a price disclosure -- and be damning when they are one and the same. With the SPEC SFS carcass dismembered and dispensed with, where does this leave Fishworks and our promise to deliver revolutionary price/performance? After considering SPEC SFS (and rejecting it, obviously), we came to believe that the storage benchmark well was so poisioned that the best way to demonstrate the performance of the product would be simple microbenchmarks that customers could run themselves -- which had the added advantage of being closer to the raw capabilities that customers wanted to talk about anyway. In this spirit, see, for example, Brendan's blog entry on the 7410's performance limits. Or, if you're more interested in latency than bandwidth, check out his screenshots of the L2ARC in action. Most importantly: don't take our word for it -- get one yourself, run it with your workload, and then use our built-in analytics to understand not just how the system runs, but also why. We have, after all, designed our systems to be run, not just to be sold...

I come to bury SPEC SFS, not to praise it. When we at Fishworks set out, our goal was to build a product that would disrupt the enterprise NAS market with revolutionary price/performance. Based on the...

Fishworks

The Hunter becomes the Hunted

I recently came into a copy of Dave Hitz's new book How to Castrate a Bull. A full review is to come, but I couldn't wait to serve up one delicious bit of irony. Among the book's many unintentionally fascinating artifacts is NetApp's original business plan, dated January 16, 1992. In that plan, NetApp's proposed differentiators are high availability, easy administration, high performance and low price -- differentiators that are eerily mirrored by Fishworks' proposed differentiators nearly fourteen years later. But the irony goes non-linear when Hitz discusses the "Competition" in that original business plan:Sun Microsystems is the main supplier of NFS file servers. Sun sells over 2/3 of all NFS file servers. Our initial product will be positioned to cost significantly less than Sun's lower-end server, with performance comparable to their high-end servers.It is unlikely that Sun will be able to produce a server that performs as well, or costs as little, for several reasons:Sun's server hardware is inherently more expensive because it has lower production volumes than our components [...]The culture among software engineers at Sun places little value on performance.The structure of Sun -- with SunSoft doing NFS and UNIX, and SMCC [Sun Microsystems Computer Corporation] doing hardware -- makes it difficult for Sun to produce products that provide creative software-hardware system solutions.Sun's distribution costs will likely remain high due to the level of technical support required to install and manage a Sun server.While I had known that NetApp targeted Sun in its early days, I had no idea how explicit that attack had been. Now, it must be said that Hitz was right about Sun on all counts -- and that NetApp thoroughly disrupted Sun with its products, ultimately coming to dominate the NAS market itself. But it is stunning the degree to which NetApp's own business plan -- nearly verbatim -- is being used against it, not least by the very company that NetApp originally disrupted. (See Slide 5 of the Fishworks elevator pitch -- and use your imagination.) Indeed, like NetApp in the 1990s, the Sun Storage 7000 Series is not disruptive by accident, and as I elaborate on in this presentation, we are very deliberately positioning the product to best harness the economic winds blowing so strongly in its favor.NetApp's success with their original business plan and our nascent success with Fishworks point to the most important lesson that the history of technology has to teach: economics always wins -- a product or a technology or a company ultimately cannot prop up unsustainable economics. Perhaps unlike Hitz, however, I had to learn that lesson the hard way: in the post-bubble meltdown that brought Sun within an inch of its life. But then again, perhaps Hitz has yet to have his final lesson on the subject...

I recently came into a copy of Dave Hitz's new book How to Castrate a Bull. A full review is to come, but I couldn't wait to serve up one delicious bit of irony. Among the book's many unintentionally...

Fishworks

On Modalities and Misadventures

Part of the design center of Fishworks is to develop powerful infrastructure that is also easy to use, with the browser as the vector for that usability. So it's been incredibly vindicating to hear some of the initial reaction to our first product, which uses words like "truly breathtaking" to describe the interface. But as much as we believe in the browser as system interface, we also recognize that it cannot be the only modality for interacting with the system: there remain too many occasions when one needs the lightweight precision of a command line interface. These occasions may be due to usability concerns (starting a browser can be an unnecessary and unwelcome burden for a simple configuration change), but they may also arise because there is no alternative: one cannot use a browser to troubleshoot a machine that cannot communicate over the network or to automate interaction with the system. For various reasons, I came to be the one working on this problem at Fishworks -- and my experiences (and misadventures) in solving it present several interesting object lessons in software engineering.Before I get to those, a brief aside about our architecture: as will come to no surprise to anyone who has used our applaince and is familiar with browser-based technologies, our interface is AJAX-based. In developing an AJAX-based application, one needs to select a protocol for client-server communication, and for a variety of reasons, we selected XML-RPC -- a simple XML-based remote procedure call protocol. XML-RPC has ample client support (client libraries are readily available for JavaScript, Perl, Python, etc.), and it was a snap to write a C-based XML-RPC server. This allowed us to cleanly separate our server logic (the "controller" in MVC parlance) from our client (the "view"), and (more importantly) it allowed us to easily develop an automated test suite to test server-side logic. Now, (very) early in our development I had written a simple Perl script to allow for us to manually test the server-side that I called "aksh" -- the appliance kit shell. It provided a simple, captive, command-line interface with features like command history, and it gave developers (that is to say, us) the ability to manually tickle their server-side logic without having to write client-side code.In part because I had written this primordial shell, the task of writing the command line interface fell to me. And when I first approached the problem it seemed natural to simply extend that little Perl script into something more elaborate. That is, I didn't stop to think if Perl was still the right tool for what was a much larger job. In not stopping to reconsider, I was committed a classic software engineering blunder: the problem had changed (and in particular, it had grown quite a bit larger than I understood it to be), but I was still thinking in terms of my existing (outmoded) solution. As I progressed through the project -- as my shell surpassed 1,000 lines and made its way towards 10,000 -- I was learning of a painful truth about Perl that many others had discovered before me: that it is an undesigned dung heap of a language entirely inappropriate for software in-the-large. As a coping mechanism, I began to vent my frustrations at the language with comments in the source, like this vitriolic gem around exceptions: eval {# # We need to install our own private __WARN__ handler that# calls die() to be able to catch any non-numeric exception# from the below coercion without inducing a tedious error# message as a side-effect. And has it been said recently that# Perl is a trash heap of a language? Indeed, it reminds one# of a reeking metropolis like Lagos or Nairobi: having long# since outgrown its original design (such as there was ever# any design to begin with), it is now teeming but crippled --# sewage and vigilantes run amok. And yet, still the masses# come. But not because it is Utopia.No, they come only# because this dystopia is marginally better than the only# alternatives that they know...# local $SIG{'__WARN__'} = sub { die(); };$val = substr($value, 0, length($value) - 1) + 0.0;};(In an attempt to prevent roaming gangs of glue-huffing Perl-coding teenagers from staging raids on my comments section: I don't doubt that there's a better way to do what I was trying to achieve above. But I would counter that there's also a way to live like a king in Lagos or Nairobi -- that doesn't make it them tourist destinations.)Most disconcertingly, the further I got into the project, the more the language became an impediment -- exactly the wrong trajectory for an evolving software system. And so as I wrote more and more code -- and wrestled more and more with the ill-suited environment -- the feeling haunting me became unmistakable: this is the wrong path. There's no worse feeling for a software engineer: knowing that you have made what is likely the wrong decision, but feeling that you can't revisit that decision because of the time already spent down the wrong path. And so, further down the wrong path you go...Meanwhile, as I was paying the price of my hasty decison, Eric -- always looking for a way to better test our code -- was experimenting writing a test harness in which he embedded SpiderMonkey and emulated a DOM layer. These experiments were a complete success: Eric found that embedding SpiderMonkey into a C program was a snap, and the end result allowed us to get automated test coverage over client JavaScript code that previously had to be tested by hand.Given both Eric's results and my increasing frustrations with Perl, an answer was becoming clear: I needed to rewrite the appliance shell as a JavaScript/C hybrid, with the bulk of the logic living in JavaScript and system interaction bits living in C. This would allow our two interface modalities (the shell and the web) to commonalize logic, and it would eradicate a complicated and burdensome language from our product. While this seemed like the right direction, I was wary of making another hasty decision. So I started down the new path by writing a library in C that could translate JavaScript objects into an XML-RPC request (and the response back into JavaScript objects). My thinking here was that if the JavaScript approach turned out to be the wrong approach for the shell, we could still use the library in Eric's new testing harness to allow a wider range of testing. As an aside, this is a software engineering technique that I have learned over the years: when faced with a decision, determine if there are elements that are common to both paths, and implement them first, thereby deferring the decision. In my experience, making the decision after having tackled some of its elements greatly informs the decision -- and because the work done was common, no time (or less time) was lost. In this case, I had the XML-RPC client library rock-solid after about a week of work. The decision could be deferred no longer: it was time to rewrite Perl functionality in JavaScript -- time that would indeed be wasted if the JavaScript path was a dead-end. So I decided that I would give myself a week. If, after that week, it wasn't working out, at least I would know why, and I would be able to return to the squalor of Perl with fewer doubts. As it turns out, after that week, it was clear that the JavaScript/C hybrid approach was the much better approach -- Perl's death warrant had been signed. And here we come to another lesson that I learned that merits an aside: in the development of DTrace, one regret was we did not start the development of our test suite earlier. We didn't make that mistake with Fishworks: the first test was created just minutes after the first line of code. With my need to now rewrite the shell, this approach paid an unexpected dividend: because I had written many tests for the old, Perl-based shell, I had a ready-made test suite for my new work. Therefore, contrary to the impressions of some about test-driven development, the presence of tests actually accelerated the development of the new shell tremendously. And of course, once I integrated the new shell, I could say with confidence that it did not contain regressions over the old shell. (Indeed, the only user visible change was that it was faster. Much, much, much faster.)While it was frustrating to think of the lost time (it ultimately took me six weeks to get the new JavaScript-based shell back to where the old Perl-based shell had been), it was a great relief to know that we had put the right architecture in place. And as often happens when the right software architecture is in place, the further I went down the path of the JavaScript/C hybrid, the more often I had the experience of new, interesting functionality simply falling out. In particular, it became clear that I could easily add a second JavaScript instance to the shell to allow for a scripting environment. This allows users to build full, programmatic flow control into their automation infrastructure without ever having to "screen scrape" output. For example, here's a script to display the used and available space in each share on the appliance:script run('shares'); projects = list(); printf('%-40s %-10s %-10s\\n', 'SHARE', 'USED', 'AVAILABLE'); for (i = 0; i < projects.length; i++) { run('select ' + projects[i]); shares = list(); for (j = 0; j < shares.length; j++) { run('select ' + shares[j]); share = projects[i] + '/' + shares[j]; used = run('get space_data').split(/\\s+/)[3]; avail = run('get space_available').split(/\\s+/)[3]; printf('%-40s %-10s %-10s\\n', share, used, avail); run('cd ..'); } run('cd ..'); }If you saved the above to file named "space.aksh", you could run it this way: % ssh root@myappliance < space.akshPassword:SHARE USED AVAILABLE admin/accounts 18K 248G admin/exports 18K 248G admin/primary 18K 248G admin/traffic 18K 248G admin/workflow 18K 248G aleventhal/hw_eng 18K 248G bcantrill/analytx 1.00G 248G bgregg/dashbd 18K 248G bgregg/filesys01 25.5K 100G bpijewski/access_ctrl 18K 248G ...(You can also upload SSH keys to the appliance if you do not wish to be prompted for the password.) As always, don't take our word for it -- download the appliance and check it out yourself! And if you have the appliance (virtual or otherwise), click on "HELP" and then type "scripting" into the search box to get full documentation on the appliance scripting environment!

Part of the design center of Fishworks is to develop powerful infrastructure that is also easy to use, with the browser as the vector for that usability. So it's been incredibly vindicating to hear som...

Fishworks

Fishworks: Now it can be told

In October 2005, longtime partner-in-crime Mike Shapiro and I were taking stock. Along with Adam Leventhal, we had just finished DTrace -- and Mike had finished up another substantial body of work in FMA -- and we were beginning to wonder about what was next. As we looked at Solaris 10, we saw an incredible building block -- the best, we felt, ever made, with revolutionary technologies like ZFS, DTrace, FMA, SMF and so on. But we also saw something lacking: despite being a great foundation, the truth was that the technology wasn't being used in many essential tasks in information infrastructure, from routing packets to storing blocks to making files available over the network. This last one especially grated: despite having invented network attached storage with NFS in 1983, and despite having the necessary components to efficiently serve files built into the system, and despite having exciting hardware like Thumper and despite having absolutely killer technologies like ZFS and DTrace, Sun had no share -- none -- of the NAS market. As we reflected on why this was so -- why, despite having so many of the necessary parts Sun had not been able to put together a compelling integrated product -- we realized that part of the problem was organizational: if we wanted to go solve this problem, it was clear that we could not do it from the confines of a software organization. With this in mind, we requested a meeting with Greg Papadopoulos, Sun's CTO, to brainstorm. Greg quickly agreed to a meeting, and Mike and I went to his office to chat. We described the problem that we wanted to solve: integrate Sun's industry-leading components together and build on them to develop a killer NAS box -- one with differentiators only made possible by our technology. Greg listened intently as we made our pitch, and then something unexpected happened -- something that tells you a lot about Sun: Greg rose from his chair and exclaimed, "let's do it!" Mike and I were caught a bit flat-footed; we had expected a much safer, more traditional answer -- like "let's commission a task force!" or something -- and instead here was Greg jumping out in front: "Get me a presentation that gives some of the detail of what you want to do, and I'll talk to Jonathan and Scott about it!"Back in the hallway, Mike and I looked at each other, still somewhat in disbelief that Greg had been not just receptive, but so explicitly encouraging. Mike said to me exactly what I was thinking: "Well, I guess we're doing this!"With that, Mike and I pulled into a nearby conference room, and we sat down with a new focus. This was neither academic exercise nor idle chatter over drinks -- we now needed to think about what specifically separated our building blocks from a NAS appliance. With that, we started writing missing technologies on the whiteboard, which soon became crowded with things like browser-based management, clustering, e-mail alerts, reports, integrated fault management, seamless upgrades and rollbacks, and so on. When the whiteboard was full and we took a look at all of it, the light went on: virtually none of this stuff was specific to NAS. At that instant, we realized that the NAS problem was but one example of a larger problem, and that the infrastructure to build fully-integrated, special-purpose systems was itself general-purpose across those special purposes!We had a clear picture of what we wanted to go do. We put our thoughts into a presentation that we entitled "A Problem, An Opportunity & An Idea" (of which I have made available a redacted version) and sent that to Greg. A week or so later, we had a con-call with Greg, in which he gave us the news from Scott and Jonathan: they bought it. It was time to put together a business plan, build our team and get going.Now Mike and I went into overdrive. First, we needed a name. I don't know how long he had been thinking about it, or how it struck him, but Mike said that he was thinking of the name "Fishworks", it not only being a distinct name that paid homage to a storied engineering tradition (and with an oblique Simpsons reference to boot), but one that also embedded an apt acronym: "FISH", Mike explained, stood for "fully-integrated software and hardware" -- which is exactly what we wanted to go build. I agreed that it captured us perfectly -- and Fishworks was born.We built our team -- including Adam, Eric and Keith -- and on February 15, 2006, we got to work. Over the next two and a half years, we went through many changes: our team grew to include Brendan, Greg, Cindi, Bill, Dave and Todd; our technological vision expanded as we saw the exciting potential of the flash revolution; and our product scope was broadened through hundreds of conversations with potential customers. But through these changes our fundamental vision remained intact: that we would build a general purpose appliance kit -- and that we would use it to build a world-beating NAS appliance. Today, at long last, the first harvest from this long labor is available: the Sun Storage 7110, Sun Storage 7210 and Sun Storage 7410.It is deeply satisfying to see these products come to market, especially because the differentiators that we so boldly predicted to Sun's executives so long ago have not only come to fruition, they are also delivering on our promise to set the product apart in the marketplace. Of these, I am especially proud of our DTrace-based appliance analytics. With analytics, we sought to harness the great power of DTrace: its ability to answer ad hoc questions that are phrased in terms of the system's abstractions instead of its implementation. We saw an acute need for this in network storage, where even market-leading products cannot answer the most basic of questions: "what am I serving and to whom?" The key, of course, was to capture the strength of DTrace visually -- and the trick was to give up enough of the arbitrary latitude of DTrace to allow for strictly visual interactions without giving up so much as to unnecessarily limit the power of the facility.I believe that the result -- which you can sample in this screenshot -- does more than simply strike the balance: we have come up with ways to visualize and interact with data that actually function as a force multiplier for the underlying instrumentation technology. So not only does analytics bring the power of DTrace to a much broader spectrum of technologists, it also -- thanks to the wonders of the visual cortex -- has much greater utility than just DTrace alone. (Or, as one hardened veteran of command line interfaces put it to me, "this is the one GUI that I actually want to use!")There is much to describe about analytics, and for those interested in a reasonably detailed guided tour of the facility, check out this presentation on analytics that I will be giving later this week at Sun's Customer Engineering Conference in Las Vegas. While the screenshots in that presentation are illustrative, the power of analytics (like DTrace before it) is in actually seeing it for yourself, in real-time. You can get a flavor for that in this video, in which Mike and I demonstrate and discuss analytics. (That video is part of a larger Inside Fishworks series that takes you through many elements of our team and the product.) While the video is great, it still can't compare to seeing analytics in your own environment -- and for that, you should contact your Sun rep or Sun reseller and arrange to test drive an appliance yourself. Or if you're the impatient show-me-now kind, download this VMware image that contains a full, working Sun Storage 7000 appliance, with 16 (virtual) disks. Configure the virtual appliance, add a few shares, access them via CIFS, WebDAV, NFS, whatever, and bust out some analytics!

In October 2005, longtime partner-in-crime Mike Shapiro and I were taking stock. Along with Adam Leventhal, we had just finished DTrace -- and Mike had finished up another substantial body of work in F...

Solaris

Concurrency's Shysters

For as long as I've been in computing, the subject of concurrency has always induced a kind of thinking man's hysteria. When I was coming up, the name of the apocalypse was symmetric multiprocessing -- and its arrival was to be the Day of Reckoning for software. There seemed to be no end of doomsayers, even among those who putatively had the best understanding of concurrency. (Of note was a famous software engineer who -- despite substantial experience in SMP systems at several different computer companies -- confidently asserted to me in 1995 that it was "simply impossible" for an SMP kernel to "ever" scale beyond 8 CPUs. Needless to say, several of his past employers have since proved him wrong...)There also seemed to be no end of concurrency hucksters and shysters, each eager to peddle their own quack cure for the miasma. Of these, the one that stuck in my craw was the two-level scheduling model, whereby many user-level threads are multiplexed on fewer kernel-level (schedulable) entities. (To paraphrase what has been oft said of little-known computer architectures, you haven't heard of it for a reason.) The rationale for the model -- that it allowed for cheaper synchronization and lightweight thread creation -- seemed to me at the time to be long on assertions and short on data. So working with my undergraduate advisor, I developed a project to explore this model both quantitatively and dynamically, work that I undertook in the first half of my senior year. And early on in that work, it became clear that -- in part due to intractable attributes of the model -- the two-level thread scheduling model was delivering deeply suboptimal performance...Several months after starting the investigation, I came to interview for a job at Sun with Jeff, and he (naturally) asked me to describe my undergraduate work. I wanted to be careful here: Sun was the major proponent of the two-level model, and while I felt that I had the hard data to assert that the model was essentially garbage, I also didn't want to make a potential employer unnecessarily upset. So I stepped gingerly: "As you may know," I began, "the two-level threading model is very... intricate." "Intricate?!" Jeff exclaimed, "I'd say it's completely busted!" (That moment may have been the moment that I decided to come work with Jeff and for Sun: the fact that an engineer could speak so honestly spoke volumes for both the engineer and the company. And despite Sun's faults, this engineering integrity remains at Sun's core to this day -- and remains a draw to so many of us who have stayed here through the ups and downs.) With that, the dam had burst: Jeff and I proceeded to gush about how flawed we each thought the model to be -- and how dogmatic its presentation. So paradoxically, I ended up getting a job at Sun in part by telling them that their technology was unsound!Back at school, I completed my thesis. Like much undergraduate work, it's terribly crude in retrospect -- but I stand behind its fundamental conclusion that the unintended consequences of the two-level scheduling model make it essentially impossible to achieve optimal performance. Upon arriving at Sun, I developed an early proof-of-concept of the (much simpler) single-level model. Roger Faulkner did the significant work of productizing this as an alternative threading model in Solaris 8 -- and he eliminated the two-level scheduling model entirely in Solaris 9, thus ending the ill-begotten experiment of the two-level scheduling model somewhere shy of its tenth birthday. (Roger gave me the honor of approving his request to integrate this work, an honor that I accepted with gusto.)So why this meandering walk through a regrettable misadventure in the history of software systems? Because over a decade later, concurrency is still being used to instill panic in the uninformed. This time, it is chip-level multiprocessing (CMP) instead of SMP that promises to be the End of Days -- and the shysters have taken a new guise in the form of transactional memory. The proponents of this new magic tonic are in some ways darker than their forebears: it is no longer enough to warn of Judgement Day -- they must also conjure up notions of Original Sin to motivate their perverted salvation. "The heart of the problem is, perhaps, that no one really knows how to organize and maintain large systems that rely on locking" admonished Nir Shavit recently in CACM. (Which gives rise to the natural follow-up question: is the Solaris kernel not large, does it not rely on locking or do we not know how to organize and maintain it? Or is that we do not exist at all?) Shavit continues: "Locks are not modular and do not compose, and the association between locks and data is established mostly by convention." Again, no data, no qualifiers, no study, no rationale, no evidence of experience trying to develop such systems -- just a naked assertion used as a prop for a complicated and dubious solution. Are there elements of truth in Shavit's claims? Of course: one can write sloppy, lock-based programs that become a galactic, unmaintainable mess. But does it mean that such monstrosities are inevitable? No, of course not.So fine, the problem statement is (deeply) flawed. Does that mean that the solution is invalid? Not necessarily -- but experience has taught me to be wary of crooked problem statements. And in this case (perhaps not surprisingly) I take umbrage with the solution as well. Even if one assumes that writing a transaction is conceptually easier than acquiring a lock, and even if one further assumes that transaction-based pathologies like livelock are easier on the brain than lock-based pathologies like deadlock, there remains a fatal flaw with transactional memory: much system software can never be in a transaction because it does not merely operate on memory. That is, system software frequently takes action outside of its own memory, requesting services from software or hardware operating on a disjoint memory (the operating system kernel, an I/O device, a hypervisor, firmware, another process -- or any of these on a remote machine). In much system software, the in-memory state that corresponds to these services is protected by a lock -- and the manipulation of such state will never be representable in a transaction. So for me at least, transactional memory is an unacceptable solution to a non-problem.As it turns out, I am not alone in my skepticism. When we on the Editorial Advisory Board of ACM Queue sought to put together an issue on concurrency, the consensus was twofold: to find someone who could provide what we felt was much-needed dissent on TM (and in particular on its most egregious outgrowth, software transactional memory), and to have someone speak from experience on the rise of CMP and what it would mean for practitioners.For this first article, we were lucky enough to find Calin Cascaval and colleagues, who ended up writing a must-read article on STM in November's CACM. Their conclusions are unavoidable: STM is a dog. (Or as Cascaval et al. more delicately put it: "Based on our results, we believe that the road for STM is quite challenging.") Their work is quantitative and analytical and (best of all, in my opinion) the authors never lose sight of the problem that transactional memory was meant to solve: to make parallel programming easier. This is important, because while many of the leaks in the TM vessel can ultimately be patched, the patches themselves add layer upon layer of complexity. Cascaval et al. conclude: And because the argument for TM hinges upon its simplicity and productivity benefits, we are deeply skeptical of any proposed solutions to performance problems that require extra work by the programmer. And while their language is tighter (and the subject of their work a weightier and more active research topic), the conclusions of Cascaval et al. are eerily similar to my final verdict on the two-level scheduling model, over a decade ago: The dominating trait of the [two-level] scheduling model is its complexity. Unfortunately, virtually all of its complexity is exported to the programmer. The net result is that programmers must have a complete understanding of the model the inner workings of its implementation in order to be able to successfully tap its strengths and avoid its pitfalls. So TM advocates: if Roger Faulkner knocks on your software's door bearing a scythe, you would be well-advised to not let him in...For the second article, we brainstormed potential authors -- but as we dug up nothing but dry holes, I found myself coming to an unescapable conclusion: Jeff and I should write this, if nothing else as a professional service to prevent the latest concurrency hysteria from reaching epidemic proportions. The resulting article appears in full in the September issue of Queue, and substantially excerpted in the November issue of CACM. Writing the article was a gratifying experience, and gave us the opportunity to write down much of what we learned the hard way in the 1990s. In particular, it was cathartic to explore the history of concurrency. Having been at concurrency's epicenter for nearly a decade, I felt that the rise of CMP has been recently misrepresented as a failure of hardware creativity -- and it was vindicating to read CMP's true origins in the original DEC Piranha paper: that given concurrent databases and operating systems, implementing multiple cores on the die was simply the best way to deliver OLTP performance. That is, it was the success of concurrent software -- and not the failure of imagination on the part of hardware designers -- that gave rise to the early CMP implementations. Hopefully practitioners will enjoy reading the article as much as we enjoyed writing it -- and here's hoping that we live to see a day when concurrency doesn't attract so many schemers and dreamers!

For as long as I've been in computing, the subject of concurrency has always induced a kind of thinking man's hysteria. When I was coming up, the name of the apocalypse was symmetric...

Solaris

Happy 5th Birthday, DTrace!

It's hard to believe, but DTrace is five years old today: it was on September 3, 2003 that DTrace integrated into Solaris. DTrace was a project that extended all three of us to our absolute limit as software engineers -- and the 24 hours before integration was then (and remains now) the most harrowing of my career. As it will hopefully remain my most stressful experience as an engineer, the story of that final day merits a retelling... Our project had been running for nearly two years, but it was not until mid-morning on September 2nd -- the day before we were slated to integrate -- that it was discovered that the DTrace prototype failed to boot on some very old hardware (the UltraSPARC-I, the oldest hardware still supported at that time). Now, "failed to boot" can meet a bunch of different things, but this was about as awful as it gets: a hard hang after the banner message. That is, booting mysteriously stopped making progress soon after control transferred to the kernel -- and one could not break in with the kernel debugger. This is an awful failure mode because with no debugger and no fatal error, one has no place to start other than to start adding print statements -- or start ripping out the code that is the difference between the working system and the busted one. This was a terrifying position to be in less than 24 hours before integration! Strangely, it was only the non-DEBUG variant that failed to boot: the DEBUG version laden with assertions worked fine. Our only lucky break was that we were able to find two machines that exhibited the problem, enabling us to bifurcate our efforts: I starting ripping out DTrace-specific code in one workspace, while Mike started frenetically adding print statements in another... Meanwhile, while we were scrambling to save our project, Eric was having his first day at Sun. My office door was closed, and with our integration pending and me making frequent (and rapid) trips back and forth to the lab, the message to my coworkers was clear: stay the hell back. Eric was blissfully unaware of these implicit signals, however, and he cheerfully poked his head in my office to say hello (Eric had worked the previous summer in our group as an intern). I can't remember exactly what I said to Eric when he opened my office door, but suffice it to say that the implicit signals were replaced with a very explicit one -- and I remain grateful to this day that Eric didn't quit on the spot...Back on our problem, Mike -- through process of elimination -- had made the key breakthrough: it wasn't actually an instruction set architecture (ISA) issue, but rather it seemed to be a host bus adapter (HBA) issue. This was an incredibly important discovery: while we had a bevy of architectural changes that could conceivably be invalid on an ancient CPU, we had no such HBA-specific changes -- this was more likely to be something marring the surface of our work rather than cracking its foundation. Mike further observed that running a DEBUG variant of these ancient HBA drivers (esp and fas) would boot on an otherwise non-DEBUG kernel. At that, I remembered that we actually did have some cosmetic changes to these drivers, and on carefully reviewing the diffs, we found a deadly problem: in folding some old tracing code under a DEBUG-only #define, a critical line (the one that actually initiates the I/O) became compiled in only when DEBUG was defined. We hadn't seen this until now because these drivers were only used on ancient machines -- machines on which we had never tested non-DEBUG. We fixed the problem, and all of our machines booted DEBUG and non-DEBUG -- and we felt like we were breathing again for the first time in the more than six hours that we had been working on the problem. (Here is the mail that I sent out explaining the problem.)To celebrate DTrace's birthday beyond just recounting the terror of its integration, I wanted to make a couple of documents public that we have not previously shared:The primordial presentation on DTrace, developed in late 1999. Some of the core ideas are present here (in particular, production instrumentation and zero disabled probe effect), but we hadn't yet figured out some very basic notions -- like that we needed our own language.Our first real internal presentation on DTrace, presented March 12, 2002 as a Kernel Technical Discussion. Here the thinking is much better fleshed out around kernel-level instrumentation -- and a prototype existed and was demonstrated. But a key direction for the technology -- the ability to instrument user-level generally and in semantically relevant ways in particular -- was still to come when Adam joined the team shortly after this presentation. (A video of this presentation also exists; in the unlikely event that anyone wants to actually relive three hours of largely outmoded thinking, I'll find a way to make it available.)The e-mail we sent out after integration, September 3, 2003 -- five years ago today.We said it then, and it's even truer today: it's been quite a ride. Happy 5th Birthday, DTrace -- and thanks again to everyone in the DTrace community for making it what it has become!

It's hard to believe, but DTrace is five years old today: it was on September 3, 2003 that DTrace integrated into Solaris. DTrace was a project that extended all three of us to our absolute limit as...

Solaris

Revisiting the Intel 432

As I have discussed before, I strongly believe that to understand systems, you must understand their pathologies -- systems are most instructive when they fail. Unfortunately, we in computing systems do not have a strong history of studying pathology: despite the fact that failure in our domain can be every bit as expensive (if not more so) than in traditional engineering domains, our failures do not (usually) involve loss of life or physical property and there is thus little public demand for us to study them -- and a tremendous industrial bias for us to forget them as much and as quickly as possible. The result is that our many failures go largely unstudied -- and the rich veins of wisdom that these failures generate live on only in oral tradition passed down by the perps (occasionally) and the victims (more often).A counterexample to this -- and one of my favorite systems papers of all time -- is Robert Colwell's brilliant Performance Effects of Architectural Complexity in the Intel 432. This paper, which dissects the abysmal performance of Intel's infamous 432, practically drips with wisdom, and is just as relevant today as it was when the paper was originally published nearly twenty years ago. For those who have never heard of the Intel 432, it was a microprocessor conceived of in the mid-1970s to be the dawn of a new era in computing, incorporating many of the latest notions of the day. But despite its lofty ambitions, the 432 was an unmitigated disaster both from an engineering perspective (the performance was absolutely atrocious) and from a commercial perspective (it did not sell -- a fact presumably not unrelated to its terrible performance). To add insult to injury, the 432 became a sort of punching bag for researchers, becoming, as Colwell described, "the favorite target for whatever point a researcher wanted to make."But as Colwell et al. reveal, the truth behind the 432 is a little more complicated than trendy ideas gone awry; the microprocessor suffered from not only untested ideas, but also terrible execution. For example, one of the core ideas of the 432 is that it was a capability-based system, implemented with a rich hardware-based object model. This model had many ramifications for the hardware, but it also introduced a dangerous dependency on software: the hardware was implicitly dependent on system software (namely, the compiler) for efficient management of protected object contexts ("environments" in 432 parlance). As it happened, the needed compiler work was not done, and the Ada compiler as delivered was pessimal: every function was implemented in its own environment, meaning that every function was in its own context, and that every function call was therefore a context switch!. As Colwell explains, this software failing was the greatest single inhibitor to performance, costing some 25-35 percent on the benchmarks that he examined.If the story ended there, the tale of the 432 would be plenty instructive -- but the story takes another series of interesting twists: because the object model consumed a bunch of chip real estate (and presumably a proportional amount of brain power and department budget), other (more traditional) microprocessor features were either pruned or eliminated. The mortally wounded features included a data cache (!), an instruction cache (!!) and registers (!!!). Yes, you read correctly: this machine had no data cache, no instruction cache and no registers -- it was exclusively memory-memory. And if that weren't enough to assure awful performance: despite having 200 instructions (and about a zillion addressing modes), the 432 had no notion of immediate values other than 0 or 1. Stunningly, Intel designers believed that 0 and 1 "would cover nearly all the need for constants", a conclusion that Colwell (generously) describes as "almost certainly in error." The upshot of these decisions is that you have more code (because you have no immediates) accessing more memory (because you have no registers) that is dog-slow (because you have no data cache) that itself is not cached (because you have no instruction cache). Yee haw!Colwell's work builds to crescendo as it methodically takes apart each of these architectural issues -- and then attempts to model what the microprocessor would look like were it properly implemented. The conclusion he comes to is the object model -- long thought to be the 432's singular flaw -- was only one part of a more complicated picture, and that its performance was "dominated, in large part, by artifacts and not by concepts." If there's one imperfection with Colwell's work, it's that he doesn't realize how convincingly he's made the case that these artifacts were induced by a rigid and foolish adherence to the concepts.So what is the relevance of Colwell's paper now, 20 years later? One of the principal problems that Colwell describes is the disconnect between innovation at the hardware and software levels. This disconnect continues to be a theme, and can be seen in current controversies in networking (TOE or no?), in virtualization (just how much microprocessor support do we want/need -- and at what price?), and (most clearly, in my opinion) in hardware transactional memory. Indeed, like an apparition from beyond the grave, the Intel 432 story should serve as a chilling warning to those working on transactional memory today: like the 432 object model, hardware transactional memory requires both novel microprocessor architecture and significant new system software. And like the 432 object model, hardware transactional memory has been touted more for its putative programmer productivity than for its potential performance gains. This is not to say that hardware transactional memory is not an appropriate direction for a microprocessor, just that its advocates should not so stubbornly adhere to their novelty that they lose sight of the larger system. To me, that is the lesson of the Intel 432 -- and thanks to Colwell's work, that lesson is available to all who wish to learn it.

As I have discussed before, I strongly believe that to understand systems, you must understand their pathologies -- systems are most instructive when they fail. Unfortunately, we in computing...

Solaris

DTrace on Linux

The interest in DTrace on Linux is heating up again -- this time in an inferno on the Linux 2008 Kernel Summit discussion list. Under discussion is SystemTap, the Linux-born DTrace-knockoff, with people like Ted Ts'o explaining why they find SystemTap generally unusable ("Do you really expect system administrators to use this tool?") and in stark contrast to DTrace ("it just works").While the comparison is clearly flattering, I find it a bit disappointing that no one in the discussion seems to realize that DTrace "just works" not merely my implementation, but also by design. Over and over again, we made architectural and technical design decisions that would yield an instrumentation framework that would be not just safe, powerful and flexible, but also usable. The subtle bit here is that many of those decisions were not at the surface of the system (where the discussion on the Linux list seems to be currently mired), but in its guts. To phrase it more concretely, innovations like CTF, DOF and provider-specified stability may seem like mind-numbing, arcane implementation detail (and okay, they probably are that too), but they are the foundation upon which the usability of DTrace is built. If you don't solve the problems that they solve, you won't have a system anywhere near as usable as DTrace.So does SystemTap appreciate either the importance of these problems or the scope of their solutions? Almost certainly not -- for if they did, they would come to the same conclusion that technologists at Apple, QNX, and the FreeBSD project have come to: the only way to have a system at parity with DTrace is to port DTrace.Fortunately for Linux users, there are some in the community who have made this realization. In particular, Paul Fox has a nascent port of DTrace to Linux. Paul still has a long way to go (and I'm sure he could use whatever help Linux folks are willing to offer) but it's impossible to believe that Paul isn't on a shorter and more realistic path than SystemTap to achieving safe, powerful, flexible -- and usable! -- dynamic Linux instrumentation. Good luck to you Paul; we continue to be available to help where we can -- and may the Linux community realize the value of your work sooner rather than later!

The interest in DTrace on Linux is heating up again -- this time in an inferno on the Linux 2008 Kernel Summit discussion list. Under discussion is SystemTap, the Linux-born DTrace-knockoff, with...

Solaris

A Tribute to Jim Gray

Like several hundred others, I spent today at Berkeley at a tribute to honor Jim Gray. While many there today had collaborated with Jim in some capacity, my own connection to him is tertiary at best: I currently serve on the Editorial Advisory Board of ACM Queue, a capacity in which Jim also served up until his disappearance. And while I never met Jim, his presence is still very much felt on the Queue Board -- and I went today as much to learn about as to honor the man and computer scientist who still looms so large. I came away inspired not only by what Jim had done and how he had done it, but also by the many in whom Jim had seen and developed a kind of greatness. It's rare for a gifted mind to be emotionally engaged with others (they are often "anti-human" in the words of one presenter), but practically unique for an intellect of Jim's caliber to have his ability for intense, collaborative engagement with so many others. Indeed, I found myself wondering if Jim didn't belong in the pantheon of Erdos and Feynman -- two other larger-than-life figures that shared Jim's zeal for problems and affection for those solving them. On a slightly more concrete level, I was impressed that not only was Jim a big, adventurous thinker, but that he was one who remained moored to reality. This delicate balance pervades his life's work: The System R folks talked about how in a year he not only wrote 9 papers, but cut 10,000 lines of code. The Wisconsin database folks talked about how he who had pioneered so much in transaction processing also pioneered the database benchmarks, developing the precursor for what would become the Transaction Processing Council in a classic Datamation paper. The Tandem folks talked about how he reached beyond research and development to engage with the field and with customers to figure out (and publish!) why systems actually failed -- which was quite a reach for a company that famously dubbed every product "NonStop". And the Microsoft folks talked about his driving vision to put a large database on the web that people would actually use, leading him to develop TerraServer, and to inspire and assist subsequent systems like the Sloan Digital Sky Survey and Microsoft Research's WorldWide Telescope (which gave an incredible demo, by the way) -- systems that solved hard, abstract problems but that also delivered appreciable concrete results.All along, Jim remained committed to big, future-looking ideas, while still insisting on designing and implementing actual systems in the present. We in computer science do not strike that balance often enough -- we are too often either detached from reality, or drowning in it -- so Jim's ability to not only strike that balance but to inspire it in others was a tremendous asset to our discipline. Jim, you are sorely missed -- even (and perhaps especially) by those of us who never met you...

Like several hundred others, I spent today at Berkeley at a tribute to honor Jim Gray. While many there today had collaborated with Jim in some capacity, my own connection to him is tertiary at best:...

Solaris

dtrace.conf(08)

dtrace.conf(08) was this past Friday, and (no surprise, given the attendees), it ended up being an incredible (un)conference. DTrace is able to cut across vertical boundaries in the software stack (taking you from say, PHP through to the kernel), and it is not surprising that this was also reflected in our conference, which ran from the darkness of new depths (namely, Andrew Gardner from the University of Arizona on using DTrace on dynamically reconfigurable FPGAs) through the more familiar deep of virtual machines (VMware, Xen and Zones) and operating system kernels (Solaris, OS X, BSD, QNX), into the rolling hills of databases (PostgreSQL) and craggy peaks of the Java virtual machine, and finally to the hypoxic heights of environments like Erlang and AIR. And as it is the nature of DTrace to reflect not the flowery theory of a system but rather its brutish reality, it was not at all surprising (but refreshing nonetheless) that nearly every presentation had a demo to go with it. Of these, it was particularly interesting that several were on actual production machines -- and that several others were on the presenter's Mac laptop. (I know I have been over the business case of the Leopard port of DTrace before, but these demos served to once again remind of the new vistas that have been opened up by having DTrace on such a popular development platform -- and how the open sourcing of DTrace has indisputably been in Sun's best business interest.)When we opened in the morning, I claimed that the room had as much software talent as has ever been assembled in 70 people; after seeing the day's presentations and the many discussions that they spawned, I would double down on that claim in an instant. And while the raw pulling power of the room was awe-inspiring, the highlight for me was looking around as everyone was eating dinner: despite many companies having sent more than one participant, no dinner conversation seemed to have two people from the same company -- or even from the same layer of the stack. It almost brought a tear to the eye to see such interaction among such superlative software talent from such disjoint domains, and my dinner partner captured the zeitgeist precisely: "it's like a commune in here." Anyway, I had a hell of a time, and judging by their blog entries, Keith, Theo, Jarod and Stephen did too. So thanks everyone for coming, and a huge thank you to our sponsor Forsythe -- and here's looking forward to dtrace.conf(09)!

dtrace.conf(08)was this past Friday, and (no surprise, given the attendees), it ended up being an incredible (un)conference. DTrace is able to cut across vertical boundaries in the software stack...

Solaris

Boom/bust cycles

Growing up, I was sensitive to the boom/bust cycles endemic in particular industries: my grandfather was a petroleum engineer, and he saw the largest project of his career cancelled during the oil bust in the mid-1980s.[1] And growing up in Colorado, I saw not only the destruction wrought by the oil bust (1987 was very bleak in Denver), but also the long boom/bust history in the mining industries -- unavoidable in towns like Leadville. (Indeed, it was only after going to the East Coast for university that I came to appreciate that school children in the rest of the country don't learn about the Silver Panic of 1893 in such graphic detail.)So when, in the early 1990s, I realized that my life's calling was in software, I felt a sense of relief: here was an industry that was at last not shackled to the fickle Earth -- an industry that was surely "bust-proof" at some level. Ha! As I learned (painfully, like most everyone else) in the Dot-Com boom and subsequent bust, booms and busts are -- if anything -- even more endemic in our industry. Companies grow from nothing to spectacular heights in virtually no time at all -- and can crash back into nothingness just as quickly. I bring up all of this, because if you haven't seen it, this video is absolutely brilliant, capturing these endemic cycles perfectly (and hilariously), and imparting a surprising amount of wisdom besides.I'll still take our boom and bust cycles over those in other industries: there are, after all, still not quite yet software "ghost towns" -- however close the old Excite@Home building on the 101 was getting before it was finally leased![1]If anyone is looking for an unspeakably large quantity of technical documentation on the (cancelled) ARAMCO al-Qasim refinery project, I have it waiting for your perusal!

Growing up, I was sensitive to the boom/bust cycles endemic in particular industries: my grandfather was a petroleum engineer, and he saw the largest project of his career cancelled during the...

Solaris

On Dreaming in Code

As I noted previously, I recently gave a Tech Talk at Google on DTrace. When I gave the talk, I was in the middle of reading Scott Rosenberg's Dreaming in Code, and (for whatever poorly thought-out reason) I elected to use my dissatisfaction with the book as an entree into DTrace. I think that my thinking here was to use what I view to be Rosenberg's limited understanding of software as a segue into the more general difficulty of understanding running software -- with that of course being the problem that DTrace was designed to solve.However tortured that path of reasoning may sound now, it was much worse in the actual presentation -- and in terms of Dreaming in Code, my introduction comes across as a kind of gangland slaying of Rosenberg's work. When I saw the video, I just cringed, noted with relief that at least the butchery was finished by five minutes in, and hoped that viewers would remember the meat of the presentation rather than the sloppy hors d'oeuvres.I was stupidly naive, of course -- for as soon as so much as one blogging observer noted the connection between my presentation and the book, Google-alerted egosurfing would no doubt take over and I'd be completely busted. And that is more or less exactly what happened, with Rosenberg himself calling me out on my introduction, complaining (rightly) that I had slagged his book without providing much substance. Rosenberg was particularly perplexed because he felt that he and I are concluding essentially the same thing. But this is not quite the case: the conclusion (if not mantra) of his book is that "software is hard", while the point I was making in the Google talk is not so much that developing software is hard, but rather that software itself is wholly different -- that it is (uniquely) a confluence between information and machine. (And, in particular, that software poses unique observability challenges.)This sounds like a really pedantic difference -- especially given that Rosenberg (in both book and blog entry) does make some attempt to show the uniqueness of software -- but the difference is significant: instead of beginning with software's information/machine duality, and from there exploring the difficulty of developing it, Rosenberg's work has at its core the difficulty itself. That is, Rosenberg uses the difficulty of developing software as the lens to understand all of software. And amazingly enough, if you insist on looking at the world through a bucket of shit, the world starts to look an awful lot like a bucket of shit: by the end of the book, Rosenberg has left the lay reader with the sense that we are careering towards some sort of software heat-death, after which meetings-about-meetings and stickies-on-whiteboards will prevent all future progress.It's not clear if this is the course that Rosenberg intended to set at the outset; in his Author's Note, he gives us his initial bearings:Why is good software so hard to make? Since no one seems to have a definitive answer even now, at the start of the twenty-first century, fifty years deep into the computer era, I offer by way of exploration the tale of the making of one piece of software. The problem is that the "tale" that he tells is that of the OSAF's Chandler, a project plagued by metastasized confusion. In fact, the project is such a wreck that it gives rise to a natural question: did Rosenberg pick a doomed project because he was convinced at the outset that developing software was impossible, and he wanted to be sure to write about a project that wouldn't hang the jury? Or did his views of the impossibility of developing software come about as a result of his being trapped on such a reeking vessel? On the one hand, it seems unimaginable that Rosenberg would deliberately book his passage on the Mobro 4000, but on the other, it's hard to imagine how a careful search would have yielded a better candidate to show just how bad software development can get: PC-era zillionaire with too many ideas funds on-the-beach programmers to change the world with software that has no revenue model. Yikes -- call me when you ship something...By the middle of the book it is clear that the Chandler garbage barge is adrift and listing, and Rosenberg, who badly wants the fate of Chandler to be a natural consequence of software and not merely a tale of bad ideas and poor execution, begins to mark time by trying to find the most general possible reasons for this failure. And it is this quest that led to my claim in the Google video that he was "hoodwinked by every long-known crank" in software, a claim that Rosenberg objects to, noting that in the video I provided just one example (Alan Kay). But this claim I stand by as made -- and I further claim that the most important contribution of Dreaming in Code may be its unparalleled Tour de Crank, including not just the Crank Holy Trinity of Minsky/Kurzweil/Joy, but many of the lesser known crazy relations that we in computer science have carefully kept locked in the cellar. Now, to be fair, Rosenberg often dismisses them (and Kapor put himself squarely in the anti-Kurzweil camp with his 2029 bet), but he dismisses them not nearly often enough or swiftly enough -- and by just presenting them, he grants many of them more than their due in terms of legitimacy.So how would a change in Rosenberg's perspective have yielded a different work? For one, Rosenberg misses what is perhaps the most profound ramification of the information/machine duality: that software -- unlike everything else that we build -- can achieve a timeless and absolute perfection. To be fair, Rosenberg comes within sight of this truth -- but only for a moment: on page 336 he opens a section with "When software is, one way or another, done, it can have astonishing longevity." But just as quickly as hopes are raised, they are dashed to the ground again; he follows up that promising sentence with "Though it may take forever to build a good program, a good program will sometimes last almost as long." Damn -- so close, yet so far away. If "almost as long" were replaced with "in perpetuity", he might have been struck by the larger fallacies in his own reasoning around the intractable fallibility of software.And software's ability to achieve perfection is indeed striking, for it makes software more like math than like traditional engineering domains -- after all, long after the Lighthouse at Alexandria crumbled, Euclid's greatest common divisor algorithm is showing no signs of wearing out. Why is this important? Because once software achieves something approximating perfection (and a surprising amount of it does), it sediments into the information infrastructure: the abstractions defined by the software become the bedrock that future generations may build upon. In this way, each generation of software engineer operates at a higher level of abstraction than the one that came before it, exerting less effort to do more. These sedimenting abstractions also (perhaps paradoxically) allow new dimensions of innovation deeper in the stack: with the abstractions defined and the constraints established, one can innovate underneath them -- provided one can do so in a way that is absolutely reliable (after all, this is to be bedrock), and with a sufficient improvement in terms of economics to merit the risk. Innovating in such a way poses hard problems that demand a much more disciplined and creative engineer than the software sludge that typified the PC revolution, so if Rosenberg had wanted to find software engineers who know how to deliver rock-solid highly-innovative software, he should have gone to those software engineers who provide the software bedrock: the databases, the operating systems, the virtual machines -- and increasingly the software that is made available as a service. There he still would have found problems to be sure (software still is, after all, hard), but he would have come away with a much more nuanced (and more accurate) view of both the state-of-the-art of software development -- and of the future of software itself.

As I noted previously, I recently gave a Tech Talk at Google on DTrace. When I gave the talk, I was in the middle of reading Scott Rosenberg's Dreaming in Code, and (for whatever poorly thought-out...

Solaris

DTrace on QNX!

There are moments in anyone's life that seem mundane at the time, but become pivotal in retrospect: that chance meeting of your spouse, or the job you applied for on a lark, or the inspirational course that you just stumbled upon.For me, one of those moments was in late December, 1993. I was a sophomore in college, waiting in Chicago O'Hare for a connecting flight home to Denver for the winter break, reading an issue of the late Byte magazine dedicated to operating systems. At the time, I had just completed what would prove to be the most intense course of my college career -- Brown's (in)famous CS169 -- and the magazine contained an article on a very interesting microkernel-based operating system called QNX. The article, written by the late, great Dan Hildebrand, was exciting: we had learned about microkernels in CS169 (indeed, we had implemented one) and here was one running in the wild, solving real, hard problems. I was inspired. When I returned to school, I cold e-mailed Dan and asked about the possibilites of summer employment. Much to my surprise, he responded! When I returned to school in January, Dan and QNX co-founder Dan Dodge gave me an intense phone interview -- and then offered me work for the summer. I worked at QNX (based in Kanata, outside of Ottawa) that summer, and then came back the next. While I didn't end up working for QNX after graduation, I have always thought highly of the company, the people -- and especially the technology.So you can imagine that for me, it's a unique pleasure -- and an an odd sort of homecoming -- to announce that DTrace is being ported to QNX. In my DTrace talk at Google I made passing reference (at about 1:11:53) to one "other system" that DTrace was being ported to, but that I was not at liberty to mention which. Some speculated that this "surely" must be Windows -- so hopefully the fact that it was QNX will serve to remind that it's a big heterogeneous world out there. So, to Colin Burgess and Thomas Fletcher at QNX: congratulations on your fine work. And to QNX'ers: welcome to the DTrace community!Now, with the drinks poured and the toasts made, I must confess that this is also a moment colored by some personal sadness, for I would love nothing more right now than to call up danh, chat about the exciting prospects for DTrace on QNX, and reminisce about the many conversations over our shared cube wall so long ago. (And if nothing else, I would kid him about his summer-of-1994 enthusiasm for Merced!) Dan, you are sorely missed...

There are moments in anyone's life that seem mundane at the time, but become pivotal in retrospect: that chance meeting of your spouse, or the job you applied for on a lark, or the...

Solaris

DTrace, Leopard, and the business of open source

If you haven't seen it, DTrace is now shipping in Mac OS X Leopard. This is very exciting for us here at Sun, but one could be forgiven for asking an obvious question: why? How is having our technology in Leopard (which, if Ars Technica is to be believed, is "perhaps the most significant change in the Leopard kernel") helping Sun? More bluntly, haven't we in Sun handed Apple a piece of technology that we could have sold them instead? The answer to these questions -- which are very similar in spirit to the questions that were asked over and over again internally as we prepared to open source Solaris -- is that they are simply the wrong questions.The thrust of the business questions around open source should not be "how does this directly help me?" or "can't I sell them something instead?" but rather "how much does it cost me?" and "does it hurt me?" Why must one shift the bias of the questions? Because open source often helps in much more profound (and unanticipated) ways than just this quarter's numbers; one must look at open source as long term strategy rather than short term tactics. And as for the intense (and natural) desire to sell a technology instead of giving away the source code, one has to understand that the choice is not between "I give a customer my technology" and "I sell a customer my technology", but rather between "a customer that I never would have had uses my technology" and "a customer that I never would have had uses someone else's technology." When one thinks of open source in this way, the business case becomes much clearer -- but this still may be a bit abstract, so let's apply these questions to the specific case of DTrace in Leopard...The first question is "how much did it cost Sun to get DTrace on Leopard?" The answer to this first question is that it cost Sun just about nothing. And not metaphorical nothing -- I'm talking actual, literal nothing: Adam, Mike and I had essentially one meeting with the Apple folks, answering some questions that we would have answered for anyone anyway. But answering questions doesn't ship product; how could the presence of our software in another product cost us nothing? This is possible because of that most beautiful property of software: it has no variable cost; the only meaningful costs associated with software are fixed costs, and those costs were all borne by Apple. Indeed, it has cost Sun more money in terms of my time to blog how this didn't cost anything to Sun than it did in fact cost Sun in the first place...With that question answered, the second question is "does the presence of DTrace on Leopard hurt Sun?" The answer is that it's very hard to come up with a situation whereby this hurts Sun: one would have to invent a fictitious customer who is happily buying Sun servers and support -- but only because they can't get DTrace on their beloved Mac OS X. In fact, this mythical customer apparently hates Sun (but paradoxically loves DTrace?) so much that they're willing to throw out all of their Sun and Solaris investment over a single technology -- and one that is present in both systems no less. Even leaving aside that Solaris and Mac OS X are not direct competitors, this just doesn't add up -- or at least, it adds up to such a strange, irrational customer that you'll find them in the ones and twos, not the thousands or millions.But haven't we lost some competitive advantage to Apple? Doesn't that hurt Sun? The answer, again, is no. If you love DTrace (and again, that must be presupposed in the question -- if DTrace means nothing to you, then its presence in Mac OS X also means nothing to you), then you are that much more likely to look at (and embrace) other perhaps less ballyhooed Solaristechnologies like SMF, FMA, Zones, least-privilege, etc. That is, the kind of technologist who appreciates DTrace is also likely to appreciate the competitive advantages of Solaris that run far, far beyond merely DTrace -- and that appreciation is not likely to be changed by the presence of DTrace in another system.Okay, so this doesn't cost Sun anything, and it doesn't hurt Sun. Once one accepts that, one is open to a much more interesting and meaningful question: namely, does this help Sun? Does it help Sun to have our technology -- especially a revolutionary one -- present in other systems? The answer is "you bet!" There are of course some general, abstract ways that it helps -- it grows our DTrace community, it creates larger markets for our partners and ISVs that wish to offer DTrace-based solutions and services, etc. But there are also more specific, concrete ways: for example, how could it not help Solaris to have Ruby developers (the vast majority of whom develop on Mac OS X) become accustomed to using DTrace to debug their Rails app? Today, Rails apps are generally developed on Mac OS X and deployed on Linux -- but one can make a very, very plausible argument that getting Rails developers hooked on DTrace on the development side could well the change the dynamics on the deployment side. (After all, DTrace + Leopard + Ruby-on-Rails is crazy delicious!) This all serves as an object lesson of how unanticipatable the benefits of open source can be: despite extensive war-gaming, no one at Sun anticipated that open sourcing DTrace would allow it to be used to Sun's advantage on a hot web development platform running on a hip development system, neither of which originated at Sun.And the DTrace/Leopard/Ruby triumvirate points to a more profound change: the presence of DTrace in other systems assures that it transcends a company or its products -- that it moves beyond a mere a feature, and becomes a technological advance. As such, you can be sure that systems that lack DTrace will become increasingly unacceptable over time. DTrace's shift from product to technological advance -- just like the shifts in NFS or Java before it -- is decidedly and indisputably in Sun's interest, and indeed it embodies the value proposition of the open systems vision that started our shop in the first place. So here's to DTrace on Leopard, long may it reign!

If you haven't seen it, DTrace is now shipping in Mac OS X Leopard. This is very exciting for us here at Sun, but one could be forgiven for asking an obvious question: why? How is having our...

Solaris

DTrace on ONTAP?

As presumably many have seen, NetApp is suing Sun over ZFS. I was particularly disturbed to read Dave Hitz's account, whereby we supposedly contacted NetApp 18 months ago requesting that they pay us "lots of money" (his words) for infringing our patents. Now, as a Sun patent holder, reading this was quite a shock: I'm not enamored with the US patent system, but I have been willing to contribute to Sun's patent portfolio because we have always used them defensively. Had we somehow lost our collective way? Now, I should say that there was something that smelled fishy about Dave's account from the start: I have always said that a major advantage of working for or doing business with Sun is that we're too disorganized to be evil. Being disorganized causes lots of problems, but actively doing evil isn't among them; approaching NetApp to extract gobs of money shows, if nothing else, a level of organization that we as a company are rarely able to summon. So if Dave's account were true, we had silently effected two changes at once: we had somehow become organized and evil!Given this, I wasn't too surprised to learn that Dave's account isn't true. As Jonathan explains, NetApp first approached STK through a third party intermediary (boy, are they ever organized!) seeking to buy the STK patents. When we bought STK, we also bought the ongoing negotiations, which obviously, um, broke down.Beyond clarifying how this went down, I think two other facts merit note: one is that NetApp has filed suit in East Texas, the infamous den of patent trolls. If you'll excuse my frankness for a moment, WTF? I mean, I'm not a lawyer, but we're both headquartered in California -- let's duke this out in California like grown-ups.The second fact is that this is not the first time that NetApp has engaged in this kind of behavior: in 2003, shortly after buying the patents from bankrupt Auspex, NetApp went after BlueArc for infringing three of the patents that they had just bought. Importantly, NetApp lost on all counts -- even after an appeal. To me, the misrepresentation of how the suit came to be, coupled with the choice of venue and history show that NetApp -- despite their claptrap to the contrary -- would prefer to meet their competition in a court of law than in the marketplace.In the spirit of offering constructive alternatives, how about porting DTrace to ONTAP? As I offered to IBM, we'll help you do it -- and thanks to the CDDL, you could do it without fear of infringing anyone's patents. After all, everyone deserves DTrace -- even patent trolls. ;)

As presumably many have seen, NetApp is suing Sun over ZFS. I was particularly disturbed to read Dave Hitz's account, whereby we supposedly contacted NetApp 18 months ago requesting that they pay us...

Solaris

On the beauty in Beautiful Code

I finished Beautiful Code this week, and have been reflectingon the book and its development. In particular, I have thought backto some of the authors' discussion,in which some advocated a different title. Many of uswere verystrongly in favor of the working title of "Beautiful Code", and I weighedin with my specific views on the matter: Date: Tue, 19 Sep 2006 16:18:22 -0700 From: Bryan Cantrill To: [Beautiful Code Co-Authors] Subject: Re: [Beautifulcode] reminder: outlines due by end of September Probably pointless to pile on here, but I'm for "Beautiful Code", if only because it appropriately expresses the nature of our craft. We suffer -- tremendously -- from a bias from traditional engineering that writing code is like digging a ditch: that it is a mundane activity best left to day labor -- and certainly beneath the Gentleman Engineer. This belief is profoundly wrong because software is not like a dam or a superhighway or a power plant: in software, the blueprints _are_ the thing; the abstraction _is_ the machine. It is long past time that our discipline stood on its feet, stepped out from the shadow of sooty 19th century disciplines, and embraced our unique confluence between mathematics and engineering. In short, "Beautiful Code" is long, long overdue.Now, I don't disagree with my reasoning from last September (thoughI think that the Norma Rae-esque tone was probably taking it a bit too far),but having now read the book, I stand by the title for a very differentreason: this book is so widely varied -- there are so many divergentideas here -- that only the most subjectivepossible title could encompass them. That is, any term less subjectivethan "beautiful" would be willfully ignorant of the disagreements (ifimplicit) among the authors about what constitutes idealsoftware.To give an idea of what I'm talking about, here is the breakdown oflanguages and their representations in chapters:LanguageChaptersC11Java5Scheme/Lisp3C++2Fortran2Perl2Python2Ruby2C#1JavaScript1Haskell1VisualBASIC1(I'm only counting each chapter once, so for the very few chapters thatincluded two languages, I took whatever appeared more frequently.Also,note that some chapters were about the implementation of one languagefeature in a different language -- so for example, while there are twoadditonalchapters on Python, both pertain more to the C-based implementation ofthose features than to their actual design or use in Python.)Now, one could argue (and it would be an interesting argument) about howmuch choice of language matters in software or, for that matter, inthought. Andindeed, in some (if not many) of these chapters, the language ofimplementation is completely orthogonal to the idea being discussed.But I believe that language does ultimately affect thought,and it's hard to imagine how one could have a sense of beauty thatis so uncritical as to equally accommodate all of these languages.More specifically: read say, R. Kent Dybvig's chapter onthe implementation of syntax-case in Scheme andWilliam Otte and Douglas Schmidt's chapter on implementing adistributed logging service using an object-oriented C++ framework. Itseems unlikely to me that one personwill come away saying that both are beautiful to them. (And I'm not talkingnew-agey "beautiful to someone" kind of beautiful -- I'm talking the"I want to write code like that" kind of beautiful.)This is not meant to be a value judgement oneither of these chapters -- just the observation that their definitionsof beauty are (in my opinion, anyway) so wildly divergent as to benearly mutuallyexclusive. And that's why the title is perfect: both of these chaptersare beautiful to their authors, and we can come away saying"Hey, if it's beautiful to you, then great."So I continue to strongly recommend Beautiful Code, but perhaps notin the way thatO'Reilly might intend: you should read this book not because it'scover-to-cover perfection, but rather to hone yourown sense of beauty. To that end, this is a book best readconcurrently with one's peers: discussing (and arguing about) what isbeautiful, what isn't beautiful, and why will help you discover andrefine your ownvoice in your code. And doing this will enable you to write the mostimportant code of all: code that is, if nothing else, beautiful to you.

I finished Beautiful Code this week, and have been reflecting on the book and its development. In particular, I have thought back to some of the authors' discussion,in which some advocated a different...

Solaris

Beautiful Code

So my copy of Beautiful Code showed up last week.Although I am one of the (many) authors and I have thus hadaccess to the entire book online for some time, I do all of my pleasurereading in venues that need the printed page (e.g.the J Church) andhave therefore waited for the printed copy to start reading.Although I have only read the first twelve chapters or so,it's already clear (and perhaps not at all surprising) that there are starkly different definitions of beauty here: the book's greatest strength -- and, frankly, its greatest weakness -- is that the chaptersare so incredibly varied. For one chapter, beauty is asmall and titilating act of recursion; for the next, it's thata massive and complicated integrated system could be delivered quicklyand cheaply. (I might add that the definition of beauty in my ownchapter draws something from both of these poles: thatin software, the smallest and most devilish details can affectthe system at the largest and most basic levels.)If one can deal with the fact that the chapters arewidely divergent, and that there is not even a token attempt to weavethem together into a larger tapestry, this book (at least so far, anyway) is(if nothing else) exceptionally thought provoking; if Oprah werea code cranking propeller head, this would be the ideal choice for her book club.Now in terms of some of my specific thoughts that have been provoked:as I mentioned, quite a few of my coauthors are enamored with the elegance of recursion.While I confess that I like writing a neatly recursive routine,I also find that I frequently end up having to unroll the recursionwhen I discover that I must deal with data structures that are biggerthan I anticipated -- and that my beautiful code is resulting (orcan result) in a stack overflow.(Indeed, I spent several unpleasant days last weekdoing exactly this when I discovered that pathologically bad inputcould cause blown stacks in some software that I'm working on.)To take a concrete example, Brian Kernighan has a great chapter inBeautiful Code about some tight, crisp codewritten by Rob Pike to perform basic globbing. And the code isindeed beautiful. But it's also (at least in a way) busted: itoverflows the stack on some categories of bad input. Admittedly, oneis talking about very bad input here -- strings that consist of hundreds of thousands of stars in this case -- but this highlights exactlythe problem I have with recursion: it leaves you with edge conditionsthat on the one hand really are edge conditions (deeplypathological input), but with afailure mode (a stack overflow) that's just too nasty to ignore.Now, there are ways to deal with this.If one can stomach it, the simplest way to deal with this is to setup a sigaltstack and thensiglongjmp out of aSIGSEGV/SIGBUS signal handler. You have to be very careful about doingthis: the signal handler should look at the si_addr field in thesiginfo and comparing it to the stack bounds to confirmthat it's a stack overflow, lest it end up siglongjmp'ing outof a non-recursion induced SIGSEGV (which, needless to say, wouldmake a bad problem much worse). While an alternative signal stack solution maysound hideous to some, at least the recursion doesn't have to go under the knifein this approach.If having a SIGSEGV handler to catchthis condition feels uncomfortably brittle (as well it might), or if one's state cannot be neatly unwound after an arbitrary siglongjmp(as well it might not),the code will have to change: either a depth counter will have to bepassed down and failure propagated when depth exceeds a reasonablemaximum, or the recursion will have to be unrolled into iteration.For most aesthetic senses, none of these options is going to make thecode more beautiful -- but they will make it indisputably more correct.I was actually curious about where exactly the Pike/Kernighan code wouldblow up, so I threw together a little program that uses sigaltstack along with sigsetjmp/siglongjmp tobinary search to find the shortest input that induces the failure. My program, which (naturally) includes the Pike/Kernighan code,is here.Here are the results of running my programon a variety of Solaris platforms, with eachnumber denoting the maximum string length that can be processedby the Pike/Kernighan code without the possibility of stack overflow.x86SPARC32-bit64-bit32-bit64-bitSun cc, unoptimized4032651872257764938821gcc, unoptimized3276512184296988340315Sun cc, optimized32765132764517472395303gcc, optimized58248952422714976987367As can be seen, there is a tremendous range here, even across just twodifferent ISAs, two different data models and two different compilers:from 38,821 on 64-bit SPARC using Sun cc without optimizationto 582,489 on 32-bit x86 using gcc with optimization -- an order of magnitude difference.So while recursion is a beautifultechnique, it is one that ends up with the ugliest of implicit dependencies:on the CPU architecture, on the data model and on the compiler.And while recursion is still beautiful to me personally, it will always be a beautythat is more superficial than profound...

So my copy of Beautiful Code showed up last week. Although I am one of the (many) authors and I have thus had access to the entire book online for some time, I do all of my pleasurereading in venues...

Solaris

The inculcation of systems thinking

As is known but perhaps not widely reported,all three of us on Team DTraceare products of Brown University Computer Science.More specifically, we were all students in (and later TAs for) Brown'soperating systems course,CS169.This course has been taught by the same professor,Tom Doeppner,over its thirty year lifetime, and has become something of a legendin Silicon Valley,having produced some of the top engineers at major companies like NetApp, SGI, Adobe, and VMware -- not to mention tons of smaller companies.And at Sun, CS169 has cast a particularly long shadow, with seven CS169 alums(Adam,Dan,Dave,Eric,Matt,Mike and me)having together playedmajor roles in developing many of the revolutionary technologies inSolaris 10 (specifically, DTrace,ZFS,SMF,FMA andZones).I mention the Brown connection because this past Thursday, Brown hosted asymposiumto honor both the DTrace team in particularand the contributions of former CS169 undergraduate TAs more generally. We were each invitedto give a presentation on a topic of our choosing, and seizingthe opportunity for intellectual indulgence,I chose toreflect on a broad topic: the inculcation ofsystemsthinking. My thoughts on this topic deserve their own lengthy blog entry, but thispresentation will have to suffice for now -- albeit stripped of thereferences to theTupolev Tu-144,LBJ, Ray Kurzweil, the 737 rudderreversaland Ruby stack backtraces that peppered (or perhaps polluted?) theactual talk...

As is known but perhaps not widely reported, all three of us on Team DTrace are products ofBrown University Computer Science. More specifically, we were all students in (and later TAs for) Brown'sopera...

Solaris

DTrace on AIX?

So IBM has been on the warpath recently against OpenSolaris,culminating with their accusationyesterday that OpenSolaris is a"facade." This is so obviously untrue that it's not even worth refuting in detail. In fact, being the father of a toddler, I would liken IBM's latest outburst tosomething of a temper tantrum -- and as with a screaming toddler, the best wayto deal with this is to not reward the performance, but rather to offersome constructive alternatives. So, withoutfurther ado, here are my constructive suggestions to IBM:Open source OS/2. Okay, given the tumultuous historyof OS/2,this is almost certainly not possible from a legal perspective -- but itwould be a great open source contribution (some reasonably interestingtechnology went down with that particular ship), and many hobbyists wouldlove you for it. Like I said, it's probably not possible -- but just tothrow it out there.Open source AIX. AIX is one of the true enterprise-class operatingsystems -- one with a long history of running business-critical applications.As such, it would be both a great contribution to open source and ahuge win for AIX customers for AIX to go open source -- if only to beable to look at the source code when chasing a problem that isn't necessarilya bug in the operating system. (And I confess that on a personal level,I'm very curious tobrowse the source code of an operating system that wasportedfrom PL/1.) However, as with OS/2, AIX's history is going to likelymake open sourcing it tough from a legal perspective: its Unix licensedates from the BadOld Days, and it would probably be time consuming (and expensive)to unencumber the system to allow it to be open sourced.Okay, those two are admittedly pretty tough for legal reasons. Here aresome easier ones:Support the portof OpenSolaris to POWER/PowerPC.Sun doesn't sell POWER-based gear, so you would have the comfort ofknowing that your efforts would in no way assist a Sun hardware sale,and your POWER customers would undoubtedly be excited to have anotherchoice for their deployments.Support the nascent effort to port OpenSolaris to the S/390.Hey, if Linux makes sense on an S/390, surely OpenSolaris with all of its goodness makes sense too, right? Again, customers love choice --and even an S/390 customer that has no intention of running OpenSolariswill love having the choice made available to them.Okay, so those two are easier because the obstacles aren't legal obstacles,but there are undoubtedly internal IBM cultural issues that make themeffectively non-starters.So here's my final suggestion, and it's an absolutely serious one.It's also relativelyeasy, it clearly and immediately benefits IBM and IBM's customers -- andit even doesn't involve giving up any IP:Port DTrace to AIX. Your customerswant it. Apple has shownthat itcan be done.We'll help youdo it. And you'll get to participate in theDTrace community(and therefore the OpenSolaris community)in a way that doesn't leave youfeeling like you've been scalped by Scooter. Hell, you can even followApple's lead with Xray and innovate on top of DTrace: from talking to your customers over the years, it's clear that they loveSMIT --integrate a SMIT frontend with a DTrace backend! Your customerswill love you for it, and the DTrace community will be excited to haveyet another system on which that they can use DTrace.Now, IBM may respond to these alternatives just as a toddler sometimesresponds to constructive alternatives ("No! No! NO! Mine! MINE!MIIIIIIIINE!", etc). But if cooler heads prevail at Big Blue, thesesuggestions -- especially the last one -- will be seen as a way toconstructively engage that will have clear benefits for IBM's customers(and therefore for IBM). So to IBM I say what parents have said toscreaming toddlers for time immemorial: we're ready when you are.

So IBM has been on the warpath recently against OpenSolaris, culminating with their accusation yesterday that OpenSolaris is a "facade." This is so obviously untrue that it's not even worthrefuting...

Solaris

DTrace on Mac OS X!

From WWDC here in San Francisco: Apple has just announcedsupport for DTrace in Leopard, the upcoming release of Mac OS X! Often (or even usually?) announcements at conferences are more vapor than ware. In this case, though, Apple is being quite modest: they have done a tremendous amount of engineering work to bring DTrace to their platform (including, it turns out, implementing DTrace's FBT provider for PowerPC!), and they are using DTrace as part of the foundation for their new Xray performance tool. This is very exciting news, as it brings DTrace to a whole slew of new users. (And speaking personally, it will be a relief to finally have DTrace on the only computer under my roof that doesn't run Solaris!) Having laid hands on DTrace on Mac OS X myself just a few hours ago, I can tell you that while it's not yet a complete port, it's certainly enough to be uniquely useful -- and it was quite a thrill to see Objective C frames in a ustack action!So kudos to the Apple engineering team working on the port: Steve Peters, James McIlree, Terry Lambert, Tom Duffy and Sean Callanan. It's been fun for us to work with the Apple team, and gratifying to see their results. And it isn't just rewarding for us; the entire OpenSolaris community should feel proud about this development because it gives the lie to IBM's nauseating assertion that we in the OpenSolaris community aren't in the "spirit" of open source.So to Apple users: welcome to DTrace! (And to DTrace users: welcome to Mac OS X!) Be sure to join us in the DTrace community -- and if you're at the WWDC, we on Team DTrace will be on hand on Friday at the DTrace session, so swing by to see a demo of DTrace running on MacOS and to meet both the team at Apple that worked on the port and we at Sun who developed DTrace. And with a little luck, you might even hear Adam commemorate the occasion by singing his beautiful and enchanting ISA aria... Update: For disbelievers, Adam posted photos -- and Mike went into greater detail on the state of the Leopard DTrace port, and what it might mean for the direction of DTrace.

From WWDC here in San Francisco: Apple has just announcedsupport for DTrace in Leopard, the upcoming release of Mac OS X! Often (or even usually?) announcements at conferences are more vapor...

Solaris

DTrace on FreeBSD, update

A while ago, I blogged about the possiblity of a FreeBSD port of DTrace.For the past few months,John Birrellhas been hard at work on the port, and has announced recentlythat he has much of the core DTrace functionality working.Over here atDTrace Central,we've been excitedly watching John's progress for a little while,providing help and guidance where we can -- albeit not alwayssolicited ;) -- and have been very impressed with how far he's come.And while John has quite a bit further to gobefore one could call it a complete port, what he has now is indisputablyuseful. If you run FreeBSD in production, you're going to want John'sport as it stands today --and if you develop for the FreeBSD kernel (drivers or otherwise), you'regoing to need it. (Once you've done kernel development with DTrace,there's no going back.)So this is obviously a win for FreeBSD users, who can now benefit fromthe leap in software observability that DTrace provides. It's also clearlya win for DTrace users, because now you have another platform onwhich you can observe your production software -- and a larger communitywith whom to share your experiences and thoughts.And finally, it's a huge win for OpenSolaris users: the presence ofa FreeBSD port of DTrace validates that OpenSolaris is an open, innovative platform (despite whatsome buttheads say) -- one that will benefit from andcontribute to the undeniable economics of open source.So congrats to John! And to the FreeBSD folks: welcome to the DTracecommunity!

A while ago, I blogged about the possiblity of a FreeBSD port of DTrace. For the past few months,John Birrell has been hard at work on the port, and has announced recentlythat he has much of the core...

Solaris

DTrace on Rails

First I need to apologize for having been absent for so long -- I amvery much heads-down on a new project. (The details of which will needto wait for another day, I'm afraid -- but suffice it to say that it, likejust about everything else I've done at Sun, leverages much of what I'vedone before it.)That said, I wanted to briefly emerge to discuss some recent work. A while ago,I blogged aboutDTraceand Ruby, usingRich Lowe'sprototype DTrace provider. This provider represents a quantum leap for Rubyobservability, but it suffers from thefact that we must do work (in particular, we must get the class and method)even when disabled. This is undesirable (especially considering thatthe effect can be quite significant -- up to 2X), and it runs against theDTrace ethos of zero disabled probe effect, but there has been no bettersolution. Now, however, thanks to Adam's workonis-enabled probes, we can have a Ruby provider that has zero disabledprobe effect. (Or essentially zero: I actually measured the probe effect at 0.2% --very much in the noise.) Having zero disabled probe effectallows us to deploy DTrace on Ruby inproduction -- which in turn opens up a whole new domain for DTrace:Ruby on Rails. And as I wasreminded byJason Hoffman's recentScale with Rails presentation (in which he outlines why they pickedSolaris generally -- and ZFS in particular), this is a hugely importantgrowth area for Solaris. So without further ado,here is a (reasonably) simple script that relies on somedetails of WEBrick and Rails to yield a system profile for Rails requests:#pragma D option quietself string uri;syscall::read:entry/execname == "ruby" && self->uri == NULL/{ self->fd = arg0; self->buf = arg1; self->size = arg2;}syscall::read:return/self->uri == NULL && self->buf != NULL && strstr(this->str = copyinstr(self->buf, self->size), "GET ") == this->str/{ this->head = strtok(this->str, " "); self->uri = this->head != NULL ? strtok(NULL, " ") : NULL; self->syscalls = 0; self->rbcalls = 0;}syscall::read:return/self->buf != NULL/{ self->buf = NULL;}syscall:::entry/self->uri != NULL/{ @syscalls[probefunc] = count();}ruby$1:::function-entry/self->uri != NULL/{ @rbclasses[this->class = copyinstr(arg0)] = count(); this->sep = strjoin(this->class, "#"); @rbmethods[strjoin(this->sep, copyinstr(arg1))] = count();}pid$1::mysql_send_query:entry/self->uri != NULL/{ @queries[copyinstr(arg1)] = count();}syscall::write:entry/self->uri != NULL && arg0 == self->fd && strstr(this->str = copyinstr(arg1, arg2), "HTTP/1.1") == this->str/{ self->uri = NULL; ncalls++;}END{ normalize(@syscalls, ncalls); trunc(@syscalls, 10); printf("Top ten system calls per URI serviced:\\n"); printf("---------------------------------------"); printf("--------------------------------+------\\n"); printa(" %-68s | %@d\\n", @syscalls); normalize(@rbclasses, ncalls); trunc(@rbclasses, 10); printf("\\nTop ten Ruby classes called per URI serviced:\\n"); printf("---------------------------------------"); printf("--------------------------------+------\\n"); printa(" %-68s | %@d\\n", @rbclasses); normalize(@rbmethods, ncalls); trunc(@rbmethods, 10); printf("\\nTop ten Ruby methods called per URI serviced:\\n"); printf("---------------------------------------"); printf("--------------------------------+------\\n"); printa(" %-68s | %@d\\n", @rbmethods); trunc(@queries, 10); printf("\\nTop ten MySQL queries:\\n"); printf("---------------------------------------"); printf("--------------------------------+------\\n"); printa(" %-68s | %@d\\n", @queries);}Running the above while horsing around with the Depot application fromAgileWeb Development with Rails yields the following:Top ten system calls per URI serviced:-----------------------------------------------------------------------+------ setcontext | 15 fcntl | 16 fstat64 | 16 open64 | 21 close | 25 llseek | 27 lwp_sigmask | 30 read | 62 pollsys | 80 stat64 | 340Top ten Ruby classes called per URI serviced:-----------------------------------------------------------------------+------ ActionController::CodeGeneration::Source | 89 ActionController::CodeGeneration::CodeGenerator | 167 Fixnum | 190 Symbol | 456 Class | 556 Hash | 1000 String | 1322 Array | 1903 Object | 2364 Module | 6525Top ten Ruby methods called per URI serviced:-----------------------------------------------------------------------+------ Object#dup | 235 String#== | 250 Object#is_a? | 288 Object#nil? | 316 Hash#[] | 351 Symbol#to_s | 368 Object#send | 593 Module#included_modules | 1043 Array#include? | 1127 Module#== | 5058Top ten MySQL queries:-----------------------------------------------------------------------+------ SELECT \* FROM products LIMIT 0, 10 | 2 SELECT \* FROM products WHERE (products.id = '7') LIMIT 1 | 2 SELECT count(\*) AS count_all FROM products | 2 SHOW FIELDS FROM products | 5While this gives us lots of questions we might want to answer (e.g., "whythe hell are we doing 340 stats on every 'effing request?!"1), it might bea little easier to lookat a view that lets us see requests and the database queries that theyinduce. Here, for example, is a similar script to do just that:#pragma D option quietself string uri;self string newuri;BEGIN{ start = timestamp;}syscall::read:entry/execname == "ruby" && self->uri == NULL/{ self->fd = arg0; self->buf = arg1; self->size = arg2;}syscall::read:return/self->uri == NULL && self->buf != NULL && (strstr(this->str = copyinstr(self->buf, self->size), "GET ") == this->str || strstr(this->str, "POST ") == this->str)/{ this->head = strtok(this->str, " "); self->newuri = this->head != NULL ? strtok(NULL, " ") : NULL;}syscall::read:return/self->newuri != NULL/{ self->uri = self->newuri; self->newuri = NULL; printf("%3d.%03d => %s\\n", (timestamp - start) / 1000000000, ((timestamp - start) / 1000000) % 1000, self->uri);}pid$1::mysql_send_query:entry/self->uri != NULL/{ printf("%3d.%03d -> \\"%s\\"\\n", (timestamp - start) / 1000000000, ((timestamp - start) / 1000000) % 1000, copyinstr(self->query = arg1));}pid$1::mysql_send_query:return/self->query != NULL/{ printf("%3d.%03d

First I need to apologize for having been absent for so long -- I am very much heads-down on a new project. (The details of which will needto wait for another day, I'm afraid -- but suffice it to say...

Solaris

Welcome to ZFS!

If you haven't already seen it,ZFS is now availablefor download, marking a major milestone in the history of filesystems.Today is a way station down a long road:for aslong as I have knownJeff Bonwick,he has wanted to solve the filesystem problem -- and about five years ago,Jeff set outto do just that, starting (as Jeff is wont to do) from a blank sheet ofpaper. I vividly remember Jeff describing some of his nascent ideas onmy whiteboard; the ideas were radical and revolutionary, theirimplications manifold. I remember thinking "he's either missedsomething basic that somehow invalidates these ideas -- or this is the mostimportant development in storage since RAID."AsI recently recounted,Jeff is the reason that I came to Sun almost a decade ago -- and inparticular, I was drawn by Jeff's contagious belief that nothing isimpossible simplybecause it hasn't been done before. So I knew better than to doubthim at the time -- and I knew that the road ahead promised excitementif nothing else. Years after that moment,there is no other conclusion left to be had:ZFSis the most important revolution in storage software in twodecades -- and may be the most important idea since the filesystem itself.That may seem a heady claim, but keep reading...To get an idea of what ZFS can do,first check outDan Price's awesomeZFS flashdemoThen join me on a tour of today's ZFS blog entries,as ZFS developers and users inside Sun illustrate thepower of ZFS: ease of administration, absolute reliability andrippin' performance.Administration.If you're an administrator, start your ZFS blog tour withMartin Englund'sentry onZFS from a sysadmin's view. Martin walks you through the ease of settingup ZFS; there are no hidden wires -- it really is that easy!And if, as a self-defence mechanism, your brain refuses to let you recallthe slog of traditional volume management, check outTim Foster'sentrycomparingZFS management to Veritas management. (And have we mentionedthe price?)For insight into the design principles that guided the developmentof the administration tools, check outEric Schrock'sentry onthe principles of the ZFS CLI. Eric's entry and his design reflectthe principles that we used inDTrace as well:make it simple to do simple things and make itpossible to do complicated things.As you can imagine, this simplicity of management is winning fans bothinside and outside of Sun. For some testimonials, check outLin Ling'sentry on the lovefor ZFS -- both Lin's and our Beta customers'.As Lin's entry implies,a common theme among ZFS users is "if I only had this when..."; checkoutJames McPhearson'sentrywishing he had ZFS back in the day.And if you think that the management of ZFS couldn't get any easier, checkoutSteve Talley'sentry onmanaging ZFS from your browser.Steve's work highlightsthe proper role for GUI admin tools in a system: they should makesomething that's already simple even simpler. Theyshould not be used to smear lipstick over a hideously over-complicatedsystem -- doing so leads to an unresolvable rift betweenwhat the tool is telling you the system looks like, and what the systemactually looks like.Thanks to the simplicity of ZFS itself, there is no secondguessing about what the GUI is actually doing under the hood -- it'sall just gravy!Speaking of gravy,check out the confluence of ZFS with another revolutionary Solaris technologyinDan Price'sentry onZFS and Zones -- thanks to some great integration work,local zone administrators can have the full power of ZFSwithout compromising the security of the system!For details on particular features of ZFS, check outMark Maybee'sentry onquotas and reservations in ZFS. Unlikesome other systems, quotas and reservations are first-class citizensin ZFS, not bolted-on afterthoughts.Die, /usr/sbin/quota, die!And for details on another feature of ZFS, check outMark Shellenbaum'sentry onaccess controllists in ZFS, andLisa Week'sentry describingwhyZFS adopted the NFSv4 ACL model.Like quotas and reservations, ACLs were a part ofthe design of ZFS -- not something that was carpet-bombed over the sourceafter the fact.Reliability. Unlike virtually every other filesystem thathas come before it, ZFS is designed around unreliable hardware.This design-center means that ZFS can detect -- and correct! -- errorsthat other filesystems just silently propagate to the user. To get avisceral feel for this, readEric Lowe's entryonZFS saving the day. Reading this entry will send a chill up yourspine: Eric had a data-corrupting hardware problem thathe didn't know he haduntil ZFS. How much data is being corrupted out there today becausepre-ZFS filesystems are too trusting of faulty hardware? More to thepoint, how much of your data is being corrupted today?Yeah -- scary, ain't it? And not only can ZFS detect hardware errors,in a mirrored configuration it can correct them.Fortunately, you don't have to have bustedhardware to see this: look atTim Cook's entrydemonstrating ZFS's self-healing by using dd to simulatedate corruption.But if problems like Eric's are all over the place, how is anyone's data evercorrect? The answer is pretty simple, if expensive: you pay forreliability bybuyingover-priced hardware.That is, we've compensated for dumb software by having smart (and expensive)hardware. ZFS flips the economics on its head: smart software allowsfor stupid (and cheap) hardware -- with ultra-high reliability.This is a profound shift; for more details on it check outRichard Elling'sentry onthe reliability of ZFS.ZFS is reliable by its architecture, but what of the implementation?As Bill Moore writes,testing ZFS was every bit as important as writing it.And testing ZFS involved many people, asJim Walkerdescribes in his entry onthe scopeof the ZFS testing effort.Performance. So fine: ZFS is a snap to administer, and it'sultra-reliable -- but at what performance cost? The short answer is:none, really -- and in fact, on many workloads, it rips.How can you have such features and still have great performance?Generally speaking, ZFS is able to delivergreat performance because it has more context, aphenomenon thatBill Sommerfeldnotes is a consequence ofthe end-to-end principle.To see how this unlocks performance, look atBill Moore's entryonI/O scheduling; as Bill describes (and as I can personally attest to)ZFS is much smarter about how it uses I/O devicesthan previous filesystems.For another architectural feature forperformance, look atNeil Perrin's entryon theZFS intent log -- and chase it withNeelakanth Nadgir'sentry taking you throughthe ZILcode.If you're looking for some performance numbers,check outRoch Bourbonnais's entrycomparing the performance of ZFS and UFS.Or letEric Kustarztake you to school, as you gotoFilesystems Performance 101: Disk Bandwidth,Filesystems Performance 102: Filesystem Bandwidth and finallygraduate to Filesystems Performance 201: When ZFS Attacks!So given that ZFS is all that, when can we start forgetting about everyother on-disk filesystem? For that, we'll need to be able to boot offZFS. Bad news: this is hard. Good news:Tabriz Lemanand the rest of the ZFS Boot team are making great progress, asTabriz describes in her entry onbooting ZFS. Once we can boot ZFS -- that is, once we can assume ZFS --all sorts of cool things become possible, asBart Smaaldersbrainstorms in his entry onthe impact of ZFS on Solaris. As Bart says, this is just thebeginning of the ZFS revolution...Finally, this has been a long, hard slog for the ZFS team.Anyone who has worked through "crunch time" on a big project willsee something of themselves inNoel Dellofano'sentry onthe final push.And any parent can empathize withSherry Moore'sentrycongratulating the team -- and looking forward to havingher husbandonce again available to help with the kids.So congratulations to everyone on the ZFS team (and your families!)-- and for everyone else,welcome to ZFS!Technorati tags:OpenSolarisSolarisZFS

If you haven't already seen it,ZFS is now available for download, marking a major milestone in the history of filesystems. Today is a way station down a long road: for as long as I have knownJeff Bonwi...

Solaris

Man, myth, legend

On a Sunday night shortly after we bought Kealia, I got a call at home from JohnFowler. He asked me if I'd like to join Glenn Weinberg and him the nextmorning to meet withAndy Bechtolsheimat Kealia's offices in Palo Alto. It's hard to express my excitementat this proposition -- it was like asking a kid if he wants to go throw a ball aroundwith Joe DiMaggio. And indeed, my response was something akin to the "Gee mister -- that'd be swell!" that this image evokes...When we walked in to Kealia's offices the next morning, there, in the foyer, was Andy! Andy blinked for a moment, and then -- without any introductions -- beganto excitedly describe some of his machines. Still talking,he marched to his office, whereupon he went to the whiteboard and started furiously drawingblock diagrams. Here, at long last, was the Real Deal: a fabled engineer who didn't disappoint -- a giant who dwarfed the substantial legend that proceeded him.After several minutes at the whiteboard, Andy got so excited that hehad to actuallyget the plans to show us how some particular piece had been engineered.And with that, he flew out of the room.As we caught our breath, Glenn looked at me and said "just so you know,this is what it's like trying to talk to you." While I was still tryingto figure out if this was a compliment or an insult (which I still haven'tfigured out, by the way), Andy flew back in,unfurled some plans for a machine and excitedly pointed out some of thefiner details of his design. Andy went on for a few more minutes when,like a raging prairie fire that had suddenly hit a fireline, hewent quiet. With that, I kicked the door down (metaphorically) andstarted describing what we had been working on in Solaris 10.(After all, John hadn't brought me along to just sit backand watch.) As I went through revolutionary feature after revolutionary feature, I was astounded by how quickly Andy grasped detail -- he asked incisivequestions that reflected a greater understanding of software than anyother hardware engineer I had ever encountered. And as he seemed to be absorbing detail faster and faster, I began delivering it faster andfaster. Now, as others have observed, I'm not exactly a slow talker; thismight have been one of the few times in my life where I thought I actuallyneeded to speak faster to stay in front of my audience. Whew!Most impressive of all, Andy had a keen intuition for the system --he immediately sawhow his innovative hardware and our innovative software could combine todeliver some uniquely innovative systems to our customers.He was excited about our software; we were excited about his hardware.How much better than that does it get?Needless to say, ever since that morning -- which was nearly a year anda half ago now -- I have eagerly awaited the day that Andy's boxes would ship. If you've talked to me over the last year, you know that I'vebeen very bullish on Sun; now you know why. (Well, now you have ataste as to why; believe it or not, Andy's best boxes are stillto come.)Not everyone can own a car designed byEnzo Ferrarior a lamp crafted by LouisComfort Tiffany -- but at just over two grand a pop,pretty much everyone can own a machine designed by the greatestsingle-board computer designer in history. Congratulations Andyand team on an historic launch! And might I add that it was especiallyfitting that it was welcomed with what is easily thefunniest ad in Sun's history.

On a Sunday night shortly after we bought Kealia, I got a call at home fromJohn Fowler. He asked me if I'd like to joinGlenn Weinberg and him the next morning to meet withAndy Bechtolsheimat Kealia's...

Solaris

DTrace and Ruby

It's been an exciting few weeks forDTrace.The party got started withWez Furlong's newPHPDTrace provider at OSCON. ThenDevon O'Dellannounced that he was starting to work in earnest on aDTraceport to FreeBSD. And now,Rich Lowehas made available a prototypeRubyDTrace provider.To install this, grabRuby 1.8.2,applyRich'spatch, and run ./configure withthe --enable-dtrace option.When you run the resulting ruby, you'll see two probes:function-entry andfunction-return.The arguments to these probes are as follows:arg0 is the name of the class (a pointer to a string within Ruby)arg1 is the name of the method (also a pointer toa string within Ruby)arg2 is the name of the file containing the call site (again,a pointer to a string within Ruby)arg3 is the line number of the call site.So if, for example, you'd like to know the classes and methods thatare called in a particular Ruby script, you could do it with thissimple D script:#pragma D option quietruby$target:::function-entry{ @[copyinstr(arg0), copyinstr(arg1)] = count();}END{ printf("%15s %30s %s\\n", "CLASS", "METHOD", "COUNT"); printa("%15s %30s %@d\\n", @);}To run this against the cal.rb that ships in the sampledirectory of Ruby, call the above script whatmethods.d andrun it this way:# dtrace -s ./whatmethods.d -c "../ruby ./cal.rb" August 2005 S M Tu W Th F S 1 2 3 4 5 6 7 8 9 10 11 12 1314 15 16 17 18 19 2021 22 23 24 25 26 2728 29 30 31 CLASS METHOD COUNT Array = 2 Hash each 2 Hash keys 2 Module attr 2 Module method_undefined 2 Module public 2 Rational coerce 2 Array + 3 Class civil_to_jd 3 Hash [] 3 Object collect 3 Array collect 4 Class inherited 4 Range each 4 String size 4 Module private_class_method 5 Object eval 5 Object require 5 String gsub 5 Class jd_to_wday 7 Class once 7 Date __8713__ 7 Date wday 7 Fixnum % 8 Array join 10 Hash []= 10 String + 10 Array each 11 NilClass to_s 11 Module alias_method 22 Module private 22 Symbol to_s 26 Module module_eval 28 Date mday 31 Object send 31 Date mon 42 Date __11105__ 43 Class jd_to_civil 45 Date succ 47 Class os? 48 Date + 49 Fixnum > 4698 Fixnum [] 7689 Fixnum == 11436This may leave us with many questions. For example, there are a coupleof calls to construct new objects -- where are they coming from? Toanswer that question:#pragma D option quietruby$target:::function-entry/copyinstr(arg1) == "initialize"/{ @[copyinstr(arg0), copyinstr(arg2), arg3] = count();}END{ printf("%-10s %-40s %-10s %s\\n", "CLASS", "INITIALIZED IN FILE", "AT LINE", "COUNT"); printa("%-10s %-40s %-10d %@d\\n", @);}Calling the above whereinit.d, we can run it in a similarmanner:# dtrace -s ./whereinit.d -c "../ruby ./cal.rb" August 2005 S M Tu W Th F S 1 2 3 4 5 6 7 8 9 10 11 12 1314 15 16 17 18 19 2021 22 23 24 25 26 2728 29 30 31CLASS INITIALIZED IN FILE AT LINE COUNTCal ./cal.rb 144 1Date /usr/local/lib/ruby/1.8/date.rb 593 1Date /usr/local/lib/ruby/1.8/date.rb 703 1Date /usr/local/lib/ruby/1.8/date.rb 916 1Time /usr/local/lib/ruby/1.8/date.rb 702 1Date /usr/local/lib/ruby/1.8/date.rb 901 49Rational /usr/local/lib/ruby/1.8/rational.rb 374 610Looking at the Date class, it's interesting to look at line 901 offile /usr/local/lib/ruby/1.8/date.rb: 897 # If +n+ is not a Numeric, a TypeError will be thrown. In 898 # particular, two Dates cannot be added to each other. 899 def + (n) 900 case n 901 when Numeric; return self.class.new0(@ajd + n, @of, @sg) 902 end 903 raise TypeError, 'expected numeric' 904 endThis makes sense: we're initializing new Dateinstances in the +method for Date. And where are those coming from?It's not hard to build a script that will tell us the file and linefor the call site of an arbitrary class and method:#pragma D option quietruby$target:::function-entry/copyinstr(arg0) == $$1 && copyinstr(arg1) == $$2/{ @[copyinstr(arg2), arg3] = count();}END{ printf("%-40s %-10s %s\\n", "FILE", "LINE", "COUNT"); printa("%-40s %-10d %@d\\n", @);}For this particular example (Date#+()), call the abovewherecall.d and run it this way:# dtrace -s ./wherecall.d "Date" "+" -c "../ruby ./cal.rb" August 2005 S M Tu W Th F S 1 2 3 4 5 6 7 8 9 10 11 12 1314 15 16 17 18 19 2021 22 23 24 25 26 2728 29 30 31FILE LINE COUNT./cal.rb 102 2./cal.rb 60 6./cal.rb 63 41And looking at the indicated lines in cal.rb: 55 def pict(y, m) 56 d = (1..31).detect{|d| Date.valid_date?(y, m, d, @start)} 57 fi = Date.new(y, m, d, @start) 58 fi -= (fi.jd - @k + 1) % 7 59 60 ve = (fi..fi + 6).collect{|cu| 61 %w(S M Tu W Th F S)[cu.wday] 62 } 63 ve += (fi..fi + 41).collect{|cu| 64 if cu.mon == m then cu.send(@da) end.to_s 65 } 66So this is doing exactly what we would expect, given the code.Now, if we were interested in making this perform a little better,we might be interested to know the work that is being inducedby Date#+(). Here's a script that reports the classes andmethods called by a given class/method -- and its callees:#pragma D option quietruby$target:::function-entry/copyinstr(arg0) == $$1 && copyinstr(arg1) == $$2/{ self->date = 1;}ruby$target:::function-entry/self->date/{ @[strjoin(strjoin(copyinstr(arg0), "#"), copyinstr(arg1))] = count();}ruby$target:::function-return/copyinstr(arg0) == $$1 && copyinstr(arg1) == $$2/{ self->date = 0; ndates++;}END{ normalize(@, ndates); printf("Each call to %s#%s() induced:\\n\\n", $$1, $$2); printa("%@8d call(s) to %s()\\n", @);}Calling the above whatcalls.d, we can answer the questionabout Date#+() this way:# dtrace -s ./whatcalls.d "Date" "+" -c "../ruby ./cal.rb" August 2005 S M Tu W Th F S 1 2 3 4 5 6 7 8 9 10 11 12 1314 15 16 17 18 19 2021 22 23 24 25 26 2728 29 30 31Each call to Date#+() induced: 1 call(s) to Class#new0() 1 call(s) to Class#reduce() 1 call(s) to Date#+() 1 call(s) to Date#initialize() 1 call(s) to Fixnum#+() 1 call(s) to Fixnum#() 37 call(s) to Fixnum#[]() 52 call(s) to Fixnum#==()That's a lot of work for something that should be pretty simple! Indeed,it's counterintuitive that, say, Integer#gcd() would be calledfrom Date#+() -- and it certainly seems suboptimal. I'll leavefurther exploration into this as an exercise to the reader, but sufficeit to say that this has to do with the use of a rational number in theDate class --the elimination of which would elminate most of the above calls andpresumably greatly improve the performance of Date#+().Now, Ruby aficionados may note that some of the above functionalityhas been available in Ruby by setting theset_trace_func function (upon which theRuby profileris implemented). While that's true (to a point -- the set_trace_funcseems to be a pretty limited mechanism),the Ruby DTrace provider is nonetheless a great lurch forward forRuby developers: it allows developers to use DTrace-specific constructslike aggregations and thread-local variables to hone in on a problem;it allows Ruby-related work performed lower in the stack (e.g.,in the I/O subsystem, CPU dispatcher or network stack)to be connected to the Rubycode inducing it; it allows a running Ruby script to beinstrumented (and reinstrumented) without stopping or restarting it; andit allows multiple, disjoint Ruby scripts to be coherently observed andunderstood as a single entity. More succinctly, it's just damned cool.So thanks to Rich for developing the prototype provider -- and if you'rea Ruby developer, enjoy!

It's been an exciting few weeks forDTrace. The party got started withWez Furlong's new PHP DTrace provider at OSCON. ThenDevon O'Dell announced that he was starting to work in earnest on aDTraceport...

Solaris

DTrace and PHP, demonstrated

If you were atmy presentation at OSCON yesterday,I apologize for the briskpace -- there was a ton of material to cover, and forty-five minutesisn't much time. Now that I've got a little more time,I'd like to get into thedetails of the PHP demonstration that I did during the presentation.Here is the (very) simple PHP program that I was using:<?phpfunction blastoff(){ echo "Blastoff!\\n\\n";}function one(){ echo "One...\\n"; blastoff();}function two(){ echo "Two...\\n"; one();}function three(){ echo "Three...\\n"; two();}function launch(){ three();}while (1){ launch(); sleep(1);}?>Running this in a window just repeats this output:% php ./blastoff.phpphp ./blastoff.phpContent-type: text/htmlX-Powered-By: PHP/5.1.0-devThree...Two...One...Blastoff!Three...Two...One...Blastoff!...Now, because I have specifiedWez Furlong'snewdtrace.soas an extension in myphp.inifile, when I run the above, two probes show up:# dtrace -l | grep php 4 php806 dtrace.so php_dtrace_execute function-return 5 php806 dtrace.so php_dtrace_execute function-entryThe function-entry and function-return probeshave three arguments:arg0 is the name of the function (a pointer to a stringwithin PHP)arg1 is the name of the file containing the call site (alsoa pointer to a string within PHP)arg2 is the line number of the call sitea pointer to a string within PHP)So for starters, let's just get an idea of which functions are being calledin my PHP program:# dtrace -n function-entry'{printf("called %s() in %s at line %d\\n", \\ copyinstr(arg0), copyinstr(arg1), arg2)}' -qcalled launch() in /export/php/blastoff.php at line 32called three() in /export/php/blastoff.php at line 27called two() in /export/php/blastoff.php at line 22called one() in /export/php/blastoff.php at line 16called blastoff() in /export/php/blastoff.php at line 10called launch() in /export/php/blastoff.php at line 32called three() in /export/php/blastoff.php at line 27called two() in /export/php/blastoff.php at line 22called one() in /export/php/blastoff.php at line 16called blastoff() in /export/php/blastoff.php at line 10\^CIf you're new to DTrace, note that you have the power to trace anarbitrary D expression in your action. For example, instead of printingout the file and line number of the call site,we could trace the walltimestamp:# dtrace -n function-entry'{printf("called %s() at %Y\\n", \\ copyinstr(arg0), walltimestamp)}' -qcalled launch() at 2005 Aug 5 08:08:24called three() at 2005 Aug 5 08:08:24called two() at 2005 Aug 5 08:08:24called one() at 2005 Aug 5 08:08:24called blastoff() at 2005 Aug 5 08:08:24called launch() at 2005 Aug 5 08:08:25called three() at 2005 Aug 5 08:08:25called two() at 2005 Aug 5 08:08:25called one() at 2005 Aug 5 08:08:25called blastoff() at 2005 Aug 5 08:08:25\^CNote, too, that (unless I direct it not to) this will aggregate acrossPHP instances. So, for example:#!/usr/sbin/dtrace -s#pragma D option quietphp\*:::function-entry{ @bypid[pid] = count(); @byfunc[copyinstr(arg0)] = count(); @bypidandfunc[pid, copyinstr(arg0)] = count();}END{ printf("By pid:\\n\\n"); printa(" %-40d %@d\\n", @bypid); printf("\\nBy function:\\n\\n"); printa(" %-40s %@d\\n", @byfunc); printf("\\nBy pid and function:\\n\\n"); printa(" %-9d %-30s %@d\\n", @bypidandfunc);}If I run three instances of blastoff.php and then the above script,I see the following after I \^C:By pid: 806 30 875 30 889 30By function: launch 18 three 18 two 18 one 18 blastoff 18By pid and function: 875 two 6 875 three 6 875 launch 6 875 blastoff 6 889 blastoff 6 806 launch 6 889 one 6 806 three 6 889 two 6 806 two 6 889 three 6 806 one 6 889 launch 6 806 blastoff 6 875 one 6The point is that DTrace allows you to aggregate across PHPinstances, allowing you to understand not just what a particularPHP is doing, but what PHP is doing more generally.If we're interested in better understanding the code flow in a particularPHP instance, we can write a script that uses a thread-local variableto follow the code flow:#pragma D option quietself int indent;php$target:::function-entry/copyinstr(arg0) == "launch"/{ self->follow = 1;}php$target:::function-entry/self->follow/{ printf("%\*s", self->indent, ""); printf("-> %s\\n", copyinstr(arg0)); self->indent += 2;}php$target:::function-return/self->follow/{ self->indent -= 2; printf("%\*s", self->indent, ""); printf(" %-20s %\*sphp\\n", copyinstr(arg0), 46 - self->indent, ""); self->indent += 2;}php$target:::function-return/self->follow/{ self->indent -= 2; printf("%\*s", self->indent, ""); printf(" %-20s %\*s%s\\n", probefunc, 46 - self->indent, "", probemod); self->indent += 2;}pid$target:libc.so.1::return/self->follow/{ self->indent -= 2; printf("%\*s", self->indent, ""); printf(" %-20s %\*skernel\\n", probefunc, 46 - self->indent, ""); self->indent += 2;}fbt:genunix::return/self->follow/{ self->indent -= 2; printf("%\*s", self->indent, ""); printf("

If you were atmy presentation at OSCON yesterday, I apologize for the brisk pace -- there was a ton of material to cover, and forty-five minutes isn't much time. Now that I've got a little more time,I'...

Solaris

DTrace and PHP

Tonight during our OpenSolaris BOF at OSCON,PHP core developer Wez Furlongwas busy adding aDTrace provider to PHP. After a little bit of work (and a little bit of debugging),we got it working -- and damn is it cool. Wez implemented it as a shared object, which may then be loadedvia an explicit extensiondirective inphp.ini. Once loaded,two probes show up: function-entry and function-return. These probes have as theirarguments a pointer to the function name, a pointer to the file name, and a line number.This allows one to, for example, get a count of all PHP functions being called:# dtrace -n function-entry'{@[copyinstr(arg0)] = count()}'Or you can aggregate on file nameand quantize by line number:# dtrace -n function-entry'{@[copyinstr(arg1)] = lquantize(arg2, 0, 5000)}'Or you can determine the amount of wall time taken by a given PHP function:# dtrace -n function-entry'/copyinstr(arg0) == "myfunc"/{self->ts = timestamp}' -n function-return'/self->ts/{@ = avg(timestamp - self->ts); self->ts = 0)}'And because it's DTrace, this can all be done on a production box -- and without regard to the number ofPHP processes. (So if you have 200 Apache processes handling PHP, the above invocations would aggregateacross them.)When I get back, I'll download Wez's provider and post some more comprehensive examples. In the meantime, if you're a PHP developer at OSCON, stop Wez if you see him and ask him to give you a demo -- it's the kind of thing that needs to be seen to be appreciated... Finally, if you're interested in adding your own DTrace provider to the application, language or system that you care about, be sure to check out my presentation on DTrace tomorrowat 4:30 in Portland room 255. (Hopefully this time I won't be tortured by memories of beingmindfucked byInside UFO 54-40.)

Tonight during ourOpenSolaris BOF at OSCON, PHP core developer Wez Furlong was busy adding aDTrace provider to PHP. After a little bit of work (and a little bit of debugging),we got it working -- and...

Solaris

DTrace at OSCON

So I'll be at OSCON this week, where I'll be giving two presentations on DTrace. The first is a free tutorial that Keith and I will be giving on OpenSolaris development with DTrace. This tutorial is on Tuesday from 1:30p to 5p in room D140, and did I mention that this tutorial is free? So even if you didn't plan to attend any of the tutorials, if you're going to be in Portland on Tuesday afternoon, you should feel free to swing by -- no need to preregister. This tutorial should be fun -- we're going to keep in very hands-on, and it will be a demo-intensive introduction to both DTrace and to our larger tool-set that we use to both build OpenSolaris and to diagnose it when it all goes horribly wrong. Hopefully you can join us! The second session is a presentation exclusively on DTrace. This is quite a bit shorter (it's 45 minutes), so this presentation is going to give a quick review of the DTrace fundamentals, and then focus on the confluence of DTrace and open source -- both what DTrace can do for your open source project, and what you can do (if you're so inclined) for the DTrace open source project. This session is on Thursday at 4:30 in Portland room 255. Other than that, my schedule is filled only with odds and ends; if you're going to be at OSCON and you want to connect, drop me a line or leave a message for me at my hotel, the 5th Avenue Suites. See you in Portland!

So I'll be at OSCON this week, where I'll be giving two presentations on DTrace. The first is a free tutorial that Keith and I will be giving on OpenSolaris development with DTrace. This tutorial is...

Solaris

DTrace Safety

DTrace is a big piece of technology, and it can be easy to lose the principles in the details. But understanding these principles is key to understanding the design decisions that we have made -- and to understanding the design decisions that we will make in the future. Of these principles, the most fundamental is the principle of safety: DTrace must not be able to accidentally induce system failure. It is our strict adherence to this principle that allows DTrace to be used with confidence on production systems -- and it is its use on production systems that most fundamentally separates DTrace from what has come before it. Of course, it's easy to say that this should be true, but what does the safety constraint mean? First and foremost, given that DTrace allows for dynamic instrumentation, this means that the user must not be allowed to instrument code and contexts that are unsafe to instrument. In any sufficiently dynamic instrumentation framework, such code and contexts exist (if nothing else, the framework itself cannot be instrumented without inducing recursion), and this must be dealt with architecturally to assure safety. We have designed DTrace such that probes are provided by instrumentation providers that guarantee their safety. That is, instead of the user picking some random point to instrument, instrumentation providers make available only the points that can be safely instrumented -- and the user is restricted to selecting among these published probes. This puts the responsibility for instrumentation safety where it belongs: in the provider. The specific techniques that the providers use to assure safety are a bit too arcane to discuss here,[1] but suffice it to say that the providers are very conservative in their instrumentation. This addresses one aspect of instrumentation safety -- instrumenting wholly unsafe contexts -- but it doesn't address the recursion issue, where the code required to process a probe (a context that we call probe context) ends up itself encountering an enabled probe. This kind of recursion can be dealt with in one of two ways: lazily (that is, the recursion can be detected when it happens, and processing of the probe that induced the recursion can be aborted) or proactively (the system can be designed such that recursion is impossible). For a myriad of reasons, we elected for the second approach: to make recursion architecturally impossible. We achieve this by mandating that while in probe context, DTrace itself must not call into any facilities in the kernel at-large. This means both implicit and explicit transfers of control into the kernel-at-large -- so just as DTrace must avoid (for example) allocating memory in probe context, it must also avoid inducing scheduler activity by blocking.[2] Once the fundamental safety issues of instrumentation are addressed, focus turns to the safety of user-specified actions and predicates. Very early in our thinking on DTrace, we knew that we wanted actions and predicates to be completely programmable, giving rise to a natural question: how are they executed? For us, the answer was so clear that it was almost unspoken: we knew that we needed to develop a virtual machine that could act as a target instruction set for a custom compiler. Why was this the clear choice? Because the alternative -- to execute user-specified code natively in the kernel -- is untenable from a safety perspective. Executing user-specified code natively in the kernel is untenable for many reasons: Explicit stores to memory would have to be forbidden. To allow for user-defined variables (as we knew we wanted to do), one must clearly allow data to be stored. But if one executes natively, one has no way of differentiating a store to legal variable memory from a stray store to arbitrary kernel state. One is reduced to either forbidding stores completely (destroying a critical architectural component in the process), rewriting the binary to add checks around the store, or to emulating stores and manually checking the target address in the emulation code. The first option is unacceptable from a feature perspective, the second option is a non-trivial undertaking rife with new failure modes, and the third option -- emulating stores -- shouldn't be thought of as native execution but rather executing a a virtual machine that happens to match the underlying instruction set architecture. Loops would have to be dynamically detected. One cannot allow user-defined code to spin on the CPU in the kernel, so loops must be dynamically detected and halted. Static analysis might be tempting, but in a Turing-complete system such analysis will always be heuristic -- one cannot solve the Halting Problem. While heuristics are fine when trying to achieve performance, they are not acceptable when correctness is the constraint. The problem must be therefore solved dynamically, and dynamic detection of loops isn't simple -- one must detect loops using any control transfer mechanism, including procedure calls. As with stores, one is reduced to either forbidding calls and backwards branches, rewriting the binary to add dynamic checks before all control transfers, or emulating them and manually checking the target address in the emulation code. And again, emulating these instructions negates the putative advantages of native execution. Code would have to be very carefully validated for illegal instructions. For example, any instruction that operates on floating point state is (generally) illegal to execute in the kernel (Solaris -- like most operating systems -- doesn't save and restore the floating point registers on a kernel/user context switch; using the floating point registers in the kernel would corrupt user floating point state); floating point operations would have to be detected and code containing them rejected. There are many such examples (e.g. register-indirect transfers of control must be prohibited to prevent user-specified code from transferring control into the kernel at large, privileged instructions must be prohibited to prevent user-specified code from hijacking the operating system, etc. etc.), and detecting them isn't fail-safe: if one fails to detect so much as one of these cases, the entire system is vulnerable. Executing natively isn't portable. This might seem counterintuitive because executing natively seems to "automatically" work on any instruction set architecture that the operating system supports -- it leverages the existing tool chain for compiling the user-specified code. But this leverage is Fools' Gold: as described above, the techniques to assure safety for native execution are profoundly specific to the instruction set -- and any new instruction set would require completely new validation code. And again, this isn't fail-safe: so much as one slip-up in a new instruction set architecture means that the entire system is at risk on the new architecture. We left these many drawbacks of native execution largely unspoken because the alternative -- a purpose-built virtual machine for executing user-specified code -- was so clearly the better choice. The virtual machine that we designed, the D Intermediate Format (DIF) virtual machine, has the following safety properties: It has no mechanism for storing to arbitrary addresses; the only store opcodes represent stores to specific kinds of user-defined variables. This solves two problems in a single design decision: it prohibits stores to arbitrary kernel memory, and it allows us to distinguish stores to different kinds of variables (global, thread-local, clause-local, etc.) from the virtual instruction opcode itself. This allows us to off-load intelligence about variable management from the instruction set and into the runtime where it belongs. It has no backwards branches, and supports only calls to defined runtime support routines -- eliminating the possibility of user-defined loops altogether. This may seem unnecessarily severe (this makes loops an impossibility by architecture), but to us it was an acceptable tradeoff to achieve absolute safety. It is sufficiently simple to be easily (and rigorously) validated, as can be seen from the straightforward DIF object validation code. It is completely portable, allowing the validation and emulation code to be written and debugged once -- accelerating bringup of DTrace on new platforms. Just having an appropriately restricted virtual machine addressed many safety issues, but several niggling safety issues still had to be dealt with explicitly: Runtime errors like division-by-zero or misaligned loads. While a virtual machine doesn't solve these in and of itself, it makes them trivial to solve: the emulator simply refuses to perform such operations, aborting processing and indicating a runtime error. Loads from I/O space. In the Solaris kernel, devices have memory that can be mapped into the kernel's address space. Loads from these memory ranges can have side-effects; they must be prohibited. This is only slightly more complicated than dealing with divisions-by-zero; before performing a load, the emulator checks that the virtual address does not fall in a range reserved for memory mapped devices, aborting processing if it does. Loads from unmapped memory. Given that we wanted to allow user-defined code to chase pointers in the kernel, we knew that we had to deal with user-defined code attempting to load from an unmapped address. This can't be dealt with strictly in the emulator, as it would require probing kernel VM structures from probe context (which, if allowed, would prohibit instrumentation of the VM system). We dealt with this instead by modifying the kernel's page fault handler to check if a load has been DIF-directed before vectoring into the VM system to handle the fault. If the fault is DIF-directed, the kernel sets a bit indicating that the load has faulted, increments the trapping instruction pointer past the faulting load, and returns from the trap. The emulation code checks the faulted bit after emulating each instruction, aborting processing if it is set. So DTrace is not safe by accident -- DTrace is safe by deliberate design and by careful execution. DTrace's safety comes from a probe discovery process that assures safe instrumentation, a purpose-built virtual machine that assures safe execution, and a careful implementation that assures safe exception handling. Could a safe system for dynamic instrumentation be built with a different set of design decisions? Perhaps -- but we believe that were such a system to be as safe, it would be either be so under-powered or so over-complicated as to invalidate those design decisions. [1] The best place to see this provider-based safety is in the implementation of the FBT provider (x86, SPARC) and in the implementation of the pid provider (x86, SPARC). [2] While it isn't a safety issue per se, this led us to two other important design decisions: probe context is lock-free (and almost always wait-free), and interrupts are disabled in probe context. Technorati tags: OpenSolaris Solaris DTrace

DTrace is a big piece of technology, and it can be easy to lose the principles in the details. But understanding these principles is key to understanding the design decisions that we have made -- and...

Solaris

Using DTrace to understand GNOME

I read with some interest about the GNOME startup bounty. As Stephen O'Grady pointed out, this problem is indeed perfect for DTrace. To get a feel for the problem, I wrote a very simple D script: #!/usr/sbin/dtrace -s#pragma D option quietproc:::exec-success/execname == "gnome-session"/{ start = timestamp; go = 1;}io:::start/go/{ printf("%10d { -> I/O %d %s %s %s }\\n", (timestamp - start) / 1000000, pid, execname, args[0]->b_flags & B_READ ? "reads" : "writes", args[2]->fi_pathname);}io:::done/go/{ printf("%10d { %d %s (from %d %s)\\n", (timestamp - start) / 1000000, pid, execname, curpsinfo->pr_ppid, self->parent); self->parent = NULL;}proc:::exit/go/{ printf("%10d " in the output. This is due to I/Os being induced by lookups that are going through directories and/or symlinks that haven't been explicitly opened. For these I/Os, my script also gives you the pointer to the vnode structure in the kernel; you can get the path to these by using ::vnode2path in MDB: # mdb -kLoading modules: [ unix krtld genunix specfs dtrace ufs ipsctp usba uhci s1394 random nca lofs nfs audiosup sppp ptm ipc ]> ffffffff8f6851c0::vnode2path/usr/sfw/lib/libXrender.so.1> Yes, having to do this sucks a bit, and it's a known issue. And rumor has it that Eric even has a workspace with the fix, so stay tuned... Update: Eric pointed me to a prototype with the fix, and I reran the script on a GNOME cold start; here is the output from that run. Interestingly, because the symlinks now show up, a little postprocessing reveals that we chased nearly eighty symlinks on startup! From the output, reading many of these took 10 to 20 milliseconds; we might be spending as much as one second of GNOME startup blocked on I/O just to chase symlinks! Ouch! Again, it's not clear how one would fix this; having an app link to libpango-1.0.so.0 and not libpango-1.0.so.400.1 is clearly a Good Thing, and having this be a symlink instead of a hardlink is clearly a Good Thing -- but all of that goodness leaves you with a read dependency that's hard to work around. Anyway, be looking for Eric's fix in OpenSolaris and then in an upcoming Solaris Express release; it makes this kind of analysis much easier -- and thanks Eric for the quick prototype! Technorati tags: OpenSolaris Solaris DTrace GNOME

I read with some interest about the GNOME startup bounty. As Stephen O'Grady pointed out, this problem is indeed perfect for DTrace. To get a feel for the problem, I wrote a very simple D script: #!/us...

Solaris

Still more blog sifting...

Even though the launch ofOpenSolaris was well over a week ago,and even though theOpening Day entrieshave now been sifted through infive different blog entries(here,here,herehere andhere),there are stillsome great uncategorized entries. Without further ado...Zones. Zones are among the most talked-aboutnew features in Solaris, and the engineers on the team have developedsome highly-readable Opening Day entries describing the implementation.Start withDavid Comay'sentry taking you on atour of zone state transitions, and then check outDan Price'sentry to extend your tour to theinnards of the zones console. Dan's tour is notable because he takesyou by anASCIIart comment of his that is one of the more elaborate inSolaris. (One of these days, I'll take you on a tourof my favorite ASCII art comments in the source base -- but today we havetoo much to see, so everyone back in the car!) Wrap up your tour ofzones withJohn Beck's entrydescribing adding command-line editing and history to zonecfg.This entry has applicability beyond zones -- it's a useful how-to foradding command-line editing and history to any C-based or C++-based application.Booting. The most exciting project to integrate into Solarissince Solaris 10shipped is surely the reachitecture of the booting system to use GRUB --andShudong Zhou'sentry describing testing the code at Fry's is a must-read.For more details,check outJan Setje-Eilers'entry describing the new boot architecture.Look for more from these two -- using GRUB allows for many newpossibilities, and Shudong and Jan have much to blog about.Solaris Volume Manager.Despite the fact that it could save them gobs in licensing fees tocertain third parties, many seem to not realizethat we bundle anindustrial-strength volume managerwith Solaris. Hell-bent on describing what they've been working on forso long, the Solaris Volume Manager team was out in force on Opening Day.If you have any storage responsibilities in your shop and you're notrunning SVM, you should carefully read these entries; SVM may well saveyou a bundle -- making you such a hero that you'll be able to tell thesuits exactly what they can do with theirTPS report.Andre Molyneux on RAID 0+1 vs. RAID 1+0 and again onthe role of timing in testing RAID 5 in SVM.Jerry Jelinek onimproving SVM response to disk failure.Sanjay Nadkarni onresync regions and optimized resyncs -- a feature whichVxVM users will recognize as the SVM equivalent of dirty region logging (DRL).Susan Kamm-Worrell onmulti-owner disksets, an SVM feature that allows multiple nodesto simultaneously access volumes.Steve Peng ondisk Relocation in SVM.Tony Nguyenon the SVM default interlace and resync buffer values.Miscellany. There are a handful of great entries that don't fit neatlyinto any one category -- or are so specific as to be their owncategory. Be sure not to miss these:Jeff Bonwick onrevealing theorigins of the slab allocator.Phil Harmanon getting getenv to scale. Because so much Solarisscalability work happened many years ago, Phil's is the only entrydescribing the specifics of getting a subsystem to scale with CPUs;Phil's entry should be considered a must-read for anyone working onscalability.Peter Memishianonusing doors as a synchronization primitive. Peter's work isa good example of why -- contrary to thebeliefsof some pinheads -- developing a user-level service provider can bequite a bit more challenging than developing in the kernel.Tim Marsland continuinghis magnum opus on the implementation of Solaris 10 on x64 withPart 4: Userland.Ienup Sungon efficiently handling illegal UTF-8 byte sequences.Chandan describingthe implementation of the OpenSolaris Source BrowserCyndi Easthamon developing libavl.Chris Bealdescribing the implementation of signal delivery.Dave Minerleading a tour of the DHCP server.Dave Powell ontheimplementation of pseudo-filesystems -- and why /system is the new home for all such filesystems.Jonathan Adams onmacros and powers of two -- and the simple pleasures ofbit-twiddling.Tom Ericksonon the implementation of libdtrace. libdtrace is still a privateinterface (we still have some sanding and polishing to do), but Tom'sentry is a must-readfor anyone considering writing their own DTrace consumer.Keith M Wesolowskion getting SPARC inlines to work with GCC.Phew! I think that about does it. When we first tried to encourageSolaris engineers to blog on Opening Day, I thought we were going to havea hard time convincing engineers to blog -- I knew that providing in-depth,technicalcontent takes a lot of time, and I knew that everyone had other priorities.So when we were planning the launch and talkingabout the possibility of dealing with a massive amount of Opening Day content,my response was "hurt mewith that problem." Well, as it turns out, most engineers didn't needmuch convincing -- many provided rich, deep content -- and I wasindeed hurt with that problem! While it was time consuming to siftthrough them, hopefullyyou've enjoyed reading these entries as much as I have. And let it besaid once more: welcome to OpenSolaris!Technorati tags:OpenSolarisSolarisDTrace

Even though the launch ofOpenSolaris was well over a week ago, and even though theOpening Day entries have now been sifted through infive different blog entries (here, here, herehere andhere),there...

Solaris

Yet more blog sifting...

Despite there being now four blog entries to sift through theOpening Day entries(one from me,one from Liane,one from Claireand another from me),there are still some great entries that have gone uncategorized. Manyof these entries fall into the category of "debugging war stories" --engineersdescribing a particularly tough or interesting bug that they nailed.Theproliferation of these kinds of stories among the Opening Day entriesis revealing: insome organizations,you're a hero among engineers if you come up with a whizzy new app(even if the implementation is a cesspool),while inothers you might gain respect among engineers by getting abig performance win(even if it makes things a bit flakey),but in Solarisdevelopment, nothing gains the respect of peers like debugging somediabolical bug that has plagued the operating system for years.So if you're new toOpenSolarisand you're looking to make a name for yourself, a good place to start isin the never-ending search for elusive bugs.To get a flavor for the thrill of the hunt, check out:George Shepherd onusing DTrace to debug a nasty STREAMS bug.Jim Carlsonondebugging a hang in ldtermclose.You can infer how muchwe obsess about debugging from the first line of Jim's entry: "Everyonce in a while, a bug sticks with you long enough that you remember theID number without having to think about it."Adam Leventhalon debugging a cross call hang.Chris Gerhard on atwo line fix in init.Sarah Jelinekondebugging a UFS file truncation bug. For me personally,it's always very gratifying to see a bug like Sarah's -- whereDTracewas guided by expert hands to help out on a tough problem.Sherry Moore ondebugging a bug at the murky interface between compiler andoperating system.Saurabh Mishra on debugging a memory ordering problem.Alok Aggarwalon debugging an NFSv4 problem on SPARC.Narayana Kadooron debugging a logic error in dynamic intimateshared memory (DISM).Surya Prakkiondebugging kernel memory corruption, surely the mostpernicious of software pathologies.Raja Gopal Andraon debugging an application bug.Martin Englundon tracking down a bug in the audit daemon.Peter Harvey onincreasing UNIX group membership. In this entry, Peterdoesn't dwell on the bug -- which is clear in this case -- but ratherthe complexities of fixing it. It's a good example of how something thatappears simple can be stubbornly complicated.Prabahar Jeyaram ondebugging a nasty panic in the UFS lockfs protocol.And here are a few more on the simple (but thorough) satisfaction fromfixing old bugs:Paul Roberts onan ancient bug in xargs.Stacey Marshallon a four year old bug in the name service switch. Stacey's entry toucheson the larger satisfaction of going from a bug in a strange subsystemto understanding the code, developing the right fix, verifying the fixand then writing the test case to be sure. The dedication to this kindof craftsmanship -- taking the time to do it the Right Way, even for smallstuff --is what has drawn many of us to Solaris; it is, as Stacey says, "what Ilove about this job."Darren Moffat on a nine year old bug in usermod.That about does it for today. There are still many more entries tocategorize, but fortunately, I think I can see the light at the end of thetunnel -- or is that just the approach of death from the exhaustion ofsifting throughall ofthis content?Technorati tags:OpenSolarisSolarisDTrace

Despite there being now four blog entries to sift through theOpening Day entries (one from me, one from Liane, one from Claire and another from me),there are still some great entries that have gone...

Solaris

More blog sifting

If you didn't see it,Liane Prazapicked up wheremy siftingleft off, addinga blog entry pointing tomore Opening Day entries -- thistimein the categories ofdevices and device configuration, security, networking,and standards. But there are still a ton of entries tocategorize, so picking up again in no particular order...System calls.System calls are the among most fundamental mechanisms in operating systems:they are the mechanism by which untrusted, unprivileged software requestsa service of trusted, privileged software. We are lucky to have twogreat entries describing the architecture-specific mechanisms ofsystem calls in Solaris:check outRuss Blaine's entryonsystem calls on x86, andGavin Maltby'sentry onsystemcalls on SPARC. Then, to understand the architectural-neutral aspects ofsystem calls, head over toEric Schrock'sentry onhow to add a system call.As a quick aside, thatlast entry is a great example of how we in Solaris Kernel Developmentare using blogs to writedown information that (believe it or not) has just been an unspoked partof the craft before now. AsTim Bray observed,blogs have become a critical conduit of information for us -- we believethat they are the most scalable way to get information from thepeople who have it to the people who need it. If (when?) you becomean OpenSolaris developer,you can expect some friendly peer pressure to create a blog andjoin the party.Build process and workspace management.We pride ourselves on a seamless build process,and a couple of entries have gone into various aspects of this in depth.To give you an idea of how seriously we take the build process -- andwhy -- check outScott Rotondo'sentry on using lint to find security vulnerabilities.In particular, note what Scott says when he added a new lint option thatgenerated500 new warnings: "I needed to fix all of these before integratingmy change toMakefile.master because we require the Solaris source to belint-clean." To which I add only, "dammit."Next, head over toJim Carlson'sentry describing the work he did to supportnon-root builds. Jim's entry demonstrates how difficult it is toradically change the build process -- and how he managed to pull it off.Finally, if you want to really let your makefile flag fly,check outMark Nelson'sentry describing the build support for localized messages.In terms of workspace management, you'll want to check outWill Fiveash'sentry describing our workspace management tool, wx. For a longtime, wxwas a shell script inBonwick's home directory.It was incredibly useful, but it was also easy to accidentally blow yourbrains out.(AsBart is fond of saying, itwas "all blade and no handle.") Will's rewrite made for a much moresafer, much more sophisticated wx -- and it was a huge help tous in automating the final approach of theDTrace integration.Debuggability. If you read just a couple of theOpening Day entries,you probably noticed a trend: many of the entries were about findingsome nasty bug in the system.This is an accurate reflection of our ethos in developing Solaris:the operating system must be reliable above all else, and we viewdebugging the operating system as our primary responsibility.This responsibility runs deeper than just the act of debugging, becauseour needs so outstripped existing tools thatwe designed and builtour own -- most notablymdband DTrace.Fortunately, we ship these tools to you, so you can use them on yourown system and on your own applications.There are many entries describing these tools and how they were usedto tackle a problem.Fittingly, a good place to start isMike Shapiro'sentry describing using mdb to debug a sendmail bug. This bug isdescribed in4278156,which has one of thegreatest bug synopses of all time: "sendmail died in a two SIGALRM fire."1For more on the power of mdb,take a look atEric Saxe'sentry onusing mdb to debug a scheduling problem,Ashish Mehta's entryonusingmdb to debuga race condition, andEric Kustarz's entry demonstrating an mdb debugger command ("dcmd") that he wrote toretrieve NFSv4 recovery messages postmortem.This last example is a particularly good onebecause this is exactly the kind of custom debugginginfrastructure that mdb's modular architecture makes easy to build.For a comprehensive example of how we have developed subsystem-specificdebugging infrastructure, readSasha Kolbasov'sentry on themdbdcmds related to STREAMS.As Sasha mentions, the place to start for learning to write yourown modules is thedocumentation --but you can get a flavor for it by readingYu Xiangning'sentry on writing awritinga module for kmdb.kmdb is the in-situ kernel debugger that implements mdb, and when youneed it, nothing else will do -- asDan Mick describesin his entry on debugging with kmdb and moddebug.For more details on kmdb itself,check outMatt Simmons'entry onkmdb's design and implementation.To see how mdb can help debug your application, take a look atWill Fiveash's noteson using debugging application memory problems. Willmentions ::findleaks, a debugger command that I originallyimplemented for kernel crash dumps, and thatJonathan Adamssubsequently ported to work on application core files and -- as he mentions inhis entry --reworking it substantially in the process.While mdb is the acme of postmortem debugging,if the manifestation of a bug is non-fatal, it's often moreeffectiveto use DTrace to debug it.For an exanple of this,look atBart Smaalders'entry on using DTrace to debug jitter.It was gratifying to see Bart debug this problem using DTrace, becauselatency bubbles were actually one of the motivating pathologies behindDTrace.And finally, debuggability doesn't end with tools; subsystems must bedesigned withdebuggability in mind, asStephen Hahndescribes in his entry ondesigning libuutil for debuggability.I think that about does it for today. As someone pointed out on Liane'sblog, we need a Wiki for this; we agree -- it's on the list of plannedenhancements foropensolaris.org. Until then,stay tuned for more sifting...Technorati tags:OpenSolarisSolarisDTracemdb

If you didn't see it,Liane Praza picked up where my sifting left off, adding a blog entry pointing to more Opening Day entries -- this time in the categories ofdevices and device configuration,...

Solaris

Sifting through the blogs...

Yesterday was Opening Day forOpenSolaris,and we welcomed OpenSolaris with hundredsof blog entriesdescribingvarious aspects ofthe implementation.The breadth and depth of our bloggingwill hopefullyput to rest any notion that open sourcing Solaris isn't a grass-rootseffort: if nothing else, it should be clear that we in the trenchesare very excited to finally be able to talk about the systemthat we have poured so much of our lives into -- and to welcome new would-be contributors into the fold.In our excitement, we may have overwhelmed a tad:there was so much content yesterday, that it would have been impossiblefor anyone to keep up -- we blogged over 200,000 words (over 800 pages!)yesterday alone.So over the next few days, I want to highlight some entries that youmight have missed, broken down by subject area. In no particular order...Fault management. Fault management in Solaris 10 has been completelyrevolutionized by the new predictive self-healing feature pioneered by my longtime co-conspirator Mike Shapiro. There aretwo must-read entries in this area:Andy Rudoff's entryproviding a predictive self-healing overview, andDilpreet Bindra'sentry going into more depth on PCI error handling. (If for nothingelse,read Dilpreet's entry for his Reading of the Vows between OpenSolaris and theCommunity.)Virtual memory.The virtual memory system is core to any modern operating system, andthere are several interesting entries here.Start withEric Lowe's extensive entrydescribing page fault handling. As Eric rightly points out,page fault handling is the epicenter of the VM system; one can learn a tremendous amount about the system just by following page fault processing --and Eric is a great guide on this journey.Once you've read Eric's entry,check out Michael Corcoran'sentry on page coalescing,a technique to assure availability oflarge-sized pages -- which are in turn necessary to increase TLB reach.And discussion of page_t's leads naturally brings you toRick Mestaentry describing a big performance win byprefetching these structures during boot.A less-discussed aspect of virtual memory is the virtual memory layoutof the kernel itself. To learn about some of the complexities of this,check outKit Chow's entryon address space limitations on 32-bit kernels. The limitation that Kit describes is one of the nasty gotchas of running32-bit x86 in flat mode. As Kit mentions, the best workaround is to runa 64-bit kernel -- but if you're stuck with a 32-bit x86 chip, you'll wantto read Kit's suggestions carefully. Kit's entry is a good segue to Prakash Sangappa'sentry describing his work ondynamic segkp for 32-bit x86 systems. Prakash's work was criticalfor getting some more breathing space on 32-bit x86 systems -- saving hundredsof megabytes of precious VA. Of course, the ultimate breathing space isthat afforded by 64 bits of VA -- and in this vein check outNils Nieuwejaar'sentry on the kernel address space layout on x64. BothPrakash and Nilsquote one of those comments in the kernel source code that you really need toknow about if you're going to do serious kernel development: the commentdescribing the address space layout in i86pc/os/startup.c andsun4/os/startup.c. This comment is one of the canonical ASCII-art comments (more on theseeventually), and I usually find these comments in startup.c bysearching forward for "----".Linking and Loading. One of the most polished subsystems in Solarisis the linker and loader -- the craftsmanship of the engineers that havebuilt it has been an ongoing inspiration for many of us in Solarisdevelopment. To learn more about the linker,start with Rod Evans' entrytaking you ona source tour of the link-editors, and then head over toMike Walker'sentry describing library bindings. As long as you're checking outthe linker, be sure to look at past entries likeRod's entrytracing of alink-edit.As you can imagine, because thedynamic linker is invoked whenever a dynamically-linked binary is executed,it's a natural place to improve performance -- especially withcomplicated programs like Mozilla or StarOffice that are linkedto hundreds (!) of shared objects. We've certainly found some big winsin the linker over the years, but we've also discovered that it's difficultto help megaprograms without hurting nanoprograms -- and vice versa.For an interesting description of this tradeoff, check outDavid Safford'sentry on dynamiclinker performance. If nothing else, you'll see from David's workthe research element of operating system development: we often aren'tassured of success when we endeavor to improve the system.Scheduling. CPU scheduling is one of the most basic propertiesof a multitasking operating system. Despite being an old problem, we find ourselves constantly improving and extending this subsystem.To learn about CPU scheduling, start with Bill Kucharski'sentry describing thearchitecture-specific elements of context switching. Then headover toGavin Maltby'sentry describingtheshort-term prevention of thread migration. (Before Gavin introducedthis facility, the only way to prevent migration was to prevent kernelpreemption -- an overly blunt mechanism that led to areally nasty latency bubble that I debugged many years ago.)If you're going to understand thread dispatching, you'll need to understandthe way thread state is manipulated -- and for that you'll want to look at Saurabh Mishra'sentry describingthreadlocks. Thread locks are different from normal synchronization primitives,as you can infer frommy own entry describing abug in user-levelpriority inheritance --which is a good segue to a more general problem when dealing withthread control: how does one change the scheduling properties of a running thread? For an idea of how tricky this can be,check out Andrei Dorofeev'sentry describing bindingprocesses to resource pools. Andrei's problem was even more challenging than traditional threadmanipulation, as he needed tochange the scheduling properties of a group of threads atomically.If for no other reason, you should read Andrei's entry to learn of the"curse of disp.c."Speaking of the cursed, wrap up your tour of scheduling withEric Saxe's entry describingdebugginga wedged kernel -- you'll see from Eric's odysseythat scheduling problems can require a lot of brain-bending (and patience) to debug!Okay, I think that's enough for today -- and yet itbarely scratches the surface! I didn't even touch on gigantictopics with many Opening Day entrieslike security, networking, I/O, filesystems, performance, scheduling,service management, observability, etc. etc. Stay tuned -- or check outtheOpening Day entries for yourself...Technorati tags: OpenSolarisSolaris

Yesterday was Opening Day forOpenSolaris, and we welcomed OpenSolaris withhundreds of blog entries describing various aspects ofthe implementation. The breadth and depth of our bloggingwill...

Solaris

OpenSolaris Sewer Tour

OpenSolaris Sewer TourHaving OpenSolarisavailable today is an odd phenomenon in some ways.For me, having spent virtually my entire professional engineering lifedeveloping Solaris,opening the source is like having tourists suddenly flock to your hometown. And as the proud and unabashed local that I am, I feel compelled towelcome newcomerswith a littlebit of a personal tour through the source. But I don't want to leadthe kind of bus tour that you'll get from the big tour operators(not that you don't want to take in those sights of course);I'd rather take you on something like asewer tourthrough the city's underbelly: I want to show you thewater mains and thepower grid and the telco infrastructure that make Solaris what it is.Solaris is the New York City of operatingsystems: it is what you don't see that makes what you dosee possible -- and like New York City, it's what youdon't see that'sreallyinteresting...[1]So, with that preamble, let's do a little source spelunking into the core of Solaris.A word of warning:we're going to go deep into the system, and this won't be for everyone.(Or even, perhaps, anyone?) But if you're ready to have your brainbent a bit, grab yourself a cup of coffee, close your office door andread on...For today, I want to fully explain the context around this somewhat cryptic comment of mine in turnstile_block inusr/src/uts/common/os/turnstile.c: /\* \* Follow the blocking chain to its end, willing our priority to \* everyone who's in our way. \*/ while (t->t_sobj_ops != NULL && (owner = SOBJ_OWNER(t->t_sobj_ops, t->t_wchan)) != NULL) { if (owner == curthread) { if (SOBJ_TYPE(sobj_ops) != SOBJ_USER_PI) { panic("Deadlock: cycle in blocking chain"); } /\* \* If the cycle we've encountered ends in mp, \* then we know it isn't a 'real' cycle because \* we're going to drop mp before we go to sleep. \* Moreover, since we've come full circle we know \* that we must have willed priority to everyone \* in our way. Therefore, we can break out now. \*/ if (t->t_wchan == (void \*)mp) break;There's quite a story behind this comment -- both in human and engineering terms. The human side of the story is that this code was added in the last moments of Solaris 8,in one of the more intense experiences of my engineering career: a week-long collaboration withJeff Bonwickthat required so much shared mental state that he and I both came to callit "the mind-meld." The engineering story is quite a bit more intricate,and requires a lot more background.It starts earlier in Solaris 8, when we developed an infrastructure for user-level priority inheritance -- a mechanism toavoid priority inversion.For those unfamiliar with the problem ofpriority inversion, it isthis: given three threads at three different priorities, if the highestpriority thread blocks on a synchronization object held by the lowestpriority thread, the middling priority thread could (in a pure prioritypreemptive system running on a uniprocessor) run in perpetuity. Thisis an inversion,and one mechanism to solve it is something calledpriority inheritance.Under priority inheritance,when one thread is going to block on a lower priority thread, thehigher priority thread wills its priority to the lowerpriority thread. That is, the lower priority thread inheritsthe higher priority for the duration of the critical section.Now,in Solaris we have long had priority inheritance for kernel synchronization primitives -- indeed, this is one of the architectural differences between SunOS 4.x and Solaris 2.x. And just getting priority inheritance right for kernel synchronization primitives is nasty: one must know whoowns a lock, and one must know which lock that owner is blocked on(if any). That is, if a thread is blocking on a lock thatitself is owned by a blocked thread, we need to be able to determinewhat lock the blocked thread is blocked on, and which thread owns that lock. In order to avoid missed wakeups, we must dothis without one of the threads in the blocking chain changing its state. (That is, we don't want to conclude that a thread that we'reblocking on isn't blocked itself, only to have it block before we goto sleep -- this would create a window for potential priority inversion.)The way that we traditionally prevent a thread from moving underneathus is to grab its thread lock (a clever synchronization mechanism that merits its own lengthy blogentry[2]),but this implies thatwe're going to be simultaneously holding two locks of the same type.This is a problem because-- in any parallel system -- simultaneously holding more than one lock of the same type is a tricky proposition: there is a possibilityof deadlock if the lock acquisition order is not rigidly defined. This isa particular problem for thread locks, because both thread locks can belocks on turnstile hash chains -- and turnstile hash chains can be usedby disjoint threads simultaneously. (A turnstile is the mechanism thatwe use to put a thread to sleep.) Jeff solved thisparticular problem of turnstile deadlocking in an elegant way; I'll lethis comment inturnstile_interlockdo the explaining:/\* \* When we apply priority inheritance, we must grab the owner's thread lock \* while already holding the waiter's thread lock. If both thread locks are \* turnstile locks, this can lead to deadlock: while we hold L1 and try to \* grab L2, some unrelated thread may be applying priority inheritance to \* some other blocking chain, holding L2 and trying to grab L1. The most \* obvious solution -- do a lock_try() for the owner lock -- isn't quite \* sufficient because it can cause livelock: each thread may hold one lock, \* try to grab the other, fail, bail out, and try again, looping forever. \* To prevent livelock we must define a winner, i.e. define an arbitrary \* lock ordering on the turnstile locks. For simplicity we declare that \* virtual address order defines lock order, i.e. if L1 < L2, then the \* correct lock ordering is L1, L2. Thus the thread that holds L1 and \* wants L2 should spin until L2 is available, but the thread that holds \* L2 and can't get L1 on the first try must drop L2 and return failure. \* Moreover, the losing thread must not reacquire L2 until the winning \* thread has had a chance to grab it; to ensure this, the losing thread \* must grab L1 after dropping L2, thus spinning until the winner is done. \* Complicating matters further, note that the owner's thread lock pointer \* can change (i.e. be pointed at a different lock) while we're trying to \* grab it. If that happens, we must unwind our state and try again. \* \* On success, returns 1 with both locks held. \* On failure, returns 0 with neither lock held. \*/This lock ordering issue is part of what made it difficult to implementpriority inheritance for kernel synchronization objects --but priority inheritance forkernel synchronization objects only solves part of the larger priorityinversion problem: in amultithreaded real-time system, one needs priority inheritance forboth kernel-levelanduser-level synchronization objects. And this problem -- user-levelpriority inheritance -- is the problem that weendeavoredto solve in Solaris 8. We assigned an engineer to solve it, and (with extensive guidance from those of us who best understand theguts of scheduling and synchronization), the new facility was integratedin October of 1999.A few months later -- in December of 1999 --I was looking at a crash dump from an operating system panic that acolleague had encountered.It was immediately clear that this was some sort of defect in our implementation of user-level priority inheritance,but as I understood the bug, I came to realize that this was no surfaceproblem: this was a design defect. Here is my analysis:[ bmc, 12/13/99 ]The following sequence of events can explain the state in the dump (the arrowdenotes an ordering): Thread A (300039c8580) Thread B (30003c492a0)(executing on CPU 10) (executing on CPU 4)+------------------------------------+ +-------------------------------------+| | | || Calls lwp_upimutex_lock() on | | || lock 0xff350000 | | || | | || lwp_upimutex_lock() acquires | | || upibp->upib_lock | | || | | || lwp_upimutex_lock(), seeing the | | || lock held, calls turnstile_block()| | || | | || turnstile_block(): | | || - Acquires A's thread lock | | || - Transitions A into TS_SLEEP | | || - Drops A's thread lock | | || - Drops upibp->upib_lock | | || - Calls swtch() | | || | | || | | |: : : : +----------------------------------------------------------------------+ | Holder of 0xff350000 releases the lock, explicitly handing it off to | | thread A (and thus setting upi_owner to 300039c8580) | +----------------------------------------------------------------------+: : : :| | | || Returns from turnstile_block() | | || | | Calls lwp_upimutex_lock() on || | | lock 0xff350000 || | | || | | lwp_upimutex_lock() acquires || | | upibp->upib_lock || | | || | | Seeing the lock held (by A), calls || | | turnstile_block() || Calls lwp_upimutex_owned() to | | || check for lock hand-off | | turnstile_block(): || | | - Acquires B's thread lock || lwp_upimutex_owned() attempts | | - Transitions B into TS_SLEEP, || to acquire upibp->upib_lock | | setting B's wchan to upimutex || | | corresponding to 0xff350000 || upibp->upib_lock is held by B; | | - Attempts to promote holder of || calls into turnstile_block() | | 0xff350000 (Thread A) || through mutex_vector_enter() | | - Acquires A's thread lock || | | - Adjusts A's priority || turnstile_block(): | | - Drops A's thread lock || upib_lock (Thread B) | | || - Acquires B's thread lock | | - Drops upibp->upib_lock || - Adjusts B's priority | | | | - Drops B's thread lock | | || - Seeing that B's wchan is not | | || NULL, attempts to continue | | || priority inheritance | | || - Calls SOBJ_OWNER() on B's wchan | | || - Seeing that owner of B's wchan | | || is A, panics with "Deadlock: | | || cycle in blocking chain" | | || | | |+------------------------------------+ +-------------------------------------+As the above sequence implies, the problem is in turnstile_block(): THREAD_SLEEP(t, &tc->tc_lock); t->t_wchan = sobj; t->t_sobj_ops = sobj_ops; ... /\* \* Follow the blocking chain to its end, or until we run out of \* inversions, willing our priority to everyone who's in our way. \*/ while (inverted && t->t_sobj_ops != NULL && (owner = SOBJ_OWNER(t->t_sobj_ops, t->t_wchan)) != NULL) { ... }(1) --> thread_unlock_nopreempt(t); /\* \* At this point, "t" may not be curthread. So, use "curthread", from \* now on, instead of "t". \*/ if (SOBJ_TYPE(sobj_ops) == SOBJ_USER_PI) {(2) --> mutex_exit(mp); ...We're dropping the thread lock of the blocking thread (at (1)) before we dropthe upibp->upib_lock at (2). From (1) until (2) we are violating one ofthe invariants of SOBJ_USER_PI locks: when sleeping on a SOBJ_USER_PI lock,_no_ kernel locks may be held; any held kernel locks can yield a deadlockpanic.Once I understood the problem, it was disconcertingly easy to reproduce:in a few minutes I was able to bang out a test case that panicked thesystem in the same manner as seen in the crash dump.[3]While I had some ideas on how to fix this, the late date in the release and the seriousness of theproblem prompted me to call Jeff at home todiscuss.[4]As Jeff and I discussed the problem, we couldn't seem to come up witha potential solution that didn't introduce a new problem. Indeed, themore we talked about the problem the harder it seemed. The essence of the problem is this:for user-level locks, we normally keep track of the state associated with the lock (for example, whether ornot there's a waiter) at user-level -- and that information is consideredpurely advisory by the kernel. (There are several situations in whichthe waiters bit can't be trusted, and the kernel knows not to trust itin these situations.)To implement priority inheritance for user-level locks, however,one must become much more precise about ownership -- the ownership must be tracked the same way we track ownership for kernel-level synchronizationprimitives. That is, when we're doing the complicated thread lock dancethat I described above, we can't be doing loads from user-level memory to determine ownership.[5]Here's the nasty implication of this:the kernel-level state tracking the ownership of the user-level lock mustitself be protected by a lock, and that (in-kernel) lock must itselfimplement priority inheritance to avoid a potential inversion. This leads us to a deadlock that we did not predict: the in-kernel lock must be acquired and droppedto both acquire the user-level lock and to drop it.That is, there are conditions in which a thread owns the in-kernel lock andwants the user-level lock, and there are conditions in which a thread ownsthe user-level lock and wants the in-kernel lock. (The upib_lock is the in-kernel lock that implements this.Knowing this, you might want to pause to reread my description of therace condition above.) The above bug captures one manifestation of this, but Jeff and I beganto realize that there must be another manifestation lurking: if one were blocking on the in-kernel lock when the false deadlock wasdiscovered, the kernel would clearly panic. But what if one wereblocking on the user-level lock when the false deadlock was discovered?We quickly determined (and a test case confirmed) that in this case,the attempt to acquire the user-level would (erroneously) return EDEADLK.That is, in this case, the code saw that the "deadlock" was induced bya user-level synchronization primitive, and therefore assumed that itwas an application-induced deadlock -- a bug in the application. So in this failure mode, a correct program would have one of its calls topthread_mutex_lockerroneously fail -- a failure modeeven more serious than a panic because it could easily lead tothe application corrupting its data. So how to solve these problems? We found this to be a hard problem becausewe kept trying to find a way to avoid that in-kernel lock. (I havepresented the in-kernel lock as a natural constraint on theproblem, but that was a conclusion that we only came to with tremendous reluctance.) Whenever one of us came up with some scheme to avoid thelock, the other would find some window invalidating the scheme. After exhausting ourselves on the alternatives, we were forced tothe conclusion that the in-kernel lock was aconstraint on the problem -- and our focus switched from avoiding thesituation to detecting it. There are two cases to detect: the panic case and the false deadlock case.The false deadlock case is actually pretty easy to detect and handle, because wealways find ourselves at the end of the blocking chain -- and we alwaysfind that the lock that we own that induced the deadlock is the in-kernellock passed as a parameter to turnstile_block (mp). Because we knowthat we have willed our priority to the entire blocking chain, we canjust detect this and break out.The panic case is nastier to deal with.To remind,this case is that the thread owns the user-level synchronization object, and is blockingtrying to acquire the in-kernel lock. We might wish to handle thiscase in a similar way, by adding a case to check if the deadlock endsin the current thread and the last synchronization object in the blockingchain is a user-level synchronization object, then it's a false deadlock.(That is, handle this case by a more general handling of the above case.)This is simple, but it's also wrong: it ignores the possibility of anactual application-level deadlock (that is, an application bug), in which EDEADLK must be returned. To deal with this case, weobserve that if a blocking chain runs from in-kernel synchronizationobjects to user-level synchronization objects, we know that we're in thiscase. Since we know that we've caught another thread in code in whichthey can't be preempted, we can fix this by busy-waiting until thelock changes and then restarting the priority inheritance dance.Here's the code to handle this case:[6] /\* \* We now have the owner's thread lock. If we are traversing \* from non-SOBJ_USER_PI ops to SOBJ_USER_PI ops, then we know \* that we have caught the thread while in the TS_SLEEP state, \* but holding mp. We know that this situation is transient \* (mp will be dropped before the holder actually sleeps on \* the SOBJ_USER_PI sobj), so we will spin waiting for mp to \* be dropped. Then, as in the turnstile_interlock() failure \* case, we will restart the priority inheritance dance. \*/ if (SOBJ_TYPE(t->t_sobj_ops) != SOBJ_USER_PI && owner->t_sobj_ops != NULL && SOBJ_TYPE(owner->t_sobj_ops) == SOBJ_USER_PI) { kmutex_t \*upi_lock = (kmutex_t \*)t->t_wchan; ASSERT(IS_UPI(upi_lock)); ASSERT(SOBJ_TYPE(t->t_sobj_ops) == SOBJ_MUTEX); if (t->t_lockp != owner->t_lockp) thread_unlock_high(owner); thread_unlock_high(t); if (loser) lock_clear(&turnstile_loser_lock); while (mutex_owner(upi_lock) == owner) { SMT_PAUSE(); continue; } if (loser) lock_set(&turnstile_loser_lock); t = curthread; thread_lock_high(t); continue; }Once these problems were fixed, we thought we were done.But further stress testing revealedthat an even darker problem lurked -- one that I honestly wasn't sure that we would be able to solve. I'm actually going to leave thedescription of this problem (and its solution) for another day -- but for an embarrassing reason:when I went intothe code to write this blog entry, I discovered (to my horror) that an addition wasmade toturnstile.cwith an apparent disregardfor the subtle intricacies of this subsystem. (Indeed, fear of such ahasty change is exactly why wewent to such pains to comment these intricacies in the first place.)That is, this darker bug has beenreintroduced in a new manifestation. The bug will be virtually impossible to hit,but there is no question that it's a bug.I'm not going to describe it here, butif you find it (and you can explain it fully), and you're in the job market,we'll fly you out for an interview. (I'mnot joking.) Everything you need to know to find the bug is inturnstile.c; if you can find it, pretty soon you'llbe leading your own sewer tour...[1]New York City owes a particularly large debt to whatisn't normally seen: itsManhattan schistis an extremely hard variant of granite, without which its buildings' towering heights would be impossible.[2]Calling Joe Eykholt...[3]This is one of the most gratifying feelings in softwareengineering:analyzing a failure postmortem, discovering that the bug should beeasily reproduced, writing a test case testing the hypothesis, and thenwatching the system blow up just as you predicted. Nothing quitecompares to this feeling;it's the software equivalent of thewalk-off home run.[4]Jeff was at home because it was a Sunday. And Jeff was upbecause it was very late on Sunday -- so much so that it was actuallyearly on Monday morning.[5]I won't go into details about why this is the case -- it'sleft as an exercise to the reader -- but suffice it to say it's one ofthe many reasons that we have joked about requiring a license to grabthread lock. [6]The code dealing with turnstile_loser_lockdidn't actually exist when we wrote this case -- that was added todeal with (yet) another problem we discovered as a result of our four day mind-meld. This problem deserves itsown blog entry, if only for the great name that Jeff gave it: dueling losers.Technorati Tag: OpenSolarisTechnorati Tag: Solaris

HavingOpenSolaris available today is an odd phenomenon in some ways. For me, having spent virtually my entire professional engineering life developing Solaris,opening the source is like having...

Solaris

Your Java fell into my DTrace!

So I've been following with excitement theJVMPI andJVMTI agentsprototyped by Jarod Jenson ofAeysis anddeveloped byKelly O'Hair of Sun(with some helpful guidance fromAdam on Team DTrace). The agentexportsDTrace probes that correspond to well-known Java phenomena ("method-entry","gc-start", "vm-init", etc), andwhile it isn't a perfect solution from a DTrace perspective(the requirement that onerestart the JVM means it's still more appropriate for developers thanfor dynamic production use, for example), damn is it a big leapforward.For an example of its power, check outAdam'sblog entry following code flow from Java throughnative code and into the kernel (and back). Such a view of code flow iswithoutprecedent -- it is awesome (in all senses of that word) to see theprogression through the stack of software abstraction.Wanting to get in on the fun, I've been playing around with it myself.To get a sample program, I downloadedPer Cerderberg'sJavaimplementation of Tetris.To get to run with the agent, I run it this way:# export LD_LIBRARY_PATH=/path/to/djvm/build/i386/lib# java -Xrundjvmti:all -cp $LD_LIBRARY_PATH -jar tetris-1.2-bin.jarThe "-Xrundjvmti" pulls in the JVMTI agent, and the ":all"following it denotes that I want to get DTrace probes for all possibleevents. (It defaults to less than all events to minimize probe effect.)Running with this agent creates a new provider in the JVM processcalled djvm. Listing its probes:# dtrace -l -P 'djvm\*' ID PROVIDER MODULE FUNCTION NAME 4 djvm793 libdjvmti.so gc_finish_worker gc-stats 5 djvm793 libdjvmti.so cbGarbageCollectionFinish gc-finish 6 djvm793 libdjvmti.so cbThreadStart thread-start 7 djvm793 libdjvmti.so _method_exit method-return 8 djvm793 libdjvmti.so cbVMInit vm-init 9 djvm793 libdjvmti.so cbThreadEnd thread-end 10 djvm793 libdjvmti.so cbMonitorContendedEnter monitor-contended-enter 11 djvm793 libdjvmti.so cbMonitorWait monitor-wait 12 djvm793 libdjvmti.so _method_entry method-entry 13 djvm793 libdjvmti.so track_allocation object-alloc 14 djvm793 libdjvmti.so cbMonitorWaited monitor-waited 15 djvm793 libdjvmti.so cbMonitorContendedEntered monitor-contended-entered 16 djvm793 libdjvmti.so cbObjectFree object-free 17 djvm793 libdjvmti.so cbGarbageCollectionStart gc-start 18 djvm793 libdjvmti.so cbObjectFree class-unload 19 djvm793 libdjvmti.so cbVMDeath vm-death 20 djvm793 libdjvmti.so cbClassFileLoadHook class-load#I can enable these just like any DTrace probe. For example, if I want tosee output whenever a method is entered, I could enable themethod-entry probe. This probe has two arguments: the firstis a pointer to the class name, and the second is a pointer to the methodname. These are both in user-level, so you need to use copyinstr()to get at them. So, to see which methods that Tetris is calling, it'sas simple as:# dtrace -n djvm793:::method-entry'{printf("called %s.%s\\n", copyinstr(arg0), co pyinstr(arg1))}' -qJust dragging my mouse into the window containing Tetris generates a flurryof output:called sun/awt/AWTAutoShutdown.notifyToolkitThreadBusycalled sun/awt/AWTAutoShutdown.getInstancecalled sun/awt/AWTAutoShutdown.setToolkitBusycalled sun/awt/AWTAutoShutdown.isReadyToShutdowncalled java/util/Hashtable.isEmptycalled java/awt/event/MouseEvent.<init>called java/awt/event/InputEvent.<init>called sun/reflect/DelegatingMethodAccessorImpl.invokecalled sun/reflect/GeneratedMethodAccessor1.invokecalled java/lang/Boolean.booleanValuecalled java/awt/EventQueue.wakeupcalled sun/awt/motif/MWindowPeer.handleWindowFocusIncalled java/awt/event/WindowEvent.<init>called java/awt/event/WindowEvent.<init>called java/awt/event/ComponentEvent.<init>called java/awt/AWTEvent.<init>called java/util/EventObject.<init>called java/awt/SequencedEvent.<init>called java/util/EventObject.getSourcecalled java/awt/AWTEvent.<init>called java/util/EventObject.<init>called java/util/LinkedList.addcalled java/util/LinkedList.addBeforecalled java/util/LinkedList$Entry.<init>...This is way too much output to sift through, so let's instead aggregateon class and method name:#pragma D option quietdjvm$1:::method-entry{ @[strjoin(strjoin(basename(copyinstr(arg0)), "."), copyinstr(arg1))] = count();}END{ printa("%-70s %8@d\\n", @);}Running this (providing our pid -- 793 -- as the argument), fooling aroundwith Tetris for a second or two, and then \^C'ing gave me thisoutput:String.equals 1PlatformFont.makeConvertedMultiFontString 1String.toCharArray 1PlatformFont.makeConvertedMultiFontChars 1CharToByteISO8859_1.getMaxBytesPerChar 1CharToByteISO8859_1.reset 1CharToByteConverter.setSubstitutionMode 1SquareBoard.getBoardHeight 1X11InputMethod.deactivate 1ExecutableInputMethodManager.setInputContext 1SquareBoard$SquareBoardComponent.redrawAll 1SquareBoard.clear 1Figure.getRotation 1InputMethodManager.getInstance 1Figure.rotateRandom 1FontDescriptor.isExcluded 1CharToByteISO8859_1.canConvert 1PlatformFont$PlatformFontCache.<init> 1InputContext.deactivateInputMethod 1...SquareBoard.access$100 807AWTEvent.getID 883Figure.getRelativeY 959X11Renderer.drawLine 1168SunGraphics2D.drawLine 1168Rectangle.intersects 1246SquareBoard$SquareBoardComponent.paintSquare 1246SunGraphics2D.getClipBounds 1284X11Renderer.validate 1352SurfaceData.getNativeOps 1352Rectangle.translate 1361Figure.getRelativeX 1383Rectangle2D.<init> 1516RectangularShape.<init> 1516SurfaceData.isValid 1620SunGraphics2D.getCompClip 1620Rectangle.<init> 2839SquareBoard.access$000 8019SquareBoard.access$200 8496That's a lot of calls to the Rectangle constructor, so a naturalquestion might be "where are calling this constructor?" For this, we canuse the djvm provider along with the jstack() action:djvm$1:::method-entry/basename(copyinstr(arg0)) == "Rectangle" && copyinstr(arg1) == "<init>"/{ @[jstack()] = count();}Running the above generated a bunch of output, with stack traces sortedin order of popularity; here was the most popular stacktrace: libdjvmti.so`_method_entry+0x5c com/sun/tools/dtrace/internal/Tracker._method_entry\* java/awt/Rectangle.<init>\* sun/java2d/SunGraphics2D.getClipBounds\* net/percederberg/tetris/SquareBoard$SquareBoardComponent.paintSquare\* net/percederberg/tetris/SquareBoard$SquareBoardComponent.paintComponent\* net/percederberg/tetris/SquareBoard$SquareBoardComponent.paint net/percederberg/tetris/SquareBoard$SquareBoardComponent.redraw net/percederberg/tetris/SquareBoard.update net/percederberg/tetris/Figure.moveDown 0xf9002a7b 0xf9002a7b 0xf9002a7b ...Note those Java frames: we can see exactly why we're creating a newRectangle. We might have all sorts of questions from the aboveoutput. The one that occurred to me was "just how much work do we do fromthat paint method?" Answering this question is a snap inDTrace using bread-and-butter like thread-locals and aggregations:#pragma D option quietself int follow;int ttl;djvm$1:::method-entry/self->follow/{ @[basename(copyinstr(arg0)), copyinstr(arg1)] = count();}djvm$1:::method-entry,djvm$1:::method-return/basename(copyinstr(arg0)) == "SquareBoard$SquareBoardComponent" && copyinstr(arg1) == "paint"/{ ttl += self->follow; self->follow \^= 1;}END{ normalize(@, ttl); printa("%35s %35s %8@d\\n", @);}Running the above for a bit and \^C'ing generated ~200 linesof output; here are the last 30: CompositeGlyphMapper getCachedGlyphCode 4 SquareBoard$SquareBoardComponent getDarkerColor 14 SquareBoard$SquareBoardComponent getLighterColor 14 X11Renderer fillRect 15 SunGraphics2D fillRect 15 SquareBoard access$100 19 Color hashCode 28 Color equals 28 Hashtable get 29 Color getRGB 44 SunGraphics2D setColor 44 PixelConverter$Xrgb rgbToPixel 45 SurfaceType pixelFor 45 SurfaceData pixelFor 45 SquareBoard$SquareBoardComponent paintSquare 65 Rectangle intersects 65 SunGraphics2D getClipBounds 66 Rectangle translate 67 RectangularShape <init> 69 Rectangle2D <init> 69 X11Renderer drawLine 115 SunGraphics2D drawLine 115 SurfaceData getNativeOps 130 X11Renderer validate 130 SunGraphics2D getCompClip 134 SurfaceData isValid 134 Rectangle <init> 137 SquareBoard access$000 191 SquareBoard access$200 238This means that (on average) each call toSquareBoard$SquareBoardComponent's paintmethod induced 137 creations of Rectangle. Seeing this, I thoughtit would be cool to write a little "stat" utility that let me knowhow many of each kind of object was being constructed on a real-time,per-second basis. Here's that script:#pragma D option quietdjvm$1:::method-entry/copyinstr(arg1) == "<init>"/{ @[basename(copyinstr(arg0))] = count();}profile:::tick-1sec{ trunc(@, 20); printf("%-70s %8s\\n", "CLASS", "COUNT"); printa("%-70s %8@d\\n", @); printf("\\n"); clear(@);}Running the above gives output once per-second. Initially, it didn'tprovide any output -- but as soon as I moused into the Tetris window, I sawa (small) explosion of mouse-related object construction:CLASS COUNTRegion 0RectangularShape 0Rectangle2D 0Rectangle 0LinkedList$ListItr 0SentEvent 0LinkedList$Entry 0WindowEvent 0InvocationEvent 15HashMap$Entry 69ComponentEvent 74InputEvent 74MouseEvent 74WeakReference 78AWTEvent 79EventObject 79Point2D 128Reference 156EventQueueItem 158Point 192As soon as I move the mouse out of the window, the rates dropped back to zero(as you would hope and expect from a well-behaved app). When I startedactually playing Tetris, I saw a different kind of ouput:CLASS COUNTDimension 7HashMap$Entry 8EventObject 8AWTEvent 8ComponentEvent 8InputEvent 8Finalizer 14AffineTransform 14Graphics 14SunGraphics2D 14Graphics2D 14FinalReference 14EventQueueItem 16KeyEvent 16WeakReference 24Region 42Reference 62Rectangle2D 105RectangularShape 105Rectangle 175This means that we're seeing 175 Rectangle creations a second whenI'm playing Tetris as fast as myBattleTris-hardenedfingers can play. That doesn't seem so bad, but you can easily see how useful this is going to beon those pathological Java apps!Needless to say, the ability to peer into aJava app with DTrace is a very exciting development.To catch the fever,download the JVMTI and JVMPI agents,head to your nearest Solaris 10machine (or download Solaris 10 if thereisn't one), and have at it!Technorati tags: DTrace Java

So I've been following with excitement theJVMPI and JVMTI agents prototyped by Jarod Jenson ofAeysis and developed byKelly O'Hair of Sun (with some helpful guidance fromAdam on Team DTrace). The agente...

Solaris

On Reverse Engineering

It has been a bit horrifying to watch the BitKeeper sagaunfold. Not that it's surprising of course that Larry rescinded the BKLinux license;if you knowLarry or even know of him, you know that Larry's tragic flaws --hypersensitivity, volatility and vindictiveness -- made this aninevitability of sorts.1 So the horrifying bit has not been theact itself, but rather the specific reasonthat Larrycited when rescinding the license: he seems to havetaken issue with Tridge's attempt to reverse engineer the BitKeeperprotocols.This rankles; I (like many engineers, I suspect)view reverse engineering as a Natural Right. That is, I believe thatwe are endowed with certain unalienable Rights, and that among these areLife, Liberty and the pursuit of Understanding how the hell something works(or doesn't, as is frequently the case). Perhaps perversely to some, itis my strong belief in the right to reverse engineer that leads me tomy equally strong belief in the responsibility of government to establisha system of patents: if you use my product, youhave the right to take it apart and understand its inner workings, but Ihave the right to protect my intellectual property by patenting the novelmechanism that represents a non-obviousadvance in the state of the art. That is, it should be the protectionafforded by patents -- and not the obfuscation inherent in a runningsystem --that prevents the rip-off artists.2 My belief reflects the fact that nearly all applications of reverse engineering do not in any way violate anyone's intellectual property -- and the act itself and alone can never violate intellectualproperty. I believe strongly in reverse engineering in particular, but it playsan especially critical role in the development of software:in my experience, when developinga layer in the stack of software abstraction, you alwaysneed to understand at least one layer below you and you often need tounderstand at least one layerabove you -- and reverse engineering is often the primary means to achievethis understanding. More generally, software is usually reverse engineeredto work around oversights or blunders, orto simply understand a software system sufficiently well to interoperate withit. It is in part out of the recognition of the importance of reverse engineering in software development and integration that wedeveloped DTrace --a tool which many regard as thene plus ultra of software reverse engineering.Returning to the case at hand,if BitMover believes that Tridge violated one of its patents, fine --BitMover should sue for infringement.3 But torescind the free BK licensesimply because someone dared to even understand howit works is just...cowardly. In doing this, BitMover is exhibitingclassic Bad ISV Behavior:they are devoting their efforts to preserving their natural monopoly(such as it is) over their own users -- joining the fetid ranksof the ISVs that have demanded that we disableDTrace for their application. And it adds insult to injury for Torvalds tocondemn Tridgefor "ruining it for everyone."Tridge "ruined it for everyone" just like Rosa Parks and Helen Gahagan Douglas and Nathan Haleand anyone else who ever took a stand for what was right.And I don't mean this comparison to diminish the courage that it tookthese others to stand up to tyranny, but rather to underscore the degreethat I believe that reverse engineering is a Natural Right.So I, for one, hope that Tridge continues to reverse engineer BitKeeper --and I would be honored if DTrace helped him do it.1 I actually like Larry -- he's sharp, forthright, andengaged and he can be very sweet -- but I do viewhim as ultimately tragic...2 I also believe that patents havegotten way out of hand, and that the proliferation of bad software patentsrepresents a serious problem -- but that doesn't change my feelings aboutpatents in the abstract.3 Of course, BitMover is unlikelyto do this, for several reasons. First, it seems highly unlikelythat Tridge hasviolated any BitMover patents if he has only reverse engineered theprotocol. Second, even if he has somehow managed to violate a patent,there's the little problem of damages to BitMover -- or rather, the lackof suchdamages; if there aren't damages, treble damages still amount to nothing.Third, even if there were enormous damages, who would pay them?Suing Tridge is not likely to be terribly gratifying; I can't imaginethat his pockets are deep enough to even pay the substantial expense ofjust prosecuting patent infringement. And bothTridge and OSDL claim that the work was done in his spare time;if it wasn't done using OSDL equipment, there isn't much of acase to be made against OSDL. Finally, there is a more practical reason that BitMover is unlikely tosue for patent infringement:suing a well-known White Knight in the open source world for patentinfringement would likelycause several megacorpswith large patent portfoliosto carefully review both their patents on SCM and the prior art in same. If BitMover is lucky, this would only result ina deluge of amicus briefs; if BitMover isunlucky, it would find itself buried in enough counter-litigation todestroy the company.

It has been a bit horrifying to watch the BitKeeper saga unfold. Not that it's surprising of course that Larry rescinded the BK Linux license; if you knowLarry or even know of him, you know that...

Solaris

DTrace is not a security risk!

Recently, there was apresentation at the annual meeting ofChaosComputer Club in Berlin.As the presentation describes DTrace at some length,several have asked the question: is DTrace a security risk? The answeris an emphatic "no" -- quite the contrary in fact -- but it merits someexplanation.DTrace can only be used by users on the system that have the appropriateprivileges (as discussed in the Security chapter of theDTrace documentation).By default, the only user withsufficient privileges to use DTrace is root -- the super-user.The techniques described in the paper and in the presentationare only for use on a system that one has already compromised.Of course, once a system is compromised, all bets are off; a nefarioususer can:Load their own daemons to act as trojanhorses, potentially sniffingpasswords and compromising subsequent machinesExamine /etc/shadow and crack it to obtain cleartext forevery password on the systemUse thepre-existing Solaris observability tools(truss(1), gcore(1), mdb(1), etc.)to observe and modify arbitrary processes Crash and/or destroy the system beyond repairLoad their own kernel modules to spoof arbitrary parts of the systemYes, you can use DTrace on a compromised system to glean additionalinformation, but everything youcan do with DTrace was in principle possible before DTrace -- DTracejust happens to make it a little easier.Indeed, the presentation doesn't even discuss the ways in which a nefarious user on a compromisedsystem can use DTrace -- rather it describes how DTrace can be used tounderstand the system well enough to design a nefariousspoofing kernel module in the first place. And revealingly, the presentation spends quite a bit of timedescribinghow to design a nefarious kernel module suchthat it evades instrumentation by DTrace.1 The factthat time and effort were spend on DTrace evasion is telling:as a tool designed to expose the inner workings of a production system,DTrace is much more feared bythe Black Hats than it is useful to them;far from being a security risk, DTrace is very much a security asset.1 I hasten to add that the author's techniques for evadingDTrace won't actually workcompletely. They will successfully evade one form of instrumentation, butthey leave the nefarious module completely exposed to several other forms of instrumentation and detection by DTrace. Amore devilish rootkit would completely replace DTrace with some sort ofBizarroDTrace thatknew how to completely deny the existence of its cohorts...Technorati tags: DTrace Security

Recently, there wasa presentation at the annual meeting ofChaos Computer Club in Berlin. As the presentation describes DTrace at some length,several have asked the question: is DTrace a security risk?...

Solaris

DTrace Tips, Tricks and Gotchas

Last week,Mike,Adam and I presentedthe first "advanced" DTrace talk. That is, since shippingDTrace, our presentations have been to people who have either never heard ofDTrace, or have heard of it but haven't seen itbefore, or have heard of it and seen it before and now want tolearn how to use it.The timefor an advanced talk was clearly overdue; many of the questionsboth in theDTrace forumand internally at Sun have indicated that the limits ofthe DTracedocumentation are being reached1,and that an advanced presentation would be helpful to some users.So given that others may find this presentation useful --and with the caveats thatit doesn't substitute for seeing it in the flesh2and that it contains some rather arcane detail -- the"Advanced DTrace: Tips, Tricks and Gotchas" presentation can be foundhere.As the title implies, this really is a random collection of tips,tricks and gotchas. Indeed, it's so random the slides could prettymuch be in any order -- there is no narrative arc to this presentationwhatsoever (there isn't even a conclusion slide because there wasn'tmuch to conclude), and the whole thing thus has a somewhat dream-like(or nightmare-like?) quality. Apologies for that; there didn't seem tobe a better way to present such arcana...1 Yes, there are limits to the documentation -- which is prettyincredible considering that there's already 400 pages of it.2 For starters, it's much easier to grok some of theseconcepts when they can be demo'd on-the-fly. And by not seeing it live,youwon't have the pleasure of hearing me speak "a million miles an hour"inDan Lacher's words. (To which all I will say in my defense is that I've neverhad anyone fall asleep during one of my presentations...)Technorati tag: DTrace

Last week,Mike,Adam and I presented the first "advanced" DTrace talk. That is, since shipping DTrace, our presentations have been to people who have either never heard ofDTrace, or have heard of it...

Solaris

Solaris 10 Revealed

Or some of it, anyway. If you haven't yet seen it, today we open sourced some of Solaris.This isn't actually theformal launch ofOpenSolaris -- we're still busily working away on that -- but we wantedto reveal enough of the juicy bits of the Solaris source for the world to realize that we'reserious about OpenSolaris. I view OpenSolaris asan important milestone in the history of operating systems -- doubly sobecause it is Solaris 10 that we are open sourcing, an operating systemthat I believe to be animportant historical milestonein its own right -- so it is an honor for meto report that the source code that we decided to release today is that forDTrace. But enough with the pomp; let's talk source.I assume that most people who download the source today will downloadit, check to see that it's actually source code,1and then say to themselves, "now what?" Tohelp answer that question, Ithought I would take a moment to describe the layout of the source,and point you to some interesting tidbits therein.As with any Unix variant with Bell labs roots,you'll find all of the source under the "usr/src" directory.Under usr/src, you'll find four directories:usr/src/cmd contains commands. Soon, this directory willbe populated with its 400+ commands, but for now itjust contains subdirectories for each of the following DTrace consumers:dtrace(1M), intrstat(1M), lockstat(1M) andplockstat(1M).This directory additionally contains a subdirectory for the isaexeccommand, as the DTrace consumers are all isaexec(3C)'d. usr/src/lib contains libraries. Soon, this directorywill be populated with its 150+ libraries, but for now it just containsthe libraries upon which the DTrace consumers depend. Each of theselibraries is in an eponymous subdirectory:libdtrace(3LIB) is the library that does much of the heavylifting for a DTrace consumer. The D compiler lives here, along withall of the infrastructure to process outbound data from the kernel.The kernel/user interface is discussed in detail in<sys/dtrace.h>.libctf is a library that is able to interpret the CompactC Type Format (CTF2). CTF is much more compact than thetraditional type stabs, and we use it to represent the kernel's types.(CTF is what allows ::print to work in mdb. If you'venever done it, try "echo '::print -at vnode_t' | mdb -k" asroot ona Solaris 9 or Solaris 10 machine.) DTrace is very dependent on libctf and hence the reason that we're including it now. Notethat much of the source for libctf is in usr/src/common;see below.libproc is a library that exports interfaces for process control via /proc. (The Solaris /proc is vastlydifferent from theLinux /proc in that it is used primarily as a means of process control and process information -- not simply as a means of system information;see proc(4) for details.) Many, many Solaris utilities link withlibproc includingpcred(1),pfiles(1),pflags(1),pldd(1),plimit(1),pmap(1),ppgsz(1),ppriv(1),prctl(1),preap(1),prstat(1),prun(1),pstack(1),pstop(1),ptime(1),ptree(1)andpwdx(1). Thanks to the powerful interfaces inlibproc, many of these utilities are quite short in termsof lines of code.These interfaces aren't yet public, but you can get a taste for themby looking at /usr/include/libproc.h -- or by reading the source,of course!usr/src/uts is the main event -- the kernel. ("uts" standsfor "Unix Time-Sharing System", and is another artifact from Bell Labs.)The subdirectories here are roughly what you might expect: common contains common codei86pc contains codespecific to the PC machine architectureintel contains code specificto the x86 instruction set architecturesparc contains code specific to the SPARC instructionset architecturesun4 contains code specific to the sun4u machine architecture,but general to all platform architectures within that machine architectureThe difference between instruction set architecture and machine architectureis a bit fuzzy, especially when there is a one-to-one relationship betweenthe two. (And in case you're not yet confused, platform architecturesadd another perplexing degree of freedom within machine architectures.)All of this made more sense when there was a one-to-many relationshipbetween instruction sets and machine architectures, but sun4m,sun4d and sun4c havebeen EOL'd and the source for these machine architectures has been removed.This layout may seem confusing, but I'm here to describe the source layout --not defend it -- so moving on...In terms of DTrace, most of the excitement is inusr/src/uts/common/dtrace, uts/src/uts/intel/dtrace andusr/src/uts/sparc/dtrace.In usr/src/uts/common/dtrace, you'll find the meat of thein-kernel DTracecomponent in the 13,000+ lines of dtrace.c. In this directoryyou will additionally find the source for common providers like profile(7D)and systrace(7D) along with the common components for thelockstat(7D) provider. In the ISA-specific directories, you'll findthe ISA-specific halves to these providers, along with wholly ISA-specificproviders like fbt(7D). You will also find the ISA-specificcomponents to DTrace itself in dtrace_asm.s anddtrace_isa.c.So that's the basic layout of the source but...now what? If you're like me, you don't have the time or interest to understandsomething big and foreign -- and you mainly look at source codeto get a flavor for things. Maybe you just want to grep for "XXX" (you'llfind two -- but we on the DTrace team are responsible for neither one) orlook for curse words (you'll find none -- at least none yet) orjust search for revealingly frankwords like "hack", "kludge", "vile", "sleaze", "ugly", "gross", "mess", etc.(of which youwill regrettably find at least one example of each). But in the interest of leaving you with more than just whiffs ofour dirty laundry,let me guide you to some interesting tidbits that require little or no DTraceknowledge to appreciate. I'm not claiming that these tidbits are particularlynovel or even particularly core to DTrace; I'm only claiming that they'reinteresting for one reason or another. For example, check out this comment in dtrace.c: /\* \* We want to have a name for the minor. In order to do this, \* we need to walk the minor list from the devinfo. We want \* to be sure that we don't infinitely walk a circular list, \* so we check for circularity by sending a scout pointer \* ahead two elements for every element that we iterate over; \* if the list is circular, these will ultimately point to the \* same element. You may recognize this little trick as the \* answer to a stupid interview question -- one that always \* seems to be asked by those who had to have it laboriously \* explained to them, and who can't even concisely describe \* the conditions under which one would be forced to resort to \* this technique. Needless to say, those conditions are \* found here -- and probably only here. Is this is the only \* use of this infamous trick in shipping, production code? \* If it isn't, it probably should be... \*/This is the code that executes the ddi_pathname function in D.A critical constraint on executing D is that any programming errors must be caught and handled gracefully. (We callthis the "safety constraint" because failure to abide by it will inducefatal system failure.)While many programming environments recover from memory-relatederrors, we in DTrace must additionally guard against infinite iteration -- a muchharder problem. (In fact, this problem is so impossibly hard that itsname has become synonymous with undecidability: this is thehalting problemthat Turing proved impossible to solve in 1936.)We skirt the halting problem in DTrace by not allowing programmer-specifiediteration whatsoever: DIF doesn't allow backwards branches. But some Dfunctions -- like ddi_pathname() -- require iteration over untrusteddata structures to function. When iterating over an untrusted list, ourfear is circularity (be it innocent or pernicious), and the easiest wayfor us to determine this circularity is to use the interview questiondescribed above.I wrote this code a while ago; now that I read it againI actually canimagine some uses for this in otherproduction code -- but I would imagine it would all be of the assertionvariety. (That is, again the data structure is effectively untrusted.)Any other use still strikes me as busted (prove me wrong?),and I still have disdain for those that ask it as an interviewquestion (apologies if this includes you, gentle reader).Here's another interesting tidbit,also in dtrace.c: switch (v) { case DIF_VAR_ARGS: ASSERT(mstate->dtms_present & DTRACE_MSTATE_ARGS); if (i >= sizeof (mstate->dtms_arg) / sizeof (mstate->dtms_arg[0])) { int aframes = mstate->dtms_probe->dtpr_aframes + 2; dtrace_provider_t \*pv; uint64_t val; pv = mstate->dtms_probe->dtpr_provider; if (pv->dtpv_pops.dtps_getargval != NULL) val = pv->dtpv_pops.dtps_getargval(pv->dtpv_arg, mstate->dtms_probe->dtpr_id, mstate->dtms_probe->dtpr_arg, i, aframes); else val = dtrace_getarg(i, aframes); /\* \* This is regrettably required to keep the compiler \* from tail-optimizing the call to dtrace_getarg(). \* The condition always evaluates to true, but the \* compiler has no way of figuring that out a priori. \* (None of this would be necessary if the compiler \* could be relied upon to _always_ tail-optimize \* the call to dtrace_getarg() -- but it can't.) \*/ if (mstate->dtms_probe != NULL) return (val); ASSERT(0); ...This is the code that retrieves an argument due to a reference toargs[n] (or argn). Theclause above will only be executed if n is equal to or greaterthan five -- in which case we need to go fishing in the stack framefor the argument. And here's where things get a bit gross: in order tobe able to find the right stack frame, we must know exactly how many stack frameshave been artificially pushed in the process of getting into DTrace.This includes frames that the provider may have pushed (tracked inthe probe as the dtpr_aframes variable) and the framesthat DTrace itselfhas pushed (rather bogusly represented by the constant "2", above:one for dtrace_probe() and one for dtrace_dif_emulate()).The problem is that if the call to dtrace_getarg() is tail-optimized, our calculation is incorrect. We therefore have totrick the compiler by having an expression after the call that the compileris forced to evaluate after the call. We do this by having an expressionthat always evaluates to true, but dereferences through a pointer.Because dtrace_getarg() is in another object file,no amount of alias disambiguationis going to figure out that it doesn't modify dtms_probe;the compiler doesn't tail-optimize the abovecall, the stack frame calculation is correct, and the arguments are correctly fished out of the (true) caller's stack frame. There's an interesting footnote to the above code: recently, weran a research tool that performs static analysis of code on the sourcefor the Solaris kernel. The toolwas actually pretty good, and found all sorts of interesting issues.Among other things, the tool flagged the above code, observing thatdtms_probe is never NULL. (The tool may be clever enoughto determine that, but it obviously can't beclever enough to know that we're trying to outsmart the compilerhere.) While this might give us pause, it needn't: the tool mightwarn about it, but no compiler could safely avoid evaluating the expression --because dtrace_getarg() is not in the same object file,it cannot be absolutely certain that dtrace_getarg()does not store to dtms_probe.As long as we're going through dtrace_getarg(), though, it maybe interestingto look at a routine that implements this on SPARC.3This routine, foundin usr/src/uts/sparc/dtrace/dtrace_asm.s, fishes a specificargument out of a specified register window -- without causing a window spill trap. Here's the function: #if defined(lint) /\*ARGSUSED\*/ int dtrace_fish(int aframes, int reg, uintptr_t \*regval) { return (0); } #else /\* lint \*/ ENTRY(dtrace_fish) rd %pc, %g5 ba 0f add %g5, 12, %g5 mov %l0, %g4 mov %l1, %g4 mov %l2, %g4 mov %l3, %g4 mov %l4, %g4 mov %l5, %g4 mov %l6, %g4 mov %l7, %g4 mov %i0, %g4 mov %i1, %g4 mov %i2, %g4 mov %i3, %g4 mov %i4, %g4 mov %i5, %g4 mov %i6, %g4 mov %i7, %g4 0: sub %o1, 16, %o1 ! Can only retrieve %l's and %i's sll %o1, 2, %o1 ! Multiply by instruction size add %g5, %o1, %g5 ! %g5 now contains the instr. to pick rdpr %ver, %g4 and %g4, VER_MAXWIN, %g4 ! ! First we need to see if the frame that we're fishing in is still ! contained in the register windows. ! rdpr %canrestore, %g2 cmp %g2, %o0 bl %icc, 2f rdpr %cwp, %g1 sub %g1, %o0, %g3 brgez,a,pt %g3, 0f wrpr %g3, %cwp ! ! CWP minus the number of frames is negative; we must perform the ! arithmetic modulo MAXWIN. ! add %g4, %g3, %g3 inc %g3 wrpr %g3, %cwp 0: jmp %g5 ba 1f 1: wrpr %g1, %cwp stn %g4, [%o2] retl clr %o0 ! Success; return 0. 2: ! ! The frame that we're looking for has been flushed to the stack; the ! caller will be forced to ! retl add %g2, 1, %o0 ! Failure; return deepest frame + 1 SET_SIZE(dtrace_fish) #endifFirst, apologies for the paucity of comments in the above.The lack of comments is particularly unfortunate because the functionis somewhat subtle, asit uses some dicey register window manipulation plus an oddSPARC technique knownas "instruction picking": the jmp with the the bain the delay slot picks one of the instructions out of the table that followsthe ba 0f, thus allowing the caller to specify any registerto fish out of the window without requiring any compares.4If you're interested in the details of the register window manipulationlogic in this function, you shouldconsult theSPARC V9Architecture Manual.That about does it for tidbits, at least for now.As you browse the DTrace source, you may well find yourself asking"does it need to be this complicated?" The short answer is, in most cases,"regrettably yes." If you're looking for somein-depth discussion on the specific issues that complicate specificfeatures,I would direct you to functions likedtrace_hres_tick() (in usr/src/uts/common/os/dtrace_subr.c),dtrace_buffer_reserve() (inusr/src/uts/common/dtrace/dtrace.c) anddt_consume_begin()(in usr/src/lib/libdtrace/common/dt_consume.c). These functionsare good examples of howseemingly simple DTrace features like timestamps, ring buffers andBEGIN/END probes can lead to much more complexity thanone might guess.Finally, I suppose there's an outside chance that you might actually want tounderstand how DTrace works -- perhaps even to modify it yourself.If this describes you, you should first heed this advicefrom usr/src/uts/common/dtrace/dtrace.c: /\* \* DTrace - Dynamic Tracing for Solaris \* \* This is the implementation of the Solaris Dynamic Tracing framework \* (DTrace). The user-visible interface to DTrace is described at length in \* the "Solaris Dynamic Tracing Guide". The interfaces between the libdtrace \* library, the in-kernel DTrace framework, and the DTrace providers are \* described in the block comments in the <sys/dtrace.h> header file. The \* internal architecture of DTrace is described in the block comments in the \* <sys/dtrace_impl.h> header file. The comments contained within the DTrace \* implementation very much assume mastery of all of these sources; if one has \* an unanswered question about the implementation, one should consult them \* first. \* ...This is important advice, because we (by design) put many of the implementation comments in a stock header file,<sys/dtrace_impl.h>.We did this because we believe in the Unix idea that the system implementation should be described as much as possible in its publicly available header files.5Discussing the comments in <sys/dtrace_impl.h> evokes a somewhatamusing anecdote; take this comment describing the implementation of speculativetracing: /\* \* DTrace Speculations \* \* Speculations have a per-CPU buffer and a global state. Once a speculation \* buffer has been committed or discarded, it cannot be reused until all CPUs \* have taken the same action (commit or discard) on their respective \* speculative buffer. However, because DTrace probes may execute in arbitrary \* context, other CPUs cannot simply be cross-called at probe firing time to \* perform the necessary commit or discard. The speculation states thus \* optimize for the case that a speculative buffer is only active on one CPU at \* the time of a commit() or discard() -- for if this is the case, other CPUs \* need not take action, and the speculation is immediately available for \* reuse. If the speculation is active on multiple CPUs, it must be \* asynchronously cleaned -- potentially leading to a higher rate of dirty \* speculative drops. The speculation states are as follows: \* \* DTRACESPEC_INACTIVE and <sys/dtrace_impl.h>) then you'reready to start reading the source for purposes of understanding it. If yourun into source that you don't understand (and certainly if you believe thatyou've found a bug), please post tothe DTrace forum.Not only will one of us answer yourquestion, but there's a good chance that we'll update the comments as well;if you can't understand it, we probably haven't been sufficiently clearin our comments. (If you haven't already inferred it, source readabilityis very important to us.)Well, that should be enough to get oriented. If it isn't obvious, we're veryexcited to be making the source to Solaris available. Andhopefully this hors d'oeuvre of DTrace source will hold your appetite untilwe serve the main course of Solaris source. Bon appetit!1 It's unclear what passes for convincing in this regard. Perhapspeople just want to be sure that it's not just a bunch of files filledwith the results of "while true; do banner all work and noplay makes jack a dull boy ; done"?2 The reason that it's CTF and not CCTF is a kind of strangeinside joke: the format is so compact, even its acronym is compressed.Yes, this is about as unfunny as theLanguage Hhumor that we never seem to get sick of...3 Please don't infer from this that I'm a SPARC bigot;both of my laptops, my desktop and my build machine are all AMD64 boxesrunning the 64-bit kernel. It's only that the SPARC version of thisparticular operation happens to be interesting, not that it's interestingbecause it happens to be SPARC...4 This brings up a fable of sorts: once, many years ago,there was a systemfor dynamic instrumentation of SPARC.Unlike DTrace, however, this system was both aggressive and naïve in itsinstrumentation and, as a result of this unfortunate combination ofattributes,couldn't guarantee safety. In particular,the technique used by the function here -- using a DCTI coupleto effect instruction picking -- was incorrectlyinstrumented by the system. When confronted with this in front of alarge-ish and highly-technical audience at Sun, the author of said system(who was interviewing for a job at Sun at the time) responded (in a confident,patronizing tone) that theSPARC architecture didn't allow such a construct. The members of the audience gasped, winced and/or snickered: to not dealcorrectly withthis construct was bad enough, but to simply pretend it didn't exist (andto be a prick about it on top of it all) was beyond the pale; the authordidn't get the job offer, and the whole episode entered local lore.The moral of the story is this: don't assume that someoneasking you a question is an idiot -- especially if the question is aboutthe intricacies of SPARC DCTI couples.5 It's unclear if this was really a deliberate philosophy ormore an accident of history, but it's certainly a Solaris philosophy atany rate.Perhaps its transition fromUnix accident to Solaris philosophy was marked by Jeff's "Riemann sum" comment(and accompanying ASCII art diagram) in <sys/kstat.h>? 6 I'm pretty sure that I was joking...Technorati tags: DTraceOpenSolarisSolaris

Or some of it, anyway. If you haven't yet seen it, today weopen sourced some of Solaris. This isn't actually the formal launch ofOpenSolaris -- we're still busily working away on that -- but...

Solaris

The Economics of Software, redux

So there has been quite a bit of reaction to my thoughts on the economics of software. While most of the reaction has been quite positive,several economists have taken issue withmy analysis. (Not surprisingly, I suppose; I'm sure I would take issuewith any software written by an economist.)For example, Susan Stewart brought up the legitimate point aboutmy conflation of macro- and microeconomic analysis, especiallywith regard to the supply curve for the firm. Susan is right on this,but we agreed that it doesn't change my overall analysis. While feedback like Susan's was quite helpful, that of another economistwas decidedly less so:he was foaming at the mouth becauseI had the gall to ignore the theories that he set forth inhis dissertation.His (largely ad hominem) attackrevealed that (a) he didn'tread past the first paragraph or so of my blog entry and (b) that he knows absolutely nothing about software.1I ultimately got him to agree to (b), after which pointI stopped trying to convince him of anything else. And at any rate, thesequalms from the academy didn't have anything to do with thecrux of my thesis: that demand for a particular software product is largely price inelastic, that software vendors act as natural monopolists,that open source is an effective way of driving demand for complementarygoods, and that this all adds up to a powerful supply-side open sourcemovement.From my peers in the software industry, the reactionhas been more positive -- and certainly much better informed.David Ogren had a thoughtful follow-upexploring the genesis for tiered pricing in software,and thenanotherdiscussing the economics of software support. I view software and supportas two different products, so I don't think that David's analysisinvalidates any of my own.Paul Brownadded histhoughts on how the economics of software and the economicsof software support relate to one another. AndJohn Mitchellwas apparently inspired toask an intriguing question: "what do examples of post-fiateconomics have to do with open source?" I have absolutely no idea,but if I start getting paid in Papiermarks,there's going to be big trouble.The most commonpoint of disagreement from those in the industryhas been that I didn't include the cost of salesand marketingsoftware in the cost of software. At the risk of stating a tautology,the cost of marketing a product is not a part of the variable cost ofmaking the product; the variable cost is simply the cost of manufacturingone more unit, and (for software at least) that cost is damn near zero.(The cost of bandwidthis the only argument to be had, and that's pretty cheap.)This isn't to say that sales and marketing aren't necessary to shipa product, just that they don't factor into the variable cost.Another point of disagreement is my contention that software doesn'twear out. I'm not budging on this one, and I refer doubters to myrecent experiments with VENIX, software that has had zeromaintenance applied to it in the last twenty years. The softwareworks exactly today as it did twenty years ago -- it's as good as new.2 Some mistake software that stops working as having"worn out", but this is an incorrect inference: the software itselfhas not changed; a heretofore unknown attribute has merely been exposed -- an attribute that was there all along.I've been meaning to do this round-up of reaction for a while, but Iwas prompted bythistale of onedatabase customer and theFirebird database.This vividly shows pretty much everything I described: a softwarecompany, seeking to extract additional revenue, jacks the price ofthe software that the customer already has by using one of theugliest tricks in the book: the software audit.But the software company gets a bittoo greedy, and overplays its hand by raising the price four-fold (!) --pushing the customer above their FYO point. (As an aside, I obviouslystand by my original nomenclature.) Meanwhile, a hard-on-its-luck also-ran software product written ages ago ends up in the hands of acompany that isn't making much money off of it. Of course, it coststhem nothing to manufacture it, so the company decides to give itaway for free by open sourcing it. After that initial supply-side push,the demand-side takes off: development begins to blossom, with the demand-side adding many long sought-after features. And in particular,a compatibility layer is added that dramatically lowers the FYO pointof everyone running the proprietary database. The pissed customerfinds the open source product -- andthe proprietary product gets the boot (and a lawsuit, I might add).Everyone's a winner. Or rather, almost everyone...DespiteLarry Ellison's (expensive) attemptto fulfill his ownprophesies,this is the true future of the software business: it may takeyears -- and it may even take decades -- but open source from thesupply-side coupled with energy from the demand-side will ultimately drive the FYO point towards zero for manyexisting software products. Companiesthat can use open source to drive complementary goods will surviveand thrive; those that can't will slowly whither away, until oneday theirsoftware is acquired for pennies-on-the-dollar by some company thatjust happens to have some complementary goods to sell...1A choice quote:"I define software as anything that is an organized collectionof information." Seeing as the ancient Sumerians were writing softwareby this definition, I found myself wondering what his definition of"hardware" must be...2That is, as good as it ever was. Software doesn't wear out, but thatdoesn't mean that it's perfect...

So there has been quite a bit of reaction to my thoughts on the economics of software. While most of the reaction has been quite positive, several economists have taken issue withmy analysis. (Not...

Solaris

UNIX, circa 1984

I recently ran across Xhomer, a simulator for theDEC Pro 350. While there are several historic machine simulatorsout there, Xhomer is polished: it compiled and ran on my Opteron laptop running Solaris 10 with no difficulties.But best of all, Xhomer includes system software: you can download a disk image of VENIX 2.0 -- a "real-time" UNIX variantfrom VenturCom. Here's a screenshot of me logging in.While I had heard of VENIX (a System III derivative), I had never used it before today. (And nor, presumably, have the good folks at pharmanex.)In using it, it's clear that VENIX had some BSD influence: VENIX 2.0 includes bothcsh (blech) and vi (phew).Using a twenty year old UNIX is a strange experience: I'm amazed at how little the most basic things havechanged. There was very little that I didn't recognize ("em1" anyone?), and I was familiar with all of the tools necessary towrite a program (vi), compile it (cc) and debug it(adb). Using the latter of these was the most amusing; here's a screenshot of me usingadb.Compare this output with theoutput of "$a" on adb, and you'll see why this got me excited.(And then try "$a" on mdb for our warped idea of an easter egg.)It should go without saying that I looked (so far in vain) for an Algol 68 compiler on this system. Seeing adb spit outa true Algol stack backtrace would be like sipping from the debugging fountain of youth... While it's amazing how familiar VENIX feels, I'm also stunned by how anemic its facilites are:it has no TCP/IP stack, noreal filesystem, no multiple processor support, no resource management,no dynamic linking, no real virtual memory system, no observability and poor debuggability. It reminds me how far we have come --even before we embarked on the far more radical technologies found in Solaris 10...Now, does anyone know of a Language H compiler for the PDP-11? OBEY

I recently ran across Xhomer, a simulator for theDEC Pro 350. While there are severalhistoric machine simulatorsout there, Xhomer is polished: it compiled and ran on my Opteron laptop running Solaris...

Solaris

Solaris 10 Launch

So it's been an exciting week for Solaris: at long last, weofficially launched Solaris 10 on Monday. Unlike most product launches, the Solaris 10 launch was heavy on bothtechnical details andcustomer testimonials: it was very importantto us that those covering the event understand that this isn't ballyhooed nothingness -- this is real technology that is having a tangible impact on those using it.To that end,Mike,Andyand I described the Solaris 10 technology areas in some depth to a groupof fifty journalists in a Solaris "boot camp" on the morning of the launch.I was pleased by how many journalists were there to begin with, andimpressedthat none left over the two hours or so of informal presentations:this showed a real willingness on the part of the press to understandwhat we had done. (Impressively, they even stayed after I suggested to one journalist that he and I strip to the waist and wrestle to settlea difference of opinion. Fortunately, we were able to settle the difference without resorting to fisticuffs.)But my favorite part of the launch -- hands down -- was when Don Fike fromFedEx stood on the stage and described the application performance problemsthat FedEx has found using DTrace. It's always gratifying to see a customerachieve a win with DTrace (which of course is what motivated us to writeDTrace in the first place), but it's something else entirely to have acustomer be willing to stand on a stage with you and put their reputation onthe line by vouching for your technology. And on top of it all, to havethat customer be FedEx -- a company that I (and most, I suspect) hold invery high regard -- well, it nearly brought a tear to my eye; moments likethat just don't come often in one's career...Overall, the launch was a great success. Driving back up to the City withMike, we wondered aloud: how would the competition respond? As itturns out, we didn't have to wait long: Martin Fink, HP's VP of Linux, dashed off a hasty diatribe against Solaris 10. As others have pointed out, this is pure HP FUD:it doesn't attack our technology in any concrete fashion, but ratherattempts to put baseless fear in the minds of those who might be considering it. In particular, Fink returns to a classic FUD attackfrom the early 1990s:fear of a mixed-endianness planet. This was certainly a surprisingangle of attack: given that this issue has been technically solvedfor nearly a decade, I naturally assumed that this was a dead issuefor any technologist. But then, his attack reveals what is confirmedby Fink's bio (and photo?):Fink isn't a technologist. But most amusing was Ben Rockwood'shilarious responseThank you, Ben, for responding with the pluckand thoroughness that I believe characterizethe Solaris community...

So it's been an exciting week for Solaris: at long last, we officially launched Solaris 10 on Monday. Unlike most product launches, the Solaris 10 launch was heavy on bothtechnical details andcustomer ...

Solaris

DTrace update

Since I've been away, there's been quite a bit of exciting DTrace activity.First, there's the DTrace Challenge, where we will award prizes for both the most creative D script, and for thebest use of DTrace to get a big performance win.When this idea was first floated internally,I assumed that the prizes would, well, kinda suck. Maybe a T-shirt or a mug or some Solaris DVDs or whatever.Much to my surprise, we've ponied up some real dough to give awaysome seriously cool prizes, including a32" flat-panel plasma HDTV, a Ferrari-branded ACER Opteron laptop (running the 64-bit Solaris kernel, natch!), and an Apple iPod. We're giving away these prizes for each of the two challenges (plus two more for use of Zones), so there are twelve prizes in all. Needless to say, I've never been so bummed to see the "employees and their families are inelgible" rider -- even at his young age, I'm sure Tobin could have developed a winning D script for his Da-da!In other DTrace news, the firstDTrace course is now available for registration. Many of you have asked about formal training; here's your opportunity to sign up. And I'm pleased to report that fellow Solaris kernel developer Jonathan Adams now has joined the Sun blogasm. Jonathan was incredibly helpful to us on DTrace, culminating in his volunteering (or perhaps acquiescing?) to do the DTrace code review. (I can say with confidence that there are bodies of in-kernel DTrace code that are understood by only two people -- Jonathan and me.) Finally, DTrace and Solaris 10 have been in the press quite a bit. We were especially excited to see DTrace -- along with other Solaris 10 technologies -- mentioned in John Markoff's story in the New York Times (free registration required). The money quote:More than any other factor, though, it will be the success or failure of Solaris 10 - the new operating system being shown to Wall Street this week - that could determine whether Sun can return to anything resembling its former glory. Analysts and first customers of the new software say it represents a generational shift for Sun, offering a variety of technology features not available under Linux or Windows.Markoff nailed it: Solaris 10 is indeed a "generational shift" -- and it's one that we on Team DTrace are very much proud to be a part of...

Since I've been away, there's been quite a bit of exciting DTrace activity. First, there's the DTrace Challenge, where we will award prizes for both the most creative D script, and for thebest use of...

Solaris

The Economics of Software

Software is like nothing else in the history of human endeavor:1unlike everything else we have ever built, software costs nothingto manufacture, and it never wears out. Yet these magical properties are arguably overshadowed by theugly truth that software remains incredibly expensive to build. Thisgives rise to some strange economic properties: software's fixedcosts are high (very high -- too high), but its variable costs are zero. As strange as they are, these economic properties aren'tactuallyunique to software; they are also true (to varying degree) of the productsthat we have traditionally called "intellectual property."But unlike books or paintingsor movies,software is predominantly an industrial good -- it is almost always used as a component in a larger, engineered system. When you take these together -- software's role as an industrial good, coupled with its high fixedcosts and zero variable costs -- you get all sorts of strange economicphenomena.For example, doesn't it strike you as odd thatyour operating system is essentiallyfree,but your database is still costing you forty grand per CPU?Is a database infinitely more difficult to write than an operatingsystem? (Answer: no.) If not, why the enormous pricing discrepancy?I want to ultimately address the paradox of the software price discrepancy,but firsta quick review of the laws of supply and demand ina normal market:at high prices, suppliers tend to want tosupply more, while consumers tend to demand less; at low prices,consumers tend to demand more,while suppliers tend to want to supply less. We can show priceversus quantity demanded/supplied with theclassicsupply and demand curves:The point of intersection of the curves is theequilibrium price, and the laws of supply and demandtend to keep the market in equilibrium:as prices rise slightly out of equilibrium, suppliers willsupply a little more, consumers will demand a little less, inventorieswill rise a little bit, and prices will fall back into equilibrium.Likewise, if prices fall slightly, consumers will demand a little more,inventories will become depleted, and prices will rise back intoequilibrium. The degree to which suppliers and consumers can react to prices -- theslope of their respective curve -- is known aspriceelasticity. In a price inelastic market, suppliers or consumers cannot react quickly toprices. For example, cigarettes have canonically high inelastic demand:if pricesincrease, few smokers will quit. (That said, demand for a particularbrand of cigarettes is roughly normal: if Marlboros suddenlycost ten bucks a pack, cheap Russianimports may begin to look a lot more attractive.)So that's the market for smokes, but what of software? Let's start by looking at the supply side, because it's pretty simple:the zero variable cost means that suppliers can supply an arbitraryquantity at a given price.That is, this is the supply curve for software:The height of the "curve" will be dictated by various factors:competitive environment, fixed costs, etc.; we'll talk about how the heightof this curve is set (and shifted) in a bit.And what of the demand side? Software demand is normal to the degree that consumers have the freedom to choose software components.The problem is that for all of the rhetoric about software becominga "commodity", most software is still very much not a commodity: onesoftware product is rarely completely interchangeable with another.The lack of interchangeability isn't as much of an issue for a project that is still being specified(one can design around the specific intricacies of a specific piece of software),but it's very much an issue after a project has deployed:deployed systems are rife with implicit dependencies among the differentsoftware components. These dependencies -- and thus the cost to replace a given software component -- tend to increase over time.That is, your demand becomes more and more price inelastic as time goeson, until you reach a point of complete price inelasticity. Perhapsthis is the point when you have so much layered on top of the decision,that a change is economically impossible. Or perhaps it's the point whenthe technical talent that would retool your infrastructure around adifferent product has gone on to do something else -- or perhaps they'reno longer with the company. Whatever the reason, it's the point afterwhich the software has become so baked into your infrastructure,the decision cannot be revisited.So instead of looking the nice supply and demand curves above, softwaresupply and demand curves tend to look like this:And of course, yourfriendlysoftware vendorknows that your demandtends towards inelasticity -- which iswhy they so frequentlyraise the rent while offering so little in return.We've always known about this demand inelasticity, we've just calledit something else:vendor lock-in.If software suppliers have such unbelievable pricing power,why don't companies end up forking over every lastcent for software?Because the demand for software isn't completely price inelastic. It's only inelastic as long asthe price is below the cost of switching software. In the spirit of theFYObillboard on the 101,I dub this switching point the "FYO point": it is the point at whichyou get so pissed off at your vendor that you completelyreevaluate your software decision -- you put everything back on thetable. So here's the completed picture:What happens at the FYO point?In extreme cases, you may decide torewriteit yourself. Or maybe you'll decide that it's worth the pain to switch toa different (and less rapacious) vendor -- or at least scaringyour existing vendor to ease up a bit on their pricing. Or maybe you'll ramp up anew project to replace the existing one, using all new components -- thus normalizing your demand curve.And increasingly often, you decide that you're not using half of the features of this thing anyway -- and you start looking fora "good enough" open source option to get out of this ugly mess once and for all.(More on this later.) Now, your software vendor doesn't actually want you to hit the FYO point;they want to keep you far enough below it that you just sigh (orgroan) and sign the check. (Which most of them are pretty good at,by the way -- but of course, you already know that from all of your sighing and groaning.)There are essentially two ways for a software company to grow revenue onan established software product:Take away business from competitorsExtract more dough out of existing customersIn terms of the FYO point, taking away competitors' businessamounts to lowering theFYO point of competitors' customers.This can be done through technology (for example, by establishingopen standards or developing migration tools) or it can be done throughpricing (for example, bylowering the price they charge their competitors' customers for theirsoftware -- the "competitive upgrade"). This is a trend thatgenerally benefits customers. If onlythere were no other option...Sadly, there is another option, and most software companies optfor it: extracting more money from their existing customers. In termsof the FYO point, this amounts to raising the FYO point oftheir own customers.That is, software vendors act as a natural monopolist does:focusing their efforts not on competition, but rather onraising barriers to entry. They have all sorts of insidious waysof doing this: proprietary data formats, complicated interdependencies,deliberate incompatibilities, etc. Personally, I find these behaviors abhorrent, and I have been amazed about how brazen some software vendorsare about maintaining their inalienable right to screw their owncustomers. To wit: I now have had not one buttwosoftware vendors tell me that I must add a way to disable DTracefor their app to prevent their own customers from observingtheir software. They're not even worried about their competitors -- they're too busy screwing their own customers! (Needless to say,their requests for such a feature were, um, declined.)So how does open source fit into this?Open source is a natural consequence of the economics of software, on both thedemand-side and the supply-side. The demand-side has been discussed ad nauseum (and frequently, ad hominem): thedemand for open source comes from customers who are sick of their vendors'unsavory tactics to raise their FYO point -- and they are sick more generally of the whole notion of vendor lock-in. The demand-side isgenerally responsible for customers writing their own software andmaking it freely available, or participating in similar projects inthe community at large.To date,the demand-side has propelled much of open source software, including web servers (Apache) and scripting languages (Perl, Python). With some exception, the demand-side consists largely of individualsparticipating out of their interest more than their self-interest.As a result, it generallycannot sustain full-time, professional software developers. But there's also a supply-side to open source:if software has no variable cost, software companies' attempts tolower their competitors' customers' FYO point will ultimately manifestitself in free software. And the most (if not the only) way to make softwareconvincingly free is to make the source code freely available -- to makeit open source.2 The tendency towards open source is especially strongwhencompanies profitnot directly from the right-to-use of the software, but rather from some complementary good: support, services, other software, or even hardware. (In the specific case ofSolaris and Sun, it's actually all of the above.) And what if customersnever consume any of these products? Well, the software costs nothingto manufacture, so there isn't a loss -- and there is often an indirect gain.To take the specific example of Solaris:if you run Solarisand never give a nickel to Sun, that's fine by us; it didn't even costus a nickel to make your copy, and your use will increase the market for Solaris applications and solutions -- driving platform adoptionand ultimately driving revenue for Sun.To put this in retail terms,open source software has all of the properties of a loss-leader --minus the loss, of course.While the demand-side has propelled much of open source to date,the supply-side is (in my opinion) ultimately a more powerful force inthe long run:the software created by supply-side forcesis generally been developed by people who do it full-time fora living -- there is naturally a greater attention todetail.For a good example of supply-side forces, look at operating systemswhereLinux -- the traditionally dominant open source operating system --has enjoyed profound benefits from the supply-side. These includecontributions from operating systems such as AIX(JFS,scalability work),IRIX (XFS, observability tools and the occasionalwhoopsie), DYNIX/ptx(RCU locks),and even OS/2 (DProbes).And an even larger supply-side contribution looms:the open sourcing of Solaris. This will certainly be the most importantsupply-side contribution to date, and a recognition of theeconomicsboth of the operating systems market and of the software market more generally. And unlike much prior supply-side open source activity, the opensourcing of Solaris is not open source as capitulation -- it is open source as counter-strike.To come back to our initial question: why is the OSbasically free while the database is costing you forty grand per CPU?The short answer is that the changes that have swept throughthe enterprise OS marketare still ongoing in the database market. Yes, there have beentraditional demand-side efforts like MySQLand research efforts likePostgreSQL,but neither of these "good enough" efforts has actually been goodenough to compete with Informix,Oracle,DB/2orSybase inthe enterprise market. In the last few years, however, we have seenserious supply-sidemovement, with MaxDB from SAP andIngres fromCA both becoming open source.Will either of these be able to start taking seriousbusiness away from Oracle and IBM? That is, will they be enoughto lower the FYO point such that more customers say "FY, O"?The economics of software tells us that, in the long run, this islikely the case: either the demand-side will ultimately force sufficient improvementsto the existing open source databases, or the supply-side will forcethe open sourcing of one of the viable competitors. And that software does not wear out and costsnothing to manufacture assures us that the open source databaseswill survive to stalk their competitors into the long run.Will this happen anytime soon?As Keynesfamously pointed out, "in the long run, we are all dead" -- sodon't count on any less sighing or groaning or check writing in the immediatefuture...1 As an aside, I generally hate this rhetorical technique of sayingthat "[noun] is the [superlative] [same noun] [verb] by humankind."It makes it sound like the chimps did this years ago, but wehumans have only recently caught up. I regret using this technique myself,so let me clarify: with the notable exception ofgtik2_applet2, the chimps have not yet discovered how to write software.2 Just to cut off any rabid comments about definitions:by "open source" I only mean that the source code is sufficiently widelyand publicly available that customers don't question that its right-to-useis (and will always be) free.This may or may not mean OSI-approved, and it may or may not mean GPL. And of course, many customers have discovered that open sourcealone doesn'tsolve the problem. You need someone to support it -- and the companyoffering support beginsto look, act and smell a lot like a traditional, rapacious softwarecompany. (Indeed, FYO point may ultimately be renamed the "FYRH point.")You still need open standards, open APIs, portable languages and so on...

Software is like nothing else in the history of human endeavor:1 unlike everything else we have ever built, software costs nothing to manufacture, and it never wears out.Yet these magical properties...

Solaris

DTrace on LKML

So DTrace was recently mentioned on the linux-kernel mailing list. The question in the subject was is "DTrace-like analysis possible with future Linux kernels?" The responses have been interesting. Karim Yaghmour rattled in with his usual blather about the existence of DTrace proving that LTT should have been accepted into the Linux kernel long ago. I find this argument incredibly tedious, and haveaddressed it at some length. Then there was this inane post:> http://www.theregister.co.uk/2004/07/08/dtrace_user_take/:> "Sun sees DTrace as a big advantage for Solaris over other versions of Unix > and Linux."That article is way too hypey.It sounds like one of those strange american commercials you seesometimes at night, where two overenthusiastic persons are telling youhow much that strange fruit juice machine has changed their lives,with making them loose 200 pounds in 6 days and improving theirperformance at beach volleyball a lot due to subneutronic antigravitymanipulation. You usually can't watch those commercials for longerthan 5 minutes.The same applies to that article, I couldn't even read it completely,it was just too much.And is it just me or did that article really take that long tomentioning what dtrace actually IS?Come on, it's profiling. As presented by that article, it is even moremicro optimization than one would think. What with tweaking the diskI/O improvements and all... If my harddisk accesses were a microsecondmore immediate or my filesystem giving a quantum more transfer rate,it would be nice, but I certainly wouldn't get enthusiastic and I betnobody would even notice.Maybe, without that article, I would recognize it as a fine thing (andby "fine" I don't mean "the best thing since sliced bread"), but thatpiece of text was just too ridiculous to take anything serious.I sure hope that article is meant sarcastically. By the way, did Imiss something or is profiling suddenly a new thing again?Regards,JulienYes, you missed something Julien: you forgot to type "dtrace" into google. (If there were a super-nerd equivalent of the Daily Show, we might expect Lewis Black to say that -- probably punctuated with his usual "you moron!")If you had done this, you would have been taken to theDTrace BigAdmin site which contains links to theDTrace USENIX paper, theDTrace documentation, and a boatload of other material that supports the claims in The Register story. In fact, if you had just scrolled to the bottom of that story you would have read the "Bootnotes" section of the story -- which provides plenty of low-level supporting detail. (Indeed, I'm not sure that I've ever seen The Register publish such user-supplied detail to support a story.)Sometimes the bigotry surrounding Linux suprises even me: in the time he took to record his misconceptions, Julien could have (easily!) figured out that he was completely wrong. But I guess that even this is too much work for someone who is looking to confirm preconceived notions rather than understand new technology...Fortunately,one of the responses did call Julien on this, if only slightly:\* Julien Oster:> Miles Lane writes:>>> http://www.theregister.co.uk/2004/07/08/dtrace_user_take/:>> "Sun sees DTrace as a big advantage for Solaris over other versions of Unix >> and Linux.">> That article is way too hypey.Maybe, but DTrace seems to solve one really pressing problem: trackingdisk I/O to the processes causing it. Unexplained high I/Outilization is a \*very\* common problem, and there aren't any tools todiagnose it.Most other system resources can be tracked quite easily: disk space,CPU time, committed address space, even network I/O (with tcpdump andnetstat -p). But there's no such thing for disk I/O.Of course, the responder misses the larger point about DTrace -- that one can instrument one's system arbitrarily and safely with DTrace -- but at least he correctly identifies one capacity in which DTrace clearly leads the pack. And I suppose that this is the best that a rival technology can expect to do, so close to the epicenter of Linux development...

So DTrace was recently mentioned on the linux-kernel mailing list. The question in the subject was is "DTrace-like analysis possible with future Linux kernels?" The responses have been interesting....

Solaris

Demo'ing DTrace

With my explanation of ademogone wrong,several people have asked me the more general question: how does one demoDTrace? Thisquestion doesn't have a single answer, especiallygiven that DTrace is best demonstrated with ad hocqueries of the system. Indeed, the best demos are whensomeone in the audience shouts out their own question that they want to see answered: answering such a question instantly with DTrace nearlyalways blows away the questioner -- who has presumably suffered in thepast trying to answer similar questions. Of course, saying "why, there are many ways to demo to DTrace!" is a useless answerto the question of how one demos DTrace; while there are many ways that one can demo DTrace,it's useful to get a flavorof how a typical demo might go. So with the substantial caveat that this is not the way to demoDTrace, but merely a way, let me walk you through an example demo.It all starts by running dtrace(1M)without any arguments:# dtraceUsage: dtrace [-32|-64] [-aACeEFGHlqSvVwZ] [-b bufsz] [-c cmd] [-D name[=def]] [-I path] [-o output] [-p pid] [-s script] [-U name] [-x opt[=val]] [-X a|c|s|t] [-P provider [[ predicate ] action ]] [-m [ provider: ] module [[ predicate ] action ]] [-f [[ provider: ] module: ] func [[ predicate ] action ]] [-n [[[ provider: ] module: ] func: ] name [[ predicate ] action ]] [-i probe-id [[ predicate ] action ]] [ args ... ] predicate -> '/' D-expression '/' action -> '{' D-statements '}' -32 generate 32-bit D programs and ELF files -64 generate 64-bit D programs and ELF files -a claim anonymous tracing state -A generate driver.conf(4) directives for anonymous tracing -b set trace buffer size -c run specified command and exit upon its completion -C run cpp(1) preprocessor on script files -D define symbol when invoking preprocessor -e exit after compiling request but prior to enabling probes -E exit after enabling probes but prior to tracing data -f enable or list probes matching the specified function name -F coalesce trace output by function -G generate an ELF file containing embedded dtrace program -H print included files when invoking preprocessor -i enable or list probes matching the specified probe id -I add include directory to preprocessor search path -l list probes matching specified criteria -m enable or list probes matching the specified module name -n enable or list probes matching the specified probe name -o set output file -p grab specified process-ID and cache its symbol tables -P enable or list probes matching the specified provider name -q set quiet mode (only output explicitly traced data) -s enable or list probes according to the specified D script -S print D compiler intermediate code -U undefine symbol when invoking preprocessor -v set verbose mode (report program stability attributes) -V report DTrace API version -w permit destructive actions -x enable or modify compiler and tracing options -X specify ISO C conformance settings for preprocessor -Z permit probe descriptions that match zero probesI usually just point out this just gives your typical Unix-ish helpmessage -- implicitly making the point that dipping into DTrace doesn'trequire much more than typing its name on the command line. From the output of the usage message,I highlight the "-l" option, noting that it lists all the probes in thesystem. I then run with that option -- being sure to pipe the output to more(1):# dtrace -l | more ID PROVIDER MODULE FUNCTION NAME 1 dtrace BEGIN 2 dtrace END 3 dtrace ERROR 4 lockstat genunix mutex_enter adaptive-acquire 5 lockstat genunix mutex_enter adaptive-block 6 lockstat genunix mutex_enter adaptive-spin 7 lockstat genunix mutex_exit adaptive-release 8 lockstat genunix mutex_destroy adaptive-release 9 lockstat genunix mutex_tryenter adaptive-acquire 10 lockstat genunix lock_set spin-acquire 11 lockstat genunix lock_set spin-spin 12 lockstat genunix lock_set_spl spin-acquire 13 lockstat genunix lock_set_spl spin-spin 14 lockstat genunix lock_try spin-acquire 15 lockstat genunix lock_clear spin-release 16 lockstat genunix lock_clear_splx spin-release 17 lockstat genunix CLOCK_UNLOCK spin-release 18 lockstat genunix rw_enter rw-acquire 19 lockstat genunix rw_enter rw-block 20 lockstat genunix rw_exit rw-release 21 lockstat genunix rw_tryenter rw-acquire 22 lockstat genunix rw_tryupgrade rw-upgrade 23 lockstat genunix rw_downgrade rw-downgrade 24 lockstat genunix thread_lock thread-spin 25 lockstat genunix thread_lock_high thread-spin 26 fasttrap fasttrap fasttrap 27 syscall nosys entry 28 syscall nosys return 29 syscall rexit entry 30 syscall rexit return--More--I point out that each of these rows is a point of instrumentation. I thenexplain each of the columns, explaining that thePROVIDER column denotesthe DTrace notion of instrumentationproviderswhich allows for disjoint instrumentation techniques. If the audience isa particularly savvy one, I may point out that thelockstatprovider is a recasting of the instrumentation technique used by lockstat(1M),and that lockstat has been made to be a DTrace consumer. I mention that the MODULE column contains the name of the kernel module (in the case of a kernel probe) or shared object name (in the case of a user-level probe),that the FUNCTION column contains the function that the probeis in, and that the NAME column contains the name of the probe.I explain that a each probe is uniquely identified by the probe tuple:provider, module, function and name,Scrolling down through the output,I point out the probes from syscallprovider, describing its ability toinstrument every systemcall entry and return. (For some audiences, I may pause here to remindthem of the definition of asystemcall.)I then continue to scroll down my probe listing untilI get to output that looks something like this: ... 1268 fbt unix dvma_release entry 1269 fbt unix dvma_release return 1270 fbt unix softlevel1 entry 1271 fbt unix softlevel1 return 1272 fbt unix freq_notsc entry 1273 fbt unix freq_notsc return 1274 fbt unix psm_unmap entry 1275 fbt unix psm_unmap return 1276 fbt unix cpu_visibility_online entry 1277 fbt unix cpu_visibility_online return 1278 fbt unix setbackdq entry 1279 fbt unix setbackdq return 1280 fbt unix lgrp_choose entry 1281 fbt unix lgrp_choose return 1282 fbt unix cpu_intr_swtch_enter entry 1283 fbt unix cpu_intr_swtch_enter return 1284 fbt unix page_upgrade entry 1285 fbt unix page_upgrade return 1286 fbt unix page_lock_es entry 1287 fbt unix page_lock_es return 1288 fbt unix lgrp_shm_policy_set entry 1289 fbt unix lgrp_shm_policy_set return 1290 fbt unix segkmem_gc entry 1291 fbt unix segkmem_gc return 1292 fbt unix disp_anywork entry 1293 fbt unix disp_anywork return 1294 fbt unix prgetprxregsize entry 1295 fbt unix prgetprxregsize return 1296 fbt unix cpuid_pass1 entry 1297 fbt unix cpuid_pass1 return 1298 fbt unix cpuid_pass2 entry 1299 fbt unix cpuid_pass2 return 1300 fbt unix cpuid_pass3 entry 1301 fbt unix cpuid_pass3 return 1302 fbt unix cpuid_pass4 entry 1303 fbt unix cpuid_pass4 return 1304 fbt unix release_bootstrap entry 1305 fbt unix release_bootstrap return 1306 fbt unix i_ddi_rnumber_to_regspec entry 1307 fbt unix i_ddi_rnumber_to_regspec return 1308 fbt unix page_mem_avail entry 1309 fbt unix page_mem_avail return 1310 fbt unix page_pp_lock entry 1311 fbt unix page_pp_lock return ...I pause here to explain that thefbtprovidercan instrument every function entry and return in the kernel.Making the observation that there are quite a few functions in the kernel,I then quit out of the output, and re-run the command -- this timepiping the output throughwc.Usually, the output is something like this:# dtrace -l | wc -l 32521Anyone who has dealt with traditionalinstrumentation frameworkswill typically express some shock at this -- that there are (for example)32,520 points of instrumentation on an optimized, stock system.Occasionally, someone who has perhaps heard of DTrace may pipe up:"Is that where that 30,000 number comes from that I've seen in in the press?"The quick answer is that this is indeed where that number comes from --but the longer answer is that this really only the beginning becausewith thepid provider, DTrace can instrumentevery instruction in every running app. For perhaps obvious reasons, we create these pid probes lazily -- we don't want to clog up the systemwith millions of unused probes.After having listed all probes and counted them, my next invocation of dtrace is to list the probes yet again -- but this time listing onlyprobes that match the probe tuple "syscall:::entry". I explainthat this means I want to match probes from the syscallprovider named entry -- andthat I don't care about the module and function. This output is simplya shorter version of the previous listing:# dtrace -l -n syscall:::entry ID PROVIDER MODULE FUNCTION NAME 27 syscall nosys entry 29 syscall rexit entry 31 syscall forkall entry 33 syscall read entry 35 syscall write entry 37 syscall open entry 39 syscall close entry 41 syscall wait entry...I then explain that to enable these probes (instead of just listingthem), I need to merely not include the "-l":# dtrace -n syscall:::entrydtrace: description 'syscall:::entry' matched 226 probesCPU ID FUNCTION:NAME 0 125 ioctl:entry 0 125 ioctl:entry 0 261 sysconfig:entry 0 261 sysconfig:entry 0 195 sigaction:entry 0 195 sigaction:entry 0 125 ioctl:entry 0 261 sysconfig:entry 0 61 brk:entry 0 61 brk:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 403 fstat64:entry 0 61 brk:entry 0 61 brk:entry 0 403 fstat64:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 35 write:entry 0 125 ioctl:entry 0 403 fstat64:entry 0 35 write:entry 0 125 ioctl:entry 0 125 ioctl:entry 0 339 pollsys:entry 0 33 read:entry ...I let this run on the screen for a while, explaining that we've gone froman optimized, uninstrumented system call table to an instrumented one --and that we're now seeing every system call as it happens on the box.I explain that (on the one hand) this is novel information: in Solaris,we haven't had a way to get system call information for the entiresystem. (I often remind the audience that this may look similar totruss,but it's really quite different: truss operates only on a singleprocess, and has an enormous probe effect from constantly stopping andrunning the target process.) But while this information may be novel (at least for Solaris), it'sactually not thatvaluable -- merely knowing that we executed these systems calls is unlikelyto be terrible helpful.Rather, we may like to know which applications are inducing thesesystem calls. To do this, we add a clause in the DTrace control language,D. The clause we need to add is quitesimple; we're going to use the traceaction to record the current application name, which is stored inexecname variable. So I \^C the previousenabling, and run the following:# dtrace -n syscall:::entry'{trace(execname)}'dtrace: description 'syscall:::entry' matched 226 probes... 0 125 ioctl:entry xterm 0 125 ioctl:entry xterm 0 339 pollsys:entry xterm 0 339 pollsys:entry xterm 0 313 lwp_sigmask:entry Xsun 0 313 lwp_sigmask:entry Xsun 0 313 lwp_sigmask:entry Xsun 0 313 lwp_sigmask:entry Xsun 0 33 read:entry Xsun 0 233 writev:entry Xsun 0 125 ioctl:entry xterm 0 33 read:entry xterm 0 339 pollsys:entry xterm 0 339 pollsys:entry xterm 0 125 ioctl:entry xterm 0 125 ioctl:entry xterm 0 339 pollsys:entry xterm 0 125 ioctl:entry xterm 0 125 ioctl:entry xterm 0 339 pollsys:entry xterm 0 339 pollsys:entry xterm 0 313 lwp_sigmask:entry Xsun 0 313 lwp_sigmask:entry Xsun 0 313 lwp_sigmask:entry Xsun 0 313 lwp_sigmask:entry Xsun 0 339 pollsys:entry Xsun 0 35 write:entry dtrace 0 125 ioctl:entry xterm 0 125 ioctl:entry xterm 0 339 pollsys:entry xterm 0 33 read:entry xterm 0 125 ioctl:entry xterm 0 35 write:entry xterm 0 125 ioctl:entry xterm 0 339 pollsys:entry xterm 0 125 ioctl:entry xterm 0 33 read:entry xterm 0 35 write:entry xterm 0 125 ioctl:entry xterm 0 125 ioctl:entry xterm 0 339 pollsys:entry xterm 0 339 pollsys:entry xterm 0 313 lwp_sigmask:entry Xsun ...This is clearly more useful: we can see which apps are performingsystem calls. But seeing this also leads to a very important observation,one that some heckler in a savvy audience may have already made by now:what we're seeing in the above output is the system calls to write allof this gunk out to the terminal. That is, if we were to take this output andcut it,sort it,uniq it,and sort it again, we would come to the conclusion that dtrace, xterm and Xsun were the problem --regardless of what was being executed on the machine. This is a serious problem, and the DTrace solution represents a substantialdifference from virtually all earler work in this domain: in DTrace,the aggregation of data is a first-class citizen. Instead of aggregatingdata postmortem with traditional Unix tools, DTrace allows data to beaggregated in situ. This aggregation can reduce the data flow by up to a factor of the number of data points -- a tremendous savingsof both space and time.So to recast the example in terms of aggregations, we want to aggregateon the application name, with our aggregating action being to count:# dtrace -n syscall:::entry'{@[execname] = count()}'dtrace: description 'syscall:::entry' matched 226 probesWhen we run the above, we don't see any output -- just that we matchedsome number of probes. I explain that behind the scenes, we're justkeeping a table of application name, and the count of system calls.Whenever we \^C the above, we get the answer:\^C utmpd 4 automountd 8 inetd 9 svc.configd 9 syslogd 16 FvwmAuto 39 in.routed 48 svc.startd 97 MozillaFirebird- 125 sendmail 183 fvwm2 621 dhcpagent 1192 dtrace 2195 xclock 2894 vim 7760 xterm 11768 Xsun 24916#I often point out that in DTrace -- as in good cocktail conversation -- theanswer to one question typicallyprovokes the next question. Looking at the above output, wemay have all sorts of questions. For example, we may well ask: what thehell isdhcpagentdoing? To answer this question, we can add apredicateto our clause:# dtrace -n syscall:::entry'/execname == "dhcpagent"/{@[probefunc] = count()}'The predicate is contained between the forward slashes; the expression inthe curly braces is only evaluated if the predicate evaluates to a non-zerovalue. And note that we have changed the tuple on which we are aggregating:instead of aggregating on execname (which, thanks to the predicate,will always be "dhcpagent"), we're aggregating on the probefunc. For the syscall provider, the probefuncis the name of the system call.Running the above for ten seconds or so, and then \^C'ing yieldsthe following (at least on my laptop):\^C pollsys 3 close 3 open 3 lwp_sigmask 6 fstat64 6 read 12 ioctl 12 llseek 15Now, we may be interested in drilling down more into one of these systemcalls. Let's say we want to know what dhcpagent is doing whenit performs anopen.We can effect this by narrowing our enabling a bit: instead of enablingall syscall probes named entry, we want to enableonly those probes that are also from the function named open.We want to keep our predicate that the execname is dhcpagent, but this time we don't want to aggregate onthe probefunc (which, thanks to the narrowed enabling, willalways be "open"); we want to aggregate on the name of thefile that is being opened -- which is the first argument to the open system call. To do this, we'll aggregate on arg0.For slightly arcane reasons,we need to use thecopyinstraction to effect this. The enabling:# dtrace -n syscall::open:entry'/execname == "dhcpagent"/{@[copyinstr(arg0)] = count()}'dtrace: description 'syscall::open:entry' matched 1 probeAgain, running the above for about ten seconds:\^C /etc/default/dhcpagent 4This tells us that there were four calls to the open system call from dhcpagent -- and that all four were to the same file.Just on the face of it, this is suspicious: why is dhcpagentopening the file "/etc/default/dhcpagent" with such frequency?Hasn't whoever wrote this code ever heard ofstat?1 To answer the formerquestion, we're going to change our enablingto aggregate on dhcpagent's stackbacktrace when it performs an open. This is done with the ustack action. The new enabling:# dtrace -n syscall::open:entry'/execname == "dhcpagent"/{@[ustack()] = count()}'dtrace: description 'syscall::open:entry' matched 1 probeRunning the above for ten seconds or so and then \^C'ing yieldsthe following:\^C libc.so.1`__open+0xc libc.so.1`_endopen+0x92 libc.so.1`fopen+0x26 libcmd.so.1`defopen+0x40 dhcpagent`df_get_string+0x27 dhcpagent`df_get_int+0x1c dhcpagent`dhcp_requesting+0x248 libinetutil.so.1`iu_expire_timers+0x8a libinetutil.so.1`iu_handle_events+0x95 dhcpagent`main+0x300 805314a 4This tells us that all four of the open calls were from the above stacktrace.From here we have many options, and our course of action would depend onhow we wanted to investigate this. Up until this point, we have beeninstrumenting only the system call layer -- the shim layer between applications and the kernel. We may wish to now instrument the applicationitself. To do this, we first need to know how many dhcpagentprocesses are contributing to the output we're seeing above. Thereare lots of ways to do this; a simple way is to aggregate on pid:# dtrace -n syscall:::entry'/execname == "dhcpagent"/{@[pid] = count()}'dtrace: description 'syscall:::entry' matched 226 probes\^C 43 80This indicates that in the time between when we started the above, andthe time that we \^C'd, we saw 80 system calls from pid 43 -- andthat is was the only dhcpagent process that made any system calls.Now that we have the pid that we're intrested in, we can instrument theapplication. For starters, let's just instrument the defopenfunction in libcmd.so.1. (We know that it's being called becausewe saw it in the stack backtrace of the open system call.)To do this:# dtrace -n pid43::defopen:entrydtrace: description 'pid43::defopen:entry' matched 1 probeCPU ID FUNCTION:NAME 0 32622 defopen:entry 0 32622 defopen:entry 0 32622 defopen:entry 0 32622 defopen:entry \^CWith this we have instrumented the running app. When we \^C'd,the app went back to being its usual optimized self. (Or at least, asoptimized as dhcpagent ever is.)If we wanted to trace both entry to and return from the defopenfunction, we could add another enabling to the mix:# dtrace -n pid43::defopen:entry,pid43::defopen:returndtrace: description 'pid43::defopen:entry,pid43::defopen:return' matched 2 probesCPU ID FUNCTION:NAME 0 32622 defopen:entry 0 32623 defopen:return 0 32622 defopen:entry 0 32623 defopen:return 0 32622 defopen:entry 0 32623 defopen:return 0 32622 defopen:entry 0 32623 defopen:return \^CBut we may want to know more than this: we may want to know everythingcalled from defopen. Answering this is a bit more sophisticated -- and it's a little too much for the command line. Toanswer this question, we head to an editor and type in the followingDTrace script:#!/usr/sbin/dtrace -spid43::defopen:entry{ self->follow = 1;}pid43:::entry,pid43:::return/self->follow/{}pid43::defopen:return/self->follow/{ self->follow = 0; exit(0);}So what is this script doing? We're enabling defopen again,but this time we're setting athread-local variable --which we named "follow" to 1. We then enabled every functionentry and return in the process, with the predicate that we're only interestedif our thread-local variable is non-zero. This has the effect of onlytracing function entry and return if the function call was induced bydefopen. We're only interested in capturing one of thesetraces, which is the reason why we explicitly exitin the pid43::defopen:return clause.Running the above is going to instrument the hell out of dhcpagent. We're going to see that we matched many, many probes.Then we'll see a bunch of output as defopen is called. Finally,we'll be dropped back onto the shell prompt as we hit the exitaction. By the time we see the shell again, we have restored thedhcpagent process to its uninstrumented self. Running the above,assuming it's in dhcp.d:# chmod +x dhcp.d# ./dhcp.ddtrace: script './dhcp.d' matched 10909 probesCPU ID FUNCTION:NAME 0 32622 defopen:entry 0 33328 _get_thr_data:entry 0 37823 thr_getspecific:entry 0 43272 thr_getspecific:return 0 38784 _get_thr_data:return 0 37176 fopen:entry 0 37161 _findiop:entry 0 37525 pthread_rwlock_wrlock:entry 0 37524 rw_wrlock_impl:entry 0 37513 rw_read_held:entry 0 42963 rw_read_held:return 0 37514 rw_write_held:entry 0 42964 rw_write_held:return 0 37518 rwlock_lock:entry 0 37789 sigon:entry 0 43238 sigon:return 0 42968 rwlock_lock:return 0 42974 rw_wrlock_impl:return 0 42975 pthread_rwlock_wrlock:return 0 37174 getiop:entry 0 37674 mutex_trylock:entry 0 43123 mutex_trylock:return 0 37676 mutex_unlock:entry 0 43125 mutex_unlock:return 0 42625 getiop:return 0 37174 getiop:entry 0 37674 mutex_trylock:entry 0 43123 mutex_trylock:return 0 37676 mutex_unlock:entry 0 43125 mutex_unlock:return 0 42625 getiop:return 0 37174 getiop:entry 0 37674 mutex_trylock:entry 0 43123 mutex_trylock:return 0 37676 mutex_unlock:entry 0 43125 mutex_unlock:return 0 42625 getiop:return 0 37174 getiop:entry 0 37674 mutex_trylock:entry 0 43123 mutex_trylock:return 0 35744 memset:entry 0 41198 memset:return 0 37676 mutex_unlock:entry 0 43125 mutex_unlock:return 0 42625 getiop:return 0 37531 pthread_rwlock_unlock:entry 0 37514 rw_write_held:entry 0 37680 mutex_held:entry 0 43129 mutex_held:return 0 42964 rw_write_held:return 0 37789 sigon:entry 0 43238 sigon:return 0 42981 pthread_rwlock_unlock:return 0 42612 _findiop:return 0 37123 _endopen:entry 0 37309 _open:entry 0 37951 __open:entry 0 43397 __open:return 0 42760 _open:return 0 37152 _flockget:entry 0 37670 mutex_lock:entry 0 37669 mutex_lock_impl:entry 0 43118 mutex_lock_impl:return 0 43119 mutex_lock:return 0 42603 _flockget:return 0 37676 mutex_unlock:entry 0 43125 mutex_unlock:return 0 42574 _endopen:return 0 42627 fopen:return 0 32623 defopen:return 0 32623 defopen:return #This is interesting, but a little hard to sift through. Go back todhcp.d, and add the following line:#pragma D option flowindentThis sets a simple optionthat indents a code flow: probes named entry cause indentation to increase two spaces, and probes namedreturn cause indentation to decrease two spaces. Running thenew dhcp.d:# ./dhcp.ddtrace: script './dhcp.d' matched 10909 probesCPU FUNCTION 0 -> defopen 0 -> _get_thr_data 0 -> thr_getspecific 0 _findiop 0 -> pthread_rwlock_wrlock 0 -> rw_wrlock_impl 0 -> rw_read_held 0 rw_write_held 0 rwlock_lock 0 -> sigon 0 start = 0; exit(0);}Running dhcptime.d:# ./dhcptime.ddtrace: script './dhcptime.d' matched 10909 probesCPU FUNCTION 0 -> defopen 0 0 -> _get_thr_data 2276 0 -> thr_getspecific 6214 0 _findiop 34498 0 -> pthread_rwlock_wrlock 37822 0 -> rw_wrlock_impl 39265 0 -> rw_read_held 42514 0 rw_write_held 64573 0 rwlock_lock 69802 0 -> sigon 89495 0

With my explanation of ademo gone wrong, several people have asked me the more general question: how does one demoDTrace? This question doesn't have a single answer, especially given thatDTrace is...

Solaris

Demo Perils

One of the downsides of being an operating systems developer is that thedemos of the technology that you develop often suck. ("Look, it boots!And hey, we can even run programs and it doesn't crash!") So it's been apleasant change to develop DTrace, a technology that packs a jaw-droppingdemo. In demonstrating DTrace for customers around the world, I have hadthe distinct (and rare) pleasure of impressing the most technically adept(and often jaded) audiences. My typical demonstration is on mySolaris x86 laptop, where I use DTrace toinstrument the running system -- exploringwith the audience the peculiarities that exist even on an idle laptop.(This usually involves discovering and understanding the unnecessary workbeing done byacroread,dhcpagent,sendmail, etc.)This ad hoc demoshows DTrace as it's meant to be used: dynamically answering questionsthat themselves were formed on-the-fly.And when I demonstrate DTrace, I always do so on the absolute latestSolaris 10 build. Our mantra in Solaris Kernel Development is "FCS QualityAll the Time" -- we believe that the product should always beready to be run in production. And if we're going to tell a customer thatit's ready to be run in production, we damn well better run it inproduction ourselves. This has the added advantage that we tend to runinto bugs before our customers do, allowing us to ship a final product thatis that much more solid. Over the past year, I have given hundreds ofDTrace demonstrations in front of customers running latest bits, and beforelast week, it had always gone off without a hitch...1Last week, I had the opportunity to give a DTrace demonstration for ahighly technical -- and highly influential -- audience at a Fortune 100company. When I demonstrate DTrace, I typically do a couple of invocationson the command line before things become sufficiently complicated to meritwriting a DTracescript. And it was when I went to runthe first such script (a script that explored the activity ofxclock) thatit happened:# dtrace -s ./xclock.dSegmentation Fault (core dumped)#If you've never had it, there's no feeling quite like having a demo blow upon you: it's as if you peed your pants, failed an exam and were punched inthe gut -- all at the same horrifying instant. It's a feeling that everysoftware developer should have exactly once in their lives: that uniquerush of shock,and then humiliation and then despair, followed by the adrenal surge of afight-or-flight reaction. In the time it takes a single process to dumpcore, you go from an (over)confident technologist to a frightened woodlandcreature, transfixed by the light of an oncoming freight train. For thewoodland creature, at least it all ends mercifully quickly; the creature isspared the suffering of trying to explain away its foolishness. Thehapless technologist, on the other hand, is left with several options: Pretend that you didn't write the software: "Boy, will you get a loadof those fancy-pants software engineers? Overpriced underworked morons,every last one!" Explain that this is demo software and isn't expected to work:"Well, that's why we haven't shipped it yet! I mean, what fool wouldrun this stuff anyway? Other than me, that is." Make light of it: "Hey, knock knock! Who's there? Not mysoftware, that's for sure! Wocka wocka wocka!" Suck it up: "That's a serious problem. If you can excuse me fora second, let me get a handle on what we've got here that we can demo."I always aim for this last option, buton the rare occasion that this has happened to me (and this is -- honest --probably the worst that a customer-facing demo has gone for me)I usually end up withsome combination of the last three, often with plenty of stuttering,some mild swearing ("Damn! Damn!") and profuse sweating.In my particular case, the worst part was not knowing the exact pathologyof the bug that I had justrun into. Was there something basic that was broken or toxic aboutmy machine? Would all scripts that I tried to run dumpcore? And if this was broken, what else was broken? Would I panic themachine or crash a target app if I continued? (Much more seriousproblems, both.) In an effort to get a handle on it, I did a quickpstack on thecore file: 0804718f ???????? (8046604, 2) d137c839 dt_instr_size (82d051a, 8067320, 223, d1380fe2) + 59 d137c0c2 dt_pid_create_return_probe (81651b8, 8067320, 8046af0, 8047170, 80472d d137370d dt_pid_per_sym (80472ac, 8047170, d087b02c) + 15b d13739ae dt_pid_sym_filt (80472ac, 8047170, d087b02c, 804715c) + 7c d13152ca Psymbol_iter_com (81651b8, ffffffff, 8069060, 1, 407, 1) + 1e0 d13153ae Psymbol_iter_by_addr (81651b8, 8069060, 1, 407, d1373932, 80472ac) + 1 d1373b81 dt_pid_per_mod (80472ac, 82cf600, 8069060) + 191 d1373d56 dt_pid_mod_filt (80472ac, 82cf600, 8069060) + a3 d1314fe4 Pobject_iter (81651b8, d1373cb3, 80472ac) + 4f d13740b4 dt_pid_create_probes (82cafa0, 8067320) + 344 d1353af8 dt_setcontext (8067320, 82cafa0) + 42 d13537d4 dt_compile_one_clause (8067320, 82be430, 82cdae0) + 32 d1353a9c dt_compile_clause (8067320, 82be430) + 26 d1354d66 dt_compile (8067320, 16a, 3, 0, 80, 1) + 3d9 d1355263 dtrace_program_strcompile (8067320, 8047ec2, 3, 80, 1, 8066848) + 23 080526ef ???????? (8066e48) 0805370e main (3, 8047df8, 8047e08) + 8fc 0805177a ???????? (3, 8047eb8, 8047ebf, 8047ec2, 0, 8047edf)This was dying in the code that analyzes a target binary as part ofcreatingpidprovider probes. There was at least a chance thatthis problem was localized to something specific about the xclockprogram text -- it was worth trying a similar script on a differentprocess.Fortunately, I was able to stave off total panic longenough to write such a script and -- even better -- this one worked. The problem did indeed seem to be localized to somethingspecific in xclock. And thanks to mycoreadmsettings, the core file from the seg faultingdtracehad been stashed away for later analysis; the best thing I could do atthat point wasdrive on with the rest of the demo.And this is what I did. The rest of the demo went well, and theaudience was ultimately impressed with the technology. And while I never quiteregained my stride (in part because my mind was racingabout which change to DTrace could have introduced the problem), Iwas at least sufficiently effective -- we achieved the goals of themeeting.2On the plane back home, I root-caused the problem and developed a fix.The next day, I integrated the fix into Solaris -- and I don't thinkI've ever been so relieved to put latest bits on my laptop!In the end, having the demo blow up certainly wasn't a pleasant experience --but I wouldn't change mydecision to demo on the latest bits. Not only did we discover a seriousbug, we discovered the hole in our test suite that prevented us fromfinding the bug before it integrated. So who am I to get upset about alittle personal humiliation if the upshot is a better product? ;)1 This is a slight exaggeration. I had actually run intoDTrace bugs in front of customers, but they were always sufficientlysmall that only a trained eye would realize that something was amiss --things like slightly incorrect error messages.2 The primary goal of such a demo is often to get the customersufficiently excited about Solaris 10to download Solaris Express (usually for x86) and start playing aroundwith the technology themselves.We are nearly always successful in this -- and I haveeven had a few customers start downloading Solaris Express beforethe end of the meeting!

One of the downsides of being an operating systems developer is that the demos of the technology that you develop often suck. ("Look, it boots!And hey, we can even run programs and it doesn't crash!")...

Solaris

The DTrace integration

I've been following with interest this thread on the linux-kernel mailing list. The LTT folks have apparently given up on the claim that they've got "basically almost everything [DTrace] claims to provide." They now acknowledge the difference in functionality between LTT and DTrace, but they're using it to continue an attack on the Linux development model. (Or more accurately, to attack how that model was applied to them.) The most interestingparagraph is this one in a post by Karim Yaghmour:As for DTrace, then its existence and current feature set is only atestament to the fact that the Linux development model can sometimes haveperverted effects, as in the case of LTT. The only reason some people canrave and wax about DTrace's capabilities and others, as yourself, dragLTT through the mud because of not having these features, is because theDTrace developers did not have to put up with having to be excluded fromtheir kernel for 5 years. As I had said earlier, we would be eons aheadif LTT had been integrated into the kernel in one of the multiple attemptsthat was made to get it in in the past few years. Lest you think thatDTrace actually got all its features in a single merge iteration ...Some points of clarification: we actually did get most of our features in our initial integration into Solaris (on September 3, 2003). In Solaris, projects that integrate are expected to be in a completed state; if there is follow-on work planned, fine -- but what integrates into the gate must be something that is usable and complete from a customer's perspective. So contrary to Karim's assertion, most of DTrace came in that first (giant) putback.As a consequence of this, DTrace spent a long time living the same way LTT has lived: outside the gate, trying to keep in sync while development was underway. Admittedly, DTrace did this for two years -- not five. And this is Solaris, not Linux; it's easier to keep in sync if only because there is only one definition of the latest Solaris. (The newest DTrace build was always based off of the latest Solaris 10 build.) Still: we didn't let the fact that we had not yet integrated prevent us from developing DTrace, nor did we let it prevent us from building a user community around DTrace. By the time DTrace (finally!) integrated into Solaris 10, we had hundreds of internal users, and a long list of actual problems that were found only with the help of DTrace. Not that DTrace would have been unable to integrate without these things, but having them certainly accelerated the process of integration.More generally, though, I'm getting a little tired of this argument that LTT would be exactly where DTrace is had they only been allowed into the Linux kernel five years ago. I believe that there is some fundamental innovation in DTrace that LTT simply did not anticipate. For insight into what LTT did anticipate, look at the LTT To Do List from 2002. In that document, you will find many future directions, but not so much as a whisper of the need for DTrace basics like aggregations, speculative tracing,thread-local variablesor associative arrays -- let alone DTrace arcanalike stability or translators. Would LTT be further along now had it been allowed to integrate into Linux five years ago? Surely. Would it be anywhere near where DTrace is today? In the immortal words of the Magic 8-Ball, "very doubtful."

I've been following with interest this thread on the linux-kernel mailing list. The LTT folks have apparently given up on the claim that they've got "basically almost everything [DTrace] claims to...

Solaris

DTrace vs. DProbes/LTT

Recently, Karim Yaghmour posted the following to the linux-kernel mailing list:As I noted when discussing this with Andrew, we've been trying to getLTT into the kernel for the past five (5) years. During that time we'verepeatedly encountered the same type of arguments for not including it,and have provided proof as to why those arguments are not substantiated.Lately I've at least got Andrew to admit that there were no maintenanceissues with the LTT trace statements (given that they've literallyremained unchanged ever since LTT was introduced.) In an effort toaddress the issues regarding the usefulness of such a tool, I directthose interested to this article on DTrace, a trace utility for Solaris:http://www.theregister.co.uk/2004/07/08/dtrace_user_take/<rant>With LTT and DProbes, we've basically got almost everything this toolclaims to provide, save that we would be even further down the road ifwe did not need to spend so much time updating patches ...</rant>Karim-- Author, Speaker, Developer, ConsultantNow, Karim's really only interested in DTrace it that it helps him make his larger point that his project has been unfairly (or unwisely) denied entry into the Linux kernel.His is a legitimate point, and something that is often lost in the assertions that Linux is developed faster than other operating systems: for all of its putative development speed, Linux has a surprising number of otherwise valuable projects that have been repeatedly denied entry for reasons that seem to be petty and non-technical. DProbes/LTT is certainly one example of such a project, and LKCD is probably another.But what of Karim's assertion that LTT and DProbes "basically [have] everything [DTrace] claims to provide"? This claim is false, and indicates that while Karim may have scannedThe Register article, he didn't bother to browse evenour USENIX paper -- let aloneour documentation. From these, one will note thatwhile LTT lacks many DTrace niceties, it also lacks several vital features. Two among these are aggregations and thread-local variables -- two features that are not syntactic sugar or bolted-on afterthoughts, but rather are core to the DTrace architecture. These features turn out to be essential in using DTrace to quickly resolve problems. For an example of how these features are used, see Section 9 ofour USENIX paper -- and note that every script that we wrote to debug that problem used aggregations, and that several critical steps were only possible with thread-local variables.And fortunately, you don't even have to take my word for it: RedHat developer Daniel Berrangé has posted a comparison of DTrace and DProbes/LTT that reaches roughly the same conclusions...

Recently, Karim Yaghmour posted the following to the linux-kernel mailing list: As I noted when discussing this with Andrew, we've been trying to getLTT into the kernel for the past five (5) years....

Solaris

Whither systems research?

Ted Leungnotedthe discussion that Werner and Ihave been having, andobserved that we should consider Rob Pike's (in)famous polemic, "SystemsSoftware Research is Irrelevant." I should say that I broadly agree withmost of Pike's conclusions -- and academic systems software research has seemedincreasingly irrelevant in the last five years. That said, I thinkthat what Pike characterizes as "systems research" is far too skewed to the interface to the system -- which (tautologically)is but the periphery of the larger system.In my opinion, "systems research" should focus not on the interface of the system, but rather its guts:those hiddenRube Goldberg-esque innards that are rife with old assumptions andunintended consequences. Pike would perhaps dismiss the study of these innards as "phenomenology",but I would counter that understanding phenomena is a prerequisiteto understanding larger systemic truths. Of course, the problem to datehas been that much systems research has not been able to completely understand phenomena -- the research has often consisted merely of characterizing it. As evidence that systems research has become irrelevant,Pike points to the fact that SOSP has hadmarkedly fewer papers that have presenting new operating systems,observing that "a new language or OS can make the machinefeel different, give excitement, novelty." While I agree with the sentiment thatinnovation is the source of excitement (and that such exciting innovationhas been woefully lacking from academic systems research), I disagree with the implicationthat systems innovation is restrictedto a new language or OS; a new file system, a new debugger, ora new way ofvirtualization can bejust as exciting. So the good news is that work need not be a new system to be important systems work, but thebad news is that while none of these is aslarge as a new OS, they're still huge projects --far more than a graduatestudent (or even a lab of graduate students) can be expected to complete in areasonable amount of time.So if even these problems are too big for academia, what's to become of academic systems research? For starters, if it's to be done by graduatestudents, it will have to be content with smaller innovation. This doesn'tmean that it need be any less innovative -- just that the scope of innovationwill be naturally narrower. As an extreme example, take the newnohup -p in Solaris 9. While thisis a very small body of work, it is exciting andinnovative. And yet, most academics would probably dismiss this work ascompletely uninteresting -- even though most could probablynot describe the mechanism by which it works. Is this a dissertation?Certainly not -- and it's not even clear how such a small body of workcould be integrated into a larger thesis. But it's original, novel work,and it solves an importantand hard (if small) problem. Note, too, that this work is interestingbecause of the phenomenon that prohibited a naive implementation: anysolution that doesn't address the deadlock inherent in the problem isn'tactually an acceptable solution. This is an extreme example, but itshould make the point that smaller work can be interesting -- as longas it's innovative, robust and thorough.But if the problems that academic systems researchers work on are goingto become smaller, the researchers must have the right foundation uponwhich to build their work: small work is necessarily more specific, andwork is markedly less relevant if it's based on an obsolete system. And(believe it or not) this actually brings us to one of our primary motivations for open sourcing Solaris: we wishto provide complete access to a best-of-breed system that allowsresearchers to solve new problems instead of revisiting old ones. Willan open source Solaris single-handedly make systems research relevant? Certainly not -- but it should make for one less excuse...

Ted Leungnoted the discussion thatWerner and I have been having, and observed that we should consider Rob Pike's (in)famous polemic,"Systems Software Research is Irrelevant." I should say that I...

Solaris

Whither USENIX? (Part II)

Werner Vogels,a member of theUSENIX '04Program Committee, haswrittenverythoughtful responses to some ofmyobservations. And it's clear thatWerner and I see the same problem: there is insufficientindustrial/academiccooperation in computer science systems research -- and the lack ofcooperation is to the detriment of both groups.That said, it's clear that there are some different perspectives as to howto address the problem. A common sentiment that I'm seeing in the commentsis that it is up to industry to keep USENIX relevant (in Werner's words,"industry will need to be more pro-active in making researchers aware ofwhat the problems are that they need to solve"). I don't entirely agree;in my opinion, the responsibility for keeping USENIX relevant doesn't lieexclusively with industry -- and it doesn't lie exclusively with academia, either. Rather, theresponsibility lies with USENIX itself, for it isthe mission of USENIX toencourage research with a "practical bias." As such, it is up to USENIX toassemble a Program Committee that will reflect this mission, and it is upto both academia and industry to participate as requested.This means that USENIX cannotsimply wait for volunteers from industry to materialize -- USENIX mustseek out people in industry who understand both the academic and theindustrial sides of systems research, and they must convince thesepeopleto work on a Program Committee. Now, I know that this has happened inthe past -- and frankly I thought that the USENIX '04 Program Committee wasa step in the right direction: whereUSENIX '03 had four (of sixteen)members from industry, USENIX '04 had six (of seventeen). Butunfortunately,USENIX'05seems to be a marked decline in industry participation, even from USENIX'03: the number from industry has dropped back to four (of eighteen).Worse, all four are from industry labs; where both USENIX '03 and USENIX'04 had at least one product-oriented member from industry, USENIX '05 hasnone.Examining these three years of USENIX brings up an interesting question:what has the Program Committee composition looked like over time?That is, is the situation getting better or worse vis a vis industryparticipation?To answer this question,I looked at the Program Committee composition for the last nine years.The results are perhaps well-known, but they were shocking to me:To me, this trend should be deeply disconcerting: an organization thathas dedicated itself to research with a "practical bias" is clearlylosing that bias in its flagship conference.So what to do? First, we need some recognition from the USENIX sidethat this is a serious issue, and that it requires substantial correctiveaction. I believe thatthe USENIX Boardshould chartera committee that consists of academia and industry (both labs andproduct groups) in roughly equal measure.This committee should hash out some of the misconceptions thateach group has of the other, clearly define the problems, develop somelong-term (measurable) goals, and make some concreteshort- and medium-term recommendations. The deliverable of the committeeshould be a report summarizing their findings and recommendations --recommendations that the Board should consider but is obviously free toignore.The situation is serious, and there is much work to be done torectify it -- but I am heartened by the amount of thought that Wernerhas put into this issue. If we can find more like him from bothindustry and academia, we can get the "practical bias" backinto USENIX.

Werner Vogels, a member of theUSENIX '04 Program Committee, has writtenhref="http://weblogs.cs.cornell.edu/AllThingsDistributed/archives/000482.html">very thoughtful responses to some ofmyobservations...

Solaris

Whither USENIX?

As I mentioned earlier,I recently returned from USENIX '04, wherewe presented the DTrace paper. It was a little shocking to me that our paper wasthe only paper to come exclusively from industry: most papers had noindustry connection whatsoever, and the papers that had any authors fromindustry were typically primarily written by PhD students interning atindustry labs. The content of the General Session was thus academic in thestrictest sense: it consisted of papers written by graduate students, solvingproblems in systems sufficiently small to be solvable by a single graduatestudent working for a reasonably short period of time. The problem is thatmany of these systems -- to me at least -- are so small as to not be terriblyrelevant. This is important because relevance is sufficiently vital to USENIXto be embodied in the MissionStatement: USENIX "supports and disseminates research with a practicalbias." And of course, there is a more pragmatic reason to seek relevance inthe General Session: most of the attendees are from industry, and most of them are paying full-freight. Given that relevance is so critical to USENIX, I wasa little surprised that -- unlike most industry conferences I have attended --there was no way to provide feedback on the General Session. How does theSteering Committee know if the research has a "practical bias" if they don'task the question? This leads to the more general question: how do we keep the "practicalbias" in academic systems research? Before Itry to answer that directly, it's worth looking at the way research isconducted by other engineering disciplines. (After all, one of the thingsthat separates systems from the rest of computer science is itsrelative proximity to engineering.) To me, it's very interesting to look atthe history of mechanical engineering at MIT. In particular, note theprograms that no longer exist:Marine engineering, stopped in 1913Locomotive engineering, stopped in 1918Steam turbine engineering, stopped in 1918Engine design, stopped in 1925Automotive engineering, stopped in 1949Why did these programs stop? It's certainly not because there weren'tengineering problems to solve. (I can't imagine that anyone would argue that a 1949 V8 was the ne plus ultra of internal combustion engines.)This is something of an educated guess (I'm not a mechanical engineer, so Itrust someone will correct me if I'mgrossly wrong here), but I bet these programs were stopped because theeconomics no longer made sense: it became prohibitively expensive tomeaningfully contribute to the state-of-the-art. That is, these specialitieswere so capital and resource intensive, that they could no longer beundertaken by a single graduate student, or even by a single institution. Bythe time an institution had built a lab and the expertise to contributemeaningfully, the lab would be obsolete and the expertise would havegraduated. Moreover, the disciplines were mature enough that there was anestablished industry that understood that research begat differentiatedproduct, and differentiated product begat profit. Industry was thereforemotivated to do its own research -- which is a good thing, because onlyindustry could afford it. And what has happened to, say, engine designsince the formal academic programs stopped? Hard problems are still beingsolved, but the way those problems are solved has changed. For example, lookat the 2001 program for the SmallEngine Technology Conference. A roughly typically snippet:G.P. BLAIR - The Queen's University of Belfast (United Kingdom)D.O. MACKEY, M.C. ASHE, G.F. CHATFIELD - OPTIMUM Power Technology (USA)Exhaust pipe tuning on a four-stroke engine; experimentation and simulationG.P. BLAIR, E. CALLENDERThe Queen's University of Belfast (United Kingdom)D.O. MACKEY - OPTIMUM Power Technology (USA)Maps of discharge coefficient data for valves, ports and throttlesV. LAKSHMINARASIMHAN, M.S. RAMASAMY, Y. RAMACHANDRA BABU TVS-Suzuki (India)4 stroke gasoline engine performance optimization using statistical techniquesK. RAJASHEKHAR SWAMY, V. HARNE, D.S. GUNJEGAONKAR TVS-Suzuki (India) K.V. GOPALKRISHNAN - Indian Institute of Technology (India)Study and development of lean burn systems on small 4-stroke gasoline engineNote that there's some work exclusively by industry, and some work donein conjunction with academia. (There's some work done exclusively byacademia, too -- but it's the exception, and it tends to be purely analytical work.) And here's the Program Committee for this conference:M. NUTI - Industrial ConsultantP. COLOMBO - Dell'OrtoC. DOVERI - EDI Progetti eSviluppoG. FORASASSI - Università di PisaR. GENTILI - Università di PisaG. LASSANSKE - Chair North American Technical Committee G. LEVIZZARI - ATAM. MARCACCI - PiaggioL. MARTORANO - Università di PisaL. PETRINI - ApriliaR. RINOLFI - Centro Ricerche FiatOf these, three are clearly academics, and seven are clearly from industry.Okay, so that's one example of how a traditional engineering discipline conducts joint academic/industrial research. Let's get back to USENIX with a look at theProgram Committee for USENIX '05.Note that the mix is exactly the inverse: twelve work for a university and five work fora company. Worse, of those five putatively from industry, all of themwork in academic labs. In our industry, these labs have a long tradition ofbeing pure research outfits -- they often have little-to-no productresponsibilities. (Which, by the way, is just an observation -- it's not meant to be a value judgement.)Even more alarming, the makeup of the FREENIX '05programcommittee is almost completely from industry. This leads to the obviousquestion: is FREENIX becoming some sort of dumping ground for systemsresearch deemed to be too "practically biased" for the academy? I hope not:aside from the obvious problem of confusing research problems with businessmodels, having the General Session become strictly academic and leaving theFREENIX track to become strictly industrial effectively separates theacademics from the practitioners. And this, in my opinion, is exactly what wedon't need...So how do we keep the "practical bias" in the academic work presented atUSENIX? For starters, industry should be better represented at theProgram Committee and on the Steering Committee. In my opinion, this ismost easily done by eliminating FREENIX (as such), folding the FREENIXProgram Committee into the General Session Program Committee, and thenhaving an interleaved "practical" track within the General Session. Thatis, alternate between practical sessions and academic ones -- forcing thepractitioners to sit through the academic sessions and vice versa. That may be too radical, but the larger point is that we need to start having anhonest conversation: how do we prevent USENIX from becoming irrelevant?

As I mentioned earlier, I recently returned from USENIX '04, where we presented the DTrace paper. It was a little shocking to me that our paper was the only paper to come exclusively from industry:...

Solaris

Beating the Odds

So I just got back from USENIX '04, and I had planned to spend the flight writing up some observations on the conference. Unfortunately, these observations -- as pithy as they no doubt will be -- will have to wait: I ended up spending the flight inhaling Bringing Down the House by Ben Mezrich. While the book itself is not very well written,1 the subject is fascinating: a well-disciplined (and apparently successful) card-counting team from MIT. The book was brain candy in the purest sense: it was exhilerating and fun -- but it definitely ruined my dinner.If you're looking for something with a little more meat in it, check out Tom Bass's classic, The Eudaemonic Pie. Bass's subjects are more interesting to me, if only because the problem they're solving is so much harder: a group of physicists and computer scientists develop a device to give them an advantage over roulette. (After all, it's just Newtonian physics, right?) And if the idea sounds incredibly implausible, just wait until you see how they implemented it. And while the "Bringing Down the House" protagonists seem destined for a life of overpaid corporate consulting and/or 12 step programs, the leader of the "Eudaemonic" tribe, Doyne Farmer, now writes papers for academic journals like Quantitative Finance from his roost at The Santa Fe Institute. Meatier stuff, to be sure.1The author had an incredibly difficult time separating himself from the story -- I don't particularly care if a stripper was "on his lap" for an interview, and I care even less that he knew the principal protagonist through "a friend from Harvard." I didn't drop fifteen bones to read "The Making of 'Bringing Down the House'"...

So I just got back from USENIX '04, and I had planned to spend the flight writing up some observations on the conference. Unfortunately, these observations -- as pithy as they no doubt will be -- will...

Solaris

The de-commoditization of the OS

So the DTrace team is currently at USENIX '04, where yesterday we presented our paper on DTrace. The presentation went quite well -- though it's a bit difficult to jam so much content in 25 minutes! The reception to the work was very positive, and even the questions largely praised DTrace. The only wrinkle in the whole operation came with the last question, from an employee of IBM. To paraphrase:DTrace seems great, I imagine many people would be interested in this, etc. etc. When are you going to port it to Linux?I'm afraid that my answer was probably perceived as a bit politically incorrect; it was roughly:Look: we believe in choice; we believe that people should pick the best operating system for the task at hand. We've been busting butt for the last three years on DTrace to make Solaris the best operating system for many different tasks. Enough said...To be clear: this is not an attack on Linux. But there is a fundamental disagreement out there: many seem to believe that the operating system is a "commodity" -- that all operating systems are basically the same. We disagree: we believe that the operating system is a nexus of innovation. And we believe that we're proving that with Solaris 10 technologies like DTrace, Zones, ZFS, etc. etc. You certainly don't have to agree with us; we believe that you always have the right to choose Linux and that there may be many great reasons to do so. But please stop asking if we're going to port these features to Linux -- if you want to take advantage of our OS innovation, runSolaris.

So the DTrace team is currently at USENIX '04, where yesterday we presented our paperon DTrace. The presentation went quite well -- though it's a bit difficult to jam so much content in 25 minutes!...

Solaris

Be careful what you ask for...

Earlier, I lamented the fact that a press roundtable on three key technology areas in Solaris 10 (DTrace, Zones and ZFS) had yielded only stories about open source -- a topic which we explicitly didn't talk about. Fortunately, there is now a new story by one of attendees of the roundtable that focuses on the three technology areas.And even better, the larger points about DTrace are certainly correct, e.g.: DTrace, which uses more than 30,000 data monitoring points in the kernel alone, lets administrators see their entire system in a new way, revealing systemic problems that were previously invisible and fixing performance issues that used to go unresolved.And the example that the article is trying to cite has an absolute basis in fact -- it's discussed in depth in Section 9 of our upcoming USENIX paper. But that said, the details of the specific example are incredibly wrong. (So wrong, in fact, that they're just odd; what does "a wild-card desktop applet that had somehow gotten channeled into the central system" even mean?) Perhaps the terms used are so opaque that readers will come away confused, but with the right overall impression -- but given that readers at LWN.net went so far as toaccuse me of being a pointy-haired boss based on theC++ misquote, I can only imagine what I'll be accused of being now...

Earlier, I lamented the fact that a press roundtable on three key technology areas in Solaris 10 (DTrace, Zones and ZFS) had yielded only stories about open source -- a topic which we explicitly...

Solaris

The Early Adopters

Several years ago, Salon.com had a contest for the motto for Silicon Valley. Maurice Herlihy1 won with the slogan "Quality is Job 1.1." Maurice's slogan is certainly clever (and disconcertingly accurate at times), but one of the honorable mentions actually struck me as being truer to Silicon Valley: Eli Neiburger's "God bless the early adopters." If you have ever developed a revolutionary technology -- one that requires people to change the way they think at some level -- you know how unbelievably true this is. For it is the Early Adopter who puts up with tremendous pain to get their hands on a technology, goes through the tedium of constantly communicating the technology's shortcomings to its inventors, endures the slow march towards something usable, and through it all somehow finds the energy to talk enthusiastically about the nascent technology at every opportunity. The Early Adopters are something of a riddle to me, but they're so incredibly important to birthing new technology, that I almost view it as uncouth to dissect what makes them tick. So "God bless the early adopter," indeed. There is no better slogan for Silicon Valley; you were robbed, Eli.I bring all of this up because one of the great DTrace Early Adopters, Jon Haslam, has joined the Sun blogmania. Jon is a canonical Early Adopter in that he remained a terrific advocate for the technology, even when it was in a painfully unfinished state. We sometimes don't understand what makes Jon tick, but DTrace certainly wouldn't be what it is without him; God bless him...1Maurice was actually a professor of mine at school; his course on lock- and wait-free synchronization was one of the highlights of my education. The course was a seminar, and one week the low quality of that week's paper led me to decry the generally woeful state of academic computer science: "Maurice," I whined, "95% of it is crap!" "Bryan," he replied, "95% of everything is crap." I conceded the point...

Several years ago, Salon.com had a contest for the motto for Silicon Valley. Maurice Herlihy1 won with the slogan "Quality is Job 1.1." Maurice's slogan is certainly clever (and disconcertingly...

Solaris

Open source, again!

So another article showed up covering the same press meeting that I discussed earlier. Again, despite the fact that it was less than one tenth of one percent of the content of the rountable, the headline is open source. On the one hand, I feel slightly vindicated in that this one at least quoted me a little more accurately. And then there's this: "But I'm also sure we'll be revisiting a few comments in the code here and there -- I just thought of a particularly disparaging one I might have left in having to do with C++ unions," Cantrill said with a laugh. This was actually in response to a question -- someone asked (lightheartedly, I had thought) if there were any inappropriate comments in Solaris that would have to be cleaned up. I responded -- indeed with a laugh -- that there were some profane comments I could think of that had to do with unions in C. (Note: C, not C++ -- not that C++ unions don't deserve the same coarse words, only that C++ has bigger problems than just unions.)Needless to say, I was quite surprised to see this off-hand comment show up -- and the reason for the disparaging comment probably deserves a little context. The comment in question is in code that I wrote for a loadable module in mdb, the Solaris modular debugger. This particular code is part of postmortem object type identification, a mechanism for identifying arbitrary memory objects from a system crash dump. I actually wrote a paper on this, and presented it at AADEBUG 2003 in September. Anyway, if you read the paper, you'll see why I was feeling malice towards unions when I wrote the comment: the presence of unions makes type identification needlessly difficult.So that's the explanation for the disparaging comment about unions, for whatever that's worth. And I'm still holding out some hope that we'll see an article on the actual technical content that we presented, and not the open sourcing of Solaris that we refused to talk about...

So another article showed up covering the same press meeting that I discussed earlier. Again, despite the fact that it was less than one tenth of one percent of the content of the rountable,...

Solaris

Open source! (P.S. DTrace, Zones, ZFS, etc.)

So several of us spoke with analysts and members of the press yesterday on Solaris 10. The idea was that it would be a deep-dive on three of the major technology areas in Solaris 10: DTrace, Zones, and ZFS (a.k.a. the "Dynamic File System"). Of course, at the outset, the press was really only interested in our (pre-)announcements about open sourcing Solaris. We had to spend two minutes at the beginning of the meeting saying yes, we were committed to it, and no, they weren't going to get any additional information out of us. But I guess I shouldn't be too surprised that one of the stories to come out of that roundtable headlined with open source. From the story, you would think that we spent the entire time talking about open source -- it reality, we spent the entire time talking about the three technology areas, and the first several minutes explaining that we explicitly weren't talking about open source. Oh well...And for the record, I didn't say "technically, it is not a problem to do this", I said "this is not a technical problem." To me, these have different connotations. I am also attributed in that article with "[w]e're engineers and we've written the cleanest code and we can't wait to share it with the world." While this expresses my sentiments accurately, my phraseology got a bit mangled. (For starters, I don't generally speak in run-on sentences!) What I said is more like: "We're engineers; we obviously understand the value of having the source code. We believe that we have some of the cleanest code anywhere, and we're looking forward to showing it to the world." (And hey: we do have some of the cleanest code anywhere -- but somehow I don't think that we're going to see a story about the beautiful ASCII art block comments in /usr/include/sys/dtrace_impl.h...)

So several of us spoke with analysts and members of the press yesterday on Solaris 10. The idea was that it would be a deep-dive on three of the major technology areas in Solaris 10: DTrace, Zones,...

Solaris

Some prehistory

I am an engineer in Solaris Kernel Development here at Sun. I have spent many (well, eight) years at Sun, and I've done work in quite a few different kernel subsystems. Most recently, I -- along with Mike Shapiro and Adam Leventhal1 -- designed, implemented and shipped DTrace, a new facility for dynamic instrumentation of production systems. As DTrace has been a nearly decade-long dream for me personally, much of this blog will focus on DTrace -- its history, its architecture, its use, its budding community, its ongoing development and its futures.But first, some prehistory to let you know where I'm coming from. (And apologies in advance for the length.)Above all else, I believe that software should:Work correctlyWork efficientlyOne would think that it would be very straightforward to achieve these ends, and yet (historically, at least) it's been damned near impossible -- even for very bright people with high levels of expertise. (It may help to know that my definition of "works correctly" is not "works once", "seems to work" or "works for me" but rather "rock-solid, never-breaks, lean-my-business-on-it and trust-my-life-to-it".) So why is there so much software out there that's slow and broken? The fundamental difficulty is that we cannot see software. And why can we not see it? Because software has the peculiar property that to see it, you must make it slower. That is, the act of indicating that a particular instruction is being executed will slow down the execution of that instruction. Worse, software has the even stranger property that if you ever want to even be able to see it, you must make it slower. That is, let's say you want to optionally be able to trace the entry to a function foo(). You might have something like this:foo(int arg){ if (tracing_enabled) trace(FOO_ENTRY, arg); ...This boils down to instructions that look something like this (using a RISC-ish proto instruction set): set tracing_enabled, %o0 ld [%o0], %o1 cmp %o0, 0 bne go_around set FOO_ENTRY, %o0 mov %i0, %o0 call trace ...go_around: ! ! The rest of the function foo()... ! ... That is, it boils down to a load, a compare and a branch. This slows down execution (loads hurt -- especially if they're not in the cache), causes a branch-taken in the common path, increases instruction cache footprint, etc. Not a big deal if foo() is the only function in which you do this to, but start putting this in every function, and you'll find that you have a system that is too slow to ship -- it has suffered the infamous "death of a thousand cuts."(Yes, if you're lucky or if the compiler supports a hint to indicate that trace() isn't called in the common case, the sense of the branch may change such that the branch will be not-taken in the common case -- which is better, but this still hurts too much to do it everywhere.)So we can't leave this kind of code in our optimized, production code, so what do we do? Many do something like this:#ifdef DEBUG#define TRACE(tok, arg) if (tracing_enabled) trace((tok), (arg))#else#define TRACE(tok, arg)#endifThis is better -- at least the production version isn't hindered and we still have debug support in a DEBUG version. But now we have a new problem. We now have essentially two versions of our software: the slow one that we can see, and the fast one that we can't. So what do we do when we see a performance problem in production? Well, we might try to run the DEBUG version (or worse, one with custom instrumentation) in production. But that requires downtime, and additional risk -- and usually doesn't fly. So what do we do? We try to reproduce this in development on the DEBUG version that we can see. (This is not "we" as in Sun, by the way, this is "we" as in humanity -- you and me and everyone.) And reproducing perfomance problems is bad, bad news: when you're reproducing performance problems, you're reproducing symptoms. (Naturally, because the symptom are all you've got; if you knew the root-cause you would be watching Knight Rider reruns instead of horsing around at work.) And why is reproducing symptoms such bad news? Because disjoint problems can manifest the same symptoms. To borrow a medical analogy: let's say that you discover that you're running a fever in production. So you take your development or test environment, and you try to make it look closer and closer to your production environment until you have a fever. Maybe you add more artificial load, more hardware, more users, whatever. Finally, you see the fever in your development environment. You get all of the developers in the room, and they start throwing instrumented binaries on the development machine. Maybe you think you've got an OS issue, so you have Solaris engineers throwing on new kernels -- or maybe you have your ISVs giving you instrumented binaries of their products. Finally, after a huge amount of time and escalation and more time and frustration, you discover the problem: the fever is due to influenza. Okay, this isn't the end of the world: if the production environment stays off its feet, drinks fluids and gets some rest for the next few weeks, it should be fine. But here's the problem: it was influenza in the development environment -- that much was correct. But it's not influenza in the production environment. In the production environment, the problem was cerebral malaria. No amount of rest is going to help -- our diagnosis is completely wrong.2 It may strike you as a glib analogy, but it's an accurate one for the experiences of many. Just think: how many times have you found "a" problem without finding "the" problem?Okay, so we're clearly down a blind alley of sorts. We actually need to start over with a clean sheet of paper -- we need to change the model. We need to be able to ship an optimized kernel and optimized apps, and when we want to be able to see the software, we need to be able to dynamically instrument it to answer our question. And when we're done answering our question, we want the system to go back to be completely optimized. And we want to do all this in production environments. This means that is must be absolutely safe -- there must be no way to crash the system through user error.And this, in essense, is what we've done with DTrace. DTrace is a new facility in Solaris 10 for the dynamic instrumentation of production systems.DTrace is available today via Solaris Express. It has been available since November, and many people have already used it to diagnose real problems. You can read some of their thoughts in the DTrace feature story that ran on sun.com late last month. 1I would have hyperlinked to Mike and Adam's blogs, but they don't (yet) exist. I would expect Adam to have a blog shortly, but given that Mike doesn't yet have a cell phone, it might be a longer wait. Then again, Mike bought a TiVo the first weekend they were on sale at the Palo Alto Fry's back in 1998 -- so you never know when he's going to adopt a technology.2Lest you think medical science has figured this one out: I encourage you to contract cerebral malaria and present at your local emergency department -- and observe just how many weeks you spend bouncing around the health care system before some clever doc finally cracks the case...

I am an engineer in Solaris Kernel Development here at Sun. I have spent many (well, eight) years at Sun, and I've done work in quite a few different kernel subsystems. Most recently, I -- along...