Friday Oct 22, 2010

End of File

In 1997, I decided to come join the kernel team at Sun Microsystems because I was eager to work on Solaris-- the best operating system on the planet. I was very privileged to work on some incredible technology over many years with some of the most talented engineers in the industry. Our work culminated in Solaris 10, the most innovative release of Solaris ever to that point, and what I think will be remembered for all time as a huge leap in operating system technology and the idea of what an OS can do.

In 2005, five years ago this week, I decided with a few friends that it was time to take the core technology in Solaris and take it in a decidedly different direction, into storage. For nearly three years, with our team of engineers and new ideas, we worked to create what became the ZFS Storage product line for Sun and now Oracle. After its launch, this core team and a growing organization took ZFS Storage from infancy to maturity, culminating in an announcement at Oracle Open World 2010 of our second-generation product line.

What began as a mere $2.1M incremental engineering investment for 2.8 years has now shipped more than 100 petabytes, more than 6000 systems, and 100X in revenue. And we redefined the Unified Storage category: first with enterprise flash, first with a Flash Hybrid Storage Pool, first with Analytics, first with triple-parity RAID, first with IB, FC, and 10GbE simultaneously, first with 6Gb SAS end-to-end, first with 5T of read cache, first with a 32-core controller. With the team working to enhance the product even further, it will be an incredible part of a growing storage portfolio at Oracle, combined with leading software to manage data, and uniquely interoperate with Solaris 11 and new OS innovations.

For me, the past five years has been an amazing experience, leading me to many places I never imagined. (Example: after being in my first management role at Sun for about 3 weeks, I was told to report to a small team of senior people at IBM to explain much of Sun to them since they were pondering buying the company.) I wouldn’t trade it for anything.

But my heart is still in engineering, and in being a new dad to my beautiful daughter, and so for me my two-year career as VP of Storage and Solaris is at an end. I am resigning because my effort on behalf of my team to bring ZFS Storage to maturity and to success and a strong position at its new company in a real storage org is now done: we made it. From outside, as a fan, I will be waiting eagerly to see what the team will accomplish next.

For all of my friends, colleagues, and mentors at Sun and Oracle, of whom there are simply too many amazing people to name, thank you. And for everyone who worked for me these last two years, thank you for your support, your dedication, and your patience.

My cell phone hasn’t changed, but my e-mail has: mike dot w dot shapiro at gmail dot com. So keep in touch. The inbox will be decidedly smaller so I’ll probably respond faster too.


More about Mike ...

Sunday Nov 09, 2008

Introducing the Sun Storage 7000 Series

Today I'm pleased to announce the new Sun Storage 7000 Series of unified network storage devices. For the past three years, the Fishworks team have been locked away at an undisclosed location creating this incredible new line of storage devices, and today we're taking the curtain off and announcing them to the world. We'll do it live by webcast from at 3pm Pacific time.

Three years ago, my long-time partner-in-crime Bryan Cantrill and I created the Fishworks advanced engineering team at Sun with a simple goal: take the core of Sun's innovations in systems and software, and use these innovations to create a breakthrough line of integrated storage products. Our goal was quite simple: create a beautiful product that fully integrates the software and hardware experience (hence our name of F-I-S-H), deliver some killer storage features long missing from the market, and provide a world-class enterprise product at an unprecedented price point. And we've done just that. Bryan recounts the history of Fishworks in his blog, and the rest of the team gears up the rest of the story about our products. For my part, I'll try to provide the big picture of what we did, how we did it, and what comes next.

The New Economics

For more than a decade, storage systems have delivered performance (measured in IOPS for latency or MB/s for bandwidth) through two basic mechanisms: lots of DRAM or NVRAM cache, and lots of fast disks (those 15k RPM FC drives you have in your NAS box). And to really drive IOPS, you go further and take your 15k RPM disks and over-provision the storage, thereby short-stroking the data around the outer tracks. It works, but it's left us with an unfortunate legacy:

  • DRAM and 15k RPM drives are the two most expensive things in a large-scale storage system
  • DRAM and 15k RPM drives are the two most power-hungry and hot things in the system
  • You can't have maximum capacity and maximum performance simultaneously, since those 15k RPM drives are a lot smaller than highest-density 7200 RPM drives
  • Drive speeds aren't getting any faster, only drive density is increasing
And then some other unfortunate stagnation set in:
  • Your storage vendor starting monetizing every last knob and screw of the architecture, charging exorbitant software fees and requiring license keys for each protocol and feature you use
  • Your storage vendor never implemented a scalable operating system, so storage devices don't scale with CPU efficiently, and are completely off the commodity compute doubling curve

Today we are changing all of that. The Sun Storage 7000 series delivers a completely new economic model for storage, based on the simple premise that using Flash memory, we can build storage performance in an entirely new way, using a Hybrid Storage Pool of transparently-managed DRAM, read-optimized Flash, write-optimized Flash, and low-cost, low-power spindles. Then we put all of that on top of an industry-standard compute architecture that has volume and doubles every eighteen months, with the most scalable storage microcode in the industry, the OpenSolaris kernel.

And then we throw in a dose of reality: no software fees, no license keys. Every bit of storage software we have is included for the price of the box. Then add whatever support plan you need, either from Sun or one of our partners. As one of our beta customers put it, "When I consider price, performance, rack density, and power consumption, your new storage systems give me 16X the storage value per dollar spent." A new economics indeed.

The Killer App

Every new product needs a killer app, and we've delivered one with these new storage products: Analytics. Analytics is a revolutionary new way of observing and understanding what your storage system is doing, in production, using real-time graphics. It lets you take any aspect of the storage system (protocols, disks, network, cpu, memory) and ask an arbitrary question about your workload and get an immediate answer over the web interface while the system is running. Then just point-and-click on something interesting and drill down to ask a new question. It goes like this:

  • How many IOPS am I delivering?
  • How many for CIFS and how many for NFS?
  • What CIFS clients are most active?
  • On the most active CIFS client, what files are being accessed?
  • For them most active file, can you show me the read-write mix?

You'll have the answer in less time than it took to read the preceding paragraph. It looks like this:

Analytics is the perfect match for our revolutionary Hybrid Storage Pool architecture because we've empowered storage administrators with the ability to understand their workload, and then given them unprecedented insight into how they can grow their architecture to improve performance. Need more networking? Add more read-optimized Flash to your caching hierarchy? Want to understand whether mirroring or RAID-Z DP is best for your workload? Now you'll have the real answer on the only workload that matters: the one running in your datacenter.

The Products

The products we're introducing today include:

  • The Sun Storage 7110, a 2u box with 2TB of storage,
  • The Sun Storage 7210, a 4u box with 46TB of storage, and
  • The Sun Storage 7410, a 2u box with up to 288TB of storage that can be clustered (and we'll double that to 576T with a software update in a few months)

All of them have the same software features, except the 7410 adds active-active clustering. And all of the software features come included with the box, including advanced features like replication, compression, thin-provisioning, and all of our data protocols.

The 7410 is the full expression of the Hybrid Storage Pool architecture. It supports:

  • Up to 16 cores (32 for a cluster)
  • Up to 128G of DRAM (256G for a cluster)
  • Up to 600G of Read-optimized Flash (1.2T for a cluster)
  • Up to 288G of Write-optimized Flash
  • Up to 288T of raw disk capacity

And then you have plenty of PCIe lanes to plug in things like 2x10Gb or 4x1Gb Ethernet cards, or an FC HBA to connect to a tape library for backup.

The Features

All our systems include these core data protocols:
  • NFS v3 and v4
  • CIFS
  • iSCSI
  • HTTP
  • WebDAV
  • FTP
and these data services:
  • RAID-Z (RAID-5 and RAID-6), Mirrored, and Striped disk configurations
  • Unlimited Read-only and Read-write Snapshots, with Snapshot Schedules
  • Built-in Data Compression
  • Remote Replication of data for Disaster Recovery
  • Active-Active Clustering (in the Sun Storage 7410) for High Availability
  • Thin Provisioning of iSCSI LUNs
  • Virus Scanning and Quarantine
  • NDMP Backup and Restore

Finally, to maximize the availability of your data in production, the Sun Storage products include a complete end-to-end architecture for data integrity, including redundancies at every level of the stack. Key features include:

  • Predictive Self-Healing and Diagnosis of all System FRUs: CPUs, DRAM, I/O cards, Disks, Fans, Power Supplies
  • ZFS End-to-End Data Checksums of all Data and Metadata, protecting data throughout the stack
  • RAID-6 (DP) and optional RAID-6 Across JBODs
  • Active-Active Clustering for High Availability
  • Link Aggregations and IP Multipathing for Network Failure Protection
  • I/O Multipathing between the Sun Storage 7410 and JBODs
  • Integrated Software Restart of all System Software Services
  • Phone-Home of Telemetry for all Software and Hardware Issues
  • Lights-out Management of each System for Remote Power Control and Console Access

    And a really nice user interface:

    What's Next

    To make it easy for you to try out Analytics and get started with these products, check out the home page for the Sun Storage 7000 Series and download our Unified Storage Simulator: it runs in a virtual environment on your laptop but is an entirely functioning network storage device.

    Over the next few months, I'll dive inside the implementation details of many of the features we worked on to make these products happen, and talk about how we see the storage world changing. One thing for certain is that today's announcement is only a beginning: we've defined the storage architecture for the next ten years, and we intend to make good use of it. There are three things that will be at the core of everything we do: a true software architecture for Flash, the Hybrid Storage Pool; users empowered with real-time Analytics so you can finally understand what your box is doing and how to make it better; and an open, industry-standard architecture: open on-disk formats, open protocols, and a compute and i/o architecture with volume economics behind it that doubles in speed and capacity every 12-18 months.

    Enjoy the launch.


Saturday Jul 19, 2008

Once Blitten

I've worked on a fair number of debuggers over the years, and in my efforts to engineer new things, I always spend time researching what has gone before. The ACM recently added the ability to unlock papers at their extensive Digital Library, so this gives me the pleasure of beginning to unlock some of the older systems papers that influenced my thinking in various topics over the past decade or so. One of these was The Blit Debugger, originally published in SIGSOFT proceedings, and written by Thomas Cargill of Bell Labs. The Blit itself takes us back quite a ways: it was a bitmapped terminal containing a small microkernel that could communicate with host programs running on UNIX. The idea was that one could write a small C program, which was turned into relocatable form, that could control the bitmap display and the mouse and keyboard, and beam it over to the Blit, where it would be executed by mpx. These mini-programs running in the Blit could then communicate with normal UNIX programs back on the host, i.e. executing in the complete UNIX timesharing environment, to form a complete interactive program. Written in 1983, the same year bytes of 68000 asssembly code were being downloaded over a serial line from a Lisa to early prototype Macs, the Blit quickly looked behind the times only a few months later when MacPaint and MacWrite showed up.

But in computing, everything old is new again, and we don't spend enough time studying our history. Hence we've reinvented virtualization and interpreters about four times now. And looking back at the Blit now, you can kind of squint your eyes and see something quite remarkable. A high-resolution bit-mapped display with keyboard and mouse control, running a kernel of software capable of multiplexing between multiple downloaded graphical applications that can drive user interaction, each communicating over a channel to a more fully capable UNIX machine with complete network access and a larger application logic. Sound familiar? I pretty much just described your favorite web browser, Javascript, and the AJAX programming model. So now on to debugging.

Cargill's Blit Debugger basically let you wander around the Blit display, pointing and clicking to attach the debugger to a program of interest. Then once you did that, it could download the symbol tables to the debugger, and then conjure up appropriate menus on the fly that displayed various program symbols and let you descend data structures. I'll let you read the paper for the full details, but there is a very central concept here, independent of the Blit, that for me turned on a big light bulb when I first read this paper in college. As a programming student, I always considered the debugger a kind of container: i.e. you either ran your program, or you ran it inside the debugger. This was the way we all learned to program, and this was the way all those big all-encompassing IDEs behaved (and mostly still do). The Blit debugger was the first description I'd read of a debugger that was truly a general-purpose tool that could be used to explore any aspect of a running environment, literally allowing its user to wander the screen in search of interesting processes and their symbols that you could shine your flashlight on. And it wasn't a debug environment, it was just your normal, running compute environment.

This is a seminal concept, and one that has had great influence on my work in debugging over the years, most prominently with the development of DTrace, where Bryan and Adam and I created a modern version of this kind of flashlight, one that would let you take your production software environment and roam around arbitrarily asking questions, poking into any aspect of the running system. But looking back at the Blit and its analogies to the rapidly-evolving AJAX environment, I hope that others will find inspiration in this idea as well, and bring new (and old) thinking to what kinds of debugging tools are needed in this environment. Yes, we've got the breakpoint debugger (mostly, it seems to cause my browser to crash) and the DOM inspector (getting better, slowly), but what's really needed is more thought into connecting the browser, the Javascript interpreter, the DOM, and the backend together, in a way that permits interactive on-the-fly exploration. One critical building block for that world now exists, in the form of Brendan's DTrace Support for Javascript. But imagine what would be possible if say, a browser plug-in itself actually leveraged this DTrace support. Suddenly features become possible like "Option-click on a button in an AJAX UI, and I will pop up the stack trace of the HttpRequest that was sent when you did that, and show you the XML-RPC call that was made and its reply." Or "Turn on a profiling mode that shows me what XML-RPC requests spent the most time blocked on the network while I click around the user interface." One can also envision DTrace linkages between the debugger control engine in a browser and a corresponding control engine in the XML-RPC backend that can also use DTrace to instrument itself or the system that contains it.

Meantime, enjoy the paper, and more about the Blit can be found here.


Self-Healing Unlocked

A couple of years ago, I wrote an article on Self-Healing in Modern Operating Systems for ACM Queue, detailing the state of the art in self-healing techniques for operating systems and introducing the Fault Management capabilities we built for Solaris 10. The ACM has recently added the ability for authors to unlock articles, so I've gone and unlocked that article here. This is a really fantastic idea to make available more of the amazing content available at their Digital Library. The other thing happening at the ACM everyone in the field should be excited about is the inclusion of new articles and an increased focus on practice and engineering in their magazine, Communications of the ACM. The increased energy really shows in the July issue with articles on Flash (courtesy of our own Adam Leventhal), Transactional Memory, and others. I also strongly recommend Pamela Samuelson's article on software patents, which avoids the usual religious and emotional responses and instead gives us a long overdue detailed and thoughtful overview of where the courts and the case law stands on this critical subject.


Thursday Nov 08, 2007

Unified POSIX and Windows Credentials for Solaris

Last week something really exciting happened in OpenSolaris: we got a native Windows CIFS server checked in. CIFS, also known as SMB, is the file sharing protocol used by Microsoft Windows systems, akin to NFS on UNIX. (Unlike NFS, CIFS crosses boundaries into other aspects of Windows as well, but that's a story for a different day.) The result of this project is that OpenSolaris can now serve up ZFS to UNIX clients over NFS and Windows clients over CIFS, along with all of the other file sharing protocols you find on a modern system. Similar to the Solaris NFS implementation, our CIFS implementation is in-kernel for performance and scale, and is accompanied by a collection of userland services that handle the non-performance-critical aspects of the problem like session management and so forth. The OpenSolaris CIFS putback is the result of a lot of really hard work led by Alan Wright and our team of CIFS engineers. Read Alan's blog for more of the back story there.

To make the CIFS server a reality inside of Solaris, one nasty root design problem we had to confront was how to represent Windows credentials and identities in Solaris. Like all UNIX systems, Solaris represents user and group identities as integers (UIDs and GIDs) that are semantically associated with a name by your current name service. But the kernel and your filesystem just store those integers in memory in the kernel and on-disk, and its up to the system administrator to connect the system to the appropriate name service to semantically bind those identifiers to the right names. However, unlike a UNIX/POSIX system, Windows has a much more complex notion of identity, represented by an identifier called an SID, which is in effect a universally unique identifier with an Active Directory Domain that has a more complex set of security attributes associated with it. The challenge for us was to consider how to represent this other notion of identity and credential in our system, while still maintaining all of the POSIX APIs and compatibility. And we decided to get aggressive: rather than just viewing Windows identities as an artifact of one service on the system and mapping everything to a POSIX lowest-common-denominator internally, we've actually gone much further and changed Solaris to support a much more sophisticated representation of identity and credentials so we can fully support both models.

I worked with the CIFS team earlier this year to make this happen, and now that their putback is done and the project has been opened to the community, I've made the full details available. (You can also directly download a copy of the design spec here.) If this is a technical area that interests you, I hope you will benefit from reading our examination of the problem and how we decided to do what we did. Here is the executive summary:

  • We changed the type of uid_t and gid_t in Solaris from 32-bit signed to 32-bit unsigned. This really should have been done a long time ago, and brings us (in my opinion) into better alignment with other UNIX variants.
  • We reserved the UID and GID values 0x80000000 - 0xFFFFFFFE to be used for what I call ephemeral mappings to foreign identifiers, represented by a generic form of an SID. These mappings are done by the new identity mapping service in OpenSolaris, called Winchester. (Winchester also can perform mappings of user names between POSIX and Windows name services: see the Winchester project page for all of the details.
  • We created a way for Solaris filesystems to store persistent identifiers in the filesystem on-disk that can represent arbitrary identifiers including both POSIX IDs and SIDs and convert those back to credentials in the kernel.
  • We extended the Solaris ucred mechanism so that these more complex credentials can be expressed back to userland processes for services that need them.

Read the spec for more details, as there is obviously a lot that needed to occur to make this happen. For developers, the best part is that we're able to maintain a very sound compatibility story with existing applications and APIs, as you would expect from Solaris. The key thing for developers to understand is that, as was always implied by POSIX but now has even more significant meaning, you should not write programs that store integer UIDs and GIDs to files on disk, or send them over the network as part of a network protocol. Instead, programmers should convert identifiers to a persistent form such as an SID or qualified name, and serialize those forms instead. Most modern programs and protocols, such as GNU tar, Solaris tar, NFSv4, and others already do this.

All the ID machinations aside, it's incredibly exciting to see OpenSolaris expand its reach as a high-performance, high-scale server for Windows and Mac clients in a heterogeneous environment. For the first time since I've been at Sun, I can honestly enthusiastically say: Fire up those Windows laptops! Because there's something interesting to see.


Monday May 28, 2007

Purpose-Built Languages in Systems Design

A couple of weeks ago I attended the Industrial Partners Program symposium at my alma mater, Brown University. For the May symposium Brown invited back a collection of former teaching assistants from the Operating Systems course (cs169), including myself, Bryan, Adam, Eric, Matt, Jeff, and Jason for a thoroughly enjoyable day of talks. The host was Professor Tom Doeppner, who has presided over this amazing and unique course for more than twenty years now. Tom gave a fabulous presentation of the history of the course to start the day, with many former students in attendance, and it really struck me how effective the course has been at constantly reinventing itself along with the evolution of operating systems in industry, thereby empowering both TAs and students to engage in some incredible design and implementation challenges along the way. Tom deserves an enormous amount of credit for this, and for me personally this course is a major part of the reason I work in an operating system group today.

For my talk, I presented a three-part discussion of Purpose-Built Languages in Systems Design. The topic is impossibly broad, but I thought it would be fun to take a look at a different side of systems than is usually discussed and look at how little, bizarre languages have contributed to the evolution of systems, in particular UNIX. In the first part of the talk, I traced the evolution of the original UNIX debugger from its predecessor ODB on the PDP-8 through db and adb up to mdb, the rewrite I contributed to Solaris 8 and in use today. Purpose-built languages may have all the elegance of pond scum, but they also have the resilience of pond-scum when attached to the grout on your bathroom tile. This one has now lasted more than thirty years, and that is quite an achievement for any language. (Side note: I would love to hear from anyone who has experiences with ODB, or early DB or ADB to share.)

In the second part of the talk, I talked about the evolution of the D language we designed for DTrace. In particular I think one of the most powerful design principles (illustrated also in part one) is how purpose-built leverages can leverage syntax and semantics from existing paradigms to speed their adoption. This is entirely common-sense, but achieving it in implementation can be quite difficult, especially as the new language will naturally extend the basic form in some areas while pruning it in others, all while attempting to achieve some rational semantics which hopefully don't violate those of the original.

In the final part of the talk, I showed some examples of what I called "mutants": purpose-built languages that evolve as a mutation between two or more existing languages, sometimes by attempting to form a bridge from one to the other (as in the case of preprocessed language front-ends) and sometimes by, well, just being plain weird. The most bizarre is the mutation of Forth and SPARC assembler used at Sun for the Open Boot PROM source code -- here Forth words are redefined to cause the generation of in-memory SPARC opcodes, and the OBP image is created by taking a core dump of the interpreter after all the files have been loaded.

At the end of the day, we have an impromptu panel to discuss, among other things, how systems should be taught in universities. One point I feel strongly about is that it's vital to give students as many tools for their mental toolbox as possible to be effective in designing and building larger-scale systems: it should mandatory to take an OS class and a compiler class, whether your intent is to be a compiler-writer or a kernel programmer. They both have a deep influence on the other, and I am constantly amazed in industry how rare a good working knowledge of both areas is. From the OS side, development of languages, in particular little ones, has played a central role in the evolution and adoption of UNIX, and is vital to its continuing growth by allowing us to more easily operate on the set of abstractions exported by the kernel.


Monday Aug 07, 2006

DTrace on MacOS X at WWDC

This morning I attended the first day of Apple's WWDC in San Francisco to see Apple announce, among other things, support for DTrace on Mac OS X Leopard. Apple's engineering team has been working hard on this for some time leading up to this announcement, and it was incredibly gratifying for us to see how much they have working already on both PowerPC and the new x86 Macs, how it has already begun to help their engineering teams, and how this is going to really expand the DTrace community to include a whole new family of application developers. Here are the particulars:
  • The base DTrace framework, including the command-line utility, our compiler, and the kernel support is essentially all running on MacOS X and is bundled with the Leopard preview DVD that was given away at the WWDC. dtrace(1M) even appears in /usr/sbin, so scripts that use common interfaces can literally be run as-is on MacOS.
  • Work on our various providers is in progress at various points, but suffice it to say that fbt, syscall, and pid are all happening on the x86 Macs, and SDT work is under development as well.
  • In addition to work on standard pid, Apple is providing an Objective C front-end for the pid provider whereby one can specify Objective C probes using the class name in place of module and then a pid probe will be created in the corresponding location.
  • The DTrace type system (CTF) is present under the hood. That is, there was a mach_kernel compiled with CTF, generated by ctfconvert-ing DWARF emitted by gcc just like we do. (Note to myself and Matt: did not imagine this ever happening during those frantic five weeks in February 2001).
  • As described on Apple's preview page above, their new performance utility XRay is using DTrace under the hood. It appears to provide support for creating DTrace scripts and capturing their output; more will be announced about this to WWDC attendees later this week.
After the show, Adam, Bryan, and I joined the Apple DTrace team (Steve Peters, James McIlree, Terry Lambert, Tom Duffy, Sean Callanan, and John Wright) for some Thai food and a fun discussion of kernel engineering at Apple and Sun and how to continue to enrich their DTrace experience and what changes in our broader community might help them. Some of the things that stood out for me were how much value Apple can bring to their developers by offering semantic USDT probes in their higher-level programming frameworks like Cocoa, Core Audio, etc., and also how the need to interface with Objective C might provide another good source of input for us to think about better integration for the naming schemes and data representation for multiple high-level languages including Objective C, C++, Java, and so forth.

So no matter whether your first love is sys_trap or an A-trap, join us at the DTrace community and get involved!


Sunday Jul 02, 2006

FMA on x64 and at DSN 06

Last Monday, Sun officially released Solaris 10 6/06, our second update to Solaris 10. Among the many new features are some exciting enhancements to our Solaris Predictive Self-Healing feature set, including:
  • Fault management support for Opteron x64 systems, including CPU, Memory diagnosis and recovery,
  • Fault management support for SNMP traps and a new MIB for browsing fault management results, and
  • Fault management support for ZFS, which also is new in Solaris 10 6/06.
And last week at Dependable Systems and Networks 2006, Dong Tang presented a paper we co-authored demonstrating some of the quantitative benefits of Solaris's unique self-healing features, showing that our unique memory retirement feature can decrease annual downtime by 37-54%. If you haven't already tried some of these features out through OpenSolaris or Solaris Express, and availability is important to you, you should definitely give Solaris 10 6/06 a try. In particular, the combination of ZFS protecting your data, AMD's Opteron RAS features, and our unique Solaris Predictive Self-Healing capabilities provide unprecedented availability for x64 platforms. Here are a few more details:

Opteron/x64 Features

The folks at AMD have put together an impressive (and growing) list of hardware RAS features in Opteron. These include:
  • Hardware cache and memory scrubbers,
  • ChipKill ECC for main memory,
  • Extended hardware error registers for first-fault analysis, and
  • a hardware watchdog for HyperTransport transactions.
With Solaris 10 6/06 (or Solaris Express or the latest OpenSolaris), we've provided a new kernel module loading mechanism to permit Solaris to load enhanced cpu-specific support for a particular type of CPU. For example, the Opteron fault management support is provided in this module:
 15 fffffffffbbd3eb0   3d10   -   1  cpu.AuthenticAMD.15 (AMD Athlon64/Opteron CPU Module)
which we load automatically on any Athlon64 or Opteron system. This new module permits Solaris to use the Opteron-specific hardware features to convert hardware error state into telemetry to drive our automated diagnosis software, and then trigger reactions like dynamically offlining a CPU core or retiring a physical page of memory. All of this done automatically for you, with a first-class administrative model built into Solaris, rather than bursting into flames or spewing random bits of hardware error state out to the poor administrator. Gavin posted more low-level details and examples of our Opteron fault management features on his blog.

Memory Page Retire Benefits

Memory page retire (MPR) is a unique feature of Solaris's self-healing system. It provides the ability for the kernel, at the request of fmd based upon a diagnosis of an underlying memory fault, to remove a particular physical page of memory from use in the system. Thanks to virtual memory, we can actually copy the content to another physical page first (assuming we have a series of correctable errors or ChipKill event), thereby making the entire operation transparent to running user processes. Similarly, if we have an uncorrectable error (UE) on a clean page, we can retire the page, and then let the kernel fault in the page content again by reading it from the backing object (e.g. a page of text from libc sitting on your filesystem). In the unlikely event of an uncorrectable memory error on a dirty page, we can kill the process, letting smf(5) restart the associated service according to its dependencies.

The great part about page retire is that it's free: it comes at no performance cost, and at a significantly smaller space and dollar cost than hardware memory redundancy. And of course the software to implement it, OpenSolaris, is entirely free too :) Page retire is therefore complementary to the hardware RAS features on Opteron and SPARC, and maximizes the benefit of everything when you have our diagnosis software looking at the underlying failures and figuring out which hammer is the right one to use on the problem. Recently, Dong Tang, Peter Carruthers, Zuheir Totari, and I wrote up a paper describing a quantitative model for the benefits of MPR, demonstrating that it can reduce annual downtime by 37-54%. You can read our DSN '06 paper here. Solaris now offers memory diagnosis and page retire on all our systems, including Opteron, UltraSPARC III, UltraSPARC IV, and UltraSPARC T1.

SNMP Features

We've also introduced a connector between the Solaris fault management stack and SNMP in Solaris 10 6/06, permitting the fault manager to publish SNMP traps and provide MIB browsing for diagnosed problems. This is implemented using the Solaris NetSNMP stack: Keith has posted examples of how to use these features on his blog. The SNMP MIB provides an ideal connection between the unique fault management features in Solaris and any type of existing heterogeneous management software you use. You can examine the MIB itself if you want to learn more.


Technorati Tag:

Sunday Aug 28, 2005

SMBIOS Support for Solaris x86

For the past few years, Sun engineering has been hard at work making the x86 platform a first-class citizen. In case you've been napping, some of the highlights included:

Meantime, there's a bunch of great engineering work going on behind the scenes to continue to enhance our x86 offerings. Three of the bigger things we've been making progress on are There are also many smaller pieces that we're putting together, steps that improve the day-to-day quality of life for users, developers, and system administrators. Over the weekend, I integrated another piece: Solaris support for the x86 System Management BIOS (SMBIOS). First I'll talk about why we did this and why it's important, and then I'll spend the remainder of the blog discussing a bit about the design and code for you OpenSolaris propeller-heads. These changes will be visible in OpenSolaris and Solaris Express as part of Build 23.

What is SMBIOS?

System Management BIOS (SMBIOS) is a data structure exported by most modern BIOSes on x86 platforms, including those currently produced by Sun. SMBIOS is an industry-standard mechanism for low-level system software to export hardware configuration information to higher-level system management software. The SMBIOS data format itself is defined by the Distributed Management Task Force (DMTF). Refer to if you want to read a copy of the specification itself.

The SMBIOS image consists of a table of structures, each describing some aspect of the system software or hardware configuration. The content of the image varies widely by platform and BIOS vendor, and may not exist at all on some systems. However, it is the only known mechanism for system software to obtain certain information on x86 systems. A simple example of information found in SMBIOS is the actual name of the platform (e.g. Sun Microsystems v40z), the BIOS vendor, version, and release date. More advanced records describe the DIMM slots in the machine, their labels (i.e. how you can locate them if you crack open the case), and various other slot types and properties.

Not much attention has been paid to SMBIOS: Windows uses it to get the system serial number and is beginning to emphasize its importance to hardware vendors, and the Linux dmidecode utility (DMI was the original name of the SMBIOS spec) has been available for some time to dump it all out in human-readable form for hackers and administrators who wish to peek at some of the details of the underlying hardware on their platform.

SMBIOS and Solaris

The work I integrated this weekend into Solaris provides an equivalent decoding utility for SMBIOS (albeit with some nicer features), but is actually a stepping stone to connecting this information with the new Fault Management features we introduced in Solaris 10. As such, I wanted to provide a much richer environment for dealing with SMBIOS. Specifically, we're working on the first collection of Fault Management features for x64 right now, part of our broader effort to implement Predictive Self-Healing in Solaris. For x64 platforms, the first set of features will include automated diagnosis and fault isolation for Opteron CPU and Memory errors and for our PCI-E stack, all connected to our standardized self-healing administration model, messaging, and knowledge article web. SMBIOS is a tiny piece of this puzzle: it gives us the ability to label self-healing diagnosis messages with the appropriate chassis serial number you need to provide if you file a service call, and allows us to properly identify faulty FRUs you need to replace with the labels that will let you easily locate and remove them from your box.

The initial SMBIOS support provides the following new features:
  • A common set of Solaris APIs to decode and examine SMBIOS images. The common source is compiled both into a new user library, libsmbios, and into the kernel itself. The APIs provide a simplified programming model for accessing SMBIOS, converting the string table indices to string pointers, sanitizing addresses, and generally handling all the bizarre implementation artifacts of the data format itself.

  • A new smbios(1M) utility to permit developers and administrators to examine SMBIOS images either exported by the system or stored in a file on disk. You can also copy your SMBIOS image out to a file (this might be useful for building a catalog of machine configurations, and was very useful to me in developing automated test software for this whole thing).

  • A new smbios(7D) character special device that exports a snapshot of the current system's SMBIOS image, taken at the time the system boots.

  • An implementation of prtdiag(1M) (long present on SPARC Solaris) for Solaris x86 platforms that displays the x86 platform name and summary hardware information using SMBIOS.

Here is the usage for the new smbios(1M) utility:
Usage: smbios [-BeOsx] [-i id] [-t type] [-w file] [file]

        -B disable header validation for broken BIOSes
        -e display SMBIOS entry point information
        -i display only the specified structure
        -O display obsolete structure types
        -s display only a summary of structure identifiers and types
        -t display only the specified structure type
        -w write the raw data to the specified file
        -x display raw data for structures

And here is some output from the x86 prtdiag using SMBIOS:
System Configuration: Sun Microsystems   W1100z/2100z
BIOS Configuration: Sun Microsystems R01-B2 S0 10/12/2004

==== Processor Sockets ====================================

Version                          Location Tag
-------------------------------- --------------------------
AMD Opteron(tm)                  CPU1
AMD Opteron(tm)                  CPU2

==== Memory Device Sockets ================================

Type    Status Set Device Locator      Bank Locator
------- ------ --- ------------------- --------------------
DDR     empty  1   DIMM1               Bank 0
DDR     empty  1   DIMM2               Bank 0
DDR     in use 2   DIMM3               Bank 1
DDR     in use 2   DIMM4               Bank 1
DDR     empty  5   DIMM5               Bank 2
DDR     empty  5   DIMM6               Bank 2
DDR     in use 6   DIMM7               Bank 3
DDR     in use 6   DIMM8               Bank 3

==== On-Board Devices =====================================

==== Upgradeable Slots ====================================

ID  Status    Type             Description
--- --------- ---------------- ----------------------------
0   unknown   AGP 8X           AGP8X PRO 110
1   available PCI-X            PCIX1-100MHZ
0   unknown   PCI-X            PCIX2-133MHZ
3   available PCI-X            PCIX3 100 ZCR
0   unknown   PCI-X            PCIX4 100
0   unknown   PCI-X            PCIX5 100

SMBIOS Subsystem Design

One of the most important principles in system software design (or really any large-scale software effort) is what I refer to as leverage, which is my shorthand for locating high-leverage subsystems that form the basic building blocks of the system, and really engineering these things right such that a wide variety of consumers can be easily and flexibly built on top.

For example, in Solaris, we have an incredibly flexible virtual memory allocator built out of two layers, vmem and kmem. The introduction of vmem in Solaris 8 allowed us to delete about 50 special-purpose non-scalable memory allocation subsystems in Solaris, greating simplifying the kernel, while at the same time improving the scability, performance, and debuggability of all those new clients. That's leverage.

File and data formats are another place where investing in a high-leverage subsystem is really worthwhile. The leverage points I wanted for SMBIOS were:
  • common API for both userland and kernel consumers built from same source code
  • ability to access SMBIOS from file, memory buffer, or pseudo-device
  • ability to easily maintain and extend the human-readable decoding

Here is a diagram of the new subsytem, showing the source code, binaries, and how the layering works:

The source code in /usr/src/common/smbios/ contains the code to initialize an SMBIOS snapshot from a file or in-memory buffer, and provide a set of common programming APIs for consuming software. This code is compiled twice: once to form an in-kernel SMBIOS subsystem, and a second time to form the bulk of the userland libsmbios library. The library and kernel each provide a small set of glue routines in a file named smb_subr.c, and then each provide some basic discovery routines. For the kernel, discovery is performed by mapping the BIOS's physical memory region and scanning for the SMBIOS table. For the userland library, discovery is performed either by opening a file or a device, or scanning a device which exports the BIOS's physical memory region.

The kernel then implements a small pseudo-driver, smbios(7D), which simply takes the SMBIOS data discovered at boot time and exports it for use by the library. In userland, the smbios(1M) utility provides generic data display for developers, and an x86 prtdiag is provided which also leverages the library. In the kernel, the IPMI bmc(7D) driver uses the in-kernel subsystem to discover the local Baseboard Management Controller, for use with ipmitool. As you might expect, the original IPMI driver implementation had its own private, buggy, non-leveraged SMBIOS decoder, and all that just got deleted.

One final trick to achieve maintainability is that I wrote a shell script that auto-generates C source to convert the integers used in various SMBIOS fields into the corresponding human-readable strings by simply grabbing the comments next to the appropriate #defines in the new smbios.h. As such, there's no way that a developer can add or modify our definition of the format and forget to update the smbios(1M) output.

You can find out more about Fault Management in Solaris from our OpenSolaris community or by posting a question to the discussion forum. If you're interested in looking at the source code or experimenting with the interfaces, you can use the OpenSolaris Source Browser to locate things. It will take a week or so for the code to propagate into the Source Browser's snapshot of the OpenSolaris tree.


Technorati Tag:
Technorati Tag:

Sunday Jul 31, 2005

FMA at the Open Solaris User Group

This past Tuesday evening I gave a presentation of the Solaris 10 Fault Management (FMA) features, part of our collection of Predictive Self-Healing technology, at the Silicon Valley OpenSolaris User Group meeting. I've placed a copy of the slides in PDF format here if you want to view them. Jim Grisanzio also took some great pictures of the event and the video will be available at the SVOSUG community page.

There was excellent turn-out at this event: thanks to Alan for organizing and to everyone who turned out for investing so much time and contributing excellent questions and discussion. Prior to the event, we created a Fault Management Community on the OpenSolaris web site where you can post more questions and get involved in development. There have already been some excellent questions and interest in converting the legacy UltraSPARC I and II error messages in the discussion forum.


Technorati Tag:
Technorati Tag:

Friday Jun 17, 2005

DTrace Challenge at JavaOne

As part of our ongoing efforts to show how DTrace can be used to tune Java applications, we've set up a booth at JavaOne where you can stop by and take the DTrace Challenge. Put simply, you either leave with a faster app or a free iPod. We're not kidding. If you're in San Francisco for the conference, definitely check this out and chat with DTrace team member Adam Leventhal and DTrace commando Jarod Jenson, who may or may not be hanging from the ceiling this time.


Tuesday Jun 14, 2005

Sendmail Died in a Two SIGALRM Fire

In honor of today's full release of OpenSolaris I thought I would retell the story of one of my favorite bugs that I've debugged while working at Sun on Solaris. This problem is Sun bug 4278156 and was originally debugged during development of Solaris 8 during the first week of October, 1999.

The investigation itself began with a single core file found on one of our internal test servers from a sendmail daemon that was running there. At the time, we were testing out the use of a new Solaris facility called coreadm(1M) that Roger Faulkner and I designed. If you haven't used it before, coreadm permits administrators and developers to create different core file patterns (i.e. other than ./core) to be used when creating core files, and to create a global core file repository where a copy of all core dumps will appear on a system. This is an essential valuable part of our strategy of first-fault debugging at Sun, how we try to ensure a single failure can be debugged to root-cause.

Aside from illustrating how we do this with a good story which ultimately benefited everyone using Sendmail, I'll also go into some details about:

  • How we debug core files at a low-level with mdb
  • How signal handling works between the kernel and user processes
  • How sometimes one bug helps you find another

The problem began simply enough: a single core file from sendmail sitting in our global core file directory /var/core on a machine running build 31 of Solaris 8. The stack trace was as follows:

$ mdb core
> $c
sfgets+0x80(0, 0, 0, 0, f0878, d1800)
That is, the stack was truncated at a single frame in sfgets(). Before launching into a detailed analysis of how exactly sendmail died, here are a few pieces of background information to help understand the analysis:

  • Our sendmail is configured (in conf.h) such that in sendmail's source code, jmp_bufs are #defined to be sigjmp_bufs, setjmp() is #defined to be sigsetjmp(env, 1), and longjmp() is #defined to be siglongjmp(). This is important to keep in mind when reading the sendmail code snippets.

  • This bug report presumes a reasonably detailed knowledge of how signal delivery (in this case on SPARC 32-bit) is implemented by the kernel and libc. To understand how this works and how a debugger can locate signal frames on the stack in Solaris, refer to my comments in libproc which are found here.

  • To debug the problem, I wrote several new mdb dcmds and walkers. It's always worth the effort to do so: as part of this debugging effort, we not only got sendmail fixed, we also shipped these new debugging features, thereby benefitting all future development efforts. You can learn more about mdb by reading the Solaris Modular Debugger Guide. I'll go into further details about mdb and how to write commands for it in a future blog.

By looking at the core file I was able to observe the immediate cause of death: the process died from a SIGBUS at sfgets+0x80 because it attempted to store to %fp - 0xc, and somehow its frame pointer has been corrupted (it was set to 0x1, a clearly bogus value):

> $c
sfgets+0x80(0, 0, 0, 0, f0878, d1800)
> <pc/i
sfgets+0x80:    st        %o0, [%fp - 0xc]
> <fp=K

Specifically, sendmail was attempting to store the pointer to the string "local" to the stack for a call to sm_syslog(), as show below:
char \*
sfgets(buf, siz, fp, timeout, during)
        /\* ... \*/

        if (timeout != 0)
                if (setjmp(CtxReadTimeout) != 0)
                        if (LogLevel > 1)
                                sm_syslog(LOG_NOTICE, CurEnv->e_id,
                                       "timeout waiting for input from %.100s during %s",
                                       CurHostName ? CurHostName : "local",

        /\* ... \*/

You can observe that it was executing code following a read timeout, which caused a siglongjmp back to the CtxReadTimeout sigjmp_buf, and caused the setjmp (remember, really a sigsetjmp) to return non-zero. Somehow when sendmail returned from an interrupt (a SIGALRM used to timeout sfgets()), the fram e pointer ended up as 0x1. The invalid frame pointer is also the reason why the stack backtrace only shows the single sfgets() frame. The readtimeout() routine used as a SIGALRM callback is quite simple, and I disassembled it to determine the address of CtxReadTimeout (since the binary was stripped). Here's the source code for readtimeout():

static void
readtimeout(time_t timeout)
        longjmp(CtxReadTimeout, 1);
mdb permits developers to add a custom symbol missing from a stripped symbol table to a private symbol table so you can use its name for later debugger queries. So I added a symbol for CtxReadTimeout at 0xcc800+304 with size 76 bytes. Once a symbol is added, the address-to-symbol conversions elsewhere in the debugger can reference it.
> 0xcc800+304::nmadd -s 0t76 CtxReadTimeout
added CtxReadTimeout, value=ccb04 size=4c
> ::nm -P
Value      Size       Type  Bind  Other Shndx    Name
0x000ccb04|0x0000004c|NOTY |GLOB |0x0  |UNDEF   |CtxReadTimeout
In order to confirm that sendmail came through readtimeout(), I decided to verify this hypothesis in two ways: (1) by confirming that the current %sp matched the %sp saved inside of CtxReadTimeout, and (2) by confirming that %o7 was still set to somewhere inside of siglongjmp(). Here is the annotated debugger output:
> CtxReadTimeout/3p
CtxReadTimeout: 1               0xffbe8630      sfgets+0x50
                \^ flags         \^ saved %sp     \^ saved %pc <--[ for reference ]
> <sp=K 
                ffbe8630        <-- same as the saved %sp above
> <o7=p
With this confirmed, my next move was to figure out what the %sp was before we made the siglongjmp of death, so I could figure out what the stack looked like before sendmail died. To do this, I wrote a new mdb walker called oldcontext to produce each lwp's lwpstatus_t.pr_oldcontext value, which is the address of the last ucontext_t saved on the stack by a signal handler. Again, to learn about how signal frames are saved on a stack, look here in the Solaris source code.
> ::walk oldcontext
Using the knowledge of how signal frames are set up, I knew that prior to sendmail's death, its stack basically looked like this:
   ucontext_t   \\__ signal                               \^ high addresses
   struct frame /   frame                                |
   struct frame <-- frame saved by libc`sigacthandler    |
   struct frame <-- frame saved by siglongjmp            v low addresses
In 32-bit SPARC Solaris, sizeof (struct frame) = 0x60, so the old stack pointer should have been located at the address ffbe82c0 - (60 \* 3) = ffbe81a0. And sure enough:
> 0xffbe81a0$C
00000000 0(0, 0, 0, 0, 0, 0)
ffbe81a0 0(d0800, 654c6, 6386c, 0, 0, 0)
ffbe8200`sigacthandler+0x28(e, 0, ffbe82c0, cfb60, 21858, 24cf4)
ffbe8578`_wait+0xc(654d7, f07c8, d00c8, f07c8, 654d7, 0)
ffbe85e0 endmailer+0x7c(188c6c, d0ae0, 0, d0ae0, a9c08, 5)
ffbe8640 smtpquit+0x88(f0878, d1800, 1, d0ae0, 188c6c, 54)
ffbe86a0 reply+0x3fc(cf400, cfc00, cfc00, 1, 0, 188c6c)
ffbe9700 smtpquit+0x64(f0878, d1800, 0, d0ae0, 188c6c, 78)
ffbe9760 mci_flush+0x58(2, 4, cfb28, 0, 1, 0)
ffbe97c0 finis+0xdc(cec00, cfabe, ffbe98e8, ff1ba000, 0, 0)
ffbe9828`sigacthandler+0x28(f, ffbe9ba0, ffbe98e8, ff1ba000, 0, 0)
ffbe9c20`sigacthandler+0x18(1, ffbe9f98, ffbe9ce0, 0, 21d2c, 245e0)
ffbea018`_read+0xc(cfba0, 18a0f4, cfba0, 0, 213a8, 21b94)
ffbea078`fgets+0x24c(ff1c164c, 0, cfba0, ff1ba000, 18a0f4, 7ff)
ffbea0d8 sfgets+0x164(d0000, cfba0, cfba0, e10, c44a0, a9b8f)
ffbea148 reply+0xbc(cf400, 1, ce0d9, d019c, 0, 188c6c)
ffbeb1a8 smtpgetstat+0x20(f0878, 188c6c, cf5dc, 162, 188c6c, cf5dc)
ffbeb208 deliver+0x2490(c1de0, 1000, 130208, ffbecb3c, c1cf0, d00c8)
ffbecff8 sendenvelope+0x1e0(cf5dc, 69, d98e8, a03b0, cfb60, 0)
ffbed170 sendall+0xddc(cf5dc, cfac9, ce6d0, 0, 0, 0)
ffbed1d8 dowork+0x200(ce400, 6d, 0, cfae4, 1, d98e8)
ffbed238 smtp+0x1a14(1, d98e8, 0, f1b70, cf4f8, ff00)
ffbee338 main+0x2ffc(cec00, cfc00, d0ae0, cf4f8, ce400, cfc00)
ffbefd68 _start+0xb8(0, 0, 0, 0, 0, 0)
Notice that there are 3 (!) observed instances of libc's sigacthandler() function on the stack, with signals 0x1 (HUP), 0xf (TERM), and 0xe (ALRM) pending respectively. To walk backward through the list of signals that should have been delivered, I wrote another mdb command: a ucontext walker which follows the uc_link chain. Here is the list of signal ucontext_t's I found by following the ucontext_t.uc_link field and my notes on each:
> ::walk oldcontext | ::walk ucontext
ffbe82c0 <-- SIGALRM that caused siglongjmp of death    [1]
ffbe82b8 <-- ?                                          [2]
ffbe98e8 <-- siginfo ffbe9ba0 = SIGTERM from PID 424186 [3]
ffbe9ce0 <-- siginfo ffbe9f98 = SIGHUP  from PID 424186 [4]
I've annotated each ucontext_t address with what I knew so far, and a bit more. I knew that ucontext 1 was the most recent, put there by the SIGALRM that caused sendmail's death. Looking back on the stack, I saw that there were siginfo_t's (the 2nd parameter to sigacthandler()) corresponding to the third and fourth ucontext_t's. I then decoded the contents of each one and determined that they were externally generated SIGTERM and SIGHUP signals respectively, both from process 424186. Interestingly, the SIGTERM was received at sigacthandler+0x18; that is, while we were still in libc setting up to handle the SIGHUP, but before we called sendmail's SIGHUP handler, we took a SIGTERM. Finally, the second ucontext_t was a mystery to me at this point: it didn't have a corresponding sigacthandler() frame on the stack.

In investigating this oddity, I discovered an interesting (and useful) bug in the Solaris SPARC siglongjmp implementation: normally, at the completion of a signal handler routine, a setcontext(2) to the interrupted context is performed. This setcontext() call effectively restores the interrupted context's uc_link value to lwp_oldcontext in the kernel, thus "deleting" the ucontext_t from the "list" of saved signal contexts. However, at that time, Solaris did not store the uc_link obtained from a getcontext(2) call in sigsetjmp() into the sigjmp_buf. Furthermore, libc's siglongjmp() performed another getcontext(), modified the resulting context, and then performed a setcontext() after updating the %pc and %sp, all without touching uc_link. Thus, at this time, Solaris SPARC callers of siglongjmp from a signal handler never delete ucontexts from the uc_link list, even when these signals are retired. (After the sendmail problem had been fixed, a bug on this was also filed and fixed).

At first glance this uc_link bug seemed bad to me, until I considered that it gave me a critical piece of data I wouldn't have had otherwise: I knew that a signal was delivered and retired after the SIGTERM but before the SIGALRM that caused sendmail's death, and I knew that its ucontext_t was saved at ffbe82b8 on the stack. So I was now ready to make a couple of observations and a hypothesis:

  • The initial signal (SIGHUP) was delivered while sendmail was blocked inside of read(2), as called from sfgets(), and thus the sfgets() SIGALRM had not yet fired. I also knew that much later, after sendmail decided to quit because of the subsequent SIGTERM, that reply() called smtpquit():
       ffbe8640 smtpquit+0x88(f0878, d1800, 1, d0ae0, 188c6c, 54)
       ffbe86a0 reply+0x3fc(cf400, cfc00, cfc00, 1, 0, 188c6c)
       ffbe9700 smtpquit+0x64(f0878, d1800, 0, d0ae0, 188c6c, 78)
    Notice, using the saved frame pointers, that the mystery ucontext_t pointer (ffbe82b8) was placed on the stack prior to reply()'s call to smtpquit(). I also knew (from groveling inside of CtxReadTimeout) that the signal mask (sigset_t) restored by our siglongjmp of death was:
       0x00004001 0x00000000 0x00000000 0x00000000
    which is the mask consisting of SIGTERM and SIGHUP blocked. This told me that SIGTERM and SIGHUP had not yet completed processing (as I had guessed from the stack trace), but that SIGALRM was unblocked (that is, its processing was done). I also knew from sendmail's source code that reply() itself calls sfgets() before calling smtpquit(), thus installing another SIGALRM event.

  • I now hypothesized that the mystery ucontext_t represented another SIGALRM, namely the SIGALRM corresponding to the first sfgets(), the one still pending on the stack. This SIGALRM was handled as if it was the timeout for the second sfgets (called from reply(), no longer on the stack) because sfgets always uses (and thus overwrites) the same global sigjmp_buf, and caused reply() to abort its sfgets() and call back to smtpquit(). At this point, a second SIGALRM (the one corresponding to reply()'s sfgets() call) was delivered, and this caused a second siglongjmp() back to the same stack location, a stack location we had now re-used and overwritten with local variable information. I then attempted to prove this hypothesis using sendmail's event queue and free event list.

When sendmail wishes to set up an sfgets() timeout, it uses an internal setevent() function that keeps a linked list of pending event structures. alarm() is then used to install a SIGALRM event for the time of the next event that requires a timeout. The event structure looked like this:

struct event {
        time_t          ev_time;        /\* time of the function call \*/
        void            (\*ev_func)__P((int)); /\* function to call \*/
        int             ev_arg;         /\* argument to ev_func \*/
        int             ev_pid;         /\* pid that set this event \*/
        struct event    \*ev_link;       /\* link to next item \*/

At the time of death, the EventQueue list was empty:
> EventQueue/K
0xd09ec:        0
but luckily, retired events were not immediately freed; instead, they were placed on the head of the FreeEventList (linked again using ev_link). Dumping out this list showed that two events had been retired:
> FreeEventList/K
0xcf4ec:        f1cd0
> f1cd0/YpnXDX
0xf1cd0:        1999 Sep 15 15:04:10    readtimeout
                0               414918          f1b50
> f1b50/YpnXDX
0xf1b50:        1999 Sep 15 15:02:20    readtimeout
                0               414918          0
Notice that (1) they both were set up to call readtimeout(), the aforementioned SIGALRM handler, and (2) sendmail conveniently records the time_t of expiration, which we have formatted using /Y in the output above. The code for reply() leading up to smtpquit() looks like this (trimmed for simplicity):
reply(m, mci, e, timeout, pfunc)
        /\*  ... \*/

                p = sfgets(bufp, MAXLINE, mci->mci_in, timeout, SmtpPhase);
                mci->mci_lastuse = curtime();

        /\* ... \*/

                if (p == NULL)
                        if (errno == 0)
                                errno = ECONNRESET;

                        mci->mci_errno = errno;
                        mci->mci_state = MCIS_ERROR;
                        smtpquit(m, mci, e);

        /\* ... \*/
Notice that when sfgets() returns, sendmail records the current time in an mci structure. Then if p is NULL, indicating sfgets() failure, it also records errno, and calls smtpquit(). Again from the stack, I was able retrieve the pointer to the struct mci in question and dump it out:
> 188c6c/6xnXXnppDnXXXXXnYp
0x188c6c:       2048    83      0       0       0       0
                6c              0
                0               0               414935
                c44a0           f0878           130920
                0               0
                1999 Sep 15 15:02:20    0
If you haven't debugged enough core files and crash dumps to have format characters wired into your brain and fingers, use ::formats in mdb to list them out. In the output above I observed two things: (1) the 0x83 at the top is mci_errno, which is decimal 131 which is ECONNRESET, and (2) the formatted time_t at the bottom (mci_lastuse) exactly matches the expiry time of the first SIGALRM event structure (0xf1b50 shown above). (1) told me that errno was zero when we came back from sfgets() with p == NULL, which means sendmail timed out, as opposed to a read() error. (2) told me that indeed the earlier SIGALRM event (last on the FreeEventList) triggered the later siglongjmp back to sfgets(), confirming my hypothesis.

So the complete chain of events leading to the core dump was:

  • sfgets() was pending, SIGHUP interrupted
  • SIGHUP interrupted by SIGTERM, causing a call to finis()
  • reply() called sfgets(), overwriting global CtxReadTimeout with new sigsetjmp
  • sfgets() was pending, SIGALRM from earlier sfgets() interrup ted
  • readtimeout() called siglongjmp(), sfgets() returned to reply() instead of to caller of earlier sfgets()
  • reply() called into smtpquit(), overwriting the portion of stack used by second sfgets()'s saved register window (including %fp) and pointed to by CtxReadTimeout's saved %sp value
  • SIGALRM from second sfgets() now delivered, causing another siglongjmp() to the same CtxReadTimeout sigjmp_buf values
  • %sp was restored to ffbe8630, but now this pointed to more recent s tack data instead of a saved register window, so %fp was restored to 0x1
  • sfgets() now managed to run along from sfgets+0x50 to sfgets+0x80, the first instruction following return from the sigsetjmp call at sfgets+0x48 that references %fp, and then died from a SIGBUS trying to perform a stor e.

Essentially the bug illustrated a pretty serious design flaw in sendmail's timeout mechanism -- with an sfgets() timeout pending, it responded to a SIGTERM and begin to execute a long sequence of cleanup code with other calls back into sfgets(). Since every sfgets() timeout re-used the same global sigjmp_buf, this buffer was then overwritten and the earlier timeout triggers a logical return to the wrong stack location. The sendmail code needed to potentially clear previously pending timeouts once it decided it was on a one-way path to quit (or block HUP and TERM while waiting on read in sfgets()), and perhaps the setevent() mechanism needed to have a per-event sigjmp_buf. In any case, the designers of this code needed to rethink the mechanism based on the evidence uncovered here. And they did just that: fixing the problem in the 8.9.3 release.

Looking back, this problem is a great illustration of how much can be discerned from a single core dump, whether it be from the kernel or a complex user application, and the value in investing the time in root-causing every failure, even those that happen once. Without coreadm this core file would likely never have been seen. And without root-causing the first time this fault was seen, this type of subtle signal condition would likely have gone unreproducible and undebugged for months or even years, until biting one of our customers. Sun's investment in a first-fault debugging strategy helps to ensure that Solaris and other OpenSource software we leverage like Sendmail are of the highest possible quality. As we move into the OpenSolaris era, I know we'll benefit from the debugging skills and brains of those who join the effort, and I hope that our ideas, tools, and techniques for debugging will spread to other Open efforts where developers believe in and value a high-quality product.


Technorati Tag:
Technorati Tag:
Technorati Tag:

Friday May 20, 2005

DTrace Inlines, Translators, and File Descriptors

I've recently added some new features to the DTrace inline feature, so it seems like a good time to go back and review some of the more advanced features of DTrace's D language, and how these features are used to make observing the system easier for DTrace users. This entry is a bit long, but if you hang in there you'll be rewarded with a peek at a new DTrace feature that is headed for Solaris Express.


Early in DTrace's development, once Bryan and I had assembled the nascent DTrace prototype far enough to be able to locate probes, trace data, and execute simple D expressions, it was obvious that even this early stage of DTrace was an incredibly powerful kernel debugging tool. For fun and posterity, here is one of the earliest known actual D programs at work once the compiler, tracing framework, and access to kernel types had been connected together:

From: Bryan Cantrill <>
Subject: Leaving now...
To: (Michael Shapiro)
Date: Tue, 12 Feb 2002 17:47:36 -0800 (PST)

But this is pretty hot:

# dtrace -f 'bcopy/(arg2 > 1000) &&
    (curthread->t_procp->p_cred->cr_uid == 31992)/{stack(20)}'
dtrace: 2 probes enabled.
CPU     ID                    FUNCTION:NAME
  1   8576                      bcopy:entry 

And while this was indeed hot (and still is), you can immediately see how the relationship between the question ("What are the stack traces of all bcopy() calls performed on behalf of user Bryan of length greater than 1000 bytes?") and its realization in D requires knowledge of the Solaris kernel implementation (i.e. that a kthread_t has a proc_t pointer, which has a cred_t pointer, which contains the UID of the user associated with that process). So while great for us kernel programmers, this immediately presented two challenges for us to grapple with in making DTrace more accessible:
  • How can we allow administrators and developers to express concepts that they readily understand, like the idea that a process has a particular UID associated with it, without requiring them to understand how those concepts are implemented?

  • How can we allow DTrace users to write programs that continue to work as the implementation of these concepts changes over time inside of Solaris?

The second question is of particular importance because one of the challenges in writing observability tools and debuggers is that by exposing everything inside of a software system, you increase the risk of users and programs coming to depend upon that knowledge, which then begins to constrain the implementors. One of my most hated examples of this phenomenon is the fact that in the original UNIX ABI the stdio FILE structure was actually exported to programs along with a set of macros that referenced its members, and the file descriptor was represented as an unsigned char (0-255) instead of an int. The result: 32-bit program binaries that used the fileno(3C) macro could be created that would break or cause silent data corruption if we then fixed the design to support file descriptor values above 255. This issue is still causing problems more than a decade later, despite changing fileno() to a function and fixing the issue entirely in the 64-bit Solaris ABI. But I digress.

To address these two issues in DTrace, we created the notion of a translator. A translator is a collection of D assignment statements provided by the supplier of an interface that can be used to translate an input expression into an object of struct type. Like any D statements, the body of the translator can refer directly to kernel types and kernel global data structures, as well as other DTrace variables. If you're familiar with object-oriented programming, you can imagine a translator sort of like a class that implements a bunch of "get" methods (of course, we don't have functions in D since we can't allow recursion). Translator definitions correspond to the implementation of some piece of software, like a part of the kernel, but they yield a struct that is in effect a stable interface to that software.

For example, DTrace provides a translator from a kernel thread pointer such as the built-in curthread variable to the /proc lwpsinfo structure. This structure is well-defined and documented in the proc(4) and is what you get if you read the file /proc/pid/lwp/lwpid/lwpsinfo on your Solaris system. Here is an excerpt of the translator definition, which is delivered to you in the file /usr/lib/dtrace/procfs.d:

translator lwpsinfo_t < kthread_t \*T > {
        pr_syscall = T->t_sysnum;
        pr_pri = T->t_pri;
        pr_clname = `sclass[T->t_cid].cl_name;

As you can see, each statement in the translator body is in effect an expression that can be inlined by the D compiler to produce the value of that member when it is referenced. For example, the pr_clname field represents the idea that every Solaris LWP has an associated scheduling class with a well-defined name (e.g. TS=timeshare, RT=real-time) that you can pass to commands like priocntl(1) using the -c option. To retrieve the string, you take the class ID, an integer index into the kernel's sclass array, and then grab the name from the contents of that array. The translator isolates DTrace programs that you write from that implementation detail, so that if we were to say, delete sclass and replace it with a hash table, you could still reliably use pr_clname in DTrace on various versions of Solaris and get the same result.


To use a translator in D, you apply the xlate operator to an input expression and specify an output type of either the desired structure or a pointer to it, as shown in the following example:

        printf("%s tid %d waiting for i/o, class=%s\\n",
            execname, tid, xlate<lwpsinfo_t>(curthread).pr_clname);

Here we translate curthread to retrive pr_clname and record the scheduling class of every thread that blocks on an i/o. The results of running this for a few seconds on my desktop look like this:

dtrace: script '/dev/stdin' matched 1 probe
CPU     ID                    FUNCTION:NAME
  1   2053               biowait:wait-start sched tid 0 waiting for i/o, class=SYS
  1   2053               biowait:wait-start cat tid 1 waiting for i/o, class=TS


While using the xlate operator directly is fun (for me, anyway), DTrace also provides an inline facility that makes D programs that use translators easier to read and write. An inline is the declaration of a typed identifier that is replaced by the compiler with the result of an expression whenever that identifier is referenced somewhere else in the program. This is more powerful than simple lexical substitution like the sort provided by C's #define, as we'll see in a moment. Here are some example inline declarations:

inline int c = 123;
inline uid_t uid = curthread->t_procp->p_cred->cr_uid;

Once declared, inlines can be used anywhere as if they were variables provided for you by DTrace. We can also use inlines to substitute translator expressions, which allows us to connect together all of the ideas discussed so far. For example, DTrace provides a built-in curlwpsinfo variable to let you access all of the process model information for the current LWP. This variable is not a variable at all, but instead the following inline provided for you by /usr/lib/dtrace/procfs.d:

inline lwpsinfo_t \*curlwpsinfo = xlate <lwpsinfo_t \*> (curthread);

So using the inlines and translators provided for you by DTrace, you can rewrite the previous example like this, using only the stable interfaces defined in proc(4):

        printf("%s tid %d waiting for i/o, class=%s\\n",
            execname, tid, curlwpsinfo->pr_clname);

Together, inlines and translators let us provide stable representations of Solaris kernel interfaces in a form that resembles a Solaris administrative or user-programming concept that is already well-understood, while allowing us to continue to evolve the Solaris implementation underneath.

Observing File Descriptors

I recently added an extension to the inline facility in DTrace to permit inlines to define identifiers that act like D associative arrays, instead of scalar variables similar to the examples in the previous section. Everything from this point forward will be available in Build 16 of Nevada (aka the next Solaris release) which you will be able to download here at some point in the future. We'll likely backport this feature to a Solaris 10 Update later this year as well. To create an inline that acts like an associative array using the new DTrace feature, you can use a declaration like this:

inline int a[int x, int y] = x + y;

Given this definition, a reference to the expression a[1, 2] would be as if you typed 3 in your program. Using this new facility, I've added an fds[] array to DTrace that returns information about the file descriptors associated with the process corresponding to the current thread. The array's base type is the fileinfo_t structure already used by DTrace's I/O provider, with a new member for the open(2) flags. Here's an example of fds[] in action:

$ dtrace -q -s /dev/stdin
/ execname == "ksh" && fds[arg0].fi_oflags & O_APPEND /
        printf("ksh %d appending to %s\\n", pid, fds[arg0].fi_pathname);

If I run this command on my desktop and start typing commands in another shell, I see output like this:

ksh 127453 appending to /home/mws/.sh_history
ksh 127453 appending to /home/mws/.sh_history

That is, given a file descriptor specified as an argument to write(2), I can match writes by ksh where the file descriptor was opened O_APPEND and then print the pathname of the file to which the data is being appended.

All of the implementation for fds[] is provided by a translator and an inline (i.e. zero new kernel support required). The translator converts a kernel file structure to a DTrace fileinfo_t, and then the inline declaration to define fds[] looks like this:

inline fileinfo_t curfds[int fd] = xlate <fileinfo_t> (
    fd >= 0 && fd < curthread->t_procp->p_user.u_finfo.fi_nfiles ?
    curthread->t_procp->p_user.u_finfo.fi_list[fd].uf_file : NULL);

I'll discuss how inlines can affect how we programmatically compute the stability of your DTrace programs in a future blog.


Monday Jan 24, 2005

Self-Healing in Modern Operating Systems

I recently wrote an article for the ACM's Queue Magazine entitled Self-Healing in Modern Operating Systems, which you can read online or download as a PDF file. The article was subsequently discussed on SlashDot. I stated the thesis of the article as follows:

Your operating system provides threads as a programming primitive that permits applications to scale transparently and perform better as multiple processors, multiple cores per die, or more hardware threads per core are added. Your operating system also provides virtual memory as a programming abstraction that allows applications to scale transparently with available physical memory resources. Now we need our operating systems to provide the new abstractions that will enable self-healing activities or graceful degradation in service without requiring developers to rewrite applications or administrators to purchase expensive hardware that tries to work around the operating system instead of with it.

The article provides an overview of our approach to building real self-healing technology into Solaris 10, and tries to make the case that administrators and developers will only benefit from automated diagnosis and repair technology in a cost-effective fashion when the operating system is involved and provides new stable abstractions for these RAS interactions. You can learn more about self-healing in Solaris 10 on BigAdmin and at our Knowledge Article Web.


Sunday Jan 23, 2005


I work in Solaris Kernel Development at Sun Microsystems, where among other things I'm the architect for RAS (Reliability, Availability, Serviceability) features in Solaris. My research and engineering interests are focused on technology to enhance the availability of computer systems,including programming languages and debugging tools for developers, operating system technologies for handling and recovering from software and hardware faults and defects, and tools for administrators and users that improve the user experience. My work at Sun includes the design and implementation of:
  • Commands: dtrace(1M), dumpadm(1M), fmadm(1M), fmdump(1M), fmstat(1M), mdb(1), pgrep(1), pkill(1)
  • Daemons: fmd(1M)
  • Libraries: libctf, libdtrace, libfmd_adm, libfmd_log, libproc
  • Kernel Subsystems: Lock-Free Error Queues, Panic Subsystem, Firmware Locking, Error Trap Interpositioning (on_trap), UltraSPARC-I and II CPU and Memory Error Handling, DTrace Virtual Machine
  • File and Data Formats: CTF (Compact C Type Format), DOF (DTrace Object Format), FCF (FMD Checkpoint Format)

as well as contributions to the design of coreadm(1M), user core files, kernel crash dumps, the /proc filesystem, and other related areas. In Solaris 10, I designed and implemented the D programming language and compiler for DTrace, and led the effort to create Sun's architecture for Predictive Self-Healing, part of our innovative approach to Fault Management that is debuting in Solaris 10.

Contrary to earlier blog easter-eggs, I resemble neither the battering-ram power of Bosco Baracus nor the pasty-haired impishness of Larry Fine. I do, however, look pretty much exactly like my cartoon action figure, as seen in InsideJack Episode 2.

Prior to working at Sun, I was causing trouble with my partner-in-crime Bryan Cantrill at Brown University, where I received a BS and MS in Computer Science. I'm originally from the Boston area, and spend much of my free time reliving basketball games from the 80's now on DVD, this year's Red Sox triumph, and the weekly drama of a Man Named Brady.




  • General
« October 2016