Thursday Aug 07, 2008

SES Sensors and Indicators

Last week, Rob Johnston and I coordinated two putbacks to Solaris to further the cause of Solaris platform integration, this time focusing on sensors and indicators. Rob has a great blog post with an overview of the new sensor abstraction layer in libtopo. Rob did most of the hard work- my contribution consisted only of extending the SES enumerator to support the new facility infrastructure.

You can find a detailed description of the changes in the original FMA portfolio here, but it's much easier to understand via demonstration. This is the fmtopo output for a fan node in a J4400 JBOD:

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0
    label             string    Cooling Fan  0
    FRU               fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0
  group: authority                      version: 1   stability: Private/Private
    product-id        string    SUN-Storage-J4400
    chassis-id        string    2029QTF0000000005
    server-id         string
  group: ses                            version: 1   stability: Private/Private
    node-id           uint64    0x1f
    target-path       string    /dev/es/ses3

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=ident
  group: authority                      version: 1   stability: Private/Private
    product-id        string    SUN-Storage-J4400
    chassis-id        string    2029QTF0000000005
    server-id         string
  group: facility                       version: 1   stability: Private/Private
    type              uint32    0x1 (LOCATE)
    mode              uint32    0x0 (OFF)
  group: ses                            version: 1   stability: Private/Private
    node-id           uint64    0x1f

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?indicator=fail
  group: authority                      version: 1   stability: Private/Private
    product-id        string    SUN-Storage-J4400
    chassis-id        string    2029QTF0000000005
    server-id         string
  group: facility                       version: 1   stability: Private/Private
    type              uint32    0x0 (SERVICE)
    mode              uint32    0x0 (OFF)
  group: ses                            version: 1   stability: Private/Private
    node-id           uint64    0x1f

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=speed
  group: authority                      version: 1   stability: Private/Private
    product-id        string    SUN-Storage-J4400
    chassis-id        string    2029QTF0000000005
    server-id         string
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x4 (FAN)
    units             uint32    0x12 (RPM)
    reading           double    3490.000000
    state             uint32    0x0 (0x00)
  group: ses                            version: 1   stability: Private/Private
    node-id           uint64    0x1f

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0000000005:server-id=/ses-enclosure=1/fan=0?sensor=fault
  group: authority                      version: 1   stability: Private/Private
    product-id        string    SUN-Storage-J4400
    chassis-id        string    2029QTF0000000005
    server-id         string
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    discrete
    type              uint32    0x103 (GENERIC_STATE)
    state             uint32    0x1 (DEASSERTED)
  group: ses                            version: 1   stability: Private/Private
    node-id           uint64    0x1f

Here you can see the available indicators (locate and service), the fan speed (3490 RPM) and if the fan is faulted. Right now this is just interesting data for savvy administrators to play with, as it's not used by any software. But that will change shortly, as we work on the next phases:

  • Monitoring of sensors to detect failure in external components which have no visibility in Solaris outside libtopo, such as power supplies and fans. This will allow us to generate an FMA fault when a power supply or fan fails, regardless of whether it's in the system chassis or an external enclosure.
  • Generalization of the disk-monitor fmd plugin to support arbitrary disks. This will control the failure indicator in response to FMA-diagnosed faults.
  • Correlation of ZFS faults with the associated physical disk. Currently, ZFS faults are against a "vdev" - a ZFS-specific construct. The user is forced to translate from this vdev to a device name, and then use the normal (i.e. painful) methods to figure out which physical disk was affected. With a little work it's possible to include the physical disk in the FMA fault to avoid this step, and also allow the fault LED to be controlled in response to ZFS-detected faults.
  • Expansion of the SCSI framework to support native diagnosis of faults, instead of a stream of syslog messages. This involves generating telemetry in a way that can be consumed by FMA, as well as a diagnosis engine to correlate these ereports with an associated fault.

Even after we finish all of these tasks and reach the nirvana of a unified storage management framework, there will still be lots of open questions about how to leverage the sensor framework in interesting ways, such as a prtdiag-like tool for assembling sensor information, or threshold alerts for non-critical warning states. But with these latest putbacks, it feels like our goals from two years ago are actually within reach, and that I will finally be able to turn on that elusive LED.

Sunday Jul 13, 2008

External storage enclosures in Solaris

Over the past few years, I've been working on various parts of Solaris platform integration, with an emphasis on disk monitoring. While the majority of my time has been focused on fishworks, I have managed to implement a few more pieces of the original design.

About two months ago, I integrated the libscsi and libses libraries into Solaris Nevada. These libraries, originally written by Keith Wesolowski, form an abstraction layer upon which higher level software can be built. The modular nature of libses makes it easy to extend with vendor-specific support libraries in order to provide additional information and functionality not present in the SES standard, something difficult to do with the kernel-based ses(7d) driver. And since it is written in userland, it is easy to port to other operating systems. This library is used as part of the fwflash firmware upgrade tool, and will be used in future Sun storage management products.

While libses itself is an interesting platform, it's true raison d'etre is to serve as the basis for enumeration of external enclosures as part of libtopo. Enumeration of components in a physically meaningful manner is a key component of the FMA strategy. These components form FMRIs (fault managed resource identifiers) that are the target of diagnoses. These FMRIs provide a way of not just identifying that "disk c1t0d0 is broken", but that this device is actually in bay 17 of the storage enclosure whose chassis serial number is "2029QTF0809QCK012". In order to do that effectively, we need a way to discover the physical topology of the enclosures connected to the system (chassis and bays) and correlate it with the in-band I/O view of the devices (SAS addresses). This is where SES (SCSI enclosure services) comes into play. SES processes show up as targets in the SAS fabric, and by using the additional element status descriptors, we can correlate physical bays with the attached devices under Solaris. In addition, we can also enumerate components not directly visible to Solaris, such as fans and power supplies.

The SES enumerator was integrated in build 93 of nevada, and all of these components now show up in the libtopo hardware topology (commonly referred to as the "hc scheme"). To do this, we walk over al the SES targets visible to the system, grouping targets into logical chassis (something that is not as straightforward as it should be). We use this list of targets and a snapshot of the Solaris device tree to fill in which devices are present on the system. You can see the result by running fmtopo on a build 93 or later Solaris machine:

# /usr/lib/fm/fmd/fmtopo














To really get all the details, you can use the '-V' option to fmtopo to dump all available properties:

# fmtopo -V '\*/ses-enclosure=0/bay=0/disk=0'
TIME                 UUID
Jul 14 03:54:23 3e95d95f-ce49-4a1b-a8be-b8d94a805ec8

  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
    ASRU              fmri      dev:///:devid=id1,sd@TATA_____SEAGATE_ST37500NSSUN750G_0720A0PC3X_____5QD0PC3X____________//scsi_vhci/disk@gATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3X
    label             string    SCSI Device  0
    FRU               fmri      hc://:product-id=SUN-Storage-J4400:chassis-id=2029QTF0809QCK012:server-id=:serial=5QD0PC3X:part=SEAGATE-ST37500NSSUN750G-0720A0PC3X:revision=3.AZK/ses-enclosure=0/bay=0/disk=0
  group: authority                      version: 1   stability: Private/Private
    product-id        string    SUN-Storage-J4400
    chassis-id        string    2029QTF0809QCK012
    server-id         string    
  group: io                             version: 1   stability: Private/Private
    devfs-path        string    /scsi_vhci/disk@gATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3X
    devid             string    id1,sd@TATA_____SEAGATE_ST37500NSSUN750G_0720A0PC3X_____5QD0PC3X____________
    phys-path         string[]  [ /pci@0,0/pci10de,377@a/pci1000,3150@0/disk@1c,0 /pci@0,0/pci10de,375@f/pci1000,3150@0/disk@1c,0 ]
  group: storage                        version: 1   stability: Private/Private
    logical-disk      string    c0tATASEAGATEST37500NSSUN750G0720A0PC3X5QD0PC3Xd0
    manufacturer      string    SEAGATE
    model             string    ST37500NSSUN750G 0720A0PC3X
    serial-number     string    5QD0PC3X            
    firmware-revision string       3.AZK
    capacity-in-bytes string    750156374016

So what does this mean, other than providing a way for you to finally figure out where disk 'c3t0d6' is really located? Currently, it allows the disks to be monitored by the disk-transport fmd module to generate faults based on predictive failure, over temperature, and self-test failure. The really interesting part is where we go from here. In the near future, thanks to work by Rob Johnston on the sensor framework, we'll have the ability to manage LEDs for disks that are part of external enclosures, diagnose failures of power supplies and fans, as well as the ability to read sensor data (such as fan speeds and temperature) as part of a unified framework.

I often like to joke about the amount of time that I have spent just getting a single LED to light. At first glance, it seems like a pretty simple task. But to do it in a generic fashion that can be generalized across a wide variety of platforms, correlated with physically meaningful labels, and incorporate a diverse set of diagnoses (ZFS, SCSI, HBA, etc) requires an awful lot of work. Once it's all said and done, however, future platforms will require little to no integration work, and you'll be able to see a bad drive generate checksum errors in ZFS, resulting in a FMA diagnosis indicating the faulty drive, activate a hot spare, and light the fault LED on the drive bay (wherever it may be). Only then will we have accomplished our goal of an end-to-end storage strategy for Solaris - and hopefully someone besides me will know what it has taken to get that little LED to light.

Saturday Jun 09, 2007

Solaris Sensors and Indicators

For those of you who have been following my recent work with Solaris platform integration, be sure to check out the work Cindi and the FMA team are doing as part of the Sensor Abstraction Layer project. Cindi recently posted an initial version of the Phase 1 design document. Take a look if you're interested in the details, and join the discussion if you're interested in defining the Solaris platform experience.

The implications of this project for unified platform integration are obvious. With respect to what I've been working on, you'll likely see the current disk monitoring infrastructure converted into generic sensors, as well as the sfx4500-disk LED support converted into indicators. I plan to leverage this work as well as the SCSI FMA work to enable correlated ZFS diagnosis across internal and external storage.

Saturday May 26, 2007

Solaris platform integration - disk monitoring

Two weeks ago I putback PSARC 2007/202, the second step in generalizing the x4500 disk monitor. As explained in my previous blog post, one of the tasks of the original sfx4500-disk module was reading SMART data from disks and generating associated FMA faults. This platform-specific functionality needed to be generalized to effectively support future Sun platforms.

This putback did not add any new user-visible features to Solaris, but it did refactor the code in the following ways:

  • A new private library, libdiskstatus, was added. This generic library uses uSCSI to read data from SCSI (or SATA via emulation) devices. It is not a generic SMART monitoring library, focusing only on the three generally available disk faults: over temperature, predictive failure, and self-test failure. There is a single function, disk_status_get() that reurns an nvlist describing the current parameters reported by the drive and whether any faults are present.

  • This library is used by the SATA libtopo module to export a generic TOPO_METH_DISK_STATUS method. This method keeps all the implementation details within libtopo and exports a generic inerface for consumers.

  • A new fmd module, disk-transport, periodically iterates over libtopo nodes and invokes the TOPO_METH_DISK_STATUS method on any supported nodes. The module generates FMA ereports for any detected errors.

  • These ereports are translated to faults by a simple eversholt DE. These are the same faults that were originally generated by the sfx4500-disk module, so the code that consumes them remains unchanged.

These changes form the foundation that will allow future Sun platforms to detect and react to disk failures, eliminating 5200 lines of platform-specific code in the process. The next major steps are currently in progress:

The FMA team, as part of the sensor framework, is expanding libtopo to include the ability to represent indicators (LEDs) in a generic fashion. This will replace the x4500 specific properties and associated machinery with generic code.

The SCSI FMA team is finalizing the libtopo enumeration work that will allow arbitrary SCSI devices (not just SATA) to be enumerated under libtopo and therefore be monitored by the disk-transport module. The first phase will simply replicate the existing sfx4500-disk functionality, but will enable us to model future non-SATA platforms as well as external storage devices.

Finally, I am finishing up my long-overdue ZFS FMA work, a necessary step towards connecting ZFS and disk diagnosis. Stay tuned for more info.

Saturday Mar 17, 2007

Solaris platform integration - libipmi

As I continue down the path of improving various aspects of ZFS and Solaris platform integration, I found myself in the thumper (x4500) fmd platform module. This module represents the latest attempt at Solaris platform integration, and an indication of where we are headed in the future.

When I say "platform integration", this is more involved than the platform support most people typically think of. The platform teams make sure that the system boots and that all the hardware is supported properly by Solaris (drivers, etc). Thanks to the FMA effort, platform teams must also deliver a FMA portfolio which covers FMA support for all the hardware and a unified serviceability plan. Unfortunately, there is still more work to be done beyond this, of which the most important is interacting with hardware in response to OS-visible events. This includes ability to light LEDs in response to faults and device hotplug, as well as monitoring the service processor and keeping external FRU information up to date.

The sfx4500-disk module is the latest attempt at providing this functionality. It does the job, but is afflicted by the same problems that often plague platform integration attempts. It's overcomplicated, monolithic, and much of what it does should be generic Solaris functionality. Among the things this module does:

  • Reads SMART data from disks and creates ereports
  • Diagnoses ereports into corresponding disk faults
  • Implements an IPMI interface directly on top of /dev/bmc
  • Responds to disk faults by turning on the appropriate 'fault' disk LED
  • Listens for hotplug and DR events, updating the 'ok2rm' and 'present' LEDs
  • Updates SP-controlled FRU information
  • Monitors the service process for resets and resyncs necessary information

Needless to say, every single item on the above list is applicable to a wide variety of Sun platforms, not just the x4500, and it certainly doesn't need to be in a single monolithic module. This is not meant to be a slight against the authors of the module. As with most platform integration activities, this effort wasn't communicated by the hardware team until far too late, resulting in an unrealistic schedule with millions of dollars of revenue behind it. It doesn't help that all these features need to be supported on Solaris 10, making the schedule pressure all the more acute, since the code must soak in Nevada and then be backported in time for the product release. In these environments even the most fervent pleas for architectural purity tend to fall on deaf ears, and the engineers doing the work quickly find themselves between a rock and a hard place.

As I was wandering through this code and thinking about how this would interact with ZFS and future Sun products, it became clear that it needed a massive overhaul. More specifically, it needed to be burned to the ground and rebuilt as a set of distinct, general purpose, components. Since refactoring 12,000 lines of code with such a variety of different functions is non-trivial and difficult to test, I began by factoring out different pieces individually, redesigning the interfaces and re-integrating them into Solaris on a piece-by-piece basis.

Of all the functionality provided by the module, the easiest thing to separate was the IPMI logic. The Intelligent Platform Management Interface is a specification for communicating with service Pprocessors to discover and control available hardware. Sadly, it's anything but "intelligent". If you had asked me a year ago what I'd be doing at the beginning of this year, I'm pretty sure that reading the IPMI specification would have been at the bottom of my list (right below driving stakes through my eyeballs). Thankfully, the IPMI functionality needed was very small, and the best choice was a minimally functional private library, designed solely for the purpose of communicating with the Service Processor on supported Sun platforms. Existing libraries such as OpenIPMI were too complicated, and in their efforts to present a generic abstracted interface, didn't provide what we really needed. The design goals are different, and the ON-private IPMI library and OpenIPMI will continue to develop and serve different purposes in the future.

Last week I finally integrated libipmi. In the process, I eliminated 2,000 lines of platform-specific code and created a common interface that can be leveraged by other FMA efforts and future projects. It is provided for both x86 and SPARC, even though there are currently no supported SPARC machines with an IPMI-capable service processor (this is being worked on). This library is private and evolving quite rapidly, so don't use it in any non-ON software unless you're prepared to keep up with a changing API.

As part of this work, I also created a common fmd module, sp-monitor, that monitors the service processor, if present, and generates a new ESC_PLATFORM_RESET sysevent to notify consumers when the service processor is reset. The existing sfx4500-disk module then consumes this sysevent instead of monitoring the service processor directly.

This is the first of many steps towards eliminating this module in its current form, as well as laying groundwork for future platform integration work. I'll post updates to this blog with information about generic disk monitoring, libtopo indicators, and generic hotplug management as I add this functionality. The eventual goal is to reduce the platform-specific portion of this module to a single .xml file delivered via libtopo that all these generic consumers will use to provide the same functionality that's present on the x4500 today. Only at this point can we start looking towards future applications, some of which I will describe in upcoming posts.

Wednesday Mar 14, 2007

DTrace sysevent provider

I've been heads down for a long time on a new project, but occasionally I do put something back to ON worth blogging about. Recently I've been working on some problems which leverage sysevents (libsysevent(3LIB)) as a common transport mechanism. While trying to understand exactly what sysevents were being generated from where, I found the lack of observability astounding. After poking around with DTrace, I found that tracking down the exact semantics was not exactly straightforward. First of all, we have two orthogonal sysevent mechanisms, the original syseventd legacy mechanism, and the more recent general purpose event channel (GPEC) mechanism, used by FMA. On top of this, the sysevent_impl_t structure isn't exactly straightforward, because all the data is packed together in a single block of memory. Knowing that this would be important for my upcoming work, I decided that adding a stable DTrace sysevent provider would be useful.

The provider has a single probe, sysevent:::post, which fires whenever a sysevent post attempt is made. It doesn't necessarily indicate that the syevent was successfully queued or received. The probe has the following semantics:

# dtrace -lvP sysevent
   ID   PROVIDER            MODULE                          FUNCTION NAME
44528   sysevent           genunix                    queue_sysevent post

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Evolving
                Data Semantics:   Evolving
                Dependency Class: ISA

        Argument Types
                args[0]: syseventchaninfo_t \*
                args[1]: syseventinfo_t \*

The 'syseventchaninfo_t' translator has a single member, 'ec_name',which is the name of the event channel. If this is being posted via the legacy sysevent mechanism, then this member will be NULL. The 'syeventinfo_t' translator has three members, 'se_publisher', 'se_class', and 'se_subclass'. These mirror the arguments to sysevent_post(). The following script will dump all sysevents posted to syseventd(1M):

#!/usr/sbin/dtrace -s

#pragma D option quiet

	printf("%-30s  %-20s  %s\\n", "PUBLISHER", "CLASS",

/args[0]->ec_name == NULL/
	printf("%-30s  %-20s  %s\\n", args[1]->se_publisher,
	    args[1]->se_class, args[1]->se_subclass);

And the output during a cfgadm -c unconfigure:

PUBLISHER                       CLASS                 SUBCLASS 
SUNW:usr:devfsadmd:100237       EC_dev_remove         disk
SUNW:usr:devfsadmd:100237       EC_dev_branch         ESC_dev_branch_remove
SUNW:kern:ddi                   EC_devfs              ESC_devfs_devi_remove

This has already proven quite useful in my ongoing work, and hopefully some other developers out there will also find it useful.

Monday Sep 12, 2005

First sponsored bugfix

Yes, I am still here. And yes, I'm still working on ZFS as fast as I can. But I do have a small amount of free time, and managed to pitch in with some of the OpenSolaris bug sponsor efforts over at the two basic code cleanup fixes into the Nevada gate. Nothing spectacular, but worthy of a proof of concept, and it adds another name to the list of contributors who have had fixes putback into Nevada. Next week I'll try to grab one of the remaining bugfixes to lend a hand. Maybe someday I'll have enough time to blog for real, but don't expect much until ZFS is back in the gate.

Also, check out the Nevada putback logs for build 22. Very cool stuff - kudos to Steve and the rest of the OpenSolaris team. Pay attention to the fixes contributed by Shawn Walker and Jeremy Teo - It's nice to see active work being done, despite the fact that we still have so much work left to do in building an effective community.

Technorati Tag:

Thursday Aug 11, 2005

Fame, Glory, and a free iPod shuffle

Thanks to Jarod Jenson (of DTrace and Aeysis fame), I now have a shiny new 512MB iPod shuffle to give away to a worthy OpenSolaris cause. Back when I posted my original MDB challenge, I had no cool stuff to entice potential suitors. So now I'll offer this iPod shuffle to the first person who submits an acceptable solution to the problem and follows through to integrate the code into OpenSolaris (I will sponsor any such RFE). Send your diffs against the latest OpenSolaris source to me at eric dot schrock at sun dot com. We'll put a time limit of, say, a month and a half (until 10/1) so that I can safely recycle the iPod shuffle into another challenge should no one respond.

Once again, the original challenge is here.

So besides the fame and glory of integrating the first non-bite size RFE into OpenSolaris, you'll also walk away with a cool toy. Not to mention all the MDB knowledge you'll have under your belt. Feel free to email me questions, or head over to the mdb-discuss forum. Good Luck!


Monday Aug 08, 2005

Where have I been?

It's been almost a month since my last blog post, so I thought I'd post an update. I spent the month of July in Massachusetts, alternately on vacation, working remotely, and attending my brother's wedding. The rest of the LAE (Linux Application Environment) team joined me (and Nils) for a week out there, and we made some huge progress on the project. For the curious, we're working on how best to leverage OpenSolaris to help the project and the community, at which point we can go into more details about what the final product will look like. Until then, suffice to say "we're working on it". All this time on LAE did prevent me from spending time with my other girlfriend, ZFS. Since getting back, I've caught up with most of the ZFS work in my queue, and the team has made huge progress on ZFS in my absence. As much as I'd like to talk about details (or a schedule), I can't :-( But trust me, you'll know when ZFS integrates into Nevada; there are many bloggers who will not be so quiet when that putback notice comes by. Not to mention that the source code will hit OpenSolaris shortly thereafter.

Tomorrow I'll be up at LinuxWorld, hanging out at the booth with Ben and hosting the OpenSolaris BOF along with Adam and Bryan (Dan will be there as well, though he didn't make the "official" billing). Whether you know nothing about OpenSolaris or are one of our dedicated community members, come check it out.

Tuesday Jul 12, 2005

Operating system tunables

There's an interesting discussion over at opensolaris-code, spawned from an initial request to add some tunables to Solaris /proc. This exposes a few very important philosophical differences between Solaris and other operating systems out there. I encourage you to read the thread in its entirety, but here's an executive summary:

  • When possible, the system should be auto-tuning - If you are creating a tunable to control fine grained behavior of your program or operating system, you should first ask yourself: "Why does this tunable exist? Why can't I just pick the best value?" More often than not, you'll find the answer is "Because I'm lazy" or "The problem is too hard." Only in rare circumstances is there ever a definite need for a tunable, and almost always control coarse on-off behavior.

  • If a tunable is necessary, it should be as specific as possible - The days of dumping every tunable under the sun into /etc/system are over. Very rarely do tunables need to be system wide. Most tunables should be per process, per connection, or per filesystem. We are continually converting our old system-wide tunables into per-object controls.

  • Tunables should be controlled by a well defined interface - /etc/system and /proc are not your personal landfills. /etc/system is by nature undocumented, and designing it as your primary interface is fundamentally wrong. While /proc is well documented, but it's also well defined to be a process filesystem. Besides the enormous breakage you'd introduce by adding /proc/tunables, its philosophically wrong. The /system directory is a slightly better choice, but it's intended primarily for observability of subsystems that translate well to a hierarchical layout. In general, we don't view filesystems as a primary administrative interface, but a programmatic API upon which more sophisticated tools can be built.

One of the best examples of these principles can been seen in the updated System V IPC tunables. Dave Powell rewrote this arcane set of /etc/system tunables during the course of Solaris 10. Many of the tunables were made auto-tuning, and those that couldn't be were converted into resource controls administered on a per process basis using standard Solaris administrative tools. Hopefully Dave will blog at some point about this process, the decisions he made, and why.

There are, of course, always going to be exceptions to the above rules. We still have far too many documented /etc/system tunables in Solaris today, and there will always be some that are absolutely necessary. But our philosophy is focused around these principles, as illustrated by the following story from the discussion thread:

Indeed, one of the more amusing stories was a Platinum Beta customer showing us some slideware from a certain company comparing their OS against Solaris. The slides were discussing available tunables, and the basic gist was something like:

"We used to have way fewer tunables than Solaris, but now we've caught up and have many more than they do. Our OS rules!"

Needless to say, we thought they company was missing the point.


Friday Jul 01, 2005

A parting MDB challenge

Like most of Sun's US employees, I'll be taking the next week off for vacation. On top of that, I'll be back in my hometown in MA for the next few weeks, alternately working remotely and attending my brother's wedding. I'll leave you with an MDB challenge, this time much more involved than past "puzzles". I don't have any prizes lying around, but this one would certainly be worth one if I had anything to give.

So what's the task? To implement munges as a dcmd. Here's the complete description:

Implement a new dcmd, ::stacklist, that will walk all threads (or all threads within a specific process when given a proc_t address) and summarize the different stacks by frequency. By default, it should display output identical to 'munges':

> ::stacklist
73      ##################################  tp: fffffe800000bc80

38      ##################################  tp: ffffffff82b21880


The first number is the frequency of the given stack, and the 'tp' pointer should be a representative thread of the group. The stacks should be organized by frequency, with the most frequent ones first. When given the '-v' option, the dcmd should print out all threads containing the given stack trace. For extra credit, the ability to walk all threads with a matching stack (addr::walk samestack) would be nice.

This is not an easy dcmd to write, at least when doing it correctly. The first key is to use as little memory as possible. This dcmd must be capable of being run within kmdb(1M), where we have limited memory available. The second key is to leverage existing MDB functionality without duplicating code. You should not be copying code from ::findstack or ::stack into your dcmd. Ideally, you should be able to invoke ::findstack without worry about its inner workings. Alternatively, restructuring the code to share a common routine would also be acceptable.

This command would be hugely beneficial when examining system hangs or other "soft failures," where there is no obvious culprit (such as a panicking thread). Having this functionality in KMDB (where we cannot invoke 'munges') would make debugging a whole class of problems much easier. This is also a great RFE to get started with OpenSolaris. It is self contained, low risk, but non-trivial, and gets you familiar with MDB at the same time. Personally, I have always found the observability tools a great place to start working on Solaris, because the risk is low while still requiring (hence learning) internal knowledge of the kernel.

If you do manage to write this dcmd, please email me (Eric dot Schrock at sun dot com) and I will gladly be your sponsor to get it integrated into OpenSolaris. I might even be able to dig up a prize somewhere...

Sunday Jun 26, 2005

Virtualization and OpenSolaris

There's actually a decent piece over at eWeek discussing the future of Xen and LAE (the project formerly known as Janus) on OpenSolaris. Now that our marketing folks are getting the right message out there about what we're trying to accomplish, I thought I'd follow up with a little technical background on virtualization and why we're investing in these different technologies. Keep in mind that these are my personal beliefs based on interactions with customers and other Solaris engineers. Any resemblance to a corporate strategy is purely coincidental ;-)

Before diving in, I should point out that this will be a rather broad coverage of virtualization strategies. For a more detailed comparison of Zones and Jails in particular, check out James Dickens' Zones comparison chart.

Benefits of Virtualization

First off, virtualization is here to stay. Our customers need virtualization - it dramatically reduces the cost of deploying and maintaining multiple machines and applications. The success of companies such as VMWare is proof enough that such a market exists, though we have been hearing it from our customers for a long time. What we find, however, is that customers are often confused about exactly what they're trying to accomplish, and companies try to pitch a single solution to virtualization problems without recognizing that more appropriate solutions may exist. The most common need for virtualization (as judged by our customer base) is application consolidation. Many of the larger apps have become so complex that they become a system in themselves - and often they don't play nicely with other applications on the box. So "one app per machine" has become the common paradigm. The second most common need is security, either for your application administrators or your developers. Other reasons certainly exist (rapid test environment deployment, distributed system simulation, etc), but these are the two primary ones.

So what does virtualization buy you? It's all about reducing costs, but there are really two types of cost associated with running a system:

  1. Hardware costs - This includes the cost of the machine, but also the costs associated with running that machine (power, A/C).
  2. Software management costs - This includes the cost of deploying new machines, and upgrading/patching software, and observing software behavior.

As we'll see, different virtualization strategies provide different qualities of the above savings.

Hardware virtualization

One of the most well-established forms of virtualization, the most common examples today are Sun Domains and IBM Logical Partitions. In each case, the hardware is responsible for dividing existing resources in such a way as to present multiple machines to the user. This has the advantage of requiring no software layer, no performance impact, and hardware fault isolation. The downside to this is that it requires specialized hardware that is extremely expensive, and provides zero benefit for reducing software management costs.

Software machine virtualization

This approach is probably the one most commonly associated with the term "virtualization". In this scheme, a software layer is created which allows multiple OS instances to run on the same hardware. The most commercialized versions are VMware and Virtual PC, but other projects exist (such as qemu and PearPC). Typically, they require a "host" operating system as well as multiple "guests" (although VMware ESX server runs a custom kernel as the host). While Xen uses a paravitualization technique that requires changes to the guest OS, it is still fundamentally a machine virtualization technique. And Usermode Linux takes a radically different approach, but accomplishes the basic same task.

In the end, this approach has similar strengths and weaknesses as the hardware assisted virtualization. You don't have to buy expensive special-purpose hardware, but you give up the hardware fault isolation and often sacrifice performance (Xen's approach lessens this impact, but its still visible). But most importantly, you still don't save any costs associated with software management - administering software on 10 virtual machines is just as expensive as administering 10 separate machines. And you have no visibility into what's happening within the virtual machine - you may be able to tell that Xen is consuming 50% of your CPU, but you can't tell why unless you log into the virtual system itself.

Software application virtualization

On the grand scale of virtualization, this ranks as the "least virtualized". With this approach, the operating system uses various tricks and techniques to present an alternate view of the machine. This can range from simple chroot(1), to BSD Jails, to Solaris Zones. Each of these provide a more complete OS view with varying degrees of isolation. While Zones is the most complete and the most secure, they all use the same fundamental idea of a single operating system presenting an "alternate reality" that appears to be a complete system at the application level. The upcoming Linux Application Environment on OpenSolaris will take this approach by leveraging Zones and emulating Linux at the system call layer.

The most significant downside to this approach is the fact there is a single kernel. You cannot run different operating systems (though LAE will add an interesting twist), and the "guest" environments have limited access to hardware facilities. On the other hand, this approach results in huge savings on the software management front. Because applications are still processes within the host environment, you have total visibility into what is happening within each guest, using standard operating system tools, as well as manage them as you would any other processes, using standard resource management tools. You can deploy, patch, and upgrade software from a single point without having to physically log into each machine. While not all applications will run in such a reduced environment, those that do will be able to benefit from vastly simplified software management. This approach also has the added bonus that it tends to make better use of shared resources. In Zones, for example, the most common configuration includes a shared /usr directory, so that no additional disk space is needed (and only one copy of each library needs to be resident in memory).

OpenSolaris virtualization in the future

So what does this all mean for OpenSolaris? Why are we continuing to pursue Zones, LAE, and Xen? The short answer is because "our customers want us to." And hopefully, from what's been said above, it's obvious that there is no one virtualization strategy that is correct for everyone. If you want to consolidate servers running a variety of different operating systems (including older versions of Solaris), then Xen is probably the right approach. If you want to consolidate machines running Solaris applications, then Zones is probably your best bet. If you require the ability to survive hardware faults between virtual machines, then domains is the only choice. If you want to take advantage of Solaris FMA and performance, but still want to run the latest and greatest from RedHat with support, then Xen is your option. If you have 90% of your applications on Solaris, and you're just missing that one last app, then LAE is for you. Similarly, if you have a Linux app that you want to debug with DTrace, you can leverage LAE without having to port to Solaris first.

With respect to Linux virtualization in particular, we are always going to pursue ISV certification first. No one at Sun wants you to run Oracle under LAE or Xen. Given the choice, we will always aggressively pursue ISVs to do a native port to Solaris. But we understand that there is an entire ecosystem of applications (typically in-house apps) that just won't run on Solaris x86. We want users to have a choice between virtualization options, and we want all those options to be a fundamental part of the operating system.

I hope that helps clear up the grand strategy. There will always be people who disagree with this vision, but we honestly believe we're making the best choices for our customers.


You may note, that I failed to mention cross-architecture virtualization. This is most common at the system level (like PearPC), but application-level solutions do exist (including Apple's upcoming Rosetta). This type of virtualization simply doesn't factor into our plans, yet, and still falls under the umbrella of one of the broad virtualization types.

I also apologize for any virtualization projects out there that I missed. There are undoubtedly many more, but the ones mentioned above serve to illustrate my point.

Saturday Jun 25, 2005

Fun source code facts

A while ago, for my own amusement, I went through the Solaris source base and searched for the source files with the most lines. For some unknown reason this popped in my head yesterday so I decided to try it again. Here are the top 10 longest files in OpenSolaris:

LengthSource File

You can see some of the largest files are still closed source. Note that the length of the file doesn't necessarily indicate anything about the quality of the code, it's more just idle curiosity. Knowing the quality of online journalism these days, I'm sure this will get turned into "Solaris source reveals completely unmaintable code" ...

After looking at this, I decided a much more interesting question was "which source files are the most commented?" To answer this question, I ran evey source file through a script I found that counts the number of commented lines in each file. I filtered out those files that were less than 500 lines long, and ran the results through another script to calculate the percentage of lines that were commented. Lines which have a comment along with source are considered a commented line, so some of the ratios were quite high. I filtered out those files which were mostly tables (like uwidth.c), as these comments didn't really count. I also ignored header files, because they tend to be far more commented that the implementation itself. In the end I had the following list:


Now, when I write code I tend to hover in the 20-30% comments range (my best of those in the gate is gfs.c, which with Dave's help is 44% comments). Some of the above are rather over-commented (especially snmp_sub.c, which likes to repeat comments above and within functions).

I found this little experiment interesting, but please don't base any conclusions on these results. They are for entertainment purposes only.

Technorati Tag:

Thursday Jun 23, 2005

MDB puzzle, take two

Since Bryan solved my last puzzle a little too quickly, this post will serve as a followup puzzle that may or may not be easier. All I know is that Bryan is ineligible this time around ;-)

Once again, the rules are simple. The solution must be a single line dcmd that produces precise output without any additional steps or post processing. For this puzzle, you're actually allowed two commands: one for your dcmd, and another for '::run'. For this puzzle, we'll be using the following test program:


main(int argc, char \*\*argv)
        int i;


        for (i = 0; i < 100; i++)
                write(rand() % 10, NULL, 0);

        return (0);

The puzzle itself demonstrates how conditional breakpoints can be implemented on top of existing functionality:

Stop the test program on entry to the write() system call only when the file descriptor number is 7

I thought this one would be harder than the last, but now I'm not so sure, especially once you absorb some of the finer points from the last post.

Technorati Tag:

MDB puzzle

On a lighter note, I'd thought I post an "MDB puzzle" for the truly masochistic out there. I was going to post two, but the second one was just way too hard, and I was having a hard time finding a good test case in userland. You can check out how we hope to make this better over at the MDB community. Unfortunately I don't have anything cool to give away, other than my blessing as a truly elite MDB hacker. Of course, if you get this one right I might just have to post the second one I had in mind...

The rules are simple. You can only use a single line command in 'mdb -k'. You cannot use shell escapes (!). Your answer must be precise, without requiring post-processing through some other utility. Leaders of the MDB community and their relatives are ineligible, though other Sun employees are welcome to try. And now, the puzzle:

Print out the current working directory of every process with an effective user id of 0.

Should be simple, right? Well, make sure you go home and study your MDB pipelines, because you'll need some clever tricks to get this one just right...

Technorati Tags:


Musings about Fishworks, Operating Systems, and the software that runs on them.


« July 2016