Thursday Oct 16, 2008

On stmsboot(1m)

When I started working with SAS (November 2006), our group's project (MPxIO support in mpt(7d)) was already off and running. The part that I was responsible for was booting - which meant fixing stmsboot(1m). Initially I was disappointed that I'd been given what I thought was such a small part of the problem to work on, but I quickly realised that there was a lot more to it than my first impression revealed.

Since we were under some pretty tight time pressure, I didn't really have time to do a redesign of stmsboot to make it more sustainable. The expected arrival of ZFS root meant that there was also some uncertainty about how that would tie in - nothing was nailed down, so I had to make some guesses and keep my eyes peeled for when ZFS root eventuated and then see what further changes needed to be made. We putback those changes into snv_63, and had a few followups in subsequent builds, and all seemed ok.

Then in February 2008 there was a thread on storage-discuss about how to obtain a particular device's lun number after running devfsadm -C (or boot -r, for that matter). I did a little digging and figured out that it would indeed be possible to provide that information - if you were willing to do a little digging and make use of a scsi_vhci ioctl() or two. Using hba-private data, unfortunately, so quite unsupportable. But it got me thinking, and I logged 6673281 stmsboot needs more clues as a placeholder.

Then a short while later I noticed that the -L or -lX options to stmsboot(1m) were now broken, as of snv_83 (nobody had worked on stmsboot(1m) since I made my changes in build 63). Since this is an essential part of the actual interface, I figured it was important enough to log (6673278 stmsboot -L/l is broken on snv_83 and later) but was unable to do much about it until I got Pluggable fwflash(1m) out of the way first. I was also annoyed to find that there were problems with updating /etc/vfstab, too (6691090 stmsboot -d failed to update /etc/vfstab with non-mpxio device paths... things were not looking good, and I was watching code rot for real. Staggering!

The kicker was (6707555 stmsboot is lost in a ZFS root world, and so I knew what I had to do - redesign and rewrite stmsboot from scratch.

The Redesign

I started with 4 guiding principles:

  • require only one reboot

  • listing of MPxIO-enabled devices should be \*fast\*

  • minimise filesystem-dependent lookups, and

  • use libdevinfo and devlinks as much as possible.

I then looked at the overall effects that we need to achieve with the stmsboot(1m) command:

  • enable MPxIO for all MPxIO-capable devices

  • enable MPxIO for specific MPxIO-capable drivers

  • enable MPxIO for specific MPxIO-capable HBA ports

  • disable MPxIO for all MPxIO-capable devices

  • disable MPxIO for specific MPxIO-capable drivers

  • disable MPxIO for specific MPxIO-capable HBA ports

  • update MPxIO settings for all MPxIO-capable drivers

  • update MPxIO settings for specific MPxIO-capable drivers

  • list the mapping between non-MPxIO and MPxIO-enabled devices

  • list device guids, if available

What does the old code do?

The code makes use of a shell script (/usr/bin/stmsboot), a private binary (/lib/mpxio/stmsboot_util) and an SMF service (/lib/svc/method/mpxio-upgrade) which runs on reboot.

The private binary does the heavy lifting, providing a way for the shell script and SMF service to determine what a device's new MPxIO or non-MPxIO mapping is. The old private binary also walked through the device link entries in /dev/rdsk when called with the -L or -l $controller options, printing any device mappings. Finally, the private binary handles the task of re-writing /etc/vfstab.

The shell script (stmsboot) is the user interface part of the facility. Its chief task is to do editing of the driver.conf(4) files for the supported drivers (fp(7d) and mpt(7d)), and to set the eeprom bootpath variable on the x86/x64 platform if disabling or updating MPxIO configurations. (Failing to do this would prevent an x86/x64 host from booting). The shell script also makes backup copies of modified files, and creates a file with instructions on how to recover a system which has failed to boot properly after running the stmsboot script.

The SMF service is armed by the stmsboot script, and runs on reboot. It mounts /usr and root as read-write, invokes the private /lib/mpxio/stmsboot_util binary to rewrite the /etc/vfstab, updates the dump configuration and any SVM metadevice (/dev/md) device mappings, and then (in the old form) reboots the system.

What has changed

The new design makes use of a private cache of device data (stored using an nvlist) gathered from libdevinfo(3LIB) functions, and obviates the requirement for a second reboot since the vfstab rewriting function is reliable - we use the kernel's concept of what devices it has attached so we're always consistent. In addition, the new design provides a significant improvement in execution time when listing device mappings: we don't need to trawl through device links on disk but instead use libdevinfo functions and our private cache to provide the required information.

The data that we store in the cache for each device attached to an MPxIO-capable controller is

  • its devid (eg, id1,sd@n5000cca00510a7cc/aS_________________________________________3QF0EAFP/a

  • its physical path (eg, /pci@0,0/pci10de,5c@9/pci108e,4131@1/sd@0,0)

  • its devlink path (eg, /dev/dsk/c2t0d0, which becomes c2t0d0)

  • its MPxIO-enabled devlink path (eg, /dev/rdsk/c3t500000E011637CF0d0,
    which becomes c3t500000E011637CF0d0)

  • whether MPxIO is enabled for the device in the running system (as a boolean_t B_TRUE or B_FALSE)

These are stored as nvlist properties:

#define NVL_DEVID "nvl-devid"
#define NVL_PATH "nvl-path"
#define NVL_PHYSPATH "nvl-physpath"
#define NVL_MPXPATH "nvl-mpxiopath"
#define NVL_MPXEN "nvl-mpxioenabled"

When we've found an MPxIO-capable device, we check whether it exists in our cached version, and if not, we create an nvlist containing the above properties and keyed off the device's devid. This nvlist is added to the global nvlist. In order to speed operations later, we also add some inverse mappings to the global nvlist:

devfspath -> devid
current devlink path -> devid
current MPxIO-enabled path -> devid
device physical path -> devid

This allows us to search for any of those paths and get the appropriate devid back, the nvlist of which we can then query for the desired properties.

When the mpxio-upgrade service is invoked, we need to determine the mapping for the root device in the currently running system and mount that device as read-write in order to continue with the boot process. We do this by reading the entry for root in /etc/vfstab and finding the physical path of that device in
the running system. We mount /devices/$physicalpath as read-write, then re-invoke stmsboot_util to find the devlink (/dev/dsk...) path for root, which we then remount. This two-remount option is required because the devlink facility is not available to us at this early stage of the boot process devfsadm is not running yet) - until we can determine what the root device is and mount it as read-write.

Once root and /usr have been remounted, we can then invoke stmsboot_util to re-write the vfstab. This is a fairly simple process of scanning through each line of the file and finding those which start with /dev/dsk, determining their mapping in the current system, and re-writing that line. As a safeguard, the new version of the vfstab is written to /etc/mpxio, and we let the mpxio-upgrade script take care of copying that file to /etc/vfstab. Once the vfstab has been updated, we run dumpadm, and if necessary, metadevadm. Finally, we re-generate the system's boot archive - which in fact is the longest single operation of all!

After this, we can disable the svc:/system/device/mpxio-upgrade:default service and exit.

When the mpxio-upgrade script exits, the svc:/system/filesystem/usr:default service takes over and the boot process completes normally - with the new device mappings already active and working. No second reboot required!

I'm not going to claim that the new form of stmsboot(1m) is a beautiful thing, but I do believe that the architecture and implementation that it has now are much more solid and should be easier to extend in the future if required.

Update (17 October 2008, 07:08 Australia/Brisbane): Jason's comment reminded me that I should have mentioned - I pushed these changes into build snv_99 and you can see them in the changelog.

See also links:

Solaris Express SAN Configuration and Multipathing Guide

Solaris ZFS Administration Guide

Linker and Libraries Guide


devfsadm(1m) stmsboot(1m) zfs(1m) zpool(1m)

libdevinfo(3LIB) libnvpair(3LIB) libdevid(3LIB)

fp(7d) mpt(7d) scsi_vhci(7d)


Wednesday Oct 31, 2007

Today is a very good day



Today I'm ecstatic to be able to announce that the S10 patches for our backport are finally available on We've delivered PSARC 2006/703 MPxIO extension for Serial Attached SCSI, and (my personal favourite) PSARC 2007/046 stmsboot(1M) extension for mpt(7D).

The patches that you need to install are

sparc:: 125081-10
(We recommend that on sparc you also install 127747-01 as well, due to 6466248)


x86/x64:: 125082-10


The full list of rfes and bugs is as follows:

6443044 add mpxio support to SAS mpt driver
6502231 stmsboot needs to support SAS devices
6544226 mpt needs mdb module

6242789 primary path comes up as standby instead online even if auto-failback is enabled
6442215 mpt.conf maybe overwritten because filetype within SUNWckr package is 'f'
6449836 stmsboot -d failed to boot if several LUNs or targets map to same partition
6510425 properties "flow_control" and "queue" in mpt.conf are useless
6525558 untagged command unlikely to be sent to HBA during heavy I/O
6541750 CAM5.1.1b2: 2530, MPT2: Vdbench bailed out after I pull ctlr-A out
6545198 build should allow architecture-dependent class action scripts
6546164 stmsboot does not remove sun4u SMF service, erroneously lists parallel SCSI HBAs
6548867 mpxio-upgrade script has fatally mis-defined variable
6550585 mpt driver has a memory leak in mpt_send_tur
6550591 mpt should not print unnecessary messages
6550849 WARNING: mpt TEST_UNIT_READY failure
6554029 mpt should get maxdevice from portfacts, not IOCfacts
6554556 stmsboot's privilege message is not quite correct
6556832 after ctlr brought online, some paths failed to come back
6560371 mpt hangs during ST2530 firmware upgrade
6566097 mpt: sd targets under mpt are not power-manageable
6566815 changes for 6502231 broke g11n in stmsboot
6531069 SCSI2 (tc_mhioctkown test cases) testing are showing UNRESOLVED results for ST2530
6546465 mpt: kernel panic due to NULL pointer reference in an error code path
6556852 mpt needs to support Sun Fire x4540 platform
6588204 mpt_check_scsi_io_error() incorrectly tests IOCStatus register
6588278 mpt driver doesn't check GUID of LUN when the path online
6591973 panic in mdi_pi_free() when remapping devices
6613189 T125082-09 and T125081-09 don't work - missing misc/scsi module from deliverables

As an interesting side note, during the development process we stumbled across

6566270 Seagate Savvio 10k1 disks do not enumerate under scsi_vhci

You'll probably see this if you have a Galaxy or T2000/T1000 system. (Unfortunately you need a service contract to view the bug report due to its category).


And on a personal note, I'd like to thank the other members of our team for working so well together - with Greg in Melbourne, Javen and Dolpher up in Beijing, test teams in Beijing, Menlo Park, Broomfield and San Diego and yours truly in Sydney (and now Brisbane) - we have truly been a virtual team. I reckon we've demonstrated that physical distance does not get in the way of designing, developing, testing and (most importantly) delivering good software that provides solutions for our customers.


Technorati tags: , , , , , , , , , , ,

Saturday Sep 22, 2007

A load off my mind

This morning I awoke and saw that the gatekeepers had given me the ok to putback our wad of changes to the Solaris 10 patch and feature gates. The gates opened less than a day ago, US/Pacific time, so I'm very happy that we were amongst the first to get in.

So after a few hours on the phone with the rest of our team, making absolutely darned sure that we had everything correct and doing a last-minute merge+test with the feature gate, I putback. I don't think I've been this nervous since the day I got married!

We expect to see patches for this backport in about a month - a coupla weeks after the close of the patch gate for this build. Not sure what the patch IDs will be, when I find out I'll mention them here.

The PSARC fasttracks we integrated are

PSARC 2006/703 MPxIO extension for Serial Attached SCSI (SAS) on mpt(7D)
PSARC 2007/046 stmsboot(1M) extension for mpt(7D)

And the primary bugids are

6443044 add mpxio support to SAS mpt driver
6502231 stmsboot needs to support SAS devices
6544226 mpt needs mdb module

( won't show the updates until tomorrow).

Our little project team is very happy and despite the fact that we're in Beijing, Brisbane and Melbourne I think we might all go off to a pub to celebrate.


I work at Oracle in the Solaris group. The opinions expressed here are entirely my own, and neither Oracle nor any other party necessarily agrees with them.


« June 2016