By jmcp-Oracle on Oct 16, 2008
When I started working with SAS (November 2006), our group's project (MPxIO support in mpt(7d)) was already off and running. The part that I was responsible for was booting - which meant fixing stmsboot(1m). Initially I was disappointed that I'd been given what I thought was such a small part of the problem to work on, but I quickly realised that there was a lot more to it than my first impression revealed.
Since we were under some pretty tight time pressure, I didn't really have time to do a redesign of stmsboot to make it more sustainable. The expected arrival of ZFS root meant that there was also some uncertainty about how that would tie in - nothing was nailed down, so I had to make some guesses and keep my eyes peeled for when ZFS root eventuated and then see what further changes needed to be made. We putback those changes into snv_63, and had a few followups in subsequent builds, and all seemed ok.
Then in February 2008 there was a thread on storage-discuss about how to obtain a particular device's lun number after running devfsadm -C (or boot -r, for that matter). I did a little digging and figured out that it would indeed be possible to provide that information - if you were willing to do a little digging and make use of a scsi_vhci ioctl() or two. Using hba-private data, unfortunately, so quite unsupportable. But it got me thinking, and I logged 6673281 stmsboot needs more clues as a placeholder.
Then a short while later I noticed that the -L or -lX options to stmsboot(1m) were now broken, as of snv_83 (nobody had worked on stmsboot(1m) since I made my changes in build 63). Since this is an essential part of the actual interface, I figured it was important enough to log (6673278 stmsboot -L/l is broken on snv_83 and later) but was unable to do much about it until I got Pluggable fwflash(1m) out of the way first. I was also annoyed to find that there were problems with updating /etc/vfstab, too (6691090 stmsboot -d failed to update /etc/vfstab with non-mpxio device paths... things were not looking good, and I was watching code rot for real. Staggering!
The kicker was (6707555 stmsboot is lost in a ZFS root world, and so I knew what I had to do - redesign and rewrite stmsboot from scratch.
I started with 4 guiding principles:
- require only one reboot
- listing of MPxIO-enabled devices should be \*fast\*
- minimise filesystem-dependent lookups, and
- use libdevinfo and devlinks as much as possible.
I then looked at the overall effects that we need to achieve with the stmsboot(1m) command:
- enable MPxIO for all MPxIO-capable devices
- enable MPxIO for specific MPxIO-capable drivers
- enable MPxIO for specific MPxIO-capable HBA ports
- disable MPxIO for all MPxIO-capable devices
- disable MPxIO for specific MPxIO-capable drivers
- disable MPxIO for specific MPxIO-capable HBA ports
- update MPxIO settings for all MPxIO-capable drivers
- update MPxIO settings for specific MPxIO-capable drivers
- list the mapping between non-MPxIO and MPxIO-enabled devices
- list device guids, if available
What does the old code do?
The code makes use of a shell script (/usr/bin/stmsboot), a private binary (/lib/mpxio/stmsboot_util) and an SMF service (/lib/svc/method/mpxio-upgrade) which runs on reboot.
The private binary does the heavy lifting, providing a way for the shell script and SMF service to determine what a device's new MPxIO or non-MPxIO mapping is. The old private binary also walked through the device link entries in /dev/rdsk when called with the -L or -l $controller options, printing any device mappings. Finally, the private binary handles the task of re-writing /etc/vfstab.
The shell script (stmsboot) is the user interface part of the facility. Its chief task is to do editing of the driver.conf(4) files for the supported drivers (fp(7d) and mpt(7d)), and to set the eeprom bootpath variable on the x86/x64 platform if disabling or updating MPxIO configurations. (Failing to do this would prevent an x86/x64 host from booting). The shell script also makes backup copies of modified files, and creates a file with instructions on how to recover a system which has failed to boot properly after running the stmsboot script.
The SMF service is armed by the stmsboot script, and runs on reboot. It mounts /usr and root as read-write, invokes the private /lib/mpxio/stmsboot_util binary to rewrite the /etc/vfstab, updates the dump configuration and any SVM metadevice (/dev/md) device mappings, and then (in the old form) reboots the system.
What has changed
The new design makes use of a private cache of device data (stored using an nvlist) gathered from libdevinfo(3LIB) functions, and obviates the requirement for a second reboot since the vfstab rewriting function is reliable - we use the kernel's concept of what devices it has attached so we're always consistent. In addition, the new design provides a significant improvement in execution time when listing device mappings: we don't need to trawl through device links on disk but instead use libdevinfo functions and our private cache to provide the required information.
The data that we store in the cache for each device attached to an MPxIO-capable controller is
- its devid (eg, id1,sd@n5000cca00510a7cc/aS_________________________________________3QF0EAFP/a
- its physical path (eg, /pci@0,0/pci10de,5c@9/pci108e,4131@1/sd@0,0)
- its devlink path (eg, /dev/dsk/c2t0d0, which becomes c2t0d0)
- its MPxIO-enabled devlink path (eg, /dev/rdsk/c3t500000E011637CF0d0,
which becomes c3t500000E011637CF0d0)
- whether MPxIO is enabled for the device in the running system (as a boolean_t B_TRUE or B_FALSE)
These are stored as nvlist properties:
#define NVL_DEVID "nvl-devid"
#define NVL_PATH "nvl-path"
#define NVL_PHYSPATH "nvl-physpath"
#define NVL_MPXPATH "nvl-mpxiopath"
#define NVL_MPXEN "nvl-mpxioenabled"
When we've found an MPxIO-capable device, we check whether it exists in our cached version, and if not, we create an nvlist containing the above properties and keyed off the device's devid. This nvlist is added to the global nvlist. In order to speed operations later, we also add some inverse mappings to the global nvlist:
devfspath -> devid
current devlink path -> devid
current MPxIO-enabled path -> devid
device physical path -> devid
This allows us to search for any of those paths and get the appropriate devid back, the nvlist of which we can then query for the desired properties.
When the mpxio-upgrade service is invoked, we need to determine the mapping for the root device in the currently running system and mount that device as read-write in order to continue with the boot process. We do this by reading the entry for root in /etc/vfstab and finding the physical path of that device in
the running system. We mount /devices/$physicalpath as read-write, then re-invoke stmsboot_util to find the devlink (/dev/dsk...) path for root, which we then remount. This two-remount option is required because the devlink facility is not available to us at this early stage of the boot process devfsadm is not running yet) - until we can determine what the root device is and mount it as read-write.
Once root and /usr have been remounted, we can then invoke stmsboot_util to re-write the vfstab. This is a fairly simple process of scanning through each line of the file and finding those which start with /dev/dsk, determining their mapping in the current system, and re-writing that line. As a safeguard, the new version of the vfstab is written to /etc/mpxio, and we let the mpxio-upgrade script take care of copying that file to /etc/vfstab. Once the vfstab has been updated, we run dumpadm, and if necessary, metadevadm. Finally, we re-generate the system's boot archive - which in fact is the longest single operation of all!
After this, we can disable the svc:/system/device/mpxio-upgrade:default service and exit.
When the mpxio-upgrade script exits, the svc:/system/filesystem/usr:default service takes over and the boot process completes normally - with the new device mappings already active and working. No second reboot required!
I'm not going to claim that the new form of stmsboot(1m) is a beautiful thing, but I do believe that the architecture and implementation that it has now are much more solid and should be easier to extend in the future if required.