Friday Mar 18, 2011

Documentation

The stuff that's here, to the extent that it's still relevant, which much of it, should be migrated into documentation. For the moment, I'm posting an update to avoid having what's here disappear as a matter of process.

Thursday Jun 04, 2009

Getting with the program in roughly an hour.

Note: This is mostly useful to developers with existing nevada based build machines.

This morning I came in to an e-mail describing a couple ways in which live-upgrade was blowing up on our (nevada based) primary x86 build machine. This wasn't really news to me, but it was now stuck on bits that pre-dated some build machine flag days and really did need to be moved forward. We could either work around all the lu issues and stay on the pre ips nevada train, or get the same bits via ips by moving to opensolaris (the distro, we're already running the OS and none of this changes the source base).

So, since the system was idle, I formed a quick plan to move to opensolaris and less than an hour and 15 minutes later it was successfully producing full nightly builds. This included a couple of trips up and down three flights of stairs and is dramatically less time than a live-upgrade would have consumed.

Here's what I did:

I knew that the system had graphical console redirection, so I could just use the live-cd. I didn't want to deal with the virtual CD/DVD drive over the 100Mbit link to my office if I could help it, so I kicked of a DVD burner to produce a 2009.06 111b CD from an image I had downloaded a while back.

While it was burning I halted the system and ran downstairs to check if it had a physical CD/DVD drive. It did. I also swapped one of the (zfs mirrored) root drives for a blank one so I could fall back or reference it if worse came to worse. I also disconnected the JBOD with the build workspaces, mostly to make things probe faster as I'd be rebooting the system a couple of times, but also to keep my data safely away from my brain that might get one of the controller numbers mixed up.

Back in my office the DVD was burned (it's CD sized, but I'm using DVD+RW media). I ran downstairs with it, put it in the system and then back up-stairs to my office, where I grabbed the console redirection. Once the live-cd was on-line I pointed the installer at one of the disks and got it started. I then paused it to mirror and enable compression on the root pool, more or less:

pstop `pgrep cpio`
zfs set compression=on rpool/ROOT/opensolaris
# used format to verify that the disks were the same size and
# that they were using whole disk Solaris2 fdisk partitions with
# s0 spanning all but the first
zpool attach -f rpool s0 s0
# run zpool status and wait roughly a minute for resilver to finish
prun `pgrep cpio`

Then rather than rebooting the system at the end, I halted it. I ran back downstairs and removed the CD and re-connected the JBOD. Back in my office I booted -s and ran sys-unconfig. Note: I didn't think of it at the time, but before running sys-unconfig might have been the right time to edit /rpool/boot/grub/menu.lst to remove all the splashimage loads and console=graphics and set console to ttya as it would have saved a later reboot.

The sys-unconfig rebooted the system, it came up with sysid and I ran through it in essence setting things to english/C, PST, static IP w/ipv6, nisdomain mpklab.sfbay.sun.com and NFSv4 domain sun.com, the system then again rebooted.

Now I could remotely log in and to get the bits needed to run a build I connected it to the internal dev (not required, but desirable) and extras repositories by doing the following:

pkg set-publisher -P -O http://ipkg.sfbay/dev/ opensolaris.org
pkg set-publisher -O http://ipkg.sfbay/extra/ extra
pkg install osnet

I also had to:

cd /opt
ln -s /ws/onnv-tools/teamware
ln -s /ws/onnv-tools/SUNWspro/SS12 SUNWspro

To get the build workspaces back on-line from the zpool on the JBOD, I simply ran:

zpool import -f builds

Since things like whether or not a filesystem is shared is set on the fs properties with zfs, that configuration was all preserved. Neat.

Finally I edited the GRUB menu as described above and ran a pkg image-update to get any bits that weren't up to date on the CD (not much at this point) and did a final reboot.

Wednesday Sep 27, 2006

Update: Solaris on x86 Macs

Background:

While I'm clearly a Solaris stake-holder, this is a spare time project for me as much as it is for the external contributors that have helped get this as far as it has. FWIW, I have not managed so spend more than a couple of hours on this since I made the last post.

If any of you want this to move along faster, please do not hesitate to grab the code and contribute. On that note: Many, many thanks to those that have!

That said, I do want to share some updates.

What's required:

The bits required to get Solaris working on the x86 Mac's are available in current Solaris Express / OpenSolaris releases. These bits can be downloaded today.

The easy way:

A lot of folks are running Solaris under parallels which simply works. Note: The mouse is emulated as a ps/2 device, but there seems to be an auto-detection problem, so just set it explicitly.

The cheap way:

If you want to run Solaris using the beta bootcamp BIOS emulation the following applies:

The most disabling bugs have been fixed, but you will still need the workaround for 6413240. I just noticed that that bug is not published since it refers to source that is still in legal review on it's way to becoming open. So I've made the complete workaround available in 6413235. At this moment that bug update has not been pushed out yet, so please keep checking back.

The other issue is the warning called out in the workaround. Until we can root cause and fix what's causing these systems to go into the hosed state some of the time we really don't want to have the bits just do that to peoples machines out of the box some of the time. Clearly this behavior can be managed by ensuring a full back-up of any user data and configuration on the OSX portion of the disk exists, but given that as a hurdle, applying the workaround is hardly a big deal and serves as a safety check.

The bleeding edge:

We're also slowly (1 person in their spare time) working on the beginnings of EFI support as there's a possibility that the x86 market will go in that direction. The proof of concept is working, but things like the lack of UGA console support mean that this will remain marginal for now. That code is covered by 6475349

Thursday Apr 13, 2006

Solaris on the iMac

Solaris Nevada build 36 running on an iMac

Solaris on iMac

We now have Solaris Nevada build 36 installed and running on a bootcamp equipped iMac.

The most significant hurdles were conquered when Juergen Keil enabled both multiboot and GRUB to work in a i8042 free system. That work is covered by 6412224 and 6412226.

The next hurdle is that the Solaris fdisk doesn't interact all that well with the bootcamp prepared disk. This is 6413235. The reason this becomes a hurdle is that install re-writes the fdisk table even when we can simply use the existing one. That is 6413240.

This is all using the bootcamp bits which make the Macs look like just another (BIOS bugs and all) x86 system. Bringing Solaris up in native (EFI) mode is still the real challenge.


Sunday Mar 12, 2006

Faster seamless boot progress graphics

In the tradition of bigger (smaller?), better, faster, more, "console=graphics" has, as of Solaris Nevada build 34 (Solaris Express snv_34 or higher) and Solaris 10 Update 2, become historical in favor of loading the boot progress image faster/earlier. It is now loaded by GRUB and simply left on the screen. The kernel (multiboot to be precise) then detects the video mode (unless the system is on a serial console) and if it is the right graphics mode, the progress indicator is drawn and updated.

None of this addresses the fundamental issue of not being able to return to text mode, so graphical boot progress remains unsupported and undocumented for now. So as long as you're willing to accept that, you can remain blissfully ignorant of all the pesky messages produced during boot by following these steps:

  1. If you used bfu (rather than upgraded), you should install the new GRUB bits in the bootblock by either running installgrub(1m) by hand or running, as root on your system, update_grub.
  2. To put up the image add a splashimage command to your Solaris entry. You'll also want to set the foreground and background correctly. For example:
     
    title Solaris Nevada (happyface) 
      root (hd0,1,a) 
      splashimage /boot/solaris.xpm 
      foreground b2bc00 
      background 35556b 
      kernel /platform/i86pc/multiboot 
      module /platform/i86pc/boot_archive 
    
    Eventually bootadm will learn how to manage such entries. For now you can run (as root on your system) add_happy_face_entries. And it will create the appropriate entries.

In addition to the technical changes, the boot progress graphics have been re-designed by an artist as opposed to an engineer. This resulted in giving them not only a far more polished look, but also integrating them with the other boot-time graphic, the GRUB background, and the (dtlogin / gdm) login screens which will appear in Solaris 10 Update 2 as well the Nevada builds.

Thursday Jan 05, 2006

Happiness through Ignorance

Background:

What sort of messages an OS can or should emit during boot has been (and will likely continue to be) the subject of much debate (which I'm not going to go into here). The two extremes usually are a message for everything (note that the definition of "a thing" is very vague). The other extreme is no messages unless something truly fatal has occurred. In current Solaris bits (s10u1 and current nv bits) there's an unsupported/undocumented mode that goes one step further: It doesn't even bother the user when something truly fatal has occurred.

The reason for this has less to do with the fact that we like to see how far we can push things, but rather that the implementation is still incomplete. As such it's not only unsupported, but in fact unsupportable, as there is no code in place that will get the console back into text mode in case of a sufficiently fatal event. Such an event may be anything that to requires sulogin (single user login) to run in order to repair the system, or worse, a system panic.

What it does do, is display a pretty picture (which could be prettier1)while the system boots (and hopefully nothing goes wrong), up until the X-server starts taking over the console.

Warnings and disclaimers:

The current code is quite primitive in that it assumes a VGA device. There are graphics cards out there that don't actually support VGA, so this mode should not be enabled this on a system with such a card. Doing so anyway will likely lead to a hang and no pretty picture.

So, if the previous text was too vague: Please enable this only at your own risk and don't call anyone for support without disabling it first. Also, please consider the purpose of the system you enable this on. While this might be a cute thing to use on ones laptop while it's a seriously poor idea on a system with any sort of RAS requirements.

Now, here's how you flip it on:

Graphics boot mode is controlled by the console property. It can either be set via eeprom(1M) in bootenv.rc or on the fly on the grub kernel command line. The latter is probably the better idea because you can just make another menu entry and power-cycle and fall back to the non-graphics one should things decide to hang. Here's an example of such menu entries (they live in /boot/grub/menu.lst):

 
title Solaris 11 nv_15 X86 
        root (hd0,1,a) 
        kernel /platform/i86pc/multiboot 
        module /platform/i86pc/boot_archive 
 
title Solaris happy face 
        root (hd0,1,a) 
        kernel /platform/i86pc/multiboot -B console=graphics 
        module /platform/i86pc/boot_archive 

If eeprom was used to set the console property to "graphics", it can be overridden by specifying it on-the-fly on the GRUB kernel line (by hitting 'e' and editing the entry) and setting it to "text" like so:

 
        kernel /platform/i86pc/multiboot -B console=text 

Adding your own pretty picture:

The graphics file that's loaded is /boot/solaris.xpm and needs to be 15 colors 640x480 - error checking on the image file is minimal, so make sure you get the image right or things will look ugly.

1 Since this is still "engineering only / experimental code" the image had no help from any of our graphic artists. Such folks have since shown me that they can easily outperform me when it comes to making an image look good in high rez VGA mode 640x480x4 (4-bit, 16 colors).

Monday Nov 14, 2005

Disabling toxic drivers

Unfortunately all code is not perfect. Even drivers (particularly ones under development) can on occasion have problems that may wedge a specific system during boot. Back in the days of the DCA (provided it wasn't the real-mode driver that was hanging), one could edit the fake prom-tree and keep such drivers from binding.

Equivalent functionality does exist post new-boot. It's however not officially documented yet as the syntax may still change a little as there currently is no way to only disable a specific instance. It currently (snv_28, s10u1_18) uses the generic -B property passing mechanism. A all instances of a driver are disabled by setting disable- to true. The -B option is just another kernel option, and is easily set at boot time by typing 'e' in the GRUB menu, select the kernel line, hit 'e' again and add whatever options to the end of the line, hit enter and 'b' to boot.

So to disable both sd and usbms, one would boot with a kernel line like this:

 
kernel /platform/i86pc/multiboot -B disable-sd=true,disable-usbms=true 

This will cause no instances of either driver to bind. This gives the system a chance to boot, allowing a less buggy driver to be installed, or toxic features to be disabled.

Disabling toxic drivers

Unfortunately all code is not perfect. Even drivers (particularly ones under development) can on occasion have problems that may wedge a specific system during boot. Back in the days of the DCA (provided it wasn't the real-mode driver that was hanging), one could edit the fake prom-tree and keep such drivers from binding.

Equivalent functionality does exist post new-boot. It's however not officially documented yet as the syntax may still change a little as there currently is no way to only disable a specific instance. It currently (snv_28, s10u1_18) uses the generic -B property passing mechanism. A all instances of a driver are disabled by setting disable- to true. The -B option is just another kernel option, and is easily set at boot time by typing 'e' in the GRUB menu, select the kernel line, hit 'e' again and add whatever options to the end of the line, hit enter and 'b' to boot.

So to disable both sd and usbms, one would boot with a kernel line like this:

 
kernel /platform/i86pc/multiboot -B disable-sd=true,disable-usbms=true 

This will cause no instances of either driver to bind. This gives the system a chance to boot, allowing a less buggy driver to be installed, or toxic features to be disabled.

Monday Oct 17, 2005

Securing the GRUB Menu

Background and problem:

Since the in-memory miniroot compressed binary glob is only around 46Mb in size we started copying it to our desktop and test machines as a sort of failsafe secondary system image that could be booted in case we hosed the root image. This allowed us to develop at a far more reckless pace than before as recovering a system, which in the past involved configuring and then performing a netboot. Having the failsafe archive allows me to bfu untested bits onto my laptop without having another system near-by or becoming nervous.

In the end it turns out that we liked having this stand-alone image on our test machines enough to make it part of the product. It contributes very directly to the serviceability of the system if something should go wrong.

However much in the same way that a CD or network boot can be used to mount a root filesystem, so can the failsafe archive. This of course (amongst other things) allows the etc/shadow file on that filesystem to be accessed and/or modified. This suggests that the failsafe menu entry needs to be secured in the same environments that require securing the systems firmware to avoid unauthorized removable media boots. In addition to that, the grub command line also offers full filesystem access, so it needs to be secured as well.

Solution:

Luckily this has all already occurred to the fine folks who've been working on GRUB. Details can be found in the security section of the GRUB documentation.

On Solaris the grub shell is currently being delivered as /boot/grub/bin/grub (Note: while that location is unstable at this point, it is unlikley to change until we have some sort of bootadm(1m) integration).

So to protect the failsafe entry as well as the command line by a password of my choosing I can do the following:

clubsix:~> /boot/grub/bin/grub

    GNU GRUB  version 0.95  (640K lower / 3072K upper memory)

 [ Minimal BASH-like line editing is supported.  For the first word, TAB
   lists possible command completions.  Anywhere else TAB lists the possible
   completions of a device/filename. ]

grub> md5crypt 

Password: \*\*\*\*\*\*\*\*
Encrypted: $1$1g5y41$HX2EHzvJIUbBda0m/txBq.

grub> Control-C

Then I edit /boot/grub/menu.lst and add a password line like:

password --md5 $1$1g5y41$HX2EHzvJIUbBda0m/txBq.

I also add the lock command to the failsafe entry:

title Solaris failsafe
lock
root (hd0,1,a)
kernel /boot/multiboot -B console=ttyb kernel/unix -s
module /boot/x86.miniroot-safe

Now when the system is rebooted and drops into GRUB, only the default Solaris entry (there are no other entries on this system) can be booted. Bother the failsafe as well as access to command line requires hitting 'p' and entering the password. Any other entry that might compromise system security (for instance a service partition) should also be locked in the same manner.

While not ideal, such a hashed password could be deployed via jumpstart finish scripts.

Conclusion / Better Solution:

In an ideal world bootadm(1M) would implement an abstraction that allows the password to be set or manipulated. In an even more ideal world setting it would be tightly integrated with selecting a security model for the system via install and jumpstart.

Wednesday Sep 28, 2005

Post new-boot Solaris on 128Mb systems

Background:

New-boot introduced the install-miniroot ramdisk. This means the entire miniroot for CD/DVD based installs as well as for network installs is mounted on an in memory ramdisk.

The entire x86 miniroot is 273Mb. Since 256Mb systems are still interesting, something had to give. We do not run the X server or the GUI installer on low(er) memory systems. So one obvious thing to pull out for lower memory systems is X. Dropping /usr/openwin and /usr/dt gains back 137Mb. They are stored in a cpio archive on the media and unpacked into tmpfs on systems with sufficient memory. The packaging databases, which in a running miniroot are obscured by the /tmp tmpfs mount, are nearly another 10Mb. They are needed for patchadd -C or pkgadd -R, so root_archive (un)packmedia archives and restores them. This brings the miniroot down to 126Mb. Solaris and the text mode installer officially require 64Mb to run. The memory occupied by the ramdisk needs to be available in addition to this 64Mb. This means it is possible to install Solaris on a system with 192Mb of memory. Please note that I used the word possible. The minimum memory requirement for a supportable system is currently 256Mb which leaves some room for things to grow.

Systems with less memory can still be live-upgraded. However since there is no way to boot the miniroot any more their serviceability is compromised and they fall off the support matrix.

There may be some situation in which it might seem desirable to run Solaris on systems with less memory than that. Most of these tend to fall into one of three categories:

  1. Embedded systems.
    While these seem to keep coming up in this context, no one really runs the installers on embedded systems. Their root (and usr) filesystems are generally constructed on another system and then deployed on the assembly line. So the miniroot install time restriction do not really apply here.
  2. Previously retired systems that have been pulled out of a closet to try out Solaris 10 or investigate something specific.
    In most of those cases, further digging in said closet will reveal another system that the rest of the required memory can be pilfered from. A quick look at one of the mail order price checkers suggests that 128Mb of pc133 SDRAM can be had for as little as $7 (likely around $20 incl. shipping from a reputable retailer), so that's a fairly cheap option (RDRAM is a little more, but still an option).
  3. Large existing deployments of desktop clients.
    In nearly all of those cases more memory should be fitted if they are to be moved to Solaris 10. The biggest issue in such a case is likely the fitting of the memory and not the cost of the memory.

Then there's another category which I'll call: "Because I want to."

That last category will have to go to more effort. The most reasonable approach is to further minimize the miniroot.

The hard way to run such a machine:

Disclaimer: The following makes extensive use of rm in ways that violate package boundaries. This may break over time as things are added to the install process (for instance in postinstall scripts) that consume any of these items. It may not work tomorrow, YMMV and it most certainly voids any warranty or support agreement you might have.

First the miniroot needs to be unpacked. For details take a look at root_archive(1m). Note: since we need to hold onto ownership and permissions this is a root only operation:

/boot/solaris/bin/root_archive unpack /cdrom/cdrom0/boot/x86.miniroot \\
    root-small

This will result in a copy of the miniroot under root-small. You'll find it's roughly 126Mb in size.

I was hoping that just deleting the non-C locales would be sufficient. However in the end quite a bit more had to go.

Here is rough script that removes things I hope are not used during a basic run through ttinstall:

#!/bin/sh

CWD=`pwd`

if [ "$CWD" = / ] ; then
        echo You probably do not want to run this in /.
        echo Please cd into the top level of an unpacked miniroot to run this.
        exit
fi

if [ ! -f sbin/install-discovery ] ; then
        echo This does not look like an unpacked install miniroot.
        echo Please cd into the top level of an unpacked miniroot to run this.
        exit
fi


# The first thing to go is all the non C locales (if you don't speak C,
# you should learn it before proceeding any further):
#
for i in usr/lib/locale usr/snadm/classes/locale usr/lib/install/data/lib/locale
do 
        mv $i/C .
        rm -rf $i/\*
        mv C $i
done

# The iconv modules also do not appear to be used unless the system is in
# a non C locale, so they go:
#
rm -rf usr/lib/iconv

# Since there is not enough memory to run the java installers, they
# can be deleted.
#
# On second thought usr/lib/install/data/wizards/apps/launcher.class
# and postinstall.class are copied to the installed system and it will
# be DOA without them, so better hang on to them. For the record I
# learned this the hard way; through reckless experimentation.
#
mv usr/lib/install/data/wizards/apps/launcher.class .
mv usr/lib/install/data/wizards/apps/postinstall.class .
rm -rf usr/lib/install/data/wizards/apps/\*
mv launcher.class postinstall.class usr/lib/install/data/wizards/apps

# Perl is only on the miniroot so folks can write their finish scripts
# in perl. So unless such a finish script is to be used, it can go:
#
rm -rf usr/perl5

# most network services will never be run in the miniroot - remote
# access can be useful for debugging, but we really don't have enough
# memory to run another shell, let alone telnetd.
#
rm usr/sbin/in.comsat usr/sbin/in.fingerd usr/sbin/in.rexecd
rm usr/sbin/in.rlogind usr/sbin/in.rshd usr/sbin/in.rwhod usr/sbin/in.talkd
rm usr/sbin/in.tftpd usr/sbin/in.telnetd 
rm -r usr/lib/gss usr/lib/mps

# Since Solaris is perfect, it will not need any debugging, so we can
# ditch the debugger:
#
find kernel platform usr -name \\\*mdb\\\* | xargs rm -r 2> /dev/null

# Other random stuff we should not need for a basic run through ttinstall:
#
rm -r usr/lib/patch usr/lib/spell usr/lib/lp

# while open source is nifty and all, I've never compiled my own zoneinfo's
#
rm -r usr/share/lib/zoneinfo/src

rm kernel/fs/cachefs kernel/fs/autofs
rm -r usr/lib/fs/cachefs
rm kernel/misc/nfssrv 

rm usr/sbin/rpc.\* 

rm usr/bin/localedef usr/bin/genmsg
rm usr/bin/mailx usr/bin/pfksh usr/bin/pfcsh usr/bin/rlogin usr/bin/rcp
rm usr/bin/rdist usr/bin/rksh usr/bin/rsh usr/bin/telnet usr/bin/tip

# note: ping _is_ needed
#
rm usr/sbin/snoop usr/sbin/traceroute

# We don't use ipv6 during install:
#
rm usr/lib/inet/in.\*

# No dr:
#
rm -r usr/\*/cfgadm

# IB support is large on disk as well as in memory, so away it goes:
#
rm kernel/misc/ibmf
rm kernel/misc/ibcm kernel/misc/ibdm
rm kernel/drv/tavor

# ipsec is cool, but we're not using secure tunnels at install-time (yet):
#
rm usr/lib/libike.so.1 usr/sbin/ikeadm usr/sbin/ikecert 

# Most of grub is not needed in the miniroot:
#
mv boot/grub/stage? .
rm -r boot/grub/\*
mv stage?  boot/grub
rm boot/solaris.xpm

# audio support and drivers aren't needed
#
find kernel -name \\\*audio\\\* | xargs rm
rm platform/i86pc/kernel/drv/sbpro

# how would one use a smartcard during install?
#
find . -name \\\*smartcard\\\* | xargs rm -r 2> /dev/null

# not all libraries and binaries on the miniroot are stripped
# - letting strip sort the binaries from other stuff is crude, but
#   seems safe enough.
#
find . -type file | xargs strip 2> /dev/null

# this likely breaks disk space relocation, but it's another 400k
#
rm usr/lib/fs/ufs/ufs\*

# ditch the optimized copies of libc
#
rm -r usr/lib/libc

This takes the miniroot down to 71Mb. Due to ufs overhead, this works out to about 78Mb in memory. Some of the bits that have been deleted might be used by postinstall scripts. However my testing produced no errors in the install logs. I have not attempted an upgrade.

I've left all the stuff needed for netinstalls and mirrored root upgrade in place. I haven't tested mirrored root upgrade, but I suspect that md, libsvm and friends will need more memory than a 128Mb system has to spare to run.

Lastly the miniroot needs to be re-packed into its compressed ramdisk image:

/boot/solaris/bin/root_archive pack x86.miniroot.small root-small

This miniroot can then be copied to a netimage or CD under < relative>/boot. If you want to make a CD, the mkisofs incantation I currently use is:

mkisofs -o hacked_Solaris.iso \\
    -b boot/grub/stage2_eltorito -c .catalog -no-emul-boot \\
    -boot-load-size 4 -boot-info-table -relaxed-filenames -N -L -l \\
    -r -J -d -D -V hacked_Solaris 

The 128Mb system I used to test was a fairly plain pIII system with only IDE (CDROM, hard disk and ZIP drive) and an iprb. I did not configure networking at install time. I performed a CD based (not DVD) install. I set up 1024Mb of swap just to be on the safe side. Side note: I did experiment with adding swap early, but ttinstall startup requires real memory as it drags in a number of shared objects which all do need to be in memory at load time.

I was able to perform a fresh, full netinstall of nevada build 22 on a large memory system using the same minimized nevada build 22 miniroot.

Recommendations:

  • Disconnect or remove any devices not used during install. This will keep their drivers from attaching and hence conserve memory.
  • Don't try this with gigabit or 10gig NICs and 128Mb. Their drivers have a tendency to set up large buffers. BTW, if you have such a setup, consider trading your NIC for more memory.
  • Don't panic when adding SUNWj5rt or updating the boot_archive take a really long time. Since you wanted to run a 128Mb system, this is a great time to get used to watching it swap.

It is possible, and even likely that any number of the items I chose to delete will be required in future builds, which is why this is basically unmaintainable and a blog entry rather than part of the product. As with many things in software that would require more work: If someone wishes to investigate the exact interface and packaging boundaries for the bits that needed to be removed and to establish a stable way of making these items optional, this could become part of the product. Considering how close I had to cut it, it may not be possible to keep this working for much longer without more serious work. More serious work in this case would be to use a compressed filesystem for some or all of the ramdisk.

Conclusion:

The 128Mb test system I used is really amazingly slow when compared to the same system with 256Mb of RAM installed. Its interactive performance is bad enough that the time spent figuring out what type of memory it takes, ordering and installing it could easily be recouped in a matter of days. Given just how cheap workable memory for most such systems is, going to the effort of running one with less than 256Mb does not appear to be particularly sensible.

Tuesday Jun 14, 2005

A new boot architecture

A new boot architecture

Here I'll attempt to give a very brief overview of the components that make up the new boot architecture we recently integrated into (Open)Solaris.

New-boot as it was called during the development phase utilizes GRUB as the initial boot loader. At the moment we're using GRUB 0.95 with a couple of patches and a number of bugfixes. The source to this GRUB is available both via OpenSolaris under usr/src/grub/grub-0.95 as well as via the SUNWgrubS source package on every all post new-boot Solaris media.

One of the goals of the project was to establish an interface between Solaris and the boot-loader allowing for (more) independent development of either. The multiboot spec, as implemented from the side of the boot-loader by GRUB seemed ideally suited to this.

The other side of the spec is implemented by the multiboot kernel / boot loader simply called multiboot. Its code can be found under: usr/src/psm/stand/boot/i386/common. From GRUB's perspective it is a truly multiboot compliant kernel. From the perspective of the Solaris kernel, it is merely a ramdisk loader and boot strap. It makes boot-time options passed through GRUB available as properties and can read and load (gen)unix and krtld from the ramdisk (more on the ramdisk in a moment).

Another goal was to reduce Solaris's reliance on hardware specific features, specifically the ability to perform read operations from IO devices early on in boot. Unlike some other operating systems, Solaris loads all drivers dynamically and assembles the "path" of drivers required to access the root device dynamically. This means it needs to be able to access the files that are the driver binaries before it can access the device they live on directly.

The history of this pre-root-mount IO is that on SPARC such IO is accomplished by device specific fcode delivered via OBP (IEEE 1275), or extensions to it delivered on IO adapters. On x86, the lack of such an OBP was compensated for by bootconf and it's collection of real-mode drivers. To the Solaris kernel, these real-mode drivers presented a very OBP like interface allowing for little divergence in the kernel configuration code. This meant that in order for a particular device to be usable as a boot device on x86, it needed two drivers: one real-mode driver to boot, and then another Solaris kernel driver to access the device once the system is booted.

Much like OBP on SPARC (or on a PowerMac), x86 systems tend to come with code that can access bootable devices. This is the BIOS, and in many cases BIOS option ROMs on adapters. While the very early stage of boot has always utilized this code, calling back into it during kernel boot is problematic as it not only involves switching back into real mode, but restoring enough state for the BIOS to be able to run again as well as potentially saving an restoring state specific to whatever IO device is being used.

So the problem is that we need to be able to read and load arbitrary modules during boot and would like to utilize device specific code that ships with the hardware that we can't call back into once we've started to boot. The solution is to simply load everything we could possibly need at once and then boot with it in memory (which we don't need system specific drivers to access. The implementation of this solution involves a ramdisk that is populated pre-boot (either at install time or if it needs to be, updated pre-reboot).

As I alluded to earlier, the multiboot kernel knows how to read this ramdisk well enough to load krtld (the kernel linker/ loader), and (gen)unix from it. Then krtld, which can also read the ramdisk thanks to the code in usr/src/uts/common/krtld/bootrd.c, can bring in modules (and other files) via kobj_open() until root can be mounted.

Before we can make any real progress towards mounting root, we need to have an idea of the physical layout of the machine. On SPARC this is accomplished by looking at the hardware tree that OBP has built (prtconf -p to view). On x86, this used to be accomplished by code in bootconf1 that when it was done, exported something very similar to a 1275 device tree.

Luckily all PCI devices can be enumerated quite reliably by parsing PCI config space. This happens in pci_setup_tree(). If you look closely at things like pci_reprogram(), you'll notice that we do a little more than just enumerate them there.

While PCI accounts for most devices in a modern system, those of us living in a UNIX world with serial consoles still like to use the on-board serial ports (which are still ISA) that are still found on many systems. Similarly, PS/2 ports (8042) while finally starting to disappear on desktop systems still account for nearly all integrated laptop keyboards and pointing devices. So we need to deal with at least them. The good news is that the need to power manage devices in the order they are connected (if you power down an HBA, you can't talk to the disk to power it down) already lead to the need for some sort of system wide description of how devices are interconnected and the ACPI tables can provide this information.

Before I explain how ISA devices are enumerated, let's take a look at ACPI. If you're thinking of ACPI as a power management related spec, you are correct, but it has grown to include things that supersede the MP tables (how do I find the not yet running processors), Plug-and-Play interrupt programming and other things that are a constant source of system specific bugs. Up to and including Solaris 10, we used acpi_int, which was a home grown ACPI interpreter that suffered greatly from the vagueness of the original ACPI specs. While we could have brought it up to speed with respect to the ACPI 2.0 spec, there is still the issue of machine specific bugs. As luck would have it, Intel had recently made acpica (which is a fairly complete OS-side ACPI implementation) available under a sufficiently free license. After much reality checking with various engineers and not least legal review we decided to dump acpi_intp and incorporate acpica. It will likely form the basis of future power management effort and can now be found under usr/src/uts/i86pc/io/acpica.

Information provided by ACPI is also used in pcplusmp (on mp systems) and uppc to configure interrupt routing. On multiprocessor systems it also supplements information found in the MP tables to help us find APICs and their associated CPUs.

Before I get back to ISA enumeration, one more note on ACPI. I previously mentioned that the initial ACPI spec was quite vague. This lead to odd implementations not only on the OS side, but also on many systems that are still in use today. Currently we have don't trust any systems (strictly speaking BIOSs) made before 1999. The startup code in acpica_process_user_options() makes that check. Beyond that, David implemented a mechanism for us to deliver fixed tables as regular files to override one's delivered by the hardware in case they are hopelessly broken. This code can be found in AcpiOsTableOverride().

Now, on to ISA enumeration. This happens when the isa nexus is attached and it sets out to enumerate its children in isa_alloc_nodes(). If ACPI can be used on this system acpi_isa_device_enum() is called and the devinfo nodes are built. Otherwise we revert to the following devices:

  • Two serial ports.
  • One parallel port.
  • One i8042 (PS/2) node for mouse and keyboard.
  • No floppy (it is known to hard hang if not present).

That pretty much covers device enumeration. Please keep in mind that this is a 9000ft view and I'm skipping over numerous things that took many engineers many months to figure out with less than a word.

Once all the needed drivers and support modules have been loaded, it becomes time to actually mount root for the first time, read only, in the kernel via mount_root() (for nfs) or ufs_mountroot(). - Strictly speaking most of the drivers were probably loaded as a result of mountroot opening the root device and devfs assembling the required drivers.

The actual root device to be mounted is specified via the bootpath property. This could be any typical root device, even a metadevice. If it is not set, we default to mounting root directly on the ramdisk.

In the case when root is mounted on a real device (not the ramdisk), the ramdisk needs to contain little more than the kernel and all required drivers. This type of ramdisk image is stored in /platform/i86pc/boot_archive and needs to be kept in sync with the kernel binaries on the root device in order to avoid loading miss-matched modules from the root filesystem after root is mounted2. The list of files and directories is keep in /boot/solaris/filelist.ramdisk on the running system. The task of syncing the bootarchive is handled by bootadm(1M).

The other case, in which we mount root on the ramdisk requires the ramdisk to contain something closer to a minimal system. An example of this is the install miniroot, but the options are really only limited by ones time and available main memory.

1 In a way bootconf was pretty neat, the idea behind it was that device probing during boot should be interactive so that probe conflicts could be resolved by the end-user/system-admin on a system by system basis. This had a lot of value in the bad old days before self describing buses where you had to poke device registers and guess, based on the devices reaction what kind of device it was and pray that some other device wouldn't hard-hang the system if treated the same way.

2Solaris dynamically loads and unloads modules and drivers on an as-needed basis.

Technorati Tag:
Technorati Tag:

About

user12619798

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today