A new boot architecture
By user12619798 on Jun 14, 2005
Here I'll attempt to give a very brief overview of the components that make up the new boot architecture we recently integrated into (Open)Solaris.
New-boot as it was called during the development phase utilizes GRUB as the initial boot loader. At the moment we're using GRUB 0.95 with a couple of patches and a number of bugfixes. The source to this GRUB is available both via OpenSolaris under usr/src/grub/grub-0.95 as well as via the SUNWgrubS source package on every all post new-boot Solaris media.
One of the goals of the project was to establish an interface between Solaris and the boot-loader allowing for (more) independent development of either. The multiboot spec, as implemented from the side of the boot-loader by GRUB seemed ideally suited to this.
The other side of the spec is implemented by the multiboot kernel / boot loader simply called multiboot. Its code can be found under: usr/src/psm/stand/boot/i386/common. From GRUB's perspective it is a truly multiboot compliant kernel. From the perspective of the Solaris kernel, it is merely a ramdisk loader and boot strap. It makes boot-time options passed through GRUB available as properties and can read and load (gen)unix and krtld from the ramdisk (more on the ramdisk in a moment).
Another goal was to reduce Solaris's reliance on hardware specific features, specifically the ability to perform read operations from IO devices early on in boot. Unlike some other operating systems, Solaris loads all drivers dynamically and assembles the "path" of drivers required to access the root device dynamically. This means it needs to be able to access the files that are the driver binaries before it can access the device they live on directly.
The history of this pre-root-mount IO is that on SPARC such IO is accomplished by device specific fcode delivered via OBP (IEEE 1275), or extensions to it delivered on IO adapters. On x86, the lack of such an OBP was compensated for by bootconf and it's collection of real-mode drivers. To the Solaris kernel, these real-mode drivers presented a very OBP like interface allowing for little divergence in the kernel configuration code. This meant that in order for a particular device to be usable as a boot device on x86, it needed two drivers: one real-mode driver to boot, and then another Solaris kernel driver to access the device once the system is booted.
Much like OBP on SPARC (or on a PowerMac), x86 systems tend to come with code that can access bootable devices. This is the BIOS, and in many cases BIOS option ROMs on adapters. While the very early stage of boot has always utilized this code, calling back into it during kernel boot is problematic as it not only involves switching back into real mode, but restoring enough state for the BIOS to be able to run again as well as potentially saving an restoring state specific to whatever IO device is being used.
So the problem is that we need to be able to read and load arbitrary modules during boot and would like to utilize device specific code that ships with the hardware that we can't call back into once we've started to boot. The solution is to simply load everything we could possibly need at once and then boot with it in memory (which we don't need system specific drivers to access. The implementation of this solution involves a ramdisk that is populated pre-boot (either at install time or if it needs to be, updated pre-reboot).
As I alluded to earlier, the multiboot kernel knows how to read this ramdisk well enough to load krtld (the kernel linker/ loader), and (gen)unix from it. Then krtld, which can also read the ramdisk thanks to the code in usr/src/uts/common/krtld/bootrd.c, can bring in modules (and other files) via kobj_open() until root can be mounted.
Before we can make any real progress towards mounting root, we need to have an idea of the physical layout of the machine. On SPARC this is accomplished by looking at the hardware tree that OBP has built (prtconf -p to view). On x86, this used to be accomplished by code in bootconf1 that when it was done, exported something very similar to a 1275 device tree.
Luckily all PCI devices can be enumerated quite reliably by parsing PCI config space. This happens in pci_setup_tree(). If you look closely at things like pci_reprogram(), you'll notice that we do a little more than just enumerate them there.
While PCI accounts for most devices in a modern system, those of us living in a UNIX world with serial consoles still like to use the on-board serial ports (which are still ISA) that are still found on many systems. Similarly, PS/2 ports (8042) while finally starting to disappear on desktop systems still account for nearly all integrated laptop keyboards and pointing devices. So we need to deal with at least them. The good news is that the need to power manage devices in the order they are connected (if you power down an HBA, you can't talk to the disk to power it down) already lead to the need for some sort of system wide description of how devices are interconnected and the ACPI tables can provide this information.
Before I explain how ISA devices are enumerated, let's take a look at ACPI. If you're thinking of ACPI as a power management related spec, you are correct, but it has grown to include things that supersede the MP tables (how do I find the not yet running processors), Plug-and-Play interrupt programming and other things that are a constant source of system specific bugs. Up to and including Solaris 10, we used acpi_int, which was a home grown ACPI interpreter that suffered greatly from the vagueness of the original ACPI specs. While we could have brought it up to speed with respect to the ACPI 2.0 spec, there is still the issue of machine specific bugs. As luck would have it, Intel had recently made acpica (which is a fairly complete OS-side ACPI implementation) available under a sufficiently free license. After much reality checking with various engineers and not least legal review we decided to dump acpi_intp and incorporate acpica. It will likely form the basis of future power management effort and can now be found under usr/src/uts/i86pc/io/acpica.
Information provided by ACPI is also used in pcplusmp (on mp systems) and uppc to configure interrupt routing. On multiprocessor systems it also supplements information found in the MP tables to help us find APICs and their associated CPUs.
Before I get back to ISA enumeration, one more note on ACPI. I previously mentioned that the initial ACPI spec was quite vague. This lead to odd implementations not only on the OS side, but also on many systems that are still in use today. Currently we have don't trust any systems (strictly speaking BIOSs) made before 1999. The startup code in acpica_process_user_options() makes that check. Beyond that, David implemented a mechanism for us to deliver fixed tables as regular files to override one's delivered by the hardware in case they are hopelessly broken. This code can be found in AcpiOsTableOverride().
Now, on to ISA enumeration. This happens when the isa nexus is attached and it sets out to enumerate its children in isa_alloc_nodes(). If ACPI can be used on this system acpi_isa_device_enum() is called and the devinfo nodes are built. Otherwise we revert to the following devices:
- Two serial ports.
- One parallel port.
- One i8042 (PS/2) node for mouse and keyboard.
- No floppy (it is known to hard hang if not present).
That pretty much covers device enumeration. Please keep in mind that this is a 9000ft view and I'm skipping over numerous things that took many engineers many months to figure out with less than a word.
Once all the needed drivers and support modules have been loaded, it becomes time to actually mount root for the first time, read only, in the kernel via mount_root() (for nfs) or ufs_mountroot(). - Strictly speaking most of the drivers were probably loaded as a result of mountroot opening the root device and devfs assembling the required drivers.
The actual root device to be mounted is specified via the bootpath property. This could be any typical root device, even a metadevice. If it is not set, we default to mounting root directly on the ramdisk.
In the case when root is mounted on a real device (not the
ramdisk), the ramdisk needs to contain little more than the kernel and
all required drivers. This type of ramdisk image is stored in
/platform/i86pc/boot_archive and needs to be kept in sync
with the kernel binaries on the root device in order to avoid loading
miss-matched modules from the root filesystem after root is
mounted2. The list of files and directories is keep in
/boot/solaris/filelist.ramdisk on the running system. The
task of syncing the bootarchive is handled by bootadm(1M).
The other case, in which we mount root on the ramdisk requires the ramdisk to contain something closer to a minimal system. An example of this is the install miniroot, but the options are really only limited by ones time and available main memory.
1 In a way bootconf was pretty neat, the idea behind it was that device probing during boot should be interactive so that probe conflicts could be resolved by the end-user/system-admin on a system by system basis. This had a lot of value in the bad old days before self describing buses where you had to poke device registers and guess, based on the devices reaction what kind of device it was and pray that some other device wouldn't hard-hang the system if treated the same way.
2Solaris dynamically loads and unloads modules and drivers on an as-needed basis.