Solaris 10 on x64 Processors: Part 3 - Kernel

Virtual Memory

One of the most critical components of a 64-bit operating system is it's ability to manage large amounts of memory using the additional addressing capabilities of the hardware. The key to those capabilities in Solaris is the HAT (Hardware Address Translation) "layer" of the otherwise generic VM system. Unfortunately, the 32-bit HAT layer for Solaris x86 was a bit long in the tooth and after years of neglect was extremely difficult to understand, let alone extend. So we decided on a ground-up rewrite pretty early on in the project; the eventual benefit of that was being able to use the same source code for both 32-bit and 64-bit mode, and to bring the benefits of the NX (no-execute) bit to both 32-bit and 64-bit kernels seamlessly. Joe Bonasera, who lead this work, told me a few weeks ago that he'd expand on this in his own blog here, so I'm not going to describe it any further than that.

Interrupts, DMA, DDI, device drivers

The Solaris DDI (Device Driver Interface) was designed to support writing portable drivers between releases, and between instruction sets, to concentrate bus-dependent details and interfaces in specialized bus-dependent drivers (called nexus drivers), and to minimize the amount of low-level, bus-specific code in regular drivers (called leaf drivers). Most of the work we did on the 64-bit SPARC project back in 1997 was completely reused, and the majority of the work on the x86 DDI implementation was essentially making the code LP64 clean, and fixing some of the more hacky internals of some of the nexus drivers.

The most difficult part of the work was porting the low-level interrupt handlers, which were a monumental mass of confusing assembler. Though I had thought that it would be simplest to port the i386 assembler to amd64 conventions, this turned out to have been a poor decision. Sherry Moore tried to get this done quickly and accurately, but it was a very difficult challenge. We spent many days debugging problems with interrupts that were really rooted in the differences in register allocations between the two instruction set architectures and ABIs, as well as the highly contorted nature of the original code. We spent so much time on it that I eventually became consumed with guilt and rewrote most of it in C, which unsurprisingly turned out to be much easier to debug, and is now probably the best way to understand how the threads-as-interrupts implementation actually works.

The remaining work can be split into two parts. The first was ensuring that the drivers properly described their addressing capabilities, particularly those that hadn't been updated in a while. The second was the usual problem of handling ioctls from 32-bit and 64-bit applications where the two environments use different size and alignments for the data types passed across the interface. Again, Solaris already had a bunch of mechanism for doing this which we simply reused on previously i386-specific drivers to make them usable on amd64 kernels too.

One slight thorn in our side was the different in alignment constraints for the long long data type. On 32-bit SPARC and 64-bit SPARC, the alignment is 8 bytes for both, however, between i386 and amd64, the alignment changes from 4 bytes to 8 bytes. This seems mildly arcane, until you recall that the alignment of these data types controls the way that basic data structures are laid out between the two ABIs. Data structures containing long long types that were compatible between a 32-bit SPARC application and the 64-bit SPARC kernel now needed special handling for a 32-bit x86 application running on a 64-bit amd64 kernel. The same problem was discovered in a few network routing interfaces, cachefs, priocntl etc. Once we'd debugged a couple of these by hand, Ethan Solomita started a more systematic effort to locate the remaining problems; Mike Shapiro suggested that we build a CTF tool that would help us find the rest more automatically, or at least semi-automatically, which was an excellent idea and helped enormously.

MP bringup, EM64-T bringup

Back in 1990, one of the core design goals of the SunOS 5.0 project was to build a multithreaded operating system designed to run on multiprocessor machines. We weren't just doing a simple port of SVR4 to SPARC, we reworked the scheduler, and invested a large amount of effort throughout the kernel, adding fine-grain locking to extract the maximal concurrency from the hardware. Fast forward to 2005, and we're still working on it! The effort to extend scalability remains one of our core activities. However, we didn't have to do a lot of work to make multiprocessor Opteron machines run the 64-bit kernel; apart from porting the locking primitives, the only porting work was around creating a primitive environment around the non-boot processors to switch them into long mode. William Kucharski (of amd64 booter fame) did this work in a week or so, and impressed us all with how quickly and how well this worked from the beginning.

We also wanted to run our 64-bit kernel on Intel's EM64-T CPUs, since we really do want Solaris to run well on non-Sun x86 and x64 systems. As we were doing other work on the system, we had been anticipating what we needed to do from Intel's documentation, so as soon as the hardware was publically available (unfortunately we weren't able to get them earlier from Intel) Russ Blaine started working on it and had the 64-bit kernel up and running multiuser in about a week. I'm not sure if that's because Intel's specifications are particularly well written, or because Russ's debugging skills were even more excellent that week, or if it's testament to the skills of the Intel engineers at making their processor be so compatible with the Opteron architecture, but we were pretty pleased with the result.

Debugging Infrastructure

Critical aspects of the debugging architecture of Solaris that needed to be ported include the CTF system for embedding dense type information in ELF files, and the corresponding library and toolchain infrastructure that manipulates it, libproc that encapsulates a bunch of /proc operations for the ptools, /proc itself, mdb, and the DTrace infrastructure. I worked on the easy part - /proc - the difficult work was done by Matt Simmons, Eric Schrock and for DTrace, Adam Leventhal and of course Bryan Cantrill.

At the same time as we were starting our bring-up efforts on Opteron, an unrelated project in the kernel group was busy creating a new debugging architecture based on mdb(1). The basic idea was that we wanted to be able to bring most of mdb's capabilities to debugging live kernel problems. The kmdb team observed that our existing kernel debugger, kadb, was always in a state of disrepair, and yet because of it's co-residence with the kernel, needs constant tweaking for new platforms. So rather than continue this state of affairs, they came to the idea that it would be simpler if we could assume that the Solaris kernel would provide the basic infrastructure for the debugger.

This has considerable advantages for incremental development, and for the vast majority of kernel developers who aren't working on new platform bringup this is clearly a Good Thing. But it does make porting to a fresh platform or instruction set a little more difficult because kmdb is sophisticated, and doesn't really work until some of the more difficult kernel code has been debugged into existence. The amd64 project had that problem in a particularly extreme form, because the debugger design and interfaces were under development at the same time as we needed them. As a result, the early amd64 kernel bringup work was really done using a simulator (SIMICS), and then by doing printf-style debugging, and post-mortem trap-tracing, than with kmdb. I still remember debugging init(1M) using the simulator on the last day of one of our offsites in San Francisco, figuring out the bug while riding BART back home.

At this point of course, kmdb works fine and is of great help when debugging more subtle problems. However, knowing what we know now, we should have built a simple bringup-debugger to get us through those early stages where almost nothing worked. Something that could catch and decode exceptions, do stack traces and dump memory would be enough. I'd certainly recommend that path to anyone thinking of porting Solaris to another instruction set architecture; as soon as you get to the point that the kernel starts taking interrupts and doing context switches, things get way too hard for printf-style debugging!

System calls Revisited

For 64-bit applications we used the syscall instruction. We used the same register calling conventions as Linux; these are somewhat forced upon you by the combination of the behaviour of the instruction, and the C calling convention, and besides, there is no value in being deliberately different.

Interestingly, the 64-bit system call parameter passing convention is extremely similar to SPARC i.e. the first six system call arguments are passed in registers, with additional arguments passed on the stack. As a result, we based the 64-bit system call handler algorithm for amd64 on the 64-bit handler for sparcv9.

The 32-bit system call handlers include the 32-bit variant of the syscall instruction which works sufficiently well when the processor is running the 64-bit kernel to be usable. We also made the sysenter instruction work for Intel CPUs, and of course, the lcall handler; though this is actually handled via a #np trap in C. Our latest version of this assigns a new int trap to 32-bit syscalls which will improve the performance of the various types of system call that don't work well with plain syscall or sysenter.

More Tool Chain Issues

In the earlier "preliminaries" blog, I mentioned our use of gcc; however the Solaris kernel contains its own linker, krtld, based on the same relocation engine used in the userland utility. Fortunately, we had Mike Walker to do the amd64 linker work early on; we had a working linker a week or two ahead of having a linkable kernel.

One more thing

In my first posting on this topic I neglected to mention that there's a really good reference work for people trying to navigate the Solaris kernel - the book by Jim Mauro and Richard McDougall called Solaris Internals: Core Kernel Components; ISBN 0130224960.

Next time, I'll describe more of the userland work that completed the port.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris


Post a Comment:
Comments are closed for this entry.



« July 2016