Tuesday Aug 09, 2005

Solaris Deployment and Kernel Development with Diskless Clients

We did the the Xen bringup effort using NFS diskless technology. Why? Are we just so infatuated with NFS that we can't resist using it? Nope, there are at least two good reasons: bringup and deployment.

The Kernel Bringup Cycle
One of the things that characterizes bringup projects (particularly ones based on printf-debugging(!)) is a repeated cycle of fix-test-debug. This is different from incremental development. You're not working on one particular subsystem or tricky bug in a 99% working system, you're continually moving from subsystem to subsystem as the bringup progresses.

If you do disk-based bringup, to install the next version of the system, you have to boot that system using a working kernel, install the bits, then reboot to try them out. This can be quite a pain on a completely new machine, where a working kernel may not exist at all in which case you're continually recabling the disk. Even if you can boot it, after a while, you quickly get tired of listening to the disk clank and whirr, the BIOS chug through its tests to announce what identical things it has (re-)detected over and over again.

Diskless bringup using NFS is a lot faster and (once you tweak the configuration correctly) simpler. Instead you just place the new kernel bits you want to test onto an NFS server, then just boot the client machine. And of course booting diskless domains under Xen is even simpler because there's no BIOS involved at all - Xen's domain builder is vastly simpler and faster.

Once I/O happens over the network, you can easily observe what the client kernel is actually doing via snoop(1M), watching the first RARP and ARP attempts, through to the fully fledged NFS traffic between client and server.

Finally, of course, it's easier to work on the disk driver this way too, with a fully functional diskless system around you. That also helps as you don't place your boot image at quite the same level of risk as when testing your prototype driver.

One of the key things that Xen can do is transparent workload migration; that is the ability to move a running pure-virtual domain from machine to machine with almost imperceptible down-time. Diskless operation is a natural environment for exploring domain migration across a pool of machine resources in a data center, because of the various advantages of file-based protocols generally, and because that state is in storage across the network.

Diskless operation is also a means of managing the multiple OS images and patch levels for all the virtual machine environments that you might want to create on a pool of hardware resources. That is one of the biggest problems with large scale virtual machine deployments, and one of the problems that OS virtualization technologies like Solaris Zones neatly avoids.

While we're on this topic, I thought people might be interested in this little vignette. I attended a virtualization BOF at the Ottawa Linux Symposium a week or two ago; people from Red Hat, VMware and IBM spoke, but I was quite surprised when one of the IBM VM technologists stood up and said that Solaris Zones solved the problem of managing multiple OS images really well, and how their customers were asking for it, and how he wished the Linux community could extend projects like vservers to try to solve those problems in a similar way. Since IBM has been working on virtualization technologies for many, many years, and I have a lot of respect for their technology and experience in this area, I took that as quite a compliment to what we built in Solaris 10.

There is no "one-size fits all" virtualization technology; each has their own advantages and disadvantages. Eric Schrock wrote a great explanation of the relationship between the technologies i.e. where OS virtualization technology like Zones is useful, and where hardware virtualization technology like Xen can help. I strongly believe the two technologies are complementary, and will allow customers to provide a balance between utilization, isolation while keeping a lid on complexity for large-scale deployment.

I rather suspect the IBM VM guys think so too.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: Xen

Thursday Jul 28, 2005

"Hello World" from Solaris on Xen

Last Friday we went multiuser for the first time on our Solaris-on-Xen port. (For the uninitiated, Xen is an open source hypervisor from the University of Cambridge - see http://xen.sf.net)

The underlying hardware is a 2-way Opteron box. We're at a point in the port where we still emit loads of debugging noise, and the boot-up sequence itself isn't that interesting, but it all works pretty well, and I just ssh'ed into the virtual machine and posted this very blog entry from a Solaris domU running side by side with a dom0 kernel, and domU version of Linux.

Here's some cut-and-paste of my ssh session:

hostname% uname -a
SunOS hostname 5.11 fpfix-2005-07-22 i86xen i386 i86xen
hostname% isainfo -x
i386: sse2 sse fxsr amd_3dnowx amd_3dnow amd_mmx mmx cmov cx8 tsc fpu
hostname% psrinfo -vp
The physical processor has 1 virtual processor (0)
  x86 (AuthenticAMD family 15 model 5 step 10 clock 2391 MHz)
        AMD Opteron(tm) Processor 250

Here's what this looks like from the dom0 side - the "control" kernel for the machine. (If this doesn't look right in your browser, it's my fault - the command really does generate nicely aligned columns ..)

# xm list
Name             Id  Mem(MB)  CPU  State  Time(s)  Console
Domain-0          0      955    0  r----   1115.2        
linuxhost        67      128    0  -b---      5.5    9667
hostname         66      511    1  -b---    539.7    9666

Thanks to everyone on the team - Todd, Joe, Stu, for all their hard work, and thanks to Ian Pratt of the Xen project for patiently answering our silly and occasionally not-so-silly questions.

S'very cool -- it boots pretty fast, despite the fact that we're still relying on non-batched hypervisor calls, and the kernel code is covered in ASSERTs. No, we don't have domain migration working yet - Joe and Stu are working on the infrastructure for that right now.

What's next for Solaris-on-Xen?
Well, we've been working off the (now rather long in the tooth) Xen 2.x source base; and we need to move on to the next major release of Xen: specifically the Xen 3.0 source base which the Xen team in Cambridge say is close to entering its testing phase. Xen 3.x provides a bunch of interesting capabilities that we're keen to explore: multiprocessor guests, 64-bit kernels, and we also want to make it possible to use Solaris in domain 0. Ian gave a great presentation on Xen, including more 3.0 details last week at OLS.

And we're happy to have other people join this project at this early stage to help us do that, or even just to experiment with the code in whatever other way they want to. To enable that, we're launching an OpenSolaris community discussion group about OpenSolaris on Xen where future postings like this will end up. Just to set expectations - we do have a wad of cleanup of the 2.0 work to do, and we have to sync up with the Solaris gate so that we get catch up with OpenSolaris (we started before the OpenSolaris launch and we've been based off build 15 ever since). There's a few weeks work to do there.

Obviously, we're early in the evaluation phase of this technology, and while the the capability and code base will be an OpenSolaris project, it won't be integrated into the top-level OpenSolaris tree until the project is complete. So please don't expect these capabilities to show up in the official OpenSolaris builds for quite a while. No, I don't even have a schedule for when they will.

How to participate? Well we're working on various ideas to make source and various builds and other components available to let the OpenSolaris community try this technology out, give us feedback, and participate in the engineering.

You can keep up with what we're doing by joining the OpenSolaris on Xen community at opensolaris.org which should appear in a day or two.

The OpenSolaris on Xen community is now up: register on the the OpenSolaris web site to participate.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: Xen

Tuesday Jun 14, 2005

Opening Day

This is opening day, and I want to say "Welcome!" to everyone that's interested in taking a look under the hood of, and tinkering with, our favourite operating system. It's taken many of us a lot of hard work to get this far, and yet this is where the conversation starts, and the journey really begins.

I'm really looking forward to participating, and seeing what we, the OpenSolaris community, can build. Together.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Solaris 10 on x64 Processors: Part 4 - Userland


The amount of work involved in the kernel part of the amd64 project was fairly large, fortunately the userland part was more straightforward because of our prior work on 64-bit Solaris on SPARC back in 1997. So, for this project, once the kernel work, which abstracts the hardware differences between processors, was done, many smaller tasks appeared that were mostly solved by tweaking Makefiles and finding occasional #ifdefs that needed something added or modified. Fortunately, it was also work that was done in parallel by many people from across the organizations that contribute to the Solaris product.

Of course there were other substantial pieces of work like the Sun C and C++ compilers, and the Java Virtual Machine; though the JVM was already working on 32-bit and 64-bit Solaris on SPARC as well as 32-bit on x86, and the Linux port of the JVM had already caused that team to explore many of the amd64 code generation issues.

One of the things we tried to do was to be compatible with the amd64 ABI on Linux. As we talked to industry partners, we discovered that there was a variety of interpretations of the term "ABI." Many of the people we talked to outside of Sun thought that "ABI" only referred to register usage, C calling conventions, data structure sizes and alignments. A specification for compiler and linker writers, but with little or nothing beyond that about the system interfaces an application can actually invoke. But, the System V ABI is a larger concept than that, and was at least intended to provide a sufficient set of binary specifications to allow complete application binaries to be constructed that could be built once, and run on any ABI-conformant implementation. Thus Sun engineers tend to think of "the ABI" as being the complete set of interfaces used by user applications, rather than just compiler conventions; and over the years we expanded this idea of maintaining a binary compatible interface to applications all the way to the Solaris application guarantee program.

Though we tried to be compatible at this level with Linux on amd64, we discovered a number of issues in the system call and library interfaces that made that difficult, and while we did eliminate gratuitous differences where we could, we eventually decided on a more pragmatic approach. We decided to be completely compatible with the basic "compiler" style view of the ABI, and simply try and make it simple to port applications from 32-bit Solaris to 64-bit Solaris, and from Solaris on sparcv9 to Solaris on x64, and leave the thornier problems of full 64-bit Linux application compatibility to the Linux Application Environment (LAE ) project.

Threads and Selectors

In previous releases of Solaris, the 32-bit threads library used the %gs selector to allow each LWP in a process to refer to a private LDT entry to provide the per-thread state manipulated by the internals of the thread library. Each LWP gets a different %gs value that selects a different LDT entry; each LDT entry is initialized to point at per-thread state. On LWP context switch, the kernel loads the per-process LDT register to virtualize all this data to the process. Workable, yes, but the obvious inefficiency here was requiring every process to have at least one extra locked-down page to contain a minimal LDT. More serious, was the implied upper bound of 8192 LWPs per process (derived from the hardware limit on LDT entries).

For the amd64 port, following the draft ABI document, we needed to use the %fs selector for the analogous purpose in 64-bit processes too. On the 64-bit kernel, we wanted to use the FSBASE and GSBASE MSRs to virtualize the addresses that a specific magic %fs and magic %gs select, and we obviously wanted to use a similar technique on 32-bit applications, and on the 32-bit kernel too. We did this by defining specific %fs and %gs values that point into the GDT, and arranged that context switches update the corresponding underlying base address from predefined lwp-private values - either explicitly by rewriting the relevant GDT entries on the 32-bit kernel, or implicitly via the FSBASE and GSBASE MSRs on the 64-bit kernel. The result of all this work makes the code simpler, it scales cleanly, and the resulting upper bound on the number of LWPs is derived only from available memory (modulo resource controls, obviously).

Floating point

Most of the prework we had done to establish the SSE capabilities in the 32-bit kernel was readily reused for amd64; modulo some restructuring to allow the same code to be compiled appropriately for the two kernel builds. However, late in the development cycle, the guys in our floating point group pointed out that we didn't capture the results of floating point exceptions properly; the result of a subtle difference in the way that AMD and Intel processors presented information to the kernel after the floating point exception had been acknowledged. Fortunately they noticed this, and we rewrote the handler to be more robust and to behave the same way on both flavors of hardware.

Continuous Integration vs. One Giant Putback

To try to keep our merging and synchronization efforts under control, we did our best to integrate many of the changes we were making directly into the Solaris 10 gate so that the rest of the Solaris development organization could see it. This wasn't a willy-nilly integration of modified files, instead each putback was a regression-tested subset of the amd64 project that could stand alone if necessary. Perhaps I should explain this a little further. The Solaris organization has, for many years, tried to adhere to the principle of integrating complete projects, that is, changes that can stand alone, even if the follow-on projects are cancelled, fail, or become too delayed to make the release under development. Some of the code reorganization we needed was done this way, as well as most of the items I described as "prework" in part 1. There were also a bunch of code removal projects we did that helped us avoid the work of porting obsolete subsystems and support for drivers. As an aside, it's interesting to muse on exactly who is responsible to get rid of drivers for obsolete hardware; it's a very unglamourous task, but one that it's highly necessary if you aren't to flounder under and ever more opaque and untestable collection of crufty old source code.

In the end though, we got to the point where the pain of creating and testing subsets of our change by hand to create partial projects in Solaris 10 became just too painful for the team to countenance. Instead, we focussed on creating a single delivery of all our change in one coherent whole. Our Michigan-based "army of one," Roger Faulkner did all of this, as well as most of the rest of the heavy lifting in userland i.e. creating the 64-bit libc and basic C run-time etc. as well as the threading primitives. Roger really did an amazing job on the project.

Projects of this giant size and scope are always difficult; and everyone gets even more worried when the changes are integrated towards the end of a release. However, we did bring unprecedented levels of testing to the amd64 project, from some incredible, hard working test people. Practically speaking I think we did a reasonable job of getting things right by the end of the release, despite a few last minute scares around our mishandling of process-private LDTs. Fortunately these were only really needed for various forms of Windows emulation, so we disabled them on the 64-bit kernel for the FCS product; this works now in the Solaris development gate, and a backported fix is working its way through the system.

Not to say that there aren't bugs of course ...

Distributed Development

I think it's worth sharing some of the experiences of how the core team worked on this project. First, when we started, Todd Clayton (the engineering lead, who also did the segmentation work, among other things) and I asked to build a mostly-local team. We asked for that because we believed that time-to-market was critical, and we thought that we could go the fastest with all the key contributors in close proximity. However, for a number of reasons, that was not possible, and we ended up instead with a collection of talented people spread over many sites as geographically distributed as New Zealand, Germany, Boston, Michigan, and Colorado as well a small majority of the team back in California. To help unify the team and make rapid progress, we came up with the idea of periodically getting the team together physically in one place (either offsite in California or Colorado) and spending a focussed week together. We spent the first week occupying a contiguous block of adjacent offices in another building; problem was that we didn't really change the dynamics of the way people worked with each other. Our accidental discovery came during our first Colorado meeting where we ended up in one (large!) training room for our kick-off meeting. Rather than trudge back across campus where we had reserved office space, we decided to stay put and just start work where we were, and suddenly everything clicked. We stayed in the room for the rest of the week, working closely with each other, immersing ourselves in the project, the team, and what needed to be done. This was very effective, because as well as reinforcing the sense of team during the week away, everyone was able to go back to their home sites and work independently and effectively for many weeks before meeting up again - with only an occasional phone call or email between team-members to synchronize.

Looking Back

I've tried to do a reasonable tour of the amd64 project, driven mostly by what stuck in my memory, and biassed by the work I was involved in to some degree, but obviously much detail has been omitted or completely forgotten. To the people at Sun whose work or contribution I've either not mentioned, foolishly glossed over or forgotten completely, sorry, and thanks for your efforts. To the people at AMD that helped support us, another thank you. To our families and loved ones that put up with "one more make," yet more thanks. This was a lot of work, done faster than any of us thought possible, and 2004 was in truth, well, a bit of a blur.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Wednesday Jun 08, 2005

Solaris 10 on x64 Processors: Part 3 - Kernel

Virtual Memory

One of the most critical components of a 64-bit operating system is it's ability to manage large amounts of memory using the additional addressing capabilities of the hardware. The key to those capabilities in Solaris is the HAT (Hardware Address Translation) "layer" of the otherwise generic VM system. Unfortunately, the 32-bit HAT layer for Solaris x86 was a bit long in the tooth and after years of neglect was extremely difficult to understand, let alone extend. So we decided on a ground-up rewrite pretty early on in the project; the eventual benefit of that was being able to use the same source code for both 32-bit and 64-bit mode, and to bring the benefits of the NX (no-execute) bit to both 32-bit and 64-bit kernels seamlessly. Joe Bonasera, who lead this work, told me a few weeks ago that he'd expand on this in his own blog here, so I'm not going to describe it any further than that.

Interrupts, DMA, DDI, device drivers

The Solaris DDI (Device Driver Interface) was designed to support writing portable drivers between releases, and between instruction sets, to concentrate bus-dependent details and interfaces in specialized bus-dependent drivers (called nexus drivers), and to minimize the amount of low-level, bus-specific code in regular drivers (called leaf drivers). Most of the work we did on the 64-bit SPARC project back in 1997 was completely reused, and the majority of the work on the x86 DDI implementation was essentially making the code LP64 clean, and fixing some of the more hacky internals of some of the nexus drivers.

The most difficult part of the work was porting the low-level interrupt handlers, which were a monumental mass of confusing assembler. Though I had thought that it would be simplest to port the i386 assembler to amd64 conventions, this turned out to have been a poor decision. Sherry Moore tried to get this done quickly and accurately, but it was a very difficult challenge. We spent many days debugging problems with interrupts that were really rooted in the differences in register allocations between the two instruction set architectures and ABIs, as well as the highly contorted nature of the original code. We spent so much time on it that I eventually became consumed with guilt and rewrote most of it in C, which unsurprisingly turned out to be much easier to debug, and is now probably the best way to understand how the threads-as-interrupts implementation actually works.

The remaining work can be split into two parts. The first was ensuring that the drivers properly described their addressing capabilities, particularly those that hadn't been updated in a while. The second was the usual problem of handling ioctls from 32-bit and 64-bit applications where the two environments use different size and alignments for the data types passed across the interface. Again, Solaris already had a bunch of mechanism for doing this which we simply reused on previously i386-specific drivers to make them usable on amd64 kernels too.

One slight thorn in our side was the different in alignment constraints for the long long data type. On 32-bit SPARC and 64-bit SPARC, the alignment is 8 bytes for both, however, between i386 and amd64, the alignment changes from 4 bytes to 8 bytes. This seems mildly arcane, until you recall that the alignment of these data types controls the way that basic data structures are laid out between the two ABIs. Data structures containing long long types that were compatible between a 32-bit SPARC application and the 64-bit SPARC kernel now needed special handling for a 32-bit x86 application running on a 64-bit amd64 kernel. The same problem was discovered in a few network routing interfaces, cachefs, priocntl etc. Once we'd debugged a couple of these by hand, Ethan Solomita started a more systematic effort to locate the remaining problems; Mike Shapiro suggested that we build a CTF tool that would help us find the rest more automatically, or at least semi-automatically, which was an excellent idea and helped enormously.

MP bringup, EM64-T bringup

Back in 1990, one of the core design goals of the SunOS 5.0 project was to build a multithreaded operating system designed to run on multiprocessor machines. We weren't just doing a simple port of SVR4 to SPARC, we reworked the scheduler, and invested a large amount of effort throughout the kernel, adding fine-grain locking to extract the maximal concurrency from the hardware. Fast forward to 2005, and we're still working on it! The effort to extend scalability remains one of our core activities. However, we didn't have to do a lot of work to make multiprocessor Opteron machines run the 64-bit kernel; apart from porting the locking primitives, the only porting work was around creating a primitive environment around the non-boot processors to switch them into long mode. William Kucharski (of amd64 booter fame) did this work in a week or so, and impressed us all with how quickly and how well this worked from the beginning.

We also wanted to run our 64-bit kernel on Intel's EM64-T CPUs, since we really do want Solaris to run well on non-Sun x86 and x64 systems. As we were doing other work on the system, we had been anticipating what we needed to do from Intel's documentation, so as soon as the hardware was publically available (unfortunately we weren't able to get them earlier from Intel) Russ Blaine started working on it and had the 64-bit kernel up and running multiuser in about a week. I'm not sure if that's because Intel's specifications are particularly well written, or because Russ's debugging skills were even more excellent that week, or if it's testament to the skills of the Intel engineers at making their processor be so compatible with the Opteron architecture, but we were pretty pleased with the result.

Debugging Infrastructure

Critical aspects of the debugging architecture of Solaris that needed to be ported include the CTF system for embedding dense type information in ELF files, and the corresponding library and toolchain infrastructure that manipulates it, libproc that encapsulates a bunch of /proc operations for the ptools, /proc itself, mdb, and the DTrace infrastructure. I worked on the easy part - /proc - the difficult work was done by Matt Simmons, Eric Schrock and for DTrace, Adam Leventhal and of course Bryan Cantrill.

At the same time as we were starting our bring-up efforts on Opteron, an unrelated project in the kernel group was busy creating a new debugging architecture based on mdb(1). The basic idea was that we wanted to be able to bring most of mdb's capabilities to debugging live kernel problems. The kmdb team observed that our existing kernel debugger, kadb, was always in a state of disrepair, and yet because of it's co-residence with the kernel, needs constant tweaking for new platforms. So rather than continue this state of affairs, they came to the idea that it would be simpler if we could assume that the Solaris kernel would provide the basic infrastructure for the debugger.

This has considerable advantages for incremental development, and for the vast majority of kernel developers who aren't working on new platform bringup this is clearly a Good Thing. But it does make porting to a fresh platform or instruction set a little more difficult because kmdb is sophisticated, and doesn't really work until some of the more difficult kernel code has been debugged into existence. The amd64 project had that problem in a particularly extreme form, because the debugger design and interfaces were under development at the same time as we needed them. As a result, the early amd64 kernel bringup work was really done using a simulator (SIMICS), and then by doing printf-style debugging, and post-mortem trap-tracing, than with kmdb. I still remember debugging init(1M) using the simulator on the last day of one of our offsites in San Francisco, figuring out the bug while riding BART back home.

At this point of course, kmdb works fine and is of great help when debugging more subtle problems. However, knowing what we know now, we should have built a simple bringup-debugger to get us through those early stages where almost nothing worked. Something that could catch and decode exceptions, do stack traces and dump memory would be enough. I'd certainly recommend that path to anyone thinking of porting Solaris to another instruction set architecture; as soon as you get to the point that the kernel starts taking interrupts and doing context switches, things get way too hard for printf-style debugging!

System calls Revisited

For 64-bit applications we used the syscall instruction. We used the same register calling conventions as Linux; these are somewhat forced upon you by the combination of the behaviour of the instruction, and the C calling convention, and besides, there is no value in being deliberately different.

Interestingly, the 64-bit system call parameter passing convention is extremely similar to SPARC i.e. the first six system call arguments are passed in registers, with additional arguments passed on the stack. As a result, we based the 64-bit system call handler algorithm for amd64 on the 64-bit handler for sparcv9.

The 32-bit system call handlers include the 32-bit variant of the syscall instruction which works sufficiently well when the processor is running the 64-bit kernel to be usable. We also made the sysenter instruction work for Intel CPUs, and of course, the lcall handler; though this is actually handled via a #np trap in C. Our latest version of this assigns a new int trap to 32-bit syscalls which will improve the performance of the various types of system call that don't work well with plain syscall or sysenter.

More Tool Chain Issues

In the earlier "preliminaries" blog, I mentioned our use of gcc; however the Solaris kernel contains its own linker, krtld, based on the same relocation engine used in the userland utility. Fortunately, we had Mike Walker to do the amd64 linker work early on; we had a working linker a week or two ahead of having a linkable kernel.

One more thing

In my first posting on this topic I neglected to mention that there's a really good reference work for people trying to navigate the Solaris kernel - the book by Jim Mauro and Richard McDougall called Solaris Internals: Core Kernel Components; ISBN 0130224960.

Next time, I'll describe more of the userland work that completed the port.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Monday May 16, 2005

CTO Advisory Council

Sun's CTO, Greg Papadopolous, invited me to a customer advisory council event that he held a couple of weeks ago. This was similar to a customer advisory council that Jonathan Schwartz wrote about here. The interesting twist here was that the attendees were a mix of CTOs from Sun's business units and equivalent CTO/CIOs from some of Sun's largest customers.

It was fascinating to hear what things really keep our customers up at night, and their perceptions of our technology strengths and gaps. For our part, we discussed different aspects of our technology, including processors, platforms, operating systems, middleware and the Sun Grid.

Reactions varied - but apart from occasional surprises ("I didn't know Sun even did [this product or that service] .."), we heard the following (paraphrased) question "but why are you investing in [this component]?" This theme was particularly evident when we talked about the capabilities of each area, isolated from the others. Perhaps the way we presented our technologies reinforced the common misperception that all computer system components are now just commodities," i.e. "done," at least as far as techological innovation goes.

But by the end of the day, the customer CTOs seemed to have internalized our set of investment areas; and had a clearer idea of how each component investment related to the other, and thus improved the effectiveness of the whole offering. A great example was our CMT processor technologies, ambitious new platform designs, Solaris 10 threading and related observability capabilities like DTrace and plockstat, and so on up the stack to the Sun Grid. And then we reached one of those "Eureka!" moments, when I think the room did the synthesis step, and rediscovered the intrinsic value of a systems company - that takes an integrated approach to the coevolution of all the components of an information processing system.

Then after a brief moment basking in our little ray of sunshine, we were quickly taken to task for not communicating that message to the marketplace effectively! Sun is a company that lives and breathes innovation, and tends to focus on detailed technology messages. But for many of these CTOs and CIOs, their executive staffs need simpler, more direct, messages that make sense to less technical people.

Everyone is searching for simplicity, in particular, simple answers to complex problems, yet the levels of abstraction spanned by modern operating systems makes their implementation intrinsically complex, and it's easy to get lost in the detail. For my organization, and the other members of the OpenSolaris community, our biggest challenge may simply be communicating the full value of Solaris 10 and OpenSolaris to a broad audience - both as a set of component technologies, and as a systems technology.

One of the participants suggested that Sun needs to find a "great communicator" who can take our technology portfolio, and translate it into something more accessible to non-technical people, to really get our messages and core values across.

That's an interesting idea, and one that's certainly set us thinking.

Wednesday May 11, 2005

Solaris 10 on x64 Processors: Part 2 - Getting Started

Booting and Startup
Some of the trickier issues with porting Solaris to a new platform architecture originate from some of the decisions we made 15 years ago. This is a complex story, and one that we're soon to drastically improve on x86/x64 systems with newboot , (as demoed recently by Jan Setje-Eilers at the first OpenSolaris User Group meeting) but I'll try to relate the minimum needed so that you understand why we took the path that we did for Solaris on x64 processors.

The Solaris 2 booting system was originally designed back in 1991 in the context of the problems and issues we had in the world of SunOS 4.x, our desires to have the hardware and software organizations in the company execute relatively independently, as well as to support the then-blossoming clone market for SPARC workstations. We wanted to enable both ourselves and 3rd parties to deliver support for new hardware without having to change the Solaris product CD bits - except by adding new driver modules to it. Also remember that speeds and feeds were two orders of magnitude slower then, and that kernel text was a larger proportion of the minimum memory size than it is today so we had to be far more selective about which modules to load as we booted.

The design we came up with starts with the primary bootstrap loading the secondary booter, and the secondary booter putting the core kernel into memory. The kernel then turns around and invokes various services from the secondary booter using the bootops interface to allow the kernel to discover, load and assemble the drivers and filesystem modules it needs as it initializes itself and explores the system it finds itself on. Once it determines it has all the modules it needs to mount the root filesystem, it takes over IO and memory allocation, mounts the root, and (if successful) continues to boot by loading additional kernel modules from that point on.

Note that this early part of the boot process starts out with the secondary boot program being the one true resource allocator i.e. in charge of physical memory, virtual memory and all I/O, and ends with the kernel being that resource allocator at the end. While moving from the former state to the latter sounds simple in principle, it's quite complex in practice because of the incremental nature of the handoff. For example, the DDI (device driver interfaces) aren't usable until the kernel has initialized its infrastructure for managing physical and virtual memory. So we have to load the modules we might need based on the name of the root device given to us by the booter which in turn comes from OpenBoot. Kernel startup somehow has to get most of the VM system initialized and working, yet still allow the boot program and its underlying firmware to do I/O to find the drivers and filesystem modules it needs from the boot filesystem to mount the filesystem. Practically speaking, this entails repeated resynchronization between the boot program and the kernel over the layout and ownership of physical and virtual memory. In other words, the kernel tries to take over physical and virtual memory management while effectively avoiding conflicting with the secondary booter and the firmware using the same resources. It's really quite a dance.

For the x86 port, a similar approach was used, using real-mode drivers that were placed onto a boot floppy as an analogue of the OpenBoot drivers in the SPARC world to construct a primitive device tree data structure analogous to the OpenBoot device tree. (In 1995, for the PowerPC port, we implemented "virtual open firmware" which was an even closer simulation of OpenBoot to make it easier to reuse SPARC boot and configuration code).

Note that the x86 secondary boot program itself runs in protected mode like the kernel; it is responsible for switching to real-mode and back to run the real-mode drivers.

Six years go by ...

During that time, specifically in Solaris 2.5, we made things even more complicated for hardware bringup by splitting the basic kernel into separate modules: unix, genunix and the kernel linker krtld; these make bringup more difficult because genunix and krtld are not relocated until run-time, thus diagnosing hexadecimal addresses where the kernel has fallen over in genunix or in krtld becomes a significant pain, absent a debugger like kmdb or its predecessor kadb.

Now, in 1997 when we created 64-bit Solaris for SPARC, it was relatively simple to make a 64-bit boot program use the OpenBoot interfaces; there is no "mode" switch between operating a SPARC V9 processor in "32-bit" mode or "64-bit" mode - apart from the address mask, the rest of the difference was entirely about software conventions, not hardware per se. So we didn't really have to do anything to the basic boot architecture, this part of the 64-bit Solaris project on SPARC was really quite straightforward, mostly a matter of getting the boot program LP64-clean.

Meanwhile, in late 2003 ...

The Opteron architecture presented us a far more difficult challenge because the processor needed to switch to long mode via an arcane sequence of instructions including switching between different format page tables and descriptor tables. Worse still, the 64-bit Solaris kernel on Opteron would need to turn around and invoke what it would think of as a 64-bit boot program running in long mode in order to fetch modules from the disk (and as discussed above) invoking real-mode code and the BIOS to do so!

Our initial approach was to use the existing, unmodified, protected mode 32-bit booter and have it boot an interposer called vmx that used the 32-bit booter to load the 64-bit kernel into double mapped memory. In this case, "double mapped" means that there's an (up to) 4Gbyte area of physical memory that are (a) mapped by 32-bits of VA in the protected mode page tables and (b) mapped by the bottom 32-bits of VA and the top 4G of VA in the long mode page tables. The interposer then pretended to be a 64-bit booter to the 64-bit kernel. When the 64-bit kernel asked vmx for a boot service via one the bootops vector (fortunately a relatively small and well-behaved interface), vmx quickly switched back to protected mode, then after massaging the arguments appropriately, invoked the bootops of the 32-bit, protected mode booter to provide the service. That service would in turn often result in the protected mode booter switching the processor back to real mode to deliver it. Though our colleagues at AMD winced at the thought of the poor processor switching back and forth hundreds or thousands of times through so many years of x86 history, Opteron didn't mind this at all.

Finally, before we delivered the code to the Solaris 10 gate, William Kucharski, who took the half-thought-out prototype and made all this insanity actually work correctly, integrated the vmx code inside the 32-bit booter so this component is invisible in the final product.

What else could we have done?

We could've made the boot program be completely 64-bit, and thus have that program deal with the mode switching from long, to protected to real to invoke the BIOS and back again. While possible, it would've involved porting a bunch of code in the x86 boot program to LP64, and reworking both protected-mode and real-mode assembler in the boot program to fit into an ELF64 environment. The latter seemed like a lot more work than we wanted, even if we'd somehow managed to convince the assembler and linker to support it.

Another suggestion was to somehow jump into the 64-bit kernel and do magic to allow us to call back to 32-bit boot services there. But that seemed to us to be just a matter of finding a different place to put the code we had in vmx; we thought putting it into boot where it will be eventually reused by the system was better than making that ugliness live in the kernel proper.

The other option on the table was to become dependent on the newboot project which we were planning at the time to bring us into the modern world of open source kernel booting; but we were unwilling to wait, or to force that dependency to be resolved earlier because of schedule risk.

Descriptor Tables and Segmentation

It quickly becomes apparent to students of the x86 architecture that with x64, AMD tried hard to preserve the better parts of the x86 segmentation architecture, while trying to preserve compatibility sufficient to allow switching to and from long mode relatively painlessly. But to the kernel programmer, it only seems to get more complicated. In previous versions of Solaris, we used to build the descriptor tables (almost) statically, with a single pass over the tables to munge IDT and GDT structures, from a form easy to initialize in software, into the form expected by the hardware. Early on, we realized it was worth bringing this up-to-date too, so we discarded what we had (for both 32-bit and 64-bit Solaris) and used the FreeBSD version which use a series of function calls to build table entries piece by piece.

Some of our more amusing early mistakes here included copying various IDT entries for machine exceptions as "trap" type exceptions instead of interrupt-type exceptions which caused real havoc when an interrupt would sneak in before the initial swapgs instruction of the handler. All terribly obvious now, but less than obvious at the time.

One optimization we made was to exploit the fact that the data segment descriptor registers %ds and %es were effectively ignored in long mode. Further, %cs and %ss were always set correctly by the hardware, %fs was unused by the kernel, and %gs has a special instruction to change the base address underlying the segment. Taken together this lead us to a scheme of lazy update of segment registers; we only update segment registers on the way out of the kernel if we know that something needs to be changed.

Next time I'll try and touch on the rest of the work involved with the kernel port proper.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Tuesday May 10, 2005

The Xen Summit

A few weeks ago I attended the Xen summit in Cambridge, UK. Xen is an open-source hypervisor project being driven by Ian Pratt from the University's Computer Laboratory. Xen is attracting a lot of interest as the de facto open source hypervisor for commodity hardware.

Xen is designed to be a thin layer of software which allows multiple kernels to run on a single machine. Among the many cool things this new layer of virtualization allows is OS checkpoint and resume, which the Xen team have used to good effect in their workload migration experiments. (I was at a Linuxworld BOF last March where the sound of jaws dropping as Ian presented their results was quite evident!) Take a look at the papers on their website - it's pretty cool stuff.

Xen is based on paravirtualization; that is, you have to make some changes to the low-level kernel to allow the OS to run. This is both because that was easier to do on existing x86 hardware, and more importantly, it's also better performing than other approaches.

Anyhow, we've been looking at Xen for about a year now, and recently a few of us have been working on a prototype port of Solaris to Xen on the x86 architecture. We're planning to make it work on x64 machines where we can exploit the new hardware virtualization technology as it becomes available. We're also planning to make the Solaris side of things into an OpenSolaris community project too; particularly since Xen is itself an open source project. Although we're still working on the mechanics of all that, I'd like to hear from people who want to participate.

Some of the comments below imply that you might think I'm only interested in help from kernel developers. I'm also interested to here from people who are already using Xen, and are prepared to experiment with, and give us feedback on, early alpha-class builds of Solaris on Xen too.

Another Update
The OpenSolaris on Xen community is now up: see the OpenSolaris web site to participate.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Solaris 10 on x64 Processors: Part 1 - Prework

SSE Support
Back in 2002, after we had resurrected Solaris on x86, we realized that we needed to get back to basics with a number of core kernel subsystems, because while we'd been slowly disinvesting in Solaris on x86, the x86 hardware world had been scampering off doing some really interesting things e.g. the Streaming SIMD Extensions (SSE) to the instruction set, and introducing fast system call mechanisms. We also knew that these were basic assumptions of the 64-bit architecture that AMD was working on, so we started work on the basic kernel support which allows the xmm and mxcsr registers to be part of the state of every Solaris lwp. At the same time, Matt Simmons helped out with the disassembler and debugger support for the new instructions. This didn't take too long, and the work was integrated into Solaris 10 in October 2003, and Solaris 9 Update 6. One of the immediate benefits was Java floating point performance which used SSE instructions on capable platforms, and on the right hardware, Solaris was now one of those platforms!

Fast System Calls
In earlier Solaris releases, system calls were implemented using the lcall instruction; we'd been faithfully doing this for years, without really noticing that the performance of call gates was falling further and further behind. For Solaris 10, we decided to make fast system calls work using the sysenter instruction that was first introduced on Pentium II processors. Because of some awkward limitations around the register usage of the sysexit instruction (in particular, dealing with system calls that return two values), plus our desire to run on older machines, we also keep the older lcall handler around too.

First, I should explain something about the way Solaris system call handlers work in general. As you can imagine, in a highly observable system like Solaris, we can, in some circumstances, end up doing a lot of work in the system call enter and system call exit path. But, most of the time, we don't actually need to do all the checks, so more than 10 years ago, one of my former colleagues restructured SPARC system calls to do a single test for all possible pre-work on the per-lwp thread variable called t_presys, and a single test for all possible post-system call handler work on another thread variable called t_postsys. The system call handler is then constructed assuming that the t_presys and t_postsys cases are rare - but if either t_presys or t_postsys is set e.g. by the current system call, previous system call, or via /proc, we handle the relevant rare case in C code, allowing us to code the fast path in a small amount of assembler. To summarize:

if (curthread->t_presys)
if (curthread->t_postsys)
return-from-trap sequence

Obviously the Solaris x86 architecture mirrored this to some extent, but the presys() and postsys() functions had been
partially rendered from C into assembler which was, as usual, difficult to understand, port and maintain, and wasn't even particularly fast. So the initial exercise was to turn the slow paths back into C code, and macro-ize the assembler code involved in performing the pre and post checks so that different syscall handlers could easily share code. Then I coded up a sysenter style handler, and we were pretty impressed with the results on our system call microbenchmarks.

Hardware Capability Architecture
All this kernel work was fun, but we didn't have a clear idea of how we were going to let libc use fast instructions on machines capable of handling them, and fall back to lcall on machines that couldn't. We also noted that when AMD processors are running in long mode, sysenter is not supported but syscall (similar but different) is.

Earlier attempts to introduce support for this facility had considered using either the libc_psr mechanism that we introduced in Solaris 2.5 for dealing with the fast bcopy instructions available on UltraSPARC platforms, or using the isalist mechanism. The former scheme assumes that the instruction set extensions were specific to a platform, while the latter implicitly assumes that there are a small set of instructions extensions that were additive, and acted to improve performance. However we realized that in the x86 world we weren't dealing with platform extensions so much as processor extensions, and that processor vendors were adding instruction set extensions orthogonally, so we'd be better describing each instruction set extension by close analogy to the way the vendors were describing them in the cpuid instruction i.e. via a bit value in a feature word. See getisax(3C) for programmatic access to the kernel's view; <sys/aux_386.h> contains the list of capabilities we expose.

What we ended up with is (currently) three copies of the libc binary compiled different ways; the basic version in /lib/libc.so.1 is able to run on the oldest hardware we support, the newer versions in /usr/lib/libc correspond to more modern hardware running on a 32-bit or 64-bit kernel. Thanks to Rod Evans the libraries are marked with the capabilities they require, and the system figures out which is the best library to use on the running system at boot time. Last November, Darren Moffat wrote something up about how the system configures which libc it uses; there's no point in repeating that here.

The Tool Chain
The other key piece we needed for the amd64 kernel was to make the Solaris kernel compile and run with the GNU C compiler and assembler, so I started work on that too. Note that wasn't because we didn't want to use the Sun compiler, it's just that it's easier to bring up an OS using a compiler that works, instead of debugging both the kernel and the compiler simultanously. More critically, there wasn't a Sun compiler that would build 64-bit objects at the time. GNU C is great for finding bugs, and really complimented the capabilities of the Sun compiler and lint tools. I got the 32-bit kernel working fairly easily, we were able to start the 64-bit project using this compiler, once we'd hacked up an initial configuration.

In the meantime, while we were completing and integrating some of these prerequisites into Solaris 10, we were assembling the main amd64 project team; work really started in earnest in January of 2004. Next time I'll describe some of our early problems with the porting work.

Technorati Tag: Solaris

Hello, World

Picture of Tim Marsland Hello. First some preliminaries. Who am I? Well, I'm a Distinguished Engineer in the Operating Platforms Group at Sun; I'm also the Chief Technical Officer for that group. The first part of that means that I get to work on Solaris, the second part of that sentence means that it's a continuous struggle to do any engineering work in the face of too many meetings :) Among the other CTOs I've met both within Sun and outside of Sun, it's a relief to see that there's a wide interpretation of the role. Some CTOs are only concerned with abstract futures, some are solely buried in day-to-day issues. I try to be somewhere in the middle, and thus please nobody all of the time :(

I'm an engineer by calling; like my colleagues, we're in this to build software artifacts that make other peoples lives better. I've worked on Solaris for many years, on numerous subsystems and problems, from architectural direction to fixing bugs. We're very excited about Solaris 10 and how much it seems to be helping customers with their problems, and changing the way people think about Solaris, Sun, and Operating Systems technology in general. I think we're living in a time of transitions, and operating systems are relevant again, as the boundaries between software system components are shifting, hardware devices become ever more capable and complex, while new business models emerge.

During Solaris 10, my principal code contributions were around modernizing the Solaris kernel port to the x86 architectures, and on bringing 64-bit Solaris up on x64 platforms - our slightly boringly named "amd64" project. Some months ago a member of that team blogged about some of the work he'd been doing on that project, and hoped that someone would spend a bit more time talking about the other work that we did during the amd64 port. That seemed like a good topic for a blog entry, so that's what I thought I'd write about first.



« April 2014