Friday Jul 20, 2007

The Xen of DomU Installation

More virtualization from me, moving on from the emulated Linux environment provided by Solaris Containers for Linux Applications, to Xen!

Let's say you've got this brand spanking new OpenSolaris release that you've downloaded, configured and installed on your machine. Now what?

Well, the whole point of virtualization is to allow you to run a guest OS on your machine, just as if you had two (or more) machines, so let's walk through the installation of a guest domain, otherwise known as a DomU, on your machine running Xen, otherwise known as a Dom0.

To make things as easy as possible, I'll assume that your machine, like most today, does not have hardware support for running a guest OS (otherwise known as a HVM DomU.) Instead, let's install a paravirtualized guest. As I'm biased, let's install a second copy of OpenSolaris.

The provided documentation explaining how to do an install is pretty complete, so I'm not going to rehash the details here except to consolidate a few potential gotchas from the Release Notes and the configuration pages:

  1. If you install a DomU with the OpenSolaris installer's default settings, which were designed to allow installation on a system with a minimal configuration, you will not be able to upgrade the instance to a newer release of OpenSolaris later because of a lack of disk space.

    To avoid this situation, you should make sure to create a root ("/") partition of at least 1 GB in size, or 1.5 GB if you want to play it safe.

  2. Xen currently gets really cranky if it runs short on memory. Symptoms can range from simply refusing to boot a guest domain to madly trying to resize Dom0 to find memory to boot the DomU, with the result that your Dom0 may become catatonic for a long period of time, refusing to respond to inputs and even closing down network connections until it determines that it can't in fact shrink the Dom0's memory space to accommodate the DomU's memory requirements. Not good.

    Instead, it's best to avoid this situation by using a rule of thumb, namely that the combined memory usage of all DomUs you intend to run concurrently should never exceed approximately 55% of the physical RAM installed in your machine. Pretty straightforward stuff.

OK, now it's time to put that advice into action and install your first guest operating system instance.

There are two different interfaces you can use to create and run guest operating systems, xm, and virt-install. The documentation I've linked to above all references xm, so I'll stick with that method.

First, as in the documentation, I'll create a simple Python configuration file containing information regarding the DomU I want to install.

For my sample, install, I'm going to save the OpenSolaris ISO image I downloaded to /export/images/66-0624-nd.iso.

Once I've done that I will create a little infrastructure as I like to keep my disk images, Python config files for installation and Python config files for booting installed DomUs separate.

Feel free to come up with your own hierarchy; I like:

  • /export/images - ISO disk images
  • /export/config/install - Python config files for installation
  • /export/config/run Python config files for installed DomUs
  • /export/root - Image files used as file systems for DomUs
So I'll create the Python configuration file /export/config/install/, containing:
    name = "solaris-install"
    memory = "1024"
    disk = [ 'file:/export/images/66-0624-nd.iso,6:cdrom,r', 'file:/export/root/sola
    ris_domu.img,0,w' ]
    vif = [ '' ]
    on_shutdown = 'destroy'
    on_reboot = 'destroy'
    on_crash = 'destroy'
This file basically states that I want to boot the ISO file /export/images/66-0624-nd.iso as if it were a boot disc (in this case, a DVD), and that I want to use the disk image file /export/root/solaris_domu.img as the "hard disk" for my DomU. It also states that in case of a shutdown, reboot or crash, I just want to xm destroy the domain - no muss, no fuss.

Next, I'll create the file to be used as the DomU's "hard disk" using dd; I think 8 GB makes for a nice compact "drive" size for a DomU:

    dd if=/dev/zero of=/export/root/solaris_domu.img bs=1024k seek=8k count=1

Feel free to use whatever size you'd prefer; you can also use an entire dedicated disk partition or a zpool. The aforementioned documentation will tell you how.

Now it's time to create the new DomU using the OpenSolaris installer, via the xm create command:

    xm create -c /export/config/install/

When you run this command, you'll see a text-based OpenSolaris installer, the same as if you were installing your Dom0.

Simply follow the installer's prompts, the same as if you were installing a new machine, except you're really installing the new guest domain.

To make things easier, I suggest that you tell the installer to not eject CDs automatically when the installation is complete; that way you can walk away from your install and come back later and see that the install completed the way you want:

    - Eject a CD/DVD Automatically? ------------------------------------------------
      During the installation of Solaris software, you may be using one or more
      CDs/DVDs. With the exception of the currently booted CD/DVD, you can choose
      to have the system eject each CD/DVD automatically after it is installed or
      you can choose to manually eject each CD/DVD.
      Note: The currently booted CD/DVD must be manually ejected during system
                [ ] Automatically eject CD/DVD
                [X] Manually eject CD/DVD
         F2_Continue    F3_Go Back    F5_Exit

When the installer gets to the point where it detects disks, it should see your specified disk file/partition/zpool as a "hard drive" of the size you specified named c0d0:

    - Select Disks -----------------------------------------------------------------
      On this screen you must select the disks for installing Solaris software.
      Start by looking at the Suggested Minimum field; this value is the
      approximate space needed to install the software you've selected. Keep
      selecting disks until the Total Selected value exceeds the Suggested Minimum
      NOTE: \*\* denotes current boot disk
      Disk Device                                              Available Space
      [X]    c0d0                                              8173 MB  (F4 to edit)
                                         Total Selected:   8173 MB
                                      Suggested Minimum:   4727 MB
         F2_Continue    F3_Go Back    F4_Edit    F5_Exit    F6_Help
When you continue, tell Solaris to go ahead and automatically lay out your file systems; here's what I do:
    - Automatically Layout File Systems ----------------------------------- --------- On this screen you must select all the file systems you want auto-layout to create, or accept the default file systems shown. NOTE: For small disks, it may be necessary for auto-layout to break up some of the file systems you request into smaller file systems to fit the available disk space. So, after auto-layout completes, you may find file systems in the layout that you did not select from the list below. File Systems for Auto-layout ======================================== [X] / [X] /opt [X] /usr [X] /usr/openwin [X] /var [X] swap -------------------------------------------------------------------------------- F2_Continue F5_Cancel F6_Help
This will result in a layout that looks something like this:
    - File System and Disk Layout ----------------------------------------- --------- The summary below is your current file system and disk layout, based on the information you've supplied. NOTE: If you choose to customize, you should understand file systems, their intended purpose on the disk, and how changing them may affect the operation of the system. File sys/Mnt point Disk/Slice Size ======================================================================== / c0d0s0 792 MB /usr/openwin c0d0s1 815 MB overlap c0d0s2 8181 MB /var c0d0s3 509 MB swap c0d0s4 517 MB /opt c0d0s5 1066 MB /usr c0d0s6 4471 MB -------------------------------------------------------------------------------- F2_Continue F3_Go Back F4_Customize F5_Exit F6_Help
Now, because of what I mentioned a bit earlier, if you want to avoid problems upgrading your DomU in the future to a newer release of OpenSolaris, you should edit your partitions to make the / partition at least 1 GB in size.

I also like to take the opportunity to bump up the swap size, as I find the default to be a bit small.

Here's a configuration I've had good luck with:

    - Customize Disk: c0d0 ------------------------------------------------ --------- Boot Disk: c0d0 Entry: /opt Recommended: 628 MB Minimum: 534 MB ================================================================================ Slice Mount Point Size (MB) 0 / 1262 1 /usr/openwin 815 2 overlap 8181 3 /var 509 4 swap 776 5 /opt 705 6 /usr 4102 7 0 ================================================================================ Solaris Partition Size: 8181 MB OS Overhead: 8 MB Usable Capacity: 8173 MB Allocated: 8169 MB Rounding Error: 4 MB Free: 0 MB F2_OK F4_Options F5_Cancel F6_Help
When the installation completes, you will be asked to "eject" the disk and reboot the system. When that occurs, you will simply be returned to your shell prompt.

Now, you can run your installed DomU!

However, if you just use the configuration file mentioned above, you'll rerun the installer; instead, you'll need to create a new Python configuration file referencing your newly installed DomU.

Here's what I use; let's call the Python configuration file /export/config/run/

    name = "nevada"
    vcpus = 1
    memory = "1024"
    root = "/dev/dsk/c0d0s0"
    disk = ['file:/export/root/solaris_domu.img,0,w']
    vif = ['']
    on_shutdown = 'destroy'
    on_reboot = 'restart'
    on_crash = 'destroy'
Since I tend to use machines that have at least 2 GB of RAM installed and only run a single DomU, the "memory" line above denotes that I want to give my DomU 1 GB of RAM. This falls within the "55 %" guideline mentioned near the top of this post.

Now, the guest domain can be booted using the command:

    xm create -c /export/config/run/
It should boot just as if you were connected to the serial console of a second machine running OpenSolaris, except you don't need to fuss with BIOS or boot loaders; here's what mine looks like:
    # xm create -c /export/config/run/
    Using config file "/export/config/run/".
    Started domain nevada
    SunOS Release 5.11 Version xen-nv66-2007-06-24 64-bit
    Copyright 1983-2007 Sun Microsystems, Inc.  All rights reserved.
    Use is subject to license terms.
    ip: joining multicasts failed (4) on xnf0 - will use link layer broadcasts for m
    Hostname: chaser-guest
    NIS domain name is boulder.Central.Sun.COM
    /dev/rdsk/c0d0s6 is clean
    /dev/rdsk/c0d0s5 is clean
    chaser-guest console login:

I know this has been a lengthy post, but it's really much simpler than my verbosity would indicate.

Why not give it a try?

Tuesday Oct 24, 2006

"Just trust us" as Microsoft's security policy

According to The Register, Microsoft has announced that 64-bit Windows Vista will be a black box to which only Microsoft has a key. Microsoft Security VP Ben Fathi apparently compared developers insisting on the ability to patch the kernel to "a Sony Walkman user invalidating their warranty by opening up the device" and stated "That's just not the way the box was designed... we're putting a stop to that."

While the desire to avoid having multiple application-installed kernel patches floating around is noble, I would have to ask if you trust Microsoft to know what is best for you. Perhaps next Microsoft will decide to deliver the "64-bit Windows Vista Service Pack 1" kernel on a secure ROM encrypted with your Windows license registration number. Time will tell if I've just inadvertently given away their strategy for 2009, but we can see how well "security through obscurity" has worked for Microsoft to date.

Compare that approach to that of OpenSolaris. While patching the kernel is something we go out of our way to make sure you don't have to do, we certainly don't forbid it. Our kernel API? Documented right here. You can browse the OpenSolaris source code here, or jump right to the kernel source directories here. It's up to you. We don't need to hide our kernel from anyone, least of all developers who want to know how things work. There are even books written by Sun engineers explaining how things work in detail.

If we don't already provide a capability you want, go ahead and add it. Submit the source code and we may even integrate it directly into our tree. Find a security bug? By all means let us know; you can even submit a fix if you like. Want to write your own scheduler? Feel free. New VM subsystem? If you've got the time and talent, go for it.

The key is that we won't prevent you from doing what you need or just want to do to make OpenSolaris more useful for you. We try to make OpenSolaris as feature rich as possible "out of the box," but logic dictates that we can never be all things to all people. You should feel free to customize things if you need to for your particular environment. That's what open source is all about.

In short, you're in control of your Operating System when you run OpenSolaris. Isn't that the way it should be?

If you're Microsoft, apparently not.

Wednesday Jun 14, 2006

Embedded Solaris? Why not?

A recent article by Ashlee Vance in The Register asked a common question under the disparaging subhead "Embedded Slowaris":
There's a lot of momentum behind PowerPC right now and a decent amount behind Solaris 10, so we'll see how this plays out. Although, it's hard to see how Solaris offers a huge edge over Linux in the embedded market.
As a person who's done embedded development before, the answer is most emphatically tools.

Simply put, Solaris has the best development tools in the market, most glaringly the combination of mdb and DTrace. When I was an embedded developer, tools such as those (or even tools with a mere fraction of their capabilities) would have made me fall to my knees and cry.

Consider that many embedded environments have few, if any tools available at all. Many have custom debuggers with command sets that correspond to no other debugger, meaning you have to learn yet another command set. A standard kernel debugger? Perhaps, depending in part on whether the copyright holder of your OS even considers them useful. But certainly not one with anywhere near the power of kmdb. Using ::print to output a formatted a copy of most any kernel structure? Certainly beats counting byte offsets in a raw hex dump!

Compared to an environment whose debug tools may consist largely of printfs output to a serial console, those other environments look as anachronistic as toggling boot code into a machine by using front panel toggle switches.

When it comes to application development for any type of embedded environment, the availability of something like DTrace is a dream. Ever get frustrated at how long it takes for the Program Guide to come up on your digital cable set top box? If that developer had access to DTrace, perhaps they would have found that extra loop wasting precious processing time. Or that memory leak that requires you to unplug your satellite reciever or DVR every so often when it locks up. Good development tools are even more useful when programming for an embedded target.

Writing an application for a Linux target? No problem - develop on Solaris, debug with DTrace, port to Linux. Porting most any reasonably portably written program from Solaris to Linux is a minimal effort compared with the time it would take you to debug the same application using only Linux tools. Are you a GNU fan that prefers to use gcc and gdb? No problem, they run quite happily on Solaris as well. (Of course you'll have a much easier time using mdb to debug your multithreaded application…)

Is Solaris the perfect embedded OS? Perhaps not; Solaris includes a lot of code that allows it to scale predictably across everything from single CPU systems with modest amounts of memory to systems with multiple multithreaded cores addressing gigabytes of RAM. That's flexibility not always required when your target platform is known and fixed in terms of RAM and processing capability ahead of time. But it's certainly not anything that should cause Solaris to be dismissed out of hand, and projects like Tom Riddle's PowerPC project at SunLabs, Project Pulsar, combined with the efforts of the OpenSolaris community will make it an even more attractive option in the future. Solaris of course has no pricing disadvantages as compared to Linux (and in fact Solaris plus a support contract can be much more affordable than an equivalent Linux offering.) Need to make source code modifications to the kernel? (Though frankly, most anything you need to do from a user or device level can be done via the interfaces described in the Solaris Writing Device Drivers manual.) As of the release of OpenSolaris 366 days ago (as I write this), that's not an issue, either.

In short, perhaps someday embedded developers using other operating systems will have access to development tools like those available on Solaris, but until then if a target platform is one that Solaris supports, it seems to me the time and effort saved by being able to develop and debug on Solaris makes development in any other environment an inexcusable waste of time and money.

Happy Birthday, Open Solaris!

Just a quick note to wish OpenSolaris a happy first birthday!

I've been truly impressed by the number of people who've participated in the various OpenSolaris discussions over the past year, and have been even more impressed by the numbers of people out there who have submitted enhancements or bug fixes. Your submissions are truly appreciated.

I'd also like to thank the many, many people I've been able to meet at the Front Range OpenSolaris User Group meetings; it's one thing for Sun to open Solaris to the world, but quite another for people to take it and run with it. It's you who have made OpenSolaris the success it is today and not just a bunch of code thrown over a wall.

It's incredibly exciting to be able to talk to people and actually show them what you've spent months and years working on rather than have to wave your hand and say "I wrote the code that does this." OpenSolaris is what has made that possible. We've always been proud of what we've done here at Sun, but opening our code to the world is the best way I can think of to show it.

So, thank you everyone, and don't forget to buy yourself an anniversary T-Shirt!

See you all in the BrandZ and PowerPC communities, and of course on the main discussion forum.

Happy Birthday OpenSolaris!

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Tuesday Dec 13, 2005

Signal Corps

I'm so glad BrandZ is finally available via OpenSolaris!

Now I can finally chat a bit about what I've been doing for the last several months since finishing up the work to bring Solaris 10 to amd64 hardware.

BrandZ and Signals

One of the more difficult problems of running a Linux branded process in BrandZ is that of signal delivery. While Solaris and Linux largely support the same signals, they often have different signal numbers, and in one case have different default semantics. Signal delivery is further complicated by differences in signal stack structure and contents and the action taken when a user signal handler exits. Finally, many signal-related structures, such as sigset_ts differ between Solaris and Linux. All of these differences must be accounted for if the signal delivery mechanism is to have any hope of working correctly, and frankly dealing with them made me yearn for the "simpler" days of translating amd64 page table entries back and forth between 32 and 64-bit mode. :-)

Signal number conversion

The simplest transformation that must be done when sending signals is to translate between Linux and Solaris signal numbers.

    Major signal number differences between Linux and Solaris

So for example, when a Linux process sends a signal using the kill(2) system call, the BrandZ infrastructure must translate the signal into its Solaris equivalent before handing control off to the Solaris kill(2). Conversely, when a signal is delivered to a Linux process, BrandZ must convert the signal number from its Solaris value to back to its Linux value. This translation of signal numbers at both the time of generation and the time of delivery ensures that the Solaris kernel sees only Solaris signals and that any signals generated by the kernel are seen by Linux processes as the proper signal.

Calling user signal handlers

In order to support user-level signal handlers, BrandZ uses a double layer of indirection to process and deliver signals to branded threads.

In a native Solaris process, signal delivery is interposed upon by libc for any thread registering a signal handler. Libc needs to do various bits of magic to provide thread-safe critical regions, so it registers its own handler for the signal, named sigacthandler(), using the sigaction(2) system call.

When a signal is received, sigacthandler() is called, and after some processing, turns around and calls the user's signal handler via the eponymous routine call_user_handler().

Adding a Linux branded thread to the mix complicates this behavior further, as when a thread receives a signal, it may be running with a Linux value in the x86 %gs segment register as opposed to the value Solaris threads expect; if control were passed directly to a bit of Solaris code, that code would suffer a segmentation fault the first time it tried to dereference a memory location using %gs.

This need to impose upon the normal Solaris signal handling mechanism means that while the path from signal generation to delivery for a native Solaris thread looks something like:

    kernel ->
        sigacthandler() ->
            call_user_handler() ->
                user signal handler
for BrandZ Linux threads, this instead would look like:
    kernel ->
        lx_sigacthandler() ->
            sigacthandler() ->
                call_user_handler() ->
                        lx_call_user_handler() ->
                            Linux user signal handler
The new routines mentioned above are:
  • lx_sigacthandler()
    This routine is responsible for setting the %gs segment register to the value Solaris code expects, and jumping to Solaris' libc signal interposition handler, sigacthandler().
  • lx_call_user_handler()
    This routine is responsible for translating Solaris signal numbers to their Linux equivalents, building a Linux signal stack based on the information Solaris has provided, and passing the stack to the registered Linux signal handler. It is, in effect, the Linux thread equivalent to libc's call_user_handler()
Installing lx_sigacthandler() is a bit tricky, as normally sigacthandler() is hidden from user programs. To facilitate this, a new private function was added to libc, setsigacthandler():
    void setsigacthandler(void (\*new_handler)(int, siginfo_t \*, void \*),
        void (\*\*old_handler)(int, siginfo_t \*, void \*))
This routine works by modifying a per-thread data structure libc already keeps that keeps track of the addresses of interposition handlers. The old handler's address is set in the pointer pointed to by the second argument if it is non-NULL, somewhat mimicking the behavior of sigaction(2).

Once setsigacthandler() has been executed, all future branded threads the thread may create will automatically have the proper interposition handler installed as the result of any sigaction() call.

Note that none of this interposition is necessary unless a Linux thread registers a user signal handler, as the default action for all signals is the same between Solaris and Linux save for one signal, SIGPWR. To handle this case, BrandZ always installs its own internal signal handler for SIGPWR that translates performs the Linux default action, namely to terminate the process upon receipt. (Solaris' default action is to ignore SIGPWR.)

Returning from a user signal handler

The process of returning to an interrupted thread of execution from a user signal handler is entirely different between Solaris and Linux. While Solaris generally expects to set the context to the interrupted one on a normal return from a signal handler, Linux instead pushes actual code that calls a specific Linux system call, sigreturn(2), onto the signal handler's stack. Then when a Linux signal handler completes execution, instead of returning through what would in Solaris' libc be a call to setcontext(2), a call to sigreturn() is responsible for accomplishing much the same thing.

This trampoline code:

    pop   %eax
    mov   LX_SYS_sigreturn, %eax
    int   $0x80
is referenced such that when the Linux user signal handler is eventually called, the stack looks like this:
    Pointer to trampoline code
    Linux signal number
    Pointer to Linux siginfo_t
    Pointer to Linux ucontext_t
    Linux ucontext_t
    Linux fpstate_t
    Linux siginfo_t

When the trampoline code is executed, BrandZ interposes upon the Linux sigreturn(2) call in order to turn it into the return through the libc call stack that Solaris expects. This is done by the lx_sigreturn() routine, which removes the Linux signal frame from the stack and pass the resulting stack pointer to another routine, lx_sigreturn_tolibc(), which makes libc believe the user signal handler it had called returned.

When control returns to call_user_handler(), a setcontext(2) will be done that (in most cases) returns the thread executing the code back to the location originally interrupted by receipt of the signal.

One final complication in this process is the restoration of the %gs segment register. The proper value is saved when the thread context is originally saved, but prior to BrandZ, code existed in libc to force the value to that expected by libc before calling setcontext(2).

For BrandZ, the code that did so has been removed. While perhaps making faults due to bad user context values for %gs harder to debug (as such bad values will now make applications appear to segmentation fault deep within Solaris' libc), the versatility to properly restore custom %gs values seems worth the trade-off.


So while this may all seem unbelievably complex, it actually works rather well. Best of all does it without impacting the performance of native Solaris threads running in other zones on the same machine, one of our primary goals with this project.

BrandZ is, of course still a work in progress with the actual product still undergoing heavy development, but at least the OpenSolaris release gives you something you can touch, feel and play with.

I suspect you'll find it was well worth the wait.

Technorati Tag: OpenSolaris
Technorati Tag: BrandZ
Technorati Tag: Solaris

Thursday Nov 10, 2005


Now that more of the boot code has made its way into the OpenSolaris tree, I can finally say a few words about the page translation mechanisms that make booting a 64-bit amd64 system possible, a topic Tim Marsland first alluded to back in May (!).

Essentially, the booter is a 32-bit binary, but the 64-bit kernel needs the booter to perform initial memory allocations for it until it can take over and do things for itself.

When the kernel needs to allocate memory for itself before its own allocation mechanism is set up, it calls the BOP_ALLOC() macro, which ends up calling the booter's registered bsys_alloc() routine, bkern_alloc(), via the bootops callback mechanism.

But this has an inherent problem; memory allocation will be performed by the 32-bit booter and a mapping made in the booter's 32-bit page tables. How will the 64-bit kernel use those mappings?

The answer is to use an interpositioned translation layer between the two, often called a thunking layer, and have the "thunking" code translate any 32-bit memory allocations made on the kernel's behalf into entries in the kernel's 64-bit page tables. The routines that do all this are now available for your viewing pleasure in the OpenSolaris source tree under /on/usr/src/psm/stand/boot/amd64, with the real meat located in the files ptops.c and ptxlate.c.

Whenever a mapping is made in the 32-bit page tables on the kernel's behalf, it is translated into a 64-bit mapping before returning to the 64-bit kernel by the routine amd64_xlate_legacy_va(). This code walks the 32-bit page tables for the memory range allocated, decodes the information the booter placed there and makes equivalent entries in the kernel's 64-bit page tables.

Now this may sound simple, but in practice it's rather a pain to get right, and errors in this type of code are always difficult to find and debug, especially as when they occur the most common failure modes are:

  • Random corruption of memory
  • A hard hang of the machine
  • A complete reset of the machine
None of which is especially diagnosable without a good simulator and an excellent kernel debugger, like kmdb.

The heart of this code is the routine amd64_tbl_lookup(). Since 32-bit and 64-bit page tables are very closely related, by manipulating bit shift amounts and page table index sizes, we can use a single routine to do all needed lookups, whether in 32-bit or 64-bit page tables.

Looking at struct amd64_mmumode:

    typedef struct amd64_mmumode {
            uint8_t shift_base;     /\* shift to start of page tables \*/
            uint8_t level_shift;    /\* shift between page table levels \*/
            uint8_t map_level;      /\* mapping level for AMD64_PAGESIZE pages \*/
            uint16_t tbl_entries;   /\* number of entries per table level \*/
    } amd64_mmumode_t;
and the values in the mmus array:
    static amd64_mmumode_t mmus[] = {
            { 22, 10, 2, 1024 },    /\* legacy 32-bit mode \*/
            { 39, 9, 4, 512 }       /\* long 64-bit mode \*/
you can begin to see how this works. For example, for 64-bit page tables, you would shift a virtual address right by shift_base (39) bits to form an index into the level 1 page tables, each successive page table index can be formed by shifting the virtual address level_shift (nine) fewer bits to the right, and a full 64-bit page table has map_level (four) levels.

(The tbl_entries field? Well, it's a bit of historical cruft left over from initial debugging that saved humans from having to calculate the value (1 << level_shift).)

In addition to being responsible for on-the-fly translation done after each memory allocation, there are two routines responsible for creating the initial 64-bit page table used to bring the system up into 64-bit mode for the first time, amd64_init_longpt() and amd64_xlate_boot_tables().

You'll see that amd64_init_longpt() creates the 64-bit page tables with a single Page Map Level-4 Offset table (pml4), two Page-Directory Pointer Offset tables (pdpt and pdpt_hi), four contiguous Page-Directory Offset tables (pdt) and a full page table (pte_zero.)

What's interesting here is what we decided to do to make memory operations easier, namely the entire 4G of the 32-bit address space is "reflected" into the top 4G of the 64-bit address space. This means that if the booter mapped an entry at virtual address 0x1000 in 32-bit space, the kernel should be able to access that memory at the 64-bit virtual address 0xffffffff00001000. The initial page tables are setup to do just that - to provide just enough page table entries to map an initial block of memory which boot has identity mapped (physical address = virtual address, 1:1) and the top 4G of the 64-bit address space.

(You may notice that we start our identity mapped pages at 0x1000; what about the low page of memory, from 0x0 - 0xfff? That range is purposely left unmapped to catch NULL pointer references.)

One optimization we make is to identity map the range from 0x200000 through magic_phys (currently set to be 15M) using large (2M) pages. While that does map an "extra" 1M of memory, the magic_phys address is only significant internally within the booter, so it doesn't really matter if an extra 1M of memory is mapped. Note also that we save a considerable bit of memory by using 2M pages as compared to having to map the range using conventional 4K pages, so it's all worth it.

Then when all this is done, amd64_xlate_boot_tables() is called to convert all the mappings the booter has setup to date into 64-bit mappings suitable for use by the kernel.

So to sum up all this, when the booter is ready to hand off execution to the amd64 kernel, it calls amd64_handoff(), which in turn calls amd64_makectx64(), which then calls amd64_init_longpt() to set up the initial 64-bit page tables.

To give you a flavor of what the kernel needs to do to perform memory allocations until it is capable of handling things itself, it will:

  1. Trap back into the booter via the bootops callback mechanism.
  2. Begin execution in __vtrap_common(), which will switch the CPU from 64-bit mode back into 32-bit mode. Once the CPU is back in 32-bit mode, the code will call
  3. amd64_vtrap(), which will call
  4. amd64_invoke_bootop(). There memory will be allocated via the BOP_ALLOC() macro, which will call bkern_alloc() to allocate space in the 32-bit page tables the booter operates with.

    When that routine returns, control will return to
  5. amd64_invoke_bootop(), which will call
  6. amd64_xlate_legacy_va() to translate the mapping into an entry in the kernel's 64-bit page tables. When amd64_invoke_bootop() returns, it will return to
  7. amd64_vtrap(), which will switch the CPU back into 64-bit mode before returning to the kernel, which can now use the new entries in its page tables provided by the page table translation mechanism.
Simple, isn't it? :-)
Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Monday Jun 27, 2005

Episode I: The Phantom #defines

If you've spent any time at all looking at the Solaris code, especially assembly routines, you may have noticed that there are #defines that are used but that just don't seem to exist anywhere.

For example, in the swtch() code I referenced last time, you'll see this construct:

movq    %rbx, T_RBX(thread_t);

So what's T_RBX? If you look through the source code, you'll find it in usr/src/uts/intel/amd64/ml/, #defined as:

    #define   T_RBX           _CONST(T_LABEL + LABEL_RBX)
but yet if you search for a #define for T_LABEL you won't find one anywhere.

How can this code possibly assemble?

The answer? Genassym, and files and assym.h.

But wait! There's no assym.h file, either! What's going on?!?!

Hold on, and I'll try to explain.

The genassym program, part of Solaris' stabs tools, is Solaris' way of generating calculated #defines as a way of accessing structure elements within assembly language code. It's documented at the beginning of the file usr/src/uts/sun4/ml/

\\ Guidelines:
\\ A blank line is required between structure/union/intrinsic names.
\\ The general form is:
\\       name size_define [shift_define]
\\               member_name [offset_define]
\\       {blank line}
\\ If offset_define is not specified then the member_name is
\\ converted to all caps and used instead.  If the size of an item is
\\ a power of two then an optional shift count may be output using
\\ shift_define as the name but only if shift_define was specified.
\\ Arrays cause stabs to automatically output the per-array-item increment
\\ in addition to the base address:
\\        foo FOO_SIZE
\\               array   FOO_ARRAY
\\ results in:
\\       #define FOO_ARRAY       0x0
\\       #define FOO_ARRAY_INCR  0x4
\\ which allows \\#define's to be used to specify array items:
\\       #define FOO_0   (FOO_ARRAY + (0 \* FOO_ARRAY_INCR))
\\       #define FOO_1   (FOO_ARRAY + (1 \* FOO_ARRAY_INCR))
\\       ...
\\       #define FOO_n   (FOO_ARRAY + (n \* FOO_ARRAY_INCR))
\\ There are several examples below (search for _INCR).
\\ There is currently no manner in which to identify "anonymous"
\\ structures or unions so if they are to be used in assembly code
\\ they must be given names.
\\ When specifying the offsets of nested structures/unions each nested
\\ structure or union must be listed separately then use the
\\ "\\#define" escapes to add the offsets from the base structure/union
\\ and all of the nested structures/unions together.  See the many
\\ examples already in this file.
This allows us to manipulate structure definitions and not have to go back and adjust #defines or correct assembly code along the way. This also provides a way to easily index arrays from within assembly code.

To continue our resume() example, in usr/src/uts/intel/amd64/ml/, you'd find this code:

\\#define        LABEL_RBP       _CONST(_MUL(2, LABEL_VAL_INCR) + LABEL_VAL)
\\#define        LABEL_RBX       _CONST(_MUL(3, LABEL_VAL_INCR) + LABEL_VAL)
\\#define        LABEL_R12       _CONST(_MUL(4, LABEL_VAL_INCR) + LABEL_VAL)
\\#define        LABEL_R13       _CONST(_MUL(5, LABEL_VAL_INCR) + LABEL_VAL)
\\#define        LABEL_R14       _CONST(_MUL(6, LABEL_VAL_INCR) + LABEL_VAL)
\\#define        LABEL_R15       _CONST(_MUL(7, LABEL_VAL_INCR) + LABEL_VAL)
\\#define        T_RBP           _CONST(T_LABEL + LABEL_RBP)
\\#define        T_RBX           _CONST(T_LABEL + LABEL_RBX)
\\#define        T_R12           _CONST(T_LABEL + LABEL_R12)
\\#define        T_R13           _CONST(T_LABEL + LABEL_R13)
\\#define        T_R14           _CONST(T_LABEL + LABEL_R14)
\\#define        T_R15           _CONST(T_LABEL + LABEL_R15)
Coupled with this code in usr/src/uts/i86pc/ml/
_kthread        THREAD_SIZE
        t_pcb                   T_LABEL
[ ... snip ... ]
        val     LABEL_VAL
You can probably see how it begins to pull together at this point; when the assembly code references T_RBX, genassym predefines that to be the offset within the thread's label_t where %rbx will be stored, or in this case:
where LABEL_VAL is the offset of the val field within a label_t, and LABEL_VAL_INCR is the offset that must be added to get to each successive member of the label_t's val array.

Thus the #define above becomes this for an amd64 kernel (given the sources as of the date this is being written):

    0x38 + (3 \* 8) + 0
yielding an offset of 0x50 bytes into the _kthread structure for the original movq instruction, so the instruction I mentioned at the top of this post:
    movq    %rbx, T_RBX(thread_t);
    movq    %rbx, 0x50(thread_t);

Now this may all seem overly complex, but what if the size of the label_t changed? What if the elements were added to the _kthread structure such that the offset of the t_pcb element changed? All that would be required is a recompile of the kernel. No changes to header files, no further changes to source code, not even a change to the file.

How does this magic happen? In summary, genassym, parses the and files and creates the assym.h at compile time.

That's why you won't find it in the source tree, yet it is #included at the top of most assembly files, such as this snippet at the top of usr/src/uts/intel/ia32/ml/swtch.s:

#if defined(__lint)
#else   /\* __lint \*/
#include "assym.h"
#endif  /\* __lint \*/
and things still assemble.

I hope this helps explain where many of those magical #defines come from.

All in all, pretty nifty.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Tuesday Jun 14, 2005

We now resume() your regular programming.

Now that Opening Day is here and you can see for yourself that OpenSolaris is not vaporware, I can fnally discuss code snippets without having to say vague things like "in general, this code does this." You may now actually look for yourself. (Even better, I can now link to the actual code I was talking about.)

One of the most confusing areas of the kernel involves the areas of swtch() and resume().

Put simply, these two routines are largely responsible for multitasking; in its simplest terms, swtch() selects the next process a given CPU core is to run, and resume() actually resumes execution of a given thread beginning at either its initial address or at the point it was last swtch()ed out.

Given that the various resume() routines contain some of the most architecture-specific code within Solaris, the code differs not only between SPARC and x86, but also between x86 and AMD64. Since I was responsible for the changes made to the resume() routines for AMD64, I thought I'd cover that in a bit more detail:

Both the IA-32 ABI and AMD64 ABI (both PDF files, Adobe Reader or equivalent required) refer to certain registers as non-volatile; that is, the contents of those registers must be preserved, by the caller, across function calls. This makes sense, as function call overhead would be enormous if you had to manually save and restore each register value you cared about whenever you called another function.

This gets a bit tricker within the context of resume(), as those registers have to be carefully saved off because one thread is about to give up use of a CPU core to another thread. Therefore the non-volatile registers need to be saved somewhere thread-specific where the values may easily be retrieved. (Likewise, those register values have to be restored when the thread later resume()s execution.)

Prior to Solaris 10, the x86 resume() routines saved the contents of the non-volatile registers on the stack. While a perfectly serviceable solution, the problem was that debuggers like kmdb and mdb had to go through a fair amount of trouble to find the "real stack" for those threads in order to provide a stack trace. To summarize, the debugger had to start at the last saved stack pointer and generate stack traces two different ways and it would present whichever stack trace it found to be longer. Functional, but rather non-optimal.

The SPARC resume() code worked around this by saving the registers within a pre-defined area of its kthread_t, called a label_t. (The label_t had already seen use in SPARC and x86 as the area register values were saved in order to provide non-local goto support, otherwise known as setjmp() and longjmp()).

Since the kthread_t definition is shared between SPARC and x86, this meant that on x86, save for setjmp() and longjmp(), this area had largely gone unused. By modifying the x86 resume() code to also use the label_t for swtch()ed out thread register storage, we were able to not only reduce the complexity of mdb's stack backtrace algorithm, but we would also be able to provide a new degree of congruency in the way resume() works on SPARC, x86 and AMD64.

The macros that save and restore the registers in the resume() code are called, unsurprisingly enough, SAVE_REGS() and RESTORE_REGS():

 \* Save non-volatile regs other than %rsp (%rbx, %rbp, and %r12 - %r15)
 \* The stack frame must be created before the save of %rsp so that tracebacks
 \* of swtch()ed-out processes show the process as having last called swtch().
#define SAVE_REGS(label_t, retaddr)                             \\
        movq    %rbp, LABEL_RBP(label_t);                       \\
        movq    %rbx, LABEL_RBX(label_t);                       \\
        movq    %r12, LABEL_R12(label_t);                       \\
        movq    %r13, LABEL_R13(label_t);                       \\
        movq    %r14, LABEL_R14(label_t);                       \\
        movq    %r15, LABEL_R15(label_t);                       \\
        pushq   %rbp;                                           \\
        movq    %rsp, %rbp;                                     \\
        movq    %rsp, LABEL_SP(label_t);                        \\
        movq    retaddr, (label_t);                             \\
        movq    %rdi, %r12;                                     \\
        call    __dtrace_probe___sched_off__cpu

 \* Restore non-volatile regs other than %rsp (%rbx, %rbp, and %r12 - %r15)
 \* We load up %rsp from the label_t as part of the context switch, so
 \* we don't repeat that here.
 \* We don't do a 'leave,' because reloading %rsp/%rbp from the label_t
 \* already has the effect of putting the stack back the way it was when
 \* we came in.
#define RESTORE_REGS(scratch_reg)                       \\
        movq    %gs:CPU_THREAD, scratch_reg;            \\
        leaq    T_LABEL(scratch_reg), scratch_reg;      \\
        movq    LABEL_RBP(scratch_reg), %rbp;           \\
        movq    LABEL_RBX(scratch_reg), %rbx;           \\
        movq    LABEL_R12(scratch_reg), %r12;           \\
        movq    LABEL_R13(scratch_reg), %r13;           \\
        movq    LABEL_R14(scratch_reg), %r14;           \\
        movq    LABEL_R15(scratch_reg), %r15
If you peruse the code in swtch.s, you're also likely to notice that there is not one resume() routine, but rather there are three variants:
  • resume(): This is the big cheese, the routine that gets called a vast majority of the time. Its task is "pretty simple"; save the registers of the current thread and a return address to the label_t, then load the registers for the new thread from its label_t, and resume its execution whereever execution of that thread last left off. Simple in concept, rather tricky in execution; this is the kind of code where misplacing one instruction will cause incredibly frustratingly unreproducible errors that will literally take weeks to hunt down and locate.

    (There's no exaggeration at all on that "weeks" time frame; I'll leave it as an exercise for the reader to determine how I know.)

  • resume_from_intr(): This routine resumes execution of a thread that had been forced to give up the CPU due to an interrupt. Solaris processes interrupts using dedicated interrupt threads, so whatever thread has control of a CPU core must give up control to an interrupt thread when an interrupt arrives. Thus resume_from_intr() is called from swtch() when the interrupt service routine has finished executing and the system is ready to release the CPU core back to normal use.
  • resume_from_zombie(): This one is a bit different; when a thread has finished execution and is ready to release its resources back to the system, it's considered to be zombied. When a process is zombied and is to relinquish use of the CPU core, it actually calls the routine swtch_from_zombie(), which in turn calls resume_from_zombie() to start execution of a new thread. This routine also performs housekeeping operations on the zombied thread, including putting it on "death row" so that it may be reaped by its creator.
The code construct for all three routines is basically the same; save the non-volatiles on entry, load up the new thread's registers, and begin executing it. However, if you try to trace the execution of a thread once the thread's registers have been restored, you'll see something seemingly bizarre occurring.

We execute this instruction, one that seemingly doesn't come back:

        jmp     spl0
This is because very early on in each of the resume() routines, you'll find code that looks something like this, to give the example of resume() itself:
        leaq    resume_return(%rip), %r11
        SAVE_REGS(%rax, %r11)
What this code does, is to save resume_return as the return address for the thread. This means that when the thread resumes execution, and its registers are restored:
        leaq    T_LABEL(%r12), %r11
        movq    LABEL_SP(%r11), %rsp    /\* switch to outgoing thread's stack \*/
        movq    (%r11), %r13            /\* saved return addr, LABEL_PC is 0 \*/

[ ... condensed for brevity ... ]

        movq    %r13, %rax      /\* save return address \*/
        pushq   %rax            /\* push return address for spl0() \*/
when the jmp is executed:
        jmp     spl0
spl0() will run to completion, and as spl0 ends with a ret instruction, this will cause the thread to return to the return address specified earlier.

So in the case of a thread calling resume(), when it later is allowed to resume execution after having given up the CPU core it was executing on, it will return from spl0() to the label resume_return, at which time the stack frame created in SAVE_REGS() is destroyed and execution will return to the caller that had originally called resume():

         \* Remove stack frame created in SAVE_REGS()
        addq    $CLONGSIZE, %rsp

There you have it, an all too short treatise on resume() that's likely to have raised more questions than it answered, (and if that was indeed the case, let me apologize in advance.)

However, if you are now more confused than when you started reading, thanks to OpenSolaris, you may now read the source code and investigate the mysteries of resume() for yourself.

That's what the world of Open Source is all about.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Wednesday May 25, 2005

Hello? Is this thing on? (thump thump thump...)

Let's see, introductions... I'm William Kucharski, and I've been in the Solaris Kernel Group since November, 2003; prior to this I was a member of various new platform engineering groups here at Sun providing kernel changes necessary to support those new platforms. Though I've only been at Sun since November, 2000, all told I've been writing kernel-level software for over sixteen years now, a number that I find both surprising and a little bit depressing. :-)

My career to date has largely focused on issues of booting and VM, with a lot of my time spent in initial platform bringup and massaging memory management on various systems. Over the course of my career I've worked on machines from IBM 370-class mainframes down to PCs, and on a wide variety of CPUs including MIPS R3000s, various PowerPCs, SPARC CPUs and now x86, especially Opterons.

The last two years of my life have been largely devoted to porting Solaris to Opteron, and I've got to say it's been one of the most rewarding projects I've ever been involved in. One thing I've noticed when working on past projects is that many of them seemed like a lot of work at the time but didn't seem especially technologically innovative or important except perhaps when looked back upon with the benefit of hindsight. The amd64 port was one of the few projects I've worked on where I could tell how important the project was (in an engineering sense, not only in a marketing one) while the work was occurring. We faced considerable technical challenges each day (and over more than a few sleepless nights) but in the end it was all worth it when we got to release our code to the world in Solaris 10. It's really neat to be able to think that whenever a multi-CPU Opteron system boots, CPUs other than 0 start because of my code. :-)

In coming posts I hope to detail a few of the more entertaining challenges we faced and what we had to do to overcome them (for me, one of the biggest was trying to learn to think in little-endian byte order. :-))

For now, the blogs of Tim Marsland, Joe Bonasera and Nils Nieuwejaar, a few of the other members of the Opteron team, are excellent sources for more information. (They are, of course, also considerably more interesting to read than anything I've blathered on about for about for far too long now.)




« April 2014