Sunday Oct 21, 2007

Debugging an xVM Panic

Handling a xVM Panic

The first project I took on when joining the xVM team was figuring out how to react when the xVM hypervisor panics. When Solaris panics, we are almost always able to capture a core dump representing the entire kernel state when the panic occurred. When the hypervisor panics, it dumps some cryptic information to the screen and resets the machine.

Obviously, our preference would be to capture as much information on a hypervisor panic as we do on a Solaris panic. There are two significant difficulties with this. First, xVM has no device drivers of its own, so xVM isn't capable of writing any of its own state to disk. Second, in the xVM model the hypervisor and the dom0 OS work closely together to deliver the full virtualization experience, so even capturing the xVM state would only give us half the information we need.

The solution I adopted was to modify the xVM hypervisor to pass control to Solaris in the event of a panic, allow Solaris to read all the xVM state, to interpret the xVM hypervisor as a standard kernel module, and to execute the standard Solaris panic code that we have spent many years hardening. Exactly how this works is pretty cool (for some particularly geeky definition of "cool"), but is much more complicated than I'm prepared to handle in a blog post. For the highlights, see the slides attached to the entry below.

If we just take it as read that a Solaris dom0 can properly capture an xVM core dump, I'd like to walk through one fairly simple example of using a dump to examine the state of both Solaris and xVM to diagnose a bug that one of our QA guys hit a few months back.

Interpreting an xVM core dump

The test machine is in a lab somewhere in China. It paniced one night, giving us the following information:

panic[cpu2]/vcpu=0xffbe1100: Domain 0 crashed: 
         gs:      161  fs:        0  es:     e010  ds:     e010
        edi: f4b63000 esi: ff192f62 ebp: ffbf3f38 esp: ffbf3f1c
        ebx: ffbf3f44 edx:        3 ecx:     1efc eax:        3
        trp:        0 err:        0 eip: ff13b83f  cs:     e008
        efl:      282 usp: ffbf3fe0  ss:     e010
        cr0: 8005003b cr2: c5a08fd4 cr3:   be1d00 cr4:      6f0
Xen panic[cpu2]/vcpu=0xffbe1100: Domain 0 crashed:

ffbf3f38 0xff11b058 (in Xen)
ffbf3f48 0xff1043f0 (in Xen)
ffbf3f68 0xff10435c (in Xen)
ffbf3f88 0xff10438b (in Xen)
ffbf3fa8 0xff17daf1 (in Xen)
dumping to /dev/dsk/c0t0d0s1, offset 860356608, content: kernel
100% done: 52473 pages dumped, compression ratio 3.51, dump succeeded

There's not a whole lot there to go on. If necessary, I'm sure we could have made a start at debugging the problem with only these few lines. Fortunately, that wasn't necessary. As the last two lines show, we were able to capture a core dump for postmortem examination.

Since this was a test machine under active use it was subject to crashing on a regular basis, so trying to examine the core dump on that machine was destined to be an exercise in futility. The QA engineer uploaded the entire dump to a more stable machine sitting in a lab in California, where I was able to debug it at my leisure.

The first step is to load the core dump into mdb, the Solaris binary debugger. Here we can see that the machine on which the debugger is running is called fsh-trout and the machine that generated the dump was called goldcloth. We can also see exactly which Solaris build the machine was running.

fsh-trout# ls -l
total 1316400
-rw-r--r--   1 root     root     1196459 Apr 10 11:29 unix.0
-rw-r--r--   1 root     root     672440320 Apr 10 11:59 vmcore.0
fsh-trout# mdb \*0
Loading modules: [ unix genunix specfs dtrace xpv_psm scsi_vhci ufs  ip
hook neti sctp arp usba fctl nca lofs zfs random sppp ptm nfs md logindmux
xpv ]
> $<utsname
{
    sysname = [ "SunOS" ]
    nodename = [ "goldcloth" ]
    release = [ "5.11" ]
    version = [ "matrix-build-2007-04-02" ]
    machine = [ "i86pc" ]
}

For a standard Solaris core dump this is a pretty typical way to start. For an xVM dump however, we need to know more than the Solaris build. We also need to know which version of xVM we were running:

> xenver::print
{
    xv_ver = 0xccad8fc0 "3.0.4-1-sun"
    xv_chgset = 0xccad1f80 "Fri Jan 05 16:11:49 2007 +0000 13187:e2466414acc3"
    xv_compiler = 0xccacffc0 "gcc version 3.4.3 (csl-sol210-3_4-20050802)"
    xv_compile_date = 0xccad6f08 "Mon Apr  2 06:59:37 PDT 2007"
    xv_compile_by = 0xccad8fa0 "xen-discuss"
    xv_compile_domain = 0xccad8f80 "opensolaris.org"
    xv_caps = 0xccad8f60 "xen-3.0-x86_32p"
}

In this case, the QA engineer emailed me the original panic message. Even if he hadn't, I can extract it from the core dump:

> ::msgbuf
Xen panic[cpu2]/vcpu=0xffbe1100: Domain 0 crashed: 
         gs:      161  fs:        0  es:     e010  ds:     e010
        edi: f4b63000 esi: ff192f62 ebp: ffbf3f38 esp: ffbf3f1c
        ebx: ffbf3f44 edx:        3 ecx:     1efc eax:        3
        trp:        0 err:        0 eip: ff13b83f  cs:     e008
        efl:      282 usp: ffbf3fe0  ss:     e010
        cr0: 8005003b cr2: c5a08fd4 cr3:   be1d00 cr4:      6f0
ffbf3f38 0xff11b058 (in Xen)
ffbf3f48 0xff1043f0 (in Xen)
ffbf3f68 0xff10435c (in Xen)
[...]

The last few lines represent the bottom of the panicing thread's execution stack. If you're a glutton for punishment, you can manually translate those addresses into symbol names. I, on the other hand, am lazy and prefer to let the machine do that kind of grunt work. To show the stack in human readable form:

> $C                                  
ffbf3f38 xpv`panic+0x33()
ffbf3f48 xpv`dom0_shutdown+0x43()
ffbf3f68 xpv`domain_shutdown+0x22()
ffbf3f88 xpv`__domain_crash+0x9d()
ffbf3fa8 xpv`__domain_crash_synchronous+0x25()
ffbf3fc8 0xff17daf1()
c5a0904c page_ctr_sub+0x47(0, 0, f3328700, 1)
c5a090e4 page_get_mnode_freelist+0x3c1(0, 10, 0, 0, 20009)
c5a09164 page_get_freelist+0x16f(f502ed14, e5406000, 0, c5a09224, e5406000,
[...] 
c5a0a154 lofi_map_file+0xfc(2400000, c5a0a3e8, 1, c5a0a800, cd326e78, 80100403)
c5a0a360 lofi_ioctl+0x13e(2400000, 4c4601, c5a0a3e8, 80100403, cd326e78, 
c5a0a38c cdev_ioctl+0x2e(2400000, 4c4601, c5a0a3e8, 80100403, cd326e78, c5a0a800
c5a0a3b4 ldi_ioctl+0xa4(dfb15310, 4c4601, c5a0a3e8, 80100403, cd326e78, c5a0a800
c5a0a804 xdb_setup_node+0xbc(e52e1000, c5a0a84c)
c5a0ac58 xdb_open_device+0xc9(e52e1000)
c5a0ac88 xdb_start_connect+0x59(e52e1000)
c5a0acbc xdb_oe_state_change+0x70(c66c9088, c57e5c40, 0, c5a0ad70)
c5a0acf8 ndi_event_run_callbacks+0x87(c54f0d80, c66c9088, c57e5c40, c5a0ad
[...]

In this stacktrace, we are looking at a mixture of Solaris and xVM hypervisor code. The symbols prefixed by xpv` represent symbols in the xVM portion of the address space. Those of you familiar with mdb will recognize this as the notation for a kernel module. Even though the machine originally booted with xVM living in an address space completely unknown to the kernel, in the core dump we have made it look like just another kernel module. This allows our existing debugging tools to work on xVM dumps without any modification at all.

One unfortunate limitation of mdb at the moment is that it doesn't understand xVM's notation for exception frames. xVM represents these frames on the stack by inverting the frame pointer. Since mdb isn't aware of this convention, it simply prints the first value of the frame out as a hex number. In the example above, that occurs in the frame at 0xffbf3fc8.

Since I suspect the cause of the panic is to be found in the exception frame, I have to decode it by hand:

> ffbf3fc8,4/X                        
0xffbf3fc8:     c5a0901c        0               e0002           f4c76e9d

The third slot in the frame contains the type of exception (0xe, which is a page fault), and the fourth slot in the frame contains the instruction pointer when the exception occurred. How do I know this? By reading the xVM exception handling code. Now you know, and you don't have to read the code yourself. You're welcome.

We have the instruction pointer, but what does it correspond to?

> f4c76e9d::dis  
page_ctr_sub_internal:          pushl  %ebp
page_ctr_sub_internal+1:        movl   %esp,%ebp
page_ctr_sub_internal+3:        subl   $0x30,%esp
page_ctr_sub_internal+6:        andl   $0xfffffff0,%esp
page_ctr_sub_internal+9:        pushl  %ebx
page_ctr_sub_internal+0xa:      pushl  %esi
page_ctr_sub_internal+0xb:      pushl  %edi

The very first instruction of the Solaris routine page_ctr_sub_internal() caused xVM to take a page fault. When you see a stack operation trigger a page fault, it almost always means you've blown your stack - that is, you've used all the space allocated for the stack and are now accessing some address beyond the stack limit.

We want to confirm that this is the cause of the panic, and then try to understand why it happened. To do so, we have to go back to the panic message to find the address that caused the fault (which on x86 can be found in the %cr2 register) and the CPU on which it happened.


> ::msgbuf !egrep "(cr2|vcpu)"
     Xen panic[cpu2]/vcpu=0xffbe1100: Domain 0 crashed: 
        cr0: 8005003b cr2: c5a08fd4 cr3:   be1d00 cr4:      6f0

OK, the faulting address is easy enough: 0xc5a08fd4. The CPU is a little bit trickier. Since this was an xVM panic, we know the xVM vcpu that caught the exception, but we really want to know the Solaris CPU. To figure that out, we have to wander through a few levels of indirection.

First we translate from the vcpu pointer to the VCPU id.

> 0xffbe1100::print struct vcpu vcpu_id
vcpu_id = 0

The VCPU id (0) is the same as the Solaris CPU id, so we look it up in the cpuinfo table:

vcpu_id = 0
> ::cpuinfo
 ID ADDR     FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD   PROC
  0 f50454b8  1b    0    0  60   no    no t-0    c5a0ade0 sched
  1 c5423080  1b    1    0  59  yes    no t-0    c6178ee0 python2.4
  2 c5a71a80  1b    0    0  59   no    no t-0    c617cac0 xenstored
  3 c5a70a00  1b    1    0  -1   no    no t-0    c5afdde0 (idle)

As a bonus, this table also tells us which thread was running on the CPU when it paniced: 0xc5a0ade0. Since our hypothesis is that this thread blew its stack, lets take a look at the base address of the stack:

> c5a0ade0::print kthread_t t_stkbase
t_stkbase = 0xc5a09000

The bottom of the stack is at 0xc5a09000 and we just tried to access 0xc5a08fd4 - which is 44 bytes below the bottom of the stack. So we were right about the cause of the panic. To see what ate up all of the stack space, lets look at the stack again:

> $C                                  
[...] 
c5a0a154 lofi_map_file+0xfc(2400000, c5a0a3e8, 1, c5a0a800, cd326e78, 80100403)
c5a0a360 lofi_ioctl+0x13e(2400000, 4c4601, c5a0a3e8, 80100403, cd326e78, 
c5a0a38c cdev_ioctl+0x2e(2400000, 4c4601, c5a0a3e8, 80100403, cd326e78, c5a0a800
c5a0a3b4 ldi_ioctl+0xa4(dfb15310, 4c4601, c5a0a3e8, 80100403, cd326e78, c5a0a800
c5a0a804 xdb_setup_node+0xbc(e52e1000, c5a0a84c)
c5a0ac58 xdb_open_device+0xc9(e52e1000)
c5a0ac88 xdb_start_connect+0x59(e52e1000)
c5a0acbc xdb_oe_state_change+0x70(c66c9088, c57e5c40, 0, c5a0ad70)
c5a0acf8 ndi_event_run_callbacks+0x87(c54f0d80, c66c9088, c57e5c40, c5a0ad
[...]

The leftmost column contains the frame pointer for each frame in the stack. Each time that number increases by a large amount, it means the calling routine allocated a lot of memory on the stack. In this example, both xdb_open_device() and xdb_setup_node() are allocating over 1KB on the stack.

By inspecting these routines we could see that each was allocating MAXPATHLEN arrays on the stack. That's not large for a userspace application, but it's pretty big for the kernel. By replacing these stack allocations with calls to kmem_alloc(), we were able to reduce the stack consumption, which fixed the panic.

Without the ability to capture a full dump of both xVM and Solaris, this trivial bug would have been much harder to detect and fix. The only information we would have had was the fact that we took a page fault, and the address on which we faulted. If we were lucky, we might also have gotten a hex stack dump - not a human readable stack like we see above, but a raw block of a few hundred hex numbers.

Friday Apr 20, 2007

Panic! on the streets of London\^H\^H\^H\^H\^H\^HOssining

I spent a few days in beautiful Ossining, New York this week, attending the spring Xen Summit at IBM's T.J. Watson Research Center. While most of the surrounding area was still underwater from this weekend's nor'easter, the folks at IBM had the good sense to build their facility on top of a hill. The auditorium in which most of the talks were held had a kind of 70's ski lodge vibe to it, so maybe the architecture was influenced by the altitude.

While there, I gave a presentation on how Solaris handles panics triggered by Xen. The short version is: if Xen can't be trusted anymore, we let Solaris take over and bring us in for a soft landing. The longer version can be seen in the slideset available here.

Tuesday Mar 13, 2007

Zones, BrandZ, and Xen in Burlington

I gave a presentation on the various virtualization technologies available through OpenSolaris at last night's New England OpenSolaris Users' Group meeting. I didn't do a real count, but I would guess that there were about 30 people there. That's a larger crowd than I saw at the last Bay Area Users' Group meeting than I attended, which was a pleasant surprise. Of course, they didn't have the Sun Blackbox on hand either.

I was asked last night, and reminded twice today, to post the slides I presented. So here you go.

I'll be giving essentially the same talk next week at the Sun Tech Days in Paris. I'll be cutting back on the Zones material a little bit, since the presentation ran long and that section in particular seemed to drag a bit.

Saturday Nov 25, 2006

Least Convincing Book Review...Ever.

"I don't do a whole lot of reading or buying of books, but this book is definitly worth the money."

Killer

Wednesday Apr 19, 2006

More On Parallels

Parallels released an updated beta of their virtualization software for MacIntel today. Among the problems they claim to have fixed is the "host machine panics when it sleeps" bug. Since I am now retyping this blog entry after a MacOSX panic, I can testify that this bug in fact has not been fixed - at least not in all cases.

(The astute Sun blogger should now be wondering why I didn't "Save as draft" before trying a test I expected to kill my machine. Well, I pressed the button. I just hadn't gotten around to training the NoScript Firefox extension that it was OK to execute JavaScript on sun.com.)

The most annoying bug I found last time (the broken "\\" and "|" keys) is still there. Luckily, somebody posted a workaround in the comments section of my last entry, so this has been downgraded from Showstopper to Annoyance. Also still present is the PCI timeout problem.

The only real improvement that I've found is the addition of full-screen support, which is a big step forward in usability. In full-screen mode, the mouse tracking does get better. I would prefer it if they polled a little more frequently, but I can live with it.

More on networking performance. As a zero-effort bandwidth test, I used ftp between my machine at home and either ftp.gnu.org or my machine at work. I also tried two different technologies to get through Sun's firewall: IPSec on Solaris and a third party VPN solution on Mac OSX.:

HostTargetTunnelling technologyBandwidth
Mac OSXftp.gnu.orgNone634 kB/s
Parallels/Solarisftp.gnu.orgNone607 kB/s
Parallels/Solarisftp.gnu.orgIPSec157 kB/s
Parallels/SolarisSun MachineIPSec161 kB/s
Mac OSXftp.gnu.orgVPN240-290 kB/s (bouncing all over the place)
Mac OSXSun MachineVPN240-300 kB/s (bouncing all over the place)

Obviously IPSec from inside the VM had the worst bandwidth for this test, backing up the qualitative assessment I made in the last entry. It is delivering only about 25% of what I can get with a direct connection to the network. Since the Parallels network interface delivers more than 95% of the raw network bandwidth to the Solaris VM without any tunnelling, this drop in performance is either the fault of the IPSec software, or its interaction with Parallels.

VPN looks better in this simple test, but in practice I often have it hang on me for seconds or minutes at a stretch. The lower, but more reliable, performance of the IPSec solution is much easier to work with interactively.

There are still some tests I need to do to fill out the picture. First, I need to measure the IPSec performance on a non-virtualized Solaris machine. This will tell me how much of the performance loss can be blamed on the interaction between IPSec and the virtualization environemnt. Second, I want to measure the performance of Solaris IPSec when running in a VMware machine. This will help determine whether this performance drop is specific to the Parallels VM or if it is a common virtualization problem.

Monday Apr 17, 2006

At What Point Does This Get Ridiculous?

I'll start with the punchline. Below is a picture of Mac OSX hosting a Solaris virtual machine, which itself has a Linux zone running on it.

I finally got myself a Mac Mini. I've been planning to buy one since Apple released the first x86-based models. Since I have no use for a computer that can't run Solaris, I had to wait for a few months. For me, the tipping point was not Apple's recent Boot Camp release, even though that does allow you to boot Solaris on a Mac. Instead, it was the announcement of an OS virtualization environment running on MacOSX. I assumed it was just a matter of time before VMware or Microsoft ported their virtualization tools, but some company I had never heard of beat them to it: Parallels.

The Parallels VM is much less featureful than VMware, but it is also much lighter weight. For my simple usage scenario, I really don't need most the bells and whistles VMware provides, so Parallels is fine.

I was able to get S10 FCS installed and running on a Parallels VM on the second try. It would have worked on the first try, but I got fed up with the DVD performance and killed the install. I was then able to Liveupgrade the system to BrandZ using the BrandZ build 35 DVD image.

Now the bad news. Parallels still has quite a way to go before it is ready for prime time - even for casual home use.

There is one level-0 issue that is a complete showstopper for anybody who plans to run Solaris (or any Unix): the '\\' and '|' keys don't work. I guess Parallels doesn't do any testing of Unix-based systems at all, or this wouldn't have made it out the door.

Some other issues that I have run into so far

  • When the Mac OS sleeps, it panics if Parallels is running. The kernel stack shows that it is running in the Parallels' hypervisor, so this doesn't seem like something you can pin on Apple.
  • Adding a second drive leads to PCI timeouts. It doesn't matter which PCI bus it's on or how many busses you have. Since you can't add more disk space, make sure you plan ahead.
  • You can't boot from a Solaris Express CD. It does boot an S10 FCS CD, so this is probably something grub related. Having said that, I was able to Liveupgrade to a grub-based build successfully, so it's obvious that grub isn't DOA.
  • Parllels doesn't support full-screen mode yet.
  • The mouse tracking is a little bit twitchy. I have the same issue with VMware on a Linux system. It seems to be a little better in full-screen mode on VMware, so maybe there is some hope that this will get better when Parallels supports full-screen.
  • Solaris doesn't support the network device Parallels pretends to provide. Fortunately, there is a Solaris driver for the device available from the ever-reliable Masayuki Murayama.
  • When installing, the DVD performance was terrible - although I can't say whether that is an issue with the Mac or with Parallels. Either way, it was faster for me to stop the install half-way through, rip the DVD to an ISO image, and restart the install with the ISO in place of the physical DVD.
  • When connecting to Sun from my Solaris VM at home, the performance seems a little bit sluggish. I suspect this is due to the Parallels network virtualization, but that is purely a guess. I haven't done anything to investigate this yet. It's also possible that there is something in our IPSEC software that suffers in a virtualization environment.
  • Liveupgrade is very slow. It took nearly 3 hours to liveupgrade from S10 FCS to Brandz35. I was upgrading from a Solaris install image on an NFS server, so this could easily be another manifestation of network performance issues. I really need to take some measurements...
  • If you have two VMs running simultaneously, shutting down or rebooting one will cause the other to die.

I've submitted a few of these problems to Parallels, but their response doesn't fill me with confidence:

             Hello Nils!
             Thank you for your interest in Parallels Workstation.

             We appreciate your feedback.

We'll see how this product improves over the coming months. Since VMware is allegedly planning a Mac OSX release, Parallels has their work cut out for them.

Thursday Mar 09, 2006

Nearly Instantaneous Deployment of Linux Zones

One of the limitations of the lx brand (AKA, a Linux userspace hosted on top of Solaris) is that we only support whole root zones. This means that each zone contains a full copy of the Linux software, which consumes a lot of disk space. The native Solaris brand supports whole root zones as well as sparse root zones. A sparse root zone has its own private copy of some things (like most of /etc and /var), but the vast bulk of the Solaris software is lofs-mounted from the global zone, allowing a sparse root Solaris zone to consume less than 100MB of disk space.

In addition to consuming a lot of disk space, limiting lx to supporting whole root zones means that zone installation can take quite a bit of time.

To address both the time and space issues, we can take advantage of the new ZFS clone functionality. In brief, ZFS allows you to make a snapshot of a filesystem, capturing the filesystems's state at that moment. ZFS then allows you to make an arbitrary number of clones of that snapshot. A clone is actually a copy-on-write copy of the snapshot, meaning that it consumes no disk space of its own until you start modifying it. As you make changes to data in the clone, ZFS makes a new, private copy of that data, which is then assigned to the clone.

When installing a new Linux zone, instead of installing a fresh copy of the Linux software in the zone, we can use the ZFS clone functionality to copy the software from an existing zone.

Creating the initial 'master' filesystem

Before you can start using the ZFS clone functionality to install new Linux zones, you need to construct the initial snapshot on which the clones will be based.

Full documentation on how to create ZFS filesystems and install Linux zones is beyond the scope of this post. For that information, you should seek out the official documentation in both the ZFS and BrandZ communities. The short version is:

        fireball# zfs create zpool/demo
        fireball# zfs set mountpoint=/export/zones/demo zpool/demo
        fireball# zoneadm -z demo install -d <path to centos tarball>

Once the zone has been installed, you need to capture a snapshot of it:

        fireball# zfs snapshot zpool/demo@template

At this point you should have something like this:

        fireball# zfs list | grep demo
        zpool/demo             632M  14.1G  1.91G  /export/zones/demo
        zpool/demo@template       0      -  1.91G  -

Installing the new zone

Now that you have your initial zone created, you are ready to start installing new zones using the initial zone as an installation source. To do so, you will need a version of the /usr/lib/brand/lx/lx_install script that has the clone support added. You can get the modified lx_install script right here:

http://blogs.sun.com/roller/resources/nilsn/lx_install

This script clones the snapshot to zpool/<zonename> and mounts it at <zoneroot>. This script is never executed directly - you should continue to use the zoneadm install command to install the zone. However, where you used to specify -d <linux source location> on the install line, you will now specify -z <zfs snapshot>. The zoneadm tool calls the lx_install script at the appropriate time during the zone construction process, and it will pass the -z option through to the script.

Two things to note about the script: it hardcodes the assumption that the ZFS pool from which the filesystems should be created is called zpool; and it assumes that you want the new filesystem to be created at zpool/<zonename>. If this were being done the Right Way, instead of as an on-the-train hack, these limitations would have to be removed. If these assumptions don't match your system, feel free to modify the script to change them.

The bottom line

        <configuration of zone ldemo1 left as an exercise for the reader.>

        fireball# time zoneadm -z ldemo1 install -z zpool/demo@template
        lx_install: Cloning ldemo1 from zpool/demo@template
        Installation done.

        real    0m0.630s
        user    0m0.047s
        sys     0m0.106s
        fireball# zfs list zpool/ldemo1
        NAME                   USED  AVAIL  REFER  MOUNTPOINT
        zpool/ldemo1            97K  14.1G  1.52G  /export/zones/ldemo1

A new Linux zone was installed in under 1 second, and consumes less than 100K of disk space. Full disclosure: after booting and halting the zone once, which (among other things) modifies some log files in /var, the zone has grown to a whopping 1.65MB:

        fireball# zfs list zpool/ldemo1
        NAME                   USED  AVAIL  REFER  MOUNTPOINT
        zpool/ldemo1          1.65M  14.1G  1.52G  /export/zones/ldemo1

Not only does this save you time and disk space, it also saves time and effort in configuration. You can create a single master zone, modify and configure it to your heart's content, and then take a snapshot of the configured zone to be used as a template when creating new zones. Any new zone that is cloned from the template will automatically inherit all of the configuration.

Removing a zone

When it comes time to uninstall or delete a zone, we don't need to take the time to walk through the filesystem deleting each file and directory individually. Instead, we can simply throw away the clone, which takes a fraction of the time. In fact, given the way ZFS implements cloning, deleting a file in the clone constitutes a change in the filesystem, which triggers a copy-on-write operation in the clone. So, an rm -rf of a clone is even more expensive than on a typical filesystem.

To avoid this cost, a zone uninstall should be done as follows:

        # zfs destroy zpool/<zonename>
        # zoneadm -z <zonename> uninstall

Future

Finally, since this is such an obvious way to leverage the ZFS clone functionality, it should not be surprising to learn that we are exploring the possibility of using it to clone Solaris zones as well. Since we hope to make this a generic zones feature, it is highly unlikely that the Linux-specific hack I've described here will find its way into the final product. Until this is done the Right Way in zones, we'll be using this hack in-house.

Technorati Tags:

Thursday Jan 12, 2006

Installing a Debian Zone with BrandZ

Background

One of the reasons we released BrandZ to the community before integrating with Solaris, was the hope that people would use the framework to create their own brands.

Less complex than creating a completely new brand is extending the lx brand to support other Linux distros. The simplest case is supporting a distro based on the same kernel/glibc version as our original RHEL/CentOS brand. To support such a distro, in theory we just need to create an installer for the new distro.

There are two distros that I have been asked about multiple times: Debian and Ubuntu. Both are based on the Debian package format, instead of RPM like the distros we currently support. I wanted to install Ubuntu, since its emphasis on getting wide distribution means it has a much smaller learning curve than Debian. Unfortunately (in this case), Ubuntu is also much more agressive about upgrading their core components. Debian still has an actively used version ('sarge') based on the kernel/glibc revs currently supported by BrandZ, while Ubuntu does not. So, Debian it was.

What follows is a description of the steps I went through to get Debian 'sarge' installed and running in a zone on Solaris. This isn't meant as a tutorial on how to perform a similar feat on your own system, since (as will become clear), it's really a pain in the neck. Rather, this is offered as a starting point (or possibly a cautionary tale) for anybody interested in writing a new install script for Debian or any other distro.

Zone configuration and layout

The first step, as always, is to configure the new Linux zone:

        crunchyfrog# zonecfg -z debian
        debian: No such zone configured
        Use 'create' to begin configuring a new zone.
        zonecfg:debian> create -B lx
        zonecfg:debian> set zonepath=/export/zones/debian
        zonecfg:debian> add net
        zonecfg:debian:net> set physical=bge0
        zonecfg:debian:net> set address=10.8.28.84
        zonecfg:debian:net> end
        zonecfg:debian> commit
        zonecfg:debian> exit

Next we do the install. Normally this process actually installs the Red Hat or CentOS bits into your zone. In this case, I wanted a blank slate on which to install the Debian software, so I used the 'tarball' install option to unpack an essentially empty tarball:

        crunchyfrog# cd tmp/
        crunchyfrog# mkdir usr
        crunchyfrog# tar cf dummy.tar usr
        crunchyfrog# zoneadm -z debian install -d /tmp/dummy.tar

With this done, I now had a Linux zone that was nominally in the 'installed' state, but it didn't actually have any software in it.

Installing Debian

When installing Red Hat or CentOS, we use the RPM software in the SFWrpm package. Debian uses a different packaging format, which we don't have any tool for manipulating. Fortunately, there is another OpenSolaris project which is working on creating a new Solaris distro built around this packaging format: Nexenta. I grabbed the version of dpkg they had built for Solaris, and I was ready to start installing.

Since I already had a sarge 'netinstall' CD handy, that's the medium I decided to install from. It can downloaded from the debian site here: debian-31r1a-i386-netinst.iso.

Getting to single-user mode

Through a little trial and error, I settled on these as the minimal set of packages needed to boot a zone that is capable of further self-installation:

        libc6, libacl1, libattr1, dpkg, dselect, tar, grep, perl, sed,
        find, mawk, gzip, zlib1g, libuuid1, debianutils, coreutils, procps,
        passwd, login, sysvinit, initscripts, sysv, util-linux, bsdutils,
        mount, base-passwd, base-files, bash, hostname, libncurses5,
        libpam0g, libpam-modules, libpam-runtime

It turns out that dpkg is much more insistent on running pre/post-install scripts than RPM. So much so that even an 'unpack' operation requires running a pre-install script. So, I decided to break the process into two parts: 'dpkg-deb --extract' and then do a real install once the zone was booted.

        crunchyfrog# for i in `cat pkgs`; do dpkg-deb --extract $i /export/zones/debian/root/; done

Since this method ensures that the install scripts aren't run, we have to do some of their work by hand before the zone will boot. Again, this was done largely by trial and error. The RHEL / CentOS installer we ship as part of lx does this work in the setup_lx_env script invoked by the installer.

        1. crunchyfrog# cd /export/zones/debian/root
           crunchyfrog# cp usr/share/sysvinit/inittab etc/
           crunchyfrog# cp usr/share/base-passwd/passwd.master etc/passwd
           crunchyfrog# cp usr/share/base-passwd/group.master etc/group
           crunchyfrog# cp usr/share/passwd/shells etc/shells

           crunchyfrog# cat > etc/rcS.d/S20proc
           #!/bin/bash
           echo "Mounting /proc"
           mount -n -t proc /proc /proc
           \^C
           crunchyfrog# chmod +x etc/rcS.d/S20proc
           crunchyfrog# cat /etc/fstab
           none /       ufs     defaults        1 1
           none /proc   proc    defaults        0 0

2. I also modified inittab to eliminate the gettys running on the virtual consoles (which Solaris doesn't support) and add one on the zone console.

           [...]
           1:2345:respawn:/sbin/getty 38400 console
           #2:23:respawn:/sbin/getty 38400 tty2
           #3:23:respawn:/sbin/getty 38400 tty3
           #4:23:respawn:/sbin/getty 38400 tty4
           #5:23:respawn:/sbin/getty 38400 tty5
           #6:23:respawn:/sbin/getty 38400 tty6
           [...]

3. Finally, there are a few links that need to be created because the zones framework expects the files to be in one place while Linux puts them somewhere else:

           ln -s /bin/sh sbin/sh
           ln -s /bin/su usr/bin/su
           ln -s /bin/login usr/bin/login

We now have a zone that can boot to single user mode:

        crunchyfrog# zoneadm -z debian boot -s
On the debian console we see:
        [NOTICE: Zone booting up]
        INIT: version 2.86 booting
        /etc/init.d/rcS: line 27: /etc/default/rcS: No such file or directory
        Press enter for maintenance
        (or type Control-D to continue):
        root@debian:~# uname -a
        Linux debian 2.4.21 BrandX fake linux i686 GNU/Linux

Housekeeping

Now to install the rest of the software. I started by copying the list of already-installed packages into the zone, and by making the mounted CD image visible as well:

        crunchyfrog# cp /tmp/pkgs /export/zones/debian/root/tmp/
        crunchyfrog# mount -F lofs /mnt/ /export/zones/debian/root/mnt/

The first thing I did was ensure sanity in the dpkg database by reinstalling all of the packages I had extracted before. Most of the following dpkg operations spit out a number of warnings, basically reflecting the fact that the environment is still insane.

To make dpkg happy I had to create a few files.

        debian# touch /var/lib/dpkg/status
        debian# touch /var/lib/dpkg/available

I then installed two packages 'by hand':

        debian# dpkg -i --force-depends /mnt/pool/main/d/dpkg/dpkg_1.10.28_i386.deb
        debian# dpkg -i --force-depends /mnt/pool/main/g/glibc/libc6_2.3.2.ds1-22_i386.deb

As part of its install process, libc interactively sets the timezone. Thus the need to install the package this way. Presumably this could be automated, but that's a problem for another day.

Next I did a bulk (re)install of all the other packages we added to the zone prior to booting.

        debian# for i in `cat /tmp/pkgs`; do dpkg -i --force-depends $i; done

We now have a working debian zone with a consistent dpkg database. dpkg -l should list all of the installed software and dpkg --audit shouldn't report any problems.

One final bit of cleanup is needed. In the process of properly installing all those packages, /etc/rcS.d was populated with a bunch of startup scripts that we don't want to run. The easiest way to deal with this is:

        debian# cd /etc/rcS.d
        debian# mv S20proc /tmp
        debian# rm \*
        debian# mv /tmp/S20proc .

This is a pretty heavy-handed operation, but it works because we don't have any necessary services installed yet. All we're wiping out are the low-level, hardware-config kind of scripts that we don't need inside a zone.

Preparing to apt-get the universe

The next step was to start loading the bulk of the software into the zone. My goal was to be able to use apt-get to get as much as possible off of the net, but obviously I needed to install at least enough stuff to get networking working first.

        libgcc1
        gcc-3.3-base
        libstdc++5
        apt
        net-tools
        debconf [1]
        debconf-english
        ifupdown
        netkit-inetd
        netkit-ping
        libwrap
        tcpd
        netbase

With these packages installed and my /etc/resolv.conf configured properly, I could now run apt-get to install any additional software into the zone.

First up: satisfy any lingering dependencies for the packages I already manhandled into place:

        root@debian:~# apt-get -f install
        Reading Package Lists... Done
        Building Dependency Tree... Done
        Correcting dependencies... Done
        The following extra packages will be installed:
          e2fslibs e2fsprogs initscripts libblkid1 libcap1 libcomerr2
          libdb1-compat libdb3 libdb4.2 libss2 slang1a-utf8
        Suggested packages:
          gpart parted e2fsck-static
        The following NEW packages will be installed:
          e2fslibs e2fsprogs initscripts libblkid1 libcap1 libcomerr2
           libdb1-compat libdb3 libdb4.2 libss2 slang1a-utf8
        0 upgraded, 11 newly installed, 0 to remove and 0 not upgraded.
        1 not fully installed or removed.
        Need to get 1685kB of archives.
        After unpacking 4497kB of additional disk space will be used.
        Do you want to continue? [Y/n]

Now we can install whatever applications we intend to run, and rely on apt-get to resolve any dependencies for us. For example, I want to run the IRC client 'xchat', so I do:

        root@debian:/# apt-get install xchat

The tool identifies the 41 packages I need to install before this will work, downloads all 25MB of .deb files, and installs them for me.

Lather, rinse, repeat until you have a zone that does everything you need it to.

Conclusion and Caveats

After all the work I went through to get the zone installed, I haven't actually used it very much. Our focus is primarily on Red Hat, since that's the distro we expect to formally support, so the Debian zone isn't getting a lot of use.

The only problems I found with the zone which are not resolved in our Build 31 release revolve around threading. Due to the completely different approaches Linux and Solaris take to threading, this has been a particularly tricky area for us to get right. It appears that Debian is still using the old linuxthreads model, while Red Hat has moved ahead to NPTL. In the unlikely event that I have some free time, I might investigate the linuxthreads issues. Otherwise, if you're trying to run a Debian zone and run into problems, this is probably the first place you should look.

As I said at the top, this really isn't a process that most people are going to want to go though manually. Bootstrapping any operating system is ugly, and trying to do it on top of a different operating system is doubly so.

My original goal was just to work through the install process to help light the way for other people, but I ended up uncovering a handful of bugs as well. The libraries and apps available on Debian are different enough from the CentOS code we've been using that they poke our emulation support in new and different ways. Since it revealed bugs that would have bitten us at some point, this has turned out to be a more generally useful exercise than I had originally expected.


[1] Something funny was going on with perl's flock(). When trying to configure debconf, it kept complaining that the configuration database was locked by another user. At first glance it appears that this is a bug in either perl or debconf. The file is being opened read-only, but we are trying to get a write-lock on it. That's not allowed in Solaris, but if it is allowed in Linux, then this is a bug in our emulation. Temporary workaround: comment out the flock() calls in

/usr/share/perl5/Debconf/DbDriver/File.pm
.

Technorati Tags:

Tuesday Dec 13, 2005

BrandZ: fitting square pegs into round holes since 2005.

Today we are unleashing the initial release of BrandZ into the wild. BrandZ is a framework that builds on the zones functionality introduced in Solaris 10, and is the technology that underlies the feature known as Solaris Containers for Linux Applications (SCLA).

BrandZ allows for the creation of zones populated with non-native software. Each new zone type is referred to as a brand and the installed zones are called Branded Zones. This software could include a different set of Solaris software, such as a GNU/Solaris distribution.

More interesting, BrandZ allows us to create zones populated with non-Solaris software. This initial release includes a single brand, lx, which enables the creation of zones that run Linux software rather than Solaris.

BrandZ and the lx brand are both still very much works in progress, so they are being released independently from the mainline Solaris source tree. Much of the basic functionality is in place, but there is still a lot of work to do before this is ready to be integrated into Solaris. A description of the remaining work can be found on the BrandZ community page. If you just want to play with BrandZ, this description can help identify what you shouldn't expect to work yet. If you are interested in participating in the BrandZ development, this list will give you some ideas for places to start.

Seriously? Why Linux on Solaris?

This feature has been in development for several years (more on that below). In its early days the focus was on customers that were trying to transition from Linux to Solaris, but had a number of legacy Linux applications that were preventing the switch. Linux emulation was viewed as a way to help them make the switch to Solaris before their full software toolset was available. Since the release and rapid uptake of Solaris 10, ISVs have been porting their applications to Solaris on x86 even faster than we expected, somewhat reducing the need for this transitional tool. There are still many customers for whom this will be useful, but their interest is now as likely to be on in-house applications as on ISV applications.

Another area in which this feature is useful is for consolidation. A monocultural example would be a customer that wants to consolidate multiple Linux applications onto a single machine, but is not sure how, or if, those applications can coexist peacefully. BrandZ allows each application to run in its own Linux zone, eliminating most of the possible coexistence problems. Alternatively, a customer may have a mix of Linux and Solaris applications, but not want to maintain two sets of machines to run them. This use case is already being explored by an EDA group in Sun.

In a consolidation environment, whether Linux-only or a Solaris/Linux mix, you can think of the underlying Solaris kernel as being a highly-featured, scalable, and stable hypervisor. As noted in this Register story, this feature also allows you to host your Linux environments on a ZFS filesystem.

The most interesting use case for this OpenSolaris release is in the developer community. By enabling Linux applications on to run on Solaris, developers are able to use Solaris tools such as DTrace to develop, debug, and tune their Linux applications. This use case has attracted the most immediate attention from customers, as it is an easy and low risk environment in which to evaluate a new technology. For more information on using DTrace on your Linux applications, see Adam's blog entry.

Janus, or What Took You Guys So Long?

Many people, both inside and outside of Sun, have asked whether this is "Project Janus", which is an internal code name that got a surprising and unfortunate amount of external exposure. Janus was the internal name for an initial prototype of Linux application support on Solaris. Rather than saying that BrandZ is Janus, it would be more accurate to say that BrandZ is the follow-on to Janus.

The Janus project was intended to be delivered with Solaris 10, and its development was wrapping up in time to make the initial release. However, last fall during the late-stage design and code reviews we decided that we really needed to rethink how the functionality was going to integrate into Solaris as a whole. There were a number of problems that Janus solved in task-specific ways, which we decided were solved more cleanly and generally by the then-recently integrated Zones technology.

In the end we decided that it was more important that we get this project right than it was to make our original ship date. So we pulled the functionality from the initial release of Solaris 10 and essentially started over with a zones-based approach.

The Janus prototype was released to a small number of customers as the "Linux Application Environment technology preview", to get their thoughts on the quality of the emulation and the overall user experience. Much of the feedback from that preview has been rolled into BrandZ.

Why are we doing an OpenSolaris release?

Another question that I've gotten several times is: "why are you putting this out on opensolaris.org when you still have so much work to do on it?"

The first, and most obvious, answer is that people keep asking about it. Customers are asking their sales people where the feature is, and people within Sun are asking us how they can use it. Getting this out on opensolaris.org lets us answer everybody at once. This isn't feature-complete by any stretch of the imagination but, for good or ill, this release will let everybody know exactly where the project stands.

Second, we want to enable the creation of new brand types. We want to verify that our framework is sufficient to support a variety of brands before we roll it into Solaris.

Given the limited time and resources available, we are focussing our efforts on supporting the Red Hat and CentOS distributions. These distros only cover a small subset of the Linux environments currently in use. By distributing BrandZ through OpenSolaris, we hope to enable users to develop their own brands that support other distributions.

In some cases, supporting a new distro may be as simple as modifying our install scripts. A more interesting possibility is that some people may want to develop radically new brands. The most likely possibility seems to be a FreeBSD brand. Implementing these new brands is likely to require extending or modifying the BrandZ infrastructure. Allowing users to get an early start on this development will increase the likelihood that we will be able to roll any needed infrastructure changes into our code before we integrate into the main Solaris source tree.

Finally, emulating a new operating system is a complex proposition and ensuring that the emulation is correct and complete is equally complex. We have a number of test suites available to us to help verify the completeness and correctness, but test suites are necessarily limited. They tend to be comprised of small, focused tests, designed to ferret out relatively simple errors.

Properly testing an system such as BrandZ requires running a wide variety of applications in a wide range of environments. We can make some progress by testing applications ourselves, but we cannot reasonably test more than a small number of applications. By releasing BrandZ via OpenSolaris we hope to get early exposure to a wide range of users, environments, and applications. We hope that this exposure will help us identify important applications that we have overlooked, and to find errors or shortcomings that the formal test suites cannot.

If you have any interest in experimenting with the Linux application support, improving the support for our current Linux distributions, adding support for new distros, or even developing completely new brands, then please check out the new BrandZ Community and/or sign up for the BrandZ discussion list.

Monday Sep 12, 2005

A Machine For All (Operating) Systems

In my earlier fit of crankiness I forgot to mention why I was looking at the Network Computing page at all: today we officially rolled out our new line of Opteron Servers. I first saw one of these systems hidden away in a lab near the Solaris engineers' offices last year, and I've been dying to get them into customers' hands since then. I'm a software guy, so bits of metal don't generally grab me, but these boxes blew me away. It's amazing how much power they squeezed into such a tiny space.

The reviews are starting to come in, and it looks like we have a hit on our hands. Even the Slashdot crowd generally seems to like what they see.

One complaint really stood out though:

The biggest problem I foresee for Sun in competing with Dell is simple, Suns don't run Windows and they don't run Linux. [...]
But back to that Windows thing, it's nice to be able to take a Dell and repurpose it from being a Linux system to a Windows system or vice versa. [...] If I buy a bunch of shiny new Suns not only am I locked into Solaris (which is painful to use after working on Linux for so many years) but I'm also locked into that hardware.

Others have already started to show this poster the error of his ways, but I hope this isn't a common misconception. These new machines are capable of running Solaris, Linux, or Windows 2003 Server. Obviously I think you should be running Solaris but neither I, nor the people who designed this system, are going to tell you you have to be running Solaris.

[Updated: I originally said the new servers support Windows XP. They actually support Windows 2003 Server. Thanks to Manish Kapur for the correction.]

Solaris Users Need Not Apply

This week is our big quarterly marketing event: "Network Computing 05Q3".

If you want somebody\^h\^h\^h\^hthing to remind you to check out the webcast, you can sign up to get an email message when the big day arrives. You can also download and run a widget on your desktop, which will presumably give you a series of whizzy, annoying announcements with increasing frequency as the time draws near.

I say 'presumably' because I can't actually run the widget. The instructions say:

Don't miss the NEXT Network Computing Event. Download the NC Events widget for a reminder.
Here is what you need to do:

1. Download and install Konfabulator
2. Then download and install the NC Events widget
Runs on Windows XP or 2000 and Mac OS X 10.2 or greater

That's right. The widget announcing Sun's big quarterly product rollout does not work on Sun's operating system. It doesn't even work on Linux. You have to be running Windows or MacOS X to get marketing's full attention.

If anybody can think of an occasion where Apple required a potential customer to use Windows-only technology, or where Microsoft required a Mac-only technology, please drop off a pointer in the comments section.

Friday Jul 29, 2005

Does your cow wear a watch?

An AP story about the recent vote to extend Daylight Savings says:

July 28,2005 | WASHINGTON -- Daylight-saving time would start three weeks earlier and run through Halloween under a change included in an energy bill approved Thursday by the House. The energy legislation was expected to be passed by the Senate, probably Friday, and sent to President Bush. The time change is supposed save energy because people have more daylight in the evening and do not have to turn on lights. The House had approved a two-month extension -- one in the spring and the other in the fall. But that was scaled back after airline officials complained that the extension would cause problems with intentional flight schedules. Farmers said the change would adversely affect livestock.

Can somebody explain to me how a sheep knows what time it is?

Wednesday Jun 22, 2005

Building Solaris - like a videogame, but not nearly as much fun.

No review of a videogame would be complete without screenshots (or an artist's rendition thereof).

No review of a new web browser, mp3 jukebox, or GUI editor would be complete without screenshots.

Heck, I've even seen screenshots in articles about cell phones.

But, I think this may be going just a bit too far. I don't speak Hungarian (or much of anything else for that matter, but I digress), but it looks as though this person has posted 100 screenshots of OpenSolaris being built, BFUed, and booted. Sadly, he ran out of pixels at the second grub screen, so we don't get to see the new OS actually come up. While less traumatic than the film breaking during the last reel of "The Usual Suspects", it leaves one without a sense of closure.

(Thanks to OS News for the link.)

Tuesday Jun 14, 2005

Kernel Address Space Layout on x86/x64

To celebrate the launch of OpenSolaris, I thought I would spend a little time describing the layout of the kernel's virtual address space.

For reasons that will become clear, this cannot be done without first giving an overview of how a Solaris system is booted. (If you want to know any more about booting Solaris, you should read Tim Marsland's writeup. He talks a bit more about the mechanisms, rationale, and history of the Solaris boot process.)

As Joe and Kit have been (and presumably will be) discussing, there are a number of complications brought on by the limited amount of virtual address space available to the kernel in 32-bit mode. For most VM operations the 64-bit address space allows for a much simpler implementation. On the other hand, booting a 64-bit system is significantly more complicated than a 32-bit system.

When we boot an x86/x64 system, we do it in several stages. The primary boot code is an OS-independent routine in the BIOS that reads the OS-specific boot code. In Solaris, the code that the BIOS loads is called the 'primary bootstrap', which is responsible for reading the next stage of boot code (which we usually call the 'boot loader') from a fixed location on the disk or from the network. The boot loader has to identify the hardware in the system to find the boot device, send messages to the console or serial port, and so on. Once the hardware has been identified, the boot loader loads the Solaris kernel into memory and jumps to the first instruction of the _start() routine.

Sounds complicated, right? We're just getting started. It turns out that we need to keep the boot loader around for a while. Until we get a fair amount of initialization out of the way, the boot loader is the only one who knows which memory is available and how to talk to the various devices in the system. This means that any memory allocations, disk access, or console message requires calling back into the boot loader.

Having to call back into the boot loader is a minor nuisance on 32-bit systems, but on 64-bit systems it's a nightmare. The problem is that the OS is running in 64-bit mode, but the boot loader is a 32-bit program. So, every I/O access or memory allocation requires a mode-switch between 32-bit and 64-bit mode. Fortunately, implementing this particular nastiness fell to somebody else so I didn't have to think about it. What does affect us is that the 32-bit boot loader can only use the lower 4GB of the virtual address space. So, we are trying to load and execute a kernel in the top of a 64-bit address range, but the memory allocator only recognizes the bottom 4GB of memory.

Getting the two worldviews to make some kind of sense required a trick we called 'doublemapping'. Essentially, we would call into the 32-bit program to allocate a range of memory. The 32-bit program would reserve a bit of physical memory and set up the page tables to map that memory at address X in the bottom 4GB of virtual address space. During the transition from 32-bit mode back to 64-bit mode, we would map the same physical memory into the upper 4GB of memory, at address 0xFFFFFFFFF.00000000 + X. So a single piece of memory would be mapped at two different addresses. The boot loader lives in roughly the bottom 1GB of memory which, due to doublemapping. means it lives from 0x00000000.00000000 to 0x00000000.3FFFFFFF and from 0xFFFFFFFF.00000000 to 0xFFFFFFFF.3FFFFFFF.

Finally, we get to discussing the actual layout of the kernel's address space. Early on in the AMD64 porting project we started with a blank sheet of paper and laid out a clean, sensible, elegant address space for Solaris on x64 systems. What you see below is what is left of the design once it ran headlong into the constraints and limitations imposed by the x86 and AMD64 architectures and by the Solaris boot process.

From startup.c:

/\*
 \*		64-bit Kernel's Virtual memory layout. (assuming 64 bit app)
 \*			+-----------------------+
 \*			|	psm 1-1 map	|
 \*			|	exec args area	|
 \* 0xFFFFFFFF.FFC00000  |-----------------------|- ARGSBASE
 \*			|	debugger 	|
 \* 0xFFFFFFFF.FF800000  |-----------------------|- SEGDEBUGBASE
 \*			|      unused    	|
 \*			+-----------------------+
 \*			|      Kernel Data	|
 \* 0xFFFFFFFF.FBC00000  |-----------------------|
 \*			|      Kernel Text	|
 \* 0xFFFFFFFF.FB800000  |-----------------------|- KERNEL_TEXT
 \* 			|     LUFS sinkhole	|
 \* 0xFFFFFFFF.FB000000 -|-----------------------|- lufs_addr
 \* ---                  |-----------------------|- valloc_base + valloc_sz
 \* 			|   early pp structures	|
 \* 			|   memsegs, memlists, 	|
 \* 			|   page hash, etc.	|
 \* ---                  |-----------------------|- valloc_base
 \* 			|     ptable_va    	|
 \* ---                  |-----------------------|- ptable_va
 \* 			|      Core heap	| (used for loadable modules)
 \* 0xFFFFFFFF.C0000000  |-----------------------|- core_base / ekernelheap
 \*			|	 Kernel		|
 \*			|	  heap		|
 \* 0xFFFFFXXX.XXX00000  |-----------------------|- kernelheap (floating)
 \*			|	 segkmap	|
 \* 0xFFFFFXXX.XXX00000  |-----------------------|- segkmap_start (floating)
 \*			|    device mappings	|
 \* 0xFFFFFXXX.XXX00000  |-----------------------|- toxic_addr (floating)
 \*			|	  segkp		|
 \* ---                  |-----------------------|- segkp_base
 \*			|	 segkpm		|
 \* 0xFFFFFE00.00000000  |-----------------------|
 \*			|	Red Zone	|
 \* 0xFFFFFD80.00000000  |-----------------------|- KERNELBASE
 \*			|     User stack	|- User space memory
 \* 			|			|
 \* 			| shared objects, etc	|	(grows downwards)
 \*			:			:
 \* 			|			|
 \* 0xFFFF8000.00000000  |-----------------------|
 \* 			|			|
 \* 			| VA Hole / unused	|
 \* 			|			|
 \* 0x00008000.00000000  |-----------------------|
 \*			|			|
 \*			|			|
 \*			:			:
 \*			|	user heap	|	(grows upwards)
 \*			|			|
 \*			|	user data	|
 \*			|-----------------------|
 \*			|	user text	|
 \* 0x00000000.04000000  |-----------------------|
 \*			|	invalid		|
 \* 0x00000000.00000000	+-----------------------+
 \*/
 
Within the kernel's address space, there are only three fixed addresses.
  • KERNEL_TEXT: the address at which 'unix' is loaded.
  • BOOT_DOUBLEMAP_BASE: the lowest address the kernel can use while the boot loader is still in memory.
  • KERNELBASE: the bottom of the kernel's address space or, alternatively, the top of the user's address space.

At KERNEL_TEXT, we allocate two 4MB pages: one for the kernel text and one for kernel data. This value is hardcoded in two different places: the kernel source ( i86pc/sys/machparam.h) and in the unix ELF header (in i86pc/conf/Mapfile.amd64). This is a value that changes rarely, but it does happen. As Solaris 10 neared completion, one of the ISVs using the first release of Solaris Express to include AMD64 support found that putting kernel text and data in the top 64MB of the address range was interfering with the operation of a particular piece of system virtualization software. This interference did not cause any probles with correctness, but it did reduce performance. By lowering the kernel to 2\^64-64MB-8MB we were able to remove the interference and eliminate the performance problem.

Directly below the kernel text is a 8MB region which is used as a scratch area by the logging UFS filesystem during boot. According to the comments in the boot loader code, that address was chosen because it was safely hidden within segmap in the 32-bit x86 address space. Since that segmap isn't touched by Solaris proper until after the boot loader is finished unrolling the UFS log, the Solaris address map in startup.c was not updated when the change was putback to the gate. After all, nobody would ever think of restructuring the kernel's address space, would they?

This feature was putback to the main Solaris gate while the AMD64 team was still in the early stages of bringing Solaris up on 64-bit x86 processors for the first time. When that code was putback, the entire AMD64 team was sequestered in a conference room in Colorado for a week of work uninterrupted by meetings, conference calls, or managers. At that time we were still struggling along without kabd, kmdb, or a working simulator, so the only debugging facility available to us was inserting printf()s into the code stream. Obviously this made debugging tedious, tricky, and time consuming. Anything that introduced new work and new debugging headaches was not well received. Thus, the somewhat disparaging title this region was given in the address space map once we identified it as the cause of our new failures.

The moral of this story: don't use Magic Values. If you must use Magic Values, make sure they are used in a contained fashion and within a single layer. If you have Magic Values that must be shared between layers, document their every use, assumption, and dependency. If you can't be bothered to document things properly...well, come talk to me. And wear a helmet.

Below the LUFS sinkhole is a region used for some fundamental virtual memory data structures. For example:

  • The memlists represent the physical memory exported to us by the boot loader.
  • The memsegs describe regions of physical memory that have been incorporated by the operating system.
  • The page_hash is a hash table used to translate <vnode, offset> pairs into page structures. In other words, this is the core data structure used to translate virtual addresses into physical addresses.
  • The page_ts are the data structures used to track the state of every physical page in the system.
  • The page_freelists form the head of the linked lists of free pages in the system.
  • The page_cachelists form the head of the linked lists of cached pages in the system. These are pages that contain valid data but which are not currently being used by any applications. When your web browser crashes, the pages containing the program text end up on the cachelists where they can be quickly reclaimed when you restart the browser.
  • The page_counters are a little more complicated than the name would suggest - and more complicated than a little squib in a bootblog can do justice to. For now, let's just say that they are used to keep track of free contiguous physical memory.

All of the above consumes about 62MB on my laptop with 2GB of memory, which we round up to the nearest 4MB. On a system with 16GB in it, the page_t structures alone would consume 120MB. Since we cannot afford to use that much memory in this stage of boot, only the page_ts for the bottom 4GB of physical memory are allocated in this part of memory. The page_ts for the rest of memory are allocated once we have broken free of the boot loader.

Next we start loading kernel modules. Once again, the oddities of 64-bitness come into play. For performance reasons, modules generally use %rip-relative addressing to access their data. In order to use this addressing, there is a limit to how distant a relative the address can be. Specifically, the address you are accessing cannot be more than 2GB away from the instruction that is doing the accessing. As long as a system doesn't have too many modules, the text is loaded into the 4MB kernel text page and the data is allocated in the 4MB kernel data page. Since these pages are adjacent, there are no problems with %rip-relative addressing.

Once we run out of space on those two pages, we have to start allocating kernel heap space for the module text and data. To maintain the 2GB limit we subdivided the heap, introducing a 2GB region we call the 'core' heap. This subdivision can be seen in the code for kernelheap_init(), which grows the heap and in the kernel's runtime linker

Below the core heap is the general-purpose kernel heap, which is the memory region from which most kmem_alloc()s are satisfied. This is the last region to be created or used while we are still dependent on the boot loader's services. The astute reader will note that the top of the kernel heap is in the upper 4GB of memory but the bottome of the heap, while not specifically defined, is clearly not in the upper 4GB. As one might expect, this causes problems with the boot loader and our doublemapping trick. In fact while we are still making use of the boot loader the kernelheap bottom is set to boot_kernelheap, which is defined as the bottom of the doublemapping region. When we unload the bootloader we call kernelheap_extend(), which grows the heap down to its true lower bound.

The regions below the kernel heap are created after Solaris has unloaded the boot loader and taken over responsibility for managing the full virtual addrss space. This post has already gotten much longer than I intended, so I'll have to leave their description to a later date. In the meantime, you can find a bit more about segkp at Prakash's blog. If you are interested in the topic, you should really be reading Solaris Internals.


Technorati Tag:
Technorati Tag:
About

nilsn

Search

Categories
  • General
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today