Monday Sep 13, 2010

An End and a Beginning

I have worked at Sun, and now Oracle, for longer than I really want to admit, but I have reached a point in my career where I feel that I need to try something new. Today, September 13th, 2010, is my last day at Oracle. For the past 5 years I've worked on zones. The zones team is such an outstanding group of engineers and I'm going to miss working with them. I've learned a lot from them, as well as all of the other engineers at Sun that I have been privileged to work with over the years.

Now, I'm moving on to something new. Tomorrow I start work at Joyent. Hopefully I'll have a chance to blog more than I've done recently. My blog is moving to I'll continue to work on zones, and Solaris in general, at Joyent, and I'm excited about the new new challenges ahead. There are obviously a lot of changes going on with Solaris right now, so its going to interesting. And fun!

Tuesday May 26, 2009

Community One Slides

I'll be delivering two presentations at Community One this year. My slides are posted on the wiki in case you want to download them. I like to use my slides as an outline instead of just reading them, so hopefully people who are attending will actually get some value from hearing me speak. :-) Don't forget that the Tuesday Deep Dive is free if you register with the OSDDT code. There a several ways to get into the deep dives if you are planning on attending. All of these are on the wiki.

Wednesday Feb 11, 2009

zones p2v

About two years ago the zones team sat down and began to create the solaris8 brand for zones. This brand allows you to run your existing Solaris 8 system images inside of a branded zone on Solaris 10. One of the key goals for this project was to easily enable migration of Solaris 8 based systems into a zone on Solaris 10. To accomplish this, as part of the project we built support for a "physical to virtual" capability, or p2v for short. The idea with p2v is that you can create an image of an existing system using a flash archive, cpio archive, a UFS dump, or even just a file system image that is accessible over NFS, then install the zone using that image. There is no explicit p2v tool you have to run; behind the scenes the zone installation process does all of the work to make sure the Solaris 8 image runs correctly inside of the zone.

Once we finished the solaris8 brand we followed that with the solaris9 brand which has this same p2v capability. Of course, while we were doing this work, we understood that having a similar feature for native zones would be useful as well. This would greatly simplify consolidation using zones, since you could deploy onto bare metal, then later consolidate that application stack into a zone with very little work.

The problem for p2v with native zones is that there is no brand module that mediates between the user-level code running in the zone and the kernel code, as we have with the solaris8 and solaris9 brands. Thus, the native zones must be running user-level code that is in sync with the kernel. This includes things like libc, which has a close relationship with the kernel. Every time a patch is applied which impacts both kernel code and user-level library code, all of the native zones must be kept in sync or unpredictable bugs will occur.

Just doing native p2v, as we did for the solaris8 and solaris9 brands, doesn't make sense since the odds that the system image you want to install in the zone will be exactly in sync with the kernel are pretty low. Most deployed systems are at different patch levels or even running different minor releases (e.g. Solaris 10 05/08 vs. 11/08), so there is no clean way to reliably p2v those images.

We really felt that native p2v was important, but we couldn't make any progress until we solved the problem of syncing up the system image to match the global zone. Fortunately I was able to find some time to add this capability, which we call update on attach. This was added into our zone migration subcommands, 'detach' and 'attach', which can be used to move zones from one system to another. Since zone migration has a similar problem as p2v, where the source and target systems can be out of sync, we do a lot of validation to make sure that the new host can properly run the zone. Of course this validation made zone migration pretty restrictive. Now that we have "update on attach", we can automatically update the zone software when you move it to the new host.

While "update on attach" is a valuable feature in its own right, we also built this with an eye on p2v, since it is the enabling capability needed for p2v. In addition, we leveraged all of the work Dan Price did on the installers for the solaris8 and solaris9 brands and were able to reuse much of that. As with the solaris8 and solaris9 brands, the native brand installer accepts a variety of image inputs; flar, cpio, compressed cpio, pax xustar, UFS dump or a directly accessible root image (e.g. over NFS). It was also enhanced to accept a pre-existing image in the zone root path. This is useful if you use ZFS send and receive to set up the zone root and want to then p2v that as a fully installed zone.

I integrated the native p2v feature into NV build 109 this morning. The webrev from the code review is still available if anyone is interested in seeing the scope of the changes. At over 2000 lines of new code this is a pretty substantial addition to zones which should greatly improve future zone consolidation projects.

Tuesday Jan 20, 2009

OpenSolaris Bible Samples

A comment on my last post noted that there were no sample chapters available for the book, however I just noticed that Wiley has posted some samples on the book's webpage.

The samples include chapter one, the index, and the detailed table of contents.

The index and TOC are probably the best sections for getting a feel for the material in the book. This is actually the first time I've seen the index myself, since it was produced after we finished writing and the final pages were nailed down. I haven't reviewed it closely yet, but at first glance it looks to be pretty comprehensive at 35 pages. I've always thought that the index was critical for a book like this. The detailed TOC is also useful for getting a sense of the topics covered in each chapter.

Tuesday Jan 06, 2009

Writing the OpenSolaris Bible

2008 was a busy year for me since I spent most of my free time co-authoring a book on OpenSolaris; the OpenSolaris Bible.

Having never written a book before, this was a new experience for me. Nick originally had the idea for writing a book on OpenSolaris and he'd already published Professional C++ with Wiley, so he had an agent and a relationship with a publisher. In December 2007 he contacted me about being a co-author and after thinking it through, I agreed. I had always thought writing a book was something I wanted to do, so I was excited to give this a try. Luckily, Dave agreed to be the third author on the book, so we had our writing team in place. After some early discussions, Wiley decided our material fit best into their "Bible" series, hence the title.

In early January 2008 the three of us worked on the outline and decided which chapters each of us would write. We actually started writing in early February of 2008. Given the publishing schedule we had with Wiley, we had to complete each chapter in about 3 weeks, so there wasn't a lot of time to waste. Also, because this project was not part of our normal work for Sun, we had to ensure that we only worked on the book on our own time, that is evenings and weekends. In the end it turned out that we each wrote exactly a third of the book, based on the page counts. Since the book came out at around 1000 pages, with approximately 950 pages of written material, not counting front matter or the index, we each wrote over 300 pages of content. Over the course of the project we were also fortunate that many of our friends and colleagues who work on OpenSolaris were willing to review our early work and provide much useful feedback.

We finished the first draft at the end of August 2008 and worked on the revisions to each chapter through early December 2008. Of course the OpenSolaris 2008.11 release came out right at the end of our revision process, so we had to scramble to be sure that everything in the book was up-to-date with respect to the new release.

From a personal perspective, this was a particularly difficult year because we also moved to a "new" house in April of 2008. Our new house is actually about 85 years old and hadn't been very well maintained for a while, so it needs some work. The first week we moved in, we had the boiler go out, the sewer back up into the basement, the toilet and the shower wouldn't stop running, the electrical work for our office took longer than expected, our DSL wasn't hooked up right, and about a million other things all seemed to go wrong. Somehow we managed to cope with all of that, keep working for our real jobs, plus I was able to finish my chapters for the book on schedule. I'm pretty sure Sarah wasn't expecting anything like this when I talked to her about working on the book the previous December. Needless to say, we're looking forward to a less hectic 2009.

If you are at all interested in OpenSolaris, then I hope you'll find something in our book that is worthwhile, even if you already know a lot about the OS. The book is targeted primarily at end-users and system administrators. It has a lot of breadth and we tried to include a balanced mix of introductory material as well as advanced techniques. Here's the table of contents so you can get a feel for whats in the book.
I. Introduction to OpenSolaris.
    1. What Is OpenSolaris?
    2. Installing OpenSolaris.
    3. OpenSolaris Crash Course.

II. Using OpenSolaris
    4. The Desktop.
    5. Printers and Peripherals.
    6. Software Management.

III. OpenSolaris File Systems, Networking, and Security.
    7. Disks,  Local File Systems, and the Volume Manager.
    8. ZFS.
    9. Networking.
    10. Network File Systems and Directory Services.
    11. Security.

IV. OpenSolaris Reliability, Availability, and Serviceability.
    12. Fault Management.
    13. Service Management.
    14. Monitoring and Observability.
    15. DTrace.
    16. Clustering for High Availability.

V. OpenSolaris Virtualization.
    17. Virtualization Overview.
    18. Resource Management.
    19. Zones.
    20. xVM Hypervisor.
    21. Logical Domains (LDoms).
    22. VirtualBox.

VI. Developing and Deploying on OpenSolaris.
    23. Deploying a Web Stack on OpenSolaris.
    24. Developing on OpenSolaris. 
If this looks interesting, you can pre-order a copy from Amazon here. It comes out early next month, February 2009, and I'm excited to hear peoples reaction once they've actually had a chance to look it over.

Tuesday Dec 23, 2008

Updating zones on OpenSolaris 2008.11 using detach/attach

In my last post I talked a bit about the new way that software and dataset management works for zones on the 2008.11 release.

One of the features that is still under development is to provide a way to automatically keep the non-global zones in sync with the global zone when you do a 'pkg image-update'. The IPS project still needs some additional enhancements to be able to describe the software dependencies between the global and non-global zones. In the meantime, you must manually ensure that you update the non-global zones after you do an image-update and reboot the global zone. Doing this will create new ZFS datasets for each zone which you can then manually update so that they match the global zone software release.

The easiest way to update the zones is to use the new detach/attach capabilities we added to the 2008.11 release. You can simply detach the zone, then re-attach it. We provide some support for the zone update on attach option for ipkg-branded zones, so you can use 'attach -u' to simply update the zone.

The following shows an example of this.
# zoneadm -z jj1 detach
# zoneadm -z jj1 attach -u
       Global zone version: pkg:/entire@0.5.11,5.11-0.101:20081119T235706Z
       Non-Global zone version: pkg:/entire@0.5.11,5.11-0.98:20080917T010824Z
Updating non-global zone: Output follows
                     Cache: Using /var/pkg/download.
PHASE                                          ITEMS
Indexing Packages                        54/54 
DOWNLOAD                                    PKGS           FILES       XFER (MB)
Completed                                     54/54   2491/2491   52.76/52.76 

PHASE                                        ACTIONS
Removal Phase                            1253/1253 
Install Phase                                 1440/1440 
Update Phase                               3759/3759 
Reading Existing Index                            9/9 
Indexing Packages                               54/54 

Here you can see how the zone is updated when it is re-attached to the system. This updates the software in the currently active dataset associated with the global zone BE. If you roll-back to an earlier image, the dataset associated with the zone and the earlier BE will be used instead of this newly updated dataset. We've also enhanced the IPS code so it can use the pkg cache from the global zone, thus the zone update is very quick.

Because the zone attach feature is implemented as a brand-specific capability, each brand provides its own options for how zones can be attached. In addition to the -u option, the ipkg brand supports a -a or -r option. The -a option allows you to take an archive (cpio, bzip2, gzip, or USTAR tar) of a zone from another system and attach it. The -r option allows you to receive the output of a 'zfs send' into the zone's dataset. Either of these options can be combined with -u to enable zone migration from one OpenSolaris system to another. An additional option, which didn't make it into 2008.11, but is in the development release, is the -d option, which allows you to specify an existing dataset to be used for the attach. The attach operation will take that dataset and add all of the properties needed to make it usable on the current global zone BE.

If you used zones on 2008.11, you might have noticed that the zone's dataset is not mounted when the zone is halted. This is something we might change in the future, but in the meantime, one final feature related to zone detach is that it leaves the zone's dataset mounted. This provides and easy way to access the zone's data. Simply detach the zone, then you can access the zone's mounted file system, then re-attach the zone.

Friday Aug 12, 2005


I haven't written a blog for quite a while now. I'm actually not working on SVM right now. Instead, I am busy on some zones related work. It has been a busy summer. My wife and I were in Beijing for about 10 days talking to some customers about Solaris. Sarah has posted some pictures and written a funny story about our trip.

Last night we had the inaugural meeting of FROSUG (the Front Range Open Solaris Users Group). I gave an overview presentation introducing OpenSolaris. The meeting seemed to go well and it got blogged by Stephen O'Grady, which is pretty cool. Hopefully it will take off and we can get an active community of OpenSolaris people in the Denver area.

Friday Apr 01, 2005

Solaris Volume Manager root mirror problems on S10

There are a couple of bugs that we found in S10 that make it look like Solaris Volume Manager root mirroring does not work at all. Unfortunately we found these bugs after the release went out. These bugs will be patched but I wanted to describe the problems a bit and describe some workarounds

On a lot of systems that use SVM to do root mirroring there are only two disks. When you set up the configuration you put one or more metadbs on each disk to hold the SVM configuration information.

SVM implements a metadb quorum rule which means that during the system boot, if more than 50% of the metadbs are not available, the system should boot into single-user mode so that you can fix things up. You can read more about this here.

On a two disk system there is no way to set things up so that more than 50% of the metadbs will be available if one of the disks dies.

When SVM does not have metadb quorum during the boot it is supposed to leave all of the metadevices read-only and boot into single-user. This gives you a chance to confirm that you are using the right SVM configuration and that you don't corrupt any of your data before having a chance to cleanup the dead metadbs.

What a lot of people do when when they set up a root mirror is pull one of the disks to check if the system will still boot and run ok. If you do this experiment on a two disk configuration running S10 the system will panic really early in the boot process and it will go into a infinite panic/reboot cycle.

What is happening here is that we found a bug related to UFS logging, which is on by default in S10. Since the root mirror stays read-only because there is no metadb quorum we hit a bug in the UFS log rolling code. This in turn leaves UFS in a bad state which causes the system to panic.

We're testing the fix for this bug right now but in the meantime, it is easy to workaround this bug by just disabling logging on the root filesystem. You can do that be specifying the "nologging" option in the last field of the vfstab entry for root. You should reboot once before doing any SVM experiments (like pulling a disk) to ensure that UFS has rolled the log and is no longer using logging on root.

Once a patch for this bug is out you will definitely want to remove this workaround from the the vfstab entry since UFS logging offers so many performance and availability benefits.

By they way, UFS logging is also on by default in the S9 9/04 release but that code does not suffer from this bug.

The second problem we found is not as serious as the UFS bug. This has to do with an interaction with the Service Management Facility (SMF) which is new in S10 and again, this is related to not have metadb quorum during the boot. What should happen is that the system should enter single-user so you can clean up the dead metadbs. Instead it boots all the way to multi-user but since the root device is still read-only things don't work very well. This turned out to be a missing dependency which we didn't catch when we integrated SVM and SMF. We'll have a patch for this too but this problem is much less serious. You can still login as root and clean up the dead metadbs so that you can then reboot with a good metadb quorum.

Both of the problems result because there is no metadb quorum so the root metadevice remains read-only after a boot with a dead disk. If you have a third disk which you can use to add a metadb onto, then you can reduce the likelihood of hitting this problem since losing one disk won't cause you to lose quorum during boot.

Given these kinds of problems you might wonder why does SVM bother to implement the metadb quorum? Why not just trust the metadbs that are alive? SVM is conservative and always chooses the path to ensure that you won't lose data or use stale data. There are various corner cases to worry about when SVM cannot be sure it is using the most current data. For example, in a two disk mirror configuration, you might run for a while on the first disk with the second disk powered down. Later you might reboot off the second disk (because the disk was now powered up) and the first disk might now be powered down. At this point you would be using the stale data on the mirror, possibly without even realizing it. The metadb quorum rule gives you a chance to intervene and fix up the configuration when SVM cannot do it automatically.

Thursday Mar 31, 2005

Solaris Volume Manager x86 root mirroring

I am one of the engineers working on Solaris Volume manager. There are some questions I get asked a lot and I thought it would be useful to get this information out to a broader audience so I am starting this blog. We have also done a lot of work on the volume manager in the recent Solaris releases and I'd like to talk about some of that too.

One of the questions I get asked most frequently is how to do root mirroring on x86 systems. Unfortunately our docs don't explain this very well yet. This is a short description I wrote about root mirroring on x86 which hopefully explains how to set this up.

Root mirroring on x86 is more complex than is root mirroring on SPARC. Specifically, there are issues with being able to boot from the secondary side of the mirror when the primary side fails. Compared to SPARC systems, with x86 machines you have be sure to properly configure the system BIOS and set up your fdisk partitioning.

The x86 BIOS is analogous to the PROM interpreter on SPARC. The BIOS is responsible for finding the right device to boot from, then loading and executing the master boot record from that device.

All modern x86 BIOSes are configurable to some degree but the discussion of how to configure them is beyond the scope of this post. In general you can usually select the order of devices that you want the BIOS to probe (e.g. floppy, IDE disk, SCSI disk, network) but you may be limited in configuring at a more granular level. For example, it may not be possible to configure the BIOS to boot from the first and second IDE disks. These limitations may be a factor with some hardware configurations (e.g. a system with two IDE disks that are root mirrored). You will need to understand the capabilities of the BIOS that is on your hardware. If your primary boot disk fails and your BIOS is set up properly, then it will automatically boot from the second disk in the mirror. Otherwise, you may need to break into the BIOS while the machine is booting and reconfigure to boot from the second disk or you may even need to boot from a floppy with the Solaris Device Configuration Assistant (DCA) on it so that you can select the alternate disk to boot from.

On x86 machines fdisk partitions are used and it is common to have multiple operating systems installed. Also, there are different flavors of master boot programs (e.g. LILO or Grub), in addition to the standard Solaris master boot program. The boot(1M) man page is a good resource for a detailed discussion of the multiple components that are used during booting on Solaris x86.

Since SVM can only mirror Solaris slices within the Solaris fdisk partition this post focuses on a configuration that only has Solaris installed. If you have multiple fdisk partitions then you will need to use some other approach to protect the data outside of the Solaris fdisk partition. SVM can't mirror that data.

For an x86 system with Solaris installed there are two common fdisk partitioning schemes. One approach uses two fdisk partitions. There is a Solaris fdisk partition and another, small fdisk partition of about 10MB called the x86 boot partition. This partition has an Id value of 190. The Solaris system installation software will create a configuration with these two fdisk partitions as the default. The x86 boot partition is needed in some cases, such as when you want to use live-upgrade on a single disk configuration, but it is problematic when using root mirroring. The Solaris system installation software only allows one x86 boot partition for the entire system and it places important data on that fdisk partition. That partition is mounted in the vfstab with this entry:

/dev/dsk/c2t1d0p0:boot - /boot pcfs - no -

Because this fdisk partition is outside of the Solaris fdisk partition it cannot be mirrored by SVM. Furthermore, because there is only a single copy of this fdisk partition it represents a single point of failure.

Since the x86 boot partition is not required in most cases it is recommended that you do not use this as the default when setting up a system that will have root mirroring. Instead, just use a single Solaris fdisk partition and omit the x86 boot partition for your installation. It is easiest to do this at the time you install Solaris. If you already have Solaris installed and you created the x86 boot partition as part of that process then the easiest thing would be to delete that with the fdisk(1M) command and reinstall, taking care not to create the x86 boot partition during the installation process.

Once your system is installed you create your metadbs and root mirror using the normal procedures.

When you use fdisk to partition your second disk you must make the disk bootable with a master boot program. The -b option of fdisk(1M) does this.

e.g. fdisk -b /usr/lib/fs/ufs/mboot /dev/rdsk/c2t1d0p0

On x86 machines the Solaris VTOC (the slices within the Solaris fdisk partition) is slightly different from what is seen on SPARC. On SPARC there are 8 VTOC slices (0-7) but on x86 there are more. In particular slice 8 is used as a "boot" slice. You will see that this slice is 1 cylinder in size and starts at the beginning of the disk (offset 0). The other slices will come after that, starting at cylinder 1.

Slice 8 is necessary for booting Solaris from this fdisk partition. It holds the partition boot record (pboot), the Solaris VTOC for the disk and the bootblk. This information is disk specific so it is not mirrored with SVM. However, you must ensure that both disks are bootable so that you can boot from the secondary disk if the primary fails. You can use the installboot program to setup the second disk as a Solaris bootable disk (see installboot(1M)). An example command is:

installboot /usr/platform/i86pc/lib/fs/ufs/pboot \\ /usr/platform/i86pc/lib/fs/ufs/bootblk /dev/rdsk/c2t1d0s2

There is one further consideration for booting an x86 disk. You must ensure that the root slice has a slice tag of "root" and the root slice must be slice 0. See format(1M) for checking and setting the slice tag field.

Solaris x86 emulates some of the behavior of the SPARC eeprom. See eeprom(1M). The boot device is stored in the "bootpath" property that you can see with the eeprom command. The value is the device tree path of the primary boot disk. For example:


You should set up the alternate bootpath via the eeprom command so that the system will try to boot from the second side of the root mirror. First you must get the device tree path for the other disk in the mirror. You can use the ls command. For example:

# ls -l /dev/dsk/c2t1d0s0 lrwxrwxrwx 1 root root 78 Sep 28 23:41 /dev/dsk/c0t1d0s0 -> ../../devices/pci@0,0/pci8086,2545@3/pci8086,1460@1d/pci8086,341a@7,1/sd@1,0:a

The device tree path is the portion of the output following "../devices".

Use the eeprom command to set up the alternate boot path. For example:

eeprom altbootpath='/pci@0,0/pci8086,2545@3/pci8086,1460@1d/pci8086,341a@7,1/sd@1,0:a'

If your primary disk fails you will boot from the secondary disk. This may either be automatic if your BIOS is configured properly, you may need to manually enter the BIOS and boot from the secondary disk or you may even need to boot from a floppy with the DCA on it. Once the system starts to boot it will try to boot from the "bootpath" device. Assuming the primary boot disk is the dead disk in the root mirror the system will attempt to boot from the "altbootpath" device.

If the system fails to boot from the altbootpath device for some reason then you need to finish booting manually. The boot should drop in to the Device Configuration Assistant (DCA). You must choose the secondary disk as the boot disk within the DCA. After the system has booted you should update the "bootpath" value with the device path that you used for the for the secondary disk (the "altbootpath") so that the machine will boot automatically.

For example, run the following to set the boot device to the second scsi disk (target 1):

eeprom bootpath='/pci@0,0/pci8086,2545@3/pci8086,1460@1d/pci8086,341a@7,1/sd@1,0:a'

Note that all of the usual considerations of mddb quorum apply to x86, just as they do for SPARC.



Top Tags
« July 2016