Tuesday Nov 11, 2008

OpenSolaris 2008.11 Release Candidate 1B (nv101a) is now available for testing

The initial release candidate (rc1b) for OpenSolaris 2008.11 (based on nv101a) is now available for download and testing. Additional (larger) images are available for non-English locales as well as USB images for faster installs. If you have not played with a USB image you will be dazzled at the speed of the installation. Amazing what happens when you eliminate all those slow seeks.

The new release candidate has quite a few interesting features and updates. The items that caught my attention were
  • IPS Package Manager
  • Automatically cloning root file system (beadm clone) during image update
  • GNOME 2.24
  • Evolution 2.24 for those of us that are stubborn enough to continue using it
  • OpenOffice 3.0
  • Songbird - an iTunes-like media player. Still needs lots of codecs (like the free Fluendo MP3 decoder) to be really useful
  • Brasero - a Nero-like media burner
Our own Dan Roberts has more to say on the subject in this video podcast.

Using the graphical package manager it only took a few minutes to set up the installation plan for a nice web based development system including Netbeans, a web stack (including Glassfish), and a Xen based virtualization system.

OpenSolaris 2008.11 is shaping up to be quite a nice release. Now that I have figured out how to make it play nicely in a root zpool with other Solaris releases, I will be spending a lot more time with it as the daily driver.

Download it, play with it, and please remember to file bugs when you run into things that don't work.

Technocrati Tags:

Tuesday Nov 04, 2008

Solaris and OpenSolaris coexistence in the same root zpool

Some time ago, my buddy Jeff Victor gave us FrankenZone. An idea that is disturbingly brilliant. It has taken me a while, but I offer for your consideration VirtualBox as a V2P platform for OpenSolaris. Nowhere near as brilliant, but at least as unusual. And you know that you have to try this out at home.

Note: This is totally a science experiment. I fully expect to see the two guys from Myth Busters showing up at any moment. It also requires at least build 100 of OpenSolaris on both the host and guest operating system to work around the hostid difficulties.

With the caveats out of the way, let me set the back story to explain how I got here.

Until virtualization technologies become ubiquitous and nothing more than BIOS extensions, multi-boot configurations will continue to be an important capability. And for those working with [Open]Solaris there are several limitations that complicate this unnecessarily. Rather than lamenting these, the possibility of leveraging ZFS root pools, now in Solaris 10 10/08, should offer up some interesting solutions.

What I want to do is simple - have a single Solaris fdisk partition that can have multiple versions of Solaris all bootable with access to all of my data. This doesn't seem like much of a request, but as of yet this has been nearly impossible to accomplish in anything close to a supportable configuration. As it turns out the essential limitation is in the installer - all other issues can be handled if we can figure out how to install OpenSolaris into an existing pool.

What we will do is use our friend VirtualBox to work around the installer issues. After installing OpenSolaris in a virtual machine we take a ZFS snapshot, send it to the bare metal Solaris host and restore it in the root pool. Finally we fix up a few configuration files to make everything work and we will be left with a single root pool that can boot Solaris 10, Solaris Express Community Edition (nevada), and OpenSolaris.

How cool is that :-) Yeah, it is that cool. Let's proceed.

Prepare the host system

The host system is running a fresh install of Solaris 10 10/08 with a single large root zpool. In this example the root zpool is named panroot. There is also a separate zpool that contains data that needs to be preserved in case a re-installation of Solaris is required. That zpool is named pandora, but it doesn't matter - it will be automatically imported in our new OpenSolaris installation if all goes well.
# lustatus 
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
s10u6_baseline             yes      no     no        yes    -         
s10u6                      yes      no     no        yes    -         
nv95                       yes      yes    yes       no     -         
nv101a                     yes      no     no        yes    -    

     
# zpool list
NAME      SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
pandora  64.5G  56.9G  7.61G    88%  ONLINE  -
panroot    40G  26.7G  13.3G    66%  ONLINE  -
One challenge that came up was the less than stellar performance of ssh over the VirtualBox NAT interface. So rather than fight this I set up a shared NFS file system in the root pool to stage the ZFS backup file. This made the process go much faster.

In the host Solaris system
# zfs create -o sharenfs=rw,anon=0 -o mountpoint=/share panroot/share

Prepare the OpenSolaris virtual machine

If you have not already done so, get a copy of VirtualBox, install it and set up a virtual machine for OpenSolaris.

Important note: Do not install the VirtualBox guest additions. This will install some SMF services that will fail when booted on bare metal.

Send a ZFS snapshot to the host OS root zpool

Let's take a look around the freshly installed OpenSolaris system to see what we want to send.

Inside the OpenSolaris virtual machine
bash-3.2$ zfs list
NAME                     USED  AVAIL  REFER  MOUNTPOINT
rpool                   6.13G  9.50G    46K  /rpool
rpool/ROOT              2.56G  9.50G    18K  legacy
rpool/ROOT/opensolaris  2.56G  9.50G  2.49G  /
rpool/dump               511M  9.50G   511M  -
rpool/export            2.57G  9.50G  2.57G  /export
rpool/export/home        604K  9.50G    19K  /export/home
rpool/export/home/bob    585K  9.50G   585K  /export/home/bob
rpool/swap               512M  9.82G   176M  -
My host system root zpool (panroot) already has swap and dump, so these won't be needed. And it also has an /export hierarchy for home directories. I will recreate my OpenSolaris Primary System Administrator user once on bare metal, so it appears the only thing I need to bring over is the root dataset itself.

Inside the OpenSolaris virtual machine
bash-3.2$ pfexec zfs snapshot rpool/ROOT/opensolaris@scooby
bash-3.2$ pfexec zfs send rpool/ROOT/opensolaris@scooby > /net/10.0.2.2/share/scooby.zfs
We are now done with the virtual machine. It can be shut down and the storage reclaimed for other purposes.

Restore the ZFS dataset in the host system root pool

In addition to restoring the OpenSolaris root pool, the canmount property should be set to noauto. I also destroy the NFS shared directory since it will no longer be needed.
# zfs receive panroot/ROOT/scooby < /share/scooby.zfs
# zfs set canmount=noauto panroot/ROOT/scooby
# zfs destroy panroot/shared
Now mount the new OpenSolaris root filesystem and fix up a few configuration files. Specifically
  • /etc/zfs/zpool.cache so that all boot environments have the same view of available ZFS pools
  • /etc/hostid to keep all of the boot environments using the same hostid. This is extremely important and failure to do this will leave some of your boot environments unbootable - which isn't very useful. /etc/hostid is new to build 100 and later.
Rebuild the OpenSolaris boot archive and we will be done with that filesystem.
# zfs set canmount=noauto panroot/ROOT/scooby
# zfs set mountpoint=/mnt panroot/ROOT/scooby
# zfs mount panroot/ROOT/scooby

# cp /etc/zfs/zpool.cache /mnt/etc/zfs
# cp /etc/hostid /mnt/etc/hostid

# bootadm update-archive -f -R /mnt
Creating boot_archive for /mnt
updating /mnt/platform/i86pc/amd64/boot_archive
updating /mnt/platform/i86pc/boot_archive

# umount /mnt
Make a home directory for your OpenSolaris administrator user (in this example the user is named admin). Also add a GRUB stanza so that OpenSolaris can be booted.
# mkdir -p /export/home/admin
# chown admin:admin /export/home/admin
# cat > /panroot/boot/grub/menu.lst   <<DOO
title Scooby
root (hd0,3,a)
bootfs panroot/ROOT/scooby
kernel$ /platform/i86pc/kernel/$ISADIR/unix -B $ZFS-BOOTFS
module$ /platform/i86pc/$ISADIR/boot_archive
DOO
At this point we are done. Reboot the system and you should see a new GRUB stanza for our new OpenSolaris installation (scooby). Cue large audience applause track.

Live Upgrade and OpenSolaris Boot Environment Administration

On interesting side effect, on the positive side, is the healthy interaction of Live Upgrade and beadm(1M). For your Solaris and nevada based installations you can continue to use lucreate(1M), luupgrade(1M), and luactivate(1M). On the OpenSolaris side you can see all of your Live Upgrade boot environments as well as your OpenSolaris boot environments. Note that we can create and activate new boot environments as needed.

When in OpenSolaris
# beadm list
BE                           Active Mountpoint Space   Policy Created          
--                           ------ ---------- -----   ------ -------          
nv101a                       -      -          18.17G  static 2008-11-04 00:03 
nv95                         -      -          122.07M static 2008-11-03 12:47 
opensolaris                  -      -          2.83G   static 2008-11-03 16:23 
opensolaris-2008.11-baseline R      -          2.49G   static 2008-11-04 11:16 
s10u6                        -      -          97.22M  static 2008-11-03 12:03 
s10x_u6wos_07b               -      -          205.48M static 2008-11-01 20:51 
scooby                       N      /          2.61G   static 2008-11-04 10:29 

# beadm create doo
# beadm activate doo
# beadm list
BE                           Active Mountpoint Space   Policy Created          
--                           ------ ---------- -----   ------ -------          
doo                          R      -          5.37G   static 2008-11-04 16:23 
nv101a                       -      -          18.17G  static 2008-11-04 00:03 
nv95                         -      -          122.07M static 2008-11-03 12:47 
opensolaris                  -      -          25.5K   static 2008-11-03 16:23 
opensolaris-2008.11-baseline -      -          105.0K  static 2008-11-04 11:16 
s10u6                        -      -          97.22M  static 2008-11-03 12:03 
s10x_u6wos_07b               -      -          205.48M static 2008-11-01 20:51 
scooby                       N      /          2.61G   static 2008-11-04 10:29 

For the first time I have a single Solaris disk environment that can boot Solaris 10, Solaris Express Community Edition (nevada) or OpenSolaris and have access to all of my data. I did have to add a mount for my shared FAT32 file system (I have an iPhone and several iPods - so Windows do occasionally get opened), but that is about it. Now off to the repository to start playing with all of the new OpenSolaris goodies like Songbird, Brasero, Bluefish and the Xen bits.

Technocrati Tags:

Monday Feb 18, 2008

ZFS and FMA - Two great tastes .....

Our good friend Isaac Rozenfeld talks about the Multiplicity of Solaris. When talking about Solaris I will use the phrase "The Vastness of Solaris". If you have attended a Solaris Boot Camp or Tech Day in the last few years you get an idea of what we are talking about - when we go on about Solaris hour after hour after hour.

But the key point in Isaac's multiplicity discussion is how the cornucopia of Solaris features work together to do some pretty spectacular (and competitively differentiating) things. In the past we've looked at combinations such as ZFS and Zones or Service Management, Role Based Access Control (RBAC) and Least Privilege. Based on a conversation last week in St. Louis, let's consider how ZFS and Solaris Fault Management (FMA) play together.

Preparation

Let's begin by creating some fake devices that we can play with. I don't have enough disks on this particular system, but I'm not going to let that slow me down. If you have sufficient real hot swappable disks, feel free to use them instead.
# mkfile 1g /dev/disk1
# mkfile 1g /dev/disk2
# mkfile 512m /dev/disk3
# mkfile 512m /dev/disk4
# mkfile 1g /dev/disk5

Now let's create a couple of zpools using the fake devices. pool1 will be a 1GB mirrored pool using disk1 and disk2. pool2 will be a 512MB mirrored pool using disk3 and disk4. Device spare1 will spare both pools in case of a problem - which we are about to inflict upon the pools.
# zpool create pool1 mirror disk1 disk2 spare spare1
# zpool create pool2 mirror disk3 disk4 spare spare1
# zpool status
  pool: pool1
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk1   ONLINE       0     0     0
            disk2   ONLINE       0     0     0
        spares
          spare1    AVAIL   

errors: No known data errors

  pool: pool2
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        pool2       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk3   ONLINE       0     0     0
            disk4   ONLINE       0     0     0
        spares
          spare1    AVAIL   

errors: No known data errors

So far so good. If we were to run a scrub on either pool, it will complete immediately. Remember that unlike hardware RAID disk replacement, ZFS scrubbing and resilvering only touches blocks that contain actual data. Since there is no data in these pools (yet), there is little for the scrubbing process to do.
# zpool scrub pool1
# zpool scrub pool2
# zpool status
  pool: pool1
 state: ONLINE
 scrub: scrub completed with 0 errors on Mon Feb 18 09:24:16 2008
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk1   ONLINE       0     0     0
            disk2   ONLINE       0     0     0
        spares
          spare1    AVAIL   

errors: No known data errors

  pool: pool2
 state: ONLINE
 scrub: scrub completed with 0 errors on Mon Feb 18 09:24:17 2008
config:

        NAME        STATE     READ WRITE CKSUM
        pool2       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk3   ONLINE       0     0     0
            disk4   ONLINE       0     0     0
        spares
          spare1    AVAIL   

errors: No known data errors

Let's populate both pools with some data. I happen to have a directory of scenic images that I use as screen backgrounds - that will work nicely.

# cd /export/pub/pix>
# find scenic -print | cpio -pdum /pool1
# find scenic -print | cpio -pdum /pool2

# df -k | grep pool
pool1                1007616  248925  758539    25%    /pool1
pool2                 483328  248921  234204    52%    /pool2

And yes, cp -r would have been just as good.

Problem 1: Simple data corruption

Time to inflict some harm upon the pool. First, some simple corruption. Writing some zeros over half of the mirror should do quite nicely.
# dd if=/dev/zero of=/dev/dsk/disk1 bs=8192 count=10000 conv=notrunc
10000+0 records in
10000+0 records out 

At this point we are unaware that anything has happened to our data. So let's try accessing some of the data to see if we can observe ZFS self healing in action. If your system has plenty of memory and is relatively idle, accessing the data may not be sufficient. If you still end up with no errors after the cpio, try a zpool scrub - that will catch all errors in the data.
# cd /pool1
# find . -print | cpio -ov > /dev/null
416027 blocks

Let's ask our friend fmstat(1m) if anything is wrong ?
# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0    0.1   0   0     0     0      0      0
disk-transport           0       0  0.0  366.5   0   0     0     0    32b      0
eft                      0       0  0.0    2.6   0   0     0     0   1.4M      0
fmd-self-diagnosis       1       0  0.0    0.2   0   0     0     0      0      0
io-retire                0       0  0.0    1.1   0   0     0     0      0      0
snmp-trapgen             1       0  0.0   16.0   0   0     0     0    32b      0
sysevent-transport       0       0  0.0  620.3   0   0     0     0      0      0
syslog-msgs              1       0  0.0    9.7   0   0     0     0      0      0
zfs-diagnosis          162     162  0.0    1.5   0   0     1     0   168b   140b
zfs-retire               1       1  0.0  112.3   0   0     0     0      0      0

As the guys in the Guinness commercial say, "Brilliant!" The important thing to note here is that the zfs-diagnosis engine has run several times indicating that there is a problem somewhere in one of my pools. I'm also running this on Nevada so the zfs-retire engine has also run, kicking in a hot spare due to excessive errors.

So which pool is having the problems ? We continue our FMA investigation to find out.
# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Feb 18 09:56:24 d82d1716-c920-6243-e899-b7ddd386902e  ZFS-8000-GH    Major    

Fault class : fault.fs.zfs.vdev.checksum

Description : The number of checksum errors associated with a ZFS device
              exceeded acceptable levels.  Refer to
              http://sun.com/msg/ZFS-8000-GH for more information.

Response    : The device has been marked as degraded.  An attempt
              will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.


# zpool status -x
  pool: pool1
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver in progress, 44.83% done, 0h0m to go
config:

        NAME          STATE     READ WRITE CKSUM
        pool1         DEGRADED     0     0     0
          mirror      DEGRADED     0     0     0
            spare     DEGRADED     0     0     0
              disk1   DEGRADED     0     0   162  too many errors
              spare1  ONLINE       0     0     0
            disk2     ONLINE       0     0     0
        spares
          spare1      INUSE     currently in use

errors: No known data errors

This tells us all that we need to know. The device disk1 was found to have quite a few checksum errors - so many in fact that it was replaced automatically by a hot spare. The spare was resilvering and a full complement of data replicas would be available soon. The entire process was automatic and completely observable.

Since we inflicted harm upon the (fake) disk device ourself, we know that it is in fact quite healthy. So we can restore our pool to its original configuration rather simply - by detaching the spare and clearing the error. We should also clear the FMA counters and repair the ZFS vdev so that we can tell if anything else is misbehaving in either this or another pool.
# zpool detach pool1 spare1
# zpool clear pool
# zpool status pool1
  pool: pool1
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Feb 18 10:25:26 2008
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk1   ONLINE       0     0     0
            disk2   ONLINE       0     0     0
        spares
          spare1    AVAIL   

errors: No known data errors


# fmadm reset zfs-diagnosis
# fmadm reset zfs-retire
# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0    0.5   0   0     0     0      0      0
disk-transport           0       0  0.0  223.5   0   0     0     0    32b      0
eft                      1       0  0.0    4.6   0   0     0     0   1.4M      0
fmd-self-diagnosis       4       0  0.0    0.6   0   0     0     0      0      0
io-retire                1       0  0.0    1.1   0   0     0     0      0      0
snmp-trapgen             4       0  0.0    8.8   0   0     0     0    32b      0
sysevent-transport       0       0  0.0  372.7   0   0     0     0      0      0
syslog-msgs              4       0  0.0    5.4   0   0     0     0      0      0
zfs-diagnosis            0       0  0.0    1.4   0   0     0     0      0      0
zfs-retire               0       0  0.0    0.0   0   0     0     0      0      0


# fmdump -v -u d82d1716-c920-6243-e899-b7ddd386902e
TIME                 UUID                                 SUNW-MSG-ID
Feb 18 09:51:49.3025 d82d1716-c920-6243-e899-b7ddd386902e ZFS-8000-GH
  100%  fault.fs.zfs.vdev.checksum

        Problem in: 
           Affects: zfs://pool=pool1/vdev=449a3328bc444732
               FRU: -
          Location: -

# fmadm repair zfs://pool=pool1/vdev=449a3328bc444732
fmadm: recorded repair to zfs://pool=pool1/vdev=449a3328bc444732

# fmadm faulty

Problem 2: Device failure

Time to do a little more harm. In this case I will simulate the failure of a device by removing the fake device. Again we will access the pool and then consult fmstat to see what is happening (are you noticing a pattern here????).
# rm -f /dev/dsk/disk2
# cd /pool1
# find . -print | cpio -oc > /dev/null
416027 blocks

# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0    0.5   0   0     0     0      0      0
disk-transport           0       0  0.0  214.2   0   0     0     0    32b      0
eft                      1       0  0.0    4.6   0   0     0     0   1.4M      0
fmd-self-diagnosis       4       0  0.0    0.6   0   0     0     0      0      0
io-retire                1       0  0.0    1.1   0   0     0     0      0      0
snmp-trapgen             4       0  0.0    8.8   0   0     0     0    32b      0
sysevent-transport       0       0  0.0  372.7   0   0     0     0      0      0
syslog-msgs              4       0  0.0    5.4   0   0     0     0      0      0
zfs-diagnosis            0       0  0.0    1.4   0   0     0     0      0      0
zfs-retire               0       0  0.0    0.0   0   0     0     0      0      0

Rats, the find ran totally out of cache from the last example. As before, should this happen,proceed directly to zpool scrub.
# zpool scrub pool1
# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0    0.5   0   0     0     0      0      0
disk-transport           0       0  0.0  190.5   0   0     0     0    32b      0
eft                      1       0  0.0    4.1   0   0     0     0   1.4M      0
fmd-self-diagnosis       5       0  0.0    0.5   0   0     0     0      0      0
io-retire                1       0  0.0    1.0   0   0     0     0      0      0
snmp-trapgen             6       0  0.0    7.4   0   0     0     0    32b      0
sysevent-transport       0       0  0.0  329.0   0   0     0     0      0      0
syslog-msgs              6       0  0.0    4.6   0   0     0     0      0      0
zfs-diagnosis           16       1  0.0   70.3   0   0     1     1   168b   140b
zfs-retire               1       0  0.0  509.8   0   0     0     0      0      0

Again, hot sparing has kicked in automatically. The evidence of this is the zfs-retire engine running.
# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Feb 18 11:07:29 50ea07a0-2cd9-6bfb-ff9e-e219740052d5  ZFS-8000-D3    Major    
Feb 18 11:16:43 06bfe323-2570-46e8-f1a2-e00d8970ed0d

Fault class : fault.fs.zfs.device

Description : A ZFS device failed.  Refer to http://sun.com/msg/ZFS-8000-D3 for
              more information.

Response    : No automated response will occur.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.

# zpool status -x
  pool: pool1
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: resilver in progress, 4.94% done, 0h0m to go
config:

        NAME          STATE     READ WRITE CKSUM
        pool1         DEGRADED     0     0     0
          mirror      DEGRADED     0     0     0
            disk1     ONLINE       0     0     0
            spare     DEGRADED     0     0     0
              disk2   UNAVAIL      0     0     0  cannot open
              spare1  ONLINE       0     0     0
        spares
          spare1      INUSE     currently in use

errors: No known data errors

As before, this tells us all that we need to know. A device (disk2) has failed and is no longer in operation. Sufficient spares existed and one was automatically attached to the damaged pool. Resilvering completed successfully and the data is once again fully mirrored.

But here's the magic. Let's repair the device - again simulated with our fake device.
# mkfile 1g /dev/dsk/disk2
# zpool repair pool1 disk2
# zpool status pool1 
  pool: pool1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress, 4.86% done, 0h1m to go
config:

        NAME               STATE     READ WRITE CKSUM
        pool1              DEGRADED     0     0     0
          mirror           DEGRADED     0     0     0
            disk1          ONLINE       0     0     0
            spare          DEGRADED     0     0     0
              replacing    DEGRADED     0     0     0
                disk2/old  UNAVAIL      0     0     0  cannot open
                disk2      ONLINE       0     0     0
              spare1       ONLINE       0     0     0
        spares
          spare1           INUSE     currently in use

errors: No known data errors

Get a cup of coffee while the resilvering process runs.
# zpool status
  pool: pool1
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Feb 18 11:23:13 2008
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk1   ONLINE       0     0     0
            disk2   ONLINE       0     0     0
        spares
          spare1    AVAIL   


# fmadm faulty

Notice the nice integration with FMA. Not only was the new device resilvered, but the hot spare was detached and the FMA fault was cleared. The fmstat counters still show that there was a problem and the fault report still existes in the fault log for later interrogation.
# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0    0.5   0   0     0     0      0      0
disk-transport           0       0  0.0  171.5   0   0     0     0    32b      0
eft                      1       0  0.0    3.6   0   0     0     0   1.4M      0
fmd-self-diagnosis       6       0  0.0    0.6   0   0     0     0      0      0
io-retire                1       0  0.0    0.9   0   0     0     0      0      0
snmp-trapgen             6       0  0.0    6.8   0   0     0     0    32b      0
sysevent-transport       0       0  0.0  294.3   0   0     0     0      0      0
syslog-msgs              6       0  0.0    4.2   0   0     0     0      0      0
zfs-diagnosis           36       1  0.0   51.6   0   0     0     1      0      0
zfs-retire               1       0  0.0  170.0   0   0     0     0      0      0

# fmdump
TIME                 UUID                                 SUNW-MSG-ID
Feb 16 11:38:16.0976 48935791-ff83-e622-fbe1-d54c20385afc ZFS-8000-GH
Feb 16 11:38:30.8519 9f7f288c-fea8-e5dd-bf23-c0c9c4e07233 ZFS-8000-GH
Feb 18 09:51:49.3025 2ac4568f-4040-cb5d-f3b8-ae3d69e7d713 ZFS-8000-GH
Feb 18 09:56:24.8029 d82d1716-c920-6243-e899-b7ddd386902e ZFS-8000-GH
Feb 18 10:23:07.2228 7c04a6f7-d22a-e467-c44d-80810f27b711 ZFS-8000-GH
Feb 18 10:25:14.6429 faca0639-b82b-c8e8-c8d4-fc085bc03caa ZFS-8000-GH
Feb 18 11:07:29.5195 50ea07a0-2cd9-6bfb-ff9e-e219740052d5 ZFS-8000-D3
Feb 18 11:16:44.2497 06bfe323-2570-46e8-f1a2-e00d8970ed0d ZFS-8000-D3


# fmdump -V -u 50ea07a0-2cd9-6bfb-ff9e-e219740052d5
TIME                 UUID                                 SUNW-MSG-ID
Feb 18 11:07:29.5195 50ea07a0-2cd9-6bfb-ff9e-e219740052d5 ZFS-8000-D3

  TIME                 CLASS                                 ENA
  Feb 18 11:07:27.8476 ereport.fs.zfs.vdev.open_failed       0xb22406c635500401

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = 50ea07a0-2cd9-6bfb-ff9e-e219740052d5
        code = ZFS-8000-D3
        diag-time = 1203354449 236999
        de = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = fmd
                authority = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        product-id = Dimension XPS                
                        chassis-id = 7XQPV21
                        server-id = arrakis
                (end authority)

                mod-name = zfs-diagnosis
                mod-version = 1.0
        (end de)

        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = fault.fs.zfs.device
                certainty = 0x64
                asru = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        scheme = zfs
                        pool = 0x3a2ca6bebd96cfe3
                        vdev = 0xedef914b5d9eae8d
                (end asru)

                resource = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        scheme = zfs
                        pool = 0x3a2ca6bebd96cfe3
                        vdev = 0xedef914b5d9eae8d
                (end resource)

        (end fault-list[0])

        fault-status = 0x3
        __ttl = 0x1
        __tod = 0x47b9bb51 0x1ef7b430

# fmadm reset zfs-diagnosis
fmadm: zfs-diagnosis module has been reset

# fmadm reset zfs-retire
fmadm: zfs-retire module has been reset

Problem 3: Unrecoverable corruption

For those of you that have attended one of my Boot Camps or Solaris Best Practices training classes know, House is one of my favorite TV shows - the only one that I watch regularly. And this next example would make a perfect episode. Is it likely to happen ? No, but it is so cool when it does :-)

Remember our second pool, pool2. It has the same contents as pool1. Now, let's do the unthinkable - let's corrupt both halves of the mirror. Surely data loss will follow, but the fact that Solaris stays up and running and can report what happened is pretty spectacular. But it gets so much better than that.
# dd if=/dev/zero of=/dev/dsk/disk3 bs=8192 count=10000 conv=notrunc
# dd if=/dev/zero of=/dev/dsk/disk4 bs=8192 count=10000 conv=notrunc
# zpool scrub pool2

# fmstat
module             ev_recv ev_acpt wait  svc_t  %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0    0.5   0   0     0     0      0      0
disk-transport           0       0  0.0  166.0   0   0     0     0    32b      0
eft                      1       0  0.0    3.6   0   0     0     0   1.4M      0
fmd-self-diagnosis       6       0  0.0    0.6   0   0     0     0      0      0
io-retire                1       0  0.0    0.9   0   0     0     0      0      0
snmp-trapgen             8       0  0.0    6.3   0   0     0     0    32b      0
sysevent-transport       0       0  0.0  294.3   0   0     0     0      0      0
syslog-msgs              8       0  0.0    3.9   0   0     0     0      0      0
zfs-diagnosis         1032    1028  0.6   39.7   0   0    93     2    15K    13K
zfs-retire               2       0  0.0  158.5   0   0     0     0      0      0

As before, lots of zfs-diagnosis activity. And two hits to zfs-retire. But we only have one spare - this should be interesting. Let's see what is happenening.
# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Feb 18 09:56:24 d82d1716-c920-6243-e899-b7ddd386902e  ZFS-8000-GH    Major    
Feb 18 13:18:42 c3889bf1-8551-6956-acd4-914474093cd7

Fault class : fault.fs.zfs.vdev.checksum

Description : The number of checksum errors associated with a ZFS device
              exceeded acceptable levels.  Refer to
              http://sun.com/msg/ZFS-8000-GH for more information.

Response    : The device has been marked as degraded.  An attempt
              will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Feb 16 11:38:30 9f7f288c-fea8-e5dd-bf23-c0c9c4e07233  ZFS-8000-GH    Major    
Feb 18 09:51:49 2ac4568f-4040-cb5d-f3b8-ae3d69e7d713
Feb 18 10:23:07 7c04a6f7-d22a-e467-c44d-80810f27b711
Feb 18 13:18:42 0a1bf156-6968-4956-d015-cc121a866790

Fault class : fault.fs.zfs.vdev.checksum

Description : The number of checksum errors associated with a ZFS device
              exceeded acceptable levels.  Refer to
              http://sun.com/msg/ZFS-8000-GH for more information.

Response    : The device has been marked as degraded.  An attempt
              will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.

# zpool status -x
  pool: pool2
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed with 602 errors on Mon Feb 18 13:20:14 2008
config:

        NAME          STATE     READ WRITE CKSUM
        pool2         DEGRADED     0     0 2.60K
          mirror      DEGRADED     0     0 2.60K
            spare     DEGRADED     0     0 2.43K
              disk3   DEGRADED     0     0 5.19K  too many errors
              spare1  DEGRADED     0     0 2.43K  too many errors
            disk4     DEGRADED     0     0 5.19K  too many errors
        spares
          spare1      INUSE     currently in use

errors: 247 data errors, use '-v' for a list

So ZFS tried to bring in a hot spare, but there were insufficient replicas to be able to reconstruct all of the data. But here is where is gets interesting. Let's see what zpool status -v says about things.
zpool status -v
  pool: pool1
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Feb 18 11:23:13 2008
config:

        NAME        STATE     READ WRITE CKSUM
        pool1       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk1   ONLINE       0     0     0
            disk2   ONLINE       0     0     0
        spares
          spare1    INUSE     in use by pool 'pool2'

errors: No known data errors

  pool: pool2
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed with 602 errors on Mon Feb 18 13:20:14 2008
config:

        NAME          STATE     READ WRITE CKSUM
        pool2         DEGRADED     0     0 2.60K
          mirror      DEGRADED     0     0 2.60K
            spare     DEGRADED     0     0 2.43K
              disk3   DEGRADED     0     0 5.19K  too many errors
              spare1  DEGRADED     0     0 2.43K  too many errors
            disk4     DEGRADED     0     0 5.19K  too many errors
        spares
          spare1      INUSE     currently in use

errors: Permanent errors have been detected in the following files:

        /pool2/scenic/cider mill crowds.jpg
        /pool2/scenic/Cleywindmill.jpg
        /pool2/scenic/csg_Landscapes001_GrandTetonNationalPark,Wyoming.jpg
        /pool2/scenic/csg_Landscapes002_ElowahFalls,Oregon.jpg
        /pool2/scenic/csg_Landscapes003_MonoLake,California.jpg
        /pool2/scenic/csg_Landscapes005_TurretArch,Utah.jpg
        /pool2/scenic/csg_Landscapes004_Wildflowers_MountRainer,Washington.jpg
        /pool2/scenic/csg_Landscapes!idx011.jpg
        /pool2/scenic/csg_Landscapes127_GreatSmokeyMountains-NorthCarolina.jpg
        /pool2/scenic/csg_Landscapes129_AcadiaNationalPark-Maine.jpg
        /pool2/scenic/csg_Landscapes130_GettysburgNationalPark-Pennsylvania.jpg
        /pool2/scenic/csg_Landscapes131_DeadHorseMill,CrystalRiver-Colorado.jpg
        /pool2/scenic/csg_Landscapes132_GladeCreekGristmill,BabcockStatePark-WestVirginia.jpg
        /pool2/scenic/csg_Landscapes133_BlackwaterFallsStatePark-WestVirginia.jpg
        /pool2/scenic/csg_Landscapes134_GrandCanyonNationalPark-Arizona.jpg
        /pool2/scenic/decisions decisions.jpg
        /pool2/scenic/csg_Landscapes135_BigSur-California.jpg
        /pool2/scenic/csg_Landscapes151_WataugaCounty-NorthCarolina.jpg
        /pool2/scenic/csg_Landscapes150_LakeInTheMedicineBowMountains-Wyoming.jpg
        /pool2/scenic/csg_Landscapes152_WinterPassage,PondMountain-Tennessee.jpg
        /pool2/scenic/csg_Landscapes154_StormAftermath,OconeeCounty-Georgia.jpg
        /pool2/scenic/Brig_Of_Dee.gif
        /pool2/scenic/pvnature14.gif
        /pool2/scenic/pvnature22.gif
        /pool2/scenic/pvnature7.gif
        /pool2/scenic/guadalupe.jpg
        /pool2/scenic/ernst-tinaja.jpg
        /pool2/scenic/pipes.gif
        /pool2/scenic/boat.jpg
        /pool2/scenic/pvhawaii.gif
        /pool2/scenic/cribgoch.jpg
        /pool2/scenic/sun1.gif
        /pool2/scenic/sun1.jpg
        /pool2/scenic/sun2.jpg
        /pool2/scenic/andes.jpg
        /pool2/scenic/treesky.gif
        /pool2/scenic/sailboatm.gif
        /pool2/scenic/Arizona1.jpg
        /pool2/scenic/Arizona2.jpg
        /pool2/scenic/Fence.jpg
        /pool2/scenic/Rockwood.jpg
        /pool2/scenic/sawtooth.jpg
        /pool2/scenic/pvaptr04.gif
        /pool2/scenic/pvaptr07.gif
        /pool2/scenic/pvaptr11.gif
        /pool2/scenic/pvntrr01.jpg
        /pool2/scenic/Millport.jpg
        /pool2/scenic/bryce2.jpg
        /pool2/scenic/bryce3.jpg
        /pool2/scenic/monument.jpg
        /pool2/scenic/rainier1.gif
        /pool2/scenic/arch.gif
        /pool2/scenic/pv-anzab.gif
        /pool2/scenic/pvnatr15.gif
        /pool2/scenic/pvocean3.gif
        /pool2/scenic/pvorngwv.gif
        /pool2/scenic/pvrmp001.gif
        /pool2/scenic/pvscen07.gif
        /pool2/scenic/pvsltd04.gif
        /pool2/scenic/banhall28600-04.JPG
        /pool2/scenic/pvwlnd01.gif
        /pool2/scenic/pvnature08.gif
        /pool2/scenic/pvnature13.gif
        /pool2/scenic/nokomis.jpg
        /pool2/scenic/lighthouse1.gif
        /pool2/scenic/lush.gif
        /pool2/scenic/oldmill.gif
        /pool2/scenic/gc1.jpg
        /pool2/scenic/gc2.jpg
        /pool2/scenic/canoe.gif
        /pool2/scenic/Donaldson-River.jpg
        /pool2/scenic/beach.gif
        /pool2/scenic/janloop.jpg
        /pool2/scenic/grobacro.jpg
        /pool2/scenic/fnlgld.jpg
        /pool2/scenic/bells.gif
        /pool2/scenic/Eilean_Donan.gif
        /pool2/scenic/Kilchurn_Castle.gif
        /pool2/scenic/Plockton.gif
        /pool2/scenic/Tantallon_Castle.gif
        /pool2/scenic/SouthStockholm.jpg
        /pool2/scenic/BlackRock_Cottage.jpg
        /pool2/scenic/seward.jpg
        /pool2/scenic/canadian_rockies_csg110_EmeraldBay.jpg
        /pool2/scenic/canadian_rockies_csg111_RedRockCanyon.jpg
        /pool2/scenic/canadian_rockies_csg112_WatertonNationalPark.jpg
        /pool2/scenic/canadian_rockies_csg113_WatertonLakes.jpg
        /pool2/scenic/canadian_rockies_csg114_PrinceOfWalesHotel.jpg
        /pool2/scenic/canadian_rockies_csg116_CameronLake.jpg
        /pool2/scenic/Castilla_Spain.jpg
        /pool2/scenic/Central-Park-Walk.jpg
        /pool2/scenic/CHANNEL.JPG



In my best Hugh Laurie voice trying to sound very Northeastern American, that is so cool! But we're not even done yet. Let's take this list of files and restore them - in this case, from pool1. Operationally this would be from a back up tape or nearline backup cache, but for our purposes, the contents in pool1 will do nicely.

First, let's clear the zpool error counters and return the spare disk. We want to make sure that our restore works as desired. Oh, and clear the FMA stats while we're at it.
# zpool clear
# zpool detach pool2 spare1

# fmadm reset zfs-diagnosis
fmadm: zfs-diagnosis module has been reset

# fmadm reset zfs-retire   
fmadm: zfs-retire module has been reset

Now individually restore the files that have errors in them and check again. You can even export and reimport the pool and you will find a very nice, happy, and thoroughly error free ZFS pool. Some rather unpleasant gnashing of zpool status -v output with awk has been omitted for sanity sake.
# zpool scrub pool2
# zpool status pool2
  pool: pool2
 state: ONLINE
 scrub: scrub completed with 0 errors on Mon Feb 18 14:04:56 2008
config:

        NAME        STATE     READ WRITE CKSUM
        pool2       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            disk3   ONLINE       0     0     0
            disk4   ONLINE       0     0     0
        spares
          spare1    AVAIL   

errors: No known data errors

# zpool export pool2
# zpool import pool2
# dircmp -s /pool1 /pool2

Conclusions and Review

So what have we learned ? ZFS and FMA are two great tastes that taste great together. No, that's chocolate and peanut butter, but you get this idea. One more great example of Isaac's Multiplicity of Solaris.

That, and I have finally found a good lab exercise for the FMA training materials. Ever since Christine Tran put the FMA workshop together, we have been looking for some good FMA lab exercises. The materials reference a synthetic fault generator that is not available in public (for obvious reasons). I haven't explored the FMA test harness enough to know if there is anything in there that would make a good lab. But this exercise that we have just explored seems to tie a number of key pieces together.

And of course, one more reason why Roxy says, "You should run Solaris."

Technocrati Tags:

Wednesday Oct 03, 2007

LIve Upgrade from Solaris 10 11/06 to 8/07 without nonglobal zones

Live Upgrade is one of the most useful Solaris features, yet in my travels around the US I still don't see it used as much as I would like. I can think of several reasons for this - not all of them totally valid
  • I tried it once a long time ago and a patch or package that wasn't LU aware messed up my current boot environment. Not valid for Solaris components although we do see the occasional partner product with this problem. The last one I saw was the NVidia driver, and the good folks from NVidia fixed it very quickly once reported.
  • The documentation can be a bit intimidating. Valid with a capital V. But Live Upgrade is an amazingly flexible feature, so at some point you do have to describe these capabilities. As a guide through this documentation, several folks have blogged managable howto guides. You can find mine back in March 2007, although I've recently updated it. And there are other good blogs with plentry of examples. There is a very good Blueprint on Live Upgrade.
  • It doesn't work with the Veritas Volume Manager.
  • I didn't know about Live Upgrade. Well, you do now. But I have noticed that a lot of the Solaris conversation is focused on new features, like ZFS, Zones, SMF, DTrace and some of the older features like Flash archives and Live Upgrade don't receive the attention they deserve. The simple fact is that Live Upgrade takes all of the pain out of the patching process, at least once you know what to patch.
And I'm sure there are other reasons, but these are the ones I hear most often.

Let's turn our attention to the topic at hand, upgrading a Solaris 10 11/06 system to 8/07, without zones. This example will be on an x64 system, but the SPARC approach is simular.

If you have read my earlier blog on Live Upgrade, you will recall the process is
  1. Read Infodoc Infodoc 72099 and install any required patches
  2. Install the LU packages SUNWluu SUNWlur and SUNWlucfg (if present) from the installation media
  3. lurename(1m) if you want to change the name of your new boot environment
  4. lumake(1m) or ludelete(1m) + lucreate(1m) to repopulate the target boot environment with the proper software and configuration files
  5. luupgrade(1m) to upgrade the target boot environment
  6. luactivate(1m) to activate the new boot environment
  7. init 0 to perform the file synchronization and conversions, create the new boot archive and update your GRUB menu


So I fire up my web browser and run over to SunSolve to pick up Infodoc 72099 and see a rather large set of patches. And there are two lists, one for systems with non-global zones and one without. Since we're looking at a system without non-global zones we will start with the shorter of the two lists (the next article will cover systems with nonglobal zones).

Apparently we need patches
	 
Solaris 10 	x86 	118816-03 or higher 	nawk patch 	 
Solaris 10 	x86 	120901-03 or higher 	libzonecfg patch 	 
Solaris 10 	x86 	121334-04 or higher 	SUNWzoneu required patch 	 
Solaris 10 	x86 	119255-42 or higher 	patchadd/patchrm patches 	 
Solaris 10 	x86 	119318-01 or higher 	SVr4 Packaging Commands (usr) Patch 	 
Solaris 10 	x86 	117435-02 or higher 	biosdev patch for GRUB Boot 	 

Reboot after installation 	 

Solaris 10 	x86 	120236-01 or higher 	SUNWluzone required patches 	 
Solaris 10 	x86 	121429-08 or higher 	SUNWluzone required patches 	 
Solaris 10 	x86 	121003-03 or higher 	pax patch 	 
Solaris 10 	x86 	123122-02 or higher 	prodreg patch 	 
Solaris 10 	x86 	121005-03		sh patch 	 
Solaris 10 	x86 	119043-10		/usr/sbin/svccfg patch 	 
Solaris 10 	x86 	121902-02		i.manifest r.manifest class action script patch 	 
Solaris 10 	x86 	120901-03		libzonecfg patch 	 
Solaris 10 	x86 	120069-03		telnet security patch 	 
Solaris 10 	x86 	120070-02		cpio patch 	 
Solaris 10 	x86 	123333-01		tftp patch


Hmmm, seems like a lot of patches and a required reboot! So I fire up our new friend updatemanager to patch my system. I see that there is a new updatemanager patch available (121119-13), so I installed that one all by itself and restarted updatemanager.

I soon realize that my choice of patching tools is making this a bit challenging. Users of patch tools such as Patch Check Advanced(PCA) may have an easier time, but I was determined to do this with updatemanager, with occasional help from the patch READMEs in SunSolve.

The list of patches required for this upgrade applies to any release of Solaris 10. A fresh install of a Solaris 10 11/06 system only needed the following four patches - which is a lot better than I first thought.
	 
119255-42	 
121429-08
126539-01 as it replaces the required 121902-02
125419-01 as it replaces the required 120069-03
The difficulty with updatemanager was with the set of obsoleted patches. Something like the required 121902-02 that was obsoleted by 126539-01 which was installed took a bit of manual trolling through patch READMEs. So I'll save you the research - it came down to only the four above patches.

One important note: the required reboot after patch 117435-02 wasn't needed after all - so I'll try to save all of you Solaris 10 11/06 users one reboot. While I have your attention, it is a good idea, if not a best practice, to install patch and packaging patches separately.

Feeling a lot better about this process, I proceed and install the four required patches using updatemanager in two steps (119255-42 and then the other three patches) and all succeeded, as expected. All that was left to do was finish the standard procedure
# mount -o ro -F hsfs `lofiadm -a /export/iso/s10u4/solarisdvd.iso` /mnt 
# pkgadd -d /mnt/Solaris_10/Product SUNWlur SUNWluu SUNWlucfg 
# lurename -e nv71 -n s10u4 
# lumake -n s10u4 
# luupgrade -u -s /mnt -n s10u4 
# luactivate s10u4 
# init 0 


And all went as expected. Next time I will tackle the longer list of patches and examine the same upgrade path, but with nonglobal zones.

Technocrati Tags:

Monday Mar 26, 2007

Securing MySQL using SMF - the Ultimate Manifest

The best way to learn the Solaris Service Management Facility (SMF) is to migrate a legacy service. The version of MySQL that comes with Solaris is an ideal application. It is relatively simple, has few dependencies, and can be done in just a few quick edits of an existing manifest (utmp would be a good starting template). We cover the basic process in the SMF Deep Dive and various people have contributed manifests to OpenSolaris and Blastwave. While these are good illustrations of how easy the process is, few show what SMF can really do. The motivation for this how-to came from a recent Solaris Bootcamp attendee who asked "what was wrong with the RC scripts the way they were ?".

Without skipping a beat.....
  1. Easy support of multiple service instances
  2. Deterministic location of service log files
  3. Timeouts on the start and stop methods to prevent system boots from hanging indefinetely.
  4. Quickly observable service state
  5. Flexible service dependencies
  6. Automatic restarting of the service upon failure
Upon closer inspection, recognizing when the service terminated and restarting it automatically isn't that special for mysql. The mysqld_safe daemon actually performs that step, restarting the database server if it fails. Yes, this is unique to mysql and may not exist for other services. Certianly, if the mysqld_safe parent actually fails, SMF does provide an additional capability by automatically restarting it. But we need more.

Most of the service migration demonstrations are single instance with no downstream application dependencies - so we still need more.

The mysql service start script runs through a set of configuration files, setting variables and starting a detached daemon, so it's highly unlikely that it will ever get stuck. Sure, it can get hacked and have bad things happen to it, but as delivered it is relatively safe. So we still need more.

The answer to the question lies in security. SMF provides a rich set of security features that demonstrate the power of Solaris Role Based Access Control (RBAC) and least privilege. Contrary to what you might think, these features are quite easy to use - once you learn a few simple concepts. This is how we will answer the question "what was wrong with the RC scripts the way they were?".

Authorizations

One of the most useful applications of RBAC is to create adminstration and operations roles. While the details of these roles will vary from customer to customer, the common theme is that operator roles should be able to start and stop a service in a safe manner and an administrative role should be able to modify service properties (of which some of those may be the ability to start or stop the service).

Historically this has been accomplished by third party security software inserting itself all over the kernel (sometimes in a manner that makes upgrades or maintenance difficult) or custom scripts that make use of setuid(2). Solaris 10 can perform many of these functions with just a few entries to some configuration files, and SMF makes this process extremely easy.

You can get lots of valuable information on Solaris Security features (roles, profiles, auths, privileges) at the OpenSolaris Security Community. As you navigate the wealth of white papers, ARC cases, and how-to examples, think of Solaris authorizations as the magic that makes this possible (or more precisely simple).

In a sentence, auths are labels that a privileged application uses to restrict access to it's features. In our case the privileged applications are svcadm(1M) and svccfg(1M). If you read the smf_security(5) man page (which is excellent reading) you will see that SMF provides several authorizations.
  • solaris.smf.manage - ability to start and stop any SMF managed service (good - but not what we are looking for)
  • action_authorization (in the general property group) - allows a non-root user to run the methods (start, stop, and refresh)
  • value_authorization (any property group) - change properties in the property group (such as general/enabled)
  • modify_authorization (any property group) - add or delete properties in the property group
Now this is getting interesting. So it appears that we can use either the action or modify authorization for the operator role. So which one do we use ?

The action_authorization would only allow running the method but not modifying any of the properties. The implication is that you can do
# svcadm enable -t mysql
but not
# svcadm enable mysql
The difference between the two commands is that enable without -t will try to set the property general/enabled to true in additional to running the start method. This would require the value_authorization. But value_authorization will allow you to change (almost) any property in the property group (in this case the general property group), so let's see what else value_authorization will let you do.
# svcprop -p general ssh
general/enabled boolean true
general/action_authorization astring solaris.smf.manage.ssh
general/entity_stability astring Unstable
general/single_instance boolean true
Hmmm, the only properties that might be abused would be the authorizations, but those require additional authorizations (solaris.smf.modify) to change. So it would seem that value_authorization would be safe for an operator role - unorthodox perhaps, but safe. modify_authorization would allow the creation of other service properties, and if limited to the general property group might be confusing, but relatively harmless - unless of course we add a new general property later. For this reason, modify_authorization would not be a good canidate for an operator role.

So which authorization to use ? Use action_authorization if you want a user (or role) to be able to start and stop the service, but not make the change permanent. This is the most common case. Use value_authorization in the general property group if you want that user or role to be able to permanently turn a service on or off - this is generally an adminstrative role.

Let's put this all together.

Start with your existing SMF manifest for MySQL. If you don't have one, you can use mine at http://blogs.sun.com/resources/bobn/mysql.xml or Keith Lawson's contributed MySQL manifest at the OpenSolaris SMF Contributed Manifests and Methods page.

Add the following section
<property_group name='general' type='framework'>
        <propval name='action_authorization' type='astring'    value='mysql.operator' />
       <propval name='value_authorization' type='astring'   value='mysql.administrator' />
</property_group>

Import the new manifest by the method of your choice (svccfg import, /lib/svc/method/manifest-import, or reboot) and your new MySQL can be managed by auths. So how to we get those auths assigned to users (or roles ?).

Authorizations are granted to users and roles by the configuration file /etc/user_attr. You can read the user_attr(4) man page for all of the details, but the process is to add auths=mysql.operator to the user or role entry. For example
# grep \^joeuser /etc/user_attr
joeuser::::type=normal;auths=mysql.operator
It is possible that a user or role may not be present in /etc/user_attr. In that case just add a line like the one above and assign the appropriate auth.

Let's see all of this in action.
% auths
mysql.operator,solaris.smf.manage.name-service.cache,solaris.smf.manage.bind,solaris.admin.dcmgr.clients,solaris.admin.dcmgr.read,solaris.snmp.\*,solaris.network.hosts.\*,solaris.smf.value.routing,solaris.smf.manage.routing,solaris.network.wifi.config,solaris.device.cdrw,solaris.profmgr.read,solaris.jobs.users,solaris.mail.mailq,solaris.admin.usermgr.read,solaris.admin.logsvc.read,solaris.admin.fsmgr.read,solaris.admin.serialmgr.read,solaris.admin.diskmgr.read,solaris.admin.procmgr.user,solaris.compsys.read,solaris.admin.printer.read,solaris.admin.prodreg.read,solaris.snmp.read,solaris.project.read,solaris.admin.patchmgr.read,solaris.network.hosts.read,solaris.admin.volmgr.read,solaris.jobs.user,solaris.device.mount.removable

% svcadm enable -t mysql
% svcs mysql
STATE          STIME    FMRI
online         15:51:02 svc:/application/mysql:default

So far so good.
% svcadm enable mysql
svcadm: svc:/application/mysql:default: Permission denied.

Why did this fail ?
% svcprop -p general mysql
general/enabled boolean true
general/action_authorization astring mysql.operator
general/entity_stability astring Unstable
general/single_instance boolean true
general/value_authorization astring mysql.administrator

Because enable also tries to set the general/enabled property - and that requires value or modify authorization. Change my user definition in /etc/user_attr
% grep \^joeuser /etc/user_attr
joeuser::::type=normal;auths=mysql.operator,mysql.administrator
% auths
mysql.operator,mysql.administrator,solaris.smf.manage.name-service.cache,solaris.smf.manage.bind,solaris.admin.dcmgr.clients,solaris.admin.dcmgr.read,solaris.snmp.\*,solaris.network.hosts.\*,solaris.smf.value.routing,solaris.smf.manage.routing,solaris.network.wifi.config,solaris.device.cdrw,solaris.profmgr.read,solaris.jobs.users,solaris.mail.mailq,solaris.admin.usermgr.read,solaris.admin.logsvc.read,solaris.admin.fsmgr.read,solaris.admin.serialmgr.read,solaris.admin.diskmgr.read,solaris.admin.procmgr.user,solaris.compsys.read,solaris.admin.printer.read,solaris.admin.prodreg.read,solaris.snmp.read,solaris.project.read,solaris.admin.patchmgr.read,solaris.network.hosts.read,solaris.admin.volmgr.read,solaris.jobs.user,solaris.device.mount.removable

% svcadm enable mysql
% svcs mysql
STATE          STIME    FMRI
online         16:10:37 svc:/application/mysql:default

This is all very cool - but we can still do more.

Removing Root from the Equation

For both simplicity and compatibility with other operating systems, the MySQL service is started by a script that is run as root. This script is generally linked into /etc/rc3.d, but since we have converted it to an SMF service we have many more options. We have already looked at delegated administration using auths, time to turn our attention to privileges.
# /etc/sfw/mysql/mysql.server start # ps -ef | grep mysqld | grep -v grep mysql 1975 1955 0 21:43:17 pts/8 0:00 /usr/sfw/sbin/mysqld --basedir=/usr/sfw --datadir=/var/mysql --user=mysql --pid root 1955 1 0 21:43:17 pts/8 0:00 /bin/sh /usr/sfw/sbin/mysqld_safe --datadir=/var/mysql --pid-file=/var/mysql/pa # /etc/sfw/mysql/mysql.server stop This suggests two immediate questions. Does the parent mysqld_safe really have to run as root, or can it be started as a lesser privileged user ? If it can run as a non-root user, exactly what privileges are required to run mysql ?

The answer to the first question is simple: it can be run as a regular user. It only runs as root out of convenience to operating systems that don't have as sophisticated a security framework as Solaris.
#  su - mysql
Sun Microsystems Inc.   SunOS 5.11      snv_57  October 2007
$ sh /etc/sfw/mysql/mysql.server start
$ /usr/sfw/bin/mysqladmin status
Uptime: 1174  Threads: 1  Questions: 1  Slow queries: 0  Opens: 6  Flush tables: 1  Open tables: 0  Queries per second avg: 0.001
$ sh /etc/sfw/mysql/mysql.server stop
Killing mysqld with pid 1975
Wait for mysqld to exit done
$ exit
#
Now that we have established the fact that a fully privileged user isn't required to run MySQL, what privileges are are really required ? How far can we restrict the mysql user ? Glenn Brunette's privilege debugger privdebug.pl is the perfect tool to help us answer this question.
# privdebug.pl -f -v  -e "su - mysql /usr/sfw/sbin/mysqld_safe --user=mysql"
STAT TIMESTAMP          PPID   PID    PRIV                 CMD
USED 2005619300419      2211   2212   proc_taskid          su
USED 2005620883559      2211   2212   proc_setid           su
USED 2005621147993      2211   2212   proc_setid           su
USED 2005621161490      2211   2212   proc_setid           su
USED 2005621165094      2211   2212   proc_setid           su
USED 2005630560973      2211   2212   proc_exec            su
Starting mysqld daemon with databases from /var/mysql                                  contract_event       
USED 2005679230394      2211   2212   proc_fork            sh
USED 2005750348321      2211   2212   proc_fork            sh
USED 2005751386190      2212   2214   proc_exec            sh
USED 2005756249415      2211   2212   proc_fork            sh
USED 2005757238096      2212   2215   proc_fork            sh
USED 2005758495289      2212   2215   proc_exec            sh
USED 2005761778059      2211   2212   proc_fork            sh
USED 2005762623018      2212   2217   proc_fork            sh
USED 2005763874569      2212   2217   proc_exec            sh
USED 2005767441408      2211   2212   proc_fork            sh
USED 2005768337263      2212   2219   proc_exec            sh
USED 2005772916576      2211   2212   proc_fork            sh
USED 2005773996432      2212   2220   proc_fork            sh
USED 2005775465400      2212   2220   proc_exec            sh
USED 2005778750305      2211   2212   proc_fork            sh
USED 2005779846375      2212   2222   proc_exec            sh
USED 2005782042348      2211   2212   proc_fork            sh
USED 2005783110622      2212   2223   proc_exec            sh
USED 2005785636236      2211   2212   proc_fork            sh
USED 2005786824801      2212   2224   proc_exec            sh
USED 2005788593079      2212   2224   proc_exec            nohup
USED 2005790693138      2212   2224   proc_exec            nohup
USED 2005792812264      2211   2212   proc_fork            sh
USED 2005794010658      2212   2225   proc_exec            sh
USED 2005795756145      2212   2225   proc_exec            nohup
USED 2005797704273      2212   2225   proc_exec            nohup
NEED 2005799674735      2211   2212   file_dac_write       sh
USED 2005800708905      2211   2212   proc_fork            sh
USED 2005801869396      2212   2226   proc_exec            sh
USED 2005804780370      2211   2212   proc_fork            sh
USED 2005805854317      2212   2227   proc_exec            sh
USED 2005807860051      2211   2212   proc_fork            sh
USED 2005808907677      2212   2228   proc_exec            sh
USED 2005811293197      2211   2212   proc_fork            sh
USED 2005812393916      2212   2229   proc_exec            sh
USED 2005814589669      2212   2229   proc_exec            nohup
USED 2005816674186      2212   2229   proc_exec            nohup
STOPPING server from pid file /var/mysql/pandora.pid                                  contract_event       
070325 22                    11  mysqld ended 18     contract_event       


Ignore the proc_taskid and proc_setid, they are artifacts of using su(1M) to run the database server as user mysql. We see that mysqld only needs proc_fork and proc_exec. The file_dac_write failure comes from a call to access(2) and is not needed for proper operation.

What do we do with what we have just learned ?

Referring to the smf_method(5) man page (another excellent read), it seems that all we need to do is add a method_credential option to the various methods (start, stop, and refresh). The appropriate section of my new and improved MySQL manifest now looks like
        <exec_method   type='method' name='start' exec='/etc/sfw/mysql/mysql.server %m'  timeout_seconds='60'>
                <method_context>
                        <method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec'  />
               </method_context>
        </exec_method>
        
        <exec_method   type='method' name='stop' exec='/etc/sfw/mysql/mysql.server %m'  timeout_seconds='120'>
                <method_context>
                        <method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec'  />
                </method_context>
        </exec_method>
        
        <exec_method   type='method' name='refresh' exec='/etc/sfw/mysql/mysql.server restart'  timeout_seconds='120'>
                <method_context>
                        <method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec'  />
                </method_context>
        </exec_method>
   

So we quickly modify our manifest and import it using one of the standard methods (svccfg import, /lib/svc/method/manifest-import, or a reboot) and we should be done, right ? Well...... not exactly - but we're close.
% svccfg enable mysql
% svcs mysql
STATE          STIME    FMRI
maintenance    21:53:37 svc:/application/mysql:default

$ tail -5 `svcprop -p restarter/logfile mysql`
[ Mar 26 21:51:12 Method "stop" exited with status 0 ]
[ Mar 26 21:53:36 Enabled. ]
[ Mar 26 21:53:36 Executing start method ("/etc/sfw/mysql/mysql.server start") ]
svc.startd could not set context for method: chdir: No such file or directory
[ Mar 26 21:53:37 Method "start" exited with status 96 ]

Doh! When we followed the MySQL installation instructions at /etc/sfw/mysql/README.solaris.mysql we created a user account called mysql. But we didn't specify a home directory, did we ? No - so the default template value of /home/mysql was used. But there is no /home/mysql, is there ? Well, no.

How do we fix this ?

Set a reasonable home directory for the mysql user. How about /var/mysql ? Elsewhere in the installation instructions we did set ownership and proper permissions to this directory - so that would seem like a reasonable home directory.

As root
# usermod -d /var/mysql mysql
That is one solution, but it may not be practical for all cases. Perhaps a better idea would be to provide a working directory for each of the methods. The benefit is that I could set it differently for each service instance. This would be done in the method_context tag for the method. So I modify my service manifest to look like
        <exec_method   type='method' name='start' exec='/etc/sfw/mysql/mysql.server %m'  timeout_seconds='60'>
                <method_context working_directory='/var/mysql'>
                        <method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec'  />
               </method_context>
        </exec_method>
        
        <exec_method   type='method' name='stop' exec='/etc/sfw/mysql/mysql.server %m'  timeout_seconds='120'>
                <method_context working_directory='/var/mysql'>
                        <method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec'  />
                </method_context>
        </exec_method>
        
        <exec_method   type='method' name='refresh' exec='/etc/sfw/mysql/mysql.server restart'  timeout_seconds='120'>
                <method_context working_directory='/var/mysql'>
                        <method_credential user='mysql' group='mysql' privileges='proc_fork,proc_exec'  />
                </method_context>
        </exec_method>
Reimport the manifest and let's see how things go.
# svccfg import /var/svc/manifest/application/mysql.xml
# svcadm clear mysql
# svcs mysql
STATE          STIME    FMRI
maintenance    22:17:49 svc:/application/mysql:default

Argh - now what ?
# tail -5 `svcprop -p restarter/logfile mysql`
/sbin/sh: /etc/sfw/mysql/mysql.server: cannot execute
[ Mar 26 22:17:49 Method "start" exited with status 1 ]
[ Mar 26 22:17:49 Executing start method ("/etc/sfw/mysql/mysql.server start") ]
/sbin/sh: /etc/sfw/mysql/mysql.server: cannot execute
[ Mar 26 22:17:49 Method "start" exited with status 1 ]

Doh! Since Solaris delivers MySQL as a legacy service the start script doesn't have execute permissions for the mysql user. That's easy to fix.
# ls -l /etc/sfw/mysql/mysql.server
-rwxr--r--   1 root     sys         5655 Mar 22 17:05 /etc/sfw/mysql/mysql.server
# chown mysql /etc/sfw/mysql/mysql.server
# svcadm clear mysql
# svcs mysql
STATE          STIME    FMRI
online         22:23:08 svc:/application/mysql:default
bash-3.00$ 
Now that's more like it. One last item to check.
# ps -ef | grep mysqld | grep -v grep
   mysql 12656 12634   0 22:23:11 ?           0:00 /usr/sfw/sbin/mysqld --basedir=/usr/sfw --datadir=/var/mysql --pid-file=/var/my
   mysql 12634     1   0 22:23:09 ?           0:00 /bin/sh /usr/sfw/sbin/mysqld_safe --datadir=/var/mysql --pid-file=/var/mysql/pa
   
# ppriv 12634
12634:  /bin/sh /usr/sfw/sbin/mysqld_safe --datadir=/var/mysql --pid-file=/var
flags = 
        E: basic,!file_link_any,!proc_info,!proc_session
        I: basic,!file_link_any,!proc_info,!proc_session
        P: basic,!file_link_any,!proc_info,!proc_session
        L: all

Now that's what I wanted to see. The parent mysqld_safe is now running as user mysql and with exactly the right privileges. This is very cool indeed. Armed with this information we could also create a zone and use the limitpriv attribute to restrict the zone privilege - but we'll leave that for another day.

Conclusion

It is quite easy to leverage not only Solaris authorizations but to run services with restricted privileges. We have presented a few templates and a general approach that should make this process less cumbersome.

More important though - we now have a compelling reply when asked "what was wrong with the RC scripts the way they were?"

Technocrati Tags:

Tuesday Mar 20, 2007

Zones in a Flash - Literally

Fantastic improvements have been made in the Solaris installation and upgrade process - even more in OpenSolaris (available in the various community releases). As we examined the cloning feature introduced in Solaris 10 11/06, it became apparent that we have stumbled upon a most intriguing capability. When combining zone cloning with the attach/detach capability we have discovered a model for flashing zones: zoneflash.

In a recent boot camp we took a look at this in more detail. Unfortunately the slides (which will be posted soon) didn't quite follow the level of depth we were exploring. Several people asked for notes on how this works - and here they are. The irony is that it will take longer to read about it than it does to perform the actual process - but it is so cool.

The Promise

We start with a fresh Solaris system. In this case just live upgraded from media, but it could have been jumpstarted from media or a flash archive. The key point here is that the system has had very little done to it, other than naming and some software installation. Since zone attach makes sure that key system components (specifically packages and patches) are compatible, it makes sense to build our flashzones on a system that will look similar to those that will be built in the future.

So how many zones will we build ? That's a good question. If this were system flasharchives the answer would be as few as possible - one per architecture in the most efficient case. But these zoneflashes are different - just applications, some metadata, and perhaps some customizations (naming, security, SMF). It seems reasonable to create one zoneflash for each type of application server you would deploy - think of it as a userspace template. In this example I have chosen four: a blank uncustomized flash (for building a new zoneflash in a flash), database server (MySQL), web server (apache2), and the community edition of webmin (just another application).

Our procedure will be to build a minimal default zoneflash, run it through first boot to populate the SMF repository, and then clone it for the remaining zoneflashes. Each of these will be booted, customized for the particular application, and tested to make sure everything is operating properly.

We will then detach the zones and move the detached zoneroots onto some media that can be transported. Of course, keeping with the theme of zones and flash, the transport could be the flasharchive itself. How cool would it be to jumpstart a server using flasharchives and have all the application zones already present in a known location, such as /zoneflash ? Unfortunately, I'm sitting in seat 18A on an American Airlines flight to Los Angeles and don't quite have the required infrastructure to do that sort of test. But I do have a USB stick and multiple boot environments. That will do nicely.

Once attached, we will clone the zoneflashes as necessary, adding resources (network, local filesystems) and attributes (resource controls) required for the proper operation of the application. When finished we will detach the zoneflashes so they may be used elsewhere.

The Turn

The first step is to build and boot a simple generic sparse root zone. Since this zone isn't really meant for operation, most zonecfg attributes (network configuration, resource limits, et al) will be skipped. We will add them later when we build the real zones - remember, these are just user space application templates.

# zonecfg -z flashdefault
flashdefault: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:flashdefault> create
zonecfg:flashdefault> set zonepath=/local/default
zonecfg:flashdefault> add inherit-pkg-dir; set dir=/opt; end
zonecfg:flashdefault> commit
zonecfg:flashdefault> exit
#
# zoneadm -z default install
A few minutes later we have an installed zone, ready for first boot. Since I've attended my Solaris Zones Best Practices class, or at least read the materials, I know how to build a sysidcfg file that will satisfy the sysidtool first boot service. This will allow the zone to boot up all the way without any additional console interaction. Let's do that for our new zone.
# echo > /local/default/root/etc/sysidcfg <<EOF
name_service=NONE
nfs4_domain=dynamic
security_policy=NONE
root_password=xxxxxxxxxx        You supply your own encrypted string from /etc/shadow, I'm not going to post mine!
system_locale=C
terminal=ansi
timezone=US/Central
network_interface=NONE {hostname=default}
EOF
# zoneadm -z default boot
# zlogin -C default 
We need to let first boot processing complete. Since we supplied a valid sysidcfg, it is just a matter of waiting for manifest-import and sysidtool to complete their magic. When complete, login in and take a look around to make sure all is well. Once satisfied, shut down the zone (either from inside the zone or from the global zone) - we are through with it for now.
(from the global zone)
# zoneadm -z default halt
Now we are done with this first zone. Time to clone it for our remaining application zones. Please pardon a bit of inline shell scripting - I hate to type the same thing over and over and over. Sort of makes for a nice script template, doesn't it ? Not quite the sophistication of Brad Digg's zonemanager, but it will do nicely for our example.

# for zone in webmin mysql web
? do
        echo "create -t default; set zonepath=/local/${zone}" | zonecfg -z ${zone}
        zoneadm -z ${zone} clone default
        echo "name_service=NONE" > /local/${zone}/root/etc/sysidcfg
        echo "nfs4_domain=dynamic" >> /local/${zone}/root/etc/sysidcfg
        echo "security_policy=NONE" >> /local/${zone}/root/etc/sysidcfg
        echo "root_password=xxxxxxxxxxx" >> /local/${zone}/root/etc/sysidcfg
        echo "system_locale=C" >> /local/${zone}/root/etc/sysidcfg
        echo "network_interface=NONE {hostname=${zone}}" >> /local/${zone}/root/etc/sysidcfg
        echo "terminal=ansi" >> /local/${zone}/root/etc/sysidcfg
        echo "timezone=US/Central" >> /local/${zone}/root/etc/sysidcfg
        zoneadm -z ${zone} boot
done
#
What in the heck was that all about ? OK, one more time - line by line with annotation.

# for zone in webmin mysql web
do

A quick interactive loop for the creation of three application zones. The variable ${zone} will be set to the name of the zone we are trying to construct.
echo "create -t default; set zonepath=/local/${zone}" | zonecfg -z ${zone}
A one liner that creates a new zone configuration based on the already existing default. At this point the only thing we need to change is the zonepath, and it should be set to /local/${zone}.
        zoneadm -z ${zone} clone default
We recognize this as a zone cloning operation. The zone root is copied and a /reconfigure is created in the new zone root so that sysidtool performs a complete configuration on first boot. If you happen to be running on a recent release of OpenSolaris, you can put your zoneroot on ZFS and the cloning operating will only take a few seconds and very little additional disk space will be required. Those of us on Solaris 10 11/06 will have to wait for the 160MB or so to be copied. Still better than the 9 minutes to go through a complete zone installation.
        echo "name_service=NONE" > /local/${zone}/root/etc/sysidcfg
        echo "nfs4_domain=dynamic" >> /local/${zone}/root/etc/sysidcfg
        echo "security_policy=NONE" >> /local/${zone}/root/etc/sysidcfg
        echo "root_password=xxxxxxxxxxx" >> /local/${zone}/root/etc/sysidcfg
        echo "system_locale=C" >> /local/${zone}/root/etc/sysidcfg
        echo "network_interface=NONE {hostname=${zone}}" >> /local/${zone}/root/etc/sysidcfg
        echo "terminal=ansi" >> /local/${zone}/root/etc/sysidcfg
        echo "timezone=US/Central" >> /local/${zone}/root/etc/sysidcfg
This step creates a custom sysidcfg file for each zone. Remember to supply your own root password from /etc/shadow in the global zone. This answers all of the sysidtool questions, including the NFSv4 question.
	zoneadm -z {zone} boot
Boot the zone. If we have done everything correctly, the next interaction will be with console login.

done
Close the for loop in the interactive script. This process will take a few minutes on Solaris 10 11/06, or if we are being clever with OpenSolaris and ZFS - a few seconds.

Now for the hard part - customizing the individual application zones. Well, it's not all that difficult. And if you do this regularly, you probably have scripts to do most of the work. It's just individual application installation and customization.

Here is what I did for my example zones.
MySQL
The installation instructions for the Solaris 10 MySQL can be found in /etc/sfw/mysql/README.solaris.mysql. There is a typo in the Solaris 10 version of the README. It will cause a lot of grief if you cut and paste without looking at the results. Fortunately it has been corrected in nevada (aka OpenSolaris Community Edition).

Boot the mysql zone and log in as root.
# /usr/sfw/bin/mysql_install_db
# groupadd mysql
# useradd -g mysql mysql
# chgrp -R mysql /var/mysql
# chmod -R 770 /var/mysql       This line is incorrect in the Solaris 10 README - my chmod works better with two arguments
# installf SUNWmysqlr /var/mysql d 770 root mysql
# cp /usr/sfw/share/mysql/my-medium.cnf /var/mysql/my.cnf
The installation instructions continue by linking the start script into /etc/rc3.d. Since we are big SMF fans in these parts, let's do that instead. Feel free to use my MySQL manifest as it contains a couple of cool features (value and action authorizations - more on that later).

Since the mysql zone doesn't have any networking configured, perform this next step from the global zone. If you already have a suitable manifest, or have stashed mine away somewhere in the global zone you can use that instead.
# cd /local/mysql/root/var/svc/manifest/application
# wget http://blogs.sun.com/bobn/resource/mysql.xml
It's probably a good idea to make sure that all of this is working properly. Either reboot the mysql zone, run the manifest-import service manually, or run svccfg import on the new manifest. Your choice. What you should see upon completion is
# svcs mysql
STATE          STIME    FMRI
online         14:41:19 svc:/application/mysql:default

# /usr/sfw/bin/mysqladmin status
Uptime: 459  Threads: 1  Questions: 2  Slow queries: 0  Opens: 6  Flush tables: 1  Open tables: 0  Queries per second avg: 0.004

We're done for now. Unless of course you want to go for some extra credit. In that case
  1. Set up a web server with PHP support. Apache 1 plus the SFWmphp package from the Solaris Companion will do just fine.
  2. Download and unpack phpMyAdmin in the webserver htdocs directory.
  3. Create a user with the mysql.operator authorization
  4. Create a user with the mysql.administrator authorization

Shut down the mysql zone.
Web
This is about as easy as it gets. Boot the web zone and perform the following steps.
# cp /etc/apache2/httpd.conf-example /etc/apache2/httpd.conf
# svcadm enable apache2

A quick check to make sure all is well.
# svcs apache2
STATE          STIME    FMRI
online         17:17:41 svc:/network/http:apache2


# telnet localhost 80
Trying ::1...
telnet: connect to address ::1: Network is unreachable
Trying 127.0.0.1...
Connected to localhost.
Escape character is '\^]'.
hello
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>501 Method Not Implemented</title>

Connection to localhost closed by foreign host.
We're done for now. Shut down the web zone.
Webmin
This one is a little more complicated. We did this one last time in the zone cloning, but it is worth a second look.

Our task here is to replace the Solaris webmin with the latest download from http://webmin.com The technique we are using will allow us to install a custom version of an application into a sparse root zone. Specifically, webmin.com's package installs into /opt/webmin, but /opt is a read-only inherited-pkg-dir. The easiest solution for this would be the creation of a symbolic link in the global zone /opt to point to a location that can be safely written by each non-global zone. In my example that would be /local-pkgs.

In the global zone, create the link in /opt, create the local package directory in the webmin zoneroot, and download the latest webmin package.
# ln -s ../local-pkgs/webmin /opt/webmin
# mkdir -p /local/webmin/root/local-pkgs/webmin
# cd /local/webmin/root/var/tmp
# wget http://prdownloads.sourceforge.net/webadmin/webmin-1.330.pkg.gz
# gunzip webmin-1.330.pkg.gz

Now boot the webmin zone and log in as root.
# zoneadm -z webmin boot
# zlogin webmin
Remove the Solaris webmin packages (SUNWwebminu SUNWwebminr). The usr package needs to be removed twice - the first pkgrm will leave it as a partially installed package, the second will completely remove it - at least as far as our zone (and future patching) is concerned. Once removed, install the webmin.com version, which should be conveniently located in /var/tmp.
# pkgrm SUNWwebminu SUNWwebminr SUNWwebminu
# pkgadd -d /var/tmp/webmin-1.330.pkg
We are done with this zone. Shut it down.
Detach
We have just built four zones: an empty zone suitable for future customizations, one with the Solaris webmin replaced by the community edition, one with a working MySQL database, and one with a webserver. The last task to be performed on these zones in their current state is to be detached, another new feature in Solaris 10 11/06. Zone detach will copy the zone configuration into the zoneroot (to be used with a subsequent zone attach) and sets the current zone state to configured. You can even delete the zone configurations as a final cleanup prior to building a flash archive.
# zoneadm -z default detach
# zoneadm -z webmin detach
# zoneadm -z mysql detach
# zoneadm -z web detach
# zonecfg -z default delete -F
# zonecfg -z webmin delete -F
# zonecfg -z mysql delete -F
# zonecfg -z web delete -F

And flash
Unless the person in 18B wants to be a jumpstart server, we will have to simulate jumpstart/flasharchive process. We can do this by booting into an alternate boot environment and then delivering the detached zoneroots by some sort of shared or removable storage - something like a USB memory stick. When we are done with this exercise, our zoneflashes will still be on the memory device, ready for their next use. Since the zones will never be booted, just cloned, the speed of the memory device really isn't important.

We need to prepare the USB memory stick (currently formatted as FAT16). We will use rmformat -l to locate the device, fdisk to put a proper label on it, finally newfs for installing a proper file system. ZFS would be interesting, but it would just get in our way later.
# rmformat -l
Looking for devices...
     1. Logical Node: /dev/rdsk/c2t0d0p0
        Physical Node: /pci@0,0/pci1179,1@1d,7/storage@4/disk@0,0
        Connected Device:          USB DISK 2.0     PMAP
        Device Type: Removable
        Bus: USB
        Size: 984.0 MB
        Label: 
        Access permissions: 
     2. Logical Node: /dev/rdsk/c1t0d0p0
        Physical Node: /pci@0,0/pci-ide@1f,1/ide@1/sd@0,0
        Connected Device: TEAC     DW-224E-A        7.2A
        Device Type: CD Reader
        Bus: IDE
        Size: 
        Label: 
        Access permissions: 

# fdisk /dev/rdsk/c2t0d0p0
3 (to delete the existing partition)
1 (to create a new Solaris partition)
5 (to exit and write the new label)

# newfs /dev/rdsk/c2t0d0s2
newfs: construct a new file system /dev/rdsk/c2t0d0s2: (y/n)? y
/dev/rdsk/c2t0d0s2:     2009088 sectors in 981 cylinders of 64 tracks, 32 sectors
        981.0MB in 62 cyl groups (16 c/g, 16.00MB/g, 7680 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
 32, 32832, 65632, 98432, 131232, 164032, 196832, 229632, 262432, 295232,
 1705632, 1738432, 1771232, 1804032, 1836832, 1869632, 1902432, 1935232,
 1968032, 2000832
 
# mkdir /tmp/flash
# mount /dev/dsk/c2t0d0s2 /tmp/flash
# cd /local
# find default webmin web mysql -print | cpio -pdum /tmp/flash
# umount /tmp/flash
We are now done with the original system. At this point we would create a flasharchive (with the detached zoneroots in a convenient place in the archive).

The Prestige

The final act in our magic trick is the delivery. Specifically the transport, reattachment, and subsequent cloning of the zoneflashes on a new system. 18B is now asleep and I really don't want to disturb him, so I'll do this part myself. I'll boot my laptop into another boot environment - built from the same media using the same Live Upgrade method as the boot environment that created the zones.

We begin by mounting the removable media (USB memory stick) that contains the zoneflash. Do take a look around, it is quite likely that our friend volfs has already done this for us. Remember - if we were using a flasharchive to deliver the zoneflash this step would be unnecessary.
# mkdir /flash
# mount /dev/dsk/c2t0d0s2 /flash        (we used rmformat -l to derive the device name)
Now that our zoneflashes have arrived, time to reattach them. The first step is to create zone configurations. If you recall, these were stored in the zoneroot when they were detached. The zonecfg command create -a is used to retrieve the stored configuration information and adapt it to the new system - specifically the new location of the zoneroot. Once configured we use zoneadm attach to reconnect them.

The sequence to reattach our default zone, now called flashdefault, would look something like this.
# zonecfg -z flashdefault
flashdefault: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:flashdefault> create -a /flash/default
zonecfg:flashdefault> commit
zonecfg:flashdefault> exit
# zoneadm -z flashdefault attach
We'll be a little more clever attaching the other three zones.
# for zone in webmin web mysql
  do
      echo "create -a /flash${zone}" | zonecfg -z flash${zone}
      zoneadm -z flash${zone} attach
  done
At this point our zoneroots are still on the USB memory device - but don't worry, these zones will never be booted. Their only purpose is to deliver preconfigured zones. We will use zone cloning to create our real application zones.

Which we will now do. It is very convenient to use the flashzone as a template for our new zone in case there were some special attributes like limitpriv that we might want to preserve. We will also need to add items that were not present in the zoneflashes - specifically networking and local file systems. Once we are satisfied with the zone configurations we will clone the zoneflash. If we are only building one of each type of zone we can detach the zoneflash so that other administrators can use it on their systems.

Let's do this for the mysql zone.
# zonecfg -z mysql
mysql: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:mysql> create -t flashmysql
zonecfg:mysql> set zonepath=/zones/mysql
zonecfg:mysql> add net; set physical=e1000g0; set address=192.168.100.102/24; end
zonecfg:mysql> add fs; set dir=/export; set special=/export; set options=[rw,nosuid,nodevices]; set type=lofs; end
zonecfg:mysql> commit
zonecfg:mysql> exit

# zoneadm -z mysql clone flashmysql
Copying /flash/mysql...

# zoneadm -z flashmysql detach

# echo "name_service=NONE" >    /zones/mysql/root/etc/sysidcfg
# echo "nfs4_domain=dynamic" >> /zones/mysql/root/etc/sysidcfg
# echo "security_policy=NONE" >> /zones/mysql/root/etc/sysidcfg
# echo "root_password=xxxxxxxxxxx" >> /zones/mysql/root/etc/sysidcfg
# echo "system_locale=C" >> /zones/mysql/root/etc/sysidcfg
# echo "network_interface=NONE {hostname=mysql}" >> /zones/mysql/root/etc/sysidcfg
# echo "terminal=ansi" >> /zones/mysql/root/etc/sysidcfg
# echo "timezone=US/Central" >> /zones/mysql/root/etc/sysidcfg

And for the finale - boot the newly flashed mysql zone and you should see an enabled and operating mysql service.
# zoneadm -z mysql boot
# zlogin -C mysql
[Connected to zone 'mysql' console]
Hostname: mysql
Creating new rsa public/private host key pair                           
Creating new dsa public/private host key pair
Mar 20 06:15:44 mysql sendmail[1719]: My unqualified host name (mysql) unknown; sleeping for retry
Mar 20 06:15:44 mysql sendmail[1722]: My unqualified host name (mysql) unknown; sleeping for retry

mysql console login: root
Password: 
Last login: Mon Mar 19 17:10:10 on console
Mar 20 06:15:49 mysql login: ROOT LOGIN /dev/console
Sun Microsystems Inc.   SunOS 5.11      snv_57  October 2007
# 
# svcs mysql
STATE          STIME    FMRI
online          6:31:28 svc:/application/mysql:default
# /usr/sfw/bin/mysqladmin status
Uptime: 8  Threads: 1  Questions: 1  Slow queries: 0  Opens: 6  Flush tables: 1  Open tables: 0  Queries per second avg: 0.125

How cool is that ? Not only did we clone the zone, but since the database is in /var, it was cloned as well. Perhaps not practical for every situation, but still pretty cool.

I will leave the flashing of default, web, and webmin as an exercise to the reader. Follow the sequence we used for the mysql zone and you should have four working zones, built from a flash like mechanism that can be delivered via removable media, flasharchive, or shared storage.

Next time we'll take a closer look at MySQL and explore running it as a less privileged user. We'll also look at the action and value authorizations.

Technocrati Tags:

Friday Feb 16, 2007

Cloning Isn't Just for Sheep Any More



While it may not have the social implications nor headline appeal of the now famous Dolly the Sheep, the zone cloning feature introduced with Solaris 10 11/06 is worth further investigation. Before we do that, it is probably a good idea to review basic zone creation and installation prior to the new cloning capability.

Building Zones the Old Fashioned Way

The first step in the creation of a zone is establishing it's configuration. This is done by conversing with our friend, zonecfg(1M), who handles all the details of writing the configuration xml file in /etc/zones and updating the zones index file /etc/zones/index.

Such a conversation might go something like....
# zonecfg -z zone1
zone1: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:zone1> create
zonecfg:zone1> set zonepath=/zones/zone1
zonecfg:zone1> add inherit-pkg-dir; set dir=/opt; end
zonecfg:zone1> add net; set physical=iprb0; set address=192.168.100.101/24; end
zonecfg:zone1> add fs; set dir=/export; set special=/export; set options=[rw,nosiud]; set type=lofs; end
zonecfg:zone1> verify
zonecfg:zone1> commit
zonecfg:zone1> exit
#


If you grok zones then you recognize this as a typical sparse root zone. If you have attended one of my zones best practices workshops then you will also notice that I'm following my own advice and making /opt an inherited package directory.

A quick check to make sure all is well.
# zoneadm list -cv
  ID NAME             STATUS         PATH                           BRAND     
   0 global           running        /                              native    
   - zone1            configured     /zones/zone1                   native    


All is as it should be (which is always the case for a how-to example).

The next step is a rather magical affair where the zoneroot is populated. This process is initiated by uttering the following sequence
# zoneadm -z zone1 install
Once spoken, fantastic things start happening behind the scenes - all of them by our good friend Live Upgrade. The actual sequence of events is something like
  1. Create the new zoneroot if it doesn't already exist. If it does exist make sure the permissions are set to 700 and it is owned by [0,0].

  2. Mount all of the inherit-pkg-dir and file systems listed in the zone configuration file.

  3. Create a candidate list of files for the new zoneroot by looking at the global zone contents file /var/sadm/install/contents.

    On my laptop daily driver, this totals approximately 2 million files.

  4. Pick from this list all files that should be delivered to the new zone root by removing all files from packages that are marked as global zone only (SUNW_PKG_THISZONE is set to true)

    We're still over 2 million files, folks!

  5. From the remaining list of files, remove all of those that will be delivered via inherit-pkg-dir directories.

    This is why I like inherit-pkg-dir. We are now down to about 2,300 files. If not for inherit-pkg-dir I would be hitting my boss up for a lot more storage.

  6. Copy all of these files from the global zone into the new zoneroot, replacing commonly edited configuration files with those that were originally delivered with the package (ie /etc/passwd).

  7. Once the files are in place there is one more step to perform. Some of the package have preinstall and postinstall scripts that might do something important. These need to be run, even if all of the files are delivered via inherited directories. So in package dependency order, all of the packages identified as applicable to the new zone (SUNW_PKG_THISZONE=false) are installed sequentially.

  8. Update the zones index file /etc/zones/index marking the new zone as installed.

  9. Unmount all of the file systems mounted in step 2.
And we are done, with the first part. The amount of time this takes can be estimated as O(sparseness, number of packages, disk speed). To speed up this process I would have to increase the degree of sparseness, which is pretty hard to do once /opt has been added. I could also decrease the number of packages in the global zone - this has some interesting possibilities. I could also get faster disks, but that isn't always practical, especially with a small server configuration or a home system. I may be talking myself into a minimal global zone installation with full root zones - but that sounds like a topic for another day.

Enough of the theory, how long did this really take ?

On a relatively clean Nevada (aka OpenSolaris Community Edition) install it was almost 10 minutes. The output is below and I have annotated it with the installation steps outlined above.
# time zoneadm -z zone1 install
[1] [2] Preparing to install zone .
[3] [4] [5] Creating list of files to copy from the global zone.
[6] Copying <1934> files to the zone.
[7] Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize <1290> packages on the zone.
Initialized <1290> packages on zone.                                 
[8] [9] Zone  is initialized.
Installation of <1> packages was skipped.
Installation of these packages generated warnings: 
The file  contains a log of the zone installation.

real    9m38.951s
user    1m26.582s
sys     2m51.252s

But we're still not done, are we ? We still have first boot processing which includes initial population of the SMF repository (which is O(number of services, speed of disks)) and system identification (which is either constant if a sysidconfig file is supplied or O(Bob's increasingly bad typing rate) if we choose an interactive dialog).

For this example the first boot process took about 3 minutes to complete.

We now have a pristine zone, ready for work. But there is more to do, isn't there ? We have to install some software, or at least configure software that is already present. In fact, these customizations might be more complicated than the zone installation process. If I had invested in developing automation scripts or was using some form of advanced provisioning technology this might not be a big deal. If I'm doing this manually then it may be quite a bit of work - and work that I don't want to repeat with regularity. In other words: I'm not likely to use lots of zones and I don't particularly look forward to OS updates.

Let's look at this a bit more and see if we can make this any easier.

This example comes from my (about to be posted) Zones workshop. In our new non-global zone we will replace the Solaris version Webmin with the community release from webmin.com.

A quick pkgchk(1M) of SUNWwebminu shows that its contents are in /usr/sfw and SUNWwebminr deposits it's payload in /etc/webmin and an SMF manifest in /var/svc/manifest/application/management. Performing the same task on the community edition of Webmin shows that it will install in /etc/webmin and /opt/webmin. The clashing of /etc/webmin indicates that these cannot easily co-exist, but complete replacement is possible (all inherit-pkg-dir destinations are contained in a single directory).

So begin by removing the Solaris version of webmin. This is all done in our new zone.
# zonename
zone1

# pkgrm SUNWwebminr SUNWwebminu



pkgrm: ERROR: unable to remove 
/usr/sfw/bin 
/usr/sfw 
/usr 
## Updating system information.

Removal of  partially failed.

At this point the root package SUNWwebminr is completely gone and SUNWwebminu is marked as partially installed. One more pkgrm(1M) and it will gone, at least as far as our package contents are concerned. The bits in /usr/sfw are still there, but without the configuration files in /etc/webmin, they are just that, bits in a directory.
# pkgrm SUNWwebminu

The following package is currently installed:
   SUNWwebminu  Webmin - Web-Based System Administration (usr)
                (i386) 11.11.0,REV=2007.01.23.02.15

Do you want to remove this package? [y,n,?,q] y

## Removing installed package instance 
(A previous attempt may have been unsuccessful.)
## Verifying package  dependencies in global zone
## Processing package information.
## Removing pathnames in class 
## Updating system information.

Removal of  was successful.

Now to install the new webmin package. While you weren't looking I put the package in /var/tmp. But there are some things to do before we can proceed. Remember, the package wants to write into /opt/webmin, but /opt is read-only. We can do a couple of things: mount a writable file system (LOFS, local real disk or NFS) onto /opt/webmin in our new zoneroot or we could create a symbolic link for /opt/webmin that would point somewhere writable. The link is much less confusing, so let's go that route this time.

In the global zone do something like
# ln -s /local/webmin /opt/webmin
# mkdir -p /zones/zone1/root/local/webmin
Now we are ready to proceed. In zone zone1, do the following
# pkgadd -d /var/tmp/webmin-1.320.pkg

The following packages are available:
  1  WSwebmin     Webmin - Web-based system administration
                  (all) 1.320

Select package(s) you wish to process (or 'all' to process
all packages). (default: all) [?,??,q]: 



Webmin has been installed and started successfully. Use your web
browser to go to

  http://zone1:10000/

and login with the name and password you entered previously.


Installation of  was successful.
Now we have a nicely customized non-global zone with one application ready to go. It wasn't all that bad, but there were a few manual steps. Multiply this by 20 or so for all of the other applications and configuration steps that you need to do for your system standards and then by 20 or so for the numbers of zones you want to provision and it is suddenly looking like a tremendous amount of work.

Until Solaris 10 11/06.

Send in the Clones: Solaris Zone Cloning

Zone cloning is a new feature that bypasses all of the steps in the zone installation process and replaces them by copying the source zoneroot and performing a sys-unconfig(1M). Of course this makes perfect sense - if you duplicate the installation process you should get the exact same results (a wise science teacher taught me that a long time ago). So why not short cut the process and just copy the zoneroot, sys-unconfig(1M), fix up the zones index file and you are done.

But it gets better than that. If we are copying the zone root then any customization performed on that zoneroot will be preserved. This includes the SMF repository. Not only do we skip the initial import, we also preserve any customizations, such as service related security hardening. Our new cloned zone would also have the community edition of Webmin instead of the one in Solaris. And it's configured, enabled, and will start automatically when the new zone boots - without requiring me to do anything else. Now that's cool.

Let's see how all this works.

Step 1 - create a new zone configuration using our clone source as a template. We need to change the zoneroot and IP address. In more complex configurations, other attributes might need to be changed, but for this simple example this is all that is required.
# zonecfg -z zone2
zone2: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:zone2> create -t zone1
zonecfg:zone2> set zonepath=/zones/zone2
zonecfg:zone2> select net address=192.168.100.101; set address=192.168.100.102/24; end
zonecfg:zone2> verify
zonecfg:zone2> commit
zonecfg:zone2> exit
Instead of installing a new zone, let's clone from zone1.
# time zoneadm -z zone2 clone zone1
WARNING: read-write lofs file system on '/export' is configured in both zones.
Copying /zones/zone1...

real    0m31.135s
user    0m0.431s
sys     0m3.818s

# zoneadm -z zone2 boot
# zlogin -C zone2  (or supply a sysidconfig file)
Now we're getting somewhere. Zone creation, including application configuration and setup is reduced from about 15 minutes down to 31 seconds. This is getting really cool.

Clones to the left of me, zpools to the right

But wait, there's more! There's an opportunity to make this even more efficient by taking advantage of ZFS clones. Note that this is only available in OpenSolaris at present, but consider the implications of the following example.

Note the use of zone relocation (move) - also a nifty new feature in Solaris 10 11/06.
# zpool create zfs_zones c4d0t0s2
# zoneadm -z zone1 move /zfs_zones/zone1
A ZFS file system has been created for this zone.
Moving across file systems; copying zonepath /zones/zone1...
Cleaning up zonepath /zones/zone1...

# zonecfg -z zone3
zone3: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:zone3> create -t zone1
zonecfg:zone3> set zonepath=/zfs_zones/zone3
zonecfg:zone3> select net address=192.168.100.101; set address=192.168.100.103/24; end
zonecfg:zone3> verify
zonecfg:zone3> commit
zonecfg:zone3> exit

# time zoneadm -z zone3 clone zone1
WARNING: read-write lofs file system on '/export' is configured in both zones.
Cloning snapshot local/zone1@SUNWzone3
Instead of copying, a ZFS clone has been created for this zone.

real    0m11.402s
user    0m0.380s
sys     0m0.412s

Wow! Under 12 seconds and a completely configured and ready to run zone is built. Throw in a sysidconfig file and we're ready to run. And by using a ZFS clone, almost no additional disk space was required for this new zone.
# df -k | grep zfs_zones
zfs_zones            1007616      27  808469     1%    /zfs_zones
zfs_zones/zone1      1007616  198590  808469    20%    /zfs_zones/zone1
zfs_zones/zone3      1007616  198592  808469    20%    /zfs_zones/zone3

1GB - 200MB - 200MB should be 600MB, but it's not. Since the zones are nearly identical at this point, only 200MB is consumed from the zpool.

Practical applications of Zone Cloning

Development environments and testbeds seem a very good fit. Build one standard configuration of a zone and clone it as necessary for each developer or test scenario. If things go wrong, which can happen while testing, just delete the zone and re-clone it. 30 seconds later you are back in business.

Shhhh - don't tell anyone, but I like the privilege restrictions of zones. I'm very likely to give a developer the root password to the zone and let them do what they need to do. The worst they can do is destroy their environment. The impact to me is two zoneadm(1M) invocations and about 30 seconds of clock time.

The better use case comes when you combine this with another new feature in Solaris 10 11/06. Zone migration. Imagine the following scenario.
  1. Mount a file system containing a company standard non-global zoneroot
  2. Attach the zone to the system (zonecfg create -a and zoneadm attach)
  3. Clone this new zone as many times as needed
  4. Detach the original zone from the server (zoneadm detach)
  5. Unmount the detached zoneroot filesystem
This sounds a lot like jumpstart and flasharchives, doesn't it ? You bet it does, and it has many of the same benefits. The flashzone (I'm making up this phrase) can be delivered via USB stick, NFS file services, network file copy (scp), embedded in a server flasharchive. The possibilities are very intriguing.

I hope that this has helped introduce you to a few new zones features with Solaris 10 11/06 (and one in OpenSolaris). As I ponder the combination of these new features I find myself beginning to think that a minimal global zone and cloned full root zones may in fact be a superior practice. We'll explore that in more detail soon.

Technocrati Tags:

Thursday May 25, 2006

What's in a name? that which we call a zone

What's in a name? that which we call a zone
By any other name would virtualize as complete;


One of the most common questions raised during boot camps and other Solaris briefings deals with the subject of containers and zones. There seems to be some confusion as the terms appear to be used interchangeably. Yes, they are related - specifically a zone is a new type of container introduced in Solaris 10, but containers have their origins much earlier.

The 1913 Webster dictionary defines a Container as
    Container \\Con\*tain"er\\, n.
         1. One who, or that which, contain
which provides the foundation of the Solaris container. Quite simply, a Solaris container is any method by which the resources of an application can be controlled (contained). I suppose the origins of the container could date back to the earliest days of Solaris 2 with the introduction of the processor_bind(2) system call and the pbind(1M) administrative command. These controls were somewhat cumbersome for all but specific workloads and a bit primitive to be called a container.

The container became a recognizable entity with the introduction of the Fair Share Scheduler (FSS) in the Solaris 2.6 timeframe. We had a new scheduler class and a relatively easy to use framework to label and control resource usage for complex applications. So we had a container (project), but it was an unbundled product - so not quite a Solaris container.

When did Solaris get a container ? When the Solaris Resource Manager (SWM) became bundled in Solaris 9. Every instance of Solaris had the capability to control resource usage of nearly every application. Why didn't we call it a container in Solaris 9 ? We only had one type of container (a project), so it wasn't really necessary to give it two different labels. With the introduction of Solaris 10, we have a new type of container, the Solaris zone.

Solaris zones are a virtualization technology that adds a security barrier around each user space instance. We now have two orthogonal application controls: security and resource limits. The name containers was introduced to describe both of these technologies.

So is a zone a container ? Absolutely. As are Solaris Resource Management projects and resource pools. And container technologies can be combined to provide several dimensions of application controls (virtualized user space object, resource caps, resource guarantees). Perhaps there will be other types of containers in the future, but for the moment we have three very interesting technologies that can all wear name container.

Technocrati Tags:

Monday May 22, 2006

To zone, or not to zone

To zone, or not to zone: that is the question:
Whether 'tis nobler in the mind of the administrator to suffer
The slings and arrows of outrageous utilization,
Or to take arms against a sea of application consolidations


One of the most interesting (and often hotly debated) questions raised while planning the adoption of Solaris 10 is when to deploy applications in zones. You can almost hear Howie Mandel asking: zone, or no zone? Some early adopters of Solaris 10 didn't includes zones in their Standard Operating Environment (SOE) certifications, preferring to consider their use later after the new OS environments have been deployed and their comfort level with Solaris 10 improved. There is wisdom in this approach, but perhaps the time is right to reconsider this question.

As with any new technology there are trade-offs that should be considered before committing to a course of action. In the case of Solaris Zones, the considerations aren't quite as complicated as they may seem - in fact they can be reduced to the following question
  1. Am I upgrading on existing hardware or installing on new hardware ?

  2. This is the most important question, for several reasons. If you are going to upgrade to Solaris 10 from a previous release and not change the hardware then the most efficient method is to use . Create a new boot environment, install Solaris 10 in the new set of disk slices, and let Live Upgrade manage all of the details of the upgrade (users, file systems, network settings, etc). The upgrade can occur with the applications are running in the current environment, so there is little impact. The previous Solaris environment can be quickly restored if problems are discovered in the new Solaris 10 installation, so the level of risk is minimized.

    At present, Live Upgrade is not supported on a system with local zones, but if you are coming from Solaris 8 or 9 you won't have local zones, so this restriction is rather moot. Conversely, if you are installing on new hardware then you won't be using Live Upgrade, at least not initially.

    So if you are upgrading on existing hardware then don't deploy zones initially. Perform the upgrade (using Live Upgrade) and once the new environment has settled down, start planning the migration of the existing applications into a zone, at a time that is convenient.

  3. Can the application run correctly in a local zone ?

  4. The first question considered the most efficient approach, but we still must consider the feasibility of running applications in zones. And there are a few considerations.

    Nonglobal zones have a reduced set of privileges that may cause some applications to fail. An example would be something like a DHCP server that requires raw IP access to communicate with systems that don't have IP addresses. Since this privilege doesn't exist in a local zone (at least until we get
    configurable privileges and per-zone IP stacks) then this type of application will not work in a local zone.

    Some applications that don't appear to work with nonglobal zones may work with a little bit of creativity. An example would be the NFS server - it does work in a nonglobal zone. But that doesn't mean that you can't share data from a nonglobal zone, you just have to use the NFS server in the global zone. Use a writable loopback filesystem between the global and nonglobal zone and share the directory using an NFS server in the global zone. Users in the nonglobal zone can modify and share data, just as if NFS server were running locally. Another example would be a backup client. It may be unnecessary to run a backup client in a nonglobal zone since all files are visible from the global zone. This can also be true for performance data collectors, and actually an interesting design goal for intrusion detection.

    And that's really about it. If the application can run in a nonglobal zone and it's convenient to do so, why not ? Let's hear the case of the single nonglobal zone arguments.
  • Prosecution: You can't JumpStart a server with nonglobal zones.

  • Defense: It is true that the JumpStart installation environment doesn't have the services required to build and manage zones. But that doesn't prevent you from developing a simple first boot service to create an initial set of nonglobal zones. Leveraging SMF dependencies and service properties, you can leave a nice log file behind to record what was done during zone creation (artifacts that the scripts indeed ran as expected).

  • Prosecution: It takes longer to patch a system with nonglobal zones.

  • Defense: That is true, but isn't it a really question of degree (ie, this is a civil case, not criminal)? With one nonglobal zone, the additional time required to patch the system is very small - and you can easily make the argument that if your maintenance window isn't large enough to support one nonglobal zone then it is really to small for even a global zone only installation.

    But wait, there's more. With a nonglobal zone, it may be possible to have different patch levels or versions of applications. Zones are a user space abstraction so there's only one kernel (and devices), but it is possible to have different versions of nearly all of the user space components. Most applications are insulated from the kernel by libraries (such as libc), so this capability extends to applications as well as the basic OS components. The Branded Zones project in OpenSolaris extends this abstraction so that the user space components don't even have to be Solaris.

  • Prosecution: Zones are more complicated to configure and administer.
    Defense: This simply isn't the case. Spend some time with the Zones in a Day workshop and you will see how to script the creation of a nonglobal zone. You will also notice that the zone configuration contains a small subset of all of the platform configuration elements - the nonglobal zone doesn't even contain it's own IP address. Details such as IP multi-pathing (IPMP) or IP Quality of Service (IPQoS) are inherited from the global zone and only need to be configured once on the system. It is certainly less effort to administer two nonglobal zones than two separate servers, and even in the one nonglobal zone case, it's about break even.

    From the viewpoint of an application, provisioning managers, such as Sun N1 Service Provisioning System, handle most of the platform details. Even for this form of automation, nonglobal zones represent the most efficient framework for provisioning applications. Reading the design documents for zone cloning and migration show that this will become even more efficient.

    It's time for the defense to present their case.

  • Defense: Once you have one nonglobal zone, it's easy to add a second zone.
    The prosecution doesn't have much of a rebuttal. To consolidate future applications, existing applications deployed in the global zone would have to be migrated to a nonglobal zone, which requires a significant additional effort. If the applications are in production, this migration would be challenging and quite disruptive.

  • Defense: All resource usage in a nonglobal zone can be measured.
    Again, the prosecution stays silent. A privileged (root) user can circumvent project level accounting, making it difficult to guarantee that all workload is identified in project level reporting. Nonglobal zones do not see their projects, nor do they have the administrative rights to modify the associated projects, even if they could be observed.

  • Nonglobal zones can be covertly audited

  • As with the preceding argument, a local zone would have no visibility into intrusion detection and auditing being run in the global zone, specifically security logs. This makes it impossible to cover your tracks if you compromise a zone. In fact, the lack of visible intrusion detection might influence a hacker to stay around a bit longer and leave more evidence that will assist future forensic analysis.

  • Defense: Zones and CPU pools may allow lower costs for software licensing.
    Many software partners, such as Oracle, consider the combination of processor pools and nonglobal zones as hard partitioning that may allow for the licensing of a subset of the available resources. Since the nonglobal zone lacks the privileges to change the processor pool configuration, even a rogue administrator or developer cannot invalidate the licensing that is being enforced from the global zone. Regular configuration audits are easy to run to insure future compliance.

    We haven't heard from the prosecution lately - are they still here ? Bueller ? Bueller

  • Defense: A compromise in a nonglobal zone doesn't compromise another zone (global or nonglobal).
    The reduction of privileges in a nonglobal zone will prevent a compromise in one zone from affecting another zone. The only user space components that are shared between zones are file systems, and those can be protected somewhat by mount options (nodevice, nosuid, noexec). If a nonglobal zone is compromised, there is a limit on the promotion of privileges that isolates other zones from further compromises.

    The prosecution was last seen downloading the latest Software Express and working on a first boot service to create nonglobal zones after a JumpStart installation.


Technocrati Tags:

Wednesday Feb 08, 2006

SMF manifest examples for Apache1 and MySQL

In the Service Management in a Day workshop (and the earlier Migrating a Legacy RC Service module from the Solaris Deep Dives) we examine the migration of MySQL from an RC script to a fully managed SMF service.

Why MySQL ? Well, it's a convenient way to point out that MySQL is included in Solaris 10. But the real reason is that it is rather simple and makes a great platform to show what SMF can really do for us - and it's certainly more than a one trick pony.

So let's set up MySQL and see where this goes. You will find the instructions in /etc/sfw/mysql/README.solaris.mysql, but be careful as there is a small error. The last time I looked, chmod -R requires two arguments, not one.

# /usr/sfw/bin/mysql_install_db
# groupadd mysql
# useradd -g mysql mysql
# chgrp -R mysql /var/mysql
# chmod -R 770 /var/mysql
# installf SUNWmysqlr /var/mysql d 770 root mysql
# cp /usr/sfw/share/mysql/my-medium.cnf /var/mysql/my.cnf


Let's start the database manually and make sure that all is well.

# /etc/sfw/mysql/mysql.server start
Starting mysqld daemon with databases from /var/mysql

# /usr/sfw/bin/mysqladmin status
Uptime: 32  Threads: 1  Questions: 1  Slow queries: 0  Opens: 6  
Flush tables: 1  Open tables: 0  Queries per second avg: 0.031


Time for the first SMF value - resilient services. Let's terminate mysqld and see what happens.

# pkill mysql
#  mysqladmin status
mysqladmin: connect to server at 'localhost' failed
error: 'Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2)'
Check that mysqld is running and that the socket: '/tmp/mysql.sock' exists!


This is what we expect. When mysqld terminates, nobody is watching and it remains down until the next reboot (or transition back to run level 3).

So what can SMF do for me here ? Paying attention to a non-transient service is a good start.

What we need now is a manifest for MySQL. You can take a look at mine or if you follow the RC Service Migration howto then you will come up with something very close. Put mysql.xml somewhere in /var/svc/manifest (application or local seem a good place, local probably being the best choice). Reboot or run the manifest-import service method to make SMF aware of the new service definition

# svcs mysql
svcs: Pattern 'mysql' doesn't match any instances
STATE          STIME    FMRI

# /lib/svc/method/manifest-import
Loaded 1 smf(5) service descriptions

# svcadm enable mysql
# svcs mysql
STATE          STIME    FMRI
online         22:39:54 svc:/application/mysql:default

# mysqladmin status
Uptime: 4  Threads: 1  Questions: 1  Slow queries: 0  Opens: 6
  Flush tables: 1  Open tables: 0  Queries per second avg: 0.250


Now, let's try that pkill thing again.

# pkill mysqld

# svcs mysql
STATE          STIME    FMRI
online         22:45:45 svc:/application/mysql:default
Now, if we watch the service log file which is convenientely located at /var/svc/log/application-mysql:default.log you will see svc.startd notice that all of the processes have terminated, yet it isn't a transient service. So there is a problem and the service should be restarted.

[ Feb  8 16:53:36 Stopping because all processes in service exited. ]
[ Feb  8 16:53:36 Executing stop method ("/etc/sfw/mysql/mysql.server stop") ]
No mysqld pid file found. Looked for /var/mysql/pandora.pid.
[ Feb  8 16:53:36 Method "stop" exited with status 0 ]
[ Feb  8 16:53:36 Executing start method ("/etc/sfw/mysql/mysql.server start") ][ Feb  8 16:53:36 Method "start" exited with status 0 ]


This is pretty cool. We've made MySQL somewhat more available than it would have been straight out of the box. Does this eliminate the requirement for High Availability Clusters ? No, but it does open an interesting discussion.

In this example my observation of MySQL's availability is rather naive - if it's running it must be OK. For something like a database server you way want to connect and manipulate some tables to see if the service is really running. We should also note that SMF doesn't really handle the platform availability issues - so HA Clusters are still needed. But it's also interesting to note that many HA scripts only provide coverage for a subset of critical services, usually a database, but ignore the dozens of other services that are also required for proper operation of the service being clustered. Lacking a sophisticated dependency framework, a node failover occurs when one of these other services fails.

SMF provides such a framework, including the watchdog monitor (svc.startd) - and it does with really little effort on the part of the administrator or application packager.

But wait, there's more.

In a recent discussion over service minimization (the idea that you don't install software that you have no intention of running) a more subtle value of SMF can be observed. Solaris 10 allows us to separate the question of installation from activation. It's quite easy to install software and then verify that it is disabled. In fact a routine scan of service properties and a comparison against a baseline is a good idea.

Here is where a bit of creativity can be give us additional safeguards. A well developed SMF manifest will allow us to make an additional distinction. We can now observe the installation of a service, the configuration of a service, and whether or not that service has actually been activated.

How is this done ? A dependency on a configuration file is a good start. Let's look at the MySQL manifest and see how this was done.

<dependency
                name='config_file'
                type='path'
                grouping='require_all'
                restart_on='none'>
                <service_fmri value='file://localhost/var/mysql/my.cnf' />   
</dependency>


This is a dependency on a particular configuration file, in this case /var/sql/my.cnf. If this file is missing then then MySQL will not transition to online. If enabled it will immediately transition to the offline state and a check of svcs -l mysql will show the missing configuration file.

Now this is very cool indeed. For this service to be activated it must be installed, configured and enabled. Failing to configure the service (consider the case of sshd which you probably don't want to run without a configuration file) will provide an obvious and easily observed error condition. This may change the way you look at service minimization.

The takeaway from this exercise is that as you plan your RC service migrations to SMF, add a dependency on an easily observed indicatation that the service has been properly configured, such as a configuration file.

This brings me to my next example, an Apache 1 service manifest. We start by copying the Apache 2 service manifest at /var/svc/manifest/network/http-apache2.xml - as that seemed like a good place to start. I changed the service name, the documentation block, and the start/stop methods as before.

There's one new wrinkle - take a look at the following property group

<property_group name='httpd' type='application'>
                        <stability value='Evolving' />
                        <propval name='ssl' type='boolean' value='false' />
                </property_group>


If we look at the Apache2 start method /lib/svc/method/http-apache2, you will see a query for this service property


        ssl=`svcprop -p httpd/ssl svc:/network/http:apache2`
        if [ "$ssl" = false ]; then
                cmd="start"
        else
                cmd="startssl"
        fi
        ;;


So this is how we enable SSL support for Apache 2. If we want to do something similar for Apache 1 then we will have to modify the start script /etc/init.d/apache. The other solution would be to remove the property group from the manifest and modify the start definition to call either /etc/init.d/apache start or /etc/init.d/apache startssl.

After you import this new manifest, please remember to unlink the start and stop links from all run level directories (there's one start in rc3.d and one kill in each run level).

This brings me to my last recommendation - using a configuration file dependency to help keep service instances separated. This is particularly important for the http service as all the executables are named httpd. By adding a dependency to the configuration file you have added an important documentation item that will come in handy when diagnosing service failures. If the instance fails and ends up in a maintenance state, a quick look at svcs -l will tell you which instance you need to investigate.

Where can I learn more about this ? The OpenSolaris SMF Community would be a good place to look. In addition to the excellent articles on Solaris Service Management, there is a repostory of contributed manifests that might help you get started. And you are invited to contribute manifests for your converted services - you might even receive a nice OpenSolaris trinket for your efforts.

Technocrati Tags:

Wednesday Nov 09, 2005

Common First Time Mistakes - Containers

Containers in Solaris 10 features an interesting virtualization technology called zones. Local zones are amazingly easy to configure and install, but there are a few things that can trip you up the first few times.
  1. Local zones require system identification
    Since each local zone has its own /etc, it can have a different identity from that of the global zone (timezone, locale, root password). So we need to supply some basic configuration information about the local zone. If you are experienced with Solaris you will recognize the system identification process that runs at first boot. Solaris 10 adds the NFS V4 domain mapping question which must be answered in addition to supplying an /etc/sysidconfig file. We'll deal with that later.

    The complication presented by local zones is that you can use zlogin(1) to enter a local zone before it has completed it's system identification. This is not possible for the global zone, nor a prior Solaris release - so you may not even consider this a possibility when diagnosing your first few zones configuration problems.

    The symptoms are that you can enter the zone using zlogin(1), but nothing else works. You cannot get in via ssh, rlogin, or telnet even through they have been configured properly (or so you believe at this point). Your first step in diagnosis should be an svcs -a and you will see service states of unitialized. This is the clue!

    If you look all the way back in the service states you will see there is a service called sysidtool (that calls the service script /lib/svc/method/sysidtool-system). This is where system identification is done (and if you look at the method you will discover how to answer the NFSV4 question).

    The resolution is simple - connect to the local zone console using zlogin -C and answer the identification questions.

    If you are using the Java Desktop System then terminal type 12 (xterms) will provide the best results.

    You will also experience this problem if your sysidconfig file contains an error. The most common errors are incorrect specifications of the timezone and root password.

  2. Failure to answer the NFS V4 question
    You can script the creation of a local zone and supply default identification through the use of an /etc/sysidconfig file. Experienced Solaris adminstrators will recognize this method from unattended jumpstart installs. Solaris 10 requires one additional configuration item that isn't satisfied by /etc/sysidconfig: the NFS V4 domain mapping question.

    Automating the NFS V4 configuration requires 2 steps. First, specify the value of NFSMAPID_DOMAIN in $zonepath/root/etc/default/nfs. Finally, you need to create a file called $zonepath/root/etc/.NFS4inst_state.domain to let sysidtool know that you have answered the question.

  3. The local zone root directory is $zonepath/root
    If the lab equipment is sufficiently fast then we have a little competition in the Containers workshop. The challenge is to completely automate the installation of a local container and provision an application (typically Apache or MySQL) as well as set up root access via telnet, rlogin, or ssh - but do it in a single script with no intervention. Run the script and the next step is to connect to the provisioned service.

    After 10 minutes of scripting work, and another 10 minutes for a local zone to install, there are always a few exclamation of "Doh" as the student realized that they dropped in a sysidconfig to $zonepath/etc rather than $zonepath/root/etc.

  4. Make sure the mountpoints exit for all file system being supplied by zoneadmd
    Supplying lofs file systems via the zone configuration file (see zonecfg man page) is a convenient way to share files between zones (including the global zone). The advantage of this method is that zoneadmd performs the loopback mount from the privileged global zone as it readies the local zone, thus the local zone isn't permitted to undo this mount. If the mount point (in the local zone) does not exist then the zone will fail to boot.

  5. Network not being plumbed will cause a local zone to fail to boot
    This one is rather unique to the mobile user. For convenience you may have your global zone boot without networking configured. Once you log in then you can run a simple script to plumb up your network interfaces based on how you need to connect (fixed IP address at home or in a lab, DHCP in a hotel, etc). If your local zones have network resources, which is typically the case, the network interfaces must be plumbed up in the global zone prior to booting the local zone. This one has gotten me more than once in a customer demonstration.


Technocrati Tags:
About

Bob Netherton is a Principal Sales Consultant for the North American Commercial Hardware group, specializing in Solaris, Virtualization and Engineered Systems. Bob is also a contributing author of Solaris 10 Virtualization Essentials.

This blog will contain information about all three, but primarily focused on topics for Solaris system administrators.

Please follow me on Twitter Facebook or send me email

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today