Thursday Jun 30, 2011

Common Live Upgrade Problems

As I have worked with customers deploying Live Upgrade in their environments, several problems seem to surface over and over. With this blog article, I will try to collect these troubles, as well as suggest some workarounds. If this sounds like the beginnings of a Wiki, you would be right. At present, there is not enough material for one, so we will use this blog for the time being. I do expect new material to be posted on occasion, so if you wish to bookmark it for future reference, a permanent link can be found here.

To help with your navigation, here is an index of the common problems.

  1. lucreate(1M) copies a ZFS root rather than making a clone
  2. luupgrade(1M) and the Solaris autoregistration file
  3. Watch out for an ever growing /var/tmp
Without any further delay, here are some common Live Upgrade problems.

Live Upgrade copies over ZFS root clone

This was introduced in Solaris 10 10/09 (u8) and the root of the problem is a duplicate entry in the source boot environments ICF configuration file. Prior to u8, a ZFS root file system was not included in /etc/vfstab, since the mount is implicit at boot time. Starting with u8, the root file system is included in /etc/vfstab, and when the boot environment is scanned to create the ICF file, a duplicate entry is recorded. Here's what the error looks like.
# lucreate -n s10u9-baseline
Checking GRUB menu...
System has findroot enabled GRUB
Analyzing system configuration.
Comparing source boot environment  file systems with the
file system(s) you specified for the new boot environment. Determining
which file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment .
Source boot environment is .
Creating boot environment .
Creating file systems on boot environment .
Creating  file system for  in zone  on .

The error indicator -----> /usr/lib/lu/lumkfs: test: unknown operator zfs

Populating file systems on boot environment .
Checking selection integrity.
Integrity check OK.
Populating contents of mount point .

This should not happen ------> Copying.

Ctrl-C and cleanup
If you weren't paying close attention, you might not even know this is an error. The symptoms are lucreate times that are way too long due to the extraneous copy, or the one that alerted me to the problem, the root file system is filling up - again thanks to a redundant copy.

This problem has already been identified and corrected, and a patch (121431-58 or later for x86, 121430-57 for SPARC) is available. Unfortunately, this patch has not yet made it into the Solaris 10 Recommended Patch Cluster. Applying the prerequisite patches from the latest cluster is a recommendation from the Live Upgrade Survival Guide blog, so an additional step will be required until the patch is included. Let's see how this works.

# patchadd -p | grep 121431
Patch: 121429-13 Obsoletes: Requires: 120236-01 121431-16 Incompatibles: Packages: SUNWluzone
Patch: 121431-54 Obsoletes: 121436-05 121438-02 Requires: Incompatibles: Packages: SUNWlucfg SUNWluu SUNWlur

# unzip 121431-58
# patchadd 121431-58
Validating patches...

Loading patches installed on the system...

Done!

Loading patches requested to install.

Done!

Checking patches that you specified for installation.

Done!


Approved patches will be installed in this order:

121431-58


Checking installed patches...
Executing prepatch script...
Installing patch packages...

Patch 121431-58 has been successfully installed.
See /var/sadm/patch/121431-58/log for details
Executing postpatch script...

Patch packages installed:
  SUNWlucfg
  SUNWlur
  SUNWluu

# lucreate -n s10u9-baseline
Checking GRUB menu...
System has findroot enabled GRUB
Analyzing system configuration.
INFORMATION: Unable to determine size or capacity of slice .
Comparing source boot environment  file systems with the
file system(s) you specified for the new boot environment. Determining
which file systems should be in the new boot environment.
INFORMATION: Unable to determine size or capacity of slice .
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment .
Source boot environment is .
Creating boot environment .
Cloning file systems from boot environment  to create boot environment .
Creating snapshot for  on .
Creating clone for  on .
Setting canmount=noauto for  in zone  on .
Saving existing file  in top level dataset for BE  as //boot/grub/menu.lst.prev.
Saving existing file  in top level dataset for BE  as //boot/grub/menu.lst.prev.
Saving existing file  in top level dataset for BE  as //boot/grub/menu.lst.prev.
File  propagation successful
Copied GRUB menu from PBE to ABE
No entry for BE  in GRUB menu
Population of boot environment  successful.
Creation of boot environment  successful.
This time it took just a few seconds. A cursory examination of the offending ICF file (/etc/lu/ICF.3 in this case) shows that the duplicate root file system entry is now gone.
# cat /etc/lu/ICF.3
s10u8-baseline:-:/dev/zvol/dsk/panroot/swap:swap:8388608
s10u8-baseline:/:panroot/ROOT/s10u8-baseline:zfs:0
s10u8-baseline:/vbox:pandora/vbox:zfs:0
s10u8-baseline:/setup:pandora/setup:zfs:0
s10u8-baseline:/export:pandora/export:zfs:0
s10u8-baseline:/pandora:pandora:zfs:0
s10u8-baseline:/panroot:panroot:zfs:0
s10u8-baseline:/workshop:pandora/workshop:zfs:0
s10u8-baseline:/export/iso:pandora/iso:zfs:0
s10u8-baseline:/export/home:pandora/home:zfs:0
s10u8-baseline:/vbox/HardDisks:pandora/vbox/HardDisks:zfs:0
s10u8-baseline:/vbox/HardDisks/WinXP:pandora/vbox/HardDisks/WinXP:zfs:0
This error can show up in a slightly different form. When activating a new boot environment, propogation of the bootloader and configuration files may fail with an error indicating that an old boot enviromnent could not be mounted. That prevents the activation from taking place and you will find yourself booting back into the old BE.

Again, the root cause is the root file system entry in /etc/vfstab. Even though the mount at boot time flag is set to no, it confuses lumount(1M) as it cycles through duing the propogation phase. To correct this problem, boot back to the offending boot environment and remove the vfstab entry for /.

lucreate(1M) and the new (Solaris 10 10/09 and later) autoregistration file

This one is actually mentioned in the Oracle Solaris 9/10 release notes. I know, I hate it when that happens too.

Here's what the "error" looks like.

# luupgrade -u -s /mnt -n s10u9-baseline

System has findroot enabled GRUB
No entry for BE  in GRUB menu
Copying failsafe kernel from media.
61364 blocks
miniroot filesystem is 
Mounting miniroot at 
ERROR:
        The auto registration file <> does not exist or incomplete.
        The auto registration file is mandatory for this upgrade.
        Use -k  argument along with luupgrade command.
        autoreg_file is path to auto registration information file.
        See sysidcfg(4) for a list of valid keywords for use in
        this file.

        The format of the file is as follows.

                oracle_user=xxxx
                oracle_pw=xxxx
                http_proxy_host=xxxx
                http_proxy_port=xxxx
                http_proxy_user=xxxx
                http_proxy_pw=xxxx

        For more details refer "Oracle Solaris 10 9/10 Installation
        Guide: Planning for Installation and Upgrade".

As with the previous problem, this is also easy to work around. Assuming that you don't want to use the auto-registration feature at upgrade time, create a file that contains just autoreg=disable and pass the filename on to luupgrade.

Here is an example.

# echo "autoreg=disable" > /var/tmp/no-autoreg
# luupgrade -u -s /mnt -k /var/tmp/no-autoreg -n s10u9-baseline
 
System has findroot enabled GRUB
No entry for BE  in GRUB menu
Copying failsafe kernel from media.
61364 blocks
miniroot filesystem is 
Mounting miniroot at 
#######################################################################
 NOTE: To improve products and services, Oracle Solaris communicates
 configuration data to Oracle after rebooting.

 You can register your version of Oracle Solaris to capture this data
 for your use, or the data is sent anonymously.

 For information about what configuration data is communicated and how
 to control this facility, see the Release Notes or
 www.oracle.com/goto/solarisautoreg.

 INFORMATION: After activated and booted into new BE ,
 Auto Registration happens automatically with the following Information

autoreg=disable
#######################################################################
Validating the contents of the media .
The media is a standard Solaris media.
The media contains an operating system upgrade image.
The media contains  version <10>.
Constructing upgrade profile to use.
Locating the operating system upgrade program.
Checking for existence of previously scheduled Live Upgrade requests.
Creating upgrade profile for BE .
Checking for GRUB menu on ABE .
Saving GRUB menu on ABE .
Checking for x86 boot partition on ABE.
Determining packages to install or upgrade for BE .
Performing the operating system upgrade of the BE .
CAUTION: Interrupting this process may leave the boot environment unstable
or unbootable.
The Live Upgrade operation now proceeds as expected. Once the system upgrade is complete, we can manually register the system. If you want to do a hands off registration during the upgrade, see the Oracle Solaris Auto Registration section of the Oracle Solaris Release Notes for instructions on how to do that.

/var/tmp and the ever growing boot environment

Let's start with a clean installation of Solaris 10 10/09 (u8).
# df -k /
Filesystem                       kbytes    used   avail capacity  Mounted on
rpool/ROOT/s10x_u8wos_08a      20514816 4277560 13089687    25%    /

So far, so good. Solaris is just a bit over 4GB. Another 3GB is used by the swap and dump devices. That should leave plenty of room for half a dozen or so patch cycles (assuming 1GB each) and an upgrade to the next release.

Now, let's put on the latest recommended patch cluster. Note that I am following the suggestions in my Live Upgrade Survival Guide, installing the prerequisite patches and the LU patch before actually installing the patch cluster.

# cd /var/tmp
# wget patchserver:/export/patches/10_x86_Recommended-2012-01-05.zip .
# unzip -qq 10_x86_Recommended-2012-01-05.zip

# wget patchserver:/export/patches/121431-69.zip
# unzip 121431-69

# cd 10x_Recommended
# ./installcluster --apply-prereq --passcode (you can find this in README)

# patchadd -M /var/tmp 121431-69

# lucreate -n s10u8-2012-01-05
# ./installcluster -d -B s10u8-2012-01-05 --passcode

# luactivate s10u8-2012-01-05
# init 0

After the new boot environment is activated, let's upgrade to the latest release of Solaris 10. In this case, it will be Solaris 10 8/11 (u10).

Yes, this does seem like an awful lot is happening in a short period of time. I'm trying to demonstrate a situation that really does happen when you forget something as simple as a patch cluster clogging up /var/tmp. Think of this as one of those time lapse video sequences you might see in a nature documentary.

# pkgrm SUNWluu SUNWlur SUNWlucfg
# pkgadd -d /cdrom/sol_10_811_x86  SUNWluu SUNWlur SUNWlucfg
# patchadd -M /var/tmp 121431-69

# lucreate -n s10u10-baseline'
# echo "autoreg=disable" > /var/tmp/no-autoreg
# luupgrade -u -s /cdrom/sol_10_811_x86 -k /var/tmp/no-autoreg -n s10u10-baseline
# luactivate s10u10-baseline
# init 0
As before, everything went exactly as expected. Or I thought so, until I logged in the first time and checked the free space in the root pool.
# df -k /
Filesystem                       kbytes    used   avail capacity  Mounted on
rpool/ROOT/s10u10-baseline     20514816 10795038 2432308    82%    /
Where did all of the space go ? Back of the napkin calculations of 4.5GB (s10u8) + 4.5GB (s10u10) + 1GB (patch set) + 3GB (swap and dump) = 13GB. 20GB pool - 13GB used = 7GB free. But there's only 2.4GB free ?

This is about the time that I smack myself on the forehead and realize that I put the patch cluster in the /var/tmp. Old habits die hard. This is not a problem, I can just delete it, right ?

Not so fast.

# du -sh /var/tmp
 5.4G   /var/tmp

# du -sh /var/tmp/10*
 3.8G   /var/tmp/10_x86_Recommended
 1.5G   /var/tmp/10_x86_Recommended-2012-01-05.zip

# rm -rf /var/tmp/10*

# du -sh /var/tmp
 3.4M   /var/tmp

Imagine the look on my face when I check the pool free space, expecting to see 7GB free.
# df -k /
Filesystem                      kbytes    used   avail capacity  Mounted on
rpool/ROOT/s10u10-baseline    20514816 5074262 2424603    68%    /

We are getting closer, I suppose. At least my root filesystem size is reasonable (5GB vs 11GB). But the free space hasn't changed at all.

Once again, I smack myself on the forehead. The patch cluster is also in the other two boot environments. All I have to do is get rid them too, and I'll get my free space back. Right ?

# lumount s10u8-2012-01-05 /mnt
# rm -rf /mnt/var/tmp/10_x86_Recommended*
# luumount s10u8-2012-01-05

# lumount s10x_u8wos_08a /mnt
# rm -rf /mnt/var/tmp/10_x86_Recommended*
# luumount s10x_u8wos_08a
Surely, the free space will now be 7GB.
# df -k /
Filesystem                    kbytes    used   avail capacity  Mounted on
rpool/ROOT/s10u10-baseline  20514816 5074265 2429261    68%    /

This is when I smack myself on the forehead for the third time in one afternoon. Just getting rid of them in the boot environments is not sufficient. It would be if I were using UFS as a root filesystem, but lucreate will use the ZFS snapshot and cloning features when used on a ZFS root. So the patch cluster is in the snapshot, and the oldest one at that.

Let's try this all over again, but this time I will put the patches somewhere else that is not part of a boot environment. If you are thinking of using root's home directory, think again - it is part of the boot environment. If you are running out of ideas, let me suggest that /export/patches might be a good place to put them.

Doing the exercise again, with the patches in /export/patches, I get similar results (to be expected), but this time the patches are in a shared ZFS dataset (/export).

# lustatus
Boot Environment           Is       Active Active    Can    Copy      
Name                       Complete Now    On Reboot Delete Status    
-------------------------- -------- ------ --------- ------ ----------
s10x_u8wos_08a             yes      no     no        yes    -         
s10u8-2012-01-05           yes      no     no        yes    -         
s10u10-baseline            yes      yes    yes       no     -         

# df -k /
Filesystem                      kbytes    used   avail capacity  Mounted on
rpool/ROOT/s10u10-baseline    20514816 5184578 2445140    68%    /


# df -k /export
Filesystem                      kbytes    used   avail capacity  Mounted on
rpool/export                  20514816 5606384 2445142    70%    /export

This means that I can delete them, and reclaim the space.
# rm -rf /export/patches/10_x86_Recommended*

# df -k /
Filesystem                      kbytes    used   avail capacity  Mounted on
rpool/ROOT/s10u10-baseline    20514816 5184578 8048050    40%    /

Now, that's more like it. With this free space, I can continue to patch and maintain my system as I had originally planned - estimating a few hundred MB to 1.5GB per patch set.

Technocrati Tags:
About

Bob Netherton is a Principal Sales Consultant for the North American Commercial Hardware group, specializing in Solaris, Virtualization and Engineered Systems. Bob is also a contributing author of Solaris 10 Virtualization Essentials.

This blog will contain information about all three, but primarily focused on topics for Solaris system administrators.

Please follow me on Twitter Facebook or send me email

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today