Heads up on Kernel patch installation issues with jumpstart or ZFS Root

I'd like to give you a heads-up on a couple of Kernel patch installation issues:

1. There was a bug (since fixed) in the Deferred Activation Patching functionality in a ZFS Root environment on x86 only.  See Sun Alert 263928.  An error message to the effect that a Class Action Script has failed to complete and failure to set up environment for Deferred Activation Patching may be seen.   The relevant CR is 6850329: "KU 139556-08 fails to apply on x86 systems that have ZFS root filesystems and corrupts the OS".    SPARC systems are similarly affected.  The following error message is returned:
mv: cannot rename /var/run/.patchSafeMode/root/lib/libc.so.1.20102 to /lib/libc.so.1: Device busy
ERROR: Move of /var/run/.patchSafeMode/root/lib/libc.so.1.20102 to dstActual failed
usage: puttext [-r rmarg] [-l lmarg] string
pkgadd: ERROR: class action script did not complete successfully

Installation of <SUNWcslr> failed.

This issue is fixed in patch in the Patch Utilities patch 119255-70 or later revision.

BTW: The principal reason ZFS Root support was implemented in Live Upgrade is so that patch application like this to the live boot environment would not be necessary.   With ZFS Root, creating a clone Boot Environment is so easy that there's no good reason not to.   This avoids the need to use technologies such as Deferred Activation Patching which attempt to make it safer to apply arbitrary change to a live boot environment, which is an inherently risky process.

2. There are reproducible issues using jumpstart finish scripts and other scenarios to install Kernel patch 137137-09 followed by Kernel patch 139555-08.   Here's the gist of the issue which I've pulled from an engineering email thread on the subject:

Issue 1: I have a customer whose system is not booting after applying the patch cluster with Live Upgrade (LU).

Solution 1: If using 'luupgrade -t', then you must ensure that latest version of LU patch is installed first, currently 121430-36 is currently the latest revision on SPARC, 121431-37 on x86. Once these patches are installed, LU will automatically handle the build of the boot archive when 'luactivate' is called, thus avoiding the problem.

Issue 2: There are other ways to get oneself into situations where a boot archive is out of sync: e.g. If using jumpstart finish scripts to apply patches that include 137137-09.  Basically any operation that involves patching to an ABE outside of 'luupgrade' will involve a manual build of boot-archive.

Solution 2: One must manually rebuild the boot-archive on the /a partition after applying the patches.  Otherwise once the system boots, the boot-archive will be out of sync.

Here's some more detail on the jumpstart finish script version of this: 

We've seen the same panic a few times when the latest patch cluster is applied via a finish script to a boot environment prior to  s10u6 via a jumpstart installation. It appears that the boot archive is out of sync with the kernel on the system. The boot archive was created from the 137137-09 patch and not updated after the 139555-08 kernel was applied, therefore the mismatch between the kernel and the boot archive.

In these instances updating the boot archive allows the system to boot successfully. Boot failsafe (ok boot -F failsafe) will detect an out of sync boot archive.  Execute the automated update then reboot.  This will now boot from the later kernel (139555-08) which successfully installed from the finish script.

I reproduced the problem in a jumpstart installation environment applying the latest 10_Recommended patch cluster from a finish script. The initial installation was S10U5 which is deployed from a miniroot that has no knowledge of a boot archive (my theory anyway).  This is similar to a live upgrade environment if the boot environment doing the patching is also boot archive unaware (meaning the kernel is pre 137137-09).

In the jumpstart scenario the immediate problem was solved by updating the boot archive by booting failsafe as previously described.  The Solution was to update the boot archive from the finish script after the patch cluster installation completed.  BTW, all patches in the patch cluster installed successfully per the /var/sadm/system/logs.finish.log.

In a standard jumpstart the boot device (install target) is mounted to /a, therefore adding the following entry to the finish script solved the problem:

/a/boot/solaris/bin/create_ramdisk -R /a

Depending on the finish script configuration, and variables the following would also work:

$ROOTDIR/boot/solaris/bin/create_ramdisk -R $ROOTDIR
Issue 3: This above issues are sometimes mis-diagnosed as CR 6850202: "bootadm fails to build bootarchive in certain configurations leading to unbootable system".

But CR 6850202 will only be encountered in very specific circumstances, all of which must occur in order to hit this specific bug, namely:

1. Install u6 SUNWCreq - there's  no mkisofs so we build ufs boot archive

2. Limit /tmp to 512M - thus forcing the ufs build to happen in /var/run

3. Have a separate /var - bootadm.c only lofs nosub mounts "/" when creating the alt root for DAP patching build of boot archive

4. Install 139555-08

You must have all 4 of above in order to hit this, i.e. step 4 must be installing a DAP patch such as a Kernel patch associated with a Solaris 10 Update such as 139555-08. 

Solution 3: Removing the 512MB limit (or whatever limit has been imposed) to /tmp in /etc/vfstab and/or adding SUNWmkcd (and probably SUNWmkcdS) so that mkisofs is available on the system is sufficient to avoid the code path that fails this way.

Booting failsafe and recreating the boot archive will successfully recreate the boot archive.

Here's further input from one of my senior engineers, Enda O'Connor:

If using Live Upgrade (LU), and LU on the live partition is up to date in terms of latest revision of the LU patch, 121430 (SPARC) and 121431 (x86), the boot-archive will be built automatically once users runs shutdown ( after luactivate to activate the new BE ).  This is done from a kill script in rcd.0.

If using a jumpstart finish script, or jumpstart profile to patch a pre-U6 image with latest kernel patches, then you need to run create_ramdisk from the finish script after all patching/packaging operations have been finished.  Alternatively, you can patch your pre-U6 miniroot to the U6 SPARC NewBoot level (137137-09), at which point the modified miniroot will handle the build of the boot_archive after the finish script has run.

If patching U6 and upwards from jumpstart, the boot_archive will get built automatically after finish script has run, so there's no issue in this scenario.

If using any home grown technology to patch or install/modify software on an Alternate Boot Environment ( ABE ), such as ufsrestore/cpio/tar for example, you must always run create_ramdisk manually before booting to said ABE.

Best Wishes,

Gerry.

Comments:

Thanks, Gerry. I've ran into this issue on multiple machines (x86) while patching them up to 139556. The problem is that the 139556 patch install tries to do something with /lib/libc.so.1 and can't write to it, causing the patch to fail mid-install, causing symbol unresolved errors on a subsequent boot.

The fix of course is to re-install the patch from failsafe and update the boot archive. This problem seems so prevalent that I wonder why 139555/139556 hasn't been revoked as a bad patch or that a new version hasn't been released that fixes this install issue.

Posted by Dale Ghent on June 25, 2009 at 03:20 PM IST #

Hi Dale!

The problem isn't caused by 139555/6. It's a problem in the Deferred Activation Patching (DAP) algorithm in the patch utilities. libc.so is already mounted, so when DAP tries to overlay mount it, ZFS returns EBUSY and DAP fails. UFS doesn't return EBUSY in this instance. Using one of the workarounds listed above (e.g. using Live Upgrade or unmounting libc prior to applying the patch) avoids the issue in a ZFS environment.

Best Wishes,

Gerry.

Posted by Gerry Haskins on June 25, 2009 at 04:11 PM IST #

Is there plans to fix DAP so this problem doesn't happen without a workaround? This seems like a major problem if you are unlucky enough not to read this blog before applying that patch.

Posted by Larry on June 29, 2009 at 02:46 PM IST #

Hi Larry!

Yes, we're working with Sustaining on identifying a fix to DAP (or possibly ZFS). This was simply an early heads-up to give customers some chance to avoid hitting the issue. We've already identified a "hack" fix on the DAP side. Sustaining are engaging the ZFS team to look for something more elegant.

BTW: With ZFS Root, cloning and patching an ABE (e.g. using Live Upgrade) is strongly recommended. This avoids the DAP issue.

Best Wishes,

Gerry.

Posted by Gerry Haskins on June 30, 2009 at 11:24 AM IST #

Hey Gerry, any news on this? This patch issue has become an even larger issue for my org as a lot of my machines hit the Issue 3 you described - separate /var, no mkiso, and a /tmp limited to 512MB. Given my very automated patching environment and some CR fixes we need that are being delivered in patches which ultimately depend on 13955(5|6)-08, this is a pretty big deal.

I seem to recall an issue on SPARC systems with a SVM'd root when newboot for SPARC was released, and the fix was to release a patch which installed pre/postpatch scripts that superseded the scripts in the faulty KU. Could the Sun patch team do something similar for this problematic KU which implemented one of the 3 workarounds you described?

Posted by Dale Ghent on July 06, 2009 at 02:02 PM IST #

Hi Dale!

My folk are talking to the responsible engineering team on this.

It would really help if you can raise a Customer Escalation on CR 6850202 through your normal Sun Support channels. This will really help me build the case that this is a real issue impacting customers which needs to be fixed as a matter of urgency, and not just a theoretical corner case issue found by my internal test team.

In general, always escalate issues which are critical to your environment. The more Customer Escalations outstanding against a CR, the higher it will be prioritized for fixing.

BTW: The fix to the DAP bug in ZFS Root is proving non-trivial. Again, my team is working with Solaris Sustaining on the hunt for a fix.

I'm sorry that I don't have better news.

Best Wishes,

Gerry.

Posted by Gerry Haskins on July 07, 2009 at 10:36 AM IST #

Thanks for the info Gerry.

I guess all of this is very concerning. 139555-08 can corrupt your OS unless you apply in a very specific manner on servers with ZFS root, but the patch has not been withdrawn, the notes for the patch have not been updated with this info, and it sounds like you are having problems convincing Sustaining that this is a big problem and needs to take priority.

I work in an environment where we have started converting the servers to ZFS root and didn't want to have 2 different patching methods until we were finished the conversions. Our current method applies the patches in single-user, the same strategy that has been supported forever, but looks like no longer.

Posted by Larry on July 09, 2009 at 09:51 AM IST #

Just to provide an update on this:

The bug in Deferred Activation Patching functionality in a ZFS Root environment on x86 only. is now fixed in Patch Utilities patch 119255-70 (and above). See Sun Alert http://sunsolve.sun.com/search/document.do?assetkey=1-66-263928-1.

Best Wishes,

Gerry.

Posted by Gerry Haskins on September 08, 2009 at 10:58 AM IST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

This blog is to inform customers about patching best practice, feature enhancements, and key issues. The views expressed on this blog are my own and do not necessarily reflect the views of Oracle. The Documents contained within this site may include statements about Oracle's product development plans. Many factors can materially affect these plans and the nature and timing of future product releases. Accordingly, this Information is provided to you solely for information only, is not a commitment to deliver any material code, or functionality, and SHOULD NOT BE RELIED UPON IN MAKING PURCHASING DECISIONS. The development, release, and timing of any features or functionality described remains at the sole discretion of Oracle. THIS INFORMATION MAY NOT BE INCORPORATED INTO ANY CONTRACTUAL AGREEMENT WITH ORACLE OR ITS SUBSIDIARIES OR AFFILIATES. ORACLE SPECIFICALLY DISCLAIMS ANY LIABILITY WITH RESPECT TO THIS INFORMATION. ~~~~~~~~~~~~ Gerry Haskins, Director, Software Lifecycle Engineer

Search

Categories
Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today