Thursday Oct 06, 2011

Walking in the shadows of giants

As I sit here in 22A on an American Airlines flight from San Francisco to O'Hare at the start of my 16 hour journey home to Ireland, I'm reflecting on some of the key Solaris 11 related events at Oracle OpenWorld this week.

For the first time in a couple of years, I got to spend the weekend in Northern California, having been here  last week for Solaris 11 planning meetings.  I went up to the Sierras to hug some Sequoias.  I'm not normally the tree-hugging type, but I make as exception for these giants.  I saw Mono Lake.  Cool.  Devil's Postpile.  Way Cool.  And the Sequoia National Park - it's truly amazing walking in the shadows of these giants.

As usual, Oracle OpenWorld and Jave One this week provided the opportunity to hear about bleeding edge technologies directly from their architects and to chat with them about the what and the why.

Markus Flierl (VP, Solaris Engineering) hosted a session on Monday with some of his key architects who have been developing Solaris 11 over the last 7+ years, including Liane Praza (IPS), Bart Smaalders (IPS), Darren Moffett (Security), Dan Price (Zones), and Mark Maybee (I/O).  It was great to hear these experts express their passion, ingenuity, and innovation.  They have a justifable parental sense of pride in Solaris 11.  Technologies which were bolt-ons in Solaris 10, or indeed far too disruptive to even be considered for release in a Solaris 10 Update, are tightly integrated and honed in Solaris 11.  Low latency (i.e. performance), scalability, security, availability, robustness, and diagnosability are all factors that customers have come to expect of Solaris.  Solaris 11 takes it to a whole new level.  Warp drive.

My colleague, Pete Dennis, and I have been working closely with Bart, Liane, David Comay, and others to ensure that IPS fully meets the needs of our customers' maintenance lifecycle.  They've listening to us and subtly tweaked and adapted their implementations where necessary to fully meet customers' maintenance lifecycle needs.  Working with geniuses is great.  Working with geniuses who are prepared to listen and adapt is truly wonderful.

But what really blew me away this week was a presentation by Nicolas Droux last night on Network Virtualization in Solaris 11.  Some of you may know about earlier incarnations of this, codenamed Project "Crossbow".  But the fleshing out of the capabilities in Solaris 11 is truly amazing.  The ability to have virtualized NICs (VNICs), virtualized LANs (VLANs), Zones which act as virtualized switches, Zones which act as virtualized firewalls, fully segregated data "Lanes", "Flows", etc., etc., and all with diagnosability built in with new utilities such as 'dlstat' (Data link stats), 'flowstat', etc.  I hadn't met Nicolas before but wow!  Not only is Nicolas a key architect, he has an amazing ability to explain it with crystal clarity in a really easy to understand manner.  As I said to the Product Manager, Joost Pronk, we've got to video Nicolas giving this talk once Solaris 11 ships so that the world can see it.  

At the end of Nicolas's presentation, Thierry Manfe showed how he is leveraging Network Virtualization in Oracle Solaris's cloud infrastructure provided to enable ISVs to test their apps with complete data integrity and segregation.  You can sign up for this, it's available now.  "Solaris 11. #1 for Clouds" isn't just some Marketing hype. It's true.

I'm walking in the shadow of giants.  And it's a wonderful feeling.

Roll on Solaris 11.  It won't be long now and I really can't wait.  It's amazing.  Big time!

Thank you to the 90+ of you who attended Pete Dennis, Isaac Rozenfeld, and my presentation on Solaris 11 Customer Maintenance Lifecycles, policies, and best practices.  If you missed it, there'll be another chance to catch an updated version with more technical content at DOAG (the German Oracle Users Group) conference in Nuremberg, Germany in November (see previous posting for details).

Finally, I'd like to pay my respects to a true giant of our industry, Steve Jobs.  Gone way too soon.  RIP Steve.  You'll be missed.  Big time!

Best Wishes,

Gerry.

Disclaimer: Any forward looking statements in this posting are subject to the vagueries of my Crystal ball, possible hallucinations, and lack of coffee.  You get the drift. 

Thursday Jun 25, 2009

Heads up on Kernel patch installation issues with jumpstart or ZFS Root

I'd like to give you a heads-up on a couple of Kernel patch installation issues:

1. There was a bug (since fixed) in the Deferred Activation Patching functionality in a ZFS Root environment on x86 only.  See Sun Alert 263928.  An error message to the effect that a Class Action Script has failed to complete and failure to set up environment for Deferred Activation Patching may be seen.   The relevant CR is 6850329: "KU 139556-08 fails to apply on x86 systems that have ZFS root filesystems and corrupts the OS".    SPARC systems are similarly affected.  The following error message is returned:
mv: cannot rename /var/run/.patchSafeMode/root/lib/libc.so.1.20102 to /lib/libc.so.1: Device busy
ERROR: Move of /var/run/.patchSafeMode/root/lib/libc.so.1.20102 to dstActual failed
usage: puttext [-r rmarg] [-l lmarg] string
pkgadd: ERROR: class action script did not complete successfully

Installation of <SUNWcslr> failed.

This issue is fixed in patch in the Patch Utilities patch 119255-70 or later revision.

BTW: The principal reason ZFS Root support was implemented in Live Upgrade is so that patch application like this to the live boot environment would not be necessary.   With ZFS Root, creating a clone Boot Environment is so easy that there's no good reason not to.   This avoids the need to use technologies such as Deferred Activation Patching which attempt to make it safer to apply arbitrary change to a live boot environment, which is an inherently risky process.

2. There are reproducible issues using jumpstart finish scripts and other scenarios to install Kernel patch 137137-09 followed by Kernel patch 139555-08.   Here's the gist of the issue which I've pulled from an engineering email thread on the subject:

Issue 1: I have a customer whose system is not booting after applying the patch cluster with Live Upgrade (LU).

Solution 1: If using 'luupgrade -t', then you must ensure that latest version of LU patch is installed first, currently 121430-36 is currently the latest revision on SPARC, 121431-37 on x86. Once these patches are installed, LU will automatically handle the build of the boot archive when 'luactivate' is called, thus avoiding the problem.

Issue 2: There are other ways to get oneself into situations where a boot archive is out of sync: e.g. If using jumpstart finish scripts to apply patches that include 137137-09.  Basically any operation that involves patching to an ABE outside of 'luupgrade' will involve a manual build of boot-archive.

Solution 2: One must manually rebuild the boot-archive on the /a partition after applying the patches.  Otherwise once the system boots, the boot-archive will be out of sync.

Here's some more detail on the jumpstart finish script version of this: 

We've seen the same panic a few times when the latest patch cluster is applied via a finish script to a boot environment prior to  s10u6 via a jumpstart installation. It appears that the boot archive is out of sync with the kernel on the system. The boot archive was created from the 137137-09 patch and not updated after the 139555-08 kernel was applied, therefore the mismatch between the kernel and the boot archive.

In these instances updating the boot archive allows the system to boot successfully. Boot failsafe (ok boot -F failsafe) will detect an out of sync boot archive.  Execute the automated update then reboot.  This will now boot from the later kernel (139555-08) which successfully installed from the finish script.

I reproduced the problem in a jumpstart installation environment applying the latest 10_Recommended patch cluster from a finish script. The initial installation was S10U5 which is deployed from a miniroot that has no knowledge of a boot archive (my theory anyway).  This is similar to a live upgrade environment if the boot environment doing the patching is also boot archive unaware (meaning the kernel is pre 137137-09).

In the jumpstart scenario the immediate problem was solved by updating the boot archive by booting failsafe as previously described.  The Solution was to update the boot archive from the finish script after the patch cluster installation completed.  BTW, all patches in the patch cluster installed successfully per the /var/sadm/system/logs.finish.log.

In a standard jumpstart the boot device (install target) is mounted to /a, therefore adding the following entry to the finish script solved the problem:

/a/boot/solaris/bin/create_ramdisk -R /a

Depending on the finish script configuration, and variables the following would also work:

$ROOTDIR/boot/solaris/bin/create_ramdisk -R $ROOTDIR
Issue 3: This above issues are sometimes mis-diagnosed as CR 6850202: "bootadm fails to build bootarchive in certain configurations leading to unbootable system".

But CR 6850202 will only be encountered in very specific circumstances, all of which must occur in order to hit this specific bug, namely:

1. Install u6 SUNWCreq - there's  no mkisofs so we build ufs boot archive

2. Limit /tmp to 512M - thus forcing the ufs build to happen in /var/run

3. Have a separate /var - bootadm.c only lofs nosub mounts "/" when creating the alt root for DAP patching build of boot archive

4. Install 139555-08

You must have all 4 of above in order to hit this, i.e. step 4 must be installing a DAP patch such as a Kernel patch associated with a Solaris 10 Update such as 139555-08. 

Solution 3: Removing the 512MB limit (or whatever limit has been imposed) to /tmp in /etc/vfstab and/or adding SUNWmkcd (and probably SUNWmkcdS) so that mkisofs is available on the system is sufficient to avoid the code path that fails this way.

Booting failsafe and recreating the boot archive will successfully recreate the boot archive.

Here's further input from one of my senior engineers, Enda O'Connor:

If using Live Upgrade (LU), and LU on the live partition is up to date in terms of latest revision of the LU patch, 121430 (SPARC) and 121431 (x86), the boot-archive will be built automatically once users runs shutdown ( after luactivate to activate the new BE ).  This is done from a kill script in rcd.0.

If using a jumpstart finish script, or jumpstart profile to patch a pre-U6 image with latest kernel patches, then you need to run create_ramdisk from the finish script after all patching/packaging operations have been finished.  Alternatively, you can patch your pre-U6 miniroot to the U6 SPARC NewBoot level (137137-09), at which point the modified miniroot will handle the build of the boot_archive after the finish script has run.

If patching U6 and upwards from jumpstart, the boot_archive will get built automatically after finish script has run, so there's no issue in this scenario.

If using any home grown technology to patch or install/modify software on an Alternate Boot Environment ( ABE ), such as ufsrestore/cpio/tar for example, you must always run create_ramdisk manually before booting to said ABE.

Best Wishes,

Gerry.

About

This blog is to inform customers about patching best practice, feature enhancements, and key issues. The views expressed on this blog are my own and do not necessarily reflect the views of Oracle. The Documents contained within this site may include statements about Oracle's product development plans. Many factors can materially affect these plans and the nature and timing of future product releases. Accordingly, this Information is provided to you solely for information only, is not a commitment to deliver any material code, or functionality, and SHOULD NOT BE RELIED UPON IN MAKING PURCHASING DECISIONS. The development, release, and timing of any features or functionality described remains at the sole discretion of Oracle. THIS INFORMATION MAY NOT BE INCORPORATED INTO ANY CONTRACTUAL AGREEMENT WITH ORACLE OR ITS SUBSIDIARIES OR AFFILIATES. ORACLE SPECIFICALLY DISCLAIMS ANY LIABILITY WITH RESPECT TO THIS INFORMATION. ~~~~~~~~~~~~ Gerry Haskins, Director, Software Lifecycle Engineer

Search

Categories
Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today