Wednesday Aug 15, 2012

SPARC SuperDuperCluster

Hi Folks!

Some of you may have noticed that I've been a little quieter than usual in the last year.

Is it because I've lost interest in patching, maintenance best practices, and improving our customers' lifecycle experience ?  Not a bit of it.

It's because my team and I have been rather busy - to put it mildly! - on developing the installation configuration utilities and maintenance updates for SPARC SuperCluster.

SPARC SuperCluster is so good, and the feedback from the already substantial customer base has been so positive, that I'm lobbying Marketing to rename it SPARC SuperDuperCluster.

Available in half rack (2 T4-4s, 3 Exadata Storage Cells) or full rack (4 T4-4s, 7 Exadata Storage Cells) configurations, both of which have a general purpose 7320 ZFSSA (ZFS Storage Appliance) and 3 Infiniband Switches, SPARC SuperDuperCluster is the prime example of the integrated Oracle Red Stack at its best.

It is a true example of an Engineered System, engineered with enhancements at every layer of the Red Stack to improve performance, robustness, and quality, from the phenomenal performance of the SPARC T4 chips, through to the excellent LDoms (Logical Domains) virtualization layer, enhancements such as RDSv3 support in Solaris as well as all the other great feature of Solaris 11 (and 10), to leveraging the phenomenal performance of Infiniband, Exadata Storage Cells and the 11gR2 database.

Seemingly paradoxically, SPARC SuperDuperCluster is both a highly flexible General Purpose "app" consolidation platform and an Engineered System, offering a wide variety of optimized configurations with various combinations of 11gR2 database domains, Solaris 11 General Purpose "app" domains, and Solaris 10 General Purpose "app" domains.

But how can SPARC SuperDuperCluster be both an Engineered System and offer extremely flexible configurations at the same time ?  That's easy.  The hardware layer and cabling is fixed in an optimized fashion (Engineered).  But what apps a customer chooses to run on SuperCluster, on how many LDoms and what memory/CPU is allocated to each is up to them, optimized for their needs (Flexible), rather than a one-size-fits-none approach.

SPARC SuperDuperCluster is more than just the hardware and software.  It's also the extraordinary cross-organizational team that has been built around it.  From the absolute cream of Services, Support, and Sustaining, to the architects and management from Performance Technologies, to the cooperation and deep engagement between engineering teams for each layer of the Oracle Red Stack, to my own small but extremely dedicated install configuration utility and maintenance update team, it's the people behind SPARC SuperDuperCluster which ensure its success. 

Feedback from the rapidly growing customer install base worldwide is extremely positive.  To find out more, please see the SPARC SuperCluster resource page.  You'll be hearing lots more about SPARC SuperDuperCluster at Oracle Open World this year - wow, it's nearly that time of year again! - but, for once, I won't be presenting myself.

I will be there and available to meet/talk either about Solaris 10 Patching, Solaris 11 SRUs (Support Repository Updates), or SPARC SuperDuperCluster.  I look forward to seeing you there!

Best Wishes,

Gerry.

Thursday Sep 09, 2010

Solaris 10 9/10 (Update 9) released

Solaris 10 9/10 (Update 9) has been released.  See here for information and here for the download (remember to accept the license agreement at the top).  There's also a podcast and a dedicated Solaris blog.

A number of technical articles have been released, including George Wilson's video overview of ZFS enhancements in Solaris 10 9/10.

As with all Solaris Updates, Solaris 10 9/10 contains all available bug fixes which were available at the time that its contents were finalized, pre-applied into the Solaris Update image. 

It also contains a significant number of feature enhancements as described in the above links.

The corresponding Solaris Update Patch Bundle is currently in test and I expect that it should be released in a similar timeframe to previous Updates.  See http://blogs.sun.com/patch/entry/solaris_10_10_08_patch  for information on Solaris Update Patch Bundles.

All standard patches in Update 9 have already been released to SunSolve and My Oracle Support (MOS).  I've updated the Solaris 10 Kernel PatchID Sequence entry below with the Kernel PatchIDs for Solaris 10 9/10 (Update 9).

As with previous Updates, there are a small number of "special" or "script" patches whose sole purpose is to correct issues in the pre-application of patches to the Solaris Update release image.  Since these patches have no purpose whatsoever outside of the Solaris Update build process, they are not released to SunSolve/MOS.   Newer "special" patches have PatchIDs of the format 800xxx to make them easily identifiable, but old "special"/"script" patches are identifable by the words "SPECIAL PATCH" and/or "script patch" in the patch synopsis.  See the SPARC and x86 patch lists. 

<pet peeve>

Please note it is incorrect to refer to Kernel Patch 142909-17 (SPARC) / 142910-17 (x86) as the "Update 9 Kernel patch".  It is the latest Kernel Patch included in Update 9, but this Kernel patch can equally be applied to all previous Solaris 10 releases.   Solaris Updates are built from patches (and a few new packages), patches are not built from Solaris Updates.

</pet peeve>

Wednesday Sep 09, 2009

Patches for "Turbo-Charging SVR4 Package Install" are now available

I am delighted to announce that Casper Dik's "Turbo-Charging SVR4 Package Install" feature enhancement is now available by downloading and installing the following patches:

119254-70 (SPARC) / 119255-70 (x86): Patch utilites patch
121428-13 (SPARC) / 121429-13 (x86): Live Upgrade Zones Support Patch
121430-40 (SPARC) / 121431-41 (x86): Live Upgrade Patch
124630-28 (SPARC) / 124631-29 (x86): System Administration Applications, Network, and Core Libraries Patch

It is important to apply 119254-70 (SPARC) / 119255-70 (x86) and 121428-13 (SPARC) / 121428-13 (x86) if the system is running non-global zones.  Otherwise booting newly installed zones will fail until the pkgserv daemon exits, about 5 minutes after zoneadm install finished.  Zones which were already installed can be booted as expected.

"Turbo Packaging" is designed to dramatically improve the performance of install, upgrade, Live Upgrade, and zone creation on Solaris 10.  It delivers only a small improvement for patching performance.   (See my Zones Parallel Patching blog entry for information on dramatic patching performance improvements.)

For background reading on the "Turbo Packaging" feature, please see
http://www.opensolaris.org/jive/thread.jspa?messageID=358081 and http://arc.opensolaris.org/caselog/PSARC/2009/173/

The "Turbo-charging SVR4 Package Install" feature will also be included in the forthcoming Solaris 10 Update 8 release and will be documented in the Release Notes for that release.

Great work, Casper - well done!

Friday Aug 14, 2009

Improvements to Solaris 10 Recommended and Sun Alert Patch Clusters released

My colleague, Ed Clark, has made significant improvements to the Solaris 10 Recommended and Sun Alert patch clusters.  These improvements have just been released and are in the current clusters available to contract customers from the Patch Cluster & Patch Bundle Downloads on SunSolve.

Ed's improvements include:

  • Filtering out "false negatives" from the patch utility return codes, so that if the cluster install script returns "1", you know you've got a real problem which needs investigating.   As you may know, the Solaris patch utility, 'patchadd', can return errors for some acceptable situations - for example, if the patch is already applied to the system, or a later revision of the patch or a patch which obsoletes it is already applied to the system, or none of the packages in the patch are on the target system (e.g. because a reduced Install Metacluster was used to install it or the system has been security hardened by package removal), etc.   Such conditions are acceptable "errors" which do not usually require further investigation by the user.  By filtering these conditions out, if the 'installcluster' script returns "1", you know it isn't because of one of these acceptable "errors", and therefore you need to look at the logfiles to find out what's gone wrong.  For further information, please see the cluster README and Analyzing a patchadd or patchrm Failure in the Solaris OS.
  • The new 'installcluster' script will exit as soon as it encounters an unexpected failure - i.e. not one of the acceptable "errors" mentioned above.  This prevents potentially compounding issues by attempting to apply further patches.
  • The new 'installcluster' script includes context intelligence for patching operations.   It informs the user when zones need to be halted, and it provides phased installation to handle patches which absolutely require an immediate reboot before further patches can be applied.  Such interim reboots are only needed when patching a live boot environment on a system below Kernel patch 118833-36 (SPARC) / 118855-36 (x86) and well as the earlier interim reboot required on x86 related to 'libc.so' patches and Kernel patch 118844-14.  On systems below these patch levels, the 'installcluster' will stop at the appropriate point when patching the live boot environment, and inform the user to reboot and re-invoke the 'installcluster' script.  (In the old cluster install script, it simply tried to carry on blindly past such interim reboots, spewing out error messages, although code in the relevant patches prevented any harm from being done).  These interim reboots, when required, are dealt with relatively early in the cluster install sequence so that once completed, the Sys Admin can leave the rest of the installation to finish unattended and move onto other systems.
  • The new 'installcluster' script provides better integration with Solaris Live Upgrade as the user can now specify the Live Upgrade alternate boot environment to patch by name.
  • The new 'installcluster' script performs space checking prior to installing each patch, and will halt if it believes there is insufficient space to complete the installation successfully.  For example, this helps avoid non-global zones getting out of sync regarding patch levels with respect to the global zone.  This is an important enhancement as running out of space during patching can potentially leave the system in an inconsistent state and is to be avoided.  Even removing a patch requires space, so immediate removal of a patch which has failed to apply correctly due to space issues should be avoided until sufficient space is freed up and potential issues caused by its partial installation investigated - for example, was the undo.Z file successfully created to enable backout ? (Tip: It may be better to retry the patch installation once space has been freed up rather than patch removal in such circumstances.  Contact Sun Support for instructions if you encounter such issues.).   The space checking enhancements in the 'installcluster' script are designed to prevent such problems occurring.
  • The messages and log files produced by the 'installcluster' script are clear and well structured.  For example, a "failed" log is created if a patch fails to apply.  See the Cluster README for further information.
  • The 'patch_order' places patches in an optimal order for installation to avoid known issues - for example, the patch utilities patches are installed as early in the sequence as possible to avoid hitting patch installation bugs which are fixed in the patch utility patches, and the Kernel patch procedural script override patch, 125555 (SPARC) / 125556 (x86), is ordered prior to 137137-09 (SPARC) / 137138-09 (x86) to resolve some known issues.  When patching an alternate boot environment (which is recommended), a small sub-set of pre-requisite patches, primarily the patch utility patches, need to be applied to the live boot environment to ensure correct patching operation.  The 'installcluster' script will check for these pre-requisite patches are halt installation if they are not present, advising the user of the 'installcluster' script option to use to install these pre-requisite patches.   Further patches may need to be installed on the live boot environment to support Live Upgrade.  See the cluster README for further information.
  • The patches have been moved to a 'patches' sub-directory, to de-clutter the top level directory of the unzipped cluster.
  • Please see the cluster README file for further information.  Customers should read the cluster README file and look at the Special Install Instructions in the patches within the cluster prior to installation.

I really want to thank Ed Clark for the enormous amount of thought and effort he has put into improving the cluster installation experience.   The work he's done on the Solaris 10 Recommended and Sun Alert patch cluster is a continuation of his previous work on the Solaris Update Patch Bundles and the Solaris 10 Live Upgrade Zones Starter Patch Bundle.  Nice work, Ed!

While the 'installcluster' script is copyrighted, I am happy for customers to use it, and the 'patch_order' file, as a starting point for their own customized patch bundles, so long as it is for their own use and is not to be given to a 3rd party or used for commercial gain (e.g. by a 3rd party maintainer or 3rd party commercial automation tool).

We have also made significant improvements to the back end processes to ensure higher and more consistent cluster quality. 

Originally, the clusters were created by the Patch Operations and Distribution (POD) team after patch release.  The POD Cluster QA process left a lot to be desired, resulting in inconsistent cluster quality.   To plug this gap, my Patch System Test team have been testing the clusters for several years, but the old process only allowed us to test them in parallel with their release, which meant that we found issues at the same time that early downloaders of the cluster encountered them.  Although we ensured such issues were fixed as quickly as possible, it still obviously compromised our customers' experience.

In the new process, the clusters are routed to Patch System Test (PST) prior to release.  PST run a transformation script on them to optimize the patch installation order, etc.  The clusters will only be released once they have passed PST testing.  This should ensure higher and more consistent quality for customers.  Work is continuing to move the entire patch cluster generation process to PST, although these future backend enhancements in this regard should be invisible to customers.

Thursday Jun 25, 2009

Heads up on Kernel patch installation issues with jumpstart or ZFS Root

I'd like to give you a heads-up on a couple of Kernel patch installation issues:

1. There was a bug (since fixed) in the Deferred Activation Patching functionality in a ZFS Root environment on x86 only.  See Sun Alert 263928.  An error message to the effect that a Class Action Script has failed to complete and failure to set up environment for Deferred Activation Patching may be seen.   The relevant CR is 6850329: "KU 139556-08 fails to apply on x86 systems that have ZFS root filesystems and corrupts the OS".    SPARC systems are similarly affected.  The following error message is returned:
mv: cannot rename /var/run/.patchSafeMode/root/lib/libc.so.1.20102 to /lib/libc.so.1: Device busy
ERROR: Move of /var/run/.patchSafeMode/root/lib/libc.so.1.20102 to dstActual failed
usage: puttext [-r rmarg] [-l lmarg] string
pkgadd: ERROR: class action script did not complete successfully

Installation of <SUNWcslr> failed.

This issue is fixed in patch in the Patch Utilities patch 119255-70 or later revision.

BTW: The principal reason ZFS Root support was implemented in Live Upgrade is so that patch application like this to the live boot environment would not be necessary.   With ZFS Root, creating a clone Boot Environment is so easy that there's no good reason not to.   This avoids the need to use technologies such as Deferred Activation Patching which attempt to make it safer to apply arbitrary change to a live boot environment, which is an inherently risky process.

2. There are reproducible issues using jumpstart finish scripts and other scenarios to install Kernel patch 137137-09 followed by Kernel patch 139555-08.   Here's the gist of the issue which I've pulled from an engineering email thread on the subject:

Issue 1: I have a customer whose system is not booting after applying the patch cluster with Live Upgrade (LU).

Solution 1: If using 'luupgrade -t', then you must ensure that latest version of LU patch is installed first, currently 121430-36 is currently the latest revision on SPARC, 121431-37 on x86. Once these patches are installed, LU will automatically handle the build of the boot archive when 'luactivate' is called, thus avoiding the problem.

Issue 2: There are other ways to get oneself into situations where a boot archive is out of sync: e.g. If using jumpstart finish scripts to apply patches that include 137137-09.  Basically any operation that involves patching to an ABE outside of 'luupgrade' will involve a manual build of boot-archive.

Solution 2: One must manually rebuild the boot-archive on the /a partition after applying the patches.  Otherwise once the system boots, the boot-archive will be out of sync.

Here's some more detail on the jumpstart finish script version of this: 

We've seen the same panic a few times when the latest patch cluster is applied via a finish script to a boot environment prior to  s10u6 via a jumpstart installation. It appears that the boot archive is out of sync with the kernel on the system. The boot archive was created from the 137137-09 patch and not updated after the 139555-08 kernel was applied, therefore the mismatch between the kernel and the boot archive.

In these instances updating the boot archive allows the system to boot successfully. Boot failsafe (ok boot -F failsafe) will detect an out of sync boot archive.  Execute the automated update then reboot.  This will now boot from the later kernel (139555-08) which successfully installed from the finish script.

I reproduced the problem in a jumpstart installation environment applying the latest 10_Recommended patch cluster from a finish script. The initial installation was S10U5 which is deployed from a miniroot that has no knowledge of a boot archive (my theory anyway).  This is similar to a live upgrade environment if the boot environment doing the patching is also boot archive unaware (meaning the kernel is pre 137137-09).

In the jumpstart scenario the immediate problem was solved by updating the boot archive by booting failsafe as previously described.  The Solution was to update the boot archive from the finish script after the patch cluster installation completed.  BTW, all patches in the patch cluster installed successfully per the /var/sadm/system/logs.finish.log.

In a standard jumpstart the boot device (install target) is mounted to /a, therefore adding the following entry to the finish script solved the problem:

/a/boot/solaris/bin/create_ramdisk -R /a

Depending on the finish script configuration, and variables the following would also work:

$ROOTDIR/boot/solaris/bin/create_ramdisk -R $ROOTDIR
Issue 3: This above issues are sometimes mis-diagnosed as CR 6850202: "bootadm fails to build bootarchive in certain configurations leading to unbootable system".

But CR 6850202 will only be encountered in very specific circumstances, all of which must occur in order to hit this specific bug, namely:

1. Install u6 SUNWCreq - there's  no mkisofs so we build ufs boot archive

2. Limit /tmp to 512M - thus forcing the ufs build to happen in /var/run

3. Have a separate /var - bootadm.c only lofs nosub mounts "/" when creating the alt root for DAP patching build of boot archive

4. Install 139555-08

You must have all 4 of above in order to hit this, i.e. step 4 must be installing a DAP patch such as a Kernel patch associated with a Solaris 10 Update such as 139555-08. 

Solution 3: Removing the 512MB limit (or whatever limit has been imposed) to /tmp in /etc/vfstab and/or adding SUNWmkcd (and probably SUNWmkcdS) so that mkisofs is available on the system is sufficient to avoid the code path that fails this way.

Booting failsafe and recreating the boot archive will successfully recreate the boot archive.

Here's further input from one of my senior engineers, Enda O'Connor:

If using Live Upgrade (LU), and LU on the live partition is up to date in terms of latest revision of the LU patch, 121430 (SPARC) and 121431 (x86), the boot-archive will be built automatically once users runs shutdown ( after luactivate to activate the new BE ).  This is done from a kill script in rcd.0.

If using a jumpstart finish script, or jumpstart profile to patch a pre-U6 image with latest kernel patches, then you need to run create_ramdisk from the finish script after all patching/packaging operations have been finished.  Alternatively, you can patch your pre-U6 miniroot to the U6 SPARC NewBoot level (137137-09), at which point the modified miniroot will handle the build of the boot_archive after the finish script has run.

If patching U6 and upwards from jumpstart, the boot_archive will get built automatically after finish script has run, so there's no issue in this scenario.

If using any home grown technology to patch or install/modify software on an Alternate Boot Environment ( ABE ), such as ufsrestore/cpio/tar for example, you must always run create_ramdisk manually before booting to said ABE.

Best Wishes,

Gerry.

Wednesday Jan 09, 2008

Patch Install Downtime Requirements

As mentioned previously, Solaris Live Upgrade can help minimize the downtime associated with patching, by enabling users to patch an inactive boot environment.  When all modifications have been made to the inactive boot environment, a single reboot is required to activate it.  Also, most Special Install Instructions specified in patch READMEs can be ignored when patching an inactive boot environment.

When patching a live boot environment, certain patches require system downtime in order to complete their installation.

Such requirements will be specified in the patch README file and is also specified by the SUNW_PATCH_PROPERTIES field in the patch's pkginfo file(s).

As mentioned previously, the problem with patching a live boot environment (without Deferred Activation Patching) is that some objects delivered in a patch, such as shared objects, may be invoked immediately while other objects, such as genunix, will only be activated when the system is rebooted.  Also, processes already running in memory may be using the old versions of objects while processes started after the patch(es) were applied will be using the new versions of the same objects (when Deferred Activation Patching isn't specified).  In some cases this is harmless.  In other cases, the system may be in a potentially inconsistent state until it is rebooted.

Some patches require that they be installed in Single User Mode (run level S) when applied to a live boot environment.  This is to ensure that the system is in a quiesced state to avoid the above potential problems.  This will be specified in the patch README file.  Also, the SUNW_PATCH_PROPERTIES field in the patch's pkginfo file(s) will contain the entry singleuser_required.

Some patches require that the system be rebooted at some convenient point after the patch is applied in order to activate its contents. Either a normal reboot or a reconfigure reboot may be required. This will be specified in the patch README file.  Also, the SUNW_PATCH_PROPERTIES field in the patch's pkginfo file(s) will contain the entry rebootafter or reconfigafter.  For such patches, the system remains in a consistent state until the reboot takes place.  The reboot is simply to allow the changes supplied by the patch to be activated. If a reconfigure reboot is required, application of the patch will cause the creation of a /.reconfigure file which will result in a reconfigure reboot when the system is rebooted.

Some patches require that the system be rebooted immediately after the patch is applied to a live boot environment.  For such patches, the system is in a potentially inconsistent state until the reboot takes place.  Either a normal reboot or a reconfigure reboot may be required. This will be specified in the patch README file.  Also, the SUNW_PATCH_PROPERTIES field in the patch's pkginfo file(s) will contain the entry rebootimmediate or reconfigimmediate.  If a reconfigure reboot is required, application of the patch will cause the creation of a /.reconfigure file which will result in a reconfigure reboot when the system is rebooted.  Normally, it is OK to apply further patches to the live boot environment before initiating the reboot.  However, normal operations must not be resumed until after the reboot is performed.  On the rare occasion where the reboot must be instigated before any further patches are applied, such as is the case with Solaris 10 Kernel patch 118833-36 (SPARC) / 118855-36 (x86), such patches will typically contain code to prevent further patches from being applied as an added safety mechanism.

About

This blog is to inform customers about patching best practice, feature enhancements, and key issues. The views expressed on this blog are my own and do not necessarily reflect the views of Oracle. The Documents contained within this site may include statements about Oracle's product development plans. Many factors can materially affect these plans and the nature and timing of future product releases. Accordingly, this Information is provided to you solely for information only, is not a commitment to deliver any material code, or functionality, and SHOULD NOT BE RELIED UPON IN MAKING PURCHASING DECISIONS. The development, release, and timing of any features or functionality described remains at the sole discretion of Oracle. THIS INFORMATION MAY NOT BE INCORPORATED INTO ANY CONTRACTUAL AGREEMENT WITH ORACLE OR ITS SUBSIDIARIES OR AFFILIATES. ORACLE SPECIFICALLY DISCLAIMS ANY LIABILITY WITH RESPECT TO THIS INFORMATION. ~~~~~~~~~~~~ Gerry Haskins, Director, Software Lifecycle Engineer

Search

Categories
Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today