Tuesday Apr 22, 2014

Configure failover LDom (Oracle VM Server for SPARC) on Solaris Cluster 4.1 by using 'live' migration

This blog shows an example to configure 'Oracle Solaris Cluster Data Service for Oracle VM Server for SPARC' on Solaris Cluster 4.1. It also mentions some hints around such a configuration. For this setup Solaris Cluster 4.1 SRU3 or higher and Oracle VM Server 3.0 or higher is required.
At least this is a summary of
Oracle Solaris Cluster Data Service for Oracle VM Server for SPARC Guide
and
Oracle VM Server for SPARC 3.0 Administration Guide.
Please check these guides for further restrictions and requirements.

This procedure is especially for 'live' migration of guest LDom's which means no shutdown of the OS in the LDom within the failover. In earlier OVM releases this was called 'warm' migration. However, the word 'live' is used in this example. A 'cold' migration means that the OS in the guest LDom will be stopped before migration.

Let's start:
The necessary services must be identical on all the potential control domains (primary domains) which run as Solaris Cluster 4.1 nodes. It is expected that Oracle VM Sever is already installed.

1) Prepare all primary domains which should manage the failover LDom with the necessary services.
all_primaries# ldm add-vconscon port-range=5000-5100 primary-vcc0 primary
all_primaries# svcadm enable svc:/ldoms/vntsd:default
all_primaries# ldm add-vswitch net-dev=net0 public-vsw1 primary
all_primaries# ldm add-vdiskserver primary-vds0 primary

To verify:
all_primary# ldm list-bindings primary


2) Set failure-policy on all primary domains:
all_primaries# ldm set-domain failure-policy=reset primary
To verify:
all_primaries# ldm list -o domain primary


3) Create failover guest domain (fgd0) on one primary domain.
Simple example:
primaryA# ldm add-domain fgd0
primaryA# ldm set-vcpu 16 fgd0
primaryA# ldm set-mem 8G fgd0



4) Add public network to failover guest domain:
primaryA# ldm add-vnet public-net0 public-vsw1 fgd0
To verify:
primaryA# ldm list-bindings fgd0

For more details to setup guest LDoms refer to Oracle VM Server for SPARC 3.0 Administration Guide.


5) Set necessary values on failover guest domain fgd0:
primaryA# ldm set-domain master=primary fgd0
primaryA# ldm set-var auto-boot?=false fgd0

To verify run:
primaryA# ldm list -o domain fgd0
auto-boot?=false is a “must have” to prevent data corruption. More details available in DocID 1585422.1 Solaris Cluster and HA VM Server Agent SUNW.ldom Data Corruption may Occur in a failover Guest Domain when "auto-boot?=true" is set


6) Select boot device for failover guest domain fgd0.
Possible options for the root file system of a domain with 'live' migration are: Solaris Cluster global filesystem (UFS/SVM), NFS, iSCSI, and SAN LUNs because all accessible at the same time from both nodes. The recommendation is to use full raw disk because it's expected to do 'live' migration. The full raw disk can be provided via SAN or iSCSI to all primary domains.
Remember zfs as root filesystem can ONLY be used if doing 'cold' migration because for 'live' migration both nodes need to access the root file system at the same time which is not possible with zfs.
Using Solaris Cluster global filesystem is an alternative but the performance is not that good as root on raw disk.
Details available in DocID 1366967.1 Solaris Cluster Root Filesystem Configurations for a Guest LDom Controlled by a SUNW.ldom Resource

So, root on raw filesystem is selected.
Add boot device to fgd0:
all_primaries# ldm add-vdsdev /dev/did/rdsk/d7s2 boot_fgd0@primary-vds0
primaryA# ldm add-vdisk root_fgd0 boot_fgd0@primary-vds0 fgd0



6a) Optional: Configure MAC addresses of LDom. The LDom Manager assign MAC automatically but the following issues can occur:
* Duplicate MAC address if other guest LDom's are down when creating a new LDom.
* MAC address can change after failover of a LDom
Assign your own MAC address is possible. This example use the suggested range between 00:14:4F:FC:00:00 – 00:14:4F:FF:FF:FF as described in
Assigning MAC Addresses Automatically or Manually of Oracle VM Server for SPARC 3.0 Administration Guide.
Example:
Identify current automatically assigned MAC addresses
primaryA# ldm list -l fgd0
to see the HOSTID which is similar as MAC a 'ldm bind fldg0' is necessary. Unbind fldg0 afterwards with 'ldm unbind fldg0'
MAC: 00:14:4f:fb:50:dc → change to 00:14:4f:fc:50:dc
HOSTID: 0x84fb50dc → change to 0x84fc50dc
public-net: 00:14:4f:fa:01:49 → change to 00:14:4f:fc:01:49
primaryA# ldm set-domain mac-addr=00:14:4f:fc:50:dc fgd0
primaryA# ldm set-domain hostid=0x84fc50dc fgd0
primaryA# ldm set-vnet mac-addr=00:14:4f:fc:01:49 public-net0 fgd0
primaryA# ldm list-constraints fgd0 (this shows assigned MAC now)
If necessary to change the MAC addresses on already configured failover guest LDom then refer to DocID 1559415.1 Solaris Cluster HA-LDom Agent do not Preserve hostid and MAC Address Upon Failover


7) Bind and start the fgd0
primaryA# ldm bind fgd0
primaryA# ldm start fgd0


8) Login to LDom using console:
primaryA# telnet localhost 5000


9) Install Solaris10 or Solaris 11 on LDom by using install server
To identify MAC address of LDom do in the console of fgd0:
{0} ok devalias net
{0} ok cd /virtual-devices@100/channel-devices@200/network@0
{0} ok .properties
local-mac-address 00 14 4f fc 01 49

For different installation method please refer to Installing Oracle Solaris OS on a Guest Domain of Oracle VM Server for SPARC 3.0 Administration Guide


10) Install HA-LDom (HA for Oracle VM Server Package) on all primary domain nodes if not already done
all_primaries# pkg info ha-cluster/data-service/ha-ldom
all_primaries# pkg install ha-cluster/data-service/ha-ldom



11) Check that cluster is first entry in /etc/nsswitch.conf
all_primaries# svccfg -s name-service/switch listprop config/host
config/host astring "files dns"
all_primaries# svccfg -s name-service/switch listprop config/ipnodes
config/ipnodes astring "files dns"
all_primaries# svccfg -s name-service/switch listprop config/netmask
config/netmask astring files
If not add it:
all_primaries# svccfg -s name-service/switch setprop config/host = astring: '("cluster files dns")'
all_primaries# svccfg -s name-service/switch setprop config/ipnodes = astring: '("cluster files dns")'
all_primaries# svccfg -s name-service/switch setprop config/netmask = astring: '("cluster files")'

More Details in DocID 1554887.1 Solaris Cluster: HA LDom Migration Fails With "Failed to establish connection with ldmd(1m) on target"


12) Create resource group for failover LDom fgd0 for primiary domains
primaryA# clrg create -n primaryA,primaryB fldom-rg


13) Register SUNW.HAStoragePlus if not already done:
primaryA# clrt register SUNW.HAStoragePlus


14) Create HAStoragePlus resource for boot device:
primaryA# clrs create -g fldom-rg -t SUNW.HAStoragePlus -p GlobalDevicePaths=/dev/global/dsk/d7s2 fgd0-has-rs
To use d7s2 is a requirement!!!


15) Enable LDom resrouce group on current node:
primaryA# clrg online -M -n fldom-rg


16) Register SUNW.ldom
primaryA# clrt register SUNW.ldom


17) Setup password file for non-interactive 'live' migration on all primary nodes
all_primaries# vi /.pass
add root password to this file
all_primaries# chmod 600 /.pass
Reguirements:
* The first line of the file must contain the password
* The password must be plain text
* The password must not exceed 256 characters in length
A newline character at the end of the password and all lines that follow the first line are ignored.
These details from Performing Non-Interactive Migrations of Oracle VM Server for SPARC 3.0 Administration Guide



18) Create SUNW.ldom resource
primaryA# clrs create -g fldom-rg -t SUNW.ldom -p Domain_name=fgd0 -p Password_file=/.pass -p resource_dependencies=fgd0-has-rs fgd0-rs

Notice: The domain configuration is retrieved by the “ldm list-constraints -x ldom” command from Solaris Cluster and stored in the CCR. This info is used to create or destroy the domain on the node where the resource group is brought online or offline.


19) Check Migration_type property. It should be MIGRATE for 'live' migration:
primaryA# clrs show -v fgd0-rs | grep Migration_type
If not MIGRATE then set it:
primaryA# clrs set -p Migration_type=MIGRATE fgd0-rs


20) To stop/start the SUNW.ldom resource
primaryA# clrs disable fgd0-rs
primaryA# clrs enable fgd0-rs


21) Verify the setup by switching failover LDom to other node and back.
primaryA# clrg switch -n primaryB fldom-rg
primaryA# clrg switch -n primaryA fldom-rg
To monitor the migration process run 'ldm list -o status fldg0' on the primary target domain.


22) Tune your timeout values depending on your system.
primaryA# clrs set -p STOP_TIMEOUT=1200 fldom-rg
Details in DocID 1423937.1 Solaris Cluster: HA LDOM Migration Fails With "Migration of domain timed out, the domain state is now shut off"


23) Consider further tuning of timeout values as described in
SPARC: Tuning the HA for Oracle VM Server Fault Monitor of Oracle Solaris Cluster Data Service for Oracle VM Server for SPARC Guide
For less frequent probing maybe the following setting can be used.
primaryA # clrs set -p Thorough_probe_interval=180 -p Probe_timeout=90 fldom-rs

Last but not least, it's not supported to run Solaris Cluster within a failover LDom!

Wednesday Jul 04, 2012

How to configure a zone cluster on Solaris Cluster 4.0

This is a short overview on how to configure a zone cluster on Solaris Cluster 4.0. This is a little bit different as in Solaris Cluster 3.2/3.3 because Solaris Cluster 4.0 is only running on Solaris 11. The name of the zone cluster must be unique throughout the global Solaris Cluster and must be configured on a global Solaris Cluster. Please read all the requirements for zone cluster in Solaris Cluster Software Installation Guide for SC4.0.
For Solaris Cluster 3.2/3.3 please refer to my previous blog Configuration steps to create a zone cluster in Solaris Cluster 3.2/3.3.

A. Configure the zone cluster into the already running global cluster
  • Check if zone cluster can be created
    # cluster show-netprops
    to change number of zone clusters use
    # cluster set-netprops -p num_zoneclusters=12
    Note: 12 zone clusters is the default, values can be customized!

  • Create config file (zc1config) for zone cluster setup e.g:

  • Configure zone cluster
    # clzc configure -f zc1config zc1
    Note: If not using the config file the configuration can also be done manually # clzc configure zc1

  • Check zone configuration
    # clzc export zc1

  • Verify zone cluster
    # clzc verify zc1
    Note: The following message is a notice and comes up on several clzc commands
    Waiting for zone verify commands to complete on all the nodes of the zone cluster "zc1"...

  • Install the zone cluster
    # clzc install zc1
    Note: Monitor the consoles of the global zone to see how the install proceed! (The output is different on the nodes) It's very important that all global cluster nodes have installed the same set of ha-cluster packages!

  • Boot the zone cluster
    # clzc boot zc1

  • Login into non-global-zones of zone cluster zc1 on all nodes and finish Solaris installation.
    # zlogin -C zc1

  • Check status of zone cluster
    # clzc status zc1

  • Login into non-global-zones of zone cluster zc1 and configure the shell environment for root (for PATH: /usr/cluster/bin, for MANPATH: /usr/cluster/man)
    # zlogin -C zc1

  • If using additional name service configure /etc/nsswitch.conf of zone cluster non-global zones.
    hosts: cluster files
    netmasks: cluster files

  • Configure /etc/inet/hosts of the zone cluster zones
    Enter all the logical hosts of non-global zones



B. Add resource groups and resources to zone cluster
  • Create a resource group in zone cluster
    # clrg create -n <zone-hostname-node1>,<zone-hostname-node2> app-rg

    Note1: Use command # cluster status for zone cluster resource group overview.
    Note2: You can also run all commands for zone cluster in global cluster by adding the option -Z to the command. e.g:
    # clrg create -Z zc1 -n <zone-hostname-node1>,<zone-hostname-node2> app-rg

  • Set up the logical host resource for zone cluster
    In the global zone do:
    # clzc configure zc1
    clzc:zc1> add net
    clzc:zc1:net> set address=<zone-logicalhost-ip>
    clzc:zc1:net> end
    clzc:zc1> commit
    clzc:zc1> exit
    Note: Check that logical host is in /etc/hosts file
    In zone cluster do:
    # clrslh create -g app-rg -h <zone-logicalhost> <zone-logicalhost>-rs

  • Set up storage resource for zone cluster
    Register HAStoragePlus
    # clrt register SUNW.HAStoragePlus

    Example1) ZFS storage pool
    In the global zone do:
    Configure zpool eg: # zpool create <zdata> mirror cXtXdX cXtXdX
    and
    # clzc configure zc1
    clzc:zc1> add dataset
    clzc:zc1:dataset> set name=zdata
    clzc:zc1:dataset> end
    clzc:zc1> verify
    clzc:zc1> commit
    clzc:zc1> exit
    Check setup with # clzc show -v zc1
    In the zone cluster do:
    # clrs create -g app-rg -t SUNW.HAStoragePlus -p zpools=zdata app-hasp-rs


    Example2) HA filesystem
    In the global zone do:
    Configure SVM diskset and SVM devices.
    and
    # clzc configure zc1
    clzc:zc1> add fs
    clzc:zc1:fs> set dir=/data
    clzc:zc1:fs> set special=/dev/md/datads/dsk/d0
    clzc:zc1:fs> set raw=/dev/md/datads/rdsk/d0
    clzc:zc1:fs> set type=ufs
    clzc:zc1:fs> add options [logging]
    clzc:zc1:fs> end
    clzc:zc1> verify
    clzc:zc1> commit
    clzc:zc1> exit
    Check setup with # clzc show -v zc1
    In the zone cluster do:
    # clrs create -g app-rg -t SUNW.HAStoragePlus -p FilesystemMountPoints=/data app-hasp-rs


    Example3) Global filesystem as loopback file system
    In the global zone configure global filesystem and it to /etc/vfstab on all global nodes e.g.:
    /dev/md/datads/dsk/d0 /dev/md/datads/dsk/d0 /global/fs ufs 2 yes global,logging
    and
    # clzc configure zc1
    clzc:zc1> add fs
    clzc:zc1:fs> set dir=/zone/fs (zc-lofs-mountpoint)
    clzc:zc1:fs> set special=/global/fs (globalcluster-mountpoint)
    clzc:zc1:fs> set type=lofs
    clzc:zc1:fs> end
    clzc:zc1> verify
    clzc:zc1> commit
    clzc:zc1> exit
    Check setup with # clzc show -v zc1
    In the zone cluster do: (Create scalable rg if not already done)
    # clrg create -p desired_primaries=2 -p maximum_primaries=2 app-scal-rg
    # clrs create -g app-scal-rg -t SUNW.HAStoragePlus -p FilesystemMountPoints=/zone/fs hasp-rs

    More details of adding storage available in the Installation Guide for zone cluster

  • Switch resource group and resources online in the zone cluster
    # clrg online -eM app-rg
    # clrg online -eM app-scal-rg

  • Test: Switch of the resource group in the zone cluster
    # clrg switch -n zonehost2 app-rg
    # clrg switch -n zonehost2 app-scal-rg

  • Add supported dataservice to zone cluster
    Documentation for SC4.0 is available here



  • Example output:



    Appendix: To delete a zone cluster do:
    # clrg delete -Z zc1 -F +

    Note: Zone cluster uninstall can only be done if all resource groups are removed in the zone cluster. The command 'clrg delete -F +' can be used in zone cluster to delete the resource groups recursively.
    # clzc halt zc1
    # clzc uninstall zc1

    Note: If clzc command is not successful to uninstall the zone, then run 'zoneadm -z zc1 uninstall -F' on the nodes where zc1 is configured
    # clzc delete zc1

Monday Dec 19, 2011

More robust control of zfs in Solaris Cluster 3.x

In some situations there is a possibility that a zpool will not be exported correctly if controlled by SUNW.HAStoragePlus resource. Please refer to the details in Document 1364018.1: Potential Data Integrity Issues After Switching Over a Solaris Cluster High Availability Resource Group With Zpools

I like to mention this because zfs is used more and more in Solaris Cluster environments. Therefore I highly recommend to install following patches to get a more reliable Solaris Cluster environment in combination with zpools on SC3.3 and SC3.2. So, if you already running such a setup, start planning NOW to install the following patch revision (or higher) for your environment...

Solaris Cluster 3.3:
145333-10 Oracle Solaris Cluster 3.3: Core Patch for Oracle Solaris 10
145334-10 Oracle Solaris Cluster 3.3_x86: Core Patch for Oracle Solaris 10_x86

Solaris Cluster 3.2
144221-07 Solaris Cluster 3.2: CORE patch for Solaris 10
144222-07 Solaris Cluster 3.2: CORE patch for Solaris 10_x86

Wednesday Aug 10, 2011

Oracle Solaris Cluster 3.3 05/11 Update1 patches

The Oracle Solaris Cluster 3.3 05/11 Update1 is released. Click here for further information.

The package version of the Oracle Solaris Cluster 3.3 05/11 Update1 are the same for the core framework and the agents as for Oracle Solaris Cluster 3.3. Therefore it's possible to patch up an existing Oracle Solaris Cluster 3.3.

The package version of the Oracle Solaris Cluster Geographic Edition 3.3 05/11 Update1 are NOT the same as Oracle Solaris Cluster Geographic Edition 3.3. But it's possible to upgrade the Geographic Edition 3.3 without interruption of the service. See documentation for details.

The following patches (with the mentioned revision) are included/updated in Oracle Solaris Cluster 3.3 05/11 Update1. If these patches are installed on Oracle Solaris Cluster 3.3 release, then the features for framework & agents are identical with Oracle Solaris Cluster 3.3 05/11 Update1. It's always necessary to read the "Special Install Instructions of the patch" but I made a note behind some patches where it's very important to read the "Special Install Instructions of the patch" (Using shortcut SIIOTP).

Included/updated patch revisions of Oracle Solaris Cluster 3.3 05/11 Update1 for Solaris 10 10/09 Update8 or higher
145333-08 Oracle Solaris Cluster 3.3 5/11: Core Patch for Oracle Solaris 10 Note: Please read SIIOTP
145335-07 Oracle Solaris Cluster 3.3 5/11: HA-Oracle Patch for Oracle Solaris 10 Note: Please read SIIOTP
145337-03 Oracle Solaris Cluster 3.3: HA-SunONE Message queue Patch for Oracle Solaris 10
145341-03 Oracle Solaris Cluster 3.3: HA-Tomcat Patch for Oracle Solaris 10 Note: Please read SIIOTP
145343-03 Oracle Solaris Cluster 3.3 5/11: HA-BEA WebLogic Patch for Oracle Solaris 10
145345-02 Oracle Solaris Cluster 3.3: HA-SunONE App Server Patch for Oracle Solaris 10 Note: Please read SIIOTP
145636-02 Oracle Solaris Cluster 3.3: HA-JavaWebServer Patch for Oracle Solaris 10 Note: Please read SIIOTP
145638-04 Oracle Solaris Cluster 3.3 5/11: SC Checks / sccheck Patch for Oracle Solaris 10 Note: Please read SIIOTP
145640-04 Oracle Solaris Cluster 3.3 5/11: SunPlex/Managability and Serviceability Patch Note: Please read SIIOTP
145644-02 Oracle Solaris Cluster 3.3: HA-PostgreSQL Patch for Oracle Solaris 10
145646-03 Oracle Solaris Cluster 3.3: HA-MySQL Patch for Oracle Solaris 10
146085-03 Oracle Solaris Cluster 3.3: SDS/SVM Mediator patch for Oracle Solaris 10
146087-02 Oracle Solaris Cluster 3.3: HA-Apache Patch for Oracle Solaris 10
146091-04 Oracle Solaris Cluster 3.3 5/11: HA-SAP WEB AS for Oracle Solaris 10 Note: Please read SIIOTP
146093-02 Oracle Solaris Cluster 3.3: HA-NFS for Oracle Solaris 10
146241-04 Oracle Solaris Cluster 3.3: SWIFTAlliance patch
146760-02 Oracle Solaris Cluster 3.3: HA-Oracle E-business suite
146762-03 Oracle Solaris Cluster 3.3: HA-xVM Patch
146764-04 Oracle Solaris Cluster 3.3: HA-Sybase Patch
147073-02 Oracle Solaris Cluster 3.3: Man Pages Patch
147077-01 Oracle Solaris Cluster 3.3: HA-Containers Patch for Oracle Solaris 10
147079-01 Oracle Solaris Cluster 3.3: HA-OracleBIE Patch
147080-01 Oracle Solaris Cluster 3.3: HA-Kerberos Patch
147082-01 Oracle Solaris Cluster 3.3: HA-DB patch
147084-01 Oracle Solaris Cluster 3.3: HA-DNS
147086-02 Oracle Solaris Cluster 3.3: HA-Siebel Patch
147087-01 Oracle Solaris Cluster 3.3: HA-LiveCache Patch
147089-01 Oracle Solaris Cluster 3.3: HA-SAP Patch
147091-03 Oracle Solaris Cluster 3.3 5/11: HA-SAPDB Patch
147095-01 Oracle Solaris Cluster 3.3: HA-SUNWscTimesTen Patch
147141-01 Oracle Solaris Cluster 3.3: Localization patch for Solaris 10 sparc
147358-01 Oracle Solaris Cluster 3.3 5/11: HA-Agfa Patch


Included/updated patch revisions of Solaris Cluster 3.3 05/11 Update1 for Solaris 10 x86 10/09 Update8 or higher
145334-08 Oracle Solaris Cluster 3.3_x86 5/11: Core Patch for Oracle Solaris 10_x86 Note: Please read SIIOTP
145336-07 Oracle Solaris Cluster 3.3_x86 5/11: HA-Oracle Patch for Oracle Solaris 10_x86 Note: Please read SIIOTP
145338-02 Oracle Solaris Cluster 3.3: HA-SunONE Message queue Patch for Oracle Solaris 10_x86
145342-03 Oracle Solaris Cluster 3.3: HA-Tomcat Patch for Oracle Solaris 10_x86 Note: Please read SIIOTP
145344-03 Oracle Solaris Cluster 3.3 5/11: HA-BEA WebLogic Patch for Oracle Solaris 10_x86
145346-02 Oracle Solaris Cluster 3.3: HA-SunONE App Server Patch for Solaris 10_x86 Note: Please read SIIOTP
145637-02 Oracle Solaris Cluster 3.3: HA-JavaWebServer Patch for Oracle Solaris 10_x86 Note: Please read SIIOTP
145639-04 Oracle Solaris Cluster 3.3 5/11: SC Checks / sccheck Patch for Solaris 10_x86 Note: Please read SIIOTP
145641-04 Oracle Solaris Cluster 3.3 5/11: SunPlex/Managability and Serviceability Patch Note: Please read SIIOTP
145645-02 Oracle Solaris Cluster 3.3: HA-PostgreSQL Patch for Oracle Solaris 10_x86
145647-03 Oracle Solaris Cluster 3.3: HA-MySQL Patch for Oracle Solaris 10_x86
146086-03 Oracle Solaris Cluster 3.3_x86: SDS/SVM Mediator patch for Oracle Solaris 10
146088-02 Oracle Solaris Cluster 3.3_x86: HA-Apache Patch for Oracle Solaris 10_x86
146092-04 Oracle Solaris Cluster 3.3_x86 5/11: HA-SAP WEB AS for Oracle Solaris 10_x86 Note: Please read SIIOTP
146094-02 Oracle Solaris Cluster 3.3_x86: HA-NFS for Oracle Solaris 10_x86 Note: Please read SIIOTP
146766-02 Oracle Solaris Cluster 3.3_x86: HA-Sybase patch
147074-02 Oracle Solaris Cluster 3.3_x86: Man Pages Patch
147078-01 Oracle Solaris Cluster 3.3_x86: HA-Containers Patch for Oracle Solaris 10_x86
147081-01 Oracle Solaris Cluster 3.3_x86: HA-Kerberos Patch
147083-01 Oracle Solaris Cluster 3.3_x86: HA-DB patch
147085-01 Oracle Solaris Cluster 3.3_x86: HA-DNS
147088-01 Oracle Solaris Cluster 3.3_x86: HA-LiveCache Patch
147090-01 Oracle Solaris Cluster 3.3_x86: HA-SAP Patch
147092-03 Oracle Solaris Cluster 3.3_x86 5/11: HA-SAPDB Patch
147142-01 Oracle Solaris Cluster 3.3: Localization patch for Solaris 10_x86


The quorum server is an alternative to the traditional quorum disk. The quorum server is outside of the Oracle Solaris Cluster and will be accessed through the public network. Therefore the quorum server can be a different architecture.
Note: The quorum server software is only required on the quorum server and NOT on the Solaris Cluster nodes which are using the quorum server.

Included/updated patch revisions in Oracle Solaris Cluster 3.3 05/11 Update1 for quorum server feature:
146089-03 Oracle Solaris Cluster 3.3: Quorum Server patch
146090-03 Oracle Solaris Cluster 3.3_x86: Quorum Server patch for Oracle Solaris 10


Beware of the "Install Requirements" of the Oracle Solaris Cluster Framework patches.
Further details refer to Overview of Patching Oracle Solaris Cluster in Oracle Solaris Cluster System Administration Guide

Last but not least, the whole actual lists can be found at:
Oracle Solaris Cluster 3.3/SPARC patches for Solaris 10
Oracle Solaris Cluster 3.3/x86 patches for Solaris 10

Tuesday Mar 15, 2011

Setup of local zpool on local or shared device with Solaris Cluster

Maybe there is a need (for whatever reason) to configure a local zpool in a Solaris Cluster environment. As local zpool I mean that this zpool should only be available on one Solaris Cluster node WITHOUT using SUNW.HAStoragePlus. Such a local zpool can be configured with local devices (only connected to one node) or shared devices (accessible from all nodes in the cluster via SAN). However in case of shared device it would be better to setup a zone in the SAN switch to make the device only available to one host.

The following procedure is necessary to use local devices in local zpool:

In this example I use the local device c1t3d0 to create a local zpool
a) Look for the did device of the device which should be used by the zpool
# scdidadm -l c1t3d0
49 node0:/dev/rdsk/c1t3d0 /dev/did/rdsk/d49
b) Check the settings of the used did device
# cldg show dsk/d49
Note: Only one node should be in the node list
c) Set localonly flag for the did device. Optional: set autogen flag
# cldg set -p localonly=true -p autogen=true dsk/d49
or disable fencing for the did device
# cldev set -p default_fencing=nofencing d49
d) Verify the settings
# cldg show dsk/d49
e) Create the zpool
# zpool create localpool c1t3d0
# zfs create localpool/data


The following procedure is necessary to use shared devices in local zpool:

In this example I use the shared device c6t600C0FF00000000007BA1F1023AE1711d0 to create a local zpool
a) Look for the did device of the device which should be used by the zpool
# scdidadm -L c6t600C0FF00000000007BA1F1023AE1711d0
11 node0:/dev/rdsk/c6t600C0FF00000000007BA1F1023AE1710d0 /dev/did/rdsk/d11
11 node1:/dev/rdsk/c6t600C0FF00000000007BA1F1023AE1710d0 /dev/did/rdsk/d11
b) Check the settings of the used did device
# cldg show dsk/d11
c) Remove the node which should not access the did device
# cldg remove-node -n node1 dsk/d11
d) Set localonly flag for the did device. Optional: set autogen flag
# cldg set -p localonly=true -p autogen=true dsk/d11
or disable fencing for the did device
# cldev set -p default_fencing=nofencing d11
e) Verify the settings
# cldg show dsk/d11
f) Create the zpool
# zpool create localpool c6t600C0FF00000000007BA1F1023AE1711d0
# zfs create localpool/data


If you forgot to do this for a local zpool then there is a possibility that the zpool will be FAULTED state after a boot.

Tuesday Oct 19, 2010

Available: Oracle Solaris Cluster 3.3

Since a couple of weeks Oracle Solaris Cluster 3.3 is available

WHAT'S NEW in Oracle Solaris Cluster 3.3 & Introduction Video

Quick summary:
\* Active Monitoring of Storage Resources
\* Flexible load distribution of application services
\* Oracle Solaris Containers cluster support for Oracle Business Intelligence Enterprise Edition, Oracle Weblogic Server, MySQL cluster, PeopleSoft, TimesTen
\* NAS support in Containers cluster
\* Global File System support with Containers cluster
\* Infiniband on public network and as storage connectivity
\* Support for Reliable Data Sockets (RDS) over Infiniband (IB) for Oracle RAC in Containers cluster
\* New supported applications, versions and configurations
\* Solaris Containers cluster with Geographic Edition
\* Oracle Sun Unified Storage 7xxx in Campus Cluster deployments
\* Qualification with Solaris Trusted Extensions
\* Wizards for ASM configurations set-up
\* User Interface performance improvements in large configurations
\* Power Management User interface
\* Node Rename

Refer to the Oracle Solaris Cluster Support Community for
- Oracle Solaris Cluster 3.3 patch list for Solaris 10 SPARC
- Oracle Solaris Cluster 3.3 patch list for Solaris 10 x86

Tuesday Aug 17, 2010

New numbers of Solaris Cluster 3.2 core patches

There was a rejuvenation of the Solaris Cluster 3.2 core patch. The new patches are

144220 Solaris Cluster 3.2: CORE patch for Solaris 9
144221 Solaris Cluster 3.2: CORE patch for Solaris 10
144222 Solaris Cluster 3.2: CORE patch for Solaris 10_x86
At this time these patches does NOT have the requirement to be installed in non-cluster-single-user-mode. They can be installed in order when cluster is running, but requires a reboot.

Beware the new patches requires the previous version -42 of the SC 3.2 core patch.
126105-42 Sun Cluster 3.2: CORE patch for Solaris 9
126106-42 Sun Cluster 3.2: CORE patch for Solaris 10
126107-42 Sun Cluster 3.2: CORE patch for Solaris 10_x86
And the -42 still have the requirement to be installed in non-cluster-single-user-mode. Furthermore carefully study the special install instructions and some entries of this blog.

The advantage is, when -42 is already applied then the patching of Solaris Cluster 3.2 becomes more easy.

Certainly, it's possible to apply the new SC core patch at the same time as the -42 core patch in non-cluster-single-user-mode.

Friday Mar 26, 2010

Summary of install instructions for 126106-40 and 126107-40

My last blog describe some issues around these patches (please read)
126106-40 Sun Cluster 3.2: CORE patch for Solaris 10
126107-40 Sun Cluster 3.2: CORE patch for Solaris 10_x86
This is a follow up with a summary of best practices 'How to install?' these patches. There is a difference between new installations, 'normal' patching and live upgrade patching.
Important: The mentioned instructions are working if already Solaris Cluster 3.2 1/09 update2 (or Solaris Cluster 3.2 core patch revision -27(sparc) / -28(x86) ) or higher is installed. If running lower version of the Solaris Cluster 3.2 core patch then additional needs are necessary. Please refer to special install instructions of the patches for the additional needs.
Update: 28.Apr 2010
This also apply to the already released -41 and -42 SC core patches, when -40 is not already active

A) In case of new installations:

Install the SC core patch -40 immediately after the installation of the Solaris Cluster 3.2 software.
In brief:
    1.) Install Solaris Cluster 3.2 via JES installer
    2.) Install the SC core patch -40
    3.) Run scinstall
    4.) Do the reboot
Note: Do NOT do a reboot between 1.) and 2.). Follow the EIS Solaris Cluster 3.2 checklist which also has a note for this issue. If not available follow the standard installation process of Sun Cluster 3.2


B) In case of 'normal' patching

It is vital to use the following/right approach in case of patching. Because if you not use the following approach then the Solaris Cluster 3.2 can not boot anymore:
    0.) Only if using AVS 4.0
    # patchadd 12324[67]-05 (Follow Special Install Instructions)
    1.) # boot in non-cluster mode
    2.) # svcadm disable svc:/system/cluster/loaddid
    3.) # svccfg delete svc:/system/cluster/loaddid
    4.) # patchadd 12610[67]-40
    5.) # init 6


C) In case of LU (Live Upgrade feature) to install patches
    1.) Create ABE:
    For zfs root within the same root pool:
    # lucreate -n patchroot
    For ufs on different root drive:
    # prtvtoc /dev/rdsk/c1t3d0s2 | fmthard -s - /dev/rdsk/c1t2d0s2
    # lucreate -c "c1t3d0s0-root" -m /:/dev/dsk/c1t2d0s0:ufs -m /global/.devices/node@2:/dev/dsk/c1t2d0s6:ufs -n "c1t2d0s0-patchroot"
    2.) Install patch into ABE ( patch is already unpacked in /var/tmp )
    # luupgrade -t -n c1t2d0s0-patchroot -s /var/tmp 126106-40
    3.) Activate ABE
    # luactivate patchroot
    4.) # init 6
    # Some errors comes up at this point
    (dependency cycle & ORB error - Please look to example below)
    5.) # init 6 (the second reboot fix the problem) Bug 6938144





My personal recommendation to minimize the risk of the installation for the SC core patch -40 is:
Step 1) Upgrade the Solaris Cluster to
a) Solaris 10 10/09 update 8 and Solaris Cluster 3.2 11/09 update3.
or
b) EIS Baseline 26JAN10 which include the Solaris kernel update 14144[45]-09 and SC core patch -39. If EIS baseline not available use other patchset which include the mentioned patches.
Step 2) After the successful upgrade do a single patch install of the SC core patch -40 by using the installation instruction B) which is mentioned above. In this software state the -40 can be applied 'rolling' to the cluster.

Note: 'Rolling' means: Boot node1 in non-cluster-mode -> install -40 (see B) -> boot node1 back into cluster -> boot node2 in non-cluster-mode -> install -40 (see B) -> boot node2 back into cluster.

Wednesday Mar 03, 2010

Oracle Solaris Cluster core patch 126106-40 and 126107-40

This is a notify because there are some troubles around with the following Sun Cluster 3.2 -40 core patches:
126106-40 Sun Cluster 3.2: CORE patch for Solaris 10
126107-40 Sun Cluster 3.2: CORE patch for Solaris 10_x86
Before installing the patch carefully read the Special Install Instructions.
Update: 28.Apr 2010
This also apply to the already released -41 and -42 SC core patches, when -40 is not already active

Two new notes where added to these patches:


NOTE 16: Remove the loaddid SMF service by running the following
commands before installing this patch, if current patch level
(before installing this patch) is less than -40:
svcadm disable svc:/system/cluster/loaddid
svccfg delete svc:/system/cluster/loaddid


      So, the right approach is:
      # boot in non-cluster mode
      # svcadm disable svc:/system/cluster/loaddid
      # svccfg delete svc:/system/cluster/loaddid
      # patchadd 126106-40
      # init 6


NOTE 17:
Installing this patch on a machine with Availability Suite
software installed will cause the machine to fail to boot with
dependency errors due to BugId 6896134 (AVS does not wait for
did devices to startup in a cluster).
Please contact your Sun
Service Representative for relief before installing this patch.


The solution for Bug 6896134 is now available, please follow the right approach below for installation...
123246-05 Sun StorEdge Availability Suite 4.0: Patch for Solaris 10
123247-05 Sun StorEdge Availability Suite 4.0: Patch for Solaris 10_x86
       # patchadd 12324[67]-05 (Follow Special Install Instructions)
       # boot in non-cluster mode
       # svcadm disable svc:/system/cluster/loaddid
       # svccfg delete svc:/system/cluster/loaddid
       # patchadd 12610[67]-40
       # init 6



Important to know: These 2 issues only come up if using Solaris 10 10/09 Update8 or the Kernel patch 141444-09 or higher. There are changes in the startup of the iSCSI initiator (is now a SMF service) - please refer to 6888193 for details.

Hint: If using LU (live upgrade) for patching please refer to my blog Summary of installation instructions for 126106-40 and 126107-40

ATTENTION: the NOTE 16 is valid for removal of the patch - carefully read 'Special Removal Instructions'.


Additional information around NOTE 16
1) What happen if forgot to delete the loaddid service before the patch installation?

The following error (or similar) come up within patch installation on the console of the server:
Mar 2 12:01:46 svc.startd[7]: Transitioning svc:/system/cluster/loaddid:default to maintenance because it completes a dependency cycle (see svcs -xv for details):
svc:/network/iscsi/initiator:default
svc:/network/service
svc:/network/service:default
svc:/network/rpc/nisplus
svc:/network/rpc/nisplus:default
svc:/network/rpc/keyserv
svc:/network/rpc/keyserv:default
svc:/network/rpc/bind
svc:/network/rpc/bind:default
svc:/system/sysidtool:net
svc:/milestone/single-user:default
svc:/system/cluster/loaddid
svc:/system/cluster/loaddid:default
Mar 2 12:01:46 svc.startd[7]: system/cluster/loaddid:default transitioned to maintenance by request (see 'svcs -xv' for details)

But this should NOT be problem because the patch 126106-40 will be installed in non-cluster-mode. This means after the next boot into cluster mode the error should disappear. This is reported in Bug 6911030.

But to be sure that the system is booting correctly do:
- check the log file /var/svc/log/system-cluster-loaddid:default.log
- that one of the last lines is: [ Mar 16 10:31:15 Rereading configuration. ]
- if not go to 'Recovery procedure below'


2) What happen if doing the loaddid delete after the patch installation?

Maybe you the see the problem mentioned in 1). If disable and delete the 'svc:/system/cluster/loaddid' service after the patch installation then the system will no longer joining the cluster. The following error come up:
...
Mar 16 11:42:54 Cluster.Framework: Could not initialize the ORB. Exiting.
Mar 16 11:42:54 : problem waiting the deamon, read errno 0
Mar 16 11:42:54 Cluster.Framework: Could not initialize the ORB. Exiting.
Mar 16 11:42:54 svc.startd[8]: svc:/system/cluster/sc_failfast:default: Method "/usr/cluster/lib/svc/method/sc_failfast start" failed with exit status 1.
Mar 16 11:42:54 svc.startd[8]: svc:/system/cluster/cl_execd:default: Method "/usr/cluster/lib/svc/method/sc_cl_execd start" failed with exit status 1.
Mar 16 11:42:54 Cluster.Framework: Could not initialize the ORB. Exiting.
Mar 16 11:42:54 : problem waiting the deamon, read errno 0
Configuring devices.
Mar 16 11:42:54 svc.startd[8]: svc:/system/cluster/sc_failfast:default: Method "/usr/cluster/lib/svc/method/sc_failfast start" failed with exit status 1.
Mar 16 11:42:54 Cluster.Framework: Could not initialize the ORB. Exiting.
Mar 16 11:42:54 svc.startd[8]: svc:/system/cluster/cl_execd:default: Method "/usr/cluster/lib/svc/method/sc_cl_execd start" failed with exit status 1.
Mar 16 11:42:54 Cluster.Framework: Could not initialize the ORB. Exiting.
Mar 16 11:42:54 : problem waiting the deamon, read errno 0
Mar 16 11:42:54 svc.startd[8]: svc:/system/cluster/sc_failfast:default: Method "/usr/cluster/lib/svc/method/sc_failfast start" failed with exit status 1.
Mar 16 11:42:54 svc.startd[8]: system/cluster/sc_failfast:default failed: transitioned to maintenance (see 'svcs -xv' for details)
Mar 16 11:42:54 Cluster.Framework: Could not initialize the ORB. Exiting.
Mar 16 11:42:55 svc.startd[8]: svc:/system/cluster/cl_execd:default: Method "/usr/cluster/lib/svc/method/sc_cl_execd start" failed with exit status 1.
Mar 16 11:42:55 svc.startd[8]: system/cluster/cl_execd:default failed: transitioned to maintenance (see 'svcs -xv' for details)

If seeing this error then refer to Recovering procedure


Recovering procedure

1) boot in non-cluster-mode if not able to login
2) bring the files loaddid and loaddid.xml in place (normally using the files from SC core patch -40)
ONLY in case of trouble with the files from SC core patch -40 use the old files!
Note: If restore old file without the dependency to iSCSI initiator then there can be problems if trying to use iSCSI storage within Sun Cluster.
3) Repair loaddid service
# svcadm disable svc:/system/cluster/loaddid
# svccfg delete svc:/system/cluster/loaddid
# svccfg import /var/svc/manifest/system/cluster/loaddid.xml
# svcadm restart svc:/system/cluster/loaddid:default
4) check the log file /var/svc/log/system-cluster-loaddid:default.log
# tail /var/svc/log/system-cluster-loaddid:default.log
for the following line (which should be on the end of the log file)
[ Mar 16 11:43:06 Rereading configuration. ]
Note: Rereading configuration is necessary before booting!
5) reboot the system
# init 6


Additional information: Difference details of loaddid files after installation of SC core patch -40.

A) /var/svc/manifest/system/cluster/loaddid.xml
The SC core patch -40 delivers a new version with the following changes:
<       ident "@(#)loaddid.xml 1.3 06/05/12 SMI"
---
>       ident "@(#)loaddid.xml 1.5 09/11/04 SMI"
56,61c79,92
<       <dependent
<              name='loaddid_single-user'
<              grouping='optional_all'
<             restart_on='none'>
<             <service_fmri value='svc:/milestone/single-user:default' />
<       </dependent>
---
>       <!--
>              The following dependency is for did drivers to get loaded
>              properly for iSCSI based quorum and data devices. We want to
>              start loaddid service after the time when iSCSI connections
>              can be made.
>        -->
>       <dependency
>              name='cl_iscsi_initiator'
>              grouping='optional_all'
>              restart_on='none'
>              type='service'>
>              <service_fmri
>              value='svc:/network/iscsi/initiator:default' />
>       </dependency>


Before patch -40 is applied:
node1 # svcs -d loaddid:default
STATE STIME FMRI
online 11:29:45 svc:/system/cluster/cl_boot_check:default
online 11:29:49 svc:/system/coreadm:default
online 11:30:52 svc:/milestone/devices:default

node1 # svcs -D svc:/system/cluster/loaddid:default
STATE STIME FMRI
online 15:34:41 svc:/system/cluster/bootcluster:default
online 15:34:46 svc:/milestone/single-user:default


After patch -40 is applied:
node1 # svcs -d loaddid:default
STATE STIME FMRI
online 12:09:18 svc:/system/coreadm:default
online 12:09:20 svc:/system/cluster/cl_boot_check:default
online 12:09:21 svc:/network/iscsi/initiator:default
online 12:10:21 svc:/milestone/devices:default

node1 # svcs -D svc:/system/cluster/loaddid:default
STATE STIME FMRI
online 16:08:19 svc:/system/cluster/bootcluster:default


B) /usr/cluster/lib/svc/method/loaddid
The SC core patch -40 delivers a new version with the following changes:
< #ident "@(#)loaddid 1.7 06/08/07 SMI"
---
> #ident "@(#)loaddid 1.9 09/11/04 SMI"
15,16c36,44
<        svcprop -q -p system/reconfigure system/svc/restarter:default 2>/dev/null
<        if [ $? -eq 0 ] && [ `svcprop -p system/reconfigure system/svc/restarter:default` = "true" ]
---
>        # The property "reconfigure" is used to store whether the boot is
>        # a reconfiguration boot or not. The property "system/reconfigure"
>        # of the "system/svc/restarter" Solaris SMF service can be used
>        # for this purpose as well. However the system/reconfigure
>        # property is reset at the single-user milestone. SC requires this
>        # property for use by service after the single-user milestone as
>        # well.
>        svcprop -q -p clusterdata/reconfigure system/cluster/cl_boot_check 2>/dev/null
>        if [ $? -eq 0 ] && [ `svcprop -p clusterdata/reconfigure system/cluster/cl_boot_check` = "true" ]


Thursday Jan 21, 2010

Configuration of 3rd mediator host in campus cluster

One of the new features in Sun Cluster 3.2 11/09 update3 is the Solaris Volume Manager Three-Mediator support. This is very helpful for two-room campus cluster configurations. It's recommended to install the 3rd mediator host in a third-room. Please refer to the details about the Guidelines for Mediators.
In the Solaris Cluster 3.3 docs you can find Configuring Dual-String Mediator documentation.

Advantages:
-- The 3rd mediator host is used to get the majority of Solaris Volume Manager configuration when one room is lost due to an error.
-- The 3rd mediator host only need public network connection. (Must not be part of the cluster and need no connection to the shared storage).
Consider:
-- One more room necessary
-- One more host for administration, but if using a Sun Cluster quorum server then the 3rd mediator can be on the same host.


This example shows how to add a 3rd mediator host to an existing Solaris Volume Manager diskset

Configuration steps on the 3rd mediator host
  • A) Add root to sysadmin in /etc/group
    # vi /etc/group
    sysadmin::14:root

  • B) Create metadb and dummy diskset
    # metadb -afc 3 c0t0d0s7 c1t0d0s7
    # metaset -s <dummyds> -a -h <3rd_mediator_host>

  • Note: Maybe a good name for the dummyds can be a combination of the used setname on the campuscluster name e.g.:'setnameofcampuscluster_campusclustername'. If using more than one set it could be e.g: 'sets_of_campusclustername'. Or if using it for more than one cluster it's possible to create one set with a specific name for each cluster. This could be helpful for monitoring/configuration purposes. But keep in mind this is not required, one set is enough for all clusters which use this 3rd mediator host.

    Configuraiton steps on cluster nodes
  • A) Add 3rd mediator host to /etc/hosts
    # echo <ipaddress hostname> >> /etc/hosts

  • B) Add 3rd mediator host to existing diskset on one cluster node.
    # metaset -s <setname> -a -m <3rd_mediator_host>

  • ATTENTION: If using 3rd mediator host for more than one cluster. Each cluster node and diskset must have a unique name throughout all clusters, and a diskset cannot be named shared or admin.

    Example output:


    Hint1: On the cluster nodes and on the third mediator the configuration is stored in the file /etc/lvm/meddb

    Hint2: Use a script to monitor the mediator status on the cluster nodes. Similar to:

    Download the mediatorcheck.ksh script here!

    Saturday Dec 19, 2009

    Quorum server patches and Sun Cluster 3.2

    This is an early notify because there are some troubles around with the following Sun Cluster 3.2 quorum server patches:
    127404-03 Sun Cluster 3.2: Quorum Server Patch for Solaris 9
    127405-04 Sun Cluster 3.2: Quorum Server patch for Solaris 10
    127406-04 Sun Cluster 3.2: Quorum Server patch for Solaris 10_x86
    All these patches are part of Sun Cluster 3.2 11/09 Update3 release, but also available on My Oracle Support.

    These patches delivers new features which requires attention in case of upgrade or patching. The installation of the mentioned patches on a Sun Cluster 3.2 quorum server can lead to a panic of all Sun Cluster 3.2 nodes which use this quorum server. The panic of Sun Cluster 3.2 nodes are as follows:
    ...
    Dec 4 16:43:57 node1 \^Mpanic[cpu18]/thread=300041f0700:
    Dec 4 16:43:57 node1 unix: [ID 265925 kern.notice] CMM: Cluster lost operational quorum; aborting.
    ...

    Update 11.Jan.2010:
    More details available Alert 1021769.1: Sun Cluster 3.2 Quorum Server Patches Cause All Cluster Nodes to Panic with "Cluster lost operational quorum"

    General workaround before patching:
    If the Sun Cluster only use the quorum server as a quorum device, temporarily add a second quorum device.
    For example:
    1) Configure a second temporary quorum device on each cluster node which uses the quorum server (use a disk if available).
    # clq add d13 (or use clsetup)
    2) Un-configure the quorum server from each cluster node that uses the quorum server.
    # clq remove QuorumServer1 (or use clsetup)
    3) Verify that the quorum server no longer serves any cluster.
    on the Sun Cluster nodes # clq status
    on the Sun Cluster quorum server # clqs show +
    4) Install the Sun Cluster 3.2 Quorum Server patch
    # patchadd 12740x-0x
    5) Reboot and start the quorum server if not already started
    # init 6
    6) From a cluster node, configure the patched quorum server again as a quorum device.
    # clq add -t quorum_server -p qshost=129.152.200.5 -p port=9000 QuorumServer1
    7) Un-configure the temporary quorum device
    # clq remove d13


    Workaround if quorum server is already patched:
    If patch is already installed (and rebooted) on the Sun Cluster 3.2 quorum server but quorum server is not working correctly then the following messages on the Sun Cluster nodes are visible:
    ...
    Dec 3 17:24:24 node1 cl_runtime: [ID 868277 kern.warning] WARNING: CMM: Erstwhile online quorum device QuorumServer1 (qid 1) is inaccessible now.
    Dec 3 17:29:20 node1 cl_runtime: [ID 237999 kern.notice] NOTICE: CMM: Erstwhile inaccessible quorum device QuorumServer1 (qid 1) is online now.
    Dec 3 17:29:24 node1 cl_runtime: [ID 868277 kern.warning] WARNING: CMM: Erstwhile online quorum device QuorumServer1 (qid 1) is inaccessible now.
    Dec 3 17:32:58 node1 cl_runtime: [ID 237999 kern.notice] NOTICE: CMM: Erstwhile inaccessible quorum device QuorumServer1 (qid 1) is online now.
    ...
    DO NOT TRY to remove the quorum server on the Sun Cluster 3.2 nodes because this can end up in a panic loop.

    Do:
    1) Clear the configuration on the quorum server
    # clqs clear <clustername> <quorumname>
    or # clqs clear +

    2.) Re-add the quorum server to the Sun Cluster nodes
    # cluster set -p installmode=enabled
    # clq remove QuorumServer1
    # clq add -t quorum_server -p qshost=129.152.200.5 -p port=9000 QuorumServer1
    # cluster set -p installmode=disabled


    This is reported in Bug 6907934

    As stated in my last blog the following note of the Special Install Instructions of Sun Cluster 3.2 core patch -38 and higher is very important.
    NOTE 17: Quorum server patch 127406-04 (or greater) needs to be installed on quorum server host first, before installing 126107-37 (or greater) Core Patch on cluster nodes.
    This means if using a Sun Cluster 3.2 quorum server then it's necessary to upgrade the quorum server before upgrade the Sun Cluster 3.2 nodes to Sun Cluster 3.2 11/09 update3 which use the quorum server. AND furthermore the same apply in case of patching. If installing the Sun Cluster core patch -38 or higher (-38 is part of Sun Cluster 3.2 11/09 update3)
    126106-38 Sun Cluster 3.2: CORE patch for Solaris 10
    126107-38 Sun Cluster 3.2: CORE patch for Solaris 10_x86
    126105-38 Sun Cluster 3.2: CORE patch for Solaris 9
    then the same rule apply. First update the quorum server and then the Sun Cluster nodes. Please refer to details above how to do it...
    For upgrade also refer to the document: How to upgrade quorum server software

    Keep in mind: Fresh installations with Sun Cluster 3.2 11/09 update3 on the Sun Cluster nodes and on the quorum server are NOT affected!

    Sunday Dec 06, 2009

    Sun Cluster 3.2 11/09 Update3 Patches

    The Sun Cluster 3.2 11/09 Update3 is released. Click here for further information.

    The package version of the Sun Cluster 3.2 11/09 Update3 are the same for the core framework and the agents as for Sun Cluster 3.2, Sun Cluster 3.2 2/08 Update1 and Sun Cluster 3.2 1/09 Update2. Therefore it's possible to patch up an existing Sun Cluster 3.2, Sun Cluster 3.2 2/08 Update1 or Sun Cluster 3.2 1/09 Update2.

    The package version of the Sun Cluster Geographic Edition 3.2 11/09 Update3 are NOT the same as Sun Cluster Geographic Edition 3.2. But it's possible to upgrade the Geographic Edition 3.2 without interruption of the service. See documentation for details.

    The following patches (with the mentioned revision) are included/updated in Sun Cluster 3.2 11/09 Update3. If these patches are installed on Sun Cluster 3.2, Sun Cluster 3.2 2/08 Update1 or Sun Cluster 3.2 1/09 Update2 release, then the features for framework & agents are identical with Sun Cluster 3.2 11/09 Update3. It's always necessary to read the "Special Install Instructions of the patch" but I made a note behind some patches where it's very important to read the "Special Install Instructions of the patch" (Using shortcut SIIOTP).

    Included/updated patch revisions of Sun Cluster 3.2 11/09 Update3 for Solaris 10 05/09 Update7 or higher
    126106-38 Sun Cluster 3.2: CORE patch for Solaris 10 Note: Please read SIIOTP
    125992-05 Sun Cluster 3.2: SC Checks patch for Solaris 10
    126017-03 Sun Cluster 3.2: HA-DNS Patch for Solaris 10
    126032-09 Sun Cluster 3.2: Ha-MYSQL Patch for Solaris 10 Note: Please read SIIOTP
    126035-06 Sun Cluster 3.2: HA-NFS Patch for Solaris 10
    126044-06 Sun Cluster 3.2: HA-PostgreSQL Patch for Solaris 10 Note: Please read SIIOTP
    126047-12 Sun Cluster 3.2: Ha-Oracle patch for Solaris 10 Note: Please read SIIOTP
    126050-04 Sun Cluster 3.2: HA-Oracle E-business suite Patch for Solaris 10 (-04 not yet on SunSolve)
    126059-05 Sun Cluster 3.2: HA-SAPDB Patch for Solaris 10
    126071-02 Sun Cluster 3.2: HA-Tomcat Patch for Solaris 10
    126080-04 Sun Cluster 3.2: HA-Sun Java Systems App Server Patch for Solaris 10
    126083-04 Sun Cluster 3.2: HA-Sun Java Message Queue Patch for Solaris 10 Note: Please read SIIOTP
    126095-06 Sun Cluster 3.2: Localization patch for Solaris 9 sparc and Solaris 10 sparc
    128556-04 Sun Cluster 3.2: Man Pages Patch for Solaris 9 and Solaris 10, sparc
    137931-02 Sun Cluster 3.2: Sun Cluster 3.2: HA-Informix patch for Solaris 10


    Included/updated patch revisions of Sun Cluster 3.2 11/09 Update3 for Solaris 10 x86 05/09 Update7 or higher
    126107-38 Sun Cluster 3.2: CORE patch for Solaris 10_x86 Note: Please read SIIOTP
    125993-05 Sun Cluster 3.2: Sun Cluster 3.2: SC Checks patch for Solaris 10_x86
    126018-05 Sun Cluster 3.2: HA-DNS Patch for Solaris 10_x86
    126033-10 Sun Cluster 3.2: Ha-MYSQL Patch for Solaris 10_x86 Note: Please read SIIOTP
    126036-07 Sun Cluster 3.2: HA-NFS Patch for Solaris 10_x86
    126045-07 Sun Cluster 3.2: HA-PostgreSQL Patch for Solaris 10_x86 Note: Please read SIIOTP
    126048-12 Sun Cluster 3.2: Ha-Oracle patch for Solaris 10_x86 Note: Please read SIIOTP
    126060-06 Sun Cluster 3.2: HA-SAPDB Patch for Solaris 10_x86
    126072-02 Sun Cluster 3.2: HA-Tomcat Patch for Solaris 10_x86
    126081-05 Sun Cluster 3.2: HA-Sun Java Systems App Server Patch for Solaris 10_x86
    126084-06 Sun Cluster 3.2: HA-Sun Java Message Queue Patch for Solaris 10_x86 Note: Please read SIIOTP
    126096-06 Sun Cluster 3.2: Localization patch for Solaris 10 amd64
    128557-04 Sun Cluster 3.2: Man Pages Patch for Solaris 10_x86
    137932-02 Sun Cluster 3.2:Sun Cluster 3.2: HA-Informix patch for Solaris 10_x86


    Included/updated patch revisions of Sun Cluster 3.2 11/09 Update3 for Solaris 9 5/09 or higher
    126105-38 Sun Cluster 3.2: CORE patch for Solaris 9 Note: Please read SIIOTP
    125991-05 Sun Cluster 3.2: Sun Cluster 3.2: SC Checks patch for Solaris 9
    126016-03 Sun Cluster 3.2: HA-DNS Patch for Solaris 9
    126031-09 Sun Cluster 3.2: Ha-MYSQL Patch for Solaris 9 Note: Please read SIIOTP
    126034-06 Sun Cluster 3.2: HA-NFS Patch for Solaris 9
    126043-06 Sun Cluster 3.2: HA-PostgreSQL Patch for Solaris 9 Note: Please read SIIOTP
    126046-12 Sun Cluster 3.2: HA-Oracle patch for Solaris 9 Note: Please read SIIOTP
    126049-04 Sun Cluster 3.2: HA-Oracle E-business suite Patch for Solaris 9 (-04 not yet on SunSolve)
    126058-05 Sun Cluster 3.2: HA-SAPDB Patch for Solaris 9
    126070-02 Sun Cluster 3.2: HA-Tomcat Patch for Solaris 9
    126079-04 Sun Cluster 3.2: HA-Sun Java Systems App Server Patch for Solaris 9
    126082-04 Sun Cluster 3.2: HA-Sun Java Message Queue Patch for Solaris 9 Note: Please read SIIOTP
    126095-06 Sun Cluster 3.2: Localization patch for Solaris 9 sparc and Solaris 10 sparc
    128556-04 Sun Cluster 3.2: Man Pages Patch for Solaris 9 and Solaris 10, sparc


    The quorum server is an alternative to the traditional quorum disk. The quorum server is outside of the Sun Cluster and will be accessed through the public network. Therefore the quorum server can be a different architecture.

    Included/updated patch revisions in Sun Cluster 3.2 11/09 Update3 for quorum server feature:
    127404-03 Sun Cluster 3.2: Quorum Server Patch for Solaris 9
    127405-04 Sun Cluster 3.2: Quorum Server Patch for Solaris 10
    127406-04 Sun Cluster 3.2: Quorum Server Patch for Solaris 10_x86
    Please beware of the following note which in the Special Install Instructions of Sun Cluster 3.2 core patch -38 and higher:
    NOTE 17: Quorum server patch 127406-04 (or greater) needs to be installed on quorum server host first, before installing 126107-37 (or greater) Core Patch on cluster nodes.
    127408-02 Sun Cluster 3.2: Quorum Man Pages Patch for Solaris 9 and Solaris 10, sparc
    127409-02 Sun Cluster 3.2: Quorum Man Pages Patch for Solaris 10_x86


    If some patches must be applied when the node is in noncluster mode, you can apply them in a rolling fashion, one node at a time, unless a patch's instructions require that you shut down the entire cluster. Follow procedures in How to Apply a Rebooting Patch (Node) in Sun Cluster System Administration Guide for Solaris OS to prepare the node and boot it into noncluster mode. For ease of installation, consider applying all patches at once to a node that you place in noncluster mode.

    Friday Oct 30, 2009

    Kernel patch 141444-09 or 141445-09 with Sun Cluster 3.2

    As stated in my last blog the following kernel patches are included in Solaris 10 10/09 Update8.
    141444-09 SunOS 5.10: kernel patch or
    141445-09 SunOS 5.10_x86: kernel patch

    Update 10.Dec.2009:
    Support of Solaris 10 10/09 Update8 with Sun Cluster 3.2 1/09 Update2 is now announced. The recommendation is to use the 126106-39 (sparc) / 126107-39 (x86) with Solaris 10 10/09 Update8. Note: The -39 Sun Cluster core patch is a feature patch because the -38 Sun Cluster core patch is part of Sun Cluster 3.2 11/09 Update3 which is already released.
    For new installations/upgrades with Solaris 10 10/09 Update8 use:
    \* Sun Cluster 3.2 11/09 Update3 with Sun Cluster core patch -39 (fix problem 1)
    \* Use the patches 142900-02 / 142901-02 (fix problem 2)
    \* Add "set nautopush=64" to /etc/system (workaround for problem 3)

    For patch updates to 141444-09/141445-09 use:
    \* Sun Cluster core patch -39 (fix problem 1)
    \* Also use patches 142900-02 / 142901-02 (fix problem 2)
    \* Add "set nautopush=64" to /etc/system (workaround for problem 3)


    It's time to notify that there are some issues with these kernel patches in combination with Sun Cluster 3.2

    1.) The patch breaks the zpool cachefile feature if using SUNW.HAStoragePlus

    a.) If the kernel patch 141444-09 (sparc) / 141445-09 (x86) is installed on a Sun Cluster 3.2 system where the Sun Cluster core patch 126106-33 (sparc) / 126107-33 (x86) is already installed then hastorageplus_prenet_start will fail with the following error message:
    ...
    Oct 26 17:51:45 nodeA SC[,SUNW.HAStoragePlus:6,rg1,rs1,hastorageplus_prenet_start]: Started searching for devices in '/dev/dsk' to find the importable pools.
    Oct 26 17:51:53 nodeA SC[,SUNW.HAStoragePlus:6,rg1,rs1,hastorageplus_prenet_start]: Completed searching the devices in '/dev/dsk' to find the importable pools.
    Oct 26 17:51:54 nodeA zfs: [ID 427000 kern.warning] WARNING: pool 'zpool1' could not be loaded as it was last accessed by another system (host: nodeB hostid: 0x8516ced4). See: http://www.sun.com/msg/ZFS-8000-EY
    ...


    b.) If the kernel patch 141444-09 (sparc) / 141445-09 (x86) is installed on a Sun Cluster 3.2 system where the Sun Cluster core patch 126106-35 (sparc) / 126107-35 (x86) is already installed then hastorageplus_prenet_start will work but the zpool cachefile feature of SUNW.HAStoragePlus is disabled. Without the zpool cachefile feature the time of zpool import increases because the import will scan all available disks. The messages look like:
    ...
    Oct 30 15:37:45 nodeA SC[,SUNW.HAStoragePlus:8,nfs-rg,zpool1-rs,hastorageplus_validate]: [ID 148650 daemon.notice] Started searching for devices in '/dev/dsk' to find the importable pools.
    Oct 30 15:37:45 nodeA SC[,SUNW.HAStoragePlus:8,nfs-rg,zpool1-rs,hastorageplus_validate]: [ID 148650 daemon.notice] Started searching for devices in '/dev/dsk' to find the importable pools.
    Oct 30 15:37:49 nodeA SC[,SUNW.HAStoragePlus:8,nfs-rg,zpool1-rs,hastorageplus_validate]: [ID 547433 daemon.notice] Completed searching the devices in '/dev/dsk' to find the importable pools.
    Oct 30 15:37:49 nodeA SC[,SUNW.HAStoragePlus:8,nfs-rg,zpool1-rs,hastorageplus_validate]: [ID 547433 daemon.notice] Completed searching the devices in '/dev/dsk' to find the importable pools.
    Oct 30 15:37:49 nodeA SC[,SUNW.HAStoragePlus:8,nfs-rg,zpool1-rs,hastorageplus_validate]: [ID 792255 daemon.warning] Failed to update the cachefile contents in /var/cluster/run/HAStoragePlus/zfs/zpool1.cachefile to CCR table zpool1.cachefile for pool zpool1 : file /var/cluster/run/HAStoragePlus/zfs/zpool1.cachefile open failed: No such file or directory.
    Oct 30 15:37:49 nodeA SC[,SUNW.HAStoragePlus:8,nfs-rg,zpool1-rs,hastorageplus_validate]: [ID 792255 daemon.warning] Failed to update the cachefile contents in /var/cluster/run/HAStoragePlus/zfs/zpool1.cachefile to CCR table zpool1.cachefile for pool zpool1 : file /var/cluster/run/HAStoragePlus/zfs/zpool1.cachefile open failed: No such file or directory.
    Oct 30 15:37:49 nodeA SC[,SUNW.HAStoragePlus:8,nfs-rg,zpool1-rs,hastorageplus_validate]: [ID 205754 daemon.info] All specified device services validated successfully.
    ...


    If the ZFS cachefile feature is not required AND the above kernel patches are installed, problem a.) is resolved by installing Sun Cluster core patch 126106-35 (sparc) / 126107-35 (x86).
    Solution for a) and b):
    126106-39 Sun Cluster 3.2: CORE patch for Solaris 10
    126107-39 Sun Cluster 3.2: CORE patch for Solaris 10_x86

    Alert 1021629.1: A Solaris Kernel Change Stops Sun Cluster Using "zpool.cachefiles" to Import zpools Resulting in ZFS pool Import Performance Degradation or Failure to Import the zpools

    2.) The patch breaks probe-based IPMP if more than one interface is in the same IPMP group

    After installing the already mentioned kernel patch:
    141444-09 SunOS 5.10: kernel patch or
    141445-09 SunOS 5.10_x86: kernel patch
    then the probe-based IPMP group feature is broken if the system is using more than one interface in the same IPMP group. This means all Solaris 10 systems which are using more than one interface in the same probe-based IPMP group are affected!

    After installing this kernel patch the following errors will be sent to the system console after a reboot:
    ...
    nodeA console login: Oct 26 19:34:41 in.mpathd[210]: NIC failure detected on bge0 of group ipmp0
    Oct 26 19:34:41 in.mpathd[210]: Successfully failed over from NIC bge0 to NIC e1000g0
    ...

    Workarounds:
    a) Use link-based IPMP instead of probe-based IPMP
    b) Use only one interface in the same IPMP group if using probe-based IPMP
    See the blog "Tips to configure IPMP with Sun Cluster 3.x" for more details if you like to change the configuration.
    c) Do not install the listed kernel patch above. Note: Fix is already in progress and can be reached via a service request. I will update this blog when the general fix is available.

    Solution:
    142900-02 SunOS 5.10: kernel patch
    142901-02 SunOS 5.10_x86: kernel patch

    Alert 1021262.1 : Solaris 10 Kernel Patches 141444-09 and 141445-09 May Cause Interface Failure in IP Multipathing (IPMP)
    This is reported in Bug 6888928

    3.) When applying the patch Sun Cluster can hang on reboot

    After installing the already mentioned kernel patch:
    141444-09 SunOS 5.10: kernel patch or
    141511-05 SunOS 5.10_x86: ehci, ohci, uhci patch
    the Sun Cluster nodes can hang within boot because the Sun Cluster nodes has exhausted the default number of autopush structures. When clhbsndr module is loaded, it causes a lot more autopushes to occur than would otherwise happen on a non-clustered system. By default, we only allocate nautopush=32 of these structures.

    Workarounds:
    a) Do not use the mentioned kernel patch with Sun Cluster
    b) Boot in non-cluster-mode and add the following to /etc/system
    set nautopush=64

    Solution:
    126106-42 Sun Cluster 3.2: CORE patch for Solaris 10
    126107-42 Sun Cluster 3.2: CORE patch for Solaris 10_x86
    for Sun Cluster 3.1 the issue is fixed in:
    120500-26 Sun Cluster 3.1: Core Patch for Solaris 10

    Alert 1021684.1: Solaris autopush(1M) Changes (with patches 141444-09/141511-04) May Cause Sun Cluster 3.1 and 3.2 Nodes to Hang During Boot
    This is reported in Bug 6879232

    Thursday Sep 24, 2009

    Configuration steps to create a zone cluster on Solaris Cluster 3.2/3.3

    This is a short overview on how to configure a zone cluster. It is highly recommended to use Solaris 10 5/09 update7 with patch baseline July 2009 (or higher) and Sun Cluster 3.2 1/09 with Sun Cluster 3.2 core patch revision -33 or higher. The name of the zone cluster must be unique throughout the global Sun Cluster and must be configured on a global Sun Cluster. Please read the requirements for zone cluster in Sun Cluster Software Installation Guide.
    For Solaris Cluster 4.0 please refer to blog How to configure a zone cluster on Solaris Cluster 4.0


    A. Configure the zone cluster into the global cluster
    • Check if zone cluster can be created
      # cluster show-netprops
      to change number of zone clusters use
      # cluster set-netprops -p num_zoneclusters=12
      Note: 12 zone clusters is the default, values can be customized!

    • Configure filesystem for zonepath on all physical nodes
      # mkdir -p /zones/zc1
      # chmod 0700 /zones/zc1

    • Create config file (zc1config) for zone cluster setup e.g:

    • Configure zone cluster
      # clzc configure -f zc1config zc1
      Note: If not using the config file the configuration can also be done manually # clzc configure zc1

    • Check zone configuration
      # clzc export zc1

    • Verify zone cluster
      # clzc verify zc1
      Note: The following message is a notice and comes up on several clzc commands
      Waiting for zone verify commands to complete on all the nodes of the zone cluster "zc1"...

    • Install the zone cluster
      # clzc install zc1
      Note: Monitor the console of the global zone to see how the install proceed!

    • Boot the zone cluster
      # clzc boot zc1

    • Check status of zone cluster
      # clzc status zc1

    • Login into non-global-zones of zone cluster zc1 and configure the shell environment for root (for PATH: /usr/cluster/bin, for MANPATH: /usr/cluster/man)
      # zlogin -C zc1

    B. Add resource groups and resources to zone cluster
    • Create a resource group in zone cluster
      # clrg create -n zone-hostname-node1,zone-hostname-node2 app-rg

      Note: Use command # cluster status for zone cluster resource group overview

    • Set up the logical host resource for zone cluster
      In the global zone do:
      # clzc configure zc1
      clzc:zc1> add net
      clzc:zc1:net> set address=<zone-logicalhost-ip>
      clzc:zc1:net> end
      clzc:zc1> commit
      clzc:zc1> exit
      Note: Check that logical host is in /etc/hosts file
      In zone cluster do:
      # clrslh create -g app-rg -h <zone-logicalhost> <zone-logicalhost>-rs

    • Set up storage resource for zone cluster
      Register HAStoragePlus
      # clrt register SUNW.HAStoragePlus

      Example1) ZFS storage pool
      In the global zone do:
      Configure zpool eg: # zpool create <zdata> mirror cXtXdX cXtXdX
      and
      # clzc configure zc1
      clzc:zc1> add dataset
      clzc:zc1:dataset> set name=zdata
      clzc:zc1:dataset> end
      clzc:zc1> exit
      Check setup with # clzc show -v zc1
      In the zone cluster do:
      # clrs create -g app-rg -t SUNW.HAStoragePlus -p zpools=zdata app-hasp-rs


      Example2) HA filesystem
      In the global zone do:
      Configure SVM diskset and SVM devices.
      and
      # clzc configure zc1
      clzc:zc1> add fs
      clzc:zc1:fs> set dir=/data
      clzc:zc1:fs> set special=/dev/md/datads/dsk/d0
      clzc:zc1:fs> set raw=/dev/md/datads/rdsk/d0
      clzc:zc1:fs> set type=ufs
      clzc:zc1:fs> end
      clzc:zc1> exit
      Check setup with # clzc show -v zc1
      In the zone cluster do:
      # clrs create -g app-rg -t SUNW.HAStoragePlus -p FilesystemMountPoints=/data app-hasp-rs

      More details of adding storage

    • Switch resource group and resources online in the zone cluster
      # clrg online -eM app-rg

    • Test: Switch of the resource group in the zone cluster
      # clrg switch -n zonehost2 app-rg

    • Add supported dataservice to zone cluster
      Further details about supported dataservices available at:
      - Zone Clusters - How to Deploy Virtual Clusters and Why
      - Running Oracle Real Application Clusters on Solaris Zone Cluster

    Example output:



    Appendix: To delete a zone cluster do:
    # clrg delete -Z zc1 -F +

    Note: Zone cluster uninstall can only be done if all resource groups are removed in the zone cluster. The command 'clrg delete -F +' can be used in zone cluster to delete the resource groups recursively.
    # clzc halt zc1
    # clzc uninstall zc1

    Note: If clzc command is not successful to uninstall the zone, then run 'zoneadm -z zc1 uninstall -F' on the nodes where zc1 is configured
    # clzc delete zc1

    Monday Sep 07, 2009

    Entries in infrastructure file if using tagged VLAN for cluster interconnect

    In some cases it's necessary to add a tagged VLAN id to the cluster interconnect. This example show the difference of the cluster interconnect configuration if using tagged VLAN id or not. The interface e1000g2 have a "normal" setup (no VLAN id) and the interface e1000g1 got a VLAN id of 2. The used ethernet switch must be configured first with tagged VLAN id before the cluster interconnect can be configured. Use "clsetup" to assign a VLAN id to cluster interconnect.

    Entries for "normal" cluster interconnect interface in /etc/cluster/ccr/global/infrastructure - no tagged VLAN:
    cluster.nodes.1.adapters.1.name e1000g2
    cluster.nodes.1.adapters.1.properties.device_name e1000g
    cluster.nodes.1.adapters.1.properties.device_instance 2
    cluster.nodes.1.adapters.1.properties.transport_type dlpi
    cluster.nodes.1.adapters.1.properties.lazy_free 1
    cluster.nodes.1.adapters.1.properties.dlpi_heartbeat_timeout 10000
    cluster.nodes.1.adapters.1.properties.dlpi_heartbeat_quantum 1000
    cluster.nodes.1.adapters.1.properties.nw_bandwidth 80
    cluster.nodes.1.adapters.1.properties.bandwidth 70
    cluster.nodes.1.adapters.1.properties.ip_address 172.16.1.129
    cluster.nodes.1.adapters.1.properties.netmask 255.255.255.128
    cluster.nodes.1.adapters.1.state enabled
    cluster.nodes.1.adapters.1.ports.1.name 0
    cluster.nodes.1.adapters.1.ports.1.state enabled


    Entries for cluster interconnect interface in /etc/cluster/ccr/global/infrastructure - with tagged VLAN:
    cluster.nodes.1.adapters.2.name e1000g2001

    cluster.nodes.1.adapters.2.properties.device_name e1000g
    cluster.nodes.1.adapters.2.properties.device_instance 1
    cluster.nodes.1.adapters.2.properties.vlan_id 2
    cluster.nodes.1.adapters.2.properties.transport_type dlpi
    cluster.nodes.1.adapters.2.properties.lazy_free 1
    cluster.nodes.1.adapters.2.properties.dlpi_heartbeat_timeout 10000
    cluster.nodes.1.adapters.2.properties.dlpi_heartbeat_quantum 1000
    cluster.nodes.1.adapters.2.properties.nw_bandwidth 80
    cluster.nodes.1.adapters.2.properties.bandwidth 70
    cluster.nodes.1.adapters.2.properties.ip_address 172.16.2.1
    cluster.nodes.1.adapters.2.properties.netmask 255.255.255.128
    cluster.nodes.1.adapters.2.state enabled
    cluster.nodes.1.adapters.2.ports.1.name 0
    cluster.nodes.1.adapters.2.ports.1.state enabled

    The tagged VLAN interface is a combination of the VLAN id and the used network interface. In this example e1000g2001, the 2 after the e1000g is the VLAN id and the 1 at the end is the instance of the e1000g driver. Normally this would be the e1000g1 interface but with the VLAN id it becomes the interface e1000g2001.

    The ifconfig -a of the above configuration is:
    # ifconfig -a

    lo0: flags=20010008c9 mtu 8232 index 1
           inet 127.0.0.1 netmask ff000000
    e1000g0: flags=9000843 mtu 1500 index 2
          inet 10.16.65.63 netmask fffff800 broadcast 10.16.55.255
          groupname sc_ipmp0
          ether 0:14:4f:20:6a:18
    e1000g2: flags=201008843 mtu 1500 index 4
          inet 172.16.1.129 netmask ffffff80 broadcast 172.16.1.255
          ether 0:14:4f:20:6a:1a
    e1000g2001: flags=201008843 mtu 1500 index 3
          inet 172.16.2.1 netmask ffffff80 broadcast 172.16.2.127
          ether 0:14:4f:20:6a:19

    clprivnet0: flags=1009843 mtu 1500 index 5
          inet 172.16.4.1 netmask fffffe00 broadcast 172.16.5.255
          ether 0:0:0:0:0:1

    Thursday Jul 23, 2009

    Sun Cluster 3.x command line overview

    I wrote together a quick reference guide for Sun Cluster 3.x. The guide includes the "old" command line which is used for Sun Cluster 3.0, 3.1, 3.2 and the already known Sun Cluster 3.2 new object based command line. Please do not expect the whole command line in this two pages. It should be a reminder for the most used commands within Sun Cluster 3.x. I added the pictures to this blog but also the pdf file is available for download.





    Further reference guide are available:
    Sun Cluster 3.2 Quick Reference Guide
    German Sun Cluster 3.2 Quick Reference Guide
    Sun Cluster 3.1 command line cheat sheet

    Wednesday Jun 17, 2009

    Ready for Sun Cluster 3.2 1/09 Update2?

    Now it's time to install/upgrade to Sun Cluster 3.2 1/09 Update2. The major bugs of Sun Cluster 3.2 1/09 Update2 are fixed in
    126106-33 or higher Sun Cluster 3.2: CORE patch for Solaris 10
    126107-33 or higher Sun Cluster 3.2: CORE patch for Solaris 10_x86
    126105-33 or higher Sun Cluster 3.2: CORE patch for Solaris 9

    This means the core patch should be applied immediately after the installation of Sun Cluster 1/09 Update2 software. The installation approach in short words:

  • Install Sun Cluster 3.2 1/09 Update2 with java enterprise installer

  • Install the necessary Sun Cluster 3.2 core patch as mentioned above

  • Configure Sun Cluster 3.2 with scinstall

  • Further details available in Sun Cluster Software Installation Guide for Solaris OS.
    Also Installation services delivered by Oracle Advanced Customer Services are available.

    Friday May 08, 2009

    Administration of zpool devices in Sun Cluster 3.2 environment


    Carefully configure zpools in Sun Cluster 3.2. Because it's possible to use the same physical device in different zpools on different nodes at the same time. This means the zpool command does NOT care about if the physical device is already in use by another zpool on another node. e.g. If node1 have an active zpool with device c3t3d0 then it's possible to create a new zpool with c3t3d0 on another node. (assumption: c3t3d0 is the same shared device on all cluster nodes).

    Output of testing...


    If problems occurred due to administration mistakes then the following errors have been seen:

    NODE1# zpool import tank
    cannot import 'tank': I/O error

    NODE2# zpool import tankothernode
    cannot import 'tankothernode': one or more devices is currently unavailable

    NODE2# zpool import tankothernode
    cannot import 'tankothernode': no such pool available

    NODE1# zpool import tank
    cannot import 'tank': pool may be in use from other system, it was last accessed by NODE2 (hostid: 0x83083465) on Fri May 8 13:34:41 2009
    use '-f' to import anyway
    NODE1# zpool import -f tank
    cannot import 'tank': one or more devices is currently unavailable


    Furthermore the zpool command also use the disk without any warning if it used by Solaris Volume Manager diskset or Symantec (Veritas) Volume Manager diskgroup.

    Summary for Sun Cluster environment:
    ALWAYS MANUALLY CHECK THAT THE DEVICE WHICH USING FOR ZPOOL IS FREE!!!


    This is addressed in bug 6783988.

    Monday May 04, 2009

    cluster configuration repository can get corrupted on installation Sun Cluster 3.2 1/09 Update2


    The issue only occurs if the Sun Cluster 3.2 1/09 Update2 will be installed with a non-default netmask address for cluster interconnect.

    Seen problems if system is affected:
    Errors with:
          * did devices
          * quorum device
          * The command 'scstat -i' can look like:
    -- IPMP Groups --
                     Node Name       Group    Status    Adapter   Status
                     ---------               -----        ------       -------       ------
    scrconf: RPC: Authentication error; why = Client credential too weak
    scrconf: Failed to get zone information for scnode2 - unexpected error.
    scrconf: RPC: Authentication error; why = Client credential too weak
    scrconf: Failed to get zone information for scnode1 - unexpected error.
    scrconf: RPC: Authentication error; why = Client credential too weak
    scrconf: Failed to get zone information for scnode2 - unexpected error.
    scrconf: RPC: Authentication error; why = Client credential too weak
    scrconf: Failed to get zone information for scnode1 - unexpected error.
    IPMP Group: scnode2    sc_ipmp0    Online    qfe0      Online
    IPMP Group: scnode1    sc_ipmp0    Online    qfe0      Online
    IPMP Group: scnode2    sc_ipmp0    Online    qfe0      Online
    IPMP Group: scnode1    sc_ipmp0    Online    qfe0      Online


    How the problem occur?
    After the installation of Sun Cluster 3.2 1/09 Update2 product with the java installer it's necessary to run the #scinstall command. If choose "Custom" installation instead of "Typical" installation then it's possible to change the default of the netmask of cluster interconnect. The following questions come up within the installation procedure if answering the default netmask question with 'no'.

    Example scinstall:
           Is it okay to accept the default netmask (yes/no) [yes]? no
           Maximum number of nodes anticipated for future growth [64]? 4
           Maximum number of private networks anticipated for future growth [10]?
           Maximum number of virtual clusters expected [12]? 0
           What netmask do you want to use [255.255.255.128]?
    Prevent the issue by answering the virtual clusters question with '1' or other serious consideration to future growth potential if necessary.
    Do NOT answer the virtual clusters question with '0'!


    Example of the whole scinstall log when corrupted ccr occur:

    In the /etc/cluster/ccr/global/infrastructure file the error can be found by an empty entry for cluster.properties.private_netmask. Furthermore some other lines are not reflect the correct values for netmask as choosen within scinstall.
    Wrong infrastructure file:
    cluster.state enabled
    cluster.properties.cluster_id 0x49F82635
    cluster.properties.installmode disabled
    cluster.properties.private_net_number 172.16.0.0
    cluster.properties.cluster_netmask 255.255.248.0
    cluster.properties.private_netmask
    cluster.properties.private_subnet_netmask 255.255.255.248
    cluster.properties.private_user_net_number 172.16.4.0
    cluster.properties.private_user_netmask 255.255.254.0

    cluster.properties.private_maxnodes 6
    cluster.properties.private_maxprivnets 10
    cluster.properties.zoneclusters 0
    cluster.properties.auth_joinlist_type sys

    If answering the virtual cluster question with value '1' then the correct netmask entries are:
    cluster.properties.cluster_id 0x49F82635
    cluster.properties.installmode disabled
    cluster.properties.private_net_number 172.16.0.0
    cluster.properties.cluster_netmask 255.255.255.128
    cluster.properties.private_netmask 255.255.255.128
    cluster.properties.private_subnet_netmask 255.255.255.248
    cluster.properties.private_user_net_number 172.16.0.64
    cluster.properties.private_user_netmask 255.255.255.224

    cluster.properties.private_maxnodes 6
    cluster.properties.private_maxprivnets 10
    cluster.properties.zoneclusters 1
    cluster.properties.auth_joinlist_type sys


    Workaround if problem already occured:
    1.) Boot all nodes in non-cluster-mode with 'boot -x'
    2.) Change the wrong values of /etc/cluster/ccr/global/infrastructure on all nodes. See example above.
    3.) Write a new checksum for all infrastructure files on all nodes. Use -o (master file) on the node which is booting up first.
    scnode1 # /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/global/infrastructure -o
    scnode2 # /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/global/infrastructure
    scnode1 # /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/global/infrastructure
    scnode2 # /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/global/infrastructure
    4.) first reboot scnode1 (master infrastructure file) into cluster, then the other nodes.
    This is reported in bug 6825948.


    Update 17.Jun.2009:
    The -33 revision of the Sun Cluster core patch is the first released version which fix this issue at installation time.
    126106-33 Sun Cluster 3.2: CORE patch for Solaris 10
    126107-33 Sun Cluster 3.2: CORE patch for Solaris 10_x86

    Friday Apr 24, 2009

    Upgrade to Sun Cluster 3.2 1/09 Update2 and SUNWscr preremove script

    There is a missing/old preremove script in Sun Cluster 3.2 2/08 Update1 which is equivalent to the patches
    126106-12 until -19 Sun Cluster 3.2: CORE patch for Solaris 10
    126107-12 until -19 Sun Cluster 3.2: CORE patch for Solaris 10_x86
    126105-12 until -19 Sun Cluster 3.2: CORE patch for Solaris 9

    This means in case of Upgrade (using scinstall -u) from Sun Cluster 3.2 to Sun Cluster 3.2 update1 or update2 the issue can occur.
    More details available in Missing preremove script in Sun Cluster 3.2 core patch revision 12 and higher.
    The issue is, if the mentioned Sun Cluster core patches are installed it is not possible to remove the SUNWscr package within the upgrade to Sun Cluster 3.2 1/09 Update2.

    The problem looks as:
    # ./scinstall -u update
    Starting upgrade of Sun Cluster framework software
    Saving current Sun Cluster configuration
    Do not boot this node into cluster mode until upgrade is complete.
    Renamed "/etc/cluster/ccr" to "/etc/cluster/ccr.upgrade".
    \*\* Removing Sun Cluster framework packages \*\*
        ...
        Removing SUNWscrtlh..done
        Removing SUNWscr.....failed
        scinstall: Failed to remove "SUNWscr"
        Removing SUNWscscku..done
        ...
    scinstall: scinstall did NOT complete successfully!

    126106-12 until -19 Sun Cluster 3.2: CORE patch for Solaris 10
    126107-12 until -19 Sun Cluster 3.2: CORE patch for Solaris 10_x86
    126105-12 until -19 Sun Cluster 3.2: CORE patch for Solaris 9

    This means in case of Upgrade (using scinstall -u) from Sun Cluster 3.2 to Sun Cluster 3.2 update1 or update2 the issue can occur.
    More details available in Missing preremove script in Sun Cluster 3.2 core patch revision 12 and higher.
    The issue is, if the mentioned Sun Cluster core patches are installed it is not possible to remove the SUNWscr package within the upgrade to Sun Cluster 3.2 1/09 Update2.

    The problem looks as:
    # ./scinstall -u update
    Starting upgrade of Sun Cluster framework software
    Saving current Sun Cluster configuration
    Do not boot this node into cluster mode until upgrade is complete.
    Renamed "/etc/cluster/ccr" to "/etc/cluster/ccr.upgrade".
    \*\* Removing Sun Cluster framework packages \*\*
        ...
        Removing SUNWscrtlh..done
        Removing SUNWscr.....failed
        scinstall: Failed to remove "SUNWscr"
        Removing SUNWscscku..done
        ...
    scinstall: scinstall did NOT complete successfully!


    Workaround:
    Before the upgrade to Sun Cluster 3.2 Update1/Update2 install the following patch which delivers a correct preremove script for Sun Cluster 3.2
    140016 Sun Cluster 3.2: CORE patch for Solaris 9
    140017 Sun Cluster 3.2: CORE patch for Solaris 10
    140018 Sun Cluster 3.2: CORE patch for Solaris 10_x86

    If already one of the following patches installed then the above patches are not necessary, because these patches also include a correct preremove script for package SUNWscr.
    126106-27 or higher Sun Cluster 3.2: CORE patch for Solaris 10
    126107-28 or higher Sun Cluster 3.2: CORE patch for Solaris 10_x86
    126105-26 or higher Sun Cluster 3.2: CORE patch for Solaris 9

    This is reported in bugs 6676771 and 6747530 with further details.

    About

    I'm still mostly blogging around Solaris Cluster and support. Independently if for Sun Microsystems or Oracle. :-)

    Search

    Categories
    Archives
    « April 2014
    SunMonTueWedThuFriSat
      
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    23
    24
    25
    26
    27
    28
    29
    30
       
           
    Today