Monday Mar 09, 2009

memory leaks in "rgmd -z global" process

A memory leak occurs in the "rgmd -z global" process on Sun Cluster 3.2 1/09 Update2. The global zone instance of the rgmd process leaks memory in most situations such as "scstat" or "cluster show" and other basic commands. The problem is severe and the rgmd heap grows to a large size and crashes the Sun Cluster node.

The issue only happen if one of the following Sun Cluster core patches are active.
126106-27 or -29 or -30 Sun Cluster 3.2: CORE patch for Solaris 10
126107-28 or -30 or -31 Sun Cluster 3.2: CORE patch for Solaris 10_x86
Due to the fact that this patches are also part of the Sun Cluster 3.2 1/09 Update2 release the issue occur also on fresh installed Sun Cluster 3.2 1/09 Update2 systems.

The error can look as follows:
Analyze the grow of memory allocation with (or similar tools)
# prstat
3942 root 61M 11M sleep 101 - 0:00:02 0.7% rgmd/41
sometime later the increase of the memory allocation is visible.
3942 root 61M 20M sleep 101 - 0:01:15 0.7% rgmd/41
or
# pmap -x <pid_of_rgmd-z_global> | grep heap
00022000 47648 6992 6984 - rwx-- [ heap ]
sometime later the increase of the memory allocation is visible.
00022000 47648 15360 15352 - rwx-- [ heap ]

When the memory is full the Sun Cluster node panics with the following message:
Feb 25 07:59:23 node1 RGMD[1843]: [ID 381173 daemon.error] RGM: Could not allocate 1024 bytes; node is out of swap space; aborting node.
...
Feb 25 08:10:05 node1 cl_dlpitrans: [ID 624622 kern.notice] Notifying cluster that this node is panicking
Feb 25 08:10:05 node1 unix: [ID 836849 kern.notice]
Feb 25 08:10:05 node1 \^Mpanic[cpu0]/thread=2a100047ca0:
Feb 25 08:10:05 node1 unix: [ID 562397 kern.notice] Failfast: Aborting zone "global" (zone ID 0) because "globalrgmd" died 30 seconds ago.
Feb 25 08:10:06 node1 unix: [ID 100000 kern.notice]
...

Update 20.Mar.2009:
Available now:
Alert 1020253.1 Memory Leak in the "rgmd" Process of Solaris Cluster 3.2 may Cause a failfast Panic

Update 17.Jun.2009:
The -33 revision of the Sun Cluster core patch is the first released version which fix this issue.
126106-33 Sun Cluster 3.2: CORE patch for Solaris 10
126107-33 Sun Cluster 3.2: CORE patch for Solaris 10_x86


Workaround: Use previous version -19 to prevent issue.
126106-19 Sun Cluster 3.2: CORE patch for Solaris 10
126107-19 Sun Cluster 3.2: CORE patch for Solaris 10_x86

The issue is reported in bug 6808508 (description: scalable services coredump during the failover due to network failure). A fix is in progress. This blog will be updated when the fix is available.

Monday Feb 16, 2009

Sun Cluster 3.2 1/09 Update2 Patches

The Sun Cluster 3.2 1/09 Update2 is released. Click here for further information.

The package version of the Sun Cluster 3.2 1/09 Update2 are the same for the core framework and the agents as for Sun Cluster 3.2 and Sun Cluster 3.2 2/08 Update1. Therefore it's possible to patch up an existing Sun Cluster 3.2 or Sun Cluster 3.2 2/08 Update1.

The package version of the Sun Cluster Geographic Edition 3.2 1/09 Update2 are NOT the same as Sun Cluster Geographic Edition 3.2. Therefore an upgrade is necessary for the Geographic Edition.
But don't worry about that, because unlike core Sun Cluster 3.2 the Geographic Edition framework does not create updates through patches. The update can be done without interruption of the service. Click here for details.

The following patches (with the mentioned revision) are included in Sun Cluster 3.2 1/09 Update2. So the complete list is a combination of Sun Cluster 3.2 2/08 Update1 patches and this list. If these patches are installed on Sun Cluster 3.2 or Sun Cluster 3.2 2/08 Update1 release, then the features for framework & agents are identical. It's always necessary to read the "Special Install Instructions of the patch" but I made a note behind some patches where it's very important to read the "Special Install Instructions of the patch" (Using shortcut SIIOTP). Furthermore I made a note when a new resource type comes with the patch.

New additional included patch revisions of Sun Cluster 3.2 1/09 Update2 for Solaris 10 05/08 update5 or higher
126106-27 Sun Cluster 3.2: CORE patch for Solaris 10
Note:
Delivers SUNW.rac_udlm:3, SUNW.rac_framework:4, SUNW.crs_framework:2, SUNW.ScalMountPoint:3, SUNW.ScalDeviceGroup:3, SUNW.rac_svm:3, SUNW.rac_cvm:3 and SUNW.LogicalHostname:3 (but LogicalHostname was introduced in revision -17). Please read SIIOTP
125514-05 Sun Cluster 3.2: Solaris Volume Manager (Mediator) Patch
125992-03 Sun Cluster 3.2: SC Checks patch for Solaris 10
126008-02 Sun Cluster 3.2: HA-DB Patch for Solaris 10
126014-05 Sun Cluster 3.2: Ha-Apache Patch for Solaris 10
126017-02 Sun Cluster 3.2: HA-DNS Patch for Solaris 10
126020-04 Sun Cluster 3.2: HA-Containers Patch for Solaris 10 Note: Please read SIIOTP
126023-03 Sun Cluster 3.2: Sun Cluster HA for Java Web Server, Patch for Solaris 10
126026-02 Sun Cluster 3.2: HA-Kerberos Patch for Solaris 10
126032-05 Sun Cluster 3.2: Ha-MYSQL Patch for Solaris 10 Note: Please read SIIOTP
126035-05 Sun Cluster 3.2: HA-NFS Patch for Solaris 10
126044-04 Sun Cluster 3.2: HA-PostgreSQL Patch for Solaris 10 Note: Please read SIIOTP
126047-10 Sun Cluster 3.2: Ha-Oracle patch for Solaris 10 Note: Please read SIIOTP
126050-03 Sun Cluster 3.2: HA-Oracle E-business suite Patch for Solaris 10
126059-04 Sun Cluster 3.2: HA-SAPDB Patch for Solaris 10
126062-06 Sun Cluster 3.2: HA-SAP-WEB-AS Patch for Solaris 10
126068-05 Sun Cluster 3.2: HA-Sybase Patch for Solaris 10 Note: Please read SIIOTP
126080-03 Sun Cluster 3.2: HA-Sun Java Systems App Server Patch for Solaris 10
126083-02 Sun Cluster 3.2: HA-Sun Java Message Queue Patch for Solaris 10
126092-03 Sun Cluster 3.2: HA-Websphere MQ Patch Note: Please read SIIOTP
126095-05 Sun Cluster 3.2: Localization patch for Solaris 9 sparc and Solaris 10 sparc
128556-03 Sun Cluster 3.2: Man Pages Patch for Solaris 9 and Solaris 10, sparc
139921-02 Sun Cluster 3.2: JFreeChart patch for Solaris 10


New additional included patch revisions of Sun Cluster 3.2 1/09 Update2 for Solaris 10 x86 05/08 update5 or higher
126107-28 Sun Cluster 3.2: CORE patch for Solaris 10_x86
Note:
Delivers SUNW.rac_framework:4, SUNW.crs_framework:2, SUNW.ScalMountPoint:3, SUNW.ScalDeviceGroup:3, SUNW.rac_svm:3 and SUNW.LogicalHostname:3 (but LogicalHostname was introduced in revision -17). Please read SIIOTP
125515-05 Sun Cluster 3.2: Solaris Volume Manager (Mediator) Patch
125993-03 Sun Cluster 3.2: Sun Cluster 3.2: SC Checks patch for Solaris 10_x86
126009-04 Sun Cluster 3.2: HA-DB Patch for Solaris 10_x86
126015-06 Sun Cluster 3.2: HA-Apache Patch for Solaris 10_x86
126018-04 Sun Cluster 3.2: HA-DNS Patch for Solaris 10_x86
126021-04 Sun Cluster 3.2: HA-Containers Patch for Solaris 10_x86 Note: Please read SIIOTP
126024-04 Sun Cluster 3.2: Sun Cluster HA for Java Web Server, Patch for Solaris 10_x86
126027-04 Sun Cluster 3.2: HA-Kerberos Patch for Solaris 10_x86
126033-06 Sun Cluster 3.2: Ha-MYSQL Patch for Solaris 10_x86 Note: Please read SIIOTP
126036-06 Sun Cluster 3.2: HA-NFS Patch for Solaris 10_x86
126045-05 Sun Cluster 3.2: HA-PostgreSQL Patch for Solaris 10_x86 Note: Please read SIIOTP
126048-10 Sun Cluster 3.2: Ha-Oracle patch for Solaris 10_x86 Note: Please read SIIOTP
126060-05 Sun Cluster 3.2: HA-SAPDB Patch for Solaris 10_x86
126063-07 Sun Cluster 3.2: HA-SAP-WEB-AS Patch for Solaris 10_x86
126069-04 Sun Cluster 3.2: HA_Sybase Patch for Solaris 10_x86 Note: Please read SIIOTP
126081-04 Sun Cluster 3.2: HA-Sun Java Systems App Server Patch for Solaris 10_x86
126084-04 Sun Cluster 3.2: HA-Sun Java Message Queue Patch for Solaris 10_x86
126093-05 Sun Cluster 3.2: HA-Websphere MQ Patch for Solaris 10_x86 Note: Please read SIIOTP
126096-05 Sun Cluster 3.2: Localization patch for Solaris 10 amd64 ??
128557-03 Sun Cluster 3.2: Man Pages Patch for Solaris 10_x86
139922-02 Sun Cluster 3.2: JFreeChart patch for Solaris 10_x86


New additional included patch revisions of Sun Cluster 3.2 1/09 Update2 for Solaris 9 8/05 update8 or higher
126105-26 Sun Cluster 3.2: CORE patch for Solaris 9
Note:
Delivers SUNW.rac_udlm:3, SUNW.rac_framework:4, SUNW.crs_framework:2, SUNW.ScalMountPoint:3, SUNW.ScalDeviceGroup:3, SUNW.rac_svm:3, SUNW.rac_cvm:3 and SUNW.LogicalHostname:3 (but LogicalHostname was introduced in revision -18). Please read SIIOTP
125513-04 Sun Cluster 3.2: Solaris Volume Manager (Mediator) Patch
125991-03 Sun Cluster 3.2: Sun Cluster 3.2: SC Checks patch for Solaris 9
126007-02 Sun Cluster 3.2: HA-DB Patch for Solaris 9
126013-05 Sun Cluster 3.2: HA-Apache Patch for Solaris 9
126016-02 Sun Cluster 3.2: HA-DNS Patch for Solaris 9
126022-03 Sun Cluster 3.2: Sun Cluster HA for Java Web Server, Patch for Solaris 9
126031-05 Sun Cluster 3.2: Ha-MYSQL Patch for Solaris 9 Note: Please read SIIOTP
126034-05 Sun Cluster 3.2: HA-NFS Patch for Solaris 9
126043-04 Sun Cluster 3.2: HA-PostgreSQL Patch for Solaris 9 Note: Please read SIIOTP
126046-10 Sun Cluster 3.2: HA-Oracle patch for Solaris 9 Note: Please read SIIOTP
126049-03 Sun Cluster 3.2: HA-Oracle E-business suite Patch for Solaris 9
126058-04 Sun Cluster 3.2: HA-SAPDB Patch for Solaris 9
126061-06 Sun Cluster 3.2: HA-SAP-WEB-AS Patch for Solaris 9
126067-05 Sun Cluster 3.2: HA-Sybase Patch for Solaris 9 Note: Please read SIIOTP
126079-03 Sun Cluster 3.2: HA-Sun Java Systems App Server Patch for Solaris 9
126082-02 Sun Cluster 3.2: HA-Sun Java Message Queue Patch for Solaris 9
126091-03 Sun Cluster 3.2: HA-Websphere MQ Patch Note: Please read SIIOTP
126095-05 Sun Cluster 3.2: Localization patch for Solaris 9 sparc and Solaris 10 sparc
128556-03 Sun Cluster 3.2: Man Pages Patch for Solaris 9 and Solaris 10, sparc
139920-02 Sun Cluster 3.2: JFreeChart patch for Solaris 9


The quorum server is an alternative to the traditional quorum disk. The quorum server is outside of the Sun Cluster and will be accessed through the public network. Therefore the quorum server can be a different architecture.

Included patch revisions in Sun Cluster 3.2 1/09 Update2 for quorum server feature:
127404-02 Sun Cluster 3.2: Quorum Server Patch for Solaris 9
127405-03 Sun Cluster 3.2: Quorum Server Patch for Solaris 10
127406-03 Sun Cluster 3.2: Quorum Server Patch for Solaris 10_x86


If some patches must be applied when the node is in noncluster mode, you can apply them in a rolling fashion, one node at a time, unless a patch's instructions require that you shut down the entire cluster. Follow procedures in How to Apply a Rebooting Patch (Node) in Sun Cluster System Administration Guide for Solaris OS to prepare the node and boot it into noncluster mode. For ease of installation, consider applying all patches at once to a node that you place in noncluster mode.

Information about patch management available at Oracle Enterprise Manager Ops Center.

Thursday Feb 12, 2009

scalable service does not failover after network outage

If a network outage occurs to the IPMP group which is part of the scalable resource group, then the scalable resource can NOT failover to the other host.

The issue only happen if one the following Sun Cluster core patches are active.
126106-27 Sun Cluster 3.2: CORE patch for Solaris 10
126107-28 Sun Cluster 3.2: CORE patch for Solaris 10_x86
126105-26 Sun Cluster 3.2: CORE patch for Solaris 9
Due to the fact that this patches are also part of the Sun Cluster 3.2 1/09 Update2 release the issue occur also on fresh installed Sun Cluster 3.2 1/09 Update2 systems.

The error can look as follows:
Feb 10 16:56:51 node1 in.mpathd[174]: NIC failure detected on e1000g0 of group ipmp0
Feb 10 16:56:51 node1 in.mpathd[174]: Successfully failed over from NIC e1000g0 to NIC e1000g4
Feb 10 16:57:18 node1 in.mpathd[174]: All Interfaces in group ipmp0 have failed
Feb 10 16:57:19 node1 SC[SUNW.apache:4.1,apache-rg,apache-rs,SSM_IPMP_CALLBACK]: IPMP group ipmp0 has failed, so scalable resource apache-rs in resource group apache-rg may not be able to respond to client requests. A request will be issued to relocate resource apache-rs off of this node.
Feb 10 16:57:23 node1 genunix: NOTICE: core_log: ssm_ipmp_callbac[2130] core dumped: /var/core/core.ssm_ipmp_callbac.2130.1227135261.0

Update 7.Apr.2009:
Solution: The bug 6774504 is fixed in
126106-28 or higher Sun Cluster 3.2: CORE patch for Solaris 10
126107-29 or higher Sun Cluster 3.2: CORE patch for Solaris 10_x86
126105-27 or higher Sun Cluster 3.2: CORE patch for Solaris 9
But the mentioned releases of the patches still have troubles with rgmd process. Please refer to memory leaks in "rgmd -z global" process .

Workaround: Use previous version -19 if using scalable servcies
126106-19 Sun Cluster 3.2: CORE patch for Solaris 10
126107-19 Sun Cluster 3.2: CORE patch for Solaris 10_x86
126105-19 Sun Cluster 3.2: CORE patch for Solaris 9

The issue is reported in bug 6774504 (description: scalable services coredump during the failover due to network failure). A fix is in progress. This blog will be updated when the fix is available.

Wednesday Jan 28, 2009

private interconnect and patch 138888/138889


In specific Sun Cluster 3.x configurations the cluster node can not join. Most of the time this issue comes up after the installation of kernel update patch
138888-01 until 139555-08 or higher SunOS 5.10: Kernel Patch OR
138889-01 until 139556-08 or higher SunOS 5.10_x86: Kernel Patch
AND
Sun Cluster 3.x using an Ethernet switch (with VLAN) for the private interconnect
AND
Sun Cluster 3.x using e1000g, nxge, bge or ixgb (GLDv3) interfaces for the private interconnect.

The issue looks similar to the following messages during the boot up of the cluster node.
...
Jan 25 15:46:14 node1 genunix: [ID 279084 kern.notice] NOTICE: CMM: node reconfiguration #2 completed.
Jan 25 15:46:15 node1 genunix: [ID 884114 kern.notice] NOTICE: clcomm: Adapter e1000g1 constructed
Jan 25 15:46:15 node1 ip: [ID 856290 kern.notice] ip: joining multicasts failed (18) on clprivnet0 - will use link layer broadcasts for multicast
Jan 25 15:46:16 node1 genunix: [ID 884114 kern.notice] NOTICE: clcomm: Adapter e1000g3 constructed
Jan 25 15:47:15 node1 genunix: [ID 604153 kern.notice] NOTICE: clcomm: Path node1:e1000g1 - node2:e1000g1 errors during initiation
Jan 25 15:47:15 node1 genunix: [ID 618107 kern.warning] WARNING: Path node1:e1000g1 - node2:e1000g1 initiation encountered errors, errno = 62. Remote node may be down or unreachable through this path.
Jan 25 15:47:16 node1 genunix: [ID 604153 kern.notice] NOTICE: clcomm: Path node1:e1000g3 - node2:e1000g3 errors during initiation
Jan 25 15:47:16 node1 genunix: [ID 618107 kern.warning] WARNING: Path node1:e1000g3 - node2:e1000g3 initiation encountered errors, errno = 62. Remote node may be down or unreachable through this path.
...
Jan 25 16:33:51 node1 genunix: [ID 224783 kern.notice] NOTICE: clcomm: Path node1:e1000g1 - node2:e1000g1 has been deleted
Jan 25 16:33:51 node1 genunix: [ID 638544 kern.notice] NOTICE: clcomm: Adapter e1000g1 has been disabled
Jan 25 16:33:51 node1 genunix: [ID 224783 kern.notice] NOTICE: clcomm: Path node1:e1000g3 - node2:e1000g3 has been deleted
Jan 25 16:33:51 node1 genunix: [ID 638544 kern.notice] NOTICE: clcomm: Adapter e1000g3 has been disabled
Jan 25 16:33:51 node1 ip: [ID 856290 kern.notice] ip: joining multicasts failed (18) on clprivnet0 - will use link layer broadcasts for multicast

Update 6.Mar.2009:
Available now:
Alert 1020193.1 Kernel Patches/Changes may Stop Sun Cluster Nodes From Joining the Cluster

Update 26.Jun.2009:
The issue is fixed in the patches
141414-01 or higher SunOS 5.10: kernel patch OR
137104-02 or higher SunOS 5.10_x86: dls patch

Both patches require the 13955[56]-08 kernel update patch which is included in Solaris 10 5/09 update7. If using Solaris 10 5/09 update7 then Sun Cluster 3.2 requires the Sun Cluster core patch in revision -33 or higher. So, to get this one fixed it's recommended to use Solaris 10 5/09 update7 (patch 13955[56]-8 or higher & 141414-01(sparc) or 137104-02(x86)) with the Sun Cluster 3.2 core patch -33 or higher.


Choose one of the corrective actions (if not install the patch with the fix):
  • Before install the mention patches configure VLAN tagging on the Sun interface and on the switch. This makes VLAN tagged packets expected and prevents drops. This means the interface name moves to e.g. e1000g810000. After configuration change to e.g. e1000g810000 it's recommend to reboot the Sun Cluster hosts. Configuration details.

  • If using the above mentioned kernel update patch enable QoS (Quality of Service) on the Ethernet switch. The switch should be able to handle priority tagging. Please refer to the switch documentation because each switch is different.

  • Do not install the above mentioned kernel update patch if using VLAN in Sun Cluster 3.x private interconnect.

The mentioned kernel update patch delivers some new features in the GLDv3 architecture. It makes packets 802.1q standard compliant by including priority tagging. Therefore the following Sun Cluster 3.x configuration should not be affected.
\* Sun Cluster 3.x which use ce, ge, hme, qfe, ipge or ixge network interfaces.
\* Sun Cluster 3.x which have back-to-back connections for the private interconnect.
\* Sun Cluster 3.x on Solaris 8 or Solaris 9.

Sunday Jan 18, 2009

ce_taskq_disable and Sun Cluster 3.x


The /etc/system variable "set ce:ce_taskq_disable=1" is always in discussion with Sun Cluster 3.x. Now some new features available which makes this value unnecessary. If the following conditions are met then remove "set ce:ce_taskq_disable=1" from /etc/system .


Overview: There are two enhancements which are solved:

6281341: ce_taskq_disable should be able to set on per instance basis. (fixed in solaris patches)
6487117: Sun Cluster should automatically request for intr mode RX processing for private interconnects. (fixed in Sun Cluster patches)

These enhancements are integrated in
Solaris 10:
118777-12 SunOS 5.10: Sun GigaSwift Ethernet 1.0 driver patch (bundled in Solaris 10 5/08 Update5 onwards)
125915-01 SunOS 5.10: dlpi.h patch 125915-01 Obsoleted by: 128004-01 SunOS 5.10: dlpi.h patch
128004-01 SunOS 5.10: headerfile patch (bundled in Solaris 10 5/08 Update5 onwards)
120500-20 Sun Cluster 3.1: Core Patch for Solaris 10
or
126106-18 Sun Cluster 3.2: CORE patch for Solaris 10

Solaris 10_x86:
118778-11 SunOS 5.10_x86: Sun GigaSwift Ethernet 1.0 driver patch (bundled in Solaris 10 5/08 Update5 onwards)
125916-01 SunOS 5.10_x86: dlpi.h patch 125916-01 Obsoleted by: 128005-01 SunOS 5.10_x86: dlpi.h patch
128005-01 SunOS 5.10_x86: headerfile patch (bundled in Solaris 10 5/08 Update5 onwards)
120501-20 Sun Cluster 3.1: Core Patch for Solaris 10_x86
or
126107-18 Sun Cluster 3.2: CORE patch for Solaris 10_x86

Solaris 9:
112817-32 SunOS 5.9: Sun GigaSwift Ethernet 1.0 driver patch
126849-01 SunOS 5.9: patch usr/include/sys/dlpi.h
117949-35 Sun Cluster 3.1: Core Patch for Solaris 9
or
126105-18 Sun Cluster 3.2: CORE patch for Solaris 9

Note: There are NO patches for Solaris 8 and Solaris 9_x86 because the 6281341 is not backported to these releases.

Overall recommendation for Solaris 10: Due to some other issues it makes sense to use Solaris 10 10/08 Update6 (patches are bundled) instead of Solaris 10 5/08 Update5 and the mentioned Sun Cluster 3.x core patch.
This is especially due to Alert 1019642.1: Failure to run clock thread may lead to a system hang.


Otherwise if you can not install these patches the workaround is:

If using supported network adapters which use the \*ce\* network driver for private interconnect, uncomment (activate) in /etc/system:
set ce:ce_taskq_disable=1
Sun Cluster installation automatically add this value to the /etc/system file.
Additional consider to use following settings in case of performance issues in the public network. Beware this tuning always depends on the network infrastructure!
set ce:ce_ring_size=1024
set ce:ce_comp_ring_size=4096

Note: If using \*ce\* network driver only for public network the default value of ce_taskq_disable=0 is ok.


Need to know:
In case of Sun Cluster 3.1 and 3.2 remove the following entry from /etc/system if active.
set ce:ce_reclaim_pending=1
This value was only necessary for Sun Cluster 3.0.


Further details available in Technical Intructions 1017839.1

Thursday Dec 11, 2008

Tips to configure IPMP with Sun Cluster 3.x

Configure IPMP (probe based or link based):
Setup IPMP (IP network multipathing) groups on all nodes for all public network interfaces which are used for a HA dataservice. This article describe a summary of possibilities and known issues. An overview about IPMP can be found in System Administration Guide: IP Services.


Example probe-based IPMP group active-active with interfaces qfe0 and qfe4 with one production IP:

Entry of /etc/hostname.qfe0:
<production_IP_host> netmask + broadcast + group ipmp1 up \\
addif <test_IP_host> netmask + broadcast + deprecated -failover up

Entry of /etc/hostname.qfe4:
<test_IP_host> netmask + broadcast + group ipmp1 deprecated -failover up
The IPMP group name ipmp1 is freely chosen in this example!

If the defaultrouter is NOT 100% available please read
Technical Instruction 1010640.1: Summary of typical IPMP Configurations
and
Technical Instruction 1001790.1: The differences between Network Adapter Failover (Sun Cluster 3.0) and IP Multipathing (Sun Cluster 3.1)

Notes:
\* Do not use test IP for normal applications.
\* When using Solaris9 12/02 or later & Sun Cluster 3.1 Update1 or later there is no need for a IPMP test address if you have only 1 IP address in the IPMP group. (RFE 4511634, 4741473)
  e.g: of /etc/hostname.qfe0 entry:
    <production_IP_host> netmask + broadcast + group ipmp1 up
\* Test IP for all adapters in the same IPMP group must belong to a single IP subnet.


Example link-based IPMP group active-active with interfaces qfe0 and qfe4 with one production IP:

Entry of /etc/hostname.qfe0:
<production_IP_host> netmask + broadcast + group ipmp1 up
Entry of /etc/hostname.qfe4:
<dummy_IP_host> netmask + broadcast + deprecated group ipmp1 up

Notes:
\* Do NOT use the 0.0.0.0 IP address as dummy_IP_host for link based due to bug 6457375.
\* This time the recommendation is to use valid IP address but it can also be a dummy IP address.

The bug 6457375 is fixed in kernel update patch 138888-01 (sparc) or 138889-01 (x86). These kernel patches are based on Solaris 10 10/08 Update6. Now it's possible to use the 0.0.0.0 IP address as described in the following example:
Entry of /etc/hostname.qfe0:
<production_IP_host> netmask + broadcast + group ipmp1 up
Entry of /etc/hostname.qfe4:
group ipmp1 up

Further Details:
Technical Instruction 1008064.1: IPMP Link-based Only Failure Detection with Solaris 10 Operating System (OS)


Hints / Checkpoints for all configurations:
  • You need an additional IP for each logical host.

  • If there is a firewall being used between clients and a HA service running on this cluster, and if this HA service is using UDP and does not bind to a specific address, the IP stack choses the source address for all outgoing packages from the routing table. So, as there is no guarantee that the same source address is chosen for all packages - the routing table might change - it is necessary to configure all addresses available on a network interface as valid source addresses in the firewall. More details can be found in the blog Why a logical IP is marked as DEPRECATED?

  • IPMP groups as active-standby configuration is also possible.

  • In the /etc/default/mpathd file, the value of TRACK_INTERFACES_ONLY_WITH_GROUPS must be yes (default).

  • In case of Sun Cluster 3.1 The FAILBACK in /etc/default/mpathd file must be yes (default). Bug 6429808. Fixed in Sun Cluster 3.2.

  • Use only one IPMP group in the same subnet. It's not supported to use more IPMP groups in the same subnet.

  • The SC installer adds an IPMP group to all public network adapters. If desired remove the IPMP configuration for network adapters that will NOT be used for HA dataservices.

  • Remove IPMP groups from dman interfaces (SunFire 12/15/20/25K) if exists. (Bug 6309869)

Thursday Oct 02, 2008

Sun Cluster 3.2 and VxVM 5.0 patch 124361-06

There are some issues around after you have installed the

Patch-ID# 124361-06
Synopsis: VRTSvxvm 5.0_MP1_RP5: Rolling Patch 5 for Volume Manager 5.0 MP1


This patch changes the handling of VxVM devices which is leading to disputes with Sun Cluster 3.2.

Seen errors:
a)
host0 # scswitch -z -D testdg -h host1
Sep 26 10:20:17 host0 Cluster.CCR: build_devlink_list: readlink failed for /dev/vx/dsk//global/.devices/node@1/dev/vx/dsk/testdg: No such file or directory

Sep 26 10:23:41 host0 SC[SUNW.HAStoragePlus:6,test-rg,test-hastp-rs,hastorageplus_prenet_start]: Failed to analyze the device special file associated with file system mount point /test/data/AB: No such file or directory


b)
host0 # clrg create test-rg
host0 # clresource create -g test-rg -t SUNW.HAStoragePlus -p FileSystemMountPoints="/testdata" test-rs
clresource: host1 - Failed to analyze the device special file associated with file system mount point /testdata: No such file or directory.

clresource: (C189917) VALIDATE on resource test-rs, resource group test-rg, exited with non-zero exit status.
clresource: (C720144) Validation of resource test-rs in resource group test-rg on node node1 failed.
clresource: (C891200) Failed to create resource "test-rs".


On the other node:
host1# Sep 26 14:27:38 host1 SC[SUNW.HAStoragePlus:4,test-rg,test-rs,hastorageplus_validate]: Failed to analyze the device special file associated with file system mount point /testdata: No such file or directory.
Sep 26 14:27:38 host1 Cluster.RGM.rgmd: VALIDATE failed on resource , resource group , time used: 0% of timeout <1800, seconds>

Workaround:
Do not install patch 124361-06 use patch 124361-05.
Important if patch is already installed: Before backing out 124361-06 ensure that Solaris 10 patch 125731-02 is installed to avoid bug 6622037.


Update 10.Oct.2008: New patch 122058-11 is released which fix the problem and obsoletes 124361-06
Patch-ID# 122058-11
Synopsis: VRTSvxvm 5.0MP3: Maintenance Patch for Volume Manager 5.0


Update 24.Oct.2008:
Basically the problems all arise when 124361-06 is installed and a VxVM volume is created on a Sun Cluster configuration. With patch 124361-06, when a vxvm volume is created it creates the special device under /devices and then we have symbolic links under /dev/vx/[r]dsk// that point to the /devices entries. This behaviour does not happen when 122058-11 is installed. The special files are created under /dev/vx/[r]dsk/< dg >/ and NOT /devices.

Check if the devices are correct. It's quite important that NO links in the mentioned directory of a device group. Two workarounds available if the wrong links exist. Besure that 122058-11 is already installed and volumes are inactive.

Workaround1:
node1# cd /global/.devices/node@1/dev/vx/dsk/testdg
node1# ls -l
total 4
lrwxrwxrwx 1 root root 46 Oct 15 16:27 vol01 -> /devices/pseudo/vxio@0:testdg,vol01,59000,blk
lrwxrwxrwx 1 root root 46 Oct 15 16:27 vol02 -> /devices/pseudo/vxio@0:testdg,vol02,59001,blk
node1# rm vol01 vol02
node1# cd /global/.devices/node@1/dev/vx/rdsk/testdg
node1# ls -l
total 4
lrwxrwxrwx 1 root root 46 Oct 15 16:27 vol01 -> /devices/pseudo/vxio@0:testdg,vol01,59000,raw
lrwxrwxrwx 1 root root 46 Oct 15 16:27 vol02 -> /devices/pseudo/vxio@0:testdg,vol02,59001,raw
node1# rm vol01 vol02
node1#
node1# cldg sync testdg
node1#
node1# ls -l /global/.devices/node@1/dev/vx/dsk/testdg
total 0
brw------- 1 root root 282, 59000 Oct 15 16:32 vol01
brw------- 1 root root 282, 59001 Oct 15 16:32 vol02
node1# ls -l /global/.devices/node@1/dev/vx/rdsk/testdg
total 0
crw------- 1 root root 282, 59000 Oct 15 16:32 vol01
crw------- 1 root root 282, 59001 Oct 15 16:32 vol02

Workaround2:
If symlink exist remove the symbolic links
node1# rm /dev/vx/[r]dsk/testdg/symlink
and then recreate the special files using the cluster command
node1# /usr/cluster/lib/dcs/scvxvmlg

Afterwards for security purposes a reconfiguation boot of all nodes is recommended.

Update 30.Oct.2008:
Available now:
Alert 1019694.1 Sun Cluster Resource "HAstoragePlus" May Fail if Veritas Volume Manager Patch 124361-06 is Installed

Wednesday Aug 06, 2008

Sun SPARC Enterprise Mx000 with active bge interface

The Sun SPARC Enterprise Server M4000, M5000, M8000 or M9000 can sporadically hang at boot time
a) if the system is part of Sun Cluster
and
b) if the system have a configured bge network interface


Example of boot hang:
...
Booting as part of a cluster
NOTICE: CMM: Node node1 (nodeid = 1) with votecount = 1 added.
NOTICE: CMM: Node node2 (nodeid = 2) with votecount = 1 added.
NOTICE: CMM: Quorum device 2 (/dev/did/rdsk/d7s2) added; votecount = 5, bitmask of nodes with configured paths = 0x3f.
NOTICE: clcomm: Adapter bge3 constructed
... now the system hang at this point ...


Solution: Install 138042-02 (or higher) of SunOS 5.10: MAC patch

Monday Jun 30, 2008

prevent reservation conflict panic if using active/passive storage controller

Reservation conflicts can happen in a Sun Cluster environment if using active/passive storage controllers e.g. SE6540, SE6140, FLX380.

First of all you should always consider to disable auto-failback flag if using MPxIO on shared devices. This can also prevent reservation conflict panics.

Change the auto-failback value in /kernel/drv/scsi_vhci.conf to disable.
e.g of kernel/drv/scsi_vhci.conf
...
# Automatic failback configuration
# possible values are auto-failback="enable" or auto-failback="disable"
auto-failback="disable";
...


Furthermore the reservation conflict panic was seen when one cluster node is down and the shared storage array made some (at least 2 or 3) failovers between the active/passive controllers. The behavior always depends on the design of the storage array controller.

Two workarounds are available at the moment:

1.) In case of Sun Cluster 3.2 force the cluster to do scsi3 reservations even in 2 node cluster configurations. If you have a 3 node (or more nodes), the cluster should do scsi3 reservations anyway.

Be aware of Alert 1019005.1. In case of SE6540/SE6140/FLX380 use firmware 6.60.11.xx (which is part of CAM 6.1) or higher. To avoid trouble update this code before enabling SCSI3 reservations.

To force the Sun Cluster 3.2 to do scsi3 reservations run the command:
# cluster set -p global_fencing=prefer3

Verify the setting using :
# cluster show | grep -i scsi
   Type:                       scsi
   Access Mode:        scsi3


2.) Allow Reservation on Unowned LUNs in SE6540/SE6140. You should prefer the workaround #1 but in case of Sun Cluster 3.1 you can not force scsi3 reservation mechanism for 2 node clusters. So, there is a need to use scsi2 reservations.

The bit "Allow Reservation on Unowned LUNs" determines the controller response to Reservation/Release commands that are received for LUNs that are not owned by the controller. The value needs to be changed from 0x01 to 0x00. Beware this setting will be lost after a NVSRAM update!

Using CAM management software do the following:
# cd /opt/SUNWsefms/bin/

For 6540/FLX380/FLX240/FLX280 run:
# ./service -d -c set -q nvsram region=0xf2 offset=0x19 value=0x00 host=0x02

For 6140 and 6130 run:
# ./service -d -c set -q nvsram region=0xf2 offset=0x19 value=0x00 host=0x00

Reboot both controllers in order to make the change active :
# ./service -d -c reset -t a

Wait at least 5 minutes until the A controller is up again.
# ./service -d -c reset -t b


Why this not happing before? With the changes of patch 125081-14 (sparc) or 125082-14 (x86) Sun deliver new driver for MPxIO. Due to this changes the problem can be triggered.

Tuesday May 27, 2008

Missing preremove script in Sun Cluster 3.2 core patch revision 12 and higher.

In my last blog I stated that Sun Cluster 3.2 GA release with the -12 Sun Cluster core patch is the same as Sun Cluster 3.2 2/08 aka Update1. This is still true but the preremove script of the SUNWscr package is missing in the Sun Cluster 3.2 core patch revision -12 and higher. This is documented as internal bug 6676771. Therefore it's NOT possible to remove the SUNWscr package when the revision -12 or higher of the Sun Cluster 3.2 core patch is installed. (Lower revisions of the core patches are NOT affected.) The remove of the SUNWscr package is necessary in case of an upgrade by using the command "scinstall -u update".


The fastest workaround is described the special install instructions of the Sun Cluster core patch revision -12:
NOTE 5: After removing this patch, remove the SunCluster smf service for
        service tag.
        svcadm disable /system/cluster/sc_svtag:default
        svccfg delete /system/cluster/sc_svtag:default
Execute these commands before the start of Sun Cluster upgrade.


To fix the issue immediately it's possible to change the preremove script of SUNWscr package. At the moment the preremove script of SUNWscr will NOT be delivered with the Sun Cluster core patch. Therefore the workaround is persistent.

Add the following to the /var/sadm/pkg/SUNWscr/install/preremove script (version 1.3):

1.) New subroutine (before the main part)
remove_svtag()
{
      STCLIENT=${BASEDIR}/usr/bin/stclient
      CL_URN_FILE=${BASEDIR}/var/sadm/servicetag/cl.urn
      if [ -f ${CL_URN_FILE} ]; then
         # read the urn from the file
         URN=`cat ${CL_URN_FILE}`
         if [ -f ${STCLIENT} ]; then
         ${STCLIENT} -d -i ${URN} >/dev/null 2>&1
         fi
         rm -f ${CL_URN_FILE}
      fi
      return 0
}


2.) In the part of SVCADMD="/usr/sbin/svcadm disable -s" add
$SVCADMD svc:/system/cluster/sc_svtag:default

3.) Near the end of main routine before the line "if [ ${errs} -ne 0 ]; then" add
# Remove service tag for cluster
remove_svtag || errs=`expr ${errs} + 1`


Or download the new preremove version 1.5 script for SUNWscr package and replace the 1.3 version.
# cd /var/sadm/pkg/SUNWscr/install/
# cp premove premove.old
# cp preremove_version1.5_SUNWscr premove

Thursday Apr 10, 2008

Sun Cluster 3.2 2/08 Update1 Patches

The Sun Cluster 3.2 2/08 Update1 is released. Click here for further information.

The package version of the Sun Cluster 3.2 2/08 Update1 are the same for the core framework and the agents as Sun Cluster 3.2. Therefore it's possible to patch up an existing Sun Cluster 3.2.

The package version of the Sun Cluster Geographic Edition 3.2 2/08 Update1 are NOT the same as Sun Cluster Geographic Edition 3.2. Therefhttps://updates.oracle.com/Orion/Services/download?type=readme&bugfix_name=ore an upgrade is necessary for the Geographic Edition.

The following patches (with the mentioned revision) are included in Sun Cluster 3.2 2/08 Update1. If these patches are installed on Sun Cluster 3.2 release, then the features for framework & agents are identical. It's always necessary to read the "Special Install Instructions of the patch" but I made a note behind some patches where it's very important to read the "Special Install Instructions of the patch" (Using shortcut SIIOTP). Furthermore I made a note when a new resource type comes with the patch.

Included patch revisions in Sun Cluster 3.2 2/08 Update1 for Solaris 10 11/06 update3 or higher
125511-02 Sun Cluster 3.2: Core Patch for Solaris 10
126106-12 Sun Cluster 3.2: CORE patch for Solaris 10 Note: Deliver SUNW.HAStoragePlus:6 Please read SIIOTP
125508-06 Sun Cluster 3.2: Manageability and Serviceability Agent for Solaris 10 Note: Please read SIIOTP
125514-02 Sun Cluster 3.2: Solaris Volume Manager (Mediator) Patch
125517-04 Sun Cluster 3.2: OPS Core Patch for Solaris 10 Note: Deliver SUNW.rac_framework:3 Please read SIIOTP
125992-01 Sun Cluster 3.2: SC Checks patch for Solaris 10
125998-01 Sun Cluster 3.2: Sun Cluster Reliable Datagram Transport Patch
126008-01 Sun Cluster 3.2: HA-DB Patch for Solaris 10
126011-02 Sun Cluster 3.2: HA-DHCP Patch for Solaris 10
126014-03 Sun Cluster 3.2: Ha-Apache Patch for Solaris 10
126017-01 Sun Cluster 3.2: HA-DNS Patch for Solaris 10
126020-02 Sun Cluster 3.2: HA-Containers Patch for Solaris 10 Note: Please read SIIOTP
126023-02 Sun Cluster 3.2: Sun Cluster HA for Java Web Server, Patch for Solaris 10
126026-01 Sun Cluster 3.2: HA-Kerberos Patch for Solaris 10
126029-01 Sun Cluster 3.2: HA-LiveCache Patch for Solaris 10
126032-03 Sun Cluster 3.2: Ha-MYSQL Patch for Solaris 10 Note: Please read SIIOTP
126035-02 Sun Cluster 3.2: HA-NFS Patch for Solaris 10
126041-01 Sun Cluster 3.2: HA-N1 Grid Engine Patch for Solaris 10
126044-02 Sun Cluster 3.2: HA-PostgreSQL Patch for Solaris 10 Note: Please read SIIOTP
126047-05 Sun Cluster 3.2: Ha-Oracle patch for Solaris 10 Note: Please read SIIOTP
126050-02 Sun Cluster 3.2: HA-Oracle E-business suite Patch for Solaris 10
126053-01 Sun Cluster 3.2: HA-Oracle Application Server Patch for Solaris 10
126056-01 Sun Cluster 3.2: HA-SAP Patch for Solaris 10 Note: Deliver SUNW.sap_as:4 Please read SIIOTP
126059-03 Sun Cluster 3.2: HA-SAPDB Patch for Solaris 10
126062-03 Sun Cluster 3.2: HA-SAP-WEB-AS Patch for Solaris 10
126065-03 Sun Cluster 3.2: HA-Siebel Patch for Solaris 10
126068-03 Sun Cluster 3.2: HA-Sybase Patch for Solaris 10 Note: Deliver SUNW.sybase:5 Please read SIIOTP
126071-01 Sun Cluster 3.2: HA-Tomcat Patch for Solaris 10
126074-01 Sun Cluster 3.2: HA-BEA WebLogic Patch for Solaris 10
126080-02 Sun Cluster 3.2: HA-Sun Java Systems App Server Patch for Solaris 10
126083-01 Sun Cluster 3.2: HA-Sun Java Message Queue Patch for Solaris 10
126092-02 Sun Cluster 3.2: HA-Websphere MQ Patch
126095-04 Sun Cluster 3.2: Localization patch for Solaris 9 sparc and Solaris 10 sparc
128481-02 Sun Cluster 3.2: Grid Service Provisioning for Solaris 10
128556-01 Sun Cluster 3.2: Man Pages Patch for Solaris 9 and Solaris 10, sparc


Included patch revisions in Sun Cluster 3.2 2/08 Update1 for Solaris 10 x86 11/06 update3 or higher
125512-02 Sun Cluster 3.2: Core Patch for Solaris 10_x86
126107-12 Sun Cluster 3.2: CORE patch for Solaris 10_x86 Note: Deliver SUNW.HAStoragePlus:6 Please read SIIOTP
125509-06 Sun Cluster 3.2: Manageability and Serviceability Agent for Solaris 10_x86 Note: Please read SIIOTP
125515-02 Sun Cluster 3.2: Solaris Volume Manager (Mediator) Patch
125518-05 Sun Cluster 3.2: OPS Core Patch for Solaris 10_x86 Note: Deliver SUNW.rac_framework:3 Please read SIIOTP
125993-01 Sun Cluster 3.2: Sun Cluster 3.2: SC Checks patch for Solaris 10_x86
126009-03 Sun Cluster 3.2: HA-DB Patch for Solaris 10_x86
126012-03 Sun Cluster 3.2: HA-DHCP Patch for Solaris 10_x86
126015-04 Sun Cluster 3.2: HA-Apache Patch for Solaris 10_x86
126018-03 Sun Cluster 3.2: HA-DNS Patch for Solaris 10_x86
126021-02 Sun Cluster 3.2: HA-Containers Patch for Solaris 10_x86 Note: Please read SIIOTP
126024-03 Sun Cluster 3.2: Sun Cluster HA for Java Web Server, Patch for Solaris 10_x86
126027-03 Sun Cluster 3.2: HA-Kerberos Patch for Solaris 10_x86
126030-03 Sun Cluster 3.2: HA-LiveCache Patch for Solaris 10_x86
126033-04 Sun Cluster 3.2: Ha-MYSQL Patch for Solaris 10_x86 Note: Please read SIIOTP
126036-03 Sun Cluster 3.2: HA-NFS Patch for Solaris 10_x86
126042-01 Sun Cluster 3.2: HA-N1 Grid Engine Patch for Solaris 10_x86
126045-03 Sun Cluster 3.2: HA-PostgreSQL Patch for Solaris 10_x86 Note: Please read SIIOTP
126048-06 Sun Cluster 3.2: Ha-Oracle patch for Solaris 10_x86 Note: Please read SIIOTP
126054-02 Sun Cluster 3.2: HA-Oracle Application Server Patch for Solaris 10_x86
126057-03 Sun Cluster 3.2: HA-SAP Patch for Solaris 10_x86 Note: Deliver SUNW.sap_as:4 Please read SIIOTP
126060-04 Sun Cluster 3.2: HA-SAPDB Patch for Solaris 10_x86
126063-04 Sun Cluster 3.2: HA-SAP-WEB-AS Patch for Solaris 10_x86
126069-02 Sun Cluster 3.2: HA_Sybase Patch for Solaris 10_x86 Note: Deliver SUNW.sybase:5 Please read SIIOTP
126072-01 Sun Cluster 3.2: HA-Tomcat Patch for Solaris 10_x86
126075-03 Sun Cluster 3.2: HA-BEA WebLogic Patch for Solaris 10_x86
126081-03 Sun Cluster 3.2: HA-Sun Java Systems App Server Patch for Solaris 10_x86
126084-03 Sun Cluster 3.2: HA-Sun Java Message Queue Patch for Solaris 10_x86
126090-01 Sun Cluster 3.2: HA Websphere Messag Broker Patch for Solaris 10_x86
126093-04 Sun Cluster 3.2: HA-Websphere MQ Patch for Solaris 10_x86
126096-04 Sun Cluster 3.2: Localization patch for Solaris 10 amd64
128482-02 Sun Cluster 3.2: Grid Service Provisioning for Solaris 10_x86
128557-01 Sun Cluster 3.2: Man Pages Patch for Solaris 10_x86


Included patches revisions in Sun Cluster 3.2 2/08 Update1 for Solaris 9 8/05 update8 or higher
125510-02 Sun Cluster 3.2: Core Patch for Solaris 9
126105-12 Sun Cluster 3.2: CORE patch for Solaris 9 Note: Deliver SUNW.HAStoragePlus:6 Please read SIIOTP
125507-06 Sun Cluster 3.2: Manageability and Serviceability Agent for Solaris 9 Note: Please read SIIOTP
125513-01 Sun Cluster 3.2: Solaris Volume Manager (Mediator) Patch
125516-04 Sun Cluster 3.2: OPS Core Patch for Solaris 9 Note: Deliver SUNW.rac_framework:3 Please read SIIOTP
125597-01 Sun Cluster 3.2: Sun Cluster Reliable Datagram Transport Patch
125991-01 Sun Cluster 3.2: Sun Cluster 3.2: SC Checks patch for Solaris 9
126004-01 Sun Cluster 3.2: HA-Agfa Patch for Solaris 9
126007-01 Sun Cluster 3.2: HA-DB Patch for Solaris 9
126010-02 Sun Cluster 3.2: HA-DHCP Patch for Solaris 9
126013-03 Sun Cluster 3.2: HA-Apache Patch for Solaris 9
126016-01 Sun Cluster 3.2: HA-DNS Patch for Solaris 9
126022-02 Sun Cluster 3.2: Sun Cluster HA for Java Web Server, Patch for Solaris 9
126025-01 Sun Cluster 3.2: HA-Oracle Application Server Patch for Solaris 9
126028-01 Sun Cluster 3.2: HA-LiveCache Patch for Solaris 9
126031-03 Sun Cluster 3.2: Ha-MYSQL Patch for Solaris 9 Note: Please read SIIOTP
126034-02 Sun Cluster 3.2: HA-NFS Patch for Solaris 9
126040-01 Sun Cluster 3.2: HA-N1 Grid Engine Patch for Solaris 9
126043-02 Sun Cluster 3.2: HA-PostgreSQL Patch for Solaris 9 Note: Please read SIIOTP
126046-05 Sun Cluster 3.2: HA-Oracle patch for Solaris 9 Note: Please read SIIOTP
126049-02 Sun Cluster 3.2: HA-Oracle E-business suite Patch for Solaris 9
126055-01 Sun Cluster 3.2: HA-SAP Patch for Solaris 9 Note: Deliver SUNW.sap_as:4 Please read SIIOTP
126058-03 Sun Cluster 3.2: HA-SAPDB Patch for Solaris 9
126061-03 Sun Cluster 3.2: HA-SAP-WEB-AS Patch for Solaris 9
126064-03 Sun Cluster 3.2: HA-Siebel Patch for Solaris 9
126067-03 Sun Cluster 3.2: HA-Sybase Patch for Solaris 9 Note: Deliver SUNW.sybase:5 Please read SIIOTP
126070-01 Sun Cluster 3.2: HA-Tomcat Patch for Solaris 9
126073-01 Sun Cluster 3.2: HA-BEA WebLogic Patch for Solaris 9
126079-02 Sun Cluster 3.2: HA-Sun Java Systems App Server Patch for Solaris 9
126082-01 Sun Cluster 3.2: HA-Sun Java Message Queue Patch for Solaris 9
126085-02 Sun Cluster 3.2: HA-SWIFTAlliance Access Patch Note: Please read SIIOTP
126091-02 Sun Cluster 3.2: HA-Websphere MQ Patch
126095-04 Sun Cluster 3.2: Localization patch for Solaris 9 sparc and Solaris 10 sparc
126097-01 Sun Cluster 3.2: HA-SWIFTAllianceGateway patch for Solaris 9
128480-02 Sun Cluster 3.2: Grid Service Provisioning for Solaris 9
128556-01 Sun Cluster 3.2: Man Pages Patch for Solaris 9 and Solaris 10, sparc


The quorum server is an alternative to the traditional quorum disk. The quorum server is outside of the Sun Cluster and will be accessed through the public network. Therefore the quorum server can be a different architecture.

Included patch revisions in Sun Cluster 3.2 2/08 Update1 for quorum server feature:
127404-01 Sun Cluster 3.2: Quorum Server Patch for Solaris 9
127405-02 Sun Cluster 3.2: Quorum Server Patch for Solaris 10
127406-02 Sun Cluster 3.2: Quorum Server Patch for Solaris 10_x86
127408-01 Sun Cluster 3.2: Quorum Man Pages Patch for Solaris 9 and Solaris 10, sparc
127409-01 Sun Cluster 3.2: Quorum Man Pages Patch for Solaris 10_x86


If some patches must be applied when the node is in noncluster mode, you can apply them in a rolling fashion, one node at a time, unless a patch's instructions require that you shut down the entire cluster. Follow procedures in How to Apply a Rebooting Patch (Node) in Sun Cluster System Administration Guide for Solaris OS to prepare the node and boot it into noncluster mode. For ease of installation, consider applying all patches at once to a node that you place in noncluster mode.

Information about patch management available at Oracle Enterprise Manager Ops Center.

Sunday Mar 09, 2008

The nscd does not cache hosts for Sun Cluster

After installation of patch 120011-14 (Solaris 10 sparc) or the patch 120012-14 (Solaris 10 x86) the nscd does not cache hosts for Sun Cluster configuration. The Solaris 10 8/07 Update4 is also affected because the mentioned patches are bundled.


By default 'cluster' database is the first entry to hosts and netmasks in file /etc/nsswitch.conf. The _nscd_get_smf_state() does not recognise 'cluster' as a backend thus the cluster nss entry is not invoked by the nscd process and hence it is not cached by nscd. This issue is due to the new interface between the nss switch engine and backend layers which was introduced by the Sparks project.

Update 2.Mar 2009 - Start -
The issue is fixed in the following patches for sparc:
126106-27 Sun Cluster 3.2: CORE patch for Solaris 10 (included in Sun Cluster 3.2 1/09 update2)
138263-03 SunOS 5.10: nscd patch (Solaris 10 10/08 Update6 only include 138263-02)
or for x86:
126107-28 Sun Cluster 3.2: CORE patch for Solaris 10_x86 (included in Sun Cluster 3.2 1/09 update2)
138264-03 SunOS 5.10_x86: nscd patch (Solaris 10 10/08 Update6 only include 138264-02)
This means if you have the mentioned patches installed then 'cluster' database should be the first entry to hosts and netmasks in file /etc/nsswitch.conf.
Update 2.Mar 2009 - End -


Known issues which occur:
- slow name lookups
- Applications linked with '-library=stlport4' abort on gethostbyname

There is no general approach to identify the issue. But if you suspect a hostname resolution issue, turn on nscd debug mode
  % more /etc/nscd.conf
  [...]
       logfile         /var/adm/nscd.log    <---- uncomment
  #    enable-cache    hosts           no
       debug-level     9                    <---- set debug level here
  [...]

  Then stop and restart nscd.
  # svcadm restart svc:/system/name-service-cache:default


Workarounds:

A) Example of a default configuration for a 2-node Sun Cluster 3.x

Add private cluster interconnect addresses to hosts, netmasks and remove 'cluster' from nsswitch.conf on ALL nodes.
Add to /etc/hosts:
172.16.0.129   clusternode1-priv-physical1
172.16.1.1     clusternode1-priv-physical2
172.16.4.1     clusternode1-priv
172.16.0.130   clusternode2-priv-physical1
172.16.1.2     clusternode2-priv-physical2
172.16.4.2     clusternode2-priv
Add to /etc/netmasks:
172.16.0.128       255.255.255.128
172.16.1.0         255.255.255.128
172.16.4.0         255.255.254.0
Remove 'cluster' entry for hosts and netmasks in /etc/nsswitch.conf e.g:
hosts: files <any other hosts database>
netmasks: files <any other netmasks database>

For non-default installations or greater than 2-node configurations look to the next workaround. Or identify the values with the commands 'ifconfig' and 'scconf -pvv|grep -i private' on all nodes.


B) For all individual Sun Cluster configurations.


 1) Backup configuration files.
   # cp /etc/nsswitch.conf /etc/nsswitch.conf.cluster
   # cp /etc/inet/hosts /etc/inet/hosts.cluster
   # cp /etc/netmasks /etc/netmasks.cluster

 2) Add Private Cluster interconnect addresses to each cluster node's local /etc/hosts file.
  NOTE: Make sure the 'cluster' is still in nsswitch.conf for the hosts entry whilst performing the following.
   # getent hosts clusternode1-priv-physical1 >>/etc/hosts
   # getent hosts clusternode1-priv-physical2 >>/etc/hosts
   # getent hosts clusternode1-priv >>/etc/hosts
   # getent hosts clusternode2-priv-physical1 >>/etc/hosts
   # getent hosts clusternode2-priv-physical2 >>/etc/hosts
   # getent hosts clusternode2-priv >>/etc/hosts
   The above is an example for a default two node cluster with two private interconnects. If you have more nodes, more interconnects or non-default hostnames then identify them by using 'ifconfig' and 'scconf -pvv|grep -i private' on all nodes.

 3) Add private cluster interconnect netmasks to netmasks.
   Note that the netmasks file should contain the network number in the first column and the corresponding netmask in the second column.
   The following script will collect these from 'cluster' before you remove it in the next step:
   # ifconfig -a | nawk '/flags/&&!/PRIVATE/{p=0}/flags/&&/PRIVATE/{p=1} \\
   p==1&&$3 ~ /netmask/{d=0;h=tolower($4);j=length(h); \\
   for (i=1;i <= j; i++) {
     d=d\*16+index("123456789abcdef",substr(h,i,1));
     if (!(i%2)){n[(i/2)]=d;d=0}
   } al=split($2,a,".");bl=split($6,b,"."); \\
   for (i=1; i<5; i++) if(n[i] != 255) a[i]=b[i] - (255 - n[i]);
     printf("%d.%d.%d.%d\\t%d.%d.%d.%d\\n",
     a[1],a[2],a[3],a[4],n[1],n[2],n[3],n[4]);
   }' > /tmp/netmasks
   # cat /tmp/netmasks
   172.16.0.128      255.255.255.128
   172.16.1.0        255.255.255.128
   172.16.4.0        255.255.254.0
  Check the output for errors. Depending on the configuration maybe it's necessary to remove some entries.
   Append new entries to netmasks:
   # cat /tmp/netmasks >> /etc/netmasks

 4) Edit nsswitch.conf to remove 'cluster' entries.
  In the following example we see the hosts and netmask entries before, change the relevant lines with nawk, verify the changes and updated nsswitch.conf:
   # egrep -n '\^hosts|\^netmasks' /etc/nsswitch.conf
   hosts: cluster files dns nisplus
   netmasks: cluster files nisplus
   # nawk '/\^(hosts|netmasks)/&&/cluster/ \\
   {gsub(/cluster/, "")}{print}' /etc/nsswitch.conf > /tmp/nsswitch.conf
   # diff /etc/nsswitch.conf /tmp/nsswitch.conf
   28c28< hosts: cluster files dns nisplus
   ---> hosts: files dns nisplus
   42c42< netmasks: cluster files nisplus
   ---> netmasks: files nisplus
   # cat /tmp/nsswitch.conf > /etc/nsswitch.conf


There are 3 internal bugs with address this issue.
Bug 6632298: nscd doesn't cache hosts for cluster after sparks project (120011-14)
Bug 6634592: nss_cluster mods to match new nsswitch API in s10u4
Bug 6644077: nscd rejects - foreign nsswitch backend
A fix for Solaris 10 is expected by the end of this year.

Friday Feb 08, 2008

Decrease boot time of Sun Cluster

The boot time of your system can increase dramatically if use Sun Cluster 3.2 in combination with Volume Management software. There is a bug 6626457 (boot time increases exponentially with S10U4 + VxVM 5.0 + Sun Cluster 3.2) which describe the issue in more details. Long boot times can be critical in the view of high availability.

Besure you have installed the following patch to decrease the boot time.

127718-04 SunOS 5.10: svc.startd and rpc.metad patch
127719-04 SunOS 5.10_x86: svc.startd and rpc.metad patch


It's worth to install this patch anyway because the boot time also decrease if you running Sun Cluster without volume management software.

Wednesday Aug 22, 2007

Memory leak in scdpmd

The scdpmd (Sun Cluster disk path monitor daemon) have a memory leak when the reboot_on_path_failure flag is enabled. This is a known issue and reported in bug 6563949 which will be fixed soon. The workaround is to use the default of reboot_on_path_failure which is disabled. Only Sun Cluster 3.2 is affected because this is a new feature of Sun Cluster 3.2.
Details about Administering Disk-Path Monitoring.
Update 8.Feb.2008: The bug 6563949 is now fixed in the patches
126106-04 Sun Cluster 3.2: CORE patch for Solaris 10
126107-04 Sun Cluster 3.2: CORE patch for Solaris 10_x86
126105-04 Sun Cluster 3.2: CORE patch for Solaris 9
Update 30.Jun.2008: The bug 6682663 can prevent the reboot. This is fixed in the revision -15 of the already mentioned Sun Cluster 3.2 CORE patches.



How to identify if reboot_on_path_failure is enabled?

t2000d# scdpm -p all:all
t2000e:reboot_on_path_failure enabled
t2000e:/dev/did/rdsk/d1 Ok
t2000e:/dev/did/rdsk/d2 Ok
t2000e:/dev/did/rdsk/d4 Ok
t2000e:/dev/did/rdsk/d5 Ok
t2000e:/dev/did/rdsk/d6 Ok
t2000e:/dev/did/rdsk/d7 Ok
t2000d:reboot_on_path_failure enabled
t2000d:/dev/did/rdsk/d10 Ok
t2000d:/dev/did/rdsk/d11 Ok
t2000d:/dev/did/rdsk/d13 Ok
t2000d:/dev/did/rdsk/d14 Ok
t2000d:/dev/did/rdsk/d6 Ok
t2000d:/dev/did/rdsk/d7 Ok



How to configure out if scdpmd consume to much memory?

t2000d# ps -ef | grep scdpmd
root 5355 1 0 Aug 20 ? 390:26 /usr/cluster/lib/sc/scdpmd
t2000d# prstat
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
5355 root 952M 8520K sleep 59 0 6:29:59 4.4% scdpmd/14



How to disable reboot_on_path_failure flag?

t2000d# clnode set -p reboot_on_path_failure=disabled t2000d t2000e
t2000d# scdpm -p all:all
t2000e:reboot_on_path_failure disabled
t2000e:/dev/did/rdsk/d1 Ok
t2000e:/dev/did/rdsk/d2 Ok
t2000e:/dev/did/rdsk/d4 Ok
t2000e:/dev/did/rdsk/d5 Ok
t2000e:/dev/did/rdsk/d6 Ok
t2000e:/dev/did/rdsk/d7 Ok
t2000d:reboot_on_path_failure disabled
t2000d:/dev/did/rdsk/d10 Ok
t2000d:/dev/did/rdsk/d11 Ok
t2000d:/dev/did/rdsk/d13 Ok
t2000d:/dev/did/rdsk/d14 Ok
t2000d:/dev/did/rdsk/d6 Ok
t2000d:/dev/did/rdsk/d7 Ok



How to restart scdpm service to prevent memory leak?

t2000d# svcadm restart svc:/system/cluster/scdpm:default


Additional information and best practices informations are available in
Infodoc 1004119.1: Sun[TM] Cluster 3.2 Disk Path Monitoring and how to test for losing path to storage

Friday Jul 20, 2007

Variable SC_DEBUG

    For troubleshooting reasons it could be helpful to set the SC_DEBUG variable in your root environment.
    e.g: for bash
    # SC_DEBUG=1; export SC_DEBUG
    This enable the (\*) on all options of the scinstall main menu.

e.g:
# scinstall
 
   \*\*\* Main Menu \*\*\*
 
     Please select from one of the following (\*) options:
      \* 1) Create a new cluster or add a cluster node
      \* 2) Configure a cluster to be JumpStarted from this install server
      \* 3) Manage a dual-partition upgrade
      \* 4) Upgrade this cluster node
      \* 5) Print release information for this cluster node
 
      \* ?) Help with menu options
      \* q) Quit
 
     Option:

  • Can be used for re-install, remove or add node to a cluster.

  • Will provide more diagnostic information during the installation.

Tuesday Jul 17, 2007

Patchlist for Sun Cluster 3.2 Geographic Edition

The first patches for Sun Cluster 3.2 Geographic Edition are released. If you have a contract you can download the required patches.


Patchlist for Solaris 9 8/05 update8 or higher & Solaris 10 11/06 update3 or higher
126607-02 Sun Cluster Geographic Edition: Core, Utilities and Man Pages Patch
126611-02 Sun Cluster Geographic Edition: Availability Suite Data Replication Patch
126613-02 Sun Cluster Geographic Edition: TrueCopy Data Replication Patch
126746-02 Sun Cluster Geographic Edition: SRDF Data Replication Patch


Patchlist for Solaris 10 x86 11/06 update3 or higher
126608-02 Sun Cluster Geographic Edition_x86: Core, Utilities and Man Pages Patch
126612-02 Sun Cluster Geographic Edition_x86: Availability Suite Data Replication Patch


Friday Jun 15, 2007

Patchlist for Sun Cluster 3.2

The first patches for Sun Cluster 3.2 are released. It's highly recommended to install the core patch and the necessary agent patches before production. If you have an contract you can download the required patches.
Update 31.Aug.2007: Added new Sun Cluster 3.2 core patches 126105, 126106 and 126107.
Update 7.Jan.2008: Added new Sun Cluster 3.2 HA-Oracle E-business suite patches 126049 and 126050.


Patchlist for Solaris 10 11/06 update3 or higher
125511-02 Sun Cluster 3.2: Core Patch for Solaris 10
126106-01 Sun Cluster 3.2: CORE patch for Solaris 10
125508-03 Sun Cluster 3.2: Manageability and Serviceability Agent
125514-01 Sun Cluster 3.2: Solaris Volume Manager (Mediator) Patch
125517-02 Sun Cluster 3.2: OPS Core Patch for Solaris 10
126011-01 Sun Cluster 3.2: HA-DHCP Patch for Solaris 10
126014-01 Sun Cluster 3.2: HA-Apache Patch
126032-01 Sun Cluster 3.2: HA-MySQL Patch
126047-01 Sun Cluster 3.2: HA-Oracle Patch
126050-02 Sun Cluster 3.2: HA-Oracle E-business suite Patch for Solaris 10
126062-01 Sun Cluster 3.2: HA-SAPWebAS Patch
126065-01 Sun Cluster 3.2: HA-Siebel Patch
126068-01 Sun Cluster 3.2: HA-Sybase Patch for Solaris 10
126092-01 Sun Cluster 3.2: HA-Websphere MQ Patch


Patchlist for Solaris 10 x86 11/06 update3 or higher
125512-02 Sun Cluster 3.2_x86: Core Patch for Solaris 10_x86
126107-01 Sun Cluster 3.2_x86: CORE patch for Solaris 10_x86
125509-03 Sun Cluster 3.2_x86: Manageability and Serviceability Agent
125515-01 Sun Cluster 3.2_x86: Solaris Volume Manager (Mediator) Patch
125518-02 Sun Cluster 3.2_x86: OPS Core Patch for Solaris 10_x86
126012-01 Sun Cluster 3.2_x86: HA-DHCP Patch for Solaris 10
126015-01 Sun Cluster 3.2_x86: HA-Apache Patch
126033-01 Sun Cluster 3.2_x86: HA-MySQL Patch
126048-01 Sun Cluster 3.2_x86: HA-Oracle Patch
126063-01 Sun Cluster 3.2_x86: HA-SAPWebAS Patch
126093-01 Sun Cluster 3.2_x86: HA-Websphere MQ Patch


Patchlist for Solaris 9 8/05 update8 or higher
125510-02 Sun Cluster 3.2: Core Patch for Solaris 9
126105-01 Sun Cluster 3.2: CORE patch for Solaris 9
125507-03 Sun Cluster 3.2: Manageability and Serviceability Agent
125513-01 Sun Cluster 3.2: Solaris Volume Manager (Mediator) Patch
125516-02 Sun Cluster 3.2: OPS Core Patch for Solaris 9
126010-01 Sun Cluster 3.2: HA-DHCP Patch for Solaris 9
126013-01 Sun Cluster 3.2: HA-Apache Patch
126031-01 Sun Cluster 3.2: HA-MySQL Patch
126046-01 Sun Cluster 3.2: HA-Oracle Patch
126049-02 Sun Cluster 3.2: HA-Oracle E-business suite Patch for Solaris 9
126061-01 Sun Cluster 3.2: HA-SAPWebAS Patch
126064-01 Sun Cluster 3.2: HA-Siebel Patch
126067-01 Sun Cluster 3.2: HA-Sybase Patch for Solaris 9
126085-01 Sun Cluster 3.2: HA-SWIFTAlliance Access Patch
126091-01 Sun Cluster 3.2: HA-Websphere MQ Patch


In case of localized version you need addtional:
126095-01 Sun Cluster 3.2: Localization patch for Solaris 9 sparc and Solaris 10 sparc
126096-01 Sun Cluster 3.2_x86: Localization patch for Solaris 10 amd64


In case of cacao 2.x for Sun Cluster Manager GUI:
123893-04 SunOS 5.10: Common Agent Container (cacao) runtime 2.1 upgrade patch
123894-03 SunOS 5.10: Common Agent Container (cacao) secure web server 2.1 upgrade patch
123895-03 SunOS 5.10: Common Agent Container (cacao) monitoring 2.1 upgrade patch
123896-04 SunOS 5.10_x86: Common Agent Container (cacao) runtime 2.1 upgrade patch
123897-03 SunOS 5.10_x86: Common Agent Container (cacao) secure web server 2.1 upgrade patch
123898-03 SunOS 5.10_x86: Common Agent Container (cacao) monitoring 2.1 upgrade patch


Monday Mar 19, 2007

scsymon-srv goes always online

Certainly, some of you have noticed that the scsymon-srv service will be enabled on each boot of a Sun Cluster.

Mar 16 08:59:05 host345 svc.startd[8]: system/cluster/scsymon-srv:default misconfigured: transitioned to maintenance (see 'svcs -xv' for details)

But this service can only start successfully if you have SUNWescom (Sun Management Center Common Components) installed. This will be fixed in the future. In meantime you have different workarounds.

The script "/usr/cluster/lib/svc/method/svc_cl_svc_enable" try to enable all userland cluster services by default on each boot of the Sun Cluster.

  1. For Sun Cluster 3.1 you can simple disable the service by executing
    # svcadm -v disable system/cluster/scsymon-srv:default
    It's crazy but the service is still disabled after the boot due to a typo in svc_cl_svc_enable.

  2. For Sun Cluster 3.2 the typo is fixed in svc_cl_svc_enable and therefore its not enough to disable the service. Furthermore Sun Cluster 3.2 have another script /usr/cluster/lib/svc/method/svc_boot_check which disables/enables Sun Cluster smf services.


    Make a copy and modify svc_cl_svc_enable and svc_boot_check

    # cd /usr/cluster/lib/svc/method/
    # cp svc_cl_svc_enable svc_cl_svc_enable.org
    # cp svc_boot_check svc_boot_check.org
    # vi svc_cl_svc_enable
    change from:
    $SVCADM enable svc:/system/cluster/scslmclean:default
    $SVCADM enable svc:/system/cluster/scsymon-srv:default
    $SVCADM enable svc:/system/cluster/cl-svc-cluster-milestone:default
    to:
    $SVCADM enable svc:/system/cluster/scslmclean:default
    $SVCADM enable svc:/system/cluster/cl-svc-cluster-milestone:default
    # vi svc_boot_check
    change from:
    svc:/system/cluster/cl-svc-cluster-milestone:default
    svc:/system/cluster/scsymon-srv:default"
    to (Take care of " at the end to the line!):
    svc:/system/cluster/cl-svc-cluster-milestone:default"

    Beware this changes will be lost after the installation of the Sun Cluster core patch.


Update June 18 2007:
This is Bug 6492070: error message (system/cluster/scsymon-srv:default misconfigured) is displayed when boot cluster node

Which is fixed in the Sun Cluster 3.2 core patch (125511-02 S10, 125512-02 S10_x86, 125510-02 S9)

Monday Feb 19, 2007

Confusing message of sc_delegated_restarter

Sun Cluster 3.2 provides a new daemon called sc_delegated_restarter as part of the normal running cluster framework on each node of the cluster. The sc_delegated_restarter is SMF (Service Management Facility of Solaris 10) and RGM (Resource Group Manager of Sun Cluster) aware. Therefore you can manage your existing SMF services with the provided resource types SUNW.Proxy_SMF_failover, SUNW.Proxy_SMF_multimaster or SUNW.Proxy_SMF_scalable of Sun Cluster 3.2.


Maybe you see the following message within the boot of your cluster which you can simple ignore.

Jan 29 08:40:16 host345 Cluster.SMF.DR: Unable to open door descriptor /var/run/rgmd_receptionist_door
Jan 29 08:40:16 host345 Cluster.SMF.DR:
Jan 29 08:40:16 host345 Error in scha_cluster_get

This is a race condition where sc_delegated restarter comes up before RGM.
This is the known Bug 6490541: The sc_delegated_restarter SMF svc should depend on rgm SMF svc

Update June 18 2007
Which is fixed in the Sun Cluster 3.2 core patch (125511-02 S10, 125512-02 S10_x86, 125510-02 S9)

Wednesday Feb 07, 2007

Start with Sun Cluster 3.2


The next powerful version of Sun Cluster is available. Sun Cluster 3.2 is part of the Solaris Cluster product suite which includes Sun Cluster, Sun Cluster Geographic Edition, developer tools and support for commercial and open-source applications through agents. Some resources I like to mention:

For the installation of Sun Cluster 3.2 you should know:
--> Solaris 10 11/06 (Update3) is required. No earlier release of Solaris 10 is supported.
--> Addtional the following patches are highly recommended. (The table lists minimum revision of the patches)

SPARC:
118833-36 SunOS 5.10: kernel patch
124918-02 SunOS 5.10: devfsadm, devlinks, drvconfig patch
124916-01 SunOS 5.10: sd, ssd drivers patch
121010-04 SunOS 5.10: rpc.metad patch
120986-10 SunOS 5.10: mkfs and newfs patch
X86:
118855-36 SunOS 5.10_x86: kernel patch
124920-02 SunOS 5.10_x86: Solaris boot filelist.ramdisk patch
124919-02 SunOS 5.10_x86: devfsadm patch support
124917-02 SunOS 5.10_x86: sd driver patch
121011-04 SunOS 5.10_x86: rpc.metad patch
120987-10 SunOS 5.10_x86: mkfs, newfs, other ufs utils patch

Note: Sun Cluster 3.2 can also run on Solaris 9 8/05 (Update8). Certainly all Solaris 10 specific features are not available in Solaris 9.

About

I'm still mostly blogging around Solaris Cluster and support. Independently if for Sun Microsystems or Oracle. :-)

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today