scalable service does not failover after network outage

If a network outage occurs to the IPMP group which is part of the scalable resource group, then the scalable resource can NOT failover to the other host.

The issue only happen if one the following Sun Cluster core patches are active.
126106-27 Sun Cluster 3.2: CORE patch for Solaris 10
126107-28 Sun Cluster 3.2: CORE patch for Solaris 10_x86
126105-26 Sun Cluster 3.2: CORE patch for Solaris 9
Due to the fact that this patches are also part of the Sun Cluster 3.2 1/09 Update2 release the issue occur also on fresh installed Sun Cluster 3.2 1/09 Update2 systems.

The error can look as follows:
Feb 10 16:56:51 node1 in.mpathd[174]: NIC failure detected on e1000g0 of group ipmp0
Feb 10 16:56:51 node1 in.mpathd[174]: Successfully failed over from NIC e1000g0 to NIC e1000g4
Feb 10 16:57:18 node1 in.mpathd[174]: All Interfaces in group ipmp0 have failed
Feb 10 16:57:19 node1 SC[SUNW.apache:4.1,apache-rg,apache-rs,SSM_IPMP_CALLBACK]: IPMP group ipmp0 has failed, so scalable resource apache-rs in resource group apache-rg may not be able to respond to client requests. A request will be issued to relocate resource apache-rs off of this node.
Feb 10 16:57:23 node1 genunix: NOTICE: core_log: ssm_ipmp_callbac[2130] core dumped: /var/core/core.ssm_ipmp_callbac.2130.1227135261.0

Update 7.Apr.2009:
Solution: The bug 6774504 is fixed in
126106-28 or higher Sun Cluster 3.2: CORE patch for Solaris 10
126107-29 or higher Sun Cluster 3.2: CORE patch for Solaris 10_x86
126105-27 or higher Sun Cluster 3.2: CORE patch for Solaris 9
But the mentioned releases of the patches still have troubles with rgmd process. Please refer to memory leaks in "rgmd -z global" process .

Workaround: Use previous version -19 if using scalable servcies
126106-19 Sun Cluster 3.2: CORE patch for Solaris 10
126107-19 Sun Cluster 3.2: CORE patch for Solaris 10_x86
126105-19 Sun Cluster 3.2: CORE patch for Solaris 9

The issue is reported in bug 6774504 (description: scalable services coredump during the failover due to network failure). A fix is in progress. This blog will be updated when the fix is available.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

I'm still mostly blogging around Solaris Cluster and support. Independently if for Sun Microsystems or Oracle. :-)

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today