cluster configuration repository can get corrupted on installation Sun Cluster 3.2 1/09 Update2


The issue only occurs if the Sun Cluster 3.2 1/09 Update2 will be installed with a non-default netmask address for cluster interconnect.

Seen problems if system is affected:
Errors with:
      * did devices
      * quorum device
      * The command 'scstat -i' can look like:
-- IPMP Groups --
                 Node Name       Group    Status    Adapter   Status
                 ---------               -----        ------       -------       ------
scrconf: RPC: Authentication error; why = Client credential too weak
scrconf: Failed to get zone information for scnode2 - unexpected error.
scrconf: RPC: Authentication error; why = Client credential too weak
scrconf: Failed to get zone information for scnode1 - unexpected error.
scrconf: RPC: Authentication error; why = Client credential too weak
scrconf: Failed to get zone information for scnode2 - unexpected error.
scrconf: RPC: Authentication error; why = Client credential too weak
scrconf: Failed to get zone information for scnode1 - unexpected error.
IPMP Group: scnode2    sc_ipmp0    Online    qfe0      Online
IPMP Group: scnode1    sc_ipmp0    Online    qfe0      Online
IPMP Group: scnode2    sc_ipmp0    Online    qfe0      Online
IPMP Group: scnode1    sc_ipmp0    Online    qfe0      Online


How the problem occur?
After the installation of Sun Cluster 3.2 1/09 Update2 product with the java installer it's necessary to run the #scinstall command. If choose "Custom" installation instead of "Typical" installation then it's possible to change the default of the netmask of cluster interconnect. The following questions come up within the installation procedure if answering the default netmask question with 'no'.

Example scinstall:
       Is it okay to accept the default netmask (yes/no) [yes]? no
       Maximum number of nodes anticipated for future growth [64]? 4
       Maximum number of private networks anticipated for future growth [10]?
       Maximum number of virtual clusters expected [12]? 0
       What netmask do you want to use [255.255.255.128]?
Prevent the issue by answering the virtual clusters question with '1' or other serious consideration to future growth potential if necessary.
Do NOT answer the virtual clusters question with '0'!


Example of the whole scinstall log when corrupted ccr occur:

In the /etc/cluster/ccr/global/infrastructure file the error can be found by an empty entry for cluster.properties.private_netmask. Furthermore some other lines are not reflect the correct values for netmask as choosen within scinstall.
Wrong infrastructure file:
cluster.state enabled
cluster.properties.cluster_id 0x49F82635
cluster.properties.installmode disabled
cluster.properties.private_net_number 172.16.0.0
cluster.properties.cluster_netmask 255.255.248.0
cluster.properties.private_netmask
cluster.properties.private_subnet_netmask 255.255.255.248
cluster.properties.private_user_net_number 172.16.4.0
cluster.properties.private_user_netmask 255.255.254.0

cluster.properties.private_maxnodes 6
cluster.properties.private_maxprivnets 10
cluster.properties.zoneclusters 0
cluster.properties.auth_joinlist_type sys

If answering the virtual cluster question with value '1' then the correct netmask entries are:
cluster.properties.cluster_id 0x49F82635
cluster.properties.installmode disabled
cluster.properties.private_net_number 172.16.0.0
cluster.properties.cluster_netmask 255.255.255.128
cluster.properties.private_netmask 255.255.255.128
cluster.properties.private_subnet_netmask 255.255.255.248
cluster.properties.private_user_net_number 172.16.0.64
cluster.properties.private_user_netmask 255.255.255.224

cluster.properties.private_maxnodes 6
cluster.properties.private_maxprivnets 10
cluster.properties.zoneclusters 1
cluster.properties.auth_joinlist_type sys


Workaround if problem already occured:
1.) Boot all nodes in non-cluster-mode with 'boot -x'
2.) Change the wrong values of /etc/cluster/ccr/global/infrastructure on all nodes. See example above.
3.) Write a new checksum for all infrastructure files on all nodes. Use -o (master file) on the node which is booting up first.
scnode1 # /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/global/infrastructure -o
scnode2 # /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/global/infrastructure
scnode1 # /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/global/infrastructure
scnode2 # /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/global/infrastructure
4.) first reboot scnode1 (master infrastructure file) into cluster, then the other nodes.
This is reported in bug 6825948.


Update 17.Jun.2009:
The -33 revision of the Sun Cluster core patch is the first released version which fix this issue at installation time.
126106-33 Sun Cluster 3.2: CORE patch for Solaris 10
126107-33 Sun Cluster 3.2: CORE patch for Solaris 10_x86

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

I'm still mostly blogging around Solaris Cluster and support. Independently if for Sun Microsystems or Oracle. :-)

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today