Preface
This is a troubleshooting case study for a 9i RAC environment on HP-UX PA RISC 64 bit, which was using HP ServiceGuard solution (third party) for cluster maintenance.The problem
The problem was that the srvctl command was giving the following error:(oracle@rac2): gsdctl stat
PRKC-1021 : Problem in the clusterware
Failed to get list of active nodes from clusterware
(oracle@rac1): srvctl config
PRKC-1021 : Problem in the clusterware
(oracle@rac1): pwd
/u01/oracle/uatdb/9.2/bin
(oracle@rac1): ./lsnodes
rac1
rac2
app1
app2
Troubleshooting Approach
Followed note 178683.1 to edit $ORACLE_HOME/bin/gsd.sh like this and traced gsd.sh:# added -DTRACING.ENABLED=true -DTRACING.LEVEL=2 before -classpath
exec $JRE -DPROGRAM=gsd -DTRACING.ENABLED=true -DTRACING.LEVEL=2 -classpath $CLA
SSPATH oracle.ops.mgmt.daemon.OPSMDaemon $MY_OHOME
$ $ORACLE_HOME/bin/gsd.sh
...
...
[main] [13:15:42:852] [line# N/A] my property portnum=null
[main] [13:15:42:942] [NativeSystem.<init>:123] Detected Cluster
[main] [13:15:42:946] [NativeSystem.<init>:124] Cluster existence = true
[main] [13:15:42:947] [UnixSystem.<init>:118] Going to load SRVM library
[main] [13:15:42:950] [UnixSystem.<init>:118] loaded libraries
[main] [13:15:42:950] [OPSMDaemon.main:726] Initializing the daemon ...
[main] [13:15:42:951] [OPSMDaemon.<init>:188] NativeDebug is set to true
[main] [13:15:42:952] [OPSMDaemon.<init>:188] UnixSystem.initializeDaemon: groupName is opsm
[main] [13:15:42:953] [OPSMDaemon.<init>:188] Unsatisfied Link Error caught. Could not initialize the cluster
[main] [13:15:42:954] [OPSMDaemon.main:726] initializeDaemon status = false
[main] [13:15:42:955] [OPSMDaemon.main:726] Failed to initialize and register with clusterware
[main] [13:15:42:955] [OPSMDaemon.main:726] OPSMErrCode = 1003
[main] [13:15:42:958] [OPSMDaemon.main:729] java.rmi.RemoteException: Unable to initialize with the clusterware
java.rmi.RemoteException: Unable to initialize with the clusterware
at oracle.ops.mgmt.daemon.OPSMDaemon.<init>(OPSMDaemon.java:195)
at oracle.ops.mgmt.daemon.OPSMDaemon.main(OPSMDaemon.java:726)
[main] [13:15:42:958] [line# N/A] Exiting from main..no point trying to start the daemon
At this point, one option was to initialize the srvconfig raw device (OCR) and then add the RAC instances etc manuall using srvctl add command:
(oracle@rac1): srvconfig -init
java.lang.UnsatisfiedLinkError: readRawObject
at oracle.ops.mgmt.nativesystem.UnixSystem.readRawObject(UnixSystem.java:410)
at oracle.ops.mgmt.rawdevice.RawDevice.readMagicString(RawDevice.java:187)
at oracle.ops.mgmt.rawdevice.RawDeviceVersion.readVersionString(RawDeviceVersion.java:175)
at oracle.ops.mgmt.rawdevice.RawDeviceVersion.isValidConfigDevice(RawDeviceVersion.java:75)
at oracle.ops.mgmt.rawdevice.RawDeviceUtil.<init>(RawDeviceUtil.java:147)
at oracle.ops.mgmt.rawdevice.RawDeviceUtil.main(Compiled Code)
Exception in thread "main" (oracle@rac1):
(oracle@rac1): srvconfig -version
java.lang.UnsatisfiedLinkError: readRawObject
at oracle.ops.mgmt.nativesystem.UnixSystem.readRawObject(UnixSystem.java:410)
at oracle.ops.mgmt.rawdevice.RawDevice.readMagicString(RawDevice.java:187)
at oracle.ops.mgmt.rawdevice.RawDeviceVersion.readVersionString(RawDeviceVersion.java:175)
at oracle.ops.mgmt.rawdevice.RawDeviceVersion.isValidConfigDevice(RawDeviceVersion.java:75)
at oracle.ops.mgmt.rawdevice.RawDeviceUtil.<init>(RawDeviceUtil.java:147)
at oracle.ops.mgmt.rawdevice.RawDeviceUtil.main(Compiled Code)
If the config file is pointing to a raw device the following type of output should be returned:
$ raw device version "9.0.0.0.0"
Since we were not getting that output, there was either a problem in accessing the OCR raw device or the soft link was not working due to a permissions issue. Outputting the contents of the OCR on the standard output using $ dd if=/dev/orar1/rrawuat.conf bs=1500 showed that the OCR device was readable all right.
$ more /var/opt/oracle/srvConfig.loc
srvconfig_loc=/dev/orar1/rrawuat.conf
$ ls -l /dev/orar1/rrawuat.conf
crw-rw---- 1 oracle dba 64 0x110004 Apr 11 2007 /dev/orar1/rrawuat.conf
Then one idea was to relink the srv* binaries using the make command, but that also resulted in error:
(oracle@rac2): cd $ORACLE_HOME/srvm/lib
(oracle@rac2): make -f ins_srvm.mk install
nm -gp /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.a | grep T | grep Java | awk '{ print "-u " $3 }' > /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.def;
/u01/oracle/uatdb/9.2/bin/echodo ld +s -G -b -o libsrvm.sl -c
/u01/oracle/uatdb/9.2/srvm/lib/libsrvm.def /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.a -L/u01/oracle/uatdb/9.2/lib32/ -L/u01/oracle/uatdb/9.2/srvm/lib/ -L/usr/lib -lc -lclntsh -lwtc9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9
-lunls9 -lnls9 /opt/nmapi/nmapi2/lib/libnmapi2.sl -lm `cat /u01/oracle/uatdb/9.2/lib/sysliblist` ;
rm /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.def
ld +s -G -b -o libsrvm.sl -c /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.def /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.a -L/u01/oracle/uatdb/9.2/lib32/ -L/u01/oracle/uatdb/9.2/srvm/lib/
-L/usr/lib -lc -lclntsh -lwtc9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 /opt/nmapi/nmapi2/lib/libnmapi2.sl -lm -l:libcl.sl -l:librt.sl -lpthread -l:libnss_dns.1 -l:libdld.sl
ld: Mismatched ABI (not an ELF file) for -lclntsh, found /u01/oracle/uatdb/9.2/lib32//libclntsh.sl
Fatal error.
*** Error exit code 1
Stop.
The turning point
Saw bug 6281672 and kind of got hint from there.Compared the file $ORACLE_HOME/lib32/libsrvm.sl on another RAC system (duat).
(oracle@duat1): ls -l $ORACLE_HOME/lib*/libsrvm*
-rwxr-xr-x 1 oracle dba 57344 Nov 7 20:04 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl
-rwxr-xr-x 1 oracle dba 57344 Nov 7 08:11 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl0
-rwxr-xr-x 1 oracle dba 36864 Nov 7 20:04 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl
-rwxr-xr-x 1 oracle dba 36864 Nov 7 08:11 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl0
(oracle@duat1):
On rac1/2, the /u01/oracle/uatdb/9.2/lib32/libsrvm.sl file was missing. Saw that the .sl and .sl0 files were copied of each other.
so on rac1/2, did the following:
$ cd /u01/oracle/uatdb/9.2/lib32
$ cp libsrvm.sl0 libsrvm.sl
Before:
(oracle@rac2): ls -l $ORACLE_HOME/lib32/libsrvm*
-r-xr-xr-x 1 oracle dba 57344 Oct 17 2004 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl0
-rwxr-xr-x 1 oracle dba 36864 Feb 9 06:49 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl
-r-xr-xr-x 1 oracle dba 106496 Nov 30 2004 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl0
After:
(oracle@rac2): ls -l $ORACLE_HOME/lib32/libsrvm*
-r-xr-xr-x 1 oracle dba 57344 Mar 5 15:14 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl
-r-xr-xr-x 1 oracle dba 57344 Oct 17 2004 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl0
-rwxr-xr-x 1 oracle dba 36864 Feb 9 06:49 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl
-r-xr-xr-x 1 oracle dba 106496 Nov 30 2004 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl0
The fact that the size of libsrvmocr.sl0 was not matching with libsrvmocr.sl did not seem to be a showstopper. The Oracle system seemed to be taking libsrvm.sl into account for srv* related commands.
After this, the gsdctl and srvctl commands started working:
(oracle@rac1): gsdctl stat
GSD is running on the local node
(oracle@rac1): srvctl config
uat
(oracle@rac2): gsdctl stat
GSD is not running on the local node
(oracle@rac2):
(oracle@rac2): gsdctl start
Successfully started GSD on local node
(oracle@rac2): srvctl status database -d uat
Instance uat1 is not running on node rac1
Instance uat2 is not running on node rac2
But there was "just" one more problem..
Interestingly, srvctl stop database/instance worked fine at this point, but srvctl start did not.(oracle@rac2): srvctl start instance -d uat -i uat1
PRKP-1001 : Error starting instance uat1 on node rac1
ORA-00119: invalid specification for system parameter local_listener
ORA-00132: syntax error or unresolved network name 'uat1' reserved.
ORA-01078: failure in processing system parameters local_listener
ORA-00132: syntax error or unresolved network name 'uat1' reserved.
(oracle@rac2): srvctl start instance -d uat -i uat2
PRKP-1001 : Error starting instance uat2 on node rac2
ORA-00119: invalid specification for system parameter local_listener
ORA-00132: syntax error or unresolved network name 'uat2' reserved.
ORA-01078: failure in processing system parameters
SQL> Disconnected
ORA-00132: syntax error or unresolved network name 'uat2' reserved.
initUAT2.ora:
-------------
uat1.local_listener='uat1'
uat2.local_listener='uat2'
uat1.remote_listener='uat2'
uat2.remote_listener='uat1'
(oracle@rac1): strings spfileuat1.ora | grep listener
uat1.local_listener='uat1'
uat2.local_listener='uat2'
uat1.remote_listener='uat2'
uat2.remote_listener='uat1'
Tnsping utility was working for both UAT1 and UAT2 service names..
(oracle@rac2): tnsping uat1
TNS Ping Utility for HPUX: Version 9.2.0.6.0 - Production on 05-MAR-2008 16:19:36
Copyright (c) 1997 Oracle Corporation. All rights reserved.
Used parameter files:
/u01/oracle/uatdb/9.2/network/admin/uat2_rac2/sqlnet_ifile.ora
Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(HOST=rac1.test.com)(PORT=1522)) (CONNECT_DATA= (SERVICE_NAME=uat) (INSTANCE_NAME=uat1)))
OK (0 msec)
(oracle@rac2): tnsping uat2
TNS Ping Utility for HPUX: Version 9.2.0.6.0 - Production on 05-MAR-2008 16:19:39
Copyright (c) 1997 Oracle Corporation. All rights reserved.
Used parameter files:
/u01/oracle/uatdb/9.2/network/admin/uat2_rac2/sqlnet_ifile.ora
Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(HOST=rac2.test.com)(PORT=1522)) (CONNECT_DATA= (SERVICE_NAME=uat) (INSTANCE_NAME=uat2)))
OK (0 msec)
Added entries for local_listener and remote_listener to $TNS_ADMIN/tnsnames.ora on rac1 (and rac2 resp) and made sure that tnsping to them was working:
local_listener=
(DESCRIPTION=
(ADDRESS=(PROTOCOL=tcp)(HOST=rac2.test.com)(PORT=1522))
)
remote_listener=
(DESCRIPTION=
(ADDRESS=(PROTOCOL=tcp)(HOST=rac1.test.com)(PORT=1522))
)
(oracle@rac2): tnsping local_listener
TNS Ping Utility for HPUX: Version 9.2.0.6.0 - Production on 05-MAR-2008 16:44:05
Copyright (c) 1997 Oracle Corporation. All rights reserved.
Used parameter files:
/u01/oracle/uatdb/9.2/network/admin/uat2_rac2/sqlnet_ifile.ora
Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(HOST=rac2.test.com)(PORT=1522)))
OK (10 msec)
(oracle@rac2): tnsping remote_listener
TNS Ping Utility for HPUX: Version 9.2.0.6.0 - Production on 05-MAR-2008 16:44:13
Copyright (c) 1997 Oracle Corporation. All rights reserved.
Used parameter files:
/u01/oracle/uatdb/9.2/network/admin/uat2_rac2/sqlnet_ifile.ora
Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(HOST=rac1.test.com)(PORT=1522)))
OK (10 msec)
The final clincher
I saw the light when I realized that the Database ORACLE_HOME was using autoconfig, which meant that the $TNS_ADMIN variable had $CONTEXT_NAME in it, and was not just plain vanilla $ORACLE_HOME/network/admin. Therefore, the listener.ora and tnsnames.ora files in $TNS_ADMIN were not the same as those in $ORACLE_HOME/network/admin
To get around this issue graciously, srvctl setenv command could be used:
$ srvctl setenv instance -d uat -i uat1 -t TNS_ADMIN=/u01/oracle/uatdb/9.2/network/admin/uat1_rac1
$ srvctl getenv instance -d uat -i uat1
TNS_ADMIN=/u01/oracle/uatdb/9.2/network/admin/uat1_rac1
$ srvctl setenv instance -d uat -i uat2 -t TNS_ADMIN=/u01/oracle/uatdb/9.2/network/admin/uat2_rac2
$ srvctl getenv instance -d uat -i uat2
TNS_ADMIN=/u01/oracle/uatdb/9.2/network/admin/uat2_rac2
(oracle@rac1): srvctl start instance -d uat -i uat1
(oracle@rac1): srvctl status instance -d uat -i uat1
Instance uat1 is running on node rac1
(oracle@rac1): srvctl start instance -d uat -i uat2
(oracle@rac1): srvctl status instance -d uat -i uat2
Instance uat2 is running on node rac2
Some pending issues..
Some issues were still pending, pointing towards some more mis-configuration, but at least we were able to get over the initial error.(oracle@rac1): srvconfig -version
java.lang.NullPointerException
at java.text.MessageFormat.format(Compiled Code)
at java.text.MessageFormat.format(MessageFormat.java)
at java.text.MessageFormat.format(MessageFormat.java)
at oracle.ops.mgmt.nls.MessageBundle.getMessage(MessageBundle.java:225)
at oracle.ops.mgmt.rawdevice.RawDeviceUtil.main(Compiled Code)
