Troubleshooting case study for 9i RAC ..PRKC-1021 : Problem in the clusterware

Preface

This is a troubleshooting case study for a 9i RAC environment on HP-UX PA RISC 64 bit, which was using HP ServiceGuard solution (third party) for cluster maintenance.

The problem

The problem was that the srvctl command was giving the following error:

(oracle@rac2):  gsdctl stat
PRKC-1021 : Problem in the clusterware
Failed to get list of active nodes from clusterware

(oracle@rac1):  srvctl config
PRKC-1021 : Problem in the clusterware

(oracle@rac1):  pwd
/u01/oracle/uatdb/9.2/bin
(oracle@rac1):  ./lsnodes
rac1
rac2
app1
app2


Troubleshooting Approach

Followed note 178683.1 to edit $ORACLE_HOME/bin/gsd.sh like this and traced gsd.sh:

# added -DTRACING.ENABLED=true -DTRACING.LEVEL=2 before -classpath
exec $JRE -DPROGRAM=gsd -DTRACING.ENABLED=true -DTRACING.LEVEL=2 -classpath $CLA
SSPATH oracle.ops.mgmt.daemon.OPSMDaemon $MY_OHOME


$ $ORACLE_HOME/bin/gsd.sh
...
...
[main] [13:15:42:852] [line# N/A]  my property portnum=null
[main] [13:15:42:942] [NativeSystem.<init>:123]  Detected Cluster
[main] [13:15:42:946] [NativeSystem.<init>:124]  Cluster existence = true
[main] [13:15:42:947] [UnixSystem.<init>:118]  Going to load SRVM library
[main] [13:15:42:950] [UnixSystem.<init>:118]  loaded libraries
[main] [13:15:42:950] [OPSMDaemon.main:726]  Initializing the daemon ...
[main] [13:15:42:951] [OPSMDaemon.<init>:188]  NativeDebug is set to true
[main] [13:15:42:952] [OPSMDaemon.<init>:188]  UnixSystem.initializeDaemon: groupName is opsm
[main] [13:15:42:953] [OPSMDaemon.<init>:188]  Unsatisfied Link Error caught. Could not initialize the cluster
[main] [13:15:42:954] [OPSMDaemon.main:726]  initializeDaemon status = false
[main] [13:15:42:955] [OPSMDaemon.main:726]  Failed to initialize and register with clusterware
[main] [13:15:42:955] [OPSMDaemon.main:726]  OPSMErrCode = 1003
[main] [13:15:42:958] [OPSMDaemon.main:729]  java.rmi.RemoteException: Unable to initialize with the clusterware
java.rmi.RemoteException: Unable to initialize with the clusterware
        at oracle.ops.mgmt.daemon.OPSMDaemon.<init>(OPSMDaemon.java:195)
        at oracle.ops.mgmt.daemon.OPSMDaemon.main(OPSMDaemon.java:726)

[main] [13:15:42:958] [line# N/A]  Exiting from main..no point trying to start the daemon


At this point, one option was to initialize the srvconfig raw device (OCR) and then add the RAC instances etc manuall using srvctl add command:

(oracle@rac1):  srvconfig -init
java.lang.UnsatisfiedLinkError: readRawObject
        at oracle.ops.mgmt.nativesystem.UnixSystem.readRawObject(UnixSystem.java:410)
        at oracle.ops.mgmt.rawdevice.RawDevice.readMagicString(RawDevice.java:187)
        at oracle.ops.mgmt.rawdevice.RawDeviceVersion.readVersionString(RawDeviceVersion.java:175)
        at oracle.ops.mgmt.rawdevice.RawDeviceVersion.isValidConfigDevice(RawDeviceVersion.java:75)
        at oracle.ops.mgmt.rawdevice.RawDeviceUtil.<init>(RawDeviceUtil.java:147)
        at oracle.ops.mgmt.rawdevice.RawDeviceUtil.main(Compiled Code)
Exception in thread "main" (oracle@rac1): 

(oracle@rac1):  srvconfig  -version  
java.lang.UnsatisfiedLinkError: readRawObject
        at oracle.ops.mgmt.nativesystem.UnixSystem.readRawObject(UnixSystem.java:410)
        at oracle.ops.mgmt.rawdevice.RawDevice.readMagicString(RawDevice.java:187)
        at oracle.ops.mgmt.rawdevice.RawDeviceVersion.readVersionString(RawDeviceVersion.java:175)
        at oracle.ops.mgmt.rawdevice.RawDeviceVersion.isValidConfigDevice(RawDeviceVersion.java:75)
        at oracle.ops.mgmt.rawdevice.RawDeviceUtil.<init>(RawDeviceUtil.java:147)
        at oracle.ops.mgmt.rawdevice.RawDeviceUtil.main(Compiled Code)

If the config file is pointing to a raw device the following type of output should be returned:

     $ raw device version "9.0.0.0.0"

Since we were not getting that output, there was either a problem in accessing the OCR raw device or the soft link was not working due to a permissions issue. Outputting the contents of the OCR on the standard output using $ dd if=/dev/orar1/rrawuat.conf bs=1500 showed that the OCR device was readable all right.

$  more /var/opt/oracle/srvConfig.loc
srvconfig_loc=/dev/orar1/rrawuat.conf

$ ls -l /dev/orar1/rrawuat.conf
crw-rw----   1 oracle    dba         64 0x110004 Apr 11  2007 /dev/orar1/rrawuat.conf


Then one idea was to relink the srv* binaries using the make command, but that also resulted in error:

(oracle@rac2):  cd $ORACLE_HOME/srvm/lib  
(oracle@rac2):  make -f ins_srvm.mk install
nm -gp /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.a | grep T  | grep Java | awk '{ print "-u " $3 }' >  /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.def;
/u01/oracle/uatdb/9.2/bin/echodo ld +s -G -b -o libsrvm.sl -c
/u01/oracle/uatdb/9.2/srvm/lib/libsrvm.def /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.a                           -L/u01/oracle/uatdb/9.2/lib32/ -L/u01/oracle/uatdb/9.2/srvm/lib/  -L/usr/lib -lc  -lclntsh -lwtc9 -lnls9  -lcore9 -lnls9 -lcore9  -lnls9 -lxml9 -lcore9
-lunls9 -lnls9 /opt/nmapi/nmapi2/lib/libnmapi2.sl -lm `cat /u01/oracle/uatdb/9.2/lib/sysliblist` ;
rm /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.def
ld +s -G -b -o libsrvm.sl -c /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.def /u01/oracle/uatdb/9.2/srvm/lib/libsrvm.a -L/u01/oracle/uatdb/9.2/lib32/ -L/u01/oracle/uatdb/9.2/srvm/lib/
-L/usr/lib -lc -lclntsh -lwtc9 -lnls9 -lcore9 -lnls9 -lcore9 -lnls9 -lxml9 -lcore9 -lunls9 -lnls9 /opt/nmapi/nmapi2/lib/libnmapi2.sl -lm -l:libcl.sl -l:librt.sl -lpthread -l:libnss_dns.1 -l:libdld.sl
ld: Mismatched ABI (not an ELF file) for -lclntsh, found /u01/oracle/uatdb/9.2/lib32//libclntsh.sl
Fatal error.
*** Error exit code 1

Stop.

The turning point

Saw bug 6281672 and kind of got hint from there.

Compared the file $ORACLE_HOME/lib32/libsrvm.sl on another RAC system (duat).

(oracle@duat1):  ls -l $ORACLE_HOME/lib*/libsrvm*
-rwxr-xr-x   1 oracle    dba          57344 Nov  7 20:04 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl
-rwxr-xr-x   1 oracle    dba          57344 Nov  7 08:11 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl0
-rwxr-xr-x   1 oracle    dba          36864 Nov  7 20:04 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl
-rwxr-xr-x   1 oracle    dba          36864 Nov  7 08:11 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl0
(oracle@duat1): 


On rac1/2, the /u01/oracle/uatdb/9.2/lib32/libsrvm.sl file was missing. Saw that the .sl and .sl0 files were copied of each other.

so on rac1/2, did the following:

$ cd /u01/oracle/uatdb/9.2/lib32
$ cp libsrvm.sl0 libsrvm.sl


Before:

(oracle@rac2):  ls -l $ORACLE_HOME/lib32/libsrvm*
-r-xr-xr-x   1 oracle    dba          57344 Oct 17  2004 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl0
-rwxr-xr-x   1 oracle    dba          36864 Feb  9 06:49 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl
-r-xr-xr-x   1 oracle    dba         106496 Nov 30  2004 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl0


After:

(oracle@rac2):  ls -l $ORACLE_HOME/lib32/libsrvm*
-r-xr-xr-x   1 oracle    dba          57344 Mar  5 15:14 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl
-r-xr-xr-x   1 oracle    dba          57344 Oct 17  2004 /u01/oracle/uatdb/9.2/lib32/libsrvm.sl0
-rwxr-xr-x   1 oracle    dba          36864 Feb  9 06:49 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl
-r-xr-xr-x   1 oracle    dba         106496 Nov 30  2004 /u01/oracle/uatdb/9.2/lib32/libsrvmocr.sl0

The fact that the size of libsrvmocr.sl0 was not matching with libsrvmocr.sl did not seem to be a showstopper. The Oracle system seemed to be taking libsrvm.sl into account for srv* related commands.

After this, the gsdctl and srvctl commands started working:

(oracle@rac1):  gsdctl stat                                             
GSD is running on the local node

(oracle@rac1):  srvctl config
uat

(oracle@rac2):  gsdctl stat
GSD is not running on the local node
(oracle@rac2): 

(oracle@rac2):  gsdctl start
Successfully started GSD on local node

(oracle@rac2):  srvctl status database -d uat
Instance uat1 is not running on node rac1
Instance uat2 is not running on node rac2

But there was "just" one more problem..

Interestingly, srvctl stop database/instance worked fine at this point, but srvctl start did not.

(oracle@rac2):  srvctl start instance -d uat -i uat1
PRKP-1001 : Error starting instance uat1 on node rac1
ORA-00119: invalid specification for system parameter local_listener
ORA-00132: syntax error or unresolved network name 'uat1' reserved.
ORA-01078: failure in processing system parameters local_listener
ORA-00132: syntax error or unresolved network name 'uat1' reserved.

(oracle@rac2):  srvctl start instance -d uat -i uat2
PRKP-1001 : Error starting instance uat2 on node rac2
ORA-00119: invalid specification for system parameter local_listener
ORA-00132: syntax error or unresolved network name 'uat2' reserved.
ORA-01078: failure in processing system parameters
SQL> Disconnected
ORA-00132: syntax error or unresolved network name 'uat2' reserved.

initUAT2.ora:
-------------
uat1.local_listener='uat1'
uat2.local_listener='uat2'
uat1.remote_listener='uat2'
uat2.remote_listener='uat1'

(oracle@rac1):  strings spfileuat1.ora | grep listener
uat1.local_listener='uat1'
uat2.local_listener='uat2'
uat1.remote_listener='uat2'
uat2.remote_listener='uat1'

Tnsping utility was working for both UAT1 and UAT2 service names..

(oracle@rac2):  tnsping uat1
TNS Ping Utility for HPUX: Version 9.2.0.6.0 - Production on 05-MAR-2008 16:19:36

Copyright (c) 1997 Oracle Corporation.  All rights reserved.

Used parameter files:
/u01/oracle/uatdb/9.2/network/admin/uat2_rac2/sqlnet_ifile.ora

Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(HOST=rac1.test.com)(PORT=1522)) (CONNECT_DATA= (SERVICE_NAME=uat) (INSTANCE_NAME=uat1)))
OK (0 msec)

(oracle@rac2):  tnsping uat2
TNS Ping Utility for HPUX: Version 9.2.0.6.0 - Production on 05-MAR-2008 16:19:39

Copyright (c) 1997 Oracle Corporation.  All rights reserved.

Used parameter files:
/u01/oracle/uatdb/9.2/network/admin/uat2_rac2/sqlnet_ifile.ora

Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(HOST=rac2.test.com)(PORT=1522)) (CONNECT_DATA= (SERVICE_NAME=uat) (INSTANCE_NAME=uat2)))
OK (0 msec)

Added entries for local_listener and remote_listener to $TNS_ADMIN/tnsnames.ora on rac1 (and rac2 resp) and made sure that tnsping to them was working:


local_listener=
        (DESCRIPTION=
                (ADDRESS=(PROTOCOL=tcp)(HOST=rac2.test.com)(PORT=1522))
        )
remote_listener=
        (DESCRIPTION=
                (ADDRESS=(PROTOCOL=tcp)(HOST=rac1.test.com)(PORT=1522))
        )

(oracle@rac2):  tnsping local_listener
TNS Ping Utility for HPUX: Version 9.2.0.6.0 - Production on 05-MAR-2008 16:44:05

Copyright (c) 1997 Oracle Corporation.  All rights reserved.

Used parameter files:
/u01/oracle/uatdb/9.2/network/admin/uat2_rac2/sqlnet_ifile.ora

Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(HOST=rac2.test.com)(PORT=1522)))
OK (10 msec)


(oracle@rac2):  tnsping remote_listener
TNS Ping Utility for HPUX: Version 9.2.0.6.0 - Production on 05-MAR-2008 16:44:13
Copyright (c) 1997 Oracle Corporation.  All rights reserved.

Used parameter files:
/u01/oracle/uatdb/9.2/network/admin/uat2_rac2/sqlnet_ifile.ora

Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION= (ADDRESS=(PROTOCOL=tcp)(HOST=rac1.test.com)(PORT=1522)))
OK (10 msec)

The final clincher

I saw the light when I realized that the Database ORACLE_HOME was using autoconfig, which meant that the $TNS_ADMIN variable had $CONTEXT_NAME in it, and was not just plain vanilla $ORACLE_HOME/network/admin. Therefore, the listener.ora and tnsnames.ora files in $TNS_ADMIN were not the same as those in $ORACLE_HOME/network/admin

To get around this issue graciously, srvctl setenv command could be used:

$ srvctl setenv instance -d uat -i uat1 -t TNS_ADMIN=/u01/oracle/uatdb/9.2/network/admin/uat1_rac1


$ srvctl getenv instance -d uat -i uat1
TNS_ADMIN=/u01/oracle/uatdb/9.2/network/admin/uat1_rac1

$ srvctl setenv instance -d uat -i uat2 -t TNS_ADMIN=/u01/oracle/uatdb/9.2/network/admin/uat2_rac2

$ srvctl getenv instance -d uat -i uat2
TNS_ADMIN=/u01/oracle/uatdb/9.2/network/admin/uat2_rac2


(oracle@rac1):  srvctl start instance -d uat -i uat1

(oracle@rac1):  srvctl status instance -d uat -i uat1
Instance uat1 is running on node rac1

(oracle@rac1):  srvctl start instance -d uat -i uat2

(oracle@rac1):  srvctl status instance -d uat -i uat2
Instance uat2 is running on node rac2

Some pending issues..

Some issues were still pending, pointing towards some more mis-configuration, but at least we were able to get over the initial error.

(oracle@rac1):  srvconfig  -version
java.lang.NullPointerException
        at java.text.MessageFormat.format(Compiled Code)
        at java.text.MessageFormat.format(MessageFormat.java)
        at java.text.MessageFormat.format(MessageFormat.java)
        at oracle.ops.mgmt.nls.MessageBundle.getMessage(MessageBundle.java:225)
        at oracle.ops.mgmt.rawdevice.RawDeviceUtil.main(Compiled Code)


Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

bocadmin_ww

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today