10gR2 CRS case study: CRS would not start after reboot - stuck at /etc/init.d/init.cssd startcheck

Preface

I had recently done a 10gR2 CRS installation on SuSE linux 9.3 (2.6.5.7-244 kernel) and noticed that after a reboot of the RAC nodes, the CRS would not come up!

The CSS daemon was stuck at the /etc/init.d/init.cssd startcheck command:

raclinux1:/tmp # ps -ef | grep css
root      6929     1  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root      6960  6928  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root      6963  6929  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root      7064  6935  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck

Debugging..

To debug this more, I went to the $ORA_CRS_HOME/log/<nodename>/client and checked the latest files there:

raclinux1:/opt/oracle/product/10.2.0/crs/log/raclinux1/client # ls -ltr
total 435
-rw-r-----  1 root   root 2561 May 18 23:20 ocrconfig_8870.log
-rw-r--r--  1 root   root  195 May 18 23:22 clscfg_8924.log
-rw-r-----  1 root   root  172 May 18 23:29 ocr_15307_3.log
-rw-r-----  1 root   root  172 May 18 23:29 ocr_15319_3.log
-rw-r-----  1 root   root  172 May 18 23:29 ocr_15447_3.log
...
...
...
drwxr-x---  2 oracle dba  3472 May 19 08:10 .
drwxr-xr-t  8 root   dba   232 May 19 13:50 ..
-rw-r--r--  1 root   root 2946 May 19 14:11 clsc.log
-rw-r--r--  1 root   root 7702 May 19 14:11 css.log

I did a more of the clsc.log & css.log and saw the following errors:

$ more clsc.log
...
...
...
2008-05-19 14:11:29.912: [ COMMCRS][1094672672]clsc_connect: (0x81c74b8) no listener at (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_UI_SOCKET))

2008-05-19 14:11:31.582: [ COMMCRS][1094672672]clsc_connect: (0x817e3f0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=SYSTEM.evm.acceptor.auth))

2008-05-19 14:11:31.583: [ default][1094672672]Terminating clsd session

$ more css.log
...
...
...
2008-05-19 02:42:48.307: [  OCROSD][1094672672]utopen:7:failed to open OCR file/disk /var/opt/oracle/ocr1 /var/opt/oracle/oc
r2, errno=19, os err string=No such device
2008-05-19 02:42:48.308: [  OCRRAW][1094672672]proprinit: Could not open raw device
2008-05-19 02:42:48.308: [ default][1094672672]a_init:7!: Backend init unsuccessful : [26]
2008-05-19 02:42:48.308: [ CSSCLNT][1094672672]clsssinit: Unable to access OCR device in OCR init.

2008-05-19 02:43:41.982: [  OCROSD][1094672672]utopen:7:failed to open OCR file/disk /var/opt/oracle/ocr1 /var/opt/oracle/oc
r2, errno=19, os err string=No such device
2008-05-19 02:43:41.983: [  OCRRAW][1094672672]proprinit: Could not open raw device
2008-05-19 02:43:41.983: [ default][1094672672]a_init:7!: Backend init unsuccessful : [26]
2008-05-19 02:43:41.983: [ CSSCLNT][1094672672]clsssinit: Unable to access OCR device in OCR init.

2008-05-19 02:46:40.204: [ CSSCLNT][1094672672]clsssInitNative: connect failed, rc 9

2008-05-19 14:11:28.217: [ CSSCLNT][1094672672]clsssInitNative: connect failed, rc 9

2008-05-19 14:11:37.186: [ CSSCLNT][1094672672]clsssInitNative: connect failed, rc 9

So it was pointing towards the OCR being not available, as could be verified by the /tmp/crsctl.<PID> files too:

raclinux1:/tmp # ls -ltr crsctl*
-rw-r--r--  1 oracle dba 148 May 19 02:44 crsctl.6826
-rw-r--r--  1 oracle dba 148 May 19 02:44 crsctl.6679
-rw-r--r--  1 oracle dba 148 May 19 02:44 crsctl.6673
-rw-r--r--  1 oracle dba 148 May 19 02:49 crsctl.7784
-rw-r--r--  1 oracle dba 148 May 19 02:49 crsctl.7890
-rw-r--r--  1 oracle dba 148 May 19 02:49 crsctl.7794
-rw-r--r--  1 oracle dba 148 May 19 13:55 crsctl.7034
-rw-r--r--  1 oracle dba 148 May 19 13:55 crsctl.6886
-rw-r--r--  1 oracle dba 148 May 19 13:55 crsctl.6883
-rw-r--r--  1 oracle dba 148 May 19 14:18 crsctl.6960
-rw-r--r--  1 oracle dba 148 May 19 14:18 crsctl.7064
-rw-r--r--  1 oracle dba 148 May 19 14:18 crsctl.6963

raclinux1:/tmp # more crsctl.6963
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [Permission denied] [13]

Permission issue!

Duh! So it was a permission issue on the OCR disk (at this moment), which could expand into a permissions issue for Voting and asm disks later:

raclinux1:/tmp # ls -ltr /dev/raw/raw*
crw-rw-r--  1 root disk 162,  9 Nov 18  2005 /dev/raw/raw9
crw-rw-r--  1 root disk 162,  8 Nov 18  2005 /dev/raw/raw8
crw-rw-r--  1 root disk 162,  7 Nov 18  2005 /dev/raw/raw7
crw-rw-r--  1 root disk 162,  6 Nov 18  2005 /dev/raw/raw6
crw-rw-r--  1 root disk 162,  5 Nov 18  2005 /dev/raw/raw5
crw-rw-r--  1 root disk 162,  4 Nov 18  2005 /dev/raw/raw4
crw-rw-r--  1 root disk 162,  3 Nov 18  2005 /dev/raw/raw3
crw-rw-r--  1 root disk 162,  2 Nov 18  2005 /dev/raw/raw2
crw-rw-r--  1 root disk 162, 15 Nov 18  2005 /dev/raw/raw15
crw-rw-r--  1 root disk 162, 14 Nov 18  2005 /dev/raw/raw14
crw-rw-r--  1 root disk 162, 13 Nov 18  2005 /dev/raw/raw13
crw-rw-r--  1 root disk 162, 12 Nov 18  2005 /dev/raw/raw12
crw-rw-r--  1 root disk 162, 11 Nov 18  2005 /dev/raw/raw11
crw-rw-r--  1 root disk 162, 10 Nov 18  2005 /dev/raw/raw10
crw-rw-r--  1 root disk 162,  1 Nov 18  2005 /dev/raw/raw1

I enabled read and write permission for the raw devices using the # chmod +rw /dev/raw/raw* devices. but even after that the latest /tmp/crsctl.<PID> files being generated were showing this message:

raclinux1:/tmp # more crsctl.6960
Failure -2 opening file handle for (vote1)
Failure 1 checking the CSS voting disk 'vote1'.
Failure -2 opening file handle for (vote2)
Failure 1 checking the CSS voting disk 'vote2'.
Failure -2 opening file handle for (vote3)
Failure 1 checking the CSS voting disk 'vote3'.
Not able to read adequate number of voting disks

At this point, I just chowned /dev/raw/raw* to oracle:dba like this:

raclinux1:/tmp # chown oracle:dba /dev/raw/raw*

After 1-2 mins, the CSS came up:

raclinux1:/tmp # ps -ef | grep css
root      6929     1  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root     10900  6929  0 14:39 ?        00:00:00 /bin/sh /etc/init.d/init.cssd daemon
oracle   10980 10900  0 14:40 ?        00:00:00 /bin/su -l oracle -c /bin/sh -c 'ulimit -c unlimited; cd /opt/oracle/product/10.2.0/crs/log/raclinux1/cssd;  /opt/oracle/product/10.2.0/crs/bin/ocssd  || exit $?'
oracle   10981 10980  0 14:40 ?        00:00:00 /bin/sh -c ulimit -c unlimited; cd /opt/oracle/product/10.2.0/crs/log/raclinux1/cssd;  /opt/oracle/product/10.2.0/crs/bin/ocssd  || exit $?
oracle   11007 10981  2 14:40 ?        00:00:00 /opt/oracle/product/10.2.0/crs/bin/ocssd.bin
root     12013  7414  0 14:40 pts/2    00:00:00 grep css
raclinux1:/tmp #

The CRS components came up fine automatically:

raclinux1:/opt/oracle/product/10.2.0/crs/bin # ./crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy

The ASM and RAC instances also came up fine:

raclinux1:/opt/oracle/product/10.2.0/crs/bin # ps -ef |grep smon
oracle   12257     1  0 14:41 ?        00:00:00 asm_smon_+ASM1
oracle   13100     1  0 14:41 ?        00:00:02 ora_smon_o10g1
root     32282  7414  0 14:55 pts/2    00:00:00 grep smon

For the long term..

To make this change permanent, I put it in /etc/init.d/boot.local file, along with the modprobe hangcheck-timer  command:

raclinux1:/opt/oracle/product/10.2.0/crs/bin # more /etc/init.d/boot.local
#! /bin/sh
#
# Copyright (c) 2002 SuSE Linux AG Nuernberg, Germany.  All rights reserved.
#
# Author: Werner Fink <werner@suse.de>, 1996
#         Burchard Steinbild, 1996
#
# /etc/init.d/boot.local
#
# script with local commands to be executed from init on system startup
#
# Here you should add things, that should happen directly after booting
# before we're going to the first run level.
#
chown oracle:dba /dev/raw/raw*
modprobe hangcheck-timer hangcheck_tick=30 hangcheck_margin=180

Conclusion

If simple things are permissions are not correct on the OCR devices, it can hold down the CRS daemons and the ASM/DB instances. It may be needed to put workarounds in /etc/init.d/boot.local for getting around the situation.


Comments:

Thanks Gaurav Verma. We had this issue some time, I thought to blog.... but ...

Posted by Virag on May 19, 2008 at 02:19 PM EDT #

Hi Virag, Thanks for your comment. I was just playing with virtualbox (building a 10gR2 rac) and came across it.. since I had the debug information, I thought why not put together a quick post. Thanks

Posted by Gaurav Verma on May 19, 2008 at 02:22 PM EDT #

i had a similar issue after installin SLES10 SP2, the problem was that the user oracle was not any longer assigned to the "dba" as primary group. I fixed that by assigning the groups "oinstall", "disk" to "dba" as primary group to the oracle user.

Posted by phimic on November 18, 2008 at 11:27 PM EST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

bocadmin_ww

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today