« Considerations for virtual IP setup before doing the 10gR2 CRS install | Main | puzzling RMAN: No channel to restore a backup or copy of datafile? »

10gR2 CRS case study: CRS would not start after reboot - stuck at /etc/init.d/init.cssd startcheck

Preface

I had recently done a 10gR2 CRS installation on SuSE linux 9.3 (2.6.5.7-244 kernel) and noticed that after a reboot of the RAC nodes, the CRS would not come up!

The CSS daemon was stuck at the /etc/init.d/init.cssd startcheck command:

raclinux1:/tmp # ps -ef | grep css
root      6929     1  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root      6960  6928  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root      6963  6929  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root      7064  6935  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck

Debugging..

To debug this more, I went to the $ORA_CRS_HOME/log/<nodename>/client and checked the latest files there:

raclinux1:/opt/oracle/product/10.2.0/crs/log/raclinux1/client # ls -ltr
total 435
-rw-r-----  1 root   root 2561 May 18 23:20 ocrconfig_8870.log
-rw-r--r--  1 root   root  195 May 18 23:22 clscfg_8924.log
-rw-r-----  1 root   root  172 May 18 23:29 ocr_15307_3.log
-rw-r-----  1 root   root  172 May 18 23:29 ocr_15319_3.log
-rw-r-----  1 root   root  172 May 18 23:29 ocr_15447_3.log
...
...
...
drwxr-x---  2 oracle dba  3472 May 19 08:10 .
drwxr-xr-t  8 root   dba   232 May 19 13:50 ..
-rw-r--r--  1 root   root 2946 May 19 14:11 clsc.log
-rw-r--r--  1 root   root 7702 May 19 14:11 css.log

I did a more of the clsc.log & css.log and saw the following errors:

$ more clsc.log
...
...
...
2008-05-19 14:11:29.912: [ COMMCRS][1094672672]clsc_connect: (0x81c74b8) no listener at (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_UI_SOCKET))

2008-05-19 14:11:31.582: [ COMMCRS][1094672672]clsc_connect: (0x817e3f0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=SYSTEM.evm.acceptor.auth))

2008-05-19 14:11:31.583: [ default][1094672672]Terminating clsd session

$ more css.log
...
...
...
2008-05-19 02:42:48.307: [  OCROSD][1094672672]utopen:7:failed to open OCR file/disk /var/opt/oracle/ocr1 /var/opt/oracle/oc
r2, errno=19, os err string=No such device
2008-05-19 02:42:48.308: [  OCRRAW][1094672672]proprinit: Could not open raw device
2008-05-19 02:42:48.308: [ default][1094672672]a_init:7!: Backend init unsuccessful : [26]
2008-05-19 02:42:48.308: [ CSSCLNT][1094672672]clsssinit: Unable to access OCR device in OCR init.

2008-05-19 02:43:41.982: [  OCROSD][1094672672]utopen:7:failed to open OCR file/disk /var/opt/oracle/ocr1 /var/opt/oracle/oc
r2, errno=19, os err string=No such device
2008-05-19 02:43:41.983: [  OCRRAW][1094672672]proprinit: Could not open raw device
2008-05-19 02:43:41.983: [ default][1094672672]a_init:7!: Backend init unsuccessful : [26]
2008-05-19 02:43:41.983: [ CSSCLNT][1094672672]clsssinit: Unable to access OCR device in OCR init.

2008-05-19 02:46:40.204: [ CSSCLNT][1094672672]clsssInitNative: connect failed, rc 9

2008-05-19 14:11:28.217: [ CSSCLNT][1094672672]clsssInitNative: connect failed, rc 9

2008-05-19 14:11:37.186: [ CSSCLNT][1094672672]clsssInitNative: connect failed, rc 9

So it was pointing towards the OCR being not available, as could be verified by the /tmp/crsctl.<PID> files too:

raclinux1:/tmp # ls -ltr crsctl*
-rw-r--r--  1 oracle dba 148 May 19 02:44 crsctl.6826
-rw-r--r--  1 oracle dba 148 May 19 02:44 crsctl.6679
-rw-r--r--  1 oracle dba 148 May 19 02:44 crsctl.6673
-rw-r--r--  1 oracle dba 148 May 19 02:49 crsctl.7784
-rw-r--r--  1 oracle dba 148 May 19 02:49 crsctl.7890
-rw-r--r--  1 oracle dba 148 May 19 02:49 crsctl.7794
-rw-r--r--  1 oracle dba 148 May 19 13:55 crsctl.7034
-rw-r--r--  1 oracle dba 148 May 19 13:55 crsctl.6886
-rw-r--r--  1 oracle dba 148 May 19 13:55 crsctl.6883
-rw-r--r--  1 oracle dba 148 May 19 14:18 crsctl.6960
-rw-r--r--  1 oracle dba 148 May 19 14:18 crsctl.7064
-rw-r--r--  1 oracle dba 148 May 19 14:18 crsctl.6963

raclinux1:/tmp # more crsctl.6963
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [Permission denied] [13]

Permission issue!

Duh! So it was a permission issue on the OCR disk (at this moment), which could expand into a permissions issue for Voting and asm disks later:

raclinux1:/tmp # ls -ltr /dev/raw/raw*
crw-rw-r--  1 root disk 162,  9 Nov 18  2005 /dev/raw/raw9
crw-rw-r--  1 root disk 162,  8 Nov 18  2005 /dev/raw/raw8
crw-rw-r--  1 root disk 162,  7 Nov 18  2005 /dev/raw/raw7
crw-rw-r--  1 root disk 162,  6 Nov 18  2005 /dev/raw/raw6
crw-rw-r--  1 root disk 162,  5 Nov 18  2005 /dev/raw/raw5
crw-rw-r--  1 root disk 162,  4 Nov 18  2005 /dev/raw/raw4
crw-rw-r--  1 root disk 162,  3 Nov 18  2005 /dev/raw/raw3
crw-rw-r--  1 root disk 162,  2 Nov 18  2005 /dev/raw/raw2
crw-rw-r--  1 root disk 162, 15 Nov 18  2005 /dev/raw/raw15
crw-rw-r--  1 root disk 162, 14 Nov 18  2005 /dev/raw/raw14
crw-rw-r--  1 root disk 162, 13 Nov 18  2005 /dev/raw/raw13
crw-rw-r--  1 root disk 162, 12 Nov 18  2005 /dev/raw/raw12
crw-rw-r--  1 root disk 162, 11 Nov 18  2005 /dev/raw/raw11
crw-rw-r--  1 root disk 162, 10 Nov 18  2005 /dev/raw/raw10
crw-rw-r--  1 root disk 162,  1 Nov 18  2005 /dev/raw/raw1

I enabled read and write permission for the raw devices using the # chmod +rw /dev/raw/raw* devices. but even after that the latest /tmp/crsctl.<PID> files being generated were showing this message:

raclinux1:/tmp # more crsctl.6960
Failure -2 opening file handle for (vote1)
Failure 1 checking the CSS voting disk 'vote1'.
Failure -2 opening file handle for (vote2)
Failure 1 checking the CSS voting disk 'vote2'.
Failure -2 opening file handle for (vote3)
Failure 1 checking the CSS voting disk 'vote3'.
Not able to read adequate number of voting disks

At this point, I just chowned /dev/raw/raw* to oracle:dba like this:

raclinux1:/tmp # chown oracle:dba /dev/raw/raw*

After 1-2 mins, the CSS came up:

raclinux1:/tmp # ps -ef | grep css
root      6929     1  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root     10900  6929  0 14:39 ?        00:00:00 /bin/sh /etc/init.d/init.cssd daemon
oracle   10980 10900  0 14:40 ?        00:00:00 /bin/su -l oracle -c /bin/sh -c 'ulimit -c unlimited; cd /opt/oracle/product/10.2.0/crs/log/raclinux1/cssd;  /opt/oracle/product/10.2.0/crs/bin/ocssd  || exit $?'
oracle   10981 10980  0 14:40 ?        00:00:00 /bin/sh -c ulimit -c unlimited; cd /opt/oracle/product/10.2.0/crs/log/raclinux1/cssd;  /opt/oracle/product/10.2.0/crs/bin/ocssd  || exit $?
oracle   11007 10981  2 14:40 ?        00:00:00 /opt/oracle/product/10.2.0/crs/bin/ocssd.bin
root     12013  7414  0 14:40 pts/2    00:00:00 grep css
raclinux1:/tmp #

The CRS components came up fine automatically:

raclinux1:/opt/oracle/product/10.2.0/crs/bin # ./crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy

The ASM and RAC instances also came up fine:

raclinux1:/opt/oracle/product/10.2.0/crs/bin # ps -ef |grep smon
oracle   12257     1  0 14:41 ?        00:00:00 asm_smon_+ASM1
oracle   13100     1  0 14:41 ?        00:00:02 ora_smon_o10g1
root     32282  7414  0 14:55 pts/2    00:00:00 grep smon

For the long term..

To make this change permanent, I put it in /etc/init.d/boot.local file, along with the modprobe hangcheck-timer  command:

raclinux1:/opt/oracle/product/10.2.0/crs/bin # more /etc/init.d/boot.local
#! /bin/sh
#
# Copyright (c) 2002 SuSE Linux AG Nuernberg, Germany.  All rights reserved.
#
# Author: Werner Fink <werner@suse.de>, 1996
#         Burchard Steinbild, 1996
#
# /etc/init.d/boot.local
#
# script with local commands to be executed from init on system startup
#
# Here you should add things, that should happen directly after booting
# before we're going to the first run level.
#
chown oracle:dba /dev/raw/raw*
modprobe hangcheck-timer hangcheck_tick=30 hangcheck_margin=180

Conclusion

If simple things are permissions are not correct on the OCR devices, it can hold down the CRS daemons and the ASM/DB instances. It may be needed to put workarounds in /etc/init.d/boot.local for getting around the situation.


TrackBack

TrackBack URL for this entry:
http://blogs.oracle.com/mte1521/mt-tb.cgi/2109

Comments (3)

Thanks Gaurav Verma. We had this issue some time, I thought to blog.... but ...

Gaurav Verma:

Hi Virag, Thanks for your comment. I was just playing with virtualbox (building a 10gR2 rac) and came across it.. since I had the debug information, I thought why not put together a quick post. Thanks

i had a similar issue after installin SLES10 SP2, the problem was that the user oracle was not any longer assigned to the "dba" as primary group. I fixed that by assigning the groups "oinstall", "disk" to "dba" as primary group to the oracle user.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About This Entry

This page contains a single entry from the blog posted on May 19, 2008 12:25 PM.

The previous post in this blog was Considerations for virtual IP setup before doing the 10gR2 CRS install.

The next post in this blog is puzzling RMAN: No channel to restore a backup or copy of datafile?.

Many more can be found on the main index page or by looking through the archives.

Top Tags

Creative Commons License
This weblog is licensed under a Creative Commons License.
Powered by
Movable Type and Oracle