Case Study: OCR Manual Cleanup and Reconfiguration

This morning I had the opportunity to work together with a colleague in a case of OCR corruption, it was an interesting and learning experience.

Late in the evening I got a call describing this scenario:

On a new 2 node RAC environment, local connections to the database were not working and while trying trying to fix the problem RAC components went down and refused to start again.

When I did arrive to the site I did find that nodeapps were running on node 2 but refused to start on node 1, there were 2 listeners registered with CRS on each node, and the database refused to start also.

1st check was to look at the alert log, no information was written at all on it... It seemed that CRS was trying to start another database.
In the 2nd check we looked  for CRS configuration of the database:

[dbtst1] > srvctl config database -d sbtst1
dbtst1 sbtst11 /dbtst/app01/oracle/product/10gASM
dbtst2 sbtst12 /dbtst/app01/oracle/product/10gASM

Somehow the database was pointing to the ASM home... So we corrected this:

[dbtst1] > srvctl modify database -d sbtst1 -n dbtst1 -o /dbtst/app01/oracle/product/10gDB -p +SBDATADG/parameter/sbtst1_spfile.ora
[dbtst1] > srvctl modify database -d sbtst1 -n dbtst2 -o /dbtst/app01/oracle/product/10gDB -p +SBDATADG/parameter/sbtst1_spfile.ora

A new trial to start the database worked!  ...
... But only for instance on node 2; Instance on node 1 seemed to be in the same status as before, not even a single byte was written to the alert.log

In addition to this nodeapps status on node 1 was not correct; so we checked again with srvctl:

[dbtst1] > srvctl config nodeapps -n dbtst2
dbtst1 sbtst12 /dbtst/app01/oracle/product/10gDB
[dbtst1] > srvctl config nodeapps -n dbtst1
<no output>

Node 1 didn't return any output, so we decided to remove nodeapps and recreate it:

[dbtst1] > srvctl remove nodeapps -n dbtst1

The command worked partially, was able to remove ons and gsd but not the listeners and vip, we did try to re-add nodeapps but kept getting the same errors, at this moment we checked with ifconfig and find that the virtual IP was down on node 1 ...

Next step was trying to reconfigure the virtual IP with vipca, but this trial failed with error: virtual IP in use. The status at this moment was:
  • Remove nodeapps does not succeeded, even with -f,  to be removed
  • Cannot configure vip because the previous configuration still was on place on OCR
So we decided to do unregister. Unregister is a low level command and should not be used without Oracle Support supervision.

First we did unregister to both listeners, because they depend on vip, and then to vip; that cleaned completely nodeapps on node 1:

./crs_unregister ora.dbtst1.LISTENER_SBTST11.lsnr
./crs_unregister ora.dbtst1.LISTENER_PRD_SBTST11.lsnr
./crs_unregister ora.dbtst1.vip

After that vipca was executed successfully and we got its network service up and running.

Adding nodeapps this time worked, and we got ons,gds and vip up and running using:

[dbtst1] > srvctl add nodeapps -n dbtst1 -A 10.4.10.46/255.255.255.0/bond0

Still we needed to add the listener, and get local connections to work, that was where the problem started...

We used Network Configuration Assistant, netca, to remove and recreate the default cluster listener, using both TCP and IPC protocols. IPC provides the local connections feature that was missing before.

Netca registered the listener with CRS and we were able to start all RAC components.






Comments:

Thanks.. Its really useful.

Posted by kalpit on April 18, 2007 at 02:58 PM IDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

bocadmin_ww

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today