This blog posting is about protecting an instance from a 10.2/11.1 single instance RAC database so that it can act in a cold-failover situation. I want to refer to the great document "Using Oracle Clusterware to Protect a single instance oracle database 11g" written by my collegue Philip Newlan since most input comes from here.
The mentioned pdf describes how to make sure a single instance database can failover with the use of Oracle Clusterware.
With this posting I want to show how to do this for a single instance RAC database where you have to manage instance1' and 'instance2' instead of just one instance.
Since the grid control agent may have some problems with an instance[number] travelling from node1 to node2 this choice was made on purpose for some customer. Other reason for this awkward solution is that the application for some reason cannot run with two instances concurrently.
Also, the fact that instance deployment with sequential instance numbers is standard in their grid environment the choice has been made to do this with different instance numbers instead of one.
Within this posting I will also show the required updates in tnsnames and spfile.
First, I made sure the OCR entries of the RAC database, the Services and the instances are removed from the OCR. Then a "resource group" will be created. This is the container for all the resources.
oracle@pts0138([crs]):/ora/product/11.1.0/crs> crs_profile -create \ oss.xdbprk.rg -t application -a \ /ora/product/11.1.0/crs/crs/public/act_resgroup.pl -o "ci=600" oracle@pts0138([crs]):/ora/product/11.1.0/crs> crs_register oss.xdbprk.rg
Now, let's verify the new entry:
oracle@pts0138([crs]):/ora/product/11.1.0/crs/crs/public> crsstat | grep rg HA Resource Target State oss.xdbprk.rg OFFLINE OFFLINE
The scripts mentioned in the pdf are placed in $CRS_HOME/crs/public on both nodes. I made sure the scripts are executable and tested them:
export CLUSTERWARE_HOME=/ora/product/11.1.0/crs/ export ORACLE_HOME=/ora/product/10.2.0/db_2 export _USR_ORA_LANG=$ORACLE_HOME export _USR_ORA_SRV=xdbprk2 export _USR_ORA_FLAGS=1 oracle@pts0138(xdbprk2):/tmp> $CLUSTERWARE_HOME/crs/public/act_db.pl start SQL*Plus: Release 10.2.0.4.0 - Production on Fri Sep 18 10:12:21 2009 Copyright (c) 1982, 2007, Oracle. All Rights Reserved. SQL> Connected to an idle instance. SQL> ORACLE instance started. Total System Global Area 536870912 bytes Fixed Size 2085360 bytes Variable Size 150998544 bytes Database Buffers 377487360 bytes Redo Buffers 6299648 bytes Database mounted. Database opened. SQL> Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production With the Partitioning, Real Application Clusters, Oracle Label Security, OLAP, Data Mining and Real Application Testing options
Also tested the stop function.
oracle@pts0138(xdbprk2):/tmp> $CLUSTERWARE_HOME/crs/public/act_db.pl stop SQL*Plus: Release 10.2.0.4.0 - Production on Fri Sep 18 10:12:37 2009 Copyright (c) 1982, 2007, Oracle. All Rights Reserved. SQL> Connected. SQL> Database closed. Database dismounted. ORACLE instance shut down. SQL> Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production With the Partitioning, Real Application Clusters, Oracle Label Security, OLAP, Data Mining and Real Application Testing options oracle@pts0138(xdbprk2):/ora/product/11.1.0/crs/crs/public>
The action above was executed on both nodes. Just to verify if the instance could be started/stoped with the scripts. Only problem was: for each node, the ORACLE_SID had to be changed, as both nodes have another ORACLE_SID for that database. With only one SID you would not have the problem.
Then, the failover resource was created and registered.
oracle@pts0138([crs]):/tmp> crs_profile -create oss.xdbprk.db-cold-failover \ -t application -r oss.xdbprk.rg -a /ora/product/11.1.0/crs/crs/public/act_db.pl -o "ci=20,ra=5,osrv=xdbprk,ol=/ora/product/10.2.0/db_2,oflags=1,rt=600" oracle@pts0138([crs]):/tmp> crs_register oss.xdbprk.db-cold-failover
The value osrv=xdbprk will never work as this is not the correct instance name for any of the nodes. Even if I made the value osrv=xdbprk1, then the script would only work on one of the nodes i.e. the node that had the appropriate init.ora etc.
So, leaving the value osrv=xdbprk to this, I actually hard-coded the ORACLE_SID on both sides of the cluster in the act_db.pl. Since the value is now hard coded, it should work. This clearly limits the option to re-use the script for other database, so I'd better change the name of the script to act_db_xdbprk.pl if I do this for real.
So, how will the resource profiles look now ?
oracle@pts0101([crs]):/var/opt/oracle> crs_profile -print oss.xdbprk.rg NAME=oss.xdbprk.rg TYPE=application ACTION_SCRIPT=/ora/product/11.1.0/crs/crs/public/act_resgroup.pl ACTIVE_PLACEMENT=0 AUTO_START=restore CHECK_INTERVAL=600 DESCRIPTION=oss.xdbprk.rg FAILOVER_DELAY=0 FAILURE_INTERVAL=0 FAILURE_THRESHOLD=0 HOSTING_MEMBERS= OPTIONAL_RESOURCES= PLACEMENT=balanced REQUIRED_RESOURCES= RESTART_ATTEMPTS=1 SCRIPT_TIMEOUT=60 START_TIMEOUT=0 STOP_TIMEOUT=0 UPTIME_THRESHOLD=7d USR_ORA_ALERT_NAME= USR_ORA_CHECK_TIMEOUT=0 USR_ORA_CONNECT_STR=/ as sysdba USR_ORA_DEBUG=0 USR_ORA_DISCONNECT=false USR_ORA_FLAGS= USR_ORA_IF= USR_ORA_INST_NOT_SHUTDOWN= USR_ORA_LANG= USR_ORA_NETMASK= USR_ORA_OPEN_MODE= USR_ORA_OPI=false USR_ORA_PFILE= USR_ORA_PRECONNECT=none USR_ORA_SRV= USR_ORA_START_TIMEOUT=0 USR_ORA_STOP_MODE=immediate USR_ORA_STOP_TIMEOUT=0 USR_ORA_VIP=
oracle@pts0101([crs]):/var/opt/oracle> crs_profile -print oss.xdbprk.db-cold-failover NAME=oss.xdbprk.db-cold-failover TYPE=application ACTION_SCRIPT=/ora/product/11.1.0/crs/crs/public/act_db.pl ACTIVE_PLACEMENT=0 AUTO_START=restore CHECK_INTERVAL=20 DESCRIPTION=oss.xdbprk.db-cold-failover FAILOVER_DELAY=0 FAILURE_INTERVAL=0 FAILURE_THRESHOLD=0 HOSTING_MEMBERS= OPTIONAL_RESOURCES= PLACEMENT=balanced REQUIRED_RESOURCES=oss.xdbprk.rg RESTART_ATTEMPTS=5 SCRIPT_TIMEOUT=60 START_TIMEOUT=600 STOP_TIMEOUT=0 UPTIME_THRESHOLD=7d USR_ORA_ALERT_NAME= USR_ORA_CHECK_TIMEOUT=0 USR_ORA_CONNECT_STR=/ as sysdba USR_ORA_DEBUG=0 USR_ORA_DISCONNECT=false USR_ORA_FLAGS=1 USR_ORA_IF= USR_ORA_INST_NOT_SHUTDOWN= USR_ORA_LANG=/ora/product/10.2.0/db_2 USR_ORA_NETMASK= USR_ORA_OPEN_MODE= USR_ORA_OPI=false USR_ORA_PFILE= USR_ORA_PRECONNECT=none USR_ORA_SRV=xdbprk USR_ORA_START_TIMEOUT=0 USR_ORA_STOP_MODE=immediate USR_ORA_STOP_TIMEOUT=0 USR_ORA_VIP=
Okay, and then the basic test of starting the resource, first all is down:
oracle@pts0101([crs]):/var/opt/oracle> crsstat HA Resource Target State ----------- ------ ----- oss.xdbprk.db-cold-failover OFFLINE OFFLINE oss.xdbprk.rg OFFLINE OFFLINE
Then, the start:
oracle@pts0101([crs]):/var/opt/oracle> crs_start oss.xdbprk.db-cold-failover Attempting to start `oss.xdbprk.db-cold-failover` on member `pts0138` Start of `oss.xdbprk.db-cold-failover` on member `pts0138` succeeded. oracle@pts0138([crs]):/var/opt/oracle> ps -ef | grep smon | grep dbprk oracle 31568 1 0 11:51 ? 00:00:00 ora_smon_xdbprk2
And the relocate:
oracle@pts0101([crs]):/var/opt/oracle> crs_relocate -f oss.xdbprk.db-cold-failover Attempting to stop `oss.xdbprk.db-cold-failover` on member `pts0138` Stop of `oss.xdbprk.db-cold-failover` on member `pts0138` succeeded. Attempting to stop `oss.xdbprk.rg` on member `pts0138` Stop of `oss.xdbprk.rg` on member `pts0138` succeeded. Attempting to start `oss.xdbprk.rg` on member `pts0101` Start of `oss.xdbprk.rg` on member `pts0101` succeeded. Attempting to start `oss.xdbprk.db-cold-failover` on member `pts0101` Start of `oss.xdbprk.db-cold-failover` on member `pts0101` succeeded.
Let's verify if it runs on the other node:
oracle@pts0101([crs]):/var/opt/oracle> ps -ef | grep smon | grep dbprk oracle 11568 1 0 11:55 ? 00:00:00 ora_smon_xdbprk1
And stopped on the original:
oracle@pts0138(xdbprk2):/ora/product/10.2.0/db_2/dbs> ps -ef | grep smon | grep dbprk
Okay, what is left to do from here:
Service names that used to be managed by CRS, now have to be coded hard in the spfile so that they register each time with the listener:
service_names in spfile added:
dbprk.xe.grid dbprk.xe.supp dbprk.xe.link
Also, for each node, another local listener will be used. I made sure this is in the spfile:
xdbprk1.local_listener='pts0101-LOCAL-LISTENER' xdbprk2.local_listener='pts0138-LOCAL-LISTENER'
In order to be sure there can only be started one instance the cluster_database_instances_parameter is set to 1.
*.cluster_database_instances=1
For tnsnames a normail failover entry is created, if the first instance is down, the next will be found (and should be running):
oracle@pts0101(xdbprk1):/ora/dbprk/admin/xdbprk1/bdump> tnsping dbprk.xe.grid TNS Ping Utility for Linux: Version 10.2.0.4.0 - Production on 18-SEP-2009 12:39:14 Copyright (c) 1997, 2007, Oracle. All rights reserved. Used parameter files: /etc/oss/sqlnet.ora Used TNSNAMES adapter to resolve the alias Attempting to contact (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=pts0101-grid.nl.eu.abnamro.com)(PORT=1521))(ADDRESS=(PROTOCOL=TCP)(HOST=pts0138-grid.nl.eu.abnamro.com)(PORT=1521))(LOAD_BALANCE=ON))(CONNECT_DATA=(SERVICE_NAME=dbprk.xe.grid)(FAILOVER_MODE=(TYPE=SELECT)(METHOD=BASIC)(RETRIES=20)))) OK (10 msec)
So, another basic test, what is the situation:
HA Resource Target State ----------- ------ ----- oss.xdbprk.db-cold-failover ONLINE ONLINE on pts0101 oss.xdbprk.rg ONLINE ONLINE on pts0101 oracle@pts0101([crs]):/ora/dbprk/admin/xdbprk1/bdump> crs_start oss.xdbprk.db-cold-failover Attempting to start `oss.xdbprk.db-cold-failover` on member `pts0101` Start of `oss.xdbprk.db-cold-failover` on member `pts0101` succeeded.
In which instance will my session end ?
oracle@pts0101(xdbprk1):/ora/dbprk/admin/xdbprk1/bdump> sqlplus rk/rk@dbprk.xe.grid SQL*Plus: Release 10.2.0.4.0 - Production on Fri Sep 18 12:39:55 2009 Copyright (c) 1982, 2007, Oracle. All Rights Reserved. Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production With the Partitioning, Real Application Clusters, Oracle Label Security, OLAP, Data Mining and Real Application Testing options SQL> select instance_name from v$instance; INSTANCE_NAME ---------------- xdbprk1
And after the relocate, the session should go to the other instance:
oracle@pts0101([crs]):/ora/dbprk/admin/xdbprk1/bdump> crs_relocate -f oss.xdbprk.db-cold-failover Attempting to stop `oss.xdbprk.db-cold-failover` on member `pts0101` Stop of `oss.xdbprk.db-cold-failover` on member `pts0101` succeeded. Attempting to stop `oss.xdbprk.rg` on member `pts0101` Stop of `oss.xdbprk.rg` on member `pts0101` succeeded. Attempting to start `oss.xdbprk.rg` on member `pts0138` Start of `oss.xdbprk.rg` on member `pts0138` succeeded. Attempting to start `oss.xdbprk.db-cold-failover` on member `pts0138` Start of `oss.xdbprk.db-cold-failover` on member `pts0138` succeeded. oracle@pts0101([crs]):/ora/dbprk/admin/xdbprk1/bdump> crsstat HA Resource Target State ----------- ------ ----- oss.xdbprk.db-cold-failover ONLINE ONLINE on pts0138 oss.xdbprk.rg ONLINE ONLINE on pts0138
oracle@pts0101(xdbprk1):/ora/dbprk/admin/xdbprk1/bdump> sqlplus rk/rk@dbprk.xe.grid SQL*Plus: Release 10.2.0.4.0 - Production on Fri Sep 18 12:41:22 2009 Copyright (c) 1982, 2007, Oracle. All Rights Reserved. Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production With the Partitioning, Real Application Clusters, Oracle Label Security, OLAP, Data Mining and Real Application Testing options SQL> select instance_name from v$instance; INSTANCE_NAME ---------------- xdbprk2
As an extra test, to make sure the two instances cannot be started concurrently, started with a test to start the second instance after starting the first.
As you can see this is not possible.
oracle@pts0101(*):/var/opt/oracle> db xdbprk1 ORACLE_SID=xdbprk1 ORACLE_HOME=/ora/product/10.2.0/db_2 oracle@pts0101(xdbprk1):/var/opt/oracle> sqlplus / as sysdba SQL*Plus: Release 10.2.0.4.0 - Production on Fri Sep 18 12:42:02 2009 Copyright (c) 1982, 2007, Oracle. All Rights Reserved. Connected to an idle instance. SQL> startup; ORACLE instance started. Total System Global Area 536870912 bytes Fixed Size 2085360 bytes Variable Size 331353616 bytes Database Buffers 197132288 bytes Redo Buffers 6299648 bytes Database mounted. ORA-01092: ORACLE instance terminated. Disconnection forced
Rene Kundersma
Oracle Technology Services, The Netherlands
Comments (4)
You say "you need to take out OCR entries for instances/services"? Why is that - so that you can mimic the RAC instance as a single DB with cold failover using the rg1 resources as mentioned in the pdf? If I did not remove the entries and still have single instance RAC - what steps should I take to move the Instance from one node to another? My DB version is 11.1.0.6.
Thanks.
Posted by Pattar | September 24, 2009 6:51 AM
Posted on September 24, 2009 06:51
If you keep your existing OCR entries the database, instance and services will be managed the conventional way. What you need to do here is make sure only the script is used. Therefor you need to create a new entry. Resource groups are not really required b.t.w.
If you only want to move the RAC instance from one node to the other, you need to make sure you have init.ora, password-file, diag_dest etc. in place.
By the way, I was thinking "Rac one node" maybe what you are looking for:
http://www.oracle.com/technology/products/database/clustering/pdf/ds-oracleraconenode-2009.pdf
Rene
Posted by Rene | September 24, 2009 12:33 PM
Posted on September 24, 2009 12:33
I was looking into Omotion but that looks like in 11gR2 only. Will have to do further reading on it... But, your way will work with DB stopped/started (cold down) which might also work in my case...
Posted by Pattar | October 6, 2009 11:53 AM
Posted on October 6, 2009 11:53
Oh! nevermind - I need to look into resource manager instance caging in my case.
Thanks.
Posted by Pattar | October 6, 2009 11:57 AM
Posted on October 6, 2009 11:57