« April 2008 | Main | June 2008 »

May 2008 Archives

May 1, 2008

Considerations for virtual IP setup before doing the 10gR2 CRS install

Preface


Many a times, simple things can bite you when you are planning something really important, like a jumping the gun for virtual IP installing 10gR2 CRS.

The CRS installation assumes that the virtual IP hostname and IP information should be either in the DNS or in local /etc/hosts file ONLY.  It does not assume that the virtual IP is already tied up to a network interface on the RAC nodes.

A cryptic error...

If the virtual IP is already attached to network interface during the CRS install, it throws a cryptc error, saying " the virtual interface seems to be already assigned to another system in the network ", which can be pretty misleading. We faced this situation recently during one of our upgrades:

10gRAC error1.jpg: ,

The problem...

The problem was that the racprod1-vip virtual hostname was already associated with a network interface alias lan1:801:

10gRAC error2.jpg:

Please note: The hostnames, domain and IPs mentioned in this diagram are completely fictional and have been fudged from their original values. The current representation is enough for the sake of illustration.

The solution...

Needless to say, the solution was to dis-associate the racprod1-vip with lan1:801, after which the 10gR2 CRS installation proceeded fine.

Conclusion..

Many people do not appreciate the fineties involved in CRS installation, especially around virtual IP. The actual association of virtual IP hostname and network interface is done by the $ORA_CRS_HOME/bin/vipca script, which is automatically run as part of $ORA_CRS_HOME/root.sh

May 19, 2008

10gR2 CRS case study: CRS would not start after reboot - stuck at /etc/init.d/init.cssd startcheck

Preface

I had recently done a 10gR2 CRS installation on SuSE linux 9.3 (2.6.5.7-244 kernel) and noticed that after a reboot of the RAC nodes, the CRS would not come up!

The CSS daemon was stuck at the /etc/init.d/init.cssd startcheck command:

raclinux1:/tmp # ps -ef | grep css
root      6929     1  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root      6960  6928  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root      6963  6929  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck
root      7064  6935  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd startcheck

Debugging..

To debug this more, I went to the $ORA_CRS_HOME/log/<nodename>/client and checked the latest files there:

raclinux1:/opt/oracle/product/10.2.0/crs/log/raclinux1/client # ls -ltr
total 435
-rw-r-----  1 root   root 2561 May 18 23:20 ocrconfig_8870.log
-rw-r--r--  1 root   root  195 May 18 23:22 clscfg_8924.log
-rw-r-----  1 root   root  172 May 18 23:29 ocr_15307_3.log
-rw-r-----  1 root   root  172 May 18 23:29 ocr_15319_3.log
-rw-r-----  1 root   root  172 May 18 23:29 ocr_15447_3.log
...
...
...
drwxr-x---  2 oracle dba  3472 May 19 08:10 .
drwxr-xr-t  8 root   dba   232 May 19 13:50 ..
-rw-r--r--  1 root   root 2946 May 19 14:11 clsc.log
-rw-r--r--  1 root   root 7702 May 19 14:11 css.log

I did a more of the clsc.log & css.log and saw the following errors:

$ more clsc.log
...
...
...
2008-05-19 14:11:29.912: [ COMMCRS][1094672672]clsc_connect: (0x81c74b8) no listener at (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_UI_SOCKET))

2008-05-19 14:11:31.582: [ COMMCRS][1094672672]clsc_connect: (0x817e3f0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=SYSTEM.evm.acceptor.auth))

2008-05-19 14:11:31.583: [ default][1094672672]Terminating clsd session

$ more css.log
...
...
...
2008-05-19 02:42:48.307: [  OCROSD][1094672672]utopen:7:failed to open OCR file/disk /var/opt/oracle/ocr1 /var/opt/oracle/oc
r2, errno=19, os err string=No such device
2008-05-19 02:42:48.308: [  OCRRAW][1094672672]proprinit: Could not open raw device
2008-05-19 02:42:48.308: [ default][1094672672]a_init:7!: Backend init unsuccessful : [26]
2008-05-19 02:42:48.308: [ CSSCLNT][1094672672]clsssinit: Unable to access OCR device in OCR init.

2008-05-19 02:43:41.982: [  OCROSD][1094672672]utopen:7:failed to open OCR file/disk /var/opt/oracle/ocr1 /var/opt/oracle/oc
r2, errno=19, os err string=No such device
2008-05-19 02:43:41.983: [  OCRRAW][1094672672]proprinit: Could not open raw device
2008-05-19 02:43:41.983: [ default][1094672672]a_init:7!: Backend init unsuccessful : [26]
2008-05-19 02:43:41.983: [ CSSCLNT][1094672672]clsssinit: Unable to access OCR device in OCR init.

2008-05-19 02:46:40.204: [ CSSCLNT][1094672672]clsssInitNative: connect failed, rc 9

2008-05-19 14:11:28.217: [ CSSCLNT][1094672672]clsssInitNative: connect failed, rc 9

2008-05-19 14:11:37.186: [ CSSCLNT][1094672672]clsssInitNative: connect failed, rc 9

So it was pointing towards the OCR being not available, as could be verified by the /tmp/crsctl.<PID> files too:

raclinux1:/tmp # ls -ltr crsctl*
-rw-r--r--  1 oracle dba 148 May 19 02:44 crsctl.6826
-rw-r--r--  1 oracle dba 148 May 19 02:44 crsctl.6679
-rw-r--r--  1 oracle dba 148 May 19 02:44 crsctl.6673
-rw-r--r--  1 oracle dba 148 May 19 02:49 crsctl.7784
-rw-r--r--  1 oracle dba 148 May 19 02:49 crsctl.7890
-rw-r--r--  1 oracle dba 148 May 19 02:49 crsctl.7794
-rw-r--r--  1 oracle dba 148 May 19 13:55 crsctl.7034
-rw-r--r--  1 oracle dba 148 May 19 13:55 crsctl.6886
-rw-r--r--  1 oracle dba 148 May 19 13:55 crsctl.6883
-rw-r--r--  1 oracle dba 148 May 19 14:18 crsctl.6960
-rw-r--r--  1 oracle dba 148 May 19 14:18 crsctl.7064
-rw-r--r--  1 oracle dba 148 May 19 14:18 crsctl.6963

raclinux1:/tmp # more crsctl.6963
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [Permission denied] [13]

Permission issue!

Duh! So it was a permission issue on the OCR disk (at this moment), which could expand into a permissions issue for Voting and asm disks later:

raclinux1:/tmp # ls -ltr /dev/raw/raw*
crw-rw-r--  1 root disk 162,  9 Nov 18  2005 /dev/raw/raw9
crw-rw-r--  1 root disk 162,  8 Nov 18  2005 /dev/raw/raw8
crw-rw-r--  1 root disk 162,  7 Nov 18  2005 /dev/raw/raw7
crw-rw-r--  1 root disk 162,  6 Nov 18  2005 /dev/raw/raw6
crw-rw-r--  1 root disk 162,  5 Nov 18  2005 /dev/raw/raw5
crw-rw-r--  1 root disk 162,  4 Nov 18  2005 /dev/raw/raw4
crw-rw-r--  1 root disk 162,  3 Nov 18  2005 /dev/raw/raw3
crw-rw-r--  1 root disk 162,  2 Nov 18  2005 /dev/raw/raw2
crw-rw-r--  1 root disk 162, 15 Nov 18  2005 /dev/raw/raw15
crw-rw-r--  1 root disk 162, 14 Nov 18  2005 /dev/raw/raw14
crw-rw-r--  1 root disk 162, 13 Nov 18  2005 /dev/raw/raw13
crw-rw-r--  1 root disk 162, 12 Nov 18  2005 /dev/raw/raw12
crw-rw-r--  1 root disk 162, 11 Nov 18  2005 /dev/raw/raw11
crw-rw-r--  1 root disk 162, 10 Nov 18  2005 /dev/raw/raw10
crw-rw-r--  1 root disk 162,  1 Nov 18  2005 /dev/raw/raw1

I enabled read and write permission for the raw devices using the # chmod +rw /dev/raw/raw* devices. but even after that the latest /tmp/crsctl.<PID> files being generated were showing this message:

raclinux1:/tmp # more crsctl.6960
Failure -2 opening file handle for (vote1)
Failure 1 checking the CSS voting disk 'vote1'.
Failure -2 opening file handle for (vote2)
Failure 1 checking the CSS voting disk 'vote2'.
Failure -2 opening file handle for (vote3)
Failure 1 checking the CSS voting disk 'vote3'.
Not able to read adequate number of voting disks

At this point, I just chowned /dev/raw/raw* to oracle:dba like this:

raclinux1:/tmp # chown oracle:dba /dev/raw/raw*

After 1-2 mins, the CSS came up:

raclinux1:/tmp # ps -ef | grep css
root      6929     1  0 13:56 ?        00:00:00 /bin/sh /etc/init.d/init.cssd fatal
root     10900  6929  0 14:39 ?        00:00:00 /bin/sh /etc/init.d/init.cssd daemon
oracle   10980 10900  0 14:40 ?        00:00:00 /bin/su -l oracle -c /bin/sh -c 'ulimit -c unlimited; cd /opt/oracle/product/10.2.0/crs/log/raclinux1/cssd;  /opt/oracle/product/10.2.0/crs/bin/ocssd  || exit $?'
oracle   10981 10980  0 14:40 ?        00:00:00 /bin/sh -c ulimit -c unlimited; cd /opt/oracle/product/10.2.0/crs/log/raclinux1/cssd;  /opt/oracle/product/10.2.0/crs/bin/ocssd  || exit $?
oracle   11007 10981  2 14:40 ?        00:00:00 /opt/oracle/product/10.2.0/crs/bin/ocssd.bin
root     12013  7414  0 14:40 pts/2    00:00:00 grep css
raclinux1:/tmp #

The CRS components came up fine automatically:

raclinux1:/opt/oracle/product/10.2.0/crs/bin # ./crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy

The ASM and RAC instances also came up fine:

raclinux1:/opt/oracle/product/10.2.0/crs/bin # ps -ef |grep smon
oracle   12257     1  0 14:41 ?        00:00:00 asm_smon_+ASM1
oracle   13100     1  0 14:41 ?        00:00:02 ora_smon_o10g1
root     32282  7414  0 14:55 pts/2    00:00:00 grep smon

For the long term..

To make this change permanent, I put it in /etc/init.d/boot.local file, along with the modprobe hangcheck-timer  command:

raclinux1:/opt/oracle/product/10.2.0/crs/bin # more /etc/init.d/boot.local
#! /bin/sh
#
# Copyright (c) 2002 SuSE Linux AG Nuernberg, Germany.  All rights reserved.
#
# Author: Werner Fink <werner@suse.de>, 1996
#         Burchard Steinbild, 1996
#
# /etc/init.d/boot.local
#
# script with local commands to be executed from init on system startup
#
# Here you should add things, that should happen directly after booting
# before we're going to the first run level.
#
chown oracle:dba /dev/raw/raw*
modprobe hangcheck-timer hangcheck_tick=30 hangcheck_margin=180

Conclusion

If simple things are permissions are not correct on the OCR devices, it can hold down the CRS daemons and the ASM/DB instances. It may be needed to put workarounds in /etc/init.d/boot.local for getting around the situation.


May 23, 2008

puzzling RMAN: No channel to restore a backup or copy of datafile?

An RMAN puzzle

  
In this article, we will look at a puzzling situation that happened at one of our environments, when we were carrying out an RMAN restoration. The RMAN backup was taken using a catalog (not using controlfile) and our mission was to restore PROD to UAT. Simple, right?

Wrong!

An Arrrgh! Moment..

We were using the following restoration script:

RMAN>
RMAN> run
2> {
3> set until time  "to_date('05/13/08 14:40:00','MM/DD/YY hh24:mi:ss')";
4> allocate auxiliary channel 'adisk_0' type DISK;
5> allocate auxiliary channel 'adisk_1' type DISK;
6> allocate auxiliary channel 'adisk_2' type DISK;
7> allocate auxiliary channel 'adisk_3' type DISK;
8> DUPLICATE TARGET DATABASE TO UAT;
9> }


The backup location was available and so were adequate backup channels, but we always ended up getting this laconic error:

...
...

Starting restore at 13-MAY-08

released channel: adisk_0
released channel: adisk_1
released channel: adisk_2
released channel: adisk_3
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 05/13/2008 15:25:38
RMAN-03015: error occurred in stored script Memory Script
RMAN-06026: some targets not found - aborting restore
RMAN-06100: no channel to restore a backup or copy of datafile 471
RMAN-06100: no channel to restore a backup or copy of datafile 466

RMAN>
RMAN> **end-of-file**

RMAN>

Recovery Manager complete.


No channel to restore?...

Huh? Well, as you can see, the channels were very well defined and even the rman log showed that the channels were being allocated. We even tried increasing the channels to 8 or 10, and also tried allocating auxiliary channels (was irrelevant), but to no avail.

Another option that was tried was plain and simple restoration using the RMAN> restore datafile 1; command. Although that failed since the paths of datafile 1 was not switched to UAT's convention, it did not prove anything.

My teammates, Dhandapani Perungulam D.PNG:   and Sandeep Rebba sandeep.PNG: were trying out different options. Then someone had the bright idea of checking the timestamp of the completed backup. It was possible that we were trying to restore to a point in time before the backup had completed.

Verifying the rman backup completion time from catalog

First of all, we needed to set the NLS_DATE_FORMAT variable to include the HH24:MI format:

$  export NLS_DATE_FORMAT='dd-mon-rr hh24:mi'

Then we needed to connect to the target (source instance for replication in RMAN terminology) and RMAN catalog:

$  rman catalog rman@rman target sys@prd

Recovery Manager: Release 10.2.0.3.0 - Production on Fri May 23 14:34:02 2008

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

rman database Password:

target database Password:

connected to target database: PRD (DBID=2530951715)
connected to recovery catalog database

RMAN>

This command will give us details of the backupsets and the time when the backup completed:

RMAN> list backup of database completed between '12-may-2008' and '14-may-2008;

...
...
...
BS Key  Type LV Size       Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ ---------------
2794098 Incr 0  53.28G     SBT_TAPE    00:38:54     13-may-08 23:11
        BP Key: 2794121   Status: AVAILABLE  Compressed: NO  Tag: TAG20080513T203101
        Handle: racprod1_PRD<PRD_116563:654647578:1>.dbf   Media:
  List of Datafiles in backup set 2794098
  File LV Type Ckp SCN    Ckp Time        Name
  ---- -- ---- ---------- --------------- ----
  7    0  Incr 7247000051870 13-may-08 22:32 /u01/oracle/prdracdata/owad01.dbf
  295  0  Incr 7247000051870 13-may-08 22:32 /u01/oracle/prdracdata/system06.dbf
  379  0  Incr 7247000051870 13-may-08 22:32 /u01/oracle/prdracdata/rbs11.dbf
  459  0  Incr 7247000051870 13-may-08 22:32 /u01/oracle/prdracdata/apps_ts_interface01.dbf
  460  0  Incr 7247000051870 13-may-08 22:32 /u01/oracle/prdracdata/apps_ts_summary01.dbf
  481  0  Incr 7247000051870 13-may-08 22:32 /u01/oracle/prdracdata/apps_ts_tx_data12.dbf
  509  0  Incr 7247000051870 13-may-08 22:32 /u01/oracle/prdracdata/apps_ts_archive02.dbf

BS Key  Type LV Size       Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ ---------------
2794099 Incr 0  55.62G     SBT_TAPE    00:39:54     13-may-08 23:12
        BP Key: 2794122   Status: AVAILABLE  Compressed: NO  Tag: TAG20080513T203101
        Handle: racprod1_PRD<PRD_116562:654647562:1>.dbf   Media:
  List of Datafiles in backup set 2794099
  File LV Type Ckp SCN    Ckp Time        Name
  ---- -- ---- ---------- --------------- ----
  5    0  Incr 7247000051797 13-may-08 22:32 /u01/oracle/prdracdata/system05.dbf
  17   0  Incr 7247000051797 13-may-08 22:32 /u01/oracle/prdracdata/rbs15.dbf
  18   0  Incr 7247000051797 13-may-08 22:32 /u01/oracle/prdracdata/apps_ts_interface03.dbf
  22   0  Incr 7247000051797 13-may-08 22:32 /u01/oracle/prdracdata/apps_ts_summary05.dbf
  398  0  Incr 7247000051797 13-may-08 22:32 /u01/oracle/prdracdata/discoverer01.dbf
  480  0  Incr 7247000051797 13-may-08 22:32 /u01/oracle/prdracdata/apps_ts_tx_data11.dbf
  508  0  Incr 7247000051797 13-may-08 22:32 /u01/oracle/prdracdata/apps_ts_nologging02.dbf

RMAN>

As we can see,
13-may-08 22:32 is the golden time. Anything after that in the until time statement will do.

Further progress..

So then, we tried with this script:

RMAN>
RMAN> run {
2> allocate auxiliary channel 'adisk_0' type DISK;
3> allocate auxiliary channel 'adisk_1' type DISK;
4> allocate auxiliary channel 'adisk_2' type DISK;
5> allocate auxiliary channel 'adisk_3' type DISK;
6> allocate auxiliary channel 'adisk_4' type DISK;
7>
8> set until time  "to_date('09/013/2008 22:40:00','DD/MM/YYYY HH24:MI:SS')";
9>
10> DUPLICATE TARGET DATABASE TO UAT;
11> }

Although, we got the following error for the archivelogs, we were able to open the database after manually supplying them and then doing an SQL> alter database open resetlogs; :

...
...

Starting recover at 13-MAY-08

starting media recovery

Oracle Error:
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '/u01/oracle/uatracdata/system01.dbf'

released channel: adisk_0
released channel: adisk_1
released channel: adisk_2
released channel: adisk_3
released channel: adisk_4
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 05/13/2008 22:47:51
RMAN-03015: error occurred in stored script Memory Script
RMAN-06053: unable to perform media recovery because of missing log
RMAN-06102: no channel to restore a backup or copy of log thread 2 seq 111965 lowscn 7246877796511
RMAN-06102: no channel to restore a backup or copy of log thread 1 seq 143643 lowscn 7246877796417

Conclusion

Some RMAN errors can be really hard to understand or debug. Especially, it seems that the no channel to restore a backup or copy seems to be a very generic error message in RMAN, which could really mean anything and can be very misleading.

For example, we can see that at the end, it gave the
RMAN-06102: no channel to restore a backup or copy of log thread 2 seq 111965 lowscn 7246877796511 error, but that meant that it could not find the archive logs. After supplying the archive logs, the database opened up fine with the resetlogs option.


May 24, 2008

Time saving tactic: How we saved 6 hrs of downtime in production 10g upgrade

Preface

So you want to upgrade your database to 10g from 9i. Well, welcome to the club.

If you have RAC, then you will definitely need to install the CRS and database binaries, along with some RDBMS patches.

When we were challenged with keeping the downtime to within 24 hrs by a customer, that set us thinking as to what time saving tactics could be employed to achieve this end.

Divide and conquer the downtime..

The old strategy of divide and conquer is really time tested and works well in most paradigms. In this article, we demonstrate how its possible to split the 10g upgrade downtime into two logical parts:

Part 1)

Install the 10g technology stack ahead of time, including the database, any patchsets (usually 10.2.0.3) and rdbms patches. Before doing this, we shutdown the 9i RAC/DB binaries. After the 10g CRS/DB technology stack installation, you shutdown the 10g CRS , bring up the 9i RAC processes and carry on as it nothing happened.


   At this stage, 9i and 10g technology stack will co-exist peacefully with each other. The production system can run on 9i RAC for a week or more till you decide to do the actual upgrade.

Part 2)

In the subsequent main upgrade outage, you can shutdown the 9i RAC/DB, bring up DB using 10g CRS/DB oracle home  and do the actual upgrade. This outage could last anywhere from 10-24 hrs or even upto 32 hrs depending on your pre-production timing practice. It is assumed that one would do at least 2-3 rounds of upgrade before gaining the confidence of doing the production round.


Adopting this split strategy saved us about 6 hrs in the main upgrade downtime window and we were able to do the upgrade in a window of 16-20 hrs. The size of the database was ~1 TB on HP-UX PA RISC 64 bit OS.

How was it done..

When you do an 9i->10g upgrade for RAC, the following happens:

1) OCR device location is automatically picked up from /var/opt/oracle/srvConfig.loc (9i  setup file)

2) The OCR device's contents is upgraded to 10g format

3) A new file called /var/opt/oracle/ocr.loc is created with the OCR device name

Since we had to preserve the 9i OCR device for running 9i RAC after the 10g CRS/DB techstack installation, we did the following:

1) Got a new set of OCR and Voting device for 10g. This was a separate set of devices in addition to the 9i OCR and voting disks. Then we brought down the 9i CRS processes.

A caveat here was that the HP service guard (vendor cluster solution) was required to be up to perform the 10g CRS installation.

2) copied the contents of 9i OCR into the 10g OCR device using the dd command:

$ whoami
oracle

$ dd if=/dev/rawdev/raw_9i_ocr.dbf  of=/dev/rawdev/raw_10g_ocr.dbf   bs=1024


3) Edited the srvConfig.loc file to point to the /dev/rawdev/raw_10g_ocr.dbf file

4) Did the 10g CRS installation and ran root.sh that upgraded the OCR device to 10g format

5) We then installed the 10.2.0.1 DB binaries and applied the 10.2.0.3 patchset, along with RDBMS patches. This was the major activity and took about 4-5 hrs.

Another option to save time here was that we could have cloned the 10.2.0.3 ORACLE_HOME from the UAT servers, but since this was production, we wanted to do everything with a clean start and not carry on any mistakes from UAT.

6) Brought down 10g CRS services with $ORA_CRS_HOME/bin/crsctl stop crs and also disabled the automatic startup of CRS with /etc/init.d/init.crs crs disable

7) Re-pointed the 9i OCR device back in /var/opt/oracle/srvConfig.loc as  /dev/rawdev/raw_9i_ocr.dbf

Then we brought up the 9i CRS services again and then brought up the Database with 9i binaries.

At this point, the 9i and 10g binaries co-existed with each other peacefully as if 10g techstack was never there.

Conclusion

There might be better ways of reducing downtime during the upgrade, but this was one of the selling points of projecting a successful 10g upgrade to the customer by reducing the downtime, especially when we were under extreme pressure to keep the main downtime window to less than 24 hrs.




May 31, 2008

So you thought you upgraded your OLAP database to 10g..


W
ell, so did we. In fact, the dba_registry view showed the version of OLAP option as 10.2.0.3. Here, see it for yourself:

  1  select comp_name , version, status from dba_registry
  2* where comp_name like  '%Analytic%'
SQL> /

COMP_NAME                  VERSION      STATUS
-------------------------- ------------ -------
OLAP Analytic Workspace    10.2.0.3.0   VALID


Seems pretty convincing, does it not? One would think so.

An unexpected error..

But, a month later, when the customer tried to do an archive and restore of a demand plan, they received the following error:

Demand Planning: Version : 11.5.0

Copyright (c) 1979, 1999, Oracle Corporation. All rights
reserved.

MSDRESTORE module: Restore demand plan

+---------------------------------------------------------------------------+

Current system time is 30-MAY-2008 08:24:19

+---------------------------------------------------------------------------+

**Starts**30-MAY-2008 08:24:19

**Ends**30-MAY-2008 08:24:21

ORA-35071: EIF file ARCH4027.eif cannot be imported because
analytic workspace ODPCODE has not been upgraded to version 10.0.0.0.

+---------------------------------------------------------------------------+

Start of log messages from FND_FILE

+---------------------------------------------------------------------------+

Error in restore process: see batch log for details

ORA-35071: EIF file ARCH4027.eif cannot be imported because
analytic workspace ODPCODE has not been upgraded to version 10.0.0.0.

+---------------------------------------------------------------------------+

End of log messages from FND_FILE

+---------------------------------------------------------------------------+

Executing request completion options...

Finished executing request completion options.

+---------------------------------------------------------------------------+

Exceptions posted by this request:

Concurrent Request for "Restore demand plan" has
completed with error.

+---------------------------------------------------------------------------+

Metalink, our friend in need..

So we turned to our friend, Metalink and came across Note 390004.1, which says that Imported the Analytic Workspace in an different database release, but the AWs are still of the old version.

Eh! So we tried to confirm the same and found that there were indeed 4 analytical workspaces that were still of 9.1 version, including the ODPCODE AW:

SQL> select aw_name , aw_version
     from all_aws
     where aw_version like '9%';

AW_NAME                        AW_V
------------------------------ ----
ZPBANNOT                       9.1
ZPBDATA                        9.1
MSD4029                        9.1
ODPCODE                        9.1


Strangely, the other AWs were all 10.2:

SQL> select aw_name , aw_version
     from all_aws
     where aw_version like '10%';

AW_NAME                        AW_V
------------------------------ ----
EXPRESS                        10.2
AWMD                           10.2
AWCREATE                       10.2
AWCREATE10G                    10.2
AWXML                          10.2
AWREPORT                       10.2
XWDEVKIT                       10.2
MMSD4027                       10.2
MSD4027                        10.2
MMMSD4027                      10.2
PMMSD4027A0                    10.2
PMMSD4027A21                   10.2
PMMSD4027A22                   10.2
PMMSD4027A26                   10.2
PMMSD4027A32                   10.2
PMMSD4027A33                   10.2
PMMSD4027A55                   10.2
PMMSD4027A57                   10.2
PMMSD4027A58                   10.2
MSD4027A58                     10.2
PMMSD4027A78                   10.2
PMMSD4027A79                   10.2
PMMSD4027A80                   10.2
PMMSD4027A86                   10.2
PMMSD4027A87                   10.2
PMMSD4027A88                   10.2
PMMSD4027A93                   10.2
PMMSD4027A95                   10.2
PMMSD4029A                     10.2
PMMSD4029A5                    10.2
PMMSD4029A3                    10.2
PMMSD4029A1                    10.2
MMMSD4029                      10.2
MMSD4029                       10.2
ZPBCODE                        10.2

35 rows selected.

SQL>

Where did our OLAP upgrade procedure go wrong..

So the question before us was that how did our production upgrade procedure, the one which we thought was so meticulously planned and executed, go wrong?

For upgrading the ODP workspaces, we had followed Note 418200.1 - ODP: Steps When Upgrading the RDBMS Version Used With ODP.
We had even
got a
clarification from the author of that note about what steps were needed
for ODP
workspaces upgrade.


Unfortunately, The following note does not have any direct reference to upgrading OLAP workspaces:



Note 362203.1
Interopratbility notes Oracle Applications Release 11i with Oracle 10g Release 2 (10.2.0)

We had found the following Notes on OLAP upgrade/exporting/clone:

472215.1 How to Export AWs or a Plan in Demand Planning
418200.1 ODP Steps When Upgrading the RDBMS Version Used With ODP
412308.1 How to Check the Version for ODP 11.5.10?
339110.1 How To Clone the OLAP Portion of ODP 11.5.10

Note 418200.1 is the most relevant for the upgrade activity, but unfortunately, even that does not have any reference to dba_aws view to check the aw_version column of each workspace and use the dbms_aw.convert API for the workspaces that could not be upgraded due to some reason.

So that is definitely one thing that should be part of that note. We would be sending a note to the author to include these steps into note
418200.1.

Another gotcha while implementing the solution (exec dbms_aw.convert API) is that it does not accept the OWNER name of the workspace in the first argument. Note 390004.1 actually sets a false expectation on this front.

Also, you need to be logged in as the owner of the workspace while issuing the dbms_aw API or you would get this error:

SQL> show user
USER is "SYS"
SQL> exec dbms_aw.convert('ODPCODE');
BEGIN dbms_aw.convert('
ODPCODE'); END;

*
ERROR at line 1:
ORA-33262: Analytic workspace
ODPCODE does not exist.
ORA-06512: at "SYS.DBMS_AW", line 58
ORA-06512: at "SYS.DBMS_AW", line 134
ORA-06512: at "SYS.DBMS_AW", line 1245
ORA-06512: at line 1


Oh, Oh.. A Gotcha while implementing the solution..


To make matters worse, by a copy-paste, I specified the owner name along with the workspace like this:

SQL> exec dbms_aw.convert('apps.odpcode');

It came back with an error like you cant specify the owner name of the table.   

After that, the name of the aw got changed to ODPCODE_ (with the _ in the end), but it upgraded to 10.2 as per dba_aws or all_aws.

SQL> select aw_name, aw_version
     from all_aws
     where aw_name like 'ODP%';


AW_NAME                        AW_V
------------------------------ ----
ODPCODE_                       10.2


Essentially, the dbms_aw.convert had left the ODPCODE workspace in a transient state and not done a good job of cleaning up after the exception.

A workaround for cleaning up...

Well, there is an API called dbms_aw.aw_rename that would do the trick of putting the name back to ODPCODE like this:

SQL> show user
USER is APPS

SQL> exec dbms_aw.aw_rename('ODPCODE_','ODPCODE');

PL/SQL procedure successfully completed.

SQL> select aw_name, aw_version from all_aws where aw_name='ODPCODE';

AW_NAME                         AW_V
------------------------------ ----
ODPCODE                        10.2

Now just to make sure that it was really 10.2, we ran the dbms_aw.convert API again and got this message, which implied that it was already in 10.2 format:

SQL> exec dbms_aw.convert('ODPCODE');
BEGIN dbms_aw.convert('ODPCODE'); END;

*
ERROR at line 1:
ORA-33298: Analytic Workspace ODPCODE is already in the newest format allowed
by the current compatibility setting
ORA-06512: at "SYS.DBMS_AW", line 1267
ORA-06512: at line 1


After this, the MSDRESTORE module: Restore demand plan concurrent job completed successfully.

Conclusion...

There were multiple learnings from this Sev 1 experience in production instance:
  • Ask your customer to test exhaustively. In this case, the customer is a very small shop with an IT department of 5-6 people and yet to graduate to the enterprise level thinking for managing their IT systems.
Any testing was good testing for them and this error was discovered when they were going to go live with another project phase implementation.

Unfortunately, in this case, we had very little say in their testing plan.
  • In this case, we were caught a little off guard because we did not have exhaustive expertise in OLAP upgrades and had to rely on published metalink notes, which did not cross link all the known upgrade issues. 
In this case, it would have paid to check the output of all_aws view too. Similarly, other Oracle Server options may have their own views.
  • While executing the dbms_aw APIs, it is best to be logged in as the owner of the analytical workspace being worked on, or you might land up in a worse soup than you signed up for.
As per our observation, even the alter session set current_schema=<OWNER> sql does not work on dbms_aw APIs.


About May 2008

This page contains all entries posted to Experiments from the Field..Based on True Stories in May 2008. They are listed from oldest to newest.

April 2008 is the previous archive.

June 2008 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Creative Commons License
This weblog is licensed under a Creative Commons License.
Powered by
Movable Type and Oracle