After Production Upgrade Thoughts, or Think Twice Before Dropping a Big Table

On a previous post "Production Upgrade from 8.1.7.4 to 9.2.0.8"  I  did describe the preparation steps for a critical production database upgrade, I was responsible for the rdbms upgrade, a little part of the whole project that included several OS and Application layers.

In few words, the bottom idea on this post is to summarize some lessons learned during the upgrade. They are:

When executing a critical process, that needs to be finished within a restricted time window:

  1. Build a step by step detailed check list for every action related to the process.
  2. Prepare ready made scripts for each step that can be executed through scripts.
  3. Rehearse and improve the check lists on test until they run smoothly.
  4. Make at least 2 general rehearsals on a full set environment.
  5. On Production, follow your check lists carefully.
  6. If something arise that can wait until after finishing the procedure, put it on the waiting list.
  7. If something unexpected arise, before taking any action:

    a) Gather the relevant people and review the facts,
    b) Review the options,
    c) Only then decide on a course of action.

  8. Once the process finish, gather the relevant people and perform an after action review to summarize the learned lessons.
This is the anecdotal part of the post:

I did start the Database upgrade at15:30 and finished at 17:00, less that 50% of the time we had on the general rehearsal; we were running on a much powerful machine, compile was done using parallel 35 and finished really fast.  Being 3 hours before the planned time was nice.

Then the problems started.

A colleague was upgrading statspack, and got some errors, so he decided to remove Perfstat tables. One of them, was a dictionary managed table that had several hundred of thousands of extents.

We were ready to turn the database back into archive log mode to proceed with application checks but the drop was still running. We went to have dinner while the drop was still running.

Archive log mode was a precondition to move to the next stage and we needed to shutdown the database to move it to archive log mode. But the drop was running...

My colleague killed his session expecting that this will give us the possibility to proceed with the other steps, but this was not the case.

Once a session that was running more than an hour dropping a dictionary managed object with more than 200.000 extents is killed, smon take over the task of moving used extents to free extents, the only correct action in this case is to wait, and let smon to finish its work.

Shutdown immediate will not work because oracle waits for smon to finish. The original drop  statement converted the segment to a temporary segment; when the drop is interrupted, the temporary segment will be cleaned up by smon.  Shutdown abort will work and the database will come up, but you will not be able to set archive log mode until smon finish its cleaning up actions.

Shutdown abort in this case was a bad idea, on startup it forced smon to scan all extents already set, a slow and painful process... trace of smon showed that it scanned uet$ row by row reviewing all the job that was already completed:

PARSING IN CURSOR #2 len=94 dep=1 uid=0 oct=3 lid=0 tim=169417024926 hv=1948998214 ad='99fffd90'
select file#,block#,length from uet$ where ts#=:1 and segfile#=:2 and segblock#=:3 and ext#=:4
END OF STMT
...
...
...
   value=203542
...
...
=====================
PARSING IN CURSOR #2 len=94 dep=1 uid=0 oct=3 lid=0 tim=169417086781 hv=1948998214 ad='99fffd90'
select file#,block#,length from uet$ where ts#=:1 and segfile#=:2 and segblock#=:3 and ext#=:4
END OF STMT
...
...
...
   value=203541
...

etc, etc. It took 3 hours just to get to the point where it was able to resume the cleanup process, then the trace started to show work on free extents table :

select length from fet$ where file#=:1 and block#=:2 and ts#=:3

And after a while it begin to move extents into fet$:

insert into fet$ (file#,block#,ts#,length) values (:1,:2,:3,:4)

At this moment we were able to estimate how long it will take the whole cleaning process, I did set up this script to show how things were going on, 550 was the average number of extents cleaned per minute:

#!/usr/bin/ksh
export ORACLE_HOME=/u01/app01/oracle/product/9i
export ORACLE_SID=proddb
x=1
vfree=sys.fet$
vused=sys.uet$
while [ $x -eq 1 ];
do
sqlplus -s / <<eof
set echo off  feed off
col "Expected Time" for 99.99
select  sysdate,
        (select count(block#) from $vused) Used,
        (select count(block#) from $vfree) Free,
        (select count(block#) from $vused)/550 "Expected Time"
from dual
/
exit
eof
sleep 60
done

After 5 long hours the cleanup finished, we were able to restart the database in archive log mode and start application migration and checkups.

Worth to know and remember:

a) If you kill a session that was dropping and object with many extents, and then shutdown the database, the time required by smon to get to the point where it will start cleaning up extents again is the time the drop was running by 3. Every time you restart the process smon needs to skip all extents on fet$ that already visited to be able to restart its work.

b) If you have an object that you must drop and has many extents use the procedure described on Note 68836.1  How To Efficiently Drop A Table With Many Extents

There were several other highlights during this 36 hours upgrade, that finished with great success and on time, that is other rule: always add at least 40% spare time for each task :-)



Comments:

Since this was a perfstat table, you could have gone in for a drop of the PERFSTAT user itself (DROP USER PERFSTAT CASCADE) and recreated it. Of course this will lead to the loss of all statspack data, but may be a viable option in case of time constraints.

Posted by Nitin on July 19, 2007 at 07:27 AM IDT #

Nitin, unless I am mistaken, dropping the user with cascade option leads to the same objects/segments/extents cleaning and so the same issue. regards, eric

Posted by Eric Grancher on July 26, 2007 at 04:57 AM IDT #

That's right Eric, unfortunately when there is a dictionary managed object with a large number of extents there is no way to avoid cleaning them up, except planning a gradual cleanup as explained in note 68836.1

Posted by Alejandro Vargas on July 26, 2007 at 05:52 AM IDT #

You should also be able to convert the underlying tablespace to LMT, which is a surprisingly quick process. exec sys.dbms_space_admin.tablespace_migrate_to_local('STATSPACK_DATA'); Another reason to have only catalog objects in SYSTEM tablespace ;)

Posted by Jeff Wong on May 05, 2009 at 03:29 PM IDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

bocadmin_ww

Search

Archives
« May 2015
SunMonTueWedThuFriSat
     
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
      
Today