The CIO asked: How long has my production database been down?


And I had no answer for him. I couldn't blame him. CIOs want to know this kind of information. Better still, he would have liked a pie chart depicting it like this:

downtimepiechart.png:

I wish..

Well, for once, it would have been nice if Oracle 9i or 10g kept the historical startup or shutdown information in the v$ or some dba_* tables. A simple query off the table would have got me the answer.

Anyways, it set me thinking. There are always other ways to get the same information. Statspack or AWR was one possibility, but I was not sure if it really gathers the details information about when the instance was shutdown or started up (historically) -- they sure are aware if there was an instance restart between two snaps.

An Aha moment..

But wait, the database alert log has information about each startup and shutdown! So if we could mine the alert log for the right regular expressions, and then find the time difference between time stamps, it could be done.

This method would not give you the overall downtime for the production instance, including the downtime for the middle tiers or Apache web server, but the same idea could probably be extended for the other services, but in this article, the scope is just the database. There is an auxiliary script (get_epoch.sh) supplied here that would be useful in this quest.

Auxiliary script: get_epoch.sh

Also available for download here.

#!/usr/bin/perl
#
# The format of input date is:  Thu Jun  5 21:15:48 2008
#
# NOTE: The format of `date` command is:  Thu Jun  5 21:15:48 EDT 2008
# -- it has the timezone in the output as well

# BUT this script does not assume that since the timestamps in
# alert log dont have the timezone in it

#
# This script heavily uses this function to convert a UTC timestamp
# into seconds after 1 jan 1970:

# timelocal($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst);

use Time::Local;

my $wday = $ARGV[0];

my $month = $ARGV[1];
# convert the month shortname into 0-11 number
if ( $month eq "Jan" ) { $mon = 0 }
elsif ( $month eq "Feb" ) { $mon = 1 }
elsif ( $month eq "Mar" ) { $mon = 2 }
elsif ( $month eq "Apr" ) { $mon = 3 }
elsif ( $month eq "May" ) { $mon = 4 }
elsif ( $month eq "Jun" ) { $mon = 5 }
elsif ( $month eq "Jul" ) { $mon = 6 }
elsif ( $month eq "Aug" ) { $mon = 7 }
elsif ( $month eq "Sep" ) { $mon = 8 }
elsif ( $month eq "Oct" ) { $mon = 9 }
elsif ( $month eq "Nov" ) { $mon = 10 }
elsif ( $month eq "Dec" ) { $mon = 11 };

my $mday = $ARGV[2];

# initialize time varialble and split hours (24 hr format), minutes, seconds into an array
my $time = $ARGV[3];
@time = split /:/, $time;

# if the timezone is left out of the input, the position of year becomes 5th in ARGV
my $year = $ARGV[4];

#######################################################################################
# I found that by excluding $wday, the seconds results (EPOCH) is more
# accurate, so $wday
parameter has been omitted from the call to
# timelocal() function.

#######################################################################################
$epoch= timelocal($time[2], $time[1], $time[0], $mday, $mon, $year);
print "$epoch\n";

The main script..

Due to formatting issues, the main script is available for download here.

Sample Usage and output..

I realized that it would probably make more sense to have an optional cutoff date to calculate the downtime from, so that was added to the version 2 of the script. The version 1 which calculates the downtime from the first database startup time is uploaded here.

sandbox:sandbox> ./calculate_downtime.sh                             
Correct Usage: calculate_downtime.sh alertlogfilepath [cutoff_date in format Sat Jun 7 08:49:34 2008]

sandbox:sandbox> ./calculate_downtime.sh $DATA_DIR/admin/bdump/alert*.log Fri Mar 28 15:20:59 2008

Cutoff date is : Fri Mar 28 15:20:59 2008

Shutdown times:

Timestamp              --  epoch (seconds)

Wed Jan 9 17:53:08 2008 - 1199919188
Wed Jan 16 12:05:09 2008 - 1200503109
Fri Jan 18 11:19:42 2008 - 1200673182
Thu Jan 24 17:34:15 2008 - 1201214055
Fri Feb 15 09:00:44 2008 - 1203084044
Wed Feb 20 16:50:14 2008 - 1203544214
Wed Mar 12 12:43:26 2008 - 1205340206
Fri Mar 28 15:21:59 2008 - 1206732119
Thu Apr 3 11:03:52 2008 - 1207235032
Thu Apr 3 11:10:20 2008 - 1207235420
Thu Apr 3 11:15:44 2008 - 1207235744
Thu Apr 3 11:22:38 2008 - 1207236158
Thu Apr 3 11:27:36 2008 - 1207236456
Thu Apr 3 11:34:35 2008 - 1207236875
Thu Apr 3 11:41:36 2008 - 1207237296
Mon May 12 14:17:13 2008 - 1210616233
Thu Jun 5 10:36:58 2008 - 1212676618
shutdown_counter=17

Startup times:

Timestamp              --  epoch (seconds)

Wed Jan 9 17:50:42 2008 -- 1199919042
Thu Jan 10 09:43:18 2008 -- 1199976198
Thu Jan 17 12:00:03 2008 -- 1200589203
Fri Jan 18 11:26:13 2008 -- 1200673573
Wed Jan 30 12:19:21 2008 -- 1201713561
Tue Feb 19 22:57:38 2008 -- 1203479858
Wed Mar 12 12:39:03 2008 -- 1205339943
Mon Mar 24 13:44:20 2008 -- 1206380660
Thu Apr 3 11:00:33 2008 -- 1207234833
Thu Apr 3 11:07:12 2008 -- 1207235232
Thu Apr 3 11:14:01 2008 -- 1207235641
Thu Apr 3 11:20:54 2008 -- 1207236054
Thu Apr 3 11:25:25 2008 -- 1207236325
Thu Apr 3 11:31:53 2008 -- 1207236713
Thu Apr 3 11:40:18 2008 -- 1207237218
Tue Apr 29 16:50:49 2008 -- 1209502249
Mon Jun 2 14:20:38 2008 -- 1212430838
Thu Jun 5 10:38:39 2008 -- 1212676719
startup_counter=18
 As per the alert log, The instance is currently up


Here are the downtime windows ...

Wed Jan 9 17:50:42 2008 -- Wed Jan 9 17:53:08 2008
Thu Jan 10 09:43:18 2008 -- Wed Jan 16 12:05:09 2008
Thu Jan 17 12:00:03 2008 -- Fri Jan 18 11:19:42 2008
Fri Jan 18 11:26:13 2008 -- Thu Jan 24 17:34:15 2008
Wed Jan 30 12:19:21 2008 -- Fri Feb 15 09:00:44 2008
Tue Feb 19 22:57:38 2008 -- Wed Feb 20 16:50:14 2008
Wed Mar 12 12:39:03 2008 -- Wed Mar 12 12:43:26 2008
Mon Mar 24 13:44:20 2008 -- Fri Mar 28 15:21:59 2008
Thu Apr 3 11:00:33 2008 -- Thu Apr 3 11:03:52 2008
Thu Apr 3 11:07:12 2008 -- Thu Apr 3 11:10:20 2008
Thu Apr 3 11:14:01 2008 -- Thu Apr 3 11:15:44 2008
Thu Apr 3 11:20:54 2008 -- Thu Apr 3 11:22:38 2008
Thu Apr 3 11:25:25 2008 -- Thu Apr 3 11:27:36 2008
Thu Apr 3 11:31:53 2008 -- Thu Apr 3 11:34:35 2008
Thu Apr 3 11:40:18 2008 -- Thu Apr 3 11:41:36 2008
Tue Apr 29 16:50:49 2008 -- Mon May 12 14:17:13 2008
Mon Jun 2 14:20:38 2008 -- Thu Jun 5 10:36:58 2008
Thu Jun 5 10:38:39 2008 --


Downtime 1 : Wed Jan  9 17:53:08 2008 (1199919188)            --> Thu Jan 10 09:43:18 2008 (1199976198) = 57010 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is < than shutdown time Wed Jan  9 17:53:08 2008 - so not accruing
Running Cumulative downtime = 0 seconds

Downtime 2 : Wed Jan 16 12:05:09 2008 (1200503109)            --> Thu Jan 17 12:00:03 2008 (1200589203) = 86094 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is < than shutdown time Wed Jan 16 12:05:09 2008 - so not accruing
Running Cumulative downtime = 0 seconds

Downtime 3 : Fri Jan 18 11:19:42 2008 (1200673182)            --> Fri Jan 18 11:26:13 2008 (1200673573) = 391 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is < than shutdown time Fri Jan 18 11:19:42 2008 - so not accruing
Running Cumulative downtime = 0 seconds

Downtime 4 : Thu Jan 24 17:34:15 2008 (1201214055)            --> Wed Jan 30 12:19:21 2008 (1201713561) = 499506 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is < than shutdown time Thu Jan 24 17:34:15 2008 - so not accruing
Running Cumulative downtime = 0 seconds

Downtime 5 : Fri Feb 15 09:00:44 2008 (1203084044)            --> Tue Feb 19 22:57:38 2008 (1203479858) = 395814 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is < than shutdown time Fri Feb 15 09:00:44 2008 - so not accruing
Running Cumulative downtime = 0 seconds

Downtime 6 : Wed Feb 20 16:50:14 2008 (1203544214)            --> Wed Mar 12 12:39:03 2008 (1205339943) = 1795729 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is < than shutdown time Wed Feb 20 16:50:14 2008 - so not accruing
Running Cumulative downtime = 0 seconds

Downtime 7 : Wed Mar 12 12:43:26 2008 (1205340206)            --> Mon Mar 24 13:44:20 2008 (1206380660) = 1040454 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is < than shutdown time Wed Mar 12 12:43:26 2008 - so not accruing
Running Cumulative downtime = 0 seconds

Downtime 8 : Fri Mar 28 15:21:59 2008 (1206732119)            --> Thu Apr  3 11:00:33 2008 (1207234833) = 502714 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Fri Mar 28 15:21:59 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 502714 seconds

Downtime 9 : Thu Apr  3 11:03:52 2008 (1207235032)            --> Thu Apr  3 11:07:12 2008 (1207235232) = 200 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Thu Apr  3 11:03:52 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 502914 seconds

Downtime 10 : Thu Apr  3 11:10:20 2008 (1207235420)            --> Thu Apr  3 11:14:01 2008 (1207235641) = 221 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Thu Apr  3 11:10:20 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 503135 seconds

Downtime 11 : Thu Apr  3 11:15:44 2008 (1207235744)            --> Thu Apr  3 11:20:54 2008 (1207236054) = 310 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Thu Apr  3 11:15:44 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 503445 seconds

Downtime 12 : Thu Apr  3 11:22:38 2008 (1207236158)            --> Thu Apr  3 11:25:25 2008 (1207236325) = 167 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Thu Apr  3 11:22:38 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 503612 seconds

Downtime 13 : Thu Apr  3 11:27:36 2008 (1207236456)            --> Thu Apr  3 11:31:53 2008 (1207236713) = 257 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Thu Apr  3 11:27:36 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 503869 seconds

Downtime 14 : Thu Apr  3 11:34:35 2008 (1207236875)            --> Thu Apr  3 11:40:18 2008 (1207237218) = 343 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Thu Apr  3 11:34:35 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 504212 seconds

Downtime 15 : Thu Apr  3 11:41:36 2008 (1207237296)            --> Tue Apr 29 16:50:49 2008 (1209502249) = 2264953 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Thu Apr  3 11:41:36 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 2769165 seconds

Downtime 16 : Mon May 12 14:17:13 2008 (1210616233)            --> Mon Jun  2 14:20:38 2008 (1212430838) = 1814605 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Mon May 12 14:17:13 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 4583770 seconds

Downtime 17 : Thu Jun  5 10:36:58 2008 (1212676618)            --> Thu Jun  5 10:38:39 2008 (1212676719) = 101 seconds
the cutoff date Fri Mar 28 15:20:59 2008 is > shutdown time Thu Jun  5 10:36:58 2008 - greater than cutoff, so accruing
Running Cumulative downtime = 4583871 seconds


Calculating lifetime of instance as per criteria specified..

Starting time being used = Fri Mar 28 15:20:59 2008 -- 1206732059
Ending time epoch = Sat Jun 7 11:49:49 2008 -- 1212853789
Total lifetime in seconds = 6121730

   Beginning Fri Mar 28 15:20:59 2008, The instance was down 74 % of the time


Application of this data..

Now, this data could be put in some kind of dashboard for CIO meetings. This would give them an approximate idea of how long do their databases and (the dependent middle tier or admin tier services) remain down due to maintenance. Sure, this method cannot distinguish between unplanned and planned maintenance, but its probably a good start.


Comments:

And why not using grid control? You can even manage SLAs there.

Posted by Fernando Jose on June 07, 2008 at 02:40 AM EDT #

Grid Control 10g has a seeded report "Availability History (Target)" which shows the availability summary, chart and details for the target.

Posted by Binoy Chakraborty on June 11, 2008 at 03:51 AM EDT #

Good start Gaurav. Based on previous comments add Grid Monitoring for Applications (11i/R12) and show downtime for Apache, Forms or even CM (apart from DB) Atul http://onlineappsdba.com

Posted by Atul on November 23, 2008 at 11:40 PM EST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

bocadmin_ww

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today