Thursday Jan 02, 2014

High Availability

As companies become more and more dependent upon their information systems just to be able to function, the availability of those systems becomes more and more important.  Outages can costs millions of dollars an hour in lost revenue, let alone potential damage done to a company’s image. To add to the problem, a number of natural disasters have shown that even the best data center designs can’t handle tsunamis and tidal waves, causing many companies to implement or re-evaluate their disaster recovery plans and systems. Practically every customer I talk to asks about disaster recovery (DR) and how to configure their systems to maximize availability and support DR. This series of articles will contain some of the information I share with these customers.

The first thing to do is define availability and how it is measured. The definition I prefer is availability represent the percentage of time a system is able to correctly process requests within an acceptable time period during its normal operating period. I like this definition as it allows for times when a system isn’t expected to be available such as during evening hours or a maintenance window. However, that being said, more and more systems are being expected to be available 24x7, especially as more and more businesses operate globally and there is no common evening hours.

Measuring availability is pretty easy. Simply put it is the ratio of the time a system is available to the time the system should be available. I know, not rocket science. While it’s good to measure availability, it’s usually better to be able to predict availability for a given system to be able to determine if it will meet a company’s availability requirements. To predict availability for a system, one needs to know a few things, or at least have good guesses for them. The first is the mean time between failures or MTBF. For single components like a disk drive, these numbers are pretty well known. For a large computer system the computation gets much more difficult. More on MTBF of complex systems later. Then next thing one needs to know is the mean time to repair or MTTR, which is simply how long does it take to put the system back into working order.

Obviously the higher the MTBF of a system, the higher availability it will have and the lower the MTTR of a system the higher the availability of the system. In mathematical terms the system availability in percent is: 

 

So if the MTBF is 1000 hours and the MTTR is 1 hour, then the availability would be 99.9% or often called 3 nines. To give you an idea about how much down time in a year equates to various number of nines, here is a table showing various levels or classes of availability:

Availability

Total Down Time per Year

Class or # of 9s

Typical application or type of system

90%

~36 days

1

99%

~4 days

2

LANs

99.9%

~9 hours

3

Commodity Servers

99.99%

~1 hour

4

Clustered Systems

99.999%

~5 minutes

5

Telephone Carrier Servers

99.9999%

~1/2 minute

6

Telephone Switches

99.99999%

~3 seconds

7

In-flight Aircraft Computers

As you can see, the amount of allowed downtime gets very small as the class of availability goes up. Note though that these times are assuming the system must be available 24x365, which isn’t always the case.

More about high availability in my next entry. 

About

This is the Tuxedo product team blog.

Search

Categories
Archives
« January 2014 »
SunMonTueWedThuFriSat
   
1
3
4
5
7
8
9
10
12
13
14
16
17
18
19
20
21
22
24
25
26
27
28
29
30
31
 
       
Today