Thursday Jan 23, 2014

High Availability Part 5

Tuxedo provides an extremely high availability platform for deploying the world's most critical applications.  This post covers some of the way Tuxedo supports redundancy to improve overall application availability.  Upcoming posts will explain additional ways that Tuxedo ensures high availability as well as way to improve application availability of a Tuxedo application.[Read More]

Wednesday Jan 15, 2014

High Availability Part 4

To Err is Human; To Survive is High Availability

In this post I’d like to look at the various causes of unavailability or outages. The most obvious although often overlooked is that of scheduled system maintenance. Now whether that is included in your measurement of availability depends upon the stack holders for a system or application. The ideal systems have no scheduled maintenance that causes the system to be unavailable. That isn’t to say they don’t receive maintenance, but that the maintenance doesn’t cause the system to be unavailable. This can be done via rolling upgrades, site switchovers, etc. For now it suffices to say that this type of down time is intentional, known, and typically scheduled.

The interesting part comes in looking at other causes of unavailability, in particular those caused by failures. The most commonly thought of failure is that of a hardware failure such as a disk drive failing, or a server failing. These failures tend to be obvious and easily remedied. Most people then guess that software failures make up the next significant portion of failures. But as is all too often the case, the most common failures in highly available systems are those caused by people. Estimates place hardware failures at around 10% of the causes of an outage. This low percentage is largely due to the ever improving MTBF of hardware. Software is estimated to cause about 20% of outages for highly available systems. The remaining 70% of outages are attributable to human action, and increasingly these actions are intentional, i.e., purposeful interruptions of service for malicious intent such as denial of service attacks.

To give an example a study was done on replacing a failed hard drive in a software RAID configuration. A seemingly simple task, yet a surprising number of cases of replacing the wrong drive occurred in the first few times an engineer was asked to repair the systems. This indicates that putting procedures in place to repair a system isn’t adequate, but that actually performing the procedures several times is needed to eliminate human error. But more importantly it points out the need to eliminate human intervention as much as possible as any human intervention either for normal operation or for remediating a failure has a significant possibility of being done incorrectly. That incorrect intervention could be relatively catastrophic as in replacing the wrong drive in the above study caused a complete loss of data in some instances.

So what is the takeaway from this information? Minimize or eliminate human intervention as much as possible in order to minimize outages attributable to human error. Typically this means automating as much as possible any necessary steps to resume normal operation after a failure or even during normal operation. Every manual step taken by an administrator has some probability of causing an outage. It also suggests that repair procedures be well tested, preferably in a test environment that duplicates the production environment.

More on how Tuxedo can help solve these problems in my next entry.

Saturday Jan 11, 2014

High Availability Part 3

In my previous posts on High Availability I looked at the definition of availability and ways to increase the availability of a system using redundant components.  In this post I'll look at another way to increase the availability of a system.  Let’s go back to the calculation of availability:

Availability Formula

Based upon this formula, we can see that if we can decrease the MTTR, we can increase the overall availability of the system. For a computer system, let’s look at what makes up the time to repair the system. It includes some time that may not be obvious, but in fact is extremely important. The timeline for a typical computer system failure might look light:

  1. Normal operation
  2. Failure or outage occurs
  3. Failure or outage detected
  4. Action taken to remediate the failure or outage
  5. System placed back into normal operation
  6. Normal operation

Most people only consider item (4) above, the time taken to remediate the outage. That might be something like replacing a failed hard drive or network controller. It could even be as simple as reconnecting an accidentally disconnected network cable, a 30 second repair. But the MTTR isn't 30 seconds. It’s the time included in (3), (4), and (5) above. For the network cable example, the amount of time taken in (3) will depend upon network timers at multiple levels and could be many minutes if just relying on the operating system network stack. The time taken for (4) may be as low as the 30 seconds needed to reconnect the cable although finding the cable might take a bit longer than 30 seconds. The time for (5) again depends upon the service resumption steps such as re-establishing a DHCP address, reconnection of applications or servers, etc. So on the surface the MTTR may be assumed to be 30 seconds, the actual time could be many minutes, especially in the extreme case where systems, servers, applications, etc., need to be restarted or rebooted manually to recover.

So how does this impact system design for highly available systems? It indicates that whatever can be done to decrease items (3), (4), and (5) above, will improve overall system availability. The more of these steps that can be automated, the lower the MTTR one can achieve, and the higher the availability of the system. Too often the detection phase (3) is left up to someone calling a help desk to say they can’t access or use the system. As well items (4) and (5) often require manual intervention or steps. When one wants to achieve 99.99% availability, manual repairs or remediation is going to make that very difficult to achieve.

More on the causes of failures in my next post.

Monday Jan 06, 2014

High Availability Part 2

To compute the availability of a system, you need to examine the availability of the components that make up the system.  To combine the availability of the components, you need to determine if the components failure prevents the system from being usable, or if the system can still be available regardless of the failure. Now that sounds strange until you consider redundancy. In a non-redundant subsystem, if it fails, the system is unavailable. So in a completely non-redundant system, the availability of the system is simply the product of each component’s availability:

A very simplified view of this might be:

     Client => LAN => Server => Disk

If we take the client out of the picture as it really isn't part of the system, we at least have a network, a server, and a disk drive to be available in order for the system to be available. Let’s say each has an availability of 99.9%, then the system availability would be:

or 99.7% available. That’s roughly equivalent to a day’s worth of outage a year. So although each subsystem is only unavailable about 9 hours a year, the 3 combined ends up being unavailable for over a day. As the number of required subsystems or components grows the availability of the overall system decreases. To alleviate this, one can use redundancy to help mask failures. With redundant components, the availability is determined by the formula:

Let’s look at just the server component. If instead of a single server with 99.9% availability , we have two servers each with 99.9% availability, but only one of them is needed to actually have the system be available, then the availability of the server component of the system increases from 99.9% to 99.999% or 5 nines of availability just by adding an additional server. As you can see, redundancy can dramatically increase the availability of a system. If we have redundant LAN and disk subsystems in the example above, instead of 99.7% availability, we get 99.997% availability or about 16 minutes of down time a year instead of over a day of down time.

OK, so what does all of this have to do with creating highly available systems? Everything! What it tells us is that all things being equal, simpler systems have higher availability. In other words, the fewer required components you have the more available your system will be. And it tells us that to improve availability we can either purchase components with higher availability, or we can add some redundancy into the system. Buying more reliable or available components is certainly an option, although generally that is a fairly costly option. Mainframe computers are an example of this option. They generally provide better availability than blade servers, but do so at a very high premium. Using cheaper redundant components is typically much cheaper and can even better overall availability.

More on high availability in my next post. 

Thursday Jan 02, 2014

High Availability

As companies become more and more dependent upon their information systems just to be able to function, the availability of those systems becomes more and more important.  Outages can costs millions of dollars an hour in lost revenue, let alone potential damage done to a company’s image. To add to the problem, a number of natural disasters have shown that even the best data center designs can’t handle tsunamis and tidal waves, causing many companies to implement or re-evaluate their disaster recovery plans and systems. Practically every customer I talk to asks about disaster recovery (DR) and how to configure their systems to maximize availability and support DR. This series of articles will contain some of the information I share with these customers.

The first thing to do is define availability and how it is measured. The definition I prefer is availability represent the percentage of time a system is able to correctly process requests within an acceptable time period during its normal operating period. I like this definition as it allows for times when a system isn’t expected to be available such as during evening hours or a maintenance window. However, that being said, more and more systems are being expected to be available 24x7, especially as more and more businesses operate globally and there is no common evening hours.

Measuring availability is pretty easy. Simply put it is the ratio of the time a system is available to the time the system should be available. I know, not rocket science. While it’s good to measure availability, it’s usually better to be able to predict availability for a given system to be able to determine if it will meet a company’s availability requirements. To predict availability for a system, one needs to know a few things, or at least have good guesses for them. The first is the mean time between failures or MTBF. For single components like a disk drive, these numbers are pretty well known. For a large computer system the computation gets much more difficult. More on MTBF of complex systems later. Then next thing one needs to know is the mean time to repair or MTTR, which is simply how long does it take to put the system back into working order.

Obviously the higher the MTBF of a system, the higher availability it will have and the lower the MTTR of a system the higher the availability of the system. In mathematical terms the system availability in percent is: 


So if the MTBF is 1000 hours and the MTTR is 1 hour, then the availability would be 99.9% or often called 3 nines. To give you an idea about how much down time in a year equates to various number of nines, here is a table showing various levels or classes of availability:


Total Down Time per Year

Class or # of 9s

Typical application or type of system


~36 days



~4 days




~9 hours


Commodity Servers


~1 hour


Clustered Systems


~5 minutes


Telephone Carrier Servers


~1/2 minute


Telephone Switches


~3 seconds


In-flight Aircraft Computers

As you can see, the amount of allowed downtime gets very small as the class of availability goes up. Note though that these times are assuming the system must be available 24x365, which isn’t always the case.

More about high availability in my next entry. 


This is the Tuxedo product team blog.


« January 2014 »