By Todd Little on Jan 06, 2014
To compute the availability of a system, you need to examine the availability of the components that make up the system. To combine the availability of the components, you need to determine if the components failure prevents the system from being usable, or if the system can still be available regardless of the failure. Now that sounds strange until you consider redundancy. In a non-redundant subsystem, if it fails, the system is unavailable. So in a completely non-redundant system, the availability of the system is simply the product of each component’s availability:
A very simplified view of this might be:
Client => LAN => Server => Disk
If we take the client out of the picture as it really isn't part of the system, we at least have a network, a server, and a disk drive to be available in order for the system to be available. Let’s say each has an availability of 99.9%, then the system availability would be:
or 99.7% available. That’s roughly equivalent to a day’s worth of outage a year. So although each subsystem is only unavailable about 9 hours a year, the 3 combined ends up being unavailable for over a day. As the number of required subsystems or components grows the availability of the overall system decreases. To alleviate this, one can use redundancy to help mask failures. With redundant components, the availability is determined by the formula:
Let’s look at just the server component. If instead of a single server with 99.9% availability , we have two servers each with 99.9% availability, but only one of them is needed to actually have the system be available, then the availability of the server component of the system increases from 99.9% to 99.999% or 5 nines of availability just by adding an additional server. As you can see, redundancy can dramatically increase the availability of a system. If we have redundant LAN and disk subsystems in the example above, instead of 99.7% availability, we get 99.997% availability or about 16 minutes of down time a year instead of over a day of down time.
OK, so what does all of this have to do with creating highly available systems? Everything! What it tells us is that all things being equal, simpler systems have higher availability. In other words, the fewer required components you have the more available your system will be. And it tells us that to improve availability we can either purchase components with higher availability, or we can add some redundancy into the system. Buying more reliable or available components is certainly an option, although generally that is a fairly costly option. Mainframe computers are an example of this option. They generally provide better availability than blade servers, but do so at a very high premium. Using cheaper redundant components is typically much cheaper and can even better overall availability.
More on high availability in my next post.