By Todd Little-Oracle on Jan 11, 2014
In my previous posts on High Availability I looked at the definition of availability and ways to increase the availability of a system using redundant components. In this post I'll look at another way to increase the availability of a system. Let’s go back to the calculation of availability:
Based upon this formula, we can see that if we can decrease the MTTR, we can increase the overall availability of the system. For a computer system, let’s look at what makes up the time to repair the system. It includes some time that may not be obvious, but in fact is extremely important. The timeline for a typical computer system failure might look light:
- Normal operation
- Failure or outage occurs
- Failure or outage detected
- Action taken to remediate the failure or outage
- System placed back into normal operation
- Normal operation
Most people only consider item (4) above, the time taken to remediate the outage. That might be something like replacing a failed hard drive or network controller. It could even be as simple as reconnecting an accidentally disconnected network cable, a 30 second repair. But the MTTR isn't 30 seconds. It’s the time included in (3), (4), and (5) above. For the network cable example, the amount of time taken in (3) will depend upon network timers at multiple levels and could be many minutes if just relying on the operating system network stack. The time taken for (4) may be as low as the 30 seconds needed to reconnect the cable although finding the cable might take a bit longer than 30 seconds. The time for (5) again depends upon the service resumption steps such as re-establishing a DHCP address, reconnection of applications or servers, etc. So on the surface the MTTR may be assumed to be 30 seconds, the actual time could be many minutes, especially in the extreme case where systems, servers, applications, etc., need to be restarted or rebooted manually to recover.
So how does this impact system design for highly available systems? It indicates that whatever can be done to decrease items (3), (4), and (5) above, will improve overall system availability. The more of these steps that can be automated, the lower the MTTR one can achieve, and the higher the availability of the system. Too often the detection phase (3) is left up to someone calling a help desk to say they can’t access or use the system. As well items (4) and (5) often require manual intervention or steps. When one wants to achieve 99.99% availability, manual repairs or remediation is going to make that very difficult to achieve.
More on the causes of failures in my next post.