By Todd Little on Jan 15, 2014
To Err is Human; To Survive is High Availability
In this post I’d like to look at the various causes of unavailability or outages. The most obvious although often overlooked is that of scheduled system maintenance. Now whether that is included in your measurement of availability depends upon the stack holders for a system or application. The ideal systems have no scheduled maintenance that causes the system to be unavailable. That isn’t to say they don’t receive maintenance, but that the maintenance doesn’t cause the system to be unavailable. This can be done via rolling upgrades, site switchovers, etc. For now it suffices to say that this type of down time is intentional, known, and typically scheduled.
The interesting part comes in looking at other causes of unavailability, in particular those caused by failures. The most commonly thought of failure is that of a hardware failure such as a disk drive failing, or a server failing. These failures tend to be obvious and easily remedied. Most people then guess that software failures make up the next significant portion of failures. But as is all too often the case, the most common failures in highly available systems are those caused by people. Estimates place hardware failures at around 10% of the causes of an outage. This low percentage is largely due to the ever improving MTBF of hardware. Software is estimated to cause about 20% of outages for highly available systems. The remaining 70% of outages are attributable to human action, and increasingly these actions are intentional, i.e., purposeful interruptions of service for malicious intent such as denial of service attacks.
To give an example a study was done on replacing a failed hard drive in a software RAID configuration. A seemingly simple task, yet a surprising number of cases of replacing the wrong drive occurred in the first few times an engineer was asked to repair the systems. This indicates that putting procedures in place to repair a system isn’t adequate, but that actually performing the procedures several times is needed to eliminate human error. But more importantly it points out the need to eliminate human intervention as much as possible as any human intervention either for normal operation or for remediating a failure has a significant possibility of being done incorrectly. That incorrect intervention could be relatively catastrophic as in replacing the wrong drive in the above study caused a complete loss of data in some instances.
So what is the takeaway from this information? Minimize or eliminate human intervention as much as possible in order to minimize outages attributable to human error. Typically this means automating as much as possible any necessary steps to resume normal operation after a failure or even during normal operation. Every manual step taken by an administrator has some probability of causing an outage. It also suggests that repair procedures be well tested, preferably in a test environment that duplicates the production environment.
More on how Tuxedo can help solve these problems in my next entry.