Boeing & Root Cause of Failure
By dcb on Dec 15, 2004
I found the following very interesting! It is buried in a 22 page report on Boeing's web site: http://www.boeing.com/news/techissues/pdf/statsum.pdf
Statistical Summary of Commercial Jet Airplane Accidents Worldwide Operations (1959 - 2003)
On page 19, you'll find the following graphic (I've added some context elements) that describes the root cause of hull loss and/or loss of life in the worldwide commercial air fleet over the last 10 years:
It is interesting to note that the large majority of cases of, um, "down" time, were due to people either making mistakes (or acting maliciously), or people correctly following faulty or incomplete procedures (which were written by people). It is rarely the products (airplanes) or the environmentals (weather).
In the same way, Gartner and others have long held that complex IT systems fail to deliver expected service levels mostly because of people and process related root causes (est. ~80%). Product failures actually account for a tiny fraction of IT service disruptions.
This seems to point to a general pattern that whenever complex systems expose their complexity to human touch points, even in situations in which those humans are psychologically screened, highly trained, highly paid, and limited in number, that catastrophic failures will occur that impact business and/or life.
This is probably no surprise to us. Each one of us have made mistakes behind the wheel, in social settings, etc, due to a variety of reasons (boredom, over-confidence, etc). But the implication of the Boeing and Gartner studies is that we should strive to abstract complexity away from human touch points at every opportunity. Think of "fly-by-wire" controls, in which a pilot's actions are constrained by a flight control system that will not allow actions that could harm the airplane or its passengers. Freedom and flexibility are permitted up to, but not exceeding, a "pain" threshold.
In professional audio systems, a "compressor" is often used. Dynamic response is not affected unless it reaches a threshold in which it might distort or consume undesired energy. Then the system steps in a cleanly limits further dynamic range. As long as you operate in the expected range, you have freedom. If your actions threaten the quality of the output, your action is constrained. Seem like a fair trade-off of freedom and control.
The ultimate expression of an automated datacenter would be to define desired service levels (and the cost and reward sensitivities as actual service levels vary from the nominal/desired) and let "fly-by-wire" micro-adjustments to the IT Infrastructure control optimization. This could radically reduce IT Service Disruptions as complexity is managed by highly-available and hardened controllers, rather than distracted operators. The sensitivity parameters allow the system to distribute excess resources to those services that could benefit most from better than desired performance, or degrade the least sensitive services if a shortfall were to occur.
Of course, cascading failures are still possible. Remember the recent black out in the NE! Codified heuristics that control optimization decisions are simply human designed algorithmic procedures. And procedures can be flawed or reach an "if" part of a decision tree based on stimuli for which there is no "then" statement. But, once solved and hardened, this datacenter control "product" will be much more dependable at delivering desired service levels than an army of humans manually adjusting knobs.
Can this go too far? Sure! I'm not sure I'd want to fly in a pilotless helicopter around Kauai... There are limits to the value of automated services that pre-define concepts of optimization. However, a helicopter with controllers that prevent potentially harmful actions from an error-prone human pilot would be comforting, and not only might keep the charter service in business (and me alive!), but be leveraged as a way to drive more business.