You are on page 1of 3

Brills Law of Catastrophic Failure: Catastrophic failure is never the result of a single event or

interaction, such failures are the result of at least five, and as many as 10, interactions. Any
single interaction or several in combination might be bad, but not catastrophic. But when
the right combination of interactions occurs (typically seven are required), they will produce
a domino effect and devastation will occur 100% of the time.
My ladder failure occurred while I was winterizing a rooftop cooling unit on my home. Here
is the sequence of interacting events. Any one done differently would have reduced the
magnitude of my injury or would have prevented it.
I was working alone. (Management decision not to incur the cost for a helper.)
It was late in the day; I was tired and rushing to get done before the sun went down and
the air temperature dropped to below freezing. (Management decision to press forward
rather than stopping for the day.)

In an attempt to save time, I reduced the number of trips up and down the ladder by
having things in my hands and not holding onto the ladder. (Human error and arrogance
brought on by a lifetime of taking increasingly riskier risks brought about by not having
fallen previously.)
One of the things in my hands was six feet of bubble wrap. (Life happens in unexpected
ways.)
The ladder was in a location protected from the wind at the bottom where I got onto the
ladder. (Life happens.)
When I got to the top of the ladder and out of protection of the building, a wind was
blowing. The bubble wrap acted as a sail and since my hands were full, I couldnt hang on
and the wind blew me off backward before I could react. (Catastrophic failure.)
I believe my analytical process can be applied to virtually any catastrophic failure and the
number of interactions required will range between five and 10. Subtract almost any one or
two of the interactions and failure will either not occur or its impact will be greatly reduced.
Lets use a public catastrophic failure example that placed more than 14,000 airplane
passengers at risk of death. This occurred many years ago when a central office in lower
Manhattan failed and left 1,400 airplanes in mid-flight with no way to communicate with Air
Traffic Control.
The local electric utility had offered a $200,000 annual incentive if the central office would
go off the grid and run on back-up engine generators when electric demand was high on hot
days. (Management decision to accept the incentive.)
It was a hot day and the utility called for the incentive to be honored starting at 8:30 a.m.
(A hot summer day will happen.)
The engines were started and the utility feed was disconnected, but the batteries werent
being charged. As a result, the batteries started discharging. (Failures occur.)
Company policy called for technicians to be stationed in the engine room whenever the
central office was running on the generators. Management directed that the technicians be
at an off-site training class and no one saw the engines were idling. (Management decision
to deviate from standard policy.)
The technicians were in training all day and did not return until after the batteries
ultimately failed. (Things happen.)
The battery discharge alarm bell had been disconnected. (Management inaction to ensure
proper monitoring.)
The battery discharge alarm light had not been relocated when the command center
moved. (Management inaction to ensure proper monitoring.)
Battery discharge continued successfully for more than eight hours. When the batteries
finally failed, the central office went off the air. (Batteries will run down and ultimately fail.)
When the central office failed, all air traffic communications with airplanes on the East Cost
failed because the fiber for multiple communications carriers was routed through this single
central office. (Management failure to implement true diversity and redundancy.)
Without air traffic communication, 1,400 airplanes were left to get back down on the
ground using manual procedures. Fortunately, the manual landing procedures worked and
all 1,400 planes landed safely. (Catastrophic risk averted.)

As these examples show, management errors are much more common than human errors. I
have applied this analytical approach to the Minneapolis bridge structural failure that
toppled cars and killed people. The results are always the same. It doesnt matter what the
underlying physical, mechanical or electrical portions of the event are, management error or
inaction contributes up to half of the interactions resulting in catastrophic failure.
The Uptime Institute data center failure statistics consistently demonstrate that only one
third of data center failures are caused by equipment. Of the remaining two-thirds of
availability failures, more than 70% are caused by intentional management decisions or by
management inaction. This means only 20% of data center failures are caused by human
error.
Systemically addressing management issues is the quickest and ultimately cheapest path to
reducing catastrophic failure.
Kenneth G. Brill is executive director of the Uptime Institute in Santa Fe, N.M.

You might also like