Professional Documents
Culture Documents
Stephen Foskett,
Pack Rat
Understanding the accumulation of data
Home About Categories Guides Calendar Series Search this website… SEARCH
YOU ARE HERE: HOME / EVERYTHING / COMPUTER HISTORY / DEFINING FAILURE: WHAT IS MTTR, MTTF, AND
MTBF?
Most IT professionals are used to talking about uptime, downtime, and system failure. But
not everyone is entirely clear on the definition of the terms widely used in the industry.
What exactly differentiates “mean time to failure” from “mean time between failures”? And
how does “mean time to repair” play into it? Let’s get some definitions straight!
Definition of a Failure
I suppose it is wise to begin by considering what exactly qualifies as a “failure.” Clearly, if
the system is down, it has failed. But what about the system running in degraded mode, such
as a raid array that is rebuilding? And what about systems that are intentionally brought off-
line?
Technically speaking, a failure is declared when the system does not meet its desired Subscribe via Email
objectives. When comes to IT systems, including disk storage, this generally means an
outage or down time. But I have experienced situations where the system was running so Subscribe via email and you will receive my
slowly that it should be considered failed even though it was technically still “up.” latest blog posts in your inbox. No ads or spam,
Therefore, I consider any system that cannot meet minimum performance or availability just the same great content you find on my site!
Similarly, a return to normal operations signals the end of downtime or system failure. All Posts: New posts (daily)
Perhaps the system is still in a degraded mode, with some nodes or data protection systems Events: Where's Stephen? (weekly)
not yet online, but if it is available for normal use I would consider it to be “non-failed.” Subscribe
Latest Content
EMC XtremIO
Upgrade is Non-
Disruptive to
Customers
OCT OBER 7 , 2 0 1 4
Most systems only occasionally fail, so it is important to think of reliability in statistical and improve it over the coming years. Soon
terms. Manufacturers often run controlled tests to see how reliable a device is expected to we’ll have a cloud storage standard based on S3,
be, and sometimes report these results to buyers. This is a good indication of the reliability just like we have a LAN file services standard
of a device, as long as these manufacturer tests are reasonably accurate. Unfortunately, based on CIFS.
many vendors refer to this metric as “mean time between failure” (MTBF), which is The Rack Endgame:
incorrect as we shall soon see. Converged
Infrastructure and
Note too that “MTTF” often exceeds the expected lifetime or usefulness of a device by a
Disaggregation
good margin. A typical hard disk drive might list an MTTF of 1,000,000 hours, or over 100 SEPT EMBER 1 9 , 2 0 1 4
years. But no one should expect a given hard disk drive to last this long. In fact, disk
As I’ve written about what I’m calling the “Rack
replacement rate is much higher than disk failure rate!
Endgame”, the specter of converged
infrastructure hasn’t been far from my
Mean Time to Repair (MTTR)
thoughts. As others have pointed out,
Many vendors suppose that repairs are instantaneous or non-existent, but IT professionals
disaggregation of servers, networks, and storage
know that this is not the case. In fact, I might still be a systems administrator if it wasn’t for
doesn’t require a rack-sized stack; it can exist in
the fact that I had to spend hours in freezing cold datacenters trying to repair failed
a rack-mountable chassis and is already on sale!
systems! The amount of time required to repair a system and bring it back online is the
“time to repair”, another critical metric. Selling Fashion: My
Thoughts on the
In our example above, our flaky RAID array had an MTTF of 96 hours. This leaves three Apple Watch, Part 4
days, or 72 hours, to get things operational again. Over time, we would come to expect a SEPT EMBER 1 8 , 2 0 1 4
“mean time to repair” or “MTTR” of 72 hours for any typical failure. Again, we would be This is the fourth in my series on the Apple
justified in complaining to the vendor at this point. Watch. Read the rest: Transformative Success –
My Thoughts on the Apple Watch Hodgepodge:
Repairs can be excruciating, but they often do not take anywhere near as long as this. In
My Thoughts on the Apple Watch, Part 2 The
fact, most computer systems and devices are wonderfully reliable, with MTTF measured in
Fashion Function: My Thoughts on the Apple
months or years. But when things do go wrong, it can often take quite a while to diagnose,
Watch, Part 3 The Apple Watch we saw this
replace, or repair the failure. Even so, MTTR in IT systems tends to be measured in hours
week is not a transformative product. [...]
rather than days.
The Rack Endgame:
Mean Time Between Failures (MTBF) Open Compute
The most common failure related metric is also mostly used incorrectly. “Mean time Project
between failures” or “MTBF” refers to the amount of time that elapses between one failure SEPT EMBER 1 7 , 2 0 1 4
and the next. Mathematically, this is the sum of MTTF and MTTR, the total time required On reading my thoughts about the evolution of
for a device to fail and that failure to be repaired. enterprise storage, many pointed out that this
looks an awful lot like the Facebook-led Open
For example, our faulty disk array with an MTTF of 96 hours and and MTTR of 72 hours Compute Project (OCP). This is entirely
would have an MTBF of one week, or 168 hours. But many disk drives only fail once in their intentional. But OCP is simply one expression of
life, and most never fail. So manufacturers don’t bother to talk about MTTR and instead use this new architecture, and perhaps not the best
MTBF as a shorthand for average failure rate over time. In other words, “MTBF” often one for the enterprise.
reflects the number of drives that fail rather than the rate at which they fail!
The Fashion Function:
Stephen’s Stance My Thoughts on the
Most computer industry vendors use the term “MTBF” rather indiscriminately. But IT pros
Apple Watch, Part 3
SEPT EMBER 1 6 , 2 0 1 4
know that systems do not magically repair themselves, at least not yet, so MTTR and MTTF
Apple previewed their 2015 Apple Watch this
are just as important!
week, and I’m not entirely convinced that they
http://blog.fosketts.net/2011/07/06/defining-failure-mttr-mttf-mtbf/ 2/4
11/15/2014 Defining Failure: What Is MTTR, MTTF, and MTBF?
You might also want to read these other posts... have a hit on their hands. Rather than a
My HP Photosmart Printer Just Stopped Printing… unfocused product that can’t figure out just
Kaminario Announces Next-Generation DataProtect Operating… what it’s supposed to be. The software side can
How To Tell If Your Mac Needs More Memory improve dramatically before launch, but what
Nimbus E-Class: The First Big, Redundant, All-Flash… about the physical design?
Hodgepodge: My
Thoughts on the
Apple Watch, Part 2
Be the first to comment. SEPT EMBER 1 5 , 2 0 1 4
Transformative
Success – My
Thoughts on the
Apple Watch, Part 1
SEPT EMBER 1 4 , 2 0 1 4
If you’re interested in
networking, I highly recommend tuning in to
the video stream live this week for Networking
Field Day 8! You’ll see 9 different networking
companies present their technology, products,
and people to an international Tech Field Day
http://blog.fosketts.net/2011/07/06/defining-failure-mttr-mttf-mtbf/ 3/4
11/15/2014 Defining Failure: What Is MTTR, MTTF, and MTBF?
delegate panel, and you can participate online
through Twitter.
Privacy Information
Symbolic Links
Enterprise IT Events
http://blog.fosketts.net/2011/07/06/defining-failure-mttr-mttf-mtbf/ 4/4