You are on page 1of 4

11/15/2014 Defining Failure: What Is MTTR, MTTF, and MTBF?

November 15, 2014 Posts

Stephen Foskett,
Pack Rat
Understanding the accumulation of data

Home About Categories Guides Calendar Series Search this website… SEARCH

YOU ARE HERE: HOME / EVERYTHING / COMPUTER HISTORY / DEFINING FAILURE: WHAT IS MTTR, MTTF, AND

MTBF?

Defining Failure: What Is MTTR, MTTF, and MTBF? Sponsors

JULY 6 , 2 0 1 1 BY ST EPHEN LEA V E A COMMENT

Most IT professionals are used to talking about uptime, downtime, and system failure. But
not everyone is entirely clear on the definition of the terms widely used in the industry.
What exactly differentiates “mean time to failure” from “mean time between failures”? And
how does “mean time to repair” play into it? Let’s get some definitions straight!

Definition of a Failure
I suppose it is wise to begin by considering what exactly qualifies as a “failure.” Clearly, if
the system is down, it has failed. But what about the system running in degraded mode, such
as a raid array that is rebuilding? And what about systems that are intentionally brought off-
line?

Technically speaking, a failure is declared when the system does not meet its desired Subscribe via Email
objectives. When comes to IT systems, including disk storage, this generally means an
outage or down time. But I have experienced situations where the system was running so Subscribe via email and you will receive my

slowly that it should be considered failed even though it was technically still “up.” latest blog posts in your inbox. No ads or spam,

Therefore, I consider any system that cannot meet minimum performance or availability just the same great content you find on my site!

requirements to be “failed.” Email Address:

Similarly, a return to normal operations signals the end of downtime or system failure. All Posts: New posts (daily)
Perhaps the system is still in a degraded mode, with some nodes or data protection systems Events: Where's Stephen? (weekly)
not yet online, but if it is available for normal use I would consider it to be “non-failed.” Subscribe

Latest Content

EMC XtremIO
Upgrade is Non-
Disruptive to
Customers
OCT OBER 7 , 2 0 1 4

EMC’s XtremIO is crapping on the badge; it’s an


immature ball of destruction that shows how
MT BF is t h e su m of MT T R a n d MT T F much architecture matters. Or so my favorite
storage bloggers say. But customers and
Mean Time to Failure (MTTF) resellers seem to have a different take on the
The first metric that we should understand is the time that a system is not failed, or is destructive XtremIO 3.0 update: They don’t
available. Often referred to as “uptime” in the IT industry, the length of time that a system care. Not at all.
is online between outages or failures can be thought of as the “time to failure” for that
A Fairy Tale of Two Storage
http://blog.fosketts.net/2011/07/06/defining-failure-mttr-mttf-mtbf/ 1/4
11/15/2014 Defining Failure: What Is MTTR, MTTF, and MTBF?
system. Protocols
SEPT EMBER 2 3 , 2 0 1 4
For example, if I bring my RAID array online on Monday at noon and the system functions
It’s clear how this fairy tale
normally until a disk failure Friday at noon, it was “available” for exactly 96 hours. If this
ends. So many companies
happens every week, with repairs lasting from Friday noon until Monday noon, I could
are using “S3 plus” as their standard interface,
average these numbers to reach a “mean time to failure” or “MTTF” of 96 hours. I would
and even inside their solutions, that it’s safe to
probably also call my system vendor and demand that they replace this horribly unreliable
say it’s won the cloud storage API battle. But S3
device!
isn’t a finalized spec – the industry will extend

Most systems only occasionally fail, so it is important to think of reliability in statistical and improve it over the coming years. Soon

terms. Manufacturers often run controlled tests to see how reliable a device is expected to we’ll have a cloud storage standard based on S3,

be, and sometimes report these results to buyers. This is a good indication of the reliability just like we have a LAN file services standard

of a device, as long as these manufacturer tests are reasonably accurate. Unfortunately, based on CIFS.

many vendors refer to this metric as “mean time between failure” (MTBF), which is The Rack Endgame:
incorrect as we shall soon see. Converged
Infrastructure and
Note too that “MTTF” often exceeds the expected lifetime or usefulness of a device by a
Disaggregation
good margin. A typical hard disk drive might list an MTTF of 1,000,000 hours, or over 100 SEPT EMBER 1 9 , 2 0 1 4
years. But no one should expect a given hard disk drive to last this long. In fact, disk
As I’ve written about what I’m calling the “Rack
replacement rate is much higher than disk failure rate!
Endgame”, the specter of converged
infrastructure hasn’t been far from my
Mean Time to Repair (MTTR)
thoughts. As others have pointed out,
Many vendors suppose that repairs are instantaneous or non-existent, but IT professionals
disaggregation of servers, networks, and storage
know that this is not the case. In fact, I might still be a systems administrator if it wasn’t for
doesn’t require a rack-sized stack; it can exist in
the fact that I had to spend hours in freezing cold datacenters trying to repair failed
a rack-mountable chassis and is already on sale!
systems! The amount of time required to repair a system and bring it back online is the
“time to repair”, another critical metric. Selling Fashion: My
Thoughts on the
In our example above, our flaky RAID array had an MTTF of 96 hours. This leaves three Apple Watch, Part 4
days, or 72 hours, to get things operational again. Over time, we would come to expect a SEPT EMBER 1 8 , 2 0 1 4

“mean time to repair” or “MTTR” of 72 hours for any typical failure. Again, we would be This is the fourth in my series on the Apple
justified in complaining to the vendor at this point. Watch. Read the rest: Transformative Success –
My Thoughts on the Apple Watch Hodgepodge:
Repairs can be excruciating, but they often do not take anywhere near as long as this. In
My Thoughts on the Apple Watch, Part 2 The
fact, most computer systems and devices are wonderfully reliable, with MTTF measured in
Fashion Function: My Thoughts on the Apple
months or years. But when things do go wrong, it can often take quite a while to diagnose,
Watch, Part 3 The Apple Watch we saw this
replace, or repair the failure. Even so, MTTR in IT systems tends to be measured in hours
week is not a transformative product. [...]
rather than days.
The Rack Endgame:
Mean Time Between Failures (MTBF) Open Compute
The most common failure related metric is also mostly used incorrectly. “Mean time Project
between failures” or “MTBF” refers to the amount of time that elapses between one failure SEPT EMBER 1 7 , 2 0 1 4

and the next. Mathematically, this is the sum of MTTF and MTTR, the total time required On reading my thoughts about the evolution of
for a device to fail and that failure to be repaired. enterprise storage, many pointed out that this
looks an awful lot like the Facebook-led Open
For example, our faulty disk array with an MTTF of 96 hours and and MTTR of 72 hours Compute Project (OCP). This is entirely
would have an MTBF of one week, or 168 hours. But many disk drives only fail once in their intentional. But OCP is simply one expression of
life, and most never fail. So manufacturers don’t bother to talk about MTTR and instead use this new architecture, and perhaps not the best
MTBF as a shorthand for average failure rate over time. In other words, “MTBF” often one for the enterprise.
reflects the number of drives that fail rather than the rate at which they fail!
The Fashion Function:
Stephen’s Stance My Thoughts on the
Most computer industry vendors use the term “MTBF” rather indiscriminately. But IT pros
Apple Watch, Part 3
SEPT EMBER 1 6 , 2 0 1 4
know that systems do not magically repair themselves, at least not yet, so MTTR and MTTF
Apple previewed their 2015 Apple Watch this
are just as important!
week, and I’m not entirely convinced that they

http://blog.fosketts.net/2011/07/06/defining-failure-mttr-mttf-mtbf/ 2/4
11/15/2014 Defining Failure: What Is MTTR, MTTF, and MTBF?

You might also want to read these other posts... have a hit on their hands. Rather than a

What’s the Point of a Warranty, Anyway? transformative punch, Apple showed an

My HP Photosmart Printer Just Stopped Printing… unfocused product that can’t figure out just

Kaminario Announces Next-Generation DataProtect Operating… what it’s supposed to be. The software side can

How To Tell If Your Mac Needs More Memory improve dramatically before launch, but what

Nimbus E-Class: The First Big, Redundant, All-Flash… about the physical design?

Cisco’s Trojan Horse


FILED UNDER: COMPUTER HISTORY, ENTERPRISE STORAGE, PERSONAL TAGGED WITH: FAILURE, MTBF, MTTF, SEPT EMBER 1 5 , 2 0 1 4
MTTR Industry watchers like me
have long wondered when
Cisco will transform itself into a full-line IT
0 Comments Stephen Foskett, Pack Rat  Login infrastructure vendor. This strategy was tipped
in 2009 as Cisco barged into the server market
Sort by Best Share ⤤ Favorite ★
with UCS. But one leg of the stool is still missing:
Storage remains the province of Cisco partners
Start the discussion… like EMC and NetApp.

Hodgepodge: My
Thoughts on the
Apple Watch, Part 2
Be the first to comment. SEPT EMBER 1 5 , 2 0 1 4

The current Apple Watch doesn’t look that


great. Apple previewed an unfocused product
that needs quite a bit more development to be
✉ Subscribe d Add Disqus to your site  Privacy
“insanely great.” Perhaps the software situation
will improve by launch time, with Apple figuring
out just what this thing is supposed to be and
focusing on that. But it’s doubtful that the
physical design will be altered much.

Transformative
Success – My
Thoughts on the
Apple Watch, Part 1
SEPT EMBER 1 4 , 2 0 1 4

Although it won’t be available for purchase for


months, Apple just announced the new standard
in smart watches and wearable computers. It’s
as far ahead of the status quo as the iPhone was
from the “smart” phone pack on its introduction
back in 2007. But as it stands, the Apple Watch
doesn’t transform the market: Although it will
undoubtedly capture most of the smart watch
market, this isn’t yet a transformative product
for modern society like the iPhone or iPad.

Networking Field Day


8
SEPT EMBER 6 , 2 0 1 4

If you’re interested in
networking, I highly recommend tuning in to
the video stream live this week for Networking
Field Day 8! You’ll see 9 different networking
companies present their technology, products,
and people to an international Tech Field Day

http://blog.fosketts.net/2011/07/06/defining-failure-mttr-mttf-mtbf/ 3/4
11/15/2014 Defining Failure: What Is MTTR, MTTF, and MTBF?
delegate panel, and you can participate online
through Twitter.

London Falling Where Their


Emma Carr Hearts Collide
New Zoe York
New

The Obituary Eleven


Society Carolyn Arnold,
Jessica L. We...
Randall... New
New

Privacy Information

Symbolic Links

Yes Virginia, There is an Open Source .NET


Effect of Transmit Power Changes on AP
Cell Sizing
Nexenta – Back in da house…
VMturbo Operations Manager 5.0 Adds
Network Control Module and More
Pure Storage, next generation storage for the
real world

Enterprise IT Events

10/Sep: Networking Field Day 8


15/Sep: SNIA Storage Developer
Conference
16/Sep: Next-Gen Storage Summit
22/Sep: CWNP Conference
23/Sep: Spice World

Return to top of page © 2014 Stephen Foskett, Pack Rat

http://blog.fosketts.net/2011/07/06/defining-failure-mttr-mttf-mtbf/ 4/4

You might also like