Failure Prevention and Recovery

Failure Prevention
and
Recovery
Failure
 There is always a chance that things might go wrong
– we must accept this NOT ignore this.
 Critical failure:
◦ Lost of customer
◦ High downtime
◦ High repair cost
◦ Injury or lost of lives (company reputation)
 Non - critical failure – lesser effect
 Organizations must discriminate and give priority to
critical failure – “why things fail” & “how to measure
the impact of failure”
Failure as an Opportunity
 All failure can be traced back to some kind of
human failure.
◦ A machine failure might have been cause by
someone’s poor design or maintenance.
◦ Delivery failure might have been someone’s
error in managing the supply schedule.
 Failures are rarely a random chance.
◦ It can be controlled to a certain extent
◦ Can learn from failure and change accordingly
 Opportunity to examine and plan for elimination
System Failure
Why things fail:
1) Failure resulting from within the operation:
• Design failure
• Facilities failure
• People failure
2) Failure resulting from material or information
input
• Supplier failure
3) Failure resulting from customer actions
• Customer failure
Why Things Fail
Design failure:
– Operations may look fine on paper but cannot cope
with real circumstances.
– Type 1: Characteristic of demand was overlooked or
miscalculated.
• Bearing factory designed to produce 100
bearings per day but customers demand 125
bearings per day.
– Type 2: The circumstances under which the
operation has to work are not as expected.
• A factory building designed to house stationary
machinery fails when it was used to store a
vibrating machine.
Why Things Fail
Facilities failure:
– All facilities (machines, equipment, buildings, fittings)
are liable to ‘breakdown’.
– Type 1: Partial breakdown
• Worn out carpet in a hotel
• Machine can only half its normal rate
– Type 2: Complete breakdown
• Sudden stop of operation
– It is the effect of the breakdown that is important –
some breakdowns could paralyse the whole
operation.
– Some failures have a cumulative significant impact.
Why Things Fail
People failure:
– Type 1: ‘Errors’ are mistakes in judgement
• A managers decision to continue running
the plant with a partially failed heat
exchanger resulted in a more expensive
complete breakdown.
– Type 2: ‘Violation’ are acts which are contrary
to defined operating procedures
• A machine operator failure to lubricate
the bearings of the motor resulted in the
bearings overheating and failing
Why Things Fail
Supplier failure:
– A supplier failed to
• Deliver.
• Deliver on time.
• Deliver quality goods and services
can lead to failure within an operation.
Customer failure:
– Customer failure can result when customers misuse
products and services
• Example: Someone loading a 14kg washing
machine with 18kg of cloths will cause the
machine to fail.
Measuring Failure
– There are three main ways of measuring
failure:
• Failure rates – how often a failure
occurs
• Reliability – the chances of failure
occurring
• Availability – the amount of available
useful operating time
Measuring Failure
 Failure rate (FR): number of failures
FR 
total number of products tested
number of failures
FR 
operating time
 Example: If an engine fails 4 times after operating
for 300 hours, it has a failure rate of 0.013 (0.13%).
 Example: If out of 250 products tested for
operability 5 failed, the failure rate is 0.02 (0.2%)
Measuring Failure
Failure over time – the ‘bath-tub’ curve
 At different stages during the life of anything, the
probability of it failing will be different.
 Most physical entity failure pattern will follow
the bath-tub curve.
The ‘bath tub curve’ comprises three stages:
 The ‘infant-mortality’ stage where early
failures occur caused by defective parts or
improper use.
 The ‘normal life’ stage when the failure rate is
low and reasonably constant and caused by
normal random factors.
 The ‘wear-out’ stage when the failure rate
increases as the part approaches the end of its
working life and failure is caused by the ageing
and deterioration of parts
Bath-Tub Curve
Infant-
mortality Normal-life Wear-out
stage stage stage
Failure rate
X Time y
Reliability
◦ Measures the probability of a system, product or
service to perform as expected over time.
◦ Values between 0 and 1 (0 to 100% reliability)
◦ Used to relate parts of the system to the system.
 If components in a system are all
interdependent, a failure in any individual
component will cause the whole system to fail.
 Hence, reliability of the whole system, Rs,
Rs = R1  R2  R3  …Rn
Where: R1 = reliability of component 1
R2 = reliability of component 2
R3 = reliability of component 3
Etc…
Worked Example
An automated pizza-making machine in a food manufacturer’s
factory has five major components, with individual reliabilities (the
probability of the component not failing) as follows:
Dough mixer Reliability = 0.95
Dough roller and cutter Reliability = 0.99
Tomato paste applicator Reliability = 0.97
Cheese applicator Reliability = 0.90
Oven Reliability = 0.98
If one of these parts of the production system fails, the whole

system will stop working. Thus the reliability of the whole system
is:
Rs = 0.95  0.99  0.97  0.90  0.98

= 0.805
Worked Example
Notes:
◦ The reliability of the whole system is 0.8 even
though the reliability of the individual
components was higher.
◦ If the system had more components, its
reliability would be lower.
◦ E.g. for a system with 10 components having
reliability of 0.99 each, the reliability of the
system is 0.9 BUT if the system has 50
components having reliability of 0.99 each, the
reliability of the system reduces to 0.8.
Availability
◦ Availability is the degree to which the operation is
ready to work.
◦ An operation is not available if it has either failed
or is being repaired following a failure.
Availabili ty  A 
MTBF
MTBF  MTTR
Where
MTBF  mean time between failures
MTTR  mean time to repair
operating hours
MTBF 
number of failures
The three tasks of failure prevention and
recovery
Failure detection
and analysis
Finding out what is
going wrong and why
Improving system
reliability Recovery
Stopping things Coping when things
going wrong do go wrong
Failure detection and analysis
Mechanisms to detect failure:
1. In process checks
2. Machine diagnostic check
3. Point-of-departure interviews
4. Phone surveys
5. Focus groups
6. Complaint cards of feedback sheets
7. Questionnaires
1. In process checks – employees check that
the process is acceptable during the process.
Example: “Is everything alright with your meal,
madam?”
2. Machine diagnostic check – a machine is
put through a prescribed sequence of activities
to expose any failures or potential failures.
Example: A heat exchanger tested for leaks,
cracks and wear
3. Point-of-departure interviews – at the end
of a service, staff may check that the service has
been satisfactory.
4. Focus group – groups of customers are
brought together to some aspects of a product
or service.
5. Phone survey, Complaint cards &
Questionnaires – these can be used to ask for
opinions about products or services.
Failure analysis:
1. Accident investigation
• Trained staff analyse the cause of the
accident.
• Make recommendations to minimize or
eradicate of the failure happening again.
• Specialized investigation technique suited to
the type of accident
2. Product liability
• Ensures all products are traceable.
• Traced back to the process, the components
from which they were produced and the
supplier who supplied them.
• Goods can be recalled if necessary.
3. Complaint analysis
• Complaints and compliments are recorded
and taken seriously.
• Cheap and easily available source of
information about errors.
• Involves tracking number of complaints over
time.
4. Critical incident analysis
• Requires customers to identify the elements
of products or services they found either
satisfying or not satisfying.
• Especially used in service operations.
4. Failure mode and effect analysis (FMEA)
• Used to identify failure before they happen so
proactive measures can be taken.
• For each possible cause of failure the following
type questions are asked:
 What is the likelihood a failure will
occur?
 What would the consequence of the
failure be?
 How likely is such a failure to be
detected before it affects the customer?
• Risk priority number (RPN) calculated based
on these questions.
• Corrective action taken based on RPN.
6. Fault-tree analysis
• This is a logical procedure that starts with
a failure or potential failure and works
backwards to identify all the possible
causes and therefore the origins of that
failure.
• Made up of branches connected by AND
nodes and OR nodes.
• Branches below AND node all need to
occur for the event above the node to
occur.
• Only one of the branches below an OR
node needs to occur for the event above
the node to occur
Fault-tree analysis for below-temperature
food being served to customers
Food served to Key
customer is below
temperature AND node
Food OR node
Plate
is cold is cold
Plate warmer Oven

malfunction malfunction
Plate taken Timing error

too early by chef
from warmer
Ingredients
Cold plate not
used defrosted
Improving Process Reliability
 After the cause and effect of a failure is known, the
next course of action is to try to prevent the
failures from taking place. This can be done in a
number of ways
◦ Designing out fail points in the process
◦ Building redundancy into the process
◦ ‘Fail-safeing’ some of the activities in the
process
◦ Maintenance of the physical facilities in the
process
Designing out fail points
 Identifying and then controlling process,
product and service characteristics to try to
prevent failures.
 Use of process maps to detect potential fail
points in operations.
Redundancy
 Building up redundancy to an operation means
having back-up systems in case of failure.
 Increases the reliability of a component
 Expensive solution
 Used for breakdowns with critical impact.
Fail-safeing
 Called poka-yoke in Japan.
 Based on the principle that human mistakes are
to some extent inevitable.
 The objective is to prevent them from becoming
a defect.
 Poka-yokes are simple (preferably inexpensive)
devices of systems which are incorporated into a
process to prevent inadvertent operator mistakes
resulting in a defect.
Maintenance
 Maintenance is the method used by organizations
to avoid failure by taking care of their physical
activities
 Important to organizations whose physical activities
play a central role in creating their goods and
service.
Benefits of maintenance:
 Enhanced safety
 Increased reliability
 Higher quality
 Lower operating costs
 Longer life span
 Higher end value
Approaches to maintenance
1. Run to breakdown (RTB)
• Allowing the facilities to continue operating
until they fail.
• Maintenance work is performed after failure
has taken place.
• The effect of the failure is not catastrophic
or frequent – e.g. does not paralyze the
whole operation.
• Regular checks are sufficient.
2. Preventive maintenance (PM)
• Attempts to eliminate or reduce the
chances of failure by servicing the facilities
at pre-planned intervals.
• Used when the consequence of failure is
considerably more serious.
• Can be used to detect impending failures.
Remedial actions can be planned for, thus
improving overall efficiency.
• The useful life of certain components can
be increase beyond their recommended life
span.
3. Conditioned-based maintenance (CBM)
• Attempts to perform maintenance only
when the facilities require it.
• May involve continuously monitoring
parameters (vibrations, temperature,
displacement) of the facility.
• The results of the monitored parameter is
used to decide whether to stop the facility
to conduct maintenance.
4. Mixed maintenance strategies
• Most operations adopt a mixture of these
approaches because different elements of their
facilities have different characteristics.
Use ???
Use ???
Use ???
5. Run to breakdown versus preventive
maintenance
• The more frequent preventive
maintenance is carried out, the lesser
chance it has of breaking down.
• The cost of preventive maintenance is
often high.
• Infrequent preventive maintenance will

cost less but will result in higher
chances of breaking down.
• The cost of an unplanned breakdown is
often high.
Cost of Preventive Maintenance
Costs of PM
Amount of preventive maintenance

Cost of Breakdown
breakdown
Costs of

Maintenance cost model 1: One model of the costs
associated with preventive maintenance shows an
optimum level of maintenance effort.
Total cost
Costs
Cost of providing
preventive
Cost of maintenance
breakdowns
‘Optimum’ level of
preventive
maintenance
Maintenance cost model 2: an optimum level of
maintenance effort.
Model 1 cost of providing

preventive maintenance
Costs
Actual cost of providing

preventive maintenance

maintenance effort.
Actual cost of
breakdowns
Costs
Model 1 cost of breakdowns

maintenance effort.
Total cost
Costs
Cost of breakdowns
Cost of providing preventive

maintenance

Notes:
 In actuality the cost of PM does not increase as
steeply as indicated in Model 1.
◦ Model 1 assumes that all maintenance jobs must
be carried out by a specialist maintenance team
but Model 2 recognizes that operators
themselves can carry out simple, in process
maintenance. Etc…
 The cost of breakdown could be higher than
indicated in Model 1.
◦ A breakdown may cost more than the cost of
repair and the cost of the stoppage itself – a
stoppage can take away the stability in the
operation.
Run To Breakdown or Preventive
Maintenance?
Based on the arguments above, the

shift is more towards the use of
Preventive Maintenance.
6. Failure distributions
• The shape of the failure probability
distribution of a facility can determine if it
benefits from preventive maintenance.
Machine A
Probability of failure
Machine B
x y
Time
Notes:
 Machine A
◦ The probability that it will break down before
time x is relatively low.
◦ It has high probability of breaking down
between times x and y.
◦ If preventive maintenance was carried out just
before point x, the chances of breakdown can
be reduced.
Notes:
 Machine B
◦ It has a relatively high probability of breaking
down at any time.
◦ Its failure probability increases gradually as it
passes through time x.
◦ Carrying out preventive maintenance at point
x or any other cannot dramatically reduce the
probability of failure.
Total Productive Maintenance (TPM)
Approach
Total productive maintenance (TPM) is defined as:
…the productive maintenance carried
out by all employees through small group
activities…
Where productive maintenance is:

…maintenance management which
recognizes the importance of reliability,
maintenance and economic efficiency in plant
design…
The five goals of TPM:
1. Improve equipment effectiveness:
• Examine how the facilities contribute to the
effectiveness of the operation by examining
all the losses which occur.
2. Achieve autonomous maintenance:
• Allow people who operate the equipment
to take responsibility for some maintenance
task.
• Maintenance staff to take responsibility for
the improvement of maintenance
performance.
There are three levels at which maintenance
staff can take responsibility for process
reliability:
• Repair level – staff carry out instructions but
do not predict the future, they simply react to
problems.
• Prevention level – staff can predict the future
by foreseeing problems, and take corrective
action.
• Improvement level – staff can predict the
future by foreseeing problems, they not only take
corrective action but also propose improvements
to prevent recurrence.
Example:
Suppose the screws on a machine become loose. Each
week it jams up and is passed to maintenance to be
fixed.
• A ‘repair level’ maintenance engineer will simply
repair it and hand it back to production.
• A ‘prevention level’ maintenance engineer will
spot the weekly pattern to the problem and
tighten the screws in advance of their loosening.
• An ‘improvement-level’ maintenance engineer
will recognize that there is a design problem and
modify the machine so that the problem cannot
recur.
The five goals of TPM (cont):
3. Plan maintenance:
• To have a fully worked out approach to all
maintenance activities. Includes
– the level of preventive maintenance
required for each piece of equipment.
– the standard for condition-based
maintenance
– the respective responsibilities of operating
staff and maintenance staff. See Slide 19.55
4. Train all staff in relevant maintenance skills:
• TPM emphasises on appropriate and
continuous training to ensure staff have the
skills to carry out their roles.
The roles and responsibilities of operating staff and
maintenance staff in TPM
Maintenance staff Operating staff
Roles To develop: To take on:
• Preventive actions • Ownership of
• Breakdown services facilities
• Care of facilities
Responsibilities • Train operators • Correct operation
• Device maintenance • Routine preventive
practice maintenance
• Problem-solving • Routine condition-
• Assess operating based maintenance
practice • Problem detection
The five goals of TPM (cont):
5. Achieve early equipment management:
• This goal is directed at avoiding
maintenance altogether by ‘maintenance
prevention’ (MP).
• MP involves considering root causes of
failure and maintainability of equipment
during the design stage, manufacture,
installation and its commissioning.
Reliability Centred Maintenance (RCM)
Approach
1. TPM tends to recommend preventive
maintenance even when it is not appropriate.
2. Uses the pattern of failure for each type of
failure mode to dictate the approach of
maintenance.
3. The approach of RCM is sometimes
summarized as “If we cannot stop it from
happening, we had better stop it from
mattering” – efforts need to be directed at
reducing the impact of the failure.
Example:
This is a simple shredding process which prepares the
vegetables prior to freezing. The most significant part
of the process which requires the most maintenance
attention is the cutter sub-assembly. However, there
are several modes of failure.
1) They require changing because they have worn
out through usage
2) They have been damaged by small stones entering
the process
3) They have shaken loose because they were not
fitted correctly.
One part in one process can have several
different failure modes, each of which
requires a different approach
Cutter ‘wear out’

failure pattern
Shredding Solution
Failures
process Preventive
maintenance before
end of useful life
Time
Cutter Solution
Failures
s Preventive damage, Cutter ‘damage’

fix stone screen failure pattern
Time
Cutter ‘shake loose’
failure pattern
Solution
Failures
Ensure correct fitting

through training
Time
Conclusion
Failure detection and its solution provision
must be handled from multifaceted approach. It
require cooperation of the operators,
maintenance personnel/professionals and
management.
The cost of failure is very substantial and it
makes economic sense to prevent its
occurrence in the first instance and/or reduce
its effect if it occurs to the barest minimum.
A VERY BIG THANK YOU TO
ALL OF YOU
ANY QUESTIONS

Failure Prevention and Recovery

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Failure Prevention and Recovery

Uploaded by

Copyright:

Available Formats

Failure Prevention

If one of these parts of the production system fails, the whole

Rs = 0.95  0.99  0.97  0.90  0.98

Plate warmer Oven

Plate taken Timing error

• Infrequent preventive maintenance will

Amount of preventive maintenance

Amount of preventive maintenance

Model 1 cost of providing

Actual cost of providing

Amount of preventive maintenance

Model 1 cost of breakdowns

Amount of preventive maintenance

Cost of providing preventive

Amount of preventive maintenance

Based on the arguments above, the

Where productive maintenance is:

Cutter ‘wear out’

s Preventive damage, Cutter ‘damage’

Ensure correct fitting

You might also like