You are on page 1of 161

PRINCIPLES OF

SYSTEM SAFETY
ENGINEERING AND MANAGEMENT
Felix Redmill
Redmill Consultancy, London
Felix.Redmill@ncl.ac.uk

RISK

(c) Felix Redmill, 2011

CERN, May '11

SAFETY ENGINEERING AND MANAGEMENT


It is necessary both to achieve appropriate safety and to
demonstrate that it has been achieved
Achieve - not only in design and development, but in all
stages of a systems life cycle
Appropriate - to the system and the circumstances
Demonstrate - that all that could reasonably have been
done has been done, at every stage of the life cycle

(c) Felix Redmill, 2011

CERN, May '11

THE U.K. LAW ON SAFETY


Health and Safety at Work Etc. Act 1974:
Safety risks imposed on others (employees and the public
at large) must be reduced so far as is reasonably
practicable (SFAIRP)

(c) Felix Redmill, 2011

CERN, May '11

THE HSES ALARP PRINCIPLE


Increasing
risk

Unacceptable Region
(Risk cannot be justified except in
extraordinary circumstances)
Limit of tolerability threshold

ALARP or Tolerability Region


(Risk is tolerable only if its reduction is
impracticable or if the cost of reduction is
grossly disproportinate to the improvement
gained)
Broadly acceptable threshold

Broadly Acceptable Region


(Risk is tolerable without reduction. But it is
necessary to maintain assurance that it
remains at this level)

(c) Felix Redmill, 2011

CERN, May '11

THE HSES ALARP PRINCIPLE

(c) Felix Redmill, 2011

CERN, May '11

CALIBRATION OF THE ALARP MODEL


Recommended for Nuclear Industry
Intolerability threshold:
1/10000 per year (for the public)
1/1000 per year (for employees)
Broadly acceptable threshold:
1/1000000 per year (for everyone)

(c) Felix Redmill, 2011

CERN, May '11

A VERY SIMPLE SYSTEM


Chemicals A and B are mixed in the tank to form product P
P opens & closes input and output valves
If emergency signal arrives, then cease operation
Emergency
signal
B
P
A

P Controller
(c) Felix Redmill, 2011

CERN, May '11

QUESTIONS
How could the accident have been avoided?

Better algorithm
How could the software designer have known that a better
algorithm was required?
Domain knowledge
But we cant be sure that such a fault wont be made, so
how can we find and correct such faults?
Risk analysis techniques

(c) Felix Redmill, 2011

CERN, May '11

A SIMPLE THOUGHT EXPERIMENT


Draw an infusion pump
Note its mechanical parts
Think about designing and developing software to control
its operation
Safety consideration: delivery of either too much or too
little of a drug could be fatal
How would you guarantee that it would kill no more than 1
in 10,000 patients per year?

What level of confidence - on a percentage scale - do


you have in your estimate?
(c) Felix Redmill, 2011

CERN, May '11

10

CONFIDENCE IN SAFETY IN ADVANCE


What is safety?
How can it be measured?
What can give confidence that safety is high?

Need also to demonstrate safety in advance


Therefore, need to find hazards in advance

(c) Felix Redmill, 2011

CERN, May '11

11

NEW RISKS OF NEW TECHNOLOGY STRASBOURG AIR CRASH


Mode: climbs and descents in degrees to horizontal
3.3 descent represented as minus 3.3
Mode: climbs and descents in units of 100 feet
3300 feet/minute descent represented as minus 33
Plane was descending at 3300 feet/minute
Needed to descend at angle of 3.3 to horizontal
To interpret correctly, pilot needed to know the mode

Mode error in the system


Human error? If so, which human made it?
(c) Felix Redmill, 2011

CERN, May '11

12

CONCERN WITH TECHNOLOGICAL RISKS


We are concerned no longer exclusively with making nature
useful, or with releasing mankind from traditional
constraints, but also and essentially with problems resulting
from techno-economic development itself [Beck]
Risks induce often irreversible harm; not restricted to their
places of origin but threaten all forms of life in all parts of the
planet; in some cases could affect those not alive at the time
or place of an accident [Beck]
People who build, design, plan, execute, sell and maintain

complex systems do not know how they may work [Coates]

(c) Felix Redmill, 2011

CERN, May '11

13

RISK - AN IMPORTANT SUBJECT


Risk is a subject of research in many fields,
e.g. Psychology, Sociology, Anthropology, Engineering
It is a critical element in many fields
e.g. Geography (climate change), Agriculture, Sport,
Leisure activities, Transport, Government policy
It influences government policy (e.g. on bird flu) and
local government decisions (e.g. closure of childrens
playgrounds)
An influential Sociological theory is that it is the most
influential factor in modern society
(we are in the risk society)

Every activity carries risk


All decisions are risky
(c) Felix Redmill, 2011

CERN, May '11

14

FUNCTIONAL SAFETY:
ACHIEVING UTILITY AND CREATING RISK
We are concerned with safety that depends on the correct
functioning of equipment

Control
system

(c) Felix Redmill, 2011

Equipment
under control

CERN, May '11

Utility
(plus risks)

15

FUNCTIONAL SAFETY 2
Functional safety depends on hardware, software,
humans, data, and interactions between all of these

Control
system

Equipment
under control

Utility
(plus risks)

Humans (design, operation,


maintenance, etc.)
(human factors)
Environment (management, culture, etc.)
(c) Felix Redmill, 2011

CERN, May '11

16

ORGANISATIONS ARE ALSO COMPLEX


(Vincristine example)
Child received injection for leukaemia into spine instead of
into vein
Root cause analysis showed 40 points at which the
accident could have been averted
Complexity of modern organisational systems
Need to identify risks in advance
But no standard can do this for us
Requires a risk-based approach

(c) Felix Redmill, 2011

CERN, May '11

17

SAFETY - A DEFINITION
Safety: freedom from unacceptable risk (IEC)
Safety is not directly measurable
But it may be addressed via risk

(c) Felix Redmill, 2011

CERN, May '11

18

VOCABULARY
In common usage, Risk is used to imply:
Likelihood (e.g. theres a high risk of infection)
Consequence (e.g. infection carries a high risk)
A combination of the two
Something, perhaps unspecified, to be avoided (e.g.
going out into the night is risky)

(c) Felix Redmill, 2011

CERN, May '11

19

RISK - DEFINITIONS
Risk: A combination of the probability of occurrence and the

severity of its consequences if the event did occur (IEC)


Tolerable risk: a willingness to live with a risk, so as to secure
certain benefits, in the confidence that the risk is one that is
worth taking and that it is being properly controlled (HSE)

(c) Felix Redmill, 2011

CERN, May '11

20

SAFETY AND RISK


Safety is only measurable in retrospect

Safety is gauged by trying to understand risk


Risk, being of the future, is estimable but not measurable
The higher the risk, the lower the confidence in safety
We increase safety by reducing risk
Two components of risk are significant:
Likelihood of occurrence
Magnitude of outcome

(c) Felix Redmill, 2011

CERN, May '11

21

SOME PRINCIPLES
Absolute safety (zero risk) cannot be achieved
Doing it well does not guarantee safety
Correct functionality safety
We must address safety as well as functionality
Reliability is not a guarantee of safety
We require confidence of safety in advance, not
retrospectively
We must not only achieve safety but also demonstrate it

(c) Felix Redmill, 2011

CERN, May '11

22

A RISK-BASED APPROACH
If safety is addressed via risk
We must base our safety-management actions on risk
We must understand the risks in order to manage safety

(c) Felix Redmill, 2011

CERN, May '11

23

SAFETY AS A WAY OF THINKING


If risk is too high, it must be reduced

But how do we know how high it is?


Carry out risk analysis
But even if the risk is high, does it need to be reduced?
We must understand what risk is tolerable in the
circumstances (in the UK, apply the ALARP Principle)
Achieving safety demands a combination of engineering
and management approaches

(c) Felix Redmill, 2011

CERN, May '11

24

RISK - TWO COMPONENTS


Two components:
Probability (likelihood) of occurrence
Consequence (magnitude of outcome)

R = f(P.C) or f(L.C)

(c) Felix Redmill, 2011

CERN, May '11

25

A SIMPLE CALCULATION
Probability of a 100-year flood = 0.01/year
Expected damage = 50M
R (financial (Expected Value)
= 0.01 x 50,000,000 = 500,000/year

(c) Felix Redmill, 2011

CERN, May '11

26

CONFIDENCE IN RISK VALUES


Accuracy of the result depends on the reliability of information
Reliability of information depends on the pedigree of its source
What are the sources of the probability and consequence
values?

What are their pedigrees?


What confidence do we have in the risk values?
Why do we need confidence?

Because we derive risk values in order to inform decisions


(c) Felix Redmill, 2011

CERN, May '11

27

CONFIDENCE IN RISK CALCULATIONS


Where did the information come from?
What trust do we have in the source?
When and by whom was it collected?
Was it ever valid? If so, is it still valid?
What assumptions have we made?
How valid are they?
Are we aware of them?

(c) Felix Redmill, 2011

CERN, May '11

28

COT DEATH CASE


A study concluded that the risk of a mature, non-smoking,
affluent couple suffering a cot death is 8543 to 1
Prof. Meadows deduced that Pr. of two cot deaths in the
same family = 8543 x 8543 = 73 million to one
Three cot deaths in her family resulted in Mrs Clark being
convicted of infanticide

(c) Felix Redmill, 2011

CERN, May '11

29

DEPENDENCE AND NOT RANDOMNESS


But deaths in the same family are not independent events
(probability theory assumes independence)
One death in the family rendered Mrs Clark in the highestrisk category for another
Probability of a second death is considerably greater
than 8543 to 1

(c) Felix Redmill, 2011

CERN, May '11

30

COMMON-MODE FAILURES
Identifying common-mode failures is a crucial part of
traditional risk analysis

(c) Felix Redmill, 2011

CERN, May '11

31

ASSUMPTIONS
We never have full knowledge

We fill the gaps with assumptions


Assumptions carry uncertainty, and therefore risk
Recognise your assumptions (if possible)

If you admit to them (document them)


Readers will recognise uncertainty
Other persons may provide closer approximations

(c) Felix Redmill, 2011

CERN, May '11

32

CONTROL OF RISK
Risk is eliminated if either Pr or C is reduced to zero

And risk is reduced by reduction of Pr or C or both


In many cases we have no control over C, but we may be
able to estimate it
It may be difficult or impossible to derive Probability, but
we may be able to reduce it (e.g. by software testing &
fixing)

(c) Felix Redmill, 2011

CERN, May '11

33

WHY TAKE RISKS?


Progress demands risk
e.g. exploration, bridge design
e.g. drug development (TeGeneros TGN1412)
Technology provides utility - at the cost of risk
Suppliers want cheap designs
e.g. Ronan Point (need to foresee hazards)
To save money
e.g. Cutting corners on a project
To save face
E.g. Id rather die than make a fool of myself
We cant avoid taking risks in every decision and action
(c) Felix Redmill, 2011

CERN, May '11

34

RISK VS. RISK


Decision is often not between risk and no risk
But between one risk and another

Decisions may create risks (unintended consequences)


Surgery, medication, or keep the illness
Consequences of being late vs. risks of driving fast
Solar haze vs. global warming

Software maintenance can introduce a new defect


Change creates a new product; test results obsolete
Recognise both risks
Assess both
Carry out impact analysis to identify new hazards
(c) Felix Redmill, 2011

CERN, May '11

35

SOME NOTES ON RISK


A single-system risk may be tiny, but the overall risk of many
systems may be high (individual vs. societal risk)
Risk per usage may be remote, but daily use may accrue a
high risk (one-shot vs. lifetime risk)
Beware of focusing on a single risk; a system is likely to carry
numerous risks
Confidence in probabilities attributed to rare events must be
low
For random events with histories (e.g. electromechanical
equipment failure) frequencies may be deduced
Given similar circumstances, they may be predictive

For systematic events (such as software failure) history is not


an accurate predictor of the future
(c) Felix Redmill, 2011

CERN, May '11

36

RISK AND UNCERTAINTY


Risk is of the future - there is always uncertainty
Risk may be estimated but not measured
Risk is open to perception
Identifying and analysing risks requires information
The quality of information depends on the source
'Facts' are open to different interpretations
Risk cannot be eliminated if goals are to be achieved
Risk management activities may introduce new risks
In the absence of certainty, we need to increase confidence

Reducing uncertainty depends on gathering information


(c) Felix Redmill, 2011

CERN, May '11

37

RISK MANAGEMENT STRATEGIES

Avoid
Eliminate
Reduce
Minimise - within defined constraints
Transfer or share
- Financial risks may be insured
- Technical risks may be transferred to experts, maintainers
Hedge
- Make a second investment, which is likely to succeed if the
first fails
Accept
- Must do this in the end, when risks are deemed tolerable
- Need contingency plans to reduce consequences
(c) Felix Redmill, 2011

CERN, May '11

38

RISK IS OPEN TO PERCEPTION

Voluntary or involuntary
Control in hands of self or another
Statistical or personal
Level of knowledge, uncertainty
Level of dread or fear evoked
Short-term or long-term view
Severity of outcome
Value of the prize
Level of excitement
Status quo bias

Perception determines where we look for risks and what


we identify as risks
(c) Felix Redmill, 2011

CERN, May '11

39

PROGRAMME FOR COMBATTING DISEASE


(1)
The country is preparing for the outbreak of a disease
which is expected to kill 600 people. Two alternative
programmes to combat it have been proposed, and the
experts have deduced that the precise estimates of the
outcomes are:

If programme A is adopted, 200 people will be saved


If programme B is adopted, there is a one third probability
that 600 people will be saved and a two thirds probability
that nobody will be saved

(c) Felix Redmill, 2011

CERN, May '11

40

PROGRAMME FOR COMBATTING DISEASE


(2)
The country is preparing for the outbreak of a disease
which is expected to kill 600 people. Two alternative
programmes to combat it have been proposed, and the
experts have deduced that the precise estimates of the
outcomes are:

If programme C is adopted, 400 people will die


If programme D is adopted, there is a one third probability
that nobody will die and a two thirds probability that 600
people will die

(c) Felix Redmill, 2011

CERN, May '11

41

UNINTENDED CONSEQUENCES
Actions always likely to have some unintended results
Downsizing but loose the wrong staff

e.g. University redundancies in UK


Staff come to rely entirely on a support system
Building high in mountain causes landslide
Abolishing DDT increased malaria
Store less chemical at plant, more trips by road
Unintended, but not necessarily unforeseeable
Foreseeable, but not necessarily obvious

Carry out hazard and risk analysis


Better to foresee them than encounter them later
(c) Felix Redmill, 2011

CERN, May '11

42

RISK COMMUNICATION
The risk analyst is usually not the decision maker
Risk information must be transferred (liver biopsy)
Usually only results are communicated
e.g. Theres an 80% chance of success
But are they correct (Bristol Royal Infirmary)?
Overconfidence bias
Managers rely heavily on risk information, collected,
analysed, packaged, by other staff
But what confidence does the staff have in it
What were the information sources, analysis methods,
and framing choices?
What uncertainties exist?
What assumptions were made?
(c) Felix Redmill, 2011

CERN, May '11

43

COMMUNICATION OF RISK INFORMATION


The risk is one in a million

One in seventeen thousand of dying in a road accident


Age, route, time of day
Framing
Appropriate presentation of numbers
1 x 10-6
One in a million
One in a city the size of Birmingham
(A launch per day for 300 years with only one failure)

(c) Felix Redmill, 2011

CERN, May '11

44

RISK IS TRICKY
We manage risks daily, with reasonable success
Our risk management is intuitive

We do not recognise mishaps as the results of poor


risk management
Risk is a familiar subject
We do not realise what we dont know

We do not realise that our intuitive techniques for


managing simple situations are not adequate for complex
ones
Risk estimating can be simple arithmetic
The difficult part is obtaining the correct information on
which to base estimates - and decisions
(c) Felix Redmill, 2011

CERN, May '11

45

SAFETY ENGINEERING PRINCIPLES

(c) Felix Redmill, 2011

CERN, May '11

46

A SIMPLE MODEL OF SYSTEM SAFETY

Safe state
(in any
mode)

Failure
or unsafe
deviation

Accident
Danger

Disaster

Restoration
Recovery

(c) Felix Redmill, 2011

CERN, May '11

47

SAFETY ACROSS THE LIFE CYCLE


Safety activities must extend across the life of a system

And must be planned accordingly


Modern safety standards call for the definition and use of a
safety life cycle
A life-cycle model shows all the phases of a systems life,
relative to each other
The overall safety lifecycle defined in safety standard IEC
61508 is the best known model

(c) Felix Redmill, 2011

CERN, May '11

48

SAFETY LIFE CYCLE MODELS


Provide a framework for planning safety engineering and
management activities
Provide a guide for creating an infrastructure for the
management of project and system documentation
Remind engineers and managers at each phase that they
need to take other phases into consideration in their
planning and activities
Facilitate the demonstration as well as the achievement of
safety
The model in safety standard IEC 61508 is the best known

(c) Felix Redmill, 2011

CERN, May '11

49

OVERALL SAFETY LIFECYCLE


Concept

Overall scope
definition

Hazard and risk


analysis

Overall safety
requirements

Safety requirements
allocation

Overall planning of:


6
O&M

(c) Felix Redmill, 2011

7
Safety
validation

Realisation of:

8
Installation &
commissioning

9
Safetyrelated
E/E/PES

12

Overall installation
and commissioning

13

Overall safety
validation

14

Overall operation,
maintenance & repair

Decommissioning or
16CERN, May '11
disposal

10
Other tech.
safetyrelated
systems

11
External
risk
reduction
facilities

Overall
15 modification
and retrofit

50

AFTER STAGE THREE


Work done after stage 3 of the model creates additions to
the overall system that were not included in the hazard
and risk analysis of stage 3

(c) Felix Redmill, 2011

CERN, May '11

51

SAFETY LIFECYCLE SUMMARY


Understand the functional goals and design

Identify the hazards

Analyse the hazards

Determine and assess the risks posed by the hazards

Specify the risk reduction measures and their SILs

Define the required safety functions (and their SILs)

Carry out safety validation

Operate, maintain, change, decommission, dispose safely


(c) Felix Redmill, 2011

CERN, May '11

52

THE SAFETY-CASE PRINCIPLE


The achievement of safety must be demonstrated in advance
of deployment of the system
Demonstration is assessed by independent safety assessors
Safety assessors will not (cannot)
Examine a complex system without guidance
Assume responsibility for the systems safety
Their assessment is guided by claims made by developers,
owners and operators of the system
The claims must be for the adequacy of safety in defined
circumstances, considering the context and application and
the benefits of accepting any residual risks
(c) Felix Redmill, 2011

CERN, May '11

53

THE SAFETY CASE


The purpose: Demonstration of adequate and appropriate
safety of a system, under defined conditions
To convince ourselves of adequate safety
The basis of independent safety assessment
Provides later requirements for proof
(It can protect and it can incriminate)
The means
Make claims of what is adequately safe
o And for what it is adequately safe
Present arguments for why it is adequately safe for the
intended purpose
Provide evidence in support of the arguments
(c) Felix Redmill, 2011

CERN, May '11

54

GENERALIZED EXAMPLE
Claim: System S is acceptably safe when used in Application A
Claims are presented as structured arguments

The use of System S in Application A was subjected to


thorough hazard identification
All the identified hazards were analysed
All risks associated with the hazards either were found to be
tolerable or have been reduced to tolerable levels
Emergency plans are in place in case of unexpected
hazardous events
The safety arguments are supported by evidence

e.g., Evidence of all activities and results, descriptions of all


relevant documentation, and references to it
(c) Felix Redmill, 2011

CERN, May '11

55

THE CASE MUST BE STRUCTURED


Demonstration of adequate safety of a system requires
demonstration of justified confidence in its components
Some components may be COTS (commercial off-the-shelf)
Dont wait until too late to find that they cannot be justified
Evidence is derived from different sources, and different
categories of sources
The evidence for each claim must be structured so as to
support a logical argument for the top-level claim

(c) Felix Redmill, 2011

CERN, May '11

56

PRESENTATION OF SAFETY CLAIMS


The claims must be structured so that the logic of the overall
case is demonstrated
The principal claims may be presented in a top-level
document
With references out to the sources of supporting evidence
(gathered throughout the life cycle)
Modern software-based tools are available for the
documentation of safety cases (primarily: GSN (goal-structured
notation))

(c) Felix Redmill, 2011

CERN, May '11

57

THE NATURE OF EVIDENCE


Evidence may need to be probabilistic
e.g. for nuclear plants
It may be qualitative
e.g. for humans, software
In applications of human control it may need to include
Competence, training, management, guidelines
It is collected throughout the systems life
Importantly during development

(c) Felix Redmill, 2011

CERN, May '11

58

EVIDENCE PLANNING - 1
Any system-related documentation (project, operational,
maintenance) may be required as evidence

It should be easily and quickly accessible to


Creators of safety arguments
Independent safety assessors
Numbering and filing systems should be designed appropriately

Principal safety claims should be identified in advance


Activities may need to be planned so that appropriate
evidence is collected and stored
The safety case should be developed throughout a development
project
And maintained throughout a systems life
(c) Felix Redmill, 2011

CERN, May '11

59

EVIDENCE PLANNING - 2
The safety case structure should be designed early
Evidential requirements should be identified early and
planned for
Safety case development should be commenced early
Evidence of software adequacy may be derived from

Analysis
Testing
Proven-in-use
Process

(c) Felix Redmill, 2011

CERN, May '11

60

VALIDITY OF A SAFETY CASE


Validity (and continued validity) depends on (examples only)
Observance of design and operational constraints, e.g.:
Use only specified components
Dont exceed maximum speed
Environment, e.g.:
Hardware: atmospheric temperature between T1 & T2C
Software: operating system remains unchanged
Assumptions remain valid, e.g.:
Those underlying estimation of occurrence probability

Routine maintenance carried out according to spec.

(c) Felix Redmill, 2011

CERN, May '11

61

HAZARD AND RISK ANALYSIS

(c) Felix Redmill, 2011

CERN, May '11

62

NEED FOR CLARITY


Is a software bug a risk?
Is a banana skin on the ground a risk?
Is a bunker on a golf course a risk?

(c) Felix Redmill, 2011

CERN, May '11

63

THE CONCEPT OF A HAZARD


A hazard is the source of risk
The risks that may arise depend on context
What outcomes could result from the hazard?
What consequences might the outcomes lead to?

Hazards form the basis of risk estimation

(c) Felix Redmill, 2011

CERN, May '11

64

GOLF-SHOT RISK
What is the probability of hitting your golf ball into a bunker?
What is the consequence (potential consequence) of hitting
your ball into a bunker?

Should I take the risk?


In this case: What level of risk should I take?
Whats missing is a question about benefit!

(c) Felix Redmill, 2011

CERN, May '11

65

WHAT IS RISK ANALYSIS?


If we take risk to be a function of Probability (Pr) &
Consequence (C)
Then, in order to estimate a value of a Risk, we need to
derive values of Pr and C for that risk
Values may be qualitative or quantitative
They may need to be more or less accurate, depending on
importance
Thus, risk analysis consists of:
Identifying relevant hazards
Collecting evidence that could contribute to the
determination of values of Pr and C for a defined risk
Analysing that evidence
Synthesising the resulting values of Pr and C
(c) Felix Redmill, 2011

CERN, May '11

66

PROBABILITY OF WHAT EXACTLY?


Automobile standards focus on the control of a
vehicle (and the possible loss of it)
Each different situation will
Occur with a different probability
Result in different consequences
Carry different risks

(c) Felix Redmill, 2011

CERN, May '11

67

CHOICE OF CONSEQUENCE ESTIMATES


Worst possible
Worst credible
Most likely
Average

(c) Felix Redmill, 2011

CERN, May '11

68

DETERMINATION OF RISK VALUES


Threat to
security
Safety hazard

Causal
analysis

Threat of
damage
Potential for
unreliability

Risk
Consequence
analysis

Potential for
unavailability

(c) Felix Redmill, 2011

CERN, May '11

69

PRELIMINARY REQUIREMENTS
Knowledge of the subject of risk
Understanding of the current situation and context
Knowledge of the purpose of the intended analysis
The questions (e.g. tolerability) that it must answer
Such knowledge and understanding are essential to
searching for the appropriate information

(c) Felix Redmill, 2011

CERN, May '11

70

BOTTOM-UP ANALYSIS
(Herald of Free Enterprise)
Bowsun asleep in his cabin when ship is due to depart

Bow doors not closed

Ship puts to sea with bow doors open

Water enters car deck

As ship rolls, water rushes to one side

Ship capsizes

Lives lost

(c) Felix Redmill, 2011

CERN, May '11

71

TOP-DOWN ANALYSIS
(Herald of Free Enterprise)
Ship puts to sea with bow doors open

Bosun did not close doors

Bosun not available


to close doors

Bosun not
on ship

Bosun on board
but not at station

Bosun asleep
in cabin

(c) Felix Redmill, 2011

Problem with doors


& bosun cant close them
Problem
with closing
mechanism

Door or
hinge
problem

Bosun in
bar

CERN, May '11

Problem
with power
supply
72

RISK MANAGEMENT PROCESS


Define scope of study
Identify the hazards
Analyse the hazards to determine the risks they pose
Assess risks against tolerability criteria

Take risk-management decisions and actions


It may also be appropriate to carry out emergency
planning and prepare for the unexpected
If so, we need to carry out rehearsals

(c) Felix Redmill, 2011

CERN, May '11

73

FOUR STAGES OF RISK ANALYSIS


(But be careful with vocabulary)
Definition of scope
Define the objectives and scope of the study
Hazard identification
Define hazards and hazardous events
Hazard analysis
Determine the sequences leading to hazardous events
Determine likelihood and consequences of hazardous
events
Risk assessment
Assess tolerability of risks associated with hazardous
events

(c) Felix Redmill, 2011

CERN, May '11

74

DEFINITION OF SCOPE
Types of risks to be studied
e.g. safety, security, financial
Risks to whom or what
e.g. employees, all people, environment, the company,
the mission
Study boundary
Plant boundary
Region
Admissible sources of information
e.g. stakeholders (which?), experts, public

(c) Felix Redmill, 2011

CERN, May '11

75

SCOPE OF STUDY
How will the results be used?

What questions are to be answered?


What decisions are to be made?
What accuracy and confidence are required?
Define study parameters
What effort to be invested?
What budget afforded?
How much time allowed?

(c) Felix Redmill, 2011

CERN, May '11

76

OUTCOME SUBJECTIVELY INFLUENCED


Defining scope is subjective
Involves judgement
Includes bias
Can involve manipulation

Scope definition
Influences the nature and direction of the analysis
Is a predisposing factor on its results

(c) Felix Redmill, 2011

CERN, May '11

77

VOCABULARY - CHOICE OF TERMS


Same term means different things to different people
Same process given different titles
There is no internationally agreed vocabulary
Even individuals use different terms for the same process
Beware: ask what others mean by their terms
Have a convention and define it

(c) Felix Redmill, 2011

CERN, May '11

78

SOME TERMS IN USE


Hazard identification,
Risk identification
Risk analysis,
Risk
assessment,
Risk
management

Hazard analysis,
Risk analysis
Risk assessment,
Risk evaluation
Risk mitigation,
Risk reduction,
Risk management

(c) Felix Redmill, 2011

CERN, May '11

79

AN APPROPRIATE CONVENTION?
Scope definition

Risk
analysis

Hazard identification
Hazard analysis
Risk assessment
Risk communication
Risk mitigation

Risk
management

Emergency planning
(c) Felix Redmill, 2011

CERN, May '11

80

HAZARD AND RISK ANALYSIS


Obtaining appropriate information, from the most
appropriate sources, and analyzing it, while making
assumptions that are recognized, valid, and as few as
possible

(c) Felix Redmill, 2011

CERN, May '11

81

HAZARD IDENTIFICATION
The foundation of risk analysis
- Identify the hazards (what could go wrong)
- Deduce their causes
- Determine whether they could lead to undesirable
outcomes
Knowing chains of cause and effect facilitates decisions
on where to take corrective action
But many accidents are caused by unexpected
interactions rather than by failures

(c) Felix Redmill, 2011

CERN, May '11

82

HAZARD ID METHODS
Checklists
Brainstorming
Expert judgement
What-if analysis
Audits and reports

Site inspections
Formal and informal staff interviews
Interviews with others, such as customers, visitors
Specialised techniques

(c) Felix Redmill, 2011

CERN, May '11

83

WHAT WE FIND DEPENDS ON WHERE WE


LOOK
We dont find hazards where we dont look
We dont look because
We dont think of looking there
We dont know of that place
We assume there are no hazards there
We assume that the risks are small
Must take a methodical approach
And be thorough

(c) Felix Redmill, 2011

CERN, May '11

84

CRACK IN TIRELESS COOLING SYSTEM

(c) Felix Redmill, 2011

CERN, May '11

85

EXTRACT FROM FILE ON 4 INTERVIEW (12/12/00)

(O'Halloran) For the Navy Captain Hurford concedes that the possibility that this critical

section of pipework might fail was never even considered in the many years that these 12
submarines of the Swiftsure and Trafalgar classes have been in service.
(Hurford) "This component was analysed against its duty that it saw in service and was
supposed never to crack and so the fact that this crack had occurred in this component in the
way that it did and caused a leak before we had detected it, is a serious issue.
(O'Halloran) How big a question mark does this place over your general risk probability
assumptions about the whole working of one of these nuclear reactors.
(Hurford) "It places a question on the surveillance that we do when the submarines are in
refit and maintenance, unquestionably
(O'Halloran) How long have these various submarines been in service ?
(Hurford) "The oldest of the Swiftsure class came into service in the early seventies
(O'Halloran) So has this area of the pipework ever been looked at in any of the submarines,
the 12 hunter killer submarines now in service ?
(Hurford) "No it hasn't, because the design of the component was understood and the
calculations showed and experience showed that there would be no problem.
(O'Halloran) But the calculations were wrong ?
(Hurford) "Clearly there is something wrong with that component that caused the crack and
we don't know if it was the calculations or whether it was the way it was made and that what
is being found out in the analysis at the moment"
(c) Felix Redmill, 2011

CERN, May '11

86

REPEAT FORMAL HAZARD ID


Hazard identification should be repeated

when new information becomes available and when


significant change is proposed. e.g.
At the concept stage
When a design is available

When the system has been built


Prior to system or environmental change during
operation
Prior to decommissioning

(c) Felix Redmill, 2011

CERN, May '11

87

HAZARD ANALYSIS
Analyse the identified hazards to determine
Potential consequences
Worst credible
Most likely
Ways in which they could lead to undesirable outcomes

Likelihood of undesirable outcomes


Sometimes rough estimates are adequate
Sometimes precision is essential
Infusion pump (too much: poison; too little: no cure)
X-ray dose (North Stafford)
(c) Felix Redmill, 2011

CERN, May '11

88

NOTES ON HAZARD ANALYSIS


Two aspects of a hazard: likelihood and consequence
Consequence is usually closely related to system goals
So risk reduction may focus on reduction of likelihood
Usually numerous hazards associated with a system
Analysis is based on collecting and deriving appropriate
information
Pedigree of information depends on pedigree of its source
The question of confidence should always be raised
Likelihood and consequences may be expressed
quantitatively or qualitatively

(c) Felix Redmill, 2011

CERN, May '11

89

NOTES ON HAZARD ANALYSIS - 2


For random events (such as hardware component failure)

Historic failure frequencies may be known


Probabilistic hazard analysis may be possible
For systematic events (such as software failures)
History is not an accurate predictor of the future
Qualitative hazard analysis is necessary
When low event frequencies (e.g. 10-6 per year) are desired,
confidence in figures must be low

(c) Felix Redmill, 2011

CERN, May '11

90

THE NEED FOR QUALITATIVE ANALYSIS


AND ASSESSMENT
When failures and hazardous events are random
Historic data may exist
Numerical analysis may be possible
When failures and hazardous events are systematic, or
historic data do not exist
Qualitative analysis and assessment are necessary
Example of qualitative techniques is the risk matrix

(c) Felix Redmill, 2011

CERN, May '11

91

HAZARD ANALYSIS TECHNIQUES


FMEA analyses the effects of failures

FMECA analyses the risks attached to failures


HAZOP analyses both causes and effects
Event tree analysis (ETA) works forward from identified
hazards or events (e.g. component failures) to determine
their consequences
Fault tree analysis (FTA) works backwards from identified
hazardous events to determine their causes

(c) Felix Redmill, 2011

CERN, May '11

92

USING THE RESULTS


The purpose is to derive reliable information on which to
base decisions on risk-management action
Derive - by carrying out risk analysis
Reliable - be thorough as appropriate
Information - the key to decision-making

Decisions - risk analysis is not carried out for its own sake
but to inform decisions (usually made by others)
Hazard identification and analysis may be carried out
concurrently

(c) Felix Redmill, 2011

CERN, May '11

93

RISK ASSESSMENT
To determine the tolerability of analysed risks
So that risk-management decisions can be taken
Need tolerability criteria
Tolerability differs according to circumstance
e.g. medical

Tolerability differs in time


e.g. nuclear
Tolerability differs according to place
e.g. oil companies treatment of staff and environment in
different countries
(c) Felix Redmill, 2011

CERN, May '11

94

TOLERABLE RISK
Risk accepted in a given context based on the current
values of society
Not trivial to determine
Differs across industry sectors
May change with time
Depends on perception
Should be determined by discussion among parties,
including
Those posing the risks
Those to be exposed to the risks
Other stakeholders, e.g. regulators

(c) Felix Redmill, 2011

CERN, May '11

95

THE HSES ALARP PRINCIPLE

(c) Felix Redmill, 2011

CERN, May '11

96

RISK TOLERABILITY JUDGEMENT


In the case of machinery

We may know what could go wrong (uncertainty is low)


It may be reversible (can fix it after one accident)
In the case of BSE or genetically modified organisms
The risk may only be suspected (uncertainty is high)
It may be irreversible
Moral values influence the judgement of how much is enough
If we dont take risks we dont make progress

(c) Felix Redmill, 2011

CERN, May '11

97

HAZARD AND RISK ANALYSIS


TECHNIQUES

(c) Felix Redmill, 2011

CERN, May '11

98

TECHNIQUES
Techniques support risk analysis
They should not govern it

(c) Felix Redmill, 2011

CERN, May '11

99

TECHNIQUES TO BE CONSIDERED

Failure (fault) modes and effects analysis (FMEA)


Failure (fault) modes, effects and criticality analysis (FMECA)
Hazard and operability studies (HAZOP)
Event tree analysis (ETA)
Fault tree analysis (FTA)
Risk matrices
Human reliability analysis (HRA - mentioned only)
Preliminary hazard analysis (PHA)

(c) Felix Redmill, 2011

CERN, May '11

100

A SIMPLE CHEMICAL PLANT

Fluid A

P1

V1

V2
Vat R

P2

Fluid B

V3

(c) Felix Redmill, 2011

CERN, May '11

101

FAILURE MODES AND EFFECTS ANALYSIS


Usually qualitative investigation of the modes of failure of
individual system components

Components may be at any level (e.g. basic, sub-system)


Components are treated as black boxes
For each failure mode, FMEA investigates
Possible causes

Local effects
System-level effects
Corrective action may be proposed
Best done by a team of experts with different viewpoints

(c) Felix Redmill, 2011

CERN, May '11

102

FAILURE MODES AND EFFECTS ANALYSIS


Boundary of study must be clearly defined
Does not usually find the effects of
Multiple faults or failures
Failures caused by communication and interactions
between components

Integration of components
Installation

(c) Felix Redmill, 2011

CERN, May '11

103

EXAMPLE FMEA OF A SIMPLE CHEMICAL


PROCESS
Study
No.
1

Item

Pump
P1

Failure
mode
Fails to
st art
Burns out

Valve
V1

Sticks
closed
Sticks
open

Possible
causes
1. No power
2. Burnt out
1. Loss of
lubricant
2. Excessive
temperature
1. No power
2. Jammed
1. No power
2. Jammed

(c) Felix Redmill, 2011

CERN, May '11

Local
effects
Fluid
does
flow
Fluid
does
flow

A
not
A
not

Fluid A
cannot
flow
Cannot
st op
flow of
Fluid A

Syst emlevel
effects
Excess of
Fluid B in
Vat R
Excess of
Fluid B in
Vat R

Monitor
pump
operation
Add alarm
to pump
monitor

Excess
Fluid B
Vat R
Danger
excess
Fluid A
Vat R

Monitor
Valve
operation
Introduce
additional
valve in
series

of
in
of
of
in

Proposed
correction

104

FAILURE MODES, EFFECTS AND


CRITICALITY ANALYSIS
FMECA = FMEA + analysis of the criticality of each failure
Criticality in terms of risk or one of its components
i.e. severity and probability or frequency of occurrence
Usually quantitative
Purpose: to identify the failure modes of highest risk
This facilitates prioritization of corrective actions,
focusing attention and resources first where they will
have the greatest impact on safety
Recording: add one or two columns to a FMEA table

(c) Felix Redmill, 2011

CERN, May '11

105

HAZARD AND OPERABILITY STUDIES


(HAZOP)
The methodical study of a documented representation of a
system, by a managed team, to identify hazards and
operational problems.
Correct system operation depends on the attributes of the
items on the representation remaining within design intent.
Therefore, studying what would happen if the attributes
deviated from design intent should lead to the identification
of the hazards associated with the system's operation. This
is the principle of HAZOP.
'Guide words' are used to focus the investigation on the
various types of deviation.
(c) Felix Redmill, 2011

CERN, May '11

106

A GENERIC SET OF GUIDE WORDS


Guide word

Meaning

No

The complete negation of the design intent ion. No part of the


intent ion is achieved, but nothing else happens.

More

A quant it ative increase.

Less

A quant it ative decrease.

As well as

A qualitative increase. All the design intent ion is achieved, together


with somet hing additional.

Part of

A qualitative decrease. Only part of the design intent ion is achieved.

Reverse

The logical opposite of the design intent ion.

Other than

A complete subst it ution where no part of the design intent ion is


achieved but somet hing diff erent occurs.

Early

The design intent ion occurs earlier in time than intended.

Late

The design intent ion occurs later in time than intended.

Before

The design intent ion occurs earlier in a sequence than intended.

Aft er

The design intent ion occurs earlier in a sequence than intended.

(c) Felix Redmill, 2011

CERN, May '11

107

HAZOP OVERVIEW
Introductions
Presentation of design
representation
Examine
design representation
methodically

Possible
deviation from design intent
?
No

Yes

Examine
consequences
and causes

Document
results
Define follow-up
work
No

Time up, or
completed study?
Yes
Agree documentation
Sign off meeting

(c) Felix Redmill, 2011

CERN, May '11

108

HAZOP SUMMARY
HAZOP is a powerful technique for hazard identification and analysis
It requires a team of carefully chosen members
It depends on planning, preparation, and leadership
A study usually requires several meetings
Study proceeds methodically
Guide words are used to focus attention
Outputs are the identified hazards, recommendations,
questions

(c) Felix Redmill, 2011

CERN, May '11

109

EVENT TREE ANALYSIS


Starts with an event that might affect safety
Follows a cause-and-effect chain to the system-level
consequence
Does not assume that an event is hazardous
Includes both operational and fault conditions

Each event results in a branch, so N events = 2N branches


If event probabilities can be derived
They may be assigned to the branches of the tree
The probability of the final events may be calculated

(c) Felix Redmill, 2011

CERN, May '11

110

A SIMPLE EVENT TREE

Valve
operates

Valve
monitor
O.K.

Alarm
relay O.K.

Claxon
O.K.

Operator
responds
Yes

Yes
Yes

Outcome
Safe
outcome

No
No

Yes
No

Unsafe
outcomes

No

No

(c) Felix Redmill, 2011

CERN, May '11

111

REFINED EVENT TREE


Valve
operates

Valve
monitor
functions

Alarm
relay
operates

Claxon
sounds

Yes
Yes

Operator
responds

Yes

Yes

No

No

(c) Felix Redmill, 2011

CERN, May '11

112

EVENT TREE: ANOTHER EXAMPLE

Fire
starts

Fire
spreads
quickly

Sprinkler
fails to
work

People
cannot
escape
Yes (P=0.4)

Yes (P=0.2)
No (P=0.6)

Yes (P=0.1)

No (P=0.8)
Yes

No (P=0.9)

(c) Felix Redmill, 2011

Resulting
event
Multiple
fatalities
Damage
and loss
Fire
controlled
Fire
contained

CERN, May '11

113

BOTTOM-UP ANALYSIS
(Herald of Free Enterprise)
Bowsun asleep in his cabin when ship is due to depart

Bow doors not closed

Ship puts to sea with bow doors open

Water enters car deck

As ship rolls, water rushes to one side

Ship capsizes

Lives lost

(c) Felix Redmill, 2011

CERN, May '11

114

TOP-DOWN ANALYSIS
(Herald of Free Enterprise)
Ship puts to sea with bow doors open

Bosun did not close doors

Bosun not available


to close doors

Bosun not
on ship

Bosun on board
but not at station

Bosun asleep
in cabin

(c) Felix Redmill, 2011

Problem with doors


& bosun cant close them
Problem
with closing
mechanism

Door or
hinge
problem

Bosun in
bar

CERN, May '11

Problem
with power
supply
115

FAULT TREE ANALYSIS


Starts with a single 'top event' (e.g., a system failure)
Combines the chains of cause and effect which could lead
to the top event
Next-level causes are identified and added to the tree and
this is repeated until the most basic causes are identified
Causal relationships are defined by AND and OR gates
One lower-level event may cause several higher-level
events
Examples of causal events: component failure, human
error, software bug
Each top-level undesired event requires its own fault tree
Probabilities may be attributed to events, and from these
the probability of the top event may be derived
(c) Felix Redmill, 2011

CERN, May '11

116

EXAMPLE FTA OF SIMPLE CHEMICAL


PROCESS

(c) Felix Redmill, 2011

CERN, May '11

117

THE PRINCIPLE OF THE PROTECTION SYSTEM


Required event
frequency
10 -7 per hour

AND

Dangerous failure
frequency of
equipt.: 10 - 3 / hour

(c) Felix Redmill, 2011

Reliability of safety
function
10 -4 / hour

CERN, May '11

118

COMPLEMENTARITY OF TECHNIQUES
Compare results of FMEA with low-level causes from FTA
Carry out HAZOP on a sub-system identified as risky by a
high-level FMEA
Carry out ETA on low-level items identified as risky by FTA

(c) Felix Redmill, 2011

CERN, May '11

119

A RISK MATRIX
(An example of qualitative risk analysis)
Likelihood
or
Frequency

Consequence

Negligible

Moderate

High

Catastrophic

High
Medium
Low

(c) Felix Redmill, 2011

CERN, May '11

120

RISKS POSED BY IDENTIFIED HAZARDS


Likelihood
or
Frequency

Consequence

Negligible
High

Medium

Moderate

Catastrophic

H2

H1, H5

H6

Low

(c) Felix Redmill, 2011

High

H4
H3

CERN, May '11

121

A RISK CLASS MATRIX


(example only)
Likelihood
or
Frequency

Consequence

Negligible

Moderate

High

Catastrophic

High

Medium

Low

(c) Felix Redmill, 2011

CERN, May '11

122

TOLERABILITY OF ANALYSED RISKS


Refer the cell of each analysed hazard in the risk matrix to the
equivalent cell in the risk class matrix to determine the class
of the analysed risks
So, Hazard 1 poses a D Class risk, Hazard 2 a B, etc.
Risk class
Defines the (relative) importance of risk reduction
Provides a means of prioritising the handling of risks
Can be equated to a defined type of action
Risk class gives no indication of time criticality
This must be derived from an understanding of the risk
What is done to manage risks depends on circumstances
(c) Felix Redmill, 2011

CERN, May '11

123

THOUOGHTS ON THE USE OF TECHNIQUES


Techniques are intended to support what you have decided
to do
They should not define what you do

Each is useful on its own for a given purpose


Thorough analysis requires a combination of techniques

Risk analysis is not achieved in a single activity


It should be continuous throughout a systems life

Early analysis (PHA) is particularly effective


Techniques may be used to verify each others results
And to expand on them and provide further detail

(c) Felix Redmill, 2011

CERN, May '11

124

IMPORTANCE OF EARLY ANALYSIS

Root causes
1000s

Hazards
<100

Accidents
<20

Too many causes to start with bottom-up analysis


Effort and financial cost would be too great
Would lead to over-engineering for small risks
Would miss too many important hazards
Fault trees commence with accidents or hazards
Carry out Preliminary Hazard Analysis early in a project
(c) Felix Redmill, 2011

CERN, May '11

125

PRELIMINARY HAZARD ANALYSIS


At the Objectives or early specification stage
Potential then for greatest safety effectiveness
Take a system perspective
Consider historic understanding of such systems
Address differences between this system and historic
systems (novel features)
Consider such matters as: boundaries, operational intent,
physical operational environment, assumptions
Identify accident types, potential accident scope and
consequences
Identify system-level hazards
If a checklist is used, review it for obsolescence
Create a new checklist for future use
(c) Felix Redmill, 2011

CERN, May '11

126

HUMAN RELIABILITY ASSESSMENT


Human components of systems are unreliable
Hazards posed by them should be included in risk analysis
Several Human Reliability Assessment (HRA) techniques
have been developed for the purpose
These will be covered later in the degree course

(c) Felix Redmill, 2011

CERN, May '11

127

RISKS POSED BY HUMANS


We need to
Extend risk analysis to include HRA
Pay more attention to ergonomic considerations in
design
Consider human cognition in interface design

Produce guidelines on human factors in safety


Safety is multi-dimensional, so take an interdisciplinary
approach
Include ergonomists and psychologists in development
and operational teams

(c) Felix Redmill, 2011

CERN, May '11

128

RISKS POSED BY SENIOR MANAGERS


Senior managers (should)
Define safety policy
Make strategic safety decisions
Provide leadership in safety (or not)
Define culture by design or by their behaviour
They predispose an organisation to safety or accident
Accident inquiry reports show that the contribution of
senior management to accident causation is considerable
Yet their contributions to risk are not included in risk
analyses
Risk analyses therefore tend to be optimistic

Risk analyses need to include management failure


Management should pay attention to management failure
(c) Felix Redmill, 2011

CERN, May '11

129

RISK ANALYSIS SUMMARY


Risk analysis is the cornerstone of modern safety engineering
and management
It can be considered as a four-staged process
The first stage, hazard identification, is the basis of all further
work - risks not identified are not analysed or mitigated
All four stages include considerable subjectivity
Subjectivity, in the form of human creativity and judgement, is
essential to the effectiveness of the process
Subjectivity also introduces bias and error
HRA techniques need to be included in risk analyses
We need techniques to include management risks
Often the greatest value of risk analysis is having to do it
(c) Felix Redmill, 2011

CERN, May '11

130

SAFETY INTEGRITY LEVELS


(SILs)

(c) Felix Redmill, 2011

CERN, May '11

131

SAFETY AS A WAY OF THINKING


Reliability is not a guarantee of safety
If risk is too great, it must be reduced
(the beginning of 'safety thinking')
But how do we know if the risk is too great?
Carry out risk analysis
(safety engineering enters software and systems
engineering)
But even if the risk is high, does it need to be reduced?
Determine what level of risk is tolerable
(this introduces the concept of safety management)

(c) Felix Redmill, 2011

CERN, May '11

132

SAFETY INTEGRITY
If risk is not tolerable it must be reduced
High-level requirement of safety function is 'to reduce the
risk'
Analysis leads to the functional requirements
The safety function becomes part of the overall system
Safety depends on it
So, will it reduce the risk to (at least) a tolerable level?
We try to ensure that it does by defining the reliability with
which it performs its safety function
In IEC 61508 in terms of Pr. of dangerous failure
This is referred to as 'safety integrity'

(c) Felix Redmill, 2011

CERN, May '11

133

SAFETY INTEGRITY IS A TARGET


REQUIREMENT
Safety integrity sets a target for the maximum tolerable rate of
dangerous failures of a safety function
(This is a reliability-type attribute)
Example: S should not fail dangerously more than once in 7
years

Example: There should be no more than 1 dangerous failure in


1000 demands
If the rate of dangerous failures of the safety function can be
measured accurately, achievement of the defined safety
integrity may be claimed

(c) Felix Redmill, 2011

CERN, May '11

134

WHY SAFETY INTEGRITY LEVELS?


A safety integrity target could have an infinite number of
values
It is practical to divide the total possible range into bands
(categories)
The bands are 'safety integrity levels' (SILs)
In IEC 61508 there are four
N.B. Interesting that the starting point was that
safety and reliability are not synonymous
and the end point is the reliance of safety
on a reliability-type measure (rate of dangerous failure)

(c) Felix Redmill, 2011

CERN, May '11

135

IEC 61508 SIL VALUES (IEC 61508)


Safety
Integrity
Level

Low Demand Mode of


Operation
(Pr. of failure to perform its
safety functions on demand)

Continuous/High-demand Mode
of Operation
(Pr. of dangerous failure per
hour)

>= 10-5 to 10-4

>= 10-9 to 10-8

>= 10-4 to 10-3

>= 10-8 to 10-7

>= 10-3 to 10-2

>= 10-7 to 10-6

>= 10-2 to 10-1

>= 10-6 to 10-5

(c) Felix Redmill, 2011

CERN, May '11

136

RELATIONSHIP (IN IEC 61508) BETWEEN


LOW-DEMAND AND CONTINUOUS MODES
Low-demand mode:
'Frequency of demands ... no greater than one per year'
1 year = 8760 hours (approx. 104)
Continuous and low-demand figures related by factor of 104
Claims for a low-demand system must be justified

(c) Felix Redmill, 2011

CERN, May '11

137

APPLYING SAFETY INTEGRITY TO


DEVELOPMENT PROCESS
When product integrity cannot be measured with confidence
(particularly when systematic failures dominate)
The target is related, in IEC 61508, to the development
process
Processes are equated to safety-integrity levels
But
Such equations are judgemental
Process may be an indicator of product, but not an
infallible one

(c) Felix Redmill, 2011

CERN, May '11

138

IEC 61508 DEFINITIONS


Safety integrity:
Probability of a safety-related system satisfactorily
performing the required safety functions under all the
stated conditions within a stated period of time

Safety integrity level:


Discrete level (one out of a possible four) for specifying the
safety integrity requirements of the safety functions to be
allocated to the E/E/PE safety-related systems, where SIL 4
has the highest level of safety integrity and SIL 1 the lowest

(c) Felix Redmill, 2011

CERN, May '11

139

SIL IN TERMS OF RISK REDUCTION (IEC 61508)

The necessary risk reduction effected by a safety function


Its functional requirements define what it should do
Its safety integrity requirements define its tolerable rate of
dangerous failure
(c) Felix Redmill, 2011

CERN, May '11

140

THE IEC 61508 MODEL OF RISK


REDUCTION
Utility + risks

EUC

Control
system
Safety functions
(c) Felix Redmill, 2011

Protection
system
Safety functions
CERN, May '11

141

PRINCIPLE OF THE PROTECTION SYSTEM


Event frequency
10 -7

AND

EUC dangerous
failure frequency
10 -3
10 -4
(c) Felix Redmill, 2011

Reliability of
safety function
10 -4
10 -3
CERN, May '11

142

BEWARE
Note that a protection system claim requires total
independence of the safety function from the protected
function

(c) Felix Redmill, 2011

CERN, May '11

143

EACH HAZARD CARRIES A RISK

Tolerable
Residual level of
risk 1
risk 1

Risk
2

Tolerable
level of
risk 2

Necessary reduction of risk 1

Risk
1

Increasing
risk

Actual reduction of risk 1

(c) Felix Redmill, 2011

CERN, May '11

144

EXAMPLE OF A SAFETY FUNCTION

The target SIL applies to the entire safety instrumentation system


Sensor, connections, hardware platform, software, actuator, valve,
data
All processes must accord with the SIL
Validation, configuration management, proof checks, maintenance,
change control and revalidation ...

(c) Felix Redmill, 2011

CERN, May '11

145

PROCESS OF SIL DERIVATION


Hazard identification
Hazard analysis
Risk assessment
Resulting in requirements for risk reduction
Safety requirements specification
Functional requirements
Safety integrity requirements
Allocation of safety requirements to safety functions
Safety functions to be performed
Safety integrity requirements
Allocation of safety functions to safety-related systems
Safety functions to be performed
Safety integrity requirements
(c) Felix Redmill, 2011

CERN, May '11

146

THE IEC 61508 SIL PRINCIPLE

(c) Felix Redmill, 2011

CERN, May '11

147

BUT NOTE
IEC 61508 emphasises process evidence but does not
exclude the need for product or analysis evidence
It is impossible for process evidence to be conclusive
It is unlikely that conclusive evidence can be derived for
complex systems in which systematic failures dominate
Evidence of all types should be sought and assembled

(c) Felix Redmill, 2011

CERN, May '11

148

SIL ALLOCATION - 1 (from IEC 61508)


Functions within a system may be allocated different SILs
The required SIL of hardware is that of the software safety
function with the highest SIL
For a safety-related system that implements safety
functions of different SILs, the hardware and all the
software shall be of the highest SIL unless it can be shown
that the implementations of the safety functions of the
different SILs are sufficiently independent
Where software is to implement safety functions of different
SILs, all the software shall be of the highest SIL unless the
different safety functions can be shown to be independent

(c) Felix Redmill, 2011

CERN, May '11

149

SIL ALLOCATION - 2
(from IEC 61508)
Where a safety-related system is to implement both safety
and non-safety functions, all the hardware and software
shall be treated as safety-related unless it can be shown
that the implementation of the safety and non-safety
functions is sufficiently independent (i.e. that the failure of
any non-safety-related functions does not affect the safetyrelated functions). Wherever practicable, the safety-related
functions should be separated from the non-safety-related
functions

(c) Felix Redmill, 2011

CERN, May '11

150

SIL ACHIEVED OR TARGET


RELIABILITY?
SIL is initially a statement of target rate of dangerous failures
There may be reasonable confidence that achieved reliability =
or > target reliability (that the SIL has been met), when
Simple design
Simple hardware components with known fault histories in
same or similar applications
Random failures
But there is no confidence that the target SIL is met when
There is no fault history
Systematic failures dominate
Then, IEC 61508 focuses on the development process

(c) Felix Redmill, 2011

CERN, May '11

151

APPLICATION OF SIL PRINCIPLE


IEC 61508 is a meta standard, to be used as the basis for
sector-specific standards
The SIL principle may be adapted to suit the particular
conditions of the sector
An example is the application of the principle in the Motor

Industry guideline

(c) Felix Redmill, 2011

CERN, May '11

152

EXAMPLE OF SIL BASED ON


CONSEQUENCE
(Motor Industry)
In the Motor Industry Software Reliability Association (MISRA)
guidelines (1994), the allocation of a SIL is determined by 'the
ability of the vehicle occupants to control the situation
following a failure'
Steps in determining an integrity level:
a) List all hazards that result from all failures of the system
b) Assess each failure mode to determine its controllability
category
c) The failure mode with the highest associated controllability
category determines the integrity level of the system
(c) Felix Redmill, 2011

CERN, May '11

153

MOTOR INDUSTRY EXAMPLE


Controllability
category

Acceptable failure rate

Integrity
level

Uncontrollable

Extremely improbable

Difficult to control

Very remote

Debilitating

Remote

Distracting

Unlikely

Nuisance only

Reasonably possible

(c) Felix Redmill, 2011

CERN, May '11

154

SIL CAN BE MISLEADING


Different standards derive SILs differently
SIL is not a synonym for overall reliability
Could lead to over-engineering and greater costs
When SIL claim is based only on development process
What were competence, QA, safety assessment, etc.?
Glib use of the term SIL
No certainty of what is meant
Is the claim understood by those making it?
Be suspicious until examining the evidence
SIL X may be intended to imply use in all circumstances
But safety is context-specific
SIL says nothing about required attributes of the system
Must go back to the specification to identify them
(c) Felix Redmill, 2011

CERN, May '11

155

THREE QUESTIONS
Does good process necessarily lead to good product?
Instead of using a safety function, why not simply improve
the basic system (EUC)?
Can the SIL concept be applied to the basic system?

(c) Felix Redmill, 2011

CERN, May '11

156

DOES GOOD PROCESS LEAD TO GOOD PRODUCT?


Adhering to good process does not guarantee reliability
Link between process and product is not definitive
Using a process is only a start
Quality, and safety, come from professionalism
Not only development processes, but also operation,
maintenance, management, supervision, etc., throughout the
life cycle
Not only must development be rigorous but also the safety
requirements must be correct
And requirements engineering is notoriously difficult
Meeting process requirements offers confidence, not proof
(c) Felix Redmill, 2011

CERN, May '11

157

WHY NOT SIMPLY IMPROVE THE BASIC


SYSTEM?

IEC 61508 bases safety on the addition of safety functions


This assumes that EUC + control system are fixed
Why not reduce risk by making EUC & control system safer?
This is an iterative process
Then, if all risks are tolerable no safety functions required
But (according to IEC 61508) claim cannot be lower than
10-5 dangerous failures/hour (SIL 1)
If there comes a time when we can't (technological) or won't
(cost) improve further
We must add safety functions if all risks are not tolerable
We are then back to the standard as it is

(c) Felix Redmill, 2011

CERN, May '11

158

CAN THE SIL CONCEPT BE APPLIED TO THE


BASIC SYSTEM?
Determine its tolerable rate of dangerous failure and translate
this into a SIL (as in the MISRA guidelines)
But note the limit on the claim that can be made (10-5
dangerous failures per hour) to satisfy IEC 61508

(c) Felix Redmill, 2011

CERN, May '11

159

DOES MEETING THE SIL GUARANTEE SAFETY?


Derivation of a SIL is based on hazard and risk analyses
Hazard and risk analyses are subjective processes
If the derived risk value is an accurate representation of the
risk AND the selected level of risk tolerability is not too high
THEN meeting the SIL may offer a good chance of safety
But these criteria may not hold
Also, our belief that the SIL has been met may not be justified

(c) Felix Redmill, 2011

CERN, May '11

160

SIL SUMMARY
The SIL concept specifies the tolerable rate of dangerous
failures of a safety-related system
It defines a safety-reliability target
Evidence of achievement comes from product and process
IEC 61508 SIL defines constraints on development
processes
Different standards use the concept differently
SIL derivation may be sector-specific
It can be misleading in numerous ways
The SIL concept is a tool
It should be used only with understanding and care

(c) Felix Redmill, 2011

CERN, May '11

161

You might also like