You are on page 1of 115

Technical Report

ISA-TR84.00.02-Draft D

Safety Integrity Level (SIL) Verification of Safety Instrumented Functions

Approved XX XXXX 2010?

August 2009

ISA-TR84.00.02 Safety Integrity Level (SIL) Verification of Safety Instrumented Functions ISBN: 1-55617-802-6 Copyright 2009 by ISA The International Society of Automation. All rights reserved. Not for resale. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means (electronic mechanical, photocopying, recording, or otherwise), without the prior written permission of the Publisher. ISA 67 Alexander Drive P.O. Box 12277 Research Triangle Park, North Carolina 27709

August 2009

ISA-TR84.00.02-2010

Draft D

Preface
This preface, as well as all footnotes and annexes, is included for information purposes and is not part of ISA-TR84.00.02-2009. This document has been prepared as part of the service of ISAthe Instrumentation, Systems, and Automation Societytoward a goal of uniformity in the field of instrumentation. To be of real value, this document should not be static but should be subject to periodic review. Toward this end, the Society welcomes all comments and criticisms and asks that they be addressed to the Secretary, Standards and Practices Board; ISA; 67 Alexander Drive; P. O. Box 12277; Research Triangle Park, NC 27709; Telephone (919) 549-8411; Fax (919) 549-8288; E-mail: standards@isa.org. The ISA Standards and Practices Department is aware of the growing need for attention to the metric system of units in general, and the International System of Units (SI) in particular, in the preparation of instrumentation standards. The Department is further aware of the benefits to USA users of ISA standards of incorporating suitable references to the SI (and the metric system) in their business and professional dealings with other countries. Toward this end, this Department will endeavor to introduce SI-acceptable metric units in all new and revised standards, recommended practices, and technical reports to the greatest extent possible. Standard for Use of the International System of Units (SI): The Modern Metric System, published by the American Society for Testing & Materials as IEEE/ASTM SI 1097, and future revisions, will be the reference guide for definitions, symbols, abbreviations, and conversion factors. It is the policy of ISA to encourage and welcome the participation of all concerned individuals and interests in the development of ISA standards, recommended practices, and technical reports. Participation in the ISA standards-making process by an individual in no way constitutes endorsement by the employer of that individual, of ISA, or of any of the standards, recommended practices, and technical reports that ISA develops. CAUTION ISA ADHERES TO THE POLICY OF THE AMERICAN NATIONAL STANDARDS INSTITUTE WITH REGARD TO PATENTS. IF ISA IS INFORMED OF AN EXISTING PATENT THAT IS REQUIRED FOR USE OF THE STANDARD, IT WILL REQUIRE THE OWNER OF THE PATENT TO EITHER GRANT A ROYALTY-FREE LICENSE FOR USE OF THE PATENT BY USERS COMPLYING WITH THE STANDARD OR A LICENSE ON REASONABLE TERMS AND CONDITIONS THAT ARE FREE FROM UNFAIR DISCRIMINATION. EVEN IF ISA IS UNAWARE OF ANY PATENT COVERING THIS STANDARD, THE USER IS CAUTIONED THAT IMPLEMENTATION OF THE STANDARD MAY REQUIRE USE OF TECHNIQUES, PROCESSES, OR MATERIALS COVERED BY PATENT RIGHTS. ISA TAKES NO POSITION ON THE EXISTENCE OR VALIDITY OF ANY PATENT RIGHTS THAT MAY BE INVOLVED IN IMPLEMENTING THE STANDARD. ISA IS NOT RESPONSIBLE FOR IDENTIFYING ALL PATENTS THAT MAY REQUIRE A LICENSE BEFORE IMPLEMENTATION OF THE STANDARD OR FOR INVESTIGATING THE VALIDITY OR SCOPE OF ANY PATENTS BROUGHT TO ITS ATTENTION. THE USER SHOULD CAREFULLY INVESTIGATE RELEVANT PATENTS BEFORE USING THE STANDARD FOR THE USERS INTENDED APPLICATION. HOWEVER, ISA ASKS THAT ANYONE REVIEWING THIS STANDARD WHO IS AWARE OF ANY PATENTS THAT MAY IMPACT IMPLEMENTATION OF THE STANDARD NOTIFY THE ISA STANDARDS AND PRACTICES DEPARTMENT OF THE PATENT AND ITS OWNER. ADDITIONALLY, THE USE OF THIS STANDARD MAY INVOLVE HAZARDOUS MATERIALS, OPERATIONS OR EQUIPMENT. THE STANDARD CANNOT ANTICIPATE ALL POSSIBLE APPLICATIONS OR ADDRESS ALL POSSIBLE SAFETY ISSUES ASSOCIATED WITH USE IN HAZARDOUS CONDITIONS. THE USER OF THIS STANDARD MUST EXERCISE SOUND August 2009

ISA-TR84.00.02-2010

Draft D

PROFESSIONAL JUDGMENT CONCERNING ITS USE AND APPLICABILITY UNDER THE USERS PARTICULAR CIRCUMSTANCES. THE USER MUST ALSO CONSIDER THE APPLICABILITY OF ANY GOVERNMENTAL REGULATORY LIMITATIONS AND ESTABLISHED SAFETY AND HEALTH PRACTICES BEFORE IMPLEMENTING THIS STANDARD. THE USER OF THIS DOCUMENT SHOULD BE AWARE THAT THIS DOCUMENT MAY BE IMPACTED BY ELECTRONIC SECURITY ISSUES. THE COMMITTEE HAS NOT YET ADDRESSED THE POTENTIAL ISSUES IN THIS VERSION. The following people served as members of ISA Committee SP84: NAME COMPANY

August 2009

ISA-TR84.00.02-2010

Draft D

This page intentionally left blank.

August 2009

ISA-TR84.00.02-2010

Draft D

Contents
1 2 3 4 5
Deleted: 9

Introduction ....................................................................................................................................... 10 Scope................................................................................................................................................ 15 Background ....................................................................................................................................... 16


Deleted: 18

Caution.............................................................................................................................................. 17
Deleted: 19

Nature of Failures.............................................................................................................................. 18 5.1 5.2 5.3 5.4


Deleted: 19

Device Failure ............................................................................................................................ 18


Deleted: 21

Failure Causes and Mechanisms ...............................................................................................20


Deleted: 28

Failure Classification .................................................................................................................. 27


Deleted: 41

Failure Rates.............................................................................................................................. 40
Deleted: 46

Probability of Failure.......................................................................................................................... 45
Deleted: 47

6.1 6.2 6.3 6.4 6.5 6.6 6.7 7 8 9

Instantaneous Probability of Failure ...........................................................................................46


Deleted: 50

PFDavg ........................................................................................................................................ 49
Deleted: 52

Effect of MTTR ........................................................................................................................... 51


Deleted: 55

Effect of Bypassing .................................................................................................................... 54


Deleted: 55

Effect of Proof Testing and Diagnostics .....................................................................................54


Deleted: 56

Effect of Voting........................................................................................................................... 55
Deleted: 57

Effect of Common Cause ........................................................................................................... 56


Deleted: 59

Spurious Trip Rate ............................................................................................................................ 58


Deleted: 61

SIF Calculation Overview .................................................................................................................. 60


Deleted: 69

Special Topics................................................................................................................................... 68 9.1 9.2 9.3 9.4


Deleted: 69

Systematic error and the management system ..........................................................................68


Deleted: 72

Methods to Analyze the Performance of Equipment with Unrevealed Failures........................... 71


Deleted: 74

Single Sided Confidence Limit ................................................................................................... 73


Deleted: 75

Software Packages .................................................................................................................... 74


Deleted: 76

Annex A Abbreviations, Acronyms and Symbols ...................................................................................... 75


Deleted: 79

Annex B Definitions .................................................................................................................................. 78 August 2009

ISA-TR84.00.02-2010

Draft D
Deleted: 85

Annex C Fault Tree Analysis ................................................................................................................... 84


Deleted: 92

Annex D Markov Analysis........................................................................................................................ 91


Deleted: 109

Annex E References ............................................................................................................................. 108

August 2009

ISA-TR84.00.02-2010

Draft D

This page intentionally left blank.

August 2009

ISA-TR84.00.02-2010

Draft D
Deleted: Safety Instrumented Systems (SIS)--Safety Integrity Level (SIL) Evaluation Techniques Foreword

Technical Reports Supporting User Implementation of ANSI/ISA84.00.01-2004 Foreword


The process sector specific ANSI/ISA-84.00.01-2004 standard defines the minimum requirements for the specification, design, installation and maintenance of an SIS given that a set of functional requirements have been defined and an SIL requirement has been established for each safety instrumented function. A series of complimenting and informative technical reports have been developed to provide the end user guidance and as well as practical examples in implementing and maintaining compliance to the requirements of the standard. Three of the technical reports TR84.00.02, TR84.00.03 and TR84.00.04 are specific to the specification, design, installation and life cycle mechanical integrity of the SIS. Each technical report builds on the other and provides detailed role and responsibility guidance for each phase of the SIS lifecycle covered by the standard. A brief overview and the interrelation of each technical report is given below. TR84.00.02Safety Integrity Level (SIL) Verification of Safety Instrumented Functions provides detailed descriptions and examples of the data analysis tools and techniques necessary to ensure and continuously improve the reliability of an SIS and is intended for use by experienced design and reliability practitioners. Life cycle assurance of system reliability and continuous improvement is complimented by the implementation performance data capture, categorization and feedback mechanisms detailed in TR84.00.03Mechanical Integrity of SIS. TR84.00.03Mechanical Integrity of Safety Instrumented Systems (SIS) provides detailed step descriptions as well as defined roles and responsibilities in assuring the life cycle mechanical integrity of SIS. The technical report includes examples of incorporating the inspection, testing, performance data capture of SIS elements into the operating facility overall maintenance and mechanical integrity strategy. The results of the performance data capture, failure rates of specific components, is fed back to the analysis expertise and processes described in TR84.00.02Safety Integrity Level (SIL) verification of Safety Instrumented Functions. TR84.00.04Guidelines for the Implementation of ANSI/ISA84.00.01-2004 puts it all together to assist the designer in applying the concepts necessary to achieve an acceptable design in meeting and maintaining the Safety Integrity Level (SIL) as specified in the Safety Requirements Specification (SRS). This report covers various aspects such as: the guidance on grandfathering existing instrumentation for use in an SIS; acceptable sensor, wiring, power and installation configurations; developing proof tests and setting proof test frequencies using available performance data; ensuring independence of the various Layers of Protection, etc.

Formatted: Font: Bold


Deleted: The information contained in ISA-TR84.00.02 is provided for information only and is not part of the (1) ANSI/ISA-84.00.01-2004 Standard requirements. The purpose of ISA-TR84.00.02 is to provide the process industry with a description of various methodologies that can be used to verify the Safety Integrity Level (SIL) of Safety Instrumented Functions (SIF). To ensure consistency in the approach; understand the concepts and assumptions behind the methodologies; A secondary purpose of this document is to reinforce the concept of the performance based evaluation of SIF. ANSI/ISA-84.00.01-2004 provides the minimum requirements for implementing an SIS given that a set of functional requirements have been defined and an SIL requirement has been established for each safety instrumented function. Additional information of an informative nature is provided in ISA TR84.00.04 to assist the designer in applying the concepts necessary to achieve an acceptable design. However, Standards Project 84 (SP84) determined that it was appropriate to provide supplemental information that would assist the user in evaluating the capability of any given SIF design to achieve its required SIL. The performance parameters that satisfactorily service the process industry are derived from the SIL and reliability evaluation of SIF, namely the probability of the SIF to fail to respond to a demand and the probability that the SIF creates a nuisance trip. Such evaluation addresses the design elements (hardware, software, redundancy, etc.) and the operational attributes (inspection/maintenance policy, frequency and quality of testing, etc.) of the SIF. The performance of the SIF is established based on the risk reduction required to address identified hazardous events associated with various process operating modes. The risk reduction is defined during a hazard and risk analysis which is outside the scope of this technical report. The document focuses on methodologies that can be used without promoting a single methodology. It provides information on the benefits of various methodologies as well as some of the drawbacks they may have.

August 2009

ISA-TR84.00.02-2010

10

Draft D

Introduction

ANSI/ISA-84.00.01-2004 describes a safety lifecycle model (Figure 1.1) for the implementation of safety instrumented systems (SIS) for the process industry. The standard defines four levels of safety integrity (Safety Integrity Levels, SIL) that may be used to benchmark the capability that a safety instrumented function (SIF) within the SIS to operate under all stated conditions within the time required. ISATR84.00.02-2009 provides methodologies for verifying that an SIF achieves a specified SIL. The approaches outlined in this document are performance-based. Consequently, examples provided in this document do not represent prescriptive architectural configurations or mechanical integrity requirements for any given SIL. THE READER/USER IS CAUTIONED TO CLEARLY UNDERSTAND THE ASSUMPTIONS ASSOCIATED WITH THE METHODOLOGIES AND EXAMPLES IN THIS DOCUMENT BEFORE DERIVING ANY CONCLUSIONS REGARDING THE PERFORMANCE VERIFICATION OF ANY SPECIFIC SIF. THE METHODOLOGIES ARE DEMONSTRATED THROUGH EXAMPLES (SIS ARCHITECTURES) THAT REPRESENT POSSIBLE SYSTEM CONFIGURATIONS AND SHOULD NOT BE INTERPRETED AS RECOMMENDATIONS FOR SIS. THE USER IS CAUTIONED TO CLEARLY UNDERSTAND THE ASSUMPTIONS AND DATA ASSOCIATED WITH THE METHODOLOGIES IN THIS DOCUMENT BEFORE ATTEMPTING TO UTILIZE THE METHODS PRESENTED HEREIN. IT PROVIDES IINFORMATION ON THE BENEFITS OF VARIUOS METHODOLOGIES AS WELL AS SOME OF THE DRAWBACKS THEY MAY HAVE.. The users of ISA-TR84.00.02 include: Developers of components of safety instrumented systems wishing to ensure that users can correctly implement and document solutions SIS designers who want a better understanding of how device selection, redundancy, diagnostic coverage, test interval, common cause failure, etc. affect the SIS performance Personnel who need to use reliability techniques to evaluate, verify and document SIFs Personnel involved in verification, functional safety assessment and auditing, who must ensure that the desired level of risk reduction continues to be provided in the changing facility environment.
Deleted: Inserted:

Formatted: Normal

Deleted:

Formatted: Bullets and Numbering

The quantitative verification of SIL takes place before the SIS detailed design phase of the life cycle (see Figure 1.1, Safety Lifecycle Model). This document assumes that an SIS is required. It does not provide guidance on the hazard and risk analysis used to identify the need for an SIS. The user is referred to ANSI/ISA-84.00.01-2004 Part 3 and CCPS Layers of Protection Analysis: Simplified Risk Assessment for guidance on assigning the SIL. This document involves the evaluation of the entire SIF from the sensors through the logic solver to the final elements. Process industry experience shows that sensors and final elements are major contributors to loss of SIS integrity and that the operating environment plays a major role in sensor and final element failure. When evaluating the performance of sensors and final elements, issues such as component technology, installation, and maintenance should be considered. However, logic solvers can pose significant risk of common cause failure between multiple SIFs. Frequently, multiple functions are implemented in a single SIS logic solver. The failure of this common logic solver may adversely impact the performance of many SIFs, where a single sensor or final element rarely affects more than a few functions.

August 2009

ISA-TR84.00.02-2010

11

Draft D

Common cause should be evaluated whenever:


Any element of an SIS is common to more than one SIF; Redundant elements are used within a SIF to achieve the required integrity; and A support system for the SIS is common to more than one SIF.

Each element should be evaluated to:


determine the impact of failure on the SIF operation; understand systematic, common cause, and common mode potential within and between SIFs; and ensure that each SIF meets the required integrity level.

August 2009

ISA-TR84.00.02-2010

12

Draft D

Figure 1.1--Safety lifecycle model

August 2009

ISA-TR84.00.02-2010

13

Draft D

Figure 1.2 shows the boundaries of the SIS and how it relates to other systems. The safety requirements specification addresses the design elements (hardware, software, redundancy, etc.) and the operational attributes (inspection/maintenance policy, frequency and quality of testing, etc.) of the SIS. Each element contributes to the probability that the SIF will fail, depending on the elements integrity and reliability.

Figure 1.2--Safety Instrumented System (SIS) Boundary


The SIL is related to a range of the probability of failure on demand (PFD) for demand mode SIFs and probability of failure as a function of time for continuous or high demand mode. The SIF performance can be determined using historical data from maintenance records. For example, the PFD can be estimated by dividing the number of times a device technology has failed under proof test by the total proof tests of the technology. A relatively large data pool is necessary to have a statistically significant population, so many users estimate the SIF performance using quantitative analysis and industry published data. This document provides users with a discussion of quantitative analysis techniques that can be used to verify whether a SIF meets the required SIL. Quantitative analysis breaks down complex systems into their basic elements. The affect of the failure of each basic element on the overall system is evaluated using reliability models, such as reliability block diagrams, fault trees, and Markov models. Safety integrity is defined as The probability of a Safety Instrumented Function satisfactorily performing the required safety functions under all stated conditions within a stated period of time. Safety integrity consists of two elements: 1) hardware safety integrity and 2) systematic safety integrity. Hardware safety integrity which is based upon random hardware failures can normally be estimated to a reasonable level of accuracy. ANSI/ISA-84.00.01-2004 addresses the hardware safety integrity by specifying target failure measures for each SIL. For SIF operating in the demand mode the target failure measure is PFDavg (average probability of failure to perform its design function on demand). PFDavg is also commonly referred to as the average probability of failure on demand. Systematic integrity is difficult to quantify due to the diversity of causes of failures; systematic failures may be introduced during the specification, design, implementation, operational and modification phase and may affect hardware as well as software. ANSI/ISA-84.00.01-2004 addresses systematic safety integrity by requiring the implementation of a management system that seeks to reduce the potential for systematic failures. An acceptable safe failure rate is also normally specified for a SIF. The safe failure rate is commonly referred to as the false trip, nuisance trip, or spurious trip rate. The spurious trip rate should be considered during the evaluation of a SIF, since unnecessary shutdowns lead to unnecessary process August 2009

ISA-TR84.00.02-2010

14

Draft D

start-ups, which are frequently periods where the likelihood of abnormal operation is high. Hence, in many cases, a reduction of spurious trips enhances the overall safety of the process. Further, spurious SP trips have a direct impact on production and product quality. Often, an increase in the target MTTF can be justified based on the cost impact of a spurious trip. A target safe failure rate is typically expressed as the mean time to a spurious trip (MTTFSP). The objective of this technical report is to provide users with techniques for the evaluation of the PFDavg SP and MTTF . ISA-TR84.00.02-2009 shows how to model complete SIF, which includes the sensors, the logic solver and final elements.

August 2009

ISA-TR84.00.02-2010

15

Draft D

2
2.1

Scope
ISA-TR84.00.02 is informative and does not contain any mandatory clauses. ISA-TR84.00.022009 is intended to be used only with a thorough understanding of ANSI/ISA-84.00.01-2004. Prior to proceeding with use of ISA-TR84.00.02, the hazards and risk analysis should be completed and the following information provided:

An SIF is required. The SIF functionality is defined. The risk reduction required for each SIF is defined.

2.2

ISA-TR84.00.02-2009 provides guidance on the following:

Assessing random and systematic failures, classifying failure modes, and estimating the failure rates for individual elements of an SIF; Assessing the impact of diagnostic and mechanical integrity choices on the element and SIF performance; Assessing and estimating the potential for common cause, common mode, and systematic failures; Verifying that SIF achieves a specified SIL and spurious trip rate; and Meeting the minimum hardware fault tolerance.

2.3

ISA-TR84.00.02 provides guidance on techniques for evaluating the following: a. probability of failure on demand for demand mode b. probability of failure as a function of time for continuous mode c. spurious trip rate

August 2009

ISA-TR84.00.02-2010

16

Draft D

Background

During a hazard and risk analysis, initiating causes for hazardous events are identified where deviation from intended operation results in abnormal operation. Safety functions are identified that achieve or maintain a safe state of the process when defined safe operating limits are exceeded. Each safety function is allocated to an independent protection layer and allocated the risk reduction necessary to reduce the process risk below the owner/operator risk criteria. When the safety function is allocated to the SIS, it becomes an SIF. The allocated risk reduction determines its SIL according to Table ? in IEC ANSI/ISA 84.00.01-2004. An SIF operates in the continuous mode if its dangerous failure is an initiating cause for a hazardous event. This occurs only when the operation of the SIF is required for normal operation. When the SIF operates in the continuous mode, the requirements are stated in terms of hazard rate or failure frequency. n the process sector, SIFs are generally designed to operate in low demand where the required risk reduction is related to the target probability of failure on demand (PFD). The target failure frequency or PFD establish the minimum required performance for the SIF and the target SIL. The target SIL serves as the performance benchmark for the design and management practices used throughout the SIF life. The SIL establishes three criteria: 1) equipment should be user approved for the operating environment and claimed PFD; 2) the subsystems should have the necessary fault tolerance against dangerous failure; and 3) the PFDAVG for a demand mode SIF and the hazard rate for the continuous mode SIF. ANSI/ISA 84.00.01-2004 Clause 11.9 requires that the SIL be verified quantitatively. The target The target SIL also establishes a minimum level of management system rigor that must be provided to reduce the potential for systematic errors to a sufficiently low probability. Systematic errors are caused, or indirectly induced, by human error or unforeseeable complex process conditions. Systematic failures are not random events and must be addressed by the management system, using quality management processes to minimize systemic error. Most systematic errors are not easily included in the verification calculation. Random failures are easily modeled using probabilistic math, allowing the performance to be estimated. Random failures occur when stress causes a fault to develop in a component of a device. The performance calculation determines whether the planned SIF design can theoretically achieve the desired integrity and reliability, taking into account six design parameters:

Comment: Want table reference or to replicate table here? Deleted:

Comment: get defintion from book Deleted: I Deleted: mode such that it When the SIF operates in demand mode,

Mean time to failure (MTTF), Voting architecture, Diagnostic coverage (DC), Testing interval (TI), Mean time to repair (MTTR), and Common cause failure.

Once the SIF performance is benchmarked, it is possible to identify optimal solutions that meet the process units operability, maintainability, and reliability requirements. Any personnel assigned responsibility for verifying the risk reduction should understand how installed equipment can fail and the strategies used to address the failures. There are many books available on the subject of reliability engineering, so this technical report provides only a brief overview of the mathematics for calculation of the PFDavg, hazard rate, and spurious trip rate. August 2009

ISA-TR84.00.02-2010

17

Draft D

Caution

The calculations are not a means to an end. Safe operation is achieved by designing and implementing SISs that take into account a wide variety of site specific criteria. Verification involves a predictive calculation which is only as good as the failure rate data and model. The calculation should be viewed as a tool for benchmarking and comparing different options. The calculations should never be perceived as a precise measure of SIF performance. The numbers used to benchmark SIF performance should never be construed to indicate that there is an acceptable level of dangerous failure. Equipment must be designed, constructed, installed and maintained to minimize the risk of significant consequence events, especially those involving highly hazardous chemicals. SIS equipment must not be run to failure, but maintained in the as good as new condition. This requires an on-going mechanical integrity program, which assures the equipments longterm integrity. The safety management system must seek to identify and correct systematic errors, which reduce the SIFs actual performance. While the calculation does not explicitly address human error, the limitations of the human systems in supporting the required performance should be considered when estimating the achievable failure rate. A significant assumption in the calculation is that inspection, preventive maintenance, and proof testing are performed at a rate sufficient to maintain a constant failure rate. The calculation does not implicitly or explicitly account for how changes in the design, operating, and maintenance practices affect the achievable failure rate. The calculation may indicate that a change in testing from once per year to once in 10 years is acceptable. However, the change in testing will result in less opportunity to detect incipient and degraded failures, allowing more critical failures to occur. Less frequent movement of mechanical components and linkages may result in a higher potential for the component to freeze in position or lock-up. As the test interval is extended, it becomes more likely that the test will find a critically failed device, rather than a degraded one. Breakdown maintenance is not acceptable for the SIS. Consequently, when dangerous failures are found, an investigation should determine the root cause and should identify means to reduce recurrence, as necessary, including an increase in the inspection, preventive maintenance, and/or proof test frequency or rigor.

August 2009

ISA-TR84.00.02-2010

18

Draft D

Nature of Failures

Potential failures within a SIS can be predicted probabilistically through analysis of the equipment in the system. The probability calculation requires a fundamental understanding of failure analysis and associated its terms. The failure mode is the symptom, condition, or effect by which the failure is observed. Failure modes that either cause a hazard or prevent equipment from performing their protective function are the primary focus of an analysis of a SIS to support ISA 84.01. The following presents a discussion of various terminology used to describe failure and to classify its impact.

5.1

Device Failure

Failures are often divided into three categories: complete, degraded, and incipient. A complete failure is the termination of the equipments ability to operate as specified, whereas, a degraded failure represents a partial loss of function. Incipient actually describes conditions that will likely result in a loss of one or more functions if not corrected in a reasonable time frame. Failures are further classified as safe or dangerous, depending on the effect that they have relative to the process or the ability of the equipment to provide protection. A failure results due to a variety of circumstances that manifest itself during design, manufacture, installation, commissioning, operation, or maintenance. Failures can be instantaneous or gradual, partial or complete, and intermittent or transient. The wide variety of failures invariably creates differing interpretations among engineers as to what happened, why it happened, and how it should be categorized, while at the same time wishing to be consistent in communication.

5.1.1

Incipient Failures

Incipient failures are really conditions that do not currently prevent a device from meeting its design specification or SIF. If corrective action is not taken, the incipient condition is more likely to progress into a degraded or complete failure. The resulting failure may be considered safe or dangerous, depending upon its effect on the process. In addition, diagnostics influence the percentage of failures that are detected and undetected. Examples of incipient conditions are: loose electrical or mechanical connections Corroded termination damaged electrical insulation calibration drift not exceeding allowed limits) partially plugged solenoid valve vent (shutdown valve doesnt exceed allowable response time) Missing bolt from sensor flange (No leakage yet) buildup of water/oil in a junction box buildup of fluid in air/gas lines partially plugged impulse line (doesnt exceed SIF response time) partially plugged dip tube (doesnt exceed required calibration accuracy) outdoor panel cover gasket cut or missing missing filter or screen on an air line

Examples of remote actuated valve incipient conditions per CCPS PERD taxonomy: Body Cracked Body Eroded Body Corroded Body Material Wrong Guide Fouled Guide Galled August 2009

ISA-TR84.00.02-2010

19

Draft D

Guide Corroded Guide Worn Stem Fouled Stem Galled Stem Corroded Stem Bent Stem Worn Seat Fouled Seat Cut Seat Eroded Seat Corroded Seat Excessive Wear Seat (soft) Embedded Debris Seat (soft) Overheat Evidence Seat Loading Mechanism Dysfunctional Spring Cracked Spring Corroded Spring Fatigued Spring Rubbing Improperly Installed Excessive Vibration

Incipient conditions may not prevent the safety function of a device but, if not repaired/restored to the initial design requirements, it may eventually result in failure of the required SIF. Incipient conditions can be detected by inspections and in some cases by diagnostics (i.e., transmitter signal comparison to detect calibration drift).

5.1.2

Degraded Failures

Degraded failures are those that may decrease a component or subsystems reliability and/or prevent a component from fully meeting its design specifications. In some cases a degraded failure prevents the SIS from performing its required SIF. Examples of degraded failures excerpted from the CCPS PERD instrument loop taxonomy include: Control Output High Control Output Low Control Output Slow to Respond Control Output Too Fast Control Output Erratic Auto Controller in Manual Mode Process Variable Indication High Process Variable Indication Low Process Variable Indication Erratic Control Output Indication High Control Output Indication Low Control Output Indication Erratic Alarm Function Delayed Interlock Function Early
Interlock Function Delayed Interlock voting channel fail to function

August 2009

ISA-TR84.00.02-2010

20

Draft D

Interlock voting channel spuriously functions

Degraded failures can be detected during inspection, maintenance, or by component/system diagnostics.

5.1.3

Inspection/Diagnostic Testing for Incipient Conditions & Degraded Failures

An effective inspection and maintenance program is required to detect most incipient conditions and degraded failures. The maintenance program will typically include both preventive and condition based activities. When equipment is known to have consumable components, (i.e. batteries, catalytic bead sensor, etc.) they can be replaced on a periodic basis. Diagnostic alarms can provide a means to implement condition based monitoring, allowing on line repair. These techniques complement periodic function testing which is necessary to detect those failures that can still go undetected until a demand is placed upon the SIF. Together, these three types of maintenance activities increase the likelihood that the SIF will function correctly throughout its entire installed life. Without a sound mechanical integrity program incorporating, periodic inspection, appropriate response to diagnostics and proof testing, one runs the risk of running equipment to dangerous failure. It is essential that equipment be maintained to the original safety requirements specification (SRS) to assure the equipments long term functionality. Inspection and maintenance programs are essential to maintaining the equipments assumed performance criteria in the SIL verification calculations. The lack of a good inspection/maintenance program for the devices, the integrated loop and associated utilities used in a SIS will result in increased spurious and dangerous failure rates for the SIS.

5.2

Failure Causes and Mechanisms

With the types of failure as the initial foundation, it is now essential to consider failure cause and failure mechanism to facilitate a more fundamental and consistent interpretation of how and why something failed. This understanding has great practical application when defining the as found and as left condition in inspection, maintenance, and proof test reports. Failures can be traced to root causes which may be systematic or random. Systematic failures are related in a deterministic way to a root cause, which can only be minimized by effective implementation of a safety management system. Random failures are unpredictable and result from various degradation mechanisms related to the operating environment, which place physical, chemical, or electrical stresses on the equipment. When a failure is determined to be systematic, it should be possible to replicate the failure by producing the same set of conditions. Software errors are said to be systematic and they are identified and corrected through extensive software testing. A more complex systematic failure is human error; it is deterministic only to an extent, as humans by their very nature are not fully predictable. Whereas improved training, practice and procedures can reduce the impact of human error, it is quite nave to believe it can be eliminated in a fully deterministic manner. Random hardware failures occur at unpredictable times and result from one or more of the possible degradation mechanisms in the hardware. Random failure is often the result of a physical failure where a stressor (or combination of stressors) has exceeded the capability of the installed equipment. There are many degradation mechanisms inherent in the equipment occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation, failures of an equipment system comprised of many components occur at predictable rates but at unpredictable (i.e., random) times.

August 2009

ISA-TR84.00.02-2010

21

Draft D

In addition, there may be external influences (i.e. weather, process upsets, etc.) that are random in nature that place stresses on the components causing random failures. A variety of failure mechanisms can lead to failure of equipment. Corrosion, erosion, fatigue due to mechanical or thermal cycling, and other environmental stresses are all examples of failure mechanisms. For instance, freezing weather might cause an impulse line to freeze up resulting in the pressure transmitter not changing output signal as the process pressure changes. A major distinguishing feature between random and systematic failures is that the equipment failure rate arising from random hardware failures can be predicted with reasonable accuracy but many systematic failures, by their very nature, cannot be accurately predicted. That is, equipment failure rates arising from random failures can be quantified with reasonable accuracy but those arising from systematic failures are more difficult to quantify because the events leading to them are often difficult to predict. An SIF uses many different devices to execute the safety functions intended to reduce the risk of identified hazardous events. The SIF performance depends on the SIF device characteristics (e.g., individual failure rates), the properties of the system itself, and the interactions among its components (e.g., voting architecture, common cause failures). An implicit assumption made during SIF design and its performance verification is that the devices are in their useful life period and are replaced as they approach the unacceptable wear-out portion of their life. The SIF probability of failure is a function of the random dangerous failure rate of its devices (e.g., field sensors, final elements, logic solvers, and support systems) and the design basis parameters (e.g., redundancy, architecture, proof test interval, and diagnostic coverage). The SIF may also operate spuriously causing unnecessary process trips. Spurious and dangerous failures are considered critical failures, because these failures result in the loss of the equipments ability to operate as specified. Other systematic, non-random contributions also affect the observed performance. Systematic and random faults can cause the failure of an individual device or the simultaneous failure of multiple devices. Device failures are tracked and analyzed as part of the safety management system. The goal of a failure tracking program is to identify root causes and data trends and to develop means to reduce the potential for failure recurrence. The analysis may identify systematic errors and common cause failures. When a common cause failure occurs, the root causes tend to be installation, software,- and/or applicationspecific. Systematic errors occur because a human error broke through the administrative controls and quality assurance processes that were supposed to detect and correct it. Therefore, systematic errors tend to be perceived as a local personnel issue, even though this can occur at any stage of an instruments life, i.e. manufacturers design/production, users design/installation, operation/maintenance, management of change, etc. Regardless of the nature of the failure, dangerous failure reports and identified trends should be communicated to appropriate personnel, because when personnel better understand how devices can fail, the better prepared they are to prevent failures. Failures are managed within the ISA 84.01 work process using different strategies depending on whether they are random or systematic. Consequently, the following sections briefly discuss random, systematic, and common cause failures, so the strategies for their management can be better understood.

5.2.1

Random Failure

SIF hardware is often manufactured with electrical, electronic, programmable electronic and mechanical components. Each component wears out or breaks down after a different length of time, depending on how well it was originally manufactured, how much it has been used, the variation in the operating conditions, etc. Since these components are lumped together to make a device, the failures of the device appear to be random even though the failure distributions of the individual components may not be random.

August 2009

ISA-TR84.00.02-2010

22

Draft D

If it can be demonstrated that an SIF device (e.g., a block valve) has dominant time-based failure mechanisms (i.e., wear out), the random failure rate model can lead to erroneous conclusions and practices. For example, in calculating test intervals, a random model may lead to testing more frequently than actually required during the early life of the device and testing too infrequently during the later wear out phase. Owner/operators should be aware that reliability models (e.g., Weibull) are available that divide failures into infant mortality, random and wear-out modes. This guideline assumes failures are random. One very effective barrier against random device failures is to implement redundancy. Fault tolerance is provided using multiple devices in voting configurations that are appropriate for the SIL. If one device breaks down, another device is available to provide the safety action. Since failures occur randomly, it is less likely that multiple devices fail at the same time. By observing the operation of a device over time, data can be collected about how often it breaks down. This information can be used to estimate how long a device is likely to last before it stops working properly. However, in the case of Programmable Electronic (PE) devices and logic solvers, the technology is evolving so rapidly that the reliability data collected on any device is often limited unless databases are pooled. Random failures are the result of hardware degradation mechanisms, which can be accelerated by stress factors due to the operating environment. Stress factors are caused by many types of events, such as:

Normal or abnormal process conditions, Presence of an adverse microenvironment (e.g., corrosion produced by chemical impurities or metallurgy), Presence of solids or other materials which randomly deposit in SIF devices, Exposure to electrostatic discharge, Operating for long periods at the extreme of the devices environmental specification, and Excessive device wear and tear.

Low frequency atmospheric events (e.g. snow in Houston, Texas, USA) can also be considered random events. The random failure does not follow any pattern, but instead occurs randomly during the devices life. The user approval process should ensure that a devices random failures and failure modes are well understood prior to approval. Random failures can be identified by internal device diagnostics, external diagnostics, inspection, and proof tests. Redundant, fault tolerant subsystems are often used to reduce the probability that a single failure will cause the SIF to fail to operate correctly. Redundant subsystems also provide the opportunity for external diagnostics, where a diagnostic algorithm is executed at a specified interval to detect device failures. The inspection and proof test interval is generally chosen based on maintenance history, manufacturers recommendations, good engineering practice, insurance requirements, regulatory requirements, and what is necessary to achieve the required performance. As the test interval gets longer, there is an increased probability that multiple devices within a subsystem have malfunctioned prior to fault detection.

5.2.2

Systematic Failures

Due to the nature of these errors, it is impossible to predict how often systematic failure leads to SIF failure. Unlike random, hardware failures, redundancy may not be effective against systematic failures, because the redundant devices are often affected by the same systematic failure. Under the same August 2009

ISA-TR84.00.02-2010

23

Draft D

operating conditions, all redundant devices could fail due to a common systematic failure. A partially effective barrier against systematic failures is to use device diversity, i.e., redundancy is provided using a different device, system, technology, programmer, etc. If one device fails, the other continues to work if the cause of failure does not result in the failure of both components. Use caution to avoid deterioration of SIF performance from the use of diverse devices with poor performance characteristics. The most effective defense against systematic failure is full integration of the ANSI/ISA 84.00.01-2004 lifecycle and functional safety management concepts into the project management process. Systematic failure is related in a deterministic way to a root cause, which can only be minimized or eliminated by changes in the design basis, installation practices, software systems, or operating basis. Systematic failures can be due to a single failure event or to a combination of errors, such as poor design and operation/maintenance practices. Systematic errors which have resulted in process safety incidents are:

Risk assessment errors, Design errors, Specification errors, Unexpected operating environment impact, Installation and commissioning errors, Operator errors, Maintenance errors, and Change management errors.

While random hardware failures are caused mainly by physical degradation mechanisms, systematic failures are the direct consequence of SIF complexity. Every device is subject to failure due to design, specification, operating, maintenance, and installation errors. These mistakes immediately put the devices on the path to failure. The following table presents a summary of major differences between random and systematic failure.

Table 5.1. Summary of the important differences between random and systematic failures (ISA TR84.00.04).
Random Failures Will always occur under the same conditions Effectively prevented by redundancy Effectively prevented by diversity in redundancy No Yes Yes Systematic Failures Yes No Partially

August 2009

ISA-TR84.00.02-2010

24

Draft D

The following are examples where systematic failures can become significant: A SIF that involves unusual or complex design or maintenance features A site with poor operating discipline A significant change in management practices, such as downsizing, impacting operating and maintenance practices As SIF complexity increases, the potential for systematic errors increases due to the combination of failures. Additionally, the probability of detecting these errors decreases. Each device has many known opportunities for systematic error. With any new technology, there is the potential for many unknown (or as yet unidentified) failures. When issues associated with interconnectivity, communication, and support (utility) systems are added to the analysis, there are generally a large number of potential systematic failures. The complex nature of systematic failures often makes them difficult to analyze probabilistically. Collected data typically includes some systematic failures, which contribute to the observed failure rate. Since the intent of the quantitative analysis is to predict SIF performance, the systematic failures should be tracked and their inherent presence considered when estimating the random failure rate. As more information is collected, trends can be identified and used to minimize random and systematic failures in new or modified designs. For example, it may take multiple failure reports before it is recognized that the instrument air quality is causing the equipment failure. Only a limited number of device failures and failure paths can be tested. When the failure patterns are not detected by the limited testing that is practically achievable, failure can happen every time the specific set of conditions occurs (e.g., every time there is lightning in the area). This potential failure becomes an intrinsic part of the SIF. Systematic errors are a major source of common cause failure, having the potential to disable redundant devices. Systematic failures include many types of errors, such as:

Manufacturing defects, e.g., software and hardware errors built into the device by the manufacturer, Specification mistakes, e.g. incorrect design basis and inaccurate software specification, Implementation errors, e.g., improper installation, incorrect programming, interface problems, and not following the safety manual for the SIS devices, and Operation and maintenance, e.g., poor inspection, incomplete testing and improper bypassing.

Systematic errors related to manufacturing defects can be reduced through the use of diverse redundancy. Diversity typically involves the use of different technologies or different manufacturers. While manufacturer errors can be addressed by diversity, this increases the SIF complexity. Incorrect specification, implementation, operation, and maintenance constitute root causes that are not solved by this type of diversity and can actually increase if unnecessary complexity is introduced into the SIS design. The level of and type diversity utilized should be balanced against the level of complexity to achieve optimal error free and operation and maintenance. The perceived improvements gained by diversity are based on the assumption that different devices exhibit different failures and failure modes. In other words, it is less probable for all of them to fail simultaneously, if they are different. However, diversity only reduces the potential for common mode failures. Many common cause failures may not be addressed by the type of diversity selected.

August 2009

ISA-TR84.00.02-2010

25

Draft D

Systematic errors are best addressed through with a safety management system, which emphasizes continuous improvement through performance monitoring. A rigorous system is necessary to decrease systematic errors and enhance safe and reliable operation. Each verification, assessment, audit, and validation is aimed at reducing the probability of systematic error to a sufficiently low probability. As errors or failures are detected, their occurrence should be investigated, so that lessons can be learned and communicated to affected personnel. The following is an actual case description of the impact of systematic failures, i.e., human error, operating discipline, etc., on the operation of an offshore production platform. Offshore platforms have many different safety related systems. Several years after one system was started up, the low pressure flare system required modification. (There were actually two flare systems; low and high pressure.) The modification required removing a piece of pipe to the flare in order to install a heater. The project was reviewed very carefully, considering other incidents the company had gone through. The pipe was removed while the plant was running, so the flare system was now operating at atmospheric pressure. The flare system was purged and deemed safe. Personnel were instructed that if anything went wrong they were to hit a manual call point, which was an input into the fire & gas system. This was considered a Class 1 ESD shutdown and isolate. What they werent told was that the input also started a two hour blowdown timer. Someone heard something hissing in the flare piping. They thought it was a release so they activated the manual call point. As a result, all the alarms went off and everyone stood down. People eventually realized that all that had really happened was that some condensate ice in the flare line had bubbled off and there really was nothing to worry about. This happened just before lunch, so everyone decided now would be a good time for their lunch break. No one knew that the two hour blowdown timer was still running. Once the timer expired the entire plant started blowing down. Now there really was gas flowing down the line to the base of the flare stack! The flare tip was a common ignition source for both the low and high pressure flares. While the high pressure flare was operating successfully, it was drawing oxygen and gas up through the open piping at the base of the low pressure flare. The mixture exploded at the flare tip and flames shot out the pipe opening at the bottom of the stack. This cycle an explosion and flames repeated approximately every 15 seconds. The open blowdown valves were near the base of the flare stack. The blowdown valves had manual resets on them, but personnel couldnt get anywhere near the valves to manually reset (close) them because of the recurring explosion and flames. The only thing left to do was abandon the platform. Approximately 200 people were evacuated using dozens of helicopters in the area. (Launching life boats was not desirable.) A crew of 20 remained on board overnight and the situation worked itself out. The event was covered by the local TV news. Fortunately, very little physical damage was actually done. So how could this have happened? The original design (from the 1980s) called for over two dozen halon zones and a halon based flare snuffing system that would have been able to snuff out the flare and eliminate the source of ignition. However, by the time the systems were installed in the early 1990s there was a move within industry to reduce the amount of halon being used. The flare snuffing system was not deemed appropriate for halon so the bottles were never installed, although the connections, piping and release button were. An operator pressed the button for the halon release during the emergency, but there never was any halon to release. People later reported that the halon system didnt work. It obviously couldnt work. Unfortunately not everyone was aware of the design change, nor was the design change thoroughly documented. The platform was abandoned for 24 hours and down for seven weeks. The OIM (Offshore Installation Manager) and 12 other people lost their jobs that day.
Formatted: Indent: First line: 0"

5.2.3

Common Cause Failures

Common cause failure (CCF) is a term used to describe random and systematic events that cause multiple devices, systems, or layers to fail simultaneously. Another term is common mode failure, which August 2009

ISA-TR84.00.02-2010

26

Draft D

describes the simultaneous failure of two devices in the same mode. Common mode failure is related to the use of identical devices in the redundant subsystem. For example, two redundant differential pressure sensors can be simultaneously disabled due to loss of signal (common mode failure) originated from diaphragm damage (failure cause) caused by water hammer. Common mode failure is a subset of common cause failure. Common cause failures are important considerations in predicting SIF performance, particularly for SIL 2 and above applications. When common cause failures are not evaluated, there is an implicit assumption that good practices for design, installation, operation, maintenance, and management of change are in place. Good practice can result in a low common cause failure rate with little impact of the estimate of the PFDavg. Poor practice can result in a high common cause failure, negatively impacting the achievable PFDavg. All common cause failures have the potential to reduce the SIF performance; however, they are addressed in different ways depending on the nature of the failure (e.g., systematic or random). Throughout the ISA 84.01 lifecycle, it is recommended that devices, systems, or protection layers be assessed for independence and the potential for common cause failure. Independence and common cause are often interrelated. A lack of independence means that there is a potential for a common cause failure. Likewise, an identified common cause indicates a lack of independence and therefore some dependency. Diversity is often suggested as a means to eliminate common cause failure. However, common cause can impact identical and diverse devices. For example, the process application or external environmental condition can affect different technologies simultaneously when the conditions trigger each devices failure mechanisms. These devices may eventually fail due to different reasons, but the abnormal process condition is root cause that started the failure propagation. The use of different technologies (i.e., diversity) does reduce the potential for common mode failure. Diversity reduces the potential for dependent failure by minimizing common mode failure, but does not eliminate the potential for common cause failure. The approach taken to manage CCF is specific to the nature of the failure. Two types of CCF are addressed: 1) single points of failure where one malfunctioning device causes an SIF failure; and 2) single events that lead to multiple failures in a redundant subsystem. Single points of failure can occur due to systematic or random events. Systematic failures occur when human errors result in the violation or invalidation of design and operating basis assumptions (e.g., process assumed to be clean but in reality is not). Random failures can occur throughout the useful life of a device. These failures are managed using redundancy, diagnostics, and proof testing. As with single points of failure, redundant subsystems can fail due to systematic errors in the device manufacture, specification, design, installation, and maintenance. These errors typically happen due to lack of knowledge, information, and training and are generally unknown to personnel. Test procedures may not identify these errors, since they are not expected. Systematic errors are difficult to test even if easily identified. For example, it is possible that the valve actuator is incorrectly specified, but how do you test to determine that the valve actuator will not close under emergency process conditions? Specification errors must be caught during the design and engineering phases using independent verification. Checklists can be used to identify CCF. The list of questions guides the engineer through the design aspects examining opportunities for CCF. Installation, commissioning, and maintenance errors are reduced by independent checks, verifications, and audits. Random failures of redundant subsystems can be caused either by conditions that are inherent to the device or inherent to the system. Random failures inherent to the device are generally manufacturing defects which may include hardware and/or software failures. Common cause failures should be considered in the PFD calculation. These failures are often estimated using the beta factor method. The operating environment, the installation, and the interconnection to other August 2009

ISA-TR84.00.02-2010

27

Draft D

systems affect the device operation. This system-induced random failure can be divided into two categories depending on the availability of failure frequency data. If data is available the failure can be modeled explicitly as an event. For example, if fault tree analysis is the selected analytical methodology, this type of CCF is treated as a basic event with its own failure rate. If data is not available, the CCF can be addressed using the beta factor method. The beta factor accounts for random events that cause a dangerous failure in the operating environment. The value of the beta factor is selected based on engineering judgment. Many owner/operators use a beta factor between 0.1% and 5.0% when good engineering practices are applied in the design, installation, inspection, and maintenance practices. The beta factor can be substantially higher if good engineering practices are not followed.

5.3

Failure Classification

All devices eventually fail. A fault occurs somewhere within the devices components and propagates into a degraded or complete failure, resulting in the devices inability to operate as specified. A device failure is observed by its effects on the devices operation. The devices failure mode is the devices observed loss of function, e.g., the signal does not change with the process variable. The user approval process (see ISA TR84.00.04 Annex L) relies heavily on gaining sufficient understanding of the devices failure modes in the operating environment. Failure mode and effects analysis (FMEA) is a qualitative analysis method used to analyze the effects of identified device failure modes on the device operation, so the design basis can take these modes into account. The effects (OREDA 1992) may include:

Complete failure (e.g., failure-to-operate), Spurious operation (e.g., premature function), Degraded failure (e.g., out-of-tolerance), and Incipient conditions (e.g., damaged electrical insulation).

Degraded and complete failures cause the loss of the devices ability to operate as specified, resulting in either a safe or dangerous failure. These failures can occur suddenly or gradually over time. Pass-fail criteria are established for each device to determine when a failure is critical. If a device fails to operate as specified in the design basis for safety, the failure is considered dangerous. The risk reduction capability of a safety function is related to its dangerous failure rate. If the device has spuriously operated in a manner that does not create a hazard (e.g. continuous SIF) or does not result in the loss of its ability to perform its protective function, the failure is considered safe. Incipient conditions are typically identified during inspection and preventive maintenance activities. An incipient condition does not currently effect the device operation. However, if corrective action is not taken, the failure could propagate into a degraded or complete failure. For example, the screen is missing from the vent port of a solenoid-operated valve. There is no obstruction currently in the vent port, but if the screen were not replaced, debris could accumulate in the port, resulting in a degraded or complete failure.

5.3.1

Safe and Dangerous Failures

Degraded and complete failures can be further classified as safe or dangerous as shown in Figure 4.?. A safe failure results in the device going to the safe state or direction defined in the design basis. A dangerous failure causes the device to fail in a manner where the protection fails to function when according to its safety requirement specification.

August 2009

ISA-TR84.00.02-2010

28

Draft D

Failure Modes

Fail Safe

Fail Dangerous

Safe Detected Fail to start Spurious Trip

Safe Undetected Detected by test Detected by inspection

Dangerous Detected Effects can lead t dangerous condition

Dangerous Undetected Detected by proof test Detected by failure on demand

The probability of false trips is of concern also because shutdowns and startups can create dangerous conditions in addition to causing business interruption. Fault tree analysis can be used to determine how to most effectively reduce false trips.

The probability of failure on demand is a fraction of all system failures and is of primary concern. This is the probability used to define the SIL. The objective of proof testing is to detect a dangerous undetected failure before it is detected by a demand .

Figure 5.1 Illustration of General Failure Classification (ISA TR84.00.04)


There are various models for describing a devices critical failures. Some devices are non-repairable and are replaced when failure is detected. Other devices are repairable and are inspected and maintained in a manner that achieves a constant failure rate throughout their useful life. In some cases, a device may actually be a complex system of repairable /replaceable components. Such complex devices may not be easily described by any one model. Figure 4.? provides a Venn Diagram illustrating failure classification where some failures can be detected using automatic diagnostics while others remain undetected until proof test.

August 2009

ISA-TR84.00.02-2010

29

Draft D

Deleted: <sp>

Safe Undetected

Safe Detected

Dangerous Undetected

Dangerous Detected

Figure 5.2 Failure Classification Considering Diagnostics (ISA TR84.00.04)


The devices failure modes are described by the mean time to failure (MTTF). The MTTF is related to the expected operating life. For a repairable device, the failure rate can be determined based on the average time between failures, or the mean time between failure (MTBF). Since the MTBF is the time between one failure and the next, it includes the MTTF and the mean time to repair (MTTR). MTBF = MTTF + MTTR When the MTTR is small compared to the MTTF, the MTTF represents the average time that the device is in the operational condition. To evaluate an SIS, two major classes of critical failure are examined dangerous and safe. Dangerous equipment failure causes the process to be put in a hazardous state or puts the SIS in a condition where it may fail-to-operate when required. Safe equipment failure causes, or places the equipment in condition where it can potentially cause, the process to achieve or maintain a safe state. A dangerous failure is one where the equipment is no longer capable of responding to a demand. A safe failure is one that results in the equipment taking its specified safe state condition. A safe failure will result in a spurious trip when simplex (1oo1) architectures are used.

5.3.2

Detected and Undetected Failures

Safe and dangerous failures can be further broken down based on the ability to detect the equipment failure. This results in four sub-classifications. The equipment failure must be analyzed to determine whether a particular failure mode is safe or dangerous based on the expected equipment operation. Dangerous undetected (DU) failure -- Occurrence of failure which puts the equipment in a dangerous state and lies undetected until a demand is placed upon the equipment. Synonyms include unrevealed and covert.

August 2009

ISA-TR84.00.02-2010

30

Draft D

Dangerous detectable (DD) failure--Occurrence of failure which puts the equipment in a dangerous state and is detected through automated diagnostics or through the operators normal observation of the process and its equipment. Synonyms include announced, revealed and overt. Safe undetectable (SU) failure--Occurrence of failure which puts the equipment in a safe state and lies undetected until a demand is placed upon the equipment. Safe detectable (SD) failure--Occurrence of failure which puts the equipment in a safe state and is detected through automated diagnostic tests or through the operators normal observation of the process and its equipment. More detailed examination of the failures allows classification of the failures based on its impact to the SIF. Dangerous failure rate, D: D = 1/Mean Time Between Failure Dangerous (MTBF ) Safe failure rate, S: S = 1/Mean Time Between Failure Safe (MTBF ) Critical failure rate, CRIT: CRIT= D + S When diagnostics is provided, the dangerous failure rate can be divided into detected and undetected. Dangerous detected failure rate, DD: DD = DC x D Dangerous undetected failure rate, DU: DU = (1-DC) x D When the detection of a dangerous failure results in the device being taken to its specified safe state, the spurious failure rate can be calculated from the S and DD. Spurious failure rate, SP: SP = S + DD By substitution, CRIT can be defined in terms of the DU and SP : CRIT= DU + SP These relationship are illustrated graphically in Figure 2.1 below.
S D

August 2009

ISA-TR84.00.02-2010

31

Draft D

Figure 5.3--Safe and Dangerous-Detected and Undetected Failure Rate Diagram


The significant difference between detected and undetected failure is the time that the device remains in the failed state before detection. For devices with continuous on-line diagnostics, the failure is detected at the diagnostic interval (DI) and is repaired and returned to fully operational condition mean time to repair (MTTR). The diagnostic interval is also known as the mean time to detect. The DI is generally significantly smaller than the MTTR, so DI is often neglected in the analysis. For a device without on-line diagnostics, the failure is detected by proof test or demand. Assuming that the proof test is sufficient to detect the failure, the time the device stays in the failed state is set by the test interval (TI) plus the MTTR. In contrast to the DI, the TI is generally much larger than the MTTR, so it will dominate the analysis. In terms of risk, the longer the time to detect a failed device, the higher the likelihood of a hazardous event.

5.3.3

Examples

The user needs adequate information to design a safety instrumented system so that its probability of failing to function is less than or equal to some acceptable probability that is a function of varying proof test intervals that account for dangerous undetected failures and required repair times for dangerous detected failures. The user also needs data to determine the frequency of spurious shutdowns due to the equipment failures to predict plant reliability. This requires data for failure modes that go beyond the classifications of dangerous and safe as referenced in the industry standards. Included below are examples to illustrate the important concepts. The examples include: Remote Actuated Valve 3-Way Solenoid Valve Transmitter Electromechanical relay PES

August 2009

ISA-TR84.00.02-2010

32

Draft D

5.3.3.1

Remote Actuated Valve

Table 1 documents the remote actuated valve failure modes as determined by the Center for Chemical Process Safety (CCPS) Process Equipment Reliability Database (PERD) initiative as part of their taxonomy development procedure. The failure modes are applicable to the boundary diagram of a remote actuated valve as depicted in Figure 2.2.

Table 5.2 Example Failure Modes, Causes and Mechanisms for a Remote Actuated Valve Failure Mode Table
Failure Modes Complete Failures Spuriously fail to closed position Spuriously fail to open position Fail to close on demand Fail to open on demand Frozen Position (Modulating Service) Valve Rupture Seal/Packing Blowout Partial Failures Reduced Capacity Seat leakage External Leak External Leak Body/Bonnet External Leak Packing/Seal Fugitive Emission Controlled variable high Controlled variable low Fail to hold position Unstable control (hunting) Responds too Quickly Responds too Slowly Failure Classification Depends on application Depends on application Dangerous Dangerous Dangerous Dangerous Dangerous Depends on application Depends on application Depends on application Depends on application Depends on application Depends on application Depends on application Depends on application Depends on application Depends on application Depends on application

Excessive Noise
(Failure Modes Excerpted from CCPS PERD Taxonomies)

August 2009

ISA-TR84.00.02-2010

33

Draft D

Figure 5.3--Example Boundary Diagram Remote Actuated Valve


Power

Air Supply

Switches & Other Monitoring Devices

Control Signal Output Control Signal Input Instrument Air Air Regulation & Filtration

(Boundary diagram excerpted from CCPS PERD taxonomies) After considering the failure mode examples in Table 1, it should be apparent that the particular application has a significant impact upon whether a particular failure mode will result in a dangerous loss of protection failure or a spurious shutdown. The following examples help to illustrate: Single isolation valve on fuel gas feed to furnace In this case, both Fail to Close and Seat Leakage would be considered dangerous failure modes. Experience tells us that Seat Leakage occurs much more frequently than a complete failure like Fail to Close. That is why double block and bleed isolation valve arrangements are often employed. Double block and bleed isolation valves on fuel gas feed to furnace In this installation, Seat Leakage is still a dangerous failure, but fault tolerance has been used to lessen its likelihood of having a significant negative impact. Lets look at the bleed valve however. If it were to Fail to Open during a shutdown, the significance of primary valve Seat Leakage would increase significantly. As such, the failure mode of Fail to Open for the bleed valve would be considered dangerous, while it would not be for the primary isolation valves. Depending upon the risks being considered, the bleed valve Spuriously Opening during normal operation may or may not be dangerous, depending on the vent system design. Cryogenic liquid isolation valve on feed to vaporizer In the event that the feed flow through the vaporizer exceeded its capacity, there would be the potential for embrittlement and rupture of warm end downstream piping. Therefore, the failure mode Fail to Close, is clearly a dangerous failure in this case. Assuming the valve closes however, Seat Leakage would not be considered a dangerous failure as the vaporizer would perform its function in an inherently safe manner by significantly reducing the flow.

5.3.3.2

3-Way Solenoid Valve

Table 4.2 documents three way solenoid valve (used in SIF service on remote actuated valve pneumatic actuators) failure modes as determined by the Center for Chemical Process Safety (CCPS) Process Equipment Reliability Database (PERD) initiative as part of their taxonomy development procedure. The failure modes are applicable to the boundary diagram of a remote actuated valve as depicted in Figure 4.3 August 2009

ISA-TR84.00.02-2010

34

Draft D

Table 5.3 Example Failure Modes, Causes and Mechanisms for a 3-Way Solenoid Valve
Failure Modes (1)
Complete Failures Fail to vent actuator

Failure Classification
Dangerous

Failure Cause
Plugged port Stuck seat Coil burnout in energize to trip system Coil burnout in deenergize to trip system

Failure Mechanism

Spuriously vent actuator

Safe

Partial Failures Vent response slow

Potentially dangerous Safe or Dangerous

Partially plugged port Oversized vent port for application Vent port leakage

Vent response too quick

Partially vented actuator

Safe

(Failure Modes Excerpted from CCPS PERD Taxonomies)

Figure 5.4 3-Way Solenoid Valve Boundary Diagram

(Boundary diagram excerpted from CCPS PERD taxonomies)

August 2009

ISA-TR84.00.02-2010

35

Draft D

5.3.3.3

Transmitter

Table 3 provides failure causes for an electronic transmitter. Each failure mode results in an erroneous signal, which may be identified as degraded or complete, depending on the device specification and passfail criteria. These failure modes may be further classified based on its effect on the equipment operation. A transmitter failing high would be a safe failure, if the function normally takes the safe state on high process variable. The transmitter failing high would be a dangerous failure in the event of a low process variable measurement.

Table 5.4 Example Failure Modes, Causes and Mechanisms for an Electronic Pressure Transmitter
Failure Modes (1)
Complete Failures Signal Output Saturated High, i.e. > 100 %

Failure Classification
Dependent on application (2) Dangerous

Failure Cause
Electronic failure

Failure Mechanism
- Corrosion - Ageing - Thermal stress Human error - Solids precipitation from process - Liquids frozen due to ambient temperature Human error - Corrosion - Ageing - Thermal stress - Corrosion - Ageing - Thermal stress - Corrosion - Ageing - Thermal stress Human error Water hammer - Mis-installation - Process upset - Corrosion - Ageing - Thermal stress Human error Water hammer - Solids precipitation from process - Liquids partially frozen due to ambient temperature Mechanical damage - Mechanical damage - Material corrosion - Vibration - Corrosion - Mechanical damage

Signal Output Frozen

Isolation Valve Closed Impulse line plugged

Left in the Test Mode Electronic failure

Signal Output Saturated Low, i.e. < 0 %

Dependent on application (3)

Electronic failure

Partial Failures Signal Output High

Dependent on application (2)

Electronic failure

Signal Output Low

Dependent on application (3)

Out of Adjustment Sensor Deformation Build Up of Fluid In Impulse Line Electronic failure

Signal Output Slow to Respond

Dependent on total safety time (4)

Out of Adjustment Sensor Deformation Impulse line partially plugged

Impulse line crimped Loss of Seal Fluid Impulse Line Leak

August 2009

ISA-TR84.00.02-2010 Failure Modes (1)

36 Failure Classification Failure Cause


Electronic failure

Draft D Failure Mechanism


- Corrosion - Ageing - Thermal stress - Corrosion - Ageing - Thermal stress - Corrosion - Ageing - Thermal stress

Signal Output Too Fast Signal Output Erratic

Dependent on application Dangerous

Electronic failure

Electronic failure

Utility Impact Power Supply Output High Power Supply Output Low Power Supply No Output High

??????

Dangerous
- Safe for deenergize to trip - Dangerous for energize to trip

Power Supply Output AC Ripple (Degraded DC source wave)

Unpredictable Dangerous or Safe

- Electrolytic capacitor failure - EMI / RFI

- Capacitor dry out

- Improper installation - Inadequate design

(1) (2) (3) (4) (5)

Failure Modes Excerpted from CCPS PERD Taxonomies Trip on high process variable safe while trip on low process variable dangerous Trip on low process variable safe while trip on high process variable dangerous Any time it exceeds its contribution to the total safety time it is dangerous Failure causes and mechanisms are intended as examples and are not complete

Figure 5.5 Transmitter Boundary Diagram

(Boundary diagram excerpted from CCPS PERD taxonomies)

August 2009

ISA-TR84.00.02-2010

37

Draft D

5.3.3.4

Electromechanical Relay

Table 5.5 Example Failure Modes, Causes and Mechanisms for an Electromechanical Relay
Failure Modes (1)
Complete Failures Contact fails to open

Failure Classification
Depends on application

Failure Cause
Contacts fouled Contacts welded Coil burned out Contacts corroded Contacts stuck Coil burned out Contacts corroded Coil burned out

Failure Mechanism

Contact fails to close

Depends on application

Contacts spuriously open

Contacts spuriously close

Coil burned out Vibration

Contacts chatter Partial Failures Contacts open late

Vibration Contacts fouled

Mis-installation

Contacts close late

Contacts fouled

(Failure Modes Excerpted from CCPS PERD Taxonomies)

Figure 5.6 Electromechanical Relay Boundary Diagram

(Boundary diagram excerpted from CCPS PERD taxonomies)

August 2009

ISA-TR84.00.02-2010

38

Draft D

5.3.3.5

PES in Protection Service

Table 5.6 Example Failure Modes, Causes and Mechanisms for a Programmable Electronic System in Protection Service With Discrete Outputs
Failure Modes (1)
Complete Failures Discrete output channel fails to open the external circuit

Failure Classification
Depends on application

Failure Causes Analog input card


component failure Communication error Logic Solver Error Component failure on discrete output card

Failure Mechanisms

Discrete output channel fails to close the external circuit

Depends on application

Analog input card

component failure

Communication error Logic Solver Error Component failure on


discrete output card

Group of n discrete output channels on a card fail to open their external circuits

Depends on application

Analog input card CMF


failure

Communication error Logic Solver Error Common mode


component failure on discrete output card

Group of n discrete output channels on a card fail to close their external circuits

Depends on application

Analog input card CMF


failure

Communication error Logic Solver Error Common mode


component failure on discrete output card

All discrete output channels on a card fail to open their external circuits

Depends on application

Analog input card CMF


failure

Communication error Logic Solver Error Common mode

component failure on discrete output card failure

All discrete output channels on a card fail to close their external circuits

Depends on application

Analog input card CMF Communication error Logic Solver Error Common mode
component failure on discrete output card

Analog input channel signal saturates high Analog input channel excessive drift high Analog input channel signal freezes Analog input channel excessive drift high Analog input channel signal saturates low Configuration error Programming error Analog input channel signal saturates high Analog input channel excessive drift high Analog input channel signal freezes Analog input channel excessive drift high Analog input channel signal saturates low Configuration error Programming error Analog input channel signal saturates high Analog input channel excessive drift high Analog input channel signal freezes Analog input channel excessive drift high Analog input channel signal saturates low Configuration error Programming error Analog input channel signal saturates high Analog input channel excessive drift high Analog input channel signal freezes Analog input channel excessive drift high Analog input channel signal saturates low Configuration error Programming error Analog input channel signal saturates high Analog input channel excessive drift high Analog input channel signal freezes Analog input channel excessive drift high Analog input channel signal saturates low Configuration error Programming error Analog input channel signal saturates high Analog input channel excessive drift high Analog input channel signal freezes Analog input channel excessive drift high Analog input channel signal saturates low Configuration error Programming error

August 2009

ISA-TR84.00.02-2010 Failure Modes (1)


All discrete output cards fail to open their external circuits

39 Failure Classification
Depends on application

Draft D Failure Mechanisms


Analog input channel signal saturates high Analog input channel excessive drift high Analog input channel signal freezes Analog input channel excessive drift high Analog input channel signal saturates low Configuration error Programming error Analog input channel signal saturates high Analog input channel excessive drift high Analog input channel signal freezes Analog input channel excessive drift high Analog input channel signal saturates low Configuration error Programming error

Failure Causes Communication error Logic Solver Error

All discrete output cards fail to close their external circuits

Depends on application

Communication error Logic Solver Error

Partial Failures Diagnostic fails to detect failure

Diagnostic spuriously initiates action

Corrosion Mechanical shock Manufacturing defect Human error Component failure Corrosion Configuration error Mechanical shock Software error Manufacturing defect Human error (Failure Modes Excerpted from CCPS PERD Taxonomies)

Component failure Configuration error Software error

Figure 5.7 Programmable Electronic System Boundary Diagram

(Boundary diagram excerpted from CCPS PERD taxonomies)

August 2009

ISA-TR84.00.02-2010

40

Draft D

5.4

Failure Rates

The failure behavior of a population of hypothetical devices over their lifecycle (i.e., from production to disposal) is commonly represented by the bathtub curve. The bathtub curve for an individual device is developed by counting the failures that occur in a population of identical (or sufficiently similar) devices over a certain period of time. When the devices are in their useful life period, the collected data set is used to calculate the failure rate using statistical techniques. The plot shows how the overall failure rate changes with time. As indicated by Smith (1997), the bathtub curve is in reality composed by three overlapping curves, one for each of three regions. The initial failure rate of the hypothetical device is driven by its burn-in, infant mortality, or early failure rate, which declines rapidly. The middle flat section of the curve represents the useful life of the device and is typically characterized by a constant failure rate. The SIF performance calculation is based on this middle section. The last part represents the wear out or end-of-life failure rate and is characterized by an increasing failure frequency. Early failures occur during a components initial life and are caused by manufacturing, assembly, test, installation, and commissioning errors. Manufacturing flaws in the components that comprise the device may cause device failures during the burn-in period. Manufacturers affect the shape of this curve when they perform burn-in and function testing prior to release from production. Early failures also occur as a result of device handling and installation. Devices can be damaged during shipment, unpacking, storage, transport to work site, and installation. Many early failures are caused by rough device handling, improper pre-installation storage, poor installation practices, and sloppy construction practices. Materials used for installation activities, such as paint, pipe dope, insulation, and small pieces of a wielding rod, have been shown to cause devices to fail by getting into places where they are not supposed to be. Water and moisture intrusion during installation and commissioning can seriously damage SIS equipment. Rigorous inspection, commissioning and validation activities are necessary to identify and correct these failures. However, not all flawed components fail immediately, and some cases of early failure occur during the useful life of the device. This is indicated by the extension of the burn-in curve to the useful life and wear out region. The transition between the burn-in and useful life occurs when the failure rate becomes constant. Failures during the useful life are caused mainly by random events that increase the stress on the device beyond its physical limits. When the wear-out failures become dominant, the overall failure rate increases. This indicates the end of the devices useful life. The wear out period is characterized by an increased slope related to the device technology. Programmable devices tend to have a very sharp increase in the failure rate due to the large number of aging components. Electromechanical devices tend to have a more gradual increase in failure rate as they age, and the slope of their failure rate curve is heavily affected by how well their mechanical components are inspected and maintained. Preventive maintenance extends the useful life of electromechanical devices. A lack of inspection and preventive maintenance has been cited as a primary cause of early failure. As with infant mortality, not all wear-out failures occur after the end of the useful life; these failures can happen during any stage of the device lifecycle and as time passes its rate of occurrence increases. The device should also be replaced when maintenance records demonstrate that it has reached its wear-out phase. When infant mortality and end-of-life issues are addressed within the device installation, commissioning and maintenance plans, it is assumed that the device failure rate is constant. The useful

August 2009

ISA-TR84.00.02-2010

41

Draft D

life section of figure 4.7 illustrates the predictable failure rate that is the goal of any effective inspection and preventive maintenance program.

Figure 4.7

For SIF analysis, it is very common to assume a constant failure rate; however, it has been demonstrated that this region does not exist for many devices. Some devices, such as block valves, are characterized by dominant failure mechanisms which are a function of time (i.e., lower failure rate at early stages of useful life, high failure rate at wear-out stages). Other statistical distributions should be considered, such as Weibull or log normal when the failure rate varies significantly with time. The random failure rate is characteristic to the device when it is operated in accordance with its specification. The failure rate is estimated based on specific operational and maintenance conditions implicit in collected data. Reducing inspection and maintenance rigor will affect a devices failure rate. For example, mechanical equipment has a higher likelihood of failure when left in one position for extended periods of time. Periodic movement of mechanical equipment reduces the likelihood of failure for some failure mechanisms. Operating the device outside of the manufacturers specification for the device may damage or stress it, thereby causing early failure. Unexpected process application impact, such as chemical attack, corrosion, and deposition, or external environmental impact, such as electromagnetic interference, vibration, and heat, can also shorten its useful life. Reliability practitioners often work with failure rates expressed as failures per million hours. The critical failure rate CRIT\ can be determined by examining a pool of equipment N for its failures over time Nf. CRIT = Nf/Nt Failure rates may be inverted and expressed as mean time to failure (MTTF) and may be expressed in years. Note that MTTF and component life are not the same.
Comment: Hal: Elaborate

5.4.1

PES failure rates

This table represents data for logic solver components submitted by 7 logic solver suppliers that have a global presence in the process sector. The data provided is an average of the values submitted. It is

August 2009

ISA-TR84.00.02-2010

42

Draft D

submitted to serve users so they may benchmark values they are provided from other sources. The values used in user calculations should originate from the logic solver supplier.

Table 4.6--Hardware failure rates


Item Failure Rate failures/million hours Low Main Processor Board (memory, bus logic, communication) Backup Control Unit I/O Processor/Common logic I/O module Single Digital Input Circuit Single Digital Output Circuit Single Analog Input Circuit Single Analog Output Circuit Relay (industrial type) Electromechanical Timer Solid state: Input circuit Solid state: Output circuit Solid state: Logic gate Solid state: Timer Inherently fail-safe solid state: Input circuit Inherently fail-safe solid state: Output circuit Inherently fail-safe solid state: Logic gate Inherently fail-safe solid state: Off delay timer Analog Trip Amplifier Power supply 2.50 2.50 0.10 0.10 0.05 0.25 0.20 1.50 0.10 0.10 0.01 0.10 0.05 0.10 0.001 0.05 0.20 2.50 Common Cause Failures Common Cause Factor -- factor 0.005 5.00 5.00 0.20 0.20 0.10 0.50 0.50 2.50 0.20 0.20 0.10 1.00 0.10 0.20 0.01 0.50 0.40 5.00 Fractions 0.01 0.05 10.00 10.00 0.40 0.40 0.20 1.00 2.00 5.00 0.40 0.40 0.20 2.00 0.20 0.40 0.10 1.00 0.80 10.00 12.00 Typical 25.00 High 50.00

5.4.2

Failure rate data for commonly used field instrumentation

In order to predict the MTTFspurious and PFDavg of a SIS, one must have failure rate data of the different components, such as the sensor, logic solver, and final elements. Failure rate data may come from a variety of different sources, such as public databases, user compiled maintenance records, vendor (13-18) compiled field returns, reliability calculations or operating experience. THE NUMBERS IN TABLE 5.1 WERE COMPILED FROM USER DATA AND ARE IN NO WAY ENDORSED BY ISA. THEY DO NOT REFLECT THE VARIABILITY OF FAILURE RATES DUE TO THE SEVERITY OF DIFFERENT PROCESSES, THE DIFFERENT FAILURE MODES OF THE ACTUAL DEVICES, NOR HOW THE DATA WAS ACTUALLY COLLECTED BY EACH COMPANY. THE USER IS CAUTIONED IN THEIR USE AND SHOULD TRY TO COMPILE THEIR OWN DATA.

August 2009

ISA-TR84.00.02-2010

43

Draft D

Table 4.7--Example MTTFD and MTTFSP in years for common field instrumentation
Company A MTTF Sensors Flow Switch Pressure Switch Level Switch Temp. Switch Pressure Transmitter (service < 1500 psig) Pressure Transmitter (service 1500 psig) Level Transmitter Flow Transmitter Orifice Meter Mag Meter Coriolis Meter Vortex Shedding Temp. Transmitter Flame Detector Thermo couple RTD (Resistance Temp. Detect.) Vibration Proxmitor Combustible Gas Detector Final Elements (See Next Page) 20 20 60-80 30-40 30 15 75 10,000 75 1 75 75 30-50 40-50 40-60 40-60 40-60 15-30 60-80 15-25 20-25 20-30 20-30 20-30 5-15 30-40 40 20 20 10 100 20 40 100 40 50 40 20 25 15 10 20 160 100 65 50 20 150 76.1 2 150 35 15 100 100 20-30 20-30 20-30 20-30 40-60 10-15 10-15 10-15 10-15 20-30 10 35 25 15 50 5 15 5-10 5 25 7 16 80 10 60 8 20 60 20 60 25 35 30 10 55 20 60 12 55
D

Company B
spur

Company C
spur

Company D
spur

Company E
spur

MTTF

MTTF

MTTF

MTTF

MTTF

MTTF

MTTF

MTTF

MTTF

spur

100

100

40-60

20-30

60

60

55

55

50

50

40-60

20-30

30

15

25

25

35

15

40-60

20-30

10

5 2.8

August 2009

ISA-TR84.00.02-2010

44

Draft D

Table ?- Example MTTFD and MTTFSP for common field instrumentation


Company A MTTF Final Elements Air Operate Gate Valve Air Operate Globe Valve Air Operate Ball Valve Solenoid (DTT) Solenoid (ETT) Motor Starter Hydraulic Operated Valve 50 50 50 100 30 50 50 50 10 100 10001500 25 Ball 15 135 30 80 30-50 40-60 40-60 25-35 15-25 20-30 20-30 12-15 50 60 50 50 25 25 25 25 40 40 40 100 40 40 120 15 40 40 40 125 40 40 40
D

Company B
spur

Company C
spur

Company D
spur

Company E
spur

MTTF

MTTF

MTTF

MTTF

MTTF

MTTF

MTTF

MTTF

MTTF

spur

Motor Operated Ball Valve

ElectroMechanical Relay Annunciator Current Switch Sensors

15002500

70 4

40 10

25-35 (See Previous Page)

August 2009

ISA-TR84.00.02-2010

45

Draft D

Probability of Failure

The probability of failure is the metric used to describe the safety integrity for the SIF. It is the probability that the safety instrumented function will fail in a manner which will render it incapable of performing its intended safety function. As such, the SIF will be unable to respond to a demand and no safety action (e.g. shutdown) will be initiated. PFD is usually expressed as PFDavg, which is the average value over the proof test interval and is used to define four SILs, as given in ANSI/ISA-84.00.01-2004. To satisfy the requirements of a given SIL, the PFDavg should be less than the upper limit as provided in ANSI/ISA-84.00.01-2004. Given the above, we can state that in general: Probability of Failure = f (failure rate, repair rate, test interval, common cause, etc.) The specific functionality (f) is an attribute of the system architecture selected for the SIS (E/E/PES and field devices). The proof test interval is that period of time over which the SIF must operate within the PFD limits of the specified SIL. It could be equal to a mission time and should be as long as possible to reduce the possibility of human error associated with frequent proof testing of the SIS. The calculations assume that the device faults are repaired after detection and the device is returned to service in the good as new condition. The dangerous failure rate can be determined from the mean time D to failure dangerous (MTTF ):

D =

1 MTTF D

The dangerous undetected failure rate is strongly impacted by the diagnostic coverage factor of the system (for random hardware failures). Comprehensive internal diagnostics dramatically improves the diagnostic coverage factor; thereby, reducing the possibility of a dangerous undetected failure, and consequently minimizing the PFD. It is important to recognize that the SIF configuration, including the number and quality of field devices, DU has a major impact on the PFD for the SIF. If one assumes a single input with a MTTF of 30 years, a DU DU single output with a MTTF of 50 years, and a logic solver with a MTTF of 100 years, about 85% of the SIF PFD will fall in the field devices and about 15% in the logic solver. On the other hand, using redundant inputs and outputs with similar MTTFDU, configured for high safety availability and with the same logic solver, the PFD for the field devices may account for only 30% and the logic solver may account for 70% of the SIF PFD. The distribution of SIF PFD will change as the number and type of system elements and the configuration vary. Therefore, enhancing the PFD of the logic solver (E/E/PES) of the SIF without corresponding enhancement of the field devices may not be effective in improving the SIF SIL. Making a change to improve one portion of the system (e.g., adding a redundant PES) may not have significant impact on the overall SIF. Thus, the overall SIF must be evaluated. Equation D.19 represents the PFDAVG for a single device assuming a constant failure rate. Since all things degrade with time, a constant failure rate can only be achieved when the device is subjected to a minimum level of inspection, preventive maintenance, and proof testing. The amount of inspection and preventive maintenance is affected by many factors, such as the device manufacture, the operating environment, and support system quality. When N devices are installed, voting subsystems can be created, where M represents the number of devices that must agree out of the total N. MooN (M out of N) voting subsystems can be modeled using August 2009

ISA-TR84.00.02-2010

46

Draft D

Markov modeling or fault tree analysis. Markov modeling and fault tree analysis yield different numerical results for architectures where MN. Both approaches are considered acceptable for the PFDAVG calculation. The error associated with the data assumptions is generally significantly larger than the error introduced by the equation. The derivations shown in this section assume an exponential failure distribution, which may not be true for all devices. Other failure distributions may apply such as Weibull, log normal, Poisson, or binomial. Given the number of assumptions required for the analysis, the calculations should not be the only form of evidence used to justify deviation from what is considered good engineering practice within a particular market sector.

6.1

Instantaneous Probability of Failure

The equations for instantaneous PFD can be derived by examining the transition of the component from the working state to the failed state. For standby equipment, there are only two states as shown in Figure I.3. State 1 represents the state where the component is available to perform its function. State 2 represents the state where the component is not available to perform its function. The transition between State 1 and State 2 is the product of the failure rate of the component and the time t.

Figure 6.1--Representation of the States of a Device


The probability of the component being in State 1 can be derived as follows:

P1 (t + t ) = P1 (t ) P1 (t )t
Rearranging,

P (t + t ) P (t ) 1 1 = P (t ) 1 t
Taking the limit as t 0

P ( t + t ) P ( t ) 1 1 = P ( t ) 1 t t 0
Lim

dP1 (t ) = P1 (t ) dt

August 2009

ISA-TR84.00.02-2010

47

Draft D

Using the Laplace transform, the equation for dP1(t)/dt can be restated as:

dP (t ) 1 = sP (s ) P(0) 1 dt
sP (s ) P (0 ) = P (s ) 1 1 1
At the initial condition, t = 0, P1(0) = 1. Therefore,

sP (s ) 1 = P (s ) 1 1
Rearranging and solving for P ( s) 1

P ( s) = 1

1 s+

To convert from Laplace domain to time domain, the following functions are used:

f (s ) =

1 sa

f (t ) = e at
Therefore, in the time domain, the probability of the component being in State 1 at any time t can be shown as

P (t ) = e t 1
For the evaluation of a SIF, the SIL is related to the probability of the component being in State 2, the unavailable state, where P2(t) = 1 P1(t).

P2 (t ) = PFD(t ) = 1 e t
Deleted: <sp> NOTE 2 Equation I.28 is the instantaneous PFD as a function of any selected time.

Formatted: Bullets and Numbering

PFD(t ) = 1 e t
When t , P1 ( t ) = 1 e t leads to P1 ( ) = 1 This means that a system or an element of system without repair inevitably reaches the failed state. Figure ?.? illustrates the probability of failure of an element as a function of time assuming = 0.0001 hr1 and a time interval of 10 years.

August 2009

ISA-TR84.00.02-2010

48

Draft D

1 0,9 0,8 0,7 P 0(t) 0,6 0,5 0,4 0,3 0,2 0,1 0 0 1 2 3 4 5 t, 6 7 8 9 10 P 1(t)

Figure 6.2 Probability as a Function of Time


Sometimes, the instantaneous probability equation is shown in its rare event form, which is applicable when t < 0.1. To determine the rare event form of the equation, the exponential series expansion is used for the exponential term,

e t = 1 t +

2t 2
2

3t 3
6

4t 4
24

...

2 t 2 3t 3 +... PFD (t ) = 1 1 t + 2 6

PFD(t ) = t

2t 2
2

3t 3
6

...

When the rare event assumption is valid, the second order and higher terms become very small and can be neglected. In practice, the rare event approximation provides good results for most SIFs when t < 0.1 . The instantaneous PFD can then be calculated as

PFD(t ) = t
Deleted: <sp> NOTE 4 time. Equation B.22 is the rare event approximation of the instantaneous PFD as a function of any selected

Formatted: Bullets and Numbering

PFD(t ) = t

The instantaneous PFD(t) is a snapshot of the failure probability of the device taken at time t. The initial state is at time = 0, where the probability of failure is zero. At time = , the probability of failure is 1. As shown in Figure ? PFD(t) increases exponentially with time. When the device is proof tested at the test interval, TI, the PFD(t) is reduced to its initial value. This involves two implicit assumptions: 1) all device failures are detected by inspection and proof test and 2) the device is repaired and returned to service as August 2009

ISA-TR84.00.02-2010

49

Draft D

good as new condition. The effect of the proof test is illustrated by the saw tooth shape shown in Figure 6.3.

Figure 6.3--Typical Saw Tooth Shape for the PFD(t).

6.2

PFDavg

To calculate the average PFD, the instantaneous PFD must be averaged over a defined time interval. For safety instrumented system evaluations, this time interval is the proof testing interval. The equation for PFDavg is derived by integrating the PFD(t) from time 0 to the testing interval, TI, assuming TI>>MTTR, and dividing by the test interval.

PFDavg =

1 TI

TI

P (t )dt 1

PFDavg =
Integrating the terms,

1 TI 1 e t dt TI 0

PFDavg =

1 e t t TI

TI 0

Substituting the bounds of the integration,

PFDavg =

e TI e (0 ) 1 (TI 0 ) + TI

August 2009

ISA-TR84.00.02-2010 Rearranging,

50

Draft D

PFDavg =

1 TI

e TI 1 TI +

This results in one of the most common forms of the PFDavg equation, describing standby components, such as those used in safety instrumented functions.

PFDavg = 1 +

e TI 1 TI
Deleted: <sp>

NOTE 6 Equation I.37 is equation for the Average Probability to Fail on Demand for a Basic Event at the defined Testing Interval (TI).

Formatted: Bullets and Numbering

PFDavg = 1 +

e TI 1 TI

Sometimes, the rare event equation is used for the PFDavg. As shown previously, the exponential series expansion is used for the exponential term:

e t = 1 t +

2t 2
2

3t 3
6

4t 4
24

...

2t 2 3t 3 4t 4 + ... 1 e t = 1 1 t + 2 6 24 PFDavg = 1 TI

TI

2t 2 3t 3 4t 4 + +... dt t 2 6 24
TI

PFDavg =

1 t 2 2t 3 3t 4 + ... TI 2 6 24

Substituting the bounds of the integration,

PFDavg =

1 TI 2 02 2 TI 3 03 3 TI 4 04 + ... TI 2 6 24

August 2009

ISA-TR84.00.02-2010

51

Draft D

For TI < 0.1 , the third order and higher terms may be neglected. The rare event equation can be shown as:

PFDavg =

1 TI 2 TI 2

PFDavg =

TI
2
Deleted: <sp>

This is the basic simplified equation


NOTE 8 Equation I.44 is the rare event approximation for the Average Probability to Fail on Demand for a Basic Event at the defined Testing Interval (TI).

Formatted: Bullets and Numbering

PFDavg =

TI
2

6.3

Effect of MTTR

The mean time to repair (MTTR) is the average time to repair equipment from the time of failure and to its return to normal operation. The mean time to repair is sometimes referred to as mean time to restoration. This time includes the time required to identify the failure, as well as the physical repair time necessary to return the equipment to service in the as good as new condition. The achievable MTTR is affected by maintenance personnel availability, travel time, equipment location, safe access, spares, service contracts, environmental constraints, and permit requirements. These factors determine how quickly maintenance can safely plan, execute and validate work activities. The equipment is considered non-repairable, when the SIS equipment cannot be repaired on-line. For the process, the MTTR represents the minimum process downtime. Depending on the time required to achieve the shutdown condition and the time required to re-start the process, the total process downtime can be substantially longer. The equipment is considered repairable, when the SIS equipment is repairable on-line. In this case, the MTTR typically represents a time interval where the process is being operated with a degraded SIF. The out of service period should be treated as a temporary operating mode with procedures identifying the compensating measure that off-sets the loss of protection, i.e., provides equivalent protection to the failed equipment for the repair period. Compensating measures are often implemented by the operator upon notification of the detected fault to safeguard the process during the outage period.

August 2009

ISA-TR84.00.02-2010

52

Draft D

Dangerous Undetected Failure Exposure Time = MTDF + MTTR ~ MTDF

Safe Dangerous Undetected Failures Are Revealed Only By A Proof Test Or By A Demand, W hichever Comes First, And Then They Are Repaired

Safety Instrumented Function Status Fail

Proof Test Interval

Dangerous Failure Exposure Time to Failure on Demand Equals Mean Time To Detect Failure Plus Mean Time To Repair MTDF + MTTR = Exposure Time to Failure on Demand

Dangerous Detected Failure Exposure Time = MTDF + MTTR ~ MTTR

Safe Dangerous Detected Failures Are Immediately Detected And Repaired

Safety Instrumented Function Status Fail

Proof Test Interval

Figure 6.4 Relationship between MTTR, Proof Test, and Exposure Time
Compensating measures may restrict production rates, limit operating modes, and/or require additional operator staffing. Some compensating measures may be difficult to maintain for long time periods. Limitations on operational activities should be considered during the process requirements specification and operating procedure development. Compensating measures should be capable of acting within onehalf the process safety time. Equation ? for the PFDAVG assumes that dangerous failure and repair occur once in the MTTFD. Many calculations assume that the contribution of the MTTR is negligible, since the MTTR is much smaller than the TI. This assumption was made in the derivation of Equation ?. If safe failures are detected and repaired on-line, they also contribute to a loss of SIF availability. However, most calculations only account for the safe failures as part of the spurious evaluation. While it is true that the MTTR is typically much smaller than the TI, the MTTR does represent a period of unavailability. For example, assume the equipment is out of service for repair once per year (8760 hrs) D rather than once in the MTTF and the repair takes 72 hours. The fractional dead time is 72 hrs/8760 hrs = 0.008. The equipment is available only 99.2% due to repair and test. If the repair time was reduced to 8 hrs, the fractional dead time is reduced to 0.0009. The equipment availability is now 99.91%. Multiple online repairs can significantly decrease the equipment availability, because each repair is another increment of fractional dead time. The significance of MTTR to safe and reliable operation is not adequately addressed by conventional calculations, which focus on the random hardware failures. Frequent or extended repair is considered a systematic problem that should be addressed through management system work processes. User approval should give preference to reliable equipment which experiences few failures. Priority should be placed on SIF equipment maintenance, repair, and testing to minimize the out of service periods.

August 2009

ISA-TR84.00.02-2010

53

Draft D

The allowable repair time should balance reasonable maintenance expectations for MTTR with operations expectations for the compensating measures. The equipment safety manual may specify a maximum MTTR, which should be met, unless deviation is justified by the design basis. When repair cannot be completed within the allowable repair time, additional review should be conducted and higher management approval should be obtained. This review and approval is typically a management of change activity. When failure is detected, the device is either repaired or replaced. This activity results in a time period where the device is not available to perform its activity. This time period is the MTTR, which is the average repair time given site maintenance practice. The derivation of D.19 assumed that the MTTR is significantly less than the TI, which should be true for any SIS. Equation ? adds the contribution of MTTR to the PFDAVG:

PFD AVG =

D TI
2

+ D MTTR

The D MTTR represents the period when the device is unavailable because it is under repair or replacement. As shown in equation D.20, this period is assumed to occur only once in the MTTFD of the device. However, repair periods also occur in response to safe and detected faults. Devices should be removed from the approved equipment list when they are not performing as required. Otherwise, the design and installation practices should be modified to reduce the likelihood of device failure. Consequently, the probability of finding a repairable system in either the working or failed state is modeled by means of ordinary differential equations:

dP0 ( t ) = P0 ( t ) + P1 ( t ) dt
dP1 ( t ) = P0 ( t ) P1 ( t ) dt

Where
Pi ( t )

- Probability of a finding of an element in a condition i, - Failure rate of an element, - Repair rate.

Integration of this system with the initial conditions P0 ( 0 ) = 1 and P1 ( 0 ) = 0 leads to well-known expression for availability and unavailability of a repairable element:
P0 ( t ) = 1

( e ( + ) t ) = 1 + e ( + )t + + +

P1 ( t ) =

( e ( + ) t ) 1 +

t leads to a steady-state values of probabilities:

August 2009

ISA-TR84.00.02-2010

54

Draft D

P0 ( ) =

1 / MTTR MTTF MTTF = = = =A + 1 / MTTF + 1 / MTTR MTTF + MTTR MTBF MTTR = =U + MTBF

P1 ( ) = 1 A =

Where A U - Availability, and - Unavailability. Effect of Bypassing

6.4

{Introduce fractional deadtime}

6.5

Effect of Proof Testing and Diagnostics

Discuss importance of proof testing to demonstrate that devices are maintained in the as good as new condition. Discuss assumption related to perfect testing Discuss impact of partial testing

A proof test can be divided into a series of tests conducted at different test intervals. All tests should be executed during validation and the scheduled, off-line proof test. The partial test reduces the theoretical PFD(t) by the degree of test coverage. The effect of partial testing is illustrated in Figure ?, which provides the PFD(t) on a logarithmic scale as a function of time.

Figure 6.5--Effect of Partial Testing on PFD(t).

Each test often detects only specific failure modes. The ratio of the dangerous detected failures to the total dangerous failures is the diagnostic coverage (DC). The PFD(t) is determined for each partial test using an appropriate diagnostic coverage and the partial test interval. Care should be taken to ensure that the total diagnostic coverage provided by the tests does not exceed a value of 1. For example, if the partial test (PT) detects 70% of the failures, the full test (FT) detects the 70% and the remaining 30%. The equation is as follows: PFD(t) = (1-0.7) DTIFT+ (0.7) DTIPT

August 2009

ISA-TR84.00.02-2010

55

Draft D

6.6

Effect of Voting

The process requirements provide the general functional description that is turned into an SIFF architecture consisting of input subsystems (e.g., a single transmitter or voting switches), logic (e.g., discrete, calculation), and output subsystems (e.g., a pump motor control circuit or dual block valves). Redundant subsystems are often used to achieve the target performance at the desired test interval. The most common subsystem architectures are simplex (1oo1), dual (1oo2, 2oo2), and triplicated (1oo3, 2oo3). Table ? provides example voting architectures, a simple logic diagram for each architecture, and the voting subsystem fault tolerance against dangerous and spurious failure.
Voting Logic 1oo1 Logic Diagram Dangerous Fault Tolerance 0 Spurious Fault Tolerance 0

1oo2

2oo2

2oo3

2oo4

Table 6.1 Voting Considerations.

Markov modeling was developed by a Russian mathematician, Andrei Markov, around 1900. Markov models are based on average after logic and assume that initially all SIF subsystems are working; therefore the PFD at time 0 is zero. With time, each device can fail bringing the SIF to degraded intermediate states; when enough devices fail, the SIF fails dangerously. The logical relationship between the devices is developed as an equation for PFD(t), which is integrated to yield the PFDAVG.

August 2009

ISA-TR84.00.02-2010

56

Draft D

Assuming the device failures are independent, i.e., no common cause failure, the PFDAVG for an MooN subsystem, where R=N-M+1:
PFD =

AVG

(N

N! R )! R !

( D )R (1 DC )TI + DC (DI ) [((1 DC )TI + DC (DI )) + MTTR R +1

] R +1 (MTTR )R +1

Fault tree analysis was developed at Bell Laboratory in the early 1960s for missile launch control reliability for the Polaris project. Fault tree analysis is generally based on the average before logic approach, which simplifies the mathematics considerably. In this method, the PFDAVG is determined for each device and the resulting PFDAVG is combined using Boolean algebra. Assuming the device failures are independent, i.e., no common cause failure, the PFDAVG for an MooN subsystem, where R=N-M+1:

PFD AVG =

N! R! ( M 1)!
R

(1 DC ) D TI DC D DI + + D MTTR 2 2
6.7
Effect of Common Cause

An understanding of operating, maintenance, testing, and diagnostic information is key to identifying which common cause failures and systematic failures should be included in the calculation. A common cause failure is the occurrence of an event that results in the failure of a subsystem or the entire system. These are often referred to as dependent failures, because multiple devices are affected, thus there is dependency between the devices. Common cause failures can occur due to many types of events, such as manufacturing defects in redundant devices, aging components, operating environment stresses, common process connections, and common support systems. The failure rates for any potential events can be estimated using plant data for frequency of common cause failures and systematic failures or with data from published sources. Human factor data is (9) available in published literature. Guidelines for Preventing Human Error in Process Safety provides data for the chemical industry and also describes the techniques utilized in evaluating and modeling human reliability. An Engineers View of Human Error(10) provides a discussion on how human factors can affect the safe operation of process units. There are two ways to account for common cause failures: 1. Explicit model: Common cause failure can be modeled explicitly as an event, especially if specific failure causes can be identified and estimated. For example, instrument air failure that disables the primary transmitter can be the same instrument air failure that disables the redundant transmitter; in this case, both instances of instrument air should be modeled as the same basic event.

August 2009

ISA-TR84.00.02-2010 2. Approximation technique:

57

Draft D

Common cause failure can be modeled using the beta factor method. The probability of common cause failure (PFD) is estimated as a percentage of the highest failure rate among the devices in the subsystem, x D. For a repairable device, the PFD is calculated as:

PFD =

(1 DC ) D TI
2

DC D DI
2

+ D MTTR

The value of the beta factor is selected based on prior-use data in the operating environment. The prior use data can be for a specific device technology or for a sufficiently similar device. Many owner/operators use a beta factor between 0.1 and 5.0% when the devices are user approved for the application and good engineering practices are applied in the design and installation to minimize CCF. The beta factor can be substantially higher, if good engineering practices are not followed.

August 2009

ISA-TR84.00.02-2010

58

Draft D

Spurious Trip Rate

The following equations cover the typical SIS configurations. The MTTFspurious for the individual SIS elements is converted to failure rate by, (Eq. I.60) 1oo1 (Eq. I.61) Where

S =

1 MTTF spurious

STR = S + DD
S is the safe or spurious failure rate for the component, DD is the dangerous detected failure rate for the component, and

The second term in the equation is the dangerous detected failure rate term and the third term is the systematic error rate term. The dangerous detected failure term is included in the spurious trip calculation when the detected dangerous failure puts that channel (of a redundant system) or system (if it is nonredundant) in a safe (de-energized) state. This can be done either automatically or by human intervention. If dangerous detected failure does not place the channel or system into a safe state, this term is not included in Equations I.54 through I.59. The spurious failure rate, SP, can be determined from the mean time to failure spurious (MTTF ):
SP

SP =

1 MTTF SP

For a simplex device, the spurious trip rate (STR) is the device spurious failure rate, SP:

STR = SP
The spurious trip rate is typically calculated without consideration for the support systems such as instrument air and power supplies. The failure of the support systems should be monitored and tracked separately from the instrumentation and controls. When the subsystem is fault tolerant for spurious failure, the STR is calculated by examining the frequency of failure of one device and the probability of failure of the second device in the time it takes to detect and repair the first device failure. When continuous diagnostics are provided, this diagnostic interval is typically much smaller than the MTTR, so the time to detect and repair can be assumed to be MTTR. For an MooN subsystem of identical devices operating under active diagnostics, the spurious trip rate can be determined by:

STRMooN =

( N 1)! (N M )! (M 1)!
M 1

N SP [ SP MTTR ]

August 2009

ISA-TR84.00.02-2010

59

Draft D

The time to detect can be assumed to be TI, when the safe failure is detected by proof test only. For an MooN subsystem of identical devices with no active diagnostics, the spurious trip rate can be determined by:

STRMooN =

( N 1)! (N M )! (M 1)!
M 1

N SP [ SP TI ]

August 2009

ISA-TR84.00.02-2010

60

Draft D

SIF Calculation Overview

Need introduction

8.1

Step 1 Understand the assumptions

Should we have a discussion of what it means to violate these assumptions? The following is generally assumed during the verification calculation: The SIF being evaluated will be designed, installed, and maintained in accordance with ANSI/ISA84.00.01-2004. Component failure and repair rates are assumed to be constant over the life of the SIF. Once a component has failed in one of the possible failure modes it cannot fail again in one of the remaining failure modes. It can only fail again after it has first been repaired. This assumption has been made to simplify the modeling effort. The equations assume similar failure rates for redundant components. The sensor failure rate includes everything from the sensor to the input module of the logic solver including the process effects (e.g., plugged impulse line to transmitter). The logic solver failure rate includes the input modules, logic solver, output modules and power supplies. These failure rates typically are supplied by the logic solver vendor.
NOTE ISA-TR84.00.02-2009 - Part 5 illustrates a suggested method to use in developing failure rate data for the logic solver.

The final element failure rate includes everything from the output module of the logic solver to the final element including the process effects. The failure rates shown in the formulas for redundant architectures are for a single leg or slice of a system (e.g., if 2oo3 transmitters, the failure rate used is for a single transmitter, not three (3) times the single transmitter value.) The Test Interval (TI) is assumed to be much shorter than the Mean Time To Failure (MTTF). Testing and repair of components in the system are assumed to be perfect. All SIF components have been properly specified based on the process application. For example, final elements (valves) have been selected to fail in the safe direction depending on their specific application. All equations used in the calculations based on this part are based on Reference 3. All power supply failures are assumed to be to the de-energized state. It is assumed that when a dangerous detected failure occurs, the SIS will take the process to a safe state or plant personnel will take necessary action to ensure the process is safe (operator response is assumed to be before a demand occurs, i.e., instantaneous, and PFD of operator response is assumed to be 0).
NOTE If the action depends on plant personnel to provide safety, the user is cautioned to account for the probability of failure of personnel to perform the required function in a timely manner.

August 2009

ISA-TR84.00.02-2010 The target PFDavg and MTTF


spurious

61

Draft D

is defined for each SIF implemented in the SIS.

The Beta model is used to estimate common cause failures.


NOTE A detailed explanation of the Beta model is given in Annex A of Part 1.

Each section below should be expanded.

8.2

Step 2 Understand the functionality

Identify the hazardous event for which the SIS is providing a layer of protection and the specific individual components that protect against the event. Discuss connection between process hazard detection, response, process safety time, and SIS functionality. Discuss how some inputs and outputs to an SIF may not be required from a safety perspective but are needed to support various modes of operation, to ensure complete shutdown, or to sequence the process shutdown when cascade trips are possible. Only those inputs and outputs required to detect and respond to the identified hazardous event are included in the calculation.

8.3

Step 3 Determine integrity requirements

Remind audience that meeting the Safety Integrity Level (SIL) requires: 1) equipment is user approved for safety, 2) fault tolerance requirements are met, and 3) the PFD is met.

8.4

Step 4 Understand the reliability requirements

Discuss how low reliability devices affect process safety through higher numbers of spurious trips and maintenance periods. Discuss how only one maintenance period is generally assumed in the PFD calculations based on the D. Include the effects of spurious trips on process safety including process demands on other equipment, loss of production, equipment stressing, and re-start.

8.5

Step 5 Calculate the PFD

List the components that have an impact on each SIF. This will typically be those sensors and final elements identified in the process hazard analysis (PHA) process. The associated SIFs are assigned a SIL by the PHA team. Using the SIS architecture being considered, calculate the PFDavg for each SIF by combining the contributions from the sensors, logic solver, final elements, power supply, and any other components that impact that SIF. Determine if the PFDavg meets the Safety Requirements Specification for each SIF. If required, modify SIS (hardware configuration, test interval, hardware selection, etc.) and re-calculate to meet the requirements specified in the Safety Requirements Specifications (See ANSI/ISA-84.00.012004, Clause 5 and Clause 6.2.2) for each SIF. PFDavg calculations The PFDavg is determined by calculating the PFD for each SIF subsystem required to detect and respond to abnormal process conditions. The values for the subsystems are combined to obtain the PFDAVG for the SIF:

August 2009

ISA-TR84.00.02-2010

62

Draft D

PFD SIF = PFD S + PFD LS + PFD FE + PFDSS


where,

PFDS represents the various sensors used to detect abnormal process conditions, PFDLS represents the logic solver used to make decisions based on the process conditions, PFDFE is the final element used to take action of the process, PFDSS represents any required support systems, such as a power supply in ETT, and PFDSIF is the PFDAVG for the SIF. Step 6 Calculate the MTTF
SP

8.6

Determine the expected Spurious Trip Rate (STR) for system components and combine to obtain MTTFSP for the SIS. If the calculated MTTFspurious is unacceptable, modify configuration (add redundancy, use components with better reliability, etc.) and re-calculate to meet requirements in the Safety Requirements Specifications. This will require re-calculation of the PFDavg value for each SIF as well. A safe device failure may cause a spurious trip of the system. Mean time to a safe failure is referred to as spurious ) that is the estimated time between safe failures of a Mean Time to Failure Spurious (MTTF component or system. If trips of the SIS caused by failures of system components are a concern, the anticipated spurious trip rate may be calculated to determine if additional steps are justified to improve SIS reliability. In ISATR84.00.02-2009, the term Spurious Trip Rate (STR) refers to the rate at which a nuisance or spurious trip might occur in the SIS.

8.7

Step 7 Adjust SIF design to meet integrity and functionality requirements

When the PFDavg and MTTFspurious values meet or exceed those specified in the Safety Requirements Specifications, the calculation procedure is complete.

8.7.1

PFD improvement techniques

Where adjustments are required to decrease PFDavg, additional redundancy may be used on components, the proof test interval may be decreased, the SIS configuration may be changed, or components with lower failure rates may be considered.

8.7.2

Reducing Spurious Trip Rate

Where the spurious trip rate is not acceptable, additional redundancy may be added to system components or more reliable components may be used. This will require re-evaluating the system PFDavg to confirm that it still meets the requirements of the Safety Requirements Specifications.

August 2009

ISA-TR84.00.02-2010

63

Draft D

8.8

Examples

Need continuous and high demand mode examples? A Safety Instrumented System (SIS) is designed to achieve or maintain a safe state of the process when unacceptable process conditions are detected. The need for an SIS is identified during the Hazard and Risk Analysis (H&RA), which evaluates the process risk associated with identified hazardous events. Independent protection layers (IPL) are identified that reduce the risk associated with the hazardous event below the owner/operator risk criteria. The risk reduction allocated to the SIS is its target safety integrity level (SIL). This metric serves as the quality performance benchmark for the SIS design and management. The use of cookbook SIS design and management practices was very common in the process industry at the time of the issuance of ANSI/ISA 84.01-1996. The performance-based approach of ISA 84.01 provides more flexibility but also adds significant complexity, because a wide range of options can be used to achieve the required risk reduction. These options often present very different capital, installation, and long-term operational costs. This paper provides examples of simple cookbook approaches and illustrates how architectures must evolve when addressing higher integrity levels and/or process reliability. Prescriptive approaches are often favored over the performance-based ones due to the apparent simplicity offered by the cookbook. However, the user of the cookbook must understand its assumptions, meaning and intent. When the assumptions are violated, the performance achieved by the SIS may be insufficient to provide the required integrity and reliability for the specific application. The assumptions for this papers cookbook are as follows: 1. SIS is managed throughout it lifecycle to achieve the required core attributes; 2. Equipment failure rate is constant and random requiring a minimum level of inspection and preventive maintenance; 3. Devices are specified to fail to the safe state on loss of power and other support systems; 4. Redundant sensors are installed on separate process connections; 5. Block valves are specified as spring return fail-closed and are actuated using de-energize-to-trip solenoid operated valves; 6. Logic solver is fault tolerant and safety-configured to achieve SIL 3; and 7. The proof test procedure fully validates the required operation of each device and inspection and preventive maintenance activities are performed as part of the proof test to return the device to the good as new condition.

8.8.1

Hazard Scenario

During a hazard and risk analysis, the team identifies a hazardous event, where if the feed valve fails open, the pressure exceeds the vessel maximum allowable working pressure (MAWP) and there is the potential for vessel failure. To reduce the risk associated with the hazardous event, the project team implements a safety instrumented function (SIF) to isolate the vessel pressure source when the system detects a specified high pressure. Since the feed valve failure is a potential initiating cause, it cannot be used as the only means of process isolation.

August 2009

ISA-TR84.00.02-2010

64

Draft D

8.8.2

Data for Examples

The failure rates, diagnostic coverage (DC), common cause factors (CCF), and mean time to repair (MTTR) used for the calculations are shown in Table 1.

Table ?--Data used for the devices


Diagnostic Coverage Factor Simplex (%) DC1 0 Dual (%) DC2 80 Triple (%) DC3 90 -

Failure Rate Device Type Dangerous (Per Year)


D

CCF MTTR (hrs) (%)

Safe (Per Year)


S

Pressure Transmitter Logic Solver Block Valve Fail to close, clean service Solenoid Valve De-energize to trip
NOTE

6.67E-03 5.00E-05 1.67E-02

1.25E-02 1.36E-03 6.67E-03

72 72 72

2 0.2

1.67E-02

3.33E-02

72

0.2

for redundant devices the following formulas are used: failure rate for two pressure transmitters with diagnostic coverage: D2=(1-)*(1-DC2)*D ; failure rate for three pressure transmitters with diagnostic coverage: D3=(1-)*(1-DC3)*D.

8.8.3 8.8.3.1

Example Architectures
Case 1

August 2009

ISA-TR84.00.02-2010

65

Draft D

8.8.3.2

Case 2

8.8.3.3

Case 3

8.8.3.4

Case 4

August 2009

ISA-TR84.00.02-2010

66

Draft D

8.8.3.5

Case 5

8.8.3.6

Case 6

8.8.3.7

Case 7

August 2009

ISA-TR84.00.02-2010

67

Draft D

8.8.3.8

Case 8

8.8.4

Analysis

{These calculations should be illustrated using simplified equations for fault tree and Markov analysis. The fault trees and the Markov graphics could be provided in the appendices, based on committee interest.

August 2009

ISA-TR84.00.02-2010

68

Draft D

9
9.1

Special Topics
Systematic error and the management system

Systematic failures are the result of errors of omission or errors of commission. In the models, the systematic failures are separated into safe failures and dangerous failures. It is assumed that a S systematic safe failure, F, will have the potential to result in a spurious trip. In a similar manner, a D systematic dangerous failure, F will have the potential to result in a fail-to-function state of the SIF. The estimation of the systematic failure rate must consider many possible causes. A partial list of root causes is as follows:

9.1.1

SIF Design Errors

Could we get some real example of SIF design, implementation, maintenance, and software errors that have resulted in incidents? Can be sanitized to protect the innocent. These errors include errors in the safety logic specification, incorrect selections of sensors and final elements, and errors in the design of the interface between the E/E/PES and sensors and actuators. Failure due to errors of this type will likely fail an entire redundant architecture.

9.1.2

Implementation Errors

These errors include errors in the installation, setup (calibration, initialization, set-point settings), and start-up of SIS components which are not detected and resolved during proof testing of SIS. Examples of such errors are a) Wiring/tubing errors:

Inadequate electrical/pneumatic power supply; Improper or blocked-in connections to the process (impulse lines); and Installation of wrong sensor or final control component.

b) Software errors These errors include errors in vendor written and user written embedded, application, and utility software. The vendor software error typically includes errors in the operating system, I/O routines, diagnostics, application oriented functions and programming languages. User written software errors include errors in the application program, diagnostics, and user interface routines (display systems, etc.). c) Human interaction errors These errors include errors in the operation of the man machine interface to the SIF, errors during periodic testing of the SIF and during the repair of failed modules in the SIF. d) Hardware design errors These errors include errors in the manufacturers design or construction of selected SIF components which prevent proper functionality and failure of component to meet specified conditions. Examples of such errors are failure of a component to

August 2009

ISA-TR84.00.02-2010

69

Draft D

Meet specified process conditions (pressure, temperature, etc.); Function properly in the installed environment (atmospheric temperature/humidity, vibration, etc.); and The wrong selection of type or rating for components.

e) Modification errors These errors occur when altering any or all of the five categories mentioned in this clause (i.e., SIF design, hardware implementation, software, human interaction, hardware design). These errors have three common characteristics:

They may occur during maintenance activities. They may be subtle. They may impact an adjacent area.

Examples of such errors are: Inability to replace a component with an identical component (e.g. due to discontinued support) and the replacement component does not fully function with adjacent components. Modification reduces the load on the power source, thus raising the voltage level and causing problems with other existing non-modified components.

9.1.3

Systematic Probability

Systematic failures(8) demonstrate themselves completely different from random hardware failures. Random hardware failures can be caused by all kinds of stressors and the susceptibility of the components for these stressor(s). By careful testing before operation we assume that there are no hardware failures at the start of the operational period. Systematic failures show a different behavior. During the design and engineering of the safety loops and software and (chip) hardware, failures arise that cannot be detected by tests. The numbers of systematic failures do not increase during the operational period, but can be triggered under all kind of normal operational conditions. To calculate this particular behavior of systematic failures in the Markov model initialization values for the operational state, and fail-dangerous state has to be established for the three types of systematic failures in logic solvers. The nature of the systematic failures is that they are present at the moment of the operational start of a logic solver. The probability that the logic solver is operational is < 100 % and the probability that the logic solver is not in the fail-dangerous state is > 0 %.

Table 2.?--Systematic failures--Initial probabilities


Item Low Engineering/Design Complexity Complex Chip hardware 0.0001 0.0004 Initial probability Typical 0.001 0.004 High 0.01 0.04

August 2009

ISA-TR84.00.02-2010

70

Draft D

Software Total Systematic Failures

0.00075 0.00125

0.0075 0.0125

0.075 0.125

By applying the appropriate techniques it is possible to detect systematic failures during the phases of the safety life cycle. The range of coverage factors applied to the systematic failures in the examples of Clause 6 are defined in table B.4.2.

Table 2.?--Systematic failures--Coverage factors


Item Low Factory test Engineering/Design Complexity Complex Chip hardware Software (embedded, utility, application) Total Systematic Failures 0.99 0.9 0.9 0.9 0.9 Coverage factor Typical 0.995 0.95 0.95 0.95 0.95 High 0.999 0.99 0.99 0.99 0.99

The systematic failures are playing a major and, most times, a dominant role in safety instrumented systems and, as a possible alternative, the systematic failures in the table below indicate a broader range. In a logic solver comparison calculation that uses the table above the systematic failure range will in many cases dominate the results for the higher SIL. Using, as an example, a high Comparison Coverage factor ignores the dominant role and represents a better indication of the influence of the remaining parameters on the Safety Integrity Level. A practical range of the Coverage factor factory test will be, including systematic failures:

Table B.4.3--Systematic failures--Practical coverage factors factory test


Item Low Practical Coverage factor factory test 0.9 Coverage factor Typical 0.925 High 0.95

To estimate the initial range for the PFD in the numeric Markov matrix methodology (as defined in ISATR84.00.02-2009 Part 4), apply the diagnostic coverage factor for the factory test (defined in Table B.4.2) and assume a 50% probability that a systematic failure results in the dangerous state.

Table B.4.4--Systematic failures--Start PFD


Item Low Start PFD 1.25E-05 Initial probability Typical 6.25E-06 High 1.25E-06

For reliability calculations not applying the numeric Markov matrix methodology the systematic failures are also expressed in failure rates.

August 2009

ISA-TR84.00.02-2010

71

Draft D

Table B.4.5--Systematic failures--Failure rates


Item Failure Rate: failures / million hours Low Engineering/Design Complexity Complex Chip hardware Software Total Systematic Failures 0.01 0.05 0.09 0.15 Typical 0.10 0.50 0.90 1.50 High 1.0 5.0 9.0 15.0

9.2

Methods to Analyze the Performance of Equipment with Unrevealed Failures

The statistical methods used to determine failure rates or other reliability statistics for equipment depend upon whether the failure mode in question is revealed or unrevealed. For revealed failures the statistical methods are relatively straight forward as described by the equation below.
MTTF = TTF

Nf

Where: MTTF = mean time to failure TTF Nf = equipment time to failure = number of failures

For the case where the failure mode is unrevealed, it is not as straight forward.

9.2.1

Simple probability of failing on demand

The simplest means of estimating the probability of failure on demand is to use the following formula to analyze the data.
PFD =

FTF

Where: PFD FTF D = Probability of failing to function on demand = Fail to function = Demand

The problem is that there is no way to estimate the impact on this value if the proof test interval were to be extended or decreased. The value is only truly meaningful if the entire population is tested at the same time interval. In addition, it is only meaningful for applications where that interval is used. Oftentimes there is a temptation to assume the exponential distribution and back calculate, but the analyst has no means to test that hypothesis or to determine at what time point in the equipments life wear out occurs. August 2009

ISA-TR84.00.02-2010

72

Draft D

9.2.2

Censored Data

The reason why unrevealed failures are a statistical analysis problem is that the data is censored, i.e. it is unknown when the failure occurred. The longer the proof test interval the greater the uncertainty as the failure could have existed at time zero or failed just before the proof test, as well as any time in between. Statistical methods have been developed to address this issue, but it requires appropriate data to be collected using a high quality assurance program to ensure that data is not missed or rationalized away.

9.2.3

Quantal Response Analysis Method Overview


1, 2

One such analysis method is known as the quantal response analysis method . The proof test results are first organized in ascending time order, irrespective of failure/success status. The data is then divided into non overlapping time interval groups where each group has a specified number of failures within its group. Each grouping encompasses a time interval within the overall total time duration of smallest proof test interval to longest proof test interval. Each group with its failures essentially represents a single data point for further analysis. The greater the number of failures that have accrued and the greater the diversity of proof test intervals yields allows more groups to be created and improves the overall analysis. This work is best done by persons trained in statistical analysis, but support by persons intimate with the test procedures and work process to obtain the data are important to a credible analysis and appropriate assumptions. The quantal response method provides an estimate of F(t), the probability of failure in the interval (0,t). If h(t) is the instantaneous failure rate, F(t) and h(t) are related by the following equation:
h ( t ) dt F (t ) = 1 e 0
t

This equation is valid for any continuous distribution. If h(t) is a constant, it becomes the exponential distribution. If it is not constant, the Weibull distribution can be used as a reasonable approximation of the normal and other distributions. The objective is to find an estimate of F(t). The general steps necessary to perform a quantal response analysis include: 1. Divide the data into 5 to 6 (minimum) non overlapping interval, each containing at least 5 test failure. Each data point within an interval is a binary trial that is either a success or a failure. 2. For the data points contained in an interval, compute the fraction, Fi(t) that failed. This fraction is an estimate of the cumulative fraction failed by the end of that interval, since all that is known about the failure is that it occurred between 0 and t. 3. Plot each failed fraction (i.e. Fi(t)) against the weighted failure time of each interval, given by

= t if + t is / ni , with the index I denoting the i-th cell, index f denoting the failures f s

(test times) and the index s denoting the survivors. Ni denotes the sample size for the i-th cell. The weighted point is used as an estimate of the ages of the equipment inspected in the interval.

August 2009

ISA-TR84.00.02-2010

73

Draft D

4. Fit a Weibull distribution to the data. In this case, F (t ) = 1 e (this is the two-parameter Weibull distribution). Find the parameters and using a probability plot or weighted regression analysis for varying numbers of inspections in each interval. 5. Test the null hypothesis that Ho: = 1 versus the alternative hypothesis H1: 1 (If = 1 then the Weibull distribution is reduced to the exponential distribution with constant failure rate and = 1 ). If the null hypothesis is not rejected, then linear regression could be used to fit an exponential distribution. 6. Fit other distributions to the data, for example, the Normal, Lognormal, Extreme Value distributions. If none provide an adequate fit, then try polynomial fitting of the empirical data.

9.2.4

References

1. Nelson, W. B. (1982), Applied Life Data Analysis, New York, John Wiley & sons. 2. Sheesley, J. H., Thomas, H. W. and Valenzuela, C. A. (1995), "Quantal Response Analysis of Relief Valve Test Data," ASQC 49th Annual Quality Congress Proceedings, pp. 741-748.

9.3

Single Sided Confidence Limit


Deleted: {From IEC 61508 and IEC 61511: what attributes should be evaluating, how do you collect and process the data. This is a math discussion with a practical bent}

Clause 5.2.5.3 requires assessing whether dangerous failure rates of the safety instrumented system are in accordance with those assumed during the design. The note associated with clause 11.9.2.c provides some statistical insight for in accordance with since the initial use of an experience based MTTF is allowed when the experience is sufficient to demonstrate the claimed mean time to failure on a statistical basis to a single-sided lower confidence limit of at least 70 %.

Demonstrating that an assumed MTTF is supported by system performance requires: (1) collecting field data on failures during operation and pass/fail data from proof tests, (2) determining the sample MTTF from that field performance data, (3) repeating this process to generate a set of sample MTTFs, (4) analyzing the set of sample MTTFs to estimate the sample population mean and standard deviation, (5) and finally determining if the assumed MTTF is supported at the required confidence limit. The set of sample MTTFs (3) can be collected from different units, plants and even companies, as long as the operating environments and mechanical integrity practices are similar. Collecting enough samples to support the calculation may be challenging for a single organization.

In statistical terms, the assessment for Clause 5.2.5.3 asks, Is the data consistent with the assumed mean? A full discussion of the statistical methods, formulas and tables required for this determination is beyond the scope of this technical report. Such a presentation is available on-line in the free NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/, 7/18/2006.

August 2009

ISA-TR84.00.02-2010

74

Draft D

9.4

Software Packages

Cautionary tale. Software packages, warnings, assumptions, data, etc. See existing TR

August 2009

ISA-TR84.00.02-2010

75

Draft D

Annex A Abbreviations, Acronyms and Symbols


Here is a starter set. Check for new editions. Will remove any unused when document is complete.

AIChE ALARP ANSI API APL ASME ATEX BMS BPCS C&E CCF CCPS CENELEC CFR COMAH CPU DC DCS DD DTT DU EEMUA E/E/PES EIL EMC EPA ESD ESS ETA ETT FAT FM FMEA FMECA FPL FTA FVL HAZOP H&RA HFE HSIS HMI HNIL HRA HSE August 2009

American Institute of Chemical Engineers As Low as Reasonably Practicable American National Standards Institute American Petroleum Institute Asset Protection Level American Society of Mechanical Engineers Atmosphres Explosibles Burner Management System Basic Process Control System Cause and Effect Common Cause Factor Center for Chemical Process Safety The European Committee for Electrotechnical Standardization Code of Federal Regulations Control of Major Accident Hazards Central Processing Unit Diagnostic Coverage Distributed Control System Dangerous Detected De-energize to Trip Dangerous Undetected European Equipment Manufacturers and Users Association Electrical/Electronic/Programmable Electronic System Environmental Integrity Level Electromagnetic Capability Environmental Protection Agency Emergency Shutdown System Emergency Support System Event Tree Analysis Energize to Trip Factory Acceptance Test Factory Mutual Global Failure Mode and Effects Analysis Failure Mode, Effects and Criticality Analysis Fixed Program Language Fault Tree Analysis Full Variability Language Hazard and Operability Study Hazard and Risk Analysis Human Factors Engineering High Integrity Protective System Human Machine Interface High Noise Immunity Logic Human Reliability Analysis Health and Safety Executive

ISA-TR84.00.02-2010 IEC IEEE IL I/O IPL SIF SIS ISA ISO ISPE ISS LOPA LVL MOC MTBF MTTF MTTR NEC NFPA NRTL OSHA PC PE PERD PES PFD(t) PFDAVG P&IDs PHA PLC PSAT PSM PSSR QRA RAGAGEP RMP RRF RFI RTL RTD SAT SD SFF SIF SIL SIS SOE SOV SRS STR SU TI August 2009

76

Draft D

International Electrotechnical Commission Institute of Electrical and Electronics Engineering Integrity Level Input/Output Independent Protection Layer Safety instrumented function Safety instrumented system Instrumentation, Systems, and Automation Society International Organization for Standardization International Society of Pharmaceutical Engineering Instrumented Safety System Layers of Protection Analysis Limited Variability Language Management of Change Mean Time Between Failure Mean Time To Failure Mean Time To Repair National Electrical Code National Fire Protection Association Nationally Recognized Testing Laboratory Occupational Safety and Health Administration Personal Computer Programmable Electronic Process Equipment Reliability Database Programmable Electronic Systems Probability of Failure as a Function of Time Average Probability of Failure on Demand Piping and Instrumentation Diagrams Process Hazard Analysis Programmable Logic Controller Pre-Startup Acceptance Test Process Safety Management Pre-Startup Safety Review Quantitative Risk Analysis Recognized and Generally Accepted Good Engineering Practice Risk Management Program Risk Reduction Factor Radio Frequency Interference Resistor-Transistor Logic Resistance Temperature Detector Site Acceptance Test Safe Detected Safe Failure Fraction Safety Instrumented Function Safety Integrity Level Safety Instrumented System Sequence of Events Solenoid Operated Valve Safety Requirements Specification Spurious Trip Rate Safe Undetected Test Interval

ISA-TR84.00.02-2010 UPS 1oo1 1oo2 2oo2 2oo3 2oo4 MooN Uninterruptible Power Supply one-out-of-one one-out-of-two two-out-of-two two-out-of-three two-out-of-four M-out-of-N

77

Draft D

August 2009

ISA-TR84.00.02-2010

78

Draft D

Annex B Definitions
Allowable Time To Repair--Length of time that has been determined by hazard and risk analysis to be acceptable for continued process operation with degraded or disabled equipment. Time is often constrained by operations ability to maintain the necessary compensating measure. Architecture--The physical organization, interconnection, or integration of the equipment of a system that operates according to the design basis. As good as new--Equipment is maintained in a manner that sustains its useful life. As good as new often refers to the initial condition after proof test, where the probability of failure at time 0 is zero and the failure rate expected during the useful life is unchanged. Availability (Instantaneous)--The probability that equipment is capable of operating under given conditions at a given instant of time, assuming the required external resources or support systems are provided. Availability (Mean)--The fraction of time that equipment is capable of operating in the desired manner. It is the mean of the instantaneous availability over a given time interval. Bathtub Curve--Typically a plot of equipment failures as a function of time and is used to characterize the equipment lifecycle. Benchmark--A point of reference from which measurements may be made or from which other things can be measured. A program that is used to compare the operation of two or more systems is known as a benchmark program. Benign Failure--Failure which does not affect the ability of equipment to perform its design function. Benign failures are often further classified as degraded or incipient failure. Burn-In--The equipment is subjected to higher than typical stress to identify early failures. Often, referred to as infant mortality. Bypass--An action taken to override, defeat, disable, or inhibit a protective system. These actions prevent operation of the protective system. Capability--Ability of equipment to operate as specified. Claim Limit--The maximum integrity level in which equipment can be used without additional fault tolerance against dangerous failure. The limit occurs due to random and systematic failures. Common Cause Failure--Failure of more than one device, function, or system due to the same cause. Common Mode Failure--Failure of more than one device, function, or system in the same manner, causing the same erroneous result. Complete Failure-- Failure that results in a 100% loss of a required function. Component--One of the parts of a system, subsystem, or device performing a specific function. Configuration--The functional and/or physical characteristics of the hardware and/or software required for the equipment to operate according to the design basis. Continuous Mode--A dangerous SIF failure causes a hazardous event without further failure. August 2009

ISA-TR84.00.02-2010

79

Draft D

Critical FailureFailure that results in either a dangerous or safe action as defined by IEC 61508 Dangerous Failure--Failure affecting equipment within a system, which causes the process to be put in a hazardous state or puts the system in a condition where it may fail-to-operate when required. De-Energize To Trip--Circuits where the final elements are energized under normal operation and the removal of power source (e.g., electricity, instrument air) causes the safety instrumented system to take its defined action. Degraded Failure--Failure that results in a partial loss of function, that is less than as good as new, but does not result in a complete loss of the function. Demand Mode--Dormant or standby operation where the SIF takes action only when a process demand occurs. Demand Rate--The number of demands divided by the total elapsed operating time during which the demands occur. Design Life--The expected equipment life due to its design and durability, as well as its capability to continue meeting the design specification. The design life may end with obsolescence. Detected Failure--Failure found through diagnostics or through the operators normal observation of the process and its equipment. Synonyms include announced, revealed and overt. Diagnostic Coverage-- Percentage of the failures of the equipment detected by automated diagnostics that report equipment faults to the operator and cause the equipment to take a specified action on fault detection. Diagnostic Interval--Time period between the operation of diagnostics used to detect equipment failures. Diagnostics--Hardware and software installed to automatically identify and report specific failure modes. Diverse--Use of independent and different means to perform the same function. Diversity may include the use of different physical methods, technology, manufacturers, installation, maintenance personnel and/or environment. Early Failure--Failure identified early in the installed equipment life. These failures are typically due to manufacturing defects, assembly errors, installation or implementation errors. Many of these failures are discovered and corrected during commissioning. The remainders are typically discovered during and shortly after startup. Failures in this time period are generally dominated by systematic rather than random failures. Energize To Trip--Circuits where the final elements require the power to take or maintain the safe state. Equipment Boundary--Demarcation of the equipment, defining devices, components and interfaces needed for the equipment to function according to its specification. Equipment Manual--Compilation of installation, configuration, and mechanical integrity requirements for equipment approved for Safety instrumented system (SIS) applications. Error--1) Discrepancy between a computed, observed, or measured value (or condition) and the true, specified, or theoretically correct value (or condition) or 2) Failure of planned actions to achieve their desired goal. Expected Life--The length of time a device is expected to function according to its specification. August 2009

ISA-TR84.00.02-2010

80

Draft D

Failure--The termination of the ability of equipment to perform a required function. Failure Cause The circumstances during design, manufacture, or use which led to failure. Failure Frequency (Instantaneous)--The expected number of failures between time t and time t+t. The number counted may include first, second, or more failures of the same equipment. The failure frequency can be integrated between 0 and t to determine the expected number of failures within time t. Failure Mode-- The observed manner of failure. The failure modes describe the loss of required system function(s) that result from failures. Failure Mode And Effects Analysis (FMEA)--A qualitative analysis method that identifies equipment failure modes and determines their impact on the equipment operation. Failure Rate--Limit when t goes to 0 of the expected rate at which equipment failures occur in the time interval t to t+t given that no failures have occurred until time t. Failure-To-Operate--Failure that inhibits, or sufficiently delays, the actions required to achieve or maintain a safe state of the process when a process demand occurs. A failure to operate on demand has a direct and detrimental effect on safety. Fault--Abnormal condition resulting in degraded operation or critical failure. Fault Tolerant--Voting architecture that allows an equipment subsystem to continue to operate in the presence of one or more hardware or software faults. Fault Tree Analysis (FTA)--Method used to analyze graphically the failure logic of a given event, to identify various failure scenarios (so called cut-sets), and to support the probabilistic estimation of the event. Field Experience--The collection of data and information about the performance of equipment based on actual use in a similar application. Field Sensors--See Sensor. Final Element--Device that takes action on the process or process equipment. For an safety instrumented function (SIF), the final element takes action on the process to achieve or maintain the safe state. The final element boundary includes the signal connection to the logic solver and the devices required to take action on the process. Fitness for Service--Management system used to assess the current condition of equipment to determine whether it is capable of continuing operation within equipment specification until the next opportunity to test or perform maintenance. Hardware Fault Tolerance--See Fault Tolerance. Historical Data--Data recorded from actual past experience. Incipient Failure--The equipment operates within specification but its current condition could result in a degraded or critical failure if corrective action is not taken. Integrity--Core attribute of a protection layer related to the risk reduction reasonably achievable given its design and management. Integrity is limited by the rigor of the management system used to identify and correct equipment failures and systematic errors.

August 2009

ISA-TR84.00.02-2010

81

Draft D

Integrity Level (IL)--Represents one of four discrete ranges used to benchmark the integrity of each PIF and the PIS, where IL 4 is the highest and IL 1 is the lowest. Logic Solver--That portion of an instrumented system performing one or more logic functions. Low Demand--See Modes of Operation: Demand Mode. Mean Time Between Failure (MTBF)--For a repairable device, mean time to failure + the mean time to restore. Mean Time Between Failure Dangerous (MTBFD)--The average time between dangerous failures of a repairable device. Mean Time Between Failure Safe (MTBFS)--The average time between safe failures of a repairable device. Mean Time to Failure (MTTF)--The average time before a devices first failure. Mean Time to Failure Dangerous (MTTF )--The average time before a devices first dangerous failure. Mean Time to Failure Safe (MTTFS)--The average time before a devices first safe failure. Mean Time to Repair--The average time to repair equipment from the time of failure and to its return to normal operation. The mean time to repair is sometimes referred to as mean time to restoration. Note that this time is more than physical repair time. Mechanical Integrity--Management system assuring equipment is inspected, maintained, tested and operated in a safe manner consistent with its risk reduction allocation. Operating Environment--Where a device is intended to be used, such as external environmental conditions, process operating conditions, communication robustness, process and system interconnections, and support system quality. Partial Testing--Method of proof testing that checks a portion of the failures of a device, e.g., partial stroke testing of valves and simulation of input or output signals. Pass/Fail Criteria--Pre-established criteria that define the acceptability of equipment operation relative to the design basis and equipment specification. Probability--Expression for the likelihood of occurrence of an event or an event sequence during an interval of time or the likelihood of the success or failure of an event on test or on demand. Probability is expressed as a dimensionless number ranging from 0 to 1. Probability of Failure on Demand Average (PFDAVG)--The average probability of a device failing to respond to a demand within a specified proof test interval. It is also the average unavailability. Process Demand--A process condition (or event) that requires a protective system to take action to achieve or maintain a safe state of the process. Proof Test--A physical inspection and witnessed test, or series of tests, executed to demonstrate that the equipment operates according to the design basis and is maintained in the as good as new condition. Proof Test Coverage--Expressed as the percentage of failures that can be detected by the proof test. A complete proof test should provide 100% coverage of the failures.
D

August 2009

ISA-TR84.00.02-2010

82

Draft D

Random Failure--Failure whose occurrence is unpredictable, which results from various degradation mechanisms in the hardware. Redundancy--Use of two or more devices, systems, or layers to perform the same function. Reliability--The probability that equipment operates according to its specification for a specified period of time under all relevant conditions. It is one of the core attributes of a protection layer. Repair Time--The time required to detect a failure, repair the failure, and return the equipment to its normal operation. The average of this total duration is called the mean time to repair. Safe Failure--Failure affecting equipment within a system, which causes, or places the equipment in condition where it can potentially cause, the process to achieve or maintain a safe state. Safety Instrumented Function (SIF)--A safety function allocated to a safety instrumented system with a safety integrity level (SIL) necessary to achieve the required risk reduction for an identified hazardous event involving a catastrophic release. Safety Instrumented System (SIS)--Composed of a separate and independent combination of sensors, logic solvers, final elements, and support systems that are designed and managed to achieve a specified safety integrity level. An SIS may implement one or more safety instrumented functions (SIFs). Safety Integrity Level (SIL)--Represents one of four discrete ranges used to benchmark the integrity of each SIF and the SIS, where SIL 4 is the highest and SIL 1 is the lowest. Sensor--A measurement device (instrument) or combination of devices that detect process variables or conditions (e.g., transmitters, transducer, process switches, and toxic gas detectors). The sensor boundary includes the process connection, sensor, transmitter, and signal connection to the logic solver. Spurious Operation--Failure causing equipment to take action on the process when not required. Spurious operation has an immediate impact on process uptime and potentially on process safety. Spurious Trip--Refers to a process shutdown, or disruption, due to the spurious operation of equipment. Other terms often used include nuisance trip and false shutdown. Spurious Trip Rate (STR)--Expected rate (number of trips per unit time) at which a process shutdown, or disruption, occurs due to the spurious operation of equipment. Other terms used include nuisance trip rate and false shutdown rate. Support Systems--Human machine interfaces, communications, wiring, power supplies and other utilities which are required for the system to function. Systematic Failure--Failure related in a deterministic way to a root cause, which can only be minimized by effective implementation of the safety management system. Test Interval--Time period between two successive proof tests. Trip--A process shutdown that may be due to a process demand or to a spurious failure of an safety instrumented system. Unavailability (Instantaneous)--The probability that the equipment is not capable of performing its required function under given conditions at a given instant of time, assuming the required external resources or support systems are provided.

August 2009

ISA-TR84.00.02-2010

83

Draft D

Unavailability (Mean)--The fraction of time that the equipment is not capable of operating in the desired manner. Undetected Failure--Failure detected by inspection, proof test or process demand. Synonyms include hidden, concealed, unannounced, unrevealed and covert. Useful Life Failure--Random failures that occur during the time period where the failure rate can be considered relatively constant, because early failures have been corrected and wear out failures have not begun. Verification--Activity of reviewing, inspecting, checking, testing, or by other means determining and documenting whether the outcome of work processes, activities or tasks conform to specified requirements and traceable input information. Voting--Specific configuration of equipment within a subsystem. Voting is often expressed as MooN (M out of N). N designates the total number of devices (or channels) implemented; M designates the minimum number of devices (or channels) out of N required to initiate, take, or maintain the safe state. Also called voting system or voting architecture. Wear-out Failure--the time period when equipments failure rate begins to increase due to various failure mechanisms.

August 2009

ISA-TR84.00.02-2010

84

Draft D

Annex C Fault Tree Analysis


C.1 Introduction to Fault Tree Analysis Fault Tree Analysis (FTA) originated in the 1960s at Bell Telephone Laboratories under the direction of H. A. Watson. FTA was developed to evaluate the safety of the Polaris missile project and was used to determine the probability of an inadvertent launching of a Minuteman missile. The methodology was extended to the nuclear industry in the 1970s for evaluating the potential for runaway nuclear reactors. Since the early 1980s, FTA has been used to evaluate the potential for incidents in the process industry, including the potential for failure of the safety instrumented function (SIF). FTA is a well-recognized and well-respected technique for determining the probability of events that occur due to failures of various equipment and components. The symbols used in Fault Tree Analysis are in Annex A, and the mathematics used are in Annex B. FTA can be a rigorous and time-consuming methodology. It is a very structured, graphical technique that can be used to examine a single interlock or the interaction of multiple interlocks. Since FTA is used at the component and application specific event level, it should not be applied until the SIF design is well understood. In terms of the ANSI/ISA-84.01-2004 Life Cycle Model, the FTA should be performed only after the Safety Requirement Specification or Conceptual Design phases are complete. C.2 WARNINGS C.2.1 FTA, similar to all the other methods in this report, cannot arrive at an absolute answer. FTA can only account for failure pathways that the person doing the analysis identifies and includes in the model. Furthermore, the failure rate values used in the assessment are based on large samples of industrial data. These failure rates must be adjusted with the knowledge of actual process operating conditions, external environmental conditions, operating history, maintenance history, and equipment age. C.2.2 FTA, similar to all the other methods in this report, is not a replacement for good engineering design principles, but it is a good method to assess the SIL of the SIF design. C.2.3 ANSI/ISA-84.01-2004, like other international standards describing the application of SIFs in the process industry, defines SIL in terms of PFDavg. Unfortunately, it is difficult to obtain a PFDavg value for an entire system due to the time-dependent, non-linear properties of most SIF logic. Calculation of the actual average can be performed by either a) deriving the instantaneous equation to describe the SIF logic and symbolically integrating the equation over the testing interval or b) numerically integrating the SIF logic using a large number of discrete time intervals over the testing interval. As an alternative, many practitioners of FTA use an approximation to calculate PFDavg in a single step. Using the approximation, the analyst integrates the instantaneous equation for each component over its testing interval to determine the PFDavg for the component. Then, the individual component PFDavg values are combined using Boolean algebra based on the fault tree logic to calculate the overall PFDavg. Care should be exercised when employing this approximation. The deviation from the actual average when using this approximation can be substantial and the direction of the error is typically nonconservative (i.e., results in a lower PFDavg than is actually achieved). When using this approximation, the analyst is cautioned to select conservative failure rates to account for non-conservative inaccuracies in the approximation technique. The approaches described above are different and may not result in the same PFDavg, depending on the configuration. Both approaches are discussed further in Annex B with a comparison of the numerical results. Section 7.0 also uses both solution techniques to solve the Base Case Example. Due to the wide spread use of FTA, many software packages are available to facilitate the calculations. These software packages typically use the approximation technique for obtaining the PFDavg. As

August 2009

ISA-TR84.00.02-2010

85

Draft D

with any software tool, the User is cautioned to understand the equations, mathematics, and any simplifying assumptions, restrictions, or limitations. C.3 Procedure FTA is generally an iterative process that involves modeling a SIF to determine the PFD, then modification of the SIF (and associated model) to achieve the target PFD. The fault tree analysis of a SIF can be broken down into 5 essential steps: 1. SIF Description and Application Information; 2. Top Event Identification; 3. Construction of the FTA; 4. Qualitative Examination of the Fault Tree Structure; and 5. Quantitative FTA Evaluation. The following procedure summarizes the important aspects of how a SIF is modeled using FTA. C.3.1 Step 1. SIF description and application information Calculations to verify the SIF design meets the specified SIL are generally performed during the Conceptual Design phase of the Safety Life Cycle Model. Consequently, the information required for the FTA should be well understood and readily available. Critical information to the successful development of the fault trees is as follows:
Instrumentation description Process description Support systems (instrument air, cooling water, hydraulic, electrical power, etc.) involved in SIF operations Testing frequency and whether testing is done on-line or off-line Testing procedures and equipment used and likelihood for SIF equipment to be compromised by testing Failure modes Failure rates Diagnostic coverage Repair intervals and whether repair is done on-line or off-line Maintenance procedures and likelihood of SIF equipment compromised by repair Management of change procedures, frequency of change, and likelihood of error introduced during change

August 2009

ISA-TR84.00.02-2010

86

Draft D

Operating and maintenance discipline, including an estimate of the frequency of human error and circumstances where incorrect bypassing could occur Administrative procedures Common cause failures Systematic failures

C.3.2 Step 2. Top event identification The FTA process begins with the determination of the Top Event. For SIL determination, the Top Event is the probability of the SIF to fail on process demand for a given safety function. Fault trees can also be constructed to determine the potential for the SIF to spurious trip. The structure of the fault tree is different for SIL determination and spurious tripping, so the Top Event to be modeled must be defined prior to proceeding with the fault tree analysis. A process unit often has more than one safety function that will require SIL determination. Each safety function has a defined Top Event that is associated with a specific process hazard that has been identified by the Process Hazards Analysis (PHA). The Top Event will, in turn, have failure logic associated with the event that can be modeled in a Fault Tree. For instance, a furnace might have a tube rupture Top Event that can be detected with a pass flow measurement. The same furnace might have a firebox overpressure Top Event that is detected by burner pressure. The tube rupture and firebox overpressure safety functions would be modeled with separate fault trees, although they may share a logic solver and a fuel gas shutoff valve. The two safety functions might even have different SIL requirements. Only those sensors and final elements that prevent or mitigate the designated event are included in calculations. C.3.3 Step 3. Construction of the fault tree Once the Top Event has been determined, the fault trees are constructed using appropriate failure logic. FTA models how the failure of a particular component or set of components can result in the Top Event. The SIF is analyzed by a top down procedure, in which the primary causes of the Top Event are identified. The fault tree construction continues by determining the failures that lead to the primary event failures. The fault tree is constructed using fault tree symbols and logic gates as described in Annex A. The construction of the fault tree continues until all the basic events that influence the Top Event are evaluated. Ideally, all logic branches in the fault tree are developed to the point that they terminate in Basic events. At a minimum, the fault tree logic should include how failures of individual SIF components, including the various inputs, outputs, and the logic solver, affect the Top Event. SIF component failures that are Basic events include primary, common cause, and systematic failures.

This Annex shows examples of symbols typically used in Fault Tree Analysis a brief description.

(6,7,8)

(Figure A.1) followed by

August 2009

ISA-TR84.00.02-2010

87

Draft D

Basic Event

Boxed Basic Event

Undeveloped Event

House Event

AND Gate

OR Gate

Transfer Gate

Transfer

Figure A.1 Examples of fault tree symbols


Each Fault Tree symbol represents specific logic: A basic event is the limit to which the failure logic can be resolved. A basic event must have sufficient definition for determination of appropriate failure rate data and equation. A boxed basic event is the same as a basic event. The box allows a text description to be placed above the basic event. Undeveloped events are events that could be broken down into sub-components, but, for the purposes of the model under development, is not broken down further. An example of an undeveloped events may be the failure of the instrument air supply. An undeveloped event symbol and a single failure rate can be used to model the instrument air supply rather than model all of the components. FTA treats undeveloped events in the same way as basic events. House events are events that are guaranteed to occur or guaranteed not to occur. House events are typically used when modeling SIF with sequential events or when operator action or inaction results in SIF failure (for example, over-rides). AND gates are used to define a set of conditions or causes in which all the events in the set must be present for the gate event to occur. The set of events under an AND gate must meet the test of necessary and sufficient." Necessary means each cause listed in a set is required for the event above it to occur; if a necessary cause is omitted from a set, the event above will not occur. Sufficient means the event above will occur if the set of causes is present; no other causes or conditions are needed. OR gates define a set of events in which any one of the events in the set, by itself, can cause the gate event. The set of events under an OR gate must meet the test of sufficient." Transfer gates are used to relate multiple fault trees. The right or left transfer gates associate the results of the fault tree with a transfer in gate on another fault tree.

August 2009

ISA-TR84.00.02-2010 Problems in Constructing Models

88

Draft D

The User should be cautioned to proceed with fault tree development carefully to ensure that the fault tree does not evolve into a functional logic description of the SIF. A key point in the fault tree development is that the fault tree should model how failures in the SIF propagate into the Top Event (fail-safe or fail-dangerous conditions). In the initial stages of fault tree development, it is critical to address all known paths to SIF failure. Basic events that are proven to be negligible in their effect on the probability of the Top Event may be omitted from the analysis at a later time. C.3.4 Step 4. Qualitative review of the fault tree structure After the fault tree is constructed, the fault tree should be reviewed. The fault tree review should include the process and instrumentation designers, operations, and risk assessment. This review confirms that the fault tree model has correctly captured:

The Top Events and the safety functions specified in the PHA and the SRS The failure modes of the components The combinations of basic events leading to the Top Events All significant pathways to failure Common cause failures Systematic failures Other SIF complexities or interactions

For large and/or complex fault trees, the qualitative examination of the fault tree alone may not be sufficient to completely audit the structure of the fault tree. For these fault trees, a listing of the minimal cut sets should also be generated and reviewed for consistency with how the SIF functions. A cut set is a combination of basic events that give rise to the Top Event, that is, when the failure of the basic events in the cut set occurs, the Top Event will occur. A brief discussion of minimal cut sets is provided in Annex B. C.3.5 Step 5. Quantitative evaluation of fault tree Once the fault tree structure is fully developed, failure rate data is employed to quantify the fault tree. Failure rate data can be obtained from plant experience or from industry published data. A listing of the industry published data sources is provided in ISA-TR84.00.02-2009 - Part 1. The data must be obtained for all SIF components. Since the primary objective of the Fault Tree Analysis is to obtain a reasonable and conservative estimate of PFDavg, it is better to use conservative failure rates for the field components, that is, conservative failure rates will result in a higher estimate of PFDavg. Fault tree analysis does involve the use of Boolean algebra for the mathematical quantification. An overview of the equations typically used in the assessment of safety instrumented functions is provided in Annex B. Hand calculations using these equations are possible but can become quite cumbersome. Therefore, it is recommended that a computer software program be used for quantification of the fault trees. There are several commercially available software tools. As the tree is quantified, the results should be examined for consistency. A cut set report should be generated showing the order of importance of each cut set to the overall PFDavg. The cut sets at the top August 2009

ISA-TR84.00.02-2010

89

Draft D

and the bottom of the importance list should be examined to see if their presence in the importance list (influence on PFDavg) makes sense in view of the practical knowledge of the facility and similar facilities. Next, the calculated PFDavg should be compared to the target PFDavg specified in the Safety Requirements Specification (See ANSI/ISA-84.01-2004, Clause 5 and Clause 6.2.2) for each safety instrumented function (SIF). If the SIF has not met or exceeded the target PFDavg,, apply risk reduction techniques and re-calculate to meet the target PFDavg. Typical risk reduction techniques that might be addressed are as follows:

Increase testing frequency for SIF components. Investigate the MTTFD and MTTFspurious of SIF components and consider replacing low integrity SIF components with types or models that have greater integrity. Consider modifying the SIF to include more redundancy or diversity. Increase the diagnostic capability of the SIF components. Other risk reduction techniques require PHA team participation: Improve administrative procedures for design, operation, and maintenance, or Add other layers of SIF protection.

The fault tree model can be updated to calculate the new PFDavg as these risk reduction techniques are applied. C.3.6 Step 6. Documentation of FTA Results The FTA Documentation may include, but is not limited to:

SIF application (Company, Plant, Unit, Safety Function) Assumptions Reference to the SRS documents used in the FTA Data Model Cut sets and importances for each top event PFDavg MTTFspurious Sensitivity and what-if studies (A sensitivity study estimates the change in PFDavg or MTTFspurious for estimates of uncertainty in the component failure rate data. A what-if study estimates the change in PFDavg or MTTFspurious for changes in the SIF configuration.) Recommendations for improvement of SIF (if any) Calculation details:

August 2009

ISA-TR84.00.02-2010 The FTA analysis program used Equations chosen

90

Draft D

Hand calculations used to transform component failure rate data into program input format, if used Software options selected (for example, cut off criteria) Input and output files (on disk or electronic form) Name of person doing the calculations Date(s) work was done (completed)

August 2009

ISA-TR84.00.02-2010

91

Draft D

Annex D Markov Analysis


D.1 INTRODUCTION
The Markov approach or Markov modeling technique originated from the Russian mathematician A.A. Markov (1856 - 1922). Markov was engaged in research on mathematically describing random processes. With the years, that work has been extensively developed and the Markov technique has received more attention and increased use. The basic principle of Markov analysis is that a system can exist in different states. Each state is defined by an internal failure in the system. Usually these internal failures are combined to the level of what are called system states. These states are often driven by the availability of data, for example, data can be available on board level but can also be available on transistor level. Independent of the level of detail the system can be a: Fully operational system; Partially failed system (degraded), but still fulfilling its function; or Totally failed system. A Markov model consists of Markov states and the transitions between these states, see Figure 4.1. The driving force to transition from one state to another is the failure or repair probability of components. There are two reasons why a transition from one state to another can occur: First, a component in an operating state can fail. Second, a component in a failed state can be repaired.
Failure

State 1 Repair

State 2

Figure D.1 Simple Markov model


Markov modeling is a quantitative analysis technique that can be used to verify the SIS performance. This section presents:

A general discussion of the Markov modeling. An example of a non-repairable system using a Discrete Markov model. This model is compared to two other models: the Simplified Equation model and the Continuous Markov model. Insight into the time series developed by discrete Markov modeling is demonstrated by an analogy of a tank farm with controllers. An example of repairable Markov models. An example of a discrete Markov model combining repairable and non-repairable models.

August 2009

ISA-TR84.00.02-2010

92

Draft D

All of the examples in this section were calculated using Excel spreadsheets. For a Markov model, a system component is represented by being in one of two states. It is either in the working state or it is in the failed state. When a system is made up of multiple components, failure of one component will result in different failure modes or conditions regarding the overall system. To model an entire system requires each component to be modeled as working or failed, and then to asses the state of the system as working or failed. Assuming that a component is initially in the working state, the mathematics employed to describe its transition from working to failed would ideally use the probability distribution that most accurately describes the equipments failure characteristics as a function of time. Typically however, that information is not known and two assumptions are made as follows:

Equipment is operating in its useful life Failures are random and have a constant probability failure rate. Therefore the exponential distribution is used to describe the transitions.

If either of these assumptions is not true, the results will not be correct. The actual mathematical expressions used to describe the transition of a component will depend upon whether the failure is detected or undetected. In the event that the failure is detected, the opportunity to repair is available, reducing the probability of being in the failed state. If the failure is undetected, then the failure can only be repaired if the component is tested and the failure is identified. The probability of being in the failed state will increase as the time between tests and subsequent repair is increased. When performing a Markov analysis a Markov diagram, also known as a state transition diagram, must be constructed. The state transition diagram is a graphical representation of the system's operational, degraded, and failed states as well as the transitions between them. Most commonly, transition rates are constant failure rates, constant repair rates, and deterministic repair rates for discrete Markov models. The Markov diagram represents the system. The results of a Markov model include estimates of PFD, PFDavg, MTTFspurious, RRF, etc. Markov diagram uses 3 symbols. Circles show states of successful operating components, degraded components and failed components. Failures and repairs are show with transition arcs.

The State symbol a Safe State, a Degraded State (the degraded state is not shown), or a Failed State The Failure Rate arc symbol designated with the character failure rates can be dangerous, safe, etc. The Repair Rate arc symbol designated with the character - repair rates can be continuous or deterministic. This deterministic repair rate is very important and is what allows use to model different schedule repair frequencies like shutdowns and on-line testing.

August 2009

ISA-TR84.00.02-2010

93

Draft D

Failure rates can be divided into fail dangerous, fail safe, fail detected, fail undetected, fail degraded, failure from common cause. Repairs can be divided into continuous repairs, shutdown repairs, complete repairs, partial repairs. The development of a Markov model requires the understanding of the probability failure rate, as well as the use of a Transition matrix to calculate the probabilities of each identified state. Each subject is briefly discussed below: D.1.1 Probability Failure Rate The definition of failure rate is:
= Nf / (Ns * t)

Ns = number of successful units at end of time period Nf = number of failed units at end of time period
Nf = number of failed units during a time period t = time period (Tn Tn+1)

Notice that failure rate is a probability (Nf / (Ns) per unit time. A better name to use would be probability failure rate. A (probability) failure rate implies that if the number of successful units decreases (Ns) then the number of failures/unit of time (Nf / t) also must decrease. This is an important concept and will be discussed later in the section on Markov Tank farm. D.1.2 Transition Matrix The transition matrix contains all of the information about the system. So we need to understand how to derive the transition matrix from the model representation. Using Figure 2, the following transition matrix will be developed. The transition matrix is always a square matrix defined by the number of states. Figure 2 has 3 states so this is a 3x3 matrix. The matrix is populated by using the arcs defined in the model. Table 2 shows the Transition Matrix for this Markov Model. Lets develop the non-diagonal cells of row 1. From State 0 to State 1 is safe. From State 0 to State 2 is dangerous. For the diagonal cell, the sum of the rows must equal 1, therefore From State 0 to State 0 = 1-safe-dangerous. It is left up to the reader to complete the Transition Matrix development of the matrix. The solution for the development is shown in Table 2.

August 2009

ISA-TR84.00.02-2010

94

Draft D

To State From State 0 1 2 0 1-safe-dangerous 0 Table 2 1 safe 1- 0 2 dangerous 0 1

D.1.3 Matrix Math The time series now is developed by using the Transition Matrix multiplied by the row matrix [1,0,0]. That is: S1 = S0 * P (state vector at time 1 is equal to state vector at time 0 multiplied by the transition matrix). The reader can review matrix multiplication in most spreadsheets. This multiplication will develop the time series for PFD. PFDavg is then the average sum of all PFD in the model. The only state that is dangerous state is State 2 so the PFDavg is calculated by taking the arithmetic average of State 2. Below is the Transition Matrix for the 3 rates: safe=.01, dangerous=.0022, =.03.

0.9878 0.03 0

0.01 0.97 0

0.0022 0 1

Time 0 1 2 3 4 5 6 7 8 9

State 0 State 1 State 2 1.000 0.000 0.000 0.988 0.010 0.002 0.976 0.020 0.004 0.965 0.029 0.007 0.954 0.038 0.009 0.943 0.046 0.011 0.933 0.054 0.013 0.923 0.062 0.015 0.914 0.069 0.017 0.905 0.076 0.019

August 2009

ISA-TR84.00.02-2010

95

Draft D

The time series from 91 to 104 is continued below:

Time 91 92 93 94 95 96 97 98 99 100 101 102 103 104

State 0 State 1 State 2 0.636 0.216 0.149 0.634 0.215 0.150 0.633 0.215 0.151 0.632 0.215 0.153 0.631 0.215 0.154 0.629 0.215 0.156 0.628 0.215 0.157 0.627 0.215 0.158 0.626 0.214 0.160 0.625 0.214 0.161 1.000 0.000 0.000 0.988 0.010 0.002 0.976 0.020 0.004 0.965 0.029 0.007

Chart 2 shows the time series chart for all 3 states. State 2 will be averaged to calculate the PFDavg. So far we have defined the P matrix. MTTFspurious is calculated using the Q matrix, the I matrix and the N matrix. The MTTFspurious will not be calculated in this example.
The matrix calculations need to be shown for MTTF spurious and STR.
1.000 0.900 0.800 0.700 0.600 State 0 0.500 0.400 0.300 0.200 0.100 0.000 1 20 39 58 77 96 115 134 153 172 191 210 229 248 267 286 305 324 343 362 381 400 State 1 State 2

Chart 2

August 2009

ISA-TR84.00.02-2010

96

Draft D

D.2 MARKOV TANK FARM


An analogy of a tank farm may prove useful to understand the interaction of the States. These interactions are controlled by the failure rates and repair rates. The following components of the tank farm analogy are defined below. Figure 3 depicts the analogy showing the failure rate and repair rates controlling valves between the tanks.

FT LT

State 0 No Failures

safe
Ratio Contr

FT

Ratio Contr

LT

State 1 Failed State

Figure 3

States can be thought of as the tanks. There are several types of tanks; the Safe tank, the Fail Dangerous Undetected Tank, The Fail Safe Detected tank, i.e. modes of failure. The level in the tank is analogous to the probability of being in the State. Repair rates can be thought of repair probability controllers. The definition of Repair ( = Nr / (Ns * t)) can be rearranged to understand how the repair controllers work. The controller will be defined as the Flow/Volume; Flow = Nr / t, Volume =Ns. So the Repair controller has a setpoint of probability repair rate. In the examples considered in this section it is assumed the Repair controller has a constant setpoint, except during shutdowns when the setpoint goes to zero. During non-shutdown times, as the volume in the tanks changes, the controller will adjust the Flow to maintain a constant Flow/Volume, i.e. constant Repair rate. Failure rates can also be thought of as controllers. The definition of Failure ( = Nf / (Ns * t)) can be rearranged to understand how the failure controller works. The controller will be defined as the Flow/Volume; Flow = Nf / t, Volume =Ns. So the Failure controller has a setpoint of probability failure rate. In the examples considered in this section it is assumed the failure controller has a constant setpoint, except during shutdowns when the setpoint goes to 1. During non-shutdown times, as the volume in the tanks change the controller will adjust the Flow to maintain a constant Flow/Volume, i.e. a constant Failure rate.

The PFD can be thought of as the level in all dangerous tanks. The PFDavg can be thought of as the average level in all dangerous tanks. The level in the Safe tank can be thought of as the Availability of the system. The time series charts are the charting of the levels of different tanks.

August 2009

ISA-TR84.00.02-2010

97

Draft D

The Safety Integrity Level (see level is in the SIF terminology) can be thought of as the average combined level in all of the dangerous tanks. A SIL 3 SIF implies the average combined dangerous levels are between .1% and .01%. A SIL 2 SIF implies the average combined dangerous levels are between 1% and .1%. A SIL 1 SIF implies the average combined dangerous levels are between 10% and 1%. The analogy brings one conclusion concerning SIL level. It is difficult to design a reliable control system that will maintain a tank level between 1% and 10%. It is more difficult, i.e. special hardware is needed, to design a reliable control system that will maintain a tank level between 1% and .1%. It is very difficult to design a reliable control system that will maintain a tank level between .1% and .01%.

D.3 FAIL SAFE MODEL


The probability to fail safe is shown in Figure 4, assuming that safe failures occur at the rate S and repaired at the rate .

Figure 4 The P matrix is shown in Table 3. To State From State 0 1 0 1-safe Table 3 Substituting safe= .01 and =.02 into the P matrix as shown in Table 4 and performing the matrix calculations as shown in Table 5 for the first 10 time increments. 1 safe 1-

0.99 0.02
Table 4

0.01 0.98

August 2009

ISA-TR84.00.02-2010

98

Draft D

Time 0 1 2 3 4 5 6 7 8 9

State 0 State 1 1.0000 0.0000 0.9900 0.0100 0.9803 0.0197 0.9709 0.0291 0.9618 0.0382 0.9529 0.0471 0.9443 0.0557 0.9360 0.0640 0.9279 0.0721 0.9201 0.0799
Table 5

Chart 3 shows the graphical representation of the Markov model.


Markov Time Series
1.0 0.9 0.8 0.7 P ro b a b i l i ty 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0 10 20 30 40 50 60 70 80 90 100 State 0 State 1

Chart 3 Several conclusions can be drawn from Chart 3. This system will eventually come to steady state. The steady state value depends on the values of and safe. Defining MTTF = 1/safe and MTTR=1/ the steady state value of State 1 can be shown = MTTR/ (MTTF+MTTR). For this case steady state of State 1= 50/ (100+50)= .333. This is seen as the asymptotic value of State 1 in Chart 3. Using our Markov Tank Farm analogy this asymptotic value seems appropriate. When the system is turned on just after a shutdown we expect Safe Tank 0 to be full and Fail Safe Tank 1 to be empty. At first the failure controller (safe) has a considerable amount of flow to get from the Safe Tank 0 to the Fail Safe Tank 1 because the controller has opened the valve to maintain the constant failure rate. As time marches on, the failure controller will close the valve in order to keep a constant Flow/Volume ratio. The Repair controller initially will have zero flow from Tank 1 to Tank 0. Eventually, the repair controller will need to start flow to Tank 0 to keep the repair Flow/Volume constant in Tank 1. At some point in time, August 2009

ISA-TR84.00.02-2010

99

Draft D

the Failure controller and the Repair controller will come to equilibrium and the levels will reach a steady state. Tank 0 will approach 67% and Tank 1 will approach 33%. It is interesting to note what happens to State 0 as repair rates approach zero. The system will approach the graft presented in the first example. This is shown on Chart 4.
Markov Time Series
1.0 0.9 0.8 0.7 Probability 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0 20 40 60 80 100 State 1 =.1,=.1 State 1 =0,=.1

Chart 4

D.4 FAIL SAFE AND FAIL DANGEROUS MODEL


The example will include three states: 1) Safe Detected, 2) Dangerous Detected and 3) Dangerous Undetected failures. Assuming that the dangerous undetected failures cannot be repaired on-line, the Markov Model is shown in Figure 5.

Figure 5

August 2009

ISA-TR84.00.02-2010

100

Draft D

Probability of finding of system in different states modeled by means of ordinary differential equations:

dP0 = ( S DD DU ) P0 + S P1 + DD P2 dt
dP1 = S P0 S P1 dt

dP2 = DD P0 DD P2 dt dP3 = DU P0 dt
Where
Pi ( t )

- Probability of a finding of an element in state i, - Failure rate of an element, - Repair rate.

The P matrix is shown in Table 6.

To State From State 0 1 2 3 0 1-safe-dd-du safe dd 0 Table 6 The P matrix with safe= .001, safe= .04, dd=.06, dd=.04, du=.03 is shown in Table 7. Comment: It appears that in Table 7 and in calculations of Table 8 that dd does not equal 0.04, but dd=0.125 is used instead. 1 safe 1-safe 0 0 2 dd 0 1-dd 0 3 du 0 0 1

August 2009

ISA-TR84.00.02-2010

101

Draft D

0.909 0.04

0.001 0.96

0.06 0

0.03 0

0.125 0

0 0
Table 7

0.875 0

0 1

The first ten time increments for the time series is shown in Table 8.

Time 0 1 2 3 4 5 6 7 8 9

State 0 State 1 State 2 State 3 1.0000 0.0000 0.0000 0.0000 0.9090 0.0010 0.0600 0.0300 0.8338 0.0019 0.1070 0.0573 0.7714 0.0026 0.1437 0.0823 0.7193 0.0033 0.1720 0.1054 0.6754 0.0039 0.1937 0.1270 0.6383 0.0044 0.2100 0.1473 0.6067 0.0049 0.2220 0.1664 0.5794 0.0053 0.2307 0.1846 0.5557 0.0056 0.2366 0.2020
Table 8

The chart of the Markov time series is shown in Chart 5.


Example 3 Markov Model Failed States
0.9000 0.8000 0.7000 Probability 0.6000 0.5000 0.4000 0.3000 0.2000 0.1000 0.0000 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 State 1 State 2 State 3

Chart 5 Several conclusions can be drawn from Chart 5. This system will eventually end up with 0 probability of failure in State 1 or State 2.

August 2009

ISA-TR84.00.02-2010

102

Draft D

Using the Markov Tank Farm analogy this seems appropriate. When the system is turned on just after a shutdown we expect Fail Safe Detected Tank 1, Fail Dangerous Detected Tank 2, and Fail Dangerous Undetected Tank 3 to be empty, i.e. we should have fixed all of the SIF failures during the shutdown. Staying with Chart 5, looking at the Fail Dangerous Detected State 2 tank, there is an interesting time series. At first the probability of failure in State 2 starts to rapidly increase. This is because Safe Tank 0 has some level and it will allow the dangerous failure controller to provide some flow to Tank 2. However, looking around time increment 10, Fail Dangerous Detected State 2 tank it is evident that something is happening. This is the unstoppable fact the Tank 3 is going to rob the level from Tank 0. Therefore, Tank 2 failure controller will eventually slow down the flow to Tank2. Also, the Repair controller for Tank 2 will empty all of the contents into Tank 0. Time marches on and Tank 3 becomes full. Tank 3 absorbers everything.

D.5 CALCULATION PROCEDURES


Markov analysis offers certain advantages and disadvantages. The main advantage of Markov modeling is its modeling flexibility. Markov analysis can model all the aspects that are important for SIFs. In one Markov model, it is, for example, possible to model different failure modes of different components, different repair or test strategies (i.e., on-line, off-line, periodic), imperfect testing and repair, diagnostics capabilities, time dependent sequences of failures and common cause or systematic failures. Once the Markov model is constructed all the information is available to calculate the probability of a failure on demand or spurious trip. The main disadvantage is its computational and modeling complexity. A number of computer programs are available on the market to perform the actual calculations, for example CARE III(7), CARMS(8), (9) (10) (11) MARKOV1 , PC Availability , MKV . The construction of the Markov model is seen by users and practitioners of the technique as the largest disadvantage. Todays current practice is that these models are constructed by hand. A straight forward FMEA approach can be used to construct the Markov model. This method is easy in use although constructing the Markov model is more time consuming and tedious as the SIS grows in complexity. The basic work process is as follows: 1. Assign each safety function to its SIS as defined in the safety requirements specification(1). 2. List the components that have a safety impact on each safety function. solver(s), sensor(s) and final control element(s). 3. List the possible failure modes for each component. 4. Determine the degraded (intermediate) and failure system states by introducing in a systematic way the different failure modes of each component and its effect on the safety function. Determine how the SIS can be repaired from the degraded (intermediate) and failure system states and construct the Markov model (Clause 7). 5. Solve the Markov model to determine the probability of being in any state as a function of time. 6. Calculate the PFDavg and the probability of a spurious trip of the SIS (Clause 8). 7. Determine if the PFDavg of the SIS generated by the Markov Model Technique meets the SIL (1) requirements of the safety requirements specification . 8. If required, modify the configuration (hardware configuration, functional test interval, hardware selection, etc.) and repeat from step 3. This will include logic

August 2009

ISA-TR84.00.02-2010

103

Draft D

9. If the calculated probability of a spurious trip is unacceptable, modify the configuration (incorporate redundancy, use components with better reliability, etc.) and repeat from step 3. 10. When the SIS SIL and the probability of a spurious trip meet the specified requirements the calculation procedure is done.

D.6 ASSUMPTIONS FOR MARKOV CALCULATIONS FOR AN SIF


The following assumptions were used in this Part for Markov analysis: 6.1 The SIF being evaluated will be designed, installed, and maintained in accordance with ANSI/ISA84.01-2004. 6.2 6.3 Component failure and repair rates are assumed to be constant over the life of the SIF. Redundant components have the same failure rates.

6.4 The sensor failure rate includes everything from the sensor to the input module of the Logic solver including the process effects (e.g., plugged impulse line to transmitter). 6.5 The logic solver failure rate includes the input modules, logic solver, output modules and power supplies. These failure rates typically are supplied by the logic solver vendor.
NOTE 1 ISA-TR84.00.02-2009 - Part 5 illustrates a suggested method to use in developing failure rate data for the logic solver.

For the examples shown in this Part, the logic solver failure rate was estimated by taking the PFDavg for the logic solver, as supplied by the vendor, and converting it using Equation 6.1 into a rate. The derivation of this equation is shown in ISA-TR84.00.02-2009 Part 3 Annex B. (Eq. 6.1)

PFDavg =

TI
2

6.6 The final element failure rate includes everything from the output module to the final element including the process effects. 6.7 6.8 The Test Interval (TI) is assumed to be much shorter than the Mean Time To Failure (MTTF). Testing and repair of components in the system are assumed to be perfect.

6.9 All SIF components have been properly specified based on the process application. For example, final elements (valves) have been selected to fail in the safe direction depending on their specific application. 6.10 Once a component has failed in one of the possible failure modes it cannot fail again in one of the remaining failure modes. It can only fail again after it has first been repaired. This assumption has been made to simplify the modeling effort.
NOTE 2 In real life it is, for example, possible that a component first fails dangerous and after some time fails safe.

6.11 It is assumed that when a dangerous detected failure occurs, the SIS will take the process to a safe state or plant personnel will take necessary action to ensure the process is safe (operator response is assumed to be before a demand occurs, i.e., instantaneous, and PFD of operator response is assumed to be 0). August 2009

ISA-TR84.00.02-2010

104

Draft D

NOTE 3 If the action depends on plant personnel to provide safety, the user is cautioned to account for the probability of failure of personnel to perform the required function in a timely manner.

6.12 The fail-safe and fail-dangerous state are treated as absorbing states. This means that, once a component failure leads to either state, they will not be repaired. This assumption has been made to simplify the modeling effort. In real life, these states are not absorbing states. Specifically, the fail-safe state will be repaired relatively quickly because entering the fail-safe state will result in a spurious trip of the process. This assumption also brings about that it is not possible to fail again once entered into either states. For example, a failure of component causes a transition from the fail-dangerous state to the failsafe state is not modeled. 6.13 The target PFDavg and MTTF
spurious

is defined for each SIF implemented in the SIS.

6.14 For the first two examples the power supplies are not taken into account. The examples used in this part assume a de-energized to trip system, which means that power supply failures only contribute to the fail-safe state. 6.15 The Beta model is used to treat possible common cause failures.
NOTE 4 A detailed explanation of the Beta model is given in Annex A of ISA-TR84.00.02-2009 - Part 1. Comment: This part is from section 5.1.5 to 5.1.6 ( pages 24 to 27) ISATR84.00.02-Part 2

D.7 SYSTEM EQUATIONS


3.0 The following equations cover the typical configurations used in SIS configurations. To see the derivation of the equations listed, refer to Reference 3 or ISA-TR84.0.02 - Part 5. Converting MTTF to failure rate, : (Eq. No. 2)

DU =

1 MTTF DU

Equations for typical configurations: (Eq. No. 3) 1oo1

TI TI PFD avg = DU + D F 2 2

where

DU is the undetected dangerous failure rate

D F

is the dangerous systematic failure rate, and

TI is the time interval between manual functional tests of the component.


NOTE The equations in ISA-TR84.00.02-2009 - Part 1 model the systematic failure as an error that occurred during the specification, design, implementation, commissioning, or maintenance that resulted in the SIF component being susceptible to a random failure. Some systematic failures do not manifest themselves randomly, but exist at time 0 and remain failed throughout the mission time of the SIF. For example, if the valve actuator is specified improperly, leading to the inability to close the valve under the process pressure that occurs during the hazardous event, then the average value as shown in the above equation is not applicable. In this event, the systematic failure would be modeled using TI . When modeling systematic failures, the reader must determine which model is more appropriate for the type of failure being assessed.

1oo2

August 2009

ISA-TR84.00.02-2010 (Eq.

105

Draft D 4A)

No.

2 TI 2 TI D TI DU DD DU PFD avg = ((1 ) DU ) + (1 ) MTTR TI + 2 + F 2 3

For simplification, 1- is generally assumed to be one, which yields conservative results. Consequently, the equation reduces to (Eq. No. 4B)

PFD avg = DU
where

( )

TI 2 TI D TI DU DD DU + MTTR TI + + F 3 2 2

MTTR is the mean time to repair

DD is dangerous detected failure rate, and


is fraction of failures that impact more than one channel of a redundant system (common cause). The second term represents multiple failures during repair. This factor is typically negligible for short repair times (typically less than 8 hours). The third term is the common cause term. The fourth term is the systematic error term. 1oo3 (Eq. No. 5)

3 TI 3 DU TI D TI DU 2 DD 2 PFDavg = (DU ) + ) MTTR TI + + F 4 2 2


The second term accounts for multiple failures during repair. This factor is typically negligible for short repair times. The third term is the common cause term and the fourth term is the systematic error term. 2oo2 (Eq. No. 6)

[(

TI PFD avg = DU TI + DU TI + D F 2

] [

The second term is the common cause term and the third term is the systematic error term. 2oo3 (Eq. No. 7)

August 2009

ISA-TR84.00.02-2010

106

Draft D

TI TI PFD avg = [( DU ) 2 ( TI ) 2 ] + [3 DU DD MTTR TI ] + DU + D F 2 2


The second term in the equation represents multiple failures during repair. This factor is typically negligible for short repair times. The third term is the common cause term. The fourth term is the systematic error term. The terms in the equations representing common cause (Beta factor term) and systematic failures are typically not included in calculations performed in the process industries. These factors are usually accounted for during the design by using components based on plant experience. Common cause includes environmental factors, e.g., temperature, humidity, vibration, external events such as lightning strikes, etc. Systematic failures include calibration errors, design errors, programming errors, etc. If there is concern related to these factors, refer to ISA-TR84.00.02-2009 - Part 1 for a discussion of their impact on the PFDavg calculations. If systematic errors (functional failures) are to be included in the calculations, separate values for each sub-system, if available, may be used in the equations above. An alternate approach is to use a single value for functional failure for the entire SIF and add this term as shown in Equation 1a in 5.1.6.
NOTE Systematic failures are rarely modeled for SIF Verification calculations due to the difficulty in assessing the failure modes and effects and the lack of failure rate data for various types of systematic failure. However, these failures are extremely important and can result in significant impact to the SIF performance. For this reason, ANSI/ISA-84.01-2004, IEC 61508, and IEC 61511 provide a lifecycle process that incorporates design and installation concepts, validation and testing criteria, and management of change. This lifecycle process is intended to support the reduction in the systematic failures. SIL Verification is therefore predominantly concerned with assessing the SIS performance related to random failures.

The simplified equations without the terms for multiple failures during repair, common cause and systematic errors reduce to the following for use in the procedures outlined in 5.1.1 through 5.1.4. 1oo1 (Eq. No. 3a) 1oo2

PFDavg = DU

TI 2

(Eq. No. 4a) 1oo3

PFDavg

[( =

DU

TI 2

(Eq. No. 5a) 2oo2 (Eq. No. 6a) 2oo3 August 2009

PFDavg

[( =

DU

TI 3

PFDavg = DU TI

ISA-TR84.00.02-2010

107

Draft D

(Eq. No. 7a)

PFDavg = ( DU ) TI 2
2

Combining components PFDs to obtain SIF PFDavg Once the sensor, final element, logic solver, and power supply (if applicable) portions are evaluated, the overall PFDavg for the SIF being evaluated is obtained by summing the individual components. The result is the PFDavg for the SIF for the event being protected against. (Eq. No. 1a)
PFDSIS =

PFD + PFD + PFD + PFD


Si Ai Li

PSi

TI + D F 2

NOTE The last term in the equation, the systematic failure term, is only used when systematic error has not been accounted for in individual component PFD and the user desires to include an overall value for the entire SIF.

August 2009

ISA-TR84.00.02-2010

108

Draft D
Deleted: References

Annex E Calculation Examples

Calculation Examples
Following are two examples of system calculations. The first is a relatively simple example of a single switch, single relay, and a single valve, all with yearly testing and optimistic assumptions. The second is a relatively more complex example incorporating redundant devices, common cause, automatic diagnostics, imperfect manual testing, and multiple diverse outputs.

Simple system evaluation with optimistic assumptions


System description:

1oo1 switch, S = 3.0 E-6 f/hr (MTTFS = 38 yrs), D = 3.0 E-6 f/hr (MTTFD = 38 yrs) 1oo1 relay, S = 3.5 E-7 f/hr (MTTFS = 330 yrs), D = 3.5 E-8 f/hr (MTTFD = 3,300 yrs) 1oo1 valve, S = 2.5 E-6 f/hr (MTTFS = 46 yrs), D = 2.5 E-6 f/hr (MTTFD = 46 yrs) 1 year test interval 100% effective (thorough) manual testing 24 hour repair time Redundant power

Formatted: Bullets and Numbering

Figure 1: System Block Diagram (for both MTTFspurious and PFDavg Calculations) Determine the MTTFspurious and RRF.

MTTFspurious calculation
Calculating the MTTFspurious involves adding all the safe failure rates, as implied in Figure 1. MTTF is the reciprocal of failure rate. The impact due to redundant power will be insignificant compared to non-redundant items and can usually be neglected in the calculation. System MTTFspurious = 1 / (1/38 yrs + 1/330 yrs + 1/46 yrs) = 20 years = 1 / s

August 2009

ISA-TR84.00.02-2010

109

Draft D

RRF calculation
Risk Reduction Factor (RRF) is the reciprocal of PFDavg (Probability of Failure on Demand, average). Calculating the PFDavg of the system involves calculating the PFDavg of all the components and adding the results, as implied in Figure 1. Assuming no automatic diagnostics, perfect manual testing, and no redundancy, the following formula can be used: PFDavg = D * TI/2 Where: D is the dangerous failure rate TI is the manual test interval Sensor PFDavg Logic PFDavg Valve PFDavg System RRF = 1/38 yrs * (1 yr / 2) = 1/3,300 yrs * (1 yr / 2) = 1/46 yrs * (1 yr / 2) = 1.3 E-2 (RRF = 76) = 1.5 E-4 (RRF = 6,600) = 1.1 E-2 (RRF = 92)

= 1/PFD = 1 / (1.3 E-2 + 1.5 E-4 + 1.1 E-2) = 41 (SIL 1)

Note:

This calculation agrees with the fault tolerance tables indicating non-redundant field devices are suitable for use in SIL 1.

August 2009

ISA-TR84.00.02-2010

110

Draft D

Complex system evaluation with realistic assumptions


System description

2oo3 transmitters, S = 1.5 E-6 f/hr (MTTFS = 76 yrs), D = 1.5 E-6 f/hr (MTTFD = 76 yrs), 99% diagnostic coverage assumed due to comparison between transmitters 2oo3 safety PLC Input module S = 2.3 E-6 f/hr (MTTFS = 50 yrs), D = 2.3 E-6 f/hr (MTTFD = 50 yrs) CPU S = 5.7 E-6 f/hr (MTTFS = 20 yrs), D = 5.7 E-6 f/hr (MTTFD = 20 yrs) Output module S = 2.3 E-6 f/hr (MTTFS = 50 yrs), D = 2.3 E-6 f/hr (MTTFD = 50 yrs) 99% diagnostics 1oo2 valves with partial stroking, S = 3.0 E-6 f/hr, D = 3.0 E-6 f/hr, weekly partial stroking, 80% diagnostic coverage factor assumed 1oo1 pump (with motor), S = 2.0 E-6 f/hr (MTTFS = 75 yrs), D = 0.2 E-6 f/hr (MTTFD = 75 yrs) 5% Beta common cause factor for redundant, identical devices (field devices & PLC) 2 year manual test interval 95% effective manual testing of all components (including the safety PLC) 15 year life 24 hour repair time

Formatted: Bullets and Numbering

Determine the MTTFspurious and RRF.

MTTFspurious calculation
Calculating the MTTFspurious involves adding the appropriate safe failure rates of all the system subelement configurations, as implied in Figure 2. MTTF is the reciprocal of failure rate. The valves are shown in series as either valve failing safe could cause a nuisance trip. (Common cause in such cases in negligible and can be ignored.) The impact due to redundant power will be insignificant compared to non-redundant items and can usually be neglected in the calculation.

Figure 2: System Block Diagram for MTTFspurious Calculation

August 2009

ISA-TR84.00.02-2010

111

Draft D

The following formulas can be used for calculating the mean time to fail spurious of various configurations: 1oo1 MTTFspurious 1oo2 MTTFspurious 2oo3 MTTFspurious = 1 / s = 1 / (2 * s) = 1 / [(6 * (s)2 * MTTR) + s]

Where: S is the safe failure rate MTTR is the mean time to repair is the Beta factor
Sensor MTTFspurious Logic MTTFspurious Valve MTTFspurious Pump MTTFspurious System MTTFspurious = 1 / [(6 * (1.5 E-6 f/hr)2 * 24 hr) + (0.05 * 1.5 E-6 f/hr)] = 13,000,000 hrs (1,500 yrs) = 1 / [(6 * (4.6 E-6 f/hr + 5.7 E-6)2 * 24 hr) + (0.05 * 10.3 E-6 f/hr)] = 1,900,000 hrs (215 yrs) = 1 / (2 * 3.0 E-6 f/hr) = 170,000 hrs (19 yrs) = 1 / (2.0 E-6 f/hr) = 500,000 hrs (57 yrs) = 1 / (1/1,500 yrs + 1/215 yrs + 1/19 yrs + 1/57 yrs) = 13 years

The valves represent 70% of MTTFspurious, the pump 23%, the PLC 6%, and the sensors 1%.

August 2009

ISA-TR84.00.02-2010

112

Draft D

RRF calculation
Risk Reduction Factor (RRF) is the reciprocal of PFDavg (Probability of Failure on Demand, average). Calculating the PFDavg of the system involves calculating the PFDavg of all the sub-component configurations and adding the results, as implied in Figure 3. The valves are now shown in parallel along including common cause (as it would take two simultaneous failures for the system to fail dangerously).

Figure 3: System Block Diagram for PFDavg Calculation

Assuming imperfect automatic diagnostics, imperfect manual testing, and common cause, the following formulas can be used: 1oo1 PFDavg = [DD * TIA/2] + [DU * TIM/2] + [DN * Life/2] 1oo2 PFDavg = [((DD)2 * (TIA)2) / 3] + [((DU)2 * (TIM)2) / 3] + [((DN)2 * Life2) / 3] 2oo3 PFDavg = [(DD)2 * (TIA)2)] + [(DU)2 * (TIM)2] + [(DN)2 * Life2] + [DU * * TIM/2] Where: DD is the dangerous detected failure rate DU is the dangerous undetected failure rate DN is the dangerous never detected failure rate TIA is the automatic test interval TIM is the manual test interval is the Beta factor
Formatted: Danish Formatted: Danish Formatted: Danish Formatted: Danish Formatted: Danish Formatted: Danish Formatted: Danish Formatted: Danish Formatted: Danish Formatted: Danish Formatted: Danish

August 2009

ISA-TR84.00.02-2010

113

Draft D

Sensor PFDavg

= [(1.5 E-6 f/hr * 0.99)2 * (0.01 hr)2] + [(1.5 E-6 f/hr * 0.01 * 0.95)2 * (17,500 hrs)2)] + [(1.5 E-6 f/hr * 0.01 * 0.05)2 * (131,000 hrs)2] + [(1.5 E-6 f/hr * 0.01 * 0.05) * (17,500 hrs / 2)] = 2.2 E-16 + 6.2 E-8 + 9.7 E-9 + 6.6 E-6 = 6.6 E-6 (RRF = 150,000)
Formatted: Danish

Safety PLC PFDavg

= [(10.3 E-6 f/hr * 0.99)2 * (0.01 hr)2] + [(10.3 E-6 f/hr * 0.01 * 0.95)2 * (17,500 hrs)2)] + [(10.3 E-6 f/hr * 0.01 * 0.05)2 * (131,000 hrs)2] + [(10.3 E-6 f/hr * 0.01 * 0.05) * (17,500 hrs / 2)] = 1.0 E-10 + 2.9 E-6 + 4.6 E-7 + 4.5 E-5 = 4.8 E-5 (RRF = 21,000) = [((3.0 E-6 f/hr * 0.8)2 * (168 hr)2) / 3] + [(3.0 E-6 f/hr * 0.2 * 0.95)2 * (17,500 hrs)2) / 3] + [(3.0 E-6 f/hr * 0.2 * 0.05)2 * (131,000 hrs)2) / 3] + [(3.0 E-6 f/hr * 0.2 * 0.05) * (17,500 hrs / 2)] = 5.4 E-8 + 3.3 E-5 + 5.1 E-6 + 2.6 E-4 = 2.9 E-4 (RRF = 3,400)
= 0.2 E-6 f/hr * 17,500 hrs / 2

Formatted: Danish Formatted: Danish Formatted: Danish Formatted: Danish

Valve PFDavg

Formatted: Danish Formatted: Danish Formatted: Danish Formatted: Danish Formatted: Danish

Formatted: Danish

Pump PFDavg

= 1.8 E-3 (RRF = 570)

Formatted: Danish

System RRF

= 1/PFD = 1 / (6.6 E-6 + 4.8 E-5 + 2.9 E-4 + 1.8 E-3) = 500 (SIL 2)
Formatted: Font: 14 pt

Annex F References
Only reference government documents, other industrial organizations, recognized gurus (Trevor Kletz) 1. Instrumentation, Systems, and Automation Society (ISA), ANSI/ISA-84.01-1996 Application of Safety Instrumented Systems for the Process Industries," Research Triangle Park, NC, February 1996. 2. ISA, ANSI/ISA-84.00.01-2004 Functional Safety: Safety Instrumented Systems for the Process Industry Sector," Research Triangle Park, NC, September 2004. 3. Center for Chemical Process Safety (CCPS), Guidelines for Chemical Process Quantitative Risk Analysis, New York, NY, American Institute of Chemical Engineers, 1989. 4. CCPS, Guidelines for Process Equipment Reliability Data With Data Tables, American Institute of Chemical Engineers, 1989, ISBN 8169-0422-7. 5. CCPS, Guidelines for Safe Automation of Chemical Processes, American Institute of Chemical Engineers, New York, NY 10017, 1993. 6. CCPS, Guidelines for Preventing Human Error in Process Safety, American Institute of Chemical Engineers, New York, New York, 1994. August 2009

Formatted: Normal, Border: Top: (Single solid line, Auto, 0.5 pt Line width), Bottom: (Single solid line, Auto, 0.5 pt Line width) Formatted: Font: 14 pt

ISA-TR84.00.02-2010

114

Draft D

7. CCPS, Guidelines for Safe and Reliable Safety instrumented systems, American Institute of Chemical Engineers, New York, New York, 2007. 8. Offshore Reliability Data (OREDA), DNV Industry, 1992, 1997, 2002, 2007 9. RAC, Reliability Analysis Centre-1991, NSN7540-01-280-5500. 10. MIL-Handbook-217, Reliability Prediction of Electronic Equipment. 11. What Went Wrong? Case Histories of Process Plant Disasters," Trevor A. Kletz, Gulf Publishing Company, Houston, Texas, 1988. 12. Learning From Disaster: How Organizations Have No Memory," Trevor A. Kletz, Gulf Publishing Company, Houston, Texas, 1993. 13. An Engineers View of Human Error, Trevor A. Kletz, Gulf Publishing Company, Houston, Texas, 1991. 14. NUREG/DR-1278-F, Handbook of Human Reliability Analysis for Emphasis on Nuclear Power Plant Applications, Swain & Guttermann, 1983. 15. Oops, sorry! and other safety system war stories Paul Gruhn, PE, CFSE, ICS Triplex, Presented at the AIChE 2008 Spring National Meeting 42nd Annual Loss Prevention Symposium
Formatted: Superscript Formatted: Bullets and Numbering

August 2009

ISA-TR84.00.02-2010

115

Draft D

Developing and promulgating sound consensus standards, recommended practices, and technical reports is one of ISAs primary goals. To achieve this goal the Standards and Practices Department relies on the technical expertise and efforts of volunteer committee members, chairmen and reviewers. ISA is an American National Standards Institute (ANSI) accredited organization. ISA administers United States Technical Advisory Groups (USTAGs) and provides secretariat support for International Electrotechnical Commission (IEC) and International Organization for Standardization (ISO) committees that develop process measurement and control standards. To obtain additional information on the Societys standards program, please write: ISA Attn: Standards Department 67 Alexander Drive P.O. Box 12277 Research Triangle Park, NC 27709 ISBN: 1-55617-802-6

August 2009