You are on page 1of 11

LVBECONANAL DOC 1

Safety Performance vs. Cost Analysis of Redundant


Architectures Used in Safety Systems
By: Dr. Lawrence Beckman - HIMA-Americas, Inc.
There are several system architectures available for use in process safety applications.
These range from single channel systems to triplicated or higher redundancy
configurations. The selected architecture should first satisfy the safety requirements of
the process under control, but likewise address production and operational issues which
have a definite impact on safety, i.e. false tripping the process. As such, high
availability is also an important consideration for safety systems. If frequent internal
failures of the safety system forces the process through an excessive number of
shutdown/start-up cycles, that process is operating in its most hazardous state for
unnecessarily long periods of time. Operation under these conditions should be
minimized in the interest of safety.
There is considerable work in process to establish standards and implementation
guidelines, both in the USA and internationally, which match the risk inherent in a given
situation to the required integrity level of the safety system. Regrettably, they are not
specific to a particular type of process, but deal only with a qualitative level of risk. This
paper is intended to provide a background and further economic insight into this subject,
and explore some architectural alternatives available to the control or safety engineer,
given a state-of-the-art approach to safety system design.
Safety System Architecture
Most safety systems in the process environment are designed to shut the process down
upon detecting a hazardous state or condition. These systems are called Emergency
Shutdown (ESD) Systems and many operate in a fail safe mode. In this mode of
operation, an internal failure of the safety system will result in a shut down of the
process, and all repairs to this system are performed in the non-operating state. Fail safe
systems can be redundant, but may lack sufficient redundancy to be considered fault
tolerant. As such, they are subject to false trips and are not used in applications requiring
a high level of availability. Availability is defined as the percentage of time over which a
system is capable of performing its intended function during a given time period.
Efforts have been made to increase the availability of dual systems by operating them in
a mode where both channels must fail for the system to shutdown the process. This
mode of operation is called the 2-out-of-2 (2oo2) mode, as opposed to the normal 1-out-
of-2 (1oo2) fail safe mode. As such, the system continues to operate on a single channel
after sustaining the first internal failure. While this configuration is certainly more
available, its integrity depends heavily on comprehensive internal diagnostics.
Fault tolerant (dual or higher) architectures are capable of sustaining reliable operation in
the presence of a fault, by providing additional levels of redundancy. The 1oo2D
configuration is an example of a dual implementation of fault tolerant architecture. It
normally operates in the 2oo2 mode, but reverts to the 1oo2 mode upon the unlikely
LVBECONANAL DOC 2
occurrence of an unresolved fault. Diagnostic watchdogs are provided in both channels
as a secondary means of de-energizing outputs. Either channel is capable of switching off
its outputs, as well as those of the other channel if required. As such, it is both very safe
and available. Details of this Programmable Electronic System (PES) configuration are
given in the literature (2,3). After sustaining a fault, the system will continue to operate
properly on its remaining channel, thus avoiding a process shutdown; and allow repairs
to be performed on-line. The safety integrity of the system in the presence of a fault is
not compromised for the sake of increased availability, as all internal diagnostics are
fully functional on the remaining channel. As such, fault tolerant systems found in
industrial applications are at minimum dual redundant, providing two independent
channels of redundancy.
Another inherent advantage of some redundant architectures is the ability to
communicate data between channels in order to decide which of the channels of the
system is malfunctioning. This capability significantly improves the systems ability to
diagnose faults and subsequently increases the safety integrity of the system. A systems
ability to diagnose faults is often referred to as its diagnostic coverage, where the
Coverage Factor (C) is defined as the probability of detecting a fault, given that one has
occurred. Perfect coverage would imply 100% effective self-diagnostics. This is of
course impossible.
After sustaining a fault, a triplicated 2-out-of-3 (2oo3) system can operate in either of
two modes. If a second fault occurs before repairs can be effected on the first
malfunctioning channel, the system can shut itself down immediately upon the
occurrence of the second fault; or it can revert to single channel operation. For safety
applications of a 2oo3 system, two channel operation is restricted to a short time interval;
and single channel operation is never allowed, due to the lack of comprehensive internal
diagnostics. As such, the integrity of a triplicated (TMR) system depends heavily on its
ability to vote, and consequently diagnostic coverage degrades consistent with the
number of operating channels.
The Coverage Factor of the various architectures will vary based on the quality of the
systems internal diagnostics. Triplicated architectures rely heavily on their voting
capacity to implement diagnostic coverage, and as such an operational third channel is
critical. After sustaining a fault, diagnostic capability, and consequently coverage, is
substantially diminished and in some instance may be non existent. Dual architecture
typically offer superior internal diagnostics, which are capable of diagnosing the
operational state of the entire system every scan cycle. This differs dramatically from
other architectures which may require 30 seconds or longer to diagnose a problem; i.e.,
memory failure. Diagnostic coverage is an important consideration in evaluating covert
system availability (U
C
).
Levels of redundancy beyond triplicated systems are rare in the industrial environment,
and are very difficult to justify economically considering cost versus incremental
improvement in safety integrity. Single channel systems are definitely not recommended
for critical safety applications. Please refer to the following table of PES Architectures
for further clarification.
LVBECONANAL DOC 3
Configuration Operating Mode
Channels
Needed to Operate
Channels
Needed to Trip
1
2
1
1oo1
1oo2
2oo2
2oo3
(2oo2 1oo2)
1oo2D
1
1
1
1-0
2-0
2-1-0 1
2
2
2 3-2-0
2-1-0
Safety vs. Availability
Having discussed the redundant configuration options available, let us quantify their
relative performance for both safety and availability operation. The criteria are
necessarily different, and will be characterized as follows:
The Safety criterion will be the Hazard Rate (H) which is calculated as
H D U
C
= =
where D = Demand Rate (demands/yr) U
C
= = Covert (safety) Unavailability
The Covert (safety) Unavailability (also referred to as fractional dead time or probability
of dangerous failure) is the probability that the system is in a failed or non-functioning
state because of a covert failure. It is this condition which represents the true hazard.
Not all covert failures are dangerous failures, but all are potentially dangerous. Thus, a
more conservative approach would require that covert and dangerous be considered
synonymous. As such, the Covert Unavailability (
C U
) is a function of the unrevealed
system failure rate (
C
) and the proof test interval (T
P
).
The Availability criterion will be the False Trip Rate (F). It is a function of the revealed
failure rate (
R
) and the repair time (T
R
). Whether a given failure is revealed or
unrevealed (Covert) depends upon the level of coverage provided by the system's
diagnostics. The repair process likewise is heavily dependent upon the systems ability to
detect a fault, as the repair time is the sum of the time to detect the fault and the time to
make the repair. In a system with a low level of diagnostic coverage, the repair time will
be extended to equal the proof test interval in most instances. Programmable Electronic
LVBECONANAL DOC 4
Systems designed for high integrity/safety applications have comprehensive diagnostics
and consequently high coverage factors, ranging from 97% to 99+%.
For a given coverage factor (C), the failure rates can be computed as follows:
R
C = = = = ; (1- C)
C
where is the total failure rate of the unit or module.
For a system consisting of "l" input modules and "m" output modules, with a main
processor module; the revealed failure rate of the system can be calculated as follows:
R
SYSTEM
i i P P o o
l
C C
m
C
= = + + + +
where l = number of input modules m = number of output modules
C = module coverage factor = total module failure rate
(i = input, P = processor, o = output)
The covert failure rate of this system can be calculated by substituting
(1
i C
) for
i C
, (1
P C
) for
P C
, etc. above.
For the system configurations of interest, we can develop a list of covert Unavailabilities
(
C U
) and False Trip Rates (F) as follows. A discussion of these equations can be found
in Beckman (1).
System Operating Failures Covert (
C U
) False Trip (F)
Configuration Mode Allowed Unavailability Rate
1-out-of-1 1-0 0
C P T
2

R
1-out-of-2 2-0 1
C P T
2 2
3
2
R
2-out-of-2 2-1-0 0
C P T
2
2
R R T
2-out-of-3 3-2-0 1
C P T
2 2
6
2
R R T
1-out-of-2D 2-1-0 1
C P
T
2 2
3
2
2
R R
T
The basic system used in this analysis is simplex, in that it has a single set of inputs and
outputs. For the redundant configurations, all channels operate independently, the inputs
are in parallel and the outputs are arranged to provide the level of operational redundancy
required; i.e., for the 1-out-of-2 configuration the two outputs are in series. For the 2-
out-of-2 configuration, the two outputs are in parallel. In addition, the above results
were obtained assuming that coverage factors for all configurations are equal, and that all
units are tested simultaneously. Only normal mode failures were included in this
analysis.
LVBECONANAL DOC 5
Common Mode Failures
Common Mode Failure occurs when a single cause affects multiple channels of a
redundant system, usually resulting in complete system failure. Sources of common
mode failure are environmental conditions, design errors, manufacturing errors, and
operational or maintenance failures. The higher the level of redundancy, the more likely
the occurrence of this type of failure. For example, a dual system (consisting of channels
A and B) has only a single common mode failure possibility; while a triple redundant
system (Channels A, B and C) has multiple common mode failure possibilities (AB, AC,
BC and ABC). This situation is exacerbated further when multiple channels share a
common hardware platform; i.e., a common I/O module, etc.
Common Mode Failure is typically modeled using the "beta factor" method, where beta
( ) represents the percentage of total failures attributable to common mode failure; i.e.,
= =
+ +
CM
CM NM
, where total failures include both common and normal mode failure.
Usually this fraction is in the range of 5-15%, but can be smaller based on operational
experience. Necessarily, it is a reasonable estimate. However, depending upon the
importance placed on this type of failure, the resulting system reliability will be
significantly altered.
Considering two of the redundant architectures discussed, the occurrence of a covert
common mode failure in either the 1-out-of-2/1-out-of-2D or the 2-out-of-3 system
configuration results in a fail-to-function situation. This result is the same irrespective of
the architecture. However, the susceptibility is lower by a factor of three for the dual
architecture. Incorporating covert common mode failure into their corresponding
reliability models yields the following modified equations for Covert Unavailability (U-
c
):
System Covert Unavailability (U
c
)
Configuration Common Mode Normal Mode
1-out-of-2 or
CC P
T
3
+
CN p T
2 2
3
1-out-of-2D
2-out-of-3
CC P
T +
CN p
T
2 2
where U
c
= U
c
(Common Mode Failure) + U
c
(Normal Mode Failure)


CC
=
2
(assuming 50% are covert)
CN
= (1- ) (1-C)
The net effect is equivalent to placing a simplex (non-redundant) element in series with
the redundant architecture for both the 1-out-of-2 and 2-out-of-3 system configurations
considered. Depending upon a reasonable estimate of the beta factor and the resulting
common mode dangerous failure rate, the Common Mode term can completely dominate
the computation, rendering Normal Mode failures insignificant for higher levels of
LVBECONANAL DOC 6
redundancy. In practice, this is typically not the case; and care should be taken to keep
common mode failure in perspective. However, it certainly should not be ignored in
critical safety evaluations. In the economic analysis that follows, common mode failures
have not been included, as they are outside the scope of this paper. A comprehensive
discussion is provided in the literature (2,4,5).
System Integrity
In a hazardous process environment there are typically two types of systems in operation;
the control system and the safety or protective system. The two systems should be totally
independent of each other.
The purpose of the safety system is to protect against the process hazard, while
preventing plant shutdowns due to false trips. The safety system is typically dormant for
extended periods of time and susceptible to functional failures, which are generally
unrevealed failures. Given less than perfect diagnostic coverage, the internal diagnostics
of today's programmable safety systems will not detect 100% of all possible failure
conditions. As such, it is necessary to conduct periodic proof testing to detect such
undiagnosed failures. It is however, not a substitute for comprehensive internal
diagnostics. This testing should be made as quick and simple as possible in order to give
the maximum system availability, while reducing the possibility of human error. Repair
should be able to be performed while the system is operating. No advantage is gained if
the safety system or the process has to be stopped to rectify any faults found.
The time interval between proof testing is of great concern, as the potential for human
error while conducting the proof test is significant. Considerations which affect the
choice of proof test interval are as follows:
1) System redundancy and the coverage factor of the internal diagnostics.
2) Potential for human error due to complexity of the test/repair process.
3) The time required to perform the necessary testing and repair.
During the test period, the system (or some portion thereof) is under test and unavailable
to perform its intended safety function. As such, proof testing too frequently increases
the unavailability of the safety system and the probability of human error. On the other
hand, infrequent testing increases the risk of developing undiagnosed faults, particularly
in systems with a low level of diagnostic coverage.
As stated earlier, the purpose of the proof test is to improve the reliability of the safety
system. The objective is to minimize the safety unavailability of the system while
conducting the required periodic testing to maintain system integrity. The selection of
the optimum proof test interval based on minimizing safety unavailability is critical.
Consider the following equation for total system Unavailability (
TOTAL U
), including field
devices:
TOTAL C T FD E U U U U U
= = + + + + + +
where
C U
= covert (safety) unavailability due to unrevealed system failure
LVBECONANAL DOC 7
T U
= unavailability resulting from proof testing
FD U
= Covert (safety) unavailability due to unrevealed field device failures
E U
= unavailability resulting from human error (system isolation,
i.e., bypass not restored)
It is desired to minimize
TOTAL U
with respect to
P T
for a given configuration of the
system and field devices. The resulting optimum proof test interval is
P
MI N
T
. A
derivation of this methodology can be found in Beckman (1).
Testing and Repair
The safety system design should facilitate maintenance of both the safety system and
associated field devices. The system itself should give the maintenance technician a
clear, visual indication of the fault; so that repair can proceed with absolute certainty,
thereby reducing the possibility for human error. Repair procedures should be simple
and straight forward to allow fast, easy repair and keep the repair time as short as
possible (low MTTR). In addition, provisions should be made to simplify the by-passing
of field devices for purposes of proof testing, calibration and maintenance.
Many redundant systems based on traditional PLCs are not fully integrated, and are
consequently difficult to test and repair. Redundant implementations which require the
maintenance technician to diagnose complex problems, perform difficult repair
procedures, or reload the application program as part of the repair process are prone to
human error, and will at the least contribute to false or nuisance trips of the process. At
worst, incomplete or inadequate repair could result in a catastrophic failure of the safety
system. Steps should be taken to minimize the occurrence of human error during testing
and maintenance of the safety system.
In determining the optimum proof test interval, consideration should also be given to the
potential for human error. Under ideal conditions, the human error rate is estimated to be
1 in 100. However, most process testing conditions are far from ideal, and as such this
rate will be substantially higher. Measures can be taken in the safety system design to
mitigate this situation, but the potential for human error both while conducting the testing
and required repair must be considered. The "Human Error" failure rate far exceeds that
of other safety system components such as sensors, actuators, etc. As such, it represents
the largest potential cause for operational failure of the safety system.
Economic Model
Given the above, it is now possible to construct an economic model for the system
configurations of interest. This analysis will focus on the safety system itself, and as
such will not include the associated field devices.
The link between Safety and Availability is becoming significantly stronger, as industry
recognizes that cycling processes up and down inherently has safety implications; in
addition to the cost associated with lost production. Hence, availability is now
LVBECONANAL DOC 8
considered a key factor in safety system design and operation. Considering the above,
the model includes three terms as follows:
1) Hazardous failures 2) False or Nuisance Trips 3) Periodic Proof
Testing
The focus of the model is on Total Safety Cost, and consequently does not include the
initial cost of system hardware, integration, programming or maintenance. One could
safely assume that these costs would be in proportion to the selected level of redundancy,
with triplication being the most expensive. These costs, even when amortized over the
life cycle of the system, differ from one configuration to another; but are mostly fixed
compared to the Total Safety Cost. The model utilizes an operating period of one (1)
year, and a proof testing interval that is optimized for each configuration considered.
Given these conditions, the Safety Cost model is

TOTAL H F T P
MI N
C
H
C
F
C C T
($) / = = + + + +
where
TOTAL C
($) = Total Annual Safety Cost
H = Hazard Rate F = False Trip Rate

H C
= Hazard Cost ($)

F C
= Nuisance Trip Cost ($)

T C
= Proof Testing Cost ($)

P
MI N
T
= Optimum Proof Test Interval
Please note that
P
MI N
T
is also used in computing the Safety Unavailability and
consequently the Hazard Rate.
Using the following values for Coverage Factor (C), Covert Failure Rate (
C
), Demand
Rate ( D), Mean Time to Repair (
R T
), and Test Time (
D T
), we compute the following
for the configurations of interest, using the equations for Covert Unavailability and False
Trip Rate given in the listing:
C= 0.97
C
= 0.18 Failures/yr. D = 0.5 Demands/yr.
R T
= 8 hrs.
D T
= 4
hrs.
Configuration
P
MI N
T
weeks ( )
TOTAL U
H per year ( ) F per year ( )
TOTAL C
($)
1-out-of-1 3.7 1.38x10
-2
6.91x10
-3
5.82 322,534
1-out-of-2 14.4 3.48x10
-3
1.74x10
-3
11.64 590,103
2-out-of-2 2.6 1.91x10
-2
9.56x10
-3
0.062

47,585
2-out-of-3 10.0 4.57x10
-3
2.28x10
-3
0.186 20,855
1-out-of-2D 14.4 3.48x10
-3
1.74x10
-3
0.062 11,196
TOTAL C
($) was calculated using the following associated costs (per occurrence):
Hazard Cost (
H C
) = $500,000
Nuisance Trip Cost (
F C
) = $50,000
LVBECONANAL DOC 9
Proof Testing Cost (
T C
) = $2,000
The lowest cost was achieved by the 1-out-of-2D configuration where both Hazard
failures and Nuisance trip were virtually eliminated. The 2-out-of-3 configuration
finished second, due to increased costs associated with proof testing and nuisance trips.
Note that the 1-out-of-2/1-out-of-2D configuration also had the longest proof test
interval, and that the 2-out-of-2 configuration was the least safe, actually less safe than
the 1-out-of-1 (simplex) configuration.
In addition, no attempt was made to comprehend the following in the model
1) Increase in Covert Unavailability (
C U
) due to human error associated with more
frequent proof testing.
2) Increase in Demand rate ( D) due to more frequent false trips and process start-
ups.
Including these effects would have further biased the results in favor of the higher
integrity configurations.
Effects of Coverage Factors
It would now be of interest to investigate the effect of the Coverage Factor (C) on the
economic model for the configurations of interest. We will use the same model for
TOTAL C
($) , keeping all parameters constant with the exception of the Coverage Factor,
and the resulting covert and revealed system failure rates. Based on this analysis, we
compute the following dollar values for
TOTAL C
($) on an annual basis.

TOTAL C
($)
C=0.98 C=0.90 C=0.75
Configuration
C
= 0.12
C
= 0.6
C
= 1.5
1-out-of-1 319,793 327,366 315,559
1-out-of-2 594,243 557,772 482,527
2-out-of-2 39,531 83,678 129,815
2-out-of-3 18,365 33,510 52,349
1-out-of-2D 9,400 20,435 34,376
The results are interesting in that two distinct effects are observed.
TOTAL C
($) actually
decreases as C decreases for the two configurations (1-out-of-X) which are prone to
false trips. Correspondingly
TOTAL C
($) increases as the coverage factor decreases for the
2-out-of-X configurations, and the 1-out-of-2D configuration (which are less prone to
false trips), because of a dramatic increase in the Hazard Rate. As these are the most
likely configurations to be utilized, a decrease in the coverage factor represents a
significant increase in the probability of a hazard, and its inherent financial
consequences. Please refer to Figure 1 for a summary of these results.
It is also interesting to note that a small increase in the coverage factor (which implies a
corresponding decrease in the covert failure rate) resulting from more comprehensive
internal diagnostics, voting, etc. will substantially reduce the Hazard Rate in all cases,
LVBECONANAL DOC 10
and consequently the Total Safety Cost for those configurations least affected by false
trips. The optimum Proof Test Interval (
P
MI N
T
) likewise increases as the overall integrity
of the system improves. This effect could also be achieved by reducing the total failure
rates of the individual module which comprise the overall system configuration.
Conclusions
An economic analysis of the safety system must comprehend both process safety and
availability. A model was constructed which included Hazardous Failures, False or
Nuisance Trips and lastly Periodic Proof Testing required to maintain system integrity
for the system configurations of interest. Hazardous failures were computed based on the
Total Safety Unavailability of the system. Human error is a significant contributor to
Safety Unavailability, and steps to minimize the probability of occurrence should be
employed in the integration, testing, and repair of the system.
The analysis indicates that the use of either the 1-out-of-1 or 1-out-of-2 configuration is
not economically feasible, given that the 1-out-of-2 configuration however is quite safe.
It is best suited for fail-safe applications where loss of production is not a consideration.
The 2-out-of-2 configuration is the least safe, and should be utilized only where safety is
not the primary consideration. The 2-out-of-3 and 1-out-of-2D configurations are
economically advantaged, in that they satisfy both safety and availability requirements,
thus minimizing the Total Safety Cost on an annual basis. However, the 1-out-of-2D
configuration is superior in both safety performance and cost. Including common mode
failure in the economic model would further reinforce this result.
The importance of having comprehensive diagnostics and consequently a high coverage
factor in the safety system cannot be overemphasized. Improving coverage has an
exponential effect on increasing reliability, safety system integrity, and reducing Total
Safety Cost. Proof testing should be used to complement a systems' internal diagnostics,
and not as a substitute for inadequate diagnostics. Frequent proof testing and complex
repair procedures increase the probability of human error, and should always be avoided.
Given the above, a safety analysis should be performed both prior to design and again
after installation to determine if the System achieves the Safety Integrity Level (SIL) as
required by the Process Hazard Analysis (PHA). This analysis can likewise establish the
proper selection of the System architecture to satisfy economic criteria, and the tangible
performance of the system as regards the mitigation of hazards which can lead to
significant economic, safety, and environmental consequences.
References
LVBECONANAL DOC 11
(1) L.V. Beckman, "Optimum Proof Testing of Programmable Safety Systems,"
Hydrocarbon Processing, November 1992.
(2) Bukowski, J.V. and W.M. Goble, Comparing Control Systems Reliability:
Architecture, Diagnostics, and Common Cause; Proceedings of the ISA/94
Conference and Exhibit. ISA, 1994.
(3) IEC 1508 (Draft) Part 2: Requirements for Electrical/ Electronic/ Programmable
electronic systems.
IEC 1508 (Draft) Part 6: Guidelines on Application of Parts 2 and 3.
(4) A.J. Bourne, et. al., Defenses against common-mode failures in redundant
systems; Safety and Reliability Directorate UK AEA, January 1981.
(5) Freeman, Raymond A., Reliability of Interlocking Systems; Process Safety
Progress (Vol. 13, No. 3), July 1994, Pg. 146.
Figure 1
Total Annual Safety Cost
0
100000
200000
300000
400000
500000
600000
1oo1 1oo2 2oo2 2oo3 1oo2D
D
o
l
l
a
r
s
C= 0.98
C= 0.97
C= 0.90
C= 0.75

You might also like