You are on page 1of 6

2.

Redundant or Voting Systems for Increased Reliability


I. H. GIBSON

As the old adage has it, a chain is no stronger than its weakest
link. This is demonstrably true in control systems design,
where statistical theory is used to predict the behavior of
complex systems. When a single sensor/logic/effector chain
has inadequate reliability, redundant systems can be brought
into play to reduce the probability of failure on demand or
to reduce the probability of false tripping or to yield both
functions.
There are several approaches available, and the correct
choice of method depends on the relative sensitivity of the
systems to the two failure modes, in economic and humanfactor risk, and the cost and practicality of proof testing.
I/O REDUNDANCY AND VOTING
When assembling a high-reliability system with current generation equipment, the logic solver is the most reliable component of the chain, as it will normally be provided with a
high level of internal diagnostics, and redundant input/output
(I/O). The vast majority of faults in logic systems come from
faults in functional specification, where the equipment performs in accordance with the designers stated requirement,
but the concept itself is faulty.
The field sensor(s) will be, for preference, analog devices
rather than switches, as the latter, by their very nature, provide little facility for online fault detection.
The least reliable components are normally the effectors,
the final control elements. IEC 61508 usefully divides safetyrelated systems into two categories; demand-type systems,
where proof testing is practical at a frequency significantly
higher than the anticipated frequency of demand, and continuous-type systems where this is not so.
Demand-type systems are then rated to PFDavethe average probability of failure on demand over the period between
successive proof tests. It should be noted that the PFDave tends
to deteriorate between successive proof tests, as no proof test
can enable the return of the system to its original reliability
rating, because it can detect only some fraction of hidden
faults. Continuous rated systems are more difficult to build
to equivalent levels of functionality, because of the difficulty
involved in demonstrating the absence of concealed faults. A
typical example would be automotive ABS braking systems,
where the demand rate can be several times per minute. This
192
2002 by Bla G. Liptk

is also a reason process control systems are not rated as safety


integrity level 1 (SIL 1) or higher without specific certification.
Redundancy is the term used for multiple devices performing the same function. Voting is the term used to describe
how the information from redundant devices is applied. One
can have redundant devices without voting, or with manual
voting if an operator can check the output of the redundant
devices and decide which output is to perform the required
function. A simple example is the traditional temperature measurement installation, where a bimetallic thermometer is
installed in one thermowell, and a duplex thermocouple element is installed in a second well, with one element connected to an alarm/shutdown system, and the other to the
process control system. Here the thermometer can provide
local operator indication, and the two thermocouple elements
offer redundant information to the control room operator.
This system also demonstrates one of the weaknesses of
redundant systemscommon mode failure. Both of the
duplex elements are identical, and are closely coupled
mechanically and electrically. Drift in calibration (such as
from manganese migration between sheaths and elements),
phase change hysteresis, or straight mechanical damage can
produce similar effects in both components. A slip during
maintenance can easily take out both measurements at once
and it is impossible to repair one element without taking the
other out of service.
NooM TERMINOLOGY
N out of M redundancy (commonly written as NooM, e.g.,
2oo3) is a method to increase reliability. The terminology
denotes the number of coincident detected trip signals
required for the system to trip. The term NooMD is often
used to denote a system with inherent diagnostic capability,
which can detect and take action on some defined percentage
of otherwise unrevealed failures.
Table 2.6a shows the effect of adding parallel, identical
quality information to a logic system in the absence of common
mode effects. In the equations in Table 2.6a, TI denotes the
time interval between proof tests, MTTR is the mean time to
DU
DU
repair, the = 1/MTTF is the failure rate for dangerous
S
S
undetected faults, and = 1/MTTF is the spurious trip rate.

2.6 Redundant or Voting Systems for Increased Reliability

TABLE 2.6a
Redundancy and Probability of Failure
Redundancy
1oo1

Average Probability of
Failure on Demand

DU

TABLE 2.6b
Estimated Failure Rates for Siemens Moore 345 Transmitter
Probability of
False Tripping

TI
-----2

2oo3

1oo3

( TI )
---------------------------4

DU

TI

DU

TI )

17.5

Open/fail underrange
(FU) output 3.6 mA

162.7

Failure (3.7 mA)

295.4

Fail-safe reaction
(FO, FU, or 3.7 mA)

475.6

SDSafe
detected

Fail-safe reaction to
3.7 mA

174.1

SUSafe
undetected

Output normal

86.5

DUDangerous
undetected

Bad output (readback


outside 2% of calculated
value)

29.2

AUDiagnostic
annunciation
failure

None

94.2

S 2

6 ( ) MTTR

Failures per 10 hours

Short/fail overrange (FO)


output 21 mA

As far as practical, redundant measurements should utilize differing technologies, both hardware and software, to
minimize common mode effects, but this needs to be balanced against the innate probabilities of failure of the equipment. There is no point in using different technology if one
of the devices is appreciably less reliable than the other.
Common-mode failures and systematic faults add separate terms to both probabilities, limiting the effect of redundancy in reducing the effects of individual failure.
The probability of failure on demand is the sum of the
probabilities of failure of each series component in the system.
This means that the chain is always poorer than the least reliable
component.
The discussion above is deliberately simplified. More
precise methods take into account common-mode failures,
the probability of further failures during the repair period,
and systematic errors. The effect of online diagnostic cover
is also important in many cases.
Several approaches to these calculations can be found in
the ISA TR84.0.02 series, and in IEC 61508 Part 4. Computer
programs that analyze complex systems are readily available
from several sources, such as exida.com and Honeywell. In
all cases, however, the most difficult part of the calculation
is obtaining accurate failure rate data for the components.
Some indicative figures are to be found in ISA TR84.0.02
Part 1, which were derived from the records of several major
chemical plants. The variation between these sets is appreciable. Other data can be found in OREDA, which covers
most of the oil platforms in the North Sea. In all cases, it
must be borne in mind that these reflect previous generations
of equipment, and all manufacturers have been working
toward higher reliability.
With newly designed equipment, historical data cannot
be relied on and analytical techniques such as FMEDA (Failure Modes, Effects and Diagnostic Analysis) are applied by
the manufacturers to provide estimates of the frequencies of
the various types of potential failures.
A published example of such data for the Siemens-Moore
Critical Transmitter is shown in Tables 2.6b and 2.6c below.

2002 by Bla G. Liptk

DDDangerous
detected

2 ( ) MTTR
2

Output Response

S 2

2oo2

DU

Type

( TI )
---------------------------3
DU

1oo2

193

TABLE 2.6c
Alternative Format for Table 2.6b Data
Parameter

Value

Remarks

MTTF

147.7 years

Assume constant failure rate, where


DD
DU
SD
AU
MTTF = 1/( + + + )

MTTFD

226 years

Assume constant failure rate, where


DD
DU
MTTFD = 1/( + )

94.2%

C = /(

DD

DD

+ )
DU

MTTF: mean time to failure; MTTFD: mean time to fail dangerously;


D
C : dangerous coverage factor.

These are for an ambient of 40C, and are in units of failures


9
per 10 hours. As Siemens-Moore notes, rates may vary
considerably from site to site depending on chemical, environmental, and mechanical stress factors.
These figures are indicative of the upper end of current
transmitter technology.

VOTING AS A CONTROL STRATEGY


Voting of analog information can be useful in control as well
as in safety systems. Many modern control systems carry a
quality bit from each input through their calculations, and
this can be used to switch calculations on failure of input.
One of the more useful tools is the use of Median Signal
Select. The median of an odd number of signals is the value

194

Designing a Safe Plant

that has an equal number of signals larger and smaller than


the designated one. Thus, if we consider three transmitters
measuring a common variable, with measurements 25, 26,
and 30%, the median signal is 26%. The median is less
sensitive to gross errors than the mean or average, and will
allow a system to continue operating in reasonable control if
any one device fails either high or low. The median of three
signals can be computed by selecting the highest of the outputs from three low signal selects. Thus, for signals A, B,
and C, the

With three identically ranged transmitters, one can


extract redundant data to determine high alarm, low alarm,
and control measurement. The median signal becomes the
control measurement, the highest good-quality signal gives
the high alarm, and the lowest good-quality signal gives the
low alarm. Good quality is normally taken as between 0
and 100%, but this may be narrowed considerably by using
other information, such as being less than 3 standard deviations away from the historical mean of all signals, or extracted
from the on-board diagnostic messages from a smart transmitter.

MEDIAN(A, B, C) = MAX(X, Y, Z),


NAMUR DIAGNOSTIC LEVELS
where X = MIN(A, B), Y = MIN(B, C), and Z = MIN(C, A).
The same result can be obtained from the lowest of three
high selects. See Figure 2.6d.
For an even number of inputs, greater than three, the
median is taken as the average of the two most central points.
Care must be taken in using median signal selection with
a derivative action controller, as the switching between transmitters must cause a step change in the rate of change of the
measurement.

>

>

<

>

>

<

<

The German NAMUR Empfehlung NE-43 defines a set of


diagnostic signal levels that can be handled within an analog
420 mA system. These are given in Table 2.6e.
It should be noted that this requires the receiving device
to be able to interpret the full 0 to 22 mA signal range, and
not clip either top or bottom. Many PLC (programmable logic
controller) systems have A/D converters that will not read
above 20 mA, and some will not read below 4 mA.
With the growing application of digital signal transmission
techniques, ranging from HART, which superimposes a digital
data train on a 420 mA carrier without changing the average
value of the 420 mA, through to PROFIBUS PE or Foundation Fieldbus (FF), which can provide a mass of diagnostic
data from the field equipment to a properly configured control
system, the practicality of monitoring both the process and
the input and output control devices for developing problems
is now apparent and can offer advantages in availability by,
in effect, reducing the proof-test interval to an individual
measurement/transmit cycle.
This is only useful if the information is actually used,
and the application of Asset Management Systems in association with information extracted from the process control
measurements is a growing field.

<

TABLE 2.6e
NAMUR Transmitter Signal Ranges for Nominal 420 mA
Signals
MEDIAN OF THREE INPUTS

Signal

Condition
70
65
60
55
50
45
40
35

A
B
C
Z

3
Time

FIG. 2.6d
Median signal select.

2002 by Bla G. Liptk

Current Range (mA)

Short circuit or transducer


failed high

Output 21.0

Overrange

20.5 Output > 21.0

Maximum scale

20.0

Minimum scale

4.0

Underrange

4.0 > Output 3.8

Transducer detected failure (low)

3.8 > Output > 3.6 (3.7 Nominal)

Open circuit or transducer


failed low

3.6 Output

2.6 Redundant or Voting Systems for Increased Reliability

195

ASIC
Dual Element
MycroSENSOR
Linearization and
Compensation

Microprocessor A/D

D/A

Detects Known
Sensor Failure
Modes
Output 1

Comparator

Set Output to Failsafe Low


Diagnostic
Circuit

Output 2

Verify Out

Transmitter Output

FIG. 2.6f
Siemens-Moore 345 XTC Critical Transmitter block diagram. Redrawn from the Siemens-Moore 345 Instruction Manual.

A recent development in process measurement technology has been the appearance of SIL 2 rated pressure and
differential pressure transmitters from at least two suppliers
(Siemens-Moore XTC series and ABB 600T series). SIL 2
DU
signifies that the equipment offers a MTTF better than 100
years. Such a level of integrity would normally require use
of two transmitters with data comparison.
Both the XTC and 600T apply redundant measurement
systems with onboard diagnostic techniques to compare the
(duplicated) process sensor measurements through the internal signal conversion/temperature compensation and the
actual output current, and provide diagnostic messages of
detected problems for an installed cost higher but comparable
with older designs. These designs do suffer from a possibility
of common-mode failure if the process connection becomes
plugged, and both sensors see a constant value, but an analysis of expected process noise can in some cases detect this
DU
fault. The suppliers both claim SIL 3 (MTTF greater than
1000 years) if two such transmitters are fitted in a redundant
installation with suitable voting; this would normally require
three good-quality transmitters at higher installed cost; see
Figure 2.6f.
The effector or final control element is traditionally the
most expensive item in an individual loop. Not only are process
valves large, high-pressure equipment, but they are exposed
to the flow of the process fluid, which may contain erosive,
corrosive, or gummy components to reduce the reliability of
the device.
For a shutdown system to be testable, the effector must
be tested. Unless the valve is closed under flowing conditions,
there can be no certainty that it will operate when required.
Even then, most block valves cannot be given better than SIL 1

2002 by Bla G. Liptk

1oo1

1oo1 with Bypass

1oo2 with Bypass

3oo4

FIG. 2.6g
Valve redundancy patterns.

rating, although at least one manufacturer (Mokveld) claims


SIL 3 on the basis of long experience and documented inservice reliability on certain of its designs.

196

Designing a Safe Plant

Because modern plants are commonly required to stay


on line for several years, on-stream testing is necessary to
validate the availability of the shutdown system. It is possible
in some cases to trip a shutdown valve and reopen it before
the upstream process is out of bounds, but there can be no
certainty that once the valve has closed, it can be reopened.
This is a 1oo1 system. Some manufacturers (e.g., Metso/
Neles, Emerson) offer partial-stroke testing systems that
allow the valve to be stroked partway, while measuring the
pressure/torque/time characteristic on the actuator. Such
diagnostic information can provide warning of deterioration
of the valve and actuator but cannot determine whether the
trim will give tight shutoff. The partial-stroke limiting equipment can also be a potential source of error.
If on-stream testing is essential, then a normally closed
manual bypass can be provided, fitted with position switches
to demonstrate that it is closed in service. This is the 1oo1 with
bypass, which gives a lower level of security, as the bypass
may be left open and the trip system inhibited. For a higher
level of security, a 1oo2 system can be used, with two shutdown valves in series. This can give a SIL 3 rating; again, a
test bypass can be fitted at the expense of possible maloperation.

There is a possibility of using 3oo4 voting, but this is


unlikely to be economic. In this format, any one valve closing
cannot stop flow through the system, but any three valves
closing must stop the flow. This permits each valve to be
tested on full stroke without loss of safety coverage. (Either
two of a vertical pair will also shut off flow.) See Figure 2.6g.
Redundant design solenoid valve assemblies are also
commercially available (e.g., Norgren/Herion, ASCO/Sistech). These can use redundant output signals from the PLC
to allow online testing of the trip solenoids without taking
the safety system into bypass, at the expense of considerably
increased complexity. An example is shown as Figure 2.6h.
The common practice of having a control valve with a
series shutdown valve offers a low-cost redundancy, but care
must be taken to ensure independence of trip and control
functions. In many cases the cause of a problem can be a
stuck control valve, in which case the trip facility will be
ineffective. Care needs to be taken when using double-acting
piston actuators, as merely tripping the air supply to the
positioner may be ineffective for an extended period. Instead,
the trip signal should vent one side of the cylinder and apply
supply pressure to the other.

Redundancy I

Redundancy II

Redundancy III

Signal
V1

V3

V2

V4

3
2
1

FIG. 2.6h
Schematic representation of a Herion 2oo3 voting solenoid valve. Internal Piloted Herion 2oo3 solenoid valve with switching position
monitoring. V1, V4, and (V2/V3) can be independently energized (triplicated output) or from a single source.

2002 by Bla G. Liptk

2.6 Redundant or Voting Systems for Increased Reliability

It is good practice to trip the control signal to manual/


closed whenever the shutdown is activated, but no credit can
be taken for this in a SIL calculation, as the entire distributed
control system (DCS) system is normally rated below SIL 1.
Control valves have traditionally been provided with
single-block isolation and a globe valve bypass of comparable capacity to the control valve. This offers poor functionality in many cases; if the process is a high pressure one,
then upstream and downstream double-block and bleed isolation may be mandated for access to the control valve, and

2002 by Bla G. Liptk

197

the globe valve will often be unsuitable for throttling the


process flow for an extended period while the control valve
is removed for service. A parallel-redundant control valve
with single-block isolation (or even upstream-only isolation)
may be an acceptable alternative, to carry operation through
to the next plant shutdown.
The concept of Independent Protection Levels is discussed elsewhere. This takes the concept of redundancy to a
higher level, where the same risk is mitigated by the use of
separate systems.

You might also like