Fault Tolerance Techniques For Distributed Systems

Fault tolerance techniques for distributed systems
As our high-tech society becomes increasingly dependent on computers, the demand for more
dependable software will increase and likely become the norm. In the past, fault-tolerant
computing was the exclusive domain of very specialized organizations such as telecom
companies and financial institutions. With business-to-business transactions taking place over the
Internet, however, we are interested not only in making sure that things work as intended, but
also, when the inevitable failures do occur, that the damage is minimal. (None of us would be
happy to lose money because a fault occurred during the transfer of funds from one account to
another, for instance.)
Unfortunately, fault-tolerant computing is extremely hard, involving intricate algorithms for
coping with the inherent complexity of the physical world. As it turns out, that world conspires
against us and is constructed in such a way that, generally, it is simply not possible to devise
absolutely foolproof, 100% reliable software 1 . No matter how hard we try, there is always a
possibility that something can go wrong. The best we can do is to reduce the probability of failure
to an "acceptable" level. Unfortunately, the more we strive to reduce this probability, the higher
the cost.
The Concepts Behind Fault-Tolerant Computing
There is much confusion around the terminology used with fault tolerance. For example, the
terms "reliability" and "availability" are often used interchangeably, but do they always mean the
same thing? What about "faults" and "errors"? In this section, we introduce the basic concepts
behind fault tolerance 2 .
Fault tolerance is the ability of a system to perform its function correctly even in the presence of
internal faults. The purpose of fault tolerance is to increase the dependability of a system. A
complementary but separate approach to increasing dependability is fault prevention. This
consists of techniques, such as inspection, whose intent is to eliminate the circumstances by
which faults arise.
Failures, Errors, and Faults
Implicit in the definition of fault tolerance is the assumption that there is a specification of what
constitutes correct behavior. Afailure occurs when an actual running system deviates from this
specified behavior. The cause of a failure is called an error. An error represents an invalid system
state, one that is not allowed by the system behavior specification. The error itself is the result of
a defect in the system or fault. In other words, a fault is the root cause of a failure. That means
that an error is merely the symptom of a fault. A fault may not necessarily result in an error, but
the same fault may result in multiple errors. Similarly, a single error may lead to multiple failures.
These basic concepts are illustrated using the Unified Modeling Language (UML) class diagram
in Figure 1.
Figure 1: Failures, Errors, and Faults
For example, in a software system, an incorrectly written instruction in a program may decrement
an internal variable instead of incrementing it. Clearly, if this statement is executed, it will result
in the incorrect value being written. If other program statements then use this value, the whole
system will deviate from its desired behavior. In this case, the erroneous statement is the fault, the
invalid value is the error, and the failure is the behavior that results from the error. Note that if the
variable is never read after being written, no failure will occur. Or, if the invalid statement is
never executed, the fault will not lead to an error. Thus, the mere presence of errors or faults does
not necessarily imply system failure.
As this example illustrates, the designation of what constitutes a fault -- the underlying cause of a
failure -- is relative in the sense that it is simply a point beyond which we do not choose to delve
further. After all, the incorrect statement itself is really an error that arose in the process of
writing the software, and so on.
At the heart of all fault tolerance techniques is some form of masking redundancy. This means
that components that are prone to defects are replicated in such a way that if a component fails,
one or more of the non-failed replicas will continue to provide service with no appreciable
disruption. There are many variations on this basic theme.
Fault Classifications
It is helpful to classify faults in a number of different ways, as shown by the UML class diagram
in Figure 2.
Figure 2: Different Classifications of Faults
Based on duration, faults can be classified as transient or permanent. A transient fault will
eventually disappear without any apparent intervention, whereas a permanent one will remain
unless it is removed by some external agency. While it may seem that permanent faults are more
severe, from an engineering perspective, they are much easier to diagnose and handle. A
particularly problematic type of transient fault is the intermittent fault that recurs, often
unpredictably.
A different way to classify faults is by their underlying cause. Design faults are the result of
design failures, like our coding example above. While it may appear that in a carefully designed
system all such faults should be eliminated through fault prevention, this is usually not realistic in
practice. For this reason, many fault-tolerant systems are built with the assumption that design
faults are inevitable, and theta mechanisms need to be put in place to protect the system against
them. Operational faults, on the other hand, are faults that occur during the lifetime of the system
and are invariably due to physical causes, such as processor failures or disk crashes.
Finally, based on how a failed component behaves once it has failed, faults can be classified into
the following categories:
• Crash faults -- the component either completely stops operating or never returns to a
valid state;
• Omission faults -- the component completely fails to perform its service;
• Timing faults -- the component does not complete its service on time;
• Byzantine faults -- these are faults of an arbitrary nature. 3
General Fault Tolerance Procedure

In general, the process for dealing with faults can be grouped into a series of distinct activities
that are typically (although not necessarily) performed in sequence, as shown in the UML activity
diagram in Figure 3.
Figure 3: General Fault Tolerance Procedure
Error detection is the process of identifying that the system is in an invalid state. This means that
some component in the system has failed. To ensure that the effects of the error are limited, it is
necessary to isolate the failed component so that its effects are not propagated further. This is
known as damage confinement. In the error recovery phase, the error and -- more importantly --
its effects, are removed by restoring the system to a valid state. Finally, in fault treatment, we go
after the fault that caused the error so that it can be isolated. In other words, we first treat the
symptoms and then go after the underlying cause. While it may seem more appropriate to go after
the fault immediately, this is often not practical, since diagnosing the true cause of an error can be
a very complex and lengthy process. In a software system, it is typical for a single fault to cause
many cascading errors that are reported independently. Correlating and tracing through a
potential multitude of such error reports often requires sophisticated reasoning.
Error Detection
The most common techniques for error detection are:
• Replication checks -- In this case, multiple replicas of a component perform the same
service simultaneously. The outputs of the replicas are compared, and any discrepancy is
an indication of an error in one or more components. A particular form of this that is
often used in hardware is called triple-modular redundancy (TMR), in which the output
of three independent components is compared, and the output of the majority of the
components is actually passed on 4 . In software, this can be achieved by providing
multiple independently developed realizations of the same component. This is called N-
version programming.
• Timing checks -- This is used for detecting timing faults. Typically a timer is started, set
to expire at a point at which a given service is expected to be complete. If the service
terminates successfully before the timer expires, the timer is cancelled. However, if the
timer times out, then a timing error has occurred. The problem with timers is in cases
where there is variation in the execution of a function. In such cases, it is dangerous to set
the timer too tightly, since it may indicate false positives. However, setting it too loosely
would delay the detection of the error, allowing the effects to be propagated much more
widely.
• Run-time constraints checking -- This involves detecting that certain constraints, such
as boundary values of variables not being exceeded, are checked at run time. The
problem is that such checks introduce both code and performance overhead. A particular
form is robust data structures, which have built-in redundancy (e.g., a checksum). Every
time these data structures are modified, the redundancy checks are performed to detect
any inconsistencies. Some programming languages also support an assertion mechanism.
• Diagnostic checks -- These are typically background audits that determine whether a
component is functioning correctly. In many cases, the diagnostic consists of driving a
component with a known input for which the correct output is also known.
Damage Confinement
This consists of first determining the extent to which an error has spread (because time may have
passed between the time the error occurred and the time it is detected). It requires understanding
the flow of information in a system and following it -- starting from a known failed component.
Each component along such flows is checked for errors, and the boundary is determined. Once
this boundary is known, that part of the system is isolated until it can be fixed.
To assist with damage confinement, special "firewalls" are often constructed between
components. This implies a loose coupling between components so that components outside of
the firewall can be easily de-coupled from the faulty components inside and re-coupled to an
alternative, error-free redundant set. The cost of this is system overhead, both in performance
(communication through a firewall is typically slower) and memory.
Error Recovery
To recover from an error, the system needs to be restored to a valid state. There are two general
approaches to achieving this. Inbackward error recovery, the system is restored to a previous
known valid state. This often requires checkpointing the system state and, once an error is
detected, rolling back the system state to the last checkpointed state. Clearly, this can be a very
expensive capability. Not only is it necessary to keep copies of previous states, but also it is
necessary to stop the operation of the system during checkpointing to ensure that the state that is
stored is consistent. In many cases, this is not viable, since it may not be possible to restore the
environment in which the system is operating to the state that corresponds to the checkpointed
state.
For such cases, forward error recovery is more appropriate. This involves driving the system
from the erroneous state to a new valid state. It may be difficult to do unless the fault that caused
the error is known precisely and is well isolated, so that it does not keep interfering. Because it is
tricky, forward error recovery is not used often in practice.
Fault Treatment
In this phase, the fault is first isolated and then repaired. The repair procedure depends on the
type of fault. Permanent faults require that the failed component be replaced by a non-failed
component. This requires a standby component. The standby component has to be integrated into
the system, which means that its state has to be synchronized with the state of the rest of the
system. There are three general types of standby schemes:
• Cold standby -- This means that the standby component is not operational, so that its
state needs to be changed fully when the cutover occurs. This may be a very expensive
and lengthy operation. For instance, a large database may have to be fully reconstructed
(e.g., using a log of transactions) on a standby disc. The advantage of cold standby
schemes is that they do not introduce overhead during the normal operation of the
system. However, the cost is paid in fault recovery time.
• Warm standby -- In this case, the standby component is used to keep the last checkpoint
of the operational component that it is backing up. When the principal component fails,
the backward error recovery can be relatively short. The cost of warm standby schemes is
the cost of backward recovery discussed earlier (mainly high overhead).
• Hot standby -- In this approach, the standby component is fully active and duplicating
the function of the primary component. Thus, if an error occurs, recovery can be
practically instantaneous. The problem with this scheme is that it is difficult to keep two
components in lock step. In contrast to warm standby schemes, in which synchronization
is only performed during checkpoints, in this case it has to be done on a constant basis.
Invariably, this requires communications between the primary and the standby, so that the
overhead of these schemes is often higher than the overhead for warm standby.
Dependability
Dependability means that our system can be trusted to perform the service for which it has been
designed. Dependability can be decomposed into specific aspects. Reliability characterizes the
ability of a system to perform its service correctly when asked to do so. Availability means that
the system is available to perform this service when it is asked to do so. Safety is a characteristic
that quantifies the ability to avoid catastrophic failures that might involve risk to human life or
excessive costs. Finally, security is the ability of a system to prevent unauthorized access.
Technically, reliability is defined as the probability that a system will perform correctly up to a
given point in time. A common measure of reliability, therefore, is the mean time between
failures (MTBF).
Availability is defined as the probability that a system is operational at a given point in time. For
a given system, this characteristic is strongly dependent on the time it takes to restore it to service
once a failure occurs. A common way of characterizing this ismean time to repair (MTTR).
The two measures for reliability (MTBF) and availability (MTBR) can be used to show the
relationship between these two important measures. It is important to distinguish these two
technical terms, since they are often used interchangeably in everyday communications. This can
lead to confusion. The availability of a system can be calculated from these two measures
according to the formula:
Availability = (MTBF) / (MTTR + MTBF)
Note that for systems that never fail, availability is equal to reliability.
Distributed Systems
We define a distributed software system (Figure 4) as: a system with two or more independent
processing sites that communicate with each other over a medium whose transmission delays
may exceed the time between successive state changes.
Figure 4: A Distributed System
From a fault-tolerance perspective, distributed systems have a major advantage: They can easily
be made redundant, which, as we have seen, is at the core of all fault-tolerance techniques.
Unfortunately, distribution also means that the imperfect and fault-prone physical world cannot
be ignored, so that as much as they help in supporting fault-tolerance, distributed systems may
also be the source of many failures. In this section we briefly review these problems.
Processing Site Failures
The fact that the processing sites of a distributed system are independent of each other means that
they are independent points of failure. While this is an advantage from the viewpoint of the user
of the system, it presents a complex problem for developers. In a centralized system, the failure of
a processing site implies the failure of all the software as well. In contrast, in a fault-tolerant
distributed system, a processing site failure means that the software on the remaining sites needs
to detect and handle that failure in some way. This may involve redistributing the functionality
from the failed site to other, operational, sites, or it may mean switching to some emergency
mode of operation.
Communication Media Failures
Another kind of failure that is inherent in most distributed systems comes from the
communication medium. The most obvious, of course, is a permanent hard failure of the entire
medium, which makes communication between processing sites impossible. In the most severe
cases, this type of failure can lead to partitioning of the system into multiple parts that are
completely isolated from each other. The danger here is that the different parts will undertake
activities that conflict with each other.
A different type of media failure is an intermittent failure. These are failures whereby messages
travelling through a communication medium are lost, reordered, or duplicated. Note that these are
not always due to hardware failures. For example, a message may be lost because the system may
have temporarily run out of memory for buffering it. Message reordering may occur due to
successive messages taking different paths through the communication medium. If the delays
incurred on these paths are different, they may overtake each other. Duplication can occur in a
number of ways. For instance, it may result from a retransmission due to an erroneous conclusion
that the original message was lost in transit.
One of the central problems with unreliable communication media is that it is not always possible
to positively ascertain that a message that was sent has actually been received by the intended
remote destination. A common technique for dealing with this is to use some type of positive
acknowledgement protocol. In such protocols, the receiver notifies the sender when it receives a
message. Of course, there is the possibility that the acknowledgement message itself will be lost,
so that such protocols are merely an optimization and not a solution.
The most common technique for detecting lost messages is based on time-outs. If we do not get a
positive acknowledgement within some reasonable time interval that our message was received,
we conclude that it was dropped somewhere along the way. The difficulty of this approach is to
distinguish situations in which a message (or its acknowledgement) is simply slow from those in
which a message has actually been lost. If we make the time-out interval too short, then we risk
duplicating messages and also reordering in some cases. If we make the interval too long, then
the system becomes unresponsive.
Transmission Delays
While transmission delays are not necessarily failures, they can certainly lead to failures. We've
already noted that a delay can be misconstrued as a message loss.
There are two different types of problems caused by message delays. One type results
from variable delays (jitter). That is, the time it takes for a message to reach its destination may
vary significantly. The delays depend on a number of factors, such as the route taken through the
communication medium, congestion in the medium, congestion at the processing sites (e.g., a
busy receiver), intermittent hardware failures, etc. If the transmission delay is constant, then we
can much more easily assess when a message has been lost. For this reason, some communication
networks are designed as synchronous networks, so that delay values are fixed and known in
advance.
However, even if the transmission delay is constant, there is still the problem of out-of-date
information. Since messages are used to convey information about state changes between
components of the distributed system, if the delays experienced are greater than the time required
to change from one state to the next, the information in these messages will be out of date. This
can have major repercussions that can lead to unstable systems. Just imagine trying to drive a car
if visual input to the driver were delayed by several seconds.
Transmission delays also lead to a complex situation that we will refer to as the relativistic effect.
This is a consequence of the fact that transmission delays between different processing sites in a
distributed system may be different. As a result, different sites may see the same set of messages
but in a different order. This is illustrated in Figure 5 below:
Figure 5: The Relativistic Effect
In this case, distributed sites NotifierP and NotifierQ each send out a notification about an event
to the two clients (ClientAand ClientB). Due to the different routes taken by the individual
messages and the different delays along those routes, we see that ClientB sees one sequence
(event1 followed by event2), whereas ClientA sees a different one (event2-event1). As a
consequence, the two clients may reach different conclusions about the state of the system.
Note that the mismatch here is not the result of message overtaking (although this effect is
compounded if overtaking occurs); it is merely a consequence of the different locations of the
distributed agents relative to each other.
Distributed Agreement Problems
The various failure scenarios in distributed systems and transmission delays in particular have
instigated important work on the foundations of distributed software. 5 Much of this work has
focused on the central issue of distributed agreement. There are many variations of this problem,
including time synchronization, consistent distributed state, distributed mutual exclusion,
distributed transaction commit, distributed termination, distributed election, etc. However, all of
these reduce to the common problem of reaching agreement in a distributed environment in the
presence of failures.
A Fault-Tolerant Pattern for Distributed Systems

We now examine a specific pattern that has been successfully used to construct complex, fault-
tolerant embedded systems in a distributed environment. This pattern is suitable for a class of
distributed applications that is characterized by the star-like topology shown in Figure 6, which
commonly occurs in practice.
Figure 6: System Topology
The system consists of a set of distributed agents, each on a separate processing site, which
collaboratively perform some function. The role of the controller is to coordinate the operation of
the agents. Note that it is not necessarily the case that the agents have to communicate through
the controller. We have simply not shown such connections in our diagram since they are not
relevant at this point. We make the following assumptions about this system:
• Soft response time Â— The processing times for global functions allow for some
variability.
• Non-critical agents Â— The system may still be useful even if one or more agents fail
permanently.
In this system, the permanent failure of the central controller would lead to the loss of all global
functionality. Hence, it is a single point of failure that needs to be made fault tolerant. However,
we would like to avoid the overhead of a full hot standby -- or even a warm standby -- for this
component.
The essential feature of this approach is to distribute the state information about the system as a
whole between the controller and the agents. This is done in such a way that (a) the controller
holds the global state information that comprises the state information for each agent, and (b)
each agent keeps its own copy of its state information. Every time the local state of an agent
changes, it informs the controller of the change. The controller caches this information. Thus, we
have state redundancy. Note that there is no need for a centralized consistent checkpoint of the
entire system, which, as we have mentioned, is a complex and high-overhead operation.
Obviously, the state redundancy does not protect us from permanent failures of the controller.
Note however, that any local functions performed by the agents are unaffected by the failure of
the agents. Global functions, on the other hand, may have to be put on hold until the controller
recovers. Thus, it is critical to be able to recover the controller.
Controller recovery can be easily achieved by using a simple, low-overhead, cold standby
scheme. The standby controller can recover the global state information of the failed controller
simply by querying each agent in turn. Once it has the full set, the system can resume its
operation; the only effect is the delay incurred during the recovery of the controller. Of course, it
is also possible to use other standby schemes as well with this approach.
If an agent fails and recovers, it can restore its local state information from the controller. One
interesting aspect to this topology is that it is not necessary for the controller to monitor its
agents. A recovering agent needs merely to contact its controller to get its state information. This
eliminates costly polling.
error detection and correction are techniques to ensure that data is transmitted without errors,
even across unreliable media or networks.
General definitions of terms
Error detection and error correction are defined as follows:
• Error detection is the detection of errors caused by noise or other impairments during
transmission from the transmitter to the receiver.[1]
• Error correction is the detection of errors and reconstruction of the original, error-free
data.
There are two basic ways to design the channel code and protocol for an error-correcting system:
• Automatic Repeat reQuest (ARQ): The transmitter sends the data and also an error-
detection code, which the receiver uses to check for errors, and requests retransmission of
the data that was deemed erroneous. In many cases the request is implicit: The receiver
sends an acknowledgement (ACK) of correctly received data, and the transmitter re-sends
anything not acknowledged within a reasonable period of time.
• Forward error correction (FEC): The transmitter encodes the data with an error-
correcting code (ECC) and sends the coded message. The receiver never sends any
messages back to the transmitter. The receiver decodes what it receives into the "most
likely" data. Forward error-correction codes are designed so that it would take an
"unreasonable" amount of noise to trick the receiver into misinterpreting the data.
It is possible to combine the ARQ and FEC so that minor errors are corrected without
retransmission, and major errors are corrected via a request for retransmission. The combination
is called a hybrid automatic repeat-request.
Error detection schemes
Several schemes exist to achieve error detection. The general idea is to add some redundancy
(i.e., some extra data) to a message, which enables detection of any errors in the delivered
message. Most such error-detection schemes are systematic: The transmitter sends the original
data bits, and attaches a fixed number of check bits, which are derived from the data bits by some
deterministic algorithm. The receiver applies the same algorithm to the received data bits and
compares its output to the received check bits; if the values do not match, an error has occurred at
some point during the transmission. In a system that uses a "non-systematic" code, such as some
raptor codes, the original message is transformed into an encoded message that has at least as
many bits as the original message.
In general, any hash function may be used to compute the redundancy. However, some functions
are of particularly widespread use because of either their simplicity or their suitability for
detecting certain kinds of errors (e.g., the cyclic redundancy check's performance in detecting
burst errors).
Other mechanisms of adding redundancy are repetition schemes and error-correcting codes.
Repetition schemes are rather inefficient but very simple to implement. Error-correcting codes
can provide strict guarantees on the number of errors that can be detected.
Repetition codes
Main article: Repetition code
A repetition code is an coding scheme that repeats the bits across a channel to achieve error-free
communication. Given a stream of data to be transmitted, the data is divided into blocks of bits.
Each block is transmitted some predetermined number of times. For example, to send the bit
pattern "1011", the four-bit block can be repeated three times, thus producing "1011 1011 1011".
However, if this twelve-bit pattern was received as "1010 1011 1011" – where the first block is
unlike the other two – it can be determined that an error has occurred.Repetition codes are not
very efficient, and can be susceptible to problems if the error occurs in exactly the same place for
each group (e.g., "1010 1010 1010" in the previous example would be detected as correct). The
advantage of repetition codes is that it they are extremely simple, and are in fact used in some
transmissions of numbers stations.[citation needed]
Parity bits
A parity bit is a bit that is added to ensure that the number of set bits (i.e., bits with the value 1) in
a group of bits is even or odd. A parity bit can only detect an odd number of errors (i.e., one,
three, five, etc. bits that are incorrect).
There are two variants of parity bits: even parity bit and odd parity bit. When using even parity,
the parity bit is set to 1 if the number of ones in a given set of bits (not including the parity bit) is
odd, making the entire set of bits (including the parity bit) even. When using odd parity, the
parity bit is set to 1 if the number of ones in a given set of bits (not including the parity bit) is
even, making the entire set of bits (including the parity bit) odd. In other words, an even parity bit
will be set if the number of set bits plus one is even, and an odd parity bit will be set if the
number of set bits plus one is odd.
There is a limitation to parity schemes. A parity bit is only guaranteed to detect an odd number of
bit errors. If an even number of bits (i.e., two, four, six, etc.) are flipped, the parity bit will appear
to be correct even though the data is erroneous. Extensions and variations on the parity bit
mechanism are horizontal redundancy checks, vertical redundancy checks, and "double," "dual,"
or "diagonal" parity (used in RAID-DP).
Checksums
A checksum of a message is a modular arithmetic sum of message code words of a fixed word
length (e.g., byte values). The sum is often negated by means of a one's-complement prior to
transmission as the redundancy information to detect errors resulting in all-zero messages.
Checksum schemes include parity bits, check digits, and longitudinal redundancy checks. Some
checksum schemes, such as the Luhn algorithm and the Verhoeff algorithm, are specifically
designed to detect errors commonly introduced by humans in writing down or remembering
identification numbers.
Cyclic redundancy checks (CRCs)

A cyclic redundancy check (CRC) is a non-secure hash function designed to detect accidental
changes to raw computer data. Its computation resembles a long division operation in which the
quotient is discarded and the remainder becomes the result, with the important distinction that the
arithmetic used is the carry-less arithmetic of a finite field. The length of the remainder is always
less than or equal to the length of the divisor, which therefore determines how long the result can
be.
Cyclic redundancy checks have favorable properties in that they are specifically suited for
detecting burst errors. CRCs are easily implemented in hardware, and are commonly used in
digital networks and storage devices such as hard disk drives.
Even parity is a special case of a cyclic redundancy check, where the single-bit CRC is generated
by the polynomial x+1.
Cryptographic hash functions
A cryptographic hash function can provide strong assurances about data integrity, provided that
changes of the data are only accidental (i.e., due to transmission errors). Any modification to the
data will likely be detected through a mismatching hash value. Furthermore, given some hash
value, it is infeasible to find some input data (other than the one given) that will yield the same
hash value. Message authentication codes, also called keyed cryptographic hash functions,
provide additional protection against intentional modification by an attacker.
Error-correcting codes
Any error-correcting code can be used for error detection. A code with minimum Hamming
distance, d, can detect up to d-1 errors in a code word. Using error-correcting codes for error
detection can be favorable if strict integrity guarantees are desired, and the capacity of the
transmission channel can be modeled.
Codes with minimum Hamming distance d=2 are degenerate cases of error-correcting codes, and
can be used to detect single errors. The parity bit is an example of a single-error-detecting code.
The Berger code is an early example of a unidirectional error(-correcting) code that can detect
any number of errors on an asymmetric channel, provided that only transitions of cleared bits to
set bits or set bits to cleared bits can occur.
Error correction
Automatic repeat request
Automatic Repeat reQuest (ARQ) is an error control method for data transmission that makes use
of error-detection codes, acknowledgment and/or negative acknowledgment messages, and
timeouts to achieve reliable data transmission. An acknowledgment is a message sent by the
receiver to indicate that it has correctly received a data frame.
Usually, when the transmitter does not receive the acknowledgment before the timeout occurs
(i.e., within a reasonable amount of time after sending the data frame), it retransmits the frame
until it is either correctly received or the error persists beyond a predetermined number of
retransmissions.
Three types of ARQ protocols are Stop-and-wait ARQ, Go-Back-N ARQ, and Selective Repeat
ARQ.
ARQ is appropriate if the communication channel has varying or unknown capacity, such as is
the case on the Internet. However, ARQ requires the availability of a back channel, results in
possibly increased latency due to retransmissions, and requires the maintenance of buffers and
timers for retransmissions, which in the case of network congestion can put a strain on the server
and overall network capacity.[2]
Error-correcting code
An error-correcting code (ECC) or forward error correction (FEC) code is a system of adding
redundant data, or parity data, to a message, such that it can be recovered by a receiver even
when a number of errors (up to the capability of the code being used) were introduced, either
during the process of transmission, or on storage. Since the receiver does not have to ask the
sender for retransmission of the data, a back-channel is not required in forward error correction,
and it is therefore suitable for simplex communication such as broadcasting. Error-correcting
codes are frequently used in lower-layer communication, as well as for reliable storage in media
such as CDs, DVDs, and dynamic RAM.
Error-correcting codes are usually distinguished between convolutional codes and block codes:
• Convolutional codes are processed on a bit-by-bit basis. They are particularly suitable for
implementation in hardware, and the Viterbi decoder allows optimal decoding.
• Block codes are processed on a block-by-block basis. Early examples of block codes are
repetition codes, Hamming codes and multidimensional parity-check codes. They were
followed by a number of efficient codes, of which Reed-Solomon codes are the most
notable ones due to their widespread use these days. Turbo codes and low-density parity-
check codes (LDPC) are relatively new constructions that can provide almost optimal
efficiency.
Shannon's theorem is an important theorem in forward error correction, and describes the
maximum information rate at which reliable communication is possible over a channel that has a
certain error probability or signal-to-noise ratio (SNR). This strict upper limit is expressed in
terms of the channel capacity. More specifically, the theorem says that there exist codes such that
with increasing encoding length the probability of error on a discrete memoryless channel can be
made arbitrarily small, provided that the code rate is smaller than the channel capacity. The code
rate is defined as the fraction k/n of k source symbols and n encoded symbols.
The actual maximum code rate allowed depends on the error-correcting code used, and may be
lower. This is because Shannon's proof was only of existential nature, and did not show how to
construct codes which are both optimal and have efficient encoding and decoding algorithms.
Hybrid schemes
Hybrid ARQ is a combination of ARQ and forward error correction. There are two basic
approaches[2]:
• Messages are always transmitted with FEC parity data (and error-detection redundancy).
A receiver decodes a message using the parity information, and requests retransmission
using ARQ only if the parity data was not sufficient for successful decoding (identified
through a failed integrity check).
• Messages are transmitted without parity data (only with error-detection information). If a
receiver detects an error, it requests FEC information from the transmitter using ARQ,
and uses it to reconstruct the original message.
The latter approach is particularly attractive on the binary erasure channel when using a rateless
erasure code.
Applications
Applications that require low latency (such as telephone conversations) cannot use Automatic
Repeat reQuest (ARQ); they must use Forward Error Correction (FEC). By the time an ARQ
system discovers an error and re-transmits it, the re-sent data will arrive too late to be any good.
Applications where the transmitter immediately forgets the information as soon as it is sent (such
as most television cameras) cannot use ARQ; they must use FEC because when an error occurs,
the original data is no longer available. (This is also why FEC is used in data storage systems
such as RAID and distributed data store).
Applications that use ARQ must have a return channel. Applications that have no return channel
cannot use ARQ.
Applications that require extremely low error rates (such as digital money transfers) must use
ARQ.
The Internet
In a typical TCP/IP stack, error detection is performed at multiple levels:
• Each Ethernet frame carries a CRC-32 checksum. The receiver discards frames if their
checksums do not match.
• The IPv4 header contains a header checksum of the contents of the header (excluding the
checksum field). Packets with checksums that don't match may be discarded or
processed, depending on application.
• The checksum was omitted from the IPv6 header, because most current link layer
protocols have error detection.
• UDP has an optional checksum. Packets with wrong checksums may be discarded or
retained depending on application.
• TCP has a checksum of the payload, TCP header (excluding the checksum field) and
source- and destination addresses of the IP header. Packets found to have incorrect
checksums are discarded and eventually get retransmitted when the sender receives a
triple-ack or a timeout occurs.
Deep-space telecommunications
Development of error-correction codes was tightly coupled with the history of deep-space
missions due to the extreme dilution of signal power over interplanetary distances, and the limited
power availability aboard space probes. Whereas early missions sent their data uncoded, starting
from 1968 digital error correction was implemented in the form of (sub-optimally decoded)
convolutional codes or Reed-Muller codes. The Reed-Muller code was well suited to the noise
the spacecraft was subject to (approximately matching a Bell curve), and was implemented at the
Mariner spacecraft for missions between 1969 and 1977.
The Voyager 1 and Voyager 2 missions, which started in 1977, were designed to deliver color
imaging amongst scientific information of Jupiter and Saturn. This resulted in increased coding
requirements, and thus the spacecrafts were supported by (optimally Viterbi-decoded)
convolutional codes that could be concatenated with an outer Golay (24,12,8) code. The Voyager
2 probe additionally supported an implementation of a Reed-Solomon code: the concatenated
Reed-Solomon-Viterbi (RSV) code allowed for very powerful error correction, and enabled the
spacecraft's extended journey to Uranus and Neptune.
The CCSDS currently recommends usage of error correction codes with performance similar to
the Voyager 2 RSV code as a minimum. Concatenated codes are increasingly falling out of favor
with space missions due to their relatively high hardware costs, and are replaced by more
powerful codes such as Turbo codes or LDPC codes.
The different kinds of deep space and orbital missions that are conducted suggest that trying to
find a "one size fits all" error correction system will be an ongoing problem for some time to
come. For missions close to earth the nature of the channel noise is different from that a
spacecraft on an interplanetary mission experiences. Additionally, as a spacecraft increases its
distance from earth, the problem of correcting for noise gets larger.
Satellite broadcasting (DVB)
The demand for satellite transponder bandwidth continues to grow, fueled by the desire to deliver
television (including new channels and High Definition TV) and IP data. Transponder availability
and bandwidth constraints have limited this growth, because transponder capacity is determined
by the selected modulation scheme and Forward error correction (FEC) rate.
Overview
• QPSK coupled with traditional Reed Solomon and Viterbi codes have been used for
nearly 20 years for the delivery of digital satellite TV.
• Higher order modulation schemes such as 8PSK, 16QAM and 32QAM have enabled the
satellite industry to increase transponder efficiency by several orders of magnitude.
• This increase in the information rate in a transponder comes at the expense of an increase
in the carrier power to meet the threshold requirement for existing antennas.
• Tests conducted using the latest chipsets demonstrate that the performance achieved by
using Turbo Codes may be even lower than the 0.8 dB figure assumed in early designs.
Data storage
Error detection and correction codes are often used to improve the reliability of data storage
media.
A "parity track" was present on the first magnetic tape data storage in 1951. The "Optimal
Rectangular Code" used in group code recording tapes not only detects but also corrects single-bit
errors.
Some file formats, particularly archive formats, include a checksum (most often CRC32) to detect
corruption and truncation and can employ redundancy and/or parity files to recover portions of
corrupted data.
Reed Solomon codes are used in compact discs to correct errors caused by scratches.
Modern hard drives use CRC codes to detect and Reed-Solomon codes to correct minor errors in
sector reads, and to recover data from sectors that have "gone bad" and store that data in the spare
sectors.[3]
RAID systems use a variety of error correction techniques, to correct errors when a hard drive
completely fails.
Error-correcting memory
DRAM memory may provide increased protection against soft errors by relying on error
correcting codes. Such error-correcting memory, known as ECC or EDAC-protected memory, is
particularly desirable for high fault-tolerant applications, such as servers, as well as deep-space
applications due to increased radiation.
Error-correcting memory controllers traditionally use Hamming codes, although some use triple
modular redundancy.
Interleaving allows distributing the effect of a single cosmic ray potentially upsetting multiple
physically neighboring bits across multiple words by associating neighboring bits to different
words. As long as a single event upset (SEU) does not exceed the error threshold (e.g., a single
error) in any particular word between accesses, it can be corrected (e.g., by a single-bit error
correcting code), and the illusion of an error-free memory system may be maintained.[4]
List of error-correcting codes
• BCH code
• Constant-weight code
• Convolutional code
• Group codes
• Golay codes, of which the Binary Golay code is of practical interest
• Goppa code, used in the McEliece cryptosystem
• Hadamard code
• Hagelbarger code
• Hamming code
• Latin square based code for non-white noise (prevalent for example in broadband over
powerlines)
• Lexicographic code
• Low-density parity-check code, also known as Gallager code, as the archetype for sparse
graph codes
• LT code, which is a near-optimal rateless erasure correcting code (Fountain code)
• m of n codes
• Online code, a near-optimal rateless erasure correcting code
• Raptor code, a near-optimal rateless erasure correcting code
• Reed-Solomon code
• Reed-Muller code
• Repeat-accumulate code
• Repetition codes, such as Triple modular redundancy
• Tornado code, a near-optimal erasure correcting code, and the precursor to Fountain
codes
• Turbo code
Error detection &correction codes
• BCH Codes
o Berlekamp–Massey algorithm
o Peterson-Gorenstein-Zierler algorithm
o Reed Solomon error correction
• BCJR algorithm: decoding of error correcting codes defined on trellises (principally
convolutional codes)
• Hamming codes
o Hamming(7,4): a Hamming code that encodes 4 bits of data into 7 bits by adding
3 parity bits
o Hamming distance: sum number of positions which are different
o Hamming weight (population count): find the number of 1 bits in a binary word
• Redundancy checks
o Adler-32
o Cyclic redundancy check
o Fletcher's checksum
o Luhn algorithm: a method of validating identification numbers
o Luhn mod N algorithm: extension of Luhn to non-numeric characters
o Parity: simple/fast error detection technique
o Verhoeff algorithm
o Longitudinal redundancy check (LRC)
Functional Microcontroller Design and Implementation
Abstract
In many situations, hardware description languages (HDL) such as VHDL, Verilog or SystemC is
used to develop the functionality of the digital system, while the timing and control signal
generation is either neglected or ignored. The authors have used a methodology wherein a
hardware structure was conceptually laid out of the digital system under consideration. The
system development started with topdown planning approach and the blocks were designed using
bottom-up implementation. The programs were written, simulated and synthesized using
Electronic Data Automation (EDA) tools such as ModelSim and Leonardo Spectrum. Instruction
set such as transfer, arithmetic, logic, input, output and control instructions were implemented.
This approach guaranteed the integrity of the system realization with proper timing and data flow,
without the invisible ghost states. In this article, the authors have presented the design
methodology of such a multi purpose microcontroller, and provided the functional
HDL code, simulation and synthesis results. Also, the authors have presented the sequence in
which this microcontroller can be made a general purpose controller unit which can be used by
other system designs.
Keywords
VLSI, VHDL, Microcontroller, Simulation, Synthesis
1. Introduction
In any control and controller system applications, microcontroller is an important module, which
provides the control, timing and status signals, in any complex digital system realizations.
(Bartbel, 1997) A microprocessor is usually defined as “a single chip that contains control logic
and data processing logic, so that it can execute instructions listed in a program to operate on
some data”. Microcontrollers are nothing but microprocessors with on-chip memory. Whether
using ASIC, FPGA or CPLD based realizations, it is essential to incorporate the microcontroller
module, as an integral part of the system. Functional microcontroller has been developed using
VHDL coding using structural design of logic blocks which generates control and timing signals
used for the data processing operation.
2. Methodology
While designing a microcontroller module, it is important to determine the number data bits,
which can be processed by the microcontroller in one cycle. In the process of designing, the
instruction set, instruction width, internal register set, type of control unit, flags, amount of
memory to be used has to be laid out. (Gloria, 1999) VLSI implementation point of view, the
most relevant factor that intervenes in the decision of implementing a certain instruction lies on
the number of operands needed to specify the instruction. This is not an absolute factor, it
depends on the number of registers, and therefore on the number of bits required for coding a
register reference. Present day VLSI technology has lead to the design and development of
millions of gates on a chip. Hardware designers create several VLSI modules for their
research and development purposes. It is often important to re-use these modules to reduce
product development time, thereby minimizing the time to market. Therefore, it is important to
design hardware in a modular fashion, so that these modules can be included in the development
of a complex system. The design and development of such a modular design microcontroller
helps other designers to incorporate
this module with minimal or no modifications to the hardware module. The block diagram of
such a 4-bit microcontroller is as shown in fig.1.
Figure 1: 4-bit Slice Microcontroller Block Diagram
3. Module Development
In this study, 4-bit microcontroller was designed using a modular approach. Using top-down
approach,
the elements of the microcontroller were identified as basic registers, instruction decoders, ALU,
RAM,
Control and timing Unit. (Zabawa and Wunnava, 2004) propose the design of a microcontroller
unit with
the following building blocks:
 A 16-word by four bit port RAM.
Four registers (RegA, RegB, RegC, RegD)
An ALU selector which select two inputs. The D (Destination Input) will always be rega_data.
The S (Source Input) will be either regb_data, regc_data, regd_data based on the selected output
from the MUX.
The MUX module used instruction [11:8] as the select input. If select is ‘0010’ then regb_data
is
set to signal S (Source Input). If select is ‘0100’ then regc_data is set to signal S. If Select is
‘1000’ then regd_data is set to signal S.
A 4-bit ALU capable of doing arithmetic, logical, and bitwise functions on the selected source
and destination words.
An instruction decoder is used to decide whether to load the ALU output into RegA or whether
to
load the read data from the RAM. It also load RegB, RegC, and RegD from memory if the
MOVB, MOVC, or MOVD operations occurs.
The implementation was done using a bottom-up approach. The basic hardware blocks like
adders, flip flops, shift registers, counters, and comparators were designed. Later, these blocks
were used to form ALU, RAM, and instruction decoder. Finally, in the top level module, these
blocks were connected to form a functional microcontroller. The top level module has the I/O
pins as shown in Table 1. The instruction set has a 12-bit instruction width, which has three 3-bit
fields whose functions are as shown in
Table 2. The ALU performs arithmetic, logic and shift operations as defined in Table 2. In all, 14
instructions can be performed by this ALU.
4. VHDL Programming, Simulation and Synthesis

Programs for each building block of this system were written in VHDL. Mentor Graphics
simulation tool ModelSim was used for writing the code, simulating the programs and to test its
behavior. VHDL can be very difficult language to learn by reading the VHDL language
Reference Manual (also called the LRMLanguage Reference Manual) but should be learnt
enough to start coding basic hardware blocks. The hardware programming was done in a
structural fashion so that the modules can be easily used as the library for complex hardware
development. The VHDL program for ALU is shown in fig 2. (Sosnowski, 2006) Developing test
procedures for micro architectural level blocks is a cumbersome process. Appropriate test stimuli
need sophisticated sequences of instructions. The programs were tested using test benches written
using TEXTIO format, which enables us to write multiple test vectors for extensive test
scenarios. The generated test benches were able to read in any test vectors from the test file and
will act as
stimuli to the designed hardware. The outputs thus obtained from these test benches were
compared and verified for functional accuracy. (Lloyd et al, 1999) After the design has been
developed and validated it becomes possible, through the use of VHDL descriptions, to employ
field-programmable gate array logic that will allow a custom device to be rapidly prototyped and
tested. Leonardo Spectrum Synthesis tool was used to synthesize the microcontroller module by
choosing A1240XLPG132 from Actel family as the FPGA target device.
Figure 2: VHDL implementation of ALU
After synthesizing the module the Register Transfer Level (RTL) schematic, Technology
schematics and Critical Path schematic can be seen and compared to the original design. The
synthesized RTL schematic of the top level of microcontroller module and the instruction decoder
is as shown in fig 3 and fig 4 respectively. (Flynn, 1999) The most successful microprocessor
implementations depend not simply on the use of the current state of the art in hardware
algorithms, but more importantly in bringing together the knowledge of these algorithms together
with projected advances in the technology and user state of the art. Therefore, the designer should
not only depend on the technology to develop faster hardware modules but also need to find out
ways to improve speed by minimizing redundant blocks in the design.
Figure 4: Top level entity of Instruction decoder
5. Results
The VHDL program was simulated in ModelSim using TEXTIO test benches with extensive test
vectors. The simulation of instruction phase OR and NOT are shown in the fig 5. The device
utilization of the target FPGA device A1240XLPG132, from Actel family is as shown in Table 3.
The synthesized microcontroller module has 29 pins for address, data, I/O and for other signals.
The Leonardo spectrum synthesis report shows 443 accumulated instances with 24 sequential and
280 combinational modules with the speed of operation of 14.47 MHz.
Figure 5: Simulation results for microcontroller instruction phase OR, NOT

6. Conclusion
In this paper, the design and the development of a basic 4-bit microcontroller has been discussed.
Functional verification shows the accurate results for instruction set Add, Subtract, compare,
Shift-left, Shift-Right, Rotate-left, Rotate-right, AND, OR, NOT operations. The developed
microcontroller module functions with simple control signals. The developed module is
functionally built using structural approach in VHDL. Therefore, this microcontroller can be used
as a module in complex board level designs. Also, because of the modular design and bottom-up
implementation of this microcontroller, the VHDL code can easily be expanded to develop higher
order microcontroller without making extensive changes. This VHDL program will be
compatible with any other VHDL system that is compliant with either IEEE Standard 1076-1987
or 1076-1993. The speed of the design is limited to the target FPGA technology and can only be
improved by using ASIC design methodology. Also ASIC design involves more resources and
expensive development costs, and hence there needs to be a trade off of speed and economy.
Design Hierarchy
The design hierarchy and the corresponding VHDL files are depicted in figure 2.
figure 2: Design hierarchy of the 8051 microcontroller IP-core.

The VHDL source files have been consistently named throughout the whole design:
 V HDL entities entity-name_.vhd
VHDL architectures entity-name_rtl.vhd for modules containing logic
entity-name_struc.vhd
 for modules just connecting submodules
VHDL configurations entity-name_rtl_cfg.vhd
entity-name_struc_cfg.vhd
The core itself is made up of the submodules timer/counter, ALU, serial interface,
and control unit. RAM or ROM blocks are most often generated corresponding to the selected
target technology and are therefore instantiated in the highest design hierarchy. Generated RAM
blocks and BIST structures - for ASIC production test integration - can be easily added at this
level of the design.
Clock Domains
The 8051 IP core is a fully synchronous design. There is a single clock signal that controls the
clock input of every storage element. Clock gating is not used. The clock signal is not fed into
any combinatorial element. The interrupt input lines are
synchronized to the global clock signal using a standard two-level synchronization stage because
they may be driven by external circuitry that operates with another clock. The parallel port input
signals are not synchronized that way. If the user decides that there is also the need for
synchronizing these signals it may be added easily.
Memory Interfaces
Due to the optimized architecture the signals coming from and going to the memory blocks have
not been registered. So during synthesis input and output timing constraints should be placed on
the corresponding ports and synchronous memory blocks should be used for the mc8051 IP-core.
Configuring the 8051 IP Core
In the following the parameterizability of the 8051 microcontroller IP-core design will be
discussed and information for embedding the IP-core in larger designs will be given.
Timer/Counter, Serial Interface, and Interrupts
The original microcontroller design offered only 2 timer/counter units, one serial
interface, and two external interrupt sources. 8051 derivates later offered more of
these resources on chip. Since this is sometimes a limiting factor we decided to
implement some sort of parameterization in the 8051 IP core. This 8051
microcontroller IP-core offers the capability to generate up to 256 of these units by simply
changing a VHDL constant’s value. In the VHDL source file mc8051_p.vhd the constant
C_IMPL_N_TMR can take
values from 1 to 256 to control this feature. Values out of this interval result in a non functioning
configuration of the core. Figure 3 shows the corresponding lines of
VHDL code.
page 7 of 11
-----------------------------------------------------------------------------
-- Select how many timer/counter units should be implemented
-- Default: 1
constant C_IMPL_N_TMR : integer := 1;
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
-- Select how many serial interface units should be implemented
-- Default: C_IMPL_N_TMR ---(DO NOT CHANGE!)---
constant C_IMPL_N_SIU : integer := C_IMPL_N_TMR;
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
-- Select how many external interrupt-inputs should be implemented
-- Default: C_IMPL_N_TMR ---(DO NOT CHANGE!)---
constant C_IMPL_N_EXT : integer := C_IMPL_N_TMR;
-----------------------------------------------------------------------------
figure 3: VHDL source code for configuring the number of timer/counter units, serial
interfaces,
and external interrupts.
Optional Instructions
In some cases it makes sense to not implement instructions which are not needed
and consume furthermore much chip area. Such instructions are 8bit multiplication,
8bit division, and 8bit decimal correction. Therefore the MUL instruction for 8bit
multiplication can be skipped when the VHDL constant C_IMPL_MUL in the
mc8051_p.vhd source file is set to 0. Equally the 8bit division DIV can be skipped
through setting the VHDL constant C_IMPL_DIV to 0 and the decimal correction
instruction can be skipped by setting the constant C_IMPL_DA to 0. The
corresponding lines of VHDL source code can be seen in figure 5.
-----------------------------------------------------------------------------
-- Select whether to implement (1) or skip (0) the multiplier
-- Default: 1
constant C_IMPL_MUL : integer := 1;
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
-- Select whether to implement (1) or skip (0) the divider
-- Default: 1
constant C_IMPL_DIV : integer := 1;
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
-- Select whether to implement (1) or skip (0) the decimal adjustment command
-- Default: 1
constant C_IMPL_DA : integer := 1;
-----------------------------------------------------------------------------
figure 5: Code fragment showing how instructions can be skipped.
The gain in terms of chip area when not implementing all three optional instructions is
approximately 10 %.
Parallel I/O Ports
The mc8051 IP-core offers just as the original 8051 microcontroller 4 bidirectional
8bit I/O ports to conveniently exchange data with the microcontroller’s environment.
To ease integration of our core for IC design the original’s multi-function ports have
not been rebuilt and all signals (e.g. serial interface, interrupts, counter inputs, and
interface to external memory) have been fed separately out of the core (see figure 1).
The basic structure of the parallel I/O ports is shown in figure 6.
figure 6: Basic structure of the parallel I/O ports.

Verification
Verification of the core was accomplished by simulating the VHDL code and
comparing the results of the executed program (i.e. ROM contents) with the results produced by
an industry standard 8051 simulator
After simulation the contents of a certain memory area is written to a file both for the standard
8051 simulator (using the command save keil.hex 0x00,0xFF) and the VHDL code simulation
(e.g. using the script write2file.do, producing the file regs.log in the simulation directory). The
resulting text files have to be identical. To be able to feed the compiled assembler file from the
Keil development software to the VHDL code simulation a short C program is provided in the
latest distribution of the IP core. It converts the Intel hex format file into a text file containing
binary 8 bit data, suited for being read by the VHDL simulator (file mc8051_rom.dua in the
simulation directory).
Deliverables
Extract the mc8051.zip file as is using the directory tree we recommend. There are already
scripts for synthesis using Synopsys’s DesignCompiler for ASIC design and Synplicity’s
Synplify for FPGA design. Additionally there are scripts for RTL simulation using
Mentor’s/Modeltech’s Modelsim. See figure 7 for an overview of the mc8051 directory tree.
figure 7: Directory tree in which the mc8051 IP-core is distributed.

Fault Tolerance Techniques For Distributed Systems

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fault Tolerance Techniques For Distributed Systems

Uploaded by

Copyright:

Available Formats

Fault tolerance techniques for distributed systems

General Fault Tolerance Procedure

A Fault-Tolerant Pattern for Distributed Systems

General definitions of terms

Error detection and error correction are defined as follows:

Error detection schemes

Main article: Repetition code

Cyclic redundancy checks (CRCs)

Cryptographic hash functions

Automatic repeat request

In a typical TCP/IP stack, error detection is performed at multiple levels:

Satellite broadcasting (DVB)

List of error-correcting codes

Error detection &correction codes

4. VHDL Programming, Simulation and Synthesis

Figure 4: Top level entity of Instruction decoder

Figure 5: Simulation results for microcontroller instruction phase OR, NOT

figure 2: Design hierarchy of the 8051 microcontroller IP-core.

figure 6: Basic structure of the parallel I/O ports.

figure 7: Directory tree in which the mc8051 IP-core is distributed.

You might also like