Distributed Operating Systems: 2 BITS Pilani Ms-Wilp 06 April 2013

Distributed Operating Systems
EMC
2
BITS Pilani
MS-WILP
06 April 2013
Definition of a Distributed System (1)
A distributed system is:

A collection of independent
computers that appears to
its users as a single coherent
system
Definition of a Distributed System (2)
A distributed system organized as middleware
Note that the middleware layer extends over multiple machines
1.1
Transparency in a Distributed System
Different forms of transparency in a distributed system
Transparency Description
Access
Hide differences in data representation and how a
resource is accessed
Location Hide where a resource is located
Migration Hide that a resource may move to another location
Relocation
Hide that a resource may be moved to another
location while in use
Replication
Hide that a resource may be shared by several
competitive users
Concurrency
Hide that a resource may be shared by several
competitive users
Failure Hide the failure and recovery of a resource
Persistence
Hide whether a (software) resource is in memory or
on disk
Scalability Problems
Examples of scalability limitations
Concept Example
Centralized services A single server for all users
Centralized data A single on-line telephone book
Centralized algorithms Doing routing based on complete information
Scaling Techniques
1. Hiding communication latencies
2. Distribution
3. Replication

Scaling Techniques (1)
1.4
The difference between letting:
a) a server or
b) a client check forms as they are being filled
Scaling Techniques (2)
1.5
An example of dividing the DNS name space into zones
Hardware Concepts
1.6
Different basic organizations and memories in distributed computer
systems
Multiprocessors (1)
A bus-based multiprocessor
Multiprocessors (2)
a) A crossbar switch
b) An omega switching network
1.8
Homogeneous Multicomputer Systems
a) Grid
b) Hypercube
1-9
Software Concepts
An overview of
DOS (Distributed Operating Systems)
NOS (Network Operating Systems)
Middleware
System Description Main Goal
DOS
Tightly-coupled OS for multi-processors and
homogeneous multicomputers
Hide and manage
hardware resources
NOS
Loosely-coupled OS for heterogeneous
multicomputers (LAN and WAN)
Offer local services
to remote clients
Middleware
Additional layer atop of NOS implementing
general-purpose services
Provide distribution
transparency
Uniprocessor Operating Systems
Separating applications from OS code through
a microkernel
1.11
Multiprocessor Operating Systems (1)
A monitor to protect an integer against concurrent access
monitor Counter {
private:
int count = 0;
public:
int value() { return count;}
void incr () { count = count + 1;}
void decr() { count = count 1;}
}
Multiprocessor Operating Systems (2)
A monitor to protect an integer against concurrent access,but
blocking a process
monitor Counter {
private:
int count = 0;
int blocked_procs = 0;
condition unblocked;
public:
int value () { return count;}
void incr () {
if (blocked_procs == 0)
count = count + 1;
else
signal (unblocked);
}
void decr() {
if (count ==0) {
blocked_procs = blocked_procs + 1;
wait (unblocked);
blocked_procs = blocked_procs 1;
}
else
count = count 1;
}
}
Multicomputer Operating Systems (1)
General structure of a multicomputer operating
system
1.14
Alternatives for blocking and buffering in message passing
1.15
Relation between blocking, buffering, and reliable communications
Synchronization point Send buffer
Reliable comm.
guaranteed?
Block sender until buffer not full Yes Not necessary
Block sender until message sent No Not necessary
Block sender until message received No Necessary
Block sender until message delivered No Necessary
Distributed Shared Memory Systems (1)
a) Pages of address
space distributed
among four
machines

b) Situation after CPU 1
references page 10

c) Situation if page 10
is read only and
replication is used
Distributed Shared Memory Systems (2)
False sharing of a page between two independent processes
1.18
Network Operating System (1)
General structure of a network operating system
Two clients and a server in a network operating system
1-20
Different clients may mount the servers in different places
1.21
Positioning Middleware
General structure of a distributed system as middleware
1-22
Middleware and Openness
In an open middleware-based distributed system, the
protocols used by each middleware layer should be the same,
as well as the interfaces they offer to applications
1.23
Comparison between Systems
A comparison between multiprocessor OS, multicomputer
OS, network OS, and middleware based distributed systems
Item
Distributed OS
Network
OS
Middleware-
based OS
Multiproc. Multicomp.
Degree of transparency Very High High Low High
Same OS on all nodes Yes Yes No No
Number of copies of OS 1 N N N
Basis for communication
Shared
memory
Messages Files Model specific
Resource management
Global,
central
Global,
distributed
Per node Per node
Scalability No Moderately Yes Varies
Openness Closed Closed Open Open
Multitiered Architectures (1)
Alternative client-server organizations (a) (e)
1-29
Multitiered Architectures (2)
An example of a server acting as a client
1-30
Modern Architectures
An example of horizontal distribution of a Web service
1-31
Issues in DOS
Synchronization within one system is hard
enough
Semaphores
Messages
Monitors

Synchronization among processes in a
distributed system is much harder

Outline
We begin with clocks and see how the semantic
requirement for real-time made Lamports logical clocks
possible
Given global clocks, virtual or real, we consider mutual
exclusion
Centralized algorithms keep information in one place effectively
becomes a monitor
Distribution handles mutual exclusion in parallel at the cost of
O(N) messages per CS use
Token algorithm reduced messages under some circumstances but
introduced heartbeat overhead
Each has strengths and weaknesses
Outline
Many distributed algorithms require a coordinator
Creating the need to select, monitor, and replace the
coordinator as required
Election algorithms provide a way to select a
coordinator
Bully algorithm
Ring algorithm
Transactions provide a high level abstraction with
significant power for organizing, expressing, and
implementing distributed algorithms
Mutual Exclusion
Locking
Deadlock
Outline
Transactions are useful because they can be aborted
Concurrency control issues were considered
Locking
Optimistic
Deadlock
Detection
Prevention
Yet again
Distributed systems have the same problems
Only more so
Network Time Protocol
A requests time of B at its own
T
1
B receives request at its T
2
,
records
B responds at its T
3
, sending
values of T
2
and T
3

A receives response at its T
4

Assume transit time is
approximately the same both
ways
Assume that B is the time
server that A wants to
synchronize to
A knows (T
4
T
1
) from its own
clock
B reports T
3
and T
2
in response
to NTP request
A computes total transit time
of

( ) ( )
2 3 1 4
T T T T

One-way transit time is approximately total,
i.e.,

Bs clock at T
4
reads approximately

( ) ( )
2
2 3 1 4
T T T T
( ) ( ) ( ) ( )
2 2
3 2 1 4 2 3 1 4
3
T T T T T T T T
T
+ +
=

+

Bs clock at T
4
reads approximately (from previous slide)

Thus, difference between B and A clocks at T
4
is

( ) ( )
2
2 3 1 4
T T T T
( ) ( ) ( ) ( )
2 2
3 2 1 4 2 3 1 4
3
T T T T T T T T
T
+ +
=

+
What is stratum in NTP?

In the world of NTP, stratum levels define the distance from the reference clock. A reference clock is a stratum-0
device that is assumed to be accurate and has lttle or no delay associated with it. The reference clock typically
synchronizes to the correct time (UTC) using GPS transmissions, CDMA technology or other time signals such as Irig-
B, WWV, DCF77, etc. Stratum-0 servers cannot be used on the network, instead, they are directly connected to
computers which then operate as stratum-1 servers.

A server that is directly connected to a stratum-0 device is called a stratum-1 server. This includes all time servers
with built-in stratum-0 devices, such as the EndRun Time Servers, and also those with direct links to stratum-0
devices such as over an RS-232 connection or via an IRIG-B time code. The basic definition of a stratum-1 time
server is that it be directly linked (not over a network path) to a reliable source of UTC time such as GPS, WWV, or
CDMA transmissions. A stratum-1 time server acts as a primary network time standard.

A stratum-2 server is connected to the stratum-1 server OVER A NETWORK PATH. Thus, a stratum-2 server gets its
time via NTP packet requests from a stratum-1 server. A stratum-3 server gets its time via NTP packet requests from
a stratum-2 server, and so on.

As you progress through different strata there are network delays involved that reduce the accuracy of the NTP
server in relation to UTC. Timestamps generated by an EndRun Stratum 1 Time Server will typically have 10
microseconds accuracy to UTC. A stratum-2 server will have anywhere from 1/2 to 100 ms accuracy to UTC and each
subsequent stratum layer (stratum-3, etc.) will add an additional 1/2-100 ms of inaccuracy.

For another explanation of NTP Strata read this wikipedia article on NTP Servers Clock Strata.
NTP Stratum
Yellow arrows indicate a direct
connection;
Red arrows indicate a network
connection.
Ref: http://en.wikipedia.org/wiki/NTP_server#Clock_strata
NTP uses a hierarchical, semi-layered
system of levels of clock sources. Each
level of this hierarchy is termed
a stratum and is assigned a layer
number starting with 0 (zero) at the top.
The stratum level defines its distance
from the reference clock and exists to
prevent cyclical dependencies in the
hierarchy.
It is important to note that the
stratum is not an indication of quality or
reliability, it is common to find stratum
3 time sources that are higher quality
than other stratum 2 time

Servers organized as strata
Stratum 0 server adjusts itself to WWV directly
Stratum 1 adjusts self to Stratum 0 servers
Etc.
Within a stratum, servers adjust with each other

If T
A
is slow, add c to clock rate
To speed it up gradually
If T
A
is fast, subtract c from clock rate
To slow it down gradually

WWV is the call sign of the United States National Institute of Standards and Technology's (NIST)
HF ("shortwave") radio station in Fort Collins, Colorado. WWV continuously transmits official U.S.
Government frequency and time signals on 2.5, 5, 10, 15 and 20 MHz. These carrier frequencies
and time signals are controlled by local atomic clocks traceable to NIST's primary standard in
Boulder, Colorado by GPS common view observations and other time transfer methods.

Ref: http://en.wikipedia.org/wiki/WWV_(radio_station)
Berkeley Algorithm
Berkeley Algorithm
Time Daemon polls other systems
Computes average time
Tells other machines how to adjust their clocks
Problem
Time not a reliable method of
synchronization
Users mess up clocks
(and forget to set their time zones!)
Unpredictable delays in Internet
Relativistic issues
If A and B are far apart physically, and
two events T
A
and T
B
are very close in time, then
which comes first? how do you know?
Example
At midnight PDT(Pacific
Daylight Time), bank posts
interest to your account
based on current balance.
At 3:00 AM EDT (Eastern
Daylight Zone), you
withdraw some cash.
Does interest get paid on
the cash you just
withdrew?
Depends upon which event
came first!
What if transactions made
on different replicas?
Solution: Logical Clocks
Not clocks at all
Just monotonic counters
Lamports temporal logic

Definition: a b means
a occurs before b
I.e., all processes agree
that a happens, then
later b happens
E.g., send(message)
receive(message)
Logical Clocks
Every machine maintains its own logical
clock C
Transmit C with every message
If C
received
> C
own
, then adjust C
own
forward to
C
received
+ 1

Result: Anything that is known to follow
something else in logical time has larger
logical clock value.
Logical Clocks
Clock Synchronization
When each machine has its own clock, an event
that occurred after another event may
nevertheless be assigned an earlier time.
Physical Clocks : Clock Synchronization
Maximum resolution desired for global time keeping
determines the maximum difference which can be
tolerated between synchronized clocks
The time keeping of a clock, its tick rate should satisfy:

The worst possible divergence between two clocks is
thus:
So the maximum time t between clock synchronization
operations that can ensure is:
o
o
o
+ < < 1 1
t
C
t A = o 2
o
2
= At
Christians Algorithm
Periodically poll the machine with access to the
reference time source
Estimate round-trip delay with a time stamp
Estimate interrupt processing time
figure 3-6, page 129 Tanenbaum
Take a series of measurements to estimate the time it
takes for a timestamp to make it from the reference
machine to the synchronization target
This allows the synchronization to converge within
with a certain degree of confidence
Probabilistic algorithm and guarantee
Wide availability of hardware and software to keep clocks
synchronized within a few milliseconds across the Internet
is a recent development
Network Time Protocol (NTP) discussed in papers by David Mill(s)
GPS receiver in the local network synchronizes other machines
What if all have GPS receivers
Increasing deployment of distributed system algorithms
depending on synchronized clocks
Supply and demand constantly in flux
Physical Clocks - 1
Computation of the mean solar day.
Physical Clocks - 2
TAI seconds are of constant length, unlike solar
seconds. Leap seconds are introduced when
necessary to keep in phase with the sun.
Clock Synchronization Algorithms
The relation between clock time and UTC when clocks tick at different rates.

Clock Synchronization
Physical Clocks
Physical clock example: counter + holding register +
oscillating quartz crystal
The counter is decremented at each oscillation
Counter interrupts when it reaches zero
Reloads from the holding register
Interrupt = clock tick (often 60 times/second)
Software clock: counts interrupts
This value represents number of seconds since some
predetermined time (Jan 1, 1970 for UNIX systems;
beginning of the Gregorian calendar for Microsoft)
Can be converted to normal clock times

Clock Skew
In a distributed system each computer has its
own clock
Each crystal will oscillate at slightly different
rate.
Over time, the software clock values on the
different computers are no longer the same.

Clock Skew
Clock skew(offset): the difference between the
times on two different clocks
Clock drift : the difference between a clock
and actual time
Ordinary quartz clocks drift by ~ 1sec in 11-12
days. (10
-6
secs/sec)
High precision quartz clocks drift rate is
somewhat better
Various Ways of Measuring Time*
The sun
Mean solar second gradually getting longer as earths
rotation slows.
International Atomic Time (TAI)
Atomic clocks are based on transitions of the cesium atom
Atomic second = value of solar second at some fixed time
(no longer accurate)
Universal Coordinated Time (UTC)
Based on TAI seconds, but more accurately reflects sun
time (inserts leap seconds to synchronize atomic second
with solar second)

Getting the Correct (UTC) Time*
WWV radio station or similar stations in other
countries (accurate to +/- 10 msec)
UTC services provided by earth satellites
(accurate to .5 msec)
GPS (Global Positioning System) (accurate to
20-35 nanoseconds)

Clock Synchronization Algorithms*
In a distributed system one machine may have
a WWV receiver and some technique is used
to keep all the other machines in synch with
this value.
Or, no machine has access to an external time
source and some technique is used to keep all
machines synchronized with each other, if not
with real time.
Clock Synchronization Algorithms
Network Time Protocol (NTP):
Objective: to keep all clocks in a system synchronized to UTC
time (1-50 msec accuracy) not so good in WAN
Uses a hierarchy of passive time servers
The Berkeley Algorithm:
Objective: to keep all clocks in a system synchronized to each
other (internal synchronization)
Uses active time servers that poll machines periodically
Reference broadcast synchronization (RBS)
Objective: to keep all clocks in a wireless system synchronized
to each other

Three Philosophies of Clock
Synchronization

Try to keep all clocks synchronized to real
time as closely as possible
Try to keep all clocks synchronized to each
other, even if they vary somewhat from UTC
time
Try to synchronize enough so that interacting
processes can agree upon an event order.
Refer to these clocks as logical clocks
6.2 Logical Clocks
Observation: if two processes (running on
separate processors) do not interact, it
doesnt matter if their clocks are not
synchronized.
Observation: When processes do interact,
they are usually interested in event order,
instead of exact event time.
Conclusion: Logical clocks are sufficient for
many applications
Formalization
The distributed system consists of n processes,
p
1
, p
2
, p
n
(e.g, a MPI group)
Each p
i
executes on a separate processor
No shared memory
Each p
i
has a state s
i

Process execution: a sequence of events
Changes to the local state
Message Send or Receive
Two Versions
Lamports logical clocks: synchronizes logical
clocks
Can be used to determine an absolute ordering
among a set of events although the order doesnt
necessarily reflect causal relations between
events.
Vector clocks: can capture the causal
relationships between events.
Lamports Logical Time
Lamport defined a happens-before relation
between events in a process.
"Events" are defined by the application. The
granularity may be as coarse as a procedure or
as fine-grained as a single instruction.

Happened Before Relation (a b)
a b: (page 244-245)
in the same [sequential] process,
send, receive in different processes, (messages)
transitivity: if a b and b c, then a c
If a b, then a and b are causally related; i.e.,
event a potentially has a causal effect on
event b.
Concurrent Events
Happens-before defines a partial order of
events in a distributed system.
Some events cant be placed in the order
a and b are concurrent (a || b) if
!(a b) and !(b a).
If a and b arent connected by the happened-
before relation, theres no way one could
affect the other.
Logical Clocks
Needed: method to assign a timestamp to event a
(call it C(a)), even in the absence of a global clock
The method must guarantee that the clocks have
certain properties, in order to reflect the definition of
happens-before.
Define a clock (event counter), C
i
, at each process
(processor) P
i
.
When an event a occurs, its timestamp ts(a) = C(a),
the local clock value at the time the event takes
place.

Correctness Conditions
If a and b are in the same process, and
a b then C

(a) < C

(b)
If a is the event of sending a message from Pi,
and b is the event of receiving the message by
Pj, then C
i
(a) < C
j
(b).
The value of C must be increasing (time
doesnt go backward).
Corollary: any clock corrections must be made by
adding a positive number to a time.

Implementation Rules

Between any two successive events a & b in Pi,

increment the local clock (C
i
= C
i
+ 1)
thus C
i
(b) = C
i
(a) + 1
When a message m is sent from P
i
, set its time-
stamp ts
m
to C
i
, the time of the send event after
following previous step.
When the message is received at P
j
the local
time must be greater than ts
m
. The rule is (C
j
=
max{C
j
, ts
m
} + 1).
Lamports Logical Clocks (2)
Figure 6-9. (a) Three processes, each with its own clock. The
clocks run at different rates.
Event a: P1 sends m1 to
P2 at t = 6,
Event b: P2 receives m1
at t = 16.
If C(a) is the time m1
was sent, and C(b) is the
time m1 is received, do
C(a) and C(b) satisfy the
correctness conditions ?

Figure 6-9. (b) Lamports algorithm corrects the clocks.
Event c: P3
sends m3 to
P2 at t = 60
Event d: P2
receives m3
at t = 56
Do C(c) and
C(d) satisfy
the
conditions?

Application Layer
Application sends message m
i
Adjust local clock,
Timestamp m
i
Middleware sends
message
Network Layer
Message m
i
is received
Adjust local clock
Deliver m
i
to application
Middleware layer
Figure 6-10. The positioning of Lamports logical clocks in distributed systems
Handling clock management as a middleware operation
Figure 5.3 (Advanced Operating Systems,Singhal and Shivaratri)
How Lamports logical clocks advance
e11
e12 e13 e14
e15
e16
e17
e21 e22 e23
e24
e25
P1
P2
Which events are causally related?
Which events are concurrent?
eij represents event j
on processor i
A Total Ordering Rule
(does not guarantee causality)
A total ordering of events can be obtained if
we ensure that no two events happen at the
same time (have the same timestamp).
Why? So all processors can agree on an
unambiguous order.
How? Attach process number to low-order
end of time, separated by decimal point; e.g.,
event at time 40 at process P1 is 40.1,event at
time 40 at process P2 is 40.2
Figure 5.3 - Singhal and Shivaratri

e11
e12 e13 e14
e15
e16
e17
e21 e22 e23
e24
e25
P1
P2
What is the total ordering of the events in these
two processes?
Example: Total Order Multicast
Consider a banking database, replicated
across several sites.
Queries are processed at the geographically
closest replica
We need to be able to guarantee that DB
updates are seen in the same order
everywhere
Totally Ordered Multicast
Update 1: Process 1 at Site A adds $100 to an account,
(initial value = $1000)
Update 2: Process 2 at Site B increments the account by
1%
Without synchronization,
its possible that
replica 1 = $1111,
replica 2 = $1110

Message 1: add $100.00
Message 2: increment account by 1%
The replica that sees the messages in the
order m1, m2 will have a final balance of
$1111
The replica that sees the messages in the
order m2, m1 will have a final balance of
$1110
The Problem
Site 1 has final account balance of $1,111 after
both transactions complete and Site 2 has final
balance of $1,100.
Which is right? Either, from the standpoint of
consistency.
Problem: lack of consistency.
Both values should be the same
Solution: make sure both sites see/process all
messages in the same order.
Implementing Total Order
Assumptions:
Updates are multicast to all sites, including
(conceptually) the sender
All messages from a single sender arrive in the
order in which they were sent
No messages are lost
Messages are time-stamped with Lamport clock
values.

Implementation
When a process receives a message, put it in a
local message queue, ordered by timestamp.
Multicast an acknowledgement to all sites
Each ack has a timestamp larger than the
timestamp on the message it acknowledges
The message queue at each site will
eventually be in the same order
Implementation
Deliver a message to the application only when the
following conditions are true:
The message is at the head of the queue
The message has been acknowledged by all other
receivers. This guarantees that no update messages with
earlier timestamps are still in transit.
Acknowledgements are deleted when the message
they acknowledge is processed.
Since all queues have the same order, all sites
process the messages in the same order.

Causality
Causally related events:
Event a may causally affect event b if a b
Events a and b are causally related if either
a b or b a.
If neither of the above relations hold, then there is
no causal relation between a & b. We say that a
|| b (a and b are concurrent)

Vector Clock Rationale
Lamport clocks limitation:
If (ab) then C(a) < C(b) but
If C(a) < C(b) then we only know that either (ab) or
(a || b), i.e., b a
In other words, you cannot look at the clock
values of events on two different processors and
decide which one happens before.
Lamport clocks do not capture causality

Figure 6-12.
Suppose we add a message to the
scenario in Fig. 6.12(b).
Tsnd(m1) < Tsnd(m3).
(6) < (32)
Does this mean
send(m1) send(m3)?
But
Tsnd(m1) < Tsnd(m2).
(6) < (20)
Does this mean
send(m1) send(m2)?
m
2

m
3

Figure 5.4
Time
P1
P2
P3
e11
.
e21
e12
e22
e31
e32
e33
(1)
(2)
(1) (3)
(1)
(2) (3)
C(e11) < C(e22) and C(e11) < C(e32) but while e11 e22, we cannot say e11
e32 since there is no causal path connecting them. So, with Lamport clocks we
can guarantee that if C(a) < C(b) then
b a , but by looking at the clock values alone we cannot say whether or
not the events are causally related.
Space
Vector Clocks How They Work
Each processor keeps a vector of values, instead of a
single value.
VC
i
is the clock at process i; it has a component for
each process in the system.
VC
i
[i] corresponds to P
i
s local time.
VC
i
[j] represents P
i
s knowledge of the time at P
j
(the # of events that P
i
knows have occurred at Pj

Each processor knows its own time exactly, and
updates the values of other processors clocks based
on timestamps received in messages.

Implementation Rules
IR1: Increment VC
i
[i] before each new event.
IR2: When process i sends a message m it sets ms
(vector) timestamp to VC
i
(after incrementing VC
i
[i])
IR3: When a process receives a message it does a
component-by-component comparison of the
message timestamp to its local time and picks the
maximum of the two corresponding components.
Adjust local components accordingly.
Then deliver the message to the application.

Review
Physical clocks: hard to keep synchronized
Logical clocks: can provide some notion of relative
event occurrence
Lamports logical time
happened-before relation defines causal relations
logical clocks dont capture causality
total ordering relation
use in establishing totally ordered multicast
Vector clocks
Unlike Lamport clocks, vector clocks capture causality
Have a component for each process in the system
Figure 5.5. Singhal and Shivaratri
(1, 0 , 0)
(2, 0, 0) (4, 5, 2)
e11
e12
e14
(0, 1, 0) (2, 2, 0) (2, 3, 1) (2, 5, 2)
(0, 0, 1)
(0, 0, 2)
e21
e22 e23
e24
e31
e32
P1
P2
P3
(2,4,2)
e25
Vector clock values. In a 3- process system, VC(Pi) = vc1, vc2, vc3
e13
(3, 0, 0)
e33
(0, 0, 3)
Establishing Causal Order
When Pi sends a message m to Pj, Pj knows
How many events occurred at Pi before m was sent
How many relevant events occurred at other sites before m
was sent (relevant = happened-before)
In Figure 5.5, VC(e
24
) = (2, 4, 2). Two events in P1 and
two events in P3 happened before e
24
.
Even though P1 and P3 may have executed other events, they
dont have a causal effect on e
24
.
Happened Before/Causally Related Events -
Vector Clock Definition
a b iff ts(a)
<
ts(b)

(a happens before b iff the timestamp of a is less
than the timestamp of b)
Events a and b are causally related if
ts(a)
<
ts(b)

or
ts(b)
<
ts(a)
Otherwise, we say the events are concurrent.
Any pair of events that satisfy the vector clock
definition of happens-before will also satisfy the
Lamport definition, and vice-versa.
Comparing Vector Timestamps
Less than: ts(a) <

ts(b)

iff at least one
component of ts(a) is strictly less than the
corresponding component of ts(b) and all other
components of ts(a) are either less than or
equal to the corresponding component in ts(b).
(3,3,5) (3,4,5), (3, 3, 3) (3, 3, 3),
(3,3,5) (3,2,4), (3, 3 ,5) | | (4,2,5).

Figure 5.4
Time
P1
P2
P3
e21
e12
e22
e31
e32
e33
(1, 0, 0)
(2, 0, 0)
(0, 1, 0) (2, 2, 0)
(0, 0,1)
(0, 0, 2) (0, 0, 3)
ts(e11) = (1, 0, 0) and ts(e32) = (0, 0, 2), which shows that the two
events are concurrent.
ts(e11) = (1, 0, 0) and ts(e22) = (2, 2, 0), which shows that
e11 e22
e11
Causal Ordering of Messages
An Application of Vector Clocks
Premise: Deliver a message only if messages
that causally precede it have already been
received
i.e., if send(m
1
) send(m
2
), then it should be true
that receive(m
1
) receive(m
2
) at each site.
If messages are not related (send(m
1
) ||
send(m
2
)), delivery order is not of interest.

Compare to Total Order
Totally ordered multicast (TOM) is stronger
(more inclusive) than causal ordering (COM).
TOM orders all messages, not just those that are
causally related.
Weaker COM is often what is needed.
Enforcing Causal Communication
Clocks are adjusted only when sending or
receiving messages; i.e, these are the only
events of interest.
Send m: P
i
increments VC
i
[i] by 1 and applies
timestamp, ts(m).
Receive m: Pi compares VC
i
to ts(m); set VC
i
[k]
to max{VC
i
[k] , ts(m)[k]} for each k, k i.
Message Delivery Conditions
Suppose: P
J
receives message m from P
i
Middleware delivers m to the application iff
ts(m)[i] = VC
j
[i] + 1
all previous messages from P
i
have been delivered
ts(m)[k] VC
i
[k] for all k i
P
J
has received all messages that P
i
had seen before it sent
message m.

In other words, if a message m is received
from P
i,
you should also have received every
message that P
i
received before it sent m; e.g.,
if m is sent by P
1
and ts(m) is (3, 4, 0) and you are
P
3
, you should already have received exactly 2
messages from P
1
and at least 4 from P
2
if m is sent by P
2
and ts(m) is (4, 5, 1, 3) and if you
are P
3
and VC
3
is (3, 3, 4, 3) then you need to wait
for a fourth message from P
2
and at least one
more message from P
1
.

P0
P1
P2
(1, 0, 0)

P1 received message m from P0 before sending
message m* to P2; P2 must wait for delivery of m
before receiving m*

(Increment own clock only on message send)

Before sending or receiving any messages, ones own
clock is (0, 0, 0)
VC2
(1, 0, 0) (1, 1, 0)
(1, 1, 0)
VC
1
m
m*
VC
0
VC
2
Figure 6-13. Enforcing Causal Communication
VC
0
(1, 1, 0)
(0, 0, 0)
VC2
Christian's Algorithm
Getting the current time from a time server.

One more introduction
Global Time & Global States of Distributed
Systems
Asynchronous distributed systems consist of several processes
without common memory which communicate (solely) via
messages with unpredictable transmission delays
Global time & global state are hard to realize in distributed
systems
Rate of event occurrence is very high
Event execution times are very small
We can only approximate the global view
Simulate synchronous distributed system on a given asynchronous systems
Simulate a global time Clocks (Physical and Logical)
Simulate a global state Global Snapshots
Simulate Synchronous
Distributed Systems
Synchronizers [Awerbuch 85]
Simulate clock pulses in such a way that a message is only generated
at a clock pulse and will be received before the next pulse
Drawback
Very high message overhead
The Concept of Time in Distributed Systems
A standard time is a set of instants with a temporal precedence order <
satisfying certain conditions [Van Benthem 83]:
Irreflexivity
Transitivity
Linearity
Eternity (x-y: x<y)
Density (x,y: x<y -z: x<z<y)
Transitivity and Irreflexivity imply asymmetry
A linearly ordered structure of time is not always adequate for distributed
systems
Captures dependence, not independence of distributed activities
Time as a partial order
A partially ordered system of vectors forming a lattice structure is a natural
representation of time in a distributed system

Global time in distributed systems
An accurate notion of global time is difficult to achieve in
distributed systems.
Uniform notion of time is necessary for correct operation of many
applications (mission critical distributed control, online
games/entertainment, financial apps, smart environments etc.)
Clocks in a distributed system drift
Relative to each other
Relative to a real world clock
Determination of this real world clock itself may be an issue
Clock synchronization is needed to simulate global time
Physical Clocks vs. Logical clocks
Physical clocks are logical clocks that must not deviate from the real-time
by more than a certain amount.
We often derive causality of events from loosely synchronized clocks

Physical Clock Synchronization
Physical Clocks
How do we measure real time?
17th century - Mechanical clocks based on
astronomical measurements
Solar Day - Transit of the sun
Solar Seconds - Solar Day/(3600*24)
Problem (1940) - Rotation of the earth varies (gets
slower)
Mean solar second - average over many days
Atomic Clocks
1948
Counting transitions of a crystal (Cesium 133, quartz) used
as atomic clock
crystal oscillates at well known frequency
TAI - International Atomic Time
9192631779 transitions = 1 mean solar second in 1948
UTC (Universal Coordinated Time)
From time to time, we skip a solar second to stay in phase with the
sun (30+ times since 1958)
UTC is broadcast by several sources (satellites)

From Distributed Systems (cs.nju.edu.cn/distribute-systems/lecture-notes/
124
How Clocks Work in Computers
Quartz
crystal
Counter
Holding
register
Each crystal oscillation
decrements the counter by 1
When counter gets 0, its
value reloaded from the
holding register
CPU
When counter is 0, an
interrupt is generated, which
is call a clock tick
At each clock tick, an interrupt
service procedure add 1 to time
stored in memory
Memory
Oscillation at a well-
defined frequency
Accuracy of Computer Clocks
Modern timer chips have a relative error of
1/100,000 - 0.86 seconds a day
To maintain synchronized clocks
Can use UTC source (time server) to obtain
current notion of time
Use solutions without UTC.
Cristians (Time Server) Algorithm
Uses a time server to synchronize clocks
Time server keeps the reference time (say UTC)
A client asks the time server for time, the server responds with its
current time, and the client uses the received value to set its clock
But network round-trip time introduces errors
Let RTT = response-received-time request-sent-time (measurable at client),
If we know (a) min = minimum client-server one-way transmission time and
(b) that the server timestamped the message at the last possible instant
before sending it back
Then, the actual time could be between [T+min,T+RTT min]

Cristians Algorithm
Client sets its clock to halfway between T+min and T+RTT min
i.e., at T+RTT/2
Expected (i.e., average) skew in client clock time = (RTT/2 min)
Can increase clock value, should never decrease it.
Can adjust speed of clock too (either up or down is ok)
Multiple requests to increase accuracy
For unusually long RTTs, repeat the time request
For non-uniform RTTs
Drop values beyond threshold; Use averages (or weighted
average)
Berkeley UNIX algorithm
One daemon without UTC
Periodically, this daemon polls and asks all the
machines for their time
The machines respond.
The daemon computes an average time and
then broadcasts this average time.
Decentralized Averaging Algorithm
Each machine has a daemon without UTC
Periodically, at fixed agreed-upon times, each
machine broadcasts its local time.
Each of them calculates the average time by
averaging all the received local times.
Distributed Time Service
Software service that provides precise, fault-tolerant clock
synchronization for systems in local area networks (LANs) and
wide area networks (WANs).
determine duration, perform event sequencing and
scheduling.
Each machine is either a time server or a clerk
software components on a group of cooperating systems;
client obtains time from DTS(distributed time system) entity
DTS entities
DTS server
DTS clerk that obtain time from DTS servers on other hosts
Clock Synchronization in DCE
(Distributed Computing Environment)
DCEs time model is actually in an interval
I.e. time in DCE is actually an interval
Comparing 2 times may yield 3 answers
t1 < t2, t2 < t1, not determined
Periodically a clerk obtains time-intervals from several servers ,e.g. all
the time servers on its LAN
Based on their answers, it computes a new time and gradually
converges to it.
Compute the intersection where the intervals overlap. Clerks then adjust
the system clocks of their client systems to the midpoint of the computed
intersection.
When clerks receive a time interval that does not intersect with the
majority, the clerks declare the non-intersecting value to be faulty.
Clerks ignore faulty values when computing new times, thereby ensuring
that defective server clocks do not affect clients.
Network Time Protocol (NTP)
Most widely used physical clock synchronization protocol on
the Internet (http://www.ntp.org)
Currently used: NTP V3 and V4
10-20 million NTP servers and clients in the Internet
Claimed Accuracy (Varies)
milliseconds on WANs, submilliseconds on LANs, submicroseconds
using a precision timesource
Nanosecond NTP in progress

NTP Design
Hierarchical tree of time servers.
The primary server at the root
synchronizes with the UTC.
The next level contains
secondary servers, which act as a
backup to the primary server.
At the lowest level is the
synchronization subnet which
has the clients.

Logical Clock Synchronization
Causal Relations
Distributed application results in a set of distributed
events
Induces a partial order causal precedence relation
Knowledge of this causal precedence relation is
useful in reasoning about and analyzing the
properties of distributed computations
Liveness and fairness in mutual exclusion
Consistency in replicated databases
Distributed debugging, checkpointing

Logical Clocks
Used to determine causality in distributed systems
Time is represented by non-negative integers
Event structures represent distributed computation
(in an abstract way)
A process can be viewed as consisting of a sequence of
events, where an event is an atomic transition of the local
state which happens in no time
Process Actions can be modeled using the 3 types of
events
Send Message
Receive Message
Internal (change of state)
Logical Clocks
A logical Clock C is some abstract mechanism which assigns to
any event eeE the value C(e) of some time domain T such
that certain conditions are met
C:ET :: T is a partially ordered set : e<eC(e)<C(e) holds
Consequences of the clock condition [Morgan 85]:
Events occurring at a particular process are totally ordered by their
local sequence of occurrence
If an event e occurs before event e at some single process, then event e is
assigned a logical time earlier than the logical time assigned to event e
For any message sent from one process to another, the logical time of
the send event is always earlier than the logical time of the receive
event
Each receive event has a corresponding send event
Future can not influence the past (causality relation)
Event Ordering
Lamport defined the happens before (=>)
relation
If a and b are events in the same process, and a
occurs before b, then a => b.
If a is the event of a message being sent by one
process and b is the event of the message
being received by another process, then a => b.
If X =>Y and Y=>Z then X => Z.
If a => b then time (a) => time (b)
P1
P2
P3
e21
e11
e31
e22
e13
global time
e1
2
e23
e32
e14
Program order: e13 < e14
Send-Receive: e23 < e12
Transitivity: e21 < e32
Processor Order: e precedes e in the same process
Send-Receive: e is a send and e is the corresponding receive
Transitivity: exists e s.t. e < e and e< e
Example:
Event Ordering- the example
Causal Ordering
Happens Before also called causal ordering
Possible to draw a causality relation between 2
events if
They happen in the same process
There is a chain of messages between them
Happens Before notion is not straightforward in
distributed systems
No guarantees of synchronized clocks
Communication latency

Implementation of Logical Clocks
Requires
Data structures local to every process to represent logical time and
a protocol to update the data structures to ensure the consistency condition.
Each process Pi maintains data structures that allow it the following two
capabilities:
A local logical clock, denoted by LC_i , that helps process Pi measure its own
progress.
A logical global clock, denoted by GCi , that is a representation of process Pi s
local view of the logical global time. Typically, lci is a part of gci
The protocol ensures that a processs logical clock, and thus its view of the
global time, is managed consistently.
The protocol consists of the following two rules:
R1: This rule governs how the local logical clock is updated by a process when it
executes an event.
R2: This rule governs how a process updates its global logical clock to update its
view of the global time and global progress.
Types of Logical Clocks
Systems of logical clocks differ in their
representation of logical time and also in the
protocol to update the logical clocks.
3 kinds of logical clocks
Scalar
Vector
Matrix
Scalar Logical Clocks - Lamport
Proposed by Lamport in 1978 as an attempt to totally
order events in a distributed system.
Time domain is the set of non-negative integers.
The logical local clock of a process pi and its local
view of the global time are squashed into one integer
variable Ci .
Monotonically increasing counter
No relation with real clock
Each process keeps its own logical clock used to
timestamp events

Consistency with Scalar Clocks
To guarantee the clock condition, local clocks must
obey a simple protocol:
When executing an internal event or a send event at
process P
i
the clock C
i
ticks
C
i
+= d (d>0)
When P
i
sends a message m, it piggybacks a logical
timestamp t which equals the time of the send event
When executing a receive event at P
i
where a message
with timestamp t is received, the clock is advanced
C
i
= max(C
i
,t)+d (d>0)
Results in a partial ordering of events.
Total Ordering
Extending partial order to total order

Global timestamps:
(Ta, Pa) where Ta is the local timestamp and Pa is
the process id.
(Ta,Pa) < (Tb,Pb) iff
(Ta < Tb) or ( (Ta = Tb) and (Pa < Pb))
Total order is consistent with partial order.
time Proc_id
Independence
Two events e,e are mutually independent (i.e. e||e) if
~(e<e).~(e<e)
Two events are independent if they have the same timestamp
Events which are causally independent may get the same or different
timestamps
By looking at the timestamps of events it is not possible to
assert that some event could not influence some other event
If C(e)<C(e) then ~(e<e) however, it is not possible to decide whether
e<e or e||e
C is an order homomorphism which preserves < but it does not
preserves negations (i.e. obliterates a lot of structure by mapping E
into a linear order)
Problems with Total Ordering
A linearly ordered structure of time is not always adequate for
distributed systems
captures dependence of events
loses independence of events - artificially enforces an ordering for
events that need not be ordered loses information
Mapping partial ordered events onto a linearly ordered set of integers is
losing information
Events which may happen simultaneously may get different timestamps
as if they happen in some definite order.
A partially ordered system of vectors forming a lattice
structure is a natural representation of time in a distributed
system
Vector Times
The system of vector clocks was developed independently by Fidge,
Mattern and Schmuck.
To construct a mechanism by which each process gets an
optimal approximation of global time
In the system of vector clocks, the time domain is represented by a set of
n-dimensional non-negative integer vectors.
Each process has a clock C
i
consisting of a vector of length n, where n is
the total number of processes vt[1..n], where vt[j ] is the local logical
clock of Pj and describes the logical time progress at process Pj .
A process P
i
ticks by incrementing its own component of its clock
C
i
[i] += 1
The timestamp C(e) of an event e is the clock value after ticking
Each message gets a piggybacked timestamp consisting of the vector of the
local clock
The process gets some knowledge about the other process time approximation
C
i
=sup(C
i
,t):: sup(u,v)=w : w[i]=max(u[i],v[i]), i
Vector Clocks example
An Example of vector clocks
From A. Kshemkalyani and M. Singhal (Distributed Computing)
Figure 3.2: Evolution of vector time.
Vector Times (cont)
Because of the transitive nature of the scheme, a process may
receive time updates about clocks in non-neighboring process
Since process P
i
can advance the i
th
component of global time,
it always has the most accurate knowledge of its local time
At any instant of real time i,j: C
i
[i]> C
j
[i]
For two time vectors u,v
usv iff i: u[i]sv[i]
u<v iff usv . u=v
u||v iff ~(u<v) .~(v<u) :: || is not transitive
Structure of the Vector Time
For any n>0, (N
n
,s) is a lattice
The set of possible time vectors of an event set E is a sublattice of (N
n
,s)
For an event set E, the lattice of consistent cuts and the lattice of possible
time vectors are isomorphic
e,eeE:e<e iff C(e)<C(e) . e||e iff iff C(e)||C(e)
In order to determine if two events e,e are causally related or
not, just take their timestamps C(e) and C(e)
if C(e)<C(e) v C(e)<C(e), then the events are causally related
Otherwise, they are causally independent
Matrix Time
Vector time contains information about latest direct
dependencies
What does Pi know about Pk
Also contains info about latest direct dependencies
of those dependencies
What does Pi know about what Pk knows about Pj
Message and computation overheads are high
Powerful and useful for applications like distributed
garbage collection
Time Manager Operations
Logical Clocks
C.adjust(L,T)
adjust the local time displayed by clock C to T (can be gradually,
immediate, per clock sync period)
C.read
returns the current value of clock C
Timers
TP.set(T) - reset the timer to timeout in T units
Messages
receive(m,l); broadcast(m); forward(m,l)

Simulate A Global State
The notions of global time and global state are closely related
A process can (without freezing the whole computation)
compute the best possible approximation of a global state
[Chandy & Lamport 85]
A global state that could have occurred
No process in the system can decide whether the state did really
occur
Guarantee stable properties (i.e. once they become true, they remain
true)
P2
P1
P3
Time
e21
e31
e11
e22
Event Diagram
e23 e24 e25
e12 e13
e32 e33 e34
Poset Diagram
P2
P1
P3
Time
e21
e31
e11
e22 e23 e24 e25
e12 e13
e32 e33 e34
Equivalent Event Diagram
Rubber Band Transformation
P2
P1
P3
Time
e31
e11
e21
e12
P4
e41 e42
e22
cut
Poset Diagram
e21
e41
e31
e21
e22
e12
e42
Past
Chandy-Lamport Distributed Snapshot
Algorithm
Assumes FIFO communication in channels
Uses a control message, called a marker to separate messages in the
channels.
After a site has recorded its snapshot, it sends a marker, along all of its
outgoing channels before sending out any more messages.
The marker separates the messages in the channel into those to be included in
the snapshot from those not to be recorded in the snapshot.
A process must record its snapshot no later than when it receives a marker
on any of its incoming channels.
The algorithm terminates after each process has received a marker on all
of its incoming channels.
All the local snapshots get disseminated to all other processes and all the
processes can determine the global state.

Chandy-Lamport Distributed Snapshot
Algorithm
Marker receiving rule for Process Pi
If (Pi has not yet recorded its state) it
records its process state now
records the state of c as the empty set
turns on recording of messages arriving over other channels
else
Pi records the state of c as the set of messages received over c
since it saved its state
Marker sending rule for Process Pi
After Pi has recorded its state,for each outgoing channel c:
Pi sends one marker message over c
(before it sends any other message over c)
Computing Global States without FIFO
Assumption
Algorithm
All process agree on some future virtual time s or a set of virtual time
instants s
1
,s
n
which are mutually concurrent and did not yet occur
A process takes its local snapshot at virtual time s
After time s the local snapshots are collected to construct a global
snapshot
P
i
ticks and then fixes its next time s=C
i
+(0,,0,1,0,,0) to be the
common snapshot time
P
i
broadcasts s
P
i
blocks waiting for all the acknowledgements
P
i
ticks again (setting C
i
=s), takes its snapshot and broadcast a dummy
message (i.e. force everybody else to advance their clocks to a value > s)
Each process takes its snapshot and sends it to P
i
when its local clock
becomes > s
Assumption (cont)
Inventing a n+1 virtual process whose clock is managed by P
i

P
i
can use its clock and because the virtual clock C
n+1
ticks only when P
i

initiates a new run of snapshot :
The first n component of the vector can be omitted
The first broadcast phase is unnecessary
Counter modulo 2
2 states
White (before snapshot)
Red (after snapshot)
Every message is red or white, indicating if it was send before or after the
snapshot
Each process (which is initially white) becomes red as soon as it receives a red
message for the first time and starts a virtual broadcast algorithm to ensure
that all processes will eventually become red

Assumption (cont)
Virtual broadcast
Dummy red messages to all processes
Flood the network by using a protocol where a process sends dummy red
messages to all its neighbors
Messages in transit
White messages received by red process
Target process receives the white message and sends a copy to the initiator
Termination
Distributed termination detection algorithm [Mattern 87]
Deficiency counting method
Each process has a counter which counts messages send messages received.
Thus, it is possible to determine the number of messages still in transit

Ordering and Cuts
A distributed system is .
A collection of sequential processes
p
1
, p
2
, p
3
..p
n
Network capable of implementing communication
channels between pairs of processes for message
exchange
Channels are reliable but may deliver messages out of
order
Every process can communicate with every other
process(may not be directly)
There is no reasoning based on global clocks
All kinds of synchronization must be done by message
passing
Distributed Computation
A distributed computation is a single execution of a distributed
program by a collection of processes. Each sequential process
generates a sequence of events that are either internal events, or
communication events

The local history of process p
i
during a computation is a (possibly
infinite) sequence of events h
i
= e
i
1
, e
i
2
....

A partial local history of a process is a prefix of the local history h
i
n
= e
i
1

, e
i
2
e
i
n

The global history of a computation is the set H = U
i=1
n
h
i
Global history
It is just the collection of events that have
occurred in the system
It does not give us any idea about the relative
times between the events
As there is no notion of global time, events
can only be ordered based on a notion of
cause and effect
Happened Before Relation ()
If a and b are events in the same process then
a b
If a is the sending of a message m by a process
and b is the corresponding receive event then
a b
Finally if a b b c then a c
If a b and b a then a and b are
concurrent
defines a partial order on the set H
Space Time Diagram
Graphical representation of a distributed system
If there is a path between two events then they are
related
Else they are concurrent
Is this notion of ordering really important?
Some idea of ordering of events is fundamental to reason
about how a system works
Global State Detection is a fundamental problem in
distributed computing
Enables detecting stable properties of a system
How do we get a snapshot of the system when there is no
notion of global time or shared memory
How do we ensure that that the state collected is consistent
Use this problem to illustrate the importance of ordering
This will also give us the notion of what is a consistent
global state
Global States and Cuts
Global State is a n-tuple of local states one for
each process
Cut is a subset of the global history that
contains an initial prefix of each local state
Therefore every cut is a natural global state
Intuitively a cut partitions the space time
diagram along the time axis
A Cut is identified by the last event of each
process that is part of the cut
Example of a Cut
Introduction to consistency
Consider this solution for the common problem of
deadlock detection
System has 3 processes p1, p2, p3
An external process p0 sends a message to each process
(Active Monitoring)
Each process on getting this message reports its local
state
Note that this global state thus collected at p
0
is a cut
p
0
uses this information to create a wait for graph
Consider the space time diagram below and
the cut C
2
1
2
3
Cycle formed
So what went wrong?
p
0
detected a cycle when there was no
deadlock
State recorded contained a message received
by p3 which p1 never sent
The system could never be in such a state and
hence the state p0 saw was inconsistent
So we need to make sure that application see
consistent states

So what is a consistent global state?
A cut C is consistent if for all events e and e

Intuitively if an event is part of a cut then all
events that happened before it must also be
part of the cut
A consistent cut defines a consistent global
state
Notion of ordering is needed after all !!
( ) ( ) C e e e C e e . e ' '
Passive Deadlock Detection
Lets change our approach to deadlock
detection
p
0
now monitors the system passively
Each process sends p
0
a message when an event
occurs
What global state does p
0
now see
Basically hell breaks lose
FIFO Channels
Communication channels need not preserve message
order
Therefore p
0
can construct any permutation of events as
a global state
Some of these may not even be valid (events of the same
process may not be in order)
Implement FIFO channels using sequence numbers

Now we know that we p
0
sees constructs valid runs
But the issue of consistency still remains
) ' ( ) ( ) ' ( ) ( m deliver m deliver m send m send j j i i
Ok lets now fix consistency
Assume a global real-time clock and bound of on the
message delay
Dont panic we shall get rid of this assumption soon
RC(e): Time when event e occurs
Each process reports to p
0
the global timestamp along
with the event
Delivery Rule at p
0
: At time t, deliver all received
messages upto t- in increasing timestamp order
So do we have a consistent state now?

Clock Condition
( ) ( ) C e e e C e e . e ' '
e is observed before e iff RC(e) < RC(e)
Recall our definition of consistency

Therefore state is consistent iff

This is the clock condition
For timestamps from a global clock this is
obviously true
Can we satisfy it for asynchronous systems?

) ' ( ) ( ' e RC e RC e e <
Logical Clocks
Turns out that the clock condition can be
satisfied in asynchronous systems as well
is defined such that Clock Condition holds if
A and b are events of the same process and a
comes before b then RC(a)<RC(b)
If a is the send of an event and b is corrsponding
receive then RC(a)<RC(b)
Lamports Clocks
Local variable LC in every process
LC: Kind of a logical clock
Simple counter that assigns timestamps to
events
Every send event is time stamped
LC modification rules
LC(e
i
) = LC + 1 if e
i
is an internal event or send
max{LC,TS(m)} + 1 if e
i
is receive(m)
Example of Logical Clocks
p1
p2
p3
1
1
1
2
2 4
4
3
5
Observations on Lamports Clocks
Lamport says
a b then C(a) < C(b)
However
C(a) < C(b) then a b ??
Solution: Vector Clocks
Clock (C) is a vector of length n
C[i] : Own logical time
C*j+ : Best guess about js logical time

Vector Clocks Example
1,0,0 2,0,0 3,4,1
0,1,0
2,2,0
0,0,1
2,3,1
2,4,1
Lets formalise the idea
C[i] is incremented between successive local
events
On receiving message timestamped message
m

Can be shown that both sides of relation holds

]) [ ], [ max( : ] [ , k t k C k C k m =
So are Lamport clocks useful only for finding
global state?
Definitely not!!!
Mutual Exclusion using Lamport clocks
Only one process can use resource at a time
Requests are granted in the order in which they
are made
If every process releases the resource then every
request is eventually granted
Assumptions
FIFO reliable channels
Direct connection between processes

Algorithm
p1
p2
p3
1,1
(1,1)
(1,1)(1,2)
1,2
(1,2)
(1,2)
(1,2)
(1,1)(1,2)
2 3
2
2
r4 r3
r3
r3
p1 has higher time stamp messages from p2 and p3. Its message is at top of queue. So
p1 enters
p1 sends release and now p2 enters
Algorithm Summary
Requesting CS
Send timestamped REQUEST
Place request on request queue
On receiving REQUEST
Put request on queue
Send back timestamped REPLY
Enter CS if
Received larger timestamped REPLY
Request at the head of queue
Releasing CS
Send RELEASE message
On receiving RELEASE remove request
Global State Revisited
Earlier in the talk we had discussed the
problem where a process actively tries to get
the global state
Solution to the problem that calculates only
consistent global states
Model
Process only knows about its internal events
Messages it sends and receives
Requirements
Each process records it own local state
The state of the communication channels is
recorded
All these small parts form a consistent whole
State Detection must run along with
underlying computation
FIFO reliable channels
Global States
What exactly is channel state
Let c be a channel from p to q
p records its local state(Lp) and so does q(Lq)
P has some sends in Lp whose receives may
not be in Lq
It is these sent messages that are the state of
q
Intuitively messages in transit when local
states collected

Distributed Mutual Exclusion
Agenda for Spring 2013
Lamport's Mutual Exclusion Algorithm
Ricart & Agrawala Algorithm
Token Ring Mutex Algorithm
Bully Algorithm
Ring Algorithm
Suzuki-Kasami's broadcast Algorithm
Raymond's Tree Based Algorithm

Mutual Exclusion?
A condition in which there is a set of
processes, only one of which is able to access
a given resource or perform a given function
at any time

Mutual Exclusion
Distributed components still need to coordinate their
actions, including but not limited to access to shared
data
Mutual exclusion to some limited set of operations and data
is thus required
Consider several approaches and compare and contrast
their advantages and disadvantages
Centralized Algorithm
The single central process is essentially a monitor
Central server becomes a semaphore server
Three messages per use: request, grant, release
Centralized performance constraint and point of failure
Mutual Exclusion: Distributed Algorithm Factors
Functional Requirements
1) Freedom from deadlock
2) Freedom from starvation
3) Fairness
4) Fault tolerance
Performance Evaluation
Number of messages
Latency
Semaphore system Throughput
Synchronization is always overhead and must be
accounted for as a cost
Mutual Exclusion: Distributed Algorithm Factors
Performance should be evaluated under a variety of loads
Cover a reasonable range of operating conditions
We care about several types of performance
Best case
Worst case
Average case
Different aspects of performance are important for different reason
and in different contexts
Centralized Systems
Mutual exclusion via:
Test & set
Semaphores
Messages
Monitors
Assume there is agreement on how a resource
is identified
Pass identifier with requests

Create an algorithm to allow a process to
obtain exclusive access to a resource
Centralized Algorithm
Token Ring Algorithm
Distributed Algorithm
Decentralized Algorithm

Centralized algorithm
Mimic single processor system
One process elected as coordinator
P
C
request(R)
grant(R)
1. Request resource
2. Wait for response
3. Receive grant
4. access resource
5. Release resource
release(R)
If another process claimed resource:
Coordinator does not reply until release
Maintain queue
Service requests in FIFO order
P
0
C
request(R)
grant(R)
release(R)
P
1
P
2
request(R)
Queue
P
1
request(R)
P
2
grant(R)
Benefits
Fair
All requests processed in order
Easy to implement, understand, verify

Problems
Process cannot distinguish being blocked from a
dead coordinator
Centralized server can be a bottleneck
Token Ring algorithm
Assume known group of processes
Some ordering can be imposed on group
Construct logical ring in software
Process communicates with neighbor

P
0
P
1
P
2
P
3
P
4
P
5
Initialization
Process 0 gets token for resource R
Token circulates around ring
From P
i
to P
(i+1)
mod N
When process acquires token
Checks to see if it needs to enter critical section
If no, send token to neighbor
If yes, access resource
Hold token until done

P
0
P
1
P
2
P
3
P
4
P
5
token(R)
Only one process at a time has token
Mutual exclusion guaranteed
Order well-defined
Starvation cannot occur
If token is lost (e.g. process died)
It will have to be regenerated
Does not guarantee FIFO order
Lamports Mutual Exclusion
Each process maintains request queue
Contains mutual exclusion requests

Requesting critical section:
Process P
i
sends request(i, T
i
) to all nodes
Places request on its own queue
When a process P
j
receives
a request, it returns a timestamped ack

Entering critical section (accessing resource):
P
i
received a message (ack or release) from every
other process with a timestamp larger than T
i

P
i
s request has the earliest timestamp in its queue

Difference from Ricart-Agrawala:
Everyone responds always - no hold-back
Process decides to go based on whether its request is
the earliest in its queue
Releasing critical section:
Remove request from its own queue
Send a timestamped release message

When a process receives a release message
Removes request for that process from its queue
This may cause its own entry have the earliest timestamp
in the queue, enabling it to access the critical section
Mutual Exclusion: Lamports Algorithm
Every site keeps a request queue sorted by logical time stamp
Uses Lamports logical clocks to impose a total global order on events
associated with synchronization
Algorithm assumes ordered message delivery between every pair of
communicating sites
Messages sent from site S
j
in a particular order arrive at S
j
in the same order
Note: Since messages arriving at a given site come from many sources the
delivery order of all messages can easily differ from site to site
Lamports Algorithm : Request Resource r

Thus, each site has a request queue containing resource use requests and replies
Note that the requests and replies for any given pair of sites must be in the same
order in queues at both sites
Because of message order delivery assumption
i
j i i
eue request_qu queuse local on the request the places
and S to j) , REQUEST(ts sends S Site 1)
r
R e
i resource using processes all of set the is R -
i
j
i
i i j
eue request_qu on
request the places and S site to message REPLY stamped time a
returns it S site from ) , REQUEST(ts receives S site When 2) i
Lamports Algorithm
Entering CS for Resource r
Site S
i
enters the CS protecting the resource when

This ensures that no message from any site with a smaller timestamp could ever arrive

This ensures that no other site will enter the CS
Recall that requests to all potential users of the resource and replies from then go
into request queues of all processes including the sender of the message
i) , (ts n larger tha stamp time
a with sites other all from message a received has S Site L1)
i
i
i
i
eue request_qu queue the of
head at the is S site from request The L2)
Lamports Algorithm
Releasing the CS
The site holding the resource is releasing it, call that site

Note that the request for resource r had to be at the head of the
request_queue at the site holding the resource or it would never
have entered the CS

Note that the request may or may not have been at the head of
the request_queue at the receiving site
i
S
r
R e
j
i i
S to message i) RELEASE(r, a sends and
eue request_qu of front the from request its removes S Site 1)
j i
j
eue request_qu from r) , REQUEST(ts
removes it message i) RELEASE(r, a receives S site When 2)
Lamport ME Example
request (i5)
queue(j10)
reply(12)
queue(i5)
P
i
in
critical
section
queue(j10, i5)
request (j10)
release(i5)
queue(j10)
queue(j10)
P
j
enters
critical
section
reply(12)
11
11
14
12
13
12
13
15
P
i
P
j
Lamports Algorithm
Comments
Performance: 3(N-1) messages per CS invocation since
each requires (N-1) REQUEST, REPLY, and RELEASE
messages
Observation: Some REPLY messages are not required
If sends a request to and then receives a REQUEST
from with a timestamp smaller than its own REQUEST
need not send a reply to because it already has enough
information to make a decision
This reduces the messages to between 2(N-1) and 3(N-1)
As a distributed algorithm there is no single point of
failure but there is increased overhead

i
S
j
S
i
S
i
S
j
S
Ricart & Agrawala algorithm
Distributed algorithm using reliable multicast and
logical clocks
Process wants to enter critical section:
Compose message containing:
Identifier (machine ID, process ID)
Name of resource
Timestamp (totally-ordered Lamport)
Send request to all processes in group
Wait until everyone gives permission
Enter critical section / use resource
When process receives request:
If receiver not interested:
Send OK to sender
If receiver is in critical section
Do not reply; add request to queue
If receiver just sent a request as well:
Compare timestamps: received & sent messages
Earliest wins
If receiver is loser, send OK
If receiver is winner, do not reply, queue
When done with critical section
Send OK to all queued requests
N points of failure
A lot of messaging traffic
Demonstrates that a fully distributed
algorithm is possible
Ricart and Agrawala
Refine Lamports mutual exclusion by merging the
REPLY and RELEASE messages
Assumption: total ordering of all events in the system
implying the use of Lamports logical clocks with tie
breaking
Request CS (P) operation:
1) Site requesting the CS creates a
message and sends it to all processes using the CS
including itself
Messages are assumed to be reliably delivered in order
Group communication support can play an obvious role
i) , REQUEST(ts
i i
S
Ricart and Agrawala
Receive a CS Request
If the receiver is not currently in the CS and does not have
pending request for it in its request_queue
Send REPLY
If the receiver is already in the CS
Queue the request, sending no reply
If the receiver desires the CS but has not entered
Compare the TS of its request to that just received
REPLY if received is newer
Queue the request if pending request is newer
Ricart and Agrawala
Enter a CS
A process enters the CS when it receives a REPLY from
every member of the group that can use the CS
Leave a CS
When the process leaves the CS it sends a REPLY to the
senders of all pending messages on its queue
i in CS
k in CS
request(i8)
request(k12)
OK(j) OK(j)
OK(k)
OK(i)
I J
K
Ricart and Agrawala
Example 1
request(i7)
I J K
request(j8) request(k9)
OK(k)
OK(j)
OK(k)
i in CS
OK(i)
k in CS
q(j8)
q(j8, k9)
q(k9)
j in CS
OK(i)
Ricart and Agrawala
Example 2
OK(j)
Ricart and Agrawala
Observations
The algorithm works because the global logical clock
ensures a global total ordering on events
This ensures, in turn, that the decision about who enters
the CS is unambiguous
Single point of failure is now N points of failure
A crashed group member cannot be distinguished from a
busy CS
Distributed and optimized version is N times more
vulnerable than the centralized version!
Explicit message denying entry helps reliability and
converts this into busy wait
Ricart and Agrawala
Observations
Either group communication support is used, or each
user of the CS must keep track of all other potential
users correctly
Powerful motivation for standard group communication
primitives
Argument against a centralized server said that a single
process involved in each CS decision was bad
Now we have N processes involved in each decision
Improvements: get a majority - Makaewas algorithm
Bottom Line: a distributed algorithm is possible
Shows theoretical and practical challenges of designing
distributed algorithms that are useful
Characteristics of Decentralized Algorithms

No machine has complete information about the system state

Machines make decisions based only on local information

Failure of one machine does not ruin the algorithm

Three is no implicit assumption that a global clock exists
Decentralized Algorithm
Based on the Distributed Hash Table (DHT)
system structure previously introduced
Peer-to-peer
Object names are hashed to find the successor
node that will store them
Here, we assume that n replicas of each object
are stored

Placing the Replicas
The resource is known by a unique name:
rname
Replicas: rname-0, rname-I, , rname-(n-1)
rname-i is stored at succ(rname-i), where names
and site names are hashed as before
If a process knows the name of the resource it
wishes to access, it also can generate the hash
keys that are used to locate all the replicas
The Decentralized Algorithm
Every replica has a coordinator that controls
access to it (the coordinator is the node that
stores it)
For a process to use the resource it must
receive permission from m > n/2 coordinators
This guarantees exclusive access as long as a
coordinator only grants access to one process
at a time
The Decentralized Algorithm
The coordinator notifies the requester when it
has been denied access as well as when it is
granted
Requester must count the votes, and decide
whether or not overall permission has been
granted or denied
If a process (requester) gets fewer than m
votes it will wait for a random time and then
ask again
Analysis
If a resource is in high demand, multiple
requests will be generated
Its possible that processes will wait a long
time to get permission
Deadlock?
Resource usage drops
Analysis
More robust than the central coordinator
approach and the distributed approaches. If
one coordinator goes down others are
available.
If a coordinator fails and resets then it will not
remember having granted access to one
requestor, and may then give access to another.
According to the authors, it is highly unlikely that
this will lead to a violation of mutual exclusion.

Token Passing Mutex
General structure
One token per CS token denotes permission to
enter
Only process with token allowed in CS
Token passed from process to process logical ring
Mutex
Pass token to process i + 1 mod N
Received token gives permission to enter CS
hold token while in CS
Must pass token after exiting CS
Fairness ensured: each process waits at most N-1
entries to get CS
Token Passing Mutex
Correctness is obvious
No starvation since passing is in strict order
Difficulties with token passing mutex
Idle case of no process entering CS pays overhead of
constantly passing the token
Lost tokens: diagnosis and creating a new token
Duplicate tokens: ensure generation of only one token
Crashes: require a receipt to detect dead destinations
Receipts double the message overhead
Design challenge: holding time for unneeded
token
Too short high overhead, too long high CS
latency

Mutex Comparison
Centralized
Simplest and most efficient
Centralized coordinator crashes create the need to detect crash
and choose a new coordinator
M/use: 3; Entry Latency: 2
Distributed
3(N-1) messages per CS use (Lamport)
2(N-1) messages per CS use (Ricart & Agrawala)
If any process crashes with a non-empty queue, algorithm wont
work
M/use: 2(N-1); Entry Latency: 2(N-1)
Mutex Comparison
Token Ring
Ensures fairness
Overhead is subtle no longer linked to CS use
M/use: 1 ; Entry Latency: 0 N-1
This algorithm pays overhead when idle
Need methods for re-generating a lost token
Design Principle: building fault handling into algorithms
for distributed systems is hard
Crash recovery is subtle and introduces overhead in normal
operation
Performance Metrics: M/use and Entry Latency
Centralized approaches often necessary
Best choice in mutex, for example
Need method of electing a new coordinator when it fails
General assumptions
Give processes unique system/global numbers (e.g. PID)
Elect process using a total ordering on the set
All processes know process number of members
All processes agree on new coordinator
All do not know if it is up or down election algorithm is
responsible for determining this
Design challenge: network delay vs. crashed
peer
Election Algorithms
Bully Algorithm
Suppose the coordinator doesnt respond to P1 request
P1 holds an election by sending an election message to all processes
with higher numbers
If P1 receives no responses, P1 is the new coordinator
If any higher numbered process responds, P1 ends its election
Process receives an election request
Reply to the sender tells it that it has lost the election
Holds an election of its own
Eventually all but highest surviving process give up
Process recovering from a crash takes over if highest
Example:
- Processes 0-7, 4 detects that 7 has crashed
- 4 holds election and loses
- 5 holds election and loses
- 6 holds election and wins
- Message overhead variable
- Who starts an election matters
- Solid lines say Am I leader?
- Dotted lines say you lose
- Hollow lines say I won
- 6 becomes the coordinator
- When 7 recovers it is a bully and sends I win to all
Bully Algorithm
2
1
5
4
0
7
3
6
Processes have a total order known by all
Each process knows its successor forming a ring
Ring: mod N
So the successor of P
i
is P
(i+1) mod N

No token involved
Any process P
i
noticing that the coordinator is
not responding
Sends an election message to its successor P
(i+1) mod N

If successor is down, send to next member timeout
Receiving process adds its number to the message and passes it
along
Ring Algorithm
When election message gets back to election initiator
Change message to coordinator
Circulate to all members
Coordinator is highest process in the total order
All processes know the order and thus all will agree no matter how the
election started
Strength
Only one coordinator chosen
Weakness
Scalability: latency increases with N because the algorithm is sequential
Ring Algorithm
What if more than one process detects a crashed
coordinator?
More than one election will be produced: message storm
All messages will contain the same information: member process
numbers and order of members
Same coordinator is chosen (highest number)
Refinement might include filtering duplicate messages
Some duplicates will happen
Consider two elections chasing each other
Eliminate one initiated by lower numbered process
Duplicated until lower reaches source of the higher
Ring Algorithm
Global State (3)
d) Process 6 tells 5 to stop
e) Process 6 wins and tells everyone
A Ring Algorithm
Election algorithm using a ring.
Mutual Exclusion:
A Centralized Algorithm
a) Process 1 asks the coordinator for permission to enter a critical region. Permission is granted
b) Process 2 then asks permission to enter the same critical region. The coordinator does not reply.
c) When process 1 exits the critical region, it tells the coordinator, when then replies to 2
A Distributed Algorithm
a) Two processes want to enter the same critical region at the same moment.
b) Process 0 has the lowest timestamp, so it wins.
c) When process 0 is done, it sends an OK also, so 2 can now enter the critical
region.
A Toke Ring Algorithm
a) An unordered group of processes on a
network.
b) A logical ring constructed in software.
Comparison
A comparison of three mutual exclusion algorithms.
Algorithm
Messages per
entry/exit
Delay before entry
(in message times)
Problems
Centralized 3 2 Coordinator crash
Distributed 2 ( n 1 ) 2 ( n 1 )
Crash of any
process
Token ring 1 to 0 to n 1
Lost token,
process crash

Token Based Algorithms
Token based Algorithms
In token-based algorithms, a unique token is shared among all sites. A site
is allowed to enter its CS if it possesses the token.
Depending upon the way a site carries out its search for the token, there
are numerous token-based algorithms.
Suzuki-Kasamis broadcast algorithm
Singhals heuristic algorithm
Raymonds tree-based algorithm
Token-based algorithms use sequence numbers instead of timestamps.
Every request for the token contains a sequence number and the
sequence numbers of sites advance independently.
A site increments its sequence number counter every time it makes a
request for the token.
A primary function of the sequence numbers is to distinguish between old
and current requests.

Raymonds algorithm
Raymond's algorithm: data-structures
Per processor variables

Boolean token_holder
Boolean inCS
current_dir
requests_queue
The neighbor that is in the
direction of the token (or self if this
processor holds the token)
Raymond's algorithm: data-structures
Per processor variables

Boolean token_holder
Boolean inCS
current_dir
requests_queue
FIFO queue holding IDs of neighbors
from which requests for the token
arrived (may also contain self)
Raymond's algorithm: entry and exit code
Request_CS:
1 If not token_holder
2 if requests_queue.isEmpty( )
3 send(current_dir, REQUEST)
4 requests_queue.enqueue(self)
5 wait until token_holder is true
6 inCS + true

Release_CS:
7. inCS + false
8. If not requests_queue.isEmpty( )
9. current_dir + requests_queue.dequeue( )
10. send(current_dir, TOKEN)
11. token_holder + false
12. if not requests_queue.isEmpty( )
13. send(current_dir, REQUEST)

If this processor currently holds the
token it immediately enters CS.
Otherwise
Request_CS:
6 inCS + true

Release_CS:
7. inCS + false

If requests queue is empty, send a
request for the token. (If queue is non-
empty, a request for the token was
already sent.)
Request_CS:
6 inCS + true

Release_CS:
7. inCS + false

Enqueue self to requests queue since this
request is on behalf of this processor
Request_CS:
6 inCS + true

Release_CS:
7. inCS + false

When token_holder is set, this processor
has the token and may enter the CS
Request_CS:
6 inCS + true

Release_CS:
7. inCS + false

No longer in critical section
Request_CS:
6 inCS + true

Release_CS:
7. inCS + false

If requests are waiting
Request_CS:
6 inCS + true

Release_CS:
7. inCS + false

Dequeue the next hop for the earliest
request and send the TOKEN to it.
Also, update orientation of the token.
Request_CS:
6 inCS + true

Release_CS:
7. inCS + false

This processor no longer holds token
Request_CS:
6 inCS + true

Release_CS:
7. inCS + false

If there are more requests in this
processors queue, send another
request for the token
Raymond's algorithm: monitoring
Monitor_CS:
1 while (true)
2 wait for a REQUEST or a TOKEN message
REQUEST
3. if token_holder
4. if inCS
5. requests_queue.enqueue(sender)
6. else
7. current_dir + sender
10. else
11. if requests_queue.isEmpty()
12. send(current_dir,REQUEST)
TOKEN
15. if current_dir = self
16. token_holder + true
17. else
Listener thread to respond
to protocol messages at all
times
Monitor_CS:
1 while (true)
REQUEST
3. if token_holder
4. if inCS
6. else
10. else
TOKEN
17. else
Upon a request.
If current processor holds token
Monitor_CS:
1 while (true)
REQUEST
3. if token_holder
4. if inCS
6. else
10. else
TOKEN
17. else
If current processor in CS then
request must wait, enqueue the
direction of requesting processor
Monitor_CS:
1 while (true)
REQUEST
3. if token_holder
4. if inCS
6. else
10. else
TOKEN
17. else
Otherwise current processor holds the
token but is not in CS, hence requests
queue is empty.

Send token to where the request came
from, mark that current processor no
longer holds token, and the new
orientation of the otken
Monitor_CS:
1 while (true)
REQUEST
3. if token_holder
4. if inCS
6. else
10. else
TOKEN
17. else
Otherwise current processor does not hold the
token
Monitor_CS:
1 while (true)
REQUEST
3. if token_holder
4. if inCS
6. else
10. else
TOKEN
17. else
If requests queue is empty, send request in the
direction of the token
Monitor_CS:
1 while (true)
REQUEST
3. if token_holder
4. if inCS
6. else
10. else
TOKEN
17. else
Enqueue the direction of this request
Monitor_CS:
1 while (true)
REQUEST
3. if token_holder
4. if inCS
6. else
10. else
TOKEN
17. else
Upon the arrival of the token
Monitor_CS:
1 while (true)
REQUEST
3. if token_holder
4. if inCS
6. else
10. else
TOKEN
17. else
Dequeue oldest request and set new
orientation of the token to its
direction
Monitor_CS:
1 while (true)
REQUEST
3. if token_holder
4. if inCS
6. else
10. else
TOKEN
17. else
If request was by this processor, mark that it
currently has the token and may enter the CS
Monitor_CS:
1 while (true)
REQUEST
3. if token_holder
4. if inCS
6. else
10. else
TOKEN
17. else
Otherwise, send the token
in the direction of the
request
Monitor_CS:
1 while (true)
REQUEST
3. if token_holder
4. if inCS
6. else
10. else
TOKEN
17. else
If the queue is non-empty, send
another request for the token
Raymond's algorithm: execution scenario
DME: Suzuki-Kasamis Algorithm
The Suzuki-Kasami's broadcast algorithm is a token-based
exclusion algorithm
A unique token is shared among all processes.
If a process possesses the token, it is allowed to enter its
critical section
Therefore it broadcasts a request message to all other
processes.
The process that possesses the token sends it to the
requesting process if the token-owner is not in the critical
section.
In that case, the token will be sent after leaving the critical
section.
In order to know actual requests, every process keeps an array
of integers RNi[1..n], where n is the number of processes, i
the ID of process pi and RNi[j] is the largest sequence number
of requests received so far in a request message from process
pj.
The token itself has got an internal FIFO-queue in
order to save the requests and an array of integers
LN[1..n],
where n is the number of processes and LN[j] is the
sequence number of the request that process pj
executed most recently.
Thereby it is possible to check, if a process has
already been executed
So this array includes the most actual known
requests of every process.

Requesting the critical section
If the requesting process does not have the token, it increments its sequence
number RNi[i], and sends a REQUEST(I, sn) message to all other sites
When a site Sj receives this message, it sets RNj[i] to max(RNj[i], sn). If Sj has
the idle token, then it sends the token to Si if RNj[i]=LN[i]+1
Executing the CS
Site Si executes the CS when it has received the token
Releasing the CS. Having finished the execution of the CS, site Si takes the
following actions:
It sets LN[i] element of the token array equal to RNi[i]
For everyt site Sj whose ID is not in the token queue, it appends its ID to the
token queue if RNi[j]=LN[j]+1
If token queue is nonempty after the above update, then it deletes the top
site ID from the queue and sends the token to the site indicated by the ID
DME: Raymonds Tree-Based Algorithm
Developed by Kerry Raymond in 1989
In progression:
Ricart/Agrawala : 2*(N-1) messages
Suzuki/Kasami : N messages
Maekawa : sqrt(N) messages
Raymond : log(N) messages

Deadlocks
Deadlocks An Introduction
What Are DEADLOCKS ?
A Blocked Process which can never be resolved unless
there is some outside Intervention.
Resource R1 is requested by Process P1 but is held by
Process P2.
For Example:-
Illustrating A Deadlock
Wait-For-Graph (WFG)
Nodes Processes in the system
Directed Edges Wait-For blocking relation
A Cycle represents a Deadlock
Starvation - A process execution is permanently halted.
Process 1
Process 2
Resource 1
Resource 2 Waits For
Waits For
Held By
Held By
Causes Of Deadlocks
Mutual Exclusion Resources being held must be in
non-shareable mode.
Hold n Wait A Process is holding one resource and
is waiting for another, which is held by another
process.
No Preemption Resource cannot be preempted
even if it is being requested.
Circular Wait Presence of a cycle of waiting
processes.
Deadlocks in Distributed Systems
Resource Deadlock
Most Common.
Occurs due to lack of requested Resource.
Communication Deadlock
A Process waits for certain messages before it can
proceed.
Handling Deadlocks
Deadlock Avoidance
Only fulfill those resource requests that wont
cause deadlock in the future.
Inefficient.
Requires Prior resource requirement
information for all processes.
High Cost of scalability.
Drawbacks
Simulate resource allocation and determine if
resultant state is safe or not.
Handling Deadlocks
Deadlock Prevention
Provide all required resources from start itself.
Prioritize processes. Assign resources accordingly.
Inefficient and effects Concurrency.
Make Prior Rules:
For Ex. Process P1 cannot request resource R1
unless it releases resource R2.
Future resource requirement unpredictable.
Drawbacks
Starvation possible.
Handling Deadlocks
Deadlock Detection
Resource allocation with an optimistic outlook.
Periodically examine process status.
Detect then break the Deadlock.
Resolution Roll back 1 or More processes and
break dependency.
Deadlock Detection
Centralized Deadlock Detection
One control node (Coordinator) maintains Global WFG
and searches for cycles.
Distributed Deadlock Detection
Each node equally responsible in maintaining Global
WFG and detecting Deadlocks.
Hierarchical Deadlock Detection
Nodes organized in a tree, where each site detects
deadlocks involving only its descendants.
CONTROL ORGANIZATIONS
Deadlock Detection Algorithms
Ho-Ramamoorthys one and two phase algorithms.
Obermarcks Path Pushing Algorithm.
Chandy-Misra-Haas Edge Chasing algorithm.
Menasce-Muntz Algorithm.
Ho-Ramamoorthys Algorithm.
Ho-Ramamoorthys 1-Phase Algorithm
Each site maintains 2 Status Tables:
One of the Sites Becomes the Central Control
site.
- Process Table.
- Resource Table.
The Central Control site periodically asks for the
status tables.
Contd
Control site builds WFG using the status tables.
Control site analyzes WFG and resolves any
present cycles.
Ho-Ramamoorthys 1-Phase Algorithm Contd
Shortcomings
Phantom Deadlocks.
High Storage & Communication Costs.
Phantom Deadlocks
P0 P2
P1
R
S T
System A
System B
P1 releases resource S and asks-for resource T.
2 Messages sent to Control Site:
1. Releasing S.
2. Waiting-for T.
Message 2 arrives at Control Site first. Control Site
makes a WFG with cycle, detecting a phantom
deadlock.
Ho-Ramamoorthys 2-Phase Algorithm
Each site maintains a status table for processes.
Phase 1
Control Site periodically asks for these Locked &
Waited tables.
Contd
Resources Locked & Resources Awaited.
It then searches for presence of cycles in these
tables.
Ho-Ramamoorthys 2-Phase Algorithm Contd
Phase 2
If cycles are found in phase 1 search, Control site
makes 2
nd
request for the tables.
The details found common in both table requests will
be analyzed for cycle confirmation.
Shortcomings
Phantom Deadlocks.
Obermarcks Path-Pushing Algorithm
Individual Sites maintain local WFG
A virtual node x exists at each site.
Node x represents external processes.
Detection Process
- Case 1: If Site S
n
finds a cycle not
involving x -> Deadlock exists.

- Case 2: If Site S
n
finds a cycle involving
x -> Deadlock possible.

Contd
Site S
n
sends a message containing its detected cycles
to other sites. All sites receive the message, update
their WFG and re-evaluate the graph.

If Case 2 ->

Consider Site S
j
receives the message:

- Site S
j
checks for local cycles. If cycle found not
involving x (of S
j
) -> Deadlock exists.

- If site S
j
finds cycle involving x it forwards the
message to other sites.

Process continues till deadlock found.

Obermarcks Path-Pushing Algorithm
Chandy-Misra-Haas Edge Chasing algorithm.
The blocked process sends probe message to the
resource holding process.
Probe message contains:
- ID of blocked process.
- ID of process sending the message.
- ID of process to which the message was sent.
When probe is received by blocked process it
forwards it to processes holding the requested
resources.
If Blocked Process receives its own probe ->
Deadlock Exists.
Menasce-Muntz Algorithm
Sites (controllers) organized in a tree structure.
- Leaf controllers manage local WFG.
- Upper controllers handle Deadlock Detection.
+ Each Parent node maintains a Global WFG, union
of WFGs of its children. Deadlock detected for
its children.
- Changes propagated upwards in the tree.
Ho-Ramamoorthys Algorithm
Sites grouped into clusters.
Periodically 1 site chosen as central control site:
- Central control site chooses controls site for
other clusters.
Control site for each cluster collects the status
graph there:
- Ho-Ramamoorthys 1-phase algorithm
centralized DD algorithm used.
All control sites forward status report to Central Control
site which combines the WFG and performs cycle
search.
Centralized Deadlock Detection Algorithms
Large communication overhead.
Coordinator is performance bottleneck.
Possibility of single point of failure.
Summary
Distributed Deadlock Detection Algorithms
High Complexity.
Detection of phantom deadlocks possible.
Hierarchical Deadlock Detection Algorithms
Most Common.
Efficient.
Deadlock Prevention
Ordered resource allocation most common
example
Consider link with two-phase-locking grow and shrink
Works but requires global view of all resources
A total order on resources must exist for the system
Process code must allocate resources in order
Under utilizes resources when period of use of a
resource conflict with the total resource order
Consider process P
i
and P
k
using resources R
1
and R
2

P
i
uses R
1
90% of its execution time and R
2
10%
P
k
uses R
2
90% of its execution time and R
1
10%
One holds one resource far too long

Deadlock Avoidance
General method: Refuse allocations that may lead
to deadlock
Method for keeping track of states
Need to know resources required by a process
Bankers algorithm
Must know maximum number allocated to P
i
Keep track of resources available
For each request, make sure maximum need will not
exceed total available
Under utilizes resources
Never used
Advance knowledge not available and CPU-intensive
Deadlock Detection and Resolution
Attractive for two main reasons
Prevention and avoidance are hard, have significant
overhead, and require information difficult or impossible
to obtain
Deadlock is comparatively rare in most systems so a form
of the argument for optimistic concurrency control
applies: detect and fix comparatively rare situations
Availability of transactions helps
DL resolution requires us to kill some participant(s)
Transactions are designed to be rolled back and restarted
General method: Construct a resource graph and
analyze it
Analyze through resource reductions
If cycle exists after analysis, deadlock has occurred
Processes in cycle are deadlocked
Local graphs on each machine
P
i
requests R
1

R
1
s machine places request in local graph
If cycle exists in local graph, perform reductions to detect
deadlock
Need to calculate union of all local graphs
Deadlock cycle may transcend machine boundaries
Cycles dont always mean deadlock!
P2
P3
P1
Graph Reduction
P2
P3
P1
P2 P3
No Deadlock
Deadlock
R1
R2
P3
P1
P2
P2 P3
P1
Waits-For Graphs (WFGs)
Based on Resource Allocation Graph (SR)
An edge from P
i
to P
j
means Pi is waiting for P
j
to release a resource
Replaces two edges in SR graph
P
i
R
R P
j
Deadlocked when a cycle is found
All hosts communicate resource state to coordinator
Construct global resource graph on coordinator
Coordinator must be reliable and fast
When to construct the graph is an important choice
Report every resource operation (request, acquire, release)
Large overhead and significant use latency
Periodically send set of operations
Lower overhead and use latency, | detection latency
Whenever a need for cycle detection is indicated
Central or local decision
All have drawbacks b/c of false deadlocks

False Deadlock
Problem: messages may not arrive in a timely
fashion
Inconsistent and out-of-date world view at a particular
machine
In particular, out-of-order arrival
Assume two processes on two machines and two
resources
P2 releases R2 (message A)
P1 requests instance of R2 (message B)
P1
R1
M1
P2
R2 R1
M2
Problem: Coordinator detects false deadlock after B
False Deadlock
Initial coordinator representation: After receiving message B:
After receiving message A:
P2
R2 R1
P1
P2
R2
R1
P1
P2
R2
R1
P1
False Deadlock
Lack of global message delivery order causes false DL
Could apply Lamports global virtual clock
Expensive
Coordinator detects potential DL
Requests all outstanding messages with lower timestamp
Aim is to establish a common global message order
Establishes a total order on resource operations
Establishes a common world view and thus common
decision making
Fixes some false deadlocks, but others are harder

Chandry-Misra-Haas algorithm
Processes can request more than one resource with a single
message process can wait on several resources
Amortize message overhead
Speed growing phase
Use waits-for graph to represent system state
Dependencies across machine boundaries make looking for cycles
hard
A process sends probe messages when it has to wait
If message gets back, deadlock has occurred
When process has to wait
Send message to process holding resources
Recipient forwards to all processes it is waiting on
Creates concurrent probe of wait-for graph for cycles
If message gets back to originator
Cycle exists in wait-for graph so deadlock has occurred
Note that first field of message will always be the
initiator
Many messages every time a process blocks

An Example
P
0
gets blocked, resource held by P
1
Initial message from P
0
to P
1
: (0, 0, 1)
P
1
waiting on P
2
P
1
sends message (0, 1, 2) to P
2
P
2
waiting on P
3
: (0, 2, 3)
P
3
waiting on P
4
and P
5
: (0, 3, 4) and (0, 3, 5)
P
5
chain ends, but P
4
P
6
P
8
But P
8
is waiting on P
0
:
P
0
gets message, sees itself as the initiator: (0, 8, 0)
A cycle thus exists
P
0
knows there is deadlock
Distributed Deadlock
Resolution
Some process in the cycle must be killed
Structuring resource use as transactions makes this better behaved
and easier to understand
Race Condition:
Two processes block at the same time and send probes
Both discover the cycle in parallel
Damping difficult as it is hard to tell what messages may be killed
killing process must know the cycle
Practice should emphasize the simplest and cheapest
Most cycles are between two processes
Example of importance of gathering performance data
Distributed Deadlock Prevention
Prevention
Careful design to make deadlocks structurally impossible
Make sure at least one of the 4 necessary
conditions for deadlock cannot hold
Process can only hold one resource at a time
Request all resources initially
Process releases all resources before requesting new one
Resource ordering
All are cumbersome in practice
Distribution opens some new possibilities
Lamport clocks create total order preventing cycles

Fault Tolerant Systems
Concepts of Fault Tolerance
O Hardware, software and networks cannot be totally free from
failures
O Fault tolerance is a non-functional (QoS) requirement that
requires a system to continue to operate, even in the presence
of faults
O Fault tolerance should be achieved with minimal involvement of
users or system administrators
O Distributed systems can be more fault tolerant than centralized
systems, but with more processor hosts generally the
occurrence of individual faults is likely to be more frequent
O Notion of a partial failure in a distributed system
Attributes
Availability
Reliability
Safety
Confidentiality
Integrity
Maintainability
Consequences
Fault
Error
Failure
Strategies
Fault prevention
Fault tolerance
Fault recovery
Fault forcasting
Attributes, Consequences and Strategies
What is a
Dependable
system
How to
distinguish
faults
How to
handle
faults?
O System attributes:
Availability system always ready for use, or probability
that system is ready or available at a given time
Reliability property that a system can run without
failure, for a given time
Safety indicates the safety issues in the case the system
fails
Maintainability refers to the ease of repair to a failed
system
O Failure in a distributed system = when a service cannot
be fully provided
O System failure may be partial
O A single failure may affect other parts of a system
(failure escalation)
Attributes of a Dependable System
results in causes
Fault

Error

Failure
Fault is a defect within the system
Error is observed by a deviation from the expected
behaviour of the system
Failure occurs when the system can no longer perform as
required (does not meet spec)
Fault Tolerance is ability of system to provide a service,
even in the presence of errors
Terminology of Fault Tolerance
Hard or Permanent repeatable error, e.g. failed
component, power fail, fire, flood, design error (usually
software), sabotage
Soft Fault
Transient occurs once or seldom, often due to
unstable environment (e.g. bird flies past microwave
transmitter)
Intermittent occurs randomly, but where factors
influencing fault are not clearly identified, e.g. unstable
component
Operator error human error
Types of Fault (wrt time)
Classification of failures
Crash failure
Omission failure
Transient failure
Byzantine failure
Software failure
Temporal failure
Security failure
Environmental perturbations
Types of Fault (wrt attributes)
Type of failure Description
Crash failure
Amnesia crash
Pause crash
Halting crash
A server halts, but is working correctly until it halts
Lost all history, must be reboot
Still remember state before crash, can be recovered
Hardware failure, must be replaced or re-installed
Omission failure
Receive omission
Send omission
A server fails to respond to incoming requests
A server fails to receive incoming messages
A server fails to send messages
Timing failure A server's response lies outside the specified time
interval
Response failure
Value failure
State transition
failure
The server's response is incorrect
The value of the response is wrong
The server deviates from the correct flow of control
Arbitrary failure A server may produce arbitrary responses at
arbitrary times
Strategies to Handle Faults
Actions to identify and
remove errors:
Design reviews
Testing
Use certified tools
Analysis:
Hazard analysis
Formal methods -
proof & refinement
No non-trivial system
can be guaranteed free
from error
Must have an
expectation of failure
and make appropriate
provision
Fault avoidance
Techniques aim to prevent
faults from entering the system
during design stage
Fault removal
Methods attempt to find faults
within a system before it enters
service
Fault detection
Techniques used during service
to detect faults within the
operational system
Fault tolerant
Techniques designed to tolerant
faults, i.e. to allow the system
operate correctly in the presence of
faults.
Crash failures
Crash failure = the process halts. It is irreversible.

In synchronous system, it is easy to detect crash failure (using heartbeat
signals and timeout). But in asynchronous systems, it is never accurate, since
it is not possible to distinguish between a process that has crashed, and a
process that is running very slowly.

Some failures may be complex and nasty. Fail-stop failure is a simple
abstraction that mimics crash failure when program execution becomes
arbitrary. Implementations help detect which processor has failed. If a system
cannot tolerate fail-stop failure, then it cannot tolerate crash.
Transient failure
(Hardware) Arbitrary perturbation of the global state. May be
induced by power surge, weak batteries, lightning, radio-
frequency interferences, cosmic rays etc.

(Software) Heisenbugs are a class of temporary internal faults
and are intermittent. They are essentially permanent faults
whose conditions of activation occur rarely or are not easily
reproducible, so they are harder to detect during the testing
phase.

Over 99% of bugs in IBM DB2 production code are non-
deterministic and transient (Jim Gray)
Not Heisenberg
Temporal failures
Inability to meet deadlines correct results
are generated, but too late to be useful.
Very important in real-time systems.

May be caused by poor algorithms, poor
design strategy or loss of synchronization
among the processor clocks
Byzantine failure
Anything goes! Includes every conceivable form of
erroneous behavior. The weakest type of failure

Numerous possible causes. Includes malicious
behaviors (like a process executing a different
program instead of the specified one) too.

Most difficult kind of failure to deal with.
Origin
Byzantine refers to the Byzantine Generals' Problem, an
agreement problem (first proposed by Marshall Pease,
Robert Shostak, and Leslie Lamport in 1980)[Ref] in which
generals of the Byzantine Empire's army must decide
unanimously whether to attack some enemy army.

The Byzantine Army was chosen as an example for the
problem as the Byzantine state experienced frequent
treachery among the high levels of its administration.

The problem is complicated by the geographic separation
of the generals, who must communicate by sending
messengers to each other, and by the presence of traitors
amongst the generals.
Origin
These traitors can act arbitrarily in order to achieve the
following aims: trick some generals into attacking; force a
decision that is not consistent with the generals' desires,
e.g. forcing an attack when no general wished to attack; or
confusing some generals to the point that they are unable
to make up their minds.

If the traitors succeed in any of these goals, any resulting
attack is doomed, as only a concerted effort can result in
victory.

Byzantine fault tolerance can be achieved, if the loyal (non-
faulty) generals have a unanimous agreement on their
strategy. Note that if the source general is correct, all loyal
generals must agree upon that value. Otherwise, the choice
of strategy agreed upon is irrelevant.

Failure Modes
A Byzantine fault is an arbitrary fault that occurs during
the execution of an algorithm by a distributed system.
It encompasses both omission failures (e.g., crash
failures, failing to receive a request, or failing to send a
response) and commission failures (e.g., processing a
request incorrectly, corrupting local state, and/or
sending an incorrect or inconsistent response to a
request).
When a Byzantine failure has occurred, the system may
respond in any unpredictable way, unless it is designed
to have Byzantine fault tolerance.
For example, if the output of one function is the input
of another, then small round-off errors in the first
function can produce much larger errors in the second.
Failure Modes
If the second function were fed into a third, the
problem could grow even larger, until the values
produced are worthless.
Another example is in compiling source code. One
minor syntactical error early on in the code can produce
large numbers of perceived errors later, as the parser of
the compiler gets out-of-phase with the lexical and
syntactic information in the source program.
Such failures have brought down major Internet
services. For example, in 2008 Amazon S3 was brought
down for several hours when a single-bit hardware
error propagated through the system.[Ref]
Failure Modes
In a Byzantine fault tolerant (BFT) algorithm, steps are taken
by processes, the logical abstractions that represent the
execution path of the algorithms.
A faulty process is one that at some point exhibits any of
the above failures. A process that is not faulty is correct.
The Byzantine failure assumption models real-world
environments in which computers and networks may
behave in unexpected ways due to hardware failures,
network congestion and disconnection, as well as malicious
attacks.
Byzantine failure-tolerant algorithms must cope with such
failures and still satisfy the specifications of the problems
they are designed to solve.
Such algorithms are commonly characterized by their
resilience t, the number of faulty processes with which an
algorithm can cope.

Failure Modes
Many classic agreement problems, such as the
Byzantine Generals' Problem, have no solution unless n
> 3t
where n is the number of processes in the system.
In other words, the algorithm can ensure correct
operation only if fewer than one third of the processes
are faulty.
Early Solutions
Several solutions were originally described by Lamport,
Shostak, and Pease in 1982.[1]
They began by noting that the Generals' Problem can
be reduced to solving a "Commander and Lieutenants"
problem where Loyal Lieutenants must all act in unison
and that their action must correspond to what the
Commander ordered in the case that the Commander is
Loyal.
Roughly speaking, the Generals vote by treating each
others' orders as votes.
Solutions-1
One solution considers scenarios in which messages may be
forged, but which will be Byzantine-fault-tolerant as long as
the number of traitorous generals does not equal or exceed
one third.
The impossibility of dealing with one-third or more traitors
ultimately reduces to proving that the 1 Commander + 2
Lieutenants problem cannot be solved, if the Commander is
traitorous.
The reason is, if we have three commanders, A, B, and C,
and A is the traitor: when A tells B to attack and C to retreat,
and B and C send messages to each other, forwarding A's
message, neither B nor C can figure out who is the traitor,
since it isn't necessarily A the other commander could
have forged the message purportedly from A.
It can be shown that if n is the number of generals in total,
and t is the number of traitors in that n, then there are
solutions to the problem only when n is greater than or
equal to 3t + 1.[4]
Solutions-2
This unforgeable signatures (in modern computer
systems, this may be achieved in practice using public-
key cryptography), but maintains Byzantine fault
tolerance in the presence of an arbitrary number of
traitorous generals.
Solutions-Misc

Also presented is a variation on the first two solutions
allowing Byzantine-fault-tolerant behavior in some
situations where not all generals can communicate
directly with each other.
Practical Byzantine fault tolerance
Byzantine fault tolerant replication protocols were long
considered too expensive to be practical.
Then in 1999, Miguel Castro and Barbara Liskov
introduced the "Practical Byzantine Fault Tolerance"
(PBFT) algorithm,[5] which provides high-performance
Byzantine state machine replication, processing
thousands of requests per second with sub-millisecond
increases in latency.
PBFT triggered a renaissance in BFT replication
research, with protocols like Q/U,[6] HQ,[7] Zyzzyva,[8]
and ABsTRACTs [9] working to lower costs and improve
performance and protocols like Aardvark[10] and
RBFT[11] working to improve robustness.
UpRight[12] is an open source library for constructing
services that tolerate both crashes ("up") and Byzantine
behaviors ("right") that incorporates many of these
protocols' innovations.
One example of BFT in use is Bitcoin, a peer-to-peer
digital currency system.
The Bitcoin network works in parallel to generate a
chain of Hashcash style proof-of-work.
The proof-of-work chain is the key to solving the
Byzantine Generals' Problem of synchronising the global
view and generating computational proof of the
majority consensus.[13]
BFT-SMaRt library,[14] a high-performance Byzantine
fault-tolerant state machine replication library
developed in Java.
This library implements a protocol very similar to
PBFT's, plus complementary protocols which offer state
transfer and on-the-fly reconfiguration of hosts.
BFT-SMaRt is the most recent effort to implement state
machine replication, still being actively maintained.
Hardware Errors and Error Control
Schemes
369
Failures Causes
Metric
s
Traditional
Approaches
Soft Errors,
Hard Failures,
System Crash
External Radiations,
Thermal Effects,
Power Loss, Poor
Design, Aging
FIT, MTTF,
MTBF
Spatial Redundancy (TMR,
Duplex, RAID-1 etc.) and
Data Redundancy (EDC,
ECC, RAID-5, etc.)
FIT: Failures in Time (10
9
hours)
MTTF: Mean Time To Failure
MTBF: Mean Time b/w Failures
TMR: Triple Modular Redundancy
EDC: Error Detection Codes
ECC: Error Correction Codes
RAID: Redundant Array of
Inexpensive Drives
E Hardware failures are increasing as technology
scales
(e.g.) SER increases by up to 1000 times [Mastipuram, 04]
E Redundancy techniques are expensive
(e.g.) ECC-based protection in caches can incur 95%
performance penalty [Li, 05]
Software Errors and Error Control
Schemes
370
Failures Causes Metrics
Traditional
Approaches
Wrong
outputs,
Infinite
loops, Crash
Incomplete
Specification, Poor
software design,
Bugs, Unhandled
Exception
Number of
Bugs/Klines,
QoS, MTTF,
MTBF
Spatial Redundancy (N-
version Programming,
etc.), Temporal
Redundancy (Checkpoints
and Backward Recovery,
etc.)
QoS: Quality of Service
E Software errors become dominant as systems complexity
increases
(e.g.) Several bugs per kilo lines
E Hard to debug, and redundancy techniques are expensive
(e.g.) Backward recovery with checkpoints is inappropriate for real-time
applications
Network Errors and Error Control
Schemes
371
Failures Causes Metrics
Traditional
Approaches
Data Losses,
Deadline
Misses, Node
(Link) Failure,
System Down
Network
Congestion,
Noise/Interfere
nce, Malicious
Attacks
Packet Loss
Rate,
Deadline
Miss Rate,
SNR, MTTF,
MTBF, MTTR
Resource Reservation, Data
Redundancy (CRC, etc.),
Temporal Redundancy
(Retransmission, etc.),
Spatial Redundancy
(Replicated Nodes, MIMO,
etc.)
SNR: Signal to Noise Ratio
MTTR: Mean Time To Recovery
CRC: Cyclic Redundancy Check
MIMO: Multiple-In Multiple-Out
E Omission Errors lost/dropped messages
E Network is unreliable (especially, wireless networks)
E Buffer overflow, Collisions at the MAC layer, Receiver out of range
E Joint approaches across OSI layers have been investigated for
minimal costs [Vuran, 06][Schaar, 07]
Simplex systems
highly reliable components
Dual Systems
twin identical
twin dissimilar
control + monitor
N-way Redundant
systems
identical / dissimilar
self-checking / voting
Dissimilar systems are
also known as
"diverse systems in
which an operation is
performed in a
different way in the
hope that the same
fault will not be
present in different
implementations.
The basic approach to
achieve fault
tolerance is
redundancy
Architectural approaches
Example: RAID
(Redundant Array of Independent Disks)
RAID has been classified into several levels: 0, 1, 2, 3, 4, 5, 6, 10,
50, each level provides a different degree of fault tolerance
(a) Original circuit
(b) Triple modular redundancy
Failure Masking by TMR
Uses 5 identical computers which can be assigned to redundant operation
under program control.
During critical mission phases - boost, re-entry and loading - 4 of its 5
computers operate an NMR configuration, receiving the same inputs and
executing identical tasks. When a failure is detected the computer
concerned is switched out of the system leaving a TMR arrangement.
The fifth computer is used to perform non-critical tasks in a simplex
mode, however, under extreme cases may take over critical functions. The
unit has "diverse" software and could be used if a systematic fault was
discovered in the other four computers.
The shuttle can tolerate up to two computer failures; after a second failure
it operates as a duplex system and uses comparison and self-test
techniques to survive a third fault.
Example: Space Shuttle
Hardware redundancy
Use more hardware
Software redundancy
Use more software
Information redundancy, e.g.
Parity bits
Error detecting or correcting codes
Checksums
Temporal (time) redundancy
Repeating calculations and comparing results
For detecting transient faults
Forms of redundancy
O Program code (may) contains bugs if actual behavior disagrees with
the intended specification. These faults may occur from:
specification error
design error
coding error, e.g. use on un-initialized variables
integration error
run time error e.g. operating system stack overflow, divide by zero
O Software failure is (usually) deterministic, i.e. predictable, based on
the state of the system. There is no random element to the failure
unless the system state cannot be specified precisely. A non-
deterministic fault behavior usually indicates that the relevant system
state parameters have not been identified.
O Fault coverage defines the fraction of possible faults that can be
detected by testing (statement, condition or structural analysis)
Software Faults
N-version programming
Use several different implementations of the same specification
The versions may run sequentially on one processor or in
parallel on different processors.
They use the same input and their results are compared.
In the absence of a disagreement, the result is output.
When produced different results:
If there are 2 routines:
the routines may be repeated in case this was a transient error;
to decide which routine is in error.
If there are 3 or more routines,
voting may be applied to mask the effects of the fault.
Software Fault Tolerance
Process Groups
O Organize several identical processes into a group
O When a message is send to a group, all members of the
group receives it
O If one process in a group fails (no matter what reason),
hopefully some other process can take over for it
O The purpose of introducing groups is to allow processes to
deal with collections of processes as a single abstraction.
O Important design issue is how to reach agreement within a
process group when one or more of its members cannot be
trusted to give correct answers.
a) Communication in a flat group.
b) Communication in a simple hierarchical group
Process Group Architectures
Fault Tolerant in Process Group
O A system is said to be k fault tolerant if it can survive
faults in k components and still meets its specification.
O If the components (processes) fail silently, then having k +
1 of them is enough to provide k fault tolerant.
O If processes exhibit Byzantine failures (continuing to run
when sick and sending out erroneous or random replies, a
minimum 2k + 1 processes are needed.
O If we demand that a process group reaches an agreement,
such as electing a coordinator, synchronization, etc., we
need even more processes to tolerate faults .
A
C D
B
(b) (a)
Y D
B
B
A
A
B
A
D
X
Z
D
A
C D
B
M3 M4
M2
M2
M1= (ABXD)
M2= (ABYD)
M3= (arbitrary list)
M4= (ABZD)
M2
M1
M4
M3
M3
M1
M1
M4
Agreement: Byzantine Generals Problem
Broadcast local troop strength Broadcast global troop vectors
General A General B General C General D
M2(ABYD) M1(ABXD) M1(ABXD) M1(ABXD)
M3(HIJK) M3(EFGH) M2(ABYD) M2(ABYD)
M4(ABZD) M4(ABZD) M3(ABZD) M3(MNPQ)
Need 3K + 1 for K
fault tolerant,. # of
messages = O(N
2
)
Reliable Communication
O Fault Tolerance in Distributed system must consider
communication failures.
O A communication channel may exhibit crash, omission,
timing, and arbitrary failures.
O Reliable P2P communication is established by a reliable
transport protocol, such as TCP.
O In client/server model, RPC/RMI semantics must be
satisfied in the presence of failures.
O In process group architecture or distributed replication
systems, a reliable multicast/broadcast service is very
important.
In the case of process failure the following situations need
to be dealt with:

Client unable to locate server

Client request to server is lost

Server crash after receiving client request

Server reply to client is lost

Client crash after sending server request
Reliable Client-Server Communication
A server in client-server communication
a) Normal case
b) Crash after execution
c) Crash before execution
Lost Request Messages when Server Crashes
Client unable to locate server, e.g. server down, or server has
changed
Solution:
- Use an exception handler but this is not always possible in
the programming language used

Client request to server is lost
Solution:
- Use a timeout to await server reply, then re-send but be
careful about idempotent operations (no side effects when re-send)
- If multiple requests appear to get lost assume cannot locate
server error
Solutions to Handle Server Failures (1)
Server crash after receiving client request
Problem may be not being able to tell if request was carried out (e.g.
client requests print page, server may stop before or after printing, before
acknowledgement)
Solutions:
- rebuild server and retry client request (assuming at least once
semantics for request)
- give up and report request failure (assuming at most once
semantics), what is usually required is exactly once semantics, but this
difficult to guarantee
Server reply to client is lost
Client can simply set timer and if no reply in time assume server down,
request lost or server crashed during processing request.
Solutions to Handle Server Failures (2)
Client crash after sending server request : Server unable to reply to
client (orphan request)

Options and Issues:
- Extermination: client makes a log of each RPC, and kills
orphan after reboot. Expensive.
- Reincarnation. Time divided into epochs (large intervals).
When client restarts it broadcasts to all, and starts a new time epoch.
Servers dealing with client requests from a previous epoch can be
terminated. Also unreachable servers (e.g. in different network areas) may
later reply, but will refer to obsolete epoch numbers.
- Gentle reincarnation, as above but an attempt is made to
contact the client owner (e.g. who may be logged out) to take action
Expiration, server times out if client cannot be reached to return reply
Solutions to Handle Client Failures
Static Groups: group membership is pre-defined
Dynamic Groups: Members may join and leave, as necessary
Member = process ( or coordinator or RM Replica Manager)
Group
Send
Address
Expansion
Multicast
Comm.
Membership
Management
Leave
Fail
Join
Group
Group Communication
A simple solution to reliable multicasting when all receivers
are known and are assumed not to fail
a) Message transmission
b) Reporting feedback
Basic Reliable-Multicasting
The essence of hierarchical reliable multicasting (best for large
process groups.
a) Each local coordinator forwards the message to its children.
b) A local coordinator handles retransmission requests.
Hierarchical Feedback Control
+ A group membership service maintains group views, which are
lists of current group members.
+This is NOT a list maintained by a one member, but
+Each member maintains its own view (thus, views may be
different across members)
+A view V
p
(g) is process ps understanding of its group (list of
members)
+ Example: V
p.0
(g) = {p}, V
p.1
(g) = {p, q}, V
p.2
(g) = {p, q, r},
V
p.3
(g) = {p,r}
+A new group view is generated, throughout the group, whenever a
member joins or leaves.
+Member detecting failure of another member reliable
multicasts a view change message (causal-total order)

Group View (1)
Group View (2)
+An event is said to occur in a view v
p,i
(g) if the event occurs at p, and at
the time of event occurrence, p has delivered v
p,i
(g) but has not yet
delivered v
p,i+1
(g).
+Messages sent out in a view i need to be delivered in that view at all
members in the group (What happens in the View, stays in the View)
+Requirements for view delivery
+ Order: If p delivers v
i
(g) and then v
i+1
(g), then no other process q
delivers v
i+1
(g) before v
i
(g).
+ Integrity: If p delivers v
i
(g), then p is in v
i
(g).
+ Non-triviality: if process q joins a group and becomes reachable
from process p, then eventually q will always be present in the views
that delivered at p.
+Virtual Synchronous Communication = Reliable multicast + Group
Membership
+ The following guarantees are provided for multicast messages:
+Integrity: If p delivers message m, p does not deliver m again. Also p
e group (m).
+Validity: Correct processes always deliver all messages. That is, if p
delivers message m in view v(g), and some process q e v(g) does not
deliver m in view v(g), then the next view v(g) delivered at p will
exclude q.
+Agreement: Correct processes deliver the same set of messages in
any view.
+All View Delivery conditions (Order, Integrity and Non-triviality
conditions, from last slide) are satisfied
+What happens in the View, stays in the View
Virtual Synchronous Communication (1)

p
q
r
V(p,q,r)
p
q
r
V(p,q,r)
p
q
r
V(p,q,r)
p
q
r
V(p,q,r)
X
X
X
V(q,r)
V(q,r)
V(q,r)
V(q,r)
X
X
X
Not Allowed
Not Allowed
Allowed
Allowed
Six different versions of virtually synchronous reliable multicasting
Multicast Basic Message Ordering
Total-ordered
Delivery?
Reliable multicast None No
FIFO multicast FIFO-ordered delivery No
Causal multicast Causal-ordered delivery No
Atomic multicast None Yes
FIFO atomic
multicast
FIFO-ordered delivery Yes
Causal atomic
multicast
Causal-ordered delivery Yes
Once failure has occurred in many cases it is important to
recover critical processes to a known state in order to
resume processing
Problem is compounded in distributed systems
Two Approaches:
Backward recovery, by use of checkpointing (global
snapshot of distributed system status) to record the system
state but checkpointing is costly (performance degradation)
Forward recovery, attempt to bring system to a new stable
state from which it is possible to proceed (applied in
situations where the nature if errors is known and a reset can
be applied)
Recovery Techniques
A recovery line is a distributed snapshot which
records a consistent global state of the system
Checkpointing
If these local checkpoints jointly do not form a distributed
snapshot, the cascaded rollback of recovery process may
lead to what is called the domino effect.
Possible solution is to use globally coordinated checkpointing
which requires global time synchronization rather than
independent (per processor) checkpointing
Independent Checkpointing
most extensively used in distributed systems and
generally safest
can be incorporated into middleware layers
no guarantee that same fault may occur again
(deterministic view affects failure transparency
properties)
can not be applied to irreversible (non-idempotent)
operations, e.g. ATM withdrawal or UNIX rm *
Backward Recovery
Exceptions
System states that should not occur
Exceptions can be defined either
predefined (e.g. array-index out of bounds, divide by zero)
explicitly declared by the programmer
Raising an exception
When such a state is detected in the execution of the program
The action of indicating occurrence of such as state
Exception handler
Code to be executed when an exception is raised
Declared by the programmer
For recovery action
Supported by several programming languages
Ada, ISO Modula-2, Delphi, Java, C++.
Forward Recovery (Exception)
Classifying fault-tolerance
Fail-safe tolerance
Given safety predicate is preserved, but liveness may be affected

Example. Due to failure, no process can enter its critical section for
an indefinite period. In a traffic crossing, failure changes the traffic in
both directions to red.
Graceful degradation
Application continues, but in a degraded mode. Much depends on
what kind of degradation is acceptable.

Example. Consider message-based mutual exclusion. Processes will
enter their critical sections, but not in timestamp order.

Conventional Approaches
Build redundancy into hardware/software
Modular Redundancy, N-Version
ProgrammingConventional TRM (Triple Modular
Redundancy) can incur 200% overheads without
optimization.
Replication of tasks and processes may result in
overprovisioning
Error Control Coding
Checkpointing and rollbacks
Usually accomplished through logging (e.g. messages)
Backward Recovery with Checkpoints cannot guarantee
the completion time of a task.
Hybrid
Recovery Blocks
403
Defining Consensus

N processes
Each process p has
input variable xp : initially either 0 or 1
1. output variable yp : initially b (b=undecided) can be changed only
once
Consensus problem: design a protocol so that either
1. all non-faulty processes set their output variables to 0
2. Or non-faulty all processes set their output variables to 1
3. There is at least one initial state that leads to each outcomes 1 and 2
above
Solving Consensus
No failures trivial
All-to-all broadcast
With failures
Assumption: Processes fail only by crash-stopping
Synchronous system: bounds on
Message delays
Max time for each process step
e.g., multiprocessor (common clock across processors)
Asynchronous system: no such bounds!
e.g., The Internet! The Web!

Asynchronous Consensus
Messages have arbitrary delay, processes arbitrarily
slow
Impossible to achieve!
a slow process indistinguishable from a crashed process

Theorem: In a purely asynchronous distributed
system, the consensus problem is impossible to
solve if even a single process crashes

Result due to Fischer, Lynch, Patterson (commonly
known as FLP 85).
Failure detection
The design of fault-tolerant algorithms will be simple if
processes can detect failures.
In synchronous systems with bounded delay channels,
crash failures can definitely be detected using timeouts.
In asynchronous distributed systems, the detection of
crash failures is imperfect.
Completeness Every crashed process is suspected
Accuracy No correct process is suspected.
Classification of completeness
Strong completeness. Every crashed process is
eventually suspected by every correct process, and
remains a suspect thereafter.
Weak completeness. Every crashed process is
eventually suspected by at least one correct process,
and remains a suspect thereafter.
Strong accuracy. No correct process is ever
suspected.
Weak accuracy. There is at least one correct process
that is never suspected.
Note that we dont care what mechanism is used for suspecting a process.
Classifying failure detectors
Perfect P. (Strongly) Complete and strongly accurate
Strong S. (Strongly) Complete and weakly accurate
Eventually perfect P.
(Strongly) Complete and eventually strongly accurate
Eventually strong S
(Strongly) Complete and eventually weakly accurate

Other classes are feasible: W (weak completeness) and
weak accuracy) and W
+ Enhances a service by replicating data
+ Increased Availability
+ Of service. When servers fail or when the network is
partitioned.
+ Fault Tolerance
+ Under the fail-stop model, if up to f of f+1 servers crash,
at least one is alive.
+Load Balancing
+ One approach: Multiple server IPs can be assigned to
the same name in DNS, which returns answers round-
robin.

P: probability that one server fails= 1 P= availability of service.
e.g. P = 5% => service is available 95% of the time.
P
n
: probability that n servers fail= 1 P
n
= availability of service.
e.g. P = 5%, n = 3 => service available 99.875% of the time
Replication
+ Request Communication
+ Requests can be made to a single RM or to multiple RMs
+ Coordination: The RMs decide
+ whether the request is to be applied
+ the order of requests
+FIFO ordering: If a FE issues r then r, then any correct RM handles r and then
r.
+Causal ordering: If the issue of r happened before the issue of r, then any
correct RM handles r and then r.
+Total ordering: If a correct RM handles r and then r, then any correct RM
handles r and then r.
+ Execution: The RMs execute the request (often they do
this tentatively).
Replication Management

+Request Communication: the request is issued to the primary RM and
carries a unique request id.
+Coordination: Primary takes requests atomically, in order,
checks id (resends response if not new id.)
+Execution: Primary executes & stores the response
+Agreement: If update, primary sends updated state/result, req-id
and response to all backup RMs (1-phase commit enough).
+Response: primary sends result to the front end

Client Front End
RM
RM
RM
Client
Front End RM
primary
Backup
Backup
Backup
.
Passive Replication

Request Communication: The request contains a unique identifier and is
multicast to all by a reliable totally-ordered multicast.
Coordination: Group communication ensures that requests are delivered
to each RM in the same order (but may be at different physical times!).
Execution: Each replica executes the request. (Correct replicas return
same result since they are running the same program, i.e., they are
replicated protocols or replicated state machines)
Agreement: No agreement phase is needed, because of multicast
delivery semantics of requests
Response: Each replica sends response directly to FE
Client Front End
RM
RM
Client
Front End
RM
.
Active Replication
Message Logging
Tolerate crash failures
Each process periodically records its local state and log
messages received after
Once a crashed process recovers, its state must be consistent with the
states of other processes
Orphan processes
surviving processes whose states are inconsistent with the recovered state of a
crashed process
Message Logging protocols guarantee that upon recovery no
processes are orphan processes
Pessimistic Logging avoid creation of orphans
Optimistic Logging eliminate orphans during recovery
Causal Logging -- no orphans when failures happen and do not block
processes when failures do not occur (add info to messages)
One more Perspective

Distributed Operating Systems: 2 BITS Pilani Ms-Wilp 06 April 2013

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distributed Operating Systems: 2 BITS Pilani Ms-Wilp 06 April 2013

Uploaded by

Copyright:

Available Formats

Distributed Operating Systems

You might also like