Unit7 PDF

Distributed Systems:
Synchronization
Gian Pietro Picco
Dipartimento di Elettronica e Informazione
Politecnico di Milano, Italy
picco@elet.polimi.it
http://www.elet.polimi.it/~picco
P olitecnico
di M ilano
Synchronization in
Distributed Systems
The problem of synchronizing concurrent activities
arises also in non-distributed systems
However, distribution complicates matters:
Absence of a global physical clock
Absence of globally shared memory
Partial failures
In these lectures, we study distributed algorithms for:
Synchronizing physical clocks

Simulating time using logical clocks & preserving event ordering
Leader election
Collecting global state & termination detection
Mutual exclusion
Distributed transactions
Distributed Systems - 2004-2006 G.P. Picco
P olitecnico
di M ilano
Time and Distributed Systems

Time plays a fundamental role in many applications:
Execute a given action at a given time
Timestamping objects/data/messages enables reconstruction of
event ordering
File versioning
Distributed debugging
Security algorithms
Problem: ensure all machines see the same global time
P olitecnico
di M ilano
Time
Clocks vs. timers
Time is a tricky issue per se:
Up to 1940, time is measured astronomically
1 second = 1/86400th of a solar day
Earth is slowing down, making measures inaccurate
Since 1948, atomic clocks (International Atomic Time)

1 second = 9,192,631,770 transitions of an atom of Cesium 133
Collected and averaged in Paris from 50 labs around the world
Skew between TAI and solar days accommodated by UTC

(Universal Time Coordinates) when greater than 800ms
Greenwich Mean Time is only astronomical
About 30 leap seconds from 1958 to now
UTC disseminate via radio stations (DCF77 in Europe, WWV in US),
GPS and GEOS satellite systems
P olitecnico
di M ilano
Synchronizing Physical Clocks

To guarantee synchronization:
Maximum clock drift rate is a constant of the timer
For ordinary quartz crystals, =10-6 s/s, i.e., 1s every 11.6 days
Maximum allowed clock skew is an engineering parameter

If two clocks are drifting in opposite directions, during a time
interval t they accumulate a skew of 2t
resynch needed at least every /2 seconds
The problem is either:

Synchronize all clocks against a single one, usually the one with
external, accurate time information (accuracy)
Synchronize all clocks among themselves (agreement)
At least time monotonicity must be preserved

Several protocols have been devised
P olitecnico
di M ilano
Simple Algorithms
Cristian (1989)
Periodically, each client sends a request to the time server

Messages are assumed to travel fast w.r.t. required time accuracy
Problems:
Major: time might run backwards on client machine.
Therefore, introduce change gradually (e.g., advance clock 9ms
instead of 10ms on each clock tick)
Minor: it takes a non-zero amount of time to get the message to the
server and back
measure round-trip time and adjust, e.g., T1 = CUTC + Tround/2
average over several measurements,
Berkeley UNIX (1989)
The time server collects the time from all clients, averages, and
then retransmits the required adjustment
P olitecnico
di M ilano
Network Time Protocol (NTP)

Designed for UTC synch over large-scale networks
Used in practice over the Internet, on top of UDP

Estimate of 10-20 million NTP clients and servers
Widely available (even under Windows)
~1ms over LANs, 1-50ms over the Internet
More info at www.ntp.org
Hierarchical synchronization subnet organized in strata

Servers in stratum 1 are directly connected to a UTC source
Lower strata (higher levels) provide more accurate information
Connections and strata membership change over time
Synchronization mechanisms
Multicast (over LAN)
Procedure-call mode (similar to Cristians)
Symmetric mode (for higher levels)
P olitecnico
di M ilano
Network Time Protocol (NTP)
Each message bears timestamps of recent message events

Ti-2 = Ti-3 + t + o and Ti = Ti-1 + t - o
where t and t are the messages transmission times
A delay estimate is derived by imposing di = t + t
Similarly, the offset is oi = o + (t - t)/2
The pairs <oi,di> are stored and processed through a statistical
filter, to obtain a measure of reliability
Each node communicates with several others
Only the most reliable data is returned to applications
Allows to select the best primary data source
P olitecnico
di M ilano
Some Observations
In many applications it is sufficient to agree
on a time, even if it is not accurate w.r.t. the
absolute time
What matters is often the ordering and
causality relationships of events, rather than
the timestamp itself
If two processes do not interact, it is not
necessary that their clocks be synchronized
P olitecnico
di M ilano
Logical
Time
Scalar
Clocks
L. Lamport. Time, clocks, and the ordering of events in a distributed system.
Communications of the ACM, 21(7):558-565, July 1978.
Based on the happens-before relationship, A B:

If events A and B occur in the same process and A occurs
before B, then A B
If A=send(msg) and B=recv(msg), then A B
is transitive
If neither A B nor B A, they are concurrent (A || B)

Use integers to represent the clock value
No relationship with a physical clock whatsoever
P1
B
C
P2
P3
D
F
10
P olitecnico
di M ilano
Assigning Scalar Clocks

Each process keeps a logical scalar clock
Starts at zero
Increments clock upon each local event
Each message is timestamped with the senders scalar time
Upon receipt, the receivers clock is set to:
MAX(msg timestamp, receivers clock) + 1

Only partial ordering is achieved: total ordering can be
obtained trivially by attaching process IDs to clocks
A
0 1
P1
P2
D
4
P3
F
5
11
P olitecnico
di M ilano
Example
12
P olitecnico
di M ilano
Example:Totally Ordered
Multicast
Updates in sequence:
Customer deposits $100
bank adds 1% interest
Updates are propagated to all locations:

If updates in the same order at each copy, consistent result (e.g., $1111)
If updates arrive in opposite orders, inconsistent result (e.g., $1110)
Totally ordered multicast delivers messages in the same global order

Using logical clocks:
Messages are acknowledged using multicast
All messages (including acks) carry a timestamp with the senders clock
Receivers store all messages in a queue, ordered according to its
timestamp
Eventually, all processes have the same messages in the queue
Assumption: reliable and FIFO links
A message is delivered to the application only when it is at the highest in

the queue and all its acks have been received
13
P olitecnico
di M ilano
Vector Clocks
In scalar clocks, A B C(A) < C(B)
But the reverse does not necessarily hold, e.g., if A || B
Scalar clocks do not capture all possible causality

relationships
In vector clocks each process Pi maintains a vector Vi of
N values (N=#processes) such that:
Vi[i] is the number of events that have occurred at Pi
If Vi[j]=k then Pi knows that k events have occurred at Pj
Rules for updating the vectors:
Initially, Vi[j]=0 for all i,j
Local event at Pi causes an increment of Vi[i]
Pi attaches a timestamp t=Vi in all messages it sends
When Pi receives a message containing t, it sets
Vi[j]=max(Vi[j], t[j]) and then increments Vi[i]
14
P olitecnico
di M ilano
Vector Clocks (contd)

Relations on vector clocks
V(A) V(B) Vi(A) Vi(B), for all i
V(A) < V(B) V(A) V(B) for all i and it exist a j
such that Vi(A) < Vi(B)
V(A) || V(B) V(A) < V(B) V(B) < V(A)
An isomorphism between the set of partially
ordered events and their timestamps
Determining causality:
A B V(A) < V(B)
A || B V(A) || V(B)
15
P olitecnico
di M ilano
Examples
P1
1,0,0 2,0,0
P2
P3
P1
P2
P3
2,1,0
2,2,0
0,0,1
2,2,2
1,0,0 2,0,0
0,1,0
2,2,0
0,0,1
0,0,2
0,0,3
By looking only at the

timestamps we are able
to determine whether
two events are causally
related or concurrent
Can be exploited to
implement causal
message delivery
16
P olitecnico
di M ilano
Example: Bulletin Boards

Pi
Pj
msg
2
Pk
Pi
Pj
Pk
msg
reply 5
1,0,0
If M1 arrives before M2, it

does not necessarily mean
that the two are related
2,2,0
Using vector clocks:
msg
1,1,0 1,2,0
0,0,1 reply
msg
?
Need to preserve the

ordering only between
messages and replies
Totally ordered multicast is
too strong
1,0,
Increment clock only when

sending a message
Hold a reply until the
previous messages are
received:
ts(r)[j] = Vk[j]+1
ts(r)[i] Vk[i] for all i j
17
P olitecnico
di M ilano
Mutual Exclusion
Required to prevent interference and ensure
consistency of resource access
Critical section problem, typical of OS
But here, no shared memory
Requirements:
Safety property: at most one process may execute in the critical
section at a time
Liveness property: all requests to enter/exit the critical section
eventually succeed
Simplest solution: a server coordinating access

Emulates a centralized solution
Server manages the lock using a token
Resource access request and release obtained with respective
messages to the coordinator
Easy to guarantee mutual exclusion and fairness
Drawbacks: performance bottleneck and single point of failure
18
P olitecnico
di M ilano
Mutual Exclusion
with Scalar Clocks
To request access to a resource:
A process Pi sends a resource request message m, with
timestamp Tm, to all processes (including itself)
The request is put into a local queue ordered according to Tm
(process identifiers are used to break ties)
Upon receipt of m, a process sends an acknowledgment to Pi,
with timestamp >Tm
To release a resource
A message is sent to everybody

When the release message is received, the request is removed
from the queue
A resource is granted to Pi when
Its request message is ahead of all others in the request queue

Its request has been acknowledged by all the other processes
with a message >Tm
19
P olitecnico
di M ilano
Example
Acquire
P1:1
P3:1 P1
78
9 10
13
13
23
P1:1
P3:1 P2
P1:1
P3:1 P3
45
11
5 6
Request
11 12
Acquire
Ack
Release
20
P olitecnico
di M ilano
A Token Ring Solution
Processes are logically arranged in a ring, regardless of

their physical connectivity
At least for the purpose of mutual exclusion
Access is granted by a token that is forwarded along a

given direction on the ring
Resource access is achieved by retaining the token
Resource release is achieved by forwarding the token
The token must circulate regardless of the existence of

access requests
21
P olitecnico
di M ilano
Comparison
Algorithm
Messages
per entry/exit
Delay before entry

Problems
(in message times)
Centralized
Coordinator crash
Distributed
(Lamport)
3(n1)
2(n1)
Crash of any process
Token ring
1 to
0 to n 1
Lost token, process crash
If nobody wants to enter the critical

section, the token circulates indefinitely
22
P olitecnico
di M ilano
Leader Election
Many distributed algorithms require a process to act as a
coordinator (or some other special role)
E.g., for server-based mutual exclusion
Problem: make everybody agree on a new leader

When the old is no longer available, e.g., because of failure or
applicative reasons
Assumptions:
Nodes are distinguishable
Otherwise, no way to perform selection
Typically use the identifier: the process with the highest ID becomes
the leader
Closed system: processes know each other

But do not know who is up and who has failed
Algorithms differ on the selection process
23
P olitecnico
di M ilano
Bully Election Algorithm
The leader, 7, fails
Process 4 (noticing the failure of

7) holds an election
Process 5 and 6 respond, telling
4 to stop
Now 5 and 6 each hold an
election
6 tells 5 to stop
6 wins and tells everyone
When a process comes back (or

joins) it holds an election
24
P olitecnico
di M ilano
A Ring-Based Algorithm
Assume a (physical or logical) ring topology among nodes

When a process detects a leader failure, it sends an ELECTION
message containing its ID to the closest alive neighbor
Upon receipt of the election message on process P:
If P is not in the message, add P and propagate to next neighbor
If P is in the list, change message type to COORDINATOR, and
re-circulate
On receiving a COORDINATOR message, a node considers the

process with the highest id as the new leader (and is also informed
about the remaining members of the ring)
Multiple messages may circulate at the same time
Eventually converge to the same content
25
P olitecnico
di M ilano
Capturing Global State
The global state of a distributed system consists of the local state of

each process, together with the message in transit over the links
Useful to know for distributed debugging, termination detection,
deadlock detection,
Banking example:
Constant amount of money in the bank system (e.g., 120)

Money is transferred among banks in messages
Problem: find out of any money accidentally lost
Not the problem: find out where the money is located
48
12
30
20
26
P olitecnico
di M ilano
Cuts & Distributed Snapshots

A distributed snapshot reflects a (consistent, global)
state in which the distributed system might have been
Particular care must be taken when reconstructing the
global state, to preserve consistency
No messages to jump from the future into the past
If a message receipt is recorded, the message send must as well
Conceptual tool: cut
27
P olitecnico
di M ilano
Simplistic Algorithm
Stop all processes

Let all channels empty (not trivial)
Record the state
Start everything back up
Why is this
a bad idea?
28
P olitecnico
di M ilano
Distributed Snapshot
Chandy-Lamport, 1985
Assume FIFO, reliable links/nodes

Initiate a snapshot by
12
Recording internal state

Sending a token on all outgoing channels
Signals a snapshot is being run
Start recording local snapshot

Record messages arriving on every incoming channel
30
Upon receiving a token:

If not already recording local snapshot
Record internal state
Send token on all outgoing channels
Start recording local snapshot
In any case stop recording incoming message on channel the token arrived along
Recording messages
If a message arrives on a channel which is recording messages, record the arrival
of the message, then process the message as normal
Otherwise, just process the message as normal
Snapshot complete when token has arrived on all incoming channels

Irrelevant how the various data are collected and/or transmitted
29
Processing
P olitecnico
di M ilano
Normal processing, first marker

about to be received
Recording of messages from the

incoming links from which a token has
been received
Token just received for the first time,

state saved, token forwarded to outgoing
links, begin recording messages
Token received,
state recording is stopped
30
P olitecnico
di M ilano
Distributed Snapshot: Intuition

Finished
Processing
Processing
Not Yet
Processing
source
Received all tokens
Received no token
Received some of the tokens
31
P olitecnico
di M ilano
Example
12
2
48
30
30
33
20
20
18
Assume bi-directional links (although not necessary)

Let node with 30 start the processing
32
P olitecnico
di M ilano
Example
14
14
48
30
48
38
20
18
33
P olitecnico
di M ilano
Some Observations1
Distributed snapshot effectively selects a consistent cut
from the history of execution
Proof:
Consider ei and ej events at pi and pj, so that ei ej
For the cut to be consistent, if ej is in the cut, so does ei
In other words, if ej occurred before pj recorded its state, the same
must have happened for ei and pi
Obvious if pi = pj
Assume that ei and ej are in the cut, although pi recorded its state
before ei
Consider the messages m1 mn that establish ei ej
FIFO and marker rules are such that the marker would have
reached pj before ej
Therefore, ej cannot be in the cut
34
P olitecnico
di M ilano
Some Observations2
Important: the distributed snapshot algorithm does
not require blocking of the computation
Collecting the snapshot is interleaved with processing
What happens if the snapshot it started at more than

one location at the same time?
Easily dealt with by associating an identifier to each
snapshot, set by the initiator
Several variations have been devised

E.g., incremental snapshots take an initial snapshot, each
node remembers where it has sent/received messages, and
when a new snapshot is requested, only these links are
included in the result
How is the snapshot result collected?
35

Unit7 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit7 PDF

Uploaded by

Copyright:

Available Formats

Distributed Systems:

In these lectures, we study distributed algorithms for:

Synchronizing physical clocks

Distributed Systems - 2004-2006 G.P. Picco

Time and Distributed Systems

Problem: ensure all machines see the same global time

Distributed Systems - 2004-2006 G.P. Picco

Since 1948, atomic clocks (International Atomic Time)

Skew between TAI and solar days accommodated by UTC

Distributed Systems - 2004-2006 G.P. Picco

Synchronizing Physical Clocks

Maximum allowed clock skew is an engineering parameter

The problem is either:

At least time monotonicity must be preserved

Periodically, each client sends a request to the time server

Berkeley UNIX (1989)

Distributed Systems - 2004-2006 G.P. Picco

Network Time Protocol (NTP)

Used in practice over the Internet, on top of UDP

Hierarchical synchronization subnet organized in strata

Distributed Systems - 2004-2006 G.P. Picco

Network Time Protocol (NTP)

Each message bears timestamps of recent message events

Distributed Systems - 2004-2006 G.P. Picco

Distributed Systems - 2004-2006 G.P. Picco

Based on the happens-before relationship, A B:

If neither A B nor B A, they are concurrent (A || B)

Assigning Scalar Clocks

MAX(msg timestamp, receivers clock) + 1

Distributed Systems - 2004-2006 G.P. Picco

Updates are propagated to all locations:

Totally ordered multicast delivers messages in the same global order

A message is delivered to the application only when it is at the highest in

Scalar clocks do not capture all possible causality

Vector Clocks (contd)

Distributed Systems - 2004-2006 G.P. Picco

Distributed Systems - 2004-2006 G.P. Picco

By looking only at the

Example: Bulletin Boards

If M1 arrives before M2, it

Using vector clocks:

Distributed Systems - 2004-2006 G.P. Picco

Need to preserve the

Increment clock only when

Simplest solution: a server coordinating access

A message is sent to everybody

A resource is granted to Pi when

Its request message is ahead of all others in the request queue

Distributed Systems - 2004-2006 G.P. Picco

A Token Ring Solution

Processes are logically arranged in a ring, regardless of

Access is granted by a token that is forwarded along a

The token must circulate regardless of the existence of

Delay before entry

Crash of any process

Lost token, process crash

If nobody wants to enter the critical

Distributed Systems - 2004-2006 G.P. Picco

Problem: make everybody agree on a new leader

Closed system: processes know each other

Algorithms differ on the selection process

Distributed Systems - 2004-2006 G.P. Picco