You are on page 1of 35

Distributed Systems:

Synchronization
Gian Pietro Picco
Dipartimento di Elettronica e Informazione
Politecnico di Milano, Italy
picco@elet.polimi.it
http://www.elet.polimi.it/~picco

P olitecnico
di M ilano

Synchronization in
Distributed Systems
The problem of synchronizing concurrent activities
arises also in non-distributed systems
However, distribution complicates matters:
Absence of a global physical clock
Absence of globally shared memory
Partial failures

In these lectures, we study distributed algorithms for:

Synchronizing physical clocks


Simulating time using logical clocks & preserving event ordering
Leader election
Collecting global state & termination detection
Mutual exclusion
Distributed transactions

Distributed Systems - 2004-2006 G.P. Picco

P olitecnico
di M ilano

Time and Distributed Systems


Time plays a fundamental role in many applications:
Execute a given action at a given time
Timestamping objects/data/messages enables reconstruction of
event ordering
File versioning
Distributed debugging
Security algorithms

Problem: ensure all machines see the same global time

Distributed Systems - 2004-2006 G.P. Picco

P olitecnico
di M ilano

Time
Clocks vs. timers
Time is a tricky issue per se:
Up to 1940, time is measured astronomically
1 second = 1/86400th of a solar day
Earth is slowing down, making measures inaccurate

Since 1948, atomic clocks (International Atomic Time)


1 second = 9,192,631,770 transitions of an atom of Cesium 133
Collected and averaged in Paris from 50 labs around the world

Skew between TAI and solar days accommodated by UTC


(Universal Time Coordinates) when greater than 800ms
Greenwich Mean Time is only astronomical
About 30 leap seconds from 1958 to now
UTC disseminate via radio stations (DCF77 in Europe, WWV in US),
GPS and GEOS satellite systems

Distributed Systems - 2004-2006 G.P. Picco

P olitecnico
di M ilano

Synchronizing Physical Clocks


To guarantee synchronization:
Maximum clock drift rate is a constant of the timer
For ordinary quartz crystals, =10-6 s/s, i.e., 1s every 11.6 days

Maximum allowed clock skew is an engineering parameter


If two clocks are drifting in opposite directions, during a time
interval t they accumulate a skew of 2t
resynch needed at least every /2 seconds

The problem is either:


Synchronize all clocks against a single one, usually the one with
external, accurate time information (accuracy)
Synchronize all clocks among themselves (agreement)

At least time monotonicity must be preserved


Several protocols have been devised
Distributed Systems - 2004-2006 G.P. Picco

P olitecnico
di M ilano

Simple Algorithms
Cristian (1989)

Periodically, each client sends a request to the time server


Messages are assumed to travel fast w.r.t. required time accuracy
Problems:
Major: time might run backwards on client machine.
Therefore, introduce change gradually (e.g., advance clock 9ms
instead of 10ms on each clock tick)
Minor: it takes a non-zero amount of time to get the message to the
server and back
measure round-trip time and adjust, e.g., T1 = CUTC + Tround/2
average over several measurements,

Berkeley UNIX (1989)

The time server collects the time from all clients, averages, and
then retransmits the required adjustment

Distributed Systems - 2004-2006 G.P. Picco

P olitecnico
di M ilano

Network Time Protocol (NTP)


Designed for UTC synch over large-scale networks

Used in practice over the Internet, on top of UDP


Estimate of 10-20 million NTP clients and servers
Widely available (even under Windows)
~1ms over LANs, 1-50ms over the Internet
More info at www.ntp.org

Hierarchical synchronization subnet organized in strata


Servers in stratum 1 are directly connected to a UTC source
Lower strata (higher levels) provide more accurate information
Connections and strata membership change over time

Synchronization mechanisms
Multicast (over LAN)
Procedure-call mode (similar to Cristians)
Symmetric mode (for higher levels)

Distributed Systems - 2004-2006 G.P. Picco

P olitecnico
di M ilano

Network Time Protocol (NTP)

Each message bears timestamps of recent message events


Ti-2 = Ti-3 + t + o and Ti = Ti-1 + t - o
where t and t are the messages transmission times
A delay estimate is derived by imposing di = t + t
Similarly, the offset is oi = o + (t - t)/2
The pairs <oi,di> are stored and processed through a statistical
filter, to obtain a measure of reliability
Each node communicates with several others
Only the most reliable data is returned to applications
Allows to select the best primary data source

Distributed Systems - 2004-2006 G.P. Picco

P olitecnico
di M ilano

Some Observations
In many applications it is sufficient to agree
on a time, even if it is not accurate w.r.t. the
absolute time
What matters is often the ordering and
causality relationships of events, rather than
the timestamp itself
If two processes do not interact, it is not
necessary that their clocks be synchronized

Distributed Systems - 2004-2006 G.P. Picco

P olitecnico
di M ilano

Logical
Time

Scalar
Clocks
L. Lamport. Time, clocks, and the ordering of events in a distributed system.
Communications of the ACM, 21(7):558-565, July 1978.

Based on the happens-before relationship, A B:


If events A and B occur in the same process and A occurs
before B, then A B
If A=send(msg) and B=recv(msg), then A B
is transitive

If neither A B nor B A, they are concurrent (A || B)


Use integers to represent the clock value
No relationship with a physical clock whatsoever

P1

B
C

P2
P3
Distributed Systems - 2004-2006 G.P. Picco

D
F
10

P olitecnico
di M ilano

Assigning Scalar Clocks


Each process keeps a logical scalar clock

Starts at zero
Increments clock upon each local event
Each message is timestamped with the senders scalar time
Upon receipt, the receivers clock is set to:

MAX(msg timestamp, receivers clock) + 1


Only partial ordering is achieved: total ordering can be
obtained trivially by attaching process IDs to clocks
A

0 1

P1
P2

D
4

P3
Distributed Systems - 2004-2006 G.P. Picco

F
5

11

P olitecnico
di M ilano

Example

Distributed Systems - 2004-2006 G.P. Picco

12

P olitecnico
di M ilano

Example:Totally Ordered
Multicast

Updates in sequence:
Customer deposits $100
bank adds 1% interest

Updates are propagated to all locations:


If updates in the same order at each copy, consistent result (e.g., $1111)
If updates arrive in opposite orders, inconsistent result (e.g., $1110)

Totally ordered multicast delivers messages in the same global order


Using logical clocks:
Messages are acknowledged using multicast
All messages (including acks) carry a timestamp with the senders clock
Receivers store all messages in a queue, ordered according to its
timestamp
Eventually, all processes have the same messages in the queue
Assumption: reliable and FIFO links

A message is delivered to the application only when it is at the highest in


the queue and all its acks have been received
Distributed Systems - 2004-2006 G.P. Picco

13

P olitecnico
di M ilano

Vector Clocks
In scalar clocks, A B C(A) < C(B)
But the reverse does not necessarily hold, e.g., if A || B

Scalar clocks do not capture all possible causality


relationships
In vector clocks each process Pi maintains a vector Vi of
N values (N=#processes) such that:
Vi[i] is the number of events that have occurred at Pi
If Vi[j]=k then Pi knows that k events have occurred at Pj
Rules for updating the vectors:
Initially, Vi[j]=0 for all i,j
Local event at Pi causes an increment of Vi[i]
Pi attaches a timestamp t=Vi in all messages it sends
When Pi receives a message containing t, it sets
Vi[j]=max(Vi[j], t[j]) and then increments Vi[i]
Distributed Systems - 2004-2006 G.P. Picco

14

P olitecnico
di M ilano

Vector Clocks (contd)


Relations on vector clocks
V(A) V(B) Vi(A) Vi(B), for all i
V(A) < V(B) V(A) V(B) for all i and it exist a j
such that Vi(A) < Vi(B)
V(A) || V(B) V(A) < V(B) V(B) < V(A)
An isomorphism between the set of partially
ordered events and their timestamps
Determining causality:
A B V(A) < V(B)
A || B V(A) || V(B)

Distributed Systems - 2004-2006 G.P. Picco

15

P olitecnico
di M ilano

Examples
P1

1,0,0 2,0,0

P2
P3

P1
P2
P3

2,1,0

2,2,0

0,0,1

2,2,2

1,0,0 2,0,0

0,1,0

2,2,0

0,0,1

0,0,2

0,0,3

Distributed Systems - 2004-2006 G.P. Picco

By looking only at the


timestamps we are able
to determine whether
two events are causally
related or concurrent
Can be exploited to
implement causal
message delivery

16

P olitecnico
di M ilano

Example: Bulletin Boards


Pi
Pj

msg
2

Pk

Pi
Pj
Pk

msg

reply 5

1,0,0

If M1 arrives before M2, it


does not necessarily mean
that the two are related

2,2,0

Using vector clocks:

msg
1,1,0 1,2,0
0,0,1 reply

msg
?

Distributed Systems - 2004-2006 G.P. Picco

Need to preserve the


ordering only between
messages and replies
Totally ordered multicast is
too strong

1,0,

Increment clock only when


sending a message
Hold a reply until the
previous messages are
received:
ts(r)[j] = Vk[j]+1
ts(r)[i] Vk[i] for all i j
17

P olitecnico
di M ilano

Mutual Exclusion
Required to prevent interference and ensure
consistency of resource access
Critical section problem, typical of OS
But here, no shared memory

Requirements:
Safety property: at most one process may execute in the critical
section at a time
Liveness property: all requests to enter/exit the critical section
eventually succeed

Simplest solution: a server coordinating access


Emulates a centralized solution
Server manages the lock using a token
Resource access request and release obtained with respective
messages to the coordinator
Easy to guarantee mutual exclusion and fairness
Drawbacks: performance bottleneck and single point of failure
Distributed Systems - 2004-2006 G.P. Picco

18

P olitecnico
di M ilano

Mutual Exclusion
with Scalar Clocks
To request access to a resource:
A process Pi sends a resource request message m, with
timestamp Tm, to all processes (including itself)
The request is put into a local queue ordered according to Tm
(process identifiers are used to break ties)
Upon receipt of m, a process sends an acknowledgment to Pi,
with timestamp >Tm

To release a resource

A message is sent to everybody


When the release message is received, the request is removed
from the queue

A resource is granted to Pi when

Its request message is ahead of all others in the request queue


Its request has been acknowledged by all the other processes
with a message >Tm

Distributed Systems - 2004-2006 G.P. Picco

19

P olitecnico
di M ilano

Example
Acquire
P1:1
P3:1 P1

78

9 10

13

13

23

P1:1
P3:1 P2

P1:1
P3:1 P3

45

11

5 6
Request

11 12
Acquire

Ack
Release
Distributed Systems - 2004-2006 G.P. Picco

20

P olitecnico
di M ilano

A Token Ring Solution

Processes are logically arranged in a ring, regardless of


their physical connectivity
At least for the purpose of mutual exclusion

Access is granted by a token that is forwarded along a


given direction on the ring
Resource access is achieved by retaining the token
Resource release is achieved by forwarding the token

The token must circulate regardless of the existence of


access requests
Distributed Systems - 2004-2006 G.P. Picco

21

P olitecnico
di M ilano

Comparison
Algorithm

Messages
per entry/exit

Delay before entry


Problems
(in message times)

Centralized

Coordinator crash

Distributed
(Lamport)

3(n1)

2(n1)

Crash of any process

Token ring

1 to

0 to n 1

Lost token, process crash

If nobody wants to enter the critical


section, the token circulates indefinitely

Distributed Systems - 2004-2006 G.P. Picco

22

P olitecnico
di M ilano

Leader Election
Many distributed algorithms require a process to act as a
coordinator (or some other special role)
E.g., for server-based mutual exclusion

Problem: make everybody agree on a new leader


When the old is no longer available, e.g., because of failure or
applicative reasons

Assumptions:
Nodes are distinguishable
Otherwise, no way to perform selection
Typically use the identifier: the process with the highest ID becomes
the leader

Closed system: processes know each other


But do not know who is up and who has failed

Algorithms differ on the selection process

Distributed Systems - 2004-2006 G.P. Picco

23

P olitecnico
di M ilano

Bully Election Algorithm

The leader, 7, fails

Distributed Systems - 2004-2006 G.P. Picco

Process 4 (noticing the failure of


7) holds an election
Process 5 and 6 respond, telling
4 to stop
Now 5 and 6 each hold an
election
6 tells 5 to stop
6 wins and tells everyone

When a process comes back (or


joins) it holds an election
24

P olitecnico
di M ilano

A Ring-Based Algorithm

Assume a (physical or logical) ring topology among nodes


When a process detects a leader failure, it sends an ELECTION
message containing its ID to the closest alive neighbor
Upon receipt of the election message on process P:
If P is not in the message, add P and propagate to next neighbor
If P is in the list, change message type to COORDINATOR, and
re-circulate

On receiving a COORDINATOR message, a node considers the


process with the highest id as the new leader (and is also informed
about the remaining members of the ring)
Multiple messages may circulate at the same time
Eventually converge to the same content

Distributed Systems - 2004-2006 G.P. Picco

25

P olitecnico
di M ilano

Capturing Global State

The global state of a distributed system consists of the local state of


each process, together with the message in transit over the links
Useful to know for distributed debugging, termination detection,
deadlock detection,
Banking example:

Constant amount of money in the bank system (e.g., 120)


Money is transferred among banks in messages
Problem: find out of any money accidentally lost
Not the problem: find out where the money is located

48

12

30
Distributed Systems - 2004-2006 G.P. Picco

20

26

P olitecnico
di M ilano

Cuts & Distributed Snapshots


A distributed snapshot reflects a (consistent, global)
state in which the distributed system might have been
Particular care must be taken when reconstructing the
global state, to preserve consistency
No messages to jump from the future into the past
If a message receipt is recorded, the message send must as well

Conceptual tool: cut

Distributed Systems - 2004-2006 G.P. Picco

27

P olitecnico
di M ilano

Simplistic Algorithm

Stop all processes


Let all channels empty (not trivial)
Record the state
Start everything back up
Why is this
a bad idea?

Distributed Systems - 2004-2006 G.P. Picco

28

P olitecnico
di M ilano

Distributed Snapshot
Chandy-Lamport, 1985

Assume FIFO, reliable links/nodes


Initiate a snapshot by

12

Recording internal state


Sending a token on all outgoing channels

Signals a snapshot is being run

Start recording local snapshot


Record messages arriving on every incoming channel

30

Upon receiving a token:


If not already recording local snapshot
Record internal state
Send token on all outgoing channels
Start recording local snapshot

In any case stop recording incoming message on channel the token arrived along

Recording messages
If a message arrives on a channel which is recording messages, record the arrival
of the message, then process the message as normal
Otherwise, just process the message as normal

Snapshot complete when token has arrived on all incoming channels


Irrelevant how the various data are collected and/or transmitted

Distributed Systems - 2004-2006 G.P. Picco

29

Processing

P olitecnico
di M ilano

Normal processing, first marker


about to be received

Recording of messages from the


incoming links from which a token has
been received
Distributed Systems - 2004-2006 G.P. Picco

Token just received for the first time,


state saved, token forwarded to outgoing
links, begin recording messages

Token received,
state recording is stopped
30

P olitecnico
di M ilano

Distributed Snapshot: Intuition


Finished
Processing

Processing

Not Yet
Processing

source

Received all tokens

Received no token
Received some of the tokens

Distributed Systems - 2004-2006 G.P. Picco

31

P olitecnico
di M ilano

Example
12
2

48

30

30
33

20
20
18

Assume bi-directional links (although not necessary)


Let node with 30 start the processing
Distributed Systems - 2004-2006 G.P. Picco

32

P olitecnico
di M ilano

Example
14
14

48

30

48

38

20
18

Distributed Systems - 2004-2006 G.P. Picco

33

P olitecnico
di M ilano

Some Observations1
Distributed snapshot effectively selects a consistent cut
from the history of execution
Proof:
Consider ei and ej events at pi and pj, so that ei ej
For the cut to be consistent, if ej is in the cut, so does ei
In other words, if ej occurred before pj recorded its state, the same
must have happened for ei and pi
Obvious if pi = pj

Assume that ei and ej are in the cut, although pi recorded its state
before ei
Consider the messages m1 mn that establish ei ej
FIFO and marker rules are such that the marker would have
reached pj before ej
Therefore, ej cannot be in the cut

Distributed Systems - 2004-2006 G.P. Picco

34

P olitecnico
di M ilano

Some Observations2
Important: the distributed snapshot algorithm does
not require blocking of the computation
Collecting the snapshot is interleaved with processing

What happens if the snapshot it started at more than


one location at the same time?
Easily dealt with by associating an identifier to each
snapshot, set by the initiator

Several variations have been devised


E.g., incremental snapshots take an initial snapshot, each
node remembers where it has sent/received messages, and
when a new snapshot is requested, only these links are
included in the result

How is the snapshot result collected?

Distributed Systems - 2004-2006 G.P. Picco

35

You might also like