Distributed Systems Principles and Paradigms

Distributed Systems
Principles and Paradigms
Chapter 05
Synchronization
Communication & Synchronization
• Why do processes communicate in DS?

– To exchange messages
– To synchronize processes
• Why do processes synchronize in DS?

– To coordinate access of shared resources
– To order events
Time, Clocks and Clock Synchronization
• Time
– Why is time important in DS?
– E.g. UNIX make utility (see Fig. 5-1)
• Clocks (Timer)
– Physical clocks
– Logical clocks (introduced by Leslie Lamport)
– Vector clocks (introduced by Collin Fidge)
• Clock Synchronization
– How do we synchronize clocks with real-world time?
– How do we synchronize clocks with each other?
05 – 1 Distributed Algorithms/5.1 Clock Synchronization

Physical Clocks (1/3)
Problem: Clock Skew – clocks gradually get out of synch and give
different values
Solution: Universal Coordinated Time (UTC):
• Formerly called GMT (Greenwich Mean Time)
• Based on the number of transitions per second of the cesium 133
atom (very accurate).
• At present, the real time is taken as the average of some 50
cesium-clocks around the world – International Atomic Time
• Introduces a leap second from time to time to compensate that
days are getting longer.
UTC is broadcasted through short wave radio (with the accuracy of

+/- 1 msec) and satellite (Geostationary Environment Operational
Satellite, GEOS, with the accuracy of +/- 0.5 msec).
Question: Does this solve all our problems? Don’t we now have
some global timing mechanism?
05 – 2 Distributed Algorithms/5.1 Clock
Problem: Suppose we have a distributed system with a UTC-
receiver somewhere in it, we still have to distribute its time to each
machine.
Basic principle:
• Every machine has a timer that generates an interrupt H (typically
60) times per second.
• There is a clock in machine p that ticks on each timer interrupt.
Denote the value of that clock by Cp (t) , where t is UTC time.
• Ideally, we have that for each machine p, Cp (t) = t, or, in other
words, dC/ dt = 1
• Theoretically, a timer with H=60 should generate 216,000 ticks per
hour
• In practice, the relative error of modern timer chips is 10**-5 (or
between 215,998 and 216,002 ticks per hour)
Where is the max. drift rate
Goal: Never let two clocks in any system differ by more than time units =>
synchronize at least every 2seconds.

Clock Synchronization Principles
• Principle I: Every machine asks a time server for the

accurate time at least once every /2seconds (see Fig. 5-5).
But you need an accurate measure of round trip delay,
including interrupt handling and processing incoming
messages.
• Principle II: Let the time server scan all machines
periodically, calculate an average, and inform each machine
how it should adjust its time relative to its present time.
Ok, you’ll probably get every machine in sync. Note: you
don’t even need to propagate UTC time (why not?)

Clock Synchronization Algorithms
• The Berkeley Algorithm
 The time server polls periodically every machine for its time
 The received times are averaged and each machine is notified of
the amount of the time it should adjust
 Centralized algorithm, See Figure 5-6
• Decentralized Algorithm
 Every machine broadcasts its time periodically for fixed length
resynchronization interval
 Averages the values from all other machines (or averages
without the highest and lowest values)
• Network Time Protocol (NTP)
 the most popular one used by the machines on the Internet
 uses an algorithm that is a combination of centralized/distributed
05 – 6 Distributed Algorithms/5.2 Logical Clocks
Network Time Protocol (NTP)
• a protocol for synchronizing the clocks of computers over packet-
switched, variable-latency data networks (i.e., Internet)
• NTP uses UDP port 123 as its transport layer. It is designed
particularly to resist the effects of variable latency
• NTPv4 can usually maintain time to within 10 milliseconds (1/100 s)
over the public Internet, and can achieve accuracies of 200
microseconds (1/5000 s) or better in local area networks under ideal
conditions
• visit the following URL to understand NTP in more detail
http://en.wikipedia.org/wiki/Network_Time_Protocol
The Happened-Before Relationship
Problem: We first need to introduce a notion of ordering before we
can order anything.
The happened-before relation on the set of events in a distributed
system is the smallest relation satisfying:
• If a and b are two events in the same process, and a

comes before b, then a  b. (a happened before b)
• If a is the sending of a message, and b is the receipt of
that message, then a  b.
• If a  b and b  c, then a  c. (transitive relation)
Note: if two events, x and y, happen in different processes that do not
exchange messages, then they are said to be concurrent.
Note: this introduces a partial ordering of events in a system with
concurrently operating processes.
Logical Clocks (1/2)
Problem: How do we maintain a global view on the system’s

behavior that is consistent with the happened-before relation?
Solution: attach a timestamp C(e) to each event e, satisfying the
following properties:
P1: If a and b are two events in the same process, and a

b, then we demand that C (a) < C (b)
P2: If a corresponds to sending a message m, and b to
the receipt of that message, then also C (a) < C (b)
Problem: How do we attach a timestamp to an event when there’s
no global clock?  maintain a consistent set of logical clocks, one
per process.

Logical Clocks (2/2)
Each process Pi maintains a local counter Ci and adjusts this counter

according to the following rules:
(1) For any two successive events that take place within Pi, Ci is
incremented by 1.
(2) Each time a message m is sent by process Pi, the message
receives a timestamp Tm = Ci.
(3) Whenever a message m is received by a process Pj, Pj adjusts its
local counter Cj:
Property P1 is satisfied by (1); Property P2 by (2) and (3).

This is called the Lamport’s Algorithm

Logical Clocks – Example
Fig 5-7. (a) Three processes, each with its own clock. The clocks
run at different rates. (b) Lamport’s algorithm corrects the clocks

P1 P2 P3
a e
j
b f k
c
g
d h l
i
• Assign the Lamport’s logical clock values for all the

events in the above timing diagram. Assume that
each process’s local clock is set to 0 initially.
P1 P2 P3
1 a 1 e
1 j
2
b 3 f 2 k
3 c
4 g
4 d 5 h l
3
6 i
From the above timing diagram, what can you say about the
following events?
• between a and b: a  b
• between b and f : b  f
• between e and k: concurrent
• between c and h: concurrent
• between k and h: k  h
Total Ordering with Logical Clocks
Problem: it can still occur that two events happen at the
same time. Avoid this by attaching a process number to
an event:
Pi timestamps event e with Ci (e) i
Then: Ci (a) i happened before Cj (b) j if and only if:
1: Ci (a) < Cj (a) ; or
2: Ci (a) = Cj (b) and i < j

Example: Totally-Ordered Multicast (1/2)
Problem: We sometimes need to guarantee that concurrent updates
on a replicated database are seen in the same order everywhere:
• Process P1 adds $100 to an account (initial value: $1000)
• Process P2 increments account by 1%
• There are two replicas
Outcome: in absence of proper synchronization, replica #1 will end up

with $1111, while replica #2 ends up with $1110.
Example: Totally-Ordered Multicast (2/2)
• Process Pi sends timestamped message msgi to all others. The

message itself is put in a local queue queuei.
• Any incoming message at Pj is queued in queuej, according to its
timestamp.
• Pj passes a message msgi to its application if:
(1) msgi is at the head of queuej
(2) for each process Pk, there is a message msgk in queuej with a
larger timestamp.
Note: We are assuming that communication is reliable and FIFO
ordered.

Fidge’s Logical Clocks
• with Lamport’s clocks, one cannot directly compare

the timestamps of two events to determine their
precedence relationship
- if C(a) </ C(b) then a / b
- if C(a) < C(b), it could be a  b or a 
/ b
- e.g., events e and b in the previous example Figure
* C(e) = 1 and C(b) = 2
* thus C(e) < C(b) but e / b
• the main problem is that a simple integer clock can
not order both events within a process and events
in different processes
• Collin Fidge developed an algorithm that
overcomes this problem
• Fidge’s clock is represented as a vector [c1 , c 2 ,
…, cn] with an integer clock value for each process
(ci contains the clock value of process i)
Fidge’s Algorithm
The Fidge’s logical clock is maintained as follows:
1: Initially all clock values are set to the smallest value.
2: The local clock value is incremented at least once before
each primitive event in a process.
3: The current value of the entire logical clock vector is
delivered to the receiver for every outgoing message.
4: Values in the timestamp vectors are never decremented.
5: Upon receiving a message, the receiver sets the value of
each entry in its local timestamp vector to the maximum of
the two corresponding values in the local vector and in the
remote vector received.
The element corresponding to the sender is a special
case; it is set to one greater than the value received, but
only if the local value is not greater than that received.
• Get r_vector from the received msg sent by process q;
if l_vector [q]  r_vector[q] then
l_vector[q] : = r_vector[q] + 1;
for i : = 1 to n do
l_vector[i] := max(l_vector[i], r_vector[i]);
• Timestamps attached to the events are compared as follows:

• ep  fq iff Tep [p] < Tfq [p]
• (where ep represents an event e occurring in process p, Tep
represents the timestamp vector of the event ep , and the ith element
of Tep is denoted by Tep [i].)
• This means event ep happened before event fq if and only if process

q received a direct or indirect message from p and that message
was sent after ep had occurred. If ep and fq are in the same process
(i,e., p = q), the local elements of their timestamps represent their
occurrences in the process.
P1 P2 P3
a e
j
b f k
c
g
d h l
i
• Assign the Lamport’s and Fidge’s logical clock values

for all the events in the above timing diagram. Assume
that each process’s logical clock is set to 0 initially.
P1 P2 P3
[1,0,0] 1 a [0,1,0] 1 e
1 j [0,0,1]
[2,0,0] 2
b f [3,2,0] 2 k [0,0,2]
3
[3,0,0] 3 c
[3,3,3] 4 g
[4,0,0] 4 d 5 h [3,4,3] 3 l [0,0,3]
6 i [5,5,3]
The above diagram shows both Lamport
timestamps (an integer value ) and Fidge
timestamps (a vector of integer values ) for
each event.
– Lamport clocks:
2 < 5 since b  h,
3 < 4 but c  g.
– Fidge clocks:
f  h since 2 < 4 is true,
b  h since 2 < 3 is true,
h  a since 4 < 0 is false,
c  h since (3 < 3) is false and (4 < 0) is false.
P1 P2 P3 P4
a
e j
m
b f k
c n
g
d h o
l
i
Assign the Lamport’s and Fidge’s logical clock values for all the events in the above
timing diagram. Assume that each process’s logical clock is set to 0 initially.
From the above timing diagram, what can you say
about the following events?
1. between b and n:
2. between b and o:
3. between m and g:
4. between c and h:
5. between c and l:
6. between j and g:
7. between k and i:
8. between j and h:
READING Reference:
• Colin Fidge, “Logical Time in Distributed
Computing Systems”, IEEE Computer, Vol.
24, No. 8, pp. 28-33, August 1991.
Global State (1/3)
Basic Idea: Sometimes you want to collect the current state of a

distributed computation, called a distributed snapshot. It consists of all
local states and messages in transit.
Important: A distributed snapshot should reflect a consistent state:
05 – 15 Distributed Algorithms/5.3 Global State

Global State (2/3)
Note: any process P can initiate taking a distributed snapshot
• P starts by recording its own local state
• P subsequently sends a marker along each of its outgoing channels
• When Q receives a marker through channel C, its action depends on
whether it had already recorded its local state:
– Not yet recorded: it records its local state, and sends the marker
along each of its outgoing channels
– Already recorded: the marker on C indicates that the channel’s state
should be recorded: all messages received before this marker and

the time Q recorded its own state.
• Q is finished when it has received a marker along each of its incoming
channels

Global State (3/3)
(a) Organization of a process and channels for a distributed snapshot

(b) Process Q receives a marker for the first time and records its local state
(c) Q records all incoming message
(d) Q receives a marker for its incoming channel and finishes recording the state of the
incoming channel
Election Algorithms
Principle: Many distributed algorithms require that some process acts

as a coordinator. The question is how to select this special process
dynamically.
Note: In many systems the coordinator is chosen by hand (e.g., file

servers, DNS servers). This leads to centralized solutions => single
point of failure.
Question: If a coordinator is chosen dynamically, to what extent can

we speak about a centralized or distributed solution?
Question: Is a fully distributed solution, i.e., one without a coordinator,

always more robust than any centralized/coordinated solution?
05 – 18 Distributed Algorithms/5.4 Election Algorithms

Election by Bullying (1/2)
Principle: Each process has an associated priority (weight). The

process with the highest priority should always be elected as the
coordinator.
Issue: How do we find the heaviest process?
• Any process can just start an election by sending an election
message to all other processes (assuming you don’t know the weights
of the others).
• If a process Pheavy receives an election message from a lighter
process Plight, it sends a take-over message to Plight. Plight is out of
the race.
• If a process doesn’t get a take-over message back, it wins, and
sends a victory message to all other processes.

Election by Bullying (2/2)
Question: We’re assuming something very important here – what?

Assumption: Each process knows the process number of other processes
Election in a Ring
Principle: Process priority is obtained by organizing processes into a
(logical) ring. Process with the highest priority should be elected as
coordinator.
• Any process can start an election by sending an election message to
its successor. If a successor is down, the message is passed on to the
next successor.
• If a message is passed on, the sender adds itself to the list. When it
gets back to the initiator, everyone had a chance to make its presence
known.
• The initiator sends a coordinator message around the ring containing
a list of all living processes. The one with the highest priority is elected
as coordinator. See Figure 5-12.
Question: Does it matter if two processes initiate an election?

Question: What happens if a process crashes during the election?

Mutual Exclusion
Problem: A number of processes in a distributed system want exclusive
access to some resource.
Basic solutions:
• Via a centralized server.
• Completely distributed, with no topology imposed.
• Completely distributed, making use of a (logical) ring.
Centralized: Really simple:
05 – 22 Distributed Algorithms/5.5 Mutual Exclusion

Mutual Exclusion: Ricart & Agrawala
Principle: The same as Lamport except that acknowledgments aren’t

sent. Instead, replies (i.e., grants) are sent only when:
• The receiving process has no interest in the shared resource; or
• The receiving process is waiting for the resource, but has lower
priority (known through comparison of timestamps).
In all other cases, reply is deferred (see the algorithm on pg. 267)

Mutual Exclusion: Token Ring Algorithm
Essence: Organize processes in a logical ring, and let a token be
passed between them. The one that holds the token is allowed to
enter the critical region (if it wants to)

Distributed Transactions
• The transaction model
• Classification of transactions
• Concurrency control
The Transaction Model (1)
• Updating a master tape is fault

tolerant.
Question: What happens if this computer operation fails?

Both tapes are rewound and the job is restarted
from the beginning without any harm being done
Primitive Description
BEGIN_TRANSACTION Make the start of a transaction
END_TRANSACTION Terminate the transaction and try to commit
ABORT_TRANSACTION Kill the transaction and restore the old values
READ Read data from a file, a table, or otherwise
WRITE Write data to a file, a table, or otherwise
Figure 5-18 Example primitives for transactions.

BEGIN_TRANSACTION BEGIN_TRANSACTION
reserve BOS -> JFK; reserve BOS -> JFK;
reserve JFK -> ICN; reserve JFK -> ICN;
reserve SEL -> KPO; reserve SEL -> KPO full =>
END_TRANSACTION ABORT_TRANSACTION
(a) (b)
a) Transaction to reserve three flights commits

b) Transaction aborts when third flight is unavailable
ACID Properties of Transactions
• Atomic
To the outside world, the transaction happens indivisibly
• Consistent
The transaction does not violate system invariants
• Isolated
Concurrent transactions do not interfere with each other
• Durable
Once a transaction commits, the changes are permanent
Nested Transactions
• Constructed from a number of subtransactions
• The top-level transaction may create children that run in parallel with
one another to gain performance or simplify programming
• Each of these children is called a subtransaction and it may also have
one or more subtransactions
• When any transaction or subtransaction starts, it is conceptually given

a private copy of all data in the entire system for it to manipulate as it
wishes
• If it aborts, its private space is destroyed
• If it commits, its private space replaces the parent’s space
• If the top-level transaction aborts, all the changes made in the

subtransactions must be wiped out
Distributed Transactions
- Transactions involving subtransactions that operate on data that
are distributed across multiple machines
- Separate distributed algorithms are needed to handle the locking

of data and committing the entire transaction
Implementing Transactions
1. Private Workspace
• Gives a private workspace (i.e., all the data it has access to) to
a process when it begins a transaction
2. Writeahead Log
• Files are actually modified in place but before any block is
changed, a record is written to a log telling
 which transaction is making the change
 which file and block is being changed
 what the old and new values are
• Only after the log has been written successfully, the change is
made to the file
Question: Why is a log needed?

 for “rollback” if necessary
Private Workspace
a) The file index and disk blocks for a three-block file

b) The situation after a transaction has modified block 0 and appended block 3
c) After committing
Writeahead Log
x = 0; Log Log Log

y = 0;
BEGIN_TRANSACTION;
x = x + 1; [x = 0 / 1] [x = 0 / 1] [x = 0 / 1]
y=y+2 [y = 0 / 2] [y = 0 / 2]
x = y * y; [x = 1 / 4]
END_TRANSACTION;
(a) (b) (c) (d)
(a) a transaction
(b) – (d) The log before each statement is executed
Concurrency Control (1)
• The goal of concurrency control is to allow multiple transactions
to be executed simultaneously
• Final result should be the same as if all transactions had run
sequentially
Fig. 5-23 General organization of managers for handling transactions

Concurrency Control (2)
General organization of
managers for handling
distributed transactions.
Serializability (1)
BEGIN_TRANSACTION BEGIN_TRANSACTION BEGIN_TRANSACTION
x = 0; x = 0; x = 0;
x = x + 1; x = x + 2; x = x + 3;
END_TRANSACTION END_TRANSACTION END_TRANSACTION
(a) (b) (c)
(a) – (c) Three transactions T1, T2, and T3
Schedule 1 x = 0; x = x + 1; x = 0; x = x + 2; x = 0; x = x + 3 Legal
Schedule 2 x = 0; x = 0; x = x + 1; x = x + 2; x = 0; x = x + 3; Legal
Schedule 3 x = 0; x = 0; x = x + 1; x = 0; x = x + 2; x = x + 3; Illegal
(d)
(d) Possible schedules

Question: Why is Schedule 3 illegal?
Serializability (2)
• Two operations conflict is they operate on the same data and if

at least one of them is a write operation
– read-write conflict: exactly one of the operations is a write
– write-write conflict: involves more than one write operations
• Concurrency control algorithms can generally be classified by

looking at the way read and write operations are synchronized
– Using locking
– Explicitly ordering operations using timestamps
Two-Phase Locking (1)
• In two-phase locking (2PL), the scheduler first acquires all the
locks it needs during the growing (1st) phase, and then releases
them during the shrinking (2nd) phase
• See the rules on pg. 284
Fig. 5-26 Two-phase locking

Two-Phase Locking (2)
• In strict two-phase locking, the shrinking phase does not take
place until the transaction has finished running and has either
committee or aborted.
Fig. 5-27 Strict two-phase locking

READING:
• Read Chapter 5

Distributed Systems Principles and Paradigms

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distributed Systems Principles and Paradigms

Uploaded by

Copyright:

Available Formats

Distributed Systems

Principles and Paradigms

• Why do processes communicate in DS?

• Why do processes synchronize in DS?

05 – 1 Distributed Algorithms/5.1 Clock Synchronization

UTC is broadcasted through short wave radio (with the accuracy of

Where is the max. drift rate

05 – 4 Distributed Algorithms/5.1 Clock Synchronization

• Principle I: Every machine asks a time server for the

05 – 5 Distributed Algorithms/5.1 Clock Synchronization

• If a and b are two events in the same process, and a

Problem: How do we maintain a global view on the system’s

P1: If a and b are two events in the same process, and a

05 – 7 Distributed Algorithms/5.2 Logical Clocks

Each process Pi maintains a local counter Ci and adjusts this counter

Property P1 is satisfied by (1); Property P2 by (2) and (3).

05 – 8 Distributed Algorithms/5.2 Logical Clocks

05 – 9 Distributed Algorithms/5.2 Logical Clocks

• Assign the Lamport’s logical clock values for all the

05 – 10 Distributed Algorithms/5.2 Logical Clocks

Outcome: in absence of proper synchronization, replica #1 will end up

• Process Pi sends timestamped message msgi to all others. The

05 – 12 Distributed Algorithms/5.2 Logical Clocks

• with Lamport’s clocks, one cannot directly compare

• Timestamps attached to the events are compared as follows:

• This means event ep happened before event fq if and only if process

• Assign the Lamport’s and Fidge’s logical clock values

Basic Idea: Sometimes you want to collect the current state of a

05 – 15 Distributed Algorithms/5.3 Global State

should be recorded: all messages received before this marker and

05 – 16 Distributed Algorithms/5.3 Global State

(a) Organization of a process and channels for a distributed snapshot

Principle: Many distributed algorithms require that some process acts

Note: In many systems the coordinator is chosen by hand (e.g., file

Question: If a coordinator is chosen dynamically, to what extent can

Question: Is a fully distributed solution, i.e., one without a coordinator,

05 – 18 Distributed Algorithms/5.4 Election Algorithms

Principle: Each process has an associated priority (weight). The

05 – 19 Distributed Algorithms/5.4 Election Algorithms

Question: We’re assuming something very important here – what?

Question: Does it matter if two processes initiate an election?

05 – 21 Distributed Algorithms/5.4 Election Algorithms

05 – 22 Distributed Algorithms/5.5 Mutual Exclusion

Principle: The same as Lamport except that acknowledgments aren’t

05 – 23 Distributed Algorithms/5.5 Mutual Exclusion

05 – 24 Distributed Algorithms/5.5 Mutual Exclusion

• The transaction model

• Updating a master tape is fault

Question: What happens if this computer operation fails?

BEGIN_TRANSACTION Make the start of a transaction

END_TRANSACTION Terminate the transaction and try to commit

ABORT_TRANSACTION Kill the transaction and restore the old values

READ Read data from a file, a table, or otherwise

WRITE Write data to a file, a table, or otherwise

Figure 5-18 Example primitives for transactions.

a) Transaction to reserve three flights commits

• When any transaction or subtransaction starts, it is conceptually given

• If the top-level transaction aborts, all the changes made in the

- Separate distributed algorithms are needed to handle the locking

Question: Why is a log needed?

a) The file index and disk blocks for a three-block file

x = 0; Log Log Log

Fig. 5-23 General organization of managers for handling transactions