You are on page 1of 54

Distributed Systems

Principles and Paradigms

Chapter 05
Synchronization
Communication & Synchronization

• Why do processes communicate in DS?


– To exchange messages
– To synchronize processes

• Why do processes synchronize in DS?


– To coordinate access of shared resources
– To order events
Time, Clocks and Clock Synchronization

• Time
– Why is time important in DS?
– E.g. UNIX make utility (see Fig. 5-1)

• Clocks (Timer)
– Physical clocks
– Logical clocks (introduced by Leslie Lamport)
– Vector clocks (introduced by Collin Fidge)

• Clock Synchronization
– How do we synchronize clocks with real-world time?
– How do we synchronize clocks with each other?

05 – 1 Distributed Algorithms/5.1 Clock Synchronization


Physical Clocks (1/3)
Problem: Clock Skew – clocks gradually get out of synch and give
different values
Solution: Universal Coordinated Time (UTC):
• Formerly called GMT (Greenwich Mean Time)
• Based on the number of transitions per second of the cesium 133
atom (very accurate).
• At present, the real time is taken as the average of some 50
cesium-clocks around the world – International Atomic Time
• Introduces a leap second from time to time to compensate that
days are getting longer.

UTC is broadcasted through short wave radio (with the accuracy of


+/- 1 msec) and satellite (Geostationary Environment Operational
Satellite, GEOS, with the accuracy of +/- 0.5 msec).

Question: Does this solve all our problems? Don’t we now have
some global timing mechanism?
05 – 2 Distributed Algorithms/5.1 Clock
Physical Clocks (2/3)
Problem: Suppose we have a distributed system with a UTC-
receiver somewhere in it, we still have to distribute its time to each
machine.
Basic principle:
• Every machine has a timer that generates an interrupt H (typically
60) times per second.
• There is a clock in machine p that ticks on each timer interrupt.
Denote the value of that clock by Cp (t) , where t is UTC time.
• Ideally, we have that for each machine p, Cp (t) = t, or, in other
words, dC/ dt = 1
• Theoretically, a timer with H=60 should generate 216,000 ticks per
hour
• In practice, the relative error of modern timer chips is 10**-5 (or
between 215,998 and 216,002 ticks per hour)
05 – 3 Distributed Algorithms/5.1 Clock Synchronization
Physical Clocks (3/3)

Where is the max. drift rate

Goal: Never let two clocks in any system differ by more than time units =>
synchronize at least every 2seconds.

05 – 4 Distributed Algorithms/5.1 Clock Synchronization


Clock Synchronization Principles

• Principle I: Every machine asks a time server for the


accurate time at least once every /2seconds (see Fig. 5-5).
But you need an accurate measure of round trip delay,
including interrupt handling and processing incoming
messages.
• Principle II: Let the time server scan all machines
periodically, calculate an average, and inform each machine
how it should adjust its time relative to its present time.
Ok, you’ll probably get every machine in sync. Note: you
don’t even need to propagate UTC time (why not?)

05 – 5 Distributed Algorithms/5.1 Clock Synchronization


Clock Synchronization Algorithms
• The Berkeley Algorithm
 The time server polls periodically every machine for its time
 The received times are averaged and each machine is notified of
the amount of the time it should adjust
 Centralized algorithm, See Figure 5-6
• Decentralized Algorithm
 Every machine broadcasts its time periodically for fixed length
resynchronization interval
 Averages the values from all other machines (or averages
without the highest and lowest values)
• Network Time Protocol (NTP)
 the most popular one used by the machines on the Internet
 uses an algorithm that is a combination of centralized/distributed
05 – 6 Distributed Algorithms/5.2 Logical Clocks
Network Time Protocol (NTP)
• a protocol for synchronizing the clocks of computers over packet-
switched, variable-latency data networks (i.e., Internet)
• NTP uses UDP port 123 as its transport layer. It is designed
particularly to resist the effects of variable latency
• NTPv4 can usually maintain time to within 10 milliseconds (1/100 s)
over the public Internet, and can achieve accuracies of 200
microseconds (1/5000 s) or better in local area networks under ideal
conditions
• visit the following URL to understand NTP in more detail
http://en.wikipedia.org/wiki/Network_Time_Protocol
The Happened-Before Relationship
Problem: We first need to introduce a notion of ordering before we
can order anything.
The happened-before relation on the set of events in a distributed
system is the smallest relation satisfying:

• If a and b are two events in the same process, and a


comes before b, then a  b. (a happened before b)
• If a is the sending of a message, and b is the receipt of
that message, then a  b.
• If a  b and b  c, then a  c. (transitive relation)
Note: if two events, x and y, happen in different processes that do not
exchange messages, then they are said to be concurrent.
Note: this introduces a partial ordering of events in a system with
concurrently operating processes.
05 – 6 Distributed Algorithms/5.2 Logical Clocks
Logical Clocks (1/2)

Problem: How do we maintain a global view on the system’s


behavior that is consistent with the happened-before relation?
Solution: attach a timestamp C(e) to each event e, satisfying the
following properties:

P1: If a and b are two events in the same process, and a


b, then we demand that C (a) < C (b)
P2: If a corresponds to sending a message m, and b to
the receipt of that message, then also C (a) < C (b)
Problem: How do we attach a timestamp to an event when there’s
no global clock?  maintain a consistent set of logical clocks, one
per process.

05 – 7 Distributed Algorithms/5.2 Logical Clocks


Logical Clocks (2/2)

Each process Pi maintains a local counter Ci and adjusts this counter


according to the following rules:
(1) For any two successive events that take place within Pi, Ci is
incremented by 1.
(2) Each time a message m is sent by process Pi, the message
receives a timestamp Tm = Ci.
(3) Whenever a message m is received by a process Pj, Pj adjusts its
local counter Cj:

Property P1 is satisfied by (1); Property P2 by (2) and (3).


This is called the Lamport’s Algorithm

05 – 8 Distributed Algorithms/5.2 Logical Clocks


Logical Clocks – Example

Fig 5-7. (a) Three processes, each with its own clock. The clocks
run at different rates. (b) Lamport’s algorithm corrects the clocks

05 – 9 Distributed Algorithms/5.2 Logical Clocks


P1 P2 P3

a e
j
b f k
c
g
d h l
i

• Assign the Lamport’s logical clock values for all the


events in the above timing diagram. Assume that
each process’s local clock is set to 0 initially.
P1 P2 P3
1 a 1 e
1 j
2
b 3 f 2 k
3 c
4 g
4 d 5 h l
3
6 i
From the above timing diagram, what can you say about the
following events?
• between a and b: a  b
• between b and f : b  f
• between e and k: concurrent
• between c and h: concurrent
• between k and h: k  h
Total Ordering with Logical Clocks
Problem: it can still occur that two events happen at the
same time. Avoid this by attaching a process number to
an event:
Pi timestamps event e with Ci (e) i
Then: Ci (a) i happened before Cj (b) j if and only if:
1: Ci (a) < Cj (a) ; or
2: Ci (a) = Cj (b) and i < j

05 – 10 Distributed Algorithms/5.2 Logical Clocks


Example: Totally-Ordered Multicast (1/2)
Problem: We sometimes need to guarantee that concurrent updates
on a replicated database are seen in the same order everywhere:
• Process P1 adds $100 to an account (initial value: $1000)
• Process P2 increments account by 1%
• There are two replicas

Outcome: in absence of proper synchronization, replica #1 will end up


with $1111, while replica #2 ends up with $1110.
05 – 11 Distributed Algorithms/5.2 Logical Clocks
Example: Totally-Ordered Multicast (2/2)

• Process Pi sends timestamped message msgi to all others. The


message itself is put in a local queue queuei.
• Any incoming message at Pj is queued in queuej, according to its
timestamp.
• Pj passes a message msgi to its application if:
(1) msgi is at the head of queuej
(2) for each process Pk, there is a message msgk in queuej with a
larger timestamp.
Note: We are assuming that communication is reliable and FIFO
ordered.

05 – 12 Distributed Algorithms/5.2 Logical Clocks


Fidge’s Logical Clocks

• with Lamport’s clocks, one cannot directly compare


the timestamps of two events to determine their
precedence relationship
- if C(a) </ C(b) then a / b
- if C(a) < C(b), it could be a  b or a 
/ b
- e.g., events e and b in the previous example Figure
* C(e) = 1 and C(b) = 2
* thus C(e) < C(b) but e / b
• the main problem is that a simple integer clock can
not order both events within a process and events
in different processes
• Collin Fidge developed an algorithm that
overcomes this problem
• Fidge’s clock is represented as a vector [c1 , c 2 ,
…, cn] with an integer clock value for each process
(ci contains the clock value of process i)
Fidge’s Algorithm
The Fidge’s logical clock is maintained as follows:
1: Initially all clock values are set to the smallest value.
2: The local clock value is incremented at least once before
each primitive event in a process.
3: The current value of the entire logical clock vector is
delivered to the receiver for every outgoing message.
4: Values in the timestamp vectors are never decremented.
5: Upon receiving a message, the receiver sets the value of
each entry in its local timestamp vector to the maximum of
the two corresponding values in the local vector and in the
remote vector received.
The element corresponding to the sender is a special
case; it is set to one greater than the value received, but
only if the local value is not greater than that received.
• Get r_vector from the received msg sent by process q;
if l_vector [q]  r_vector[q] then
l_vector[q] : = r_vector[q] + 1;
for i : = 1 to n do
l_vector[i] := max(l_vector[i], r_vector[i]);

• Timestamps attached to the events are compared as follows:


• ep  fq iff Tep [p] < Tfq [p]
• (where ep represents an event e occurring in process p, Tep
represents the timestamp vector of the event ep , and the ith element
of Tep is denoted by Tep [i].)

• This means event ep happened before event fq if and only if process


q received a direct or indirect message from p and that message
was sent after ep had occurred. If ep and fq are in the same process
(i,e., p = q), the local elements of their timestamps represent their
occurrences in the process.
P1 P2 P3

a e
j
b f k
c
g
d h l
i

• Assign the Lamport’s and Fidge’s logical clock values


for all the events in the above timing diagram. Assume
that each process’s logical clock is set to 0 initially.
P1 P2 P3

[1,0,0] 1 a [0,1,0] 1 e
1 j [0,0,1]
[2,0,0] 2
b f [3,2,0] 2 k [0,0,2]
3
[3,0,0] 3 c
[3,3,3] 4 g
[4,0,0] 4 d 5 h [3,4,3] 3 l [0,0,3]
6 i [5,5,3]
The above diagram shows both Lamport
timestamps (an integer value ) and Fidge
timestamps (a vector of integer values ) for
each event.
– Lamport clocks:
2 < 5 since b  h,
3 < 4 but c  g.

– Fidge clocks:
f  h since 2 < 4 is true,
b  h since 2 < 3 is true,
h  a since 4 < 0 is false,
c  h since (3 < 3) is false and (4 < 0) is false.
P1 P2 P3 P4

a
e j
m
b f k
c n
g
d h o
l
i

Assign the Lamport’s and Fidge’s logical clock values for all the events in the above
timing diagram. Assume that each process’s logical clock is set to 0 initially.
From the above timing diagram, what can you say
about the following events?

1. between b and n:
2. between b and o:
3. between m and g:
4. between c and h:
5. between c and l:
6. between j and g:
7. between k and i:
8. between j and h:
READING Reference:
• Colin Fidge, “Logical Time in Distributed
Computing Systems”, IEEE Computer, Vol.
24, No. 8, pp. 28-33, August 1991.
Global State (1/3)

Basic Idea: Sometimes you want to collect the current state of a


distributed computation, called a distributed snapshot. It consists of all
local states and messages in transit.
Important: A distributed snapshot should reflect a consistent state:

05 – 15 Distributed Algorithms/5.3 Global State


Global State (2/3)
Note: any process P can initiate taking a distributed snapshot
• P starts by recording its own local state
• P subsequently sends a marker along each of its outgoing channels
• When Q receives a marker through channel C, its action depends on
whether it had already recorded its local state:
– Not yet recorded: it records its local state, and sends the marker
along each of its outgoing channels
– Already recorded: the marker on C indicates that the channel’s state

should be recorded: all messages received before this marker and


the time Q recorded its own state.
• Q is finished when it has received a marker along each of its incoming
channels

05 – 16 Distributed Algorithms/5.3 Global State


Global State (3/3)

(a) Organization of a process and channels for a distributed snapshot


(b) Process Q receives a marker for the first time and records its local state
(c) Q records all incoming message
(d) Q receives a marker for its incoming channel and finishes recording the state of the
incoming channel
05 – 17 Distributed Algorithms/5.3 Global State
Election Algorithms

Principle: Many distributed algorithms require that some process acts


as a coordinator. The question is how to select this special process
dynamically.

Note: In many systems the coordinator is chosen by hand (e.g., file


servers, DNS servers). This leads to centralized solutions => single
point of failure.

Question: If a coordinator is chosen dynamically, to what extent can


we speak about a centralized or distributed solution?

Question: Is a fully distributed solution, i.e., one without a coordinator,


always more robust than any centralized/coordinated solution?

05 – 18 Distributed Algorithms/5.4 Election Algorithms


Election by Bullying (1/2)

Principle: Each process has an associated priority (weight). The


process with the highest priority should always be elected as the
coordinator.
Issue: How do we find the heaviest process?
• Any process can just start an election by sending an election
message to all other processes (assuming you don’t know the weights
of the others).
• If a process Pheavy receives an election message from a lighter
process Plight, it sends a take-over message to Plight. Plight is out of
the race.
• If a process doesn’t get a take-over message back, it wins, and
sends a victory message to all other processes.

05 – 19 Distributed Algorithms/5.4 Election Algorithms


Election by Bullying (2/2)

Question: We’re assuming something very important here – what?


Assumption: Each process knows the process number of other processes
05 – 20 Distributed Algorithms/5.4 Election Algorithms
Election in a Ring
Principle: Process priority is obtained by organizing processes into a
(logical) ring. Process with the highest priority should be elected as
coordinator.
• Any process can start an election by sending an election message to
its successor. If a successor is down, the message is passed on to the
next successor.
• If a message is passed on, the sender adds itself to the list. When it
gets back to the initiator, everyone had a chance to make its presence
known.
• The initiator sends a coordinator message around the ring containing
a list of all living processes. The one with the highest priority is elected
as coordinator. See Figure 5-12.

Question: Does it matter if two processes initiate an election?


Question: What happens if a process crashes during the election?

05 – 21 Distributed Algorithms/5.4 Election Algorithms


Mutual Exclusion
Problem: A number of processes in a distributed system want exclusive
access to some resource.
Basic solutions:
• Via a centralized server.
• Completely distributed, with no topology imposed.
• Completely distributed, making use of a (logical) ring.
Centralized: Really simple:

05 – 22 Distributed Algorithms/5.5 Mutual Exclusion


Mutual Exclusion: Ricart & Agrawala

Principle: The same as Lamport except that acknowledgments aren’t


sent. Instead, replies (i.e., grants) are sent only when:
• The receiving process has no interest in the shared resource; or
• The receiving process is waiting for the resource, but has lower
priority (known through comparison of timestamps).
In all other cases, reply is deferred (see the algorithm on pg. 267)

05 – 23 Distributed Algorithms/5.5 Mutual Exclusion


Mutual Exclusion: Token Ring Algorithm
Essence: Organize processes in a logical ring, and let a token be
passed between them. The one that holds the token is allowed to
enter the critical region (if it wants to)

05 – 24 Distributed Algorithms/5.5 Mutual Exclusion


Distributed Transactions

• The transaction model

• Classification of transactions

• Concurrency control
The Transaction Model (1)

• Updating a master tape is fault


tolerant.

Question: What happens if this computer operation fails?


Both tapes are rewound and the job is restarted
from the beginning without any harm being done
The Transaction Model (2)

Primitive Description

BEGIN_TRANSACTION Make the start of a transaction

END_TRANSACTION Terminate the transaction and try to commit

ABORT_TRANSACTION Kill the transaction and restore the old values

READ Read data from a file, a table, or otherwise

WRITE Write data to a file, a table, or otherwise

Figure 5-18 Example primitives for transactions.


The Transaction Model (3)

BEGIN_TRANSACTION BEGIN_TRANSACTION
reserve BOS -> JFK; reserve BOS -> JFK;
reserve JFK -> ICN; reserve JFK -> ICN;
reserve SEL -> KPO; reserve SEL -> KPO full =>
END_TRANSACTION ABORT_TRANSACTION
(a) (b)

a) Transaction to reserve three flights commits


b) Transaction aborts when third flight is unavailable
ACID Properties of Transactions

• Atomic
To the outside world, the transaction happens indivisibly

• Consistent
The transaction does not violate system invariants

• Isolated
Concurrent transactions do not interfere with each other

• Durable
Once a transaction commits, the changes are permanent
Nested Transactions
• Constructed from a number of subtransactions
• The top-level transaction may create children that run in parallel with
one another to gain performance or simplify programming
• Each of these children is called a subtransaction and it may also have
one or more subtransactions

• When any transaction or subtransaction starts, it is conceptually given


a private copy of all data in the entire system for it to manipulate as it
wishes
• If it aborts, its private space is destroyed
• If it commits, its private space replaces the parent’s space

• If the top-level transaction aborts, all the changes made in the


subtransactions must be wiped out
Distributed Transactions
- Transactions involving subtransactions that operate on data that
are distributed across multiple machines

- Separate distributed algorithms are needed to handle the locking


of data and committing the entire transaction
Implementing Transactions

1. Private Workspace
• Gives a private workspace (i.e., all the data it has access to) to
a process when it begins a transaction

2. Writeahead Log
• Files are actually modified in place but before any block is
changed, a record is written to a log telling
 which transaction is making the change
 which file and block is being changed
 what the old and new values are
• Only after the log has been written successfully, the change is
made to the file

Question: Why is a log needed?


 for “rollback” if necessary
Private Workspace

a) The file index and disk blocks for a three-block file


b) The situation after a transaction has modified block 0 and appended block 3
c) After committing
Writeahead Log

x = 0; Log Log Log


y = 0;
BEGIN_TRANSACTION;
x = x + 1; [x = 0 / 1] [x = 0 / 1] [x = 0 / 1]
y=y+2 [y = 0 / 2] [y = 0 / 2]
x = y * y; [x = 1 / 4]
END_TRANSACTION;
(a) (b) (c) (d)

(a) a transaction
(b) – (d) The log before each statement is executed
Concurrency Control (1)
• The goal of concurrency control is to allow multiple transactions
to be executed simultaneously
• Final result should be the same as if all transactions had run
sequentially

Fig. 5-23 General organization of managers for handling transactions


Concurrency Control (2)

General organization of
managers for handling
distributed transactions.
Serializability (1)
BEGIN_TRANSACTION BEGIN_TRANSACTION BEGIN_TRANSACTION
x = 0; x = 0; x = 0;
x = x + 1; x = x + 2; x = x + 3;
END_TRANSACTION END_TRANSACTION END_TRANSACTION

(a) (b) (c)

(a) – (c) Three transactions T1, T2, and T3

Schedule 1 x = 0; x = x + 1; x = 0; x = x + 2; x = 0; x = x + 3 Legal
Schedule 2 x = 0; x = 0; x = x + 1; x = x + 2; x = 0; x = x + 3; Legal
Schedule 3 x = 0; x = 0; x = x + 1; x = 0; x = x + 2; x = x + 3; Illegal

(d)

(d) Possible schedules


Question: Why is Schedule 3 illegal?
Serializability (2)

• Two operations conflict is they operate on the same data and if


at least one of them is a write operation
– read-write conflict: exactly one of the operations is a write
– write-write conflict: involves more than one write operations

• Concurrency control algorithms can generally be classified by


looking at the way read and write operations are synchronized
– Using locking
– Explicitly ordering operations using timestamps
Two-Phase Locking (1)
• In two-phase locking (2PL), the scheduler first acquires all the
locks it needs during the growing (1st) phase, and then releases
them during the shrinking (2nd) phase
• See the rules on pg. 284

Fig. 5-26 Two-phase locking


Two-Phase Locking (2)
• In strict two-phase locking, the shrinking phase does not take
place until the transaction has finished running and has either
committee or aborted.

Fig. 5-27 Strict two-phase locking


READING:
• Read Chapter 5

You might also like