You are on page 1of 130

Coordination and Agreement

Outline
Introduction
Distributed Mutual Exclusion
Election Algorithms
Group Communication
Consensus and Related Problems

Introduction

If a collection of processes share a resource or collection of


resources, then mutual exclusion is required to prevent
interference and ensure consistency when accessing the
resources. This is the critical section problem
Neither shared variables nor facilities supplied by a single local
kernel can be used to solve it - a solution to distributed mutual
exclusion: one that is based solely on message passing
Example Usages

NFS servers are stateless so they do not maintain clients state hence
they do not lock files on behalf of the clients. A distributed mutual
exclusion mechanism is required to ensure consistency and prevent
interference
Ethernet and wifi adapters coordinate to access the same transmission
medium mutual exclusion is again required
Any application that has distributed process example can be car
parking maintenance application with entrance and exit processes
working separately

Main Assumptions

A system of N processes pi where I = 1, 2, 3 . , N,


that do not share variables
Each pair of processes is connected by reliable
channels
Processes independent from each other
The processes access common resources, but they
do so in a critical section
there is only one critical section
Network: dont disconnect
Processes fail only by crashing
Local failure detector is available

Distributed Mutual Exclusion


Process 1

Process 2

(1)

Process 3

Shared
resource

Mutual exclusion very important

Process n

Prevent interference
Ensure consistency when accessing the
resources

Distributed Mutual Exclusion

Mutual exclusion useful when the server managing the


resources dont use locks

Critical section
Enter()

Exit()

(2)

enter critical section blocking


Access shared resources in critical
section
Leave critical section

Distributed Mutual Exclusion

(3)

Distributed mutual exclusion: no shared


variables, only message passing
Properties:

Safety: At most one process may execute in the


critical section at a time

Liveness: Requests to enter and exit the critical


section eventually succeed
No deadlock and no starvation

Ordering: If one request to enter the CS happenedbefore another, then entry to the CS is granted in
that order

Performance Evaluation of an Algorithm

The bandwidth consumed: The number of


messages sent in each entry and exit operation
The client delay: Delay incurred by a process at
each entry and exit operation;
Throughput of the System: The rate at which
the collection of processes as a whole can access
the critical section. Measured by using the
synchronization delay between one process
exiting the critical section and the next process
entering it.
The throughput is greater when the
synchronization delay is shorter.

Mutual Exclusion Algorithms

Central Server Algorithm

Ring-Based Algorithm

Mutual Exclusion using Multicast and Logical Clocks

Maekawas Voting Algorithm

Mutual Exclusion Algorithms Comparison

Central Server Algorithm


Server
Queue of
requests

4
2

Holds the token

3) Grant
token

P1

1) Reques
t
token
P2

10

2)
Release
token

P4

P3
Holds the token

Waiting

Ring-Based Algorithm

(1)

A group of unordered
processes in a network

P4

P2

Pn

P1

Ethernet

11

P3

Ring-Based Algorithm
P1
P1

(2)

Enter()

P2

Pn

Exit()
P3

P4

12

Token
navigates
around the
ring

Critical
Section

Mutual Exclusion using Multicast and


Logical Clocks

The basic idea is that processes that require entry to a critical section
multicast a request message, and can enter it only when all the other
processes have replied to this message
Each process i is required to receive permission from Si only. Correctness requires
that multiple processes will never receive permission from all members of their
respective subsets.
S1

S0

0,1,2

1,3,5

2,4,5
S2

Mutual Exclusion using Multicast and


Logical Clocks
The processes p1,p2,, pN bear distinct numeric
identifiers.
They are assumed to possess communication channels
to one another
Messages requesting entry are of the form <T , pi>,
where T is the senders timestamp and pi is the
senders identifier.

Mutual Exclusion using Multicast and


Logical Clocks (1)

19
2

Enter()

Exit()
Critical
Section
15

P3

19
OK

P1
P1

23

23

OK

OK
19

OK

Waiting
queue

23
P2

P1 and P2 request
entering the critical
section
simultaneously

Mutual Exclusion using Multicast and


Logical Clocks (2)

Main steps of the algorithm:

Initialization
State :=
RELEASED;
Process pi request entering the critical section
State := WANTED;
T := requests timestamp;
Multicast request <T, pi> to all processes;
Wait until (Number of replies received = (N 1));
State := HELD;
16

Mutual Exclusion using Multicast and


Logical Clocks(3)

Main steps of the algorithm (contd):

On receipt of a request <Ti, pi> at pj (i j)


If (state = HELD) OR
(state = WANTED AND (T, pj) < (Ti, pi))
Then queue request from pi without replying;
Else reply immediately to pi;
To quit the critical section
state := RELEASED;
Reply to any queued requests;
17

Analysis

Bandwidth Utilization: No. of Messages required


to gain entry into Critical section: 2(N-1) where
N is the number of processes

N-1 messages are required to multicast the request


messages to all members except the process itself
Then each process responds with N-1 messages.
Total = 2 (N-1)

Client Delay: One round trip time


Synchronization delay: one message only. The
process that is releasing the token is extracting
the head of the queue and send him the reply
message

Maekawas Voting Algorithm

(1)

Candidate process: must collect sufficient votes to enter to the


critical section
Each process pi maintain a voting set Vi (i=1, , N), where Vi
{p1, , pN}
Sets Vi: chosen such that i,j
pi Vi
Vi Vj
Vi = k

(at least one common member


of any two voting sets)

(fairness
)
Each process pj is contained in M of the voting sets Vi
19

Maekawa optimal solution

Maekawa proved that the optimal solution


which minimizes K and allows the processes to
achieve mutual exclusion is
K ~ ceiling (N)
And M = K

Given that N = 7, k = ceiling (7) = 3


So each process should have a set of 3 (i.e. K)
processes including itself. And each process
should be part of at max 3 sets.

Maekawas algorithm
Example. Let there be seven processes 0, 1, 2, 3, 4, 5, 6 then the

optimal solution suggests the following set generation for each process
S0
S1
S2
S3
S4
S5
S6

=
=
=
=
=
=
=

{0, 1, 2}
{1, 3, 5}
{2, 4, 5}
{0, 3, 4}
{1, 4, 6}
{0, 5, 6}
{2, 3, 6}

Maekawas algorithm
Version 1 {Life of process I}
1. Send timestamped request to each process in Si.
2. Request received send ack to process with the
lowest timestamp. Thereafter, "lock" (i.e. commit)
yourself to that process, and keep others waiting.
3. Enter CS if you receive an ack from each member
in Si.
4. To exit CS, send release to every process in Si.
5. Release received unlock yourself. Then send
ack to the next process with the lowest timestamp.

S0 =

{0, 1, 2}

S1 =

{1, 3, 5}

S2 =

{2, 4, 5}

S3 =

{0, 3, 4}

S4 =

{1, 4, 6}

S5 =

{0, 5, 6}

S6 =

{2, 3, 6}

Maekawas Voting Algorithm

(2)

Main steps of the algorithm:

Initialization
state :=
RELEASED;
voted
:= FALSE;
For pi to enter the critical section
state := WANTED;
Multicast request to all processes in Vi {pi};
Wait until (number of replies received = K 1);
state := HELD;
23

pi enter the critical section


only after collecting K-1
votes

Maekawas Voting Algorithm

(3)

Main steps of the algorithm (contd):

On receipt of a request from pi at pj (i j)


If (state = HELD OR voted = TRUE)
Then queue request from pi without replying;
Else

Reply immediately to pi;

voted := TRUE;
For pi to exit the critical section
state := RELEASED;
Multicast release to all processes Vi {pi};
24

Maekawas Voting Algorithm

(4)

Main steps of the algorithm (contd):

On a receipt of a release from pi at pj (i j)


If (queue of requests is non-empty)
Then remove head of queue, e.g., pk;
send reply to pk;
voted := TRUE;
Else voted := FALSE;

25

Maekawas algorithm-version 1
ME1. At most one process can enter its critical
section at any time.

Let i and j attempt to enter their Critical Sections


Si Sj implies there is a process k Si Sj
Process k will never send ack to both.
So it will act as the arbitrator and establishes ME1

S0 =

{0, 1, 2}

S1 =

{1, 3, 5}

S2 =

{2, 4, 5}

S3 =

{0, 3, 4}

S4 =

{1, 4, 6}

S5 =

{0, 5, 6}

S6 =

{2, 3, 6}

Maekawas algorithm-version 1
ME2. No deadlock. Unfortunately deadlock is
possible! Assume 0, 1, 2 want to enter their
critical sections.

S0 =

{0, 1, 2}

S1 =

{1, 3, 5}

From S1= {1,3,5}, 1,3 send ack to 1, but 5 sends ack to 2;

S2 =

{2, 4, 5}

From S2= {2,4,5}, 4,5 send ack to 2, but 2 sends ack to 0;

S3 =

{0, 3, 4}

Now, 0 waits for 1 (to send a release), 1 waits for 2 (to send a

S4 =

{1, 4, 6}

S5 =

{0, 5, 6}

S6 =

{2, 3, 6}

From S0= {0,1,2}, 0,2 send ack to 0, but 1 sends ack to 1;

release), , and 2 waits for 0 (to send a release), . So deadlock

is possible!

Maekawas algorithm-Version 2
Avoiding deadlock
If processes always receive messages in
increasing order of timestamp, then
deadlock could be avoided. But this is too
strong an assumption.
Version 2 uses three additional messages:
- failed
- inquire
- relinquish

S0 =

{0, 1, 2}

S1 =

{1, 3, 5}

S2 =

{2, 4, 5}

S3 =

{0, 3, 4}

S4 =

{1, 4, 6}

S5 =

{0, 5, 6}

S6 =

{2, 3, 6}

Maekawas algorithm-Version 2
New features in version 2

Send ack and set lock as usual.

If lock is set and a request with a larger


timestamp arrives, send failed (you have no
chance). If the incoming request has a lower
timestamp, then send inquire (are you in CS?) to
the locked process.

Receive inquire and at least one failed message


send relinquish. The recipient resets the lock.

Sanders identified a bug in version 2 Can you find that?

S0 =

{0, 1, 2}

S1 =

{1, 3, 5}

S2 =

{2, 4, 5}

S3 =

{0, 3, 4}

S4 =

{1, 4, 6}

S5 =

{0, 5, 6}

S6 =

{2, 3, 6}

Analysis

Bandwidth Utilization: 2 (N) per entry to the


critical section. On exit, send release message
to all the members in your set: (N) messages
Total consumption = 3 (N)

Note: Maekawas analysis of Version 2 reveals a complexity of 7N

Client delay: Client Delay: One round trip time


Synchronization delay: One round trip time

Fault Tolerance

What happens when messages are lost?

None of the algorithms that we have described would tolerate


the loss of messages, if the channels were unreliable.

What happens when a process crashes?

The central server algorithm: can tolerate the crash failure of


a client process that neither holds nor has requested the
token.
Ring based algorithm: if the process holding the token
crashes then there is no way out
Multicast and logical clock based algorithm: if the crashing
process have not requested any turn, then the failed process
can be assumes to give all process Ok signal
Maekawas algorithm: can tolerate some process crash
failures: if a crashed process is not in a voting set that is
required, then its failure will not affect the other processes.

Outline
Introduction
Distributed Mutual Exclusion
Election Algorithms
Group Communication
Consensus and Related Problems

32

Election Algorithms

(1)

Objective: Elect one process pi from a group of processes


p1pN
Utility: Elect a primary manager, a master process, a
coordinator or a central serverEven if multiple elections
have
been
started
Each process pi maintains the
identity of the elected in the
simultaneously
variable Electedi (NIL if it isnt defined yet)

33

Properties to satisfy: pi,


Safety: Electedi = NIL or Elected = P
A non-crashed
process
with
the
Liveness: pi participates and sets
Elected
i NIL, or crashes
largest identifier
Major assumption: None of the processes crash, system is
asynchronous, identifiers of the processes are unique

Election Algorithms

34

(2)

Ring-Based Election Algorithm

Bully Algorithm

Election Algorithms Comparison

Ring-Based Election Algorithm


5
5

16

16

9
25

Process 5
starts
the election

25
3
25

35

(1)

Ring-Based Election Algorithm

(2)

Initialization
Participanti := FALSE;
Electedi := NIL
Pi starts an election
Participanti := TRUE;
Send the message <election, pi> to its
neighbor
Receipt of a message <elected, pj> at pi
Participanti := FALSE;
If pi pj
Then Send the message <elected, pj> to its neighbor
36

Ring-Based Election Algorithm

(3)

Receipt of the elections message <election, pi> at pj


If pi > pj
Then Send the message <election, pi> to its neighbor
Participantj := TRUE;
Else If pi < pj AND Participantj = FALSE
Then Send the message <election, pj> to its neighbor
Participantj := TRUE;
Else If pi = pj
Then Electedj := TRUE;
Participantj := FALSE;
Send the message <elected, pj> to its neighbor
37

Analysis

Bandwidth Consumption: If only a single


process starts an election, then the worstperforming case is when its anti-clockwise
neighbor has the highest identifier.

A total of N 1 messages are required to reach this


neighbor
After receiving the message, this node will send its
identifier (being the highest identifier). This
identifier will take a further N messages to come
back to the originator and letting it know that it is
the coordinator.
The elected message is then sent N times, making
3N 1 messages in all.

Bully Algorithm

Characteristic:
election

Hypotheses:

(1)

Allows

processes

to

crash

during

an

Reliable transmission
Synchronous system

DelayTrans.
DelayTrans.
T = 2 DelayTrans. + DelayTrait.
39

DelayTrait.

Bully Algorithm

Hypotheses (contd):

(2)

Each process knows which processes have higher


identifiers, and it can communicate with all such processes

Three types of messages:


Election: starts an election
OK: sent in response to an election message
Coordinator: announces the new coordinator

40

Election started by a process when it notices, through


timeouts, that the coordinator has failed

Bully Algorithm

(3)

d
r
o

in

r
o
t

.
rd

or

7 New Coordinator

Co
o

n
io
ct

Coo

at
n
i
rd
o
C

41

C oo
at
rdti ion
Kc nat
or
O
e
l
o
E
r
Process 5
Election
OKat
ordin
Co
detects
itor
first
e
El

in

n.
od
ctoi r
C
OoK
Ele

rd

El
ec
El
tio
ec
n
ti
on

C
oo

Coordinator
Coordinatorfailed

Bully Algorithm

(4)

Initialization
Electedi := NIL
pi starts the election
Send the message (Election, pi) to pj , i.e., pj > pi
Waits until all messages (OK, pj) from pj are received;
If no message (OK, pj) arrives during T
Then Elected := pi;
Send the message (Coordinator, pi) to pj , i.e.,
p
j < pwaits
i
Else
until receipt of the message
(coordinator)
42

(if it doesnt arrive during another timeout T, it begins


another election)

Bully Algorithm

(5)

Receipt of the message (Coordinator, pj)


Elected := pj;
Receipt of the message (Election, pj ) at pi
Send the message (OK, pi) to pj
Start the election unless it has begun one
already

43

When a process is started to replace a crashed process: it


begins an election

Properties to satisfay Test

Recall the required properties

Properties to satisfy: pi,


Safety: Electedi = NIL or Elected
=P
A non-crashed
process with the
largest
identifier
Liveness: pi participates and sets
Elected
i NIL, or crashes

Are these properties are met by Bullys


algorithm?

Property 1? May not be. Reason? See next slide for


the answer
Property 2, Yes it is met since each process
participates in the election

Failure reasons of Property 1

Property 1 is met if no process replaces the failed process with


higher identifier

It is impossible for two processes to decide that they are the coordinator
since the process with the lower identifier will discover that the other
exists and defer to it.
But the algorithm is not guaranteed to meet the safety property 1 if
processes that have crashed are replaced by processes with the same
identifier

A process that replaces a crashed process p may decide that it has the highest
identifier just as another process (which has detected ps crash) decides that it has
the highest identifier.
Two processes will therefore announce themselves as the coordinator concurrently
Since there are no guarantees on message delivery order, and the recipients of
these messages may reach different conclusions on which is the coordinator
process

Property 1 is met if and only if the assumed timeout values are


accurate. If the timeouts occur because of the traffic congestion
rather than the process failure, the bullys algorithm may fail

Analysis

Bandwidth consumption

In the best case, the process with the secondhighest identifier notices the coordinators failure.
Then it can immediately elect itself and send N 2
coordinator messages
The bully algorithm requires O(N2) messages in
the worst case that is, when the process with the
lowest identifier first detects the coordinators
failure.

Because N 1 processes altogether begin elections,


each sending messages to processes with higher
identifiers.

Consensus

One important application of the consensus problem is the election of a


coordinator or leader in a fault-tolerant environment for initiating some global
action. A consensus algorithm allows you to do this on-the-fly, without fixing a
"supernode" in advance (which would introduce a single point of failure).
Another application is maintaining consistency in a distributed network:
Suppose that you have different sensor nodes monitoring the same
environment. In the case where some of these sensor nodes crash (or even
start sending corrupted data due to a hardware fault), a consensus protocol
ensures robustness against such faults.
The data managers in a distributed database system need to agree on
whether they should commit or abort a transaction
In a replicated system, the nodes may need to agree on where the replicas of
the data must reside
In the flight control system, multiple processes need to corporate whether to
continue or abort the landing
In a bank money transfer scenario, both the processes, the one who debit
from the first account and the one who credits to the other account must
agree on same amount for debit and credit transaction
Reservations
Distributed, fault-tolerant logging with globally consistent sequencing

Mutual exclusion and Election algorithms: The


catch to those algorithms was that all
processes had to be functioning and able to
communicate with each other. Faults make it
difficult. Faults include process failures and
communication failures.

Example

As a particular example, let us consider the case of a service whose


state has been distributed on several nodes. To maintain a
consistent copy of the service state, each node must apply to its
copy the same sequence of the updates that have been issued to
modify the service state. So, there are two problems to solve.

(1) Disseminate the updates to the nodes that have a copy of the service
state.
(2), apply the updates in the same order to each copy.

The first problem can be solved by using a reliable multicast


primitive. The second problem is more difficult to solve. The nodes
have to agree on a common value, namely, the order in which they
will apply the updates. This well known problem (namely, the Atomic
Broadcast problem) is actually a classical Agreement problem.

Any agreement problem can be seen as a particular instance of a more


general problem, namely, the Consensus problem. In the Consensus
problem, each process proposes a value, and all non-faulty processes have
to agree on a single decision which has to be one of the proposed values

Failures in Distributed Systems

Link failure: A link fails and remains inactive; the


network may get disconnected

Processor Crash: At some point, a processor stops


taking steps

Byzantine processor: processor changes state


arbitrarily and sends messages with arbitrary content

50

Link Failures

p2

Non-faulty
links

p1

p3

p5 c

51

p4

p2

Faulty
link

p1

p3

p5 c

p4

Some of the messages are not delivered

52

Crash Failures

p2

Non-faulty
processor

p1

p3

p5 c

53

p4

Faulty
processor

p2

p1
p5

p3
p4

Some of the messages are not sent

54

Round

Round Round Round Round

p1

p1

p1

p1

p1

p2

p2

p2

p2

p2

p3

p3

p3

p3

p3

p4

p4

p4

p4

p4

p5

p5

p5

p5

p5

Failure

After failure the processor disappears from


the network
55

Consensus Problem

Every processor has an input x X

Termination: Eventually every non-faulty


processor must decide on a value y.
Agreement: All decisions by non-faulty
processors must be the same.
Validity: If all inputs are the same, then the
decision of a non-faulty processor must equal
the common input (this avoids trivial
solutions).

56

Agreement
Start

Finish

Everybody has an initial


value
57

All non-faulty must decide the same


value

Validity
If everybody starts with the same value,
then non-faulty must decide that value

Finish

Start
1

58

A simple algorithm for fault-free consensus

Each processor:

1.

Broadcast its input to all processors

2.

Decide on the minimum

(only one round is needed, since the graph is


complete)

59

Start
0

60

Broadcast values

0,1,2,3,4
0

0,1,2,3,4

0,1,2,3,4

0,1,2,3,4

61

0,1,2,3,4

Decide on minimum

0,1,2,3,4
0

0,1,2,3,4

0,1,2,3,4

0,1,2,3,4

62

0,1,2,3,4

Finish

63

This algorithm satisfies the validity condition

Finish

Start
1

If everybody starts with the same initial value,


everybody decides on that value (minimum)
64

Consensus with Crash Failures


The simple algorithm doesnt work

Each processor:

65

1.

Broadcast value to all processors

2.

Decide on the minimum

Start
fail
0
0
1

The failed processor doesnt broadcast


its value to all processors
66

Broadcasted values
fail
0

0,1,2,3,4

1,2,3,4

1,2,3,4

67

0,1,2,3,4

Decide on minimum
fail
0

0,1,2,3,4

1,2,3,4

1,2,3,4

68

0,1,2,3,4

Finish
fail
0

No Consensus!!!

69

If an algorithm solves consensus for


f failed (crashing) processors we say it is:

an f-resilient consensus algorithm

70

An f-resilient algorithm

Round 1:
Broadcast my value
Round 2 to round f+1:
Broadcast any new received values

End of round f+1:


Decide on the minimum value received

71

An- f-resilient Algorithm

Example: f=1 failures, f+1 = 2 rounds needed


Start
0

73

Example: f=1 failures, f+1 = 2 rounds needed


Round 1
0

fail

0,1,2,3,4
1

1,2,3,4

(new values)

1,2,3,4

0,1,2,3,4

Broadcast all values to everybody

74

Example: f=1 failures, f+1 = 2 rounds needed


Round 2
0

0,1,2,3,4

0,1,2,3,4

0,1,2,3,4

0,1,2,3,4

Broadcast all new values to everybody


75

Example: f=1 failures, f+1 = 2 rounds needed


Finish
0

0,1,2,3,4

0,1,2,3,4

0,1,2,3,4

0,1,2,3,4

Decide on minimum value


76

Example: f=2 failures, f+1 = 3 rounds needed


Start
0

77

Example: f=2 failures, f+1 = 3 rounds needed


Round 1
0

Failure 1

1,2,3,4
1

1,2,3,4

1,2,3,4

0,1,2,3,4

Broadcast all values to everybody

78

Example: f=2 failures, f+1 = 3 rounds needed


Round 2
0

Failure 1

0,1,2,3,4

1,2,3,4

4
0

1,2,3,4

0,1,2,3,4
Failure 2

Broadcast new values to everybody

79

Example: f=2 failures, f+1 = 3 rounds needed


Round 3
0

Failure 1

0,1,2,3,4

0,1,2,3,4

0,1,2,3,4

0,1,2,3,4
Failure 2

Broadcast new values to everybody

80

Example: f=2 failures, f+1 = 3 rounds needed


Finish
0

Failure 1

0,1,2,3,4

0,1,2,3,4

0,1,2,3,4

0,1,2,3,4
Failure 2

Decide on the minimum value

81

f there are f failures and f+1 rounds then


there is at least a round with no failed processors:
Round

Example:
5 failures,
6 rounds

No failure

82

Lemma: In the algorithm, at the end of the


round with no failure, all the processors know the same set of values.
Proof: For the sake of contradiction, assume the claim is false. Let x be a
value which is known only to a subset of (non-faulty) processors. But when
a processor knew x for the first time, in the next round it broadcasted it to
all. So, the only possibility is that it received it right in this round, otherwise
all the others should know x as well. But in this round there are no failures,
and so x must be received by all.

83

Then, at the end of the round with no failure:

Every (non-faulty) processor knows


about all the values of all other
participating processors

This knowledge doesnt change until


the end of the algorithm

84

Therefore, at the end of the


round with no failure:

everybody would decide the same value

However, we dont know the exact position


of this round, so we have to let the algorithm
execute for f+1 rounds

85

Byzantine Failures
a

Non-faulty
processor

p1

p3

p5 c

86

p2

p4

Byzantine Failures
Faulty
processor

p2

*!#

p1

p3

*!#

%&/

p5

p4

%&/

Processor sends arbitrary messages, plus some messages may be not sent

87

Round

Round Round Round Round Round

p1

p1

p1

p1

p1

p1

p2

p2

p2

p2

p2

p2

p3

p3

p3

p3

p3

p3

p4

p4

p4

p4

p4

p4

p5

p5

p5

p5

p5

p5

Failure

Failure

After failure the processor may continue


functioning in the network
88

Consensus under link failures:


the 2 generals problem

There are two generals of the same army who


have encamped a short distance apart.
Their objective is to capture a hill, which is
possible only if they attack simultaneously.
If only one general attacks, he will be defeated.
The two generals can only communicate by
sending messengers, which is not reliable.
Is it possible for them to attack simultaneously?

89

Here the situation is much harder

Generally we need at least 3f+1 processes in


a system to tolerate f Byzantine failures

For example, to tolerate 1 failure we need 4 or


more processes

We also need f+1 rounds


Lets see why this happens

Byzantine scenario

Generals (N of them) surround a city

Each has an opinion: attack or wait

They communicate by courier

In fact, an attack would succeed: the city will


fall.
Waiting will succeed too: the city will surrender.
But if some attack and some wait, disaster
ensues

Some Generals (f of them) are traitors it


doesnt matter if they attack or wait, but
we must prevent them from disrupting the
battle

Traitor cant forge messages from other


Generals

A timeline perspective

p
q
r
s
t

Suppose that p and q favor attack, r is a


traitor and s and t favor waiting assume
that in a tie vote, we attack

A timeline perspective

p
q
r
s
t

After first round collected votes are:

{attack, attack, wait, wait, traitors-vote}

What can the traitor do?

Add a legitimate vote of attack

Add a legitimate vote of wait

Anyone with 3 votes to attack knows the outcome


Vote now favors wait

Or send different votes to different folks


Or dont send a vote, at all, to some

Outcomes?

Traitor simply votes:

Traitor double-votes

Some see {a,a,a,w,w} and some {a,a,w,w,w}

Traitor withholds some vote(s)

Either all see {a,a,a,w,w}


Or all see {a,a,w,w,w}

Some see {a,a,w,w}, perhaps others see


{a,a,a,w,w,} and still others see {a,a,w,w,w}

Notice that traitor cant manipulate votes


of loyal Generals!

What can we do?

Clearly we cant decide yet; some loyal


Generals might have contradictory data

In fact if anyone has 3 votes to attack, they can


already decide.
Similarly, anyone with just 4 votes can decide
But with 3 votes to wait a General isnt sure
(one could be a traitor)

So: in round 2, each sends out witness


messages: heres what I saw in round 1

General Smith send me: attack(signed) Smith

A timeline perspective

p
q
r
s
t

In second round if the traitor didnt behave


identically for all Generals, we can weed out
his faulty votes

A timeline perspective
Attack!
!

Attack!
!

Damn! Theyre on to
me

Attack!
!

Attack!
!

We attack!

Consensus under Byzantine faults

Story:

N Byzantine generals out to repel an attack by a


Turkish Sultan
Each general has a preference attack or retreat
Coordinated attack or retreat by loyal generals
necessary for victory
Treacherous Byzantine generals could conspire
together and send conflicting messages to mislead
loyal generals

Byzantine General Agreement (BGA)

Reliable messages
Possible to show that no protocol can tolerate
f failures if N 3f

Lets assume N > 4f

BGA Algorithm

Takes f+1 rounds


Rotating coordinator processes (kings)
Pi is the king in round i
Phase 1:

Phase 2:

Exchange V with other processes


Based on V decide myvalue (majority value)
Receive value from king- kingvalue
If V has more than N/2 + f copies of myvalue then
V[i]=myvalue else V[i]= kingvalue

After f+1 rounds decide on V[i]

BGA
Algorithm

A Consensus Algorithm
The King algorithm

solves consensus in 2(f+1) rounds with:

processors and

failures, where

n
f
4

Assumptions:
1. Number f must be known to processors;
2. Processor ids are in {1,,n}.

103

The King algorithm

There are

f 1phases

Each phase has 2 broadcast rounds


In each phase there is a different king
There is a king that is non-faulty!

104

The King algorithm

Each processor

pi has a preferred valuevi


In the beginning,
the preferred value is set to the initial value

105

The King algorithm

pi

Round 1, processor

Phase k

vi

Broadcast preferred value

Let

vi
abe the majority of received values (including
(in case of tie pick an arbitrary value)

vi a

Set

106

The King algorithm

pk

Round 2, king

Phase k

vk

Broadcast new preferred value

pi

Round 2, process

If

n
vihad majority of less than f 1
2
then set
107

vi vk

The King algorithm

End of Phase f+1:

Each processor decides on preferred value

108

Example: 6 processors, 1 fault

king 1

Faulty

109

king 2

Phase 1, Round 1

2,1,1,0,0,0

2,1,1,1,0,0
0

2,1,1,0,0,0

0
0
1

2,1,1,1,0,0

Everybody broadcasts
110

king 1

2,1,1,0,0,0

Phase 1, Round 1
Choose the majority

2,1,1,1,0,0

king 1
Each majority vote was

n
3 f 1 5
2

On round
2, everybody will choose the kings value
111

Phase 1, Round 2

0
2
1

king 1
The king broadcasts

112

Phase 1, Round 2

king 1

Everybody chooses the kings value


113

Phase 2, Round 1

2,1,1,0,0,0

2,1,1,1,0,0
0

2,1,1,0,0,0

0
0
1

2,1,1,1,0,0

Everybody broadcasts
114

2,1,1,0,0,0
king 2

Phase 2, Round 1
Choose the majority

king 2
1

2,1,1,1,0,0
Each majority vote was

n
3 f 1 5
2

On round
2, everybody will chose the kings value
115

Phase 2, Round 2

0
0

0
0

The king broadcasts

116

0
1

king 2

Phase 2, Round 2

king 2
0

Everybody chooses the kings value

Final decision
117

Correctness of the King algorithm


Lemma 1: At the end of a phase where the king is non-faulty, every nonfaulty processor decides the same value

Proof: Consider the end of round 1 of phase .


There are two cases:

Case 1: some node has chosen its preferred


value with strong majority ( n f 1
votes)

Case 2: No node has chosen its preferred value with strong majority

118

Case 1:

suppose node
has chosen its preferred value
n
with strong majority
votes)
( f 1
2

At the end of round 1, every other non-faulty node must have preferred
value
(including the king)

Explanation:

a
At least 2 1
non-faulty nodes must have broadcasted
start of round 1
n

119

at

At end of round 2:
If a node keeps its own value:
then decides

If a node gets the value of the king:


then it decides
,
since the king has decided

Therefore: Every non-faulty node decides

120

Consensus with Byzantine Failures


How many processors total are needed to
solve consensus when f = 1 ?
Suppose n = 2. If p0 has input 0 and p1
has 1, someone has to change, but not
both. What if one processor is faulty?
How can the other one know?
Suppose n = 3. If p0 has input 0, p1 has
input 1, and p2 is faulty, then a tie-breaker
is needed, but p2 can act maliciously.

121

Set 10: Consensus with Byzantine CSCE 668


Failures

Impassibility with N 3f (outline)

Pease et al generalized basic impossibility result


Simulation-based argument
Impossibility shown by contradiction
Assume there exists algorithm for N3f ( e.g.
N = 12, f = 4)
Use algorithm to solve BG for N= 3 and f =1 thus
reaching contradiction!
Assume three processes {0,7,1} , each simulate
behavior of 4 generals
Assume process 0 is faulty, then {0,11,5,6} generals
will generate byzantine failures. All other processes
are correct
Correctness of simulated algorithm tells us that
algorithm terminates and 1 and 7 satisfy integrity
2 correct processes {1,7} solve consensus in spite of
failure of 0
Contradiction (since N= 3, f=1 case is unsolvable)

Impossibility (no solution) with N = 3, f = 1

Lamport et al (1982) considered three


processes with one Byzantine process
No solution to achieve agreement
Example
1:v means 1 says v, 2:1:v means 2 says 1
says v
p
p1(Commander)
1(Commander)
2 different scenarios appear identical to p2

1:v

1:v

p2

2:1:v

1:x

1:w

p3

p2

3:1:u
Faulty processes are shown coloured

2:1:w
3:1:x

p3

BG Algorithm for N = 3f + 1
2 rounds
1. commander sends value to lieutenants
P2 decides
2. lieutenants send value to peers
Majority(u,v,w) =

P2 decides
Majority(v,v,u) = v
P4 decides
p1(Commander)
Majority(v,v,w) = v)

p1(Commander)
1:v
2:1:v

p2

1:v

1:v

1:u

p3

3:1:u

2:1:u

p2

1:v

1:w

p3

3:1:w

4:1:v 4:1:v
2:1:v

P3 decides
Majority(u,v,w) =
P4 decides
Majority(u,v,w) =

4:1:v 4:1:v
3:1:w

2:1:u

p4

3:1:w

p4
Faulty processes are shown coloured

Processor Lower Bound for f = 1


Theorem (5.7): Any consensus algorithm
for 1 Byzantine failure must have at least
4 processors.
Proof: Suppose in contradiction there is a
consensus algorithm A = (A,B,C) for 3
processors and 1 Byzantine failure.
p1

A
125

p0

B
p2

Set 10: Consensus with Byzantine CSCE 668


Failures

Specifying Faulty Behavior

Consider a ring of 6 nonfaulty processors running


components of A like this:

p1

p2

p0 1

0 p3
0

p5

p4

This execution probably doesn't solve consensus (it


doesn't have to).
But the processors do something -- this behavior is used
to specify the behavior of faulty processors in executions
126 of A in the triangle.
Set 10: Consensus with Byzantine CSCE 668

Failures

Getting a Contradiction

Let 0 be this execution:

p1
act like p3
to p4 in
p0

p1 and p2
must decide 0

0
0

p2

act like p0
to p5 in

127

Set 10: Consensus with Byzantine CSCE 668


Failures

Getting a Contradiction

Let 1 be this execution:

p0 and p1
must decide 1

p1

act like p2
to p1 in

128

p0

p2

act like p5
to p0 in

Set 10: Consensus with Byzantine CSCE 668


Failures

The Contradiction

Let be this execution:

act like p1
to p0 in

p0

p1
?

act like p4
to p5 in
0

p2

What do p0 and
p2 decide?

view of p0 in = view of p0 in = view of p0 in 1 p0 decides 1


view of p2 in = view of p2 in = view of p2 in 0 p2 decides 0
129

Contradiction!

Set 10: Consensus with Byzantine CSCE 668


Failures

Views
A

130

p1

p2

p0

p1
?

1
0 p3

1
0

act like p1
to p0 in

C
1

p0

p2

p5

p4

act like p4
to p5 in

1:
A

p0

p1

B
act like p2
to p1 in

1
1

p2

act like p5
to p0 in

Set 10: Consensus with Byzantine CSCE 668


Failures

You might also like