Distributed Systems

UNIT 1
Introduction to Distributed Systems

Concepts
Introduction & DeIinition
Goals Goals
Hardware Concepts
SoItware Concepts
Client Server model
Definition: A Distributed System is one in which
hardware or soItware components located at networked
computers, communicate and coordinate their actions
only by passing messages.
(or)
A Distributed System is a collection oI independent
computers that appears to its users as a single coherent
system system
Resource sharing is the Main advantage oI Distributed
Systems. Resources such as:
Printers
Scanners
Files
Database Records, WebPages, Fax etc
Goals
1) Connecting Users & Resources:
Sharing remote resources like:
Printers,
Computers,
Storage Iacilities,
Web pages etc
Collaborate and exchange inIormation like:
Files,
Mails,
Audio,
Video etc...
2. Transparency:
An important goal oI Distributed System is to hide the Iact
that its processes and resources are physically distributed
across multiple computers.
Transparency Description
Hide differences in data representation and how a resource is
accessed
Hide where a resource is located
Hide that a resource may move to another location
Different forms of transparency in a distributed system(ISO-1995)
Hide that a resource may be moved to another location while in use
Hide that a resource is replicated
Hide that a resource may be shared by several competitive users
Hide the failure and recovery of a resource
Hide whether a (software) resource is in memory or on disk
3. Openness:
An Open Distributed System is a system that oIIers services
according to standard rules that describe the syntax and
semantics oI those services.
Systems should conIorm to well-deIined interfaces
Systems should support portability oI applications
Systems should easily interoperate
4. Scalability: 4. Scalability:
Scalability oI a system can be measured along at least three
diIIerent dimensions:
Number oI users and/or processes (size scalability)
Maximum distance between nodes (geographical
scalability)
Number oI administrative domains (administrative
scalability)
Scalability Problems:
Scaling Techniques:
a) Hiding Communication Latency:
Try to avoid waiting Ior responses to remote service requests. Try to avoid waiting Ior responses to remote service requests.
b) Distribution:
Taking a component, splitting it into smaller parts, and subsequently
spreading those parts across the system .
Ex: Internet Domain Name System (DNS)
The DAS name space is hierarchically organized into a tree oI
domains, which are divided into nonoverlapping zones.
c) Replication:
Copy oI inIormation to increase availability and decrease
centralized load.
Example: P2P networks distribute copies uniIormly or in
proportion to use
Example: Caching is a replication decision made by client
Example: Web Browser cache
Issue: Consistency oI replicated inIormation
Hardware Concepts
Divide all computers into two groups:
Multiprocessors (Shared Memory)
Multicomputers (Private Memory)
Heterogeneous Multicomputers
Homogeneous Multicomputers
Multiprocessors:
All the CPUs have direct access to the shared memory.
Bus-based multiprocessors:
Consist oI some number oI CPUs all connected to a common
bus, along with a memory module.
Bus will usually be overloaded and perIormance will drop
drastically.
The solution is to add a high-speed cache memory between
the CPU and the bus.
The cache holds the most recently accessed words.
The problem with bus-based multiprocessors is their limited
scalability.
Switch-based multiprocessors:
Divide the memory into modules and connect them to the
CPUs with a Crossbar switch.
Omega network contains Iour 2 2 switches, each having
two inputs and two outputs.
Each switch can route either input to either output.
Every CPU can access every memory. These switches can be
set in nanoseconds or less.
This is also known as a NUMA (NonUniIorm Memory
Access) .
Multicomputers: Multiple computers are connectea through a LAN.
Homogeneous Multicomputer Systems:
In bus-based multicomputer, the processors are connected
through a shared multi-access network such as Fast Ethernet.
In a switch-based multicomputer, messages between the
processors are routed through an interconnection network.
Two popular interconnection topologies are: Mesh (Crid)
and Hypercube.
Heterogeneous Multicomputer Systems:
Toady most oI the distributed systems are built on top oI a
heterogeneous multicomputer.
i.e., Processor type, memory sizes and I/O.
Example: ATM & FDDI
SoItware Concepts
SoItware (Operating Systems) plays a vital role in Distributed
Systems.
Diviaea into two categories.
Tightly coupled Operating Systems (Distributed OS-DOS)
Used in multiprocessors and homogeneous
multicomputers.
Loosely coupled Operating Systems (Network OS-NOS)
Used Ior heterogeneous multicomputers Used Ior heterogeneous multicomputers
Middleware is a top oI NOS, it will provide distribution
transparency.
System Description Main Goal
Tightly-coupled operating systemfor multi-
processors and homogeneous multicomputers
Hide and manage
hardware resources
Loosely-coupled operating systemfor
heterogeneous multicomputers (LAN and WAN)
Offer local services to
remote clients
Additional layer a top of NOS implementing
general-purpose services
Provide distribution
transparency
Uniprocessor Operating System:
Allow users and applications an easy way oI sharing resources
such s the CPU, main memory, disks and peripheral devices.
Kernal acts as an interIace between operating system and H/W.
In Kernal mode instructions are permitted to be executed.
Microkernal containing only the code that must execute in kernal
mode.
Multiprocessor Operating Systems:
Support Ior multiple processors having access to a shared
memory.
Support high perIormance through multiple CPUs.
Inter-process communication plays a vital role.
Concurrent accessing is the main problem with the inter-process
communication.
Techniques Ior inter-process communication are:
Semaphores Semaphores
Monitors
Multicomputer Operating Systems:
Each node has its own kernal containing modules Ior managing
local resources such as memory, the local CPU, a local disk etc..
Each node has a separate module Ior handling interprocessor
communication, i.e., sending & receiving messages to and Irom
other nodes.
Message passing technique can be used Ior sending and receiving messages. Message passing technique can be used Ior sending and receiving messages.
Synchronization point Send buffer
Reliable comm.
guaranteed?
Block sender until buffer not full Yes Not necessary
Block sender until message sent No Not necessary
Block sender until message received No Necessary
Block sender until message delivered No Necessary
Distributed Shared Memory Systems:
Provide a virtual shared machine, running on a multicomputer.
The address space oI memory is divided into pages, called as
Distributed Shared Memory (DSM).
Pages spread over all the processors in the system.
When a processor reIerences an address that is not present
locally, a trap occurs, and the operating system Ietches the
page containing the address.
Aetwork Operating System:
Collection oI uniprocessor systems, each with its own OS.
Machines and operating systems may be diIIerent, but they are
all connected to each other in a computer network.
Typical services: rlogin, rcp
rlogin machine-----------> Login
rcp machine1:file1 machine2:file2-----------~ Remote copv
The Iile system is supported by one or more machines called file
servers.
Middleware:
Many modern distributed systems are constructed by means oI
such an additional layer oI what is called middleware.
Positioning Middleware:
An additional layer oI soItware between applications and the
network OS, oIIering a higher level oI abstraction. Such a
layer is accordingly called middleware.
Middleware models:
A relatively simple model is that oI treating everything as a
Iile. i.e., All resources such as keyboard, mouse, Disk etc..
Another model is based on Remote Procedure Call(RPC).
When calling such a procedure, parameters are transparently
shipped to the remote machine where the procedure is
subsequently executed, aIter which the results sent back to
the caller.
Middleware Services:
Implementation oI access transparency, by oIIering high-
level communication Iacilities.
Another common service is Aaming.
Ex: URL. i.e., A URL contain the name oI the server
where the document to which the URL reIers is stored.
Provide Iacilities Ior Security.
Middleware and Openness:
AMiddleware interIaces are open.
So that we can implement Open Middleware Distributed
System.
In an open middleware-based distributed system, the
protocols used by each middleware layer should be the same,
as well as the interIaces they oIIer to applications.
II diIIerent, compatibility issues
II incomplete, then users build their own or use lower-layer
services .
I tem
Distributed OS
Network
OS
Middleware-
based OS
Multiproc. Multicomp.
Degree of transparency Very High High Low High
Same OS on all nodes Yes Yes No No
Number of copies of OS 1 N N N
Basis for communication Shared memory Messages Files Model specific
DOS most transparent, but closed and only moderately scalable
NOS not so transparent, but open and scalable
Middleware provides a bit more transparency than NOS
Resource management Global, central Global, distributed Per node Per node
Scalability No Moderately Yes Varies
Openness Closed Closed Open Open
UNIT 2
Communication
Concepts
Layered protocols
Remote Procedure Remote Procedure
Remote Object Invocation
Message-Oriented Communication
Stream-Oriented Communication
Layered Protocols
OSI: Open Systems Interconnection
Developed by the International Organization Ior Standardization
(ISO)
Provides a generic Iramework to discuss the layers and Iunctionalities
oI communication protocols.
Standard rules that govern the Iormat, contents, and meaning oI the
messages sent and received called Protocols.
The collection oI protocols used in a particular system called a
Protocol Stack.
Physical layer
Deals with the transmission oI bits
Physical interIace between data transmission device
(Ex: Computer) and transmission medium or network
Concerned with:
Characteristics oI transmission medium, Signal levels, Data
rates
Data link layer:
Deals with detecting and correcting bit transmission errors
Bits are group into Irames
Checksums are used Ior integrity
Network layer:
PerIorms multi-hop routing across multiple networks
Implemented in end systems and routers
Transport layer:
Packing oI data
Reliable delivery oI data (breaks message into pieces small
enough, assign each one a sequence number and then send enough, assign each one a sequence number and then send
them)
Ordering oI delivery
Examples:
TCP (connection-oriented)
UDP (connectionless)
RTP (Real-time Transport Protocol)
Session layer
Provide dialog control to keep track oI which party is talking
and it provides synchronization Iacilities
Presentation layer
Deals with non-uniIorm data representation and with
compression and encryption
Application layer
Support Ior user applications
e.g. HTTP, SMPT, FTP
Middleware Protocols:
Support high-level communication services
The session and presentation layers are merged into the
middleware layer,
Ex: MicrosoIt ODBC (Open Database Connectivity), OLE
DB
Types of Communication:
Persistent vs. 1ransient
Persistent messages are stored as long as necessary by the communication
system (e.g., e-mail)
Transient messages are discarded when they cannot be delivered (e.g.,
TCP/IP)
Synchronous vs. Asynchronous
Asynchronous implies sender proceeds as soon as it sends the message no
blocking
Synchronous implies sender blocks till the receiving host buIIers the Synchronous implies sender blocks till the receiving host buIIers the
message
Remote Procedure call
Basic idea: To execute a procedure at a remote site and ship the
results back.
Goal: To make this operation as distribution transparent as
possible (i.e., the remote procedure call should look like a local
one to the calling procedure).
Example:
Swap(a,b);
Steps of a Remote Procedure Call:
1. Client procedure calls client stub in normal way
2. Client stub builds message, calls local OS
3. Client's OS sends message to remote OS
4. Remote OS gives message to server stub
5. Server stub unpacks parameters, calls server
6. Server does work, returns result to the stub
7. Server stub packs it in message, calls local OS
8. Server's OS sends message to client's OS
9. Client's OS gives message to client stub
10.Stub unpacks result, returns to client
Passing Value Parameters (1)
Parameter Passing
Packing parameters into a message is called parameter
marshaling.
In a large distributed system, multiple machine types are present
Each machine has its own representation Ior number, characters,
and others data items.
Asynchronous RPC:
Avoids blocking oI the client process.
Allows the client to proceed without getting the Iinal result oI
the call.
Differed synchronous RPC:
One-way RPC model: client does not wait Ior an
acknowledgement of the servers acceptance of the request.
Combining two asynchronous RPCs is some times called
Differed Synchronous RPC.
DCE RPC (Distributed Computing Environment RPC):
DCE is a true middleware system in that it is designed to
execute as a layer oI abstraction between exiting (network)
operating system and distributed application.
Goals of DCE RPC
Makes it possible Ior client to access a remote service by
simply calling a local procedure.
Components:
Languages
Libraries
Daemon
Utility programs
Others
2-14
Generate a prototype IDL
Iile containing an interIace
identiIy
Filling in the names oI remote
procedure and their parameters
IDL compiler is call to
compile into three Iiles
Writing a Client and a Server
Binding a Client to a Server:
1) Locate the servers machine
2) Locate the server (i.e., the correct process) on that machine.
Endpoint (port) is used by servers
OS to distinguish incoming message.
DCE Daemon maintains a endpoint
table oI (Server, endpoint) pairs.
Message-Oriented Communication
RPC and RMI enhanced access transparency
But client have to blocked until its request has been processed
Message-Oriented Communication can solve this problem by
messaging.
A) Message Oriented Transient Communication: A) Message Oriented Transient Communication:
Many distributed systems & applications are built directly
on top oI the simple message-oriented model oIIered by the
Transport Layer.
1) Berkeley Sockets: (Introduced in 1970 in Berkeley UNIX)
Socket is a communication end point to which an
application can write data that are to be sent out over the
underlying network, and Irom which incoming data can
be read.
Primitive Meaning
Socket Create a new communication endpoint
Bind Attach a local address to a socket
Listen Announce willingness to accept connections
Accept Block caller until a connection request arrives
Connect Actively attempt to establish a connection
Send Send some data over the connection
Receive Receive some data over the connection
Close Release the connection
Create a new Create a new
endpoint endpoint
Associate Associate
endpoint endpoint
Waiting Waiting
connection connection
Block waiting for Block waiting for
reqs reqs
2) Message-Passing Interface (MPI):
Sockets were deemed insuIIicient Ior two reasons:
a) Supporting only simple send & receive primitives
b) Suitable Ior only TCP/IP, Not Ior open protocols.
MPI is designed Ior parallel applications and as such is
tailored transient communication.
Primitive Meaning
MPI_bsend Append outgoing message to a local send buffer
MPI_send Send a message and wait until copied to local or remote buffer
MPI_ssend Send a message and wait until receipt starts
MPI_sendrecv Send a message and wait for reply
MPI_isend Pass reference to outgoing message, and continue
MPI_issend Pass reference to outgoing message, and wait until receipt starts
MPI_recv Receive a message; block if there are none
MPI_irecv Check if there is an incoming message, but do not block
B) Message Oriented Persistent Communication:
Message oriented middleware services, generally known as
Message Queuing Systems or Message Oriented
Middleware (MOM)
1) Message Queuing Model:
Apps communicate by inserting messages in speciIic queues
Loosely-couple communication
Support Ior:
Persistent asynchronous communication
Longer message transIers(e.g., e-mail systems)
Basic interIace to a queue in a message-queuing system:
Primitive Primitive Meaning Meaning
Put Append a message to a specified queue
Get Block until the specified queue is nonempty, and remove the first message
Poll Check a specified queue for messages, and remove the first. Never block
Notify
Install a handler to be called when a message is put into the specified
queue
a) Both the sender & receiver execute during the entire
transmission oI a message.
b) Only the sender is executing, while the receiver is passive.
c) The combination oI a passive sender and an executing receiver.
d) System is storing messages even while sender & receiver are
passive.
2) General Architecture of Message Queuing Model:
Messages can only be put and received Irom local queues
Ensure transmitting the messages between the source
queues and destination queues, meanwhile storing the
messages as long as necessary
Each queue is maintained by a queue manager
Queue managers are not only responsible Ior directly
interacting with applications but are also responsible Ior
acting as relays (or routers)
3) Message Brokers:
In Message queuing system conversions are handled by special
nodes in a queuing network, known as message brokers.
Message broker can be as simple as a reIormatter Ior messages.
Ex: Suppose incoming message contains a table Irom a database, in
which records are separated by a special end-oI-record delimiter .
II the destination app expects a diIIerent delimiter between records
then message broker can be used to convert messages to the Iormat
expected by the destination.
Example: IBM MQSeries
Attribute Description
Transport type Determines the transport protocol to be used
FIFO delivery Indicates that messages are to be delivered in the order they are sent
Message length Maximum length of a single message
Setup retry count Specifies maximumnumber of retries to start up the remote MCA
Delivery retries Maximum times MCA will try to put received message into queue
Primitive Description
MQopen Open a (possibly remote) queue
MQclose Close a queue
MQput Put a message into an opened queue
MQget Get a message from a (local) queue
Stream-Oriented Communication
Form oI communication in which timing plays a crucial role.
Ex: Audio or Video
A) Support for Continuous Media:
1ypes of media:
1) Continuous meaia 1) Continuous meaia
Temporal dependence between data items
Ex: Motion - series oI images, audio and video
2) Discrete meaia
No temporal dependence between data items
Ex: text, still images, object code or executable Iiles
Data Stream
Sequence of aata units
Discrete data stream: UNIX pipes or TCP/IP connection
Continuous data stream. auaio file (connection between file
ana auaio aevices)
1ransmission modes
Asynchronous: no timing constraints
Synchronous: upper bouna on propagation aelav
Isochronous: upper ana lower bounas on propagation aelav Isochronous: upper ana lower bounas on propagation aelav
(transfer on time)
1ypes of stream
Simple stream
Consist oI only a single sequence oI data
Complex stream
Consist oI several related simple stream called Sub
streams.
Ex: stereo audio, movie
Streams & Quality of Service: Streams & Quality of Service:
Streams are all about timely delivery oI data. How do you speciIy this
Quality of Service (QoS)? Basics:
The required bit rate at which data should be transported.
The maximum delay until a session has been set up (i.e., when an
application can start sending data).
The maximum end-to-end delay (i.e., how long will it take until a data
unit makes it to a recipient).
The maximum delay variance, or jitter.
The maximum round-trip delay.
Enforcing QOS:
There are various network-level tools, such as differentiated services by
which certain packets can be prioritized.
Ex: expedited Iorwarding (absolute priority), assured Iorwarding
Problem: How to reduce the eIIects oI packet loss (when multiple samples
are in a single packet)?
Solution: simply spread the samples:
Without interleaving
Stream Synchronization:
A) Synchronization between a discrete data stream and a continuous
data stream:
Consider, Ior example, a slide show on the Web that has been
enhanced with audio.
Each slide is transIerred Irom the server to the client in the Iorm
oI a discrete data stream.
At the same time, the client should play out a speciIic audio
stream that matches the current slide that is also Ietched Irom the
server.
In this case, the audio stream is to be synchronized with the
presentation oI slides.
B) Synchronization is that between continuous data streams:
Example is playing a movie in which the video stream needs to be
synchronized with the audio, commonly reIerred to as lip
synchronization.
Another example oI synchronization is playing a stereo audio
stream consisting oI two sub streams, one Ior each channel.
Synchronization Mechanisms:
Explicit Synchronization:
Synchronization is done explicitly by operating on the data
units oI simple streams.
Supported by high-level interfaces:
(TAI: International Atomic Time)
(UTC: Universal Coordinated Time)
C
UTC
(T1-T0-I)/2
Algorithm
Messages per
entry/exit
Delay before entry
(in message times)
Problems
Centralized 3 2 Coordinator crash
Crash of any
Distributed 2 ( n 1 ) 2 ( n 1 )
Crash of any
process
Token ring 1 to 0 to n 1
Lost token,
process crash
Primitive Description
BEGIN_TRANSACTION Make the start of a transaction
END_TRANSACTION Terminate the transaction and try to commit
ABORT_TRANSACTION Kill the transaction and restore the old values
READ Read data from a file, a table, or otherwise
WRITE Write data to a file, a table, or otherwise
BEGIN_TRANSACTION
reserve WP -> JFK;
reserve JFK -> Nairobi;
reserve Nairobi -> Malindi;
END_TRANSACTION
(a)
BEGIN_TRANSACTION
reserve WP -> JFK;
reserve JFK -> Nairobi;
reserve Nairobi -> Malindi full =>
ABORT_TRANSACTION
(b) (a) (b)
x = 0;
y = 0;
BEGIN_TRANSACTION;
x = x + 1;
y = y + 2
x = y * y;
Log
[x = 0/1]
Log
[x = 0/1]
[y = 0/2]
Log
[x = 0/1]
[y = 0/2]
[x = 1/4]
END_TRANSACTION;
(a) (b) (c) (d)
BEGIN_TRANSACTION
x = 0;
x = x + 1;
END_TRANSACTION
(a)
BEGIN_TRANSACTION
x = 0;
x = x + 2;
END_TRANSACTION
(b)
BEGIN_TRANSACTION
x = 0;
x = x + 3;
END_TRANSACTION
(c)
Schedule 1 x = 0; x = x + 1; x = 0; x = x + 2; x = 0; x = x + 3; Legal
Schedule 2 x = 0; x = 0; x = x + 1; x = x + 2; x = 0; x = x + 3; Legal
Schedule 3 x = 0; x = 0; x = x + 1; x = 0; x = x + 2; x = x + 3; Illegal
(d)
DISTRIBUTED SYSTEMS
Principles and Paradigms
Second Edition
ANDREW S. TANENBAUM
MAARTEN VAN STEEN
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved. 0-13-239227-5
Chapter 7
Consistency And Replication
Reasons for Replication
Data are replicated to increase the
reliability of a system.
Replication for performance
Scaling in numbers
Scaling in numbers
Scaling in geographical area
Caveat
Gain in performance
Cost of increased bandwidth for
maintaining replication
Data-centric Consistency Models
Figure 7-1. The general organization of a logical
data store, physically distributed and
replicated across multiple processes.
Continuous Consistency (1)
Figure 7-2. An example of keeping track of consistency
deviations [adapted from (Yu and Vahdat, 2002)].
Figure 7-3. Choosing the appropriate granularity for a conit.
(a) Two updates lead to update propagation.
Figure 7-3. Choosing the appropriate granularity for a conit.
(b) No update propagation is needed (yet).
Sequential Consistency (1)
Figure 7-4. Behavior of two processes operating
on the same data item. The horizontal axis is time.
A data store is sequentially consistent when:
The result of any execution is the same as if
the (read and write) operations by all
processes on the data store
were executed in some sequential order
were executed in some sequential order
and
the operations of each individual process
appear
in this sequence
in the order specified by its program.
Figure 7-5. (a) A sequentially consistent data store.
(b) A data store that is not sequentially consistent.
Figure 7-6. Three concurrently-executing processes.
Figure 7-7. Four valid execution sequences for the processes of
Fig. 7-6. The vertical axis is time.
Causal Consistency (1)
For a data store to be considered causally
consistent, it is necessary that the store obeys
the following condition:
Writes that are potentially causally related
Writes that are potentially causally related
must be seen by all processes
in the same order.
Concurrent writes
may be seen in a different order
on different machines.
Figure 7-8. This sequence is allowed with a causally-consistent
store, but not with a sequentially consistent store.
Figure 7-9. (a) A violation of a causally-consistent store.
Figure 7-9. (b) A correct sequence of events
in a causally-consistent store.
Grouping Operations (1)
Necessary criteria for correct synchronization:
An acquire access of a synchronization variable, not
allowed to perform until all updates to guarded shared
data have been performed with respect to that process.
Before exclusive mode access to synchronization
variable by process is allowed to perform with respect to
variable by process is allowed to perform with respect to
that process, no other process may hold synchronization
variable, not even in nonexclusive mode.
After exclusive mode access to synchronization variable
has been performed, any other process next
nonexclusive mode access to that synchronization
variable may not be performed until it has performed with
respect to that variables owner.
Grouping Operations (2)
Figure 7-10. A valid event sequence for entry consistency.
Eventual Consistency
Figure 7-11. The principle of a mobile user accessing
different replicas of a distributed database.
Monotonic Reads (1)
A data store is said to provide monotonic-read
consistency if the following condition holds:
If a process reads the value of a data item x
If a process reads the value of a data item x
any successive read operation on x by that
process will always return that same value or a
more recent value.
Monotonic Reads (2)
Figure 7-12. The read operations performed by a single process P
at two different local copies of the same data store.
(a) A monotonic-read consistent data store.
Monotonic Reads (3)
Figure 7-12. The read operations performed by a single process P
at two different local copies of the same data store.
(b) A data store that does not provide monotonic reads.
Monotonic Writes (1)
In a monotonic-write consistent store, the following
condition holds:
A write operation by a process on a data item x
is completed before any successive write
is completed before any successive write
operation on x by the same process.
Figure 7-13. The write operations performed by a single process P
at two different local copies of the same data store. (a) A
monotonic-write consistent data store.
Figure 7-13. The write operations performed by a single process P
at two different local copies of the same data store. (b) A data
store that does not provide monotonic-write consistency.
Read Your Writes (1)
A data store is said to provide read-your-writes
consistency, if the following condition holds:
The effect of a write operation by a process on data
item x
item x
will always be seen by a successive read
operation on x by the same process.
Figure 7-14. (a) A data store that provides
read-your-writes consistency.
Figure 7-14. (b) A data store that does not.
Writes Follow Reads (1)
A data store is said to provide writes-follow-reads
consistency, if the following holds:
A write operation by a process
on a data item x following a previous read
on a data item x following a previous read
operation on x by the same process
is guaranteed to take place on the same or a
more recent value of x that was read.
Figure 7-15. (a) A writes-follow-reads consistent data store.
Figure 7-15. (b) A data store that does
not provide writes-follow-reads consistency.
Replica-Server Placement
Figure 7-16. Choosing a proper cell size for server placement.
Content Replication and Placement
Figure 7-17. The logical organization of different kinds
of copies of a data store into three concentric rings.
Server-Initiated Replicas
Figure 7-18. Counting access requests from different clients.
State versus Operations
Possibilities for what is to be propagated:
1. Propagate only a notification of an update.
2. Transfer data from one copy to another.
3. Propagate the update operation to other
copies.
Pull versus Push Protocols
Figure 7-19. A comparison between push-based and pull-based
protocols in the case of multiple-client, single-server systems.
Remote-Write Protocols
Figure 7-20. The principle of a primary-backup protocol.
Local-Write Protocols
Figure 7-21. Primary-backup protocol in which the primary
migrates to the process wanting to perform an update.
Quorum-Based Protocols
Figure 7-22. Three examples of the voting algorithm. (a) A
correct choice of read and write set. (b) A choice that
may lead to write-write conflicts. (c) A correct choice,
known as ROWA (read one, write all).
DISTRIBUTED SYSTEMS
Principles and Paradigms
Second Edition
ANDREW S. TANENBAUM
MAARTEN VAN STEEN
Chapter 8
Chapter 8
Fault Tolerance
Fault Tolerance Basic Concepts
Being fault tolerant is strongly related to
what are called dependable systems
Dependability implies the following:
Availability
Reliability
Safety
Maintainability
Types of faults: Transient, Intermittent, Permanent
Failure Models
Figure 8-1. Different types of failures.
Failure Masking by Redundancy
Information Redundancy. For example, adding extra bits
(like in Hamming Codes, see the book Coding and
Information Theory) to allow recovery from garbled bits
Time Redundancy. Repeat actions if need be
Physical Redundancy. Extra equipment or processes are
added to make the system tolerate loss of some components
Failure Masking by Physical Redundancy
Figure 8-2. Triple modular redundancy.
Process Resilience
Achieved by replicating processes into groups.
How to design fault-tolerant groups?
How to reach an agreement within a group when
How to reach an agreement within a group when
some members cannot be trusted to give correct
answers?
Flat Groups versus Hierarchical Groups
Figure 8-3. (a) Communication in a flat group.
(b) Communication in a simple hierarchical group.
Failure Masking and Replication
Primary-backup protocol. A primary coordinates all write
operations. If it fails, then the others hold an election to replace the
primary
Replicated-write protocols. Active replication as well as
quorum based protocols. Corresponds to a flat group
A system is said to be if it can survive faults in
A system is said to be if it can survive faults in
k components and still meet its specifications.
For fail-silent components, k+1 are enough to be k fault
tolerant
For Byzantine failures, at least 2k+1 extra components are
needed to achieve k fault tolerance
Requires : all requests arrive at all servers in
same order
Agreement in Faulty Systems (1)
Possible cases:
Synchronous versus asynchronous systems
Communication delay is bounded or not
Message delivery is ordered or not
Message transmission is done through
unicasting or multicasting
Figure 8-4. Circumstances under which distributed
agreement can be reached.
Two Army Problem
Bonaparte
Alexander
Nonfaulty generals with unreliable communication
Byzantine Generals problem
Red army in the valley, n blue generals each with their own army
surrounding them.
Communication is pairwise, instantaneous and perfect.
However of the blue generals are traitors (faulty processes)
However m of the blue generals are traitors (faulty processes)
and are actively trying to prevent the loyal generals from
reaching agreement. The generals know the value m.
Goal: The generals need to exchange their troop strengths. At the
end of the algorithm, each general has a vector of length n. If ith
general is loyal, then the ith element has their troop strength
otherwise it is undefined.
Conditions for a Solution
All loyal generals decide upon the same plan of action
A small number of traitors cannot cause the loyal
generals to adopt a bad plan
generals to adopt a bad plan
Figure 8-5. The Byzantine agreement problem for three
nonfaulty and one faulty process. (a) Each process
sends their value to the others.
Byzantine Example
The Byzantine generals problem for 3 loyal generals and1 traitor:
The generals announce their troop strengths (in units of 1 kilo soldiers)
The vectors that each general assembles based on previous step
The vectors that each general receives
If a value has a majority, then we know it correctly, else it is unknown
Byzantine Example (2)
The same as in previous slide, except now with 2
loyal generals and one traitor.
For m faultv processes, we neea a total of 3m1 processes to reach
agreement.
Reliable Client-Server Communication
Example: TCP masks omission failures (e.g. lost messages) but crash
failures of connections are often not masked. The distributed system
may automatically try to set up a new connection...
RMI semantics in the presence of failures.
The client is unable to locate the server
The request message from the client to the server is lost
The server crashes after receiving a request
The reply message from the server to the client is lost
The client crashes after sending a request
RPC semantics in the presence of failures
Client cannot locate server: Raise an exception or send a signal to
client leading to loss in transparency
Lost request messages: Start a timer when sending a request. If timer
expires before a reply is received, send the request again. Server
would need to detect duplicate requests
Server crashes: Server crashes before or after executing the request
is indistinguishable from the client side... is indistinguishable from the client side...
At least once semantics
At most once semantics
Guarantee nothing semantics!
Exactly once semantics
Lost Request Messages
Server Crashes (1)
A server in client-server communication
a. Normal case
b. Crash after execution
c. Crash before execution
Server Crashes (1)
Server: Prints text on receiving request from client and sends
message to client after text is printed.
Send a completion message just before it actually tells the printer
to do its work
Or after the text has been printed
Client Client:
Never to reissue a request.
Always reissue a request.
Reissue a request only if it did not receive an acknowledgement of its
request being delivered to the server.
Reissue a request only if it has not received an acknowledgement of its
print request.
Server Crashes (2)
Three events that can happen at the server:
Send the completion message (M)
Print the text (P)
Crash (C)
Crash (C)
Server Crashes (3)
These events can occur in six different orderings:
1. M P C: A crash occurs after sending the completion
message and printing the text.
2. M C (P): A crash happens after sending the
completion message, but before the text could be
printed.
3. P M C: A crash occurs after sending the completion
message and printing the text.
4. PC(M): The text printed, after which a crash occurs
before the completion message could be sent.
5. C (P M): A crash happens before the server could
do anything.
6. C (M P): A crash happens before the server could
do anything.
Server Crashes (4)
PC(M
Strategy P -> M
Server
Strategy M -> P
Client
M: send the completion message
P: print the text
C: server crash
Different combinations of client and server strategies
in the presence of server crashes.
DUP
OK
OK
DUP
PC(M
)
OK
DUP
OK
DUP
PMC
OK
ZERO
ZERO
OK
C(MP)
OK ZERO OK Only when not ACKed
ZERO OK DUP Only when ACKed
ZERO ZERO OK Never
OK OK DUP Always
C(PM) MC(P) MPC Reissue strategy
RPC Semantics continued...
Lost Reply Messages. Set a timer on client. II it expires without a reply,
then send the request again. II requests are idempotent, then they
can be repeated again without ill-eIIects
Client Crashes. Creates orphans. An orphan is an active computation on the
server Ior which there is no client waiting. How to deal with orphans:
. Client logs each request in a file before sending it. After
a reboot the file is checked and the orphan is explicitly killed off. a reboot the file is checked and the orphan is explicitly killed off.
Expensive, cannot locate grand-orphans etc
. Divide time into sequentially numbered epochs. When a
client reboots, it broadcasts a message declaring a new epoch. This allows
servers to terminate orphan computations
. A server tries to locate the owner of orphans before
killing the computation
. Each RPC is given a quantum of time to finish its job. If it
cannot finish, then it asks for another quantum. After a crash, a client need
only wait for a quantum to make sure all orphans are gone
Idempotent Operations
An idempotent operation that can be repeated as oIten as necessary
without any harm being done. E.g. reading a block Irom a Iile.
In general, try to make RPC/RMI methods be idempotent iI possible. II
not, it can be dealt with in couple oI ways.
Use a sequence number with each request so server can detect
duplicates. But now the server needs to keep state Ior each client
Have a bit in the message to distinguish between original and
duplicate transmission
Reliable Group Communication
Reliable multicasting guarantees that messages
are delivered to all members in a process group.
Reliable multicasting turns out to be surprisingly
tricky.
Basic reliable multicasting schemes Basic reliable multicasting schemes
Scalability in reliable multicasting
Non-hierarchical feedback control
Hierarchical feedback control
Atomic multicasting using Virtual Synchrony
Basic Reliable-Multicasting Schemes (1)
Reliable point-to-point channels are available but
reliable communication to a group of processes is
rarely built-in the transport layer. For example,
multicasting uses datagrams, which are not reliable.
Few processes: Set up reliable point-to-point
channels. Inefficient for more processes channels. Inefficient for more processes
What does reliable multicasting mean?
What if a process joins during the communication?
What happens if a sending process crashes during
communication?
How to reach agreement on what does the group look
like?
Assume that processes do not fail and join or leave
the group during communication.
Sending process assigns a sequence number to
each message it multicasts. Messages are received
in the order which they were sent
Each multicast message is kept in a history buffer at Each multicast message is kept in a history buffer at
the sender. Assuming that the sender knows the
receivers, the sender simply keeps the message
until all receivers have returned the
acknowledgement (ack).
Sender retransmits on a negative ack or on timeout
before all acks were received
Acks can be piggy-backed. Retransmissions can be
done with point-to-point communication
Figure 8-9. A simple solution to reliable multicasting when all
receivers are known and are assumed not to fail.
(a) Message transmission. (b) Reporting feedback.
Scalability in Reliable Multicasting
Negative acknowledgements: A receiver returns
feedback only if it is missing a message. This
improves scalability by cutting down on the number
of messages. However this forces the sender to
keep a message in its buffer forever (so we need to
use timeouts for the buffer)
Nonhierarchical feedback control: feedback Nonhierarchical feedback control: feedback
suppression via multicasting of negative feedback
Hierarchical feedback control: use subgroups and
coordinators in each subgroup
Nonhierarchical Feedback Control
Figure 8-10. Several receivers have scheduled a request for
retransmission, but the first retransmission request
leads to the suppression of others.
Hierarchical Feedback Control
Figure 8-11. The essence of hierarchical reliable multicasting.
Each local coordinator forwards the message to its children and
later handles retransmission requests.
Atomic Multicasting
The atomic multicast setup to achieve reliable
multicasting in the presence of process failures
requires the following conditions:
A message is delivered to all processes or to none
at all
All messages are delivered in the same order to all All messages are delivered in the same order to all
processes
For example, this solves the problem of a replicated
database on top of a distributed system.
Atomic multicasting ensures that non-faulty processes
maintain a consistent view of the database, and forces
reconciliation when a replica recovers and rejoins the
group.
Virtual Synchrony (1)
Figure 8-12. The logical organization of a distributed system to
distinguish between message receipt and message delivery.
A multicast message m is uniquely associated with a
list of processes to which it should be delivered. This
delivery list corresponds to a group view. Each
process on the list has the same view.
A view change takes place by multicasting a
message vc announcing the joining or leaving of a message announcing the joining or leaving of a
process.
Suppose m and vc are simultaneously in transit. We
need to guarantee that m is either delivered to all
processes in the group view G before each of them
is delivered message vc, or m is not delivered at all.
Note that m being not delivered is because the
sender of m crashed.
A reliable multicast is said to be virtually synchronous
if it has the following properties:
A message multicast to group view G is delivered to
each non-faulty process in G
If a sender of the message crashes during the
multicast, the message may either be delivered to all multicast, the message may either be delivered to all
processes, or ignored by each of them
The principle is that all multicasts take place between
view changes. All multicasts that are in transit when a
view change takes place are completed before the
view change comes into effect.
Figure 8-13. The principle of virtual synchronous multicast.
Message Ordering (1)
Virtual synchrony allows us to think about
multicasts as taking place in epochs. But we
can have several possible orderings of the
multicasts:
Unordered multicasts
FIFO-ordered multicasts
Causally-ordered multicasts (requires vector
timestamps)
Totally-ordered multicasts
Figure 8-14. Three communicating processes in the same group.
The ordering of events per process is shown along the
vertical axis. This shows unordered multicasts.
Figure 8-15. Four processes in the same group with two different
senders, and a possible delivery order of messages under
FIFO-ordered multicasting
Comparison of Message Orderings
Figure 8-16. Six different versions of virtually synchronous reliable
multicasting.
Implementing Virtual Synchrony (1)
Assume that we have reliable point to point communication (e.g. TCP)
and that messages from the same source are received in the same order
as sent (e.g. TCP).
The main problem is that all messages sent to group view G need to be
delivered to all non-faulty processes in G before the next group membership
change takes place.
Every process in G keeps message m until it knows for sure that all members
in G have received it. If m has been received by all members in G, then m is
said to be stable. Only stable messages are allowed to be delivered. To
ensure stability, it is sufficient to pick an arbitrary process in G and request it
to send m to all other processes.
That arbitrary process can be the coordinator.
Assumes that no process crashes during a view change although it can be
generalized to handle that as well.
Implementing Virtual Synchrony (2)
a. Process 4 notices that process 7 has crashed, sends a view change
b. Process 6 sends out all its unstable messages, followed by a flush
message
c. Process 6 installs the new view when it has received a flush message
from everyone else
Distributed Commit
The distributed commit problem involves having an
operation being performed by each member of a group
or none at all. Examples:
Reliable multicasting is a specific example with the
operation being the delivery of a message operation being the delivery of a message
With distributed transactions, the operation may be
the commit of a transaction at a single site that takes
part in the transaction
Solutions:
One-phase commit, two-phase commit and three-
phase commit
One-Phase Commit
The coordinator tells all processes whether or not to
(locally) perIorm the operation in question
II one oI the participants cannot perIorm the operation, II one oI the participants cannot perIorm the operation,
then there is no way to tell the coordinator
Two-Phase Commit (1)
Figure 8-18.
(a) The finite state machine for the coordinator in 2PC.
(b) The finite state machine for a participant.
The protocol can fail if a process crashes for other processes
may be waiting indefinitely for a message from the crashed
process. We deal with this using timeouts
Participant blocked in INIT: A participant is waiting for a
VOTE_REQUEST message from the coordinator. On a
timeout, it can locally abort the transaction and thus send a
VOTE_ABORT message to the coordinator
Coordinator blocked in WAIT: If it doesnt get all the votes, it
votes for an abort and sends a GLOBAL_ABORT to all
participants
Participant blocked in READY: Participant cannot simply
decide to abort. It needs t know what message did the
coordinator send.
We can simply block until the coordinator recovers
Or contact another participant Q to see if it can decide from Qs state what to
do. Four cases to deal with here that are summarized on next slide
Make transition to ABORT ABORT
Make transition to COMMIT COMMIT
Action by P State of Q
Actions taken by a participant P when residing in
state READY and having contacted another
participant Q.
Contact another participant READY
Make transition to ABORT INIT
actions by coordinator:
write START_2PC local log;
multicast VOTE_REQUEST to all participants;
while not all votes have been collected {
wait for any incoming vote;
if timeout {
write GLOBAL_ABORT to local log;
multicast GLOBAL_ABORT to all participants;
exit;
}
Outline of the steps taken by the coordinator in a two
phase commit protocol
record vote;
}
if all participants sent VOTE_COMMIT and coordinator votes COMMIT{
write GLOBAL_COMMIT to local log;
multicast GLOBAL_COMMIT to all participants;
} else {
multicast GLOBAL_ABORT to all participants;
}
Steps taken by
process in 2PC.
actions by participant:
write INIT to local log;
wait for VOTE_REQUEST from coordinator;
if timeout {
write VOTE_ABORT to local log;
exit;
}
if participant votes COMMIT {
write VOTE_COMMIT to local log;
send VOTE_COMMIT to coordinator;
wait for DECISION from coordinator; wait for DECISION from coordinator;
if timeout {
multicast DECISION_REQUEST to other participants;
wait until DECISION is received; /* remain blocked */
write DECISION to local log;
}
if DECISION == GLOBAL_COMMIT
write GLOBAL_COMMIT to local log;
else if DECISION == GLOBAL_ABORT
} else {
write VOTE_ABORT to local log;
send VOTE ABORT to coordinator;
}
actions for handling decision requests: /* executed by separate thread */
while true {
wait until any incoming DECISION_REQUEST is received; /* remain blocked */
read most recently recorded STATE from the local log;
if STATE == GLOBAL_COMMIT
send GLOBAL_COMMIT to requesting participant;
else if STATE == INIT or STATE == GLOBAL_ABORT
Steps taken for handling incoming decision requests.
else if STATE == INIT or STATE == GLOBAL_ABORT
send GLOBAL_ABORT to requesting participant;
else
skip; /* participant remains blocked */
Three-Phase Commit (1)
If the coordinator crashes in two-phase commit, the
processes may not be able to reach a final decision
and have to block until the coordinator recovers
Three-phase commit protocol avoids blocking
processes in presence of crashes. The states of the
coordinator and each participant satisfy the following
coordinator and each participant satisfy the following
two conditions:
There is no single state Irom which it is possible to make a
transition directly to either a COMMIT or an ABORT state
There is no state in which it is not possible to make a Iinal
decision, and Irom which a transition to a COMMIT state can be
made
Three-Phase Commit (2)
Figure 8-22. (a) The finite state machine for the coordinator in 3PC.
(b) The finite state machine for a participant.
Recovery
Backward recovery. Roll back the system from erroneous state to a
previously correct state. This requires system to be checkpointing,
which has the following issues:
Relatively costly to checkpoint. Often combined with message logging
for better performance. Messages are logged before sending or before
receiving. Combined with checkpoints to makes recovery possible.
Checkpoints alone cannot solve the issue of replaying all messages in Checkpoints alone cannot solve the issue of replaying all messages in
the right order
Backward recovery requires a loop of recovery so failure transparency
cannot be guaranteed. Some states can never be rolled back to...
Forward recovery. Bring the system to a correct new state from which
it can continue execution. E.g. In an (n,k) block erasure code, a set of k
source packets is encoded into a set of n encoded packets, such that
any set of k encoded packets is enough to reconstruct the original k
source packets.
Stable Storage
We need Iault-tolerant disk storage Ior the checkpoints and
message logs. Examples are various RAID (Redundant Array
oI Independent Disks) schemes (although they are used Ior
both improved Iault tolerance as well as improved
perIormance). Some common schemes:
RAID-0 (block-level striping)
RAID-1 (mirroring)
RAID-5 (block-level striping with distributed parity)
RAID-6 (block-level striping with double distributed
parity)
Recovery Stable Storage
Figure 8-23. (a) Stable storage. (b) Crash after drive 1 is updated.
(c) Bad spot due to spontaneous decay can be dealt with.
Checkpointing
Backward error recovery schemes require that a distributed system regularly
records a consistent global state to stable storage. This is known as a distributed
snapshot
In a distributed snapshot, if a process P has recorded the receipt of a message,
then there is also a process Q that has recorded the sending of that message
To recover after a process or system failure, it is best to recover to the most
recent distributed snapshot, also known as the recovery line recent distributed snapshot, also known as the recovery line
Independent checkpointing:
Coordinated checkpointing:
Message logging:
Optimistic message logging
Pessimistic message logging
Checkpointing
Figure 8-24. A recovery line.
Independent Checkpointing
Figure 8-25. The domino effect.
Coordinated Checkpointing
All processes synchronize to jointly write their state to local stable storage, which
implies that the saved state is automatically consistent.
Simple Coordinated Checkpointing. Coordinator multicasts a
CHECKPOINT_REQUEST to all processes. When a process receives the
request, it takes a local checkpoint, queues any subsequent messages handed to
it by the application it is executing, and acknowledges to the coordinator. When
the coordinator has received an acknowledgement from all processes, it the coordinator has received an acknowledgement from all processes, it
multicasts a CHECKPOINT_DONE message to allow the blocked processes to
continue
Incremental snapshot. The coordinator multicasts a checkpoint request only to
those processes it had sent a message to since it last took a checkpoint. When a
process P receives such a request, it forwards it to all those processes to which P
itself had sent a message since the last checkpoint and so on. A process
forwards the request only once. When all processes have been identified, then a
second message is multicast to trigger checkpointing and to allow the processes
to continue where they had left off.
Message Logging
If the transmission of messages can be replayed, we can still
reach a globally consistent state by starting from a checkpointed
state and retransmitting all messages sent since. Helps in
reducing the number of checkpoints
Assumes a piecewise deterministic model, where deterministic
intervals occur between sending/receiving messages
An orphan process is a process that has survived the crash of An orphan process is a process that has survived the crash of
another process, but whose state is inconsistent with the
crashed process after its recovery
Message Logging
Incorrect replay of messages after
recovery, leading to an orphan process.
Message Logging Schemes
A message is said to be stable if it can no longer be lost, because it has
been written to stable storage. Stable messages can be used for recovery
by replaying their transmission.
DEP(m): A set of processes that depend upon the delivery of message m.
COPY(m): A set of processes that have a copy of m but not yet in their
local stable storage.
A process Q is an orphan process if there is a message m such that Q is
contained in DEP(m), while at the same time all processes in COPY(m) contained in DEP(m), while at the same time all processes in COPY(m)
have crashed. We want to avoid this scenario.
Pessimistic logging protocol: For each non-stable message m, there is at most
one process dependent upon m, which means that this process is in
COPY(m). Basically, a process P is not allowed to send any messages after
delivery of m without first storing it in stable storage
Optimistic logging protocol: After a crash, orphan processes are rolled back
until they are not in DEP(m). Much more complicated than pessimistic logging

Distributed Systems

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distributed Systems

Uploaded by

Copyright:

Available Formats

UNIT 1

Introduction to Distributed Systems

You might also like