You are on page 1of 109

Distributed Computing - lecture notes

2014

Content
1.Introduction
2. System architectures
3.Naming
4-5. Synchronization and coordination
6.Replication and Consistency
7. Consens
8. Distributed transactions
9.Distributed deadlock
10. Distributed file systems
11. Map-Reduce
12. Bigtable
13.P2P systems
14. Fault Tolerance

1. Introduction
What is a distributed system? Andrew Tannenbaum [TvS07] defines it as
A distributed system is a collection of independent computers that appear to its users as a single coherent
system.
This certainly is the ideal form of a distributed system, where the implementation detail of building a powerful system
out of many simpler systems is entirely hidden from the user.
Are there any such systems? Unfortunately, when we look at the reality of networked computers, we find that the
multiplicity of system components usually shines through the abstractions provided by the operating system and other
software. In other words, when we work with a collection of independent computers, we are almost always made
painfully aware of this. For example, some applications require us to identify and distinguish the individual computers
by name while in others our computer hangs due to an error that occurred on a machine that we have never heard of
before.
Throughout this course, we will investigate the various technical challenges that ultimately are the cause for the
current lack of true distributed systems. Moreover, we will investigate various approaches to solving these challenges
and study several systems that provide services that are implemented across a collection of computers, but appear as a
single service to the user.
For the purpose of this course, we propose to use the following weaker definition of a distributed system,
A distributed system is a collection of independent computers that are used jointly to perform a single task or to
provide a single service.
A distributed system by Tannenbaums definition would surely also be one by our definition; however, our definition is
more in line with the current state of the art as perceived by todays users of distributed systems andnot
surprisinglyit characterises the kind of systems that we will study throughout this course.

Examples of distributed systems


Probably the simplest and most well known example of a distributed system is the collection of Web serversor more
precisely, servers implementing the HTTP protocolthat jointly provide the distributed database of hypertext and
multimedia documents that we know as the World-Wide Web. Other examples include the computers of a local network
that provide a uniform view of a distributed file system and the collection of computers on the Internet that implement
the Domain Name Service (DNS).
A rather sophisticated version of a distributed system is the XT series of parallel computers by Cray (currently XK7).
These are high-performance machines consisting of a collection of computing nodes that are linked by a high-speed lowlatency network. The operating system, Cray Linux Environment (CLE) (in the pas also called UNICOS/lc), presents
users with a standard Linux environment upon login, but transparently schedules login sessions over a number of
available login nodes. However, the implementation of parallel computing jobs on the XT generally requires the
programmer to explicitly manage a collection of compute nodes within the application code using XT-specific versions
of common parallel programming libraries.
Despite the fact that the systems in these examples are all similar (because they fulfill the definition of a distributed
system), there are also many differences between them. The WorldWide Web and DNS, for example, both operate on a
global scale. The distributed file system, on the other hand, operates on the scale of a LAN, while the Cray supercomputer
operates on an even smaller scale making use of a specially designed high speed network to connect all of its nodes.

Why do we use distributed systems?


The alternative to using a distributed system is to have a huge centralised system, such as a mainframe. For many
applications there are a number of economic and technical reasons that make distributed systems much more attractive
than their centralised counterparts.
Cost. Better price/performance as long as commodity hardware is used for the component computers.
Performance. By using the combined processing and storage capacity of many nodes, performance levels can be
reached that are beyond the range of centralised machines.
Scalability. Resources such as processing and storage capacity can be increased incrementally.
Reliability. By having redundant components the impact of hardware and software faults on users can be reduced.
Inherent distribution. Some applications, such as email and the Web (where users are spread out over the whole
world), are naturally distributed. This includes cases where users are geographically dispersed as well as when
single resources (e.g., printers, data) need to be shared.
However, these advantages are often offset by the following problems encountered during the use and development of
distributed systems:
New component: network. Networks are needed to connect independent nodes and are subject to performance
limitations. Besides these limitations, networks also constitute new potential points of failure.
Software complexity. As will become clear throughout this course distributed software is more complex and harder to
develop than conventional software; hence, it is more expensive to develop and there is a greater chance of
introducing errors.
Failure. With many more computers, networks, and other peripherals making up the whole system, there are more
elements that can fail. Distributed systems must be built to survive failure of some of their elements, adding even
more complexity to the system software.
Security. Because a distributed system consists of multiple components there are more elements that can be
compromised and must, therefore, be secured. This makes it easier to compromise distributed systems.

Hardware and Software Architectures


A key characteristic of our definition of distributed systems is that it includes both a hardware aspect (independent
computers) and a software aspect (performing a task and providing a service). From a hardware point of view
distributed systems are generally implemented on multicomputers. From a software point of view they are generally
implemented as distributed operating systems or middleware.

Figure 1: A multicomputer.
Multicomputers
A multicomputer consists of separate computing nodes connected to each other over a network (Figure 1).
Multicomputers generally differ from each other in three ways:
1. Node resources. This includes the processors, amount of memory, amount of secondarystorage, etc. available on
each node.
2. Network connection. The network connection between the various nodes can have a largeimpact on the
functionality and applications that such a system can be used for. A multicomputer with a very high bandwidth
network is more suitable for applications that actively share data over the nodes and modify large amounts of
that shared data. A lower bandwidth network, however, is sufficient for applications where there is less intense
sharing of data.
3. Homogeneity. A homogeneous multicomputer is one where all the nodes are the same, thatis they are based on
the same physical architecture (e.g. processor, system bus, memory, etc.). A heterogeneous multicomputer is one
where the nodes are not expected to be the same.
One common characteristic of all types of multicomputers is that the resources on any particular node cannot be
directly accessed by any other node. All access to remote resources ultimately takes the form of requests sent over the
network to the node where that resource resides.
Distributed Operating System
A distributed operating system (DOS) is a an operating system that is built, from the ground up, to provide distributed
services. As such, a DOS integrates key distributed services into its architecture (Figure 2). These services may include
distributed shared memory, assignment of tasks to processors, masking of failures, distributed storage, interprocess
communication, transparent sharing of resources, distributed resource management, etc.
A key property of a distributed operating system is that it strives for a very high level of transparency, ideally
providing a single system image. That is, with an ideal DOS users would not be aware that they are, in fact, working on
a distributed system.
Distributed operating systems generally assume a homogeneous multicomputer. They are also generally more
suited to LAN environments than to wide-area network environments.
In the earlier days of distributed systems research, distributed operating systems where the main topic of interest.
Most research focused on ways of integrating distributed services into the operating system, or on ways of distributing
traditional operating system services. Currently, however, the emphasis has shifted more toward middleware systems.
The main reason for this is that middleware is more flexible (i.e., it does not require that users install and run a particular
operating system), and is more suitable for heterogeneous and wide-area multicomputers.
Machine A

Machine B

Machine C

Distributed applications
Distributed operating system services

Kernel

Kernel

Kernel

Network

Figure 2: A distributed operating system.

Middleware
Whereas a DOS attempts to create a specific system for distributed applications, the goal of middleware is to create
system independent interfaces for distributed applications.

Machine A

Machine B

Machine C

D istributed applications

M iddleware services
Network OS
services

Kernel

Network OS
services

Network OS
services

Kernel

Kernel

Network

Figure 3: A middleware system.


As shown in Figure 3 middleware consists of a layer of services added between those of a regular network OS1 and
the actual applications. These services facilitate the implementation of distributed applications and attempt to hide the
heterogeneity (both hardware and software) of the underlying system architectures.
The principle aim of middleware, namely raising the level of abstraction for distributed programming, is achieved
in three ways: (1) communication mechanisms that are more convenient and less error prone than basic message
passing; (2) independence from OS, network protocol, programming language, etc. and (3) standard services (such as a
naming service, transaction service,
security service, etc.
To make the integration of these various services easier, and to improve transparency and system independence,
middleware is usually based on a particular paradigm, or model, for describing distribution and communication. Since
a paradigm is an overall approach to how a distributed system should be developed, this often manifests itself in a
particular programming model such as everything is a file, remote procedure call, and distributed objects. Providing
such a paradigm automatically provides an abstraction for programmers to follow, and provides direction for how to
design and set up the distributed applications. Paradigms will be discussed in more detail later on in the course.
Although some forms of middleware focus on adding support for distributed computing directly into a language
(e.g., Erlang, Ada, Limbo, etc.), middleware is generally implemented as a set of libraries and tools that enable retrofitting
of distributed computing capabilities to existing programming languages. Such systems typically use a central
mechanism of the host language (such as the procedure call or method invocation) and dress remote operations up such
that they use the same syntax as that mechanism resulting, for example, in remote procedure calls and remote method
invocation.
Since an important goal of middleware is to hide the heterogeneity of the underlying systems (and in particular of
the services offered by the underlying OS), middleware systems often try to offer a complete set of services so that
clients do not have to rely on underlying OS services directly. This provides transparency for programmers writing
distributed applications using the given middleware. Unfortunately this everything but the kitchen sink approach often
leads to highly bloated systems. As such, current systems exhibit an unhealthy tendency to include more and more
functionality in basic middleware and its extensions, which leads to a jungle of bloated interfaces. This problem has
been recognised and an important topic of research is investigating adaptive and reflective middleware that can be
tailored to provide only what is necessary for particular applications.
With regards to the common paradigms of remote procedure call and remote method invocations, Waldo et al.
[WWWK94] have eloquently argued that there is also a danger in confusing local and remote operations and that initial
application design already has to take the differences between these two types of operations into account. We shall
return to this point later.

A Network OS is a regular OS enhanced with network services such as sockets, remote login, remote file transfer, etc.

Distributed systems and parallel computing


Parallel computing systems aim for improved performance by employing multiple processors to execute a single
application. They come in two flavours: shared-memory systems and distributed memory systems. The former use
multiple processors that share a single bus and memory subsystem. The latter are distributed systems in the sense of
the systems that we are discussing here and use independent computing nodes connected via a network (i.e., a
multicomputer). Despite the promise of improved performance, parallel programming remains difficult and if care is
not taken performance may end up decreasing rather than increasing.

Distributed systems in context


The study of distributed systems is closely related to two other fields: Networking and Operating Systems. The
relationship to networking should be pretty obvious, distributed systems rely on networks to connect the individual
computers together. There is a fine and fuzzy line between when one talks about developing networks and developing
distributed systems. As we will discuss later the development (and study) of distributed systems concerns itself with
the issues that arise when systems are built out of interconnected networked components, rather than the details of
communication and networking protocols.
The relationship to operating systems may be less clear. To make a broad generalisation operating systems are
responsible for managing the resources of a computer system, and providing access to those resources in an application
independent way (and dealing with the issues such as synchronisation, security, etc. that arise). The study of distributed
systems can be seen as trying to provide the same sort of generalised access to distributed resources (and likewise
dealing with the issues that arise).
Many distributed applications solve the problems related to distribution in application specific ways. The goal of
this course is to examine these problems and provide generalised solutions that can be used in any application.
Furthermore we will also examine how these solutions are incorporated into infrastructure software (either distributed
OS or middleware) to ease the job of the distributed application developer and help build well functioning distributed
applications.

Basic Goals
When considering the design and development of distributed systems, in particular in the context of their distributed
nature, there are several key properties that we wish the systems to have. These can be seen as the basic goals of
distributed systems.
Transparency
Scalability
Dependability
Performance
Flexibility
Providing systems with these properties leads to many of the challenges that we cover in this course.
We discuss the goals in turn.
Transparency
Transparency is the concealment from the user and the application programmer of the separation of the components
of a distributed system (i.e., a single image view). Transparency is a strong property that is often difficult to achieve.
There are a number of different forms of transparency including the following:
Access Transparency: Local and remote resources are accessed in same way
Location Transparency: Users are unaware of the location of resources
Migration Transparency: Resources can migrate without name change
Replication Transparency: Users are unaware of the existence of multiple copies of resources
Failure Transparency: Users are unaware of the failure of individual components
Concurrency Transparency: Users are unaware of sharing resources with others
Note that complete transparency is not always desirable due to the trade-offs with performance and scalability, as
well as the problems that can be caused when confusing local and remote operations. Furthermore complete
transparency may not always be possible since nature imposes certain limitations on how fast communication can take
place in wide-area networks.
Scalability
Scalability is important in distributed systems, and in particular in wide area distributed systems, and those expected
to experience large growth. According to Neuman [Neu94] a system is scalable if:
It can handle the addition of users and resources without suffering a noticeable loss of performance or increase
in administrative complexity.
Adding users and resources causes a system to grow. This growth has three dimensions:
Size: A distributed system can grow with regards to the number of users or resources (e.g., computers) that it
supports. As the number of users grows the system may become overloaded (for example, because it must
process too many user requests). Likewise as the number of resources managed by the system grows the
administration that the system has to perform may become too overwhelming for it.
Geography: A distributed system can grow with regards to geography or the distance between nodes. An increased
distance generally results in greater communication delays and the potential for communication failure. Another

aspect of geographic scale is the clustering of users in a particular area. While the whole system may have enough
resources to handle all users, when they are all concentrated in a single area, the resources available there may
not be sufficient to handle the load.
Administration: As a distributed system grows, its various components (users, resources, nodes, networks, etc.) will
start to cross administrative domains. This means that the number of organisations or individuals that exert
administrative control over the system will grow. In a system that scales poorly with regards to administrative
growth this can lead to problems of resource usage, reimbursement, security, etc. In short, an administrative mess.
A claim often made for newly introduced distributed systems (or solutions to specific distributed systems problems)
is that they are scalable. These claims of scalability are often (unintentionally) unfounded because they focus on very
limited aspects of scalability (for example, a vendor may claim that their system is scalable because it can support up to
several hundred servers in a cluster). Although this is a valid claim it says nothing about the scalability with regards to
users, or geographic distribution, for example. Another problem with claims of scalability is that many of the techniques
used to improve scalability (such as replication), introduce new problems that are often fixed using non-scalable
solutions (e.g., the solutions for keeping replicated data consistent may be inherently non-scalable).
Note also that, although a scalable system requires that growth does not affect performance adversely, the
mechanisms to make the system scalable may have adverse effects on the overall system performance (e.g., the
performance of the system when deployed in a small scale, when scalability is not as important, may be less than optimal
due to the overhead of the scalability mechanisms).
The key approach to designing and building a scalable system is decentralisation. Generally this requires avoiding
any form of centralisation, since this can cause performance bottlenecks. In particular a scalable distributed system
must avoid centralising:
components (e.g., avoid having a single server),
tables (e.g., avoid having a single centralised directory of names), and
algorithms (e.g., avoid algorithms based on complete information).
When designing algorithms for distributed systems the following design rules can help avoid centralisation:
Do not require any machine to hold complete system state.
Allow nodes to make decisions based on local information.
Algorithms must survive failure of nodes.
No assumption of a global clock.
Other, more specific, approaches to avoiding centralisation and improving scalability include: Hiding (or masking)
communication delays introduced by wide area networks; Distributing data over various machines to reduce the load
placed on any single machine; and Creating replicas of data and services to reduce the load on any single machine.
Besides spreading overall system load out over multiple machines, distribution and replication also help to bring data
closer to the users, thus improving geographic scalability. Furthermore, by allowing distribution of data, the
management of, and responsibility over, data can be kept within any particular administrative domain, thus helping to
improve administrative scalability as well.
Dependability
Although distributed systems provide the potential for higher availability due to replication, the distributed nature of
services means that more components have to work properly for a single service to function. Hence, there are more
potential points of failure and if the system architecture does not take explicit measures to increase reliability, there
may actually be a degradation of availability. Dependability requires consistency, security, and fault tolerance.
Performance
Any system should strive for maximum performance, but in the case of distributed systems this is a particularly
interesting challenge, since it directly conflicts with some other desirable properties. In particular, transparency,
security, dependability and scalability can easily be detrimental to performance.
Flexibility
A flexible distributed system can be configured to provide exactly the services that a user or programmer needs. A
system with this kind of flexibility generally provides a number of key properties.
Extensibility allows one to add or replace system components in order to extend or modify system functionality.
Openness means that a system provides its services according to standard rules regarding invocation syntax and
semantics. Openness allows multiple implementations of standard components to be produced. This provides
choice and flexibility.
Interoperability ensures that systems implementing the same standards (and possibly even those that do not) can
interoperate.
An important concept with regards to flexibility is the separation of policy and mechanism. A mechanism provides
the infrastructure necessary to do something while a policy determines how that something is done. For example, a
distributed system may provide secure communication by enabling the encryption of all messages. A system where
policy and mechanism is not separated might provide a single hardcoded encryption algorithm that is used to encrypt
all outgoing messages. A more flexible system, on the other hand, would provide the infrastructure (i.e., the mechanism)

needed to call an arbitrary encryption routine when encrypting outgoing messages. In this way the user or programmer
is given an opportunity to choose the most appropriate algorithm to use, rather than a built-in system default.
Component-based architectures are inherently more flexible than monolithic architectures, which makes them
particularly attractive for distributed systems.

Common mistakes
Developing distributed systems is different than developing nondistributed ones. Developers with no experience
typically make a number of false assumptions when first developing distributed applications [RGO06]. All of these
assumptions hold for nondistributed systems, but typically do not hold for distributed systems.
The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
The topology does not change
There is one administrator
Transport cost is zero
Everything is homogeneous
Making these assumptions invariably leads to difficulties achieving the desired goals for distributed systems. For
example, assuming that the network is reliable will lead to trouble when the network starts dropping packets. Likewise
assuming latency is zero will undoubtedly lead to scalability problems as the application grows geographically, or is
moved to a different kind of network. The principles that we discuss in the rest of this course deal with the consequence
of these assumptions being false, and provide approaches to deal with this.

Principles
There are several key principles underlying all distributed systems. As such, any distributed system can be described
based on how those principles apply to that system.
System Architecture
Communication
Synchronisation
Replication and Consistency
Fault Tolerance
Security
Naming
During the rest of the course we will examine each of these principles in detail.

Paradigms
As mentioned earlier, most middleware systems (and, therefore, most distributed systems) are based on a particular
paradigm, or model, for describing distribution and communication. Some of these paradigms are:
Shared memory
Distributed objects
Distributed file system
Distributed coordination
Service Oriented Architecture and Web Services
Shared documents
Agents
Shared memory, distributed objects and distributed file systems, and service oriented architectures will be
discussed in detail during this course (time permitting, other paradigms may also be handled, but in less detail). Note,
however, that the principles and other issues discussed in the course are relevant to all distributed system paradigms.

Rules of Thumb
Finally, although not directly, or solely, related to distributed systems, we present some rules of thumb that are relevant
to the study and design of distributed systems.
Trade-offs: As has been mentioned previously, many of the challenges faced by distributed systems lead to conflicting
requirements. For example, we have seen that scalability and overall performance may conflict, likewise flexibility may
interfere with reliability. At a certain point trade-offs must be madea choice must be made about which requirement
or property is more important and how far to go when fulfilling that requirement. For example, is it necessary for the
reliability requirement to be absolute (i.e., the system must always be available, e.g., even during a natural disaster or
war) or is it sufficient to require that the system must remain available only in the face of certain problems, but not
others?
Separation of Concerns: When tackling a large, complex, problem (such as designing a distributed system), it is useful
to split the problem up into separate concerns and address each concern individually. In distributed systems, this might,
for example, mean separating the concerns of communication from those of replication and consistency. This allows the
system designer to deal with communication issues without having to worry about the complications of replication and
consistency. Likewise, when designing a consistency protocol, the designer does not have to worry about particulars of
communication. Approaching the design of a distributed system in this way leads to highly modular or layered systems,
which helps to increase a systems flexibility.
End-to-End Argument: In a classic paper Saltzer et al. [SRC84] argue that when building layered systems some
functions can only be reliably implemented at the application level. They warn against implementing too much end-toend functionality (i.e., functionality required at the application level) in the lower layers of a system. This is relevant to
the design of distributed systems and services because one is often faced with the question of where to implement a
given functionality. Implementing it at the wrong level not only forces everyone to use that, possibly inappropriate,
mechanism, but may render it less useful than if it was implemented at a higher (application) level. Implementing
encryption as part of the communication layer, for example, may be less secure than end-to-end encryption
implemented by the application and therefore offer users a false sense of security.
Policy versus Mechanism: This rule has been discussed previously. Separation of policy and mechanism helps to build
flexible and extensible systems, and is, therefore, important to follow when designing distributed systems.
Keep It Simple, Stupid (KISS): Overly complex systems are error prone and difficult to use. If possible, solutions to
problems and resulting architectures should be simple rather than mind-numbingly complex.

References
[Neu94] B. Clifford Neuman. Scale in distributed systems. In T. Casavant and M. Singhal, editors, Readings in Distributed
Computing Systems, pages 463489. IEEE Computer Society Press, Los Alamitos, CA, USA, 1994. http://clifford.neuman.
name/papers/pdf/94--_scale-dist-sys-neuman-readings-dcs.pdf.
[RGO06]
Arnon Rotem-Gal-Oz. Fallacies of distributed computing explained. http://www.
rgoarchitects.com/Files/fallacies.pdf, 2006.
[SRC84]
J. H. Saltzer, D. P. Reed, and D. Clark. End-to-end arguments in system design. ACM
Transactions on Computer Systems, 2(4), nov 1984. http://www.reed.com/
dpr/docs/Papers/EndtoEnd.html.
[TvS07]

Andrew S. Tanenbaum and Maarten van Steen. Distributed Systems: Principles and
Paradigms. Prentice Hall, second edition, 2007.

2. System Architecture
A distributed system is composed of a number of elements, the most important of which are software
components, processing nodes and networks. Some of these elements can be specified as part of a distributed systems
design, while others are given (i.e., they have to be accepted as they are). Typically when building a distributed system,
the software is under the designers control. Depending on the scale of the system, the hardware can be specified within
the design as well, or already exists and has to be taken as-is. The key, however, is that the software components must
be distributed over the hardware components in some way.
The software of distributed systems can become fairly complexespecially in large distributed systemsand
its components can spread over many machines. It is important, therefore, to understand how to organise the system.
We distinguish between the logical organisation of software components in such a system and their actual physical
organisation. The software architecture of distributed systems deals with how software components are organised and
how they work together, i.e., communicate with each other. Typical software architectures include the layered, objectoriented, data-centred, service-oriented and event-based architectures. Once the software components are instantiated
and placed on real machines, we talk about an actual system architecture. A few such architectures are discussed in this
section. These architectures are distinguished from each other by the roles that the communicating processes take on.
Choosing a good architecture for the design of a distributed system allows splitting of the functionality of the system,
thus structuring the application and reducing its complexity. Note that there is no single best architecturethe best
architecture for a particular system depends on the applications requirements and the environment.

Client-Server
The client-server architecture is the most common and widely used model for communication between
processes. As Figure 1 shows, in this architecture one process takes on the role of a server, while all other processes
take on the roles of clients. The server process provides a service (e.g., a time service, a database service, a banking
service, etc.) and the clients are customers of that service. A client sends a request to a server, the request is processed
at the server and a reply is returned to the client.
Request
Client

Server
Reply

Kernel

Kernel

Figure 1: The client-server communication architecture.


A typical client-server application can be decomposed into three logical parts: the interface part, the application
logic part, and the data part. Implementations of the client-server architecture vary with regards to how the parts are
separated over the client and server roles. A thin client implementation will provide a minimal user interface layer, and
leave everything else to the server. A fat client implementation, on the other hand, will include all of the user interface
and application logic in the client, and only rely on the server to store and provide access to data. Implementations in
between will split up the interface or application logic parts over the clients and server in different ways.

Vertical Distribution (Multi-Tier)


An extension of the client-server architecture, the vertical distribution, or multi-tier, architecture (see Figure
2) distributes the traditional server functionality over multiple servers. A client request is sent to the first server. During
processing of the request this server will request the services of the next server, who will do the same, until the final
server is reached. In this way the various servers become clients of each other (see Figure 3). Each server is responsible
for a different step (or tier) in the fulfilment of the original client request.

Request
Client

App.
Server

Reply
Kernel

Request

Dbase
Server

Reply
Kernel

Kernel

Figure 2: The vertical distribution (multi-tier) communication architecture.

User interface
(presentation)

Wait for result


Request
operation

Return
result
Wait for data

Application
server
Request data
Database
server

Return data

Time

Figure 3: Communication in a multi-tier system.


Splitting up the server functionality in this way is beneficial to a systems scalability as well as its flexibility. Scalability
is improved because the processing load on each individual server is reduced, and the whole system can therefore
accommodate more users. With regards to flexibility this architecture allows the internal functionality of each server to
be modified as long as the interfaces provided remain the same.

Horizontal Distribution
While vertical distribution focuses on splitting up a servers functionality over multiple computers, horizontal
distribution involves replicating a servers functionality over multiple computers. A typical example, as shown in Figure
4, is a replicated Web server. In this case each server machine contains a complete copy of all hosted Web pages and
client requests are passed on to the servers in a round robin fashion. The horizontal distribution architecture is
generally used to improve scalability (by reducing the load on individual servers) and reliability (by providing
redundancy).

Front end handling


incoming
Replicated Web servers each requests containing the same Web pages
Requests
handled in
round-robin
fashion

Disks

Internet
Internet

Figure 4: An example of a horizontally distributed Web server.


Note that it is also possible to combine the vertical and horizontal distribution models. For example, each of the servers
in the vertical decomposition can be horizontally distributed. Another approach is for each of the replicas in the
horizontal distribution model to themselves be vertically distributed.

Peer to Peer
Whereas the previous models have all assumed that different processes take on different roles in the
communication architecture, the peer to peer (P2P) architecture takes the opposite approach and assumes that all
processes play the same role, and are therefore peers of each other. In Figure 5 each process acts as both a client and a
server, both sending out requests and processing incoming requests. Unlike in the vertical distribution architecture,
where each server was also a client to another server, in the P2P model all processes provide the same logical services.
Well known examples of the P2P model are file-sharing applications. In these applications users start up a
program that they use to search for and download files from other users. At the same time, however, the program also
handles search and download requests from other users.
With the potentially huge number of participating nodes in a peer to peer network, it becomes practically impossible
for a node to keep track of all other nodes in the system and the information they offer. To reduce the problem, the
nodes form an overlay network, in which nodes form a virtual network among themselves and only have direct
knowledge of a few other nodes. When a node wishes to send a message to an arbitrary other node it must first locate
that node by propagating a request along the links in the overlay network. Once the destination node is found, the two
nodes can typically communicate directly (although that depends on the underlying network of course).
There are two key types of overlay networks, the distinction being based on how they are built and maintained.
In all cases a node in the network will maintain a list of neighbours (called its partial view of the network). In
unstructured overlays the structure of the network often resembles a random graph. Membership management is
typically random, which means that a nodes partial view consists of a random list of other nodes. In order to keep the

10

network connected as nodes join and leave, all nodes periodically exchange their partial views with neighbours, creating
a new
request

Peer
reply

Peer
Kernel
Kernel
reply

request

request

reply

Peer

request

Peer
reply

Kernel
Kernel
Peer

Kernel

Figure 5: The peer to peer communication architecture.


neighbour list for themselves. As long as nodes both push and pull this information the network tends to stay well
connected (i.e., it doesnt become partitioned).
In the case of structured overlays the choice of a nodes neighbours is determined according to a specific structure. In a
distributed hash table, for example, nodes work together to implement a hash table. Each node is responsible for storing
the data associated with a range of identifiers. When joining a network, a node is assigned an identifier, locates the node
responsible for the range containing that identifier, and takes over part of that identifier space. Each node keeps track
of its neighbours in the identifier space. We will discus specific structured overlays in more detail in a future lecture.

Hybrid
Many more architectures can be designed by combining the previously described architectures in different ways and
result in what are called hybrid architectures. A few examples are:
Superpeer networks In this architecture a few superpeers form a peer to peer network, while the regular peers are
clients to a superpeer. This hybrid architecture maintains some of the advantages of a peer to peer system, but
simplifies the system by having only the superpeers managing the index of the regular peers, or acting as brokers
(e.g., Skype).
Collaborative distributed systems In collaborative distributed systems, peers typically support each other to deliver
content in a peer to peer like architecture, while they use a client server architecture for the initial setup of the
network. In BitTorrent for example, nodes requesting to download a file from a server first contact the server to
get the location of a tracker. The tracker then tells the nodes the locations of other nodes, from which chunks of
the content can be downloaded concurrently. Nodes must then offer downloaded chunks to other nodes and are
registered with the tracker, so that the other nodes can find them.
Edge-server networks In edge-server networks, as the name implies, servers are placed at the edge of the Internet,
for example at internet service providers (ISPs) or close to enterprise networks. Client nodes (e.g., home users or
an enterprises employees) then access the nearby edge servers instead of the original server (which may be
located far away). This architecture is typically well suited for large-scale content-distribution networks such as
that provided by Akamai.

Processes and Server Architecture


A key property of all distributed systems is that they consist of separate processes that communicate in order to
get work done. Before exploring the various ways that processes on separate computers can communicate, we will first
review communication between processes on a single computer (i.e., a uniprocessor or multiprocessor).
Communication takes place between threads of control. There are two models for dealing with threads of control
in an operating system. In the process model, each thread of control is associated with a single private address space.
The threads of control in this model are called processes. In the thread model, multiple threads of control share a single
address space. These threads of control are called threads. Sometimes threads are also referred to as lightweight
processes because they take up less operating system resources than regular processes.
An important distinction between processes and threads is memory access. Threads share all of their memory,
which means that threads can freely access and modify each others memory. Processes, on the other hand, are
prevented from accessing each others memory. As an exception to this rule, in many systems it is possible for processes
to explicitly share memory with other processes.
Some systems provide only a process model, while others provide only a thread model. More common are systems
that provide both threads and processes. In this case each process can contain multiple threads, which means that the
threads can only freely access the memory of other threads in the same process. In general, when we are not concerned
about whether a thread of control is a process or thread, we will refer to it as a process.

11

A server process in a distributed system typically receives many requests for work from various clients, and it
is important to provide quick responses to those clients. In particular, a server should not refuse to do work for one
client because it blocked (e.g., because it invoked a blocking system call) while doing work for another client. This is a
typical result of implementing a server as a single-threaded process. Alternatives to this are to implement the server
using multiple threads, one of which acts as a dispatcher, and the others acting as workers. Another option is to design
and build the server as a finite state machine that uses non-blocking system calls.
A key issue in the design of servers is whether they store information about clients or not. In the stateful model a server
stores persistent information about a client (e.g., which files it has opened). While this leads to good performance since
clients do not have to constantly remind servers what their state is, the flipside is that the server must keep track of all
its clients (which leads to added work and storage on its part) and must ensure that the state can be recovered after a
crash. In the stateless model, the server keeps no persistent information about its clients. In this way it does not need to
use up resources to track clients, nor does it need to worry about restoring client state after a crash. On the other hand,
it requires more communication since clients have to resend their state information with every request.
Often, as discussed above, the server in client-server is not a single machine, but a collection of machines that
act as a clustered server. In many modern systems separate virtual machines are hosted on a single physical machine.
This allows consolidation of many servers on a single machine, while providing isolation between them. Since virtual
machines can be stopped, migrated, and restarted, virtualisation also provides a good basis for code mobility and load
balancing.
Typically the machines (and processes) in such a cluster are assigned dedicated roles including, a logical switch,
compute (or application logic) servers, and file or database servers. We have discussed the latter two previously, so now
focus on the switch. The role of the switch is to receive client requests and route them to appropriate servers in the
cluster. There are several ways to do this. At the lowest level, a transport-layer switch, reroutes TCP connections to
other servers. The decision regarding which server to route the request to typically depends on system load. On a
slightly higher level, an application switch analyses the incoming request and routes it according to application-specific
logic. For example, HTTP requests for HTML files could be routed to one set of servers, while requests for images could
be served by other servers. Finally, at an even higher level, A DNS server could act as a switch by returning different IP
addresses for a single host name. Typically the server will store multiple addresses for a given name and cycle through
them in a round-robin fashion. The disadvantage of this approach is that the DNS server does not use any application or
cluster specific knowledge to route requests.
A final issue with regards to processes is that of code mobility. In some cases it makes sense to change the
location where a process is being executed. For example, a process may be moved to an unloaded server in a cluster to
improve its performance, or to reduce the load on its current server. Processes that are required to process large
amounts of data may be moved to a machine that has the data locally available to prevent the data from having to be
sent over the network. We distinguish between two types of code mobility: weak mobility and strong mobility. In the
first case, only code is transfered and the process is restarted from an initial state at its destination. In the second case,
the code and an execution context are transfered, and the process resumes execution from where it left off before being
moved.

12

3.Communication
In order for processes to cooperate (e.g., work on a single task together), they must communicate. There are
two reasons for this communication: synchronisation and sharing of data. Processes synchronise in order to coordinate
their activities. This includes finding out whether another process is alive, determining how much of a task a process
has executed, acquiring exclusive access to a resource, requesting another process to perform a certain task, etc.
Processes share data about tasks that they are cooperatively working on. This may include sending data as part of a
request (e.g., data to perform calculations on), returning the results of a calculation, requesting particular data, etc.
There are two ways that processes can communicate: through shared memory or through message passing. In
the first case processes must have access to some form of shared memory (i.e., they must be threads, they must be
processes that can share memory, or they must have access to a shared resource, such as a file). Communicating using
shared memory requires processes to agree on specific regions of the shared memory that will be used to pass
synchronisation information and data.
The other option (and the only option for processes that do not have access to shared memory), is for processes
to communicate by sending each other messages. This generally makes use of interprocess communication (IPC)
mechanisms made available by the underlying operating system. Examples of these mechanisms include pipes and
sockets.

Communication in a Distributed System


While the discussion of communication between processes has, so far, explicitly assumed a uniprocessor (or
multiprocessor) environment, the situation for a distributed system (i.e., a multicomputer environment) remains
similar. The main difference is that in a distributed system, processes running on separate computers cannot directly
access each others memory. Nevertheless, processes in a distributed system can still communicate through either
shared memory or message passing.
Message Passing
Message passing in a distributed system is similar to communication using messages in a nondistributed system. The
main difference being that the only mechanism available for the passing of messages is network communication.
At its core, message passing involves two operations send() and receive(). Although these are very simple
operations, there are many variations on the basic model. For example, the communication can be connectionless or
connection oriented. Connection oriented communication requires that the sender and receiver first create a connection
before send() and receive() can be used.
There are a number of important issues to consider when dealing with processes that communicate using message
passing, which are described in the next section. Besides these variations in the message passing model, there are also
issues involved with communicating between processes on heterogeneous computers. This brings up issues such as
data representation and dealing with pointers, which will be discussed in more detail later.

Communication Modes
There are a number of alternative ways, or modes, in which communication can take place. It is important to know and
understand these different modes, because they are used to describe the different services that a communication
subsystem offers to higher layers.
A first distinction is between the two modes data-oriented communication and control-oriented communication. In the
first mode, communication serves solely to exchange data between processes. Although the data might trigger an action
at the receiver, there is no explicit transfer of control implied in this mode. The second mode, control-oriented
communication, explicitly associates a transfer of control with every data transfer. Data-oriented communication is
clearly the type of communication used in communication via shared address space and shared memory, as well as
message passing. Control-oriented communication is the mode used by abstractions such as remote procedure call,
remote method invocation, active messages, etc. (communication abstractions are described in the next section).
Next, communication operations can be synchronous or asynchronous. In synchronous communication the sender of a
message blocks until the message has been received by the intended recipient. Synchronous communication is usually
even stronger than this in that the sender often blocks until the receiver has processed the message and the sender has
received a reply. In asynchronous communication, on the other hand, the sender continues execution immediately after
sending a message (possibly without having received an answer).
Another possible alternative involves the buffering of communication. In the buffered case, a message will be stored if
the receiver is not able to pick it up right away. In the unbuffered case the message will be lost.
Communication can also be transient and persistent. In transient communication a message will only be delivered if a
receiver is active. If there is no active receiver process (i.e., no one interested in or able to receive messages) then an
undeliverable message will simply be dropped. In persistent communication, however, a message will be stored in the
system until it can be delivered to the intended recipient. As Figure 6 shows, all combinations of
synchronous/asynchronous and transient/persistent are possible.

13

Figure 6: Possible combinations of synchronous/asynchronous and transient/persistent communication.


There are also varying degrees of reliability of the communication. With reliable communication errors are discovered
and fixed transparently. This means that the processes can assume that a message that is sent will actually arrive at the
destination (as long as the destination process is there to receive it). With unreliable communication messages may get
lost and processes have to deal with it.
Finally it is possible to provide guarantees about the ordering of messages. Thus, for example, a communication system
may guarantee that all messages are received in the same order that they are sent, while another system may make no
guarantees about the order of arrival of messages.

Communication Abstractions
In the previous discussion it was assumed that all processes explicitly send and receive messages (e.g., using
send() and receive()). Although this style of programming is effective and works, it is not always easy to write
correct programs using explicit message passing. In this section we will discuss a number of communication
abstractions that make writing distributed applications easier. In the same way that higher level programming
languages make programming easier by providing abstractions above assembly language, so do communication
abstractions make programming in distributed systems easier.
Some of the abstractions discussed attempt to completely hide the fact that communication is taking place.
While other abstractions do not attempt to hide communication, all abstractions have in common that they hide the
details of the communication taking place. For example, the programmers using any of these abstractions do not have
to know what the underlying communication protocol is, nor do they have to know how to use any particular operating
system communication primitives.
The abstractions discussed in the coming sections are often used as core foundations of most middleware systems. Using
these abstractions, therefore, generally involves using some sort of middleware framework. This brings with it a number
of the benefits of middleware, in particular the various services associated with the middleware that tend to make a
distributed application programmers life easier.
Message-Oriented Communication
The message-oriented communication abstraction does not attempt to hide the fact that communication is
taking place. Instead its goal is to make the use of flexible message passing easier.
Message-oriented communication is based around the model of processes sending messages to each other. Underlying
message-oriented communication has two orthogonal properties. Communication can be synchronous or
asynchronous, and it can be transient or persistent. Whereas Rpc and Rmi are generally synchronous and transient,
message oriented communication systems make many other options available to programmers.
Message-oriented communication is provided by message-oriented middleware (MOM). Besides providing
many variations of the send() and receive() primitives, MOM also provides infrastructure required to support
persistent communication. The send() and receive() primitives offered by MOM also abstract from the underlying
operating system or hardware primitives. As such, MOM allows programmers to use message passing without having
to be aware of what platforms their software will run on, and what services those platforms provide. As part of this
abstraction MOM also provides marshalling services. Furthermore, as with most middleware, MOM also provides other
services that make building distributed applications easier.
MPI (Message Passing Interface) is an example of a MOM that is geared toward high-performance transient message
passing. MPI is a message passing library that was designed for parallel computing. It makes use of available networking
protocols, and provides a huge array of functions that basically perform synchronous and asynchronous send() and
receive().
Another example of MOM is MQ Series from IBM. This is an example of a message queuing system. Its main characteristic
is that it provides persistent communication. In a message queuing system, messages are sent to other processes by
placing them in queues. The queues hold messages until an intended receiver extracts them from the queue and
processes them. Communication in a message queuing system is largely asynchronous.
The basic queue interface is very simple. There is a primitive to append a message onto the end of a specified
queue, and a primitive to remove the message at the head of a specific queue.
These can be blocking or nonblocking. All messages contain the name or address of a destination queue.
Messages can only be added to and retrieved from local queues. Senders place messages in source queues (or
send queues), while receivers retrieve messages from destination queues (or receive queues). The underlying system is
responsible for transferring messages from source queues to destination queues. This can be done simply by fetching
messages from source queues and directly sending them to machines responsible for the appropriate destination

14

queues. Or it can be more complicated and involve relaying messages to their destination queues through an overlay
network of routers. An example of such a system is shown in Figure 7. In the figure, an application on sender A sends a
message to an application on receiver B. It places the message in its local source queue, from where it is forwarded
through routers R1 and R2 into the receivers destination queue.

Sender A
Application

Application

Receive
queue

R2
Message

Send queue
Application

R1
Receiver B
Application
Router

Figure 7: An example of a message queuing system.

Remote Procedure Call (RPC)


The idea behind a remote procedure call (Rpc) is to replace the explicit message passing model with the model of
executing a procedure call on a remote node [BN84]. A programmer using Rpc simply performs a procedure call, while
behind the scenes messages are transferred between the client and server machines. In theory the programmer is
unaware of any communication taking place.
Figure 8 shows the steps taken when an Rpc is invoked. The numbers in the figure correspond to the following steps
(steps seven to eleven are not shown in the figure):
1.

client program calls client stub routine (normal procedure call)

2.

client stub packs parameters into message data structure (marshalling)

3.

client stub performs send() syscall and blocks

4.

kernel transfers message to remote kernel


Client machine

Server machine

Client process

Server process
Implementation
of inc

6
j = inc(i);

Client stub

Server stub

proc: "inc"
int: val(i)

j = inc(i);
proc: "inc"
int: val(i)

3
Client OS

Server OS

Message
proc: "inc"
int: val(i)

Figure 8: A remote procedure call.


5.

remote kernel delivers to server stub procedure, blocked in receive()

6.

server stub unpacks message, calls service procedure (normal procedure call)

7.

service procedure returns to stub, which packs result into message

8.

server stub performs send() syscall

9.

kernel delivers to client stub

10. client stub unpacks result (unmarshalling)

15

11. client stub returns to client program (normal return from procedure)
A server that provides remote procedure call services defines the available procedures in a service interface. A service
interface is generally defined in an interface definition language (IDL), which is a simplified programming language,
sufficient for defining data types and procedure signatures but not for writing executable code. The IDL service interface
definition is used to generate client and server stub code. The stub code is then compiled and linked in with the client
program and service procedure implementations respectively.
The first widely used Rpc framework was proposed by Sun Microsystems in Internet RFC1050 (currently
defined in RFC1831). It is based on the XDR (External Data Representation) format defined in Internet RFC1014
(currently defined in RFC4506) and is still being heavily used as the basis for standard services originating from Sun
such as NFS (Network File System) and NIS (Network Information Service). Another popular Rpc framework is DCE
(Distributed Computing Environment) Rpc, which has been adopted in Microsofts base system for distributed
computing.
More modern Rpc frameworks are based on XML as a data format and are defined to operate on top of widely used
standard network protocols such as HTTP. This simplifies integration with Web servers and is useful when transparent
operation through firewalls is desired. Examples of such frameworks are XML-RPC and the more powerful, but often
unnecessarily complex SOAP.
As mentioned earlier there are issues involved with communicating between processes on heterogeneous architectures.
These include different representations of data, different byte orderings, and problems with transferring pointers or
pointer-based data structures. One of the tasks that Rpc frameworks hide from programmers is the packing of data into
messages (marshalling) and unpacking data from messages (unmarshalling). Marshalling and unmarshalling are
performed in the stubs by code generated automatically from IDL compilers and stub generators.
An important part of marshalling is converting data into a format that can be understood by the receiver. Generally,
differences in format can be handled by defining a standard network format into which all data is converted. However,
this may be wasteful if two communicating machines use the same internal format, but that format differs from the
network format. To avoid this problem, an alternative is to indicate the format used in the transmitted message and rely
on the receiver to apply conversion where required.
Because pointers cannot be shared between remote processes (i.e., addresses cannot be transferred verbatim
since they are usually meaningless in another address space) it is necessary to flatten, or serialise, all pointer-based
data structures when they are passed to the Rpc client stub. At the server stub, these serialised data structures must be
unpacked and recreated in the recipients address space. Unfortunately this approach presents problems with aliasing
and cyclic structures. Another approach to dealing with pointers involves the server sending a request for the
referenced data to the client every time a pointer is encountered.
In general the Rpc abstraction assumes synchronous, or blocking, communication. This means that clients invoking Rpcs
are blocked until the procedure has been executed remotely and a reply returned. Although this is often the desired
behaviour, sometimes the waiting is not necessary. For example, if the procedure does not return any values, it is not
necessary to wait for a reply. In this case it is better for the Rpc to return as soon as the server acknowledges receipt of
the message. This is called an asynchronous RPC.
It is also possible that a client does require a reply, but does not need it right away and does not want to block
for it either. An example of this is a client that prefetches network addresses of hosts that it expects to contact later. The
information is important to the client, but since it is not needed right away the client does not want to wait. In this case
it is best if the server performs an asynchronous call to the client when the results are available. This is known as
deferred synchronous RPC.
A final issue that has been silently ignored so far is how a client stub knows where to send the Rpc message. In a
regular procedure call the address of the procedure is determined at compile time, and the call is then made directly. In
Rpc this information is acquired from a binding service; a service that allows registration and lookup of services. A
binding service typically provides an interface similar to the following:
register(name, version, handle, UID)
deregister(name, version, UID)
lookup(name, version)(handle, UID)
Here handle is some physical address (IP address, process ID, etc.) and UID is used to distinguish between servers
offering the same service. Moreover, it is important to include version information since the flexibility requirement for
distributed system requires us to deal with different versions of the same software in a heterogeneous environment.
Remote Method Invocation (RMI)
When using Rpc, programmers must explicitly specify the server on which they want to perform the call
(possibly using information retrieved from a binding service). Furthermore, it is complicated for a server to keep track
of the different state belonging to different clients and their invocations. These problems with Rpc lead to the remote
method invocation (Rmi) abstraction. The transition from Rpc to Rmi is, at its core, a transition from the server metaphor
to the object metaphor.
When using Rmi, programmers invoke methods on remote objects. The object metaphor associates all
operations with the data that they operate on, meaning that state is encapsulated in the remote object and much easier
to keep track of. Furthermore, the concept of remote object, improves location transparency: once a client is bound to a
remote object, it no longer has to worry about where that object is located. Also, objects are first-class citizens in an

16

object-based model, meaning that they can be passed as arguments or received as results in Rmi. This helps to relieve
many of the problems associated with passing pointers in Rpc.
Although, technically, Rmi is a small evolutionary step from Rpc, the model of remote and distributed objects is very
powerful. As such, Rmi and distributed objects form the base for a widely used distributed systems paradigm, and will
be discussed in detail in a future lecture.
The Danger of Transparency
Unfortunately, the illusion of a procedure call is not perfect for Rpcs and that of a method invocation is not
perfect for Rmi. The reason for this is that an Rpc or Rmi can fail in ways that a real procedure call or method invocation
cannot. This is due to the problems such as not being able to locate a service (e.g., it may be down or have the wrong
version), messages getting lost, servers crashing while executing a procedure, etc. As a result, the client code has to
handle error cases that are specific to Rpcs.
In addition to the new failure modes, the use of threads (to alleviate the problem of blocking) can lead to
problems when accessing global program variables (like the POSIX errno). Moreover, some forms of arguments like
varargs in C do not lend themselves well to the static generation of marshalling code. As mentioned earlier, pointerbased structures also require extra attention, and exceptions, such as user interrupts via keyboard, are more difficult to
handle.
Furthermore, Rpc and Rmi involve many more software layers than local system calls and also incur network
latencies. Both form potential performance bottlenecks. The code must, therefore, be carefully optimised and should
use lightweight network protocols. Moreover, since copying often dominates the overhead, hardware support can help.
This includes DMA directly to/from user buffers and scatter-gather network interfaces that can compose a message
from data at different addresses on the fly. Finally, issues of concurrency control can show up in subtle ways that, again,
break the illusion of executing a local operation. These problems are discussed in detail by Waldo et al. [WWWK94].
Group Communication
Group communication provides a departure from the point-to-point style of communication (i.e., where each
process communicates with exactly one other process) assumed so far. In this model of communication a process can
send a single message to a group of other processes. Group communication is often referred to as broadcast (when a
single message is sent out to everyone) and multicast (when a single message is sent out to a predefined group of
recipients).
Group communication can be applied in any of the previously discussed system architectures. It is often used to send
requests to a group of replicas, or to send updates to a group of servers containing the same data. It is also used for
service discovery (e.g., broadcast a request saying who offers this service?) as well as event notification (e.g., to tell
everyone that the printer is on fire).
Issues involved with implementing and using group communication are similar to those involved with regular point-topoint communication. This includes reliability and ordering. The issues are made more complicated because now there
are multiple recipients of a message and different combinations of problems may occur. For example, what if only one
of the recipients does not receive a message, should it be multicast out to everyone again, or only to the process that did
not receive it? Or, what if messages arrive in a different order on the different recipients and the order of messages is
important?
A widely implemented (but not as widely used) example of group communication is IP multicast. The IP
multicast specification has existed for a long time, but it has taken a while for implementations to make their way into
the routers and gateways (which is necessary for it to become viable).
An increasingly important form of group communication is gossip-based communication, which is often used for data
dissemination. This technique relies on epidemic behaviour like diseases spreading among people. One variant is
rumour spreading (or simply gossiping), which resembles the way in which rumours spread in a group of people. In this
form of communication, a node A that receives some new information will contact an arbitrary other node B in the
system to push the data to that node. If node B did not have the data previously, nodes A and B will continue to contact
other nodes and push the data to them. If, however, node B already had that data, then node A stops spreading the data
with a certain probability. Gossiping cannot make any guarantees that all nodes will receive all data, but works quite
well in practice to disseminate data quickly.
Event-Based Communication
The event-based abstraction decouples senders and receivers in communication. Senders produce events
(which may carry data), without specifying a receiver for them. Receivers listen for events that they are interested in,
without specifying specific senders. The underlying middleware is responsible for delivering appropriate events to
appropriate receivers. A common category of eventbased communication are publish/subscribe systems, where senders
publish events, and receivers subscribe to events of interest. Subscriptions can be based on topic, content, or a more
complex combination of event and context properties.
Distributed Shared Memory
Because distributed processes cannot access each others memory directly, using shared memory in a
distributed system requires special mechanisms that emulate the presence of directly accessible shared memory. This
is called distributed shared memory (DSM). The idea behind DSM is that processes on separate computers all have access
to the same virtual address space. The memory pages that make up this address space actually reside on separate
computers. Whenever a process on one of the computers needs to access a particular page, it must find the computer

17

that actually hosts that page and request the data from it. Figure 9 shows an example of how a virtual address space
might be distributed over various computers.
There are many issues involved in the use and design of distributed shared memory. As such, a separate lecture will be
dedicated to a detailed discussion of DSM.
Tuple Spaces
A tuple space is an abstraction of distributed shared memory into a generalised shared space. In this model of
communication, processes place tuples containing data into the space, while others can search the space, and read and
remove tuples from the space. The underlying middleware coordinates the placement and removal of the tuples, and
ensures that properties such as ordering and consistency are maintained.
Streams
Whereas the previous communication abstractions dealt with discrete communication (that is they
communicated chunks of data), the Stream abstraction deals with continuous communication, and in particular with
the sending and receiving of continuous media. In continuous media, data is represented as a single stream of data rather
than discrete chunks (for example, an email is a discrete chunk of data, a live radio program is not). The main
characteristic of continuous media is that besides a spatial relationship (i.e., the ordering of the data), there is also a
temporal relationship between the data. Film is a good example of continuous media. Not only must the frames of a film
be played in the right order, they must also be played at the right time, otherwise the result will be incorrect.
A stream is a communication channel that is meant for transferring continuous media. Streams can be set up between
two communicating processes, or possibly directly between two devices (e.g., a camera and a TV). Streams of continuous
media are examples of isochronous communication, that is communication that has minimum and maximum end-to-end
time delay requirements.
When dealing with isochronous communication, quality of service is an important issue. In this case quality of
service is related to the time dependent requirements of the communication. These requirements describe what is
required of the underlying distributed system so that the temporal relationships in a stream can be preserved. This
generally involves timeliness and reliability.

Application
Irregular stream
of data units

One token is added


to the bucket everyT

Regular stream

Figure 10: The token bucket model.


Quality of service requirements are often specified in terms of the parameters of a token bucket model (shown in Figure
10). In this model tokens (permission to send a fixed number of bytes) are regularly generated and stored in a bucket.
An application wanting to send data removes the required amount of tokens from the bucket and then sends the data.
If the bucket is empty the application must wait until more tokens are available. If the bucket is full newly generated
tokens are discarded.
It is often necessary to synchronise two or more separate streams. For example, when sending stereo audio it is
necessary to synchronise the left and right channels. Likewise when streaming video it is necessary to synchronise the
audio with the video.
Formally, synchronisation involves maintaining temporal relationships between substreams. There are two
basic approaches to synchronisation. The first is the client based approach, where it is up to the client receiving the
substreams to synchronise them. The client uses a synchronisation profile that details how the streams should be
synchronised. One possibility is to base the synchronisation on timestamps that are sent along with the stream. A
problem with client side synchronisation is that, if the substreams come in as separate streams, the individual streams
may encounter different communication delays. If the difference in delays is significant the client may be unable to
synchronise the streams. The other approach is for the server to synchronise the streams. By multiplexing the
substreams into a single data stream, the client simply has to demultiplex them and perform some rudimentary
synchronisation.

References
[BN84] Andrew D. Birrell and Bruce Jay Nelson. Implementing remote procedure calls. ACM Transactions on Computer
Systems, 2:3959, 1984.
[WWWK94] Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall. A note on distributed computing. Technical Report
SMLI
TR-94-29,
Sun
Microsystems
Laboratories,
Inc.,
1994.
http://research.sun.com/techrep/1994/smli_tr-94-29.pdf.

18

4.Naming
Most computer systems (in particular operating systems) manage wide collections of entities (such as, files,
users, hosts, networks, and so on). These entities are referred to by users of the system and other entities by various
kinds of names. Examples of names in Unix systems include the following:
Files: /boot/vmlinuz, ~/lectures/DS/notes/tex/naming.tex
Processes: 1, 14293
Devices: /dev/hda, /dev/ttyS1
Users: chak, cs92
For largely historical reasons, different entities are often named using different naming schemes. We say that
they exist in different name spaces. From time to time a new system design attempts to integrate a variety of entities
into a homogeneous name space, and then also attempts to provide a uniform interface to these entities. For example, a
central concept of Unix systems is the uniform treatment of files, devices, sockets, and so on. Some systems also
introduce a /proc file system, which maps processes to names in the file system and supports access to process
information through this file interface. In addition, Linux provides access to a variety of kernel data structures via the
/proc file system. The systems Plan 9 [ATT93] and Inferno go even further and are designed according to the concept
that all resources are named and accessed like files in a forest of hierarchical file systems.

Basic Concepts
A name is the fundamental concept underlying naming. We define a name as a string of bits or characters that
is used to refer to an entity. An entity in this case is any resource, user, process, etc. in the system. Entities are accessed
by performing operations on them, the operations are performed at an entitys access point. An access point is also
referred to by a name, we call an access points name an address. Entities may have multiple access points and may
therefore have multiple addresses. Furthermore an entitys access points may change over time (that is an entity may
get new access points or lose existing ones), which means that the set of an entitys addresses may also change.
We distinguish between a number of different kinds of names. A pure name 2 is a name that consists of an
uninterpreted bit pattern that does not encode any of the named entitys attributes. A nonpure name, on the other hand,
does encode entity attributes (such as an access point address) in the name. An identifier is a name that uniquely
identifies an entity. An identifier refers to at most one entity and an entity is referred to by at most one identifier.
Furthermore an identifier can never be reused, so that it will always refer to the same entity. Identifiers allow for easy
comparison of entities; if two entities have the same identifier then they are the same entity. Pure names that are also
identifiers are called pure identifiers. Location independent names are names that are independent of an entitys address.
They remain valid even if an entity moves or otherwise changes its address. Note that pure names are always location
independent, though location independent names do not have to be pure names.

System Names Versus Human Names


Related to the purity of names is the distinction between system-oriented and human-oriented names. Humanoriented names are usually chosen for their mnemonic value, whereas systemoriented names are a means for efficient
access to, and identification of, objects.
Taking into account the desire for transparency human-oriented names would ideally be pure. In contrast, systemoriented names are often nonpure which speeds up access to repeatedly used object attributes. We can characterise
these two kinds of names as follows:
System-oriented names are usually fixed size numerals (or a collection thereof); thus they are easy to store,
compare, and manipulate, but difficult for the user to remember.
Human-oriented names are usually variable-length strings, often with structure; thus they are easy for humans to
remember, but expensive to process by machines.
System-Oriented Names
As mentioned, system-oriented names are usually implemented as one or more fixed-sized numerals to facilitate
efficient handling. Moreover, they typically need to be unique identifiers and may be sparse to convey access rights (e.g.,
capabilities). Depending on whether they are globally or locally unique, we also call them structured or unstructured.
These are two examples of how structured and unstructured names may be implemented:
The structuring may be over multiple levels. Note that a structured name is not pure.
Global uniqueness without further mechanism requires a centralised generator with the usual drawbacks regarding
scalability and reliability. In contrast, distributed generation without excessive communication usually leads to

19

The distinction between pure and nonpure names is due to Needham [Nee93]

structured names. For example, a globally unique structured name can be constructed by combining the local time with
a locally unique identifier. Both values can be generated locally and do not require any communication.
Human-Oriented Names
In many systems, the most important attribute bound to a human-oriented name is the systemoriented name of the
object. All further information about the entity is obtained via the systemoriented name. This enables the system to
perform the usually costly resolution of the humanoriented name just once and implement all further operations on the
basis of the system-oriented name (which is more efficient to handle). Often a whole set of human-oriented names is
mapped to a single system-oriented name (symbolic links, relative addressing, and so on).
As an example of all this, consider the naming of files in Unix. A pathname is a human-oriented name that, by means of
the directory structure of the file system, can be resolved to an inode number, which is a system-oriented name. All
attributes of a file are accessible via the inode (i.e., the system-oriented name). By virtue of symbolic and hard links
multiple human-oriented names may refer to the same inode, which makes equality testing of files merely by their
human-oriented name impossible.
The design space for human-oriented names is considerably wider than that for system-oriented names. As such naming
systems for human-oriented names usually require considerably greater implementation effort.

Name Spaces
Names are grouped and organised into name spaces. A structured name space is represented as a labeled
directed graph, with two types of nodes. A leaf node represents a named entity and stores information about the entity.
This information could include the entity itself, or a reference to the entity (e.g., an address). A directory node (also called
a context) is an inner node and does not represent any single entity. Instead it stores a directory table, containing (node
id,edge label) pairs, that describes the nodes children. A leaf node only has incoming edges, while a directory node
has both incoming and outgoing edges. A third kind of node, a root node is a directory node with only outgoing edges.
A structured name space can be strictly hierarchical or can form a directed acyclic graph (DAG). In a strictly
hierarchical name space a node will only have one incoming edge. In a DAG name space any node can have multiple
incoming edges. It is also possible to have name spaces with multiple root nodes. Scalable systems usually use
hierarchically structured name spaces.
A sequence of edge labels leading from one node to another is called a path name. A path name is used to refer to a node
in the graph. An absolute path name always starts from a root node. A relative path name is any path name that does not
start at the root node. In Figure 1 the absolute path name that corresponds to the leftmost branch is
<home,ikuz,cs9243lectures>. The path <ikuz,cs9243lectures>, on the other hand, represents a relative
path name.
Many name spaces support aliasing, in which case an entity may be reachable by multiple paths from a root node and
will therefore be named by numerous path names. There are two types of alias. A hard link is when there are two or
more paths that directly lead to that entity. A soft link, on the other hand, occurs when a leaf node holds a pathname that
refers to another node. In this case the leaf node implicitly refers to the file named by the pathname. Figure 1 shows an
example of a name space with both a hard link (the solid arrow from d3 to n0) and a soft link (the dashed arrow from
n1 to n2).
d0
home

tmp

d1
ikuz

n0
cs9243
temp

d2

d3

n1

n2

cs9243_lectures

"/tmp"
"/home/cs9243/temp"

hard link

lectures
"/home/cs9243/lectures"

contains

"/home/cs9243/lectures"

soft link

Figure 1: An example of a name space with aliasing


Ideally we would have a global, homogeneous name space that contains names for all entities used. However, we are
often faced with the situation where we already have a collection of name spaces that have to be combined into a larger
name space. One approach is to simply create a new name that combines names from the other name spaces. For
example, a Web URL http://www.cs.un.edu.uk/~cs92/naming.pdf
globalises the local name ~cs92/naming.pdf by adding the context www.cs.un.edu.uk. Unfortunately, this
approach often compromises location transparencyas is the case in the example of URLs.
Another example of the composition of name spaces is mounting a name space onto a mount point in a different
(external) name space. This approach is often applied to merge file systems (e.g., mounting a remote file system onto a
local mount point). In terms of a name space graph, mounting requires one directory node to contain information about
another directory node in the external name space. This is similar to the concept of soft linking, except that in this case
the link is to a node outside of the name space. The information contained in the mount point node must, therefore,
include information about where to find the external name space.

20

Name Resolution
The process of determining what entity a name refers to is called name resolution. Resolving a name3 results in
a reference to the entity that the name refers to. Resolving a name in a name space often results in a reference to the
node that the name refers to. Path name resolution is a process that starts with the resolution of the first element in the
path name, and ends with resolution of the last element in the name. There are two approaches to this process, iterative
resolution and recursive resolution.
In iterative resolution the resolver contacts each node directly to resolve each individual element of the path
name. In recursive resolution the resolver only contacts the first node and asks it to resolve the name. This node looks
up the node referred to by first element of the name and then passes the rest of the name on to that node. The process
is repeated until the last element is resolved after which the result is returned back through the nodes to the resolver.
A problem with name resolution is how to determine which node to start resolution at. Knowing how and
where to start name resolution is referred to as the closure mechanism. One approach is to keep an external reference
(e.g., in a file) to the root node of the name space. Another approach is to keep a reference to the current directory node
for dealing with relative names. Note that the actual closure mechanism is always implicit, that is it is never explicitly
defined in a name. The reason for this is that if a closure mechanism was defined in a name there would have to be a
way to resolve the name used for that closure mechanism. This would require the use of a closure mechanism to
bootstrap the original closure mechanism. Because this could be repeated indefinitely, at a certain point an implicit
mechanism will always be required.

Naming Service
A naming service is a service that provides access to a name space allowing clients to perform operations on
the name space. These operations include adding and removing directory or leaf nodes, modifying the contents of nodes
and looking up names. The naming service is implemented by name servers. Name resolution is performed on behalf of
clients by resolvers. A resolver can be implemented by the client itself, in the kernel, by the name server, or as a separate
service.

Distributed Naming Service


As with most other system services, naming becomes more involved in a distributed environment. A distributed
naming service is implemented using multiple name servers over which the name space is partitioned and/or
replicated. The goal of a distributed naming service is to distribute both the management and name resolution load over
these name servers.
Before discussing implementation aspects of distributed naming services it is useful to split a name space up into several
layers according to the role the nodes play in the name space. These layers help to determine how and where to partition
and replicate that part of the name space. The highest level nodes belong to the global layer. A main characteristic of
nodes in this layer is that they are stable, meaning that they do not change much. As such, replicating these nodes is
relatively easy because consistency does not cause much of a problem. The next layer is the administrational layer. The
nodes in this layer generally represent a part of the name space that is associated to a single organisational entity (e.g.,
a company or a university). They are relatively stable (but not as stable as the nodes in the global layer). Finally the
lowest layer is the managerial layer. This layer sees much change. Nodes may be added or removed as well as have their
contents modified. The nodes in the top layers generally see the most traffic and, therefore, require more effort to keep
their performance at an acceptable level.
Typically, a client does not directly converse with a name server, but delegates this to a local resolver that may use
caching to improve performance. Each of the name servers stores one or more naming contexts, some of which may be
replicated. We call the name servers storing attributes of an object this objects authoritative name servers.
Directory nodes are the smallest unit of distribution and replication of a name space. If they are all on one host,
we have one central server, which is simple, but does not scale and does not provide fault tolerance. Alternatively, there
can be multiple copies of the whole name space, which is called full replication. Again, this is simple and access may be
fast. However, the replicas will have to be kept consistent and this may become a bottleneck as the system grows.
In the case of a hierarchical name space, partial subtrees (often called zones) may be maintained by a single server. In
the case of the Internet Domain Name Service (DNS), this distribution also matches the physical distribution of the
network. Each zone is associated with a name prefix that leads from the root to the zone. Now, each node maintains a
prefix table (essentially, a hint cache for name servers corresponding to zones) and, given a name, the server
corresponding to the zone with the longest matching prefix is contacted. If it is not the authoritative name server, the
next zones prefix is broadcast to obtain the corresponding name server (and update the prefix table). As an alternative
to broadcasting, the contacted name server may be able to provide the address of the authoritative name server for this
zone. This scheme can be efficiently implemented, as the prefix table can be relatively small and, on average, only a small
number of messages are needed for name resolution. Consistency of the prefix table is checked on use, which removes
the need for explicit update messages.
For smaller systems, a simpler structure-free distribution scheme may be used. In this scheme contexts can be
freely placed on the available name servers (usually, however, some distribution policy is in place). Name resolution
starts at the root and has to traverse the complete resolution chain of contexts. This is easy to reconfigure and, for
example, used in the standard naming service of CORBA.

21

also referred to as looking up a name

Implementation of Naming Services


In the following, we consider a number of issues that must be addressed by implementations of name services.
First, a starting point for name resolution has to be fixed. This essentially means that the resolver must have a list of
name servers that it can contact. This list will usually not include the root name server to avoid overloading it. Instead,
physically close servers are normally chosen. For example, in the BIND (Berkeley Internet Name Domain)
implementation of DNS, the resolver is implemented as a library linked to the client program. It expects the file
/etc/resolv.conf to contain a list of name servers. Moreover, it facilitates relative naming in form of the search
option.

Name Caches
Name resolution is expensive. For example, studies found that a large proportion of Unix system calls (and network
traffic in distributed systems) is due to name-mapping operations. Thus, caching of the results of name resolution on
the client is attractive:
High degree of locality of name lookup; thus, a reasonably sized name cache can give good hit ratio.
Slow update of name information database; thus, the cost for maintaining consistency is low.
On-use consistency of cached information is possible; thus, no invalidation on update: stale entries are detected
on use.
There are three types of name caches:
Directory cache: directory node data is cached. Directory caches are normally used with iterative name resolution.
They require large caches, but are useful for directory listings etc.
Prefix cache: path name prefix and zone information is cached. Prefix caching is unsuitable with structure-free
context distribution.
Full-name cache: full path name information is cached. Full-name caching is mostly used with structure-free
context distribution and tends to require larger cache sizes than prefix caches.
A name cache can be implemented as a process-local cache, which lives in the address space of the client process. Such
a cache does not need many resources, as it typically will be small in size, but much of the information may be duplicated
in other processes. More seriously, it is a shortlived cache and incurs a high rate of start-up misses, unless a scheme
such as cache inheritance is used, which propagates cache information from parent to child processes. The alternative
is a kernel cache, which avoids duplicate entries and excessive start-up misses, but access to a kernel cache is slower
and it takes up valuable kernel memory. Alternatively, a shared cache can be located in a user-space cache process that
is utilised by clients directly or by redirection of queries via the kernel (the latter is used in the CODA file system). Some
Unixvariants use a tool called name server cache daemon (nscd) as a user-space cache process.

Example: Domain Name System (DNS)


Information about DNS, the main concepts, the model, and implementation details can be found in RFCs 1034 [Moc87a]
and 1035 [Moc87b].

Attribute-Based Naming
Whereas names as described above encode at most one attribute of the named entity (e.g., a domain name encodes the
entitys administrative or geographical location) in attribute-based naming an entitys name is composed of multiple
attributes. An example of an attribute-based name is given below:
/C=AU/O=UNSW/OU=CSE/CN=WWW Server/Hardware=Sparc/OS=Solaris/Server=Apache
The name not only encodes the location of the entity (/C=AU/O=UNSW/OU=CSE, where C is the attribute country, O is
organisation, OU is organisational unit these are standard attributes in X.500 and LDAP), it also identifies it as a Web
server, and provides information about the hardware that it runs on, the operating system running on it, and the
software used. Although an entitys attribute-based name contains information about all attributes, it is common to also
define a distinguished name (DN), which consists of a subset of the attributes and is sufficient to uniquely identify the
entity.
In attribute-based naming systems the names are stored in directories, and each distinguished name refers to a directory
entry. Attribute-based naming services are normally called directory services. Similar to a naming service, a directory
service implements a name space that can be flat or hierarchical. With a hierarchical name space its structure mirrors
the structure of distinguished names.
The structure of the name space (i.e., the naming graph) is defined by a directory information tree (DIT). The actual
contents of the directory (that is the collection of all directory entries) are stored in the directory information base (DIB).

Directory Service
A directory service implements all the operations that a naming service does, but it also adds a search operation that
allows clients to search for entities with particular attributes. A search can use partial knowledge (that is, a search does
not have to be based on all of an entitys attributes) and it does not have to include attributes that form part of a
distinguished name. Thus, given a directory service that stores the entity named in the previous example, a search for

22

all entities that have Solaris as their operating system would return a list of directory entries that contain the
OS=Solaris attribute. This ability to search based on attributes is one of the key properties that distinguishes a
directory service from a naming service.

Distributed Directory Service


The directory service is implemented by a directory server. As with naming services, directory servers can be
centralised or distributed. Even more than name services, centralised directory services run the risk of becoming
overloaded, thus distributed implementations are preferable when scalability is required. Distributed implementations
also increase the reliability of the service.
As in naming services, the directory service can be partitioned or replicated (or partitioned and replicated). Partitioning
of directory services follows the structure of the DIT, with the tree being split up over available directory servers.
Generally the nodes are partitioned so that administratively or geographically related parts of the DIT are placed on the
same servers. Figure 2 shows an example of the administrative partitioning of a DIT.

C=US

C=AU

O=Slashdot

O=USYD

O=UNSW

CN=WWW Server

OU=CS

OU=CSE

CN=WWW Server

CN=WWW Server

Figure 2: A partitioned DIT


Replication in a directory service involves either replicating the whole directory, or replicating individual partitions.
This replication is generally more sophisticated than in naming services, and as a result a distributed directory service
usually provides both read-only and read/write replicas. Furthermore, caching (e.g., caching of query results) is also
used to improve performance.
Lookup in a distributed directory service is similar to lookup in a distributed naming service, it can be done iteratively
(called referral) and recursively (called chaining). Search operations are also handled using referral or chaining.
Because a search can be performed on any attributes, performing a search may require the examination of
every directory entry to find those containing the desired attributes. In a distributed directory service, this would
require that all directory servers be contacted and requested to perform the search locally. As this is inherently
unscalable and incurs a high performance penalty it is necessary for users to reduce the scope of searches as much as
possible, for example, by specifying a limited part of the directory tree in which to search. Another approach to increase
the performance for commonly performed searches is to keep a catalog at each directory server. The catalog contains a
subset of each directory entry in the DIB. Catalog entries generally contain the distinguished attributes and some of the
most searched for attributes. This way a search can easily be resolved by searching through the local catalog and finding
all entries that fulfill the search criteria. It is necessary to tune the set of attributes stored in the catalog so that the
catalog remains effective, but does not become too large.

Example: X.500 and LDAP


X.500 and LDAP are examples of widely used directory services. An overview of X.500 and LDAP can be found in
[How95]. More technical details about both can be found in RFCs 1309 [WRH92], 1777 [YHK95], and 2251 [WHK97].

Address Resolution of Unstructured Identifiers


Unstructured identifiers are almost like random bit strings and contain no information whatsoever on how to
locate the access point of the entity they refer to. Because of this lack of structured information, we have the problem of
how to find the corresponding address of the entity. Examples of such unstructured identifiers are IP numbers in a LAN
or hash values.
A simple solution is to use broadcasting: The resolver simply broadcasts the query to every node and only the node that
has the access point answers with the address. This approach works well in smaller systems. However, as the system
scales, the broadcasts impose an increasing load on the network and the nodes, which make it impractical for larger
systems. As a practical example, the address resolution protocol (ARP) uses this approach to resolve IP addresses of
local nodes or routers in a network to MAC addresses.
A more complicated and scalable approach is to use distributed hash tables (DHT). DHTs are constructed as
overlay networks and allow the typical operations of hash tables such as put(), get() and remove(). The details
of a DHT implementation called Chord can be found in [SMLN +03]. An advantage of such DHTs is that lookups of keys
and their values can be done in O(logn) (where n is the number of nodes in the DHT), which makes DHTs very practical
even in very large-scale systems. A well known application of DHTs are peer-to-peer file-sharing networks, where DHTs
are used to store meta-information about the filenames or keywords of the files in the network.

23

References
AT&T Bell Laboratories, Murray Hill, NJ, USA.
Plan 9 from Bell Labs Second
Release Notes, 1993.
[How95] Timothy A. Howes. The lightweight directory access protocol: X.500 lite. Technical Report 95-8, University of
Michigan CITI, July 1995.
[Moc87a] P. Mockapetris. Domain names concepts and facilities. RFC 1034, November 1987.
http://www.ietf.org/rfc/rfc1034.txt.
[Moc87b] P. Mockapetris. Domain names implementation and specification. RFC 1035, November 1987.
http://www.ietf.org/rfc/rfc1035.txt.
[Nee93] R. Needham. Names. In S. Mullender, editor, Distributed Systems, an Advanced Course. Addison-Wesley, second
edition, 1993.
[SMLN+03] Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger, M. Frans Kaashoekz, Frank Dabek, and Hari
Balakrishnan. Chord: A scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM
Transactions on Networking, 11(1):1732, February 2003.
[WHK97] M. Wahl, T. Howes, and S. Kille. Lightweight directory access protocol (v3). RFC 2251, December 1997.
http://www.ietf.org/rfc/rfc2251.txt.
[ATT93]

24

5.Synchronisation and coordination


This chapter deals with one of the fundamental issues encountered when constructing a system made up of
independent communicating processes: dealing with time and making sure that processes do the right thing at the right
time. In essence this comes down to allowing processes to synchronise and coordinate their actions. Coordination refers
to coordinating the actions of separate processes relative to each other and allowing them to agree on global state (such
as values of a shared variable). Synchronisation is coordination with respect to time, and refers to the ordering of events
and execution of instructions in time. Examples of synchronisation include ordering distributed events in a log file and
ensuring that a process performs an action at a particular time. Examples of coordination include ensuring that
processes agree on what actions will be performed (e.g., money will be withdrawn from the account), who will be
performing actions (e.g., which replica will process a request), and the state of the system (e.g., the elevator is stopped).
Synchronisation and coordination play an important role in most distributed algorithms (i.e., algorithms
intended to work in a distributed environment). In particular, some distributed algorithms are used to achieve
synchronisation and coordination, while others assume the presence of synchronisation or coordination mechanisms.
Discussions of distributed algorithms generally assume one of two timing models for distributed systems. The first is a
synchronous model, where the time to perform all actions, communication delay, and clock drift on all nodes, are
bounded. In asynchronous distributed systems there are no such bounds. Most real distributed systems are
asynchronous, however, it is easier to design distributed algorithms for synchronous distributed systems. Algorithms
for asynchronous systems are always valid on synchronous systems, however, the converse is not true.

Time & Clocks


As mentioned, time is an important concept when dealing with synchronisation and coordination. In particular it is often
important to know when events occurred and in what order they occurred. In a nondistributed system dealing with
time is trivial as there is a single shared clock. All processes see the same time. In a distributed system, on the other
hand, each computer has its own clock. Because no clock is perfect each of these clocks has its own skew which causes
clocks on different computers to drift and eventually become out of sync.
There are several notions of time that are relevant in a distributed system. First of all, internally a computer clock
simply keeps track of ticks that can be translated into physical time (hours, minutes, seconds, etc.). This physical time
can be global or local. Global time is a universal time that is the same for everyone and is generally based on some form
of absolute time.4 Currently Coordinated Universal Time (UTC), which is based on oscillations of the Cesium-133 atom,
is the most accurate global time. Besides global time, processes can also consider local time. In this case the time is only
relevant to the processes taking part in the distributed system (or algorithm). This time may be based on physical or
logical clocks (which we will discuss later).

Physical Clocks
Physical clocks keep track of physical time. In distributed systems that rely on actual time it is necessary to keep
individual computer clocks synchronised. The clocks can be synchronised to global time (external synchronisation), or
to each other (internal synchronisation). Cristians algorithm and the Network Time Protocol (NTP) are examples of
algorithms developed to synchronise clocks to an external global time source (usually UTC). The Berkeley Algorithm is
an example of an algorithm that allows clocks to be synchronised internally.
Cristians algorithm requires clients to periodically synchronise with a central time server (typically a server with a
UTC receiver). One of the problems encountered when synchronising clocks in a distributed system is that
unpredictable communication latencies can affect the synchronisation. For example, when a client requests the current
time from the time server, by the time the servers reply reaches the client the time will have changed. The client must,
therefore, determine what the communication latency was and adjust the servers response accordingly. Cristians
algorithm deals with this problem by attempting to calculate the communication delay based on the time elapsed
between sending a request and receiving a reply.
The Network Time Protocol is similar to Cristians algorithm in that synchronisation is also performed using time
servers and an attempt is made to correct for communication latencies. Unlike Cristians algorithm, however, NTP is not
centralised and is designed to work on a widearea scale. As such, the calculation of delay is somewhat more complicated.
Furthermore, NTP provides a hierarchy of time servers, with only the top layer containing UTC clocks. The NTP
algorithm allows client-server and peer-to-peer (mostly between time servers) synchronisation. It also allows clients
and servers to determine the most reliable servers to synchronise with. NTP typically provides accuracies between 1
and 50 msec depending on whether communication is over a LAN or WAN.
Unlike the previous two algorithms, the Berkeley algorithm does not synchronise to a global time. Instead, in this
algorithm, a time server polls the clients to determine the average of everyones time. The server then instructs all
clients to set their clocks to this new average time. Note that in all the above algorithms a clock should never be set
backward. If time needs to be adjusted backward, clocks are simply slowed down until time catches up.

Logical Clocks
For many applications, the relative ordering of events is more important than actual physical time. In a single process
the ordering of events (e.g., state changes) is trivial. In a distributed system, however, besides local ordering of events,
all processes must also agree on ordering of causally related events (e.g., sending and receiving of a single message).
Although Einsteins special relativity theory shows that time is relative and there is, therefore, no absolute time, for our purposes (and at the
worldwide scale) we can safely assume that such an absolute time does exist.
4

25

Given a system consisting of N processes pi, i {1,...,N}, we define the local event ordering i as a binary relation, such
that, if pi observes e before e, we have e i e. Based on this local ordering, we define a global ordering as a happened
before relation , as proposed by Lamport [Lam78]: The relation is the smallest relation, such that
1. e i e implies e e,
2. for every message m, send(m) receive(m), and
3. e e and e e implies e e (transitivity).
The relation is almost a partial order (it lacks reflexivity). If a b, then we say a causally affects b. We consider
unordered events to be concurrent if they are unordered; i.e., a 6 b and b 6 a implies a k b.
As an example, consider Figure 1. We have the following causal relations:
E11 E12,E13,E14,E23,E24,...
E21 E22,E23,E24,E13,E14,...
E11

E12

E13

E14
P1

P2
E21 E22

E23

E24
Real Time

Figure 1: Example of event ordering


Moreover, the following events are concurrent: E11kE21, E12kE22, E13kE23, E11kE22, E13kE24, E14kE23, and so on.
Lamport Clocks
Lamports logical clocks can be implemented as a software counter that locally computes the happened-before relation
. This means that each process pi maintains a logical clock Li. Given such a clock, Li(e) denotes a Lamport timestamp of
event e at pi and L(e) denotes a timestamp of event e at the process it occurred at. Processes now proceed as follows:
1. Before time stamping a local event, a process pi executes Li := Li + 1.
2. Whenever a message m is sent from pi to pj:
Process pi executes Li := Li + 1 and sends the new Li with m.
Process pj receives Li with m and executes Lj := max(Lj,Li) + 1. receive(m) is annotated with the new Lj.
In this scheme, a b implies L(a) < L(b), but L(a) < L(b) does not necessarily imply a b. As an example, consider Figure
2. In this figure E12 E23 and L1(E12) < L2(E23) (i.e., 2 < 3), however we also have E13 6 E24 while L1(E13) < L2(E24) (i.e., 3
< 4).
E11

E12 E13 E14 E15

E16

E17
P1

7
P2

E21 E22

E23

E24

E25
Real Time

Figure 2: Example of the use of a Lamports clocks


In some situations (e.g., to implement distributed locks), a partial ordering on events is not sufficient and a total
ordering is required. In these cases, the partial ordering can be completed to total ordering by including process
identifiers. Given local time stamps Li(e) and Lj(e), we define global time stamps hLi(e),ii and hLj(e),ji. We, then, use
standard lexicographical ordering, where hLi(e),ii < hLj(e),ji iff Li(e) < Lj(e), or Li(e) = Lj(e) and i < j.

26

Vector Clocks
E

E11

12

P1
1

2
E21

E22
P2

E31

E32

E33
P3

2
3

Real Time

Figure 3: Example of the lack of causality with Lamports clocks


The main shortcoming of Lamports clocks is that L(a) < L(b) does not imply a b; hence, we cannot deduce causal
dependencies from time stamps. For example, in Figure 3, we have L1(E11) < L3(E33), but E11 6 E33. The root of the
problem is that clocks advance independently or via messages, but there is no history as to where advance comes from.
This problem can be solved by moving from scalar clocks to vector clocks, where each process maintains a vector
clock Vi. Vi is a vector of size N, where N is the number of processes. The component Vi[j] contains the process pis
knowledge about pjs clock. Initially, we have Vi[j] := 0 for i,j {1,...,N}. Clocks are advanced as follows:
1. Before pi timestamps an event, it executes Vi[i] := Vi[i] + 1.
2. Whenever a message m is sent from pi to pj:
Process pi executes Vi[i] := Vi[i] + 1 and sends Vi with m.
Process pj receives Vi with m and merges the vector clocks Vi and Vj as follows:
:=max(Vj[k],Vi[k]) + 1 ,if j = k (as in scalar clocks)

Vj[k]

,otherwise.

max(Vj[k],Vi[k])

This last part ensures that everything that subsequently happens at pj is now causally related to everything
that previously happened at pi.
Under this scheme, we have, for all i,j, Vi[i] Vj[i] (i.e., pi always has the most up-to-date version of its own clock);
moreover, a b iff V (a) < V (b), where
V = V iff V [i] = V [i] for all i {1,...,N},
V V iff V [i] V [i] for all i {1,...,N},
V > V iff V V V 6= V ; and
V kV iff V 6> V V 6> V
For example, consider the annotations at the diagram in Figure 4. Each event is annotated with both its vector clock
value (the triple) and the corresponding value of a scalar Lamport clock. For L1(E12) and L3(E32), we have 2 = 2 versus
(2,0,0) 6= (0,0,2). Likewise we have L2(E24) > L3(E32) but (2,4,1) 6> (0,0,2) and thus E32 6 E24.
E11 1

2 E12

E13 6

(1,0,0) (2,0,0)
E21 1
(0,1,0)

P1

(3,4,1)
3 E22

E23 4

(2,2,0) (2,3,1)
1 E31
(0,0,1)

E32 2

E24 5

P2

(2,4,1)
P3

(0,0,2)
Real Time

Figure 4: Example contrasting vector and scalar clock annotations

27

Global State
Determining global properties in a distributed system is often difficult, but crucial for some applications. For
example, in distributed garbage collection, we need to be able to determine for some object whether it is referenced by
any other objects in the system. Deadlock detection requires detection of cycles of processes infinitely waiting for each
other. To detect the termination of a distributed algorithm we need to obtain simultaneous knowledge of all involved
process as well as take account of messages that may still traverse the network. In other words, it is not sufficient to
check the activity of all processes. Even if all processes appear to be passive, there may be messages in transition that,
upon arrival, trigger further activity.
In the following, we are concerned with determining stable global states or properties that, once they occur,
will not disappear without outside intervention. For example, once an object is no longer referenced by any other object
(i.e., it may be garbage collected), no reference to the object can appear at a later time.

Consistent Cuts
To reason about the validity of global observationsi.e., observations that combine information from multiple
nodesthe notion of consistent cuts is useful. Due to the lack of global time, we cannot simply require that all local
observations must happen at the same time. As it is clear that using the state of the individual processes at arbitrary
points in time is not generally going to result in a consistent overall picture, we need to define a criterion for determining
when we regard a collection of local states to be globally consistent.
To formalise the notion of a consistent cut, we again refer to a system of N processes pi, i {1,...,N}. Each process
pi, over time, proceeds through a series events he0i,e1i,e2i,...i, which we call pis history denoted by hi. This series may be
finite or infinite. In any case, we denote by hki a k-prefix of hi (history of pi up to and including event eki ). Each event eji,
as before, is either a local event or a communication event (e.g., sending or receiving of a message).We denote the state
of of any process pi, immediately before event eki , as ski ; i.e., the state recording all events included in the history hik1.
This makes s0i refer to the initial state of pi.
Using a total event ordering, we can merge all local histories into a global history
and, similarly, we can combine a set of local states s1,...,sN into a global state S = (s1,...,sN). This raises the question as to
which combination of local states is consistent (a global state is consistent if for any received message in the state the
corresponding send is also in the state). To answer this question, we need one more concept, namely that of a cut. Similar
to the global history, we can define cuts based on k-prefixes:

where hcii is history of pi up to and including event ecii. The cut C corresponds to the state
events in a cut are its frontier defined as {ecii | i {1,...,N}}.
cut 1
P3

P1

cut 2
r 0
3

s 0
3

r 0
2

P2

s 0
1

r 1
2

). The final

s 0
2

s 1
2

s1
3

r 1
3

s 2
2

r 0
1

r 1
1

Figure 5: A consistent and inconsistent cut


We call a cut consistent iff for all events e C, e e implies e C (i.e., all events that happened before are also in the
cut). A global state is consistent if it corresponds to a consistent cut. As a result, we can characterise the execution of a
system as a sequence of consistent global states S0 S1 S2 . Figure 5 displays both a consistent cut (labeled cut
1) and an inconsistent cut (labeled cut 2). For the inconsistent cut note that the event that happened before
) is not part of the cut.
A global history that is consistent with the happened-before relation is also called a linearisation or consistent run.
A linearisation only passes through consistent global states. Finally, we call a state S is reachable from state S if there is
a linearisation that passes thorough S and then
S .

Snapshots
Now that we have a precise characterisation of a consistent cut, the next question is whether such cuts can be
computed effectively. Chandy & Lamport [CL85] introduced an algorithm that yields a snapshot of a distributed system,

28

which embodies consistent global state and takes care of messages that are in transit when the snapshot is being
performed. The resulting snapshots are useful for evaluating stable global properties.
Chandy & Lamports algorithm makes strong assumptions about the underlying infrastructure. In particular,
communication must be reliable and processes be failure-free. Furthermore, point-topoint message delivery must be
ordered and the process/channel graph must be strongly connected (i.e, each node can communicate withe every other
node). Under these assumptions, and after the algorithm completes, each process hold a copy of its local state and a set
of messages that were in transit, with that process as their destination, during the snapshot.
The algorithm proceeds as follows: One process initiates the algorithm by recording its local state and sending a
marker message over each outgoing channel. On receipt of a marker message over incoming channel c, a process
distinguishes two cases:
1. If its local state is not yet saved, it behaves like the initiating process and saves the localstate and sends marker
messages over each outgoing channel.
2. Otherwise, if its local state is already saved, it saves all messages that it received via c since it saved its local state
and until the marker arrived.
A process local contribution is complete after it has received markers on all incoming channels. At this time, it has
accumulated (a) a local state snapshot and (b), for each incoming channel, a set of messages received after performing
the local snapshot and before the marker came down that channel.

P1

m3

m1

P2

m2

P3

*
Figure 6: Marker messages during the collection of a snapshot
Figure 6 outlines the the marker messages (doted arrows) and points where local snapshots are taken (marked by
the stars) for three processes.

Distributed Concurrency Control


Some of the issues encountered when looking at concurrency in distributed systems are familiar from the study
of operating systems and multithreaded applications. In particular dealing with race conditions that occur when
concurrent processes access shared resources. In nondistributed system these problems are solved by implementing
mutual exclusion using local primitives such as locks, semaphores, and monitors. In distributed systems, dealing with
concurrency becomes more complicated due to the lack of directly shared resources (such as memory, CPU registers,
etc.), the lack of a global clock, the lack of a single global program state, and the presence of communication delays.

Distributed Mutual Exclusion


When concurrent access to distributed resources is required, we need to have mechanisms to prevent race
conditions while processes are within critical sections. These mechanisms must fulfill the following three requirements:
1. Safety: At most one process may execute the critical section at a time
2. Liveness: Requests to enter and exit the critical section eventually succeed
3. Ordering: Requests are processed in happened-before ordering
Method 1: Central Server
The simplest approach is to use a central server that controls the entering and exiting of critical sections.
Processes must send requests to enter and exit a critical section to a lock server (or coordinator), which grants
permission to enter by sending a token to the requesting process. Upon leaving the critical section, the token is returned
to the server. Processes that wish to enter a critical section while another process is holding the token are put in a queue.
When the token is returned the process at the head of the queue is given the token and allowed to enter the critical
section.
This scheme is easy to implement, but it does not scale well due to the central authority. Moreover, it is
vulnerable to failure of the central server.
Method 2: Token Ring
More sophisticated is a setup that organises all processes in a logical ring structure, along which a token
message is continuously forwarded. Before entering the critical section, a process has to wait until the token comes by
and then retain the token until it exits the critical section.

29

A disadvantage of this approach is that the ring imposes an average delay of N/2 hops, which again limits
scalability. Moreover, the token messages consume bandwidth and failing nodes or channels can break the ring. Another
problem is that failures may cause the token to be lost. In addition, if new processes join the network or wish to leave,
further management logic is needed.
Method 3: Using Multicast and Logical Clocks
Ricart & Agrawala [RA81] proposed an algorithm for distributed mutual exclusion that makes use of logical
clocks. Each participating process pi maintains a Lamport clock and all processes must be able to communicate pairwise.
At any moment, each process is in one of three states:
1. Released: Outside of critical section
2. Wanted: Waiting to enter critical section
3. Held: Inside critical section
If a process wants to enter a critical section, it multicasts a message hLi,pii and waits until it has received a reply from
every other process. The processes operate as follows:
If a process is in Released state, it immediately replies to any request to enter the critical section.
If a process is in Held state, it delays replying until it is finished with the critical section.
If a process is in Wanted state, it replies to a request immediately only if the requesting timestamp is smaller
than the one in its own request.
The only hurdle to scalability is the use of multicasts (i.e., all processes have to be contacted in order to enter a
critical section). More scalable variants of this algorithm require each individual process to only contact subsets of its
peers when wanting to enter a critical section. Unfortunately, failure of any peer process can deny all other processes
entry to the critical section.
Comparison of Algorithms
When comparing the three distributed mutual exclusion algorithms we focus on the number of messages
exchanged per entry/exit of the critical section, the delay that a process experiences before being allowed to enter a
critical section, and the reliability of the algorithms (that is, what kinds of problems the algorithms face).
The centralised algorithm requires a total of 3 messages to be exchanged every time a critical section is
executed (two to enter and one to leave). After a process has requested permission to enter a critical section it has to
wait for a minimum of two messages to be exchanged (one for the current holder to return the token to the coordinator
and one for the coordinator to send the token to the waiting process). The biggest problem this algorithm faces is that
if the coordinator crashes (or becomes otherwise unavailable) the whole algorithm fails.
For the ring algorithm, the number of messages exchanged per entry and exit of a critical section depends on
how often processes need to enter the section. The less often processes want to enter, the longer the token will travel
around the ring, and the higher the cost (in terms of messages exchanged) of entry into a critical section will be. With
regards to delay, depending on where the token is it will take between 0 and n1 messages before a process can enter
the critical section. The biggest problems faced by this algorithm are loss of the token and a crashed process breaking
the ring. It is possible to overcome the latter by providing all processes information about the ring structure so that
broken nodes can be skipped.
Finally, the decentralized algorithm effectively requires 2(n 1) messages to be sent per entry and exit of a
critical section (i.e., n 1 request messages and n 1 replies). Likewise there is a delay of 2(n1) messages before
another process can enter the critical section (once again because n 1 requests and n 1 replies have to be sent). With
regards to reliability the decentralised algorithm is worse than the others because the failure of any single node is
enough to break the algorithm.

Coordination and Multicast


Recall that group communication provides a model of communication whereby a process can send a single
message to a group of other processes. When such a message is sent to a predefined group of recipients (as opposed to
all nodes on the network), we refer to the communication as multicast. Since a multicast is sent out to a specific group,
it is important to have agreement on the membership of that group. We distinguish between static group membership,
where membership does not change at runtime, and dynamic membership, where the group membership may change.
Likewise we distinguish between open and closed groups. In an open group anyone can send a message to the group,
while in a closed group only group members can send messages to the group.
Besides group membership there are two other key properties for multicast: reliability and ordering. With
regards to reliability we are interested in message delivery guarantees in the face of failure. There are two delivery
guarantees that a reliable multicast can provide: the guarantee that if the message is delivered to any member of the
group it is guaranteed to be delivered to all members of the group; and the slightly weaker guarantee that if a message
is delivered to any non-failed member of the group it will be delivered to all non-failed members of the group. We will
further discuss multicast reliability in a future lecture.

30

With regards to ordering we are interested in guarantees about the order in which messages are delivered to
group members. This requires synchronisation between the group members in order to ensure that everyone delivers
received messages in an appropriate order. We will look at four typical multicast ordering guarantees: basic (no
guarantee), FIFO, causal, and total order.
Before discussing the different ordering guarantees and their implementations, we introduce the basic
conceptual model of operation for multicast. A multicast sender performs a single multicast send operation
msend(g,m) specifying the recipient group and the message to send. This operation is provided by a multicast
middleware. When it is invoked, it eventually results in the invocation of an underlying unicast send(m) operation (as
provided by the underlying OS) to each member of the group. At each recipient, when a message arrives, the OS delivers
the message to the multicast middleware. The middleware is responsible for reordering received messages to comply
with the required ordering model and then delivering (mdeliver(m)) the message to the recipient process. Only after
a message has been mdelivered is it available to the recipient and do we say that it has been successfully delivered.
Note that multicast implementations do not typically follow this model directly. For example, unicasting the
message to every member in the group is inefficient, and solutions using multicast trees are often used to reduce the
amount of actual communication performed. Nevertheless the model is still a correct generalisation of such a system.
For the following we also assume a static and closed group. We will discuss how to deal with dynamic groups in a future
lecture together with reliable multicast. An open group can easily be implemented in terms of a closed group by having
the sender send (i.e., unicast) its message to a single member of the group who then multicasts the message on to the
group members.
In basic multicast there are no ordering guarantees. Messages may be delivered in any order. The stricter
orderings that we discuss next can all be described and implemented making use of basic multicast to send messages to
all group members. This means that any suitable basicmulticast implementation, such as IP multicast, can be used to
implement the following.
FIFO multicast guarantees that message order is maintained per sender. This means that messages from a
single sender are always mdelivered in the same order as they were sent. FIFO multicast can be implemented by giving
each group member a local send counter and a vector of the counters received from the other group members. The send
counter is included with each message sent, while the receive vector is used to delay delivery of messages until all
previous messages from a particular sender have been delivered.
The causal message delivery guarantee requires that order is maintained between causally related sends. Recall
that a message B is causally related to a message A (A happens before B) if A is received at the sender before B is sent.
In this case all recipients must have causally related messages delivered in the happened before order. There is no
ordering requirement for concurrent messages. Causal multicast can be implemented using a vector clock at each group
member. This vector is sent with each message and is used to track causal relationships. A message is queued at a
receiver until all previous (i.e., happened before) messages have been successfully delivered.
Finally, totally-ordered multicast guarantees that all messages will be delivered in the same order at all group
members. There are various approaches to implementing totally-ordered multicast. We briefly present two: a
sequencer-based approach, and an agreement-based approach. The sequencer-based approach requires the
involvement of a centralised sequencer (which can be either a separate node, or one of the group members). In this case,
all messages are assigned a sequence number by the sequencer, with all recipients delivering messages in the assigned
sequence order.
The agreement-based approach is decentralised, and requires all group members to vote on a sequence number
for each message. In this approach each process keeps a local sequence number counter. After receiving a message each
process replies to the sender with a proposed sequence number (the value of its counter). The sender chooses the
largest proposed sequence number and informs all the group members. Each group member then assigns the sequence
number to the message and sets its counter to the maximum of the counters current value and the sequence number.
Messages are delivered in their sequence number order.
Totally-ordered multicast is often combined with reliable multicast to provide atomic multicast. Note that most
practical totally-ordered multicast implementations are based on some form of the sequencer-based approach, typically
with optimisations applied to prevent overloading a single node.

Coordination and Elections


Various algorithms require a set of peer processes to elect a leader or coordinator. In the presence of failure, it
can be necessary to determine a new leader if the present one fails to respond. Provided that all processes have a unique
identification number, leader election can be reduced to finding the non-crashed process with the highest identifier.
Any algorithm to determine this process needs to meet the following two requirements:
1. Safety: A process either doesnt know the coordinator or it knows the identifier of the process with largest
identifier.
2. Liveness: Eventually, a process crashes or knows the coordinator.
Bully Algorithm
The following algorithm was proposed by Garcia-Molina [GM82] and uses three types of messages:
Election: Announce election
Answer: Response to an election
Coordinator: Elected coordinator announces itself

31

A process begins an election when it notices through a timeout that the coordinator has failed or receives an
Election message. When starting an election, a process sends Election message to all higher-numbered
processes. If it receives no Answer within a predetermined time bound, the process that started the election decides
that it must be coordinator and sends a Coordinator message to all other processes. If an Answer arrives, the
process that triggered an election, waits a pre-determined period of time for a Coordinator message. A process that
receives an Election message can immediately announce that it is the coordinator if it knows that it is the highest
numbered process. Otherwise, it itself starts a sub-election by sending Election message to higher-numbered
processes. This algorithm is called the bully algorithm because the highest numbered process will always be the
coordinator.
Ring Algorithm
An alternative to the bully algorithm is to use a ring algorithm [CR79]. In this approach all processes are
ordered in a logical ring and each process knows the structure of the ring. There are only two types of messages
involved: Election and Coordinator. A process starts an election when it notices that the current coordinator has
failed (e.g., because requests to it have timed out). An election is started by sending an Election message to the first
neighbour on the ring. The Election message contains the nodes process identifier and is forwarded on around the
ring, with each process adding its own identifier to the message. When the Election message reaches the originator,
the election is complete. Based on the contents of the message that originator process determines the highest numbered
process and sends out a Coordinator message specifying this process as the winner of the election.

References
[CL85] K. Mani Chandy and Leslie Lamport. Distributed snapshots: Determining global states of distributed systems.
ACM Transactions on Computer Systems, 3:6375, 1985.
[Lam78] Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM,
21:558565, 1978.
[RA81] G. Ricart and A. Argawala. An optimal algorithm for mutual exclusion in computer networks. Communications of
the ACM, 24(1), January 1981.
[CR79] E. G. Chang and R. Roberts. An improved algorithm for decentralized extrema-finding in circular configurations
of processors. Communications of the ACM, 22(5):281283, 1979.
[GM82] Hector Garcia-Molina. Elections in a distributed computing system. IEEE Transactions on Computers, 31(1),
January 1982.
[LS76] Butler Lampson and H. Sturgis. Crash recovery in a distributed system. Working paper, Xerox PARC, Ca, USA, Ca,
USA, 1976.
[Ske81] D. Skeen. Nonblocking commit protocols. In SIGMOD International Conference on Management of Data, 1981.

32

6.Replication and Consistency


Replication
Replication involves creating and maintaining copies of services and data provided by a distributed system.
Unlike communication, without which it is impossible to build a distributed system, replication is not a fundamental
principle. This means that it is possible to build a distributed system that does not make use of replication.
Replication does, however, become important when reliability, performance, and scalability of a distributed
system are key concerns. In the case of reliability, creating many redundant copies of a service improves that services
availability. With multiple servers available to clients, it is less likely that a malfunction of one of them will render the
whole service unavailable. Likewise, if the data on a server becomes corrupt, data stored at replicas can be used to
restore the correct state. With regards to performance, replicating services helps to reduce the load on individual
servers. Likewise, by placing replicas close to clients the impact of communication can be greatly reduced. Finally,
replication is a key technique for improving a systems scalability. As a service grows, creating more replicas allows the
service to scale along with the growth.
When considering the replication of services, there are two types of replication possible: data replication and
control replication. In the first case, only a services data is replicated. Processing and manipulation of the data is
performed either by a non-replicated server, or by clients accessing the data. A typical example of data replication is a
replicated (also known as mirrored) FTP site. Web browsers with caches are another example of data replication. In the
second case, only the control part of the service is replicated while the data remains at a single centralised server. This
form of replication is generally used to to improve or maintain performance by spreading the computational load over
multiple servers. It is also possible to combine data and control replication, in which case both the data and control are
replicated. They may be replicated together (i.e., both control and data are placed on the same replica servers), or
separately (i.e., data is replicated on different servers than control).
During the design and implementation of replication in a distributed system, there are a number of issues that
must be addressed. The most important of these is keeping the copies of replicated data consistent. Furthermore, it is
important to decide how replicas propagate updates amongst each other, where to place the replicas, how many replicas
to create, when to add and remove replicas, etc.
Distributed Data-Store
The following model of a distributed data-store will be used during the further discussion of replication. A data
store is a generic term for a service that stores data. Examples of data stores include: shared memory, databases, file
systems, objects, web servers, etc. A data store stores data items (depending on the data store, a data item could be a
page of memory, a record, a file, a variable, a Web page, etc.). Clients wishing to access data from a data store connect
to the data store and perform read and write operations on it. The exact nature of a client connection depends on the
underlying data store and could be through a network connection or direct access. We abstract from this detail by
assuming that, from the clients point of view, the time required to communicate with the data store is insignificant.
From a clients point of view a data store acts like a centralised service hosted on a single server. Internally,
however, the data store consists of multiple servers (called replica servers) each containing a copy of all the data items
stored in the data store. Each replica server runs a replica manager process, which receives operation invocation
requests from clients and executes the operations locally on its copy of the data. The replica manager is also responsible
for communicating with replica managers running on the other replica servers. We refer to the combination of a replica
manager running on a replica server as a replica. A client always connects to a single replica. Figure 1 illustrates this
model.
Client A

Replica 1

Client B

Replica 2

Client C

Client D

Replica 3

Replica 4

Data Store
Figure 1: The data store model

The operations performed by clients on a data store will be represented on a time line, an example of which is
shown in Figure 2. In this figure, time flows to the right (the figure assumes an absolute global time, and the position of
the operations on the timeline reflect their ordering according to this time). For the operations there are three relevant
times: the time of issue, the time of execution, and the time of completion. Arrows show the time of execution of

33

operations on remote replicas. Read operations are always performed locally, while writes are assumed to be performed
locally first and then propagated to remote replicas.

Figure 2: An example timeline of two clients accessing a distributed data-store

Consistency
When we replicate data we must ensure that when one copy of the data is updated all the other copies are
updated too. Depending on how and when these updates are executed we can get inconsistencies in the data store (i.e.,
all the copies of the data are not the same). There are two ways in which the data store can be inconsistent. First the
data could be stale, that is, the data at some replicas has not been updated, while others have. Staleness is typically
measured using time or versions. As long as updates are reliably propagated to all replicas, given enough time (and a
lack of new updates) stale data will eventually become up to date. The other type of inconsistency occurs when
operations are performed in different orders at different replicas. This can cause more problems than staleness because
applying operations in different orders can lead to different results that cannot be consolidated by simply ensuring that
all replicas have received all updates. In the following we concentrate on consistency with regards to the ordering of
operations at different replicas.
Because clients on different machines can access the data store concurrently, it is possible for clients to invoke
conflicting operations. Two (or more) operations on the same data item are conflicting if they may occur concurrently
(that is each is invoked by a different client), and at least one of them is a write. We distinguish between read-write
conflicts (where only one operation is a write) and write-write conflicts (where more than one operation is a write). In
order for a data store to be consistent all write-write conflicting operations must be seen in an agreed upon order by
the clients.
All operations executed at a single replica occur in a particular order. This is called the replicas partial ordering.
Central to reasoning about consistency is the notion of an interleaving of all the partial orderings into a single timeline
(as though the operations were all performed on a single non-replicated data store). This is called the total ordering.

Consistency Models
In a nondistributed data-store the program order of operations is always maintained (i.e., the order of writes
as performed by a single client must be maintained). Likewise, data coherence is always respected. This means that if a
value is written to a particular data item, subsequent reads will return that value until the data item is modified again.
Ideally, a distributed data-store would also exhibit these properties in its total ordering. However, implementing such
a distributed data store is expensive, and so weaker models of consistency (that are less expensive to implement) have
been developed.
A consistency model defines which interleavings of operations (i.e., total orderings) are acceptable (admissible).
A data store that implements a particular consistency model must provide a total ordering of operations that is
admissible.
Data-Centric Consistency Models
The first, and most widely used, class of consistency models is the class of data-centric consistency models.
These are consistency models that apply to the whole data store. This means that any client accessing the data store
will see operations ordered according to the model. This is in contrast to client-centric consistency models (discussed
later) in which clients request a particular consistency model and different clients may see operations ordered in
different ways.
Strict Consistency The strict consistency model requires that any read on a data item returns a value corresponding to
the most recent write on that data item. This is what one would expect from a single program running on a uniprocessor.
A problem with strict consistency is that the interpretation of most recent is not clear in a distributed data store. A
strict interpretation requires that all clients have a notion of an absolute global time. It also requires instant propagation
of writes to all replicas. Due to the fact that it is not possible to achieve absolute global time in a distributed system and
the fact that communicating between replicas can never be instantaneous, strict consistency is impossible to implement
in a distributed data store.
A model that is close to strict consistency, but that is possible to implement in a distributed data store, is the
linearisable consistency model. In this model the requirement of absolute global time is dropped. Instead all operations
are ordered according to a timestamp taken from the invoking clients loosely synchronized local clock. Linearisable
consistency requires that all operations be ordered according to their timestamp. This means that all operations are
executed in the same order at all replicas. Note that although it is possible for a distributed data store to implement this
model, it is still very expensive to do so. For this reason linearisable consistency is rarely implemented.

34

Sequential Consistency Linear consistency is expenisve to implement because of the time ordering requirement. The
sequential consistency model drops this requirement. In a data store that provides sequential consistency, all clients
see all (write) operations performed in the same order. However, unlike in the linearisable consistency model where
there is exactly one valid total ordering, in sequential consistency there are many valid total orderings. The only
requirement is that all clients see the same total ordering.

W(x)a

Client A
Client B

W(x)b

Client B
R(x)b

Client C

W(x)a

Client A

R(x)a

R(x)a

Client C

R(x)b R(x)a

Client D

W(x)b
R(x)b

R(x)bR(x)a

Client D

sequential

not sequential

Figure 3: An example of a valid and an invalid ordering of operations for the sequential consistency model

Figure 3 shows an example of a valid and an invalid ordering of operations for the sequential consistency model. In
the example of invalid ordering the two write operations are executed in a different order on the replicas associated
with client C and client D. This is not admissible with the sequential consistency model.
It has been shown that there is fixed minimum cost for implementations of sequential consistency. It is possible
to provide an implementation where reads are instantaneous but writes have a significant overhead, or an
implementation where writes are instantaneous but reads have a significant overhead. In other words, changing the
implementation to improve read performance makes write performance worse and vice versa.
Causal Consistency Often the requirement that all operations are seen in the same order is not important. The causal
consistency model weakens sequential consistency by requiring that only causally related write operations are
executed in the same order on all replicas. Two writes are causally related if the execution of one write possibly
influences the value written by the second write. Specifically two operations are causally related if:
A read is followed by a write in the same client
A write of a particular data item is followed by a read of that data item in any client.
If operations are not causally related they are said to be concurrent. Concurrent writes can be executed in any order, as
long as program order is respected.
Figure 4 shows an example of a valid and an invalid ordering of operations for the causal consistency model. In the
example of an invalid ordering we see that the write performed by client B (W(x)b) is causally related to the previous
write performed by client A (W(x)a). As such, these writes must appear in the same (causal) order at all replicas. This
is not the case for client D where we see that W(x)a is executed after W(x)b.
FIFO Consistency The FIFO (or Pipelined RAM) consistency model, weakens causal consistency in that it removes
limitations about the order of any concurrent operations. FIFO consistency requires only that any total ordering respect
the partial orderings of operations (i.e., program order).
Figure 5 provides an example of a valid and invalid ordering for FIFO consistency. In the invalid ordering example
client D does not observe the writes coming from client A in the correct order (i.e., W(x)c is executed before W(x)a).

W(x)a

W(x)c

W(x)a

Client A
Client B
Client C
Client D

W(x)c

Client A
W(x)b

Client B
R(x)aR(x)b

R(x)c

R(x)b R(x)a R(x)c


causally consistent

Client C
Client D

R(x)a

W(x)b
R(x)a R(x)b R(x)c
R(x)bR(x)a

not causally consistent

Figure 4: An example of a valid and an invalid ordering of operations for the causal consistency model

35

R(x)c

Figure 5: An example of a valid and an invalid ordering of operations for the FIFO consistency model

Weak Consistency Whereas the previous consistency models specified the ordering requirements for all operations on
the data store, the following data-centric models drop this requirement dealing instead with the ordering of groups of
instructions. They do this by defining groups of operations (comparable to critical sections) and only dictating
requirements for the ordering of these groups, rather than individual operations.
The first of these models is the weak consistency model. in this model critical sections are delimited using operations
on synchronisation variables (e.g., locks). In this model performing a synchronise operation on a synchronisation
variable causes the following to happen. First, all local writes are completed and the updated data items are propagated
to all other replicas. Second, all updates from other clients are executed locally (that is, the replica makes sure that its
copy of the data is up-to-date). Essentially, weak consistency imposes a sequentially consistent ordering of synchronise
operations.
Release Consistency The functionality of the synchronise operation as defined in the weak consistency model can be
split into two separate functions: bringing local state up-to-date and propagating local updates to all other replicas.
Because there is only one operation weak consistency requires both to occur whenever entering and leaving a critical
section. This is generally not necessary. The release consistency model makes the distinction between the two functions
explicit by associating the first with an acquire() operation and the second with a release() operation. The
model requires a client to call acquire() when entering a critical section and release() when leaving it. This
ensures that when entering a critical section all data is up-to-date, while when leaving the section all updates are made
available to other replicas.
A slight modification of the release consistency model, called lazy release consistency, requires that updates are only
propagated and executed when an acquire() operation is performed. This saves much communication when a
critical section is performed repeatedly by a single client.
Entry Consistency The entry consistency model is similar to lazy release consistency, except that the synchronisation
variables are explicitly associated with specific data items. These data items are called guarded data-items. Use of
synchronisation variables in the entry consistency model is as follows. In order to write to a guarded data-item a client
must acquire that items synchronisation variable in exclusive mode. This means that no other clients can hold that
variable. When the client is done updating the data item it releases the associated synchronisation variable. When a
client wishes to read a particular data item it must acquire the associated synchronisation variable in nonexclusive
mode. Multiple clients may hold a synchronisation variable in nonexclusive mode.
When performing an acquire, the client fetches the most recent version of the data item from the synchronisation
variables owner (A synchronisation variables owner is the last client that performed an exclusive acquire on it).
Although this, and the previous weak consistency models, result in extra complexity for the programmer, it is
possible to hide the use of synchronisation variables by associating guarded dataitems with objects. Invoking a method
on an object then automatically invokes the associated acquire and release operations.
CAP Theory and Eventual Consistency
In 2000 Eric Brewer claimed (and in 2002 it was proven [GL02]) that in a replicated data store, of the three desired
properties, consistency, availability, and partition tolerance, only two can ever be guaranteed at once. This is called the
CAP theorem. What it means is that in a system that can survive a network partition (that is, a system that continues to
function, and does not outright fail, when it is split into two or more parts that are (temporarily) unable to communicate
with each other) it is only possible to provide consistency (specifically the property that a read provides the results of
the latest write) or availability (that a write is always accepted and processed in a timely fashion), but not both.
Since network partitions are statistically very likely in large distributed systems, the CAP theorem presents a
real limitation for modern, large-scale distributed systems. This has led to the increasing popularity of eventual
consistency [Vog08].
The eventual consistency model weakens the temporal aspect of consistency, guaranteeing only that, if no
updates take place for a while, eventually all the replicas will contain the same data. This model generally applies when
there are few conflicting operations and means that only the ordering of writes to the same data item is respected.
The eventual consistency model requires that the data store experience few read-write conflicts (e.g., because
there are many more reads than writes), and that there are few if any write-write conflicts (e.g., because all writes are
performed by the same client). Also, it is imperative that clients accept temporary inconsistencies (i.e., staleness).
Typical examples of systems that allow eventual consistency are DNS and the Web. In both systems it takes new data a
while to propagate and replace old data stored in caches.

36

Client-Centric Consistency Models


The data-centric consistency models had an underlying assumption that the number of reads was
approximately equal to the number of writes, and that concurrent writes occur often. Clientcentric consistency models,
on the other hand, assume that clients perform more reads than writes and that there are few concurrent writes. They
also assume that clients can be mobile, that is, they will connect to different replicas during the course of their execution.
Client-centric consistency models are based on the eventual consistency model but offer perclient models that
hide some of the inconsistencies of eventual consistency. Client-centric consistency models are useful because they are
relatively cheap to implement.
For the discussion of client-centric consistency models we extend the data store model and notation somewhat.
The change to the data store model is that the client can change which replica it communicates with (i.e., the client is
mobile). We also introduce the concept of a write set (WS). A write set contains the history of writes that led to a
particular value of a particular data item at a particular replica. When showing timelines for client-centric consistency
models we are now concerned with only one client performing operations while connected to different replicas
Monotonic Reads The monotonic-reads model ensures that a client will always see progressively newer data and never
see data older than what it has seen before. This means that when a client performs a read on one replica and then a
subsequent read on a different replica, the second replica will have at least the same write set as the first replica. This
is shown in Figure 6. The figure also shows an invalid ordering for monotonic reads. This ordering is invalid because
the write set at the second replica does not yet contain that from the first.

Figure 6: An example of a valid and invalid ordering for the monotonic reads consistency model

Monotonic Writes The monotonic-writes model ensures that a write operation on a particular data item will be
completed before any successive write on that data item by the same client. In other words, all writes that a client
performs on a particular data item will be sequentially ordered. This is essentially a client-centric version of FIFO
consistency (the difference being that it only applies to a single client). Figure 7 shows an example of a valid and invalid
ordering for monotonic writes consistency. The example of the invalid ordering shows that the write performed at
replica 1 has not yet been executed at replica 2 when the second write is performed at that replica.

Figure 7: An example of a valid and invalid ordering for the monotonic writes consistency model

Read Your Writes In the read you writes consistency model, a client is guaranteed to always see its most recent writes.
Figure 8 shows an example of read your writes ordering. The figure also shows an example where the client does not
see its most recent write at another replica. In this case, the write set at replica 2 does not contain the most recent write
operation performed on replica 1.

Figure 8: An example of a valid and invalid ordering for the read your writes consistency model

Write Follows Reads This model states the opposite of read your writes, and guarantees that a client will always
perform writes on a version of the data that is at least as new the last version it saw. Figure 9 shows an example of write
follows reads ordering. In the example of the non write follows reads ordering, the two replicas do not have the same
write set (and the one on replica 2

37

is also not newer than the one on replica 1). This means that the read and the write operations are not performed on
the same state.

Figure 9: An example of a valid and invalid ordering for the write follows reads consistency model

Consistency Protocols
Having discussed various consistency models, it is now time to focus on the implementation of these models. A
consistency protocol provides an implementation of a consistency model in that it manages the ordering of operations
according to its particular consistency model. In this section we focus on the various ways of implementing data-centric
consistency models, with an emphasis on sequential consistency (which includes the weak consistency models as well).
There are two main classes of data-centric consistency protocols: primary-based protocols and replicatedwrite based protocols. Primary-based protocols require that each data item have a primary copy (or home) on which
all writes are performed. In contrast, the replicated-write protocols require that writes are performed on multiple
replicas simultaneously.
The primary-based approach to consistency protocols can further be split into two classes: remote-write and
local-write. In remote-write protocols writes are possibly executed on a remote replica. In local-write writes are always
executed on the local replica.
Single Server The first of the remote-write protocols is the single server protocol. This protocol implements sequential
consistency by effectively centralising all data and foregoing data replication altogether (it does, however, allow data
distribution). All write operations on a data item are forwarded to the server holding that items primary copy. Reads
are also forwarded to this server. Although this protocol is easy to implement, it does not scale well and has a negative
impact on performance. Note that due to the lack of replication, this protocol does not provide a distributed system with
reliability.
Primary-Backup The primary-backup protocol allows reads to be executed at any replica, however, writes can still
only be performed at a data items primary copy. The replicas (called backups) all hold copies of the data item, and a
write operation blocks until the write has been propagated to all of these replicas. Because of the blocking write, this
protocol can easily be used to implement sequential consistency. However, this has a negative impact on performance
and scalability. It does, however, improve a systems reliability. Furthermore, while it is possible to make the write
nonblocking, greatly improving performance, such a system would no longer guarantee sequential consistency.
Migration The migration protocol is the first of the local-write protocols. This protocol is similar to single server in that
the data is not replicated. However, when a data item is accessed it is moved from its original location to the replica of
the client accessing it. The benefit of this approach is that data is always consistent and repeated reads and writes occur
quickly, with no delay. The drawback is that concurrent reads and writes can lead to thrashing behaviour where the
data item is constantly being copied back and forth. Furthermore the system must keep track of every data items
current home. There are many techniques for doing this including broadcast, forwarding pointers, name services, etc.
A number of these techniques will be discussed in a future lecture on naming.
Migrating Primary (multiple reader/single writer) An improvement on the migration protocol is to allow read
operations to be performed on local replicas and to migrate the primary copy only on writes. This improves on the write
performance of primary-backup (only if nonblocking writes are used), and avoids some of the thrashing of the migration
approach. It is also good for (mobile) clients operating in disconnected mode. Before disconnecting from the network
the client becomes the primary allowing it to perform updates locally. When the client reconnects to the network it
updates all the backups.
Active Replication The active replication protocol is a replicated write protocol. In this protocol write operations are
propagated to all replicas, while reads are performed locally. The writes can be propagated using either point-to-point
communication or multicast. The benefit of this approach is that all replicas receive all operations at the same time (and
in the same order), and it is not necessary to track a primary, or send all operations to a single server. However it does
require atomic multicast or a centralised sequencer, neither of which are scalable approaches.
Quorum-Base Protocols With quorum based protocols write operations are executed at a subset of all replicas. When
performing read operations clients must also contact a subset of replicas to find out the newest version of the data. In
this protocol all data items are associated with a version number. Every time a data item is modified its version number
is increased.
This protocol defines a write quorum and a read quorum, which specify the number of replicas that must be contacted
for writes and reads respectively. The write quorum must be greater than half of the total replicas, while the sum of the
read quorum and the write quorum must be greater than the total number of replicas. In this way a client performing a

38

read operation is guaranteed to contact at least one replica that has the newest version of the data item. The choice of
quorum sizes depends on the expected read-write ratio and the cost of group communication.

Update Propagation
Another important aspect of implementing replication and consistency protocols is the question of how updates are
propagated to other replicas. There are three approaches to this: send the data, send the operation, and send an
invalidation. In the first approach the updated data item is simply sent to the other replicas. In the second approach the
operation that caused an update to the data item is sent to all replicas. This operation is then performed by the remote
replicas, updating their local store. Finally sending an invalidation involves notifying the replicas that the copy of the
data item that they hold is no longer valid. It is then up to each replica to contact the sender of the invalidation to retrieve
the new state of the data item. Which approach to use largely depends on the context.
The benefits of sending the updated data are that the replicas do not need to perform any actions other than simply
replacing their copy of the appropriate data item. Sending the data is a good approach if the data items are small, or if
few updates are performed. The benefits of propagating the update operation are that the messages may be significantly
smaller than the actual data items, this is useful when bandwidth is limited. Likewise replicas can store logs of
operations, and it may be possible to resolve write conflicts if the updates affect different parts of the data item.
Invalidation is a useful approach if data items are large, and many updates occur.
Push vs Pull
Besides deciding what to send in an update message, it is also important to decide whether updates are pushed to all
replicas when they occur or pulled from replicas when they are needed. The push model is a useful approach when a
high degree of freshness is required (i.e., clients always want to access the newest data), as well as when there are few
updates and many reads. A drawback of this approach, however, is that the writer must keep track of all replicas. This
is not a problem when the set of replicas is small and stable, but does become a problem when there are many replicas
(e.g., web browser caches) and they are unstable (e.g., browsers purging their caches, being stopped and started, etc.)
On the other hand, when there are many updates and few reads it is more efficient to have the reader pull the update,
that is, send a request for the newest version whenever a read is performed. It is also efficient to do this when the server
does not want to keep track of all replicas. The drawback of this approach is that it may incur a polling delay, meaning
that replicas must check for the most up to date version every time a read request is made. It is possible to avoid the
poll delay by periodically checking the freshness of the replicated data, however, this means that replicas may contain
stale data (as happens in the Web).
Leases Because the push approach is inefficient if a replica has no interested clients, the concept of timed leases can be
used to keep track of and push to interested replicas only. When a replica is interested in receiving updates for a
particular data item it acquires a lease for that item. Whenever updates occur they are then propagated to that replica.
When the lease expires the replica no longer receives updates. It is up to each replica to renew its lease if it is still
interested in updates.
To cut down on the costs of constantly renewing leases and to prevent sending unnecessary updates, it is
possible to base the length of leases on characteristics of the replicas and data items. Lease length can be based on the
age of the data item, on the renewal frequency of a replica, and on the overhead that lease management incurs. With age
based leases, data items that were recently modified will receive shorter leases. This is based on the expectation that
they will be modified again soon and will therefore generate many update messages. As such it is important to make
sure that the receivers of the updates are still interested in them. For data items that are not expected to be modified
soon the lease age can be longer, as they will not generate many update message. With renewal-frequency based leases,
replicas that often request to have their copy of a data item updated will receive longer leases than those that do so
infrequently. Finally statespace-overhead based leases base the lease length on available resources for storing and
processing lease state as well as propagating updates. When available resources are low, lease lengths will also be low.
Multicast vs Unicast
As has been mentioned previously, multicast communication can often be used to propagate updates to other replicas.
This is particularly useful when updates have to be propagated to many replicas. Furthermore atomic multicast is very
useful for maintaining operation order. In atomic multicast a message is guaranteed to be delivered to all recipients or
none at all. Also all messages are delivered in the same order to all recipients. Unfortunately it is difficult to implement
atomic multicast in a scalable way.
In the situation where all replicas are on the same LAN, broadcast can also be used. This is obviously not
geographically scalable, but in many situations (e.g., a cluster of replicated servers) such scalability is not required.
Unicast (or point-to-point communication) is more useful when updates are pulled by clients, or when multicast cannot
be implemented efficiently. A further consideration is that unicast communication mechanisms are also more readily
available to programmers.

Replica Placement
A final issue with regards to the implementation of replication is the question of where to place replicas, how many
replicas to create, who is responsible for creating and maintaining them, and how clients find the most appropriate
replicas to connect to. Replicas can be categorised into permanent replicas, server-initiated replicas, and client-initiated
replicas as shown in Figure 10.

39

Server-initiated replication
Client-initiated replication
Permanent
replicas
Server-initiated replicas
Client-initiated replicas
Clients

Figure 10: Different kinds of replicas

Permanent replicas are created by the data store owner and function as permanent storage for the data. Often
this is a single server, but it may also be a cluster or group of mirrors maintained by the data store owner. This category
also includes replicas that are created for fault tolerance (availability) reasons. Writes are usually only performed by
clients directly connected to permanent replicas. Server-initiated replicas are replicas created in order to enhance the
performance of the system. They are created at the request of the data store owner but are often placed on servers
maintained by others. The replicas are not as long lived as the permanent replicas, although exceptions where serverinitiated replicas exist for as long as the permanent replicas are possible. In order to improve performance these replicas
are placed close to large concentrations of clients. Finally, client-initiated replicas are temporary copies that the data
store owner is not generally aware of. They are created by clients to improve their own performance and access to the
data. A typical example of client-initiated replicas are Web browser caches and proxy caches.
Dynamic Replication
In many cases the patterns of use that a distributed system will experience will change over time. For example,
the number of clients accessing the system can change (grow or shrink), the amount of data that the system contains
will tend to grow, the access characteristics may change (i.e., the R/W ratio may change), etc. The changes can be steady
or bursty. Bursty changes are often characterised by sudden, heavy, increases in usage, followed by sharp declines.
In order to adapt to these changes many systems apply dynamic replica placement. With dynamic replica
placement the decisions about where to place replicas and when to create new ones is made automatically by the
system. This kind of automatic replication requires a specific infrastructure which allows the collection of usage pattern
data and the migration of replicas to and from other servers. It generally also requires the availability of a supporting
network of servers willing to host replicas.
An example of a dynamic replica placement strategy comes from the RaDaR Web hosting service [RA99]. In
that system clients send all requests to a nearest server, where they are forwarded on to a server that contains the actual
replica. All servers keep track of where requests for replicated data originated. The system defines a number of
thresholds: replication, migration and deletion. These threshold are used to determine what should happen with
replicas - whether new ones should be created, existing ones destroyed, or existing ones moved to different servers. For
example, when the number of requests at a particular replica exceeds the replication threshold a new replica will be
created.

Request Routing
So far it has been assumed that clients always connect to the most appropriate replica. Determining where a
clients most appropriate replica is, or even what is most appropriate for a client is a difficult problem. Most notably it
is difficult to integrate replication solutions into existing distributed systems (e.g., the Web, FTP, etc.) precisely because
of this problem. Ideally a client would transparently connect to its most appropriate replica, without user intervention,
or without a noticeable detour to a redirection service. The details of identifying and finding replicas and other
resources in a distributed system will be discussed in a future lecture on naming in distributed systems.

References
[GL02] Seth Gilbert and Nancy A. Lynch. Brewers conjecture and the feasibility of consistent, available, partitiontolerant web services. SigAct News, June 2002.
[RA99] Michael Rabinovich and Amit Aggarwal. RaDaR: a scalable architecture for a global Web hosting service.
Computer Networks, 31(1116):15451561, 1999.
[Vog08] Werner Vogels. Eventual consistency. ACM Queue, October 2008.
[CL85] K. Mani Chandy and Leslie Lamport. Distributed snapshots: Determining global states of distributed systems.
ACM Transactions on Computer Systems, 3:6375, 1985.
[Lam78] Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM,
21:558565, 1978.
[RA81] G. Ricart and A. Argawala. An optimal algorithm for mutual exclusion in computer networks. Communications of
the ACM, 24(1), January 1981.
[CR79] E. G. Chang and R. Roberts. An improved algorithm for decentralized extrema-finding in circular configurations
of processors. Communications of the ACM, 22(5):281283, 1979.

40

[GM82] Hector Garcia-Molina. Elections in a distributed computing system. IEEE Transactions on Computers, 31(1),
January 1982.
[LS76] Butler Lampson and H. Sturgis. Crash recovery in a distributed system. Working paper, Xerox PARC, Ca, USA, Ca,
USA, 1976.
[Ske81] D. Skeen. Nonblocking commit protocols. In SIGMOD International Conference on Management of Data, 1981.

41

7.Consensus
Reaching agreement
Consensus is the task of getting all processes in a group to agree on some specific value based on the votes of each
processes. All processes must agree upon the same value and it must be a value that was submitted by at least one of
the processes (i.e., the consensus algorithm cannot just invent a value). In the most basic case, the value may be binary
(0 or 1), which will allow all processes to use it to make a decision on whether to do something or not.
With election algorithms, our goal was to pick a leader. With distributed transactions, we needed to get unanimous
agreement on whether to commit. These are forms of consensus. With a consensus algorithm, we need to get unanimous
agreement on some value. This is a simple-sounding problem but finds a surprisingly large amount of use in distributed
systems. Any algorithm that relies on multiple processes maintaining common state relies on solving the consensus
problem. Some examples of places where consensus has come in useful are:

synchronizing replicated state machines and making sure all replicas have the same (consistent) view of system
state.
electing a leader (e.g., for mutual exclusion)
distributed, fault-tolerant logging with globally consistent sequencing
managing group membership
deciding to commit or abort for distributed transactions

Consensus among processes is easy to achieve in a perfect world. For example, when we examined distributed
mutual exclusion algorithms earlier, we visited a form of consensus where everybody reaches the same decision on
who can access a resource. The simplest implementation was to assign a system-wide coordinator who is in charge of
determining the outcome. The two-phase commit protocol is also an example of a system where we assume that the
coordinator and cohorts are alive and communicating or we can afford to wait for them to restart, indefinitely if
necessary. The catch to those algorithms was that all processes had to be functioning and able to communicate with
each other. Faults make it difficult. Faults include process failures and communication failures.
Dealing with failure
We cannot provably achieve consensus with completely asynchronous faulty processes. This means making no
assumption on the speeds of the processes or network communication. The core problem is that there is no way to
check whether a process has failed or whether the process is alive but the communication to the process is intolerably
slow. This impossibility is proved by Fischer, Lynch, Patterson (FLP[85]). Also, in the presence of unreliable
communication, consensus is impossible to achieve since we may never be able to communicate with a process.
We will examine two fault tolerance scenarios that illustrate some basic constraints that are imposed on us. The two
army problem is particularly relevant.
Two Army Problem
Let's examine the case of good processors but faulty communication lines. This is known as the two army problem and
can be summarized as follows:
Two divisions of an army, A and B, coordinate an attack on enemy army, C. A and B are physically separated and
use a messenger to communicate. A sends a messenger to B with a message of "let's attack at dawn". B receives the
message and agrees, sending back the messenger with an "OK" message. The messenger arrives at A, but A realizes that
B did not know whether the messenger made it back safely. If B is not convinced that A received the acknowledgement,
then it will not be confident that the attack should take place since the army will not win on its own. A may choose to
send the messenger back to B with a message of "A received the OK" but A will then be unsure as to whether B received
this message. The two army problem demonstrates that even with non-faulty processors, provable agreement between
two processes is not possible with unreliable communication channels.
In the real world, we will need to place upper bounds on communication and computing speeds and consider a process
to be faulty if it does not respond within that bounded time.
Fail-stop, also known as fail-silent, is the condition when a failed process does not communicate. A byzantine fault is
where the faulty process continues to communicate but may produce faulty information. We can create a consensus
algorithm that is resilient to fail-stop. If there are n processes, of which t may be faulty, then a process can never expect
to receive more than (n-t) acknowledgements. The consensus problem now is to make sure that the same decision is
made by all processes, even if each process receives up to (n-t) answers from a different set of processes (perhaps due
to partial network segmentation or routing problems).
A fail-stop resilient algorithm can be demonstrated as follows. It is an iterative algorithm. Each phase consists of:
1.

42

A process broadcasts its preferred value and the number of processes that it has seen that also have that
preferred value (this count is called the cardinality and is 1 initially).

2.
3.

The process receives (n-t) answers, each containing a preferred value and cardinality.
The process may then change its preferred value according to which value was preferred most by other
processes. It updates the corresponding cardinality to the number of responses it has received with that value
plus itself.

Continue this process until a process receives t messages of a single value with cardinality at least n/2. This means that
at least half of the systems have agreed on the same value. At this point, run two more phases, broadcasting this value.
As the number of phases goes to infinity, the probability that consensus not reached approaches 0.
To make it easier to develop algorithms in the real world, we can relax the definition of asynchronous and allow some
synchrony. Several types of asynchrony may exist in a system:
1.
2.
3.

Process asynchrony: a process may go to sleep or be suspended for an arbitrary amount of time.
Communication asynchrony: there is no upper bound on the time a message may take to reach its destination.
Message order asynchrony: messages may be delivered in a different order than sent.

It has been shown [Dolev, D., Dwork, C., and Stockmeyer] that it is not sufficient to make processes synchronous but
that any of the following cases is sufficient to make a consensus protocol possible:
1.
2.
3.
4.

Process and communication synchrony: place an upper bound on process sleep time and message
transmission).
Process and message order synchrony: place an upper bound on process sleep time and message order).
Message order synchrony and broadcast capability.
Communication synchrony, broadcast capability, and send/receive atomicity. Send/receive atomicity means
that a processor can carry through the operations of receiving a message, performing computation, and sending
messages to other processes.

Byzantine failures in synchronous systems


Solutions to the Byzantine Generals problem are not obvious, intuitive, or simple. They are not presented in
these notes. You can read Lamport's paper on the problem here. You can also check out the brief summary and various
solutions which go beyond the Lamport paper here.
We looked at the case of unreliable communication lines with reliable or fail-stop communication. The other case to
consider is that of reliable communication lines but faulty (not fail-stop) processors. Byzantine failures are failed
processors that, instead of staying quiet (fail-stop), instead communicate with erroneous data.
Byzantine Generals Problem
Consensus with reliable communication lines and byzantine failures is illustrated by the Byzantine Generals
Problem. In this problem, there are n army generals who head different divisions. Communication is reliable (radio or
telephone) but m of the generals are traitors (faulty) and are trying to prevent others from reaching agreement by
feeding them incorrect information. The question is: can the loyal generals still reach agreement? Specifically, each
general knows the size of his division. At the end of the algorithm can each general know the troop strength of every
other loyal division?
Lamport demonstrated a solution that works for certain cases. His answer to this problem is that any solution
to the problem of overcoming m traitors requires a minimum of 3m+1 participants (2m+1 loyal generals). This means
that more than 2/3 of the generals must be loyal. Moreover, it was demonstrated that no protocol can overcome m faults
with fewer than m+1 rounds of message exchanges and O(mn2) messages. If n < 3m + 1 then the problem has no solution.
Clearly, this is a rather costly solution. While the Byzantine model may be applicable to certain types of special-purpose
hardware, it will rarely be useful in general purpose distributed computing environments.
There is a variation on the Byzantine Generals Problem that uses signed messages. What this means is that messages
from loyal generals cannot be forged or modified. In this case, there are algorithms that can achieve consensus for values
of n m + 2, where
n = total number of processors
m = total number of faulty processors
Replicated state machines
An important motivation for building distributed systems is to achieve high scalability and high availability. High
availability can be achieved via redundancy: replicated functioning components will take the place of those that ceased
to function. To achieve redundancy with multiple active components, we want all working replicas to do the same thing:
produce the same outputs given the same inputs.

43

A state machine approach to systems design models each replica (each component of the system) as a deterministic
state machine. For some given input to a specific state of the system, a deterministic output and transition to a new
state will be produced. We refer to each replica (component) as a process. For correct execution and high availability, it
is important that each process sees the same inputs. To do this, we rely on a consensus algorithm. This ensures that
multiple processes will do the same thing since they will each be provided with the same set of inputs.
An example of an input may be a request from a client to read data from a specific location from a file or write data to a
specific location of a file. We want the replicated files to contain the exact same data and yield the same results. To
achieve this, we need agreement among all processes on what the client requests are and the requests must be totally
ordered: each server must see file read/write requests in the exact same order as everyone else. The total ordering part
is most easily achieved by electing one process to serve sequence numbers (although there are other more complex but
more distributed implementations.

Paxos
Paxos is a popular fault-tolerant distributed consensus algorithm. It allows a globally consistant (total) order to be
assigned to client messages (actions).
Much of what is summarized here is from Lamport's Paxos Made Simple but I tried to simplify it substantially. Please
refer to that paper for more detail and definitive explanations.
The goal of a distributed consensus algorithm is to allow a set of computers to all agree on a single value that one of the
nodes in the system proposed (as opposed to making up a random value). The challenge in doing this in a distributed
system is that messages can be lost or machines cn fail. Paxos guarantees that a set of machines will chose a single
proposed value as long as a majority of systems that participate in the algorithm are available.
The setting for the algorithm is that of a collection of processes that can propose values. The algorithm has to ensure
that a single one of those proposed values is chosen and all processes should learn that value.
There are three classes of agents:
1.
2.
3.

Proposers
Acceptors
Learners

A machine can take on any or all of these roles. Proposers put forth proposed values. Acceptors drive the algorithm's
goal to reach agreement on a single value and let the learners are informed of the outcome. Acceptors either reject a
proposal or agree to it and make promises on what proposals they will accept in the future. This ensures that only the
latest set of propsals will be accepted. A process can act as more than one agent in an implementation. Indeed, many
implementations have collections of processes where each process takes on all three roles.
Agents communicate with each other asynchronously. They may also fail to communicate and may restart. Messages
can take arbitrarily long to deliver. They can can be duplicated or lost but are not corrupted. A corrupted message
should be detectable as such and can be counted as a lost one (this is what UDP does, for example).
The absolutely simplest implementation contains a single acceptor. A proposer sends a proposal value to the acceptor.
The acceptor processes one request at a time, chooses the first proposed value that it receives, and lets everyone
(learners) know. Other proposers must agree to that value.
This works as long as the acceptor doesn't fail. Unfortunately, acceptors are subject to failure. To guard against the
failure of an acceptor, we turn to replication and use multiple acceptor processes. A proposer now sends a proposal
containing a value to a set of acceptors. The value is chosen when a majority of the acceptors accept that proposal (agree
to it).
Different proposers, however, could independently initiate proposals at approximately the same time and
those proposals could contain different values. They each will communicate with a different subset of acceptors. Now
different acceptors will each have different values but none will have a majority. We need to allow an acceptor to be
able to accept more than one proposal. We will keep track of proposals by assigning a unique proposal number to each
proposal. Each proposal will contain a proposal number and a value. Different proposals must have different proposal
numbers. Our goal is to agree on one of those proposed values from the pool of proposals sent to different subsets of
acceptors.
A value is chosen when a single proposal with that value has been accepted by a majority of the acceptors. That
means it has been chosen. Multiple proposals can be chosen but all of them bust have the same value: if a proposal with
a value v is chosen, then every higher-numbered proposal that is chosen must also have value v.
If a proposal with proposal number n and value v is issued, then there is a set S consisting of a majority of acceptors
such that either:
1.

44

no acceptor in S has accepted any proposal numbered less than n, or

2.

v is the value of the highest-numbered proposal among all proposals numbered < n accepted by the acceptors
in S.

A proposer that wants to issue a proposal numbered n must learn the highest numbered proposal with number less
than n, if any, that has been or will be accepted by each acceptor in a majority of acceptors. To do this, the proposer gets
a promise from an acceptor that there will be no future acceptance of proposals numbered less than n.

The Paxos algorithm


The Paxos algorithm operates in two phases:
Phase 1: Prepare: send a proposal request
Proposer:
A proposer chooses a proposal number n and sends a prepare request to a majority of acceptors. The number
n is stored in the proposer's stable storage so that the proposer can ensure that a higher number is used for the
next proposal (even if the proposer process restarts).
Acceptor:

If an acceptor has received a proposal greater than n in the past, then it ignores this prepare(n) request.
The acceptor promises never to accept a proposal numbered less than n.
The acceptor replies to the proposer with a past proposal that it has accepted previously that had the
highest number less than n: reply(n',v').

If a proposer receives the requested responses to its prepare request from a majority of the acceptors, then it
can issue a proposal with number n and value v, where v is the value of the highest-numbered proposal among
the responses or any value selected by the proposer if the responding acceptors reported no proposals.
Phase 2: Accept: send a proposal (and then propagate it to learners after acceptance)
Proposer:
A proposer can now issue its proposal. It will send a message to a set of acceptors stating that its proposal
should be accepted (an accept(n,v) message). If the proposer receives a response to its prepare(n) requests
from a majority of acceptors, it then sends an accept(n, v) request to each of those acceptors for a proposal
numbered n with a value v, where v is the highest-numbered proposal among the responses, or is any value if
the responses reported no proposals.
Acceptor:
If an acceptor receives an accept(n, v) request for a proposal numbered n, it accepts the proposal unless it has
already responded to a prepare request having a number greater than n.
The acceptor receives two types of requests from proposers: prepare and accept requests. Any request can be
ignored. An acceptor only needs to remember the highest-numbered proposal that it has ever accepted and the number
of the highest-numbered prepare request to which it has responded. The acceptor must store these values in stable
storage so they can be preserved in case the acceptor fails and has to restart.
A proposer can make multiple proposals as long as it follows the algorithm for each one.
Consensus
Now that the acceptors have a proposed value, we need a way to learn that a proposal has been accepted by a
majority of acceptors. The learner is responsible for getting this information. Each acceptor, upon accepting a proposal,
forwards it to all the learners. The problem with doing this is the potentially large number of duplicate messages:
(number of acceptors) * (number of learners). If desired, this could be optimized. One or more "distinguished learners"
could be elected. Acceptors will communicate to them and they, in turn, will inform the other learners.
Ensuring progress
One problem with the algorithm is that its possible for two proposers to keep issuing sequences of proposals
with increasing numbers, none of which get chosen. An accept message from one proposer may be ignored by an
acceptor because a higher numbered prepare message has been processed from the other proposer. To ensure that the
algorithm will make progress, a "distinguished proposer" is selected as the only one to try issuing proposals.
In operation, clients send commands to the leader, an elected "distinguished proposer". This proposer sequences
the commands (assigns a value) and runs the Paxos algorithm to ensure that an agreed-upon sequence number gets
chosen. Since there might be conflicts due to failures or another server thinking it is the leader, using Paxos ensures
that only one command (proposal) gets assigned that value.
Leasing versus Locking
Processes often rely on locks to ensure exclusive access to a resource. The difficulty with locks is that they are
not fault-tolerant. If a process holding a lock dies or forgets to release the lock, the lock exists unless additional software
is in place to detect these actions and break the lock. For this reason, it is more safer to add an expiration time to a lock.
This turns a lock into a lease.

45

We saw an example of this approach with the two-phase and three-phase commit protocols. A two-phase commit
protocol uses locking while the three-phase commit uses leasing; if a lease expires, the transaction is aborted. We also
saw this approach with maintaining references to remote objects. If the lease expires, the server considers the object
unreferenced and suitable for deletion. The client is responsible for renewing the lease periodically as long as it needs
the object.
The downside with a leasing approach is that the resource is unavailable to others until the lease expires. Now
we have a trade-off: have long leases with a possibly long wait after a failure or have short leases that need to be
renewed frequently.
Hierarchical leases versus consensus
In a fault tolerant system with replicated components, leases for resources should be granted by running a
consensus algorithm. Looking at Paxos, it is clear that, while there is not a huge amount of message passing taking place,
there are number of players involved and hence there is a certain efficiency cost in using the algorithm. A compromise
approach is to use the consensus algorithm as an election algorithm to elect a coordinator. This coordinator is granted
a lease on a large set of resources or the state of the system. In turn, the coordinator is now responsible for handing out
leases for all or a subset of the system state. When the coordinator's main lease expires, a consensus algorithm has to
be run again to grant a new lease and possibly elect a new coordinator but it does not have to be run for every client's
lease request; that is simply handled by the coordinator.
References
Leslie Lamport, Paxos Made Simple, November 2001.
One of the clearest papers out there detailing the Paxos algorithm
Lampson, Butler. How to Build a Highly Available System Using Consensus, Microsoft Research
An updated version of Distributed Algorithms, ed. Babaoglu and Marzullo, Lecture Notes in Computer Science
1151, Springer, 1996, pp 1-17.
A great coverage of leases, the Paxos algorithm, and the need for consensus in achieving highly available
computing using replicated state machines.
Henry Robinson, Consensus Protocols: Paxos, Paper Trail blog, February 2009.
Iair Amir, Jonathan Kirsch, Paxos for System Builders: An Overview, Johns Hopkins University.
Written from a system-builder's perspective and covers some of the details of implementation.

46

8.Distributed Transactions
ACID, Commit protocols, and BASE
We've looked at a number of low level techniques that can be used for managing synchronization in a
distributed environment: algorithms for mutual exclusion and critical section management. In addition (and we'll look
at these later), we can have algorithms for deadlock resolution and crash recovery. Much as remote procedure calls
allowed us to concentrate on the functionality of a program and express it in a more natural way than sends and
receives, we crave a higher level of abstraction in dealing with issues of synchronization. This brings us to the topic of
atomic transactions (also known colloquially simply as transactions).
In transacting business, all parties involved may have to go through a number of steps in negotiating a contract but the
end result of the transaction won't be committed until both parties sign on the dotted line. If even one of the parties
reconsiders and aborts, the contract will be forgotten and life goes on as before.
Consider, for example, the purchase of a house. You express your interest in purchasing a house by making an
offer (and possibly putting some money down with a trusted party). At that point, you have not bought the house, but
you have entered the transaction of purchasing a house. You may have things to do (such as getting a mortgage and
inspection) and the seller may have things to do (such as fixing up certain flaws). If something goes wrong (you can't
get a mortgage, the seller won't fix the heating system, you find the house is sitting on a fault line, the seller won't
remove the black velvet wallpaper, ...), then the transaction is cancelled (aborted) and both parties go back to life as
before: you look for another house and the seller remains in the house, possibly still trying to sell it. If, however, the
transaction is not aborted and both parties sign the contract on the closing day, it is made permanent. The deed is signed
over and you own the house. If the seller changes her mind at this point, she'll have to try to buy back the house. If you
change your mind, you'll have to sell the house.
The concept of a transaction in the realm of computing is quite similar. One process announces that it's
beginning a transaction with one or more processes. Certain actions take place. When all processes commit, the results
are permanent. Until they do so, any process may abort (if something fails, for example). In that case, the state of
computing reverts to the state before the transaction began: all side effects are gone. A transaction has an all or nothing
property.
The origins of transactions in computing date back to the days of batch jobs scheduled to processes tapes. A
days worth of "transactions" would be logged on a tape. At the end of the day, a merge job would be run with the original
database tape and the transactions tape as inputs, producing a new tape with all the transactions applied. If anything
went wrong, the original database tape was unharmed. If the merge succeeded, then the original tapes could be reused.
Transaction model
A process that wishes to use transactions must be aware of certain primitives associated with them. These primitives
are:
1.
2.
3.
4.

begin transaction - mark the start


end transaction - mark the end; try to commit
abort transaction - kill transaction, restore old values
read data from object(file), write data to object(file).

In addition, ordinary statements, procedure calls, etc. are allowed in a transaction.


To get a flavor for transactions, consider booking a flight from Newark, New Jersey to Ridgecrest, California. The
destination requires us to land at Inyokern airport, and non-stop flights are not available:
transaction begin
1. reserve a seat for Newark to Denver (EWKDEN)
2. reserve a seat for Denver to Los Angeles (DENLAX)
3. reserve a seat for Los Angeles to Inyokern (LAXIYK)
transaction end
Suppose there are no seats available on the LAXIYK leg of the journey. In this case, the transaction is aborted,
reservations for (1) and (2) are undone, and the system reverts to the state before the reservation was made.
Properties of transactions
The properties of transactions are summarized with the acronym ACID, which stands for Atomic, Consistent, Isolated,
and Durable.
Atomic
either an entire transaction happens completely or not at all. If the transaction does happen, it happens as a
single indivisible action. Other processes cannot see intermediate results. For example, suppose we have a file

47

that is 100 bytes long and a transaction begins appending to it. If other processes read the file, they only see
the 100 bytes. At the end of the transaction, the file instantly grows to its new size.
Consistent
If the system has certain invariants, they must hold after the transaction (although they may be broken within
the transaction). For example, in some banking application, the invariant may be that the amount of money
before a transaction must equal the amount of money after the transaction. Within the transaction, this
invariant may be violated but this is not visible outside the transaction.
Isolated (or serializable)
If two or more transactions are running at the same time, to each of them and to others, the final result looks as
though all transactions ran sequentially in some order.
An order of running transactions is called a schedule. Orders may be interleaved. If no interleaving is done and
the transactions are run in some sequential order, they are serialized.
Consider the following three (small) transactions:
begin
x=0
x=x+1
end

begin
x=0
x=x+2
end

begin
x=0
x=x+3
end

Some possible schedules are (with time flowing from left to right):
schedule

execution order

schedule 1 x=0 x=x+1 x=0

final x legal?
x=x+2 x=0

x=x+3 3

yes

schedule 1 x=0 x=0

x=x+1 x=x+2 x=0

x=x+3 3

yes

schedule 1 x=0 x=0

x=x+1 x=0

x=x+2 x=x+3 5

NO

Durable
Once a transaction commits, the results are made permanent. No failure after a commit can undo results or
cause them to get lost. [Conversely, the results are not permanent until a transaction commits.]
Nested transactions
Transactions may themselves contain subtransactions (nested transactions). A top-level transaction may fork off
children that run in parallel with each other. Any or all of these may execute subtransactions.
The problem with this is that the subtransactions may commit but, later in time, the parent may abort. Now we find
ourselves having to undo the committed transactions. The level of nesting (and hence the level of undoing) may be
arbitrarily deep. For this to work, conceptually, each subtransaction must be given a private copy of every object it may
manipulate. On commit, the private copy displaces its parent's universe (which may be a private copy of that parent's
parent).
Implementation
We cannot just allow a transaction to update the objects (files, DB records, et cetera) that it uses. The
transactions won't be atomic (i.e., appear indivisible) or consistent in that case. If other transactions read and act on
the data, we also violate the isolated property. Finally, we need to ensure that we can undo changes if the transaction
aborts. One way of supporting object modification is by providing a private workspace. When a process starts a
transaction, it's given a private workspace containing all the objects to which it has access. On a commit, the private
workspace becomes the real workspace. Clearly this is an expensive proposition. It requires us to copy everything that
the transaction may modify (every file, for example). However, it is not as bleak as it looks. A number of optimizations
can make this a feasible solution.
Suppose that a process (transaction) reads a file but doesn't modify it. In that case it doesn't need a copy. The
private workspace can be empty except that it contains a pointer back to the parent's workspace. How about writing a
file? On an open, don't copy the file to the private workspace but just copy the index (information of where the file's
data is stored; a UNIX inode, for example). The file is then read in the usual way. When a block is modified, a local copy
is made and the address for the copied block is inserted into the index. New blocks (appends) work this way too.
Privately allocated blocks are called shadow blocks.
If this transaction was to abort, the private blocks go back on the free list and the private space is cleaned up.
Should the transaction commit, the private indices are moved into the parent's workspace (atomically). Any parent
blocks that would be overwritten are freed.
Another, and more popular, mechanism for ensuring that transactions can be undone (and possibly redone) is
the use of a write-ahead log, also known as an intentions list. With this system, objects are modified in place (proper
locking should be observed to control when other processes can access these objects). Before any data is changed, a
record is written to the write-ahead log in stable storage. The record identifies the transaction (with an ID number),

48

the block or page modified, and the old and new values. This log allows us to undo the effects of a transaction should an
abort be necessary.
If the transaction succeeds (i.e., commits), a commit record is written to the log. If the transaction aborts, the
log is used to back up to the original state (this is called a rollback. The write-ahead log can also be played forward for
crash recovery (this becomes useful in the two-phase commit protocol, which is discussed next). A term associated with
the write-ahead log was stable storage. This is intended to be a data repository that can survive system crashes. After
a datum is written to stable storage, it is retrievable even if the system crashes immediately after the write. A disk is
suitable for stable storage, but it is important that any writes are immediately flushed to the disk and not linger in the
memory (unstable) buffer cache.
The two-phase commit protocol
(Gray, 1978)
In a distributed system, a transaction may involve multiple processes on multiple machines. Even in this environment,
we still need to preserve the properties of transactions and achieve an atomic commit (either all processes involved in
the transaction commit or else all of them will abort the transaction - it will be unacceptable to have some commit and
some abort). A protocol that achieves this atomic commit is the two-phase commit protocol.
In implementing this protocol, we assume that one process will function as the coordinator and the rest as cohorts (the
coordinator may be the one that initiated the transaction, but that's not necessary). We further assume that there is
stable storage and a write-ahead log at each site. Furthermore, we assume that no machine involved crashes forever.
The protocol works as follows (the coordinator is ready to commit and needs to ensure that everyone else will do so as
well):
phase

coordinator

cohort

write prepare to commit message


work on transaction; when done, wait for a prepare message
to the log
send prepare to commit message
1
request
receive message. When transaction is ready to commit, write agree to
commit (or abort) to log.
wait for reply
send "agree" or "abort" reply
write commit message to the log. wait for commit message
send commit (or abort) message
2
commit wait for all cohorts to respond

receive commit (or abort) message


if a commit was received, write "commit" to the log, release all locks &
resources,
update
databases.
if an abort was received, undo all changes.
send done message.

clean up all state. Done.


What the two phase commit protocol does is this:
In phase 1, the coordinator sends a request to commit to all the cohorts and waits for a reply from all of them.
The reply is either an agreement or an abort. Note that nobody has committed at this point. After the coordinator
receives a reply from all cohorts, it knows that all transaction-relevant computation is finished so nothing more will
happen to abort the transaction. The transaction can now be committed or, in the case that at lease one of the parties
could not complete its transaction, aborted. The second phase is to wait for all cohorts to commit (or abort). If aborting,
an abort message is sent to everyone. The coordinator waits until every cohort responds with an acknowledgement. If
committing, a cohort receives a commit message, commits locally, and sends an acknowledgment back. All message
deliveries are reliable (retransmits after time-out).
No formal proof will be given here of the correctness of the two-phase protocol. Inspecting for correctness, it is readily
apparent that if one cohort completes the transaction, all cohorts will complete if eventually. If a cohort is completing a
transaction, it is because it received a commit message, which means that we're in the commit phase and all cohorts
have agreed. This information is in permanent memory in case of a crash (that's why information is written to the log
before a message is sent. If any system crashes, it can replay its log to find its latest state (so it will know if it was ready
to commit, for example). When the coordinator is completing, it is ensured that every cohort completes before the
coordinator's data is erased(update).
Three-Phase Commit
A problem with the two-phase commit protocol is that there is no time limit for the protocol to complete. A subtransaction may be delayed indefinitely or the process (or machine) may die and it might be a long time before it
restarts. If the coordinator dies, there is no easy way for a standby coordinator to find out the state of the protocol and
continue the commit. From a practical point of view, this is not good.

49

The three-phase commit protocol is a variation of the two-phase commit protocol that places an upper bound on the
time that a transaction may take to commit or abort. It also introduces an extra phase where cohorts are told what the
consensus was so that any of them that received this information before a coordinator died could inform a standy
coordinator whether there was a unanimous decision to commit or abort.
The setup is the same as with the two-phase commit protocol. A coordinator process is in charge of soliciting votes from
multiple cohorts that are responsible for the various sub-transactions of the top-level transaction. Here are the steps:
phase

coordinator

cohort

write prepare to commit message to the


work on transaction; when done, wait for a prepare message
log
1
request

send prepare to commit message

receive message. When transaction is ready to commit, write


agree to commit (or abort) to log.
if timeout on waiting for a prepare message, then abort

wait for reply from all cohorts.


if all replies have been received and all
replies are "agree" messages, then
write prepare-to-commit message to
the
log.
wait for a prepare-to-commit or abort message
else if all replies are not received
before a timeout or at least a single
abort message is received then write
an abort message to the log.
2
commit
authorized

If the cohort receives a prepare-to-commit message, it sends


back an acknowledgement and waits. The commit does not yet
take place.

send a prepare-to-commit or abort


If the cohort receives an abort message or times out waiting for
message to all cohorts.
a message from the coordinator, then it aborts the transaction.
This means that it releases all locks & resources, and reverts the
state of the data it modified.
if a prepare-to-commit was sent, then
wait for all cohorts to respond.
otherwise, we're done.
write commit message to the log.

3
commit
finalized

send commit message

wait for a commit message


receive a commit. release all locks & resources, make database
changes
permanent.
if a timeout on waiting for a commit message, then commit
anyway.
send a commit completed message.

Receive commit completed messages


from all cohorts. Give up waiting after
a certain time.
clean up all state. Done.

If the coordinator crashes during this protocol, another one can step in and query the cohorts for the commit decision.
If every cohort received the prepare-to-commit message then the coordinator can commit. If only some cohorts received
the message, the coordinator now knows that the unanimous decision was to commit and can re-issue the request. If
no cohort received the message, the coordinator can restart to protocol or, if necessary, restart the transaction.
Paxos Commit
What's wrong with the two-phase commit protocol?
The problem with the two-phase commit protocol is that it requires all systems to be available in order to
complete. A single fault can make the two-phase commit protocol block. Two-phase commit is not fault tolerant because
it uses a single coordinator whose failure can cause the protocol to block.
What about three-phase commit?
Three-phase commit tries to solve this with timeouts but no implementations have been put forth with a truly
complete algorithm with a correctness proof. If the three-phase commit protocol implements voting for a coordinator,
a key problem with the algorithm is that it is undefined what happens when a resource manager (the cohort, responsible
for a sub-transaction) receives messages from two different processes; both claiming to be the current transaction
manager (coordinator).
Can we get ACID guarantees that we want and still survive F faults?

50

Fault-tolerant consensus algorithms such as Paxos are designed to reach agreement and do not block whenever any
majority of the processes are working. Let's use Paxos to create a fault-tolerant commit protocol that uses multiple
coordinators. A majority of functioning coordinators will allow the commit to occur.
The participants in the algorithm are:

N resource managers (RMs). Each resource manager is associated with a single sub-transaction. For the
transaction to be committed, each participating resource manager must be willing to commit it.
2F+1 acceptors, where F is the number of failures that we can tolerate. If F+1 acceptors see that all resource
managers are prepared then then transaction can be committed. All instances of Paxos can share the same set
of acceptors.
a Leader. The leader coordinates the commit algorithm. All instances of Paxos share the same leader. Unlike
the two-phase commit, it is not a single point of failure.
One instance of the Paxos consensus algorithm is executed for each resource manager. Each instance provides
a fault-tolerant way to agree on the commit or abort proposed by each resource manager: each resource
manager is responsible for a sub-transaction.

Here's how we run the algorithm:


1.

2.
3.
4.
5.

A client requests a commit by sending a commit request to a transaction manager. The Paxos Commit algorithm
uses a separate instance of the Paxos consensus algorithm to obtain agreement on the decision each RM makes
of whether to prepare (commit) or abort. We can represent this decision by unique values that represent
Prepared and Aborted, respectively. The transaction will be committed if and only if each resource manager's
instance chooses Prepared. Otherwise, the transaction is aborted.
The transaction manager sends a PREPARE message to each resource manager.
Each resource manager then sends a proposal to its own consensus algorithm (running on multiple servers).
Each resource manager is the first proposer in its own instance of Paxos.
Each instance of the consensus algorithm sends the results back to the transaction manager.
The transaction manager is stateless and just gets consensus outcomes. It will issue a COMMIT or ABORT
message to each resource manager based on whether it received any ABORT messages.

As long as the majority of acceptors are working, the transaction manager can always learn what was chosen. If it fails
to hear from all the resource managers then it can make the decision to abort. Paxos maintains consistency, never
allowing two different values to be chosen, even if multiple processes think they are the leader.
Paxos provides a fault-tolerant commit algorithm based on replication. With two-phase commit, you rely on the
coordinator to not fail or to recover after a failure. With Paxos Commit, the two-phase commit's transaction manager's
stable storage is replaced by the acceptor's stable storage. The transaction manager itself is replaced with a set of
possible leaders. With two-phase commit, the transaction manager is solely responsible for deciding whether to abort.
With Paxos Commit, a leader will make an abort decision only for a resource manager that cannot decide for itself (e.g.,
it is not functioning). This will ensure that the protocol will not block due to a failed resource manager.
Brewer's CAP Theorem
Eric Brewer proposed a conjecture that states that if you want consistency, availability, and partition tolerance, you
have to settle for two out of three for any shared data system. This assertion as since been proven and Brewer's proposal
is known as Brewer's CAP theorem, where CAP stands for Consistency, Availability, and Partitions. Partition tolerance
means that all the systems will continue to work unless there is a total network failure. The inaccessibility of a few notes
will not impact the system. Let's examine each of the aspects of CAP.
Consistency
Consistency in this discussion means that everyone sees the same view of the data if it is replicated in a
distributed system. This can be enforced by forcing the algorithms to wait until all participating nodes
acknowledge their actions (e.g., two phase commit). Guaranteeing this impacts availability. Alternatively, if we
want to offer availability, we need to ensure that all live nodes can get updated and we have to give up on
partition tolerance.
Availability
Availability refers to the system being highly available. Since commodity-built individual systems are not highly
available, we achieve availability through redundancy, which means replication. If one system is down, a
request can be fulfilled by another. In an environment with multiple systems connected on a network we have
to be concerned about network partitioning. If we have partition tolerance, then we lose consistency: some
systems are disconnected from the network segment where updates are being issued. Conversely, to keep
consistency, we have to ensure that the network remains fully connected so that all live nodes can get updates.
This means giving up on partition tolerance.
Partition Tolerance
Partition tolerance means that the system performs correctly even if the network gets segmented. This can be
enforced by using a non-distributed system (in which case partitioning is meaningless) or by forcing the
algorithms to wait until network partitioning no longer exists (e.g., two phase commit). Guaranteeing this

51

impacts availability. Alternatively, the system can continue running, but partitioned nodes will not participate
in the computation (e.g., commits, updates) and will hence have different values of data, impacting consistency.
Giving up on consistency allows us to use optimistic concurrency control techniques as well as leases instead of locks.
Examples of this are web caches and the Domain Name System (DNS).
BASE: Giving up on ACID
Availability and partition tolerance are not part of the ACID guarantees of a transaction, so we may be willing to give
those up to preserve database integrity. However, that may not be the best choice in all environments since it limits a
system's ability to scale and be highly available. In fact, in a lot of environments, availability and partition tolerance are
more important than consistency (so what if you get stale data?).
In order to guarantee ACID behavior in transactions, objects (e.g., parts of the database) have to be locked so that
everyone will see consistent data, which involves other entities having to wait until that data is consistent and unlocked.
Locking works well on a small scale but is difficult to do efficiently on a huge scale. Instead, it is attractive to consider
using cached data. The risk is that we violate the "C" and "I" in ACID (Consistent & Isolated): two separate transactions
might see different views of the same data. An example might me that you just purchased the last copy of a book on
Amazon.com but I still see one copy remaining.
An alternative to the strict requirements of ACID is BASE, which stands for Basic Availability, Soft-state, Eventual
consistency. Instead of requiring consistency after every transaction, it is enough for a database to eventually be in a
consistent state. In these environments, accessing stale data is acceptable. This leniency makes it easy to cache copies
of data throughout multiple nodes, never have to lock access to all those copies for any extensive time (e.g., a transaction
operating on data will not lock all copies of that data), and update that data asynchronously (eventually). With a BASE
model, extremely high scalability is obtainable through caching (replication), no central point of congestion, and no
need for excessive messages to coordinate activity and access to data.

Concurrency Control
When discussing transactions, we alluded to schedules, or valid orders of execution. We can play it safe and use
mutual exclusion on a transaction level to ensure that only one transaction is executing at any time. However, this is
usually overkill and does not allow us take advantage of the concurrency that we may get in distributed systems. What
we would really like is to allow multiple transactions to execute simultaneously but keep them out of each others way
and ensure serializability. This is called concurrency control.
Locking
One mechanism that we can use to serialize transactions is grabbing an exclusive lock on a resource. A process will lock
any data that it needs to use within its transaction. Resource locking in a distributed system can be implemented using
mutual exclusion algorithms. A common approach is to use a lock manager, which is essentially the same as the
centralized implementation for mutual exclusion. One process serves as a lock manager. Processes request a lock for a
resource (e.g., a file or a shared data object) and then they either are granted a lock or wait for it to be granted when
another process releases it.
Two-phase locking
Getting and releasing locks precisely can be tricky. Done improperly, it can lead to inconsistency and/or deadlocks. We
need to ensure that a transaction can always commit without violating the serializability invariant. For transactions to
be serial, all access to data must be serialized with respect to accesses by other transactions. To ensure that conflicting
operations of multiple transactions are executed in the same order, a restriction is imposed: a transaction is not allowed
to obtain new locks once it has released a lock. This restriction is called two-phase locking. The first phase is known as
the growing phase, in which a transaction acquires all the locks it needs. The second phase is known as the shrinking
phase, where the process releases the locks. If a process fails to acquire all the locks during the first phase, then it is
obligated to release all of them, wait, and start over. It has been proved (Eswaren, et al., 1976) that if all transactions
use two-phase locking, then all schedules formed by interleaving them are serializable.
Strict two-phase locking
A risk with two-phase locking is that another transaction may access data that was modified by a transaction that has
not yet committed: the shrinking phase, where locks on resources are released, happens before the transaction commits.
If this happens, any transactions that accessed that data have to be aborted since the data they accessed will be rolled
back to its previous state. This is a phenomenon known as cascading aborts.
To ensure that a transaction not access data until another transaction that was manipulating the data has either
committed or aborted, locks may be held until the transaction commits or aborts. This is known as strict two-phase
locking. This is similar to twophase locking except that in this case, the shrinking phase takes place during the commit
or abort process. One effect of strict two-phase locking is that by placing the second phase at the end of a transaction,
all lock acquisitions and releases can be handled by the system without the transactions knowledge.
Locking granularity
A typical system will have many objects and typically a transaction will access only a small amount of data at any
given time and it will frequently be the case that a transaction will not clash with other transactions. The granularity of

52

locking affects the amount of concurrency we can achieve. If we can have a smaller granularity by locking smaller objects
or pieces of objects then we can generally achieve higher concurrency.
For example, suppose that all of a banks customers are locked for any transaction that needs to modify a single
customer datum. Concurrency is severely limited because any other transactions that need to access any customer data
will be blocked. If, however, we use a customer record as the granularity of locking, transactions that access different
customer records will be capable of running concurrently.
Read and write locks
If a process imposes a read lock on a resource, other processes will still be able to request read locks on that same
resource. However, a request for a write lock would fail or be blocked until all read locks are released5. If a process
imposes a write lock on a resource, then neither read nor write locks will be granted until that process releases the
resource.
Locking can be optimized to yield better resource usage by distinguishing read locks from write locks. To
support this, we introduce the use of two locks per object: read locks and write locks. Read locks are also known as
shared locks (since they can be shared by multiple transactions) If a transaction needs to read an object, it will request
a read lock from the lock manager. If a transaction needs to modify an object, it will request a write lock from the lock
manager. If the lock manager cannot grant a lock, then the transaction will wait until it can get the lock (after the
transaction with the lock committed or aborted)6. To summarize lock granting:

If a transaction has: another transaction


obtain:
no locks
read lock or write lock
read lock
write lock

may

read lock (wait for write lock)


wait for read or write locks

Two-version locking
Two-version locking is a somewhat optimistic concurrency control scheme that allows one transaction to write
tentative versions of objects while other transactions read from committed versions of the same objects. Read
operations only wait if another transaction is currently committing the same object. This scheme allows more
concurrency than read-write locks, but writing transactions risks a wait (or rejection) when they attempt to commit.
Transactions cannot commit their write operations immediately if other uncommitted transactions have read the same
objects. Transactions that request to commit in this situation have to wait until the reading transactions have completed.
Two-version locking requires three types of locks: read, write, and commit locks. Before an object is read, a
transaction must obtain a read lock. Before an object is written, the transaction must obtain a write lock (the same as
with two-phase locking). Neither of these locks will be granted if there is a commit lock on the object. When the
transaction is ready to commit:
All of the transactions write locks are changed to commit locks.
If any objects used by the transaction have outstanding read locks, the transaction must wait until the transactions
that set these locks have completed and those locks are released.
If we compare the performance difference between two-version locking and strict two-phase lockings that uses
read/write locks, we find:
Read operations in two-version locking are delayed only while transactions are being committed rather than
during the entire execution of transactions. The commit protocol usually takes far less time than the time to
perform the transaction.
Operations of one transaction in two-version locking can cause a delay in the committing of other transactions.
Optimistic concurrency control
Locking is not without problems. Locks have an overhead associated with maintaining and checking them and having
to run a lock managers. Even transactions performing read-only operations on objects must request locks. The use of
locks can result in deadlock. We will need to have software in place to detect or avoid deadlock. Locks can also decrease
the potential concurrency in a system by having a transaction hold locks for the duration of the transaction until a
commit or abort takes place. Concurrency is reduced becuase the transactions hold on to locks longer than needed (as
in strict two-phase locking). An alternative proposed by Kung and Robinson in 1981 is optimistic concurrency control,
which tells the transaction to just go ahead and do what it has to do without worrying about what someone else is doing.
It is based on the observation that, in most applications, the chance of two transactions accessing the same object is low.
So why introduce overhead? When a conflict does arise, then the system will have to deal with it.
We will allow transactions to proceed as if there were no possibility of conflict with other transactions: a
transaction does not have to obtain or check for locks. This is the working phase. Each transaction creates a tentative
version of the objects it updates: a copy of the most recently committed version. Write operations record new values as
tentative values.
When a transaction is ready to commit, a validation is performed on all the data items to see whether the data
conflicts with operations of other transactions. This is the validation phase. If the validation fails, then the transaction
will have to be aborted and restarted later (again, optimistically we hope these conflicts will be few and far between). If
it succeeds, then the tentative versions are made permanent. This is the update phase. Optimistic control is clearly
deadlock free (no locking or waiting on resources) and allows for maximum parallelism (since no process has to wait
for a lock, all can execute in parallel).

5
6

53

This is similar to DFS and its use of tokens in granting file access permissions.
This is similar to DFS and its use of tokens in granting file access permissions.

Timestamp ordering
Another approach to concurrency control is the use of timestamp ordering, developed by Reed in 1983. In this
algorithm, we assign a timestamp to a transaction when it begins. The timestamp has to be unique with respect to the
timestamps of other transactions (this is easily accomplished; for example, Lamports algorithm can be used). Every
object in the system has a read and a write timestamp associated with it, two timestamps per object, identifying which
committed transaction last read it and which committed transaction last wrote it. Note that the timestamps are obtained
from the transaction timestamp: the start of that transaction.
Normally, when a process tries to access a file, the files read and write timestamps will be older than the
current transactions. This implies good ordering. If this is not the case, and the ordering is incorrect, this means that a
transaction that started later than the current one accessed the file and committed. In this case the current transaction
is too late and has to abort: the rule here is that the lower numbered transaction always goes first. The rule of timestamp
ordering is:
If a transaction wants to write an object, it compares its own timestamp with the objects read and write
timestamps. If the objects timestamps are older, then we have good ordering.
If a transaction wants to read an object, it compares its own timestamp with the objects write timestamp. If the
objects write timestamp is older than the current transaction, then the ordering is good.
If a transaction attempts to access an object and does not detect proper ordering, then we have bad ordering. The
transaction is aborted and restarted (improper ordering means that a newer transaction came in and modified
data before the older one could access the data or read data that the older one wants to modify).
For example, suppose there are three transactions: a, b, and c. Transacion a ran a long time ago and used every
object needed by b and c. Transactions b and c start concurrently with b receiving a smaller (younger) timestamp than
c (Tb < Tc).
Case 1 (proper ordering): Transaction b writes a file (assume the read timestamp of the file is TR and the write
timestamp is TW). Unless c has already committed, TR and TW are Ta, which is less than Tb. The write is accepted,
performed tentatively, and made permanent on commit. Tb is recorded as the tentative TW.
Case 2 (improper ordering): If c has either read or written the object and committed, then the objects timestamp(s) has
been modified to Tc. In this case, when b tries to access the object and compares its timestamp with that of the
object, it sees that ordering is incorrect because b is an older transaction trying to modify data committed by a
younger transaction (c). Transaction b must be aborted. It can restart, get a new timestamp, and try again.
Case 3 (delaying execution): Suppose transaction b has written the object but not yet committed. The write timestamp
of the object is a tentative Tb. If c now wants access to the object, we have a situation in which ordering is correct
but the timestamp is in a tentative state. Transaction c must now wait for b to finish before it can access the object.
Concurrency control with timestamps is different than locking in one important way. When a transaction encounters
a later timestamp, it aborts. With locking, it would either wait or proceed immediately.

Vocabulary
abort
commit

transaction will not complete (commit). All changes are undone to the state before the transaction started.

action which indicates that the transaction has successfully completed. All changes to the database, files, and
objects are made permanent.
commit protocol
a fault-tolerant algorithm which ensures that all sides in a distributed system either commit or abort a
transaction unanimously.
log
a record of system activity recorded in sufficient detail so that a previous state of a process can be restored.
redo
given a log record, redo the action specified in the log.
stable storage
permanent storage to which we can do atomic writes.
transaction
an atomic action which is some computation that read and/or changes the state of one or more data objects
and appears to take place indivisibly.
write-ahead log protocol
a method in which operations done on objects may be undone after restarting a system.
References
Eric A. Brewer, Towards Robust Distributed Systems, PODC Keynote, 2004.
Seth Gilbert and Nancy Lynch, ACID versus BASE for database transaction , The Endeavour blog, July 6, 2009.
Julian Browne, Brewer's CAP Theorem: The kool aid Amazon and Ebay have been drinking , January 11, 2009.
Great discussion on CAP and scalability.
There is no free lunch with distributed data white paper, Hewlett Packard, 2005.
HP's explanation of the CAP Theorem and its impact on database systems: easy reading.
Three-phase commit protocol. Wikipedia article.

54

9. Distributed Deadlock
A deadlock is a condition in a system where a set of processes (or threads) have requests for resources that can
never be satisfied. Essentially, a process cannot proceed because it needs to obtain a resource held by another process
but it itself is holding a resource that the other process needs.

Figure 1: Figure 1. Deadlock

More formally, Coffman7 defined four conditions have to be met for a deadlock to occur in a system:
1. Mutual exclusion A resource can be held by at most one process.
2. Hold and wait Processes that already hold resources can wait for another resource.
3. Non-preemption A resource, once granted, cannot be taken away.
4. Circular wait Two or more processes are waiting for resources held by one of the other processes.
A directed graph model used to record the resource allocation state of a system. This state consists of n processes, P1 ...
Pn, and m resources, R1 ... $m. In such a graph:
P1 R1 means that resource R1 is allocated to process P1.
P1 R1 means that resource R1 is requested by process P1.
Deadlock is present when the graph has a directed cycles. An example is shown in Figure
Such a graph is called a Wait-For Graph (WFG).
Deadlock in distributed systems

Figure 2: Figure 2. Resource graph on A

Figure 3: Figure 3. Resource graph on B


The same conditions for deadlock in uniprocessors apply to distributed systems. Unfortunately, as in many other aspects
of distributed systems, they are harder to detect, avoid, and prevent. Four strategies can be used to handle deadlock:
1. ignorance: ignore the problem; assume that a deadlock will never occur. This is a surprisingly common approach.
2. detection: let a deadlock occur, detect it, and then deal with it by aborting and later restarting a process that causes
deadlock.
3. prevention: make a deadlock impossible by granting requests so that one of the necessary conditions for deadlock
does not hold.

55

http://people.cs.umass.edu/mcorner/courses/691J/papers/TS/coffman_deadlocks/ coffman_deadlocks.pdf

4. avoidance: choose resource allocation carefully so that deadlock will not occur. Resource requests can be honored
as long as the system remains in a safe (non-deadlock) state after resources are allocated.
5.
The last of these, deadlock avoidance through resource allocation is difficult and requires the ability to predict
precisely the resources that will be needed and the times that they will be needed. This is difficult and not practical in
real systems. The first of these is trivially simple but, of course, ineffective for actually doing anything about deadlock
conditions. We will focus on the middle two approaches.
In a conventional system, the operating system is the component that is responsible for resource allocation and
is the ideal entity to detect deadlock. Deadlock can be resolved by killing a process. This, of course, is not a good thing
for the process. However, if processes are transactional in nature, then aborting the transaction is an anticipated
operation. Transactions are designed to withstand being aborted and, as such, it is perfectly reasonable to abort one or
more transactions to break a deadlock. The transaction can be restarted later at a time when, we hope, it will not create
another deadlock.

Centralized deadlock detection


Centralized deadlock detection attempts to imitate the nondistributed algorithm through a central coordinator.
Each machine is responsible for maintaining a resource graph for its processes and resources. A central coordinator
maintains the resource utilization graph for the entire system: the Global Wait-For Graph. This graph is the union of the
individual Wait-For Graphs. If the coordinator detects a cycle in the global wait-for graph, it aborts one process to break
the deadlock.
In the non-distributed case, all the information on resource usage lives on one system and the graph may be constructed
on that system. In the distributed case, the individual subgraphs have to be propagated to a central coordinator. A
message can be sent each time an arc is added or deleted. If optimization is needed, a list of added or deleted arcs can
be sent periodically to reduce the overall number of messages sent.

Figure 4: Figure 4. Resource graph on coordinator

Figure 5: Figure 5. False deadlock


Here is an example (from Tanenbaum8). Suppose machine A has a process P0, which holds the resource S and wants
resource R, which is held by P1. The local graph on A is shown in Figure 2. Another machine, machine B, has a process
P2, which is holding resource T and wants resource S. Its local graph is shown in Figure 3. Both of these machines send
their graphs to the central coordinator, which maintains the union (Figure 4).
All is well. There are no cycles and hence no deadlock. Now two events occur. Process P1 releases resource R and asks
machine B for resource T. Two messages are sent to the coordinator:
message 1 (from machine A): releasing R message 2 (from machine B): waiting
for T
This should cause no problems (no deadlock). However, if message 2 arrives first, the coordinator would then construct
the graph in Figure 5 and detect a deadlock. Such a condition is known as false deadlock. A way to fix this is to use
Lamports algorithm to impose global time ordering on all machines. Alternatively, if the coordinator suspects deadlock,
it can send a reliable message to every machine asking whether it has any release messages. Each machine will then
respond with either a release message or a negative acknowledgement to acknowledge receipt of the message.
Distributed deadlock detection

http://www.amazon.com/dp/0132392275/pkorg

56

An algorithm for detecting deadlocks in a distributed system was proposed by Chandy, Misra, and Haas in 1983.
Processes request resources from the current holder of that resource. Some processes may wait for resources, which
may be held either locally or remotely. Cross-machine arcs make looking for cycles, and hence detecting deadlock,
difficult. This algorithm avoids the problem of constructing a Global WFG.
The Chandy-Misra-Haas algorithm works this way: when a process has to wait for a resource, a probe message
is sent to the process holding that resource. The probe message contains three components: the process ID that blocked,
the process ID that is sending the request, and the destination. Initially, the first two components will be the same. When
a process receives the probe: if the process itself is waiting on a resource, it updates the sending and destination fields
of the message and forwards it to the resource holder. If it is waiting on multiple resources, a message is sent to each
process holding the resources. This process continues as long as processes are waiting for resources. If the originator
gets a message and sees its own process number in the blocked field of the message, it knows that a cycle has been taken
and deadlock exists. In this case, some process (transaction) will have to die. The sender may choose to commit suicide
and abort itself or an election algorithm may be used to determine an alternate victim (e.g., youngest process, oldest
process, ...).

Distributed deadlock prevention


An alternative to detecting deadlocks is to design a system so that deadlock is impossible. We examined the four
conditions for deadlock. If we can deny at least one of these conditions then we will not have deadlock.
Mutual exclusion To deny this means that we will allow a resource to be held (used) by more than one process at a
time. If a resource can be shared then there is no need for mutual exclusion and deadlock cannot occur. Too often,
however, a process requires mutual exclusion for a resource because the resource is some object that will be
modified by the process.
Hold and wait Denying this means that processes that hold resources cannot wait for another resource. This typically
implies that a process should grab all of its resources at once. This is not practical either since we cannot always
predict what resources a process will need throughout its execution.
Non-preemption A resource, once granted, cannot be taken away. In transactional systems, allowing preemption
means that a transaction can come in and modify data (the resource) that is being used by another transaction.
This differs from mutual exclusion since the access is not concurrent but the same problem arises of having
multiple transactions modify the same resource. We can support this with optimistic concurrency control
algorithms that will check for out-of-order modifications at commit time and roll back (abort) if there are potential
inconsistencies.
Circular wait Avoiding circular wait means that we ensure that a cycle of waiting on resources does not occur. We can
do this by enforcing an ordering on granting resources and aborting transactions or denying requests if an
ordering cannot be granted.
One way of avoiding circular wait is to obtain a globally-unique timestamp (e.g., Lamport total ordering) for every
transaction so that no two transactions get the same timestamp. When one process is about to block waiting for a
resource that another process is using, check which of the two processes has a younger timestamp and give priority to
the older process.
If a younger process is using the resource, then the older process (that wants the resource) waits. If an older process is
holding the resource, the younger process (that wants the resource) aborts itself. This forces the resource utilization
graph to be directed from older to younger processes, making cycles impossible. This algorithm is known as the waitdie algorithm.
An alternative, but similar, method by which resource request cycles may be avoided is to have an old process abort
(kill) the younger process that holds a resource. If a younger process wants a resource that an older one is using, then
it waits until the older process is done. In this case, the graph flows from young to old and cycles are again impossible.
This variant is called the wound-wait algorithm.
References
Andrew S. Tanenbaum and Maarten Van Steen, Distributed Systems: Principles and Paradigms, Second Edition.
Prentice Hall, October 2006.
K Mani Chandy, Jayadev Misra, and Laura M. Haas, Distributed Deadlock Detection, ACM Transactions on
Computer Systems, Vol. 1, No. 2, May 1983, Pages 144156.
E. G. Coffman, Jr; M. J. Elphick; A. Shoshani, System Deadlocks, Computing Surveys, Vol. 3, No. 2, June 1971.
Edgar Knapp, Deadlock Detection in Distributed Databases, ACM Computing Surveys, Vol. 19, No. 4, December
1987

57

10.Distributed File Systems


Introduction
Presently, our most common exposure to distributed systems that exemplify some degree of transparency is
through distributed file systems. We'd like remote files to look and feel just like local ones.
A file system is responsible for the organization, storage, retrieval, naming, sharing, and protection of files. File
systems provide directory services, which convert a file name (possibly a hierarchical one) into an internal identifier
(e.g. inode, FAT index). They contain a representation of the file data itself and methods for accessing it (read/write).
The file system is responsible for controlling access to the data and for performing low-level operations such as
buffering frequently-used data and issuing disk I/O requests.
Our goals in designing a distributed file system are to present certain degrees of transparency to the user and the
system.
access transparency Clients are unaware that files are distributed and can access them in the same way as local files
are accessed.
location transparency A consistent name space exists encompassing local as well as remote files. The name of a file
does not give it location.
concurrency transparency All clients have the same view of the state of the file system. This means that if one process
is modifying a file, any other processes on the same system or remote systems that are accessing the files will see
the modifications in a coherent manner.
failure transparency The client and client programs should operate correctly after a server failure.
heterogeneity File service should be provided across different hardware and operating system platforms.
scalability The file system should work well in small environments (1 machine, a dozen machines) and also scale
gracefully to huge ones (hundreds through tens of thousands of systems).
replication transparency To support scalability, we may wish to replicate files across multiple servers. Clients should
be unaware of this.
migration transparency Files should be able to move around without the client's knowledge.
support fine-grained distribution of data To optimize performance, we may wish to locate individual objects near
the processes that use them.
tolerance for network partitioning The entire network or certain segments of it may be unavailable to a client during
certain periods (e.g. disconnected operation of a laptop). The file system should be tolerant of this.
2 Distributed file system concepts
A file service is a specification of what the file system offers to clients. A file server is the implementation of a file
service and runs on one or more machines.
A file itself contains a name, data, and attributes (such as owner, size, creation time, access rights). An immutable file
9 is one that, once created, cannot be changed. Immutable files are easy to cache and to replicate across servers since
their contents are guaranteed to remain unchanged.
Two forms of protection are generally used in distributed file systems, and they are essentially the same techniques
that are used in single-processor non-networked systems:
capabilities Each user is granted a ticket (capability) from some trusted source for each object to which it has access.
The capability specifies what kinds of access are allowed.
access control lists Each file has a list of users associated with it and access permissions per user. Multiple users may
be organized into an entity known as a group.
2.1 File service types
To provide a remote system with file service, we will have to select one of two models of operation. One of these is the
upload/download model. In this model, there are two fundamental operations: read file transfers an entire file from
the server to the requesting client, and write file copies the file back to the server. It is a simple model and efficient in
that it provides local access to the file when it is being used. Three problems are evident. It can be wasteful if the client
needs access to only a small amount of the file data. It can be problematic if the client doesn't have enough space to
cache the entire file. Finally, what happens if others need to modify the same file? The second model is a remote access
model. The file service provides remote operations such as open, close, read bytes, write bytes, get attributes, etc. The file

9The

58

Bullet server on the Amoeba operating system is an example of a system that uses immutable files.

system itself runs on servers. The drawback in this approach is the servers are accessed for the duration of file access
rather than once to download the file and again to upload it.
Another important distinction in providing file service is that of understanding the difference between directory
service and file service. A directory service, in the context of file systems, maps human-friendly textual names for files
to their internal locations, which can be used by the file service. The file service itself provides the file interface (this is
mentioned above). Another component of file distributed file systems is the client module. This is the client-side
interface for file and directory service. It provides a local file system interface to client software (for example, the VFS
layer of a UNIX/Linux kernel).
2.2
Naming issues
In designing a distributed file service, we should consider whether all machines (and processes) should have the exact
same view of the directory hierarchy. We might also wish to consider whether the name space on all machines should
have a global root directory (a.k.a. "super root") so that files can be accessed as, for example, //server/path. This is a
model that was adopted by the Apollo Domain System, an early distributed file system, and more recently by the web
community in the construction of a uniform resource locator (URL).
In considering our goals in name resolution, we must distinguish between location transparency and location
independence. By location transparency we mean that the path name of a file gives no hint to where the file is located.
For instance, we may refer to a file as //serverl/dir/file. The server (server) can move anywhere without the client
caring, so we have location transparency. However, if the file moves to server2 things will not work. If we have location
independence, the files can be moved without their names changing. Hence, if machine or server names are embedded
into path names we do not achieve location independence.
It is desirable to have access transparency, so that applications and users can access remote files just as they access
local files. To facilitate this, the remote file system name space should be syntactically consistent with the local name
space. One way of accomplishing this is by redefining the way files are named and require an explicit syntax for
identifying remote files. This can cause legacy applications to fail and user discontent (users will have to learn a new
way of naming their files). An alternate solution is to use a file system mounting mechanism to overlay portions of
another file system over a node in a local directory structure. Mounting is used in the local environment to construct a
uniform name space from separate file systems (which reside on different disks or partitions) as well as incorporating
special-purpose file systems into the name space (e.g. /proc on many UNIX systems allows file system access to
processes). A remote file system can be mounted at a particular point in the local directory tree. Attempts to access files
and directories under that node will be directed to the driver for that file system.
To summarize, our naming options are:
machine and path naming (machine:path, /machine/path).
mount remote file systems onto the local directory hierarchy (merging the local and remote name spaces).
provide a single name space which looks the same on all machines. The first two of these
options are relatively easy to implement.
2.3

Types of names

When we talk about file names, we refer to symbolic names (for example, server.c). These names are used by
people (users or programmers) to refer to files. Another "name" is the identifier used by the system internally to refer
to a file. We can think of this as a binary name (more precisely, as an address). On most POSIX file systems, this would
be the device number and inode number.
Directories provide a mapping from symbolic names to file addresses (binary names). Typically, one symbolic
name maps to one file address. If multiple symbolic names map onto one binary name, these are called hard links. On
inode-based file systems (e.g., most
UNIX systems), hard links must exist within the same device since the address (inode) is unique only on that device. On
Windows systems, they are not supported because file attributes are stored with the name of the file. Having two
symbolic names refer to the same data will cause problems in synchronizing file attributes (how would you locate other
files that point to this data?). A hack to allow multiple names to refer to the same file (whether its on the same device
or a different device) is to have the symbolic name refer to a single file address but that file may have an attribute to tell
the system that its contents contain a symbolic file name that should be dereferenced. Essentially, this adds a level of
indirection: access a file which contains another file name, which references the file attributes and data. These files are
known as symbolic links. Finally, it is possible for one symbolic name to refer to multiple file addresses. This doesn't
make much sense on a local system 10, but can be useful on a networked file system to provide fault tolerance or enable
the system to use the file address which is most efficient.
2.4 Semantics of file sharing
The analysis of file sharing semantics is that of understanding how files behave. For instance, on most systems, if a read
follows a write, the read of that location will return the values just written. If two writes occur in succession, the
following read will return the results of the last write. File systems that behave this way are said to observe sequential
semantics.

10It

really does make sense in a way. In the late 1980's, David Korn created a file system that allowed multiple directories to be mounted over the same
directory node. Several operating systems later adopted this technique and called it union mounts. It was a core aspect of the Plan 9 operating system,
where it allowed one to build a fully custom name space and avoid the need for a PATH environment variable in searching for executables. The executables
are always found in /bin. The name is resolved by searching through the file systems (directories) mounted on that node in a last-mounted, first-searched
order.

59

Sequential semantics can be achieved in a distributed system if there is only one server and clients do not cache data.
This can cause performance problems since clients will be going to the server for every file operation (such as singlebyte reads). The performance problems can be alleviated with client caching. However, now if the client modifies its
cache and another client reads data from the server, it will get obsolete data. Sequential semantics no longer hold.
One solution is to make all the writes write-through to the server. This is inefficient and does not solve the
problem of clients having invalid copies in their cache. To solve this, the server would have to notify all clients holding
copies of the data.
Another solution is to relax the semantics. We will simply tell the users that things do not work the same way
on the distributed file system as they did on the local file system. The new rule can be "changes to an open file are
initially visible only to the process (or machine) that modified it." These are known as session semantics.
Yet another solution is to make all the files immutable. That is, a file cannot be open for modification, only
for reading or creating. If we need to modify a file, we'll create a completely new file under the old name. Immutable
files are an aid to replication but they do not help with changes to the file's contents (or, more precisely, that the old
file is obsolete because a new one with modified contents succeeded it). We still have to contend with the issue that
there may be another process reading the old file. It's possible to detect that a file has changed and start failing
requests from other processes.
A final alternative is to use atomic transactions. To access a file or a group of files, a process first executes a
begin transaction primitive to signal that all future operations will be executed indivisibly. When the work is
completed, an end transaction primitive is executed.
If two or more transactions start at the same time, the system ensures that the end result is as if they were run in some
sequential order. All changes have an all or nothing property.
2.5

File usage patterns

It's clear that we cannot have the best of all worlds and that compromises will have to be made between the desired
semantics and efficiency (efficiency encompasses increasing client performance, reducing network traffic, and
minimizing server load). To design or select a suitable distributed file system, it is important to understand the usage
patterns within a file system.
A comprehensive study was made by Satyanarayanan in 1981 which showed the following use patterns. Bear in mind
that this is in the days before people kept vast collections of audio and video files.
Most files are under 10K bytes in size. This suggests that it may be feasible to transfer entire files (a simpler
design). However, a file system should still be able to support large files.
Most files have short lifetimes (many files are temporary files created by editors and compilers). It may be a good
idea to keep these files local and see if they will be deleted soon.
Few files are shared. While sharing is a big issue theoretically, in practice it's hardly done. We may choose to
accept session semantics and not worry too much about the consequences.
Files can be grouped into different classes, with each class exhibiting different properties: system binaries:
read/only, widely distributed. These are good candidates for replication.
compiler and editor temporary files :
short, unshared, disappear quickly. We'd like to keep these local if possible.
mailboxes (email): not shared, frequently updated. We don't want to replicate these.
ordinary data files: these may be shared.
System design issues
Name resolution
In looking up the pathname of a file (e.g. via the namei function in the UNIX kernel), we may choose to evaluate a
pathname a component at a time. For example, for a pathname aaa/bbb/ccc, we would perform a remote lookup of aaa,
then another one of bbb, and finally one of ccc). Alternatively, we may pass the rest of the pathname to the remote
machine as one lookup request once we find that a component is remote. The drawback of the latter scheme is (a) the
remote server may be asked to walk up the tree by processing .. (parent node) components and reveal more of its file
system than it wants and (b) other components cannot be mounted underneath the remote tree on the local system.
Because of this, component at a time evaluation is generally favored but it has performance problems (a lot more
messages). We may choose to keep a local cache of component resolutions.
Should servers maintain state?
A stateless system is one in which the client sends a request to a server, the server carries it out, and returns the
result. Between these requests, no client-specific information is stored on the server. A stateful system is one where
information about client connections is maintained on the server. In a stateless system:
Each request must be complete - the file has to be fully identified and any offsets specified.
Fault tolerance: if a server crashes and then recovers, no state was lost about client connections because there
was no state to maintain.
No remote open/close calls are needed (they only serve to establish state).
No wasted server space per client.

60

No limit on the number of open files on the server; they aren't "open" - the server maintains no per-client state.
No problems if the client crashes. The server does not have any state to clean up. On a stateful system:
requests are shorter (less info to send).
better performance in processing the requests.
idempotency works; cache coherence is possible.
file locking is possible; the server can keep state that a certain client is locking a file (or portion thereof).
Caching
We can employ caching to improve system performance. There are four places in a distributed system where we can
hold data:
1. On the server's disk
2. In a cache in the server's memory
3. In the client's memory
4. On the client's disk
The first two places are not an issue since any interface to the server can check the centralized cache. It is in the last
two places that problems arise and we have to consider the issue of cache consistency. Several approaches may be
taken:
write-through What if another client reads its own cached copy? All accesses would require checking with the server
first (adds network congestion) or require the server to maintain state on who has what files cached. Writethrough also does not alleviate congestion on writes.
delayed writes Data can be buffered locally (where consistency suffers) but files can be updated periodically. A single
bulk write is far more efficient than lots of little writes every time any file contents are modified. Unfortunately
the semantics become ambiguous.
write on close This is admitting that the file system uses session semantics.
centralized control Server keeps track of who has what open in which mode. We would have to support a stateful
system and deal with signaling traffic.
3 Distributed File Systems: case studies
It is clear that compromises have to be made in a practical design. For example, we may trade-off file consistency for
decreased network traffic. We may choose to use a connectionless protocol to enable clients to survive a server crash
gracefully but sacrifice file locking. This section examines a few distributed file systems.
3.1 Network File System (NFS)
Sun's Network File System (NFS) is one of the earliest distributed file systems, is still widely used, and is the de facto
standard network file system on various flavors of UNIX, Linux and BSD and is natively supported in Apple's OS X. We
will look at its early design to understand what the designers where trying to do and why certain decisions were made.
The design goals of NFS were:
Any machine can be a client and/or a server.
NFS must support diskless workstations (that are booted from the network). Diskless workstations were Sun's
major product line.
Heterogeneous systems should be supported: clients and servers may have different hardware and/or operating
systems. Interfaces for NFS were published to encourage the widespread adoption of NFS.
High performance: try to make remote access as comparable to local access through caching and read-ahead.
From a transparency point of view NFS offers:
Access transparency Remote (NFS) files are accessed through normal system calls; On POSIX systems, the protocol is
implemented under the VFS layer.
Location transparency The client adds remote file systems to its local name space via mount. File systems must be
exported at the server. The user is unaware of which directories are local and which are remote. The location of
the mount point in the local system is up to the client's administrator.
Failure transparency : NFS is stateless; UDP is used as a transport. If a server fails, the client retries.

61

Performance transparency Caching at the client will be used to improve performance


No migration transparency The client mounts machines from a server. If the resource moves to another server, the
client must know about the move.
No support for all Unix file system features NFS is stateless, so stateful operations such as file locking are a problem.
All UNIX file system controls may not be available.
Devices? Since NFS had to support diskless workstations, where every file is remote, remote device files had to refer to
the client's local devices. Otherwise there would be no way to access local devices in a diskless environment.
3.2 NFS protocols
The NFS client and server communicate over remote procedure calls (Sun/ONC RPC) using two protocols: the
mounting protocol and the directory and file access protocol. The
mounting protocol is used to request a access to an exported directory (and the files and directories within that file
system under that directory). The directory and file access protocol is used for accessing the files and directories (e.g.
read/write bytes, create files, etc.). The use of RPC's external data representation (XDR) allows NFS to communicate
with heterogeneous machines. The initial design of NFS ran only with remote procedure calls over UDP. This was done
for two reasons. The first reason is that UDP is somewhat faster than TCP but does not provide error correction (the
UDP header provides a checksum of the data and headers). The second reason is that UDP does not require a connection
to be present. This means that the server does not need to keep per-client connection state and there is no need to
reestablish a connection if a server was rebooted.
The lack of UDP error correction is remedied in the fact that remote procedure calls have built-in retry logic.
The client can specify the maximum number of retries (default is 5) and a timeout period. If a valid response is not
received within the timeout period the request is re-sent. To avoid server overload, the timeout period is then doubled.
The retry continues until the limit has been reached. This same logic keeps NFS clients fault-tolerant in the presence of
server failures: a client will keep retrying until the server responds.
Mounting protocol
The client sends the pathname to the server and requests permission to access the contents of that directory.
If the name is valid and exported (stored in /etc/exports on Linux and BSD systems and in /etc/dfs/sharetab on System
V release 4/SunOS 5.x), the server returns a file handle to the client. This file handle contains all the information needed
to identify the file on the server: {_file system type_, disk ID, inode number, security info}.
Mounting an NFS file system is accomplished by parsing the path name, contacting the remote machine for a
file handle, and creating an in-memory vnode at the mount point. A vnode points to an inode for a local UNIX file or, in
the case of NFS, an rnode. The rnode contains specific information about the state of the file from the point of view of
the client. Two forms of mounting are supported:
static In this case, file systems are mounted with the mount command (generally during system boot).
automounting One problem with static mounting is that if a client has a lot of remote resources mounted, boot-time
can be excessive, particularly if any of the remote systems are not responding and the client keeps retrying.
Another problem is that each machine has to maintain its own name space. If an administrator wants all machines
to have the same name space, this can be an administrative headache. To combat these problems the automounter
was introduced.
The automounter allows mounts and unmounts to be performed in response to client requests. A set of remote
directories is associated with a local directory. None are mounted initially. the first time any of these is referenced, the
operating system sends a message to each of the servers. The first reply wins and that file system gets mounted (it is up
to the administrator to ensure that all file systems are the same). To configure this, the automounter relies on mapping
files that provide a mapping of client pathname to the server file system. These maps can be shared to facilitate
providing a uniform naming space to a number of clients.
Directory and file access protocol
Clients send RPC messages to the server to manipulate files and directories. A file is accessed by performing a
lookup remote procedure call. This returns a file handle and attributes. It is not like an open in that no information is
stored in any system tables on the server. After that, the handle may be passed as a parameter for other functions. For
example, a read(handle, offset, count) function will read count bytes from location offset in the file referred to by handle.
The entire directory and file access protocol is encapsulated in sixteen functions 3. These are:

62

Function
null

Description
no-operation but ensure that connectivity
exists

lookup
create
remove
rename
read
write
link
symlink
readlink

lookup the file name in a directory


create a file or a symbolic link
remove a file from a directory
rename a file or directory
read bytes from a file
write bytes to a file
create a link to a file
create a symbolic link to a file
read the data in a symbolic link (do not follow
the link)

mkdir
rmdir
readdur
getattr

create a directory
remove a directory
read from a directory
get attributes about a file or directory (type,
access and modify times, and access
permissions)

setattr
statfs

set file attributes


get information about the remote file system

3These functions are present in versions 2 and 3 of the NFS protocol. Version 3 added six more functions and even more were added in
version 4.

Accessing files
Files are accessed through conventional system calls (thus providing access transparency). If you recall
conventional UNIX systems, a hierarchical pathname is dereferenced to the file location with a kernel function called
namei. This function maintains a reference to a current directory, looks at one component and finds it in the directory,
changes the reference to that directory, and continues until the entire path is resolved. At each point in traversing this
pathname, it checks to see whether the component is a mount point, meaning that name resolution should continue on
another file system. In the case of NFS, it continues with remote procedure calls to the server hosting that file system.
Upon realizing that the rest of the pathname is remote, namei will continue to parse one component of the
pathname at a time to ensure that references to .. (dot-dot, the parent directory) and to symbolic links become local if
necessary. Each component is retrieved via a remote procedure call which performs an NFS lookup. This procedure
returns a file handle. An in-memory rnode is created and the VFS layer in the file system creates a vnode to point to
it.The application can now issue read and write system calls. The file descriptor in the user's process will reference the
in-memory vnode at the VFS layer, which in turn will reference the in-memory rnode at the NFS level which contains
NFS-specific information, such as the file handle. At the NFS level, NFS read, write, etc. operations may now be
performed, passing the file handle and local state (such as file offset) as parameters. No information is maintained on
the server between requests; it is a stateless system.
The RPC requests have the user ID and group ID number sent with them. This is a security hole that may be stopped
by turning on RPC encryption.
Performance
NFS performance was usually found to be slower than accessing local files because of the network overhead. To improve
performance, reduce network congestion, and reduce server load, file data is cached at the client. Entire pathnames are
also cached at the client to improve performance for directory lookups.
server caching Server caching is automatic at the server in that the same buffer cache is used as for all other files on
the server. The difference for NFS-related writes is that they are all write-through to avoid unexpected data loss
if the server dies.
client caching The goal of client caching is to reduce the amount of remote operations. Three forms of information are
cached at the client: file data, file attribute information, and pathname bindings. NFS caches the results of read,
readlink, getattr, lookup, and readdir operations. The danger with caching is that inconsistencies may arise. NFS
tries to avoid inconsistencies (and/or increase performance) with:
validation: if caching one or more blocks of a file, save a time stamp. When a file is opened or if the server is
contacted for a new data block, compare the last modification time. If the remote modification time is more
recent, invalidate the cache.
Validation is performed every three seconds on open files.
Cached data blocks are assumed to be valid for three seconds.
Cached directory blocks are assumed to be valid for thirty seconds.

63

Whenever a page is modified, it is marked dirty and scheduled to be written (asynchronously). The page is
flushed when the file is closed.
Transfers of data are done in large chunks; the default is 8K bytes. As soon as a chunk is received, the client
immediately requests the next 8K-byte chunk. This is known as read-ahead. The assumption is that most file accesses
are sequential and we might as well fetch the next block of data while we're working on our current block, anticipating
that we'll likely need it. This way, by the time we do, it will either be there or we don't have to wait too long for it since
it is on its way.
Problems
The biggest problem with NFS is file consistency. The caching and validation policies do not guarantee session
semantics.
NFS assumes that clocks between machines are synchronized and performs no clock synchronization between
client and server. One place where this hurts is in distributed software development environments. A program such as
make, which compares times of files (such as object and source) to determine whether to regenerate them, can either
fail or give confusing results.
Because of its stateless design, open with append mode cannot be guaranteed to work. You can open a file, get
the attributes (size), and then write at that offset, but you'll have no assurance that somebody else did not write to that
location after you received the attributes. In that case your write will overwrite the other once since it will go to the old
end-of-file byte offset.
Also because of its stateless nature, file locking cannot work. File locking implies that the server keeps track of
which processes have locks on the file. Sun's solution to this was to provide a separate process (a lock manager). This
introduces state to NFS.
One common programming practice under UNIX file systems for manipulating temporary data in files is to open
a temporary file and then remove it from the directory. The name is gone, but the data persists because you still have
the file open. Under NFS, the server maintains no state about remotely opened files and removing a file will cause the
file to disappear. Since legacy applications depended on this, Sun's solution was to create a special hack for UNIX: if the
same process that has a file open attempts to delete it, it is instead moved to a temporary name and deleted on close.
It's not a perfect solution, but it works well.
Permission bits might change on the server and disallow future access to a file. Since NFS is stateless, it has to
check access permissions each time it receives an NFS request. With local file systems, once access is granted initially,
a process can continue accessing the file even if permissions change.
By default, no data is encrypted and Unix-style authentication is used (used ID, group ID). NFS supports two
additional forms of authentication: Diffie-Hellman and Kerberos. However, data is never encrypted and user-level
software should be used to encrypt files if this is necessary.
More fixes
The original version of NFS was released in 1985, with version 2 released around 1988. In 1992, NFS was enhanced to
version 3 (SunOS 5.5). This version is still supported, although version 4 was introduced around 2003. Several changes
were added to enhance the performance of the system in version 3:
1. NFS was enhanced to support TCP. UDP caused more problems over wide-area networks than it did over LANs
because of errors. To combat that and to support larger data transfer sizes, NFS was modified to support TCP as
well as UDP. To minimize connection setup, all traffic can be multiplexed over one TCP connection.
2. NFS always relied on the system buffer cache for caching file data. The buffer cache is often not very large and
useful data was getting flushed because of the size of the cache. Sun introduced a caching file system, CacheFS,
that provides more caching capability by using the disk. Memory is still used as before, but a local disk is used if
more cache space is needed. Data can be cached in chunks as large as 64K bytes and entire directories can be
cached.
3. NFS was modified to support asynchronous writes. If a client needed to send several write requests to a server, it
would send them one after another. The server would respond to a request only after the data was flushed to the
disk. Now multiple writes can be collected and sent as an aggregate request to the server. The server does not
have to ensure that the data is on stable storage (disk) until it receives a commit request from the client.
4. File attributes are returned with each remote procedure call now. The overhead is slight and saves clients from
having to request file attributes separately (which was a common operation).
5. Version 3 allows 64-bit rather than the old 32-bit file offsets (supporting file sizes over 18 million terabytes).
6. An enhanced lock manager was added to provide monitored locks. A status monitor monitors hosts with locks
and informs a lock manager of a system crash. If a server crashes, the status monitor reinstates locks on recovery.
If a client crashes, all locks from that client are freed on the server.
Version 4, described in RFC 3530, is a major enhancement to NFS and the original stateless model of the system is
essentially gone now. A few of the major additions in version 4 were:
Single pseudo file system export: Instead of exporting a set of directories, a server can now export a single pseudo
file system that appears as one directory but is created from multiple components.
Compound RPC: The system is still based on ONC RPC but supports the use of compound RPC reduce network latency.
Compound RPCs allow one to issue several disparate requests in one message.

64

Connection-oriented transport: NFS now requires the use of TCP.


Access control lists and server objects: NFS now supports arbitrary server objects beyond files and directories. This
was added primarily to support Windows systems. Support for control lists was also added.
File locking and consistency: NFS still has a weak consistency model for general file access but now supports
mandatory as well as advisory file locking. The NFS server can also direct the client on what data it can cache.
Security: Strong security is now required. One of three mechanisms are supported and negotiated at mount time:
Kerberos, LIPKEY, and SPK-3. Identification is sent as username strings instead of user ID numbers.
3.3 Andrew File System (AFS)
The goal of the Andrew File System (from Carnegie Mellon University, then a product of Transarc Corp., and now part
of the Transarc division of IBM and available via the IBM public license) was to support information sharing on a large
scale (thousands to 10000+ users). There were several incarnations of AFS, with the first version being available around
194, AFS-2 in 1986, and AFS-3 in 1989). The assumptions about file usage were:
most files are small
reads are much more common than writes
most files are read/written by one user
files are referenced in bursts (locality principle). Once referenced, a file will probably be referenced again.
From these assumptions, the original goal of AFS was to use whole file serving on the server (send an entire file
when it is opened) and whole file caching on the client (save the entire file onto a local disk). To enable this mode of
operation, the user would have a cache partition on a local disk devoted to AFS. If a file was updated then the file would
be written back to the server when the application performs a close. The local copy would remain cached at the client.
Implementation
The client's machine has one disk partition devoted to the AFS cache (for example, 100M bytes, or whatever
the client can spare). The client software manages this cache in an LRU (least recently used) manner and the clients
communicate with a set of trusted servers. Each server presents a location-transparent hierarchical file name space
to its clients. On the server, each physical disk partition contains files and directories that can be grouped into one or
more volumes. A volume is nothing more than an administrative unit of organization (e.g., a user's home directory, a
local source tree). Each volume has a directory structure (a rooted hierarchy of files and directories) and is given a name
and ID. Servers are grouped into administrative entities called cells*. A cell is a collection of servers, administrators,
clients, and users. Each cell is autonomous but cells may cooperate and present users with one uniform name space.
The goal is that every client will see the same name space (by convention, under a directory /afs). Listing the directory
/afs shows the participating cells (e.g., /afs/mit.edu).
Internally, each file and directory is identified by three 32-bit numbers:
volume ID: This identifies the volume to which the object belongs. The client caches the binding between volume ID
and server, but the server is responsible for maintaining the bindings.
vnode ID: This is the "handle" (vnode number) that refers to the file on a particular server and disk partition (volume).
uniquifier: This is a unique number to ensure that the same vnode IDs are not reused.
Each server maintains a copy of a database that maps a volume number to its server. If the client request is
incorrect (because a volume moved to a different server), the server forwards the request. This provides AFS with
migration transparency: volumes may be moved between servers without disrupting access.
Communication in AFS is with RPCs via UDP. Access control lists are used for protection; UNIX file permissions
are ignored. The granularity of access control is directory based; the access rights apply to all files in the directory. Users
may be members of groups and access rights specified for a group. Kerberos is used for authentication.
Cache coherence
The server copies a file to the client and provides a callback promise: it will notify the client when any other process
modifies the file.
When a server gets an update from a client, it notifies all the clients by sending a callback (via RPC). Each client
that receives the callback then invalidates the cached file. If a client that had a file cached was down, on restart, it
contacts the server with the timestamps of each cached file to decide whether to invalidate the file. Note that if a process
as a file open, it can continue using it, even if it has been invalidated in the cache. Upon close, the contents will still be
propagated to the server. There is no further mechanism for coherency. AFS presents session semantics.
Under AFS, read-only files may be replicated on multiple servers.
Whole file caching is not feasible for very large files, so AFS caches files in 64K byte chunks (by default) and
directories in their entirety. File modifications are propagated only on close. Directory modifications are propagated
immediately.
AFS does not support byte-range file locking. Advisory file locking (query to see whether a file has a lock on it)
is supported.
AFS Summary

65

AFS demonstrates that whole file (or large chunk) caching offers dramatically reduced loads on servers,
creating an environment that scales well. The AFS file system provides a uniform name space from all workstations,
unlike NFS, where the client mount each NFS file system at a client-specific location (the name space is uniform only
under the /afs directory, however). Establishing the same view of the file name space from each client is easier than
with NFS. This enables users to move to different workstations and see the same view of the file system.
Access permission is handled through control lists per directory, but there is no per-file access control.
Workstation/user authentication is performed via the Kerberos authentication protocol using a trusted third party
(more on this in the security section).
A limited form of replication is supported. Replicating read-only (and read-mostly at your own risk) files can
alleviate some performance bottlenecks for commonly accessed files (e.g. password files, system binaries).
3.4 Coda
Coda is a descendent of AFS, also created at CMU (c. 1990-1992). Its goals are:
Provide better support for replication of file volumes than offered by AFS. AFS' limited form (read-only volumes)
of replication will be a limiting factor in scaling the system. We would like to support widely shared read/write
files.
Provide constant data availability in disconnected environments through hoarding (user-directed caching). This
requires logging updates on the client and reintegration when the client is reconnected to the network. Such a
scheme will support the mobility
of PCs.
Improve fault tolerance. Failed servers and network problems shouldn't seriously inconvenience users.
To achieve these goals, AFS was modified in two substantial ways:
1. File volumes can be replicated to achieve higher throughput of file access operations and improve fault tolerance.
2. The caching mechanism was extended to enable disconnected clients to operate.
Volumes can be replicated to group of servers. The set of servers that can host a particular volume is the volume
storage group (VSG) for that volume. In identifying files and directories, a client no longer uses a volume ID as AFS did,
but instead uses a replicated volume ID. The client performs a one-time lookup to map the replicated volume ID to a list
of servers and local volume IDs. This list is cached for efficiency. Read operations can take place from any of these
servers to distribute the load. A write operation has to be multicast to all available servers. Since some servers may be
inaccessible at a particular point in time, a client may be able to access only a subset of the VSG. This subset is known
as the Available Volume Storage Group, or AVSG.
Since some volume servers may be inaccessible, special treatment is needed to ensure that clients do not read
obsolete data. Each file copy has a version stamp. Before fetching a file, a client requests version stamps for that file
from all available servers. If some servers are found to have old versions, the client initiates a resolution process which
tries to automatically resolve differences (administrative intervention may be required if the process finds problems
that it cannot fix). Resolution is only initiated by the client. The process is handled entirely by the servers.
Disconnected operation
If a client's AVSG is empty, then the client is operating in a disconnected operation mode. If a file is not cached
locally and is needed, nothing can be done: the system simply retries access and fails. For writes, however, the client
does not report a failure of an update. Instead, the client logs the update locally in a Client Modification Log (CML). The
user is oblivious to this. On reconnection, a process of reintegration with the server(s) commences to bring the server
up to date. The CML is played back (the log playback is optimized so that only the latest changes are sent). The system
tries to resolve conflicts automatically. This is not always possible (for example, someone may have modified the same
parts of the file on a server while our client was disconnected). In cases where conflicts arise, user intervention is
required to reconcile the differences.
To further support disconnected operation, it is desirable to cache all the files that will be needed for work to
proceed when disconnected and keep them up to date even if they are not being actively used. To do this, Coda supports
a hoard database that contains a list of these "important" files. The hoard database is constructed both by monitoring
a user's file activity and allowing a user to explicitly specify files and directories that should be present on the client.
The client frequently asks the server to send updates if necessary (that is, when it receives a callback).
3.5

Distributed File System (DFS), also known as AFS v3

DFS is the file system that is part of the Open Group's (formerly the Open Software Foundation or OSF)
Distributed Computing Environment (DCE) and is a the third version of AFS. Like AFS, it assumes that:
most file accesses are sequential
most file lifetimes are short
the majority of accesses are whole-file transfers
the majority of accesses are to small files

66

With these assumptions, the conclusion is that file caching can reduce network traffic and server load. Since
the studies on file usage in the early and mid 1980s, it was noted that file throughput per user has increased dramatically
and that typical file sizes became much larger. However, disk capacity and network throughput also became much
larger.
DFS implements a strong consistency model (unlike AFS) with Unix semantics supported. This means that a
read will return the effects of all writes that precede it. Cache consistency under DFS is maintained by the use of tokens.
A token is a guarantee from the server that a client can perform certain operations on the cached file. The server
will revoke a token if another client attempts a conflicting operation. A server grants and revokes tokens. It will grant
any number of read tokens to clients but as soon as one client requests write access, the server will revoke all
outstanding read and write tokens and issue a single write token to the requestor. This token scheme makes long term
caching possible (which is not possible under NFS). Caching is in units of chunk sizes that range from 8K to 256K bytes.
Caching is both in client memory and on the disk. DFS also employs read-ahead (similar to NFS) to attempt to bring
additional chunks off the file to the client before they are needed.
DFS is integrated with DCE security services. File protection is via access control lists (ACL) and all
communication between client and server is via authenticated remote procedure calls.
3.6

Server Message Block (SMB)

SMB is a protocol for sharing files, devices, and communication abstractions (such as named pipes or mailslots).
It was created by Microsoft and Intel in 1987 and evolved over the years.
SMB is a client-server request-response protocol. Servers make file systems and other resources available to
clients and clients access the shared file systems (and printers) from the servers. The protocol is connection-oriented
and initially required either Microsoft's
IPX/SPX or NetBIOS over either TCP/IP or NetBEUI (these are session-layer APIs). A
typical session proceeds as follows:
1. Client sends a negprot SMB to the server. This is a protocol negotiation request.
2. The server responds with a version number of the protocol (and version-specific information, such as a maximum
buffer size and naming information).
3. The client logs on (if required) by sending a sesssetupX SMB, which includes a user-name and password. It
receives an acknowledgement or failure from the server. If successful, the server sends back a user ID (UID) of the
logged-on user. This UID must be submitted with future requests.
4. The client can now connect to a tree. It sends a tcon (or tconX) SMB with the network name of the shared resource
to request access to the resource. The server responds with a tree identifier (TID) that the client will use for all
future requests for that resource.
5. Now the client can send open, read, write, close SMBs.
Machine naming was restricted to 15-character NetBIOS names if NetBEUI or TCP/IP are used. Since SMB was
designed to operate in a small local-area network, clients find out about resources either by being configured to know
about the servers in their environment or by having each server periodically broadcast information about its presence.
Clients would listen for these broadcasts and build lists of servers. This is fine in a LAN environment but does not scale
to wide-area networks (e.g. a TCP/IP environment with multiple subnets or networks). To combat this deficiency,
Microsoft introduced browse servers and the Windows Internet Name Service (WINS).
The SMB security model has two levels:
1. Share level Protection is applied per "share" (resource). Each share can have a password. The client needs to know that password to be able to access all files under that share. This was the only
security model in early versions of SMB and became the default under Windows 95 and future systems.
2. User level Protection is applied to individual files in each share based on user access
rights. A user (client) must log into a server and be authenticated. The client is then provided with a UID which
must be presented for all future accesses.
3.7 CIFS (SMB evolves)
SMB continuted to evolve and eventually Microsoft made the protocol public. CIFS is a version of SMB based on
the public protocol and continued evolution.
It is based on the server message block protocol and draws from concepts in AFS and DFS. SMB shunned
excessive client-side caching for fears of inconsistency but it really is useful both for client performance and to alleviate
load from servers so coherent caching was a key addition to the protocol. To support wide-area (slow) networks, CIFS
allows multiple requests to be combined into a single message to minimize round-trip latencies. The obsolete
requirement for NetBIOS or NetBEUI have been dropped. CIFS is transport-independent but requires a reliable
connection-oriented message-stream transport. Sharing is in units of directory trees or individual devices. For access,
the client sends authentication information to the server (name and password). The granularity of authorization is up
to the server (e.g., individual files or an entire directory tree).
The caching mechanism is one of the more interesting aspects of CIFS. Network load is reduced if the amount of times
that the client informs the server of changes is minimized. This also minimizes the load on servers. An extreme
optimization leads to session semantics but a goal in CIFS is to provide coherent caching. Client caching is safe if any
number of clients are reading data. Read-ahead operations (prefetch) are also safe as long as other clients are only

67

reading the file. Write-behind (delayed writes) is safe only if a single client is accessing the file. None of these
optimizations is safe if multiple clients are writing the file. In that case, operations will have to go directly to the server.
To support this behavior, the server grants opportunistic locks (oplocks) to a client for each file that it is accessing.
This is a slight modification of the token granting scheme used in DFS. An oplock takes one of the following forms:
exclusive oplock: Tells the client that it is the only one with the file open (for write). Local caching , read-ahead, and
write-behind are allowed. The server must receive an update upon file close. If someone else opens the file, the
server has the previous client break its oplock. The client must send the server any lock and write data and
acknowledge that it no longer has the lock.
level II oplock: Allows multiple clients to have the same file open as long as none are writeing to the file. It tells the
client that there are multiple concurrent clients, none of whom have modified the file (read access). Local caching
of reads as well as read-ahead are allowed. All other operations must be sent directly to the server.
batch oplock: Allows the client to keep the file open on the server even if a local process that was using it has closed
the file. A client requests a batch oplock if it expects that programs may behave in a way that generates a lot of
traffic (accessing the same file over and over). This oplock tells the client that it is the only one with the file open.
All operations may be done on cached data and data may be cached indefinitely.
no oplocks: Tells the client that other clients may be writing data to the file: all requests other than reads must be sent
to the server. Read operations may work from a local cache only if the byte range was locked by the client.
The server has the right to asynchronously send a message to the client changing the oplock. For example, a client
may be granted an exclusive oplock initially since nobody else was accessing the file. Later on, when another client
opened the same file for read, the oplock was changed to a level II oplock. When another client opened the file for
writing, both of the earlier clients were sent a message revoking their oplocks.

3 Google File System (GFS)


The Google File System was designed to provide a fault tolerant file system for an environment of thousands of
machines. Multi-gigabyte and multi-terabyte files are the norm for the environment and it does not make sense to design
a file system that is optimized for smaller files.
In addition to the aforementioned design assumptions, the assumption is that appends to files will be large and
that hundreds of processes may append to the same file concurrently.
3.1 Interface
GFS does not provide a file system interface at the operating-system level (e.g., under the VFS layer). As such, file system
calls are not used to access it. Instead, a user-level API is provided. GFS is implemented as a set of user-level services
that store data onto native Linux file systems. Moreover, since GFS was designed with special considerations in mind, it
does not support all the features of POSIX (Linux, UNIX, OS X, BSD) file system access. It provides a familiar interface of
files organized in directories with basic create, delete, open, close, read, and write operations. In addition, two special
operations are supported. A snapshot is an efficient way of creating a copy of the current instance of a file or directory
tree. An append operation allows a client to append data to a file as an atomic operation without having to lock a file.
Multiple processes can append to the same file concurrently without fear of overwriting one another's data.
3.2 Configuration
A file in GFS is broken up into multiple fixed-size chunks. Each chunk is 64 MB. The set of machines that
implements an instance of GFS is called a GFS cluster. A GFS cluster consists of one master and many chunkservers.
The master is responsible for storing all the metadata for the files in GFS.

68

This includes their names, directories, and the mapping of files to the list of chunks that contain each file's data.
The chunks themselves are stored on the chunkservers. For fault tolerance, chunks are replicated onto multiple systems.
Figure 1 shows how a file is broken into chunks and distributed among multiple chunkservers.
The Google File System is a core part of the Google Cluster Environment. This environment has GFS and a cluster
scheduling system for dispatching processes as its core services.

Figure 1: Figure 1. GFS data chunking and distribution

69

Typically, hundreds to thousands of active jobs are run. Over 200 clusters are deployed, many with thousands of
machines. Pools of thousands of clients access the files in this cluster. The file systems exceed four petabytes and provide
reaa/write loads of 40 GB/s. Jobs are often on the same machines that implement GFS, which are commodity PCs
running Linux. The environment is shown in Figure 2.
3.3 Client interaction model
Since the GFS client is not implemented in the operating system at the VFS layer, GFS client code is linked into each
application that uses GFS. This library interacts with the GFS master for all metadata-related operations (looking up
files, creating them, deleting them, etc.). For accessing data, it interacts directly with the chunkservers that hold that
data. This way, the master is not a point of congestion. Except for caching within the buffer cache on the chunkservers,
neither clients nor chunkservers cache a file's data. However, client programs do cache the metadata for an open file
(for example, the location of a file's chunks). This avoids additional traffic to the master.
3.4 Implementation
Each chunkserver stores chunks. A chunk is identified by a chunk handle, which is a globally unique 64-bit number
that is assigned by the master when the chunk is first created. On the chunkserver, every chunk is stored on the local
disk as a regular Linux file. For integrity, the chunkserver stores a 32-bit checksum for each chunk (and logged to disk)
for each chunk on that chunkserver.
Every chunk is replicated onto multiple chunkservers. By default, there are three replicas of a chunk althrough
different levels can be specified on a per-file basis. Files that are accessed by lots of processes may need more replicas
to avoid congestion at any server.
3.5 Master
The primary role of the master is to maintain all of the file system metadata. This include the names and directories of
each file, access control information, the mapping from each file to a set of chunks, and the current location of chunks
on chunkservers. Metadata is stored only on the master. This simplifies the design of GFS as there is no need to handle
synchronizing information for a changing file system among multiple masters.
For fast performance, all metadata is stored in the master's main memory. This includes the entire filesystem
namespace as well as all the name-to-chunk maps. For fault tolerance, any changes are written to

Figure 2: Figure 2. Google Cluster Environment

the disk onto an operation log. This operation log is also replicated onto remote machines. The operation log is
similar to a journal. Every operation to the file system is logged into this file. Periodic checkpoints of the file system
state, stored in a B-tree structure, are performed to avoid having to recreate all metadata by playing back the entire log.
Having a single master for a huge file system sounds like a bottleneck but the role of the master is only to tell
clients which chunkservers to use. The data access itself is handled between the clients and chunkservers.
The file system namespace of directories and file names is maintained by the master. Unlike most file systems,
there is no separate directory structure that contains the names of all the files within that directory. The namespace is

70

simply a single lookup table that contains pathnames (which can look like a directory hierarchy if desired) and maps
them to metadata. GFS does not support hard links or symbolic links.
The master manages chunk leases (locks on chunks with expiration), garbage collection (the freeing of unused
chunks), and chunk migration (the movement and copying of chunks to different chunkservers). It periodically
communicates with all chunkservers via heartbeat messages to get the state of chunkservers and sends commands to
chunkservers. The master does not store chunk locations persistently on its disk. This information is obtained from
queries to chunkservers and is done to keep consistency problems from arising.
The master is a single point of failure in GFS and replicates its data onto backup masters for fault tolerance.
3.6 Chunk size
The default chunk size in GFS is 64MB, which is a lot bigger than block sizes in normal file systems (which are
often around 4KB). Small chunk sizes would not make a lot of sense for a file system designed to handle huge files since
each file would then have a map of a huge number of chunks. This would greatly increase the amount of data a master
would need to manage and increase the amount of data that would need to mbe communicated to a client, resulting in
extra network traffic. A master stores less than 64 bytes of metadata for each 64MB chunk. By using a large chunk size,
we reduce the need for frequent communication with the master to get chunk location information. It becomes feasible
for a client to cache all the information related to where the data of large files is located. To reduce the risk of caching
stale data, client metadata caches have timeouts. A large chunk size also makes it feasible to keep a TCP connection open
to a chunkserver for an extended time, amortizing the time of setting up a TCP connection.
3.7 File access
To read a file, the client contacts the master to read a file's metadata; specifically, to get the list of chunk handles.
It then gets the location of each of the chunk handles. Since chunks are replicated, each chunk handle is associated with
a list of chunkservers. The client can contact any available chunkserver to read chunk data.
File writes are expected to be far less frequent than file reads. To write to a file, the master grants a chunk
lease to one of the replicas. This replica will be the primary replica chunkserver and will be the first one to get updates
from clients. The primary can request lease extensions if needed. When the master grants the lease, it increments the
chunk version number and informs all of the replicas containing that chunk of the new version number.
The actual writing of data is split into two phases: sending and writing.
1. First, the client is given a list of replicas that identifies the primary chunkserver and secondaries. The client sends
the data to the closest replica chunkserver. That replica forwards the data to another replica chunkserver, which
then forwards it to yet another replica, and so on. Eventually all the replicas get the data, which is not yet written
to a file but sits in a cache.
2. When the client gets an acknowledgement from all replicas that the data has been received it then sends a write
request to the primary, identifying the data that was sent in the previous phase. The primary is responsible for
serialization of writes. It assigns consecutive serial numbers to all write requests that it has received, applies the
writes to the file in serial-number order, and forwards the write requests in that order to the secondaries. Once
the primary gets acknowledgements from all the secondaries, the primary responds back to the client and the
write operation is complete.
The key point to note is that data flow is different from control flow. The data flows from the client to a
chunkserver and then from that chunkserver to another chunkserver, and from that other chunkserver to yet another
one until all chunkservers that store replicas for that chunk have received the data. The control (the write request) flow
goes from the client to the primary chunkserver for that chunk. The primary then forwards the request to all the
secondaries. This ensures that the primary is in control of the order of writes even if it receives multiple write requests
concurrently. All replicas will have data written in the same sequence. Chunk version numbers are used to detect if any
replica has stale data that was not updated because that chunkserver was down during some update.
4 Hadoop Distributed File System (HDFS) 4.1 File system operations
The Hadoop Distributed File System is inspired by GFS. The overall architecture is the same, although some terminology
changes.

GFS Name
Master
Chunkserver
chunk
Checkpoint
image
Operation log

HDFS Name
NameNode
DataNode
block
FsImage
EditLog

71

The file system provides a familiar file system interface. Files and directories can be created, deleted, renamed,
and moved and symbolic links can be created. However, there is no goal of providing the rich set of features available
through, say, a POSIX (Linux/ BSD/OS X/Unix) or Windows interface. That is, synchronous I/O, byte-range locking, seekand-modify, and a host of other features may not be supported. Moreover, the file system is provided through a set of
user-level libraries and not as a kernel module under VFS. Applications have to be compiled to incorporate these
libraries.
A file is made up of equal-size data blocks, except for the last block of the file, which may be smaller. These data
blocks are stored on a collection of servers called DataNodes. Each block of a file may be replicated on multiple
DataNodes for high availability. The block size and replication factor is configurable per file. DataNodes are responsible
for storing blocks, handling read/write requests, allocating and deleting blocks, and accepting commands to replicate
blocks on another DataNode. A single NameNode is responsible for managing the name space of the file system and
coordinating file access. It stores keeps track of which block numbers belong to which file and implements open, close,
rename, and move operations on files and directories. All knowledge of files and directories resides in the NameNode.
See Figure 3.
4.2 Heartbeating and Replication
DataNodes periodically send a heartbeat message and a block report to the NameNode. The heartbeat
informs the NameNode that the DataNode is functioning. The block report contains a list of all the blocks on that
DataNode. A block is considered safely replicated if the minimum number of replica blocks have been sent by block
reports from all available DataNodes. The NameNode waits for a configured percentage of DataNodes to check in and
then waits an additional 30 seconds. After that time, if any data blocks do not have their minimum number of replicas,
the NameNode sends replica requests to DataNodes, asking them to create replicas of specific blocks.

Figure 3: Figure 3. HDFS data


chunking

The system is designed to be rack-aware and data center-aware in order to improve availability and
performance. What this means is that the NameNode knows which DataNodes occupy the same rack and which racks
are in one data center. For performance, it is desirable to have a replica of a data block in the same rack. For availability,
it is desirable to have a replica on a different rack (in case the entire rack goes down) or even in a different data center
(in case the entire data center fails). HDFS supports a pluggable interface to support custom algorithms that decide on
replica placement. In the default case of three replicas, the first replica goes to the local rack and both the second and
third replicas go to the same remote rack.
The NameNode chooses a list of DataNodes that will host replicas of each block of a file.

72

A client writes directly to the first replica. As the first replica gets the data from the client, it sends it to the second
replica even before the entire block is written (e.g., it may get 4 KB out of a 64 MB block). As the second replica gets the
data, it sends it to the third replica, and so on (see Figure 4).
4.3 Implementation
The NameNode contains two files: EditLog and FsInfo. The EditLog is a persistent record of changes to any HDFS
metadata (file creation, addition of new blocks to files, file deletion, changes in replication, etc.). It is stored as a file on
the server's native file system.
The FsInfo file stores the entire file system namespace. This includes file names, their location in directories, block
mapping for each file, and file attributes. This is also stored as a file on the server's native file system.
The entire active file system image is kept in memory. On startup, the NameNode reads FsInfo and applies the list of
changes in EditLog to create an up-to-date file system image. Then, the image is flushed to the disk and the EditLog is
cleared. This sequence is called a checkpoint. From this point, changes to file system metadata are logged to EditLog
but FsInfo is not modified until the next checkpoint.
On DataNodes, each block is stored as a separate file in the local file system. The DataNode does not have any knowledge
of file names, attributes, and associated blocks; all that is handled by the NameNode. It simply processes requests to
create, delete, write, read blocks, or replicate blocks. Any use of directories is done strictly for local efficiency - to ensure
that a directory does not end up with a huge number of files that will impact performance.
To ensure data integrity, each HDFS file has a separate checksum file associated with it. This file is created by the
client when the client creates the data file/. Upon retrieval, if there is a mismatch between the block checksum and the
computed block checksum, the client can request a read from another DataNode.
5 References
The NFS Distributed File Service, A White Paper from SunSoft, 1994 Sun Microsystems, Inc.
RFC 1094: NFS: Network File System Protocol Specification, Sun Microsystems,
March 1989
Sun's Network File System4, Operating Systems: Three Easy Pieces, Remzi H. Arpaci-Dusseau and
Andrea C. Arpaci-Dusseau
S. Shepler et al, RFC 35305, Network Working Group, IETF, April 2003
John H. Howard, An Overview of the Andrew File System6, Carnegie Mellon University, 1988
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System. Google, SOSP'03, October 1922, 2003.
Google File System, Wikipedia article.
Robin Harris, Google File System Eval: Part 1, StorageMojo blog.
HDFS Architecture Guide, Apache Hadoop project, December 4, 2011.

73

11.MapReduce
A framework for large-scale parallel processing
Traditional programming tends to be serial in design and execution. We tackle many problems with a
sequential, stepwise approach and this is reflected in the corresponding program. With parallel programming,
we break up the processing workload into multiple parts, that can be executed concurrently on multiple
processors. Not all problems can be parallelized. The challenge is to identify as many tasks as possible that can
run concurrently. Alternatively, we can identify data groups that can be processed concurrently. This will allow
us to divide the data among multiple concurrent tasks.
The most straightforward situation that lends itself to parallel programming is one where there is no
dependency among data. Data can be split into chunks and each process can be assigned a chunk to work on. If
we have lots of processors, we can split the data into lots of chunks. A master/worker approach is a design
where a master process coordinates overall activity. It identifies the data, splits it up based on the number of
available workers, and assigns a data segment to each worker. A worker receives the data segment from the
master, performs whatever processing is needed on the data, and then sends results to the master. At the end,
the master does whatever final processing (e.g., merging) is needed to the resultant data.
MapReduce Etymology
MapReduce is was created at Google in 2004 by Jeffrey Dean and Sanjay Ghemawat. The name is
inspired from map and reduce functions in the LISP programming language. In LISP, the map function takes as
parameters a function and a set of values. That function is then applied to each of the values. For example:
(map length (() (a) (ab) (abc)))
applies the length funciton to each of the three items in the list. Since length returns the length of an item, the
result of map is a list containing the length of each item:
(0 1 2 3)
The reduce function is given a binary function and a set of values as parameters. It combines all the values
together using the binary function. If we use the + (add) function to reduce the list (0 1 2 3):
(reduce #'+ '(0 1 2 3))
we get:
6
If we think about how the map operation, we realize that each application of the function to a value can be
performed in parallel (concurrently) since there is no dependence of one upon another. The reduce operation
can take place only after the map is complete.
MapReduce is not an implementation of these LISP functions; they are merely an inspiration and etymological
predecessor.

74

MapReduce
MapReduce is a framework for parallel computing. Programmers get a simple API and do not have to deal with
issues of parallelization, remote execution, data distribution, load balancing, or fault tolerance. The framework
makes it easy for one to use thousands of processors to process huge amounts of data (e.g., terabytes and
petabytes).
From a user's perspective, there are two basic operations in MapReduce: Map and Reduce.
The Map function reads a stream of data and parses it into intermediate (key, value) pairs. When that is
complete, the Reduce function is called once for each unique key that was generated by Map and is given the
key and a list of all values that were generated for that key as a parameter. The keys are presented in sorted
order.
As an example of using MapReduce, consider the task of counting the number of occurrences of each word in a
large collection of documents. The user-written Map function reads the document data and parses out the
words. For each word, it writes the (key, value) pair of (word, 1). That is, the word is treated as the key and the
associated value of 1 means that we saw the word once. This intermediate data is then sorted by MapReduce
by keys and the user's Reduce function is called for each unique key. Since the only values are the count of 1,
Reduce is called with a list of a "1" for each occurence of the word that was parsed from the document. The
function simply adds them up to generate a total word count for that word. Here's what the code looks like:
map(String key, String value): // key: document name, value: document contents for each word w in value:
EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word; values: a list of counts int result
= 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
Let us now look at what happens in greater detail.
MapReduce: More Detail
To the programmer, MapReduce is largely seen as an API: communication with the various machines that play
a part in execution is hidden. MapReduce is implemented in a master/worker configuration, with one master
serving as the coordinator of many workers. A worker may be assigned a role of either a map worker or a reduce
worker.
Step 1. Split input

Figure 1. Split input into shards


The first step, and the key to massive parallelization in the next step, is to split the input into multiple pieces.
Each piece is called a split, or shard. For M map workers, we want to have M shards, so that each worker will
have something to work on. The number of workers is mostly a function of the amount of machines we have at
our disposal.
The MapReduce library of the user program performs this split. The actual form of the split may be specific to
the location and form of the data. MapReduce allows the use of custom readers to split a collection of inputs
into shards, based on specific format of the files.

75

Step 2. Fork processes

Figure 2. Remotely execute worker processes


The next step is to create the master and the workers. The master is responsible for dispatching jobs to
workers, keeping track of progress, and returning results. The master picks idle workers and assigns them
either a map task or a reduce task. A map task works on a single shard of the original data. A reduce task
works on intermediate data generated by the map tasks. In all, there will be M map tasks and R reduce tasks.
The number of reduce tasks is the number of partitions defined by the user. A worker is sent a message by the
master identifying the program (map or reduce) it has to load and the data it has to read.
Step 3. Map

Figure 3. Map task


Each map task reads from the input shard that is assigned to it. It parses the data and generates (key, value)
pairs for data of interest. In parsing the input, the map function is likely to get rid of a lot of data that is of no
interest. By having many map workers do this in parallel, we can linearly scale the performance of the task of
extracting data.
Step 4: Map worker: Partition

Figure 4. Create intermediate files


The stream of (key, value) pairs that each worker generates is buffered in memory and periodically stored on
the local disk of the map worker. This data is partitioned into R regions by a partitioning function.
The partitioning function is responsible for deciding which of the R reduce workers will work on a specific
key. The default partitioning function is simply a hash of key modulo R but a user can replace this with a custom
partition function if there is a need to have certain keys processed by a specific reduce worker.
Step 5: Reduce: Sort (Shuffle)

76

Figure 5. Sort and merge partitioned data


When all the map workers have completed their work, the master notifies the reduce workers to start
working. The first thing a reduce worker needs to is to get the data that it needs to present to the user's reduce
function. The reduce worker contacts every map worker via remote procedure calls to get the (key, value) data
that was targeted for its partition. This data is then sorted by the keys. Sorting is needed since it will usually be
the case that there are many occurrences of the same key and many keys will map to the same reduce worker
(same partition). After sorting, all occurrences of the same key are grouped together so that it is easy to grab
all the data that is associated with a single key.
This phase is sometimes called the shuffle phase.
Step 6: Reduce function

Figure 6. Reduce function writes output


With data sorted by keys, the user's Reduce function can now be called. The reduce worker calls the Reduce
function once for each unique key. The function is passed two parameters: the key and the list of intermediate
values that are associated with the key.
The Reduce function writes output sent to file.
Step 7: Done!
When all the reduce workers have completed execution, the master passes control back to the user program.
Output of MapReduce is stored in the R output files that the R reduce workers created.
The big picture
Figure 7 illustrates the entire MapReduce process. The client library initializes the shards and creates map
workers, reduce workers, and a master. Map workers are assigned a shard to process. If there are more shards
than map workers, a map worker will be assigned another shard when it is done. Map workers invoke the user's
Map function to parse the data and write intermediate (key, value) results onto their local disks. This
intermediate data is partitioned into R partitions according to a partioning function. Each of R reduce workers
contacts all of the map workers and gets the set of (key, value) intermediate data that was targeted to its
partition. It then calls the user's Reduce function once for each unique key and gives it a list of all values that

77

were generated for that key. The Reduce function writes its final output to a file that the user's program can
access once MapReduce has completed.

Figure 7. MapReduce

Dealing with failure


The master pings each worker periodically. If no response is received within a certain time, the worker is
marked as failed. Any map or reduce tasks that have been assigned to this worker are reset back to the inital
state and rescheduled on other workers.

Locality
MapReduce is built on top of GFS, the Google File System. Input and output files are stored on GFS. The
MapReduce workers run on GFS chunkservers. The MapReduce master attempts to schedule a map worker
onto one of the machines that holds a copy of the input chunk that it needs for processing. Alternatively,
MapReduce may read from or write to BigTable.
What is it good for and who uses it?
MapReduce is clearly not a general-purpose framework for all forms of parallel programming. Rather,
it is designed specifically for problems that can be broken up into the the map-reduce paradigm. Perhaps
surprisingly, there are a lot of data analysis tasks that fit nicely into this model. While MapReduce is heavily
used within Google, it also found use in companies such as Yahoo, Facebook, and Amazon.

78

The original, and proprietary, implementation was done by Google. It is used internally for a large
number of Google services. The Apache Hadoop project built a clone to specs defined by Google. Amazon, in
turn, uses Hadoop MapReduce running on their EC2 (elastic cloud) computing-on-demand service to offer the
Amazon Elastic MapReduce service.
Some problems it has been used for include:
Distributed grep (search for words)
Map: emit a line if it matches a given pattern
Reduce: just copy the intermediate data to the output
Count URL access frequency
Map: process logs of web page access; output
Reduce: add all values for the same URL
Reverse web-link graph
Map: output for each link to target in a page source
Reduce: concatenate the list of all source URLs associated with a target. Output
Inverted index
Map: parse document, emit pairs
Reduce: for each word, sort the corresponding document IDs; emits a pair. The set of all output pairs is an
inverted index
References
Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth
Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
The introductory and definitive paper on MapReduce
Jerry Zhao, Jelena Pjesivac-Grbovic, MapReduce: The programming model and practice, SIGMETRICS'09 Tutorial,
2009.
Tutorial of the MapReduce programming model.
MapReduce.org,
Amazon Elastic MapReduce,
Amazon's MapReduce service
Google: MapReduce in a Week , Google Code University

12.BigTable
A NoSQL massively parallel table
Traditional relational databases present a view that is composed of multiple tables, each with rows and named
columns. Queries, mostly performed in SQL (Structured Query Language) allow one to extract specific columns from a
row where certain conditions are met (e.g., a column has a specific value). Moreover, one can perform queries across
multiple tables (this is the "relational" part of a relational database). For example a table of students may include a
student's name, ID number, and contact information. A table of grades may include a student's ID number, course
number, and grade. We can construct a query that extracts a grades by name by searching for the ID number in the
student table and then matching that ID number in the grade table. Moreover, with traditional databases, we expect
ACID guarantees: that transactions will be atomic, consistent, isolated, and durable. As we saw when we studied
distributed transactions, it is impossible to guarantee consistency while providing high availability and network
partition tolerance. This makes ACID databases unattractive for highly distributed environments and led to the
emergence of alternate data stores that are target to high availability and high performance. Here, we will look at the
structure and capabilities of BigTable.

79

BigTable
BigTable is a distributed storage system that is structured as a large table: one that may be petabytes in size
and distributed among tens of thousands of machines. It is designed for storing items such as billions of URLs, with
many versions per page; over 100 TB of satellite image data; hundreds of millions of users; and performing thousands
of queries a second. BigTable was developed at Google in has been in use since 2005 in dozens of Google services. An
open source version, HBase, was created by the Apache project on top of the Hadoop core. Apache Cassandra, first
developed at Facebook to power their search engine, is similar to BigTable with a tunable consistency model and no
master (central server).
BigTable is designed with semi-structured data storage in mind. It is a large map that is indexed by a row key,
column key, and a timestamp. Each value within the map is an array of bytes that is interpreted by the application. Every
read or write of data to a row is atomic, regardless of how many diferent columns are read or written within that row.
It is easy enough to picture a simple table. Let's look at a few characteristics of BigTable:
map
A map is an associative array; a data structure that allows one to look up a value to a corresponding key quickly.
BigTable is a collection of (key, value) pairs where the key identifies a row and the value is the set of columns.
persistant
The data is stored peristantly on disk.
distributed
BigTable's data is distributed among many independent machines. At Google, BigTable is built on top of GFS
(Google File System). The Apache open source version of BigTable, HBase, is built on top of HDFS (Hadoop
Distributed File System) or Amazon S3. The table is broken up among rows, with groups of adjacent rows
managed by a server. A row itself is never distributed.
sparse
The table is sparse, meaning that different rows in a table may use different columns, with many of the columns
empty for a particular row.
sorted
Most associative arrays are not sorted. A key is hashed to a position in a table. BigTable sorts its data by keys.
This helps keep related data close together, usually on the same machine assuming that one structures keys
in such a way that sorting brings the data together. For example, if domain names are used as keys in a BigTable,
it makes sense to store them in reverse order to ensure that related domains are close together.
multidimensional
A table is indexed by rows. Each row contains one or more named column families. Column families are
defined when the table is first created. Within a column family, one may have one or more named columns. All
data within a column family is usually of the same type. The implementation of BigTable usually compresses
all the columns within a column family together. Columns within a column family can be created on the fly.
Rows, column families and columns provide a three-level naming hierarchy in identifying data. For example:
edu.rutgers.cs" : { // row "users" : { // column family "watrous": "Donald", // column "hedrick": "Charles", //
column "pxk" : "Paul" // column } "sysinfo" : { // another column family "" : "SunOS 5.8" // column (null name)
}}
To get data from BigTable, you need to provide a fully-qualified name in the form column-family:column. For
example, users:pxk or sysinfo:. The latter shows an null column name.
time-based

80

Time is another dimension in BigTable data. Every column family may keep multiple versions of column family
data. If an application does not specify a timestamp, it will retrieve the latest version of the column family.
Alternatively, it can specify a timestamp and get the latest version that is earlier than or equal to that timestamp.
Columns and column families

Figure 1. BigTable column families and columns


Let's look at a sample slice of a table that stores web pages (this example is from Google's paper on BigTable).
The row key is the page URL. For example, "com.cnn.www". various attributes of the page are stored in column families.
A contents column family contains page contents (there are no columns within this column family). A language column
family contains the language identifier for the page. Finally, an anchor column family contains the text of various
anchors from other web pages. The column name is the URL of the page making the reference. These three column
families underscore a few points. A column may be a single short value, as seen in the language column family. This is
our classic database view of columns. In BigTable, however, there is no type associated with the column. It is just a
bunch of bytes. The data in a column family may also be large, as in the contents column family.
The anchor column family illustrates the extra hierarchy created by having columns within a column family. It
also illustrates the fact that columns can be created dynamically (one for each external anchor), unlike column families.
Finally, it illustrates the sparse aspect of BigTable. In this example, the list of columns within the anchor column family
will likely vary tremendously for each URL. In all, we may have a huge number (e.g., hundreds of thousands or millions)
of columns but the column family for each row will have only a tiny fraction of them populated. While the number of
column families will typically be small in a table (at most hundreds), the number of columns is unlimited.
Rows and partitioning
A table is logically split among rows into multiple subtables called tablets. A tablet is a set of consecutive rows
of a table and is the unit of distribution and load balancing within BigTable. Because the table is always sorted by row,
reads of short ranges of rows are efficient: one typically communicates with a small number of machines. Hence, a key
to ensuring a high degree of locality is to select row keys properly (as in the earlier example of using domain names in
reverse order).
Timestamps
Each column family cell can contain multiple versions of content. For example, in the earlier example, we may
have several timestamped versions of page contents associated with a URL. Each version is identified by a 64-bit
timestamp that either represents real time or is a value assigned by the client. Reading column data retrieves the most
recent version if no timestamp is specified or the latest version that is earlier than a specified timestamp. timestamp.
A table is configured with per-column-family settings for garbage collection of old versions. A column family can be
defined to keep only the latest n versions or to keep only the versions written since some time t.

81

Implementation
BigTable comprises a client library (linked with the user's code), a master server that coordinates activity, and many
tablet servers. Tablet servers can be added or removed dynamically.
The master assigns tablets to tablet servers and balances tablet server load. It is also responsible for garbage collection
of files in GFS and managing schema changes (table and column family creation).
Each tablet server manages a set of tablets (typically 10-1,000 tablets per server). It handles read/write requests to the
tablets it manages and splits tablets when a tablet gets too large. Client data does not move through the master; clients
communicate directly with tablet servers for reads/writes. The internal file format for storing data is Google's SSTable,
which is a persistent, ordered, immutable map from keys to values.
BigTable uses the Google File System (GFS) for storing both data files and logs. A cluster management system contains
software for scheduling jobs, monitoring health, and dealing with failures.
Chubby
Chubby is a highly available and persistent distributed lock service that manages leases for resources and stores
configuration information. The service runs with five active replicas, one of which is elected as the master to serve
requests. A majority must be running for the service to work. Paxos is used to keep the replicas consistent. Chubby
provides a namespace of files & directories. Each file or directory can be used as a lock.
In BigTable, Chubby is used to:

ensure there is only one active master


store the bootstrap location of BigTable data
discover tablet servers
store BigTable schema information
store access control lists

Startup and growth

Figure 1. BigTable indexing hierarchy


A table starts off with just one tablet. As the table grows, it is split into multiple tablets. By default, a table is split at
around 100 to 200 MB.

82

Locating rows within a BigTable is managed in a three-level hierarchy. The root (top-level) tablet stores the location of
all Metadata tablets in a special Metadata tablet. Each Metadata table contains the location of user data tablets. This
table is keyed by node IDs and each row identifies a tablet's table ID and end row. For efficiency, the client library caches
tablet locations.
A tablet is assigned to one tablet server at a time. Chubby keeps track of tablet servers. When a tablet server starts, it
creates and acquires an exclusive lock on a uniquely-named file in a Chubby servers directory. The master monitors this
directory to discover new tablet servers. When the master starts, it:

grabs a unique master lock in Chubby (to prevent multiple masters from starting)
scans the servers directory in Chubby to find live tablet servers
communicates with each tablet server to discover what tablets are assigned to each server
scans the Metadata table to learn the full set of tablets
builds a set of unassigned tablet servers, which are eligible for tablet assignment

Replication
A BigTable can be configured for replicaiton to multiple BigTable clusters in different data centers to ensure availability.
Data propagation is asynchronous and results an eventually consistent model.
References
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra,
Andrew Fikes, Robert E. Gruber, Bigtable: A Distributed Storage System for Structured Data, Google, Inc. OSDI 2006
The definitive paper on BigTable
Robin Harris, Googles Bigtable Distributed Storage System , StorageMojo.com
Understanding HBase and BigTable, Jumoojw.com

83

13.Distributed Lookup Services


Distributed Hash Tables
Distributed Lookup Services deal with the problem of locating data that is distributed among a collection of
machines. In the general case, a lookup service may involve full-content searching or a a directory-services or structured
database query approach of finding data records that match multiple attributes.
We limit the problem to the common task of looking up data that is associated with a specific, unique search
key rather than, for instance, locating all items whose content contains some property (e.g., locate all files that contain
the word polypeptide). The unique key with which data is associated may be a file name, shopping cart ID, session ID,
or user name. Using the key, we need to find out on which node (of possibly thousands of nodes) the data is located.
Ideally, the machines involved in managing distributed lookup are all cooperating peers and there is should be no
central state that needs to be maintained. At any time, some nodes may be unavailable. The challenge is to find a way to
locate a node that stores the data associated with a specific key in a scalable, distributed manner.
There are three basic approaches we can take to locating such data:
1.
2.
3.

Central coordinator. This uses a server that is in charge of locating resources that are distributed among a
collection of servers. Napster is a classic example of this. The Google File System (GFS) is another.
Flooding. This relies on sending queries to a large set of machines in order to find the node that has the data
we need. Gnutella is an example of this for peer-to-peer file sharing.
Distributed hash tables. This technique is based on hashing the key to locate the node that stores the
associated data. There are many examples of this, including Chord, Amazon Dynamo, CAN, and Tapestry. We
will focus on CAN, Chord, and Amazon Dynamo.

Central coordinator
A central server keeps a database of key node mappings. Other nodes in the system store (key, value) sets. A
query to the central server identifies the node (or, for redundancy, nodes) that hosts the data for the desired key. A
subsequent query to that node will return the associated value.
Napster, the original peer-to-peer file sharing service, is an example of this model. The service was focused on music
sharing and was in operation from 1999 through 2001, when it was shut down for legal reasons. The central server
holds an index of all the music (e.g., song names) with pointers to machines that host the content.
GFS, the Google File System, also implements a central coordinator model. All metadata, including file names,
is managed by the master while data is spread out among chunkservers. A distinction is that the contents of each file
are broken into chunks and distributed among multiple chunkservers. In Napster, each server held complete files.
The advantage of this model is that it is simple and easy to manage. As the volume of queries increases, the central
server can become a bottleneck. With GFS, this issue was ameliorated by catering to an environment of huge files where
the ratio of lookups to data reads was exceptionally low. The central server is also crucial to the operation of the entire
system. If the server cannot be contacted, then the entire service is dead. In the case of Napster, the entire service was
shut down simply by shutting down the server.
Flooding
For flooding, each node is aware of a set of other nodes that are a subset of the entire set of nodes. This makes up an
overlay network. An overlay network is a logical network that is built on top of another network. For our needs, an

84

overlay network refers to a group of nodes where each node has limited information about the overall network topology
and uses other nodes (typically neighboring nodes) to route requests.
Figure 1. Flooding in an overlay network
Figure 2. Back propagation
With flooding, a node that needs to find the content corresponding to a certain key will contact peers when looking for
content. If a peer node has the needed content, it can respond to the requestor. If not, it will then forward the request
to its peers. As long as any peer does not have the requested content, it will forward the request onward to its peers
(Figure 1). If a node has the needed content, it will respond to the node from which it received the request in a process
called back propagation. This chain of responses is followed back to the originator (Figure 2). For implementations
where the data associated with a key is likely to be large (e.g., more than a packet), the response will contain the address
of a node. The originator, having now obtained the nodes address, can then connect to the node directly and request
the data corresponding to the key.
To keep messages from looping or propagating without limit, a time-to-live (TTL) count is often associated with the
message. This TTL count is decremented and the message is discarded if the TTL drops below zero (Figure 1).
This flooding approach is the model that gnutella, a successor to Napster, used for distributed file sharing.
Disadvantages to flooding include a potentially large number of messages on the network and a potentially large
number of hops to find the desired key.

Hash tables
Before we jump into distributed hash tables, let us refresh our memory of hash tables.

85

In non-distributed systems, a hash function (the non-cryptographic kind) maps a key (the thing you are searching for)
to a number in some range 0 n1. The content is then accessed by indexing into a hash table, looking up look up value
at table[hash(key)]. The appeal of hash tables is that you can often realize close to O(1) performance in lookups
compared to O(log N) for trees or sorted tables or O(N) for a random list. Considerations in implementing a hash table
include the following:
1.
2.

3.

Picking a good hash function. We want to ensure that the function will yield a uniform distribution for all values
of keys throughout the hash table instead of clustering a larger chunk of values in specific parts of the table.
Handling collisions. There is a chance that two keys will hash to the same value, particularly for smaller tables.
To handle this, each entry of a table (table[i]) represents a bucket, or slot, that contains a collection of (key,
value) pairs. Within each bucket, one can use a chaining (a linked list) or another layer of hashing.
Growth and shrinkage of the table. If the size of the table changes, existing (key, value) sets will have to be
rehashed and, if necessary, moved to new slots. Since a hash function is often a mod N function (where N is the
table size), this means that, in many cases, a large percentage of data will need to be relocated.

Distributed Hash Tables


In a distributed implementation, known as a distributed hash table, or DHT, the hash table becomes a logical construct
for (key, data) pairs that are distributed among a set of nodes. Each node stores a portion of the key space. The goal of
a DHT is to find the node that is responsible for holding data associated with a particular key.
A key difference between DHTs and the centralized or flooding approaches is that a specific (key, value) set is not placed
on an arbitrary node but rather on a node that is identified in some way by the hash of the key.
Some challenges with distributed hashing are:

How do we partition the (key, data) sets among the group of nodes? That is, what sort of hashing function do
we use and how do we use its results to allow us to locate the node holing the data that we want?
How do we build a decentralized system so there is no coordinator?
How can the system be designed to be scalable? There are two aspects to scalability. One is performance. Wed
like to avoid flooding or having an algorithm that requires traversing a large number of nodes in order to get
the desired results. The other aspect is the ability to grow or shrink the system as needed. We would like to be
able to add additional nodes to the group as the data set gets larger and, perhaps, remove nodes as the data set
shrinks. Wed like to do this without rehashing a large portion of the key set.
How can the system be designed to be fault tolerant? This implies replication and we need to know where to
find the replicated data and know what assumptions to make on its consistency.

We will now take a look at two approaches to DHTs:


1.
2.

CAN, a Content-Addressable Network


Chord

We will then follow up with a look at Amazons Dynamo, a production-grade approach to implementing a DHT modeled
on Chord.
CAN (Content-Addressable Network)

86

Figure 3. A node in a CAN grid


Think of a grid and two separate hash functions h x(key) and hy(key), one for each dimension of the grid. The key is
hashed with both of them: i=hx(key) gives you the x coordinate and j=hy(key) produces the y coordinate. Each node in
the group is also mapped onto this logical grid and is responsible for managing values within a rectangular sub-grid,
called a zone: that is, some range (xa..xb, ya..yb). See Figure 3. The node responsible for the location (i, j) stores (key, V),
the key and
its value, as
long as xa i
< xb and ya
j < yb

Figure 4. Two zones in a CAN grid

Figure 5. Three zones in a CAN grid

Initially, a system can start with a single node and, hence, a single zone. Any zone can be split in two either horizontally
or vertically. For example, Figure 4 shows a grid split into two zones managed by two nodes, n 1 and n2. Node n1 is
responsible for all key, value sets whose x-hashes are less than xmax/2 and node n2 manages all key, value sets whose xhashes are between xmax/2 and xmax. Either of these zones can then be split into two zones. For example (Figure 5), zone
n1 can be then split into two zones, n0 and n1. These two zones are still responsible for all key, value sets whose x-hash
is less than xmax/2 but n0 is responsible for those key, value sets whose y-hash is less than ymax/2.

87

Figure 5. Neighboring zones in a CAN grid


A node only knows about its immediate nodes. For looking up and routing messages to the node that holds the data it
needs, it will use neighbors that minimize the distance to the destination. For a two-dimensional grid, a node knows its
own minium and maximum x and y values. If the target x coordinate (result of the x-hash) is less than its own maximum
x value, the request is passed to the left neighbor; if its greater, the result is passed to the right neighbor. Similarly, if
the target y coordinate is greater than the nodes maximum y value, the request is passed to the top neighbor; if its less
then it is passed to the bottom neighbor. If both values are out of range, other nodes will take care of the routing. For
example, a request that is passed to the top node may forward the request to the right node if the x coordinate is greater
than the nodes maximum x value.
A new node is inserted by the following process:

pick a random pair of values in the grid: (p,q).


contact some node in the system and ask it to look up the node that is responsible for (p,q).
Now negotiate with that node to split its zone in half. The new node will own half of the area.

We discussed a CAN grid in two dimensions, which makes it easy to diagram and visualize but CAN can be deployed for
an arbitrary dimension space. For d dimensions, each node has to keep track of 2d neighbors. CAN is highly scalable,
although the hop count to find the node hosting an arbitrary key, value pair does increase with an increase in the number
of nodes in the system. It has been shown that the average route for a two-dimension CAN grid is O(sqrt(n)) hops where
n is the number of nodes in the system.
To handle failure, we need to add a level of indirection: a node needs to know its neighbors neighbors. If a node fails,
one of the nodes neighbors will take over the failed zone. For this to work, data has to be replicated onto that neighbor
during any write operation while the node is still up

88

Consistent hashing

Figure 6. Consistent hashing


Before going on to the next DHT, we will detour to describe consistent hashing. Most hash functions will require
practically all keys in the table to be remapped if the table size changes. For a distributed hash table, this would mean
that the (key, value) sets would need to be moved from one machine to another. With consistent hashing, only k/n keys
will need to be remapped on average, where k is the number of keys and n is the number of slots, or buckets, in the
table. What this means in a distributed hash table is that most (key, value) sets remain untouched. Only those from a
node that is split into two nodes or two nodes that are combined into one node may need to be relocated (Figure 6).
Chord
Figure 7. Logical ring in Chord
Think of a sequence of numbers arranged in a logical ring, 0, 1, 2 n,
and looping back to 0. Each node in the system occupies a position in
this ring that is the number youd get by hashing its IP address and
taking the value modulo the size of the ring hash(IP)mod n. Figure 7
shows a tiny ring of just 16 elements for illustrative purposes. Four
nodes are mapped onto this ring at positions 3, 8, 10, and 14. These
locations are obtained because the IP address of each node happens to
hash to those values. For instance, the IP address of the machine in
position 3 hashes to 3. In reality, the hash value for Chord will be a
number that is much larger than the number of nodes in the system
with a highly unlikely probability that two addresses will hash to the
same node.
Each node is a bucket for storing a subset of key, value pairs. Because
not every potential bucket position (hash value of the key) contains a
node (most will not), data is assigned to a node based on the hash of the key and is stored at a successor node, a node
whose value is greater than or equal to the hash of the key. If we look at the example in Figure 7 and consider a key that
hashes to 1. Since there is no node in position 1, the key will be managed by the successor node: the first node that we
encounter as we traverse the ring clockwise. Node 3 is hence responsble for keys that hash to 15, 0, 1, 2, and 3. Node 8
is responsible for keys that hash to 4, 5, 6, 7, and 8. Node 19 is responsible for keys that hash to 9, and 10. Node 14 is
responsible for keys that hash to 11, 12, 13, and 14.

89

Figure 8. Adding a node in Chord


When a new node joins a network at some position j, where j=hash(nodes IP), it will take on some of the keys from the
successor node. As such, some existing (key, value) data will have to migrate from the successors node to this new node.
Figure 8 shows an example of adding a new node at position 6. This node now manages keys that hash to 4, 5, and 6.
They were previously managed by node 8. Conversely, if a node is removed from the set then all keys managed by that
node need to be reassigned to the nodes successor.
For routing queries, a node only needs to know of its successor node. Queries can be forwarded through successors
until a node that holds the value is found. This yields an O(n) lookup.

We can optimize the performance and obtain O(1) lookups by having each node maintain a list of all the nodes in the
group and know each nodes hash value. Finding the node that hosts a specific (key, data) set now becomes a matter of
searching the table for a node whose value is the same as hash(key) or its successor. If a node is added or removed, all
nodes in the system need to get the information so they can update their table.
A compromise approach to having an entire list of nodes stored at every node is to use finger tables. A finger table
allows each node to store a partial list of nodes but places an upper bound on the size of the table. The i th entry in the
finger table contains the address of the first node that succeeds the current node by at least 2 i1 in the circle. What this
means is that finger_table[0] contains that nodes successor, finger_table[1] contains that the successor after that,
finger_table[2] contains that the fourth (22 successor, finger_table[3] contains that the eighth (23 successor, and so on.
The desired successor may not be present in the table and the node will need to forward the request to the lower one
on the list, which will in turn have mode knowledge of closer successors. On average, O(log N) nodes need to be
contacted to find the node that owns a key.
Amazon Dynamo
As an example of a real-world distributed hash table, we will take a look at Amazon Dynamo, which is somewhat
modeled on the idea of Chord. Amazon Dynamo is not exposed as a customer-facing web service but is used to power
parts of Amazon Web Services (such as S3) as well as internal Amazon services. Its purpose is to be a highly available
key-value storage system. Many services within Amazon only need this sort of primary-key access to data rather than
a the complex querying capabilities offered by a full-featured database. Examples include best seller lists, shopping
carts, user preferences, user session information, sales rank, and parts of the product catalog.

90

Design goals and assumptions


A full relational database is overkill and limits the scale and availability of the system given that it is still a challenge to
scale or load balance relational database management systems (RDBMS) on a large scale. Moreover, a relational
databases ACID guarantees value consistency over availability. Dynamo is designed with a weaker, eventual
consistency model in order to provide high availability. Amazon Dynamo is designed to be highly fault tolerant. Like
other systems we have looked at, such as GFS and BigTable, something is always expected to be failing in an
infrastructure with millions of components.
Apps themselves should be able to configure Dynamo for their desired latency and throughput needs. One can properly
balance performance, cost, availability, and durability guarantees for each application. Latency is hugely important in
many of Amazons operations. Amazon measured that every 100ms of latency costs the company 1% in sales! [1]
Because of this, Dynamo is designed so that at least 99.9% of read/write operations can must be performed within a
few hundred milliseconds. A great way to reduce latency is to avoid routing requests through multiple nodes (as we do
with flooding, CAN, and Chords finger tables). Dynamos design can be seen as a zero-hop DHT. This is accomplished
by having each node be aware of all the other nodes in the group.
Dynamo is designed to provide incremental scalability. A system should be able to grow by adding a node at a time. The
system is decentralized and symmetric: each node has the same programming interface and set of responsibilities.
There is no coordinator. However, because some servers may be more powerful than others, the system should support
workload partitioning in proportion to the capabilities of servers. For instance, a machine that is faster or has twice as
much storage may be configured to be responsible for managing twice as many keys as another machine.
Dynamo provides two basic operations: get(key) and put(key, data). The data is an arbitrary binary object that is
identified by a unique key. These objects tend to be small, typically under a megabyte. Dynamos interface is simple,
highly available key, value store. This is far more basic than Googles BigTable, which offers a column store and manages
column families and columns within the column families and also allows the programmer to iterate over a sorted
sequence of keys. Because Dynamo is designed to be highly available, updates are not rejected even in the presence of
network partitions or server failures.
Storage and retrieval
As we saw in the last section, the Dynamo API provides two operations to the application. Get(key) returns the object
associated with the given key or a list of objects if there are conflicting versions. It also returns a context that serves as
a version. The user will pass this to future put operations to allow the system to keep track of causal relationships.
Put(key, value, context) stores a key, value pair and creates any necessary replicas for redundancy. The context encodes
the version and was obtained from any previous related get operation and is otherwise ignored by the application. The
key is hashed with an MD5 hash function to create a 128-bit identifier that is used to determine the storage nodes that
serve the key.
A key to scalability is being able to break up data into chunks that can be distributed over all nodes in a group of servers.
We saw this in Bigtables tablets, MapReduces partitioning, and GFSs chunkservers. Dynamo is also designed to be
scalable to a huge number of servers. It relies on consistent hashing to identify which nodes hold data for a specific
key and constructs a logical ring of nodes similar to Chord.
Every node is assigned a random value in the hash space (i.e., some 128-bit number). This becomes its position in the
ring. This node is now responsible for managing all key data for keys that hash to values between its value and its
predecessors value. Conceptually, one would hash the key and then walk the ring clockwise to find the first node greater
than or equal to that hash. Adding or removing nodes affects only the immediate neighbors of that node. The new node
will take over values that are managed by its successor.

91

Virtual nodes

Figure 9. Virtual nodes in Dynamo


Unlike Chord, a physical node (machine) is assigned to multiple points in the logical ring. Each such point is called a
virtual node. Figure 9 shows a simple example of two physical nodes where Node A has virtual nodes 3, 8, and 14 and
Node B has virtual nodes 1 and 10. As with Chord, each key is managed by the successor node.
The advantage of virtual nodes is that we can balance the load distribution of the system. If any node becomes
unavailable and a neighbor takes over, the load is evenly dispersed among the available nodes. If a new node is added,
it will result in the addition of multiple virtual nodes that are scattered throughout the ring and will thus take on load
from multiple nodes rather a single server hosting a single neighboring node. Finally, the number of virtual nodes that
a system hosts can be based on the capacity of that node. A bigger, faster machine can be assigned more virtual nodes.
Replication
Data is replicated onto N nodes, where N is a configurable number. The primary node is called the coordinator node
and is assigned by hashing the key and storing it on the successor node (as described by Chord). This coordinator is in
charge of replicating the data and replicates it at each of N1 clockwise successor nodes in the ring. Hence, if any node
is unavailable, the system needs to look for the next available node clockwise in the ring to find a replica of that data.
The parameter for the degree of replication is configurable, as are other values governing the availability of nodes for
get and put operations. The minimum number of nodes that must participate in a successful get operation and the
minimum number of nodes that must participate in a successful put operation are both configurable. If a node was
unreachable for replication in a put operation, the replica is sent to another node in the ring along with metadata instructions identifying the original desired operation. Periodically, the node will check to see if the originally targeted
node is alive. If so, it will transfer the object to that node. If necessary, it may also delete its copy of the object to keep
the number of replicas in the system to the required amount. To account for data center failures, each object is replicated
across multiple data centers.

Consistency and versioning


We have seen that consistency is at odds with high availability. Because Dynamos design values high availability, it uses
optimistic replication techniques that result in an eventually consistent model. Changes to replicas are propagated in
the background. This can lead to conflicting data (for example, in the case of a temporary network partition and two
writes, each applied to a different side of the partition). The traditional approach to resolving such conflicts is during a

92

write operation. A write request is rejected if the node cannot reach a majority of (or, in some cases, all) replicas.
Dynamos approach is more optimistic and it resolves conflicts during a read operation. The highly available design
attempts to provide an always writable data store where read and write operations can continue even during network
partitions. The rationale for this is that rejecting customer-originated updates will not make for a good user experience.
For instance, a customer should always be able to add or remove items in a shopping cart, even if some servers are
unavailable.
Given that conflicts can arise, the question then is how to resolve them. Resolution can be done by either the data store
system (Dynamo) or by the application. If we let the data store do it, we have to realize that it has minimal information.
It has no knowledge of the meaning of the data, only that some arbitrary data is associated with a particular key. Because
of this, if can offer only simple policies, such as last write wins. If, on the other hand, we present the set of conflicts to an
application, it is aware of the structure of the data and can implement application-aware conflict resolution. For
example, it can merge multiple shopping cart versions to produce a unified shopping cart. Dynamo offers both options.
Application-based reconciliation is the preferred choice but the system can fall back on a Dynamo-implemented last
write wins if the application does not want to bother with reconciling the data.
The context that is passed to put operations and obtained from get operations is a vector clock. It captures the causal
relations between different versions of the same object. The vector clock is a sequence of <node, counter> pairs of values.
Storage nodes
Each node in Dynamo has three core functions.
1.

2.

3.

Request coordination. The coordinator is responsible for executing get/put (read/write) requests on behalf
of requesting clients. A state machine contains all the logic for identifying the nodes that are responsible for
managing a key, sending requests to that node, waiting for responses, processing retries, and packaging the
response for the application. Each instance of the state machine manages a single request.
Membership. Each node is aware of all the other nodes in the group and may detect the failure of other nodes.
It is prepared to receive write requests that contain metadata informing the node that another node was dead
and needs to get a replica of the data when it is available once again.
Local persistant storage. Finally, each node manages a portion of the global key, value space and hence needs
to store keys and their associated values. Dynamo provides multiple storage solutions depending on
application needs. The most popular system is the Berkeley Database (BDB) Transactional Data Store.
Alternative systems include the Berkeley Database Java Edition, MySQL (useful for large objects), and an inmemory buffer with a persistent backing store (for high performance).

References

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan, Chord: A Scalable Peer-to-peer
Lookup Service for Internet Applications, SIGCOMM01, August 2731, 2001, San Diego, California, USA.
Copyright 2001 ACM
Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Shenker, A Scalable Content-Addressable
Network, SIGCOMM 01 Proceedings of the 2001 conference on Applications, technologies, architectures, and
protocols for computer communications, Pages 161172, Copyright 2001 ACM.
Sylvia Paul Ratnasamy, A Scalable Content-Addressable Network, PhD Thesis, University of California at
Berkeley, Fall 2002.
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex
Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels, Dynamo: Amazons Highly
Available Key-value Store, SOSP07, October 1417, 2007, Stevenson, Washington, USA. Copyright 2007 ACM.

93

14.Fault Tolerance
In previous lectures weve mentioned that one of the reasons that distributed systems are different (and more
complicated) than nondistributed systems is due to partial failure of system components. Weve mentioned that
dependability is an important challenge in designing and building distributed systems and that the presence of failure
often makes achieving transparency (e.g., for RPC) difficult if not impossible. In this lecture we take a deeper look into
the concept of failure and how failure is dealt with in distributed systems.
Dependability
Dependability is the ability to avoid service failures that are more frequent or severe than desired. A key requirement
of most systems is to provide some level of dependability. A dependable systems has the following properties.
Availability: refers to the probability that the system is operating correctly at any given moment and is available to
perform its functions. Availability is often given as a percentage, e.g., 99.9999% availability means that 99.9999%
of the time the system will be operating correctly. Over a year this amounts to less than a minute of downtime.
Reliability: refers to the ability of the system to run continuously without failure. Note that this is different than
availability. A 99.9999% available system could go down for a millisecond every hour. Although the availability
would be high, the reliability would be low.
Safety: refers to the fact that when a system (temporarily) fails to operate correctly, nothing catastrophic happens. For
example, the controller of a nuclear power plant requires a high degree of safety.
Maintainability: refers to how easily a failed system can be repaired. This is especially useful if automatic recovery
from failure is desired and can lead to high availability.
Integrity, Confidentiality : are related to security and will be dealt with in a separate lecture.
Building a dependable system comes down to preventing failure.
Faults and Failures
In the following discussion of failure and fault tolerance we use the following terminology:
A system is a set of hardware and software components designed to provide a specific service. Its components may also
be (sub)systems.
A failure of a system occurs when the system fails to meet its promises or does not perform its services in the specified
manner. Note that this implies a specification or understanding of what correct services are.
An erroneous state is a state which could lead to a system failure by a sequence of valid state transitions.
An error is a part of the system state which differs from its intended value. An error is a manifestation of a fault in the
system, which could lead to system failure.
A fault is an anomalous condition. Faults can result from design errors, manufacturing faults, deterioration, or external
disturbance.
Failure recovery is the process of restoring an erroneous state to a error-free state.
A good overview of the relation between faults, errors, and failures can be found in [ALRL04].

94

Faults
Failures are caused by faults. In order to better understand failures, and to better understand how to prevent faults
from leading to failure we look at the different properties of faults. We distinguish between three categories of faults.
Transient faults are those that occur once and never reoccur. These are typically caused by external disturbances,
for example, wireless communication being interrupted by external interference, such as a bird flying through a
microwave transmission. Intermittent faults are those that reoccur irregularly. Typically these kinds of faults occur
repeatedly, then vanish for while, then reoccur again. They are often caused by loose contacts, or non-deterministic
software bugs such as race conditions. Finally permanent faults are those that persist until the faulty component is
replaced. Typical examples of permanent faults are software bugs, burnt out chips, crashed disks, etc.
Faults are typically dormant until they are activated by some event causing them to create an error.
When systems rely on other (sub)systems, faults can propagate, causing failures in multiple systems. For example,
a fault in a disk can cause the disk to fail by returning incorrect data. A service using the disk sees the disk failure as a
fault that causes it to read an incorrect value (i.e., an error), which in turn causes it to provide an incorrect service (e.g.,
return an incorrect reply to a database query).
Failures
There are different kinds of failures that can occur in a distributed system. The important types of failure are:
System failure: the processor may fail to execute due to software faults (aka. an OS bug) or hardware faults (affecting
the CPU, main memory, the bus, the power supply, a network interface, etc.) Recovery from such failures generally
involves stopping and restarting the system. While this may help with intermittent faults, or failures triggered by
a specific and rare combination of circumstances, in other cases it may lead to the system failing again at the same
point. In such a case, recovery requires external interference (replacing the faulty component) or an automatic
reconfiguration of the system to exclude the faulty component (e.g. reconfiguring the distributed computation to
continue without a faulty node).
Process failure: a process proceeds incorrectly (owing to a bug or consistency violation) or not at all (due to deadlock,
livelock or an exception). Recovery from process failure involves aborting or restarting the process.
Storage failure: some (supposedly) stable storage has become inaccessible (typically a result of hardware failure).
Recovery involves rebuilding the devices data from archives, logs, or mirrors (RAID).
Communication failure: communication fails due to a failure of a communications link or intermediate node. This leads
to corruption or loss of messages and may lead to partitioning of the network. Recovery from communication
medium failure is difficult and sometimes impossible.
In nondistributed systems a failure is almost always total, which means that all parts of the system fail. For example,
if the operating system that an application runs on crashes, then the whole application goes down with the operating
system. In a distributed system this is rarely the case. It is more common that only a part of the system fails, while the
rest of the system components continue to function normally. For example, in a distributed shared memory system, the
network link between one of the servers and the rest of the servers may fail. Despite the network link failing the rest of
the components (all the processes, and the rest of the network links) continue to work correctly. This is called partial
failure. In this lecture we concentrate on the problems caused by partial failures in distributed systems.
There are a number of different ways that a component in a distributed system can fail. The way in which a
component fails often determines how difficult it is to deal with that failure and to recover from it.
Crash Failure: a server halts, but works correctly until it halts
Fail-Stop: server will stop in a way that clients can tell that it has halted. Fail-Silent: clients do
not know server has halted Omission Failure: a server fails to respond to incoming requests
Receive Omission: fails to receive incoming messages
Send Omission: fails to send messages

95

Timing Failure: a servers response lies outside the specified time interval
Response Failure: a servers response is incorrect
Value Failure: the value of the response is wrong
State Transition Failure: the server deviates from the correct flow of control

Arbitrary Failure: a server may produce arbitrary response at arbitrary times (aka Byzantine
failure)
Fault Tolerance
The topic of fault tolerance is concerned with being able to provide correct services, even in the presence of faults. This
includes preventing faults and failures from affecting other components of the system, automatically recovering from
partial failures, and doing so without seriously affecting performance.
Failure Masking
One approach to dealing with failure is to hide occurrence of failures from other processes. The most common approach
to such failure masking is redundancy. Redundancy involves duplicating components of the system so that if one fails,
the system can continue operating using the nonfailed copies. There are actually three types of redundancy that can be
used. Information redundancy involves including extra information in data so that if some of that data is lost, or modified,
the original can be recovered. Time redundancy allows actions to be performed multiple times if necessary. Time
redundancy is useful for recovering from transient and intermittent faults. Finally physical redundancy involves
replicating resources (such as processes, data, etc.) or physical components of the system.
Process Resilience
Protection against the failure of processes is generally achieved by creating groups of replicated processes. Since all the
processes perform the same operations on the same data, the failure of some of these processes can be detected and
resolved by simply removing them from the group, continuing execution with only the nonfailed processes. Such an
approach requires the group membership to be dynamic, and relies on the presence of mechanisms for managing the
groups and group membership. The group is generally transparent to its users, that is, the whole group is dealt with as
a single process.
Process groups can be implemented hierarchically or as a flat space. In a hierarchical group a coordinator makes all
decisions and the other members follow the coordinators orders. In a flat group the processes must make decisions
collectively. A benefit of the flat group approach is that there is no single point of failure, however the decision making
process is more complicated. In a hierarchical group, on the other hand, the decision making process is much simpler,
however the coordinator forms a single point of failure.
Process groups are typically modelled as replicated state machines [Sch90]. In this model every replica process that
is part of the group implements an identical state machine. The state machine transitions to a new state whenever it
receives a message as input, and it may send one or more messages as part of the state transition. Since the state
machines are identical, any process executing a given state will react to input messages in the same way. In order for
the replicas to remain consistent they must perform the same state transitions in the same order. This can be achieved
by ensuring that each replica receives and processes input messages in exactly the same order: the replicas must all
agree on the same order of message delivery, which requires consensus. The result is a deterministic set of replica
processes, and all correct replicas are assured to produce exactly the same output.
A group of replicated processes is k fault tolerant if it can survive k faults and still meet its specifications. With failstop semantics, k + 1 replicas are sufficient to survive k faults, however with arbitrary failure semantics, 2k+1 processes
are required. This is because a majority (k+1) of the processes must provide the correct results even if the k failing
processes all manage to provide the same incorrect results.

96

Consensus
As we have seen previously, consensus (or agreement) is often needed in distributed systems. For example, processes
may need to agree on a coordinator, or they may need to agree on whether to commit a transaction or not. The consensus
algorithms weve looked at previously all assumed that there were no faults: no faulty communication and no faulty
processes. In the presence of faults we must ensure that all nonfaulty processes reach and establish consensus within a
finite number of steps.
A correct consensus algorithm will have the following properties:
Agreement all processes decide on the same value
Validity the decided value was proposed by one of the processes
Termination all processes eventually decide.
We will look at the problem of reaching consensus in synchronous and asynchronous systems. We start with
synchronous systems which assume that execution time is bounded (i.e., processes can execute in rounds), and that
communication delay is bounded (i.e., timeouts can be used to detect failure).
We first look at the problem of reaching consensus with nonfaulty processes, but unreliable communication. The
difficulty of agreement in this situation is illustrated by the two-army problem (Figure 1). In this problem the two blue
armies must agree to attack simultaneously in order to defeat the green army. Blue army 1 plans an attack at dawn and
informs blue army 2. Blue army 2 replies to blue army 1 acknowledging the plans. Both armies now know that the plan
is to attack at dawn. However blue army 2 does not know whether blue army 1 received its acknowledgment. It reasons
that if blue army 1 did not receive the acknowledgment, it will not know whether blue army 2 received the original
message and will therefore not be willing to attack. Blue army 1 knows that blue army 2 may be thinking this, and
therefore decides to acknowledge blue army 2s acknowledgment. Of course blue army 1 does not know whether blue
army 2 received the acknowledgment of the acknowledgment, so blue army 2 has to return an acknowledgment of the
acknowledgment of the acknowledgment. This process can continue indefinitely without both sides ever reaching
consensus.

3000

3000

5000

Figure 1: The two army problem


A different problem involves reliable communication, but faulty (byzantine) processes. This problem is illustrated
by the byzantine generals problem, where a group of generals must come to consensus about each others troop
strengths. The problem is that some of the generals are traitors and will lie in order to prevent agreement. Lamport et
al. devised a recursive algorithm to solve this problem. It was also proven that in a system with k faulty (byzantine)
processes, agreement can only be achieved if there are 2k + 1 nonfaulty processes [LSP82].
In an asynchronous system the temporal assumptions are dropped: execution is no longer bounded, and
communication delay is not bounded (so timeouts cannot be used to detect failure). It turns out that in an asynchronous
distributed system it is impossible to create a totally correct consensus algorithm that can tolerate the failure of even
one process. What this means is that no algorithm can guarantee correctness in every scenario if one or more processes

97

fail. This was proven by Fischer, Lynch, and Patterson [FLP85] in 1985. In practice however we can get algorithms that
are good enough. An example of one such algorithm is Paxos.
The Paxos consensus algorithm [Lam98, Lam01] is currently the best algorithm for consensus in real distributed
systems, and is used to implement highly scalable and reliable services such as Googles Chubby lock server. The goal of
Paxos is for a group of processes to agree on a value, and make that value known to any interested parties. The algorithm
requires a leader process to be elected and proceeds in two phases. In the first phase, the leader (called the Proposer)
sends a proposal to all the other processes (called Acceptors). A proposal contains a monotonically increasing sequence
number and a proposed value. Each acceptor receives the proposal and decides on whether to accept the proposed value
or not. The proposal is rejected if the sequence number is lower than the sequence number of any value the acceptor
has already accepted in the past. Otherwise, the proposal is accepted, and the acceptor sends a promise message,
containing the value of that acceptors most recently accepted value (if any otherwise no value is returned in the
message).
In the second phase, the proposer waits until it has received a reply from a majority of the acceptors. Given all the
replies, the proposer chooses a value to propose as follows: if any of the promises contained a value, then the proposer
must choose the highest of these values (that is, the value associated with the highest sequence number), otherwise it
is free to choose an arbitrary value. The proposer sends its chosen value in an accept message to all the acceptors. Upon
receiving this message each acceptor checks whether it can still accept the value. It will accept the value unless it has
subsequently sent a promise with a higher sequence number to another proposer. The acceptor replies to the proposer
with an accepted message. Once the proposer has received accepted messages from a majority of the acceptors, it knows
that the proposed value has been agreed upon and can pass this on to any other interested parties.
Paxos can tolerate the failure of up to half of the acceptor processes as well as the failure of the proposer process
(when a proposer fails, a new one can be elected and can continue the algorithm where the old one left off). If multiple
proposers are simultaneously active, it is possible for Paxos to enter a state known as dueling proposers which could
continue without progress indefinitely. In this way Paxos chooses to sacrifice termination rather than agreement in the
face of failure.
Implementing this algorithm in real systems turns out to be more difficult than the simple description of it in
literature would imply. An interesting account of the challenges faced while implementing a version of Paxos is given
by Chandra et al. [CGR07].
Reliable Communication
Besides processes the other part of a distributed system that can fail is the communication channel. Masking failure of
the communication channel leads to reliable communication.
Reliable Point-to-Point Communication
Reliable point-to-point communication is generally provided by reliable protocols such as TCP/IP. TCP/IP masks
omission failures but not crash failures. When communication is not reliable, and in the presence of crash failures, it is
useful to define failure semantics for communication protocols. An example is the semantics of RPC in the presence of
failures. There are five different classes of failures that can occur in an RPC system.
Client cannot locate server
Request message to server is lost
Server crashes after receiving a request
Reply message from server is lost
Client crashes after sending a request
In the first case the RPC system must inform the caller of the failure. Although this weakens the transparency of the
RPC it is the only possible solution. In the second case the sender can simply resend the message after a timeout. In the
third case there are two possibilities. First, the request had already been carried out before the server crashed, in which

98

case the client cannot retransmit the request message and must report the failure to the user. Second, the request was
not carried out before the server crashed, in which case the client can simply retransmit the request. The problem is
that the client cannot distinguish between the two possibilities. Depending on how this problem is solved results in atleast-once, at-most-once, and maybe semantics.
In the fourth case it would be sufficient for the server to simply resend the reply. However, the server does not know
whether the client received its reply or not and the client cannot tell whether the server has crashed or whether the
reply was simply lost (or if the original request was lost). For idempotent operations (i.e., operations that can be safely
repeated) the client simply resends its request. For nonidempotent operations, however, the server must be able to
distinguish a retransmitted request from an original request. One approach is to add sequence numbers to the requests,
or to include a bit that distinguishes an original request from a retransmission.
Finally, in the fifth case, when a client crashes after sending a request, the server may be left performing unnecessary
work because the client is not around to receive the results. Such a computation is called an orphan. There are four ways
of dealing with orphans. Extermination involves the client explicitly killing off the orphans when the it comes back up.
Reincarnation involves the client informing all servers that it has restarted, and leaving it up to the servers to kill off any
computations that were running on behalf of that client. Gentle reincarnation is similar to reincarnation, except that
servers only kill off computations whose parents cannot be contacted. Finally, in expiration each RPC is given a fixed
amount of time to complete, if it cannot complete it must explicitly ask the client for more time. If the client crashes and
reboots, it must simply wait long enough for all previous computations to have expired.
Reliable Group Communication
Reliable group communication, that is, guaranteeing that messages are delivered to all processes in a group, is
particularly important when process groups are used to increase process resilience. We distinguish between reliable
group communication in the presence of faulty processes and reliable group communication in the presence of
nonfaulty processes. In the first case the communication succeeds only when all nonfaulty group members receive the
messages. The difficult part is agreeing on who is a member of the group before sending the message. In the second case
it is simply a question of delivering the messages to all group members.

Receiver missed
message #24
Sender
History
buffer

Receiver

Receiver

Receiver

Receiver

Last = 24

Last = 24

Last = 23

Last = 24

M25
M25

M25

M25

M25

Network
(a)
Sender

Receiver

Receiver

Receiver

Receiver

Last = 25

Last = 24

Last = 23

Last = 24

M25
ACK 25

M25

ACK 25

Missed 24

M25

M25
ACK 25

Network
(b)

Figure 2: Basic reliable multicast approach

99

Figure 2 shows a basic approach to reliable multicasting assuming nonfaulty processes. In this example the sender
assigns a sequence number to each message and stores sent messages in a history buffer. Receivers keep track of the
sequence number of the last messages they have seen. When a receiver successfully receives a message that it was
expecting, it returns an acknowledgment to the sender. When a receiver receives a message that it was not expecting
(e.g., because it was expecting an older message first) it informs the sender which messages it has not yet received. The
sender can then retransmit the messages to that particular receiver.
A problem with this approach is that if the group is large enough the sender must deal with a large amount of
acknowledgment messages, which is bad for scalability. This problem is known as feedback implosion. In order to avoid
this the multicast approach can be modified to reduce the amount of feedback the server must process. In this approach
receivers do not send acknowledgments, but only negative acknowledgments (NACKs) when they are missing messages.
A major drawback of this approach is that the sender must keep its history buffer indefinitely as it does not know when
all receivers have successfully received a message.
A different approach to improving scalability is to arrange groups in a hierarchical fashion (see Figure 3). In this
approach a large group of receivers is partitioned into subgroups, and the subgroups are organised into a tree. Each
subgroup is small enough so that any of the above mentioned reliable group communication schemes can be applied
and for each subgroup a local coordinator acts as the sender. All the coordinators and the original sender also form a
group and use one of the above mentioned multicast schemes. The main problem with this approach is constructing the
tree. It is particularly difficult to support dynamic addition and removal of receivers.
Sender
(Long-haul) connection
Local-area network

Coordinator
C

R
Root

Receiver

Figure 3: Hierarchical multicast


When discussing reliable group communication in the face of possibly faulty processes, it is useful to look at the
atomic multicast problem. Atomic multicast guarantees that a message will be delivered to all members of a group, or
to none at all. It is generally also required that the messages are delivered in the same order at all receivers. When
combined with faulty processes, atomic multicast requires that these processes be removed from the group, leaving
only nonfaulty processes in receiver groups. All processes must agree on the group view, that is, the view of the group
the sender had when the message was sent.
Failure Recovery
The term recovery refers to the process of restoring a (failed) system to a normal state of operation. Recovery can apply
to the complete system (involving rebooting a failed computer) or to a particular application (involving restarting of
failed process(es)).
While restarting processes or computers is a relatively straightforward exercise in a centralised system, things are
(as usual) significantly more complicated in a distributed system. The main challenges are:

100

Reclamation of resources: a process may hold resources, such as locks or buffers, on a remote node. Naively restarting
the process or its host will lead to resource leaks and possibly deadlocks.
Consistency: Naively restarting one part of a distributed computation will lead to a local state that is inconsistent with
the rest of the computation. In order to achieve consistency it is, in general, necessary to undo partially completed
operations on other nodes prior to restarting.
Efficiency: One way to avoid the above problems would be to restart the complete computation whenever one part
fails. However, this is obviously very inefficient, as a significant amount of work may be discarded unnecessarily.
Forward vs. backward recovery
Recovery can proceed either forward or backward. Forward error recovery requires removing (repairing) all errors in
the systems state, thus enabling the processes or system to proceed. No actual computation is lost (although the repair
process itself may be time consuming). Forward recovery implies the ability to completely assess the nature of all errors
and damages resulting from the faults that lead to failure. An example could be the replacement of a broken network
cable with a functional one. Here it is known that all communication has been lost, and if appropriate protocols are used
(which, for example, buffer all outgoing messages) a forward recovery may be possible (e.g. by resending all buffered
messages). In most cases, however, forward recovery is impossible.
The alternative is backward error recovery. This restores the process or system state to a previous state known to
be free from errors, from which the system can proceed (and initially retrace its previous steps). Obviously this incurs
overheads due to the lost computation and the work required to restore the state. Also, there is in general no guarantee
that the same error will not reoccur (e.g. if the failure resulted from a software bug). Furthermore, there may be
irrecoverable components, such as external input (from humans) or irrevocable outputs (e.g. cash dispensed from an
ATM).
While the implementation of backward recovery faces substantial difficulties, it is in practice the only way to go, due
to the impossibility of forward-recovery from most errors. For the remainder of this lecture we will, therefore only look
at backward recovery.
Backward recovery
Backward error recovery works by restoring processes to a recovery point, which represents a prefailure state of the
process. A system can be recovered by restoring all its active processes to their recovery points. Recovery can happen
in one of two ways:
Operation-based recovery keeps a log (or audit trail) of all state-changing operations. The recovery point is reached
from the present state by reversing these operations;
State-based recovery stores a complete prior process state (called a checkpoint). The recovery point is reached by
restoring the process state from the checkpoint (called roll-back). Statebased recovery is also frequently called
rollback-recovery.
Both approaches require the recovery data (log or checkpoint) to be recorded on stable storage. Combinations of
both are possible, e.g. by using checkpoints in an operation-based scheme to reduce the rollback overhead.
Operation-based recovery is usually implemented by in-place updates in combination with write-ahead logging.
Before a change is made to data, a record is written to the log, completely describing the change. This includes an
identification (name) of the affected object, the pre-update object state (for undoing the operation, roll-back) and the
post-update object state (for redoing the operation, roll-forward). 1 The implied transaction semantics makes this
scheme attractive for databases.
State-based recovery requires checkpoints to be performed during execution. There exists an obvious trade-off for
the frequency of checkpointing: checkpoints slow execution and the overhead of frequent checkpoints may be
prohibitive. However, a low checkpoint frequency increases the average recovery cost (in terms of lost computation).
Strict operation-based recovery does not require the new object state to be logged, but recovery can be sped up if this information is available in
the log.
1

101

Checkpointing overhead can be reduced by some standard techniques:


Incremental checkpointing: rather than including the complete process state in a checkpoint, include only the changes
since the previous checkpoint. This can be implemented by the standard memory-management technique of copyon-write: After a checkpoint the whole address space is write protected. Any write will then cause a protection
fault, which is handled by copying the affected page to a buffer, and un-protecting the page. On a checkpoint only
those dirty pages are written to stable storage.
The drawback of incremental checkpointing is increased restart overhead, as first the process must be restored
to the last complete checkpoint (or its initial state), and then all incremental checkpoints must be applied in order.
Much of that work is in fact redundant, as processes tend to dirty the same pages over and over. For the same
reason, the sum of the incremental checkpoints can soon exceed the size of a complete checkpoint. Therefore,
incremental checkpointing schemes will occasionally perform a complete checkpoint to reduce the consumption
of stable storage and the restart overhead.
Asynchronous checkpointing: rather than blocking the process for the complete duration of a checkpoint, copy-onwrite techniques are used to protect the checkpoint state from modification while the checkpoint is written to
stable storage concurrently with the process continuing execution. Inevitably, the process will attempt to modify
some pages before they have been written, in which case the process must be blocked while those pages are
written. Blocking is minimised by prioritising writes of such pages. The scheme works best if the checkpoint is
written in two stages: first to a buffer in memory, and from there to disk. The process will not have to block once
the in-memory checkpoint is completed.
Asynchronous checkpointing can be easily (and beneficially) combined with incremental checkpointing. Its
drawback is that a failure may occur before the complete checkpoint is written to stable storage, in which case
the checkpoint is useless and recovery must use an earlier checkpoint. Hence this scheme generally requires
several checkpoints to be kept on stable storage. Two checkpoints are sufficient, provided that a new checkpoint
is not commenced until the previous one is completed. Under adverse conditions this can lead to excessive delays
between checkpoints, and may force synchronisation of a checkpoint in order to avoid delaying the next one
further.
Compressed checkpoints: The checkpoint on stable storage can be compressed in order to reduce storage
consumption, at the expense of increased checkpointing and restart costs.
These techniques can be implemented by the operating system, or transparently at user level (for individual
processes)[PBKL95]. This is helped significantly by the semantics of the Unix fork() system call: in order to perform a
checkpoint, the process forks an identical copy of itself. The parent can then continue immediately, while the child
performs the checkpoint by dumping its address space (or the dirty parts of it).
Problems with recovery
The main challenge for implementing recovery in a distributed system arises from the requirement of consistency. An
isolated process can easily be recovered by restoring it to a checkpoint. A distributed computation, however, consists
of several processes running on different nodes. If one of them fails, it may have causally affected another process
running on another node. A process B is causally affected by process A if any of As state has been revealed to B (and B
may have subsequently based its computation on this knowledge of As state). See Global States for details.
Domino rollback If A has causally affected B since As last checkpoint, and A subsequently fails and is restored to its
checkpointed state, then Bs state is inconsistent with As state, as B depends on a possible future state of A. It is possible
that in its new run A does not even enter that state again. In any case, the global state may no longer be consistent, as it
may not be reachable from the initial state by a fault-free execution.
Such a situation must be avoided, by rolling back all processes which have been causally affected by the failed
process. This means that all processes must establish recovery points, and that furthermore any recovery must be

102

guaranteed to roll back to a consistent global state. The problem is that this can lead to domino rollback, as shown in
Figure 4.

R11

R12
R21

R13
R22

R31

P1

m
P2

R32

P3

Figure 4: Domino effect leads to total rollback after failure of P3


If P1 fails, it can be rolled back to R13 which leaves the system in a consistent state. We use the notation P1 to indicate
that P1 fails, and P1 y R13 to indicate P1 being rolled back to R13.
A failure of P2 is more serious: P2 P2 y R22 leads to an inconsistent state, as the state recorded in R22 has causally
affected the state recorded in R13, and hence P1s present state. We use the notation R22 R13 to indicate this relationship.
In order to make the global state consistent, the orphan message m, which has been received but not sent, must be
removed. This forces P1 y R12.
The situation is worst if P3 fails:
P3 P3 y R32 P2 y R21 P1 y R11,P3 y R31,
and the system has rolled back to its in initial state. The reason behind the domino rollbacks is the uncoordinated
(independent) nature of the checkpoints taken by the different processes.

R11

P1

m
R21

P2

Figure 5: Rollback leading to message loss after failure of P2


Message loss Domino rollbacks are triggered by messages sent by the failing process since the checkpoint to which it is
rolling back. However, messages received since the potential recovery point also cause problems, as indicated in Figure
5. Here, P2 P2 y R21, leading to a consistent global state. However, the behaviour will still be incorrect, as P1 has sent
message m, while P2 has not received m, and will never receive it. The message is lost.
However, this sort of message loss results not only from process failure but also from communication failure, a
frequent event in networks. For the participating processes the two types of message loss are indistinguishable, and can

R11
n1
R21

m1

be addressed by the same mechanisms: loss-resilient communication protocols.


P1

103

P2
(a) State before failure of P2

n2

P1
R11

R21 n 1

m2
P2
(b) State after failure of P2

Figure 6: State before and after failure of P2. The message in transit (n1) leads to livelock.
Livelock Finite communication latencies can also lead to problems, as shown in Figure 6. Failure of P2 between sending
m1 and receiving n1 (already sent by P1) leads to P2 y R21, message m1 forces P1 y R11. The latter rollback orphans message
n1, which forces P2 y R21 once more. Since P2 managed to send m2 before receipt of n1, this latter rollback orphans m2 and
forces P1 y R11. With the right timing this can happen indefinitely.
Consistent Checkpointing and Recovery
The source of the domino rollback and livelock problems was that the local checkpoints were taken in an independent
(uncoordinated) fashion, facing the system at recovery time with the task of finding a set of local checkpoints that
together represent a consistent cut

R11

R12

R21

R22

R31

Figure 7:
Strongly
checkpoint ({R12,R22,R32})

consistent

m
P2
R32

checkpoint

P1

({R11,R21,R31})

P3
and

consistent

There are two basic kinds of such checkpoints. A strongly consistent checkpoint, such as the cut {R11,R21,R31} in Figure
7, has no messages in transit during the checkpointing interval. This requires a quiescent system during the checkpoint,
and thus blocks the distributed computation for the full duration of the checkpoint.
The alternative to strongly consistent checkpoints are checkpoints representing consistent cuts, simply called
consistent checkpoints. An example are Chandy & Lamports snapshots[CL85]. Remember that this algorithm buffers
messages in transit to cope with the message loss problem mentioned above. It also assumes reliable communication,
which is unrealistic in most distributed system. It is preferable to use an approach that can tolerate message loss.
One simple approach is to have each node perform a local checkpoint immediately after sending any message to
another node. While this is an independent checkpointing scheme, it is easy to see that the last local checkpoints
together form a consistent checkpoint. This scheme obviously suffers from high overhead (frequent checkpointing). Any
less frequent checkpointing requires systemwide coordination (synchronous checkpoints) to be able to guarantee a
consistent checkpoint. (For example, checkpointing after every second message will, in general, not lead to consistent
checkpoints.)

104

A simple synchronous checkpointing scheme providing strongly consistent checkpoints is described below. The
scheme assumes ordered communication (FIFO channels), a strongly connected network (no partitioning) and some
mechanism for dealing with message loss (which could be a protocol such as sliding window or nodes buffering all
outgoing messages on stable storage). Each node maintains two kinds of checkpoints: a permanent checkpoint is part of
a global checkpoint, while a tentative checkpoint is a candidate for being part of a global checkpoint. The checkpoint is
initiated by one particular node, called the coordinator. Not surprisingly, the algorithm is based on the two-phase commit
protocol[LS76] . The algorithm works as follows:
First phase:
1.

the coordinator Pi takes a tentative checkpoint;

2.

Pi sends a t message to all other processes Pj to take tentative checkpoints;

3.

each process Pj informs Pi whether it succeeded in taking a tentative checkpoint;

4.

if Pi receives a true reply from each Pj it decides to make the checkpoint permanent if Pi receives at least
one false reply it decides to discard the tentative checkpoints.

Second phase:
1.

the coordinator Pi sends a p (permanent) or u (undo) message to all other processes Pj;

2.

each Pj converts or discards its tentative checkpoint accordingly;

3.

a reply from Pj back to Pi can be used to let Pi know when a successful checkpoint is complete.

R11

R12

R21

P1

R22

R31

R32

P2

P3

Figure 8: Synchronous checkpointing creates redundant checkpoints (R32)


In order to ensure consistency, this algorithm requires that processes do not send any messages other than the true
or false reply between receiving the coordinators control messages (t and p or u). This is a weak form of blocking that
limits the performance of this checkpointing scheme. Furthermore, as the algorithm generates strongly consistent
checkpoints, it performs redundant checkpoints if only simple consistency is required, as shown in Figure 8. Here P1
initiates two global checkpoints, {R11,R21,R31} and {R12,R22,R32}. As the cut {R12,R22,R31} is also consistent (but not strongly
consistent), local checkpoint R32 is redundant. Note that a redundant local checkpoint is not necessarily bad, as it would
reduce the amount of lost work in the case of a rollback.
Redundant checkpoints can be avoided by keeping track of messages sent[KT87]. In this approach, each message m
is tagged with a label m.l, which is incremented for each message. Each process maintains three arrays:
last reci[j] := m.l, where m is the last message received by Pi from Pj since last checkpoint (last reci[j] = 0 if no
messages were received from Pj since the last checkpoint).
first senti[j] := m.l, where m is the first message sent by Pi to Pj since last the checkpoint (first senti[j] = 0 if no
message was sent to Pj since the last checkpoint).

105

cohorti := {j|last reci[j] > 0} is the set of processes from which Pi has received a message (has been causally affected)
since the last checkpoint.
Each process initialises first sent to zero, and also maintains a variable OK which is initialised to true. The idea of the
algorithm is that a process only needs to take a local checkpoint if the new checkpoints coordinator has been causally
affected by the process since the last permanent checkpoint. This is the case if, when receiving the t message from Pi, Pj
finds that last reci[j] first sentj[i] > 0.
The algorithm is as follows:
The coordinator Pi initiates a checkpoint by:
1.

send(t,i,last reci[j]) to all Pj cohorti,

2.

if all replies are true, send(p) to all Pj cohorti, else send(u) to all Pj cohorti.

Other processes, Pj, upon receiving (t,i,last reci[j]) do:


1.

if OKj and last reci[j] first sentj[i] > 0

2.

take tentative checkpoint,

3.

send(t,j,last recj[k]) to all Pk cohortj, 4.

if all replies are true, OK := true else OK := false,

5. send(OK,j) to i.
Upon receiving the commit message x {p,u} from Pi, the other processes, Pj, do:
1.

if x = p make permanent else discard tentative checkpoint,

2.

send(x,j) to all Pk cohortj.

Note that this algorithm is expensive as it requires O(n2) messages. Recovery is initiated by the coordinator sending a
rollback message r to all other processes. A two phase commit is used to ensure that the computation does not continue
until all processes have rolled back. This leads to unnecessary rollbacks. This can again be avoided with a more
sophisticated protocol which checks whether processes were causally affected.
Some more-or-less obvious improvements can be applied to consistent checkpointing. For example, explicit control
messages can be avoided by tagging a sufficient amount of state information on all normal messages. A process will
then take a tentative checkpoint (and inform its cohort) when it finds that its checkpointing state is inconsistent with
that recorded in the message. This is the same idea as that of logical time[Mat93]. Whether this approach is beneficial is
not clear a priori, as the reduced number of messages comes at the expense of larger messages.
Asynchronous Checkpointing and Recovery
Synchronous checkpointing as described above produces consistent checkpoints by construction, making recovery
easy. In this sense it is a pessimistic scheme, optimised toward recovery overheads. Its drawbacks are essentially
blocking the computation for the duration of a checkpoint, and the O(n2) message overhead which severely limits
scalability. These features make the scheme unattractive in a scenario where failures are rare.
The alternative is to use an optimistic scheme, which assumes infrequent failures and consequently minimises
checkpointing overheads at the expense of increased recovery overheads. In contrast to the pessimistic scheme, this
approach makes no attempt at generating consistent checkpoints at normal run time, but leaves it to the recovery phase
to construct a consistent state from the available checkpoints.
In this approach, checkpoints are taken locally in an independent (unsynchronised) fashion, with precautions that
allow the construction of a consistent state later. Remember that orphan messages are the source of inconsistencies.
These are messages that have been received but not yet sent (by the rolled-back process). The negative effect of
orphaned messages can be avoided by realising that the process restarting from its last checkpoint will generate the
same messages again. If the process knows that it is recovering, i.e., it is in its roll-forward phase (prior to reaching the
previous point of failure), it can avoid the inconsistencies caused by orphan messages by suppressing any messages that

106

it would normally have sent. Once the roll-forward proceeds past the point where the last message was sent (and
received) prior to failure, the computation reverts to its normal operation. Except for timing, this results in an operation
that is indistinguishable from a restart after the last message send operation, provided there is no message loss. Note that
this implies a relaxation of the definition of a consistent state (but the difference is unobservable except for timing).
The requirement of suppressing outgoing messages during the roll-forward phase implies the knowledge of the
number of messages sent since the last checkpoint. Hence it is necessary to log the send count on stable storage.
Furthermore, the roll-forward cannot proceed past the receipt of any lost message, hence message loss must be avoided.
This can be achieved by logging all incoming messages on stable storage. During roll-forward these messages can then
be replayed from the log.
This scheme works under the assumption of a deterministic behaviour of all processes. Such a determinism may be
broken by such factors as dependence on the resident set size (the available memory may be different for different
executions of the same code, leading to different caching behaviour), changed process identifiers (a restarted process
may be given a different ID by the OS), or multi-threaded processes, which exhibit some inherent non-determinism.
Furthermore, interrupts (resulting from I/O to local backing store) are inherently asynchronous and thus
nondeterministic. One way around this problem is to checkpoint prior to handling any interrupt (or asynchronous
signal)[BBG+89], although this could easily lead to excessive checkpointing. All these factors are related to the
interaction with the local OS (and OS features) and can be resolved by careful implementation and appropriate OS
support.
A more serious problem is that any message missing from the replay log will terminate the roll-back prematurely
and may result in an inconsistent global state. This can be avoided by synchronous logging of messages, however, this
slows down the computation, reducing (or eliminating) any advantage over synchronous checkpointing.
A way out of this dilemma is to be even more optimistic: using asynchronous (or optimistic) logging[SY85, JZ90].
Here, incoming messages are logged to volatile memory and flushed to stable memory asynchronously. On failure, any
unflushed part of the log is lost, resulting in an inconsistent state. This is repaired by rolling back the orphan process(es)
to a consistent state. Obviously, this can result in a domino-rollback, but the likelihood of that is significantly reduced
by the asynchronous flush. Worst-case this approach is not better than independent checkpointing, but in average it is
much better. One advantage is that there is no synchronisation requirement between checkpointing and logging, which
greatly simplifies implementation and improves efficiency.
An optimistic checkpointing algorithm has been presented by Juang & Venkatesan[JV91]. It assumes reliable FIFO
communication channels with infinite buffering capacity and finite communication delays. It considers a computation
as event driven, where an event is defined as the receipt of a message. Each process is considered to be waiting for a
message between events. When a message m arrives, as part of the event m is processed, an event counter, e, is
incremented, and optionally a number of messages are sent to some directly connected nodes. Each event is logged
(initially to volatile storage) as a triplet E = {e,m,msgs sent}, where msgs sent is the set of messages sent during the
event. The log is asynchronously flushed to stable storage.

R11
E10

E11

E13

E12

Failure

R21
E20

E21

E22

E23

R31
E30

P1

E31

E32

P2

P3

Figure 9: Optimistic checkpointing example


Figure 9 shows an example of how this works. After failure of P2, that process can roll back to local checkpoint R21
and then roll forward to event E23, provided that this event was logged in stable storage. If it was not logged, then E13 is

107

an orphan state and a rollback to the consistent state {E12,E22,E31} is required. Note that E22 is contained in checkpoint
R21 and thus guaranteed to be logged.
The challenge here is the detection of the latest consistent state. This is done by keeping track of the messages sent
and received. Specifically, each process Pi maintains two arrays of counters:
n rcvdij(E): the number of messages received from Pj (up to event E) n sentij(E): the number
of messages sent to Pj (up to event E).
At restart, these are used to compare the local message count with that of the processs neighbours. If for any neighbour
Pj it is found that n rcvdji > n sentij, then Pj is orphaned and must be rolled back until n rcvdji n sentij. This may,
of course, cause domino rollbacks of other processes.
The recovery is initiated by a restarting process, Pi, sending a broadcast message announcing its failure. Pi
determines the last event, Ei, which has been recorded in its stable event log. For each process Pj receiving the failure
message, Ej denotes the last local event (prior to the failure message). Each of the N processes then performs the
following algorithm:
1. for k := 1 to N do
2.

for each neighbour j do

3.

send(r,i,n sentij(Ei));

4.

wait for r messages from all neighbours;

5.

for each message (r,j,s) received do

6.

if n rcvdij(Ei) > s then /* have orphan */

7.

Ei := latest E such that n rcvdij(E) = s;

In the example of Figure 9, the following steps are taken when P2 fails and finds that E22 is the last logged event
(implicitly logged by R21):
1. P2 : recover from R21; E2 := E22; send(r,P2,2) P1; send(r,P2,1) P3;
2. P1 P2; E1 := E13; n rcvd12(E13) = 3 > 2 : E1 := E12; send(r,P1,2) P2;
3. P3 P2;E3 := E32; n rcvd32(E32) = 2 > 1 : E3 := E31; send(r,P3,1) P2;
4. P2 P1; n rcvd21(E22) = 1 2; no change; send(r,P2,2) P1;
5. P2 P3; n rcvd23(E22) = 1 1; no change; send(r,P2,1) P3;
Here, P1 P2 is a shorthand for P1 receiving the previously sent message from P2. The algorithm determines the recovery
state to be {E12,E22,E31}, which is reached by rolling back to {R11,R21,R31} and then rolling forward by replaying the logs.
References
[ALRL04] Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. Basic concepts and taxonomy of
dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 01(1):1133,
2004.
+
[BBG 89] Anita Borg, Wolfgang Blau, Wolfgang Graetsch, Ferdinand Herrmann, and Wolfgang
Oberle. Fault tolerance under UNIX. ACM Transactions on Computer Systems, 7:124,
1989.

108

[CGR07]

Tushar D. Chandra, Robert Griesemer, and Joshua Redstone. Paxos made live: an
engineering perspective. In PODC 07: Proceedings of the twenty-sixth annual ACM
symposium on Principles of distributed computing, pages 398407, New York, NY, USA,
2007. ACM.

[CL85]

K. Mani Chandy and Leslie Lamport. Distributed snapshots: Determining global states of
distributed systems. ACM Transactions on Computer Systems, 3:6375, 1985.

[FLP85]

Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Impossibility of distributed


consensus with one faulty process. Journal of the ACM, 32(2):374382, 1985.

[JV91]

T. Juang and S. Venkatesan. Crash recovery with little overhead. In Proceedings of the 11th
IEEE International Conference on Distributed Computing Systems (ICDCS), pages 454461.
IEEE, May 1991.

[JZ90]

David B. Johnson and Willy Zwaenepoel. Recovery in distributed systems using


optimistic message logging and checkpointing. Journal of Algorithms, 11:462491, 1990.

[KT87]

R. Koo and S. Toueg. Checkpointing and rollback-recovery for distributed systems. IEEE
Transactions on Software Engineering, 13:2331, January 1987.

[Lam98]

Leslie Lamport. The part-time parliament. ACM Transactions on Computer Systems,


16(2):133169, 1998.

[Lam01]

Leslie Lamport. Paxos made simple. ACM SIGACT News (Distributed Computing Column),
32(4):5158, 2001.

[LS76]

Butler Lampson and H. Sturgis. Crash recovery in a distributed system. Working paper,
Xerox PARC, Ca, USA, Ca, USA, 1976.

[LSP82]

Leslie Lamport, Robert Shostak, and Marshall Pease. The byzantine generals problem.
ACM Transactions on Programming Languages and Systems, 4:382401, 1982.

[PBKL95] J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under UNIX. In Proceedings of
the 1995 USENIX Technical Conference, pages 213 223, January 1995.
[Sch90]
Fred B. Schneider. Implementing fault-tolerant services using the state machine
approach: a tutorial. ACM Computing Surveys, 22(4):299319, 1990.
[SY85]
R. Strom and S. Yemini. Optimistic recovery in distributed systems. ACM Transactions on
Computer Systems, 3(3):204226, 1985.

109

You might also like