Ds Twoside

Prescript to the lectures
Distributed Systems (DS) 1

(work in progress)
dr Tomasz Jordan Kruk

T.Kruk@ia.pw.edu.pl
June 20, 2006
1 Semester 2005/2006 summer

Contents
1 Introduction 1
2 Communication (I) 19
3 Communication (II) 35
4 Synchronization (I) 51
5 Synchronization (II) 73
6 Consistency and Replication 87
7 Fault Tolerance 107
8 Distributed File System 125
9 Naming 149
10 Peer-to-Peer Systems 169
11 Web Services 189
Bibliography 198
i
CONTENTS CONTENTS
ii
Chapter 1
Introduction
[1.1] Lectures (1)

Tomasz Jordan Kruk (T.Kruk@ia.pw.edu.pl)
Consultations: room 530
• Tuesday 18.15-19.00,
• Thursday 14.15-15.00
Slides available after lectures on http://studia.elka.pw.edu.pl
Books:
1. Distributed Systems. Principles and Paradigms. Andrew S. Tanenbaum,

Maarten van Steen. Prentice Hall 2002.
2. Distributed Systems. Concepts and Design. fourth edition. G. Colouris, J.

Dollimore, T. Kindberg. Addison Wesley 2005.
3. Systemy rozproszone. Zasady i paradygmaty. Andrew S. Tanenbaum,

Maarten van Steen. WNT 2005.
[1.2] Lectures (2)
1. Introduction
2. Communication (I)
1
CHAPTER 1. INTRODUCTION
3. Communication (II)
4. Synchronization (I)
5. Synchronization (II)
6. Consistency and replication
7. Fault tolerance
8. File systems
9. Naming
10. Peer-to-peer systems
11. Web services
12. Security
[1.3] Definition of a Distributed System (1)

Distributed system
Collection of independent computers that appears to its users as a single coherent
system.
Goals:
• connecting users and resources,
• transparency,
• openness = to offer services according to standard rules that describe the

syntax and semantics of those services (e.g. POSIX for OS),
• scalability.
[1.4] Definition of a Distributed System (2)
2
Machine A Machine B Machine C
Distributed applications
Middleware service
Local OS Local OS Local OS
Network
A distributed system organized as middleware. Note that the middleware layer

extends over multiple machines.
[1.5] Transparency in a Distributed System
Different forms of transparency in a distributed system.
[1.6] Degree of Transparency
• some attempts to blindly hide all distribution aspects not always a good
idea,
• a trade-off between a high degree of transparency and the performance.
3
The goal not to be achieved: parallelism transparency.
Parallelism transparency
Transparency level with which a distributed system is supposed to appear to the
users as a traditional uniprocessor timesharing system.
[1.7] Openness
Completeness and neutrality of specifications as important factors for interoper-
ability and portability of distributed solutions.
completeness all necessary to make an implementation as it has been specified,
neutrality specification do not prescribe what an implementation should look

like.
Interoperability
The extent by which two implementations of systems from different manufactures
can cooperate.
Portability
To what extent an application developed for A can be executed without modifi-
cation on some B which implements the same interfaces as A.
[1.8] Scalability Problems

Three different dimensions of the system scalability:
• scalable with respect to its size,
• geographically scalable systems (users and resources may lie apart),
• system administratively scalable.
4
[1.9] Decentralized Algorithms
1. No machine has complete information about the system state.
2. Machines make decisions based only on local information.
3. Failure of one machine does not ruin the algorithm.
4. There is no implicit assumption that a global clock exists.
[1.10] Scaling Techniques (1)
• asynchronous communication (to hide communication latencies),
• distribution (splitting into smaller parts and spreading),
• replication (to increase availability and to balance the load),
• caching (as a special form of replication).
Client Server
M
FIRST NAME MAARTEN
A
LAST NAME VAN STEEN A
E-MAIL R
STEEN@CS.VU.NL T
E
N
Check form Process form

(a)
Client Server
FIRST NAME MAARTEN

MAARTEN
LAST NAME VAN STEEN VAN STEEN
E-MAIL STEEN@CS.VU.NL STEEN@CS.VU.NL
Check form Process form

(b)
A difference between letting:

a. a server or
5
b. a client
check forms as they are being filled.
Generic Countries
Z1
int com edu gov mil org net jp us nl
sun yale Z2
acm ieee ac co oce vu
eng cs eng jack jill keio nec cs

Z3
ai linda cs csl flits fluit
robot pc24
An example of dividing the DNS name space into zones.
[1.13] Hardware Concepts
Shared memory Private memory

Bus-based
M M M M M M M
P P P P
P P P P
Switch-based
M M M M M M M
P P P P
P P P P
P Processor M Memory
6
Different basic organizations and memories in distributed computer systems.
[1.14] Multiprocessors (1)
CPU CPU CPU Memory

Cache Cache Cache
Bus
A bus-based multiprocessor.
[1.15] Multiprocessors (2)
Memories
CPUs Memories
M M M M
P M
P
P M
P
CPUs
P M
P
P M
P
Crosspoint switch 2x2 switch
(a) (b)
a. a crossbar switch,
b. an omega switching network (2k inputs and a like outputs; log2 N stages,
each having N/2 exchange elements at each stage),
NUMA - NonUniform Memory Access - hierarchical systems.
[1.16] Homogeneous Multicomputer Systems
7
(a) (b)
a. grid,
b. hypercube.
Examples: Massively Parallel Processors (MPPs), Clusters of Workstations (COWs).
[1.17] Software Concepts
DOS distributed operating system.
NOS network operating system.
[1.18] Uniprocessor Operating Systems
8
No direct data exchange between modules
OS interface
User Memory Process File module
application module module User mode
Kernel mode
System call Microkernel
Hardware
[1.19] Multicomputer Operating Systems (1)
Distributed operating system services
Kernel Kernel Kernel
Network
General structure of a multicomputer operating system.
9
Possible
synchronization
point
Sender Receiver
S1 S4
Sender Receiver
buffer buffer
S2 S3
Network
Alternatives for blocking and buffering in message passing.
Relation between blocking, buffering, and reliable communications.
[1.22] Distributed Shared Memory Systems (1)
10
Shared global address space

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 2 5 1 3 6 4 7 11 13 15
9 8 10 12 14 Memory
CPU 1 CPU 2 CPU 3 CPU 4
(a)
0 2 5 1 3 6 4 7 11 13 15
9 10 8 12 14
(b)
0 2 5 1 3 6 4 7 11 13 15
9 10 8 10 12 14
(c)
• Pages of address space distributed among four machines,
• Situation after CPU 1 references page 10,
• Situation if page 10 is read only and replication is used.
[1.23] Distributed Shared Memory Systems (2)
11
Machine A Page transfer when Machine B

B needs to be accessed
A A
Two independent
B B data items
Page transfer when
Page p A needs to be accessed Page p
Code using A Code using B
False sharing of a page between two independent processes.
[1.24] Network Operating Systems (1)
Network OS Network OS Network OS

services services services
Network
General structure of a network operating system.
[1.25] Network Operating Systems (2)
12
Client 1 Client 2 Server 1 Server 2

/ / games work
private pacman mail
pacwoman teaching
pacchild research
(a)
Client 1 Client 2
/ /
games private/games
work work
pacman mail pacman mail

pacwoman teaching pacwoman teaching
pacchild research pacchild research
(b) (c)
Different clients may mount the servers in different places.
[1.26] Positioning Middleware
Middleware services
Network OS Network OS Network OS

services services services
Network
General structure of a distributed system as middleware.
[1.27] Middleware and Openness
13
Application Same Application

programming
interface
Middleware Middleware
Common
Network OS protocol Network OS
In an open middleware-based distributed system, the protocols used by each

middleware layer should be the same, as well as the interfaces they offer to
applications.
[1.28] Comparison of Operating Systems Types
A comparison between multiprocessor OS, multicomputer OS, network OS, and

middleware based distributed systems.
[1.29] Clients and servers
14
Wait for result

Client
Request Reply
Server
Provide service Time
General interaction between a client and a server.
[1.30] Application Layering
• The user-interface layer,

• The processing level,
• The data level.
[1.31] Processing Level
User-interface
User interface level
HTML page
Keyword expression containing list
HTML
generator Processing
Query Ranked list level
generator of page titles
Ranking
Database queries component
Web page titles

with meta-information
Database Data level
with Web pages
The general organization of an Internet search engine into three different layers.
[1.32] Multitiered Architectures (1)
15
Client machine
User interface User interface User interface User interface User interface
Application Application Application
Database
User interface
Application Application Application

Database Database Database Database Database
Server machine
(a) (b) (c) (d) (e)
Alternative client-server organizations.
[1.33] Multitiered Architectures (2)
User interface Wait for result

(presentation)
Request Return
operation result
Wait for data
Application
server
Request data Return data
Database
server
Time
An example of a server acting as a client.
[1.34] Modern Architectures (1)

Vertical distribution
Achieved by placing logically different components on different machines.
Horizontal distribution
Client or server may be physically split up into logically equivalent parts, but
each part is operating on its own share of the complete data set, thus balancing
the load.
[1.35] Modern Architectures (2)
16
Front end
handling
incoming Replicated Web servers each
requests containing the same Web pages
Requests Disks
handled in
round-robin
fashion
Internet
Internet
An example of horizontal distribution of a Web service.
[1.36] Internet and web servers
Date Computers Web servers

1979, Dec. 188 0
1989, July 130 000 0
1999, July 56 218 000 5 560 866
2003, Jan. 171 638 297 35 424 956
17
18
Chapter 2
Communication (I)
[2.1] Communication (I)
1. Layered Protocols
2. Remote Procedure Call
3. Remote Object Invocation
4. Message-oriented Communication
5. Stream-oriented Communication
[2.2] Necessary Agreements
• How many volts should be used to signal a 0-bit, and how many for a
1-bit?
• How does the receiver know which is the last of the message?
• How can it detect if a message has been damaged or lost?
• How long are numbers, strings and other data items?
• How are they represented?
ISO OSI = OSI Model = Open Systems Interconnection Reference Model
Protocols: connection-oriented vs. connectionless.
19
CHAPTER 2. COMMUNICATION (I)
protocol suite = protocol stack = the collection of protocols used in a particular

system.
[2.3] Protocols (1)
Example protocol as a discussion:
A: Please, retransmit message n,
B: I already retransmitted it,
A: No, you did not,
B: Yes, I did,
A: All right, have it your way, but send it again.
[2.4] Protocols (2)
Protocol
A well-known set of rules and formats to be used for communication between
processes in order to perform a given task.
Two important parts of the definition:
• a specification of the sequence of messages that must be exchanged,
• a specification of the format of the data in the messages.
How to create protocols:

On the Design of Application Protocols, RFC 3117,
http://www.rfc-editor.org/rfc/rfc3117.txt
[2.5] Layered Protocols (1)
20
Application protocol
Application 7
Presentation protocol
Presentation 6
Session protocol
Session 5
Transport protocol
Transport 4
Network protocol
Network 3
Data link protocol
Data link 2
Physical protocol
Physical 1
Network
Layers, interfaces, and protocols in the OSI model.
• focus on message-passing only,
• often unneeded or unwanted functionality.
Data link layer header

Network layer header
Transport layer header
Session layer header
Presentation layer header
Application layer header
Message Data link

layer trailer
Bits that actually appear on the network
A typical message as it appears on the network.
21
Physical layer
Contains the specification and implementation of bits, and their transmission
between sender and receiver.
Data link layer

Describes the transmission of a series of bits into a frame to allow error and
flow control.
Network layer
Describes how packets in a network of computers are to be routed.
Transport Layer
Provides the actual communication facilities for most distributed systems.
Standard Internet protocols:

• TCP: connection-oriented, reliable, stream-oriented communication,
• UDP: unreliable (best-effort) datagram communication.
[2.8] Data Link Layer
Time A B Event
0 Data 0 A sends data message 0
1 Data 0 B gets 0, sees bad checksum
A sends data message 1

2 Data 1 Control 0 B complains about the checksum
3 Control 0 Data 1 Both messages arrive correctly
A retransmits data message 0

4 Data 0 Control 1 B says: "I want 0, not 1"
5 Control 1 Data 0 Both messages arrive correctly
6 Data 0 A retransmits data message 0 again
7 Data 0 B finally gets message 0
Discussion between a receiver and a sender in the data link layer.
[2.9] Network level protocols

Network layer:
22
• IP packets
• ATM virtual channels (unidirectional connection-oriented protocol),
• collections of virtual channels grouped into virtual paths – predefined

routes between pairs of hosts.
Transport layer:
• TCP, UDP
• RTP - Real-time Transport Protocol
• TP0 – TP4, the official ISO transport protocols,
[2.10] Client-Server TCP
Client Server Client Server
1 1
SYN SYN,request,FIN
2
2
SYN,ACK(SYN) SYN,ACK(FIN),answer,FIN
3
3
4 ACK(SYN)
ACK(FIN)
5 request
FIN
6
ACK(req+FIN)
7
answer 8
FIN
Time 9 Time
ACK(FIN)
(a) (b)
(a) Normal operation of TCP. (b) Transactional TCP.
[2.11] Networking - review

Networking, keywords, review:
• routing in IP, default gateway,
23
• hardware: router, bridge, hub, switch, gateway, firewall, transceiver,
• domain name resolution,
• CIDR – classless interdomain routing,
• private networks (10.x.y.z, 172.16.x.y, 192.168.x.y),
• NAT.
[2.12] Above the Transport Layer

Many application protocols are directly implemented on top of transport proto-
cols, doing a lot of application-independent work.
News FTP WWW

Transfer NNTP FTP HTTP
Naming Newsgroup Host + path URL
Distribution Push Pull Pull
Replication Flooding Caching + DNS tricks Caching + DNS tricks
Security None (PGP) Username + Password Username + Password
[2.13] Middleware Protocols (1)

Middleware
An application that logically lives in the application layer, but which contains
many general-purpose protocols that warrant their own layers, independent of
other, more specific applications.
Middleware invented to provide common services and protocols that can be used
by many different applications:
Example protocols:
• open communication protocols,
• marshaling and unmarshaling of data, for systems integration,
• naming protocols, for resource sharing,
• security protocols, distributed authentication and authorization,
• scaling mechanisms, support for caching and replication.
24
[2.14] Middleware Protocols (2)
Application protocol
Application 6
Middleware protocol
Middleware 5
Transport protocol
Transport 4
Network protocol
Network 3
Data link protocol
Data link 2
Physical protocol
Physical 1
Network
An adapted ISO OSI reference model for networked communication.
[2.15] High-level Middleware Communication Services

Some of high-level middleware protocol types:
1. remote procedure call,

2. remote object invocation,
3. message queuing services,
4. stream-oriented communication.
[2.16] Local Procedure Call
Stack pointer
Main program's Main program's Parameter passing:

local variables local variables
bytes
buf
fd
a. the stack before
return address
read's local the call.
variables
b. the stack while

(a) (b) the called proce-
dure is active.
25
• Application developers familiar with simple procedure model,
• Procedures as black boxes (isolation),
• No fundamental reason not to execute procedures on separate machine.
[2.17] Remote Procedure Call

When we try to call procedures located on other machines, some subtle problems
exist:
• different address spaces,
• parameters and results have to be passed,
• both machines may crash.
Standard function call parameters types:
• call-by-value,
• call-by-reference,
• call by copy/restore.
[2.18] Steps in RPC
1. Client procedure calls client stub in normal way.
2. Client stub builds message, calls local OS.
3. Client’s OS sends message to remote OS.
4. Remote OS gives message to server stub.
5. Server stub unpacks parameters, calls server.
6. Server does work, returns result to the stub.
7. Server stub packs it in message, calls local OS.
8. Server’s OS sends message to client’s OS.
26
9. Client’s OS gives message to client stub.
10. Stub unpacks result, returns to client.
[2.19] Passing Value Parameters (1)
Client machine Server machine
Client process Server process

1. Client call to
procedure Implementation 6. Stub makes
of add local call to "add"
Server stub
k = add(i,j) k = add(i,j)
Client stub
proc: "add" proc: "add"
int: val(i) int: val(i) 5. Stub unpacks
2. Stub builds message
int: val(j) message int: val(j)
proc: "add" 4. Server OS

Client OS int: val(i) Server OS hands message
int: val(j) to server stub
3. Message is sent
across the network
Steps involved in doing remote computation through RPC.
parameter marshaling – packing parameters into a message.
[2.20] Passing Value Parameters (2)
• IBM mainframes: EBCDIC character code,
• IBM personal computers: ASCII character code.
3 2 1 0 0 1 2 3 0 1 2 3
0 0 0 5 5 0 0 0 0 0 0 5
7 6 5 4 4 5 6 7 4 5 6 7
L L I J J I L L L L I J
(a) (b) (c)
a. Original message on the Pentium
b. The message as being received on the SPARC
27
c. The message after being inverted. The little numbers in boxes indicate the
address of each byte.
[2.21] Extended RPC models – Doors
Door
A procedure in the address space of a server process that can be called by process
collocated with the server.
• local IPC to be much more efficient than networking,
• door to be registered to be called (door_create),
• in Solaris, each door has a file name (fattach),
• calling doors by door_call (OS makes an upcall),
• result returned to the client through door_return.
• benefit: single mechanism, procedure calls, for effective communication

in a distributed system,
• drawbacks: still the need to distinguish standard procedure calls, calls to

other local processes, calls to remote processes.
[2.22] Doors
28
Computer
Client process Server process

server_door(...)
{
...
door_return(...);
}
main()
{ main()
... {
fd = open(door_name, ... ); ...
Register door fd = door_create(...);
door_call(fd, ... );
... fattach(fd, door_name, ... );
} ...
}
Operating system
Invoke registered door

at other process Return to calling process
[2.23] Asynchronous RPC (1)
Client Wait for result Client Wait for acceptance
Call remote Return Call remote Return

procedure from call procedure from call
Request Request Accept request

Reply
Server Call local procedure Time Server Call local procedure Time
and return results
(a) (b)
a. The interconnection between client and server in a traditional RPC.
b. The interaction using asynchronous RPC.
29
[2.24] Asynchronous RPC (2)
Wait for Interrupt client

acceptance
Client
Call remote Return

procedure from call Return
results Acknowledge
Accept
Request request
Server
Call local procedure Time
Call client with
one-way RPC
deferred synchronous RPC – asynchronous RPC with second call done by the
server,
one-way RPC – client does not wait for acceptance of the request , problem
with reliability.
[2.25] Writing a Client and a Server
Uuidgen
Interface
definition file
IDL compiler
Client code Client stub Header Server stub Server code
#include #include
C compiler C compiler C compiler C compiler
Client Client stub Server stub Server

object file object file object file object file
Runtime Runtime
Linker Linker
library library
Client Server
binary binary
30
Steps in writing a client and a server in DCE RPC. Let the developer concentrate
only on the client- and server-specific code. Leave the rest for RPC generators
and libraries.
[2.26] Binding a Client to a Server

Client must locate server machine, and locate the server.
Directory machine
Directory
server
2. Register service
3. Look up server
Server machine
Client machine
5. Do RPC 1. Register endpoint

Server
Client
4. Ask for endpoint DCE

daemon Endpoint
table
Client-to-server binding in DCE – separate daemon for each server machine.
[2.27] Remote Distributed Objects (1)

The basic idea of remote objects:
• data and operations encapsulated in an object,
• operations are implemented as methods, and are accessible through inter-

faces,
• object offers only its interface to clients,
• object server is responsible for a collection of objects,
• client stub (proxy) implements interface,
• server skeleton handles (un)marshaling and object invocation.
31

Object
Client Server
State
Same
Client interface Method
invokes as object
a method
Skeleton
Interface
invokes
Proxy same method Skeleton
at object
Client OS Server OS
Network
Marshalled invocation
is passed across network
Common organization of a remote object with client-side proxy.

Compile-time objects
Language-level objects, from which proxy and skeletons are automatically gen-
erated.
Runtime objects
Can be implemented in any language, but require use of an object adapter that
makes the implementation appear as an object.
Transient object lives only by virtue of a server: if the server exits, so will the
object.
Persistent object lives independently from a server: if a server exits, the ob-
ject’s state and code remain (passively) on disk.
[2.30] Binding a Client to an Object (1)

Having an object reference allows a client to bind to an object:
• reference denotes server, object, and communication protocol,
• client loads associated stub code,
• stub is instantiated and initialized for specific object.
32
Remote-object references enable passing references as parameters, what was

hardly possible with ordinary RPCs.
Two ways of binding:
Implicit: invoke methods directly on the referenced object.
Explicit: client must first explicitly bind to object before invoking it.
[2.31] Binding a Client to an Object (2)
a. Example with implicit binding using only global references.
b. Example with explicit binding using global and local references.
[2.32] RMI - Parameter Passing
Machine A Machine B
Local object
Local Remote object
O1 Remote
reference L1 O2
reference R1
Client code with

RMI to server at C
(proxy) New local
reference Copy of O1
Remote
invocation with
L1 and R1 as Copy of R1 to O2
parameters Server code
Machine C (method implementation)
33
Objects sometimes passed by reference, but sometimes by value.
• a client running on machine A, a server on machine C,
• the client calls the server with two references as parameters, O1 and O2,
to local and remote objects,
• copying of an object as a possible side effect of invoking a method with

an object reference as a parameter (transparency versus efficiency).
34
Chapter 3
Communication (II)
[3.1] Communication (II)
1. Layered Protocols
2. Remote Procedure Call
3. Remote Object Invocation
4. Message-oriented Communication
5. Stream-oriented Communication
[3.2] Persistence and Synchronicity in Communication (1)

Assumption – communication system organized as follows:
• applications are executed on hosts,
• each host connected to one communication server,
• buffers may be placed either on hosts or in the communication servers of

the underlying network,
• example: an e-mail system.
persistent vs transient communication,
asynchronous communication – sender continues immediately after it has sub-

mitted its message for transmission,
35
CHAPTER 3. COMMUNICATION (II)
synchronous communication – the sender blocked until its message is stored

in a local buffer at the receiving host or actually delivered to the receiver.

Client/server computing generally based on a model of synchronous commu-
nication:
• client and server to be active at the time of communication,
• client issues request and blocks until reply received,
• server essentially waits only for incoming requests and subsequently pro-
cesses them.
Drawbacks of synchronous communication:
• client cannot do any other work while waiting for reply,
• failures to be dealt with immediately (the client is waiting),
• in many cases the model simply not appropriate (mail, news).
Messaging interface
Sending host Communication server Communication server Receiving host
Buffer independent
Routing of communicating Routing
Application program hosts Application
program
To other (remote)
communication
server
OS OS OS OS
Local network Internetwork

Local buffer Local buffer
Incoming message
General organization of a communication system in which hosts are connected

through a network.
• queued messages sent among processes,
• sender not stopped in waiting for immediate reply,
• fault tolerance often ensured by middleware.
36
Persistent vs. transient communication
Persistent communication
A message is stored at a communication server as long as it takes to deliver it
at the receiver.
Transient communication
A message is discarded by a communication server as soon as it cannot be
delivered at the next server or at the receiver.
Post
Pony and rider office
Post Post
office office
Post
Mail stored and sorted, to office
be sent out depending on destination
and when pony and rider available
Persistent communication of letters back in the days of the Pony Express.
37
A sends message
and continues A stopped
running
A sends message
and waits until accepted
A stopped
running
Different forms of communication:
A A
Message is stored
Time
at B's location for
later delivery
Accepted
Time
a. persistent asynchronous,
B B
B starts and B is not B starts and
B is not
running
receives
message
running receives
message
b. persistent synchronous,
(a) (b)
A sends message
and continues
Send request and wait
until received
c. transient asynchronous,
A Message can be A
sent only if B is
running Request ACK
d. receipt-based transient syn-
is received
Time Time
B B chronous,
B receives Running, but doing Process
message something else request
(c) (d)
e. delivery-based transient syn-
Send request and wait until Send request
A
accepted
A
and wait for reply chronous,
Request Request Accepted
is received Accepted is received
B
Time
B
Time
f. response-based transient syn-
Running, but doing Process Running, but doing Process
something else request something else request chronous,
(e) (f)
[3.8] Message-Oriented Transient Communication
• socket interface introduced in Berkeley UNIX,

• another transport layer interface: XTI, X/Open Transport Interface, for-
merly called the Transport Layer Interface (TLI), developed by AT&T
socket
Communication endpoint to which an application write data that are to be sent
over the underlying network and from which incoming data can be read.
[3.9] Berkeley Sockets (1)
38
Socket primitives for TCP/IP.
[3.10] Berkeley Sockets (2)
Server
socket bind listen accept read write close
Synchronization point Communication
socket connect write read close

Client
Connection-oriented communication pattern using sockets.
[3.11] The Message-Passing Interface (MPI) (1)

MPI
Group of message-oriented primitives that would allow developers to easily write
highly efficient applications.
Sockets insufficient because:

• at the wrong level of abstraction supporting only send and receive primi-
tives,
• designed to communicate using general-purpose protocol stacks such as

TCP/IP, not suitable in high-speed interconnection networks, such as those
used in COWs and MPPs (with different forms of buffering and synchro-
nization).
[3.12] The Message-Passing Interface (MPI) (2)

MPI assumptions:
• communication within a known group of processes,
• each group with assigned id,
• each process withing a group also with assigned id,
• all serious failures (process crashes, network partitions) assumed as fatal

and without any recovery,
39
• a (groupID, processID) pair used to identify source and destination of the

message,
• only receipt-based transient synchronous communication (d) not supported,

other supported.
[3.13] The Message-Passing Interface (3)
Some of the most intuitive message-passing primitives of MPI.
[3.14] The Message-Oriented Persistent Communication

Message-queueing systems = Message-Oriented Middleware (MOM)
The essence of MOM systems:
• offer the intermediate-term storage capacity for messages,
• target to support message transfers that are allowed to take minutes instead
of seconds or milliseconds,
• no guarantees about when or even if the message will be actually read,
• the sender and receiver can execute completely independently.
[3.15] Message-Queuing Model
40
Basic interface to a queue in a message-queuing system.
Most queuing systems also allow a process to install handlers as callback func-
tions.
[3.16] Architecture of Message-Queuing Systems (1)
Look-up
Sender transport-level Receiver
address of queue
Queuing Queue-level Queuing

layer address layer
Local OS Address look-up Local OS

database
Transport-level
address
Network
The relationship between queue-level addressing and network-level addressing.
source queue, destination queue, a database of queue names to network locations

mapping.
[3.17] Architecture of Message-Queuing Systems (2)
41
Sender A
Application
Application
Receive
queue
R2
Message
Send queue
Application
R1
Receiver B
Application
Router
The general organization of a message-queuing system with routers:
• may grow into overlay network,
• may need dynamic routing schemes.
Queue managers:
• normally interact directly with applications,
• some operate as routers or relays.
[3.18] Message Brokers
Database with
Source client Message broker conversion rules Destination client
Broker
program
Queuing
layer
OS OS OS
Network
42
The general organization of a message broker in a message-queuing system.

Message broker
Acts as an application-level gateway in a message-queuing system. Its main
purpose it to convert incoming messages to a format that can be understood by
the destination application. It may provide routing capabilities.
[3.19] Notes on Message-Queuing Systems
• with message brokers it may be necessary to accept a certain loss of

information during transformation,
• at the heart of a message broker lies a database of conversion rules,
• general message-queuing systems are not aimed at supporting only end

users,
• they are set up to enable persistent communication,
• range of applications:
– e-mail, workflow, groupware, batch processing,

– integration of a collection of databases or database applications.
[3.20] Example: IBM MQSeries
Client's receive
Routing table Send queue queue Receiving client
Sending client
Queue Queue
Program manager manager Program
MQ Interface
Server Server
Stub MCA MCA MCA MCA Stub
stub stub
RPC Local network

(synchronous) Internetwork
To other remote
Message passing queue managers
(asynchronous)
General organization of IBM’s MQSeries message-queuing system.
[3.21] Channels
43
Some attributes associated with message channel agents.
[3.22] Message Transfer (1)
Alias table Routing table

LA1 QMC QMB SQ1 Alias table Routing table
LA2 QMD QMC SQ1 LA1 QMA QMA SQ1
QMD SQ2 LA2 QMD QMC SQ1
QMD SQ1
SQ2
SQ1
QMA SQ1
QMB
Routing table SQ1 QMC Routing table

QMA SQ1
QMA SQ1
QMC SQ2 SQ2 QMB SQ1
QMB SQ1
QMD SQ1
Alias table
LA1 QMA SQ1
LA2 QMC
QMD
The general organization of an MQSeries queuing network using routing tables

and aliases. By using logical names, in combination with name resolution to
local queues, it is possible to put a message in a remote queue.
[3.23] Message Transfer (2)
44
Primitives available in an IBM MQSeries MQI.
[3.24] Stream-Oriented Communication
• forms of communication in which timing plays a crucial role,

• example:
– an audio stream built up as a sequence of 16-bit samples each repre-
senting the amplitude of the sound wave as it is done through PCM
(Pulse Code Modulation),
– audio stream represents CD quality, i.e. 44100Hz,
– samples to be played at intervals of exactly 1/44100,
• which facilities a distributed system should offer to exchange time-dependent
information such as audio and video streams?
– support for the exchange of time-dependent information = support
for continuous media,
– continuous (representation) media vs. discrete (representation) me-
dia.
[3.25] Support for Continuous Media

In continuous media :
• temporal relationships between data items fundamental to correctly inter-
preting the data,
• timing is crucial.
Asynchronous transmission mode

Data items in a stream are transmitted one after the other, but there are no further
timing constraints on when transmission of items should take place.
Synchronous transmission mode

Maximum end-to-end delay defined for each unit in a data stream.
Isochronous transmission mode

It is necessary that data units are transferred on time. Data transfer is subject to
bounded (delay) jitter.
[3.26] Data Stream (1)
45
Sending process
Receiving process
Program
Stream
OS OS
Network
(a)
Camera
Display
Stream
OS OS
Network
(b)
a. Setting up a stream between two processes across a network,
b. Setting up a stream directly between two devices.
• stream sequence of data units, may be considered as a virtual connection

between a source and a sink,
• simple stream vs. complex stream (consisting of several related sub-

streams).
[3.27] Data Stream (2)
Stream Sink
Intermediate
node, possibly
Source with filters
Lower bandwidth
An example of multicasting a stream to several receivers.
46
• problem with receivers having different requirements with respect to the

quality of the stream,
• filters to adjust the quality of an incoming stream, differently for outgoing

streams.
[3.28] Specifying QoS (1)
A flow specification.
Time-dependent requirements among other Quality of Service (QoS) require-

ments.
[3.29] Specifying QoS (2)
Application
Irregular stream One token is added

of data units to the bucket every ∆T
Regular stream
The principle of a token bucket algorithm.
• tokens generated at a constant rate,
• tokens buffered in a bucket which has limited capacity.
47
[3.30] Setting Up a Stream
Sender process
RSVP-enabled host
Policy RSVP process

Application
control
Application
data stream
RSVP
program
Local OS
Reservation requests
Admission from other RSVP hosts
Data link layer
control
Data link layer

data stream
Internetwork
Local network
Setup information to
other RSVP hosts
The basic organization of RSVP (Resource reSerVation Protocol), transport-level

protocol for resource reservation in a distributed system.
[3.31] Synchronization Mechanisms (1)
Receiver's machine
Application
Procedure that reads
two audio data units for
each video data unit
Incoming stream
OS
Network
The principle of explicit synchronization on the level data units.

Given a complex stream, how to keep the different substreams in synch?
[3.32] Synchronization Mechanisms (2)
48
Application tells
Receiver's machine middleware what
to do with incoming
Multimedia control streams
Application
is part of middleware
Middleware layer
Incoming stream OS
Network
The principle of synchronization as supported by high-level interfaces.
Multiplex of all substreams into a single stream and demultiplexing at the re-
ceiver. Synchronization is handled at multiplexing/demultiplexing point (MPEG).
49
50
Chapter 4
Synchronization (I)
[4.1] Synchronization (I)
1. Clock synchronization
2. Logical clocks
3. Global state (distributed snapshot)
4. Election algorithms
5. Mutual exclusion
Synchronization
Setting the time order of the set of events caused by concurrent processes.
[4.2] Clock Synchronization
Computer on 2144 2145 2146 2147 Time according

which compiler to local clock
runs
output.o created
Computer on 2142 2143 2144 2145 Time according

which editor to local clock
runs
output.c created
51
CHAPTER 4. SYNCHRONIZATION (I)
When each machine has its own clock, an event that occurred after another event
may nevertheless be assigned an earlier time.
[4.3] Timers
• timer,
• registers associated with each crystal:
– counter,
– holding register;
• interrupt generated when counter gets 0,
• interrupt called every clock tick,
• impossible to guarantee two crystals run at exactly the same frequency,
• after getting out of sync, the difference in time values called clock skew.
[4.4] The Mean Solar Day
Earth's orbit
A transit of the sun

occurs when the
sun reaches the At the transit of the sun
highest point of n days later, the earth
the day Sun has rotated fewer
than 360o
x
Earth on day 0 at the
To distant galaxy
transit of the sun
x
To distant galaxy
Earth on day n at the
transit of the sun
Computation of the mean solar day – the period of the earth’s rotation is not
constant.
[4.5] Physical Clocks (1)
52
Transit of the sun the event of the sun reaching its highest apparent point in
the sky.
Solar day the interval between two consecutive transits of the sun.
Solar second 1/86400th of a solar day.
• mean solar second (300 million days ago a year has about 400 days),

Sometimes we simply need the exact time, not just an ordering.
Solution: Universal Coordinated Time (UTC):
• based on the number of transitions per second of the cesium 133 atom
(pretty accurate),
• at present, the real time is taken as the average of some 50 cesium-clocks

around the world,
• introduces a leap second from time to time to compensate that days are
getting longer.
NIST operates a shortwave radio station with call letters WWV from Fort Collins
in Colorado (a short pulse at the start of each UTC second). UTC is broadcast
through short wave radio and satellite. Satellites can give an accuracy of about
±0.5 ms.
Does this solve all our problems? Don’t we now have some global timing
mechanism? This timing is still way too coarse for ordering every event.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
TAI
Solar 0 1 2 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18 19 21 22 23 24 25
seconds
Leap seconds introduced into UTC to

get it in synch with TAI
TAI seconds are of constant length, unlike solar seconds. Leap seconds are
introduced when necessary to keep in phase with the sun.
53
• TAI – International Atomic Time,
• 86400 TAI seconds is about 3 msec less than a mean solar day,
• UTC – TAI with leap seconds whenever the discrepancy between TAI and
solar time grows to 800 msec.
Assumption: a distributed system with an UTC-receiver somewhere in it.

Basic principle:
• every machine has a timer that generates an interrupt H times per second,
• there is a clock in machine p that ticks on each timer interrupt. Denote

the value of that clock by Ci p (t), where t is UTC time.
• ideally, we have that for each machine p, C p(t) = t, or, in other words,
dC/dt = 1
• Ideally: dC/dt = 1, in practice: 1 − ρ ≤ dC/dt ≤ 1 + ρ
• in order to protect against difference bigger than δ time units ⇒ synchro-

nize at least every δ/(2ρ) seconds.
[4.9] Clock Synchronization Algorithms
54
dC
>1
dt dC
Clock time, C =1
dt
ck
k
oc
clo
cl
dC
<1
ct
st
k
rfe
c dt
Fa
c lo
Pe
w
Slo
UTC, t
The relation between clock time and UTC when clocks tick at different rates.
[4.10] Clock Synchronization Principles
Principle I Every machine asks a time server for the accurate time at least once
every δ/(2ρ) seconds.
• needs an accurate measure of round trip delay, including interrupt

handling and processing incoming messages.
Principle II Let the time server scan all machines periodically, calculate an
average, and inform each machine how it should adjust its time relative to
its present time.
• probably gets every machine in sync.
• setting the time back is never allowed, therefore smooth adjustments.
[4.11] Clock Synchronization Algorithms

Clock synchronization algorithms:
• Cristian’s Algorithm
55
• The Berkeley Algorithm
• Averaging Algorithms
[4.12] Cristian’s Algorithm
Both T0 and T1 are measured with the same clock

T0 T1
Client
Request CUTC
Time server
Time
I, Interrupt handling time
Getting the current time from a time server.
• (T 1 − T 0)/2,
• messages with T 1 − T 0 above some threshold discarded as being victims

of network congestion,
• the message that came back fastest is the most accurate one.
[4.13] The Berkeley Algorithm
56
Time daemon
3:00 3:00 3:00 0 3:05 +5
3:00 -10 +15

3:00 +25 -20
Network
2:50 3:25 2:50 3:25 3:05 3:05

(a) (b) (c)
1. The time daemon asks all the other machines for their clock values.
2. The machines answer.
3. The time daemon tells everyone how to adjust their clock.
[4.14] Averaging Algorithms
• previous methods highly centralized,
• decentralized algorithms:
– dividing time into fixed-length resynchronization intervals,

– T 0 + (i + 1)R, where R is a system parameter,
– machines broadcast the current time according to their clocks,
– another variation: correcting each message by considering propaga-
tion time from the source,
• Internet: the Network Time Protocol (NTP), accuracy in the range of 1-50
msec.
[4.15] Logical Clocks
• often if it is sufficient that all machines agree on the same time,
• internal consistency only matters, not whether they are particularly close
to the real time,
57
• what usually matters is not that all processes agree on what time is, but
rather that they agree on the order in which events occur,
• Lamport’s algorithm, which synchronizes logical clocks,
• an extension to Lamport’s approach, called vector timestamps.
[4.16] The Happened-Before Relationship

The happened-before relation on the set of events in a distributed system is the
smallest relation satisfying:
• if a and b are two events in the same process, and a comes before b, then
a → b.
• if a is the sending of a message, and b is the receipt of that message, then
a → b.
• if a → b and b → c, then a → c.
This introduces a partial ordering of events in a system with concurrently oper-
ating processes.
Concurrent events
Nothing can be said about when the events happened or which event happened
first.
[4.17] Logical Clocks (1)
How do we maintain a global view on the system’s behavior that is consistent
with the happened-before relation?
Solution: attach a time-stamp C(e) to each event e, satisfying the following
properties:
P1 If a and b are two events in the same process, and a → b, then we demand
that C(a) < C(b).
P2 If a corresponds to sending a message m, and b to the receipt of that message,
then also C(a) < C(b).
How to attach a time-stamp to an event when there’s no global clock?

Solution: maintain a consistent set of logical clocks, one per process.
Each process Pi maintains a local counter C i and adjusts this counter according
to the following rules:
58
1. For any two successive events that take place within Pi , Ci is incremented
by 1.
2. Each time a message m is sent by process Pi , the message receives a

time-stamp T m = Ci .
3. Whenever a message m is received by a process P j , P j adjusts its local

counter C j :
C j := max{C j + 1, T m + 1}.
• property P1 satisfied by 1.,
• property P2 satisfied by 2. and 3.
0 0 0 0 0 0
6 A 8 10 6 A 8 10
12 16 20 12 16 20
18 24 B 30 18 24 B 30
24 32 40 24 32 40
30 40 50 30 40 50
36 48 C 60 36 48 C 60
42 56 70 42 61 70
48 D 64 80 48 D 69 80
54 72 90 70 77 90
60 80 100 76 85 100
(a) (b)
Lamport’s algorithm example
[4.20] Total Ordering with Logical Clocks

Still can occur: two events happen at the same time. May be avoided by attaching
a process number to an event:
If: Pi time-stamps event e with C i (e).i

Then: Ci (a).i before C j (b). j if and only if:
59
• Ci (a) < C j (a) or
• Ci (a) = C j (b) and i < j.
[4.21] Example: Totally-Ordered Multicasting
Update 1 Update 2
Replicated database
Update 1 is Update 2 is
performed before performed before
update 2 update 1
• this situation requires totally-ordered multicasting - to be implemented

with Lamport timestamps,
• each message is always timestamped with the current logical time of the
sender,
• received message put into a local queue, ordered according to its times-
tamp, receiver multicasts an acknowledgement to others,
• a process can deliver a queued message to the application it is running only

when that message is at the head of the queue and has been acknowledged
by each other process.
[4.22] Vector Timestamps (1)
• Lamport timestamps do not guarantee that if C(a) < C(b) that a indeed
happened before b. Vector timestamps are required for that.
– each process Pi has an array Vi [1 . . . n], where Vi [ j] denotes the num-

ber of events that process Pi knows have taken place at process P j ,
60
– when Pi sends a message m, it adds 1 to Vi [i], and sends Vi along

with m as vector timestamp vt(m). Upon arrival, each other process
knows Pi ’s timestamp.
• timestamp vt of m tells the receiver how many events in other processes

have preceded m, and on which m may causally depend.
[4.23] Vector Timestamps (2)
• when a process P j receives m from Pi with vt(m), it:
– updates each V j [k] to max{V j [k], V(m)[k]},

– increments V j [ j] by 1.
• to support causal delivery of messages, assume you increment your own

component only when sending a message. Then, P j postpones delivery of
m until:
– vt(m)[i] = V j [i] + 1 and

– vt(m)[k] ≤ V j [k] for k , i.
Example
Given V3 = [0, 2, 2], vt(m) = [1, 3, 0]:
What information does P3 have, and what will it do after receiving m (from P1 )?
[4.24] An example of Causal Delivery of Messages (1)

Assumptions:
• messages multicasted by the processes to all other participating in com-

munication,
• all messages sent by one process received in the same order by each other
process,
• reliable message sending mechanism,
• order of messages from different processes not forced.
Actions on the sender side:
1. Sending (multicasting) of the message.
61
Actions on the receiver side:
1. Receiving of the message by the communication layer.
2. Delivering of the message to the target process.

Let
vtm - vector timestamp of message m,
VP - current vector of process P.
Rules
When message m sent by process P, sent together with vector timestamp vt m
built up in the following way:
1. vtm [P] = VP [P] + 1,
2. vtm [X] = VP [X] for all X different to P.
Received message m from P delivered into the process Q only if the following
conditions are met:
1. vtm [P] = VQ [P] + 1
2. vtm [X] ≤ VQ [X] for all X different to P.
When message m delivered to the process Q:
1. VQ [X] = max{VQ [X], vtm [X]}

Three processes: A, B, C with initial vectors: V A = VB = VC = (0, 0, 0)
General scenario:
1. Process A multicasts request m1
2. Process B multicasts reply m2 as a result of obtaining request in message

m1.
62
Goal:
All processes should have delivered message m2 only after delivering message
m1. If the message m2 is received by the transport layer of some process as
the first one, delivery of the m2 must be postponed until m1 is received and
delivered before.

A sends m1(0 + 1, 0, 0) = m1(1, 0, 0),
B receives m1(1, 0, 0) from A,
VB = (0, 0, 0), vtm1 = (1, 0, 0),

m1 delivered at once because:
vtm1 [A] = VB [A] + 1,

vtm1 [X] <= VB [X] for all X different to A.
after m1 delivery new value of V B set to VB = (1, 0, 0).
B sends m2(1, 0 + 1, 0) = m2(1, 1, 0),

A receives m2(1, 1, 0) from B,
VA = (1, 0, 0), vtm2 = (1, 1, 0),

vtm2 [B] = VA [B] + 1,

vtm2 [X] <= VA [X] for all X different to B.
after m2 delivery new value of V A set to VA = (1, 1, 0).

C receives m2(1, 1, 0) from B,
VC = (0, 0, 0), vtm2 = (1, 1, 0),

m2 delivery postponed because:
vtm2 [A] > VC [A] and A is different to B.
63
Comment:
We should not deliver the message m2 sent by B to the process C now because
at the time of sending that message by the process B it knew already some
message received from process A about which we do not know yet.
Perhaps in that message, received before by B and not received by us yet, was
something important what should be received by C before receiving m2. Firstly,
C has to have delivered the previous message, already delivered to B before the
moment of sending by B the message m2.
C receives m1(1, 0, 0) from A
VC = (0, 0, 0), vtm1 = (1, 0, 0),

vtm1 [A] = VC [A] + 1,

vtm1 [X] <= VC [X] for all X different to A.
after m1 delivery new value of VC set to VC = (1, 0, 0),

now on C we check delivery queue,
now m2 may be and is delivered because:
VC = (1, 0, 0), vtm2 = (1, 1, 0),

vtm2 [C] = VC [C] + 1,
vtm2 [X] ≤ VC [X] for all X different to C.
after m2 delivery new value of VC set to VC = (1, 1, 0).
After two multicasts A → BC and B → AC, current values of vector timestamps

of processes are as follows: V A = VB = VC = (1, 1, 0)
[4.30] Global State (1)
Sometimes one wants to collect the current state of a distributed computation,

called a distributed snapshot.
It consists of: (1) all local states and (2) messages currently in transit.
64
Consistent cut Inconsistent cut
P1 Time P1 Time
m1 m3 m1 m3
P2 P2
m2
m2
P3 P3
Sender of m2 cannot
be identified with this cut
(a) (b)
A distributed snapshot should reflect a consistent state.
• collection of processes connected to each other through unidirectional

point-to-point communication channels,
• any process P can initiate taking a distributed snapshot.
1. P starts by recording its own local state,
2. P subsequently sends a marker along each of its outgoing channels,
3. when Q receives a marker through channel C, its action depends on

whether it had already recorded its local state:
• not yet recorded: it records its local state, and sends the marker along
each of its outgoing channels,
• already recorded: the marker on C indicates that the channel’s state
should be recorded: all messages received since the time Q recorded
its own state and before that marker to be recorded as the channel’s
state,
4. Q is finished when it has received a marker along each of its incoming

channels.

Distributed snapshot, channel state recording:
65
Incoming Outgoing
message Process State message
M
Q
Local
Marker filesystem
(a)
M
a b c Q M d Q Q
M
a b c a b c d
Recorded
state
(b) (c) (d)
1. Process Q receives a marker for the first time and records its local state.
2. Q records all incoming message.
3. Q receives a marker for its incoming channel and finishes recording the
state of the incoming channel.
[4.33] Election Algorithms

An algorithm requires that some process acts as a coordinator. How to select
this special process dynamically?
• in many systems the coordinator chosen by hand (e.g. file servers). This
leads to centralized solutions ⇒ single point of failure.
• if a coordinator chosen dynamically, to what extent one can speak about a

centralized or distributed solution? Having a central coordinator does not
necessarily make an algorithm non-distributed.
• is a fully distributed solution, i.e. one without a coordinator, always more

robust than any centralized/coordinated solution? Fully distributed solu-
tions not necessarily better.
Example election algorithms:

• the bully algorithm,
• a ring algorithm.
66
[4.34] The Bully Election Algorithm (1)
Each process has an associated priority (weight). The process with the highest
priority should always be elected as the coordinator.
How to find the heaviest process?
• any process can just start an election by sending an election message to

all other processes (assuming you don’t know the weights of the others).
• if process Pheavy receives an election message from lighter process Plight ,

it sends a take-over message to Plight . Plight is out of the race.
• if a process doesn’t get a take-over message back, it wins, and sends a

victory message to all other processes.
1 1 1
2 5 2 5 2 5
n
ctio OK Election
Ele
n
Election OK
ctio
4 6 4 6 4 6
Ele
n
El
tio
ec
ec
tio
El
0 3 0 3 0 3
n
7 7 7
Previous coordinator
has crashed
(a) (b) (c)
1 1
2 5 2 5
a. process 4 holds an election,
OK
Coordinator
4 6 4 6
b. process 5 and 6 respond, telling 4 to stop,
0 3 0 3
7 an election.
c. now 5 and 6 each hold 7
(d) (e)
67
1 1 1
2 5 2 5 2 5
n
ctio OK Election
Ele
n
Election OK
ctio
4 6 4 6 4 6
Ele
n
El
tio
ec
ec
tio
El
0 3 0 3 0 3
n
7 7 7
has crashed
(a) (b) (c)
1 1
2 5 2 5
OK
Coordinator
4 6 4 6
0 3 0 3
7 7
(d) (e)
d. process 6 tells 5 to stop,
e. process 6 wins and tells everyone.
[4.37] A Ring Algorithm (1)
Process priority is obtained by organizing processes into a (logical) ring. Process

with the highest priority should be elected as coordinator.
• any process can start an election by sending an election message to its

successor. If a successor is down, the message is passed on to the next
successor.
• if a message is passed on, the sender adds itself to the list. When it gets
back to the initiator, everyone had a chance to make its presence known.
• the initiator sends a coordinator message around the ring containing a list
of all living processes. The one with the highest priority is elected as
coordinator.
[4.38] A Ring Algorithm (2)
68
[5,6,0] 1
Election message
0 2
[2]
has crashed 7 [5,6] 3
[2,3]
No response 6 4
[5] 5
[4.39] Mutual Exclusion

A number of processes in a distributed system want exclusive access to some
resource.
Standard solutions:
• via a centralized server,
• completely distributed, with no topology imposed,
• completely distributed, making use of a logical ring.
[4.40] MutEx: A Centralized Algorithm
0 1 2 0 1 2 0 1 2
Request Release
Request OK
OK
No reply
3 3 3
Queue is 2
empty
Coordinator
(a) (b) (c)
1. Process 1 asks the coordinator for permission to enter a critical region.

Permission is granted.
69
2. Process 2 then asks permission to enter the same critical region. The
coordinator does not reply.
3. When process 1 exits the critical region, it tells the coordinator, when then
replies to 2.
[4.41] MutEx: Ricart & Agrawala Algorithm (1)

Ricart & Agrawala algorithm – completely distributed, with no topology im-
posed.
• the same as Lamport except that acknowledgments aren’t sent. Instead,

replies (i.e. grants) are sent only when:
– the receiving process has no interest in the shared resource or

– the receiving process is waiting for the resource, but has lower priority
(known through comparison of time-stamps).
• in all other cases, reply is deferred, implying some more local administra-
tion.
[4.42] MutEx: Ricart & Agrawala Algorithm (2)
Enters
critical
region
8
0 0 0
8 OK OK OK
12
8 Enters
1 2 1 2 1 2 critical
12 OK region
12
(a) (b) (c)
1. Two processes want to enter the same critical region at the same moment.
2. Process 0 has the lowest timestamp, so it wins.
3. When process 0 is done, it sends an OK also, so 2 can now enter the

critical region.
70
[4.43] MutEx: A Token Ring Algorithm
2
1 3
0 4
0 2 4 9 7 1 6 5 8 3
7 5
6
(a) (b)
1. An unordered group of processes on a network.
2. A logical ring constructed in software.
[4.44] Mutual Exclusion - Comparison
Messages per Delay before entry Potential

Algorithm entry/exit (in message times) problems
Centralized 3 2 Coordinator crash
Distributed 2(n − 1) 2(n − 1) Crash of any process
Token Ring 1 to ∞ 0 to n − 1 Lost token, process crash
A comparison of three mutual exclusion algorithms.
71
72
Chapter 5
Synchronization (II)
[5.1] Distributed Transactions
1. The transaction model
• ACID properties
2. Classification of transactions
• flat transactions,
• nested transactions,
• distributed transactions.
3. Concurrency control
• serializability,
• synchronization techniques
– two-phase locking,
– pessimistic timestamp ordering,
– optimistic timestamp ordering.
[5.2] The Transaction Model (1)
73
CHAPTER 5. SYNCHRONIZATION (II)
Previous
inventory
New
inventory
Input tapes
Computer Output tape
Today's
updates
Updating a master tape is fault tolerant.
Examples of primitives for transactions.
a. transaction to reserve three flights commits,
b. transaction aborts when third flight is unavailable.
74
[5.5] ACID Properties

Transaction
Collection of operations on the state of an object (database, object composition,
etc.) that satisfies the following properties:
Atomicity All operations either succeed, or all of them fail. When the transac-
tion fails, the state of the object will remain unaffected by the transaction.
Consistency A transaction establishes a valid state transition. This does not ex-
clude the possibility of invalid, intermediate states during the transaction’s
execution.
Isolation Concurrent transactions do not interfere with each other. It appears to

each transaction T that other transactions occur either before T, or after T,
but never both.
Durability After the execution of a transaction, its effects are made permanent:
changes to the state survive failures.
[5.6] Transaction Classification

Flat transactions
The most familiar one: a sequence of operations that satisfies the ACID proper-
ties.
Nested transactions
A hierarchy of transactions that allows (1) concurrent processing of subtransac-
tions, and (2) recovery per subtransaction.
Distributed transactions
A (flat) transaction that is executed on distributed data. Often implemented as a
two-level nested transaction with one subtransaction per node.
[5.7] Flat Transactions – Limitations
• they do not allow partial results to be committed or aborted,
• the strength of the atomicity property of a flat transaction also is partly its
weakness,
• solution: usage of nested transactions,
• difficult scenarios:
75
– subtransaction committed but the higher-level transaction aborted,

– if a subtransaction commits and a new subtransaction is started, the
second one has to have available results of the first one.
[5.8] Distributed Transactions
• nested transaction is logically decomposed into a hierarchy of subtrans-

actions,
• distributed transaction is logically flat, indivisible transaction that oper-
ates on distributed data. Separate distributed algorithms required for (1)
handling the locking of data and (2) committing the entire transaction.
Nested transaction Distributed transaction
Subtransaction Subtransaction Subtransaction Subtransaction
Airline database Hotel database

Distributed database
Two different (independent) Two physically separated
databases parts of the same database
(a) (b)
[5.9] Transaction Implementation
1. private workspace
• use a private workspace, by which the client gets its own copy of the
(part of the) database. When things go wrong delete copy, otherwise
commit the changes to the original,
• optimization by not getting everything.
2. write-ahead log
• use a writeahead log in which changes are recorded allowing you to
roll back when things go wrong.
76
[5.10] TransImpl: Private Workspace
Private
workspace
Original
Index index 0 0
0 0 1 1
1 1 2 2
2 2 3 3
1 2 0 1 2 0 1 2
0 3 0 3
Free blocks
(a) (b) (c)
a. The file index and disk blocks for a three-block file,
b. The situation after a transaction has modified block 0 and appended block
3,
c. After committing.
[5.11] TransImpl: Writeahead Log
a. A transaction,
77
b.-d. The log before each statement is executed.
[5.12] Transactions: Concurrency Control (1)
Transactions
READ/WRITE Transaction BEGIN_TRANSACTION

manager END_TRANSACTION
LOCK/RELEASE
Scheduler or
Timestamp operations
Data Execute read/write

manager
General organization of managers for handling transactions.
[5.13] Transactions: Concurrency Control (2)
78
Transaction
manager
Scheduler Scheduler Scheduler
Data Data Data

manager manager manager
General organization of managers for handling distributed transactions.
[5.14] Serializability (1)
a.-c. Three transactions T1, T2, and T3,
d. Possible schedules.
79
[5.15] Serializability (2)

Consider a collection E of transactions T 1 , . . . , T n . Goal is to conduct a serial-
izable execution of E:
• transactions in E are possibly concurrently executed according to some

schedule S.
• schedule S is equivalent to some totally ordered execution of T 1 , . . . , T n .
Because we are not concerned with the computations of each transaction, a

transaction can be modeled as a log of read and write operations.
Two operations OPER(T i , x) and OPER(T j , x) on the same data item x, and from
a set of logs may conflict at a data manager:
read-write conflict (rw) one is a read operation while the other is a write op-
eration on x,
write-write conflict (ww) both are write operations on x.
[5.16] Synchronization Techniques
1. Two-phase locking
Before reading or writing a data item, a lock must be obtained. After a
lock is given up, the transaction is not allowed to acquire any more locks.
2. Timestamp ordering
Operations in a transaction are time-stamped, and data managers are forced
to handle operations in timestamp order.
3. Optimistic control
Don’t prevent things from going wrong, but correct the situation if conflicts
actually did happen. Basic assumption: you can pull it off in most cases.
[5.17] Two-Phase Locking (1)
• clients do only READ and WRITE operations within transactions,
• locks are granted and released only by scheduler,
• locking policy is to avoid conflicts between operations.
80
1. When client submits OPER(T i , x), scheduler tests whether it conflicts with
an operation OPER(T j , x) from any other client. If no conflict then grant
LOCK(T i , x), otherwise delay execution of OPER(T i , x).
• conflicting operations are executed in the same order as that locks

are granted.
2. If LOCK(T i , x) has been granted, do not release the lock until OPER(T i , x)
has been executed by data manager.
3. If RELEAS E(T i , x) has taken place, no more locks for T i may be granted.
Lock point
Number of locks
Growing phase Shrinking phase
Time
Two-phase locking.

Types of 2PL
Centralized 2PL A single site handles all locks,
Primary 2PL Each data item is assigned a primary site to handle its locks.
Data is not necessarily replicated,
Distributed 2PL Assumes data can be replicated. Each primary is responsible

for handling locks for its data, which may reside at remote data managers.
81
Problems:
• deadlock possible – order of acquiring, deadlock detection, a timeout
scheme,
• cascaded aborts – strict two-phase locking.
[5.20] 2PL: Strict 2PL
Lock point
Number of locks
Growing phase Shrinking phase
All locks are released

at the same time
Time
Strict two-phase locking.
[5.21] Pessimistic Timestamp Ordering (1)
• each transaction T has a timestamp ts(T ) assigned,

• timestamps are unique (Lamport’s algorithm),
• every operation, part of T , timestamped with ts(T ),
• every data item x has a read timestamp tsRD (x) and a write timestamp
tsWR (x),
• if operations conflicts, the data manager processes the one with the lowest
timestamp,
• comparing to locking (like 2PL): aborts possible but deadlock free.
[5.22] Pessimistic Timestamp Ordering (2)
82
tsRD(x) tsWR(x) ts(T2) tsWR(x) ts(T2)
(T1 ) (T1 ) (T2 ) (T1 ) (T2 ) OK

Do
(a) Time (e) Time
tentative
tsWR(x) tsRD(x) ts(T2) write tsWR(x) tstent (x) ts(T2)
(T1 ) (T1 ) (T2 ) (T1 ) (T3 ) (T2 )

OK
(b) Time (f) Time
ts(T2) tsRD(x) ts(T2) tsWR(x)
(T2 ) (T3 ) (T2 ) (T3 )

(c) Time (g) Time
Abort Abort
ts(T2) tsWR(x) ts(T2) tstent (x)
(T2 ) (T3 ) (T2 ) (T3 )
(d) Time (h) Time
(a)-(d) T 2 is trying to write an item, (e)-(f) T 2 is trying to read an item.
[5.23] Optimistic Timestamp Ordering

Assumptions:
• conflicts are relatively rare,
• go ahead and do whatever you want, solve conflicts later on,
• keep track of which data items have been read and written (private workspaces,
shadow copies),
• check possible conflicts at the time of committing.
Features:
• deadlock free with maximum parallelism,
• under conditions of heavy load, the probability of failure (and abort) goes
up substantially,
• focused on nondistributed systems,

• hardly implemented in commercial or prototype systems.
[5.24] MySQL: Transactions (1)

By default, MySQL runs with autocommit mode enabled. This means that as
soon as you execute a statement that updates (modifies) a table, MySQL stores
the update on disk.
83
• SET AUTOCOMMIT = {0 | 1}
Start and stop transaction:
• START TRANSACTION | BEGIN [WORK]
• COMMIT [WORK] [AND [NO] CHAIN] [[NO] RELEASE]
• ROLLBACK [WORK] [AND [NO] CHAIN] [[NO] RELEASE]
[5.25] MySQL: Transactions (2)
• If you issue a ROLLBACK statement after updating a non-transactional

table within a transaction, warning occurs. Changes to transaction-safe
tables are rolled back, but not changes to non-transaction-safe tables.
• InnoDB – transaction-safe storage engine,
• MySQL uses table-level locking for MyISAM and MEMORY tables, page-
level locking for BDB tables, and row-level locking for InnoDB tables.
• Some statements cannot be rolled back. In general, these include data

definition language (DDL) statements, such as those that create or drop
databases, those that create, drop, or alter tables or stored routines.
• Transactions cannot be nested. This is a consequence of the implicit

COMMIT performed for any current transaction when you issue a START
TRANSACTION statement or one of its synonyms.
[5.26] MySQL: Savepoints

The savepoints syntax:
• SAVEPOINT identifier
• ROLLBACK [WORK] TO SAVEPOINT identifier
• RELEASE SAVEPOINT identifier
Description:
84
• The ROLLBACK TO SAVEPOINT statement rolls back a transaction

to the named savepoint. Modifications that the current transaction made
to rows after the savepoint was set are undone in the rollback, but Inn-
oDB does not release the row locks that were stored in memory after the
savepoint.
• All savepoints of the current transaction are deleted if you execute a COM-
MIT, or a ROLLBACK that does not name a savepoint.
[5.27] MySQL: Isolation Levels in InnoDB (1)

Isolation levels:
• SET [SESSION | GLOBAL] TRANSACTION ISOLATION LEVEL

{READ UNCOMMITTED | READ COMMITTED | REPEATABLE READ | SE-
RIALIZABLE}
• SELECT @@global.tx_isolation;
• SELECT @@tx_isolation;
• Suppose that you are running in the default REPEATABLE READ isola-
tion level. When you issue a consistent read (that is, an ordinary SELECT
statement), InnoDB gives your transaction a timepoint according to which
your query sees the database. If another transaction deletes a row and
commits after your timepoint was assigned, you do not see the row as
having been deleted. Inserts and updates are treated similarly.
[5.28] MySQL: Isolation Levels in InnoDB (2)
READ UNCOMMITTED SELECT statements are performed in a non-locking

fashion, but a possible earlier version of a record might be used. Thus,
using this isolation level, such reads are not consistent. This is also called
a dirty read. Otherwise, this isolation level works like READ COMMIT-
TED.
READ COMMITTED Consistent reads behave as in other databases: Each

consistent read, even within the same transaction, sets and reads its own
fresh snapshot.
85
REPEATABLE READ This is the default isolation level of InnoDB. All con-
sistent reads within the same transaction read the snapshot established by
the first such read in that transaction. You can get a fresher snapshot for
your queries by committing the current transaction and after that issuing
new queries.
SERIALIZABLE This level is like REPEATABLE READ, but InnoDB im-

plicitly commits all plain SELECT statements to SELECT ... LOCK IN
SHARE MODE.
86
Chapter 6
Consistency and Replication
[6.1] Consistency and Replication

Consistency and replication
1. Introduction
2. Data-centric consistency models
3. Client-centric consistency models
4. Consistency protocols
[6.2] Introduction
Two primary reasons for replicating data:
• reliability – to increase reliability of a system,
• performance – to scale in numbers and geographical area.
Reliability corresponds to fault tolerance, performance/scalability corresponds to

high availability.
The cost of replication:
• modifications have to be carried on all copies to ensure consistency,
• when and how modifications need to be carried out, determines the price
of replication.
87
CHAPTER 6. CONSISTENCY AND REPLICATION
[6.3] Performance and Scalability

Main issue: To keep replicas consistent, we generally need to ensure that all
conflicting operations are done in the the same order everywhere.
Conflicting operations:
read–write conflict a read operation and a write operation act concurrently,
write–write conflict two concurrent write operations.
Guaranteeing global ordering on conflicting operations may be a costly operation,
downgrading scalability.
Solution: weaken consistency requirements so that hopefully global synchro-

nization can be avoided.
[6.4] Data-Centric Consistency Models (1)
Process Process Process
Local copy
Distributed data store
The general organization of a logical data store, physically distributed and repli-
cated across multiple processes.
Consistency model
A contract between a (distributed) data store and processes, in which the data
store specifies precisely what the results of read and write operations are in the
presence of concurrency.
[6.5] Data-Centric Consistency Models (2)

Strong consistency models: Operations on shared data are synchronized:
88
• strict consistency (related to time),
• sequential consistency (what we are used to),
• causal consistency (maintains only causal relations),
• FIFO consistency (maintains only individual ordering).
Weak consistency models: Synchronization occurs only when shared data is

locked and unlocked:
• general weak consistency,
• release consistency,
• entry consistency.
Observation: The weaker the consistency model, the easier it is to build a

scalable solution.
[6.6] Strict Consistency

Strict consistency
Any read to a shared data item X returns the value stored by the
most recent write operation on X.
P1: W(x)a P1: W(x)a

P2: R(x)a P2: R(x)NIL R(x)a
(a) (b)
Behavior of two processes, operating on the same data item.
a. a strictly consistent store,
b. a store that is not strictly consistent.
[6.7] Linearizability and Sequential Consistency (1)

Sequential Consistency
89
The result of any execution is the same as if the operations of all

processes were executed in some sequential order, and the opera-
tions of each individual process appear in this sequence in the order
specified by its program.
P1: W(x)a P1: W(x)a

P2: W(x)b P2: W(x)b
P3: R(x)b R(x)a P3: R(x)b R(x)a
P4: R(x)b R(x)a P4: R(x)a R(x)b
(a) (b)
All processes should see the same interleaving of operations.

a. a sequentially consistent data store,
b. a data store that is not sequentially consistent.
[6.8] Linearizability and Sequential Consistency (3)

linearizable = sequential + operations ordered according to a global time.
90
Four valid execution sequences for the presented processes. The vertical axis is
time.
[6.9] Causal Consistency (1)

Causal consistency
Writes that are potentially causally related must be seen by all pro-
cesses in the same order. Concurrent writes may be seen in a dif-
ferent order on different machines.
P1: W(x)a W(x)c

P2: R(x)a W(x)b
P3: R(x)a R(x)c R(x)b
P4: R(x)a R(x)b R(x)c
This sequence is allowed with a causally-consistent store, but not with sequen-
tially or strictly consistent store.
[6.10] Causal Consistency (2)
P1: W(x)a P1: W(x)a

P2: R(x)a W(x)b P2: W(x)b
P3: R(x)b R(x)a P3: R(x)b R(x)a
P4: R(x)a R(x)b P4: R(x)a R(x)b
(a) (b)
a. a violation of a causally-consistent store,

b. a correct sequence of events in a causally-consistent store.
[6.11] FIFO Consistency (1)

FIFO consistency
Writes done by a single process are seen by all other processes in the
order in which they were issued, but writes from different processes
may be seen in a different order by different processes.
91
P1: W(x)a
P2: R(x)a W(x)b W(x)c
P3: R(x)b R(x)a R(x)c
P4: R(x)a R(x)b R(x)c
A valid sequence of events of FIFO consistency.
• PRAM consistency = pipelined RAM, writes from a single process can

be pipelined,
• easy to implement by tagging each write operation with a (process, se-

quence number) pair.
Statement execution as seen by the three earlier presented processes. The state-
ments in bold are the ones that generate the output shown.
92
Two concurrent processes.

Sequential vs. FIFO consistency:
• FIFO consistency: counterintuitive results – both processes can be killed,
• sequential consistency: none of interleavings results in both processes

being killed,
• in sequential consistency, although the order is non-deterministic, at least

all processes agree what it is. This is not the case in FIFO consistency.
[6.14] Weak Consistency (1)

Weak consistency models
Introduction of explicit synchronization variables. Changes of local

replica content propagated only when an explicit synchronization
takes place.
Properties:
• accesses to synchronization variables associated with a data store are se-

quentially consistent,
• no operation on a synchronization variable is allowed to be performed until

all previous writes have been completed everywhere,
• no read or write operation on data items are allowed to be performed until

all previous operations to synchronization variables have been performed.
[6.15] Weak Consistency (2)
93
P1: W(x)a W(x)b S P1: W(x)a W(x)b S

P2: R(x)a R(x)b S P2: S R(x)a
P3: R(x)b R(x)a S
(a) (b)
a. a valid sequence of events for weak consistency,
b. an invalid sequence for weak consistency.
Issue: The simplest method of weak consistency model implementation in case

of replication with full replicas.
[6.16] Release Consistency (1)
P1: Acq(L) W(x)a W(x)b Rel(L)

P2: Acq(L) R(x)b Rel(L)
P3: R(x)a
A valid event sequence for release consistency.
[6.17] Release Consistency (2)

Release consistency properties:
• before a read or write operation on shared data is performed, all previous

acquires done by the process must have completed successfully,
• before a release is allowed to be performed, all previous reads and writes

by the process must have completed,
• accesses to synchronization variables are FIFO consistent (sequential con-

sistency is not required).
Additional issues:
• lazy release consistency versus eager release consistency,
94
• barriers instead of critical regions possible.
[6.18] Entry Consistency (1)
• with release consistency, all local updates are propagated to other copies/servers
during release of shared data.
• with entry consistency, each shared data item is associated with a synchro-
nization variable.
• when acquiring the synchronization variable, the most recent values of its
associated shared data item are fetched.
Note: Where release consistency affects all shared data, entry consistency affects
only those shared data associated with a synchronization variable.
Question: What would be a convenient way of making entry consistency more
or less transparent to programmers?
[6.19] Entry Consistency (2)
P1: Acq(Lx) W(x)a Acq(Ly) W(y)b Rel(Lx) Rel(Ly)

P2: Acq(Lx) R(x)a R(y)NIL
P3: Acq(Ly) R(y)b
A valid event sequence for entry consistency.
[6.20] Summary of Consistency Models
95
a. Strong consistency models.
b. Weak consistency models.
[6.21] Client-Centric Consistency Models (1)
1. System model
2. Coherence models
• monotonic reads,
• monotonic writes,
• read-your-writes,
• write-follows-reads.
[6.22] Client-Centric Consistency Models (2)

Goal: Avoiding system-wide consistency, by concentrating on what specific
clients want, instead of what should be maintained by servers.
Background: Most large-scale distributed systems (i.e., databases) apply repli-
cation for scalability, but can support only weak consistency:
DNS updates are propagated slowly, and inserts may not be immediately visible.
96
NEWS articles and reactions are pushed and pulled throughout the Internet,
such that reactions can be seen before postings.
Lotus Notes geographically dispersed servers replicate documents, but make no

attempt to keep (concurrent) updates mutually consistent.
WWW caches all over the place, but there need be no guarantee that you are
reading the most recent version of a page.
[6.23] Consistency for Mobile Users
Example: Consider a distributed database to which you have access through

your notebook. Assume your notebook acts as a front end to the database.
• at location A you access the database doing reads and updates.
• at location B you continue your work, but unless you access the same
server as the one at location A, you may detect inconsistencies:
– your updates at A may not have yet been propagated to B

– you may be reading newer entries than the ones available at A
– your updates at B may eventually conflict with those at A
Note: The only thing you really want is that the entries you updated and/or read
at A, are in B the way you left them in A. In that case, the database will appear
to be consistent to you.
[6.24] Eventual Consistency
Eventual consistency
Consistency model in large-scale distributed replicated databases that

tolerate a relatively high degree of inconsistency. If no updates take
place for a long time, all replicas gradually becomes consistent.
97
Client moves to other location

and (transparently) connects to
other replica
Replicas need to maintain

client-centric consistency
Wide-area network
Distributed and replicated database

Read and write operations
Portable computer
The principle of a mobile user accessing different replicas of a distributed

database.
[6.25] Monotonic Reads (1)

Monotonic reads
If a process reads the value of a data item x, any successive read

operation on x by that process will always return that same or a
more recent value.
L1: WS(x1) R(x1) L1: WS(x1) R(x1)

L2: WS(x1;x2 ) R(x2) L2: WS(x2 ) R(x2) WS(x1;x2 )
(a) (b)
The read operations performed by a single process P at two different local copies
of the same data store.
a. a monotonic-read consistent data store,
b. a data store that does not provide monotonic reads.
98
[6.26] Monotonic Reads (2)

Example
Automatically reading your personal calendar updates from different

servers. Monotonic reads guarantees that the user sees all updates,
no matter from which server the automatic reading takes place.
Example
Reading (not modifying) incoming mail while you are on the move.
Each time you connect to a different e-mail server, that server fetches
(at least) all the updates from the server you previously visited.
[6.27] Monotonic Writes (1)

Monotonic writes
A write operation by a process on a data item x is completed before

any successive write operation on x by the same process.
L1: W(x1) L1: W(x1)

L2: W(x1) W(x2) L2: W(x2)
(a) (b)
The write operations performed by a single process P at two different local

copies of the same data store
a. a monotonic-write consistent data store.
b. a data store that does not provide monotonic-write consistency.
[6.28] Monotonic Writes (2)

Example
Updating a program at server S2, and ensuring that all components

on which compilation and linking depends, are also placed at S2.
Example
99
Maintaining versions of replicated files in the correct order every-

where (propagate the previous version to the server where the newest
version is installed).
[6.29] Read Your Writes

Read your writes
The effect of a write operation by a process on data item x, will

always be seen by a successive read operation on x by the same
process.
L1: W(x1) L1: W(x1)

L2: WS(x1;x2 ) R(x2) L2: WS(x2 ) R(x2)
(a) (b)
a. a data store that provides read-your-writes consistency.
b. a data store that does not.
[6.30] Writes Follow Reads

Writes follow reads
A write operation by a process on a data item x following a previous

read operation on x by the same process, is guaranteed to take place
on the same or a more recent value of x that was read.
L1: WS(x1) R(x1) L1: WS(x1) R(x1)

L2: WS(x1;x2 ) W(x 2) L2: WS(x2 ) W(x2)
(a) (b)
a. a writes-follow-reads consistent data store,
b. a data store that does not provide writes-follow-reads consistency.
100
[6.31] Examples
Read-your-writes example
Updating your Web page and guaranteeing that your Web browser
shows the newest version instead of its cached copy.
Writes-follow-reads example
See reactions to posted articles only if you have the original posting
(a read “pulls in” the corresponding write operation).
[6.32] Consistency Protocols
Consistency protocol
Describes the implementation of a specific consistency model. We will concen-
trate only on sequential consistency.
• Primary-based protocols
– remote-write protocols,
– local-write protocols.
• Replicated-write protocols
– active replication,
– quorum-based protocols.
• Cache-coherence protocols (write-through, write-back)
[6.33] Remote-Write Protocols (1)
101
Client Client
Single server
for item x Backup server
W1 W4 R1 R4
W2 R2
W3 R3 Data store
W1. Write request R1. Read request

W2. Forward request to server for x R2. Forward request to server for x
W3. Acknowledge write completed R3. Return response
W4. Acknowledge write completed R4. Return response
Primary-based remote-write protocol with a fixed server to which all read and
write operations are forwarded.
[6.34] Remote-Write Protocols (2)
Client Client
Primary server
for item x Backup server
W1 W5 R1 R2
W4 W4
W3 W3 Data store
W2 W3
W4

W2. Forward request to primary R2. Response to read
W3. Tell backups to update
W4. Acknowledge update
W5. Acknowledge write completed
The principle of primary-backup protocol: read operations allowed on a locally

available copy, write operations forwarded to a fixed primary copy.
[6.35] Local-Write Protocols (1)
102
Client
Current server New server
for item x for item x
1 4
3 Data store
1. Read or write request

2. Forward request to current server for x
3. Move item x to client's server
4. Return result of operation on client's server
Primary-based local-write protocol in which a single copy is migrated between

processes.
[6.36] Local-Write Protocols (2)
Client Client
Old primary New primary
for item x for item x Backup server
R1 R2 W1 W3
W5 W5
W4 W4 Data store
W5 W2
W4

W2. Move item x to new primary R2. Response to read
W3. Acknowledge write completed
W4. Tell backups to update
W5. Acknowledge update
Primary-backup protocol in which the primary migrates to the process wanting

to perform an update.
[6.37] Active Replication (1)
103
Client replicates Object receives

invocation request the same invocation
B1 three times
A B2 C
All replicas see B3

the same invocation
Replicated object
The problem of replicated invocations.
[6.38] Active Replication (2)
Coordinator Coordinator
of object B of object C
Client replicates Result

invocation request
B1 B1
C1 C1
A B2 A B2
C2 C2
B3 B3
Result
(a) (b)
a. forwarding an invocation request from a replicated object,
b. returning a reply to a replicated object.
104
[6.39] Quorum-Based Protocols
Read quorum
A B C D A B C D A B C D
E F G H E F G H E F G H
I J K L I J K L I J K L
NR = 3, N W = 10 NR = 7, NW = 6 NR = 1, N W = 12
Write quorum
(a) (b) (c)
Three examples of the voting algorithm:
a. a correct choice of read and write set,
b. a choice that may lead to write-write conflicts,
c. a correct choice, known as ROWA (read one, write all).
Constraints: NR + NW > N and NW > N/2
[6.40] Cache-Coherence Protocols

Cache coherence strategies:
• coherence detection strategy - when inconsistencies are detected,
• coherent enforcement strategy - how caches are kept consistent with the
copies stored at servers.
When processes modify data:
• read-only cache - updates can be performed only by servers,
• write-through cache - clients directly modify cached data and forward

updates to servers,
• write-back cache - propagation of updates may be delayed by allowing

multiple writes to take place before informing servers.
105
106
Chapter 7
Fault Tolerance
[7.1] Fault Tolerance
1. Basic concepts - terminology
2. Process resilience
• groups and failure masking
3. Reliable communication
• reliable client-server communication

• reliable group communication
4. Distributed commit
• two-phase commit (2PC)

• three-phase commit (3PC)
[7.2] Dependability
A component provides services to clients. To provide services, the component
may require the services from other components ⇒ a component may depend
on some other component.
Dependability
A component C depends on C∗ if the correctness of C’s behavior depends on
the correctness of C∗’s behavior.
Properties of dependability:
107
CHAPTER 7. FAULT TOLERANCE
• availability readiness for usage,
• reliability continuity of service delivery,
• safety very low probability of catastrophes,
• maintainability how easy a failed system may be repaired.
For distributed systems, components can be either processes or channels.
[7.3] Fault Terminology
• Failure: When a component is not living up to its specifications, a failure

occurs.
• Error: That part of a component’s state that can lead to a failure.
• Fault: The cause of an error.
Different fault management techniques:

• fault prevention: prevent the occurrence of a fault,
• fault tolerance: build a component in such a way that it can meet its
specifications in the presence of faults (i.e., mask the presence of faults),
• fault removal: reduce the presence, number, seriousness of faults,
• fault forecasting: estimate the present number, future incidence, and the
consequences of faults.
[7.4] Different Types of Failures
108
Different types of failures. Crash failures are the least severe, arbitrary failures
are the worst.
[7.5] Failure Masking by Redundancy
A B C
(a)
Voter
A1 V1 B1 V4 C1 V7
A2 V2 B2 V5 C2 V8
A3 V3 B3 V6 C3 V9
(b)
Triple modular redundancy (TMR).
[7.6] Process Resilience

Process groups: Protect yourself against faulty processes by replicating and
distributing computations in a group.
Flat group
Hierarchical group Coordinator
Worker
(a) (b)
109
a. flat groups: good for fault tolerance as information exchange immediately

occurs with all group members. May impose more overhead as control is
completely distributed (hard to implement).
b. hierarchical groups: all communication through a single coordinator ⇒

not really fault tolerant and scalable, but relatively easy to implement.
[7.7] Groups and Failure Masking (1)
Group tolerance
When a group can mask any k concurrent member failures, it is said to be k-fault
tolerant (k is called degree of fault tolerance).
If we assume that all members are identical, and process all input in the same
order. How large does a k-fault tolerant group need to be?
• assume crash/performance failure semantics ⇒ a total of k + 1 members

are needed to survive k member failures.
• assume arbitrary failure semantics, and group output defined by voting ⇒

a total of 2k + 1 members are needed to survive k member failures.
Assumption: Group members are not identical, i.e., we have a distributed com-
putation.
Problem: Nonfaulty group members should reach agreement on the same value.
Assuming arbitrary failure semantics, we need 3k + 1 group members to survive

the attacks of k faulty members.
We are trying to reach a majority vote among the group of loyalists, in the
presence of k traitors ⇒ we need 2k+1 loyalists. This is also known as Byzantine
failures.
110
2
1 2
1
2 4
1 x 2 4
1 y 1 Got(1, 2, x, 4 ) 1 Got 2 Got 4 Got
2 Got(1, 2, y, 4 ) (1, 2, y, 4 ) (1, 2, x, 4 ) (1, 2, x, 4 )
4 3 Got(1, 2, 3, 4 ) (a, b, c, d ) (e, f, g, h ) (1, 2, y, 4 )
3 4 4 Got(1, 2, z, 4 ) (1, 2, z, 4 ) (1, 2, z, 4 ) ( i, j, k, l )
z
Faulty process
(a) (b) (c)
The Byzantine generals problem for 3 loyal generals and 1 traitor.

a. the generals announce their troop strengths (in thousands of soldiers),
b. the vectors that each general assembles based on (a),
c. the vectors that each general receives in step 3.
1 2
x 1
2 1 Got(1, 2, x ) 1 Got 2 Got
3 2 2 Got(1, 2, y ) (1, 2, y ) (1, 2, x )
y 3 Got(1, 2, 3 ) (a, b, c ) (d, e, f )
Faulty process
(a) (b) (c)
The same as before, except now with 2 loyal generals and one traitor.
[7.11] Reliable Communication

So far concentrated on process resilience (by means of process groups). What
about reliable communication channels?
Error detection:
• framing of packets to allow for bit error detection,
• use of frame numbering to detect packet loss.
111
Error correction:
• add so much redundancy that corrupted packets can be automatically cor-

rected,
• request retransmission of lost, or last N packets.
Most of this work assumes point-to-point communication.
[7.12] Reliable RPC (1)

What can go wrong during RPC?
1. client cannot locate server
2. client request is lost
3. server crashes
4. server response is lost
5. client crashes
Notes:
1: relatively simple - just report back to client,
2: just resend message,
3: server crashes are harder as no one knows what server had already done.

If server crashes no one knows what server had already done. We need to decide
on what we expect from the server.
Server Server Server

REQ REQ REQ
Receive Receive Receive
Execute Execute Crash
REP No REP No REP
Reply Crash
(a) (b) (c)
112
(a) normal case (b) crash after execution (c) crash before execution.
Possible different RPC server semantics:

• at-least-once-semantics: the server guarantees it will carry out an opera-
tion at least once, no matter what.
• at-most-once-semantics: the server guarantees it will carry out an oper-

ation at most once.
4: Detecting lost replies can be hard, because it can also be that the server had
crashed. You don’t know whether the server has carried out the operation.
Possible solution: None, except that one can try to make your operations
idempotent – repeatable without any harm done if it happened to be
carried out before.
5: Problem: The server is doing work and holding resources for nothing
(called doing an orphan computation).
Possible solutions:
– orphan killed (or rolled back) by client when it reboots,

– broadcasting new epoch number when recovering ⇒ servers kill or-
phans,
– requiring computations to complete in a T time units, old ones simply
removed.
[7.15] Reliable Multicasting (1)

Basic model: There is a multicast channel c with two (possibly overlapping)
groups:
• the sender group S ND(c) of processes that submit messages to channel c,
• the receiver group RCV(c) of processes that can receive messages from
channel c.
Simple reliability If process P ∈ RCV(c) at the time message m was submitted

to c, and P does not leave RCV(c), m should be delivered to P.
113
Atomic multicast How to ensure that a message m submitted to channel c is

delivered to process P ∈ RCV(c) only if m is delivered to all members of
RCV(c).
[7.16] Reliable Multicasting (2)

If one can stick to a local-area network, reliable multicasting is ”easy”.
Let the sender log messages submitted to channel c:
• if P sends message m, m is stored in a history buffer,
• each receiver acknowledges the receipt of m, or requests retransmission at

P when noticing message lost,
• sender P removes m from history buffer when everyone has acknowledged

receipt.
Why doesn’t this scale? The basic algorithm doesn’t scale:

• if RCV(c) is large, P will be swamped with feedback (ACKs and NACKs),
• sender P has to know all members of RCV(c).
[7.17] Basic Reliable-Multicasting Schemes
Receiver missed
message #24
Sender Receiver Receiver Receiver Receiver
History M25
buffer Last = 24 Last = 24 Last = 23 Last = 24
M25 M25 M25 M25
Network
(a)
Last = 25 Last = 24 Last = 23 Last = 24

M25 M25 M25 M25
ACK 25 ACK 25 Missed 24 ACK 25
Network
(b)
114
A simple solution to reliable multicasting when all receivers are known and are
assumed not to fail: (a) message transmission and (b) reporting feedback.
[7.18] Scalable RM: Feedback Suppression
Idea: Let a process P suppress its own feedback when it notices another
process Q is already asking for a retransmission.
Assumptions:
• all receivers listen to a common feedback channel to which feedback mes-

sages are submitted,
• process P schedules its own feedback message randomly, and suppresses

it when observing another feedback message.
• random schedule needed to ensure that only one feedback message is even-
tually sent.
Sender receives Receivers suppress their feedback

only one NACK
T=3 T=4 T=1 T=2

NACK NACK NACK NACK
NACK
Network
[7.19] Scalable RM: Hierarchical Solutions
Idea: Construct a hierarchical feedback channel in which all submitted messages

are sent only to the root. Intermediate nodes aggregate feedback messages before
passing them on.
Main challenge: Dynamic construction of the hierarchical feedback channels.
115
Sender
(Long-haul) connection
S Local-area network
Coordinator
C
C
R
Receiver Root
[7.20] Atomic Multicast

Idea: Formulate reliable multicasting in the presence of process failures in terms
of process groups and changes to group membership.
Guarantee: A message is delivered only to the non-faulty members of the
current group. All members should agree on the current group membership.
Keyword: Virtually synchronous multicast.
Reliable multicast by multiple

P1 joins the group point-to-point messages P3 crashes P3 rejoins
P1
P2
P3
P4
G = {P1,P2,P3,P4} G = {P1,P2,P4} G = {P1,P2,P3,P4}
Partial multicast Time

from P3 is discarded
[7.21] Virtual Synchrony (1)
116
Application
Message is delivered to application
Comm. layer
Message is received by communication layer
Message comes in from the network

Local OS
Network
The logical organization of a distributed system to distinguish between message

receipt and message delivery.

Idea: We consider views V ⊆ RCV(c) ∪ S ND(c).
Processes are added or deleted from a view V through view changes to V∗. A
view change is to be executed locally by each P ∈ V ∩ V∗
1. for each consistent state, there is a unique view on which all its members
agree. Note: implies that all non-faulty processes see all view changes in
the same order,
2. if message m is sent to V before a view change vc to V∗, then either all

P ∈ V that excute vc receive m, or no processes P ∈ V that execute vc
receive m. Note: all non-faulty members in the same view get to see the
same set of multicast messages,
3. a message sent to view V can be delivered only to processes in V, and is

discarded by successive views.
A reliable multicast algorithm satisfying 1. – 3. is virtually synchronous.

A sender to a view V need not be member of V,
117
If a sender S ∈ V crashes, its multicast message m is flushed before S is removed

from V: m will never be delivered after the point that S < V
Note: Messages from S may still be delivered to all, or none (non-faulty) pro-
cesses in V before they all agree on a new view to which S does not belong
If a receiver P fails, a message m may be lost but can be recovered as we know
exactly what has been received in V. Alternatively, we may decide to deliver m
to members in V − P
Observation: Virtually synchronous behavior can be seen independent from the
ordering of message delivery. The only issue is that messages are delivered to
an agreed upon group of receivers.
[7.24] Virtually Synchronous Reliable Multicasting
Different versions of virtually synchronous reliable multicasting.
[7.25] Implementing Virtual Synchrony
Unstable Flush message

message
1 1 1
2 5 2 5 2 5
View change
4 6 4 6 4 6
0 3 0 3 0 3
7 7 7
(a) (b) (c)
a. process 4 notices that process 7 has crashed and sends a view change.
b. process 6 sends out all its unstable messages, followed by a flush message.
118
c. process 6 installs the new view when it has received a flush message from
everyone else.
[7.26] Distributed Commit
• Two-phase commit (2PC)
• Three-phase commit (3PC)
Essential issue: Given a computation distributed across a process group, how

can we ensure that either all processes commit to the final result, or none of
them do (atomicity)?
[7.27] Two-Phase Commit (1)

Model: The client who inititated the computation acts as coordinator; processes
required to commit are the participants.
Phase 1a Coordinator sends VOTE_REQUEST to participants (also called a

pre-write).
Phase 1b When participant receives VOTE_REQUEST it returns either YES or

NO to coordinator. If it sends NO, it aborts its local computation.
Phase 2a Coordinator collects all votes; if all are YES, it sends COMMIT to
all participants, otherwise it sends ABORT.
Phase 2b Each participant waits for COMMIT or ABORT and handles accord-
ingly.
[7.28] Two-Phase Commit (2)
Vote-request
INIT Vote-abort INIT
Commit Vote-request
Vote-request Vote-commit
WAIT READY
Vote-abort Vote-commit Global-abort Global-commit
Global-abort Global-commit ACK ACK
ABORT COMMIT ABORT COMMIT
(a) (b)
119
a. the finite state machine for the coordinator in 2PC,
b. the finite state machine for a participant.
[7.29] 2PC – Failing Participant (1)

Consider participant crash in one of its states, and the subsequent recovery to
that state:
initial state no problem, as participant was unaware of the protocol,
ready state participant is waiting to either commit or abort. After recovery,

participant needs to know which state transition it should make → log the
coordinator’s decision,
abort state merely make entry into abort state idempotent, e.g., removing the
workspace of results,
commit state also make entry into commit state idempotent, e.g., copying workspace
to storage.
When distributed commit is required, having participants use temporary workspaces

to keep their results allows for simple recovery in the presence of failures.
[7.30] 2PC – Failing Participant (2)

Alternative: When a recovery is needed to the Ready state, check what the other
participants are doing. This approach avoids having to log the coordinator’s
decision.
Assume recovering participant P contacts another participant Q:
120
Result: If all participants are in the ready state, the protocol blocks. Apparently,
the coordinator is failing.
[7.31] 2PC – Coordinator
[7.32] 2PC – Participant
121
[7.33] 2PC – Handling Decision Requests
Actions for handling decision requests executed by separate thread.
[7.34] Three-Phase Commit (1)

Problem: with 2PC when the coordinator crashed, participants may not be able
to reach a final decision and may need to remain blocked until the coordinator
recovers.
Solution: three-phase commit protocol (3PC). The states of the coordinator
and each participant satisfy the following conditions:
122
• there is no single state from which it is possible to make a transition

directly to either a COMMIT or ABORT state,
• there is no state in which it is not possible to make a final decision, and
from which a transition to a COMMIT state can be made.
Note: not often applied in practice as the conditions under which 2PC blocks
rarely occur.
Phase 1a Coordinator sends VOTE_REQUEST to participants

Phase 1b When participant receives VOTE_REQUEST it returns either YES or
NO to coordinator. If it sends NO, it aborts its local computation
Phase 2a Coordinator collects all votes; if all are YES, it sends PREPARE to
all participants, otherwise it sends ABORT, and halts
Phase 2b Each participant waits for PREPARE, or waits for ABORT after which
it halts
Phase 3a (Prepare to commit) Coordinator waits until all participants have
ACKed receipt of PREPARE message, and then sends COMMIT to all
Phase 3b (Prepare to commit) Participant waits for COMMIT
Vote-request
INIT Vote-abort INIT
Commit Vote-request
Vote-request Vote-commit
WAIT READY
Vote-abort Vote-commit Global-abort Prepare-commit
Global-abort Prepare-commit ACK Ready-commit
ABORT PRECOMMIT ABORT PRECOMMIT
Ready-commit Global-commit
Global-commit ACK
COMMIT COMMIT
(a) (b)
a. finite state machine for the coordinator in 3PC,

b. finite state machine for the participant.
123
124
Chapter 8
Distributed File System
[8.1] Distributed File System
1. Sun Network File System
2. The Coda File System
3. Plan 9: Resources Unified to Files
[8.2] Network File System (NFS)

NFS, basic idea: each file server provides a standardized view of its local files
system,
History of NFS:
• the 1st version internal to Sun,
• the 2nd version incorporated into SunOS 2.0,
• the 3rd (current) version – now undergoing major revisions.
NFS – not so much a true file system but a collection of protocols.
[8.3] NFS Architecture (1)
125
CHAPTER 8. DISTRIBUTED FILE SYSTEM
1. File moved to client

Old file
New file
Requests from
client to access File stays 2. Accesses are
3. When client is done,
remote file on server done on client
file is returned to
server
(a) (b)
a. the remote access model,
b. the upload/download model.
[8.4] NFS Architecture (2)
Client Server
System call layer System call layer
Virtual file system Virtual file system

(VFS) layer (VFS) layer
Local file Local file

system interface NFS client NFS server system interface
RPC client RPC server

stub stub
Network
The basic NFS architecture for UNIX systems.
[8.5] NFS Features
• NFS largely independent of local file system,
• supports hard and symbolic links,
• files named, accessed by means of Unix-like file handles,
126
• version 4
– create used for creating non-regular files,

– regular files created by open,
– server generally maintains state between operations on the same file,
– lookup attempts to resolve the entire name, also if it means crossing
mount points,
– supports compound procedures.
[8.6] File System Model
An incomplete list of file system operations supported by NFS.
[8.7] Communication
127

LOOKUP
OPEN
LOOKUP READ
Lookup name Lookup name
Open file
READ
Read file data
Read file data
Time Time
(a) (b)
a. Reading data from a file in NFS version 3.
b. Reading data using a compound procedure in version 4.
[8.8] Stateless vs. Stateful Server
• NFS version 3:
– simplicity as the main advantage of the stateless approach,

– locking a file cannot be easily done,
– certain authentication protocols require maintaining state of clients.
• NFS version 4:
– expected to work across wide area network,

– clients can make effective use of caches requiring cache consistency
protocol,
– support for callback procedures by which a server can do an RPC to
a client.
[8.9] NFS - Naming (1)
128
Client A Server Client B
remote bin users work bin
vu steen me
mbox mbox mbox
Exported directory Exported directory

mounted by client mounted by client
Network
Mounting (part of) a remote file system in NFS.
[8.10] NFS - Naming (2)
Exported directory
contains imported
subdirectory
Client Server A Server B
bin packages
Client
imports
directory
draw from draw
server A Server A
imports
directory
install install from install
server B
Network
Client needs to
explicitly import
subdirectory from
server B
Mounting nested directories from multiple servers in NFS.
[8.11] Automounting (1)
129
1. Lookup "/home/alice"
users
3. Mount request
NFS client Automounter
alice
2. Create subdir "alice"
Local file system interface
home
alice
4. Mount subdir "alice"

from server
A simple automounter for NFS.
[8.12] Automounting (2)
home tmp_mnt
alice home
alice
"/tmp_mnt/home/alice"
Symbolic link
Using symbolic links with automounting.
130
Whenever command ls -l /home/alice is executed, the NFS server is

contacted directly without involvement of the automounter.
[8.13] File Attributes
Some general mandatory (a) and recommended (b) file attributes in NFS.
Moreover one may have named attributes – an array of pairs (attribute, value).
[8.14] Semantics of File Sharing (1)
131
Client machine #1
a b
Process
A
a b c
2. Write "c" 1. Read "ab"
File server
Original file
Single machine a b
a b
Process
A 3. Read gets "ab"
a b c
Client machine #2
Process
a b
B
Process
B
1. Write "c" 2. Read gets "abc"
(a) (b)
• On a single processor, when a read follows a write, the value returned by

the read is the value just written.
• In a distributed system with caching, obsolete values may be returned.
[8.15] Semantics of File Sharing (2)
Four ways of dealing with the shared files in a distributed system.
• NFS implements session semantics.
132
[8.16] File Locking in NFS
NFS version 4 operations related to file locking.
• v4: file locking integrated into file access protocol,
• lock failed ⇒
– error message and polling or

– client can request to be put on a FIFO-ordered list maintained by the
server (and still polling).
[8.17] Client Caching (1)
Memory Client NFS server

cache application
Disk
cache
Network
Client-side caching in NFS.
[8.18] Client Caching (2)
133
1. Client asks for file

Client Server
2. Server delegates file
Old file
Local copy 3. Server recalls delegation
Updated file
4. Client sends returns file
Using the NFS version 4 callback mechanism to recall file delegation.

• open delegation takes place when the client machine is allowed to locally
handle open and close operations from other clients on the same machine,
• recalling delegation requires callback support,
• NFS uses leases on cached attributes, file handles and directories.
[8.19] RPC Failures
Client Server Client Server Client Server

XID = 1234 XID = 1234 XID = 1234
XID = 1234
process
request
XID = 1234 reply is lost
Cache Cache Cache
reply
XID = 1234
Time Time Time
(a) (b) (c)
Three situations for handling retransmissions (XID = transaction identifier).
a. the request is still in progress,
b. the reply has just been returned,
134
c. the reply has been some time ago, but was lost.
[8.20] File Locking in the Presence of Failures

Server crashes and subsequently recovers, than:
• grace period:
– a client can reclaim locks that were previously granted to it,

– normal lock requests may be refused until the grace period is over.
Notice: leasing requires synchronization of client’s and server’s clocks.
[8.21] Security
Client Server
Virtual file system layer Virtual file system layer
Access Access
control control
Local file Local file

system interface NFS client NFS server system interface
RPC client RPC server

stub Secure channel stub
The NFS security architecture (version 3).
• system authentication,
• Diffie-Hellman key exchange (a public key cryptosystem), but only 192

bits in NFS,
• Kerberos.
[8.22] Secure RPCs
135
NFS client NFS server
RPC client stub RPC server stub
RPCSEC_GSS RPCSEC_GSS
GSS-API GSS-API
Kerberos
Kerberos
LIPKEY
LIPKEY
Other
Other
Network
Secure RPC in NFS version 4 (GSS - general security framework):
• LIPKEY - a public key system,
• clients to be authenticated using passwords,
• servers can be authenticated using a public key.
[8.23] Access Control
136
The classification of operations recognized by NFS with respect to access control.
[8.24] Users/ Processes by Access Control
The various kinds of users and processes distinguished by NFS with respect to
access control.
[8.25] The Coda File System
• developed at Carnegie Mellon University, main goal: high availability,
• advanced caching allows a client to continue operation despite being dis-

connected from a server,
137
• descendant of version 2 of the Andrew File System (AFS),
• Vice file servers and Virtue workstations with Venus processes,
• both Vice file server processes and Venus processes run as user-level pro-
cesses,
• a user-lever RPC on top of UDP providing at-most-once semantics,
• trusted Vice machines run authentication servers,
• Coda appears as a traditional UNIX-based file system.
[8.26] Overview of Coda (1)
Transparent access
to a Vice file server
Virtue
client
Vice file
server
The overall organization of AFS.
[8.27] Overview of Coda (2)
138
Virtue client machine
User User Venus

process process process
RPC client
stub
Local file
Virtual file system layer
system interface
Local OS
Network
The internal organization of a Virtue workstation.
[8.28] Coda - communication
• RPC2 different to ONC RPC used by NFS,

• offers reliable communication on top of the UDP protocol,
• thread per each RPC request,
• back messages regularly sent by the server to the client,
• support for side effects – mechanisms for communication using an application-
specific protocols,
• support for multicasting, parallel RPC implemented by means of Mut-
liRPC, fully transparent to callees,
• threads in Coda non-preemptive and entirely in user space,
• separate thread to handle all I/O operations with low-level asynchronous
I/O emulating synchronous I/O without blocking an entire process.
[8.29] Communication (1)
139
Client
Server
application
Application-specific
RPC Client protocol Server
side effect side effect
RPC client RPC protocol RPC server

stub stub
Side effects in Coda’s RPC2 system.
[8.30] Communication (2)
Client Client
Invalidate Reply Invalidate Reply
Server Server
Invalidate Reply Invalidate Reply
Client Client
Time Time
(a) (b)
a. sending an invalidation message one at a time,
b. sending invalidation messages in parallel.
[8.31] Naming
140
Naming inherited from server's name space

Client A Server Client B
afs local afs

bin pkg
bin pkg
Exported directory Exported directory

mounted by client mounted by client
Network
Clients in Coda have access to a single shared name space.
[8.32] Volumes and File Identifiers
• volumes,
• only root nodes can act as mounting points,
• shared name space,
• file identifiers,
• RVID – replicated volume identifier,
• VID – volume identifier,
• volume replication database,
• volume location database,
• 64-bit handle identifying the file within the volume.
[8.33] File Identifiers
141
Volume
replication DB RVID File handle
File server
VID1,
VID2
Server File handle
Server1
Server2 File server

Volume
location DB
Server File handle
The implementation and resolution of a Coda file identifier.
[8.34] Sharing Files in Coda
Session S A
Client
Open(RD) File f Invalidate

Close
Server
Close
Open(WR) File f
Client
Time
Session S B
The transactional behavior in sharing files in Coda.
142
[8.35] Transactional Semantics
• partition - part of the network isolated from the rest,
• recognition of different types of session (like the store session type),
• usage of versioning scheme,
• update from a client accepted only when the update lead to the next version
of a file,
• when conflict occurs, the updates from the client’s session undone and
client forced to save its local version of a file for manual reconciliation
• cache coherence maintained by means of callbacks,
• callback promise,
• callback break.
[8.36] Client Caching
Session S A Session SA
Client A
Open(RD) Close Close
Open(RD)
Invalidate
Server File f (callback break) File f
File f OK (no file transfer)
Open(WR)
Open(WR) Close Close
Client B
Time
Session S B Session S B
The use of local copies when opening a session in Coda.
[8.37] Server Replication
• file servers may be replicated,
• Volume Storage Group (VSG),
143
• client’s Accessible VSG (AVSG),
• if the AVSG is empty, the client is said to be disconnected,
• consistency: Read-One, Write-All (ROWA),
• optimistic strategy for file replication,
• version vectors for conflicts detection.
[8.38] Server Replication
Server Server
S1 S3
Client Broken Client

Server
A network B
S2
Two clients with different AVSG for the same replicated file.
[8.39] Coda - Hoarding
• hoarding – filling the cache in advance with the appropriate files,
• priority mechanism to ensure caching of useful data:
– user may store paths in hoard database (one per workstation),

– priority for each file based on the hoard database and last references,
• hoard walk invoked once every 10 minutes,
• cache in equilibrium, if:
– no uncached file with a higher priority,

– cache full or no uncached files with nonzero priority,
144
– each cached file is a copy of the one from client’s AVSG.
• anyway no guarantee.
[8.40] Disconnected Operation
HOARDING
Disconnection Reintegration
Disconnection completed
EMULATION REINTEGRATION
Reconnection
The state-transition diagram of a Coda client with respect to a volume.
• http://www.coda.cs.cmu.edu/
[8.41] Access Control
145
Classification of file and directory operations recognized by Coda with respect

to access control.
• also: useful support for the listing of negative rights.
[8.42] Plan 9
• bringing back the idea of having a few centralized servers and numerous
client machines,
• Unix at Bell Labs team,

• file-based distributed system,
• all resources accessed in the same way (as files), including processes and
network interfaces,
• each server offers a hierarchical name space to the resources it controls,
• communication through the protocol 9P, tailored to file-oriented opera-
tions,
• for LAN Internet Link (IL) reliable datagram protocol, TCP for WAN.
[8.43] Plan 9: Resources Unified to Files
Gateway File server CPU Server

NS3
NS1 NS2
Network Process
interface
To Internet
Client has
mounted
NS2 NS1 and NS2
NS1
NS3 NS2
Client Client
146
General organization of Plan 9.
[8.44] Communication
Files associated with a single TCP connection in Plan 9.
• opening a telnet connection requires writing a special string to the ctl file
”connect 192.31.231.42!23”.
[8.45] Processes
File server machine Only WORM contains

actual file system
In-memory
cache
Disk WORM
cache
The Plan 9 file server.
[8.46] Resource Management
• let /net/inet denote the network interface,
• if M exports /net, a client can use M as a gateway by locally mounting

/net and subsequently opening /net/inet.
147
• multiple name spaces can be mounted at the same mount point, leading
to union directory,
• file systems appear to be Boolean or-ed,
• mounting order is important.
• Plan 9 implements UNIX file sharing semantics,
• all update operations always forwarded to the server.
[8.47] Naming
FSA FS B
/remote
/home /usr /bin /src /lib /bin /src /lib /home /usr
A union directory in Plan 9.
• http://cm.bell-labs.com/plan9/
• http://www.vitanuova.com/inferno/
148
Chapter 9
Naming
[9.1] Names - Introduction

Usage of names:
• to share resources,
• to uniquely identify entities,
• to refer to locations.
A name can be resolved to the entity it refers to.

Name – a string of bits or characters that is used to refer to an entity.
• typical entities: hosts, printers, disks, files, processes, users, mailboxes,

newsgroups, Web pages, messages, network connections,
• an access point – special entity to access another entity,
• an address – the name of an access point,
• the address of an entity access point simply called an address of the entity.
[9.2] Names, Identifiers, Addresses
• if an entity offers more than one access point not clear which address to
use as a reference,
• location independent names,
149
CHAPTER 9. NAMING
An identifier – a name that is used to uniquely identify an entity.

An identifier has the following properties:
1. An identifier refers to at most one entity.
2. Each entity is referred to by at most one identifier.
3. An identifier always refers to the same entity (it is never reused).
Remarks:
• by using identifiers much easier unambiguously reference to entities,
• human-friendly names in contrast to addresses and identifiers.
[9.3] Name Spaces (1)
• names in distributed systems organized into name spaces,
• name space may be represented as a labeled, directed graph,
– a leaf node represents a named entity,

– a directory node stores a table with outgoing edges representation as
pairs (edge label, node identifier) – directory table,
• root node of the naming graph,
• path name: N:<label-1, label-2, ..., label-n>,
• absolute path name (starts with root) vs. relative path name,
• global name – denotes the same entity in the whole system,
• local name – with interpretation depending on where the name is being

used.
[9.4] Name Spaces (2)
150
CHAPTER 9. NAMING
Data stored in n1 n0
n2: "elke" home keys
n3: "max"
n4: "steen" "/keys"
n1 n5
"/home/steen/keys"
elke steen
max
n2 n3 n4 keys
Leaf node
.twmrc mbox
Directory node
"/home/steen/mbox"
A general naming graph with a single root node
• n5 can be referred to by /home/steen/keys as well as /keys,
• the idea of directed acyclic graph,
• in Plan9 all resources (processes, hosts, I/O devices, network interfaces)

named as files – single naming graph for all resources in a distributed
system,
[9.5] Name Resolution

Name resolution – the process of looking up a name, given a path name.
• a name lookup returns the identifier of a node from where the name res-
olution process continues,
• closure mechanism – knowing how and where to start name resolution,
– Unix file system: the inode of the root directory is the first inode in
the logical disk,
– ”000312345654” not recognizable as string, but recognizable as a
phone number,
• alias another name for the same entity,
• hard links versus symbolic links.
[9.6] Linking and Mounting (1)
151
CHAPTER 9. NAMING
Data stored in n1 n0
n2: "elke" home keys
n3: "max"
n4: "steen" n1 n5 "/keys"
elke steen
max
n2 n3 n4
Data stored in n6
Leaf node
.twmrc mbox keys "/keys"
Directory node
n6 "/home/steen/keys"
The concept of a symbolic link explained in a naming graph.
• mount point – the directory node storing the node identifier,
• mounting point – the directory node in the foreign name space.
To mount a foreign name space in a distributed system the following information

is required:
1. The name of an access protocol.
2. The name of the server.
3. The name of the mounting point in the foreign name space.
Remarks:
• each of these names has to be resolved,
• NFS as an example.
152
CHAPTER 9. NAMING
Name server Name server for foreign name space

Machine A Machine B
keys
remote home
vu steen
"nfs://flits.cs.vu.nl//home/steen"
mbox
OS
Network
Reference to foreign name space
Mounting remote name spaces through a specific process protocol.
• in DECs GNS (Global Name Service) new root is added making all
existing root nodes its children,
• names in GNS always (implicitly) include the identifier of the node from
where resolution should normally start,
• /home/steen/keys in NS1 expanded to n0:/home/steen/keys,
• hidden expansion,
• assumed that node identifiers universally unique,
• therefore: all nodes have different identifiers.
153
CHAPTER 9. NAMING
m0 home
n0 vu
oxford
vu
NS1 NS2
n0 m0
home keys mbox
"m0:/mbox"
elke max steen
.twmrc mbox keys

"n0:/home/steen/keys"
Organization of the DEC Global Name Service.
[9.11] Name Space Distribution (1)
• name space distribution,
• large scale name spaces partitioned into logical layers
– global layer,
– administrational layer,
– managerial layer.
• the name space may be divided into zones,
• a zone is a part the name space that is implemented by a separate name

server,
• availability, performance requirements caching, replication.
154
CHAPTER 9. NAMING
Global
layer gov mil org net
com edu jp us
nl
sun yale acm ieee ac co oce vu
eng cs eng jack jill keio nec cs

Adminis-
trational
layer ftp www
cs csl
ai linda
pc24
robot pub
globe
Mana-
gerial
layer Zone
index.txt
An example partitioning of the DNS name space, including Internet-accessible

files, into three layers.
A comparison between name servers for implementing nodes from a large-scale

name space partitioned into a global layer, as an administrational layer, and a
managerial layer.
[9.14] Implementation of Name Resolution (1)

Each client has access to a local name resolver:
• ftp://ftp.cs.vu.nl/pub/globe/index.txt
155
CHAPTER 9. NAMING
• root:<nl, vu, cs, ftp, pub, globe, index.txt>
Iterative name resolution:
• a name resolver hands over the complete name to the root name server,
but root resolves only nl and returns address of the associated name server
• caching restricted to the clients name resolver as a compromise interme-

diate name server shared by all clients.
Recursive name resolution:
• a name server passes the result to the next name server it finds,
• drawback: puts a higher performance demand,
• caching results more effective comparing to iterative name resolution,
• communication costs may be reduced.
1. <nl,vu,cs,ftp>
Root
2. #<nl>, <vu,cs,ftp> name server
nl
3. <vu,cs,ftp> Name server
nl node
Client's 4. #<vu>, <cs,ftp>
name vu
resolver 5. <cs,ftp> Name server
vu node
6. #<cs>, <ftp>
cs
7. <ftp> Name server
8. #<ftp> cs node
ftp
<nl,vu,cs,ftp> #<nl,vu,cs,ftp>
Nodes are
managed by
the same server
The principle of iterative name resolution
156
CHAPTER 9. NAMING
1. <nl,vu,cs,ftp>
Root
8. #<nl,vu,cs,ftp> name server 2. <vu,cs,ftp>
7. #<vu,cs,ftp> Name server

nl node 3. <cs,ftp>
Client's
name
resolver 6. #<cs,ftp> Name server
vu node 4. <ftp>
5. #<ftp> Name server

cs node
<nl,vu,cs,ftp> #<nl,vu,cs,ftp>
The principle of recursive name resolution.
Recursive name resolution of <nl, vu, cs, ftp>. Name servers cache intermediate
results for subsequent lookups.
157
CHAPTER 9. NAMING
Recursive name resolution

R1
Name server
I1 nl node
R2
I2 Name server
Client vu node
I3
R3
Name server
Iterative name resolution cs node
Long-distance communication
The comparison between recursive and iterative name resolution with respect to
communication costs.
[9.19] The DNS Name Space
The most important types of resource records forming the contents of nodes in
the DNS name space.
[9.20] DNS Implementation (1)
158
CHAPTER 9. NAMING
An excerpt from the DNS database for the zone cs.vu.nl.
[9.21] DNS Implementation (2)
Part of the description for the vu.nl domain which contains the cs.vu.nl domain.
[9.22] X.500
159
CHAPTER 9. NAMING
• directory service – special kind of naming service in which a client can

look for an entity based on a description of properties instead of a full
name.
• OSI X.500 directory service,
• Directory Information Base (DIB) – the collection of all directory entries

in an X.500 directory service,
• each record in DIB uniquely named,
• unique name as a sequence of naming attributes,
• each attribute called a Relative Distinguished Name (RDN),
• Directory Information Tree (DIT) – a hierarchy of the collection of

directory entries,
• DIT forms a naming graph in which each node represents a directory entry,
• each node in DIT may act as a directory in the traditional sense.
[9.23] The X.500 Name Space (1)
A simple example of a X.500 directory entry using X.500 naming conventions.
[9.24] The X.500 Name Space (2)
160
CHAPTER 9. NAMING
C = NL
O = Vrije Universiteit
OU= Math. & Comp. Sc.
CN = Main server
N
Host_Name = star Host_Name = zephyr
Part of the directory information tree.
[9.25] X.500 Implementation
• DIT usually partitioned and distributed across several servers, known as

Directory Service Agents (DSA),
• each DSA implements advanced search operations,
• clients represented by Directory User Agents (DUA),
• example: list of all main servers at the VU:
– answer=search(&(C=NL)(O=Vrije Universiteit)(OU=*)(CN=Main server))
• searching is generally an expensive operation.
161
CHAPTER 9. NAMING
[9.26] GNS Names

( /... or /.: ) + ( X.500 or DNS name ) + ( local name )
• ( /... or /.: ) + ( X.500 or DNS name ) + ( local name )
• /.../ENG.IBM.COM.US/nancy/letters/to/lucy
• /.../Country=US/OrgType=COM/OrgName=IBM/
Dept=ENG/nancy/letters/to/lucy
• /.:/nancy/letters/to/lucy
[9.27] LDAP
• Lightweight Directory Access Protocol (LDAP),
• an application-level protocol implemented on top of TCP,
• LDAP servers as specialized gateways to X.500 servers,
• parameters of lookup and update simply passed as strings,
• LDAP contains:
– defined security model based on SSL,

– defined API,
– universal data exchange format, LDIF,
• http://www.openldap.org
[9.28] Naming versus Locating Entities (1)

Three types of names distinguished:
• human-friendly names,
• identifiers,
• addresses.
162
CHAPTER 9. NAMING
What happens if a machine ftp.cs.vu.nl is to move to ftp.cs.unisa.edu.au?

• recording the address of the new machine in the DNS for cs.vu.nl but,
• whenever it moves again, its entry in DNS in cs.vu.nl has to be updated

as well,
• recording the name of the new machine in the DNS for cs.vu.nl but,
• lookup operation becomes less efficient.
Better solution: naming from locating entities separation by introduction of

identifiers.
[9.29] Naming versus Locating Entities (2)
Name Name Name Name Name Name Name Name
Naming
service
Entity ID
Location
service
Address Address Address Address Address Address
(a) (b)
a. direct, single level mapping between names and addresses,
b. two-level mapping using identifiers.
• locating an entity is handled by means of a separate location service.
[9.30] Location service implementations
1. simple solutions
• broadcasting and multicasting,

– ARP to find the data-link address given only an IP address
• forwarding pointers,
– when an entity moves, it leaves behind a reference to its new
location
163
CHAPTER 9. NAMING
2. home-Based approaches,
3. hierarchical approaches.
[9.31] Forwarding Pointers (1)
Process P2 Proxy p refers to

Proxy p same skeleton as
proxy p
Process P3
Identical proxy
Process P1 Skeleton
Proxy p
Process P4 Object
Local
invocation
Interprocess
communication Identical
skeleton
The principle of forwarding pointers using (proxy, skeleton) pairs.

• skeletons acts as entry items for remote references, proxies as exit items,
• whenever move from A to B, proxy installed on A and referring skeleton
on B.
[9.32] Forwarding Pointers (2)
Skeleton is no
Invocation longer referenced
request is by any proxy
sent to object
Skeleton at object's Client proxy sets

current process returns a shortcut
the current location
(a) (b)
164
CHAPTER 9. NAMING
Redirecting a forwarding pointer, by storing a shortcut in a proxy.
[9.33] Home-Based Approaches
Host's home
location 1. Send packet to host at its home
2. Return address
of current location
Client's
location
3. Tunnel packet to
current location
4. Send successive packets

to current location
Host's present location
The principle of Mobile IP.

• home location, home agent in home LAN, fixed IP address,
• whenever the mobile host in another network requests a temporary care-
of-address, registered afterwards at the home agent.
[9.34] Hierarchical Approaches (1)
The root directory

Top-level
node dir(T)
domain T
Directory node
dir(S) of domain S
A subdomain S
of top-level domain T
(S is contained in T)
A leaf domain, contained in S
165
CHAPTER 9. NAMING
Hierarchical organization of a location service into domains, each having an

associated directory node.
Field with no data

Field for domain
dom(N) with Location record
pointer to N for E at node M
M
Location record
with only one field,
containing an address
Domain D1
Domain D2
An example of storing information of an entity having two addresses in different

leaf domains.
Node knows
about E, so request
Node has no is forwarded to child
record for E, so
that request is M
forwarded to
parent
Look-up
Domain D
request
166
CHAPTER 9. NAMING
Looking up a location in a hierarchically organized location service.
Node knows
Node has no about E, so request
record for E, is no longer forwarded
so request is Node creates record
forwarded and stores pointer
to parent M
M
Node creates
record and
stores address
Domain D
Insert
request
(a) (b)
1. An insert request is forwarded to the first node that knows about entity E.
2. A chain of forwarding pointers to the leaf node is created.
167
CHAPTER 9. NAMING
168
Chapter 10
Peer-to-Peer Systems
[10.1] P2P Systems - Goals and Definition

Goal: to enable sharing of data and resources on a very large scale by eliminating
any requirement for separately-managed servers and their associated infrastruc-
ture.
Goal: to support useful distributed services and applications using data and
computing resources present on the Internet in ever-increasing numbers.
• standard services scalability limited when all the hosts must be owned and
managed by the single service provider,
• administration and fault recovery costs tend to dominate.
Peer-to-peer systems: applications that exploit resources available at the edges

of the Internet - storage, cycles, content, human presence.
[10.2] P2P Systems - Features

Characteristics shared by the P2P systems:
• design ensures that each user contributes resources to the system,
• all the nodes in a P2P system have the same functional capabilities and
responsibilities, although they may differ in the resources that they con-
tribute,
• correctness of any operation does not depend on the existence of any
centrally-administered systems,
• often designed to offer a limited degree of anonymity to the providers and
users of resources,
169
CHAPTER 10. PEER-TO-PEER SYSTEMS
• the key issues for P2P systems efficiency:
– algorithms for data placement across many hosts and subsequent ac-
cess to it,
– key issues of these algorithms: workload balancing, ensuring avail-
ability without adding undue overheads.
[10.3] P2P Systems - History

Antecedents of P2P systems; distributed algorithms for placement or location of
information; early Internet-based services with multi-server scalable and fault-
tolerant architecture: DNS, Netnews/Usenet, classless inter-domain IP routing.
Potential for the deployment of P2P services emerged when a significant number
of users had acquired always-on, broadband connections (around 1999 in USA).
Three generations of P2P systems:
1. launched by the Napster music exchange service,
2. file-sharing applications offering greater scalability, anonymity and fault

tolerance (Freenet, Gnutella, Kazaa, BitTorrent).
3. P2P middleware layers for application-independent management of dis-

tributed resources (Pastry, Tapestry, CAN, Chord, Kademlia).
[10.4] P2P Middleware Introduction

Middleware platforms for distributed resources management:
• designed to place resources and to route messages to them on behalf of

clients,
• relieve clients of decisions about placing resources and of holding re-

sources address information,
• provide guarantee of delivery for requests in a bounded number of network

hops,
• resources identified by globally unique identifiers (GUIDs), usually de-

rived as a secure hash from resource’s state,
• secure hashes make resources ”self certifying”, clients receiving a resource

can check validity of the hash.
170
• inherently best suited to storage of immutable objects,
• usage for objects with dynamic state more challenging, usually addressed
by addition of trusted servers for session management and identification.
[10.5] IP and P2P Overlay Routing (1)
• Scale:
IP: IPv4 limited to 232 addressable nodes (in IPv6 to 2128), addresses
hierarchically structured and much of the space preallocated accord-
ing to administrative requirements.
OR: The GUID name space very large and flat (>2128), allowing it to be
much more fully occupied.
• Load balancing:
IP: Loads on routers are determined by network topology and associated

traffic patterns.
OR: Object locations can be randomized and hence traffic patterns are
divorced from the network topology.
• Network dynamics (addition/deletion of objects/nodes):
IP: IP routing tables are updated asynchronously on a best-efforts basis

with time constants on the order of 1 hour.
OR: Routing tables can be updated synchronously or asynchronously with
fractions of a second delays.
[10.6] IP and P2P Overlay Routing (2)
• Fault tolerance:
IP: Redundancy is designed into the IP network by its managers, ensuring

tolerance of a single router or network connectivity failure. n-fold
replication is costly.
OR: Routes and object references can be replicated n-fold, ensuring tol-
erance of n failures of nodes or connections.
• Target identification:
171
IP: Each IP address maps to exactly one target node.

OR: Messages can be routed to the nearest replica of a target object.
• Security and anonymity:
IP: Addressing is only secure when all nodes are trusted. Anonymity for
the owners of addresses is not achievable.
OR: Security can be achieved even in environments with limited trust. A
limited degree of anonymity can be provided.
[10.7] Distributed Computation (1)
• work with the first personal computers at Xerox PARC showed the feasi-
bility of performing loosely-coupled compute-intensive tasks by running
background processes on about 100 computers linked by a local network,
• Piranha/Linda and adaptive parallelism,
• SETI@home - most widely known project
– part of a wider project Search for Extra-Terrestrial Intelligence,

– stream of data partitioned into 107-second work units, each of about
350KB,
– each work distributed redundantly to 3-4 personal computers,
– distribution and coordination handheld by a single server,
– 3.91 million computers participated by August 2002, resulting in the
processing of 221 million work units,
– on average 27.36 teraflops of computational power,
[10.8] Distributed Computation (2)
• SETI@home didn’t involved any communication or coordination between

computers while processing the work units,
• although often recognized as P2P they are rather based on client-server

architecture,
• BOINC – Berkeley Open Infrastructure for Network Computing.
172
Similar scientific tasks:
• search for large prime numbers,
• attempts at brute-force description,
• climate prediction.
Grid projects - distributed platforms that support data sharing and the coordina-
tion of computation between participating computers on a large scale. Resources
are located in different organizations and are supported by heterogeneous com-
puter hardware, operating systems, programming languages and applications.
[10.9] Napster – Music Files P2P (1)
• launched in 1999 became very popular for music exchange,
• architecture: centralized replicated indexes, but users supplied the files

stored and accessed on their personal computers,
• locality – minimizing number of hops between client and server when

allocating a server to a client requesting a file,
• taken advantage of special characteristics of the applications:
– music files never updated, no need for consistency management,

– no guarantees required concerning availability of individual files (mu-
sic temporarily unavailable may be downloaded later).
• key to success: large, widely-distributed set of files available to users,
• Napster shut down as a result of legal proceedings instituted against Napster

service operators by the owners of the copyright in some of the material.
[10.10] Napster – Music Files P2P (2)
173
... peers
Napster server Napster server

Index Index
1. File location request
2. List of peers offering the file 3. File request
4. File
5. Index update
delivered
peers ...
Napster: P2P file sharing with a centralized, replicated index. In step 5. clients
expected to add their own files to the pool of shared resources.
[10.11] P2P Middleware Requirements (1)

Function of the P2P middleware: to simplify construction of services imple-
mented across many hosts in a widely distributed network.
Expected functional requirements:
• enabling clients to locate and communicate with any individual resource

made available to a service,
• ability to add new resources and to remove them at will,
• ability to add hosts to the service and to remove them,
• offering simple programming interface independent of types of managed

distributed resources.
[10.12] P2P Middleware Requirements (1)

Expected non-functional requirements:
• global scalability,
• load balancing - random placement and usage of replicas,
• optimization for local interactions between neighbouring peers,
• accommodating to highly dynamic host availability,
174
• security of data in an environment with heterogeneous trust,
• anonymity, deniability and resistance to censorship.
[10.13] Routing Overlays

Routing overlay
A distributed algorithm which takes responsibility for locating nodes and objects
in P2P networks.
Randomly distributed identifiers (GUIDs) used to determine placement of ob-
jects and to retrieve them, thus overlay routing systems sometimes described as
distributed hash tables (DHT).
General tasks of a routing overlay layer:
• having given GUID routing the request,
• having given GUID publishing the resource,
• service of removal request,
• responsibility allocation depending on changing view of peers.
[10.14] Routing Overlay – Identifiers

GUIDs – opaque identifiers, reveal nothing about locations of objects to which
they refer. Computed with usage of hash function (such as SHA-1) from all
or part of the state of an object, unique. Uniqueness verified by searching for
another object with the same GUID.
Prefix routing - narrowing the search for the next node along the route by
applying a binary mask that selects an increasing number of hexadecimal digits
from the destination GUID after each hop.
[10.15] Routing Overlay – DHT
put(GUID, data)
The data is stored in replicas at all nodes responsible for the object identified by
GUID.
remove(GUID)
Deletes all references to GUID and the associated data.
175
value = get(GUID)
The data associated with GUID is retrieved from one of the nodes responsible
for it.
Basic programming interface for a distributed hash table (DHT) as imple-

mented by the PAST API over Pastry.
[10.16] Routing Overlay – DOLR
publish(GUID)
GUID can be computed from the object (or some part of it, e.g. its name). This
function makes the node performing a publish operation the host for the object
corresponding to GUID.
unpublish(GUID)
Makes the object corresponding to GUID inaccessible.
sendToObj(msg, GUID, [n])
Following the object-oriented paradigm, an invocation message is sent to an
object in order to access it. This might be a request to open a TCP connection
for data transfer or to return a message containing all or part of the object’s state.
The final optional parameter [n], if present, requests the delivery of the same
message to n replicas of the object.
Basic programming interface for distributed object location and routing (DOLR)
as implemented by Tapestry.
[10.17] Routing Overlay – Routing and Location

DHT:
• when data submitted to be stored with its GUID DHT layer takes respon-
sibility for choosing a location, storing it (with replicas) and providing
access,
• data item with GUID X stored at the node whose GUID numerically closest
to X and moreover at the r hosts with GUIDs numerically closest to it,
where R is a replication factor chosen to ensure high availability.
DOLR:
• locations for the replicas of data objects decided outside the routing layer,
176
• host address of each replica notified to DOLR using the publish() operation.
[10.18] Routing Overlay – Prefix Routing

Prefix routing:
• both Pastry and Tapestry employ prefix routing to determine routes,
• prefix routing is based on applying a binary mask that selects increasing

number of hexadecimal digits from the destination GUID after each hop
(similar to CIDR in IP).
Other possible routing schemes:
• based on numerical difference between the GUIDs of the selected node

and the destination node (Chord),
• usage of distance in a d-dimensional hyperspace into which nodes are

placed (CAN),
• usage of the XOR of pairs of GUIDs as a metric for distance between

nodes (Kademlia).
[10.19] P2P - Human-readable Names
• GUIDs are not human-readable, some form of indexing service using

human-readable names or search requests required,
• weakness of centralized indexes evidenced by Napster,
• example: indices on web pages in BitTorrent. Definitions: seed – peers

with complete copy of the torrent still offering upload; swarm – all peers
including seeds sharing a torrent,
• in BitTorrent a web search index leads to a stub file containing details

of the desired resource. The torrent file contains metadata about all the
files it makes downloadable, including: names, sizes, checksums of all
pieces in the torrent, address of a tracker that coordinates communication
between the peers in the swarm ,
• tracker – server that keeps track of which seeds and peers are in the
swarm, not directly involved in the data transfer, does not have copies of
data files.
177
• clients report information to the tracker periodically and in exchange re-

ceive information about other clients that they can connect to.
[10.20] Pastry - Introduction

Pastry: message routing infrastructure deployed in several applications including
PAST, an archival (immutable) file storage system implemented as a distributed
hash table with DHT API and in Squirrel, a P2P web caching service.
• 128-bit GUIDs (hash function such as SHA-1) randomly distributed in the

range 0 ÷ 2128 − 1,
• in a network with N participating nodes, Pastry routing algorithm correctly

route a message addressed to any GUID in O(log N ) steps,
• if a target node is active, message is delivered, otherwise message delivered

to active node which is numerically closest to it.
• active nodes take responsibility for processing requests addressed to all

objects in their numerical neighbourhood,
• moreover Pastry uses a locality metric based on network distance in the

underlying network to select appropriate neighbours,
[10.21] Pastry - Routing

Routing, simplified approach:
• each active node stores a leaf set – a vector L (of size 2l) containing
the GUIDs and IP addresses of the nodes whose GUIDs are numerically
closest on either side of its own (above and below),
• leave sets maintained by Pastry as nodes join and leave,
• any node A that receives a message M with destination address D routes

the message by comparing D with its own GUID A and with each of the
GUIDs in its leaf set and forwards M to the node amongst them that is
numerically closest to D,
• inefficient, requires about N/2l hops to deliver a message.
[10.22] Circular Routing
178
Black color depicts live nodes. The space is considered as circular: node 0 is
adjacent to node (2128 − 1). The diagram illustrates the routing of a message
from node 65A1FC to D46A1C using leaf set information alone, assuming leaf
sets of size 8 (l = 4, in Pastry usually 8). This is a degenerate type of routing
that would scale very poorly; it is not used in practice.
[10.23] Pastry Routing
• efficient routing due to routing tables,

• each node maintains a tree-structured routing table of nodes spread
throughout the entire address range, with increased density of coverage for
GUIDs numerically close to,
• the routing process at any node uses the information in its routing table
and leaf set to handle each request from an application and each incoming
message from another node,
179
• new nodes use a joining protocol and compute suitable GUIDs (typically
by applying the SHA-1 to the node’s public key, then it make contact with
a nearby (in network distance) Pastry node.
[10.24] Pastry’s Routing Table
First four rows of a Pastry routing table located in a node whose GUID begins
with 65A1.
• each ”n” element represents [GUID, IP address] pair specifying next hop
to be taken by messages addressed to GUIDs that match each given prefix.
• grey-shaded entries indicate that the prefix matches the current GUID up
to the given value of p: the next row down or the leaf should be examined
to find a route,
• although there are a maximum of 128 rows in the table, only log 16 N rows
will be populated on average in a network with N active nodes.
[10.25] Pastry’s Routing Algorithm

If R[p, i] means the element at column i in the row p of the routing table and L
means leaf set. To handle a message M addressed to a node D:
if (L−l < D < Ll ) {
forward M to the element Li of the leaf set with GUID closest to D

or the current node A.
} else {
find p, the length of the longest common prefix of D and A.

find i, the (p + 1)th hexadecimal digit of D.
if (R[p, i] , null) {
180
forward M to R[p, i],

} else {
forward M to any node in L or R with a common prefix of length
i, but a GUID that is numerically closer.
}
[10.26] Pastry Routing Example
Routing a message from node 65A1FC to D46A1C. With the aid of a well-
populated routing table the message can be delivered in log16 (N) hops.
[10.27] Pastry - Host Failure and Fault Tolerance
181
• nodes may fail or depart without warning, node considered failed when its
immediate neighbours (in GUID space) can no longer communicate with
it,
• to repair leaf set, the node looks for a live node close to the failed one
and requests a copy of its leaf set (one value to replace),
• repairs to routing tables made on a ’when discovered’ basis,
• moreover all nodes send heartbeat messages to neighbouring nodes in

their leaf sets,
• to deal with any remaining failures or malicious nodes, small degree of

randomness introduced into the route selection algorithm. Possible usage
of a routing from an earlier row with less optimal but different routing.
[10.28] Tapestry
• nodes holding resources periodically use the publish(GUID) primitive to

make them known to Tapestry, holders responsible for storing resources,
replicated resources published with the same GUID,
• 160-bit identifiers used to refer both to objects and to nodes that perform
routing actions,
• for any resource with GUID G unique root node with GUID RG numerically
closest to G,
• on each invocation of publish(G) publish message routed towards RG ,
• on receipt RG enters mapping between G and the sending host’s IP, (G, IP H )
in its routing table, the same cached along publication path.
[10.29] Tapestry Routing
182
Replicas of the file Phil’s Books (G=4378), hosted at nodes 4228 and AA93.
Node 4377 is the root node for object 4378. Shown routings are some of the
entries in routing tables. The location mapping (cached while servicing publish
messages) are subsequently used to route messages sent to 4378.
[10.30] Squirrel Web Cache (1)
• developed by authors of Pastry P2P web caching service for use in local
networks,
Web caching in general:
• browser cache, proxy cache, origin web server,
• metadata stored with an object in a cache: date of last modification T ,

time-to-leave t or eTag (hash computed from the object contents),
• conditional GET (cGET) request issued to the next level for validation,
• cGET request types: If-Modified-Since, If-None-Match,
• in response either the entire object or not-modified message.
[10.31] Squirrel Web Cache (2)
183
• SHA-1 hash function applied to the URL of each cached object to produce
a 128-bit Pastry GUID, GUID not used to validate content,
• in the simplest implementation: the node whose GUID numerically closest

to the GUID of an object becomes the object’s home node, responsible for
holding any cached copy of the object,
• Squirrel routes a Get or a cGet request via Pastry to the home node.
Evaluation, two real working environments within Microsoft, 105 active clients
(Cambridge), 36000 active clients (Redmond):
• reduction in total external bandwidth: caches 100MB, 37% (Cambridge),

28% (Redmond), hit ratio for centralized servers: 38% and 29% respec-
tively,
• local latency perceived by users for access web objects: neglectable,
• computational and storage load: low and likely to be imperceptible to

users.
[10.32] OceanStore File Store
• OceanStore – unlike Past, supports the storage of mutable files,
• goal: very large scale, scalable persistent storage facility for mutable data
objects with long-term persistence and reliability in changing network and
computing resources environment,
• privacy and integrity achieved through encryption of data and use of a

Byzantine agreement protocol for updates to replicated objects – because
trustworthiness of individual hosts cannot be assumed,
• Pond – OceanStore prototype implemented in Java, uses Tapestry routing

overlay to place blocks of data at distributed nodes and to dispatch requests
to them,
• data stored in a set of blocks, data blocks organized and accessed through
a metadata block called root block,
• each object represented as an ordered sequence of immutable versions

kept for ever, versions share unchanged blocks (copy-on-write technique),
184
[10.33] Ocean Store - Storage Organization (1)
• several replicas of each block stored at peer nodes selected accordingly to

locality and storage availability criteria,
• data blocks GUIDs published (with publish()) by each of the nodes that
holds a replica, Tapestry can be used by clients to access the blocks,
• AGUID stored in directories against each file name,
• association between an AGUID and the sequence of versions of the ob-
ject recorded in signed certificate stored and replicated by primary copy
replication scheme,
• trust model for P2P requires construction of each new certificate being
agreed amongst small set of hosts called the inner ring.
[10.34] Ocean Store - Storage Organization (2)
185
Version i + 1 has been updated in blocks d1, d2 and d3. The certificate and the
root blocks include some data not shown. All unlabelled arrows are BGUIDs.
[10.35] Pond Performance
Times in seconds to run different phases of the Andrew benchmark. (1) recursive
subdirectory creation, (2) source tree copying, (3) status only examining of all
the files in the tree, (4) every data byte examining in all the files, (5) compiling
and linking the files.
[10.36] Ivy File System
• read/write file system emulating a Sun NFS server,
• stores the state of files as logs of the file update requests issued by Ivy
clients,
• log records held in DHash distributed hash-addresses storage service (160-

bit SHA-1),
• version vectors to impose a total order on log entries when reading from
multiple logs,
• potentially very long read time reduced by use of a combination of local

caches and snapshots,
• shared file system seen as a result of merging all the updates performed
by (dynamically selected – views) set of participants,
186
• possible continuing operations during partitions in the network, conflicting

updates to shared files resolved similar like in Coda file system.
[10.37] Ivy Architecture
Ivy system architecture.
[10.38] Ivy – Performance

Each participant maintains a mutable DHash block (called log-head) that points
to a participant’s most recent log record. Mutable blocks are assigned a cryp-
tographic public key pair by their owner. The contents of the block are signed
with the private key. Any participant that has the public key can retrieve the
log-head and use it to access all the records in the log.
Performance:
• execution times mostly two times (for some operations three times) larger
than for NFS,
• in WAN 10 times slower than in LAN, similar to NFS – still NFS not
designed for usage in WAN,
Primary contribution of Ivy: novel approach to the management of security

and integrity in an environment of partial trust (in networks spanning many
organizations and jurisdictions).
[10.39] P2P – Summary

The benefits of P2P:
187
• ability to exploit unused resources (storage, processing) in the host com-

puters,
• ability to support large numbers of clients and hosts with adequate bal-
ancing of the loads on network links and host computer resources,
• self-organizing properties of the middleware platforms lead to to costs

largely independent of the numbers of clients and hosts deployed.
Weaknesses and subjects of research:
• relatively costly as storage solution for mutable data compared to trusted

centralized service solutions,
• still lack of strong guarantees for client and host anonymity.
188
Chapter 11
Web Services
[11.1] XML – Introduction

The Extensible Markup Language (XML) is a W3C-recommended general-
purpose markup language for creating special-purpose markup languages, capa-
ble of describing many different kinds of data.
• a way of describing data,
• a simplified subset of Standard Generalized Markup Language (SGML),
• primary purpose: to facilitate the sharing of data across different systems,

particularly systems connected via the Internet,
[11.2] XML – Main Features

XML as a well-suited media for data transfer:
• simultaneously human- and machine-readable format,
• support for Unicode, allowing almost any information in any human lan-
guage to be communicated,
• ability to represent the most general computer science data structures:

records, lists and trees,
• the self-documenting format that describes structure and field names as

well as specific values,
• the strict syntax and parsing requirements that allow the necessary parsing
algorithms to remain simple, efficient, and consistent.
189
CHAPTER 11. WEB SERVICES
[11.3] XML and Correctness

For an XML document to be correct, it must be:
1. well-formed: conforming to all of XML’s syntax rules.
2. valid: conforming to some XML schema. An eXML schema is a de-

scription of a type of XML document, typically expressed in terms of
constraints on the structure and content of documents of that type.
DTD Document Type Definition, inherited from SGML, included in the XML
1.0 standard,
XSD XML Schema Definition, schema with rich datatyping system and XML
syntax,
Relax NG proposed by OASIS, now part of ISO DSDL (Document Schema

Definition Languages) standard.
• two formats: an XML based syntax and a compact syntax.

• compact syntax aims to increase readability and writability, having a
strict way to translate compact syntax to the XML syntax and back
again.
[11.4] Web Service – Definition

Web service definition (W3C):
A Web service is a software system designed to support interoperable machine-
to-machine interaction over a network.
• It has an interface described in a machine-processable format (specifically

WSDL).
• Other systems interact with the Web service in a manner prescribed by its
description using SOAP messages, typically conveyed using HTTP with
an XML serialization in conjunction with other Web-related standards.
[11.5] Web Service – Introduction
• a web services provides a service interface enabling clients to interact with

servers in a more general way than web browsers do,
190
• clients access the operations in the interface of a web service by means of

requests and replies formatted in XML and usually transmitted over HTTP,
• like CORBA and Java, the interface of web services can be described in an
IDL. But for web services, additional information including the encoding
and communication protocols in use and the service location need to be
described,
• secure channels of TLS do not provide all of the necessary requirements.

XML security is intended to breach this gap.
[11.6] Web Services - Core Components

Web services - core components:
XML All data to be exchanged is formatted with XML tags. The encoded
message may conform to a messaging standard such as SOAP or the older
XML-RPC. The XML-RPC scheme calls functions remotely, whilst SOAP
favours a more modern (object-oriented) approach based on the Command
pattern.
SOAP Lightweight protocol for exchange of information in a decentralized, dis-

tributed environment.
WSDL Web Services Description Language, an XML-based language for de-

scribing public interface to web services. Describes how to communicate
using the web service.
UDDI protocol for publishing the web service information. Enables applications
to look up web services information in order to determine whether to use
them.
[11.7] Web Services - Other Components
Web Services Protocol Stack standards and protocols used to consume a web
service.
Common protocols protocols for data transport such as HTTP, FTP and SMTP.
ebXML set of specifications enabling a modular electronic business framework.

The vision of ebXML is to enable a global electronic marketplace for
business conducting through exchange of XML-based messages.
191
WS-Security specification that allows authentication of actors and confidential-

ity of the messages sent (OASIS standard).
WS-ReliableExchange a SOAP-based specification that fulfills reliable messag-

ing requirements critical to some applications of Web Services. (OASIS
standard).
WS-Management specification which describes a SOAP-based protocol for sys-

tems management of personal computers, servers, devices, and other man-
ageable hardware and Web services and other applications.
[11.8] Web Services Infrastructure
Applications
Directory service Security Choreography
Web Services Service descriptions (in WSDL)
SOAP
URIs (URLs or URNs) XML HTTP, SMTP or other transport
Web services infrastructure and components.
[11.9] WS Features (1)
• data representation model XML-based,
• SOAP protocol specifies the rules for using XML to package messages,
for example to support a request-reply protocol,
• SOAP used to encapsulate these messages and transmit them over HTTP
or another protocol,
• Web service provides a service description, which includes an interface

definition and other information, such as the server’s URL,
• XML security - documents or parts of documents may be signed or en-

crypted,
192
• Web services do not provide means for coordinating their operations with
one another.
[11.10] WS Features (2)

The main differences from the distributed object model:
• remote objects cannot be instantiated - effectively a web service consists

of a single remote object,
– remote object references are irrelevant.
• although interaction similar to that in RMI, remote object references not

very similar to URI’s,
• web services cannot create instances of remote objects, garbage collection

is irrelevant.
[11.11] SOAP (Simple Object Access Protocol)

SOAP
XML-based lightweight protocol for exchange of information in a decentralized,
distributed environment.
SOAP message structure:
Envelope top level root element of a SOAP message, which contains the header
and body element.
Header a collection of zero or more SOAP header blocks each of which might
be targeted at any SOAP receiver within the SOAP message path.
Body a collection of zero or more element information items targeted at an

ultimate SOAP receiver in the SOAP message path.
[11.12] SOAP Specification

The SOAP specification states:
• how XML is to be used to represent the contents of individual messages,
• how a pair of single messages can be combined to produce a request-reply

pattern,
193
• the rules as how the recipients of messages should process the XML ele-
ments that they contain,
• how HTTP and SMTP should be used to communicate SOAP messages.
It is expected that future versions of the specification will define how to
use other transport protocols, for example, TCP.
[11.13] SOAP Message in an Envelope
envelope
header
header element header element
body
body element body element
[11.14] SOAP Example (1)
env:envelope xmlns:env =namespace URI for SOAP envelopes
env:body
m:exchange
xmlns:m = namespace URI of the service description
m:arg1 m:arg2
Hello World
Example of a simple request without headers, each XML element is represented

by a shaded box.
[11.15] SOAP Example (2)
194
env:envelope xmlns:env =namespace URI for SOAP envelopes
env:body
m:exchangeResponse
xmlns:m = namespace URI of the service description
m:res1 m:res2
World Hello
Example of a reply corresponding to the previous request.
[11.16] SOAP and HTTP POST
POST /examples/stringer endpoint address

HTTP
Host: www.cdk4.net
headers
Content-Type:application/sosap+xml
Action: http://www.cdk4.net/examples/stringer#exchange action
<env:envelope xmlns:env= namespace URI for SOAP envelope>

<env header></env:header> SOAP
<env:body></env:body> message
</env:Envelope>
Use of HTTP POST Request in SOAP client-server communication.
[11.17] REST (Representational State Transfer)

Roy Fielding’s explanation of the meaning of Representational State Transfer:
Representational State Transfer is intended to evoke an image of how a well-
designed Web application behaves: a network of web pages (a virtual state-
machine), where the user progresses through an application by selecting links
(state transitions), resulting in the next page (representing the next state of the
application) being transferred to the user and rendered for their use.
REST – (common meaning:) any simple web-based interface that uses XML
and HTTP without the extra abstractions of MEP-based approaches like the
web services SOAP protocol. It is possible to design web service systems in
accordance with Fielding’s REST architectural style (RESTful systems).
REST is an architectural style and not a standard.
195
[11.18] The use of SOAP with Java

The service interface
The Java interface of a web service must conform to the following rules:
• must extend the Remote interface,
• must not have constant declarations, such as public final static,
• the methods must throw the java.rmi.RemoteException or one of its sub-

classes,
• Method parameters and return types must be permitted JAX-RPC types.
• no main method, no constructor,
• wscompile and wsdeploy to generate the skeleton class and the service
description (in WSDL),
• service implementation run as a servlet inside a servlet container (like

Tomcat),
• client program may use static proxies, dynamic proxies or a dynamic in-
vocation interface.
[11.19] WSDL (Web Services Description Language)

WSDL
The Web Services Description Language (WSDL) is an XML format published
for describing Web services.
• an XML-based service description on how to communicate using the web

service, namely, the protocol bindings and message formats required to
interact with the web services listed in its directory,
• supported operations and messages are described abstractly, and then bound
to a concrete network protocol and message format.
[11.20] The main elements in a WSDL description
196
definitions
types message interface bindings services
target namespaces document style request-reply style how where
abstract concrete
[11.21] WSDL Example WSDL request and reply messages for the newShape
operation
message name = "ShapeList_newShape" message name = "ShapeList_newShapeResponse"
part name="GraphicalObject_1" part name=’result"

type = "ns:GraphicalObject" type = "xsd:int"
tns - target namespace xsd - XML schema definitions
[11.22] Message exchange patterns for WSDL operations
Name Messaeges sent by

Client Server Delivery Fault tolerance
In-Out Request Reply may replace Reply
In-Only Request no fault message
Rebust In-Only Request guaranteed may be sent
Out-In Reply Request may replace Reply
Out-Only Request no fault message
Rebust Out-Only Request guaranteed may send fault
197
198
Bibliography
[GCDT05] G. G. Colouris, J. Dollimore, and Kindberg T. Distributed Systems.

Concepts and Design. fourth edition. Addison Wesley, 2005.
[TvS02] Andrew S. Tanenbaum and Maarten van Steen. Distributed Systems.

Principles and Paradigms. Prentice Hall, 2002.
[TvS05] Andrew S. Tanenbaum and Maarten van Steen. Systemy rozproszone.

Zasady i paradygmaty. WNT, 2005.
199

Ds Twoside

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ds Twoside

Uploaded by

Copyright:

Available Formats

Prescript to the lectures

Distributed Systems (DS) 1

dr Tomasz Jordan Kruk

June 20, 2006

1 Semester 2005/2006 summer

6 Consistency and Replication 87

7 Fault Tolerance 107

8 Distributed File System 125

10 Peer-to-Peer Systems 169

11 Web Services 189

[1.1] Lectures (1)

Consultations: room 530

Slides available after lectures on http://studia.elka.pw.edu.pl

1. Distributed Systems. Principles and Paradigms. Andrew S. Tanenbaum,

2. Distributed Systems. Concepts and Design. fourth edition. G. Colouris, J.

3. Systemy rozproszone. Zasady i paradygmaty. Andrew S. Tanenbaum,

[1.2] Lectures (2)

6. Consistency and replication

10. Peer-to-peer systems

11. Web services

[1.3] Definition of a Distributed System (1)

• connecting users and resources,

• openness = to offer services according to standard rules that describe the

[1.4] Definition of a Distributed System (2)

Machine A Machine B Machine C

Local OS Local OS Local OS

A distributed system organized as middleware. Note that the middleware layer

[1.5] Transparency in a Distributed System

Different forms of transparency in a distributed system.

[1.6] Degree of Transparency

• a trade-off between a high degree of transparency and the performance.

The goal not to be achieved: parallelism transparency.

completeness all necessary to make an implementation as it has been specified,

neutrality specification do not prescribe what an implementation should look

[1.8] Scalability Problems

• scalable with respect to its size,

• geographically scalable systems (users and resources may lie apart),

• system administratively scalable.

[1.9] Decentralized Algorithms

1. No machine has complete information about the system state.

2. Machines make decisions based only on local information.

3. Failure of one machine does not ruin the algorithm.

4. There is no implicit assumption that a global clock exists.

[1.10] Scaling Techniques (1)

• asynchronous communication (to hide communication latencies),

• distribution (splitting into smaller parts and spreading),

• replication (to increase availability and to balance the load),

• caching (as a special form of replication).

[1.11] Scaling Techniques (2)

Check form Process form

FIRST NAME MAARTEN

Check form Process form

A difference between letting:

check forms as they are being filled.

[1.12] Scaling Techniques (3)

eng cs eng jack jill keio nec cs

An example of dividing the DNS name space into zones.

[1.13] Hardware Concepts

Shared memory Private memory

Different basic organizations and memories in distributed computer systems.

[1.14] Multiprocessors (1)

CPU CPU CPU Memory

[1.15] Multiprocessors (2)