You are on page 1of 203

Prescript to the lectures

Distributed Systems (DS) 1


(work in progress)

dr Tomasz Jordan Kruk


T.Kruk@ia.pw.edu.pl

June 20, 2006

1 Semester 2005/2006 summer


Contents

1 Introduction 1

2 Communication (I) 19

3 Communication (II) 35

4 Synchronization (I) 51

5 Synchronization (II) 73

6 Consistency and Replication 87

7 Fault Tolerance 107

8 Distributed File System 125

9 Naming 149

10 Peer-to-Peer Systems 169

11 Web Services 189

Bibliography 198

i
CONTENTS CONTENTS

ii
Chapter 1

Introduction

[1.1] Lectures (1)


Tomasz Jordan Kruk (T.Kruk@ia.pw.edu.pl)

Consultations: room 530

• Tuesday 18.15-19.00,

• Thursday 14.15-15.00

Slides available after lectures on http://studia.elka.pw.edu.pl

Books:

1. Distributed Systems. Principles and Paradigms. Andrew S. Tanenbaum,


Maarten van Steen. Prentice Hall 2002.

2. Distributed Systems. Concepts and Design. fourth edition. G. Colouris, J.


Dollimore, T. Kindberg. Addison Wesley 2005.

3. Systemy rozproszone. Zasady i paradygmaty. Andrew S. Tanenbaum,


Maarten van Steen. WNT 2005.

[1.2] Lectures (2)

1. Introduction

2. Communication (I)

1
CHAPTER 1. INTRODUCTION

3. Communication (II)

4. Synchronization (I)

5. Synchronization (II)

6. Consistency and replication

7. Fault tolerance

8. File systems

9. Naming

10. Peer-to-peer systems

11. Web services

12. Security

[1.3] Definition of a Distributed System (1)


Distributed system
Collection of independent computers that appears to its users as a single coherent
system.

Goals:

• connecting users and resources,

• transparency,

• openness = to offer services according to standard rules that describe the


syntax and semantics of those services (e.g. POSIX for OS),

• scalability.

[1.4] Definition of a Distributed System (2)

2
CHAPTER 1. INTRODUCTION

Machine A Machine B Machine C

Distributed applications

Middleware service

Local OS Local OS Local OS

Network

A distributed system organized as middleware. Note that the middleware layer


extends over multiple machines.

[1.5] Transparency in a Distributed System

Different forms of transparency in a distributed system.

[1.6] Degree of Transparency

• some attempts to blindly hide all distribution aspects not always a good
idea,

• a trade-off between a high degree of transparency and the performance.

3
CHAPTER 1. INTRODUCTION

The goal not to be achieved: parallelism transparency.

Parallelism transparency
Transparency level with which a distributed system is supposed to appear to the
users as a traditional uniprocessor timesharing system.

[1.7] Openness
Completeness and neutrality of specifications as important factors for interoper-
ability and portability of distributed solutions.

completeness all necessary to make an implementation as it has been specified,

neutrality specification do not prescribe what an implementation should look


like.

Interoperability
The extent by which two implementations of systems from different manufactures
can cooperate.

Portability
To what extent an application developed for A can be executed without modifi-
cation on some B which implements the same interfaces as A.

[1.8] Scalability Problems


Three different dimensions of the system scalability:

• scalable with respect to its size,

• geographically scalable systems (users and resources may lie apart),

• system administratively scalable.

4
CHAPTER 1. INTRODUCTION

[1.9] Decentralized Algorithms

1. No machine has complete information about the system state.

2. Machines make decisions based only on local information.

3. Failure of one machine does not ruin the algorithm.

4. There is no implicit assumption that a global clock exists.

[1.10] Scaling Techniques (1)

• asynchronous communication (to hide communication latencies),

• distribution (splitting into smaller parts and spreading),

• replication (to increase availability and to balance the load),

• caching (as a special form of replication).

[1.11] Scaling Techniques (2)

Client Server
M
FIRST NAME MAARTEN
A
LAST NAME VAN STEEN A
E-MAIL R
STEEN@CS.VU.NL T
E
N

Check form Process form


(a)

Client Server

FIRST NAME MAARTEN


MAARTEN
LAST NAME VAN STEEN VAN STEEN
E-MAIL STEEN@CS.VU.NL STEEN@CS.VU.NL

Check form Process form


(b)

A difference between letting:


a. a server or

5
CHAPTER 1. INTRODUCTION

b. a client

check forms as they are being filled.

[1.12] Scaling Techniques (3)

Generic Countries

Z1
int com edu gov mil org net jp us nl

sun yale Z2
acm ieee ac co oce vu

eng cs eng jack jill keio nec cs


Z3
ai linda cs csl flits fluit

robot pc24

An example of dividing the DNS name space into zones.

[1.13] Hardware Concepts

Shared memory Private memory


Bus-based

M M M M M M M

P P P P
P P P P
Switch-based

M M M M M M M

P P P P

P P P P

P Processor M Memory

6
CHAPTER 1. INTRODUCTION

Different basic organizations and memories in distributed computer systems.

[1.14] Multiprocessors (1)

CPU CPU CPU Memory


Cache Cache Cache

Bus

A bus-based multiprocessor.

[1.15] Multiprocessors (2)

Memories
CPUs Memories
M M M M
P M
P
P M
P
CPUs
P M
P
P M
P

Crosspoint switch 2x2 switch

(a) (b)

a. a crossbar switch,

b. an omega switching network (2k inputs and a like outputs; log2 N stages,
each having N/2 exchange elements at each stage),

NUMA - NonUniform Memory Access - hierarchical systems.

[1.16] Homogeneous Multicomputer Systems

7
CHAPTER 1. INTRODUCTION

(a) (b)

a. grid,

b. hypercube.

Examples: Massively Parallel Processors (MPPs), Clusters of Workstations (COWs).

[1.17] Software Concepts

DOS distributed operating system.

NOS network operating system.

[1.18] Uniprocessor Operating Systems

8
CHAPTER 1. INTRODUCTION

No direct data exchange between modules

OS interface
User Memory Process File module
application module module User mode

Kernel mode
System call Microkernel

Hardware

[1.19] Multicomputer Operating Systems (1)

Machine A Machine B Machine C

Distributed applications

Distributed operating system services

Kernel Kernel Kernel

Network

General structure of a multicomputer operating system.

[1.20] Multicomputer Operating Systems (2)

9
CHAPTER 1. INTRODUCTION

Possible
synchronization
point
Sender Receiver
S1 S4

Sender Receiver
buffer buffer
S2 S3

Network

Alternatives for blocking and buffering in message passing.

[1.21] Multicomputer Operating Systems (3)

Relation between blocking, buffering, and reliable communications.

[1.22] Distributed Shared Memory Systems (1)

10
CHAPTER 1. INTRODUCTION

Shared global address space


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 2 5 1 3 6 4 7 11 13 15
9 8 10 12 14 Memory

CPU 1 CPU 2 CPU 3 CPU 4

(a)

0 2 5 1 3 6 4 7 11 13 15
9 10 8 12 14

CPU 1 CPU 2 CPU 3 CPU 4

(b)

0 2 5 1 3 6 4 7 11 13 15
9 10 8 10 12 14

CPU 1 CPU 2 CPU 3 CPU 4

(c)

• Pages of address space distributed among four machines,

• Situation after CPU 1 references page 10,

• Situation if page 10 is read only and replication is used.

[1.23] Distributed Shared Memory Systems (2)

11
CHAPTER 1. INTRODUCTION

Machine A Page transfer when Machine B


B needs to be accessed
A A
Two independent
B B data items
Page transfer when
Page p A needs to be accessed Page p

Code using A Code using B

False sharing of a page between two independent processes.

[1.24] Network Operating Systems (1)

Machine A Machine B Machine C

Distributed applications

Network OS Network OS Network OS


services services services

Kernel Kernel Kernel

Network

General structure of a network operating system.

[1.25] Network Operating Systems (2)

12
CHAPTER 1. INTRODUCTION

Client 1 Client 2 Server 1 Server 2


/ / games work
private pacman mail
pacwoman teaching
pacchild research

(a)

Client 1 Client 2
/ /

games private/games
work work

pacman mail pacman mail


pacwoman teaching pacwoman teaching
pacchild research pacchild research

(b) (c)

Different clients may mount the servers in different places.

[1.26] Positioning Middleware

Machine A Machine B Machine C

Distributed applications

Middleware services

Network OS Network OS Network OS


services services services

Kernel Kernel Kernel

Network

General structure of a distributed system as middleware.

[1.27] Middleware and Openness

13
CHAPTER 1. INTRODUCTION

Application Same Application


programming
interface

Middleware Middleware
Common
Network OS protocol Network OS

In an open middleware-based distributed system, the protocols used by each


middleware layer should be the same, as well as the interfaces they offer to
applications.

[1.28] Comparison of Operating Systems Types

A comparison between multiprocessor OS, multicomputer OS, network OS, and


middleware based distributed systems.

[1.29] Clients and servers

14
CHAPTER 1. INTRODUCTION

Wait for result


Client

Request Reply

Server
Provide service Time

General interaction between a client and a server.

[1.30] Application Layering

• The user-interface layer,


• The processing level,
• The data level.

[1.31] Processing Level

User-interface
User interface level

HTML page
Keyword expression containing list
HTML
generator Processing
Query Ranked list level
generator of page titles
Ranking
Database queries component

Web page titles


with meta-information
Database Data level
with Web pages

The general organization of an Internet search engine into three different layers.

[1.32] Multitiered Architectures (1)

15
CHAPTER 1. INTRODUCTION

Client machine

User interface User interface User interface User interface User interface
Application Application Application
Database

User interface

Application Application Application


Database Database Database Database Database

Server machine

(a) (b) (c) (d) (e)

Alternative client-server organizations.

[1.33] Multitiered Architectures (2)

User interface Wait for result


(presentation)
Request Return
operation result
Wait for data
Application
server

Request data Return data

Database
server
Time

An example of a server acting as a client.

[1.34] Modern Architectures (1)


Vertical distribution
Achieved by placing logically different components on different machines.

Horizontal distribution
Client or server may be physically split up into logically equivalent parts, but
each part is operating on its own share of the complete data set, thus balancing
the load.

[1.35] Modern Architectures (2)

16
CHAPTER 1. INTRODUCTION

Front end
handling
incoming Replicated Web servers each
requests containing the same Web pages
Requests Disks
handled in
round-robin
fashion

Internet
Internet

An example of horizontal distribution of a Web service.

[1.36] Internet and web servers

Date Computers Web servers


1979, Dec. 188 0
1989, July 130 000 0
1999, July 56 218 000 5 560 866
2003, Jan. 171 638 297 35 424 956

17
CHAPTER 1. INTRODUCTION

18
Chapter 2

Communication (I)

[2.1] Communication (I)

1. Layered Protocols

2. Remote Procedure Call

3. Remote Object Invocation

4. Message-oriented Communication

5. Stream-oriented Communication

[2.2] Necessary Agreements

• How many volts should be used to signal a 0-bit, and how many for a
1-bit?

• How does the receiver know which is the last of the message?

• How can it detect if a message has been damaged or lost?

• How long are numbers, strings and other data items?

• How are they represented?

ISO OSI = OSI Model = Open Systems Interconnection Reference Model

Protocols: connection-oriented vs. connectionless.

19
CHAPTER 2. COMMUNICATION (I)

protocol suite = protocol stack = the collection of protocols used in a particular


system.

[2.3] Protocols (1)

Example protocol as a discussion:

A: Please, retransmit message n,

B: I already retransmitted it,

A: No, you did not,

B: Yes, I did,

A: All right, have it your way, but send it again.

[2.4] Protocols (2)

Protocol
A well-known set of rules and formats to be used for communication between
processes in order to perform a given task.

Two important parts of the definition:

• a specification of the sequence of messages that must be exchanged,

• a specification of the format of the data in the messages.

How to create protocols:


On the Design of Application Protocols, RFC 3117,
http://www.rfc-editor.org/rfc/rfc3117.txt

[2.5] Layered Protocols (1)

20
CHAPTER 2. COMMUNICATION (I)

Application protocol
Application 7
Presentation protocol
Presentation 6
Session protocol
Session 5
Transport protocol
Transport 4
Network protocol
Network 3
Data link protocol
Data link 2
Physical protocol
Physical 1

Network

Layers, interfaces, and protocols in the OSI model.

• focus on message-passing only,

• often unneeded or unwanted functionality.

[2.6] Layered Protocols (2)

Data link layer header


Network layer header
Transport layer header
Session layer header
Presentation layer header
Application layer header

Message Data link


layer trailer

Bits that actually appear on the network

A typical message as it appears on the network.

[2.7] Layered Protocols (3)

21
CHAPTER 2. COMMUNICATION (I)

Physical layer
Contains the specification and implementation of bits, and their transmission
between sender and receiver.

Data link layer


Describes the transmission of a series of bits into a frame to allow error and
flow control.

Network layer
Describes how packets in a network of computers are to be routed.

Transport Layer
Provides the actual communication facilities for most distributed systems.

Standard Internet protocols:


• TCP: connection-oriented, reliable, stream-oriented communication,
• UDP: unreliable (best-effort) datagram communication.

[2.8] Data Link Layer

Time A B Event

0 Data 0 A sends data message 0

1 Data 0 B gets 0, sees bad checksum

A sends data message 1


2 Data 1 Control 0 B complains about the checksum

3 Control 0 Data 1 Both messages arrive correctly

A retransmits data message 0


4 Data 0 Control 1 B says: "I want 0, not 1"

5 Control 1 Data 0 Both messages arrive correctly

6 Data 0 A retransmits data message 0 again

7 Data 0 B finally gets message 0

Discussion between a receiver and a sender in the data link layer.

[2.9] Network level protocols


Network layer:

22
CHAPTER 2. COMMUNICATION (I)

• IP packets

• ATM virtual channels (unidirectional connection-oriented protocol),

• collections of virtual channels grouped into virtual paths – predefined


routes between pairs of hosts.

Transport layer:

• TCP, UDP

• RTP - Real-time Transport Protocol

• TP0 – TP4, the official ISO transport protocols,

[2.10] Client-Server TCP

Client Server Client Server

1 1
SYN SYN,request,FIN
2
2
SYN,ACK(SYN) SYN,ACK(FIN),answer,FIN

3
3
4 ACK(SYN)
ACK(FIN)
5 request
FIN

6
ACK(req+FIN)
7
answer 8
FIN

Time 9 Time
ACK(FIN)

(a) (b)

(a) Normal operation of TCP. (b) Transactional TCP.

[2.11] Networking - review


Networking, keywords, review:

• routing in IP, default gateway,

23
CHAPTER 2. COMMUNICATION (I)

• hardware: router, bridge, hub, switch, gateway, firewall, transceiver,

• domain name resolution,

• CIDR – classless interdomain routing,

• private networks (10.x.y.z, 172.16.x.y, 192.168.x.y),

• NAT.

[2.12] Above the Transport Layer


Many application protocols are directly implemented on top of transport proto-
cols, doing a lot of application-independent work.

News FTP WWW


Transfer NNTP FTP HTTP
Naming Newsgroup Host + path URL
Distribution Push Pull Pull
Replication Flooding Caching + DNS tricks Caching + DNS tricks
Security None (PGP) Username + Password Username + Password

[2.13] Middleware Protocols (1)


Middleware
An application that logically lives in the application layer, but which contains
many general-purpose protocols that warrant their own layers, independent of
other, more specific applications.

Middleware invented to provide common services and protocols that can be used
by many different applications:

Example protocols:

• open communication protocols,

• marshaling and unmarshaling of data, for systems integration,

• naming protocols, for resource sharing,

• security protocols, distributed authentication and authorization,

• scaling mechanisms, support for caching and replication.

24
CHAPTER 2. COMMUNICATION (I)

[2.14] Middleware Protocols (2)

Application protocol
Application 6
Middleware protocol
Middleware 5
Transport protocol
Transport 4
Network protocol
Network 3
Data link protocol
Data link 2
Physical protocol
Physical 1

Network

An adapted ISO OSI reference model for networked communication.

[2.15] High-level Middleware Communication Services


Some of high-level middleware protocol types:

1. remote procedure call,


2. remote object invocation,
3. message queuing services,
4. stream-oriented communication.

[2.16] Local Procedure Call

Stack pointer

Main program's Main program's Parameter passing:


local variables local variables
bytes
buf
fd
a. the stack before
return address
read's local the call.
variables

b. the stack while


(a) (b) the called proce-
dure is active.

25
CHAPTER 2. COMMUNICATION (I)

• Application developers familiar with simple procedure model,

• Procedures as black boxes (isolation),

• No fundamental reason not to execute procedures on separate machine.

[2.17] Remote Procedure Call


When we try to call procedures located on other machines, some subtle problems
exist:

• different address spaces,

• parameters and results have to be passed,

• both machines may crash.

Standard function call parameters types:

• call-by-value,

• call-by-reference,

• call by copy/restore.

[2.18] Steps in RPC

1. Client procedure calls client stub in normal way.

2. Client stub builds message, calls local OS.

3. Client’s OS sends message to remote OS.

4. Remote OS gives message to server stub.

5. Server stub unpacks parameters, calls server.

6. Server does work, returns result to the stub.

7. Server stub packs it in message, calls local OS.

8. Server’s OS sends message to client’s OS.

26
CHAPTER 2. COMMUNICATION (I)

9. Client’s OS gives message to client stub.

10. Stub unpacks result, returns to client.

[2.19] Passing Value Parameters (1)

Client machine Server machine

Client process Server process


1. Client call to
procedure Implementation 6. Stub makes
of add local call to "add"
Server stub
k = add(i,j) k = add(i,j)
Client stub
proc: "add" proc: "add"
int: val(i) int: val(i) 5. Stub unpacks
2. Stub builds message
int: val(j) message int: val(j)

proc: "add" 4. Server OS


Client OS int: val(i) Server OS hands message
int: val(j) to server stub

3. Message is sent
across the network

Steps involved in doing remote computation through RPC.

parameter marshaling – packing parameters into a message.

[2.20] Passing Value Parameters (2)

• IBM mainframes: EBCDIC character code,

• IBM personal computers: ASCII character code.

3 2 1 0 0 1 2 3 0 1 2 3
0 0 0 5 5 0 0 0 0 0 0 5
7 6 5 4 4 5 6 7 4 5 6 7
L L I J J I L L L L I J

(a) (b) (c)

a. Original message on the Pentium

b. The message as being received on the SPARC

27
CHAPTER 2. COMMUNICATION (I)

c. The message after being inverted. The little numbers in boxes indicate the
address of each byte.

[2.21] Extended RPC models – Doors

Door
A procedure in the address space of a server process that can be called by process
collocated with the server.

• local IPC to be much more efficient than networking,

• door to be registered to be called (door_create),

• in Solaris, each door has a file name (fattach),

• calling doors by door_call (OS makes an upcall),

• result returned to the client through door_return.

• benefit: single mechanism, procedure calls, for effective communication


in a distributed system,

• drawbacks: still the need to distinguish standard procedure calls, calls to


other local processes, calls to remote processes.

[2.22] Doors

28
CHAPTER 2. COMMUNICATION (I)

Computer

Client process Server process


server_door(...)
{
...
door_return(...);
}
main()
{ main()
... {
fd = open(door_name, ... ); ...
Register door fd = door_create(...);
door_call(fd, ... );
... fattach(fd, door_name, ... );
} ...
}

Operating system

Invoke registered door


at other process Return to calling process

[2.23] Asynchronous RPC (1)

Client Wait for result Client Wait for acceptance

Call remote Return Call remote Return


procedure from call procedure from call

Request Request Accept request


Reply

Server Call local procedure Time Server Call local procedure Time
and return results
(a) (b)

a. The interconnection between client and server in a traditional RPC.

b. The interaction using asynchronous RPC.

29
CHAPTER 2. COMMUNICATION (I)

[2.24] Asynchronous RPC (2)

Wait for Interrupt client


acceptance
Client

Call remote Return


procedure from call Return
results Acknowledge
Accept
Request request
Server
Call local procedure Time
Call client with
one-way RPC

deferred synchronous RPC – asynchronous RPC with second call done by the
server,

one-way RPC – client does not wait for acceptance of the request , problem
with reliability.

[2.25] Writing a Client and a Server

Uuidgen

Interface
definition file

IDL compiler

Client code Client stub Header Server stub Server code

#include #include

C compiler C compiler C compiler C compiler

Client Client stub Server stub Server


object file object file object file object file

Runtime Runtime
Linker Linker
library library

Client Server
binary binary

30
CHAPTER 2. COMMUNICATION (I)

Steps in writing a client and a server in DCE RPC. Let the developer concentrate
only on the client- and server-specific code. Leave the rest for RPC generators
and libraries.

[2.26] Binding a Client to a Server


Client must locate server machine, and locate the server.

Directory machine

Directory
server
2. Register service
3. Look up server
Server machine
Client machine

5. Do RPC 1. Register endpoint


Server
Client

4. Ask for endpoint DCE


daemon Endpoint
table

Client-to-server binding in DCE – separate daemon for each server machine.

[2.27] Remote Distributed Objects (1)


The basic idea of remote objects:

• data and operations encapsulated in an object,

• operations are implemented as methods, and are accessible through inter-


faces,

• object offers only its interface to clients,

• object server is responsible for a collection of objects,

• client stub (proxy) implements interface,

• server skeleton handles (un)marshaling and object invocation.

[2.28] Remote Distributed Objects (2)

31
CHAPTER 2. COMMUNICATION (I)

Client machine Server machine


Object
Client Server
State
Same
Client interface Method
invokes as object
a method
Skeleton
Interface
invokes
Proxy same method Skeleton
at object
Client OS Server OS

Network
Marshalled invocation
is passed across network

Common organization of a remote object with client-side proxy.

[2.29] Remote Distributed Objects (3)


Compile-time objects
Language-level objects, from which proxy and skeletons are automatically gen-
erated.

Runtime objects
Can be implemented in any language, but require use of an object adapter that
makes the implementation appear as an object.

Transient object lives only by virtue of a server: if the server exits, so will the
object.

Persistent object lives independently from a server: if a server exits, the ob-
ject’s state and code remain (passively) on disk.

[2.30] Binding a Client to an Object (1)


Having an object reference allows a client to bind to an object:

• reference denotes server, object, and communication protocol,

• client loads associated stub code,

• stub is instantiated and initialized for specific object.

32
CHAPTER 2. COMMUNICATION (I)

Remote-object references enable passing references as parameters, what was


hardly possible with ordinary RPCs.

Two ways of binding:

Implicit: invoke methods directly on the referenced object.

Explicit: client must first explicitly bind to object before invoking it.

[2.31] Binding a Client to an Object (2)

a. Example with implicit binding using only global references.

b. Example with explicit binding using global and local references.

[2.32] RMI - Parameter Passing

Machine A Machine B
Local object
Local Remote object
O1 Remote
reference L1 O2
reference R1

Client code with


RMI to server at C
(proxy) New local
reference Copy of O1
Remote
invocation with
L1 and R1 as Copy of R1 to O2
parameters Server code
Machine C (method implementation)

33
CHAPTER 2. COMMUNICATION (I)

Objects sometimes passed by reference, but sometimes by value.

• a client running on machine A, a server on machine C,

• the client calls the server with two references as parameters, O1 and O2,
to local and remote objects,

• copying of an object as a possible side effect of invoking a method with


an object reference as a parameter (transparency versus efficiency).

34
Chapter 3

Communication (II)

[3.1] Communication (II)

1. Layered Protocols

2. Remote Procedure Call

3. Remote Object Invocation

4. Message-oriented Communication

5. Stream-oriented Communication

[3.2] Persistence and Synchronicity in Communication (1)


Assumption – communication system organized as follows:

• applications are executed on hosts,

• each host connected to one communication server,

• buffers may be placed either on hosts or in the communication servers of


the underlying network,

• example: an e-mail system.

persistent vs transient communication,

asynchronous communication – sender continues immediately after it has sub-


mitted its message for transmission,

35
CHAPTER 3. COMMUNICATION (II)

synchronous communication – the sender blocked until its message is stored


in a local buffer at the receiving host or actually delivered to the receiver.

[3.3] Persistence and Synchronicity in Communication (2)


Client/server computing generally based on a model of synchronous commu-
nication:

• client and server to be active at the time of communication,

• client issues request and blocks until reply received,

• server essentially waits only for incoming requests and subsequently pro-
cesses them.

Drawbacks of synchronous communication:

• client cannot do any other work while waiting for reply,

• failures to be dealt with immediately (the client is waiting),

• in many cases the model simply not appropriate (mail, news).

[3.4] Persistence and Synchronicity in Communication (3)

Messaging interface

Sending host Communication server Communication server Receiving host

Buffer independent
Routing of communicating Routing
Application program hosts Application
program

To other (remote)
communication
server
OS OS OS OS

Local network Internetwork


Local buffer Local buffer
Incoming message

General organization of a communication system in which hosts are connected


through a network.

• queued messages sent among processes,

• sender not stopped in waiting for immediate reply,

• fault tolerance often ensured by middleware.

36
CHAPTER 3. COMMUNICATION (II)

[3.5] Persistence and Synchronicity in Communication (4)

Persistent vs. transient communication

Persistent communication
A message is stored at a communication server as long as it takes to deliver it
at the receiver.

Transient communication
A message is discarded by a communication server as soon as it cannot be
delivered at the next server or at the receiver.

[3.6] Persistence and Synchronicity in Communication (5)

Post
Pony and rider office

Post Post
office office

Post
Mail stored and sorted, to office
be sent out depending on destination
and when pony and rider available

Persistent communication of letters back in the days of the Pony Express.

[3.7] Persistence and Synchronicity in Communication (6)

37
CHAPTER 3. COMMUNICATION (II)

A sends message
and continues A stopped
running
A sends message
and waits until accepted
A stopped
running
Different forms of communication:
A A
Message is stored

Time
at B's location for
later delivery
Accepted
Time
a. persistent asynchronous,
B B
B starts and B is not B starts and
B is not
running
receives
message
running receives
message
b. persistent synchronous,
(a) (b)

A sends message
and continues
Send request and wait
until received
c. transient asynchronous,
A Message can be A
sent only if B is
running Request ACK
d. receipt-based transient syn-
is received
Time Time
B B chronous,
B receives Running, but doing Process
message something else request
(c) (d)
e. delivery-based transient syn-
Send request and wait until Send request

A
accepted
A
and wait for reply chronous,
Request Request Accepted
is received Accepted is received

B
Time
B
Time
f. response-based transient syn-
Running, but doing Process Running, but doing Process
something else request something else request chronous,
(e) (f)

[3.8] Message-Oriented Transient Communication

• socket interface introduced in Berkeley UNIX,


• another transport layer interface: XTI, X/Open Transport Interface, for-
merly called the Transport Layer Interface (TLI), developed by AT&T

socket
Communication endpoint to which an application write data that are to be sent
over the underlying network and from which incoming data can be read.

[3.9] Berkeley Sockets (1)

38
CHAPTER 3. COMMUNICATION (II)

Socket primitives for TCP/IP.

[3.10] Berkeley Sockets (2)

Server
socket bind listen accept read write close

Synchronization point Communication

socket connect write read close


Client

Connection-oriented communication pattern using sockets.

[3.11] The Message-Passing Interface (MPI) (1)


MPI
Group of message-oriented primitives that would allow developers to easily write
highly efficient applications.

Sockets insufficient because:


• at the wrong level of abstraction supporting only send and receive primi-
tives,

• designed to communicate using general-purpose protocol stacks such as


TCP/IP, not suitable in high-speed interconnection networks, such as those
used in COWs and MPPs (with different forms of buffering and synchro-
nization).

[3.12] The Message-Passing Interface (MPI) (2)


MPI assumptions:

• communication within a known group of processes,

• each group with assigned id,

• each process withing a group also with assigned id,

• all serious failures (process crashes, network partitions) assumed as fatal


and without any recovery,

39
CHAPTER 3. COMMUNICATION (II)

• a (groupID, processID) pair used to identify source and destination of the


message,

• only receipt-based transient synchronous communication (d) not supported,


other supported.

[3.13] The Message-Passing Interface (3)

Some of the most intuitive message-passing primitives of MPI.

[3.14] The Message-Oriented Persistent Communication


Message-queueing systems = Message-Oriented Middleware (MOM)

The essence of MOM systems:

• offer the intermediate-term storage capacity for messages,

• target to support message transfers that are allowed to take minutes instead
of seconds or milliseconds,

• no guarantees about when or even if the message will be actually read,

• the sender and receiver can execute completely independently.

[3.15] Message-Queuing Model

40
CHAPTER 3. COMMUNICATION (II)

Basic interface to a queue in a message-queuing system.

Most queuing systems also allow a process to install handlers as callback func-
tions.

[3.16] Architecture of Message-Queuing Systems (1)

Look-up
Sender transport-level Receiver
address of queue

Queuing Queue-level Queuing


layer address layer

Local OS Address look-up Local OS


database
Transport-level
address
Network

The relationship between queue-level addressing and network-level addressing.

source queue, destination queue, a database of queue names to network locations


mapping.

[3.17] Architecture of Message-Queuing Systems (2)

41
CHAPTER 3. COMMUNICATION (II)

Sender A

Application
Application
Receive
queue
R2
Message

Send queue

Application

R1

Receiver B
Application
Router

The general organization of a message-queuing system with routers:

• may grow into overlay network,

• may need dynamic routing schemes.

Queue managers:

• normally interact directly with applications,

• some operate as routers or relays.

[3.18] Message Brokers

Database with
Source client Message broker conversion rules Destination client

Broker
program

Queuing
layer
OS OS OS

Network

42
CHAPTER 3. COMMUNICATION (II)

The general organization of a message broker in a message-queuing system.


Message broker
Acts as an application-level gateway in a message-queuing system. Its main
purpose it to convert incoming messages to a format that can be understood by
the destination application. It may provide routing capabilities.

[3.19] Notes on Message-Queuing Systems

• with message brokers it may be necessary to accept a certain loss of


information during transformation,

• at the heart of a message broker lies a database of conversion rules,

• general message-queuing systems are not aimed at supporting only end


users,

• they are set up to enable persistent communication,

• range of applications:

– e-mail, workflow, groupware, batch processing,


– integration of a collection of databases or database applications.

[3.20] Example: IBM MQSeries

Client's receive
Routing table Send queue queue Receiving client
Sending client

Queue Queue
Program manager manager Program

MQ Interface

Server Server
Stub MCA MCA MCA MCA Stub
stub stub

RPC Local network


(synchronous) Internetwork
To other remote
Message passing queue managers
(asynchronous)

General organization of IBM’s MQSeries message-queuing system.

[3.21] Channels

43
CHAPTER 3. COMMUNICATION (II)

Some attributes associated with message channel agents.

[3.22] Message Transfer (1)

Alias table Routing table


LA1 QMC QMB SQ1 Alias table Routing table
LA2 QMD QMC SQ1 LA1 QMA QMA SQ1
QMD SQ2 LA2 QMD QMC SQ1
QMD SQ1
SQ2
SQ1
QMA SQ1
QMB

Routing table SQ1 QMC Routing table


QMA SQ1
QMA SQ1
QMC SQ2 SQ2 QMB SQ1
QMB SQ1
QMD SQ1
Alias table
LA1 QMA SQ1
LA2 QMC
QMD

The general organization of an MQSeries queuing network using routing tables


and aliases. By using logical names, in combination with name resolution to
local queues, it is possible to put a message in a remote queue.

[3.23] Message Transfer (2)

44
CHAPTER 3. COMMUNICATION (II)

Primitives available in an IBM MQSeries MQI.

[3.24] Stream-Oriented Communication

• forms of communication in which timing plays a crucial role,


• example:
– an audio stream built up as a sequence of 16-bit samples each repre-
senting the amplitude of the sound wave as it is done through PCM
(Pulse Code Modulation),
– audio stream represents CD quality, i.e. 44100Hz,
– samples to be played at intervals of exactly 1/44100,
• which facilities a distributed system should offer to exchange time-dependent
information such as audio and video streams?
– support for the exchange of time-dependent information = support
for continuous media,
– continuous (representation) media vs. discrete (representation) me-
dia.

[3.25] Support for Continuous Media


In continuous media :
• temporal relationships between data items fundamental to correctly inter-
preting the data,
• timing is crucial.

Asynchronous transmission mode


Data items in a stream are transmitted one after the other, but there are no further
timing constraints on when transmission of items should take place.

Synchronous transmission mode


Maximum end-to-end delay defined for each unit in a data stream.

Isochronous transmission mode


It is necessary that data units are transferred on time. Data transfer is subject to
bounded (delay) jitter.

[3.26] Data Stream (1)

45
CHAPTER 3. COMMUNICATION (II)

Sending process
Receiving process

Program

Stream
OS OS

Network
(a)

Camera
Display

Stream
OS OS

Network
(b)

a. Setting up a stream between two processes across a network,

b. Setting up a stream directly between two devices.

• stream sequence of data units, may be considered as a virtual connection


between a source and a sink,

• simple stream vs. complex stream (consisting of several related sub-


streams).

[3.27] Data Stream (2)

Stream Sink

Intermediate
node, possibly
Source with filters

Lower bandwidth

An example of multicasting a stream to several receivers.

46
CHAPTER 3. COMMUNICATION (II)

• problem with receivers having different requirements with respect to the


quality of the stream,

• filters to adjust the quality of an incoming stream, differently for outgoing


streams.

[3.28] Specifying QoS (1)

A flow specification.

Time-dependent requirements among other Quality of Service (QoS) require-


ments.

[3.29] Specifying QoS (2)

Application

Irregular stream One token is added


of data units to the bucket every ∆T

Regular stream

The principle of a token bucket algorithm.

• tokens generated at a constant rate,

• tokens buffered in a bucket which has limited capacity.

47
CHAPTER 3. COMMUNICATION (II)

[3.30] Setting Up a Stream

Sender process
RSVP-enabled host

Policy RSVP process


Application
control
Application
data stream
RSVP
program

Local OS
Reservation requests
Admission from other RSVP hosts
Data link layer
control

Data link layer


data stream
Internetwork

Local network
Setup information to
other RSVP hosts

The basic organization of RSVP (Resource reSerVation Protocol), transport-level


protocol for resource reservation in a distributed system.

[3.31] Synchronization Mechanisms (1)

Receiver's machine

Application
Procedure that reads
two audio data units for
each video data unit

Incoming stream
OS

Network

The principle of explicit synchronization on the level data units.


Given a complex stream, how to keep the different substreams in synch?

[3.32] Synchronization Mechanisms (2)

48
CHAPTER 3. COMMUNICATION (II)

Application tells
Receiver's machine middleware what
to do with incoming
Multimedia control streams
Application
is part of middleware

Middleware layer

Incoming stream OS

Network

The principle of synchronization as supported by high-level interfaces.

Multiplex of all substreams into a single stream and demultiplexing at the re-
ceiver. Synchronization is handled at multiplexing/demultiplexing point (MPEG).

49
CHAPTER 3. COMMUNICATION (II)

50
Chapter 4

Synchronization (I)

[4.1] Synchronization (I)

1. Clock synchronization

2. Logical clocks

3. Global state (distributed snapshot)

4. Election algorithms

5. Mutual exclusion

Synchronization
Setting the time order of the set of events caused by concurrent processes.

[4.2] Clock Synchronization

Computer on 2144 2145 2146 2147 Time according


which compiler to local clock
runs
output.o created

Computer on 2142 2143 2144 2145 Time according


which editor to local clock
runs
output.c created

51
CHAPTER 4. SYNCHRONIZATION (I)

When each machine has its own clock, an event that occurred after another event
may nevertheless be assigned an earlier time.

[4.3] Timers

• timer,
• registers associated with each crystal:
– counter,
– holding register;
• interrupt generated when counter gets 0,
• interrupt called every clock tick,
• impossible to guarantee two crystals run at exactly the same frequency,
• after getting out of sync, the difference in time values called clock skew.

[4.4] The Mean Solar Day

Earth's orbit

A transit of the sun


occurs when the
sun reaches the At the transit of the sun
highest point of n days later, the earth
the day Sun has rotated fewer
than 360o

x
Earth on day 0 at the
To distant galaxy
transit of the sun
x
To distant galaxy
Earth on day n at the
transit of the sun

Computation of the mean solar day – the period of the earth’s rotation is not
constant.

[4.5] Physical Clocks (1)

52
CHAPTER 4. SYNCHRONIZATION (I)

Transit of the sun the event of the sun reaching its highest apparent point in
the sky.

Solar day the interval between two consecutive transits of the sun.

Solar second 1/86400th of a solar day.

• mean solar second (300 million days ago a year has about 400 days),

[4.6] Physical Clocks (2)


Sometimes we simply need the exact time, not just an ordering.
Solution: Universal Coordinated Time (UTC):

• based on the number of transitions per second of the cesium 133 atom
(pretty accurate),

• at present, the real time is taken as the average of some 50 cesium-clocks


around the world,

• introduces a leap second from time to time to compensate that days are
getting longer.

NIST operates a shortwave radio station with call letters WWV from Fort Collins
in Colorado (a short pulse at the start of each UTC second). UTC is broadcast
through short wave radio and satellite. Satellites can give an accuracy of about
±0.5 ms.
Does this solve all our problems? Don’t we now have some global timing
mechanism? This timing is still way too coarse for ordering every event.

[4.7] Physical Clocks (3)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
TAI

Solar 0 1 2 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18 19 21 22 23 24 25
seconds

Leap seconds introduced into UTC to


get it in synch with TAI

TAI seconds are of constant length, unlike solar seconds. Leap seconds are
introduced when necessary to keep in phase with the sun.

53
CHAPTER 4. SYNCHRONIZATION (I)

• TAI – International Atomic Time,

• 86400 TAI seconds is about 3 msec less than a mean solar day,

• UTC – TAI with leap seconds whenever the discrepancy between TAI and
solar time grows to 800 msec.

[4.8] Physical Clocks (4)

Assumption: a distributed system with an UTC-receiver somewhere in it.


Basic principle:

• every machine has a timer that generates an interrupt H times per second,

• there is a clock in machine p that ticks on each timer interrupt. Denote


the value of that clock by Ci p (t), where t is UTC time.

• ideally, we have that for each machine p, C p(t) = t, or, in other words,
dC/dt = 1

• Ideally: dC/dt = 1, in practice: 1 − ρ ≤ dC/dt ≤ 1 + ρ

• in order to protect against difference bigger than δ time units ⇒ synchro-


nize at least every δ/(2ρ) seconds.

[4.9] Clock Synchronization Algorithms

54
CHAPTER 4. SYNCHRONIZATION (I)

dC
>1
dt dC
Clock time, C =1
dt

ck

k
oc
clo

cl
dC
<1

ct
st
k

rfe
c dt
Fa
c lo

Pe
w
Slo

UTC, t

The relation between clock time and UTC when clocks tick at different rates.

[4.10] Clock Synchronization Principles

Principle I Every machine asks a time server for the accurate time at least once
every δ/(2ρ) seconds.

• needs an accurate measure of round trip delay, including interrupt


handling and processing incoming messages.

Principle II Let the time server scan all machines periodically, calculate an
average, and inform each machine how it should adjust its time relative to
its present time.

• probably gets every machine in sync.

• setting the time back is never allowed, therefore smooth adjustments.

[4.11] Clock Synchronization Algorithms


Clock synchronization algorithms:
• Cristian’s Algorithm

55
CHAPTER 4. SYNCHRONIZATION (I)

• The Berkeley Algorithm

• Averaging Algorithms

[4.12] Cristian’s Algorithm

Both T0 and T1 are measured with the same clock


T0 T1
Client

Request CUTC

Time server
Time
I, Interrupt handling time

Getting the current time from a time server.

• (T 1 − T 0)/2,

• messages with T 1 − T 0 above some threshold discarded as being victims


of network congestion,

• the message that came back fastest is the most accurate one.

[4.13] The Berkeley Algorithm

56
CHAPTER 4. SYNCHRONIZATION (I)

Time daemon
3:00 3:00 3:00 0 3:05 +5

3:00 -10 +15


3:00 +25 -20
Network

2:50 3:25 2:50 3:25 3:05 3:05


(a) (b) (c)

1. The time daemon asks all the other machines for their clock values.

2. The machines answer.

3. The time daemon tells everyone how to adjust their clock.

[4.14] Averaging Algorithms

• previous methods highly centralized,

• decentralized algorithms:

– dividing time into fixed-length resynchronization intervals,


– T 0 + (i + 1)R, where R is a system parameter,
– machines broadcast the current time according to their clocks,
– another variation: correcting each message by considering propaga-
tion time from the source,

• Internet: the Network Time Protocol (NTP), accuracy in the range of 1-50
msec.

[4.15] Logical Clocks

• often if it is sufficient that all machines agree on the same time,

• internal consistency only matters, not whether they are particularly close
to the real time,

57
CHAPTER 4. SYNCHRONIZATION (I)

• what usually matters is not that all processes agree on what time is, but
rather that they agree on the order in which events occur,
• Lamport’s algorithm, which synchronizes logical clocks,
• an extension to Lamport’s approach, called vector timestamps.

[4.16] The Happened-Before Relationship


The happened-before relation on the set of events in a distributed system is the
smallest relation satisfying:
• if a and b are two events in the same process, and a comes before b, then
a → b.
• if a is the sending of a message, and b is the receipt of that message, then
a → b.
• if a → b and b → c, then a → c.
This introduces a partial ordering of events in a system with concurrently oper-
ating processes.
Concurrent events
Nothing can be said about when the events happened or which event happened
first.
[4.17] Logical Clocks (1)
How do we maintain a global view on the system’s behavior that is consistent
with the happened-before relation?
Solution: attach a time-stamp C(e) to each event e, satisfying the following
properties:
P1 If a and b are two events in the same process, and a → b, then we demand
that C(a) < C(b).
P2 If a corresponds to sending a message m, and b to the receipt of that message,
then also C(a) < C(b).

How to attach a time-stamp to an event when there’s no global clock?


Solution: maintain a consistent set of logical clocks, one per process.
[4.18] Logical Clocks (2)
Each process Pi maintains a local counter C i and adjusts this counter according
to the following rules:

58
CHAPTER 4. SYNCHRONIZATION (I)

1. For any two successive events that take place within Pi , Ci is incremented
by 1.

2. Each time a message m is sent by process Pi , the message receives a


time-stamp T m = Ci .

3. Whenever a message m is received by a process P j , P j adjusts its local


counter C j :

C j := max{C j + 1, T m + 1}.

• property P1 satisfied by 1.,

• property P2 satisfied by 2. and 3.

[4.19] Logical Clocks (3)

0 0 0 0 0 0
6 A 8 10 6 A 8 10
12 16 20 12 16 20
18 24 B 30 18 24 B 30
24 32 40 24 32 40
30 40 50 30 40 50
36 48 C 60 36 48 C 60
42 56 70 42 61 70
48 D 64 80 48 D 69 80
54 72 90 70 77 90
60 80 100 76 85 100

(a) (b)

Lamport’s algorithm example

[4.20] Total Ordering with Logical Clocks


Still can occur: two events happen at the same time. May be avoided by attaching
a process number to an event:

If: Pi time-stamps event e with C i (e).i


Then: Ci (a).i before C j (b). j if and only if:

59
CHAPTER 4. SYNCHRONIZATION (I)

• Ci (a) < C j (a) or

• Ci (a) = C j (b) and i < j.

[4.21] Example: Totally-Ordered Multicasting

Update 1 Update 2

Replicated database
Update 1 is Update 2 is
performed before performed before
update 2 update 1

• this situation requires totally-ordered multicasting - to be implemented


with Lamport timestamps,

• each message is always timestamped with the current logical time of the
sender,

• received message put into a local queue, ordered according to its times-
tamp, receiver multicasts an acknowledgement to others,

• a process can deliver a queued message to the application it is running only


when that message is at the head of the queue and has been acknowledged
by each other process.

[4.22] Vector Timestamps (1)

• Lamport timestamps do not guarantee that if C(a) < C(b) that a indeed
happened before b. Vector timestamps are required for that.

– each process Pi has an array Vi [1 . . . n], where Vi [ j] denotes the num-


ber of events that process Pi knows have taken place at process P j ,

60
CHAPTER 4. SYNCHRONIZATION (I)

– when Pi sends a message m, it adds 1 to Vi [i], and sends Vi along


with m as vector timestamp vt(m). Upon arrival, each other process
knows Pi ’s timestamp.

• timestamp vt of m tells the receiver how many events in other processes


have preceded m, and on which m may causally depend.

[4.23] Vector Timestamps (2)

• when a process P j receives m from Pi with vt(m), it:

– updates each V j [k] to max{V j [k], V(m)[k]},


– increments V j [ j] by 1.

• to support causal delivery of messages, assume you increment your own


component only when sending a message. Then, P j postpones delivery of
m until:

– vt(m)[i] = V j [i] + 1 and


– vt(m)[k] ≤ V j [k] for k , i.

Example
Given V3 = [0, 2, 2], vt(m) = [1, 3, 0]:
What information does P3 have, and what will it do after receiving m (from P1 )?

[4.24] An example of Causal Delivery of Messages (1)


Assumptions:

• messages multicasted by the processes to all other participating in com-


munication,

• all messages sent by one process received in the same order by each other
process,

• reliable message sending mechanism,

• order of messages from different processes not forced.

Actions on the sender side:

1. Sending (multicasting) of the message.

61
CHAPTER 4. SYNCHRONIZATION (I)

Actions on the receiver side:

1. Receiving of the message by the communication layer.

2. Delivering of the message to the target process.

[4.25] An example of Causal Delivery of Messages (2)


Let

vtm - vector timestamp of message m,

VP - current vector of process P.

Rules
When message m sent by process P, sent together with vector timestamp vt m
built up in the following way:

1. vtm [P] = VP [P] + 1,

2. vtm [X] = VP [X] for all X different to P.

Received message m from P delivered into the process Q only if the following
conditions are met:

1. vtm [P] = VQ [P] + 1

2. vtm [X] ≤ VQ [X] for all X different to P.

When message m delivered to the process Q:

1. VQ [X] = max{VQ [X], vtm [X]}

[4.26] An example of Causal Delivery of Messages (3)


Three processes: A, B, C with initial vectors: V A = VB = VC = (0, 0, 0)
General scenario:

1. Process A multicasts request m1

2. Process B multicasts reply m2 as a result of obtaining request in message


m1.

62
CHAPTER 4. SYNCHRONIZATION (I)

Goal:
All processes should have delivered message m2 only after delivering message
m1. If the message m2 is received by the transport layer of some process as
the first one, delivery of the m2 must be postponed until m1 is received and
delivered before.

[4.27] An example of Causal Delivery of Messages (4)


A sends m1(0 + 1, 0, 0) = m1(1, 0, 0),
B receives m1(1, 0, 0) from A,

VB = (0, 0, 0), vtm1 = (1, 0, 0),


m1 delivered at once because:

vtm1 [A] = VB [A] + 1,


vtm1 [X] <= VB [X] for all X different to A.

after m1 delivery new value of V B set to VB = (1, 0, 0).

B sends m2(1, 0 + 1, 0) = m2(1, 1, 0),


A receives m2(1, 1, 0) from B,

VA = (1, 0, 0), vtm2 = (1, 1, 0),


m2 delivered at once because:

vtm2 [B] = VA [B] + 1,


vtm2 [X] <= VA [X] for all X different to B.

after m2 delivery new value of V A set to VA = (1, 1, 0).

[4.28] An example of Causal Delivery of Messages (5)


C receives m2(1, 1, 0) from B,

VC = (0, 0, 0), vtm2 = (1, 1, 0),


m2 delivery postponed because:

vtm2 [A] > VC [A] and A is different to B.

63
CHAPTER 4. SYNCHRONIZATION (I)

Comment:
We should not deliver the message m2 sent by B to the process C now because
at the time of sending that message by the process B it knew already some
message received from process A about which we do not know yet.

Perhaps in that message, received before by B and not received by us yet, was
something important what should be received by C before receiving m2. Firstly,
C has to have delivered the previous message, already delivered to B before the
moment of sending by B the message m2.

[4.29] An example of Causal Delivery of Messages (6)

C receives m1(1, 0, 0) from A

VC = (0, 0, 0), vtm1 = (1, 0, 0),


m1 delivered at once because:

vtm1 [A] = VC [A] + 1,


vtm1 [X] <= VC [X] for all X different to A.

after m1 delivery new value of VC set to VC = (1, 0, 0),


now on C we check delivery queue,
now m2 may be and is delivered because:

VC = (1, 0, 0), vtm2 = (1, 1, 0),


vtm2 [C] = VC [C] + 1,
vtm2 [X] ≤ VC [X] for all X different to C.

after m2 delivery new value of VC set to VC = (1, 1, 0).

After two multicasts A → BC and B → AC, current values of vector timestamps


of processes are as follows: V A = VB = VC = (1, 1, 0)

[4.30] Global State (1)

Sometimes one wants to collect the current state of a distributed computation,


called a distributed snapshot.
It consists of: (1) all local states and (2) messages currently in transit.

64
CHAPTER 4. SYNCHRONIZATION (I)

Consistent cut Inconsistent cut

P1 Time P1 Time

m1 m3 m1 m3
P2 P2

m2
m2
P3 P3

Sender of m2 cannot
be identified with this cut

(a) (b)

A distributed snapshot should reflect a consistent state.

[4.31] Global State (2)

• collection of processes connected to each other through unidirectional


point-to-point communication channels,

• any process P can initiate taking a distributed snapshot.

1. P starts by recording its own local state,

2. P subsequently sends a marker along each of its outgoing channels,

3. when Q receives a marker through channel C, its action depends on


whether it had already recorded its local state:

• not yet recorded: it records its local state, and sends the marker along
each of its outgoing channels,
• already recorded: the marker on C indicates that the channel’s state
should be recorded: all messages received since the time Q recorded
its own state and before that marker to be recorded as the channel’s
state,

4. Q is finished when it has received a marker along each of its incoming


channels.

[4.32] Global State (3)


Distributed snapshot, channel state recording:

65
CHAPTER 4. SYNCHRONIZATION (I)

Incoming Outgoing
message Process State message

M
Q

Local
Marker filesystem
(a)

M
a b c Q M d Q Q
M

a b c a b c d
Recorded
state
(b) (c) (d)

1. Process Q receives a marker for the first time and records its local state.

2. Q records all incoming message.

3. Q receives a marker for its incoming channel and finishes recording the
state of the incoming channel.

[4.33] Election Algorithms


An algorithm requires that some process acts as a coordinator. How to select
this special process dynamically?

• in many systems the coordinator chosen by hand (e.g. file servers). This
leads to centralized solutions ⇒ single point of failure.

• if a coordinator chosen dynamically, to what extent one can speak about a


centralized or distributed solution? Having a central coordinator does not
necessarily make an algorithm non-distributed.

• is a fully distributed solution, i.e. one without a coordinator, always more


robust than any centralized/coordinated solution? Fully distributed solu-
tions not necessarily better.

Example election algorithms:


• the bully algorithm,

• a ring algorithm.

66
CHAPTER 4. SYNCHRONIZATION (I)

[4.34] The Bully Election Algorithm (1)

Each process has an associated priority (weight). The process with the highest
priority should always be elected as the coordinator.
How to find the heaviest process?

• any process can just start an election by sending an election message to


all other processes (assuming you don’t know the weights of the others).

• if process Pheavy receives an election message from lighter process Plight ,


it sends a take-over message to Plight . Plight is out of the race.

• if a process doesn’t get a take-over message back, it wins, and sends a


victory message to all other processes.

[4.35] The Bully Election Algorithm (2)

1 1 1
2 5 2 5 2 5
n
ctio OK Election
Ele
n

Election OK
ctio

4 6 4 6 4 6
Ele

n
El

tio
ec

ec
tio

El

0 3 0 3 0 3
n

7 7 7
Previous coordinator
has crashed
(a) (b) (c)

1 1
2 5 2 5
a. process 4 holds an election,
OK
Coordinator
4 6 4 6
b. process 5 and 6 respond, telling 4 to stop,
0 3 0 3
7 an election.
c. now 5 and 6 each hold 7

(d) (e)

[4.36] The Bully Election Algorithm (3)

67
1 1 1
2 5 2 5 2 5
n
ctio OK Election
Ele

n
Election OK

ctio
4 6 4 6 4 6

Ele

n
El

tio
ec

ec
tio

El
0 3 0 3 0 3

n
7 7 7
CHAPTER 4. SYNCHRONIZATION (I)
Previous coordinator
has crashed
(a) (b) (c)

1 1
2 5 2 5

OK
Coordinator
4 6 4 6

0 3 0 3
7 7

(d) (e)

d. process 6 tells 5 to stop,

e. process 6 wins and tells everyone.

[4.37] A Ring Algorithm (1)

Process priority is obtained by organizing processes into a (logical) ring. Process


with the highest priority should be elected as coordinator.

• any process can start an election by sending an election message to its


successor. If a successor is down, the message is passed on to the next
successor.

• if a message is passed on, the sender adds itself to the list. When it gets
back to the initiator, everyone had a chance to make its presence known.

• the initiator sends a coordinator message around the ring containing a list
of all living processes. The one with the highest priority is elected as
coordinator.

[4.38] A Ring Algorithm (2)

68
CHAPTER 4. SYNCHRONIZATION (I)

[5,6,0] 1
Election message
0 2
[2]

Previous coordinator
has crashed 7 [5,6] 3

[2,3]

No response 6 4

[5] 5

[4.39] Mutual Exclusion


A number of processes in a distributed system want exclusive access to some
resource.
Standard solutions:
• via a centralized server,
• completely distributed, with no topology imposed,
• completely distributed, making use of a logical ring.

[4.40] MutEx: A Centralized Algorithm

0 1 2 0 1 2 0 1 2
Request Release
Request OK
OK
No reply
3 3 3
Queue is 2
empty
Coordinator

(a) (b) (c)

1. Process 1 asks the coordinator for permission to enter a critical region.


Permission is granted.

69
CHAPTER 4. SYNCHRONIZATION (I)

2. Process 2 then asks permission to enter the same critical region. The
coordinator does not reply.

3. When process 1 exits the critical region, it tells the coordinator, when then
replies to 2.

[4.41] MutEx: Ricart & Agrawala Algorithm (1)


Ricart & Agrawala algorithm – completely distributed, with no topology im-
posed.

• the same as Lamport except that acknowledgments aren’t sent. Instead,


replies (i.e. grants) are sent only when:

– the receiving process has no interest in the shared resource or


– the receiving process is waiting for the resource, but has lower priority
(known through comparison of time-stamps).

• in all other cases, reply is deferred, implying some more local administra-
tion.

[4.42] MutEx: Ricart & Agrawala Algorithm (2)

Enters
critical
region
8

0 0 0
8 OK OK OK
12
8 Enters
1 2 1 2 1 2 critical
12 OK region
12
(a) (b) (c)

1. Two processes want to enter the same critical region at the same moment.

2. Process 0 has the lowest timestamp, so it wins.

3. When process 0 is done, it sends an OK also, so 2 can now enter the


critical region.

70
CHAPTER 4. SYNCHRONIZATION (I)

[4.43] MutEx: A Token Ring Algorithm

2
1 3

0 4
0 2 4 9 7 1 6 5 8 3

7 5
6

(a) (b)

1. An unordered group of processes on a network.

2. A logical ring constructed in software.

[4.44] Mutual Exclusion - Comparison

Messages per Delay before entry Potential


Algorithm entry/exit (in message times) problems
Centralized 3 2 Coordinator crash
Distributed 2(n − 1) 2(n − 1) Crash of any process
Token Ring 1 to ∞ 0 to n − 1 Lost token, process crash

A comparison of three mutual exclusion algorithms.

71
CHAPTER 4. SYNCHRONIZATION (I)

72
Chapter 5

Synchronization (II)

[5.1] Distributed Transactions

1. The transaction model

• ACID properties

2. Classification of transactions

• flat transactions,
• nested transactions,
• distributed transactions.

3. Concurrency control

• serializability,
• synchronization techniques
– two-phase locking,
– pessimistic timestamp ordering,
– optimistic timestamp ordering.

[5.2] The Transaction Model (1)

73
CHAPTER 5. SYNCHRONIZATION (II)

Previous
inventory
New
inventory
Input tapes
Computer Output tape

Today's
updates

Updating a master tape is fault tolerant.

[5.3] The Transaction Model (2)

Examples of primitives for transactions.

[5.4] The Transaction Model (3)

a. transaction to reserve three flights commits,

b. transaction aborts when third flight is unavailable.

74
CHAPTER 5. SYNCHRONIZATION (II)

[5.5] ACID Properties


Transaction
Collection of operations on the state of an object (database, object composition,
etc.) that satisfies the following properties:

Atomicity All operations either succeed, or all of them fail. When the transac-
tion fails, the state of the object will remain unaffected by the transaction.

Consistency A transaction establishes a valid state transition. This does not ex-
clude the possibility of invalid, intermediate states during the transaction’s
execution.

Isolation Concurrent transactions do not interfere with each other. It appears to


each transaction T that other transactions occur either before T, or after T,
but never both.

Durability After the execution of a transaction, its effects are made permanent:
changes to the state survive failures.

[5.6] Transaction Classification


Flat transactions
The most familiar one: a sequence of operations that satisfies the ACID proper-
ties.

Nested transactions
A hierarchy of transactions that allows (1) concurrent processing of subtransac-
tions, and (2) recovery per subtransaction.

Distributed transactions
A (flat) transaction that is executed on distributed data. Often implemented as a
two-level nested transaction with one subtransaction per node.

[5.7] Flat Transactions – Limitations

• they do not allow partial results to be committed or aborted,

• the strength of the atomicity property of a flat transaction also is partly its
weakness,

• solution: usage of nested transactions,

• difficult scenarios:

75
CHAPTER 5. SYNCHRONIZATION (II)

– subtransaction committed but the higher-level transaction aborted,


– if a subtransaction commits and a new subtransaction is started, the
second one has to have available results of the first one.

[5.8] Distributed Transactions

• nested transaction is logically decomposed into a hierarchy of subtrans-


actions,
• distributed transaction is logically flat, indivisible transaction that oper-
ates on distributed data. Separate distributed algorithms required for (1)
handling the locking of data and (2) committing the entire transaction.

Nested transaction Distributed transaction

Subtransaction Subtransaction Subtransaction Subtransaction

Airline database Hotel database


Distributed database
Two different (independent) Two physically separated
databases parts of the same database

(a) (b)

[5.9] Transaction Implementation

1. private workspace
• use a private workspace, by which the client gets its own copy of the
(part of the) database. When things go wrong delete copy, otherwise
commit the changes to the original,
• optimization by not getting everything.
2. write-ahead log
• use a writeahead log in which changes are recorded allowing you to
roll back when things go wrong.

76
CHAPTER 5. SYNCHRONIZATION (II)

[5.10] TransImpl: Private Workspace

Private
workspace
Original
Index index 0 0
0 0 1 1
1 1 2 2
2 2 3 3

1 2 0 1 2 0 1 2

0 3 0 3

Free blocks
(a) (b) (c)

a. The file index and disk blocks for a three-block file,

b. The situation after a transaction has modified block 0 and appended block
3,

c. After committing.

[5.11] TransImpl: Writeahead Log

a. A transaction,

77
CHAPTER 5. SYNCHRONIZATION (II)

b.-d. The log before each statement is executed.

[5.12] Transactions: Concurrency Control (1)

Transactions

READ/WRITE Transaction BEGIN_TRANSACTION


manager END_TRANSACTION

LOCK/RELEASE
Scheduler or
Timestamp operations

Data Execute read/write


manager

General organization of managers for handling transactions.

[5.13] Transactions: Concurrency Control (2)

78
CHAPTER 5. SYNCHRONIZATION (II)

Transaction
manager

Scheduler Scheduler Scheduler

Data Data Data


manager manager manager

Machine A Machine B Machine C

General organization of managers for handling distributed transactions.

[5.14] Serializability (1)

a.-c. Three transactions T1, T2, and T3,

d. Possible schedules.

79
CHAPTER 5. SYNCHRONIZATION (II)

[5.15] Serializability (2)


Consider a collection E of transactions T 1 , . . . , T n . Goal is to conduct a serial-
izable execution of E:

• transactions in E are possibly concurrently executed according to some


schedule S.

• schedule S is equivalent to some totally ordered execution of T 1 , . . . , T n .

Because we are not concerned with the computations of each transaction, a


transaction can be modeled as a log of read and write operations.

Two operations OPER(T i , x) and OPER(T j , x) on the same data item x, and from
a set of logs may conflict at a data manager:

read-write conflict (rw) one is a read operation while the other is a write op-
eration on x,

write-write conflict (ww) both are write operations on x.

[5.16] Synchronization Techniques

1. Two-phase locking
Before reading or writing a data item, a lock must be obtained. After a
lock is given up, the transaction is not allowed to acquire any more locks.

2. Timestamp ordering
Operations in a transaction are time-stamped, and data managers are forced
to handle operations in timestamp order.

3. Optimistic control
Don’t prevent things from going wrong, but correct the situation if conflicts
actually did happen. Basic assumption: you can pull it off in most cases.

[5.17] Two-Phase Locking (1)

• clients do only READ and WRITE operations within transactions,

• locks are granted and released only by scheduler,

• locking policy is to avoid conflicts between operations.

80
CHAPTER 5. SYNCHRONIZATION (II)

1. When client submits OPER(T i , x), scheduler tests whether it conflicts with
an operation OPER(T j , x) from any other client. If no conflict then grant
LOCK(T i , x), otherwise delay execution of OPER(T i , x).

• conflicting operations are executed in the same order as that locks


are granted.

2. If LOCK(T i , x) has been granted, do not release the lock until OPER(T i , x)
has been executed by data manager.

3. If RELEAS E(T i , x) has taken place, no more locks for T i may be granted.

[5.18] Two-Phase Locking (2)

Lock point
Number of locks

Growing phase Shrinking phase

Time

Two-phase locking.

[5.19] Two-Phase Locking (3)


Types of 2PL

Centralized 2PL A single site handles all locks,

Primary 2PL Each data item is assigned a primary site to handle its locks.
Data is not necessarily replicated,

Distributed 2PL Assumes data can be replicated. Each primary is responsible


for handling locks for its data, which may reside at remote data managers.

81
CHAPTER 5. SYNCHRONIZATION (II)

Problems:
• deadlock possible – order of acquiring, deadlock detection, a timeout
scheme,

• cascaded aborts – strict two-phase locking.

[5.20] 2PL: Strict 2PL

Lock point
Number of locks

Growing phase Shrinking phase

All locks are released


at the same time

Time

Strict two-phase locking.

[5.21] Pessimistic Timestamp Ordering (1)

• each transaction T has a timestamp ts(T ) assigned,


• timestamps are unique (Lamport’s algorithm),
• every operation, part of T , timestamped with ts(T ),

• every data item x has a read timestamp tsRD (x) and a write timestamp
tsWR (x),
• if operations conflicts, the data manager processes the one with the lowest
timestamp,
• comparing to locking (like 2PL): aborts possible but deadlock free.

[5.22] Pessimistic Timestamp Ordering (2)

82
CHAPTER 5. SYNCHRONIZATION (II)

tsRD(x) tsWR(x) ts(T2) tsWR(x) ts(T2)

(T1 ) (T1 ) (T2 ) (T1 ) (T2 ) OK


Do
(a) Time (e) Time
tentative
tsWR(x) tsRD(x) ts(T2) write tsWR(x) tstent (x) ts(T2)

(T1 ) (T1 ) (T2 ) (T1 ) (T3 ) (T2 )


OK
(b) Time (f) Time

ts(T2) tsRD(x) ts(T2) tsWR(x)

(T2 ) (T3 ) (T2 ) (T3 )


(c) Time (g) Time
Abort Abort
ts(T2) tsWR(x) ts(T2) tstent (x)
(T2 ) (T3 ) (T2 ) (T3 )
(d) Time (h) Time

(a)-(d) T 2 is trying to write an item, (e)-(f) T 2 is trying to read an item.

[5.23] Optimistic Timestamp Ordering


Assumptions:
• conflicts are relatively rare,
• go ahead and do whatever you want, solve conflicts later on,
• keep track of which data items have been read and written (private workspaces,
shadow copies),
• check possible conflicts at the time of committing.

Features:
• deadlock free with maximum parallelism,
• under conditions of heavy load, the probability of failure (and abort) goes
up substantially,

• focused on nondistributed systems,


• hardly implemented in commercial or prototype systems.

[5.24] MySQL: Transactions (1)


By default, MySQL runs with autocommit mode enabled. This means that as
soon as you execute a statement that updates (modifies) a table, MySQL stores
the update on disk.

83
CHAPTER 5. SYNCHRONIZATION (II)

• SET AUTOCOMMIT = {0 | 1}

Start and stop transaction:

• START TRANSACTION | BEGIN [WORK]

• COMMIT [WORK] [AND [NO] CHAIN] [[NO] RELEASE]

• ROLLBACK [WORK] [AND [NO] CHAIN] [[NO] RELEASE]

[5.25] MySQL: Transactions (2)

• If you issue a ROLLBACK statement after updating a non-transactional


table within a transaction, warning occurs. Changes to transaction-safe
tables are rolled back, but not changes to non-transaction-safe tables.

• InnoDB – transaction-safe storage engine,

• MySQL uses table-level locking for MyISAM and MEMORY tables, page-
level locking for BDB tables, and row-level locking for InnoDB tables.

• Some statements cannot be rolled back. In general, these include data


definition language (DDL) statements, such as those that create or drop
databases, those that create, drop, or alter tables or stored routines.

• Transactions cannot be nested. This is a consequence of the implicit


COMMIT performed for any current transaction when you issue a START
TRANSACTION statement or one of its synonyms.

[5.26] MySQL: Savepoints


The savepoints syntax:

• SAVEPOINT identifier

• ROLLBACK [WORK] TO SAVEPOINT identifier

• RELEASE SAVEPOINT identifier

Description:

84
CHAPTER 5. SYNCHRONIZATION (II)

• The ROLLBACK TO SAVEPOINT statement rolls back a transaction


to the named savepoint. Modifications that the current transaction made
to rows after the savepoint was set are undone in the rollback, but Inn-
oDB does not release the row locks that were stored in memory after the
savepoint.

• All savepoints of the current transaction are deleted if you execute a COM-
MIT, or a ROLLBACK that does not name a savepoint.

[5.27] MySQL: Isolation Levels in InnoDB (1)


Isolation levels:

• SET [SESSION | GLOBAL] TRANSACTION ISOLATION LEVEL


{READ UNCOMMITTED | READ COMMITTED | REPEATABLE READ | SE-
RIALIZABLE}

• SELECT @@global.tx_isolation;

• SELECT @@tx_isolation;

• Suppose that you are running in the default REPEATABLE READ isola-
tion level. When you issue a consistent read (that is, an ordinary SELECT
statement), InnoDB gives your transaction a timepoint according to which
your query sees the database. If another transaction deletes a row and
commits after your timepoint was assigned, you do not see the row as
having been deleted. Inserts and updates are treated similarly.

[5.28] MySQL: Isolation Levels in InnoDB (2)

READ UNCOMMITTED SELECT statements are performed in a non-locking


fashion, but a possible earlier version of a record might be used. Thus,
using this isolation level, such reads are not consistent. This is also called
a dirty read. Otherwise, this isolation level works like READ COMMIT-
TED.

READ COMMITTED Consistent reads behave as in other databases: Each


consistent read, even within the same transaction, sets and reads its own
fresh snapshot.

85
CHAPTER 5. SYNCHRONIZATION (II)

REPEATABLE READ This is the default isolation level of InnoDB. All con-
sistent reads within the same transaction read the snapshot established by
the first such read in that transaction. You can get a fresher snapshot for
your queries by committing the current transaction and after that issuing
new queries.

SERIALIZABLE This level is like REPEATABLE READ, but InnoDB im-


plicitly commits all plain SELECT statements to SELECT ... LOCK IN
SHARE MODE.

86
Chapter 6

Consistency and Replication

[6.1] Consistency and Replication


Consistency and replication

1. Introduction

2. Data-centric consistency models

3. Client-centric consistency models

4. Consistency protocols

[6.2] Introduction
Two primary reasons for replicating data:

• reliability – to increase reliability of a system,

• performance – to scale in numbers and geographical area.

Reliability corresponds to fault tolerance, performance/scalability corresponds to


high availability.

The cost of replication:

• modifications have to be carried on all copies to ensure consistency,

• when and how modifications need to be carried out, determines the price
of replication.

87
CHAPTER 6. CONSISTENCY AND REPLICATION

[6.3] Performance and Scalability


Main issue: To keep replicas consistent, we generally need to ensure that all
conflicting operations are done in the the same order everywhere.

Conflicting operations:
read–write conflict a read operation and a write operation act concurrently,
write–write conflict two concurrent write operations.
Guaranteeing global ordering on conflicting operations may be a costly operation,
downgrading scalability.

Solution: weaken consistency requirements so that hopefully global synchro-


nization can be avoided.

[6.4] Data-Centric Consistency Models (1)

Process Process Process

Local copy

Distributed data store

The general organization of a logical data store, physically distributed and repli-
cated across multiple processes.
Consistency model
A contract between a (distributed) data store and processes, in which the data
store specifies precisely what the results of read and write operations are in the
presence of concurrency.

[6.5] Data-Centric Consistency Models (2)


Strong consistency models: Operations on shared data are synchronized:

88
CHAPTER 6. CONSISTENCY AND REPLICATION

• strict consistency (related to time),

• sequential consistency (what we are used to),

• causal consistency (maintains only causal relations),

• FIFO consistency (maintains only individual ordering).

Weak consistency models: Synchronization occurs only when shared data is


locked and unlocked:

• general weak consistency,

• release consistency,

• entry consistency.

Observation: The weaker the consistency model, the easier it is to build a


scalable solution.

[6.6] Strict Consistency


Strict consistency

Any read to a shared data item X returns the value stored by the
most recent write operation on X.

P1: W(x)a P1: W(x)a


P2: R(x)a P2: R(x)NIL R(x)a
(a) (b)

Behavior of two processes, operating on the same data item.

a. a strictly consistent store,

b. a store that is not strictly consistent.

[6.7] Linearizability and Sequential Consistency (1)


Sequential Consistency

89
CHAPTER 6. CONSISTENCY AND REPLICATION

The result of any execution is the same as if the operations of all


processes were executed in some sequential order, and the opera-
tions of each individual process appear in this sequence in the order
specified by its program.

P1: W(x)a P1: W(x)a


P2: W(x)b P2: W(x)b
P3: R(x)b R(x)a P3: R(x)b R(x)a
P4: R(x)b R(x)a P4: R(x)a R(x)b

(a) (b)

All processes should see the same interleaving of operations.


a. a sequentially consistent data store,
b. a data store that is not sequentially consistent.

[6.8] Linearizability and Sequential Consistency (3)


linearizable = sequential + operations ordered according to a global time.

90
CHAPTER 6. CONSISTENCY AND REPLICATION

Four valid execution sequences for the presented processes. The vertical axis is
time.

[6.9] Causal Consistency (1)


Causal consistency
Writes that are potentially causally related must be seen by all pro-
cesses in the same order. Concurrent writes may be seen in a dif-
ferent order on different machines.

P1: W(x)a W(x)c


P2: R(x)a W(x)b
P3: R(x)a R(x)c R(x)b
P4: R(x)a R(x)b R(x)c

This sequence is allowed with a causally-consistent store, but not with sequen-
tially or strictly consistent store.

[6.10] Causal Consistency (2)

P1: W(x)a P1: W(x)a


P2: R(x)a W(x)b P2: W(x)b
P3: R(x)b R(x)a P3: R(x)b R(x)a
P4: R(x)a R(x)b P4: R(x)a R(x)b
(a) (b)

a. a violation of a causally-consistent store,


b. a correct sequence of events in a causally-consistent store.

[6.11] FIFO Consistency (1)


FIFO consistency
Writes done by a single process are seen by all other processes in the
order in which they were issued, but writes from different processes
may be seen in a different order by different processes.

91
CHAPTER 6. CONSISTENCY AND REPLICATION

P1: W(x)a
P2: R(x)a W(x)b W(x)c
P3: R(x)b R(x)a R(x)c
P4: R(x)a R(x)b R(x)c

A valid sequence of events of FIFO consistency.

• PRAM consistency = pipelined RAM, writes from a single process can


be pipelined,

• easy to implement by tagging each write operation with a (process, se-


quence number) pair.

[6.12] FIFO Consistency (2)

Statement execution as seen by the three earlier presented processes. The state-
ments in bold are the ones that generate the output shown.

[6.13] FIFO Consistency (3)

92
CHAPTER 6. CONSISTENCY AND REPLICATION

Two concurrent processes.


Sequential vs. FIFO consistency:

• FIFO consistency: counterintuitive results – both processes can be killed,

• sequential consistency: none of interleavings results in both processes


being killed,

• in sequential consistency, although the order is non-deterministic, at least


all processes agree what it is. This is not the case in FIFO consistency.

[6.14] Weak Consistency (1)


Weak consistency models

Introduction of explicit synchronization variables. Changes of local


replica content propagated only when an explicit synchronization
takes place.

Properties:

• accesses to synchronization variables associated with a data store are se-


quentially consistent,

• no operation on a synchronization variable is allowed to be performed until


all previous writes have been completed everywhere,

• no read or write operation on data items are allowed to be performed until


all previous operations to synchronization variables have been performed.

[6.15] Weak Consistency (2)

93
CHAPTER 6. CONSISTENCY AND REPLICATION

P1: W(x)a W(x)b S P1: W(x)a W(x)b S


P2: R(x)a R(x)b S P2: S R(x)a
P3: R(x)b R(x)a S

(a) (b)

a. a valid sequence of events for weak consistency,

b. an invalid sequence for weak consistency.

Issue: The simplest method of weak consistency model implementation in case


of replication with full replicas.

[6.16] Release Consistency (1)

P1: Acq(L) W(x)a W(x)b Rel(L)


P2: Acq(L) R(x)b Rel(L)
P3: R(x)a

A valid event sequence for release consistency.

[6.17] Release Consistency (2)


Release consistency properties:

• before a read or write operation on shared data is performed, all previous


acquires done by the process must have completed successfully,

• before a release is allowed to be performed, all previous reads and writes


by the process must have completed,

• accesses to synchronization variables are FIFO consistent (sequential con-


sistency is not required).

Additional issues:

• lazy release consistency versus eager release consistency,

94
CHAPTER 6. CONSISTENCY AND REPLICATION

• barriers instead of critical regions possible.

[6.18] Entry Consistency (1)

• with release consistency, all local updates are propagated to other copies/servers
during release of shared data.

• with entry consistency, each shared data item is associated with a synchro-
nization variable.

• when acquiring the synchronization variable, the most recent values of its
associated shared data item are fetched.

Note: Where release consistency affects all shared data, entry consistency affects
only those shared data associated with a synchronization variable.
Question: What would be a convenient way of making entry consistency more
or less transparent to programmers?

[6.19] Entry Consistency (2)

P1: Acq(Lx) W(x)a Acq(Ly) W(y)b Rel(Lx) Rel(Ly)


P2: Acq(Lx) R(x)a R(y)NIL
P3: Acq(Ly) R(y)b

A valid event sequence for entry consistency.

[6.20] Summary of Consistency Models

95
CHAPTER 6. CONSISTENCY AND REPLICATION

a. Strong consistency models.

b. Weak consistency models.

[6.21] Client-Centric Consistency Models (1)

1. System model

2. Coherence models

• monotonic reads,
• monotonic writes,
• read-your-writes,
• write-follows-reads.

[6.22] Client-Centric Consistency Models (2)


Goal: Avoiding system-wide consistency, by concentrating on what specific
clients want, instead of what should be maintained by servers.
Background: Most large-scale distributed systems (i.e., databases) apply repli-
cation for scalability, but can support only weak consistency:

DNS updates are propagated slowly, and inserts may not be immediately visible.

96
CHAPTER 6. CONSISTENCY AND REPLICATION

NEWS articles and reactions are pushed and pulled throughout the Internet,
such that reactions can be seen before postings.

Lotus Notes geographically dispersed servers replicate documents, but make no


attempt to keep (concurrent) updates mutually consistent.

WWW caches all over the place, but there need be no guarantee that you are
reading the most recent version of a page.

[6.23] Consistency for Mobile Users

Example: Consider a distributed database to which you have access through


your notebook. Assume your notebook acts as a front end to the database.

• at location A you access the database doing reads and updates.

• at location B you continue your work, but unless you access the same
server as the one at location A, you may detect inconsistencies:

– your updates at A may not have yet been propagated to B


– you may be reading newer entries than the ones available at A
– your updates at B may eventually conflict with those at A

Note: The only thing you really want is that the entries you updated and/or read
at A, are in B the way you left them in A. In that case, the database will appear
to be consistent to you.

[6.24] Eventual Consistency

Eventual consistency

Consistency model in large-scale distributed replicated databases that


tolerate a relatively high degree of inconsistency. If no updates take
place for a long time, all replicas gradually becomes consistent.

97
CHAPTER 6. CONSISTENCY AND REPLICATION

Client moves to other location


and (transparently) connects to
other replica

Replicas need to maintain


client-centric consistency

Wide-area network

Distributed and replicated database


Read and write operations
Portable computer

The principle of a mobile user accessing different replicas of a distributed


database.

[6.25] Monotonic Reads (1)


Monotonic reads

If a process reads the value of a data item x, any successive read


operation on x by that process will always return that same or a
more recent value.

L1: WS(x1) R(x1) L1: WS(x1) R(x1)


L2: WS(x1;x2 ) R(x2) L2: WS(x2 ) R(x2) WS(x1;x2 )

(a) (b)

The read operations performed by a single process P at two different local copies
of the same data store.

a. a monotonic-read consistent data store,

b. a data store that does not provide monotonic reads.

98
CHAPTER 6. CONSISTENCY AND REPLICATION

[6.26] Monotonic Reads (2)


Example

Automatically reading your personal calendar updates from different


servers. Monotonic reads guarantees that the user sees all updates,
no matter from which server the automatic reading takes place.

Example

Reading (not modifying) incoming mail while you are on the move.
Each time you connect to a different e-mail server, that server fetches
(at least) all the updates from the server you previously visited.

[6.27] Monotonic Writes (1)


Monotonic writes

A write operation by a process on a data item x is completed before


any successive write operation on x by the same process.

L1: W(x1) L1: W(x1)


L2: W(x1) W(x2) L2: W(x2)

(a) (b)

The write operations performed by a single process P at two different local


copies of the same data store

a. a monotonic-write consistent data store.

b. a data store that does not provide monotonic-write consistency.

[6.28] Monotonic Writes (2)


Example

Updating a program at server S2, and ensuring that all components


on which compilation and linking depends, are also placed at S2.

Example

99
CHAPTER 6. CONSISTENCY AND REPLICATION

Maintaining versions of replicated files in the correct order every-


where (propagate the previous version to the server where the newest
version is installed).

[6.29] Read Your Writes


Read your writes

The effect of a write operation by a process on data item x, will


always be seen by a successive read operation on x by the same
process.

L1: W(x1) L1: W(x1)


L2: WS(x1;x2 ) R(x2) L2: WS(x2 ) R(x2)

(a) (b)

a. a data store that provides read-your-writes consistency.

b. a data store that does not.

[6.30] Writes Follow Reads


Writes follow reads

A write operation by a process on a data item x following a previous


read operation on x by the same process, is guaranteed to take place
on the same or a more recent value of x that was read.

L1: WS(x1) R(x1) L1: WS(x1) R(x1)


L2: WS(x1;x2 ) W(x 2) L2: WS(x2 ) W(x2)

(a) (b)

a. a writes-follow-reads consistent data store,

b. a data store that does not provide writes-follow-reads consistency.

100
CHAPTER 6. CONSISTENCY AND REPLICATION

[6.31] Examples

Read-your-writes example

Updating your Web page and guaranteeing that your Web browser
shows the newest version instead of its cached copy.

Writes-follow-reads example

See reactions to posted articles only if you have the original posting
(a read “pulls in” the corresponding write operation).

[6.32] Consistency Protocols

Consistency protocol
Describes the implementation of a specific consistency model. We will concen-
trate only on sequential consistency.

• Primary-based protocols

– remote-write protocols,
– local-write protocols.

• Replicated-write protocols

– active replication,
– quorum-based protocols.

• Cache-coherence protocols (write-through, write-back)

[6.33] Remote-Write Protocols (1)

101
CHAPTER 6. CONSISTENCY AND REPLICATION

Client Client
Single server
for item x Backup server
W1 W4 R1 R4

W2 R2

W3 R3 Data store

W1. Write request R1. Read request


W2. Forward request to server for x R2. Forward request to server for x
W3. Acknowledge write completed R3. Return response
W4. Acknowledge write completed R4. Return response

Primary-based remote-write protocol with a fixed server to which all read and
write operations are forwarded.
[6.34] Remote-Write Protocols (2)

Client Client
Primary server
for item x Backup server
W1 W5 R1 R2

W4 W4

W3 W3 Data store

W2 W3
W4

W1. Write request R1. Read request


W2. Forward request to primary R2. Response to read
W3. Tell backups to update
W4. Acknowledge update
W5. Acknowledge write completed

The principle of primary-backup protocol: read operations allowed on a locally


available copy, write operations forwarded to a fixed primary copy.

[6.35] Local-Write Protocols (1)

102
CHAPTER 6. CONSISTENCY AND REPLICATION

Client
Current server New server
for item x for item x

1 4

3 Data store

1. Read or write request


2. Forward request to current server for x
3. Move item x to client's server
4. Return result of operation on client's server

Primary-based local-write protocol in which a single copy is migrated between


processes.

[6.36] Local-Write Protocols (2)

Client Client
Old primary New primary
for item x for item x Backup server
R1 R2 W1 W3

W5 W5

W4 W4 Data store
W5 W2
W4

W1. Write request R1. Read request


W2. Move item x to new primary R2. Response to read
W3. Acknowledge write completed
W4. Tell backups to update
W5. Acknowledge update

Primary-backup protocol in which the primary migrates to the process wanting


to perform an update.

[6.37] Active Replication (1)

103
CHAPTER 6. CONSISTENCY AND REPLICATION

Client replicates Object receives


invocation request the same invocation
B1 three times

A B2 C

All replicas see B3


the same invocation

Replicated object

The problem of replicated invocations.

[6.38] Active Replication (2)

Coordinator Coordinator
of object B of object C

Client replicates Result


invocation request
B1 B1
C1 C1

A B2 A B2

C2 C2

B3 B3

Result

(a) (b)

a. forwarding an invocation request from a replicated object,

b. returning a reply to a replicated object.

104
CHAPTER 6. CONSISTENCY AND REPLICATION

[6.39] Quorum-Based Protocols

Read quorum

A B C D A B C D A B C D

E F G H E F G H E F G H

I J K L I J K L I J K L

NR = 3, N W = 10 NR = 7, NW = 6 NR = 1, N W = 12
Write quorum
(a) (b) (c)

Three examples of the voting algorithm:

a. a correct choice of read and write set,

b. a choice that may lead to write-write conflicts,

c. a correct choice, known as ROWA (read one, write all).

Constraints: NR + NW > N and NW > N/2

[6.40] Cache-Coherence Protocols


Cache coherence strategies:

• coherence detection strategy - when inconsistencies are detected,

• coherent enforcement strategy - how caches are kept consistent with the
copies stored at servers.

When processes modify data:

• read-only cache - updates can be performed only by servers,

• write-through cache - clients directly modify cached data and forward


updates to servers,

• write-back cache - propagation of updates may be delayed by allowing


multiple writes to take place before informing servers.

105
CHAPTER 6. CONSISTENCY AND REPLICATION

106
Chapter 7

Fault Tolerance

[7.1] Fault Tolerance

1. Basic concepts - terminology

2. Process resilience

• groups and failure masking

3. Reliable communication

• reliable client-server communication


• reliable group communication

4. Distributed commit

• two-phase commit (2PC)


• three-phase commit (3PC)

[7.2] Dependability
A component provides services to clients. To provide services, the component
may require the services from other components ⇒ a component may depend
on some other component.
Dependability
A component C depends on C∗ if the correctness of C’s behavior depends on
the correctness of C∗’s behavior.
Properties of dependability:

107
CHAPTER 7. FAULT TOLERANCE

• availability readiness for usage,

• reliability continuity of service delivery,

• safety very low probability of catastrophes,

• maintainability how easy a failed system may be repaired.

For distributed systems, components can be either processes or channels.

[7.3] Fault Terminology

• Failure: When a component is not living up to its specifications, a failure


occurs.

• Error: That part of a component’s state that can lead to a failure.

• Fault: The cause of an error.

Different fault management techniques:


• fault prevention: prevent the occurrence of a fault,

• fault tolerance: build a component in such a way that it can meet its
specifications in the presence of faults (i.e., mask the presence of faults),

• fault removal: reduce the presence, number, seriousness of faults,

• fault forecasting: estimate the present number, future incidence, and the
consequences of faults.

[7.4] Different Types of Failures

108
CHAPTER 7. FAULT TOLERANCE

Different types of failures. Crash failures are the least severe, arbitrary failures
are the worst.

[7.5] Failure Masking by Redundancy

A B C
(a)

Voter

A1 V1 B1 V4 C1 V7

A2 V2 B2 V5 C2 V8

A3 V3 B3 V6 C3 V9

(b)

Triple modular redundancy (TMR).

[7.6] Process Resilience


Process groups: Protect yourself against faulty processes by replicating and
distributing computations in a group.

Flat group
Hierarchical group Coordinator

Worker

(a) (b)

109
CHAPTER 7. FAULT TOLERANCE

a. flat groups: good for fault tolerance as information exchange immediately


occurs with all group members. May impose more overhead as control is
completely distributed (hard to implement).

b. hierarchical groups: all communication through a single coordinator ⇒


not really fault tolerant and scalable, but relatively easy to implement.

[7.7] Groups and Failure Masking (1)

Group tolerance
When a group can mask any k concurrent member failures, it is said to be k-fault
tolerant (k is called degree of fault tolerance).

If we assume that all members are identical, and process all input in the same
order. How large does a k-fault tolerant group need to be?

• assume crash/performance failure semantics ⇒ a total of k + 1 members


are needed to survive k member failures.

• assume arbitrary failure semantics, and group output defined by voting ⇒


a total of 2k + 1 members are needed to survive k member failures.

[7.8] Groups and Failure Masking (2)

Assumption: Group members are not identical, i.e., we have a distributed com-
putation.

Problem: Nonfaulty group members should reach agreement on the same value.

Assuming arbitrary failure semantics, we need 3k + 1 group members to survive


the attacks of k faulty members.

We are trying to reach a majority vote among the group of loyalists, in the
presence of k traitors ⇒ we need 2k+1 loyalists. This is also known as Byzantine
failures.

[7.9] Groups and Failure Masking (3)

110
CHAPTER 7. FAULT TOLERANCE

2
1 2
1
2 4
1 x 2 4
1 y 1 Got(1, 2, x, 4 ) 1 Got 2 Got 4 Got
2 Got(1, 2, y, 4 ) (1, 2, y, 4 ) (1, 2, x, 4 ) (1, 2, x, 4 )
4 3 Got(1, 2, 3, 4 ) (a, b, c, d ) (e, f, g, h ) (1, 2, y, 4 )
3 4 4 Got(1, 2, z, 4 ) (1, 2, z, 4 ) (1, 2, z, 4 ) ( i, j, k, l )
z
Faulty process
(a) (b) (c)

The Byzantine generals problem for 3 loyal generals and 1 traitor.


a. the generals announce their troop strengths (in thousands of soldiers),

b. the vectors that each general assembles based on (a),

c. the vectors that each general receives in step 3.

[7.10] Groups and Failure Masking (4)

1 2
x 1
2 1 Got(1, 2, x ) 1 Got 2 Got
3 2 2 Got(1, 2, y ) (1, 2, y ) (1, 2, x )
y 3 Got(1, 2, 3 ) (a, b, c ) (d, e, f )
Faulty process
(a) (b) (c)

The same as before, except now with 2 loyal generals and one traitor.

[7.11] Reliable Communication


So far concentrated on process resilience (by means of process groups). What
about reliable communication channels?
Error detection:
• framing of packets to allow for bit error detection,

• use of frame numbering to detect packet loss.

111
CHAPTER 7. FAULT TOLERANCE

Error correction:

• add so much redundancy that corrupted packets can be automatically cor-


rected,

• request retransmission of lost, or last N packets.

Most of this work assumes point-to-point communication.

[7.12] Reliable RPC (1)


What can go wrong during RPC?

1. client cannot locate server

2. client request is lost

3. server crashes

4. server response is lost

5. client crashes

Notes:

1: relatively simple - just report back to client,

2: just resend message,

3: server crashes are harder as no one knows what server had already done.

[7.13] Reliable RPC (2)


If server crashes no one knows what server had already done. We need to decide
on what we expect from the server.

Server Server Server


REQ REQ REQ
Receive Receive Receive
Execute Execute Crash
REP No REP No REP
Reply Crash

(a) (b) (c)

112
CHAPTER 7. FAULT TOLERANCE

(a) normal case (b) crash after execution (c) crash before execution.

Possible different RPC server semantics:


• at-least-once-semantics: the server guarantees it will carry out an opera-
tion at least once, no matter what.

• at-most-once-semantics: the server guarantees it will carry out an oper-


ation at most once.

[7.14] Reliable RPC (3)

4: Detecting lost replies can be hard, because it can also be that the server had
crashed. You don’t know whether the server has carried out the operation.

Possible solution: None, except that one can try to make your operations
idempotent – repeatable without any harm done if it happened to be
carried out before.

5: Problem: The server is doing work and holding resources for nothing
(called doing an orphan computation).
Possible solutions:

– orphan killed (or rolled back) by client when it reboots,


– broadcasting new epoch number when recovering ⇒ servers kill or-
phans,
– requiring computations to complete in a T time units, old ones simply
removed.

[7.15] Reliable Multicasting (1)


Basic model: There is a multicast channel c with two (possibly overlapping)
groups:
• the sender group S ND(c) of processes that submit messages to channel c,
• the receiver group RCV(c) of processes that can receive messages from
channel c.

Simple reliability If process P ∈ RCV(c) at the time message m was submitted


to c, and P does not leave RCV(c), m should be delivered to P.

113
CHAPTER 7. FAULT TOLERANCE

Atomic multicast How to ensure that a message m submitted to channel c is


delivered to process P ∈ RCV(c) only if m is delivered to all members of
RCV(c).

[7.16] Reliable Multicasting (2)


If one can stick to a local-area network, reliable multicasting is ”easy”.
Let the sender log messages submitted to channel c:
• if P sends message m, m is stored in a history buffer,

• each receiver acknowledges the receipt of m, or requests retransmission at


P when noticing message lost,

• sender P removes m from history buffer when everyone has acknowledged


receipt.

Why doesn’t this scale? The basic algorithm doesn’t scale:


• if RCV(c) is large, P will be swamped with feedback (ACKs and NACKs),

• sender P has to know all members of RCV(c).

[7.17] Basic Reliable-Multicasting Schemes

Receiver missed
message #24

Sender Receiver Receiver Receiver Receiver

History M25
buffer Last = 24 Last = 24 Last = 23 Last = 24
M25 M25 M25 M25

Network
(a)

Sender Receiver Receiver Receiver Receiver

Last = 25 Last = 24 Last = 23 Last = 24


M25 M25 M25 M25
ACK 25 ACK 25 Missed 24 ACK 25
Network
(b)

114
CHAPTER 7. FAULT TOLERANCE

A simple solution to reliable multicasting when all receivers are known and are
assumed not to fail: (a) message transmission and (b) reporting feedback.

[7.18] Scalable RM: Feedback Suppression

Idea: Let a process P suppress its own feedback when it notices another
process Q is already asking for a retransmission.
Assumptions:

• all receivers listen to a common feedback channel to which feedback mes-


sages are submitted,

• process P schedules its own feedback message randomly, and suppresses


it when observing another feedback message.

• random schedule needed to ensure that only one feedback message is even-
tually sent.

Sender receives Receivers suppress their feedback


only one NACK

Sender Receiver Receiver Receiver Receiver

T=3 T=4 T=1 T=2


NACK NACK NACK NACK

NACK

Network

[7.19] Scalable RM: Hierarchical Solutions

Idea: Construct a hierarchical feedback channel in which all submitted messages


are sent only to the root. Intermediate nodes aggregate feedback messages before
passing them on.

Main challenge: Dynamic construction of the hierarchical feedback channels.

115
CHAPTER 7. FAULT TOLERANCE

Sender
(Long-haul) connection
S Local-area network
Coordinator

C
C

R
Receiver Root

[7.20] Atomic Multicast


Idea: Formulate reliable multicasting in the presence of process failures in terms
of process groups and changes to group membership.
Guarantee: A message is delivered only to the non-faulty members of the
current group. All members should agree on the current group membership.
Keyword: Virtually synchronous multicast.

Reliable multicast by multiple


P1 joins the group point-to-point messages P3 crashes P3 rejoins

P1

P2

P3

P4
G = {P1,P2,P3,P4} G = {P1,P2,P4} G = {P1,P2,P3,P4}

Partial multicast Time


from P3 is discarded

[7.21] Virtual Synchrony (1)

116
CHAPTER 7. FAULT TOLERANCE

Application
Message is delivered to application

Comm. layer
Message is received by communication layer

Message comes in from the network


Local OS

Network

The logical organization of a distributed system to distinguish between message


receipt and message delivery.

[7.22] Virtual Synchrony (2)


Idea: We consider views V ⊆ RCV(c) ∪ S ND(c).
Processes are added or deleted from a view V through view changes to V∗. A
view change is to be executed locally by each P ∈ V ∩ V∗

1. for each consistent state, there is a unique view on which all its members
agree. Note: implies that all non-faulty processes see all view changes in
the same order,

2. if message m is sent to V before a view change vc to V∗, then either all


P ∈ V that excute vc receive m, or no processes P ∈ V that execute vc
receive m. Note: all non-faulty members in the same view get to see the
same set of multicast messages,

3. a message sent to view V can be delivered only to processes in V, and is


discarded by successive views.

A reliable multicast algorithm satisfying 1. – 3. is virtually synchronous.

[7.23] Virtual Synchrony (3)


A sender to a view V need not be member of V,

117
CHAPTER 7. FAULT TOLERANCE

If a sender S ∈ V crashes, its multicast message m is flushed before S is removed


from V: m will never be delivered after the point that S < V
Note: Messages from S may still be delivered to all, or none (non-faulty) pro-
cesses in V before they all agree on a new view to which S does not belong
If a receiver P fails, a message m may be lost but can be recovered as we know
exactly what has been received in V. Alternatively, we may decide to deliver m
to members in V − P
Observation: Virtually synchronous behavior can be seen independent from the
ordering of message delivery. The only issue is that messages are delivered to
an agreed upon group of receivers.

[7.24] Virtually Synchronous Reliable Multicasting

Different versions of virtually synchronous reliable multicasting.

[7.25] Implementing Virtual Synchrony

Unstable Flush message


message
1 1 1
2 5 2 5 2 5

View change
4 6 4 6 4 6

0 3 0 3 0 3
7 7 7

(a) (b) (c)

a. process 4 notices that process 7 has crashed and sends a view change.

b. process 6 sends out all its unstable messages, followed by a flush message.

118
CHAPTER 7. FAULT TOLERANCE

c. process 6 installs the new view when it has received a flush message from
everyone else.

[7.26] Distributed Commit

• Two-phase commit (2PC)

• Three-phase commit (3PC)

Essential issue: Given a computation distributed across a process group, how


can we ensure that either all processes commit to the final result, or none of
them do (atomicity)?

[7.27] Two-Phase Commit (1)


Model: The client who inititated the computation acts as coordinator; processes
required to commit are the participants.

Phase 1a Coordinator sends VOTE_REQUEST to participants (also called a


pre-write).

Phase 1b When participant receives VOTE_REQUEST it returns either YES or


NO to coordinator. If it sends NO, it aborts its local computation.

Phase 2a Coordinator collects all votes; if all are YES, it sends COMMIT to
all participants, otherwise it sends ABORT.

Phase 2b Each participant waits for COMMIT or ABORT and handles accord-
ingly.

[7.28] Two-Phase Commit (2)

Vote-request
INIT Vote-abort INIT
Commit Vote-request
Vote-request Vote-commit
WAIT READY
Vote-abort Vote-commit Global-abort Global-commit
Global-abort Global-commit ACK ACK
ABORT COMMIT ABORT COMMIT
(a) (b)

119
CHAPTER 7. FAULT TOLERANCE

a. the finite state machine for the coordinator in 2PC,

b. the finite state machine for a participant.

[7.29] 2PC – Failing Participant (1)


Consider participant crash in one of its states, and the subsequent recovery to
that state:

initial state no problem, as participant was unaware of the protocol,

ready state participant is waiting to either commit or abort. After recovery,


participant needs to know which state transition it should make → log the
coordinator’s decision,

abort state merely make entry into abort state idempotent, e.g., removing the
workspace of results,

commit state also make entry into commit state idempotent, e.g., copying workspace
to storage.

When distributed commit is required, having participants use temporary workspaces


to keep their results allows for simple recovery in the presence of failures.

[7.30] 2PC – Failing Participant (2)


Alternative: When a recovery is needed to the Ready state, check what the other
participants are doing. This approach avoids having to log the coordinator’s
decision.
Assume recovering participant P contacts another participant Q:

120
CHAPTER 7. FAULT TOLERANCE

Result: If all participants are in the ready state, the protocol blocks. Apparently,
the coordinator is failing.

[7.31] 2PC – Coordinator

[7.32] 2PC – Participant

121
CHAPTER 7. FAULT TOLERANCE

[7.33] 2PC – Handling Decision Requests

Actions for handling decision requests executed by separate thread.

[7.34] Three-Phase Commit (1)


Problem: with 2PC when the coordinator crashed, participants may not be able
to reach a final decision and may need to remain blocked until the coordinator
recovers.
Solution: three-phase commit protocol (3PC). The states of the coordinator
and each participant satisfy the following conditions:

122
CHAPTER 7. FAULT TOLERANCE

• there is no single state from which it is possible to make a transition


directly to either a COMMIT or ABORT state,
• there is no state in which it is not possible to make a final decision, and
from which a transition to a COMMIT state can be made.

Note: not often applied in practice as the conditions under which 2PC blocks
rarely occur.
[7.35] Three-Phase Commit (2)

Phase 1a Coordinator sends VOTE_REQUEST to participants


Phase 1b When participant receives VOTE_REQUEST it returns either YES or
NO to coordinator. If it sends NO, it aborts its local computation
Phase 2a Coordinator collects all votes; if all are YES, it sends PREPARE to
all participants, otherwise it sends ABORT, and halts
Phase 2b Each participant waits for PREPARE, or waits for ABORT after which
it halts
Phase 3a (Prepare to commit) Coordinator waits until all participants have
ACKed receipt of PREPARE message, and then sends COMMIT to all
Phase 3b (Prepare to commit) Participant waits for COMMIT

[7.36] Three-Phase Commit (3)

Vote-request
INIT Vote-abort INIT
Commit Vote-request
Vote-request Vote-commit
WAIT READY
Vote-abort Vote-commit Global-abort Prepare-commit
Global-abort Prepare-commit ACK Ready-commit
ABORT PRECOMMIT ABORT PRECOMMIT
Ready-commit Global-commit
Global-commit ACK
COMMIT COMMIT

(a) (b)

a. finite state machine for the coordinator in 3PC,


b. finite state machine for the participant.

123
CHAPTER 7. FAULT TOLERANCE

124
Chapter 8

Distributed File System

[8.1] Distributed File System

1. Sun Network File System

2. The Coda File System

3. Plan 9: Resources Unified to Files

[8.2] Network File System (NFS)


NFS, basic idea: each file server provides a standardized view of its local files
system,
History of NFS:

• the 1st version internal to Sun,

• the 2nd version incorporated into SunOS 2.0,

• the 3rd (current) version – now undergoing major revisions.

NFS – not so much a true file system but a collection of protocols.

[8.3] NFS Architecture (1)

125
CHAPTER 8. DISTRIBUTED FILE SYSTEM

1. File moved to client


Client Server Client Server

Old file

New file

Requests from
client to access File stays 2. Accesses are
3. When client is done,
remote file on server done on client
file is returned to
server

(a) (b)

a. the remote access model,

b. the upload/download model.

[8.4] NFS Architecture (2)

Client Server

System call layer System call layer

Virtual file system Virtual file system


(VFS) layer (VFS) layer

Local file Local file


system interface NFS client NFS server system interface

RPC client RPC server


stub stub

Network

The basic NFS architecture for UNIX systems.

[8.5] NFS Features

• NFS largely independent of local file system,

• supports hard and symbolic links,

• files named, accessed by means of Unix-like file handles,

126
CHAPTER 8. DISTRIBUTED FILE SYSTEM

• version 4

– create used for creating non-regular files,


– regular files created by open,
– server generally maintains state between operations on the same file,
– lookup attempts to resolve the entire name, also if it means crossing
mount points,
– supports compound procedures.

[8.6] File System Model

An incomplete list of file system operations supported by NFS.

[8.7] Communication

127
CHAPTER 8. DISTRIBUTED FILE SYSTEM

Client Server Client Server


LOOKUP
OPEN
LOOKUP READ

Lookup name Lookup name

Open file
READ
Read file data
Read file data
Time Time

(a) (b)

a. Reading data from a file in NFS version 3.

b. Reading data using a compound procedure in version 4.

[8.8] Stateless vs. Stateful Server

• NFS version 3:

– simplicity as the main advantage of the stateless approach,


– locking a file cannot be easily done,
– certain authentication protocols require maintaining state of clients.

• NFS version 4:

– expected to work across wide area network,


– clients can make effective use of caches requiring cache consistency
protocol,
– support for callback procedures by which a server can do an RPC to
a client.

[8.9] NFS - Naming (1)

128
CHAPTER 8. DISTRIBUTED FILE SYSTEM

Client A Server Client B

remote bin users work bin

vu steen me

mbox mbox mbox

Exported directory Exported directory


mounted by client mounted by client
Network

Mounting (part of) a remote file system in NFS.

[8.10] NFS - Naming (2)

Exported directory
contains imported
subdirectory
Client Server A Server B

bin packages
Client
imports
directory
draw from draw
server A Server A
imports
directory
install install from install
server B

Network
Client needs to
explicitly import
subdirectory from
server B

Mounting nested directories from multiple servers in NFS.

[8.11] Automounting (1)

129
CHAPTER 8. DISTRIBUTED FILE SYSTEM

Client machine Server machine

1. Lookup "/home/alice"
users
3. Mount request
NFS client Automounter
alice
2. Create subdir "alice"

Local file system interface

home

alice

4. Mount subdir "alice"


from server

A simple automounter for NFS.

[8.12] Automounting (2)

home tmp_mnt

alice home

alice
"/tmp_mnt/home/alice"

Symbolic link

Using symbolic links with automounting.

130
CHAPTER 8. DISTRIBUTED FILE SYSTEM

Whenever command ls -l /home/alice is executed, the NFS server is


contacted directly without involvement of the automounter.

[8.13] File Attributes

Some general mandatory (a) and recommended (b) file attributes in NFS.
Moreover one may have named attributes – an array of pairs (attribute, value).

[8.14] Semantics of File Sharing (1)

131
CHAPTER 8. DISTRIBUTED FILE SYSTEM

Client machine #1

a b
Process
A
a b c

2. Write "c" 1. Read "ab"

File server
Original file
Single machine a b

a b
Process
A 3. Read gets "ab"
a b c
Client machine #2

Process
a b
B
Process
B
1. Write "c" 2. Read gets "abc"

(a) (b)

• On a single processor, when a read follows a write, the value returned by


the read is the value just written.

• In a distributed system with caching, obsolete values may be returned.

[8.15] Semantics of File Sharing (2)

Four ways of dealing with the shared files in a distributed system.

• NFS implements session semantics.

132
CHAPTER 8. DISTRIBUTED FILE SYSTEM

[8.16] File Locking in NFS

NFS version 4 operations related to file locking.

• v4: file locking integrated into file access protocol,

• lock failed ⇒

– error message and polling or


– client can request to be put on a FIFO-ordered list maintained by the
server (and still polling).

[8.17] Client Caching (1)

Memory Client NFS server


cache application

Disk
cache

Network

Client-side caching in NFS.

[8.18] Client Caching (2)

133
CHAPTER 8. DISTRIBUTED FILE SYSTEM

1. Client asks for file


Client Server
2. Server delegates file
Old file

Local copy 3. Server recalls delegation

Updated file
4. Client sends returns file

Using the NFS version 4 callback mechanism to recall file delegation.


• open delegation takes place when the client machine is allowed to locally
handle open and close operations from other clients on the same machine,

• recalling delegation requires callback support,

• NFS uses leases on cached attributes, file handles and directories.

[8.19] RPC Failures

Client Server Client Server Client Server


XID = 1234 XID = 1234 XID = 1234

XID = 1234
process
request
XID = 1234 reply is lost
Cache Cache Cache
reply
XID = 1234
Time Time Time

(a) (b) (c)

Three situations for handling retransmissions (XID = transaction identifier).

a. the request is still in progress,

b. the reply has just been returned,

134
CHAPTER 8. DISTRIBUTED FILE SYSTEM

c. the reply has been some time ago, but was lost.

[8.20] File Locking in the Presence of Failures


Server crashes and subsequently recovers, than:

• grace period:

– a client can reclaim locks that were previously granted to it,


– normal lock requests may be refused until the grace period is over.

Notice: leasing requires synchronization of client’s and server’s clocks.

[8.21] Security

Client Server

Virtual file system layer Virtual file system layer

Access Access
control control

Local file Local file


system interface NFS client NFS server system interface

RPC client RPC server


stub Secure channel stub

The NFS security architecture (version 3).

• system authentication,

• Diffie-Hellman key exchange (a public key cryptosystem), but only 192


bits in NFS,

• Kerberos.

[8.22] Secure RPCs

135
CHAPTER 8. DISTRIBUTED FILE SYSTEM

Client machine Server machine

NFS client NFS server

RPC client stub RPC server stub

RPCSEC_GSS RPCSEC_GSS

GSS-API GSS-API
Kerberos

Kerberos
LIPKEY

LIPKEY
Other

Other
Network

Secure RPC in NFS version 4 (GSS - general security framework):

• LIPKEY - a public key system,

• clients to be authenticated using passwords,

• servers can be authenticated using a public key.

[8.23] Access Control

136
CHAPTER 8. DISTRIBUTED FILE SYSTEM

The classification of operations recognized by NFS with respect to access control.

[8.24] Users/ Processes by Access Control

The various kinds of users and processes distinguished by NFS with respect to
access control.

[8.25] The Coda File System

• developed at Carnegie Mellon University, main goal: high availability,

• advanced caching allows a client to continue operation despite being dis-


connected from a server,

137
CHAPTER 8. DISTRIBUTED FILE SYSTEM

• descendant of version 2 of the Andrew File System (AFS),

• Vice file servers and Virtue workstations with Venus processes,

• both Vice file server processes and Venus processes run as user-level pro-
cesses,

• a user-lever RPC on top of UDP providing at-most-once semantics,

• trusted Vice machines run authentication servers,

• Coda appears as a traditional UNIX-based file system.

[8.26] Overview of Coda (1)

Transparent access
to a Vice file server

Virtue
client

Vice file
server

The overall organization of AFS.

[8.27] Overview of Coda (2)

138
CHAPTER 8. DISTRIBUTED FILE SYSTEM

Virtue client machine

User User Venus


process process process

RPC client
stub

Local file
Virtual file system layer
system interface

Local OS

Network

The internal organization of a Virtue workstation.

[8.28] Coda - communication

• RPC2 different to ONC RPC used by NFS,


• offers reliable communication on top of the UDP protocol,
• thread per each RPC request,
• back messages regularly sent by the server to the client,
• support for side effects – mechanisms for communication using an application-
specific protocols,
• support for multicasting, parallel RPC implemented by means of Mut-
liRPC, fully transparent to callees,
• threads in Coda non-preemptive and entirely in user space,
• separate thread to handle all I/O operations with low-level asynchronous
I/O emulating synchronous I/O without blocking an entire process.

[8.29] Communication (1)

139
CHAPTER 8. DISTRIBUTED FILE SYSTEM

Client
Server
application

Application-specific
RPC Client protocol Server
side effect side effect

RPC client RPC protocol RPC server


stub stub

Side effects in Coda’s RPC2 system.

[8.30] Communication (2)

Client Client

Invalidate Reply Invalidate Reply

Server Server

Invalidate Reply Invalidate Reply

Client Client
Time Time
(a) (b)

a. sending an invalidation message one at a time,

b. sending invalidation messages in parallel.

[8.31] Naming

140
CHAPTER 8. DISTRIBUTED FILE SYSTEM

Naming inherited from server's name space


Client A Server Client B

afs local afs


bin pkg

bin pkg

Exported directory Exported directory


mounted by client mounted by client
Network

Clients in Coda have access to a single shared name space.

[8.32] Volumes and File Identifiers

• volumes,

• only root nodes can act as mounting points,

• shared name space,

• file identifiers,

• RVID – replicated volume identifier,

• VID – volume identifier,

• volume replication database,

• volume location database,

• 64-bit handle identifying the file within the volume.

[8.33] File Identifiers

141
CHAPTER 8. DISTRIBUTED FILE SYSTEM

Volume
replication DB RVID File handle

File server
VID1,
VID2
Server File handle

Server1

Server2 File server


Volume
location DB
Server File handle

The implementation and resolution of a Coda file identifier.

[8.34] Sharing Files in Coda

Session S A
Client

Open(RD) File f Invalidate


Close
Server

Close
Open(WR) File f

Client

Time
Session S B

The transactional behavior in sharing files in Coda.

142
CHAPTER 8. DISTRIBUTED FILE SYSTEM

[8.35] Transactional Semantics

• partition - part of the network isolated from the rest,

• recognition of different types of session (like the store session type),

• usage of versioning scheme,

• update from a client accepted only when the update lead to the next version
of a file,

• when conflict occurs, the updates from the client’s session undone and
client forced to save its local version of a file for manual reconciliation

• cache coherence maintained by means of callbacks,

• callback promise,

• callback break.

[8.36] Client Caching

Session S A Session SA
Client A
Open(RD) Close Close
Open(RD)
Invalidate
Server File f (callback break) File f

File f OK (no file transfer)

Open(WR)
Open(WR) Close Close
Client B
Time
Session S B Session S B

The use of local copies when opening a session in Coda.

[8.37] Server Replication

• file servers may be replicated,

• Volume Storage Group (VSG),

143
CHAPTER 8. DISTRIBUTED FILE SYSTEM

• client’s Accessible VSG (AVSG),

• if the AVSG is empty, the client is said to be disconnected,

• consistency: Read-One, Write-All (ROWA),

• optimistic strategy for file replication,

• version vectors for conflicts detection.

[8.38] Server Replication

Server Server
S1 S3

Client Broken Client


Server
A network B
S2

Two clients with different AVSG for the same replicated file.

[8.39] Coda - Hoarding

• hoarding – filling the cache in advance with the appropriate files,

• priority mechanism to ensure caching of useful data:

– user may store paths in hoard database (one per workstation),


– priority for each file based on the hoard database and last references,

• hoard walk invoked once every 10 minutes,

• cache in equilibrium, if:

– no uncached file with a higher priority,


– cache full or no uncached files with nonzero priority,

144
CHAPTER 8. DISTRIBUTED FILE SYSTEM

– each cached file is a copy of the one from client’s AVSG.

• anyway no guarantee.

[8.40] Disconnected Operation

HOARDING

Disconnection Reintegration
Disconnection completed

EMULATION REINTEGRATION

Reconnection

The state-transition diagram of a Coda client with respect to a volume.

• http://www.coda.cs.cmu.edu/

[8.41] Access Control

145
CHAPTER 8. DISTRIBUTED FILE SYSTEM

Classification of file and directory operations recognized by Coda with respect


to access control.

• also: useful support for the listing of negative rights.

[8.42] Plan 9

• bringing back the idea of having a few centralized servers and numerous
client machines,

• Unix at Bell Labs team,


• file-based distributed system,
• all resources accessed in the same way (as files), including processes and
network interfaces,
• each server offers a hierarchical name space to the resources it controls,
• communication through the protocol 9P, tailored to file-oriented opera-
tions,

• for LAN Internet Link (IL) reliable datagram protocol, TCP for WAN.

[8.43] Plan 9: Resources Unified to Files

Gateway File server CPU Server


NS3
NS1 NS2
Network Process
interface

To Internet
Client has
mounted
NS2 NS1 and NS2
NS1
NS3 NS2

Client Client

146
CHAPTER 8. DISTRIBUTED FILE SYSTEM

General organization of Plan 9.

[8.44] Communication

Files associated with a single TCP connection in Plan 9.

• opening a telnet connection requires writing a special string to the ctl file
”connect 192.31.231.42!23”.

[8.45] Processes

File server machine Only WORM contains


actual file system

In-memory
cache

Disk WORM
cache

The Plan 9 file server.

[8.46] Resource Management

• let /net/inet denote the network interface,

• if M exports /net, a client can use M as a gateway by locally mounting


/net and subsequently opening /net/inet.

147
CHAPTER 8. DISTRIBUTED FILE SYSTEM

• multiple name spaces can be mounted at the same mount point, leading
to union directory,

• file systems appear to be Boolean or-ed,

• mounting order is important.

• Plan 9 implements UNIX file sharing semantics,

• all update operations always forwarded to the server.

[8.47] Naming

FSA FS B
/remote

/home /usr /bin /src /lib /bin /src /lib /home /usr

A union directory in Plan 9.

• http://cm.bell-labs.com/plan9/

• http://www.vitanuova.com/inferno/

148
Chapter 9

Naming

[9.1] Names - Introduction


Usage of names:

• to share resources,

• to uniquely identify entities,

• to refer to locations.

A name can be resolved to the entity it refers to.


Name – a string of bits or characters that is used to refer to an entity.

• typical entities: hosts, printers, disks, files, processes, users, mailboxes,


newsgroups, Web pages, messages, network connections,

• an access point – special entity to access another entity,

• an address – the name of an access point,

• the address of an entity access point simply called an address of the entity.

[9.2] Names, Identifiers, Addresses

• if an entity offers more than one access point not clear which address to
use as a reference,

• location independent names,

149
CHAPTER 9. NAMING

An identifier – a name that is used to uniquely identify an entity.


An identifier has the following properties:

1. An identifier refers to at most one entity.

2. Each entity is referred to by at most one identifier.

3. An identifier always refers to the same entity (it is never reused).

Remarks:

• by using identifiers much easier unambiguously reference to entities,

• human-friendly names in contrast to addresses and identifiers.

[9.3] Name Spaces (1)

• names in distributed systems organized into name spaces,

• name space may be represented as a labeled, directed graph,

– a leaf node represents a named entity,


– a directory node stores a table with outgoing edges representation as
pairs (edge label, node identifier) – directory table,

• root node of the naming graph,

• path name: N:<label-1, label-2, ..., label-n>,

• absolute path name (starts with root) vs. relative path name,

• global name – denotes the same entity in the whole system,

• local name – with interpretation depending on where the name is being


used.

[9.4] Name Spaces (2)

150
CHAPTER 9. NAMING

Data stored in n1 n0
n2: "elke" home keys
n3: "max"
n4: "steen" "/keys"
n1 n5
"/home/steen/keys"
elke steen
max

n2 n3 n4 keys
Leaf node
.twmrc mbox
Directory node
"/home/steen/mbox"

A general naming graph with a single root node

• n5 can be referred to by /home/steen/keys as well as /keys,

• the idea of directed acyclic graph,

• in Plan9 all resources (processes, hosts, I/O devices, network interfaces)


named as files – single naming graph for all resources in a distributed
system,

[9.5] Name Resolution


Name resolution – the process of looking up a name, given a path name.

• a name lookup returns the identifier of a node from where the name res-
olution process continues,

• closure mechanism – knowing how and where to start name resolution,

– Unix file system: the inode of the root directory is the first inode in
the logical disk,
– ”000312345654” not recognizable as string, but recognizable as a
phone number,

• alias another name for the same entity,

• hard links versus symbolic links.

[9.6] Linking and Mounting (1)

151
CHAPTER 9. NAMING

Data stored in n1 n0
n2: "elke" home keys
n3: "max"
n4: "steen" n1 n5 "/keys"

elke steen
max

n2 n3 n4
Data stored in n6
Leaf node
.twmrc mbox keys "/keys"
Directory node
n6 "/home/steen/keys"

The concept of a symbolic link explained in a naming graph.

[9.7] Linking and Mounting (2)

• mount point – the directory node storing the node identifier,

• mounting point – the directory node in the foreign name space.

To mount a foreign name space in a distributed system the following information


is required:

1. The name of an access protocol.

2. The name of the server.

3. The name of the mounting point in the foreign name space.

Remarks:

• each of these names has to be resolved,

• NFS as an example.

[9.8] Linking and Mounting (3)

152
CHAPTER 9. NAMING

Name server Name server for foreign name space


Machine A Machine B

keys
remote home

vu steen
"nfs://flits.cs.vu.nl//home/steen"

mbox

OS

Network
Reference to foreign name space

Mounting remote name spaces through a specific process protocol.

[9.9] Linking and Mounting (4)

• in DECs GNS (Global Name Service) new root is added making all
existing root nodes its children,

• names in GNS always (implicitly) include the identifier of the node from
where resolution should normally start,

• /home/steen/keys in NS1 expanded to n0:/home/steen/keys,

• hidden expansion,

• assumed that node identifiers universally unique,

• therefore: all nodes have different identifiers.

[9.10] Linking and Mounting (5)

153
CHAPTER 9. NAMING

m0 home
n0 vu
oxford
vu
NS1 NS2

n0 m0

home keys mbox

"m0:/mbox"
elke max steen

.twmrc mbox keys


"n0:/home/steen/keys"

Organization of the DEC Global Name Service.

[9.11] Name Space Distribution (1)

• name space distribution,

• large scale name spaces partitioned into logical layers

– global layer,
– administrational layer,
– managerial layer.

• the name space may be divided into zones,

• a zone is a part the name space that is implemented by a separate name


server,

• availability, performance requirements caching, replication.

[9.12] Name Space Distribution (2)

154
CHAPTER 9. NAMING

Global
layer gov mil org net
com edu jp us
nl

sun yale acm ieee ac co oce vu

eng cs eng jack jill keio nec cs


Adminis-
trational
layer ftp www
cs csl
ai linda

pc24
robot pub

globe
Mana-
gerial
layer Zone
index.txt

An example partitioning of the DNS name space, including Internet-accessible


files, into three layers.

[9.13] Name Space Distribution (3)

A comparison between name servers for implementing nodes from a large-scale


name space partitioned into a global layer, as an administrational layer, and a
managerial layer.

[9.14] Implementation of Name Resolution (1)


Each client has access to a local name resolver:

• ftp://ftp.cs.vu.nl/pub/globe/index.txt

155
CHAPTER 9. NAMING

• root:<nl, vu, cs, ftp, pub, globe, index.txt>

Iterative name resolution:

• a name resolver hands over the complete name to the root name server,
but root resolves only nl and returns address of the associated name server

• caching restricted to the clients name resolver as a compromise interme-


diate name server shared by all clients.

Recursive name resolution:

• a name server passes the result to the next name server it finds,

• drawback: puts a higher performance demand,

• caching results more effective comparing to iterative name resolution,

• communication costs may be reduced.

[9.15] Implementation of Name Resolution (2)

1. <nl,vu,cs,ftp>
Root
2. #<nl>, <vu,cs,ftp> name server
nl
3. <vu,cs,ftp> Name server
nl node
Client's 4. #<vu>, <cs,ftp>
name vu
resolver 5. <cs,ftp> Name server
vu node
6. #<cs>, <ftp>
cs
7. <ftp> Name server
8. #<ftp> cs node

ftp
<nl,vu,cs,ftp> #<nl,vu,cs,ftp>
Nodes are
managed by
the same server

The principle of iterative name resolution

[9.16] Implementation of Name Resolution (3)

156
CHAPTER 9. NAMING

1. <nl,vu,cs,ftp>
Root
8. #<nl,vu,cs,ftp> name server 2. <vu,cs,ftp>

7. #<vu,cs,ftp> Name server


nl node 3. <cs,ftp>
Client's
name
resolver 6. #<cs,ftp> Name server
vu node 4. <ftp>

5. #<ftp> Name server


cs node

<nl,vu,cs,ftp> #<nl,vu,cs,ftp>

The principle of recursive name resolution.

[9.17] Implementation of Name Resolution (4)

Recursive name resolution of <nl, vu, cs, ftp>. Name servers cache intermediate
results for subsequent lookups.

[9.18] Implementation of Name Resolution (5)

157
CHAPTER 9. NAMING

Recursive name resolution


R1
Name server
I1 nl node
R2
I2 Name server
Client vu node
I3
R3
Name server
Iterative name resolution cs node

Long-distance communication

The comparison between recursive and iterative name resolution with respect to
communication costs.

[9.19] The DNS Name Space

The most important types of resource records forming the contents of nodes in
the DNS name space.

[9.20] DNS Implementation (1)

158
CHAPTER 9. NAMING

An excerpt from the DNS database for the zone cs.vu.nl.

[9.21] DNS Implementation (2)

Part of the description for the vu.nl domain which contains the cs.vu.nl domain.

[9.22] X.500

159
CHAPTER 9. NAMING

• directory service – special kind of naming service in which a client can


look for an entity based on a description of properties instead of a full
name.

• OSI X.500 directory service,

• Directory Information Base (DIB) – the collection of all directory entries


in an X.500 directory service,

• each record in DIB uniquely named,

• unique name as a sequence of naming attributes,

• each attribute called a Relative Distinguished Name (RDN),

• Directory Information Tree (DIT) – a hierarchy of the collection of


directory entries,

• DIT forms a naming graph in which each node represents a directory entry,

• each node in DIT may act as a directory in the traditional sense.

[9.23] The X.500 Name Space (1)

A simple example of a X.500 directory entry using X.500 naming conventions.

[9.24] The X.500 Name Space (2)

160
CHAPTER 9. NAMING

C = NL

O = Vrije Universiteit

OU= Math. & Comp. Sc.

CN = Main server
N
Host_Name = star Host_Name = zephyr

Part of the directory information tree.

[9.25] X.500 Implementation

• DIT usually partitioned and distributed across several servers, known as


Directory Service Agents (DSA),

• each DSA implements advanced search operations,

• clients represented by Directory User Agents (DUA),

• example: list of all main servers at the VU:

– answer=search(&(C=NL)(O=Vrije Universiteit)(OU=*)(CN=Main server))

• searching is generally an expensive operation.

161
CHAPTER 9. NAMING

[9.26] GNS Names


( /... or /.: ) + ( X.500 or DNS name ) + ( local name )

• ( /... or /.: ) + ( X.500 or DNS name ) + ( local name )

• /.../ENG.IBM.COM.US/nancy/letters/to/lucy

• /.../Country=US/OrgType=COM/OrgName=IBM/
Dept=ENG/nancy/letters/to/lucy

• /.:/nancy/letters/to/lucy

[9.27] LDAP

• Lightweight Directory Access Protocol (LDAP),

• an application-level protocol implemented on top of TCP,

• LDAP servers as specialized gateways to X.500 servers,

• parameters of lookup and update simply passed as strings,

• LDAP contains:

– defined security model based on SSL,


– defined API,
– universal data exchange format, LDIF,

• http://www.openldap.org

[9.28] Naming versus Locating Entities (1)


Three types of names distinguished:

• human-friendly names,

• identifiers,

• addresses.

162
CHAPTER 9. NAMING

What happens if a machine ftp.cs.vu.nl is to move to ftp.cs.unisa.edu.au?


• recording the address of the new machine in the DNS for cs.vu.nl but,

• whenever it moves again, its entry in DNS in cs.vu.nl has to be updated


as well,

• recording the name of the new machine in the DNS for cs.vu.nl but,

• lookup operation becomes less efficient.

Better solution: naming from locating entities separation by introduction of


identifiers.

[9.29] Naming versus Locating Entities (2)

Name Name Name Name Name Name Name Name

Naming
service
Entity ID
Location
service
Address Address Address Address Address Address

(a) (b)

a. direct, single level mapping between names and addresses,

b. two-level mapping using identifiers.

• locating an entity is handled by means of a separate location service.

[9.30] Location service implementations

1. simple solutions

• broadcasting and multicasting,


– ARP to find the data-link address given only an IP address
• forwarding pointers,
– when an entity moves, it leaves behind a reference to its new
location

163
CHAPTER 9. NAMING

2. home-Based approaches,
3. hierarchical approaches.

[9.31] Forwarding Pointers (1)

Process P2 Proxy p refers to


Proxy p same skeleton as
proxy p
Process P3
Identical proxy

Process P1 Skeleton

Proxy p
Process P4 Object
Local
invocation
Interprocess
communication Identical
skeleton

The principle of forwarding pointers using (proxy, skeleton) pairs.


• skeletons acts as entry items for remote references, proxies as exit items,
• whenever move from A to B, proxy installed on A and referring skeleton
on B.

[9.32] Forwarding Pointers (2)

Skeleton is no
Invocation longer referenced
request is by any proxy
sent to object

Skeleton at object's Client proxy sets


current process returns a shortcut
the current location
(a) (b)

164
CHAPTER 9. NAMING

Redirecting a forwarding pointer, by storing a shortcut in a proxy.

[9.33] Home-Based Approaches

Host's home
location 1. Send packet to host at its home

2. Return address
of current location

Client's
location

3. Tunnel packet to
current location

4. Send successive packets


to current location
Host's present location

The principle of Mobile IP.


• home location, home agent in home LAN, fixed IP address,
• whenever the mobile host in another network requests a temporary care-
of-address, registered afterwards at the home agent.

[9.34] Hierarchical Approaches (1)

The root directory


Top-level
node dir(T)
domain T

Directory node
dir(S) of domain S
A subdomain S
of top-level domain T
(S is contained in T)

A leaf domain, contained in S

165
CHAPTER 9. NAMING

Hierarchical organization of a location service into domains, each having an


associated directory node.
[9.35] Hierarchical Approaches (2)

Field with no data


Field for domain
dom(N) with Location record
pointer to N for E at node M
M

Location record
with only one field,
containing an address

Domain D1
Domain D2

An example of storing information of an entity having two addresses in different


leaf domains.
[9.36] Hierarchical Approaches (3)

Node knows
about E, so request
Node has no is forwarded to child
record for E, so
that request is M
forwarded to
parent

Look-up
Domain D
request

166
CHAPTER 9. NAMING

Looking up a location in a hierarchically organized location service.

[9.37] Hierarchical Approaches (4)

Node knows
Node has no about E, so request
record for E, is no longer forwarded
so request is Node creates record
forwarded and stores pointer
to parent M
M
Node creates
record and
stores address

Domain D
Insert
request
(a) (b)

1. An insert request is forwarded to the first node that knows about entity E.

2. A chain of forwarding pointers to the leaf node is created.

167
CHAPTER 9. NAMING

168
Chapter 10

Peer-to-Peer Systems

[10.1] P2P Systems - Goals and Definition


Goal: to enable sharing of data and resources on a very large scale by eliminating
any requirement for separately-managed servers and their associated infrastruc-
ture.
Goal: to support useful distributed services and applications using data and
computing resources present on the Internet in ever-increasing numbers.

• standard services scalability limited when all the hosts must be owned and
managed by the single service provider,
• administration and fault recovery costs tend to dominate.

Peer-to-peer systems: applications that exploit resources available at the edges


of the Internet - storage, cycles, content, human presence.

[10.2] P2P Systems - Features


Characteristics shared by the P2P systems:
• design ensures that each user contributes resources to the system,
• all the nodes in a P2P system have the same functional capabilities and
responsibilities, although they may differ in the resources that they con-
tribute,
• correctness of any operation does not depend on the existence of any
centrally-administered systems,
• often designed to offer a limited degree of anonymity to the providers and
users of resources,

169
CHAPTER 10. PEER-TO-PEER SYSTEMS

• the key issues for P2P systems efficiency:

– algorithms for data placement across many hosts and subsequent ac-
cess to it,
– key issues of these algorithms: workload balancing, ensuring avail-
ability without adding undue overheads.

[10.3] P2P Systems - History


Antecedents of P2P systems; distributed algorithms for placement or location of
information; early Internet-based services with multi-server scalable and fault-
tolerant architecture: DNS, Netnews/Usenet, classless inter-domain IP routing.
Potential for the deployment of P2P services emerged when a significant number
of users had acquired always-on, broadband connections (around 1999 in USA).

Three generations of P2P systems:

1. launched by the Napster music exchange service,

2. file-sharing applications offering greater scalability, anonymity and fault


tolerance (Freenet, Gnutella, Kazaa, BitTorrent).

3. P2P middleware layers for application-independent management of dis-


tributed resources (Pastry, Tapestry, CAN, Chord, Kademlia).

[10.4] P2P Middleware Introduction


Middleware platforms for distributed resources management:

• designed to place resources and to route messages to them on behalf of


clients,

• relieve clients of decisions about placing resources and of holding re-


sources address information,

• provide guarantee of delivery for requests in a bounded number of network


hops,

• resources identified by globally unique identifiers (GUIDs), usually de-


rived as a secure hash from resource’s state,

• secure hashes make resources ”self certifying”, clients receiving a resource


can check validity of the hash.

170
CHAPTER 10. PEER-TO-PEER SYSTEMS

• inherently best suited to storage of immutable objects,

• usage for objects with dynamic state more challenging, usually addressed
by addition of trusted servers for session management and identification.

[10.5] IP and P2P Overlay Routing (1)

• Scale:

IP: IPv4 limited to 232 addressable nodes (in IPv6 to 2128), addresses
hierarchically structured and much of the space preallocated accord-
ing to administrative requirements.
OR: The GUID name space very large and flat (>2128), allowing it to be
much more fully occupied.

• Load balancing:

IP: Loads on routers are determined by network topology and associated


traffic patterns.
OR: Object locations can be randomized and hence traffic patterns are
divorced from the network topology.

• Network dynamics (addition/deletion of objects/nodes):

IP: IP routing tables are updated asynchronously on a best-efforts basis


with time constants on the order of 1 hour.
OR: Routing tables can be updated synchronously or asynchronously with
fractions of a second delays.

[10.6] IP and P2P Overlay Routing (2)

• Fault tolerance:

IP: Redundancy is designed into the IP network by its managers, ensuring


tolerance of a single router or network connectivity failure. n-fold
replication is costly.
OR: Routes and object references can be replicated n-fold, ensuring tol-
erance of n failures of nodes or connections.

• Target identification:

171
CHAPTER 10. PEER-TO-PEER SYSTEMS

IP: Each IP address maps to exactly one target node.


OR: Messages can be routed to the nearest replica of a target object.

• Security and anonymity:

IP: Addressing is only secure when all nodes are trusted. Anonymity for
the owners of addresses is not achievable.
OR: Security can be achieved even in environments with limited trust. A
limited degree of anonymity can be provided.

[10.7] Distributed Computation (1)

• work with the first personal computers at Xerox PARC showed the feasi-
bility of performing loosely-coupled compute-intensive tasks by running
background processes on about 100 computers linked by a local network,

• Piranha/Linda and adaptive parallelism,

• SETI@home - most widely known project

– part of a wider project Search for Extra-Terrestrial Intelligence,


– stream of data partitioned into 107-second work units, each of about
350KB,
– each work distributed redundantly to 3-4 personal computers,
– distribution and coordination handheld by a single server,
– 3.91 million computers participated by August 2002, resulting in the
processing of 221 million work units,
– on average 27.36 teraflops of computational power,

[10.8] Distributed Computation (2)

• SETI@home didn’t involved any communication or coordination between


computers while processing the work units,

• although often recognized as P2P they are rather based on client-server


architecture,

• BOINC – Berkeley Open Infrastructure for Network Computing.

172
CHAPTER 10. PEER-TO-PEER SYSTEMS

Similar scientific tasks:

• search for large prime numbers,

• attempts at brute-force description,

• climate prediction.

Grid projects - distributed platforms that support data sharing and the coordina-
tion of computation between participating computers on a large scale. Resources
are located in different organizations and are supported by heterogeneous com-
puter hardware, operating systems, programming languages and applications.

[10.9] Napster – Music Files P2P (1)

• launched in 1999 became very popular for music exchange,

• architecture: centralized replicated indexes, but users supplied the files


stored and accessed on their personal computers,

• locality – minimizing number of hops between client and server when


allocating a server to a client requesting a file,

• taken advantage of special characteristics of the applications:

– music files never updated, no need for consistency management,


– no guarantees required concerning availability of individual files (mu-
sic temporarily unavailable may be downloaded later).

• key to success: large, widely-distributed set of files available to users,

• Napster shut down as a result of legal proceedings instituted against Napster


service operators by the owners of the copyright in some of the material.

[10.10] Napster – Music Files P2P (2)

173
CHAPTER 10. PEER-TO-PEER SYSTEMS

... peers

Napster server Napster server


Index Index
1. File location request

2. List of peers offering the file 3. File request

4. File
5. Index update
delivered
peers ...

Napster: P2P file sharing with a centralized, replicated index. In step 5. clients
expected to add their own files to the pool of shared resources.

[10.11] P2P Middleware Requirements (1)


Function of the P2P middleware: to simplify construction of services imple-
mented across many hosts in a widely distributed network.

Expected functional requirements:

• enabling clients to locate and communicate with any individual resource


made available to a service,

• ability to add new resources and to remove them at will,

• ability to add hosts to the service and to remove them,

• offering simple programming interface independent of types of managed


distributed resources.

[10.12] P2P Middleware Requirements (1)


Expected non-functional requirements:

• global scalability,

• load balancing - random placement and usage of replicas,

• optimization for local interactions between neighbouring peers,

• accommodating to highly dynamic host availability,

174
CHAPTER 10. PEER-TO-PEER SYSTEMS

• security of data in an environment with heterogeneous trust,

• anonymity, deniability and resistance to censorship.

[10.13] Routing Overlays


Routing overlay
A distributed algorithm which takes responsibility for locating nodes and objects
in P2P networks.
Randomly distributed identifiers (GUIDs) used to determine placement of ob-
jects and to retrieve them, thus overlay routing systems sometimes described as
distributed hash tables (DHT).
General tasks of a routing overlay layer:

• having given GUID routing the request,

• having given GUID publishing the resource,

• service of removal request,

• responsibility allocation depending on changing view of peers.

[10.14] Routing Overlay – Identifiers


GUIDs – opaque identifiers, reveal nothing about locations of objects to which
they refer. Computed with usage of hash function (such as SHA-1) from all
or part of the state of an object, unique. Uniqueness verified by searching for
another object with the same GUID.

Prefix routing - narrowing the search for the next node along the route by
applying a binary mask that selects an increasing number of hexadecimal digits
from the destination GUID after each hop.

[10.15] Routing Overlay – DHT

put(GUID, data)
The data is stored in replicas at all nodes responsible for the object identified by
GUID.
remove(GUID)
Deletes all references to GUID and the associated data.

175
CHAPTER 10. PEER-TO-PEER SYSTEMS

value = get(GUID)
The data associated with GUID is retrieved from one of the nodes responsible
for it.

Basic programming interface for a distributed hash table (DHT) as imple-


mented by the PAST API over Pastry.

[10.16] Routing Overlay – DOLR

publish(GUID)
GUID can be computed from the object (or some part of it, e.g. its name). This
function makes the node performing a publish operation the host for the object
corresponding to GUID.
unpublish(GUID)
Makes the object corresponding to GUID inaccessible.
sendToObj(msg, GUID, [n])
Following the object-oriented paradigm, an invocation message is sent to an
object in order to access it. This might be a request to open a TCP connection
for data transfer or to return a message containing all or part of the object’s state.
The final optional parameter [n], if present, requests the delivery of the same
message to n replicas of the object.

Basic programming interface for distributed object location and routing (DOLR)
as implemented by Tapestry.

[10.17] Routing Overlay – Routing and Location


DHT:

• when data submitted to be stored with its GUID DHT layer takes respon-
sibility for choosing a location, storing it (with replicas) and providing
access,

• data item with GUID X stored at the node whose GUID numerically closest
to X and moreover at the r hosts with GUIDs numerically closest to it,
where R is a replication factor chosen to ensure high availability.

DOLR:

• locations for the replicas of data objects decided outside the routing layer,

176
CHAPTER 10. PEER-TO-PEER SYSTEMS

• host address of each replica notified to DOLR using the publish() operation.

[10.18] Routing Overlay – Prefix Routing


Prefix routing:

• both Pastry and Tapestry employ prefix routing to determine routes,

• prefix routing is based on applying a binary mask that selects increasing


number of hexadecimal digits from the destination GUID after each hop
(similar to CIDR in IP).

Other possible routing schemes:

• based on numerical difference between the GUIDs of the selected node


and the destination node (Chord),

• usage of distance in a d-dimensional hyperspace into which nodes are


placed (CAN),

• usage of the XOR of pairs of GUIDs as a metric for distance between


nodes (Kademlia).

[10.19] P2P - Human-readable Names

• GUIDs are not human-readable, some form of indexing service using


human-readable names or search requests required,

• weakness of centralized indexes evidenced by Napster,

• example: indices on web pages in BitTorrent. Definitions: seed – peers


with complete copy of the torrent still offering upload; swarm – all peers
including seeds sharing a torrent,

• in BitTorrent a web search index leads to a stub file containing details


of the desired resource. The torrent file contains metadata about all the
files it makes downloadable, including: names, sizes, checksums of all
pieces in the torrent, address of a tracker that coordinates communication
between the peers in the swarm ,

• tracker – server that keeps track of which seeds and peers are in the
swarm, not directly involved in the data transfer, does not have copies of
data files.

177
CHAPTER 10. PEER-TO-PEER SYSTEMS

• clients report information to the tracker periodically and in exchange re-


ceive information about other clients that they can connect to.

[10.20] Pastry - Introduction


Pastry: message routing infrastructure deployed in several applications including
PAST, an archival (immutable) file storage system implemented as a distributed
hash table with DHT API and in Squirrel, a P2P web caching service.

• 128-bit GUIDs (hash function such as SHA-1) randomly distributed in the


range 0 ÷ 2128 − 1,

• in a network with N participating nodes, Pastry routing algorithm correctly


route a message addressed to any GUID in O(log N ) steps,

• if a target node is active, message is delivered, otherwise message delivered


to active node which is numerically closest to it.

• active nodes take responsibility for processing requests addressed to all


objects in their numerical neighbourhood,

• moreover Pastry uses a locality metric based on network distance in the


underlying network to select appropriate neighbours,

[10.21] Pastry - Routing


Routing, simplified approach:

• each active node stores a leaf set – a vector L (of size 2l) containing
the GUIDs and IP addresses of the nodes whose GUIDs are numerically
closest on either side of its own (above and below),

• leave sets maintained by Pastry as nodes join and leave,

• any node A that receives a message M with destination address D routes


the message by comparing D with its own GUID A and with each of the
GUIDs in its leaf set and forwards M to the node amongst them that is
numerically closest to D,

• inefficient, requires about N/2l hops to deliver a message.

[10.22] Circular Routing

178
CHAPTER 10. PEER-TO-PEER SYSTEMS

Black color depicts live nodes. The space is considered as circular: node 0 is
adjacent to node (2128 − 1). The diagram illustrates the routing of a message
from node 65A1FC to D46A1C using leaf set information alone, assuming leaf
sets of size 8 (l = 4, in Pastry usually 8). This is a degenerate type of routing
that would scale very poorly; it is not used in practice.
[10.23] Pastry Routing

• efficient routing due to routing tables,


• each node maintains a tree-structured routing table of nodes spread
throughout the entire address range, with increased density of coverage for
GUIDs numerically close to,
• the routing process at any node uses the information in its routing table
and leaf set to handle each request from an application and each incoming
message from another node,

179
CHAPTER 10. PEER-TO-PEER SYSTEMS

• new nodes use a joining protocol and compute suitable GUIDs (typically
by applying the SHA-1 to the node’s public key, then it make contact with
a nearby (in network distance) Pastry node.

[10.24] Pastry’s Routing Table

First four rows of a Pastry routing table located in a node whose GUID begins
with 65A1.
• each ”n” element represents [GUID, IP address] pair specifying next hop
to be taken by messages addressed to GUIDs that match each given prefix.

• grey-shaded entries indicate that the prefix matches the current GUID up
to the given value of p: the next row down or the leaf should be examined
to find a route,

• although there are a maximum of 128 rows in the table, only log 16 N rows
will be populated on average in a network with N active nodes.

[10.25] Pastry’s Routing Algorithm


If R[p, i] means the element at column i in the row p of the routing table and L
means leaf set. To handle a message M addressed to a node D:

if (L−l < D < Ll ) {

forward M to the element Li of the leaf set with GUID closest to D


or the current node A.

} else {

find p, the length of the longest common prefix of D and A.


find i, the (p + 1)th hexadecimal digit of D.
if (R[p, i] , null) {

180
CHAPTER 10. PEER-TO-PEER SYSTEMS

forward M to R[p, i],


} else {
forward M to any node in L or R with a common prefix of length
i, but a GUID that is numerically closer.
}

[10.26] Pastry Routing Example

Routing a message from node 65A1FC to D46A1C. With the aid of a well-
populated routing table the message can be delivered in log16 (N) hops.

[10.27] Pastry - Host Failure and Fault Tolerance

181
CHAPTER 10. PEER-TO-PEER SYSTEMS

• nodes may fail or depart without warning, node considered failed when its
immediate neighbours (in GUID space) can no longer communicate with
it,

• to repair leaf set, the node looks for a live node close to the failed one
and requests a copy of its leaf set (one value to replace),

• repairs to routing tables made on a ’when discovered’ basis,

• moreover all nodes send heartbeat messages to neighbouring nodes in


their leaf sets,

• to deal with any remaining failures or malicious nodes, small degree of


randomness introduced into the route selection algorithm. Possible usage
of a routing from an earlier row with less optimal but different routing.

[10.28] Tapestry

• nodes holding resources periodically use the publish(GUID) primitive to


make them known to Tapestry, holders responsible for storing resources,
replicated resources published with the same GUID,

• 160-bit identifiers used to refer both to objects and to nodes that perform
routing actions,

• for any resource with GUID G unique root node with GUID RG numerically
closest to G,

• on each invocation of publish(G) publish message routed towards RG ,

• on receipt RG enters mapping between G and the sending host’s IP, (G, IP H )
in its routing table, the same cached along publication path.

[10.29] Tapestry Routing

182
CHAPTER 10. PEER-TO-PEER SYSTEMS

Replicas of the file Phil’s Books (G=4378), hosted at nodes 4228 and AA93.
Node 4377 is the root node for object 4378. Shown routings are some of the
entries in routing tables. The location mapping (cached while servicing publish
messages) are subsequently used to route messages sent to 4378.

[10.30] Squirrel Web Cache (1)

• developed by authors of Pastry P2P web caching service for use in local
networks,

Web caching in general:

• browser cache, proxy cache, origin web server,

• metadata stored with an object in a cache: date of last modification T ,


time-to-leave t or eTag (hash computed from the object contents),

• conditional GET (cGET) request issued to the next level for validation,

• cGET request types: If-Modified-Since, If-None-Match,

• in response either the entire object or not-modified message.

[10.31] Squirrel Web Cache (2)

183
CHAPTER 10. PEER-TO-PEER SYSTEMS

• SHA-1 hash function applied to the URL of each cached object to produce
a 128-bit Pastry GUID, GUID not used to validate content,

• in the simplest implementation: the node whose GUID numerically closest


to the GUID of an object becomes the object’s home node, responsible for
holding any cached copy of the object,

• Squirrel routes a Get or a cGet request via Pastry to the home node.

Evaluation, two real working environments within Microsoft, 105 active clients
(Cambridge), 36000 active clients (Redmond):

• reduction in total external bandwidth: caches 100MB, 37% (Cambridge),


28% (Redmond), hit ratio for centralized servers: 38% and 29% respec-
tively,

• local latency perceived by users for access web objects: neglectable,

• computational and storage load: low and likely to be imperceptible to


users.

[10.32] OceanStore File Store

• OceanStore – unlike Past, supports the storage of mutable files,

• goal: very large scale, scalable persistent storage facility for mutable data
objects with long-term persistence and reliability in changing network and
computing resources environment,

• privacy and integrity achieved through encryption of data and use of a


Byzantine agreement protocol for updates to replicated objects – because
trustworthiness of individual hosts cannot be assumed,

• Pond – OceanStore prototype implemented in Java, uses Tapestry routing


overlay to place blocks of data at distributed nodes and to dispatch requests
to them,

• data stored in a set of blocks, data blocks organized and accessed through
a metadata block called root block,

• each object represented as an ordered sequence of immutable versions


kept for ever, versions share unchanged blocks (copy-on-write technique),

184
CHAPTER 10. PEER-TO-PEER SYSTEMS

[10.33] Ocean Store - Storage Organization (1)

• several replicas of each block stored at peer nodes selected accordingly to


locality and storage availability criteria,
• data blocks GUIDs published (with publish()) by each of the nodes that
holds a replica, Tapestry can be used by clients to access the blocks,
• AGUID stored in directories against each file name,
• association between an AGUID and the sequence of versions of the ob-
ject recorded in signed certificate stored and replicated by primary copy
replication scheme,
• trust model for P2P requires construction of each new certificate being
agreed amongst small set of hosts called the inner ring.

[10.34] Ocean Store - Storage Organization (2)

185
CHAPTER 10. PEER-TO-PEER SYSTEMS

Version i + 1 has been updated in blocks d1, d2 and d3. The certificate and the
root blocks include some data not shown. All unlabelled arrows are BGUIDs.

[10.35] Pond Performance

Times in seconds to run different phases of the Andrew benchmark. (1) recursive
subdirectory creation, (2) source tree copying, (3) status only examining of all
the files in the tree, (4) every data byte examining in all the files, (5) compiling
and linking the files.

[10.36] Ivy File System

• read/write file system emulating a Sun NFS server,

• stores the state of files as logs of the file update requests issued by Ivy
clients,

• log records held in DHash distributed hash-addresses storage service (160-


bit SHA-1),

• version vectors to impose a total order on log entries when reading from
multiple logs,

• potentially very long read time reduced by use of a combination of local


caches and snapshots,

• shared file system seen as a result of merging all the updates performed
by (dynamically selected – views) set of participants,

186
CHAPTER 10. PEER-TO-PEER SYSTEMS

• possible continuing operations during partitions in the network, conflicting


updates to shared files resolved similar like in Coda file system.

[10.37] Ivy Architecture

Ivy system architecture.

[10.38] Ivy – Performance


Each participant maintains a mutable DHash block (called log-head) that points
to a participant’s most recent log record. Mutable blocks are assigned a cryp-
tographic public key pair by their owner. The contents of the block are signed
with the private key. Any participant that has the public key can retrieve the
log-head and use it to access all the records in the log.
Performance:
• execution times mostly two times (for some operations three times) larger
than for NFS,
• in WAN 10 times slower than in LAN, similar to NFS – still NFS not
designed for usage in WAN,

Primary contribution of Ivy: novel approach to the management of security


and integrity in an environment of partial trust (in networks spanning many
organizations and jurisdictions).

[10.39] P2P – Summary


The benefits of P2P:

187
CHAPTER 10. PEER-TO-PEER SYSTEMS

• ability to exploit unused resources (storage, processing) in the host com-


puters,

• ability to support large numbers of clients and hosts with adequate bal-
ancing of the loads on network links and host computer resources,

• self-organizing properties of the middleware platforms lead to to costs


largely independent of the numbers of clients and hosts deployed.

Weaknesses and subjects of research:

• relatively costly as storage solution for mutable data compared to trusted


centralized service solutions,

• still lack of strong guarantees for client and host anonymity.

188
Chapter 11

Web Services

[11.1] XML – Introduction


The Extensible Markup Language (XML) is a W3C-recommended general-
purpose markup language for creating special-purpose markup languages, capa-
ble of describing many different kinds of data.

• a way of describing data,

• a simplified subset of Standard Generalized Markup Language (SGML),

• primary purpose: to facilitate the sharing of data across different systems,


particularly systems connected via the Internet,

[11.2] XML – Main Features


XML as a well-suited media for data transfer:

• simultaneously human- and machine-readable format,

• support for Unicode, allowing almost any information in any human lan-
guage to be communicated,

• ability to represent the most general computer science data structures:


records, lists and trees,

• the self-documenting format that describes structure and field names as


well as specific values,

• the strict syntax and parsing requirements that allow the necessary parsing
algorithms to remain simple, efficient, and consistent.

189
CHAPTER 11. WEB SERVICES

[11.3] XML and Correctness


For an XML document to be correct, it must be:

1. well-formed: conforming to all of XML’s syntax rules.

2. valid: conforming to some XML schema. An eXML schema is a de-


scription of a type of XML document, typically expressed in terms of
constraints on the structure and content of documents of that type.

DTD Document Type Definition, inherited from SGML, included in the XML
1.0 standard,

XSD XML Schema Definition, schema with rich datatyping system and XML
syntax,

Relax NG proposed by OASIS, now part of ISO DSDL (Document Schema


Definition Languages) standard.

• two formats: an XML based syntax and a compact syntax.


• compact syntax aims to increase readability and writability, having a
strict way to translate compact syntax to the XML syntax and back
again.

[11.4] Web Service – Definition


Web service definition (W3C):
A Web service is a software system designed to support interoperable machine-
to-machine interaction over a network.

• It has an interface described in a machine-processable format (specifically


WSDL).

• Other systems interact with the Web service in a manner prescribed by its
description using SOAP messages, typically conveyed using HTTP with
an XML serialization in conjunction with other Web-related standards.

[11.5] Web Service – Introduction

• a web services provides a service interface enabling clients to interact with


servers in a more general way than web browsers do,

190
CHAPTER 11. WEB SERVICES

• clients access the operations in the interface of a web service by means of


requests and replies formatted in XML and usually transmitted over HTTP,

• like CORBA and Java, the interface of web services can be described in an
IDL. But for web services, additional information including the encoding
and communication protocols in use and the service location need to be
described,

• secure channels of TLS do not provide all of the necessary requirements.


XML security is intended to breach this gap.

[11.6] Web Services - Core Components


Web services - core components:

XML All data to be exchanged is formatted with XML tags. The encoded
message may conform to a messaging standard such as SOAP or the older
XML-RPC. The XML-RPC scheme calls functions remotely, whilst SOAP
favours a more modern (object-oriented) approach based on the Command
pattern.

SOAP Lightweight protocol for exchange of information in a decentralized, dis-


tributed environment.

WSDL Web Services Description Language, an XML-based language for de-


scribing public interface to web services. Describes how to communicate
using the web service.

UDDI protocol for publishing the web service information. Enables applications
to look up web services information in order to determine whether to use
them.

[11.7] Web Services - Other Components

Web Services Protocol Stack standards and protocols used to consume a web
service.

Common protocols protocols for data transport such as HTTP, FTP and SMTP.

ebXML set of specifications enabling a modular electronic business framework.


The vision of ebXML is to enable a global electronic marketplace for
business conducting through exchange of XML-based messages.

191
CHAPTER 11. WEB SERVICES

WS-Security specification that allows authentication of actors and confidential-


ity of the messages sent (OASIS standard).

WS-ReliableExchange a SOAP-based specification that fulfills reliable messag-


ing requirements critical to some applications of Web Services. (OASIS
standard).

WS-Management specification which describes a SOAP-based protocol for sys-


tems management of personal computers, servers, devices, and other man-
ageable hardware and Web services and other applications.

[11.8] Web Services Infrastructure

Applications
Directory service Security Choreography

Web Services Service descriptions (in WSDL)

SOAP

URIs (URLs or URNs) XML HTTP, SMTP or other transport

Web services infrastructure and components.

[11.9] WS Features (1)

• data representation model XML-based,

• SOAP protocol specifies the rules for using XML to package messages,
for example to support a request-reply protocol,

• SOAP used to encapsulate these messages and transmit them over HTTP
or another protocol,

• Web service provides a service description, which includes an interface


definition and other information, such as the server’s URL,

• XML security - documents or parts of documents may be signed or en-


crypted,

192
CHAPTER 11. WEB SERVICES

• Web services do not provide means for coordinating their operations with
one another.

[11.10] WS Features (2)


The main differences from the distributed object model:

• remote objects cannot be instantiated - effectively a web service consists


of a single remote object,

– remote object references are irrelevant.

• although interaction similar to that in RMI, remote object references not


very similar to URI’s,

• web services cannot create instances of remote objects, garbage collection


is irrelevant.

[11.11] SOAP (Simple Object Access Protocol)


SOAP
XML-based lightweight protocol for exchange of information in a decentralized,
distributed environment.

SOAP message structure:

Envelope top level root element of a SOAP message, which contains the header
and body element.

Header a collection of zero or more SOAP header blocks each of which might
be targeted at any SOAP receiver within the SOAP message path.

Body a collection of zero or more element information items targeted at an


ultimate SOAP receiver in the SOAP message path.

[11.12] SOAP Specification


The SOAP specification states:

• how XML is to be used to represent the contents of individual messages,

• how a pair of single messages can be combined to produce a request-reply


pattern,

193
CHAPTER 11. WEB SERVICES

• the rules as how the recipients of messages should process the XML ele-
ments that they contain,
• how HTTP and SMTP should be used to communicate SOAP messages.
It is expected that future versions of the specification will define how to
use other transport protocols, for example, TCP.

[11.13] SOAP Message in an Envelope

envelope
header

header element header element

body

body element body element

[11.14] SOAP Example (1)

env:envelope xmlns:env =namespace URI for SOAP envelopes

env:body

m:exchange
xmlns:m = namespace URI of the service description

m:arg1 m:arg2
Hello World

Example of a simple request without headers, each XML element is represented


by a shaded box.

[11.15] SOAP Example (2)

194
CHAPTER 11. WEB SERVICES

env:envelope xmlns:env =namespace URI for SOAP envelopes

env:body

m:exchangeResponse
xmlns:m = namespace URI of the service description

m:res1 m:res2
World Hello

Example of a reply corresponding to the previous request.

[11.16] SOAP and HTTP POST

POST /examples/stringer endpoint address


HTTP
Host: www.cdk4.net
headers
Content-Type:application/sosap+xml
Action: http://www.cdk4.net/examples/stringer#exchange action

<env:envelope xmlns:env= namespace URI for SOAP envelope>


<env header></env:header> SOAP
<env:body></env:body> message

</env:Envelope>

Use of HTTP POST Request in SOAP client-server communication.

[11.17] REST (Representational State Transfer)


Roy Fielding’s explanation of the meaning of Representational State Transfer:
Representational State Transfer is intended to evoke an image of how a well-
designed Web application behaves: a network of web pages (a virtual state-
machine), where the user progresses through an application by selecting links
(state transitions), resulting in the next page (representing the next state of the
application) being transferred to the user and rendered for their use.

REST – (common meaning:) any simple web-based interface that uses XML
and HTTP without the extra abstractions of MEP-based approaches like the
web services SOAP protocol. It is possible to design web service systems in
accordance with Fielding’s REST architectural style (RESTful systems).
REST is an architectural style and not a standard.

195
CHAPTER 11. WEB SERVICES

[11.18] The use of SOAP with Java


The service interface
The Java interface of a web service must conform to the following rules:

• must extend the Remote interface,

• must not have constant declarations, such as public final static,

• the methods must throw the java.rmi.RemoteException or one of its sub-


classes,

• Method parameters and return types must be permitted JAX-RPC types.

• no main method, no constructor,

• wscompile and wsdeploy to generate the skeleton class and the service
description (in WSDL),

• service implementation run as a servlet inside a servlet container (like


Tomcat),

• client program may use static proxies, dynamic proxies or a dynamic in-
vocation interface.

[11.19] WSDL (Web Services Description Language)


WSDL
The Web Services Description Language (WSDL) is an XML format published
for describing Web services.

• an XML-based service description on how to communicate using the web


service, namely, the protocol bindings and message formats required to
interact with the web services listed in its directory,

• supported operations and messages are described abstractly, and then bound
to a concrete network protocol and message format.

[11.20] The main elements in a WSDL description

196
CHAPTER 11. WEB SERVICES

definitions
types message interface bindings services

target namespaces document style request-reply style how where

abstract concrete

[11.21] WSDL Example WSDL request and reply messages for the newShape
operation

message name = "ShapeList_newShape" message name = "ShapeList_newShapeResponse"

part name="GraphicalObject_1" part name=’result"


type = "ns:GraphicalObject" type = "xsd:int"

tns - target namespace xsd - XML schema definitions

[11.22] Message exchange patterns for WSDL operations

Name Messaeges sent by


Client Server Delivery Fault tolerance
In-Out Request Reply may replace Reply
In-Only Request no fault message
Rebust In-Only Request guaranteed may be sent
Out-In Reply Request may replace Reply
Out-Only Request no fault message
Rebust Out-Only Request guaranteed may send fault

197
CHAPTER 11. WEB SERVICES

198
Bibliography

[GCDT05] G. G. Colouris, J. Dollimore, and Kindberg T. Distributed Systems.


Concepts and Design. fourth edition. Addison Wesley, 2005.

[TvS02] Andrew S. Tanenbaum and Maarten van Steen. Distributed Systems.


Principles and Paradigms. Prentice Hall, 2002.

[TvS05] Andrew S. Tanenbaum and Maarten van Steen. Systemy rozproszone.


Zasady i paradygmaty. WNT, 2005.

199