Professional Documents
Culture Documents
COURSE OUTLINE
REFERENCES
1. Operating Systems – Design and Implementation by Andrew Tanenbaum (2005), Prentice Hall
2. Distributed Computing: Principles and Applications by Liu M.L (2004), Pearson Addison-
Wesley,
3. Schaum’s Outlines of Operating Systems by Archer J. H (2002), McGraw Hill
4. Operating System Projects using Windows NT by Nutt Gary (2001), Addison Wesley
5. Distributed Operating Systems – Concepts and Design by Pradeep K. S (2001), Prentice Hall
6. Other Resources: The Internet, Papers, Handouts, Lecture Notes etc
Machine Language
Microprogramming Hardware
Physical Devices
At the bottom layer is the hardware, which in many cases is composed of two or more layers.
The lowest layer contains physical devices such as IC chips, wires, network cards, cathode ray
tubes etc. The next layer, which may be absent in some machines, is a layer of primitive software
that directly controls the physical devices and provides a clean interface to the next layer. This
software called the micro-program is normally located in ROM. It is an interpreter, fetching the
machine language instruction such as ADD, MOVE and JUMP and carry them out as a series of
little steps. The set of instructions that the micro-program can interpret defines the machine
language. The machine language typically has between 50 and 300 instructions, mostly for
The program that hides the truth about hardware from the programmer and presents a nice,
simple view of named files that can be read and written is, of course, the operating system. The
operating system also conceals a lot of unpleasant business concerning interrupts, timers,
memory management and other low level features. In this view, the function of the operating
system is to present the user with the equivalent of an extended machine or virtual machine that
is easier than the underlying hardware.
Resource management
Modern computers consist of processors, memories, timers, disks, network interface cards,
printers etc. The job of the operating system is to provide for an orderly and controlled allocation
of processors, memories and I/O devices among the various programs competing for them. For
example, if different programs running in the same computer sent print jobs to the same printer
at the same time, if the printing is not controlled then the printing will be interleaved with say the
first line of the printout being for the first program, the second line being for the second program
etc. The operating system brings some order in such situation by buffering all output destined for
the printer on disk. When one program is finished, the operating system can then copy its output
from the disk to the printer. In this view, the operating system keeps track of who is using which
resource, grant resource requests, account for usage and mediate conflicting requests from
different programs and users.
A process is basically a program in execution. Associated with each process is its address space:
memory locations, which the process can read and write. The address space contains the
executing program, its data and stack. Also associated with each process is some set of registers,
including the program counter, stack pointer and hardware registers and all information needed
to run the program. In a time-sharing system, the operating system decides to stop running one
process and start running another. When a process is suspended temporarily, it must later be
restarted in exactly the same state it had when it was stopped. This means that the context of the
process must be explicitly saved during suspension. In many operating systems, the information
about each process, apart from the contents of its address space, is stored in a table called the
process table.
Therefore, a suspended program consists of its address space, usually referred to as the core
image and its process table entry. The key process management system calls are those dealing
with the creation and termination of processes. For example, a command interpreter (shell) reads
commands from a terminal, for instance a request to compile a program. The shell must create a
new process that will run the compiler and when the process has finished the compilation, it
executes a system call to terminate itself. A process can create other processes known as child
processes and these processes can in turn create other child processes. Related processes that are
cooperating to get some job done often need to communicate with one another and synchronize
their activities. This communication is known as Inter-Process Communication (IPC). Other
systems calls are available to request more memory or release unused memory, wait for a child
process to terminate and overlay its program with a different one.
Files - A file is a collection of related information defined by its creator. Commonly, files
represent programs and data. Data files may be numeric, alphabetic or alphanumeric. System
calls are needed to create, delete, move, copy, read and write files. Before a file can be read, it
must be opened, and after reading it should be closed. System calls are provided to do all these
things. Files are normally organized into logical clusters or directories, which make them easier
to locate and access. For example, you can have directories for keeping all your program files,
word processing documents, database files, spreadsheets, electronic mail etc. System calls are
available to create and remove directories. Calls are also provided to put an existing file in a
directory, remove a file from a directory. Every file within a directory hierarchy can be specified
by giving its path name from the root directory. Such absolute path names consist of the list of
directories that must be traversed from the root directory to get to the file with slashes separating
the components.
Batch Systems - The early operating systems were batch systems. The common input devices
were card readers and tape drives. The common output devices were line printers, tape drives
and card punches. The users did not interact with the system, but would rather prepare a job and
submit it to the computer operator, who would feed the job into the computer and later on the
output appeared. The major task of the operating system was to transfer control automatically
from one job to the next. To speed processing, jobs with similar needs were batched together and
run through the computer as a group. Programmers would leave their jobs with the operator who
Multiprogramming - Spooling will result in several jobs that have already been read waiting on
disk, ready to run. This allows the operating system to select which job to put in memory next,
ready for execution. This is referred to as job scheduling. The most important aspect of job
scheduling is the ability to multiprogram. The operating system keeps several jobs in memory at
the same time, which is a subset of jobs kept in the job spool. The operating system picks and
starts executing one of the jobs in memory. Eventually, the job may have to wait for some task
such as an I/O operation to complete. In multiprogramming, when this happens the operating
system, simply switches to and executes another job. If several jobs are ready to be brought from
the job spool to the memory and there is no room for all of them, then the system must chose
among them. Making this decision is job scheduling. Having several jobs or programs in
memory at the same time ready for execution also requires some memory management and that
system must chose one among them. Making this decision is known as CPU scheduling.
Parallel Systems - Most systems are single-processor systems, that is, they have only one main
CPU. However, there is a trend towards multiprocessing systems. Such systems have more than
one processor in close communication, sharing the computer bus, clock and sometimes memory
and peripheral devices. These systems are referred to as tightly coupled systems. The motivation
for having such systems is to improve the throughput and reliability of the system.
Real-Time Systems - A real time system is used when there are rigid time requirements on the
operation of a processor or the flow of data and thus often used as a control device in a dedicated
application. Sensors bring data to the computer. The computer must analyze the data and
possibly adjust control to modify the sensor inputs. Systems that control scientific experiments,
medical imaging systems, industrial control systems and some display systems are examples of
real-time systems.
Tightly Coupled Systems - Tightly coupled software on loosely coupled hardware. Components
are Processors, Memory, Bus, I/O e.g. Meiko Compute Surface. The operating system tries to
maintain a single global view of the resources it manages. Single global intercrosses
communication mechanism: any process can talk to any other process (regardless of what
processor the process is running on). Global protection scheme: security system (e.g. passwords,
access rights) must look the same everywhere. File system must look the same everywhere:
Every file should be visible at every location (subject to protection /security constraints). Runs
the same operating system
PROCESSING CONFIGURATIONS
Concerns processing configurations with two characteristics:
The number of instruction streams; and the number of data streams.
SISD: A computer with a Single Instruction stream and a Single Data stream: All
traditional uni-processor computers.
SIMD: Single Instruction, Multiple Data: Array processors with one instruction unit that
fetches an instruction and then commands many data units to carry it out in parallel, each
with its own data. Good for vector processing.
MISD: Multiple Instruction Single Data: Pipelined computers: Fetch and process
multiple instructions simultaneously, operating on one data at a time. (Book differs here).
MIMD: Multiple Instruction Multiple Data: A group of independent computers, each
with own program counter, program and data. All distributed systems are MIMD.
SOFTWARE CONCEPTS
Distributed Operating System - A distributed operating system (DOS) is an operating system
that is built, from the ground up, to provide distributed services. As such, a DOS integrates key
distributed services into its architecture. These services may include distributed shared memory,
assignment of tasks to processors, masking of failures, distributed storage, inter-process
Although some forms of middleware focus on adding support for distributed computing directly
into a language, middleware is generally implemented as a set of libraries and tools that enable
Fault Tolerance – Since failures are inevitable, a computer system can be made more reliable by
making it fault tolerant. A fault tolerant system is one designed to fulfill its specified purposes
despite the occurrence of component failures (machine and network). Fault tolerant systems are
designed to mask component failures i.e. attempt to prevent the failure of a system in spite of the
failure of some of its components. Fault tolerance can be achieved through hardware and
software.
Although fault tolerance improves the system availability and reliability, it brings some
overheads in terms of:
Cost - increased system costs
Software development – recovery mechanisms and testing
Performance – makes system slower in updates of replicas
Scalability – Each component of a distributed system has a finite capacity. Designing for
scalability involves calculating the capacity of each of these elements and the extent to which the
capacity can be increased. Good distributed systems design minimizes utilization components
that are not scalable. Also, the element that is weakest in terms of available capacity (and the
extent to which the capacity can be increased) should be of prime importance in terms of design.
There are four principle components to be considered when designing for scalability: client
workstation, LAN, servers and WAN.
Performance – There are two common measures of performance for distributed systems:
Response time – defined as the average elapsed time from the moment the user is ready
to transmit and the entire response is received.
Throughput – the number of requests handled per unit time.
Latency – the delay between the start of a message’s transmission from one process and
the beginning of its receipt by another
Bandwidth – the total amount of information that can be transmitted over given time unit
Jitter – the variation in time taken to deliver a series of messages
Client Server Model – This is the most widely used paradigm for structuring distributed systems.
A client requests a particular service. One or more processes called servers are responsible for
Peer-to-peer Model – This model is quite similar to the client/server model. The use of a
small a small manageable number of servers (i.e. increased centralization of resources)
increase system management compared to a case where potentially every computer can
be configured as client and server. This model is known as a peer-to-peer model because
every process has the same functionality as a peer process.
Layered Protocols - Due to the absence of shared memory, all communication in distributed
systems in based on exchanging (low level) messages. When process A wants to communicate
with process B, it
first builds a message in its own address space. Then it executes a system call that causes the
operating system to send the message over the network to B. To make it easier to deal with the
numerous levels and issues involved in communication, the International Standard
Organization(ISO) developed a reference model that clearly identifies the various levels
involved, gives them standard names, and points out which level should do which job. This
model is called OSI model. Figure below shows seven layers of OSI.
Client-Server TCP
Client-Server interaction in distributed systems is often done using the transport protocols of the
underlying network. With the increasing popularity of the Internet, it is now common to build
client-server applications and systems using TCP. The benefit of TCP compared to UDP is that it
works reliably over any network. The obvious drawback is that TCP introduces considerably
Middleware Protocols
Middleware is an application that logically lives in the application layer, but which contains
many general-purpose protocol that warrant their own layers, independent of others, more
specific applications.
Message Passing
Message passing in a distributed system is similar to communication using messages in a non-
distributed system. The main difference being that the only mechanism available for the passing
of messages is network communication. At its core message passing involves two operations
send( ) and receive( ). Although these are very simple operations, there are many variations on
the basic model. For example, the communication can be connectionless or connection oriented.
Connection oriented communication requires that the sender and receiver first create a
connection before send( ) and receive( ) can be used. Communication operations can also be
synchronous or asynchronous. In the first case the operations block until a message has been
delivered (or received). In the second case the operations return immediately. Yet another
possible variation involves the buffering of communication. In the buffered case, a message will
be stored if the receiver is not able to pick it up right away. In the un buffered case the message
will be lost. There are also varying degrees of reliability of the communication. With reliable
communication errors are discovered and fixed transparently. This means that the processes can
assume that a message that is sent will actually arrive at the destination (as long as the
destination process is there to receive it).
Communication Models
Client-Server
The client-server model is the most common and widely used model for communication between
processes. In this model one process takes on the role of a server, while all other processes take
on the roles of clients. The server process provides a service (e.g., a time service, a database
service, a banking service, etc.) and the clients are customers of that service. A client sends a
request to a server, the request is processed at the server and a reply is returned to the client. A
typical client-server application can be decomposed into three logical parts: the interface part, the
application logic part, and the data part. Implementations of the client-server model vary with
regards to how the parts are separated over the client and server roles. A thin client
implementation will provide a minimal user interface layer, and leave everything else to the
server. A fat client implementation, on the other hand, will include all of the user interface and
application logic in the client, and only rely on the server to store and provide access to data.
Implementations in between will split up the interface or application logic parts over the clients
and server in different ways.
Peer to Peer
Whereas the previous models have all assumed that different processes take on different roles in
the communication model, the peer to peer (P2P) model takes the opposite approach and
assumes that all processes play the same role, and are therefore peers of each other. In figure
below each processes acts as both a client and a server, both sending out requests and processing
incoming requests.
Group Communication
The group communication model provides a departure from the point to point style of
communication assumed so far. In this model of communication a process can send a single
message to a group of other processes. Group communication is often referred to as broadcast
(when a single message is sent out to everyone) and multicast (when a single message is sent out
to a predefined group of recipients). Group communication can be applied in any of the applied
in any of the previously discussed models. It is often used to send requests to a group of replicas,
or to send updates to a group of servers containing the same data, It is also used for service
discovery (e.g., broadcast a request saying “who offers this service?”) as well as event
notification (e.g., to tell everyone that the printer is on fire). Issues involved with implementing
and using group communication are similar to those involved with regular point-to-point
communication. This includes reliability and ordering. The issues are made more complicated
because now there are multiple recipients of a message and different combinations of problems
may occur . A widely implemented (but not as widely used) example of group communication is
IP multicast.
Communication Abstractions
In the previous topic it was assumed that all processes explicitly send and receive messages (e.g.,
using send ( ) and receive ( )). Although this style of programming is effective and works, it is
not always easy to write correct programs using explicit message passing. In this section we will
discuss a number of communication abstractions that make writing distributed applications
easier. In the same way that higher level programming languages make programming easier by
providing abstractions above assembly language, so do communication abstractions make
programming in distributed systems easier. Some of the abstractions discussed attempt to
completely hide the fact that communication is taking place. While other abstractions do not
attempt to hide communication, all abstractions have in common that they hide the details of the
communication taking place. For example, the programmers using any of these abstractions do
not have to know what the underlying communication protocol is, nor do they have to know how
to use any particular operating system communication primitives. The abstractions discussed
below are often used as core foundations of most middleware systems. Using these abstractions,
therefore, generally involves using some sort of middleware framework. This brings with it a
number of the benefits of middleware, in particular the various services associated with the
middleware that tend to make a distributed application programmer’s life easier.
CLIENT-SERVER STUBS
Remote Procedure Call (RPC)
The idea behind a remote procedure call (RPC) is to replace the explicit message passing model
with the model of executing a procedure call on a remote node. A programmer using RPC simply
performs a procedure call, while behind the scenes messages are transferred between the client
and server machines.
In theory the programmer is unaware of any communication taking place.
An important part of marshalling is converting data into a format that can be understood by the
receiver. Generally, differences in format can be handled by defining a standard network format
into which all data is converted. However, this may be wasteful if two communicating machines
Here handle is some physical address (IP address, process ID, etc.) and UID is used to
distinguish between servers offering the same service. Moreover, it is important to include
version information as the flexibility requirement for distributed system requires us to deal with
different versions of the same software in a heterogeneous environment.
Message Passing Interface (MPI) is an example of a MOM that is geared toward high
performance transient message passing. MPI is a message passing library that was designed for
parallel computing. It makes use of available networking protocols, and provides a huge array of
functions that basically perform synchronous and asynchronous send ( ) and receive( ). Another
example of MOM is MQ Series from IBM. This is an example of a message queuing system. Its
main characteristic is that it provides persistent communication.
Message queuing system, messages are sent to other processes by placing them in queues. The
queues hold messages until an intended receiver extracts them from the queue and processes
them.
Communication in a message queuing system is largely asynchronous. The basic queue interface
is very simple. There is a primitive to append a message onto the end of a specified queue, and a
primitive to remove the message at the head of a specific queue. These can be blocking or non
blocking. All
messages contain the name or address of a destination queue. Messages can only be added and
retrieved from local queues. Senders place messages in source queues, while receivers retrieve
Stream abstraction
Whereas the previous communication abstractions dealt with discrete communication (that is
they communicated chunks of data), the Stream abstraction deals with continuous
communication, and in particular with the sending and receiving of continuous media. In
continuous media, data is represented as a single stream of data rather than discrete chunks (for
example, an email is a discrete chunk of data, a live radio program is not). The main
characteristic of continuous media is that besides a spatial relationship (i.e., the ordering of the
data), there is also a temporal relationship between the data. Film is a good example of
continuous media. Not only must the frames of a film must be played in the right order, they
must also be played at the right time, otherwise the result will be incorrect.
A stream is a communication channel that is meant for transferring continuous media. Streams
can be set up between two communicating processes, or possibly directly between two devices
(e.g., a camera and a TV). Streams of continuous media are examples of isochronous
communication that is communication that has minimum and maximum end-to-end time delay
requirements. When dealing with isochronous communication, quality of service is an important
issue. In this case quality of service is related to the time dependent requirements of the
communication. These requirements describe what is required of the underlying distributed
system so that the temporal relationships in a stream can be preserved. This generally involves
timeliness and reliability.
Quality of service requirements axe often specified in terms of the parameters of a token bucket
model In this model tokens (permission to send a fixed number of bytes) axe regularly generated
and stored in a bucket. An application wanting to send data removes the required amount of
tokens from the bucket and then sends the data. If the bucket is empty the application must wait
until more tokens are available. If the bucket is full newly generated tokens are discarded. It is
often necessary to synchronize two or more separate streams. For example, when sending stereo
audio it is necessary to synchronize the left and right channels likewise when streaming video it
is necessary to synchronize the audio with the video.
The other approach is for the server to synchronize the streams. By multiplexing the sub streams
into a single data stream, the client simply has to demultiplex them and perform some
rudimentary synchronization. Distributed processing can be loosely defined as the execution of
co-operating processes which communicate by exchanging messages across an information
network. It means that the infrastructure consists of distributed processors, enabling parallel
execution of processes and message exchanges. Communication and data exchange can be
implemented in two ways: Shared memory and Message exchange / passing and Remote
Procedure Call (RPC)
DEADLOCKS IN PROCESSING
The Centralized Deadlock Detection has a Central coordinator maintains resource graph. If cycle
detected, coordinator kills off a process to break deadlock.
False deadlock: Delays in transmission in distributed systems causes system to think that a
cycle exists, when resources have been released. With strict two phase locks, cannot release
and then obtain more data items, so this cannot occur. Only situation is if a transaction is
aborted while in deadlock.
Edge Chasing: Distributed approach to deadlock detection. Process 0 sends probe message
when waiting on Process 1. If Process 1 is waiting on resource(s) it forwards probe
message(s) to processes it is waiting on. If probe message returns to original sender, a cycle
is detected.
Probe message contains: process that just blocked, process sending probe message,
process to whom probe is sent. Actually in two steps: Transaction coordinator indicates
what transaction waits for, server indicates who holds the data item. Who is high priority:
Oldest transaction: because they have run longer.
Deadlock Prevention: Wound-wait (-> is waits on) Old process -> young process then
young process killed. Young process -> old process then young process waits.
SYNCHRONIZATION OF PROCESSES
There are two main reasons why there is need for synchronization mechanisms:
Two or more processes may need to co-operate in order to accomplish a given task. This
implies that the operating mechanism must provide facilities for identifying co-operating
processes and synchronizing them.
Two or more processes may need to compete for access to shared services or resources.
The implication is that the synchronization mechanism must provide facilities for a
process to wait for a resource to become available and another process to signal the
release of that resource.
When processes are running on the same computer, synchronization is straightforward
since all processes use the same physical clock and can share memory. This can be done
using well-known techniques such as
1. Semaphores - used to provide mutually exclusive access to a non-sharable resource by
preventing concurrent execution of the critical region of a program through which the
non-sharable resource is accessed.
2. A Monitor is a collection of procedures, which may be executed by a collection of
concurrent processes. It protects its internal data from the users, and is a mechanism for
synchronizing access to the resources the procedures use. Since only the monitor can
access its private data, it automatically provides mutual exclusive between customer
processes. Entry to the monitor by one process excludes entry by others.
Berkeley Algorithm: Berkeley UNIX time daemon polls every machine periodically. Time
server computes average time from stable machines, considering Propagation time. Time server
returns delta time to adjust clocks to each machine: + or -. Requires no interface to WWW
Decentralized Algorithm - All processors broadcast current time every R interval. All processors
discard endpoints received back and average remaining to get actual time value.
ELECTION ALGORITHMS
Many distributed algorithms require one process to act as coordinator. An election selects the
coordinator. Elect a timeserver. Elect a mutual exclusion server. In general: elected coordinator
is process with highest process number.
Bully Algorithm:
Coordinator is ALWAYS process with highest process number. Process P notices coordinator is
no longer responding to requests. P sends ELECTION message to all processes with higher
numbers. Higher process responds with OK and send ELECTION messages to all higher number
processes. P drops out. If no one responds, P becomes coordinator. Sends COORDINATOR
Ring Algorithm:
Let's See Who Has The Highest Number In The Ring Algorithm. Process P notices coordinator
is no longer responding to requests. P sends ELECTION message to next process on ring.
ELECTION message has process P's number in it. Each process adds its process number to the
ELECTION message. When P receives its ELECTION message back P sends COORDINATOR
message listing highest process number as winner.
Pipes / Named Pipes - perhaps the most primitive example is a synchronous filter mechanism.
For example the pipe mechanism in UNIX: ls –l | more. The commands ls and more run as two
concurrent processes, with the output of ls connected to the input of more and has the overall
effect of listing the contents of the current directory one screen at a time.
File sharing - An alternative mechanism is the use of a local file. This has the advantage that it
can handle large volumes of data and is well understood. This is the basis on which on-line
database systems are built. The major drawback is that there are no inherent synchronization
mechanisms between communicating processes to avoid state data corruption, synchronization
mechanisms such as file and record locking are used to allow concurrent processes communicate
while preserving data consistency. Secondly, communication is inefficient since it uses a
relatively slow medium.
Shared Memory - Since all processes are local, the computer’s RAM can be used to implement a
shared memory facility. A common region of memory addressable by all concurrent processes is
used to define shared variables which are used to pass data or for synchronization purposes.
Processes must use semaphores, monitors or other techniques for synchronization purposes. A
good example of a shared memory mechanism is the clipboard facility.
SYNCHRONIZATION MODELS
Unicasting – This involves sending a separate copy of the message to each member. An implicit
assumption is that the sender knows the address of every member in the group. This may be not
possible in some systems. In the absence of more sophisticated mechanisms, a system may resort
to unicasting if member addresses are known. The number of network transmissions is
proportional to the number of members in the group.
Multicasting – In this model a single message with a group address can be used for routing
purposes. When a group is first created it is assigned a unique group address. When a member is
added to the group, it is instructed to listen for messages stamped with the group address as well
as for its own unique address. This is an efficient mechanism since the number of network
transmissions is significantly less than for unicasting.
Broadcasting – Broadcast the message by sending a single message with a broadcast address.
The message is sent to every possible entity on the network. Every entity must read the message
and determine whether they should take action or discard it. This may be appropriate in the case
where the address of members is not known since most network protocols implement broadcast
facility. However, if messages are broadcasted frequently and there is no efficient network
broadcast mechanism, the network becomes saturated. In some cases, all group members or none
must receive a group message at all. Group communication in this case is said to be atomic.
Achieving atomicity in the presence of failures is difficult, resulting in many more messages
being sent. Another aspect of group communication is the ordering of group messages. For
example, in a computer conferencing system a user would expect to receive the original news
item before any response to that item is received. This is known as ordered multicast and the
requirement to ensure that all multicasts are received in the same order for all group members is
common in distributed systems. Atomic multicasting does not guarantee that all messages will be
received by group the members in the order they were sent.
Indirect communication - for identifying co-operating processes Here the destination and the
source identifiers are not process identifiers, instead, a port also known as a mailbox is specified
which represents an abstract object at which messages are queued. Potentially, any process can
write or read from a port. To send a message to a process, the sending process simply issues a
send operation specifying a well-known port number that is associated with the destination
process. To receive the message, the recipient simply issues a receive specifying the same port
number. For example:
Send (message, destination_port)
Receive (message, source_port)
Security constraints can be introduced by allowing the owning process to specify access control
rights on a port. Messages are not lost provided the queue size is adequate for the rate at which
messages are being queued and de-queued.
RPC is popular for developing distributed systems because it looks and behaves like a well-
understood, conventional procedure call in high-level languages. A procedure call is a very
1. The client procedure calls the client stub in the normal way.
2. The client stub builds a message and calls the local operating system.
3. The client’s OS sends the message to the remote OS.
4. The remote OS gives the message to the server stub.
5. The server stub unpacks the parameters and calls the server.
6. The server does the work and returns the result to the stub.
7. The server stub packs it in a message and calls its local OS.
8. The server’s OS sends the message to the client’s OS.
9. The client’s OS gives the message to the client stub.
10. The stub unpacks the result and returns to the client.
Stub Generation - Once the RPC protocol has been completely defined, the client and server
stubs need to be implemented. Fortunately, stubs for the same protocol but different procedures
generally differ only in their interface to the applications. An interface consists of a collection of
procedures that can be called by a client, and which are implemented by a server. An interface is
generally available in the same programming language as the one in which the client or server is
written (although this is strictly speaking, not necessary). To simplify matters, interfaces are
often specified by means of an Interface
MARSHALLING
Marshalling is the process of converting the data types from the machines representation to a
standard representation before transmission and converting it at the other end from the standard
to the machines internal representation. Marshalling is complicated by use of global variables
and pointers as they only have meaning in the client’s address space. Client and server processes
run in different address spaces on separate machines. One solution would be to pass data values
held by global variables or pointed to by the pointer. However there are cases where this will not
workout, for example, when a linked list data structure is being passed to a procedure that
manipulates the list. Differences in representation of data can be overcome by use of an agreed
If the client’s message gets lost then the client will wait forever unless a time out error detection
mechanism is employed. If the client process fails then, the server will carry out the remote
operation unnecessarily. If the operation involves updating a data value then this can lead to a
loss of data integrity. Furthermore, the server would generate a reply to client process that does
not exist. This must be discarded by the client’s machine. When the client re-starts, it may be
send the request again causing the server to execute more than once. A similar situation arises
when the server crashes. The server could crash just prior to the execution of the remote
operation or just after execution completes but before a reply to the client is generated. In this
case, clients will time-out and continually generate retries until either the server restarts or the
retry limit is met.
RMI Applications - RMI is the equivalent of RPC commonly used in middleware based on
distributed objects model. RMI applications are often comprised of two separate programs: a
server and a client. A typical server application creates some remote objects, makes references to
them accessible, and waits for clients to invoke methods on these remote objects. A typical client
application gets a remote reference to one or more remote objects in the server and then invokes
methods on them. RMI provides the mechanism by which the server and the client communicate
and pass information back and forth. Such an application is sometimes referred to as a
distributed object application.
CONSISTENCY MODELS
Strict Consistency: Ideal programming model: Any read to a memory location X returns
the value stored by the most recent write operation to X. Nearly impossible to implement
in distributed system. Easy on a parallel system or single system with multiple
threads/processes.
Sequential Consistency: Any valid interleaving is acceptable, but all processes must see
the same sequence of memory references.
Causal Consistency: Happens-before order. Includes: All processors agree to order of
writes issued by processor X. All messages received by processor Y (Reads) must occur
after processor X sent message (Write). Concurrent (non-causal) writes may be seen in a
different order on different machines. Pipelined RAM: PRAM: Writes from different
processes may be seen in a different order. All processors agree to order of ‘writes’
issued by processor X.
Weak Consistency: Programmer uses synchronization method to update data.
Synchronization methods may include: critical section, mutual exclusion -mutex, or
barrier: all processes must arrive at barrier before any can continue
Release Consistency: Shared data are made consistent when a critical region is exited.
Multiple data may be associated with one critical section
Entry Consistency: Shared data is made consistent upon entering a critical region. One
synchronization variable associated with each data object, Multiple data variables can be
updated at a time by different processes
Release Consistency: Shared data updated in critical section. Exploit fact that
programmers use synchronization objects.
OBJECT-BASED DSM
The Object includes attributes such as object state: internal data, methods or operations. Uses
information hiding. Treated as collection of separate objects instead of linear address space.
MEMO - MEMO is a filing package or organizational package which coordinates data
and tasks between processes. Perfect for job jar allocation schemes.
Caching - Used when CPUs share the same physical memory and since the Cache is
faster than memory, it Reduces access to bus for CPUs which share memory and Works
with less than 64 CPUs.
NUMA (Non Uniform Memory Accesses) Multiprocessors: that all memories glued
together to create one real address space. Access to remote memory is possible,
Accessing remote memory is slower than accessing local memory though no cache
allowed.
One of the central and unique features of RMI is its ability to download the bytecodes (or simply
code) of an object's class if the class is not defined in the receiver's virtual machine. The types
and the behavior of an object, previously available only in a single virtual machine, can be
transmitted to another, possibly remote, virtual machine. RMI passes objects by their true type,
so the behavior of those objects is not changed when they are sent to another virtual machine.
This allows new types to be introduced into a remote virtual machine, thus extending the
behavior of an application dynamically.
DISTRIBUTED PROCESSING
Distributed processing can be loosely defined as the execution of co-operating processes which
communicate by exchanging messages across an information network. It means that the
infrastructure consists of distributed processors, enabling parallel execution of processes and
The general organization of an Internet search engine into three different layers
BASIC CONCEPTS
A name is the fundamental concept underlying naming. We define a name as a string of
bits or characters that is used to refer to an entity. An entity in this case is any resource,
user, process, etc. in the system.
Entities are accessed by performing operations on them; the operations are performed at
an entity’s access point. An access point is also referred to by a name, we call an access
point’s name an address. Entities may have multiple access points and may therefore
have multiple addresses. Furthermore an entity’s access points may change over time
(that is an entity may get new access points or lose existing ones), which means that the
set of an entity’s addresses may also change.
A pure name is a name that consists of an un interpreted bit pattern that does not encode
any of the named entity’s attributes.
A non pure name, on the other hand, does encode entity attributes (such as an access
point address) in the name.
An identifier is a name that uniquely identifies an entity. An identifier refers to at most
one entity and an entity is referred to by at most one identifier. Furthermore an identifier
can never be reused, so that it will always refer to the same entity. Identifiers allow for
easy comparison of entities; if two entities have the same identifier then they are the same
entity. Pure names that are also identifiers are called pure identifiers.
Location independent names are names that are independent of an entity’s address. They
remain valid even if an entity moves or otherwise changes its address. Note that pure
names are always location independent, though location independent names do not have
to be pure names.
System-Oriented Names
System-oriented names are usually implemented as one or more fixed-sized numerals to facilitate
efficient handling. Moreover, they typically need to be unique identifiers and may be sparse to
convey access rights (e.g., capabilities). Depending on whether they are globally or locally
unique, we also call them structured or unstructured: Globally unique integer unstructured node
Human-Oriented Names
In many systems, the most important attribute bound to a human-oriented name is the system-
oriented name of the object. All further information about the entity is obtained via the system-
oriented name. This enables the system to perform the usually costly resolution of the human-
oriented name just once and implement all further operations on the basis of the system-oriented
name (which is more efficient to handle). Often a whole set of human-oriented names is mapped
to a single system-oriented name (symbolic links, relative addressing, and so on).
As an example of all this, consider the naming of files in UNIX. A pathname is a human-oriented
name that, by means of the directory structure of the file system, can be resolved to an inode
number, which is a machine-oriented name. All attributes of a file are accessible via the inode
(i.e., the machine oriented
name). By virtue of symbolic and hard links multiple human oriented names may refer to the
same inode, which makes equality testing of files merely by their human-oriented name
impossible. The design space for human-oriented names is considerably wider than that for
system-oriented names. As such naming systems for human-oriented names usually require
considerably greater implementation effort.
NAME SPACES
Names are grouped and organized into name spaces. A structured name space is
represented as a labeled directed graph, with two types of nodes. A leaf node represents a
named entity and stores information about entity. The information could include the
entity itself, or a reference to the entity (e.g., an address).
A directory node (also called a context) is an inner node and does not represent any
single entity. Instead it stores a directory table, containing (node - id, edge - label) pairs,
that describes the node’s children. A leaf node only has incoming edges, while a directory
node has both incoming and outgoing edges. A third kind of node, a root node is a
directory node with only outgoing edges.
A structured name space can be strictly hierarchical or can form a directed acyclic graph
(DAG). In a strictly hierarchical name space a node will only have one incoming edge. In
a DAG name space any node can have multiple incoming edges. It is also possible to
have name spaces with multiple root nodes.
Scalable systems usually use hierarchically structured names spaces. A sequence of edge
labels leading from one node to another is called a path name.
A path name is used to refer to a node in the graph. An absolute path name always starts
from a root node, a relative path name is any path name that does not start at the root
node.
In this case the leaf node implicitly refers to the file named by the pathname. Ideally we would
have a global, homogeneous name space that contains names for all entities used. However, we
are often faced with the situation where we already have a collection of name spaces that have to
be combined into a larger name space. One approach is to simply create a new name that
combines names from the other name spaces. For example, a Web URL
http://www.raiuniversity.edu/-cs9243/naming-slides.ps globalizes the local name ~
cs9243/naming-slides.ps by adding the context www.raiuniversity.edu. Unfortunately, this
approach often compromises location transparency—as is the case in the example of URLs.
Another example of the composition of name spaces is mounting a name space onto a mount
point in a different (external) name space. This approach is often applied to merge file systems
(e.g., mounting a remote file system onto a local mount point). In terms of a name space graph,
mounting requires one directory node to contain information about another directory node in the
external name space. This is similar to the concept of soft linking, except that in this case the link
is to a node outside of the name space. The information contained in the mount point node must,
therefore, include information about where to find the external name space.
NAME RESOLUTION
The process of determining what entity a name refers to is called name resolution. Resolving a
name results in a reference to the entity that the name refers to. Resolving a name in a name
space often results in a reference to the node that the name refers to. Path name resolution is a
process that starts with the resolution of the first element in the path name, and ends with
resolution of the last element in the name. There are two approaches to this process, iterative
resolution and recursive resolution.
NAMING SERVICE
A naming service is a service that provides access to a name space allowing clients to perform
operations on the name space. These operations include adding and removing directory or leaf
nodes, modifying the contents of nodes and looking up names. The naming service is
implemented by name servers. Name resolution is performed on behalf of clients by resolvers. A
resolver can be implemented by the client itself, in the kernel, by the name server, or as a
separate service.
Typically, a client does not directly converse with a name server, but delegates this to a local
resolver that may use caching to improve performance. Each of the name servers stores one or
more naming contexts, some of which may be replicated. We call the name servers storing
attributes of an object this object’s authoritative name servers.
In the case of a hierarchical name space, partial sub trees (often called zones) may be maintained
by a single server. In the case of the Internet Domain Name Service (DNS), this distribution also
matches the physical distribution of the network. Each zone is associated with a name prefix that
leads from the root
to the zone. Now, each node maintains a prefix table (essentially, a hint cache for name servers
corresponding to zones) and, given a name, the server corresponding to the zone with the longest
matching prefix is contacted. If it is not the authoritative name server, the next zone’s prefix is
broadcast to obtain the corresponding name server (and update the prefix table). As an
alternative to broadcasting, the contacted name server may be able to provide the address of the
authoritative name server for this zone. This scheme can be efficiently implemented, as the
prefix table can be relatively small and, on average, only a small number of messages are needed
for name resolution. Consistency of the prefix table is checked on use, which removes the need
for explicit update messages. For smaller systems, a simpler structure-free distribution scheme
may be used. In this scheme contexts can be freely placed on the available name servers (usually,
however, some distribution policy is in place). Name resolution starts at the root and has to
traverse the complete resolution chain of contexts. This is easy to reconfigure and, for example,
used in the standard naming service of CORBA.
A name cache can be implemented as a process-local cache, which lives in the address space of
the client process. Such a cache does not need many resources, as it typically will be small in
size, but much of the information may be duplicated in other processes. More seriously, it is a
short-lived cache and incurs a high rate of start-up misses, unless a scheme such as cache
inheritance is used, which propagates cache information from parent to child processes. The
alternative is a kernel cache, which avoids duplicate entries and excessive start-up misses, but
access to a kernel cache is slower and it takes up valuable kernel memory. Alternatively, a shared
cache can be located in a user space cache process that is utilized by clients directly or by
redirection of queries via the kernel (the latter is used in the
CODA file system).
ATTRIBUTE-BASED NAMING
Whereas names as described above encode at most one attribute of the named entity (e.g., a
domain name encodes the entity’s administrative or geographical location) in attribute-based
naming an entity’s name is composed of multiple attributes. An example of an attribute-based
name is given below: /C=AU/0=UNSW/0U=CSE/CN=WWW.server/
Hardware=Sparc/OS=Solaris/Server=Apache.
The name not only encodes the location of the entity (/C=AU/ 0=UNSW/ 0U=CSE, where C is
the attribute country, O is organization, OU is organizational unit - these are standard attributes
in X.500 and LDAP), it also identifies it as a Web server, and provides information about the
hardware that it runs on, the operating system running on it, and the software used.
Although an entity’s attribute-based name contains information about all attributes, it is common
to also define a distinguished name (DN), which consists of a subset of the attributes and is
Caching / Buffering
Four places to store files: Server disk, Server memory, Client disk, and Client memory.
Advantages: Plenty of space. Files accessible to all clients. No consistency problems with one
copy.
Disadvantages: Read time: transfer from Server disk to Client memory.
Client Cache:
Advantage: Reduces network traffic & delays accessing files.
Disadvantage: More complex, Potential for different version of files in client nodes.
Implementation: Three locations for caching: within process: no shared cache (e.g. database)
kernel: processes share cache, but a kernel call is needed to access
REPLICATION
Goal: Replication Transparency i.e. Provide backup, Split workload Architecture: Includes
Client Program: Does a read and/or write
Front end: communicates with replica managers. Hides the implementation of how replication is
maintained from the client program Implemented as: user package executed in each client; or
separate process. Talks with one or multiple Replica Managers
Replica Manager: Holds copy of data, and performs direct reads/writes to it.
METHODS OF REPLICATION:
Explicit file replication: copy
Lazy file replication (=Gossip): updates occur in background
Group communication: WRITES occur simultaneously to all servers.
Primary copy: Primary updates secondary replicated files
Advantage: simple for programmer; Disadvantage: recovery upon primary failure
Implementation: Read from secondary or primary, Write to primary only, Elect slave upon
primary failure. Example: Network Information Service (NIS)
Totally ordered updates: Solves problem of updates arriving out of order
All requests sent to sequencer process, Sequencer process assigns consecutive sequence numbers
and sends to Rep Mgr. All replication mgrs process requests in same order
Problem: Sequencer failure or bottleneck
ATOMIC TRANSACTIONS
Atomic Transaction - The effect of performing any single operation is free from interference
from concurrent operations being performed in other threads.
If the transaction does not complete, all previous operations within the transaction are backed
out.
Aspects of Atomicity:
All-or-nothing: All operations in an AT are completed or rolled back out to initial state.
Failure atomicity: Effects are atomic even when server fails.
Durability: Completed transactions are saved in permanent storage.
Isolation: Each transaction performed w/o interference from other transactions.
TRANSACTION IMPLEMENTATION
Clients may use a Server to share resources. Good design techniques include: Server holds
requests for service until resource becomes available. A Server uses a new thread for each
Lamport Timestamps
To synchronize logical clocks, Lamport defined a relation called “happens-before”. The
expression ab is read “a happens before b” and means that all processes agree that first event a
occurs, then afterward, event b occurs. The “happens-before” relation can be observed directly in
two situations:
1. If a and b are events in the same process, and a occurs before b, then a greater than b is
true..
2. If a is the event of a message being sent by one process, and b is the event of the message
being received by another process, then a is greater than b is also true.
A message cannot be received before it is sent, or even at the same time it is sent, since it takes a
finite, nonzero amount of time to arrive. Happens-before is a transitive relation, so if a is greater
than b and b greater than c, then a is greater than c. If two events, x and y, happen in different
processes that do not exchange messages (not even indirectly via third parties), then x is greater
than y is not true, but neither is y is greater than x. These events are said to be concurrent, which
What we need is a way of measuring time such that for every event, a, we can assign it a time
value (a) on which all processes agree. These time values must have the property that if a < b,
then C(a) < C(b). To rephrase the conditions we stated earlier, if a and b are two events within
the same process and a occurs before b, then C(a) < C(b). Similarly, if a is the sending of a
message by one process and b is the reception of that message by another process, then C(a) and
C(b) must be assigned in such a way that everyone agrees on the values of (a) and C(b) with C(a)
< C(b). In addition, the clock time, C, must always go forward (increasing), never backward
(decreasing). Corrections to time can be made by adding a positive value, never by subtracting
one.
Global State
Determining global properties in a distributed system is often difficult, but crucial for some
applications. For example, in distributed garbage collection, we need to be able to determine for
some object whether it is referenced by any other objects in the system. Deadlock detection
requires detection of cycles of processes infinitely waiting for each other. To detect the
termination of a distributed algorithm we need to obtain simultaneous knowledge of all involved
process as well as take account of messages that may still traverse the network. In other words, it
is not sufficient to check the activity of all processes. Even if all processes appear to be passive,
there may be messages in transition that, upon arrival, trigger further
activity. In the following, we are concerned with determining stable global states or properties
that, once they occur, will not disappear without outside intervention. For example, once an
object is no longer referenced by any other object (i.e., it may be garbage collected), no reference
to the object can appear at a later time.
The only hurdle to scalability is the use of multicasts (i.e., all processes have to be contacted in
order to enter a critical section). More scalable variants of this algorithm require each individual
process to only contact subsets of its peers when wanting to enter a critical section.
Unfortunately, failure of any peer process can deny all other processes entry to the critical
section.
TRANSACTIONS
A transaction can be regarded as a set of server operations that are guaranteed to appear atomic
in the presence of multiple clients and partial failure. The concept of a transaction originates
from the database community as a mechanism to maintain the consistency of databases.
Transaction management is build
around two basic operations:
Begin Transaction
TRANSACTION IMPLEMENTATION
Two general strategies exist for the implementation of transactions:
Private Workspace - All tentative operations are performed on a shadow copy of the
server state, which is atomically swapped with the main copy on Commit or discarded on
abort.
Write ahead Log - Updates are performed in-place, but all updates are logged and
reverted when a transaction aborts.
Concurrency in Transactions
It is often necessary to allow transactions to occur simultaneously (for example, to allow
multiple travel agents to simultaneously reserve seats on the same flight). Due to the consistency
and isolation properties of transactions concurrent transaction must not be allowed to interfere
with each other. Concurrency control algorithms for transactions guarantee that multiple
transactions can be executed simultaneously while providing a result that is the same as if they
were executed one after another. A key concept when discussing concurrency control for
transactions is the serialization of conflicting operations. Recall that conflicting operations are
those operations that operate on the same data item and whose combined effects depend on the
order they are executed in. We define a schedule of operations as
an interleaving of the operations of concurrent transactions. A legal schedule is one that provides
results that are the same as though the transactions were serialized (i.e., performed one after
another). This leads to the concept of serial equivalence. A schedule is serially equivalent if all
conflicting operations are performed in the same order on all data items. For example, given two
transactions Xi and T-i in a serially equivalent schedule, then of all the pairs of conflicting
operations the first operation will be performed by T\ and the second by Ti (or vice versa: of all
the pairs the first is performed by T2 and the second by Ti). There are three type of concurrency
control algorithms for transactions: those using locking, those using timestamps, and those using
optimistic algorithms.
Locking
The locking algorithms require that each transaction obtains a lock from a scheduler process
before performing a read or a write operation. The scheduler is responsible for granting and
releasing locks in such a way that legal schedules are produced. The most widely used locking
approach is two-phase locking (2PL). In this approach a lock for a data item is granted to a
process if no conflicting locks are held by other processes (otherwise the process requesting the
Timestamp Ordering
A different approach to creating legal schedules is to timestamp all operations and ensures that
operations are ordered according to their timestamps. In this approach each transaction receives a
unique timestamp and each operation receives its transaction’s timestamp. Each data item also
has three timestamps – the timestamp of the last committed write, the timestamp of the last read,
and the timestamp of the last tentative (noncommittal) write. Before executing a write operation
the scheduler ensures that the operation’s time stamp is both greater than the data item’s write
timestamp and greater than or equal to the data item’s read timestamp. For read operations the
operation’s time stamp must be greater than the data item’s write timestamps (both committed
and tentative). When scheduling conflicting operations the operation with a lower timestamp is
always executed first.
Optimistic Control
Both locking and time stamping incur significant overhead. The optimistic approach to
concurrency control assumes that no conflicts will occur, and therefore only tries to detect and
resolve conflicts at commit time. In this approach a transaction is split into three phases, a
working phase (using shadow copies), a validation phase, and an update phase. In the working
phase operations are carried out on shadow copies with no attempt to detect or order conflicting
operations. In the validation phase the scheduler attempts to detect conflicts with other
transactions that were in progress during the working phase. If conflicts are detected then one of
the conflicting transactions are aborted. In the update phase, assuming that the transaction was
not aborted, all the updates made on the shadow copy are made permanent.
DISTRIBUTED TRANSACTIONS
In contrast to transactions in the sequential database world, transactions in a distributed setting
are complicated because a single transaction will usually involve multiple servers. Multiple
servers may involve multiple services and files stored on different servers. To ensure the
atomically of transactions, all servers involved must agree whether to Commit or Abort.
Moreover, the use of multiple servers and services may require nested transactions, where a
transaction is implemented by way of multiple other transactions, each of which can
independently Commit or Abort.
Transactions that span multiple hosts include one host that acts as the coordinator, which is the
host that handles the initial BeginTransaction. This coordinator maintains a list of workers,
COORDINATION ELECTIONS
Various algorithms require a set of peer processes to elect a leader or coordinator. In the
presence of failure, it can be necessary to determine a new leader if the present one fails to
respond. Provided that all processes have a unique identification number, leader election can be
reduced to finding the non crashed
process with the highest identifier. Any algorithm to determine this process needs to meet the
following two requirements:
Safety: A process either doesn’t know the coordinator or it knows the identifier of the
process with largest identifier.
Liveness: Eventually, a process crashes or knows the coordinator.
BULLY ALGORITHM
The following algorithm was proposed by Garcia-Molina and uses three types of messages:
Election: Announce election
Answer: Response to an election
Coordinator: Elected coordinator announces itself.
A process begins an election when it notices through a timeout that the coordinator has failed or
receives an Election message. When starting an election, a process sends Election message to all
higher-numbered processes. If it receives no Answer within a predetermined time bound, the
RING ALGORITHM
An alternative to the bully algorithm is to use a ring algorithm. In this approach all processes are
ordered in a logical ring and each process knows the structure of the ring. There are only two
types of messages involved: Election and Coordinator. A process starts an election when it
notices that the current coordinator has failed (e.g., because requests to it have timed out). An
election is started by sending an Election message to the first neighbor on the ring. The Election
message contains the node’s process identifier and is forwarded on around the ring, with each
process adding its own identifier to the message. When the Election message reaches the
originator, the election is complete. Based on the contents of the message that originator process
determines the highest numbered process and sends out a coordinator message specifying this
process as the winner of the election.
PROCESS MIGRATION:
Move a process already in progress to remote site. Motivation: Load sharing: Move from heavy
to lightly loaded system to improve performance. Communications performance: Move the
process to the data to minimize communications overhead. Availability: Survive a scheduled
downtime. Utilize special capabilities: Take advantage of unique h/w or s/w on a particular node.
Commonly: Owner returns to workstation. Alternately: lower priority of foreign process. Select a
target machine. Send part of process image and open file information. Receiving kernel forks a
child with the passed information. New process pulls over data, environment, register/stack
information, and modified program text. Program demand paged otherwise. New process sends
migration-completed message. Old process destroys itself.
Multiprocessor Scheduling
Effects of Scheduling in Multiprocessors: Multi-programed Single applications run better:
Traditional priority, FCFS, round robin algorithms matter less because other processes can be
served by other processors. Multithreaded: Threads run faster if scheduled together. Application
speedup on a multiprocessor often exceed expectations because: threads share disk caches;
threads share compiler code.
Classes of Multiprocessor OS
Separate Supervisor: Each processor has own copy of kernel, data structures, I/O
devices, file systems. Minimum shared data structures (e.g. for semaphores).
Disadvantage: Difficult to perform parallel execution of a single task. Inefficient since
much replication for each processor
BASIC CONCEPTS
To understand the role of fault tolerance in distributed systems we first need to take a closer look
at what it actually means for a distributed system to tolerate faults. Being fault tolerant is
strongly related to what are called “dependable” systems. Dependability is a term that covers a
number of useful
requirements for distributed systems including the following
Availability - is defined as the property that a system is ready to be used immediately. In
general, it refers to the probability that the system is operating correctly at any given
moment and is available to perform its functions on behalf of its users. In other words, a
highly available system is one that will most likely be working at a given instant in time.
Reliability - refers to the property that a system can run continuously without failure. In
contrast to availability, reliability is defined in terms of a time interval instead of an
instant in time. A highly reliable system is one that will most likely continue to work
without interruption during a relatively long period of time. This is a subtle but important
difference when compared to availability. If a system goes down for one millisecond
every hour, it has an availability of over 99.9999 percent, but is still highly unreliable.
Similarly, a system that never crashes but is shut down for two weeks every August has
high reliability but only 96 percent availability. The two are not the same.
Safety - refers to the situation that when a system temporarily fails to operate correctly,
nothing catastrophic happens. For example, many process control systems, such as those
used for controlling nuclear power plants or sending people into space, are required to
provide a high degree of safety. If such control systems temporarily fail for only a very
brief moment, the effects could be disastrous. Many examples from the past (and
probably many more yet to come) show how hard it is to build safe systems.
A distinction is made between preventing, removing, and forecasting faults. For our purposes,
the most important issue is “fault tolerance”, meaning that a system can provide its services even
in the presence of faults. Faults are generally classified as transient, intermittent, or permanent.
“Transient faults” occur once and then disappear. If the operation is repeated, the fault goes
away. A bird flying through the beam of a microwave transmitter may cause lost bits on some
network (not to mention a roasted bird). If the transmission times out and is retried, it will
probably work the second time.
An intermittent fault occurs, then vanishes of its own accord, then reappears, and so on.
A loose contact on a connector will often cause an intermittent fault. Intermittent faults
cause a great deal of aggravation because they are difficult to diagnose. Typically,
whenever the fault doctor shows up, the system works fine.
A permanent fault is one that continues to exist until the faulty component is repaired.
Burnt-out chips, software bugs, and disk head crashes are examples of permanent faults.
FAILURE MODELS
A system that fails is not adequately providing the services it was designed for. If we consider a
distributed system as a collection of servers that communicate with each other and with their
clients, not adequately providing services means that servers, communication channels, or
possibly both, are not doing what they are supposed to do. However, a malfunctioning server
itself may not always be the fault we are looking for. If such a server depends on other servers to
adequately provide its services, the cause of an error may need to be searched for somewhere
else.
Such dependency relations appear in abundance in distributed systems. A failing disk may make
life difficult for a file server that is designed to provide a highly available file system. If such a
file server is part of a distributed database, the proper working of the entire database may be at
stake, as only part of its data may actually be accessible. To get a better grasp on how serious a
failure actually is, several
classification schemes have been developed. One such scheme is shown in Fig. below
A crash failure occurs when a server prematurely halts, but was working correctly until it
stopped. An important aspect with crash failures is that once the server has halted, nothing is
heard from it anymore. A typical example of a crash failure is an operating system that comes to
a grinding halt, and for which
there is only one solution: reboot. Many personal computer systems suffer from crash failures so
often that people have come to expect them to be normal. In this sense, moving the reset button
from the back of a cabinet to the front was done for good reason. Perhaps one day it can be
moved to the back again, or even removed altogether.
An omission failure occurs when a server fails to respond to a request. Several things might go
wrong. In the case of a receive omission failure, the server perhaps never got the request in the
first place. Note that it may well be the case that the connection between a client and a server has
been correctly established, but that there was no thread listening to incoming requests. Also, a
receive omission failure will generally not affect the current state of the server, as the server is
unaware of any message sent to it.
State transition failure. This kind of failure happens when the server reacts unexpectedly to an
incoming request. For example, if a server receives a message it cannot recognize, a state
transition failure happens if no measures have been taken to handle such messages. In particular,
a faulty server may incorrectly take default actions it should never have initiated.
Arbitrary / Byzantine failures. In effect, when arbitrary failures occur, clients should be
prepared for the worst. In particular, it may happen that a server is producing output it should
never have produced, but which cannot be detected as being incorrect. Worse yet a faulty server
may even be maliciously working together with other servers to produce intentionally wrong
answers. This situation illustrates why security is also considered an important requirement when
talking about dependable systems
Failure Masking by Redundancy - If a system is to be fault tolerant, the best it can do is to try
to hide the occurrence of failures from other processes. The key technique for masking faults is
to use redundancy. Three kinds are possible: information redundancy, time redundancy, and
physical redundancy. With information redundancy, extra bits are added to allow recovery from
Process Resilience - Now that the basic issues of fault tolerance have been discussed, let us
concentrate on how fault tolerance can actually be achieved in distributed systems. The first
topic we discuss is protection against process failures, which is achieved by replicating processes
into groups.
Design Issues - The key approach to tolerating a faulty process is to organize several identical
processes into a group. The key property that all groups have is that when a message is sent to
the group itself, all members of the group receive it. In this way, if one process in a group fails,
hopefully some other process can take over for it. Process groups may be dynamic. New groups
can be created and old groups can be destroyed. A process can join a group or leave one during
system operation. A process can be a member of several groups at the same time. Consequently,
mechanisms are needed for managing groups and group membership. Groups are roughly
analogous to social organizations.
Agreement in Faulty Systems - Before considering the case of faulty processes, let us look at
the “easy” case of perfect processes but where communication lines can lose messages. There is
a famous problem, known as the two-army problem, which illustrates the difficulty of getting
even two perfect processes to reach agreement about 1 bit of information.
CLIENT-SERVER COMMUNICATION
Reliable Client-server Communication
In many cases, fault tolerance in distributed systems concentrates on faulty processes. However,
we also need to consider communication failures. Most of the failure models discussed
previously apply equally well to communication channels. In particular, a communication
channel may exhibit crash, omission,
timing, and arbitrary failures. In practice, when building reliable communication channels, the
focus is on masking crash and omission failures. Arbitrary failures may occur in the form of
duplicate messages, resulting from the fact that in a computer network messages may be buffered
for a relatively long time, and are re-injected into the network after the original sender has
already issued a retransmission
Point-to-Point Communication
In many distributed systems, reliable point-to-point communication is established by making use
of a reliable transport protocol, such as TCP. TCP masks omission failures, which occur in the
form of lost messages, by using acknowledgements and retransmissions. Such failures are
completely hidden from a
TCP client. However, crash failures of connections are often not masked. A crash failure may
occur when, for whatever reason, a TCP connection is abruptly broken so that no more messages
can be
Lost Request Messages - The second item on the list is dealing with lost request messages. This
is the easiest one to deal with: just have the operating systems or client stub start a timer when
sending the request. If the timer expires before a reply or acknowledgement comes back, the
message is sent again. If the message was truly lost, the server will not be able to tell the
difference between the retransmission and the original, and everything will work fine. Unless, of
course, so many request messages are lost that the client gives up and falsely concludes that the
server is down, in which case we are back to “Cannot locate server.” If the request was not lost,
the only thing we need to do is let the server be able to detect it is dealing with a retransmission.
Unfortunately, doing so is not so simple, as we explain when discussing lost replies.
Server Crashes - The next failure on the list is a server crash. Assume that the server crashes
and subsequently recovers. It announces to all clients that it has just crashed but is now up and
running again. The problem is that the client does not know whether its request to print some text
will actually be carried out. There are four strategies the client can follow. First, the client can
decide to never reissue a request, at the risk that the text will not be printed. Second, it can decide
1. MPC: A crash occurs after sending the completion message and printing the text.
2. AC(P): A crash happens after sending the completion message, but before the text could be
printed.
3. PMC: A crash occurs after sending the completion message and printing the text.
4. PC(M): The text printed, after which a crash occurs before the completion message could be
sent.
5. C(P(M): A crash happens before the server could do anything.
6. C((M(P): A crash happens before the server could do anything.
Now consider a request to a banking server asking to transfer a million dollars from one account
to another. If the request arrives and is carried out, but the reply is lost, the client will not know
this and will retransmit the message. The bank server will interpret this request as a new one, and
will carry it out too. Two million dollars will be transferred. Heaven forbid that the reply is lost
10 times. Transferring money is not idempotent. One way of solving this problem is to try to
structure all requests in an idem-potent way.
Client Crashes
The final item on the list of failures is the client crash. What happens if a client sends a request to
a server to do some work and crashes before the server replies? At this point a computation is
active and no parent is waiting for the result. Such an unwanted computation is called an
orphan. Orphans can cause a variety of problems. As a bare minimum, they waste CPU cycles.
They can also lock files or otherwise tie up valuable resources. Finally, if the client reboots and
does the RPC again, but the reply from the orphan comes back immediately afterward, confusion
can result.
TWO-PHASE COMMIT
The original two-phase commit protocol (2PC) is due to Gray (1978). Without loss of
generality, consider a distributed transaction involving the participation of a number of processes
each running on a different machine. Assuming that no failures occur, the protocol consists of
the following two phases, each consisting of two steps
1. The coordinator sends a VOTE_REQUEST message to all participants.
2. When a participant receives a Vote_Request message, it returns either a Vote_Commit
message to the coordinator telling the coordinator that it is prepared to locally commit its
part of the transaction, or otherwise a Vote_Abort message.
3. The coordinator collects all votes from the participants. If all participants have voted to
commit the transaction, then so will the coordinator. In that case, it sends a
Global_Commit message to all participants. However, if one participant had voted to
abort the transaction, the coordinator will also decide to abort the transaction and
multicasts a Global_Abort message.
4. Each participant that voted for a commit waits for the final reaction by the coordinator. If
a participant receives a Global_Commit message, it locally commits the transaction.
Otherwise, when receiving a Global_Abort message, the transaction is locally aborted as
well.
THREE-PHASE COMMIT
A problem with the two-phase commit protocol is that when the coordinator has crashed,
participants may not be able to reach a final decision. Consequently, participants may need to
remain blocked until the coordinator recovers. Skeen (1981) developed a variant of 2PC, called
the three-phase commit
protocol (3PC), that avoids blocking processes in the presence of fail stop crashes. Although
3PC is widely referred to in the literature, it is not applied often in practice as the conditions
under which 2PC
RECOVERY
So far, we have mainly concentrated on algorithms that allow us to tolerate faults. However,
once a failure has occurred, it is essential that the process where the failure happened can recover
to a correct state. In what follows, we first concentrate on what it actually means to recover to a
correct state, and subsequently when and how the state of a distributed system can be recorded
and recovered to, by means of check pointing and message logging. Fundamental to fault
tolerance is the recovery from an error. Recall that an error is that part of a system that may lead
to a failure. The whole idea of error recovery is to replace an erroneous state with an error-free
state. There are essentially two forms of error recovery.
In backward recovery, the main issue is to bring the system from its present erroneous
state back into a previously correct state. To do so, it will be necessary to record the
system’s state from time to lime, and to restore such a recorded state when things go
wrong. Each time (part of) the system’s present state is recorded, a checkpoint is said to
be made.
In forward recovery. In this case, when the system has just entered an erroneous state,
instead of moving back to a previous, check pointed state, an attempt is made to bring the
system in a correct new state from which it can continue to execute. The main problem
with forward error recovery mechanisms is that it has to be known in advance which
errors may occur. Only in that case is it possible to correct those errors and move to a
new state.
STABLE STORAGE
To be able to recover to a previous state, it is necessary that information needed to enable
recovery is safely stored. Safely in this context means that recovery information survives process
crashes and site failures, but possibly also various storage media failures. Stable storage plays an
important role when it comes to recovery in distributed systems. Stable storage can be
implemented with a pair of
ordinary disks. Storage comes in three categories.
First there’s RAM memory, which is wiped out when power fails or a machine crashes.
Next is disk storage, which survives CPU failures but which can be lost in disk head
crashes.
Finally, there is also stable storage, which is designed to survive anything except major
calamities such as floods and earthquakes.
The other part concerns authorization, which deals with ensuring that a process gets only those
access rights to the resources in a distributed system it is entitled to. Authorization is covered in a
separate section dealing with access control. In addition to traditional access control
mechanisms, we also focus
on access control when have to deal with mobile code such as agents. Secure channels and
access control require mechanisms to hand out cryptographic keys, but also mechanisms to add
and remove users from a system. These topics are covered by what is known as security
management.
Integrity is the characteristic that alterations to a system’s assets can be made only in an
authorized way. In other words, improper alterations in a secure computer system should be
detectable and recoverable. Major assets of any computer system are its hardware, software, and
data. Another way of looking at security in computer systems is that we attempt to protect the
services and data it offers against security threats. There are four types of security threats to
consider
1. Interception refers to the situation that an unauthorized party has gained access to a
service or data. A typical example of interception is where communication between two
parties has been overheard by someone else. Interception also happens when data are
illegally copied, for example, after breaking into a person’s private directory in a file
system.
4. Fabrication refers to the situation in which additional data or activity are generated that
would normally not exist. For example, an intruder may attempt to add an entry into a
Note that interruption, modification, and fabrication can each be seen as a form of data
falsification.
Simply stating that a system should be able to protect itself against all possible security threats is
not the way to actually build a secure system. What is first needed is a description of security
requirements, that is, a security policy. A security policy - describes precisely which actions the
entities in a system are allowed to take and which ones are prohibited. Entities include users,
services, data, machines, and so on. Once a security policy has been laid down, it becomes
possible to concentrate on the security mechanisms by which a policy can be enforced.
SECURITY MECHANISMS
1. Encryption is fundamental to computer security. Encryption transforms data into
something an attacker cannot understand. In other words, encryption provides a means to
implement confidentiality. In addition, encryption allows us to check whether data have
been modified. It thus also provides support for integrity checks.
2. Authentication is used to verify the claimed identity of a user, client, server, and so on.
In the case of clients, the basic premise is that before a service will do work for a client,
the service must learn the client’s identity. Typically, users are authenticated by means of
passwords, but there are many other ways to authenticate clients.
4. Auditing - tools are used to trace which clients accessed what, and which way. Although
auditing does not really provide any protection against security threats, audit logs can be
extremely useful for the analysis of a security breach, and subsequently taking measures
against intruders. For this reason, attackers are generally keen not to leave any traces that
could eventually lead to exposing their identity. In this sense, logging accesses makes
attacking sometimes a riskier business.
Design Issues - A distributed system, or any computer system for that matter, must provide
security services by which a wide range of security policies can be implemented. There are a
number of important design issues that need to be taken into account when implementing
general-purpose security services. In the following pages, we discuss three of these issues: focus
of control, layering of security mechanisms, and simplicity
Focus of Control - When considering the protection of a (possibly distributed) application, there
are essentially three different approaches that can be followed
Middleware-based distributed systems thus require trust in the existing local operating systems
they depend on. If such trust does not exist, then part of the functionality of the local operating
systems may need to be incorporated into the distributed system itself. Consider a microkernel
operating system, in which most operating-system services run as normal user processes. In this
case, the file system, for instance, can be entirely replaced by one tailored to the specific needs of
a distributed system, including its various security measures. Consistent with this approach is to
separate security services from other types of services by distributing services across different
machines depending on the required security. For example, for a secure distributed file system, it
may be possible to isolate the file server from clients by placing the server on a machine with a
trusted operating system, possibly running a dedicated secure file system. Clients and their
applications are placed on un trusted machines.
This separation effectively reduces the TCB to a relatively small number of machines and
software components. By subsequently protecting those machines against security attacks from
the outside, overall trust in the security of the distributed system can be increased. Preventing
clients and their applications direct access to critical services is followed in the Reduced
Interfaces for Secure System Components (RISSC) approach, as described in (Neumann, 1995).
In the RISSC approach, any security-critical server is placed on a separate machine isolated from
end-user systems using low-level secure network interfaces
Simplicity - Another important design issue related to deciding in which layer to place a security
mechanism is that of simplicity. Designing a secure computer system is generally considered a
difficult task. Consequently, if a system designer can use a few, simple mechanisms that are
easily understood and trusted to work, the better it is.
First, an intruder may intercept the message without either the sender or receiver being
aware that eavesdropping is happening. Of course, if the transmitted message has been
encrypted in such a way that it cannot be easily decrypted without having the proper key,
interception is useless: the intruder will see only unintelligible data
The second type of attack that needs to be dealt with is that of modifying the message.
Modifying plaintext is easy; modifying cipher text that has been properly encrypted is
much more difficult because the intruder will first have to decrypt the message before it
can meaningfully modify it. In addition, he will also have to properly encrypt it again or
otherwise the receiver may notice that the message has been tampered with.
The third type of attack is when an intruder inserts encrypted messages into the
communication system, attempting to make R believe these messages came from. Again
encryption can help protect against such attacks. Note that if an intruder can modify
messages, he can also insert messages. There is a fundamental distinction between
different cryptographic systems, based on whether or not the encryption and decryption
key are the same.
In a symmetric cryptosystem, the same key is used to encrypt and decrypt a message. In other
words, P=DK(EK(P)) symmetric cryptosystems are also referred to as secret-key or shared-key
systems, because the sender and receiver are required to share the same key, and to ensure that
protection works, this shared key must be kept secret; no one else is allowed to see the key. We
will use the notation KAB to denote a key shared by A and B.
In an asymmetric cryptosystem, the keys for encryption and decryption are different, but
together form a unique pair. In other words, there is a separate key KE for encryption and one for
decryption, KD, such that P=DKD(EKE(P)) One of the keys in an asymmetric cryptosystem is
kept private, the other is made public. For this reason, asymmetric cryptosystems are also
referred to as public- key systems. In
what follows, we use the notation KX to denote a public key belonging to A, and KX as its
corresponding private key.
Attacks
Passive attacks are mainly based on observation without altering data or compromising services,
they represent the interception and interruption forms of security threats. The simplest form of
attack is browsing, which implies the nondestructive examination of all accessible data. This
leads to the need for confidentiality and the need-to-know principle. Related is the leaking of
information via authorized accomplices, which leads to the confinement problem. More indirect
are attempts to infer information from traffic analysis, code breaking, and so on. In contrast,
active attacks alter or delete data and may cause service to be denied to authorized users. They
represent the modification and fabrication forms of security threats. Typical active attacks
attempt to modify or destroy files. Communication related active attacks attempt to modify the
data sent over a
AUTHENTICATION
Authentication involves verifying the claimed identity of an entity (or principal). Authentication
requires a representation of identity (i.e., some way to represent a principal’s identity, such as, a
user name, a bank account, etc.) and some way to verify that identity (e.g., a password, a
passport, a PIN, etc.)- Depending on the system’s requirements different strengths of
authentication may be required. For example, in some cases it is enough to simply present a user
id, while in other cases a certificate signed by a trusted authority may be required to prove a
principal’s identity. A comprehensive logic of authentication has been developed by Lampson et
al.
Authentication based on A Shared Secret Key - This protocol no longer works. It can
easily be defeated by what is known as a reflective attack.
PROTECTION SYSTEM
Evaluate the implementations based on these design considerations
Propagation of Rights: Can someone act as an agent’s proxy? That is, can one subject’s
access rights be delegated to another subject?
Restriction of Rights: Can a subject propagate a subset of their rights (as opposed to all
of their rights)?
Amplification of Rights: Can an unprivileged subject perform some privileged
operations (i.e., (temporarily) extend their protection domain)?
Revocation of Rights: Can a right, once granted, be remove from a subject?
Determination of Object Accessibility: Who has which access rights on a particular
object?
Determination of a Subject’s Protection Domain: What is the set of objects that a
particular subject can access?
FIREWALLS
A different form of protection that can be employed in distributed systems is that offered by
firewalls. A firewall is generally used when communicating with external untrusted clients and
servers, and serves to disconnect parts of the system from the outside world, allowing inbound
(and possibly outbound) communication only on predefined ports. Besides simply blocking
communication, firewalls can also inspect incoming (or outgoing) communication and filter only
suspicious messages. Two main types of firewalls are packetfiltering and application-level
firewalls. Packet-filtering firewalls work at the packet level, filtering network packets based on
the contents of headers.
Application-level firewalls, on the other hand, filter messages based on their contents. They are
capable of spotting and filtering malicious content arriving over otherwise innocuous
communication channels (e.g., virus filtering email gateways).