You are on page 1of 73

CMT 412: DISTRIBUTED OPERATING SYSTEMS – LECTURE NOTES

COURSE OUTLINE

1. Introduction to DOS File Oriented Names


Introduction to OS Name Spaces and Resolution
HW / SW Architectures
SW Concepts 6. Distributed File Systems
NOS Systems File Service Models
DOS Systems Shared Semantics
Network File Systems
2. Distributed Systems Name Caches & Schemes
Introduction to DS
Characteristics of DS 7. Distributed Transactions
Design Issues of DS Transaction Implementation
Systems Models Concurrency Control Approaches
Transaction Synchronization
3. Communications in DS Transaction Implementation
Process & Thread Creation
Networks and Protocols 8. Distributed Fault Tolerance
Communication Models Basic Concepts
- Inter Process Communication Failure Models & Masking
- Remote Procedure Call Replication Models
- Message Passing Communication Models
Distributed Commit
4. Distributed Processing
Synchronization 9. Distributed Security
Distributed Shared Memory Security Threats
Resource Management Policies and Mechanisms
Process Migration Security Design Issues
Layering Security Mechanisms
5. System Naming Schemes Distributing Security Mechanisms
Basic Concepts Simplicity & Cryptography
System Oriented Names

REFERENCES
1. Operating Systems – Design and Implementation by Andrew Tanenbaum (2005), Prentice Hall
2. Distributed Computing: Principles and Applications by Liu M.L (2004), Pearson Addison-
Wesley,
3. Schaum’s Outlines of Operating Systems by Archer J. H (2002), McGraw Hill
4. Operating System Projects using Windows NT by Nutt Gary (2001), Addison Wesley
5. Distributed Operating Systems – Concepts and Design by Pradeep K. S (2001), Prentice Hall
6. Other Resources: The Internet, Papers, Handouts, Lecture Notes etc

Recreated by Kyalo J. K Page 1 of 73


INSTRUCTIONAL MATERIALS  Research on the field of study and
 Computers and Projectors others
 Writing Boards and Mark Pens
 Well Ventilated Lecture Theaters LECTURER:
 Name: Mr. Josphat K. Kyalo
COURSE ASSESSMENT  Cells: 0724 577772, 0732 307 288
a) Student Performance  Email: joskkyalo@yahoo.co.uk,
 Two Assignments contributing 10% joskyalo@gmail.com
 Three Sitting CATs contributing COURSE OBJECTIVES
20% 1. To Understand of the software that
 End Unit Exam contributing 70% implements networking
2. To Explain the principles of a
b) Lecturer Performance Network Operating System e.g.
 Based on Student Evaluation Unix, Win NT etc
 Head of Department Evaluation 3. To Discuss DOS, Process
 Lecturer / Self Evaluation Management and Distributed File
Systems
TEACHING METHODOLOGY 4. To Analyze distributed transactions,
DELIVERY: replication, naming and Security
 Lectures, Tutorials and Tests 5. To Apply the knowledge of DOS to
sampled case studies
 Readings on Relevant Materials
 Group Work, Discussions and
Reporting

Recreated by Kyalo J. K Page 2 of 73


INTRODUCTION TO OPERATING SYSTEM
Software makes a computer become useful. With software a computer can store, process and
retrieve information. Computer Software can roughly be divided into two forms: system
programs and application programs. System programs manage the operations of the computer
itself and application programs perform the work that the user wants. The most fundamental
system program is the Operating System, which controls all computer resources and provides
the base upon which application programs run. Long ago, there was no such thing as an
operating system. The computers ran one program at a time. The computer programmers would
load the program they had written and run them. If there was a bug in the program, the
programmer had to start over. Even if the program did run correctly, the programmer probably
never got to work on the machine directly. The program (punched card) was fed into the
computer by an operator who then passed the printed output to the programmer later on.
As technology advanced, many such programs, or jobs, were all loaded onto a single tape. This
tape was then loaded and manipulated by another program, which was the ancestor of today's
operating systems. This program (also known as a monitor) would monitor the behavior of the
running program and if it misbehaved (crashed), the monitor could then immediately load and
run another. The process of loading and monitoring programs in the computer was somehow
cumbersome and with time, it became apparent that some way had to be found to shield
programmers from the complexity of the hardware and allow for smooth sharing of the relatively
vast computer resources. The way that has evolved gradually is to put a layer of software on top
of the bare hardware, to manage all the components of the system and present the user with a
virtual machine that is easier to understand and program. This layer of software is the operating
system.
The concept of an operating system can be illustrated using the following diagram:

Banking Airline Web Application Programs


System Reservation Browser

Compilers Editors Command


Interpreter
System Programs
Operating System

Machine Language

Microprogramming Hardware

Physical Devices

At the bottom layer is the hardware, which in many cases is composed of two or more layers.
The lowest layer contains physical devices such as IC chips, wires, network cards, cathode ray
tubes etc. The next layer, which may be absent in some machines, is a layer of primitive software
that directly controls the physical devices and provides a clean interface to the next layer. This
software called the micro-program is normally located in ROM. It is an interpreter, fetching the
machine language instruction such as ADD, MOVE and JUMP and carry them out as a series of
little steps. The set of instructions that the micro-program can interpret defines the machine
language. The machine language typically has between 50 and 300 instructions, mostly for

Recreated by Kyalo J. K Page 3 of 73


moving data around the machine, doing arithmetic and comparing values. In this layer, I/O
devices are controlled by loading values into special device registers.
A major function of the operating system is to hide all this complexity and give the programmer
and users a more convenient set of instruction to work with. For example, COPY FILE1 FILE2
is conceptually simpler than having to worry about the location of the original file on disk,
location of the new file and the movement of the disk heads to effect the copying. On top of the
operating system is the rest of the system software for example, compilers, command
interpreters and editors. These are not part of the operating system. The operating system runs in
kernel or supervisor mode meaning is protected from user tampering. Compilers run in user
mode, meaning that users are free to write their own compiler or editor if they so wish. Finally,
above the system programs come the application programs. These are programs purchased or
written by the users to solve particular problems, such as word-processing, spreadsheets,
databases etc

FUNCTIONS OF OPERATING SYSTEM


Provision of a virtual machine
A virtual machine is software that creates an environment between the computer platform and
the end-user. A programmer does not want to get too intimately involved with programming
hardware devices like floppy disks, hard disks and memory. Instead, the programmer wants a
simple, high-level abstraction to deal with. In the case of disks, typical abstraction would be that
the disk contains a collection of named files. Each file can be opened for reading or writing then
read or written, and finally closed.

The program that hides the truth about hardware from the programmer and presents a nice,
simple view of named files that can be read and written is, of course, the operating system. The
operating system also conceals a lot of unpleasant business concerning interrupts, timers,
memory management and other low level features. In this view, the function of the operating
system is to present the user with the equivalent of an extended machine or virtual machine that
is easier than the underlying hardware.

Resource management
Modern computers consist of processors, memories, timers, disks, network interface cards,
printers etc. The job of the operating system is to provide for an orderly and controlled allocation
of processors, memories and I/O devices among the various programs competing for them. For
example, if different programs running in the same computer sent print jobs to the same printer
at the same time, if the printing is not controlled then the printing will be interleaved with say the
first line of the printout being for the first program, the second line being for the second program
etc. The operating system brings some order in such situation by buffering all output destined for
the printer on disk. When one program is finished, the operating system can then copy its output
from the disk to the printer. In this view, the operating system keeps track of who is using which
resource, grant resource requests, account for usage and mediate conflicting requests from
different programs and users.

OPERATING SYSTEM CONCEPTS


The interface between the operating systems and user programs is defined the set of ‘extended
instructions’ that the operating system provides. These instructions are referred to as system

Recreated by Kyalo J. K Page 4 of 73


calls. The calls available in the interface vary from one operating system to another although the
underlying concept is similar.

A process is basically a program in execution. Associated with each process is its address space:
memory locations, which the process can read and write. The address space contains the
executing program, its data and stack. Also associated with each process is some set of registers,
including the program counter, stack pointer and hardware registers and all information needed
to run the program. In a time-sharing system, the operating system decides to stop running one
process and start running another. When a process is suspended temporarily, it must later be
restarted in exactly the same state it had when it was stopped. This means that the context of the
process must be explicitly saved during suspension. In many operating systems, the information
about each process, apart from the contents of its address space, is stored in a table called the
process table.

Therefore, a suspended program consists of its address space, usually referred to as the core
image and its process table entry. The key process management system calls are those dealing
with the creation and termination of processes. For example, a command interpreter (shell) reads
commands from a terminal, for instance a request to compile a program. The shell must create a
new process that will run the compiler and when the process has finished the compilation, it
executes a system call to terminate itself. A process can create other processes known as child
processes and these processes can in turn create other child processes. Related processes that are
cooperating to get some job done often need to communicate with one another and synchronize
their activities. This communication is known as Inter-Process Communication (IPC). Other
systems calls are available to request more memory or release unused memory, wait for a child
process to terminate and overlay its program with a different one.

Files - A file is a collection of related information defined by its creator. Commonly, files
represent programs and data. Data files may be numeric, alphabetic or alphanumeric. System
calls are needed to create, delete, move, copy, read and write files. Before a file can be read, it
must be opened, and after reading it should be closed. System calls are provided to do all these
things. Files are normally organized into logical clusters or directories, which make them easier
to locate and access. For example, you can have directories for keeping all your program files,
word processing documents, database files, spreadsheets, electronic mail etc. System calls are
available to create and remove directories. Calls are also provided to put an existing file in a
directory, remove a file from a directory. Every file within a directory hierarchy can be specified
by giving its path name from the root directory. Such absolute path names consist of the list of
directories that must be traversed from the root directory to get to the file with slashes separating
the components.

Batch Systems - The early operating systems were batch systems. The common input devices
were card readers and tape drives. The common output devices were line printers, tape drives
and card punches. The users did not interact with the system, but would rather prepare a job and
submit it to the computer operator, who would feed the job into the computer and later on the
output appeared. The major task of the operating system was to transfer control automatically
from one job to the next. To speed processing, jobs with similar needs were batched together and
run through the computer as a group. Programmers would leave their jobs with the operator who

Recreated by Kyalo J. K Page 5 of 73


would then sort them out into batches and as the computer became available, would run each
batch. The output would then be sent to the appropriate programmer. The delay between job
submission and completion also referred to, as the turnaround time was high in these systems. In
this execution environment the CPU, is often idle because of the disparity in speed between the
I/O devices and the CPU. To reduce the turnaround time and CPU idle time in these systems, the
spool (simultaneous peripheral operation on-line) concept was introduced. Spooling, in essence
uses the disk as a huge buffer, for reading as far ahead as possible on input device and for storing
output files until the output device is able to accept them.

Multiprogramming - Spooling will result in several jobs that have already been read waiting on
disk, ready to run. This allows the operating system to select which job to put in memory next,
ready for execution. This is referred to as job scheduling. The most important aspect of job
scheduling is the ability to multiprogram. The operating system keeps several jobs in memory at
the same time, which is a subset of jobs kept in the job spool. The operating system picks and
starts executing one of the jobs in memory. Eventually, the job may have to wait for some task
such as an I/O operation to complete. In multiprogramming, when this happens the operating
system, simply switches to and executes another job. If several jobs are ready to be brought from
the job spool to the memory and there is no room for all of them, then the system must chose
among them. Making this decision is job scheduling. Having several jobs or programs in
memory at the same time ready for execution also requires some memory management and that
system must chose one among them. Making this decision is known as CPU scheduling.

Time Sharing Systems - Time-sharing, or multitasking is a logical extension of


multiprogramming. In time-sharing, multiple jobs are executed by the CPU switching between
them, but the switches occur so frequently that the users may interact with each program while it
is running. An interactive computer system provides on-line communication between the user
and the system. Time-sharing systems were developed to provide interactive use of the computer
system at a reasonable cost. A time-shared operating system uses CPU scheduling and
multiprogramming to provide each user or program with a small portion of a time-shared
computer. It allows many users to share the computer simultaneously. As the system switches
rapidly from one user to the other, each user is given the impression that they have their own
computer, whereas one computer is being shared among many users

Parallel Systems - Most systems are single-processor systems, that is, they have only one main
CPU. However, there is a trend towards multiprocessing systems. Such systems have more than
one processor in close communication, sharing the computer bus, clock and sometimes memory
and peripheral devices. These systems are referred to as tightly coupled systems. The motivation
for having such systems is to improve the throughput and reliability of the system.

Real-Time Systems - A real time system is used when there are rigid time requirements on the
operation of a processor or the flow of data and thus often used as a control device in a dedicated
application. Sensors bring data to the computer. The computer must analyze the data and
possibly adjust control to modify the sensor inputs. Systems that control scientific experiments,
medical imaging systems, industrial control systems and some display systems are examples of
real-time systems.

Recreated by Kyalo J. K Page 6 of 73


THE OPERATING SYSTEMS
Network Operating System
Assumes loosely coupled software on loosely coupled hardware that includes Network of
workstations and servers. All commands are run locally on the workstation. To access a remote
machine the user log in using a remote login command. Servers used in client/server functions:
file storage, printer mgmt. Examples: Solaris, Windows NT.

Distributed Operating System


Create the illusion in the minds of the users that the entire network of computers is a single
timesharing system, rather than a collection of distinct machines. No current system fulfills this
requirement entirely yet. Distributed Operating Systems are often broadly classified into two
extremes of a spectrum: Examples – Amoeba, Argus, Cronas and V-System

Tightly Coupled Systems - Tightly coupled software on loosely coupled hardware. Components
are Processors, Memory, Bus, I/O e.g. Meiko Compute Surface. The operating system tries to
maintain a single global view of the resources it manages. Single global intercrosses
communication mechanism: any process can talk to any other process (regardless of what
processor the process is running on). Global protection scheme: security system (e.g. passwords,
access rights) must look the same everywhere. File system must look the same everywhere:
Every file should be visible at every location (subject to protection /security constraints). Runs
the same operating system

 Loosely Coupled Systems – they can be thought of as a collection of computers each


running their own O/S. However these OS work together to make their services and
resources available to others (Network Operating Systems) Components are
Workstations, LAN, and Servers e.g. V-System, BSD Unix
 Cooperative Autonomous System - In between a Distributed Operating System and
Network Operating System with an interface for Service integration and cooperating
processes, Example: CORBA: Common Object Request Broker Architecture.
 Multiprocessor Timesharing System - Tightly coupled software on tightly coupled
hardware Used with Parallel Processing. Examples: UNIX timesharing system with
multiple CPUs instead of one CPU. Single run time queue for all processors, Common
memory (may be single memory) for all processors. Mutual exclusion achieved by
monitors and semaphores. Can process switch or busy wait for I/O interrupt to occur for
cache purposes. Traditional file system with single unified block cache examples: Solaris,
Windows NT.

Recreated by Kyalo J. K Page 7 of 73


THE DISTRIBUTED COMPUTING
Symptoms of a Distributed System
 Interconnection hardware connects
 Multiple processing elements run independently
 Shared state - in order to recover from failures
 Processing elements fail independently partial failures

HARDWARE AND SOFTWARE ARCHITECTURES


A key characteristic of our definition of distributed systems is that it includes both a hardware
aspect (independent computers) and a software aspect (performing a task and providing a
service). From a hardware point of view distributed systems are generally implemented on multi-
computers. From a software point of view they are generally implemented as distributed
operating systems or middleware.
Multiprocessors - Multiprocessor systems all share a single key property: All the CPUs have
direct access to the shared memory. Bus-based multiprocessors consist of some number of CPUs
all connected to a common bus, along with a memory module.
Multi-computers - A multicomputer consists of separate computing nodes connected to each
other over a network.
 Node Resources - This includes the processors, amount of memory, amount of secondary
storage, etc. available on each node.
 Network Connection - The network connection the various nodes can have a large
impact on the functionality and applications that such a system can be used for. A
multicomputer with a very high bandwidth network is more suitable for applications that
actively share data over the nodes and modify large amounts of that shared data. A lower
bandwidth network, however, is sufficient for applications where there is less intense
sharing of data.
 Homogeneity - A homogeneous multicomputer is one where all the nodes are
the same, that is they are based on the same physical architecture (e.g. processor, system
bus, memory, etc.). A heterogeneous multicomputer is one where the nodes are not
expected to be the same. One common characteristic of all types of multi-computers is
that the resources on any particular node cannot be directly accessed by any other node.
All access to remote resources takes the form of requests sent over the network to the
node where that resource resides.

PROCESSING CONFIGURATIONS
Concerns processing configurations with two characteristics:
The number of instruction streams; and the number of data streams.
 SISD: A computer with a Single Instruction stream and a Single Data stream: All
traditional uni-processor computers.
 SIMD: Single Instruction, Multiple Data: Array processors with one instruction unit that
fetches an instruction and then commands many data units to carry it out in parallel, each
with its own data. Good for vector processing.
 MISD: Multiple Instruction Single Data: Pipelined computers: Fetch and process
multiple instructions simultaneously, operating on one data at a time. (Book differs here).
 MIMD: Multiple Instruction Multiple Data: A group of independent computers, each
with own program counter, program and data. All distributed systems are MIMD.

Recreated by Kyalo J. K Page 8 of 73


Processors can be subdivided further:
Multiprocessors: CPUs share memory. If one CPU writes to location 44, all will see the new
value.
Multi-computers: CPUs do not share memory and each machine has own private memory.
CONNECTION CONFIGURATIONS
Two Multiprocessor Architecture types (based on architecture of the interconnection network):
1. Bus: A single network, backplane or bus cable that connects all machines, 300 Mbps and
faster for a backplane bus.  When connecting 4-5 CPUs the bus becomes overloaded and
performance drops drastically.  Solution: add a cache between CPU and bus. If cache is large the
hit rate will be high, bus  traffic will drop dramatically, allowing many more CPUs to be added.
Supports 32-64 CPUs
 Write-through cache: Cache hits for reads do not cause bus traffic, but cache misses for
reads, and all writes, cause bus traffic.
 Snoopy cache: When a snoopy cache sees a write to a memory address in its cache,
it either removes or updates the entry from its cache. 'Snoops' on the bus.
2. Switch: Individual wires connect machines to other machines. Used to build a multiprocessor
with
more than 64 processors
 Crossbar switch: Each CPU and each memory have a connection coming out of it. A
cross-point switch is at every intersection which can be opened and closed in h/w.
Processors can access memory simultaneously. If two CPUs try to access the same
memory simultaneously, one of them will have to wait.
 Omega Switch: The network contains 2x2 switches, each having two inputs and two    
outputs. The switches can be set in nanoseconds or less. The omega network requires
log2 n switching stages, each containing n/2 switches.
 NUMA (Non-Uniform Memory Access): Each CPU can access its own local memory  
quickly, but accessing anybody else's memory is slower. Hierarchical system where some
memory is associated with each CPU
NB. Multi-computers also are configured in a bus or switch type configuration.
Each CPU has direct connection to its own local memory. Thus there is much less traffic on it.

SYSTEM COUPLING TYPES


Software and hardware can be loosely or tightly coupled.
 Tightly coupled: Delay to send message from one computer to another is short Data rate is
high. Associated with multiprocessors and parallel systems (which work on one problem)
 Loosely coupled: Inter-machine message delay is large. Data rate is low. Associated with
multi-computers and distributed processing (working on many unrelated problems).
 Distributed system: designed to allow many users to work together.
 Parallel systems: goal is to achieve maximum speedup on a single problem (e.g. 500,000
MIPS).  Allocates multiple processors to a single problem and divides the work.

SOFTWARE CONCEPTS
Distributed Operating System - A distributed operating system (DOS) is an operating system
that is built, from the ground up, to provide distributed services. As such, a DOS integrates key
distributed services into its architecture. These services may include distributed shared memory,
assignment of tasks to processors, masking of failures, distributed storage, inter-process

Recreated by Kyalo J. K Page 9 of 73


communication, transparent sharing of resources, distributed resource management, etc. A key
property of a distributed operating system is that it strives for a very high level of transparency,
ideally providing a single system image. That is, with an ideal DOS users would not be aware
that they are, in fact, working on a distributed system.
Distributed operating systems generally assume a homogeneous multicomputer. They are also
generally more suited to LAN environments than to wide-area network environments. In the
earlier days of distributed systems research, distributed operating systems where the main topic
of interest. Most research focused on ways of integrating distributed services into the operating
system, or on ways of distributing traditional operating system services. Currently. however, the
emphasis has shifted more toward middleware systems. The main reason for this is that
middleware is more flexible (i.e., it does not require that users install and run a particular
operating system), and is more suitable for heterogeneous and wide-area multi-computers.
Multicomputer operating systems that do not provide a notion of shared memory can offer only
message-passing facilities to applications. Unfortunately, the semantics of message-passing
primitives may vary widely between different systems. It is easiest to explain their differences by
considering whether or not messages are buffered. In addition, we need to take into account
when, if ever, a sending or receiving process is blocked.

NETWORK OPERATING SYSTEMS


In contrast to distributed operating systems, network operating systems do not assume that the
underlying hardware is homogeneous and that it should be managed as if it were a single system.
Instead, they are generally constructed from a collection of uniprocessor systems, each with its
own operating system, as show in figure below.

Figure: Network Operating System Difference between DOS and NOS


Difference between DOS and NOS
Figure below shows a shared, global file system accessible from all the workstations in a
Network Operating System. The file system is supported by one or more machines called file
servers. The file servers accept requests from user programs running on the other machines,
called clients, to read and write files.

Recreated by Kyalo J. K Page 10 of 73


THE MIDDLEWARE
Whereas a DOS attempts to create a specific system for distributed applications, the goal of
middle ware is to create system independent interfaces for distributed applications.

Figure: Middle Ware


As shown in figure above, middleware consists of a layer of services added between those of a
regular network OSI and the actual applications. These services facilitate the implementation of
distributed applications and attempt to hide the heterogeneity of the underlying system
architectures (both hardware and software). The principle aim of middleware, namely raising the
level of abstraction for distributed programming, is achieved in three ways: Communication
mechanisms that are more convenient and less error prone than message passing; Independence
from as, network protocol, programming language, etc. and Standard services (such as a
naming service).
To make the integration of these various services easier, and to improve transparency and system
independence, middleware is usually based on a particular paradigm, or model, for describing
distribution and communication. This often manifests itself in a particular programming model
such as ‘everything is a file’, remote procedure call, and distributed objects. Providing such a
paradigm automatically provides an abstraction for procedure to follow, and provides direction
for how to design and set up the distributed applications.

Although some forms of middleware focus on adding support for distributed computing directly
into a language, middleware is generally implemented as a set of libraries and tools that enable

Recreated by Kyalo J. K Page 11 of 73


retrofitting of distributed computing capabilities to existing programming languages. Such
systems typically use a central mechanism of the host language (such as the procedure call and
method invocation) and dress remote operations up such that they use the same syntax as that
mechanism, resulting for example in remote procedure calls and remote method invocation.
Because an important goal of middleware is to hide the heterogeneity of the underlying systems
(and in particular of the services offered by the underlying OS), middleware systems often try to
offer a complete set of services so that clients do not have to rely on underlying OS services
directly. This provides transparency for programmers writing distributed application using the
given middleware. Unfortunately this ‘everything but the kitchen sink’ approach often leads to
highly bloated systems. As such, current systems exhibit an unhealthy tendency to include more
and more functionality in basic middleware and its extensions, which leads to a jungle of bloated
interfaces.

Recreated by Kyalo J. K Page 12 of 73


INTRODUCTION TO DISTRIBUTED SYSTEMS
Definitions
 A distributed system consists of a number of components, which are by themselves
computer systems. These components are connected by some communication medium,
usually a sophisticated network. Applications execute by using a number of processes in
different component systems. These processes communicate and interact to achieve
productive work within the application.
 A distributed system in this context is simply a collection of autonomous computers
connected by a computer network to enable resource sharing and co-operation between
applications to achieve a given task.
 A Distributed System is one that runs a collection of machines that do not have shared
memory, yet it looks to its users as a single computer.
 A Distributed System is a collection of independent computers that appear to the users of
the system as a single computer.

Goals of Distributed Systems


 A state-of-the-art distributed system is one that combines the accessibility, coherence and
manageability advantages of centralized systems
 Has the sharing, growth, cost and autonomy advantages of networked systems
 Has the added advantage of security, availability and reliability.
 Distribution should be concealed, giving the users the illusion that all available resources
are located at the user’s workstation.

NEED FOR A DISTRIBUTED SYSTEM


 Resource sharing - People / processes can share scarce hardware and software resources.
 Flexibility – concerns adding new resource-sharing services without disruption /
duplication of existing services
 Concurrency – ability to run several processes simultaneously while in different
components of the system
 Scalability – the need to have the System grow as requirements increase. System and
application s/w should not need to change when the scale of the system increases. As
demand for a resource grows, it should be possible to extend the system meet
 Reliability: If a machine goes down during processing, some other machine takes over
the job. When one of the components in a distributed system fails, only the work that was
using the failed component is affected.
 Transparency: Concealment from the user and application programmer of the separation
of components in a distributed system so that the system is perceived as a whole rather
than as a collection of independent components.

REASONS FOR USE OF DISTRIBUTED SYSTEMS


The alternative to using a distributed system is usually to have a huge centralized system, such as
a mainframe. For many applications there are a number of economic and technical reasons that
make distributed systems much more attractive than their centralized counterparts.
 Cost - Better price/performance as long as commodity hardware is used for the
component computers

Recreated by Kyalo J. K Page 13 of 73


 Performance - By using the combined processing and storage capacity of many nodes,
performance levels can be reached that are beyond the range of centralized machines
 Scalability - Resources such as processing and storage capacity can be increased
incrementally.
 Transparency - An important goal of a distributed system is to hide the fact that its
processes and resources are physically distributed across multiple computers. A
distributed system that is able to present itself to users and applications as if it were only
a single computer system is said to be transparent. Figure below shows different forms of
transparency in a distributed system.
 Reliability - By having redundant components the impact of hardware and software
faults on users can be reduced however, these advantages are often offset by the
following problems encountered during the use and development of distributed systems:
 Limited Software - As will become clear throughout this course distributed software is
harder to develop than conventional software; hence, it is more expensive to develop and
there is less such software available
 New Components - Networks are needed to connect independent nodes and are subject
to performance limitations. Besides these limitations, networks also constitute new
potential points of failure
 Security - Because a distributed system consists of multiple components there are more
elements that can be compromised and must, therefore, be secured. This makes it easier
to compromise distributed systems.

CHARACTERISTICS OF DISTRIBUTED SYSTEMS


 Multiple autonomous processing elements – A distributed system is composed of
several independent components each with processing ability. There is no master-slave
relationship between processing elements. Thus, it excludes traditional centralized
mainframe based systems.
 Information exchange over a network – the network connects autonomous processing
elements that communicate using various protocols
 Processes interact via non-shared local memory – A DS assumes hybrid configuration
involving separate computers with a distributed shared memory. Multiple processor
computer systems can be classified into those that share memory (multiprocessor
computers) and those without shared memory (multi-computers).
 Multiple Computers – there are more than one physical computers each consisting of a
processor, local memory, a stable storage module together with input-output paths that
connect it with other components in the distributed system environment
 Interconnections – there are mechanisms and configurations for communicating with the
other nodes via the network
 Shared state – the sub-sets nodes cooperate providing services which are distributed or
replicated among the participants or users
 Transparency – A distributed system is designed to conceal from the users the fact that
they are operating over a wide spread geographical area and provide the illusion of a
single desktop environment. It should allow every part of the system to be viewed the
same way regardless of the system size and provide services the same way to every part
of the system. Some aspects of transparency include:

Recreated by Kyalo J. K Page 14 of 73


 Global names – the same name works everywhere. Machines, users, files, control
groups and services have full names that mean the same thing regardless of where
in the system the name is used.
 Global access – the same functions are usable everywhere with reasonable
performance. A program can run anywhere and get the same results. All the
services and objects required by a program to run are available to the program
regardless of where in the system the program is executing.
 Global security – same user authentication and control access work everywhere
e.g. same mechanism to let the same person next door and someone at another site
read ones files. Authentication to any computer in the system
 Global management – The same person can administrate system components
anywhere. System management tools perform the same actions e.g.
configuration of workstations.

ADVANTAGES OF DISTRIBUTED SYSTEMS


 It can be more fault-tolerant. It can be designed so that if one component of the system
fails then the others will continue to work. Such a system will provide useful work in the
face of quite a large number of failures in individual component systems.
 It is more flexible - A distributed system can be made up from a number of different
components. Some of these components may be specialized for a specific task while others
may be general purpose. Components can be added, upgraded, moved and removed without
impacting upon other components.
 It is easier to extend – the need of more processing, storage or other power can be obtained
by increasing the number of components.
 It is easier to upgrade - A distributed system may be upgraded in increments by replacing
individual components without a major disruption, or a large cash injection. When a single
large computer system becomes obsolete all of it has to be replaced in a costly and
disruptive operation.
 Local autonomy – by allowing domains of control to be defined where decisions are made
relating to purchasing, ownership, operating priorities, IS development and management,
etc. Each domain decides where resources under its control are located.
 Increased Reliability and Availability – In a distributed system, multiple components of
the same type can be configured to fail independently. This aspect of replication of
components improves the fault tolerance in distributed systems, consequently, the
reliability and availability of the system is enhanced. In a centralized system, a component
failure can mean that the whole system is down, stopping all users from getting services.
 Improved Performance – A Distributed System can have a service that is partitioned over
many server computers each supporting a smaller set of and users access to local data and
resources results in faster access. Another performance advantage is the support for parallel
access to distributed data across the organization. Large centralized systems can be slow
performers due to the sheer volume of data and transactions being handled.
 Security breaches are localized – In distributed systems with multiple security control
domains, a security breach in one domain does not compromise the whole system. Each
security domain has varying degree of security authentication, access control and auditing.

Recreated by Kyalo J. K Page 15 of 73


DISADVANTAGES OF DISTRIBUTED SYSTEMS
 It’s more difficult to manage and secure – Centralized systems are inherently easier to
secure and easier to manage because control is done from a single point. Distributed
systems require more complex procedures for security, administration, maintenance and
user support due to greater levels of co-ordination and control required.
 Lack of skilled support and development Staff – Since the equipment and software in a
DS can be sourced from different vendors, unlike in traditional systems where everything
is sourced from the same vendor, its difficult to find personnel with a wide range of skills
to offer comprehensive support.
 Introduce problems of maintaining consistency of data.
 Introduce problems of synchronization between processes.
 Significantly more complex in structure and implementation
 Communication network can lose messages and become overloaded.
 Security can become a problem since a computer is most secure if it minimizes network
connections and computer is retained in a locked room.

DESIGN ISSUES OF DISTRIBUTED SYSTEMS


There are key designs issues that people building distributed systems must deal with, with a goal
to ensure that they are attained. These are:
Transparency – It is described as “the concealment from the user and the application
programmer of the separation of components in a distributed system so that the system is
perceived as a whole rather than a collection of independent components.” Transparency there
involves hiding all the distribution from human users and application programs
 Human users – In terms of the commands issued from the terminal and the results
displayed on the terminal. The distributed system can be made to look just like a single
processor system.
 Programs – At the lower level, the distribution should be hidden from programs.
Transparency minimizes the difference to the application developer between
programming for a distributed system and programming for a single machine. In other
words, the system call interface should be designed such that the existence of multi-
processors is not visible. A file should be accessed the same way whether its local or
remote. A system in which remote files are accessed by explicitly setting up a network
connection to a remote server and then sending messages to it is not transparent because
remote services are being accessed differently from local ones.

Fault Tolerance – Since failures are inevitable, a computer system can be made more reliable by
making it fault tolerant. A fault tolerant system is one designed to fulfill its specified purposes
despite the occurrence of component failures (machine and network). Fault tolerant systems are
designed to mask component failures i.e. attempt to prevent the failure of a system in spite of the
failure of some of its components. Fault tolerance can be achieved through hardware and
software.
Although fault tolerance improves the system availability and reliability, it brings some
overheads in terms of:
 Cost - increased system costs
 Software development – recovery mechanisms and testing
 Performance – makes system slower in updates of replicas

Recreated by Kyalo J. K Page 16 of 73


 Consistency – maintaining data consistency is not trivial
Concurrency – Concurrency arises in a system when several processes run in parallel. If these
processes are not controlled then inconsistencies may arise in the system. This is an issue in of
distributed systems because designers have to do this carefully and keenly to control the problem
of inconsistency and conflicts. The end result is to achieve a serial access illusion. Concurrency
control is important to achieve proper resource sharing and co-operation of processes.
Uncontrolled interleaving of sub-operations of concurrent transactions can lead to four main
types of problems:
Openness – It’s the ability of the system to accommodate different technology (hardware and
software components) without changing the underlying structure of the system. For example, the
ability to accommodate a 64-bit processor where a 32-bit processor was being used without
changing the underlying system structure or the ability to accommodate a machine running a
MAC O/S in a predominantly Windows O/S system.

Scalability – Each component of a distributed system has a finite capacity. Designing for
scalability involves calculating the capacity of each of these elements and the extent to which the
capacity can be increased. Good distributed systems design minimizes utilization components
that are not scalable. Also, the element that is weakest in terms of available capacity (and the
extent to which the capacity can be increased) should be of prime importance in terms of design.
There are four principle components to be considered when designing for scalability: client
workstation, LAN, servers and WAN.

Performance – There are two common measures of performance for distributed systems:
 Response time – defined as the average elapsed time from the moment the user is ready
to transmit and the entire response is received.
 Throughput – the number of requests handled per unit time.
 Latency – the delay between the start of a message’s transmission from one process and
the beginning of its receipt by another
 Bandwidth – the total amount of information that can be transmitted over given time unit
 Jitter – the variation in time taken to deliver a series of messages

Performance improvements can be made in distributed systems environment by migrating much


of the processing on to a user’s client workstation. This reduces the processing on the server per
client request which leads to faster and more predictable response time. Data intensive
applications can improve performance by avoiding I/O operations to read from disk storage.
Reading from buffer areas in memory is much faster. Applications invoking remote operations
offered by remote servers can improve performance by avoiding the need to access a remote
server to satisfy a request. A caching system reduces the performance cost of I/O and remote
operations by storing the results of recently executed I/O or remote operations in memory and re-
using the same data whenever the same operation is re-invoked and when it can be ascertained
that the data is still valid.

COMPUTING SYSTEM MODELS

Client Server Model – This is the most widely used paradigm for structuring distributed systems.
A client requests a particular service. One or more processes called servers are responsible for

Recreated by Kyalo J. K Page 17 of 73


the provision of services to clients. Services are accessed via a well-defined interface that is
made known to the clients. On the receipt of a request the server executes the appropriate
operation and sends a reply back to the client. The interaction is known as request/reply or
interrogation. Both clients and servers are run as user processes. A single computer may run a
single client or server process or may run multiple client or server processes. A server process is
normally persistent (non-terminating) and provides services to more than one client process. The
main distinction between master –slave and client/server models is in the fact that client and
server processes are on equal footing but with distinct roles.
 Master –Slave model – It may not be an appropriate model for structuring a distributed
system. In this model, a master process initiates and controls any dialogue with the slave
processes. Slave processes exhibit very little intelligence, responding to commands from
a single master process and exchange messages only when invited by the master process.
The slave process merely complies with the dialogue rules set by the master. This is the
model on which centralized systems were based and has limited applications in
distributed systems because it does not make the best use of distributed resources and is a
single point of failure.

 Peer-to-peer Model – This model is quite similar to the client/server model. The use of a
small a small manageable number of servers (i.e. increased centralization of resources)
increase system management compared to a case where potentially every computer can
be configured as client and server. This model is known as a peer-to-peer model because
every process has the same functionality as a peer process.

 Group Model – In many circumstances, a set of processes need to co-operate in such a


way that one process may need to send a message to all other processes in the group and
receive response from one or more members. For example, in a video conferencing
involving multiple participants and a whiteboard facility, when someone writes to the
board, every other participant must receive the new image. In this model a set of group
members are modeled conveniently to behave as a single unit called a group. When a
message is sent to a group interface, all the members receive it.

There are different approaches to routing a ‘group’ message to every member:


 Unicasting: a Point-to-point sending of a message from single sender to a single receiver.
 Broadcasting: sending a message to all of the computers in a given network
environment.
 Multicasting: Sending a message to the members of a specified group of processes. A
single group send operation will (hopefully) result in a receive operation performed by
each member of the process group.

REASONS FOR MULTICASTING:


 Locating an object - the client multicasts a message containing the name of a file
directory to a group of file - server processes. Only one which holds relevant directory
replies to the request.
 Fault tolerance: A client multicasts its requests to a group of server processes, all of
which process the requests identically and one or more of which reply to them.
 Replicated data: data is replicated to increase the performance of a service

Recreated by Kyalo J. K Page 18 of 73


 Multiple updates: an event such as 'the time is 18:01' can be multicast to interested
processes.

PROBLEMS WITH MULTICASTING:


What if some processes receive the message and some don't?
 Atomic multicast: message is received by all processes or else it is received by none of
them. Acknowledgements are required and retransmissions occur. Originators assume no
replies are no longer in the network. Alternately, everyone sends a received message once
to everyone else (unless retransmissions are required to get an acknowledgement).
 Reliable multicast: A best effort to deliver to all members of a group, but no guarantee.
What if the messages are not received in the same time and order at all nodes?
 Synchronous system: Events happen strictly sequentially, with each event taking
essentially zero time to complete. Impossible to build
 Loosely synchronous system: Events take a finite amount of time, but all events appear
in the same order to all parties.
 Virtually synchronous system: Since the ordering of messages is not so important, the
ordering constraint has been relaxed. Example: ISIS is a set of programs for building
distributed applications, from Cornell.

Client Server Application Example


 Supports GUI which uses windows and mouse
 Presentation Layer Services: GUI (Graphical User Interface)
 Application Logic: Data analysis / number crunching: SQL
 Database Management System: Search/sort/validate/access database
 Communications Software: Protocol between server and client

EXAMPLES OF DISTRIBUTED SYSTEMS


 Probably the simplest and most well known example of a distributed system is the
collection of Web servers - or more precisely, servers implementing the HTTP protocol-
that jointly provide the distributed database of hypertext and multimedia documents that
we know as the World-Wide Web.
 The computers of a local network that provide a uniform view of a distributed file system
and the collection of computers on the Internet that implement the Domain Name Service
(DNS).
 The distributed system is the T3E series of parallel computers by Cray. These are high
performance machines consisting of a collection of computing nodes that are linked by a
high-speed network.
 The operating system, UNICOS, presents users with a standard UNIX environment upon
login, but transparently schedules login sessions over a number of available login nodes.
 Despite the fact that the systems in these examples are all similar (because they fulfill the
definition of a distributed system), there are also many differences between them.
 The World-Wide Web and DNS, for example, both operate on a Global scale. The
distributed file system, on the other hand, operates on the scale of a LAN, while the Cray
supercomputer operates on an even smaller scale making use of a specially designed high
speed network to connect all of its nodes.

Recreated by Kyalo J. K Page 19 of 73


COMMUNICATION IN DISTRIBUTED SYSTEM
Inter process communication is at the heart of all distributed systems. Communication in
distributed systems is always based on low-level message passing as offered by the underlying
Network. In this unit we will discuss the rules that communication processes must adhere to,
known as protocols, and concentrate on structuring those protocols in the form of layers.

Layered Protocols - Due to the absence of shared memory, all communication in distributed
systems in based on exchanging (low level) messages. When process A wants to communicate
with process B, it
first builds a message in its own address space. Then it executes a system call that causes the
operating system to send the message over the network to B. To make it easier to deal with the
numerous levels and issues involved in communication, the International Standard
Organization(ISO) developed a reference model that clearly identifies the various levels
involved, gives them standard names, and points out which level should do which job. This
model is called OSI model. Figure below shows seven layers of OSI.

Seven Layers of OSI


Message travels from Application layer to Physical layer and adds various headers at each layer
so that at the receiving end message is deduced.

Typical Message as It Appears on the Network

Client-Server TCP
Client-Server interaction in distributed systems is often done using the transport protocols of the
underlying network. With the increasing popularity of the Internet, it is now common to build
client-server applications and systems using TCP. The benefit of TCP compared to UDP is that it
works reliably over any network. The obvious drawback is that TCP introduces considerably

Recreated by Kyalo J. K Page 20 of 73


more overhead especially compared to those cases in which underlying network is highly
reliable, such as in local area systems. Figure below shows Normal operation of TCP and UDP
and transactional TCP.

Middleware Protocols
Middleware is an application that logically lives in the application layer, but which contains
many general-purpose protocol that warrant their own layers, independent of others, more
specific applications.

Communication in Distributed System


While the discussion of communication between processes has, so far, explicitly assumed a
uniprocessor (or multiprocessor) environment, the situation for a distributed system (i.e., a
multicomputer environment) remains similar. The main difference is that in a distributed system
processes running on separate computers cannot directly access each others memory.
Nevertheless processes in a distributed system can still communicate through either shared
memory or message passing.

Distributed Shared Memory


Because distributed process cannot access each other’s memory directly, using shared memory in
a distributed system require special mechanisms that emulate the presence of directly accessible
shared memory. This is called distributed-shared memory (DSM). The idea behind DSM is that
processes on separate computers all have access to the same virtual address space. The memory
pages that make up this address space actually reside on separate computers. Whenever a process
on one of the computers needs to access a particular page it must find the computer actually
hosting that page and request the data from it.

Message Passing
Message passing in a distributed system is similar to communication using messages in a non-
distributed system. The main difference being that the only mechanism available for the passing
of messages is network communication. At its core message passing involves two operations
send( ) and receive( ). Although these are very simple operations, there are many variations on
the basic model. For example, the communication can be connectionless or connection oriented.
Connection oriented communication requires that the sender and receiver first create a
connection before send( ) and receive( ) can be used. Communication operations can also be
synchronous or asynchronous. In the first case the operations block until a message has been
delivered (or received). In the second case the operations return immediately. Yet another
possible variation involves the buffering of communication. In the buffered case, a message will
be stored if the receiver is not able to pick it up right away. In the un buffered case the message
will be lost. There are also varying degrees of reliability of the communication. With reliable
communication errors are discovered and fixed transparently. This means that the processes can
assume that a message that is sent will actually arrive at the destination (as long as the
destination process is there to receive it).

Communication Models

Recreated by Kyalo J. K Page 21 of 73


There are numerous ways that communicating processes can be arranged. This section will
discuss some of the most common communication models. These models are distinguished from
each other by the roles that the communicating processes take on.

Client-Server
The client-server model is the most common and widely used model for communication between
processes. In this model one process takes on the role of a server, while all other processes take
on the roles of clients. The server process provides a service (e.g., a time service, a database
service, a banking service, etc.) and the clients are customers of that service. A client sends a
request to a server, the request is processed at the server and a reply is returned to the client. A
typical client-server application can be decomposed into three logical parts: the interface part, the
application logic part, and the data part. Implementations of the client-server model vary with
regards to how the parts are separated over the client and server roles. A thin client
implementation will provide a minimal user interface layer, and leave everything else to the
server. A fat client implementation, on the other hand, will include all of the user interface and
application logic in the client, and only rely on the server to store and provide access to data.
Implementations in between will split up the interface or application logic parts over the clients
and server in different ways.

Vertical Distribution (Multi-Tier) - An extension of the client-server model, the vertical


distribution, or multi-tier, model (see Figure below) distributes the traditional server
functionality over multiple servers. A client request is sent to the first server. During processing
of the request this server will request the services of the next server, who will do the same, until
the final server is reached. In this way the various servers become clients of each other.

Communication in a Multi-tier System


Each servers is responsible for a different step (or tier) in the fulfillment of the original client
request. Splitting up the server functionality in this way is beneficial to a system’s scalability as
well as its flexibility. Scalability is improved because the processing load on each individual
server is reduced, and the whole system can therefore accommodate more users. With regards to
flexibility this model allows the internal functionality of each server to be modified as long as the
interfaces provided remain the same.

Horizontally Distributed Web Server


While vertical distribution focuses on splitting up a server’s functionality over multiple
computers, horizontal distribution involves replicating a server’s functionality over multiple
computers. In this case each server machine contains a complete copy of all hosted Web pages
and client requests are passed on to the servers in a round robin fashion. The horizontal

Recreated by Kyalo J. K Page 22 of 73


distribution model is generally used to improve scalability (by reducing the load on individual
servers) and reliability (by providing redundancy). Note that it is also possible to combine the
vertical and horizontal distribution models. For example, each of the servers in the vertical
decomposition can be horizontally distributed. Another approach is for each of the replicas in the
horizontal distribution model to themselves be vertically distributed.

Peer to Peer
Whereas the previous models have all assumed that different processes take on different roles in
the communication model, the peer to peer (P2P) model takes the opposite approach and
assumes that all processes play the same role, and are therefore peers of each other. In figure
below each processes acts as both a client and a server, both sending out requests and processing
incoming requests.

Group Communication
The group communication model provides a departure from the point to point style of
communication assumed so far. In this model of communication a process can send a single
message to a group of other processes. Group communication is often referred to as broadcast
(when a single message is sent out to everyone) and multicast (when a single message is sent out
to a predefined group of recipients). Group communication can be applied in any of the applied
in any of the previously discussed models. It is often used to send requests to a group of replicas,
or to send updates to a group of servers containing the same data, It is also used for service
discovery (e.g., broadcast a request saying “who offers this service?”) as well as event
notification (e.g., to tell everyone that the printer is on fire). Issues involved with implementing
and using group communication are similar to those involved with regular point-to-point
communication. This includes reliability and ordering. The issues are made more complicated
because now there are multiple recipients of a message and different combinations of problems
may occur . A widely implemented (but not as widely used) example of group communication is
IP multicast.

Communication Abstractions
In the previous topic it was assumed that all processes explicitly send and receive messages (e.g.,
using send ( ) and receive ( )). Although this style of programming is effective and works, it is
not always easy to write correct programs using explicit message passing. In this section we will
discuss a number of communication abstractions that make writing distributed applications
easier. In the same way that higher level programming languages make programming easier by
providing abstractions above assembly language, so do communication abstractions make
programming in distributed systems easier. Some of the abstractions discussed attempt to
completely hide the fact that communication is taking place. While other abstractions do not
attempt to hide communication, all abstractions have in common that they hide the details of the
communication taking place. For example, the programmers using any of these abstractions do
not have to know what the underlying communication protocol is, nor do they have to know how
to use any particular operating system communication primitives. The abstractions discussed
below are often used as core foundations of most middleware systems. Using these abstractions,
therefore, generally involves using some sort of middleware framework. This brings with it a
number of the benefits of middleware, in particular the various services associated with the
middleware that tend to make a distributed application programmer’s life easier.

Recreated by Kyalo J. K Page 23 of 73


Communication Modes
Before discussing the details of the various abstractions, it is important to make a distinction
between two modes of communication: data-oriented communication and control-oriented
communication. In the first mode, communication serves solely to exchange data between
processes. Although the data might trigger an action at the receiver, there is no explicit transfer
of control implied in this mode. The second mode, control oriented communication, explicitly
associates a transfer of control with every data transfer. Data-oriented communication is clearly
the type of communication used in communication via shared address space and shared memory,
as well as message passing. Control-oriented communication is the mode used by abstractions
such as remote procedure call, remote method invocation, active messages, etc. Note that low
level communication mechanisms are generally data-oriented while the higher level ones (e.g.,
middleware) are control oriented. This isn’t always the case; however, as MPI is a data-oriented
mode of communication and is implemented at a higher level, while some operating systems
provide RPC (control-oriented) at a low level.

CLIENT-SERVER STUBS
Remote Procedure Call (RPC)
The idea behind a remote procedure call (RPC) is to replace the explicit message passing model
with the model of executing a procedure call on a remote node. A programmer using RPC simply
performs a procedure call, while behind the scenes messages are transferred between the client
and server machines.
In theory the programmer is unaware of any communication taking place.

Client and Server Stubs


Below shows the steps taken when an RPC is invoked. The numbers in the figure correspond to
the following steps.
1. Client program calls client stub routine (normal procedure call)
2. Client stub packs parameters into message data structure (marshalling)
3. Client stub performs send( ) syscall and blocks
4. Kernel transfers message to remote kernel
5. Remote kernel delivers to server stub procedure, blocked in receive ( )
6. Server stub unpacks message, calls service procedure (normal proc call)
7. Service procedure returns to stub, which packs result into message
8. Server stub performs send ( ) syscall
9. Kernel delivers to client stub, which unpacks and returns
A server that provides remote procedure call services defines the available procedures in a
service interface. A service interface is generally defined in an interface definition language
(IDL), which is a simplified programming language, sufficient for defining data types and
procedure signatures but not for writing executable code. The IDL service interface definition is
used to generate client and server stub code. The stub code is then compiled and linked in with
the client program and service procedure implementations respectively.

An important part of marshalling is converting data into a format that can be understood by the
receiver. Generally, differences in format can be handled by defining a standard network format
into which all data is converted. However, this may be wasteful if two communicating machines

Recreated by Kyalo J. K Page 24 of 73


use the same internal format, but that format differs from the network format. To avoid this
problem, an alternative is to indicate the format used in the transmitted message and rely on the
receiver to apply conversion where required. Because pointers cannot be shared between remote
processes (i.e., addresses cannot be transferred verbatim as they are usually meaningless in
another address space) it is necessary to flatten,
or serialise, all pointer-based data structures when they are passed to the RPC client stub. At the
server stub, these serialized data structures must be unpacked an recreated in the recipients
address space. Unfortunately this approach presents problems with aliening and cyclic structures.
Another approach to dealing with pointers involves the server sending a request for the
referenced data to the client every time a pointer is encountered. In general the RPC abstraction
assumes synchronous, or blocking, communication. This means that clients invoking RPCS are
blocked until the procedure has been executed remotely and a reply returned. Although this is
often the desired behaviour, sometimes the waiting is not necessary. For example, if the
procedure does not return any values it is not necessary to wait for a reply. In this case it is better
for the RPC to return as soon as the server acknowledges receipt of the message. This is called
an asynchronous RPC.

a)Interaction between a client and server in a traditional RPC


b)The interaction using asynchronous RPC
It is also possible that a client does require a reply, but does not need it right away and does not
want to block for it either. An example of this is a client that pre-fetches network addresses of
hosts that it expects to contact later. The information is important to the client, but as it is not
needed right away the
client does not want to wait. In this case it is best f the server performs an asynchronous call to
the client when the results are available. This is known as deferred synchronous RPC.

A Client and Server interaction through two Asynchronous RPCs

Recreated by Kyalo J. K Page 25 of 73


A final issue that has been silently ignored so far is how a client stub knows where to send the
RPC message. In a regular procedure call the address of the procedure is determined at compile
time, and the call is then made directly. In RPC this information is acquired from a binding
service; a service that allows registration and lookup of services. A binding service typically
provides an interface similar to the following:
 Register (name, version, handle, UID)
 Deregister (name, version, UID)
 Lookup (name, version) à (handle, UID)

Here handle is some physical address (IP address, process ID, etc.) and UID is used to
distinguish between servers offering the same service. Moreover, it is important to include
version information as the flexibility requirement for distributed system requires us to deal with
different versions of the same software in a heterogeneous environment.

Remote Method Invocation (RMI)


When using RPC, programmers must explicitly specify the server on which they want to perform
the call (possibly using information retrieved from a binding service). Furthermore, it is
complicated for a server to keep track of the different state belonging to different clients and
their invocations. These problems with RPC lead to the remote method invocation (RMI)
abstraction. The transition from RPC to RMI is, at its core, a transition from the server metaphor
to the object metaphor. When using RMI, programmers invoke methods on remote objects. The
object metaphor associates all operations with the data that they operate on, meaning that state is
encapsulated in the remote object and much easier to keep track of. Furthermore, the concept of
remote object, improves location transparency: once a client is bound to a remote object, it no
longer has to worry about where that object is located. Also, objects are first-class citizens in an
object-based model, meaning that they can be passed as arguments or received as results in RMI.
This helps to relieve many of the problems associated with passing pointers in RPC. Although,
technically, RMI is a small evolutionary step from RPC, the model of remote and distributed
objects is very powerful.

The Danger of Transparency


Unfortunately, the illusion of a procedure call is not perfect for RPCS and that of a method
invocation is not perfect for RMI. The reason for this is that an RPC or RMI can fail in ways that
a “real” procedure call or method invocation cannot. This is due to the problems such as not
being able to locate a service (e.g., it may be down or have the wrong version), messages getting
lost, servers crashing while executing a procedure, etc. As a result, the client code has to handle
error cases that are specific to
RPCS. Furthermore, RPC and RMI involve many more software layers than local system calls
and also incur network latencies. Both form potential performance bottlenecks. The code must,
therefore, be carefully optimized and should use lightweight network protocols. Moreover, as
copying often dominates the overhead, hardware support can help. This includes DMA directly
to/from user buffers and scatter-gather network interfaces that can compose a message from data
at different addresses on the fly. Finally, issues of concurrency control can show up in subtle
ways that, again, break the illusion of executing a local operation.
 Synchronous communication the sender of a message blocks until the message has been
received by the intended recipient. Synchronous communication is usually even stronger

Recreated by Kyalo J. K Page 26 of 73


than this in that the sender often blocks until receiver has processed the message and
sender has received a reply. In asynchronous Communication, on the other hand, the
sender continues execution immediately after sending a message.
 Transient communication a message will only be delivered if a receiver is active. If
there is no active receiver process (i.e., no one interested in receiving messages) then an
undeliverable message will simply be dropped. In persistent communication, however, a
message will be stored in the system until it can be delivered to the intended recipient.

Message - Oriented Communication


Due to the dangers of RPC and RMI, and the fact that those models are generally limited to
synchronous (and transient) communication, alternative abstractions are often needed. The
message-oriented communication abstraction is one of these and does not attempt to hide the fact
that communication is
taking place. Instead its goal is to make the use of flexible message passing easier. Message-
oriented communication is provided by message-oriented middleware (MOM). Besides
providing many variations of the send ( ) and receive ( ) primitives, MOM also provides
infrastructure required to support persistent communication. The send ( ) and receive( )
primitives offered by MOM also abstract from the underlying operating system or hardware
primitives. As such, MOM allows programmers to use message passing without having to be
aware of what platforms their software will run on, and what services those platforms provide.
As part of this abstraction MOM also provides marshalling services. Furthermore, as with most
middleware, MOM also provides other services that make building distributed applications
easier.

Message-oriented communication is based around the model of processes sending messages to


each other. Underlying message-oriented communication has two orthogonal properties.
Communication can be synchronous or asynchronous, and it can be transient or persistent.
Whereas RPC and RMI are generally synchronous and transient, message oriented
communication systems make many other options available to programmers.

Message Passing Interface (MPI) is an example of a MOM that is geared toward high
performance transient message passing. MPI is a message passing library that was designed for
parallel computing. It makes use of available networking protocols, and provides a huge array of
functions that basically perform synchronous and asynchronous send ( ) and receive( ). Another
example of MOM is MQ Series from IBM. This is an example of a message queuing system. Its
main characteristic is that it provides persistent communication.

Message queuing system, messages are sent to other processes by placing them in queues. The
queues hold messages until an intended receiver extracts them from the queue and processes
them.
Communication in a message queuing system is largely asynchronous. The basic queue interface
is very simple. There is a primitive to append a message onto the end of a specified queue, and a
primitive to remove the message at the head of a specific queue. These can be blocking or non
blocking. All
messages contain the name or address of a destination queue. Messages can only be added and
retrieved from local queues. Senders place messages in source queues, while receivers retrieve

Recreated by Kyalo J. K Page 27 of 73


messages from destination queues. The underlying system is responsible for transferring
messages from source queues to destination queues. This can be done simply by fetching
messages from source queues and directly sending them to machines responsible for the
appropriate destination queues. Or it can be more complicated and involve relaying messages to
their destination queues through an overlay network of routers. An example of such a system is
shown in Figure below

Stream abstraction
Whereas the previous communication abstractions dealt with discrete communication (that is
they communicated chunks of data), the Stream abstraction deals with continuous
communication, and in particular with the sending and receiving of continuous media. In
continuous media, data is represented as a single stream of data rather than discrete chunks (for
example, an email is a discrete chunk of data, a live radio program is not). The main
characteristic of continuous media is that besides a spatial relationship (i.e., the ordering of the
data), there is also a temporal relationship between the data. Film is a good example of
continuous media. Not only must the frames of a film must be played in the right order, they
must also be played at the right time, otherwise the result will be incorrect.

A stream is a communication channel that is meant for transferring continuous media. Streams
can be set up between two communicating processes, or possibly directly between two devices
(e.g., a camera and a TV). Streams of continuous media are examples of isochronous
communication that is communication that has minimum and maximum end-to-end time delay
requirements. When dealing with isochronous communication, quality of service is an important
issue. In this case quality of service is related to the time dependent requirements of the
communication. These requirements describe what is required of the underlying distributed
system so that the temporal relationships in a stream can be preserved. This generally involves
timeliness and reliability.

Quality of service requirements axe often specified in terms of the parameters of a token bucket
model In this model tokens (permission to send a fixed number of bytes) axe regularly generated
and stored in a bucket. An application wanting to send data removes the required amount of
tokens from the bucket and then sends the data. If the bucket is empty the application must wait
until more tokens are available. If the bucket is full newly generated tokens are discarded. It is
often necessary to synchronize two or more separate streams. For example, when sending stereo
audio it is necessary to synchronize the left and right channels likewise when streaming video it
is necessary to synchronize the audio with the video.

Formally, synchronization involves maintaining temporal relationships between sub-streams.


There are two basic approaches to synchronization. The first is the client-based approach, where
it is up to the client receiving the sub-streams to synchronize them. The client uses a
synchronization profile that details how the streams should be synchronized. One possibility is to
base the synchronization on timestamps that are sent along with the stream. A problem with
client side synchronization is that, if the sub-streams come in as separate streams, the individual

Recreated by Kyalo J. K Page 28 of 73


streams may encounter different communication delays. If the difference in delays is significant
the client may be unable to synchronize the streams.

The other approach is for the server to synchronize the streams. By multiplexing the sub streams
into a single data stream, the client simply has to demultiplex them and perform some
rudimentary synchronization. Distributed processing can be loosely defined as the execution of
co-operating processes which communicate by exchanging messages across an information
network. It means that the infrastructure consists of distributed processors, enabling parallel
execution of processes and message exchanges. Communication and data exchange can be
implemented in two ways: Shared memory and Message exchange / passing and Remote
Procedure Call (RPC)

DEADLOCKS IN PROCESSING
The Centralized Deadlock Detection has a Central coordinator maintains resource graph. If cycle
detected, coordinator kills off a process to break deadlock.
 False deadlock: Delays in transmission in distributed systems causes system to think that a
cycle exists, when resources have been released. With strict two phase locks, cannot release
and then obtain more data items, so this cannot occur.  Only situation is if a transaction is
aborted while in deadlock.
 Edge Chasing: Distributed approach to deadlock detection. Process 0 sends probe message
when waiting on Process 1. If Process 1 is waiting on resource(s) it forwards probe
message(s) to processes it is waiting on. If probe message returns to original sender, a cycle
is detected.
 Probe message contains: process that just blocked, process sending probe message,
process to whom probe is sent. Actually in two steps: Transaction coordinator indicates
what transaction waits for, server indicates who holds the data item. Who is high priority:
Oldest transaction: because they have run longer.
 Deadlock Prevention: Wound-wait (-> is waits on) Old process -> young process then
young process killed. Young process -> old process then young process waits.

PROCESSES AND PROCESSORS


Microkernel Architecture Components include:
 Process manager: Creates and performs low-level operations upon processes. Enhanced
by applications: O.S. emulation, language support.
 Thread manager: Thread creation, synchronization, scheduling (across processors).
Scheduling can occur in user-level modules.
 Communication manager: Communication between threads. May include
communication to other processors, otherwise additional service
 Memory manager: Memory management units, hardware caches.
 Supervisor: for the Interrupts, traps, exceptions.
RPC Threads
 Local RPC Call Implementation. Kernel recognizes local RPC during binding. Server and
client share argument stack. No copying or marshaling needed. Calling thread handles
system call, context switch, and up calls into server code.

Recreated by Kyalo J. K Page 29 of 73


 Remote RPC Call Implementation - Kernel creates pop-up thread when RPC message
received. Less copying required, no context restoration. Original implantation: thread blocks
waiting for next call. Example: SUN Solaris Threads

SYNCHRONIZATION OF PROCESSES
There are two main reasons why there is need for synchronization mechanisms:
 Two or more processes may need to co-operate in order to accomplish a given task. This
implies that the operating mechanism must provide facilities for identifying co-operating
processes and synchronizing them.
 Two or more processes may need to compete for access to shared services or resources.
The implication is that the synchronization mechanism must provide facilities for a
process to wait for a resource to become available and another process to signal the
release of that resource.
 When processes are running on the same computer, synchronization is straightforward
since all processes use the same physical clock and can share memory. This can be done
using well-known techniques such as
1. Semaphores - used to provide mutually exclusive access to a non-sharable resource by
preventing concurrent execution of the critical region of a program through which the
non-sharable resource is accessed.
2. A Monitor is a collection of procedures, which may be executed by a collection of
concurrent processes. It protects its internal data from the users, and is a mechanism for
synchronizing access to the resources the procedures use. Since only the monitor can
access its private data, it automatically provides mutual exclusive between customer
processes. Entry to the monitor by one process excludes entry by others.

Synchronization can either be synchronous (blocked) or asynchronous (non-blocking). A


synchronous process is delayed until it receives a response from the destination process. A
primitive is non-blocking if its execution never delays the invoking process. Non-blocking
primitives must buffer messages to maintain synchronization. This makes programs flexible but
increases their complexity. When blocking versions of message passing are used, programs are
easier to write and synchronization easier to maintain. When send() operation is invoked, the
invoking process blocks until the message is received. A subsequent receive() operation again
blocks the invoking process until a message is actually received.

(a) Synchronous communication (b) Asynchronous communication

Recreated by Kyalo J. K Page 30 of 73


SYNCHRONIZATION OF CLOCKS
Logical vs. Physical Clocks:
 Logical Clocks: Processes must agree on the order in which events occur. It is not necessary that
clocks are synchronized. Physical Clock: All clocks must not deviate from real time by greater
tolerance level.
 Computer clock: counts oscillations occurring in a crystal, and divide count.
 Clock drift: oscillators' frequencies vary
 Quartz crystal: varies 10**-5 or 1 second every 1 million seconds.
 International Atomic Time (TAI): Based on Cesium-133. Accuracy is 10 **13. January 1
1958 is beginning of TAI time.  
 Universal Coordinated Time (UTC): Based on atomic time but adds leap second.
 National Institute of Standards and Technology (NIST) Automated Computer Time
Service (ACTS) provides modem service to NIST through telephone.  
 Designed for infrequent access Satellite: Geo-stationary Operational Environmental
Satellites (GOES) Satellite: Global Positioning System (GPS).  Propagation speed varies
with atmospheric conditions: Accuracy to 0.1-10 ms.

External vs. Internal Clocks


External: Synch computer clock with an authoritative, external source of time.
Internal: Synch computer clock with other computers' clocks to a known degree of accuracy.

PHYSICAL CLOCK SYNCHRONIZATION ALGORITHMS


Christian Algorithm: Use a clock server synchronized with UTC. Each machine issues message
to time server requesting time. Time server responds with current time: C. Propagation time P =
Reply_time - Request_time) /2. Discard threshold-exceeded values and average others.  Actual
time = C + P. However time can never go backwards! So slow down or increase clock rate
as appropriate. average: Add 10 msec to clock each interrupt until time matches. Problems:
Dependent on single time server. Solution: Have a number of time servers. Broadcast time
request to all servers and take first returned value.

Berkeley Algorithm: Berkeley UNIX time daemon polls every machine periodically. Time
server computes average time from stable machines, considering Propagation time. Time server
returns delta time to adjust clocks to each machine: + or -. Requires no interface to WWW

Network Time Protocol: Synchronizes clock in Internet occurs in hierarchy called


Synchronization Subnet. Primary servers connected to UTC clock source. Secondary servers
connected to Primary servers. Stratum 3 servers synchronized from stratum 2 servers, and so on.
Synchronization most accurate at higher levels

Decentralized Algorithm - All processors broadcast current time every R interval. All processors
discard endpoints received back and average remaining to get actual time value.

Logical Clock Synchronization Algorithm - Lamport’s Algorithm


Based on the theory of happens-before includes: If A and B are events in same process, and A
occurs before B, then a>b. If A is an event sent by one process and B is the event being received
by another process, and so on, then the transitive properties holds: a>b>c … If a message is

Recreated by Kyalo J. K Page 31 of 73


received before it was sent, then receiver fast-forwards clock to send-time + 1,. Concurrent: If 2
events happen in different processes and don't exchange message, the ordering is unknown. In
between every 2 events for a process, clock must tick at least once.  No 2 events ever occur at
the same time: attach process number to time if necessary.

Mutual Exclusion (mutex)


At most one process may execute in critical region at a time. A process requesting entry is
eventually granted it. (No starvation).Entry to the critical region should happen in happened-
before ordering. Central Server Mutual Exclusion: A server grants permission to enter a critical
section. Process sends a request message to server and awaits a reply. Grant reply gives
permission to enter critical region managed by server. If token held by another process, server
queues request until token becomes available. When critical region exited, process sends Release
message to central server. Problem: What if server fails? Must elect new server. Processes are
waiting for reply. Single server can become a bottleneck. But: Simple and efficient.

Ricart and Agrawala Distributed Algorithm:


Uses distributed agreement.  Process sends request to enter CR to all other processes.
Request contains name of critical region, process number and time.
             If receiver does not want to enter CR, returns OK.
             If receiver is in CR it queues request.
             If receiver wants to enter CR it sends OK if it is an earlier request; or queues the
request if it is a later request.
            When process exits CR it sends OK to all processes on its queue.
Problem:
If any process crashes it will not respond to requests. Solution: Receiver always sends a reply:
OK or deny permission. Sender resends requests periodically then assumes destination is dead.
Many messages. Must track all in group. Time consuming. Less efficient than centralized.

Token Ring Algorithm:


Process take turns as the token circulates around the ring. When you get a token you have option
to enter critical region; upon exit pass token. Inefficiency when none wants to enter CR, and yet
token message circulates. Processes do not have to be in ring configuration physically
Problem: Token not obtained in happened-before order.
If token lost, election must occur so token is regenerated.

ELECTION ALGORITHMS
Many distributed algorithms require one process to act as coordinator. An election selects the
coordinator. Elect a timeserver. Elect a mutual exclusion server. In general: elected coordinator
is process with highest process number.

Bully Algorithm:
Coordinator is ALWAYS process with highest process number. Process P notices coordinator is
no longer responding to requests. P sends ELECTION message to all processes with higher
numbers. Higher process responds with OK and send ELECTION messages to all higher number
processes. P drops out. If no one responds, P becomes coordinator. Sends COORDINATOR

Recreated by Kyalo J. K Page 32 of 73


message to all running processes. If higher number process ever boots, it sends ELECTION
message.

Ring Algorithm:
Let's See Who Has The Highest Number In The Ring Algorithm. Process P notices coordinator
is no longer responding to requests. P sends ELECTION message to next process on ring.
ELECTION message has process P's number in it. Each process adds its process number to the
ELECTION message. When P receives its ELECTION message back P sends COORDINATOR
message listing highest process number as winner.

INTER-PROCESS COMMUNICATION (IPC)


When processes in the same local computer wish to interact they make use of an inter-process
communication (IPC) mechanism that is usually provided by the O/S. The most common mode
of communication is via a shared memory since the processes reside in the same address space.
A number of mechanisms are available:

Pipes / Named Pipes - perhaps the most primitive example is a synchronous filter mechanism.
For example the pipe mechanism in UNIX: ls –l | more. The commands ls and more run as two
concurrent processes, with the output of ls connected to the input of more and has the overall
effect of listing the contents of the current directory one screen at a time.

File sharing - An alternative mechanism is the use of a local file. This has the advantage that it
can handle large volumes of data and is well understood. This is the basis on which on-line
database systems are built. The major drawback is that there are no inherent synchronization
mechanisms between communicating processes to avoid state data corruption, synchronization
mechanisms such as file and record locking are used to allow concurrent processes communicate
while preserving data consistency. Secondly, communication is inefficient since it uses a
relatively slow medium.

Shared Memory - Since all processes are local, the computer’s RAM can be used to implement a
shared memory facility. A common region of memory addressable by all concurrent processes is
used to define shared variables which are used to pass data or for synchronization purposes.
Processes must use semaphores, monitors or other techniques for synchronization purposes. A
good example of a shared memory mechanism is the clipboard facility.

Message Queuing - A common asynchronous linkage mechanism is a message queuing


mechanism that provides the ability for any process to read/write from a named queue.
Synchronization is inherent in the read/write operations and the message queue which together
support asynchronous communication between many different processes. Messages are identified
by a unique identifier and security implemented by granting read/write permissions to processes.
IPC mechanisms can be broadly classified into:
 Reliable communication - channels fail only with the end system (e.g. if a central
computer bus fails, usually the entire machine (stable storage/memory/CPU access) fails.
 Unreliable communication - channels exhibit various different types of fault. Messages
may be lost, re-ordered, duplicated, changed to apparently correct but different messages

Recreated by Kyalo J. K Page 33 of 73


and even created as if from nowhere by the channel. All of these problems may have to
be overcome by the IPC mechanism.

SYNCHRONIZATION MODELS
Unicasting – This involves sending a separate copy of the message to each member. An implicit
assumption is that the sender knows the address of every member in the group. This may be not
possible in some systems. In the absence of more sophisticated mechanisms, a system may resort
to unicasting if member addresses are known. The number of network transmissions is
proportional to the number of members in the group.

Multicasting – In this model a single message with a group address can be used for routing
purposes. When a group is first created it is assigned a unique group address. When a member is
added to the group, it is instructed to listen for messages stamped with the group address as well
as for its own unique address. This is an efficient mechanism since the number of network
transmissions is significantly less than for unicasting.

Broadcasting – Broadcast the message by sending a single message with a broadcast address.
The message is sent to every possible entity on the network. Every entity must read the message
and determine whether they should take action or discard it. This may be appropriate in the case
where the address of members is not known since most network protocols implement broadcast
facility. However, if messages are broadcasted frequently and there is no efficient network
broadcast mechanism, the network becomes saturated. In some cases, all group members or none
must receive a group message at all. Group communication in this case is said to be atomic.
Achieving atomicity in the presence of failures is difficult, resulting in many more messages
being sent. Another aspect of group communication is the ordering of group messages. For
example, in a computer conferencing system a user would expect to receive the original news
item before any response to that item is received. This is known as ordered multicast and the
requirement to ensure that all multicasts are received in the same order for all group members is
common in distributed systems. Atomic multicasting does not guarantee that all messages will be
received by group the members in the order they were sent.

REMOTE INTER - PROCESS COMMUNICATION (IPC)


In a distributed system, processes interact in a logical sense by exchanging messages across a
communication network. This is referred to as remote IPC. As with local processes, remote
processes are either co-operating to complete a defined task or a competing for the use of a
resource. Remote IPC can be implemented using the message passing, remote procedure call or
the shared memory paradigm. Remote IPC functions are:
 Process registration for the purpose of identifying communicating processes
 Hide differences between local and remote communication
 Establishing communication channels between processes
 Routing messages to the destination process
 Synchronizing concurrent processes
 Shutting down communication channels
 Enforces a clean and simple interface providing a natural environment for modular
structuring of distributed applications.

Recreated by Kyalo J. K Page 34 of 73


a) The interconnection between client and server in a traditional RPC
b) The interaction using asynchronous RPC
BINDING
At some point, a process needs to determine the identity of the process with which it is
communicating. This is known as binding. There are two major ways of binding:
 Static binding – destination processes are identified explicitly at program compile time.
Static binding is the most efficient approach and is most appropriate when a client almost
always binds to the same server although in some systems its often not possible to
identify all potential destination processes
 Dynamic binding – source to destination binding are created, modified and deleted at
program run-time. Dynamic binding facilitates location and migration transparency when
processes are referred to indirectly (by name) and mapped to the location address at run-
time. This is normally facilitated by a facility service known as a directory service.

Binding a Client to a Server

RPC Semantics in the Presence of Failures


 Locate the server’s machine
 Locate the service on that machine
 The client is unable to locate the server.
 The request message from the client to the server is lost.
 The server crashes after receiving a request.
 The reply message from the server to the client is lost.
 The client crashes after sending a request.

Recreated by Kyalo J. K Page 35 of 73


MESSAGE PASSING
Direct communication - a low level remote IPC in which the developer is explicitly aware of the
message used in communication and the underlying message transport mechanism used in
message exchange is message passing. Processes interact directly using send and receive or
equivalent language primitives to initiate message transmission and reception, explicitly naming
the recipient or sender, for example:
 Send (message, destination_process)
 Receive (message, source_process)
Message passing is the most flexible remote IPC mechanism that can be used to support all types
of process interactions and the underlying transport protocols can be configured by the
application according to the needs of the application. The above example is known as direct
communication.

Indirect communication - for identifying co-operating processes Here the destination and the
source identifiers are not process identifiers, instead, a port also known as a mailbox is specified
which represents an abstract object at which messages are queued. Potentially, any process can
write or read from a port. To send a message to a process, the sending process simply issues a
send operation specifying a well-known port number that is associated with the destination
process. To receive the message, the recipient simply issues a receive specifying the same port
number. For example:
 Send (message, destination_port)
 Receive (message, source_port)
Security constraints can be introduced by allowing the owning process to specify access control
rights on a port. Messages are not lost provided the queue size is adequate for the rate at which
messages are being queued and de-queued.

REMOTE PROCEDURE CALL


Many distributed systems have been based on explicit message exchange between processes.
However, the procedures send and receive do not conceal communication, which is important to
achieve access transparency in distributed systems. This interaction is very similar to the
traditional procedure call in high-level programming languages except that the caller and the
procedure to be executed are on different computers. A procedure call mechanism that allows the
calling and the called procedures to be running on different computers is known as remote
procedure call (RPC). When a process on machine A calls a procedure on machine B, the calling
process on A is suspended, and execution of the called procedure takes place on B. Information
can be transported from the sender to the recipient in the parameters and can come back in the
procedure result. No message passing at all is visible to the programmer. While the basic idea
sounds simple and elegant, subtle problems exist. To start with, because the calling and called
procedures run on different machines, they execute in different address spaces, which causes
complications. Parameters and results also have to be passed, which can be complicated,
especially if the machines are not identical. Finally, both machines can crash and each of the
possible failures causes different problems. Still, most of these can be dealt with, and RPC is a
widely used technique that underlies many distributed systems.

RPC is popular for developing distributed systems because it looks and behaves like a well-
understood, conventional procedure call in high-level languages. A procedure call is a very

Recreated by Kyalo J. K Page 36 of 73


effective tool for implementing abstraction since to use it all one needs to know is the name of
the procedure and arguments associated with it. Packing parameters into a message is called
parameter marshaling. RPC is a remote operation with semantics similar to a local procedure
call and can provide a degree of:
 Access transparency – since a call to a remote procedure may be similar to a local
procedure.
 Location transparency – since the developer can refer to the procedure by name,
unaware of where exactly the remote procedure is located.
 Synchronization – since the process invoking the RPC remains suspended (blocked) until
the remote procedure is completed, just as a call to a local procedure. A remote procedure
call occurs in the following steps:

1. The client procedure calls the client stub in the normal way.
2. The client stub builds a message and calls the local operating system.
3. The client’s OS sends the message to the remote OS.
4. The remote OS gives the message to the server stub.
5. The server stub unpacks the parameters and calls the server.
6. The server does the work and returns the result to the stub.
7. The server stub packs it in a message and calls its local OS.
8. The server’s OS sends the message to the client’s OS.
9. The client’s OS gives the message to the client stub.
10. The stub unpacks the result and returns to the client.

Stub Generation - Once the RPC protocol has been completely defined, the client and server
stubs need to be implemented. Fortunately, stubs for the same protocol but different procedures
generally differ only in their interface to the applications. An interface consists of a collection of
procedures that can be called by a client, and which are implemented by a server. An interface is
generally available in the same programming language as the one in which the client or server is
written (although this is strictly speaking, not necessary). To simplify matters, interfaces are
often specified by means of an Interface

Definition Language (IDL). An interface specified in such an IDL, is then subsequently


compiled into a client stub and a server stub, along with the appropriate compile-time or run-time
interfaces. Practice shows that using an interface definition language considerably simplifies
client-server applications based on RPCs. Because it is easy to fully generate client and server
stubs, all RPC-based middleware systems offer an IDL to support application development.

MARSHALLING
Marshalling is the process of converting the data types from the machines representation to a
standard representation before transmission and converting it at the other end from the standard
to the machines internal representation. Marshalling is complicated by use of global variables
and pointers as they only have meaning in the client’s address space. Client and server processes
run in different address spaces on separate machines. One solution would be to pass data values
held by global variables or pointed to by the pointer. However there are cases where this will not
workout, for example, when a linked list data structure is being passed to a procedure that
manipulates the list. Differences in representation of data can be overcome by use of an agreed

Recreated by Kyalo J. K Page 37 of 73


language for representing data between client and server processes. For example, the common
syntax for describing and encoding of data which known as Abstract Syntax Notation (ASN.1)
has been defined as an international standard by the International Organization for
standardization (ISO). ASN.1 is similar to the data declaration statements in a high-level
programming language.
FAILURE HANDLING
RPC failures can be difficult to handle. There are four generalized types of failures that can
occur when an RPC call is made:
 The Client’s request message is lost.
 The client process fails while the server is processing the request.
 The sever process fails while servicing the request.
 The reply message is lost.

If the client’s message gets lost then the client will wait forever unless a time out error detection
mechanism is employed. If the client process fails then, the server will carry out the remote
operation unnecessarily. If the operation involves updating a data value then this can lead to a
loss of data integrity. Furthermore, the server would generate a reply to client process that does
not exist. This must be discarded by the client’s machine. When the client re-starts, it may be
send the request again causing the server to execute more than once. A similar situation arises
when the server crashes. The server could crash just prior to the execution of the remote
operation or just after execution completes but before a reply to the client is generated. In this
case, clients will time-out and continually generate retries until either the server restarts or the
retry limit is met.

REMOTE METHOD INVOCATION (RMI)


Remote method invocation allows applications to call object methods located remotely, sharing
resources and processing load across systems. Unlike other systems for remote execution that
require that only simple data types or defined structures be passed to and from methods, RMI
allows any object type to be used - even if the client or server has never encountered it before.
RMI allows both client and server to dynamically load new object types as required.

RMI Applications - RMI is the equivalent of RPC commonly used in middleware based on
distributed objects model. RMI applications are often comprised of two separate programs: a
server and a client. A typical server application creates some remote objects, makes references to
them accessible, and waits for clients to invoke methods on these remote objects. A typical client
application gets a remote reference to one or more remote objects in the server and then invokes
methods on them. RMI provides the mechanism by which the server and the client communicate
and pass information back and forth. Such an application is sometimes referred to as a
distributed object application.

THE DISTRIBUTED MEMORY


The main idea is to provide the mechanism for a set of networked workstations to share a single,
paged virtual address space. A reference to local memory is done in hardware first and emulates
the multi-processor caches using MMU and OS. An attempt to reference an address that is not
local causes a page fault, trap to OS, massage to remote node to fetch the page, and restart
faulting instruction. The idea is similar to traditional Virtual Machine systems. Distributed

Recreated by Kyalo J. K Page 38 of 73


Shared Memory (DSM) is an abstraction used for sharing data between processes in computers
that do not share physical memory. The Processes appear to access a single shared memory and
thus a tool for parallel applications since there is no message passing and no marshalling of data.
It is also Scalable to large number of computers and the approaches to DSM is the Hardware-
Based , Page-Based , Library or Object-Based

CONSISTENCY MODELS
 Strict Consistency: Ideal programming model: Any read to a memory location X returns
the value stored by the most recent write operation to X. Nearly impossible to implement
in distributed system. Easy on a parallel system or single system with multiple
threads/processes.
 Sequential Consistency: Any valid interleaving is acceptable, but all processes must see
the same sequence of memory references.
 Causal Consistency: Happens-before order. Includes: All processors agree to order of
writes issued by processor X. All messages received by processor Y (Reads) must occur
after processor X sent message (Write). Concurrent (non-causal) writes may be seen in a
different order on different machines. Pipelined RAM: PRAM: Writes from different
processes may be seen in a different order. All processors agree to order of ‘writes’
issued by processor X.
 Weak Consistency: Programmer uses synchronization method to update data.
Synchronization methods may include: critical section, mutual exclusion -mutex, or
barrier: all processes must arrive at barrier before any can continue
 Release Consistency: Shared data are made consistent when a critical region is exited.
Multiple data may be associated with one critical section
 Entry Consistency: Shared data is made consistent upon entering a critical region. One
synchronization variable associated with each data object, Multiple data variables can be
updated at a time by different processes
 Release Consistency: Shared data updated in critical section. Exploit fact that
programmers use synchronization objects.

OBJECT-BASED DSM
The Object includes attributes such as object state: internal data, methods or operations. Uses
information hiding. Treated as collection of separate objects instead of linear address space.
 MEMO - MEMO is a filing package or organizational package which coordinates data
and tasks between processes. Perfect for job jar allocation schemes.
 Caching - Used when CPUs share the same physical memory and since the Cache is
faster than memory, it Reduces access to bus for CPUs which share memory and Works
with less than 64 CPUs.
 NUMA (Non Uniform Memory Accesses) Multiprocessors: that all memories glued
together to create one real address space. Access to remote memory is possible,
Accessing remote memory is slower than accessing local memory though no cache
allowed.

Distributed object applications need to:


 Locate remote objects: Applications can use one of two mechanisms to obtain references to
remote objects. An application can register its remote objects with RMI's simple naming

Recreated by Kyalo J. K Page 39 of 73


facility or the application can pass and return remote object references as part of its normal
operation.
 Communicate with remote objects: Details of communication between remote objects are
handled by RMI; to the programmer, remote communication looks like a standard method
invocation.
 Load class byte codes for objects that are passed around: Because RMI allows a caller to
pass objects to remote objects; RMI provides the necessary mechanisms for loading an
object's code, as well as for transmitting its data.

One of the central and unique features of RMI is its ability to download the bytecodes (or simply
code) of an object's class if the class is not defined in the receiver's virtual machine. The types
and the behavior of an object, previously available only in a single virtual machine, can be
transmitted to another, possibly remote, virtual machine. RMI passes objects by their true type,
so the behavior of those objects is not changed when they are sent to another virtual machine.
This allows new types to be introduced into a remote virtual machine, thus extending the
behavior of an application dynamically.

Creating Distributed Applications Using RMI


When you use RMI to develop a distributed application, you follow these general steps.
 Design and implement the components of your distributed application.
 Compile sources and generate stubs.
 Make classes network accessible.
 Start the application.

IMPLEMENTING APPLICATION COMPONENTS


First, decide on your application architecture and determine which components are local objects
and which ones should be remotely accessible. This step includes:
 Defining the remote interfaces: A remote interface specifies the methods that can be
invoked remotely by a client. Clients program to remote interfaces, not to the
implementation classes of those interfaces. Part of the design of such interfaces is the
determination of any local objects that will be used as parameters and return values for
these methods; if any of these interfaces or classes do not yet exist, you need to define
them as well.
 Implementing the remote objects: Remote objects must implement one or more remote
interfaces. The remote object class may include implementations of other interfaces
(either local or remote) and other methods (which are available only locally). If any local
classes are to be used as parameters or return values to any of these methods, they must
be implemented as well.
 Implementing the clients: Clients that use remote objects can be implemented at any time
after the remote interfaces are defined, including after the remote objects have been
deployed.

DISTRIBUTED PROCESSING
Distributed processing can be loosely defined as the execution of co-operating processes which
communicate by exchanging messages across an information network. It means that the
infrastructure consists of distributed processors, enabling parallel execution of processes and

Recreated by Kyalo J. K Page 40 of 73


message exchanges. Communication and data exchange can be implemented in two ways:
Shared memory and Message exchange / passing and Remote Procedure Call (RPC)

PROCESSES AND THREADS


A process is a logical representation of a physical processor that executes program code and has
associated state and data. Sometimes described as a virtual processor. A process is the unit of
resource allocation and so is defined by the resources it uses and by the location at which it’s
executing. A process can run either in a separate (private) address space or may share the same
address space with other processes. Processes are created either implicitly (e.g. by the operating
system) or explicitly using an appropriate language construct or O/S construct such as fork( ). In
uni-processor computer systems the illusion of many programs running at the same time is
created using the time slicing technique, but in actual sense there is only one program utilizing
the CPU at any given time. Processes are switched in and out of the CPU rapidly that each
process appears to be executing continuously. Switching involves saving the state of the
currently active process and setting up the state of another process, sometimes known as context
switching.
Threads - Some operating systems allow additional ‘child processes’ to be created, each
competing for the CPU and other resources with the other processes. All resources belonging to
the ‘parent process’ are duplicated thus making the available to the ‘child processes’. It’s
common for a program to create multiple processes that are required to share memory and other
resources. The process may wait for a particular event to occur. Some operating systems support
this situation efficiently by allowing a number of processes to share a single address space.
Processes in this context are referred to as threads and the O/S is said to support multi-threading.
Usually a processes and threads are used interchangeably and has layers as shown.

The general organization of an Internet search engine into three different layers

SYSTEM NAMES AND NAMING TECHNIQUES


Introduction
Most computer systems (in particular operating systems) manage wide collections of entities
(such as, files, users, hosts, networks, and so on). These entities are referred to by users of the
system and other entities by various kinds of names. Examples of names in UNIX systems
include the following:
 Devices: /dev/hda, /dev/ttyS1
 Files: /boot/vmlinuz/lectures/DS/notes/tex/naming.tex

Recreated by Kyalo J. K Page 41 of 73


For largely historical reasons, different entities are often named using different naming schemes.
We say that they exist in different name spaces. From time to time a new system design attempts
to integrate a variety of entities into a homogeneous name space, and then also attempts to
provide a uniform interface to these entities. For example, a central concept of UNIX systems is
the uniform treatment of files, devices, sockets, and so on. Some systems also introduce a /proc
file system, which maps processes to names in the file system and supports access to process
information through this file interface. In addition, Linux provides access to a variety of kernel
data structures via the /proc file system.

BASIC CONCEPTS
 A name is the fundamental concept underlying naming. We define a name as a string of
bits or characters that is used to refer to an entity. An entity in this case is any resource,
user, process, etc. in the system.
 Entities are accessed by performing operations on them; the operations are performed at
an entity’s access point. An access point is also referred to by a name, we call an access
point’s name an address. Entities may have multiple access points and may therefore
have multiple addresses. Furthermore an entity’s access points may change over time
(that is an entity may get new access points or lose existing ones), which means that the
set of an entity’s addresses may also change.
 A pure name is a name that consists of an un interpreted bit pattern that does not encode
any of the named entity’s attributes.
 A non pure name, on the other hand, does encode entity attributes (such as an access
point address) in the name.
 An identifier is a name that uniquely identifies an entity. An identifier refers to at most
one entity and an entity is referred to by at most one identifier. Furthermore an identifier
can never be reused, so that it will always refer to the same entity. Identifiers allow for
easy comparison of entities; if two entities have the same identifier then they are the same
entity. Pure names that are also identifiers are called pure identifiers.
 Location independent names are names that are independent of an entity’s address. They
remain valid even if an entity moves or otherwise changes its address. Note that pure
names are always location independent, though location independent names do not have
to be pure names.

SYSTEM NAMES VERSUS HUMAN NAMES


Related to the purity of names is the distinction between system-oriented and human-oriented
names. Human-oriented names are usually chosen for their mnemonic value, whereas system-
oriented names are a means for efficient access and identification of objects. Taking into account
the desire for transparency human-oriented names would ideally be pure. In contrast, system-
oriented names are often non pure which, speeds up access to repeatedly used object attributes.
We can characterize these two kinds of names as follows:

System-Oriented Names
System-oriented names are usually implemented as one or more fixed-sized numerals to facilitate
efficient handling. Moreover, they typically need to be unique identifiers and may be sparse to
convey access rights (e.g., capabilities). Depending on whether they are globally or locally
unique, we also call them structured or unstructured: Globally unique integer unstructured node

Recreated by Kyalo J. K Page 42 of 73


identifier | local unique identifier structured. The structuring may be over multiple levels. Note
that a structured name is not pure. Global uniqueness without further mechanism requires a
centralized generator with the usual drawbacks regarding scalability and reliability. In contrast,
distributed generation without excessive communication usually leads to structured names. For
example, a globally unique structured name can be constructed by combining the local time with
a locally unique identifier. Both values can be generated locally and do not require any
communication.

Human-Oriented Names
In many systems, the most important attribute bound to a human-oriented name is the system-
oriented name of the object. All further information about the entity is obtained via the system-
oriented name. This enables the system to perform the usually costly resolution of the human-
oriented name just once and implement all further operations on the basis of the system-oriented
name (which is more efficient to handle). Often a whole set of human-oriented names is mapped
to a single system-oriented name (symbolic links, relative addressing, and so on).

As an example of all this, consider the naming of files in UNIX. A pathname is a human-oriented
name that, by means of the directory structure of the file system, can be resolved to an inode
number, which is a machine-oriented name. All attributes of a file are accessible via the inode
(i.e., the machine oriented
name). By virtue of symbolic and hard links multiple human oriented names may refer to the
same inode, which makes equality testing of files merely by their human-oriented name
impossible. The design space for human-oriented names is considerably wider than that for
system-oriented names. As such naming systems for human-oriented names usually require
considerably greater implementation effort.

NAME SPACES
 Names are grouped and organized into name spaces. A structured name space is
represented as a labeled directed graph, with two types of nodes. A leaf node represents a
named entity and stores information about entity. The information could include the
entity itself, or a reference to the entity (e.g., an address).
 A directory node (also called a context) is an inner node and does not represent any
single entity. Instead it stores a directory table, containing (node - id, edge - label) pairs,
that describes the node’s children. A leaf node only has incoming edges, while a directory
node has both incoming and outgoing edges. A third kind of node, a root node is a
directory node with only outgoing edges.
 A structured name space can be strictly hierarchical or can form a directed acyclic graph
(DAG). In a strictly hierarchical name space a node will only have one incoming edge. In
a DAG name space any node can have multiple incoming edges. It is also possible to
have name spaces with multiple root nodes.
 Scalable systems usually use hierarchically structured names spaces. A sequence of edge
labels leading from one node to another is called a path name.
 A path name is used to refer to a node in the graph. An absolute path name always starts
from a root node, a relative path name is any path name that does not start at the root
node.

Recreated by Kyalo J. K Page 43 of 73


Many name spaces support aliening, in which case an entity may be reachable by multiple paths
from a root node and will therefore be named by numerous path names. There are two types of
alias.
 A hard link is when there two or more paths that directly lead to that entity.
 A soft link, occurs when a leaf node holds a pathname that refers to another node.

In this case the leaf node implicitly refers to the file named by the pathname. Ideally we would
have a global, homogeneous name space that contains names for all entities used. However, we
are often faced with the situation where we already have a collection of name spaces that have to
be combined into a larger name space. One approach is to simply create a new name that
combines names from the other name spaces. For example, a Web URL
http://www.raiuniversity.edu/-cs9243/naming-slides.ps globalizes the local name ~
cs9243/naming-slides.ps by adding the context www.raiuniversity.edu. Unfortunately, this
approach often compromises location transparency—as is the case in the example of URLs.
Another example of the composition of name spaces is mounting a name space onto a mount
point in a different (external) name space. This approach is often applied to merge file systems
(e.g., mounting a remote file system onto a local mount point). In terms of a name space graph,
mounting requires one directory node to contain information about another directory node in the
external name space. This is similar to the concept of soft linking, except that in this case the link
is to a node outside of the name space. The information contained in the mount point node must,
therefore, include information about where to find the external name space.

NAME RESOLUTION
The process of determining what entity a name refers to is called name resolution. Resolving a
name results in a reference to the entity that the name refers to. Resolving a name in a name
space often results in a reference to the node that the name refers to. Path name resolution is a
process that starts with the resolution of the first element in the path name, and ends with
resolution of the last element in the name. There are two approaches to this process, iterative
resolution and recursive resolution.

Iterative Name Resolution


In iterative resolution the resolver contacts each node directly to resolve each individual element
of the path name. In recursive resolution the resolver only contacts the first node and asks it to
resolve the name. This node looks up the node referred to by first element of the name and then
passes the rest of the name on to that node. The process is repeated until the last element is
resolved after which the result is return back through the nodes to the resolver.

Recursive Name Resolution


A problem with name resolution is how to determine which node to start resolution at. Knowing
how and where to start name resolution is referred to as the closure mechanism. One approach is
to keep an external reference (e.g., in a file) to the root node of the name space. Another
approach is to keep a reference to the ‘current’ directory node for dealing with relative names.

Recreated by Kyalo J. K Page 44 of 73


Note that the actual closure mechanism is always implicit, that is it is never explicitly defined in
a name. The reason for this is that if a closure mechanism was defined in a name there would
have to be a way to resolve the name used for that closure mechanism. This would require the
use of a closure mechanism to bootstrap the original closure mechanism. Because this could be
repeated indefinitely, at a certain point an implicit mechanism will always be required.

NAMING SERVICE
A naming service is a service that provides access to a name space allowing clients to perform
operations on the name space. These operations include adding and removing directory or leaf
nodes, modifying the contents of nodes and looking up names. The naming service is
implemented by name servers. Name resolution is performed on behalf of clients by resolvers. A
resolver can be implemented by the client itself, in the kernel, by the name server, or as a
separate service.

Distributed Naming Service


As with most other system services, naming becomes more involved in a distributed
environment. A distributed naming service is implemented using multiple name servers over
which the name space is partitioned and/or replicated. The goal of a distributed naming service is
to distribute both the management and name resolution load over these name servers. Before
discussing implementation aspects of distributed naming services it is useful to split a name
space up into several layers according to the role the nodes play in the name space. These layers
help to determine how and where to partition and replicate that part of the name space. The
highest level nodes belong to the global layer.

Internet access files into three layers


A main characteristic of nodes in this layer is that they are stable, meaning that they do not
change much. As such, replicating these nodes is relatively easy because consistency does not
cause much of a problem. The next layer is the administrational layer. The nodes in this layer
generally represent a part of the name space that is associated to a single organizational entity
(e.g., a company or a university). They are relatively stable (but not as stable as the nodes in the
global layer). Finally the lowest layer is
the managerial layer. This layer sees much change. Nodes may be added or removed as well as
have their contents modified. The nodes in the top layers generally see the most traffic and,
therefore, require more effort to keep their performance at an acceptable level.

Typically, a client does not directly converse with a name server, but delegates this to a local
resolver that may use caching to improve performance. Each of the name servers stores one or
more naming contexts, some of which may be replicated. We call the name servers storing
attributes of an object this object’s authoritative name servers.

Recreated by Kyalo J. K Page 45 of 73


A comparison between name servers for implementing nodes from a large scale name space
partitioned into a global layer as an administrational layer and a managerial layer
Directory nodes are the smallest unit of distribution and replication of a name space. If they are
all on one host, we have one central server, which is simple, but does not scale and does not
provide fault tolerance. Alternatively, there can be multiple copies of the whole name space,
which is called full replication. Again, this is simple and access may be fast. However, the
replicas will have to be kept consistent and this may become a bottleneck as the system grows.

In the case of a hierarchical name space, partial sub trees (often called zones) may be maintained
by a single server. In the case of the Internet Domain Name Service (DNS), this distribution also
matches the physical distribution of the network. Each zone is associated with a name prefix that
leads from the root
to the zone. Now, each node maintains a prefix table (essentially, a hint cache for name servers
corresponding to zones) and, given a name, the server corresponding to the zone with the longest
matching prefix is contacted. If it is not the authoritative name server, the next zone’s prefix is
broadcast to obtain the corresponding name server (and update the prefix table). As an
alternative to broadcasting, the contacted name server may be able to provide the address of the
authoritative name server for this zone. This scheme can be efficiently implemented, as the
prefix table can be relatively small and, on average, only a small number of messages are needed
for name resolution. Consistency of the prefix table is checked on use, which removes the need
for explicit update messages. For smaller systems, a simpler structure-free distribution scheme
may be used. In this scheme contexts can be freely placed on the available name servers (usually,
however, some distribution policy is in place). Name resolution starts at the root and has to
traverse the complete resolution chain of contexts. This is easy to reconfigure and, for example,
used in the standard naming service of CORBA.

IMPLEMENTATION OF NAMING SERVICES


In the following, we consider a number of issues that must be addressed by implementations
name services. First, a starting point for name resolution has to be fixed. This essentially means
that the resolver must have a list of name servers that it can contact. This list will usually not
include the root name server to avoid overloading it. Instead, physically close servers are
normally chosen. For example, in the BIND (Berkeley Internet Name Domain) implementation
of DNS, the resolver is implemented as a library linked to the client program. It expects the
file /etc/resolv.conf to contain a list of name servers. Moreover, it facilitates relative naming in
form of the search option.

Recreated by Kyalo J. K Page 46 of 73


Name Caches
Name resolution is expensive. For example, studies found that a large proportion of UNIX
system calls (and network traffic in distributed systems) is due to name-mapping operations.
Thus, caching of the results of name resolution on the client is attractive:
 High degree of locality of name lookup thus a reasonably sized name cache can give
good hit ratio.
 Slow update of name information database; thus, the cost for maintaining consistency is
low.
 On-use consistency of cached information is possible; thus, no invalidation on update:
stale entries are detected on use.

There are three types of name caches:


 Directory cache: directory node data is cached. Directory caches are normally used with
iterative name resolution. They require large caches, but are useful for directory listings
etc.
 Prefix cache: path name prefix and zone information is cached. Prefix caching is
unsuitable with structure – free context distribution.
 Full-name cache: full path name information is cached. Full-name caching is mostly
used with
structure-free context distribution and tends to require larger cache sizes than prefix
caches.

A name cache can be implemented as a process-local cache, which lives in the address space of
the client process. Such a cache does not need many resources, as it typically will be small in
size, but much of the information may be duplicated in other processes. More seriously, it is a
short-lived cache and incurs a high rate of start-up misses, unless a scheme such as cache
inheritance is used, which propagates cache information from parent to child processes. The
alternative is a kernel cache, which avoids duplicate entries and excessive start-up misses, but
access to a kernel cache is slower and it takes up valuable kernel memory. Alternatively, a shared
cache can be located in a user space cache process that is utilized by clients directly or by
redirection of queries via the kernel (the latter is used in the
CODA file system).

ATTRIBUTE-BASED NAMING
Whereas names as described above encode at most one attribute of the named entity (e.g., a
domain name encodes the entity’s administrative or geographical location) in attribute-based
naming an entity’s name is composed of multiple attributes. An example of an attribute-based
name is given below: /C=AU/0=UNSW/0U=CSE/CN=WWW.server/
Hardware=Sparc/OS=Solaris/Server=Apache.
The name not only encodes the location of the entity (/C=AU/ 0=UNSW/ 0U=CSE, where C is
the attribute country, O is organization, OU is organizational unit - these are standard attributes
in X.500 and LDAP), it also identifies it as a Web server, and provides information about the
hardware that it runs on, the operating system running on it, and the software used.
Although an entity’s attribute-based name contains information about all attributes, it is common
to also define a distinguished name (DN), which consists of a subset of the attributes and is

Recreated by Kyalo J. K Page 47 of 73


sufficient to uniquely identify the entity. In attribute-based naming systems the names are stored
in directories, and each distinguished name refers to a directory entry. Attribute-based naming
services are normally called directory services. Similar to a naming service, a directory service
implements a name space that can be flat or hierarchical. With a hierarchical name space its
structure mirrors the structure of distinguished names. The structure of the name space (i.e., the
naming graph) is defined by a directory information tree (DIT). The actual contents of the
directory (that is the collection of all directory entries) are stored in the directory information
base (DIB).

DISTRIBUTED FILE SYSTEMS


State: File Server maintains information about clients between requests.
    Information retained includes: Who opened which files. Where to read from next in file.
    Advantages:
        Shorter request messages: Internal file name only required.
        Better performance: Open file information is retained in memory for each request.
        File locking possible: Restrict access to one user.
Stateless: File Server replies to requests but does not keep client information between requests.
Each request includes: Full file name and offset into file.
    Advantages:
        Fault tolerance: File server reboots.
        No OPEN/CLOSE calls needed: fewer messages.
        No server space wasted on tables.
        No limits on number of open files.
        No problems if a client crashes: No open files which are never closed.

Dealing with Shared Files:


1. UNIX Semantics: Every operation on a file is instantly visible to all processes.
    Desired: When READ follows WRITE, READ gets value just written.
    Easy Implementation: one file server and no cache files.
    Distributed system: File server cache gives good performance.
    WRITE must be immediately updated.
2. Session Semantics: No changes are visible to other processes until the file is closed.
    WRITEs are updated when file is closed.
3. Immutable Files: No updates are possible.
    Allowable operations include: CREATE and READ.
    WRITEs are not allowed. Simplifies sharing and replication.
    Always create newer versions of the same file with a new version number
4. Atomic Transactions: All changes have the all or nothing property.
    BEGIN TRANSACTION and END TRANSACTION executed indivisibly.

Caching / Buffering
Four places to store files: Server disk, Server memory, Client disk, and Client memory.
Advantages: Plenty of space. Files accessible to all clients. No consistency problems with one
copy.
Disadvantages: Read time: transfer from Server disk to Client memory.

Recreated by Kyalo J. K Page 48 of 73


Server Cache:
Advantages: Performance gain. One copy: No consistency problems. Implementation:   Main
memory contains array of blocks = size of disk blocks. Read from cache if available - else from
disk. Upon read with a cache miss release: "least recently used" block. Requires time stamping of
each read and write
Handling updates:
 Dirty flag: indicates if cache updated and needs to be written to disk.
Write-through cache: updates to disk immediately

Client Cache:
Advantage: Reduces network traffic & delays accessing files.
Disadvantage: More complex, Potential for different version of files in client nodes.
Implementation: Three locations for caching: within process: no shared cache (e.g. database)
kernel: processes share cache, but a kernel call is needed to access

REPLICATION
Goal: Replication Transparency i.e. Provide backup, Split workload Architecture: Includes
Client Program: Does a read and/or write
Front end: communicates with replica managers. Hides the implementation of how replication is
maintained from the client program Implemented as: user package executed in each client; or
separate process. Talks with one or multiple Replica Managers
Replica Manager: Holds copy of data, and performs direct reads/writes to it.

METHODS OF REPLICATION:
Explicit file replication: copy
Lazy file replication (=Gossip): updates occur in background
Group communication: WRITES occur simultaneously to all servers.
Primary copy: Primary updates secondary replicated files
Advantage: simple for programmer; Disadvantage: recovery upon primary failure
Implementation: Read from secondary or primary, Write to primary only, Elect slave upon
primary failure. Example: Network Information Service (NIS)
Totally ordered updates: Solves problem of updates arriving out of order
All requests sent to sequencer process, Sequencer process assigns consecutive sequence numbers
and sends to Rep Mgr. All replication mgrs process requests in same order
Problem: Sequencer failure or bottleneck

Sun Network File System (NFS)


Sun NFS 4.0 Introduced in 1985. Widely adopted in industry - de facto standard.      clients &
servers can run different OS and different h/w. Implementation: Use Remote Procedure Calling.
File and Directory services integrated. Client parses and controls path name translation.
Stateless: Authentication info required each request Server caching:
 Read-ahead. Writes occur immediately to stable storage: disk, nonvolatile memory
(NVRAM), uninterruptible power supply (UPS). May use write gathering: delays and
groups similar writes together
 Caches all request: checks cache before processing new request RequestID: contains
clientID, transactionID, procedure#, state, timestamp

Recreated by Kyalo J. K Page 49 of 73


Network Information Service: NIS
    Translates a key into a value used in authenticating parties.
    Translates user names to encrypted passwords.
    Maps machine names to network addresses.
    Supports primary copy replication method.

ATOMIC TRANSACTIONS
Atomic Transaction - The effect of performing any single operation is free from interference
from concurrent operations being performed in other threads.
If the transaction does not complete, all previous operations within the transaction are backed
out.
Aspects of Atomicity:
 All-or-nothing: All operations in an AT are completed or rolled back out to initial state.
 Failure atomicity: Effects are atomic even when server fails.
 Durability: Completed transactions are saved in permanent storage.
 Isolation: Each transaction performed w/o interference from other transactions.

Need for Transactions:


Banking operation to transfer money is done in two steps: Withdraw(amt, account); Deposit(amt,
account); Windows NT: Transactions also used to ensure all file data is consistent so a disk is
recoverable if a system crash occurs.

Problems with Simultaneous Transactions:


 Lost Update: Two writes happen for same read.
 Inconsistent Retrievals: Process B reads midway through Process A's transaction.
 Dirty Read: Transaction B reads after Transaction A writes - but Transaction A aborts.
 Over-written Uncommitted Values: Later transaction backs out to aborted earlier
transaction.

Transaction Primitives include:


      BEGIN_TRANSACTION: marks start of transaction
      END_TRANSACTION: commit transaction.
      ABORT-TRANSACTION: kill transaction: restore old values.
      READ: Read data in file (or other object).
      WRITE: Write data to file.
Note: Each transaction has an identifier, which is included on all operations.
Nested Transactions
       If parent transaction aborts, all children transactions must abort.
       If child transaction aborts, parent transaction may decide to commit or abort.
       Child transactions may run concurrently on different servers.

TRANSACTION IMPLEMENTATION
Clients may use a Server to share resources. Good design techniques include:  Server holds
requests for service until resource becomes available. A Server uses a new thread for each

Recreated by Kyalo J. K Page 50 of 73


request. A thread that cannot continue execution uses the Wait operation. Thread causes
suspended thread to resume using Signal operation.
 Fault Tolerance - Transactions should survive Server processor failures. Multiple
replicas run on different computers - Replicas maintain recovery files. Can recover if disk
block failure or if all replicas not updated before processor failure.
 Fault tolerant: Replicas monitor each other and may continue operation if one fails.
Server may back out of partially completed transactions after restart.
 Private Workspace: Uncommitted records are written in temporary location. File's index
(UNIX i-node) is copied into private workspace. Private workspace index is updated with
new/modified records. All other processes continue to see original file. If transaction
aborts, private workspace is deleted and private blocks put on free list. If transaction
commits, private workspace index replaces previous index and old blocks put on free list.

DISTRIBUTED CONCURRENCY CONTROL


Logical Clocks
For many purposes, it is sufficient that all machines agree on the same time. It is not essential
that this time also agrees with the real lime as announced on the radio every hour. For running
make, for example, it is adequate that all machines agree that it is 10:00, even if it is really 10:02.
Thus for a certain class of algorithms, it is the internal consistency of the clocks that matters, not
whether they are particularly close to the real time.
For these algorithms, it is conventional to speak of the clocks as logical clocks. In a classic
paper, Lamport (1978) showed that although clock synchronization is possible, it need not be
absolute. If two processes do not interact, it is not necessary that their clocks be synchronized
because the lack of synchronization would not be observable and thus could not cause problems.
Furthermore, he pointed out that what usually matters is not that all processes agree on exactly
what time it is, but rather that they agree on the order in which events occur. In the make
example given in the previous section, what counts is whether input.c is older or newer than
input.o, not their absolute creation times. In this section we will discuss Lamport’s algorithm,
which synchronizes logical clocks.

Lamport Timestamps
To synchronize logical clocks, Lamport defined a relation called “happens-before”. The
expression ab is read “a happens before b” and means that all processes agree that first event a
occurs, then afterward, event b occurs. The “happens-before” relation can be observed directly in
two situations:
1. If a and b are events in the same process, and a occurs before b, then a greater than b is
true..
2. If a is the event of a message being sent by one process, and b is the event of the message
being received by another process, then a is greater than b is also true.

A message cannot be received before it is sent, or even at the same time it is sent, since it takes a
finite, nonzero amount of time to arrive. Happens-before is a transitive relation, so if a is greater
than b and b greater than c, then a is greater than c. If two events, x and y, happen in different
processes that do not exchange messages (not even indirectly via third parties), then x is greater
than y is not true, but neither is y is greater than x. These events are said to be concurrent, which

Recreated by Kyalo J. K Page 51 of 73


simply means that nothing can be said (or need be said) about when the events happened or
which event happened first.

What we need is a way of measuring time such that for every event, a, we can assign it a time
value (a) on which all processes agree. These time values must have the property that if a < b,
then C(a) < C(b). To rephrase the conditions we stated earlier, if a and b are two events within
the same process and a occurs before b, then C(a) < C(b). Similarly, if a is the sending of a
message by one process and b is the reception of that message by another process, then C(a) and
C(b) must be assigned in such a way that everyone agrees on the values of (a) and C(b) with C(a)
< C(b). In addition, the clock time, C, must always go forward (increasing), never backward
(decreasing). Corrections to time can be made by adding a positive value, never by subtracting
one.

Global State
Determining global properties in a distributed system is often difficult, but crucial for some
applications. For example, in distributed garbage collection, we need to be able to determine for
some object whether it is referenced by any other objects in the system. Deadlock detection
requires detection of cycles of processes infinitely waiting for each other. To detect the
termination of a distributed algorithm we need to obtain simultaneous knowledge of all involved
process as well as take account of messages that may still traverse the network. In other words, it
is not sufficient to check the activity of all processes. Even if all processes appear to be passive,
there may be messages in transition that, upon arrival, trigger further
activity. In the following, we are concerned with determining stable global states or properties
that, once they occur, will not disappear without outside intervention. For example, once an
object is no longer referenced by any other object (i.e., it may be garbage collected), no reference
to the object can appear at a later time.

Distributed Concurrency Control


Some of the issues encountered when looking at concurrency in distributed systems are familiar
from the study of operating systems and multithreaded applications. In particular dealing with
race conditions that occur when concurrent processes access shared resources. In non distributed
system these problems are solved by implementing mutual exclusion using local primitives such
as locks, semaphores, and monitors. In distributed systems, dealing with concurrency becomes
more complicated due to the lack of directly shared resources (such as memory, CPU registers,
etc.), the lack of a global clock, the lack of a single global program state, and the presence of
communication delays.

Distributed Mutual Exclusion


When concurrent access to distributed resources is required, we need to have mechanisms to
prevent race conditions while processes are within critical sections. These mechanisms must
fulfill the following three requirements:
 Safety: At most one process may execute the critical section at a time
 Liveness: Requests to enter and exit the critical section eventually succeed
 Ordering: Requests are processed in happened-before ordering

Method 1: Central Server

Recreated by Kyalo J. K Page 52 of 73


The simplest approach is to use a central server that controls the entering and exiting of critical
sections. Processes must send requests to enter and exit a critical section to a lock server (or
coordinator), which grants permission to enter by sending a token to the requesting process.
Upon leaving the critical section, the token is returned to the server. Processes that wish to enter
a critical section while another process is holding the token are put in a queue. When the token is
returned the process at the head of the queue is given the token and allowed to enter the critical
section. This scheme is easy to implement, but it does not scale well due to the central authority.
Moreover, it is vulnerable to failure of the central server.

Method 2: Token Ring


More sophisticated is a setup that organizes all processes in a logical ring structure, along which
a token message is continuously forwarded. Before entering the critical section, a process has to
wait until the token comes by and then retain the token until it exits the critical section. A
disadvantage of this approach is that the ring imposes an average delay of N/2 hops, which again
limits scalability. Moreover, the token messages consume bandwidth and failing nodes or
channels can break the ring. Another problem is that failures may cause the token to be lost. In
addition, if new processes join the network or wish to leave, further management logic is needed.

Method 3: Using Multicast and Logical Clocks


Ricart & Agrawala proposed an algorithm for distributed mutual exclusion that makes use of
logical clocks. Each participating process pi maintains a Lamport clock and all processes must be
able to communicate pairwise. At any moment, each process is in one of three states:
1. Released: Outside of critical section
2. Wanted: Waiting to enter critical section
3. Held: Inside critical section
If a process wants to enter a critical section, it multicasts a message and waits until it has
received a reply from every other process. The processes operate as follows:
 If a process is in Released state, it immediately replies to any request to enter the critical
section.
 If a process is in Held state, it delays replying until it is finished with the critical section.
 If a process is in Wanted state, it replies to a request immediately only if the requesting
timestamp is smaller than the one in its own request.

The only hurdle to scalability is the use of multicasts (i.e., all processes have to be contacted in
order to enter a critical section). More scalable variants of this algorithm require each individual
process to only contact subsets of its peers when wanting to enter a critical section.
Unfortunately, failure of any peer process can deny all other processes entry to the critical
section.

TRANSACTIONS
A transaction can be regarded as a set of server operations that are guaranteed to appear atomic
in the presence of multiple clients and partial failure. The concept of a transaction originates
from the database community as a mechanism to maintain the consistency of databases.
Transaction management is build
around two basic operations:
 Begin Transaction

Recreated by Kyalo J. K Page 53 of 73


 End Transaction.
An EndTransaction operation causes the whole transaction to either Commit or Abort. For this
discussion, the operations performed in a transaction are Read and Write. Transactions have the
ACID property.

 Atomic - All-or-nothing, once committed the full transaction is performed, if aborted,


there is no trace left.
 Consistent - Concurrent transactions will not produce inconsistent results.
 Isolated - Transactions do not interfere with each other, i.e., no intermediate state of a
transaction is visible outside; (this is also called the serialisable property)
 Durable - All-or-nothing property must hold even if server or hardware fails.

TRANSACTION IMPLEMENTATION
Two general strategies exist for the implementation of transactions:
 Private Workspace - All tentative operations are performed on a shadow copy of the
server state, which is atomically swapped with the main copy on Commit or discarded on
abort.
 Write ahead Log - Updates are performed in-place, but all updates are logged and
reverted when a transaction aborts.

Concurrency in Transactions
It is often necessary to allow transactions to occur simultaneously (for example, to allow
multiple travel agents to simultaneously reserve seats on the same flight). Due to the consistency
and isolation properties of transactions concurrent transaction must not be allowed to interfere
with each other. Concurrency control algorithms for transactions guarantee that multiple
transactions can be executed simultaneously while providing a result that is the same as if they
were executed one after another. A key concept when discussing concurrency control for
transactions is the serialization of conflicting operations. Recall that conflicting operations are
those operations that operate on the same data item and whose combined effects depend on the
order they are executed in. We define a schedule of operations as
an interleaving of the operations of concurrent transactions. A legal schedule is one that provides
results that are the same as though the transactions were serialized (i.e., performed one after
another). This leads to the concept of serial equivalence. A schedule is serially equivalent if all
conflicting operations are performed in the same order on all data items. For example, given two
transactions Xi and T-i in a serially equivalent schedule, then of all the pairs of conflicting
operations the first operation will be performed by T\ and the second by Ti (or vice versa: of all
the pairs the first is performed by T2 and the second by Ti). There are three type of concurrency
control algorithms for transactions: those using locking, those using timestamps, and those using
optimistic algorithms.

Locking
The locking algorithms require that each transaction obtains a lock from a scheduler process
before performing a read or a write operation. The scheduler is responsible for granting and
releasing locks in such a way that legal schedules are produced. The most widely used locking
approach is two-phase locking (2PL). In this approach a lock for a data item is granted to a
process if no conflicting locks are held by other processes (otherwise the process requesting the

Recreated by Kyalo J. K Page 54 of 73


lock blocks until the lock is available again). A lock is held by a process until the operation it
was requested for has been completed. Furthermore once a process has released a lock, it can no
longer request any new locks until its current transaction has been completed. This results in a
growing phase of the transaction where locks are acquired and a shrinking phase where locks are
released. While this approach results in legal schedules, it can also result in deadlock when
conflicting locks are requested in reverse orders. This problem can be solved either by detecting
and breaking deadlocks or by adding timeouts to the locks (when a lock times out then the
transaction holding the lock is aborted). Another problem is that 2PC can lead to cascaded
aborts. If a transaction (T1) reads the results of a write of another transaction (T2) that is
subsequently aborted, then the first transaction (T1) will also have to be aborted. The solution to
this problem is called strict two-phase locking and allows locks to be released only at commit or
abort time.

Timestamp Ordering
A different approach to creating legal schedules is to timestamp all operations and ensures that
operations are ordered according to their timestamps. In this approach each transaction receives a
unique timestamp and each operation receives its transaction’s timestamp. Each data item also
has three timestamps – the timestamp of the last committed write, the timestamp of the last read,
and the timestamp of the last tentative (noncommittal) write. Before executing a write operation
the scheduler ensures that the operation’s time stamp is both greater than the data item’s write
timestamp and greater than or equal to the data item’s read timestamp. For read operations the
operation’s time stamp must be greater than the data item’s write timestamps (both committed
and tentative). When scheduling conflicting operations the operation with a lower timestamp is
always executed first.

Optimistic Control
Both locking and time stamping incur significant overhead. The optimistic approach to
concurrency control assumes that no conflicts will occur, and therefore only tries to detect and
resolve conflicts at commit time. In this approach a transaction is split into three phases, a
working phase (using shadow copies), a validation phase, and an update phase. In the working
phase operations are carried out on shadow copies with no attempt to detect or order conflicting
operations. In the validation phase the scheduler attempts to detect conflicts with other
transactions that were in progress during the working phase. If conflicts are detected then one of
the conflicting transactions are aborted. In the update phase, assuming that the transaction was
not aborted, all the updates made on the shadow copy are made permanent.

DISTRIBUTED TRANSACTIONS
In contrast to transactions in the sequential database world, transactions in a distributed setting
are complicated because a single transaction will usually involve multiple servers. Multiple
servers may involve multiple services and files stored on different servers. To ensure the
atomically of transactions, all servers involved must agree whether to Commit or Abort.
Moreover, the use of multiple servers and services may require nested transactions, where a
transaction is implemented by way of multiple other transactions, each of which can
independently Commit or Abort.
Transactions that span multiple hosts include one host that acts as the coordinator, which is the
host that handles the initial BeginTransaction. This coordinator maintains a list of workers,

Recreated by Kyalo J. K Page 55 of 73


which are the other servers involved in the transaction. Each worker must be aware of the
identity of the coordinator. The responsibility for ensuring the atomicity of an entire transaction
lies with the coordinator, which needs to rely on a distributed commit protocol.

TWO PHASE COMMIT


This protocol ensures that a transaction commits only when all workers are ready to commit,
which, for example, corresponds to validation in optimistic concurrency. As a result a distributed
commit protocol requires at least two phases:
1. Voting phase: all workers vote on commit; then the coordinator decides whether to
commit or abort.
2. Completion phase: all workers commit or abort according to the decision of the
coordinator. This basic protocol is called two-phase commit (2PC)

DISTRIBUTED NESTED TRANSACTIONS


Distributed nested transactions are realized by letting sub transactions commit provisionally,
whereby they report a provisional commit list that contains all provisionally committed sub-
transactions to the parent. If the parent aborts, it aborts all transactions on the provisional commit
list. Otherwise, if the parent is ready to commit, it lets all sub-transactions commit. The actual
transition from provisional to final commit needs to be via a 2PC protocol, as a worker may
crash after it has already provisionally committed. Essentially, when a worker receives a Can
Commit message, there are two alternatives:
 If it has no recollection of the sub-transactions involved in the committing transaction, it
votes abort, as it must have recently crashed.
 Otherwise, it saves the information about the provisionally committed sub-transaction to
a persistent store and votes yes.

COORDINATION ELECTIONS
Various algorithms require a set of peer processes to elect a leader or coordinator. In the
presence of failure, it can be necessary to determine a new leader if the present one fails to
respond. Provided that all processes have a unique identification number, leader election can be
reduced to finding the non crashed
process with the highest identifier. Any algorithm to determine this process needs to meet the
following two requirements:
 Safety: A process either doesn’t know the coordinator or it knows the identifier of the
process with largest identifier.
 Liveness: Eventually, a process crashes or knows the coordinator.

BULLY ALGORITHM
The following algorithm was proposed by Garcia-Molina and uses three types of messages:
 Election: Announce election
 Answer: Response to an election
 Coordinator: Elected coordinator announces itself.
A process begins an election when it notices through a timeout that the coordinator has failed or
receives an Election message. When starting an election, a process sends Election message to all
higher-numbered processes. If it receives no Answer within a predetermined time bound, the

Recreated by Kyalo J. K Page 56 of 73


process that started the election decides that it must be coordinator and sends a Coordinator
message to all other processes. If an Answer arrives, the process that triggered an election waits a
pre-determined period of time for a Coordinator message. A process that receives an Election
message can immediately announce that it is the coordinator if it knows that it is the highest
numbered process. Otherwise, it itself starts a sub-election by sending Election message to
higher-numbered processes. This algorithm is called the bully algorithm because the highest
numbered process will always be the coordinator.

RING ALGORITHM
An alternative to the bully algorithm is to use a ring algorithm. In this approach all processes are
ordered in a logical ring and each process knows the structure of the ring. There are only two
types of messages involved: Election and Coordinator. A process starts an election when it
notices that the current coordinator has failed (e.g., because requests to it have timed out). An
election is started by sending an Election message to the first neighbor on the ring. The Election
message contains the node’s process identifier and is forwarded on around the ring, with each
process adding its own identifier to the message. When the Election message reaches the
originator, the election is complete. Based on the contents of the message that originator process
determines the highest numbered process and sends out a coordinator message specifying this
process as the winner of the election.

ADVANCED OPERATING SYSTEMS


Workstation Models
 Workstation-server model: Workstation's processor performance and memory capacity
determine largest task that can be performed on behalf of user. Fixed amount of dedicated
computing power and guaranteed response time.
 Diskless workstation: low-cost computers with: processor, memory, network interface.
Cheap: Few large disks cheaper than many little ones. Ease of maintenance: Centralized
s/w installation: Flexibility: Access from any node Example: X Terminal runs X.11
server software.
 Disk workstations can include: Paging and temporary files.
 System binaries: s/w installation is broadcast to all machines UP (or when they come up)
 Explicit caches: local working copy copied centrally when completed.
 Local file system: loss of transparency.

Processor pool model:


Dynamically allocates processors to users. Centralized file server & processor pool
Graphics workstations or dumb terminals, Assigns processors to users as needed.
Supports incremental growth, Example use: makes, simulations, processing-intensive
applications
 Hybrid Model: Personal Workstation and Processor Pool. More expensive, simple
design.Fast interactive response: workstation, Heavy computing performed using
processor pool
 Load Distribution: Transfer load from heavily loaded computers to idle or lightly loaded
computers.

Recreated by Kyalo J. K Page 57 of 73


 Load Balancing: Equalize load at all computers. Goal: Minimize response time or
maximize CPU utilization. Design Issues: Deterministic versus heuristic (=dynamic):
 Transfer Policy: When does a node become a sender? Use thresholds Swap info with
other machines (e.g. periodically)
 Selection Policy: How does sender choose a process to transfer? Newly originated Least
overhead: small, location independent.
 Location Policy: Which node should be the target receiver? Polling To Determine
Processor Load:
      Solution 1: Count # of processes on each machine (running or in ready state)
      Solution 2: Idle process or periodic interrupt determines amount of time processor bus
      Goal: Fraction of time CPU is busy

PROCESS MIGRATION:
Move a process already in progress to remote site. Motivation: Load sharing: Move from heavy
to lightly loaded system to improve performance. Communications performance: Move the
process to the data to minimize communications overhead. Availability: Survive a scheduled
downtime. Utilize special capabilities: Take advantage of unique h/w or s/w on a particular node.
Commonly: Owner returns to workstation. Alternately: lower priority of foreign process. Select a
target machine.  Send part of process image and open file information. Receiving kernel forks a
child with the passed information. New process pulls over data, environment, register/stack
information, and modified program text.  Program demand paged otherwise. New process sends
migration-completed message.  Old process destroys itself.

Characteristics of real time O.S


Fast process/thread switch Small size - minimum functionality Responds to external interrupts
quickly
Minimizes intervals where interrupts are disabled. Supports multitasking with inter-process
communications tools: (semaphores, signals, events) Accumulates data in sequential files at a
fast rate
Uses preemptive scheduling based on priority Pauses / resumes tasks for fixed intervals:
clocks/timers
Supports special alarms and timeouts

Multiprocessor Scheduling
Effects of Scheduling in Multiprocessors: Multi-programed Single applications run better:
Traditional priority, FCFS, round robin algorithms matter less because other processes can be
served by other processors. Multithreaded: Threads run faster if scheduled together. Application
speedup on a multiprocessor often exceed expectations because: threads share disk caches;
threads share compiler code.
Classes of Multiprocessor OS
 Separate Supervisor: Each processor has own copy of kernel, data structures, I/O
devices, file systems. Minimum shared data structures (e.g. for semaphores).
Disadvantage: Difficult to perform parallel execution of a single task. Inefficient since
much replication for each processor

Recreated by Kyalo J. K Page 58 of 73


 Master Slave: Master assigns work to slaves. Master runs O.S. Slaves run applications
Advantage: Simplified O.S. Disadvantage: master can fail or become bottleneck.
 Symmetric Multi-Processor (SMP): All processor are treated equally. Identical h/w
capabilities.  All h/w available to all processors. One copy of kernel can be executed by
all processors concurrently. Floating Master: O.S. is critical section and 1 processor can
be active. One processor allowed in each segment of O.S. Advantage: Most flexible.
Disadvantage: Most difficult to implement.

FAULT TOLERANCE AND FAILURE MODELS,


A characteristic feature of distributed systems that distinguishes them from single-machine
systems is the notion of partial failure. A partial failure may happen when one component in a
distributed system fails. This failure may affect the proper operation of other components, while
at the same time leaving
yet other components totally unaffected. In contrast, a failure in non distributed systems is often
total in the sense that it affects all components, and may easily bring down the entire application.
An important goal in distributed systems design is to construct the system in such a way that it
can automatically recover from partial failures without seriously affecting the overall
performance. In particular, whenever a failure occurs, the distributed system should continue to
operate in an acceptable way while repairs are being made, that is, it should tolerate faults and
continue to operate to some extent even in their presence.

BASIC CONCEPTS
To understand the role of fault tolerance in distributed systems we first need to take a closer look
at what it actually means for a distributed system to tolerate faults. Being fault tolerant is
strongly related to what are called “dependable” systems. Dependability is a term that covers a
number of useful
requirements for distributed systems including the following
 Availability - is defined as the property that a system is ready to be used immediately. In
general, it refers to the probability that the system is operating correctly at any given
moment and is available to perform its functions on behalf of its users. In other words, a
highly available system is one that will most likely be working at a given instant in time.
 Reliability - refers to the property that a system can run continuously without failure. In
contrast to availability, reliability is defined in terms of a time interval instead of an
instant in time. A highly reliable system is one that will most likely continue to work
without interruption during a relatively long period of time. This is a subtle but important
difference when compared to availability. If a system goes down for one millisecond
every hour, it has an availability of over 99.9999 percent, but is still highly unreliable.
Similarly, a system that never crashes but is shut down for two weeks every August has
high reliability but only 96 percent availability. The two are not the same.
 Safety - refers to the situation that when a system temporarily fails to operate correctly,
nothing catastrophic happens. For example, many process control systems, such as those
used for controlling nuclear power plants or sending people into space, are required to
provide a high degree of safety. If such control systems temporarily fail for only a very
brief moment, the effects could be disastrous. Many examples from the past (and
probably many more yet to come) show how hard it is to build safe systems.

Recreated by Kyalo J. K Page 59 of 73


 Maintainability - refers to how easy a failed system can be repaired. A highly
maintainable system may also show a high degree of availability, especially if failures
can be detected and repaired automatically. However, as we shall see later in this chapter,
automatically recovering from failures is easier said than done. Often, dependable
systems are also required to provide a high degree of security, especially when it comes
to issues such as integrity. In particular, if a distributed system is designed to provide its
users with a number of services, the system has failed when one or more of those services
cannot be (completely) provided.

A distinction is made between preventing, removing, and forecasting faults. For our purposes,
the most important issue is “fault tolerance”, meaning that a system can provide its services even
in the presence of faults. Faults are generally classified as transient, intermittent, or permanent.
“Transient faults” occur once and then disappear. If the operation is repeated, the fault goes
away. A bird flying through the beam of a microwave transmitter may cause lost bits on some
network (not to mention a roasted bird). If the transmission times out and is retried, it will
probably work the second time.
 An intermittent fault occurs, then vanishes of its own accord, then reappears, and so on.
A loose contact on a connector will often cause an intermittent fault. Intermittent faults
cause a great deal of aggravation because they are difficult to diagnose. Typically,
whenever the fault doctor shows up, the system works fine.
 A permanent fault is one that continues to exist until the faulty component is repaired.
Burnt-out chips, software bugs, and disk head crashes are examples of permanent faults.

FAILURE MODELS
A system that fails is not adequately providing the services it was designed for. If we consider a
distributed system as a collection of servers that communicate with each other and with their
clients, not adequately providing services means that servers, communication channels, or
possibly both, are not doing what they are supposed to do. However, a malfunctioning server
itself may not always be the fault we are looking for. If such a server depends on other servers to
adequately provide its services, the cause of an error may need to be searched for somewhere
else.

Such dependency relations appear in abundance in distributed systems. A failing disk may make
life difficult for a file server that is designed to provide a highly available file system. If such a
file server is part of a distributed database, the proper working of the entire database may be at
stake, as only part of its data may actually be accessible. To get a better grasp on how serious a
failure actually is, several
classification schemes have been developed. One such scheme is shown in Fig. below

Recreated by Kyalo J. K Page 60 of 73


Types of failure Different types of failures

A crash failure occurs when a server prematurely halts, but was working correctly until it
stopped. An important aspect with crash failures is that once the server has halted, nothing is
heard from it anymore. A typical example of a crash failure is an operating system that comes to
a grinding halt, and for which
there is only one solution: reboot. Many personal computer systems suffer from crash failures so
often that people have come to expect them to be normal. In this sense, moving the reset button
from the back of a cabinet to the front was done for good reason. Perhaps one day it can be
moved to the back again, or even removed altogether.

An omission failure occurs when a server fails to respond to a request. Several things might go
wrong. In the case of a receive omission failure, the server perhaps never got the request in the
first place. Note that it may well be the case that the connection between a client and a server has
been correctly established, but that there was no thread listening to incoming requests. Also, a
receive omission failure will generally not affect the current state of the server, as the server is
unaware of any message sent to it.

State transition failure. This kind of failure happens when the server reacts unexpectedly to an
incoming request. For example, if a server receives a message it cannot recognize, a state
transition failure happens if no measures have been taken to handle such messages. In particular,
a faulty server may incorrectly take default actions it should never have initiated.

Arbitrary / Byzantine failures. In effect, when arbitrary failures occur, clients should be
prepared for the worst. In particular, it may happen that a server is producing output it should
never have produced, but which cannot be detected as being incorrect. Worse yet a faulty server
may even be maliciously working together with other servers to produce intentionally wrong
answers. This situation illustrates why security is also considered an important requirement when
talking about dependable systems
Failure Masking by Redundancy - If a system is to be fault tolerant, the best it can do is to try
to hide the occurrence of failures from other processes. The key technique for masking faults is
to use redundancy. Three kinds are possible: information redundancy, time redundancy, and
physical redundancy. With information redundancy, extra bits are added to allow recovery from

Recreated by Kyalo J. K Page 61 of 73


garbled bits. For example, a Hamming code can be added to transmitted data to recover from
noise on the transmission line. With physical redundancy, extra equipment or processes are
added to make it possible for the system as a whole to tolerate the loss or malfunctioning of some
components. Physical redundancy can thus be done either in hardware or in software.

Process Resilience - Now that the basic issues of fault tolerance have been discussed, let us
concentrate on how fault tolerance can actually be achieved in distributed systems. The first
topic we discuss is protection against process failures, which is achieved by replicating processes
into groups.

Design Issues - The key approach to tolerating a faulty process is to organize several identical
processes into a group. The key property that all groups have is that when a message is sent to
the group itself, all members of the group receive it. In this way, if one process in a group fails,
hopefully some other process can take over for it. Process groups may be dynamic. New groups
can be created and old groups can be destroyed. A process can join a group or leave one during
system operation. A process can be a member of several groups at the same time. Consequently,
mechanisms are needed for managing groups and group membership. Groups are roughly
analogous to social organizations.

Agreement in Faulty Systems - Before considering the case of faulty processes, let us look at
the “easy” case of perfect processes but where communication lines can lose messages. There is
a famous problem, known as the two-army problem, which illustrates the difficulty of getting
even two perfect processes to reach agreement about 1 bit of information.

CLIENT-SERVER COMMUNICATION
Reliable Client-server Communication
In many cases, fault tolerance in distributed systems concentrates on faulty processes. However,
we also need to consider communication failures. Most of the failure models discussed
previously apply equally well to communication channels. In particular, a communication
channel may exhibit crash, omission,
timing, and arbitrary failures. In practice, when building reliable communication channels, the
focus is on masking crash and omission failures. Arbitrary failures may occur in the form of
duplicate messages, resulting from the fact that in a computer network messages may be buffered
for a relatively long time, and are re-injected into the network after the original sender has
already issued a retransmission

Point-to-Point Communication
In many distributed systems, reliable point-to-point communication is established by making use
of a reliable transport protocol, such as TCP. TCP masks omission failures, which occur in the
form of lost messages, by using acknowledgements and retransmissions. Such failures are
completely hidden from a
TCP client. However, crash failures of connections are often not masked. A crash failure may
occur when, for whatever reason, a TCP connection is abruptly broken so that no more messages
can be

Recreated by Kyalo J. K Page 62 of 73


transmitted through the channel. In most cases, the client is informed that the channel has
crashed by raising an exception. The only way to mask such failures is to let the distributed
system attempt to automatically set up a new connection.

RPC Semantics in the Presence of Failures


Let us now take a closer look at client-server communication when using high-level
communication facilities such as Remote Procedure Calls (RPCs) or Remote Method Invocations
(RMIs). In the following pages, we focus on RPCs, but the discussion is equally applicable to
communication with remote objects. The goal of RPC is to hide communication by making
remote procedure calls look just like local ones. With a few exceptions, so far we have come
fairly close. Indeed, as long as both client and server are functioning perfectly, RPC does its job
well. The problem comes about when errors occur. It is then that the differences between local
and remote calls are not always easy to mask. To structure our discussion, let us distinguish
between five different classes of failures that can occur in RPC systems, as follows
1. The client is unable to locate the server.
2. The request message from the client to the server is lost.
3. The server crashes after receiving a request.
4. The reply message from the server to the client is lost.
5. The client crashes after sending a request.
Each of these categories poses different problems and requires different solutions.

Client Cannot Locate the Server


To start with, it can happen that the client cannot locate a suitable server. The server might be
down, for example. Alternatively, suppose that the client is compiled using a particular version
of the client stub, and the binary is not used for a considerable period of time. In the meantime,
the server evolves and a new version of the interface is installed; new stubs are generated and put
into use. When the client is finally run, the binder will be unable to match it up with a server and
will report failure. While this mechanism is used to protect the client from accidentally trying to
talk to a server that may not agree with it in terms of what parameters are required or what it is
supposed to do, the problem remains of how should this failure be dealt with.

Lost Request Messages - The second item on the list is dealing with lost request messages. This
is the easiest one to deal with: just have the operating systems or client stub start a timer when
sending the request. If the timer expires before a reply or acknowledgement comes back, the
message is sent again. If the message was truly lost, the server will not be able to tell the
difference between the retransmission and the original, and everything will work fine. Unless, of
course, so many request messages are lost that the client gives up and falsely concludes that the
server is down, in which case we are back to “Cannot locate server.” If the request was not lost,
the only thing we need to do is let the server be able to detect it is dealing with a retransmission.
Unfortunately, doing so is not so simple, as we explain when discussing lost replies.

Server Crashes - The next failure on the list is a server crash. Assume that the server crashes
and subsequently recovers. It announces to all clients that it has just crashed but is now up and
running again. The problem is that the client does not know whether its request to print some text
will actually be carried out. There are four strategies the client can follow. First, the client can
decide to never reissue a request, at the risk that the text will not be printed. Second, it can decide

Recreated by Kyalo J. K Page 63 of 73


to always reissue a request, but this may lead to its text being printed twice. Third, it can decide
to reissue a request only if it did not yet receive an acknowledgement that its print request had
been delivered to the server. In that case, the client is counting on the fact that the server crashed
before the print request could be delivered. The fourth and last strategy is to reissue a request
only if it has received an acknowledgement for the print request. With two strategies for the
server, and four for the client, there are a total of eight combinations to consider. Unfortunately,
no combination is satisfactory. To explain, note that there are three events that can happen at the
server: send the completion message (M), print the text (P), and crash (C). These events can
occur in six different orderings.

1. MPC: A crash occurs after sending the completion message and printing the text.
2. AC(P): A crash happens after sending the completion message, but before the text could be
printed.
3. PMC: A crash occurs after sending the completion message and printing the text.
4. PC(M): The text printed, after which a crash occurs before the completion message could be
sent.
5. C(P(M): A crash happens before the server could do anything.
6. C((M(P): A crash happens before the server could do anything.

Lost Reply Messages


Lost replies can also be difficult to deal with. The obvious solution is just to rely on a timer again
that has been set by the client’s operating system. If no reply is forthcoming within a reasonable
period, just send the request once more. The trouble with this solution is that the client is not
really sure why there
was no answer. Did the request or reply get lost, or is the server merely slow? It may make a
difference.
In particular, some operations can safely be repealed as often as necessary with no damage being
done. A request such as asking for the first 1024 bytes of a file has no side effects and can be
executed as often as necessary without any harm being done. A request that has this property is
said to be idempotent.

Now consider a request to a banking server asking to transfer a million dollars from one account
to another. If the request arrives and is carried out, but the reply is lost, the client will not know
this and will retransmit the message. The bank server will interpret this request as a new one, and
will carry it out too. Two million dollars will be transferred. Heaven forbid that the reply is lost
10 times. Transferring money is not idempotent. One way of solving this problem is to try to
structure all requests in an idem-potent way.

Client Crashes
The final item on the list of failures is the client crash. What happens if a client sends a request to
a server to do some work and crashes before the server replies? At this point a computation is
active and no parent is waiting for the result. Such an unwanted computation is called an
orphan. Orphans can cause a variety of problems. As a bare minimum, they waste CPU cycles.
They can also lock files or otherwise tie up valuable resources. Finally, if the client reboots and
does the RPC again, but the reply from the orphan comes back immediately afterward, confusion
can result.

Recreated by Kyalo J. K Page 64 of 73


DISTRIBUTED COMMIT
The atomic multicasting problem discussed in the previous section is an example of a more
general problem, known as distributed commit. The distributed commit problem involves
having an operation being performed by each member of a process group, or none at all. In the
case of reliable multicasting, the operation is the delivery of a message. With distributed
transactions, the operation may be the commit of a transaction at a single site that takes part in
the transaction.

Distributed commit is often established by means of a coordinator. In a simple scheme, this


coordinator tells all other processes that are also involved, called participants, whether or not to
(locally) perform the operation in question. This scheme is referred to as a one-phase commit
protocol. It has the obvious drawback that if one of the participants cannot actually perform the
operation, there is no way to tell the coordinator. For example, in the case of distributed
transactions, a local commit may not be possible because this would violate concurrency control
constraints. In practice, more sophisticated schemes are needed, the most common one being the
two-phase commit protocol, which is discussed in detail below. The main drawback of this
protocol is that it cannot efficiently handle the failure of the coordinator.

TWO-PHASE COMMIT
The original two-phase commit protocol (2PC) is due to Gray (1978). Without loss of
generality, consider a distributed transaction involving the participation of a number of processes
each running on a different machine. Assuming that no failures occur, the protocol consists of
the following two phases, each consisting of two steps
1. The coordinator sends a VOTE_REQUEST message to all participants.
2. When a participant receives a Vote_Request message, it returns either a Vote_Commit
message to the coordinator telling the coordinator that it is prepared to locally commit its
part of the transaction, or otherwise a Vote_Abort message.
3. The coordinator collects all votes from the participants. If all participants have voted to
commit the transaction, then so will the coordinator. In that case, it sends a
Global_Commit message to all participants. However, if one participant had voted to
abort the transaction, the coordinator will also decide to abort the transaction and
multicasts a Global_Abort message.
4. Each participant that voted for a commit waits for the final reaction by the coordinator. If
a participant receives a Global_Commit message, it locally commits the transaction.
Otherwise, when receiving a Global_Abort message, the transaction is locally aborted as
well.

THREE-PHASE COMMIT
A problem with the two-phase commit protocol is that when the coordinator has crashed,
participants may not be able to reach a final decision. Consequently, participants may need to
remain blocked until the coordinator recovers. Skeen (1981) developed a variant of 2PC, called
the three-phase commit
protocol (3PC), that avoids blocking processes in the presence of fail stop crashes. Although
3PC is widely referred to in the literature, it is not applied often in practice as the conditions
under which 2PC

Recreated by Kyalo J. K Page 65 of 73


blocks rarely occur. We discuss the protocol, as it provides further insight into solving fault-
tolerance problems in distributed systems.

RECOVERY
So far, we have mainly concentrated on algorithms that allow us to tolerate faults. However,
once a failure has occurred, it is essential that the process where the failure happened can recover
to a correct state. In what follows, we first concentrate on what it actually means to recover to a
correct state, and subsequently when and how the state of a distributed system can be recorded
and recovered to, by means of check pointing and message logging. Fundamental to fault
tolerance is the recovery from an error. Recall that an error is that part of a system that may lead
to a failure. The whole idea of error recovery is to replace an erroneous state with an error-free
state. There are essentially two forms of error recovery.

 In backward recovery, the main issue is to bring the system from its present erroneous
state back into a previously correct state. To do so, it will be necessary to record the
system’s state from time to lime, and to restore such a recorded state when things go
wrong. Each time (part of) the system’s present state is recorded, a checkpoint is said to
be made.
 In forward recovery. In this case, when the system has just entered an erroneous state,
instead of moving back to a previous, check pointed state, an attempt is made to bring the
system in a correct new state from which it can continue to execute. The main problem
with forward error recovery mechanisms is that it has to be known in advance which
errors may occur. Only in that case is it possible to correct those errors and move to a
new state.

STABLE STORAGE
To be able to recover to a previous state, it is necessary that information needed to enable
recovery is safely stored. Safely in this context means that recovery information survives process
crashes and site failures, but possibly also various storage media failures. Stable storage plays an
important role when it comes to recovery in distributed systems. Stable storage can be
implemented with a pair of
ordinary disks. Storage comes in three categories.
 First there’s RAM memory, which is wiped out when power fails or a machine crashes.
 Next is disk storage, which survives CPU failures but which can be lost in disk head
crashes.
 Finally, there is also stable storage, which is designed to survive anything except major
calamities such as floods and earthquakes.

SECURITY IN DISTRIBUTED SYSTEMS


Security is by no means the least important principle. However, one could argue that it is one of
the most difficult principles, as security needs to be pervasive throughout a system. A single
design flaw with respect to security may render all security measures useless. We concentrate on
the various mechanisms that are generally incorporated in distributed systems to support
security. The security in distributed systems can roughly be divided into two parts. One part
concerns the communication between users or processes, possibly residing on different

Recreated by Kyalo J. K Page 66 of 73


machines. The principal mechanism for ensuring secure communication is that of a secure
channel. Secure channels, and more specifically, authentication, message integrity, and
confidentiality.

The other part concerns authorization, which deals with ensuring that a process gets only those
access rights to the resources in a distributed system it is entitled to. Authorization is covered in a
separate section dealing with access control. In addition to traditional access control
mechanisms, we also focus
on access control when have to deal with mobile code such as agents. Secure channels and
access control require mechanisms to hand out cryptographic keys, but also mechanisms to add
and remove users from a system. These topics are covered by what is known as security
management.

SECURITY THREATS, POLICIES, AND MECHANISMS


Security in computer systems is strongly related to the notion of dependability. Informally, a
dependable computer system is one that we justifiably trust to deliver its services. Dependability
includes availability, reliability, safety, and maintainability. However, if we are to put our trust in
a computer system, then confidentiality and integrity should also be taken into account.
Confidentiality refers to the property of a computer system whereby its information is disclosed
only to authorized parties.

Integrity is the characteristic that alterations to a system’s assets can be made only in an
authorized way. In other words, improper alterations in a secure computer system should be
detectable and recoverable. Major assets of any computer system are its hardware, software, and
data. Another way of looking at security in computer systems is that we attempt to protect the
services and data it offers against security threats. There are four types of security threats to
consider

1. Interception refers to the situation that an unauthorized party has gained access to a
service or data. A typical example of interception is where communication between two
parties has been overheard by someone else. Interception also happens when data are
illegally copied, for example, after breaking into a person’s private directory in a file
system.

2. Interruption is when a file is corrupted or lost. In general, interruption refers to the


situation in which services or data become unavailable, unusable, destroyed, and so on. In
this sense, denial of service attacks by which someone maliciously attempts to make a
service inaccessible to other parties is a security threat that classifies as interruption.

3. Modifications involve unauthorized changing of data or tampering with a service so that


it no longer adheres to its original specifications. Examples of modifications include
intercepting and subsequently changing transmitted data, tampering with database entries,
and changing a program so that it secretly logs the activities of its user.

4. Fabrication refers to the situation in which additional data or activity are generated that
would normally not exist. For example, an intruder may attempt to add an entry into a

Recreated by Kyalo J. K Page 67 of 73


password file or database. Likewise, it is sometimes possible to break into a system by
replaying previously sent messages.

Note that interruption, modification, and fabrication can each be seen as a form of data
falsification.
Simply stating that a system should be able to protect itself against all possible security threats is
not the way to actually build a secure system. What is first needed is a description of security
requirements, that is, a security policy. A security policy - describes precisely which actions the
entities in a system are allowed to take and which ones are prohibited. Entities include users,
services, data, machines, and so on. Once a security policy has been laid down, it becomes
possible to concentrate on the security mechanisms by which a policy can be enforced.

SECURITY MECHANISMS
1. Encryption is fundamental to computer security. Encryption transforms data into
something an attacker cannot understand. In other words, encryption provides a means to
implement confidentiality. In addition, encryption allows us to check whether data have
been modified. It thus also provides support for integrity checks.

2. Authentication is used to verify the claimed identity of a user, client, server, and so on.
In the case of clients, the basic premise is that before a service will do work for a client,
the service must learn the client’s identity. Typically, users are authenticated by means of
passwords, but there are many other ways to authenticate clients.

3. Authorization - After a client has been authenticated, it is necessary to check whether


that client is authorized to perform the action requested. Access to records in a medical
database is a typical example. Depending on who accesses the database, permission may
be granted to read records, to modify certain fields in a record, or to add or remove a
record.

4. Auditing - tools are used to trace which clients accessed what, and which way. Although
auditing does not really provide any protection against security threats, audit logs can be
extremely useful for the analysis of a security breach, and subsequently taking measures
against intruders. For this reason, attackers are generally keen not to leave any traces that
could eventually lead to exposing their identity. In this sense, logging accesses makes
attacking sometimes a riskier business.

Design Issues - A distributed system, or any computer system for that matter, must provide
security services by which a wide range of security policies can be implemented. There are a
number of important design issues that need to be taken into account when implementing
general-purpose security services. In the following pages, we discuss three of these issues: focus
of control, layering of security mechanisms, and simplicity

Focus of Control - When considering the protection of a (possibly distributed) application, there
are essentially three different approaches that can be followed

Recreated by Kyalo J. K Page 68 of 73


Distribution of Security Mechanisms
Dependencies between services regarding trust lead to the notion of a Trusted Computing Base
(TCB). A TCB is the set of all security mechanisms in a (distributed) computer system that are
needed to enforce a security policy. The smaller the TCB, the better. If a distributed system is
built as middleware on an existing network operating system, its security may depend on the
security of the underlying local operating systems. In other words, the TCB in a distributed
system may include the local operating systems at various hosts. Consider a file server in a
distributed file system. Such a server may need to rely on the various protection mechanisms
offered by its local operating system. Such mechanisms include not only those for protecting
files against accesses by processes other than the file server, but also mechanisms to protect the
file server from being maliciously brought down.

Middleware-based distributed systems thus require trust in the existing local operating systems
they depend on. If such trust does not exist, then part of the functionality of the local operating
systems may need to be incorporated into the distributed system itself. Consider a microkernel
operating system, in which most operating-system services run as normal user processes. In this
case, the file system, for instance, can be entirely replaced by one tailored to the specific needs of
a distributed system, including its various security measures. Consistent with this approach is to
separate security services from other types of services by distributing services across different
machines depending on the required security. For example, for a secure distributed file system, it
may be possible to isolate the file server from clients by placing the server on a machine with a
trusted operating system, possibly running a dedicated secure file system. Clients and their
applications are placed on un trusted machines.

This separation effectively reduces the TCB to a relatively small number of machines and
software components. By subsequently protecting those machines against security attacks from
the outside, overall trust in the security of the distributed system can be increased. Preventing
clients and their applications direct access to critical services is followed in the Reduced
Interfaces for Secure System Components (RISSC) approach, as described in (Neumann, 1995).
In the RISSC approach, any security-critical server is placed on a separate machine isolated from
end-user systems using low-level secure network interfaces

Simplicity - Another important design issue related to deciding in which layer to place a security
mechanism is that of simplicity. Designing a secure computer system is generally considered a
difficult task. Consequently, if a system designer can use a few, simple mechanisms that are
easily understood and trusted to work, the better it is.

Cryptography - Fundamental to security in distributed systems is the use of cryptographic


techniques. The basic idea of applying these techniques is simple. Consider a sender S wanting
to transmit message m to a receiver R. To protect the message against security threats, the sender
first encrypts it into an unintelligible message m’, and subsequently sends m’ to R. R, in turn,
must decrypt the received message into its original form
Encryption and decryption are accomplished by using cryptographic methods parameterized by
keys

INTRUDERS AND EAVESDROPPERS IN COMMUNICATION

Recreated by Kyalo J. K Page 69 of 73


To describe the various security protocols that are used in building security services for
distributed systems, it is useful to have a notation to relate plaintext, cipher text, and keys.
Following the common notational conventions, we will use C = EK(P) to denote that the cipher
text C is obtained by encrypting the plaintext P using key K. Likewise, P = DK(C) is used to
express the decryption of the cipher text C using key K, resulting in the plaintext P.

 First, an intruder may intercept the message without either the sender or receiver being
aware that eavesdropping is happening. Of course, if the transmitted message has been
encrypted in such a way that it cannot be easily decrypted without having the proper key,
interception is useless: the intruder will see only unintelligible data
 The second type of attack that needs to be dealt with is that of modifying the message.
Modifying plaintext is easy; modifying cipher text that has been properly encrypted is
much more difficult because the intruder will first have to decrypt the message before it
can meaningfully modify it. In addition, he will also have to properly encrypt it again or
otherwise the receiver may notice that the message has been tampered with.
 The third type of attack is when an intruder inserts encrypted messages into the
communication system, attempting to make R believe these messages came from. Again
encryption can help protect against such attacks. Note that if an intruder can modify
messages, he can also insert messages. There is a fundamental distinction between
different cryptographic systems, based on whether or not the encryption and decryption
key are the same.

In a symmetric cryptosystem, the same key is used to encrypt and decrypt a message. In other
words, P=DK(EK(P)) symmetric cryptosystems are also referred to as secret-key or shared-key
systems, because the sender and receiver are required to share the same key, and to ensure that
protection works, this shared key must be kept secret; no one else is allowed to see the key. We
will use the notation KAB to denote a key shared by A and B.

In an asymmetric cryptosystem, the keys for encryption and decryption are different, but
together form a unique pair. In other words, there is a separate key KE for encryption and one for
decryption, KD, such that P=DKD(EKE(P)) One of the keys in an asymmetric cryptosystem is
kept private, the other is made public. For this reason, asymmetric cryptosystems are also
referred to as public- key systems. In
what follows, we use the notation KX to denote a public key belonging to A, and KX as its
corresponding private key.
Attacks
Passive attacks are mainly based on observation without altering data or compromising services,
they represent the interception and interruption forms of security threats. The simplest form of
attack is browsing, which implies the nondestructive examination of all accessible data. This
leads to the need for confidentiality and the need-to-know principle. Related is the leaking of
information via authorized accomplices, which leads to the confinement problem. More indirect
are attempts to infer information from traffic analysis, code breaking, and so on. In contrast,
active attacks alter or delete data and may cause service to be denied to authorized users. They
represent the modification and fabrication forms of security threats. Typical active attacks
attempt to modify or destroy files. Communication related active attacks attempt to modify the
data sent over a

Recreated by Kyalo J. K Page 70 of 73


communication channels.

ATTACK OF COMMUNICATION CHANNEL


Because of its networked nature, the communication channel presents a particularly important
vulnerability in distributed systems. As such, many of the threats faced by distributed systems
come in the form of attacks on their communication channels. We distinguish between five
different types of attacks on communication channels.

 Eaves Dropping - Attacks involve obtaining copies of messages without authorization.


This could, for example, involve sniffing passwords being sent over the network.
 Masquerading - Attacks involve sending or receiving messages using the identity of
another principal without their authority. In a typical masquerading attack the attacker
sends messages to the victim with the headers modified so that it looks like the messages
are being sent by a trusted third party.
 Message Tampering - Attacks involve the interception and modification of messages so
that they have a different effect than what was originally intended. One form of the
message tampering attack is called the man-in-the-middle attack. In this attack the
attacker intercepts the first message in an exchange of keys and is able to establish a
secure channel with both the original sender and intended receiver. By placing itself in
the middle the attacker can view and modify all communication over that channel.
 Replay - Attacks involve resounding intercepted messages at a later time in order to
cause an action to be repeated. This kind of attack can be effective even if
communication is authenticated and encrypted.
 Denial of Service - Attacks involve the flooding of a channel with messages so that
access is denied to others.

AUTHENTICATION
Authentication involves verifying the claimed identity of an entity (or principal). Authentication
requires a representation of identity (i.e., some way to represent a principal’s identity, such as, a
user name, a bank account, etc.) and some way to verify that identity (e.g., a password, a
passport, a PIN, etc.)- Depending on the system’s requirements different strengths of
authentication may be required. For example, in some cases it is enough to simply present a user
id, while in other cases a certificate signed by a trusted authority may be required to prove a
principal’s identity. A comprehensive logic of authentication has been developed by Lampson et
al.

A verified identity is represented by a credential. A certificate signed by a trusted authority


stating that the bearer of the certificate has been successfully authenticated is an example of a
credential. A credential has the property that it speaks for a principal. In some case it is necessary
for more than one
principal to authorize an action. In that case multiple credentials could be combined to speak for
those principles. A credential can also be made to represent a role (e.g., a system administrator)
rather than an individual. Roles can be assigned to specific principals as needed.

 Authentication based on A Shared Secret Key - This protocol no longer works. It can
easily be defeated by what is known as a reflective attack.

Recreated by Kyalo J. K Page 71 of 73


 The Reflection Attack - The second approach improves on this problem by storing all
keys at a key distribution centre (KDC). The KDC stores a copy of each entity’s secret
key, and can, therefore, communicate securely with each entity.
 The Principle of using A Kdc - A drawback of the KDC approach is that it requires a
centralized and trusted service (the KDC).
 Using A Ticket and Letting Alice Set Up A Connection to Bob - The third approach
makes use of public keys to securely authenticate a principal. By sending a message
encrypted with its private key a principal can prove its identity to the authenticator. A
problem with this approach is that the authenticator must have the principal’s public key
(and trust that it does indeed belong to that principal).
 Mutual Authentication in A Public-key Cryptosystem - A different approach
combines the public key and shared secret key approach. In this approach (which is used
by the secure shell (ssh) protocol) two parties first establish a secure channel by
exchanging a session key encrypted using their public keys, and then exchange their
authentication information over this secure channel.
 Protection (and Authorisation and Access Control) - Once a principal has been
authenticated it is necessary to determine what actions that principal is allowed to
perform and to enforce any restrictions placed on it. Restricting actions and enforcing
those restrictions allows resources to be protected against abuse.

PROTECTION SYSTEM
Evaluate the implementations based on these design considerations
 Propagation of Rights: Can someone act as an agent’s proxy? That is, can one subject’s
access rights be delegated to another subject?
 Restriction of Rights: Can a subject propagate a subset of their rights (as opposed to all
of their rights)?
 Amplification of Rights: Can an unprivileged subject perform some privileged
operations (i.e., (temporarily) extend their protection domain)?
 Revocation of Rights: Can a right, once granted, be remove from a subject?
 Determination of Object Accessibility: Who has which access rights on a particular
object?
 Determination of a Subject’s Protection Domain: What is the set of objects that a
particular subject can access?

Implementation of the Access Matrix


An efficient representation of the access matrix can be achieved by representing it either by
column or row. A column-wise representation that associates access rights with objects is called
an access control list (ACL). Row-wise representations that associate access rights with subjects
are based on capabilities.
Access Control Lists Each object may be associated with an access control list (ACL), which
corresponds to one column of the access matrix, and is represented as a list of subject-rights
pairs. When a subject tries to access an object, the set of rights associated with that subject is
used to determine whether access should be granted. This comparison of the accessing subjects
identity with the subjects

Recreated by Kyalo J. K Page 72 of 73


mentioned in the ACL requires prior authentication of the subject. ACL-based systems usually
support a concept of group rights (granted to each agent belonging to the group) or domain
classes.
The properties of ACLs, with respect to the previously listed design considerations, are as
follows
 Propagation: the owner of an object can add to or modify entries to the ACL
 Restriction: anyone who has the right to modify the ACL can restrict access
 Amplification: ACL entries can include protected invocation rights (e.g., setuid)
 Revocation: access rights can be revoked by removing or modifying ACL entries
 Object Accessibility: the ACL itself represents an object’s accessibility
 Protection Domain: hard (if not impossible) because the ACLs of all objects in the
system must be inspected.

FIREWALLS
A different form of protection that can be employed in distributed systems is that offered by
firewalls. A firewall is generally used when communicating with external untrusted clients and
servers, and serves to disconnect parts of the system from the outside world, allowing inbound
(and possibly outbound) communication only on predefined ports. Besides simply blocking
communication, firewalls can also inspect incoming (or outgoing) communication and filter only
suspicious messages. Two main types of firewalls are packetfiltering and application-level
firewalls. Packet-filtering firewalls work at the packet level, filtering network packets based on
the contents of headers.

Application-level firewalls, on the other hand, filter messages based on their contents. They are
capable of spotting and filtering malicious content arriving over otherwise innocuous
communication channels (e.g., virus filtering email gateways).

Recreated by Kyalo J. K Page 73 of 73

You might also like