Grid and Cloud 5 Unit Notes PDF

(Telugu Minority Institution)(Approved by AICTE, NBA accredited and affiliated to AU)
(ISO 9001-2000 Certified Institution)
DEPARTMENT OF COMPUTER SCIENCE AND

ENGINEERING
LECTURE NOTES
CS6703 – GRID AND CLOUD COMPUTING
(2013 Regulation)
Year/Semester: IV/VII CSE
Prepared by
Dr.R.Suguna , Dean - CSE
Mrs.T.Kujani, AP - CSE
CS6703 Grid and Cloud Computing 1

CS6703 GRID AND CLOUD COMPUTING LTPC3003
OBJECTIVES: The student should be made to:
Understand how Grid computing helps in solving large scale scientific problems.
Gain knowledge on the concept of virtualization that is fundamental to cloud computing.
Learn how to program the grid and the cloud.
Understand the security issues in the grid and the cloud environment.
UNIT I INTRODUCTION 9
Evolution of Distributed computing: Scalable computing over the Internet – Technologies for
network based systems – clusters of cooperative computers - Grid computing Infrastructures –
cloud computing - service oriented architecture – Introduction to Grid Architecture and
standards – Elements of Grid – Overview of Grid Architecture.
UNIT II GRID SERVICES 9
Introduction to Open Grid Services Architecture (OGSA) – Motivation – Functionality

Requirements – Practical & Detailed view of OGSA/OGSI – Data intensive grid service models
– OGSA services.
UNIT III VIRTUALIZATION 9
Cloud deployment models: public, private, hybrid, community – Categories of cloud computing:
Everything as a service: Infrastructure, platform, software - Pros and Cons of cloud computing –
Implementation levels of virtualization – virtualization structure – virtualization of CPU, Memory
and I/O devices – virtual clusters and Resource Management – Virtualization for data center
automation.
UNIT IV PROGRAMMING MODEL 9
Open source grid middleware packages – Globus Toolkit (GT4) Architecture , Configuration –
Usage of Globus – Main components and Programming model - Introduction to Hadoop
Framework - Mapreduce, Input splitting, map and reduce functions, specifying input and output
parameters, configuring and running a job – Design of Hadoop file system, HDFS concepts,
command line and java interface, dataflow of File read & File write.
UNIT V SECURITY 9
Trust models for Grid security environment – Authentication and Authorization methods – Grid
security infrastructure – Cloud Infrastructure security: network, host and application level –
aspects of data security, provider data and its security, Identity and access management
architecture, IAM practices in the cloud, SaaS, PaaS, IaaS availability in the cloud, Key privacy
issues in the cloud.

OUTCOMES: At the end of the course, the student should be able to:
Apply grid computing techniques to solve large scale scientific problems.
Apply the concept of virtualization.
Use the grid and cloud tool kits.
Apply the security models in the grid and the cloud environment.
TEXT BOOK:
1. Kai Hwang, Geoffery C. Fox and Jack J. Dongarra, ―Distributed and Cloud Computing:
Clusters, Grids, Clouds and the Future of Internet‖, First Edition, Morgan Kaufman Publisher, an
Imprint of Elsevier, 2012.
REFERENCES:
1. Jason Venner, ―Pro Hadoop- Build Scalable, Distributed Applications in the Cloud‖, A Press,
2009
2. Tom White, ―Hadoop The Definitive Guide‖, First Edition. O‟Reilly, 2009.
3. Bart Jacob (Editor), ―Introduction to Grid Computing‖, IBM Red Books, Vervante, 2005
4. Ian Foster, Carl Kesselman, ―The Grid: Blueprint for a New Computing Infrastructure‖, 2 nd
Edition, Morgan Kaufmann.
5. Frederic Magoules and Jie Pan, ―Introduction to Grid Computing‖ CRC Press, 2009.
6. Daniel Minoli, ―A Networking Approach to Grid Computing‖, John Wiley Publication, 2005.
7. Barry Wilkinson, ―Grid Computing: Techniques and Applications‖, Chapman and Hall,
CRC,Taylor and Francis Group, 2010.

CONTENTS
UNIT I INTRODUCTION
1.1 Evolution of Distributed computing: Scalable computing over the Internet 9
1.1.1 The Age of Internet Computing 9
1.1.2 Scalable Computing Trends and New Paradigms 12
1.1.3 The Internet of Things and Cyber-Physical Systems 13
1.2 TECHNOLOGIES FOR NETWORK-BASED SYSTEMS 14
1.2.1 System Components and Wide-Area Networking 14
1.3 SYSTEM MODELS FOR DISTRIBUTED AND CLOUD COMPUTING 22
1.3.1 Clusters of Cooperative Computers 22
1.3.2 Grid Computing Infrastructures 24
1.3.3 Cloud Computing 26
1.3.4 Service-Oriented Architectures (SOA) 31
1.4 Grid Computing 35
1.4.4 Introduction to Grid Architecture and Standards 35
1.4.2 Grid Computing Systems 36
1.4.3 Grid Architecture 36
1.4.4 Layered Grid Architecture 37
1.4.5 Grid Standards 38
1.4.6 Elements of Grid 40
1.4.7 Overview of Grid Architecture 42
UNIT II GRID SERVICES
2.1 Introduction to Open Grid Services Architecture (OGSA) 51
2.2 MOTIVATIONS FOR STANDARDIZATION 53
2.3 Functional Requirements for OGSA 56
2.3.1 Basic Functionality Requirements 56
2.3.2 Security Requirements 57
2.3.3 Resource Management Requirements 58
2.3.4 System Properties Requirements 59
2.3.5 Other Functionality Requirements 60
2.4 OGSA: A PRACTICAL VIEW 61
2.4.1 Objectives of OGSA 61
2.5 OGSA: A MORE DETAILED VIEW 64
2.5.1 Introduction 64
2.5.2 Setting the Context 65
2.5.3 The Grid Service 70
2.5.4 WSDL Extensions and Conventions 71
2.5.5 Service Data 71
2.5.6 Core Grid Service Properties 74
2.6 Data intensive grid service models 76
2.6.1 Data Replication and Unified Namespace 77
2.6.2 Grid Data Access Models 77
2.6.3 Parallel versus Striped Data Transfers 79

2.7 OGSA services 79
2.7.1 Handle Resolution 79
2.7.2 Virtual Organization Creation and Management 79
2.7.3 Service Groups and Discovery Services 80
2.7.4 Choreography, Orchestrations and Workflow 80
2.7.5 Transactions 81
2.7.6 Metering Service 81
2.7.7 Rating Service 81
2.7.8 Accounting Service 81
2.7.9 Billing and Payment Service 82
2.7.10 Installation, Deployment, and Provisioning 82
2.7.11 Distributed Logging 82
2.7.12 Messaging and Queuing 82
2.7.13 Event 83
2.7.14 Policy and Agreements 84
2.7.15 Base Data Services 85
2.7.16 Other Data Services 86
2.7.17 Discovery Services 86
2.7.18 Job Agreement Service 87
2.7.19 Reservation Agreement Service 87
2.7.20 Data Access Agreement Service 87
2.7.21 Queuing Service 88
2.7.22 Open Grid Services Infrastructure 88
2.7.23 Common Management Model 89
UNIT III VIRTUALIZATION
3.1 Cloud Deployment Models 94
3.1.1 Types of cloud computing 94
3.2 Pros and Cons of Cloud Computing 100
3.3 Implementation Levels of Virtualization 103
3.3.1 Levels of virtualization implementation 103
3.3.2 VMM design requirements and providers 107
3.3.3 Virtualization support at the OS level 109
3.3.4 Middleware support for virtualization 111
3.4 Virtualization Structures/Tools and Mechanisms 113
3.4.1 Hypervisor and XEN Architecture 113
3.4.2 Binary Translation with Full Virtualization 114
3.4.3 Para-Virtualization With Compiler Support 117
3.5 Virtualization of CPU, Memory, and I/O Devices 119
3.5.1 Hardware Support for Virtualization 119
3.5.2 CPU Virtualization 121
3.5.3 Memory Virtualization 122
3.5.4 I/O Virtualization 124
3.5.5 Virtualization in Multi-Core Processors 126
3.6 Virtual Clusters and Resource Management 127

3.6.1 Physical Versus Virtual Clusters 127
3.6.2 Live VM Migration Steps And Performance Effects 130
3.6.3 Migration of Memory, Files, and Network Resources 131
3.7 Virtualization for Data-Center Automation 134
3.7.1 Server consolidation in data centers 134
3.7.2 Virtual Storage Management 135
3.7.3 Cloud OS for Virtualized Data Centers 137
3.7.4 Trust Management in Virtualized Data Centers 139
UNIT IV PROGRAMMING MODEL
4.1 Open source grid middleware packages 147
4.1.1 Basic Functional Grid Middleware Packages 148
4.1.2 Globus 150
4.1.3 Legion 151
4.1.4 Gridbus 152
4.2 Globus Toolkit (GT4) Architecture 153
4.2.1 Primary GT4 Components 155
4.3 Configuration 156
4.3.1 GT4 Configuration 158
4.4 Usage of Globus 160
4.4.1 Definition of Job 60
4.4.2 Staging Files 160
4.4.3 Job Submission 161
4.5 Main components and Programming model 163
4.5.1 Security(GSI) 163
4.5.2 Data management 163
4.5.3 Data Replica Components 165
4.5.3.1 Replica Location Service (RLS) 165
4.5.3.2 Data Replication Service (DRS) 165
4.5.4 Execution management 165
4.5.5 Monitoring and Discovery Services 165
4.5.6 Server Programming Model 166
4.5.7 Client Programming Model 168
4.6 Introduction to Hadoop Framework 168
4.6.1 The Parts of a Hadoop MapReduce Job 170
4.7 Input Splitting 177
4.7.1 MapReduce Inputs And Splitting 177

4.8 Map Reduce Input and Output Formats 179
4.8.1 Input Formats 179
4.8.2 Output Formats 187
4.9 Configuring a Job 190
4.10 Running a Job 196
4.11 Design of Hadoop file system 197
4.11.1 Features 197
4.11.2 HDFS Concepts 198

4.11.3 Hadoop Filesystems 199
4.12 The Java Interface 200
4.13 Data Flow 204
4.13.1 Anatomy of a File Read 204
4.13.2 Anatomy of a File Write 206
UNIT V SECURITY
5.1 Trust models for Grid security environment 212
5.1.1 A Generalized Trust Model 213
5.1.2 Reputation-Based Trust Model 214
5.1.3 A Fuzzy-Trust Model 214
5.2 AUTHENTICATION AND AUTHORIZATION METHODS 214
5.2.1 Authorization for Access Control 215
5.2.2 Three Authorization Models 216
5.3 GRID SECURITY INFRASTRUCTURE (GSI) 216
5.3.1 GSI Functional Layers 217
5.3.2 Transport-Level Security 218
5.3.3 Message-Level Security 218
5.3.4 Authentication and Delegation 218
5.3.5 Trust Delegation 220
5.4 Cloud Infrastructure security: network, host and application level 221
5.5 Aspects of Data Security in the Cloud 225
5.6 Identity Management and Access Control 231
5.6.1 Identity Management 231
5.6.2 Access Control 235
5.6.3 IAM practices in the cloud 238
5.6.4 Achieving availability in the cloud (SaaS, PaaS, IaaS) 241
5.6.5 Key privacy issues in the cloud 243

UNIT I
INTRODUCTION
TEXT BOOK:
REFERENCES: 1. Jason Venner, ―Pro Hadoop- Build Scalable, Distributed Applications in the
Cloud‖, A Press, 2009
4. Ian Foster, Carl Kesselman, ―The Grid: Blueprint for a New Computing Infrastructure‖, 2nd
7. Barry Wilkinson, ―Grid Computing: Techniques and Applications‖, Chapman and Hall, CRC,
Taylor and Francis Group, 2010.
STAFF IN-CHARGE HOD

CS6703 GRID AND CLOUD COMPUTING
1.1 Evolution of Distributed computing: Scalable computing over the Internet
Computing technology has undergone a series of platform and environment changes. Instead of
using a centralized computer to solve computational problems, a parallel and distributed
computing system uses multiple computers to solve large-scale problems over the Internet.
Hence, Distributed computing becomes data-intensive and network-centric.
1.1.1 The Age of Internet Computing

Large number of users uses the Internet every day. Therefore supercomputer sites and large
data centers must provide high-performance computing services to huge numbers of Internet
users concurrently. There is need to upgrade data centers using fast servers, storage systems,
and high-bandwidth networks. The purpose is to advance network-based computing and web
services with the emerging new technologies.
1.1.1.1 The Platform Evolution

Computer technology has gone through five generations of development, with each generation
lasting from 10 to 20 years.
 IBM 360 and CDC 6400, were built to satisfy the demands of large businesses and
government organizations during the year 1950to 1970.
 From 1960 to 1980, lower-cost minicomputers such as the DEC PDP 11 and VAX Series
became popular among small businesses and on college campuses.
 From 1970 to 1990, widespread use of personal computers built with VLSI
microprocessors was built.
 From 1980 to 2000, massive numbers of portable computers and pervasive devices
appeared in both wired and wireless applications.
 Since 1990, the use of both HPC and HTC systems hidden in clusters, grids, or Internet
clouds has flourished. The figure below illustrates the evolution of HPC and HTC
systems

Figure1.1 Evolutionary trend
On the HPC side, supercomputers (massively parallel processors or MPPs) are

gradually replaced by clusters of cooperative computers out of a desire to share
computing resources. The cluster is often a collection of homogeneous compute nodes
that are physically connected in close range to one another.
On the HTC side, peer-to-peer (P2P) networks are formed for distributed file sharing and
content delivery applications. A P2P system is built over many client machines. Peer
machines are globally distributed in nature. P2P, cloud computing, and web service
platforms are more focused on HTC applications than on HPC applications.
1.1.1.2 High-Performance Computing

Over several years, HPC systems emphasize the raw speed performance. The increase
in speed of HTC system is mainly due to demands from scientific, engineering, and
manufacturing communities. Today, the majority of computer users are using desktop
computers or large servers when they conduct Internet searches and market-driven computing
tasks.
1.1.1.3 High-Throughput Computing
The development of market-oriented high-end computing systems is undergoing a
strategic change from an HPC paradigm to an HTC paradigm. This HTC paradigm pays more
attention to high-flux computing. The main application for high-flux computing is in Internet
searches and web services by millions or more users simultaneously. The performance goal
thus shifts to measure high throughput or the number of tasks completed per unit of time.HTC
technology needs to not only improve in terms of batch processing speed, but also address the
acute problems of cost, energy savings, security, and reliability at many data and enterprise
computing centers.
1.1.1.4 The Three New Computing Paradigms

Advances in virtualization make it possible to see the growth of Internet clouds as a new
computing paradigm.

The maturity of radio-frequency identification (RFID), Global Positioning System (GPS), and
sensor technologies has triggered the development of the Internet of Things (IoT).
1.1.1.5 Computing Paradigm Distinctions

The following list defines the various computing paradigms
i) Centralized computing - This is a computing paradigm by which all computer

resources are centralized in one physical system. All resources are fully shared
and tightly coupled within one integrated OS. Many data centers and
supercomputers are centralized systems, but they are used in parallel,
distributed, and cloud computing applications.
ii) Parallel computing - In parallel computing, all processors are either tightly
coupled with centralized shared memory or loosely coupled with distributed
memory. A computer system capable of parallel computing is commonly known
as a parallel computer. The process of writing parallel programs is often referred
to as parallel programming.
iii) Distributed computing - A distributed system consists of multiple autonomous
computers, each having its own private memory, communicating through a
computer network. Information exchange in a distributed system is accomplished
through message passing. A computer program that runs in a distributed system
is known as a distributed program. The process of writing distributed programs is
referred to as distributed programming.
iv) Cloud computing - An Internet cloud of resources can be either a centralized or a
distributed computing system. The cloud applies parallel or distributed
computing, or both. Clouds can be built with physical or virtualized resources
over large data centers that are centralized or distributed. Some authors consider
cloud computing to be a form of utility computing or service computing.
v) Ubiquitous computing refers to computing with pervasive devices at any place
and time using wired or wireless communication.
vi) The Internet of Things (IoT) is a networked connection of everyday objects
including computers, sensors, humans, etc.
vii) Internet computing is even broader and covers all computing paradigms over the
Internet.
1.1.1.6 Distributed System Families

Technologies for building P2P networks and networks of clusters have been
consolidated into many national projects designed to establish wide area computing
infrastructures, known as computational grids or data grids.
Both HPC and HTC systems will demand multicore or many-core processors that can
handle large numbers of computing threads per core. Both HPC and HTC systems emphasize
parallelism and distributed computing. Future HPC and HTC systems must be able to satisfy
this huge demand in computing power in terms of throughput, efficiency, scalability, and
reliability.
The system efficiency is decided by speed, programming, and energy factors. The following
are the design objectives:
 Efficiency measures the utilization rate of resources in an execution model by exploiting
massive parallelism in HPC.
 Dependability measures the reliability and self-management from the chip to the system
and application levels.

 Adaptation in the programming model measures the ability to support billions of job
requests over massive data sets and virtualized cloud resources under various workload
and service models.
 Flexibility in application deployment measures the ability of distributed systems to run
well in both HPC and HTC applications.
1.1.2 Scalable Computing Trends and New Paradigms

Several predictable trends in technology are known to drive computing applications. In fact,
designers and programmers want to predict the technological capabilities of future systems.
1.1.2.1 Degrees of Parallelism

a) Bit-level parallelism (BLP) - Most computers were designed in a bit-serial fashion, when
hardware was huge and expensive. Bit-level parallelism (BLP) converts bit-serial processing to
word-level processing gradually.
b) Instruction-level parallelism (ILP) - Over the past years, users graduated from 4-bit
microprocessors to 8, 16, 32, and 64-bit CPUs. This led to instruction-level parallelism (ILP), in
which the processor executes multiple instructions simultaneously rather than only one
instruction at a time.
c) Data-level parallelism (DLP) was made popular through SIMD (single instruction, multiple
data) and vector machines using vector or array types of instructions. DLP requires even more
hardware support and compiler assistance to work properly.
d) Task-level parallelism (TLP)- the introduction of multicore processors and chip
multiprocessors (CMPs) explored TLP.
e) Job-level parallelism (JLP) – the increase in computing granularity arises as we move from
parallel processing to distributed processing.
1.1.2.2 Innovative Applications
Applications of High-Performance and High-Throughput Systems
Table 1.1 Applications of High-Performance and High-Throughput Systems
Domain Specific Applications

Science and engineering Scientific simulations, genomic
analysis, etc
Earthquake prediction, global
warming, weather forecasting, etc.
Business, education, services Telecommunication, content delivery,

industry, and health care e-commerce, etc.
Banking, stock exchanges,
transaction processing, etc.
Air traffic control, electric power grids,
distance education, etc.
Health care, hospital automation,
telemedicine, etc.
Internet and web services, Internet search, data centers,
and government applications decision-making systems, etc.
Traffic monitoring, worm
containment, cyber security, etc.
Mission-critical applications Military command and control,
intelligent systems, crisis
management, etc.

1.1.2.3 The Trend toward Utility Computing
The following figure identifies major computing paradigms to facilitate the study of distributed
systems and their applications. These paradigms share some common characteristics.
Figure 1.2 The vision of computer utilities

Utility computing - focuses on a business model in which customers receive computing
resources from a paid service provider.
Distributed cloud applications run on any available servers in some edge networks.
1.1.2.4 The Hype Cycle of New Technologies

The hype cycle shows the technology status. It is observed from the figure that the cloud
technology had just crossed the peak of the expectation stage in 2010, and it was expected to
take two to five more years to reach the productivity stage.
Figure1.3 Hype cycle for Emerging Technologies
1.1.3 The Internet of Things and Cyber-Physical Systems

The two Internet development trends: the Internet of Things and cyber-physical
systems.

1.1.3.1 The Internet of Things
The IoT refers to the networked interconnection of everyday objects, tools, devices, or
computers. One can view the IoT as a wireless network of sensors that interconnect all things in
our daily life. The dynamic connections will grow exponentially into a new dynamic network of
networks, called the Internet of Things (IoT).
1.1.3.2 Cyber-Physical Systems:

A cyber-physical system (CPS) is the result of interaction between computational
processes and the physical world. A CPS integrates ―cyber with ―physical‖ objects. A CPS
merges the ―3C‖ technologies of computation, communication, and control into an intelligent
closed feedback system between the physical world and the information world, a concept which
is actively explored in the United States. The IoT emphasizes various networking connections
among physical objects, while the CPS emphasizes exploration of virtual reality (VR)
applications in the physical world.
1.2 TECHNOLOGIES FOR NETWORK-BASED SYSTEMS
We will discuss the viable approaches to build distributed operating systems for handling
massive parallelism in a distributed environment.
1.2.1 System Components and Wide-Area Networking
In recent years, considering the growth of component and network technologies in building HPC
or HTC systems, processor speed is measured by MIPS (million instructions per second) and
the network bandwidth is counted by Mbps or Gbps (Mega or Giga bits per second).
1.2.1.1 Advances in Processors:
 CPU‘s today assume a multi-core architecture with dual, quad, six, or more processing
cores. By Moore‘s law, the processor speed is doubled in every 18 months. This doubling
effect was accurate in the past 30 years.
 The clock rate increased from 10 MHz for Intel 286 to 4 GHz for Pentium 4 in 30 years.
However, the clock rate reached its limit on CMOS chips due to power limitations. Clock
speeds cannot continue to increase due to excessive heat generation and current leakage.
 The ILP (instruction-level parallelism) is recommended in modern processors. ILP
mechanisms include multiple-issue superscalar architecture, dynamic branch prediction, and
speculative execution, etc. These ILP techniques are all hardware and compiler-supported.
In addition, DLP (data-level parallelism) and TLP (thread-level parallelism) are also highly
explored in today‘s processors.
 Many processors are now upgraded to have multi-core and multithreaded micro-
architectures. The architecture of a typical multicore processor is shown in Fig.1.4. Each
core is essentially a processor with its own private cache (L1 cache). Multiple cores are
housed in the same chip.

Fig. 1.4 Multicore Processor
LI cache is private to each core, L2 cache is shared and L3 cache or DRAM is off the chip.
Examples of multi-core CPUs include Intel i7, Xeon, AMD Opteron. Each core can also be
multithreaded. E.g. the Niagara II has 8 cores with each core handling 8 threads for a total of 64
threads maximum.
1.2.1.2 Multithreading Technology
Multithreading is the ability of a program or an operating system process to manage its use by
more than one user at a time and to even manage multiple requests by the same user without
having to have multiple copies of the programming running in the computer.
Consider the dispatch of five independent threads of instructions to four pipelined data paths
(functional units) in each of the following five processor categories
 Four-issue superscalar (e.g. Sun Ultrasparc I)
 Implements instruction level parallelism (ILP) within a single processor.
 Executes more than one instruction during a clock cycle by sending multiple
instructions to redundant functional units.
 Fine-grain multithreaded processor

 Switch threads after each cycle
 Interleave instruction execution
 If one thread stalls, others are executed
 Coarse-grain multithreaded processor
 Executes a single thread until it reaches certain situations
 Simultaneous multithread processor (SMT)
 Instructions from more than one thread can execute in any given pipeline stage at a
time.

Figure 1.5 Five micro-architectures in modern CPU processors
Each row represents the issue slots for a single execution cycle: A filled box indicates that the
processor found an instruction to execute in that issue slot on that cycle; An empty box denotes
an unused slot.
1.2.1.3 Multicore Architecture:
 The number of working cores on the same CPU chip could reach hundreds in the next few
years.
 Graphic processing units (GPU) appeared in HPC systems. A GPU is a graphics
coprocessor or accelerator mounted on a computer‘s graphics card or video card. A GPU
offloads the CPU from tedious graphics tasks in video editing applications.
 Traditional CPUs are structured with only a few cores. For example, the Xeon X5670 CPU
has six cores. However, a modern GPU chip can be built with hundreds of processing cores.
Unlike CPUs, GPUs have a throughput architecture that exploits massive parallelism by
executing many concurrent threads slowly, instead of executing a single long thread in a
conventional microprocessor very quickly.
 GPUs are designed to handle large numbers of floating-point operations in parallel. In a

way, the GPU offloads the CPU from all data-intensive calculations, not just those that are
related to video processing.
 Conventional GPUs are widely used in mobile phones, game consoles, embedded systems,
PCs, and servers. The NVIDIA CUDA Tesla or Fermi is used in GPU clusters or in HPC
systems for parallel processing of massive floating-pointing data.

Figure. 1.6 Use of a GPU along with a CPU for massively parallel execution in hundreds or
thousands of processing cores.
 In the future, Exa-scale (EFlops or 10^18 Flops) systems could be built with a large number
of multi-core CPUs and GPUs. Four challenges are identified for exascale computing: (1)
energy and power, (2) memory and storage, (3) concurrency and locality, and (4) system
resiliency.
1.2.1.4 Wide-Area Networking :
There is a rapid growth of Ethernet bandwidth from 10 Mbps in 1979 to 1 Gbps in 1999 and 40
GE in 2007. High-bandwidth networking increases the capability of building massively
distributed systems.
1.2.1.5 Memory, SSD, and Disk Arrays:
 DRAM chip capacity increased from 16 KB in 1976 to 64 GB in 2011 for a 4x increase in
capacity every 3 years. Memory access time did not improve as much.
 For hard drives, capacity increased from 260 MB in 1981 to 3 TB for the Seagate Barracuda
 The "memory wall" is the growing disparity of speed between CPU and memory outside the
CPU chip. An important reason for this disparity is the limited communication bandwidth
beyond chip boundaries. From 1986 to 2000, CPU speed improved at an annual rate of 55%
while memory speed only improved at 10%.
 Faster processor speed and larger memory capacity result in a wider performance gap
between processors and memory. The memory wall may become an even worse problem
limiting CPU performance.
 The rapid growth of flash memory and solid-state drive (SSD) also impacts the future of
HPC and HTC systems. The power increases linearly with respect to the clock frequency
and quadratically with respect to the voltage applied on chips. We cannot increase the clock
rate indefinitely. Lower the voltage supply is very much in demand.

1.2.1.6 System-Area Interconnects
A LAN is typically used to connect clients to big servers.
• SAN (storage area network) - connects servers with disk arrays
• LAN (local area network) – connects clients, hosts, and servers
• NAS (network attached storage) – connects clients with large storage systems
1.2.1.7 Virtual Machines and Virtualization Middleware

To build clouds we need to aggregate large amounts of computing, storage, and networking
resources in a virtualized manner. Specifically clouds rely on the dynamic virtualization of CPU,
memory, and I/O.
Virtual machines (VM) offer novel solutions to underutilized resources, application inflexibility,
software manageability, and security concerns in existing physical machines.
Virtual Machines (VMs)
 Eliminate real machine constraint
o Increases portability and flexibility
 Virtual machine adds software to a physical machine to give it the appearance of a
different platform or multiple platforms.
 Benefits
o Cross platform compatibility
o Increase Security
o Enhance Performance
o Simplify software migration
Initial Hardware Model
 All applications access hardware resources (i.e. memory, i/o) through system calls to
operating system (privileged instructions)
 Advantages
 Design is decoupled (i.e. OS people can develop OS separate of Hardware
people developing hardware)
 Hardware and software can be upgraded without notifying the Application
programs

 Disadvantage
 Application compiled on one ISA will not run on another ISA..
 ISA‘s must support old software

 Since software is developed separately from hardware… Software is not
necessarily optimized for hardware.
Virtual Machine Basics
 Virtual software placed between underlying machine and conventional software
 Conventional software sees different ISA from the one supported by the
hardware
 Virtualization process involves:
 Mapping of virtual resources (registers and memory) to real hardware resources
 Using real machine instructions to carry out the actions specified by the virtual
machine instructions
Three VM Architectures
The concept of virtual machines is illustrated in Fig.1.7
 The host machine is equipped with the physical hardware shown at the bottom. For
example, a desktop with x-86 architecture running its installed Windows OS as shown in
Fig.1.7(a).
 The VM can be provisioned to any hardware system. The VM is built with virtual
resources managed by a guest OS to run a specific application. Between the VMs and
the host platform, we need to deploy a middleware layer called a virtual machine monitor
(VMM) .
 Figure 1.7(b) shows a native VM installed with the use a VMM called a hypervisor at the
privileged mode. For example, the hardware has a x-86 architecture running the
Windows system. The guest OS could be a Linux system and the hypervisor is the XEN
system developed at Cambridge University. This hypervisor approach is also called
bare-metal VM, because the hypervisor handles the bare hardware (CPU, memory, and
I/O) directly.

 Another architecture is the host VM shown in Fig.1.7(c). Here the VMM runs with a non-
privileged mode. The host OS need not be modified.
 The VM can be also implemented with a dual mode as shown in Fig.1.7(d). Part of VMM
runs at the user level and another portion runs at the supervisor level. In this case, the
host OS may have to be modified to some extent. Multiple VMs can be ported to one
given hardware system, to support the virtualization process.
Fig 1.7 Three ways of constructing a virtual machine (VM) embedded in a physical machine.
Virtualization Operations:
The VMM provides the VM abstraction to the guest OS. With full virtualization, the VMM exports
a VM abstraction identical to the physical machine; so that a standard OS such as Windows
2000 or Linux can run just as they would on the physical hardware.
Low-level VMM operations areillustrated in Fig.1..8. First, the VMs can be multiplexed between
hardware machines as shown in Fig.1..8(a). Second, a VM can be suspended and stored in a
stable storage as shown in Fig.1..8 (b). Third, a suspended VM can be resumed or provisioned
to a new hardware platform in Fig.1.8(c). Finally, a VM can be migrated from one hardware
platform to another platform as shown in Fig.1.8 (d).

Figure1.8 Low-level VMM operations
These VM operations enable a virtual machine to be provisioned to any available hardware

platform. They make it flexible to port distributed application executions.
VM approach will significantly enhance the utilization of server resources. Multiple server
functions can be consolidated on the same hardware platform to achieve higher system
efficiency. According to a claim by VMWare, the server utilization could be increased from
current 5-15% to 60-80%.
Virtual Infrastructures:
Physical resources for compute, storage, and networking at the bottom are mapped to the
needy applications embedded in various VMs at the top. Hardware and software are then
separated. Virtual Infrastructure is what connects resources to distributed applications. It is a
dynamic mapping of the system resources to specific applications. The result is decreased
costs and increased efficiencies and responsiveness.
 Data Center Virtualization for Cloud Computing
Storage and energy efficiency are more important than shear speed performance. Data center
design emphasizes the performance/price ratio over speed performance alone.
Data Center Growth and Cost Breakdown
A large data center may be built with thousands of servers. Smaller data centers are typically
built with hundreds of servers. The cost to build and maintain data center servers has increased
over the years.
About 60 percent of the cost to run a data center is allocated to management and maintenance.
The server purchase cost did not increase much with time. The cost of electricity and cooling did
increase from 5 percent to 14 percent in 15 years.
Low-Cost Design Philosophy
Commodity switches and networks are more desirable in data centers. Similarly, using
commodity x86 servers is more desired over expensive mainframes. The software layer handles
network traffic balancing, fault tolerance, and expandability.

Convergence of Technologies
Essentially, cloud computing is enabled by the convergence of technologies in four areas:
(1) hardware virtualization and multi-core chips

(2) utility and grid computing
(3) SOA, Web 2.0, and WS mashups
(4) atonomic computing and data center automation.
1.3 SYSTEM MODELS FOR DISTRIBUTED AND CLOUD COMPUTING
1.3.1 Clusters of Cooperative Computers
A computing cluster is built by a collection of interconnected stand-alone computers, which work

cooperatively together as a single integrated computing resource. To handle heavy workload
with large datasets, clustered computer systems are preferred.
Cluster Architecture
Figure 1.9 shows the architecture of a typical cluster built around a low-latency and high-
bandwidth interconnection network.
Figure1.9 architecture of a typical cluster
A cluster of servers (S1, S2,…,Sn) interconnected by a high-bandwidth system-area or local-

area network with shared I/O devices and disk arrays. The cluster acts as a single computing
node attached to the Internet throught a gateway. The gateway IP address could be used to
locate the cluster over the cyberspace.
Single-System Image:
The system image of a computer is decided by the way the OS manages the shared cluster
resources. Most clusters have loosely-coupled node computers. All resources of a server node
is managed by its own OS. Thus, most clusters have multiple system images coexisting
simultaneously.
An ideal cluster should merge multiple system images into a single-system image (SSI) at
various operational levels. We need an idealized cluster operating system or some middlware to
support SSI at various levels, including the sharing of all CPUs, memories, and I/O across all
computer nodes attached to the cluster.

A single system image is the illusion, created by software or hardware that presents a collection
of resources as an integrated powerful resource. SSI makes the cluster appear like a single
machine to the user, applications, and network. A cluster with multiple system images is nothing
but a collection of independent computers.
Figure 1.10 shows the hardware and software architecture of a typical cluster system. Each
node computer has its own operating system. On top of all operating systems, we deploy some
two layers of middleware at the user space to support the high availability and some SSI
features for shared resources or fast MPI communications.
Figure 1.10 The architecture of a working cluster with full hardware, software, and middleware
support for availability and single system image.
For example, since memory modules are distributed at different server nodes, they are
managed independently over disjoint address spaces. This implies that the cluster has multiple
images at the memory-reference level.
On the other hand, we may want all distributed memories to be shared by all servers by forming
a distributed shared memory (DSM) with a single address space. A DSM cluster thus has a
single-system image (SSI) at the memory-sharing level. Cluster explores data parallelism at the
job level with high system availability.
Cluster Design Issues:
Unfortunately, a cluster-wide OS for complete resource sharing is not available yet. Middleware
or OS extensions were developed at the user space to achieve SSI at selected functional levels.
Without the middleware, the cluster nodes cannot work together effectively to achieve
cooperative computing.
The software environments and applications must rely on the middleware to achieve high
performance. The cluster benefits come from scalable performance, efficient message-passing,
high system availability, seamless fault tolerance, and cluster-wide job management as
summarized in Table 1.7.

Table1.2 Critical Cluster Design Issues and Feasible Implementations
1.3.2 Grid Computing Infrastructures
In 30 years, there is a natural growth path from Internet to web and grid computing services.
 Internet service such as the Telnet command enables connection from one computer to a
remote computer.
 The Web service like http protocol enables remote access of remote web pages.
 Grid computing is the collection of computer resources from multiple locations to reach a
common goal. Grid computing allows close interactions among applications running on
distant computers, simultaneously.
 A computing grid offers an infrastructure that couples computers, software/middleware,
special instruments, and people and sensors together. The grid is often constructed across
LAN, WAN, or Internet backbone networks at a regional, national, or global scale.
 Enterprises or organizations present grids as integrated computing resources. The
computers used in a grid are primarily workstations, servers, clusters, and supercomputers.
Personal computers, laptops, and PDAs can be used as access devices to a grid system.
Figure 11. shows an example computational grid built over multiple resource sites owned by
different organizations.
 The resource sites offer complementary computing resources, including workstations,

large servers, a mesh of processors, and Linux clusters to satisfy a chain of
computational needs.
 The grid is built across various IP broadband networks including LANs and WANs
already used by enterprises or organizations over the Internet. The grid is presented to
users as an integrated resource pool as shown in the upper half of the figure.

 At the server end, the grid is a network. At the client end, wired or wireless terminal
devices are present. The grid integrates the computing, communication, contents, and
transactions as rented services.
 Larger computational grids like NSF TeraGrid and EGEE, and ChinaGrid have built
similar national infrastructures to perform distributed scientific grid applications
Figure 1.11 Computational grid or data grid providing computing utility, data, and information
services through resource sharing and cooperation among participating organizations
Grid Families:
Grid technology demands new distributed computing models, software/middleware support,

network protocols, and hardware infrastructures.
National grid projects are followed by industrial grid platform development by IBM, Microsoft,
Sun, HP, Dell, Cisco, EMC, Platform Computing, etc
New grid service providers (GSP) and new grid applications are opened rapidly, similar to the
growth of Internet and Web services in the past two decades.
Generally grid systems are classified two families: namely computational or data grids and P2P
grids.
A data grid is an architecture or set of services that gives individuals or groups of users the
ability to access, modify and transfer extremely large amounts of geographically
distributed data for research purposes.
A P2P grid with peer groups managed locally arranged into a global system supported by
servers. • Grids would control the central servers while services at the edge are grouped into
―middleware peer groups‖. P2P technologies are part of the services of the middleware.
Table 1.3 Design Issues of Computational , Data and P2P Grids

1.3.3 Cloud Computing
Cloud
 A cloud is a pool of virtualized computer resources. A cloud can host a variety of different
workloads.
 A cloud infrastructure provides a framework to manage scalable, reliable, on-demand
access to applications
 A cloud is the ―invisible‖ backend to many applications
 A model of computation and data storage based on ―pay as you go‖ access to ―unlimited‖
remote data center capabilities
 A cloud supports redundant, self-recovering, highly scalable programming models that allow
workloads to recover from hardware/software failures. They monitor resource use in real
time to enable rebalancing of allocations when needed.
Internet Clouds:
 Cloud computing applies a virtualized platform with elastic resources on-demand by

provisioning hardware, software, and datasets, dynamically.
 Cloud computing leverages its low cost and simplicity that benefit both users and the
providers. Machine virtualization has enabled such cost-effectiveness.
 Cloud computing intends to satisfy many heterogeneous user applications
simultaneously. The cloud ecosystem must be designed to be secure, trustworthy, and
dependable.
Cloud Computing
Ian Foster defined cloud computing as follows: ―A large-scale distributed computing paradigm
that is driven by economics of scale, in which a pool of abstracted virtualized, dynamically-
scalable, managed computing power, storage, platforms, and services are delivered on demand
to external customers over the Internet‖.

• Cloud computing is the use of computing resources (hardware and software) that are
delivered as a service over a network (typically the Internet).
• Cloud computing entrusts remote services with a user's data, software and computation.
• The utilization of cloud computing involves three steps:
• Subscribe
• Use
• Pay for what you use, based on QoS
The six common characteristics of Internet clouds are
(1) Cloud platform offers a scalable computing paradigm built around the datacenters.
(2) Cloud resources are dynamically provisioned by datacenters upon user demand.
(3) Cloud system provides computing power, storage space, and flexible platforms for upgraded
web-scale application services.
(4) Cloud computing relies heavily on the virtualization of all sorts of resources.
(5) Cloud computing defines a new paradigm for collective computing, data consumption and
delivery of information services over the Internet.
(6) Clouds stress the cost of ownership reduction in mega datacenters.
Basic Cloud Models
Traditional distributed systems have encountered several performance bottlenecks: constant

system maintenance, poor utilization and increasing costs associated with hardware/software
upgrades. Cloud computing as an on-demand computing paradigm resolves or relieves from
these problems.
Cloud Service Models
 Infrasturcture as a Service (IaaS)

 Platform as a Service (PasS)
 Software as a service (SaaS)

Figure 1.12 Cloud Service Models
Infrastructure as a Service (IaaS):
 Most basic cloud service model

 Cloud providers offer computers, as physical or more often as virtual machines, and
other resources.
 Virtual machines are run as guests by a hypervisor, such as Xen or KVM.
 Cloud users deploy their applications by then installing operating system images on the
machines as well as their application software.
 Cloud providers typically bill IaaS services on a utility computing basis, that is, cost will
reflect the amount of resources allocated and consumed.
 Examples of IaaS include: Amazon CloudFormation (and underlying services such as
Amazon EC2), Rackspace Cloud, Terremark, and Google Compute Engine.
Platform as a Service (PaaS):
 Cloud providers deliver a computing platform typically including operating system,

programming language execution environment, database, and web server.
 Application developers develop and run their software on a cloud platform without the
cost and complexity of buying and managing the underlying hardware and software
layers.

 Examples of PaaS include: Amazon Elastic Beanstalk, Cloud Foundry, Heroku,
Force.com, EngineYard, Mendix, Google App Engine, Microsoft Azure and
OrangeScape.
Software as a Service (SaaS):
 Cloud providers install and operate application software in the cloud and cloud users
access the software from cloud clients.
 The SaaS model applies to business processes, industry applications, CRM (consumer
relationship mamagment), ERP (enterprise resources planning), HR (human resources)
and collaborative applications.
 The pricing model for SaaS applications is typically a monthly or yearly flat fee per user,
so price is scalable and adjustable if users are added or removed at any point.
 Examples of SaaS include: Google Apps, innkeypos, Quickbooks Online, Limelight
Video Platform, Salesforce.com, and Microsoft Office 365.
Cloud deployment Models
Cloud hosting deployment models represent the exact category of cloud environment and are
mainly distinguished by the proprietorship, size and access. It tells about the purpose and the
nature of the cloud.
Internet clouds offer four deployment modes: private, public, managed, and hybrid. The different
service level agreements and service deployment modalities imply the security to be a shared
responsibility of all the cloud providers, the cloud resource consumers and the third party cloud
enabled software providers.
Public Cloud: is a type of cloud hosting in which the cloud services are delivered over a
network which is open for public usage. In this the service provider renders services and
infrastructure to various clients. The customers do not have any distinguishability and control
over the location of the infrastructure.
Private Cloud: is also known as internal cloud; the platform for cloud computing is implemented
on a cloud-based secure environment that is safeguarded by a firewall which is under the
governance of the IT department that belongs to the particular corporate. Private cloud as it
permits only the authorized users, gives the organisation greater and direct control over their
data.
Hybrid Cloud: is a type of cloud computing, which is integrated. It can be an arrangement of

two or more cloud servers, i.e. private, public or community cloud that is bound together but
remain individual entities.
Managed cloud hosting is a process in which organizations share and access resources,
including databases, hardware and software tools, across a remote network via multiple servers
in another location.
Benefits of Outsourcing to The Cloud:
Outsourcing local workload and/or resources to the cloud has become an appealing alternative
in terms of operational efficiency and cost effectiveness. This outsourcing practice particularly

gains its momentum with the flexibility of cloud services from no lock-in contracts with the
provider and the use of a pay-as-you-go pricing model.
From the consumer‘s perspective, this pricing model for computing has relieved many issues in
IT practices, such as the burden of new equipment purchases and the ever-increasing costs in
operation of computing facilities (e.g., salary for technical supporting personnel and electricity
bills).
From the provider‘s perspective, charges imposed for processing consumers‘ service
requests—often exploiting underutilized resources—are an additional source of revenue.
Listed below are 8 motivations of adapting the cloud for upgrading Internet applications and web
services in general.
(1) Desired location in areas with protected space and better energy efficiency.
(2) Sharing of peak-load capacity among a large pool of users, improving the overall utilization
(3) Separation of infrastructure maintenance duties from domain-specific application

development.
(4) Significant reduction in cloud computing cost, compared with traditional computing
paradigms.
(5) Cloud computing programming and application development
(6) Service and data discovery and content/service distribution
(7) Privacy, security, copyright, and reliability issues
(8) Service agreements, business models, and pricing policies.
Representative Cloud Providers :
Table 1.4, summarizes the features of three cloud platforms.

1.3.4 Service-Oriented Architectures (SOA)
 SOA is an evolution of distributed computing based on the request/reply design

paradigm for synchronous and asynchronous applications.
 An application's business logic or individual functions are modularized and presented as
services for consumer/client applications.
 Key to these services - their loosely coupled nature;
 i.e., the service interface is independent of the implementation.
 Application developers or system integrators can build applications by composing one or
more services without knowing the services' underlying implementations.
 For example, a service can be implemented either in .Net or J2EE, and the
application consuming the service can be on a different platform or language.
1.3.4.1 SOA key characteristics:
 SOA services have self-describing interfaces in platform-independent XML documents.
 Web Services Description Language (WSDL) is the standard used to describe
the services.
 SOA services communicate with messages formally defined via XML Schema (also
called XSD).
 Communication among consumers and providers or services typically happens in
heterogeneous environments, with little or no knowledge about the provider.
 Messages between services can be viewed as key business documents
processed in an enterprise.
 SOA services are maintained in the enterprise by a registry that acts as a directory
listing.
 Applications can look up the services in the registry and invoke the service.
 Universal Description, Definition, and Integration (UDDI) is the standard used for
service registry.
 Each SOA service has a quality of service (QoS) associated with it.
 Some of the key QoS elements are security requirements, such as authentication
and authorization, reliable messaging, and policies regarding who can invoke
services.
1.3.4.2 Layered Architecture for Web Services and Grids
These architectures build on the traditional seven Open Systems Interconnection (OSI) layers
that provide the base networking abstractions.
On top of this we have a base software environment, which would be .NET or Apache Axis for
web services, the Java Virtual Machine for Java, and a broker network for CORBA.
On top of this base environment one would build a higher level environment reflecting the
special features of the distributed computing environment. This starts with entity interfaces and
inter-entity communication.

Figure 1.13 Layered Architecture for Web Services and Grid
The entity interfaces correspond to the Web Services Description Language (WSDL), Java
method, and CORBA interface definition language (IDL) specifications in the distributed
systems. These interfaces are linked with customized, high-level communication systems:
SOAP, RMI, and IIOP in the three examples.
These communication systems support features including particular message patterns (such as
Remote Procedure Call or RPC), fault recovery, and specialized routing.
In the case of fault tolerance, the features in the Web Services Reliable Messaging (WSRM)
framework mimic the OSI layer capability (as in TCP fault tolerance) modified to match the
different abstractions (such as messages versus packets, virtualized addressing) at the entity
levels.
Security is a critical capability that either uses or reimplements the capabilities seen in concepts
such as Internet Protocol Security (IPsec) and secure sockets in the OSI layers.
JNDI (Jini and Java Naming and Directory Interface) illustrates different approaches used within
the Java distributed object model. The CORBA Trading Service, UDDI (Universal Description,
Discovery, and Integration), LDAP (Lightweight Directory Access Protocol), and ebXML
(Electronic Business using eXtensible Markup Language) are other examples of discovery and
information services.
Management services include service state and lifetime support; examples include the CORBA
Life Cycle and Persistent states, the different Enterprise JavaBeans models, Jini‘s lifetime
model, and a suite of web services specifications.

CORBA and Java approaches were used in distributed systems rather than today‘s SOAP,
XML, or REST (Representational State Transfer).
1.3.4.3 Web Services and Tools
Loose coupling and support of heterogeneous implementations make services more attractive
than distributed objects. There are two choices of service architecture: web services or REST
systems. Both web services and REST systems have very distinct approaches to building
reliable interoperable systems.
In web services, one aims to fully specify all aspects of the service and its environment. This
specification is carried with communicated messages using Simple Object Access Protocol
(SOAP). The hosting environment then becomes a universal distributed operating system with
fully distributed capability carried by SOAP messages.
In the REST approach, one adopts simplicity as the universal principle and delegated most of
the hard problems to application (implementation specific) software. In a Web Service language
REST has minimal information in the header and the message body (that is opaque to generic
message processing) carries all needed information. REST architectures are clearly more
appropriate to rapidly technology environments.
REST can use XML schemas but not used that are part of SOAP; "XML over HTTP" is a
popular design choice.
Above the communication and management layers, we have the capability to compose new
entities or distributed programs by integrating several entities together.
In CORBA and Java, the distributed entities are linked with remote procedure calls and the
simplest way to build composite applications is to view the entities as objects and use the
traditional ways of linking them together. For Java, this could be as simple as writing a Java
program with method calls replaced by RMI (Remote Method Invocation) while CORBA
supports a similar model with a syntax reflecting the C++ style of its entity (object) interfaces.
1.3.4.4 The Evolution of SOA
As shown in Figure 1.14, service-oriented architecture (SOA) has evolved over the years. SOA
applies to building grids, clouds, grids of clouds, clouds of grids, clouds of clouds (also known
as interclouds), and systems of systems in general.
A large number of sensors provide data-collection services, denoted in the figure as SS (sensor
service). A sensor can be a ZigBee device, a Bluetooth device, a WiFi access point, a personal
computer, a GPA, or a wireless phone, among other things. Raw data is collected by sensor
services. All the SS devices interact with large or small computers, many forms of grids,
databases, the compute cloud, the storage cloud, the filter cloud, the discovery cloud, and so
on.
Filter services ( fs in the figure) are used to eliminate unwanted raw data, in order to respond to
specific requests from the web, the grid, or web services. A collection of filter services forms a
filter cloud.

Figure 1.14 The evolution of SOA: grids of clouds and grids, where ―SS‖ refers to a sensor
service and ―fs‖ to a filter
transforming service.
SOA aims to search for, or sort out, the useful data from the massive amounts of raw data
items. Processing this data will generate useful information, and subsequently, the knowledge
for our daily use.
In fact, wisdom or intelligence is sorted out of large knowledge bases. Finally, we make
intelligent decisions based on both biological and machine wisdom.
Most distributed systems require a web interface or portal. For raw data collected by a large
number of sensors to be transformed into useful information or knowledge, the data stream may
go through a sequence of compute, storage, filter, and discovery clouds.
Finally, the inter-service messages converge at the portal, which is accessed by all users. Two
example portals, OGFCE and HUBzero, are described in using both web service (portal) and
Web 2.0 (gadget) technologies. Many distributed programming models are also built on top of
these basic constructs.
1.3.4.5 Grids versus Clouds
 Cloud computing refers to a client server architecture where typically the servers (called "the
cloud") reside remotely and are accessed via the internet, usually via a web browser. Grid

computing refers to a distributed computing architecture where a set of networked
computers are utilized for large computational tasks,
 A grid system applies static resources, while a cloud emphasizes elastic resources.
 For some researchers, the differences between grids and clouds are limited only in dynamic
resource allocation based on virtualization and autonomic computing.
 One can build a grid out of multiple clouds. This type of grid can do a better job than a pure
cloud, because it can explicitly support negotiated resource allocation. Thus one may end
up building with a system of systems: such as a cloud of clouds, a grid of clouds, or a cloud
of grids, or inter-clouds as a basic SOA architecture.
 Server computers are still needed to distribute the pieces of data and collect the results from
participating clients on grid.
 Cloud offers more services than grid computing. In fact almost all the services on the
Internet can be obtained from cloud, eg web hosting, multiple Operating systems, DB
support and much more.
 Grids tends to be more loosely coupled, heterogeneous, and geographically dispersed
compared to conventional cluster computing systems.
1.4 Grid Computing

Grid computing is a virtualized distributed computing environment. Such an environment
aims at enabling the dynamic ―runtime‖ selection, sharing, and aggregation of (geographically)
distributed autonomous resources based on the availability, capability, performance, and cost of
these computing resources, and, simultaneously, also based on an organization‘s specific
baseline and/or burst processing requirements.
Table 1.5 Definition of grid
1. A grid integrates and A grid integrates and

coordinates resources and coordinates resources and
users users that exist within different
that are not subject to that control domains and also
exist within different control addresses the issues of
domains (for example, security, policy, payment, and
centralized control membership, that arise in
these settings.
2. using standard, open, A grid is built from
general-purpose protocols multipurpose protocols and
and interfaces interfaces that address such
issues as authentication,
authorization, resource
discovery, and resource
access. It is crucial that these
protocols and interfaces be
standard and open.
3. to deliver nontrivial qualities A grid allows its constituent
of service resources to be used in a
coordinated fashion to deliver
various qualities of service.
1.4.1 Introduction to Grid Architecture and Standards

The Grid:

―Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual
organizations‖
Main characteristics of Grids

The main characteristics of a grid computing environment can be listed as follows:
1) Large scale: A grid must be able to deal with a number of resources ranging from just a
few to millions.
2) Geographical distribution: Grid resources may be spread geographically.
3) Heterogeneity: A grid hosts both software and hardware resources that can be ranging
from data, files, software components or programs to sensors, scientific instruments,
display devices, personal digital organizers, computers, super-computers and networks.
4) Resource sharing and coordination: Resources in a grid belong to different
organizations that allow other organizations to access them. The resources must be
coordinated in order to provide aggregated computing capabilities.
5) Multiple administrations: Each organization may establish different security and
administrative policies under which resources can be accessed and used.
6) Accessibility attributes: Transparency, dependability, consistency, and pervasiveness
are attributes typical to grid resource access. A grid should be seen as a single virtual
computing environment and must assure the delivery of services under established
Quality of Service requirements. A grid must grant access to available resources by
adapting to dynamic environments where resource failure is common.
1.4.2 Grid Computing Systems

A Grid is an environment that allows service oriented, flexible and seamless sharing of
heterogeneous network of resources for compute intensive and data intensive tasks and
provides faster throughput and scalability at lower costs. The distinct benefits of using grids
include performance with scalability, resource utilization,management and reliability and
virtualization.
Grid computing environment provides more computational capabilities and helps to
increase the efficiency and scalability of the infrastructure. Grid computing provides a single
interface for managing the heterogeneous resources.
1.4.3 Grid Architecture

The architecture of a grid system is often described in terms of ―layers‖, each providinga
specific function. Higher layers are user centric, whereas the lower layers are hardware-centric.
The purpose of a grid architecture is to offer users an infrastructure to execute complex
applications and at the same time hide the complexity of the resources.
The following diagram depicts the generic grid architecture showing the functionality of
each layer:

Figure1.15 Grid Architecture
Resource layer: is made up of actual resources that are part of the grid, such as computers,
storage systems, electronic data catalogues, and even sensors such as telescopes or other
instruments, which can be connected directly to the network.
Middleware layer: provides the tools that enable various elements such: servers, storage,
networks, etc. to participate in a unified grid environment.
Application layer: which includes different user applications (science, engineering, business,
financial), portal and development toolkits-supporting applications.
The application layer is where the users describe the applications to be submitted to the
grid. The resource layer is a widely distributed infrastructure, composed of different resources
linked via Internet. The main purpose of the resourcesis to host data and execute jobs. The
middleware is in charge of allocatingresources to jobs, and of other management issues.
The application layer sends job descriptions to the middleware, together with the
locations of the required input data. It then waits for a message saying whether the job was
finished or it was canceled. When it receives a job description the middleware tries to find a
resource to execute this job. If a suitable resource is found, it is first claimed and then the job is
sent to it.
The middleware monitors the status of the job and reacts to state changes. The resource
layer sends to the middleware the acknowledgments, and the information on the current state of
resources, new data elements, and finished jobs. When instructed by the application layer, the
middleware removes the data that is no longer needed from the resources.

1.4.4 Layered Grid Architecture
Coordinating multiple resources: ubiquitous infrastructure

services, app-specific distributed services.
Sharing single resources: negotiating access, controlling
use.
Talking to things: communication (Internet protocols) &
security.
Controlling things locally: Access to, & control of,
resources.
Figure1.16 Layered Grid Architecture
1.4.5 Grid Standards
The Global Grid Forum is a community-initiated forum of researchers and practitioners working
on grid computing, and a number of working groups are producing technical specs,
documenting user experiences, and writing implementation guidelines.
The need for open standards that define the interactions and foster interoperability
between components supplied from different sources has been the motivation for the Open
GridService Architecture/Open Grid Services Infrastructure (OGSA/OGSI) milestone
documentation published by the Forum.
The following describes the Grid Standards
a) OGSA (Open Grid Service Architecture)
OGSA is a service-oriented architecture (SOA). The aim of OGSA is to standardize grid
computing and to define a basic framework of a grid application structure.
OGSA main goals are:
- Resources must be handled in distributed and heterogeneous environments
- Support of QoS-orientated (Quality of Service) Service Level Agreement
- Partly autonomic management
- Definition of open, published interfaces and protocols that provide interoperability of
diverse resources
- Integration of existing and established standards
b) OGSA Services
The OGSA specifies services which occur within a wide variety of grid systems. They
can be divided into 4 broad groups: core services, data services, program execution
services, and resource management services.
i) The core services: This deals with Service Communication, Service
Management, Service Interaction and Security.
ii) Data Services: The wide range of different data types, usability and transparency
involve a large variety of different interfaces:
- Interfaces for caching
- Interfaces for data replication
- Interfaces for data access
- Interfaces for data transformation and filtering
- Interfaces for file and DBMS services
- Interfaces for grid storage services
iii) Program execution :
Main goal of this category is to enable applications to have coordinated access to
underlying VO resources, regardless of their physical location or access mechanisms.

iv) Resource Management: Resources need to be reserved and scheduled,
orchestrated, and controlled. This group also maintains administration and deployment
services for software deployment, change and identity management.
c) OGSI (Open Grid Service Infrastructure)

OGSI specifications define the standard interfaces and behaviors of a grid service,
building on a Web services base. OGSI defines mechanisms for creating, managing,
and exchanging information among grid services. OGSA defines a Grid Application and
what a Grid Service should be able to do. OGSI specifies a Grid Services in detail. Grid
Services are Web Services with special additions:
- Lifecycle Management
- Service Data:
State informations
Service metadata
- Notifications
- Service Groups
- PortType Extensions
d) WSRF (Web Service Resource Framework)

WSRF is a derivative of OGSI. WSRF describes how resources can be handled by Web
Services. The framework combines 6 different WS specifications ―that define what is
termed the WS-Resource approach to modeling and managing state in a Web services
context‖ . The 6 specifications are:
- WS-ResourceLifetime: mechanisms for WSRessource destruction
- WS-ResourceProperties: manipulation and definition of WS properties
- WS-Notification: event management
- WS-RenewableReference: defines updating proceeding
- WS-ServiceGroup: interface for by-reference collections of WSs
- WS-BaseFaults: standardization of possible failures
e) GridFTP
GridFTP is a secure and reliable data transfer protocol providing high performance and
optimized for wide-area networks that have high bandwidth. As one might guess from its
name, it is based upon the Internet FTP protocol and includes extensions that make it a
desirable tool in a grid environment. GridFTP uses basic Grid security on both control
(command) and data channels. Features include multiple data channels for parallel
transfers, partial file transfers, third-party transfers, and more.

Figure1.17 Grid Standards
1.4.6 Elements of Grid
Grid computing combines elements such as distributed computing, high-performancecomputing

and disposable computing depending on the application of the technologyand the scale of
operation.
The key components of grid computing include the following:
1) Resource management - a grid must be aware of what resources are available for different
tasks.
2)Security management - the grid needs to take care that only authorized users can access and
use the available resources.
3)Data management - data must be transported, cleansed, parceled and processed.
4)Services management - users and applications must be able to query the grid in an effective
and efficient manner.
The major components are necessary to form a grid:
Figure1.18 Components of Grid

The major constituents of a grid computing system can be identified intovarious categories from
different perspectives as follows:
1) Functional view
2) Physical view
3) Service view
1.4.6.1 BASIC CONSTITUENT ELEMENTS: FUNCTIONAL VIEW

Some of the functional constituents of a grid are
 Grid portal
 Security (grid security infrastructure)
 Broker (along with directory)
 Scheduler
 Data management
 Job and resource management
 Resources
Grid Portal :
A portal/user interface functional block usually exists in the grid environment. The user
interaction mechanism (specifically, the interface) can take a number of forms. The interaction
mechanism typically is application specific
The Grid Security Infrastructure:
A user security functional block usually exists in the grid environment and, as noted
above, a key requirement for grid computing is security. In a grid environment, there is a need
for mechanisms to provide authentication, authorization, data confidentiality, data integrity, and
availability, particularly from a user‘s point of view. When a user‘s job executes, typically it
requires confidential message-passing services.
Broker Function:
A broker functional block usually exists in the grid environment. After the user is authenticated
by the user security functional block, the user is allowed to launch an application. At this
juncture, the grid system needs to identify appropriate and available resources that can/should
be used within the grid, based on the application and application-related parameters provided
by the user of the application. This task is carried out by a broker function.
Scheduler Function/Functional Block

A scheduler functional block usually exists in the grid environment. If a set of stand-alone jobs
without any interdependencies needs to execute, then a scheduler is not necessarily required.
In the situation where the user wishes to reserve a specific resource or to ensure that different
jobs within the application run concurrently, then a scheduler is needed to coordinate the
execution of the jobs.
Data Management Function/Functional Block

A data management functional block usually exists in a grid environment. There typically needs
to be a reliable (and secure) method for moving files and data to various nodes within the grid.
This functionality is supported by the data management functional block.
Job Management and Resource Management Function/Functional Block

A job management and resource management functional block usually exists in a grid
environment. This functionality is also known as the grid resource allocation manager (GRAM).
The job management and resource management function provides the services to actually
launch a job on a particular resource, to check the job‘s status, and to retrieve the results when
the job is complete. The management component keeps track of the resources available to the
grid and which users are members of the grid.

1.4.6.2 BASIC CONSTITUENT ELEMENTS—A PHYSICAL VIEW
A grid is a collection of networks, processors, storage, and other resources.
Networks - the network infrastructure that connects the elements of the shared storage
environment. This network may be a network that is primarily used for storage access, or one
that is also shared with other uses. The important requirement is that it must provide an
appropriately rich, high performance, scalable connectivity upon which a shared storage
environment can be based.
Computation - The next most common resource on a grid is obviously computing cycles
provided by the processors on the grid. The processors can vary in speed, architecture,
software platform, and storage apparatus. There are efforts underway to develop very high-
speed supercomputers.
Storage – The next most common resource used in a grid is data storage. In a grid
environment, a file or data base can span several physical storage devices and processors,
bypassing size restrictions often imposed by file systems that are preembedded with operating
systems. Storage capacity available to an application can be increased by making use of the
storage on multiple processors with a unifying file system.
1.4.6.3 BASIC CONSTITUENT ELEMENTS—SERVICE VIEW

Standards such as the OGSA provide the necessary stable framework. OGSA is a
proposed grid service architecture based on the integration of grid and Web services concepts
and technologies. The OGSI specification is a companion specification to OGSA that defines
the interfaces and protocols to be used between the various services in a grid environment; it is
the mechanism that provides the interoperability between grids designed using OGSA. Key
constructs for the architecture are functional blocks, protocols, grid services, APIs, and software
development kits (SDKs).
1.4.7 Overview of Grid Architecture

Grids have to be designed so as to serve different communities with varying
characteristics and requirements. Grid computing embodies a combination of a decentralized
architecture for resource management, and a layered hierarchical architecture for
implementation of various constituent services.
Grids can be built ranging from just a few processors to large groups of processors
organized as a hierarchy that spans a continent or the globe. The simplest grid consists of just a
few processors, all of which have the same hardware architecture and utilize the same
operating system.
The three primary types of grids are as follows:
Table 1.3 Types of Grid

Computational grid This grid is used to allocate resources
specifically for computing power.
Scavenging grid This grid is used to ―locate processors–

cycles‖: grid nodes are exploited for
available machine cycles and other
resources. Nodes typically equate to
desktop computers; a large numbers of
processors are generally involved.

Data grid This grid is used for housing and
providing access to data across
multiple organizations. Users are not
focused on where this data is located
as long as they have access to the
data.
Market-oriented grids This which deal with price setting and
negotiation, grid economy
management and utility driven
scheduling and resource allocation
1.4.7.1 Model of the Grid Architecture

The ―hour-glass‖ model of the Grid architecture
Thin center: few standards
Wide top: many high-levelbehaviors can be mapped
Wide bottom: many underlying technologies and systems
Figure1.19 Hour glass Model Architecture
Role of layers:
1) Fabric: interfaces local control. Need to manage a resource. Grid fabric layer provides
standardized access to local resource-specific operations. Software is provided to
discover. This layer provides the resources, which could comprise computers, storage
devices and databases. The resource could also be a logical entity such as a distributed
file system or computer pool.
2) Connectivity: secure communications. Need secure connectivity to resources.
Assumes trust is based on user, not service providers. Use public key infrastructure
(PKI). Integrate and obey local security policies in global view. This layer consists of the
core communication and authentication protocols required for transactions.
Communication protocols enable the exchange of data between fabric layer resources.
3) Resource: This layer builds on the Connectivity layer communication and authentication
protocols to define Application Program Interfaces (API) and Software Development Kit
(SDK) for secure negotiation, initiation, monitoring, control, accounting and payment of
sharing operations.
4) Collective: coordinated sharing of multiple resources. Need to coordinate sharing of
resources. This layer is different from the resource layer in the sense, while resource
layer concentrates on interactions with single resource; this layer helps in coordinating
multiple resources. Its tasks can be varied like Directory Services, Co-allocation and
scheduling, monitoring, diagnostic services, and software discovery services.
1.4.7.2 Basic Grid Types and its Architecture

Cluster/Local Grid
Local grids rely on LANs and SANs. The availability of powerful workstations, processors,
servers, and ―blade technology,‖ along with high-speed networks as commodity components
has led to the emergence of local clusters for high-performance computing. The availability of
such clusters within many organizations has fostered a growing interest in aggregating
distributed resources to solve large-scale problems of multi-institutional interest.

Figure1.20 Local Grid
IntraGrid
Intragrids rely on WANs. Supported by innovations in optics, the theoretical performance of
WANs has increased significantly in recent years. The ―affordable‖ bandwidth has also grown in
the past 5–10 years. Furthermore, the integration of intelligent services into the network helps
simplify data access across the grid and resource sharing and management
Figure1.21 Intra Grid
InterGrid
Intergrids often rely on the Internet. This crosses organization boundaries. Generally, an
intergrid may be used to collaborate on ―large‖ projects of common scientific interest. The
intergrid offers the opportunity for sharing, trading, or brokering resources over widespread
pools; computational ―processor-cycle‖ resources may also be obtained,as needed, from a utility
for a specified fee.

Figure1.22 Inter Grid

CS6703 Grid and Cloud Computing
2 marks Questions and answers
IV Year CSE Academic Year: 2016- 2017 (Odd Semester)
UNIT I INTRODUCTION
1. Define HPC Computers.

On the HPC side, supercomputers (massively parallel processors or MPPs) are gradually
replaced by clusters of cooperative computers out of a desire to share computing resources.
The cluster is often a collection of homogeneous compute nodes that are physically connected
in close range to one another
2. Define HTC Computers.

On the HTC side, peer-to-peer (P2P) networks are formed for distributed file sharing and
content delivery applications. A P2P system is built over many client machines. Peer machines
are globally distributed in nature. P2P, cloud computing, and web service platforms are more
focused on HTC applications than on HPC applications.
3. Mention the Computing Paradigms Distinctions.

Centralized computing The Internet of Things
Parallel computing Internet computing
Distributed computing Ubiquitous computing
Cloud computing
4. Mention the Degrees of Parallelism
 Bit-level parallelism
 Instruction-level parallelism
 Data-level parallelism
 Task-level parallelism
 Job-level parallelism
5. What is IoT?
The IoT refers to the networked interconnection of everyday objects, tools, devices, or
computers. One can view the IoT as a wireless network of sensors that interconnect all things in
our daily life. The dynamic connections will grow exponentially into a new dynamic network of
networks, called the Internet of Things (IoT).
6. Define Multithreading.
Multithreading is the ability of a program or an operating system process to manage its use by
more than one user at a time and to even manage multiple requests by the same user without
having to have multiple copies of the programming running in the computer.
7. What are Cloud Service Models?
 Infrasturcture as a Service (IaaS)
 Platform as a Service (PasS)
 Software as a service (SaaS)

8. What is Single System Image?
The system image of a computer is decided by the way the OS manages the shared cluster
resources. A single system image is the illusion, created by software or hardware that presents
a collection of resources as an integrated powerful resource. SSI makes the cluster appear like
a single machine to the user, applications, and network. A cluster with multiple system images is
nothing but a collection of independent computers.
9. Define DataGrid.
A data grid is an architecture or set of services that gives individuals or groups of users the
ability to access, modify and transfer extremely large amounts of geographically
distributed data for research purposes.
10. Define P2Pgrid.

A P2P grid with peer groups managed locally arranged into a global system supported by
servers. • Grids would control the central servers while services at the edge are grouped into
―middleware peer groups‖. P2P technologies are part of the services of the middleware.
11. Mention the characteristics of Internet Cloud.

The six common characteristics of Internet clouds are
(1) Cloud platform offers a scalable computing paradigm built around the datacenters.
(2) Cloud resources are dynamically provisioned by datacenters upon user demand.
(3) Cloud system provides computing power, storage space, and flexible platforms for
upgraded web-scale application services.
(4) Cloud computing relies heavily on the virtualization of all sorts of resources.
(5) Cloud computing defines a new paradigm for collective computing, data consumption
and delivery of information services over the Internet.
(6) Clouds stress the cost of ownership reduction in mega datacenters.
12. Define Public Cloud.

Public Cloud: is a type of cloud hosting in which the cloud services are delivered over a
network which is open for public usage. In this the service provider renders services and
infrastructure to various clients. The customers do not have any distinguishability and control
over the location of the infrastructure.
13. Define Private Cloud.

Private Cloud: is also known as internal cloud; the platform for cloud computing is implemented
on a cloud-based secure environment that is safeguarded by a firewall which is under the
governance of the IT department that belongs to the particular corporate. Private cloud as it
permits only the authorized users, gives the organization greater and direct control over their
data.
14. Define Hybrid Cloud.

Hybrid Cloud: is a type of cloud computing, which is integrated. It can be an arrangement of
two or more cloud servers, i.e. private, public or community cloud that is bound together but
remain individual entities.
15. What is SOA?

SOA is an evolution of distributed computing based on the request/reply design paradigm for
synchronous and asynchronous applications.
An application's business logic or individual functions are modularized and presented as
services for consumer/client applications

16. Mention SOA key characteristics.
 SOA services have self-describing interfaces in platform-independent XML documents
 SOA services communicate with messages formally defined via XML Schema (also
called XSD).
 SOA services are maintained in the enterprise by a registry that acts as a directory
listing.
 Each SOA service has a quality of service (QoS) associated with it.
17. Define Grid Computing.
Grid Computing is the concept of distributed computing technologies for computing resource
sharing among participants in a virtualized collection of organization.
18. What is business on demand?
BOD is just not about utility computing as it has a much broader set of ideas about the
transformation of business practices, process transformation and technology implementations.
The essential characteristics of on-demand business are responsiveness to the dynamics of
business, adapting to variable cost structures, focusing on core business competency and
resiliency for consistent availability.
19. What is the definition of Grid Computing concept by Foster?
A Computation grid is a combination of hardware and software infrastructure that provides

dependable consistent , pervasive and inexpensive access to high end-user computation
capabilities.
20. What are the grid computing applications?

i) Application partitioning that involves breaking the problem into discrete pieces.
ii) Discovery and scheduling of tasks and workflows.
iii) Data communication distributing the problem data where and when it is required.
21. List the main characteristics of Grid.

Large scale
Geographical distribution
Heterogeneity
Resource sharing and coordination
Multiple administrations
Accessibility attributes
22. List the various Grid Standards.
OGSA (Open Grid Service Architecture)

OGSA Services
OGSI (Open Grid Service Infrastructure)
WSRF (Web Service Resource Framework)
GridFTP

23. Mention the key components of Grid Computing.
1) Resource management - a grid must be aware of what resources are available for
different tasks.
2)Security management - the grid needs to take care that only authorized users can
access and use the available resources.
3)Data management - data must be transported, cleansed, parceled and processed.
4)Services management - users and applications must be able to query the grid in an
effective and efficient manner.
24. List the functions constituents of Grid.
 Grid portal
 Security (grid security infrastructure)
 Broker (along with directory)
 Scheduler
 Data management
 Job and resource management
 Resources
25. What is Grid Portal?
A portal/user interface functional block usually exists in the grid environment. The user
interaction mechanism (specifically, the interface) can take a number of forms. The interaction
mechanism typically is application specific
26. What is the use of broker function?
A broker functional block usually exists in the grid environment. After the user is authenticated
by the user security functional block, the user is allowed to launch an application. At this
juncture, the grid system needs to identify appropriate and available resources that can/should
be used within the grid, based on the application and application-related parameters provided
by the user of the application.
27. Define Intergrid.
Intergrids often rely on the Internet. This crosses organization boundaries. Generally, an
intergrid may be used to collaborate on ―large‖ projects of common scientific interest. The
intergrid offers the opportunity for sharing, trading, or brokering resources over widespread
pools; computational ―processor-cycle‖ resources may also be obtained as needed, from a utility
for a specified fee.
16 Marks Questions
1. Discuss the technologies for Computer based system.

2. Explain in detail about the system models for distributed and cloud computing.
3. Discuss the various Cloud models.
4. List the characteristics of Grid and Grid Architecture with neat diagram.
5. Explain the elements of grids and basic constituent elements in functional, physical and
service view.

UNIT II
GRID SERVICES

Requirements – Practical & Detailed view of OGSA/OGSI – Data intensive grid service models
– OGSA services.
TEXT BOOK:
STAFF IN-CHARGE HOD


Requirements – Practical & Detailed view of OGSA/OGSI – Data intensive grid service
models – OGSA services.
2.1 Introduction to Open Grid Services Architecture (OGSA)

The architecture for grid computing is defined in the Open Grid Services Architecture that
describes the overall structure and the services to be provided in grid environments Motivation.
Figure2.1 depicts the network‘s role in supporting a grid. Figure 2.2 is the reference diagram
that illustrates the OGSA.
Figure 2.1 Networking role.
Figure 2.2 Basic functional model for grid environment.
OGSI, in effect, is the base infrastructure on which the OGSA is built, as illustrated pictorially in
Figure 2.3

.
Figure 2.3 OGSA reliance on OGSI.
The running of an individual service is called a service instance. Services and service
instances can be ―lightweight‖ and transient, or they can be long-term tasks that require ―heavy-
duty‖ support from the grid. Services and service instances can be dynamic or interactive, or
they can be batch processed.
Grid services include:
Discovery
Lifecycle
State management
Service groups
Factory
Notification
Handle map
A ―layering‖ approach is used to the extent possible in the definition of grid architecture
because it is advantageous for higher-level functions to use common lower-level functions. Grid
functionality can include the following, among others: information queries, network bandwidth
allocation, data management/extraction, processor requests, managing data sessions, and
balance workloads.
OGSA-related GGF groups include :
The Open Grid Services Architecture Working Group (OGSA-WG)
The Open Grid Services Infrastructure Working Group (OGSI-WG)
The Open Grid Service Architecture Security Working Group (OGSA-SECWG)
Database Access and Integration Services Working Group (DAIS-WG)
OGSA has introduced the concept of a Grid service as a building block of the service-
oriented framework. A Grid service is an enhanced Web service that extends the conventional
Web service functionality into the Grid domain. A Grid service handles issues such as state
management, global service naming, reference resolution, and Grid-aware security, which are
the key requirements for a seamless, universal Grid architecture.

2.2 MOTIVATIONS FOR STANDARDIZATION
An effective grid relies on making use of computing power, whether via a LAN, over an extranet,
or through the Internet. To use computing power efficiently, one needs to support a gamut of
computing platforms; also; one needs a flexible mechanism for distributing and allocating the
work to individual clients.
GRID STANDARDS
A closed, proprietary environment limits the ease with which one can distribute work,
who can become a service provider, who can be a service requester, and how one can find out
about the available grid resources. The reader can grasp the limitations of a nonstandard
approach by considering any of the better-known grid computing projects on the Internet. For
example, consider distributed.net. This organization is a ―loosely knit‖ group of computer users
located all over the world, that takes up challenges and run projects that require a lot of
computing power. It solves these by utilizing the combined idle processing cycles of their
members‘ computers.
To illustrate the point about standards, to become a service provider for the
distributed.net grid, one must download a specific client that is capable of processing the work
units from the corresponding servers. However, with a distributed.net client installed, one can
only process work units supplied by distributed.net.
Furthermore, distributed.net service providers can only process those work units
supplied by distributed.net. For example, if distributed.net wanted to allow their service
providers to process SETI@Home work units, it would be problematic. distributed.net would
have to redeploy their service provider functionality. They would also have to redesign many of
their discovery and distribution systems to allow different work units to be deployed to service
providers, and they would need to update their statistical analysis on completed units to track it
all properly .
As this example illustrates, standards are critical to making the computing utility concept
a reality. On the other hand, a corporate user just looking to secure better utilization of its
platforms and internal resources could start with a vendor-based solution and then move up to a
standards-based solution in due course.
Some specific areas where a lack of grid standards limit deployment are
Data management. For a grid to work effectively, there is a need to store information and
distribute it. Without a standardized method for describing the work and how it should be
exchanged, one quickly encounters limits related to the flexibility and interoperability of the grid.

Dispatch management. There are a number of approaches that can be used to handle
brokering of work units and to distribute these work units to client resources. Again, not having a
standard method for this restricts the service providers that can connect to the grid and accept
units of work from the grid; this also restricts the ability of grid services users to submit work.
Information services. Metadata6 about the grid service helps the system to distribute
information. The metadata is used to identify requesters (grid users), providers, and their
respective requirements and resource availability. Again, without a standard, one can only use
specific software and solutions to support the grid applications.
Scheduling. Work must be scheduled across the service providers to ensure they are kept
busy. To accomplish this, information about remote loads must be collected and administered. A
standardized method of describing the grid service enables grid implementations to specify how
work is to be scheduled.
Security. Without a standard for the security of a grid service and for the secure distribution of
work units, one runs the risk of distributing information to the ―wrong‖ clients. Although
proprietary methods can provide a level of security, they limit accessibility.
Work unit management. Grid services require management of the distribution of work units to
ensure that the work is uniformly distributed over the service providers. Without a standard way
of advertising and managing this process, efficiencies are degraded.
Looking from the perspective of the grid applications developer, a closed environment is
similarly problematic because to make use of the computing resources across a grid, the
developer must utilize a specific tool kit or environment to build, distribute, and process work
units. The closed environment limits the choice of grid-resident platforms on which work units
can be executed and, at the same time, also limits how one uses and distributes work and/or
requests to the grid. It also means that one cannot combine or adopt other grid solutions for use
within an organization‘s grid without redeploying the grid software.
The generic advantages of the standardized approach are well known, since they apply
across any number of disciplines. In the context of grid computing, they all reduce to one basic
advantage: the extension and expansion of the resources available to the user for grid
computing. From an end user‘s perspective, standardization translates into the ability to
purchase middleware and grid-enabled applications from a variety of suppliers in an off-the-
shelf, shrink-wrapped fashion. Figure 2.4 depicts an example of the environment that one aims
to achieve.

Figure 2.4 Example of a service-oriented architecture
For example, standard APIs enable application portability; without standard APIs,
application portability is difficult to accomplish. Standards enable cross-site interoperability;
without standard protocols, interoperability is difficult to achieve. Standards also enable the
deployment of a shared infrastructure .
Use of the OGSI standard, therefore, provides the following benefits:
 Increased effective computing capacity. When the resources utilize the same
conventions, interfaces, and mechanisms, one can transparently switch jobs among grid
systems, both from the perspective of the server as well as from the perspective of the
client. This allows grid users to use more capacity and allows clients a more extensive
choice of projects that can be supported on the grid. Hence, with a gamut of platforms
and environments supported, along with the ability to more easily publish the services
available, there will be an increase in the effective computing capacity.
 Interoperability of resources. Grid systems can be more easily and efficiently
developed and deployed when utilizing a variety of languages and a variety of platforms.
For example, it is desirable to mix service-provider components, work-dispatch tracking
systems, and systems management; this makes it easier to dispatch work to service
providers and for service providers to support grid services.
 Speed of application development. Using middleware based on a standard expedites
the development of grid-oriented applications supporting a business environment.
Rather than spending time developing communication and management systems to help
support the grid system, the planner can, instead, spend time optimizing the
business/algorithmic logic related to the processing the data.

For useful applications to be developed, a rich set of grid services need to be
implemented and delivered by both open source efforts and by middleware software
companies. In a way, OGSI and the extensions it provides for Web services are
necessary but insufficient for the maturation of the service-oriented architecture; the next
required step is that these standards be fully implemented and truly. Figure 2.5 depicts a
simple environment to put the network-based services in context.
Figure 2. 5 Network-based grid services.
2.3 Functional Requirements for OGSA
The functional requirements include fundamental, security and resource management functions.
The basic functionalities are as follows:
Basic Functionality Requirements: The basic functionalities include Discovery and
brokering, Metering and accounting, Data sharing, Deployment, Monitoring, Policy and Virtual
organizations.
Security Requirements: The security functions include Multiple security infrastructures,
Perimeter security solutions, Authentication, Authorization, and Accounting, Encryption,
Application and Network-Level Firewalls and Certification.
Resource Management Requirements: The resource management functions include
Provisioning, Resource virtualization, Optimization of resource usage, Transport management,
Access, Management and monitoring, Processor scavenging, Scheduling of service tasks, Load
balancing, Advanced reservation, Notification and messaging, Logging, Workflow management
and Pricing.
System Properties Requirements: The system properties functions include Fault
tolerance, Disaster recovery, Self-healing capabilities, Strong monitoring, Legacy application

management, Administration Agreement-based interaction and Grouping/aggregation of
services.
2.3.1 Basic Functionality Requirements
The following basic functions are universally fundamental:
 Discovery and brokering. Mechanisms are required for discovering and/or allocating
services, data, and resources with desired properties
 Metering and accounting. Applications and schemas for metering, auditing, and billing for
IT infrastructure and management use cases. The metering function records the usage and
duration, especially metering the usage of licenses. The auditing function audits usage and
application profiles on machines, and the billing function bills the user based on metering.
 Data sharing. Data sharing and data management are common as well as important grid
applications. Mechanisms are required for accessing and managing data archives, for
caching data and managing its consistency, and for indexing and discovering data and
metadata.
 Deployment. Data is deployed to the hosting environment that will execute the job (or
made available in or via a high-performance infrastructure). Also, applications (executable)
are migrated to the computer that will execute them.
 Virtual organizations (VOs). The need to support collaborative VOs introduces a need for
mechanisms to support VO creation and management, including group membership
services.
 Monitoring. A global, cross-organizational view of resources and assets for project and
fiscal planning, troubleshooting, and other purposes. The users want to monitor their
applications running on the grid.
 Policy. An error and event policy guides self-controlling management, including failover and
provisioning. It is important to be able to represent policy at multiple stages in hierarchical
systems, with the goal of automating the enforcement of policies that might otherwise be
implemented as organizational processes or managed manually. There may be policies at
every level of the infrastructure: from low-level policies that govern how the resources are
monitored and managed, to high-level policies that govern how business process such as
billing are managed. High-level policies are sometimes decomposable into lower-level
policies.
2.3.2 Security Requirements

Grids also introduce a rich set of security requirements; some of these requirements are:

 Multiple security infrastructures. Distributed operation implies a need to interoperate
with and manage multiple security infrastructures.
 Perimeter security solutions. Many use cases require applications to be deployed on
the other side of firewalls from the intended user clients. Intergrid collaboration often
requires crossing institutional firewalls. OGSA needs standard, secure mechanisms that
can be deployed to protect institutions while also enabling cross-firewall interaction.
 Authentication, Authorization, and Accounting. Obtaining application programs and
deploying them into a grid system may require authentication/authorization.
 Encryption. The IT infrastructure and management use case requires encrypting of the
communications, at least of the payload.
 Application and Network-Level Firewalls. This is a long-standing problem; it is made
particularly difficult by the many different policies one is dealing with and the particularly
harsh restrictions at international sites.
 Certification. A trusted party certifies that a particular service has certain semantic
behavior.
2.3.3 Resource Management Requirements
Resource management is another multilevel requirement, encompassing SLA negotiation,
provisioning, and scheduling for a variety of resource types and activities:
 Provisioning. Computer processors, applications, licenses, storage, networks, and
instruments are all grid resources that require provisioning. OGSA needs a framework
that allows resource provisioning to be done in a uniform, consistent manner.
 Resource virtualization. Dynamic provisioning implies a need for resource virtualization
mechanisms that allow resources to be transitioned flexibly to different tasks as required.
 Optimization of resource usage while meeting cost targets (i.e., dealing with finite
resources). Mechanisms to manage conflicting demands from various organizations,
groups, projects, and users and implement a fair sharing of resources and access to the
grid.
 Transport management. For applications that require some form of real-time
scheduling, it can be important to be able to schedule or provision bandwidth
dynamically for data transfers or in support of the other data sharing applications. In
many commercial applications, reliable transport management is essential to obtain the
end-to-end QOS required by the application.
 Access. Usage models that provide for both batch and interactive access to resources.
 Management and monitoring. Support for the management and monitoring of resource
usage and the detection of SLA or contract violations by all relevant parties. Also,

conflict management is necessary; it resolves conflicts between management disciplines
that may differ in their optimization objectives (availability goals versus performance
goals, for example).
 Processor scavenging is an important tool that allows an enterprise or VO to use to
aggregate computing power that would otherwise go to waste.
 Scheduling of service tasks. Long recognized as an important capability for any
information processing system, scheduling becomes extremely important and difficult for
distributed grid systems. In general, dynamic scheduling is an essential component.
 Load balancing. In many applications, it is necessary to make sure make sure
deadlines are met or resources are used uniformly. These are both forms of load
balancing that must be made possible by the underlying infrastructure. For example, for
the commercial data center use case, monitoring the job performance and adjusting
allocated resources to match the load and fairly distributing end users‘ requests to all the
resources are necessary
 Advanced reservation. This functionality may be required in order to execute the
application on reserved resources.
 Notification and messaging. Notification and messaging are critical in most dynamic
scientific problems. Notification and messaging are event driven.
 Logging. It may be desirable to log processes such as obtaining/deploying application
programs because, for example, the information might be used for accounting. This
functionality is represented as ―metering and accounting.‖
 Workflow management. Many applications can be wrapped in scripts or processes that
require licenses and other resources from multiple sources. Applications coordinate
using the file system based on events.
 Pricing. Mechanisms for determining how to render appropriate bills to users of a grid.
2.3.4 System Properties Requirements
A number of grid-related capabilities can be thought of as desirable system properties rather
than functions:
 Fault tolerance. Support is required for failover, load redistribution, and other
techniques used to achieve fault tolerance. Fault tolerance is particularly important for
long running queries that can potentially return large amounts of data, for dynamic
scientific applications, and for commercial data center applications.
 Disaster recovery. Disaster recovery is a critical capability for complex distributed grid
infrastructures. For distributed systems, failure must be considered one of the natural
behaviors and disaster recovery mechanisms must be considered an essential

component of the design. In case of commercial data center applications if the data
center becomes unavailable due to a disaster such as an earthquake or fire, the remote
backup data center needs to take over the application systems.
 Self-healing capabilities of resources, services and systems are required. Significant
manual effort should not be required to monitor, diagnose, and repair faults. There is a
need for the ability to integrate intelligent self-aware hardware such as disks, networking
devices, and so on.
 Strong monitoring for defects, intrusions, and other problems. Ability to migrate attacks
away from critical areas.
 Legacy application management. Legacy applications are those that cannot be
changed, but they are too valuable to give up or to complex to rewrite. Grid infrastructure
has to be built around them so that they can continue to be used.
 Administration. Be able to ―codify‖ and ―automate‖ the normal practices used to
administer the environment. The goal is that systems should be able to self-organize and
self-describe to manage low-level configuration details based on higher-level
configurations and management policies specified by administrators.
 Agreement-based interaction. Some initiatives require agreement-based interactions
capable of specifying and enacting agreements between clients and servers and then
composing those agreements into higher-level end-user structures.
 Grouping/aggregation of services. The ability to instantiate (compose) services using
some set of existing services is a key requirement.
There are two main types of composition techniques:
Selection Selection involves choosing to use a particular
service among many services with the same
Composition
operational interface
Techniques
Aggregatio Aggregation involves orchestrating a
n functional flow (workflow) between services.
2.3.5 Other Functionality Requirements

Although some use cases involve highly constrained environments, it is clear that in
general grid environments tend to be heterogeneous and distributed:
 Platforms. The platforms themselves are heterogeneous, including a variety of
operating systems, hosting environments, and devices.
 Mechanisms. Grid software can need to interoperate with a variety of distinct
implementation mechanisms for core functions such as security.

 Administrative environments. Geographically distributed environments often
feature varied usage, management, and administration policies (including
policies applied by legislation) that need to be honored and managed.
A wide variety of application structures are encountered and must be supported by other
system components, including the following:
 Both single-process and multiprocess applications covering a wide range of
resource requirements.
 Flows, that is, multiple interacting applications that can be treated as a single
transient service instance working on behalf of a client or set of clients.
 Workloads comprising potentially large numbers of applications with a number of
characteristics just listed.
2.4 OGSA: A PRACTICAL VIEW
 OGSA aims at addressing standardization by defining the basic framework of a grid

application structure.
 OGSA standard defines what grid services are, what they should be capable of, and
what technologies they are based on.
 OGSA, however, does not go into specifics of the technicalities of the specification;
instead, the aim is to help classify what is and is not a grid system.
 It is called an architecture because it is mainly about describing and building a well-
defined set of interfaces from which systems can be built, based on open standards
such as WSDL .
2.4.1 Objectives of OGSA:
 Manage resources across distributed heterogeneous platforms.
 Support QoS-oriented Service Level Agreements (SLAs). The topology of grids is
often complex; the interactions between/among grid resources are almost invariably
dynamic. It is critical that the grid provide robust services such as authorization,
access control, and delegation.
 Provide a common base for autonomic management. A grid can contain a plethora of
resources, along with an abundance of combinations of resource configurations,
conceivable resource-to-resource interactions, and a litany of changing state and
failure modes. Intelligent self-regulation and autonomic management of these
resources is highly desirable.
 Define open, published interfaces and protocols for the interoperability of diverse
resources. OGSA is an open standard managed by a standards body.

 Exploit industry standard integration technologies and leverage existing solutions
where appropriate. The foundation of OGSA is rooted in Web services, for example,
SOAP and WSDL, are a major part of this specification.
 OGSA‘s companion OGSI document consists of specifications on how work is
managed, distributed, and how service providers and grid services are described.
 The Web services component is utilized to facilitate the distribution and the
management of work across the grid.
 WSDL provides a simple method of describing and advertising the Web services that
support the grid‘s application.
 OGSA is the blueprint, OGSI is a technical specification, and Globus Toolkit is an
implementation of the framework.
 OGSA describes and defines a Web-services-based architecture composed of a set of
interfaces and their corresponding behaviors to facilitate distributed resource sharing
and accessing in heterogeneous dynamic environments.
 A set of services based on open and stable protocols can hide the complexity of service
requests by users or by other elements of a grid.
 Grid services enable virtualization; virtualization, in turn, can transform computing into
a ubiquitous infrastructure that is more like to an electric or water utility.
 OGSA relies on the definition of grid services in WSDL, which, as noted, defines, for this
context, the operations names, parameters, and their types for grid service access.
 Based on the OGSI specification, a grid service instance is a Web service that conforms
to a set of conventions expressed by the WSDL as service interfaces, extensions, and
behaviors.
 Specifically, the grid service interface (see Table 2.1) is described by WSDL, which
defines how to use the service.
Table 2.1 Proposed OGSA grid service interfaces

 A new tag, gsdl, has been added to the WSDL document for grid service description.
 The UDDI registry and WSIL document are used to locate grid services.
 The transport protocol SOAP is used to connect data and applications for accessing grid
services.
 All services adhere to specified grid service interfaces and behaviors. Grid service
interfaces correspond to portTypes in WSDL used in current Web services solutions.
 The interfaces of grid services address discovery, dynamic service-instance creation,
lifetime management, notification, and manageability; the conventions of grid services
address naming and upgrading issues.
 The standard interface of a grid service includes multiple bindings and implementations.
Grid therefore, is deployed on different hosting environments, even different operating
systems.
 OGSA also provides a grid security mechanism to ensure that all the communications
between services are secure.
 OGSI services use WSDL as a service description mechanism.
There are two fundamental requirements for describing Web services based on the
OGSI:

1. The ability to describe interface inheritance—a basic concept with most of the
distributed object systems.
2. The ability to describe additional information elements with the interface
definitions.
2.5 OGSA: A MORE DETAILED VIEW

2.5.1 Introduction
The OGSA integrates key grid technologies with Webservices mechanisms to create a
distributed system framework based on the OGSI. A grid service instance is a service that
conforms to a set of conventions, expressed as WSDL interfaces, extensions, and behaviors,
for such purposes as lifetime management, discovery of characteristics, and notification. Grid
services provide for the controlled management of the distributed and often long-lived state that
is commonly required in sophisticated distributed applications. OGSI also introduces standard
factory and registration interfaces for creating and discovering grid services.
OGSI defines a component model that extends WSDL and XML schema definition to
incorporate the concepts of
 Stateful Web services
 Extension of Web services interfaces
 Asynchronous notification of state change
 References to instances of services
 Collections of service instances
 Service state data that augment the constraint capabilities of XML schema definition
The OGSI specification defines the minimal, integrated set of extensions and interfaces
necessary to support definition of the services that will compose OGSA. The OGSI V1.0
specification proposes detailed specifications for the conventions that govern how clients create,
discover, and interact with a grid service instance. That is, it specifies
(1) how grid service instances are named and referenced;
(2) the base, common interfaces that all grid services implement; and
(3) the additional interfaces and behaviors associated with factories and service groups.
The specification does not address how grid services are created, managed, and destroyed
within any particular hosting environment. Thus, services that conform to the OGSI specification
are not necessarily portable to various hosting environments, but any client program that follows
the conventions can invoke any grid service instance conforming to the OGSI specification. The
term hosting environment is used in the OGSI specification to denote the server in which one or
more grid service implementations run. Such servers are typically language or platform specific;

examples include native Unix and Windows processes, J2EE application servers, and Microsoft
.NET.
2.5.2 Setting the Context
GGF calls OGSI the ―base for OGSA.‖ Specifically, there is a relationship between OGSI and
distributed object systems and also a relationship between OGSI and the existing Web services
framework. One needs to examine both the client-side programming patterns for grid services
and a conceptual hosting environment for grid services. The patterns described in this section
are enabled but not required by OGSI.
2.5.2.1 Relationship to Distributed Object Systems

A given grid service implementation is an addressable and potentially stateful instance
that implements one or more interfaces described by WSDL portTypes. Grid service factories
can be used to create instances implementing a given set of portType(s). Each grid service
instance has a notion of identity with respect to the other instances in the distributed grid. Each
instance can be characterized as state coupled with behavior published through type-specific
operations. The architecture also supports introspection in that a client application can ask a
grid service instance to return information describing itself, such as the collection of portTypes
that it implements.
Grid service instances are made accessible to client applications through the use of a
grid service handle and a grid service reference (GSR). These constructs are basically network-
wide pointers to specific grid service instances hosted in execution environments. A client
application can use a grid service reference to send requests, represented by the operations
defined in the portType(s) of the target service description directly to the specific instance at the
specified network-attached service endpoint identified by the grid service reference.
In many situations, client stubs and helper classes isolate application programmers from
the details of using grid service references. Some client-side infrastructure software assumes
responsibility for directing an operation to a specific instance that the GSR identifies.
The characteristics introduced above are frequently also cited as fundamental
characteristics of distributed object-based systems. There are, however, also various other
aspects of distributed object models that are specifically not required or prescribed by OGSI.
For this reason, OGSI does not adopt the term distributed object model or distributed object
system when describing these concepts, but instead uses the term ―open grid services
infrastructure,‖ thus emphasizing the connections that are established with both Web services
and grid technologies.

Among the object-related issues that are not addressed within OGSI are implementation
inheritance, service instance mobility, development approach, and hosting technology. The grid
service specification does not require, nor does it prevent, implementations based upon object
technologies that support inheritance at either the interface or the implementation level. There is
no requirement in the architecture to expose the notion of implementation inheritance either at
the client side or at the service provider side of the usage contract.
In addition, the grid service specification does not prescribe, dictate, or prevent the use
of any particular development approach or hosting technology for grid service instances. Grid
service providers are free to implement the semantic contract of the service description in any
technology and hosting architecture of their choosing. OGSI envisions implementations in J2EE,
.NET, traditional commercial transaction management servers, traditional procedural Unix
servers, and so forth. It also envisions service implementations in a wide variety of both object-
oriented and nonobject-oriented programming languages.
2.5.2.2 Client-Side Programming Patterns.

Another important issue is how OGSI interfaces are likely to be invoked from client applications.
OGSI exploits an important component of the Web services framework: the use of WSDL to
describe multiple protocol bindings, encoding styles, messaging styles (RPC versus document
oriented), and so on, for a given Web service. The Web Services Invocation Framework (WSIF)
and Java API for XML RPC (JAX-RPC) are among the many examples of infrastructure
software that provide this capability.
Figure 2.6 depicts possible (but not required) client-side architecture for OGSI. In this approach,
a clear separation exists between the client application and the client-side representation of the
Web service (proxy), including components for marshaling the invocation of a Web service over
a chosen binding. In particular, the client application is insulated from the details of the Web
service invocation by a higher-level abstraction: the client-side interface.

Figure 2.6:Possible client-side runtime architecture.
Various tools can take the WSDL description of the Web service and generate interface
definitions in a wide range of programming-language-specific constructs. This interface is a front
end to specific parameter marshaling and message routing that can incorporate various binding
options provided by the WSDL. Further, this approach allows certain efficiencies, for example,
detecting that the client and the Web service exist on the same network host, therefore avoiding
the overhead of preparing for and executing the invocation using network protocols.
Within the client application runtime, a proxy provides a client-side representation of
remote service instance‘s interface. Proxy behaviors specific to a particular encoding and
network protocol are encapsulated in a protocol-specific stub. Details related to the binding
specific access to the grid service instance, such as correct formatting and authentication
mechanics, happen here; thus, the application is not required to handle these details itself.
It is possible, but not recommended, for developers to build customized code that
directly couples client applications to fixed bindings of a particular grid service instance.
Although certain circumstances demand potential efficiencies gained by this style of
customization, this approach introduces significant inflexibility into a system and therefore
should only be used under extraordinary circumstances.
The developers of the OGSI specification expect the stub and client-side infrastructure
model that we describe to be a common approach to enabling client access to grid services.
This includes both application-specific services and common infrastructure services that are
defined by OGSA. Thus, for most software developers using grid services, the infrastructure and
application-level services appear in the form of a class library or programming language

interface that is natural to the caller. WSDL and the GWSDL extensions provide support for
enabling heterogeneous tools and enabling infrastructure software.
2.5.2.3 Client Use of Grid Service Handles and References.
A client gains access to a grid service instance through grid service handles and grid
service references. A grid service handle (GSH) can be thought of as a permanent network
pointer to a particular grid service instance. The GSH does not provide sufficient information to
allow a client to access the service instance; the client needs to ―resolve‖ a GSH into a grid
service reference (GSR). The GSR contains all the necessary information to access the service
instance. The GSR is not a ―permanent‖ network pointer to the grid service instance because a
GSR may become invalid for various reasons; for example, the grid service instance may be
moved to a different server.
OGSI provides a mechanism, the HandleResolver to support client resolution of a grid
service handle into a grid service reference. Figure 2.7 shows a client application that needs to
resolve a GSH into a GSR.
Figure 2.7: Resolving a GSH

The client resolves a GSH into a GSR by invoking a HandleResolver grid service
instance identified by some out-of-band mechanism. The HandleResolver can use various
means to do the resolution; some of these means are depicted in Figure 4.2. The
HandleResolver may have the GSR stored in a local cache. The HandleResolver may need to
invoke another HandleResolver to resolve the GSH.The HandleResolver may use a handle
resolution protocol, specified by the particular kind (or scheme) of the GSH to resolve to a GSR.
The HandleResolver protocol is specific to the kind of GSH being resolved. For example, one

kind of handle may suggest the use of HTTP GET to a URL encoded in the GSH in order to
resolve to a GSR.
2.5.2.4 Relationship to Hosting Environment

OGSI does not dictate a particular service-provider-side implementation architecture. A
variety of approaches are possible, ranging from implementing the grid service instance directly
as an operating system process to a sophisticated server-side component model such as J2EE.
In the former case, most or even all support for standard grid service behaviors (invocation,
lifetime management, registration, etc.) is encapsulated within the user process; for example,
via linking with a standard library. In the latter case, many of these behaviors are supported by
the hosting environment.
Figure 2.8 illustrates these differences by showing two different approaches to the
implementation of argument demarshaling functions. One can assume that, as is the case for
many grid services, the invocation message is received at a network protocol termination point
(e.g., an HTTP servlet engine) that converts the data in the invocation message into a format
consumable by the hosting environment. The top part of Figure 4.16 illustrates two grid service
instances (the oval) associated with container-managed components (e.g., EJBs within a J2EE
container). Here, the message is dispatched to these components, with the container frequently
providing facilities for demarshaling and decoding the incoming message from a format (such as
an XML/SOAP message) into an invocation of the component in native programming language.
In some circumstances (the oval), the entire behavior of a grid service instance is completely
encapsulated within the component.
Figure 2.8 Two approaches to the implementation of argument demarshaling functions in

a grid service hosting environment.

In other case, a component will collaborate with other server-side executables, perhaps
through an adapter layer, to complete the implementation of the grid service behavior. The
bottom part of figure depicts another scenario wherein the entire behavior of the grid service
instance, including the demarshaling/ decoding of the network message, has been
encapsulated within a single executable. Although this approach may have some efficiency
advantages, it provides little opportunity for reuse of functionality between grid service
implementations.
A container implementation may provide a range of functionality beyond simple
argument demarshaling. For example, the container implementation may provide lifetime
management functions, automatic support for authorization and authentication, request logging,
intercepting lifetime management functions, and terminating service instances when a service
lifetime expires or an explicit destruction request is received. Thus, one avoids the need to
reimplement these common behaviors in different grid service implementations.
2.5.3 The Grid Service

The purpose of the OGSI document is to specify the interfaces and behaviors that define a
grid service. In brief, a grid service is a WSDL-defined service that conforms to a set of
conventions relating to its interface definitions and behaviors. Thus, every grid service is a Web
service, though the converse of this statement is not true. The OGSI document expands upon
this brief statement by
 Introducing a set of WSDL conventions that one uses in the grid service specification;
these conventions have been incorporated in WSDL 1.2 .
 Defining service data that provide a standard way for representing and querying
metadata and state data from a service instance
 Introducing a series of core properties of grid service, including:
1. Defining grid service description and grid service instance, as organizing
principles for their extension and their use
2. Defining how OGSI models time
3. Defining the grid service handle and grid service reference constructs that are
used to refer to grid service instances
4. Defining a common approach for conveying fault information from operations.
This approach defines a base XML schema definition and associated
semantics for WSDL fault messages to support a common interpretation; the

approach simply defines the base format for fault messages, without modifying
the WSDL fault message model.
5. Defining the life cycle of a grid service instance
2.5.4 WSDL Extensions and Conventions

As should be clear by now, OGSI is based on Web services; in particular, it uses WSDL
as the mechanism to describe the public interfaces of grid services. However, WSDL 1.1 is
deficient in two critical areas: lack of interface (portType) extension and the inability to describe
additional information elements on a portType (lack of open content). These deficiencies have
been addressed by the W3C Web Services Description Working Group.
Because WSDL 1.2 is a ―work in progress,‖ OGSI cannot directly incorporate the entire
WSDL 1.2 body of work. Instead, OGSI defines an extension to WSDL 1.1, isolated to the
wsdl:portType element, which provides the minimal required extensions to WSDL 1.1. These
extensions to WSDL 1.1 match equivalent functionality agreed to by the W3C Web Services
Description Working Group. Once WSDL 1.2 [150] is published as a recommendation by the
W3C, the Global Grid Forum is committed to defining a follow- on version of OGSI that exploits
WSDL 1.2, and to defining a translation from this OGSI v1.0 extension to WSDL 1.2.
2.5.5 Service Data

The approach to stateful Web services introduced in OGSI identified the need for a
common mechanism to expose a service instance‘s state data to service requestors for query,
update, and change notification. Since this concept is applicable to any Web service including
those used outside the context of grid applications, one can propose a common approach to
exposing Web service state data called ―serviceData.‖ The GGF is endeavoring to introduce this
concept to the broader Web services community.
In order to provide a complete description of the interface of a stateful Web service, it is
necessary to describe the elements of its state that are externally observable. By externally
observable, one means that the state of the service instance is exposed to clients making use of
the declared service interface, where those clients are outside of what would be considered the
internal implementation of the service instance itself. The need to declare service data as part of
the service‘s external interface is roughly equivalent to the idea of declaring attributes as part of
an object-oriented interface described in an object-oriented interface-definition language.
Service data can be exposed for read, update, or subscription purposes. Since WSDL
defines operations and messages for portTypes, the declared state of a service must be
externally accessed only through service operations defined as part of the service interface. To

avoid the need to define serviceData-specific operations for each serviceData element, the grid
service portType provides base operations for manipulating serviceData elements by name.
Consider an example. Interface alpha introduces operations op1, op2, and op3. Also
assume that the alpha interface consists of publicly accessible data elements of de1, de2, and
de3. One uses WSDL to describe alpha and its operations. The OGSI serviceData construct
extends WSDL so that the designer can further define the interface to alpha by declaring the
public accessibility of certain parts of its state de1, de2, and de3. This declaration then
facilitates the execution of operations on the service data of a stateful service instance
implementing the alpha interface.
Put simply, the serviceData declaration is the mechanism used to express the elements
of the publicly available state exposed by the service‘s interface. ServiceData elements are
accessible through operations of the service interfaces such as those defined in this
specification. The private internal state of the service instance is not part of the service interface
and is therefore not represented through a serviceData declaration.
2.5.5.1 Motivation and Comparison to JavaBean Properties.
The OGSI specification introduces the serviceData concept to provide a flexible,
properties- style approach to accessing state data of a Web service. The serviceData concept is
similar to the notion of a public instance variable or field in object-oriented programming
languages such as Java, Smalltalk, and C++. ServiceData is similar to JavaBean™ properties.
The JavaBean model defines conventions for method signatures (getXXX/setXXX) to access
properties, and helper classes (BeanInfo) to document properties. The OGSI model uses the
serviceData elements and XML schema types to achieve a similar result.
The OGSI specification has chosen not to require getXXX and setXXX WSDL operations
for each serviceData element, although service implementers may choose to define such safe
get and set operations themselves. Instead, OGSI defines extensible operations for querying
(get), updating (set), and subscribing to notification of changes in serviceData elements. Simple
expressions are required by OGSI to be supported by these operations, which allows for access
to serviceData elements by their names, relative to a service instance. This by-name approach
gives functionality roughly equivalent to the getXXX and setXXX approach familiar to JavaBean
and Enterprise JavaBean programmers. However, these OGSI operations may be extended by
other service interfaces to support richer query, update, and subscription semantics, such as
complex queries that span multiple serviceData elements in a service instance.
The serviceDataName element in a GridService portType definition corresponds to the
BeanInfo class in JavaBeans. However, OGSI has chosen an XML (WSDL) document that

provides information about the serviceData, instead of using a serializable implementation class
as in the BeanInfo model.
2.5.5.2 Extending portType with serviceData.

ServiceData defines a new portType child element named serviceData, used to define
serviceData elements, or SDEs, associated with that portType. These serviceData element
definitions are referred to as serviceData declarations, or SDDs. Initial values for those
serviceData elements may be specified using the staticServiceDataValues element within
portType. The values of any serviceData element, whether declared statically in the portType or
assigned during the life of the Web service instance, are called serviceData element values, or
SDE values.
2.5.5.3 serviceDataValues.
Each service instance is associated with a collection of serviceData elements: those
serviceData elements defined within the various portTypes that form the service‘s interface, and
also, potentially, additional service-
Data elements added at runtime. OGSI calls the set of serviceData elements associated
with a service instance its ―serviceData set.‖ A serviceData set may also refer to the set of
serviceData elements aggregated from all serviceData elements declared in a portType
interface hierarchy.
Each service instance must have a ―logical‖ XML document, with a root element of
serviceDataValues that contains the serviceData element values. An example of a
serviceDataValues element was given above. A service implementation is free to choose how
the SDE values are stored; for example, it may store the SDE values not as XML but as
instance variables that are converted into XML or other encodings as necessary.
The wsdl:binding associated with various operations manipulating serviceData elements
will indicate the encoding of that data between service requestor and service provider. For
example, a binding might indicate that the serviceData element values are encoded as
serialized Java objects.
2.5.5.4 SDE Aggregation within a portType Interface Hierarchy.
WSDL 1.2 has introduced the notion of multiple portType extension, and one can model
that construct within the GWSDL namespace. A portType can extend zero or more other
portTypes. There is no direct relationship between a wsdl:service and the portTypes supported
by the service modeled in the WSDL syntax. Rather, the set of portTypes implemented by the
service is derived through the port element children of the service element and binding elements

referred to from those port elements. This set of portTypes, and all portTypes they extend,
defines the complete interface to the service.
The serviceData set defined by the service‘s interface is the set union of the serviceData
elements declared in each portType in the complete interface implemented by the service
instance. Because serviceData elements are uniquely identified by QName, the set union
semantic implies that a serviceData element can appear only once in the set of serviceData
elements. For example, if a portType named ―pt1‖ and portType named ―pt2‖ both declare a
serviceData named ―tns:sd1,‖ and a port- Type named ―pt3‖ extends both ―pt1 and ―pt2,‖ then it
has one (not two) serviceData elements named ―tns:sd1.‖
2.5.5.5 Dynamic serviceData Elements.

Although many serviceData elements are most naturally defined in a service‘s interface
definition, situations can arise in which it is useful to add or move serviceData elements
dynamically to or from an instance. The means by which such updates are achieved are
implementation specific; for example, a service instance may implement operations for adding a
new serviceData element.
The grid service portType illustrates the use of dynamic SDEs. This contains a
serviceData element named ―serviceDataName‖ that lists the serviceData elements currently
defined. This property of a service instance may return a superset of the serviceData elements
declared in the GWSDL defining the service interface, allowing the requestor to use the
subscribe operation if this serviceDataSet changes, and the findServiceData operation to
determine the current serviceDataSet value.
2.5.6 Core Grid Service Properties

This subsection discusses a number of properties and concepts common to all grid services.
2.5.6.1 Service Description and Service Instance.
One can distinguish in OGSI between the description of a grid service and an instance of
a grid service:
A grid service description describes how a client interacts with service instances. This
description is independent of any particular instance. Within a WSDL document, the grid service
description is embodied in the most derived portType (i.e., the portType referenced by the
wsdl:service element‘s port children, via referenced binding elements, describing the service) of
the instance, along with its associated portTypes (including serviceData declarations), bindings,
messages, and types definitions.

A grid service description may be simultaneously used by any number of grid service
instances, each of which
 Embodies some state with which the service description describes how to interact
 Has one or more grid service handles
 Has one or more grid service references to it
A service description is used primarily for two purposes. First, as a description of a service
interface, it can be used by tooling to automatically generate client interface proxies, server
skeletons, and so forth. Second, it can be used for discovery, for example, to find a service
instance that implements a particular service description, or to find a factory that can create
instances with a particular service description.
The service description is meant to capture both interface syntax and semantics. Interface
syntax is described by WSDL portTypes. Semantics may be inferred through the name
assigned to the portType. For example, when defining a grid service, one defines zero or more
uniquely named portTypes. Concise semantics can be associated with each of these names in
specification documents, and, perhaps in the future, through Semantic Web or other more
formal descriptions. These names can then be used by clients to discover services with desired
semantics, by searching for service instances and factories with the appropriate names. The
use of namespaces to define these names also provides a vehicle for assuring globally unique
names.
2.5.6.2 Modeling Time in OGSI
The need arises at various points throughout this specification to represent time that is
meaningful to multiple parties in the distributed Grid. For example, information may be tagged
by a producer with timestamps in order to convey that information‘s useful lifetime to
consumers. Clients need to negotiate service instance lifetimes with services, and multiple
services may need a common understanding of time in order for clients to be able to manage
their simultaneous use and interaction.
The GMT global time standard is assumed for grid services, allowing operations to refer
unambiguously to absolute times. However, assuming the GMT time standard to represent time
does not imply any particular level of clock synchronization between clients and services in the
grid. In fact, no specific accuracy of synchronization is specified or expected by OGSI, as this is
a service-quality issue.
Grid service hosting environments and clients should utilize the Network Time Protocol
(NTP) or equivalent function to synchronize their clocks to the global standard GMT time.
However, clients and services must accept and act appropriately on messages containing time
values that are out of range because of inadequate synchronization, where ―appropriately‖ may

include refusing to use the information associated with those time values. Furthermore, clients
and services requiring global ordering or synchronization at a finer granularity than their clock
accuracies or resolutions allow for must coordinate through the use of additional
synchronization service interfaces, such as through transactions or synthesized global clocks.
In some cases, it is required to represent both zero time and infinite time. Zero time
should be represented by a time in the past. However, infinite time requires an extended notion
of time. One therefore introduces the following type in the OGSI namespace that may be used
in place of xsd:dateTime when a special value of ―infinity‖ is appropriate.
2.5.6.3 XML Element Lifetime Declaration Properties
Since serviceData elements may represent instantaneous observations of the dynamic
state of a service instance, it is critical that consumers of serviceData be able to understand the
valid lifetimes of these observations. The client may use this time-related information to reason
about the validity and availability of the serviceData element and its value, though the client is
free to ignore the information.
One can define three XML attributes that together describe the lifetimes associated with
an XML element and its subelements. These attributes may be used in any XML element that
allows for extensibility attributes, including the serviceData element.
The three lifetime declaration properties are:
1. ogsi:goodFrom. Declares the time from which the content of the element is said to be valid.
This is typically the time at which the value was created.
2. ogsi:goodUntil. Declares the time until which the content of the element is said to be valid.
This property must be greater than or equal to the goodFrom time.
3. ogsi:availableUntil. Declares the time until which this element itself is expected to be
available, perhaps with updated values. Prior to this time, a client should be able to obtain an
updated copy of this element. After this time, a client may no longer be able to get a copy of this
element. This property must be greater than or equal to the goodFrom time.
2.6 Data intensive grid service models
Applications in the grid are normally grouped into two categories: computation-
intensive and data-intensive. For data-intensive applications, we may have to deal with massive
amounts of data.
The grid system must be specially designed to discover, transfer, and manipulate these
massive data sets. Transferring massive data sets is a time-consuming task. Efficient data

management demands low-cost storage and high-speed data movement. The following
paragraphs are several common methods for solving data movement problems.
2.6.1 Data Replication and Unified Namespace
This data access method is also known as caching, which is often applied to enhance
data efficiency in a grid environment. By replicating the same data blocks and scattering them in
multiple regions of a grid, users can access the same data with locality of references. Some
key data will not be lost in case of failures. However, data replication may demand periodic
consistency checks. The increase in storage requirements and network bandwidth may cause
additional problems.
The strategies of replication can be classified into method types: dynamic and static. For
the static method, the locations and number of replicas are determined in advance and will not
be modified.
Dynamic strategies can adjust locations and number of data replicas according to
changes in conditions.
The most common replication strategies include preserving locality, minimizing update costs,
and maximizing profits.
2.6.2 Grid Data Access Models
Multiple participants may want to share the same data collection. To retrieve any piece
of data, we need a grid with a unique global namespace. Similarly, we desire to have unique file
names. To achieve these, we have to resolve inconsistencies among multiple data objects
bearing the same name. Access restrictions may be imposed to avoid confusion. Also, data
needs to be protected to avoid leakage and damage. Users who want to access data have to be
authenticated first and then authorized for access. There are four access models for organizing
a data grid, as listed here and shown in Figure2.9.

Figure 2.9 Four architectural models for building a data grid.
Monadic model: This is a centralized data repository model, shown in Figure 2.9(a). All the
data is saved in a central data repository. When users want to access some data they
have to submit requests directly to the central repository. No data is replicated for
preserving data locality. This model is the simplest to implement for a small grid. For a
large grid, this model is not efficient in terms of performance and reliability. Data
replication is permitted in this model only when fault tolerance is demanded.
Hierarchical model: The hierarchical model, shown in Figure 2.9(b), is suitable for building a
large data grid which has only one large data access directory. The data may be
transferred from the source to a second-level center. Then some data in the regional
center is transferred to the third-level center. After being forwarded several times,
specific data objects are accessed directly by users.
Federation model: This data access model shown in Figure 2.9(c) is better suited for designing
a data grid with multiple sources of data supplies. Sometimes this model is also known
as a mesh model. The data sources are distributed to many different locations.
Although the data is shared, the data items are still owned and controlled by their
original owners.
Hybrid model: This data access model is shown in Figure 2.9(d). The model combines the best
features of the hierarchical and mesh models. Traditional data transfer technology,
such as FTP, applies for networks with lower bandwidth. Network links in a data grid
often have fairly high bandwidth, and other data transfer models are exploited by high-
speed data transfer tools such as GridFTP developed with the Globus library.

2.6.3 Parallel versus Striped Data Transfers
Compared with traditional FTP data transfer, parallel data transfer opens multiple data streams
for passing subdivided segments of a file simultaneously. Although the speed of each
stream is the same as in sequential streaming, the total time to move data in all
streams can be significantly reduced compared to FTP transfer.
In striped data transfer, a data object is partitioned into a number of sections, and each section
is placed in an individual site in a data grid. When a user requests this piece of data, a
data stream is created for each site, and all the sections of data objects are transferred
simultaneously. Striped data transfer can utilize the bandwidths of multiple sites more
efficiently to speed up data transfer.
2.7 OGSA services
2.7.1 Handle Resolution

OGSI defines a two-level naming scheme for grid service instances based on abstract,
long-lived grid service handles (GSHs) that can be mapped by HandleMapper services to
concrete, but potentially less long lived, grid service references (GSRs). These constructs are
basically network-wide pointers to specific grid service instances hosted in (potentially remote)
execution environments. A client application can use a grid service reference to send directly to
the specific instance at the specified network-attached service endpoint identified by that GSR.
The format of the GSH is a URL, where the schema directive indicates the naming
scheme used to express the handle value. Based on the GSH naming scheme, the application
should find an associated naming-scheme-specific HandleMapper service that knows how to
resolve that name to the associated GSR. OGSI defines the basic GSH format and portType for
the HandleMapper service that resolve a GSH to a GSR.
The OGSI Working Group decided to leave the registration of GSHs and associated
GSRs undefined for possible standardization. Another unspecified issue is how the bootstrap
mechanism should work. Currently, it is left to the implementation, which may decide to use
custom configuration data and external naming services; for example, DNS or the Handle
System. The handle resolutions require service invocations and could, therefore, affect overall
performance.
2.7.2 Virtual Organization Creation and Management

VOs are a concept that supplies a ―context‖ for operation of the grid that can be used to
associate users, their requests, and resources. VO contexts permit the grid resource providers

to associate appropriate policy and agreements with their resources. Users associated with a
VO can then exploit those resources consistent with those policies and agreements.
VO can then exploit those resources consistent with those policies and agreements. VO
creation and management functions include mechanisms for associating users/groups with a
VO, manipulation of user roles within the VO, association of services with the VO, and
attachment of agreements and policies to the VO as a whole or to individual services within the
VO.
2.7.3 Service Groups and Discovery Services
GSHs and GSRs together realize a two-level naming scheme, with HandleResolver
services mapping from handles to references; however, GSHs are not intended to contain
semantic information and indeed may be viewed for most purposes as opaque. Thus, other
entities (both humans and applications) need other means for discovering services with
particular properties, whether relating to interface, function, availability, location, policy, or other
criteria.
Two types of such semantic name spaces are common—naming by attribute, and naming by
path.
Attribute naming schemes associate various metadata with services and support
retrieval via queries on attribute values. A registry implementing such a scheme allows
service providers to publish the existence and properties of the services that they
provide, so that service consumers can discover them.
Path naming or directory schemes represent an alternative approach to attribute
schemes for organizing services into a hierarchical name space that can be navigated.
The two approaches can be combined, as in LDAP. Directory path naming can be
accomplished by defining a PathName Interface that maps strings to GSHs.
2.7.4 Choreography, Orchestrations and Workflow
Over these interfaces OGSA provides a rich set of behaviors and associated operations and
attributes for business process management:
 Definition of a job flow, including associated policies
 Assignment of resources to a grid flow instance
 Scheduling of grid flows (and associated grid services)
 Execution of grid flows (and associated grid services)
 Common context and metadata for grid flows (and associated services)
 Management and monitoring for grid flows (and associated grid services)
 Failure handling for grid flows (and associated grid services); more generally, managing
the potential transiency of grid services

 Business transaction and coordination services
2.7.5 Transactions
Transaction services are important in many grid applications, particularly in industries such as
financial services and in application domains such as supply chain management. Transaction
management in a widely distributed, high-latency, heterogeneous RDBMS environment is more
complicated than in a single data center with a single vendor‘s software. Traditional distributed
transaction algorithms, such as two-phase distributed commit, may be too expensive in a wide-
area grid, and other techniques such as optimistic protocols may be more appropriate.
2.7.6 Metering Service
It is a quasiuniversal requirement that resource utilization can be monitored, whether for
purposes of cost allocation, capacity and trend analysis, dynamic provisioning, grid-service
pricing, fraud and intrusion detection, and/or billing. OGSA must address this requirement by
defining standard monitoring, metering, rating, accounting, and billing interfaces.
A grid service may consume multiple resources and a resource may be shared by
multiple service instances. Ultimately, the sharing of underlying resources is managed by
middleware and operating systems. All modern operating systems and many middleware
systems have metering subsystems for measuring resource consumption and for aggregating
the results of those measurements.
A metering interface provides access to a standard description of such aggregated data.
A key parameter is the time window over which measurements are aggregated. Finally, in
addition to metering resource consumption, metering systems must also accommodate the
measurement and aggregation of application-related resources.
2.7.7 Rating Service
A rating interface needs to address two types of behaviors. Once the metered
information is available, it has to be translated into financial terms. That is, for each unit of
usage, a price has to be associated with it. This step is accomplished by the rating interfaces,
which provide operations that take the metered information and a rating package as input and
output the usage in terms of chargeable amounts.
Furthermore, when a business service is developed, a rating service is used to
aggregate the costs of the components used to deliver the service, so that the service owner
can determine the pricing, terms, and conditions under which the service will be offered to
subscribers.
2.7.8 Accounting Service
Once the rated financial information is available, an accounting service can manage
subscription users and accounts information, calculate the relevant monthly charges and

maintain the invoice information. This service can also generate and present invoices to the
user. Account-specific information is also applied at this time.
2.7.9 Billing and Payment Service
Billing and payment service refers to the financial service that actually carries out the transfer of
money; for example, a credit card authorization service.
2.7.10 Installation, Deployment, and Provisioning
Computer processors, applications, licenses, storage, networks, and instruments are all
grid resources that require installation, deployment, and provisioning. OGSA affords a
framework that allows resource provisioning to be done in a uniform, consistent manner.
2.7.11 Distributed Logging
Distributed logging can be viewed as a typical messaging application in which message
producers generate log artifacts, that may or may not be used at a later time by other
independent message consumers.
Logging services provide the extensions needed to deal with the following issues:
1. Decoupling. The logical separation of logging artifact creation from logging artifact
consumption. The ultimate usage of the data is determined by the message
consumer; the message producer should not be concerned with this.
2. Transformation and common representation. Logging packages commonly
annotate the data that they generate with useful common information such as
category, priority, time stamp, and location.
Filtering and aggregation
The amount of logging data generated can be large, whereas the amount of data
actually consumed can be small. Therefore, it can be desirable to have a mechanism for
controlling the amount of data generated and for filtering out what is actually kept and
where.
Configurable persistency Depending on consumer needs, data may have different
durability characteristics. Hence, there is a need for a mechanism to create different data
repositories, each with its own persistency characteristics.
Consumption patterns. Consumption patterns differ according to the needs of the
consumer application. For example, a real-time monitoring application needs to be
notified whenever a particular event occurs, whereas a postmortem problem
determination program queries historical data, trying to find known patterns.
2.7.12 Messaging and Queuing

OGSA extends the scope of the base OGSI Notification Interface to allow grid services to
produce a range of event messages, not just notifications that a serviceData element has
changed. Several terms related to this work are:
Figure2.10 Schematic of a messaging service architecture
Event— Some occurrence within the state of the grid service or its environment that may be of
interest to third parties. This could be a state change or it could be environmental, such as a
timer event.
Message— An artifact of an event, containing information about an event that some entity
wishes to communicate to other entities.
Topic— A ―logical‖ communications channel and matching mechanism to which a requestor
may subscribe to receive asynchronous messages and publishers may publish messages.
2.7.13 Event
Events are generally used as asynchronous signaling mechanisms. The most common form is
―publish/subscribe,‖ in which a service ―publishes‖ the events that it exports There is also a
distinction between the reliability of an event being raised and its delivery to a client. A service
my attempt to deliver every occurrence of an event (reliable posting), but not be able to
guarantee delivery.
An event can be anything that the service decides it will be: a change in a state variable, entry
into a particular code segment, an exception such as a security violation or floating point
overflow, or the failure of some other expected event to occur..

The basic idea is simple: inside the SOAP message invoking a service method is an Event
Interest Set (EIS). The EIS specifies the events in which the caller is interested and a callback
associated with each event.
An event is a representation of an occurrence in a system or application component that may be
of interest to other parties. Standard means of representing, communicating, transforming,
reconciling, and recording events are important for interoperability.
A detailed set of services include:
 Standard interface(s) for communicating events with specified QoS. These may be
based directly on the Messaging interfaces.
 Standard interface(s) for transforming (mediating) events in a manner that is transparent
to the endpoints. This should include aggregation of multiple events into a single event.
 Standard interface(s) for reconciling events from multiple sources.
 Standard interface(s) for recording events. These may be based directly on the Message
logging interface(s).
 Standard interface(s) for batching and queuing events.
2.7.14 Policy and Agreements

These services create a general framework for creation, administration, and
management of policies and agreements for system operation, security, resource allocation,
and so on, as well as an infrastructure for ―policy aware‖ services to use the set of defined and
managed policies to govern their operation. These services do not actually enforce policies but
permit policies to be managed and delivered to resource managers that can interpret and
operate on them.
One can expect that many grid services will use policies to direct their actions. Thus,
grids need to support the definition, discovery, communication, and enforcement of policies for
such purposes as resource allocation, workload management, security, automation, and
qualities of services. Some policies need to be expressed at the operational level, that is, at the
level of the devices and resources to be managed, whereas higher-level policies express
business goals and SLAs within and across administrative domains. Higher-level policies are
hard to enforce without a canonical representation for their meaning to lower-level resources.

Figure2.11 A set of potential policy service components.
2.7.15 Base Data Services
OGSA data interfaces are intended to enable a service-oriented treatment of data so
that data can be treated in the same way as other resources within the Web/grid services
architecture.
OGSA data services are intended to allow for the definition, application, and management of
diverse abstractions—what can be called data virtualizations—of underlying data sources.
A data virtualization is represented by, and encapsulated in, a data service, an OGSI
grid service with SDEs that describe key parameters of the virtualization, and with operations
that allow clients to inspect those SDEs, access the data using appropriate operations, derive
new data virtualizations from old, and/or manage the data virtualization.
Four base data interfaces can be used to implement a variety of different data service
behaviors:
1. Data Description defines OGSI service data elements representing key parameters of the
data virtualization encapsulated by the data service.
2. DataAccess provides operations to access and/or modify the contents of the data
virtualization encapsulated by the data service.
3. DataFactory provides an operation to create a new data service with a data virtualization
derived from the data virtualization of the parent data service.
4. DataManagement provides operations to monitor and manage the data service‘s data
virtualization, including the data sources that underlie the data service.

2.7.16 Other Data Services
A variety of higher-level data interfaces can and must be defined on top of the base data
interfaces, to address functions such as:
Data access and movement
Data replication and caching
Data and schema mediation
Metadata management and looking
Data replication, data caching, and schema transformation subservices are described below.
Data Replication. Data replication can be important as a means of meeting
performance objectives by allowing local computer resources to have access to local data.
Services that may consume data replication are group services for clustering and failover, utility
computing for dynamic resource provisioning, policy services ensuring various qualities of
service, metering and monitoring services, and also higher-level workload management and
disaster recovery solutions.
Data Caching. In order to improve performance of access to remote data items, caching
services will be employed. At the minimum, caching services for traditional flat file data will be
employed. Caching of other data types, such as views on RDBMS data, streaming data, and
application binaries, are also envisioned.
Issues that arise include:
Consistency—Is the data in the cache the same as in the source? If not, what is the coherence
window? Different applications have very different requirements.
Cache invalidation protocols—How and when is cached data invalidated? Write through or
write back? When are writes to the cache committed back to the original data source?
Security—How will access control to cached items be handled? Will access control
enforcement be delegated to the cache, or will access control be somehow enforced by the
original data source?
Integrity of cached data—Is the cached data kept in memory or on disk? How is it protected
from unauthorized access? Is it encrypted?
Schema Transformation. Schema transformation interfaces support the transformation of data
from one schema to another. For example, XML transformations as specified in XSLT.
2.7.17 Discovery Services

Discovery interfaces address the need to be able to organize and search for information about
various sorts of entities in various ways. Different interface definitions and different

implementation behaviors may vary according to how user requests are expressed, the
information used to answer requests, and the mechanisms used to propagate and access that
information.
2.7. 18 Job Agreement Service
The job agreement service is created by the agreement factory service with a set of job terms,
including command line, resource requirements, execution environment, data staging, job
control, scheduler directives, and accounting and notification terms. The job agreement service
provides an interface for placing jobs on a resource manager, and for interacting with the job
once it has been dispatched to the resource manager.
The interfaces provided by the job agreement service are:
1. Manageability interface
– Supported job terms: defines a set of service data used to publish the job terms
supported by this job service, including the job definition, resource requirements,
execution environment, data staging, job control, scheduler directives, and accounting
and notification terms.
– Workload status: total number of jobs, statuses such as number of jobs running or
pending and suspended jobs.
2. Job control: control the job after it has been instantiated. This would include the ability
to suspend/resume, checkpoint, and kill the job.
2.7.19 Reservation Agreement Service

The reservation agreement service is created by the agreement factory service with a set of
terms including time duration, resource requirement specification, and authorized user/project
agreement terms.
The reservation agreement service provides one interface, manageability, which defines a set of
service data that describe the details of a particular reservation, including resource terms, start
time, end time, amount of the resource reserved, and the authorized users.
2.7.20 Data Access Agreement Service

The data access agreement service is created by the agreement factory service with a set of
terms, including source and destination file path, bandwidth requirements, and fault-tolerance
terms. The data access agreement service allows end users or a job agreement service to
stage application or required data.

2.7.21 Queuing Service
The queuing service provides scheduling capability for jobs. Given a set of policies defined at
the VO level, a queuing service will map jobs to resource managers based on the defined
policies.
The manageability interface defines a set of service data for accessing QoS terms supported by
the queuing services. QoS terms for the queuing service can include whether the service
supports on-line or batch capabilities, average turn-around time for jobs, throughput guarantees,
the ability to meet deadlines, and the ability to meet certain economic constraints.
The following terms apply to the queuing service:
Enqueue—add a job to a queue
Dequeue—remove a job from a queue
2.7.22 Open Grid Services Infrastructure

The OGSI defines fundamental mechanisms on which OGSA is constructed. These
mechanisms address issues relating to the creation, naming, management, and exchange of
information among entities called grid services.
The following list recaps the key OGSI features and briefly discusses their relevance to OGSA.
1. Grid Service descriptions and instances –OGSI introduces the twin concepts of the
grid service description and grid service instance as organizing principles of distributed
systems. A grid service instance is an addressable, potentially stateful, and potentially
transient instantiation of such a description. These concepts provide the basic building
blocks used to build OGSA-based distributed systems. Grid service descriptions define
interfaces and behaviors, and a distributed system comprises a set of grid service
instances that implement those behaviors
2. Service state, metadata, and introspection OGSI defines mechanisms for
representing and accessing metadata and state data from a service instance, as well as
providing uniform mechanisms for accessing state.
3. Naming and name resolution- OGSI defines a two-level naming scheme for grid
service instances based on abstract, long-lived grid service handles that can be mapped
by HandleMapper services to concrete but potentially lesslong- lived grid service
references. These constructs are basically networkwide pointers to specific grid service
instances hosted in execution environments.
4. Fault model. OGSI defines a common approach for conveying fault information from
operations.

5. Life cycle OGSI defines mechanisms for managing the life cycle of a grid service
instance, including both explicit destruction and soft-state lifetime management functions
for grid service instances, and grid service factories that can be used to create instances
implementing specified interfaces.
6. Service groups. OGSI defines a means of organizing groups of service instances.
2.7.23 Common Management Model
The Common Management Model specification defines the base behavioral model for all
resources and resource managers in the grid management infrastructure. A mechanism is
defined by which resource managers can make use of detailed manageability information for a
resource that may come from existing resource models and instrumentation, such as those
expressed in CIM, JMX, SNMP, and so on, combined with a set of canonical operations
introduced by base CMM interfaces.
The CMM specification defines
 The base manageable resource interface, which a resource or resource manager must
provide to be manageable
 Canonical lifecycle states—the transitions between the states, and the operations
necessary for the transitions that complement OGSI lifetime service data
 The ability to represent relationships among manageable resources, including a
canonical set of relationship types
 Life cycle metadata - common to all types of managed resources for monitoring and
control of service data and operations based on life cycle state
 Canonical services factored out from across multiple resources or domain specific
resource managers, such as an operational port type.
2 marks Questions and answers
UNIT II GRID SERVICES
1. Mention the Grid Services.
Discovery
Lifecycle
State management
Service groups
Factory
Notification
Handle map
2. Mention the OGSA-related GGF groups.
The Open Grid Services Architecture Working Group (OGSA-WG)
The Open Grid Services Infrastructure Working Group (OGSI-WG)
The Open Grid Service Architecture Security Working Group (OGSA-SECWG)
Database Access and Integration Services Working Group (DAIS-WG)

3. Mention the areas that lack grid standards.
Data management
Dispatch management
Information services
Scheduling
Security
Work unit management
4. Mention the benefits of OGSI Standards.

Increased effective computing capacity
Interoperability of resources
Speed of application development
5. Mention the basic functional requirements of OGSA.
The basic functionalities include Discovery and brokering, Metering and accounting, Data
sharing, Deployment, Monitoring, Policy and Virtual organizations.
The security functions include Multiple security infrastructures, Perimeter security solutions,
Authentication, Authorization, and Accounting, Encryption, Application and Network-Level
Firewalls and Certification.
6. List the various security requirements.

Multiple security infrastructures
Perimeter security solutions
Authentication, Authorization, and Accounting
Encryption
Application and Network-Level Firewalls
Certification
7. List the objectives of OGSA.

Manage resources across distributed heterogeneous platforms.
Support QoS-oriented Service Level Agreements (SLAs).
Provide a common base for autonomic management.
Define open, published interfaces and protocols for the interoperability of diverse
resources.
8. Mention the core properties of grid service.

Defining grid service description and grid service instance, as organizing principles for
their extension and their use
Defining how OGSI models time
Defining the grid service handle and grid service reference constructs that are used to
refer to grid service instances
9. Define Monadic Model.

This is a centralized data repository model. All the data is saved in a central data repository.
When users want to access some data they have to submit requests directly to the central

repository. No data is replicated for preserving data locality. This model is the simplest to
implement for a small grid. For a large grid, this model is not efficient in terms of performance
and reliability.
10. Define Federation model.

This data access model is better suited for designing a data grid with multiple sources of data
supplies. Sometimes this model is also known as a mesh model. The data sources are
distributed to many different locations.
11. Define Striped data Transfer.

In striped data transfer, a data object is partitioned into a number of sections, and each section
is placed in an individual site in a data grid. When a user requests this piece of data, a
data stream is created for each site, and all the sections of data objects are transferred
simultaneously. Striped data transfer can utilize the bandwidths of multiple sites more
efficiently to speed up data transfer.
12. Define Event.

Events are generally used as asynchronous signaling mechanisms. The most common form is
―publish/subscribe,‖ in which a service ―publishes‖ the events that it exports There is also a
distinction between the reliability of an event being raised and its delivery to a client.
13. Mention four base data interfaces can be used to implement a variety of different data
service behaviors.
Data Description
DataAccess
DataFactory
DataManagement
14. Discuss the use of Data caching.
Data Caching. In order to improve performance of access to remote data items, caching
services will be employed. At the minimum, caching services for traditional flat file data will be
employed. Caching of other data types, such as views on RDBMS data, streaming data, and
application binaries, are also envisioned.
15. What are the issues that arise in data caching?

Consistency
Cache invalidation protocols
Security
Integrity of cached data
Schema Transformation
16. List the interfaces provided by job agreement service.
Manageability interface
Job control
17. What is Reservation Agreement Service?

The reservation agreement service is created by the agreement factory service with a set of
terms including time duration, resource requirement specification, and authorized user/project
agreement terms. They provide one interface, manageability, which defines a set of service data
that describe the details of a particular reservation, including resource terms, start time, end
time, amount of the resource reserved, and the authorized users.

18. Define Queuing Service
The queuing service provides scheduling capability for jobs. Given a set of policies defined at
the VO level, a queuing service will map jobs to resource managers based on the defined
policies.
19. Mention the CMM Specifications.

The base manageable resource interface, which a resource or resource manager must provide
to be manageable Canonical lifecycle states—the transitions between the states, and the
operations necessary for the transitions that complement OGSI lifetime service data. The ability
to represent relationships among manageable resources, including a canonical set of
relationship types
20. Name some representation use cases from OGSA architecture working group.
Commercial Data Center(Commercial grid)
National Fusion Collaboratory (Science grid)
Online Media and Entertainment (Commercial grid)
21. What are the functional requirements of CDC on OGSA?
 Discovery of the available resources
 Scheduling of resources for specific tasks
 Provisioning of resources based on need.
 Use static and dynamic policies.
22. What are the major goals of OGSA?
 Identify the use cases that can drive the OGSA platform components.
 Identify and define the core OGSA platform components
 Define hosting and platform specific bindings
 Define resource models and resource profiles with interoperable solutions.
23. What are the OGSA basic services?
Common Management Model(CMM)
Service domains
Distributed data access and replication
Policy, security
Provisioning and resource management.
16 marks Questions
1. Explain Open Grid Services Architecture (OGSA) in detail with suitable diagram.
2. Explain about OGSA: A PRACTICAL VIEW and DETAILED VIEW
3. Explain about Data intensive grid service models
4. List the Various OGSA Services in detail.

UNIT III
VIRTUALIZATION

Cloud deployment models: public, private, hybrid, community ,Categories of cloud computing:
Everything as a service: Infrastructure, platform, software ,Pros and Cons of cloud computing –
Implementation levels of virtualization – virtualization structure ,virtualization of CPU, Memory
and I/O devices ,virtual clusters and Resource Management ,Virtualization for data center
automation.
STAFF IN-CHARGE HOD
TEXT BOOK:
STAFF IN-CHARGE HOD

Cloud deployment models: public, private, hybrid, community ,Categories of cloud computing:
Everything as a service: Infrastructure, platform, software ,Pros and Cons of cloud computing –
Implementation levels of virtualization – virtualization structure ,virtualization of CPU, Memory
and I/O devices ,virtual clusters and Resource Management ,Virtualization for data center
automation.
3.1 Cloud Deployment Models

The concept of cloud computing has evolved from cluster, grid, and utility computing. Cluster
and grid computing leverage the use of many computers in parallel to solve problems of any
size. Utility and Software as a Service (SaaS) provide computing resources as a service with
the notion of pay per use.
 Cloud computing leverages dynamic resources to deliver large numbers of services to

end users.
 Cloud computing is a high-throughput computing (HTC) paradigm whereby the
infrastructure provides the services through a large data center or server farms.
 The cloud computing model enables users to share access to resources from anywhere
at any time through their connected devices.
 In this scenario, the computations (programs) are sent to where the data is located,
rather than copying the data to millions of desktops as in the traditional approach.
 Cloud computing avoids large data movement, resulting in much better network
bandwidth utilization. Furthermore, machine virtualization has enhanced resource
utilization, increased application flexibility, and reduced the total cost of using virtualized
data-center resources.
 The cloud offers significant benefit to IT companies by freeing them from the low-level
task of setting up the hardware (servers) and managing the system software.
 Cloud computing applies a virtual platform with elastic resources put together by on-
demand provisioning of hardware, software, and data sets, dynamically. The main idea
is to move desktop computing to a service-oriented platform using server clusters and
huge databases at data centers.
 Cloud computing leverages its low cost and simplicity to both providers and users. Cloud
computing intends to leverage multitasking to achieve higher throughput by serving
many heterogeneous applications, large or small, simultaneously.
3.1.1 Types of cloud computing

Cloud computing is typically classified in two ways:
 Location of the cloud computing

 Type of services offered
3.1.1.1 Based on Location of the cloud
1. Public Clouds

 A public cloud is built over the Internet and can be accessed by any user who has paid
for the service.
 Public clouds are owned by service providers and are accessible through a subscription.
 The providers of clouds are commercial providers that offer a publicly accessible remote
interface for creating and managing VM instances within their proprietary infrastructure.
 A public cloud delivers a selected set of business processes. The application and
infrastructure services are offered on a flexible price-per-use basis.
 Many public clouds are available, including Google App Engine (GAE), Amazon Web
Services (AWS), Microsoft Azure, IBM Blue Cloud, and Salesforce.com‘s
Force.com.
Main characteristics of the public cloud:
 Easy to use: Some developers prefer public cloud due to its ease of access.
Generally, the public cloud operates at a pretty fast speed, which is also alluring to
some enterprises.
 Typically a pay-per-use model (cost-effective): Often, public clouds operate on an
elastic pay-as-you-go model, so users only need to pay for what they use.
 Operated by a third party: The public cloud isn't specific to a single business,
person or enterprise; it is constructed with shared resources and operated by third-
party providers.
 Flexible: Public clouds allow users to easily add or drop capacity, and are typically
accessible from any Internet-connected device — users don't need to jump through
many hurdles in order to access.
 Can be unreliable: Public cloud outages have made headlines in recent weeks,
leading to headaches for users.
 Less secure: Public cloud often has a lower level of security and may be more
susceptible to hacks. Some public cloud providers also reserve the right to shift data
around from one region to another without notifying the user -– which may cause
issues, legal and otherwise, for a company with strict data security policies.
2. Private Clouds
 A private cloud is built within the domain of an intranet owned by a single organization.
 It is client owned and managed, and its access is limited to the owning clients and their
partners.
 Its deployment was not meant to sell capacity over the Internet through publicly
accessible interfaces.
 Private clouds give local users a flexible and agile private infrastructure to run service
workloads within their administrative domains.
 A private cloud is supposed to deliver more efficient and convenient cloud services.
 It may impact the cloud standardization, while retaining greater customization and
organizational control.
 Examples of Private Cloud:
o Eucalyptus
o Ubuntu Enterprise Cloud - UEC (powered by Eucalyptus)
o Amazon VPC (Virtual Private Cloud)
o VMware Cloud Infrastructure Suite
o Microsoft ECI data center.
Main features of private cloud computing:

 Organization-specific: Private clouds are developed specifically for one organization or
enterprise; unlike the public cloud, they aren't shared among many users.
 More control and reliability: Private cloud services and infrastructure are maintained
onsite, or in a privately hosted environment such as a third-party data center. This gives
an enterprise the utmost control over access — IT can know where information is
deployed and can keep an eye on the boundaries that surround that data. Additionally,
managed private clouds allow for strong service level agreements, which can increase
reliability.
 Customizable: IT can customize storage and networking components so that the cloud
is a perfect fit for the specific organization and its needs.
 More costly (arguably): Proponents of public cloud computing often tout its cost-
effectiveness as one of the primary advantages. While private cloud may rack up costs
due to increased management responsibilities and smaller economies of scale, it's worth
weighing the risks/costs of security.
 Requires IT expertise: Some companies may not have the infrastructure to completely
build out and manage a custom private cloud within their own IT department -– it can
require a good deal of up-keep. In these cases, a managed private cloud may be a
viable option.
3. Hybrid Clouds
 The cloud infrastructure is a composition of two or more clouds (private, community,

or public) that remain unique entities but are bound together by standardized or
proprietary technology that enables data and application portability (e.g., cloud
bursting for load-balancing between clouds).
 Organizations may host critical applications on private clouds and applications with
relatively less security concerns on the public cloud. The usage of both private and
public clouds together is called hybrid cloud.
 . A hybrid cloud provides access to clients, the partner network, and third parties.
 Enterprise cloud providers, often support a hybrid cloud approach, focused on a using
the right destination for the right application that makes sense for individual business
needs.
 Examples of Hybrid Cloud:
o Windows Azure (capable of Hybrid Cloud)
o VMware vCloud (Hybrid Cloud Services)
Main Characteristics
 Flexible and scalable: Since the hybrid cloud, as its name suggests, employs facets
of both private and public cloud services, enterprises have the ability to mix and match
for the ideal balance of cost and security.
 Cost effective: Businesses can take advantage of the cost-effectiveness of public
cloud computing, while also enjoying the security of a private cloud.
 Becoming widely popular: More and more enterprises are adopting this type of
model.
In summary, public clouds promote standardization, preserve capital investment, and offer
application flexibility. Private clouds attempt to achieve customization and offer higher

efficiency, resiliency, security, and privacy. Hybrid clouds operate in the middle, with many
compromises in terms of resource sharing.
4. Community Cloud
 Community Cloud is a type of cloud hosting in which the setup is mutually shared
between many organisations that belong to a particular community, i.e. banks and
trading firms.
 It is a multi-tenant setup that is shared among several organisations that belong to a
specific group which has similar computing apprehensions. The community members
generally share similar privacy, performance and security concerns
 A community cloud may be internally managed or it can be managed by a third party
provider. It can be hosted externally or internally. The cost is shared by the specific
organisations within the community, hence, community cloud has cost saving capacity.
 A community cloud is appropriate for organisations and businesses that work on joint
ventures, tenders or research that needs a centralised cloud computing ability for
managing, building and implementing similar projects.
 Various state-level government departments requiring access to the same data
relating to the local population or information related to infrastructure, such as
hospitals, roads, electrical stations, etc., can utilize a community cloud to manage
applications
 Government departments, universities, central banks etc. often find this type of cloud
useful.
 Examples of Community Cloud:
o Google Apps for Government
o Microsoft Government Community Cloud
Figure 3.1 Public, Private and Hybrid Cloud
Public Clouds vs. Private Clouds :

3.1.1.2 Based on Services offered
 The services provided over the cloud can be generally categorized into three different
service models: namely Infrastructure as a service (IaaS), Platform as a Service
(PaaS), and Software as a Service (SaaS).
 All three models allow users to access services over the Internet, relying entirely on the
infrastructures of cloud service providers.
 These models are offered based on various SLAs between providers and users. In a
broad sense, the SLA for cloud computing is addressed in terms of service availability,
performance, and data protection and security.
 Figure below illustrates three cloud models at different service levels of the cloud.
o SaaS is applied at the application end using special interfaces by users or
clients.
o At the PaaS layer, the cloud platform must perform billing services and handle
job queuing, launching, and monitoring services.
o At the bottom layer of the IaaS services, databases, compute instances, the file
system, and storage must be provisioned to satisfy user demands.

Figure 3.2 Cloud models at different service levels of the cloud
 Infrastructure as a Service
 This involves offering hardware related services using the principles of cloud computing.
 This model allows users to use virtualized IT resources for computing, storage, and
networking. In short, the service is performed by rented cloud infrastructure.
 The user can deploy and run his applications over his chosen OS environment. The user
does not manage or control the underlying cloud infrastructure, but has control over the
OS, storage, deployed applications, and possibly select networking components.
 This IaaS model encompasses storage as a service, compute instances as a service,
and communication as a service. The Virtual Private Cloud (VPC) provide Amazon EC2
clusters and S3 storage to multiple users. GoGrid, FlexiScale, and Aneka are good
examples. Table 4.1 summarizes the IaaS offerings by five public cloud providers.
 Platform as a Service (PaaS)

 This involves offering a development platform on the cloud. The PaaS model to enable
users to develop and deploy their user applications.

 The platform cloud is an integrated computer system consisting of both hardware and
software infrastructure.
 The user application can be developed on this virtualized cloud platform using some
programming languages and software tools supported by the provider (e.g., Java,
Python, .NET).
 The user does not manage the underlying cloud infrastructure. The cloud provider
supports user application development and testing on a well-defined service platform.
 This PaaS model enables a collaborated software development platform for users from
different parts of the world.
 This model also encourages third parties to provide software management, integration,
and service monitoring solutions.
 Software as a service (SaaS)

 This includes a complete software offering on the cloud.
 Users can access a software application hosted by the cloud vendor on pay-per-use
basis. The pioneer in this field has been Salesforce.coms offering in the online Customer
Relationship Management (CRM) space.
 The SaaS model provides software applications as a service. As a result, on the
customer side, there is no upfront investment in servers or software licensing. On the
provider side, costs are kept rather low, compared with conventional hosting of user
applications. Customer data is stored in the cloud that is either vendor proprietary or
publicly hosted to support PaaS and IaaS.
 Other examples are online email providers like
o Googles gmail and Microsofts hotmail,
o Google docs and Microsoft Sharepoint
3.2 Pros and Cons of Cloud Computing

Advantages of Cloud Computing
1. Lower-Cost Computers for Users

No need to have a high-powered (and accordingly high-priced) computer to run cloud
computing‘s web-based applications. Because the application runs in the cloud, the client
computers in cloud computing can be lower priced, with smaller hard disks, less memory, more
efficient processors, and the like. In fact, a client computer wouldn‘t even need a CD or DVD
drive, because no software programs have to be loaded and no document files need to be
saved.

2. Improved Performance
Computers in a cloud computing system will boot up faster and run faster, because they‘ll have
fewer programs and processes loaded into memory.
3. Lower IT Infrastructure Costs

In a larger organization, instead of investing in larger numbers of more powerful servers, they
can use the computing power of the cloud to supplement or replace internal computing
resources. Those companies that have peak needs no longer have to purchase equipment to
handle the peaks. Peak computing needs are easily handled by computers and servers in the
cloud.
4. Fewer Maintenance Issues

Cloud computing greatly reduces both hardware and software maintenance for organizations of
all sizes. With less hardware (fewer servers) necessary in the organization, maintenance costs
are immediately lowered. As to software maintenance, all cloud apps are based elsewhere, so
there‘s no software on the organization‘s computers for the IT staff to maintain.
5. Lower Software Costs

Instead of purchasing separate software packages for each computer in the organization, only
those employees actually using an application need access to that application in the cloud. The
cost of installing and maintaining those programs on every desktop in the organization are
saved. In fact, many companies (such as Google) are offering their web-based applications for
free—which to both individuals and large organizations is much more attractive than the high
costs charged by Microsoft and similar desktop software suppliers.
6. Instant Software Updates

When the app is web-based, updates happen automatically and are available the next time the
user logs in to the cloud. Whenever you access a web-based application, we get the latest
version—without needing to pay for or download an upgrade.
7. Increased Computing Power

When we‘re tied into a cloud computing system, the power of the entire cloud is at our disposal.
We can perform supercomputing-like tasks utilizing the power of thousands of computers and
servers.
8. Unlimited Storage Capacity

Cloud offers virtually limitless storage capacity. Hundreds of petabytes (a million gigabytes) are
available in the cloud. No question of out of space.
9. Increased Data Safety

Unlike desktop computing, where a hard disk crash can destroy all your valuable data, a
computer crashing in the cloud doesn‘t affect the storage of your data. That‘s because data in
the cloud is automatically duplicated, so nothing is ever lost. Even if your personal computer
crashes, all your data is still out there in the cloud, still accessible.

10. Improved Compatibility Between Operating Systems
In the cloud, operating systems simply don‘t matter. You can connect your Windows computer
to the cloud and share documents with computers running Apple‘s Mac OS, Linux, or UNIX. In
the cloud, the data matters, not the operating system.
11. Improved Document Format Compatibility

All documents created by web-based applications can be read by any other user accessing that
application. There are no format incompatibilities when everyone is sharing docs and apps in
the cloud.
12. Easier Group Collaboration

Sharing documents leads directly to collaborating on documents. Cloud computing allows
multiple users to easily collaborate on documents and projects. Now each one can access the
project‘s documents simultaneously; the edits one user makes are automatically reflected in
what the other users see onscreen. With cloud computing, anyone anywhere can collaborate in
real time. It‘s an enabling technology.
13. Universal Access to Documents

With cloud computing, all your documents stay in the cloud, where you can access them from
anywhere with a computer and an Internet connection. All your documents are instantly
available from wherever you are.
14. Latest Version Availability

The cloud always hosts the latest version of your documents; you‘re never in danger of having
an outdated version on the computer you‘re working on.
15. Removes the Tether to Specific Devices

You‘re no longer tethered to a single computer or network. There‘s no need to buy a special
version of a program for a particular device, or save your document in a device-specific format.
Your documents and the programs that created them are the same no matter what computer
you‘re using.
Disadvantages of Cloud Computing
Let‘s examine a few of the risks related to cloud computing.
1. Requires a Constant Internet Connection

Cloud computing is impossible if you can‘t connect to the Internet. Because you use the Internet
to connect to both your applications and documents, if you don‘t have an Internet connection,
you can‘t access anything, even your own documents. A dead Internet connection means no
work, period—and in areas where Internet connections are few or inherently unreliable, this
could be a deal breaker. When you‘re offline, cloud computing just doesn‘t work.
2. Doesn‘t Work Well with Low-Speed Connections

Webbased apps often require a lot of bandwidth to download, as do large documents. Cloud
computing isn‘t for the slow or broadband-impaired.
3. Can Be Slow

Even on a fast connection, web-based applications can sometimes be slower because the
document you‘re working on, has to be sent back and forth from your computer to the
computers in the cloud. If the cloud servers happen to be backed up at that moment, or if the
Internet is having a slow day, you won‘t get the instantaneous access.
4. Features Might Be Limited

Today many web-based applications aren‘t as full-featured as their desktop-based version.
Many web-based apps add more advanced features over time. This has certainly been the case
with Google Docs and Spreadsheets. Make sure that the cloud-based application can do
everything you need it to do before you give up on your traditional software.
5. Stored Data Might Not Be Secure

With cloud computing, all your data is stored on the cloud. That‘s all well and good, but how
secure is the cloud? Can other, unauthorized users gain access to your confidential data?
These are all important questions, which require further examination.
6. If the Cloud Loses Your Data, You‘re Screwed

Theoretically, data stored in the cloud is unusually safe, replicated across multiple machines.
But there is a chance that if your data is missing, there is no physical or local backup.
3.3 Implementation Levels of Virtualization

Virtualization is a computer architecture technology by which multiple virtual machines (VMs)
are multiplexed in the same hardware machine. The purpose of a VM is to enhance resource
sharing by many users and improve computer performance in terms of resource utilization and
application flexibility. Hardware resources (CPU, memory, I/O devices, etc.) or software
resources (operating system and software libraries) can be virtualized in various functional
layers.
Virtual Machines are presentation of a real machine using software that provides an operating
environment which can run or host a guest operating system. Virtual machines are created and
managed by virtual machine monitors.
Guest Operating System are Operating system which is running inside the created virtual
machine.
Virtual Machine Monitor (Hypervisor): Software that runs in a layer between host operating
system and one or more virtual machines that provides the virtual machine abstraction to the
guest operating systems. Example: Xen, KVM, VMWare.
The idea of virtualization is to separate the hardware from the software to yield better system
efficiency. Virtualization techniques can be applied to enhance the use of compute engines,
networks, and storage. With sufficient storage, any computer platform can be installed in
another host computer, even if they use processors with different instruction sets and run with
distinct operating systems on the same hardware.
3.3.1 LEVELS OF VIRTUALIZATION IMPLEMENTATION

A traditional computer runs with a host operating system specially tailored for its hardware
architecture, as shown in Figure 3.3(a).

After virtualization, different user applications managed by their own operating systems (guest
OS) can run on the same hardware, independent of the host OS. This is often done by adding
additional software, called a virtualization layer as shown in Figure 3.3(b).
This virtualization layer is known as hypervisor or virtual machine monitor (VMM) . The VMs are
shown in the upper boxes, where applications run with their own guest OS over the virtualized
CPU, memory, and I/O resources.
Figure 3.3 Hardware Architecture
The virtualization software creates the abstraction of VMs by interposing a virtualization layer at
various levels of a computer system. Common virtualization layers include the instruction set
architecture (ISA) level, hardware level, operating system level, library support level, and
application level
Figure 3.4 Virtualization Levels

 Virtualization at ISA (Instruction Set Architecture) level:
• Emulating a given ISA by the ISA of the host machine.
• e.g, MIPS binary code can run on an x-86-based host machine with the help of ISA
emulation.
• Typical systems: Bochs, Crusoe, Quemu, BIRD, Dynamo
Advantage:
• It can run a large amount of legacy binary codes written for various processors
on any given new hardware host machines
• best application flexibility
Shortcoming & limitation:
• One source instruction may require tens or hundreds of native target instructions
to perform its function, which is relatively slow.
• V-ISA requires adding a processor-specific software translation layer in the
complier.
 Virtualization at Hardware Abstraction level:
• Virtualization is performed right on top of the hardware.

• It generates virtual hardware environments for VMs, and manages the underlying
hardware through virtualization.
• Typical systems: VMware, Virtual PC, Denali, Xen
Advantage:
• Has higher performance and good application isolation

• Very expensive to implement (complexity)

 Virtualization at Operating System (OS) level:
It is an abstraction layer between traditional OS and user placations.
• This virtualization creates isolated containers on a single physical server and the OS-
instance to utilize the hardware and software in datacenters.
• Typical systems: Jail / Virtual Environment / Ensim's VPS / FVM
Advantage:
• Has minimal starup/shutdown cost, low resource requirement, and high scalability;
synchronize VM and host state changes.
• All VMs at the operating system level must have the same kind of guest OS
• Poor application flexibility and isolation.

Figure 3.5 The virtualization layer is inserted inside an OS to partition the hardware resource for
multiple VMs to run their applications in virtual environments
 Library Support level:
It creates execution environments for running alien programs on a platform rather than creating
VM to run the entire operating system.
• It is done by API call interception and remapping.

• Typical systems: Wine, WAB, LxRun , VisualMainWin
Advantage:
• It has very low implementation effort

• poor application flexibility and isolation

 User-Application level:
It virtualizes an application as a virtual machine.
• This layer sits as an application program on top of an operating system and exports an
abstraction of a VM that can run programs written and compiled to a particular abstract
machine definition.
• Typical systems: JVM , NET CLI , Panot
Advantage:
• has the best application isolation

• low performance, low application flexibility and high implementation complexity.

Overall, hardware and OS support will yield the highest performance. However, the hardware
and application levels are also the most expensive to implement. User isolation is the most
difficult to achieve. ISA implementation offers the best application flexibility.
3.3.2 VMM DESIGN REQUIREMENTS AND PROVIDERS

Hardware-level virtualization inserts a layer between real hardware and traditional operating
systems. This layer is commonly called the Virtual Machine Monitor (VMM) and it manages the
hardware resources of a computing system.
There are three requirements for a VMM.
 First, a VMM should provide an environment for programs which is essentially identical to
the original machine.
 Second, programs run in this environment should show, at worst, only minor decreases in
speed.
 Third, a VMM should be in complete control of the system resources. Any program run
under a VMM should exhibit a function identical to that which it runs on the original machine
directly.
A VMM should demonstrate efficiency in using the VMs. To guarantee the efficiency of a VMM,
a statistically dominant subset of the virtual processor‘s instructions needs to be executed
directly by the real processor, with no software intervention by the VMM.
Complete control of these resources by a VMM includes the following aspects:
(1) The VMM is responsible for allocating hardware resources for programs;
(2) it is not possible for a program to access any resource not explicitly allocated to it;
and
(3) it is possible under certain circumstances for a VMM to regain control of resources
already allocated.
A VMM is tightly related to the architectures of processors. It is difficult to implement a VMM for
some types of processors, such as the x86. Specific limitations include the inability to trap on
some privileged instructions. If a processor is not designed to support virtualization primarily, it

is necessary to modify the hardware to satisfy the three requirements for a VMM. This is known
as hardware-assisted virtualization.
Comparison of Four VMM and Hypervisor Software Packages
Hypervisor
A hypervisor is a hardware virtualization technique allowing multiple operating systems, called

guests to run on a host machine. This is also called the Virtual Machine Monitor (VMM).
Type 1: bare metal hypervisor
• sits on the bare metal computer hardware like the CPU, memory, etc.
• All guest operating systems are a layer above the hypervisor.
• The original CP/CMS hypervisor developed by IBM was of this kind.
Type 2: hosted hypervisor
• Run over a host operating system.

• Hypervisor is the second layer over the hardware.
• Guest operating systems run a layer over the hypervisor.
• The OS is usually unaware of the virtualization

Figure 3.6 Types of Hypervisor
3.3.3 VIRTUALIZATION SUPPORT AT THE OS LEVEL

Cloud computing has at least two challenges.
 The first is the ability to use a variable number of physical machines and VM instances
depending on the needs of a problem. For example, a task may need only a single CPU
during some phases of execution but may need hundreds of CPUs at other times.
 The second challenge concerns the slow operation of instantiating new VMs. Currently,
new VMs originate either as fresh boots or as replicates of a template VM, unaware of
the current application state.
Need for OS Virtualization
In a cloud computing environment, thousands of VMs need to be initialized simultaneously.

Besides slow operation, storing the VM images also becomes an issue. There is considerable
repeated content among VM images.
Moreover, full virtualization at the hardware level also has the disadvantages of slow
performance and low density, and the need for para-virtualization to modify the guest OS.
To reduce the performance overhead of hardware-level virtualization, even hardware

modification is needed.
OS-level virtualization provides a feasible solution for these hardware-level virtualization issues.
 Operating system virtualization inserts a virtualization layer inside an operating

system to partition a machine‘s physical resources.
 It enables multiple isolated VMs within a single operating system kernel. This kind of
VM is often called a virtual execution environment (VE), Virtual Private System
(VPS), or simply container.

 From the user‘s point of view, VEs look like real servers. This means a VE has its
own set of processes, file system, user accounts, network interfaces with IP
addresses, routing tables, firewall rules, and other personal settings.
 Although VEs can be customized for different people, they share the same operating
system kernel. Therefore, OS-level virtualization is also called single-OS image
virtualization.
Figure 3.7 illustrates operating system virtualization from the point of view of a machine stack.
Figure 3.7 Operating system virtualization from the point of view of a machine stack
 The OpenVZ virtualization layer inside the host OS, which provides some OS images to
create VMs quickly.
 The virtualization layer is inserted inside the OS to partition the hardware resources for
multiple VMs to run their applications in multiple virtual environments.
 To implement OS-level virtualization, isolated execution environments (VMs) should be
created based on a single OS kernel. Furthermore, the access requests from a VM
need to be redirected to the VM‘s local resource partition on the physical machine. For
example, the chroot command in a UNIX system can create several virtual root
directories within a host OS. These virtual root directories are the root directories of all
VMs created.
Advantages of OS Extension for Virtualization
 VMs at OS level has minimum startup/shutdown costs

 OS-level VM can easily synchronize with its environment
Disadvantage of OS Extension for Virtualization

 All VMs in the same OS container must have the same or similar guest OS, which
restrict application flexibility of different VMs on the same physical machine.
Virtualization on Linux or Windows Platforms
Most reported OS-level virtualization systems are Linux-based. Virtualization support on the
Windows-based platform is still in the research stage. The Linux kernel offers an abstraction
layer to allow software processes to work with and operate on resources without knowing the
hardware details.
Examples of OS-level virtualization tools
Two OS tools (Linux vServer and OpenVZ) support Linux platforms to run other
platform-based applications through virtualization.
The third tool, FVM, is an attempt specifically developed for virtualization on the
Windows NT platform.
Virtualization Support and Source of Brief Introduction on Functionality and Application

Information Platforms
Linux vServer for Linux platforms Extends Linux kernels to implement a security
(http://linux-vserver.org/) mechanism to help build VMs by setting resource
limits and file attributes and changing the root
environment for VM isolation
OpenVZ for Linux platforms Supports virtualization by creating virtual private

servers (VPSes); the VPS has its own files, users,
process tree, and virtual devices, which can be
isolated from other VPSes, and checkpointing and live
migration are supported
FVM (Feather-Weight Virtual Uses system call interfaces to create VMs at the NY
Machines) for virtualizing the Windows kernel space; multiple VMs are supported by
NT platforms virtualized namespace and copy-on-write
3.3.4 MIDDLEWARE SUPPORT FOR VIRTUALIZATION
Library-level virtualization is also known as user-level Application Binary Interface (ABI) or API
emulation. This type of virtualization can create execution environments for running alien
programs on a platform rather than creating a VM to run the entire operating system.
This following provides an overview of several library-level virtualization systems:
Middleware or Runtime Library and Brief Introduction and Application Platforms

References or Web Link
WABI (Windows Application Binary Middleware that converts Windows system calls
running on x86 PCs to Solaris system calls running on

Interface) SPARC workstations
Lxrun (Linux Run) A system call emulator that enables Linux applications
written for x86 hosts to run on UNIX systems such as
the SCO OpenServer
WINE A library support system for virtualizing x86

processors to run Windows applications under Linux,
FreeBSD, and Solaris
Visual MainWin A compiler support system to develop Windows

applications using Visual Studio to run on Solaris,
Linux, and AIX hosts
vCUDA Virtualization support for using general-purpose GPUs

to run data-intensive applications under a special
guest OS
Figure 3.8 vCUDA Architecture
CUDA is a programming model and library for general-purpose GPUs. It leverages the high
performance of GPUs to run compute-intensive applications on host operating systems.
The vCUDA employs a client-server model to implement CUDA virtualization.
It consists of three user space components: the vCUDA library, a virtual GPU in the guest OS
(which acts as a client), and the vCUDA stub in the host OS (which acts as a server).
The vCUDA library resides in the guest OS as a substitute for the standard CUDA library. It is
responsible for intercepting and redirecting API calls from the client to the stub. Besides these
tasks, vCUDA also creates vGPUs and manages them.

3.4 Virtualization Structures/Tools and Mechanisms
Depending on the position of the virtualization layer, there are several classes of VM
architectures,
 hypervisor architecture
 Full virtualization and host-based virtualization
 para-virtualization
3.4.1 HYPERVISOR AND XEN ARCHITECTURE
The hypervisor supports hardware-level virtualization on bare metal devices like CPU, memory,
disk and network interfaces. The hypervisor software sits directly between the physical
hardware and its OS. This virtualization layer is referred to as either the VMM or the hypervisor.
The hypervisor provides hypercalls for the guest OSes and applications. Depending on the
functionality, a hypervisor can assume a micro-kernel architecture like the Microsoft Hyper-V. Or
it can assume a monolithic hypervisor architecture like the VMware ESX for server virtualization.
Figure 3.9 Hypervisor architecture
A micro-kernel hypervisor includes only the basic and unchanging functions (such as physical
memory management and processor scheduling). The device drivers and other changeable
components are outside the hypervisor.
A monolithic hypervisor implements all the aforementioned functions, including those of the
device drivers.
Therefore, the size of the hypervisor code of a micro-kernel hypervisor is smaller than that of a
monolithic hypervisor. Essentially, a hypervisor must be able to convert physical devices into
virtual resources dedicated for the deployed VM to use.
The Xen Architecture
Xen is an open source hypervisor program developed by Cambridge University.

Xen is a micro-kernel hypervisor, which separates the policy from the mechanism. The Xen
hypervisor implements all the mechanisms, leaving the policy to be handled by Domain 0, as
shown in Figure 3.10.
Xen does not include any device drivers natively. It just provides a mechanism by which a
guest OS can have direct access to the physical devices. As a result, the size of the Xen
hypervisor is kept rather small.
Xen provides a virtual environment located between the hardware and the OS.
The core components of a Xen system are the hypervisor, kernel, and applications. The
organization of the three components is important.
Like other virtualization systems, many guest OSes can run on top of the hypervisor. However,
not all guest OSes are created equal, and one in particular controls the others.
The guest OS, which has control ability, is called Domain 0, and the others are called Domain
U.
Domain 0 is a privileged guest OS of Xen. It is first loaded when Xen boots without any file
system drivers being available. Domain 0 is designed to access hardware directly and manage
devices. Therefore, one of the responsibilities of Domain 0 is to allocate and map hardware
resources for the guest domains (the Domain U domains).
For example, Xen is based on Linux and its security level is C2. Its management VM is named
Domain 0, which has the privilege to manage other VMs implemented on the same host. If
Domain 0 is compromised, the hacker can control the entire system.
So, in the VM system, security policies are needed to improve the security of Domain 0. Domain
0, behaving as a VMM, allows users to create, copy, save, read, modify, share, migrate, and roll
back VMs as easily as manipulating a file, which flexibly provides tremendous benefits for users.
Unfortunately, it also brings a series of security problems during the software life cycle and data
lifetime.
Figure 3.10 XEN Architecture
3.4.2 BINARY TRANSLATION WITH FULL VIRTUALIZATION

Depending on implementation technologies, hardware virtualization can be classified into two
categories:
 full virtualization
 host-based virtualization.
 Full virtualization does not need to modify the host OS. It relies on binary translation
to trap and to virtualize the execution of certain sensitive, nonvirtualizable
instructions. The guest OSes and their applications consist of noncritical and critical
instructions.
 In a host-based system, both a host OS and a guest OS are used. A virtualization
software layer is built between the host OS and guest OS.
 Full Virtualization
 With full virtualization, noncritical instructions run on the hardware directly while critical
instructions are discovered and replaced with traps into the VMM to be emulated by
software.
 Both the hypervisor and VMM approaches are considered full virtualization.
 Critical instructions are trapped into the VMM because binary translation can incur a large
performance overhead.
 Noncritical instructions do not control hardware or threaten the security of the system, but
critical instructions do. Therefore, running noncritical instructions on hardware not only can
promote efficiency, but also can ensure system security
Figure. 3.11 Full Virtualization using a hypervisor / VMM on top of bare hardware device
Binary Translation of Guest OS Requests Using a VMM
 This approach was implemented by VMware and many other software companies.
 VMware puts the VMM at Ring 0 and the guest OS at Ring 1.
 The VMM scans the instruction stream and identifies the privileged, control- and
behavior-sensitive instructions. When these instructions are identified, they are trapped

into the VMM, which emulates the behavior of these instructions. The method used in
this emulation is called binary translation.
 Therefore, full virtualization combines binary translation and direct execution.
 The guest OS is completely decoupled from the underlying hardware. Consequently, the
guest OS is unaware that it is being virtualized.
 Host-Based Virtualization
This approach installs a virtualization layer on top of the host OS. This host OS is still
responsible for managing the hardware. The guest OSes are installed and run on top of the
virtualization layer. Dedicated applications may run on the VMs.
Figure 3.12 Hosted VM installs a guest OS on top of host OS
This host-based architecture has some distinct advantages:
 First, the user can install this VM architecture without modifying the host OS. The virtualizing
software can rely on the host OS to provide device drivers and other low-level services. This
will simplify the VM design and ease its deployment.
 Second, the host-based approach appeals to many host machine configurations.
Compared to the hypervisor/VMM architecture, the performance of the host-based architecture
may also be low. When an application requests hardware access, it involves four layers of
mapping which downgrades performance significantly. When the ISA of a guest OS is different

from the ISA of the underlying hardware, binary translation must be adopted. Although the host-
based architecture has flexibility, the performance is too low to be useful in practice.
3.4.3 PARA-VIRTUALIZATION WITH COMPILER SUPPORT

Para-virtualization needs to modify the guest operating systems.
A para-virtualized VM provides special APIs requiring substantial OS modifications in user

applications.
Performance degradation is a critical issue of a virtualized system. No one wants to use a VM if

it is much slower than using a physical machine.
The virtualization layer can be inserted at different positions in a machine software stack.
However, para-virtualization attempts to reduce the virtualization overhead, and thus improve
performance by modifying only the guest OS kernel.
Para-virtualized VM architecture:
The guest operating systems are para-virtualized.
They are assisted by an intelligent compiler to replace the nonvirtualizable OS instructions by

hypercalls
The traditional x86 processor offers four instruction execution rings: Rings 0, 1, 2, and 3. The
lower the ring number, the higher the privilege of instruction being executed.
The OS is responsible for managing the hardware and the privileged instructions to execute at
Ring 0, while user-level applications run at Ring 3.
Figure 3.13 Para-virtualized VM architecture
Para-virtualized guest OS is assisted by an intelligent compiler to replace nonvirtualizable OS

instructions by hypercalls that communicate directly with the hypervisor or VMM.

Although para-virtualization reduces the overhead, it has incurred other problems.
 First, its compatibility and portability may be in doubt, because it must support the
unmodified OS as well.
 Second, the cost of maintaining para-virtualized OSes is high, because they may require
deep OS kernel modifications.
 Finally, the performance advantage of para-virtualization varies greatly due to workload
variations.
Compared with full virtualization, para-virtualization is relatively easy and more practical. The
main problem in full virtualization is its low performance in binary translation. To speed up binary
translation is difficult.
Many virtualization products employ the para-virtualization architecture. The popular Xen, KVM,
and VMware ESX are good examples.
 KVM (Kernel-Based VM)
This is a Linux para-virtualization system—a part of the Linux version 2.6.20 kernel.
Memory management and scheduling activities are carried out by the existing Linux kernel. The
KVM does the rest, which makes it simpler than the hypervisor that controls the entire machine.
KVM is a hardware-assisted para-virtualization tool, which improves performance and supports

unmodified guest OSes such as Windows, Linux, Solaris, and other UNIX variants.
 Para-Virtualization with Compiler Support
Unlike the full virtualization architecture which intercepts and emulates privileged and sensitive
instructions at runtime, para-virtualization handles these instructions at compile time.
The guest OS kernel is modified to replace the privileged and sensitive instructions with
hypercalls to the hypervisor or VMM. Xen assumes such a para-virtualization architecture.
The guest OS running in a guest domain may run at Ring 1 instead of at Ring 0. This implies
that the guest OS may not be able to execute some privileged and sensitive instructions.

The privileged instructions are implemented by hypercalls to the hypervisor. After replacing the
instructions with hypercalls, the modified guest OS emulates the behavior of the original guest
OS.
On an UNIX system, a system call involves an interrupt or service routine. The hypercalls apply
a dedicated service routine in Xen.
 VMware ESX Server for Para-Virtualization
ESX is a VMM or a hypervisor for bare-metal x86 symmetric multiprocessing (SMP) servers. It
accesses hardware resources such as I/O directly and has complete resource management
control.
An ESX-enabled server consists of four components: a virtualization layer, a resource manager,

hardware interface components, and a service console.
To improve performance, the ESX server employs a para-virtualization architecture in which the
VM kernel interacts directly with the hardware without involving the host OS.
Figure 3.14 VMware ESX Server for Para-Virtualization
Full virtualization vs Para virtualization
Full virtualization
• Does not need to modify guest OS, and critical instructions are emulated by software
through the use of binary translation.
• VMware Workstation applies full virtualization, which uses binary translation to
automatically modify x86 software on-the-fly to replace critical instructions.
• Advantage: no need to modify OS.
• Disadvantage: binary translation slows down the performance.
Para virtualization

• Reduces the overhead, but cost of maintaining a paravirtualized OS is high.
• The improvement depends on the workload.
• Para virtualization must modify guest OS, non-virtualizable instructions are replaced by
hypercalls that communicate directly with the hypervisor or VMM.
• Para virtualization is supported by Xen, Denali and VMware ESX
3.5 Virtualization of CPU, Memory, and I/O Devices

To support virtualization, processors such as the x86 employ a special running mode and
instructions, known as hardware-assisted virtualization.
In this way, the VMM and guest OS run in different modes and all sensitive instructions of
the guest OS and its applications are trapped in the VMM. Mode switching is completed by
hardware.
3.5.1 HARDWARE SUPPORT FOR VIRTUALIZATION

 Modern operating systems and processors permit multiple processes to run
simultaneously. If there is no protection mechanism in a processor, all instructions from
different processes will access the hardware directly and cause a system crash.
Therefore, all processors have at least two modes, user mode and supervisor mode, to
ensure controlled access of critical hardware.
 Instructions running in supervisor mode are called privileged instructions. Other
instructions are unprivileged instructions. In a virtualized environment, it is more difficult
to make OSes and applications run correctly because there are more layers in the
machine stack.
VMware Workstation
This is a VM software suite for x86 and x86-64 computers.
This allows users to set up multiple x86 and x86-64 virtual computers and to use one or
more of these VMs simultaneously with the host operating system.
assumes the host-based virtualization. Xen is a hypervisor for use in IA-32, x86-64,
Itanium, and PowerPC 970 hosts.
KVM (Kernel-based Virtual Machine)
One or more guest OS can run on top of the hypervisor.
This is a Linux kernel virtualization infrastructure.
KVM can support hardware-assisted virtualization and paravirtualization by using the

Intel VT-x or AMD-v and VirtIO framework, respectively.
Hardware Support for Virtualization in the Intel x86 Processor
Adopts full virtualization
For processor virtualization, Intel offers the VT-x or VT-i technique.

VT-x adds a privileged mode (VMX Root Mode) and some instructions to processors.
This enhancement traps all sensitive instructions in the VMM automatically.
For memory virtualization, Intel offers the EPT, which translates the virtual address to
the machine‘s physical addresses to improve performance.
For I/O virtualization, Intel implements VT-d and VT-c to support this.
Figure 3.15 Hardware Support for Virtualization in the Intel x86 Processor
3.5.2 CPU VIRTUALIZATION

 VM instructions are executed on the host processor in native mode. Thus, unprivileged
instructions run directly on the host machine for higher efficiency. Other critical
instructions should be handled carefully for correctness and stability.
 The critical instructions are divided into three categories: privileged instructions, control-
sensitive instructions, and behavior-sensitive instructions.
o Privileged instructions execute in a privileged mode and will be trapped if
executed outside this mode.
o Control-sensitive instructions attempt to change the configuration of resources
used.
o Behavior-sensitive instructions have different behaviors depending on the
configuration of resources, including the load and store operations over the
virtual memory.
 A CPU architecture is virtualizable if it supports the ability to run the VM‘s privileged and
unprivileged instructions in the CPU‘s user mode while the VMM runs in supervisor
mode.
 When the privileged instructions including control- and behavior-sensitive instructions of
a VM are executed, they are trapped in the VMM. In this case, the VMM acts as a unified
mediator for hardware access from different VMs to guarantee the correctness and
stability of the whole system.
 However, not all CPU architectures are virtualizable. RISC CPU architectures can be
naturally virtualized but x86 CPU architectures does not support virtualization.
 On a UNIX-like system, a system call triggers the 80h interrupt and passes control to the
OS kernel. The interrupt handler in the kernel is then invoked to process the system call.
 On a paravirtualization system such as Xen, a system call in the guest OS first triggers
the 80h interrupt normally. Almost at the same time, the 82h interrupt in the hypervisor is

triggered. Incidentally, control is passed on to the hypervisor as well. When the
hypervisor completes its task for the guest OS system call, it passes control back to the
guest OS kernel.
Hardware-Assisted CPU Virtualization
 Intel and AMD add an additional mode called privilege mode level (some people call it
Ring-1) to x86 processors.
 Qperating systems can run at Ring 0 and the hypervisor can run at Ring -1. All the
privileged and sensitive instructions are trapped in the hypervisor automatically.
 This technique removes the difficulty of implementing binary translation of full
virtualization. It also lets the operating system run in VMs without modification.
Intel Hardware-Assisted CPU Virtualization
 Virtualization of x86 processors: Intel‘s VT-x technology is an example of hardware-

assisted virtualization, as shown in Figure.
 Intel calls the privilege level of x86 processors the VMX Root Mode. In order to control
the start and stop of a VM and allocate a memory page to maintain the CPU state for
VMs, a set of additional instructions is added.
Figure 3.16 Intel hardware-assisted CPU virtualization
3.5.3 MEMORY VIRTUALIZATION

 All modern x86 CPUs include a memory management unit (MMU) and a translation
lookaside buffer (TLB) to optimize virtual memory performance.
 However, in a virtual execution environment, virtual memory virtualization involves sharing
the physical system memory in RAM and dynamically allocating it to the physical memory of
the VMs.
 A two-stage mapping process should be maintained by the guest OS and the VMM,
respectively: virtual memory to physical memory and physical memory to machine memory.
 MMU virtualization should be supported, which is transparent to the guest OS. The guest
OS continues to control the mapping of virtual addresses to the physical memory addresses

of VMs but cannot directly access the actual machine memory. The VMM is responsible for
mapping the guest physical memory to the actual machine memory.
Figure shows the two-level memory mapping procedure.
Figure 3.17 Two-level memory mapping procedure
 Each page table of the guest OSes has a separate page table in the VMM corresponding to
it. The VMM page table is called the shadow page table. Nested page tables add another
layer of indirection to virtual memory.
 The MMU already handles virtual-to-physical translations as defined by the OS.
 Then the physical memory addresses are translated to machine addresses using another
set of page tables defined by the hypervisor.
 Since modern operating systems maintain a set of page tables for every process, the
shadow page tables will get flooded. Consequently, the performance overhead and cost of
memory will be very high.
Extended Page Table by Intel for Memory Virtualization
 To improve the efficiency of the software shadow page table technique Intel developed a
hardware-based EPT technique.
 When a virtual address needs to be translated, the CPU will first look for the L4 page table
pointed to by Guest CR3.

Figure 3.18 Extended Page Table by Intel for Memory Virtualization
 Since the address in Guest CR3 is a physical address in the guest OS, the CPU needs to
convert the Guest CR3 GPA to the host physical address (HPA) using EPT.
 In this procedure, the CPU will check the EPT TLB to see if the translation is there. If there is
no required translation in the EPT TLB, the CPU will look for it in the EPT. If the CPU cannot
find the translation in the EPT, an EPT violation exception will be raised.
 When the GPA of the L4 page table is obtained, the CPU will calculate the GPA of the L3
page table by using the GVA and the content of the L4 page table.
 If the entry corresponding to the GVA in the L4 page table is a page fault, the CPU will
generate a page fault interrupt and will let the guest OS kernel handle the interrupt.
 When the PGA of the L3 page table is obtained, the CPU will look for the EPT to get the
HPA of the L3 page table, as described earlier.
 To get the HPA corresponding to a GVA, the CPU needs to look for the EPT five times, and
each time, the memory needs to be accessed four times. Therefore, there are 20 memory
accesses in the worst case, which is still very slow. To overcome this shortcoming, Intel
increased the size of the EPT TLB to decrease the number of memory accesses.
3.5.4 I/O VIRTUALIZATION

I/O virtualization involves managing the routing of I/O requests between virtual devices and the
shared physical hardware.
There are three ways to implement I/O virtualization:

 full device emulation,
 para-virtualization and
 direct I/O.
 Full device emulation
 This approach emulates well-known, real-world devices.

 All the functions of a device or bus infrastructure, such as device enumeration,
identification, interrupts, and DMA, are replicated in software. This software is located in
the VMM and acts as a virtual device.
 The I/O access requests of the guest OS are trapped in the VMM which interacts with
the I/O devices.
 Device emulation for I/O virtualization implemented inside the middle layer that maps
real I/O devices into the virtual devices for the guest device driver to use.
Figure.3.19 I/O virtualization
 Para-virtualization method of I/O virtualization

 used in Xen and also known as the split driver model consisting of a frontend driver and
a backend driver.
 The frontend driver is running in Domain U and the backend driver is running in Domain
0. They interact with each other via a block of shared memory.
 The frontend driver manages the I/O requests of the guest OSes and the backend driver
is responsible for managing the real I/O devices and multiplexing the I/O data of different
VMs.
 it has higher CPU overhead.
 Direct I/O virtualization
 VM access devices directly.

 Since software-based I/O virtualization requires a very high overhead of device
emulation, hardware-assisted I/O virtualization is critical.
 Intel VT-d supports the remapping of I/O DMA transfers and device-generated interrupts.
The architecture of VT-d provides the flexibility to support multiple usage models that
may run unmodified, special-purpose, or ―virtualization-aware‖ guest OSes.
 Self-virtualized I/O (SV-IO) .
 SV-IO is to harness the rich resources of a multicore processor. All tasks associated with
virtualizing an I/O device are encapsulated in SV-IO.

 It provides virtual devices and an associated access API to VMs and a management API
to the VMM.
 SV-IO defines one virtual interface (VIF) for every kind of virtualized I/O device, such as
virtual network interfaces, virtual block devices (disk), virtual camera devices, and
others.
 The guest OS interacts with the VIFs via VIF device drivers. Each VIF consists of two
message queues. One is for outgoing messages to the devices and the other is for
incoming messages from the devices. Each VIF has a unique ID for identifying it in SV-
IO.
3.5.5 VIRTUALIZATION IN MULTI-CORE PROCESSORS

 Virtualizing a multi-core processor is relatively more complicated. There are mainly two
difficulties: Application programs must be parallelized to use all cores fully
 New programming models, languages, and libraries are needed to make parallel
programming easier
 software must explicitly assign tasks to the cores, which is a very complex problem.
 Researches involved in scheduling algorithms and resource management policies.
3.5.5.1 Physical versus Virtual Processor Cores
A multicore virtualization method is proposed to allow hardware designers to get an abstraction

of the low-level details of the processor cores.
It is located under the ISA and remains unmodified by the operating system or VMM
(hypervisor).
Figure illustrates the technique of a software-visible VCPU moving from one core to another and
temporarily suspending execution of a VCPU when there are no appropriate cores on which it
can run.
3.5.5.2 Virtual Hierarchy
A virtual hierarchy is a cache hierarchy that can adapt to fit the workload or mix of workloads.
The hierarchy‘s first level locates data blocks close to the cores needing them for faster access,
establishes a shared-cache domain, and establishes a point of coherence for faster
communication.
When a miss leaves a tile, it first attempts to locate the block (or sharers) within the first level.

The first level can also provide isolation between independent workloads. A miss at the L1
cache can invoke the L2 acces
Space sharing is applied to assign three workloads to three clusters of virtual cores:
VM0 and VM3 for database workload,
VM1 and VM2 for web server workload, and
VM4–VM7 for middleware workload.
The basic assumption is that each workload runs in its own VM. However, space sharing
applies equally within a single operating system.
Statically distributing the directory among tiles can do much better, provided operating systems
or hypervisors carefully map virtual pages to physical frames.
Figure 3.20 Virtual Hierarchy
3.6 Virtual Clusters and Resource Management
3.6.1 PHYSICAL VERSUS VIRTUAL CLUSTERS

 Virtual clusters are built with VMs installed at distributed servers from one or more
physical clusters. The VMs in a virtual cluster are interconnected logically by a virtual
network across several physical networks.
 The virtual cluster nodes can be either physical or virtual machines. Multiple VMs
running with different OSs can be deployed on the same physical node.
 A VM runs with a guest OS, which is often different from the host OS, that manages the
resources in the physical machine, where the VM is implemented.

Figure 3.21 Physical Vs Virtual Cluster
 VMs consolidates multiple functionalities on the same server to enhance the server
utilization and application flexibility.
 VMs can be colonized (replicated) in multiple servers to promote distributed parallelism,
fault tolerance, and disaster recovery.
 The size (number of nodes) of a virtual cluster can grow or shrink dynamically
 Failure of any physical nodes may disable some VMs installed on the failing nodes. But
the failure of VMs will not pull down the host system.
Figure shows the concept of a virtual cluster based on application partitioning or
customization.
Figure 3.22 Virtual cluster based on application partitioning or customization.

3.6.1.1 Fast Deployment and Effective Scheduling
 Deployment refers to
o construct and distribute software stacks (OS, libraries, applications) to a physical
node inside clusters as fast as possible
o to quickly switch runtime environments from one user‘s virtual cluster to another
user‘s virtual cluster.
 If one user finishes using his system, the corresponding virtual cluster should shut down
or suspend quickly to save the resources to run other VMs for other users.
 The live migration of VMs allows workloads of one node to transfer to another node.
VMs cannot randomly migrate among themselves. But the potential overhead caused by
live migrations may have serious negative effects on cluster utilization, throughput, and
QoS issues.
 Load balancing of applications can be achieved using the load index and frequency of
user logins. The automatic scale-up and scale-down mechanism of a virtual cluster can
be implemented based on this model.
 Dynamically adjusting loads among nodes by live migration of VMs is desired, when the
loads on cluster nodes become quite unbalanced.
3.6.1.2 High-Performance Virtual Storage
 Template VM can be distributed to several physical hosts in the cluster to customize the
VMs. To efficiently manage the disk spaces occupied by template software packages,
some storage architecture design can be applied to reduce duplicated blocks in a
distributed file system of virtual clusters. Hash values are used to compare the contents
of data blocks.
 Users have their own profiles which store the identification of the data blocks for
corresponding VMs in a user-specific virtual cluster. New blocks are created when users
modify the corresponding data.
 There are four steps to deploy a group of VMs onto a target cluster:
o preparing the disk image,
o configuring the VMs,
o choosing the destination nodes, and
o executing the VM deployment command on every host.
 To simplify the disk image preparation process. template is used. It is a disk image that
includes a preinstalled operating system with or without certain application software.
Users choose a proper template according to their requirements and make a duplicate of
it as their own disk image. Templates could implement the COW (Copy on Write) format.
A new COW backup file is very small and easy to create and transfer which reduces disk
space consumption.
 Every VM is configured with a name, disk image, network setting, and allocated CPU
and memory. VMs with the same configurations could use preedited profiles to simplify
the process.
 Normally, users do not care which host is running their VM. A strategy to choose the
proper destination host for any VM is needed.
 The deployment principle is to fulfill the VM requirement and to balance workloads
among the whole host network.

3.6.2 LIVE VM MIGRATION STEPS AND PERFORMANCE EFFECTS
 When a VM fails, its role must be replaced with another VM on a different node, provided
both run with the same guest OS.
 During life migration of a VM from host A to host B, the migration copies the VM state file
from the storage area to the host machine.
Figure 3.23 VM Migration Steps
There are four ways to manage a virtual cluster.
 Using guest-based manager :
o cluster manager resides on a guest system. Multiple VMs form a virtual cluster.
o Example: openMosix (Linux cluster running different guest systems on top of the
Xen hypervisor). Sun‘s cluster Oasis (clusterster of VMs supported by a VMware
VMM.
 Build a cluster manager on the host systems.
o Host-based manager supervises the guest systems.
o Example: VMware HA system that can restart a guest system after failure.
 Use an independent cluster manager on both the host and guest systems.

 use an integrated cluster on the guest and host systems.
Live migration of a VM consists of the following six steps:

Steps 0 and 1: Start migration.
Prepares for the migration, including determining the migrating VM and the destination
host. Migration is automatically started by strategies such as load balancing and server
consolidation.
Steps 2: Transfer memory.
Since the whole execution state of the VM is stored in memory, all of the memory data is
transferred in the first round, and then the migration controller recopies the memory data
which is changed in the last round. Precopying memory is performed iteratively, without
interrupting the execution of programs.
Step 3: Suspend the VM and copy the last portion of the data.
When the last round‘s memory data is transferred, non memory data such as CPU and
network states should also be transferred. During this step, the VM is stopped. This
―service unavailable‖ time is called the ―downtime‖ of migration, should be as short as
possible.
Steps 4 and 5: Commit and activate the new host.
After all the needed data is copied, on the destination host, the VM reloads the states
and recovers the execution of programs in it, and the service provided by this VM
continues. The network connection is redirected to the new VM and the original VM from
the source host is removed.
3.6.3 MIGRATION OF MEMORY, FILES, AND NETWORK RESOURCES

When one system migrates to another physical node, we should consider the following issues.
3.6.3.1 Memory Migration

 Memory migration can be in a range of hundreds of megabytes to a few gigabytes.
 The Internet Suspend-Resume (ISR) technique exploits temporal locality as memory
states are likely to have considerable overlap in the suspended and the resumed
instances of a VM.
 To exploit temporal locality, each file in the file system is represented as a tree of small
subfiles. A copy of this tree exists in both the suspended and resumed VM instances.
 The advantage of using a tree-based representation of files is that the caching ensures
the transmission of only those files which have been changed.
3.6.3.2 File System Migration

 To support VM migration, a system must provide each VM with a consistent, location-
independent view of the file system that is available on all hosts.
 A simple way to achieve this is to provide each VM with its own virtual disk and mapping
the file system to this virtual disk along with the other states of the VM. However, due to
high-capacity disks, migration of the contents of an entire disk over a network is not a
viable solution.
 Another way is to have a global file system across all machines where a VM could be
located. This way removes the need to copy files from one machine to another because
all files are network-accessible.

 A distributed file system is used in ISR serving as a transport mechanism for
propagating a suspended VM state. Here the VMM only accesses its local file system.
The relevant VM files are explicitly copied into the local file system for a resume
operation and taken out of the local file system for a suspend operation. This approach
relieves developers from the complexities of implementing several different file system
calls for different distributed file systems. However, the VMM has to store the contents
of each VM‘s virtual disks in its local files, which have to be moved around with the other
state information of that VM.
 In smart copying, the VMM exploits spatial locality. The idea is to transmit only the
difference between the two file systems at suspending and resuming locations. This
technique significantly reduces the amount of actual physical data that has to be moved.
3.6.3.3 Network Migration

 A migrating VM should maintain all open network connections without relying on
forwarding mechanisms on the original host or on support from mobility or redirection
mechanisms.
 To enable remote systems to locate and communicate with a VM, each VM must be
assigned a virtual IP address known to other entities. This address can be distinct from
the IP address of the host machine where the VM is currently located. Each VM can also
have its own distinct virtual MAC address.
 The VMM maintains a mapping of the virtual IP and MAC addresses to their
corresponding VMs.
 In general, a migrating VM includes all the protocol states and carries its IP address with
it.
 If the source and destination machines of a VM migration are typically connected to a
single switched LAN, an unsolicited ARP reply from the migrating host is provided
advertising that the IP has moved to a new location. This solves the open network
connection problem by reconfiguring all the peers to send future packets to a new
location.
 Although a few packets that have already been transmitted might be lost, there are no
other problems with this mechanism. Alternatively, on a switched network, the migrating
OS can keep its original Ethernet MAC address and rely on the network switch to detect
its move to a new port.
 Live migration means moving a VM from one physical node to another while keeping its
OS environment and applications unbroken. It provides desirable features to satisfy
requirements for computing resources in modern computing systems, including server
consolidation, performance isolation, and ease of management.
 Traditional migration suspends VMs before the transportation and then resumes them at
the end of the process. By importing the precopy mechanism, a VM could be live-
migrated without stopping the VM and keep the applications running during the
migration.
 Live migration is a key feature of system virtualization technologies. VM migration within
a cluster environment has network-accessible storage system, such as storage area
network(SAN) or network attached storage (NAS) and only memory and CPU status
needs to be transferred from the source node to the target node.
 Live migration techniques mainly use the precopy approach, which first transfers all
memory pages, and then only copies modified pages during the last round iteratively.
The VM service downtime is expected to be minimal. When applications‘ writable
working set becomes small, the VM is suspended and only the CPU state and dirty
pages in the last round are sent out to the destination.
 In the precopy phase, although a VM service is still available, much performance
degradation will occur because the migration daemon continually consumes network

bandwidth to transfer dirty pages in each round. An adaptive rate limiting approach is
employed to mitigate this issue, but total migration time is prolonged by nearly 10 times.
Moreover, the maximum number of iterations must be set because not all applications‘
dirty pages are ensured to converge to a small writable working set over multiple rounds.
 A checkpointing/recovery and trace/replay approach (CR/TR-Motion) is proposed to
provide fast VM migration. This approach transfers the execution trace file in iterations
rather than dirty pages, which is logged by a trace daemon. Apparently, the total size of
all log files is much less than that of dirty pages. So, total migration time and downtime
of migration are drastically reduced. However, CR/TR-Motion is valid only when the log
replay rate is larger than the log growth rate.
 Another strategy of postcopy is introduced for live migration of VMs. Here, all memory
pages are transferred only once during the whole migration process and the baseline
total migration time is reduced. But the downtime is much higher than that of precopy
due to the latency of fetching pages from the source node before the VM can be
resumed on the target
 CPU resources can be used to compress page frames and the amount of transferred
data can be significantly reduced. Memory compression algorithms typically have little
memory overhead. Decompression is simple and very fast and requires no memory for
decompression.
3.6.3.4 Live Migration of VM Using Xen

The following figure shows the compression scheme for VM migration between two Xen-
enabled host machines.
 Domain 0 (or Dom0) performs tasks to create, terminate, or migrate to another host. Xen
uses a send/recv model to transfer states across VMs.
 Migration daemons running in the management VMs are responsible for performing
migration.
 Shadow page tables in the VMM layer trace modifications to the memory page in
migrated VMs during the precopy phase. Corresponding flags are set in a dirty bitmap.
 At the start of each precopy round, the bitmap is sent to the migration daemon. Then,
the bitmap is cleared and the shadow page tables are destroyed and re-created in the
next round.
 The system resides in Xen‘s management VM. Memory pages denoted by bitmap are
extracted and compressed before they are sent to the destination. The compressed data
is then decompressed on the target.

Figure 3.24 Live migration of VM from the Dom0 domain to a Xen-enabled target host.
3.7 Virtualization for Data-Center Automation

 Data-center automation means that huge volumes of hardware, software, and database
resources in these data centers can be allocated dynamically to millions of Internet users
simultaneously, with guaranteed QoS and cost-effectiveness.
 This automation process is triggered by the growth of virtualization products and cloud
computing services.
 Virtualization is moving towards enhancing mobility, reducing planned downtime (for
maintenance), and increasing the number of virtual clients.
 The latest virtualization development highlights high availability (HA), backup services,
workload balancing, and further increases in client bases.
We will discuss server consolidation, virtual storage, OS support, and trust management in
automated data-center designs.
3.7.1 SERVER CONSOLIDATION IN DATA CENTERS

 In data centers, a large number of heterogeneous workloads can run on servers at
various times. These heterogeneous workloads are divided into two categories: chatty
workloads and noninteractive workloads.
 Chatty workloads may burst at some point and return to a silent state at some other
point. A web video service is an example of this, whereby a lot of people use it at night
and few people use it during the day.
 Noninteractive workloads do not require people‘s efforts to make progress after they are
submitted. High-performance computing is a typical example of this. At various stages,
the requirements for resources of these workloads are dramatically different. However,
to guarantee that a workload will always be able to cope with all demand levels, the
workload is statically allocated enough resources so that peak demand is satisfied.
 Therefore, it is common that most servers in data centers are underutilized. A large
amount of hardware, space, power, and management cost of these servers is wasted.
Server consolidation is an approach to improve the low utility ratio of hardware
resources by reducing the number of physical servers.
 Among several server consolidation techniques such as centralized and physical
consolidation,
 Virtualization-based server consolidation is the most powerful. Data centers need to
optimize their resource management. Server virtualization enables smaller resource
allocation than a physical machine.
 In general, the use of VMs increases resource management complexity. This causes a
challenge in terms of how to improve resource utilization as well as guarantee QoS in
data centers.
Server virtualization has the following side effects:
• Consolidation enhances hardware utilization. Many underutilized servers are
consolidated into fewer servers to enhance resource utilization. Consolidation also
facilitates backup services and disaster recovery.
• This approach enables more agile provisioning and deployment of resources. In a virtual
environment, the images of the guest OSes and their applications are readily cloned and
reused.
• The total cost of ownership is reduced. Server virtualization causes deferred purchases
of new servers, a smaller data-center footprint, lower maintenance costs, and lower
power, cooling, and cabling requirements.
• This approach improves availability and business continuity. The crash of a guest OS
has no effect on the host OS or any other guest OS. It becomes easier to transfer a VM

from one server to another, because virtual servers are unaware of the underlying
hardware.
To automate data-center operations, one must consider resource scheduling, architectural
support, power management, automatic or autonomic resource management, performance of
analytical models, and so on.
 In virtualized data centers, an efficient, on-demand, fine-grained scheduler is one of the key
factors to improve resource utilization.
 Scheduling and reallocations can be done in a wide range of levels in a set of data centers.
The levels match at least at the VM level, server level, and data-center level. Ideally,
scheduling and resource reallocations should be done at all levels. However, due to the
complexity of this, current techniques only focus on a single level or, at most, two levels.
 Dynamic CPU allocation is based on VM utilization and application-level QoS metrics.
o One method considers both CPU and memory flowing as well as automatically
adjusting resource overhead based on varying workloads in hosted services.
o Another scheme uses a two-level resource management system to handle the
complexity involved. A local controller at the VM level and a global controller at the
server level are designed. They implement autonomic resource allocation via the
interaction of the local and global controllers. Multicore and virtualization are two
cutting techniques that can enhance each other.
 However, the use of CMP is far from well optimized. The memory system of CMP is a typical
example.
o One can design a virtual hierarchy on a CMP in data centers.
o One can consider protocols that minimize the memory access time, inter-VM
interferences, facilitating VM reassignment, and supporting inter-VM sharing.
o One can also consider a VM-aware power budgeting scheme using multiple
managers integrated to achieve better power management. One must address the
trade-off of power saving and data-center performance.
3.7.2 VIRTUAL STORAGE MANAGEMENT

 In system virtualization, virtual storage includes the storage managed by VMMs and
guest OSes.
 Generally, the data stored in this environment can be classified into two categories: VM
images and application data.
 The VM images are special to the virtual environment, while application data includes all
other data which is the same as the data in traditional OS environments.
 The most important aspects of system virtualization are encapsulation and isolation.
Traditional operating systems and applications running on them can be encapsulated in
VMs. Only one operating system runs in a virtualization while many applications run in
the operating system.
 System virtualization allows multiple VMs to run on a physical machine and the VMs are
completely isolated. To achieve encapsulation and isolation, both the system software
and the hardware platform, such as CPUs and chipsets, are rapidly updated. However,
storage is lagging. The storage systems become the main bottleneck of VM deployment.
 In virtualization environments, a virtualization layer is inserted between the hardware
and traditional operating systems or a traditional operating system is modified to support
virtualization. This procedure complicates storage operations.
 On the one hand, storage management of the guest OS performs as though it is
operating in a real hard disk while the guest OSes cannot access the hard disk directly.
 On the other hand, many guest OSes contest the hard disk when many VMs are running
on a single physical machine. Therefore, storage management of the underlying VMM is
much more complex than that of guest OSes (traditional OSes).

 In addition, the storage primitives used by VMs are not nimble. Hence, operations such
as remapping volumes across hosts and checkpointing disks are frequently clumsy and
esoteric, and sometimes simply unavailable.
 In data centers, there are often thousands of VMs, which cause the VM images to
become flooded.
 Many researchers tried to solve these problems in virtual storage management. The
main purposes of their research are to make management easy while enhancing
performance and reducing the amount of storage occupied by the VM images.
 Parallax is a distributed storage system customized for virtualization environments.
Content Addressable Storage (CAS) is a solution to reduce the total size of VM images,
and therefore supports a large set of VM-based systems in data centers.
 Parallax designs a novel architecture in which storage features that have traditionally
been implemented directly on high-end storage arrays and switchers are relocated into a
federation of storage VMs. These storage VMs share the same physical hosts as the
VMs that they serve.
 The architecture of Parallax is scalable and especially suitable for use in cluster-based
environments.
 Figure shows a high-level view of the structure of a Parallax-based cluster. A cluster-
wide administrative domain manages all storage appliance VMs, which makes storage
management easy.
 Parallax itself runs as a user-level application in the storage appliance VM. It
provides virtual disk images (VDIs) to VMs. A VDI is a single-writer virtual disk which
may be accessed in a location-transparent manner from any of the physical hosts in the
Parallax cluster.
 The VDIs are the core abstraction provided by Parallax. Parallax uses Xen‘s block tap
driver to handle block requests and it is implemented as a tapdisk library. This library
acts as a single block virtualization service for all client VMs on the same physical host.
 In the Parallax system, it is the storage appliance VM that connects the physical
hardware device for block and network access.

FIGURE 3.25 Parallax is a set of per-host storage appliances that share access to a common
block device and presents virtual disks to client VMs.
3.7.3 CLOUD OS FOR VIRTUALIZED DATA CENTERS

Nimbus, Eucalyptus, and OpenNebula are all open source software (VI Managers)available to
the general public.
Only vSphere 4 is a proprietary OS for cloud resource virtualization and management over data
centers.
VI Managers and Operating Systems for Virtualizing Data Centers
Eucalyptus for Virtual Networking of Private Cloud

 Eucalyptus is an open source software system intended mainly for supporting
Infrastructure as a Service (IaaS) clouds. The system primarily supports virtual
networking and the management of VMs; virtual storage is not supported.
 Its purpose is to build private clouds that can interact with end users through Ethernet or
the Internet. The system also supports interaction with other private clouds or public
clouds over the Internet. The system is short on security and other desired features for
general-purpose grid or cloud applications.
 The designers of Eucalyptus implemented each high-level system component as a
stand-alone web service. Each web service exposes a well-defined language-agnostic
API in the form of a WSDL document containing both operations that the service can
perform and input/output data structures.
 Furthermore, the designers leverage existing web-service features such as WS-Security
policies for secure communication between components. The three resource managers
in Figure 3.27 are specified below:
o Instance Manager controls the execution, inspection, and terminating of VM
instances on the host where it runs.
o Group Manager gathers information about and schedules VM execution on

specific instance managers, as well as manages virtual instance network.

o Cloud Manager is the entry-point into the cloud for users and administrators. It
queries node managers for information about resources, makes scheduling
decisions, and implements them by making requests to group managers.
 In terms of functionality, Eucalyptus works like AWS APIs. Therefore, it can interact with
EC2. It does provide a storage API to emulate the Amazon S3 API for storing user data
and VM images. It is installed on Linux-based platforms, is compatible with EC2 with
SOAP and Query, and is S3-compatible with SOAP and REST. CLI and web portal
services can be applied with Eucalyptus.
FIGURE 3.26 Eucalyptus for building private clouds by establishing virtual networks over the
VMs linking through Ethernet and the Internet. Courtesy of D. Nurmi, et al. [45]
VMware vSphere 4 as a Commercial Cloud OS
Figure shows vSphere‘s overall architecture. The system interacts with user applications via an
interface layer, called vCenter.
vSphere is primarily intended to offer virtualization support and resource management of data-
center resources in building private clouds. VMware claims the system is the first cloud OS that
supports availability, security, and scalability in providing cloud computing services.
The vSphere 4 is built with two functional software suites: infrastructure services and application
services.
It also has three component packages intended mainly for virtualization purposes:
vCompute is supported by ESX, ESXi, and DRS virtualization libraries from VMware;
vStorage is supported by VMS and thin provisioning libraries; and
vNetwork offers distributed switching and networking functions.

These packages interact with the hardware servers, disks, and networks in the data center.
These infrastructure functions also communicate with other external clouds.
The application services are also divided into three groups: availability, security, and scalability.
Availability support includes VMotion, Storage VMotion, HA, Fault Tolerance, and Data
Recovery from VMware.
The security package supports vShield Zones and VMsafe.
The scalability package was built with DRS and Hot Add
FIGURE 3.27 vSphere/4, a cloud operating system that manages compute, storage, and
network resources over virtualized data centers. Courtesy of VMware, April 2010 [72]
3.7.4 TRUST MANAGEMENT IN VIRTUALIZED DATA CENTERS

 A VMM changes the computer architecture. It provides a layer of software between the
operating systems and system hardware to create one or more VMs on a single physical
platform.

 A VM entirely encapsulates the state of the guest operating system running inside it.
Encapsulated machine state can be copied and shared over the network and removed
like a normal file, which proposes a challenge to VM security.
 In general, a VMM can provide secure isolation and a VM accesses hardware resources
through the control of the VMM, so the VMM is the base of the security of a virtual
system.
 Normally, one VM is taken as a management VM to have some privileges such as
creating, suspending, resuming, or deleting a VM.
 Once a hacker successfully enters the VMM or management VM, the whole system is in
danger.
3.7.4.1 VM-Based Intrusion Detection

 An intrusion detection system (IDS) is built on operating systems, and is based on the
characteristics of intrusion actions.
 A typical IDS can be classified as a host-based IDS (HIDS) or a network-based IDS
(NIDS), depending on the data source.
 A HIDS can be implemented on the monitored system. When the monitored system is
attacked by hackers, the HIDS also faces the risk of being attacked.
 A NIDS is based on the flow of network traffic which can‘t detect fake actions.
 Figure 3.29 illustrates the concept.
FIGURE 3.28 The architecture of livewire for intrusion detection using a dedicated VM.
 The VM-based IDS contains a policy engine and a policy module. The policy framework
can monitor events in different guest VMs by operating system interface library and
PTrace indicates trace to secure policy of monitored host.
 It‘s difficult to predict and prevent all intrusions without delay. Therefore, an analysis of
the intrusion action is extremely important after an intrusion occurs. Most computer
systems use logs to analyze attack actions, but it is hard to ensure the credibility and
integrity of a log.
 The IDS log service is based on the operating system kernel. Thus, when an operating
system is invaded by attackers, the log service should be unaffected.
 Besides IDS, honeypots and honeynets are also prevalent in intrusion detection. They
attract and provide a fake system view to attackers in order to protect the real system. In
addition, the attack action can be analyzed, and a secure IDS can be built.

 A honeypot is a purposely defective system that simulates an operating system to cheat
and monitor the actions of an attacker. A honeypot can be divided into physical and
virtual forms. A guest operating system and the applications running on it constitute a
VM. The host operating system and VMM must be guaranteed to prevent attacks from
the VM in a virtual honeypot.
FIGURE 3.29 Techniques for establishing trusted zones for virtual cluster insulation and VM
isolation. Courtesy of L. Nick, EMC [40]
 The arrowed boxes on the left and the brief description between the arrows and the
zoning boxes are security functions and actions taken at the four levels from the users to
the providers.
 The small circles between the four boxes refer to interactions between users and
providers and among the users themselves.
 The arrowed boxes on the right are those functions and actions applied between the
tenant environments, the provider, and the global communities.
 Almost all available countermeasures, such as anti-virus, worm containment, intrusion

detection, encryption and decryption mechanisms, are applied here to insulate the
trusted zones and isolate the VMs for private tenants.

 The main innovation here is to establish the trust zones among the virtual clusters. The
end result is to enable an end-to-end view of security events and compliance across the
virtual clusters dedicated to different tenants.
2Marks Questions and Answers
1. What is virtual organization?

Virtual organization is nothing but coordinating resource sharing and problem sharing and
dynamic multi institution organization.
2. What are the facilities provided by virtual organization?
The formation of virtual task forces, or groups, to solve specific problems associated with the
virtual organization. The dynamic provisioning and management capabilities of the resource
required meeting the SLA‘s.
3. What are the main characteristics of the public cloud:

Easy to use
Typically a pay-per-use model
Operated by a third party
Flexible
Can be unreliable
Less secure
4. List some examples of Private Cloud.

a. Eucalyptus
b. Ubuntu Enterprise Cloud - UEC (powered by Eucalyptus)
c. Amazon VPC (Virtual Private Cloud)
d. VMware Cloud Infrastructure Suite
e. Microsoft ECI data center.
5. What are the main features of private cloud computing.

Organization-specific
More control and reliability
Customizable
More costly
Requires IT expertise
6. List the main characteristics of Hybrid Cloud.
Flexible and scalable
Cost effective
Becoming widely popular
7. What is community Cloud?

Community Cloud is a type of cloud hosting in which the setup is mutually shared between
many organizations that belong to a particular community, i.e. banks and trading firms. It is a

multi-tenant setup that is shared among several organizations that belong to a specific
group which has similar computing apprehensions.
8. Mention the features of Paas.
This involves offering a development platform on the cloud. The PaaS model to enable
users to develop and deploy their user applications.
The platform cloud is an integrated computer system consisting of both hardware and
software infrastructure.
This PaaS model enables a collaborated software development platform for users from
different parts of the world.
9. Discuss the advantages of Cloud Computing.

Lower-Cost Computers for Users Increased Data Safety
Improved Performance Lower IT Infrastructure Costs
Fewer Maintenance Issues Latest Version Availability
Lower Software Costs Universal Access to Documents
Instant Software Updates Removes the Tether to Specific Devices
Increased Computing Power Unlimited Storage Capacity
Improved Compatibility Between Operating Systems
10. List the disadvantage of Cloud Computing.

Requires a Constant Internet Connection
Doesn‘t Work Well with Low-Speed Connections
Can Be Slow
Features Might Be Limited
Stored Data Might Not Be Secure
If the Cloud Loses Your Data, You‘re Screwed
11. List the virtualization Levels.

Application Level
Library Level
Operating System Level
Hardware abstraction level
Instruction set Architecture level
12. List the advantages and disadvantages of ISA level.
Advantage:
• It can run a large amount of legacy binary codes written for various processors
on any given new hardware host machines
• best application flexibility
Llimitation:
• One source instruction may require tens or hundreds of native target instructions
to perform its function, which is relatively slow.
• V-ISA requires adding a processor-specific software translation layer in the
complier.

13. Mention the three requirements for a VMM
 First, a VMM should provide an environment for programs which is essentially identical to
the original machine.
 Second, programs run in this environment should show, at worst, only minor decreases in
speed.
 Third, a VMM should be in complete control of the system resources. Any program run
under a VMM should exhibit a function identical to that which it runs on the original machine
directly.
14. What is Hypervisor?

A hypervisor is a hardware virtualization technique allowing multiple operating systems,
called guests to run on a host machine. This is also called the Virtual Machine Monitor
(VMM).
15. What is Xen Architecture?
Xen is an open source hypervisor program developed by Cambridge University. Xen is a micro-
kernel hypervisor, which separates the policy from the mechanism. The Xen hypervisor
implements all the mechanisms, leaving the policy to be handled by Domain 0. Xen does not
include any device drivers natively. It just provides a mechanism by which a guest OS can have
direct access to the physical devices.
Mention the advantages of host-based architecture
 First, the user can install this VM architecture without modifying the host OS. The
virtualizing software can rely on the host OS to provide device drivers and other low-
level services. This will simplify the VM design and ease its deployment.
 Second, the host-based approach appeals to many host machine configurations.
16. Define Full Virtualization with its pros and cons.
• Does not need to modify guest OS, and critical instructions are emulated by software
through the use of binary translation.
• VMware Workstation applies full virtualization, which uses binary translation to
automatically modify x86 software on-the-fly to replace critical instructions.
• Advantage: no need to modify OS.
Disadvantage: binary translation slows down the performance
17. Define para Virtualization.

 Reduces the overhead, but cost of maintaining a paravirtualized OS is high.
 The improvement depends on the workload.
 Para virtualization must modify guest OS, non-virtualizable instructions are replaced by
hypercalls that communicate directly with the hypervisor or VMM.
 Para virtualization is supported by Xen, Denali and VMware ESX

18. List the three ways to implement IO virtualization.
 full device emulation,
 para-virtualization and
 direct I/O.
19. Mention the four steps to deploy a group of VMs onto a target cluster
a. preparing the disk image,
b. configuring the VMs,
c. choosing the destination nodes, and
d. executing the VM deployment command on every host.
20. What are the four ways to manage a virtual cluster?

 Using guest-based manager
 Build a cluster manager on the host systems
 Use an independent cluster manager on both the host and guest systems
 use an integrated cluster on the guest and host systems
21. List the steps of live migration of VM.

 Start migration
 Transfer memory
 Suspend the VM and copy the last portion of the data
 Commit and activate the new host
22. What are the side effects of Server virtualization?
Consolidation enhances hardware utilization. Many underutilized servers are consolidated into
fewer servers to enhance resource utilization. Consolidation also facilitates backup services and
disaster recovery.This approach enables more agile provisioning and deployment of resources.
In a virtual environment, the images of the guest OSes and their applications are readily cloned
and reused.
23. Define Eucalyptus
Eucalyptus is an open source software system intended mainly for supporting Infrastructure as
a Service (IaaS) clouds. The system primarily supports virtual networking and the management
of VMs; virtual storage is not supported. Its purpose is to build private clouds that can interact
with end users through Ethernet or the Internet.
16 marks Questions
1. Discuss about cloud deployment models in detail.
2. Explain in detail about the levels of virtualization implementation.
3. Explain CPU, Memory and I/O virtualization in detail.
4. Explain about the Live migration steps
5. Discuss about Server consolidation in data centers

UNIT IV
PROGRAMMING MODEL
TEXT BOOK:
6. Daniel Minoli, ―A Networking Approach to Grid Computing‖, John Wiley Publication,

2005.
STAFF IN-CHARGE HOD

4.1 Open source grid middleware packages

Grid system includes an aggregation of computational resources, storage resources,
network resources and scientific instruments. These resources construct the so-called grid
fabric of the grid.
Grid middleware provide users with seamless computing ability and uniform access to
resources in a heterogeneous grid environment. The development of grid middleware should be
shared, reusable and extensible. The figure 4.1 illustrates the grid middleware in the grid
architecture.
APPLICATIONS
Applications and Portals
USER LEVEL MIDDLEWARE

Development Environment and Tools
CORE MIDDLEWARE
Distributed Resources
Coupling Services
FABRIC
Local resource management
Figure 4.1 Grid Middleware in the grid architecture

The Grid Fabric level consists of distributed resources such as computers, networks,
storage devices and scientific instruments. The computational resources represent multiple
architectures such as clusters, supercomputers, servers and ordinary PCs which run a variety of
operating systems.

Core Grid middleware offers services such as remote process management, co-
allocation of resources, storage access, information registration and discovery, security, and
aspects of Quality of Service (QoS) such as resource reservation and trading. These services
abstract the complexity and heterogeneity of the fabric level by providing a consistent method
for accessing distributed resources.
User-level Grid middleware utilizes the interfaces provided by the low-level middleware
to provide higher level abstractions and services. These include application development
environments, programming tools and resource brokers for managing resources and scheduling
application tasks for execution on global resources.
Grid applications and portals are typically developed using Grid-enabled languages and utilities
such as HPC++ or MPI.
4.1.1 Basic Functional Grid Middleware Packages
A lot of significant software has been designed and realized.
UNICORE Middleware
UNICORE is a vertically integrated Grid computing environment that facilitates the following:
 A seamless, secure and intuitive access to resources in a distributed environment – for
end users.
 Solid authentication mechanisms integrated into their administration procedures,
reduced training effort and support requirements – for Grid sites.
 Easy relocation of computer jobs to different platforms – for both end users and Grid
sites.
UNICORE follows a three-tier architecture as shown in above Figure . It consists of a client

that runs on a Java enabled user workstation or a PC, a gateway, and multiple instances of
Network Job Supervisors (NJS) that execute on dedicated securely configured servers and
multiple instances of Target System Interfaces (TSI) executing on different nodes provide
interfaces to underlying local resource management systems such as operating systems and
the batch subsystems.
UNICORE is a client-server system based on a three-tier model:
User tier - The user is running the UNICORE Client on a local workstation or PC.
Server tier - On the top level, each participating computer center defines one or several
UNICORE Grid sites (Usites) that Clients can connect to.

Target System tier - A Usite offers access to computing or data resources. They are organized
as one or several virtual sites (Vsites) which can represent the execution and/or storage
systems at the computer centers.
The UNICORE Client interface consists of two components: JPA (Job Preparation Agent) and
JMC (Job Monitor Component). The UNICORE Gateway is the single entry point for all
UNICORE connections into a Usite. It provides an Internet address and a port that users can
use to connect to the gateway using SSL.
Figure 4.2 The UNICORE Architecture

A UNICORE Vsite is made up of two components: NJS (Network Job Supervisor) and TSI
(Target System Interface). The NJS Server manages all submitted UNICORE jobs and performs
user authorization by looking for a mapping of the user certificate to a valid login in the UUDB
(UNICORE User Data Base). UNICORE TSI accepts incarnated job components from the NJS,
and passes them to the local batch systems for execution.
UNICORE‘s features and functions can be summarized as follows:
1. User driven job creation and submission: A graphical interface assists the user in creating
complex and interdependent jobs that can be executed on any UNICORE site without job
definition changes.
2. Job management: The Job management system provides user with full control over jobs and
data.

3. Data management: During the creation of a job, the user can specify which data sets have to
be imported into or exported from the USpace (set of all files that are available to a UNICORE
job), and also which datasets have to be transferred to a different USpace. UNICORE performs
all data movement at run time, without user intervention.
4. Application support: Since scientists and engineers use specific scientific application, the
user interface is built in pluggable manner in order to extend it with plugins that allows to
prepare specific application input.
5. Flow control: A user job can be described as a set of one or more directed acyclic graphs.
6. Single sign-on: UNICORE provides a single sign-on through X.509V3 certificates.
7. Support for legacy jobs: UNICORE supports traditional batch processing by allowing users
to include their old job scripts as part of a UNICORE job.
8. Resource management: Users select the target system and specify the required resources.
The UNICORE client verifies the correctness of jobs and alerts users to correct errors
immediately.
4.1.2 Globus
The Globus project provides open source software toolkit that can be used to build
computational grids and grid based applications. It allows sharing of computing power,
databases, and other tools securely online across corporate, institutional and geographic
boundaries without sacrificing local autonomy. The core services, interfaces and protocols in the
Globus toolkit allow users to access remote resources seamlessly while simultaneously
preserving local control over who can use resources and when.

Figure4.3 The Globus Architecture.
4.1.3 Legion
Legion is a middleware system that combines very large numbers of independently

administered heterogeneous hosts, storage systems, databases legacy codes and user objects
distributed over wide-area-networks into a single coherent computing platform. Legion provides
the means to group these scattered components together into a single, object-based
metacomputer that accommodates high degrees of flexibility and site autonomy.
Figure 4.4 Legion Architecture
Legion defines a set of core object types that support basic system services, such as
naming and binding, object reation, activation, deactivation, and deletion. These objects provide

the mechanisms that help classes to implement policies appropriate for their instances. Legion
also allows users to define and build their own class objects. Some core objects are:
Host objects: represent processors in Legion.
Vault objects: represent persistent storage.
Context objects: map context names to LOIDs (Legion Object Identifiers).
Binding agents: LOIDs to LOAs (Legion Object Address).
Implementation object: maintains as an executable file that a host object can execute when it
receives a request to activate or create an object.
4.1.4 Gridbus
The Gridbus Project is an open-source, multi-institutional project led by the GRIDS Lab
at the University of Melbourne. It is engaged in the design and development of service-oriented
cluster and grid middleware technologies to support eScience and eBusiness applications. It
extensively leverages related software technologies and provides an abstraction layer to hide
idiosyncrasies of heterogeneous resources and low-level middleware technologies from
application developers.
Gridbus supports commoditization of grid services at various levels:
- Raw resource level (e.g., selling CPU cycles and storage resources)
- Application level (e.g., molecular docking operations for drug design application )
- Aggregated services (e.g., brokering and reselling of services across multiple domains)
Figure 4.5 The Gridbus Architecture

Table 4.1: Comparison of Grid Middleware
Systems.
4.2 Globus Toolkit (GT4) Architecture

Globus Toolkit has emerged as the de facto standard for several important connectivity,
resource, and collective protocols. The toolkit, having a ―middleware plus‖ capability, addresses
issues of security, information discovery, resource management, data management,
communication, fault detection, and Portability.
One well-known example of the implementation of OGSA/OGSI is the just-named
Globus Toolkit maintained by the Globus Alliance. The Globus Toolkit grid-enables a wide range

of computing environments. It is a software tool kit addressing key technical issues in the
development of grid-enabled environments, services, and applications.
The Globus Toolkit is a community-based, open-architecture, open-source set of
services and software libraries that supports grids and grid applications.
It is viewed as the reference software system for generic grid. With the publication of
OGSI and OGSA, the Globus Toolkit is now moving even more fully toward implementation of
standard grid protocols and APIs.
Globus Architecture
The Figure below illustrates the various aspects of GT4 Architecture:
It depicts three sets of components as follows:
1) A set of service implements – implement useful infrastructure services. These services
addresses such concerns as execution management (GRAM, data access and movement
(GridFTP), replica management(RLS,DRS), monitoring and discovery(Index, Trigger),
credential management(MyProxy, delegation) and instrument management(GTCP).
2) Three containers can be used to host user developed services written in Java, Python
and C, respectively. These containers provide implementation of security, management,
discovery, state management and other mechanisms frequently required when building
services. These containers extend open source service hosting environments with support
for useful webservices (WS) specifications including WS Resource Framework(WSRF),
WS-Notifications and WS Security.
3) A set of class libraries allow client programs in Java, C and Python to invoke operations
on both GT4 and user developed services.
Figure 4.6 GT4 Architecture

4.2.1 Primary GT4 Components

1. Execution and Resource Management- The resource management package enables
resource allocation through job submission, staging of executable files, job monitoring
and result gathering. The components of Globus within this package are:
Grid Resource Allocation and Management(GRAM) service addresses the issues by
providing a Web Service interface for initiating, monitoring and managing the execution
of arbitrary computations on remote computers. This interface allows a client to express
such things as the type and from the execution site, the executable and its arguments,
credentials to be used and job persistence requirement.
GRAM is often used as a service deployment and management service. It is first used to
start the service and then to control its resource consumption and provide for restart in
the event of resource or service failure.
Globus Access to Secondary Storage (GASS) GASS is a file-access mechanism that
allows applications to pre-fetch and open remote files and write them back. GASS is
used for staging-in input files and executables for a job and for retrieving output once it is
done.
2. Security: The Grid Security Infrastructure (GSI) provides methods for authentication
of Grid users and secure communication. It is based on SSL (Secure Sockets Layer),
PKI (Public Key Infrastructure) and X.509 Certificate Architecture. The GSI provides
services, protocols and libraries to achieve the following aims for Grid security:
 Single sign-on for using Grid services through user certificates
 Resource authentication through host certificates
 Data encryption
 Authorization
 Delegation of authority and trust through proxies and certificate chain of trust for
Certificate Authorities (CAs)
3. Data Management - The data management package provides utilities and libraries for
transmitting, storing and managing massive data sets that are part and parcel of many
scientific computing applications . The elements of this package are:
GridFTP: It is an extension of the standard FTP protocol that provides secure, efficient
and reliable data movements in grid environments. In addition to standard FTP
functions, GridFTP provides GSI support for authenticated data transfer, third-party
transfer invocation and striped, parallel and partial data transfer support.
4. Replica Location and Management: This component supports multiple locations
for the same file throughout the grid. Using the replica management functions, a file
can be registered with the Replica Location Service (RLS) and its replicas can be

created and deleted. Within RLS, a file is identified by its Logical File Name (LFN)
and is registered within a logical collection.
Figure 4.7 Primary GT4 Components

4.3 Configuration
Several items need to be configured to complete the installation of GT4, including:
A. Request and sign certificates
B. Install MMJFS
C. Create a database
D. Create two Grid security files
A. Requesting and signing certificates
A certificate is necessary for each user (known as a user certificate) who will use the Grid, as
well as for each machine (known as a host certificate) that is serving in the Grid. The following
procedure explains the necessary steps to create and sign the certificates:
1. Request the user certificate for the end user.
2. Request the host certificate for the machines.
3. Send the certificates to the CA.

4. Have the CA sign the certificates.
5. Retrieve the signed certificates from the CA.
1. Request the user certificate
A user certificate is requested by issuing the grid-cert-request command. This needs to be
done just once, regardless of the number of Grid servers involved in the Grid.
Issue the following command for this purpose:
$ grid-cert-request
2. Request the host certificate
A host certificate is requested by issuing the grid-cert-request command with the –service and
-host parameters. The host certificate is requested in a similar fashion to the way the user
certificate was requested.
As root, issue the following command for this purpose:
grid-cert-request -service host -host x1.itso-xingu.com
3. Send the certificates to the CA
The certificate request is made by copying the unsigned certificate to the /CA/IN directory of the
CA machines
4. Sign the certificates
Now the CA needs to sign the certificate.
5. Retrieve the signed certificates
The signed certificate is stored by the camgr tool in the /CA/OUT directory of the CA machine.
B. Install MMJFS
After configuring the certificates, you are able to install the Master Managed Job Factory
Service (MMJFS) by issuing the install-gt3-mmjfs command script. MMJFS is the single point
for submitting jobs. It is responsible for exposing the virtual GRAM service to the outside world.
C. Change the ownership and access permission

Upon completion, run the setperms.sh script located in the /usr/local/globus/bin directory to
change the ownership of some Globus files under $GLOBUS_LOCATION/bin directory.
As root, issue the following commands for this purpose:
# cd $GLOBUS_LOCATION/bin
# ./setperms.sh
D. Create the PostgreSQL database
Before creating the database, make sure the postgres daemon is running.
You can use the ps command for verification:
# ps -aux | grep postgres

Example:
# su - postgres
$ createuser globus
E. Create Grid security files

Two security files need to be created from scratch using any text editor. These files essentially
contain a listing of users that are to be given access to the Globus environment and are located
in the /etc/grid-security directory.
Security file Description

name
grid-mapfile A flat file that maps the Grid users to the distinguished name of
the user certificate. Each line represents an authorized user of the
resource.
grim-port- An XML document that maps the user ID to the Grid service name
type.xml
4.3.1 GT4 Configuration

Grid FTP Configuration
When you install GT4, GridFTP is installed by default. It can function without specific
configuration. Run the following command t start GridFTP server, with –s to make the command
run as a daemon, and –p2811 to specify the listening port for client connections:
#GLOBUS_LOCATION/sbin/g;obus –gridftp –server -$ -p2811
RFT Connection
RFT is used to perfrom third-party transfers across GridFTP servers. GridFTP records
the transfers status through RTF so that they can be recovered from failures.A database should
be installed on the system since these records are stored in a Database. PostgreSQLv8.1.4 is
used as the Database. The installation and configuration of PostgreSQL documents are
available at http://www.postgresql.org/docs/manuals/
1. Run the following command as root, to configure the postmaster deamon,

accept the TCP connection by adding –o-I options.
Postmaster –o -I –D PostgreSQL_INSTALL_PATH/data
2. Run createbrftDatabase -to create the database used for RFT. Then create the
database tables with sql script offered by GT4.

Psql –d rtf –fv$GLOBUS_LOCATION/share/globus_wsrf_rft/rft_schema.sql
3. Modify file $GLOBUS_LOCATION/etc/globus_wsrf_rft/jndi-config.xml As
globus. Findthe item connectionString in the <resource> section and change
the values of driverName , connectionString, username, password.
GRAM Configuration
WS GRAM is installed as part of a default GT4.0 installation.
WSGRAM executes and manages jobs through the local scheduler.
Configuration
1.sudoconfiguration : Most users should only have to follow step 1-2 to finish setting up
WS GRAM.
1. sudo needs to be available on the node running WS GRAM, with sudoers entries
like the ones below :
# Globus GRAM entries
Runas_Alias GLOBUSUSERS = user1, user2
globus ALL=(GLOBUSUSERS) NOPASSWD:
$TG_APPS_PREFIX/globus-wsrf-1.1.1-r4/libexec/globus-gridmap-and-execute -g /etc/grid-
security/grid-mapfile
$TG_APPS_PREFIX/globus-wsrf-1.1.1-r4/libexec/globus-job-manager-script.pl …*
globus ALL=(GLOBUSUSERS) NOPASSWD:

$TG_APPS_PREFIX/globus-wsrf-1.1.1-r4/libexec/globus-gridmap-and-execute -g /etc/grid-
security/grid-mapfile
$TG_APPS_PREFIX/globus-wsrf-1.1.1-r4/libexec/globus-gram-local-proxy-tool… *
2. Make the host credentials accessible by the container

% cd /etc/grid-security
% cp hostcert.pem containercert.pem
% cp hostkey.pem containerkey.pem
% chown globus.globus container.pem
3. Local scheduler adapter configuration

Install PBS adapter
%cd $GLOBUS_LOCATION\gt4.0.0-all-source-installer
% make gt4-gram-pbs
% make install
Configure the remote shell for rsh:

%ch $GLOBUS-LOCATION/setup/globus
% ./setup-globus-job-manager-pbs –remote-shell=rsh
4.4 Usage of Globus

4.4.1 Definition of Job
A job can be defined as a process or set of processes created as an outcome of a job request.
4.4.2 Staging Files
File staging allows executables and data files to be automatically transferred to the
required destination without user intervention. For file staging, specific elements have to be
added to the provided job description XML file during the job submission. Each file transmission
must provide a URL source and a URL destination.
The URLs for remote files are specified as GridFTP URLs, and for the files local to
service they are mentioned as file URLs. The file URLs are internally converted to GridFTP
URLs by the service.
Example of staging:
<fileStageIn>
<transfer>
<sourceURL>
gsiftp://host1.examplegrid.org:2811/home…user1/userDataFile
</ sourceURL>
<destinationURL>
file://${GLOBUS_USER_HOME}/…transferred_files
</destinationURL>
</transfer>
</fileStageIn>
GLOBUS_USER_HOME refers to home directory of the user on the remote host.
In the job description file we can specify where the standard output and error files to be directed.
These files can then be staged out from the remote server. To redirect standard output and
standard error we add the following job description file
<stdout>${GLOBUS_USER_HOME}/test.out</stdout>
<stderr>${GLOBUS_USER_HOME}/test.err</stderr>
After the completion of job, these files can be transferred to the submission node by adding the
following to the job description file.
<fileStageOut>
<transfer>

<sourceURL>file:///${GLOBUS_USER_HOME}/test.out</sourceURL>
<destinationURL>
gsiftp://host1.examplegrid.org:2811/home/user1/user</destinationURL>
</transfer>
<transfer>
<sourceURL>file:///${GLOBUS_USER_HOME}/test.err</sourceURL>
<destinationURL>
gsiftp://host1.examplegrid.org:2811/home/user1/user</destinationURL>
</transfer>
</fileStageOut>
Job description file can be used to automatically clean up the files transmitted to the remote
node. Suppose we want to remove the file userDataFile from the remote host; we add the
following lines to the job description file.
<fileCleanUp>
<deletion>
<file>file:///$GLOBUS_USER_HOME}/transferred_files/userDataFile</file>
</deletion>
</fileCleanUp>
4.4.3 Job Submission

4.4.3.1 Data transfer
Globus provides two ways for reliable and secure file transfer during remote job execution.
Files can be transferred using the GASS protocol or the GridFTP protocol. When using the
GASS server we have the option to choose between a secure or insecure communication
channel. For secure connections, GSI is used for authentication. It uses proxy credentials
generated by the command grid-proxy-init. The GASS protocol should be used when we need
to exchange small files or support real system information exchange.
A GASS server can be started using following command globus-gass-server
For large files,we can use GridFTP protocol. GridFTP provides a command globus_url_copy to
transfer files between two nodes. To start GridFTP server use the following command
globus-gridftp-server –p 2811 –S
This starts the server on the default port 2811.The –S option starts the server in the
background.
To put the file on GridFTP server, We issue the following command

globus-url-copy –vb file:///home/user1/temp
gsiftp://host1.examplergrid.org/usr/local/temp
4.4.3.2 Job Submission

Globus Resource Allocation Manager(GRAM) provides a set of client tools for submitting jobs
remotely.
The command globus-job-run provides a job submission interface and allows staging of data
and executable files using the GASS server. The command can be executed as
globus-job-run machine_name:port/jobmanager-name command
The default port number is 2119 and the default jobmanager-name is jobmanager.
Globus-job-run localhost:2119/jobmanager-date/bin/date
Using the –s flag with the command gives access to the staging functionality by starting a GASS
server on the localhost.
globus-job-run localhost –s exampleProg
This will start a GASS server on your machine. The jobmanager will contact the GASS server to
read the staged file and finally submit the job to the scheduler.
The globusrun command gives access to the complete features of the Resource Specification
Language(RSL). You can provide an RSL file by specifying –f option to the globusrun
command.
globusrun –r host1.examplegrid.org/jobmanager –pbs –f example…..rsl
The example.rsl file contains the complete job description including executable file,input and
output files and the details about file staging.
The GRAM component of globus toolkit supports two approaches:
1. A Web Service GRAM approach
2. Pre-web service GRAM approach
4.4.3.3 Job Monitoring
Finding Status of Submitted Jobs
The globus-job-status displays the status of job submitted by the command globus-job-submit. It
returns a ‗contact string‘ which can be used to query the status of the submitted job. The contact
string returned is of the form http://host2.examplegrid.org:5678/12340/1176891.
The states of a submitted job can be one of the following:
(i)Unsubmitted
(ii)StageIn
(iii)Pending
(iv)Active

(v)Suspended
(vi)StageOut
(vii)Done or Failed
Collecting Output and Cleaning Files

To get the output of the job,run the following command
Globus-job-get-output ‗contact string‘
The globus-job-clean command cleans up the cached copy of the output generated by the job.
4.5 Main components and Programming model

4.5.1 Security(GSI)
Grid Security is used to establish the authentication for users and services. nIt provides secure
communication, authorization of permitted actions for grid users, management os users‘
credentials and membership information. GT4‘s security mechanism is based on X.509 end
entity certificate and proxy certificate.
Grid Security infrastructure(GSI) is security component of GT. It provides authentication and
message protection for the communication between users and grid resources or services. The
motivations of GSI are:
 Establishing secure communication between grid elements.
 Supporting security across organizational boundaries without using a centralized
security system.
 Supporting , single sign-on for grid users.
GSI has the following functions:
1) Mutual Authentication
2) Confidential Communication
3) Securing Private keys
4) Delegation, Single Sign-On and Proxy Certificates
4.5.2 Data management

Globus Toolkit 4 provides various tools that enable data management in a grid environment.
Data Movement Components:
4.5.2.1 GridFTP
The GridFTP facility provides secure and reliable data transfer between grid hosts. Its protocol
extends the well-known FTP standard to provide additional features, including support for
authentication through GSI. One of the major features of GridFTP is that it enables third-party

transfer. Third-party transfer is suitable for an environment where there is a large file in remote
storage and the client wants to copy it to another remote server, as illustrated in Figure below
Figure4.8 GridFTP third-party transfer

4.5.2.2 Reliable File Transfer (RFT)
Reliable File Transfer provides a Web service interface for transfer and deletion of files. RFT
receives requests via SOAP messages over HTTP and utilizes GridFTP. RFT also uses a
database to store the list of file transfers and their states, and is capable of recovering a transfer
request that was interrupted.
The below Figure shows how RFT and GridFTP work.
Figure4.9 How RFT and GridFTP works

4.5.3 Data Replica Components:
4.5.3.1 Replica Location Service (RLS)
The Replica Location Service maintains and provides access to information about the physical
locations of replicated data. This component can map multiple physical replicas to one single
logical file, and enables data redundancy in a grid environment.
4.5.3.2 Data Replication Service (DRS)

Data Replication Service provides a system for making replicas of files in the grid environment,
and registering them to RLS. DRS uses RFT and GridFTP to transfer the files, and it uses RLS
to locate and register the replicas. Currently, DRS is a technical preview component.
4.5.4 Execution management

Globus Toolkit 4 provides various tools that enable execution management in a grid
environment.
4.5.4.1 WS GRAM
WS GRAM is the Grid service that provides the remote execution and status
management of jobs. When a job is submitted by a client, the request is sent to the remote host
as a SOAP message, and handled by WS GRAM service located in the remote host. The WS
GRAM service can collaborate with the RFT service for staging files required by jobs. In order to
enable staging with RFT, valid credentials should be delegated to the RFT service by the
Delegation service.
4.5.4.2 Globus Teleoperations Control Protocol (GTCP)
Globus Teleoperations Control Protocol is the WSRF version of NEESgrid Teleoperations
Control Protocol (NTCP). Currently, GTCP is a technical preview component
4.5.5 Monitoring and Discovery Services

The Monitoring and Discovery Services (MDS) are mainly concerned with the collection,
distribution, indexing, archival, and otherwise processing information about the state of various
resources, services, and system configurations. The information collected is used to either
discover new services or resources, or to enable monitoring of system status.
4.5.5.1 Information Service : Index service and Trigger service

Index service
The Index service is the central component of the GT4 MDS implementation. Every
instance of a GT4 container has a default indexing service (DefaultIndexService) exposed as a
WSRF service. The Index service interacts with data sources via standard WS-RF resource
property and subscription/notification interfaces (WS-ResourceProperties and WS-
BaseNotification). The contents of the Index service can be queried via XPath queries.
The following are some of the key features of an Index service:
- Index services can be configured in hierarchies, but there is no single global index that
provides information about every resource on the Grid.
-The presence of a resource in an Index service makes no guarantee about the
availability of the resource for users of that Index.
- Information published with MDS is recent but not the absolute latest.
- Each registration into an Index service has a lifetime and requires periodic renewal of
registrations to indicate the continued existence of a resource or a service.
Trigger service
The MDS Trigger service collects information and compares that data against a set of
conditions defined in a configuration file. When a condition is met an action is executed. The
condition is specified as an XPath expression; that, for example, may compare the value of a
property to a threshold and send an alert e-mail to an administrator by executing a script.
4.5.5.2 Aggregator Framework

The MDS-Index service and the MDS-Trigger service are specializations of a general
Aggregator Framework. The Aggregator Framework is a software framework for building
software services that collect and aggregate data. These services are also known as aggregator
services.
An aggregator service collects information from one of the three types of aggregator sources
such as a query source that utilizes WS-ResourceProperty mechanisms to collect data, a
subscription source that uses a WS-Notification subscription/notification mechanism to collect
data, or an execution source that executes an administrator-provided application to collect
information in XML format.
4.5.6 Server Programming Model

The core Grid service server programming model is depicted in Figure below.

Figure4.10 Server Programming Model
Grid Service Base
A GridServiceBase object is the base of all Grid services and implements the standard
OGSI GridService PortType. It also provides APIs to modify instance specific properties, as well
as APIs for querying and modifying service data. The base functionality can be seen as the
functionality known at development time.
Operation Providers
A service can be created by simply extending from GridServiceImpl, or
PersistentGridServiceImpl, but it is not recommended because of its limited flexibility.
Grid Service Callback
The GridServiceCallback interface defines a number of lifecycle management callbacks

that you can optionally implement to manage the state of your service.
Factory Callback
A factory callback can be implemented to provide custom factories for your services. It can, for
instance, be used to create services in remote hosting environments. Most implementations are,

however, likely to use the dynamic factory callback implementation we provide, which allows
you to, through configuration, specify the implementation class that the factory should create.
4.5.7 Client Programming Model

The Grid Service Client programming model is depicted in Figure below.
Figure4.11 Client Programming Model
A Grid service client can be written directly on top of the JAX-RPC client APIs. The handle is
passed into a ServiceLocator that constructs a proxy, or stub, responsible for making the call
using the network binding format defined in the WSDL for the service. The proxy is exposed
using a standard JAX-RPC generated PortType interface.
4.6 Introduction to Hadoop Framework

Introducing the MapReduce Model
Hadoop supports the MapReduce model, which was introduced by Google as a method of
solving a class of petascale problems with large clusters of inexpensive machines. The model is
based on two distinct steps for an application:
• Map: An initial ingestion and transformation step, in which individual input records can be
processed in parallel.
• Reduce: An aggregation or summarization step, in which all associated records must be
processed together by a single entity.
The core concept of MapReduce in Hadoop is that input may be split into logical chunks,
and each chunk may be initially processed independently, by a map task. The results of these
individual processing chunks can be physically partitioned into distinct sets, which are then

sorted. Each sorted chunk is passed to a reduce task. Figure below illustrates how the
MapReduce model works.
Figure4.12 The MapReduce model

MapReduce is a programming model for data processing. Hadoop can run MapReduce
programs written in various languages. MapReduce programs are inherently parallel, thus
putting very large-scale data analysis into the hands of anyone with enough machines at their
disposal. MapReduce comes into its own for large datasets.
A map task may run on any compute node in the cluster, and multiple map tasks may be
running in parallel across the cluster. The map task is responsible for transforming the input
records into key/value pairs. The output of all of the maps will be partitioned, and each partition
will be sorted. There will be one partition for each reduce task. Each partition‘s sorted keys and
the values associated with the keys are then processed by the reduce task. There may be
multiple reduce tasks running in parallel on the cluster.
Hadoop Core MapReduce
The Hadoop Distributed File System (HDFS)MapReduce environment provides the user
with a sophisticated framework to manage the execution of map and reduce tasks across a
cluster of machines. The user is required to tell the framework the following:
• The location(s) in the distributed file system of the job input
• The location(s) in the distributed file system for the job output
• The input format
• The output format
• The class containing the map function

• Optionally. the class containing the reduce function
• The JAR file(s) containing the map and reduce functions and any support classes
MapReduce is oriented around key/value pairs. The framework will convert each record
of input into a key/value pair, and each pair will be input to the map function once. The map
output is a set of key/value pairs—nominally one pair that is the transformed input pair, but it is
perfectly acceptable to output multiple pairs. The map output pairs are grouped and sorted by
key. The reduce function is called one time for each key, in sort sequence, with the key and the
set of values that share that key.
The framework provides two processes that handle the management of MapReduce jobs:
• TaskTracker manages the execution of individual map and reduce tasks on a computenode in
the cluster.
• JobTracker accepts job submissions, provides job monitoring and control, and manages the
distribution of tasks to the TaskTracker nodes.
4.6.1 The Parts of a Hadoop MapReduce Job
The user configures and submits a MapReduce job (or just job for short) to the framework,
which will decompose the job into a set of map tasks, shuffles, a sort, and a set of reduce
tasks.The framework will then manage the distribution and execution of the tasks, collect the
output, and report the status to the user.
The job consists of the parts shown in Figure 4.12 and listed in Table 4.2.
Table 4.2. Parts of a MapReduce Job
Part Handled By
Configuration of the job User
Input splitting and distribution Hadoop framework
Start of the individual map tasks with their Hadoop framework
input split
Map function, called once for each input User
key/value pair
Shuffle, which partitions and sorts the per- Hadoop framework
map output
Sort, which merge sorts the shuffle output for Hadoop framework
each partition of all map outputs
Start of the individual reduce tasks, with their Hadoop framework
input partition
Reduce function, which is called once for User
each unique input key, with all of the input

values that share that key
Collection of the output and storage in the Hadoop framework
configured job output directory, in N parts,
where N is the number of reduce tasks
Figure 4.13 Parts of a MapReduce job
The user is responsible for handling the job setup, specifying the input location(s),
specifying the input, and ensuring the input is in the expected format and location. The
framework is responsible for distributing the job among the TaskTracker nodes of the cluster;
running the map, shuffle, sort, and reduce phases; placing the output in the output directory;
and informing the user of the job-completion status.
The job created by the code in MapReduceIntro.java will read all of its textual input line
by line, and sort the lines based on that portion of the line before the first tab character. If there
are no tab characters in the line, the sort will be based on the entire line. The
MapReduceIntro.java file is structured to provide a simple example of configuring and running

a MapReduce job.
Source Code. MapReduceIntro.java

package com.apress.hadoopbook.examples.ch2;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.RunningJob;
import org.apache.hadoop.mapred.lib.IdentityMapper;
import org.apache.hadoop.mapred.lib.IdentityReducer;
import org.apache.log4j.Logger;
/** A very simple MapReduce example that reads textual input where
* each record is a single line, and sorts all of the input lines into
* a single output file.
* The records are parsed into Key and Value using the first TAB
* character as a separator. If there is no TAB character the entire
* line is the Key. * */
public class MapReduceIntro
{
protected static Logger logger = Logger.getLogger(MapReduceIntro.class);
public static void main(final String[] args) {
try
{
/** Construct the job conf object that will be used to submit this job to the
Hadoop framework. ensure that the jar or directory that * contains
MapReduceIntroConfig.class is made available to all of the * Tasktracker nodes
that will run maps or reduces for this job.
*/
final JobConf conf = new JobConf(MapReduceIntro.class);
/**
* Take care of some housekeeping to ensure that this simple example

* job will run
*/
MapReduceIntroConfig.
exampleHouseKeeping(conf,
MapReduceIntroConfig.getInputDirectory(),
MapReduceIntroConfig.getOutputDirectory());
/**
* This section is the actual job configuration portion /**
* Configure the inputDirectory and the type of input. In this case we are stating
that the input is text, and each record is a single line, and the first TAB is the
separator between the key and the value of the record.
*/
conf.setInputFormat(KeyValueTextInputFormat.class);
FileInputFormat.setInputPaths(conf,
MapReduceIntroConfig.getInputDirectory());
/** Inform the framework that the mapper class will be the {@link
IdentityMapper}. This class simply passes the input Key Value pairs directly to its
output, which in our case will be the shuffle.
*/
conf.setMapperClass(IdentityMapper.class);
FileOutputFormat.setOutputPath(conf,MapReduceIntroConfig.getOutputDirectory
());
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setNumReduceTasks(1);
conf.setReducerClass(IdentityReducer.class);
logger .info("Launching the job.");
/** Send the job configuration to the framework and request that the job be run.
*/
final RunningJob job = JobClient.runJob(conf);
logger.info("The job has completed.");
if (!job.isSuccessful())
{
logger.error("The job failed.");

System.exit(1);
}
logger.info("The job completed successfully.");
System.exit(0);
}
catch (final IOException e)
{
logger.error("The job has failed due to an IO Exception", e);
e.printStackTrace();
}
}
}
A Simple Map Function: IdentityMapper

The Hadoop framework provides a very simple map function, called IdentityMapper. It is used in
jobs that only need to reduce the input, and not transform the raw input.
Program 2-2. IdentityMapper.java

package org.apache.hadoop.mapred.lib;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.MapReduceBase;
/** Implements the identity function, mapping inputs directly to outputs. */
public class IdentityMapper<K, V>
extends MapReduceBase implements Mapper<K, V, K, V>
{
/** The identify function. Input key/value pair is written directly to
* output.*/
public void map(K key, V val, OutputCollector<K, V> output, Reporter reporter)
throws IOException
{
output.collect(key, val);
}

}
The magic piece of code is the line output.collect(key, val), which passes a key/value pair back
to the framework for further processing.
Common Mappers
One common mapper drops the values and passes only the keys forward:
public void map(K key, V val, OutputCollector<K, V> output, Reporter reporter)
throws IOException
{
output.collect(key, null); /** Note, no value, just a null */
}
Another common mapper converts the key to lowercase:

/** put the keys in lower case. */
public void map(Text key, V val, OutputCollector<Text, V> output, Reporter reporter)
throws IOException
{
Text lowerCaseKey = new Text( key.toString().toLowerCase());
output.collect(lowerCaseKey, value);
}
A Simple Reduce Function: IdentityReducer

Program 2-3. IdentityReducer.java
package org.apache.hadoop.mapred.lib;
import java.util.Iterator;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.MapReduceBase;
/** Performs no reduction, writing all input values directly to the output. */
public class IdentityReducer<K, V> extends MapReduceBase implements
Reducer<K, V, K, V>
{

/** Writes all keys and values directly to output. */
public void reduce(K key, Iterator<V> values,
OutputCollector<K, V> output, Reporter reporter) throws IOException
{
while (values.hasNext())
{
output.collect(key, values.next());
}
}
}
Common Reducers
A common reducer drops the values and passes only the keys forward:
public void map(K key, V val, OutputCollector<K, V> output, Reporter reporter) throws
IOException
{
output.collect(key, null);
}
Another common reducer provides count information for each key:
protected Text count = new Text();
/** Writes all keys and values directly to output. */
public void reduce(K key, Iterator<V> values, OutputCollector<K, V> output, Reporter reporter)
throws IOException
{
int i = 0;
while (values.hasNext())
{
i++
}
count.set( "" + i );
output.collect(key, count);
}
4.7 Input Splitting

For the framework to be able to distribute pieces of the job to multiple machines, it needs
to fragment the input into individual pieces, which can in turn be provided as input to the
individual distributed tasks. Each fragment of input is called an input split.
An input split will normally be a contiguous group of records from a single input file, and
in this case, there will be at least N input splits, where N is the number of input files. If the
number of requested map tasks is larger than this number, or the individual files are larger than
the suggested fragment size, there may be multiple input splits constructed of each input file.
4.7.1 MapReduce Inputs And Splitting

Input split: It is part of input processed by a single map. Each split is processed by a single
map. In other words InputSplit represents the data to be processed by an individual Mapper.
Each split is divided into records , and the map processes each record, which is a key value
pair. Split is basically a number of rows and record is that number.
The length of the InputSplit is measured in bytes. Every InputSplit has a storage locations. The
storage locations are used by the MapReduce system to place map tasks as close to split's data
as possible. The tasks are processed in the order of the size of the splits, largest one get
processed first. This is done in order to minimize the job runtime. One important thing to
remember is that InputSplit doesn't contain input data but a reference to the data.
public abstract class InputSplit

{
public abstract long getLength() throws IOException, InterruptedException;
public abstract String[] getLocations() throws IOException,InterruptedException;
}
As a user, we don't have to use InputSplits directly, InputFormat does that job. An InputFormat
is a class that provides the following functionality:
 Selects the files or other objects that should be used for input.
 Defines the InputSplits that break a file into tasks.
 Provides a factory for RecordReader objects that read the file.

The overall process can be explained in following points:
 The client which runs the job calculates the splits for the job by calling getSplits().
 Client then sends the splits to the jobtracker, which uses their storage locations to
schedule map tasks that will process them on the tasktrackers.
 On a tasktracker, the map task passes the split to the createRecordReader() method on
InputFormat to obtain a RecordReader for that split.
 Map task uses RecordReader to generate record key-value pairs, which it passes to the
map function. We can see this by looking at the Mapper‘s run() method:
How to prevent splitting?

Some applications don‘t want files to be split, as this allows a single mapper to process
each input file in its entirety. For example, a simple way to check if all the records in a file are
sorted is to go through the records in order, checking whether each record is not less than the
preceding one. Implemented as a map task, this algorithm will work only if one map processes
the whole file. There are a couple of ways to ensure that an existing file is not split. The first
(quick and dirty) way is to increase the minimum split size to be larger than the largest file in
your system. Setting it to its maximum value, Long.MAX_VALUE, has this effect. The second is
to subclass the concrete subclass of FileInputFormat that you want to use, to override the
isSplitable() method to return false.
For example, here‘s a nonsplittable TextInputFormat:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.JobContext;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class NonSplittableTextInputFormat extends TextInputFormat {
@Override
protected boolean isSplitable(JobContext context, Path file)
{
return false;
}
}
4.8 Map Reduce Input and Output Formats

4.8.1 Input Formats
• Hadoop can process many different types of data formats, from flat text files to
databases.
– Input Splits and Records
– Text Input
– Binary Input
– Multiple Inputs
– Database Input (and Output)
1) Input Splits and Records
• An input split is a chunk of the input that is processed by a single map.
• Each map processes a single split.
• Each split is divided into records, and the map processes each record—a key-value pair-
in turn.
• Splits and records are logical: there is nothing that requires them to be tied to files
In a database context, a split might correspond to a range of rows from a table and a record
to a row in that range.
Input splits are represented by the Java class, InputSplit
public abstract class InputSplit
{
public abstract long getLength() throws IOException, InterruptedException;
public abstract String[] getLocations() throws IOException,
InterruptedException;
}
An InputFormat is responsible for creating the input splits and dividing them into
records.

public abstract class InputFormat<K, V> {
public abstract List<InputSplit> getSplits(JobContext context) throws IOException,
public abstract RecordReader<K, V>
createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException,
}
FileInputFormat
• FileInputFormat is the base class for all implementations of InputFormat that use files as
their data source.
• It provides two things: a place to define which files are included as the input to a job,
and an implementation for generating splits for the input files.
FileInputFormat input paths
• The input to a job is specified as a collection of paths, which offers great flexibility in
constraining the input to a job. FileInputFormat offers four static convenience methods
for setting a Job‘s input paths:
public static void addInputPath(Job job, Path path)
public static void addInputPaths(Job job, String commaSeparatedPaths)
public static void setInputPaths(Job job, Path... inputPaths)
public static void setInputPaths(Job job, String commaSeparatedPaths)
InputFormat class hierarchy

Figure 4.14 InputFormat class hierarchy
• The split size is calculated by the formula

max(minimumSize, min(maximumSize, blockSize))
by default:
minimumSize < blockSize < maximumSize
• so the split size is blockSize.
Preventing splitting
• Some applications don‘t want files to be split, so that a single mapper can process each
input file in its entirety.
• For example, a simple way to check if all the records in a file are sorted is to go through
the records in order, checking whether each record is not less than the preceding one.
• There are a couple of ways to ensure that an existing file is not split.
1) way is to increase the minimum split size to be larger than the largest file in your system.
• The second is to subclass the concrete subclass of FileInputFormat that you want to
use, to override the isSplitable() method4 to return false.
Exampleimport org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class NonSplittableTextInputFormat extends TextInputFormat
{
@Override

{
return false;
}
}
Processing a whole file as a record
An InputFormat for reading a whole file as a record.
public class WholeFileInputFormat
extends FileInputFormat<NullWritable, BytesWritable>
{
@Override
{
return false;
}
@Override
public RecordReader<NullWritable, BytesWritable> createRecordReader(
InputSplit split, TaskAttemptContext context) throws IOException,
InterruptedException
{
WholeFileRecordReader reader = new WholeFileRecordReader();
reader.initialize(split, context);
return reader;
}
The RecordReader used by WholeFileInputFormat for reading a whole file as a record
class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable>
{
private FileSplit fileSplit;
private Configuration conf;
private BytesWritable value = new BytesWritable();
private boolean processed = false;
@Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException
{

this.fileSplit = (FileSplit) split;
this.conf = context.getConfiguration();
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException
{
if (!processed) {
byte[] contents = new byte[(int) fileSplit.getLength()];
Path file = fileSplit.getPath();
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream in = null;
try
{
in = fs.open(file);
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents, 0, contents.length);
}
finally
{
IOUtils.closeStream(in);
}
processed = true;
return true;
}
return false;
}
@Override
public NullWritable getCurrentKey() throws IOException, InterruptedException
{
return NullWritable.get();
}
@Override
public BytesWritable getCurrentValue() throws IOException,
InterruptedException {

return value;
}
@Override
public float getProgress() throws IOException
{
return processed ? 1.0f : 0.0f;
}
@Override
public void close() throws IOException
{
// do nothing
}
}
• WholeFileRecordReader is responsible for taking a FileSplit and converting it into a

single record, with a null key and a value containing the bytes of the file.
2) Text Input
• Hadoop excels at processing unstructured text.
TextInputFormat:
 The key is the line number, and the value is the line.
• TextInputFormat is the default InputFormat. Each record is a line of input.
• The key, a LongWritable, is the byte offset within the file of the beginning of the line.
• The value is the contents of the line, excluding any line terminators (newline, carriage
return), and is packaged as a Text object.
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
• is divided into one split of four records. The records are interpreted as the following
key-value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)

Logical records and HDFS blocks for TextInputFormat
• KeyValueTextInputFormatTextInputFormat‘s keys, being simply the offset within the file,
are not normally very useful.
• It is common for each line in a file to be a key-value pair, separated by a delimiter such
as a tab character.
• TextInputFormat‘s keys, being simply the offset within the file, are not normally very
useful.
• It is common for each line in a file to be a key-value pair, separated by a delimiter such
as a tab character.
• Consider the following input file, where → represents a (horizontal) tab character:
line1→On the top of the Crumpetty Tree
line2→The Quangle Wangle sat,
line3→But his face you could not see,
line4→On account of his Beaver Hat.
• Like in the TextInputFormat case, the input is in a single split comprising four records,
although this time the keys are the Text sequences before the tab in each line:
(line1, On the top of the Crumpetty Tree)
(line2, The Quangle Wangle sat,)
(line3, But his face you could not see,)
(line4, On account of his Beaver Hat.)
NLineInputFormat
 Similar to KeyValueTextInputFormat, but the splits are based on N lines of input
rather than Y bytes of input.
• With TextInputFormat and KeyValueTextInputFormat, each mapper receives a variable
number of lines of input.
• The number depends on the size of the split and the length of the lines.
• If we want your mappers to receive a fixed number of lines of input, then
NLineInputFormat is the InputFormat to use.
• N refers to the number of lines of input that each mapper receives.

• With N set to one (the default), each mapper receives exactly one line of input.

If, for example, N is two, then each split contains two lines. One mapper will receive
the first two key-value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
And another mapper will receive the second two key-value pairs:
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
XML
• Most XML parsers operate on whole XML documents, so if a large XML document is
made up of multiple input splits, then it is a challenge to parse these individually.
• Large XML documents that are composed of a series of ―records‖ can be broken into
these records using simple string or regular-expression matching to find start and end
tags of records.
3) Binary Input
• Hadoop MapReduce is not just restricted to processing textual data—it has support for
binary formats, too.
SequenceFileInputFormat
 The input file is a Hadoop sequence file, containing serialized key/value pairs.
• Hadoop‘s sequence file format stores sequences of binary key-value pairs. Sequence
files are well suited as a format for MapReduce data since they are splittable, they
support compression as a part of the format, and they can store arbitrary types using a
variety of serialization frameworks.
• To use data from sequence files as the input to MapReduce, you use
SequenceFileInputFormat.
• The keys and values are determined by the sequence file, and you need to make sure
that your map input types correspond.
SequenceFileAsTextInputFormat

• SequenceFileAsTextInputFormat is a variant of SequenceFileInputFormat that converts
the sequence file‘s keys and values to Text objects.
• The conversion is performed by calling toString() on the keys and values.
• This format makes sequence files suitable input for Streaming.
SequenceFileAsBinaryInputFormat
• SequenceFileAsBinaryInputFormat is a variant of SequenceFileInputFormat that
retrieves the sequence file‘s keys and values as opaque binary objects.
• They are encapsulated as BytesWritable objects, and the application is free to interpret
the underlying byte arrayas it pleases
4) Multiple Inputs
In abstract class that lets the user implement an input format that aggregates
multiple files into one split.
• Although the input to a MapReduce job may consist of multiple input files, all of the input
is interpreted by a single InputFormat and a single Mapper.
• As the data format evolves, so you have to write your mapper to cope with all of your
legacy formats.
• Or, you have data sources that provide the same type of data but in different formats.
• These cases are handled elegantly by using the MultipleInputs class, which allows you
to specify the InputFormat and Mapper to use on a per-path basis.
5) Database Input (and Output)
• DBInputFormat is an input format for reading data from a relational database, using
JDBC
• The corresponding output format is
• DBOutputFormat, which is useful for dumping job outputs (of modest size) into a
database.
4.8.2 Output Formats
• Hadoop has output data formats that correspond to the input formats covered in the
previous section.
Formats:
• Text Output
• Binary Output
• Multiple Outputs
• Lazy Output
• Database Output

OutputFormat class hierarchy
Figure 4.15 OutputFormat class hierarchy

1) Text Output
• The default output format, TextOutputFormat, writes records as lines of text.
• Its keys and values may be of any type, since TextOutputFormat turns them to strings by
calling toString() on them.
• Each key-value pair is separated by a tab character, although that may be changed
using the mapreduce.output.textoutputformat.separator property.
2) Binary Output
• SequenceFileOutputFormat
• SequenceFileAsBinaryOutputFormat
• MapFileOutputFormat
SequenceFileOutputFormat
• As the name indicates, SequenceFileOutputFormat writes sequence files for its output.
• This is a good choice of output if it forms the input to a further MapReduce job, since it is
compact and is readily compressed.
SequenceFileOutputFormat
• As the name indicates, SequenceFileOutputFormat writes sequence files for its output.
• This is a good choice of output if it forms the input to a further MapReduce job, since it is
compact and is readily compressed.

SequenceFileAsBinaryOutputFormat
• SequenceFileAsBinaryOutputFormat is the counterpart to SequenceFileAsBinaryInput
• Format, and it writes keys and values in raw binary format into a SequenceFile
container.
MapFileOutputFormat
• MapFileOutputFormat writes MapFiles as output. The keys in a MapFile must be added
in order, so you need to ensure that your reducers emit keys in sorted order.
3) Multiple Outputs
• FileOutputFormat and its subclasses generate a set of files in the output directory.
• There is one file per reducer, and files are named by the partition number: part-r-00000,
partr- 00001, etc.
• There is sometimes a need to have more control over the naming of the files or to
produce multiple files per reducer. MapReduce comes with the MultipleOut puts class to
help you do this
Zero reducers
This is a vacuous case: there are no partitions, as the application needs to run only map
tasks.
One reducer
It can be convenient to run small jobs to combine the output of previous jobs into a single
file. This should only be attempted when the amount of data is small enough to be
processed comfortably by one reducer.
4) MultipleOutputs
• MultipleOutputs allows you to write data to files whose names are derived from the
• output keys and values, or in fact from an arbitrary string.
• This allows each reducer to create more than a single file.
• File names are of the form name-m-nnnnn for map outputs and name-r-nnnnn for reduce
outputs, where name is an arbitrary name that is set by the program, and nnnnn is an
integer designating the part number, starting from zero.
5) Lazy Output
• FileOutputFormat subclasses will create output files, even if they are empty.
• Some applications prefer that empty files not be created, which is where Lazy
OutputFormat helps. It is a wrapper output format that ensures that the output file is
created only when the first record is emitted for a given partition.
6) Database Output

• The output formats for writing to relational databases and to HBase are mentioned in
Database Input.
4.9 Configuring a Job
All Hadoop jobs have a driver program that configures the actual MapReduce job and submits it
to the Hadoop framework. This configuration is handled through the JobConf object. The
sample class MapReduceIntro provides a walk-through for using the JobConf object to
configure and submit a job to the Hadoop framework for execution.
Program. MapReduceIntroConfig.java
package com.apress.hadoopbook.examples.ch2;
import java.util.Formatter;
import java.util.Random;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.JobConf;
import org.apache.log4j.Logger;
public class MapReduceIntroConfig
{
// Log4j is the recommended way to provide textual information to the user about the job.
protected static Logger logger = Logger.getLogger(MapReduceIntroConfig.class);
protected static Path inputDirectory = new Path("file:///tmp/MapReduceIntroInput");
// This is the directory that the job output will be written to. It must not * exist at Job Submission
time.
protected static Path outputDirectory = new Path("file:///tmp/MapReduceIntroOutput");
protected static void exampleHouseKeeping(final JobConf conf, final Path inputDirectory, final
Path outputDirectory) throws IOException
conf.set("mapred.job.tracker", "local");
conf.setInt("io.sort.mb", 1);
generateSampleInputIf(conf, inputDirectory);
if (!removeIf(conf, outputDirectory))
{
logger.error("Unable to remove " + outputDirectory + "job aborted");

System.exit(1);
}
}
protected static void generateRandomFiles(final FileSystem fs,
final Path inputDirectory, final int fileCount, final int maxLines)
throws IOException
{
final Random random = new Random();
logger .info( "Generating 3 input files of random data," + "each record is a random number TAB
the input file name");
for (int file = 0; file < fileCount; file++)
{
final Path outputFile = new Path(inputDirectory, "file-" + file);
final String qualifiedOutputFile = outputFile.makeQualified(fs).toUri().toASCIIString();
FSDataOutputStream out = null;
try
{
/**
* This is the standard way to create a file using the Hadoop Framework. An error
will be thrown if the file already
* exists.
*/
out = fs.create(outputFile);
final Formatter fmt = new Formatter(out);
final int lineCount = (int) (Math.abs(random.nextFloat())* maxLines + 1);
for (int line = 0; line < lineCount; line++)
{
fmt.format("%d\t%s%n", Math.abs(random.nextInt()),
qualifiedOutputFile);
}
fmt.flush();
}
finally
{
out.close();

}
}
}
protected static void generateSampleInputIf(final JobConf conf, final Path inputDirectory)

throws IOException
{
boolean inputDirectoryExists;
final FileSystem fs = inputDirectory.getFileSystem(conf);
if ((inputDirectoryExists = fs.exists(inputDirectory))
&& !isEmptyDirectory(fs, inputDirectory))
{
if (logger.isDebugEnabled())
{
logger
.debug("The inputDirectory " + inputDirectory + " exists and is either a" + " file or
a non empty directory");
}
return;
}
if (!inputDirectoryExists)
{
if (!fs.mkdirs(inputDirectory))
{
logger.error("Unable to make the inputDirectory "
+ inputDirectory.makeQualified(fs) + " aborting job");
System.exit(1);
}
}
final int fileCount = 3;
final int maxLines = 100;
generateRandomFiles(fs, inputDirectory, fileCount, maxLines);
}
public static Path getInputDirectory()

{
return inputDirectory;
}
public static Path getOutputDirectory()
{
return outputDirectory;
}
private static boolean isEmptyDirectory(final FileSystem fs, final Path inputDirectory) throws
IOException
{
final FileStatus[] statai = fs.listStatus(inputDirectory);
if ((statai == null) || (statai.length == 0))
{
if (logger.isDebugEnabled()) {
logger.debug(inputDirectory.makeQualified(fs).toUri()
+ " is empty or missing");
}
return true;
}
{
logger.debug(inputDirectory.makeQualified(fs).toUri()
+ " is not empty");
}
for (final FileStatus status : statai)

{
if (!status.isDir() && (status.getLen() != 0))
{
{
logger.debug("A non empty file " + status.getPath().makeQualified(fs).toUri() + "
was found");
return false;
}

}
}
for (final FileStatus status : statai)
{
if (status.isDir() && isEmptyDirectory(fs, status.getPath()))
{
continue;
}
if (status.isDir())
{
return false;
}
}
return true;
}
protected static boolean removeIf(final JobConf conf, final Path outputDirectory) throws
IOException
{
final FileSystem fs = outputDirectory.getFileSystem(conf);
if (!fs.exists(outputDirectory))
{
{
logger .debug("The output directory does not exist," + " no removal needed.");
}
return true;
}
final FileStatus status = fs.getFileStatus(outputDirectory);
logger.info("The job output directory "+ outputDirectory.makeQualified(fs) + " exists"
+ (status.isDir() ? " and is not a directory" : "")
+ " and will be removed");
if (!fs.delete(outputDirectory, true))
{
logger.error("Unable to delete the configured output directory "
+ outputDirectory);

return false;
}
/** The outputDirectory did exist, but has now been removed. */
return true;
}
public static void setInputDirectory(final Path inputDirectory)
{
MapReduceIntroConfig.inputDirectory = inputDirectory;
}
public static void setOutputDirectory(final Path outputDirectory)
{
MapReduceIntroConfig.outputDirectory = outputDirectory;
}
}
Table4.3 : Map Phase Configuration
Element Required? Default

Input path(s) Yes
Class to read and convert Yes
the input path elements to
key/value pairs
Map output key class No Job output key class
Map output value class No Job output value class
Class supplying the map Yes
function
Suggested minimum number No Cluster Default
of map tasks
Number of threads to run in No 1
each map task
Configuring the Reduce Phase

To configure the reduce phase, the user must supply the framework with five pieces of
information:

• The number of reduce tasks; if zero, no reduce phase is run
• The class supplying the reduce method
• The input key and value types for the reduce task; by default, the same as the reduce
output
• The output key and value types for the reduce task
• The output file type for the reduce task output
The configured number of reduce tasks determines the number of output files for a job that will
run the reduce phase. Tuning this value will have a significant impact on the overall
performance of your job.
The number of reduce tasks is commonly set in the configuration phase of a job.
conf.setNumReduceTasks(1);
4.10 Running a Job

The ultimate aim of all your MapReduce job configuration is to actually run that job. The
MapReduceIntro.java example demonstrates a common and simple way to run a job:
logger .info("Launching the job.");
/** Send the job configuration to the framework
* and request that the job be run.
*/
final RunningJob job = JobClient.runJob(conf);
logger.info("The job has completed.");
The method runJob() submits the configuration information to the framework and waits for the
framework to finish running the job. The response is provided in the job object. The RunningJob
class provides a number of methods for examining the response. Perhaps the most useful is
job.isSuccessful().
Run MapReduceIntro.java as follows
hadoop jar DOWNLOAD_PATH/ch2.jar ➥
com.apress.hadoopbook.examples.ch2.MapReduceIntro
The response should be as follows:
ch2.MapReduceIntroConfig: Generating 3 input files of random data, each record

is a random number TAB the input file name
ch2.MapReduceIntro: Launching the job.
jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=

mapred.JobClient: Use GenericOptionsParser for parsing the arguments.
Applications should implement Tool for the same.
mapred.FileInputFormat: Total input paths to process : 3
mapred.JobClient: Running job: job_local_0001
…..
…..
mapred.JobClient: Map input bytes=8068
mapred.JobClient: Combine input records=0
mapred.JobClient: Map output records=170
mapred.JobClient: Reduce input records=170
ch2.MapReduceIntro: The job has completed.
ch2.MapReduceIntro: The job completed successfully.
Congratulations, you have run a MapReduce job.
4.11 Design of Hadoop file system

Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed
Filesystem. HDFS is a filesystem designed for storing very large files with streaming data
access patterns, running on clusters of commodity hardware.
4.11.1 Features:
Very large files- ―Very large‖ in this context means files that are hundreds of megabytes,
gigabytes, or terabytes in size
Streaming data access- HDFS is built around the idea that the most efficient data processing
pattern is a write-once, read-many-times pattern.
Commodity hardware- Hadoop doesn‘t require expensive, highly reliable hardware to run on.
It‘s designed to run on clusters of commodity hardware for which the chance of node failure
across the cluster is high, at least for large clusters.
Areas where HDFS is not a good fit today:
Low-latency data access
Applications that require low-latency access to data, in the tens of milliseconds range, will not
work well with HDFS.
Lots of small files

Since the namenode holds filesystem metadata in memory, the limit to the number of files in a
filesystem is governed by the amount of memory on the namenode.
Multiple writers, arbitrary file modifications
Files in HDFS may be written to by a single writer. Writes are always made at the end of the file.
There is no support for multiple writers, or for modifications at arbitrary offsets in the file.
4.11.2 HDFS Concepts

1) Blocks - A disk has a block size, which is the minimum amount of data that it can read
or write. File systems for a single disk build on this by dealing with data in blocks,
which are an integral multiple of the disk block size. File system blocks are typically a
few kilobytes in size, while disk blocks are normally 512 bytes. In HDFS, files are
broken into block-sized chunks, which are stored as independent units.
A Block in HDFS So Large: Reason:
HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost
of seeks. By making a block large enough, the time to transfer the data from the disk can
be made to be significantly larger than the time to seek to the start of the block. Thus the
time to transfer a large file made of multiple blocks operates at the disk transfer rate.
2) Namenodes and Datanodes- An HDFS cluster has two types of node operating in a
master-worker pattern: a namenode (the master) and a number of datanodes
(workers). The namenode manages the file system namespace. It maintains the
filesystem tree and the metadata for all the files and directories in the tree. This
information is stored persistently on the local disk in the form of two files: the
namespace image and the edit log.
A client accesses the filesystem on behalf of the user by communicating with the
namenode and datanodes. Datanodes are the workhorses of the filesystem. They store
and retrieve blocks when they are told to (by clients or the namenode), and they report
back to the namenode periodically with lists of blocks that they are storing. Without the
namenode, the filesystem cannot be used. It is also possible to run a secondary
namenode, which despite its name does not act as a namenode. Its main role is to
periodically merge the namespace image with the edit log to prevent the edit log from
becoming too large.
3) The Command-Line Interface- There are many interfaces to HDFS, but the command
line is one of the simplest and, to many developers, the most familiar. There are two
properties that we set in the pseudo-distributed configuration that deserve further

explanation. The first is fs.default.name, set to hdfs://localhost/, which is used to set a
default filesystem for Hadoop. We set the second property, dfs.replication, to 1 so that
HDFS doesn‘t replicate filesystem blocks by the default factor of three.
4) Basic Filesystem Operations - The filesystem is ready to be used, and we can do all
of the usual filesystem operations such as reading files, creating directories, moving
files, deleting data, and listing directories. We can type hadoop fs -help to get detailed
help on every command.
HDFS Key features:

1) Highly fault tolerant
2) High Throughput
3) Designed to work with systems with very large files.
4) Provides streaming access to file system sata.
5) Can be built out of commodity hardware.
4.11.3 Hadoop Filesystems

Table4.4 . Hadoop filesystems
Filesystem URI scheme Java Description
implementation
Local File fs.LocalFileSystem A filesystem for a
locally connected
disk with
clientsidechecksums.
Use
RawLocalFileSystem
for a local filesystem
with no checksums
HDFS Hdfs hdfs. Hadoop‘s distributed
DistributedFileSystem filesystem. HDFS is
designed to
work efficiently in
conjunction with
MapReduce

HFTP Hftp hdfs.HftpFileSystem A filesystem
providing read-only
access to HDFS over
HTTP
FTP ftp fs.ftp.FTPFileSystem A filesystem backed
by an FTP server.
HAR Har fs.HarFileSystem A filesystem layered
on another filesystem
for archiving files.
Hadoop Archives are
typically used for
archiving files in
HDFS to reduce the
namenode‘s memory
usage.
KFS (Cloud- Kfs fs.kfs. CloudStore (formerly
Store) KosmosFileSystem Kosmos filesystem)
is a distributed
filesystem like HDFS
or Google‘s GFS,
written in C++.
HDFS Interfaces
1. HTTP- HDFS defines a read-only interface for retrieving directory listings and data over
HTTP. This protocol is not tied to a specific HDFS version, making it possible to write
clients that can use HTTP to read data from HDFS clusters that run different versions
of Hadoop.
2. FTP- There is an FTP interface to HDFS, which permits the use of the FTP protocol to
interact with HDFS. This interface is a convenient way to transfer data into and out of
HDFS using existing FTP clients.
4.12 The Java Interface

1. Reading Data from a Hadoop URL:
One of the simplest ways to read a file from a Hadoop filesystem is by using a java.net.URL
object to open a stream to read the data from. The general idiom is:

InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// process in
}
finally
{
}
Example. Displaying files from a Hadoop filesystem on standard output using a
URLStreamHandler
public class URLCat {
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception

{
InputStream in = null;
try
{
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
}
finally {
}
}
}
We make use of the handy IOUtils class that comes with Hadoop for closing the stream in the
finally clause, and also for copying bytes between the input stream and the output stream. The
last two arguments to the copyBytes method are the buffer size used for copying and whether to
close the streams when the copy is complete.
Here‘s a sample run:

% hadoop URLCat hdfs://localhost/user/tom/quangle.txt

FSDataInputStream
The open() method on FileSystem actually returns a FSDataInputStream rather than a standard
java.io class. This class is a specialization of java.io.DataInputStream with support for random
access, so you can read from any part of the stream:
Example. Displaying files from a Hadoop filesystem on standard output twice, by using
seek
public class FileSystemDoubleCat
{
{
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
try
{
in = fs.open(new Path(uri));
in.seek(0); // go back to the start of the file
}
finally
{
}
}
}
Here‘s the result of running it on a small file:
% hadoop FileSystemDoubleCat hdfs://localhost/user/tom/quangle.txt

Writing Data
The FileSystem class has a number of methods for creating a file. The simplest is the method
that takes a Path object for the file to be created and returns an output stream to write to:
public FSDataOutputStream create(Path f) throws IOException
There are overloaded versions of this method that allow you to specify whether to forcibly
overwrite existing files, the replication factor of the file, the buffer size to use when writing the
file, the block size for the file, and file permissions.
Example. Copying a local file to a Hadoop filesystem
public class FileCopyWithProgress
{
String localSrc = args[0];

String dst = args[1];
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst), new Progressable()
{
public void progress()
{
System.out.print(".");
}
}
);
IOUtils.copyBytes(in, out, 4096, true);
}

}
Typical usage:
% hadoop FileCopyWithProgress input/docs/1400-8.txt hdfs://localhost/user/tom/
1400-8.txt
FSDataOutputStream
The create() method on FileSystem returns an FSDataOutputStream, which, like
FSDataInputStream, has a method for querying the current position in the file:
package org.apache.hadoop.fs;
public class FSDataOutputStream extends DataOutputStream implements Syncable {
public long getPos() throws IOException {
// implementation elided
}
// implementation elided
}
However, unlike FSDataInputStream, FSDataOutputStream does not permit seeking. This is
because HDFS allows only sequential writes to an open file or appends to an already written
file.
4.13 Data Flow

4.13.1 Anatomy of a File Read
To get an idea of how data flows between the client interacting with HDFS, the namenode and
the datanodes, consider the below figure, which shows the main sequence of events when
reading a file.

Figure 4.16 A client reading data from HDFS
Steps of Operation:
1) The client opens the file it wishes to read by calling open() on the FileSystem object,
which for HDFS is an instance of DistributedFileSystem.
2) DistributedFileSystem calls the namenode, using RPC, to determine the locations of
the blocks for the first few blocks in the file.
3) The client then calls read() on the stream. DFSInputStream, which has stored the
datanode addresses for the first few blocks in the file, then connects to the first
(closest) datanode for the first block in the file.
4) Data is streamed from the datanode back to the client, which calls read() repeatedly on
the stream
5) When the end of the block is reached, DFSInputStream will close the connection to the
datanode, then find the best datanode for the next block.
6) When the client has finished reading, it calls close() on the FSDataInputStream.

4.13.2 Anatomy of a File Write
Let us examine the case of creating a new file, writing data to it, then closing the file.
Figure4.17 Anatomy of a File Write

Steps of Operation
1) The client creates the file by calling create() on DistributedFileSystem.
2) DistributedFileSystem makes an RPC call to the namenode to create a new file in the
filesystem‘s namespace, with no blocks associated with it. The namenode performs
various checks to make sure the file doesn‘t already exist, and that the client has the
right permissions to create the file. If these checks pass, the namenode makes a
record of the new file; otherwise, file creation fails and the client is thrown an
IOException.
3) As the client writes data , DFSOutputStream splits it into packets, which it writes to an
internal queue, called the data queue. The data queue is consumed by the Data
Streamer, whose responsibility it is to ask the namenode to allocate new blocks by
picking a list of suitable datanodes to store the replicas.
4) The DataStreamer streams the packets to the first datanode in the pipeline, which
stores the packet and forwards it to the second datanode in the pipeline. Similarly, the
second datanode stores the packet and forwards it to the third (and last) datanode in
the pipeline.
5) DFSOutputStream also maintains an internal queue of packets that are waiting to be
acknowledged by datanodes, called the ack queue. A packet is removed from the ack
queue only when it has been acknowledged by all the datanodes in the pipeline.

6) When the client has finished writing data, it calls close() on the stream
7) This action flushes all the remaining packets to the datanode pipeline and waits for
acknowledgments before contacting the namenode to signal that the file is complete.
UNIT IV PROGRAMMING MODEL
1. Mention the services offered by Core Grid middleware
Core Grid middleware offers services such as remote process management, co-allocation of
resources, storage access, information registration and discovery, security, and aspects of
Quality of Service (QoS) such as resource reservation and trading.
2. Mention the services offered by User-level Grid middleware
User-level Grid middleware utilizes the interfaces provided by the low-level middleware to
provide higher level abstractions and services. These include application development
environments, programming tools and resource brokers for managing resources and scheduling
application tasks for execution on global resources.
3. List the features of UNICORE.
 User driven job creation and submission
 Job management
 Data management
 Application support
 Flow control
 Single sign-on
 Support for legacy jobs
 Resource management
4. What are the two aspects involved in GRAM?

a. Job Submission – a user starts the job scheduling with the creation of a managed job
services.
b. Resource Management – a client knows about the master host environment and the
master managed factory services.
5. Write notes on Grid Container.

 The globus container model is derived from J2EE manages container model, where
the components are free from complex resource manageability.
 Lightweight service introspection and discovery.
 Dynamci deployment and soft-sate management of stateful grid services.
6. What are the problem with operation providers?

i) Due to unavailability of multiple inheritances in java, service developers utilize the
default interface hierarchy, as provided by framework.
ii) Dynamic configuration of service behaviors are not possible.

7. What are the most common security handlers?
Authentication Service handler
WS security handler
Security policy handler
Authentication handler
X509sign handler
GSS handler
8. Discuss about Legion.

Legion is a middleware system that combines very large numbers of independently
administered heterogeneous hosts, storage systems, databases legacy codes and user
objects distributed over wide-area-networks into a single coherent computing platform.
Legion defines a set of core object types that support basic system services, such as
naming and binding, object reaction, activation, deactivation, and deletion.
9. What is Gridbus Project?

The Gridbus Project is an open-source, multi-institutional project led by the GRIDS Lab
at the University of Melbourne. It is engaged in the design and development of service-
oriented cluster and grid middleware technologies to support eScience and eBusiness
applications.
10. List the primary GT4 components.

Execution and Resource Management
Security
Data Management
Replica Location and Management
11. What does staging files mean?

File staging allows executable and data files to be automatically transferred to the
required destination without user intervention. For file staging, specific elements have to
be added to the provided job description XML file during the job submission. Each file
transmission must provide a URL source and a URL destination.
12. Mention the use of GridFTP.

The GridFTP facility provides secure and reliable data transfer between grid hosts. Its
protocol extends the well-known FTP standard to provide additional features, including
support for authentication through GSI. One of the major features of GridFTP is that it
enables third-party transfer.
13. What is WS GRAM?

WS GRAM is the Grid service that provides the remote execution and status
management of jobs. When a job is submitted by a client, the request is sent to the
remote host as a SOAP message, and handled by WS GRAM service located in the
remote host.

14. Discuss Map andReduce Function.
• Map: An initial ingestion and transformation step, in which individual input records can
be processed in parallel.
• Reduce: An aggregation or summarization step, in which all associated records must
be processed together by a single entity.
15. Define mapreduce.

MapReduce is a programming model for data processing. Hadoop can run MapReduce
programs written in various languages. MapReduce programs are inherently parallel,
thus putting very large-scale data analysis into the hands of anyone with enough
machines at their disposal. MapReduce comes into its own for large datasets.
16. Define Input Splitting.

An input split will normally be a contiguous group of records from a single input file, and
in this case, there will be at least N input splits, where N is the number of input files. If
the number of requested map tasks is larger than this number, or the individual files are
larger than the suggested fragment size, there may be multiple input splits constructed
of each input file.
17. Mention the various Input Formats.

a. Input Splits and Records
b. Text Input
c. Binary Input
d. Multiple Inputs
e. Database Input (and Output)
18. Discuss the various HDFS concepts.

Blocks
Namenodes and Datanodes
The Command-Line Interface
Basic Filesystem Operations
19. List the HDFS key Features.

6) Highly fault tolerant
7) High Throughput
8) Designed to work with systems with very large files.
9) Provides streaming access to file system sata.
10) Can be built out of commodity hardware.

20. What are the HDFS Interfaces?
i) HTTP- HDFS defines a read-only interface for retrieving directory listings and data over
HTTP. This protocol is not tied to a specific HDFS version, making it possible to write
clients that can use HTTP to read data from HDFS clusters that run different versions
of Hadoop.
ii) FTP- There is an FTP interface to HDFS, which permits the use of the FTP protocol to
interact with HDFS. This interface is a convenient way to transfer data into and out of
HDFS using existing FTP clients.
16 Marks Questions
1. Explain Globus Toolkit (GT4) Architecture in detail.
2. Explain about the usage and configuration of GT4
3. What is Input Splitting and discuss about the specifying input and output parameters.
4. Explain about the dataflow of File read & File write of HDFS.
5. Discuss about Mapreduce with suitable programming example.

UNIT V
SECURITY
UNIT V SECURITY
issues in the cloud
TEXT BOOK:
STAFF IN-CHARGE HOD

UNIT V
SECURITY
issues in the cloud
5.1 Trust models for Grid security environment

Many potential security issues may occur in a grid environment if qualified security mechanisms
are not in place. These issues include network sniffers, out-of-control access, faulty operation,
malicious operation, integration of local security mechanisms, delegation, dynamic resources
and services, attack provenance, and so on.
Computational grids are motivated by the desire to share processing resources among many
organizations to solve large-scale problems. Indeed, grid sites may exhibit unacceptable
security conditions and system vulnerabilities.
On the one hand, a user job demands the resource site to provide security assurance by issuing
a security demand (SD). On the other hand, the site needs to reveal its trustworthiness, called
its trust index (TI). These two parameters must satisfy a security-assurance condition: TI ≥ SD
during the job mapping process.
When determining its security demand, users usually care about some typical attributes. These
attributes and their values are dynamically changing and depend heavily on the trust model,
security policy, accumulated reputation, self-defense capability, attack history, and site
vulnerability.
Three challenges are outlined below to establish the trust among grid sites.
1. Integration with existing systems and technologies: The resources sites in a grid are
usually heterogeneous and autonomous. It is unrealistic to expect that a single type of
security can be compatible with and adopted by every hosting environment. At the same
time, existing security infrastructure on the sites cannot be replaced overnight. Thus, to
be successful, grid security architecture needs to step up to the challenge of integrating
with existing security architecture and models across platforms and hosting
environments.
2. Interoperability with different ―hosting environments: Services are often invoked
across multiple domains, and need to be able to interact with one another. The
interoperation is demanded at the protocol, policy, and identity levels. For all these
levels, interoperation must be protected securely.
3. Constructing trust relationships among interacting hosting environments: Grid
service requests can be handled by combining resources on multiple security domains.
Trust relationships are required by these domains during the end-to-end traversals. A
service needs to be open to friendly and interested entities so that they can submit
requests and access securely.

Resource sharing among entities is one of the major goals of grid computing. A trust
relationship must be established before the entities in the grid interoperate with one another.
The entities have to choose other entities that can meet the requirements of trust to coordinate
with.
The entities that submit requests should believe the resource providers will try to process their
requests and return the results with a specified QoS.
To create the proper trust relationship between grid entities, two kinds of trust models are often
used.
 PKI-based model, which mainly exploits the PKI to authenticate and authorize entities
 Reputation-based model.
The grid aims to construct a large-scale network computing system by integrating distributed,
heterogeneous, and autonomous resources. The security challenges faced by the grid are much
greater than other computing systems. Before any effective sharing and cooperation occurs, a
trust relationship has to be established among participants. Otherwise, not only will participants
be reluctant to share their resources and services, but also the grid may cause a lot of damage.
5.1.1 A Generalized Trust Model
Figure 5.1 shows a general trust model.
 At the bottom, we identify three major factors which influence the trustworthiness of a
resource site. An inference module is required to aggregate these factors.
 Followings are some existing inference or aggregation methods. An intra-site fuzzy
inference procedure is called to assess defense capability and direct reputation.
 Defense capability is decided by the firewall, intrusion detection system (IDS), intrusion
response capability, and anti-virus capacity of the individual resource site.
 Direct reputation is decided based on the job success rate, site utilization, job turnaround
time, and job slowdown ratio measured.
 Recommended trust is also known as secondary trust and is obtained indirectly over the
grid network.
Figure 5.1 A general trust model for grid computing

5.1.2 Reputation-Based Trust Model
 In a reputation-based model, jobs are sent to a resource site only when the site is
trustworthy to meet users‘ demands.
 The site trustworthiness is usually calculated from the following information: the defense
capability, direct reputation, and recommendation trust.
 The defense capability refers to the site‘s ability to protect itself from danger. It is
assessed according to such factors as intrusion detection, firewall, response capabilities,
anti-virus capacity, and so on. Direct reputation is based on experiences of prior jobs
previously submitted to the site.
 The reputation is measured by many factors such as prior job execution success rate,
cumulative site utilization, job turnaround time, job slowdown ratio, and so on. A positive
experience associated with a site will improve its reputation. On the contrary, a negative
experience with a site will decrease its reputation.
5.1.3 A Fuzzy-Trust Model
 In this model, the job security demand (SD) is supplied by the user programs. The trust
index (TI) of a resource site is aggregated through the fuzzy-logic inference process over
all related parameters.
 Specifically, one can use a two-level fuzzy logic to estimate the aggregation of numerous
trust parameters and security attributes into scalar quantities that are easy to use in the
job scheduling and resource mapping process.
 The TI is normalized as a single real number with 0 representing the condition with the
highest risk at a site and 1 representing the condition which is totally risk-free or fully
trusted.
 The fuzzy inference is accomplished through four steps: fuzzification, inference,
aggregation, and defuzzification.
 The second salient feature of the trust model is that if a site‘s trust index cannot match
the job security demand (i.e., SD > TI), the trust model could deduce detailed security
features to guide the site security upgrade as a result of tuning the fuzzy system.
5.2 AUTHENTICATION AND AUTHORIZATION METHODS

The major authentication methods in the grid include passwords, PKI, and Kerberos.
The password is the simplest method to identify users, but the most vulnerable one to use.
The PKI is the most popular method supported by GSI. To implement PKI, we use a trusted
third party, called the certificate authority (CA). Each user applies a unique pair of public and
private keys. The public keys are issued by the CA by issuing a certificate, after recognizing a
legitimate user. The private key is exclusive for each user to use, and is unknown to any other
users.
A digital certificate in IEEE X.509 format consists of the user name, user public key, CA name,
and a secrete signature of the user. The following example illustrates the use of a PKI service in
a grid environment.
Trust Delegation Using the Proxy Credential in GSI

The PKI is not strong enough for user authentication in a grid. Figure shows a scenario where a
sequence of trust delegation is necessary.
Bob and Charlie both trust Alice, but Charlie does not trust Bob. Now, Alice submits a task Z to
Bob. The task Z demands many resources for Bob to use, independently. Bob forwards a
subtask Y of Z to Charlie. Because Charlie does not trust Bob and is not sure whether Y is really
originally requested by Alice, the subtask Y from Bob is rejected for resources by Charlie.
Figure 5.2 Interactions among multiple parties in a sequence of trust delegation operations
using the PKI services in a GT4-enabled grid environment.
For Charlie to accept the subtask Y, Bob needs to show Charlie some proof of entrust from
Alice. A proxy credential is the solution proposed by GSI.
A proxy credential is a temporary certificate generated by a user. Two benefits are seen by
using proxy credentials.
 First, the proxy credential is used by its holder to act on behalf of the original user or the
delegating party. A user can temporarily delegate his right to a proxy.
 Second, single sign-on can be achieved with a sequence of credentials passed along
the trust chain. The delegating party (Alice) need not verify the remote intermediate
parties in a trust chain.
The only difference between the proxy credential and a digital certificate is that the proxy
credential is not signed by a CA. We need to know the relationship among the certificates of the
CA and Alice, and proxy credential of Alice.
 The CA certificate is signed first with its own private key.
 Second, the certificate Alice holds is signed with the private key of the CA.
 Finally, the proxy credential sent to her proxy (Bob) is signed with her private key.
The procedure delegates the rights of Alice to Bob by using the proxy credential.
 First, the generation of the proxy credential is similar to the procedure of generating a
user certificate in the traditional PKI.
 Second, when Bob acts on behalf of Alice, he sends the request together with Alice‘s
proxy credential and the Alice certificate to Charlie.
 Third, after obtaining the proxy credential, Charlie finds out that the proxy credential is
signed by Alice. So he tries to verify the identity of Alice and finds Alice trustable. Finally,
Charlie accepts Bob‘s requests on behalf of Alice. This is called a trust delegation chain.
5.2.1 Authorization for Access Control

The authorization is a process to exercise access control of shared resources. Decisions can be
made either at the access point of service or at a centralized place. Typically, the resource is a
host that provides processors and storage for services deployed on it. Based on a set
predefined policies or rules, the resource may enforce access for local services.
The central authority is a special entity which is capable of issuing and revoking polices of
access rights granted to remote accesses.
The authority can be classified into three categories: attribute authorities, policy authorities, and
identity authorities.
 Attribute authorities issue attribute assertions;
 policy authorities issue authorization policies;
 identity authorities issue certificates.
 The authorization server makes the final authorization decision.
5.2.2 Three Authorization Models
Figure 7.31 shows three authorization models The subject is the user and the resource refers to
the machine side.
 The subject-push model is shown at the top diagram. The user conducts handshake with
the authority first and then with the resource site in a sequence.
 The resource-pulling model puts the resource in the middle. The user checks the
resource first. Then the resource contacts its authority to verify the request, and the
authority authorizes at step 3. Finally the resource accepts or rejects the request from
the subject at step 4.
 The authorization agent model puts the authority in the middle. The subject check with
the authority at step 1 and the authority makes decisions on the access of the requested
resources. The authorization process is complete at steps 3 and 4 in the reverse
direction.
Figure 5.3 Three authorization models: the subject-push model, resource-pulling model, and the
authorization agent model.
5.3 GRID SECURITY INFRASTRUCTURE (GSI)

Although the grid is a common approach to construct dynamic, interdomain, distributed
computing and data collaborations, ―lack of security/trust between different services‖ is still
an important challenge of the grid.
The grid requires a security infrastructure with the following properties:

 easy to use;
 conforms with security needs while working well with site policies of each resource
provider site
 provides appropriate authentication and encryption of all interactions.
The GSI is an important step toward satisfying these requirements
GSI is a portion of the Globus Toolkit and provides fundamental security services needed to
support grids, including supporting for message protection, authentication and delegation, and
authorization.
GSI enables secure authentication and communication over an open network, and permits
mutual authentication across and among distributed sites with single sign-on capability.
GSI supports both message-level security, which supports the WS-Security standard and the
WS-Secure Conversation specification to provide message protection for SOAP messages, and
transport-level security, which means authentication via TLS with support for X.509 proxy
certificates.
5.3.1 GSI Functional Layers
GT4 provides distinct WS and pre-WS authentication and authorization capabilities. Both build
on the same base, namely the X.509 standard and entity certificates and proxy certificates,
which are used to identify persistent entities such as users and servers and to support the
temporary delegation of privileges to other entities, respectively.
GSI may be thought of as being composed of four distinct functions:
❶ message protection, ❷ authentication, ❸ delegation, and ❹ authorization.
Figure 5.4 GSI functional layers at the message and transport levels.
TLS (transport-level security) or WS-Security and WS-Secure Conversation (message-level) are
used as message protection mechanisms in combination with SOAP.
X.509 End Entity Certificates or Username and Password are used as authentication
credentials.
X.509 Proxy Certificates and WS-Trust are used for delegation.

An Authorization Framework allows for a variety of authorization schemes, including a ―grid-
mapfile‖ ACL, an ACL defined by a service, a custom authorization handler, and access to an
authorization service via the SAML protocol.
In addition, associated security tools provide for the storage of X.509 credentials (MyProxy and
Delegation services), the mapping between GSI and other authentication mechanisms (e.g.,
KX509 and PKINIT for Kerberos, MyProxy for one-time passwords), and maintenance of
information used for authorization (VOMS, GUMS, PERMIS).
The web services portions of GT4 use SOAP as their message protocol for communication.
Message protection can be provided either by transport-level security, which transports SOAP
messages over TLS, or by message-level security, which is signing and/or encrypting portions
of the SOAP message using the WS-Security standard.
5.3.2 Transport-Level Security
Transport-level security necessitates SOAP messages conveyed over a network connection
protected by TLS. TLS provides for both integrity protection and privacy (via encryption).
Transport-level security is normally used in conjunction with X.509 credentials for
authentication, but can also be used without such credentials to provide message protection
without authentication, often referred to as ―anonymous transport-level security.‖ In this mode of
operation, authentication may be done by username and password in a SOAP message.
5.3.3 Message-Level Security
GSI also provides message-level security for message protection for SOAP messages by
implementing the WS-Security standard and the WS-Secure Conversation specification. The
WS-Security standard from OASIS defines a framework for applying security to individual SOAP
messages; WS-Secure Conversation is a proposed standard from IBM and Microsoft that allows
for an initial exchange of messages to establish a security context which can then be used to
protect subsequent messages in a manner that requires less computational overhead (i.e., it
allows the trade-off of initial overhead for setting up the session for lower overhead for
messages).
GSI conforms to this standard. GSI uses these mechanisms to provide security on a per-
message basis, that is, to an individual message without any preexisting context between the
sender and receiver.
GSI allows three additional protection mechanisms.
 Integrity protection, by which a receiver can verify that messages were not altered in
transit from the sender.
 Encryption, by which messages can be protected to provide confidentiality.
 Replay prevention, by which a receiver can verify that it has not received the same
message previously.
These protections are provided between WS-Security and WS-Secure Conversation. The
former applies the keys associated with the sender and receiver‘s X.509 credentials. The X.509
credentials are used to establish a session key that is used to provide the message protection.
5.3.4 Authentication and Delegation

GSI has traditionally supported authentication and delegation through the use of X.509
certificates and public keys. As a new feature in GT4, GSI also supports authentication through
plain usernames and passwords as a deployment option. GSI uses X.509 certificates to identify
persistent users and services.
As a central concept in GSI authentication, a certificate includes four primary pieces of
information:
(1) a subject name, which identifies the person or object that the certificate represents;
(2) the public key belonging to the subject;
(3) the identity of a CA that has signed the certificate to certify that the public key and the
identity both belong to the subject;
(4) the digital signature of the named CA. X.509 provides each entity with a unique identifier
(i.e., a distinguished name) and a method to assert that identifier to another party
through the use of an asymmetric key pair bound to the identifier by the certificate.
Grid deployments around the world have established their own CAs based on third-party
software to issue the X.509 certificate for use with GSI and the Globus Toolkit.
GSI also supports delegation and single sign-on through the use of standard X.509 proxy
certificates. Proxy certificates allow bearers of X.509 to delegate their privileges temporarily to
another entity. For the purposes of authentication and authorization, GSI treats certificates and
proxy certificates equivalently.
Authentication with X.509 credentials can be accomplished either via TLS, in the case of
transport-level security, or via signature as specified by WS-Security, in the case of message-
level security.
Mutual Authentication between Two Parties
Mutual authentication is a process by which two parties with certificates signed by the CA prove
to each other that they are who they say they are based on the certificate and the trust of the
CAs that signed each other‘s certificates. GSI uses the Secure Sockets Layer (SSL) for its
mutual authentication protocol, which is described in Figure.

Figure 5.5 Multiple handshaking in a mutual authentication scheme.
To mutually authenticate,
 the first person (Alice) establishes a connection to the second person (Bob) to start the
authentication process.
 Alice gives Bob her certificate. The certificate tells Bob who Alice is claiming to be (the
identity), what Alice‘s public key is, and what CA is being used to certify the certificate.
 Bob will first make sure the certificate is valid by checking the CA‘s digital signature to
make sure the CA actually signed the certificate and the certificate hasn‘t been tampered
with. Once Bob has checked out Alice‘s certificate, Bob must make sure Alice really is
the person identified in the certificate.
 Bob generates a random message and sends it to Alice, asking Alice to encrypt it.
 Alice encrypts the message using her private key, and sends it back to Bob.
 Bob decrypts the message using Alice‘s public key. If this results in the original random
message, Bob knows Alice is who she says she is.
 Now that Bob trusts Alice‘s identity, the same operation must happen in reverse. Bob
sends Alice his certificate, and Alice validates the certificate and sends a challenge
message to be encrypted.
 Bob encrypts the message and sends it back to Alice, and Alice decrypts it and
compares it with the original. If it matches, Alice knows Bob is who he says he is.
5.3.5 Trust Delegation
To reduce or even avoid the number of times the user must enter his passphrase when several
grids are used or have agents (local or remote) requesting services on behalf of a user, GSI
provides a delegation capability and a delegation service that provides an interface to allow
clients to delegate (and renew) X.509 proxy certificates to a service.
The interface to this service is based on the WS-Trust specification. A proxy consists of a new
certificate and a private key. The key pair that is used for the proxy, that is, the public key
embedded in the certificate and the private key, may either be regenerated for each proxy or be
obtained by other means. The new certificate contains the owner‘s identity, modified slightly to
indicate that it is a proxy. The new certificate is signed by the owner, rather than a CA.
Figure 5.6 A sequence of trust delegations in which new certificates are signed by the owners
rather by the CA.
The certificate also includes a time notation after which the proxy should no longer be accepted
by others. Proxies have limited lifetimes. Because the proxy isn‘t valid for very long, it doesn‘t
have to stay quite as secure as the owner‘s private key, and thus it is possible to store the
proxy‘s private key in a local storage system without being encrypted, as long as the
permissions on the file prevent anyone else from looking at them easily.

Once a proxy is created and stored, the user can use the proxy certificate and private key for
mutual authentication without entering a password. When proxies are used, the mutual
authentication process differs slightly. The remote party receives not only the proxy‘s certificate
(signed by the owner), but also the owner‘s certificate.
During mutual authentication, the owner‘s public key (obtained from her certificate) is used to
validate the signature on the proxy certificate. The CA‘s public key is then used to validate the
signature on the owner‘s certificate. This establishes a chain of trust from the CA to the last
proxy through the successive owners of resources.
The GSI uses WS-Security with textual usernames and passwords. When using usernames and
passwords as opposed to X.509 credentials, the GSI provides authentication, but no advanced
security features such as delegation, confidentiality, integrity, and replay prevention. However,
one can use usernames and passwords with anonymous transport-level security such as
unauthenticated TLS to ensure privacy.
5.4 Cloud Infrastructure security: network, host and application level
The foundational infrastructure for a cloud must be inherently secure whether it is a private or
public cloud or whether the service is SAAS, PAAS or IAAS. It will require:
 Inherent component-level security: The cloud needs to be architected to be secure, built

with inherently secure components, deployed and provisioned securely with strong
interfaces to other components and supported securely, with vulnerability-assessment and
change-management processes that produce management information and service-level
assurances that build trust.
 Stronger interface security: The points in the system where interaction takes place (user-
to-network, server-to application) require stronger security policies and controls that ensure
consistency and accountability.
 Resource lifecycle management: The economics of cloud computing are based on multi-
tenancy and the sharing of resources. As the needs of the customers and requirements will
change, a service provider must provision and decommission correspondingly those
resources - bandwidth, servers, storage and security. This lifecycle process must be
managed in order to build trust.
The infrastructure security can be viewed, assessed and implemented according its building
levels - the network, host and application levels
Infrastructure Security - The Network Level
When looking at the network level of infrastructure security, it is important to distinguish

between public clouds and private clouds. important to distinguish between public clouds and
private clouds. With private clouds, there are no new attacks, vulnerabilities, or changes in risk
specific to this topology that information security personnel need to consider. If public cloud
services are chosen, changing security requirements will require changes to the network
topology and the manner in which the existing network topology interacts with the cloud

provider's network topology should be taken into account. There are four significant risk factors
in this use case:
 Ensuring the confidentiality and integrity of organization's data-in-transit to and from a

public cloud provider;
 Ensuring proper access control (authentication, authorization, and auditing) to whatever
resources are used at the public cloud provider;
 Ensuring the availability of the Internet-facing resources in a public cloud that are being
used by an organization, or have been assigned to an organization by public cloud
providers;
 Replacing the established model of network zones and tiers with domains.
Infrastructure Security - The Host Level
When reviewing host security and assessing risks, the context of cloud services delivery models
(SaaS, PaaS, and IaaS) and deployment models public, private, and hybrid) should be
considered. The host security responsibilities in SaaS and PaaS services are transferred to the
provider of cloud services. IaaS customers are primarily responsible for securing the hosts
provisioned in the cloud (virtualization software security, customer guest OS or virtual server
security).
Infrastructure Security - The Application Level
Application or software security should be a critical element of a security program. Most

enterprises with information security programs have yet to institute an application security
program to address this realm. Designing and implementing applications aims at deployment on
a cloud platform will require existing application security programs to reevaluate current
practices and standards. The application security spectrum ranges from standalone single-user
applications to sophisticated multiuser e-commerce applications used by many users. The level
is responsible for managing :
 Application-level security threats;

 End user security;
 SaaS application security;
 PaaS application security;
 Customer-deployed application security
 IaaS application security
 Public cloud security limitations
Different types of attack and preventive method at various level
Security Attacks Attack type Preventive Method

problem

DNS attack Sender and a receiver get Domain name system security
rerouted through some Extensions (DNSSEC) reduces
evil the effects of DNS threats.
Eavesdropp Attacker monitor network Methods of preventing intruders

traffic in transit then are Internet protocol security(IP
ing interprets all unprotected sec) Implement security policies
data and procedures
install anti-virus software
Dos Attack Prevent the authorized DoS attacks can be prevented with
user to accessing a firewall but they have configured
services on network properly
Enforce strong password policies
Distributed Attack against a single Limit the number of ICMP and

SYN packets on router interfaces.
Denial of network from multiple
Filter private IP addresses using
services computers or systems router access control lists.
Sniffer Data is not encrypted & Detect based on ARP and RTT.
Network Level
Attack flowing in network, and Implement Internet Protocol
Security (IPSec) to encrypt
chance to read the vital network traffic
information. System administrator can prevent
this attack to be tight on security,
i.e one time password or ticketing
authentication
Issues of IP address is reassigned Old ARP addresses are cleared

and reused by other from cache
reused IP customer.
addresses The address still exists in
the DNS cache, it
violating the privacy of the
original user
BGP Prefix network attack in which Filtering and MD5/TTL

protection(preventing the source of
Hijacking wrong announcement on most attacks)
IP address associated
with a autonomous
system(AS), network

attack in which
wrong announcement on
IP address associated
with a autonomous
system(AS)
Host Level Security Single hardware unit is Hooksafe that can provide generic
protection against kernelmode
concerns difficult to monitor multiple
operating systems. rootkits
with the Malicious code get control
hypervisor of the system
and block other guest OS.
Securing Self-provisioning new Operational security procedures

virtual servers on an IaaS need to be followed
virtual platform creates a risk
server that insecure
virtual servers
Application Cookie Unauthorized person can Cookie should be avoided, or

Level regular Cookie Cleanup is
Poisoning change or modify the
content of cookies necessary.
Backdoor Debug options are left Scan the system periodically for
SUID/SGID files
and debug enabled unnoticed, it
provide an easy entry to a Permissions and ownership of
options hacker into the web-site important files and directories
and let him make
changes at the web-site periodically
level
Hidden field Certain fields are hidden Avoid putting parameters into a
in the web-site and it‘s query string
manipulation used by the developers.
Hacker can easily modify
on the web page.
Dos Attack Services used by the Intrusion Detection System (IDS)

is the most popular method
authorized user unable to
be used by them of defence against this type of
attacks .Preventive tools are

Firewalls,Switches,Routers
Distributed DDoS attack results in Preventive tools are firewalls,

Switches, Routers, Application
Denial of making the service
front-end hardware, IPS based
service unavailable to the Prevention, etc.
authorized user similar to
attack the way it is done in a
DoS attack but different in
the way it is launched.
Google Google search engine Prevent sharing of any sensitive

Best option for the hacker information
Hacking:- to access the sensitive
information Software solution such as Web
Vulnerability Scanner
SQL Malicious code is inserted Avoiding the usage of dynamically

generated SQL in the
injection into a standard SQL code
and gain unauthorized code
access to a database
Cross site Inject the malicious Various techniques to detect the

scripts into web contents. security flaws like: Active
Scripting
Content Filtering, Content Based
attacks
Data Leakage Prevention
Technology, Web Application
Vulnerability Detection Technology
5.5 Aspects of Data Security in the Cloud

This refers to Security for
a. Data in transit
b. Data at rest
c. Processing of data including multi tenancy
d. Data Lineage
e. Data Provenance
f. Data remanance
a. Data in Transit

 Data in transit is data being accessed over the network, and therefore could be
intercepted by someone else on the network or with access to the physical media the
network uses.
 On an ethernet network, someone with the ability to tap a cable, configure a switch to
mirror traffic, or fool the client or a router into directing traffic to them before it moves on
to the final destination.
 On a wireless network, all they need is to be within range. Wireless networks can be
protected from unauthorized snooping by encrypting all traffic.
 When protocols like TELNET, HTTP, FTP, SMTP, POP, IMAP, or LDAP are used, and if
someone has access to your network traffic and a readily available tool like Wireshark,
they can intercept your traffic and read your email, copy your credentials, or even
duplicate files.
 You need to protect your data‘s confidentiality and your own privacy by encrypting this
traffic using SSL/TLS, or switching to an encrypted equivalent. TELNET can be replaced
by SSH. FTP can be replaced by SFTP. The rest can use encrypted transport with SSL
or TLS.
 When data is encrypted in transit, it can only be compromised if the session key can be
compromised.
 Some encryption in transit will use symmetric encryption and a set session key, but most
will use a certificate and asymmetric encryption to securely exchange a session key and
then use that session key for symmetric encryption to provide the fastest
encryption/decryption.
 Any protocol that uses either SSL or TLS, uses certificates to exchange Public Keys,
and then the Public Keys are used to securely exchange Private Keys, it becomes very
difficult for an attacker to defeat.
 Most encrypted protocols include a hashing algorithm to ensure no data was altered in
transit. This can also help defeat ―Man in the Middle (MitM)‖ attacks, because by
decrypting and re-encrypting data, the attacker will alter the signature even if they don‘t
change any of the key data.
 Always use certificates from a third-party Certificate Authority, to never accept a
certificate when your client software warns about an untrusted certificate.
 Encryption in transit should be mandatory for any network traffic that requires
authentication, or includes data that is not publicly accessible. It is not needed to encrypt
your public facing website, but if customers to logon to view things, then use encryption
to protect both the logon data, and their privacy.
b. Data at rest
 Encryption of data stored on media is used to protect the data from unauthorized
access should the media ever be stolen. Physical access can get past file system
permissions, but if the data is stored in encrypted form and the attacker does not have
the decryption key, they can‘t access
 Most encryption at rest uses a symmetric algorithm so that data can be very quickly
encrypted and decrypted. However, since the symmetric key itself needs to be
protected, they can use a PIN, password, or even a PKI certificate on a smart card to
secure the symmetric key, making it very difficult for an attacker to compromise.

 Hashing algorithms can be used on files at rest to calculate their value and compare it
later to quickly and easily detect any changes to the data. Checksums or hashes are
commonly used to validate a downloaded file.
 Encryption at rest should be mandatory for any media that can possibly leave the
physical boundaries of your infrastructure. USB keys, external drives, backup tapes,
and the hard drives of all laptops should be encrypted without exception. Encrypt the
hard drives of all servers too.
 Examples of encryption at rest include the AES-encrypted portable media, some of
which include a fingerprint reader for two-factor authentication, and Bitlocker in
Windows operating systems to secure both the system drives and external media.
c. Multi-tenancy Issues in the Cloud

Multi-tenancy refers to conflict between tenants‘ opposing goals ie., Tenants share a pool of
resources and have opposing goals.
The issues to be addressed are :
• How does multi-tenancy deal with conflict of interest?

– Can tenants get along together and ‗play nicely‘ ?
– If they can‘t, can we isolate them?
• How to provide separation between tenants?
We can minimize Multi-tenancy issues in the Cloud.
 It can‘t really force the provider to accept less tenants but can try to increase isolation
between tenants by using strong isolation techniques . QoS requirements need to be
met with clear policy specification.
 It can try to increase trust in the tenants by defining who‘s the insider, where‘s the
security boundary? who can be trusted and also providing SLAs to enforce trusted
behavior
d. Data lineage
It is the process of data path visualization from the time when the data has been transferred
to the cloud. Data lineage is time-consuming and in the case of public cloud service it is not
possible. Even if Data lineage is established among public cloud service, the most
challenging task is to provide Integrity as well as Provenance. Integrity of data is the process
in which the data was not accessed by unauthorized person
e. Data Provenance
This refers to the computational accuracy along with the integrity.
• Cloud provenance can be
– Data provenance:
Who created, modified, deleted data stored in a cloud (external entities change
data)
– Process provenance:
What happened to data once it was inside the cloud (internal entities change
data)
• Cloud provenance should give a record of who accessed the data at different times

• Auditors should be able to trace an entry (and associated modification) back to the
creator
The Secure Provenance (SPROV) scheme automatically collects data provenance at the
application layer. It provides security assurances of confidentiality and integrity of the data
provenance. In this scheme, confidentiality is ensured by employing state-of the-art
encryption techniques while integrity is preserved using the digital signature of the user who
takes any actions. Each record in the data provenance includes the signed checksum of
previous record in the chain
f. Data Remanence.
This refers to the residual data representation present even after being deleted or erased.
Data remanence may lead to disclosure of sensitive information possible when storage
media is released into an uncontrolled environment (e.g., thrown in the trash, or lost).
Various techniques have been developed to counter data remanence. These techniques are
classified as clearing, purging/sanitizing or destruction.
Specific methods include overwriting, degaussing, encryption, and media destruction.
Challenges/Issues in Data Security
 The need to protect confidential business, government, or regulatory data

 Cloud service models with multiple tenants sharing the same infrastructure
 Data mobility and legal issues relative to such government rules as the EU Data Privacy
Directive
 Lack of standards about how cloud service providers securely recycle disk space and
erase existing data
 Auditing, reporting, and compliance concerns
 Loss of visibility to key security and operational intelligence that no longer is available to
feed enterprise IT security intelligence and risk management
 A new type of insider who does not even work for your company, but may have control
and visibility into your data
Handling Data Security
When using a public cloud, it is important that it must be capable of providing both confidentiality
as well as integrity. Obviously when using protocols such as FTPS, HTTPS, SCP for data
transfer through the internet, simply encrypting the data along with a non-secure protocols such
as FTP, HTTP results in confidentiality but the integrity of the data is not ensured.
In IaaS services, it is strongly recommended for small storage and the encryption is possible.
But in PaaS and SaaS cloud based application, compensating control is not possible because it
may prevent it from searching or indexing.
Applications provided with cloud computing are designed with data tagging to prevent
unauthorized access to user data. All the data must be encrypted to transfer or receive from
cloud but there is no proper method to process encrypted data, IBM developed a fully
Homomorphic Encryption Scheme. Using the scheme , we can process the data without

decryption. The limitation of this scheme is that it requires immense computational effort. It is
always necessary for the cloud clients to know about exactly where the data has been stored
and when the data has been updated.
An effective cloud security solution should incorporate three key capabilities:
 Data lockdown
 Access policies
 Security intelligence
First, make sure that data is not readable and that the solution offers strong key management.
Second, implement access policies that ensure only authorized users can gain access to
sensitive information, so that even privileged users such as root user cannot view sensitive
information. Third, incorporate security intelligence that generates log information, which can be
used for behavioral analysis to provide alerts that trigger when users are performing actions
outside of the norm.
Meanwhile, conventional security considerations must be addressed in the cloud environment.

These include implementing best practices and real-time security intelligence, protecting data
security, and preventing advanced persistent threats (APTs) or attacks that exploit social
engineering. It‘s also critical to plan for the added risks posed by big data mined across different
cloud environments and mobile devices that store information in the cloud infrastructure
Cloud Service Provider – Data Security Issues
Security assessment in an organization must be carried out periodically by person who are able
to identify and fix the problems efficiently. Shared risk will appear in multi-tier service
arrangement provider where the service provider acquire an infrastructure needed for it from
another service provider,thereby,it potentially affects all parties.
Staff Security Screening is important because usually cloud provider employs contractors to
undergo an investigation of your employees under policy.
Distributed Data Center is needed to make the cloud servive provider less prone to
geographical disasters ,in such a way eliminating the use of periodically tested disaster rcovery
plan.
Physical Security is nothing but the clients must have a good knowledge about the security
levels of all the cloud service providers. Coding of the cloud service provider must be based on
the standard methods that can be further documented and demonstrated to the client in future in
order to make them sure about the secure coding process. Data Leakage is one drawbacks of
every cloud provider , so it is always recommended to have a encrypted format of data is
transmitted and received
While many organizations have implemented encryption for data security, they often overlook
inherent weaknesses in key management, access control, and monitoring of data access. If
encryption keys are not sufficiently protected, they are vulnerable to theft by malicious hackers.

Vulnerability also lies in the access control model; thus, if keys are appropriately protected but
access is not sufficiently controlled or robust, malicious or compromised personnel can attempt
to access sensitive data by assuming the identity of an authorized user.
The encryption implementation must incorporate a robust key management solution to provide
assurance that the keys are sufficiently protected. It‘s critical to audit the entire encryption and
key management solution
Therefore, any data-centric approach must incorporate encryption, key management, strong
access controls, and security intelligence to protect data in the cloud and provide the requisite
level of security. By implementing a layered approach that includes these critical elements,
organizations can improve their security posture more effectively and efficiently than by focusing
exclusively on traditional network-centric security methods.
The strategy should incorporate a blueprint approach that addresses compliance requirements
and actual security threats. Best practices should include securing sensitive data, establishing
appropriate separation of duties between IT operations and IT security, ensuring that the use of
cloud data conforms to existing enterprise policies, as well as strong key management and strict
access policies.
Database Integrity Issues
Database integrity requires the following three goals:
Prevention of the modifi cation of information by unauthorized users
Prevention of the unauthorized or unintentional modification of information by authorized users

Preservation of both internal and external consistency:
Internal consistency — Ensures that internal data is consistent. For example, assume that an
internal database holds the number of units of a particular item in each department of an
organization. The sum of the number of units in each department should equal the total number
of units that the database has recorded internally for the whole organization.
External consistency — Ensures that the data stored in the database is consistent with the real
world. Using the preceding example, external consistency means that the number of items
recorded in the database for each department is equal to the number of items that physically
exist in that department.
Security challenges in Cloud service models
Specific security challenges pertain to each of the three cloud service models—Software as a
Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS).
◗ SaaS deploys the provider‘s applications running on a cloud infrastructure; it offers anywhere
access, bu-t also increases security risk. With this service model it‘s essential to implement
policies for identity management and access control to applications. For example, with
Salesforce.com, only certain salespeople may be authorized to access and download
confidential customer sales information.

◗ PaaS is a shared development environment, such as Microsoft™ Windows Azure, where the
consumer controls deployed applications but does not manage the underlying cloud
infrastructure. This cloud service model requires strong authentication to identify users, an audit
trail, and the ability to support compliance regulations and privacy mandates.
◗ IaaS lets the consumer provision processing, storage, networks, and other fundamental
computing resources and controls operating systems, storage, and deployed applications. As
with Amazon Elastic Compute Cloud (EC2), the consumer does not manage or control the
underlying cloud infrastructure. Data security is typically a shared responsibility between the
cloud service provider and the cloud consumer. Data encryption without the need to modify
applications is a key requirement in this environment to remove the custodial risk of IaaS
infrastructure personnel accessing sensitive data.
5.6 Identity Management and Access Control

Identity management and access control are fundamental functions required for secure cloud
computing. The simplest form of identity management is logging on to a computer system with a
user ID and password. However, true identity management, such as is required for cloud
computing, requires more robust authentication, authorization, and access control.
It should determine what resources are authorized to be accessed by a user or process by

using technology such as biometrics or smart cards, and determine when a resource has been
accessed by unauthorized entities.
5.6.1 Identity Management

Identification and authentication are the keystones of most access control systems.
Identification is the act of a user professing an identity to a system, usually in the form of a
username or user logon ID to the system. Identification establishes user accountability for the
actions on the system. User IDs should be unique and not shared among different individuals. In
many large organizations, user IDs follow set standards, such as first initial followed by last
name, and so on. In order to enhance security and reduce the amount of information available
to an attacker, an ID should not reflect the user‘s job title or function.
Authentication is verification that the user‘s claimed identity is valid, and it is usually
implemented through a user password at logon. Authentication is based on the following three
factor types:
 Type 1 — Something you know, such as a personal identification number (PIN) or

password
 Type 2 — Something you have, such as an ATM card or smart card
 Type 3 — Something you are (physically), such as a fingerprint or retina scan
Sometimes a fourth factor, something you do, is added to this list. Something you do might be
typing your name or other phrases on a keyboard.
Two-factor authentication requires two of the three factors to be used in the authentication
process. For example, withdrawing funds from an ATM machine requires two-factor

authentication in the form of the ATM card (something you have) and a PIN number (something
you know).
 Passwords
o Because passwords can be compromised, they must be protected. In the ideal case,
a password should be used only once. This ―one-time password,‖ or OTP, provides
maximum security because a new password is required for each new logon.
o A password that is the same for each logon is called a static password. A password
that changes with each logon is termed a dynamic password. The changing of
passwords can also fall between these two extremes.
o Passwords can be required to change monthly, quarterly, or at other intervals,
depending on the criticality of the information needing protection and the password‘s
frequency of use. Obviously, the more times a password is used, the more chance
there is of it being compromised.
o A passphrase is a sequence of characters that is usually longer than the allotted
number for a password. The passphrase is converted into a virtual password by the
system.
o In all these schemes, a front-end authentication device or a back-end authentication
server, which services multiple workstations or the host, can perform the
authentication.
o Passwords can be provided by a number of devices, including tokens, memory
cards, and smart cards.
 Tokens
Tokens, in the form of small, hand-held devices, are used to provide passwords.
The following are the four basic types of tokens:
1. Static password tokens

 Owners authenticate themselves to the token by typing in a secret
password.
 If the password is correct, the token authenticates the owner to an
information system.
2. Synchronous dynamic password tokens, clock-based

 The token generates a new, unique password value at fixed time
intervals that is synchronized with the same password on the
authentication server (this password is the time of day encrypted with
a secret key).
 The unique password is entered into a system or workstation along
with an owner‘s PIN.
 The authentication entity in a system or workstation knows an owner‘s
secret key and PIN, and the entity verifies that the entered password
is valid and that it was entered during the valid time window.
3. Synchronous dynamic password tokens, counter-based


The token increments a counter value that is synchronized with a
counter in the authentication server.
 The counter value is encrypted with the user‘s secret key inside the
token and this value is the unique password that is entered into the
system authentication server.
 The authentication entity in the system or workstation knows the
user‘s secret key and the entity verifi es that the entered password is
valid by performing the same encryption on its identical counter value.
4. Asynchronous tokens, challenge-response
 A workstation or system generates a random challenge string, and the
owner enters the string into the token along with the proper PIN.
 The token performs a calculation on the string using the PIN and
generates a response value that is then entered into the workstation
or system.
 The authentication mechanism in the workstation or system performs
the same calculation as the token using the owner‘s PIN and
challenge string and compares the result with the value entered by the
owner. If the results match, the owner is authenticated.
 Memory Cards
Memory cards provide nonvolatile storage of information, but they do not have
any processing capability. A memory card stores encrypted passwords and other
related identifying information. A telephone calling card and an ATM card are
examples of memory cards.
 Smart Cards
Smart cards provide even more capability than memory cards by incorporating
additional processing power on the cards. These credit-card-size devices
comprise microprocessor and memory and are used to store digital signatures,
private keys, passwords, and other personal information.
 Biometrics
An alternative to using passwords for authentication in logical or technical access control is

biometrics. Biometrics is based on the Type 3 authentication mechanism — something you are.
Biometrics is defined as an automated means of identifying or authenticating the identity of a

living person based on physiological or behavioral characteristics.
In biometrics, identification is a one-to-many search of an individual‘s characteristics from a

database of stored images.
Authentication is a one-to-one search to verify a claim to an identity made by a person.

Biometrics is used for identification in physical controls and for authentication in logical controls.
There are three main performance measures in biometrics:

 False rejection rate (FRR) or Type I Error — The percentage of valid subjects that are
falsely rejected.
 False acceptance rate (FAR) or Type II Error — The percentage of invalid subjects that are
falsely accepted.
 Crossover error rate (CER) — The percentage at which the FRR equals the FAR. The
smaller the CER, the better the device is performing.
In addition to the accuracy of the biometric systems, other factors must be considered, including
enrollment time, throughput rate, and acceptability.
Enrollment time is the time that it takes to initially register with a system by providing samples
of the biometric characteristic to be evaluated. An acceptable enrollment time is around two
minutes. For example, in fingerprint systems the actual fingerprint is stored and requires
approximately 250KB per finger for a high-quality image. This level of information is required for
one-to-many searches in forensics applications on very large databases.
In finger-scan technology, a full fingerprint is not stored; rather, the features extracted from this
fingerprint are stored by using a small template that requires approximately 500 to 1,000 bytes
of storage. The original fingerprint cannot be reconstructed from this template. Finger-scan
technology is used for one-to-one verification by using smaller databases.
Updates of the enrollment information might be required because some biometric

characteristics, such as voice and signature, might change over time.
The throughput rate is the rate at which the system processes and identifies or authenticates
individuals. Acceptable throughput rates are in the range of 10 subjects per minute.
Acceptability refers to considerations of privacy, invasiveness, and psychological and physical

comfort when using the system. For example, a concern with retina scanning systems might be
the exchange of body fluids on the eyepiece. Another concern would be disclosing the retinal
pattern, which could reveal changes in a person‘s health, such as diabetes or high blood
pressure.
Collected biometric images are stored in an area referred to as a corpus. The corpus is stored in
a database of images. Potential sources of error include the corruption of images during
collection, and mislabeling or other transcription problems associated with the database.
Therefore, the image collection process and storage must be performed carefully with constant
checking.
The following are typical biometric characteristics that are used to uniquely authenticate an
individual‘s identity:
 Fingerprints — Fingerprint characteristics are captured and stored. Typical CERs are 4–
5%.
 Retina scans — The eye is placed approximately two inches from a camera and an
invisible light source scans the retina for blood vessel patterns. CERs are approximately
1.4%.
 Iris scans — A video camera remotely captures iris patterns and characteristics. CER
values are around 0.5%.

 Hand geometry — Cameras capture three-dimensional hand characteristics. CERs are
approximately 2%.
 Voice — Sensors capture voice characteristics, including throat vibrations and air
pressure, when the subject speaks a phrase. CERs are in the range of 10%.
 Handwritten signature dynamics — The signing characteristics of an individual making a
signature are captured and recorded. Typical characteristics including writing pressure
and pen direction. CERs are not published at this time.
 Other types of biometric characteristics include facial and palm scans.
Implementing Identity Management

Identity management include the following:
 Establishing a database of identities and credentials

 Managing users‘ access rights
 Enforcing security policy
 Developing the capability to create and modify accounts
 Setting up monitoring of resource accesses
 Installing a procedure for removing access rights
 Providing training in proper procedures
An identity management effort can be supported by software that automates many of the
required tasks.
The Open Group and the World Wide Web Consortium (W3C) are working toward a standard
for a global identity management system that would be interoperable, provide for privacy,
implement accountability, and be portable.
Identity management is also addressed by the XML-based eXtensible Name Service (XNS)
open protocol for universal addressing. XNS provides the following capabilities:
 A permanent identification address for a container of an individual‘s personal data and

contact information
 Means to verify whether an individual‘s contact information is valid
 A platform for negotiating the exchange of information among different entities
5.6.2 Access Control

Access control is intrinsically tied to identity management and is necessary to preserve the
confidentiality, integrity, and availability of cloud data.
These and other related objectives flow from the organizational security policy. This policy is a
high-level statement of management intent regarding the control of access to information and
the personnel who are authorized to receive that information.
Three things that must be considered for the planning and implementation of access control
mechanisms are threats to the system, the system‘s vulnerability to these threats, and the risk
that the threats might materialize. These concepts are defined as follows:
 Threat — An event or activity that has the potential to cause harm to the information
systems or networks

 Vulnerability — A weakness or lack of a safeguard that can be exploited by a threat,
causing harm to the information systems or networks
 Risk — The potential for harm or loss to an information system or network; the
probability that a threat will materialize
Controls
Controls are implemented to mitigate risk and reduce the potential for loss.
Two important control concepts are separation of duties and the principle of least privilege.
Separation of duties requires an activity or process to be performed by two or more entities for
successful completion. Thus, the only way that a security policy can be violated is if there is
collusion among the entities. For example, in a financial environment, the person requesting that
a check be issued for payment should not also be the person who has authority to sign the
check.
Least privilege means that the entity that has a task to perform should be provided with the
minimum resources and privileges required to complete the task for the minimum necessary
period of time.
Control measures can be administrative, logical (also called technical), and physical in their
implementation.
 Administrative controls include policies and procedures, security awareness training,

background checks, work habit checks, a review of vacation history, and increased
supervision.
 Logical or technical controls involve the restriction of access to systems and the
protection of information. Examples of these types of controls are encryption, smart
cards, access control lists, and transmission protocols.
 Physical controls incorporate guards and building security in general, such as the locking
of doors, the securing of server rooms or laptops, the protection of cables, the
separation of duties, and the backing up of fi les.
Controls provide accountability for individuals who are accessing sensitive information in a cloud
environment. This accountability is accomplished through access control mechanisms that
require identification and authentication, and through the audit function.
These controls must be in accordance with and accurately represent the organization‘s security
policy. Assurance procedures ensure that the control mechanisms correctly implement the
security policy for the entire life cycle of a cloud information system.
In general, a group of processes that share access to the same resources is called a protection
domain, and the memory space of these processes is isolated from other running processes.
Models for Controlling Access

Controlling access by a subject (an active entity such as an individual or process) to an object (a
passive entity such as a fi le) involves setting up access rules. These rules can be classified into
three categories or models.
1. Mandatory Access Control

 The authorization of a subject‘s access to an object depends upon labels, which
indicate the subject‘s clearance, and the classification or sensitivity of the object. For
example, the military classifies documents as unclassified, confidential, secret, and
top secret.
 An individual can receive a clearance of confidential, secret, or top secret and can
have access to documents classified at or below his or her specified clearance level.
Thus, an individual with a clearance of ―secret‖ can have access to secret and
confidential documents with a restriction.
 This restriction is that the individual must have a need to know relative to the
classified documents involved. Therefore, the documents must be necessary for that
individual to complete an assigned task.
 Even if the individual is cleared for a classification level of information, the individual
should not access the information unless there is a need to know.
 Rule-based access control is a type of mandatory access control because rules
determine this access (such as the correspondence of clearance labels to
classification labels), rather than the identity of the subjects and objects alone.
2. Discretionary Access Control
 With discretionary access control, the subject has authority, within certain limitations, to
specify what objects are accessible. For example, access control lists (ACLs) can be
used.
 An access control list is a list denoting which users have what privileges to a particular
resource. For example, a tabular listing would show the subjects or users who have
access to the object, e.g., file X, and what privileges they have with respect to that file.
 An access control triple consists of the user, program, and file, with the corresponding
access privileges noted for each user. This type of access control is used in local,
dynamic situations in which the subjects must have the discretion to specify what
resources certain users are permitted to access.
 When a user within certain limitations has the right to alter the access control to certain
objects, this is termed a user-directed discretionary access control.
 An identity-based access control is a type of discretionary access control based on an
individual‘s identity. In some instances, a hybrid approach is used, which combines the
features of user-based and identity-based discretionary access control.
3. Nondiscretionary Access Control

 A central authority determines which subjects can have access to certain objects based
on the organizational security policy.
 The access controls might be based on the individual‘s role in the organization (role-
based) or the subject‘s responsibilities and duties (task-based).
 In an organization with frequent personnel changes, nondiscretionary access control is
useful because the access controls are based on the individual‘s role or title within the
organization. Therefore, these access controls don‘t need to be changed whenever a
new person assumes that role.
 Access control can also be characterized as context-dependent or content-dependent.

 Context-dependent access control is a function of factors such as location, time
of day, and previous access history. It is concerned with the environment or
context of the data.
 In content-dependent access control, access is determined by the information
contained in the item being accessed.
Single Sign-On (SSO)

 Single sign-on (SSO) addresses the cumbersome situation of logging on multiple times
to access different resources. When users must remember numerous passwords and
IDs, they might take shortcuts in creating them that could leave them open to
exploitation.
 In SSO, a user provides one ID and password per work session and is automatically
logged on to all the required applications. For SSO security, the passwords should not
be stored or transmitted in the clear. SSO applications can run either on a user‘s
workstation or on authentication servers.
 The advantages of SSO include having the ability to use stronger passwords, easier
administration of changing or deleting the passwords, and less time to access resources.
 The major disadvantage of many SSO implementations is that once users obtain access
to the system through the initial logon, they can freely roam the network resources
without any restrictions.
 Authentication mechanisms include items such as smart cards and magnetic badges.
Strict controls must be enforced to prevent a user from changing configurations that
another authority sets.
 SSO can be implemented by using scripts that replay the users‘ multiple logins or by
using authentication servers to verify a user‘s identity, and encrypted authentication
tickets to permit access to system services.
 Enterprise access management (EAM) provides access control management services to
Web-based enterprise systems that include SSO.
 SSO can be provided in a number of ways. For example, SSO can be implemented on
Web applications residing on different servers in the same domain by using non
persistent, encrypted cookies on the client interface. This task is accomplished by
providing a cookie to each application that the user wishes to access.
 Another solution is to build a secure credential for each user on a reverse proxy that is
situated in front of the Web server. The credential is then presented each time a user
attempts to access protected Web applications.
5.6.3 IAM practices in the cloud,

There are seven best practices for implementing IAM policies
1. Create and use IAM roles instead of the root account
When a new AWS account is set up, the IT team or Managed Service Provider will create an
Admin role, institute MFA on the root account, and lock away the account token in a high
security safe. Admin roles are usually restricted to no less than two and no more than three
accounts.
Choosing not to use the root account improves security in a number of ways.

 First, every user is restricted under IAM policies, so that if account keys are inadvertently
shared or stolen, damage is limited to some degree and can be disabled quickly.
 Secondly, IAM roles ensure non-repudiation; no team member can claim to have or have
not done something to affect the environment. IAM users ―sign‖ each action, so every
individual is personally accountable for the work that they do.
2. Grant least privileges
In a large organization, users often require access to only a small portion of an AWS
environment. However, many businesses overlook the process of investigating the minimum
functions required by staff members and 3rd party consultants, which means that excessive IAM
privileges are granted out of lack of knowledge or laziness.
IAM administrators with this mindset introduce a significant degree of risk into an organization‘s
security policy. Greater entitlement than necessary opens the door for human error and
introduces the need for more complex audits; IAM policies greatly simplify an auditor‘s
investigation into who has access to which resources.
Best practice is to grant least privilege — and then grant more privileges on a granular level if
needed.
It is useful to note that S3 is a special service in that one can restrict access both through IAM
and through S3 Bucket Policies; one can further lock down access to an S3 bucket by
stipulating the actions the user can take in that bucket. For example, a user can be granted IAM
access to the bucket, but be denied if they are accessing it from an IP address outside of an IP
range set out in a bucket policy.
In a complex healthcare enterprise, the ongoing investigation that entitlement definition requires
often necessitates that one or several IT staff with the task of constantly updating, removing,
and re-auditing IAM policies.
What are the requirements of each application? What S3 buckets need to be accessed by which
teams? Who got fired and who hired? These new permissions and the reason for them must be
well documented; this is one form of administrative work that is well worth the red tape.
Work done proactively here will save hours of forensic time if something goes wrong and if
auditors come in, there is much less that needs to be gathered. The organization has to prove
that the function can only be performed by certain people in a central location, so auditing is
fairly simple.
3. MFA everywhere + federated access
Multi-Factor Authentication provides an important level of security in any environment. Even if a
password is shared or gets inadvertently released, malicious users still cannot access the
account. This is particularly important in HIPAA-compliant environments.

However, in all but the smallest environment, requiring MFA access to every resource would be
burdensome. Instead it is best to configure a few entry points such as Bastion hosts or a jump
box, where access is limited and sessions can be logged.
Accounts are often established in Microsoft‘s Active Directory (AD) and AWS is configured to
respect identity tokens from the AD server. Then Active Directory Federation Services (ADFS)
issues an identity token. When logging into AD, users get prompted for an MFA token. Once
authenticated to central authentication, it is respected by AWS. MFA access is required for
these entry points, and this ID token can be provided by a hardware FOB or by a software token
installed on a user‘s phone, whether it be DUO or a free product like Google Authenticator.
4. Rotate credentials regularly

AWS recommends as a best practice that all credentials, API access keys, and passwords are
rotated regularly. Any credential left active for too long increases the risk of compromise.
There are a few ways to do this. AWS outlines a manual process using the AWS CLI to create a
second set of credentials, programmatically create and distribute the access key, and disable
the old key. One could also do this by creating two keys and having them overlap (the first is
active from the 1st to the 16th, the other from the 15th to the 30th of the month), then
programmatically disabling old keys.
Creating system roles will simplify the process, as the SDKs for Python, Boto, and CLI will
automatically return keys from AWS metadata. However Some key rotation needs to be
maintained.
5. Maintain audit logs with CloudTrail and Config
CloudTrail logs every action taken by any IAM user or resource in your AWS environment.
AWS Config is also useful in that it tracks and reports on changes to your AWS environment
itself. Ensuring that these logs are kept is high on every auditor‘s checklist.
Of course, it is not enough merely to enable logs; engineers must also protect the logs
themselves. One key role of IAM is to lock down those logs.
In HIPAA-compliant environments, logs are usually hosted in an S3 bucket that only Admins
have write access to. Logs are further protected by requiring MFA – even for Admin access. It
is also wise to set alarms in CloudWatch for when CloudTrail is disabled and enable SNS
notifications of CloudTrail log delivery. Logicworks usesCloudcheckr to set up sophisticated
alerts with CloudTrail and Config. Another option is to set up a separate AWS account solely for
logs.
6. Regularly re-certify employees
IAM can provide scheduled and ad-hoc compliance reports, including automated violation
notifications, audit assessments, role changes, and hierarchies. Along with ongoing IAM policy
changes, team leaders should perform monthly or quarterly reviews of IAM roles within each
environment. M

anagers should re-approve their direct reports‘ access to resources and applications. With IAM,
this is a fairly straightforward process, where IAM can create a report that outlines each
employee‘s access rights. As long as these roles are meaningful and specific, and business
leaders are given appropriate business context, they can accurately audit and restrict or expand
these roles.
7. Set up and lock down automation tools
Tools like CloudFormation can automate the implementation and configuration of certain
security policies as well as infrastructure. This can act like configuration management for IAM
policies much like Puppet or Chef provides configuration management for Operating Systems.
Since CloudFormation is made to create and destroy infrastructure, it is that much more
important that IAM policies are managed effectively. CloudFormation only runs under the
context of the user running it, or else this powerful tool could become a powerful weapon.
CloudFormation will fail if a user tries to automate a function beyond its IAM role.
IAM puts you in a position of always having control over your environment. This is essential not
only in HIPAA-compliant environments, but in any environment that hosts sensitive or
proprietary data. Through the correct implementation of IAM policies, AWS is fully capable of
hosting sensitive data and may even provide a more granular level of user management
security than a traditional hosting environment.
5.6.4 Achieving availability in the cloud (SaaS, PaaS, IaaS),

Could computing is, by all means, a third party service and consumers heavily rely on the
service providers for their computing needs. These computing needs range from research to
businesses to high performance computing.
Researchers are heavily involved in finding new technologies that can make cloud computing
more reliable from a security, performance and availability's perspective.
Traditionally the resources required for businesses have been locally installed, setup and
maintained by the organizations.
The organizations interact with each other in a very controlled and secured environment. They
often sign the service level agreements (SLAs) that hold each party engaged with certain
accountabilities.
In some situations a downtime of few hours can lead to a loss of hundreds of thousands of
dollars. Establishing robust monitoring tools and practices will bring long terms benefits in terms
of achieving high availability in the cloud.
Technically there are several levels where high availability can be achieved. These levels
include application level, data center level, infrastructure level and geographic location level.
One of the very basic goals of high availability is to avoid single point of failures as much as
possible to achieve operational continuity, redundancy and fail-over capability.

At the infrastructure level the basic configuration might look like the one in the Figure below.
This configuration has two or more load balancers, two or more web servers and two or more
database servers. The consumer accesses the cloud via the Internet. At each level both active
and passive nodes are provisioned to provide high availability. If one node goes bad, the
second node supports the load and hence reducing the downtime. This is replicated at each
level of the configuration as shown in Figure 5.7.
Figure 5.7 Infrastructure level high availability configuration
Dynamic scalability of the services is one of the very important features of the cloud. This goes
a long way in achieving high availability.
Amazon‘s EC2 scales up the services by provisioning additional servers very easily and in a
short amount of time. It provides dynamic scalability capabilities which help in load balancing
and effective handling of the sudden and unexpected increase in the network traffic.
This dynamic scalability can be programmatically controlled via cloud servers API.
Programmatically controlled environments provide near real time scalability capabilities. With a
single API call several virtual machine instances can be added to a cluster.
And since the resources are fixed at the beginning of the computation the applications can be
scaled up or scaled down as the requirement to adjust the workload arises.
These adjustments can be in the form of requesting more machines or terminating the ones that
are not needed. Amazon's EC2 provides capabilities to control and manage the resources per
user needs. This in turn helps the web services to provide high availability.

Figure 2 illustrates a typical Windows Azure and roles of some of the core components that are
responsible for scalability and availability.
In this environment the physical hardware resources are abstracted away and are exposed as
compute resources for the cloud applications to use them.
A Windows Azure Fabric is controlled by a Fabric Controller. This windows fabric is responsible
for exposing the storage and computing resources by abstracting the hardware resources.
In addition the instances of the applications are monitored for availability and scalability. This is
done automatically in the environment. If one instance of the application goes down for some
reason the Fabric Controller is notified and the application is instantiated in another virtual
machine. This process ensures that the application availability is achieved with minimal impacts
to the downtime in a consistent manner.
Figure 5.8 Window Azure and roles.
Open source private cloud vendor Eucalyptus has targeted high-availability in Eucalyptus 3.0.
In Eucalyptus 3.0 there are now multiple controllers for high-availability. The controllers are web
services that help to orchestrate the real time operation of the cloud.
In terms of deployment, this can be made across two or more racks with separate controllers on
each. The high-availability feature will detect networking, compute, memory and hardware
failures and then fail-over to a working stable node.
5.6.5 Key privacy issues in the cloud

Privacy and Its Relation to Cloud-Based Information Systems
Information privacy or data privacy is the relationship between collection and dissemination of
data, technology, the public expectation of privacy, and the legal issues surrounding them.

The challenge in data privacy is to share data while protecting personally identifiable
information. The fields of data security and information security design and utilize software,
hardware, and human resources to address this issue.
The ability to control what information one reveals about oneself over the Internet, and who can
access that information, has become a growing concern. These concerns include whether email
can be stored or read by third parties without consent, or whether third parties can track the web
sites someone has visited. Another concern is whether web sites which are visited collect, store,
and possibly share personally identifiable information about users.
Personally identifiable information
(PII), as used in information security, refers to information that can be used to uniquely identify,
contact, or locate a single person or can be used with other sources to uniquely identify a single
individual.
Privacy is an important business issue focused on ensuring that personal data is protected from
unauthorized and inappropriate collection, use, and disclosure, ultimately preventing the loss of
customer trust and inappropriate fraudulent activity such as identity theft, email spamming, and
phishing.
Adhering to privacy best practices is simply good business but is typically ensured by legal
requirements. Many countries have enacted laws to protect individuals‘ right to have their
privacy respected, such as Canada‘s Personal Information Protection and Electronic
Documents Act (PIPEDA), the European Commission‘s directive on data privacy, the Swiss
Federal Data Protection Act (DPA), and the Swiss Federal Data Protection Ordinance.
In the United States, individuals‘ right to privacy is also protected by business-sector regulatory
requirements such as the Health Insurance Portability and Accountability Act (HIPAA), The
Gramm-Leach- Bliley Act (GLBA), and the FCC Customer Proprietary Network Information
(CPNI) rules.
Customer information may be ―user data‖ and/or ―personal data.‖
User data is information collected from a customer, including:
 Any data that is collected directly from a customer (e.g., entered by the customer via an
application‘s user interface)
 Any data about a customer that is gathered indirectly (e.g., metadata in documents)
 Any data about a customer‘s usage behavior (e.g., logs or history)
 Any data relating to a customer‘s system (e.g., system configuration,IP address)
Personal data (sometimes also called personally identifiable information) is any piece of data
which can potentially be used to uniquely identify, contact, or locate a single person or can be
used with other sources to uniquely identify a single individual.
Examples of personal data include:
 Contact information (name, email address, phone, postal address)

 Forms of identification (Social Security number, driver‘s license, passport, fingerprints)

 Demographic information (age, gender, ethnicity, religious affiliation, sexual orientation,
criminal record)
 Occupational information (job title, company name, industry)
 Health care information (plans, providers, history, insurance, genetic information)
 Financial information (bank and credit/debit card account numbers, purchase history,
credit records)
 Online activity (IP address, cookies, flash cookies, log-in credentials)
A subset of personal data is defined as sensitive and requires a greater level of controlled
collection, use, disclosure, and protection. Sensitive data includes some forms of identification
such as Social Security number, some demographic information, and information that can be
used to gain access to financial accounts, such as credit or debit card numbers and account
numbers in combination with any required security code, access code, or password. Finally, it is
important to understand that user data may also be personal dasta.
Privacy Risks and the Cloud

Cloud computing has significant implications for the privacy of personal information as well as
for the confidentiality of business and governmental information. Any information stored locally
on a computer can be stored in a cloud, including email, word processing documents,
spreadsheets, videos, health records, photographs, tax or other financial information, business
plans, PowerPoint presentations, accounting information, advertising campaigns, sales
numbers, appointment calendars, address books, and more.
The entire contents of a user‘s storage device may be stored with a single cloud provider or with
many cloud providers. Whenever an individual, a business, a government agency, or other
entity shares information in the cloud, privacy or confidentiality questions may arise.
A user‘s privacy and confidentiality risks vary significantly with the terms of service and privacy
policy established by the cloud provider. For some types of information and some categories of
cloud computing users, privacy and confidentiality rights, obligations, and status may change
when a user discloses information to a cloud provider.
Disclosure and remote storage may have adverse consequences for the legal status of or
protections for personal or business information. The location of information in the cloud may
have significant effects on the privacy and confidentiality protections of information and on the
privacy obligations of those who process or store the information. Information in the cloud may
have more than one legal location at the same time, with differing legal consequences.
Laws could oblige a cloud provider to examine user records for evidence of criminal activity and
other matters. Legal uncertainties make it difficult to assess the status of information in the
cloud as well as the privacy and confidentiality protections available to users.
Protecting Privacy Information

The Federal Trade Commission is educating consumers and businesses about the importance
of personal information privacy, including the security of personal information. Under the FTC
Act, the Commission guards against unfairness and deception by enforcing companies‘ privacy
promises about how they collect, use, and secure consumers‘ personal information.

The FTC publishes a guide that is a great educational tool for consumers and businesses alike,
titled ―Protecting Personal Information: A Guide for Business.‖ In general, the basics for
protecting data privacy are as follows, whether in a virtualized environment, the cloud, or on a
static machine:
Collection: You should have a valid business purpose for developing applications and
implementing systems that collect, use or transmit personal data.
Notice: There should be a clear statement to the data owner of a company‘s/providers intended
collection, use, retention, disclosure, transfer, and protection of personal data.
Choice and consent: The data owner must provide clear and unambiguous consent to the
collection, use, retention, disclosure, and protection of personal data.
Use: Once it is collected, personal data must only be used (including transfers to third parties)
in accordance with the valid business purpose and as stated in the Notice.
Security: Appropriate security measures must be in place (e.g., encryption) to ensure the
confidentiality, integrity, and authentication of personal data during transfer, storage, and use.
Access: Personal data must be available to the owner for review and update. Access to
personal data must be restricted to relevant and authorized personnel.
Retention: A process must be in place to ensure that personal data is only retained for the
period necessary to accomplish the intended business purpose or that which is required by law.
Disposal: The personal data must be disposed of in a secure and appropriate manner (i.e.,
using encryption disk erasure or paper shredders).
Particular attention to the privacy of personal information should be taken in an a SaaS and
managed services environment when
(1) transferring personally identifiable information to and from a customer‘s system,

(2) storing personal information on the customer‘s system,
(3) transferring anonymous data from the customer‘s system,
(4) installing software on a customer‘s system,
(5) storing and processing user data at the company, and
(6) deploying servers.
There should be an emphasis on notice and consent, data security and integrity, and enterprise
control for each of the events above as appropriate.
The Future of Privacy in the Cloud

There has been a good deal of public discussion of the technical architecture of cloud
computing and the business models that could support it; however, the debate about the legal
and policy issues regarding privacy and confidentiality raised by cloud computing has not kept
pace.
The following observations are made on the future of policy and confidentiality in the cloud
computing environment:

 Responses to the privacy and confidentiality risks of cloud computing include better
policies and practices by cloud providers, more vigilance by users, and changes to laws.
 The cloud computing industry could establish standards that would help users to analyze
the difference between cloud providers and to assess the risks that users face.
 Users should pay more attention to the consequences of using a cloud provider and,
especially, to the provider‘s terms of service.
 For those risks not addressable solely through policies and practices, changes in laws
may be needed.
 Users of cloud providers would benefit from greater transparency about the risks and
consequences of cloud computing, from fairer and more standard terms, and from better
legal protections. The cloud computing industry would also benefit.
2marks Questions and Answers

UNIT V SECURITY
1. What are the three challenges to establish the trust among grid sites?
 Integration with existing systems and technologies
 Interoperability with different ―hosting environments
 Constructing trust relationships among interacting hosting environments
2. What do Reputation-Based Trust Model mean?

In a reputation-based model, jobs are sent to a resource site only when the site is trustworthy to
meet users‘ demands. The site trustworthiness is usually calculated from the following
information: the defense capability, direct reputation, and recommendation trust.
3. What do Fuzzy-Trust Model mean?
In this model, the job security demand (SD) is supplied by the user programs. The trust index
(TI) of a resource site is aggregated through the fuzzy-logic inference process over all related
parameters. The fuzzy inference is accomplished through four steps: fuzzification, inference,
aggregation, and defuzzification.
4. List the categories of authorities.

 Attribute authorities issue attribute assertions;
 Policy authorities issue authorization policies;
 Identity authorities issue certificates.
5. What are the three authorization models?

Subject-push model - The user conducts handshake with the authority first and then with the
resource site in a sequence.
Resource-pulling model- puts the resource in the middle. The user checks the resource first.
Authorization agent model - puts the authority in the middle
6. List the properties of grid security infrastructure.
 easy to use;

 conforms with security needs while working well with site policies of each resource
provider site
 provides appropriate authentication and encryption of all interactions.
7. List the function of GSI.

 message protection
 authentication
 delegation
 authorization.
8. List the attacks at Network Level.

 DNS attack
 Eavesdropping
 Dos Attack
 Distributed Denial of Services
 Issues of reused IP addresses
 Sniffer Attack
 Issues of reused IP addresses
 BGP Prefix
 Hijacking
9. List the attacks at Host Level.

 Security concerns with the hypervisor
 Securing virtual server
10. List the attacks at Application Level

 Cookie Poisoning
 Backdoor and debug options
 Hidden field manipulation
 Dos Attack
 Distributed Denial of service attack
 Google Hacking
 SQL injection
 Cross site Scripting attacks
11. List the aspects of security in Cloud.

 Data in transit
 Data at rest
 Processing of data including multi tenancy
 Data Lineage
 Data Provenance
 Data remanance
12. List some challenges in data security.
 The need to protect confidential business, government, or regulatory data
 Cloud service models with multiple tenants sharing the same infrastructure

 Lack of standards about how cloud service providers securely recycle disk space and
erase existing data
 Auditing, reporting, and compliance concerns
 Loss of visibility to key security and operational intelligence that no longer is available to
feed enterprise IT security intelligence and risk management
13. Describe Internal Consistency.
Ensures that internal data is consistent. For example, assume that an internal database holds
the number of units of a particular item in each department of an organization. The sum of the
number of units in each department should equal the total number of units that the database
has recorded internally for the whole organization.
14. Describe External Consistency.

Ensures that the data stored in the database is consistent with the real world. Using the
preceding example, external consistency means that the number of items recorded in the
database for each department is equal to the number of items that physically exist in that
department.
15. Discuss about Identification and Authentication.

Identification is the act of a user professing an identity to a system, usually in the form of a
username or user logon ID to the system. Identification establishes user accountability for
the actions on the system. User IDs should be unique and not shared among different
individuals
Authentication is verification that the user‘s claimed identity is valid, and it is usually
implemented through a user password at logon.
16. Define Smart Cards.

Smart cards provide even more capability than memory cards by incorporating additional
processing power on the cards. These credit-card-size devices comprise microprocessor and
memory and are used to store digital signatures, private keys, passwords, and other personal
information.
17. Describe Biometrics.

Biometrics is defined as an automated means of identifying or authenticating the identity of a
living person based on physiological or behavioral characteristics. In biometrics, identification is
a one-to-many search of an individual‘s characteristics from a database of stored images.
18. What does Identity Management include?

 Establishing a database of identities and credentials
 Managing users‘ access rights
 Enforcing security policy
 Developing the capability to create and modify accounts
 Setting up monitoring of resource accesses
 Installing a procedure for removing access rights

 Providing training in proper procedures
19. Define Access Control.

Access control is intrinsically tied to identity management and is necessary to preserve the
confidentiality, integrity, and availability of cloud data. Concepts under Access control are as
follows:
 Threat — An event or activity that has the potential to cause harm to the information
systems or networks
 Vulnerability — A weakness or lack of a safeguard that can be exploited by a threat,
causing harm to the information systems or networks
 Risk — The potential for harm or loss to an information system or network; the
probability that a threat will materialize
20. Define PII.
(PII), as used in information security, refers to information that can be used to uniquely identify,
contact, or locate a single person or can be used with other sources to uniquely identify a single
individual.
16 marks Questions
1.Explain Cloud Infrastructure in detail.
2.Explain in detail about Identity and Access Management Architecture in detail.
3. List the Key privacy issues in cloud and explain each in detail.
4. Explain the trust models for Grid security environment and its challenges.

Grid and Cloud 5 Unit Notes PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Grid and Cloud 5 Unit Notes PDF

Uploaded by

Copyright:

Available Formats

(Telugu Minority Institution)(Approved by AICTE, NBA accredited and affiliated to AU)

(ISO 9001-2000 Certified Institution)

DEPARTMENT OF COMPUTER SCIENCE AND

Year/Semester: IV/VII CSE

CS6703 Grid and Cloud Computing 1

OBJECTIVES: The student should be made to:

UNIT II GRID SERVICES 9

Introduction to Open Grid Services Architecture (OGSA) – Motivation – Functionality

UNIT III VIRTUALIZATION 9

UNIT IV PROGRAMMING MODEL 9

CS6703 Grid and Cloud Computing 2

CS6703 Grid and Cloud Computing 3

CS6703 Grid and Cloud Computing 4

CS6703 Grid and Cloud Computing 5

4.7.1 MapReduce Inputs And Splitting 177

CS6703 Grid and Cloud Computing 6

CS6703 Grid and Cloud Computing 7

STAFF IN-CHARGE HOD

CS6703 Grid and Cloud Computing 8

1.1 Evolution of Distributed computing: Scalable computing over the Internet

1.1.1 The Age of Internet Computing

1.1.1.1 The Platform Evolution

CS6703 Grid and Cloud Computing 9

On the HPC side, supercomputers (massively parallel processors or MPPs) are

1.1.1.2 High-Performance Computing

1.1.1.4 The Three New Computing Paradigms

CS6703 Grid and Cloud Computing 10

1.1.1.5 Computing Paradigm Distinctions

i) Centralized computing - This is a computing paradigm by which all computer

1.1.1.6 Distributed System Families

CS6703 Grid and Cloud Computing 11

1.1.2 Scalable Computing Trends and New Paradigms

1.1.2.1 Degrees of Parallelism

1.1.2.2 Innovative Applications

Applications of High-Performance and High-Throughput Systems

Table 1.1 Applications of High-Performance and High-Throughput Systems

Domain Specific Applications

Business, education, services Telecommunication, content delivery,

CS6703 Grid and Cloud Computing 12

Figure 1.2 The vision of computer utilities

1.1.2.4 The Hype Cycle of New Technologies

Figure1.3 Hype cycle for Emerging Technologies

1.1.3 The Internet of Things and Cyber-Physical Systems

CS6703 Grid and Cloud Computing 13

1.1.3.2 Cyber-Physical Systems:

1.2 TECHNOLOGIES FOR NETWORK-BASED SYSTEMS

CS6703 Grid and Cloud Computing 14

 Fine-grain multithreaded processor

CS6703 Grid and Cloud Computing 15

 GPUs are designed to handle large numbers of floating-point operations in parallel. In a

CS6703 Grid and Cloud Computing 16

CS6703 Grid and Cloud Computing 17

1.2.1.7 Virtual Machines and Virtualization Middleware

CS6703 Grid and Cloud Computing 18

 ISA‘s must support old software

CS6703 Grid and Cloud Computing 19

CS6703 Grid and Cloud Computing 20

These VM operations enable a virtual machine to be provisioned to any available hardware

 Data Center Virtualization for Cloud Computing

Data Center Growth and Cost Breakdown

Low-Cost Design Philosophy

CS6703 Grid and Cloud Computing 21

Essentially, cloud computing is enabled by the convergence of technologies in four areas: