Professional Documents
Culture Documents
AbstractCloud computing is the new trend in service interpreted as the probability of providing service according
delivery, and promises large cost savings and agility for the to dened requirements. In order to avoid costly down
customers. However, some challenges still remain to be solved times contributing to service unavailability, fault avoidance
before widespread use can be seen. This is especially relevant
for enterprises, which currently lack the necessary assurance and fault tolerance are used in the design of dependable
for moving their critical data and applications to the cloud. systems. Traditionally, fault tolerance has been implemented
The cloud SLAs are simply not good enough. in hardware resulting in expensive systems, or using cluster
This paper focuses on the availability attribute of a cloud software, often very specic for each application [3]. In
SLA, and develops a complete model for cloud data centers, cloud computing, the approach to fault-tolerance has mainly
including the network. Different techniques for increasing the
availability in a virtualized system are investigated, quantifying been to use cheap, off-the-shelf hardware, allowing failures
the resulting availability. The results show that depending on and then tolerating these in software. The reason for this is
the failure rates, different deployment scenarios and fault- partly the large size of cloud data centers, which means that
tolerance techniques can be used for achieving availability hardware will fail constantly. Adding additional hardware
differentiation. However, large differences can be seen from resources to account for failures and let the failover be
using different priority levels for restarting of virtual machines.
handled by software is then a more cost-effective approach
than using special-built hardware. Another advantage from
Keywords-Availability, cloud, differentiation, SLA virtualization is that virtual instances can be migrated to
arbitrary physical machines, sharing redundant capacity
I. I NTRODUCTION among a large number of Virtual Machines (VMs). The
Cloud computing presents a new computing paradigm standby resources needed are thus much less than for a
that has attracted a lot of attention lately. It enables on- traditional system. However, also with virtualization, fault
demand access to a shared pool of highly scalable computing tolerance means adding additional resources in the system,
resources that can be rapidly provisioned and released [1]. which adds cost. Fault-tolerance should therefore be targeted
This is achieved by offering computing resources and ser- to specic needs.
vices from large data centers, where the physical resources Differentiating with respect to fault-tolerance techniques
(servers, network, storage) are virtualized and offered as and physical deployment for different applications give bet-
services over a network. ter resource utilization, being cost effective for the provider
A large part of cloud applications has so far been tar- while still delivering service according to user expecta-
geted to consumers with low willingness to pay, and low tions. In particular, stateful applications require synchro-
expectations to the service QoS (dependability, performance nized/updated replicas for tolerating failures, while stateless
and security). Recently, more and more enterprises are also applications that tolerate short downtimes can be imple-
investigating how to leverage on the cloud computing ad- mented using non-updated replicas. In addition, adding
vantages such as the pay per use model and rapid elasticity. replicas at different physical locations increase the fault-
However, major challenges have to be faced in order for tolerance, tolerating failures that may affect specic parts
enterprises to trust cloud providers with their core business of a cloud data center.
applications. These challenges are mainly related to QoS, in In related work, hardware reliability for cloud data centers
our view covering dependability, performance and security, has been characterized in [4], and reliability/availability
and a comprehensive Service Level Agreement (SLA) is models for cloud data centers are stated as important on-
needed to cover all these aspects. This is in contrast to the going work. The availability of a service running in VMs
insufcient SLAs offered today. on two physical host is modeled in [5] and a non-virtualized
In this paper, we focus on dependability, and more specif- system is compared with a virtualized system. A very simple
ically the availability attribute. Availability is dened in cloud computing availability model is used in [6], combined
[2] as the readiness for correct service, which can be with performance models to give Quality of Experience
(QoE) measures for online services. The authors state the the unavailability has not been covered by the SLA 2 . One
need for availability models for complete cloud data centers. reason is that the cloud SLAs are not specic enough when
The main contribution of this paper is the effort of dening availability. From the customers point of view this is
modeling a complete cloud system, including the network a major drawback. Performance (e.g. response time) above
between the cloud data centers and the customers. The a certain threshold will be perceived by the customer as
availability resulting from deployment in different physical service unavailability and should be credited accordingly.
locations can thus be studied with respect to different failure This issue is covered in [7], where the throughput of a load-
rates both in the network and the cloud infrastructure itself. balanced application is studied under the events of failures.
In addition, these deployment options will be inuenced by It is clear that the availability parameter alone is not enough
management software failure rates, an important aspect to to ensure a satisfactory service delivery.
include in analysis. The on-demand characteristic of cloud computing is one
The rest of this paper is organized as follows. In Section aspect that complicates the QoS provisioning and SLA
II, we focus on the SLAs offered by commercial cloud management. The cloud infrastructure needs to adjust to
providers today and the missing pieces. In Section III, changing user demands, resource conditions and environ-
principles for achieving fault-tolerance are described, as well mental issues. Hence, the cloud management system needs
as its application in cloud computing. An availability model to automatically allocate resources to match the SLAs and
of a cloud service deployment is described in Section IV, also detect possible violations and take appropriate action
and the different types of failures are discussed. In Section in order to avoid paying credits. Several challenges for
V, different scenarios for VM deployment is described. autonomic SLA management still remain. First, resources
Numerical results are presented in Section VI. Finally, we need to be allocated according to a given SLA. Next,
conclude the paper in Section VII, together with some measurements and monitoring are needed to detect possible
thoughts on future work. violations and react accordingly, e.g., by allocating more
resources. For availability violations this may require adding
II. SLA S IN C LOUD C OMPUTING
more standby resources to handle a given number of failures,
Cloud computing gives the customers less control of and for performance violations this may require moving a
the service delivery, and they need to take precautions in VM to an other physical machine if the current machine
order not to suffer low performance, long downtimes or is overloaded. All these actions require a mapping between
loss of critical data. Service Level Agreements (SLAs) have low-level resource metrics and high-level SLA parameters.
therefore become an important part of the cloud service One proposal on how to do this mapping is given in [8],
delivery model. An SLA is a binding agreement between the where the amount of allocated resources are adjusted on
service provider and the service customer, used to specify the y to avoid an SLA violation. An other proposal for
the level of service to be delivered as well as how measuring, dynamic resource allocation using a feedback control system
reporting and violation handling should be done. Today, is proposed in [9]. Here, the allocation of physical resources
most of the major cloud service providers include QoS (e.g. physical CPU, memory and input/output) to VMs is
guarantees in their SLA proposals, specied in Service Level adjusted based on measured performance.
Specication (SLS), as seen in Figure 1. The focus in most With deployment of widely different services in the cloud,
cases is on dependability, measured as service availability there is clearly a need for cloud providers to offer differen-
usually covering a time period of a month or a whole tiated SLAs, with respect to dependability, performance and
year. Credits are issued if the SLA is violated, e.g. the security. Core business functions such as production systems
Amazon EC2 SLA includes an Annual uptime percentage and billing needs a higher availability than applications
of 99.99% and issues 10% service credits 1 . targeted to consumers such as email and document handling.
Also, different user groups may have different requirements.
Service Level Agreement (SLA)
Measuring
One example is Gmail, where the SLA for email services
Reporting
Violation handling
for consumers and business users are differentiated, offering
the business users an availability of 99.9% at a xed price,
Service Level Specication (SLS)
SLS Parameters
while consumers have a free offering without any SLA 3 .
SLS Thresholds
III. FAULT T OLERANCE IN C LOUD C OMPUTING
In the design of dependable systems, a combination of
Figure 1: The structure of an SLA fault avoidance (called fault prevention in [2]) and fault
tolerance is used to increase availability. Fault avoidance
Recently, we have seen many examples where cloud
2 http://cloudcomputingfuture.wordpress.com/2011/04/24/why-amazons-
services have been unavailable for the customer, but where
cloud-computing-outage-didnt-violate-its-sla
1 http://aws.amazon.com/ec2-sla/ 3 http://www.google.com/apps/intl/en/business/features.html
aims at avoiding faults being introduced, through use of of virtualization though is that one single hardware fault
better components (i.e. SSD instead of HDD), debugging in a physical server can affect several VMs and hence
of software or protecting the system against environmental many applications. Replicas of the same applications must
faults. Fault tolerance is often used in addition to fault therefore always be deployed on different physical machines.
avoidance, allowing a fault leading to error but preventing The standby resources must also be dimensioned to handle
errors leading to service failure. Fault tolerance thus use the high number of failed VMs in case of a physical server
redundancy in order to remove or compensate for errors. failure.
This section gives a short overview of general fault tolerance Virtualization facilitates live migration of VMs, where a
techniques used in design of dependable communication running VM instance can be transferred between physical
systems, and then looks at how fault tolerance is achieved machines. Live migration has been implemented both for the
in virtualized environments. Xen hypervisor [10] and for VMware with its VMotion [11]
and ensures zero downtime in case of planned migrations
A. Fault Tolerance Principles due to resource optimization or planned maintenance. In
Cloud infrastructure is built using off-the-shelf hardware, case of failures in the physical host running the VM, live
and standby redundancy is the preferred fault tolerance migration is not possible. The conguration le or the
technique. With standby redundancy, there are two or more VM image should then be available on possible new host
replicas of the system. Only the active replica will produce machines in order to restart the application. In addition, the
result to be presented to the receiver, while the standby conguration le should be stored at a centralized location
replicas are ready to take over should the active replica should all replicas fail. How this is performed is dependent
fail. Hot and cold standbys are possible. Hot standbys are on the type of standby redundancy, as described next.
powered standbys, capable of taking over service execution 1) Hot Standby: For stateful applications, state must be
with no downtime (as long as the state is updated). Cold stored on the standby virtual machine in order to allow
standbys are non-powered and need some time to be started failover. In traditional fault-tolerance terminology, this re-
in case of failure in the active replica. quires the use of updated hot standbys. Different levels of
Different levels of synchronization are possible for the hot updating/synchronization between the active and standby
standbys (updated/not-updated), and the backup resources replica are possible; either the input is evaluated at each
can be dedicated or shared, for both the hot and cold replica, or the state information is transferred at specied
standbys. This gives the overall classication as shown in checkpoints. The former method will then consume more
Figure 2. compute resources than the latter, and is denoted dedicated
Standby Redundancy in our classication (Figure 2). The latter method allows
many replicas to share backup resources and is hence
VMware HA denoted shared.
Examples of hot standby techniques are VMwares Fault
Hot Cold
Tolerance [3] and Remus for the Xen hypervisor [12] as
seen in Figure 2. VMware Fault Tolerance is designed for
mission-critical workloads, using a technique called virtual
Updated Not updated Dedicated Shared
Remus
Lockstep, and ensure no data or state loss, including all
VMware FT
active network connections etc. Both active and standby
Dedicated Shared replicas execute all instructions, but the output from the
Figure 2: Standby redundancy classication standby replica is suppressed by the hypervisor. The hyper-
visor thus hides the complexity from both the application
The choice between hot or cold standbys will decide the and the underlying hardware. This scheme is classied as
service restoration time, but more importantly the choice updated and dedicated since the standby replica is fully
should depend upon the applications need of an updated synchronized and consumes resources equal to the active
state space, as described in the next section. replica.
In Remus, fault tolerance is achieved by transmitting state
B. Fault Tolerance in Cloud Computing information to the standby VM at frequent checkpoints,
Cloud computing uses virtualization of computing re- and buffering intermediate inputs between checkpoints. The
sources made available as VMs, virtual storage and virtual standby can hence be up and running with a complete state
networks. We concentrate here on computation services and space in case of failures, with only a short downtime needed
the use of VMs. In this case, backup is made easy with for catching up the input buffer. The standby is not executing
virtualization, since the virtual image contains everything any inputs, which means that less resources are consumed
that is needed to run the application and can be transparently compared to VMware Fault Tolerance, and a short downtime
migrated between physical machines. One of the downsides and loss of ongoing transaction is experienced in case of
failure of the active replica. This scheme is classied as at different physical locations, connected to the customers
updated and shared since the standby only consume a small via the Internet. Following [13],we model the data centers
amount of resources compared to the active. with racks of servers that are organized into clusters. Each
Hot standbys can be used for both stateless and stateful cluster share some infrastructure elements such as power
applications, but since all replicas consume resources it distribution elements and network switches. The overall
is most often used for stateful applications. For stateless network architecture is simplied (inspired by [14]), and
services, the not-updated hot standby is a possibility if high consist of two levels of switches (L1 and L2) in addition to
availability is important. the gateway routers. The overall model is then as shown in
2) Cold Standby: The cold standby solution requires Figure 4.
less resources and should in general be used for stateless Cluster
applications that allows short downtimes. The same is true in Server Server
VM VM VM VM
needs. Dedicated standby resources are still possible for cold Data
center 1 Cluster L2
GW1 Internet
standbys, and should be used for stateless applications with COL
GW2 Customer
high availability requirements. In practice, the dedicated PWR
Cluster L2
Internet UA
enough resources for restarting a hot standby, since these
L1 - B L2 - B GW2 W2
should host the highest priority applications.
(1-c) (1-c)
2
Figure 6: Network model
Both One Both
OK down down
3) Management Software Failures: Cloud computing re- 1 2 3
dimensioned to handle these restarts for the lowest priority Mngmt Power Network
Aservice
0.986
0.985
probability p), and the operational aspects of the data center 0.984
The resulting availability for the different deployment (a) Management software availability
scenarios (I-same cluster, II-same data center, III-same cloud 0.988
(scenario III), but the effect is clearly not very big. The
Aservice
more prominent, at least for the scenario with all replicas 0.984
in one cluster (scenario I). For the hot standbys, there are
small differences between the dedicated and shared standbys. 0.983
decrease when the load increases. We also see that the cold 0.9975 0.9980 0.9985
Apower
0.9990 0.9995
0.94
0.90
[7] D. Menasc, Performance and Availability of Internet Data
Centers, IEEE Internet Computing, vol. 8, no. 3, pp. 9496,
0.00 0.05 0.10 0.15 0.20
May 2004.
Figure 15: Availability for high and low priority cold stand- [8] I. Brandic, V. C. Emeakaroha, M. Maurer, S. Dustdar,
bys with increasing preemption rate S. Acs, A. Kertesz, and G. Kecskemeti, LAYSI: A Layered
Approach for SLA-Violation Propagation in Self-manageble
Cloud Infrastructures, in Proceeding of the 2010 34th An-
nual IEEE Computer Software and Applications Conference
However, there are some important improvements to be Workshops, Jul. 2010, pp. 365370.
made. First, the SLAs must become more detailed with
respect to actual KPIs used to dene availability. Next, in [9] Q. Li, Q. Hao, L. Xiao, and Z. Li, Adaptive Management
order to deploy also important enterprise services in clouds, of Virtualized Resources in Cloud Computing Using Feed-
different levels of availability should be offered, depending back Control, in Proc. of 1st International Conference on
Information Science and Engineering (ICISE09), Dec. 2010.
on the actual user requirements. Finally, the SLAs should
be available on demand, which also means that they should [10] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul,
be adjustable on demand. C. Limpach, I. Pratt, and A. Wareld, Live Migration of
This paper has proposed an overall availability model Virtual Machines, in Proceedings of the 2nd Symposium
on Networked Systems Design & Implementation (NSDI 05),
for a cloud system, including the network. We have shown 2005, pp. 273286.
how deploying replicas in different physical locations affect
the resulting availability, and also how different applications [11] M. Nelson, B.-h. Lim, and G. Hutchins, Fast Transparent
need different fault tolerance schemes. These are two pos- Migration for Virtual Machines, in Proceedings of USENIX
sible dimensions for differentiating cloud applications. 05, Anaheim, California, 2005, pp. 59.
Future work include modeling more complex services, [12] B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson,
e.g. a tiered web service. Also, the server models should and A. Wareld, Remus: High Availability via Asynchronous
be made more detailed, taking into account characteristics Virtual Machine Replication, in NSDI08 Proceedings of the
of the different failures and repairs. We have discussed the 5th USENIX Symposium on Networked Systems Design and
Implementation, 2008.
need for well dened KPIs for availability, the next step
is to also include performance measures in the availability [13] L. A. Barroso and U. Holzle, The Datacenter as a Computer:
models. Finally, the network availability strongly inuence An Introduction to the Design of Warehouse-Scale Machines,
the total availability of a cloud service, and should optimally Synthesis Lectures on Computer Architecture, vol. 4, no. 1,
be included in the cloud service SLA. pp. 1108, Jan. 2009.
[2] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, [16] VMware White Paper, VMware High Availability. Concepts,
Basic Concepts and Taxonomy of Dependable and Secure Implementation and Best Practices, 2007.
Computing, IEEE Transactions on Dependable and Secure
[17] J. Dean, Designs, Lessons and Advice from Building Large
Computing, vol. 1, no. 1, pp. 1133, Jan. 2004.
Distributed Systems, Keynote Presentation at LADIS 2009,
The 3rd ACM SIGOPS International Workshop on Large
[3] VMware White Paper, Protecting Mission-Critical Work- Scale Distributed Systems and Middleware, 2009.
loads with VMware Fault Tolerance, 2009.
[18] M. Dahlin, B. B. V. Chandra, L. Gao, and A. Nayate, End-
[4] K. V. Vishwanath and N. Nagappan, Characterizing Cloud to-End WAN Service Availability, IEEE/ACM Transactions
Computing Hardware Reliability, in in Proceedings of the on Networking, vol. 11, no. 2, pp. 300313, 2003.
ACM Symposium on Cloud Computing (SOCC), 2010.