You are on page 1of 58

Business continuity and disaster

Recovery
Data storage and management aims at performance, availability
and recovery improvement.
There are four key critical service areas that each application
requires:
Primary Storage with focus on key performance
Operational Recovery with focus on recovery time and point
objectives from infrastructure failures and data corruption
Disaster Recovery with focus on recovery time and point objectives
from complete data center loss or failures
Archival Management with focus on compliance, access and recall
requirements
There are numerous situations, both planned and unplanned, that
will affect the availability of your system.
Planned Downtime Unplanned Downtime
Scheduled, routine server
maintenance
2. Hardware and/or software upgrades
3. Physical server re-location
1. System crash caused by hardware
and/or software failure
2. Database corruption
3. System compromised by virus(es),
malicious attacks, or careless misuse

But with a little planning, you can maximize
system uptime.
This is the essence of Disaster Recovery
Planning and Business Continuity Planning.
Business Continuity and Disaster
Recovery

A disaster is defined as a sudden, unplanned
catastrophic event that renders the
organizations ability to perform mission-
critical and critical processes, including the
ability to do normal production processing of
systems that support critical business
processes.
A disaster could be the result of significant
damage to a portion of the operations, a total
loss of a facility, or the inability of the
employees to access that facility.


Enterprise Operations Cycle of Disaster
Recovery

Business Continuity

Business Continuance ensures availability of the
functions that are critical to the day-to-day business
environment in the event of a natural disaster, hardware
failure, or application failure.
Disaster Recovery is a reactive plan to resume IT
operations after a failure.
Business continuance is a proactive plan to maintain
business operations despite planned or unplanned
outages.
Planning for business continuance requires careful
analysis of network requirements and typically requires
developing some or all of the following documentation:
Business continuance plandefines the resources, actions, tasks,
and data required to manage the business recovery process in the
event of a business interruption.
Data (or application) classification studydetermines the value of
each application to the business and assigns a service level to each,
such as mission critical, business critical and critical.
Applications that supply data to critical applications must also be
identified.
Business impact analysis (or assessment)defines the effect on the
organization when an application becomes unavailable.

A business impact assessment identifies the risks and potential
costs associated with application downtime, data loss, and
disrupted user access.


Disaster Recovery vs. Business
Continuity

Terms are often used interchangeably.
Different but complementary components of a
business's overall recovery and continuity planning.
Disaster Recovery Planning (DRP) is concerned with the
recovery of systems and infrastructure components,
Business Continuity Planning has a larger scope
namely, the determination of which business
components and functions need to be recovered and
those which can be ignored.
Briefly, Disaster Recovery is the coordinated process of
restoring systems, data, and infrastructure required to
support key on-going business operations.
Disaster Recovery is the coordinated process of restoring
systems, data, and infrastructure required to support key
on-going business operations.
DISASTER RECOVERY PLANNING (DRP): The technological
aspect of business continuity planning. The advance
planning and preparations that are necessary to minimize
loss and ensure continuity of the critical business functions
of an organization in the event of disaster.
BUSINESS CONTINUITY PLANNING (BCP): Process of
developing advance arrangements and procedures that
enable an organization to respond to an event in such a
manner that critical business functions continue with
planned levels of interruption or essential change. (Source:
www.drii.org)
DRP
A DR plan is usually limited in scope to a set of
defined IT systems and infrastructure, with the
ultimate goal being the complete recovery of those
systems and infrastructure within a defined
timeframe and with minimal data loss. The DR plan
may exclude non-IT business units such as
Accounting, Marketing, Sales, etc., except in terms of
recovery of the defined applications used by these
business units.
Contrast this with a BC plan that has as its scope
the entire enterprise, with the ultimate goal
being recovery of mission-critical/core business
functions to ensure the survival of the enterprise.
The business functions to be recovered in a BC
extend beyond IT systems for example, BC
concerns itself not only with the recovery of
applications that support sales, but also with the
recovery of the related infrastructure (such as
office space), supplies (such as marketing
materials and manual forms), etc.

Disaster Recovery Planning
Concerned with the recovery of a key set of IT
Systems and infrastructure components.
The process of Business Continuity Planning is
concerned with the enterprise as a whole,
dealing with business functions rather than
application systems. Disaster Recovery (DR) can
be thought of as the foundation of Business
Continuity (BC); a working DRP is a core
component of any Business Continuity effort.
Since both DR and BC address the same basic problem
recovery from an unforeseen event and continuation of the
business, there are a number of areas with overlap
between the two processes.
Many companies charter their DR effort as part of their
Business Continuity effort.
However, just as many will first build a working DRP before
taking the net logical step and constructing an overall BCP.
With an understanding of the relationship between the two
processes, knowledge gained in one area can be leveraged
in the other.
Defining Recovery

Define what is meant by Recovery.
From the IT perspective, recovery will usually
mean establishing support for the processing and
communications functions considered critical by
the business community, and then establishing
support for ancillary systems.
From the business perspective, recovery will mean
being able to execute the business functions that
are at the core of the business, and then being able
to execute ancillary functions.
Another key factor is the timeframe. Is the goal to have all
systems up and running within a day? A week?
Is it enough to only bring up a few key systems within the
first week, while taking longer to restore others? This factor
is often expressed as either the Recovery Time Objective
(RTO) or the Service Delivery Objective (SDO). This refers
to the amount of time that can elapse from the failure to
the time when the systems or services are available for use.
Most often this factor can vary by system or service; for
example, a companys order processing system may have a
SDO of 24 hours, while the companys intranet has an SDO
of 1 week.
RTO and SDO
What is the difference between RTO and SDO? It is
possible to recover one or more systems in a short
period of time, but due to an unplanned dependency
(often unknown until testing the DRP) or recovery
failure it is impossible to provide the full and necessary
functionality to restore service. Hence, while having
a good RTO is important, understanding SDO and
planning around that goal is even more important.
A single failure of a key component that other systems
are dependent on can result in failures of many other
systems. Because of this it is important to understand
the dependencies within your environment.
RPO
How much data can you afford to lose? This is
expressed as the Recovery Point Objective
or RPO.
Depending on the environment, the loss of
any data could have a significant impact.
What's better a lower or a higher RPO?
It is also important to have some way of
tracking the various components of a recovery
effort such as using a Critical Success Factors
spread sheet that breaks a recovery effort
down by key area, and is weighted by system
criticality, downstream dependencies and
impact of failure.
As systems are recovered, the spread sheet is
updated.
Benefits:

In a DR test scenario, this spread sheet provides the
overall score based on the recovery test.
This score can then be used as a tangible way to track
progress and identify key deficiencies in the overall
Disaster Recovery Plan, both for the current test and as
part of a baseline metric to track improvement over time.
In an actual recovery effort, the spread sheet provides a
checklist that can be reviewed with management to see
the overall progress of the DR effort.
It also clearly draws out the ramifications of a failed
system through the weighted scoring and dependency
lists, as well as to help identify timings with regards to
RTO/SDO and RPO.
Disaster Recovery Planning

A Disaster Recovery Planning project cannot be
completed in a week or even a month.
In many ways, a DRP is never completed the plan
must be tested and updated at least once per year, if
not more frequently. A Plan that does not keep pace
with the changes in your organization is a disaster in
itself, providing a false sense of security.
The primary objectives of a Disaster Recovery Plan are
to guide an organization in the event of a disaster and
to effectively re-establish critical business operations
within the shortest possible period of time with a
minimal loss of data.
The goals of the planning project are to assess
current and anticipated vulnerabilities, define
the requirements of the business and IT
communities, design and implement risk
mitigation procedures, and provide the
organization with a plan that will enable it to
react quickly and efficiently at the time of a
disaster.

A DRP first needs to be created, then tested and
refined, and finally implemented and tested on a
periodic basis.
While this can be an expensive and time-consuming
effort, it can prove to be a bargain in the long run
if some type of disaster occurs.
Most plans will experience some type of failure during
their first execution so it is very important to test the
plan.
Systems and personnel change, so it is also important
to retest the plan on a periodic basis, ideally rotating
staff.

DRP address 3 main functional areas

1. Recovery: Once the infrastructure is in place it will be
necessary to recover production data. Since recovery may
not be up to the point of failure, it is important to identify
any processing that needs to be redone. Can all of the data
feeds to the system be identified? How many of them can be
redone with 100% certainty of success? It is important to
minimize "holes" in data (especially in a distributed
processing environment where one step could be dependent
on one or more predecessor steps or actions), and then to
identify the action to be taken when data inconsistencies are
detected.
There should be an audit trail for all work performed during
this phase. Once the data is recovered there should be some
type of validation process to ensure that the recovery was
complete, leaving a consistent work environment.
2. Restoring / Sustaining Business
Operations:
All processing requirements and service level
agreements need to be defined and documented.
Dependencies between processes also need to be
defined. It is important to document the existing
process and then build the plan accordingly.
Anything that ran before (in production) will
probably need to run again (at the hot site), so
scheduling and dependency information is critical.
Remember that routine maintenance (including
backups) should still be performed (it too is an asset
that requires protection).

3. Transferring Data back to Production
Machines
This is one area that is generally omitted from a DRP, but which is
very important.
Eventually production will need to back to normal.
A process needs to be defined to manage this migration.
Often the Client will elect to execute the DRP on the production
machines in order to synchronize the machines to a specific point
in time.
It is one of the more difficult tasks to test.


The DRP should address 3 main
technical areas

1. Hardware Issues: This includes machine type (especially
an issue when using equipment that is older or not as
popular), configuration (disk capacity, peripheral devices,
device names, RAM, file systems and volume groups, OS
users, etc.) and operating system version and patch level
(hopefully it is a current version in case vendor support is
required).
Another issue is deciding whether to use an existing
preconfigured machine or to completely configure a
machine (load the OS, initialize and configure disks, TCP/IP
configuration, SCSI addresses, everything). There are pros
and cons to each scenario. It is recommended that
organizations plan for the worst case (i.e., the complete
rebuild).
Note: It may be possible to reconstruct the production
machine on a new machine using a tape backup.
This method does not leave much room for flexibility
relative to hardware configuration, but is very fast when
compared to a manual system reconstruction.
The key to success is to ensure that DRP machines have at
least as much capacity as the production machines that
they are replacing, that they are compatible architectures,
and that "someone" has the installation media for the OS.
These machines usually either resides at a remote location
(in which case they can be pre-configured) or are provided
as-needed by a company that specializes in providing
facilities and equipment.

Networking Issues
The environment most likely consists heterogenous
machines.
Is any special type of LAN or VPN software required?
How do the machines communicate with one
another?
Are there requirements for connections to an
external network (WAN, Internet, Extranet)?
Is there any other type of Client/Server or n-tier
activity that will need to be supported?
All networking requirements and issues need to be
identified, documented, and then addressed in the
DRP.
Software Issues
Software includes the operating System, user written
applications, and third party software (RDBMS, report
writers, GUI products, backup/recovery products,
scheduling software, etc.).
A comprehensive inventory of currently used
software, including current version, license
information, and support contact information is
essential. This is what runs your business. Working
hardware and an accessible network is worthless if
your critical applications are not working!

Procedures/Methods in planning
disaster recovery.

1. Identification and Analysis of Disaster
Risks/Threats
The first step is to identify the threats or risks.
Risk analysis (sometimes called business impact
analysis) involves evaluating existing physical and
environmental security and control systems, and
assessing their adequacy with respect to the
potential threats. The risk analysis process begins
with a list of the essential functions of the
business. This list will set priorities for addressing
the risks.
2. Classification of Risks Based on
Relative Weights

Categorize risk into different classes to accurately
prioritize them.
In general, risks can be classified in the following five
categories.
3. Evaluation of Disaster Recovery Mechanisms
Determine the best suitable recovery method for each.
The main factors that need to be considered are:
Cost of deployment, maintenance, and operation
Recovery time
Ease of recovery activation and operation

3. Setting up a Disaster Recovery Committee
This committee should have representation
from all the different company agencies with a
role in the disaster recovery process, typically
management, finance, IT electrical
department, security department, human
resources, vendor management, and so on.
The Disaster Recovery Committee creates the
disaster recovery plan and maintains it.

5. Disaster Recovery Phases

Disaster recovery happens in the following sequential phases:
1. Activation Phase: Quick and precise detection and
communication the key for reducing the effects of the
incoming emergency; in some cases it may give enough time
to allow system personnel to implement actions gracefully,
thus reducing the impact of the disaster. In this phase, the
disaster effects are assessed and announced.
a. Notification procedures
b. Damage assessment
c. Disaster recovery activation planning
2. Execution Phase: In this phase, the actual procedures to
recover each of the disaster affected entities are executed.
Business operations are restored on the recovery system.
3. Reconstitution Phase: In this phase the original system is
restored and execution phase procedures are stopped.

The Disaster Recovery Plan Document

The outcome of the disaster recovery planning
process is the disaster recovery plan document. will
be the source of information for disaster recovery
procedures. The disaster recovery plan document is
the only reliable source of information for the
disaster recovery during an emergency. It should be
very easily readable, with simple and detailed
instructions. Should be kept up to date with the
current organization environment. Periodic Mock
Drills and periodic updates are recommended for
the maintenance of the plan documentation.

Disaster Recovery Planning Project Steps


Step I Project Initiation
The objectives of the disaster recovery planning
project initiation are to gain an understanding of
the existing and planned future IT environment of
the organization, define the scope of the project,
develop the project schedule, and identify risks to
the project. Input from the BCP team will be
required as part of the requirements analysis
phase.
In addition, a Project Sponsor/Champion and
Steering Committee should be established during
this phase.
Step II Assessment of Disaster Risk

This should include, but not be limited to, an
assessment of geographical location, building
composition, computing environment/physical
plant security, installed security devices (including
automated fire extinguishers and automated
shut-down devices), computing
environment/physical plant access control
systems and software, personnel practices,
operating practices, and backup practices. This is
a good time to perform an IT Assessment,
Practices and Procedures Audit, and Single Points
of Failure Analysis.
Step III Business Impact Analysis

Key business units that are supported by the IT
undertaken to identify which systems and
functions critical to the continuation of
business, and to determine the length of time
that those units can survive without the
critical systems in operation.
This analysis is essential to making decisions
about how to implement disaster recovery.
Step IV Definition of Requirements

All requirements of, and relating to, the Plan
must be defined and detailed. These include, the
recovery requirements of the business and IT
communities, the requirements generated by the
business impact analysis, and the requirements
generated by the assessment of disaster risk and
the mitigation of disaster risk.
Step V Project Planning

It is important here to distinguish between the
Project Plan and the Disaster Recovery Plan.
The Project Plan in this case will define the
project that is being executed and as one of its
objectives will develop the Disaster Recovery
Plan. An additional objective of this project is
to mitigate as much disaster risk as possible.
VI Project Execution
The project should proceed according to standard practices of Project
Management. During the project the identified methods of mitigating risk
will be executed, and the Disaster Recovery Plan will be constructed
and tested.
Step VII BCP Integration
The DR Plan needs to integrate back into the organizations overall
Business Continuity efforts. For an organization that has run the DR effort
as part of an overall BC effort, this has likely already been done.
However, for an organization that builds their DRP first and then creates a
BCP from that foundation it is important to align the two.

Step VIII On-going Maintenance and
Integration
Part of the plan will include the on-going
maintenance and testing efforts required to
keep the Plan up to date, as well as processes
to identify and mitigate future risks as they
are encountered.

What is BCP?

Disasters test the contingency plans of businesses to an
unanticipated degree. Companies that have business
continuity plans and contracts in place with vendors of
recovery services are able to continue business at
alternate sites with minimum downtime and minimum
loss of data. Recovery itself must be speedy (under 24
hours) for high-availability systems and the facilities
must provide continuity not only of the data centre
but also of all critical aspects of its clients businesses.
At the most basic level, Business Continuity
Planning (BCP) can be defined as an iterative
process that is designed to identify mission
critical business functions and enact policies,
processes, plans and procedures to ensure the
continuation of these functions in the event of
an unforeseen event. All activity surrounding
the creation, testing, deployment, and
maintenance of a BCP can be viewed in terms
of this definition.
The other main difference between a DR and BC
concerns the definition of what to recover and what to
exclude. Business Continuity requires the definition
and determination of response to risk (Risk Analysis
and Response), the definition of possible failure areas
(a Single Points of Failure, or SPOF analysis), and the
determination of the impact of these areas on the
business as a whole (Business Impact Analysis, or BIA).
This analysis will result in the determination of what
business functions are "core" or "mission critical"
these are the business functions that are essential for
the survival of the enterprise, and will by necessity be
the focus of the BC effort.
Another key area within the scope of the BCP is
concerned with data, system, and application
dependencies. The failure to identify, plan for,
and properly recover systems and processes in
light of these dependencies could very well keep
a business unit from operating properly. This
scenario highlights the importance of these
issues, and underscores why they need to be
identified early in the BC process and addressed
accordingly.

If an enterprise has a working DR plan or
plans, it makes more sense to leverage that
knowledge into the creation of the BC plan. In
the second case, the integration of the DR
plan into the new BC plan is of utmost
importance. There is also the chance that
portions of the existing DR plan or plans may
need to be reworked or become obsolete
based on the overall allocation of resources
determined by the BC effort.

Unlike a DRP (which can be owned and managed at the
department level), the scale, cost, and impact of a
true BCP are at the enterprise level and needs to be
managed as such. Because of it is logical connection to
DRP, many companies BCP plans fall under the
responsibility of the IT department however, since
Business Recovery comprises many areas that are
outside of the scope of the IT department (such as
Legal Counsel, Supply Chain, and external agency
liaisons), this is often a flawed assumption.
The Phases of Business Continuity
Planning, Implementation, and
Management

The significance of each major phase of
continuity planning merits attention because
each phase contributes to building all four areas
of business continuity: disaster recovery,
business recovery, business resumption, and
contingency planning:
Phase 1 Establish the foundation. These alignment and analysis
steps are necessary to obtain executive sponsorship and the
commitment of resources from all stakeholders. Without a basis of
business impact analysis and risk assessment, the plan cannot
succeed and may not even be developed.
Phase 2 Develop and implement the plan. Here, attention to
detail and active participation by all stakeholders ensure the
development of a plan worth implementing. The plan itself must
include the recovery strategy with all of its detailed components
and the test plan.
Phase 3 Maintain the plan. The best plan is only as effective as it
is current. Every tactic of business resumption and recovery must
be kept up to date and tested regularly.
Types of Plans

The separate plans that make up a business continuity
plan include:
Disaster recovery plan to recover mission-critical
technology and applications at an alternate site.
Business resumption plan to continue mission-
critical functions at the production site through work-
arounds until the application is restored.
Business recovery plan recover mission-critical
business processes at an alternate site (sometimes
called workspace recovery).
Contingency plan to manage an external event
that has far-reaching impact on the business.

Business Use of BCP

The need for high availability of systems today
approaches 24x365 across all industries, for
both service and manufacturing organizations.
Despite that fact, the relatively high percentage
of companies without business continuity plans
indicates that strategic planners may be relying
on a combination of insurance and outsourced
physical recovery sites to take the required
steps independently, absent of any cohesion.
Organizations need to identify and prioritize
their resources in terms of: Mission Critical to
accomplishing the mission of the organization
Can be performed only by computers No
alternative manual processing capability exists
Must be restored within 36 hours
Critical
o Critical in accomplishing the work of the organization
o Primarily performed by computers
o Can be performed manually for a limited time period
o Must be restored starting at 36 hours and within 5 days
Essential
o Essential in completing the work of the organization
o Performed by computers
o Can be performed manually for an extended time period
o Can be restored as early as 5 days, however it can take
longer.

Non-essential/Non-critical: Systems and
resources that can be recovered within several
days or weeks of the disaster, or until the crisis
has passed.
Non-Critical to accomplishing the mission of the
organization can be delayed until damaged site
is restored and/or a new computer system is
purchased can be performed manually.

What if the organizations DR involves
use of backup cloud services

Backup cloud services have become a popular trend for
consumers and even some smaller businesses. While
cloud services do address some reliability and accessibility
issues quite well, their inherent design limits their
ability to provide the needed level of immediate accessibility.
1. Potential High Reliability - Depending on the cloud service
and actual infrastructure, backup cloud services hold the
promise of being able to provide reliable backup and restore.
In terms of integrity, a cloud service should be able to provide
the same level of integrity as a disk-to-disk solution, but
without the need to go to tape for off-site archival.
2. Automated Offsite Storage One of the major
benefits of a backup cloud service is its trans-parent,
automated nature. Data is automatically backed up and
archived at an offsite, hosted data centre without any
user intervention.
3. Slow Time to Recovery
The main disadvantage of a cloud service for backup is
the slow time to recovery. While it might be fine for
recovering a few files, transferring the backup files for
entire servers over the Internet can take days to weeks,
depending on an organizations Inter-net bandwidth
and connection speeds.

Question
What key items should a good SLA agreement
used to evaluate a cloud service provider
contain?

You might also like