You are on page 1of 5

Introduction

September 11, 2001 and other terrorism occurrences, devastating hurricanes, such as
Katrina, Wilma, and Frances, Californian and Hawaiian earthquakes, mud-slides,
tornadoes, forest fires, and other occurrences provide life lessons for all of us.
Unfortunately, many organizations do not prepare for this type of devastating event.
Every organization needs to prepare for events that might inhibit the ability for its
employees or customers to continue operations.
As our dependence on information technology (IT) grows, so does the importance of
detailed planning to restore operations in the event of an occurrence that might reduce our
ability to use those IT assets in support of organizational business requirements. Manual
operations are an operation of the past, as we have evolved to a dependence on IT
resources. IT systems are vulnerable to a variety of disruptions. These include occurrence
threats, ranging from mild (for example, short-term power outage, disk drive failure,
software interruptions) to severe, which might result in complete equipment destruction
(for example, natural disaster, terrorist action, and so on). A lot of vulnerabilities
(technical and non-technical security weaknesses in our operation) can be minimized or
eliminated through technical, management, or operational solutions as part of the
organization's risk management effort; however, it is impossible to completely eliminate
all risks and still be able to use our critical IT assets.

The Need for Contingency Plans


Organizational business objectives can be disrupted by events that might impact any type
of organizational requirement—planning for all types of disruptions, beyond those that
are IT-specific, is critical to any organization. Planning for all types of events causing
organizational turbulence is necessary to ensure that business continues, including those
necessary for personnel health and welfare. These broad contingency plans are
considered in business resumption plans (BRPs) or continuity of operation plans
(COOPs). These broad plans are not specific to information technology and minimize the
effects of unexpected and undesired events and might include the planning, service
contracting and contract execution, and related events in advance for such events that
might impact services. Although the overall scenario planning is far reaching, examples
of COOPs' planning considerations include things, such as sewage interruptions,
inadequate water supply, road work that might interrupt receipt of raw materials or
customer delivery, employment strikes at the site or contracted service, loss of heating,
ventilation, or air conditioning. In addition, a company must always prepare for IT
network outages. Concise descriptions that outline the differences in these plans can be
found in Section 2.2 and Table 2-1 of the National Instituted of Standards and
Technology's June 2002 Special Publication 800-34 entitled "Contingency Planning
Guide for Information Technology Systems."

Planning for Disruptions


Before developing a contingency plan, an organization needs to understand the types of
threats that might impact its operations. Often, a business impact analysis is performed,
and a company looks at its assets and potential survivable downtimes. IT security risk
analyses that focus on threats that might impact network or system operations provide a
research basis for such impact analyses. Weaknesses identified within these analyses and
the threats that might exploit those weaknesses should be focused on when considering
contingency response scenarios during the planning and future testing processes.
IT Contingency plans must include various important sections, including Emergency
Response, Backup Operations, Short-Term Recovery Procedures, Long-Term Loss
Recovery Procedures, Roles and Responsibilities, and Testing. Contingency site location
and contacts, emergency staff contact rosters, vendor/supplier contacts, access control
procedures, and user/customer notification procedures are also typically included. IT
Contingency plans that include a remote location must also consider the regular
management and operations of that site.

Data and Software Backups


For both short- and long-term losses, the ability to restore software and data to the point
when the detrimental event occurred is critical to the organization's operational
resumption. A company should strive to restore operations as quickly as possible. Short-
term loss necessitating back-up restoration might come from replacement of corrupted
files, an accidentally deleted and overwritten file, disk crashes, or any other event that
requires the installation of software or data files from the most recent backup. Of course,
activation of a remote contingency site in response to long-term losses that affect
significant network or system resources also requires the most recent backups to be
available to the users to reduce any required re-work. Redundant Arrays of Individual
Disks (RAID) can be configured for data storage and recovery in the event of a disk crash
either through mirrored backup or by recreating the lost disk from the others in the array.
Backups can be sent over broadband network connection to a remote location (electronic
vaulting) or similarly remote journals can be used to recover transactions.
These new technologies are more advanced than the previous method, where tapes served
as backup, and then the tapes were rotated at off-site locations to protect them from an
event that impacted the main operational site. Backup methods require that periodic "full
backups" be performed. To reduce the backup duration and number of tapes, subsequent
differential backups can be performed, which makes a backup of files changed since the
last full backup. Alternatively, incremental backups can be performed, which makes a
backup of all files changed since the last incremental (the first of which being since the
full backup). This incremental process continues until the next full backup; after that, the
incremental process begins again. When conducting incremental backups, recovery can
take a while from tape because many tapes are needed to restore all of the data. First the
full backup would have to be restored followed by all of the incremental tapes.

Short-Term and Long-Term Resumptions


When many people think of IT Contingency Plans, they often think only of the long-term
disruption portion exercised due to high-magnitude event impacts. The probability of the
occurrence of a detrimental event is normally inversely related to the event's impact
magnitude on the organization. Thus, IT operations are disrupted more frequently with
smaller impacts than with large, devastating events.
Power failures, communications failures, environmental equipment failures (HVAC),
equipment misconfigurations, drive crashes, loss of connection to power or data cables,
and other easily remedied occurrences cause short-term disruptions. These occurrences
can require implementation of a roll-back, trouble-shooting, or other procedure to quickly
restore operations. Documenting the process aids in the step-by-step process for
operations personnel to calmly ensure that they implement the restoration procedure
properly. In the absence of a good plan, minor events can lead to increased downtime,
although a well planned recovery procedure can minimize the impact to the users or
customers. However, if the short-term portion provides good procedures, the disruption
timeframe can be minimized, even by inexperienced system administration staff.
Long-term restoration plans and implementation procedures for these plans can come in
varying types, depending on the resources available to the organization and the costs that
the organization is willing to pay. Just like there are options for insurance plans and
different riders, an organization should consider the different options for long-term
resumption. Some options can be rented from organizations by contract. However, this
might require organizational trust in the contingency vendor if data is maintained on the
vendor site.
Hot Sites
Rapid long-term resumption can occur through the implementation of a "hot site". Hot
sites are prepared with similar equipment configurations as in the primary sites and can
become the primary site relatively quickly, such as within one day, after the latest
backups are loaded, domain name addressing updated, and whatever other switchover
requirements must be accomplished. This option is expensive, as it requires the
organization to maintain another site that is only there for contingent operations. The
most sophisticated hot site is the mirror-site, which maintains automated near-real-time
copies of the data, and switchover to the contingency site takes place in moments. This
mirror-site contingency plan is crucial to many federal organizations; the Federal
Aviation Administration is a good example of an organization that maintains mirror sites,
as it must ensure that aircrafts are continuously tracked and directed by air traffic
controllers at any particular moment.
Cold Sites
Cold sites are bare shells or organizational spaces that are identified as the contingency
site. Cold site contingency plans spell out the procedures for emergency upgrade and
procurements to buy the equipment, communications, cable plants, and so on that are
necessary to restore operations. This process is slow, and there can be competition for
speedy procurements if the occurrence affects a large number of organizations (such as a
hurricane that devastates a large city). Although the maintenance of this option is low
cost, the activation costs are high, if they are ever needed. This option also provides the
ability to procure only state-of-the-art equipment if and when it is activated. It is difficult
to test this type of plan actively. If you take a cold site and upgrade it by adding some
equipment it is known as a "warm site". If you take a warm site that always has the latest
data than this is known as a hot site.
Alternate Site Option
There is another "alternate site" option, which is often provided through reciprocal
agreement with a site that is trusted or within the organization. It is low cost, but it can
provide rapid operation resumption, though often in degraded modes. This "alternate" or
"reciprocal" site is where the primary site makes an agreement/contract with a similarly
configured site elsewhere (within the organization or partner). If your site suffers a
catastrophic event, the victim site can perform pre-arranged and agreed operations
supplied by the gaining site. The agreements must be detailed, because the gaining site
might also have to run in degraded mode to some extent. When performed under
reciprocal agreement with a partner, there are normally no maintenance costs, but there
are pre-arranged fees to be paid upon activation by one of the partners.
However, this type of service can also be contracted out to organizations that provide this
type of "insurance" service. These service organizations guarantee to have the appropriate
network resources available within a specified short period of time from notification, and
as such, there are maintenance, activation, and operation fees, as per the contract. The
benefit of this is fairly rapid-resumption of critical processes as per the agreement. And,
unless the agreement is executed, there are low maintenance costs.

Contingency Plan Testing Is Training


There are two critical reasons for testing a contingency plan. The first reason is to ensure
that the plan is still valid and does not need to be modified because of systems, personnel,
and process changes. The second reason is it is imperative that the involved personnel
read their copy of the contingency plan before a disaster happens. It doesn't help to read
the plan as the fire department is hosing down the smoldering ashes where your office
building once stood. The more you test the plan, the more comfortable key personnel will
be with their roles and responsibilities during emergencies. Contingency plan
development and testing is a mandatory required of all federal organizations, and
businesses should require it as well.
Testing the plan comes in two primary methods:

• Active test method

• Scenario based
The best testing option, but most operationally disruptive, is the active test method. In
this method, the actual contingency plan or portions of the plan are actually executed as if
there were a real emergency. Often, the contingency site in this option is brought up in
parallel with the primary site.
The other option takes less planning, is short in duration, and provides little disruption;
however, it does not provide the same level of training or update benefits. This second
option is scenarios where the key players are quizzed on what actions to take in a certain
emergency. So, testing a contingency plan includes a training component with the benefit
of being able to update the plan if errors are found.

Summary
Proper resumption of business processes must be considered for all business-critical
resources within an organization to recover quickly and efficiently from the
consequences of a devastating event. However, the focus of this paper was on how to
maintain the stability of and recover from disruptions to the organization-critical IT
assets. IT Contingency (or IT Disaster Recovery) Plans are necessary to minimize the
impacts that unexpected losses of organizational IT services cause. IT Contingency
planning provides documented processes designed to sustain and recover IT services
following an event detrimental to normal operations. The well researched process is
designed to mitigate the risk of system or service unavailability by focusing effective and
efficient procedures for recovery from events that might impact operations.

You might also like