Professional Documents
Culture Documents
EXECUTIVE OVERVIEW
A much-discussed aspect of ‘The internet changing everything’ has been the increasing focus on
availability in open systems applications. Seemingly overnight, application developers and support
staffs were faced with the latest buzzword ‘24x7’ and were given sometimes ill-defined requirements
to have systems and applications available at all hours. IT management, developers, support staff,
DBAs, system and network administrators were then faced with the problem of translating those
requirements into an achievable reality, integrating solutions from multiple hardware and software
vendors.
Oracle’s philosophy with our latest generation of products is to provide a set of components that
will present to the users of a system a seamless picture of application availability, even though any
one component may be experiencing a failure. Through out the remainder of this paper, we will
review the justifications for developing a highly available solution, justifying the expense and trouble,
explore some of the key components available in the Oracle product set and illustrate their use with
a sample system architecture
Oracle has been serious about providing high application availability to customers since it first
introduced support for a clustered database in Oracle version 6.2 on VMS in 1990. Even back then
Oracle consulting would write scripts to provide a simple standby database running recovery at a
backup site. Today Oracle is setting the pace in open systems High Availability with its new tagline:
Unbreakable.
Paper 541
Database Administration
The following table provides a guideline when defining high availability requirements. Downtime for
hardware and software upgrades also counts when considering availability requirements.
Implementing and maintaining a highly available system can require a large investment in hardware,
software and a further investment of staff time to construct the recovery procedures, test the
solution and monitor the system to reduce the overall recovery time. Business users, management
and IT staff should all understand at design time the investment required in setting up and
maintaining a 24x7 system.
System designers often discuss availability in terms of 100% uptime. Our goal is to make the user’s
perception that the system is available all the time or 24x7 even though an individual component
may have failed. We want to do this while keeping the cost of implementing such a system
affordable and the management of the overall solution straight forward and easy.
A business needs to balance the cost of providing a given level of availability vs. cost of downtime.
According to a survey by Contingency Research Planning, Livingston, N.J., the leading causes of
computer downtime for more than 12 hours were 31% Power Related (Surge etc.), 20% for storm
Damage, 16% for Burst Pipes, 9% for Fire and Bombing, 7% Earthquakes and 4% attributed to
Other causes. A study conducted by the University of Texas uncovered that all companies who
suffer from major data loss and extended downtime: 6% survive, 43% never reopen and 51% close
within 2 years (source CIO Magazine April 1998). It is interesting to note that, although 98 percent
of CIOs polled believe it is important to have a disaster recovery plan, 25 percent do not have one in
place. This is according to a poll conducted by RHI Consulting and published at
(http://www.cio.com/archive/040198_disaster.html).
Paper 541
Database Administration
survey on reasons for data loss as published in the Disaster Recovery Journal are summarized in the
chart below.
From this data, we can identify the two most likely causes of an outage are Hardware /Software
errors and human errors. When discussing equipment errors, this data allows us to understand and
justify the cost of appropriate, fault tolerant hardware and software solutions. It’s interesting to not
that the second largest cause of outage and data loss is human error. This only reinforces the point
that appropriate and detailed procedures must be in place and review (and practiced) regularly to
insure the highest levels of uptime. In addition to the categories of errors described here, it would
be prudent to review the particular application environment for additional potential sources of
outage and develop appropriate plans.
TEST IT!
While discussing human error, it is worthwhile to mention an often over looked best practice for a
true HA environment. This is the creation and maintenance of an adequate test environment that is
capable of producing load representative of the production environment. As was discussed earlier,
human error was the second largest cause of data loss in a survey of companies that experienced an
outage. Along with automation, scripting and clear procedures, testing is key to reducing the chance
that human failure will cause or extend an outage. Oracle considers maintaining an appropriate test
environment a key best practice to assure continued High availability for any system.
An example of a site where testing really paid off is Merrill Lynch. On September 11th, Merrill Lynch
one of the world's leading financial management and advisory companies, who test critical
applications for disaster readiness quarterly, where challenged with an previously unthinkable
disaster, when terrorists struck the Twin Towers in New York. A core data processing center within
two blocks of ground zero made a decision to redirect operations to a backup data center in the City
of London, England. Within 11 minutes of making the decision, London was able to take over
operations by activating their Oracle standby databases with no loss of data.
Paper 541
Database Administration
In addition to defining the uptime requirements, the procedure for recovering based on the outage
and command structure should also be defined. Issues such as who should be notified, and how
they should be contacted should be clearly spelled out. Also, the issue of who should declare an
outage and start the failover process should be defined. For a system with very high availability
requirements, much of the outage window could be lost if there is not a clear decision to start the
failover process or if that process is not completely automated.
Also these procedures should define the exact steps for activating the failover plan. As many of
those steps as possible should be scripted and tested in advance. These scripts and plans should be
tested in advance of their activation in a real failure and should for the basis of an ongoing test
methodology.
• Prevent failures before they happen through tested and certified configurations
• Detect obvious and subtle failures quickly with fault detection agents
• Capture diagnostic information
• Resume services to end users quickly, and with minimal disruption
• Analyze failures offline to prevent re-occurrence
Once you have a common definition of uptime, and a common understanding of what any given
system’s uptime should be, plus an understanding of the errors that you may encounter, choose from
the array of solutions that Oracle provides. The next section of this document will review some of
the new or key High Availability features with these features in mind.
BASE FEATURES
The Oracle9i release of Data Guard has built upon some features in the core database server product
that contribute directly to the potential availability of any system.
Paper 541
Database Administration
• Online Redefinition – This new package allows the DBA to modify the storage parameters for
any given object, move table a table from one tablespace to another, partition an existing table
recreate or re-organize a table or change structure the structure of an existing table by adding or
dropping or redefining a column.
Paper 541
Database Administration
RMAN greatly simplifies recovery, by automatically identifying the appropriate backups and archive
logs that are required to recover the database.
Paper 541
Database Administration
The HARD initiative includes several technologies that can be embedded in storage devices to
prevent all these classes of corruption. Oracle’s storage partners will roll out these technologies over
time.
STORAGE CONFIGURATION
Configuring the storage subsystem optimally for the database is a very important task for most
system and database administrators. A poorly configured storage subsystem can result in I/O
bottlenecks and reduced protection from device failures. The SAME configuration model offers a a
scheme that addresses the availability and performance challenges administrators face. SAME, an
acronym for Stripe and Mirror Everything has been developed by Oracle experts who have done
significant work in researching optimal storage configuration for Oracle database systems. This
model is based on four simple proposals (i) stripe all files across all disks using a 1 megabyte stripe
width, (ii) mirror data for high availability, (iii) place frequently accessed data on the outside half of
the disk drives, (iv) subset data by partition and not by disk.
STRIPE ALL FILES ACROSS ALL DISKS USING A 1 MEGABYTE STRIPE WIDTH
Striping all files across all disks ensures that full bandwidth of all the disks is available for any
operation. This equalizes load across disk drives and eliminates hot-spots. Parallel execution and
other I/O intensive operations do not get unnecessarily bottlenecked because of disk configuration.
Since storage cannot be reconfigured without a great deal of effort, striping across all disks is the
safest, most optimal option. Another benefit of striping across all disks is that it is very easy to
manage. Administrators no longer need to move files around to reduce long disk queues, which
frequently has been a non-trivial drain on administrative resources. The easiest way to do such
striping is to use volume level striping using volume managers. Volume managers can stripe across
hundreds of disks with negligible I/O overhead and are, therefore, the best available option at the
present time. The recommendation of using a stripe size of one megabyte is based on transfer rates
and throughputs of modern disks. If the stripe size is very small, the disk head has to move a lot
more to access data. This means that more time is spent positioning the disk head on the data than
in the actual transfer of data. It has been observed that a stripe size of one megabyte achieves
reasonably good throughput and that larger stripe sizes produce only modest improvements.
However, given the current trend in advances in disk technology the stripe size will have to be
gradually increased.
Paper 541
Database Administration
PLACE FREQUENTLY ACCESSED DATA ON THE OUTSIDE HALF OF THE DISK DRIVES
The transfer rate of a disk drive varies for different portions of the disk. Outer sectors have a higher
transfer rate than inner sectors. Also outer portions of a disk drive can store more data. For this
reason, datafiles that are accessed more frequently should be placed on the outside half of the disk.
Redo logs and archive logs can undergo significant I/O activity during updates and hence should
also be placed on the outside portion of the disks. Since this can lead to an administrative overhead
on the part of the database administrator, a simple solution is to leave the inside half of a disk drive
empty. The is not as wasteful an option as might appear because due to the circular shape of the
disks, typically more than 60% of capacity is on the outside half of the disk. Also the current trends
of increasing disk drive capacities make it an even more viable proposition.
Orace9i Data Guard - Standby database was first delivered as part of the initial release of the
Oracle9i Database in the summer of 2001. A Standby database maintains one or more, bit-for-bit
replicas of the Primary database it protects. A SQL maintained Standby database is an integral part of
a future release of Oracle9i, and is used to maintain a logical replica of a Primary database. A Data
Guard configuration is comprised of a collection of loosely connected systems, consisting of a single
Primary database and a number of Standby databases, which can be a mix of both traditional
Standby’s and SQL Miantained Standby databases. The databases in a Data Guard configuration can
be in the same data center (LAN attached) or geographically dispersed (over a WAN) and connected
by Oracle Network Services.
When OLTP users add data or change information stored within an Oracle database, the database
buffers changes in memory pending a request to make those changes permanent (when an
application or interactive user transaction issues a COMMIT statement). Before the COMMIT
operation is acknowledged by the database, that record is written to database redo log files in the
Paper 541
Database Administration
form of a Redo Log record, which contains just enough information to redo the transaction in case
of a database having to restart after system crash or during data recovery following data loss.
When using a standby database, as transactions make changes to the Primary database, the Standby
database is sent redo log data generated by changes, in addition to being logged locally. These
changes are applied to the standby databases, which runs in managed recovery mode in the case of a
traditional standby database or applied using SQL regenerated from archived log files.
Whilst the Primary database is open and active, a traditional Standby database is either performing
recovery (by applying logs) or open for reporting access. In the case of an SQL maintained type of
Standby database, changes from the Primary database can be applied concurrently with end user
access. The tables being maintained by SQL generated from a Primary database log will, of course be
read only to users of the SQL Maintained Standby. These tables can have different indexes, and
physical characteristics from their Primary database peers, but have to maintain logical consistency
from an application access perspective, in order to fulfill their role as a Standby data source.
Paper 541
Database Administration
Oracle Real Applications Clusters supports additional Oracle features that enhance recovery time
and minimize the disruption of service to end-users. These features provide fast and bounded
recovery, enforce primary node /secondary node access if desired, automatically reconnect failed
sessions, and capture diagnostics data after a failure. In addition, Oracle Real Applications Clusters
can be integrated with the cluster framework of the platform to provide enhanced monitoring for a
wide range of failures, including those external to the database.
Oracle Real Applications Clusters can recover more quickly from failures than traditional cold
failover solutions. Because both instances are started an the database is concurrently mounted an
opened by both instances, there is no need to fail over volume groups and file systems, because they
are already available to all nodes as a requirement of Real Applications Clusters. There is also no
need to start the Oracle instance, mount the database, and open all the data files.
If there are many connections that must be reestablished after failover, this task can be time
consuming. Oracle Real Applications Clusters supports pre-connections to the secondary instance.
In this case, the memory structures needed to support a connection are established in advance,
speeding reconnections after a failure. Lastly, since both instances are running, the cache can be
warm in both instances. In an active/active configuration, the cache on both nodes is warmed
automatically.
Real Application Clusters Guard is built upon Real Applications Clusters running in a primary
node/secondary node configuration. Much of the complexity in installation, configuration and
management of the various components of traditional high availability solutions is avoided with the
use of Real Application Clusters Guard. In this configuration, all connections to the database are
through the primary node. The secondary node serves as a backup, ready to provide services should
an outage at the primary occur. Also, Real Application Clusters Guard is tightly integrated with the
cluster framework provided by the system's vendor to offering enhanced monitoring and failover for
all types of outages.
The result of this integration of the cluster framework with Real Applications Clusters is the look
and feel of traditional cluster failover. This is important as it means the solution works with all types
of third-party applications that are designed to work with a single Oracle instance. However, the Real
Applications Cluster Guard solution provides much faster detection and bounded failover in the
event of a failure. It also provides improved performance after failover thanks to pre-connected
secondary connections provided by Transparent Application Failover (TAF) and a pre-warmed
cache on the secondary node.
Paper 541
Database Administration
• OCI Failover Callback - On a failure, a client application call-out can be provided that will be
executed after instance failure. This callout can be used at the application level to check for the
presence of a failure and execute some failover logic (retry transaction, present a message to the
user, etc.)
Paper 541
Database Administration
While all these components add capabilities to the DBA and developers toolkit for building highly
available solution, the beauty of these components if how they can be brought together to develop a
highly availability architecture. Before tying these features together it is important to note a last few
general concepts:
Hardware – any Oracle system is only as available as the underlying hardware components. While it
is beyond the scope here to review the many solutions available for different hardware vendors,
redundant disk and clustered systems are key to any HA solution
Manageability - keep the DBAs free to focus on adding high-level business value through automation
of mundane tasks such as shuffling disks for performance, which can be eliminated using SAME.
Administrators should be using modern tools such as Enterprise Manager to ease or eliminate many
tasks by using alerts, advisories, capacity planning and monitoring tools. Administrators should
exploit manageability features such as automatic undo management and automatic space
management within database files.
Paper 541
Database Administration
Standby (local and remote) – The standby database functionality provides multiple functions in this
architecture. Running locally, it provides a robust backup solution, taking the place of the traditional
hot backup. By applying the logs from the production database with a relatively short delay, a
complete copy of the production database can be maintained locally, allowing for object level
recovery as well as file-based recovery. Using a remote standby provides protection from a critical
failure of production hardware. In addition, running standby databases allows us some flexibility in
applying the logs, providing protection from human error as described earlier and providing a copy
of the database for read only processing away from the production instance.
RMAN – In this architecture, RMAN provides us an Oracle aware method for completing our
architecture with an offline copy of the database. In addition to automating this process, RMAN
check the integrity of the backup as it is being taken, catching any corruptions at the time of the
backup instead of the restore. This is crucial, because if this is the last line of defense for the
database and the backup is corrupted, the database many not be recoverable.
TAF – Finally, using the facilities available with Transparent Application Failover, we can speed the
application recovery time by defining in advance the resources available to an application in the
advent of an outage. With the failover callback we can also abstract many potential failures from the
user.
HARD and SAME – Using these features described above make a configuration more robust, ease
administration while making optimal use of available disk storage.
SUMMARY
Oracle9i provides a broad array of features easing the challenges of maintaining high availability for
any application. While have provided sample architecture and described its benefits, each application
deployment will have its own unique HA requirements, constraints and infrastructure. We hope this
discussion better arms the reader to more fully exploit the capabilities Oracle provides to achieve
high application availability. More information on Oracle high availability projects can be found at
the URL, http://otn.oracle.com/deploy/availability.
Paper 541