Oracle 9i - RAC

Cache Fusion
Real Application Clusters Features 1-1

Objectives
After this lesson, you should be able to:

• Describe the benefits of cache fusion
• Differentiate lock modes and roles
• Explain how cache fusion transfers locks and
blocks in different scenarios
• Describe the basic Distributed Lock Manager (DLM)
changes to support cache fusion

Real Application Clusters Concepts
Node A Node B
Instance Instance
1 2
Instance Instance
3 4
Node C Node D
Real Application Cluster Concepts

Oracle9i Real Application Clusters allow multiple instances to execute against the same
database. The typical installation involves a cluster of servers with access to the same disks.
The nodes that actually run instances form a proper subset of the cluster. The cluster nodes
are connected by an interconnect that allows them to share disks and run the Cluster Group
Services (CGS) and the Distributed Lock Manager (DLM).
On the IBM RS/6000 SP platform, the entire SP is considered to be a single cluster (unless
it is partitioned, in which case each partition is a cluster). On Windows, which nodes are in
the cluster is controlled by the vendor’s cluster manager.
A node is defined as the collection of processors, shared memory and disks that runs an
instance. A node may have more than one CPU, as in an IBM RS/6000 SP SMP node. The
node monitor is the vendor-provided software that monitors the health of processes running
in the cluster, and is used by CGS to control the membership of instances in the Real
Application Cluster.
The node-to-instance mapping is the definition of which instances run on which nodes. For
example, it might record that instance RAC1 runs on host1, and instance RAC2 runs on
host2. This mapping is stored in text configuration files on UNIX, and in the registry on
Windows.

Operating system
Oracle RDBMS
Distributed lock manager
Cluster management software
Shared disk software
Data files
Online redo log files
Control files

An Oracle Real Application Cluster Relational Database Management System (RDBMS)
contains the same elements as a normal Oracle RDBMS. It runs under an operating system
and uses disks to store the database files, data files, online redo log files, and control files.
The difference for an Oracle Real Application Cluster is that each database is managed by
multiple, concurrent instances, each running on a different node in a cluster. This requires
some additional software, much of which has evolved over time. The original clustered
version of Oracle was called Oracle Parallel Server and released as Version 6.2. This
version relied on cluster management software and a Distributed Lock Manager (DLM)
provided by the cluster hardware vendor.
The cluster management software is responsible for ensuring that each active node knows
the status of the other nodes. The DLM is responsible for handling special locks which
prevent two nodes from making conflicting changes to the same data. Oracle Parallel
Server relied on the vendor’s DLM to manage special locks called Parallel Cache
Management locks. These allowed only one instance to have exclusive access to a block in
order to change its contents while multiple instances could share a lock in order to query a
block. Ownership or status changes of PCM locks between instances required frequent
block writes and reads, an action known pinging. Poorly designed applications could suffer
from reduced performance due to excessive pinging.
During the various releases of Oracle7, the number of hardware platforms and operating
systems that could support Oracle Parallel Server increased. As Oracle Parallel Server was
ported to these systems, some of the functionality of the cluster manager were incorporated
into the Oracle Parallel Server code. The interface between the vendor and the Oracle
portions of the cluster manager was known as Group Management Services. An Oracle-
developed DLM also became available on some platforms during the life of Oracle7.
With the Oracle8 releases of Oracle Parallel Server, the DLM and many of the features of
the cluster management software were incorporated into the Oracle code. The interface
between the vendor’s cluster code and the Oracle code was renamed to Cluster Group
Services. The space required for the DLM was made available in the Shared Pool of the
active instances using special parameters prefixed with LM_. As instances started up or shut
down, the DLM was rebuilt across the active instances to balance the workload and space
required.

Node A Node B
Instance Instance
1 2
Block transfer with disk ping

Node A Node B
Instance Instance
1 2
Cache fusion block transfer

In Oracle8i, a Read Consistent Cache Fusion server was introduced to Parallel Server. This
removed the need to force block pinging to satisfy queries and greatly simplified
application design for Oracle Parallel Server. Read Consistent Cache Fusion allowed a read
consistent block image, required to satisfy a query, to be transferred from the buffer cache
of the instance holding the current block using the cluster interconnect to the instance
performing the query. This contrasted with previous releases which required the current
block to be pinged across the disks to the querying instance followed, in most cases, by the
pinging of rollback segment blocks to produce a read consistent copy of the block. Read
Consistent Cache Fusion is also known as Cache Fusion, Phase 1, or Write-Read cache
fusion. The latter name refers to the fact that blocks changed in one instance (write activity)
have a read consistent image sent across the interconnect to satisfy a query (read activity).
In Oracle9i, full Cache Fusion is introduced and the product is renamed from Oracle
Parallel Server to Oracle Real Application Clusters. With Cache Fusion, modified blocks
currently in the buffer cache of one instance are transferred to another instance across the
cluster interconnect rather than by pinging the blocks. This is true for blocks required for
changes by the second instance (write-write transfers) as well as for queries on the second
instance (write-read transfers). The mechanism also allows read-read and read-write
transfers, which reduces the need to read blocks from disk.
Note that you can continue to use the traditional PCM locking mechanisms, if you wish,
which bypasses the Cache Fusion mechanisms and uses block pinging to transfer blocks
between instances. This can be beneficial if you have already designed a well-partitioned
database and application which does not involve much block sharing between the instances.

Benefits of Cache Fusion
Without cache fusion

100
Block
access
time With
(milli- cache
seconds) fusion
20
1
0.01
Block in Block in Block
local cache remote cache on disk
Benefits of Cache Fusion

Cache Fusion removes the pinging (disk writes and reads) required to transfer blocks being
modified by multiple instances between those instances. It extends and replaces the
functionality of the Consistent Read Server, introduced in Oracle8i, which serviced
requests to read blocks changed in another instance, by allowing any block being held by
any instance to be copied to another instance across the cluster interconnect. As studies on
typical clusters with a high-speed interconnect demonstrated, blocks can transferred directly
between instances faster than they can be read from disk. The results of a typical study are
shown in the slide.
By providing fast transfers of blocks, Cache Fusion allows Oracle Real Application
Clusters to perform data access as fast, or faster, than single instances which need to access
multiple blocks by reading them from disk. Even with the overhead of the DLM, this
allows any application to run without degradation on Oracle Real Application Clusters.
Under previous releases of Oracle Parallel Server, some applications would not scale well
because the block pinging algorithm caused too many disk writes and reads between
instances. This could be avoided by partitioning the data, the application, or the users across
instances, but this was not always possible— for example, with third party applications— or
desirable due to the costs associated with modifying the data or the application. By
providing block transfers without pinging, Cache Fusion avoids the need for partitioning
and allows any application to take advantage of the scalability offered by Oracle Real
Application Clusters without any requiring any changes.
The simplified messaging, discussed on the next slide, allowed by Cache Fusion, reduces
the work required to rebuild the DLM as new instances are started or existing instances are
stopped. This, in turn, reduces the time it takes an Oracle Real Application Cluster to return
to normal processing following an instance failure or cluster reconfiguration. Additionally,
it allows the DLM to migrate the lock master from its original location. The DLM will
migrate a lock master to the requesting instance when a single instance appears to make
exclusive, but repeated, use of the lock.

Cache Fusion Model
• PCM locks and the DLM record

– Lock modes
– Lock roles
– Past image history
• Data transfer between instances uses the cluster
interconnect
Cache Fusion Model

To ensure that the current block image can be located and that recovery mechanisms can
determine the order of changes, PCM locks and the DLM contain additional information
compared to earlier versions. The locks contain the traditional information about the lock
mode (NULL, Shared, or Exclusive), as well as the new lock role status (local or global) and
past image history. The DLM is responsible for assigning PCM lock modes and roles,
updating their status, locating the most current block image when necessary, and informing
holders of past images when those images are no longer needed. Past images can be
discarded when an instance writes a current block image to disk. With Cache Fusion, such
writes occur to satisfy check point requests, and so on, not to ping blocks.
To ensure that Cache Fusion provides the highest level of block transfer rates, the cluster
interconnect should be based on fast interconnect technologies with low latency. This
includes high-speed Ethernet and technologies based on the Virtual Interface Architecture
(VIA) messaging API and hardware, such as GigaNet and ServerNet.

Cache Fusion Model
• Cache fusion requires fewer messages and steps

to convert lock status
– Requesting instance to instance mastering the
lock
– Mastering instance to the instance holding the
lock
– Holding instance to the requesting instance
– Holding instance to the mastering instance
• Traditional pinging can coexist with Cache Fusion
Cache Fusion Model

Cache fusion requires fewer round trip messages between instances than traditional PCM
locking when lock statuses change. Lock requests no longer involve DBWR and require the
following messages:
• Requesting instance (requester) to instance mastering the lock (master)
• Master to the instance holding the lock (holder)
• Holder to the requester
• Holder to the master (which can occur asynchronously with lock usage on the
requester)
These changes also simplify the overall DLM strategy. In particular, less overhead is
required to rebuild the DLM following the addition or deletion of an active instance and
locks can be dynamically remastered. This allows them to be relocated to instances where
they are used, reducing the internode traffic when these locks are required by the local
instance.
Tablespaces can be defined to use traditional hash locks. Blocks in tablespaces using hash
locking will not use any of the Cache Fusion features and minimized messaging. Changed
blocks covered by hash locks will be pinged when required by other instances.

PCM Lock Modes
PCM locks retain their modes from previous

releases
• Exclusive (X)
• Shared (S)
• NULL (N)
PCM Lock Modes

PCM locks still support the three modes they supported in earlier releases:
• Exclusive (X)
This lock mode allows an instance to change the contents of a block covered by the
lock.
• Shared (S)
This lock mode is required for an instance to retrieve the contents of a block,
typically to satisfy a query. Multiple instances can hold the same PCM lock in
shared mode.
• NULL
The NULL mode of a PCM lock is the default status for each instance. It indicates
that the instance does not currently hold the lock in either X or S status.

PCM Lock Roles
• A new element of PCM locking, called a lock role,

supports Cache Fusion
• A lock can be held in one of two roles within an
instance
– Local: the blocks associated with the lock can
be manipulated without further reference to
DLM or other instances
– Global: the blocks covered by the lock might
not be usable without further information from
the DLM or other nodes
PCM Lock Roles

PCM locks are held in either a local or a global role. When a PCM lock is held with a local
role, it behaves just as it did in previous releases. This means that an exclusive mode lock
can only be held in one instance at a time and that no other instance can hold that lock in
any mode. A local lock, however, can be held in shared mode by multiple instances at the
same time. Global roles allow these restrictions to be broken. Specifically, an instance can
hold a global lock in shared mode while it is concurrently held in exclusive mode by
another instance.
Also, global lock information may be stored in the DLM to manage the history block
transfers even if the lock mode is NULL. With local locking, the DLM discards the lock
allocation information for instances which downgrade a PCM lock to NULL mode.

PCM Lock Roles: Local
• May be X or S mode
• Can serve a copy of blocks to other instances
• Can read blocks from disk
• In X mode
– No other instance has the lock in X mode
– All unwritten changes are in local cache
– Can write changed blocks to disk,
asynchronously informing DLM of the write
• In S mode, block cannot be dirty so no disk writes
allowed
PCM Lock Roles: Local

A PCM lock held in local in a local role corresponds to the way PCM locks were held in
prior releases. The local role implies that the instance can manipulate the associated blocks
independently of the DLM or the other nodes. This leads to the following characteristics of
local roles:
• Locks with the local role can be in either X or S mode. NULL locks should not be
held but closed.
• Covered blocks can be served to other instances when they are requested. If they
had been dirtied prior to the request, the mode is converted to global.
• Any block not in memory can be read from disk while covered by a lock with the
local role.
• In X mode:
– the contents of the block are either identical to the disk copy or only changed
in the local cache— in other words, the block under a local role can be dirty
in at most one instance
– the other instances do not have the lock open, that is, they hold it in NULL
mode
– the block can be written to disk without confirmation from the DLM,
although the DLM should be notified of the write— this confirmation can be
sent asynchronously after the block is written
• In S mode, the block cannot contain any changes since it was read from disk. Such a
block is not dirty and hence is never written to disk.

PCM Lock Roles: Global
• May be in any mode

• Held under X or S mode while its associated block was
current in a cache
• Covered a globally dirty block when it became global
• With X mode, the associated block can be modified in
the current cache
• Status (current or not) of the associated block on disk
not known
• The associated block can be served to another instance
when required
• The associated block can only be written to disk when
directed by the DLM
PCM Lock Roles: Global

Global lock management is used to manage blocks that are dirty in more than one instance
concurrently. Further, the dirty block image of a globally managed block may be different
from its image in another instance. The global role is initially assigned when a changed
block image is served to another instance without an intervening disk write. To ensure that
an instance is working with the correct image of a globally-managed block, the instance
cannot manipulate the block without coordination, through the DLM, with other instances
holding the lock globally.
When a block is globally managed
• the mode of its PCM lock can be X, S, or NULL
• it was, or is, a current block in an instance under an X or S mode lock
• it was globally dirty when the global role was assigned
• if the mode is X, it can be modified in its current cache
• its image on disk may or may not be current
• it can be served to another instance when required
• it can only be written to disk when
– a request to write from the current instance has been approved by the DLM
– a directive has been received from the DLM

Past Image
• Copy of a dirty block that has been served to

another instance
• Maintained until a write covering that version is
recorded
• A BWR is logged, but not flushed to disk, when the
buffer is released
Past Image
A Past Image (PI) is a copy of a globally dirty block image maintained in cache. It is saved
when a dirty block is served to another instance after setting the lock role to global (if it
was not already set). A PI must be maintained by an instance until it, or a later version of
the block, is written to disk. The DLM is responsible for informing an instance when a PI is
no longer needed because another instance wrote the block.
When an instance needs to write a block, to satisfy a checkpoint request, for example, the
instance checks the role of the lock covering the block. If the role is global, the instance
must inform the DLM of the write requirement. The DLM is responsible for finding the
most current block image and informing the instance holding that image to perform the
block write. The DLM then informs all holders of the global lock that they can release the
lock on any PI copies of the block. This allows instances to free buffers holding the PI.
A block written record (BWR) is placed in its redo log buffer when an instance is told it can
free up a PI buffer. This record is used by recovery processes to indicate that redo for the
block is not needed prior to this point. Although the BWR makes recovery more efficient,
the instance does not force a flush of the log buffer after creating it because it is not
essential for accurate recovery.

Past Image
• When a block is served and the instance has a PI

and a current image
– If no write is in progress, a merge on exit
operation replaces the PI with the current image
if no write is in progress
– If a write is in progress the current image
becomes a new PI
• A write-in-progress bit is set if a served block
holds a PCM lock in exclusive mode
Past Image
When an instance receives a current copy of block for which it already has a PI copy, it
must keep both copies. If the receiving instance then has to serve the block to another
instance, along with the request, the DLM indicates whether a write is in progress that
would free the PI (that is, a later version of the block is being written).
If such a write is not occurring, the instance replaces the old PI with a new PI created from
the current image. This is called a merge on exit because it results in an apparent single
string of redo (from this instance) for the block, terminated by just one BWR when the
block is finally written to disk.
If such a write is occurring, the instance creates a new PI from the current image. It is
possible, given the asynchronous messaging protocol, that an instance may have more than
one PI of a single block. An instance maintains a maximum of two PIs associated with a
given block.
When the current image is served, a write-in-progress bit is set in the block if the block is
holding an exclusive mode PCM lock. This is required to synchronize block writes when
the serving instance held the original local role lock.
Note that a clean block, regardless of its PCM lock status, does not need a PI generated
when it is served to another instance.

Cache Fusion Block Transfers:
Example Set Up
Instance A Instance B
Instance C
1008
Lock master
Instance D
Cache Fusion Block Transfers: Example Set Up

The following examples show the messages, lock status changes, and block transfers
involved in moving blocks between disk and memory and in cache fusion transfers of block
images between instances. This slide shows the setup used for these examples.
There are four instances, A, B, C, and D and a shared drive. For simplicity, the examples
use just one block which is initially shown on the disk with an SCN of 1008. This particular
block has its PCM lock mastered on instance D throughout the examples.
The examples show the lock statuses as seen by the instances using a three-character
notation, for example, SG0. Interpret this notation as follows:
• The first character indicates the mode
– N for NULL
– S for shared
– X for exclusive.
• The second character indicates the role
– L for local
– G for global
• The third character indicates whether the DLM knows about a PI
– 0 for no
– 1 for yes (second PI block images are not covered in these examples)
Locks not held by an instance are in an NL0 (NULL, local, no PI) status. To simplify the
examples, an NL0 status is only shown when the lock is involved in a transition to or from
another state on a particular instance, for example, NL0→ XL0.

Example 1: Read with No Transfer
Instance C
1 NL0
Request to obtain a
shared lock on C
1008
Lock master
Instance D

In this example, instance C requests the block with a shared lock in order to execute a query
against it. The block is currently not available in the cache of any of the instances.
Step 1
Instance C sends a request for a shared lock to the DLM. This request is sent to instance D
where the lock is mastered.

Instance C
1 NL0→ SL0
The request is
2 granted and
the instance 1008
Lock master converts the
lock status
Instance D
Step 2
The DLM grants the lock in shared mode with the local role and the mastering instance
sends a message with the grant to instance C. Instance C converts the NULL status on the
lock to shared mode, local role, with no past images (NL0→ SL0).

Instance C
1 SL0
2 3
1008
Lock master Read request
Instance D
Step 3
Instance C initiates the I/O with a read request message to the disk for the block.

Instance C
Block image
1 SL0 4 delivered
1008
2 3
1008
Lock master
Instance D
Step 4
The I/O completes in step 4 with the delivery of the block to instance C. Instance C now
holds the block with SCN 1008 using an SL0 lock.

Example 2: Read to Read Transfer
NL0
Instance C
1 SL0
1008
Request to obtain a
shared lock on B
1008
Lock master
Instance D

In this example, instance B requests the block with a shared lock in order to execute a query
against it. The block is currently available in the cache of instance C, just as it was at the
end of the previous example (Example 1).
Step 1
Instance B sends a request for a shared lock to the DLM. This request is sent to instance D
where the lock is mastered as in the previous example (Example 1).

NL0
Instance C
1 SL0
1008
2
Instruction to transfer 1008
Lock master the block to B for
shared access
Instance D
Step 2
The DLM discovers that the block is being held by instance C under an SL0 lock. The
DLM sends a request to instance C to transfer the block, for shared access, to instance B.

NL0
3
1008
Instance C Block sent to B
1 SL0 indicating a
1008 shared mode lock
on B and C
2
1008
Lock master
Instance D
Step 3
Instance C ships a copy of the block to instance B with headers indicating that it is retaining
its lock in SL0 mode and that instance B should take out the same type of lock.
Note: In earlier releases, the read request from instance B would have been granted by the
DLM issuing a shared lock, but instance B would have needed to read the block from disk.

NL0→ SL0
3
1008
Instance C
1 SL0
1008
4 1008
Lock master
Lock assumption
Instance D and status message
Step 4
Instance B converts the lock to shared mode, local role, with no PI and sends a message to
the DLM— that is, to instance D where the lock is mastered— to inform the DLM of the
newly-converted lock status. This message includes the lock status (SL0) on both of the
instances involved in the process, B and C.
The process would have been slightly different had the required block no longer been
available in the cache of the instance receiving the instruction to send the block. In this
case, the message sent in step 3 would simply have contained the lock information,
informing the receiving instance that it is free to obtain the required lock. After performing
the lock conversion, the receiving instance would have to read the block from disk.
In the example, this would result in instance C dropping the lock and in instance B
performing the disk I/O as shown in the earlier example (Example 1).

Example 3: Read to Write Transfer
NL0
Instance C
1 SL0
1008
Request to obtain an
exclusive lock on B
1008
Lock master
Instance D

This example starts with instance C holding an SLO lock on the block which is still at SCN
1008. Note that this is the same status at which the previous example, Example 2, started.
Instance B wants the same block as before but, in comparison to Example 2, the instance
wishes to make changes to the block’s contents rather than just read the contents.
Step 1
Instance B sends a request for an exclusive lock to the DLM. This request is sent to
instance D where the lock is mastered as in the previous examples (Examples 1 and 2).

Example 3: Read to Write transfer
NL0
Instance C
1 SL0
1008
2
Instruction to transfer 1008
Lock master the block to B for
exclusive access
Instance D
Step 2
The DLM discovers that the block is being held by instance C and sends a request to that
instance. The request asks instance C to transfer the block, for exclusive access, to instance
B.
In a more complex situation, more than one instance could be holding SL0 locks on the
block for which an instance is requesting an exclusive lock. In such cases, the DLM sends,
to all but one of the holding instances, a message to transfer the block to null location. This
effectively tells these instances to close their shared locks on the block and to release the
buffers holding the block. Once this is done, the last remaining shared lock holder is
equivalent to instance C in this example— it is the only instance holding an SL0 lock on the
requested block. At this point, the actions performed by the DLM and the remaining lock
holder are identical to the steps shown in this example, with Instance C having the role of
the last holder of the requested shared lock.

NL0
3
1008
Instance C Block and lock
1 SL0→ NL0 status (including
1008 C’s plan to close
its lock)
2
1008
Lock master
Instance D
Step 3
On receipt of the transfer message sent in step 2, instance C does the following:
1. Sends the block to B, as requested, along with an indicator that is closing its own lock
and supplying an exclusive lock for use by the receiving instance
2. Closes its own lock by converting it to NL0. This also marks the buffer holding the
block image as Consistent Read (CR), identifying it as available for reuse.
Note: In earlier releases, the read request from instance B would have been granted by the
DLM issuing an exclusive lock to instance B after forcing instance C to release its shared
lock.
However, at that point, instance B would have needed to read the block from disk, the copy
in
instance C would be unused.

NL0→ XL0
3
1009
Instance C
1 NL0
1008
2
1008
Lock master Lock status
4 on instances
Instance D B and C
Step 4
On receipt of the block message, instance B converts its lock and sends a message to the
DLM. The message includes information about the assumption of lock mode and role
(XL0) on instance B and the closure of the lock on instance C. Instance B can now modify
the block. In this example, the block SCN becomes 1009 following the changes.
The process would have been slightly different had the required block no longer been
available in the cache of the instance receiving the instruction to send the block. In this
case, the message sent in step 3 would simply have contained the lock information,
informing the receiving instance that it is free to obtain the required lock. After performing
the lock conversion, the receiving instance would have to read the block from disk.
In the example, this would result in instance C dropping the lock and in instance B
performing the disk I/O as shown in the earlier example (Example 1).

Example 4: Write to Write Transfer
NL0 XL0
1009
Instance C
1
Request lock
in exclusive
mode
1008
Lock master
Instance D

This example (Example 4), begins where the previous example (Example 3) ended.
Instance B currently holds an exclusive lock, in local mode, for the block being used in
these examples. The block is currently in the cache of instance B and it is at SCN 1009.
The copy of the block on disk is still at SCN 1008.
Instance C does not participate in this example. Further, the copy of the block that could
still be in its memory from previous examples is not shown. This is because its buffer is
marked CR, that is, the buffer is available for reuse. The lock for this instance was dropped
in the previous example (Example 3).
This example begins when instance A requests an exclusive lock on the block so that it can
modify its contents.
Step 1
Instance A sends a request to the DLM, at instance D where the required lock is mastered,
for an exclusive lock on the block.

NL0 XL0
1009
Instance C
1 2
1008
Lock master
Instruction to transfer
exclusive lock to B
Instance D
Step 2
The DLM tells instance B to give up the block to satisfy the request from instance A for an
exclusive lock. This message will be sent immediately if the DLM has completed recording
the lock transactions from the previous example (Example 3). If these transactions are not
complete, the request from instance A will be queued until the DLM can process the
request.

NL0 Exclusive-keep XL0→ NG1
copy of buffer
1009 3 1009
Instance C
1 2
1008
Lock master
Instance D
Step 3
Instance B completes its work on the block when it receives the message to transfer the
block to instance A. This involves
• Logging any changes to the block and forcing a log flush if this has not already
occurred.
• Converting its lock to NG1, indicating that the buffer now contains a PI, that is, a
history, level 1, copy of the block.
• Sending an exclusive-keep copy of the block buffer to instance A. This includes the
block image at SCN 1009, information that the instance B is holding a past image of
the block, and notification that the exclusive lock is available in global mode.
If there had been no changes to the block’s contents when the message was received by
instance B, the instance would simply send the block image to instance A and close its lock
(XL0→ NL0). This would allow the receiving instance to assume the exclusive lock in local
mode, just as in the read to write transfer shown in the previous example (Example 3).

NL0→ XG0 NG1
1013 3 1009
Instance C
1 2
4
Lock
assumption
information
1008
Lock master
Instance D
Step 4
After instance A has received the block with the lock dispositions, instance A sends a lock
assumption message— including the lock information from instance B— to the DLM (in this
case, the mastering instance D). This tells the DLM that instance A has the lock with and
XG0 status and that instance B, the previous holder of the exclusive lock, is now a PI
holder of version 1009. Instance A is able to obtain the block SCN from the copy sent by
instance B because the copy sent to instance A contains all of the changes made by instance
B.
Once this is done, instance A can modify the block. In the example, the modification
converts the block to SCN 1013. Note that because it no longer has an exclusive lock,
instance B cannot make any further changes to the block even though it is required to
maintain a PI copy in the buffer cache.

Example 5: Write to Read Transfer
XG0 NG1
1013 1009
Instance C
Request lock in
shared mode 1
1008
Lock master
Instance D

Example 5 continues where the previous example (Example 4) ended. That is, the block
image on disk is still at the original SCN, 1008. There is a PI copy of it in instance B at
SCN 1009 and the current copy, at SCN 1013, is in instance A, held with an exclusive
mode lock. The existence of the current and past images in two different instances requires
the PCM locks associated with the block to be held in the global role.
In this example, the block being held with an exclusive lock in instance A is requested by
instance C for a query. Note that the Consistent Read Server in Oracle8i would handle this
request by having instance A build a read consistent image of the block, based on the SCN
current at the start of the query on instance C. The read consistent block image would then
be shipped across the cluster interconnect from instance A to instance C. This example
shows how the request is handled with the Cache Fusion algorithm.
Step 1
The first step is the request by instance C to the DLM for the necessary shared lock. As
before, this request is directed to instance D where the lock is mastered.

XG0 NG1
1013 1009
Instance C
2 Instruction NL0
to transfer
shared
lock to C 1
1008
Lock master
Instance D
Step 2
The DLM instructs instance A to transfer a shared lock to satisfy the request from instance
C. As before, this message will be sent immediately if the DLM has no lock transactions in
progress or else it will be queued.

XG0→ SG1 Shared-keep NG1
3 copy of buffer
1013 1009
Instance C
2 NL0
1013
1008
Lock master
Instance D
Step 3
On receipt of the message to transfer the block, Instance A completes its work on the block
and sends a copy of the block image to instance C. As in the previous example (Example
4), this may involve logging changes and flushing the log buffer on instance A before
sending the block. In this case, an exclusive lock is not needed by the receiving instance, so
instance A downgrades its lock to a shared lock, but keeps its global role in order to
preserve the past image of the block.
After this is done, the instance sends a shared-keep copy of the block to instance C. As well
as the block current block contents, the message identifies the type of locks at each end of
the transfer: shared/global/PI in instance A and shared/global without PI in instance C.

SG1 NG1
1013 3 1009
Instance C
2 NL0→ SG0
1013
1
4 1008
Lock
Lock master
assumption
Instance D information
Step 4
Instance C extracts the SCN from the block it received from instance A and constructs a
lock assumption message for the DLM. This contains sufficient information for the DLM to
record the new status of the lock on each of the instances, along with the SCN of the PI on
instance A. Instance C sends the completed message to instance D, the instance mastering
the lock for the DLM.
Note that, at the end of the transfer, instance A has the most recent PI for the block and that
the lock on instance C is in global mode because of the dirty PI block image still held in
instance A’s buffer.

Example 6: Writing Dirty Blocks
XG1 NG1
1013 1009
1
1008
Request write
at SCN 1009
or greater
Lock master
Instance D

This example shows how a block write may occur when multiple buffers contain different
images of the block. When this example begins, the current state of the database is similar,
but not identical, to way it looked at the end of Example 4. Specifically:
• The block image on disk is still at the original SCN, 1008.
• Instance A contains the most recent copy of the block, SCN 1013, in a past image
buffer using an XG1 lock (exclusive mode, global role, PI known to the DLM).
• Instance B contains an older past image of the block, at SCN 1009, using an NG1
lock (no PCM mode, global role, PI known to the DLM).
• The block is mastered by the DLM on instance D.
• To simplify the graphics, instance C, which is not involved in this example, is not
shown on the slides.
The scenario covered by this example is precipitated when instance B determines it is
necessary to write the block. This could be caused by a checkpoint request, for example,
caused by its redo thread performing a log switch.
Step 1
Instance B sends a write request to the instance mastering the lock (instance D) with the
necessary SCN. Instance B remembers the write that it has requested. The DLM marks all
existing PI holders as needing notification.
The DLM also selects the node to perform the actual write: either the current node or the
latest holding node for the requested write. In the example, instance A is the most recent
node to hold a lock on the block, so instance A is selected as the node to perform the write.

XG1 NG1
1013 1009
2 1
1008
Request write at
SCN 1009 or greater
Lock master
Instance D
Step 2
The DLM mastering node, instance D, sends a write request to the instance selected in step
1, instance A. The master also remembers that a write at the requested SCN is outstanding,
and does not allow another write to be requested until the current one is satisfied.

XG1 4 Write notification NG1
1013 1009
3
2 Block write 1
1013
Lock master
Instance D
Step 3
Instance initiates the I/O with a write to disk.
Step 4
The I/O completes with a notify message back to instance A.
Having received the completion notification, instance A will log the completion and the
version written with a BWR and advance its checkpoint, but not force the log.

XG1→ XL0 4 NG1
1013 1009
3
2 5 1
1013
Write
notification
Lock master
Instance D
Step 5
Instance A sends a notification to the DLM master node (instance D). It also includes an
assertion that it is going local because it wrote current.
Note: The order in which the two operations— writing the BWR (mentioned in step 4) and
sending the notification message— are performed is not critical, they can be done in parallel
and in any order.

XG1→ XL0 4 NG1→ NL0
1013
3
2 5 1
1013
6 Flush PI
Lock master
instruction
Instance D
Step 6
On receipt of the write notification, the DLM master (instance D) sends each instance
holding a PI, which it recorded earlier, an instruction to flush the PI. It also sends a
notification, potentially redundant, to the current holder of the X mode lock (which may
have moved), even if it has no PIs. If no PIs remain, the instance holding the current X lock
is told to go to local role, and the flush to this instance will set a go-local flag. This will be
redundant if the current X holder did the write.
In the example, B is the only instance, other than the writing instance (instance A), that
holds a PI. When instance B receives the flush instruction from the DLM, the instance logs
a BWR that the block has been written, without flushing the log buffer. Instance B also
releases the block buffer and clears the record being kept of the write the instance initiated.
At the completion of this step, A holds the buffer in XL mode, and all other past images
have been purged.

Example 7: Read Transfer After Buffer Write
XL0
1013
Instance C
NL0
1
Shared lock
request to 1013
Lock master DLM
Instance D

This example repeats the case used in Example 5, but the starting with the status at the end
of the previous example (Example 6) instead of Example 4. That is, the current copy of the
block is held with a local role, exclusive lock on instance A, which has just flushed this
copy to disk. The other instances have freed any buffers containing this block and have no
locks on it. Note that instance C has been included in the graphic again and the NL0 lock
on instance B is no longer included on the slide because instance B is not used in this
example.
The process starts when instance C requests a readable copy of the block.
Step 1
The requesting instance, C, sends a request for a shared lock to the DLM. As before, the
block’s lock is mastered on instance D.

XL0
1013
Shared Instance C
transfer
NL0
command
2 passed
to lock 1
holder
1013
Lock master
Instance D
Step 2
The DLM forwards the request for the shared lock to the current holder, instance A, as a
command for a shared transfer.

XL0→ SL0 3 Shared-keep
copy of buffer
1013 to requester
Instance C
NL0→ SL0
1013
2
1
1013
Lock master
Instance D
Step 3
The sending instance, A, has to convert its exclusive lock to shared. After converting its
lock, instance A sends the block to the requesting instance, C, with a shared-keep lock
status. Because the block is globally clean, the lock mode can remain local on both
instances.

SL0 3
1013
Instance C
SL0
1013
2
1 4
Shared-keep
message 1013
Lock master to DLM
Instance D
Step 4
Instance C sends a shared-keep message to inform the DLM that the sender and recipient
instances (A and C) now both hold the lock as shared local. Once again, the DLM master
for the lock is on instance D.

Lock Holder Responsibilities
NG1 SL0 SG0 SG1 XL0 XG0 XG1
May read disk if not no yes no no no no no

in cache
May age out no yes yes S: yes no no no
PI: no
Can write current no no yes yes yes yes yes
Can write PI yes no no yes no no yes
Gets notified on write yes no yes yes yes yes yes
Serves PI yes no no yes no no yes

Serves for recovery yes no yes yes yes yes yes
Can modify current no no no no yes yes yes
Lock Holder Responsibilities

The table shows some of the properties of each possible combination of PCM lock mode,
role, and PI. The column headings, showing the lock states, are comprised of the same
elements as the locks shown in Examples 1 through 7:
• The first character indicates the mode
– N for NULL
– S for shared
– X for exclusive.
• The second character indicates the role
– L for local
– G for global
• The third character indicates whether the DLM knows about a PI
– 0 for no
– 1 for yes (second PI block images are not covered in these examples)
Locks not held by an instance are in an NL0 (NULL, local, no PI) status. To simplify the
examples, an NL0 status is only shown when the lock is involved in a transition to or from
another state on a particular instance, for example, NL0→ XL0.

Special Considerations
• Consistent Read (CR) request handling

• Lost block
– No acknowledgment message when block
received
– Reliable side channel sends separate message
on systems with unreliable primary channel
• Transfer race
Consistent Read (CR) Request Handling

A CR request looks just like a lock operation in its message flow, except that a “CR
request” message is sent to the selected target instead of a Transfer message. When the
requester receives the block, and if it was not a “Best Mode” request that granted a lock,
then the requesting instance calls Cancel to clear the state in the master. If the requesting
instance successfully obtained a lock, then the instance needs to send the received state (the
sending disposition is unchanged) to the DLM. This is done by calling the Assume function.
Lost Block
In some transports, block transfer is not reliable. The high level protocol requires that block
transfers be reliable, and does not pay the latency penalty of a protocol with
acknowledgments in the normal case. In environments where this is an issue, the DLM uses
a reliable mechanism to indicate that a block has been sent. This is called the reliable side
channel. Even when a reliable side channel is used, the block will usually arrive first,
making the reliable arrival redundant. If the reliable message arrives first, the DLM
assumes the block was lost and go into a cancel-retry operation internal to the DLM. The
mechanism of the separate channels is hidden from the higher level code, and is private to
the DLM messaging system and its OSDs.
Transfer Race
A normal race condition occurs when an instance has aged a block out of its cache while a
request to transfer the block is on its way from the DLM. Although the instance had sent a
message to the DLM informing it of the lock downgrade (or close), it didn’t get to the
master before the transfer request was sent.
Because of this type, and other types of race, the DLM keeps state of the request in
progress and knows that it is waiting for one of the following:
• A completed call to the Assume function by the requester
• A spontaneous lock down convert or close from the current holder
• A cancel/retry from the requester.
In the event of the spontaneous lock down convert or close from the instance to which it
has sent a transfer, the DLM must reprocess the request in light of the changed state.

Special Considerations
• Cancellation
– Control-C from user
– Time out on lock request
– Process failure
• Instance holding XL lock is transferred while
writing
• Write race
• Write and notification errors
• Messages out of order
There are other situations that can cause problems but which are anticipated by the cache
fusion code. The main categories of these problems include
• Cancellation of a pending request caused by
– Control-C sent by a user
– Lock request had a time out which expired
– Failure of a foreground process
• The DLM sends a transfer to an instance with an exclusive, local lock while the
instance is writing the requested block
• The DLM sends an instruction to write a block which is no longer available in the
selected instance because of a change that was signaled but not received by the
DLM in time
• Messages will not necessarily reach the DLM in the logical order they were sent
• Various failures that could cause a write, or a notification of a write, to be properly
completed or recorded.

DLM Changes for Cache Fusion
• Tighter integration between DLM and buffer cache

• Past images and the global role
• Hash locking and DBA locking
• Page contents during transfer
– The current SCN of the block
– The disposition of the lock on the sender
– The SCN for the Lamport clock
Past Images and the Global Role

Each block past image (PI) is tagged with the system commit number (SCN). The SCN is
guaranteed to be later than the latest modification performed on this block and earlier than
modification performed by the next instance. (The Lamport scheme ensures that this is
possible). When a write completes, the writer notifies the master about the write completion
and the SCN of the block written. The master sends out flushes to all relevant PIs. The
relevant ones are those whose PI is earlier than the one written to disk. Since the PI SCN
can change while the down convert is in flight, the down convert message has the SCN of
the disk copy.
1:N Locking and 1:1 Locking
Fusion is invoked at the individual lock level. Fusion assumes 1:1 locking, meaning that
each PCM lock covers exactly one database block. The first lock on a resource is always
acquired in local role. When a request for block comes to a holder, it checks whether the
lock is a 1:N lock or 1:1 lock. If it is a 1:N lock, the block is not shipped. Further, If the
block is dirty, the block is written to disk before the lock is converted, as in earlier releases.
Since the block is clean, the when read by the requester, it gets the lock in local role.
The fusion code path is taken only when the lock is a 1:1 lock.
Page Contents During Transfer
Fusion blocks transfer the data in the block, as well as the data necessary to complete the
lock transaction. This includes:
• The SCN of the block, that is the CR status
• The disposition of the lock on the sender, for example, did it down convert, change
modes, close, keep a PI, and so on
• The SCN for the Lamport clock, needed because blocks sent for fusion processing
function like the transaction messages that also contain SCNs under the Lamport
scheme

Summary
In this lesson, you should have learned how to:

• Describe the benefits of cache fusion
• Differentiate lock modes and roles
• Explain how cache fusion transfers locks and
blocks in different scenarios
• Describe the basic DLM changes to support cache
fusion

Database Recovery for Cache Fusion

Objectives

• Explain how dynamic lock remastering improves
database availability
• Identify tablespaces with re-mastered locks
• Recognize the implications of instance recovery
with cache fusion
• Describe the steps taken in two-pass recovery with
Oracle9i Real Application Clusters
• Identify lock and buffer requirements necessary for
instance recovery
• Differentiate recovery requirements following
single and multiple instance failures

Real Application Clusters:
High Availability Enhancements
• Real Application Clusters is Oracle’s premier high
availability solution
• Work from failed instances can be redistributed to
other nodes in the cluster
• New features in Oracle9i specific to Real
Application Clusters make recovery faster and
provide higher levels of availability
– Faster detection and reconfiguration
– Dynamic lock remastering
Real Application Clusters: High Availability Enhancements

In the past, Oracle has provided multi-instance database capability to provide scalability
and high availability. Cache Fusion, introduced in Oracle9i Real Application Clusters,
improves the scalability. (Cache Fusion is covered in Lesson 1 of this course.) Other
features introduced in Oracle9i address availability improvements in Real Application
Clusters. These changes are covered in the remainder of this lesson.

Faster Detection and Reconfiguration
• The Real Application Cluster can now detect and

address network problems on its own
• This reduces downtime due to dependencies at the
operating system level
• When a cluster problem arises:
– The Real Application Cluster determines which
members are affected
– It quickly reconfigures itself to circumvent the
problem
Faster Detection and Reconfiguration

In previous releases of Oracle multi-instance databases, failures in the cluster had to be
detected by the operating system and reported to Oracle. With Oracle9i, cluster aware code
is included in the database software. This allows the Oracle instances to detect cluster
problems without having to coordinate with the operating system’s cluster management
software.
Once a problem is detected within a cluster supporting an Oracle9i Real Application
Cluster, the instances communicate among themselves to determine the status. They decide
which instances are going to remain active members of the database and reconfigure the
database resources around them.

Fast Real Application Clusters
Reconfiguration
• Oracle9i Real Application Clusters implement a
disk-based heartbeat and a voting procedure
• Each member gives its impression of the other
members availability
• One member arbitrates among the different
membership configurations
• The chosen configuration is published
• All members then examine the published
configuration, and if necessary, terminate their
sessions

Dynamic Lock Remastering
• Dynamic remastering allows lock masters to be

changed without complete reconfiguration
• During processing the DLM may use dynamic
remastering to move locks to the most active
instance
• This optimization occurs in the background while
users are accessing the system
• Should an instance leave the group, the DLM only
remasters resources mastered by the departing
member
• Similarly, when a new instance joins the group, the
DLM gradually remasters locks, adapting to cluster
workload

In the past, all DLM resources were evenly distributed among the available instances.
When a node transition took place, when an instance either joined or left the group, a
complete resource redistribution occurred. This, of course, could take considerable time
depending on the amount of resources being mastered and the frequency of node transitions
in the system. While the remastering occurred, the DLM was frozen and no new lock
requests could be processed.
Oracle9i Real Application Clusters provide dynamic lock remastering. With this feature,
only the minimum number of resources are remastered while the lock database is frozen.
Most of the redistribution work is done online. There is no need to remaster all the locks
anymore. This allows new lock requests to be processed much sooner after a node
transition than in previous releases.

Instance Leaving
• Locks are now hashed into a constant number of
buckets
• The number of active instances is irrelevant
• Only resources with a master hash value mapped
to the departing instance have to be remastered
• Other resources remain unchanged
Dynamic Lock Remastering— Instance Leaving

Previously, the hash function which locates a resource master used the number of available
instances. Now, irrespective of the number of active instances, resources are hashed to a
range of values from 1 to M. M is a multiple of the maximum number of instances as
determined by the parameter PARALLEL_SERVER_INSTANCES. Only resources with a
master hash value mapped to the departing instance have to be remastered.

Instance Joining
• Previously, a complete redistribution of resources
took place whenever a new instance started up
• Now, only a portion of the locks are distributed
directly from each instance to the joining instance
Dynamic Lock Remastering— Instance Joining

In previous releases, the DLM resources were assigned to a mastering instance using a hash
value derived from the resource’s identification number and the number of active instances.
When a new instance joined the database, the locks had to be remastered based on the new
value for the number of active instances. This involved remastering, and probable
relocation, of every DLM resource.
In Oracle9i, DLM resources are assigned to the new instance from the active instances
without having to remaster any of the other resources.

Example
N1 N2 N3
Master Master Master

R1, R4 R2, R5 R3, R6
Resource Hashed Value Master Node Open Locks

R1 HV1 N1 N1, N2, N3
R2 HV2 N2 N1, N2
R3 HV3 N3 N3
R4 HV4 N1 N1, N3
R5 HV5 N2 N3
R6 HV6 N3 N1, N2, N3
Example
This slide shows an example of a Real Application Cluster environment consisting of three
instances, one per node. There are six open resources which can be PCM locks on data
blocks for example. These resources are hashed to six different hashed values and these
values are then evenly mapped to the three instances.
The open locks column represents the locks that each instance has on each resource.

Example
N1 N2 N3
Master Master Master

R1, R4, R2, R5 R3, R6,
R2 R5
Resource Hashed Value Master Node Open Locks
R1 HV1 N1 N1, N3
R2 HV2 N2 -> N1 N1
R3 HV3 N3 N3
R4 HV4 N1 N1, N3
R5 HV5 N2 -> N3 N3
R6 HV6 N3 N1, N3
Example
If the instance in N2 crashes, the values HV2 and HV5 need to be remapped. Resources R2
and R5 need new master nodes. During reconfiguration, the DLM will map HV2 and HV5 to
N1 and N3 respectively. Hence, R2 will have master node at N1 and R5 will have master node
at N3.
The slide shows the hashed values and master mapping after the reconfiguration. It also shows
how the locks from the lost instance (N2) are cleared by the distributed lock manager.

Instance Transition and Recovery Domains
• Previously, any kind of shutdown caused the loss

of master resource information residing in the
departing instance
• During reconfiguration, this information needed to
be completely rebuilt
• An instance shutdown (excluding abort) should not
cause any loss of master resource information
• The recovery domain should remain valid

In previous releases, when an instance departs the cluster, the DLM resources associated with
that instance had to be rebuilt and remastered. This process occurred as part of the remastering
of all of the locks based on the hashing algorithm described earlier.

• In Oracle9i, when an instance does a shutdown

normal, the DLM performs the following steps:
1. Close all locks owned by the departing
instance
2. Re-master all the resources currently mastered
at the departing instance
3. Deregister from the cluster group service
(CGS) which triggers reconfiguration
• This will reduce reconfiguration time

In Oracle9i, the DLM resources being mastered by a departing instance are redistributed to the
remaining instances before the information is released by the departing instance. Thus the
DLM does not have to reconstruct the resource information before remastering the resources,
reducing the time to complete the node transition. Of course, if the instance does not complete
a normal shutdown, the DLM resource information is lost and has to be reconstructed as
before.
Note: In this discussion, the term shutdown normal implies the NORMAL, IMMEDIATE,
and TRANSACTIONAL shutdown options.

Interaction between DLM Reconfiguration and
Cache Recovery
• Issue
– In Oracle8i, instance recovery could only be
started after DLM reconfiguration was complete
– The DLM had to restore the whole lock database
before crash recovery could begin
• Solution
– In Oracle9i, DLM reconfiguration and instance
recovery proceed in parallel
– This reduces overall recovery time
Interaction between DLM Reconfiguration and Cache Recovery

The DLM has to divide up the locks into two groups: enqueue resources and buffer cache
resources. At reconfiguration, the DLM first tries to rebuild the enqueue lock group. During
this period of time, opening new resources and lock operations on resources without a master
are temporarily halted. These resources include enqueue and buffer cache resources. The
recovering SMON may be able to get X lock on the IR enqueue at this point depending on if
recovery is needed for the IR enqueue.
When the DLM finished recovering the enqueue lock group, the DLM on a particular node
(node with the lowest id) posts SMON to begin recovery. The recovering SMON can get the
IR enqueue and start the first-pass redo log read. At the same time, the DLM continues to
recover the buffer cache lock group.
If SMON finishes reading the redo log and building the recovery set before the DLM finishes
recovering cache locks, SMON has to wait. Otherwise, SMON can start claiming locks for
buffers it needs to recover. During this time, all cache lock operations are temporarily frozen.
Once SMON finishes claiming all necessary locks, it notifies the DLM that the lock claim is
done and the recovery domain is then validated, provided there is no additional
reconfiguration. All lock operations are then resumed and reconfiguration completes.

Automatic Lock Re-Mastering
• Tablespace access by instance is tracked

automatically
• Tablespaces used by only one instance are flagged
• PCM locks associated with the data files belonging
to such tablespaces are re-mastered on the
instance using them
Automatic Lock Remastering

An internal mechanism, introduced in Oracle9i, automatically determines if a tablespace is
being accessed by only one instance. Once this is determined, the lock masters for that
tablespace’s files are lazily moved to that instance. This reduces the overhead for that
instance to open these locks because no messages need be sent to the Distributed Lock
Manager on another node.
It is relatively easy to determine if a tablespace is being accessed by only one instance, and
moving lock masters from various nodes in a cluster to a single instance requires little
overhead. As a result, automatic re-mastering is robust and transparent.
The new dynamic performance view, V$DBA_TABLESPACES, indicates which
tablespaces have been selected for lock re-mastering through this algorithm. Such
tablespaces are good candidates for less granular hash locks to reduce the lock re-mastering
overhead even further.

Recovery Issues with Cache Fusion
• Crash and instance recovery

– Multiple threads of redo are merged
– Only redo from failed threads is included
• Online block recovery: predecessor blocks may be
– in a past image block in a different instance
– on disk
• Media recovery remains unaffected by cache
fusion
Recovery Issues with Cache Fusion

Cache Fusion addresses the hard ping problem in pre-Oracle9i multiple-instance databases,
where Oracle instances could only communicate data block contents via the shared disk
subsystem, that is by writing to disk and reading back. In a Fusion environment, block
contents are shipped between instance caches over the cluster interconnect.
Recovery mechanisms can no longer depend on the shared disk concept to obtain the most
current image of a block. The most current image may be in a Past Image (PI) buffer of a
surviving instance since PI buffers allow instances to hold different versions of a dirty
block.
Crash and Instance Recovery
A crash is defined as the failure of all instances accessing a database. Typically, instance
failure in an Oracle9i Real Application Cluster database involves only one instance,
although the same recovery issues apply when more than instance has failed. During crash
recovery, there are no surviving instances. This is not the case with instance recovery where
other instances are still active and the needs of active sessions must be considered.
Redo thread recovery of each failed instance is no longer independent because a single data
block may have been dirty in more than one instance. As a result, recovery processing may
need to merge redo streams from multiple threads to recover a block. Cache Fusion’s use of
PI buffers guarantees that only the redo threads of failed instances need to be merged when
doing crash or instance recovery.
Online Block Recovery
When a data buffer becomes corrupt in an instance’s cache, PMON or the foreground
attempts online block recovery. This involves finding the block’s predecessor and applying
to it redo records from the online logs of the local (corrupted) instance. In a Fusion system,
the predecessor is the most recent Past Image for that buffer that exists in another instance’s
cache. When there is no PI for the corrupted buffer, we revert to the traditional mechanism
of reading in the disk data block (as predecessor) before applying changes from the online
redo logs.
Media Recovery
Cache Fusion does not impact the mechanisms used for media recovery.

Traditional and Cache Fusion Issues
Traditional Assumptions Cache Fusion

Assumptions
The starting point for The starting point for
recovery is always the on- recovery may be the on-
disk block image disk block image or the
most recent PI version in
the cache of a surviving
instance
Only changes from a Redo threads of all failed
single redo thread are instances are merged
applied at a time to the
disk version
Traditional and Cache Fusion Assumptions

In a pre-Cache Fusion Oracle system, when a buffer modified by an instance A was
requested
by another instance B, A had to write its dirty buffer to disk before B could read it. This
disk-
based coherency mechanism allowed instance/crash and online block recovery to assume
that
only redo changes from the last instance that modified a block would need to be applied to
the
on-disk copy. The two assumptions made in recovering a data block when an instance died
or
the block was corrupted in memory, with no permanent media loss, were:
1. The on-disk version of a block was always the starting point for recovery.
2. Only changes from a single redo thread needed to be applied to the disk version.
Both these assumptions are invalid in a Cache Fusion environment. In the example above,
instance A will directly ship the contents of its current buffer to instance B after doing a log
force, but without writing the block to disk. A’s buffer becomes a PI and cannot be
modified
further. B now has the current buffer and is able to modify it. The on-disk version of the
block
does not contain the changes made by instance A or B, so the block is dirty in both caches.
Fusion requires the two assumptions made above to be restated as:
1. The starting point for recovery of a block is its most recent PI version, held in some
instance’s cache. The on-disk version is only used if no PI is available for the block.
2. Redo threads of all failed instances need to be merged for instance/crash recovery.
Cache Fusion does not affect media recovery, which starts at the restored backup and
applies
changes from the merged redo threads of all instances in the Oracle9i Real Application
Cluster.

Recovery in Oracle9i
3
1 2
Total freeze
Instance dies Failure detected Enqueue reconfiguration
Enqueue thaw
5 4
Pass 1 recovery PCM reconfiguration write thaw
PCM release thaw
6 7
Locks claimed for recovery Claims done, PCM locks thaw
9 8
Pass 2 recovery Partially available
10 11
Individual block availability Recovery enqueue released
Recovery in Oracle9i
The recovery path in Oracle9i involves the following steps:
1. The instance, or instances, dies.
2. The failure is detected by the cluster manager or cluster group services.
3. Parallel Cache Management (PCM) locks are frozen for a time, as are write requests.
Enqueue locks are reconfigured quickly and become available.
4. The DLM commences its recovery and remastering of the PCM locks, which involves
rebuilding, on surviving instances, the lock masters lost due to the instances failures.
When this is complete, pending activities are processed after which PCM lock releases
and down converts are allowed.
5. At the same time, the recovery code grabs the enqueue lock, and does it’s first pass
recovery read of the log. It identifies the locks of the blocks that need to be recovered.
6. On completion of pass 1 and the DLM reconfiguration, recovery continues by
• obtaining buffer space for the recovery set, possibly doing writes to make room;
• claiming locks identified by pass 1;
• obtaining a source buffer, either from an instance’s buffer cache or by a disk read.
7. After the necessary locks are obtained, and the recovering instance has all the resources
it needs to complete pass 2 with no further intervention, the PCM lock space is
unfrozen.
8. The system is partially available, as blocks not in recovery may be operated on as
before. Blocks being recovered will be blocked by the locks held in the recovering
instance.
9. The cache marches through the second phase of its recovery, taking care of all blocks
identified in pass 1, recovering and writing each block, then releasing recovery locks.
10. Blocks become individually available as they are recovered, not all at once.
11. When all the blocks have been recovered, written, and recovery locks released, the
system is completely available, and the recovery enqueue is released.

Overview of Fusion Lock States
Valid lock states

Lock Mode Valid Lock Role, PI Count
X L0, G0, G1
S L0, G0, G1
N G1
Overview of Fusion Lock States

The first two phases of instance recovery involve identifying and locking the blocks that
need recovery. Following this, the DLM can allow lock operations to proceed on resources
covering all other blocks, even before redo application begins. A review of the new DLM
lock states introduced for Cache Fusion is presented here.
A PCM Fusion lock has 3 dimensions: lock mode, lock role and past-image count.
Together these dimensions are used to maintain cache coherency in a Fusion environment.
The set of lock modes remains unchanged: Exclusive (X), Shared (S) and Null (N). Lock
roles describe local or global interest in the resource. The past-image count indicates the
number of PI buffers maintained under the lock. The set of valid lock states is a subset of
the total combination space: XL0, XG0, XG1, SL0, SG0, SG1, and NG1.
• Null (N): no examine or modify rights.
• Share (S): may examine block.
• Exclusive (X): may modify and create new version of the block.
• Local (L) : Locally managed lock. Block can only be dirty in this cache.
• Global (G): Globally managed lock, may be dirty in more than one cache. Must
coordinate with DLM for write.
• PI count 0: no past-image.
• PI count 1: past-image present.
Note: We represent lock state by 2 letters and a digit for mode, role, and number of past-
images respectively. For example, XG1 is an Exclusive mode, Global role, 1 past-image
lock.

Instance and Crash Recovery
• Cache fusion recovery relies on two changes in the

recovery processing introduced in Oracle9i
– SMON performs all instance recovery, not
foreground processes
– Two pass log read scheme
• Cache fusion, in conjunction with these changes,
enhances the availability of Oracle9i Real
Application Cluster databases
Instance and Crash Recovery

Thread recovery of a failed instance is done by a surviving instance’s SMON (Instance
Recovery) or by a foreground process when all instances are dead (Crash Recovery). If a
foreground detects the need for instance recovery (IR), then it will post SMON. This is a
change from Oracle8i behavior, where a foreground could perform IR.
The other change in recovery processing introduced in Oracle9i that is relevant to Cache
Fusion recovery is the two-pass log read scheme for thread recovery. The first pass
determines the set of data blocks that were modified but not proven to have been
successfully written out of the buffer cache. This eliminates blocks that were modified by
the failed instance(s) but later written out to disk, and therefore not in need of recovery. The
second pass limits redo application to the set of blocks compiled by the first pass. The
scheme guarantees that by the end of the first pass all data blocks needing recovery are
known.
Fusion Recovery builds on the framework of this two-pass log read mechanism to allow
enhanced availability of the Oracle9i Real Application Cluster.
Note: For the most efficient recovery and processing following fail over, you should use
clusters consisting of homogeneous, rather than heterogeneous, nodes.

First Pass Log Read
• Redo threads of failed instances are merged

• During the merge
– A recovery set data structure stores data about
blocks found
– Data block address (dba)
– First dirty SCN
– Last dirty SCN
– A block Written Record (BWR) with a higher
version than the last dirty version causes the
block to be removed from the recovery set
First Pass Log Read

Redo threads of failed instances are read and merged by SCN, beginning at the redo block
address of the last incremental checkpoint for each thread. Thread merge for instance or
crash recovery is very similar to media recovery. When the first change to a data block is
encountered in the merged redo stream, a block entry is added to the recovery set data
structure. Internally, recovery set entries are organized in a hash table by data block address
(dba), with each hash chain sorted by dba for efficient lookup during the second pass. Each
block entry stores the first-dirty SCN encountered for the block and updates a last-dirty
version (SCN, sequence#) as subsequent changes for the block are read from the redo
stream.
The redo can also contain Block Written Records (BWRs). The logging of BWRs leads to a
persistent notion of block class for each written block. The instance doing the write
(Owner) of a block must log a BWR. Every instance with a Past Image (Holder),
invalidated as a result of this write, must also log a BWR to indicate that the changes
represented by the PI have been written. A BWR contains the version (SCN, sequence#) of
the written block image.
When a BWR is read from the merged log stream, the recovering process checks the
version. If the BWR version is greater than the last-dirty version in the recovery set, then
the block does not need recovery. The block entry is dropped from the recovery set. This
avoids unnecessary reads of these data blocks during second pass redo application.
By the end of first pass the recovery set only contains blocks that were modified by the
failed instances and had no subsequent BWR to indicate that they were later written.
Further, each block entry has a first-dirty to last-dirty SCN range of changes that need to be
applied. A recovery list is maintained, made up of recovery set entries sorted by increasing
first-dirty SCN in a doubly-linked list, to specify the order in which to acquire IR locks.
This order minimizes paging when the buffer cache is not large enough to hold a buffer for
every recovery set entry.
Note: BWRs are logged by the owner instance that did the write and all holder instances.
Because every instance that modified the buffer logs a BWR following write of the buffer,
the first pass is more likely to find the BWR when any one of these instances fail and hence
exclude the block from the recovery set.

First Pass Log Read
• Until locks are obtained by the recovery process

for all blocks in the recovery set, the DLM remains
frozen for normal PCM lock activity
• After IR locks are acquired, the DLM cleans up
orphaned blocks
First Pass Log Read

The tables on the next three slides show how the DLM responds to a RecoveryClaimLock
message with various combinations of locks and buffers already in use across different
instances. Once the recovery process has acquired the necessary IR lock, the shipped block
is copied into a recovery buffer covered by the granted lock.
After locks have been acquired on every block in the recovery set, the recovering process
issues a RecoveryDoneClaiming message to all DLM master nodes. The DLM will
complete reconfiguration (if it has not already) and initiate orphan resolution. An orphan is
a PI Holder that does not have an Owner; a dead instance was Owner, but the block does
not need recovery. Orphans occur in two ways:
• An instance became Owner, but died before modifying the block or having the redo
for its modification forced to disk. This is a common condition.
• An instance became Owner, modified and wrote the block, did a log force to write
the BWR, but died before invalidating past-image Holders. This is a less common,
race condition.
The orphaned PI Holders for such a resource will not be cleaned up by IR. After IR lock
acquisition, all resources needing recovery have been identified and locked, so any
remaining dubious resources must be orphans requiring DLM cleanup. Following orphan
cleanup, only resources locked for recovery need to remain unavailable to foreground lock
requests, so the DLM can validate the PCM lock space. Until the RecoveryDoneClaiming
message is received from recovery, the PCM lock database must remain frozen, suspending
PCM lock operations cluster-wide.

First Pass Log Read
• Blocks must remain locked until the underlying

block is recovered
• Large recovery sets can cause the recovering
instance’s buffer cache to fill with locked buffers
awaiting recovery
First Pass Log Read

Once it is allocated an IR buffer cannot be replaced or aged out, except by another recovery
buffer request. In the common case, when the recovering instance’s buffer cache can hold
all recovery buffers, IR buffers must remain in cache until they are released individually
during the redo application phase. The IR locks must be held until the underlying data
block is recovered, otherwise user lock operations would be allowed on partially recovered
blocks.
Large recovery sets (relative to buffer cache size) will result in the recovering instance’s
buffer cache being full of non-reusable buffers, leaving little room for foregrounds. User
activity on the recovering instance may therefore be heavily degraded, depending on the
fraction of the buffer cache that is allowed to hold recovery buffers. The recovery
algorithms are written with a target of using approximately half the default buffer pool that
is, the cold half of the least recently used (lru) buffer chain.
Lock down-convert requests for recovery buffers, using callback functions known as
blocking asynchronous traps (BASTs), need to be deferred and serviced only after the IR
lock is released. Locked IR buffers are marked “in-recovery” to inform the cache layer that
the current lock holder (SMON) will be able to release its lock only when recovery is
complete for the block.

Second Pass Log Read
• Redo threads read again and merged by SCN

• Redo records applied
• After a block is recovered
– It is written to disk
– SMON’s lock is converted to XL
– It becomes available for users

Redo threads of failed instances are read again and merged by SCN. For each redo record
in the merged redo stream, the recovery set hash table is looked up to decide if the change
is for a recovery set block. In the common case, complete IR lock acquisition was able to
occur, so redo changes are applied to recovery buffers that are guaranteed to be in the
buffer cache. With partial IR lock acquisition, the block may need to be read into the buffer
cache and its IR lock acquired. This will require replacement of an existing recovery buffer.
After applying a redo record, if the resulting recovery buffer matches its last-dirty version
(SCN & sequence#) in the recovery set, then recovery is complete for that block. The block
can be immediately released for normal operations, even before second-pass is completed.
The recovering process requests a write of the recovery buffer and allows lock operations to
resume on the block:
•Message DBWR to write the recovery buffer and clear the “in-recovery” state of the
buffer. A recovery buffer can become current only after the write, unlike a regular buffer.
The cache layer resumes processing of lock down-converts, blocking ASTs (BASTs), for
this buffer, once it has been made current.
•After write completion, SMON’s recovery lock goes from XO to XL and any past-image
Holders are invalidated by the DLM master as usual. The recovering instance may itself
have a Holder. If the recovery lock was XL to begin with, no lock transition occurs after the
write.
Recovery locks only differ from regular PCM locks in their response to BASTs, hence they
are not distinguished at the DLM level.

• Blocks can be removed from the recovery set if the

recovery buffer version is greater than the last-
dirty version because we already have a current
image of the block
• Redo application is complete when the last
recovery buffer is released, even if all the redo has
not been read

After IR lock acquisition, when the contents of locked recovery buffers have been received
(either from disk or another instance), it is possible to further trim the recovery set. If a
recovery buffer version is greater than or equal to its last-dirty version (stored in the block
entry), no redo needs to be applied. An IR Resource Complete message is issued to the
DLM master and the block is removed from the recovery set without having to wait for the
completion of the second pass.
When the last recovery buffer is released, redo application is complete even if all the redo
has not been read. Potentially this allows the second pass to read fewer log records than the
first pass. Recovered threads are checkpointed and closed, requiring a wait for write
completions on outstanding requests issued during IR lock acquisition (for Owner writes)
or second pass (for recovery buffer writes). IR is complete when all dead threads have been
checkpointed and closed.

Recovery from Single Instance Failure
• DLM must reconstruct states for these locks

– Non-PCM
– PCM-fusion
– PCM-non-fusion
• Each lock can be granted or still be on the convert
queue
• At the start of cache recovery
– Normal PCM lock activity is frozen
– Recovery process takes a recovery lock
(instance recovery enqueue)
• The current version of any given block may have
been in the buffer cache of the crashed instance

Following a single instance failure, the DLM layer has to reconstruct the lock states on
instance recovery. Recall the following categorization of locks:
• Non-PCM locks (enqueue locks)
• PCM-fusion locks
• PCM-non-fusion locks
Each of these categories is further subdivided into two states:
• granted
• requested but not granted yet (on convert queue)
At the beginning of cache recovery, the recovering instance takes a recovery lock (instance
recovery enqueue). All normal PCM lock activity in cache is frozen at that time. Therefore
Recovery is able to rebuild the DLM structures and to modify the buffers/blocks. For a
given block in a single instance failure scenario two outcomes are possible: whether the
crashed instance contained the current version of the block or it did not.

• If the current copy of the block was in the failed

instance, the recovery process
– Claims the best surviving PI
– Applies redo from the failed instance logs
– Writes the recovered block
• If the current copy of the block was not in the failed
instance
– The DLM notes this during reconfiguration
– A copy of the current buffer is sent to the
recovery instance
Instance with Current Block Image Fails

If the instance holding the current copy fails, the DLM, while recovering the lock database
identifies the surviving PIs, if any. When the recovery instance claims the lock, it will be
acquired as current, if possible, or as a PI if another PI is already here. The best surviving
PI will be claimed (recovery pinged) by the recovery instance. Any redo for the block will
be replayed. Only the log of the failed instance needs to be applied to this block. No log
merge is required. (Note that the most common case of two node clusters will always fall
into this no-log-merge category.) When the recovery is complete, the block will be written
to disk using the normal protocol before the checkpoint is advanced and recovery
completes
Information about PI copies and current (SCUR, XCUR) blocks is acquired at the DLM
reconfiguration and recovery stage. Master node failure can cause the loss of write
notifications, which must be resolved initially.
Instance with Current Survives
If the instance with the current (and therefore latest) copy of the block survives the crash,
the reconfigured cluster still has the current copy of the block available. The DLM
discovers this at the stage of DLM reconfiguration. After the DLM recovery completes, the
claim process will send the buffer to the recovery instance as the best available past image.
If it is not needed by recovery, no action is taken. If it is needed by recovery, the block will
be written using the normal write protocol before pass two completes. The current holder
will be selected as the writing instance.

Recovery from Multiple Instance Failures
• When multiple instances fail, the redo from them is

merged prior to recovery
• The cost of the merge is proportional to
number of failed instances × size of log per instance
• This is no worse than the pre-fusion cost where all redo

must be applied from the failed instances, also
proportional to
number of failed instances × size of log per instance
In case of a multiple failure, when neither the latest PI copy nor any current copy have
survived, it may happen that the changes made to the block are spread over multiple logs of
the failed instances. To ensure complete recovery, the logs must be merged. Because only
the logs of the failed instances are required, the potential performance penalty for the log
merge is proportional to
Number of failed instances × Size of log per instance
The size of the logs can be controlled by checkpoint features. This calculation shows that
the multi-instance recovery performance penalty is similar to the price that pre-Oracle9i
multi-instance databases paid without cache fusion, which required the successive
application of all logs of failed instances. Therefore the total performance penalty of
recovery prior to cache fusion is also proportional to
Number of failed instances × Size of log per instance
The additional requirement of the Cache Fusion design compared to the pre-Cache Fusion
design is to merge the logs of the failed instances. The number of operations required for
that scales linearly with the size of the merged data sets.

Summary

• Explain how dynamic lock remastering improves
database availability
• Identify tablespaces with re-mastered locks
• Recognize the implications of instance recovery
with cache fusion
• Describe the steps taken in two-pass recovery with
• Identify lock and buffer requirements necessary for
instance recovery
• Differentiate recovery requirements following
single and multiple instance failures

Oracle9i Real Application Cluster Management

Objectives

• Describe a cluster configuration for Oracle Parallel
Fail Safe (OPFS)
• Configure an SPFILE for your initialization
parameters
• Identify the GC_* parameters that were changed or
made obsolete in Oracle9i
• Use INSTANCE_NAME and INSTANCE_NUMBER
parameters correctly
• Manage Oracle9i Real Application Cluster database
instances with Enterprise Manager tools
• Use new OPSCTL options

Oracle Parallel Fail Safe
• High availability database solution for mission

critical systems
• Integrates technology from Oracle and its partners
• Includes tested configurations designed to quickly
recover from all types of system faults
• Formerly a separate product, now an integrated
feature of Oracle9i Real Application Clusters

In Oracle9i, OPFS is part of the standard distribution and is installed automatically when Real
Application Clusters is installed.
However, you need to run a special OPFS installation program to enable and configure OPFS
on a cluster.
NOTE:
Oracle Parallel Fail Safe will not be available in the beta release of Oracle9i.

• Provides the fastest failure detection, failover, and

reconnect
• Automatically installs and configures high
availability features such as:
– Connect time failover
– Transparent application failover
– Pre-connections to secondary instance
• Integration of Real Application Clusters and the
system’s vendor cluster manager

The integration of Real Application Clusters and the system’s vendor cluster manager to
migrate IP addresses reduces TCP/IP timeouts on reconnects, provides a mechanism to
externally monitor the database, and detect failures.

Oracle Parallel Fail Safe: Architecture
Primary Secondary
Real Real
Application Application
Cluster System Cluster
Oracle HA Management Oracle HA
Packs Infrastructure Packs
Cluster Cluster
Framework Framework
Clustered Clustered
System System
RAID/Mirrored
Storage
Oracle Parallel Fail Safe: Architecture

The foundation of the high availability database is a cluster of servers, reliable and intelligent
storage, and the Oracle database with Real Application Clusters. To reduce downtime it has
the ability to quickly and automatically recover from failures should they occur.
Real Application Clusters enable the Oracle database to run concurrently on both systems in
the cluster. The high availability solution uses this technology to connect all clients to one
Oracle instance in normal operation (called the primary instance). In the event of a failure,
Oracle's failover and monitoring software together with the hardware vendor's cluster
framework will detect the problem and gracefully switch over clients to the alternate Oracle
instance (called the secondary instance), ensuring continued data access. OPFS configurations
have been designed to quickly and automatically recover from any single common failure,
including hardware, operating system, or Oracle instance faults.

Common Initialization Parameter File
• Parameters for different instances can be mixed in

a single initialization parameter file
– Only one file to maintain and propoagate
– Having all parameter values available in a single
location helps reduce errors
• Use a dot notation with the instance name for
instance-specific parameter values
Common Initialization Parameter File

In previous releases of Oracle, you had to have a separate initialization file for each
instance in a multi-instance database in order to assign different parameter values to the
instances. However, certain parameters had to have the same value for every instance in a
database. To simplify the management of the instance-specific and common parameters, the
IFILE parameter was commonly used. This parameter would point to a file containing the
common parameter values and was included in each of the individual instance parameter
files.
In Oracle9i, you can store the parameters for all the instances belonging to an Oracle Real
Application Cluster database in a single file. This simplifies the management of the
instances because you only have one file to maintain. It is also easier to avoid making
mistakes, such as changing a value in one instance’s file but not in another’s, if all the
parameters are listed in one place.
To allow values for different instances to share the same file, parameter entries that are
specific to a particular instance are prefixed with the instance name using a dot notation.
For example, to assign different sort area sizes to two instances, PROD1and PROD2, you
could include the following entries in your parameter file:
prod1.sort_area_size = 1048576
prod2.sort_area_size = 524288

Shared Initialization Parameter File
Node 1 Raw device Node 2

ORACLE_HOME= ORACLE_HOME=
/hdisk1 /hdisk1
ORACLE_SID=orac1 ORACLE_SID=orac2
Local Local
disk disk
/hdisk1/dbs/initorac1.ora /hdisk1/dbs/initorac2.ora
SPFILE=/dev/rdisk1/spfile SPFILE=/dev/rdisk1/spfile
/dev/rdisk1/spfile
orac1.instance_name=orac1
orac2.instance_name=orac2
…
Shared Initialization Parameter File

Even if you put all the parameter values for all the instances in a single initialization file,
you need to this file to be available to the process that starts up each instance. If your
instances are started automatically as part of the system startup routines, you would need a
copy of the file on each node in the cluster.
However, in Oracle9i, you can store the parameters for all the instances in a special binary
file known as a server paramter file (SPFILE). This file is stored in a raw partition on one
of the shared cluster devices and so it is available to any node in the cluster. Therefore, you
only need to keep and maintain one copy of this file.
Note: On Windows clusters, SPFILE shares a partition with the OSD's quorum disk.
The cluster in the slide consists of two nodes. Node 1 has assigned /hdisk1 to be its
$ORACLE_HOME directory and set the name for the instance it supports to be ORAC1.
Node 2 is using /hdisk1 for its $ORACLE_HOME and has assigned ORAC2 for its
instance’s name. A partition on the raw device, called /dev/rdisk1/spfile,
holds the binary file, SPFILE.
Each node has an instance-specific initialization file configured. This file contains just one
entry
SPFILE = /dev/rdisk1/spfile
This entry simply points to the raw partition holding the shared parameter file. The shared
file, SPFILE, contains entries that are common to both instances as well as instance-
specific entries.
The example shows two of the entries in SPFILE. These contain parameters that follow
the suggested naming standard for instances— use the SID, defined at the operating system
level, for the INSTANCE_NAME. Because instance names are unique to each instance, they
use dot notation to include the instance name with the parameter name:
orac1.instance_name = orac1
orac2.instance_name = orac2

Managing Server Parameter Files
SQL> CREATE SPFILE

2 =/u01/oracle/dbs/test_spfile.ora
3 FROM /u01/oracle/dbs/test_init.ora;
SQL> ALTER SYSTEM SET

2 JOB_QUEUE_PROCESSES=50
3 COMMENT='Changed from 30'
4 SCOPE=SPFILE;
Managing Server Parameter Files

The server parameter file must initially be created from a traditional text initialization
parameter file. It must be created prior to its use in the STARTUP command. The CREATE
SPFILE SQL statement is used to create a server parameter file. You must have the
SYSDBA or the SYSOPER role to execute this statement.
The following example creates a server parameter file from initialization parameter file
/u01/oracle/dbs/init.ora. In this example no SPFILE name is specified, so the
file is created in a platform-specific default location and is named spfile.ora.
SQL> CREATE SPFILE FROM /u01/oracle/ dbs/init.ora;
Another example, below, illustrates creating a server parameter file and supplying a name.
SQL> CREATE SPFILE=/u01/oracle/dbs/test_spfile.ora
2 FROM /u01/oracle/dbs/test_init.ora;
The server parameter file is always created on the machine running the database server. If a
server parameter file of the same name already exists on the server, it is overwritten with
the new information. Multiple server parameter files can be created on the server and you
can select an appropriate one at startup.
You can change the values of parameters stored in an SPFILE by using a SQL ALTER
SYSTEM command. For example, you can change the job_queue_processes
parameter value, including a comment, using the following command:
SQL> ALTER SYSTEM SET JOB_QUEUE_PROCESSES=50
2 COMMENT='Changed from 30'
3 SCOPE=SPFILE;

GC_FILES_TO_LOCKS
The GC_FILES_TO_LOCKS parameter assigns 1:N

PCM locks to data files
• In the earliest releases of multi-instance Oracle
databases, all locks were fixed, 1:N locks
• Releasable locks were introduced in later releases
and eventually became the default lock type
• By default, 1:N locks were fixed in previous
releases but could be defined as releasable with
the GC_FILES_TO_LOCKS parameter
• All PCM locks are releasable in Oracle9i, so the
releasable option of GC_FILES_TO_LOCKS has been
discontinued
GC_FILES_TO_LOCKS Parameter
Fixed locks were the only types of PCM locks available in the early days of multi-instance
Oracle databases. They were allocated at instance start up and persisted for the life of the
instance. To reduce the overhead of the DLM, fixed locks were 1:N locks— each lock
covered multiple blocks rather than just a single block.
In later versions, releasable locks were introduced. These locks were acquired from a pool
of locks when required and released back to the pool when they are no longer needed by the
instance. For the past few releases, releasable locks have been the default PCM locking
method. By default, releasable locks are 1:1, that is one lock covers exactly one block.
For data which could benefit from having more blocks to a lock, 1:N locks are the preferred
strategy. This includes data in files which are accessed by only one instance or are accessed
by several instances for read-mostly activity. In these cases, 1:N locks reduce the locking
overhead for these files and improve performance. The initialization parameter,
GC_FILES_TO_LOCKS, is used to assign 1:N locks to files. In previous releases, 1:N
locks assigned with the GC_FILES_TO_LOCKS could be defined as releasable, but
defaulted to fixed.
Releasable locks have a number of advantages over fixed locks:
• Instance start up times are faster because there are no fixed locks to open
• You have more flexibility in assigning hash locks— because all the 1:N locks are not
allocated at start-up but are created on demand, more hash locks can be specified.
For this reason, fixed locks are eliminated from Oracle9i— 1:1 and 1:N locks are all
releasable. The option to define 1:N as releasable is therefore no longer needed and has
been dropped from the GC_FILES_TO_LOCKS parameter syntax.
Note: To avoid performance problems caused by pre-Oracle9i pinging, you should only use
GC_FILES_TO_LOCKS to assign 1:N PCM locks on:
• Read-only or read-mostly files and tablespaces
• Files containing data that is modified only, or mainly, by just one instance

GC_DEFER_TIME
The GC_DEFER_TIME parameter has become

obsolete because
• The default which is adequate for most
environments
• There are no good tools to help make appropriate
changes to its value
• Even when it was tuned, only minimal gains were
achieved
GC_DEFER_TIME Parameter
The GC_DEFER_TIME parameter defined a number of one-hundredths of a second that an
instance would wait before responding to a request to release or downgrade a PCM lock.
The intent was to give the instance an opportunity to finish any current activity on the block
(or blocks) covered by the lock before taking action on the lock request. The benefits of
setting this parameter were minimal because, in general, only a few blocks were in use
when a lock request was received and the delay impacted all of the lock requests for non-
active blocks. Tuning this parameter was difficult because of the lack of good guidelines
and barely-measurable performance improvements. For these reasons, GC_DEFER_TIME
has been made obsolete
Note: GC_DEFER_TIME is being retained as an underscore (hidden) parameter,
_GC_DEFER_TIME.

GC_RELEASABLE_LOCKS
GC_RELEASABLE_LOCKS has become obsolete in

Oracle9i
• In earlier releases
– It defaulted to DB_BLOCK_BUFFERS
– It could be less than DB_BLOCK_BUFFERS if
GC_FILES_TO_LOCKS assigned hash locks to all
data files to save space in the DLM
• Hash locks only used in Oracle9i for read mostly
tablespaces
• Locks are smaller in current release
GC_RELEASABLE_LOCKS Parameter
In Oracle8i and earlier releases, if GC_FILES_TO_LOCKS was used to assign hash locks
to all the files in the database, then GC_RELEASABLE_LOCKS was occasionally set to be
less than DB_BLOCK_BUFFERS. This was done to save memory because less DLM locks
were needed.
However, in Oracle9i, hash locks are used only for read mostly tablespaces and the DLM
locks are much smaller. For these reasons, there is no requirement to reduce the number of
releasable locks. The number of releasable locks is fixed at DB_BLOCK_BUFFERS and the
GC_RELEASABLE_LOCKS parameter has been made obsolete.
Note: GC_RELEASABLE_LOCKS is being retained as an underscore (hidden) parameter,
_GC_RELEASABLE_LOCKS.

GC_ROLLBACK_LOCKS
• Need for the parameter was significantly reduced

with the introduction of the CR Server
• Has become obsolete in Oracle9i
• Rollback segments are covered by locks covering
16 contiguous blocks
– Efficient for sequential creation and use
– Not too large to cause problems when rollback
needed by another instance to build read
consistent images
GC_ROLLBACK_LOCKS Parameter
This parameter was used to specify the lock mapping for rollback segments. If there was a
lot of pinging of UNDO blocks from a rollback segment it needed fine grain locks, if there
was not much pinging it needed coarse grain locks. The idea was to balance the cost of
pinging with the cost of getting locks, to achieve maximum performance.
The value of this parameter was reduced considerably by the introduction of the Consistent
Read (CR) Server in Oracle8i. The CR Server created read consistent block images on the
instance holding the rollback blocks and sends them to the requesting instance through the
interconnect. This eliminated the need to send rollback blocks to the requesting instance.
This parameter is obsolete in Oracle9i. Internally the rollback segments are protected by
locks with a grouping of 16. Since the UNDO blocks are created sequentially, this large
grouping should provide the best performance. The grouping is not too large, however,
because query requests may be sent to a node which has aged out the pertinent blocks. In
this case, any rollback block needed to build a read consistent image is read into the cache
of the querying instance under a shared PCM lock. Until the shared lock is released, the
rollback blocks covered by the lock cannot be modified by the instance to which the
rollback segment is assigned.

Instance Naming
• Oracle9i: Real Application Cluster instances

must have unique names
– To ensure that tools can correctly and
consistently identify an instance
– To simplify instance management for
database administrators
– To differentiate values for each instance,
use initialization parameters of the form
sid.parameter_name = value
• A second instance attempting to start with a
duplicate SID will fail with an error
Instance Naming
Prior to the release of Oracle9i, instances were identified internally by number. Instance-
specific database objects, such as redo threads and free list groups, were also associated
with numeric initialization parameters, such as thread and instance_number. Instances in
those earlier releases had names, which were assigned using the ORACLE_SID
environment variable at the operating system level and the optional instance_name
initialization parameter. However, there were limitations with these naming techniques:
• INSTANCE_NAME values did not have to be unique in different instances of the
same database
• On some platforms, the ORACLE_SID could be the same for all instances of the
same database
This meant that instance names could not be used by management tools to identify
instances. Also, if the thread and instance_number parameters were not specified in an
instance’s initialization file, values were assigned based on startup order— the first instance
to start was assigned thread 1, the second was assigned thread 2, and so on. Thus there was
never a guaranteed assignment of these database objects based on instance names.
In Oracle9i, each instance of an Oracle Real Application Cluster database is required to
have a unique name assigned with the SID. The use of unique instance names enable the
system management tools to use instance names to identify instances to the user, with the
assurance that these names are unique.
Unique names also allow the instances associated with the same database to share an
initialization file through the use the SID as a parameter prefix as described earlier.

Unique Instance Numbers
• An Oracle9i Real Application Cluster instance must

have an assigned instance number
– To ensure that tools can correctly and
consistently identify an instance
– For internal algorithms to manage space and
DLM access
• Instance numbers do not default to the next
unused instance number
– Unique instance numbers are defined with the
instance_number initialization parameter
– The instance_number parameter defaults to 1
• A second instance attempting to start with a
duplicate instance number will fail with an error
Unique Instance Numbers

Prior to Oracle9i, the RDBMS chose an unused instance number if a multi-instance
database instance started up without specifying a value for the INSTANCE_NUMBER
parameter.
In Oracle9i, the default value for the INSTANCE_NUMBER parameter is set to 1. A second
instance attempting to start up without specifying a value for the INSTANCE_NUMBER
parameter will also attempt to start as instance number 1. Such an attempt fails and returns
an error to the user. To start successfully, instances are required to specify unique numbers
in the INSTANCE_NUMBER parameter.

Unique Instance Names and Numbers
• The Oracle Database Configuration Assistant

assigns unique instance names and numbers
automatically or manually
• Two databases sharing a node may have the same
names for their instances if they use separate
Oracle Homes
• The combination of unique names and associated
instance numbers
– Enable database administrators to identify
instances reliably when using database
management tools
– Ensure that redo logs and extents are mapped
to the same instances each time
Unique Instance Names and Numbers

In Oracle9i, instance names and numbers are configured by the Database Configuration
Assistant, or manually by the user just as in earlier releases of Oracle9i Real Application
Cluster. The instance name, specified by INSTANCE_NAME, must be unique within the
database where the instance runs. Instances in other databases may use the same instance
name, even if they run on the same node.
When starting up and joining the Real Application Cluster, each instance checks that there
is no other currently running instance using its name. If it finds that there is another
instance already running using the same name, it fails to start up, and returns an error to the
user. There is also a pre-existing restriction that instance names must be unique on a node.
On Windows, there can be only one OracleService<SID> with a particular name on a node.
On UNIX, it may be possible from an RDBMS point of view to run multiple instances with
the same name on a node as long as they’re running in different ORACLE_HOMEs.
However, the management tools won’t support this configuration, even for single-instance
databases, because there can only be one /etc/oratab entry for a particular name. Currently,
the restriction that instance names be unique on a node is not enforced. If you run multiple
Real Application Cluster databases on the same nodes, you will need to be aware of the
possible confusion if you use duplicate names and instance numbers for instances in
different databases.
Unique names allow you to use the instance name to identify instances when using Oracle9i
Real Application Cluster management tools. Also, you are assured that the database objects
associated with Oracle9i Real Application Cluster instances, such as extents, will be
mapped to the same instances each time they are restarted.

Instance Names and Numbers
Node A Node B Node C Node D
PROD1 PROD2 PROD3 PROD4

Thread 1 Thread 2 Thread 3 Thread 4
DEV1 TEST1 DEV2 TEST2

Thread 1 Thread 1 Thread 2 Thread 2
Instance Names and Numbers

There are three different databases in the example: PROD, DEV, and TEST. PROD has
three
instances, PROD1, PROD2, PROD3, and PROD4, each running on one of the four
available
nodes. DEV has two instances, DEV1 and DEV2, running on Node A and Node C
respectively. Similarly, TEST has two instance, TEST1 on Node B and TEST2 on Node D.
The recommended naming and numbering for these instances would be as follows:
1. Number the redo threads for each instance 1, 2, 3, 4, and so on. That is, your redo
thread numbers should start at 1 and increment by 1. The slide shows the thread
numbers to each of the eight instances used in the example.
2. Set the ORACLE_SID to be the database name plus its redo thread number as a suffix.
For example, on Node A, you would use ORACLE_SID = PROD1 for the PROD
database and ORACLE_SID = DEV1 for the DEV database.
3. Set the THREAD parameter to match the thread number you chose for the instance and
which you should have reflected in the instance’s ORACLE_SID value. For example,
the THREAD value for the PROD database instance on Node C should be 3.
4. Set the INSTANCE_NUMBER parameter value the same as the THREAD parameter
value for each instance. For example, the INSTANCE_NUMBER for the PROD database
instance on Node C should also be set to 3, the value assigned to THREAD in step 3.
5. Set the INSTANCE_NAME parameters for each instance to match the SID name in the
parameter file. For example, the INSTANCE_NAME for the TEST database instance on
Node D should be PROD4.
Following these recommendations, the parameter file for the DEV database would contain
the following entries:
dev1.thread = 1
dev2.thread = 2
dev1.instance_number = 1
dev2.instance_number = 2
dev1.instance_name = dev1
dev2.instance_name = dev2

Real Application Clusters and Enterprise
Manager
Enterprise Manager simplifies the management
of Oracle9i Real Application Cluster databases:
• OPSM consists of a set of management tools for
Oracle Real Application Clusters
• Additional management tasks can be
accomplished through Enterprise Manager
• Greater similarity between single instance and Real
Application Cluster database configurations
• SPFILE provides a single-image, server side copy
of the initialization file that can be manipulated by
Enterprise Manager
Real Application Clusters and Enterprise Manager

Oracle Parallel Server Management (OPSM) consists of a set of tools and utilities to help
you manage Real Application Clusters. OPSM is installed on the Real Application Cluster
server and consists of OPSCTL, OPSD, and so on.
The Intelligent Agent, used by Oracle Enterprise Manager (EM), is extended for Oracle
Real Application Clusters. These extensions used to be installed separately for Oracle
Parallel Server, but are now always contained within the Intelligent Agent. The EM
Console also has extensions, called Oracle Parallel Server Manager, which address Real
Application Cluster issues.
The Console extensions allow EM to discover, start up, shut down, Oracle Real Application
Clusters. In Oracle9i, the Console also manages the configuration of the clustered servers
(which instances run on which nodes). This enables the EM front end to include additional
functionality for managing Oracle Real Application Cluster databases.

New Enterprise Manager Features
• Configuration data (node to instance matching)

managed by the Agent
• Autodiscovery of all nodes running an instance
• Automatic startup of the daemon to manage cluster
nodes
• Graphical display support for NT
• Improved cluster-wide instance startups and
shutdowns on NT
New Enterprise Manager Features for Oracle Real Application Clusters

Currently, users must manage Oracle9i Real Application Cluster databases using manual
methods to add, remove or rename instances
The primary benefits of these new features include extending OPSM to allow the
management of these tasks to be automated through EM. This will enhance the single
system image of Oracle9i Real Application Clusters by eliminating some of the differences
between how database configuration is managed and how Oracle Real Application Clusters
configuration is managed.
These features are described briefly on next few slides.

Real Application Cluster Configuration
Windows
Windows
cluster Registry or UNIX
cluster
Text
UNIX file Raw
cluster device
Pre-Oracle9i Oracle9i
Real Application Cluster Configuration

Prior to Oracle9i, OPSM stored configuration information in the registry on Windows and
in a flat text file on each node under UNIX. Because you cannot predict which node will be
requested by EM to startup or shutdown a Real Application Cluster instance, you had to
maintain one copy of this file on each node and ensure the consistency between these
copies.
In Oracle9i, OPSM provides a portable mechanism for storing the Real Application Cluster
configuration that removes the requirement for you to manually synchronize the
configuration file across all nodes of the Oracle9i Real Application Cluster. Also, it
provides an API interface for storing and retrieving configuration variables for use by
Oracle’s Database Configuration Assistant (DBCA). The node-to-instance mapping is
stored on a raw device, so that it can be shared amongst the nodes. At any time, this raw
device contains the currently configured list of instances and which nodes they should run
on.
Note: For those of you familiar with the internal structures of Oracle, this can be compared
to information available via the clusterware interface skgxn about which instances are
running in a particular database on all operating systems except Windows. Windows uses a
shared raw device to store static, physical information about instances and skgxn contains
only active instance information.
With the configuration data available to all nodes, OPSM enables EM and DBCA to
discover which instances exist and also to add, remove, and rename instances. For example,
you can add a new node to an existing Real Application Cluster database by using a wizard
available in the DBCA. The wizard leads you through steps to specify the new instance
name (SID), the node name on which it runs, the names of the raw devices to be used for
the redo log files it will use, and so on. DBCA then updates the configuration files, such as
tnsnames.ora and listener.ora, on all the nodes to reflect the presence of the new
instance. DBCA also creates the database objects (thread, redo log group, rollback
segments) required by the new instance. Finally, DBCA will start up the new instance.

Windows Feature Improvements
• Real Application Clusters display is available for

Windows
– Shows startup and shutdown progress
– Graphical display of Real Application Cluster
status
• Remote administration on Windows
– OPSM starts up nodes locally
– Remote administration no longer required by
OPSCTL
– EM may still require remote administration
Status Details Display on Windows

The Enterprise Manager Console has a graphical status display that is used to show the
status of the Real Application Cluster and to show the progress of Oracle9i Real
Application Cluster startup and shutdown operations. Prior to Oracle9i, this display worked
when Oracle Parallel Server was running on a UNIX system, but not on an NT cluster. The
additional code to generate the status details has now been ported to NT and OPSM in
Oracle9i displays the status details for both UNIX and Windows Real Application Clusters.
Remote Administration on Windows
OPSCTL previously required remote administration for performing instance startup and
shutdown operations on remote nodes in Windows clusters. OPSM in Oracle9i starts up
nodes locally, eliminating the requirement for remote administration by OPSCTL.
Enterprise Manager may still require remote administration.

Requirements
• Pre-installation includes building the raw partition

to store the configuration information
• Command line tools include extensions to the
OPSCTL options
• Windows Cluster Setup Wizard
Pre-Installation
The OPSM software is installed as part of the Real Application Cluster Option. It is not
listed as a separately installable item in the Oracle Universal Installer.
As part of installation, DBCA already creates the Real Application Cluster configuration
according to the database name, SID prefix and list of nodes entered by the user. Users also
have to provide a raw device on which to store the OPSM configuration. Creating such a
raw device will be a pre-installation step.
Extended OPSCTL Commands
Configuration information is shared amongst the nodes by being stored as a binary file on a
shared raw disk, like the database files. You will no longer be able to change this mapping
by using a text editor. This configuration will need to be alterable via a command line
interface as well as by GUI tools such as DBCA.
New OPSCTL sub-commands allow you to configure Real Application Clusters from the
command line.
Windows Pre-Installation Tool
The Windows Preinstallation tool is a wizard, called Cluster Setup Wizard. This tool
incorporates OLM functionality in it and users can create the symbolic links before
installing OLM and OSDs. A help system is integrated with this wizard.

New and Modified OPSCTL Commands
Start all instances and listeners in OPS

opsma opsctl start -p opsma
Start the instance opsma1 and its listeners
opsctl start -p opsma -i opsma1
Start only instance opsma1
opsctl start -p opsma -i opsma1 -s inst
Start only listeners for instance opsma1
opsctl start -p opsma -i opsma1 -s lsnr
Start all instances with debug output enabled
opsctl start -p opsma -x lsnr -D 3
New and Modified OPSCTL Commands

OPSCTL supported the START and STOP sub-commands in previous releases. These sub-
commands have been modified and new ones have been added to support more extensive
Oracle9i Real Application Cluster management from the command line:
opsctl start -p <ops_name> [-i <inst,...>] [-n <node,...>]
[-s <stage,...>] [-x <stage,...>]
[-c <connstr>] [-o <options>] [-S <level>] [-D <dbglvl>]
[-h]
where
-p <ops_name> start specified Oracle9i Real Application Cluster
-i <inst,...> start named instances if specified, otherwise entire Oracle9i Real
Application Cluster
-n <node,...> start instances on named nodes
-s <stage,...> list of stages to start (stage=inst,lsnr)
-x <stage,...> except these stages
-c <connstr> connect string (default: / as sysdba)
-o <options> options to startup command (e.g. force, nomount, ...)
-S <level> intermediate status level for Console
-D <debuglvl> debug level
-h print usage

New and Modified OPSCTL Commands: STOP
Sub-Command
Stop instances and listeners in OPS opsma
opsctl stop -p opsma
Stop only listeners

opsctl stop -p opsma -s lsnr
Stop instance opsma1 with an option and with

debug output
opsctl stop -p opsma -i opsma1 -s inst -o immediate -D 3
Stop the instance on node ez1

opsctl stop -p opsma -n ez1 -x lsnr
Stop OPS opsma with connection string

'system/manager'
opsctl stop -p opsma -c 'system/manager'
OPSCTL STOP Sub-Command

opsctl stop -p <ops_name> [-i <inst,...>] [-n <node,...>]
[-s <stage,...>] [-x <stage,...>] [-c <connstr>] [-o
<options>] [-S <level>] [-D <dbglvl>] [-h]
where
-p <ops_name> Stop specified Oracle9i Real Application Cluster
-i <inst,...> Stop named instances if specified, otherwise entire Oracle9i Real
Application Cluster
-n <node,...> Stop instances on named nodes
-s <stage,...> list of stages to stop (stage=inst,lsnr)
-o <options> options to shutdown command (e.g. abort, transactional)
-S <level> Intermediate status level for Console
-h print usage

New and Modified OPSCTL Commands:
STATUS Sub-Command
Get status of OPS opsma
opsctl status -p opsma
Get status of instance opsma1
opsctl status -p opsma -i opsma1 -s inst
Get status of all listeners in OPS opsma
opsctl status -p ospma -s lsnr
OPSCTL STATUS Sub-Command

opsctl status -p <ops_name> [-i <inst,...>] [-n
<node,...>] [-s <stage,...>] [-x <stage,...>]
[-c <connstr>] [-S <level>] [-D <dbglvl>] [-h]
where
-p <ops_name> Check specified Oracle9i Real Application Cluster
-i <inst,...> Check named instances if specified, otherwise entire Oracle9i
Real Application Cluster
-n <node,...> Check instances on named nodes
-s <stage,...> list of stages to check status (stage=inst,lsnr)
-S <level> Intermediate status level for Console
-h print usage

CONFIG Sub-Command
Get a list of OPSes on the raw device
opsctl config
Get configuration of OPS opsma
opsctl config -p opsma
Get configuration of node ez1 (entries from all
OPS configuration will apply)
opsctl config -n ez1
Get configuration of node ez1 in OPS opsma
opsctl config -p opsma -n ez1
Display version and exit
opsctl config -V
OPSCTL CONFIG Sub-Command

opsctl config [-p <ops_name>] [-n <node>] [-D <dbglvl>] [-
V] [-v] [-h]
where
-p <ops_name> Show configuration for specified Oracle9i Real Application Cluster;
otherwise list all Real Application Clusters
-n <node> Only show services on named node
-V show version
-h print usage

Additional Sub-Commands
Add an OPS myops to raw device
opsctl add ops -p myops -o /disk1/ora9
Add an instance to myops

opsctl add instance -p myops -i myops1 -n mynode1
Delete instance oldinst from myops

opsctl delete instance -p myops -i oldinst
Delete OPS myops

opsctl delete ops -p myops
Rename instance myops1 to newmyops1

opsctl rename instance -p myops -i myops1 -e newmyops1
Additional OPSCTL Sub-Commands

opsctl add instance -p <ops_name> [-i <inst> -n <node>] [-
D dbglvl] [-h]
where
-p <ops_name> Name of Oracle9i Real Application Cluster to add to
-i <inst> Name of instance to add
-n <node> Name of node on which to add instance
-D <dbglvl> debug level
-h print usage
opsctl add ops -p <ops_name> -o <oracle_home> [-D dbglvl]
[-h]
where
-p <ops_name> Name of Oracle9i Real Application Cluster to add
-o <oracle_home> value of ORACLE_HOME
opsctl delete instance -p <ops_name> [-i <inst>] [-D
dbglvl] [-h]
where
-p <ops_name> Name of Oracle9i Real Application Cluster to delete from
-i <inst> Name of instance to delete
opsctl delete ops -p <ops_name> [-D dbglvl] [-h]
where
-p <ops_name> Name of Oracle9i Real Application Cluster to delete
opsctl rename instance -p <ops_name> [-i <oldinst> -e
<newinst>] [-D dbglvl] [-h]
where
-p <ops_name> Name of Oracle9i Real Application Cluster to rename
-i <oldinst> Name of instance to rename
-e <newinst> New name for instance

Move an instance opsma1 to new node ez2
opsctl move instance -p opsma -i opsma1 -n ez2
Set OPS environment

opsctl set env -p opsma -t NLS_LANGUAGE=english
Set instance environment

opsctl set env -p opsma -i opsma1 -t NLS_LANGUAGE=american
Unset OPS environment

opsctl unset env -p opsma -t NLS_LANGUAGE

opsctl move instance -p <ops_name> -i <inst> -n <newnode> [-
D <dbglvl>] [-h]
where
-p <ops_name> Name of Oracle9i Real Application Cluster in which instance should
be moved
-i <inst> Name of instance to move
-n <newnode> New node for instance
-D <dbglvl> Debug level
-h print usage
opsctl set env -p <ops_name> -t <name>=<value> [-i <inst>]
[-D <dbglvl>] [-h]
where
-p <ops_name> Name of Oracle9i Real Application Cluster in which to set
environment
-t <name>=<value> Name and value of environment variable
-i <inst> Instance for which env should be set
opsctl unset env -p <ops_name> -t <name> [-i <inst>]
where
-p <ops_name> Name of Oracle9i Real Application Cluster in which to unset
environment
-t <name> Name of environment variable to unset
-i <inst> Instance for which env variable should be unset

Display all OPS environment settings
opsctl get env -p opsma
Display all instance environment settings
opsctl get env -p opsma -i opsma1

opsctl get env -p <ops_name> [-i <inst>] [-D <dbglvl>] [-h]
where
-p <ops_name> Name of Oracle9i Real Application Cluster from which to get
environment
-i <inst> Instance for which env variables should be displayed
opsctl -V
where
-V print version

Summary

• Use Oracle Parallel Fail Safe
• Setup a shared initialization parameter file
• List GC_* parameters that have become obsolete
in Oracle9i
• Set INSTANCE_NAME and INSTANCE_NUMBER
parameters to identify instances uniquely
• Configure and use Enterprise Manager to manage
Oracle9i Real Application Cluster database
instances
• Execute new OPSCTL options
Reference: For information on the tools used to diagnose Oracle Real Application Clusters
refer to the course Improved Diagnosability Features.

Integration with Microsoft’s Cluster Server

Objectives

• Explain the benefits of using MSCS and Oracle9i
Real Application Clusters together
• Describe basic Microsoft Cluster Server (MSCS)
functionality
• Identify the Oracle9i Real Application Clusters and
MSCS interface

MSCS Benefits
• Greater flexibility in configuring Oracle9i Real

Application Clusters on Windows
– Use MSCS IP fail-over
– Use Microsoft’s cluster management facilities
• Simplifies migration from Oracle Fail Safe to
• Oracle9i Real Application Cluster applications can
still run in active-active mode even with MSCS on
the cluster

MSCS Concepts
• Oracle9i Real Application Clusters use multiple nodes

in a cluster to provide a scalable database
– Each node can run an active instance against the
database
– Any remaining instance can provide failover, to a
certain extent, for a failed instance
• MSCS provides clustering capability for virtually any
Windows application
– A single application can only be active on one node
at a time
– Cluster is used solely for failover
MSCS Concepts
Oracle9i Real Application Clusters use an active-active shared storage cluster That is all the
nodes in the cluster can be online and capable of processing transactions at the same time.
Microsoft’s Windows has functionality to support clustering through the Microsoft Cluster
Server (MSCS), previously known as Wolfpack. MSCS is a generic clustering solution
which can be used to cluster virtually any Windows application It is based on an
active/passive design which emphasizes high availability rather than scalability or fault
tolerance. MSCS works by having an application active on only one node at any given
time. MSCS monitors the availability of the application and will restart the application on a
standby node in the case of failure.
MSCS requires that all of the nodes in the cluster share at least one disk which is used as a
quorum. The shared disk is only visible by the active node in the cluster and is failed over if
the active node fails. Clustered applications can put data, such as currently open documents,
on the shared disk. This allows the data to be recovered on the secondary node in the case
of an active node failure. Clustered applications can also use their local disks as long as the
data is not required to fail over. Currently, MSCS only supported two node clustering but
Microsoft has stated intentions to release an n-node version in the future.

MSCS and Oracle9i Real Application Clusters
• MSCS can be extended using a DLL

• The DLL provides an API interface to MSCS
• The DLL is visible to Oracle9i Real Application
Clusters
• MSCS works with groups of resources
• Oracle9i Real Application Clusters can be grouped
under MSCS with related resources
MSCS and Oracle9i Real Application Clusters

Vendor’s who want their applications to be made highly available by MSCS can extend
MSCS by implementing a resource type DLL. This provides API interface to MSCS which
it uses to monitor the health of the application. By making calls to the DLL, MSCS will be
able to alter an application online or offline and to periodically validate the health with calls
to IsAlive or LookAlive functions. This resource DLL is now available to Oracle9i Real
Application Clusters customers so they can create Oracle9i Real Application Clusters
resources in MSCS.
MSCS users can create groups of resources. All resources in a group must be “Alive” for
the group to be considered online. If one of the group members fails, MSCS restarts all the
members of the group on an active node.
The Oracle9i Real Application Clusters resource DLL enables users to create dependencies
on Oracle9i Real Application Clusters being active with other resources such as IP
addresses, web server, or client application. For example, an application based on
Microsoft’s Internet Information Server to serve the HTML but using Oracle9i Real
Application Clusters for the back end.

MSCS Configuration
• MSCS is configured with wizards or the cluster

administrator GUI tool
• The Oracle9i Real Application Clusters resource
type uses an extension to provide an integrated
configuration
• Hardware vendors supplying MSCS enabled OSDs
must either
– Ship the reference OSDs provided by Oracle
Corporation
– Integrate changes into their own cluster
management modules
MSCS Configuration
MSCS provides a way to configure the extended resource types in the initial creation
wizard or after the resource is created in the cluster administrator GUI. Configuration is
done in dialog boxes which are implemented in a separate module call the cluster
administrator extension DLL. The Oracle9i Real Application Clusters resource type uses a
cluster administrator extension DLL to allow proper configuration.
The Oracle9i Real Application Clusters architecture is designed to allow hardware vendors
to provide system dependent clusterware in modules collectively called the Oracle System
Dependent modules (OSD). Currently, most Windows modules are shipped with the
reference implementation supplied by Oracle. The OSD cluster manager (CM) is a module
which monitors the health of the instances and dependent processes in the cluster. An
Oracle9i Real Application Clusters MSCS resource interfaces with the CM. Hardware
vendors who wish to supply MSCS enabled OSDs must either ship the Oracle references
modules or integrate the changes into their CM modules.

Configuration Parameters
• The name of the database

• The instances designator
• The behavior of the resource when the database is
brought online
• The Oracle Net connect information for OCI queries
used by the IsAlive function
• The behavior of the Oracle9i Real Application
Clusters resource when the database is brought
offline
Configuration Parameters
The cluster administrator extension DLL allows the user to configure the following
parameters:
• The name of the database. This is needed to distinguish between multiple databases
on the cluster.
• The instances designator which uniquely and globally differentiates instances in the
cluster.
• The behavior of the resource when the database is brought online: either to start the
Oracle9i Real Application Clusters instance and mount it when MSCS calls the
Online function or to override this functionality.
• The Oracle Net connect information for OCI queries used by the IsAlive function.
• The behavior of the Oracle9i Real Application Clusters resource when the database
is brought offline. The options are
– To stop the service
– To issue a SHUTDOWN command
– To do nothing.
The default behavior is to stop and start the database services when the database is brought
online and offline.

Cluster Resource Type DLL
The key functions in the interface include

• Online
• Offline
• LookAlive
• IsAlive
• Terminate
Cluster Resource Type DLL

The cluster resource type DLL complies with the resource type API required by MSCS.
The most important functions in the interface are the Online, Offline, LookAlive, IsAlive
and Terminate functions.

Online and Offline Functions
• The Online function

– Starts an Oracle instance, including the
underlying service, if configured by the user to
do this
– Confirms the need to start the service and
instance before trying
– Continues attempting to validate a running
instance until exceeding a user-configured time
out
• The Offline function stops the database instance if
configured
Online Function
The Online function either does nothing or else starts the database instances, depending on
the configuration specified by the user via the cluster administrator extension DLL.
If configured to bring up the database, the function first checks to see if the database service
is up. If it is not, the function starts the service and then mounts the database. It does not
return until confirming that the new database instance is functioning correctly or validating
the already-running instance.
Should the function not be able to bring up the instance, it returns an error. The amount of
time the function waits before concluding that there is a problem can be configured by the
cluster administrator extension DLL.
Offline Function
The Offline function either does nothing or stops the database instance depending on the
configuration specified by the user in the cluster administrator extension DLL. When
shutting down the database it is possible to specify whether the service is “shutdown”
(process alive but not mounted) or completely stopped (process terminated).

LookAlive and IsAlive Functions
• The LookAlive function validates the health of a

database instance
– Completes the check in less than 30
milliseconds
– Depends on Oracle9i Real Application Clusters
and MSCS recognizing the same cluster
members
• The IsAlive function also validates the health of a
database instance
– Can spend more time than LookAlive
– Performs more extensive testing
LookAlive Function
The MSCS specifications require the LookAlive function to take less than 30 millisecond to
do a best guess assessment of the database instances health. This function makes use of the
CM’s knowledge of group membership.
To avoid split-brain problems, the serving nodes recognized by Oracle9i Real Application
Clusters and by MSCS are the same in case the communication between nodes is severed.
The LookAlive function relies on this integrity between the two products.
IsAlive Function
The IsAlive function can take more time than the LookAlive to determine if the instance is
available.
In addition to the IsAlive and LookAlive functions, the MSCS resource type DLL registers
an event with MSCS which it uses to signal a failure more quickly. The Oracle9i Real
Application Clusters resource type DLL uses this functionality to reduce the latency of
error detection.

Terminate Function
• Attempts to stop the database

• Uses a normal shutdown, if possible
• Kills the instance by some other means, such as
killing background processes, if shutdown
commands fail
Terminate Function
The Terminate function attempts to stop the database, using a normal shutdown to avoid the
overhead of instance recovery. If it is unable to do this, it uses more drastic measures such
as killing the instance process.

Summary

• Explain the benefits of using MSCS and Oracle9i
Real Application Clusters together
• Describe basic MSCS functionality
• Identify the Oracle9i Real Application Clusters and
MSCS interface

Oracle 9i - RAC

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Oracle 9i - RAC

Uploaded by

Copyright:

Available Formats

Cache Fusion

Real Application Clusters Features 1-1

After this lesson, you should be able to:

Real Application Clusters Features 1-2

Real Application Cluster Concepts

Real Application Clusters Features 1-3

Distributed lock manager

Cluster management software

Shared disk software

Real Application Cluster Concepts

Real Application Clusters Features 1-4

Block transfer with disk ping

Cache fusion block transfer

Real Application Cluster Concepts

Real Application Clusters Features 1-5

Without cache fusion

Benefits of Cache Fusion

Real Application Clusters Features 1-6

• PCM locks and the DLM record

Cache Fusion Model

Real Application Clusters Features 1-7

• Cache fusion requires fewer messages and steps

Cache Fusion Model

Real Application Clusters Features 1-8

PCM locks retain their modes from previous

PCM Lock Modes

Real Application Clusters Features 1-9

• A new element of PCM locking, called a lock role,

PCM Lock Roles

Real Application Clusters Features 1-10

PCM Lock Roles: Local

Real Application Clusters Features 1-11

• May be in any mode

PCM Lock Roles: Global

Real Application Clusters Features 1-12

• Copy of a dirty block that has been served to

Real Application Clusters Features 1-13

• When a block is served and the instance has a PI

Real Application Clusters Features 1-14

Cache Fusion Block Transfers: Example Set Up

Real Application Clusters Features 1-15

Example 1: Read with No Transfer

Real Application Clusters Features 1-16

Real Application Clusters Features 1-17

Real Application Clusters Features 1-18

Real Application Clusters Features 1-19

Example 2: Read to Read Transfer

Real Application Clusters Features 1-20

Real Application Clusters Features 1-21

Real Application Clusters Features 1-22

Real Application Clusters Features 1-23

Example 3: Read to Write Transfer

Real Application Clusters Features 1-24

Real Application Clusters Features 1-25

Real Application Clusters Features 1-26

Real Application Clusters Features 1-27

Example 4: Write to Write Transfer

Real Application Clusters Features 1-28

Real Application Clusters Features 1-29

Real Application Clusters Features 1-30

Real Application Clusters Features 1-31

Example 5: Write to Read Transfer