RAC Architecture

RAC Architecture
Oracle Real Application clusters allows multiple instances to access a single database, the
instances will be running on multiple nodes. In an standard Oracle configuration a database
can only be mounted by one instance but in a RAC environment many instances can access a
single database.
Oracle's RAC is heavy dependent on a efficient, high reliable high speed private network
called the interconnect, make sure when designing a RAC system that you get the best that
you can afford.
The table below describes the difference of a standard oracle database (single instance) an a
RAC environment
Component
Single Instance
Environment
SGA
Instance has its own SGA Each instance has its own SGA
Background
processes
Instance has its own set of Each instance has its own set of background
background processes
processes
Datafiles
Accessed by only one

instance
Shared by all instances (shared storage)
Control Files

instance
RAC Environment
Only one instance can write but other

instances can read during recovery and
Dedicated for write/read to
Online Redo Logfile
archiving. If an instance is shutdown, log
only one instance
switches by other instances can force the idle
instance redo logs to be archived
Archived Redo
Logfile
Private to the instance but other instances

Dedicated to the instance will need access to all required archive logs
during media recovery
Flash Recovery Log

instance
Alert Log and Trace

Private to each instance, other instances
Dedicated to the instance
Files
never read or write to those files.
Multiple instances on the

same server accessing
ORACLE_HOME
different databases ca use
the same executable files
Same as single instance plus can be placed

on shared file system allowing a common
ORACLE_HOME for all instances in a RAC
environment.
RAC Components
The major components of a Oracle RAC system are
Shared disk system

Oracle Clusterware
Cluster Interconnects
Oracle Kernel Components
The below diagram describes the basic architecture of the Oracle RAC environment
Here are a list of processes running on a freshly installed RAC
Disk architecture
With today's SAN and NAS disk storage systems, sharing storage is fairly easy and is
required for a RAC environment, you can use the below storage setups
SAN (Storage Area Networks) - generally using fibre to connect to the SAN
NAS ( Network Attached Storage) - generally using a network to connect to the NAS
using either NFS, ISCSI
JBOD - direct attached storage, the old traditional way and still used by many
companies as a cheap option
All of the above solutions can offer multi-pathing to reduce SPOFs within the RAC
environment, there is no reason not to configure multi-pathing as the cost is cheap when
adding additional paths to the disk because most of the expense is paid when out when
configuring the first path, so an additional controller card and network/fibre cables is all that
is need.
The last thing to think about is how to setup the underlining disk structure this is known as a
raid level, there are about 12 different raid levels that I know off, here are the most common
ones
A number of disks are concatenated together to give the appearance of one very
large disk.
raid 0
(Striping)
Advantages
Improved performance
Can Create very large Volumes
Disadvantages
Not highly available (if one disk fails, the volume fails)
A single disk is mirrored by another disk, if one disk fails the system is
unaffected as it can use its mirror.
raid 1
(Mirroring)
Advantages
Improved performance
Highly Available (if one disk fails the mirror takes over)
Disadvantages
Expensive (requires double the number of disks)
Raid stands for Redundant Array of Inexpensive Disks, the disks are striped with
parity across 3 or more disks, the parity is used in the event that one of the disks
fails, the data on the failed disk is reconstructed by using the parity bit.
raid 5
Advantages
Improved performance (read only)
Not expensive
Disadvantages
Slow write operations (caused by having to create the parity bit)
There are many other raid levels that can be used with a particular hardware environment for
example EMC storage uses the RAID-S, HP storage uses Auto RAID, so check with the
manufacture for the best solution that will provide you with the best performance and
resilience.
Once you have you storage attached to the servers, you have three choices on how to setup
the disks
Raw Volumes - normally used for performance benefits, however they are hard to
manage and backup
Cluster FileSystem - used to hold all the Oracle datafiles can be used by windows and
linux, its not used widely
Automatic Storage Management (ASM) - Oracle choice of storage management, its a
portable, dedicated and optimized cluster filesystem
I will only be discussing ASM, which i have already have a topic on called Automatic
Storage Management.
Oracle Clusterware
Oracle Clusterware software is designed to run Oracle in a cluster mode, it can support you to
64 nodes, it can even be used with a vendor cluster like Sun Cluster.
The Clusterware software allows nodes to communicate with each other and forms the cluster
that makes the nodes work as a single logical server. The software is run by the Cluster Ready
Services (CRS) using the Oracle Cluster Registry (OCR) that records and maintains the
cluster and node membership information and the voting disk which acts as a tiebreaker
during communication failures. Consistent heartbeat information travels across the
interconnect to the voting disk when the cluster is running.
The CRS has four components
OPROCd - Process Monitor Daemon

CRSd - CRS daemon, the failure of this daemon results in a node being reboot to
avoid data corruption
OCSSd - Oracle Cluster Synchronization Service Daemon (updates the registry)
EVMd - Event Volume Manager Daemon
The OPROCd daemon provides the I/O fencing for the Oracle cluster, it uses the hangcheck
timer or watchdog timer for the cluster integrity. It is locked into memory and runs as a
realtime processes, failure of this daemon results in the node being rebooted. Fencing is used
to protect the data, if a node were to have problems fencing presumes the worst and protects
the data thus restarts the node in question, its better to be save than sorry.
The CRSd process manages resources such as starting and stopping the services and failover
of the application resources, it also spawns separate processes to manage application
resources. CRS manages the OCR and stores the current know state of the cluster, it requires
a public, private and VIP interface in order to run.
OCSSd provides synchronization services among nodes, it provides access to the node
membership and enables basic cluster services, including cluster group services and locking,
failure of this daemon causes the node to be rebooted to avoid split-brain situations.
The below functions are covered by the OCSSd
CSS provides basic Group Services Support, it is a distributed group membership

system that allows applications to coordinate activities to archive a common result.
Group services use vendor clusterware group services when it is available.
Lock services provide the basic cluster-wide serialization locking functions, it uses
the First In, First Out (FIFO) mechanism to manage locking
Node services uses OCR to store data and updates the information during
reconfiguration, it also manages the OCR data which is static otherwise.
The last component is the Event Management Logger, which runs the EVMd process. The
daemon spawns a processes called evmlogger and generates the events when things happen.
The evmlogger spawns new children processes on demand and scans the callout directory to
invoke callouts. Death of the EVMd daemon will not halt the instance and will be restarted.
Quick recap
Failure of the Process
Run
AS
CRS Process
Functionality
OPROCd - Process
Monitor
provides basic cluster integrity

Node Restart
services
root
EVMd - Event
Management
spawns a child process event

logger and generates callouts
oracle
OCSSd - Cluster
Synchronization
Services
basic node membership, group

Node Restart
services, basic locking
CRSd - Cluster Ready

Services
resource monitoring, failover

and node recovery
Daemon automatically
restarted, no node restart
Daemon restarted
automatically, no node
restart
oracle
root
The cluster-ready services (CRS) is a new component in 10g RAC, its is installed in a
separate home directory called ORACLE_CRS_HOME. It is a mandatory component but can
be used with a third party cluster (Veritas, Sun Cluster), by default it manages the node
membership functionality along with managing regular RAC-related resources and services
RAC uses a membership scheme, thus any node wanting to join the cluster as to become a
member. RAC can evict any member that it seems as a problem, its primary concern is
protecting the data. You can add and remove nodes from the cluster and the membership
increases or decrease, when network problems occur membership becomes the deciding
factor on which part stays as the cluster and what nodes get evicted, the use of a voting disk is
used which I will talk about later.
The resource management framework manage the resources to the cluster (disks, volumes),
thus you can have only have one resource management framework per resource. Multiple
frameworks are not supported as it can lead to undesirable affects.
The Oracle Cluster Ready Services (CRS) uses the registry to keep the cluster configuration,
it should reside on a shared storage and accessible to all nodes within the cluster. This shared
storage is known as the Oracle Cluster Registry (OCR) and its a major part of the cluster, it is
automatically backed up (every 4 hours) the daemons plus you can manually back it up. The
OCSSd uses the OCR extensively and writes the changes to the registry
The OCR keeps details of all resources and services, it stores name and value pairs of
information such as resources that are used to manage the resource equivalents by the CRS
stack. Resources with the CRS stack are components that are managed by CRS and have the
information on the good/bad state and the callout scripts. The OCR is also used to supply
bootstrap information ports, nodes, etc, it is a binary file.
The OCR is loaded as cache on each node, each node will update the cache then only one
node is allowed to write the cache to the OCR file, the node is called the master. The
Enterprise manager also uses the OCR cache, it should be at least 100MB in size. The CRS
daemon will update the OCR about status of the nodes in the cluster during reconfigurations
and failures.
The voting disk (or quorum disk) is shared by all nodes within the cluster, information about
the cluster is constantly being written to the disk, this is known as the heartbeat. If for any
reason a node cannot access the voting disk it is immediately evicted from the cluster, this
protects the cluster from split-brains (the Instance Membership Recovery algorithm IMR is
used to detect and resolve split-brains) as the voting disk decides what part is the really
cluster. The voting disk manages the cluster membership and arbitrates the cluster ownership
during communication failures between nodes. Voting is often confused with quorum the are
similar but distinct, below details what each means
Voting
A vote is usually a formal expression of opinion or will in response to a

proposed decision
Quorum
is defined as the number, usually a majority of members of a body, that, when

assembled is legally competent to transact business
The only vote that counts is the quorum member vote, the quorum member vote defines the
cluster. If a node or group of nodes cannot archive a quorum, they should not start any
services because they risk conflicting with an established quorum.
The voting disk has to reside on shared storage, it is a a small file (20MB) that can be
accessed by all nodes in the cluster. In Oracle 10g R1 you can have only one voting disk, but
in R2 you can have upto 32 voting disks allowing you to eliminate any SPOF's.
The original Virtual IP in Oracle was Transparent Application Failover (TAF), this had
limitations, this has now been replaced with cluster VIPs. The cluster VIPs will failover to
working nodes if a node should fail, these public IPs are configured in DNS so that users can
access them. The cluster VIPs are different from the cluster interconnect IP address and are
only used to access the database.
The cluster interconnect is used to synchronize the resources of the RAC cluster, and also
used to transfer some data from one instance to another. This interconnect should be private,
highly available and fast with low latency, ideally they should be on a minimum private 1GB
network. What ever hardware you are using the NIC should use multi-pathing (Linux bonding, Solaris - IPMP). You can use crossover cables in a QA/DEV environment but it is
not supported in a production environment, also crossover cables limit you to a two node
cluster.
Oracle Kernel Components
The kernel components relate to the background processes, buffer cache and shared pool and
managing the resources without conflicts and corruptions requires special handling.
In RAC as more than one instance is accessing the resource, the instances require better
coordination at the resource management level. Each node will have its own set of buffers but
will be able to request and receive data blocks currently held in another instance's cache. The
management of data sharing and exchange is done by the Global Cache Services (GCS).
All the resources in the cluster group form a central repository called the Global Resource
Directory (GRD), which is distributed. Each instance masters some set of resources and
together all instances form the GRD. The resources are equally distributed among the nodes
based on their weight. The GRD is managed by two services called Global Caches Services
(GCS) and Global Enqueue Services (GES), together they form and manage the GRD. When
a node leaves the cluster, the GRD portion of that instance needs to be redistributed to the
surviving nodes, a similar action is performed when a new node joins.
RAC Background Processes
Each node has its own background processes and memory structures, there are additional
processes than the norm to manage the shared resources, theses additional processes maintain
cache coherency across the nodes.
Cache coherency is the technique of keeping multiple copies of a buffer consistent between
different Oracle instances on different nodes. Global cache management ensures that access
to a master copy of a data block in one buffer cache is coordinated with the copy of the block
in another buffer cache.
The sequence of a operation would go as below
1. When instance A needs a block of data to modify, it reads the bock from disk, before
reading it must inform the GCS (DLM). GCS keeps track of the lock status of the data
block by keeping an exclusive lock on it on behalf of instance A
2. Now instance B wants to modify that same data block, it to must inform GCS, GCS
will then request instance A to release the lock, thus GCS ensures that instance B gets
the latest version of the data block (including instance A modifications) and then
exclusively locks it on instance B behalf.
3. At any one point in time, only one instance has the current copy of the block, thus
keeping the integrity of the block.
GCS maintains data coherency and coordination by keeping track of all lock status of each
block that can be read/written to by any nodes in the RAC. GCS is an in memory database
that contains information about current locks on blocks and instances waiting to acquire
locks. This is known as Parallel Cache Management (PCM). The Global Resource Manager
(GRM) helps to coordinate and communicate the lock requests from Oracle processes
between instances in the RAC. Each instance has a buffer cache in its SGA, to ensure that
each RAC instance obtains the block that it needs to satisfy a query or transaction. RAC uses
two processes the GCS and GES which maintain records of lock status of each data file and
each cached block using a GRD.
So what is a resource, it is an identifiable entity, it basically has a name or a reference, it can
be a area in memory, a disk file or an abstract entity. A resource can be owned or locked in
various states (exclusive or shared). Any shared resource is lockable and if it is not shared no
access conflict will occur.
A global resource is a resource that is visible to all the nodes within the cluster. Data buffer
cache blocks are the most obvious and most heavily global resource, transaction enqueue's
and database data structures are other examples. GCS handle data buffer cache blocks and
GES handle all the non-data block resources.
All caches in the SGA are either global or local, dictionary and buffer caches are global, large
and java pool buffer caches are local. Cache fusion is used to read the data buffer cache from
another instance instead of getting the block from disk, thus cache fusion moves current
copies of data blocks between instances (hence why you need a fast private network), GCS
manages the block transfers between the instances.
Finally we get to the processes
Oracle RAC Daemons and Processes
LMSn
Lock
this is the cache fusion part and the most active process, it handles the
Manager
consistent copies of blocks that are transferred between instances. It receives
Server
requests from LMD to perform lock requests. I rolls back any uncommitted
process - GCS transactions. There can be up to ten LMS processes running and can be
started dynamically if demand requires it.
they manage lock manager service requests for GCS resources and send them
to a service queue to be handled by the LMSn process. It also handles global
deadlock detection and monitors for lock conversion timeouts.
as a performance gain you can increase this process priority to make sure
CPU starvation does not occur
you can see the statistics of this daemon by looking at the view X$KJMSDP
LMON
this process manages the GES, it maintains consistency of GCS memory

structure in case of process death. It is also responsible for cluster
Lock Monitor reconfiguration and locks reconfiguration (node joining or leaving), it checks
Process - GES for instance deaths and listens for local messaging.
A detailed log file is created that tracks any reconfigurations that have
happened.
LMD
Lock
Manager
Daemon GES
this manages the enqueue manager service requests for the GCS. It also
handles deadlock detention and remote resource requests from other
instances.
you can see the statistics of this daemon by looking at the view X$KJMDDP
LCK0
manages instance resource requests and cross-instance call operations for

Lock Process
shared resources. It builds a list of invalid lock elements and validates lock
- GES
elements during recovery.
DIAG
Diagnostic
Daemon
This is a lightweight process, it uses the DIAG framework to monitor the

health of the cluster. It captures information for later diagnosis in the event of
failures. It will perform any necessary recovery if an operational hang is
detected.
RAC Administration
I am only going to talk about RAC administration, if you need Oracle administration then see
my Oracle section.
It is recommended that the spfile (binary parameter file) is shared between all nodes within
the cluster, but it is possible that each instance can have its own spfile. The parameters can be
grouped into three categories
Unique parameters
These parameters are unique to each instance, examples would be instance_

undo_tablespace
Identical parameters
Parameters in this category must be the same for each instance, examples w
control_file
Neither unique or identical

parameters
parameters that are not in any of the above, examples would be db_cache_s
local_listener and gcs_servers_processes
The main unique parameters that you should know about are
instance_name - defines the name of the Oracle instance (default is the value of the
oracle_sid variable)
instance_number - a unique number for each instance must be greater than 0 but
smaller than the max_instance parameter
thread - specifies the set of redolog files to be used by the instance
undo_tablespace - specifies the name of the undo tablespace to be used by the

instance
rollback_segments - you should use Automatic Undo Management
cluster_interconnects - use if only if Oracle has trouble not picking the correct
interconnects
The identical unique parameters that you should know about are below you can use the below
query to view all of them
select name, isinstance_modifiable from v$parameter where isinstance_modifiable =
'false' order by name;
cluster_database - options are true or false, mounts the control file in either share
(cluster) or exclusive mode, use false in the below cases
o Converting from no archive log mode to archive log mode and vice versa
o
Enabling the flashback database feature
Performing a media recovery on a system table
Maintenance of a node
active_instance_count - used for primary/secondary RAC environments
cluster_database_instances - specifies the number of instances that will be accessing

the database (set to maximum # of nodes)
dml_locks - specifies the number of DML locks for a particular instance (only change
if you get ORA-00055 errors)
gc_files_to_locks - specify the number of global locks to a data file, changing this
disables the Cache Fusion.
max_commit_propagation_delay - influences the mechanism Oracle uses to

synchronize the SCN among all instances
instance_groups - specify multiple parallel query execution groups and assigns the
current instance to those groups
parallel_instance_group - specifies the group of instances to be used for parallel

query execution
gcs_server_processes - specify the number of lock manager server (LMS)

background processes used by the instance for Cache Fusion
remote_listener - register the instance with listeners on remote nodes.
<instance_name>.<parameter_name>=<parameter_value>
syntax for parameter
file
inst1.db_cache_size = 1000000
*.undo_management=auto
alter system set db_2k_cache_size=10m scope=spfile sid='inst1';
Example
Note: use the sid option to specify a particular instance
Starting and Stopping Instances
The srvctl command is used to start/stop an instance, you can also use sqlplus to start and
stop the instance
srvctl start database -d <database> -o <option>
Note: starts listeners if not already running, you can use the -o option to
specify startup/shutdown options, see below for options
start all instances
force
open
mount
nomount
srvctl stop database -d <database> -o <option>
Note: the listeners are not stopped, you can use the -o option to specify
startup/shutdown options, see below for options
stop all instances
immediate
abort
normal
transactional
start/stop
srvctl [start|stop] database -d <database> -i <instance>,<instance>
particular instance
Undo Management
To recap on undo management you can see my undo section, instances in a RAC do not share
undo, they each have a dedicated undo tablespace. Using the undo_tablespace parameter each
instance can point to its own undo tablespace
undo
tablespace
instance1.undo_tablespace=undo_tbs1
instance2.undo_tablespace=undo_tbs2
With todays Oracle you should be using automatic undo management, again I have a detailed
discussion on AUM in my undo section.
Temporary Tablespace
I have already discussed temporary tablespace's, in a RAC environment you should setup a
temporary tablespace group, this group is then used by all instances of the RAC. Each
instance creates a temporary segment in the temporary tablespace it is using. If an instance is
running a large sort, temporary segments can be reclaimed from segments from other
instances in that tablespace.
gv$sort_segment - explore current and maximum sort segment usage statistics
(check columns freed_extents, free_requests ,if they grow increase tablespace
size)
useful views gv$tempseg_usage - explore temporary segment usage details such as name,
SQL, etc
v$tempfile - identify - temporary datafiles being used for the temporary
tablespace
Redologs
I have already discussed redologs, in a RAC environment every instance has its own set of
redologs. Each instance has exclusive write access to its own redologs, but each instance can
read each others redologs, this is used for recovery. Redologs are located on the shared
storage so that all instances can have access to each others redologs. The process is a little
different to the standard Oracle when changing the archive mode
archive mode
(RAC)
SQL> alter system set cluster_database=false scope=spfile sid='prod1';

srvctl stop database -d <database>
SQL> startup mount
SQL> alter database archivelog;
SQL> alter system set cluster_database=true scope=spfile sid='prod1';
SQL> shutdown;
srvctl start database -d prod
Flashback
Again I have already talked about flashback, there is no difference in RAC environment apart
from the setting up
flashback
(RAC)
## Make sure that the database is running in archive log mode

SQL> archive log list
## Setup the flashback
SQL> alter system set cluster_database=false scope=spfile sid='prod1';
SQL> alter system set DB_RECOVERY_FILE_DEST_SIZE=200M
scope=spfile;
SQL> alter system set DB_RECOVERY_FILE_DEST='/ocfs2/flashback'
scope=spfile;
srvctl stop database -p prod1
SQL> startup mount
SQL> alter database flashback on;
SQL> shutdown;
srvctl start database -p prod1

SRVCTL command
We have already come across the srvctl above, this command is called the server control
utility. It can divided into two categories
Database configuration tasks

Database instance control tasks
Oracle stores database configuration in a repository, the configuration is stored in the Oracle
Cluster Registry (OCR) that was created when RAC was installed, it will be located on the
shared storage. Srvctl uses CRS to communicate and perform startup and shutdown
commands on other nodes.
I suggest that you lookup the command but I will provide a few examples
display the
registered
databases
srvctl config database
status
srvctl status database -d <database

srvctl status instance -d <database> -i <instance>
srvctl status nodeapps -n <node>
srvctl status service -d <database>
srvctl status asm -n <node>
srvctl stop database -d <database>
srvctl stop instance -d <database> -i <instance>,<instance>
srvctl stop service -d <database> [-s <service><service>] [-i
<instance>,<instance>]
srvctl stop nodeapps -n <node>
srvctl stop asm -n <node>
stopping/starting
srvctl start database -d <database>
srvctl start instance -d <database> -i <instance>,<instance>
srvctl start service -d <database> -s <service><service> -i
<instance>,<instance>
srvctl start nodeapps -n <node>
srvctl start asm -n <node>
adding/removing srvctl add database -d <database> -o <oracle_home>
srvctl add instance -d <database> -i <instance> -n <node>
srvctl add service -d <database> -s <service> -r <preferred_list>
srvctl add nodeapps -n <node> -o <oracle_home> -A <name|ip>/network
srvctl add asm -n <node> -i <asm_instance> -o <oracle_home>
srvctl remove database -d <database> -o <oracle_home>
srvctl remove instance -d <database> -i <instance> -n <node>
srvctl remove service -d <database> -s <service> -r <preferred_list>
srvctl remove nodeapps -n <node> -o <oracle_home> -A <name|
ip>/network
srvctl asm remove -n <node>
Services
Services are used to manage the workload in Oracle RAC, the important features of services
are
used to distribute the workload

can be configured to provide high availability
provide a transparent way to direct workload
The view v$services contains information about services that have been started on that
instance, here is a list from a fresh RAC installation
The table above is described below
Goal - allows you to define a service goal using service time, throughput or none
Connect Time Load Balancing Goal - listeners and mid-tier servers contain current
information about service performance
Distributed Transaction Processing - used for distributed transactions
AQ_HA_Notifications - information about nodes being up or down will be sent to

mid-tier servers via the advance queuing mechanism
Preferred and Available Instances - the preferred instances for a service, available
ones are the backup instances
You can administer services using the following tools
DBCA
EM (Enterprise Manager)
DBMS_SERVICES
Server Control (srvctl)
Two services are created when the database is first installed, these services are running all the
time and cannot be disabled.
sys$background - used by an instance's background processes only
sys$users - when users connect to the database without specifying a service they use
this service
srvctl add service -d D01 -s BATCH_SERVICE -r node1,node2 -a node3
Note: the options are describe below
add
-d - database
-s - the service
-r - the service will running on the these nodes
-a - if nodes in the -r list are not running then run on this node
remove
srvctl remove service -d D01 -s BATCH_SERVICE
start
srvctl start service -d D01 -s BATCH_SERVICE
stop
srvctl stop service -d D01 -s BATCH_SERVICE
status
srvctl status service -d D10 -s BATCH_SERVICE

## create the JOB class
BEGIN
DBMS_SCHEDULER.create_job_class(
job_class_name => 'BATCH_JOB_CLASS',
service
=> 'BATCH_SERVICE');
END;
/
## Grant the privileges to execute the job
grant execute on sys.batch_job_class to vallep;
service
(example)
## create a job associated with a job class

BEGIN
DBMS_SCHDULER.create_job(
job_name => 'my_user.batch_job_test',
job_type => 'PLSQL_BLOCK',
job_action => SYSTIMESTAMP'
repeat_interval => 'FREQ=DAILY;',
job_class => 'SYS.BATCH_JOB_CLASS',
end_date => NULL,
enabled => TRUE,
comments => 'Test batch job to show RAC services');
END;
/
## assign a job class to an existing job
exec dbms_scheduler.set_attribute('MY_BATCH_JOB', 'JOB_CLASS',
'BATCH_JOB_CLASS');
Cluster Ready Services (CRS)

CRS is Oracle's clusterware software, you can use it with other third-party clusterware
software, though it is not required (apart from HP True64).
CRS is start automatically when the server starts, you should only stop this service in the
following situations
Applying a patch set to $ORA_CRS_HOME

O/S maintenance
Debugging CRS problems

CRS Administration
## Starting CRS using Oracle 10g R1
not possible
Starting
## Starting CRS using Oracle 10g R2
$ORA_CRS_HOME/bin/crsctl start crs
Stopping
## Stopping CRS using Oracle 10g R1

srvctl stop -d database <database>
srvctl stop asm -n <node>
srvctl stop nodeapps -n <node>
/etc/init.d/init.crs stop
## Stopping CRS using Oracle 10g R2
$ORA_CRS_HOME/bin/crsctl stop crs
## stop CRS restarting after a reboot, basically permanent over
reboots
disabling/enabling
## Oracle 10g R1
/etc/init.d/init.crs [disable|enable]
## Oracle 10g R2
$ORA_CRS_HOME/bin/crsctl [disable|enable] crs
Checking
$ORA_CRS_HOME/bin/crsctl check crs

$ORA_CRS_HOME/bin/crsctl check evmd
$ORA_CRS_HOME/bin/crsctl check cssd
$ORA_CRS_HOME/bin/crsctl check crsd
$ORA_CRS_HOME/bin/crsctl check install -wait 600
Resource Applications (CRS Utilities)
$ORA_CRS_HOME/bin/crs_stat
$ORA_CRS_HOME/bin/crs_stat -t
$ORA_CRS_HOME/bin/crs_stat -ls
$ORA_CRS_HOME/bin/crs_stat -p
Status
Note:
-t more readable display
-ls permission listing
-p parameters
create profile
$ORA_CRS_HOME/bin/crs_profile
register/unregister
$ORA_CRS_HOME/bin/crs_register
application
$ORA_CRS_HOME/bin/crs_unregister
Start/Stop an application
$ORA_CRS_HOME/bin/crs_start
$ORA_CRS_HOME/bin/crs_stop
Resource permissions
$ORA_CRS_HOME/bin/crs_getparam
$ORA_CRS_HOME/bin/crs_setparam
Relocate a resource
$ORA_CRS_HOME/bin/crs_relocate
Nodes
olsnodes -n
member number/name
Note: the olsnodes command is located in

$ORA_CRS_HOME/bin
local node name
olsnodes -l
activates logging
olsnodes -g
Oracle Interfaces
Display
oifcfg getif
Delete
oicfg delig -global
Set
oicfg setif -global <interface name>/<subnet>:public

oicfg setif -global <interface
name>/<subnet>:cluster_interconnect
Global Services Daemon Control
Starting
gsdctl start
Stopping
gsdctl stop
Status
gsdctl status
Cluster Configuration (clscfg is used during installation)
clscfg -install
create a new configuration

Note: the clscfg command is located in $ORA_CRS_HOME/bin
upgrade or downgrade and
existing configuration
clscfg -upgrade
clscfg downgrade
add or delete a node from the clscfg -add

configuration
clscfg delete
create a special single-node
clscfg local
configuration for ASM
brief listing of terminology
used in the other nodes
clscfg concepts
used for tracing
clscfg trace
Help
clscfg -h
Cluster Name Check
print cluster name
cemutlo -n
Note: in Oracle 9i the ulity was called "cemutls", the command
is located in $ORA_CRS_HOME/bin
cemutlo -w
print the clusterware version
Note: in Oracle 9i the ulity was called "cemutls"
Node Scripts
addnode.sh
Add Node
Note: see adding and deleting nodes
deletenode.sh
Delete Node
Note: see adding and deleting nodes
Oracle Cluster Registry (OCR)
As you already know the OCR is the registry that contains information
Node list
Node membership mapping
Database instance, node and other mapping information
Characteristics of any third-party applications controlled by CRS
The file location is specified during the installation, the file pointer indicating the OCR
device location is the ocr.loc, this can be in either of the following
linux - /etc/oracle
solaris - /var/opt/oracle
The file contents look something like below, this was taken from my installation
orc.loc
ocrconfig_loc=/u02/oradata/racdb/OCRFile
ocrmirrorconfig_loc=/u02/oradata/racdb/OCRFile_mirror
local_only=FALSE
OCR is import to the RAC environment and any problems must be immediately actioned, the
command can be found in located in $ORA_CRS_HOME/bin
OCR Utilities
log file
$ORA_HOME/log/<hostname>/client/ocrconfig_<pid>.log
ocrcheck
checking
dump contents
Note: will return the OCR version, total space allocated, space used, free
space, location of each device and the result of the integrity check
ocrdump
Note: by default it dumps the contents into a file named OCRDUMPFILE

in the current directory
ocrconfig -export <file>
export/import
ocrconfig -restore <file>
# show backups
ocrconfig -showbackup
# to change the location of the backup, you can even specify a ASM disk
ocrconfig -backuploc <path|+asm>
backup/restore
# perform a backup, will use the location specified by the -backuploc

location
ocrconfig -manualbackup
# perform a restore
ocrconfig -restore <file>
# delete a backup
orcconfig -delete <file>
Note: there are many more option so see the ocrconfig man page
## add/relocate the ocrmirror file to the specified location
ocrconfig -replace ocrmirror '/ocfs2/ocr2.dbf'
## relocate an existing OCR file

add/remove/replace ocrconfig -replace ocr '/ocfs1/ocr_new.dbf'
## remove the OCR or OCRMirror file
ocrconfig -replace ocr
ocrconfig -replace ocrmirror
Voting Disk
The voting disk as I mentioned in the architecture is used to resolve membership issues in the
event of a partitioned cluster, the voting disk protects data integrity.
querying
crsctl query css votedisk
adding
crsctl add css votedisk <file>
deleting
crsctl delete css votedisk <file>
RAC Backups and Recovery

Backups and recovery is very similar to a single instance database. This article covers only
the specific issues that surround RAC backups and recovery, I have already written a article
on standard Oracle backups and recovery.
Backups can be different depending on the the size of the company
small company - may use tools such as tar, cpio, rsync

medium/large company - Veritas Netbackup, RMAN
Enterprise company - SAN mirroring with a backup option like Netbackup or RMAN
Oracle RAC can use all the above backup technologies, but Oracle prefers you to use RMAN
oracle own backup solution.
Backup Basics
Oracle backups can be taken hot or cold, a backup will comprise of the following
Datafiles
Control Files
Archive redolog files
Parameter files (init.ora or SPFILE)
Databases have now grown to very large sizes well over a terabyte in size in some cases, thus
tapes backups are not used in these cases but sophisticated disk mirroring have taken their
place. RMAN can be used in either a tape or disk solution, it can even work with third-party
solutions such as Veritas Netbackup.
In a Oracle RAC environment it is critical to make sure that all archive redolog files are
located on shared storage, this is required when trying to recover the database, as you need
access to all archive redologs. RMAN can use parallelism when recovering, the node that
performs the recovery must have access to all archived redologs, however, during recovery
only one node applies the archived logs as in a standard single instance configuration.
Oracle RAC also supports Oracle Data Guard, thus you can have a primary database
configured as a RAC and a standby database also configured as a RAC.
Instance Recovery
In a RAC environment there are two types of recovery
Crash Recovery - means that all instances have failed, thus they all need to be
recovered
Instance Recovery - means that one or more instances have failed. this instance can
then be recovered by the surviving instances
Redo information generated by an instance is called a thread of redo. All log files for that
instance belong to this thread, an online redolog file belongs to a group and the group belongs
to a thread. Details about log group file and thread association details are stored in the control
file. RAC databases have multiple threads of redo, each instance has one active thread, the
threads are parallel timelines and together form a stream. A stream consists of all the threads
of redo information ever recorded, the streams form the timeline of changes performed to the
database.
Oracle records the changes made to a database, these are called change vectors. Each vector
is a description of a single change, usually a single block. A redo record contains one or more
change vectors and is located by its Redo Byte Address (RBA) and points to a specific
location in the redolog file (or thread). It will consist of three components
log sequence number

block number within the log
byte number within the block
Checkpoints are the same in a RAC environment and a single instance environment, I have
already discussed checkpoints, when a checkpoint needs to be triggered, Oracle will look for
the thread checkpoint that has the lowest checkpoint SCN, all blocks in memory that contain
changes made prior to this SCN across all instances must be written out to disk. I have
discussed how to control recovery in my Oracle section and this applies to RAC as well.
Crash Recovery
Crash recovery is basically the same for a single instance and a RAC environment, I have a
complete recovery section in my Oracle section, here is a note detailing the difference
For a single instance the following is the recovery process
1. The on-disk block is the starting point for the recovery, Oracle will only consider the
block on the disk so the recovery is simple. Crash recovery will automatically happen
using the online redo logs that are current or active
2. The starting point is the last full checkpoint. The starting point is provided by the
control file and compared against the same information in the data file headers, only
the changes need to be applied
3. The block specified in the redolog is read into cache, if the block has the same
timestamp as the redo record (SCN match) the redo is applied.
For a RAC instance the following is the recovery process
1. A foreground process in a surviving instance detects an "invalid block lock" condition
when a attempt is made to read a block into the buffer cache. This is an indication that
an instance has failed (died)
2. The foreground process sends a notification to instance system monitor (SMON)
which begin to search for dead instances. SMON maintains a list of all the dead
instances and invalid block locks. Once the recovery and cleanup has finished this list
is updated.
3. The death of another instance is detected if the current instance is able to acquire that
instance's redo thread locks, which is usually held by an open and active instance.
Oracle RAC uses a two-pass recovery, because a data block could have been modified in any
of the instances (dead or alive), so it needs to obtain the latest version of the dirty block and it
uses PI (Past Image) and Block Written Record (BWR) to archive this in a quick and timely
fashion.
The cache aging and incremental checkpoint system would write a number of blocks to disk
a data block write operation, it also adds a redo record that states the block has been written
Block Written
DBWn can write block written records (BWRs) in batches, though in a lazy fashion. In RAC
Record (BRW)
instance writes a block covered by a global resource or when it is told that its past image (PI
longer necessary.
This is was makes RAC cache fusion work, it eliminates the write/write contention problem
database. A PI is a copy of a globally dirty block and is maintained in the database buffer cac
saved when a dirty block is shipped across to another instance after setting the resource role
responsible for informing an instance that its PI is no longer needed after another instance w
Past Image (PI)
of the same block. PI's are discarded when GCS posts all the holding instances that a new an
particular block is now on disk.
I go into more details about PI's in my cache fusion section.
The first pass does not perform the actual recovery but merges and reads redo threads to
create a hash table of the blocks that need recovery and that are not known to have been
written back to the datafiles. The checkpoint SCN is need as a starting point for the recovery,
all modified blocks are added to the recovery set (a organized hash table). A block will not be
recovered if its BWR version is greater than the latest PI in any of the buffer caches.
The second pass SMON rereads the merged redo stream (by SCN) from all threads needing
recovery, the redolog entries are then compared against a recovery set built in the first pass
and any matches are applied to the in-memory buffers as in a single pass recovery. The buffer
cache is flushed and the checkpoint SCN for each thread is updated upon successful
completion.
Cache Fusion Recovery
I have a detailed section on cache fusion, this section covers the recovery, cache fusion is
only used in RAC environments, as additional steps are required, such as GRD
reconfiguration, internode communication, etc. There are two types of recovery
Crash Recovery - all instances have failed

Instance Recovery - one instance has failed
In both cases the threads from failed instances need to be merged, in a instance recovery
SMON will perform the recovery where as in a crash recovery a foreground process performs
the recovery.
The main features (advantages) of cache fusion recovery are
Recovery cost is proportional to the number of failures, not the total number of nodes
It eliminates disk reads of blocks that are present in a surviving instance's cache
It prunes recovery set based on the global resource lock state
The cluster is available after an initial log scan, even before recovery reads are
complete
In cache fusion the starting point for recovery of a block is its most current PI version, this
could be located on any of the surviving instances and multiple PI blocks of a particular
buffer can exist.
Remastering is the term used that describes the operation whereby a node attempting
recovery tries to own or master the resource(s) that were once mastered by another instance
prior to the failure. When one instance leaves the cluster, the GRD of that instance needs to
be redistributed to the surviving nodes. RAC uses an algorithm called lazy remastering to
remaster only a minimal number of resources during a reconfiguration. The entire Parallel
Cache Management (PCM) lock space remains invalid while the DLM and SMON complete
the below steps
1. IDLM master node discards locks that are held by dead instances, the space is
reclaimed by this operation is used to remaster locks that are held by the surviving
instance for which a dead instance was remastered
2. SMON issues a message saying that it has acquired the necessary buffer locks to
perform recovery
Lets look at an example on what happens during a remastering, lets presume the following
Instance A masters resources 1, 3, 5 and 7

Instance B masters resources 2, 4, 6, and 8
Instance C masters resources 9, 10, 11 and 12
Instance B is removed from the cluster, only the resources from instance B are evenly
remastered across the surviving nodes (no resources on instances A and C are affected), this
reduces the amount of work the RAC has to perform, likewise when a instance joins a cluster
only minimum amount of resources are remastered to the new instance.
Before Remastering
You can control the remastering process with a number of parameters

_gcs_fast_config
enables fast reconfiguration for gcs locks (true|false)
After Remaste
_lm_master_weight
controls which instance will hold or (re)master more resources than others
_gcs_resources
controls the number of resources an instance will master at a time
you can also force a dynamic remastering (DRM) of an object using oradebug
## Obtain the OBJECT_ID form the below table
SQL> select * from v$gcspfmaster_info;
force dynamic remastering
(DRM)
## Determine who masters it

SQL> oradebug setmypid
SQL> oradebug lkdebug -a <OBJECT_ID>
## Now remaster the resource
SQL> oradebug lkdebug -m pkey <OBJECT_ID>
The steps of a GRD reconfiguration is as follows
Instance death is detected by the cluster manager

Request for PCM locks are frozen
Enqueues are reconfigured and made available
DLM recovery
GCS (PCM lock) is remastered
Pending writes and notifications are processed
I Pass recovery
The instance recovery (IR) lock is acquired by SMON
The recovery set is prepared and built, memory space is allocated in the
SMON PGA
SMON acquires locks on buffers that need recovery
II Pass recovery
o
II pass recovery is initiated, database is partially available
Blocks are made available as they are recovered
The IR lock is released by SMON, recovery is then complete
The system is available
Graphically it looks like below
RAC Performance
I have already discussed basic Oracle tuning, in this section I will mainly dicuss Oracle RAC
tuning. First lets review the best pratices of a Oracle design regarding the application and
database
Optimize connection management, ensure that the middle tier and programs that
connect to the database are efficent in connection management and do not log on or
off repeatedly
Tune the SQL using the available tools such as ADDM and SQL Tuning Advisor
Ensure that applications use bind variables, cursor_sharing was introduced to solve
this problem
Use packages and procedures (because they are compiled) in place of anonymous
PL/SQL blocks and big SQL statements
Use locally managed tablespaces and automatic segment space management to help
performance and simplify database administration
Use automatic undo management and temporary tablespace to simplify administration

and increase performance
Ensure you use large caching when using sequences, unless you cannot afford to lose
sequence during a crash
Avoid using DDL in production, it increases invalidations of the already parsed SQL
statements and they need to be recompiled
Partion tables and indexes to reduce index leaf contention (buffer busy global cr
problems)
Optimize contention on data blocks (hot spots) by avoiding small tables with too
many rows in a block
Now we can review RAC specific best practices
Consider using application partitioning (see below)

Consider restricting DML-intensive users to using one instance, thus reducing cache
contention
Keep read-only tablespaces away from DML-intensive tablespaces, they only require
minimum resources thus optimizing Cache Fusion performance
Avoid auditing in RAC, this causes more shared library cache locks
Use full tables scans sparingly, it causes the GCS to service lots of block requests, see
table v$sysstat column "table scans (long tables)"
if the application uses lots of logins, increase the value of sys.audsess$ sequence
Partitioning Workload
Workload partitioning is a certian type of workload that is executed on an instance, that is
partitioning allows users who access the same set of data to log on to the same instance. This
limits the amount of data that is shared between instances thus saving resources used for
messaging and Cache Fusion data block transfer.
You should consider the following when deciding to implement partitioning
If the CPU and private interconnects are of high performance then there is no need to
to partition
Partitioning does add complexity, thus if you can increase CPU and the interconnect
performance the better
Only partition if performance is betting impacted
Test both partitioning and non-partitioning to what difference it makes, then decide if
partitioning is worth it
RAC Wait Events

An event is an operation or particular function that the Oracle kernel performs on behalf of a
user or a Oracle background process, events have specific names like database event.
Whenever a session has to wait for something, the wait time is tracked and charged to the
event that was associated with that wait. Events that are associated with all such waits are
known as wait events. The are a number of wait classes
Commit
Scheduler
Application
Configuration
User I/O
System I/O
Concurrency
Network
Administrative
Cluster
Idle
Other
There are over 800 different events spread across the above list, however you probably will
only deal with about 50 or so that can improve performance.
When a session requests access to a data block it sends a request to the lock master for proper
authorization, the request does not know if it will receive the block via Cache Fusion or a
permission to read from the disk. Two placeholder events
global cache cr request (consistent read - cr)

global cache curr request (current - curr)
keep track of the time a session spends in this state. There are number of types of wait events
regarding access to a data block
Wait Event
Contention
Description
type
an instance requests authorization for a block to be accessed in current mode

the resource receives the request. The master has the current version of the b
to the requestor via Cache Fusion and keeps a Past Image (.PI)
If you get this then do the following
gc current
block 2-way
gc current
block 3-way
write/write
write/write
Analyze the contention, segments in the "current blocks received" se

Use application partitioning scheme
Make sure the system has enough CPU power
Make sure the interconnect is as fast as possible
Ensure that socket send and receive buffers are configured correctly
an instance requests authorization for a block to be accessed in current mode

the resource receives the request and forwards it to the current holder of the
holding instance sends a copy of the current version of the block to the requ
exclusive lock to the requesting instance. It also keeps a past Image (PI).
Use the above actions to increase the performance
gc current
write/read
The difference with the one above is that this sends a copy of the block thus
block 2-way
gc current
block 3-way
write/read
The difference with the one above is that this sends a copy of the block thus
The requestor will eventually get the block via cache fusion but it is delayed
gc current
block busy
write/write
The block was being used by another session on another session

was delayed as the holding instance could not write the correspondin
If you get this then do the following
Ensure the log writer is tuned
gc current
buffer busy
local
This is the same as above (gc current block busy), the difference is that anot
requested the block (hence local contention)
gc current
block congested
none
This is caused if heavy congestion on the GCS, thus CPU resources are stret
Enqueue Tuning
Oracle RAC uses a queuing mechanism to ensure proper use of shared resources, it is called
Global Enqueue Services (GES). Enqueue wait is the time spent by a session waiting for a
shared resource, here are some examples of enqueues:
updating the control file (CF enqueue)

updating an individual row (TX enqueue)
exclusive lock on a table (TM enqueue)
Enqueues can be managed by the instance itself others are used globally, GES is responsible
for coordinating the global resources. The formula used to calculate the number of enqueue
resources is as below
GES Resources = DB_FILES + DML_LOCKS + ENQUEUE_RESOURCES +
PROCESS + TRANSACTION x (1 + (N - 1)/N)
N = number of RAC instances
displaying enqueues
stats
SQL> column current_utilization heading current

SQL> column max_utilization heading max_usage
SQL> column initial_allocation heading initial
SQL> column resource_limit format a23;
SQL> select * from v$resource_limit;
AWR and RAC

I have already discussed AWR in a single instance environment, so for a quick refresh take a
look and come back here to see how you can use it in a RAC environment.
From a RAC point of view there are a number of RAC-specific sections that you need to look
at in the AWR, in the report section is a AWR of my home RAC environment, you can view
the whole report here.
RAC AWR Section
Number of Instances
Report
instances
Description
lists the number of instances from the beginning and end of th
information about the interinstance cache fusion data block an

lightweight here is a more heavy used RAC example
Global Cache Load Profile

~~~~~~~~~~~~~~~~~~~~~~~~~
Per Second
Per T
----------------------------Global Cache blocks received:
315.37
12.82
Global Cache blocks served:
240.30
9.67
GCS/GES messages received:
525.16
20.81
GCS/GES messages sent:
765.32
30.91
The first two statistics indicate the number of blocks transferre

block size
Sent:
240 x 8,192 = 1966080 bytes/sec = 2.0 MB/sec
Received: 315 x 8,192 = 2580480 bytes/sec = 2.6 MB/se
Instance global cache load
profile
global cache
to determine the amount of network traffic generated due to m

size (this was 193 on my system)
select sum(kjxmsize * (kjxmrcv + kjxmsnt + kjxmqsnt)) / su

size" from x$kjxm
where kjxmrcv > 0 or kjxmsnt > 0 or kjxmqsnt > 0;
then calculate the amount of messaging traffic on this network

193 (765 + 525) = 387000 = 0.4 MB
to calculate the total network traffic generated by cache fusion

= 2.0 + 2.6 + 0.4 = 5 MBytes/sec
= 5 x 8 = 40 Mbits/sec
The DBWR Fusion writes statistic indicates the number of tim

disk due to remote instances, this number should be low.
Glocal cache efficiency
percentage
global cache
efficiency
this section shows how the instance is getting all the data bloc
Local cache
Remote cache
Disk
The first two give the cache hit ratio for the instance, you are
higher values then you may consider application partitioning.
this section contains timing statistics for global enqueue and g

GCS and GES - workload
characteristics
GCS and GES

workload
All timings related to CR (Consistent Read) processing
All timings related to CURRENT block processing sho
The first section relates to sending a message and should be le

Messaging statistics
Service statistics
Service wait class
statistics
Top 5 CR and current
block segements
messaging
Service stats
The second section details the breakup of direct and indirect m

foreground or the user processes to remote instances, indirect
sent.
shows the resources used by all the service instance supports
Service wait class summarizes waits in different categories for each service
Top 5 CR and
current blocks
conatns the names of the top 5 contentious segments (table or

of CR and Current block transfers you need to investigate. Thi
Cluster Interconnect
As I stated above the interconnect it a critical part of the RAC, you must make sure that this
is on the best hardware you can buy. You can confirm that the interconnect is being used in
Oracle 9i and 10g by using the command oradebug to dump information out to a trace file, in
Oracle 10g R2 the cluster interconnect is also contained in the alert.log file, you can view my
information from here.
interconnect

SQL> oradebug ipc
Note: look in the user_dump_dest directory, the trace will be there

RAC Architecture

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RAC Architecture

Uploaded by

Copyright:

Available Formats

RAC Architecture

Accessed by only one

Shared by all instances (shared storage)

Accessed by only one

Shared by all instances (shared storage)

Only one instance can write but other

Private to the instance but other instances

Flash Recovery Log

Accessed by only one

Shared by all instances (shared storage)

Alert Log and Trace

Multiple instances on the

Same as single instance plus can be placed

Shared disk system

Oracle Kernel Components

Here are a list of processes running on a freshly installed RAC

OPROCd - Process Monitor Daemon

OCSSd - Oracle Cluster Synchronization Service Daemon (updates the registry)

EVMd - Event Volume Manager Daemon

CSS provides basic Group Services Support, it is a distributed group membership

provides basic cluster integrity

spawns a child process event

basic node membership, group

CRSd - Cluster Ready

resource monitoring, failover

A vote is usually a formal expression of opinion or will in response to a

is defined as the number, usually a majority of members of a body, that, when

this process manages the GES, it maintains consistency of GCS memory

manages instance resource requests and cross-instance call operations for

This is a lightweight process, it uses the DIAG framework to monitor the

These parameters are unique to each instance, examples would be instance_

Neither unique or identical

thread - specifies the set of redolog files to be used by the instance

undo_tablespace - specifies the name of the undo tablespace to be used by the

rollback_segments - you should use Automatic Undo Management

Enabling the flashback database feature

Performing a media recovery on a system table

active_instance_count - used for primary/secondary RAC environments

cluster_database_instances - specifies the number of instances that will be accessing

max_commit_propagation_delay - influences the mechanism Oracle uses to

parallel_instance_group - specifies the group of instances to be used for parallel

gcs_server_processes - specify the number of lock manager server (LMS)

remote_listener - register the instance with listeners on remote nodes.

SQL> alter system set cluster_database=false scope=spfile sid='prod1';

## Make sure that the database is running in archive log mode

srvctl start database -p prod1

Database configuration tasks

srvctl config database

srvctl status database -d <database

used to distribute the workload

provide a transparent way to direct workload

The table above is described below

Distributed Transaction Processing - used for distributed transactions

AQ_HA_Notifications - information about nodes being up or down will be sent to

You can administer services using the following tools

Server Control (srvctl)

sys$background - used by an instance's background processes only

srvctl remove service -d D01 -s BATCH_SERVICE

srvctl start service -d D01 -s BATCH_SERVICE

srvctl stop service -d D01 -s BATCH_SERVICE

srvctl status service -d D10 -s BATCH_SERVICE

## create a job associated with a job class

Cluster Ready Services (CRS)

Applying a patch set to $ORA_CRS_HOME