You are on page 1of 12

Oracle Real Application Cluster (Oracle RAC)

Session 1: Oracle 10g/11gR2 RAC Architecture

Oracle Clusterware
Oracle Clusterware is software that enables servers to operate together as if they are one server. Each
servers look like a standalone server. However, each server has additional processes that communicate
with each other. So here the separate server appears as if they are one server to the application and end
users.
Starting with the version 10g Release 1 Oracle introduced an own portable cluster software Cluster Ready
Services. This product has been renamed in the version 10g Release 2 to Oracle Clusterware, from 11g
Release 2 is part of the Oracle Grid Infrastructure software.

Ahmed Fathi - Senior Oracle Consultant


P ag e |1
Email: ahmedf.dba@gmail.com Blog: http://ahfathi.blogspot.com LinkedIn: http://linkedin.com/in/ahmedfathieg

Oracle Real Application Cluster (Oracle RAC)

Session 1: Oracle 10g/11gR2 RAC Architecture

The benefits of using a cluster include:

Scalability: multiple nodes allow cluster database to scale by single node database.
Availability: if any nodes failure other nodes in cluster, clients can continue working without any effects.
Manageability: more than one database can be handled by oracle Cluster ware.
Ability to monitor processes and restart them if they stop
Eliminate unplanned downtime due to hardware or software malfunctions
Reduce or eliminate planned downtime for software maintenance

There are two kinds of cluster active/active and active/passive:


Active/passive
In this setup usually we have two nodes, one of the nodes are available (active) and the other one is not
(passive), Oracle software should be on Shared storage, and only run on one node (active) in case of failure
the cluster convert the shared storage to passive node in that case active node now is passive and the
passive node is now active.
There is number of Third Party software Corporation support this kind of cluster such as Microsoft, Linux.
Usually its called OS cluster.
Active/Active
In this kind of setup Oracle instance run concurrently on both server and client access to both server at the
same time, the instance should communicate with other node to ensure (heartbeat) both server are
available but in case any of server goes down the other server can handle the workload, the benefits of
active/active workload can be shared between servers.

Voting Disk and Cluster Registry


Voting Disk: A voting disk is a shared disks that will be accessed by all the member of the nodes in the
cluster. It is stores the cluster membership information, and keeps the heartbeat information between the
nodes. If any of the node is unable to ping the voting disk, cluster immediately recognize the
communication failure and evicts the node from the cluster.
Used to determine which instance takes control of cluster in case of node failure to avoid split brain.

Oracle Cluster Registry (OCR): stores and manages configuration information about the cluster resources
managed by Oracle clusterware such as Oracle RAC databases, database instance, listeners, VIPs, and
servers and applications.
Oracle Local Registry (11gR2): Similar to OCR, introduces in 11gR2 but it only stores information about the
local node. It is not shared by other nodes of cluster and used by OHASd while starting or joining a cluster.

Ahmed Fathi - Senior Oracle Consultant


P ag e |2
Email: ahmedf.dba@gmail.com Blog: http://ahfathi.blogspot.com LinkedIn: http://linkedin.com/in/ahmedfathieg

Oracle Real Application Cluster (Oracle RAC)

Session 1: Oracle 10g/11gR2 RAC Architecture

RAC Components

Shared Disk System


Oracle Clusterware Stack
Cluster Interconnects
Oracle Kernel Components

Shared Disk System


Below are the three major type of shared storage which are using in RAC:
Raw volumes: A raw logical volume is an area of physical and logical disk space that is under the direct
control of an application such as database or partition rather than under the direct control of the operating
system or a file system.
Cluster File system: This option is not widely used and here the cluster file system such as Oracle Cluster
file system (OCFS) for MS Windows and Linux holding the all datafiles of RAC database
Automatic Storage Management (ASM): Oracle recommended storage option which is optimized for
cluster file system for Oracle database files introduced in Oracle 10g

Oracle Clusterware Stack 10g/11gR1


The Oracle Clusterware comprises several background processes that facilitate cluster operations. The
Cluster Synchronization Service (CSS), Event Management (EVM), and Oracle Cluster components
communicate with other cluster component layers in the other instances within the same cluster database
environment. These components are also the main communication links between the Oracle Clusterware
high availability components and the Oracle Database. In addition, these components monitor and manage
database operations.

Cluster Synchronization Services (CSS): Manages the cluster configuration by controlling which nodes
are members of the cluster and by notifying members when a node joins or leaves the cluster.

Cluster Ready Services (CRS): The primary program for managing high availability operations within a
cluster. Anything that the crs process manages is known as a cluster resource which could be a
database, an instance, a service, a Listener, a virtual IP (VIP) address, an application process, and so on.
The crs process manages cluster resources based on the resource's configuration information that is
stored in the OCR. This includes start, stop, monitor and failover operations. The crs process generates
events when a resource status changes. When you have installed Oracle RAC, crs monitors the Oracle
instance, Listener, and so on, and automatically restarts these components when a failure occurs. By

Ahmed Fathi - Senior Oracle Consultant


P ag e |3
Email: ahmedf.dba@gmail.com Blog: http://ahfathi.blogspot.com LinkedIn: http://linkedin.com/in/ahmedfathieg

Oracle Real Application Cluster (Oracle RAC)

Session 1: Oracle 10g/11gR2 RAC Architecture

default, the crs process makes five attempts to restart a resource and then does not make further
restart attempts if the resource does not restart.

Event Management (EVM): A background process that publishes events that crs creates.

Oracle Notification Service (ONS): Allows clusterware events to be (propagate) send to nodes in
cluster, middle tier application servers, clients. EVMD publishes events through ONS.

RACG: Extends clusterware to support Oracle-specific requirements and complex resources. Runs server
callout scripts when FAN events occur.

Process Monitor Daemon (OPROCD): This process is locked in memory to monitor the cluster and
provide I/O fencing. OPROCD performs its check, stops running, and if the wake up is beyond the
expected time, then OPROCD resets the processor and reboots the node. An OPROCD failure results in
Oracle Clusterware restarting the node. OPROCD uses the hangcheck timer on Linux platforms.

Oracle Clusterware Stack 11gR2


Oracle Clusterware consists of two separate stacks: an upper stack anchored by the Cluster Ready Services
(CRS) daemon (crsd) and a lower stack anchored by the Oracle High Availability Services daemon (ohasd).
These two stacks have several processes that facilitate cluster operations. The following sections describe
these stacks in more detail:
-

The Cluster Ready Services Stack

The list in this section describes the processes that comprise CRS. The list includes components that are processes on
Linux and UNIX operating systems, or services on Windows.

Cluster Ready Services (CRS): The primary program for managing high availability operations in a
cluster. The CRS daemon (crsd) manages cluster resources based on the configuration information that
is stored in OCR for each resource. This includes start, stop, monitor, and failover operations. The crsd
process generates events when the status of a resource changes. When you have Oracle RAC installed,
the crsd process monitors the Oracle database instance, listener, and so on, and automatically restarts
these components when a failure occurs.

Cluster Synchronization Services (CSS): Manages the cluster configuration by controlling which nodes
are members of the cluster and by notifying members when a node joins or leaves the cluster
The cssdagent process monitors the cluster and provides I/O fencing. This service formerly was
provided by Oracle Process Monitor Daemon (oprocd), also known as OraFenceService on Windows. A
cssdagent failure may result in Oracle Clusterware restarting the node.

Oracle ASM: Provides disk management for Oracle Clusterware and Oracle Database.

Cluster Time Synchronization Service (CTSS): Provides time management in a cluster for Oracle
Clusterware.

Ahmed Fathi - Senior Oracle Consultant


P ag e |4
Email: ahmedf.dba@gmail.com Blog: http://ahfathi.blogspot.com LinkedIn: http://linkedin.com/in/ahmedfathieg

Oracle Real Application Cluster (Oracle RAC)

Session 1: Oracle 10g/11gR2 RAC Architecture

Event Management (EVM): A background process that publishes events that Oracle Clusterware
creates.

Oracle Notification Service (ONS): A publish and subscribe service for communicating Fast Application
Notification (FAN) events.

Oracle Agent (oraagent): Extends clusterware to support Oracle-specific requirements and complex
resources. This process runs server callout scripts when FAN events occur. This process was known as
RACG in Oracle Clusterware 11g release 1 (11.1).

Oracle Root Agent (orarootagent): A specialized oraagent process that helps crsd manage resources
owned by root, such as the network, and the Grid virtual IP address.
-

The Oracle High Availability Services Stack

This section describes the processes that comprise the Oracle High Availability Services stack. The list
includes components that are processes on Linux and UNIX operating systems, or services on Windows.

Cluster Logger Service (ologgerd): Receives information from all the nodes in the cluster and persists in
a CHM repository-based database. This service runs on only two nodes in a cluster.

System Monitor Service (osysmond): The monitoring and operating system metric collection service
that sends the data to the cluster logger service. This service runs on every node in a cluster.

Grid Plug and Play (GPNPD): Provides access to the Grid Plug and Play profile, and coordinates updates
to the profile among the nodes of the cluster to ensure that all of the nodes have the most recent
profile.

Grid Interprocess Communication (GIPC): A support daemon that enables Redundant Interconnect
Usage.

Multicast Domain Name Service (mDNS): Used by Grid Plug and Play to locate profiles in the cluster,
as well as by GNS to perform name resolution. The mDNS process is a background process on Linux and
UNIX and on Windows.

Oracle Grid Naming Service (GNS): Handles requests sent by external DNS servers, performing name
resolution for names defined by the cluster.

Cluster Interconnects
It is the communication path used by the cluster for the synchronization of resources and it is also used in
some cases for transfer of data from one instance to another. Typically, the interconnect is a network
connections that is dedicated to the server nodes of a cluster (thus is sometimes referred as private
interconnect)
Ahmed Fathi - Senior Oracle Consultant
P ag e |5
Email: ahmedf.dba@gmail.com Blog: http://ahfathi.blogspot.com LinkedIn: http://linkedin.com/in/ahmedfathieg

Oracle Real Application Cluster (Oracle RAC)

Session 1: Oracle 10g/11gR2 RAC Architecture

Oracle Kernel Components


Set of additional background process in each instance is known as oracle kernel components in
RAC environment. Since buffer and shared pool became global in RAC , special handling is required to
manage the resources to avoid conflicts and corruption. Additional background process (for RAC) and
single instance background process works together and achieved this.
Global Cache and Global Enqueue Services
RAC Database System has two important services. They are Global Cache Service (GCS) and Global Enqueue
Service (GES). These are basically collections of background processes and memory structures. These two
services GCS and GES together manage the total Cache Fusion process, resource transfers, and resource
acquisition among the instances.
In Oracle RAC each instance will have its own cache but it is required for an instance to access the data
blocks currently residing in another instance cache. This management and data sharing is done by Global
Cache services (GCS). Blocks other than data such as locks, enqueue details and shared across the instances
are known as Global Enqueue Services (GES).
The Global Cache Service employs various background processes such as the Global Cache Service Processes (LMSn)
and Global Enqueue Service Daemon (LMD)

Global Resource Directory


Global Resource Directory (GRD) is the internal in-memory database and is stored on all of the running
instances that records and stores the current status of resources and the enqueues (data blocks). GRD is
maintained by GES and GCS. Whenever a block is transferred out of a local cache to another instances
cache the GRD is updated. The following resources information is available in GRD.
-

Data Block information such as file # and block #

Location of most current version

Modes of the data blocks: (N)Null, (S)Shared, (X)Exclusive

Oracle RAC Background Processes

LMS
LMD
LMON
LCK0
DIAG

Global Cache Service Process


Global Enqueue service Daemon
Global Enqueue Service Monitor
Instance Enqueue Process
Diagnosability Daemon

Ahmed Fathi - Senior Oracle Consultant


P ag e |6
Email: ahmedf.dba@gmail.com Blog: http://ahfathi.blogspot.com LinkedIn: http://linkedin.com/in/ahmedfathieg

Oracle Real Application Cluster (Oracle RAC)

Session 1: Oracle 10g/11gR2 RAC Architecture

Global Cache Service Processes (LMSn)

LMS- Lock Manager Server Process is used in Cache Fusion. It enables consistent copies of blocks to be
transferred from a holding instance's buffer cache to a requesting instance buffer cache without a disk
write under certain conditions.
It rollbacks any uncommitted transactions for any blocks that are being requested for a consistent read by
the remote instance.
Global Enqueue Service Daemon (LMD)

LMD-Lock Manager Daemon process manages Enqueue service requests for GCS. It also handles deadlock
detection and remote resource requests.
Global Enqueue Service Monitor (LMON)

LMON-Lock Monitor Process is responsible to manage Global Enqueue Services (GES).


It maintains consistency of GCS memory in case of any process death. LMON is also responsible for the
cluster reconfiguration when an instance joins or leaves the cluster. It also checks for the instance death
and listens for local manages.
Instance Enqueue Process (LCK)

The LCK0 process manages non-Cache Fusion resource requests such as library and row cache requests.
Diagnosability Daemon (DIAG)

This background process monitors the health of the instance and captures diagnostic data about process
failures within instances. The operation of this daemon is automated and updates an alert log file to record
the activity that it performs.
Clusterware and heartbeat mechanism
Cluster needs to know who is a member at all times. Oracle cluster has Two (02) types of heartbeats:
1. Network heartbeat
-

Performed once per second.


Node will evict from cluster when failed to send a network heartbeat within <MissCount maximum
time in seconds> time frame.

2. Disk (Voting Disk) heartbeat


-

Each node of a cluster writes a disk heartbeat to voting disk every second
Node evicts from cluster if no heartbeat is updated within I/O (MissCount/Disktimeout) timeout.

Ahmed Fathi - Senior Oracle Consultant


P ag e |7
Email: ahmedf.dba@gmail.com Blog: http://ahfathi.blogspot.com LinkedIn: http://linkedin.com/in/ahmedfathieg

Oracle Real Application Cluster (Oracle RAC)

Session 1: Oracle 10g/11gR2 RAC Architecture

What is miscount in oracle RAC?


The cluster synchronization service (CSS) on RAC has miscount parameter. This value represent the
maximum time, in seconds, that a network heartbeat can be missed before entering into a cluster
reconfiguration, in order to evict a node. The default value is 60 seconds in linux 10g and 11g it is 30
seconds
I/O Fencing
There will be some situations where the left over write operations from database instances reach the
storage system. The cluster function on this node failed, but the nodes are still running at the OS level.
Since these operations are no longer in the serial order, they can damage the consistency of the stored
data. Therefore, when a cluster node fails, the failed node needs to be fenced off from all the shared disk
devices or disk groups. This methodology is called I/O fencing, disk fencing or failure fencing.
Functions of I/O fencing
Prevents the updates by failed instances and to detect failure and prevent splitbrain in the cluster.
Cluster volume manager and cluster file system play a significant role in preventing
the failed nodes from accessing shared devices. Oracle uses algorithm common to
STONITH (shoot the other node in the head) implementations to determine what
nodes needs to fenced. This simply means the healthy nodes kill the sick
node. Oracle's Clusterware does not do this; instead, it simply gives the message
"Please Reboot" to the sick node. The node bounces itself and rejoins the cluster.
There are other methods of fencing that are utilized by different hardware/software vendors. When using
Veritas Storage Foundation for RAC (VxSF RAC), you can implement I/O fencing instead of node
fencing. This means that instead of asking a server to reboot, you simply close it off from shared storage.
In versions before 11.2.0.2 Oracle Clusterware tried to prevent a split-brain with a fast reboot (better:
reset) of the server(s) without waiting for ongoing I/O operations or synchronization of the file systems.

Ahmed Fathi - Senior Oracle Consultant


P ag e |8
Email: ahmedf.dba@gmail.com Blog: http://ahfathi.blogspot.com LinkedIn: http://linkedin.com/in/ahmedfathieg

Oracle Real Application Cluster (Oracle RAC)

Session 1: Oracle 10g/11gR2 RAC Architecture

This mechanism has been changed in version 11.2.0.2 (first 11g Release 2 patch set). After deciding which
node to evict, the Clusterware:
- attempts to shut down all Oracle resources/processes on the server (especially processes generating
I/Os)
- will stop itself on the node
- Afterwards Oracle High Availability Service Daemon (OHASD)5 will try to start the Cluster Ready
Services (CRS) stack again. Once the cluster interconnect is back online, all relevant cluster resources
on that node will automatically start
- Kill the node if stop of resources or processes generating I/O is not possible (hanging in kernel
mode, I/O path, etc.)
Generally Oracle Clusterware uses two rules to choose which nodes should leave the cluster to assure the
cluster integrity:
- In configurations with two nodes, node with the lowest ID will survive (first node that joined the
cluster), the other one will be asked to leave the cluster
- With more cluster nodes, the Clusterware will try to keep the largest sub-cluster Running
When node does reboots?
-

Network failure interconnect


Slow interconnect (latency) must fail 30 consecutive times!
Voting disk IO cannot read or write
CPU-bound CPU is too busy to maintain heartbeat
Files moved, delected, changed or some other human error
Configuration error wrong network for private interconnect
ocssd process died
Some Oracle Clusterware bug

Ahmed Fathi - Senior Oracle Consultant


P ag e |9
Email: ahmedf.dba@gmail.com Blog: http://ahfathi.blogspot.com LinkedIn: http://linkedin.com/in/ahmedfathieg

Oracle Real Application Cluster (Oracle RAC)

Session 1: Oracle 10g/11gR2 RAC Architecture

Split-Brain scenario
The term "Split-Brain" is often used to describe the scenario when two or more co-operating processes in a
distributed system, typically a high availability cluster, lose connectivity with one another but then continue
to operate independently of each other, including acquiring logical or physical resources, under the
incorrect assumption that the other process(es) are no longer operational or using the said resources.

Fast Application Notification (FAN)


Notifying clients about the RAC availability and instance (actually service) performance is the purpose of the
FAN (Fast Application Notification) events. The client is not actively checking the availability or load of an
instance and is no more glued to an instance once connected. The nodes directly inform the application
server about which instance is able to provide a defined Quality of Service.
FAN is a method introduced in Oracle 10.1, by which applications can be informed of changes in cluster
status for Fast node failure detection and Workload balancing.

Ahmed Fathi - Senior Oracle Consultant


P a g e | 10
Email: ahmedf.dba@gmail.com Blog: http://ahfathi.blogspot.com LinkedIn: http://linkedin.com/in/ahmedfathieg

Oracle Real Application Cluster (Oracle RAC)

Session 1: Oracle 10g/11gR2 RAC Architecture

Advantageous by preventing applications from Waiting for TCP/IP timeouts when a node fails, Trying to
connect to currently down database service and Processing data received from failed node.
And can be notified using Server side callouts, Fast Connection Failover (FCF), ONS API
Why Use Virtual IP?
The goal is application availability.
When a node fails, the VIP associated with it is automatically failed over to some other node. When this
occurs the following thing happens:
-

VIP detects public network failure which generates FAN event


The new node announces the world indicating a new MAC address for VIP.
Connected clients through VIP, immediately receive ORA-3113 error or equivalent.
New connection request rapidly traverse the tnsnames.ora address list skipping over the dead
nodes, instead of having to wait on TCP-IP timeouts.

Without using VIP, clients connected to a node that died will often wait for TCP-IP timeout period (which
can be up to 10 minutes) before getting an error. As a result you dont have really good High Availability
solution without using VIP.
Connecting with Public IP Scenario

Ahmed Fathi - Senior Oracle Consultant


P a g e | 11
Email: ahmedf.dba@gmail.com Blog: http://ahfathi.blogspot.com LinkedIn: http://linkedin.com/in/ahmedfathieg

Oracle Real Application Cluster (Oracle RAC)

Session 1: Oracle 10g/11gR2 RAC Architecture

Connecting with Virtual IP scenario

Ahmed Fathi - Senior Oracle Consultant


P a g e | 12
Email: ahmedf.dba@gmail.com Blog: http://ahfathi.blogspot.com LinkedIn: http://linkedin.com/in/ahmedfathieg

You might also like