You are on page 1of 68

What Are Oracle Real Application Clusters?

• Multiple instances
accessing the same
database Interconnect
• One Instance
per node
• Physical or Shared
logical access cache
to each
database file
• Software-controlled Instances
spread
data access
across nodes

Database
files
RAC Architecture

public network
VIP1 VIPn
Service Service Node n
Node1
Listener Listener
instance 1 instance n
ASM ASM
cluster
Oracle Clusterware interconnect Oracle Clusterware

Operating System Operating System

shared storage
Redo / Archive logs all instances
Managed by ASM
Database / Control files

RAW Devices OCR and Voting Disks


Global Resources Coordination

Cluster
Node1 Noden
Instance1 Instancen
GRD Master Cache GRD Master Cache
… Global …
LMON GES resources GES LMON
LMD0 LMD0
LMSx GCS GCS LMSx
LCK0 Interconnect LCK0
DIAG DIAG

Global Resource Directory (GRD)

Global Cache Services (GCS) Global Enqueue Services (GES)


RAC Software

Cluster
Node1 Noden
Instance1 Instancen
Cache Cache
… Global …
LMON resources LMON
LMD0 LMD0
LMSx LMSx
LCK0 LCK0
DIAG DIAG

Oracle Clusterware Oracle Clusterware


CRSD & RACGIMON Cluster CRSD & RACGIMON
EVMD interface EVMD
OCSSD & OPROCD OCSSD & OPROCD
Applications Global Applications
ASM, DB, Services, OCR management: ASM, DB, Services, OCR
VIP, ONS, EMD, Listener SRVCTL, DBCA, OEM VIP, ONS, EMD, Listener
RAC Software Storage

Node1 Noden Node1 Noden


Instance1 … Instancen Instance1 … Instancen
CRS_HOME CRS_HOME
ORACLE_HOME ORACLE_HOME CRS_HOME CRS_HOME
ASM_HOME ASM_HOME
Local storage Local storage Local storage Local storage

Voting files Voting files


OCR files OCR files
Shared storage ORACLE_HOME
ASM_HOME
Shared storage

Permits rolling patch upgrades


Software not a single
point of failure
RAC Database Storage

Node1 Noden

Instance1 Instancen
Archived Archived
log files log files
Local storage Local storage

Data files
Undo tablespace Undo tablespace
files for Temp files files for
instance1 Control files instancen
Flash recovery area files
Online Change tracking file Online
redo log files SPFILE redo log files
for instance1 TDE Wallet for instancen

Shared storage
Automatic Storage Management

• Eliminates need for


conventional file system and
volume manager
• Capacity on demand
• Add/drop disks online
• Automatic I/O load balancing
• Stripes data across disks to
balance load
• Best I/O throughput
• Automatic mirroring
• Easy
Automatic Storage Management

• Simplify and Automate Database Storage management


• Fraction of the time is needed to manage database files
• Increase Storage Utilization
• Eliminate over provisioning and maximize storage resource
utilization
• Predictably Delivers on Service Level Agreements
• Never get out of tune delivering higher performance than RAW
& File System over time
• Uncompromized availability empowering low cost storage
deployment reliably
Clusters and Scalability

SMP model RAC model

Shared
Memory
storage

Cache Cache SGA SGA

CPU CPU CPU CPU BGP BGP BGP BGP

Cache coherency Cache fusion

BGP: Background process


Real Application Clusters Benefits

• Highest Availability
• On-demand flexible
scalability Database
• Lower computing costs
• World record
performance
Storage
Levels of Scalability

• Hardware: Disk input/output (I/O)


• Internode communication: High bandwidth and low latency
• Operating system: Number of CPUs
• Database management system: Synchronization
• Application: Design
Scaleup and Speedup

Original system

Hardware Time 100% of task

Cluster system scaleup Cluster system speedup

Hardware Time Up to
200% Hardware
of Up to
100%
task 300%
Hardware of task
Time of
Hardware
task Time/2
Hardware
Time
Speedup/Scaleup and Workloads

Workload Speedup Scaleup

OLTP and Internet No Yes

DSS with parallel query Yes Yes

Batch (mixed) Possible Yes


Definition of a Data Warehouse

“An enterprise structured repository of subject-


oriented, time-variant, historical data used for
information retrieval and decision support. The
data warehouse stores atomic and summary data.”
Data Warehouse - Characteristics

• What is Data Warehousing today?


• Not a simple batch query and analytical engine anymore
• Large user population with diverse query and analytical
needs
• 1000’s of users accessing data both internally and
externally
• Large size, 10 TB and upwards of 100 TB
• Not a simple schema with few tables
• Multiple applications sharing an common copy of
enterprise data
• Strict performance and operational SLA’s
• Adaptable to growing business needs
• Constantly evolving with more business units and
functionality
• Constant requirement to scale users, data
Data Warehouse - Characteristics

• Large, complex database operations


• Complex SQL and calculations
• Updated through a controlled process
• Extract, Transform, Load (ETL)
• Heterogeneous workload
• ETL processing
• Scheduled reporting
• Ad hoc queries
• Aggregations etc…
• Peak usage of different workload patterns at
different times
• System have to be sized appropriately
Data Warehouse - Requirements

• High availability and reliability


• Deliver real-time data for real time queries
• Get more in-time, accurate data
• Stay Informed
• Have the ability to Make Decisions & Take Action
• Have a Lag-Time of Hours/ Minutes
• High performance and throughput.
• Capability to scale quickly as the business is
growing
• Flexibility to meet diverse, shifting demands
RAC and Data Warehouse
Physical Considerations
Configure for a Balanced System
Interconnects

 “The weakest link”


defines the performance
 Balance these
components:
HBA1
HBA2

HBA1
HBA2

HBA1
HBA2

HBA1
HBA2
CPU
HBA (Host Bus Adapter)
NICs and Interconnect Protocol
Switch speed
FC-Switch1 FC-Switch2 Controllers
Disks

Disk Disk Disk Disk Disk Disk Disk Disk


Array 1 Array 2 Array 3 Array 4 Array 5 Array 6 Array 7 Array 8
Grid Component* Dependencies

Rule of Thumb: Maximal Number of


200MB/s per CPU Number of Switches
HBA = = Number of
Number of ControllersHBAs
Number of HBA per
node = number CPUs
CPU,Node
per node Number of HBAs +
Number of Controllers
Host Bus Adapter
Switch
Number of nodes <=8
 GigE, otherwise
Controller
infiniband
Disk

Interconnect
Minimum number of
disks = number of
* 2Gbit based controller x 4
I/O Design

• Optimal Storage Design


• Support workload that perform Sequential I/O
• Expressed, Bandwidth - MB/sec
• Large multi-block I/O’s -
• Table/Index scans
• Support workload that does Random I/O
• Expressed, I/O Operations – IOPS/sec
• Single block block I/O requests
• Estimation should include requirements for both
normal/backup I/O’s
I/O Design

Estimate aggregated throughput and IOPS


(E.g., 2GB/sec, or 30,000 IOPS)

Calculate the total bandwidth requirement per node


(E.g., 2GB/sec for 16 nodes = 128MB/node/sec
or 30,000/16 = 1875 IOPS/node)

Choose the appropriate storage class and build the configuration


(E.g., 120 IOPS per spindle, 16-way striped = 1920 IOPS per LUN
16 LUNS)
I/O Design

• DW Specific Best practices


• Plan 50-60% utilization per HBA
• Target 30-50 Meg Per CPU Core
• Use ASM
• Managing Ultra Large Database fairly simple
• Eliminate contention by evenly spreading I/O
• Expanding Storage need is addressed easily
• Re-balancing ensures I/O performance is constant
• Create optimal size LUN’s
• Small LUN’s for multi-terabyte DB’s are sub-optimal
• Pay attention to initial storage layout while increasing
cluster nodes exponentially
• Offset partition table to stripe-width of the Storage Array
Interconnect Design

• Interconnect Design
• In DW environment primary users of interconnect
• Inter-node Parallel Query
• Typical message size
• PARALLEL_EXECUTION_MESSAGE_SIZE default 2k

• Global Cache Fusion


• Two Types of message
• Short 256 Byte message
• Block Transfer - DB_BLOCK_SIZE
Interconnect Design

• Interconnect Bandwidth Estimation


• Message received (M)
• 256 * (GES message + GCS messages)
• Blocks received (B)
• (db_block_size * (cr block received + current block
received)) / mtu size
• PQ message received (P)
• (Parallel_execution_message_size * no of PQ remote
messages received) / mtu size
• Total bandwidth required …
• (Message received + Blocks received + PQ message
received) / max network transmit capacity
• (M+B+P)/85000
Interconnect design – Cache Traffic

• Example from AWR:


Global Cache Load Profile Per Sec Per Trans
------------------------------- ---------- ---------
Global Cache blocks received: 2.70 2.23
Global Cache blocks served: 2.84 2.36
GCS/GES messages received: 164.07 136.03
GCS/GES messages sent: 136.96 113.56
DBWR Fusion writes: 0.22 0.18
Estd Interconnect traffic (KB): 103.08

• This DW system primarily uses PQ


• Global cache traffic is minimal
• Mostly dictionary blocks
Interconnect Design – IPQ traffic

• Example from AWR:


Statistic Total per Sec per Trans
--------------------------- --------- -------- ----------
PX local messages recv'd 104 0.1 0.1
PX local messages sent 104 0.1 0.1
PX remote messages recv'd 200271 200.2 151.1
PX remote messages sent 213267 213.2 156.1

• The per second this system receives 200 messages


• PQ message Size is 8182
• Usage is 1.5 MB/Sec
• For this workload GigE should be optimal
Interconnect Design

• DW Specific Best practices


• Plan 50-70% utilization of Network Bandwidth
• GigE performs very well
• IPQ usage is less
• Multiplexed GigE is choice for many customers
• For high IPQ usage
• Infiniband, if available on your platform
• RDS in Linux offers good performance over IB
Temporary Tablespace Design

• Large sorts in Data Warehouse use temp spaces


• For performance reasons temp space allocation is
managed thru SGA
• Unless requested, space allocated in one instance is not
returned to common pool
• Space reclamation is done under SS and CI enqueue
• This could cause slowdown if space is reclaimed
constantly
• A few queries with excessive temp space requirement
can cause imbalance of usage among instances
Temporary Tablespace Design

• DW Specific Best practices


• Make sure enough temp space is allocated combining all
instances’ usage
• Allocate separate temp tablespace for users who perform large
sorts
• For each temp tablespace, create as many temp files as the no.
of instances
• This would eliminate ‘buffer busy’ waits associated with
temp file header
• If imbalance is found, use the following command to release
excessive allocation
• “alter session set events 'immediate trace name
drop_segments level <TS number + 1>';”
• Metalink Note: 465840.1 for more details
RAC and Data Warehouse
Database Technologies
Automatic Workload Management: Services

• Application workloads can be defined as Services


• Individually managed and controlled
• Assigned to instances during normal startup
• On instance failure, automatic re-assignment
• Service performance individually tracked
• Finer grained control with Resource Manager
• Integrated with other Oracle tools / facilities (E.G. Scheduler,
Streams)
• Managed by Oracle Clusterware
Many Services, One Database

Node-1 Node-2 Node-3 Node-4 Node-5 Node-6

Queries
Aggregations ETL1 Backu
p
ETL2
How to define a service

• 1. SRVCTL
srvctl add service –d ORA –s APP1 –r INSTANCE1,INSTANCE2
srvctl add service –d ORA –s APP2 –r INSTANCE3,INSTANCE4
(db) (service) (preferred instances)
• 2. Using OEM Grid Control
• 3. DBMS_SERVICE (for single instance)
DBMS_SERVICE.CREATE_SERVICE
Partitioning

• Powerful functionality for partitioning objects into smaller


piece
• Beneficial for any environment with large volumes of data
• Business decision, not hardware based (top-down design
approach, NOT bottom-up)
Partitioning Strategies

Range Partitioning
Hash Partitioning
List Partitioning
Composite Partitioning
• Composite Range-Range Partitioning
 Composite Range-Hash Partitioning
 Composite Range-List Partitioning
 Composite List-Range Partitioning
 Composite List-Hash Partitioning
 Composite List-List Partitioning
Query Performance:
Partition Pruning

05-Jan
Only the relevant partitions are
05-Feb accessed
05-Mar select sum(sales_amount)
from sales
where sales_date between
05-Apr
to_date(‘01-MAR-2005’,‘DD-MON-YYYY’)
and
05-May to_date(‘31-MAY-2005’,’DD-MON-YYYY’)

05-Jun

Sales
Partition-wise Joins

• Partition-wise join may provide significant


performance improvements
• Partition-wise join supported for range, hash and
composite partitioning
• Optimizer chooses partition-wise joins whenever
possible
• Degree of parallelism not correlated to number of
partitions
Full Partition-wise Joins

When joining two tables that are partitioned on the join-key,


Oracle may choose to join on a per-partition basis

Lineitem Orders Lineitem Orders

Sub-1 05-Apr 05-Apr


Sub-1
Sub-1 Sub-1 Node 1

Sub-2 Sub-2 Sub-2 Sub-2


Node 2

Sub-3 Sub-3
Sub-3 Sub-3 Node 3
Partial Partition-wise Joins
Partial Partition-wise join: If Lineitem is partitioned by
the join key, then Orders can be re-distributed to
enable partition-wise join
Lineitem Orders Lineitem Orders

Sub-1 Sub-1
Sub-1 Node 1

Sub-2 Sub-2 Sub-2 Node 2

Sub-3
Sub-3 Sub-3 Node 3
What is Parallelism

• Breaking a single task into multiple smaller,


distinct units
• Instead of one process doing all the work multiple
processes working concurrently on smaller unit
• Independent of the number of nodes
How Parallel Execution Works?

• With serial execution, only one process is used


• With parallel execution:
• One parallel execution coordinator process
• Many parallel execution servers
• Table may be dynamically partitioned

Serial Process Coordinator


SELECT COUNT(*) SELECT COUNT(*) SALES
FROM sales FROM sales

SALES
Parallel
Execution Servers
Parallel Operations

SELECT cust_last_name, cust_first_name


FROM customers
ORDER BY cust_last_name;

Execution Servers
Consumers Producers
SQL Data Table on disk
sort A-K scan
dispatching
sort L-S scan
results
Coordinator sort T-Z scan
Table’s
dynamic
Intra-Parallelism Intra-Parallelism partitioning
DOP=3 (granules)
Inter-Parallelism
How Parallel Execution
Servers Communicate

• Rows Distribution:
• PARTITION QC
• HASH
• RANGE Parallel
• ROUND-ROBIN Execution
• BROADCAST Server Set 1
• QC(ORDER)
• QC(RANDOM) Parallel
Execution
Server Set 2
DOP=3
Degree of Parallelism (DOP)

• Number of parallel execution servers used by one


parallel operation
• Applies only to intra-operation parallelism
• If inter-operation parallelism is used then the number of
parallel execution servers can be twice the DOP
• No more than two sets of parallel execution servers can
be used for one parallelized statement
• When using partition granules, use a relatively high
number of partitions
Parallel Execution with RAC

• Execution slaves have node affinity with the


execution coordinator, but will expand if needed.

Node 1 Node 2 Node 3 Node 4

Execution
coordinator

Shared disks Parallel


execution
server
Adaptive Parallelism

•Adaptive Multiuser feature adjusts the DOP based on user load


Initially no workload
•Enabled by default: PARALLEL_ADAPTIVE_MULTI_USER=TRUE

1st user logs on


Node 1 Node 2
issues a query -> parallel 8

Node 1 Node 2 2nd user logs on


issues a query -> parallel 4

Node 2
3rd and 4th user logs on
Node 1
issues a query -> parallel 4
Inter-node Parallel Query– Oracle10g

• Parallel execution slaves allocated on instances


without regard for services
• Benefits of services greatly reduced when using
parallel execution
• Workaround – instance groups
• instance_groups=ig1,ig2,ig3 (not dynamic)
• parallel_instance_group=ig2 (dynamic)
Inter-node Parallel Query– Oracle11g

• Parallel execution slaves only allocated on


instances offering the service that the user
session is connected to
• All services have equivalent, dynamic instance
groups
• Services can be created
• For different IPQ user groups
• Preferred and Available Characteristics of services
can be exploited
• IPQ SLA’s can be guaranteed thru service failover
Overview: Parallel Join Execution

• EMP and DEPT joined on deptno QC


• Repartition EMP and DEPT on
deptno DFO Send
• Join each partition
Hash
Join

Receive Receive

Hash Hash
DFO DFO
Send Send

Table Table
Scan Scan
Parallel Hash-Join with 8 Slaves

Node 1 Node 2

Interconnect Can Become a Bottleneck


Pre-filtering can reduce communication

DFO
Hash Join

Filter Create Receive

Receive

Set

DFO
DFO Shared Bloom filter Send
Send

Test Filter Use

Scan Dept
Scan Dept
11gR1: Extended to Serial Execution

Serial Plan Hash Join

View
Filter Create

Set
Group By
Local Bloom filter

Filter Use
Test
Scan Dept

Scan Emp
Parallel Execution on RAC

• Need to Merge bloom filter over a cluster


• Potentially costly operation
• Prior to Merging, each node contains a private,
incomplete bloom filter
• Merging done in Parallel
• Each producers split the bloom filter in pieces
• Each pieces is sent to a single consumers on each other
node
• Each consumer merges the received pieces in their local
bloom filter
• After Merging the bloom filter is complete and can
be used for filtering
Two Approaches to Parallelism and Partitioning

Shared Everything Shared Nothing


Parallel degree independent of Static parallel degree dependent
the number of nodes on number of nodes
Data partitioning independent of Static Data partitioning
the number of nodes dependent on number of nodes

Hash 1 Hash 2 Hash 3 Hash 4


Data A-Z
Oracle Advanced Compression

• Oracle 9i compresses data only during bulk


load; useful for DW and ILM
• Oracle 11g compresses w/ inserts, updates
• Trade some cpu for disk & i/o efficiency
• Compress large application tables
• Transaction processing, data warehousing
• Compress all data types: structured,
unstructured
• Savings cascade to all db copies: test, dev,
standby, mirrors, archiving, backup, etc.
Let’s Talk About
RAC & Data Warehouse
The Key Question

How should I design and configure my Oracle Data


Warehouse ?

Answer : It depends….
Few Large Nodes

or

Many Small Nodes

?
Manageability

• Many nodes are more difficult to manage:


• Increase maintenance
• Performance problems are harder to diagnose
• Statistic gathering is more challenging
• However, computing power lost during planned and
unplanned outages has less impact:
• 16 x 2 grid  6% less power
• 4 x 8 grid  25% less power
• Many nodes are more flexible to distribute different
workloads
Scalability: Scale-Out

• Easy scale out


• Simply add nodes with no reconfiguration of database,
but
• Keep a balanced system
• Watch out for number of slots in switch
• We recommend adding only nodes with similar
performance characteristics: CPUs, HBAs, NICs
etc.
• Scale-out increment is one node
• 16 x 2 grid  6% increase in computing power
• 4 x 8 grid  25% increase in computing power
How can I run different workload types?

• Managing and partitioning the workload using Services


- Services provide a single system image for managing workload
- Service span one or more instances of the database. An instance
can support multiple services. The number of instances offering
the service is managed by the DBA independent of the application
- How many services do I need to define?
- How many instances will offer a service?
- Are there services that should run on one instance for
performance reason (contention on resources for example)

• Managing the workload using Resource Manager


- Using Oracle Database Resource Manager facilitates meeting
SLAs and provide effective control of system resources focused
primarily on running an Oracle database instance
What is the optimal partitioning strategy?

• Partitions are the foundation for achieving effective


performance in a large/very large data warehouse and other
features depend on partitioning to achieve the benefit
objective
• Important criteria considered when choosing partitioning
Performance (primary motivation)
Ease of administration/management
Data purge
Data archiving
Data movement
Data lifecycle management
Efficiency of backup
What is the optimal partitioning strategy?

• Grouping data by value for pruning (range, list)


• Balancing data distribution (hash partitioning).
• Dividing data across parallel processing to balance
workload (partition wise join)
• Combining different partition mechanism (composite
partitioning)
Which degree of parallelism ?

• Different scenarios can be used for parallel query:


- Standard use of parallel query for large data sets. In this
scenario, the degree of parallelism can be defined to utilize
all of the available resources across the cluster.
- Use of restricted parallel query. This scenario restricts the
processing to specific nodes in the cluster. Thus nodes can
be logically grouped for specific types of operations. This
can be done by using Services and/or
Parallel_instance_Group
Which degree of parallelism ?

• The downside of parallel operations is the exhaustion of


server resources:
- If I/O bottlenecks currently exist, use of parallel
operations may exacerbate it
- If CPU utilization is relatively high on one node, using
more instances for parallel query may help.

• The use of parallel operations within the RAC environment


provides for the flexibility to utilize all the server hardware
that is part of the cluster architecture. Utilizing instance
groups, database administrators can further control the
allocation of these resources based on application
requirements or service level agreements.
Summary – DW on RAC Best Practices

• Design to support Business Needs


• Implement and test
• Partition the Data
• Partition the Workload
• Configure Parallel Query
• Measure and Monitor

You might also like