RAC Troubleshooting Diagnosability

Troubleshooting and Diagnosing Oracle
Database 12.2 and Oracle RAC

https://www.linkedin.com/in/raosandesh/
sandeshr
Sandesh Rao, Senior Director , RAC Development
Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracles products remains at the sole discretion of Oracle.
Copyright 2017, Oracle and/or its affiliates. All rights reserved. | Confidential Oracle Restricted 2
Common Questions
How do I contact you ?
Linkedin Sandesh Rao
Email Sandesh.rao@oracle.com
Where do I get your presentation ?
http://otnyathra.in/downloads/
Which books on RAC do I read for basics or internals ?

Oracle Database 11g Oracle Real Application Clusters Handbook, 2nd Edition (Oracle Press) 2nd Edition
Pro Oracle Database 11g RAC on Linux (Expert's Voice in Oracle) 2nd ed. Edition
Oracle 10g RAC Grid, Services and Clustering 1st Edition
Pro Oracle Database 10g RAC on Linux: Installation, Administration, and Performance (Expert's Voice in
Oracle) 1st Corrected ed., Corr. 3rd printing Edition
Oracle Database 12c Release 2 Oracle Real Application Clusters Handbook: Concepts, Administration, Tuning &
Troubleshooting (Oracle Press) 1st Edition
Documentation Autonomous Computing Guide , RAC Admin guide
Copyright 2017, Oracle and/or its affiliates. All rights reserved. | Confidential Oracle Internal/Restricted/Highly Restricted 3
Agenda
Architectural Overview
Troubleshooting Scenarios
Proactive and Reactive tools
Q&A

Grid Infrastructure Overview
Grid Infrastructure is the name for the combination of
Oracle Cluster Ready Services (CRS)
Oracle Automatic Storage Management (ASM)
The Grid Home contains the software for both products
CRS can also be Standalone for ASM and/or Oracle Restart
CRS can run by itself or in combination with other vendor clusterware
Grid Home and RDBMS home must be installed in different locations
The installer locks the Grid Home path by setting root permissions.

CRS requires shared Oracle Cluster Registry (OCR) and Voting files
Must be in ASM or CFS
OCR backed up every 4 hours automatically GIHOME/cdata
Kept 4,8,12 hours, 1 day, 1 week
Restored with ocrconfig
Voting file backed up into OCR at each change.
Voting file restored with crsctl

For network CRS requires
One/multiple high speed, low latency, redundant private network for inter node
communications
Think of interconnect as a memory backplane for the cluster
Should be a separate physical network or managed converged network
VLANS are supported
Used for :-
Clusterware messaging
RDBMS messaging and block transfer
ASM messaging
HANFS for block traffic

Only one set of Clusterware daemons can run on each node
The CRS stack is spawned from Oracle HA Services Daemon (ohasd)
On Unix ohasd runs out of inittab with respawn
A node can be evicted when deemed unhealthy
May require reboot but at least CRS stack restart (rebootless restart)
IPMI integration or diskmon in case of Exadata
CRS provides Cluster Time Synchronization services
Always runs but in observer mode if ntpd configured

Grid Infrastructure Processes
Agents change everything
Multi-threaded Daemons
Manage multiple resources and types
Implements entry points for multiple resource types
Start,stop check,clean,fail
oraagent, orarootagent, application agent, script agent, cssdagent
Single process started from init on Unix (ohasd)
Diagram below shows all core resources

Grid Infrastructure Processes Level 4a
Level 2a
Level 3
Level 0
Level 4b
Level 1
Level 2b

Init Scripts
/etc/init.d/ohasd ( location O/S dependent )
RC script with start and stop actions
Initiates Oracle Clusterware autostart
Control file coordinates with CRSCTL
/etc/init.d/init.ohasd ( location O/S dependent )
OHASD Framework Script runs from init/upstart
Control file coordinates with CRSCTL
Named pipe syncs with OHASD

Level 1: OHASD Spawns:

cssdagent - Agent responsible for spawning CSSD
orarootagent - Agent responsible for managing all root owned ohasd resources
oraagent - Agent responsible for managing all oracle owned ohasd resources
cssdmonitor - Monitors CSSD and node health (along with the cssdagent)
Level 2a: OHASD rootagent spawns:
CRSD - Primary daemon responsible for managing cluster resources.
CTSSD - Cluster Time Synchronization Services Daemon
Diskmon ( Exadata )
ACFS (ASM Cluster File System) Drivers

Level 2b: OHASD oraagent spawns:
MDNSD Multicast DNS daemon
GIPCD Grid IPC Daemon
GPNPD Grid Plug and Play Daemon
EVMD Event Monitor Daemon
ASM ASM instance started here as may be required by CRSD
Level 3: CRSD spawns:
orarootagent - Agent responsible for managing all root owned crsd resources.
oraagent - Agent responsible for managing all nonroot owned crsd resources.
One is spawned for every user that has CRS resources to manage.

Startup Sequence
Level 4: CRSD oraagent spawns:
ASM Resouce - ASM Instance(s) resource (proxy resource)
Diskgroup - Used for managing/monitoring ASM diskgroups.
DB Resource - Used for monitoring and managing the DB and instances
SCAN Listener - Listener for single client access name, listening on SCAN VIP
Listener - Node listener listening on the Node VIP
Services - Used for monitoring and managing services
ONS - Oracle Notification Service
eONS - Enhanced Oracle Notification Service ( pre 11.2.0.2 )
GSD - For 9i backward compatibility
GNS (optional) - Grid Naming Service - Performs name resolution

Oracle Flex Cluster
The standard going forward

(every Oracle 12c Rel. 2 cluster
is a Flex Cluster by default.)
Copyright 2017, Oracle and/or its affiliates. All rights reserved. | 15

Under the Hood: Any New Install Ends Up in a Flex Cluster
[GRID]> crsctl get cluster name

CRS-6724: Current cluster name is 'SolarCluster'
[GRID]> crsctl get cluster class

CRS-41008: Cluster class is 'Standalone Cluster'
[GRID]> crsctl get cluster type

CRS-6539: The cluster type is 'flex'.

1 2 Cluster Domain 3 4
Database Application Database Database
Member Cluster Member Cluster Member Cluster Member Cluster
Uses IO & ASM Uses ASM

Private Uses local ASM GI only Service of DSC Service
Network
SAN
NAS Domain Services Cluster

Mgmt Trace File Rapid Home Additional
Repository Analyzer Provisioning ASM
Optional IO Service
(GIMR) (TFA) (RHP) Service
Service Services
Service Service
Shared ASM

ASM Flex Diskgroups 1
Database-oriented Storage Management for more flexibility and availability
Pre-12.2 diskgroup Organization 12.2 Flex Diskgroup Organization
Shared resource Database-oriented
File Group resource management
management
Diskgroup Flex Diskgroup
DB1 : File 1 DB3 : File 3
DB1 DB2 DB3
File 1 File 1 File 1
File 2 File 2 File 2
DB3 : File 2 DB2 : File 3 File 3 File 3 File 3
DB2 : File 4 DB1 : File 2 File 4
Copyright 2017, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential Internal/Restricted/Highly Restricted 18
ASM Flex Diskgroups 2
Database-oriented Storage Management for more flexibility and availability
12.2 Flex Diskgroup Organization
Flex Diskgroups enable
Quota Management - limit the space
Flex Diskgroup databases can allocate in a diskgroup and
thereby improve the customers ability to
DB1 DB2 DB3 consolidate databases into fewer DGs
File 1 File 1 File 1 Redundancy Change utilize lower
redundancy for less critical databases
Quota File 2 File 2 File 2
File 3 Shadow Copies (split mirrors) to easily
File 3 File 3 DB3
and dynamically create database clones
File 4 File 1
for test/dev or production databases
File 2
File 3
Copyright 2017, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential Internal/Restricted/Highly Restricted 19
Node Weighting in Oracle RAC 12c Release 2
Idea: Everything equal, let the majority of work survive
Node Weighting is a new feature that considers
the workload hosted in the cluster during fencing
1 2 The idea is to let the majority of work survive,

if everything else is equal
Example: In a 2-node cluster, the node hosting the
majority of services (at fencing time) is meant to survive

CSS_CRITICAL Fencing with Manual Override
Node eviction
despite WL; WL Conflict.
will failover.

srvctl modify database -help
|grep critical

-css_critical {YES | NO}
Define whether the database
or service is CSS critical
crsctl set server

css_critical {YES|NO}
+ server restart
CSS_CRITICAL CSS_CRITICAL will be honored

can be set on various levels / if no other technical reason prohibits A fallback scheme is applied if
components to mark them as survival of the node which has at CSS_CRITICAL settings do not lead to
critical so that the cluster will try to least one critical component at the an actionable outcome.
preserve them in case of a failure. time of failure.

Proven Features Even More Beneficial on the DSC
Autonomous Health Framework

The DSC is the ideal hosting Oracle ASM 12c Rel. 2 based storage
(powered by machine learning)
environment for Rapid Home consolidation is best performed on
works more efficiently for you on the
Provisioning (RHP) enabling software the DSC, as it enables numerous
DSC, as continuous analysis is taken
fleet management. additional features and use cases.
off the production cluster.

Node Eviction Basics

Basic RAC Cluster with Oracle Clusterware
Public Lan Public Lan
Private Lan /
Interconnect
CSSD CSSD CSSD
SAN SAN
Network Voting Network
Disk

What does CSSD do?
CSSD monitors and evicts nodes
Monitors nodes using 2 communication channels:
Private Interconnect Network Heartbeat
Voting Disk based communication Disk Heartbeat
Evicts (forcibly removes nodes from a cluster)
nodes dependent on heartbeat feedback (failures)
CSSD Ping CSSD
Ping

Network Heartbeat
Interconnect basics
Each node in the cluster is pinged every second
Nodes must respond in css_misscount time (defaults to 30 secs.)
Reducing the css_misscount time is generally not supported
Network heartbeat failures will lead to node evictions

CSSD-log: [date / time] [CSSD][1111902528]clssnmPollingThread: node
mynodename (5) at 75% heartbeat fatal, removal in 6.770 seconds
CSSD Ping CSSD

Disk Heartbeat
Voting Disk basics Part 1
Each node in the cluster pings (r/w) the Voting Disk(s) every second
Nodes must receive a response in (long / short) diskTimeout time
I/O errors indicate clear accessibility problems timeout is irrelevant
Disk heartbeat failures will lead to node evictions

CSSD-log: [CSSD] [1115699552] >TRACE: clssnmReadDskHeartbeat:
node(2) is down. rcfg(1) wrtcnt(1) LATS(63436584) Disk lastSeqNo(1)
CSSD CSSD
Ping

Voting Disk Structure
Voting Disks contain dynamic and static data:
Dynamic data: disk heartbeat logging
Static data: information about the nodes in the cluster
With 11.2.0.1 Voting Disks got an identity:

E.g. Voting Disk serial number: [GRID]> crsctl query css votedisk
1. 2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]
Voting Disks must therefore not be copied using dd or cp anymore
Node information Disk Heartbeat Logging

Simple Majority Rule
Oracle supports redundant Voting Disks for disk failure protection
Simple Majority Rule applies:
Each node must see the simple majority of configured Voting Disks
at all times in order not to be evicted (to remain in the cluster)
trunc(n/2+1) with n=number of voting disks configured and n>=1
CSSD CSSD

Insertion 1: Simple Majority Rule
In extended Oracle clusters
http://www.oracle.com/goto/rac
Using standard NFS to support
a third voting file for extended
cluster configurations (PDF)
CSSD CSSD
Same principles apply

Voting Disks are just
geographically dispersed

Insertion 2: Voting Disk in Oracle ASM
The way of storing Voting Disks doesnt change its use
[GRID]> crsctl query css votedisk
1. 2 1212f9d6e85c4ff7bf80cc9e3f533cc1 (/dev/sdd5) [DATA]
2. 2 aafab95f9ef84f03bf6e26adc2a3b0e8 (/dev/sde5) [DATA]
3. 2 28dd4128f4a74f73bf8653dabd88c737 (/dev/sdd6) [DATA]
Located 3 voting disk(s).
Oracle ASM auto creates 1/3/5 Voting Files

Based on Ext/Normal/High redundancy
and on Failure Groups in the Disk Group
Per default there is one failure group per disk
ASM will enforce the required number of disks
New failure group type: Quorum Failgroup

Why are nodes evicted?
To prevent worse things from happening
Evicting (fencing) nodes is a preventive measure (a good thing)!
Nodes are evicted to prevent consequences of a split brain:
Shared data must not be written by independently operating nodes
The easiest way to prevent this is to forcibly remove a node from the cluster
1 2
CSSD CSSD

How are nodes evicted?
EXAMPLE: Heartbeat failure
The network heartbeat between nodes has failed
It is determined which nodes can still talk to each other
A kill request is sent to the node(s) to be evicted
Using all (remaining) communication channels Voting Disk(s)
A node is requested to kill itself; executer: typically CSSD
CSSD CSSD

Re-bootless Node
Fencing (restart)

Re-bootless Node Fencing (restart)
Fence the cluster, do not reboot the node
Until Oracle Clusterware 11.2.0.2, fencing meant re-boot
With Oracle Clusterware 11.2.0.2, re-boots will be seen less, because:
Re-boots affect applications that might run an a node, but are not protected
Customer requirement: prevent a reboot, just stop the cluster implemented...
Standalone Standalone
App X App Y
Oracle RAC Oracle RAC
DB Inst. 1 DB Inst. 2
CSSD CSSD

How it works
With Oracle Clusterware 11.2.0.2, re-boots will be seen less:
Instead of fast re-booting the node, a graceful shutdown of the stack is attempted
Then IO issuing processes are killed; it is made sure that no IO process remains
For a RAC DB mainly the log writer and the database writer are of concern
App X App Y
Oracle RAC
DB Inst. 1
CSSD CSSD

EXCEPTIONS
With Oracle Clusterware 11.2.0.2, re-boots will be seen less, unless:
IF the check for a successful kill of the IO processes fails reboot
IF CSSD gets killed during the operation reboot
IF cssdmonitor is not scheduled reboot
IF the stack cannot be shutdown in short_disk_timeout-seconds reboot
App X App Y
Oracle RAC Oracle RAC
DB Inst. 1 DB Inst. 2
CSSD CSSD

Cluster Startup Problem Triage (11.2+)
Startup ps ef|grep init.ohasd NO crsctl config crs NO Engage Oracle Support
Sequence ps ef|grep ohasd.bin Running? ohasd.log Obvious? TFA Collector
Engage Sysadmin Team
YES
YES
Engage Sysadmin Team
ps ef|grep cssdagent
Cluster Startup ps ef|grep ocssd.bin

ps ef|grep orarootagent NO
ohasd.log YES Engage
ps ef|grep ctssd.bin Running? agent logs Obvious?
Diagnostic Flow ps ef|grep crsd.bin
ps ef|grep cssdmonitor
process logs
Sysadmin Team
ps ef|grep oraagent NO
YES
ps ef|grep ora.asm
Engage
ps ef|grep gpnpd.bin
TFA Collector Oracle Support
ps ef|grep mdnsd.bin ohasd.log Sysadmin Team
ps ef|grep evmd.bin OLR perms
Crsctl check crs Compare reference system
Crsctl check cluster
Engage NO YES Engage

Oracle Support TFA Collector Obvious?
Sysadmin Team
Sysadmin Team

Cluster Startup Problem Triage
Multicast Domain Name Service Daemon (mDNS(d))

Used by Grid Plug and Play to locate profiles in the cluster, as well as by GNS to perform
name resolution. The mDNS process is a background process on Linux and UNIX and on
Windows.
Uses multicast for cache updates on service advertisement arrival/departure.
Advertises/serves on all found node interfaces.
Log is GI_HOME/log/<node>/mdnsd/mdnsd.log

<?xml version="1.0" encoding="UTF-8"?>
<gpnp:GPnP-Profile Version="1.0" xmlns="http://www.grid-pnp.org/2005/11/gpnp-profile" xmlns:gpnp="http://www.grid-
pnp.org/2005/11/gpnp-profile" xmlns:orcl="http://www.oracle.com/gpnp/2005/11/gpnp-profile"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.grid-pnp.org/2005/11/gpnp-profile
gpnp-profile.xsd" ProfileSequence="6" ClusterUId="b1eec1fcdd355f2bbf7910ce9cc4a228" ClusterName="staij-cluster"
PALocation="">
<gpnp:Network-Profile><gpnp:HostNetwork id="gen" HostName="*">
<gpnp:Network id="net1" IP=192.168.1.0" Adapter="eth0" Use="public"/>
<gpnp:Network id="net2" IP=192.168.2.0" Adapter="eth1 Use="cluster_interconnect"/>
</gpnp:HostNetworkcss"></gpnp:Network-Profile>
<orcl:CSS-Profile id=" DiscoveryString="+asm" LeaseDuration="400"/>
<orcl:ASM-Profile id="asm" DiscoveryString="" SPFile="+SYSTEM/staij-cluster/asmparameterfile/registry.253.693925293"/>
<ds:Signature xmlns:ds="http://www.w3.org/2000/09/xmldsig#"><ds:SignedInfo><ds:CanonicalizationMethod
Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/><ds:SignatureMethod Algorithm="http://www.w3.org/2001/10/xml-
exc-c14n#"> <InclusiveNamespaces xmlns="http://www.w3.org/2001/10/xml-exc-c14n#" PrefixList="gpnp orcl
xsi"/></ds:Transform></ds:Transforms><ds:DigestMethod
Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/><ds:DigestValue>x1H9LWjyNyMn6BsOykHhMvxnP8U=</ds:DigestValue
></ds:Reference></ds:SignedInfo><ds:SignatureValue>N+20jG4=</ds:SignatureValue></ds:Signature>
</gpnp:GPnP-Profile>

cssd agent and monitor
Same functionality in both agent and monitor
Functionality of several pre-11.2 daemons consolidated in both
OPROCD system hang
OMON oracle clusterware monitor
VMON vendor clusterware monitor
Run realtime with locked down memory, like CSSD
Provides enhanced stability and diagnosability
Logs are
GI_HOME/log/<node>/agent/oracssdagent_root/oracssdagent_root.log
GI_HOME/log/<node>/agent/oracssdmonitor_root/oracssdmonitor_root.log
12c ORACLE_BASE/diag/node/agent/..

Node Evictions
NHB?
1050693.1 Engage
1534949.1 YES YES
Eviction 1531223.1 Resource NO Cluster alert
1546004.1
Obvious? networking team
1328466.1 ocssd.log
Scenario System log
Starvation?
NO
NO
YES TFA Collector
Free memory? Engage storage

CPU load? team
Node Response? Engage DHB? NO
appropriate 1549428.1
team 1466639.1 YES YES
Obvious?
Engage
NO
Node Eviction Resolved?
NO
Oracle
Support
Diagnostic Flow Fenced? YES

NO
YES
YES Resource starvation

Engage
sysadmin
NO team
TFA Collector

Missing Network Heartbeat (1)
ocssd.log from node 1
===> sending network heartbeats other nodes. Normally, this message is output once every 5 messages (seconds)
2016-08-13 17:00:20.023: [ CSSD][4096109472]clssnmSendingThread: sending status msg to all nodes
2016-08-13 17:00:20.023: [ CSSD][4096109472]clssnmSendingThread: sent 5 status msgs to all nodes
===> The network heartbeat is not received from node 2 (drrac2) for 15 consecutive seconds.
===> This means that 15 network heartbeats are missing and is the first warning (50% threshold).
2016-08-13 17:00:22.818: [ CSSD][4106599328]clssnmPollingThread: node drrac2 (2) at 50% heartbeat fatal, removal in 14.520
seconds
2016-08-13 17:00:22.818: [ CSSD][4106599328]clssnmPollingThread: node drrac2 (2) is impending reconfig, flag 132108,
misstime 15480
===> continuing to send the network heartbeats and log messages once every 5 messages
===> 75% threshold of missing network heartbeat is reached. This is second warning.
seconds

===> continuing to send the network heartbeats and log messages once every 5 messages
===> continuing to send the network heartbeats, but the message is logged after 4 messages
===> Last warning shows that 90% threshold of the missing network heartbeat is reached.
===> The eviction will occur in 2.49 seconds.
2016-08-13 17:00:34.841: [ CSSD][4106599328]clssnmPollingThread: node drrac2 (2) at 90% heartbeat fatal, removal in
2.490 seconds, seedhbimpd 1
===> Eviction of node 2 (drrac2) started
2016-08-13 17:00:37.337: [ CSSD][4106599328]clssnmPollingThread: Removal started for node drrac2 (2), flags 0x2040c,
state 3, wt4c 0
===> This shows that the node 2 is actively updating the voting disks
2016-08-13 17:00:37.340: [ CSSD][4085619616]clssnmCheckSplit: Node 2, drrac2, is alive, DHB (1281744040, 1396854)
more than disk timeout of 27000 after the last NHB (1281744011, 1367154)

===> Evicting node 2 (drrac2)
2016-08-13 17:00:37.340: [ CSSD][4085619616](:CSSNM00007:)clssnmrEvict: Evicting node 2, drrac2, from the cluster in
incarnation 169934272, node birth incarnation 169934271, death incarnation 169934272, stateflags 0x24000
===> Reconfigured the cluster without node 2

2016-08-13 17:01:07.705: [ CSSD][4043389856]clssgmCMReconfig: reconfiguration successful, incarnation 169934272 with 1
nodes, local node number 1, master node number 1

ocssd.log from node 2:
===> Logging the message to indicate 5 network heartbeats are sent to other nodes
===> First warning of reaching 50% threshold of missing network heartbeats
seconds
2016-08-13 17:00:26.213: [ CSSD][4073040800]clssnmPollingThread: node drrac1 (1) is impending reconfig, flag 394254,
misstime 15460
===> Second warning of reaching 75% threshold of missing network heartbeats
seconds

===> Logging the message to indicate 4 network heartbeats are sent
===> Third warning of reaching 90% threshold of missing network heartbeats
2016-08-13 17:00:38.236: [ CSSD][4073040800]clssnmPollingThread: node drrac1 (1) at 90% heartbeat fatal, removal in
2.460 seconds, seedhbimpd 1
===> Eviction started for node 1 (drrac1)
2016-08-13 17:00:40.702: [ CSSD][4073040800]clssnmPollingThread: Removal started for node drrac1 (1), flags 0x6040e,
state 3, wt4c 0
===> Node 1 is actively updating the voting disk, so this is a split brain condition
2016-08-13 17:00:40.706: [ CSSD][4052061088]clssnmCheckSplit: Node 1, drrac1, is alive, DHB (1281744036, 1243744)
more than disk timeout of 27000 after the last NHB (1281744007, 1214144)
2016-08-13 17:00:40.706: [ CSSD][4052061088]clssnmCheckDskInfo: My cohort: 2
2016-08-13 17:00:40.707: [ CSSD][4052061088]clssnmCheckDskInfo: Surviving cohort: 1
===> Node 2 is aborting itself to resolve the split brain and ensure the cluster integrity
2016-08-13 17:00:40.707: [ CSSD][4052061088](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain.
Cohort of 1 nodes with leader 2, drrac2, is smaller than cohort of 1 nodes led by node 1, drrac1, based on map type 2
2016-08-13 17:00:40.707: [ CSSD][4052061088]###################################
2016-08-13 17:00:40.707: [ CSSD][4052061088]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
2016-08-13 17:00:40.707: [ CSSD][4052061088]###################################

Observations
1. Both nodes reported missing heartbeats at the same time
2. Both nodes sent heartbeats to other nodes all the time
3. Node 2 aborted itself to resolve split brain
Conclusion
1. This is likely a network problem, engage network team
2. Check OSWatcheroutput (netstat and traceroute)
1. Configure private.net file, not configured by default
3. Check CHM
4. Check system log

Voting Disk Access Problem (1)
ocssd.log:
===> The first error indicating that it could not read voting disk -- first message to indicate a
problem accessing the voting disk
2016-08-13 18:31:19.787: [ SKGFD][4131736480]ERROR: -9(Error 27072, OS Error (Linux
Error: 5: Input/output error
Additional information: 4
Additional information: -1)
)
2016-08-13 18:31:19.787: [ CSSD][4131736480](:CSSNM00060:)clssnmvReadBlocks: read
failed at offset 529 of /dev/sdb8
2016-08-13 18:31:19.802: [ CSSD][4131736480]clssnmvDiskAvailabilityChange: voting file
/dev/sdb8 now offline
====> The error message that shows a problem accessing the voting disk repeats once every 4 seconds
2016-08-13 18:31:23.782: [ CSSD][150477728]clssnmvDiskOpen: Opening /dev/sdb8
2016-08-13 18:31:23.782: [ SKGFD][150477728]Handle 0xf43fc6c8 from lib :UFS:: for disk :/dev/sdb8:
2016-08-13 18:31:23.782: [ CLSF][150477728]Opened hdl:0xf4365708 for dev:/dev/sdb8:
2016-08-13 18:31:23.787: [ SKGFD][150477728]ERROR: -9(Error 27072, OS Error (Linux Error: 5:
Input/output error
)
2016-08-13 18:31:23.787: [ CSSD][150477728](:CSSNM00060:)clssnmvReadBlocks: read failed at offset 17
of /dev/sdb8

====> The last error that shows a problem accessing the voting disk.
====> Note that the last message is 200 seconds after the first message
====> because the long disktimeout is 200 seconds
2016-08-13 18:34:37.423: [ CSSD][150477728]clssnmvDiskOpen: Opening /dev/sdb8
2016-08-13 18:34:37.423: [ CLSF][150477728]Opened hdl:0xf4336530 for dev:/dev/sdb8:
2016-08-13 18:34:37.429: [ SKGFD][150477728]ERROR: -9(Error 27072, OS Error (Linux Error: 5:
Input/output error
)
2016-08-13 18:34:37.429: [ CSSD][150477728](:CSSNM00060:)clssnmvReadBlocks: read failed at offset 17
of /dev/sdb8

====> This message shows that ocssd.bin tried accessing the voting disk for 200 seconds
2016-08-13 18:34:38.205: [ CSSD][4110736288](:CSSNM00058:)clssnmvDiskCheck: No I/O completions for
200880 ms for voting file /dev/sdb8)
====> ocssd.bin aborts itself with an error message that the majority of voting disks are not available. In
this case, there was only one voting disk, but if three voting disks were available, as long as two
voting disks are accessible, ocssd.bin will not abort.
2016-08-13 18:34:38.206: [ CSSD][4110736288](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 1
configured voting disks available, need 1
2016-08-13 18:34:38.206: [ CSSD][4110736288]###################################
2016-08-13 18:34:38.206: [ CSSD][4110736288]clssscExit: CSSD aborting from thread
clssnmvDiskPingMonitorThread
2016-08-13 18:34:38.206: [ CSSD][4110736288]###################################
Conclusion
The voting disk was not available, engage storage team

Node Eviction Triage
Time synchronisation issue

Cluster Time Synchronisation Services daemon
Provides time management in a cluster for Oracle.
Observer mode when Vendor time synchronisation s/w is found
Logs time difference to the CRS alert log
Active mode when no Vendor time sync s/w is found

Cluster Ready Services Daemon
The CRSD daemon is primarily responsible for maintaining the availability of application
resources, such as database instances. CRSD is responsible for starting and stopping these
resources, relocating them when required to another node in the event of failure, and
maintaining the resource profiles in the OCR (Oracle Cluster Registry). In addition, CRSD is
responsible for overseeing the caching of the OCR for faster access, and also backing up the
OCR.
Log file is GI_HOME/log/<node>/crsd/crsd.log
Rotation policy 10-50M
Retention policy 10 logs
Dynamic in 12.1 and can be changed

CRSD oraagent
CRSDs oraagent manages
all database, instance, service and diskgroup resources
node listeners
SCAN listeners, and ONS
If the Grid Infrastructure owner is different from the RDBMS home owner then you would
have 2 oraagents each running as one of the installation owners. The database, and service
resources would be managed by the RDBMS home owner and other resources by the Grid
Infrastructure home owner.
Log file is
GI_HOME/log/<node>/agent/crsd/oraagent_<user>/oraagent_<user>.log

CRSD orarootagent
CRSDs rootagent manages
GNS and its VIP
Node VIP
SCAN VIP
network resources.
Log file is
GI_HOME/log/<node>/agent/crsd/orarootagent_root/oraagent_root.log

Agent return codes
Check entry must return one of the following return codes:
ONLINE
UNPLANNED_OFFLINE
Target=online, may be recovered failed over
PLANNED_OFFLINE
UNKNOWN
Cannot determine, if previously online, partial then monitor
PARTIAL
Some of a resources services are available. Instance up but not open.
FAILED
Requires clean action

Automatic Diagnostic Repository (ADR)
Important logs and traces

11.2 Databases only use ADR
Grid Infrastructure files in $GI_HOME/log/<node_name>/<component_name>
$GI_HOME/log/myHost/cssd
$GI_HOME/log/myHost/alertmyHost.log
12c Grid Infrastructure and Database use ADR
Different locations for Grid Infrastructure and Databases
Grid Infrastructure
Alert.log, cssd.log, csrd.log, etc
Databases
Alert.log, background process traces, foreground process traces

Oracles Database and Clusterware Tools
What if issues were detected before they
had an impact? Hang
Manager
Trace File
Analyzer
What if you were notified with a specific Quality of
Service
diagnosis and corrective actions? Management
Cluster
What if resource bottlenecks threatening Health
SLAs were identified early? EXAchk Advisor
What if bottlenecks could be Memory

Guard
automatically relieved just in time?
Cluster ORAchk
What if database hangs and node reboots Health
Monitor
could be eliminated? Cluster
Verification
Utility
Maintains Compliance
with Best Practices and
Alerts Vulnerabilities to
Known Issues
Oracle 12c ORAchk & EXAchk

Why Oracle ORAchk & EXAchk
Automatic proactive warning Health checks for most impactful Runs in your environment
of problems before they reoccurring problems with no need to send
impact you anything to Oracle
Get scheduled health reports Findings can be integrated

sent to you in email Engineered into other tools of choice
EXAchk
Systems
Common Framework
Non
Engineered ORAchk
Systems

Oracle Stack Coverage
Oracle Engineered Systems Oracle Database Oracle E-Business Suite
Oracle Database Appliance Standalone Database Oracle Payables
o Oracle Exadata Database Machine Grid Infrastructure & RAC Oracle Workflow
o Oracle SuperCluster / MiniCluster Maximum Availability Architecture (MAA) Oracle Purchasing
Scorecard Oracle Order Management
o Oracle Private Cloud Appliance
Upgrade Readiness Validation Oracle Process Manufacturing
o Oracle Big Data Appliance
Golden Gate Oracle Receivables
o Oracle Exalogic Elastic Cloud Oracle Restart Oracle Fixed Assets
o Oracle Exalytics In-Memory Machine Oracle Enterprise Manager Cloud Control Oracle HCM
o Oracle Zero Data Loss Recovery Appliance Repository Oracle CRM
Oracle ASR Agent Oracle Project Billing
OMS Oracle Siebel
Oracle Systems
Oracle Solaris Oracle Middleware Database best practices
Cross stack checks Application Continuity Oracle PeopleSoft
Oracle Identify and Access Management Database best practices
Solaris Cluster
Suite (Oracle IAM)
OVN Oracle SAP
EXAdata best practices

Profiles Profile
asm ASM Checks
Description
avdf Audit Vault Configuration checks

Profiles provide logical grouping of clusterware
control_VM
Oracle clusterware checks
Checks only for Control VM(ec1-vm, ovmm, db, pc1, pc2).
checks which are about similar topics No cross node checks
corroborate Exadata checks needs further review by user to determine
Run only checks in a specific profile pass or fail
dba DBA Checks
./exachk profile <profile> ebs Oracle E-Business Suite checks
eci_healthchecks Enterprise Cloud Infrastructure Healthchecks
Run everything except checks in a specific ecs_healthchecks Enterprise Cloud System Healthchecks
profile goldengate Oracle GoldenGate checks
hardware Hardware specific checks for Oracle Engineered systems
./exachk excludeprofile <profile> maa Maximum Availability Architecture Checks
ovn Oracle Virtual Networking
platinum Platinum certification checks
preinstall Pre-installation checks
prepatch Checks to execute before patching
security Security checks
solaris_cluster Solaris Cluster Checks
storage Oracle Storage Server Checks
switch Infiniband switch checks
sysadmin Sysadmin checks
user_defined_checks Run user defined checks from user_defined_checks.xml

Profiles Profile
asm ASM Checks
Description
bi_middleware Oracle Business Intelligence checks

Profiles provide logical grouping of clusterware
dba
Oracle clusterware checks
DBA Checks
checks which are about similar topics ebs Oracle E-Business Suite checks
emagent Cloud control agent checks
Run only checks in a specific profile emoms Cloud Control management server
em Cloud control checks
./orachk profile <profile> goldengate Oracle GoldenGate checks
hardware Hardware specific checks for Oracle Engineered systems
Run everything except checks in a specific oam Oracle Access Manager checks
profile oim Oracle Identify Manager checks
oud Oracle Unified Directory server checks
./orachk excludeprofile <profile> ovn Oracle Virtual Networking
peoplesoft Peoplesoft best practices
preinstall Pre-installation checks
prepatch Checks to execute before patching
security Security checks
siebel Siebel Checks
solaris_cluster Solaris Cluster Checks
storage Oracle Storage Server Checks
switch Infiniband switch checks
sysadmin Sysadmin checks
user_defined_checks Run user defined checks from user_defined_checks.xml

Keep Track of Changes to the Attributes of Important Files
Track changes to the attributes of important files with fileattr
Looks at all files & directories within Grid Infrastructure and Database homes by default
The list of monitored directories and their contents can be configured to your specific requirements
Use fileattr start to start the first snapshot ./orachk fileattr start
$ ./orachk -fileattr start

CRS stack is running and CRS_HOME is not set. Do you want to set CRS_HOME to
/u01/app/11.2.0.4/grid?[y/n][y]
Checking ssh user equivalency settings on all nodes in cluster
Node mysrv22 is configured for ssh user equivalency for oradb user
Node mysrv23 is configured for ssh user equivalency for oradb user
List of directories(recursive) for checking file attributes:
/u01/app/oradb/product/11.2.0/dbhome_11203
/u01/app/oradb/product/11.2.0/dbhome_11204
orachk has taken snapshot of file attributes for above directories at:
/orahome/oradb/orachk/orachk_mysrv21_20170504_041214

Keep Track of Changes to the Attributes of Important Files
Compare current attributes against first snapshot using fileattr check
./orachk fileattr check
$ ./orachk -fileattr check -includedir "/root/myapp/config" -excludediscovery
CRS stack is running and CRS_HOME is not set. Do you want to set CRS_HOME to
/u01/app/12.2.0/grid?[y/n][y]
Checking for prompts on myserver18 for oragrid user...
Checking ssh user equivalency settings on all nodes in cluster
Node myserver17 is configured for ssh user equivalency for root user
List of directories(recursive) for checking file attributes:
Results of snapshot comparison will also
/root/myapp/config be shown in the HTML report output
Checking file attribute changes...
.
"/root/myapp/config/myappconfig.xml" is different:
Baseline : 0644 oracle root /root/myapp/config/myappconfig.xml
Current : 0644 root root /root/myapp/config/myappconfig.xml
etc
etc
Note:
Use the same arguments with check that you used with start
Will proceed to perform standard health checks after attribute checking
File Attribute Changes will also show in HTML report output

Improve performance of SQL queries
Many new checks focus on known issues in 12c All contained in the dba profile:
Optimizer as well as SQL Plan Management -profile dba
These checks target problems such as:

Wrong results returned
High memory & CPU usage
Errors such as ORA-00600 or ORA-07445
Issues with cursor usage
Other general SQL plan management problems

Oracle Database Security Assessment Tool (DBSAT) included
DBSAT analyzes
database
configurations and
security policies
Uncovers security
risks
Improves the security
posture of Oracle
Databases
All results included within report output under the check:

Validate database security configuration using database security assessment tool

Upgrade to Database 12.2 with confidence
New checks to help when upgrading the database
to 12.2
Both pre and post upgrade verification to prevent
problems related to:
OS configuration
Grid Infrastructure & Database patch prerequisites
Database configuration
Cluster configuration
Pre upgrade -u o pre
Post upgrade -u o post

Oracle Health Checks Collection Manager
New Collection Manager
app built on APEX 5
theme
Tabs replaced with drop
down menus for easier
navigation
ORAchk & EXAchk
continue to ship with
APEX 4 app too
No more new
functionality in the APEX
4 app, all new features
will go into the APEX 5
app

Enterprise Manager Integration
Related checks grouped into View targets checked, violations &

compliance standards average score
Check results integrated into EM

compliance framework via plugin
View results in native EM
compliance dashboards
Drill down into compliance standard View break down by target
to see individual check results

Provision
Use Enterprise Manager provisioning After selected this will launch the
feature and select ORAchk/EXAchk provisioning wizard, choose the system
type

View Results by Compliance Standard
Drill into applicable standard and view
individual checks & target status
Filter by Exachk%
Click individual checks for

recommendation details

JSON Output to Integrate with Kibana, Elastic Search etc
The JSON provides many tags to
allow dashboard filtering based on
facts such as:
Engineered System type
Engineered System version
Hardware type
Node name
OS version
Rack identifier
Rack type
Database version
And more...
Kibana can be used to view health
check compliance across your data
center
Results can also be filtered based
on any combination of exposed
system attributes

JSON Output to Integrate with Kibana, Elastic Search etc

Speeds Issue Diagnosis,
Triage and Resolution
Oracle 12c Trace File Analyzer
Copyright 2017, Oracle and/or its affiliates. All rights reserved. | Confidential Oracle Internal 77
Why TFA?
Collects data across the

Provides one interface for
cluster and consolidates it
all diagnostic needs
in one place
Reduces time required to

Collects all relevant
obtain diagnostic data,
diagnostic data at the time
which saves your business
of the problem
money
Copyright 2017, Oracle and/or its affiliates. All rights reserved. | Confidential Oracle Internal 78
Supported Platforms and Versions
All major Operating Systems are All Oracle Database & Grid versions
supported 10.2+ are supported
Linux (OEL, RedHat, SUSE, Itanium &
zLinux) You probably already have TFA
Oracle Solaris (SPARC & x86-64) installed as it is included with:
AIX Oracle Grid
Oracle Database
Infrastructure
HPUX (Itanium & PA-RISC) 11.2.0.4+
Windows 12.1.0.2+ 12.2.0.1+
12.2.0.1+
Updated quarterly via 1513912.1

OS versions supported are the same as those supported by the Database
Java Runtime Edition 1.8 required

Linux / Unix Installation
Root / Daemon Install Non root / Non Daemon Install
1. Download from 1513912.1 1. Download from 1513912.1
2. Copy to one required machine and unzip 2. Copy to every required machine and unzip
3. Run ./installTFA<platform> 3. Run ./installTFA<platform>
Will : -extractto <install_dir>

Will: -javahome <jre_home>
Install on all nodes
Only install on current host
Auto discover relevant Oracle Software & Exadata
Storage Servers Not do automatic collections
Start monitoring for problems & perform auto Not collect from remote hosts
collections Not collect files unreadable by install user
Recommended install location: /opt/oracle.tfa

Architecture
TFA daemon runs on each cluster
node
Remote
TFA Cluster
Node
n
Remote
Daemon TFA
Or single instance when no
Node
Daemon
Remote Grid Infrastructure is used
2 TFA
Node TFA
Scripts 1 Daemon Daemon Command line communication is
via tfactl command
Scripts
Alerts &
Scripts Log files TFA Daemons on all nodes
Alerts &
Log files
coordinate:
Scripts Script execution
tfactl Collection of diagnostics
Cluster Trimming of log contents
wide
Initiator Node Collection
( Where command originated) Cluster wide collection output is
consolidated on one node
The daemon is only used when installed as root

Automatic Diagnostic Collections
Oracle Trace File Analyzer
DBA(s) / Sys Admin(s)
1
Automatically
detect event
Oracle Grid Infrastructure
& Database(s)
2 4
Collect & package Upload collection
relevant to Oracle Support
diagnostics for further help
Significant 3 Notify
problem occurs
relevant DBA and
or Sys Admin by
email

Command Interfaces
Command line Shell Menu
Specify all command options at 1. Set and change context 1. Select menu navigation
the command line options then choose the
2. Run commands from within command you want to run
tfactl <command> the shell
tfactl menu
tfactl
tfaclt > database MyDB
MyDB tfactl > oratop

Maintain
Option 1 Option 2
Applying standard PSUs will To update with latest TFA & Support
automatically update TFA Tools Bundle
PSUs do not contain Support Tools 1. Download latest version: 1513912.1
Bundle updates 2. Repeat the same installation steps
Upgrade to the latest version whenever possible to include bug fixes, new features & optimizations

View System & Cluster Summary
Choose an option to drill

down further
Quick summary of status of

key components

Summary ASM Drill Down Example
ASM Overview
ASM cluster wide summary
Problems found
ASM Cluster wide status Problems found on myserver69
Also disk space warning on both servers

Summary ASM Drill Down Example
View ASM problems for myserver69
View node wise & drill into

myserver69
View ASM status summary

for myserver69
View recent problems detected

View component status

Investigate Logs & Look for Errors
Analyze all important recent log entries: Search recent log entries:
tfactl analyze last 1d tfactl analyze -search ora-00600" -last 8h
Searching for
ora-00600

Perform Analysis Using the Included Tools
Tool Description Tool Description
orachk or Provides health checks for the Oracle stack. grep Search alert or trace files with a given database and file name pattern, for
exachk Oracle Trace File Analyzer will install either a search string.
Oracle EXAchk for Engineered Systems, see document 1070954.1 for
more details summary Provides high level summary of the configuration
or vi Opens alert or trace files for viewing a given database and file name
Oracle ORAchk for all non-Engineered Systems, see document pattern in the vi editor
1268927.2 for more details
tail Runs a tail on an alert or trace files for a given database and file name
oswatcher Collects and archives OS metrics. These are useful for instance or node pattern
evictions & performance Issues. See document 301137.1 for more details
param Shows all database and OS parameters that match a specified pattern
procwatcher Automates & captures database performance diagnostics and session level
dbglevel Sets and unsets multiple CRS trace levels with one command
hang information. See document 459694.1 for more details
history Shows the shell history for the tfactl shell
oratop Provides near real-time database monitoring. See document 1500864.1
for more details. changes Reports changes in the system setup over a given time period. This
sqlt Captures SQL trace data useful for tuning. See document 215187.1 for includes database parameters, OS parameters and patches applied
more details. calog Reports major events from the Cluster Event log
alertsummary Provides summary of events for one or more database or ASM alert files events Reports warnings and errors seen in the logs
from all nodes
managelogs Shows disk space usage and purges ADR log and trace files
ls Lists all files TFA knows about for a given file name pattern across all nodes
pstack Generate process stack for specified processes across all nodes ps Finds processes
triage Summarize oswatcher/exawatcher data
Not all tools are included in Grid or Database install.
Download from 1513912.1 to get full collection of tools Verify which tools you have installed: tfactl toolstatus

OS Watcher (Support Tools Bundle)
Collect & Archive OS Metrics

Executes standard UNIX utilities (e.g. vmstat, iostat, ps,
etc) on regular intervals
Built in Analyzer functionality to summarize, graph and
report upon collected metrics
Output is Required for node reboot and performance
issues
Simple to install, extremely lightweight
Runs on ALL platforms (Except Windows)
MOS Note: 301137.1 OS Watcher Users Guide

Procwatcher (Support Tools Bundle)
Monitor & Examine Database Processes

Single instance & RAC
Generates session wait, lock and latch reports as well as call stacks
from any problem process(s)
Ability to collect stack traces of specific processes using Oracle Tools
and OS Debuggers
Typically reduces SR resolution for performance related issues
Runs on ALL major UNIX Platforms
MOS Note: 459694.1 Procwatcher Install Guide

oratop (Support Tools Bundle)
Near Real-Time Database Monitoring

Single instance & RAC
Monitoring current database activities
Database performance
Identifying contentions and bottleneck

Analyze
Each tool can be run using tfactl in shell mode
Start tfactl shell with tfactl
Run a tool with the tool name tfactl > orachk
1. Where necessary set context with database <dbname> tfactl > database MyDB
2. Then run tool MyDB tfactl > oratop
3. Clear context with database MyDB tfactl > database

One Command SRDCs
For certain types of problems
Oracle Support will ask you to
run a Service Request Data
Collection (SRDC)
Previously this would have
involved:
Reading many different
support documents
Collecting output from
many different tasks
Gathering lots of different
diagnostics
Packaging & uploading
Now just run:
tfactl diagcollect -srdc <srdc_type>

Faster & Easier SR Data Collection
tfactl diagcollect srdc <srdc_type>
Type of Problem SRDC Types Collection Scope

ORA-00600
ORA-00700 ORA-27300
ORA Errors ORA-04030 ORA-27301 Local only
ORA-04031 ORA-27302
ORA-07445
Other internal database errors internalerror Local only
Database performance problems dbperf Cluster wide
dbpatchinstall New
Database patching problems Local only
dbpatchconflict New
dbinstall New
Database install / upgrade problems Local only
dbupgrade New
Enterprise Manager tablespace usage metric problems emtbsmetrics New Local only (on EM Agent target)
emdebugon New
Enterprise Manager general metrics page or threshold Local only (on EM Agent target & OMS)
emdebugoff New
problems - Run all three SRDCs
emmetricalert New Local only (on EM Agent target & Repository DB)

One Command SRDCs Examples of Whats Collected
ORA4031: Database Performance
tfactl diagcollect srdc ora4031 tfactl diagcollect srdc dbperf
1. IPS Package 1. ADDM report

2. Patch Listing 2. AWR for good and problem period
3. AWR report 3. AWR Compare Period report
4. Memory information 4. ASH report for good and problem period
5. RDA 5. OS Watcher
6. IPS Package (if errors during problem
period)
7. ORAchk (performance related checks)

Manual Data Gathering vs One Command SRDC
Manual Data Gathering TFA SRDC
1. Generate ADDM reviewing Document 1680075.1 1. Run tfactl diagcollect srdc dbperf
2. Identify good and problem periods and gather AWR 2. Upload resulting zip file to SR
reviewing Document 1903158.1
3. Generate AWR compare report (awrddrpt.sql) using good
and problem periods
4. Generate ASH report for good and problem periods
reviewing Document 1903145.1
5. Collect OSWatcher data reviewing Document 301137.1
6. Check alert.log if there are any errors during the problem
period
7. Find any trace files generated during the problem period
8. Collate and upload all the above files/outputs to SR

One Command SRDC
Interactive Mode
tfactl diagcollect srdc <srdc_type>
4. All required files are

identified
5. Trimmed where
applicable
6. Package in a zip ready

to provide to support
1. Enter default for event date/time and database name
2. Scans system to identify recent 10 events in the system (ORA600

example shown)
3. Once the relevant event is chosen, proceeds with diagnostic

collection

One Command SRDC
Silent Mode tfactl diagcollect srdc <srdc_type> -database <db> -for <time>
1. Parameters(date/time, DB name) are provided

in the command
2. Does not prompt for any more information
3. All required files are identified
4. Trimmed where applicable
5. Package in a zip ready to provide to support

Default Collection
Run a default diagnostic
collection if there is not
yet an SRDC about your
problem:
tfactl diagcollect
Will trim & collect all

important log files
updated in the past 12
hours:
Collections stored in the
repository directory
Change diagcollect
timeframe with
last <n>h|d
Automatic Database Log Purge
TFA can automatically purge database logs
OFF by default
Except on a Domain Service Cluster (DSC),
which it is ON by default
Turn auto purging on or off: tfactl set manageLogsAutoPurge=<ON|OFF>
Will remove logs older than 30 days

configurable with: tfactl set manageLogsAutoPurgePolicyAge=<n><d|h>
Purging runs every 60 minutes

configurable with: tfactl set manageLogsAutoPurgeInterval=<minutes>
Manual Database Log Purge
TFA can manage ADR log and trace files
Show disk space usage of individual diagnostic destinations
Purge these file types based on diagnostic location and or age:
"ALERT, "INCIDENT, "TRACE, "CDUMP, "HM, "UTSCDMP, "LOG
tfactl managelogs <options>
Option Description
Runs as the ADR home
show usage Shows disk space usage per diagnostic directory for both GI and database logs owner. So will only be able
-show variation older <n><m|h|d> Use to determine per directory disk space growth. to purge files this owner
Shows the disk usage variation for the specified period per directory. has permission to delete
-purge older <n><m|h|d> Remove all ADR files under the GI_BASE directory, which are older than the time specified
gi Restrict command to only diagnostic files under the GI_BASE
database [all | dbname] Restrict command to only diagnostic files under the database directory. Defaults to all,
alternatively specify a database name
-dryrun Use with purge to estimate how many files will be affected and how much disk space will be May take a while for a
freed by a potential purge command. large number of files
tfactl managelogs show usage tfactl managelogs show variation older <n><m|h|d>
Use -gi to only

show grid
infrastructure
Use database to only

show database
tfactl managelogs purge older n<m|h|d> -dryrun tfactl managelogs purge older n<m|h|d>
Use dryrun
for a what if
Disk Usage Snapshots
TFA will track disk usage and record snapshots to:
tfa/repository/suptools/<node>/managelogs/usage_snapshot/
Snapshot happens every 60 minutes, configurable with:
tfactl set diskUsageMonInterval=<minutes>
Disk usage monitoring is ON by default, configurable with:

tfactl set diskUsageMon=<ON|OFF>
Collect
Trim & collect all important log files updated in Collect a problem specific Service Request Data
the past 12 hours: tfactl diagcollect Collection (SRDC): tfactl diagcollect -srdc ora600
Collections stored in the repository directory

Change diagcollect timeframe with since <n>h|d
For list of types of srdc collections use tfactl diagcollect -srdc help
TFA dbglevel profiles
Example
tfactl dbglevel -set node_eviction
would be used for enhancing diagnostics when node evictions are the being
investigated and would perform the following operation internally
crsctl set log css "CSSD=4"
crsctl set log css "CSSDNMC=4"
crsctl set log css "CLSF=4"
crsctl set log css "CSSDGMCC=4"
crsctl set log css "CSSDGMPC=4"
To revert to the original or default logging levels the following command

$ tfactl dbglevel -unset node_eviction
would perform the following operations internally
crsctl set log css "CSSD=2"
crsctl set log css "CSSDNMC=2"
crsctl set log css "CLSF=0"
crsctl set log css "CSSDGMCC=2"
crsctl set log css "CSSDGMPC=2" Copyright 2017, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential Internal/Restricted/Highly Restricted 107
Incident Based Collections with SRDC
Incident Type Description For dbperf use these parameters to
ora4030 For ORA-04030 errors
ora4031 For ORA-04031 errors specify the good & bad performance
dbperf For basic db performance problems periods to compare:
Parameter Description
perf_base_sd Start date for a good performance period
perf_base_st Start time for a good performance period
perf_base_ed End date for a good performance period
Use srdc <incident type>: tfactl srdc ora4030 perf_base_et End time for a good performance period
To specify sid use sid <oracle sid> perf_comp_sd Start date for a bad performance period
To specify database use db <dbname> perf_comp_st Start time for a bad performance period
perf_comp_ed End date for a bad performance period
To specify incident date & time use perf_comp_et End time for a bad performance period
inc_date <YYYY-MM-DD> -inc_time <HH:MM:SS>
To upload directly to the SR use sr<SR#> tfactl srdc dbperf db RDBMS121 \
perf_base_sd 2016-06-15 perf_base_st 01:30:00 \
tfactl srdc ora4030 -sid orcl db RDBMS121 \ perf_base_ed 2016-06-15 perf_base_et 02:00:00 \
-inc_date 2016-06-15 -inc_time 02:48:23 \ perf_comp_sd 2016-06-16 perf_comp_st 09:30:00 \
-sr 3-123456789 perf_comp_ed 2016-06-16 perf_comp_et 10:00:00
Copyright 2017, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential Internal 108
Generates Diagnostic
Metrics View of Cluster
and Databases
Oracle 12c Cluster Health Monitor
Copyright 2017, Oracle and/or its affiliates. All rights reserved. | Confidential
Confidential Oracle Restricted
Oracle Internal/Restricted/Highly Restricted 109
Cluster Health Monitor (CHM)
Generates Diagnostic Metrics View of Cluster and Databases
Always on - Enabled by default

OS Data OS Data
Provides Detailed OS Resource Metrics
osysmond
Assists Node eviction analysis osysmond
OS Data
Locally logs all process data osysmond

ologgerd
(master)
User can define pinned processes
Listens to CSS and GIPC events osysmond OS Data
Categorizes processes by type GIMR
Supports plug-in collectors (ex. 12c Grid Infrastructure

traceroute, netstat, ping, etc.) Management Repository
New CSV output for ease of analysis

Oclumon CLI or Full Integration with EM Cloud Control
Discovers Potential Cluster
& DB Problems - Notifies
with Corrective Actions
Oracle 12c Cluster Health Advisor
Generates Diagnostic Metrics View of Cluster and Databases

OS Data OS Data
Provides Detailed OS Resource Metrics
osysmond
Assists Node eviction analysis osysmond
OS Data
Locally logs all process data osysmond

ologgerd
(master)
User can define pinned processes
Listens to CSS and GIPC events osysmond OS Data
Categorizes processes by type GIMR
Supports plug-in collectors (ex. 12c Grid Infrastructure

traceroute, netstat, ping, etc.) Management Repository
New CSV output for ease of analysis

CHA has detected a service degradation due to higher than expected I/O latencies.
CHA has detected a service degradation due to higher than expected I/O latencies.
Cluster Health Advisor
CHA/DB Health
CHA detected a for service degradation due to higher than expected I/O latencies.
CHA/DB Health: I/O problem

Problem The degradation is caused by a higher than expected utilization of shared storage devices for this
database. No evidence of significant increase in I/O demand on the local node.
Confidence 95.17%
Action Validate whether there is increase in I/O demand on other nodes than the local and find I/O intensive SQL .
Add more disks to disk group or move database to faster disks.
proddb_1
proddb_2

Confidential Oracle Restricted 115
Cluster Health Advisor Daemon
Dependencies to the Grid Infrastructure

Management Repository (GIMR)
Command Line Tool - chactl
Will only monitor cluster

initially
Tell it to monitor the

database
chactl monitor database db <db_name>
Cluster Health Advisor - diagnosis Query a specific database for
diagnosis
Query the cluster diagnosis for

incidents and recommendations chactl query diagnosis chactl query diagnosis db <db_name>
Query the repository footprint
chactl query repository
Autonomously Preserves
Database Availability and
Performance
Oracle 12c Database Hang Manager
Debugging Live Systems: Hangs
Parsing the system state dump can be very time consuming.
To debug a hang more quickly you could query v$session.
blocking_session:
select sess.sid sid,substr(proc.program,0,25)
prog,substr(sw.event,0,15) event,sw.wait_time wt,
sess.blocking_session bsid from v$process proc, v$session sess,
v$session_wait sw where proc.addr=sess.paddr and
sess.status='ACTIVE and sw.sid=sess.sid order by prog;
SID Program Event WT BSID

----- ------------------------- --------------- --- -----
2836 oracle@fstsun002 (S000) enq: TM - conte 0 2979
2979 oracle@fstsun002 (TNS V1- enq: TM - conte 0 2853

Debugging Live Systems: Hangs
sqlplus prelim / as sysdba is useful because it avoids a
process state object creation which requires various
resources such as latches.
Trying to acquire those resources may cause your debugger
session to hang.
Some dumps/commands may require a PSO therefore you
can execute those dumps/commands in an existing process
that already has a PSO
$ sqlplus -prelim "/ as sysdba"

SQL> oradebug setorapid 9
SQL> oradebug dump systemstate 3

Oracle 12c Hang Manager
Autonomously Preserves Database Availability and Performance Session

Reliably detects database hangs and DETECT
deadlocks
Autonomously resolves them EVALUATE
Hung?
Supports QoS Performance Classes, Ranks
and Policies to maintain SLAs ANALYZE
QoS
Logs all detections and resolutions Policy
DIA0 VERIFY
New SQL interface to configure sensitivity
(Normal/High) and trace file sizes
Victim
Oracle 12c Hang Manager
Full Resolution Dump Trace File and DB Alert Log Audit Reports
Dump file /diag/rdbms/hm6/hm62/incident/incdir_5753/hm62_dia0_12656_i5753.trc
Oracle Database 12c Enterprise Edition Release 12.2.0.0.0 - 64bit Beta
With the Partitioning, Real Application Clusters, OLAP, Advanced Analytics 2015-10-13T16:47:59.435039+17:00
and Real Application Testing options Errors in file /oracle/log/diag/rdbms/hm6/hm6/trace/hm6_dia0_12433.trc (incident=7353):
Build label: RDBMS_MAIN_LINUX.X64_151013 ORA-32701: Possible hangs up to hang ID=1 detected
ORACLE_HOME: /3775268204/oracle Incident details in: /diag/rdbms/hm6/hm6/incident/incdir_7353/hm6_dia0_12433_i7353.trc
System name: Linux 2015-10-13T16:47:59.506775+17:00
Node name: slc05kyr DIA0 requesting termination of session sid:40 with serial # 43179 (ospid:13031) on instance 2
Release: 2.6.39-400.211.1.el6uek.x86_64 due to a GLOBAL, HIGH confidence hang with ID=1.
Version: #1 SMP Fri Nov 15 13:39:16 PST 2013 Hang Resolution Reason: Automatic hang resolution was performed to free a
Machine: x86_64 significant number of affected sessions.
VM name: Xen Version: 3.4 (PVM) DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=1.
Instance name: hm62
Redo thread mounted by this instance: 2 In the alert log on the instance local to the session (instance 2 in this case),
Oracle process number: 19 we see the following:
Unix process pid: 12656, image: oracle@slc05kyr (DIA0)
2015-10-13T16:47:59.538673+17:00
Errors in file /diag/rdbms/hm6/hm62/trace/hm62_dia0_12656.trc (incident=5753):
*** 2015-10-13T16:47:59.541509+17:00 ORA-32701: Possible hangs up to hang ID=1 detected
*** SESSION ID:(96.41299) 2015-10-13T16:47:59.541519+17:00 Incident details in: /diag/rdbms/hm6/hm62/incident/incdir_5753/hm62_dia0_12656_i5753.trc
*** CLIENT ID:() 2015-10-13T16:47:59.541529+17:00
*** SERVICE NAME:(SYS$BACKGROUND) 2015-10-13T16:47:59.541538+17:00 2015-10-13T16:48:04.222661+17:00
*** MODULE NAME:() 2015-10-13T16:47:59.541547+17:00 DIA0 terminating blocker (ospid: 13031 sid: 40 ser#: 43179) of hang with ID = 1
*** ACTION NAME:() 2015-10-13T16:47:59.541556+17:00 requested by master DIA0 process on instance 1
*** CLIENT DRIVER:() 2015-10-13T16:47:59.541565+17:00 Hang Resolution Reason: Automatic hang resolution was performed to free a
significant number of affected sessions.
by terminating session sid:40 with serial # 43179 (ospid:13031)
Deploys with Minimum
Footprint and Maximum
Manageability
Oracle Domain Services Cluster (DSC)
Oracle 12c Domain Services Cluster (DSC)
Deploys with Minimum Footprint and Maximum Manageability
ORACLE CLUSTER DOMAIN
Application Database
Hosts Framework as Services Member
Cluster
Member
Cluster
Reduces local resource footprint Application

Member
Database
Member
Cluster Cluster
Centralizes management
Speeds deployment and patching
Oracle Domain Services Cluster
Optional Shared Storage Database Database
Member Member
Supports multiple versions and Cluster Cluster
platforms going forward Management Repository Service

Trace File Analyzer Receiver
ORAchk Collection Service
Grid Names Service
Storage Services
Rapid Home Provisioning Service
Oracle Cluster Domain
Database Application Database Database
Member Cluster Member Cluster Member Cluster Member Cluster
Uses IO & ASM Uses ASM

Private Uses local ASM GI only Service of DSC Service
Network
SAN
NAS Oracle Domain Services Cluster

Mgmt Trace File Rapid Home Additional
Repository Analyzer Provisioning ACFS ASM
Optional IO Service
(GIMR) (TFA) (RHP) Services Service
Service Services
Service Service
Shared ASM
Oracle 12c Domain Services Cluster (DSC)
Deploys with Minimum Footprint and Maximum Manageability
ORACLE CLUSTER DOMAIN
Application Database
Hosts Framework as Services Member
Cluster
Member
Cluster
Reduces local resource footprint Application

Member
Database
Member
Cluster Cluster
Centralizes management
Speeds deployment and patching
Oracle Domain Services Cluster
Optional Shared Storage Database Database
Member Member
Supports multiple versions and Cluster Cluster
platforms going forward Management Repository Service

Trace File Analyzer Receiver
ORAchk Collection Service
Grid Names Service
Storage Services
Rapid Home Provisioning Service
Compare Database Status Before & After Upgrade
Download dbupgdiag.sql from doc 556610.1
Run both before and after the upgrade:
cd <location of the script>
$ sqlplus / as sysdba
sql> alter session set

nls_language='American';
sql> @dbupgdiag.sql
sql> exit

RAC Troubleshooting Diagnosability

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RAC Troubleshooting Diagnosability

Uploaded by

Copyright:

Available Formats

Troubleshooting and Diagnosing Oracle

Database 12.2 and Oracle RAC

Sandesh Rao, Senior Director , RAC Development

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Which books on RAC do I read for basics or internals ?

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Level 1: OHASD Spawns:

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

The standard going forward

Copyright 2017, Oracle and/or its affiliates. All rights reserved. | 15

[GRID]> crsctl get cluster name

[GRID]> crsctl get cluster class

[GRID]> crsctl get cluster type

Copyright 2017, Oracle and/or its affiliates. All rights reserved. | 16

Uses IO & ASM Uses ASM

NAS Domain Services Cluster

Copyright 2017, Oracle and/or its affiliates. All rights reserved. | 17

Node Weighting is a new feature that considers

the workload hosted in the cluster during fencing

1 2 The idea is to let the majority of work survive,

Copyright 2017, Oracle and/or its affiliates. All rights reserved. | 20

crsctl set server

CSS_CRITICAL CSS_CRITICAL will be honored

Copyright 2017, Oracle and/or its affiliates. All rights reserved. | 21

Autonomous Health Framework

Copyright 2017, Oracle and/or its affiliates. All rights reserved. | 22

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Public Lan Public Lan

CSSD CSSD CSSD

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

CSSD Ping CSSD

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Network heartbeat failures will lead to node evictions

CSSD Ping CSSD

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Disk heartbeat failures will lead to node evictions

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

With 11.2.0.1 Voting Disks got an identity:

Voting Disks must therefore not be copied using dd or cp anymore

Node information Disk Heartbeat Logging

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

trunc(n/2+1) with n=number of voting disks configured and n>=1

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Same principles apply

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Oracle ASM auto creates 1/3/5 Voting Files

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

A node is requested to kill itself; executer: typically CSSD

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Copyright 2017, Oracle and/or its affiliates. All rights reserved. |

Engage Sysadmin Team