You are on page 1of 47

Cloug Meeting 2

Junio 8, 2016

Network/Clusterware/RAC
Troubleshooting Without ADDM
César Sáez León

• Miembro de la directiva de CLOUG desde 2012


• OCP Desde Oracle 8i
• OCE RAC Administrator
• OCE Performance Tuning
• Relator Oracle University desde 2001
• Gerente de Negocios y Marketing en Datactiva
Agenda
• Se abordarán 3 casos reales de análisis y solución de
problemas en plataformas clusterizadas.

– Clusterware
– Network
– RAC (Base de Datos)

• Se hace uso exclusivamente de herramientas gratuitas


(STATSPACK, Logs, Vistas Dinámicas, CVU, MOS, etc).

• Plataformas RAC10gR2 y RAC11gR2, Oracle 12cR1, aun está


con muy baja penetración en el mercado chileno, no hay
casos para mostrar aún, sin embargo la metodología es la
misma.
Lentitud - 10gR2 - STATSPACK

"Cliente informa lentitud en un proceso de negocio"


Release 10.2.0.4.0 - 64bit

RAC YES

Num CPUs 24

Phys Memory (MB) 60,413

Platform RHEL AS release 4 (Nahant Update 8)

Caso 1 RAC
STATSPACK
Instancia 1

• Snapshot Snap Id Snap Time Sessions Curs/Sess


~~~~~~~~ ---------- ------------------ -------- ---------
Begin Snap: 2030 18-Jan-11 14:00:03 125 34.4
End Snap: 2043 18-Jan-11 17:00:01 171 38.2
Elapsed: 179.97 (mins)

Instancia 2

• Snapshot Snap Id Snap Time Sessions Curs/Sess


~~~~~~~~ ---------- ------------------ -------- ---------
Begin Snap: 2031 18-Jan-11 14:00:01 119 36.3
End Snap: 2034 18-Jan-11 17:00:02 169 41.8
Elapsed: 180.02 (mins)

Caso 1 RAC
Instance Efficiency Percentages

• Instancia 1

Buffer Nowait %: 99.72 Redo NoWait %: 100.00


Buffer Hit %: 97.86 In-memory Sort %: 100.00
Library Hit %: 97.75 Soft Parse %: 89.50
Execute to Parse %: 87.27 Latch Hit %: 99.83
Parse CPU to Parse Elapsd %: 38.79 % Non-Parse CPU: 97.54

• Instancia 2

Buffer Nowait %: 100.00 Redo NoWait %: 100.00


Buffer Hit %: 99.76 In-memory Sort %: 100.00
Library Hit %: 97.00 Soft Parse %: 90.05
Execute to Parse %: 80.69 Latch Hit %: 100.05
Parse CPU to Parse Elapsd %: 85.75 % Non-Parse CPU: 95.53

Caso 1 RAC
Top 5 Timed Events
Instancia 1

• Top 5 Timed Events Avg %Total


~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time
---------------------------------- ------------ ----------- ------ ------
CPU time 4,105 60.3
db file sequential read 293,699 1,123 4 16.5
db file scattered read 91,186 374 4 5.5
SQL*Net break/reset to client 710,360 316 0 4.6
db file parallel read 26,613 107 4 1.6

Caso 1 RAC
Top 5 Timed Events
Instancia 2

• Top 5 Timed Events Avg %Total


~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time
------------------------------------- ---------- ----------- ------ ------
CPU time 8,929 39.8
gc cr multi block request 58,289,682 4,735 0 21.1
db file scattered read 717,216 4,203 6 18.8
db file sequential read 311,881 1,510 5 6.7
gc buffer busy 1,497,373 1,348 1 6.0

Caso 1 RAC
Time Model System Stats
• Instancia 1
Statistic Time (s) % of DB time
----------------------------------- -------------------- ------------
sql execute elapsed time 5,668.7 91.7
DB CPU 4,198.4 67.9
parse time elapsed 427.6 6.9
hard parse elapsed time 238.3 3.9
PL/SQL execution elapsed time 144.6 2.3
failed parse elapsed time 116.9 1.9

• Instancia 2

Statistic Time (s) % of DB time


----------------------------------- -------------------- ------------
sql execute elapsed time 19,082.0 95.5
DB CPU 9,051.5 45.3
parse time elapsed 803.7 4.0
hard parse elapsed time 280.1 1.4
PL/SQL execution elapsed time 243.4 1.2

Indica que la fase de ejecución de sentencias SQL es lo que más consume tiempo
En esta fase se realiza el acceso a los datos (bloques)

Caso 1 RAC
Global Cache Efficiency
Percentages
• Instancia 1

Buffer access - local cache %: 99.64


Buffer access - remote cache %: 0.12
Buffer access - disk %: 0.24

• Instancia 2

Buffer access - local cache %: 86.21


Buffer access - remote cache %: 11.65
Buffer access - disk %: 2.14

Sobre un 11% de los bloques accesados por instancia2, provienen de


instancia1.

Caso 1 RAC
Global Cache Load Profile
• Instancia 1
Global Cache Load Profile
~~~~~~~~~~~~~~~~~~~~~~~~~ Per Second Per Transaction
--------------- ---------------
Global Cache blocks received: 60.26 5.63
Global Cache blocks served: 5,779.27 540.19
GCS/GES messages received: 11,659.76 1,089.83
GCS/GES messages sent: 279.52 26.13
DBWR Fusion writes: 1.74 0.16
Estd Interconnect traffic (KB): 49,048.19

• Instancia 2
Global Cache Load Profile
~~~~~~~~~~~~~~~~~~~~~~~~~ Per Second Per Transaction
--------------- ---------------
Global Cache blocks received: 5,779.04 282.22
Global Cache blocks served: 60.26 2.94
GCS/GES messages received: 279.57 13.65
GCS/GES messages sent: 11,660.55 569.44
DBWR Fusion writes: 1.95 0.10
Estd Interconnect traffic (KB): 49,046.51

Caso 1 RAC
Global Cache Services -
Workload Characteristics
• Instancia 1

Avg global enqueue get time (ms): 0.0

Avg global cache cr block receive time (ms): 0.6


Avg global cache current block receive time (ms): 0.4

Avg global cache cr block build time (ms): 0.0


Avg global cache cr block send time (ms): 0.0
Avg global cache cr block flush time (ms): 0.5
Global cache log flushes for cr blocks served %: 7.5
Avg global cache current block pin time (ms): 0.0
Avg global cache current block send time (ms): 0.0
Avg global cache current block flush time (ms): 1.6
Global cache log flushes for current blocks served %: 0.0

Caso 1 RAC
Global Cache Services -
Workload Characteristics
• Instancia 2

Avg global enqueue get time (ms): 0.0

Avg global cache cr block receive time (ms): 0.5


Avg global cache current block receive time (ms): 0.9

Avg global cache cr block build time (ms): 0.0


Avg global cache cr block send time (ms): 0.0
Avg global cache cr block flush time (ms): 2.2
Global cache log flushes for cr blocks served %: 5.6

Avg global cache current block pin time (ms): 0.0


Avg global cache current block send time (ms): 0.0
Avg global cache current block flush time (ms): 1.7
Global cache log flushes for current blocks served %: 0.2

Caso 1 RAC
SQL
• Instancia 2

En instancia2, la misma query es la que:

• Consume más CPU


• Demora más tiempo
• Tiene más esperas por eventos de cluster

Esta query tuvo 1309 ejecuciones en el periodo.

DB/Inst: instancia/instancia2

a) SQL ordered by CPU DB/Inst: instancia/instancia2

CPU CPU per Elapsd Old


Time (s) Executions Exec (s) %Total Time (s) Buffer Gets Hash Value
---------- ------------ ---------- ------ ---------- --------------- ----------
4388.08 1,309 3.35 48.5 8870.58 88,555,035 4280002066

Caso 1 RAC
SQL
b) SQL ordered by Elapsed DB/Inst: instancia/instancia2

Elapsed Elap per CPU Old


Time (s) Executions Exec (s) %Total Time (s) Physical Reads Hash Value
---------- ------------ ---------- ------ ---------- --------------- ----------
8870.58 1,309 6.78 44.4 4388.08 3,799 4280002066

c) SQL ordered by Cluster Wait Time DB/Inst: instancia/instancia2

Cluster CWT % of Elapsd CPU Old


Wait Time (s) Elapsd Time Time (s) Time (s) Executions Hash Value
------------- ----------- ----------- ----------- -------------- ----------
6,053.21 68.2 8,870.58 4,388.08 1,309 4280002066

Caso 1 RAC
Información Adicional de la
Sentencia
SQL_ID |SQL_FULLTEXT |HASH_VALUE |OLD_HASH_VALUE
---------------------- |------------------------------------------------------------------------------------------------------------------------------------- |-------------------- |-----------------
Gpzfhxtdgzzmx |SELECT IMAGEN.IMAG_CODIGO, IMAG_EXPEDICION, IMAG_ORIGEN, IMAG_DESTINO, IMAG_REME |1526726269 |4280002066

SQL_ID gpzfhxtdgzzmx
--------------------

SELECT IMAGEN.IMAG_CODIGO, IMAG_EXPEDICION, IMAG_ORIGEN, IMAG_DESTINO, IMAG_REMESA, IMAG_FECHA_EXPEDICION,


IMAG_NUMERO_ENVIO, IMAG_RUTA_PRIVADA, DOCUMENTO.DOCU_NOMBRE
FROM IMAGEN,DOCUMENTO
WHERE IMAG_FECHA_BAJA IS NULL AND
DOCU_FECHA_BAJA IS NULL AND
(IMAGEN.DOCU_CODIGO = :CODIGO_DOCUMENTO OR :CODIGO_DOCUMENTO IS NULL) AND
(IMAGEN.DOCU_CODIGO IN (SELECT DOCU_CODIGO FROM ACCESO_DOCUMENTO WHERE ACCD_FECHA_BAJA IS NULL AND USUA_CODIGO =
:CODIGO_USUARIO)) AND
(IMAG_ORIGEN = :ORIGEN_EXPEDICION OR :ORIGEN_EXPEDICION IS NULL) AND
(IMAG_DESTINO = :DESTINO_EXPEDICION OR :DESTINO_EXPEDICION IS NULL) AND
(IMAG_REMESA = :REMESA_EXPEDICION OR :REMESA_EXPEDICION IS NULL) AND
(IMAG_EXPEDICION = :EXPEDICION OR :EXPEDICION IS NULL) AND
(IMAGEN.DELE_CODIGO = :CODIGO_DELEGACION OR :CODIGO_DELEGACION = :"SYS_B_0") AND
(IMAG_NUMERO_ENVIO = :NUMERO_ENVIO OR :NUMERO_ENVIO IS NULL) AND
(IMAGEN.DOCU_CODIGO = DOCUMENTO.DOCU_CODIGO) AND
(DOCUMENTO.GRUP_CODIGO = :CODIGO_GRUPO OR :CODIGO_GRUPO IS NULL) AND
(IMAG_FECHA_EXPEDICION >= :EXPEDICION_DESDE OR :EXPEDICION_DESDE IS NULL) AND
(IMAG_FECHA_EXPEDICION <= :EXPEDICION_HASTA OR :EXPEDICION_HASTA IS NULL) AND
(IMAG_FECHA_ALTA >= :ESCANEO_DESDE OR :ESCANEO_DESDE IS NULL) AND
(IMAG_FECHA_ALTA <= :ESCANEO_HASTA OR :ESCANEO_HASTA IS NULL)

Caso 1 RAC
Información Adicional de la
Sentencia
id Operation Name Rows Bytes Cost (%CPU) Time

0 SELECT STATEMENT 1 25014 (100)

1 NESTED LOOPS SEMI 1 123 25014 (2) 00:05:01

2 NESTED LOOPS 1 113 25013 (2) 00:05:01

3 TABLE ACCESS FULL IMAGEN 1 84 25012 (2) 00:05:01

4 TABLE ACCESS BY INDEX DOCUMENTO 1 29 1 (0) 00:00:01


ROWID

5 INDEX UNIQUE SCAN DOCU_PK 1 0 (0)

6 TABLE ACCESS BY INDEX ACCESO_DOCUMENT 4 40 1 (0) 00:00:01


ROWID O

7 INDEX RANGE SCAN ACCD_USUARIO 4 0 (0)

Caso 1 RAC
Reads vs Changes
CONCLUSIONES Y
RECOMENDACIONES
El problema se produce en el módulo w3wp.exe, con la consulta de
SQL_ID=gpzfhxtdgzzmx:

• Se ejecuta esta consulta en un nodo del cluster, mientras hay


modificaciones a las tablas base desde el otro nodo (o ambos).

• Se requiere determinar si es normal que esta query se ejecute


simultáneamente a modificaciones sobre las tablas base.

• Si es normal, el curso a seguir sería realizar la query y las


modificaciones, desde el mismo nodo

• Subir STATSPACK a nivel 7 para contar con estadísticas por


segmento

Caso 1 RAC
Lentitud – 10gR2 - STATSPACK

"Cliente informa lentitud en su plataforma"

Release 10.2.0.4.0 - 64bit


RAC YES
Num CPUs 24
Phys Memory (MB) 60,413
Platform RHEL AS release 4 (Nahant Update 8)

STATSPACK, MOS, Documentación

Caso 2 Network
STATSPACK
• Instancia 1

Snapshot Snap Id Snap Time Sessions Curs/Sess


~~~~~~~~ ---------- ------------------ -------- ---------
Begin Snap: 10240 15-Jan-16 09:00:02 216 31.3
End Snap: 10252 15-Jan-16 11:00:05 294 35.4
Elapsed: 120.05 (mins)

• Instancia 2

Snapshot Snap Id Snap Time Sessions Curs/Sess


~~~~~~~~ ---------- ------------------ -------- ---------
Begin Snap: 10241 15-Jan-16 09:00:05 209 41.0
End Snap: 10243 15-Jan-16 11:00:04 283 38.4
Elapsed: 119.98 (mins)

Caso 2 Network
Instance Efficiency
Percentages
• Instancia 1

Buffer Nowait %: 99.99 Redo NoWait %: 100.00


Buffer Hit %: 99.88 In-memory Sort %: 100.00
Library Hit %: 98.53 Soft Parse %: 92.56
Execute to Parse %: 89.02 Latch Hit %: 99.92
Parse CPU to Parse Elapsd %: 87.19 % Non-Parse CPU: 97.23

• Instancia 2

Buffer Nowait %: 100.00 Redo NoWait %: 100.00


Buffer Hit %: 99.73 In-memory Sort %: 100.00
Library Hit %: 98.40 Soft Parse %: 91.24
Execute to Parse %: 89.75 Latch Hit %: 99.87
Parse CPU to Parse Elapsd %: 73.60 % Non-Parse CPU: 97.82

Caso 2 Network
Top 5 Timed Events
• Instancia 1

Top 5 Timed Events Avg %Total


~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time
----------------------------------------- ------------ ----------- ------ ------
enq: TX - row lock contention 34,746 16,783 483 26.9
gc cr multi block request 5,519,578 12,631 2 20.2
CPU time 5,941 9.5
gc buffer busy 123,818 5,105 41 8.2
gc cr grant 2-way 315,907 4,103 13 6.6

• Instancia 2

Top 5 Timed Events Avg %Total


~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time
----------------------------------------- ------------ ----------- ------ ------
enq: TX - row lock contention 38,470 18,611 484 26.6
db file sequential read 2,248,154 13,128 6 18.8
gc current block 2-way 640,072 9,328 15 13.3
CPU time 9,246 13.2
gc cr grant 2-way 300,460 4,068 14 5.8

Caso 2 Network
Time Model System Stats
• Instancia 1

Statistic Time (s) % of DB time


----------------------------------- -------------------- ------------
sql execute elapsed time 57,878.2 92.0
inbound PL/SQL rpc elapsed time 7,002.8 11.1
DB CPU 6,107.4 9.7
PL/SQL execution elapsed time 3,871.2 6.2

• Instancia 2

Statistic Time (s) % of DB time


----------------------------------- -------------------- ------------
sql execute elapsed time 65,571.7 92.0
DB CPU 9,456.3 13.3
inbound PL/SQL rpc elapsed time 6,290.4 8.8
PL/SQL execution elapsed time 4,208.1 5.9
parse time elapsed 703.4 1.0

Caso 2 Network
Global Cache Load Profile
• Instancia 1

Global Cache Load Profile


~~~~~~~~~~~~~~~~~~~~~~~~~ Per Second Per Transaction
--------------- ---------------
Global Cache blocks received: 839.07 43.55
Global Cache blocks served: 153.10 7.95
GCS/GES messages received: 726.71 37.72
GCS/GES messages sent: 2,049.10 106.36
DBWR Fusion writes: 5.01 0.26
Estd Interconnect traffic (KB): 8,479.47

• Instancia 2
Global Cache Load Profile
~~~~~~~~~~~~~~~~~~~~~~~~~ Per Second Per Transaction
--------------- ---------------
Global Cache blocks received: 152.83 6.45
Global Cache blocks served: 841.05 35.50
GCS/GES messages received: 2,047.36 86.42
GCS/GES messages sent: 725.80 30.64
DBWR Fusion writes: 5.08 0.21
Estd Interconnect traffic (KB): 8,492.64

Hay una diferencia de 13.17/s

Caso 2 Network
Global Cache Efficiency
Percentages
• Instancia 1

Buffer access - local cache %: 99.28


Buffer access - remote cache %: 0.60
Buffer access - disk %: 0.12

• Instancia 2

Buffer access - local cache %: 99.67


Buffer access - remote cache %: 0.06
Buffer access - disk %: 0.27

Caso 2 Network
Global Cache Services -
Workload Characteristics
• Instancia 1

Avg global enqueue get time (ms): 6.2

Avg global cache cr block receive time (ms): 17.5


Avg global cache current block receive time (ms): 17.5

Avg global cache cr block build time (ms): 0.0


Avg global cache cr block send time (ms): 0.0
Avg global cache cr block flush time (ms): 5.3
Global cache log flushes for cr blocks served %: 17.8

Avg global cache current block pin time (ms): 2612.1


Avg global cache current block send time (ms): 0.0
Avg global cache current block flush time (ms): 89.0
Global cache log flushes for current blocks served %: 0.7

Caso 2 Network
Global Cache Services -
Workload Characteristics
• Instancia 2

Avg global enqueue get time (ms): 6.7

Avg global cache cr block receive time (ms): 16.9


Avg global cache current block receive time (ms): 14.5

Avg global cache cr block build time (ms): 0.0


Avg global cache cr block send time (ms): 0.0
Avg global cache cr block flush time (ms): 6.7
Global cache log flushes for cr blocks served %: 19.5

Avg global cache current block pin time (ms): 0.0


Avg global cache current block send time (ms): 0.0
Avg global cache current block flush time (ms): 3.3
Global cache log flushes for current blocks served %: 0.1

Caso 2 Network
Typical Latencies for RAC
Operations
gc blocks lost
Statistic Total per Second per Trans
--------------------------------- ------------------ -------------- ------------
gc blocks lost 7,525 1.0 0.1
gc blocks lost 0 0.0 0.0

Referencia: Troubleshooting gc block lost and Poor Network Performance in a


RAC Environment (Doc ID 563566.1)

"Any block loss indicates a problem in network packet processing and should
be investigated"

Caso 2 Network
Global Cache Block Loss
Diagnostic Guide
• Poorly sized UDP receive (rx) buffer sizes / UDP buffer socket overflows

(RAC01)

[root@rac1 ~]# netstat -su


Udp:
123437667 packets received
171908 packets to unknown port received.
6828 packet receive errors
2081851993 packets sent
(RAC02)

[root@rac2 ~]# netstat -su


Udp:
387660084 packets received
155302 packets to unknown port received.
0 packet receive errors
468987970 packets sent

• Se encuentran pérdidas de paquetes UDP en el nodo 1, lo cual se traduce en aumento en latencias de transferencia y por
ende, demoras en el procesamiento y trabajo de Oracle.

Caso 2 Network
Global Cache Block Loss
Diagnostic Guide
• Interconnect LAN non-dedicated

[root@rac1 ~]# cat /etc/hosts


# RED PRIVADA
10.180.23.26
10.180.23.27

(RAC01)
eth3 inet addr:10.180.23.26 Bcast:10.180.23.255 Mask:255.255.255.0
RX bytes:52287574010569 (47.5 TiB) TX bytes:65881057016247 (59.9 TiB)

(RAC02)
eth3 inet addr:10.180.23.27 Bcast:10.180.23.255 Mask:255.255.255.0
RX bytes:580766305300 (540.8 GiB) TX bytes:2772380340803 (2.5 TiB)

Caso 2 Network
Global Cache Block Loss
Diagnostic Guide
• Limited capacity and over-saturated bandwidth
(Rac 1)

[root@rac1 ~]# ethtool eth3


Settings for eth3:
Speed: 100Mb/s
Duplex: Full
Auto-negotiation: on

(Rac2)
[root@rac2 ~]# ethtool eth3
Settings for eth3:
Speed: 100Mb/s
Duplex: Full
Auto-negotiation: on

Caso 2 Network
CONCLUSIONES Y
RECOMENDACIONES
CONCLUSIONES

La configuración UDP no es la adecuada


Se esta usando la red privada para tráfico distinto al del Interconnect del
cluster
La red privada no está correctamente configurada

RECOMENDACIONES

La red privada debe ser sólo de uso de interconnect


Las interfaces de red deben estar configuradas al menos a 1000 mbps

Caso 2 Network
No sube Grid en nodo 2 – 11gR2
– Logs, CVU
"No levanta el clusterware en nodo 2"
[root@server2 ~]# crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.

Release 11.2.0.1.0 - 64bit


RAC YES
Num CPUs 16
Phys Memory (GB) 35
Platform Linux x86 64-bit

• Logs OS, Logs Oracle Grid Infrastructure, CVU, MOS

Caso 3 Clusterware
Log OS

• Se revisó el archivo /var/log/messages desde la fecha


mantención (fin de semana) hasta el momento actual,
corroborándose que los problemas asociados a la falla de la
tarjeta madre, efectivamente habían sido corregidos.

• Es decir que el SO sube sin ningún tipo de error.

Caso 3 Clusterware
Logs Oracle Grid
Infrastructure
• 11gR2 Clusterware and Grid Home - What You Need to Know
(Doc ID 1053147.1)

• Important Log Locations

• Clusterware daemon logs are all under


<GRID_HOME>/log/<nodename>. Structure under
<GRID_HOME>/log/<nodename>:

• alert<NODENAME>.log - look here first for most clusterware


issues

Caso 3 Clusterware
alertnode.log
• /u01/app/11.2.0/grid_1/log/server2/alertserver2.log

• Lo primero fue la revisión del archivo de alerta del nodo, donde llama la
atención la siguiente entrada:
2015-12-14 17:27:44.559:
[cssd(20301)]CRS-1656:The CSS daemon is terminating due to a fatal error;
Details at (:CSSSC00012:) in
/u01/app/11.2.0/grid_1/log/server2/cssd/ocssd.log

• Que apunta a un grave problema con el Cluster Syncronization Service


Deamon.

Caso 3 Clusterware
ocssd.log
• En la revisión del log del Cluster Syncronization Service Deamon se encuentran las siguientes
entradas que apuntan a un problema con el archivo Voting Disk en el ASM:
2015-12-14 18:07:54.474: [ CSSD][3058187552]clssnmReadDiscoveryProfile: voting file discovery string(ORCL:*,/voting_file)

2015-12-14 18:07:54.477: [ SKGFD][1101998400]OSS discovery with :ORCL:*:

2015-12-14 18:07:54.477: [ SKGFD][1101998400]Discovery skipping bad asmlib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:

2015-12-14 18:07:54.477: [ SKGFD][1101998400]Discovery advancing to nxt string :/voting_file:

2015-12-14 18:07:54.477: [ SKGFD][1101998400]UFS discovery with :/voting_file:

2015-12-14 18:07:54.508: [ CSSD][1101998400]clssnmvDiskVerify: discovered a potential voting file


2015-12-14 18:07:54.523: [ SKGFD][1101998400]Handle 0x120368a0 from lib :UFS:: for disk :/voting_file/voting_file3:

2015-12-14 18:07:54.525: [ CSSD][1101998400]clssnmvDiskVerify: Successful discovery for disk /voting_file/voting_file3, UID 883ff2bf-f1a94f85-bfcfd521-
c908cb0b, Pending CIN 0:1425783924:0, Committed CIN 0:1425783924:0
2015-12-14 18:07:54.526: [ SKGFD][1101998400]Lib :UFS:: closing handle 0x120368a0 for disk :/voting_file/voting_file3:

2015-12-14 18:07:55.070: [ CSSD][3058187552]ASSERT clssnml.c 453


2015-12-14 18:07:55.070: [ CSSD][3058187552]clssnmlgetleasehdls: Do not have sufficient voting files, found 1 of 2 configured files, needed at least 2
2015-12-14 18:07:55.070: [ CSSD][3058187552]###################################
2015-12-14 18:07:55.070: [ CSSD][3058187552]clssscExit: CSSD aborting from thread Main
2015-12-14 18:07:55.070: [ CSSD][3058187552]###################################
2015-12-14 18:07:55.070: [ CSSD][3058187552](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
2015-12-14 18:07:55.070: [ CSSD][3058187552]

• El log indica que no reconoce ningún Voting Disk dentro del ASM, pero si uno afuera de él.

Caso 3 Clusterware
Voting Disk
• Desde el nodo que si funciona correctamente, se realiza una revisión de los Voting Disks
definidos en la configuración del cluster:

[root@server1 ~]# crsctl query css votedisk


## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 754ab903282a4f88bf48eb7090d42b4c (ORCL:OCR2_2) [OCRVOD]
2. ONLINE 883ff2bff1a94f85bfcfd521c908cb0b (/voting_file/voting_file3) [OCRVOD]
Located 2 voting disk(s).

• Esto comprueba que hay un problema con el ASM, ya que el sistema no puede leer la copia
del Voting Disk almacenada ahí.

Caso 3 Clusterware
CVU
• Para detectar problemas adicionales a los del ASM, se ejecutó un análisis completo de las dos
máquinas del cluster, desde el punto de vista que cumplan con lo necesario para albergar una
instalación de Oracle Grid Infrastructure:

[oracle@server1 ~]$ cd /u01/app/11.2.0/grid_1/bin/


[oracle@server1 bin]$ ./cluvfy stage -pre crsinst -n server1,server2

• Salió todo bien excepto la configuración de las ASMlibs:

Checking ASMLib configuration.

ERROR:
PRVF-10109 : ASMLib is not configured correctly on the nodes:
Check failed on nodes:
server2,server1
Check for ASMLib configuration failed.

Caso 3 Clusterware
ASM
• Para la revision de ASM se utilizó la nota “ASMLib Devices Not Discovered with Diskstring as
'ORCL:*' (Doc ID 1444115.1)”
• El problema se descubre a continuación:
[root@server2 sbin]# ./oracleasm configure
ORACLEASM_ENABLED=true
ORACLEASM_UID=oraacle
ORACLEASM_GID=oinstall
ORACLEASM_SCANBOOT=true
ORACLEASM_SCANORDER="dm-"
ORACLEASM_SCANEXCLUDE="sd"

[oracle@server1 sbin]$ ./oracleasm configure


ORACLEASM_ENABLED=true
ORACLEASM_UID=oracle
ORACLEASM_GID=dba
ORACLEASM_SCANBOOT=true
ORACLEASM_SCANORDER="dm-"
ORACLEASM_SCANEXCLUDE="sd"
• server2 no tiene correctamente configurados el usuario y el grupo dueños de ASM.

Caso 3 Clusterware
SOLUCIÓN
• Se reconfiguran las ASMlibs en el nodo 2:

[root@server2 sbin]# ./oracleasm configure -i


Configuring the Oracle ASM library driver.

This will configure the on-boot properties of the Oracle ASM library
driver. The following questions will determine whether the driver is
loaded on boot and what permissions it will have. The current values
will be shown in brackets ('[]'). Hitting <ENTER> without typing an
answer will keep that current value. Ctrl-C will abort.

Default user to own the driver interface [oraacle]: oracle


Default group to own the driver interface [oinstall]: dba
Start Oracle ASM library driver on boot (y/n) [y]: y
Scan for Oracle ASM disks on boot (y/n) [y]: y
Writing Oracle ASM library driver configuration: done

Caso 3 Clusterware
REVISIÓN FINAL
• Se realiza un reboot de server2 y se comprueba que todo sube correctamente de forma
automática:
[root@server2 server2]# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

[root@server2 ~]# ps -fea| grep pmon


oracle 22944 1 0 16:26 ? 00:00:00 asm_pmon_+ASM2
oracle 24432 1 0 16:26 ? 00:00:00 ora_pmon_instancia2
root 24622 21665 0 16:26 pts/1 00:00:00 grep pmon
[root@server2 ~]# su - oracle
[oracle@server2 ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.1.0 Production on Tue Dec 15 16:28:06 2015

Copyright (c) 1982, 2009, Oracle. All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Data Mining and Real Application Testing options

SQL> select status from gv$instance;

STATUS
------------
OPEN
OPEN

Caso 3 Clusterware
CONCLUSIONES Y
RECOMENDACIONES

• Es altamente inusual que haya una copia del Voting Disk fuera de ASM,
sólo deben usarse sistema de archivos compatibles con una solución de
Oracle RAC, los cuales son:

• ASM (Automatic Storage Managent)


• OCFS (Oracle Cluster File System)

• Se recomienda revisar y cambiar la configuración actual para ajustarla a


las buenas prácticas entregadas por el fabricante.

Caso 3 Clusterware
References
• Oracle® Database Performance Tuning Guide
11g Release 2 (11.2)
Part Number E16638-05

• Oracle® Real Application Clusters Administration and Deployment Guide


11g Release 2 (11.2)
Part Number E16795-08

• Oracle® Database Reference


11g Release 2 (11.2)
Part Number E17110-07

• Statistics Package (STATSPACK) Guide (ID 394937.1)


• How to Use AWR Reports to Diagnose Database Performance Issues (Doc ID 1359094.1)
• Troubleshooting gc block lost and Poor Network Performance in a RAC Environment (Doc ID
563566.1)
• 11gR2 Clusterware and Grid Home - What You Need to Know (Doc ID 1053147.1)
• ASMLib Devices Not Discovered with Diskstring as 'ORCL:*' (Doc ID 1444115.1)
Questions

You might also like