You are on page 1of 10

Known Issues BARE METAL RESTORE PROCEDURE

Introduction

In this document, the goal was to provide clarification to different sections from the main MOS
Document 1084360.1 (bare metal restore). The notes will provide clarification during the execution of
certain commands or will provide workarounds to possible problems that can be reported.

Section Remove the failed Database Server from cluster

Following commands are executed from the RDBMS Oracle Home. If using different owners, log as the
correct owner

1. Delete the instances that run on the failed db server

2. Disable the listener that runs on the failed db server

3. Delete the Oracle HOME from the Oracle Inventory

Following commands are executed from the GI Oracle Home.

1. Verify that the failed database server is unpinned

2. Stop the VIP resources for the failed database server and delete

3. Delete the node from the cluser

4. Update the Oracle Inventory

5. Verify the node deletion is successful


Section Prepare the USB Flash Drive for Imaging

The image file used to reimage the compute node can be downloaded from Edelivery, an external site,
which is password protected. Those images can be also be provided by Oracle Support.

The most common methods to reimage the compute node are using ISO file image or using USB Flash
image.

Using ISO file Image

Oracle Support can provide the ISO file image already generated and details where the file can be
obtained, normally the external ftp site ftp.oracle.com.

Using USB image

To generate the USB image, Oracle Support will provide the files required and the process will be
completed with the execution of next steps:

• How to generate the USB image

 Insert a blank USB flash drive into a working database server in the cluster.

 Log in as root user

 Receive from support tar file computeImageMaker_Exadata_release_LINUX.X64_


release_date.platform.tar

 Execute command tar –xvf computeImageMaker_Exadata_release_LINUX.X64_


release_date.platform.tar

 #cd dl360

 Note: After 11.2.2.2.X, to make sure dualboot is forced to no, modify file
makeImageMedia.sh. Search for a line like ‘dualboot=’ and add no, like this:

dualboot=no

 #./makeImageMedia.sh

Continue with the steps from the main document 1084360.1


Section Configure Replacement Database Server
This document is describing two methods for the re-image: Using the ISO image file or Using USB Flash
Drive

Using the ISO image file for imaging


• The iso image file needs to be transferred to the desktop where web ILOM is going to be used
for the reimage process.

 Log into ILOM via web and enable remote console

 Attach the ISO image to the CD ROM

 Connect to the ILOM via web interface. Go to Remote Control tab , then Host Control tab. From
the Next Boot Device, select CDROM. Next the server is rebooted, it will use the ISO image
attached. This is valid for one time, which after the default BIOS order settings will remain.

 Reboot the box and let the process pick the ISO image and start the re-image process

Using USB Flash Drive for Imaging

1. Insert the USB flash drive into the USB port on the replacement database server.
2. Log in to the console through the service processor, or by using the KVM switch to monitor
progress.
3. Power on the database server using either the service processor interface or by physically
pressing the power button.
4. Press F2 during BIOS and select BIOS Setup to configure boot order, or press F8 and select the
one-time boot selection menu to choose the USB flash drive.
5. Configure the BIOS boot order if the motherboard was replaced. The boot order should be USB
flash drive, then RAID controller.
6. Allow the system to boot. As the system boots, it detects the CELLUSBINSTALL media. The
imaging process has two phases. Let each phase complete before preceding to the next step.

The first phase of the imaging process identifies any BIOS or firmware that is out of data, and
upgrades the components to the expected level for the image. If any components need to be
upgraded or downgraded, then the system automatically reboots.

The second phase of the imaging process installs the factory image on the replacement
database server.

7. Remove the USB flash drive when prompted by the system.


8. Press Enter to power off the server after removing the USB flash drive.
Following any of the re-imaging procedures, the server will come through a first boot.

The first boot will ask for all the IPs, NTP, ILOM, etc. To get the information about IPs, file
/opt/oracle.cellos/cell.conf from a surviving node can be used.

Also all IPs normally are registered on DNS, command nslookup can be used to discover the IPs assigned
to the node.

If mistakes are made, ipconf can be used to modify the settings. ipconf can be used with options -
nocodes since 11.2.2.X
Section Prepare Replacement Database Server for The Cluster

• Before copying cellinit.ora and cellip.ora files, it requires creating the directories:

#mkdir -p /etc/oracle/cell/network-config

• Step 2.f Set up SSH within the oracle user account.

setssh.sh is not available on 11.2.0.2

new binary is setssh-Linux.sh and syntax is: -s -p password -n N -h dbs_group

Cloning Oracle Grid Infrastructure to the Replacement Database


Server

• cluvfy commands are executed from GI OH

• dbca commands are executed from RDBMS OH

• How to identify the names of the VIPs on the surviving nodes

Set OH to the GI

srvctl status vip -n <nodename>

• Before running addNode.sh, remove files from $GI/rdbms/audit

MOS note 1298957.1 is available, to create a cron job that will be removing older files.
Known Issues

1. Error PRVF-5499 while executing cluvfy state –pre nodeadd command.

#cluvfy stage -pre nodeadd -n <lost node> -fixup -fixupdir <directory> returns

The output returned is:

Checking Oracle Cluster Voting Disk configuration...

ERROR:

PRVF-5449 : Check of Voting Disk location

"o/192.168.73.102/DATA_CD_00_sclcbcel07(o/192.168.73.102/DATA_CD_00_sclcbcel07)" failed on the


following nodes:

Check failed on nodes:

sclcbdb03

sclcbdb03:No such file or directory

ERROR:

PRVF-5449 : Check of Voting Disk location

"o/192.168.73.103/DATA_CD_00_sclcbcel08(o/192.168.73.103/DATA_CD_00_sclcbcel08)" failed on the


following nodes:

Check failed on nodes:

sclcbdb03

sclcbdb03:No such file or directory

Solution:

set environment variable IGNORE_PREADDNODE_CHECKS="Y". This will not prevent the error on
cluvfy -pre nodeadd, but it will do when addNode.sh is executed.
The cause is bug 11719563, when Voting Disks are in ASM

INTERNAL PROBLEM DESCRIPTION:

When a collecting file ownership/group/permission data the path being checked

was not correctly filtered out if it was an ASM path. This resulted in

errors being displayed to the user.

2. When cloning GI. during execution of addNode.sh, it fails with permission problems

Errors:

dmorldb08:

PRCF-2023 : The following contents are not transferred as they are non-readable.

Directories:

Files:

1) /u01/app/11.2.0/grid/ccr/hosts/dmorldb07.us.oracle.com/log/collector.log

2) /u01/app/11.2.0/grid/ccr/hosts/dmorldb07.us.oracle.com/log/upgrade.log

3) /u01/app/11.2.0/grid/ccr/hosts/dmorldb07.us.oracle.com/log/sched.log

4) /u01/app/11.2.0/grid/ccr/sysman/admin/scripts/ias/weblogic_j2eeserver/JDBCDataSource.pl

5) /u01/app/11.2.0/grid/ccr/sysman/admin/scripts/ias/weblogic_j2eeserver/JDBCMultiDataSource.pl

6) /u01/app/11.2.0/grid/ccr/sysman/admin/scripts/ias/weblogic_j2eeserver/JMSConnectionFactory.pl

7) /u01/app/11.2.0/grid/ccr/sysman/admin/scripts/ias/weblogic_j2eeserver/JMSQueue.pl

8) /u01/app/11.2.0/grid/ccr/sysman/admin/scripts/ias/weblogic_j2eeserver/JMSTopic.pl

9) /u01/app/11.2.0/grid/ccr/sysman/admin/scripts/ias/weblogic_j2eeserver/JoltConnectionPool.pl

10) /u01/app/11.2.0/grid/ccr/sysman/admin/scripts/ias/weblogic_j2eeserver/ResourceConfig.pl
11) /u01/app/11.2.0/grid/ccr/sysman/admin/scripts/ias/weblogic_j2eeserver/Server.pl

12) /u01/app/11.2.0/grid/ccr/sysman/admin/scripts/ias/weblogic_j2eeserver/ServerConfigUtil.pm

13)
/u01/app/11.2.0/grid/ccr/sysman/admin/scripts/ias/weblogic_j2eeserver/StartupShutdownClasses.pl

14) /u01/app/11.2.0/grid/ccr/sysman/admin/scripts/ias/weblogic_j2eeserver/VirtualHost.pl

15) /u01/app/11.2.0/grid/ccr/sysman/admin/scripts/ias/weblogic_j2eeserver/WorkManager.pl

----------------------------------------------------------------------------------

Solution:

Set correct permissions for the files listed above. Enable rx to owner,group,others and retry command

3. Other set of files with incorrect perms reported also by command addNode.sh

Copying to remote nodes (Monday, April 11, 2011 3:45:17 PM EDT)

.WARNING:Error while copying directory /u01/app/11.2.0/grid with exclude file list


'/tmp/OraInstall2011-04-11_03-44-48PM/installExcludeFile.lst' to nodes 'dmorldb08'. [PRCF-2028 :
Failed to get system information for /u01/app/11.2.0/grid/network/admin/samples/listener.ora:
Permission denied

PRCF-2028 : Failed to get system information for


/u01/app/11.2.0/grid/network/admin/samples/listener.ora: Permission denied]

Refer to '/u01/app/oraInventory/logs/addNodeActions2011-04-11_03-44-48PM.log' for details. You may


fix the errors on the required remote nodes. Refer to the install guide for error recovery.

Solution:

change permissions rx to samples directory. (-R).

4. After running root.sh, ASM instance do not start


From alert.log file

Fri Apr 29 11:41:41 2011

PMON (ospid: 14100): terminating the instance due to error 481

Fri Apr 29 11:41:41 2011

System state dump requested by (instance=2, osid=14100 (PMON)), summary=[abnormal instance


termination].

System State dumped to trace file


/u01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_diag_14110.trc

Fri Apr 29 11:41:41 2011

ORA-1092 : opitsk aborting process

The information in the diag trace file:

*** 2011-04-29 11:39:44.019

I'm the voting node

Group reconfiguration cleanup

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

DIAGNOSTIC

Run rds-ping from the node been restored to the other compute nodes and validate connectivity:

#rds-ping –c 10 <ib-address-other nodes in cluster>

If RDS can establish connectivity the output of command will be:

rds-ping -c10 192.168.10.2


1: 51 usec
2: 24 usec
3: 25 usec
4: 31 usec
5: 25 usec
6: 23 usec
7: 27 usec
8: 21 usec
9: 35 usec
10: 19 usec

If the output is null, rds connection is broken.

SOLUTION

• Reboot the node and test rds-ping again

• Validate rds-ping against all the compute nodes and storage cells

• If problem is still present after reboot, collect:

o Rds-info –I from all nodes/cells

o Lmon,lmd,alert,diskmon,cssd log files from all compute nodes

You might also like