You are on page 1of 41

Introduction Troubleshooting Backup/PTI Boxes Hands on

Configuring a new Minnow (will include a step to assign

hot spares) Configuring HBA Cards **** Backup Server OCRT Specific

4 Internal disks 2 Logical Drives

# format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c0t40d0 <SUN-StorEdge3510-413C cyl 35021 alt 2 hd 64 sec 127> /pci@1e,600000/SUNW,jfca@2/fp@0,0/ssd@w216000c0ff887677,0 1. c1t46d0 <SUN-StorEdge3510-413C cyl 35021 alt 2 hd 64 sec 127> /pci@1e,600000/SUNW,jfca@3/fp@0,0/ssd@w266000c0ffe87677,0 2. c3t0d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>/pci@1f,700000/scsi@2/sd@0,0 3. c3t1d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>/pci@1f,700000/scsi@2/sd@1,0 4. c3t2d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>/pci@1f,700000/scsi@2/sd@2,0 5. c3t3d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>/pci@1f,700000/scsi@2/sd@3,0

Disks 0,1 are the Logical Drives Disks 2,3,4 and 5 are physical Drives inside the Netra 440 How are the disks configured?? Next page

Each of the 4 drives are partitioned like the picture Disk 0 and Disk 1 are mirrored. Disk 2 and Disk 3 are mirrored. What does that mean and what are the implications?

146GB Slice c3t0d0s0 c3t0d0s1 c3t0d0s2 c3t0d0s3 c3t0d0s4 c3t0d0s5 c3t0d0s6 c3t0d0s7 Mount Point / swap overlap /zonehome /usr/gsm/logs /home /usr metadb 25600 20480 56320 10240 14 135254 Size (MB) 10240 12360

Disk 0

Disk 1

Disk 2

Disk 3

At any given point one set of disks is active Upgrades are always done on the redundant pack. Hence, no need to break mirrors. When booting up always boot up from disk0 or disk2

The concept of logical drive is to use a cluster of physical drives as one drive. Used in Minnow 3510 Two logical drives, each consists of five physical drives. How are the Logical drives configured?

Port 0

Port 1

Netra 1 Netra 2 Netra 3 Netra 4

40 43 44 47

46 45 42 41

- Minnow has a total of 12 Drives. Drives 0 thro 4 are configured as the primary logical drive and drives 6 thro 10 are configured as secondary logical drive. Drives 5 and 11 are configured as global spares. - Each logical drive is divided into 4 LUNs. Each Netra is assigned a primary and secondary LUN as described in the spreadsheet above. Furthermore, each LUN is partitioned into 8 dblinks and a ne_data partition. Refer to disk layout diagram and SR 15 new install MOP Appendix F for more details. - What does this mean for OCRT/CNRC troubleshooting purposes? Pretty much nothing, but is presented for a understanding of the system.

Alom Upgrades How to figure out which pack is active What to do if disk goes bad (Minnow Vs. Internal) Mirroring issues (In the Minnow Vs. Internal) How to Reboot (from ok prompt and # prompt) How to run fsck When the HBA card is replaced Assigning Global Spare after Minnow disk replacement

It is simply a port that gives you access to the machine irrespective of the state of the machine. It is configured with an IP at the new install phase of NGO migration. Any reboots should be performed Once you get the sc> via the ALOM. prompt, just type in console f. Hit <enter> to gain access to the console login. FYI: The default ALOM login/password is admin/N440alom

Like mentioned above, Disk0 and Disk1 are mirrored and Disk2 and Disk3 are mirrored. At any given point, one of the two packs is active and the other one just holds /pre_netra directory and /omcbackup and is called the redundant pack. Upgrades are always done on the redundant pack, hence mirrors dont need to be broken. When booting up the machine from the ok prompt, always use the first disk on the active pack.

To find out which pack is active, run metastat p command as root and view the last three lines of output, for example:
# metastat -p d0 -m d1 d2 1 d1 1 1 c3t2d0s0 d2 1 1 c3t3d0s0

In this example, d0 is the main mirror attached to d1 and d2 secondary mirrors. But the physical disk for d1 is t2 and d2 is t3. Hence we know disk2 and disk3 are the active mirrored side. Use boot disk2 to boot this machine from the ok prompt.

If a disk goes bad in the Minnow, one of the hot spares will take over. If another disk goes bad another hot spare will take over. However, If a third disk goes bad and the previous two hard disks have not been replaced, you will have major issues. Also, after the disk is replaced, the replaced disk will need to be configured as global hot spare. These above steps are extremely important, and should be noted well!

How do you figure it out? Log into the Minnow, and then select view and edit Drives and you will see the following picture. As seen, drive ID 3 is bad and global spare ID 5 is online and has taken over. You will also see an amber flashing light on the front panel of the minnow. The drive errors will also be in /var/adm/messages.

What are the next steps? Open a case with SUN. Give them all the data they require and request a hard drive replacement. SUN Tech will come and replace the bad drive. MOTO CNRC will have to turn the newly replaced Hard Drive to be labeled as Global Spare. ***This is an extremely important step and will be demonstrated while configuring the Minnow.

You will see errors on /var/adm/messages and the file system on that drive will be inaccessible. This is same as before. Turn pack 1 (Disk 0 or 2) or pack 2 (Disk 1 or Disk 3) off depending on which disk is bad and replace it and sync the mirrors back.

Mirroring issues can happen in two places. Firstly, in the internal drive and secondly in the Minnow drives. How do I know if the mirroring issue is with internal drive or Minnow drive? Simply run a metastat command and find out which mirrors have issues. Mirrors labeled d0 thro d29 belong to the internal drives and the rest of them are for the drives in the Minnow.

After youve figured out that the drives with the issues are internal drives, verify that the drives are not bad by checking /var/adm/messages log file. Fix mirroring issues only after verifying that the disk is good. Mirroring troubleshooting is same as previous releases.
#metadetach -f mirror submirror #metaclear submirror #metainit submirror #metattach mirror submirror

Get OCRT support for both sub mirrors of a mirror needing maintenance.

repeat meta commands for all sub mirrors in maintenance mode check to see if the problem is resolved with the metastat command.

After verifying that the drives in need of maintenance are in the Minnow, verify that the physical drives are not bad by checking /var/adm/messages and by logging into the Minnow. If any physical disks show bad, get SUN support to replace them. For any minnow mirroring issues, verify with customer that they have a recent backup and call OCRT.

From the ok prompt:


ok>boot disk0(or disk2) or simply ok>boot r ok>boot cdrom s (to boot from the CDRom in a single user mode)

From the root prompt:


#reboot #halt (to bring it down to the okay prompt)

To get to the ok prompt from the sc prompt:


sc> break sc> console f ( get back to the ok prompt ) ok>

# cd / # metastat -p | grep d2 | tail -1 d2 1 1 c3t1d0s0 ( Target to fsck is t0 and disk0, x=0) or d2 1 1 c3t3d0s0 ( Target to fsck is t2 and disk2, x=2) Note: Replace x with the actual disk numbers below

# fsck -y /dev/dsk/c3txd0s0 # fsck -y /dev/dsk/c3txd0s3 # fsck -y /dev/dsk/c3txd0s4 # fsck -y /dev/dsk/c3txd0s5 # fsck -y /dev/dsk/c3txd0s6
If any of the above fsck commands states that the file system was modified, re-run that particular command until it no longer shows it was modified.

If Informix is down (or not responsive) and cd to usr/gsm/ne_data gives I/O errors, then there is likely an issue with the connection from the netra440 to the Minnow. It could be the HBA card on the N440, GBIC connectors or the connection on the minnow itself. Get SUN involved to isolate the issue. However, if it is the HBA card that needs to be replaced, SUN tech will do so. There are a few things Motorola will need to do before and after the HBA cards are replaced. Details in Hands On session.

After a hot swappable disk is replaced in the minnow, it is imperative that it be declared as global spare. Hands on training to have details on the procedure.

Consists of Netra 240 and the C4 Jukebox (Tape Library) Automated backups (usually set at 6:20 in the morning, no cron jobs to start backup) Single backup process captures system proc., MMI and Informix backups Issues usually arise due to communication issues between Netra 240 and C4

1. At the beginning of the backup, the Informix data is backed up by performing a binary dump to a file on the system processor. 2. The OMC builds a backup image including the Informix data. Because the MMI has been merged onto the OMC platform, a backup of the system processor's file systems includes the MMI file systems. 3. The backup is sent to the backup server (Netra 240 and StorEdge C4) over the E0 LAN.

/usr/gsm/logs/platform/OMCbackup<time>
Look for the following:
URBAN_BACKUP_STARTED BACKUP_STARTED_ON_chic_sys2 Informix backup is good! 12/03/06 06:54:17 pstclntsave: All command(s) ran successfully. 12/03/06 06:54:17 pstclntsave: Exited. 12/04/06 06:22:16 preclntsave: All command(s) ran successfully. 12/04/06 06:55:17 pstclntsave: All savesets on the worklist are done. BACKUP_COMPLETED_ON_chic_sys2

Can also check event logs


#0 - NOT APPL - *NONE*. backupStartedEvent

- OMC - OMC - Dec 4, 2006 06:20:56. OMC-R backup started on chic_sys2 #0 - NOT APPL - *NONE*. backupCompletedEvent - OMC - OMC - Dec 4, 2006 06:55:17. OMC-R backup completed successfully on chic_sys2

Essentially via process of elimination The main issue weve seen is communication failure between the Netra 240 and C4 HW issues also galore! Issue usually starts with backups not being complete for that day.

1.

2. 3.

Ask MSO if the C4 tape library is online by opening the C4 browser GUI. Follow steps in the sysadmin manual (Chapter 10) to open GUI and see if the status is online or offline. If offline, change it back online and proceed with backups. Establish connectivity between N240 and C4 by running jbverify. (Page 21 on Installation MOP) If jbverify passes, backups should work, but if it fails, recreate the device tree and run jbverify again. #cd /dev/rmt #rm *.* #devfsadm Rerun jbverify

4. If backups still fail reboot the N240. #halt ok>reset-all Run jbverify once more to establish connectivity. 5. After the box is up, rerun backup. If it still fails, bring down the box and run probe-scsi-all. #halt ok>probe-scsi-all If you see bus fault errors, it is likely a HW issue. Get SUN support at this time while continuing to isolate the issue.

s4 s3

s5

Netra 240

S2 S1 D2 D1

StorEdge C4

This diagram is a simplified view of the connection between N240 and C4 (same as the cabling slide). Use this while troubleshooting issues. D1/D2 are drives 1 and 2 in the C4 that hold backup tapes. S1 and S2 are SCSI ports in C4. S3, S4 and S5 are SCSI ports in N240.

After you have rebooted N240 and jbverify is still failing, you can try out the following steps to isolate the problem. SUN should already have been engaged in resolution as well.

Disconnect all SCSI cables from N240 (S4 and S3 are disconnected):
#halt ok>reset-all ok>probe-scsi-all

If you see bus fault errors, then the likely culprit can be the cards/ports on C4 (S1/S2) or the SCSI cables. If you dont see errors we start plugging in N240 cables in one by one in the following steps.

Connect top left cable on N240 (S4)


ok>reset-all ok>probe-scsi-all If errors then the culprit is the controller card or SCSI cable (S4). If no errors, follow step 3.

Disconnect the top cable (S4) and connect bottom cable on N240 (S3)
ok>reset-all ok>probe-scsi-all If errors then the culprit is the controller card or SCSI cable (S3). If no errors, follow step 4.

Connect all cables


ok>reset-all ok>probe-scsi-all ok>boot r Run jbverify to verify connectivity.

If jbverify is good, were set. If not we will need to delete the C4 and add it again. From the nwadmin GUI: 1. Media jukeboxes. Delete C4 2. Media devices. Delete devices. Make sure theyre not mounted. 3. Run jbconfig 4. Run jbverify

Configure a brand new Minnow. Know how to connect it to N440. How to map HBA cards Backup Server PTI

Configuring Minnow and HBA cards Appendix F on new install MOP.

How to open nwadmin and c4 GUI


Differences Perform immediate backup Delete and readd C4 Running jbverify and jbconfig

This section applies only to mirroring issues with the Minnow drives. It can however also be used when both sub mirrors of the main mirror are in need of maintenance. **********Caution: This procedure should be used only after it has been confirmed that the disks in the Minnow are not bad (after checking /var/adm/messages and logging into the Minnow) and verifying that the internal disks are not bad. Please verify with customer on existence of a good backup before proceeding.

Link will be provided for OCRT.

# jbverify j Jbverify is running on host omcbackup Processing jukebox devices... Processing jukebox C4: Testing drive 1 (/dev/rmt/<A>cbn) of JB C4 Testing drive 2 (/dev/rmt/<B>cbn) of JB C4 Jukebox C4 on omcbackup successfully processed. Finished processing jukebox devices. ********************************************************************** Summary report of jbverify ======= ====== == ======== Hostname Device Handle Blocksize Jukebox Drv No. Status ------------ ---------------------- -------------- ------------- ----------- --------omcbackup /dev/rmt/<A>cbn 65536 C4 1 Pass omcbackup /dev/rmt/<B>cbn 65536 C4 2 Pass ********************************************************************** Exiting jbverify successfully.

You might also like