5 General TS and Navisphere (Important)

EMC / CLARiiON Troubleshooting Strictly Confidential
General Array Troubleshooting & Navisphere Section Five

Copyright 2004 EMC Corporation. All rights reserved. Revision A02 1

Section Five - General Troubleshooting / Navisphere

NOTICE: This document contains sensitive technical information which is for use
solely by EMC employees and authorized service partners of EMC Corporation.
Any use, duplication or distribution outside the Corporation is strictly prohibited.


Architectural differences

To begin any troubleshooting process, one must understand the product they are working on.
The purpose of this section is to provide fundamentals and better understanding.

FC4700 SP-based hard drive

There is a 6GB hard drive resident on each storage processor. This storage area is unseen by
users; end-user data is never stored here; service personnel may access it via SymmRemote.
This drive is not a field replaceable unit. A failure of this drive will require the entire SP to be
replaced. This is the same policy CLARiiON has always maintained: if any component of the SP
should fail, the entire SP would be replaced.

The operating system of the SP resides on this drive and with it the services and layered drivers
that comprise the FC4700 software stack.

IDE drive picture of FC4700 SP showing IDE drive

PSM Persistent Storage Manager

The Persistent Storage Manager (PSM) is a hidden LUN that records configuration information
specific to the CLARiiONs environment on disk. This PSM LUN is what allows an SP to be
replaced and come up running the correct software with the correct information on hosts, LUNs,
storage groups, etc.


Both SPs access a single PSM so that their environmental records are always in sync. If one SP
needs to be replaced, the new one can find the unique environmental information on the PSM.

If one SP receives new configuration information, that data is written to the PSM and the other
SP instantaneously updates itself. The PSM is created at initialization of the array via Navisphere,
and currently occupies 512MB. Upon managing an array that does not have a PSM, the
Navisphere client software (Navisphere Manager or NaviCLI) will warn the user that the array is
currently in an un-initialized state, and allow the user to perform the initialization.

Once created, destruction of the PSM will result in loss of all host information on the array. This
is why the installer must determine the types of RAID groups that the end-user will employ. For
example: if youve made a five-drive RAID5 group (the default PSM setting) the PSM LUN the
customer will be forced to use those disks as RAID5.

Assume were using 18GB drives thats approximately 90GB of raw storage that customer most
likely will want to use. The PSM will occupy less than 1% of that RAID group. Make sure the
customer can use the RAID type selected for the PSM LUN. Note also that the LUs
selected for inclusion in the PSM Raid Group should NOT be subject to heavy I/O due to
performance reasons.

The PSM is used by non-disruptive upgrade to store new and previous version driver software
during the installation process. This allows the installation to occur within the array removing
the issues associated with a hosts failure or a lost connection during installation. After
installation, the drivers are run from cache on the SP-based hard drive for better response time
to the OS.

The software packages for the current & previous versions for each component are stored in
PSM. That way, when an SP of arbitrary software revision, with an arbitrary set of layered
drivers, is inserted, the array software can install the currently valid set on that SP.

What host information is stored in the PSM?

The security provided by FC4700s PSM has been featured prominently as an important step
forward: it moves critical host configuration data off the hosts agent.config file and on to a RAID
protected hidden LUN on the array.

It allows hosts to be taken off/on line and the host can regain access to the storage. But just
what is being stored there?

Drive mapping
The drive letter in Windows or device name in Unix that the OS has assigned to a
particular LUN will be noted by the host agent and pushed to the array. This information
is determined dynamically by the HOST agent, and is reported to any clients. This is why
a user must manage the host agents for hosts attached to FC4700s, in order to get this
mapping information.

Host information
The host agent reports: hostname, OS, version of ATF, versions of Agent, IP address.



Privileged users
Only users listed in the hosts agent.config file may manage that host. This is the prime
host-based security available in the Navisphere environment.

Polling rates

All AccessLogix Host information
Initiator records (associating a host name with an HBA WWN), Storage group mapping.
The association between HBA and the hostname is collected by the array agent and
stored in the PSM. This association is used by AccessLogix to ensure the host(s) assigned
to a particular storage sees only the storage groups assigned to it and also ensures
that unauthorized hosts do not see into other groups.

What array data is stored in the PSM?

In addition to host information, the PSM stores the following array agent information in the PSM.

All AccessLogix information
Storage groups, current default storage group, physical array (private/public LUs), the
user-defined name of the array

SP IP address
One of the first steps in a FC4700 installation is to use a serial connection to gain PPP
access to the SP. Then you may set the IP address and complete the installation from a
remote Management station connected via the LAN.

Privileged users (array)
SP authorized users. At initialization, anyone able to access the SP may configure it.
After the first privileged user is entered the SP becomes secure and allows only users
from the privileged users list to modify the configuration.

ALPA
A prerequisite for remote mirroring is that the Arbitrated Loop Physical Addresses
(AL_PAs) for each SP must be unique.

Vault private space layout

The first nine drives in the DPE have space set aside to accommodate cache de-staging in the
event of a component failure in the write caching subsystem. This allows for an orderly way to
protect the user data in memory. These drives are configured into a nine-drive RAID3 group.

Note: A faulted condition in the DPE will automatically disable write caching.

CLARiiONs have Standby Power Supplies, designed to maintain power to the DPE long enough to
allow data stored in memory to be securely written to disk (the vault drives), before the system
powers off.


Database drives

The database drives hold the information that the Core operating system that is running the
storage processor needs in order to track array-specific data on the:

LUNs
RAID groups
SPs PROM code and BIOS
Chassis ID of the array

The space used by the database is trivial, no larger than one MB. It is triple mirrored between
the first three drives in the DPE.

FC4700 Private Space Layout

CX Series fibre boot - Boot from fibre, picture of SP, note there is no on-board disk.



The CX600 is a fibre boot based storage processor. The boot image exists on fibre channel
drives in the first DAE2 chassis, also referred to as the DAE2 O/S. The PSM lun functions in
similar fashion to the FC4700.

The PSM LUN is integrated and hidden in the CX600; unlike the FC4700, PSM LUN configuration
is not required. The differences between PSM usages will not be discussed in this document.

The partitioning of the disk drives in this first chassis is shown below. Please note that the size
and usage of the partitions changed slightly between pre-Release 11 software and Release 11
going forward. It is important to remember this as once youve committed to the new software,
there is no going back.

Data Directory Boot Service 2 MB all disks in array Fixed space for boot service

Data Directory 2 MB all disks in array - Each disk contains a data directory that maintains a
map of the database entries for that disk

Flare Database 28.3 MB all disks in array The traditional database is triple mirrored on
drives 0, 1 & 2. This area is used in other drives for FRU signature, clean/dirty flags, HW/FRU
verify, etc. and a large reserved for future use area.

External Database 35 MB drives 0, 1, & 2 Contains persistent information outside the
purview of Flare such as: BIOS code image, PROM code image, Chameleon Kernel software,
Chameleon volume manager, and Chameleon file system database.

NT Boot Partitions 2826.2 MB drives 0, 1, 2, & 3 - Each SP will have a mirrored NT boot
partition. SPA will use drives 0 & 2, SPB will use drives 1 & 3.

Reserved Space 300 MB Set aside for future NT growth.

PSM 1024 MB drives 0, 1 & 2 Triple mirrored private LUN for storage of persistent SP data.

Vault 2176 MB drives 0 through 4 RAID 4+1 area used for vaulting cache data in power fail
emergency.

Core Dump Partition 1 GB disk 4 reserved for Chameleon II NAS software core dumps.

Total private space drives 0 4 = 6393.5 MB



CX Series Private Space Release 10 & Prior
(NOTE: Not drawn to scale)
0 1 2 3 4
User Space
Data Directory Boot Service (2MB/disk)
Flare Db (28.3MB/disk)
External Db (35MB/disk)
PSM (1024MB)
Data Directory (2MB/disk)
FRU Signature (28.3MB/disk)
SPA
NT Boot
Primary
(2826.2MB)
SPB
NT Boot
Primary
(2826.2MB)
SPA
NT Boot
Secondary
(2826.2MB)
SPB
NT Boot
Secondary
(2826.2MB)
N/U N/U
N/U N/U
NAS Core
Dump Area
1GB
Vault Area (2176MB)
Reserve Area
CX Series Private Space Release 11
After Utility
Partition NDU
and first Utility
Partition boot.
(NOTE: Not drawn to scale)
0 1 2 3 4 5 > end of array
User Space
Data Directory Boot Service (2MB/disk)
Flare Db (28.3MB/disk)
External Db (35MB/disk)
PSM (1024MB/disk)
Data Directory (2MB/disk)
FRU Signature (28.3MB/disk)
SPA
NT Boot
Primary
(2826.2MB)
SPB
NT Boot
Primary
(2826.2MB)
SPA
NT Boot
Secondary
(2826.2MB)
SPB
NT Boot
Secondary
(2826.2MB)
N/U N/U
N/U N/U
NAS Core
Dump Area
1GB
Vault Area (2176MB or 3200mb w/Release 12)
Reserve Area (100MB/disk)
1GB
1GB
Image
Repository
SPB
Utility Pri
(200MB)
SPB
Utility Sec
(200MB)
SPA
Utility Pri
(200MB)
SPA
Utility Sec
(200MB)


What is the difference between an SPE (CX600) and a DPE (FC4700)?

FC4700 has an OS based on NT, which resides in an onboard IDE drive. The PSM is a hidden
LUN that is on a Raid Group selected during the initialization process. Note that the Raid Group
is out on the fibre channel drives, separate from the SP-IDE drive.

If an SP is replaced, a process call newSP will run and allow the new SP to get its software
packages from the PSM. Both SPs have access to this single LUN and will always keep their
environmental records in sync.

The PSM can exist on as few as two drives and as many as 10 drives. As noted before is that the
Vault area is on the first nine drives of the DPE Chassis and the Data Base Drives are a triple
mirror on the first three drives.

FC4700 Array back view

S SP PS S
D DP PE E
D DA AE E


CX600, CX400 and CX200 The CX series arrays are a fabric boot based SP. The NT boot
image exists on fibre channel drives in the first DAE. This first DAE, Bus 0 is also known as the
DAEOS chassis.

The PSM and Vault areas are also part of a private area that is reserved on the first five disk
drives. The Data Base area is still a triple mirror but is now part of the private area. See page
seven for more information on which disks contain the above named areas.

CX600 Array back view

CX600 improvements over FC4700

The CX600 array provides for enhanced Storage Processors which consist of a motherboard with
two Pentium 4 processors and a minimum of 2 GB of cache memory. There is an option of 2 GB
of additional memory, but it is not field-upgradeable. For the additional 2 GB of cache upon SP
replacement, the DIMMs are ordered separately. The SAN personality card consists of four fibre
optic connections. Other items of interest include:

Drives per Storage System - 240
Drive Cache Vault - 5
Maximum LUN Counts - See Primus article emc70491 for details
Max LUN Size - 2TB
Max RAID Groups - 240
Array Boots from first DAE2 (DAE2 O/S) chassis and contains a factory bound PSM LUN

S SP PS S
S SP PE E
D DA AE E2 2 O O/ /S S


Navisphere Block diagram and data flow

The above diagram is a reminder of your previous CLARiiON training. It shows how the legacy
arrays were managed over the fibre channel. Starting with the FC4700, management was taken
from being host based to being array based. The Navisphere Agent was moved down into the
array with management occurring over the IP network. There still remains a host agent which is
used to register the host HBAs with the array. It is also used to provide file system information
to the LUN listing within Navisphere.

FC4700 & CX
Array Agent
to Core OS
inside stack
IP
FC
Navi Manager
FC4700 or CX-series array
SP A SP B
SPB
p0
p1
SPA
p0
p1
Pre-FC4700
Host Agent
to Core OS
via in-line fiber
Manager sends commands
to agent over IP
Pre-FC4700
storage system
Host
Intranet
D
i
r
e
c
t
o
r
y
C
L
A
R
i
i
O
N
S
e
c
u
r
i
t
y
L
e
g
a
c
y
F
u
t
u
r
e
s
F
u
t
u
r
e
s
A
n
a
l
y
z
e
r
Linux with
Browser
Management Server
P
e
r
s
i
s
t
e
n
c
e
Solaris
with
Browser
Windows
2000 with
Browser
NT with
Browser
Management, SnapView, and MirrorView GUI


This illustration above shows how the software components of Navi 6 interact. The cloud shown
represents the clients subnet (not the internet yet, as were still trying to understand firewall and
security issues) and each circle represents one of the four operating systems you can open a
browser on in order to connect to the array IP address by which to manage with Navisphere 6.

The green box is the array and it contains the providers that process various calls made by the
client browser for changes to security, etc. Also within the array are the modules for future
support.

The user on the NT browser is issuing a request to make a change on the array. The command
goes over the blue arrow (the LAN) to the Management Server, which routes the call to the
correct provider. The CLARiiON provider then translates the call to the array agent, which passes
the command to Core software.

CIMOM Architecture (also known as ManagementServer)

The CIMOM is comprised of several layers which include a web server to provide HTTP access, an
encoding layer to translate CIM/XML and a CIMOM object manager. The providers are used to
collect data and feed that data into CIMOM as well as execute methods.

RAID ++ Provider

The next page shows us that the Raid++ provider is at the core of the Navisphere CIMOM
architecture. It is responsible for handling all Raid specific get/set operations.



Directory Provider

This provider is responsible for caching the list of arrays found on the selected subnets. It
will periodically ping arrays in the list to verify there state. It will also maintain the heartbeat
connection to all arrays within the management domain. Certain array (or arrays) will be
designated as the directory provider master to minimize heartbeat pings on the network.



Event Monitor Provider

In a centralized notification model, one of the arrays is designated to process and forward critical
events via the following mechanisms:

Modem,Pager or Email
Launch executable

Security Provider

This provider is responsible for authenticating users to the array and allowing a user based on a
security ID to make requests on objects within the CIMOM.



Admin Provider

This provider is responsible for managing all configuration aspects of the Navisphere Manager 6.X
infrastructure. This would include a web server, the CIMOM and provider.


Boot issues (array)

To troubleshoot a boot issue effectively, one must understand some of the basics of the boot
process. From the point of power up to the operating system boot sequence and finally to when
the array is ready to process host I/O.

The storage processor (SP) has an operating system and other software components which
replace FLARE as the sole base code. Under this base operating system reside Layered Drivers
which are software components which provide storage-oriented functionality.

Being such we have to go through a boot process that is very similar to a standard NT server
boot sequence. What follows is a description of the boot process from power up. You will be
able to see the BIOS portion of the boot but not the actual NT process. The portions of the NT
boot sequence will be visible in SP event log.

Local or FC Disk/Booting

When Windows/NT is booted, the BIOS finds a disk based on a search pattern in NVRAM. The
disk is assumed to be partitioned with a FAT or NTFS file system on the first partition. In the
root directory there is a file called boot.ini which is read to determine which partition to actually
boot from. The kernel is then loaded and the file system in that partition it mounted. That
partition must contain a paging file, along with other files. A normal NT Workstation install
takes 200-300MB of disk space.

Before we ever get to the NT boot sequence, we first must look at the BIOS boot sequence.

Here is a power up of SPA; messages seen are similar to the following as viewed from
a hyperterminal connection.

Phoenix ServerBIOS 3 Release 6.0.
Copyright 1985-2001 Phoenix Technologies Ltd. All Rights Reserved
Copyright 1999-2002 by EMC Corporation, All Rights Reserved.
EMC BIOS Release 3.26
CPU = 2 Intel(R) XEON(TM) CPU 2.00GHz
637K System RAM Passed
173M Extended RAM Passed
Press <F2> to enter SETUP

Hard Disk 0 : None
Hard Disk 1 : None
Hard Disk 2 : None
Hard Disk 3 : None

Press Any Key to Continue

PhoenixBIOS Setup Utility
CPU Type : Intel(R) XEON(TM)
System ROMz : E9D9 - FFFF
CPU Speed : 2000 MHz
BIOS Date : 05/22/03
System Memory : 640 KB
COM Ports : 03F8 02F8 0300 0308
Extended Memory : 2096128 KB
LPT Ports : 03BC
Shadow Ram : 384 KB


Display Type : EGA \ VGA
Cache Ram : 512 KB
PS/2 Mouse : Not Installed

Hard Disk 0 : None
Hard Disk 1 : None
Hard Disk 2 : None
Hard Disk 3 : None

Copyright (c) EMC Corporation , 2003 <- This is the start of FLARE
Disk Array Subsystem Controller
Model: CX600
DiagName: Extended POST
DiagRev: Rev. 02.99
Build Date: Tue Jul 22 14:45:46 2003
StartTime: 10/20/2003 21:16:18
SaSerialNo: LKE00022706003 __ FLARE post testing, hit ESC at any
| point here to enter debug mode.
V
AabcdeBCDabEabcdFGHabIabcJabcKabcLabcMabcNabOabPabQabRabSabTabUabVabWabXYZ
Initializing back end FIBRE...

PCI Config Reg: 2.4.1 0x0157
FCDMTL 0 [2.4.1] Dual Mode Fibre init - OSW DB PTR 0x20000000
FCDMTL 0 [2.4.1] Cached memory - 0xF77B9 bytes @ 0x200006B0
FCDMTL 0 [2.4.1] Noncached memory - 0xC037F bytes @ 0x200F7E69 (0x200F7E69 phys)
FCDMTL 0 [2.4.1] DVM Initialized
FCDMTL 0 [2.4.1] IMQ base ptr = 20170000; IMQ length = 8000
Dualmode fibre init completed
FCDMTL 0 [2.4.1] TPM Notify: st=0xA000000, flg=0x4, cmd=0x1
FCDMTL 0 [2.4.1] TPM Hndle API Event: cntx=0x200004C4, evnt=0x4002, info=0x0
FCDMTL 0 [2.4.1] TPM Lnk Up: state=0xA000000, flg=0x84
Link Event: 0x00030005
FCDMTL 0 [2.4.1] DVM Duplicate address id already in list: EF
FCDMTL 0 [2.4.1] DVM Duplicate address id already in list: E4
Device Event (0xE4): 0x00030012, tach_ptr: 0x08491854
Device Event (0xEF): 0x00030012, tach_ptr: 0x08491854
DL waited 1s for discovery
Target 0 is online
Target 1 is online
Target 2 is online
Target 3 is online
Target 4 is online
Relocating Data Directory Boot Service (DDBS)...
Autoflash POST?
POST/DIAG image located at sector LBA 0x00012048
Autoflash BIOS?
BIOS image located at sector LBA 0x00011048
EndTime: 10/20/2003 21:16:52


int13 - RESET (1) <- System BIOS using int13 reads the master boot
record (MBR) and loads it into memory. The system
BIOS then transfers the execution of the startup
process to the MBR. After the MBR loads a copy
of the active partition's boot sector into memory,
the boot sector code starts the operating system
as defined by the operating system.
DDBS: MDB read from both disks.
DDBS: Chassis and disk WWN seeds match.
DDBS: First disk is valid for boot.
DDBS: Second disk is valid for boot.
NT FLARE image (0x00400007) located at sector LBA 0x0002284B
Disk Set: 0 2 <- Found boot location
Total Sectors: 0x005821A1 <- Boot disk drive 0_0_0 information
Relative Sectors: 0x0000003F
Calculated mirror drive geometry:
Sectors: 63
Heads: 240
Cylinders: 382
Capacity: 5775840 sectors
Total Sectors: 0x005821A1 <- Boot disk drive 0_0_2 information
Sectors: 63
Heads: 240
Cylinders: 382
int13 - READ PARAMETERS (19)
int13 - DRIVE TYPE (59)
Error : Invalid Drive ID - 0x81
int13 - CHECK EXTENSIONS PRESENT (63)
int13 - GET DRIVE PARAMETERS (Extended) (64)
int13 - READ PARAMETERS (1744) <- NT load continues and is being handed over to the
hba driver. This number shown will not be the same
in all cases. What follows is the unseen sequence of
an NT boot process.


Two layers of software are interacting during the NT boot process. The first layer to start is the
kernel layer. In simplified terms here is what occurs. One, due to NT design, the miniport
drivers are first up, but do NOT expose themselves to the fabric until instructed by user-space
software. This is so the WWN, which is dependent on the array SSN, can be set.

The reboot driver then checks its reboot count (registry). If the counter is >=3, a failure will be
reported to the Service Control Manager. No other drivers will be started because they depend
upon the reboot driver. This prevents a bad component from causing a reboot loop.

The drivers dependent upon the miniport driver start up next. These are the scsitarg, CMI and
SMIScd drivers. Scsitarg reads miniport WWN from the registry and sets the WWNs of the
miniports, which can then be enabled. SCSITarg does not yet allow I/O (it returns busy). This
is followed by drivers dependent upon CMI, which are disktarg, MPS, DLS and Flare.

User-space processes are then started next:

PPP, eventlog, etc. started as part of OS.
KTCons (tracing) is an anomaly: it is not controlled by Governor.
K10Governor starts. It has a list pf processes which will:
o check miniport WWNs vs. Array SSN
o Check installed SW, and make sure it all is working
o tells scsitarg to drop the gate and allow IO
o Starts external admin services (Navisphere)

K10Governor (NT Service autostart)
NDUApp
1) Create
DeviceMap
object,
rebuild map
report if
successful.

Set registry
flag to
IOInhibited
if not.
NewSP
1) Check
Installed
SW vs
PSM.

2) Check
ArraySSN,
use it to
generate
miniport
WWNs.

3) Reboot,
if 1 or 2.
Registry:
RebootCount
Inhibited Mode State
Degraded Mode state
DumpMgr
1) Look for dump,
copy, report.
NDUMon
1) Check reboot count in Registry, reset, or set degraded.
2) Check IOInhibited flag (set by NDUApp), if OK, tell
hostside software to allow IO. Else log the failure.
3) Wait for NDU requests.
K10_DGSSP
1) Get all events from NVRAM and log in NT Event Log
2) Clear NVRAM event Log
3) Poll with high frequency for new events.
MessageDispatcher
1) Ping on MPS channel and wait for peer to respond.
2) After handshake, ping and detect peer death.
Set named event.
Navisphere
1) Poll Array. Redirector will read Degraded Mode State
from Registry.
Set by RebootDriver


The basic boot sequence

REBOOT driver begins and checks the reboot count.
NT EVENT log starts
SPID checks the id of the SP that is booting
NTMIRROR driver begins
DLS driver begins (distributed lock service)
DLU driver begins (disk logical unit)
SCSITARG starts and claims the ports for the FE (frontend) and CMI
CMISCD driver begins
CMI driver establishes contact with its peer
BE (backend) starts
PSM driver begins
DISKTARG begins
SCSITARG activates the TDD (target disk driver) allowing flare to communicate with NT
SAFETYNET starts and it then starts the K10governor
NEWSP begins and runs the ndu sync process
NDUAPP begins
DUMPMANAGER begins
NDUMON begins and will unquiesce the frontend (allow host log in) if all is okay. If not it
will skip the unquiesce if there is a problem. This starts the reboot count and three more
reboots will be attempted. On the fourth reboot, the SP will come up in a degraded
mode with no drivers started.
NDUMON also will check the PSM for the ndu-cache-settings
MESSAGE DISPATCHER begins
SCSITARG starts the FE (frontend) if it has received a good status from ndumon
LOCKWATCH begins
KTCONSERVICE starts
K10GOVERNOR process count checked
K10_DGSSP begins
NAVISPHERE AGENT (sp agent) starts

What you will see in the SP event log

SP Shutting down

Timestamp (1776)The Event log service was stopped. EventLog

SP Starting up

Timestamp (71200002)Compiled at Aug 19 2003. Reboot
Timestamp (1779)Microsoft (R) Windows NT (R) 4.0 1381
Service Pack 5 Uniprocessor Free. EventLog
Timestamp (1775)The Event log service was started. EventLog
Timestamp (71200006)Current (incremented) reboot count is 1. Reboot
Timestamp (71200007)Found package Base02.05.1.40.5.008. Reboot
Timestamp (71200003)DriverEntry() returned 0. Reboot
Timestamp (71190002)My SP ID is 0x3f23209060010650:0, signature is 0xca4c0. spid
Timestamp (71320002)Compiled on Aug 19 2003. SMBus
Timestamp (71320003)DriverEntry() returned 0. SMBus
Timestamp (7124000f)NT Mirror Driver Compiled on Aug 19 2003 12:21:43
Free (Retail) Build 02_05_08. ntmirror
Timestamp (71240014)Creating root partition
\Device\Harddisk0\Partition0 P=0 S=2. ntmirror



Timestamp (71240016)Internal information only.
Unit State: ENABLED P=READY (3) S=READY (3). ntmirror
Timestamp (71240014)Creating root partition
\Device\Utility\UtilityPartition1 P=1 S=3. ntmirror
Timestamp (71240016)Internal information only.
Unit State: ENABLED P=READY (3) S=READY (3). ntmirror
Timestamp (71240010)DriverEntry() exiting with status 0. ntmirror
Timestamp (71110002)Compiled on Aug 19 2003 at 11:50:13, Free (Retail) Build. dls
Timestamp (71110003)DriverEntry() returned 0. dls
Timestamp (71120002)Compiled on Aug 19 2003 at 11:50:24, Free (Retail) Build. dlu
Timestamp (71120003)DriverEntry() returned at 0. dlu
Timestamp (71170000)ScsiTarg (TCD) starting. scsitarg
Timestamp (71170002)TCD0 claimed LogPort 1 for FE. scsitarg
Timestamp (71170002)TCD1 claimed LogPort 0 for FE. scsitarg
Timestamp (71170002)TCD2 claimed LogPort 3 for CMI. scsitarg
Timestamp (71170002)TCD3 claimed LogPort 2 for CMI. scsitarg
Timestamp (71230002)Compiled on Aug 19 2003 at 11:48:36, Free (Retail) Build. cmiscd
Timestamp (71170003)CMI linked with ScsiTarg. scsitarg
Timestamp (71230003)DriverEntry() returned 0. cmiscd
Timestamp (3) User configuration data for parameter COM1 overriding
firmware configuration data. serial
Timestamp (71180002)Calling DriverEntry(). cmi
Timestamp (71180003)My SP ID is 3f23209060010650:0. cmi
Timestamp (71180004)Heartbeat interval is 10 1/10-second ticks. cmi
Timestamp (71180005)Peer SP timeout interval is 100 1/10-second ticks. cmi
Timestamp (71180006)Remote SP timeout interval is 100 1/10-second ticks. cmi
Timestamp (71180009)CMI Transport Device 0: 0 gate(s) found. cmi
Timestamp (71180009)CMI Transport Device 1: 0 gate(s) found. cmi
Timestamp SP A (63f) Resume PROM information was read successfully. [0x00] 0 403
Timestamp Enclosure 0 SPS A (698) Battery Testing In Progress [0x00] 0 80
Timestamp (71150005)Read and processed default Persistent Container
\Device\CLARiiON_PSM psm
Timestamp (71150003)DriverEntry() returned 0. psm
Timestamp (71160000)DiskTarg (TDD) starting. disktarg
Timestamp (71170003)TDD linked with ScsiTarg. scsitarg
Timestamp (71170004)TDD activated with ScsiTarg. scsitarg
Timestamp (12f530)Safety net starting SafetyNet
Timestamp (12f530)Starting K10Governor SafetyNet
Timestamp (1b72)The following boot-start or system-start driver(s)
failed to load: atapi Hpt366 Service Control Manager
Timestamp (41000000)Starting service: K10Governor K10Governor
Timestamp (41000001)K10Monitor process started, executable = K10Monitor. K10Governor
Timestamp (71510000)Informational message. File: newSP.cpp
Line: 998 Details: Starting. newSP
Timestamp (76000001)newSP inhibits I/O. newSP
Timestamp (71510000)Informational message. File: K10NDUAdmin.cpp
Line: 496 Details: Processing sync NDU
Timestamp (71510000)Informational message. File: K10NDUAdmin.cpp
Line: 509 Details: Completed sync NDU
Timestamp (71510000)Informational message. File: newSP.cpp
Line: 1302 Details: Normal Exit. newSP
Timestamp (41000002)Starting NduApp
Timestamp (40000001)NduApp normal exit. NduApp
Timestamp (41000100)DumpManager started DumpManager
Timestamp (41000101)No new dump found DumpManager
Timestamp (71510000)Informational message. File: NDUmon.cpp
Line: 1142 Details: NDUMon starting ndumon
Timestamp (71510000)Informational message. File: NDUmon.cpp
Line: 1201 Details: SP Unquiesce succeeded ndumon


Timestamp (71510000)Informational message. File: NDUmon.cpp Line: 1263
Details: PSM file ndu-cache-settings does not exist, skipping cache restoration ndumon
Timestamp (40000001)Message Dispatcher has started MessageDispatcher
Timestamp (71170009)Fibre Channel loop up on logical port 1 scsitarg
Timestamp (71170008)Fibre Channel loop down on logical port 1. scsitarg
Timestamp (71170008)Fibre Channel loop down on logical port 0. scsitarg
Timestamp (41000300)LockWatch started LockWatch
Timestamp (71214000)ktconsService log: Waiting for signal from the Governor
to take ktrace dump. ktconsService
Timestamp (41000001)All processes started, process count = 11. K10Governor
Timestamp (76000100)K10_DGSSP Starting K10_DGSSP
Timestamp (1) Navisphere Agent, version 6.5.0.3.7, has started Navisphere Agent
Timestamp (2000)Application Starting Up
Timestamp (4700)'10.5.43.206' was managed successfully.
Timestamp Enclosure 0 SPS A (637) SPS Recharging [0x00] 0 0

Note: There may be other events unrelated to the boot process displayed. The above list is only
a sample representation.

What you will see in the ktrace_usr file (see sp_collect files)

> !ktrace -T -r user
rtc_freq 799860000
ti_slot 135; ti_size 4096; ti_cirbuf 0x80baf000; ti_altbuf 0x80baf000
Boot 2003/10/21 07:37:17.187 stamp 0039ccd6a9
DATE: 2003/10/21 07:38:02.565
07:38:02.565 0 81ec72e0 NDU: Found package Navisphere
07:38:02.600 34309 81ec72e0 NDU: Found package SANCopyUI
07:38:02.637 37410 81ec72e0 NDU: Found package SnapCloneProvider
07:38:02.679 42148 81ec72e0 NDU: Found package SnapViewUI
07:38:02.742 62861 81ec72e0 NDU: Exit ToC::Mirror() 0
07:38:02.747 5044 81ec72e0 NDU: Clearing Autorevert Flag
07:38:02.903 155419 81ec72e0 NDU: Synchronizing ToC
07:38:02.904 1754 81ec72e0 NDU: Synchronize complete
07:38:02.905 1093 81ec72e0 NDU: Dropping lock
07:38:02.918 12728 81ec72e0 NDU: SP::sync no reboot required
07:38:02.918 37 81ec72e0 newSP: newSP sync complete
07:38:02.919 367 81ec72e0 newSP: Calling TerminateThread to cancel HangTimer : 3c
07:38:02.919 35 81ec72e0 newSP: Hang Timer canceled
07:38:02.920 919 81ec72e0 newSP: newSP normal exit.
07:38:03.992 1072054 81ec0020 NduApp: FlareData mutex count inc 1
07:38:04.005 13712 81ec0020 NduApp: FlareData mutex count dec 0
07:38:04.006 208 81ec0020 NduApp: FlareData mutex count inc 1
07:38:04.006 57 81ec0020 NduApp: Wait on mutex
07:38:04.006 37 81ec0020 NduApp: Got Devmap mutex
07:38:04.107 100937 81ec0020 NduApp: release mutex
07:38:04.107 41 81ec0020 NduApp: FlareData mutex count dec 0
07:38:07.394 3287056 81ea7480 ndumon: NDUMon starting
07:38:07.396 2646 81ea7480 ndumon: Degraded mode 0
07:38:07.396 128 81ea7480 ndumon: IO Inhibit 0
07:38:07.397 102 81ea7480 ndumon: Disk is partitioned correctly
07:38:07.397 30 81ea7480 ndumon: Checking NDU status
07:38:07.409 12648 81ea7480 ndumon: Scheduling peer sync
07:38:07.409 184 81ea7480 ndumon: Clearing SafeRevision
07:38:07.410 252 81ea7480 ndumon: Unquiescing I/O

07:38:07.413 3566 81ea5020 ndumon: DelaySyncPeer waiting
07:38:07.420 7072 81ea7480 ndumon: Pre-unquiesce device map build
07:38:07.421 601 81ea7480 ndumon: FlareData mutex count inc 1
07:38:07.435 13834 81ea7480 ndumon: FlareData mutex count dec 0
07:38:07.435 205 81ea7480 ndumon: FlareData mutex count inc 1
07:38:07.435 58 81ea7480 ndumon: Wait on mutex
07:38:07.435 37 81ea7480 ndumon: Got Devmap mutex
07:38:07.458 22842 81ea7480 ndumon: release mutex
07:38:07.458 41 81ea7480 ndumon: FlareData mutex count dec 0
07:38:07.461 2763 81ea7480 ndumon: Unquiesce of K10AggDrvAdmin
07:38:07.463 2623 81ea7480 ndumon: Hostside unquiesce
07:38:07.463 163 81ea7480 ndumon: HostAdmin quiesce opcode 0
07:38:07.522 58147 81ea7480 ndumon: SP Unquiesce succeeded
07:38:07.560 38220 81ea7480 ndumon: PSM File OPEN FAILED 0x00000002 ndu-cache-settings
07:38:07.560 77 81ea7480 ndumon: PSM file ndu-cache-settings does not exist, skipping cache restoration 2
07:38:07.561 1310 81ea7480 ndumon: No post command pending
07:38:07.561 30 81ea7480 ndumon: Creating locks
07:38:07.562 820 81ea7480 ndumon: Creating server thread
07:38:07.562 154 81ea7480 ndumon: Wait for Termination Event
07:38:07.563 818 81ea5780 NDU: Starting server loop
07:38:07.874 310963 81ea8d40 MessageDispatch: #THREADI: Entering Run
07:39:07.406 59531896 81ea5020 ndumon: Acquiring operation lock
07:39:07.422 15844 81ea5020 ndumon: Releasing operation lock
07:39:07.422 482 81ea5020 ndumon: Synchronizing SP times
07:39:09.719 2296380 81ea5020 NDU: peer returned 0
07:39:09.719 456 81ea8960 MessageDispatch: #CXN (outg): SendPacket failed: 0x0000006d0
07:39:09.728 9131 81ea5020 NDU: Time difference of 15 seconds is within threshold of 60 seconds
07:39:09.728 89 81ea5020 ndumon: DelaySyncPeer running
07:39:09.728 31 81ea5020 ndumon: Acquiring peer sync lock
07:39:12.280 2551995 81ea5020 NDU: peer returned 0
07:39:12.280 37 81ea5020 ndumon: DSP sync peer returned 0
07:39:12.280 31 81ea5020 ndumon: Releasing peer sync lock
07:39:12.281 211 81ea8960 MessageDispatch: #CXN (outg): SendPacket failed: 0x0000006d1
07:39:12.281 679 81ea5020 ndumon: DelaySyncPeer quitting
08:29:22.700 -1284548591 81db3020 NaviCimom: PSM File OPEN FAILED 0x00000002 PersistenceProviderTOC

Note: There may be other events unrelated to the boot process displayed. The above list is only
a sample representation.


A few hints for troubleshooting either an FC4700 or a CX-series array.

FC4700 - Watch the VGA port output or by attaching to the serial port via Hyperterm. Watch for failures
in BIOS or POST. The DPE power up and initialization indicates when ac power is initially applied to a DPE,
the disk drives power up and spin up in a specified sequence. The maximum delay is 48 seconds for the last
drive to start spinning in a DPE, and 84 seconds for the last drive to start spinning in a DAE. The same
delays occur when you insert a drive while a DPE is powered up.

Status lights on the DPE and its CRUs indicate error conditions. These lights are visible outside the DPE.
Some lights are visible from the front, and some are visible from the back. The check status light is located
behind the SP fan pack. It is partially visible from the front if you look between the slats on the front panel.
If you have difficulty seeing these lights, simply remove the fan pack cover using appropriate methods
described in manuals.

LIGHT QUANTITY COLOR MEANING
Enclosure Address 2 Green ON indicates enclosure address 0
Disk Active 1 per disk
module
Green OFF when module slot is empty or contains a filler
FLASHING (mostly off) drive is powered up but not spinning; this is a
normal part of the spin-up sequence, occurs during the spin-up delay of
a disk drive slot.
FLASHING - (at a constant rate) when the disk drive is spinning up or
spinning down normally.
ON - drive is spinning but not handling any I/O activity (the ready state).
FLASHING - (mostly on) disk drive is spinning and handling I/O activity.
Disk Check 1 per disk slot Amber ON disk module is faulty or as an indication to remove the disk module
DPE Active 1 Green ON DPE is powered up
DPE Check 1 Amber ON any fault condition exists. If the fault is not obvious from another
fault light on the front, look at obvious from another fault light on the
front, look at the back of the DPE.
SP Fan Pack Check 1 Amber ON - SP fan pack is faulty, not visible with the fan pack cover on.
SP Active 1 per SP Green ON SP is operating normally or flashing when firmware is being loaded
SP Check 1 per SP Amber ON when SP fault condition exists
LAN Link 1 per SP Green ON when there is a valid eth connection
LAN Activity 1 per SP Amber BLINKING - blinks during Ethernet activity
LCC Active 1 per LCC Green ON when LCC is powered up
LCC Check 1 per LCC Anger ON when either LCC or FCAL connection is faulty.
Power Supply Active 1 per supply Green ON power supply is operating
Power Supply Check 1 per supply Amber ON power supply is faulty or is not receiving AC line voltage
Cooling Check 1 per supply Amber FLASHING when multiple fans in the drive fan pack are faulty or the
drive fan pack is removed. The DPE powers down the SPs and disk
drives when the fault persists for more than about two minutes.
Drive Fan Pack Check 1 per fan pack Yellow ON a fan the drive fan pack is faulty

If the DPE Check light is on, you should look at the other Check lights to determine which CRU(s) are faulty.
If the Check light for a CRU remains on, replace the CRU as soon as possible.

If a CRU fails in a DPE, the DPEs high availability will be compromised until you replace the faulty CRU. The
write cache function (if any) will be disabled.


CX Series - Watch the serial port output via Hyperterm since the VGA connection is no longer available.
Check logs on the other SP if available for backend issues. Remember that the CX uses the backend fibre to
boot the SP. It is important that you do not replace an SP without direction for a boot problem. You may
want to consider the cables from the SP to the DAEOS, the cable could be bad. (Check for presence of
Amphenol type cables)

CX600 Status Indications

CX600 Storage Processor (SP) Status Lights

BE 1, BE 0, AUX 0,
AUX 1 Link LEDs
1 per port Green ON indicates auxiliary or backend activity
LAN Link 1 per LAN port Green ON when there is a valid Ethernet connection
LAN Activity 1 per LAN port Amber FLASHES indicates LAN activity
Power 1 per SP Green ON indicates +12 volt power
Fault 1 per SP Amber Flashing Indications:
Once / 4 seconds BIOS Activity
Once / second POST Activity
Four / second Booting
Steady indicates a fault condition
Link LEDs 0, 1, 2, 3 1 per port Green ON indicates I/O with the host



CX600 Power Supply Status Lights

Power Supply Active 1 per supply Green ON power supply is operating
Power Supply Fault 1 per supply Amber ON when the power supply is faulty or if one of the two is
not receiving ac line voltage.
FLASHING - when system has been shut down due to a
multiple fan fault or ambient over-temperature.
SPS Active 1 per SPS Green ON - when the SPS is ready and operating normally.
Flashes when SPS is re-charging.
SPS On Battery 1 per SPS Amber ON indicates the AC power line in no longer available and the
SPS is supplying DC output power from battery
SPS Replace Battery 1 per SPS Amber ON indicates the SPS battery pack can no longer support
loads. Replace SPS as soon as possible.
SPS Fault 1 per SPS Amber ON - indicates the SPS has an internal fault. Replace the SPS
as soon as possible.

CX600 Status Lights



CX600 Power OK 1 Green ON indicates the SPE is powered up
CX600 System Fault 1 Amber ON when any fault condition exists, if the fault is not obvious
from another fault light on the front, look at the back.

If the System Fault LED is on, you should look at the other Status LEDs to identify the faulty FRU(s). If the
Status LED for a FRU remains on, replace the FRU as soon as possible.

If a FRU fails in a CX600 SPE, the write cache function is disabled and high availability is compromised until
you replace the faulty FRU.

Each fan module includes one amber cooling check (fan fault) LED that indicates a faulty module. These
lights, visible with the front bezel removed.

CX400 and CX200 Status Indications

See the following manuals;

CX400-Series Hardware Reference 014003049-Axx
CX200-Series Initialization Guide 014003117-Axx

NOTE:

For any boot issues or power up issues, do not consider that re-imaging the system is
the proper or correct step to take. Consider all other possibilities before performing a
re-image of the base operating system.


Utility Partition

This is a tool starting at release version 11 code, which is used to re-image SPs, resetting SPs to a factory
fresh state and for doing conversions. To enter the utility menu, attach a serial cable to the storage
processor and make a hyperterminal connection. Reboot the storage processor and when you see the
FLARE post testing (ABC..), strike the ESC key. Flare will stop with an error at which point you will type in
DB_key. A diagnostic menu will then appear. Please see EMC document CLAR-PSP-078 Recovering a Boot
Image on a CX System Using Recovery Drives or the CLARiiON Utility Partition for more detailed
information.

Diagnostic Menu
1) Reset Controller 3) DDBS Service Sub-Menu
2) Display Warnings/Errors 4) FCC Boot Sub-Menu

DDBS Service Sub-Menu
1) Drive Slot ID Check 2) Utility Partition Boot
0) Exit

Which Back End Loop?
0 - BE Loop 0
1 - BE Loop 1
Enter number (0-1) [0]: 0

"FCDMTL 0 [2.4.1] DVM address IDs will be shown"
"Device Events will be shown"
"FCDMTL 0 [2.4.1] DVM address IDs will be shown"
"Device Events will be shown"
"Targets found will be shown and their state"
Drive Slot Check Report for Back End Loop 0
-------------------------------------------
LOOP: 0
Summary:
Total Disks in the Correct Slots: 30
Total Disks in the WRONG Slots: 0
Total Slots Checked: 30

DDBS Service Sub-Menu
1) Drive Slot ID Check 2) Utility Partition Boot
0) Exit

int13 - RESET (1)

FCDMTL 1 [2.4.1] DVM Duplicate address id already in list: EF
Device Event (0xEF): 0x00030012, tach_ptr: 0x08491854
DL waited 1s for discovery

Target 0 is online
Target 1 is online
Target 2 is online
Target 3 is online
Target 4 is online


DDBS: MDB read from both disks.
DDBS: Chassis and disk WWN seeds match.
DDBS: First disk is valid for boot.
DDBS: Second disk is valid for boot.

NT Utility image (0x0040000F) located at sector LBA 0x00BE804C
Disk Set: 1 3

Total Sectors: 0x0005FF61
Sectors: 63
Heads: 240
Cylinders: 26

Total Sectors: 0x0005FF61
Sectors: 63
Heads: 240
Cylinders: 26
Error : Invalid Drive ID - 0x81 -----------------------------this is normal
int13 - CHECK EXTENSIONS PRESENT (61)

int13 - GET DRIVE PARAMETERS (Extended) (62)

CLARiiON Utility Toolkit
(c) EMC Corporation 2001-2003 All Rights Reserved
DiagName: UtilityToolkit
DiagRev: 1.04.03
StartTime: 10/12/03 21:32:38
SPID.......................... Running
FCDMTL........................ Running
NTMIRROR...................... Running
ASIDC......................... Running
ASIRAMDISK.................... Running
ICA........................... Running
Connecting to ICA............. Success
SP Type....................... CX600
SP ID......................... A
Checking Disk 4............... Present
Searching for Image RepositoryFound Volume
Sizing Image Repository....... 1024 MB
Checking Image Repository..... Done
Sizing RAM Disk............... 381 MB
Checking LAN Port State....... Not Started
Checking LAN Port Config...... Not Found
Starting FTP Server........... Success
Loading Plugins............... Done
Finding incompatible images... Done

=========================================================
!!! WARNING !!!
=========================================================

Installing a Release 11 (02.04.X.XX.X.XXX) or earlier Recovery Image or Conversion Image on an array
running Release 12 (02.05.X.XX.X.XXX) or higher Core Array Software will result in permanent,
unrecoverable loss of configuration information and customer data.

The following images have been automatically removed from this array's Image Repository to prevent
accidental installation: SAN_Image-02.04.0.60.5.001.mif (SAN Image 02.04.0.60.5.001)

Have you read and understood the warning above? [y/N] : y Note that N is the default
Checking for Upgrade Wizard...Not Found

EndTime: 10/12/03 21:32:51
Press the Enter key to continue


=========================================================
CLARiiON Utility Toolkit Main Menu
=========================================================
1) About the Utility Toolkit
2) Reset Storage Processor
3) Wizard Sub-Menu
4) Image Repository Sub-Menu
5) Plugin Sub-Menu
6) Enable LAN Service Port
7) Enable Engineering Mode
8) Install Images
Enter Option:

=========================================================
CLARiiON Utility Toolkit Image Repository Menu
=========================================================
1) Back to the Main Menu
2) List Image Repository Contents
3) Delete Files from the Image Repository
4) Copy Files from the RAM Disk to the Image Repository
5) Copy Files from the Image Repository to the RAM Disk
Enter Option:

FCC Boot Sub-Menu
1) Restore Def Port Settings 4) BE1 FCC Boot
2) Display Port Settings 5) AUX0 FCC Boot
3) BE0 FCC Boot 6) AUX1 FCC Boot
0) Exit

PORT SETTINGS
Port B/E WWN Port WWN Primary
Num B/E WWN Node Name::Bus:Dev:Func WWN Secondary Port Settings
-------------------------------------------------------------------------------------------------------------------
000 00000000:00000000 BE0 FCC::02:04:01 00000000:00000000 2Gb, ENA, WWN
00000000:00000000 00000000:00000000
00000000:00000000 00000000:00000000
001 00000000:00000000 BE1 FCC::02:04:00 00000000:00000000 2Gb, ENA, WWN
00000000:00000000 00000000:00000000
00000000:00000000 00000000:00000000
002 00000000:00000000 AUX0 FCC::01:04:01 00000000:00000000 2Gb, ENA, WWN
00000000:00000000 00000000:00000000
00000000:00000000 00000000:00000000
003 00000000:00000000 AUX1 FCC::01:06:01 00000000:00000000 2Gb, ENA, WWN
00000000:00000000 00000000:00000000
00000000:00000000 00000000:00000000

Diagnostic Menu
1) Reset Controller 3) DDBS Service Sub-Menu
2) Display Warnings/Errors 4) FCC Boot Sub-Menu

Requesting System Reset
Copyright 1985-2001 Phoenix Technologies Ltd.
All Rights Reserved
Copyright 1999-2002 by EMC Corporation, All Rights Reserved.
EMC BIOS Release 3.26
CPU = 2 Intel(R) XEON(TM) CPU 2.00GHz
637K System RAM Passed
power up messages will continue.



Unmanaged SPs

Is the customer data still accessible from the hosts?

If above is yes then DO NOT restart the K10 governor or reboot the SP in any way.
If the above is true try to use navicli getagent command
See if the cimom is the source of the problem.
Try pinging the SP on the customer network.
Failing those try to establish a PPP connection to the serial port.

If the customer cannot access data on the fabric then the SP may in fact be hung.

Try the NMI switch to see if a dump can be gotten.
Connect to the serial port via Hyperterm to watch for a reboot.
If no response from the NMI then a reset on the FC4700 would be in order.
On a CX series you will need to reseat the SP.
In either case collect all logs to attempt to determine the cause.

If this is the first instance of a hang then keep all information handy for this failure.
If a second hang of the same type, then a SP replacement may be in order.



SP Failures

FC4700

Hangs Unmanaged (later)
Hard hang no response from any attempt to communicate (Navi, ping, PPP, NMI)

Misconceptions for SP Failures
NDU Failure that happen when starting with no faults.
Unmanaged SP (Almost always)
Panics (especially when layered products are involved)

Real SP Failures
Memory errors
IDE faults and panics - A message that the internal drive is corrupt is not an IDE failure
and as such is not an SP failure but an NT issue.
Boot failures (watch the VGA and serial port during power up to determine the fault)



CX Series

Hangs Unmanaged (later)
Hard hang no response from any attempt to communicate (Navi, ping, PPP, NMI)

NMI switch is located at the dot being pointed to by the arrow.

Misconceptions for SP Failures
NDU Failure that happen when starting with no faults.
Unmanaged SP - (Almost always)
Panics - (especially when layered products are involved)
SP will not reboot - (Watch the power up via the serial port for actual SP
failures that would require an SP replacement).
Real SP Failures
Memory errors
See above for the limited boot failures caused by a failed SP.
Almost every instance of a boot failure is caused by a backend failure or some
misguided troubleshooting step taken.

Replacing SPs (Dont)

FC4700 All software needs to be loaded on the new SP by the ndumon process.
This can and will take several reboots.

All logs and troubleshooting information will go with the SP that was replaced.
Save it in case this information is needed.

If an SP is replaced the SP that was inserted can NEVER be put back into stock.
It MUST be returned to the repair center to be reimaged.

CX Series All software and logs remain with the image on the NT drives for the SP.
SPs that were replaced and then removed can be returned to stock.

Panics

Check panic against list of known panics to see if a fix has been identified

Check Primus and DIMs (internal to EMC only) for any instances of the same panic to see if a
solution exists or if the dump requires submittal for further collateral information.

NOTE: Always refer to the latest available document.

The goal of this Support Procedure is to reduce the number of CLARiiON Storage Processors
replaced unnecessarily. There are numerous failure modes in a CLARiiON array that appear to
indicate a faulty Storage Processor. Many of these failure modes related to software faults or
other components in the array and may incorrectly appear to be Storage Processor failures.
Proper service action requires careful diagnosis before replacing a Storage Processor. If you
have any question about the advisability or need to replace a Storage Processor, please
contact the Call Center.

This document will cover the replacement of an SP in a FC4700 and a CX-series array and will
note when there are differences. The table at the end of this document lists several resources
that offer direction when deciding if an SP replacement is necessary.

When is it not OK to remove or replace a Storage Processor (SP)?

If the SP considered for replacement is servicing active I/O to a LUN
If the other SP does not appear healthy
If there is a second problem on the array. Navisphere should indicate no problem other
than a single SP failure
If an SP panics and the event log entry indicates an Internal SW Error never replace an
SP just because it has had a panic
Simply because of infrequent single bit ECC errors (see Primus case emc65498)
If the SP has been replaced recently for the same or similar symptom, you should
consider whether a second replacement is appropriate
It fails to boot. See clar-psp-093 to determine if there is a problem with the SP or with
the NT Image which the SP is trying to boot.

Identifying a BAD SP

Unmanaged SP (U over a single SP icon) - Regardless of what Navisphere indicates
regarding the Storage Processor, you should always verify if the Storage Processor is still
handling I/O from attached servers to their LUNs via the Storage Processor in question.

Navisphere could indicate that an SP is Unmanaged when it is running or when it has stopped.
You must determine if the SP is running I/O or just not currently being managed via Navisphere.

1. Monitor I/O activity of LUNs owned by that SP, from the other SP.
2. Are any LUNs trespassed to the other SP
3. Has power path failed over because the server can not use the SP.
4. Can the SP LAN address be pinged from a server on the same Subnet as the Array



If investigation proves that the SP is running I/O but Navisphere can not manage it, there are
options other than replacing the SP. Reference Primus emc52543 in addition to this information.

If the SP does respond to a PING, this mans that the OS is running on the SP.
If a navicli command of any type addressed to the LAN of the SP yield a positive
response, this indicates that the Navisphere Agent on the array is running but he
Management Server on the SP may Not be running. Restart the management Server via
connection to the /Setup page of the SP serial connection. Example: 192.168.1.1/setup
If the SP does not answer a Ping but I/O is running through the SP, look for a cable
connection where the SP connects to the LAN.
If the SP does answer a Ping but will not answer to a navicli command this mans that the
OS on the SP is running but the SP agent is not. Call EMC/CLARiiON Support.
If the above step does not work, a reboot of the SP may be required. See Below for SP
reboot directions.

When is restarting a Storage Processor recommended before replacement?

If an SP appears to be HUNG, it is advisable to attempt to retrieve a Panic dump from the SP.

FC4700 - The FC4700 has 2 buttons a NMI button which will cause a Panic dump/reboot and a
Reset Button accessible though the air-dam. The Reset button will cause just a reboot.
CX array - The CX Series SPs have a RESET Button accessible through the air-dam. This is
actually NMI Button which when pushed, will cause a Panic Dump and a reboot. It could take up
to 45 minutes for the SP to respond.

If a Storage Processor is non-responsive after the use of the switches noted above, it is always
advisable to try to restart the Storage Processor before replacing it.

FC4700 & CX array - Never attempt to restart an SP by cycling power or by disconnecting
power cords or a cache dirty (data loss) condition may result.
FC4700 - Do not simply RESEAT an FC4700 SP to induce a reboot. The FC4700 has an IDE
drive which is VERY susceptible to damage if physically removed from its slot and reseated in
order to cause a reboot. For FC4700 Storage Processors (SP) always use the NMI button to
attempt to restart the SP. It could take up to 45 minutes for the SP to respond. If the NMI
restart does not work try using the Reset button before replacement.
CX array - The CX Series SP can also be Removed and re-inserted to cause a complete power-
up and reboot of that SP.

When removing an FC4700 SP that will be replaced, special care should be taken to ensure that
the IDE drive is not damaged in the process.

Press the Reset Switch and wait approximately 10 seconds then remove the SP. Waiting 10
seconds allows time for the heads of the IDE drive to land properly before jerky motion of an SP
removal. Waiting too long (1 minute) after hitting reset and the drive heads will be out of the
landing area and damage could result if violently moved.

Never Re-use an SP from another FC4700. Once an SP is inserted and a boot sequence has
begun, the FC4700 SP has taken on properties of the Array it was plugged into. Any future use in
another array is completely unpredictable. An SP that is inserted into a running array must
remain in THAT array or be returned to the factory to be re-imaged. See Primus case emc71665




Commonly used tools

ktcons (K10 trace console, k10 is a codename)

This is a tool that can be used to examine the KTRACE buffers (engineering level info) on an SP.
Information relating directly to flare can be obtained and engineering level commands can be
performed. It is accessed and executed by a Symmremote session directly into the SP. This tool
can be run remotely to the SP or locally while connected to the storage processor.

Caution: must be taken when using this tool. Use under direction of Technical Support only.

C:\>ktcons

Remote IP address is required

USAGE:
ktcons -h [-i <invocationType>] [-p tcpPort] [-r remoteHost][-d <bitMask>|s ] [-s <a|d>] [-n]

where:
-h: Display Help
// display this text
-i: Invocation Type {l|L|r|R|s|S} // local/remote/service
-t: tcpPort // unused port number
-r: Remote_HostName // name or IP address
-f: sourceFileName // initial source file name
-d: Debug Level 1,2,4, s // init/data transfer/timing mask bits;
// s - Signal ktcons to take a dump of ktrace buffer
-s: Service {a|d} // add/delete service
-q: Queue Mode // ktcons starts and runs in queue mode.
// When signaled by K10Governor, it takes a dump of
// ktrace buffer.
-n: // Do not Reconnect, when connection is lost

if -i omitted and running as ktconsService.exe -- run as service (-s)
if -i omitted and running as ktcons.exe -- run as remote observer (-r)
if -il or -iL specified, run on target as server and observer

-t is TCP/IP port number used by KtCon. Server gets default from registry
Observer gets value from command line or uses KTCONS_DEF_TCP_PORT
-r is TCP/IP address of the server. No default
-s valid if run as ktconsService.exe and add/deletes it as a service.



psmtool (persistent storage manager tool)

For accessing information related to the PSM data areas. Information relating directly to flare
can be obtained and engineering level commands can be performed. It is accessed and executed
by a Symmremote session directly into the SP. Basic commands are list, show and del.


C:\>psmtool

Usage: psmtool op ...

put file dataArea
get dataArea file
del dataArea
list
show dataArea
status
enum
layout



flarecons (flare console)

This is an internal SP tool available only to EMC personnel that allows engineering access to the
fcli (flare cli) prompt. You can obtain information relating directly to flare and perform
engineering level commands. It is accessed and executed by a Symmremote session directly into
the SP. The command to enter into flarecons will be provided by Technical Support when
needed. This tool is used primarily for clearing of resume proms, performing functions on the
vault lun, etc.)


fcli> ?

Notes: command full name/abbreviation - summary

clearlog/cl - Destroy contents of RAID storage controller's error log
access/acc - access -m [1 | 2]
eccerr/ecc - eccerr <-mb> [-bit [all | value]]
lrucmd/lru - lrucmd <-r | -w> [offset] [value]

getlog/l - returns specified portions of the storage processor unsolicited log
getwwn/gw - get current World Wide Name Seed
getdropevtcnt/gdec - get drop event messages count
getprominfo/gp - Displays the resume Prom information for a particular Device
lccupgrade/lcc - Controls and monitors the upgrading of the LCC firmware.
lccdebugcmd/ld - Issue a LCC Debug command to simulate faults on the specified enclosure

help/? - list all available commands with summary
lustat/ls - Logical Unit Status -- summary info for all LU's
setcache/c - modify cache configuration and state information
setdate/da - set the Storage Processor date and time
setdisk/di - Set disk configuration parameters
seterr/e - set/display periodic error reporting
setunit/u - sets unit parameters not associated with cache
spstat/sp - Show summary of various statistics/revisions
trespass/tr - trespass
zero_disk/zd - Initiate/abort factory-zeroing of disks



Admintool

Is an SP resident tool that provides a utility to handle LUs. This tool is primarily used for the
clearing of dirty cache. Uses only at the direction of Technical Support.

C:\>admintool

== Main Menu ==
0: Exit
1: Test _____________
2: Recovery |
Selection[0]: 1 |
|
V
== Test Menu ==
0: Exit
1: Dump DeviceMap
2: Build DeviceMap
3: Test PSM
4: Dump TransactLog
5: Compare luns of StorageCentric and Flare (N/A)
6: List Raid Groups
7: CMI enumerate arrays
8: List physical arrays
Selection[0]:

== Main Menu ==
0: Exit
1. Test
2: Recovery _____________
Selection[0]: 2 V
== Recovery Menu ==
0: Exit
1: Clear TransactLog
2: Fix up transaction
3: Test Layered Driver
4: Scrub lu
5: SP Control
6: Fix DeviceMap (N/A)
7: Clear CacheDirty LU
8: Make Flare LUN Public
9: Execute Work List
Selection[0]: 5



Less commonly used tools

ktr (used for obtaining performance information)

How to enable and disable host-traffic tracing in SPs

1. Log in to the SP either directly, or using Symm-Remote Client
2. Bring up a DOS command window.
3. Create a separate directory for your tracefiles. While in directory C: give a command
mkdir tracefiles to create C:\tracefiles
4. To enable tracing, enter rba by typing rba
5. At rbas prompt, enter the following if you want to create a tracefile named mytrace.ktr
rba> -o \??\C:\tracefiles\mytrace.ktr -r traffic

This will open a tracefile in the named path for tracing host-traffic. Note carefully the \??\ at
the start of the path. This is necessary because the internal software needs this in order to
find the root directory. Also notice the -r traffic at the end of the command. All of the
commands you give to rba to control host-traffic tracing, should contain -r traffic.

6. Tracing is now enabled for this SP. From this point onward any host-originated I/Os
done through this SP will result in Trace Records being written to the internal buffers for
this file. Each internal buffer is 1 Megabyte long, enough for 32,768 Trace Records.
When full the buffer is physically written to the file.
7. Note that you can now quit rba, by issuing a q command: rba> q You enter back
into rba by typing rba again. Exiting and re-entering rba has no effect on the tracing. If
tracing has been enabled, it keeps going until you explicitly disable it as described below.
8. To end your tracing and close the tracing file, usually requires two steps:
rba> -f -r traffic
rba> -c -r traffic

This first command flushes the current (that is, final) one-megabyte buffer, the second
command actually closes the file. Note that if you dont care about the final records,
you do not need to give the first command above.

9. At the end of the above, you have a completed file named mytrace.ktr but it is in the
SPs disk space. To get a copy down to a host computer, you can invoke File Transfer
from the FILE menu of Symm-Remote Client.

Once the file has been copied as in Step 9 above, you can use the ktrcutil utility to examine its
contents. This utility is available from the Performance Engineering group and can also extract
Trace Records, converting them to the traditional trace file format, thereby creating a file for
you that can be used with the existing Excel Trace Tools.

luntool

Is an SP resident tool that provides a utility that operates on LUs and Admin libs. It supports the
commands list, add and remove commands. As with many of the internal SP utilities/tools, use
only at the direction of Technical Support.



hostconfcli ability to perform various configuration options

C:\>hostconfcli

Host Configuration CLI menu
CX Series - Jul 27 2003
0 - Exit.
1 - System Options Menu.
2 - Port Menu.
3 - XLU Menu.
4 - Virtual Array Menu.
5 - Initiator Menu.
6 - Engineering Menu.
7 - Statistics Menu.
8 - HostConfCLI Display Options Menu.

Selection (0 - 8) [0] in decimal:

hfon/hfoff
Setting this to hands free off will cause the SP to boot without the drivers. You have to set this
back to hfon after completing your work as the setting will survive a power cycle.

flarestart.bat
Used to start the drivers after you have come up in the hfoff mode

getspids

C:\>getspids

K10 -- User-space Message Passing Service (UMps)
[Checked (Debug) Build] Compiled: May 15 2003 01:17:26

Array 0 % Success Sec/IO
-------------------------------------------------------------------------
* 9203608060010650:0 [0009c773]
9203608060010650:1 [0009b5c5] 100.00% 0.00017



This page left intentionally blank.

END OF SECTION FIVE

5 General TS and Navisphere (Important)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5 General TS and Navisphere (Important)

Uploaded by

Copyright:

Available Formats

EMC / CLARiiON Troubleshooting Strictly Confidential

General Array Troubleshooting & Navisphere Section Five

You might also like