Brocade San Switch Troubleshooting

Description
Platform
Model
Prepared By
:
:
:
:
5 MINUTES INITIAL TROUBLESHOOTING ON BROCADE

SAN SWITCH
BROCADE
F. MOHAIDEEN ABDUL KADER
For doing basic hardware troubleshooting in Brocade Switch we can use the below commands, I have
briefly explained about the commands & I have kept one example to identify the issue with the switch
1.
2.
3.
4.
Switchstatusshiw
Switchshow
Sfpshow
porterrshow
BASIC TROUBLESHOOTING
COMMAND
:
EXPLANATION :
switchstatusshow
Use this command to display the overall status for a switch. In addition, users with a Fabric Watch license
are able to view a listing of unhealthy ports that includes the port index number, the port name, and the
port status.
This command displays the following information: the overall switch status, and the status of the following
contributors:
Report Time
Switch Name
IP address
Switch State: HEALTHY, MARGINAL, or DOWN
Duration: hours and minutes (HH:MM) the switch has been in the current state
Power supplies
Temperatures
Fans
WWN servers (dual-CP systems only)
Standby CP (dual-CP systems only with HA enabled)
Blades (bladed systems only)
Flash
Marginal ports
Faulty ports
Error Ports
Status values are HEALTHY, MARGINAL, or DOWN, depending on whether thresholds established by
switchStatusPolicySet have been exceeded. The overall status is based on the most severe status of
all contributors
EXAMPLES
:
Admin> switchstatusshow
Switch Health Report
Report time: 06/20/2013 06:19:17 AM
Switch Name: Sydney_ILAB_DCX-4S_LS128
IP address: 10.129.2.143
SwitchState: MARGINAL
Duration: 214:29
Power supplies monitor MARGINAL
Temperatures monitor HEALTHY
Fans monitor
HEALTHY
WWN servers monitor
HEALTHY
CP monitor
HEALTHY
Blades monitor
HEALTHY
Core Blades monitor HEALTHY
Flash monitor
HEALTHY
Marginal ports monitor HEALTHY
Faulty ports monitor HEALTHY
Missing SFPs monitor HEALTHY
Error ports monitor HEALTHY
All ports are healthy
======
switch:user> switchstatusshow
Switch Health Report
Switch Name: ras220
IP address: 10.20.10.220
SwitchState: MARGINAL
Duration: 47:42
Power supplies monitor
Temperatures monitor
Fans monitor
Flash monitor
Marginal ports monitor
Faulty ports monitor
Missing SFPs monitor
Error ports monitor
Report time: 03/12/2011 08:48:00 PM
HEALTHY
HEALTHY
HEALTHY
MARGINAL
HEALTHY
HEALTHY
HEALTHY
HEALTHY
Port 032 port32 is FAULTY
COMMAND
:
EXPLANATION :
switchshow
This command Provides a general overview of logical switch status (no physical components)
plus a list of ports and their status.
The switchState should alway be online.
The switchDomain should have a unique ID in the fabric.
If zoning is configured it should be in the "ON" state.
The port status should be "Online" for all ports which connected and operational.
If you see ports showing "No_Sync" whereby the port is notdisabled there is likely a cable or
SFP/HBA problem.
If you have configured FabricWatch to enable portfencing you'll see indications like here with
port 75
Obviously for any port to work it should be enabled.
EXAMPLE
Admin> switchshow
switchName: Sydney_ILAB_DCX-4S_LS128
switchType: 77.3
switchState: Online
switchMode: Native
switchRole: Principal
switchDomain: 143
switchId: fffc8f
switchWwn: 10:00:00:05:1e:52:af:00
zoning:
ON (Brocade)
switchBeacon: OFF
FC Router: OFF
Fabric Name: FID 128
Allow XISL Use: OFF
LS Attributes: [FID: 128, Base Switch: No, Default Switch: Yes, Address Mode 0]
Index Slot Port Address Media Speed
State Proto
============================================================
0 1 0 8f0000 id 4G
Online FC E-Port 10:00:00:05:1e:36:02:bc "BR48000_1_IP146"
(downstream)(Trunk master)
1 1 1 8f0100 id N8
Online FC F-Port 50:06:0e:80:06:cf:28:59
2 1 2 8f0200 id N8
3 1 3 8f0300 id N8
4 1 4 8f0400 id 4G
No_Sync FC Disabled (Persistent)
5 1 5 8f0500 id N2
Online FC F-Port 50:06:0e:80:14:39:3c:15
6 1 6 8f0600 id 4G
7 1 7 8f0700 id 4G
8 1 8 8f0800 id N8
Online FC F-Port 50:06:0e:80:13:27:36:30
75 2 11 8f4b00 id N8
No_Sync FC Disabled (FOP Port State Change threshold exceeded)
76 2 12 8f4c00 id N4
No_Light FC Disabled (Persistent)
COMMAND
:
EXPLANATION :
sfpshow <slot>/<port>
One of the most important pieces of a link irrespective of mode and distance is the SFP. On
newer hardware and software it provides a lot of info on the overall health of the link.
With older FOS codes there could have been a discrepancy of what was displayed in this
output as to what actually was plugged in the port. The reason was that the SFP's get polled
so every now and then for status and update information. If a port was persistent disabled it
didn't update at all so in theory you plug in another SFP but sfpshow would still display the old
info. With FOS 7.0.1 and up this has been corrected and you can also see the latest polling
time per SFP now.
The question we often get is: "What should these values be?". The answer is "It depends".
As you can imagine a shortwave 4G SFP required less amps then a longwave 100KM SFP so in
essence the SFP specs should be consulted. As a ROT you can say that signal quality depends
ont he TX power value minus the link-loss budget. The result should be within the RX Power
specifications of the receiving SFP.
Also check the Current and Voltage of the SFP. If an SFP is broken the indication is often it
draws no power at all and you'll see these two dropping to zero.
EXAMPLE
:
Admin> sfpshow 1/1
Identifier: 3 SFP
Connector: 7 LC
Transceiver: 540c404000000000 2,4,8_Gbps M5,M6 sw Short_dist
Encoding: 1 8B10B
Baud Rate: 85 (units 100 megabaud)
Length 9u: 0 (units km)
Length 9u: 0 (units 100 meters)
Length 50u (OM2): 5 (units 10 meters)
Length 50u (OM3): 0 (units 10 meters)
Length 62.5u:2 (units 10 meters)
Length Cu: 0 (units 1 meter)
Vendor Name: BROCADE
Vendor OUI: 00:05:1e
Vendor PN: 57-1000012-01
Vendor Rev: A
Wavelength: 850 (units nm)
Options: 003a Loss_of_Sig,Tx_Fault,Tx_Disable
BR Max: 0
BR Min: 0
Serial No: UAF110480000NYP
Date Code: 101125
DD Type: 0x68
(its a 8G short wave SFP)
Enh Options: 0xfa

Status/Ctrl: 0x80
Alarm flags[0,1] = 0x5, 0x0
Warn Flags[0,1] = 0x5, 0x0
Alarm
Warn
low
high
low
high
Temperature: 25 Centigrade -10
90
-5
85
Current: 6.322 mAmps
1.000 17.000 2.000
14.000
Voltage: 3290.2 mVolts
2900.0 3700.0 3000.0 3600.0
RX Power: -3.2 dBm (476.2uW) 10.0 uW 1258.9 uW 15.8 uW 1000.0 uW
TX Power: -3.3 dBm (472.9 uW) 125.9 uW 631.0 uW 158.5 uW 562.3 uW
State transitions: 1
Last poll time: 06-20-2013 EST Thu 06:48:28
COMMAND
:
EXPLANATION :
porterrshow
For link state counters this is the most useful command in the switch however there is a perception that
this command provides a "silver" bullet to solve port and link issues but that is not the case. Basically it
provides a snapshot of the content of the LESB (Link Error Status Block) of a port at that particular point
in time. It does not tell us when these counters have accumulated and over which time frame. So in
order to create a sensible picture of the statuses of the ports we need a baseline. This baseline can be
created to reset all counters and start from zero. To do this issue the "statsclear" command on the cli.
There are 7 columns you should pay attention to from a physical perspective.
enc_in - Encoding errors inside frames. These are errors that happen on the FC1 with encoding
8 to 10 bits and back or, with 10G and 16G FC from 64 bits to 66 and back. Since these happen
on the bits that are part of a data frame these are counted in this column.
crc_err - An enc_in error might lead to a CRC error however this column shows frames that have
been market as invalid frames because of this crc-error earlier in the datapath. According to FC
specifications it is up to the implementation of the programmer if he wants to discard the frame
right away or mark it as invalid and send it to the destination anyway. There are pro's and con's
on both scenarios. So basically if you see crc_err in this column it means the port has received
a frame with an incorrect crc but this occurred further upstream.
crc_g_eof - This column is the same as crc_err however the incoming frames are NOT marked
as invalid. If you see these most often the enc_in counter increases as well but not necessarily.
If the enc_in and/or enc_out column increases as well there is a physical link issue which could
be resolved by cleaning connectors, replacing a cable or (in rare cases) replacing the SFP and/or
HBA. If the enc_in and enc_out columns do NOT increase there is an issue between the SERDES
chip and the SFP which causes the CRC to mismatch the frame. This is a firmware issue which
could be resolved by upgrading to the latest FOS code. There are a couple of defects listed to
track these.
enc_out - Similar to enc_in this is the same encoding error however this error was outside
normal frame boundaries i.e. no host IO frame was impacted. This may seem harmless however
be aware that a lot of primitive signals and sequences travel in between normal data frame
which are paramount for fibre-channel operations. Especially primitives which regulate credit
flow. (R_RDY and VC_RDY) and signal clock synchronization are important. If this column
increases on any port you'll likely run into performance problems sooner or later or you will see
a problem with link stability and sync-errors (see below).
Link_Fail - This means a port has received a NOS (Not Operational) primitive from the remote
side and it needs to change the port operational state to LF1 (Link Fail 1) after which the
recovery sequence needs to commence. (See the FC-FS standards specification for that)
Loss_Sync - Loss of synchronization. The transmitter and receiver side of the link maintain a
clock synchronization based on primitive signals which start with a certain bit pattern (K28.5).
If the receiver is not able to sync its baud-rate to the rate where it can distinguish between
these primitives it will lose sync and hence it cannot determine when a data frame starts.
Loss_Sig - Loss of Signal. This column shows a drop of light i.e. no light (or insufficient RX power)
is observed for over 100ms after which the port will go into a non-active state. This counter
increases often when the link-loss budget is overdrawn. If, for instance, a TX side sends out light
with -4db and the receiver lower sensitivity threshold is -12 db. If the quality of the cable
deteriorates the signal to a value lower than that threshold, you will see the port bounce very
often and this counter increases. Another culprit is often unclean connectors, patch-panels and
badly made fibre splices. These ports should be shut down immediately and the cabling plant
be checked. Replacing cables and/or bypassing patch-panels is often a quick way to find out
where the problem is.
The other columns are more related to protocol issues and/or performance problems which could be
the result of a physical problem but not be a cause. In short look at these 7 columns mentioned above
and check if no port increases a value.
too_short/too_long - indicates a protocol error where SOF or EOF are observed too soon or too
late. These two columns rarely increase.
bad_eof - Bad End-of-Frame. This column indicates an issue where the sender has observed and
abnormality in a frame or it's transceiver whilst the frameheader and portions of the payload
where already send to its destination. The only way for a transceiver to notify the destination
is to invalidate the frame. It truncates the frame and add an EOFni or EOFa to the end. This
signals the destination that the frame is corrupt and should be discarded.
F_Rjt and F_Bsy are often seen in Ficon environments where control frames could not be
processes in time or are rejected based on fabric configuration or fabric status.
c3timout (tx/rx) - These are counters which indicate that a port is not able to forward a frame
in time to it's destination. These either show a problem downstream of this port (tx) or a
problem on this port where it has received a frame meant to be forwarded to another port
inside the sames switch. (rx). Frames are ALWAYS discarded at the RX side (since that's where
the buffers hold the frame). The tx column is an aggregate of all rx ports that needs to send
frames via this port according to the routing tables created by FSPF.
pcs_err - Physical Coding Sublayer - These values represent encoding errors on 16G platforms
and above. Since 16G speeds have changed to 64/66 bits encoding/decoding there is a separate
control structure that takes car of this.
As a best practice it is wise to keep a trace of these port errors and create a new baseline every week.
This allows you to quickly identify errors and solve these before they can become an problem with an
elongated resolution time. Make sure you do this fabric-wide to maintain consistency across all
switches in that fabric.
EXAMPLE
Admin> porterrshow
frames enc crc crc too too
tx rx in err g_eof shrt long
0: 100.1m 53.4m 0 0 0 0 0
1: 466.6k 154.5k 0 0 0 0 0
2: 476.9k 973.7k 0 0 0 0 0
3: 474.2k 155.0k 0 0 0 0 0
bad enc disc link loss loss frjt fbsy c3timeout pcs
eof out c3 fail sync sig
tx rx err
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0

Brocade San Switch Troubleshooting

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Brocade San Switch Troubleshooting

Uploaded by

Copyright:

Available Formats

Description

5 MINUTES INITIAL TROUBLESHOOTING ON BROCADE

Report time: 03/12/2011 08:48:00 PM

Port 032 port32 is FAULTY

(its a 8G short wave SFP)

Enh Options: 0xfa

You might also like