TR1050-Monitoring Your PS Series SAN With SAN HeadQuarters

TECHNICAL REPORT
MONITORING YOUR PS SERIES SAN WITH SAN

HEADQUARTERS
ABSTRACT
Provides detailed information and best practices
for monitoring a Dell EqualLogic PS Series
storage environment using SAN HeadQuarters
Version 2.1
TR1050
V2.1
Copyright 2010 Dell Inc. All Rights

Reserved.
Dell EqualLogic is a trademark of Dell Inc.
All trademarks and registered trademarks mentioned herein are the property of their respective owners.
Possession, use, or copying of the documentation or the software described in this publication is authorized
only under the license agreement specified herein.
Dell, Inc. will not be held liable for technical or editorial errors or omissions contained herein. The
information in this document is subject to change.
November 2010
WWW.DELL.COM/PSseries
ii
TABLE OF CONTENTS
Revision Information ........................................................................................................................................ iv
Preface ............................................................................................................................................................... v
Introduction........................................................................................................................................................ 1
Overview............................................................................................................................................................ 1
Planning the installation ..................................................................................................................................... 2
Monitor Service System Requirements ..................................................................................................... 3
Monitor Client System Requirements ....................................................................................................... 3
Permissions Needed to Access Group Data .............................................................................................. 3
SNMP Polling ........................................................................................................................................... 3
Installing SAN HeadQuarters ............................................................................................................................ 3
Adding Groups to be Monitored ........................................................................................................................ 3
Using SANHQ on Groups with the Management Port Enabled ................................................................ 4
Information provided by SANHQ...................................................................................................................... 4
Capacity..................................................................................................................................................... 4
Alerts and Events ...................................................................................................................................... 4
IO and Experimental Analysis Data .......................................................................................................... 4
Hardware and Firmware ............................................................................................................................ 5
Network and Port Data .............................................................................................................................. 5
Receiving Email Alerts ...................................................................................................................................... 5
Exporting Data ................................................................................................................................................... 5
Archiving Data ................................................................................................................................................... 5
Data Analysis ..................................................................................................................................................... 6
Launching the Monitor Client for Offline Use .......................................................................................... 6
Importing Data .......................................................................................................................................... 6
Understanding Your Applications and What is Normal ........................................................................ 7
Solving Performance Problems .......................................................................................................................... 7
Check for Damaged Hardware .................................................................................................................. 8
Check the Volume IO Latencies................................................................................................................ 8
Determining Overload of SAN Resources ................................................................................................ 9
Random IOPS.......................................................................................................................................... 13
Experimental Analysis and IOPS ............................................................................................................ 13
Group IO Load Space Distribution.......................................................................................................... 13
Network Performance.............................................................................................................................. 14
Network Bandwidth ................................................................................................................................ 15
Storage Pool Capacity ............................................................................................................................. 15
Low Performance on Thin Provisioned Volumes ................................................................................... 16
iSCSI Connections .................................................................................................................................. 16
MPIO Connections .................................................................................................................................. 16
Queue Depths .......................................................................................................................................... 17
EXAMPLES .................................................................................................................................................... 18
Example 1: A Stable System ................................................................................................................... 18
Example 2: High Latency Caused by High Random IOPS ..................................................................... 19
Example 3: TCP Retransmit Errors ......................................................................................................... 20
Example 4: Network Bandwidth Saturation ............................................................................................ 21
Example 5: Sudden Rise in Capacity Used ............................................................................................. 22
Example 6: Low Storage Pool Free Space .............................................................................................. 23
Summary .......................................................................................................................................................... 23
For More Information ...................................................................................................................................... 24
Technical Support and Customer Service ........................................................................................................ 24
iii
REVISION INFORMATION
The following table describes the release history of this Technical Report.
Report
Date
Document Revision
1.0
September 2009
Initial Release
2.0
June 2010
Updated for v2 of the SANHQ tool
2.1
November 2010
Updated for v2.1 of the SANHQ tool
The following table shows the software and firmware used for the preparation of this Technical Report.
Vendor
Model
Software Revision
Microsoft
Windows Server 2008 R2
Enterprise Edition
Microsoft
Windows 7
Enterprise Edition
Dell
EqualLogic PS Series array firmware
Version 5.0.2
Dell
EqualLogic Host integration Tools for

Windows
Version 3.4.2
Dell
EqualLogic SAN HeadQuarters
Version 2.1.0
The following table lists the documents referred to in this Technical Report. All PS Series Technical Reports
are available on the EqualLogic website: http://www.equallogic.com/resourcecenter/documentcenter.aspx
Vendor
Document Title
Dell
EqualLogic PS Series Arrays SAN HeadQuarters User Guide Version 2.1
Dell
EqualLogic PS Series Arrays Group Administration, PS Series Firmware Version 5.0
iv
PREFACE
Thank you for your interest in Dell EqualLogic PS Series storage products. We hope you will find the PS Series
products intuitive and simple to configure and manage.
PS Series arrays optimize resources by automating volume and network load balancing. Additionally, PS Series arrays
offer all-inclusive array management software, host software, and free firmware updates. The following value-add
features and products integrate with PS Series arrays and are available at no additional cost:
Note: The highlighted text denotes the focus of this document.
PS Series Array Software

o
Group Manager GUI: Provides a graphical user interface for managing your array
Group Manager CLI: Provides a command line interface for managing your array.
Manual Transfer Utility (MTU): Runs on Windows and Linux host systems and enables secure transfer of
large amounts of data to a replication partner site when configuring disaster tolerance. You use portable
media to eliminate network congestion, minimize downtime, and quick-start replication.
SAN HeadQuarters (SANHQ): Provides centralized monitoring, historical performance trending, and
event reporting for multiple PS Series groups.
Host Software for Windows

o
Firmware Installed on each array, this software allows you to manage your storage environment and
provides capabilities such as volume snapshots, clones, and replicas to ensure data hosted on the arrays can
be protected in the event of an error or disaster.
Host Integration Tools
Remote Setup Wizard (RSW): Initializes new PS Series arrays, configures host connections to PS
Series SANs, and configures and manages multipathing.
Multipath I/O Device Specific Module (MPIO DSM): Includes a connection awareness-module
that understands PS Series network load balancing and facilitates host connections to PS Series
volumes.
VSS and VDS Provider Services: Allows 3rd party backup software vendors to perform off-host
backups.
Auto-Snapshot Manager/Microsoft Edition (ASM/ME): Provides point-in-time SAN protection

of critical application data using PS Series snapshots, clones, and replicas of supported applications
such as SQL Server, Exchange Server, Hyper-V, and NTFS file shares.
Host Software for VMware

o
Storage Adapter for Site Recovery Manager (SRM): Allows SRM to understand and recognize PS Series
replication for full SRM integration.
Auto-Snapshot Manager/VMware Edition (ASM/VE): Integrates with VMware Virtual Center and PS
Series snapshots to allow administrators to enable Smart Copy protection of Virtual Center folders,
datastores, and virtual machines.
EqualLogic Multipathing Extension Module for VMware ESX: Provides enhancements to existing
VMware multipathing functionality.
Current Customers Please Note: You may not be running the latest versions of the tools and software listed above. If you
are under valid warranty or support agreements for your PS Series array, you are entitled to obtain the latest updates and
new releases as they become available.
To learn more about any of these products, contact your local sales representative or visit the Dell EqualLogic site at
http://www.equallogic.com. To set up a Dell EqualLogic support account to download the latest available PS Series
firmware and software kits visit: https://www.equallogic.com/secure/login.aspx?ReturnUrl=%2fsupport%2fDefault.aspx
INTRODUCTION
According to IDC and Gartner, Dell is the leader in iSCSI SAN deployments. Dell EqualLogic PS
Series storage arrays offer great benefits including:
Simplified management of storage resources
Comprehensive protection of data resources
Load balancing and performance optimization of storage resources
Server and host integration tools
SAN HeadQuarters (SANHQ) provides comprehensive monitoring of performance and health

statistics for one or more EqualLogic PS Series groups.
The purpose of this technical report is to help storage administrators and other IT professionals use
SANHQ to monitor an EqualLogic SAN. In addition, it provides basic troubleshooting tips to help
administrators diagnose some common SAN problems.
OVERVIEW
SANHQ monitors one or more PS Series groups. The tool is a client/server application that runs
on a Microsoft Windows server and uses SNMP to query the groups. Acting like a flight data
recorder on an aircraft, SANHQ collects data over time and stores it on the server for later
retrieval and analysis. Client systems connect to the server and format and display the data in the
SANHQ GUI.
SANHQ enables the storage administrator to:
Monitor one or more PS Series groups and store operational data for up to a year
Obtain a centralized view of the health and status of multiple groups
Allow the same performance data to be viewed by multiple clients simultaneously
Monitor performance for a specific time period
Monitor and analyze capacity usage for groups
View IO rates, throughput, and latency for each volume, member, pool, or group
View estimated maximum IO capabilities
Generate alerts and email notifications based on the health status of the groups
Use the GUI or a script to archive group performance data for later offline analysis
View archived group data offline
Launch the Group Manager GUI directly from SANHQ
Use Single Sign On for quick group login
Use the built-in syslog server to consolidate all events and alerts into a single view
Generate reports, including Top10 Volume, Configuration, Performance and Alerts reports
Customize the SANHQ user interface
An overview of the architecture is shown in Figure 1.
FIGURE 1: SANHQ ARCHITECTURE
TABLE 1: DESCRIPTION OF ELEMENTS IN FIGURE 1

Figure 1 Element Description
The computer running the Monitor Service issues a series of SNMP requests (polls) to each group
for configuration, status, and performance information. The Monitor Service also includes a syslog
server to which a PS Series group can log events.
When the first set of SNMP requests returns from a group, the Monitor Service stores this baseline
information in the log files for that group. The Monitor Service issues subsequent SNMP requests at
regular intervals of time (by default, two minutes).
To obtain a data point, the Monitor Service averages the data from consecutive polling operations.
Each computer running a Monitor Client accesses the log files maintained by the Monitor Service
and displays the group data in the SAN HeadQuarters GUI.
Note: The computer running the Monitor Service also has a Monitor Client installed.
PLANNING THE INSTALLATION

As shown in Figure 1, implementation of SANHQ requires a Windows system to run the Monitor
Service that conducts the SNMP polling and store the group performance data in log files. Other
Windows systems run the Monitor Client and format and display the data in the SANHQ GUI.
To minimize the overhead of SNMP polling on a PS Series group, it is important that only one
system monitor a specific group. You can have multiple servers running the Monitor Service (for
example, a central hosting facility that monitors the SANs of multiple customers) if the list of
groups monitored by each Monitor Service is unique . Each Monitor Service can support many
systems running the Monitor Client.
2
Monitor Service System Requirements

The SANHQ Monitor Service can be installed on Windows 7, Windows Vista, Windows 2003, or
Windows 2008. See the SANHQ User Guide for detailed requirements.
Either a physical or virtual system can be used to host the Monitor Service. However, when using
a virtual server, time is not always accurately tracked, so the polling interval might be adversely
affected if the hypervisor is particularly busy.
Be sure that the network permissions and routing allow the Monitor Service access to all the group
member Ethernet ports. Be sure that the log file directory is accessible to all the systems running
the Monitor Client.
Monitor Client System Requirements
The SANHQ Monitor Client can be installed on Windows 7, Windows Vista, Windows 2003, or
Windows 2008. See the SANHQ User Guide for detailed requirements.
Minimally, a system running the Monitor Client must have network read access to the Windows
file share in which the Monitor Service stores the group log files. See Permissions Needed to
Access Group Data for more information.
Note that any Monitor Client can open an archive file and view the archived group data.
Permissions Needed to Access Group Data
The data collected by the Monitor Service can be stored in log files on the system running the
Monitor Service and made available to the Monitor Clients through a Windows File Share.
Alternately, you can store the log files on a network device.
Standard Windows permissions are used to control access to the network share and the directory
where the data is stored. Monitor Clients with read-only access can view data, but read-write
access is required to manage the list of monitored groups, configure SNMP information for the
monitored groups, or change the email notification settings for the monitored groups. Use standard
Windows tools to manage the permissions.
SNMP Polling
The SNMP polling interval is designed to minimize the impact of SANHQ on the performance of
the groups being monitored. If the SNMP poll is unable to return a full set of data during the
default two-minute interval, the Monitor Service will automatically adjust the polling frequency to
ensure the data can be collected.
INSTALLING SAN HEADQUARTERS

To install SANHQ, run SANHQSetup32and64.exe on the computer. You can choose to install
the Monitor Service or only the Monitor Client.
An easy to follow wizard will guide you through the installation. See the SANHQ User Guide for
more details.
ADDING GROUPS TO BE MONITORED

Multiple groups can be monitored by a single SANHQ Monitor Service installation. You can add
groups to be monitored, as described in the SANHQ User Guide.
3
Of particular importance is ensuring that the SNMP community name is the same in the Monitor
Service group configuration and in the monitored group.
Optionally, you can change the default size of the log files when first setting up monitoring for a
group. Generally, the default log file size of 5 MB is an appropriate size.
Larger log files allow you to maintain more detailed data for longer periods of time, but at the
expense of disk space and potentially slower performance from the Monitor Client.
Smaller log files require less disk space and might improve Monitor Client performance, but at the
expense of more detailed data.
See the SANHQ User Guide for more details.
Using SANHQ on Groups with the Management Port Enabled
SANHQ can be used with PS Series groups that have a dedicated management network enabled.
The management IP address should be used to add a group, instead of the normal iSCSI VLAN
group IP address. Note that the Monitor Service must be able to communicate with each network
interface on each group member to properly gather data.
INFORMATION PROVIDED BY SANHQ

There is a broad spectrum of information that SANHQ provides. Data is categorized into key
areas, as discussed briefly below.
Capacity
A key component of the health of your PS Series group is capacity. To fully understand the
capacity available for new applications or to support the growth of existing servers, you must
examine the overall group and pool capacity, storage utilization statistics, thin provisioned space,
and space used for replication. To ensure a healthy SAN, it is important to detect any sudden or
unexpected changes in capacity utilization.
Alerts and Events
Monitoring SANHQ alerts and PS Series group events can help you correlate specific performance
issues with group activity. Dell recommends setting up email notification in SANQH and also
configuring the monitored groups to log events to the SANHQ syslog server, which is part of the
Monitor Service, to help troubleshoot problems if they occur.
IO and Experimental Analysis Data
For the PS Series group and its storage pools and members, IO performance indicates whether the
storage system can handle the load from the servers. IOPS (IO operations per second), throughput
in KB/sec, and latency all are factors to consider. Latency is particularly important because it helps
determine if you are exceeding the maximum IOPS or throughput.
Obtaining IO data for individual volumes can help determine the source of the load and can assist
you in determining if more resources or dedicated resources are needed to support a particular
application.
The Experimental Analysis data provides an estimate of how busy the SAN is and can help
identify problems caused by an excessive load on the available resources. The estimated
Experimental Analysis data is based on a specific workload (small, random IOPS) that may not
4
resemble the actual group workload. Therefore, the data should not be used as the sole measure of
group performance.
Hardware and Firmware
Because hardware failures can be a source of performance problems, SANHQ alerts you when
there is a failure.
In addition, hardware and firmware can affect performance. For example, different disk types will
have different performance characteristics. Also, if hardware is added or removed from the
environment, viewing this information at different points in time will tell you what was in use at
that time.
SANHQ tracks:
Array model, service tag, and serial number

RAID status and RAID policy
Firmware version
In addition, information about individual disks is available, including disk IO data and individual
disk queue depth (requires PS Series Firmware Version 4.2 or higher).
Network and Port Data
A key to understanding the overall load on the SAN is network data and member network interface
(port) data. Network retransmits are a critical indicator of network problems that can affect SAN
performance. SANHQ shows the network link throughput and high- and low-use network ports.
RECEIVING EMAIL ALERTS

SANHQ can proactively inform you of performance problems and PS Series group events that
require your attention, such as hardware failures or high latencies.
To receive alerts from SANHQ, configure the email settings appropriate for your environment.
See the SANHQ User Guide for details.
EXPORTING DATA
SANHQ enables you to export the data for one or more monitored groups in csv format for use
with an external analysis tool, such as Microsoft Excel. You specify the groups and the time range
for the desired data.
See the SANHQ User Guide for details.
ARCHIVING DATA
SANHQ allows you to archive data for one or more monitored groups. This enables you to use
SANHQ to view and analyze the archived data offline, without needing access to the Monitor
Service.
You can also archive data to preserve detailed data for a particular period in time.
A data archive is more valuable than exported data when working with Dell support to resolve
issues because the archive can be viewed by anyone running the Monitor Client.
DATA ANALYSIS
Analyzing data gathered by SANHQ can be a mixture of art and science. When looking for the
source of performance issues, it is important to carefully consider all the performance data before
drawing conclusions. The use of the SAN may change over time, so a tool such as SANHQ, which
provides historical perspective, is often able to provide the insight that a simple performance
snapshot cannot.
Launching the Monitor Client for Offline Use
Normally, each Monitor Client connects to the Monitor Service to obtain and format the latest
group performance data. If desired, you can archive group data for later analysis (for example, if
you do not have access to the Monitor Service).
For example, if you start SANHQ, but do not have access to the Monitor Service, simply choose
the Ignore option when launching SANHQ, as shown in Figure 2. This allows SANHQ to start
in offline mode, after which you may import archive files.
FIGURE 2: LAUNCHING THE SANHQ CLIENT FOR OFFLINE USE
Importing Data
To open an archive, select Monitor > Open Archive from the SANHQ menu bar. A new SANHQ
session will appear with the data from the selected archive file. The data can be viewed and
analyzed just like it would be if you were connected to the Monitor Service.
FIGURE 3: OPENING AN ARCHIVE FILE FOR ANALYSIS
Understanding Your Applications and What is Normal

Key to knowing whether the storage system is performing optimally for your environment is
understanding the applications that are using the PS Series SAN. For example, normal business
applications behave very differently than video editing applications. Furthermore, you might be
running high-impact operations only at certain times. Therefore, what is normal at 10AM may
be very different from what is normal at 10PM, and what is normal for most days may not be
normal at the end of the month.
Using multiple monitoring tools including SANHQ, server monitoring with tools such as Windows
PerfMon or Linux iostat, and network monitoring tools can provide insight into the overall
behavior of the storage system, under normal and abnormal conditions.
Obtaining a baseline when things are working well often helps you to identify the source of the
problem when problems arise. The baseline must be reestablished after major system
reconfigurations, upgrades, or significant changes in application use patterns.
SOLVING PERFORMANCE PROBLEMS

There are many potential causes of performance problems. Resolving problems requires a
methodical approach. You must consider all possible solutions and the effect of any changes and
make sure changes can be reversed if they exacerbate the problem. Use of a methodical approach
can also help avoid analysis paralysis where nothing is tried for fear of causing irreversible
damage and thus nothing improves.
Check for Damaged Hardware

Performance problems are often caused by malfunctioning hardware. The basic troubleshooting
steps that are key to solving any IT problem also apply to a SAN. Table 2 lists some problems that
you should watch for and immediately correct. Other factors unique to a particular SAN also might
be important.
If you correct any hardware problems and the problem still exists, you must investigate further.
TABLE 2: TYPICAL HARDWARE ISSUES AFFECTING SAN PERFORMANCE
Damaged
Hardware
Server NIC
Typical Symptom
Malformed Packets
Detected By
Monitor Errors at Switch
Possible Corrective
Actions
Update NIC Drivers
Replace NIC
Bad Patch Cable
Visible Damage
Visual Inspection
Wrong Class of
Patch Cables
Malformed Packets
Monitor Errors at Switch
Defective Switch
Spontaneous Restarts
Monitor Switch with

Appropriate Network
Tools
Random Lock-up
Replace Cable
Update Switch
Firmware
Replace Switch
Defective Array
Hardware
Alerts
Monitor EqualLogic
Group
Monitor SANHQ
Contact Dell Support

to Replace the
Malfunctioning
Component
Setup Email Alerts on

Group and SANHQ
Check the Volume IO Latencies
One of the leading indicators of the health of a SAN is latency. In SANHQ, latency is the time
from the receipt of the IO request to the time that the IO is returned to the server. Volume latencies
are easily observed using SANHQ.
Table 3 provides some typical guidelines for interpreting the observed latencies and possible
corrective actions.
Many applications will begin to exhibit significant performance degradation when latencies in the
storage system are consistently above 50 ms. If the servers show high latency (for example, using
PerfMon) but the storage does not, the issue is not with the storage system but with the server
configuration or the SAN network. Consult your operating system, server, or switch vendor for the
appropriate actions to troubleshoot these components.
8
TABLE 3: VOLUME LATENCIES GUIDELINES

Observed
Value
Indicative Of
When To Be Concerned
Possible Corrective
Actions
Less than
20ms
Normal Operations
N/A
None Required
20ms to 50ms
Possible Misconfiguration of SAN

Components
When Condition is
Sustained
Check Configuration of
Server NICs and SAN
Switches
*** If OK ***
*** If OK ***
Possible Overload of
SAN Resources
Add Additional Hardware

to the Storage Pool
Migrate Volumes to other
Storage Pools
Above 50ms
Possible Misconfiguration of SAN

Components
When Condition is
Frequently Repeated or
Sustained
Check Configuration of
Server NICs and SAN
Switches
*** If OK ***
*** If OK ***
Probable Overload of
SAN Resources

to the Storage Pool
Storage Pools
Determining Overload of SAN Resources

When high latency cannot be attributed to incorrect configuration of the SAN infrastructure (server
NICs and switches) it is possible that SAN resources are overloaded. SANHQ can provide
additional information to help determine if this is the situation. Areas that could become
overloaded are shown in Table 4, along with possible corrective actions.
TABLE 4: DETERMINING IF SAN RESOURCES ARE OVERLOADED

Overloaded
Resource
Random IOPS
Network
Performance
Indicated By
High IOPS Values,

Not Attributable to
Sequential
Operations Such as
Backup
When Condition is
Sustained
High TCP
Retransmits and
Alerts
When Condition is
Sustained
Possible Corrective
Actions
to the Storage Pool
Storage Pools
Unable to Support
High Throughput
for Sequential
Operations
Check Flow Control

Settings on Servers and
Switches
Check Jumbo Frames
Settings on Servers and
Switches
Disable Jumbo Frame
Support
Check Receive Buffers on
Servers and Switches
Disable Unicast Storm
Control
Enable Broadcast and
Multicast Storm Control
Network
Bandwidth
High Network
Utilization Values
and Alerts
When Condition is
Sustained
Activate Additional
Network Ports (if
available)
to the Storage Pool
Storage Pools
10
Overloaded
Resource
Storage Pool
Capacity
Indicated By
Low Storage Pool

Free Space Values
and Alerts
When Condition is
Sustained
Possible Corrective
Actions
Reduce Storage Utilization
Reduce Overallocated
Snapshot Reserve Space
Convert Underutilized
Volumes to Thin
Provisioned Volumes
Storage Pools
to the Storage Pool
Low
Performance
on Thin
Provisioned
Volumes
Low Storage Pool

Free Space Values
and Alerts
When Condition is
Sustained
Volume
Approaching
Maximum In Use
Space Values and
Alerts
Reduce Storage Utilization

to the Storage Pool
Storage Pools
Increase the Maximum In
Use Space Value
Convert Thin Provisioned
Volumes to Standard
Volumes
iSCSI
Connections
Unable to Attach
Servers to Volumes
and Alerts
When Condition is
Sustained
Disconnect from Unused

Volumes and Snapshots
Modify MPIO Settings to
Reduce the Number of
Connections Per Volume
Migrate Volumes to
another Storage Pool
Create a New Storage Pool
and Migrate Volumes to
the New Storage Pool
11
Overloaded
Resource
Indicated By
MPIO
Connections
Unable to Establish
Multiple
Connections
When Condition is
Encountered
Possible Corrective
Actions
Check the Number of
Active iSCSI Connections
in the Storage Pool
Check ACLs on Volumes
Use EqualLogic AutoMPIO on Supported
Operating Systems
Ensure that MPIO is
Supported and Configured
on Other Operating
Systems
High Queue
Depth
>10
When Condition is
Sustained, Especially if
Accompanied by High
Latency

to the Storage Pool
Storage Pools
12
Random IOPS
A PS Series group uses storage virtualization to distribute workloads over many disks and
generally provides much higher performance than other storage systems with a similar number of
like disks. However, the disks do have a finite ability to do work, as measured in IOPS. Faster
disks, such as SSD or those spinning at 15K RPM, are able to do more random work than slower
disks.
As with the statement it snows in Florida, the estimated maximum IOPS number, while valid
under the right conditions, is not very realistic to expect. Write scaling from the maximum is
greatly affected by the read/write ratio and RAID level, with RAID 10 experiencing the least
degradation and RAID 6 the most, due to the greater write penalty. When random IO is high,
latencies in the SAN are generally the best indication that the maximum IOPS have been exceeded.
The best way to address high IO issues is to redistribute the load to other storage pools within the
group (if possible) or add PS Series arrays to the storage pool that is experiencing the high random
workload. Sequential workloads, such as backup, will generate much higher IOPS numbers than
random workloads; however, the IOPS during sequential operations are less relevant than network
bandwidth utilization, as discussed below.
When determining if high IOPS numbers are generated by random or sequential functions, consider
the applications in use at that time. For example, there will be few OLTP users active at 2:00 am. in
most environments, but backup is most likely occurring. Therefore, be sure to look at the IO size.
Large IOs are often indicative of sequential operations, while small IOs indicate random
operations. Latencies are usually higher if a high IOPS value is caused by random operations.
Overall, high random IOPS are not a problem if applications are performing satisfactorily.
Experimental Analysis and IOPS
SANHQ provides Experimental Analysis windows that display an estimate of the maximum
workload that can be sustained; as well an estimate of the workload that can be sustained if a RAID
set becomes degraded due to a drive failure. There is also a related graph that shows the estimated
workload on a scale of 0% to 100%.
Note that these graphs are estimates based on a small, random IO workload, which is prevalent in a
typical business environment. Large IO sizes and more sequential workloads may not match the
estimated workload.
Consequently, it is possible for the group workload to exceed the maximum estimated IOPS. This
is not cause for concern unless it is accompanied by high latencies. Similarly, some workloads can
result in high latencies, while remaining below the estimated maximum IOPS.
Understanding what is normal for your environment will help you determine the best use of the
estimated workload graphs.
Group IO Load Space Distribution
Starting with v2.1 of the SANHQ software and v5 of the EqualLogic firmware, SANHQ is able to
provide additional information on the IO load distribution within a group or pool. This distribution
can be helpful in determining if the IO activity observed is attributable to a relatively small amount
of the total dataset, or if it is a generally uniform distribution across the entire dataset. Knowledge
of the distribution of activity can be useful in making decisions about whether or not tiering data
for performance will be effective. If data activity is concentrated in a relatively small portion of
the capacity, providing higher performing media for that capacity may prove effective in increasing
13
the performance of an application. For example, many databases have a core portion of the data
that is frequently accessed indexes and reference data accessed by all users for example. This
portion of the data, if accelerated, will often improve the overall performance of the database. The
Group IO Load Space distribution graph shows the amount of data that falls into one of three
categories: high IO, medium IO and low IO.
In addition, if SSD media is present in the EqualLogic group, the Group IO Load Space
distribution graph will reflect this capacity relative to the amount of very active data. And if one of
the EqualLogic tiered array models is present (PS6000XVS or PS6010XVS) an additional graph
appears to demonstrate how much of the Enhanced Write Cache unique to these array models is in
use as shown below in Figure 4.
FIGURE 4: GROUP I/O LOAD SPACE DISTRIBUTION AND ENHANCED WRITE CACHE USAGE
Network Performance
Network performance is dependent on a number of components working in conjunction with each
other.
The most critical function that affects network performance is Flow Control, which allows network
devices to signal the next device that the data stream should be reduced to prevent dropped packets
and retransmissions.
Flow Control is typically disabled by default and must be enabled on both the server NICs and the
network switches in order to be effective. Consult your switch vendor or server NIC driver
documentation to determine how to configure Flow Control. If Flow Control cannot be configured
on a network device such as a NIC or switch, either upgrade the device to a version of firmware
that supports Flow Control or replace the device with one that does. Flow Control is automatically
supported by PS Series arrays.
14
The second item which must be properly configured is Jumbo Frames. Jumbo Frames uses a larger
frame size than a standard Ethernet frame and allow large amounts of data to be efficiently
transmitted between the server and storage.
In environments with small, average IO sizes, Jumbo Frames provides limited benefits. In general,
Jumbo Frames support is disabled by default on switches and server NICs. Enabling Jumbo
Frames requires that the switch use a VLAN other the default VLAN (usually VLAN 1). PS Series
arrays will automatically negotiate the use of Jumbo Frames when the iSCSI connection is
established by the server. Consult your switch vendor or server NIC driver documentation to
determine if Jumbo Frames can be configured.
Note that some network devices run more slowly with Jumbo Frames enabled, do not properly
support Jumbo Frames, or cannot support them simultaneously with Flow Control. In these cases,
Jumbo Frames should be disabled, or the switches should be upgraded or replaced.
When attempting to troubleshoot network performance problems, disable Jumbo Frames and
determine whether the network is performing properly with standard Ethernet frames.
Another area that can cause problems is a lack of receive buffers. Low end switches often have
limited memory and suffer from performance issues related to insufficient buffers. A
recommended buffer level in switches is 1MB per port. Dedicated buffers are preferred to shared
buffers.
In addition, server performance can often be improved by increasing the number of buffers
allocated to the server NICs. Consult your switch vendor or server NIC driver documentation to
determine if you can increase the buffers.
Network Bandwidth
Network bandwidth may become fully utilized during highly sequential operations, such as backup.
This is not indicative of a problem in most cases but simply a case of a fully utilized system.
Using all the available bandwidth on one or more member Ethernet interfaces will generate an
alert.
Make sure you connect and enable all the member Ethernet interfaces to maximize the available
SAN bandwidth.
If all interfaces are enabled, but bandwidth is still insufficient, increasing the number of arrays in
the storage pool may provide additional throughput, if the servers have not reached their maximum
bandwidth.
If only one interface is completely utilized on a member or on a server with multiple NICs, ensure
that MPIO is properly configured. If all of the server NICs exceed their capacity (use host based
tools to determine this), but the PS Series group has excess network capacity, add additional server
NICs. Also, configure MPIO for those operating systems that support MPIO.
Storage Pool Capacity
Low storage pool capacity is a problem that generates an alert in SANHQ. If a pool has less than
5% free space (or less than 100 GB per member, whichever is less), a PS Series group may not
have sufficient free space to efficiently perform the virtualization functions required for automatic
optimization of the SAN. In addition, when storage pool free space is low, write performance on
thin provisioned volumes is automatically reduced in order to slow the consumption of free space.
15
To increase free space in a storage pool, you can:
Reduce the amount of in-use storage space by deleting unused volumes.

Reduce the amount of in-use storage space by reducing the amount of snapshot reserve.
Identify large volumes that have low utilization and convert them to thin provisioned volumes.
Migrate volumes to storage pools with excess capacity.
Add additional hardware to the storage pool.
Low Performance on Thin Provisioned Volumes

If free space in a storage pool falls below 5% of total capacity (or 100GB per member, whichever
is less), write performance for thin provisioned volumes decreases in order to reduce usage of the
declining free space. Using the options for alleviating low storage pool capacity, listed above, will
correct the problem and permit the thin provisioned volumes to once again operate at full speed.
In addition, thin provisioned volume performance decreases when the allocated space for the
volume approaches the maximum in-use space value for the volume. Increasing the maximum inuse space value or the reported volume size or converting the volume from a thin provisioned
volume to a fully-provisioned volume will restore performance to normal.
Note that there must be sufficient free space to reserve the remaining unreserved volume space
when converting to a fully-provisioned volume. Do not convert a thin provisioned volume if doing
so will reduce the storage pool free space below 5%.
iSCSI Connections
Large, complex environments can utilize many iSCSI connections. A storage pool in a PS Series
group can support numerous simultaneous connections, as outlined in the release notes for the
particular EqualLogic firmware release in use. These connections can be used for fullyprovisioned volumes, thin provisioned volumes, and snapshots. Attempting to exceed the
supported number of connections will result in an error message.
You can reduce the number of iSCSI connections to the volumes and snapshots in a storage pool in
several ways:
Disconnect from unused volumes and snapshots.

Modify MPIO settings to reduce the number of connections per volume.
Move volumes to another storage pool.
Create a new storage pool and move volumes to the new storage pool.
MPIO Connections
MPIO provides additional performance capabilities and network path failover between servers and
volumes. For certain operating systems (Windows 2003 and 2008), the connections can be
automatically managed.
If MPIO is not creating multiple connections, you should:
Check that the storage pool does not have the maximum number of iSCSI connections for the
release in use (see the release notes).
Check the access control records for the volume. Using the iSCSI initiator name instead of an
IP address can make access controls easier to manage and more secure.
Ensure that EqualLogic MPIO extensions are properly installed on the supported operating
systems. See the EqualLogic Host Integration Tools documentation for details.
16
Ensure that MPIO is supported and properly configured, according to the documentation for
the operating system.
Queue Depths
Queue depth is a measure of how much work is pending for a resource, such as an array or a disk
drive.
A high queue depth might indicate that a resource is overloaded, particularly if high latencies exist.
A low queue depth might indicate that a resource has sufficient unused capacity to absorb new
workloads.
Understanding what resource has a high queue depth can be helpful in deciding what workload to
move or what type of resources should be added to the SAN. Note that queue depth reporting
requires PS Series Firmware Version 4.2 or higher.
TABLE 5: QUEUE DEPTH GUIDELINES

Observed
Value
Indicative Of
Possible Corrective
Actions
< 10
Low queue depth
Normal
N/A
10-25
Moderate queue
depth
Normal
N/A
>25
High queue depth
When sustained, especially

if accompanied by high
latency
Consider moving some

workloads to another pool
or adding resources if
sustained
17
EXAMPLES
Troubleshooting SAN issues using SANHQ is easily demonstrated using examples. The following
examples show a variety of issues and some solutions that can be used to resolve them.
Example 1: A Stable System
The combined graphs view gives an overview of the overall health of the group. A healthy, stable
system is shown in Figure 5. Latencies are consistently below 20 ms, network bandwidth is well
below the maximum that could be sustained by multiple Gigabit Ethernet ports, and TCP
retransmits are virtually zero.
FIGURE 5: A STABLE SYSTEM
18
Example 2: High Latency Caused by High Random IOPS

A system overloaded with a random IOPS workload is shown in Figure 6. The PS Series group has
high IOPS with a low IO size at times when users are performing normal daily activities.
The small IO size, a 50/50 read/write ratio, and latencies that are 20 ms and above indicate that the
group performance is bound by random IOPS. The group displays high latencies that negatively
affected application performance prior to January 31, at which time an additional array was added.
Because of the inherent scalability of a PS Series group, the online expansion and automatic
distribution of the load over more spindles occurred without any disruption to running applications.
Once the group expansion was complete, latencies dropped to acceptable levels and application
performance improved dramatically.
FIGURE 6: HIGH LATENCIES CAUSED BY HIGH RANDOM IOPS
19
Example 3: TCP Retransmit Errors

A group that exhibits high TCP retransmit Errors is shown in Figure 7. TCP retransmit errors can
be an indicator of a number of problems, ranging from defective cables to overloaded SAN
switches or servers.
The frequency of the errors might indicate the source of the problem. If they are frequent,
regardless of load, the cause may be a bad cable, NIC, HBA, or a switch that is unable to process
traffic properly. Less frequent errors, such as those shown in Figure 6, are typical of switches or
server components that fail only under load and are harder to diagnose. Very infrequent errors
might indicate a temporary, but normal condition that may not adversely affect application
performance.
FIGURE 7: TCP RETRANSMIT ERRORS
20
Example 4: Network Bandwidth Saturation

A group that has saturated the available network bandwidth is shown in Figure 8. Figure 9 shows
the associated alerts.
Saturation of the network bandwidth is not a problem if it occurs during sequential operations and
if application performance remains acceptable. Resolving problems with network bandwidth might
require redistribution of the workload over multiple arrays. The EqualLogic architecture permits
the bandwidth to scale, either by enabling additional array interface ports or by adding another
array to the group and thus adding more controllers and network ports.
FIGURE 8: SATURATED NETWORK BANDWIDTH
FIGURE 9: NETWORK BANDWIDTH SATURATION ALERTS
21
Example 5: Sudden Rise in Capacity Used

Most groups will exhibit a rise in capacity utilization over time, due to normal activities such as
adding volumes, increasing snapshot reserve, expanding thin provisioned volumes, and other
normal storage activities. A sudden rise in in-use capacity, as shown in Figure 10, could indicate a
problem, such as an improperly sized volume (for example, you specified TB when GB was
intended) or a heavily used thin provisioned volume. Use the volume data prior to and after the
increase to diagnose the sudden increase in utilization.
FIGURE 10: SUDDEN RISE IN CAPACITY IN USE
22
Example 6: Low Storage Pool Free Space

As with any computing resource, high utilization is desirable for the sake of efficiency, but a small
buffer must be maintained for overhead. In the case of an PS Series group, the recommended
minimum free space in a storage pool is 5% of total capacity or 100 GB per member, whichever is
less.
A group without this buffer can have difficulty with various operations, such as snapshots, load
balancing, member removal, and replication. A low pool free space condition is indicated by a
warning message such as the one in Figure 11.
Several steps can be taken to increase free pool space, such as converting mostly empty volumes to
thin provisioning, reducing snapshot reserve for volumes that do not need the amount currently
allocated, reducing the replication reserve for a volume if it is not needed, suspending replication
for some volumes, deleting unneeded volumes, and moving volumes to another pool.
Ultimately, you might need to add more resources to the pool, either by redistributing existing
group resources or by adding an array to the pool.
FIGURE 11: LOW POOL FREE SPACE WARNING
SUMMARY
Acting as a flight data recorder for your PS Series group, SAN HeadQuarters is a powerful
monitoring and analysis tool designed to provide SAN administrators with valuable insight into the
health of their storage environment. The easy-to-use graphical interface provides information on
PS Series group capacity, IO performance, network data, member hardware and configuration, and
volume data. With the ability to show trends and export metrics for further reporting and analysis,
SAN HeadQuarters is a key component in the constant battle that administrators face daily to do
more with less resources.
23
FOR MORE INFORMATION

For detailed information about PS Series arrays, groups, and volumes see the following
documentation:
Release Notes. Provides the latest information about PS Series arrays and groups.
Installation and Setup. Describes how to install the array hardware and configure the software.
The manual also describes how to create and connect to a volume.
Group Administration. Describes how to use the Group Manager graphical user interface
(GUI) to manage a PS Series group. This manual provides comprehensive information about
product concepts and procedures.
CLI Reference. Describes how to use the Group Manager command line interface (CLI) to
manage a PS Series group and individual arrays.
Hardware Maintenance. Describes how to maintain the array hardware. Be sure to use the
manual for your array model.
Online help. In the Group Manager GUI, expand Tools in the far left panel and then click
Online Help for help on both the GUI and the CLI.
See support.dell.com/EqualLogic and log in to your customer support site for the latest
documentation.
TECHNICAL SUPPORT AND CUSTOMER SERVICE

Dells support service is available to answer your questions about PS Series arrays. If you have an
Express Service Code, have it ready when you call. The code helps Dells automated-support
telephone system direct your call more efficiently.
Contacting Dell
Dell provides several online and telephone-based support and service options. Availability varies
by country and product, and some services may not be available in your area.
For customers in the United States, call 800-945-3355.
Note: If you do not have an Internet connection, you can find contact information on your
purchase invoice, packing slip, bill, or Dell product catalog.
To contact Dell for sales, technical support, or customer service issues:
1. Visit support.dell.com.
2. Verify your country or region in the Choose A Country/Region drop-down menu at the bottom
of the window.
3. Click Contact Us on the left side of the window.
4. Select the appropriate service or support link based on your need.
5. Choose the method of contacting Dell that is convenient for you.
24
Online Services
You can learn about Dell products and services using the following procedure:
1. Visit www.dell.com (or the URL specified in any Dell product information).
2. Use the locale menu or click on the link that specifies your country or region.
25

TR1050-Monitoring Your PS Series SAN With SAN HeadQuarters

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TR1050-Monitoring Your PS Series SAN With SAN HeadQuarters

Uploaded by

Copyright:

Available Formats

TECHNICAL REPORT

MONITORING YOUR PS SERIES SAN WITH SAN

Copyright 2010 Dell Inc. All Rights

Updated for v2 of the SANHQ tool

Updated for v2.1 of the SANHQ tool

Windows Server 2008 R2

EqualLogic PS Series array firmware

EqualLogic Host integration Tools for

EqualLogic SAN HeadQuarters

EqualLogic PS Series Arrays SAN HeadQuarters User Guide Version 2.1

EqualLogic PS Series Arrays Group Administration, PS Series Firmware Version 5.0

PS Series Array Software

Host Software for Windows

Host Integration Tools

Auto-Snapshot Manager/Microsoft Edition (ASM/ME): Provides point-in-time SAN protection

Host Software for VMware

Simplified management of storage resources

Comprehensive protection of data resources

Load balancing and performance optimization of storage resources

Server and host integration tools

SAN HeadQuarters (SANHQ) provides comprehensive monitoring of performance and health

An overview of the architecture is shown in Figure 1.

FIGURE 1: SANHQ ARCHITECTURE

TABLE 1: DESCRIPTION OF ELEMENTS IN FIGURE 1

PLANNING THE INSTALLATION

Monitor Service System Requirements

INSTALLING SAN HEADQUARTERS

ADDING GROUPS TO BE MONITORED

INFORMATION PROVIDED BY SANHQ

Array model, service tag, and serial number

RECEIVING EMAIL ALERTS

FIGURE 3: OPENING AN ARCHIVE FILE FOR ANALYSIS

Understanding Your Applications and What is Normal

SOLVING PERFORMANCE PROBLEMS

Check for Damaged Hardware

Monitor Errors at Switch

Bad Patch Cable

Monitor Errors at Switch

Monitor Switch with

Contact Dell Support

Setup Email Alerts on

TABLE 3: VOLUME LATENCIES GUIDELINES

Possible Misconfiguration of SAN

Add Additional Hardware

Possible Misconfiguration of SAN

Add Additional Hardware

Determining Overload of SAN Resources

TABLE 4: DETERMINING IF SAN RESOURCES ARE OVERLOADED

High IOPS Values,

Check Flow Control

Low Storage Pool

Low Storage Pool

Reduce Storage Utilization

Disconnect from Unused

Add Additional Hardware

To increase free space in a storage pool, you can:

Reduce the amount of in-use storage space by deleting unused volumes.

Low Performance on Thin Provisioned Volumes

Disconnect from unused volumes and snapshots.

TABLE 5: QUEUE DEPTH GUIDELINES

Low queue depth

High queue depth

When sustained, especially

Consider moving some