Professional Documents
Culture Documents
TR1050
V2.1
WWW.DELL.COM/PSseries
ii
TABLE OF CONTENTS
Revision Information ........................................................................................................................................ iv
Preface ............................................................................................................................................................... v
Introduction........................................................................................................................................................ 1
Overview............................................................................................................................................................ 1
Planning the installation ..................................................................................................................................... 2
Monitor Service System Requirements ..................................................................................................... 3
Monitor Client System Requirements ....................................................................................................... 3
Permissions Needed to Access Group Data .............................................................................................. 3
SNMP Polling ........................................................................................................................................... 3
Installing SAN HeadQuarters ............................................................................................................................ 3
Adding Groups to be Monitored ........................................................................................................................ 3
Using SANHQ on Groups with the Management Port Enabled ................................................................ 4
Information provided by SANHQ...................................................................................................................... 4
Capacity..................................................................................................................................................... 4
Alerts and Events ...................................................................................................................................... 4
IO and Experimental Analysis Data .......................................................................................................... 4
Hardware and Firmware ............................................................................................................................ 5
Network and Port Data .............................................................................................................................. 5
Receiving Email Alerts ...................................................................................................................................... 5
Exporting Data ................................................................................................................................................... 5
Archiving Data ................................................................................................................................................... 5
Data Analysis ..................................................................................................................................................... 6
Launching the Monitor Client for Offline Use .......................................................................................... 6
Importing Data .......................................................................................................................................... 6
Understanding Your Applications and What is Normal ........................................................................ 7
Solving Performance Problems .......................................................................................................................... 7
Check for Damaged Hardware .................................................................................................................. 8
Check the Volume IO Latencies................................................................................................................ 8
Determining Overload of SAN Resources ................................................................................................ 9
Random IOPS.......................................................................................................................................... 13
Experimental Analysis and IOPS ............................................................................................................ 13
Group IO Load Space Distribution.......................................................................................................... 13
Network Performance.............................................................................................................................. 14
Network Bandwidth ................................................................................................................................ 15
Storage Pool Capacity ............................................................................................................................. 15
Low Performance on Thin Provisioned Volumes ................................................................................... 16
iSCSI Connections .................................................................................................................................. 16
MPIO Connections .................................................................................................................................. 16
Queue Depths .......................................................................................................................................... 17
EXAMPLES .................................................................................................................................................... 18
Example 1: A Stable System ................................................................................................................... 18
Example 2: High Latency Caused by High Random IOPS ..................................................................... 19
Example 3: TCP Retransmit Errors ......................................................................................................... 20
Example 4: Network Bandwidth Saturation ............................................................................................ 21
Example 5: Sudden Rise in Capacity Used ............................................................................................. 22
Example 6: Low Storage Pool Free Space .............................................................................................. 23
Summary .......................................................................................................................................................... 23
For More Information ...................................................................................................................................... 24
Technical Support and Customer Service ........................................................................................................ 24
iii
REVISION INFORMATION
The following table describes the release history of this Technical Report.
Report
Date
Document Revision
1.0
September 2009
Initial Release
2.0
June 2010
2.1
November 2010
The following table shows the software and firmware used for the preparation of this Technical Report.
Vendor
Model
Software Revision
Microsoft
Enterprise Edition
Microsoft
Windows 7
Enterprise Edition
Dell
Version 5.0.2
Dell
Version 3.4.2
Dell
Version 2.1.0
The following table lists the documents referred to in this Technical Report. All PS Series Technical Reports
are available on the EqualLogic website: http://www.equallogic.com/resourcecenter/documentcenter.aspx
Vendor
Document Title
Dell
Dell
iv
PREFACE
Thank you for your interest in Dell EqualLogic PS Series storage products. We hope you will find the PS Series
products intuitive and simple to configure and manage.
PS Series arrays optimize resources by automating volume and network load balancing. Additionally, PS Series arrays
offer all-inclusive array management software, host software, and free firmware updates. The following value-add
features and products integrate with PS Series arrays and are available at no additional cost:
Note: The highlighted text denotes the focus of this document.
Group Manager GUI: Provides a graphical user interface for managing your array
Group Manager CLI: Provides a command line interface for managing your array.
Manual Transfer Utility (MTU): Runs on Windows and Linux host systems and enables secure transfer of
large amounts of data to a replication partner site when configuring disaster tolerance. You use portable
media to eliminate network congestion, minimize downtime, and quick-start replication.
SAN HeadQuarters (SANHQ): Provides centralized monitoring, historical performance trending, and
event reporting for multiple PS Series groups.
Firmware Installed on each array, this software allows you to manage your storage environment and
provides capabilities such as volume snapshots, clones, and replicas to ensure data hosted on the arrays can
be protected in the event of an error or disaster.
Remote Setup Wizard (RSW): Initializes new PS Series arrays, configures host connections to PS
Series SANs, and configures and manages multipathing.
Multipath I/O Device Specific Module (MPIO DSM): Includes a connection awareness-module
that understands PS Series network load balancing and facilitates host connections to PS Series
volumes.
VSS and VDS Provider Services: Allows 3rd party backup software vendors to perform off-host
backups.
Storage Adapter for Site Recovery Manager (SRM): Allows SRM to understand and recognize PS Series
replication for full SRM integration.
Auto-Snapshot Manager/VMware Edition (ASM/VE): Integrates with VMware Virtual Center and PS
Series snapshots to allow administrators to enable Smart Copy protection of Virtual Center folders,
datastores, and virtual machines.
EqualLogic Multipathing Extension Module for VMware ESX: Provides enhancements to existing
VMware multipathing functionality.
Current Customers Please Note: You may not be running the latest versions of the tools and software listed above. If you
are under valid warranty or support agreements for your PS Series array, you are entitled to obtain the latest updates and
new releases as they become available.
To learn more about any of these products, contact your local sales representative or visit the Dell EqualLogic site at
http://www.equallogic.com. To set up a Dell EqualLogic support account to download the latest available PS Series
firmware and software kits visit: https://www.equallogic.com/secure/login.aspx?ReturnUrl=%2fsupport%2fDefault.aspx
INTRODUCTION
According to IDC and Gartner, Dell is the leader in iSCSI SAN deployments. Dell EqualLogic PS
Series storage arrays offer great benefits including:
OVERVIEW
SANHQ monitors one or more PS Series groups. The tool is a client/server application that runs
on a Microsoft Windows server and uses SNMP to query the groups. Acting like a flight data
recorder on an aircraft, SANHQ collects data over time and stores it on the server for later
retrieval and analysis. Client systems connect to the server and format and display the data in the
SANHQ GUI.
SANHQ enables the storage administrator to:
Monitor one or more PS Series groups and store operational data for up to a year
Obtain a centralized view of the health and status of multiple groups
Allow the same performance data to be viewed by multiple clients simultaneously
Monitor performance for a specific time period
Monitor and analyze capacity usage for groups
View IO rates, throughput, and latency for each volume, member, pool, or group
View estimated maximum IO capabilities
Generate alerts and email notifications based on the health status of the groups
Use the GUI or a script to archive group performance data for later offline analysis
View archived group data offline
Launch the Group Manager GUI directly from SANHQ
Use Single Sign On for quick group login
Use the built-in syslog server to consolidate all events and alerts into a single view
Generate reports, including Top10 Volume, Configuration, Performance and Alerts reports
Customize the SANHQ user interface
Of particular importance is ensuring that the SNMP community name is the same in the Monitor
Service group configuration and in the monitored group.
Optionally, you can change the default size of the log files when first setting up monitoring for a
group. Generally, the default log file size of 5 MB is an appropriate size.
Larger log files allow you to maintain more detailed data for longer periods of time, but at the
expense of disk space and potentially slower performance from the Monitor Client.
Smaller log files require less disk space and might improve Monitor Client performance, but at the
expense of more detailed data.
See the SANHQ User Guide for more details.
Using SANHQ on Groups with the Management Port Enabled
SANHQ can be used with PS Series groups that have a dedicated management network enabled.
The management IP address should be used to add a group, instead of the normal iSCSI VLAN
group IP address. Note that the Monitor Service must be able to communicate with each network
interface on each group member to properly gather data.
resemble the actual group workload. Therefore, the data should not be used as the sole measure of
group performance.
Hardware and Firmware
Because hardware failures can be a source of performance problems, SANHQ alerts you when
there is a failure.
In addition, hardware and firmware can affect performance. For example, different disk types will
have different performance characteristics. Also, if hardware is added or removed from the
environment, viewing this information at different points in time will tell you what was in use at
that time.
SANHQ tracks:
In addition, information about individual disks is available, including disk IO data and individual
disk queue depth (requires PS Series Firmware Version 4.2 or higher).
Network and Port Data
A key to understanding the overall load on the SAN is network data and member network interface
(port) data. Network retransmits are a critical indicator of network problems that can affect SAN
performance. SANHQ shows the network link throughput and high- and low-use network ports.
EXPORTING DATA
SANHQ enables you to export the data for one or more monitored groups in csv format for use
with an external analysis tool, such as Microsoft Excel. You specify the groups and the time range
for the desired data.
See the SANHQ User Guide for details.
ARCHIVING DATA
SANHQ allows you to archive data for one or more monitored groups. This enables you to use
SANHQ to view and analyze the archived data offline, without needing access to the Monitor
Service.
You can also archive data to preserve detailed data for a particular period in time.
A data archive is more valuable than exported data when working with Dell support to resolve
issues because the archive can be viewed by anyone running the Monitor Client.
DATA ANALYSIS
Analyzing data gathered by SANHQ can be a mixture of art and science. When looking for the
source of performance issues, it is important to carefully consider all the performance data before
drawing conclusions. The use of the SAN may change over time, so a tool such as SANHQ, which
provides historical perspective, is often able to provide the insight that a simple performance
snapshot cannot.
Launching the Monitor Client for Offline Use
Normally, each Monitor Client connects to the Monitor Service to obtain and format the latest
group performance data. If desired, you can archive group data for later analysis (for example, if
you do not have access to the Monitor Service).
For example, if you start SANHQ, but do not have access to the Monitor Service, simply choose
the Ignore option when launching SANHQ, as shown in Figure 2. This allows SANHQ to start
in offline mode, after which you may import archive files.
FIGURE 2: LAUNCHING THE SANHQ CLIENT FOR OFFLINE USE
Importing Data
To open an archive, select Monitor > Open Archive from the SANHQ menu bar. A new SANHQ
session will appear with the data from the selected archive file. The data can be viewed and
analyzed just like it would be if you were connected to the Monitor Service.
Typical Symptom
Malformed Packets
Detected By
Possible Corrective
Actions
Update NIC Drivers
Replace NIC
Visible Damage
Visual Inspection
Wrong Class of
Patch Cables
Malformed Packets
Defective Switch
Spontaneous Restarts
Random Lock-up
Replace Cable
Update Switch
Firmware
Replace Switch
Defective Array
Hardware
Alerts
Monitor EqualLogic
Group
Monitor SANHQ
Indicative Of
When To Be Concerned
Possible Corrective
Actions
Less than
20ms
Normal Operations
N/A
None Required
20ms to 50ms
When Condition is
Sustained
Check Configuration of
Server NICs and SAN
Switches
*** If OK ***
*** If OK ***
Possible Overload of
SAN Resources
Above 50ms
When Condition is
Frequently Repeated or
Sustained
Check Configuration of
Server NICs and SAN
Switches
*** If OK ***
*** If OK ***
Probable Overload of
SAN Resources
Network
Performance
Indicated By
When To Be Concerned
When Condition is
Sustained
High TCP
Retransmits and
Alerts
When Condition is
Sustained
Possible Corrective
Actions
Add Additional Hardware
to the Storage Pool
Migrate Volumes to other
Storage Pools
Unable to Support
High Throughput
for Sequential
Operations
Network
Bandwidth
High Network
Utilization Values
and Alerts
When Condition is
Sustained
Activate Additional
Network Ports (if
available)
Add Additional Hardware
to the Storage Pool
Migrate Volumes to other
Storage Pools
10
Overloaded
Resource
Storage Pool
Capacity
Indicated By
When To Be Concerned
When Condition is
Frequently Repeated or
Sustained
Possible Corrective
Actions
Reduce Storage Utilization
Reduce Overallocated
Snapshot Reserve Space
Convert Underutilized
Volumes to Thin
Provisioned Volumes
Migrate Volumes to other
Storage Pools
Add Additional Hardware
to the Storage Pool
Low
Performance
on Thin
Provisioned
Volumes
When Condition is
Frequently Repeated or
Sustained
Volume
Approaching
Maximum In Use
Space Values and
Alerts
iSCSI
Connections
Unable to Attach
Servers to Volumes
and Alerts
When Condition is
Frequently Repeated or
Sustained
11
Overloaded
Resource
Indicated By
MPIO
Connections
Unable to Establish
Multiple
Connections
When To Be Concerned
When Condition is
Encountered
Possible Corrective
Actions
Check the Number of
Active iSCSI Connections
in the Storage Pool
Check ACLs on Volumes
Use EqualLogic AutoMPIO on Supported
Operating Systems
Ensure that MPIO is
Supported and Configured
on Other Operating
Systems
High Queue
Depth
>10
When Condition is
Frequently Repeated or
Sustained, Especially if
Accompanied by High
Latency
12
Random IOPS
A PS Series group uses storage virtualization to distribute workloads over many disks and
generally provides much higher performance than other storage systems with a similar number of
like disks. However, the disks do have a finite ability to do work, as measured in IOPS. Faster
disks, such as SSD or those spinning at 15K RPM, are able to do more random work than slower
disks.
As with the statement it snows in Florida, the estimated maximum IOPS number, while valid
under the right conditions, is not very realistic to expect. Write scaling from the maximum is
greatly affected by the read/write ratio and RAID level, with RAID 10 experiencing the least
degradation and RAID 6 the most, due to the greater write penalty. When random IO is high,
latencies in the SAN are generally the best indication that the maximum IOPS have been exceeded.
The best way to address high IO issues is to redistribute the load to other storage pools within the
group (if possible) or add PS Series arrays to the storage pool that is experiencing the high random
workload. Sequential workloads, such as backup, will generate much higher IOPS numbers than
random workloads; however, the IOPS during sequential operations are less relevant than network
bandwidth utilization, as discussed below.
When determining if high IOPS numbers are generated by random or sequential functions, consider
the applications in use at that time. For example, there will be few OLTP users active at 2:00 am. in
most environments, but backup is most likely occurring. Therefore, be sure to look at the IO size.
Large IOs are often indicative of sequential operations, while small IOs indicate random
operations. Latencies are usually higher if a high IOPS value is caused by random operations.
Overall, high random IOPS are not a problem if applications are performing satisfactorily.
Experimental Analysis and IOPS
SANHQ provides Experimental Analysis windows that display an estimate of the maximum
workload that can be sustained; as well an estimate of the workload that can be sustained if a RAID
set becomes degraded due to a drive failure. There is also a related graph that shows the estimated
workload on a scale of 0% to 100%.
Note that these graphs are estimates based on a small, random IO workload, which is prevalent in a
typical business environment. Large IO sizes and more sequential workloads may not match the
estimated workload.
Consequently, it is possible for the group workload to exceed the maximum estimated IOPS. This
is not cause for concern unless it is accompanied by high latencies. Similarly, some workloads can
result in high latencies, while remaining below the estimated maximum IOPS.
Understanding what is normal for your environment will help you determine the best use of the
estimated workload graphs.
Group IO Load Space Distribution
Starting with v2.1 of the SANHQ software and v5 of the EqualLogic firmware, SANHQ is able to
provide additional information on the IO load distribution within a group or pool. This distribution
can be helpful in determining if the IO activity observed is attributable to a relatively small amount
of the total dataset, or if it is a generally uniform distribution across the entire dataset. Knowledge
of the distribution of activity can be useful in making decisions about whether or not tiering data
for performance will be effective. If data activity is concentrated in a relatively small portion of
the capacity, providing higher performing media for that capacity may prove effective in increasing
13
the performance of an application. For example, many databases have a core portion of the data
that is frequently accessed indexes and reference data accessed by all users for example. This
portion of the data, if accelerated, will often improve the overall performance of the database. The
Group IO Load Space distribution graph shows the amount of data that falls into one of three
categories: high IO, medium IO and low IO.
In addition, if SSD media is present in the EqualLogic group, the Group IO Load Space
distribution graph will reflect this capacity relative to the amount of very active data. And if one of
the EqualLogic tiered array models is present (PS6000XVS or PS6010XVS) an additional graph
appears to demonstrate how much of the Enhanced Write Cache unique to these array models is in
use as shown below in Figure 4.
FIGURE 4: GROUP I/O LOAD SPACE DISTRIBUTION AND ENHANCED WRITE CACHE USAGE
Network Performance
Network performance is dependent on a number of components working in conjunction with each
other.
The most critical function that affects network performance is Flow Control, which allows network
devices to signal the next device that the data stream should be reduced to prevent dropped packets
and retransmissions.
Flow Control is typically disabled by default and must be enabled on both the server NICs and the
network switches in order to be effective. Consult your switch vendor or server NIC driver
documentation to determine how to configure Flow Control. If Flow Control cannot be configured
on a network device such as a NIC or switch, either upgrade the device to a version of firmware
that supports Flow Control or replace the device with one that does. Flow Control is automatically
supported by PS Series arrays.
14
The second item which must be properly configured is Jumbo Frames. Jumbo Frames uses a larger
frame size than a standard Ethernet frame and allow large amounts of data to be efficiently
transmitted between the server and storage.
In environments with small, average IO sizes, Jumbo Frames provides limited benefits. In general,
Jumbo Frames support is disabled by default on switches and server NICs. Enabling Jumbo
Frames requires that the switch use a VLAN other the default VLAN (usually VLAN 1). PS Series
arrays will automatically negotiate the use of Jumbo Frames when the iSCSI connection is
established by the server. Consult your switch vendor or server NIC driver documentation to
determine if Jumbo Frames can be configured.
Note that some network devices run more slowly with Jumbo Frames enabled, do not properly
support Jumbo Frames, or cannot support them simultaneously with Flow Control. In these cases,
Jumbo Frames should be disabled, or the switches should be upgraded or replaced.
When attempting to troubleshoot network performance problems, disable Jumbo Frames and
determine whether the network is performing properly with standard Ethernet frames.
Another area that can cause problems is a lack of receive buffers. Low end switches often have
limited memory and suffer from performance issues related to insufficient buffers. A
recommended buffer level in switches is 1MB per port. Dedicated buffers are preferred to shared
buffers.
In addition, server performance can often be improved by increasing the number of buffers
allocated to the server NICs. Consult your switch vendor or server NIC driver documentation to
determine if you can increase the buffers.
Network Bandwidth
Network bandwidth may become fully utilized during highly sequential operations, such as backup.
This is not indicative of a problem in most cases but simply a case of a fully utilized system.
Using all the available bandwidth on one or more member Ethernet interfaces will generate an
alert.
Make sure you connect and enable all the member Ethernet interfaces to maximize the available
SAN bandwidth.
If all interfaces are enabled, but bandwidth is still insufficient, increasing the number of arrays in
the storage pool may provide additional throughput, if the servers have not reached their maximum
bandwidth.
If only one interface is completely utilized on a member or on a server with multiple NICs, ensure
that MPIO is properly configured. If all of the server NICs exceed their capacity (use host based
tools to determine this), but the PS Series group has excess network capacity, add additional server
NICs. Also, configure MPIO for those operating systems that support MPIO.
Storage Pool Capacity
Low storage pool capacity is a problem that generates an alert in SANHQ. If a pool has less than
5% free space (or less than 100 GB per member, whichever is less), a PS Series group may not
have sufficient free space to efficiently perform the virtualization functions required for automatic
optimization of the SAN. In addition, when storage pool free space is low, write performance on
thin provisioned volumes is automatically reduced in order to slow the consumption of free space.
15
MPIO Connections
MPIO provides additional performance capabilities and network path failover between servers and
volumes. For certain operating systems (Windows 2003 and 2008), the connections can be
automatically managed.
If MPIO is not creating multiple connections, you should:
Check that the storage pool does not have the maximum number of iSCSI connections for the
release in use (see the release notes).
Check the access control records for the volume. Using the iSCSI initiator name instead of an
IP address can make access controls easier to manage and more secure.
Ensure that EqualLogic MPIO extensions are properly installed on the supported operating
systems. See the EqualLogic Host Integration Tools documentation for details.
16
Ensure that MPIO is supported and properly configured, according to the documentation for
the operating system.
Queue Depths
Queue depth is a measure of how much work is pending for a resource, such as an array or a disk
drive.
A high queue depth might indicate that a resource is overloaded, particularly if high latencies exist.
A low queue depth might indicate that a resource has sufficient unused capacity to absorb new
workloads.
Understanding what resource has a high queue depth can be helpful in deciding what workload to
move or what type of resources should be added to the SAN. Note that queue depth reporting
requires PS Series Firmware Version 4.2 or higher.
Indicative Of
When To Be Concerned
Possible Corrective
Actions
< 10
Normal
N/A
10-25
Moderate queue
depth
Normal
N/A
>25
17
EXAMPLES
Troubleshooting SAN issues using SANHQ is easily demonstrated using examples. The following
examples show a variety of issues and some solutions that can be used to resolve them.
Example 1: A Stable System
The combined graphs view gives an overview of the overall health of the group. A healthy, stable
system is shown in Figure 5. Latencies are consistently below 20 ms, network bandwidth is well
below the maximum that could be sustained by multiple Gigabit Ethernet ports, and TCP
retransmits are virtually zero.
FIGURE 5: A STABLE SYSTEM
18
19
20
21
22
SUMMARY
Acting as a flight data recorder for your PS Series group, SAN HeadQuarters is a powerful
monitoring and analysis tool designed to provide SAN administrators with valuable insight into the
health of their storage environment. The easy-to-use graphical interface provides information on
PS Series group capacity, IO performance, network data, member hardware and configuration, and
volume data. With the ability to show trends and export metrics for further reporting and analysis,
SAN HeadQuarters is a key component in the constant battle that administrators face daily to do
more with less resources.
23
Release Notes. Provides the latest information about PS Series arrays and groups.
Installation and Setup. Describes how to install the array hardware and configure the software.
The manual also describes how to create and connect to a volume.
Group Administration. Describes how to use the Group Manager graphical user interface
(GUI) to manage a PS Series group. This manual provides comprehensive information about
product concepts and procedures.
CLI Reference. Describes how to use the Group Manager command line interface (CLI) to
manage a PS Series group and individual arrays.
Hardware Maintenance. Describes how to maintain the array hardware. Be sure to use the
manual for your array model.
Online help. In the Group Manager GUI, expand Tools in the far left panel and then click
Online Help for help on both the GUI and the CLI.
See support.dell.com/EqualLogic and log in to your customer support site for the latest
documentation.
Contacting Dell
Dell provides several online and telephone-based support and service options. Availability varies
by country and product, and some services may not be available in your area.
For customers in the United States, call 800-945-3355.
Note: If you do not have an Internet connection, you can find contact information on your
purchase invoice, packing slip, bill, or Dell product catalog.
To contact Dell for sales, technical support, or customer service issues:
1. Visit support.dell.com.
2. Verify your country or region in the Choose A Country/Region drop-down menu at the bottom
of the window.
3. Click Contact Us on the left side of the window.
4. Select the appropriate service or support link based on your need.
5. Choose the method of contacting Dell that is convenient for you.
24
Online Services
You can learn about Dell products and services using the following procedure:
1. Visit www.dell.com (or the URL specified in any Dell product information).
2. Use the locale menu or click on the link that specifies your country or region.
25