You are on page 1of 8

vSphere 5.

x support with NetApp MetroCluster


http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2031038

Purpose
This article provides information about deploying a vSphere Metro Storage Cluster (vMSC) across two
datacenters or sites using NetApp MetroCluster Solution with vSphere 5.0, 5.1, or 5.5. For ESXi 5.0, 5.1,
or 5.5, the article applies for FC, iSCSI, and NFS implementations of Stretch and Fabric MetroCluster.
Resolution

What is vMSC?
vSphere Metro Storage Cluster (vMSC) is a new certified configuration for NetApp MetroCluster storage
architectures. vMSC configuration is designed to maintain data availability beyond a single physical or
logical site. A storage device configured in the vMSC configuration is supported after successful vMSC
certification. All supported storage devices are listed on the VMware Storage Compatibility Guide.

What is a NetApp MetroCluster?


NetApp MetroCluster is a synchronous replication solution between two NetApp controllers providing
storage high availability and disaster recovery in a campus or metropolitan area. A MetroCluster (MC)
configuration consists of two NetApp controllers, each residing in the same data center or across two
different physical locations, clustered together. MC handles any single failure in the storage
configuration and certain multiple failures without causing disruption to data availability and provides
single-command recovery in case of complete site disaster.

What is MetroCluster TieBreaker?


MetroCluster TieBreaker (MCTB) Solution is a plug-in that runs in the background as a Windows service
or Unix daemon on an OnCommand Unified Manager (OC UM) host. The OC UM host can be a physical
machine or a virtual machine. MCTB provides automatic failover in a MetroCluster Solution in scenarios
where automatic failover is not possible. This can occur during an entire Site Failure.
MCTB continuously monitors the MetroCluster controllers and corresponding network gateways from
an OnCommand server at a third location. When MCTB detects conditions that require a Cluster Failover
on Disaster (CFOD), it issues the necessary commands to initiate the CFOD. Log messages and
OnCommand events are generated when necessary to keep the operator informed as to the state of the
MetroCluster and MCTB.

Configuration Requirements
These requirements must be satisfied to support this configuration:

For distances under 500 m, stretch MetroCluster configurations can be used, and for distances
over 500 m but under 160 km for systems running ONTAP version 8.1.1, a Fabric MetroCluster
configuration can be used.
The maximum round trip latency for Ethernet Networks between two sites must be less than 10
ms, and for syncmirror replications must be less than 3 ms.

The storage network must be a minimum of 1 Gbps throughput between the two sites for ISL
connectivity.
ESXi hosts in the vMSC configuration should be configured with at least two different IP
networks, one for storage and the other for management and virtual machine traffic. The
Storage network handles NFS and iSCSI traffic between ESXi hosts and NetApp Controllers. The
second network (VM Network) supports virtual machine traffic as well as management functions
for the ESXi hosts. End users can choose to configure additional networks for other functionality
such as vMotion/Fault Tolerance. VMware recommends this as a best practice, but it is not a
strict requirement for a vMSC configuration.
FC Switches are used for vMSC configurations where datastores are accessed via FC protocol,
and ESX management traffic will be on an IP network. End users can choose to configure
additional networks for other functionality such as vMotion/Fault Tolerance. This is
recommended as a best practice but is not a strict requirement for a vMSC configuration.
For NFS/iSCSI configurations, a minimum of two uplinks for the controllers must be used. An
interface group (ifgroup) should be created using the two uplinks in multimode configurations.
The VMware datastores and NFS volumes configured for the ESX servers are provisioned on
mirrored aggregates.
vCenter Server must be able to connect to ESX servers on both the sites.
The maximum number of Hosts in an HA cluster must not exceed 32 hosts.

Notes:

A MetroCluster TieBreaker Machine should be deployed in a third site, and must be able to
access the storage controllers in Site one and Site two to initiate a CFOD in case of an entire site
failure.

vMSC certification testing was conducted on vSphere 5.0 and NetApp Data ONTAP version 8.1
operating in 7 mode. For ESXi 5.5, vMSC certification testing was successfully completed on
vSphere 5.5 and NetApp Data ONTAP version 8.2 operating in 7 mode.

For more information on NetApp MetroCluster Design and Implementation, see the NetApp
Technical Report, Best Practices for MetroCluster Design and Implementation. For information
about NetApp in a vSphere environment, see NetApp Storage Best Practices for VMware
vSphere.

Solution Overview
The NetApp Unified Storage Architecture offers an agile and scalable storage platform. All NetApp
storage systems use the Data ONTAP operating system to provide SAN (FC, iSCSI) and NFS.
MetroCluster leverages NetApp HA CFO functionality to automatically protect against controller failures.
Additionally, MetroCluster layers local SyncMirror, cluster failover on disaster (CFOD), hardware
redundancy, and geographical separation to achieve extreme levels of availability. Local SyncMirror
synchronously mirrors data across the two halves of the MetroCluster configuration by writing data to
two plexes: the local plex (on the local shelf) actively serving data and the remote plex (on the remote
shelf) normally not serving data. On local shelf failure, the remote shelf seamlessly takes over data-

serving operations. No data loss occurs because of synchronous mirroring. Hardware redundancy is put
in place for all MetroCluster components. Controllers, storage, cables, switches (fabric MetroCluster),
and adapters are all redundant.
A VMware HA/DRS cluster is created across the two sites using ESXi 5.x hosts and managed by vCenter
Server 5.x. The vSphere Management, vMotion, and virtual machine networks are connected using a
redundant network between the two sites. It is assumed that the vCenter Server managing the HA/DRS
cluster can connect to the ESXi hosts at both sites.
Based on the distance considerations, NetApp MetroCluster can be deployed in two different
configurations:

Stretch MetroCluster
Fabric MetroCluster

Stretch MetroCluster
This is a Stretch MetroCluster configuration:

Fabric MetroCluster
This is a Fabric MetroCluster configuration:

Note: These illustrations are simplified representations and do not indicate the redundant front-end
components, such as Ethernet and fibre channel switches.
The vMSC configuration used in this certification program was configured with Uniform Host Access
mode. In this configuration, the ESX hosts from a single site are configured to access storage from both
sites.
In cases where RDMs are configured for virtual machines residing on NFS volumes, a separate LUN must
be configured to hold the RDM mapping files. Ensure you present this LUN to all the ESX hosts.

vMSC test scenarios


This table outlines vMSC test scenarios:
Scenario

NetApp Controllers Behavior

VMware HA Behavior

Controller single
path failure

Controller path failover occurs. All LUNs and


volumes remain connected.

No impact

For FC datastores, path failover is triggered from


the host and the next available path to the same
controller will be active.
All ESXi iSCSI/NFS sessions remain active in
multimode configurations of two or more
network interfaces.
ESXi single storage No impact on LUN and volume availability. ESXi No impact
path failure
storage path fails over to the alternative path. All
sessions remain active.
Site-1 controller
failure

LUN availability remains unaffected.

No impact

FC datastores fail over to the alternate available


path of a surviving controller.
ESXi iSCSI sessions affected by node failure,
failover to surviving controller.
ESXi NFS volumes continue to be accessible
through the other surviving node.
Site-2 controller
failure

LUN availability remains unaffected.

No impact

FC datastores fail over to the alternate available


path of a surviving controller.
ESXi iSCSI sessions affected by node failure,
failover to surviving controller.
ESXi NFS volumes continue to be accessible
through the other surviving node.
MCTB VM failure

No impact on LUN and volume availability. All


sessions remain active.

No impact

MCTB VM single
Link failure

No impact. Controllers continue to function


normally.

No impact

Complete Site 1
LUN and volume availability remains unaffected. Virtual machines on failed Site 1
failure, including FC datastores fail over to the alternate available ESXi nodes fail. HA restarts failed
ESXi and controller path of a surviving controller.
virtual machines on ESXi hosts on

iSCSI sessions to surviving ESXi nodes remain


Site 2.
active.
After failed controller comes back online, and the
giveback is initiated afterward, all affected
aggregates resync automatically.
Complete Site 2
LUN and volume availability remains unaffected.
failure, including FC datastores fail over to the alternate available
ESXi and controller path of a surviving controller.
iSCSI sessions to surviving ESXi nodes remain
active.
After failed controller comes back online, and the
giveback is initiated afterward, all affected
aggregates resync automatically.

Virtual machines on failed Site 2


ESXi nodes fail. HA restarts failed
virtual machines on ESXi hosts on
Site 1.

Single ESXi failure


(shutdown)

No impact. Controllers continue to function


normally.

Virtual machines on failed ESXi


node fail. HA restarts failed virtual
machines on surviving ESXi hosts.

Multiple ESXi host No impact. Controllers continue to function


management
normally.
network failure

A new Master will be selected


within the network partition.
Virtual machines will remain
running. No need to restart virtual
machines.

Site 1 and Site 2


simultaneous
failure (shutdown)
and restoration

Controllers boot up and resync. All LUNs and


No impact
volumes become available. All iSCSI sessions and
FC paths to ESXi hosts are re-established and
virtual machines restarted successfully. As a best
practice, NetApp controllers should be powered
on first and allow the LUNs/volumes to become
available before powering on the ESXi hosts.

ESXi Management No impact to controllers. LUNs and volumes


network all ISL
remain available.
links failure

If the HA host isolation response is


set to Leave Powered On, virtual
machines at each site continue to
run as storage heartbeat is still
active. Partitioned Hosts on site
that do not have a Fault Domain
Manager elect a new Master.

All Storage ISL


Links failure

No impact

No Impact to controllers. LUNs and volumes


remain available.
When the ISL links are back online, the

aggregates resync.
System Manager - No impact. Controllers continue to function
Management
normally.
Server failure
NetApp controllers can be managed using
Command Line.

No impact

vCenter Server
failure

No impact on HA. However, the


DRS rules cannot be applied.

No impact. Controllers continue to function


normally.

You might also like