M13 HighAvailability

High Availability
Module 13 Data ONTAP 8.0 7-Mode Administration
Module Objectives
By the end of this module, you should be able to: Describe the high availability solutions Discuss how high availability increases reliability of storage Define the high-availability controller configuration Describe the three modes of high-availability operation with a high-availability pair Analyze the effect on client protocols during failover and giveback operations
2009 NetApp. All rights reserved.
High Availability Spectrum

High availability is the process of providing solutions that increase storage resiliency
Loss of Site Loss of Region
Loss of Controller Loss of Shelf Loss of Cable

Loss of Building
NetApp provides solutions to overcome all these business continuity problems
Loss of Cable
Loss of a cable between the storage system and a shelf can be overcome by shelf multipathing
Loss of Site
Loss of Region

Loss of Building
Shelf Multipathing
PROPERLY SHUT DOWN SYSTEM BEFORE OPENING CHASSIS.
REPLACE THIS ITEM WITHIN 2 MINUTES OF REMOVAL
AC IN
PWR
Multipathing provides: 1. Increased availability 2. Increased throughput

1 2 3 4 5 6 7 8 9
HI-POT 2200VDC
AC IN
PWR
1
e0e
HI-POT 2200VDC
e0f
RLM
e0a
e0b
e0c
e0d
LNK
LNK
LNK
LNK
LNK
LNK
LNK
X2
1Gb 2Gb 4Gb
SHELF ID
A B
ESH4
4Gb 2Gb 1Gb ELP
X2
4Gb 2Gb
1Gb ELP
ESH4
Adding a second cable provides availability even if a single cable goes bad
LNK
0a
0b
0c
0d
0e
0f
0g
0h
Loss of Shelf
Loss of a shelf can be overcome by implementing SyncMirror
A RAID 0 implementation where the RAID groups are mirrored
Loss of Building Loss of Region
Loss of Site

SyncMirror
SyncMirror may be configured:
In a stand alone storage system or, most commonly, in a high-availability pair (discussed next)
SyncMirror divides up disks into 2 pools

Pool0 becomes plex0 aggr1 plex1
/vol
Pool1 becomes plex1
plex0
/vol
Data within the aggregate

/vol0
/etc
/vol0
Mirrored data within the aggregate
/etc
SyncMirror (Cont.)
To determine which disks are in which pool:
Software disk ownership administrators may assign disks to pools using disk assign
system> disk assign {disk_list...}[-p pool]... disk_list is the Disk IDs of the unassigned disk pool is either 0 or 1 system> disk assign 0a.21 -p 1
AC IN
PWR
AC IN
PWR
1
e0e
HI-POT 2200VDC
HI-POT 2200VDC
e0f
RLM
e0a
e0b
e0c
e0d
LNK
LNK
LNK
LNK
LNK
LNK
LNK
X2
LNK
0a
0b
0c
0d
0e
0f
0g
0h
SHELF ID
Pool0
1
4Gb 2Gb 1Gb ELP
1Gb 2Gb 4Gb
A B
ESH4
NOTE: Disks in different pools must be on different loops/shelves
SHELF ID
Pool1
1Gb 2Gb 4Gb
X2
A B
ESH4
4Gb 2Gb 1Gb ELP
X2
X2
4Gb 2Gb 1Gb ELP
4Gb 2Gb
1Gb ELP
ESH4
ESH4
SyncMirror (Cont.)
To implement SyncMirror:
Add the syncmirror_local license
License is available at no cost
Verify disks are in the correct pools Reboot the system
To create a new mirrored aggregate:

system> aggr create aggrname -m disk#
To add a mirror to an existing aggregate:

system> aggr mirror aggrname
Double the number of expected disks
SyncMirror (Cont.)
If a shelf goes bad, such as shelf that has pool0, the data is still available from pool1
Nondisruptive shelf replacement (NDSR) is now available in Data ONTAP 7.3.2 and later
AC IN
PWR
AC IN
PWR
1
e0e
HI-POT 2200VDC
HI-POT 2200VDC
e0f
RLM
e0a
e0b
e0c
e0d
LNK
LNK
LNK
LNK
LNK
LNK
LNK
SHELF ID
Pool0
1
4Gb 2Gb 1Gb ELP
SHELF ID
Pool1
1Gb 2Gb 4Gb
X2
A B
ESH4
4Gb 2Gb 1Gb ELP
X2
4Gb 2Gb
1Gb ELP
ESH4
If multiple disks fail in an aggregate, then the data is still available by way of the alternate pool
X
1Gb 2Gb 4Gb
X2
LNK
0a
0b
0c
0d
0e
0f
0g
0h
A B
ESH4
X2
4Gb 2Gb
1Gb ELP
ESH4
SyncMirror (Cont.)
Administrators may perform additional maintenance of the mirror, such as:
Splitting the mirror aggregate Rejoining a split aggregate Removing a plex from a mirror aggregate Comparing plexes of a mirrored aggregate
NOTE: For more information about SyncMirror, please see the High Availability Web-based courses
Loss of Controller
Loss of a controller may be come by configuring a high-availability pair
Loss of Region

Loss of Building
Loss of Site
High-Availability Controller Configuration

PROPERLY SHUT DOWN SYSTEM BEFORE OPENING CHASSIS. PROPERLY SHUT DOWN SYSTEM BEFORE OPENING CHASSIS.
AC IN
PWR
AC IN
PWR
AC IN
PWR
AC IN
PWR
1
e0e
HI-POT 2200VDC
HI-POT 2200VDC
1
e0e
HI-POT 2200VDC
HI-POT 2200VDC
e0f
e0f
RLM
e0a
e0b
e0c
e0d
RLM
e0a
e0b
e0c
e0d
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
X2
X2
X2
LNK
0a
0b
0c
0d
0e
0f
0g
0h
0a
0b
0c
0d
0e
0f
0g
0h
1Gb 2Gb 4Gb
1Gb 2Gb 4Gb
SHELF ID
SHELF ID
A B
ESH4
4Gb 2Gb 1Gb ELP 4Gb 2Gb 1Gb ELP
ESH4
Connected to its own disk shelves Connected to the other controllers disk shelves Storage controllers are connected to each other If a storage controller fails, the surviving partner serves the data of failed controller
X2
4Gb 2Gb
1Gb ELP
ESH4
4Gb 2Gb
1Gb ELP
ESH4
High-Availability Features
AC IN
PWR
AC IN
PWR
AC IN
PWR
AC IN
PWR
1
e0e
HI-POT 2200VDC
HI-POT 2200VDC
1
e0e
HI-POT 2200VDC
HI-POT 2200VDC
e0f
e0f
RLM
e0a
e0b
e0c
e0d
RLM
e0a
e0b
e0c
e0d
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
X2
X2
X2
LNK
0a
0b
0c
0d
0e
0f
0g
0h
0a
0b
0c
0d
0e
0f
0g
0h
1Gb 2Gb 4Gb
1Gb 2Gb 4Gb
SHELF ID
SHELF ID
A B
ESH4
ESH4
High-availability controller configuration provides:

Fault tolerance Nondisruptive software upgrades Nondisruptive hardware maintenance
X2
4Gb 2Gb
1Gb ELP
ESH4
4Gb 2Gb
1Gb ELP
ESH4
Requirements for High-Availability

Architecture compatibility Storage capacity Disk and disk shelf compatibility Cluster interconnect adapters and cables installed Nodes attached to the same networks Same software licensed and enabled
Partner Communication
In an high-availability (HA) controller configuration, partners communicate through the interconnect with a heartbeat
System state is written to disk in a Mailbox Data not committed to disk is written to the local and partner nonvolatile RAM (NVRAM)
High-Availability Controllers and NVRAM

Each node reserves half of the total NVRAM for the partners data During takeover, the surviving partner performs the down systems reads and writes using the mirror NVLOG
NVRAM During Normal Mode
Storage System 1 Storage System 1 NVLOG Storage System 2 Storage System 2 NVLOG
Storage System 2 NVLOG (mirror)
Storage System 1 NVLOG (mirror)
Configuring High-Availability
License the high-availability service called cf:
system> license add xxxxxx
Reboot:
system> reboot
Enable the service on one of the two systems:

system> cf enable
Check the status:

system> cf status
To check the partner:

system> cf partner
Setting Matching Node Options

1. Analyze the values of the options for each nodes 2. Verify that the options settings are the same 3. Correct any mismatched options The following table lists parameters that must be the same for both nodes:
Parameter
Date NDMP Published route table
Setting for
date, rdate NDMP (on or off) route
Route
Time zone
routed (on or off)

timezone
Normal Operation
AC IN
PWR
AC IN
PWR
AC IN
PWR
AC IN
PWR
1
e0e
HI-POT 2200VDC
HI-POT 2200VDC
1
e0e
HI-POT 2200VDC
HI-POT 2200VDC
e0f
e0f
RLM
e0a
e0b
e0c
e0d
RLM
e0a
e0b
e0c
e0d
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
X2
X2
X2
LNK
0a
0b
0c
0d
0e
0f
0g
0h
0a
0b
0c
0d
0e
0f
0g
0h
1Gb 2Gb 4Gb
1Gb 2Gb 4Gb
SHELF ID
SHELF ID
A B
ESH4
ESH4
Each storage controller handles its own storage requests
X2
4Gb 2Gb
1Gb ELP
ESH4
4Gb 2Gb
1Gb ELP
ESH4
Takeover Operation
system> cf takeover
AC IN
PWR
AC IN
PWR
AC IN
PWR
AC IN
PWR
1
e0e
HI-POT 2200VDC
e0f
system
1 2 3 4 5 6 7 8 9
e0a e0b e0c e0d
LNK LNK LNK LNK LNK LNK LNK
HI-POT 2200VDC
1
e0e
HI-POT 2200VDC
e0f
system2
1 2 3 4 5 6 7 8 9
e0a e0b e0c e0d
LNK LNK LNK LNK LNK LNK LNK
HI-POT 2200VDC
RLM
RLM
LNK
LNK
0a
0b
0c
0d
0e
0f
0g
0h
0a
0b
0c
0d
0e
0f
0g
0h
X2
X2
X2
1Gb 2Gb 4Gb
1Gb 2Gb 4Gb
SHELF ID
SHELF ID
A B
ESH4
ESH4
Surviving partner has two identities, with each identity able to access appropriate volumes and networks only You can access the failed node using console commands
X2
4Gb 2Gb
1Gb ELP
ESH4
4Gb 2Gb
1Gb ELP
ESH4
Takeover Events
Takeover occurs on the following events:
A node undergoes a software or system failure that leads to a panic A node undergoes a system failure (for example, a loss of power) and cannot reboot There is a mismatch between the disks that one node believes it owns and the disks that the other node believe it owns One or more network interfaces configured to support failover becomes unavailable A node cannot send heartbeat messages to its partner and no other mechanism is available A node is halted: halt A takeover is manually initiated
partner
To access the failed storage controller:
system(takeover)> partner system2/system>
Failed controller / Takeover controller >
Execute commands as needed:

NOTE: some commands are unavailable in partner mode
To toggle back to the prompt of the first system:

system2/system> partner
system(takeover)>
Giveback Operation
system> cf giveback
AC IN
PWR
AC IN
PWR
AC IN
PWR
AC IN
PWR
1
e0e
HI-POT 2200VDC
HI-POT 2200VDC
1
e0e
HI-POT 2200VDC
HI-POT 2200VDC
e0f
e0f
RLM
e0a
e0b
e0c
e0d
RLM
e0a
e0b
e0c
e0d
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
X2
X2
X2
LNK
0a
0b
0c
0d
0e
0f
0g
0h
0a
0b
0c
0d
0e
0f
0g
0h
1Gb 2Gb 4Gb
1Gb 2Gb 4Gb
SHELF ID
SHELF ID
A B
ESH4
ESH4
cf giveback command terminates the emulated node The failed node resumes normal operation The high-availability configuration resumes normal operation
X2
4Gb 2Gb
1Gb ELP
ESH4
4Gb 2Gb
1Gb ELP
ESH4
System Manager: HA Configuration
Storage System must be removed from System Manager and then re-added after HA is configured
System Manager: HA Configuration (Cont.)

Currently, we cannot enable high availability from System Manager so we will have to do from the CLI License controller failover (cf):
system> license add xxxxxx system2> license add xxxxxx
Reboot:
system> reboot system2> reboot
Enable controller failover:

system> cf enable
Check status:
system> cf status
Add one of the storage systems to System Manager and the partner is automatically identified
The HA pair
HA Configuration Problems
Verify licenses match with the partner
Configure an IP address to takeover the partners IP and enable the interface
Do the same task for the other partner; Remember to enable the interface
Fixed the HA configuration problems
To perform a takeover To configure HA
To perform a giveback
Giveback complete
Failover Effects on Client Connections

For clients or applications using NFS v3, NFS v4, HTTP, FCP, and iSCSI protocols, I/O requests are suspended during the takeover/giveback period Connections can usually be resumed when the takeover or giveback process is complete For CIFS (SMB 1.0), sessions are lost Stateful clients and applications mayand usually doattempt to re-establish the session
Best Practices
Test failover and giveback operations before placing high-availability controllers into production Monitor:
Performance of network Performance of disks and storage shelves CPU utilization of each controller to ensure it does not exceed 50%
Enable AutoSupport
NOTE: For more information about high availability, please see the High Availability Web-based course
Loss of Building
Loss of a entire building can be overcome by implementing stretch MetroCluster
Loss of Region

Loss of Building
Loss of Site
Stretch MetroCluster
Building 1
Building 2
REPLACE THIS ITEM WITHIN 2 MINUTES OF REMOVAL REPLACE THIS ITEM WITHIN 2 MINUTES OF REMOVAL
AC IN
PWR
AC IN
PWR AC IN PWR AC IN PWR
1
e0e
HI-POT 2200VDC
HI-POT 2200VDC
1
e0f e0e
HI-POT 2200VDC
HI-POT 2200VDC
e0f
9 1 2 3 4 5 6 7 8 9
RLM
e0a
e0b
e0c
e0d
RLM
e0a
e0b
e0c
e0d
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
0a
0b
0c
0d
0e
0f
0g
0h
LNK LNK LNK LNK LNK LNK LNK LNK
0a
0b
0c
0d
0e
0f
0g
0h
X2
X2
X2
1Gb 2Gb 4Gb
SHELF ID
A B
1Gb 2Gb 4Gb
SHELF ID
A B
ESH4
4Gb 2Gb 1Gb ELP
ESH4
4Gb 2Gb 1Gb ELP
Stretch MetroCluster expands high-availability to up to 300m See the High Availability Web-based course for more information
X2
4Gb 2Gb
1Gb ELP
ESH4
4Gb 2Gb
1Gb ELP
ESH4
Loss of Site
Loss of a site can be overcome by implementing fabric-attached MetroCluster
Loss of Site Loss of Region

Loss of Building
Fabric-Attached MetroCluster
Site1
Site 2
8 9
AC IN
PWR
AC IN
1
e0e
HI-POT 2200VDC
HI-POT 2200VDC
1
e0f e0e
HI-POT 2200VDC
HI-POT 2200VDC
e0f
1
RLM
e0a e0b e0c e0d
RLM
e0a
e0b
e0c
e0d
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
0a
0b
0c
0d
0e
0f
0g
0h
0a
0b
0c
0d
0e
0f
0g
0h
ISL Trunk
X2
X2
X2
1Gb 2Gb 4Gb
1Gb 2Gb 4Gb
SHELF ID
A B
SHELF ID
A B
ESH4
4Gb 2Gb
4Gb 2Gb 1Gb ELP
ESH4
1Gb ELP
Fabric-attached MetroCluster expands high-availability up to 100 km See the High Availability Web-based course for more information
X2
4Gb 2Gb
1Gb ELP
ESH4
4Gb 2Gb
1Gb ELP
ESH4
Loss of Region
Loss of a region can be overcome by implementing SnapMirror
Loss of Region

Loss of Building
Loss of Site
SnapMirror
Region 1
Region 2
AC IN
PWR
AC IN
1
e0e
HI-POT 2200VDC
HI-POT 2200VDC
1
e0f e0e
HI-POT 2200VDC
HI-POT 2200VDC
e0f
9 1 2 3 4 5 6 7 8 9
RLM
e0a
e0b
e0c
e0d
RLM
e0a
e0b
e0c
e0d
LNK
LNK
LNK
LNK
LNK
LNK
LNK
LNK
0a
0b
0c
0d
0e
0f
0g
0h
0a
0b
0c
0d
0e
0f
0g
0h
X2
X2
X2
1Gb 2Gb 4Gb
SHELF ID
A B
1Gb 2Gb 4Gb
SHELF ID
A B
ESH4
4Gb 2Gb 1Gb ELP
ESH4
4Gb 2Gb 1Gb ELP
SnapMirror allows mirroring volumes or qtrees See the NetApp Protection Software Administration ILT course for more information
X2
4Gb 2Gb
1Gb ELP
ESH4
4Gb 2Gb
1Gb ELP
ESH4
Possibilities Abound
These high availability techniques do not have to be used in isolation...often they are combined
system
2 3 4 5 6 7 8 2 3 4 5 6 7 8
High-availability
AC IN PWR
partner
2 3 4 5 6 7 8 9
AC IN PWR
AC IN
PWR
AC IN
PWR
1
e0e
HI-POT 2200VDC
HI-POT 2200VDC
HI-POT 2200VDC
HI-POT 2200VDC
e0f
RLM
e0a
e0b
e0c
e0d
Multipath
LNK LNK
e0e
e0f
RLM
e0a
e0b
e0c
e0d
LNK
LNK
LNK
LNK
LNK
LNK
0a
0b
0c
0d
0e
0f
0g
0h
LNK
LNK
LNK
LNK
LNK
LNK
LNK
X2
LNK
0a
0b
0c
0d
0e
0f
0g
0h
X2
X2
1Gb 2Gb 4Gb
1Gb 2Gb 4Gb
SHELF ID
A B
SHELF ID
A B
ESH4
ESH4
X2
X2
X2
X2
1Gb 2Gb 4Gb
1Gb 2Gb 4Gb
SHELF ID
SHELF ID
A B
ESH4
ESH4
X2
4Gb 2Gb
4Gb 2Gb
1Gb ELP
1Gb ELP
ESH4
ESH4
4Gb 2Gb 1Gb ELP
4Gb 2Gb
1Gb ELP
ESH4 ESH4
Module Summary
In this module, you should have learned to: Describe the high availability solutions Discuss how high availability increases reliability of storage Define the high-availability controller configuration Describe the three modes of high-availability operation with a high-availability pair Analyze the effect on client protocols during failover and giveback operations
Exercise
Module 13: High Availability Estimated Time: 30 minutes
Check Your Understanding

What are three modes of operation for a highavailability controller configuration?
Normal, Takeover, Giveback
What is the purpose of using a high-availability controller configuration?

Fault tolerance Nondisruptive software upgrades Nondisruptive hardware maintenance
What happens during a takeover?

Surviving partner has two identities; each identity can only
access appropriate volumes and networks
Check Your Understanding (Cont.)

True or false Options must be set the same on both nodes.
True
The license must be set the same on both nodes.

True
Both nodes must have the same number of disks.

True
Both nodes must be part of the same domain.

False

M13 HighAvailability

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

M13 HighAvailability

Uploaded by

Copyright:

Available Formats

High Availability

Module 13 Data ONTAP 8.0 7-Mode Administration

2009 NetApp. All rights reserved.

High Availability Spectrum

Loss of Controller Loss of Shelf Loss of Cable

NetApp provides solutions to overcome all these business continuity problems

Loss of Controller Loss of Shelf Loss of Cable

REPLACE THIS ITEM WITHIN 2 MINUTES OF REMOVAL

Multipathing provides: 1. Increased availability 2. Increased throughput

REPLACE THIS ITEM WITHIN 2 MINUTES OF REMOVAL

1Gb 2Gb 4Gb

2009 NetApp. All rights reserved.

Loss of Controller Loss of Shelf Loss of Cable

SyncMirror divides up disks into 2 pools

Pool1 becomes plex1

Data within the aggregate

Mirrored data within the aggregate

REPLACE THIS ITEM WITHIN 2 MINUTES OF REMOVAL

1Gb 2Gb 4Gb

NOTE: Disks in different pools must be on different loops/shelves

2009 NetApp. All rights reserved.

Verify disks are in the correct pools Reboot the system

To create a new mirrored aggregate:

To add a mirror to an existing aggregate:

2009 NetApp. All rights reserved.

REPLACE THIS ITEM WITHIN 2 MINUTES OF REMOVAL

REPLACE THIS ITEM WITHIN 2 MINUTES OF REMOVAL

2009 NetApp. All rights reserved.

2009 NetApp. All rights reserved.

Loss of Controller Loss of Shelf Loss of Cable

High-Availability Controller Configuration

REPLACE THIS ITEM WITHIN 2 MINUTES OF REMOVAL

REPLACE THIS ITEM WITHIN 2 MINUTES OF REMOVAL

REPLACE THIS ITEM WITHIN 2 MINUTES OF REMOVAL

1Gb 2Gb 4Gb

1Gb 2Gb 4Gb

REPLACE THIS ITEM WITHIN 2 MINUTES OF REMOVAL

REPLACE THIS ITEM WITHIN 2 MINUTES OF REMOVAL

REPLACE THIS ITEM WITHIN 2 MINUTES OF REMOVAL

1Gb 2Gb 4Gb

1Gb 2Gb 4Gb

High-availability controller configuration provides:

Requirements for High-Availability

2009 NetApp. All rights reserved.

2009 NetApp. All rights reserved.

High-Availability Controllers and NVRAM

Storage System 2 NVLOG (mirror)

Storage System 1 NVLOG (mirror)

2009 NetApp. All rights reserved.

Enable the service on one of the two systems:

Check the status:

To check the partner:

Setting Matching Node Options

routed (on or off)

PROPERLY SHUT DOWN SYSTEM BEFORE OPENING CHASSIS.

REPLACE THIS ITEM WITHIN 2 MINUTES OF REMOVAL

REPLACE THIS ITEM WITHIN 2 MINUTES OF REMOVAL

REPLACE THIS ITEM WITHIN 2 MINUTES OF REMOVAL

REPLACE THIS ITEM WITHIN 2 MINUTES OF REMOVAL

1Gb 2Gb 4Gb

1Gb 2Gb 4Gb

Each storage controller handles its own storage requests

2009 NetApp. All rights reserved.