Professional Documents
Culture Documents
93%
of companies that
40% lost their data center
for 10 days or more
of companies due to a disaster filed
that suffer a for bankruptcy within
massive data loss one year of the
will never reopen 1 disaster2
Reference: (1) Disaster Recovery Plans and Systems Are Essential, Gartner Group, 2001
Reference: (2) US National Archives and Records Administration
Global Distance
Recovery
Metro Distance
Compliance Recovery
High
Data Loss Availability
Local Disaster Regional Disaster
or Single System Human error
Electric grid failure
Failure Electric grid failure
Corruption HAVC or power failures
Floods
Human error Hurricanes
Burst water pipe
Software error Earthquakes
Building fire
Component failures Tornados
Architectural failures
Single system failures Tsunamis
Gas explosion
Warfighting
Terrorist attack
RTO
ISP (external)
Find the
Bjrn Rodn
Enterprise environment
MA
A single point of failure exists when a
Switches
N
WA critical Service function is provided by a
Application
N
SAN
Network
single component. Middleware
http://publib.boulder.ibm.com/infocenter/aix/v6r1/topic/com.ibm.aix.powerha.plangd/ha_plan_over_ppg.htm
Hardware (cores, cache, nest)
Storage
Accept as-is
Decide that the risk for partitioning occurring is unlikely, the cost for redundancy is too high, and accepting longer
downtime relying on backup restore in case of data inconsistency.
NOTE: External access to cluster nodes can still be available, even if site interconnects fail between the cluster nodes.
Copyright IBM Corporation 2015 15
PowerHA SystemMirror
PowerHA SystemMirror Edition basics
PowerHA SystemMirror for AIX Standard Edition
Cluster management for the data center
Monitors, detects and reacts to events
Multiple channels heartbeat between the systems
> Network
> SAN
> Central Repository
Enables automatic switch-over
SAN shared storage clustering
Smart Assists
HA agent Support Discover, Configure, and Manage
Resource Group Management Advanced Relationships
Support for Custom Resource Management
Out of the box support for DB2, WebSphere, Oracle, SAP, TSM, LDAP, IBM HTTP, etc
PowerHA SystemMirror for AIX Enterprise Edition
Cluster management for the Enterprise (Disaster Tolerance)
Multi-site cluster management
Automated or manual confirmation of swap-over
Third site tie-breaker support
Separate storage synchronization
Metro Mirror, Global Mirror, GLVM, HyperSwap with DS8800 (<100KM)
R=Rolling Upgrade
S=Snapshot Upgrade
O=Offline Upgrade
Baseline
Ordinary run-of-the-mill dual node cluster
Using Mirror Pools for LVM mirroring
Single Virtual Ethernet adapter per node backed by
the same VIOS SEA LAGG
Set "Communication Path to Node" to the cluster
HA1 HA2
PowerHA LPAR LPAR nodes hostname network interface (using IP-address
cluster and symbolic hostname from /etc/hosts)
netmon.cf configured for ping outside the box from
partition (cluster file)
/usr/es/sbin/cluster/netmon.cf
rhosts configured cluster nodes (cluster file)
/etc/cluster/rhosts
netsvc.conf configured with DNS (system file)
/etc/netsvc.conf
Single or dual SAN Fabric
If dual sites, within a few km distance for minimal
LVM latency and throughput degradation
Mirror
Single LAN with ISL
If dual sites, use VLAN spanning
Single or Dual
Enterprise
Storage If the cluster node (partition) have multiple Virtual Ethernet
adapters, set the "Communication Path to Node" to the IP
address and Virtual Ethernet network interface device
which maps to the hostname.
http://www-01.ibm.com/support/knowledgecenter/SSPHQG_7.1.0/com.ibm.powerha.trgd/ha_trgd_test_multicast.htm
http://www-01.ibm.com/support/knowledgecenter/SSPHQG_7.1.0/com.ibm.powerha.admngd/clmgr_cmd.htm
Copyright IBM Corporation 2015 24
PowerHA 7.1 with dual node single/dual site
Repository Disk
The cluster repository disk is used as the central
repository for the cluster configuration data.
When CAA is configured with repos_loss mode set to assert
and CAA loses access to the repository disk, the system
automatically shuts down.
Access from all nodes and paths.
HA1 HA2
PowerHA LPAR LPAR Start with ~10GB for up to 32 nodes (min=512MB,
cluster
max=460 GB, thin provisioning is supported).
Direct access by CAA only, raw disk I/O.
Define a spare for the repos disk.
Verify the disk reserve attribute is set to no_reserve
Do not manually write to the repos disk !
Check repos disk status
/usr/es/sbin/cluster/utilities/clmgr
query repository
/usr/lib/cluster/clras lsrepos
LVM /usr/lib/cluster/clras dumprepos
Mirror /usr/lib/cluster/clras dumprepos -r
repo
<reposdisk>
/usr/lib/cluster/clras dpcomm_status
Single or Dual
Enterprise
If IP heartbeating fails, cluster nodes will keep alive if the
Storage
repository disk is accessible from all nodes.
http://www-01.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.clusteraware/claware_repository.htm
https://www.ibm.com/developerworks/community/blogs/6eaa2884-e28a-4e0a-a158-
7931abe2da4f/entry/powerha_caa_repository_disk_management
Copyright IBM Corporation 2015 25
PowerHA 7.1 with dual node single/dual site
Storage Framework
Fibre Channel adapters with target mode support
only
On fcsX tme=yes
On fscsiX dyntrk=yes & fc_err_recov=fast_fail
Enable the new settings (reboot)
HA1 HA2 All physical FC adapters WWPNs zoned
PowerHA LPAR LPAR One Fabric supported with SFWcomm
cluster
For dual Fabric, it is supposed to work, if it do not
TM-ZONE work with your implementation and system software
levels, please open a PMR with IBM Support
LPM do not migrate SFWcomm configuration
It is recommended that SAN communication be
reconfigured after LPM is performed
Using datalink layer communication over VLAN
between AIX cluster node and VIOS with the
physical FC adapters
LVM Check SFWcomm status
Mirror lscluster -i
sfwinfo -a
clras sancomm_status
Single or Dual
Enterprise If the IP hearbeat and repository disk are not sufficient to
Storage meet heartbeat requirements, also enable SFWcomm.
http://www-01.ibm.com/support/knowledgecenter/ssw_aix_71/com.ibm.aix.clusteraware/claware_comm_setup.htm
http://www-01.ibm.com/support/knowledgecenter/SSPHQG_7.1.0/com.ibm.powerha.concepts/ha_concepts_ex_san.htm
Copyright IBM Corporation 2015 26
PowerHA IP heartbeating over VIOS SEA
For single adapter PowerHA cluster node network <owner> : The interface this line is intended to be
adapters, use the netmon.cf configuration file: used by; that is, the code monitoring the
adapter specified here will determine its
/usr/es/sbin/cluster/netmon.cf own up/down status by whether it can ping
any of the targets (below) specified in
When netmon needs to stimulate the network to these lines.
The owner can be specified as a hostname, IP
ensure adapter function, it sends ICMP ECHO address, or interface name. In the case of
requests to each IP address. hostname or IP address, it *must* refer to
the boot name/IP (no service aliases).
After sending the request to every address, In the case of a hostname, it must be
resolvable to an IP address or the line will
netmon checks the inbound packet count before be ignored.
determining whether an adapter has failed or not. The string "!ALL" will specify all adapters.
Specify remote hosts that are not in the cluster <target> : The IP address or hostname you want the
configuration and that can be accessed from owner to try to ping.
As with normal netmon.cf entries, a hostname
PowerHA interfaces, and who reply consistently to target must be resolvable to an IP address
ICMP ECHO without delay, such as default in order to be usable.
http://www-01.ibm.com/support/docview.wss?uid=isg1IZ01332
Copyright IBM Corporation 2015 28
Basic PowerHA cluster functionality verification
Number of nodes:
Up to 1530 (AIX)
Up to 9620 (Linux/x86)
File system Up to 64 (Windows)
http://www-01.ibm.com/software/support/aix/lifecycle/index.html
http://www-01.ibm.com/support/knowledgecenter/SSFKCN/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfsclustersfaq.html
Disk
SAN
Storage
Typical 2-8 GPFS
Failure Failure
Server nodes for
Group #1 Group #2 commercial high
availability clusters
Network Network
Protocol Protocol
Client Client
CIFS, NFS, ..
LAN Attach App Network
HTTP, FTP
LAN Attach protocol file
GPFS Server GPFS Server serving
NSD
App
Direct Attach Direct Attach Direct Attach Network Direct Attach
GPFS Server GPFS Server GPFS Server protocol file GPFS Server
serving
root@stglbs1:/: mmstartup -a
Sat Nov 1 02:40:46 GST 2014: 6027-1642 mmstartup: Starting GPFS ...
root@stglbs1:/: mmgetstate -a
root@stglbs1:/: mmlscluster
root@stglbs1:/: mmlsconfig
Configuration data for cluster gpfscl1.stglbs1:
-----------------------------------------
myNodeConfigNumber 1
clusterName gpfscl1.stglbs1
clusterId 5954771470676922314
autoload no
dmapiFileHandleSize 32
minReleaseLevel 3.5.0.11
maxMBpS 1200
prefetchThreads 150
worker1Threads 96
pagepool 6g
adminMode central
GPFS: 6027-531 The following disks of gdata10 will be formatted on node stglbs1:
gpfs42nsd: size 1073741824 KB
gpfs43nsd: size 1073741824 KB
GPFS: 6027-540 Formatting file system ...
GPFS: 6027-535 Disks up to size 8.8 TB can be added to storage pool system.
Creating Inode File
Creating Allocation Maps
Creating Log Files
Clearing Inode Allocation Map
Clearing Block Allocation Map
Formatting Allocation Map for storage pool system
GPFS: 6027-572 Completed creation of file system /dev/gdata10.
mmcrfs: 6027-1371 Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.
Know why
Business and regulatory requirements
Services, Risks, Costs
Key Performance Indicators (KPIs)
Understand how
Architect, Design, Plan
Can implement
Build, verify, inception, monitor, maintain, skill-up
Will govern
Service and Availability management
Change, Incident and problem management
Security and Performance management
Capacity planning
Migrate, replace and decommission
Bjrn Rodn
roden@ae.ibm.com
http://www.linkedin.com/in/roden
Copyright IBM Corporation 2015
Continue growing your IBM skills
ibm.com/training provides a
comprehensive portfolio of skills and career
accelerators that are designed to meet all
your training needs.
Total
Cost Down
Cost
Solution
Costs Balance Time
Balance1 Costs
Solution Costs
(CAPEX/OPEX) Risk
(1): Quick Total Cost Balance (TCB) = TCO or TCA + Business Down Time Costs
Buffer Buffer
Degree of time Degree of time Degree of
Availability Availability Availability
Application
Application restart after node failure (stop-start)
active / standby (automatic/manual)
Application concurrency (scale out)
active / active (separate or shared transaction tracking)
Data
Single site, single or dual storage
Storage based controlled by host (Hyperswap)
Host based (LVM mirroring/GPFS)
Database based (transaction replication)
Dual site, dual storage
Storage based (Metro/Global mirror)
Host based (GLVM/GPFS)
Database based (transaction replication)
Access
Primary site entry
Automated or manual redirection
Multi site concurrent entry
Automated or manual load balancing
1
Can use BCI Good Practice or Note that Business Continuity
similar, or just start with Management (BCM) encompass
1. Develop contingency planning policy much more than IT Continuity.
2. Perform Business Impact Analysis Some national and international
3. Identify preventive controls standards and organizational
recommendations:
4. Develop recovery strategies
(1)BCI, Good Practice,
5. Develop IT contingency plan http://www.thebci.org/
(2)DRII, Professional Practices,
http://www.drii.org/
(3)ITIL IT Service Continuity: Continuity
Focus on business purpose management is the process by which
plans are put in place and managed to
ensure that IT Services can recover and
continue should a serious incident occur.
(4) ISO Information Security and Continuity, ISO
17799/27001
(5) US NIST Contingency Planning Guide for
Note: Information Technology Systems, NIST 800-34
-ITIL: Availability Management To optimize the capability of the IT (6) British Standard for Business Continuity
infrastructure, services and supporting organization to deliver a cost Management: BS 25999-1:2006
effective and sustained level of availability enabling the business to
(7) British Standard for Information and
meet their objectives.
Communications Technology Continuity
-COBIT: DS4 Ensure Continuous Service objectives are control over
Management: BS 25777:2008 (Paperback)
the IT process to ensure continuous service that satisfies the
business requirement for IT of ensuring minimal business impact in (8) BITS Basniv fr informationsskerhet,
the event of an IT service interruption. https://www.msb.se/RibData/Filer/pdf/24855.pdf
Copyright IBM Corporation 2015 57
Architecting for IT Service Continuity
1
Can use TOGAF ADM to bring
clarity and understanding from an
enterprise perspective on the
availability/continuity requirements
for different IT services
1
COBIT DS4 to bring clarity and
understanding from an enterprise
perspective on the
availability/continuity requirements
for different IT services
http://www.itgi.org/
IT
Governance
Resource
(1) The IT Governance Institute (ITGI) Control Objectives for Information Management
and related Technology (COBIT) is an international unifying framework
that integrates all of the main global IT standards, including ITIL,
CMMI and ISO17799, which provides good practices, representing the
consensus of experts, across a domain and process framework and
presents activities in a manageable and logical structure, focused on
control.
http://www-01.ibm.com/support/knowledgecenter/SSPHQG_7.1.0/com.ibm.powerha.plangd/ha_plan_eliminate_spf.htm
http://www-01.ibm.com/support/knowledgecenter/SSPHQG_7.1.0/com.ibm.powerha.insgd/ha_install_required_aix.htm
Copyright IBM Corporation 2015 63
Todo before migration
Verify cluster conditions and settings
Use clstat to review the cluster state and to make certain that the cluster is in a stable state
Review the /etc/hosts file on each node to make certain it is correct
Review the /etc/netsvc.conf (equiv) file on each node to make certain it is correct
Review the /usr/es/sbin/cluster/netmon.cf file on each node to make certain it is correct
After AIX Version 6.1.6, or later is installed, enter the fully qualified host name of every node in the cluster in
the /etc/cluster/rhosts file
Take a snapshot of the cluster configuration and save off customized scripts, such as start, stop,
monitor and event script files
Remove configurations which cant be migrated
Configurations with IPAT via replacement or hardware address takeover (MAC address)
Configurations with heartbeat via IP aliasing
Configurations with non-IP networking, such as RS232, TMSCSI/SSA, DISKHB or MNDHB
Configurations which use other than Ethernet for network communication, such as FDDI, ATM, X25, TokenRing
Note that clmigcheck doesn't flag an error if DISKHB network is found and PowerHA migration utility automatically
takes care of removing that network
SAN storage for Repository Disk and Target Mode
The repository is stored on a disk that must be SAN attached and zoned to be shared by every node in the cluster and
only the nodes in the cluster and not part of a volume group
SAN zoning of FC adapters WWPN for Target Mode communication
Multicast IP address for the monitoring technology (optional)
You can explicitly specify multicast addresses, or one will be assigned by CAA
Ensure that multicast communication is functional in your network topology before migration
Note that from PowerHA 7.1.3 unicast is default
clmigcheck tool is part of base AIX from 6.1 TL6 or 7.1 (/usr/sbin/clmigcheck)
An interactive tool that verifies the current cluster configuration, checks for unsupported elements, and collects
additional information required for migration
Saves migration check to file /tmp/clmigcheck/clmigcheck.log
You must run this command on all cluster nodes, one node at a time, before installing PowerHA 7.1.3
When the clmigcheck command is run on the last node of the cluster before installing PowerHA 7.1.3, the CAA
infrastructure will be started (check with lscluster -m command).
Option 1
Checks configuration data (/etc/es/objrepos) and provides errors and warnings if there are any elements
in the configuration that must be removed manually.
In that case, the flagged elements must be removed, cluster configuration verified and synchronized, and
clmigcheck must be rerun until the configuration data check completes without errors.
Option 2
Checks a snapshot (present in /usr/es/sbin/cluster/snapshots) and provides error information if there are
any elements in the configuration that will not migrate.
Errors checking the snapshot indicate that the snapshot cannot be used as it is for migration, and
PowerHA do not provide tools to edit a snapshot.
Option 3
Queries for additional configuration needed and saves it in a file in /var on every node in the cluster.
When option 3 is selected from the main screen, you will be prompted for repository
disk and multicast dotted decimal IP addresses.
Newer version of AIX has updated /usr/sbin/clmighcheck command and ask to select "Unicast" or
"Multicast.
Use either option 1 or option 2 successfully before running option 3, which collects and
stores configuration data in the node file /var/clmigcheck/clmigcheck.txt, which is used
when PowerHA 7.1.3 is installed.