You are on page 1of 31

HACMP

Elements in System Availability:


- Well planned and implemented concept - "no single point of failure" - recovery without user intervention, using scripting

High availability is: - The masking or elimination of both planned and unplanned downtime. - The elimination of single points of failure (SPOFs) - Fault resilience, but not fault tolerant.
The failure of any component of the solution, be it hardware, software or system management, will not cause the application and its data to be inaccessible to the user community. High availability solutions do fail, Fault Tolerant solutions should not fail. High availability solution should be to achieve continuous availability, i.e., no downtime. We must not only implement High Availability solution, but also reduce our planned downtime through disciplined and documented change management.

The causes of downtime:


Planned downtime : Hardware upgrades Repairs Software updates Backups Testing Development Unplanned downtime : User Error Application failure Hardware faults Environmental Disasters

The standalone system may offer limited availability benefits: Journalled Filesystem Dynamic CPU Deallocation Service Processor Redundant Power Redundant Cooling ECC Memory Hot Swap Adapters Dynamic Kernel

Single points of failure: o Operating System o Network o Network Adapter o Node o Disk o Application o Site Failure The enhanced system may offer increased availability benefits: Redundant Data Paths Data Mirroring Hot Swap Storage Redundant Power for Storage Arrays Redundant Cooling for Storage Arrays Hot Spare Storage Single points of failure: o Operating System o Application o Network o Network Adapter o Node o Site Failure Clustering technologies offer High Availability: Redundant Servers Redundant Networks Redundant Network Adapters Heartbeat Monitoring Failure Detection Failure Diagnosis Automated Fallover Automated Reintegration Single points of failure: o Site Failure o Application

Benefits of High Availability Solutions: Standard Components (no specialized hardware) Can be built from existing hardware (no need to invest in new kit) Works with just about any application Works with wide range of disk and network types No specialized operating system or microcode Excellent availability at low cost HACMP is largely independent of the disk type , network and application chosen. o o o o o High Availability solutions require the following : Thorough design and detailed planning Selection of appropriate hardware Disciplined system administration practices Documented operational procedures Comprehensive testing

A High Availability solution based upon HACMP provides automated failure detection , diagnosis , recovery and reintegration. The highly available solution will include: AIX Operating System, HACMP for AIX , customized enhancements, Cluster Proven applications and of course a plan for the design and testing.

AIX's contribution to High Availability :


Object Data Manager (ODM) System Resource Controller (SRC) Logical Volume Manager (LVM) Journalled File System (JFS) Online JFS Backup (splitlvcopy) Work Load Manager (WLM) Quality of Service (Qos) External Boot Software Installation Management (installp) Reliable Scalable Cluster Technology (RSCT)

Hardware Prerequisites:
All pSeries systems will work with high availability, in any combination of nodes within a cluster, however a minimum of 4 free adapter slots is recommended ( 2 for network adapters and 2 for disk adapters) . Any other adapters (graphics adapters) will occupy additional slots. The internal Ethernet adapter should not be included in the calculations. Even with 4 adapter slots free, there will be a single point of failure as the cluster will only be able to accommodate a single TCP/IP local area network between the nodes.

HACMP Features:
1) Availability using: - Cluster concept - Redundancy at component level (standby adapters) - AIX: LVM (JFS, disk mirroring) , SRC, Error Notify Event (fault) Detection - network adapter, network, or node Automatically triggered or customized Event Recovery - adapter swap, fallover or notification of network down CSPOC: Tools for global changes across all nodes - create AIX users, passwords, LVM components (VG, LV, JFS) DARE (Dynamic Automatic Reconfiguration Event) - Make HACMP changes without stopping the application Monitoring using: - HACMP commands, HAview, or HAtivoli, pager support

2) 3) 4) 5) 6)

HACMP is Not the Right Solution If... 1) You cannot suffer any downtime - Fault tolerance is required - 7 x 24 operation is required - Life critical systems 2) Your environment is insecure - Users have access to root password - Network security has not been implemented 3) Your environment is unstable - Change management is not respected - You do not have trained administrators - Environment is prone to 'user fiddle factor' HACMP will never be an out-of-the-box solution to availability. A certain degree of skill will be always required.

HACMP Basic Terms:


Cluster

HACMP's Resource Components:


Resources are logical components of the cluster configuration that can be moved from one node to another. Because they are logical components, they can be moved without human intervention. Resource components include: Resource Groups, IP labels, filesystems, NFS exports and mounts, Application Servers, Volume Groups and other items. All the logical resources are collected together into a resource group. All components in a resource group move together from one node to another in the event of a node failure. The difference between topology and resources is that the topology components are physical, i.e., nodes, networks and network adapters, which would require manual intervention to move from one place to another.

Failures detected by HACMP


1) Node failures Processor hardware or OS failure 2) Network Adapter failure Moves IP address to standby 3) Network failures Message displayed on console, event is logged. HACMP/ES can also monitor on applications, processor load and available disk capacity.

Other Failures:
1) Disk Drive failure LVM mirroring , RAID 2) Other Hardware failures No direct HACMP support. HACMP for AIX provides SMIT interface to AIX Error Notification Facility. Trap on specific errors. Execute command in response to error. 3) Application failures 4) HACMP failure Promoted to node failure. 5) Power failure Avoid common power supplies across replicated devices / Use UPS.

Cluster Resources :
Applications Disk Drives Volume Groups File Systems NFS File Systems IP Addresses

Disk Crash
1) Data replicated through LVM mirroring 2) Data replicate on RAID.

Disk adapter Failure:


1) Data replicated through LVM mirroring across buses. 2) If RAID uses multiple buses , data available.

Network Adapter Failure:


1) Move IP address to standby adapter. - The standby adapter takes over IP and, where applicable, MAC address. - Duration between 5 and about 25 seconds - If the standby adapter fails , no fallover action is taken. - If the failed adapter comes back to life , it joins as a joined standby. - No significant effect on applications

Network Fails:
1) HACMP provides notification and runs any user defined scripts. - HACMP detects a fault. - Standard event script does not contain any actions. - Network takeover only possible with customizing. - Behavior of application depends on infrastructure.

Machine Fails:
1) Workload ( resources ) moved to surviving node. 2) TCP/IP address moved from failed to surviving node. 3) Users login again , using same host name. What you lose: 1) Work in progress. 2) Any data not cached to disk. 3) All process state.

Types of HACMP Resource Groups


1) Cascading - computers have a fixed priority. - resources move through an ordered list of nodes in the event that a node fails. - resources automatically revert to the top level computer that is active. A cascading resource group can be modified to prevent the resources moving back to a higher priority node upon reintegration of the previously failed node , this is called cascading without fallback. 2) Rotating - all computers have equal priority - takeover only happens when there is a defect, not with reintegration - When a previously failed node rejoins cluster , the resources do not move back. A limitation with rotating resource group is that ni1 rotating resource groups are supported for each cluster, where n is the number of nodes in the cluster. Another consideration for rotating resource group is that all nodes should ideally be eqully sized. 3) Concurrent - all computers work on the data simultaneously

Resource Groups, points to ponder :


You may have hundreds of resource groups in a single cluster. You may mix and match all three types of resource group in a single cluster. Simply because you have 32 nodes in a cluster does not mean that all nodes must service all resource groups. The only requirement is that each resource group is serviced by at least two nodes. Any given node may service one or more resource groups (of the same or differing types). You may manually move resource groups about the cluster to perform a type of load balancing. You may chose which node has which resource group at any point in time or following failure.

LVM and HACMP Considerations:


All LVM constructs must have unique names in the cluster. - like httplv, httploglv, httpfs and httpvg. Mirror all critical logical volumes. - Don't forget the jfslog. The VG major device numbers should be the same. - Mandatory for clusters exporting NFS filesystems, but it is a good habit for any cluster. Shared data on internal disks is a bad idea.

HACMP and Networks:


HACMP clusters must contain two or more nodes and two or more networks, one being IP based, one being non-IP based. The non-IP networks are there to allow HACMP to determine if a failure is a network related failure or a node failure. LANs carry the following traffic: HACMP heartbeat or keepalive packets. HACMP messages, used for cluster communications. HACMP Lock Manager traffic used in concurrent access configurations. Client communications, e.g., Telnet, FTP, NFS, sqlnet. Serial networks carry: HACMP heartbeats. HACMP messages. HACMP Network Components Terminology: Public Network Any TCP/IP LAN that supports HACMP client network traffic Private Network Any TCP/IP LAN that carries only HACMP traffic. Serial Network An RS232/RS422 or tmscsi/tmssa network used only for HACMP traffic. Adapter The HACMP ODM definition associated with a TCP/IP or Serial network interface. Adapter IP label The name in /etc/hosts that maps to an IP address. Nodename The name associated with a cluster node , not to be confused with hostname. Network Type Identifies the physical media type to HACMP , e.g., FDDI , ether, ATM, HPS. Adapter Function Service, Standby or Boot. Adapter hardware Address This is the LAN adapters Locally Administered Address ( LAA). All nodes must have atleast one standby and one service adapter per network. Keep the following items in mind when you do your network planning: All service and boot adapters must be in the same subnet. All standby adapters must be in the same subnet. Service and Standby adapters must be in different logical subnets. All adapters in the cluster must have the same subnet mask. Do not edit the route entries for service and standby adapters in SMIT.

IP Address Takeover (IPAT):


Service IP address of failed node may be taken over by a surviving node. This behaviour is known as IP Address Takeover or IPAT for short. Standby adapter on a surviving node acquires the service address of failed node. This is an optional behaviour that must be configured. Requires the configuration of a 'boot' adapter. If more than one standby adapter is available the surviving node may takeover the workload of one or more failed nodes.

When a cluster is configured to use IPAT , an additional network adapter must be defined. This is known as a Boot Adapter. When a failed node recovers, it cannot boot on the Service IP address if this has been acquired by another node in the cluster. For this reason, the failed node needs to boot on a unique IP address which is not used elsewhere in the cluster. This ensures that there is no IP address duplication during reintegration.

Configuring IPAT:
IPAT is only required on rotating resource groups and is optional on CASCADING. It is not supported on concurrent resource groups. On all nodes, prepare security and name resolution: 1. Add an entry for the boot IP label in to /etc/hosts on each node. 2. Add the boot IP label to /.rhosts on each node. 3. Use FTP or rdist to keep these files in sync' and minimise human error. On the node that will have its service IP address taken over: 4. Change the IP address that is held in the ODM to that of the boot IP address by using smit chinet. This will cause cfgmgr to read the 'boot' address at system startup. On any node, update cluster configuration: 5. Add the boot adapter definition to the cluster topology, for that node which will have its service IP address taken over. 6. Synchronise the topology (you will get a warning message). 7. Now add the service IP label of the node to be taken over to a resource group. 8. Take a snapshot of your modified topology and update your cluster planning worksheets.

Configuring Hardware Address Takeover:


Do not enable the "ALTERNATE" hardware address field in smit devices. This will cause the adapter to boot on its locally administered address rather than the burned in ROM address. This causes serious communications problems and will put the cluster in to an unstable state. Correct method is to enter your chosen LAA in to the smit hacmp menus.

Some adapter types are very specific about the numbering of the first two digits in an LAA. Token-ring and FDDI in particular. Must start with 42 for Token-Ring and 4,5,6 or 7 for FDDI in the first octet of the first byte. Always check the documentation provided with the adapter and the HACMP manuals. Token-Ring adapters will not release the LAA if AIX crashes. AIX must be set to reboot automatically after a system crash (smit environment). Install the HACMP software from the HACMP CD cluster.adt cluster.base cluster.cspoc cluster.man cluster.taskguides cluster.vsm clstrmgr clsmuxpd ( works on SNMP) cllockd for concurrent access clinfo required for IPAT ( hardware MAC takeover )

Installation of the software


1) 2) 3) 4)

HACMP Daemons:

clstrmgr and clsmuxpd daemons are mandatory. The other two are optional. 1) Cluster Manager (clstrmgr) : Runs on all cluster nodes. Tracks cluster topology. Tracks network status. Externalize failure events.

Cluster Manager has four functional pieces Cluster Controller (CC) , Event Manager (EM) , NIM Interface Layer (NIL)and Network Interface Module (NIM).

10

CC , EM and NIL are all part of the clstrmgr executable. NIMs are separate executables one for each network. The Cluster Controller reads on the start the information out of the ODM the Network Interface Layer is controlling the Hardware using the Network Interface Modules. The Event Manager is handling the Event Scripts and communicates to the clsmuxpd and cllockd. The Cluster Controller performs a number of coordinating functions: Retrieves cluster configuration information from the HACMP ODM object classes at startup and during a refresh or DARE operation. Establishes the ordering of cluster neighbours for the purpose of sending keep alive packets. Tracks changes to the cluster topology. Receives information about cluster status changes from the NIM via the NIL. Queues events in response to status changes in the cluster. Handles node isolation and partitioned clusters. The NIL provides a common interface between the Cluster Controller and one or more NIMs. This allows NIMs to be developed for new adapter hardware without rewriting the cluster manager. It tells the NIMs the appropriate keep alive and failure detection rates for each network type as mentioned in the ODM. It starts the appropriate NIMs for the network types that have been defined in the HACMP classes of the ODM. It gives the NIMs a list of the IP addresses or /dev files to send keep alive to. It restarts the NIMs if they hang or exit.

11

The NIMs are the contact point between HACMP and the network interfaces. The NIMs send and receive keep-alive and message information. They detect network related failures. They are provided for each supported network type including a generic one. The Event manager performs the following functions: It starts the appropriate event scripts in response to status change in the cluster. It sets the required environment variables. It communicates with clsmuxpd and cllockd when required. It starts the config_too_long event if any event does not exit 0 within 6 minutes.

Event Manager causes event scripts to execute . Primary events ( such as node_up, node_up_complete , node_down , node_down_complete, etc.) are called directly by the cluster manager. Sub events ( such as node_up_local , node_up_remote , node_down_remote , node_down_local , etc. ) are called by primary events. 1) 1) 1) Cluster SNMP Agent (clsmuxpd): Receives information from Cluster Manager. Maintains the HACMP enterprise specific MIB. Provides information to SNMP. Cluster Lock Manager (cllockd): Cluster wide advisory level locking. CLM locking API. Unix locking API. Only for processes running on cluster. Cluster Information Services (clinfo): Optional on both cluster nodes and cluster clients. Provides cluster status information to clients. Clinfo API allows for cluster aware applications. - Transmitted over all interfaces known to HACMP - Direct I/O (non-IP networks) or UDP packets - Three adjustable transmission rates (fast, normal, slow) - If a failure rate is exceeded, an event is triggered

Characteristics of keep-alive (KA) packets:

12

Dead Man's Switch (DMS):


If on one of your cluster nodes the LED flashes 888, then you may have experienced a DMS time-out. The reason is clstrmgr could not send heartbeat to itself due to excessive I/O traffic. Based on this all other nodes in the cluster will start the node_down event. This may cause data corruption on shared disks. To avoid this problem the DMS happen. The operating system (AIX) will panic (888) The deadman switch is a kernel extension to AIX .The cluster manager tries to reset the DMS frequently (every 0.5 seconds). If DMS is not reset for n-1 seconds - PANIC! n = (KA rate) x (Missed KAs) for slowest network

Cluster Single Point of Control (C-SPOC):


C-SPOC provides facilities for performing common cluster wide administration tasks from any node within the cluster. Requires either /.rhosts or kerberos to be configured on all nodes. C-SPOC operations fail if any target node is down at the time of execution or selected resource is not available. Any change to a shared VGDA is synchronised automatically if C-SPOC is used to change shared LVM components (VGs, LVs, JFS). C-SPOC uses a script parser called the "command execution language"

The cluster snapshot utility:


HACMP's cluster snapshot utility records the HACMP ODM configuration information, both cluster topology and resources. When a new snapshot is created, two files are generated. <snapshotname>.odm (contains ALL cluster topology and resource information) <snapshotname>.info (a printable report, can be extended) By default, snapshots are stored in the directory /usr/sbin/cluster/snapshots The SNAPSHOTPATH environment variable can be used to specify an alternative location for storing snapshots. The documentary report that a snapshot creates can be customised to include information specific to a given cluster (application configuration). Snapshots can be applied to a running cluster.

13

HACMP Log Files :


1) /use/adm/cluster.log output of cluster script and daemons. 2) /usr/sbin/cluster/history/cluster.<mmdd> - History files are created everyday. 3) /tmp/cspoc.log output of all commands executed for C-SPOC. 4) /tmp/cm.log log for Cluster Manager daemon. 5) /tmp/emuhacmp.out for emulation scripts. 6) /tmp/hacmp.out detailed output of all scripts.

Dynamic Reconfiguration :
HACMP provides a facility that allows changes to cluster topology and resources to be made while the cluster is active. This facility is known as DARE or to give it it's full name "Dynamic Automatic Reconfiguration Event". This requires 3 copies of the HACMP ODM. Default Configuration Directory (DCD) which is updated by SMIT/command line /etc/objrepos Staging Configuration Directory (SCD) which is used during reconfiguration /usr/sbin/cluster/etc/objrepos/staging. Active Configuration Directory (ACD) from which clstrmgr reads the cluster configuration /usr/sbin/cluster/etc/objrepos/active DARE allows changes to be made to most cluster topology and nearly all resource group components without the need to stop HACMP, take the application offline or reboot a node. All changes must be synchronised in order to take effect.

Pre configuration steps


1) 2) 3) 4) Add the Boot, Service, Standby entries in the /etc/hosts file. Make these entries in the .rhosts file (required at the time of sync). HACMP uses rsh on remote machine for updation. Create volume group on node1 with the major number which is free in both nodes ( use lvlstmajor command to check the free major number ) Off the auto vary on feature on that node with command chvg -a n <vg name> 5) Create a jfslog logical volume # mklv -t jfslog -y <lv name> <vg name> <size> 6) Format the log logical volume
14

logform /dev/<lv name> 7) Create a logical volume mklv -t jfs -y <lv name> <vg name> <size> 8) Create file system crfs -v jfs -d /dev/<lv name> -m /<mount point> 9) Turn off the vg using varyoffvg <vgname> 10) Import the VG in the other node with same major number importvg -V 44 -y <vgname> <pv name> 11) Off the auto varry on feature with command chvg -a n <vg name> 12) Turn off the VG using varyoffvg <vgname>

Configuration steps ( short )


1) 2) 3) 4) 5) 6) 7) Define cluster Id and cluster name Configure nodes Configure net adapters Sync the topology Create resource group Create resources Sync the resources

15

Steps in Configuring Clusters:


Step 1 - Plan your cluster Step 2 - Configure TCP/IP and LVM Step 3 - Install the HACMP software Step 4 - Define the cluster topology - Use the planning worksheets and the documentation - /etc/hosts, /.rhosts, jfs mirroring and layout - Select the necessary filesets - Nodes, networks and network adapters - Verification performed automatically

Step 5 - Synchronize the cluster topology

Step 6 - Configure the application start and stop scripts - Application servers Step 7 - Define the cluster resources and resource groups mounts Step 8 - Synchronize the cluster resources Step 9 - Test the cluster -Verification performed automatically - Including application tests -File systems, IP addresses, exports and NFS

16

HACMP Enhanced Scalability (HACMP/ES):


extended functionality compared with classic HACMP upto 32 nodes in one HACMP cluster. Based on IBM High Availability Infrastructure. Includes all features of classic HACMP.

Difference between HACMP and HACMP/ES: - HACMP/ES uses RSCT (Reliable Scalable Cluster Technology) - User defined events based on RSCT - Application Monitoring - Recovery from resource group acquisition failure. - Dynamic node policy - Selective Fallover - Plugins In 4.4.1, plugins are provided to help configure the following services: - DNS - DHCP - Print services The plugins add the server application and an application monitor to an existing resource group. - Process Monitoring via provided plugin scripts - Users must create their own scripts/programs for Custom Monitoring

What Plugins provide:


- The resource group now contains an application server to start and stop the relevant daemons. - The resource group also contains the shared volume group and filesystem needed by the daemons. - An application monitor is configured to watch the daemons. - When the resource group comes online, the daemon(s) will be activated. - If a daemon should fail, the application monitor will detect it. - the daemon will be restarted or the resource group will move to another node. - Note that the limitations of application monitoring still exist! - only one monitored application per resource group. - the monitored application can be included in at most one resource group.

17

HACMP Commands
rdist -b -f /etc/disfile1 binary mode To distribute the files in disfile1 to all nodes in disfile1 in

Sample entry for disfile1 HOSTS = ( root@node1 root@node3 ) FILES = ( /etc/passwd /etc/security/passwd) ${FILES} -> ${HOSTS} clstart -m -s -b -i -l clstop -f -N clstop -g -N clstop -gr -N cldare -t cldare -t -f cldare -r cldare -r -f clverify cllscf cllsclstr cllsnode cllsnode -i node1 cllsdisk -g shrg cllsnw cllsnw -n ether1 cllsif cllsif -n node1_service cllsvg cllsvg -g sh1 cllslv cllslv -g sh1 cllsdisk -g sh1 cllsfs cllsfs -g sh1 cllsnim cllsnim -n ether cllsparam -n node1 cllsserv claddclstr -i 3 -n dcm claddnode claddnim To start cluster daemons (m-clstrmgr, s-clsmuxpd, b-broadcast message, -i-clinfo, -l cllockd) To force shutdown cluster immediately without releasing resources To do graceful shutdown immediately with no takeover To do graceful shutdown immediately with takeover To sync the cluster toplogy To do the mock sync of topology To sync the cluster resources To do the mock sync of resources cluster verification utility To list clustur topology information To list the name and security level of the cluster To list the info about the cluster nodes To list info about node1 To list the PVID of the shared hard disk for resource group shrg To list all cluster networks To list the details of network ether1 To list the details by network adapter To list the details of network adapter node1_service To list the shared vgs which can be accessed by all nodes To list the shared vgs in resource group sh1 To list the shared lvs To list the shared lvs in the resource group sh1 To list the PVID of disks in the resource group sh1 To list the shared file systems To list the shared file systems in the resource group sh1 Show info about all network modules Show info about ether network module To list the runtime parameters for the node node1 To list all the application servers To add a cluster definition with name dcm and id 3 To add an adapter To add network interface module

18

claddgrp -g sh1 -r cascading -n n1 n2 To create resource group sh1 with nodes n1,n2 in cascade claddserv -s ser1 -b /usr/start -e /usr/stop Creates an application server ser1 with startscript as /usr/start and stop script as /usr/stop clchclstr -i 2 -n dcmds To change cluster definitions name to dcmds and id to 2 clchclstr -s enhanced To change the clustur security to enhanced clchnode To change the adapter parameters clchgrp To change the resource group name or node relationship clchparam To change the run time parameters (like verbose logging) clchserv To change the name of app. server or change the start/end scripts clrmclstr clrmgrp -g sh1 clrmnim ether clrmnode -n node1 clrmnode -a node1_svc clrmres -g sh1 clrmserv app1 clrmserv ALL clgetactivenodes -n node1 To remove the cluster definition To delete the resource group sh1 and related resources To remove the network interface module ether To remove the node node1 To remove the adapter named node1_svc To remove all resources from resource group sh1 To remove the application server app1 To remove all applicaion servers To list the nodes with active cluster manager processes from cluster manager on node node1 clgetaddr node1 returns a pingable address from node node1 clgetgrp -g sh1 To list the info about resource group sh1 clgetgrp -g sh1 -f nodes To list the participating nodes in the resource group sh1 clgetif To list interface name/interface device name/netmask associated with a specified ip label / ip address of a specific node clgetip sh1 To get the ip label associated to the resource group clgetnet 193.9.200.2 255.255.255.0 To list the network for ip 193.9.200.2, netmask 255.255.255.0 clgetvg -l nodelv To list the VG of LV nodelv cllistlogs To list the logs clnodename -a node5 To add node5 to the cluster clnodename -o node5 -n node3 To change the cluster node name node5 to node3 clshowres Lists resources defined for all resource group clfindres To find the resource group within a cluster xclconfig X utility for cluster configuration xhacmpm X utility for hacmp management xclstat X utility for cluster status

Setup for Cascading mode :

19

20

HACMP Configuration Exercise : Scenario : Connecting machines in Cascading Resource Group , so as to operate for IP address takeover , NFS availablility and Application takeover. 1. Create .rhosts file on all nodes which are going to be part of HACMP . The file should be in root directory and contain name of boot , standby and service adapters . # cat .rhosts cws node3 node1 # End of generated entries by updauthfiles script node1_boot node1_svc node1_stby node3_boot node3_svc node3_stby 2. # smit hacmp A.) Define a cluster HACMP for AIX Move cursor to desired item and press Enter. Cluster Configuration Cluster Services Cluster System Management Cluster Recovery Aids RAS Support

Cluster Configuration Move cursor to desired item and press Enter. Cluster Topology Cluster Security Cluster Resources Cluster Snapshots Cluster Verification Cluster Custom Modification Restore System Default Configuration from Active Configuration

21

Cluster Topology Move cursor to desired item and press Enter. Configure Cluster Configure Nodes Configure Adapters Configure Network Modules Show Cluster Topology Synchronize Cluster Topology

Configure Cluster Move cursor to desired item and press Enter. Add a Cluster Definition Change / Show Cluster Definition Remove Cluster Definition

Add a Cluster Definition Type or select values in entry fields. Press Enter AFTER making all desired changes. [Entry Fields] * Cluster ID * Cluster Name [10] [dcm] #

B.) Add Participating nodes # smit hacmp - Cluster Configuration - Cluster Topology - Configure Nodes - Add Cluster Nodes Make entries for participating nodes . The names may not be related to /etc/hosts file. Can be any name.

22

Add Cluster Nodes Type or select values in entry fields. Press Enter AFTER making all desired changes. [Entry Fields] * Node Names [node1 node3]

B.) Create entries for all IP addresses First confirm that the system has booted through the boot IP given. Check using # lsattr El en0 Make entries for all adapters like node1_boot , node1_svc, node1_stby , node3_boot, node3_svc and node3_stby. # smit hacmp - Cluster Configuration - Cluster Topology - Configure Adapters - Add an adapter Add an Adapter Type or select values in entry fields. Press Enter AFTER making all desired changes. * Adapter IP Label * Network Type * Network Name * Network Attribute * Adapter Function Adapter Identifier Adapter Hardware Address Node Name [Entry Fields] [node1_boot] [ether] [ether1] public boot [193.9.200.225] [] [node1]

+ + + + +

B.) Check the Cluster Topology #smit hacmp - Cluster Configuration - Cluster Topology - Show Cluster Topology - Check the cluster topology. C.) Synchronize Cluster Topology #smit hacmp
23

- Cluster Configuration - Cluster Topology - Synchronize Cluster Topology The topology is copied to all participating nodes. D.) Create a Resource Group #smit hacmp - Cluster Configuration - Cluster Resources - Define resource Group - Add a resource Group Add a Resource Group Type or select values in entry fields. Press Enter AFTER making all desired changes. * Resource Group Name * Node Relationship * Participating Node Names [Entry Fields] [rg1] cascading [node1 node3]

+ +

Give Resource Group Name , Node Relationship and Participating Node Name. B.) Define Resources for a Resource Group #smit hacmp - Cluster Configuration - Cluster resources - Change/Show Resources for a RG Select a Resource Group Move cursor to desired item and press Enter. rg1

24

Configure a Resource Group Type or select values in entry fields. Press Enter AFTER making all desired changes. [TOP] Resource Group Name Node Relationship Participating Node Names [Entry Fields] rg1 cascading node1 node3 + + + + + + + + + + + + + + + + +

Service IP Label [node1_svc] Filesystems [] Filesystems Consistency Check fsck Filesystems Recovery Method sequential Filesystems to Export [] Filesystems to NFS Mount [] Volume Groups [] Concurrent Volume Groups [] Raw Disk PVIDs [] AIX Connections Services [] AIX Fast Connect Services [] Application Servers [] Highly Available Communication Links [] Miscellaneous Data [] Inactive Takeover Activated 9333 Disk Fencing Activated SSA Disk Fencing Activated Filesystems mounted before IP configured [BOTTOM] false false false false

An entry is made for node1_svc in rg1 so that it can be taken over in case of adapter failure or network failure. Similarly create another Resource Group rg2 for node3_svc. Entries for NFS and Application Servers if required, also have to be made in the above screen.

25

B.) Copy the Resource information to all participating nodes #smit hacmp - Cluster Configuration - Cluster resources - Synchronize Cluster resources. The Resources configuration gets copied to all participating nodes. C.) Start HACMP on all participating nodes. #smit hacmp - Cluster Services - Start Cluster Services Started on individual machine. We can use C-SPOC ( Cluster single point of control ) for all machines. However it has not been enabled on SP due to security reasons. For C-SPOC # smit hacmp - Cluster system management - HACMP for AIX Cluster Services. - Start Cluster Services ( It takes time ) B.) Check cluster services are started # lssrc g cluster Now the system has been configured for high availability of IP . It can be tested by stopping the HACMP services on one of the nodes. For this follow these steps : a.) On both the nodes run the following command and check that service IP is being used on en1. en2 should be using standby IP. # netstat in Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll lo0 16896 link#1 210 0 229 0 0 lo0 16896 127 127.0.0.1 210 0 229 0 0 lo0 16896 ::1 210 0 229 0 0 en0 1500 link#2 0.60.94.e9.56.e3 40117 0 38081 0 0 en0 1500 192.9.200 192.9.200.2 40117 0 38081 0 0 en1 1500 link#3 0.6.29.ac.ca.66 63612 0 1136 0 0 en1 1500 193.9.200 193.9.200.226 63612 0 1136 0 0 en2 1500 link#4 0.6.29.ac.f2.f6 0 0 3 3 0 en2 1500 193.9.201 193.9.201.1 0 0 3 3 0 The above display is obtained on node1. b.) Stop the cluster services on node1. # smit hacmp - Cluster Services - Stop Cluster Services

26

Stop Cluster Services Type or select values in entry fields. Press Enter AFTER making all desired changes. * Stop now, on system restart or both [Entry Fields] now + + +

BROADCAST cluster shutdown? true * Shutdown mode graceful with takeover (graceful, graceful with takeover, forced)

a.) Now check on node3 that node1_svc IP has shifted to its standby adapter (en2) using netstat command. Adding Serial links obtained through SSA : Check the device addresses on each node using # lsdev C | grep tmssa . # smit hacmp - Cluster Configuration - Cluster Topology - Configure Adapters - Add an adapter Add an Adapter Type or select values in entry fields. Press Enter AFTER making all desired changes. * Adapter IP Label * Network Type * Network Name * Network Attribute * Adapter Function Adapter Identifier Adapter Hardware Address Node Name [Entry Fields] [node1_tmssa] [tmssa] + [ssa1] + serial + service + [/dev/tmssa3] [] [ node1 ] +

Similarly make entry for node3 ( /dev/tmssa2). Create Application server :

27

# smit hacmp - Cluster Configuration - Cluster Resources - Define Application Servers Define Application Servers Move cursor to desired item and press Enter. Add an Application Server Change / Show an Application Server Remove an Application Server

Add Application Server Type or select values in entry fields. Press Enter AFTER making all desired changes. * Server Name * Start Script * Stop Script [Entry Fields] [tsm] [/usr/bin/tsmstart] [/usr/bin/tsmstop]

The same scripts have to be copied on all participating nodes . The entry for Application Server has to be made in Resource Group (Step 2. G ).

28

Sample tsmstart script SERVICEIP=1 while [ SERVICEIP -ne 0 ] do x=`netstat I | grep -c node1_svc` if [ $x -eq 1 ] then SERVICEIP=0 echo "Exiting with SERVICEIP" else echo "Executing IP Take over" sleep 2 fi done sleep 15 /usr/tivoli/tsm/server/bin/rc.adsmserv Sample tsmstop script cd /usr/tivoli/tsm/client/ba/bin dsmadmc -id=admin -password=support halt sleep 15 Cluster snapshot creation : # smit hacmp Cluster Snapshots Add a Cluster Snapshot We are required to provide Cluster Snapshot Name , Custom defined Snapshot Method and Cluster Snapshot Description. The snapshot is created in directory /usr/bin/cluster/snapshots. Two file are created <snapshotname>.odm and <snapshotname>.info. Testing the non-IP serial communication (SSA): Node1 : # cat < /dev/tmssa3.tm Node3 : # cat <filename> > /dev/tmssa2.im Scenario : Connecting the nodes in Rotating Resource Group. In Rotating mode , a resource does not belong to any node . Therefore when creating a resource like IP address , node name is not required to be given.

29

Setup in Rotaing mode :

The configuration of the resource group is exactly the same as was done for cascading mode. However while adding adapter for node1_svc , do not provide the Node Name. The other adapters like node1_boot, node1_stby , node3_boot and node3_stby are added as in Step 2.C . # smit hacmp - Cluster Configuration - Cluster Topology - Configure Adapters - Add an adapter

30

Add an Adapter Type or select values in entry fields. Press Enter AFTER making all desired changes. * Adapter IP Label * Network Type * Network Name * Network Attribute * Adapter Function Adapter Identifier Adapter Hardware Address Node Name [Entry Fields] [node1_svc] [ether] + [ether1] + public + service + [193.9.200.226] [] [ ] +

After adding adapter definitions , synchronize the topology . Then create a resource group rg3 using node relationship as rotating . After making entries for resource group , synchronize the RG. Start HACMP on each node. The service IP is allocated on the boot adapter of node1. To test for IP takeover , stop the hacmp services on node1 ( Step 2.J b ) . The service IP label moves to boot adapter of node3. Note : It has been observed that the serial link tmssa should be configured in rotating RG . The test for IP takeover goes through successfully only when tmssa is configured.

31

You might also like