You are on page 1of 22

OPTIMIZING MIRRORING PERFORMANCE USING HACMP/XD FOR GEOGRAPHIC LVM

March 2006

WHITE PAPER

Contents Overview Understanding Performance: Three Areas Running the Performance Optimization Cycle Tuning for Optimizing Performance Case Study: Massive Two Shoes Summary References 3

14

OPTIMIZING MIRRORING PERFORMANCE USING HACMP/XD FOR GEOGRAPHIC LVM

16

18 20 20

About the Authors 21 Appendix A 21

For more information or to comment on this document, please email:

hafeedbk@us.ibm.com

ABSTRACT The newest addition to IBMs HACMP/XD family, HACMP/XD for Geographic Logical Volume Manager (GLVM) is easier to configure and manage than the existing IP-based disaster recovery solutions. To ensure that the mirroring of the critical data across two remote locations is optimized to meet performance requirements at low costs, system administrators must plan to achieve that performance. This paper discusses the characteristics of a data-intensive application that requires data mirroring and explains what must be done to plan, provision, and tune for optimal performance in a cluster spanning two remote sites.

sptrHACMPGLVMwp061506.doc

Page 2

Overview
The mirroring of data between two remote sites has become significantly easier with the arrival of the new HACMP/XD for Geographic Logical Volume Manager (GLVM) solution. As an IP-based solution, GLVM sites can span an almost unlimited distance. In GLVM, the AIX 5L LVM (Logical Volume Manager) itself is responsible for mirroring the data; thus several complex and time consuming manual tasks have been eliminated. The configuration and management of GLVM is therefore simpler and with fewer geographical limitations than other remote mirroring solutions. But, as with any remote mirroring solution, it is still important to design and tune for optimal performance. The enemy of remote data mirroring is delay you want to reduce the time it takes to write changes to your data to the disks at the remote site. Optimal performance for a mirroring solution is one where both the delay and the cost of the solution are minimized. If you overestimate the amount of bandwidth needed for the network carrying the disk I/ O operations, you could end up with an expensive mirroring solution. Yet it is important to know that an under-performing mirroring solution can create a performance bottleneck for your application, thereby lowering productivity, putting your Service Level Agreements at risk, or in other ways reducing your potential revenue. This paper helps to understand how HACMP/XD for GLVM works and where to anticipate performance bottlenecks. You will then be able to accurately estimate the network bandwidth required and target areas for tuning. This will help you minimize delay and cost. A typical HACMP/ XD for GLVM implementation is displayed below:
Data Mirror XD_data Service IP Label PV hdisk1 Disk Array Waltham XD_data Standby IP Label XD_data Standby IP Label TCP/IP WAN XD_data Network XD_data Service IP Label Burlington PV hdisk2 Disk Array Data Mirror

SERIAL Network XD_rs232

Figure 1: Burlington Waltham GLVM Cluster, a Two-Node, Two-Site Cluster

In this typical mirroring configuration we have a two-node cluster with each node at a different site. Waltham is the primary node where the application runs, and Burlington is a node at the secondary site, which uses GLVM to mirror the data used by the application. In developing this white paper we tested the above scenario within one geographic facility. The exact configuration is listed in Appendix A. The tool diskio2 was used to generate disk I/O for testing. It provides many options
sptrHACMPGLVMwp061506.doc Page 3

to read or write data of any size or frequency, and timing the results of the I/ O operations. Although this white paper refers to HACMP/ XD for GLVM V5.3, GLVM is also available for HACMP/ XD V5.2. All the information provided here can be applied to V5.2. Note: Throughout this paper (except where otherwise noted) the following abbreviations are used: Kb = kilobits (1,024 bits) Mb = megabits (1,048,576 bits) KB = kilobytes (1,024 bytes) MB = megabytes (1,048,576 bytes)

Understanding Performance: Three Areas


As mentioned earlier, the key to optimizing mirroring performance is minimizing network delay. In order to minimize this delay, we need to understand the three performance areas associated with a mirroring solution: Application I/O performance Disk and file system performance Network performance.

Application I/O Performance


It is important to understand the frequency and size of data your business applications are writing to the disks, because the same volume of information has to be communicated across the network and written to the disks at the remote site. This volume of information can be used to help estimate the network bandwidth necessary to communicate it across the network. When you estimate the volume of data that needs to be sent to the remote node, you need to consider not only the average volume, but also the peak usages. It is also important to plan for the growth of the volume of application data as more people use the applications. For instance, an online shopping application may have its largest periods of use during the winter, which may be months after the deployment of the application. One year later, the application may experience two or more times the initial I/ O activity.

Measuring Application I/O Performance


To estimate the volume of data that needs to be sent to the remote node, you can:
sptrHACMPGLVMwp061506.doc

Examine the formal requirements for your application, or


Page 4

Empirically study the disk I/O performance for typical and peak periods of time.

The formal requirement specifications for your application may directly or indirectly indicate the amount of I/O activity that your application must be able to support. An empirical study of the data I/O is not only good for estimating the bandwidth needed for communicating the disk I/O, but also for creating a baseline for evaluating disk performance for the future, should disk performance become an issue in your mirroring solution. To ensure the bandwidth is adequate for the peak periods, you should choose a sample period that covers a representational period of peak activity. You should also evaluate the steady state or typical performance; this is useful if, for example, your application has short-lived and/or only occasional spikes. If you have an application already running, you can use the iostat command for the disks being examined, and examine the value for Kb_w rtn (Note: iostat uses Kb to mean kilobytes). Figure 2 gives an example of running iostat on a disk used at the remote node for mirroring the application data, giving a summary of activity every 60 seconds. The value of Kbps can be used to indicate disk throughput; this will later be used to calculate the network bandwidth required for the mirroring solution.
waltham$ iostat -d hdisk1 60 System configuration: lcpu=2 drives=17 paths=1 vdisks=0 Disks: hdisk1 hdisk1 hdisk1 hdisk1 hdisk1 hdisk1 hdisk1 hdisk1 hdisk1 % tm_act 98.0 98.0 98.0 98.3 97.8 98.3 98.0 81.2 0.0 Kbps 2134.2 2133.7 2126.1 2115.1 2145.9 2213.5 2216.7 1840.2 0.0 tps 533.6 533.3 531.6 528.6 536.5 553.5 554.1 460.0 0.0 Kb_read 0 0 0 0 0 0 0 0 0 Kb_wrtn 128320 128020 127564 126904 128752 132808 132992 110412 0

Figure 2: Example of the iostat command

Alternatively, you can use the filemon command if you are using a file system (/gmfs1 in this example) on top of logical volumes. Figure 3 shows an example of using the filemon command to illustrate the disk activity (Note: filemon also uses Kb to mean kilobytes).

sptrHACMPGLVMwp061506.doc

Page 5

waltham$ filemon -o /tmp/lv_filemon_$(date +%Y%m%d%T) -O lv; sleep 60; trcstop Enter the "trcstop" command to complete filemon processing [filemon: Reporting started] [filemon: Reporting completed] [filemon: 60.964 secs in measured interval] waltham$ head -15 /tmp/lv_filemon_2005090615:35:31 Tue Sep 6 15:49:42 2005 System: AIX wolverine Node: 5 Machine: 00088C9A4C00 Cpu utilization: 44.2%

Most Active Logical Volumes ----------------------------------------------------------------util #rblk #wblk Kb/s volume description ----------------------------------------------------------------0.99 0 842496 6918.4 /dev/lvGMVG1 /gmfs1 0.00 0 936 7.7 /dev/hd3 /tmp 0.00 40 176 1.8 /dev/hd2 /usr 0.00 0 128 1.1 /dev/hd4 / 0.00 0 64 0.5 /dev/hd8 jfslog

Figure 3: Example of the filemon command.

In Figure 3, the number of write blocks can be used to determine disk throughput. Since each block is half a kilobyte, you can divide the # wblk amount by two and the time (in seconds) to get the disk throughput.

The Size of Writes


Testing has shown that applications that write out fewer blocks of information that are large will tend to have better I/O performance than applications that operate with a larger amount of small writes. This is mostly independent of whether or not the data is on a raw logical volume or on a file system, although the presence of a file system will have an effect on throughput. This is illustrated in the following tables. Writing to a raw logical volume with no file system with a 100 Mbps (megabits per second) network yielded the following results:
Number of Average Write Time (Mb/Sec) Writes Size per write (bytes) 100 1000 0.101965 100 10,240 0.812464 100 102,400 1.05944 100 1,024,000 1.455486 100 10,240,000 1.483685 Table 1: I/O Throughput with a raw GLVM volume

sptrHACMPGLVMwp061506.doc

Page 6

The increase in write time shows that performance is better with large blocks of data. Similarly, with a JFS2 file system, the performance of I/O operations improves as the size of the block being written grows, although sizes greater than 100KB perform equivalently to raw volume I/O:
Number of Average Write Time Writes Size per Write (bytes) (Mb/Sec) 1,000,000 10,000 47.04641 100,000 100,000 91.798174 100,000 102,400 105.295252 50,000 200,000 1.746223 10,000 1,000,000 1.340575 10,000 1,024,000 1.767009 Table 2: I/O Throughput on a GLVM Volume with a JFS2 File System

Both tables demonstrate that larger, fewer writes perform better than smaller, more frequent writes.

Disk/File System Performance


HACMP/XD for Geographic LVM V5.3 synchronously writes information to the remote node. In other words, for every write operation performed by the application writing to mirrored storage, the system waits until the write is complete both at the local node and at the remote node before returning control back to the application. This adds some delay to the original application, but more importantly it implies that slow disk and system I/O on the remote side will have an effect on the overall performance of the cluster application. Although there is no requirement on the types of storage necessary for GLVM to work, a fast storage media on the remote side would benefit overall application performance. Since changes to the file system will not be sent to the remote node until they are written to the volume group, it is recommended, for purposes of availability, to tune the JFS/JFS2 file system so that writes of changed dirty pages happen more frequently. Shortening the interval syncd used to flush pages to disk is the preferred method, although it is possible to achieve finer control over the flushing of pages to disk by specifying explicit tuning parameters for random and sequential write-behind. For more information about these parameters and the effects of changing them, see the File System Performance Tuning topic in the Performance Management Guide found in the pSeries an d A IX Inform ation Cen ter.

Measuring Disk I/O Performance


For the purpose of this white paper, measuring the volume of disk activity is the principal area of concern. The tools for measuring this are iostat and filemon, introduced in the Measuring Application I/ O Performance section above. To get a more complete picture of the performance of the disks, the tools used for tuning the performance of the disk would be used. These are explained in the Monitoring Disk I/O topic in the Performance Management Guide found in the pSeries an d A IX Inform ation Cen ter.
sptrHACMPGLVMwp061506.doc Page 7

Raw Volumes vs. JFS2


Some applications do not give the administrator a choice as to whether data should be written to a volume with or without a file system. However, some database vendors do give the administrator this choice. For these types of applications, it is worthwhile to compare the I/O performance when mirroring a JFS2 file system and mirroring a raw volume. The JFS2 file system is optimized for writing smaller blocks of data, sized less than 100KB. The write-cache mechanism associated with the file system produced a better data throughput both on the local node and at the remote mirror, as shown in Table 3 below:
Size of Write blocks 100KB File System Average Throughput (MB/sec)

No (raw) 1.4941 JFS2 6.8938 1 MB No (raw) 1.455486 JFS2 1.340575 Table 3: File System Type vs. Average Throughput at the Remote Site

When the size of the blocks of data was greater than the 100KB boundary, the performance and throughput of using raw logical volume versus using the JFS2 file systems was nearly identical. If you have a choice of whether to use a file system or a raw disk, and the sizes of the data are small (less than 100KB), consider using a file system like JFS2.

Network Performance
There are three factors that would cause a delay in the transmission of the changes to the data: inadequate bandwidth, excessive latency, and saturation.

Network Bandwidth
If there is one facet of system performance that stands out over the rest, it is the size of the network pipe, or the network bandwidth, between sites. The network bandwidth must be sufficient and robust to carry the I/O traffic for mirroring. Since HACMP/XD for GLVM does not support multilinking (using multiple networks), we recommend that you accurately estimate the width of the network between the sites before procuring it. If you fail to perform a fair estimate during the planning stage of your geographically dispersed cluster, it may be costly to upgrade or install a replacement network that can provide a greater network capacity.

Analyzing the Effect of Inadequate Network Bandwidth


The following example illustrates the effect of the inadequately provisioned network bandwidth. This effect is significant delays will be experienced in both the writing of mirrored data as well as in the originating application. When we set up testing for an extreme load, we saw that mirrored write throughput was throttled. For each network speed, the peak throughput is
sptrHACMPGLVMwp061506.doc Page 8

comparable to 75% of the theoretical maximum (to avoid network saturation), but the average throughput was less (see below). It is worth noting that the average performance is due more to the application writing data, rather than to network conditions.
Maximum Network Throughput Average Peak Speed (Theoretic) 75% Max Throughput Throughput (Mb/sec) (MB/sec) (MB/sec) (MB/sec) (MB/sec) 10 1.220 0.91552 0.7742 0.7888 100 12.207 9.1552 6.8938 9.6872 1000 122.07 91.552 67.454 89.472 Table 4: The Network Throughput under Different Network Speeds

The corresponding time to write data to disk was also affected, being slowed down, as shown when a 10 Mbps data network is used to carry too much data. Table 5 illustrates this:
Number of operations Size (bytes) Time (MB/Sec) 1,000,000 10,000 57.85429 100,000 100,000 52.83522 100,000 102,400 108.7942 Table 5: The Effect of Over-saturating a 10Mbps Network

It is worth noting several things observed during the testing procedure: The choice of a file system doesnt matter; even a raw file system exhibits this poor performance for write sizes greater than 100KB. It is possible to write the same value of information with more and smaller sized blocks of data. However, this can lead to unpredictable or disastrous performance issues if there are unexpected peaks in the amount of data being written.

To summarize, in order to have a robust geographic mirroring solution for your applications, plan and provision the correct network bandwidth taking into account peaks and growth in network activity.

Measuring Network Throughput


To validate the network throughput, we recommend that you use the netstat -v or entstat commands to display the amount of bytes being received by network adapters on your data network. The commands produce similar results; however, enstat contains information for a single adapter (see Figure 4), where netstat -v contains information for all adapters (see Figure 5).

sptrHACMPGLVMwp061506.doc

Page 9

burlington$ entstat ent0 ------------------------------------------------------------ETHERNET STATISTICS (ent0) : Device Type: 10/100/1000 Base-TX PCI-X Adapter (14106902) Hardware Address: 00:11:25:08:18:43 Elapsed Time: 0 days 0 hours 0 minutes 2 seconds Transmit Statistics: -------------------Packets: 184 Bytes: 12686 Interrupts: 0 Transmit Errors: 0 Packets Dropped: 0 [further output omitted ] Receive Statistics: ------------------Packets: 2804 Bytes: 4215556 Interrupts: 411 Receive Errors: 0 Packets Dropped: 0 Bad Packets: 0

Figure 4: Example of the enstat command


burlington$ netstat -v ------------------------------------------------------------ETHERNET STATISTICS (ent0) : Device Type: 10/100/1000 Base-TX PCI-X Adapter (14106902) Hardware Address: 00:11:25:08:18:43 Elapsed Time: 0 days 0 hours 7 minutes 55 seconds Transmit Statistics: -------------------Packets: 181848 Bytes: 11110492 Interrupts: 0 Transmit Errors: 0 Packets Dropped: 0 [further output deleted ] Receive Statistics: ------------------Packets: 2394203 Bytes: 3615161998 Interrupts: 364649 Receive Errors: 0 Packets Dropped: 0 Bad Packets: 0

Figure 5: Example of the netstat -v command

If you are trying to estimate throughput for your adapters, it will be helpful to run entstat r on the adapter used on the data network; this will reset the counters. For example, to measure network throughput, sample the amount of traffic over a period of 60 seconds on the adapter on the data network: enstat r ent0; sleep 60; entstat ent0 The amount of bytes under the Receive Statistics column would be the amount of traffic received per minute; divide this number by 60 to get a per-second rate. This would be your network throughput, and you can use it to compare against what you should be getting for your network.

Network Latency
The network latenc y, or delay, can be a surprise factor in estimating the efficiency of the mirroring performance of your cluster. In general, each network technology has a delay for each packet that is sent from one network device to another. Each network hop, such as a router
sptrHACMPGLVMwp061506.doc Page 10

or gateway that is on the way from the source to the destination, will introduce its own delay. The total network latency is the accumulation of such delays from each network device. Therefore, to ensure the best mirroring performance for your application, you should provision a network that has guaranteed minimums for its latency.

Measuring Network Latency


You can use the ping and traceroute commands for determining network latency; however, the results from these commands can only be considered helpful if ICMP protocol traffic, used by ping and traceroute, is routed similarly as TCP/IP traffic. ISPs and network carriers can route each protocol separately. Through bandwidth shaping they can choose to degrade the performance of one protocol for the benefit of other protocols. Therefore, use these commands after checking that ICMP traffic is routed in the same way as the TCP/IP traffic. The following is an example of using ping to get to a remote site:
waltham$ ping burlington PING burlington: (192.168.3.5): 56 data bytes 64 bytes from 192.168.3.5: icmp_seq=0 ttl=125 64 bytes from 192.168.3.5: icmp_seq=1 ttl=125 64 bytes from 192.168.3.5: icmp_seq=2 ttl=125 64 bytes from 192.168.3.5: icmp_seq=3 ttl=125 64 bytes from 192.168.3.5: icmp_seq=4 ttl=125 64 bytes from 192.168.3.5: icmp_seq=5 ttl=125 64 bytes from 192.168.3.5: icmp_seq=6 ttl=125

time=372 ms time=396 ms time=394 ms time=397 ms time=230 ms time=95 ms time=80 ms

----burlington PING Statistics---7 packets transmitted, 7 packets received, 0% packet loss round-trip min/avg/max = 80/280/397 ms $

Figure 6: Example of ping command to Illustrate Network Latency

The two things to note about the results of the ping command in Figure 6 are that the average time is 280ms (or 0.28 of a second) and that there were no packets lost. This shows that although the network is reliable (no packets lost) there was about a quarter of a second delay in sending a request and receiving a response due. On systems on the same local network, you would see near 0ms round-trip time. On remote systems, you would tend to see larger numbers. Clearly youd like to have as little a delay as possible, to ensure the data changes are mirrored at the remote site as close to real-time as possible. traceroute is a network debugging tool used to indicate where on the network delays may occur. Since network traffic is routed through routers and other network devices between the source and destination nodes, each step or hop on the way can introduce delay. traceroute allows you to see the delay at each hop on the way to the destination, as shown in Figure 7.

waltham$ traceroute burlington sptrHACMPGLVMwp061506.doc Page 11

trying to get source for burlington source should be 10.10.10.8 traceroute to burlington(192.168.3.5) from 10.10.10.8 (10.10.10.8), 30 hops max outgoing MTU = 1500 1 10.10.1.1 (10.10.1.1) 13 ms 16 ms 16 ms 2 10.10.0.1 (10.10.0.1) 1 ms 1 ms 1 ms 3 * * * 4 192.168.1.1 (192.168.1.1) 100 ms 91 ms 75 ms 5 burlington (192.168.3.5) 87 ms 79 ms 77 ms $

Figure 7: Example of the traceroute command to Illustrate Network Latency

What is notable about the results of the traceroute command in Figure 7 is that significant delay of 100ms is occurring between network hops 3 (unknown IP address) and 4 (192.168.1.1). This is causing the majority of the delay, but because hop 3 is not responding to ICMP requests (hence the * * *) it is difficult to determine which hop of the two is causing the delay. Your network administrators or ISP should be able to assist you in determining where the network delays occur and address them.

Network Saturation
Ethernet media has a theoretical limit for the amount of data that can be sent through it. When the amount of packets sent over the network reach this maximum, the network is saturated. In special cases where two machines perform on a closed network with synchronized I/O, the throughput can approach the limit. But Ethernet traffic is rarely synchronized this way, and rarely are there only two machines on the wire. As a result, network collisions occur as the throughput approaches saturation. Large amount of network collisions degrade the performance of the network. It is important to build a certain amount of overhead into the calculations for throughput, usually 25%, to avoid network saturation.

Measure the Effects of Saturation


Network administrators use LAN analyzers or similar specialized devices to measure the network utilization. However, the effects of saturation can be detected and measured by examining the collision rate on network devices associated with the XD_data network. The command to examine the collision count is enstat device where your device is the on the XD_data network, as in:

sptrHACMPGLVMwp061506.doc

Page 12

burlington$ entstat ent0 ------------------------------------------------------------ETHERNET STATISTICS (ent0) : Device Type: IBM 10/100 Mbps Ethernet PCI Adapter (23100020) Hardware Address: 00:04:ac:5e:4d:3e Elapsed Time: 90 days 2 hours 52 minutes 4 seconds Transmit Statistics: -------------------Packets: 360716570 Bytes: 128358396889 [output omitted] Max Collision Errors: 327 Late Collision Errors: 0 [output omitted] No Resource Errors: 0 Receive Collision Errors: 0 Receive Statistics: ------------------Packets: 361676332 Bytes: 104279932996

Figure 8: Example of collisions in enstat command

netstat v produces similar output but for all devices. If the collision rate (collisions divided by packets) is greater than 0.10 (10%), then the network has become saturated, and it has to be either reorganized or partitioned. In this example, the Max Collision Errors divided by the transmit packets is much less than 10%, which means the network is not exhibiting effects of saturation.

Effect of Mirroring on Application Performance


In an ideal environment, when the local and remote nodes of a mirror are on the same network, there should be an imperceptible difference in the time it takes to write data with mirroring when compared to just writing to the local disk. Table 6 demonstrates that when the network is adequate for carrying the I/O load, there is almost no difference; at 100 Mbps and 1000 Mbps it takes nearly the same time as writing to the local disk. However, when the network bandwidth is inadequate, then response time is much slower: on a 10 Mbps network, the response time is nearly nine times that of writing to the local disk.
Network Speed (local disk only no mirror) 1000Mbps 100Mbps 10Mbps Response time for 1,000MB 129.10 seconds 131.18 seconds 131.09 seconds 1110.76 seconds Table 6: Response Time for 1,000MB vs. Network Speed

sptrHACMPGLVMwp061506.doc

Page 13

Running the Performance Optimization Cycle


Developing an optimized mirroring solution is an iterative process that begins by specifying your performance requirements: how much data needs to be transferred to the remote site? From this answer, you provision the network that can carry the volume of disk I/O, and then set up GLVM to mirror your logical volumes. The next step is to measure the performance of your mirroring solution. If the performance does not match the requirements, then your mirroring solution has to be tuned to meet the performance requirements. Tuning involves making changes to the mirrored solution, and re-measuring the performance to check the effectiveness of the change.
Develop Performance Requirements Performance Requirements Met

Measure Performance

Tune Performance

Figure 9: The Performance Optimization Cycle

Determining Mirroring Performance Requirements


When you want to provision a network to carry the mirrored disk I/O traffic, you will need to give your network provider or ISP two estimates: one for the average network bandwidth, and one for the peak network bandwidth. Calculating the amount of bandwidth necessary for carrying the mirroring traffic is easy because it is based on the volume of disk I/O generated by your application. The amount of bytes that need to be transferred across the network per second would be the same as the amount that is written to disk plus some overhead to avoid network saturation. The following formula should be used for this estimate:
Bandwidth (bits per second) = disk writes (bytes per second) * 8 / 0.75

In order to provide these figures, you will have to either estimate your bandwidth based on your applications requirements, or measure your disk activity using iostat or filemon as mentioned in Measuring Application I/O Performance. For example, suppose a database application was at its peak while writing out 4MBps; the amount of bandwidth necessary to carry the traffic is:
sptrHACMPGLVMwp061506.doc Page 14

Bandwidth

= (4,096,000 Bytes/sec) * 8 / 0.75 = 43,690,666 bits/sec = 43.691 Mbps

This speed falls within the threshold of a T3, which can handle 44.736 Mbps. Keep in mind that with HACMP/XD for GLVM V5.3 you can set up a mutual takeover cluster configuration, where applications can exist at both sites, and therefore mirroring occurs in both directions at the same time. In this case, the network bandwidth has to be sufficient to accommodate both streams of data writes. You should estimate performance measurements for the disk I/O for both applications, and add them together before calculating an overall bandwidth for the network. Once you have your estimates, you can discuss with your network provider what kind of network solution you must choose so that it meets the requirements for bandwidth and cost. There will usually be several choices, depending on cost and expansion or growth capabilities.

Establish a Performance Baseline


After you finish setting up your mirroring solution, you should take your first measurement of your mirroring solutions performance; this will be the baseline for further performance measurements. To establish the performance baseline, generate disk I/O that matches your estimate for a peak load. You can do this by using your application to generate the load, or a tool to write large volumes of data to the mirrored volume. While generating this I/O load, monitor the disk activity on the remote end to determine if it is writing the same volume of information as at the source end. This can be done using the iostat or filemon commands mentioned earlier.

Remove Performance Bottlenecks


If you find that the current mirroring solution is not meeting the performance objectives, then the next steps are to determine the performance bottlenecks and remove them. The pSeries and A IX Inform ation Center has excellent information on how to determine AIX 5L performance bottlenecks and tune to remove them. For a remote mirroring solution, finding the bottlenecks involves answering the following questions and addressing the issues they raise. Has the disk I/O activity of your application exceeded the throughput requirements for the network? By measuring the volume of disk activity on the local system, you can determine whether the volume has exceeded the initial performance requirements. This can happen if the application is used more frequently than initially expected. Is the network adequately supporting the volume of disk I/O
sptrHACMPGLVMwp061506.doc Page 15

operations? Network throughput can deteriorate due to unexpected changes in service, routing changes, or saturation due to a higher than anticipated load in traffic. By using the network commands netstat, entstat, ping and traceroute mentioned earlier, you should have an idea where the network bottlenecks lie. Are the remote disks performing adequately? If the remote disks are slow, perhaps due to contention with other applications, then tuning the disk operations may improve performance. By using the iostat or filemon utilities on the remote node, you can determine if there are bottlenecks.

Re-measure for Effectiveness


After every attempt to improve performance, it is important to retest the performance of the system. Tuning the performance can be tricky because the choices for tuning may enhance or worsen other parameters. If the performance objectives are met, this becomes the new baseline for performance, and your mirroring solution is ready to be put into production. If the effect of the change worsens the performance, then the change needs to be backed out and another approach will have to be tried instead.

Tuning for Optimized Performance


Overview
Performance tuning alone will never improve a mirroring solution where the network bandwidth is inadequate. It is crucial that the network be provisioned to meet the volume of disk activity. However, it is possible to squeeze some additional performance out of your mirroring solution if it fails to fully meet your performance objectives. Here are some additional tuning options to try if your system fails to meet your performance objectives.

Network Performance Options


A series of network options devoted to bulk data transfers should be examined to see if changing them would improve performance. These are: MTU size, tcp_recvspace, tcp_sendspace, sb_max, tcp_nodelayack, and tcp_pmtu_discover. These and other network options are discussed more fully in the TCP and UDP Performance Tuning topic in the Performance Management Guide found in the pSeries and A IX Inform ation Center. AIX provides reasonable defaults for these values, but the addition of network adapters to the system may cause these values to change. Increasing the MTU size and changing the other related options should only be done if the adapters and intervening network gear can handle the increase in MTU size. If the network option tcp_recvspace is greater than 64KB (kilobytes), set rfc1323 to enable the window scaling option. You can use either the chdev command or the ifconfig command for each adapter on the XD_data network as follows:

sptrHACMPGLVMwp061506.doc

Page 16

ifconfig en0 rfc1323 1

or
chdev -l en0 -a rfc1323=1

If you use the chdev command, the values are set permanently in the ODM, but the change will not take place until after a reboot. If you use the ifconfig command, you can make the change without rebooting, but it is good only for the current session. To make the change persistent across reboots, add the ifconfig command as shown above to the /etc/rc.net file. The network option tcp_nodelay, normally considered along with rfc1323 is explicitly set by GLVM. Setting the option globally or on the interface will have no effect on the mirroring performance.

Disk Performance Options


Network performance issues have a much greater impact than disk performance issues in a mirroring solution. As an example, in a test where a pair of disks mirrored using GLVM were compared against striped disks mirrored with GLVM, the difference in performance was negligible. The following performance tuning suggestions might be helpful to try, but only when all the network performance tuning options have been tried. When using some disk technologies such as SSA and SCSI, disk adapters with fast write cache should be considered on the remote nodes to improve disk I/O performance. Multiple adapters would work better to distribute the load. This is discussed more fully in the Using Fast Write Cache topic in the Performance Management Guide found in the pSeries and A IX Inform ation Cen ter. Use striping to force the LVM to alternate writes evenly throughout the disks. If applicable, stripe across as many disk adapters as possible. IBM suggests a striping size of 64KB (kilobytes). If you are using multiple disks, they should use their own adapters if possible. This is discussed more fully in the Changing Logical Volume Attributes That Affect Performance topic in the Performance Management Guide found in the pSeries and A IX Inform ation Cen ter.

File System Performance Options


If you are not using raw volumes, adjust the syncd interval to maintain a continuous data load. HACMP/XD for GLVM is best suited to handling a continuous load, as opposed to bursts of data. So if you have to use file systems, experiment with a syncd interval of around 3040 seconds. You can change this value in /sbin/rc.boot where it invokes the syncd daemon. Then reboot the system for it to take effect. For the current system, kill the syncd daemon and restart it with the new seconds value. This will increase the rate at which data is written, and prevent the file systems from becoming I/ O bottlenecks. Depending on the frequency, the size, and the location of the write
sptrHACMPGLVMwp061506.doc Page 17

operations performed by your application, it may be possible to tune the sequential and random write-behind options for the file system you are using in order to commit the pending I/ O to disk faster. Like other performance tuning options this has the potential of degrading your I/O performance so it should be considered carefully. The File System Performance Tuning topic in the Performance Management Guide found in the pSeries an d A IX Inform ation Center discusses write-behind options.

HACMP Performance Options


The use of HACMP does not affect the performance of the mirrored solution; it simply improves the availability and the speed of recovery upon a site failure. These considerations are explained in the HA CMP/ XD for Geographic L V M: Plann in g and A dm inistration Guide.

Case Study: Massive Two Shoes


In order to explain the process of planning and implementing optimal performance in a mirroring solution with GLVM, a scenario was developed and tested featuring a fictional shoe company with a need for making their data highly available. Massive Two Shoes, a foot apparel retail outlet located on the Route 128 beltway in Massachusetts, has a data center in Burlington, MA. They wish to add another data center footprint in Waltham, MA to mirror their customer and sales databases. It is anticipated that activity will be steady (little growth) during the year, but recent history shows that activity can double during peak times such as holidays and right before the Boston Marathon.

Following
Running the Performance Optimization Cycle, the data administrator for MTS measures the amount of disk activity during an average period of sales using iostat; he receives the following information:
burlington$ iostat hdisk10 60 Disks: % tm_act Kbps hdisk2 9.8 1985.9 tps 2.0 Kb_read 0 Kb_wrtn 119172

Figure 10: Massive Two Shoes Disk Activity

This is showing that the applications on average write 1.985MB per second. Since this historically doubles, the network must carry at most 3.97MB per second of disk activity. The data administrator calculates (using the formula in Determining Mirroring Performance Requirements) 21.06 Mbps on average, with peaks up to 42.13 Mbps. The data administrator walks these numbers to the network administrator, who performs the next steps. The network administrator talks with a few network providers and decides that a Fractional T3 meets their needs since the peak can be 44.763 Mbps. The network provider installs the network and helps to establish
sptrHACMPGLVMwp061506.doc Page 18

communication between the two sites. He runs a traceroute and finds that network performance had an unexpected delay, as shown in Figure 11.

burlington$ traceroute waltham trying to get source for waltham source should be 10.70.28.1 traceroute to waltham (10.70.28.2) from 10.70.28.1 (10.70.28.1), 30 hops max outgoing MTU = 1500 1 burl-fw (10.70.281) 2 ms 1 ms 0 ms 2 waltham-fw (10.60.1.1) * * * 3 waltham (10.60.1.1) 247ms 149ms 183ms $

Figure 11: Network Delays to Waltham

After much wrangling, the network provider fixes the issues and the delay is minimized to below 10ms. The data administrator configures HACMP/XD for GLVM to mirror the volumes used for application data storage, and lowers the syncd rate to 30 seconds. Measurements show that the average disk I/O meets the needs of the application, but the data administrator wonders if he can shoehorn additional performance into the solution to anticipate the peak periods.
waltham$ iostat hdisk12 60 Disks: % tm_act Kbps hdisk12 9.8 2018.97 tps 2.0 Kb_read 0 Kb_wrtn 121138

Figure 12: Disk Performance at the Remote Mirror

The first target is application performance. Since the application was a legacy database developed by a third party developer, there was no money budgeted for changing the applications I/ O characteristics. The next step would be to address network performance. Since the network provider has minimized the network delay, the next step is to change the MTU settings to see if they had an impact on performance. The following parameters were set: tcp_sendpage: 640k tcp_recvpage 640k rfc1323 1

After making these changes to the network settings, the performance was again tested and found to have little difference, as shown in Figure 13. Any differences could be attributed to the variation in the load from the application. The fact that there are no substantial differences is a result of the network being provisioned correctly from the beginning.
waltham$ iostat hdisk12 60 Disks: % tm_act Kbps sptrHACMPGLVMwp061506.doc tps Kb_read Kb_wrtn Page 19

hdisk12

9.8

2007.08

2.0

120425

Figure 13: Disk Performance at the Remote Mirror with Tuning

The data administrator can rest assured that this solution is optimal.

Summary
The key to optimal mirroring performance is planning understanding what volume of disk activity will be generated by the applications, and procuring the appropriate network bandwidth to carry that volume. In order to achieve maximum mirroring performance with HACMP/XD for GLVM V5.3, always adequately plan and provision the network architecture for carrying the disk I/O traffic between two sites. To achieve optimal mirroring performance, use these recommended approaches: Estimate the disk activity performed by the applications on one site, or two if there is mirroring. Calculate the network bandwidth required for the mirroring. If there is a choice between using a raw logical volume or a JFS2 file system, use JFS2 if the sizes of the writes are less than 100KB. Generate a peak load and capture a baseline for performance. Tune using the techniques provided, test whether they were effective, and retune until the performance goals have been met.

With these performance guidelines, you can build an optimal mirroring solution using HACMP/ XD for GLVM.

References

IBM Publications: HA CMP/ XD for Geographic L V M: Plan ning an d A dm in istration Guide, August 2005, SA23-1338-02 High A vailability Geographic Cluster for A IX: Plan ning an d A dm in istration Guide, V ersion 2 Release 4, September 2003, SC23-1886-04 IBM White papers: Geom irror Perform anc e for HA GE O an d GeoRM: A Com m en tary, September 10, 2001 Planning Considerations for Geographic ally Dispersed Clusters using IBM HA CMP/ XD: HA GE O Tec hnology, June 2004 IBM Resource centers: pSeries an d A IX Inform ation Center,
http://publib.boulder.ibm.com/infocenter/pseries/index.jsp

sptrHACMPGLVMwp061506.doc

Page 20

About the Authors

Wayne Wylupski Wayne Wylupski is a Senior Engineer for Availant, Inc. He has over 20 years in the IT field as software engineer, network engineer and systems designer. He has helped to develop HACMP for the past three years while at Availant. Chris Cox Chris Cox is a Principal Quality Assurance Engineer at Availant, Inc. She has 20 years experience in the IT field, initially as system administrator and customer support, then as a quality assurance engineer. Chris has worked for the past seven years in assuring the quality of HACMP.

Appendix A
Description of hardware and software used in the test case. Two nodes using AIX 5L V5.3, one designated as the local node and the other designated as the remote node. One private TCP/ IP network connecting the local and remote nodes that will support 10/100/1000 Mbps. Ten disks connected to the local node and five disks connected to the remote node. These disks are configured with five local volume groups and five GLVM volume groups; each volume group has five logical volumes with two mirror copies. The utility diskio2 used to generate disk I/O for testing can be found in the Appendix of Geom irror Perform anc e for HA GE O an d GeoRM: A Com m entary, available from the IBM Web site at: www.ibm.com/ servers/ eserver/ pseries/ software/ whitepapers/ gmdperf.html.

sptrHACMPGLVMwp061506.doc

Page 21

IBM Corporation 2006 IBM Corporation Marketing Communications Systems Group Route 100 Somers, New York 10589 Produced in the United States of America March 2006 All Rights Reserved This document was developed for products and/or services offered in the United States. IBM may not offer the products, features, or services discussed in this document in other countries. The information may be subject to change without notice. Consult your local IBM business contact for information on the products, features and services available in your area. All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only. IBM, the IBM logo, the e-business logo, AIX 5L, HACMP are trademarks or registered trademarks of International Business Machines Corporation in the United States or other countries or both. A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml. Other company, product, and service names may be trademarks or service marks of others. Information concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of the non-IBM products should be addressed with the suppliers. The IBM home page on the Internet can be found at http://www.ibm.com . The System p home page on the Internet can be found at http://www.ibm.com/systems/p.

sptrHACMPGLVMwp061506.doc

Page 22

You might also like