You are on page 1of 26

Introduction AIX 5L Version 5.

3 is the latest version of the AIX operating system that offers simultaneous multi-threading (SMT) on eServer p5 systems to deliver industry leading throughput and performance levels. With support for advanced virtualization, AIX 5L Version 5.3 helps you to dramatically increase your server utilization and consolidate workloads for more efficient management. A review of computing history and operating systems shows that computer scientists have developed many CPU scheduling policies. First-in, first-out (FIFO), shortest job first, and round robin are just a few. Scheduling policies are important because a single policy might not be best suited to all applications. Some applications in certain workloads can run well in a default scheduling policy. However, the same applications with a different workload might require a scheduling policy adjustment in order to achieve the optimal performance. Note: This article is an update for AIX 5.3 performance. The advanced virtualization is not discussed in this article. It has enhancements and updates to emphasize AIX 5L Version 5.3 features, tools, and capabilities.

Back to top What is SMT? SMT is the ability of a single physical processor to concurrently dispatch instructions from more than one hardware thread. In AIX 5L Version 5.3, a dedicated partition created with one physical processor is configured as a logical two-way by default. Two hardware threads can run on one physical processor at the same time. SMT is a good choice when overall throughput is more important than the throughput of an individual thread. For example, Web servers and database servers are good candidates for SMT. Viewing processor and attribute information By default, the SMT is enabled, as shown in Listing 1 below.

Listing 1. SMT
# smtctl This system is SMT capable. SMT is currently enabled. SMT threads are bound to the same physical processor. Proc0 has 2 SMT threads Bind processor 0 is bound with proc0 Bind processor 2 is bound with proc0 Proc2 has 2 SMT threads

Bind processor 1 is bound with proc2 Bind processor 3 is bound with proc2 # lsattr -El proc0 frequency 1656376000 Processor Speed False

smt_enabled true smt_threads 2 state type enable

Processor SMT enabled False Processor SMT threads False Processor state False False

PowerPC_POWER5 Processor type

The smtctl command provides privileged users and applications the ability control utilization of processors with SMT support. With this command, you can turn SMT on or off. The smtctl command syntax is:
smtctl [-m off | on [ -w boot | now] ]

What are shared processors? Shared processors are physical processors that are allocated to partition on a timeslice basis. You can use any physical processor in the shared processor pool to meet the execution needs of any partition using the shared processor pool. An eServer p5 system can contain a mix of shared and dedicated partitions. A partition must be all shared or all dedicated, and you can not use dynamic LPAR (DLPAR) commands to change between the two. You need to bring down the partition and switch it from using dedicated to shared, or vice versa. Processing units After a partition is configured, you can assign it an amount of processing units. A partition must have a minimum of 1/10 of a processor. And after that requirement has been met, you can configure processing units at the granularity on 1/100 of a processor. A partition that uses shared processors is often called a shared partition. A dedicated partition is one that uses dedicated processors. Each partition is configured with a percentage of execution dispatch time for each 10 milliseconds (ms) timeslice. For example:

A partition with 0.2 processing units is entitled to 20 percent capacity during each timeslice. A partition with 1.8 processing units is entitled to 18ms processing time for each 10ms timeslice (using multiple processors).

There is no accumulation of unused cycles. If a partition does not use the entitled processing capacity, the excess processing time is ceded back to the shared processing pool.

Partitions with shared processors are either capped or uncapped. The capped partition is assigned with a hard limit capacity. If a partition needs an extra CPU cycle (more than its total processing units), it can utilize unused capacity in the shared pool.

Back to top Scheduling algorithms AIX 5 implements the following scheduling policies: FIFO, round robin, and a fair round robin. The FIFO policy has three different implementations: FIFO, FIFO2, and FIFO3. The round robin policy is named SCHED_RR in AIX, and the fair round robin is called SCHED_OTHER. We discuss these policies in greater detail in the upcoming sections. Scheduling policies can have a major impact on system performance, depending on how one assigns and manages them (response time and throughput). For example, FIFO is a good choice for the job that uses a lot of CPU, but it also can choke out all of the other jobs waiting in line. A basic round robin gives a "timeslice" or "quantum" to each job in a time-shared manner. As a result, it tends to discriminate against I/O-intensive tasks, since those tasks often give up CPU voluntarily due to I/O wait. The fair round robin is "fair" because scheduling priorities change as the jobs accumulate quantums of CPU time during execution. This allows the operating system to demote a CPU hugger so that an I/O bound job has a fair chance to use the CPU resource. Let's go over two important concepts before getting into the scheduling details: the nice value and the AIX priority and run queue structure. The nice and renice commands AIX has two important scheduling commands: nice and renice. A user job in AIX carries a base priority level of 40 and a default nice value of 20. Together, these two numbers form the default priority level of 60. This value applies to most of the jobs you see in a system. When you start a job with a nice command, such as nice -n 10 myjob, the number 10 becomes the delta_NICE. This number is added to the default 20 to create the new nice value of 30. In AIX, the higher this number, the lower the priority. Using this example, your job now starts with a priority of 70, which is 10 levels worse in priority than the default. The renice command applies to a job that has already started. For example, the renice -n 5 -p 2345 command causes process 2345 to have a nice value of 25. Note that the renice value is always applied to a base nice of 20, regardless of the current nice value of the process. AIX priority and run queue structure A thread carries a priority range from 0 to 255 (the range is from 0 to 127 on systems prior to AIX 5). Priority 0 is the highest or the most favorable, and 255 is the lowest or least favorable. AIX maintains a run queue in the form of a 256-level priority queue to efficiently support the 256 priority levels of threads.

AIX also implements a 256-bit array to map to the 256 levels of the queue. If a particular queue level is empty, the corresponding bit is set to 0. This design allows the AIX scheduler to quickly identify the first non-empty level and start the first ready-to-run job in that level. See the AIX run queue structure in Figure 1 below.

Figure 1. Scheduler run queue

In Figure 1, the scheduler maintains a run queue of all the threads that are ready to be dispatched. All dispatchable threads of a given priority occupy consecutive positions in the run queue. AIX 5L implements one run queue for each CPU and a global queue. For example, there are 32 run queues and one global queue in an eServer pSeries p590 machine. With a per-CPU run queue, a thread has better chance to go back to the same CPU after a preemption, which is an affinity enhancement. Also, the contention among CPUs to lock the run queue structure is much reduced with multiple run queues. However, for some situations, a multiple run queue structure might not be desirable. Exporting a system environment variable RT_GRQ=ON can cause a thread to be placed on the global run queue when it becomes runnable. This can improve performance for threads that are interrupt-driven and running SCHED_OTHER. If schedo o fixed_pri_global =1 is run on AIX 5L Version 5.2 and later, threads running the fixed priority are placed on the global run queue. For local run queues, the dispatcher picks the best priority thread in the run queue when a CPU is available. When a thread has been running on a CPU, it tends to stay on that CPU's run queue. If that CPU is busy, then the thread can be dispatched to another idle CPU and assigned to that CPU's run queue. FIFO

Although the FIFO policy is the simplest, it is rarely used because of its non-preemptive nature. A thread with this scheduling policy runs all the way to completion, unless one of the following happens:

It gives up the CPU voluntarily by executing a function that would put the thread to sleep, such as sleep() or select(). It gets blocked due to resource contention. It has to wait for I/O completion.

The checkout lane at a grocery store uses a typical FIFO policy. Imagine yourself in the checkout lane with only one TV dinner (and you're hungry), but the person in front has a full load in his cart. What can you do? Not much. Since this is a FIFO, you must wait patiently for your turn. Similarly, it is obvious that job response time can suffer severely if several tasks are running FIFO mode in AIX. Consequently, FIFO is rarely used in AIX. Only a process owned by root can set itself or another thread to FIFO with the thread_setsched() system call. There are two variations of the FIFO policy: FIFO2 and FIFO3. FIFO2 says that a thread is put at the head of its run queue if it was asleep for only a short period of time less than a predefined number of ticks (affinity_lim ticks, tunable with the schedo -p command). This allows a thread to have a good chance to reuse the cache content. For FIFO3, a thread is always put at the head of the queue when it becomes runnable. Round robin The well-known round robin scheduling policy is even older than UNIX itself. AIX 5L implements round robin on top of its multilevel priority queue of 256 levels. At a given priority level, a round robin thread shares the CPU timeslices with all other entries of the same priority. A thread is scheduled to run until one of the following occurs:

It yields the CPU to other tasks. It is blocked for I/O. It uses up its timeslice.

When the timeslice is exhausted, if a thread of equal or better priority is available to run on that CPU, the thread that is currently running is then placed at the end of the queue for the next turn to own the processor. A thread can be preempted because of a higher priority job waking up or a device interrupt (for example, after an I/O is done). For a round robin task only, this preempted thread is placed at the beginning of its queue level, because AIX wants to ensure that a round robin job has a full timeslice before it is moved to the end of the round robin chain. It is important to note that the priority of a round robin thread is fixed and does not change over time. This makes the priority of a round robin task persistent (as opposed to the changing priorities in fair round robin) and more predictable. Since a round robin thread has special status, only root can set a thread to run with the round robin scheduling policy. To set SCHED_RR for a thread, use one of the following application programming interfaces (APIs): thread_setsched() or setpri().

SCHED_OTHER This last scheduling policy is also the default. While trying to establish the fairest policy among tasks, this innovative SCHED_OTHER algorithm was created with a not so innovative POSIX-defined name. The AIX SCHED_OTHER is a priority-queue round robin design at the core, with one major difference: the priority is no longer fixed. If a task is using an excessive amount of CPU time, its priority level should be downgraded to allow other jobs an opportunity to access the CPU. If a task is at a priority level so low (a high number) that it does not have an opportunity to run, then its priority should be upgraded to a higher level (a lower number) so it can run to finish. A new concept was also implemented to further enhance the effectiveness of the nice value: If a task is nice (the UNIX nice value) at the beginning, the system will then force it to be nice all the time. I discuss this feature later. Traditional CPU utilization Prior to AIX 5.3 or with SMT disabled, AIX processor utilization uses a sample-based approach to approximate:

Percentage of processor time spent executing user programs System code Waiting for disk I/O Idle time

AIX produces 100 interrupts per second to take samples. At each interrupt, a local timer tick (10ms) is charged to the current running thread that is preempted by the timer interrupt. One of the following utilization categories is chosen based on the state of the interrupted thread:

If the thread was executing code in the kernel using system call, the entire tick is charged to the process system time. If the thread was executing application code, the entire tick is charged to the process user time. Otherwise, if the current running thread was the operating system's idle process, the tick is changed in a separate variable. The problem with this method is the process receiving the tick most likely did not run for the entire timer period and happened to be executing when the timer expired. With AIX 5.3 SMT enabled, the traditional utilization metrics are misleading as treating due to the two logical processors. If one thread is 100 percent busy, one idle thread would result in 50 percent utilization. But in reality, if one SMT thread is using all CPU resources, then that CPU is 100 percent busy, as reported using the new Processor Utilization Resource Register- (PURR) based method.

PURR Beginning in AIX 5.3, the number of dispatch cycles for each thread can be measured using a new register called the PURR. Each physical processor has two PURR registers (one for each hardware thread). The PURR is a new register provided by the POWER5 processor, which is used to provide an actual count of physical processing time units that a logical processor has used. All performance tools and APIs utilize this PURR value to report CPU utilization

metrics for SMT systems. This register is a special-purpose register that can be read or written by the POWER Hypervisor; however, it is read-only by the operating system. The hardware increments for PURRs is based on how each thread is using the resources of the processor, including the dispatch cycles that are allocated to each thread. For a cycle in which no instructions are dispatched, the PURR of the thread that last dispatched an instruction is incremented. The register advances automatically so that the operating system can always get the current up-to-date value. When the processor is in single-thread mode, the PURR increments by one every eight processor clock cycles. When the processor is in SMT mode, the thread that dispatches a group of instructions in a cycle increments the counter by 1/8 in that cycle. If no group dispatch occurs in a given cycle, both threads increment their PURR by 1/16. Over a period of time, the sum of the two PURR registers, when running in SMT mode, should be very close, but not greater than the number of timebase ticks. AIX 5.3 CPU utilization In AIX 5L V5.3, there are new metrics that are collected by the kernel that are stated-based rather than a sample-based approach. State-based is the collection of information based on PURR increments rather than a set time of 10ms. AIX 5.3 uses PURR for process accounting. Instead of charging the entire 10ms clock tick to the interrupted process as before, processes are charged on the PURR delta for the hardware thread since the last interval. At each interrupt:

The elapsed PURR is calculated for the current sample period. This value is added to the appropriated utilization category (user, sys, iowait, and idle), instead of the fixed-size increment (10 ms) that was previously added.

There are two different ways to measure: the threads processor time and the elapsed time. To measure the elapsed time, the time-based register (TB) is still used. The physical resource utilization metrics for a logical processor are:

(delta PURR/delta TB) represents the fraction of the physical processor consumed by a logical processor. (delta PURR/delta TB) * 100 over an interval represent the percentage of dispatch cycles given to a logical processor.

CPU utilization example Assume two threads are running on one physical processor with SMT enabled. Both SMT threads of a physical CPU are busy. Using the old tick-based method, both SMT threads would be reported as 100 percent busy but, in reality, they are really sharing the CPU resources evenly. This means the new PURR-based method would show each SMT thread as 50 percent busy. Using the PURR methods, each logical processor reports a utilization of 50 percent representing the proportion of physical processor resources that it used, assuming equal distribution of physical processor resources to both the hardware threads. Additional CPU utilization metrics

The following metrics uses the per-thread PURR method to measure the thread's processor time and uses the TB register to measure the elapsed time.

Table 1. Per-thread PURR method Additional CPU utilization metrics Information provided %sys=(delta PURR in system mode/entitled PURR) * Physical CPU utilization metrics are 100 where entitled PURR (ENT * delta TB), and calculated using the PURR-based ENT is entitlement in # of processors (entitlement/100) samples and entitlement. sum (delta PURR/delta TB) for each logical processor The Physical Processor Consumed in a partition over an interval. The percentage of entitlement (PPC/ENT) * 100 consumed. (delta PIC/delta TB) where PIC is the Pool Idle count, It provides the available pool of which represents the clock ticks where POWER processors. Hypervisor was idle Logical processor utilization helps you to determine if more virtual Sum of traditional 10ms tic-based %sys and %user processors should be added to a partition.

Back to top AIX 5.3 command changes When AIX is running with SMT enabled, commands that display CPU information, such as vmstat, iostat, topas, and sar, display the PURR-based statistics, rather than the traditional sample-based statistics. In SMT mode, additional columns of information are displayed, as show in Table 2 below.

Table 2. SMT mode Column Description pc or physc Physical Processor Consumed by the partition pec or %entcPercentage of Entitlement Consumed by the partition Another tool that needed modification was trace/trcrpt and several other tools that are based on the trace utility. In an SMT environment, trace can optionally collect PURR register values at each trace hook, and trcrpt can display elapsed PURR. Table 3 below shows the arguments to use for an SMT.

Table 3. Arguments for SMT Argument Description Collects the PURR register values. Only valid for a trace run on a 64trace r PURR bit kernel. trcrpt O Tells trcrpt to show the PURR, along with any timestamps. PURR=[on|off]

netpmon r PURR pprof r PURR gprof curt r PURR splat p

Uses the PURR time instead of timebase in percent and CPU calculation. Elapsed time calculations are unaffected. Uses the PURR time instead of timebase in percent and CPU calculation. Elapsed time calculations are unaffected. GPROF is the new environment variable to support the SMT. Specifies the use of PURR register to calculate CPU times. Specifies the use of PURR register to calculate CPU times.

Back to top Thread priority formulas You can calculate the priority of a thread using the formulas, as shown in Listing 2 below. It is a function of the nice value, the CPU usage c, and a tuning factor r.

Back to top How AIX calculates the new priority The clock timer interrupt occurs every 10ms or 1 tick on each CPU. The timers are staggered so that a CPU's clock timer does not go off at the same time as another CPU's clock timer. When the CPU clock timer interrupt occurs (even before the thread has run for a full 10ms), the thread has its CPU usage value (the CPU charge) incremented by one, up to a maximum of 120. If a job does not get a full 10ms slice and is running RR policy, the system dispatcher changes the thread's priority in the run queue to allow it to run again soon. The priority of most user processes varies with the amount of CPU time the process has used recently. The CPU scheduler's priority calculations are based on two parameters that are set with schedo, sched_R, and sched_D. The sched_R and sched_D values are in 1/32 seconds. The scheduler uses this formula to calculate the amount to add to a process's priority value as a penalty for recent CPU use. For example:
CPU penalty = (recently used CPU value of the process) * (r/32)

The recalculation (once per second) of the recently used CPU value of each process is:
New recently used CPU value = (old recently used CPU value of the process) * (d/32)

Both r (sched_R parameter) and d (sched_D parameter) have default values of 16. The recent CPU charge C is then used to determine the priority penalty and to recalculate the new thread priority. Using the first formula as a reference (see Listing 2), you know that a

newly started user task, which carries a base priority 40, a default nice value of 20, and no CPU charge so far (C=0), begins with a priority level 60. Also, in the first formula, the value r determines the penalty ratio with a range from zero to 32. An r value of zero means a no-charge penalty for the CPU, since it is always zero (C*r/32). If r=32, it yields the highest possible penalty charge for a CPU -- each tick (10ms) of CPU usage translates to one priority-level downgrade. In most cases, the value of r lies near the middle between zero and 32. AIX defaults r to 16; that is, every two ticks of CPU charge become one level of priority penalty. When the r value is high, the impact of a nice value becomes less important since the CPU usage penalty prevails. A smaller r, on the contrary, makes the effect of the nice value more obvious. Based on this discussion, the effectiveness of the nice value diminishes after a while. The reason for this is because the CPU charge grows in time and gradually becomes the main factor in determining the new priority. This formula has been modified in AIX 5L to increase the weight of the nice value in calculating the priority level. With all the different versions of AIX, two new factors have been introduced : x_nice and x_nice_factor ("extra nice" and "extra nice factor"). See the second formula in Listing 2 below.

Listing 2. Thread priority formulas


<Formula 1 : The Basic Formula> Priority = p_nice + (C * r/32) (1)

<Formula 2 : for AIX 5L> Priority = x_nice + (C * r/32 * x_nice_factor) (2) Where: p_nice = base_PRIORITY + NICE base_PRIORITY = 40 NICE = 20 + delta_NICE (20 is the default nice value) That is, P_nice = 60 + delta_NICE C is the CPU usage charge The maximum value of C is 120 If NICE <= 20 then x_nice = p_nice If NICE > 20 then x_nice = p_nice * 2 - 60 or x_nice = p_nice + delta_NICE, or x_nice = 60 + (2 * delta_NICE) x_nice_factor = (x_nice + 4)/64 Priority has a maximum value of 255

(3) (3a) (4)

As you can see from Formula 2 and Formula 3, the x_nice now has doubled the increased nice value. The x_nice_factor further strengthens the r ratio. For example, an initial nice 16, which gives a nice value of 36, results in a new x_nice_factor of 1.5. This value is a 50 percent higher CPU charge penalty for the CPU usage part over the lifetime of the thread.

Decaying the CPU usage It is possible that a thread can get a priority so low that it never has a chance to run. This would occur if you use only Formulas 1 and 2 without a mechanism to push a thread's priority level back up. When a thread runs with SCHED_OTHER, its priority is degraded for its use of CPU time. When it is not running and is waiting for its turn, AIX tries to regain its priority by "decaying" its CPU charges, about once a second. The rule is simple: A CPU-bound job should be assigned a lower priority to allow other jobs to run, but it should not be discriminated against to the point that it cannot finish itself. All threads' CPU charge is decayed based on a predefined factor of once per second, as follows:
New Charge C = (Old Charge C) * d / 32 (5)

A kernel process Sweapper does this job. Once every second, Swapper wakes up and handles the CPU charge decaying for all the threads. The default decay factor is 0.5 or d=16, which "discounts" or "waives" half of the CPU charge. With this mechanism, a CPU-intensive job accumulates CPU charge, gets to a lower priority level, and then advances to a much higher level at the end of a second. On the other hand, an I/O-intensive job does not vary its priority up and down as much, since it generally accumulates less CPU time.

Back to top Have you exhausted your CPU? Now that you understand how the AIX scheduler prioritizes the workload, let's look at several commonly used commands. If AIX seems to take too long to finish your workload or it does not respond quickly enough, try these commands to investigate whether your system is CPUbound: vmstat, iostat, and sar. We do not discuss all the possible ways to use these commands, but instead emphasize the information they convey to you. For a detailed description of these commands, see your AIX publications or visit the IBM System p and AIX Information Center at http://publib16.boulder.ibm.com/pseries/index.htm. Scroll down, if necessary, and click AIX Version 5L Version 5.3 Version 5.3 information center to start using the AIX 5 publications. The priority change history of a thread Listing 3 shows how the CPU charge can change the priority of a thread:

Listing 3. Change of CPU charge and the priority of a thread


Base priority is 40 Default NICE value is 20, assume task was run using the default nice value

p_nice = base_priority + NICE = 40 + 20 = 60 Assume r = 2 to slow down the penalty increase (default r value is 16) Priority = p_nice + C*r/32 = 60 + C * r / 32 Tick 0 P = 60 + 0 * 2 / 32 = 60 Tick 1 P = 60 + 1 * 2 / 32 = 60 Tick 2 P = 60 + 2 * 2 / 32 = 60 . Tick 15 P = 60 + 15 * 2 / 32 = 60 Tick 16 P = 60 + 16 * 2 / 32 = 61 Tick 17 P = 60 + 17 * 2 / 32 = 61 . . Tick 100 P = 60 + 100 * 2 / 32 = 66 Tick 100 Swapper decays all CPU usage charges for all threads. New C CPU Charge = (Current CPU Charge) * d / 32 Assume d = 16 (the default) For the test thread, new C = 100 * 16 / 32 = 50 Tick 101 P = 60 + 51 * 2 / 32 = 63

Listing 4 shows how to specify a fast or slow priority:

Listing 4. Priority change of a typical CPU-bound job (fast verses slow)


fast.c: main(){for (;;)} slow.c: main() {sleep 80;}

Back to top Common commands The vmstat, iostat, and sar commands are used frequently for CPU monitoring. You should be familiar with the usage and the meaning of the reports each command generates. vmstat The vmstat command provides an overview of resource utilization through a report of CPU, disk, and memory activity in a one-line-per-report format. The sample output in Listing 5 is generated on an AIX 5L Version 5.3 system running "vmstat 1 6". This report was generated every second, as requested. Since a count of six was specified following the interval, reporting stops after the sixth report. One popular way to run the vmstat command is to leave out the count parameter; vmstat then generates reports continuously until the command terminates.

Except for the avm and fre columns, the first report contains average statistics per second since system startup. Subsequent reports contain statistics collected during the interval since the previous report. Beginning with AIX 5L Version 5.3, the vmstat command reports the number of physical processors consumed (pc) and the percentage of entitlement consumed (ec) in the MicroPartitioning and SMT environments. These metrics only display on Micro-Partitioning and SMT environments. AIX 5L adds a useful new option "-I" to vmstat that shows the number of threads waiting for the raw I/O to complete (p column) and the number of file pages paged in/out per second (fi/fo columns). The following detailed descriptions of the columns convey useful information about CPU utilization. Listing 5 shows the output of the vmstat 1 6 command:

Listing 5. Output of the vmstat 1 6 command from a p520 system (two CPUs)
vmstat 1 6 System configuration: lcpu=4 mem=15808MB kthr memory page faults cpu ----- ----------------------------r b avm fre re pi po fr sr cy in sy 1 1 110996 763741 0 0 0 0 0 0 231 96 0 0 111002 763734 0 0 0 0 0 0 332 2365 0 0 111002 763734 0 0 0 0 0 0 330 2283 0 0 111002 763734 0 0 0 0 0 0 310 2212 1 0 111002 763734 0 0 0 0 0 0 314 2259 0 0 111002 763734 0 0 0 0 0 0 321 2261

cs 91 179 139 153 173 177

us 0 0 0 0 0 0

sy 0 1 5 0 0 1

id wa 99 0 99 0 93 1 99 0 99 0 99 0

Figure 2 shows the output of the command vmstat -I 1 (issued during a software installation):

Figure 2. Output of the vmstat -I 1 command XML error: The image is not displayed because the width is greater than the maximum of 580 pixels. Please decrease the image width.

See Table 4 below for a listing of relevant columns with descriptions.

Table 4. Description of relevant columns ColumnDescription Kernel thread state changes per second over the sampling interval. kthr Number of kernel threads placed in run queue. r Number of kernel threads placed in the Virtual Memory Manager (VMM) wait b queue (awaiting resource, awaiting input/output).

p fi/fo cpu us sy id

wa

pc ec

The number of threads waiting on raw I/Os (bypassing journaled file system (JFS)) to complete. This is only available on AIX 5 and later. Number of file pages paged in/out per second. Note: This column is available only on AIX 5 and later systems. Breakdown of percentage usage of CPU time. For multiprocessor systems, CPU values are global averages among all processors. Also, the I/O wait state is defined system-wide and not per processor. Average percentage of CPU time executing in the user mode. Average percentage of CPU time executing in the system mode. Average percentage of time that CPUs were idle and the system did not have an outstanding disk I/O request. CPU idle time during which the system had outstanding disk/NFS I/O request(s). If there is at least one outstanding I/O to a disk when wait is running, the time is classified as waiting for I/O. Unless asynchronous I/O is being used by the process, an I/O request to disk causes the calling process to block (or sleep) until the request has been completed. Once an I/O request for a process completes, it is placed on the run queue. If the I/Os were completing faster, more CPU time could be used. Number of physical processors consumed. Displayed only if the partition is running with shared processor. The percentage of entitled capacity consumed. Displayed only if the partition is running with the shared processor.

A CPU is marked wio at the time of a clock interrupt (every 1/100 ms), if the CPU is idling and an outstanding I/O was initiated on that CPU. If a CPU is only idling with no outstanding I/O from that CPU, it is marked as id instead of wa. For example, a system with four CPUs and one thread doing I/O reports a maximum of 25 percent wio time. A system with 12 CPUs and one thread doing I/O reports a maximum of 8.3 percent wio time. To be precise, the wio measures the percent of time the CPU is idle as it waits for an I/O to complete. These four columns should total 100 percent, or very close. If the sum of user and system (us and sy) CPU-utilization percentages consistently approach a 100 percent, the system might be encountering a CPU bottleneck. iostat The iostat command is used primarily to monitor system input and output devices, but it can also provide CPU utilization data. Beginning with AIX 5.3, the iostat command reports number of physical processors consumed (physc) and the percentage of entitlement consumed (% entc) in Micro-Partitioning and SMT environments. These metrics are only displayed on Micro-Partitioning/SMT environments. When SMT is enabled, iostat automatically uses a new PURR-based data and formula for:

%user %sys %wait %idle

Listing 6 is generated on an AIX 5L Version 5.3 system by entering "iostat 5 3", as follows:

Listing 6. iostat report


System configuration: lcpu=4 drives=9 tty: tin tout avq-cpu: %user %sys %idle %iowait 0.0 4.3 0.2 0.6 98.8 0.4 Disks: %tm_act Kbps tps Kb_read Kb_wrtn hdisk0 0.0 0.2 0.0 7993 4408 hdisk1 0.0 0.0 0.0 2179 1692 hdisk2 0.4 1.5 0.3 67548 59151 cd0 0.0 0.0 0.0 0 0 tty: tin tout cpu: %user %sys %idle %iowait 0.0 30.3 8.8 7.2 83.9 0.2 Disks: %tm_act Kbps tps Kb_read Kb_wrtn hdisk0 0.2 0.8 0.2 4 0 hdisk1 0.0 0.0 0.0 0 0 hdisk2 0.0 0.0 0.0 0 0 cd0 0.0 0.0 0.0 0 0 tty: tin tout cpu: %user %sys %idle %iowait 0.0 8.4 0.2 5.8 0.0 93.8 Disks: %tm_act Kbps tps Kb_read Kb_wrtn hdisk0 0.0 0.0 0.0 0 0 hdisk1 0.0 0.0 0.0 0 0 hdisk2 98.4 75.6 61.9 396 2488 cd0 0.0 0.0 0.0 0 0 Example iostat with SPLAR configuration #iostat t 2 3 System Configuration: lcpu=4 ent=0.80 avg-cpu %user %sys %idle %iowait 0.1 0.2 99.7 0.0 0.1 0.4 99.5 0.0 0.1 0.2 99.7 0.0

physc 0.0 0.0 0.0

%entc 0.9 1.1 0.9

Just like the vmstat command report, the first report contains statistic averages since the system started up. Subsequent reports contain statistics collected during the interval since the previous report. The four columns that show the breakdown of CPU usage time convey the same information as the vmstat command. The columns should total approximately 100 percent. If the sum of user and system (us and sy) CPU-utilization percentages consistently approach 100 percent, the system might be encountering a CPU bottleneck. On systems running one application, a high I/O wait percentage might be related to the workload. On systems with many processes, some will be running while others wait for I/O. In this case, the %iowait can be small or zero because running processes "hide" some wait time. Although %iowait is low, a bottleneck can still limit application performance. If the iostat command indicates that a CPU-bound situation does not exist and %iowait time is greater than 20 percent, you might have an I/O or disk-bound situation. sar The sar command has two forms: The first form samples, displays, and/or saves system statistics and the second form processes and displays previously captured data. The sar

command can provide queue and processor statistics just like the vmstat and iostat commands. However, it has two additional features:

Each sample has a leading time stamp, so an overall average appears at the end of the samples. The -P option can be used to generate per-processor statistics, in addition to the global averages among all processors. The sample code below shows sample output from a four-way symmetric multiprocessor (SMP) system that resulted from entering two commands:
o sar -o savefile 5 3 > /dev/null & o

Note: This command collects the data three times at five-second intervals, saves the collected data in savefile, and redirects the report to null so that no report is written to the terminal.
o sar -P ALL -u -f savefile o o

Note: The -P ALL is specified to get per-processor statistics for each individual processor and -u CPU usage data. In addition, -f savefile tells sar to generate the report using the data saved in savefile. The sar P All output for all logical processors with SMT enabled shows the physical processor consumed physc (delta PURR/delta TB). This column shows the relative SMT split between processors -- in other words, it illustrates the measurement of fraction of time a logical processor was getting physical processor cycles. Whenever the percentage of entitled capacity consumed is under 100 percent, a line beginning with U is added to represent the unused capacity. When running in shared mode, sar displays the percentage of entitlement consumed %entc, which is ((PPC/ENT)*100).

Listing 7. A typical sar report from a 2-way p520 system with dedicated LPAR configuration
AIX nutmeg 3 5 00CD241F4C00 System configuration: lcpu=4 11:51:33 cpu 11:51:34 0 1 2 3 11:51:35 0 1 2 3 %usr 0 1 2 0 1 0 0 0 0 0 %sys 0 1 1 0 1 0 0 0 0 0 %wio 0 1 0 0 0 0 0 0 0 0 %idle 100 98 96 100 98 100 100 100 100 100 physc 0.30 0.69 0.69 0.31 1.99 0.31 0.69 0.73 0.31 2.04 06/14/05

11:51:36

11:51:37

0 1 2 3 0 1 2 3 0 1 2 3 -

0 0 0 0 0 0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

100 100 100 100 100 100 100 100 100 100 100 99 99 100 99

0.31 0.69 0.70 0.31 2.01 0.31 0.69 0.69 0.31 2.00 0.31 0.69 0.70 0.31 2.01

Average

mpstat The mpstat command collects and displays performance statistics for all logical CPUs in the system. If SMT is enabled, the mpstat s command displays physicals as well as usage of logical processors, as shown in Listing 8 below.

Listing 8. A typical mpstat report from a 2-way p520 system with SPLAR configuration
System configuration: lcpu=4 Proc0 63.65% cpu2 58.15% cpu0 5.50% Proc1 63.65% cpu1 61.43% cpu3 2.22%

lparstat The lparstat command provides a report of LPAR-related information and utilization statistics. This command provides a display of current LPAR-related parameters and hypervisor information, as well as utilization statistics for the LPAR. An interval mechanism retrieves numbers of reports at a certain interval. The following statistics are displayed only when the partition type is shared: physc Shows the number of physical processors consumed. %entcShows the percentage of the entitled capacity consumed. Shows the percentage of logical processor(s) utilization that occurred while executing lbusy at the user and system level. app Shows the available physical processors in the shared pool. Shows the number of phantom (targeted to another shared partition in this pool) phint interruptions received.

The following statistics are displayed only when the -h flag is specified: %hypvShows the percentage of time spent in hypervisor. hcalls Shows number of hypervisor calls executed. Listing 9. A typical lparstat report from a 2-way p520 machine
System configuration: type=Dedicated mode=Capped smt=On lcpu=4 mem=15808 %user ----0.0 0.0 0.4 %sys ---0.1 0.1 0.2 %wait ----0.0 0.0 0.1 %idle ----99.9 99.9 99.3

# lparstat 1 3 System configuration: type=Shared mode=Uncapped smt=On lcpu=2 mem=2560 ent=0.50 %user ----0.3 43.2 0.1 %sys ---0.4 6.9 0.4 %wait ----0.0 0.0 0.0 %idle physc %entc lbusy ----- ----- ----- -----99.3 0.01 1.1 0.0 49.9 0.29 58.4 12.7 99.5 0.00 0.9 0.0 app --vcsw phint ---- ----346 0 389 0 312 0

Back to top Improving system performance For a CPU-bound system, you can improve the system performance by manipulating thread and process priorities of a specific process or tuning the scheduler algorithm to set a different system-wide scheduling policy. Changing user-process priority The commands to change or set user task priority include the nice and renice commands and two system calls that allow thread priority and scheduling policy to be changed through API calls. Using the nice command The standard nice value of a foreground process is 20; the standard nice value of a background process is 24, if started from ksh or csh (20, if started by tcsh and bsh). The system uses the nice value to calculate the priority of all threads associated with the process. Using the nice command, a user can specify an increment or decrement to the standard nice value so that a process can be started with a different priority. The thread priority is still nonfixed and gets different values based on the thread's CPU usage. By using nice, any user can run a command at a lower priority than normal. Only root can use nice to run commands at a priority higher than normal. For example, the command nice -5 iostat 10 3 >iostat.out causes the iostat command to start with a nice value of 25

(instead of 20), resulting in a lower starting priority. The values of nice and priority can be viewed using the ps command with the -l flag. Listing 10 shows a typical output using the ps -l command:

Listing 10. Using ps -l to observe process priority


F S UID PID PPID 240001 A 0 15396 5746 200001 A 0 15810 15396 iostat C PRI NI ADDR 1 60 20 393ce 3 70 25 793fe SZ 732 524 WCHAN TTY TIME CMD pts/3 0:00 ksh pts/3 0:00

As root, you can run iostat at a higher priority with # nice --5 vmstat 10 3 >io.out. The iostat command can run with a nice value of 15, resulting in a higher starting priority. Using the renice command If a process is already running, you can use the renice command to alter the nice value, and thus the priority. The processes are identified by process ID, process group ID, or the name of the user who owns the processes. The renice command cannot be used on fixed priority processes. Using the setpri() and thread_setsched() subroutines There are now two system calls that allow users to make individual processes or threads to be scheduled with fixed priority. The setpri() system call is process-oriented and thread_setsched() is thread-oriented. Use caution when calling these two subroutines, since improper use might cause the system to hang. An application that runs under the root user ID can invoke the setpri() subroutine to set its own priority or the priority of another process. The target process is scheduled using the SCHED_RR scheduling policy with a fixed priority. The change is applied to all the threads in the process. Note the following two examples:
retcode = setpri(0, 45);

Gives the calling process a fixed priority of 45.


retcode = setpri(1234, 35);

Gives the process with PID of 1234 a fixed priority of 35. If the change is intended for a specific thread, the thread_setsched() subroutine can be used:
retcode = thread_setsched(thread_id,priority_value, scheduling_policy)

The parameter scheduling_policy can be one of the following:

SCHED_OTHER, SCHED_FIFO, or SCHED_RR.

When SCHED_OTHER is specified as the scheduling policy, the second parameter (priority_value) is ignored. Changing the scheduling algorithm globally AIX allows users to make changes to the priority calculation formula using the schedo command. Adjusting r and d As mentioned earlier, the formula for calculating the priority value is as follows:
Priority = x_nice + (C * r/32 * x_nice_factor)

The recent CPU usage value is displayed as the C column in the ps command output. The maximum value of recent CPU usage is 120. Once every second, the CPU usage value for each thread is degraded using the following formula:
New Charge C = (Old Charge C) * d / 32

The default value of r is 16; therefore, the thread priority is penalized by recent CPU usage * 0.5. The d also has a default value of 16, which means the recent CPU usage value of every process is reduced to half of its original value once every second. For some users, the default values of sched_R and sched_D do not allow enough distinction between foreground and background processes. These two values can be tuned using sched_R and sched_D options to the schedo command. Note the following two examples:
# schedo -o sched_R=0

(R=0, D=.5) indicates that the CPU penalty was always 0. The priority value of the process would effectively be fixed, although it is not treated like an RR process.
# schedo -o sched_D=32

(R=0.5, D=1) indicates that long-running processes would reach a C value of 120 and stay there. The recent CPU usage value does not get reduced once every second and

the priority of long-running processes would not fluctuate back to low numbers (higher importance) to compete with new processes. Changing the timeslice Although the schedo command can modify the length of the scheduler timeslice, the timeslice change only applies to RR threads. This does not affect threads running with other scheduling policies. The syntax for this command is:
schedo -L timeslice

n is the number of 10ms clock ticks to be used as the timeslice. schedo -p -o timeslice=2 would set the timeslice length to 20ms. You must log on as root to make changes using the schedo command.

Back to top Using additional techniques Other techniques that can help a CPU-bound system include the following. Scheduling Depending on the relative importance of applications, you could schedule less important ones for off-shift hours using at, cron, or batch commands. Using the mkpasswd command If your system has thousands of entries in the /etc/passwd file, you could use mkpasswd command to create a hashed or indexed version of the /etc/passwd file to save CPU time spent in looking up a user ID.

Back to top Tuning individual applications The following techniques can help you diagnose and improve the performance of specific applications running under AIX. Using the ps command The ps command or profiling can identify an application that is consuming large fractions of CPU time. This information can then be used to narrow the search for a CPU bottleneck.

After you find the problem area, you can tune up or improve the application. You might need to recompile the application or change the source code. Using the schedo command The schedo command is used to set or display current or next boot values for all CPU scheduler tuning parameters. This command can only be executed by the root user. The schedo command can also make permanent changes or defer changes until the next reboot. Beginning with AIX 5L Version 5.3, several tuning parameters have been added to the schedo command. Listing 11 shows all the CPU scheduler parameters.

Listing 11. CPU scheduler parameters


# schedo -a %usDelta affinity_lim big_tick_size fixed_pri_global force_grq hotlocks_enable idle_migration_barrier krlock_confer2self krlock_conferb4alloc krlock_enable krlock_spinb4alloc krlock_spinb4confer maxspin n_idle_loop_vlopri pacefork sched_D sched_R search_globalrq_mload search_smtrunq_mload setnewrq_sidle_mload shed_primrunq_mload sidle_S1runq_mload sidle_S2runq_mload sidle_S3runq_mload sidle_S4runq_mload slock_spinb4confer smt_snooze_delay smtrunq_load_diff timeslice unboost_inflih v_exempt_secs v_min_process v_repage_hi v_repage_proc v_sec_wait = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 100 7 1 0 0 0 4 n/a n/a n/a n/a n/a 16384 100 10 16 16 256 256 384 64 64 134 134 4294967040 1024 0 2 1 1 2 2 0 4 1

Upgrading Upgrading the system to a faster CPU or more CPUs might be necessary if tuning does not improve the performance.

Back to top Case studies Two real-world examples show how the performance experts from IBM implemented these theories and techniques. Case 1 Symptoms: The user has a batch script that starts up 500 other batch scripts, and each of these scripts queries and updates a database. Each script also starts as a client request from another machine. Each client request creates a database user thread on the database server machine. The response time began at less than 10 seconds for a period of time. Then the response time gradually became worse. At times it was more than a minute -- sometimes two minutes. Diagnosis: The run queue began growing until it reached into the hundreds. Another symptom included the CPU being 100 percent utilized (this was an eight-way SMP system), with 99 percent in user mode. By examining an AIX trace sample collected for a few seconds, we saw a pattern emerge. While a thread was using the CPU, a network packet would arrive and cause a network adapter interrupt. This would take the currently running thread off its CPU so the interrupt could be serviced. After servicing the interrupt, the scheduler verifies if any other threads are runnable and have a better priority than the currently running thread. Since the currently running thread had run for a few timeslices already, its CPU priority had increased as it accumulated CPU ticks. Each of the 500 scripts began with priority 60. If they were runnable, they would preempt any currently running thread with a thread priority higher than 60. The preempted thread would then be put at the end of the run queue and have to wait for the CPU until its priority rose again. One effect of this preemption was that sometimes a thread would be preempted while holding a database lock. Since this type of lock is implemented at the application layer within the database software, the kernel does not know that the thread is holding a lock. If the lock was a kernel-level lock or a pthread library mutex lock, then the kernel could perform priority boosting and boost a thread's priority to the same level as that of a running thread that is requesting the lock. This way, the requesting thread does not have to wait long for the lock holder to get the CPU again and release the lock. Since the lock in this scenario was a user lock, the database thread would spin on the lock until it exhausted its spin count (a tunable database parameter), and then go to sleep. So the 99 percent used CPU was mostly due to the threads spinning on database locks. Prescription: After determining that priority preemption was having a negative effect, we tuned the scheduler formula, which calculates the thread priority. This particular formula is:
pri = base_pri + NICE + (C * r/32)

pri

is the new priority, base_pri is 40, NICE is the nice value (20 in this case), C is the CPU usage in ticks, and r is 16. As a thread accumulates CPU ticks, its priority value becomes larger, thereby making its priority lower. The schedo command provides a way to change the value of r by using the sched_R option. Running schedo -p -o sched_R=0 causes r to be 0, which then causes the CPU penalty factor (C * r/32) to be 0. This prevents priorities from changing, unless the nice value is changed. If the nice value is the same for all threads, then threads can complete their timeslices without being preempted due to priority changes. This allows the thread that is currently running and holding the database lock to keep running and then release the lock. Results: These changes had an instantaneous impact on the performance. The response time, which was over two minutes by this time, started getting better until all of the scripts were completing in just a few seconds. The C value in the priority formula is recalculated once a second by a CPU usage decay factor (C = C*d/32). Setting the d value to 0 when using the schedo command would have accomplished the same result. In this case, if d=0, then C*d/32 = 0. Since the CPU penalty factor is C*r/32, this also becomes 0 so that the priority will be just 40 + NICE. Case 2 Symptoms: A pSeries machine was used as both a database and an application server. Users would input requests into a forms-based application and then submit the transactions. They noticed that at certain times the forms would take longer to get updated on their screens and their usual short-running queries would return in a longer time period. Diagnosis: When this slowness was observed, there were also some long-running database batch jobs that were submitted to the system. Normally, such batch jobs would be run at night, but near the end of the month additional batch jobs were run during the day while the users were on the system. The batch jobs were CPU-intensive and constantly on the run queue. Therefore, users' threads had to compete with the threads of the batch jobs for the CPU. With priorities degrading as CPU usage increased, the batch jobs' priorities became worse and allowed the users' threads to run. However, the kernel decays the CPU usage value C by half once a second. This allowed the priorities of the batch jobs to improve in a short time period. So the batch jobs would again compete for the CPU with the users' threads. Prescription: By changing the decay factor (d/32) used to reduce CPU usage once a second, we improved performance for the users. We used the schedo command to set the d value to 31. The higher the value of d; the higher the value of C (C=C*d/32). Since C is used to calculate priorities (pri=40+NICE+C*r/32), the priority would get worse as C became larger. By setting the d value to a higher number, the C value is reduced at a slower than usual rate.

Results: The users' threads get the CPU more often than the batch threads. As a result, the users saw an immediate improvement in performance. Of course, the batch jobs would be slowed down somewhat, but these jobs would get the CPU whenever the users had any "think" time or had to wait on I/O. The impact was minimal on the batch jobs, but performance improvement for the users was dramatic. Case study notes: Tracing a pattern A final tip describes some odd things that impact performance. During one of our benchmarks, we noticed that the CPU usage reached 100 percent, with most of the time being charged to "system". At that time, the application performance degraded noticeably. After we collected an AIX trace, we noticed a repeating pattern. One application process would encounter a page fault on an address. That page fault caused a protection exception in the VMM, which in turn caused the kernel to send this process a SIGSEGV (segmentation violation) signal. When the process resumed, the page faulted on the same address again, which then caused yet another protection exception and another SIGSEGV signal to be sent to the process. The default signal disposition for the SIGSEGV signal is to kill the process and generate a core dump, but in this case, the application continued on and stayed in this loop. Most of the CPU time was spent in this loop. After investigation, we discovered the problem: A developer for another component had installed a signal handler to catch the SIGSEGV signal in the code during the test process. After the testing was completed, the developer had forgotten to remove the signal handler. That component then linked with the rest of the application and, during the benchmark, another unrelated component of the application caused a segmentation fault. This old signal handler caught the exception, ignored it, and caused the process to resume. The current instruction (the one which caused the exception) was then restarted, causing an infinite loop to occur.

Resources

The AIX 5L Support for Micro-Partitioning and Simultaneous Multi-threading white paper describes the simultaneous multi-threading and optionally, Micro-Partitioning new technologies and the AIX 5L support for them. The article Operating system exploitation of the POWER5 system discusses how new performance features deliver improved system scalability and performance. The AIX 5L Differences Guide Version 5.3 Edition Redbook focuses on the differences introduced in AIX 5L Version 5.3 when compared to AIX 5L Version 5.2. The Capped and Uncapped Partitions in IBM POWER5 whitepaper introduces and explains the concepts of capped and uncapped partitions and discusses priority weighting and CPU utilization by memory pools.

The AIX 5L Practical Performance Tools and Tuning Guide Redbook a comprehensive guide about the performance monitoring and tuning tools that are provided with AIX 5L Version 5.3. Want more? The developerWorks AIX and UNIX zone hosts hundreds of informative articles and introductory, intermediate, and advanced tutorials. Get involved in the developerWorks community by participating in developerWorks blogs.

You might also like