You are on page 1of 5

Automating a System Performance Check Using the

checkperf Utility

Victor Feng, October 2008

Introduction

This tech tip provides two scripts: checkperf and runqueue.d. The checkperf script
works on systems that run the Solaris 9 or Solaris 10 Operating System. The runqueue.d
script works on systems that run the Solaris 10 OS.

• Here's the source code for the checkperf script. Please rename this file so it has a
.sh extension instead of a .txt extension.
• Here's the source code for the runqueue.d script. Please rename this file so it has
a .d extension instead of a .txt extension.

The checkperf utility checks system performance in terms of CPU, memory, I/O, and
network TCP. The default warning threshold for each of these items can be changed.
Whenever one of the thresholds is reached, checkperf sends a warning email to a
specified recipient. The email might include suggestions about how to improve system
performance.

cron can be used with checkperf so that you don't have to go to each server to check its
system performance manually. checkperf can be scheduled to run during business hours.
checkperf will not affect system performance. By default, it uses sar to collect statistics
every 5 seconds for 5 minutes.

The minimal interval for which sar is able to collects statistics is one second. If a system
has many processes that take a couple of milliseconds to run, sar will not know that they
are in the run queue. Therefore, if DTrace is installed on the system (for example, if the
system runs the Solaris 10 OS), checkperf calls runqueue.d, which collects run queue
information every millisecond for 30 seconds.

The remaining sections of this tech tip demonstrate how checkperf reacts when a system
has various performance issues. Before we continue, you need to set a few variables in
checkperf:

• DIR: Specifies the directory where checkperf and runqueue.d are located (for
example, /home/<username>/bin)
• LOG: Specifies the file that will contain generated warning messages (for example,
/home/<username>/bin/perf_msg)
• RECEIVER: Specifies the email address of the person who should receive warning
messages (for example, <username@domain.com>)
Note: My system has 32 CPUs. For testing purposes, I turned off 30 of them using
psradm -f 2-31.

CPU Performance Warning

The parameter in checkperf for reporting CPU performance is CPU_UTIL_WARN, and its
CPU utilization warning threshold is set to 80 by default.

If the CPU utilization rate is more than 80%, checkperf checks the threads in the run
queue, checks whether the system has CPUs offline, and sends out email.

We can run dd if=/dev/zero of=/dev/null & to consume CPU resources:

root@host # dd if=/dev/zero of=/dev/null &


[1] 1571
root@host # dd if=/dev/zero of=/dev/null &
[2] 1572
root@host # dd if=/dev/zero of=/dev/null &
[3] 1573
root@host # dd if=/dev/zero of=/dev/null &
[4] 1574

root@host # sar -uq 5 3


15:21:25 %usr %sys %wio %idle
runq-sz %runocc swpq-sz %swpocc
Average 69 31 0 0
Average 2.0 99 0.0 0

root@host # ./checkperf

root@host # more perf_msg

CPU average utilization: 100%(>80%)


There are 30 CPUs offline and use psradm to bring them online
Threads (per second) waiting for CPU to run: 2.0.
Recommend adding 2.0 CPUs to your system. Use prstat -L to see
if running processes have multiple threads so that you may switch to
thread-based-processor machine, such as the Sun Fire T2000 server.
The accurate threads waiting for CPU: 2.1

The "accurate threads waiting for CPU: 2.1" text is generated by runqueue.d, which
provides more accurate information about the run queue.

Memory Shortage Warning

There are two parameters in checkperf for checking memory:

• MEM_FREEPHY_WARN_PERCENT: This is the warning threshold for available


physical memory, and its threshold is set to 20 by default.
• MEM_FREESWAP_WARN_PERCENT: This is the warning threshold for available swap
space (virtual memory), and its threshold is set to 20 by default.

If the available swap space is less than 20%, checkperf also checks whether the total
size of physical swap devices is less than 1.5 times the size of physical memory. As I
demonstrated in a previous article, Impact of Swap Space on System Performance for the
Solaris 9 and 10 OS, the lack of physical swap space affects system performance when a
system is low on physical memory.

Here we will use the myfilltmp.sh script (which is shown in the previous article) to
consume memory:

root@host # ./myfilltmp.sh

root@host # sar -r 5 3
15:34:39 freemem freeswap
Average 122536 6180453

So, free memory is 122536*8/1024, which equals 957 Mbytes, and free swap space is
6180453*512/1024/1024, which equals 3017 Mbytes.

root@host # ./checkperf

root@host # more perf_msg

Available physical memory: 937 MB(<3275 MB)

Available swap space: 2956 MB(<3552 MB)

Recommend adding 20465 MB swap device. The total size


of physical swap devices should be 1.5 times physical memory.

I/O Performance Warning

The parameter in checkperf for reporting I/O devices' utilization is IO_UTIL_WARN, and
its I/O utilization warning threshold is set to 80 by default.

Let's generate some heavy I/O load:

root@host # cp myusr.tar myusr.tar2

root@host # sar -d 5 5
Average nfs1 0 0.0 0 0 0.0 0.0
sd1 99 6.8 134 80513 0.0 51.0

root@host # ./checkperf

root@host # more perf_msg

IO utilization on sd1: 100%(>80%)


Network Performance Warning

The following factors degrade TCP performance:

• Retransmission: Messages that are lost must be retransmitted.


• Duplicate packets: The local host might receive duplicate packets if it times out
on the original request, issues another request, and then receives the original
packet.
• Listen queues: A listen queue grows as the arrival rate of client requests to a
server exceeds the server's processing rate.

In checkperf, the warning threshold for the retransmission rate is 15% and the warning
threshold for the duplicate packet rate is 15%. The warning threshold for listen queue
drop is 100. Because the testing server does not have any retransmitted messages or any
duplicate packets, and listen queue drop is not greater than 100, the perf_msg file is
empty.

Putting It All Together

Finally, let's perform CPU, memory, and I/O performance checks all together:

root@host # dd if=/dev/zero of=/dev/null &


root@host # dd if=/dev/zero of=/dev/null &
root@host # dd if=/dev/zero of=/dev/null &
root@host # dd if=/dev/zero of=/dev/null &

root@host # ./myfilltmp.sh

root@host # cp myusr.tar myusr.tar2

root@host # more perf_msg

CPU average utilization: 100%(>80%)


There are 30 CPUs offline and use psradm to bring them online
Threads (per second) waiting for CPU to run: 3.1.
Recommend to add 3.1 CPUs to your system. Use prstat -L to see
if running processes have multiple threads so that you may switch to
thread-based-processor machine, such as Sun Fire T2000 server.
The accurate threads waiting for CPU: 3.1

Available physical memory: 778 MB(<3275 MB)

Available swap space: 2821 MB(<3517 MB)

Recommend to add 20465 MB swap device. The total size of physical


swap devices should be 1.5 times physical memory.

IO utilization on sd1: 51%(>30%)


Because of the CPU utilization and lack of memory, the average disk utilization was not
able to reach 80%. I decreased the variable IO_UTIL_WARN to 30. From this example, we
can see that CPU and memory can affect I/O performance too.

You might also like