Professional Documents
Culture Documents
Microsoft Corporation
1/31/2003
Microsoft® Windows® 2000 provides a rich set of performance counters for both the
operating system and applications. This paper provides an introduction to the best
practices that utilize these counters to conduct effective performance monitoring for
Windows 2000 servers. Furthermore, it also covers key performance thresholds to assist
in troubleshooting some of the more common performance issues in systems using
Windows 2000.
1. Introduction
Monitoring performance is a critical process in the management of computer systems. Through monitoring,
system administrators obtain performance data that can be analyzed in real time or collected for processing at
a future point in time. The data is used in locating possible performance issues as well as planning for the
growth in demand for system resources. The steps and procedures for monitoring, however, widely vary
depending on the target environment.
Performance monitoring can generally be classified into three different sets of activities; regular monitoring,
troubleshooting and resource planning. Regular performance monitoring ensures that administrators always
have up-to-date information about how their system is operating. Performance data for a specific system over
a range of activities and loads can be used to establish a baseline—a range of measurements that represent
acceptable performance under normal operating conditions. When administrators troubleshoot performance
related system issues, monitoring and collecting performance data and comparing against the baseline
measurement give administrators important information about system resource utilization when the problem
occurred. Finally, monitoring system performance provides the data with which to project future growth and to
plan for changes in system configurations.
The Microsoft Windows 2000 operating system is one of the leading operating systems in use for many
different types of servers. As documented in [1], the Windows 2000 system is well instrumented and provides
abundant utilities for performance monitoring and analysis. In this paper, we present our guidelines for taking
advantage of these features and the practices required in managing real life environments. In doing so, we
categorize server environments and point out useful features that can apply to different configurations.
The target audience of this paper is system engineers and administrators (“administrators” hereafter) who are
interested in monitoring the performance of small to medium size server environments. For such people, the
task of managing the performance as well as the system itself falls under their responsibility. For larger
environments, performance experts with detailed knowledge of monitoring and analysis features would need
to be appointed.
The organization of this paper is as follows. Section 2 gives the overview of the monitoring architecture and brief
descriptions of its sub-components. Section 3 presents practices that we suggest for planning, configuring, and
using performance features on Windows 2000. The summary of important counters and their usages are also
given in Section 3. Section 4 concludes the paper.
User def
a
Figure 1
Windows
2000 adv.
Server 1039 10211
3. Monitoring Practices
As shown in Table 1, there are a large number of counters and counter instances available on Windows 2000
systems. So how does one determine what counter data to monitor and collect? The answer to this question relies
on the target environment and its intended functionality. Devising general guidelines for selecting appropriate
counters and monitoring policies for a wide range of systems is a difficult task. In this paper, we start by
categorizing the workloads for common configurations.
System configurations and server usages vary dramatically based on the requirements of the business,
ranging from the need for a small number of servers to support a small business to monitoring and collecting data
from thousands of servers in an enterprise environment. Regardless, based on the Windows architecture and the
performance subsystem, there are key items to take into consideration that can assist greatly in determining the
best practices for monitoring a particular environment. Ultimately, the key is to monitor the system resources
being consumed by the application or service of interest. Therefore understanding the workload for the target
environment should precede all the other steps.
Performance
Component aspect being Counters to monitor
monitored
Physical Disk\ Sec/Read
Physical Disk\ Sec/Write
Physical Disk\ Disk Reads/sec
Physical Disk\ Disk Writes/sec
Physical Disk\ Avg Queue Length Read
Disk Usage Physical Disk\ Avg Queue Length Write
Physical Disk\%Idle Time
Logical Disk\ % Free Space
Interpret the % Disk Time counter carefully. Because the _Total instance of this
counter may not accurately reflect utilization on multiple-disk systems, it is
important to use the % Idle Time counter.
Physical Disk\ ALL COUNTERS
Disk Bottlenecks System\File Control Operations/sec
System\File Data Operations/sec
Memory\ Available Bytes
Memory\ Cache Bytes
Memory Usage
Memory\%Committed Bytes in Use
Memory\Pages Input or Reads/s
Memory\ Pages/sec
Memory\ Page Inputs or Reads/sec
Memory\ Page Output or Write/sec
Memory\ Transition Faults/sec
Memory\ Pool Paged Bytes
Bottlenecks or
Memory Memory\ Pool Paged Resident Bytes
leaks
Memory\ Pool NonPaged Bytes
Although not specifically Memory object counters, the following are also useful for
memory analysis:
Paging File\ % Usage object (all instances)
Cache\ Data Map Hits %
Processor\ % Processor Time (all instances)
Processor\% DPC Time
Processor Usage Processor\% Interrupt Time
Processor\% Privileged Time
Processor\% User Time
Processor\ % Processor Time (all instances)
Processor\% DPC Time
Processor\% Interrupt Time
Processor\% Privileged Time
Processor Bottlenecks Processor\% User Time
System\ Processor Queue Length (all instances)
Processor\ Interrupts/sec
System\Context switches/sec
System\System Calls/sec
3.3. Collection-monitoring policy
Depending on the configuration, collecting performance data can be done in two ways.
Centralized data collection (that is, collecting performance data from remote systems to a
centralized repository) is simple to implement because only one logging service needs to be
running in the system hosting the centralized repository. However, this scheme may be
constrained by available memory on the logging system. Furthermore, frequent updating can add
undesired impact to network traffic. Hence, centralized monitoring is useful for a small number of
servers (25 or fewer).
Distributed data collection (that is, data collection that occurs locally on the individual
computers) does not incur the memory and network traffic problems of centralized collection
scheme. However, it does result in delayed availability of the data, requiring that the collected
data be transferred separately to the administrator's system for review. This type of monitoring is
useful if the network is likely to be part of the problem because it isolates the computers from the
network during data collection. However, it should be noted that local monitoring creates more
disk traffic on each monitored computer.
© 2001 Microsoft Corporation. All rights reserved. Microsoft, Windows and Windows NT
are either registered trademarks or trademarks of Microsoft Corporation in the United States
and/or other countries. Other product and company names mentioned herein may be the
trademarks of their respective owners.
NOTES:
First, NEVER look at a NT 4.0 performance monitor log with Windows 2000 System Monitor
“sysmon.exe”. Use System Monitor from XP or Server 2003. Also there are clear benefits in
using the latest version of System Monitor to read even Server 2000 logs. So use XP or Server
2003 for all displaying of log data no matter what format.
Also, it is assumed that all of the concepts and fundamentals discussed in the troubleshooter are
known to the System professional. For further in depth background knowledge, please consult
the following book: Inside Windows NT, third edition is available. While the entire book is highly
recommended, the following chapters are the most relevant to this document:
It is very important to note that while many of the old troubleshooting techniques are valid in the
Windows 2000 OS, many of the old threshold values that we observed in the NT 4.0 are too low
to indicate any issue when applied to Windows 2000. The most effective way of developing
accurate threshold values is developing your own baseline data. These reported threshold
values were developed by observing the internal testing of Windows 2000. Base line information
is important, so always keep the logs of well performing systems as a baseline.
Second, system monitor is an external view of the machine; it does not allow any insight into the
kernel operation although it does allow us to see some information concerning each user
process. Simplifying the state of the machine will make trouble shooting a easier task. The
simple method of replacing or running without specific drivers may lead to a quicker resolution
than trying to interpret several logs. IIS and content indexing are examples of services should be
disabled if not needed. Other third party services that should be disabled are screen savers,
virus scanners, file replication agents and any other file system or network services that you do
not understand the function of or cannot establish their function through research.
Third, a system monitor log is an overview of the internal state of the machine. It does not inform
the engineer what the end users observe. Without a detailed explanation of exactly what the end
users see and the administrators observe directly in event or error logs it is difficult to reach any
conclusion. Example: The End User’s complaint is that the machine is slow. Without other
information, it is impractical to tell if the CPU, memory or disk is the source of the bottleneck.
Without an idea of exactly which actions are slow and exactly when this slowness occurs, the
engineer will not know what to look for. All one can do is look for obvious bottlenecks in the
system such as the ones outlined in the overview below.
• Memory
o % Committed bytes in use
This value should be stable in long term use over the course of a day
(except in terminal server).
Any Value over 80% is something to look into, especially if the Commit
Limit changes even slightly.
• If the Commit Limits changes at all, the system has run out of
page file space and has attempted to expand the page file.
However, the system is generally too slow to react to the
expansion.
o Available bytes
Any value less than 4 MB is an issue, very short time periods (10sec) not
fewer than 2.5 MB are usually acceptable but still need to be
investigated. Extended periods fewer than 4 MB generally mean that the
system is out of physical memory. Extended periods less than 2.5MB
certainly means that the system is out of physical memory.
Memory cannot be added as a solution, blindly. Further investigation is
needed to determine what is consuming physical memory.
Remember that extensive paging will occur at this point and the system
will slow down. Paging is the result not the cause of this activity.
The engineer will need to next investigate the Cache, paged pool,
nonpaged pool and then EACH process in that order.
We tend to cache everything so on a server with many megabytes of free
memory, this memory will actually be used in cache,
o Cache bytes
We use this counter not to look memory issues but to look at disk and
process issues. The system will rob pages from this to service other
memory requests but high numbers here usually mean a disk bottleneck.
It is important to think of this as an available pool the system can rob
from if necessary.
Low numbers are 30 to 100 MB, high are numbers over 400 MB. High
numbers raise a flag, which will need to be investigated.
Limit is 960 MB in 2000, except occasionally, the limit is 512 MB.
• Terminal server, /3gb boot flag, over 16 gig physical memory.
• /PAE will allow the use of almost ALL memory; 6GB caches have
been reported
This counter NEVER indicates a memory leak. On file servers we would
expect this counter to rise when the CPU counter goes up and decrease
when the CPU goes down.
o Commit limit
This counter should never change, any changes indicate a page file that
is too small and has been extended. If this happens, investigate by
looking at all of the process memory counters.
Many process do not wait on the page file extensions to occur and do not
gracefully handle the rejection of a memory request.
Disk space is cheap, recommend a major expansion of the page file and
relog the data. Note that this is completely opposite of what is
recommended when the engineers see limits in the physical memory
case.
o Committed Bytes
This counter is the sum of each process’s total written bytes in virtual
space. As such, it is the total bytes committed to the disk system and
should be relatively stable on a multiday log. It should go up each day
and down each night. If the counter continues to rise each day, start
looking for memory leaks. Other possible explanations could be the case
of new user sessions being connected each day and old ones not going
away. Possible causes of inflated numbers are bad system or
application design as well as memory leaks.
Next step is to investigate each process by using the process object in a
detailed chart.
o Free System page table entries
Totally ignore unless this drops below 10,000, consult inside Windows
NT for further explanation. Usually exhaustion of Free System page
table entries will generate a blue screen but could generate a mysterious
hang.
o Generally, the system will not run out of system pages
due to a major enhancement of the available system
pages in Windows 2000. Please note that this enhance
applies only on systems with more than 256 MB of
memory; systems with less than 256 MB will exhibit
exactly the same behavior found on Windows NT 4.0.
o You also get NT 4 behavior with the /3GB switch.
o Pages faults/second
It is important not to confuse this counter with the following two counters.
This is a soft fault counter, generally not an issue with the system. Soft
faults are only memory references to another page existing in memory.
These memory references are quite fast and do not result in any
performance penalty. Remember, high numbers need to be investigated
but usually do not mean anything unless the Pages/second or Pages
input /sec are high. Page faults can range all over the spectrum with
normal application behavior; values from 0 to 1000 per second can be
normal. This is where a normal baseline is essential to determine the
expected behavior. The event logs are also useful.
Look at context switches/second for supporting behavior. If this counter
is high, look for a specific process to demonstrate high CPU or other
unusual behavior.
o Pages/second
Investigate if over 40 pages per second on a system with a slow disk,
usually even 200 pages per second on a system with a fast disk
subsystem may not be an issue. Please note that values of 5 to 20
pages that appear in many other sources of documentation are out of
date.
Always break up this counter in pages output and pages input separately
if the counter is above 40/second.
Pages/sec is the number of pages read from the disk or written to the
disk to resolve memory references to pages that were not in memory at
the time of the reference. This is the sum of Pages Input/sec and Pages
Output/sec. This counter includes paging traffic on behalf of the system
Cache to access file data for applications. This is the primary counter to
observe if you are concerned about excessive memory pressure (that is,
thrashing), and the excessive paging that may result. This counter,
however, also accounts for such activity as the sequential reading of
memory mapped files, whether cached or not. The typical indication of
this is when you see high number of Memory: Pages/sec, a "normal"
(average, relative to the system being monitored) or high number of
Memory: Available Bytes, and a normal or small amount of Paging File:
% Usage. In the case of a non-cached memory mapped file, you also
see normal or low cache (cache fault) activity.
Pages output/second are an issue only when disk load becomes an
issue. Remember, these are pages written to by an application and
need to be backed out to the pagefile. This is not resource intensive and
as long as disk write time for the logical partition does not exceed 30%,
you should not see any system impact. The correct method of
observing the disk by its write time is by looking at its inverse counter.
In this case, the disk idle time should be 70% or greater.
o Pages input /sec
Separate pages output/sec from pages input/sec. Pages output will
stress the system but the application will not see this. An application will
only wait on an input page and the engineer will need to know what the
application tolerance for waiting on pages input will be. For example,
SQL and most applications will tolerate very few pages input while
exchange will do much better. Again, you will need a good baseline to
compare to.
If you suspect paging is the issue, the best threshold value to use is disk
read time on the logical disk holding the page file. Look for values of disk
read time of less than 15% and transfer times of less than 20 msec that
would tend to indicate paging is not an issue.
If you have no other information, use the general rule that paging 20
pages input/sec per spindle will not slow down most applications.
o Pool Nonpaged bytes
Here, the engineer is looking for two separate behaviors. First, memory
leaking behaviors, i.e. increasing non-paged pool usage. The
assumption is that nonpaged pool memory should reach a stabilizing
value after some operating time, generally two or three hours. Generally
this is true but not always. For example, adding many new users on a file
server will generate an increase in nonpaged pool usage but this is cyclic
behavior and we would expect this memory usage to drop as the users
log off.
Be very suspicious of any deltas (sudden changes or spikes) in either
pool counters. This could be normal behavior but usually it is not. Try to
find any process or thread from the system process that increases CPU
when these deltas occur.
Also, compare the deltas against process memory usage, IO usage and
handle usage.
Excessive pool usage needs to be investigated if much higher than 30
MB. A typical pool usage would be 3 to 30 MB except for terminal server
or streaming video or audio.
If usage is excessive, look at drivers, services, and then the system
process as sources of the problem.
Limit is 256 MB although even a debug will show larger values.
/3gb and TS limit non-paged pool to 128 MB.
o Pool Paged bytes
Looking for the same behaviors as above but remember that applications
as well as services and drivers use nonpaged pool.
Although the maximum amount of paged pool is approximately 335 MB
by default, the system will allocate only half as a maximum, i.e.,
approximately 165 MB. There is a KB: Q247094 as well as a public
website on memory tuning terminal servers that apply to any machine
with over 256 MB of memory. You will double available page pool bytes
by setting the page pool memory to FFFFFFFF (approximately 335 MB)
in the registry. If you wish to get the maximum paged pool for your
system, set page pool memory to FFFFFFFF.
192 MB on /3gb.
The largest prime users of page pool is the registry which can be quite
large on domain controllers,
Next the prime users of page pool are file servers, especially when
serving out profiles. Also, terminal servers with many connections and
print servers with 10’s of thousands of print jobs pending are major users
of system pool. Printer queues use little space, it is the printing that take
the space.
Expect to see 10 to 30 MB used plus what ever is needed for the above
tasks.
o Disk transfers/sec
Current technology of disk drives show the following limits:
• 180 Sequential Transfers per 10,000 RPM of disk drive
o Some drives with good predictive read ahead will reach
180 Sequential Transfers per 10,000 RPM
• 60 Random Transfers per 10,000 RPM of disk drive
o It is necessary to know the disk speed and the type of
I/O in order to determine the maximum throughput.
o Hard and fast for database work
• 60-80 Transfers a second as a general rule of thumb as most I/O
is pseudo-random.
• For database apps be VERY conservative
• Caching disk drive controllers nullifies this for writes only and
yields a gain from 4x to 10x in write transfers only.
• The above listed limits are per spindle, not an overall limit for a
RAID set. Due to RAID set design, the limit or RAID set
throughput are somewhat difficult to calculate. Below is a
summary of the Disk I/O per second generated for each type of
RAID configuration based on a given number of reads and writes
per second.
o RAID 0: READS + WRITES = I/Os / sec
o RAID 1: READS + (2*WRITES) = I/Os / sec
o RAID 5: READS + (4*WRITES) = I/Os / sec
o RAID 0+1: READS + (2*WRITES) = I/Os per second.
o See the following whitepapers from your preferred
hardware RAID vendor for a detailed explanation of how
to observe Disk bottlenecks and to calculate disk I/O
limits.
o Split I/Os
Should be close to ZERO; if not
• The stripe for RAID is TOO small
• For example statistical variation demands that 12k average I/O is
the maximum size a 16k stripe set can take.
• the disk is heavily fragmented or free space is too small
o Percent Free Space
Never less than 15% free
Try to maintain at least 25% free
You will need 30%-40% free on highly dynamic systems.
• System
o Context Switches/sec
A context switch occurs when a processor switches from running one
thread to another thread.
Do not forget to divide by number of processors
1000 – 3000 per processor is the range from excellent to fair.
6000 or greater is considered poor. Upper limit is about 20,000 at 90 %
CPU
3000 – 9,000 per processor for Terminal Server is allowed due to
blocking that naturally occurs.
Abnormal high rates can be caused by page cache faults due to memory
starvation.
Abnormally high rates are usually caused by an application memory
issue around heap memory allocations or another resource is being
blocked. Further determination of the cause of the issue requires a good
base line to compare against.
o Processes
Sanity checks to determine number of processes to determine if it is
increasing or decreasing.
o Processor queue length
Only one queue for all processors.
On standard servers with long quantum’s (see KB on Quantum’s in
Windows2000):
• 4 or less per CPU is excellent
• < 8 per CPU is good
• < 12 per CPU is fair
On terminal servers which have short quantum:
• 10 or less per CPU is excellent
• < 15 per CPU is good
• < 20 per CPU is fair
o Threads
Sanity check to determine number of threads, especially to determine if it
is increasing or decreasing.
Appendix – Definition of counters referenced in this paper
The information in this appendix is a subset of that provided in the Windows 2000 Server
Resource Kit: Supplement 1 Performance Counters Reference guide. It is given here to
provide context for the topics discussed in the paper.
Process Object
The Process performance object consists of counters that monitor running application program
and system processes. All the threads in a process share the same address space and have
access to the same data.
Counter
Description
Name
Private Shows the current number of bytes that this process has allocated that cannot be
Bytes shared with other processes.
Handle Shows the total number of handles currently open by this process. This number is the
Count equal to the sum of the handles currently open by each thread in this process.
Shows the current size, in bytes, of the virtual address space that the process is using.
Virtual Use of virtual address space does not necessarily imply corresponding use of either
Bytes disk or main memory pages. Virtual space is finite, and by using too much, the process
can limit its ability to load libraries.
Shows the maximum size, in bytes, of virtual address space the process has used at
Virtual any one time. Use of virtual address space does not necessarily imply corresponding
Bytes Peak use of either disk or main memory pages. However, virtual space is finite, and the
process might limit its ability to load libraries by using too much.
Processor Object
The Processor performance object consists of counters that measure aspects of processor
activity. The processor is the part of the computer that performs arithmetic and logical
computations, initiates operations on peripherals, and runs the threads of processes. A computer
can have multiple processors. The processor object represents each processor as an instance of
the object.
Memory Object
The Memory performance object consists of counters that describe the behavior of physical and
virtual memory on the computer. Physical memory is the amount of random-access memory
(RAM) on the computer. Virtual memory consists of space in physical memory and on disk. Many
of the memory counters monitor paging, which is the movement of pages of code and data
between disk and physical memory. Excessive paging, a symptom of a memory shortage, can
cause delays which interfere with all system processes.
Counter
Description
Name
shows the ratio of Memory\ Committed Bytes to the Memory\ Commit Limit.
Committed memory is physical memory in use for which space has been reserved
% Committed
in the paging file so that it can be written to disk. The commit limit is determined by
Bytes In Use
the size of the paging file. If the paging file is enlarged, the commit limit increases,
and the ratio is reduced.
Shows the amount of physical memory, in bytes, available to processes running on
the computer. It is calculated by summing adding the amount of space on the
zeroed, free, and standby memory lists. Free memory is ready for use; zeroed
Available
memory consists of pages of memory filled with zeros to prevent later processes
Bytes
from seeing data used by a previous process; standby memory is memory that has
been removed from a process's working set (its physical memory) en route to disk
but is still available to be recalled.
Shows the sum of the values of System Cache Resident Bytes, System Driver
Cache Bytes
Resident Bytes, System Code Resident Bytes, and Pool Paged Resident Bytes.
Page Shows the rate at which the disk is read to resolve hard page faults. It shows
Reads/sec numbers of read operations, without regard to the number of pages retrieved in
each operation. Hard page faults occur when a process references a page in virtual
memory that is not in its working set or elsewhere in physical memory, and must be
retrieved from disk. This counter is a primary indicator of the kinds of faults that
cause system-wide delays. It includes read operations to satisfy faults in the file
system cache (usually requested by applications) and in non-cached mapped
memory files. Compare the value of Page Reads/sec to the value of Pages
Input/sec to find an average of how many pages were read during each read
operation.
Shows the rate at which pages are read from disk to resolve hard page faults. Hard
page faults occur when a process refers to a page in virtual memory that is not in
its working set or elsewhere in physical memory, and must be retrieved from disk.
Pages
When a page is faulted, the system tries to read multiple contiguous pages into
Input/sec
memory to maximize the benefit of the read operation. Compare Pages Input/sec to
Page Reads/sec to find the average number of pages read into memory during
each read operation
Shows the rate at which pages are read from or written to disk to resolve hard page
faults. This counter is a primary indicator of the kinds of faults that cause system-
wide delays. It is the sum of Memory\ Pages Input/sec and Memory\ Pages
Pages/sec Output/sec. It is counted in numbers of pages, so it can be compared to other
counts of pages, such as Memory\ Page Faults/sec, without conversion. It includes
pages retrieved to satisfy faults in the file system cache (usually requested by
applications) and non-cached mapped memory files.
Pool Shows the size, in bytes, of the non-paged pool. Memory\ Pool Nonpaged Bytes is
Nonpaged calculated differently than Process\ Pool Nonpaged Bytes, so it might not equal
Bytes Process(_Total )\ Pool Nonpaged Bytes.
Shows the size, in bytes, of the paged pool. Memory\ Pool Paged Bytes is
Pool Paged
calculated differently than Process\ Pool Paged Bytes, so it might not equal
Bytes
Process(_Total )\ Pool Paged Bytes.
Shows the rate at which page faults are resolved by recovering pages that were
being used by another process sharing the page, or were on the modified page list
Transition or the standby list, or were being written to disk at the time of the page fault. The
Faults/sec pages were recovered without additional disk activity. Transition faults are counted
in numbers of faults; because only one page is faulted in each operation, it is also
equal to the number of pages faulted.
The Physical Disk performance object consists of counters that monitor hard or fixed disk drives
on a computer. Disks are used to store file, program, and paging data, are read to retrieve these
items, and are written to record changes to them. The values of physical disk counters are sums
of the values of the logical disks (or partitions) into which they are divided.
System Object
The System performance object consists of counters that apply to more than one component of
the computer.