You are on page 1of 69

vSphere Performance Monitoring and Troubleshooting

Overview

What?
CPU, Memory, Disk, Network
How?
Use available tools and a systematic methodology
Why?
Need to build confidence in virtualizing critical and
high demand applications

vSphere Performance Monitoring and Troubleshooting


Top Issues

Top Issues:
Storage "performance capacity" oversubscription
Memory oversubscription
SMP overuse

Firmware & driver issues

vSphere Performance Monitoring and Troubleshooting


What tools do we have at our disposal?

Top tools for information collection:


vCenter - Performance charts and alarms
Guest OS* - Task Manager/Resource Monitor and PerfMon
ESX Host - esxtop and vscsiStats
vSphere PowerCLI

*Guest based monitoring is subject to inaccuracy

vSphere Performance Monitoring and Troubleshooting


Prepare vCenter Settings

vSphere Performance Monitoring and Troubleshooting


Prepare vCenter Settings

vSphere Performance Monitoring and Troubleshooting


Prepare vCenter Settings

Prepare custom vCenter alerts:


Host Console Swap In Rate 512KBps Warning, 1024 KBps Alert
Host Console Swap Out Rate 512KBps Warning, 1024 KBps Alert

VM CPU Ready 1000ms Warning, 2000ms Alert


VM Disk Latency 20ms Warning, 50ms Alert

vSphere Performance Monitoring and Troubleshooting


Prepare vCenter Settings

vSphere Performance Monitoring and Troubleshooting


Prepare vCenter Settings

vSphere Performance Monitoring and Troubleshooting


Prepare esxtop

ESXTOP realtime monitoring:


esxtop (run command from SSH or tech-support mode)
s 2 (refresh view every 2 seconds)
V (View VMs only)
h (for quick in-tool command reference)
Batch Mode for a 5 minute capture of all stats:
esxtop -b -a -d 2 -n 150 > esxtop_capture.csv

vSphere Performance Monitoring and Troubleshooting


Prepare PowerCLI

Run PowerCLI:
Tip: Run as Administrator
Set-ExecutionPolicy remotesigned
Connect-VIServer -Server <host> -Protocol https -User <user>
-Password <pass>
<host> can be IP address or name of ESX server or vCenter

Get-VM
Get-Stat -common -realtime

vSphere Performance Monitoring and Troubleshooting


Where do we get started?

vSphere Performance Monitoring and Troubleshooting


Network Overview

Network
Virtual Machine

vNIC

vSwitch
or
dvSwitch

Core

Switch

pNIC

vSphere Performance Monitoring and Troubleshooting


Network

Troubleshooting Guidance:
1. Physical Issues - A bad cable, a failing switch port or NIC, or
an incompatible/flawed firmware or device driver (use
VMXNET3 whenever possible)

2. Configuration Issues - Inconsistent configuration of vSwitches,


Port Groups, or upstream VLAN trunks
3. Capacity Issues - Too many VMs on a single NIC; inadequate
switch backplane or uplink capacity; sharing unmanaged
network infrastructure for storage and data
4. Thresholds Bandwidth saturation, dropped packets

vSphere Performance Monitoring and Troubleshooting


Network What can we see?
Systems Management Tools

Virtual Machine

vNIC

dvSwitch (only)

Core

Switch

pNIC
VMware Monitoring Tools

vSphere Performance Monitoring and Troubleshooting


Network

vCenter Metrics:
Receive packets dropped
Transmit packets dropped

vSphere Performance Monitoring and Troubleshooting


Network

ESXTOP Metrics:
Display
NETWORK
NETWORK

Metric
%DRPTX
%DRPRX

Threshold

Explanation

Dropped packets transmitted, hardware overworked.


Possible cause: very high network utilization

Dropped packets received, hardware overworked.


Possible cause: very high network utilization

vSphere Performance Monitoring and Troubleshooting


Network

ESXTOP Commands:
esxtop
s2
n
f

vSphere Performance Monitoring and Troubleshooting


Network

ESXTOP Example:

vSphere Performance Monitoring and Troubleshooting


Network

PowerCLI Commands:
Get-Stat -net -realtime
Get-Stat -Entity <Host> -stat net.droppedRx.summation
Get-Stat -Entity <Host> -stat net.droppedTx.summation

vSphere Performance Monitoring and Troubleshooting


Network What cant we see?

Virtual Machine

vNIC

Cisco 1000v only

Core

Switch

Network Monitoring Tools

pNIC
Net Mon Tools

vSphere Performance Monitoring and Troubleshooting


Network

Possible resources for external monitoring:


Native Telnet/SSH/HTTP-based interface
counters and stats
Third-party SNMP, NetFlow and ICMP tools

vSphere Performance Monitoring and Troubleshooting


CPU Overview

CPU
Virtual Machine

vCPU

Physical
CPU

vSphere Performance Monitoring and Troubleshooting


CPU

Troubleshooting Guidance:
1. Physical Issues - Rare and always catastrophic (e.g. obvious)
2. Configuration Issues - Too many / too few vCPUs per VM;
SMP/HAL mismatch; incorrect CPU affinity settings
3. Capacity Issues - CPU saturation at the guest or host level;
CPU starvation due to high IO or other system level ops

4. Thresholds Waiting for CPU cycles (due to co-scheduling,


swapping, high IO)

vSphere Performance Monitoring and Troubleshooting


CPU What can we see?
Systems Management Tools

Virtual Machine

vCPU

Physical
CPU
VMware Monitoring Tools

vSphere Performance Monitoring and Troubleshooting


CPU

vCenter Metrics:
Host/Guest Saturation
Stacked Graph (per VM)
Usage

vSphere Performance Monitoring and Troubleshooting


CPU

vCenter Metrics:
Guest
Ready (value/20=n%)
Swap Wait

vSphere Performance Monitoring and Troubleshooting


CPU

ESXTOP Metrics:
Display Metric

Threshold

CPU

%RDY

10

CPU

%CSTP

CPU

%SYS

20

CPU

%MLMTD

CPU

%SWPWT

Explanation
Overprovisioning of vCPUs, excessive usage of vSMP or a limit(check
%MLMTD) has been set.
Excessive usage of vSMP. Decrease amount of vCPUs for this
particular VM. This should lead to increased scheduling
opportunities.
The percentage of time spent by system services on behalf of the
world. Most likely caused by high IO VM. Check other metrics and
VM for possible root cause
The percentage of time the vCPU was ready to run but deliberately
wasnt scheduled because that would violate the CPU limit
settings. If larger than 0 the world is being throttled due to the limit
on CPU.
VM waiting on swapped pages to be read from disk. Possible cause:
Memory overcommitment.

vSphere Performance Monitoring and Troubleshooting


CPU

ESXTOP Commands:
esxtop
s2
V
c
e GID (expand/contract a VM world)

vSphere Performance Monitoring and Troubleshooting


CPU

ESXTOP Example:
Excessive vCPUs

vSphere Performance Monitoring and Troubleshooting


CPU

ESXTOP Example:
Now with fewer vCPUs

vSphere Performance Monitoring and Troubleshooting


CPU

ESXTOP Example:
SMP impacting multiple VMs

vSphere Performance Monitoring and Troubleshooting


CPU

PowerCLI Example
Get-Stat -cpu
Get-Stat -Entity <VM> -stat cpu.ready.summation -realtime
Very cool script code at:
http://www.peetersonline.nl/index.php/vmware/examine-vmware-cpuready-times-with-powershell/

vSphere Performance Monitoring and Troubleshooting


CPU Not much else to see

Virtual Machine

vCPU

Physical
CPU
Offline Diagnostics & Systems Management Tools

vSphere Performance Monitoring and Troubleshooting


CPU

Possible resources for external monitoring:


Vendor specific systems management tools,
MS System Center, etc.
http://www.peetersonline.nl/index.php/vmware/ex
amine-vmware-cpu-ready-times-withpowershell/

vSphere Performance Monitoring and Troubleshooting


Memory Overview

Memory
Virtual Machine

vRAM

Physical
RAM

vSphere Performance Monitoring and Troubleshooting


Memory

Troubleshooting Guidance:
1. Physical Issues - Rare and usually catastrophic
2. Configuration Issues - Memory overcommit; incorrect
configuration of shares, reservations or limits
3. Capacity Issues - Physical memory exhaustion
4. Thresholds Active memory swapping

vSphere Performance Monitoring and Troubleshooting


Memory What can we see?
Systems Management Tools

Virtual Machine

vRAM

Physical
RAM
VMware Monitoring Tools

vSphere Performance Monitoring and Troubleshooting


Memory

vCenter Metrics
Swap in rate
Swap out rate
Swap used

vSphere Performance Monitoring and Troubleshooting


Memory

ESXTOP Metrics:
Display

Metric

Threshold

MEM

MCTLSZ

MEM

SWCUR

MEM

SWR/s

MEM

SWW/s

MEM

CACHEUSD

MEM

ZIP/s

MEM

UNZIP/s

Explanation
If larger than 0 host is forcing VMs to inflate balloon driver
to reclaim memory as host is overcommited.
If larger than 0 host has swapped memory pages in the
past. Possible cause: Overcommitment.
If larger than 0 host is actively reading from swap(vswp).
Possible cause: Excessive memory overcommitment.
If larger than 0 host is actively writing to swap(vswp).
Possible cause: Excessive memory overcommitment.
If larger than 0 host has compressed memory. Possible
cause: Memory overcommitment.
If larger than 0 host is actively compressing memory.
Possible cause: Memory overcommitment.
If larger than 0 host has accessing compressed memory.
Possible cause: Previously host was overcommited on
memory.

vSphere Performance Monitoring and Troubleshooting


Memory

ESXTOP Commands:
esxtop
s2
V
m
f

vSphere Performance Monitoring and Troubleshooting


Memory

ESXTOP Example:
m Heavy swapping and ballooning

vSphere Performance Monitoring and Troubleshooting


Memory

PowerCLI Commands:
Get-Stat -mem
Get-Stat -Entity <VM> -stat mem.swapoutRate.average -realtime
Get-Stat -Entity <VM> -stat mem.swapinRate.average -realtime
Get-Stat -Entity <VM> -stat mem.vmmemctl.average -realtime
Get-Stat -Entity <Host> -stat mem.swapused.average -realtime

vSphere Performance Monitoring and Troubleshooting


Memory The occasional DIMM failure

Virtual Machine

vRAM

Physical
RAM
Offline Diagnostics & Systems Management Tools

vSphere Performance Monitoring and Troubleshooting


Memory

Possible external monitoring options:


Vendor specific systems management tools, MS
System Center, etc.
Dont forget vCenter Hardware Status reporting

vSphere Performance Monitoring and Troubleshooting


Storage Overview

vmdk

vmdk

vmdk

Storage
Virtual Machine

RDM

Datastore

Disk

LUN

SCSI Controller

Controller

Switch

HBA

vSphere Performance Monitoring and Troubleshooting


Storage

Troubleshooting Guidance:
1. Physical Issues - A bad cable, a failing switch port or
HBA/NIC, or an incompatible/flawed firmware or device driver
(use LSI Logic Parallel/SAS as appropriate)

2. Configuration Issues - Inconsistent or incorrect configuration


of LUN masking, zoning, or multi-pathing; inappropriate
resource provisioning; aligning queue depth with storage type
3. Capacity Issues - Too many VMs or VMDKs on a LUN; too
much IO load for an array or RAID group
4. Thresholds Latency and queuing

vSphere Performance Monitoring and Troubleshooting


Storage What can we see?
Systems Management Tools

vmdk

vmdk

vmdk

Virtual Machine

RDM

Datastore

Disk

LUN

SCSI Controller

Controller

Switch

HBA
VMware Monitoring Tools

vSphere Performance Monitoring and Troubleshooting


Storage

vCenter Metrics:
Datastore
Read latency
Write latency

vSphere Performance Monitoring and Troubleshooting


Storage

ESXTOP Metrics:
Display

Metric

Threshold

Explanation

DISK

GAVG

20

Look at DAVG and KAVG as the sum of both is GAVG.

DISK

DAVG

20

DISK

KAVG

DISK

QUED

DISK

ABRTS/s

Disk latency most likely to be caused by array.


Disk latency caused by the VMkernel, high KAVG usually
means queuing. Check QUED.
Queue maxed out. Possibly queue depth set to low. Check
with array vendor for optimal queue depth value.
Aborts issued by guest(VM) because storage is not
responding. Can be caused when paths failed.

DISK

RESETS/s

DISK

CONS/s

20

The number of commands reset per second.


SCSI Reservation Conflicts per second. Can be caused by
too many VMDKs on a datastore.

vSphere Performance Monitoring and Troubleshooting


Storage
Application

Filesystem

Guest

I/O Drivers

Application Latency

R = Physical Disk
Disk Secs/Transfer

Device Queue

S
K

G = Guest Latency
K = ESX Kernel

Virtual SCSI

VMkernel

Filesystem

D = Device Latency

vSphere Performance Monitoring and Troubleshooting


Storage

ESXTOP Commands (HBA/LUN):


esxtop
s2
V
d
f
e vmhba#

vSphere Performance Monitoring and Troubleshooting


Storage

ESXTOP Commands(LUN/Datastore):
esxtop
s2
V
u
L 38
f
e <devname>

vSphere Performance Monitoring and Troubleshooting


Storage

ESXTOP Commands (VM/VMDK):


esxtop
s2
V
v
f
e GID

vSphere Performance Monitoring and Troubleshooting


Storage

ESXTOP Examples:
d - Multipathing / Expand adapter to view targets

vSphere Performance Monitoring and Troubleshooting


Storage

ESXTOP Examples:
u - Queuing, Disk or Kernel?

vSphere Performance Monitoring and Troubleshooting


Storage

ESXTOP Examples:
v - Identify the IO consumer

vSphere Performance Monitoring and Troubleshooting


Storage

vscsiStats Command:
[root@host ~]# cd /usr/lib/vmware/bin
./vscsiStats -l
./vscsiStats -s -w <worldid>
./vscsiStats -w <worldid> -p all -c > /path/vscsistats.csv
./vscsiStats -x

vSphere Performance Monitoring and Troubleshooting


Storage

vscsiStats Example:

vSphere Performance Monitoring and Troubleshooting


Storage

vscsiStats Example:

vSphere Performance Monitoring and Troubleshooting


Storage

vscsiStats Example:
http://dunnsept.wordpress.com/2010/03/11/new-vscsistats-excel-macro/

vSphere Performance Monitoring and Troubleshooting


Storage

vscsiStats histograms:
IO lengths of commands
IO lengths of Read commands
IO lengths of Write commands
distance (in LBNs) between successive commands
distance (in LBNs) between successive Read commands
distance (in LBNs) between successive Write commands
distance (in LBNs) between each command from the closest of previous 16
latency of IOs in Microseconds (us)
latency of Read IOs in Microseconds (us)
latency of Write IOs in Microseconds (us)
number of outstanding IOs when a new IO is issued
number of outstanding Read IOs when a new Read IO is issued
number of outstanding Write IOs when a new Write IO is issued
latency of IO interarrival time in Microseconds (us)
latency of IO interarrival time for Reads in Microseconds (us)
latency of IO interarrival time for Writes in Microseconds (us)

vSphere Performance Monitoring and Troubleshooting


Storage

PowerCLI Commands:
Get-Stat -disk
Get-Stat -stat disk.totalLatency.average -realtime
Get-Stat -stat disk.deviceLatency.average -realtime
Get-Stat -stat disk.kernelLatency.average -realtime

vSphere Performance Monitoring and Troubleshooting


Storage What cant we see?

vmdk

vmdk

vmdk

Virtual Machine

RDM

Datastore

Disk

LUN
Storage Monitoring Tools

SCSI Controller

Controller

Switch

HBA

vSphere Performance Monitoring and Troubleshooting


Storage More of what we cant see

vmdk

vmdk

vmdk

Virtual Machine

RDM

Datastore

Disk

LUN

SCSI Controller

Controller

Switch

Network/Fabric Monitoring Tools

HBA

vSphere Performance Monitoring and Troubleshooting


Storage

Possible external monitoring solutions:


Vendor specific SAN and fabric/network tools,
native Telnet/SSH/HTTP-based tools for most
networks, third-party SNMP-based tools

vSphere Performance Monitoring and Troubleshooting


Working with PowerCLI

PowerCLI Tips:
For a complete list of stat objects:

Get-StatType -Entity <Host/VM>


Pipe the outputs to a file:

Get-Stat -stat <stat> -realtime | ft -autosize >


c:\temp\<filename>.csv
Import the CSV file data to a spreadsheet with fixed width
parameters
Build pretty graphs

vSphere Performance Monitoring and Troubleshooting


Working with PowerCLI

vSphere Performance Monitoring and Troubleshooting


Way More Information

ESXTOP / vscsiStats / PowerCLI:


http://www.yellow-bricks.com/esxtop/ Special thanks to Duncan Epping!
http://communities.vmware.com/docs/DOC-3930
http://communities.vmware.com/docs/DOC-9279
http://communities.vmware.com/docs/DOC-10095
http://www.vmware.com/support/developer/PowerCLI/PowerCLI41/html/Get-Stat.html
http://www.lucd.info/2009/12/30/powercli-vsphere-statistics-part-1-the-basics/
http://simongreaves.co.uk/blog/esxtop-guide
http://dunnsept.wordpress.com/2010/03/11/new-vscsistats-excel-macro/

vSphere Performance Monitoring and Troubleshooting


Easy button?

What is the problem with these tools?


Limited alerting mechanisms, no collection
automation or historical data for comparison,
and no correlation of events!
vCenter Operations Standard / Enterprise

You might also like