Professional Documents
Culture Documents
Overview
What?
CPU, Memory, Disk, Network
How?
Use available tools and a systematic methodology
Why?
Need to build confidence in virtualizing critical and
high demand applications
Top Issues:
Storage "performance capacity" oversubscription
Memory oversubscription
SMP overuse
Run PowerCLI:
Tip: Run as Administrator
Set-ExecutionPolicy remotesigned
Connect-VIServer -Server <host> -Protocol https -User <user>
-Password <pass>
<host> can be IP address or name of ESX server or vCenter
Get-VM
Get-Stat -common -realtime
Network
Virtual Machine
vNIC
vSwitch
or
dvSwitch
Core
Switch
pNIC
Troubleshooting Guidance:
1. Physical Issues - A bad cable, a failing switch port or NIC, or
an incompatible/flawed firmware or device driver (use
VMXNET3 whenever possible)
Virtual Machine
vNIC
dvSwitch (only)
Core
Switch
pNIC
VMware Monitoring Tools
vCenter Metrics:
Receive packets dropped
Transmit packets dropped
ESXTOP Metrics:
Display
NETWORK
NETWORK
Metric
%DRPTX
%DRPRX
Threshold
Explanation
ESXTOP Commands:
esxtop
s2
n
f
ESXTOP Example:
PowerCLI Commands:
Get-Stat -net -realtime
Get-Stat -Entity <Host> -stat net.droppedRx.summation
Get-Stat -Entity <Host> -stat net.droppedTx.summation
Virtual Machine
vNIC
Core
Switch
pNIC
Net Mon Tools
CPU
Virtual Machine
vCPU
Physical
CPU
Troubleshooting Guidance:
1. Physical Issues - Rare and always catastrophic (e.g. obvious)
2. Configuration Issues - Too many / too few vCPUs per VM;
SMP/HAL mismatch; incorrect CPU affinity settings
3. Capacity Issues - CPU saturation at the guest or host level;
CPU starvation due to high IO or other system level ops
Virtual Machine
vCPU
Physical
CPU
VMware Monitoring Tools
vCenter Metrics:
Host/Guest Saturation
Stacked Graph (per VM)
Usage
vCenter Metrics:
Guest
Ready (value/20=n%)
Swap Wait
ESXTOP Metrics:
Display Metric
Threshold
CPU
%RDY
10
CPU
%CSTP
CPU
%SYS
20
CPU
%MLMTD
CPU
%SWPWT
Explanation
Overprovisioning of vCPUs, excessive usage of vSMP or a limit(check
%MLMTD) has been set.
Excessive usage of vSMP. Decrease amount of vCPUs for this
particular VM. This should lead to increased scheduling
opportunities.
The percentage of time spent by system services on behalf of the
world. Most likely caused by high IO VM. Check other metrics and
VM for possible root cause
The percentage of time the vCPU was ready to run but deliberately
wasnt scheduled because that would violate the CPU limit
settings. If larger than 0 the world is being throttled due to the limit
on CPU.
VM waiting on swapped pages to be read from disk. Possible cause:
Memory overcommitment.
ESXTOP Commands:
esxtop
s2
V
c
e GID (expand/contract a VM world)
ESXTOP Example:
Excessive vCPUs
ESXTOP Example:
Now with fewer vCPUs
ESXTOP Example:
SMP impacting multiple VMs
PowerCLI Example
Get-Stat -cpu
Get-Stat -Entity <VM> -stat cpu.ready.summation -realtime
Very cool script code at:
http://www.peetersonline.nl/index.php/vmware/examine-vmware-cpuready-times-with-powershell/
Virtual Machine
vCPU
Physical
CPU
Offline Diagnostics & Systems Management Tools
Memory
Virtual Machine
vRAM
Physical
RAM
Troubleshooting Guidance:
1. Physical Issues - Rare and usually catastrophic
2. Configuration Issues - Memory overcommit; incorrect
configuration of shares, reservations or limits
3. Capacity Issues - Physical memory exhaustion
4. Thresholds Active memory swapping
Virtual Machine
vRAM
Physical
RAM
VMware Monitoring Tools
vCenter Metrics
Swap in rate
Swap out rate
Swap used
ESXTOP Metrics:
Display
Metric
Threshold
MEM
MCTLSZ
MEM
SWCUR
MEM
SWR/s
MEM
SWW/s
MEM
CACHEUSD
MEM
ZIP/s
MEM
UNZIP/s
Explanation
If larger than 0 host is forcing VMs to inflate balloon driver
to reclaim memory as host is overcommited.
If larger than 0 host has swapped memory pages in the
past. Possible cause: Overcommitment.
If larger than 0 host is actively reading from swap(vswp).
Possible cause: Excessive memory overcommitment.
If larger than 0 host is actively writing to swap(vswp).
Possible cause: Excessive memory overcommitment.
If larger than 0 host has compressed memory. Possible
cause: Memory overcommitment.
If larger than 0 host is actively compressing memory.
Possible cause: Memory overcommitment.
If larger than 0 host has accessing compressed memory.
Possible cause: Previously host was overcommited on
memory.
ESXTOP Commands:
esxtop
s2
V
m
f
ESXTOP Example:
m Heavy swapping and ballooning
PowerCLI Commands:
Get-Stat -mem
Get-Stat -Entity <VM> -stat mem.swapoutRate.average -realtime
Get-Stat -Entity <VM> -stat mem.swapinRate.average -realtime
Get-Stat -Entity <VM> -stat mem.vmmemctl.average -realtime
Get-Stat -Entity <Host> -stat mem.swapused.average -realtime
Virtual Machine
vRAM
Physical
RAM
Offline Diagnostics & Systems Management Tools
vmdk
vmdk
vmdk
Storage
Virtual Machine
RDM
Datastore
Disk
LUN
SCSI Controller
Controller
Switch
HBA
Troubleshooting Guidance:
1. Physical Issues - A bad cable, a failing switch port or
HBA/NIC, or an incompatible/flawed firmware or device driver
(use LSI Logic Parallel/SAS as appropriate)
vmdk
vmdk
vmdk
Virtual Machine
RDM
Datastore
Disk
LUN
SCSI Controller
Controller
Switch
HBA
VMware Monitoring Tools
vCenter Metrics:
Datastore
Read latency
Write latency
ESXTOP Metrics:
Display
Metric
Threshold
Explanation
DISK
GAVG
20
DISK
DAVG
20
DISK
KAVG
DISK
QUED
DISK
ABRTS/s
DISK
RESETS/s
DISK
CONS/s
20
Filesystem
Guest
I/O Drivers
Application Latency
R = Physical Disk
Disk Secs/Transfer
Device Queue
S
K
G = Guest Latency
K = ESX Kernel
Virtual SCSI
VMkernel
Filesystem
D = Device Latency
ESXTOP Commands(LUN/Datastore):
esxtop
s2
V
u
L 38
f
e <devname>
ESXTOP Examples:
d - Multipathing / Expand adapter to view targets
ESXTOP Examples:
u - Queuing, Disk or Kernel?
ESXTOP Examples:
v - Identify the IO consumer
vscsiStats Command:
[root@host ~]# cd /usr/lib/vmware/bin
./vscsiStats -l
./vscsiStats -s -w <worldid>
./vscsiStats -w <worldid> -p all -c > /path/vscsistats.csv
./vscsiStats -x
vscsiStats Example:
vscsiStats Example:
vscsiStats Example:
http://dunnsept.wordpress.com/2010/03/11/new-vscsistats-excel-macro/
vscsiStats histograms:
IO lengths of commands
IO lengths of Read commands
IO lengths of Write commands
distance (in LBNs) between successive commands
distance (in LBNs) between successive Read commands
distance (in LBNs) between successive Write commands
distance (in LBNs) between each command from the closest of previous 16
latency of IOs in Microseconds (us)
latency of Read IOs in Microseconds (us)
latency of Write IOs in Microseconds (us)
number of outstanding IOs when a new IO is issued
number of outstanding Read IOs when a new Read IO is issued
number of outstanding Write IOs when a new Write IO is issued
latency of IO interarrival time in Microseconds (us)
latency of IO interarrival time for Reads in Microseconds (us)
latency of IO interarrival time for Writes in Microseconds (us)
PowerCLI Commands:
Get-Stat -disk
Get-Stat -stat disk.totalLatency.average -realtime
Get-Stat -stat disk.deviceLatency.average -realtime
Get-Stat -stat disk.kernelLatency.average -realtime
vmdk
vmdk
vmdk
Virtual Machine
RDM
Datastore
Disk
LUN
Storage Monitoring Tools
SCSI Controller
Controller
Switch
HBA
vmdk
vmdk
vmdk
Virtual Machine
RDM
Datastore
Disk
LUN
SCSI Controller
Controller
Switch
HBA
PowerCLI Tips:
For a complete list of stat objects: