Vsphere Troubleshooting

vSphere Performance Monitoring and Troubleshooting
Overview
What?
CPU, Memory, Disk, Network
How?
Use available tools and a systematic methodology
Why?
Need to build confidence in virtualizing critical and
high demand applications

Top Issues
Top Issues:
Storage "performance capacity" oversubscription
Memory oversubscription
SMP overuse
Firmware & driver issues

What tools do we have at our disposal?
Top tools for information collection:

vCenter - Performance charts and alarms
Guest OS* - Task Manager/Resource Monitor and PerfMon
ESX Host - esxtop and vscsiStats
vSphere PowerCLI
*Guest based monitoring is subject to inaccuracy

Prepare vCenter Settings


Prepare custom vCenter alerts:

Host Console Swap In Rate 512KBps Warning, 1024 KBps Alert
Host Console Swap Out Rate 512KBps Warning, 1024 KBps Alert
VM CPU Ready 1000ms Warning, 2000ms Alert

VM Disk Latency 20ms Warning, 50ms Alert



Prepare esxtop
ESXTOP realtime monitoring:

esxtop (run command from SSH or tech-support mode)
s 2 (refresh view every 2 seconds)
V (View VMs only)
h (for quick in-tool command reference)
Batch Mode for a 5 minute capture of all stats:
esxtop -b -a -d 2 -n 150 > esxtop_capture.csv

Prepare PowerCLI
Run PowerCLI:
Tip: Run as Administrator
Set-ExecutionPolicy remotesigned
Connect-VIServer -Server <host> -Protocol https -User <user>
-Password <pass>
<host> can be IP address or name of ESX server or vCenter
Get-VM
Get-Stat -common -realtime

Where do we get started?

Network Overview
Network
Virtual Machine
vNIC
vSwitch
or
dvSwitch
Core
Switch
pNIC

Network
Troubleshooting Guidance:
1. Physical Issues - A bad cable, a failing switch port or NIC, or
an incompatible/flawed firmware or device driver (use
VMXNET3 whenever possible)
2. Configuration Issues - Inconsistent configuration of vSwitches,

Port Groups, or upstream VLAN trunks
3. Capacity Issues - Too many VMs on a single NIC; inadequate
switch backplane or uplink capacity; sharing unmanaged
network infrastructure for storage and data
4. Thresholds Bandwidth saturation, dropped packets

Network What can we see?
Systems Management Tools
Virtual Machine
vNIC
dvSwitch (only)
Core
Switch
pNIC
VMware Monitoring Tools

Network
vCenter Metrics:
Receive packets dropped
Transmit packets dropped

Network
ESXTOP Metrics:
Display
NETWORK
NETWORK
Metric
%DRPTX
%DRPRX
Threshold
Explanation
Dropped packets transmitted, hardware overworked.

Possible cause: very high network utilization
Dropped packets received, hardware overworked.

Possible cause: very high network utilization

Network
ESXTOP Commands:
esxtop
s2
n
f

Network
ESXTOP Example:

Network
PowerCLI Commands:
Get-Stat -net -realtime
Get-Stat -Entity <Host> -stat net.droppedRx.summation
Get-Stat -Entity <Host> -stat net.droppedTx.summation

Network What cant we see?
Virtual Machine
vNIC
Cisco 1000v only
Core
Switch
Network Monitoring Tools
pNIC
Net Mon Tools

Network
Possible resources for external monitoring:

Native Telnet/SSH/HTTP-based interface
counters and stats
Third-party SNMP, NetFlow and ICMP tools

CPU Overview
CPU
Virtual Machine
vCPU
Physical
CPU

CPU
1. Physical Issues - Rare and always catastrophic (e.g. obvious)
2. Configuration Issues - Too many / too few vCPUs per VM;
SMP/HAL mismatch; incorrect CPU affinity settings
3. Capacity Issues - CPU saturation at the guest or host level;
CPU starvation due to high IO or other system level ops
4. Thresholds Waiting for CPU cycles (due to co-scheduling,

swapping, high IO)

CPU What can we see?
Virtual Machine
vCPU
Physical
CPU

CPU
vCenter Metrics:
Host/Guest Saturation
Stacked Graph (per VM)
Usage

CPU
vCenter Metrics:
Guest
Ready (value/20=n%)
Swap Wait

CPU
ESXTOP Metrics:
Display Metric
Threshold
CPU
%RDY
10
CPU
%CSTP
CPU
%SYS
20
CPU
%MLMTD
CPU
%SWPWT
Explanation
Overprovisioning of vCPUs, excessive usage of vSMP or a limit(check
%MLMTD) has been set.
Excessive usage of vSMP. Decrease amount of vCPUs for this
particular VM. This should lead to increased scheduling
opportunities.
The percentage of time spent by system services on behalf of the
world. Most likely caused by high IO VM. Check other metrics and
VM for possible root cause
The percentage of time the vCPU was ready to run but deliberately
wasnt scheduled because that would violate the CPU limit
settings. If larger than 0 the world is being throttled due to the limit
on CPU.
VM waiting on swapped pages to be read from disk. Possible cause:
Memory overcommitment.

CPU
ESXTOP Commands:
esxtop
s2
V
c
e GID (expand/contract a VM world)

CPU
ESXTOP Example:
Excessive vCPUs

CPU
ESXTOP Example:
Now with fewer vCPUs

CPU
ESXTOP Example:
SMP impacting multiple VMs

CPU
PowerCLI Example
Get-Stat -cpu
Get-Stat -Entity <VM> -stat cpu.ready.summation -realtime
Very cool script code at:
http://www.peetersonline.nl/index.php/vmware/examine-vmware-cpuready-times-with-powershell/

CPU Not much else to see
Virtual Machine
vCPU
Physical
CPU
Offline Diagnostics & Systems Management Tools

CPU
Possible resources for external monitoring:

Vendor specific systems management tools,
MS System Center, etc.
http://www.peetersonline.nl/index.php/vmware/ex
amine-vmware-cpu-ready-times-withpowershell/

Memory Overview
Memory
Virtual Machine
vRAM
Physical
RAM

Memory
1. Physical Issues - Rare and usually catastrophic
2. Configuration Issues - Memory overcommit; incorrect
configuration of shares, reservations or limits
3. Capacity Issues - Physical memory exhaustion
4. Thresholds Active memory swapping

Memory What can we see?
Virtual Machine
vRAM
Physical
RAM

Memory
vCenter Metrics
Swap in rate
Swap out rate
Swap used

Memory
ESXTOP Metrics:
Display
Metric
Threshold
MEM
MCTLSZ
MEM
SWCUR
MEM
SWR/s
MEM
SWW/s
MEM
CACHEUSD
MEM
ZIP/s
MEM
UNZIP/s
Explanation
If larger than 0 host is forcing VMs to inflate balloon driver
to reclaim memory as host is overcommited.
If larger than 0 host has swapped memory pages in the
past. Possible cause: Overcommitment.
If larger than 0 host is actively reading from swap(vswp).
Possible cause: Excessive memory overcommitment.
If larger than 0 host is actively writing to swap(vswp).
Possible cause: Excessive memory overcommitment.
If larger than 0 host has compressed memory. Possible
cause: Memory overcommitment.
If larger than 0 host is actively compressing memory.
Possible cause: Memory overcommitment.
If larger than 0 host has accessing compressed memory.
Possible cause: Previously host was overcommited on
memory.

Memory
ESXTOP Commands:
esxtop
s2
V
m
f

Memory
ESXTOP Example:
m Heavy swapping and ballooning

Memory
PowerCLI Commands:
Get-Stat -mem
Get-Stat -Entity <VM> -stat mem.swapoutRate.average -realtime
Get-Stat -Entity <VM> -stat mem.swapinRate.average -realtime
Get-Stat -Entity <VM> -stat mem.vmmemctl.average -realtime
Get-Stat -Entity <Host> -stat mem.swapused.average -realtime

Memory The occasional DIMM failure
Virtual Machine
vRAM
Physical
RAM
Offline Diagnostics & Systems Management Tools

Memory
Possible external monitoring options:

Vendor specific systems management tools, MS
System Center, etc.
Dont forget vCenter Hardware Status reporting

Storage Overview
vmdk
vmdk
vmdk
Storage
Virtual Machine
RDM
Datastore
Disk
LUN
SCSI Controller
Controller
Switch
HBA

Storage
1. Physical Issues - A bad cable, a failing switch port or
HBA/NIC, or an incompatible/flawed firmware or device driver
(use LSI Logic Parallel/SAS as appropriate)
2. Configuration Issues - Inconsistent or incorrect configuration

of LUN masking, zoning, or multi-pathing; inappropriate
resource provisioning; aligning queue depth with storage type
3. Capacity Issues - Too many VMs or VMDKs on a LUN; too
much IO load for an array or RAID group
4. Thresholds Latency and queuing

Storage What can we see?
vmdk
vmdk
vmdk
Virtual Machine
RDM
Datastore
Disk
LUN
SCSI Controller
Controller
Switch
HBA

Storage
vCenter Metrics:
Datastore
Read latency
Write latency

Storage
ESXTOP Metrics:
Display
Metric
Threshold
Explanation
DISK
GAVG
20
Look at DAVG and KAVG as the sum of both is GAVG.
DISK
DAVG
20
DISK
KAVG
DISK
QUED
DISK
ABRTS/s
Disk latency most likely to be caused by array.

Disk latency caused by the VMkernel, high KAVG usually
means queuing. Check QUED.
Queue maxed out. Possibly queue depth set to low. Check
with array vendor for optimal queue depth value.
Aborts issued by guest(VM) because storage is not
responding. Can be caused when paths failed.
DISK
RESETS/s
DISK
CONS/s
20
The number of commands reset per second.

SCSI Reservation Conflicts per second. Can be caused by
too many VMDKs on a datastore.

Storage
Application
Filesystem
Guest
I/O Drivers
Application Latency
R = Physical Disk
Disk Secs/Transfer
Device Queue
S
K
G = Guest Latency
K = ESX Kernel
Virtual SCSI
VMkernel
Filesystem
D = Device Latency

Storage
ESXTOP Commands (HBA/LUN):

esxtop
s2
V
d
f
e vmhba#

Storage
ESXTOP Commands(LUN/Datastore):
esxtop
s2
V
u
L 38
f
e <devname>

Storage
ESXTOP Commands (VM/VMDK):

esxtop
s2
V
v
f
e GID

Storage
ESXTOP Examples:
d - Multipathing / Expand adapter to view targets

Storage
ESXTOP Examples:
u - Queuing, Disk or Kernel?

Storage
ESXTOP Examples:
v - Identify the IO consumer

Storage
vscsiStats Command:
[root@host ~]# cd /usr/lib/vmware/bin
./vscsiStats -l
./vscsiStats -s -w <worldid>
./vscsiStats -w <worldid> -p all -c > /path/vscsistats.csv
./vscsiStats -x

Storage
vscsiStats Example:

Storage
vscsiStats Example:

Storage
vscsiStats Example:
http://dunnsept.wordpress.com/2010/03/11/new-vscsistats-excel-macro/

Storage
vscsiStats histograms:
IO lengths of commands
IO lengths of Read commands
IO lengths of Write commands
distance (in LBNs) between successive commands
distance (in LBNs) between successive Read commands
distance (in LBNs) between successive Write commands
distance (in LBNs) between each command from the closest of previous 16
latency of IOs in Microseconds (us)
latency of Read IOs in Microseconds (us)
latency of Write IOs in Microseconds (us)
number of outstanding IOs when a new IO is issued
number of outstanding Read IOs when a new Read IO is issued
number of outstanding Write IOs when a new Write IO is issued
latency of IO interarrival time in Microseconds (us)
latency of IO interarrival time for Reads in Microseconds (us)
latency of IO interarrival time for Writes in Microseconds (us)

Storage
PowerCLI Commands:
Get-Stat -disk
Get-Stat -stat disk.totalLatency.average -realtime
Get-Stat -stat disk.deviceLatency.average -realtime
Get-Stat -stat disk.kernelLatency.average -realtime

Storage What cant we see?
vmdk
vmdk
vmdk
Virtual Machine
RDM
Datastore
Disk
LUN
Storage Monitoring Tools
SCSI Controller
Controller
Switch
HBA

Storage More of what we cant see
vmdk
vmdk
vmdk
Virtual Machine
RDM
Datastore
Disk
LUN
SCSI Controller
Controller
Switch
Network/Fabric Monitoring Tools
HBA

Storage
Possible external monitoring solutions:

Vendor specific SAN and fabric/network tools,
native Telnet/SSH/HTTP-based tools for most
networks, third-party SNMP-based tools

Working with PowerCLI
PowerCLI Tips:
For a complete list of stat objects:
Get-StatType -Entity <Host/VM>

Pipe the outputs to a file:
Get-Stat -stat <stat> -realtime | ft -autosize >

c:\temp\<filename>.csv
Import the CSV file data to a spreadsheet with fixed width
parameters
Build pretty graphs

Working with PowerCLI

Way More Information
ESXTOP / vscsiStats / PowerCLI:

http://www.yellow-bricks.com/esxtop/ Special thanks to Duncan Epping!
http://communities.vmware.com/docs/DOC-3930
http://www.vmware.com/support/developer/PowerCLI/PowerCLI41/html/Get-Stat.html
http://www.lucd.info/2009/12/30/powercli-vsphere-statistics-part-1-the-basics/
http://simongreaves.co.uk/blog/esxtop-guide
http://dunnsept.wordpress.com/2010/03/11/new-vscsistats-excel-macro/

Easy button?
What is the problem with these tools?

Limited alerting mechanisms, no collection
automation or historical data for comparison,
and no correlation of events!
vCenter Operations Standard / Enterprise

Vsphere Troubleshooting

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vsphere Troubleshooting

Uploaded by

Copyright:

Available Formats

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Firmware & driver issues

vSphere Performance Monitoring and Troubleshooting

Top tools for information collection:

*Guest based monitoring is subject to inaccuracy

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Prepare custom vCenter alerts:

VM CPU Ready 1000ms Warning, 2000ms Alert

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

ESXTOP realtime monitoring:

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

2. Configuration Issues - Inconsistent configuration of vSwitches,

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Dropped packets transmitted, hardware overworked.

Dropped packets received, hardware overworked.

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Cisco 1000v only

Network Monitoring Tools

vSphere Performance Monitoring and Troubleshooting

Possible resources for external monitoring:

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

4. Thresholds Waiting for CPU cycles (due to co-scheduling,

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Possible resources for external monitoring:

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

Possible external monitoring options:

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting

2. Configuration Issues - Inconsistent or incorrect configuration

vSphere Performance Monitoring and Troubleshooting

vSphere Performance Monitoring and Troubleshooting