You are on page 1of 92

vSphere APIs for performance monitoring

London Workshop October 2010

Balaji Parimi, Staff Engineer, Ecosystem Performance, VMware, Inc. Ravi Soundararajan, Senior Staff Engineer, Performance, VMware, Inc.

Motivation
To debug performance, why deal with this...?

Motivation
When you can deal with this instead?

More motivation
Why look at data like this?

Before memhog: no guest swapping

After memhog, guest swaps, but Host does not!

More motivation
When you can look at it like this?

Even more motivation


Why compare resource pool performance like this?

Even more motivation


When you can compare them like this?

Why?
vSphere gives you awesome, helpful charts But you dont have to rely solely on these charts Do you want to learn how to make your own charts?
Keep watching

Goal
Teach you how to use our APIs for performance monitoring

Agenda
What sorts of stats are useful? How does vSphere retrieve them? How can you get these stats and use them yourself?

Useful stats
Basics of performance monitoring in virtual infrastructure Find underperforming resources Find overcommitted resources Identify issues due to resource sharing among VMs

Resources we will look at


CPU Memory Disk Network

Resources that we often look at


CPU Memory Disk Network

CPU basics
Wait/Idle VM5 VM6 VM4 Ready Run

VM0
CPU0

VM1
CPU1

VM2
CPU2

VM3
CPU3

ESX

Run (accumulating used time) Ready (wants to run, no physical CPU available) Wait: blocked on I/O or voluntarily descheduled

Why is my VM slow?
CPU saturated (cpu.usage.average) Ready time? (cpu.ready.summation) Latency to be swapped in? (cpu.swapwait.summation)

CPU saturation

2 vCPUs 2.2GHz/CPU ~4.4GHz used (Look at left y-axis)

Small ready time

Ready time vCPU1: 150ms Real-time chart: refresh 20s 150ms / 20s = 0.75% (No big deal) Right y-axis is relevant

Now, turn on CPU burner on same host

CPU burner ~100% of 1 vCPU

And see what happens to original VMs ready time

SpecJBB ready time ~2000ms = 10% (ps. SpecJBB perf. dropped by 10%)

Latency to load in VM: cpu.swapwait.average


Sometimes there is a latency to load VM data from disk: cpu swapwait

CPU takes 20s to load in data before VM can run!

CPU issues: Summary


CPU saturated? High Ready time Problematic if it is sustained for high periods
Sample rule of thumb: > 20% per vCPU
investigate further

Possible contention for CPU resources among VMs


Workload Variability? Fix with VMotion/DRS Resource limits on VMs? Check Limits, reservations and shares Actual over commitment? Fix with Vmotion/DRS/more CPUs

High SwapWait time Consider setting memory reservation (see next section, Memory)

Resources that we often look at


CPU Memory Disk Network

Memory
ESX must balance memory usage
Page sharing to reduce memory footprint of Virtual Machines Ballooning to relieve memory pressure in a graceful way Host swapping to relieve memory pressure when ballooning insufficient Compression to relieve memory pressure without host-level swapping

ESX allows over commitment of memory Sum of configured memory sizes of virtual machines can be greater than
physical memory if working sets fit

Memory also has limits, shares, and reservations Host swapping can cause performance degradation

Ballooning, compression, and swapping (1)


Ballooning: Memctl driver grabs pages and gives to ESX
Guest OS choose pages to give to memctl (avoids hot pages if possible): either free
pages or pages to swap

Unused pages are given directly to memctl Pages to be swapped are first written to swap partition within guest OS and then
given to memctl

VM1 F memctl 2. Reclaim 3. Redistribute Swap partition w/in Guest OS 1. Balloon

VM2

ESX

Ballooning, swapping, and compression (2)


Swapping: ESX reclaims pages forcibly
Guest doesnt pick pagesESX may inadvertently pick hot pages ( possible VM
performance implications)

Pages written to VM swap file

VM1

VM2

Swap Partition VSWP (w/in guest) (external to guest)

ESX

1. Force Swap 2. Reclaim 3. Redistribute

Ballooning, swapping and compression (3)


Compression: ESX reclaims pages, writes to in-memory cache
Guest doesnt pick pagesESX may inadvertently pick hot pages ( possible VM
performance implications)

Pages written in-memory cache

faster than host-level swapping

VM1

VM2

Swap Partition (w/in guest)

ESX

1. Write to Compression Cache 2. Give pages to VM2

Compression Cache

Ballooning, swapping, and compression


Bottom line: Ballooning may occur even when no memory pressure just to keep memory
proportions under control

Ballooning is preferable to compression and vastly preferably to swapping


Guest can surrender unused/free pages With host swapping, ESX cannot tell which pages are unused or free and may
accidentally pick hot pages

Even if balloon driver has to swap to satisfy the balloon request, guest chooses what
to swap

Can avoid swapping hot pages within guest Compression: reading from compression cache is faster than reading from disk

Swapping in Guest! = Swapping in Host


DVDstore benchmark: SQL DB benchmark uses lots of memory

About to start memory hogger program in guest

Force Guest swapping: No Host-level swapping

Before memhog: no guest swapping

After memhog, guest swaps, but Host does not!

Viewing Host-level swapping with performance charts

Setup: 2 VMsone dvdstore, one memhog, competing for host memory Host swaps out dvdstore VM memory to fulfill memhog VM requests Host swaps in dvdstore VM memory to fulfill dvdstore VM requests

Using Swap Rate Counters: Remember CPU SwapWait?

Cpu.swapwait.summation: CPU is waiting for memory to be swapped in

Absolute Swap Counters

Swapin, swapout (KB) show some activity but hard to detect

And Swap Rate Counters

SwapinRate, SwapoutRate (KBps) show activity much more clearly Rule of thumb: host swapping > 1MBps is cause for concern

Resources that we often look at


CPU Memory Disk Network

ESX storage stack


Different latencies for local disk vs. SAN (caching, switches, etc.) Queuing within kernel and in hardware vSphere shows
Total Command Latency Kernel Latency Device Latency Bandwidth/IOPS

Disk performance problems 101


What should I look for to figure out if disk is an issue?
Am I getting the IOPs I expect? Am I getting the bandwidth (read/write) I expect? Are the latencies higher than I expect? Where is time being spent?

What are some things I can do? Make sure devices are configured properly (caches, queue depths) Use multiple adapters and multipathing Check networking settings (for iSCSI/NAS)

Another disk example: Slow VM power on


Trying to Power on a VM Sometimes, powering on VM would take 5 seconds Other times, powering on VM would take 5 minutes! Where to begin? Powering on a VM requires disk activity on host
Check disk metrics for host

Lets look at the vSphere client

Rule of thumb: latency > 20ms is Bad. Here: 1,100ms REALLY BAD!!!

Max Disk Latencies range from 100ms to 1100msvery high! Why? (counter name: disk.maxTotalLatency.latest)

High disk latency: Mystery solved

Host events: disk has connectivity issues

high latencies!

Bottom line: monitor disk latencies; issues may not be related to virtualization!

Resources that we often look at


CPU Memory Disk Network

Network performance problems 101


What should I look for to figure out if network is an issue?
Am I getting the packet rate that I expect? Am I getting the bandwidth (read/write) I expect? Is all traffic on one NIC, or spread across many NICs? [more advanced not available through counters]: out-of-order packets?

What are some things I can do? Check host networking settings
Full-duplex/Half-duplex 10Gig network vs 100Mb network? Firewall settings

Check VM settings: all VMs on proper networks?

Network performance troubleshooting


Customer complains about slow network
Shes running netperf on a GigE Link She sees only 200Mbps Why? I bet its that VMware stuff!! Note to reader: Please dont blame VMware first

Where do we start?

All VMs using same NIC (VM network)

All VMs using VM Network and sharing 1 physical NIC

Where do we begin? Check VM bandwidth


Measure VM Bandwidth (net.transmitted.average)
200 Mb/s Screenshot from the vSphere client

Check Host Bandwidth


Measure Host Bandwidth (net.transmitted.average)
Host sees around 900Mbpswhy is VM at 200Mbps?

Hmm are we sharing this NIC with multiple VMs?

All traffic is going through one NIC!


Measure per-physical-NIC traffic

All traffic through one NIC on this host

Hmm all VM traffic is going through 1 NIC Lets split the VMs across NICs

Split VMs across multiple NICs. Bingo!

Network issues: Configuration woes

Network adapter set to autonegotiate: 90Mbps

Network adapter set to full duplex, 100 Mbps: < 0.1Mbps! Specific combo of switch and adapter caused this performance degradation! Lesson: Check specs & configuration!

Agenda
What sorts of stats are useful? How does vSphere retrieve them? How can you get these stats and use them yourself?

Stats infrastructure in vSphere


4. Rollups
VM VM VM VM VM DB ESX

3. Send 5-min stats to DB


vCenter Server (vpxd, tomcat)

VM VM VM VM VM ESX

VM VM VM VM VM 1. Collect 20s

2. Send 5-min stats to vCenter

ESX

and 5-min host and VM stats

Rollups

DB

1. 2. 3. 4.

Past-Day (5-minutes) Past-Week Past-Week (30-minutes) Past-Month Past-Month (2-hours) Past-Year (Past-Year = 1 data point per day)

DB only archives historical data


Real-time (i.e., Past hour) NOT archived at DB Past-day, Past-week, etc. Stats Interval Stats Levels ONLY APPLY TO HISTORICAL DATA

Anatomy of a stats query: Past-hour (RealTime) Stats

VM VM VM VM VM DB ESX

VM VM VM VM VM

1. Query
Client vCenter Server (vpxd, tomcat) ESX

3. Response

VM VM VM VM VM

2. Get stats from host

ESX

No calls to DB Note: Same code path for past-day stats within last 30 minutes

Anatomy of a stats query: Archived stats

VM VM VM VM VM DB ESX

2. Get Stats
VM VM VM VM VM

1. Query
Client vCenter Server (vpxd, tomcat) ESX

3. Response

VM VM VM VM VM ESX

No calls to ESX host (caveats apply) Stats Level = Store this stat in the DB

Agenda
What sorts of stats are useful? How does vSphere retrieve them? How can you get these stats and use them yourself?

Phew! Ok, How do I get these stats?


You want a chart like this?

PowerCLI CPU Usage for a VM for last hour: $vm = Get-VM Name Foo Get-Stat Entity $vm Realtime Maxsample 180 Stat
cpu.usagemhz.average

Grab appropriate fields from output, use graphing program, etc.

Looks simple Whats going on behind the scenes?


To get stats, this is what is going on FOR EACH GET-STAT CALL Retrieve PerformanceManager QueryPerfProviderSummary $vm QueryAvailablePerfMetric $vm QueryPerfCounter
Says what intervals are supported Describes available metrics

Verbose description of counters Query specification to get the stats

Create PerfQuerySpec QueryPerf


Get stats

Bottom line: The PowerCLI toolkit spares you detailsEasy to use!

PowerCLI Is so easy Why use Java / C#?


PowerCLI is great for scripting Stateless Hides details But with Java / C# You can squeeze out more performance! Much higher scalability

Pseudo code
PowerCLI Java perfCounter property Of PerformanceManager Get MOREF for each Get-Stat { QueryAvailablePerfMetric(); QueryPerfCounter(); QueryPerfProviderSummary(); create PerfQuerySpec(); QueryPerf(); } } Get MOREF QueryAvailablePerfMetric(); QueryPerfCounter(); QueryPerfProviderSummary(); create PerfQuerySpec(); for each Get-Stat { QueryPerf();

Performance implications: Need to write scalable scripts!


Entities
(cpu.usagemhz.average)

PowerCLI
(Time in secs)

Java
(Time in secs)

1 VM

9.2

14

6 VMs

11

14.5

39 VMs

101

16

363 VMs

2580 (43 minutes)

50

Highly-tuned Java Stats Collector

A Nave script that works for small environments may not be suitable for large environments Java provides opportunities for scalable, ongoing stats collection Lets examine Java code in more detail

GetPerfStats Main method

Get MOREF QueryAvailablePerfMetric(); perfCounter QueryPerfCounter(); QueryPerfProviderSummary(); create PerfQuerySpec(); for each Get-Stat { QueryPerf(); }
QueryAvailablePerfMetric QueryProviderSummary create PerfQuerySpec Get CounterIds Get MOREF

QueryPerf

GetPerfStats

Get MOREF QueryAvailablePerfMetric(); QueryPerfCounter(); QueryPerfProviderSummary(); create PerfQuerySpec(); for each Get-Stat { QueryPerf(); }

Get MOREF

Get the entity MOREF

GetPerfStats
perfCounter property Of PerformanceManager

Get MOREF

QueryAvailablePerfMetric(); QueryPerfCounter(); QueryPerfProviderSummary(); create PerfQuerySpec(); for each Get-Stat { QueryPerf(); }

Get CounterIds

Get available counterIDs from perfCounter property of PerformanceManager

Map human-readable stat name to counterID (e.g., cpu.usagemhz.average 101) QueryPerf () requires counterID

GetPerfStats

Get MOREF QueryAvailablePerfMetric(); QueryPerfCounter(); QueryPerfProviderSummary(); create PerfQuerySpec(); for each Get-Stat { QueryPerf(); }

QueryPerfProviderSummary

All VMs have same value All Hosts have same value etc. Call once for a given entity type and store result

GetPerfStats

Get MOREF QueryAvailablePerfMetric(); QueryPerfCounter(); QueryPerfProviderSummary(); create PerfQuerySpec(); for each Get-Stat { QueryPerf(); }

Create PerfQuerySpec

Use wild card

CSV output format

GetPerfStats

Get MOREF QueryAvailablePerfMetric(); QueryPerfCounter(); QueryPerfProviderSummary(); create PerfQuerySpec(); for each Get-Stat { QueryPerf(); }

QueryPerf

So, what is Java / C# buying us?


Avoiding redundant work More compact return format (CSV vs. objects) Low-overhead tracking of ongoing inventory changes Etc.

If we dig deeper, we can optimize even more

Digging deeper: The PerfQuerySpec architecture


To grab counters: QueryPerf(PerfQuerySpec[] querySpec) PerfQuerySpec: Specifies which counters to grab
Entity (host, VM) Format (CSV, normal) MetricId StartTime EndTime IntervalID (20s, 300s) maxSample

PerfQuerySpec[]: [pQs1, pQs2, pQs3, ] Array of PerfQuerySpec objects pQs1, pQs2, pQs2 Can grab multiple stats using single QueryPerf call

Complexities of QueryPerf
How Does vSphere Process QueryPerf(querySpec[])? 1. vCenter receives queryPerf request with querySpec[] 2. vCenter takes each querySpec one at a time 3. vCenter gets data for each querySpec before processing next one Options for querySpec[]: 1. 1 entry 1 stat or set of stats for a single entity (e.g., all CPU) pQs1 pQs2 pQs3 2. Multiple entries. Examples:
VM1,cpu.* Each entry for a different entity Each entry for a different stat type, same entity VM1,cpu.* VM2,cpu.* H3,mem.*

VM1,net.*

VM1,mem. *

Implications of QuerySpec
Format of QuerySpec Allows Multiple Client Options 1. 2. 3. 4.
Grab each stat one at a time Grab a group of stats per entity at once Grab all stats for all entities at once Grab stats for a subset of entities at once

Some Tradeoffs: 1. Network processing (large result sets vs. small result sets) 2. Client aggregation overhead 3. vCenter processing (Each QueryPerf handled in a single thread)

What about in-guest stats?


Using VIX APIs: Create a script that can get what ever stats you are interested in. Make the script write the stats to a file. Copy file from the guest. Session covering this topic PPC-15 Guest Operations using VMware VIX APIs and Beyond

Back to the Future (1)


Now I know how to I convert this (many metrics on different charts)

Back to the Future (2)


To This (CPU, Memory, Disk, and Network on the same chart)

Combining metrics across VMs & Hosts

Combining metrics across VMs & Hosts

Comparing resource pools

Use VIX API + vSphere counters to get RP performance data

What about VMs running on a Host?


Memory usage of VMs on a Host

Summary, Part 1: Some useful Counters to monitor


Resource Metric
CPU Usage Ready SwapWait Memory Swapin, swapinrate Swapout, swapoutrate Disk commands totalLatency Usage Network Packets received, transmitted Usage

Host or Description VM?


Both VM VM Both Both Both Host Both Both CPU % used Ready to run, but limit or no available physical CPU CPU time spent waiting for host-level swap-in Memory ESX host swaps in from disk (per VM, or cumulative over host) Memory ESX host swaps out to disk (per VM, or cumulative over host) Operations done during stats refresh interval End-to-end disk latency (available for reads & writes) Disk bandwidth utilized (available for reads & writes) Operations done during stats refresh interval

Both

Network bandwidth used (available for reads & writes)

For completenessVM memory metrics


Metric
Memory Active (KB) Memory Usage (%) Memory Consumed (KB) Memory Granted (KB)

Description
Physical pages touched recently by a virtual machine Active memory / configured memory Machine memory mapped to a virtual machine, including its portion of shared pages. Does NOT include overhead memory. VM physical pages backed by machine memory. May be less than configured memory. Includes shared pages. Does NOT include overhead memory. Physical pages shared with other virtual machines Physical memory ballooned from a virtual machine Physical memory in swap file (approx. swap out swap in). Swap out and Swap in are cumulative. Machine pages used for virtualization

Memory Shared (KB) Memory Balloon (KB) Memory Swapped (KB) Overhead Memory (KB)

Host memory metrics


Metric
Memory Active (KB) Memory Usage (%)* Memory Consumed (KB) Memory Granted (KB) Memory Shared (KB) Shared common (KB) Memory Balloon (KB) Memory Swap Used (KB) Overhead Memory (KB)

Description
Physical pages touched recently by the host Active memory / configured memory Total host physical memory free memory on host. Includes Overhead and Service Console memory. Sum of memory granted to all running virtual machines. Does NOT include overhead memory. Sum of memory shared for all running VMs Total machine pages used by shared pages Machine pages ballooned from virtual machines Physical memory in swap files (approx. swap out swap in). Swap out and Swap in are cumulative. Machine pages used for virtualization

*For a cluster, mem.usage.average = (consumed + overhead)/total mem

Summary, Part 2: Cheat sheet


Rules of Thumb Ready Time > 20% sustained is undesirable Host-level swapping is bad, > 1MBps is especially bad Disk latencies > 20 ms BAD
Use IOmeter to assess disk bandwidth and latency

Network
run netperf to get network baselines

Summary, Part 3: SDK/API Tips and tricks


Collect static data once CounterIDs, metricIDs, MOREFs etc. Use Views to keep this data up to date. Reuse PerfQuerySpec as much as possible Use CSV format Reduces serialization cost and the size of metadata Choose metrics and query intervals carefully Query the real-time stats at a slower rate than the refresh rate Choose correct stats levels Use parallelism (multi-threaded clients)

Conclusion
vSphere gives a bunch of awesome charts If you want to see the data differently, use the API PowerCLI is great for simple scripts When designing for scalability, consider Java / C#

Resources
Developer Support Dedicated support for your organization when building solutions using vSphere
APIs, PowerCLI, vSphere Web Services SDKs and many more VMware SDKs

http://vmware.com/go/sdksupport PowerCLI Training 2 day instructor led training, 40% lecture, 60% lab http://vmware.com/go/vsphereautomation VMware Developer Community SDK Downloads, Documentation, Sample Code, Forums, Blogs http://developer.vmware.com Technology Alliance Partner (TAP) Program Updated partner benefits http://www.vmware.com/partners/alliances/programs/

Disclaimer
This session may contain product features that are currently under development. This session/overview of the new technology represents no commitment from VMware to deliver these features in any generally available product. Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features discussed or presented have not been determined.

These features are representative of feature areas under development. Feature commitments are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Technical feasibility and market demand will affect final delivery.

Backup slides

What about VMs across resource pools?

Back to the Future (2)


To This (CPU, Memory, Disk, and Network on the same chart)

Combining metrics across VMs & Hosts

You might also like