You are on page 1of 74

Performance Report

VMware View linked clone performance


on Sun’s Unified Storage

Author: Erik Zandboer


Date: 02-04-2010
Version 1.00
Table of contents

1 Management Summary ................................................................................................................. 6


1.1 Introduction ........................................................................................................................ 6
1.2 Objectives ........................................................................................................................... 6
1.3 Results ................................................................................................................................ 6
2 Initial objective ............................................................................................................................. 7
2.1 VMware View ....................................................................................................................... 7
2.2 Storage requirements .......................................................................................................... 7
3 Technical overview of the solutions .............................................................................................. 8
3.1 VMware View linked cloning ................................................................................................ 8
3.2 Sun Unified Storage ............................................................................................................. 8
3.3 Linked cloning technology combined with Unified Storage ................................................... 9
4 Performance test setup............................................................................................................... 10
4.1 VMware ESX setup ............................................................................................................. 10
4.2 VMware View setup ........................................................................................................... 11
4.3 Windows XP vDesktop setup .............................................................................................. 11
4.4 Unified Storage setup ........................................................................................................ 12
5 Tests performed ......................................................................................................................... 13
5.1 Test 1: 1500 idle vDesktops .............................................................................................. 13
5.2 Test 2: User load simulated linked clone desktops ............................................................. 13
5.3 Test 2a: Rebooting 100 vDesktops in parallel .................................................................... 13
5.4 Test 2b: Recovering all vDesktops after storage appliance reboot ...................................... 13
5.5 Test 3: User load simulated full clone desktops ................................................................. 14
6 Test results ................................................................................................................................ 15
6.1 Test Results 1: 1500 idle vDesktops .................................................................................. 15
6.1.1 Measured Bandwidth and IOP sizes ................................................................................ 16
6.1.2 Caching in the ARC and L2ARC ...................................................................................... 20
6.1.3 I/O Latency ................................................................................................................... 22
6.2 Test Results 2: User load simulated linked clone desktops ................................................ 24
6.2.1 Deploying the initial 500 user load-simulated vDesktops ............................................... 25
6.2.2 Impact of 500 vDesktop deployment on VMware ESX ..................................................... 31
6.2.3 Impact of 500 vDesktop deployment on VMware vCenter and View ................................ 34
6.2.4 Deploying vDesktops beyond 500 .................................................................................. 36
6.2.5 Performance figures at 1300 vDesktops ......................................................................... 40
6.2.6 Extrapolating performance figures ................................................................................. 47
6.3 Test Results 2a: Rebooting 100 vDesktops ........................................................................ 54
6.4 Test Results 2b: Recovering all vDesktops after storage appliance reboot........................... 58
6.5 Test Results 3: User load simulated full clone desktops ..................................................... 62

Page 2 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
7 Conclusions ............................................................................................................................... 65
7.1 Conclusions on scaling VMware ESX ................................................................................... 65
7.2 Conclusions on scaling networking between ESX and Unified Storage ................................. 66
7.3 Conclusions on scaling Unified Storage CPU power ............................................................ 67
7.4 Conclusions on scaling Unified Storage Memory and L2ARC ............................................... 68
7.5 Conclusions on scaling Unified Storage LogZilla SSDs ........................................................ 68
7.6 Conclusions on scaling Unified Storage SATA storage ........................................................ 69
8 Conclusions in numbers ............................................................................................................. 70
9 References ................................................................................................................................. 72
Appendix 1: Hardware test setup ...................................................................................................... 73
Appendix 2: Table of derived constants ............................................................................................ 74

Page 3 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
People involved

Name Company Responsibility E-Mail

Erik Zandboer Dataman B.V. Sr. Technical Consultant erik.zandboer@dataman.nl

Simon Huizenga Dataman B.V. Technical Consultant simon.huizenga@dataman.nl

Kees Pleeging Sun Project leader kees.pleeging@sun.com

Cor Beumer Sun Storage Solution Architect cor.beumer@sun.com

Version control

Version Date Status Description

0.01 11-02-2010 Initial draft Initial draft for internal (Dataman / Sun) review

0.02 12-03-2010 Final draft Adjusted some reviewed minors; added conclusions and derived
constants

1.0 02-04-2010 Release Changed last minors; changed minors in reviewed items added in 0.02

Page 4 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
Abbreviations and definitions

Abbreviation Description

VM Virtual Machine. Virtualized workload on a virtualization platform (such as VMware ESX)

GbE Gigabit Ethernet. Physical network connection at Gigabit speed.

IOPS I/O operations Per Second. The number of both read- and write commands from and to a
storage device per second. Take note that the ratio between reads and writes cannot be
extracted from these values, only the sum of the two. Also see ROPS and WOPS.

OPS Operations Per Second. More general term, and closely related to IOPS.

ROPS Read Operation Per Second. The number of read commands performed on a storage
device per second.

WOPS Write Operation Per Second. The number of write commands performed on a storage
device per second.

TPS Transparent Page Sharing. A feature unique to VMware ESX, where several memory pages
can be identified as containing equal data, and then stored only once in physical memory,
effectively saving physical memory. Is in most respects comparable to data deduplication.

SSD Solid State Drive. This is normally indicated as a non-volatile storage device with no
moving parts. It can be a Flash Drive (like the ReadZilla device), but it can also be a
battery-backed (plus optionally flash-backed) RAM drive (like the LogZilla device).

KB KBytes. Also seen in conjunction with “/s” or “.sec-1” which dedicates KBytes-per-second

MB MBytes. Also seen in conjunction with “/s” or “.sec-1” which dedicates MBytes-per-second.

Mb Mbits. Also seen in conjunction with “/s” or “.sec-1” which dedicates Mbits-per-second.

vDesktop Virtualized Desktop. A Virtual Machine (VM) running a client operating system such as
Windows XP.

ave Average. Shorthand used in graphs to indicate the value is an averaged value.

HT, HTx Hyper Transport bus. High bandwidth connection between CPUs and I/O devices on
mainboards. Often indicated with numbers (HT0, HT1) to indicate specific connections.

UFS Unified Storage (Device). Storage device which is capable of delivering the same data using
multiple protocols.

Page 5 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
1 Management Summary

1.1 Introduction

Running virtual desktops (vDesktops) puts a lot of stress on storage systems. Conventional storage systems
are easily scaled to the right size: A number of disks deliver a certain capacity and performance.

In an effort to tackle the need for a lot of disks in a virtualized desktop (vDesktop) environment, Dataman
started to analyze the basic needs for a vDesktop storage solution based on VMware linked cloning
technology. The new Sun Unified Storage (UFS) Solution (see reference [4] ) appeared to have a significant
head start in delivering a high vDesktop performance with small number of disks.

Because of the alternative way the storage solution works, it is next to impossible to calculate performance
numbers. The way the Unified Storage performs is very dependent on the workload offered. This is why
Dataman teamed up with Sun in order to run performance tests on these storage devices.

1.2 Objectives

The performance test had several goals:

- To measure performance impact on the Unified Storage Array as more vDesktops were deployed on
the environment;
- To examine impact on vDesktop reboots;
- To extrapolate the measured performance data;
- To project (and avoid) performance bottlenecks;
- To define scaling constants for scaling the environment to a projected number of vDesktops.

The tests were performed in Suns Datacenter in Linlithgow, Scotland. Hardware and housing was generously
made available to Dataman for a period of two months, over which all necessary tests were performed.

1.3 Results

The performance tests have proven to be very effective; during the final stages of the test the testing
environment stopped at 1319 user-simulated vDesktops because the VMware environment having “only”
eight nodes could not handle any more virtual machines (VMs). At that stage, all vDesktops still performed
without any issues or noticeable latency on a single headed UFS device. Even more remarkable, the
environment could have run with only 16 SATA spindles in a mirrored setup! It is the underlying ZFS file
system and the intelligent use of memory and Solid State Disks (SSDs) that makes all the difference here.

Page 6 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
2 Initial objective

After virtualization practically conquered the world for server loads, virtualization now continues to virtualize
the desktop. Virtualizing a big number of desktops on a small set of servers has proven to pose its own set of
challenges. The one often encountered is performance requirements of the underlying storage array. Scaling
disks just to satisfy the capacity needs has always been a bad practice, but this can work out especially bad in
a virtual desktop environment. The large disk capacities of nowadays do not help either.

2.1 VMware View

One of the leading platforms for delivering virtualized desktops is VMware ESX in combination with VMware
View. VMware View is able to deliver virtual desktops using linked cloning technology. This technology is able
to deliver very fast desktop image duplicating and more efficient in terms of storage capacity needs.

Calculating the number of ESX nodes (cores and memory) is not too hard. It is no different from having full
cloned desktops. But what are the requirements of the underlying storage array?

2.2 Storage requirements

The structure of linked clones poses some challenges to the storage. For reasons explained in the next
paragraphs, Sun’s 7000 series Unified Storage (see reference [4] ) was selected as being THE platform to drive
the linked clone loads most efficient.

The objective of this performance test is to prove that Sun’s 7000 series Unified Storage in combination with
linked clones gives great performance at little cost.

Page 7 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
3 Technical overview of the solutions

In order to understand the performance test setup and its results better, it is important to have some
knowledge about the underlying technologies.

3.1 VMware View linked cloning

VMware View is basically a broker between the clients and the virtualized desktops (vDesktops) in the
datacenter. The idea is that a single Windows XP image can be used to clone thousands of identical desktops.
The broker controls the cloning and customization of these desktops.

VMware View enables an extra feature: linked cloning. When using linked cloning technology, only a small
number of fully cloned desktop images exist. All virtual desktops that are actually used are derivatives of
these full clone images. In order to be able to differentiate the desktops, all writes to the virtual desktop’s
disk are captured in a separate file, much like VMware snapshot technology. The result of this is that many
read operations are performed from the few full clones within the environment.

Following the VMware best practices, it is recommended to have a maximum of 64 linked clones under every
full clone (called a replica).

3.2 Sun Unified Storage

Suns Unified Storage uses the ZFS file system internally. There are some very specific differences with just
about any other file system. It is far beyond the scope of this document to deep dive into ZFS, so just some
features of these appliances will be discussed.

Suns Unified Storage appliances have a lot of CPU power and memory compared to most competitors. The
CPU power is required to drive the ZFS file system in an appropriate manner, and memory helps caching of
data. This caching is partly the key to extreme performance of the appliance, even with relatively slow SATA
disks. The use of Solid State Drives (SSDs) further enhances the performance of the appliance: read SSD
(called Readzillas) basically extends the appliances memory, and logging SSDs (called Logzillas) help
synchronous writes to be acknowledged faster (the effect appears somewhat similar to write caching, but the
technology is very different).

Page 8 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
3.3 Linked cloning technology combined with Unified Storage

The basic idea of using Suns Unified Storage for linked cloned desktops came from two directions: First, a
storage device with a lot of cache was needed, in order to be able to store the replicas (full clone images).
Secondly, the barrier of 64 linked clones per replica limited the effectiveness of the cache, since one replica is
needed for every 64 linked clones. This limit applies to storage devices having LUNs with VMFS (the VMware
file system for storing VMs) on it. LUN queuing, LUN locking and some other artifacts come into play here.

But when using NFS for storage, and not iSCSI or FC, the “64 linked clones per replica” barrier could possibly
be broken. NFS has no issues having a thousand or more opened files accessed in parallel. Since Suns Unified
Storage is also able to deliver NFS, Suns storage device appeared to be the right choice.

Page 9 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
4 Performance test setup

The performance test was set up in Suns test laboratory in Linlithgow, Scotland. Sun made a number of
servers, a Sun 7410 Unified storage device and the necessary switching components available. The total
hardware setup can be viewed in appendix 1.

4.1 VMware ESX setup

A total of nine servers were available for VMware ESX. Eight were used for virtual desktop loads, the ninth
server was used for all other required VMs like vCenter, SQL, View and Active Directory. The specifications of
the used servers:

8x Sun x4450 with 4x 6core Intel CPU (2.6GHz), 64GB memory


1x Sun X4450 with 4x 4core Intel CPU (2.6GHz), 16GB memory

All nodes were connected with a single GbE NIC to the management network, a single NIC to a vmotion
network, and with a third Ethernet NIC to an isolated “client” network where the Windows XP virtual desktops
could connect to active directory / file serving.

The eight nodes performing virtual desktop loads were also connected to an NFS storage network using two
GbE interfaces. All these interfaces were connected to a single GbE switch.

ESX 3.5 update 5 was used to perform the tests. Setup was kept to a default; console memory was increased
to 800MB (maximum). In order to make sure both GbE connections to the storage array would be used, two
different subnets were used to the array, each subnet accessed by its own VMkernel interface. Each VMkernel
interface in its turn was connected to one of both GbE interfaces, guaranteeing a static load balancing across
both interfaces for every host.

To be able to house the maximum number of VMs possible on a single vSwitch, the port-count of the vSwitch
was increased to 248 ports.

Page 10 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
4.2 VMware View setup

For managing the desktops, a template Windows 2003 64bit enterprise edition was created. From this
template, five VMs were derived:

1) Microsoft SQL 2005 standard server with SP3;


2) Domain controller with DNS and file sharing enabled;
3) VMware vCenter 2.5 update 5;
4) VMware View 3.1.2;
5) VMware Update Manager.

During the tests, all these VMs were constantly monitored to guarantee that any limits found in the
performance tests were not due to limitations within these VMs.

All ESX nodes involved in carrying vDesktops were put in a single VMware cluster, which was kept at default.
A single Resource Pool was created within the cluster (at default) to hold all vDesktops during the tests.

4.3 Windows XP vDesktop setup

The Windows XP image used was a standard Windows XP install with SP2 integrated. PSTools was installed
inside the image, in order to be able to start and stop application in batches, to simulate a simple user load
of the vDesktops. No further tuning was done to the image.

Within VMware the images were configured with an 8GB disk, a single vCPU and 512MB of memory.

User load was simulated by using autologon of the vDesktop, after which a batch file was started. This batch
file performed standard tasks with built-in delays. Examples of the tasks were:

- Starting of MSpaint which loads an image from the Domain Controller/File server;
- Starting Internet Explorer;
- Starting MSinfo32;
- Unzipping putty.zip to a local directory, then deleting it again;
- Starting solitaire;
- Stopping all applications again.

These actions were fixed in order and delay. The delays were tuned until the vDesktop delivered an average
load of 300MHz, and just about 6 IOPs (this is accepted as being a lightweight user). In this user load, a
rather high write-load was introduced (in every 6 IO’s, 5 are writes). This is considered to be a worst-case IO
distribution for a vDesktop, making it a perfect setup for storage performance testing.

Checking the performance of the XP desktops was not a primary objective of the performance tests, however
after each test a few randomly chosen vDesktops would be accessed and the “introduction to Windows XP”
would be started to see the fluidness of the animation, making sure the desktops were still responsive.

Page 11 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
4.4 Unified Storage setup

The Sun 7410 Unified storage device was connected to the storage switch using two 10 GbE interfaces. Only a
single head was used in the performance test, connected to 136 underlying SATA disks in six trays. In four of
the trays a LogZilla was present. In total two LogZillas (2x 18[GB]) were assigned to the 7410 head. Inside the
7410 head itself, two Readzillas were available (2x 100[GB]). All SATA storage (apart for some hot spares) was
mirrored (on a ZFS level). With a drive size of 1TB, this effectively delivers 60TB of total storage.

The 7410 itself was configured with two Quad-Core AMD Opteron 2356 processors and 64[GB] of memory. A
single, dual port 10GbE interface was added to the system for connection to the storage network. A third link
(1GbE) was introduced for management inside the management network.

During configuration, two shares were created, both having their own IP address on their own 10GbE uplink.
This ensures static load balancing for the ESX nodes, and also ensures the load is evenly spread over both
10GbE links on the storage unit. Jumbo frames was not enabled anywhere in the tests.

In order to be able to measure the usage of the HyperTransport busses inside the 7410, a script was inserted
into the unit which can measure these loads.

Page 12 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
5 Tests performed

A total of three tests were performed; the first test loaded 1500 idle vDesktops in linked clone mode on the
storage. In the second test an attempt was made to load as many user load simulated vDesktops onto the
testing environment in steps of 100 vDesktops. The third and final test was equal to the second test, but now
using full clones from VMware View.

For all test both NFS shares were used. VMware View will automatically balance the number of VMs equally
across all available stores.

5.1 Test 1: 1500 idle vDesktops

In the first test, VMware View was simply instructed to deploy 1500 Windows XP images from a single source
image. The resulting images were not performing any user load simulation, so were booted then left at idle.
This test has been performed to get a general idea about loading on ESX and storage required for this
number of VMs.

5.2 Test 2: User load simulated linked clone desktops

After the initial test mentioned in 5.1, the test was repeated, now with user load simulated desktops. The test
was performed in steps, with an additional 100 vDesktops every step. The steps are repeated until a
limitation in storage, ESX and/or external environment is met.

5.3 Test 2a: Rebooting 100 vDesktops in parallel

As test 2 (5.2) reached the 1000 vDesktop mark, a hundred vDesktops were rebooted in parallel. This test
was performed to simulate a real life scenario, where a group of desktops is rebooted in a live environment.
Especially the impact on the storage device is to be monitored.

5.4 Test 2b: Recovering all vDesktops after storage appliance reboot

As test 2 (5.2) reached its maximum, the storage array was forcibly rebooted. Not really part of the
performance test, yet interesting to see the recovery process of the storage array, and the recovery of the
VMs on it.

Page 13 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
5.5 Test 3: User load simulated full clone desktops

Using full clones on a Sun 7000 storage device was not expected to work as efficient as a linked cloning
configuration. In this test a number of full clone desktops are deployed, 25 vDesktops in each step.

Page 14 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
6 Test results

The test results are described on a per-test basis. The initial 1500 idle-running vDesktop test is also used as
a general introduction into the behavior of the storage device, the solid state drives and the observed
latencies.

6.1 Test Results 1: 1500 idle vDesktops

As an initial test, 1500 idle-running, linked-cloned vDesktops were deployed onto the test environment. After
the system had settled, there was first prove about the storage device being able to cope at least with 1500
idle vDesktop loads.

Page 15 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
6.1.1 Measured Bandwidth and IOP sizes

Running this workload used NFS bandwidth is measured in figure 6.1.1:

NFS read and write MBs


(1550 idle-running vDesktops)
30

25
NFS rate [MB.sec-1]

20

15
NFS writes ave [MB/sec]
10 NFS reads ave [MB/sec]

0
1 101 201 301 401 501 601 701 801 901 1001
Time [sec]

Figure 6.1.1: Running 1500 idle desktops, about 22MB/s writes and 10MB/sec reads are observed.

The fact that about twice as much data is written than read, is probably due to the fact that the vDesktops are
running idle (little reads taking place), while the vDesktops only have 512[MB] of memory each, causing them
to use their local swap files and writing out to the storage device.

Page 16 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
As both bandwidth and number of IOPS have been measured, it is easy to derive the average block size
of the NFS reads and writes:

Average NFS read and write blocksizes


(1500 idle-running vDesktops)
30

25
Average NFS Blocksize [KB]

20

15
NFS write Blocksize [KB]
NFS read blocksize [KB]
10

0
1 101 201 301 401 501 601 701 801 901 1001
Time [sec]

Figure 6.1.2: Average NFS read- and write block sizes observed

Since VMware ESX will try to concatenate sequential reads and writes whenever possible, it is very likely that
the writes are completely random (NTFS 4K block size appears to be overruling here). The read operations are
bigger on average, probably meaning there are some “quasi sequential” reads going on.

Page 17 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
Since all writes to the storage device are synchronous and have very small block sizes, all writes will be put
into the LogZilla devices before they pass on to SATA. As the data to be written traverse through these stages,
the number of WOPS becomes smaller with every step:

Comparing Write OPS through stages


(1500 idle-running vDesktops)
5000
4500
4000
Write Operations [sec-1]

3500
3000
2500 LogZilla WOPS [/sec]
2000 SATA WOPS [/sec]
1500 NFS WOPS [/sec]
1000
500
0
1 101 201 301 401 501 601 701 801 901 1001

Time [sec]
Figure 6.1.3: Number of Write operations observed through the three stages

Here it becomes obvious how effective the underlying ZFS file system is. The completely random write load
which consists of nearly 5000 Write Operations per second, gets converted in the last stage (SATA) to just
over 30 write operations per second. ZFS is effectively converting the tiny random writes to NFS into large
sequential blocks, effectively dealing with the relatively poor seek times of the physical SATA drives.

Page 18 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
The write operations are effectively being dealt with. On reads, the following is observed on the SATA drives:

SATA IOPS read ave [/sec]


3

2,5

2
SATA Read Operation [sec-1]

1,5

SATA ROPS [/sec]


1

0,5

0
1 101 201 301 401 501 601 701 801 901 1001

-0,5
Time [sec]

Figure 6.1.4: Observed SATA read operation per second.

At an average read bandwidth of 10[MB.sec-1] (see figure 6.1.1), less than 0.3 read operations per second
(ROPS) are observed on the SATA drives. This raises the suspicion that most (in fact almost all) read
operations are served by the read cache (ARC or L2ARC), and only very little reads actually originate from the
SATA drives, effectively boosting overall read performance of the Sun 7000 storage device.

Page 19 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
6.1.2 Caching in the ARC and L2ARC

Zooming in on the read performance, we need to look closer to the read caching going on. In figure 6.1.5 it is
obvious, that the ARC (64[GB] minus overhead) was saturated and the L2ARC (200[GB]) is only filled up to
about 70[GB]:

ARC / L2ARC size (1500 idle-running vDesktops )


70000

60000

50000
ARC / L2ARC size [MB]

40000

30000 ARC datasize [MB]


L2ARC datasize [MB]
20000

10000

0
1 101 201 301 401 501 601 701 801 901 1001
Time [sec]

Figure 6.1.5: Running 1500 idle desktops, the ARC shows fully filled while the L2ARC flash drives
vary in usage around 64[GB].

Page 20 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
The ARC/L2ARC not being saturated should mean that all actively read data still fits into memory (ARC) or
Readzilla (L2ARC). This is clearly shown in figure 6.1.6, where the number of ARC hits show to be much larger
than the number of ARC misses:

ARC hits / misses (1500 idle-running vDesktops)


8000

7000

6000
ARC hits / misses [sec-1]

5000

4000
ARC hits [/sec]
3000 ARC misses [/sec]

2000

1000

0
1 101 201 301 401 501 601 701 801 901 1001
Time [sec]

Figure 6.1.6: Running 1500 idle desktops, the ARC hits show around 7000 per second while the
ARC misses show up at about 250. This is an indication of the effectiveness of the
(L2)ARC while running this specific workload.

While read operations appear to be properly drawn from ARC or L2ARC, write operations must be committed
to the disks at some point. The NFS writes are synchronous, meaning that each write operation must be
guaranteed to be saved by the storage device before acknowledging the operation. This would mean a rather
bad write performance, since the underlying disks are relatively slow SATA drives.

This problem is countered by the use of LogZilla devices. These devices are write-optimized solid state disks
(SSDs), which constantly store the write operation metadata and acknowledge the write back immediately,
before it is actually committed to disk. As soon as the write is actually committed to SATA storage, the
metadata entry is removed from the LogZilla (this the reason it is called a LogZilla and not a write cache; the
LogZilla is only there to make sure the dataset does not get in an inconsistent state when for example a
power outage occurs).

Page 21 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
The underlying ZFS file system flushes the writes at least every 30 seconds to disk. The ZFS file system is able
to perform random writes to the SATA disks very effective, actually being a big sequential write whenever
possible. This can be verified from the graph in figure 6.1.3.

6.1.3 I/O Latency

Besides read and write performance, it is also necessary to look at storage latency. Latency is the delay
between a request to the storage, and the answer back. During a read it is typically the time from a read
request to the delivering of the data. During a write it is typically the time required from a write to the write
acknowledgement back.

Best performance is met when latency is minimal. To be able to graph latency through time, a three
dimensional graph is required. The functions of the different axes are:

- Horizontal Axis: Time;


- Vertical Axis: Number of Read and/or Write Operations;
- Depth Axis: Latency.

Latency is grouped into ranges instead of unique values. This enables the creation of 3D graphs,
because it is now possible to see groups of IOPS which conform to a certain latency range.

Since in many occasions almost all latency falls within the lowest group of 0-20[ms], graphs are often
zoomed in, where the number of IOPS (Vertical axis) is clipped to a low number. As a result, the peaks
of the 0-20[ms] latency-group go “of the chart”. This gives room to a more clear view of the higher
latency-groups. Please take note that these graphs do not give a total overview of the number of IOPS
performed; they merely give insight to the tiny details which are almost invisible in the original (non
zoomed) graph.

In figure 6.1.7a (with its zoomed counterpart 6.1.7b) the latency graph is displayed for NFS Read
Operations with 1500 idle-running vDesktops. Almost all operations fall within the 0-20[ms] latency-
group. Only when looking at the zoomed graph (figure 6.1.7b), some higher latencies can be observed.
However, these are so very small in numbers compared to the number of IOPS within the 0-20[ms]
latency-group, that only very little impact is to be expected from this.

The read operations that required more time to complete are probably the ARC/L2ARC cache misses,
and had to be read from SATA. These SATA reads are the reads observed in figure 6.1.4.

Page 22 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
NFS Read Latency (1500 idle-running
vDesktops)

NFS Read Operations [sec-1]


1200
1000
800
600
400
200
0

Figure 6.1.7a: Observed latency in NFS reads. Most read operations are served within 20[msec]

NFS Read Latency ZOOMED (1500 idle-


running vDesktops) NFS Read Operations [sec-1]

20

15

10

Figure 6.1.7b: Detail of latency in NFS read operations. Clipped at only 20 OPS to visualize higher
latency read operations.

Page 23 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
6.2 Test Results 2: User load simulated linked clone desktops

After the initial test with idle-running desktops, the environment was reset. A new Windows XP image was
introduced, which delivers a lightweight user pattern:

- 200[MHz] CPU load;


- 300[MB] active memory;
- 7 observed NFS IOPS.

The memory and CPU load were deliberately held to a low level, so a maximum number of VMs would fit onto
the virtualization platform. The number of IOPs was matched to the accepted industry-average of 5 - 5.6
IOPs, with a calculated 150% overhead for linked cloning technology (See reference [1] for an explanation on
the 150% factor).

Page 24 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
6.2.1 Deploying the initial 500 user load-simulated vDesktops

When deploying the initial 500 vDesktops, the effect of the deployment was clearly reflected in several
graphs. In figure 6.2.1 the ARC + L2ARC size grow almost linear during deployment:

ARC / L2ARC datasize (0 - 500 userloaded


vDesktops)
60000

50000
ARC / L2ARC Datasize [MB]

40000

30000

20000 ARC datasize [MB]


L2ARC datasize [MB]
10000

Time

Figure 6.2.1: Observed ARC/L2ARC data size growth when deploying the first 500 desktops.

During the deployment of the very first vDesktops, the ARC immediately fills with both replicas (a replica is
the full-clone image from which the linked clones are derived). There are two replicas, because two NFS
shares were used, and VMware View places one replica on each share. In the leftmost part of the graph it is
actually identifiable that both replicas are put into the ARC one by one.

After this initial action, the ARC starts to fill. This is because the created linked clones are also being read
back. Since every vDesktop behaves the same, the read back performed on the linked clones is also identical,
which explains the near-linear growth.

Page 25 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
At the right of figure 6.2.1, the ARC fills up to its memory limit of 64[GB] minus the Storage 7000 overhead. It
is not until this time that the L2ARC starts to fill in the same linear manner as the ARC did. It becomes clear
that the L2ARC behaves as a direct (though somewhat slower) extension of the ARC (which resides in
memory).

When looking at ARC hits and misses in figure 6.2.2, it becomes clear that more and more read operations
are performed throughout the deployment:

ARC hits/misses (0 - 500 userloaded vDesktops)


6000

5000
ARC hits / misses [sec-1]

4000

3000

2000 ARC hits [/sec]


ARC misses [/sec]
1000

Figure 6.2.2: Observed ARC hits and misses while deploying the initial 500 user loaded vDesktops.

The graph in figure 6.2.2 clearly shows the growing number of ARC hits. The ARC misses hardly increase at
all. This means that as more vDesktops are deployed, the effectiveness of the read cache mechanism
increases.

Page 26 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
NFS bandwidth consumed (0-500 userloaded
vDesktops)
30

25
NFS read/write [MB.sec-1]

20

15

10 NFS write ave [MB/sec]


NFS read ave [MB/sec]
5

Figure 6.2.3: Consumed NFS bandwidth during deployment of the initial 500 vDesktops

In figure 6.2.3 it is clearly visible that the first 500 vDesktops were deployed in batches of 100. During the
linked cloning deployment, consumed NFS bandwidth is clearly higher than during normal running periods.

Page 27 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
SATA Read- and Write Operations (0 - 500
userloaded vDesktops)
Read / Write Operations [sec-1]

28

23

18

13

8 SATA WOPS [/sec]


SATA ROPS [/sec]
3

-2

Figure 6.2.4: SATA Read- and Write Operations observed during the deployment of the initial 500
vDesktops. Note that the vertical scale has been extended to -2 in order to clearly
display the Read Operations, which run over the vertical axis itself.

Figure 6.2.4 shows that the SATA Write Operations increase with the number of vDesktops running. The Read
Operations remain at a minimum level, without any measurable increase. This is in line with figure 6.2.2
showing that the read cache gets more effective with a growing number of deployed vDesktops.

Page 28 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
The write operations to SATA are synchronous, and get accelerated by the LogZillas. The graph in figure 6.2.5
shows the WOPS to the LogZilla devices:

Logzilla WOPS ave [/sec]


(0-500 userloaded vDesktops)
1000
900
800
LogZilla WOPS [sec-1]

700
600
500
400
300 LogZilla WOPS ave
200
100
0

Figure 6.2.5: Write Operations to the LogZilla device(s).

Page 29 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
The ZFS file system is able to deliver this workload using a very limited amount of SATA write operations. A
possible downside of the ZFS file system, is the large amount of CPU overhead imposed. See figure 6.2.6 for
details on CPU usage of the sun 7000 storage device:

Sun Storage 7000 CPU load ave [%]


40

35

30

25

20

15
CPU load ave [%]
10

Figure 6.2.6: CPU usage in the Sun 7000 storage during deployment of 500 user load-simulated
vDesktops

Page 30 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
6.2.2 Impact of 500 vDesktop deployment on VMware ESX

As the number of vDesktops increases, the load on VMware ESX and vCenter also increases. See figure 6.2.7,
6.2.8 and 6.2.9 for more details:

Figure 6.2.7: CPU usage within one of the eight VMware ESX hosts during the deployment of the
initial 250 vDesktops. The topmost grey graph is the CPU overhead of VMware ESX.

In figure 6.2.7 the deployment of vDesktops is clearly visible. Each time a vDesktop is deployed and started, a
“ribbon” is added to the graph. Each vDesktop uses the same amount of CPU power, which is increased
slightly just after deployment (when the VM is booting its operating system)

Page 31 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
Figure 6.2.8: Active Memory used by the vDesktops on one of the ESX nodes during the
deployment of the initial 250 vDesktops. The lower red ribbon is ESX memory
overhead due to the Service Console.

Figure 6.2.8 shows the active memory consumed as the vDesktops are deployed on one of the ESX nodes.
After each batch of 100 vDesktops, the memory consumption stops increasing, then slightly decreases. This
effect is caused by two things:

1) Freeing up tested memory within the VMs (Windows VMs touch all memory during memory test);
2) VMware’s Transparent Page Sharing technology.

As the VMs settle on the ESX server, ESX starts to detect identical memory pages, effectively deduplicating
them (item 2 on the list above). This feature can save a lot of physical memory usage, especially when
deploying many (almost) identical VM workloads.

Page 32 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
Figure 6.2.9: Physical memory shared between vDesktops thanks to VMware’s Transparent Page
Sharing (TPS) function within VMware ESX.

Transparent Page Sharing (TPS) effects become clearer when looking at the graph in figure 6.2.9. As VMs are
added to the ESX server, more memory pages are identified as being duplicates, saving more and more
physical memory.

Page 33 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
6.2.3 Impact of 500 vDesktop deployment on VMware vCenter and View

VMware vCenter and VMware View are not directly involved in the delivering of the vDesktop
workloads, but they play an important role during the deployment of net vDesktops. The CPU loads on
these machines clearly show the deployment of the batches of vDesktops:

Figure 6.2.10: Observed CPU load on the (dual vCPU) vCenter server during vDesktop deployment.
Note the dual y-axis descriptions; some values are percents, others are [MHz].

In figure 6.2.10, the deployment batches can be clearly extracted. After each batch, vCenter server
settles at a slightly higher CPU load. This is caused by the number of VMs to manage and monitor
within the entire ESX cluster.

Page 34 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
Figure 6.2.11: Observed CPU load on the VMware View server during vDesktop deployment.
Note the dual y-axis descriptions; some values are percents, others are [MHz].

The VMware View server shows pretty much the same characteristics as the VMware vCenter server.
Higher CPU loads during the batch deployment of vDesktops, and settling somewhat higher after each
batch.

Page 35 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
6.2.4 Deploying vDesktops beyond 500

After the successful deployment of the initial 500 vDesktops, further batches of 100 vDesktops were
deployed. Goal was to fit as many vDesktops onto the testbed as possible, keeping track of all
potential boundaries (performance-wise).

The largest amount of vDesktops that could be deployed was 1319. At this point VMware stopped
deploying more vDesktops because the ESX servers were running out of vCPUs. Within ESX version 3.5,
the limit of the number of VMs that can run on a single node is fixed to 170. This maximum was
reached just before ESX physical memory ran out:

VMware ESX node resource usage (0 to 1300


vDesktops)
100
90
CPU / Memory usage (average) [%]

80
70
60
50
Node CPU
40
Node mem
30
20
10
0
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300
Number of deployed user-simulated vDesktops

Figure 6.2.12: ESX node resource usage when deploying 1300 vDesktops

As the graph in figure 6.2.12 is showing, the ratio of memory versus CPU usage was almost matched. The
limitation of the number of running VMs, memory limitations and CPU power limitations reached their
maximum almost simultaneously.

Page 36 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
Due to the nature of the ZFS file system, the CPU load on the storage device was a concern. The
measured values can be found in figure 6.2.13:

7000 Storage CPU resources (0 to 1300


vDesktops)
100 2,5

HyperTransport bus bandwidth usage [GB . sec-1]


90
80 2
70
CPU consumed [%]

60 1,5
50
40 1 7410 CPU load
30 HT0/socket1
20 0,5
10
0 0

Number of deployed user-simulated vDesktops

Figure 6.2.13: CPU load on the 7000 storage during deployment of 1300 vDesktops. Note the HT0
value. This is the HT-bus between the two quad core CPUs inside the storage device.
The “relaxation” points at 600-700 and 1200 vDesktops were due to settling of the
environment during weekends.

As shown, the CPU load on the storage device is quite high, but not near saturation yet. The HT0 bus
displayed here was the one HT-bus having the biggest bandwidth usage. This is due to the fact that a
single, dual-channel PCI-e 10GbE card was used in the environment. The result of this was that the
second CPU had to transport all of its data to the first CPU in order to be able to get its data in and out
of the 10GbE interfaces. Note that the design could have been optimized here to use two separate
10GbE cards, each on PCI-e lanes that use a different HT-bus. This would have resulted in a better
load balancing across CPUs and HyperTransport busses. See figure 7.3.1 for a graphical representation
of this.

Page 37 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
The memory consumption of the 7000 storage is directly linked to the amount of read cache used. As
the number of vDesktops increase, the ARC (memory cache) fills up. As it reaches about 450
vDesktops, the ARC reaches its 64[GB] limit and the L2ARC (solid state drive) starts to fill (see figure
6.2.14):

7000 Storage memory usage (0 to 1300


vDesktops)
120

100
Memory usage [GB]

80

60
7410 L2ARC [GB]

40 7410 ARC [GB]


7410 Kernel use [GB]
20

Number of deployed user-simulated vDesktops

Figure 6.2.14: Memory usage on the 7000 storage during the deployment of 1300 vDesktops. Note
the L2ARC (SSD drive) starting to fill as the ARC (memory) saturates. The relaxation
between 600 and 700 vDesktops is due to a stop of deploying during a weekend
(ARC flushing occurred through time as the vDesktops settled in their workload).

The L2ARC finally settled at just about 100[GB] of used space (on the testbed there was a total of
200[GB] of ReadZilla available).

Page 38 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
The networking bandwidth and IOPs used by the testbed is displayed in figure 6.2.15:

NFS traffic (0 to 1300 vDesktops)


9000 80
8000 70

Network traffic [MB . sec-1]


7000 60
I/O operations [sec-1]

6000
50
5000
40
4000 NFS IOPS
30
3000 NFS reads
2000 20
NFS writes
1000 10
0 0

Number of deployed user-simulated vDesktops

Figure 6.2.15: NFS traffic observed during the deployment of 1300 vDesktops.

The “dips” in the graphics at 600/700 and 900/1000 vDesktops are actually weekends; the vDesktops
settled in their behavior which shows in the graph in figure 6.2.15.

Page 39 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
6.2.5 Performance figures at 1300 vDesktops

The system saturated at 1300 vDesktops, due to the limit in the maximum number of running VMs
inside the ESX servers. Performance of the vDesktops at this number was still very acceptable, even
though both memory and CPU power where almost at their maximum.

The VMs were still very responsive. Random vDesktops were accessed through the console, and
responsiveness was tested by starting the “welcome to Windows XP” introduction animation. Both frame rate
and animation speed did not deteriorate significantly through the entire range of 0 to 1300 vDesktops.

A good grade to determine this technically is the CPU ready time. This is the time that a VM is ready to
execute on a physical CPU core, but ESX somehow cannot manage to schedule it to a physical core:

Figure 6.2.16: CPU ready time measured on a vDesktop on a 30 minute interval.

Page 40 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
Note that these values are summed up between samples, and all millisecond values should be divided by
1800 (30 minutes) in order to obtain the number of milliseconds ready time per second (instead of per 30
minutes). In the leftmost part of the graph, vDesktops are still being deployed and booted up, impacting
performance (ready time is about 12.5 [ms.sec-1]). After the deployment is complete, ready time drops to
about 4.2 [ms.sec-1]. These values are very acceptable from a CPU performance point of view.

Next to CPU ready times, also the NFS latency is of great influence on the responsiveness of the
vDesktops. The graphs in the following figures were made at a load of 1300 vDesktops:

Page 41 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
NFS Read Latency (1300 userloaded vDesktops)
2000
1800
1600
1400
1200
1000

Read IOPS [sec-1]


800
600
400
200
0

NFS Read Latency ZOOMED (1300 userloaded


vDesktops)
20
18
16
14
12
10
Read IOPS [sec-1]

8
6
4
2
0

Figure 6.2.17a and 6.2.17b: Observed NFS read latency at 1300 user simulated vDesktops

Page 42 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
Looking at graph 6.2.17a, it shows that almost all Read Operations are served within 20 [ms]. At this load it is
quite impressive.

When we look at the read latency in more detail (figure 6.2.17b), there are some Read Operations which take
a longer time to be served. To put this in numbers, there are between 1 and 2 read operations every second
which take up to about 100 [ms] to be served. Take note, this is only about 0.2% of the read operations
performed.

Next to read latency, also write latency is measured. The write latency appears to be a little worse than
the read latency:

Page 43 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
NFS Write Latency (1300 userloaded vDesktops)

10000

NFS write Operations [sec-1]


8000

6000

4000

2000

NFS Write Latency ZOOMED (1300 userloaded


vDesktops)
1000

NFS write Operations [sec-1]


800

600

400

200

Figure 6.2.18a and 6.2.18b: Observed NFS write latency at 1300 user simulated vDesktops

Page 44 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
About 150 write operations require more than the base 0-40ms window to complete. Since the total
number of write operations is about 6000, this is about 2,5% of the operations performed. The only
explanation of these high latency numbers can be that some writes are not committed to the LogZilla,
but are flushed to disk directly. This is normal behavior for ZFS.

Within ZFS, larger blocks are not committed to the LogZilla. This is controlled by a parameter called
zfs_immediate_write_sz . This parameter is actually a constant within ZFS, and set to 32768 (see
reference [2] )

VMware ESX will concatenate writes if possible, up to 64[KB]. It is safe to assume that the majority of
writes equal the size of the vDesktops NTFS block size (4[KB]). However, some blocks do get
concatenated within VMware.

Looking at figure 6.1.2, we can see that the average write size is 5.5 [KB]. If we calculate the projected
block size from the behavior seen above, we can conclude that:

This is a perfect match, so it is safe to assume that the large write latency observed is in fact due to
this behavior. Tuning the zfs_immediate_write_sz constant could help in this case (increasing it to
65537 (which is 2^16+1) to make sure 64K writes are also stored in the LogZillas). Unfortunately
adjusting of this parameter is not supported on the Sun 7000 storage arrays (nor is it in ZFS to my
knowledge).

Page 45 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
VMware ESX has a feature called “Transparent Page Sharing” (TPS). This allows VMware ESX to map
several virtual memory pages which are identical to the same physical memory page. VMware performs
this “memory deduplication” in either hardware (vSphere 4 plus supported CPUs) or spare CPU cycles
(both ESX3.x and vSphere 4 optionally), so the positive effect of TPS gets bigger over time (also see
figure 6.2.9).

At a total of 1300 deployed vDesktops, ESX saves a large amount of memory:

Figure 6.2.19: Memory shared between vDesktops within a single ESX server.

As shown in figure 6.2.19, there are 170 VMs running (the graph shows only one of the eight ESX
nodes). Each ribbon in this graph represents a VM. In total, 22,5 [GB] of memory is shared thus saved
between vDesktops per ESX node. Without TPS, the ESX servers would have required 64+22,5 = 86,5
[GB] of memory (a 30% saving !).

When looking at the entire ESX cluster, each ESX server saves about the same amount of memory
thanks to TPS, saving 8*22,5 = 180 [GB] of memory.

Page 46 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
6.2.6 Extrapolating performance figures

In order to be able to predict the maximum amount of vDesktops which can be placed on a certain
environment, it is important to take note of all limiting factors. By extrapolating the measurements
performed on these factors, we can determine how to scale different resources (like CPU, memory, SSD
drives) to match the number of vDesktops we need to deploy.

For scaling VMware ESX CPU and memory, we set the maximum allowable load to 85%. The
extrapolated graph can be found in figure 6.2.20:

VMware ESX node resource usage


(extrapolated)
1000
1100
1200
1300
1400
1500
1600
1700
1800
500
100
200
300
400

600
700
800
900
0

85
80
CPU / Mememory usage (average) [%]

75
70
65
60
55
50 Node CPU
45
40 Node mem
35
30
25
20
15
10
5
0
Number of deployed user-simulated vDesktops

Figure 6.2.20: Extrapolation of figure 6.2.11: ESX node resource usage

In figure 6.2.20, memory is limited at 1300 desktops (which actually was the limit we ran into during
the test). CPU had some room to spare: If we pushed CPU consumption to 85%, we could deploy 1650
vDesktops.

Page 47 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
7000 Storage CPU resources (Extrapolated)

1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
100
200
300
400
500
600
700
800
900
0

85 2,5

75

Hyper Transport bus troughput [GB.sec-1]


2
65

55
CPU consumed [%]

1,5
45 7410 CPU load
HT0/socket1
35
1
25

15
0,5

-5 0
Number of deployed user-simulated vDesktops

Figure 6.2.21: Extrapolation of figure 6.2.12: CPU load on the 7000 storage

Looking at figure 6.2.21, the extrapolated value for the 7000 storage CPU usage would put the
maximum number of vDesktops on 1900. The theoretical maximum of the HT bus is 4[GB.sec -1], but a
general accepted value is around 2.5[GB.sec-1]. This would mean the HT-bus would limit the number of
vDesktops to 1950.

Page 48 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
For read caching, the 7000 storage relies on memory and Solid state drives (SSDs). Both basically
extrapolate the same, only memory is much faster than SSD. For extrapolating memory usage, using
the ARC values is sufficient:

7000 Storage memory usage (Extrapolated)

1800
1000
1100
1200
1300
1400
1500
1600
1700

1900
2000
2100
2200
2300
2400
2500
900
100
200
300
400
500
600
700
800
0

250

200
Memory usage [GB]

150
7410 ARC [GB]

100

50

0
Number of deployed user-simulated vDesktops

Figure 6.2.22: Extrapolation of figure 6.2.14: Memory usage on the 7000 storage

Extrapolation of the ARC size shows, that at 256 [GB] of memory minus some overhead for the Kernel
could have up to 2400 vDesktops deployed. Beyond this point SSDs (ReadZilla) would have to be used
in order to extend memory beyond 256[GB] which is the maximum amount of RAM that can be added
to the biggest 7000 series array at the time of this writing.

An important note to take is that the measured range of the ARC is rather short. A slight variation in
the measurement could have quite a dramatic effect on the final number of vDesktops that can be
deployed in a given environment.

Page 49 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
Finally, the NFS traffic is extrapolated in order to be able to see projected network bandwidth and
number of IOPS required for a given number of vDesktops:

NFS traffic (Extrapolated)

1000

1200

1400

1600

1800

2000
200

400

600

800
0

12000 60

10000 50

Network traffic [MB . sec-1]


8000 40
NFS IOPS [sec-1]

NFS writes
NFS reads
6000 30
NFS IOPS
4000 20

2000 10

0 0
Number of deployed user-simulated vDesktops

Figure 6.2.23: Extrapolation of figure 6.2.15: NFS traffic observed

The extrapolation in figure 6.2.23 is set to some limits. In the network bandwidth projection a
maximum of 2x 1GbE is used, with usage limited to 50% for each link in order to avoid possible
saturation / packet dropping on the link.

The number of total IOPS in this projection is limited to 12.000 [sec-1]. The reason for choosing this
number is that at the measured I/O distribution, there should be about 10.000 Write Operations
performed Per Second (WOPS), which is the maximum for a LogZilla device.

According to this graph, maximums come into play above 1800 vDesktops. For the NFS read
bandwidth, the maximum is not reached in this graph but would end somewhere near 4000
vDesktops (!).

Page 50 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
In close relation to the NFS IOPS performed, SATA ROPS and WOPS can also be extrapolated:

SATA Read- and Write Operations


(Extrapolated)
68
Read / Write Operations [sec-1]

58

48

38

28
SATA WOPS [/sec]
18 SATA ROPS [/sec]
8

-2

Number of deployed vDesktops

Figure 6.2.24: Extrapolation of figure 6.2.4: SATA Read- and Write-Operations Extrapolated

The graph in figure 6.2.24 clearly shows that hardly any SATA ROPS are performed; SATA WOPS
steadily increase as the number of running vDesktops increase. Note that at 1500 running vDesktops
the number of WOPS is projected to be only 68 [sec-1] WOPS. ROPS remain near zero.

Page 51 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
The write acceleration through the LogZilla device(s) can also be extrapolated:

Logzilla WOPS ave [/sec] (Extrapolated)


2000
1800
1600
LogZilla WOPS [sec-1]

1400
1200
1000
800
LogZilla WOPS ave
600
400
200
0

Number of vDesktops deployed

Figure 6.2.25: Extrapolation of figure 6.2.5: LogZilla WOPS performed

Page 52 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
Latency is more complex to extrapolate. By extrapolating each latency-group, a 3D graph can be
recreated to show projected NFS read latencies:

Extrapolated NFS read latencies


100,00
NFS Read Operations [sec-1]

80,00

60,00

40,00

20,00

0,00

Figure 6.2.26: Extrapolation of NFS read latency, clipped at 100 read operations per second.

Figure 6.2.26 is an extreme zoom of an extrapolated NFS read latency graph. The graph has been cut
into segments and had isolations inserted to give a clear view of the latency graphs as more vDesktops
would be deployed on the environment.

As the number of vDesktops gets bigger, more latency is introduced. This was already determined. In
this graph though, it becomes clear that the distribution of latency changes as the load increases.

Page 53 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
6.3 Test Results 2a: Rebooting 100 vDesktops

The impact that should never be underestimated is the impact on the storage when a large number of VMs
have to be rebooted. The reboot process uses far more resources than a regular workload, and especially
rebooting a lot of vDesktops in parallel can mean a large increase in I/O operations performed.

As a subtest, we shut down then restarted a hundred vDesktops with a total of 800 vDesktops deployed. The
impact is best seen in the latency graphs:

Page 54 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
NFS Read Latency (800 vDesktops, 100 rebooting)

Number of Read OPS [sec-1]


10000
8000
6000
4000
2000
0

NFS Read Latency ZOOMED (800 vDesktops, 100


rebooting)

Number of Read OPS [sec-1]


1000
800
600
400
200
0

Figure 6.3.1a and 6.3.1b: NFS read latency rebooting 100 vDesktops (@800 deployed).

Page 55 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
As can be seen in graphs 6.3.1a and b, the reboot in total took about one hour. The restart was issued
through VMware View, who schedules the restarts spread over time to vCenter. In the “a” (unzoomed) graph,
the peaks above the 4000 ROPS indicate a higher number of read operations caused by the restarting of the
vDesktops. The zoomed graph (graph b) shows more detail on the read latency getting worse during the
restarts. This is due to the fact that especially the linked clones files that were written by the VMs previously,
are now read back and have to be introduced to the ARC/L2ARC read cache, meaning the reads have to come
from the relatively slow SATA drives. A second restart might have had less impact in this respect (untested).

The filling of the L2ARC (from SATA) at rebooting of the vDesktops can be clearly seen in the graph in figure
6.3.2:

(L2)ARC growth (restart of 100 vDesktops in an


800 vDesktop environment)
120

100
ARC / L2ARC memory uagse [GB]

80

ARCdata size
60
L2ARCdata size

40

20

Figure 6.3.2: L2ARC growth on desktops reboot

A hundred rebooting vDesktops caused the L2ARC to grow about 50[GB] in size. Since all common reads
come from only two replicas which are already stored in the ARC, apparently each VM reads about 0,5[GB] of
unique data (from its linked clone).

Page 56 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
Network bandwidth used is also clearly higher during the reboot of the vDesktops:

NFS bandwidth used (restart of 100 vDesktops


in an 800 vDesktop environment)
120

100
NFS read/write [MB.sec-1]

80

NFS write ave


60
NFS read ave

40

20

Figure 6.3.3: NFS bandwidth used during reboot of 100 vDesktops.

At the left of the graph above, the regular I/O workload can be observed. The rest of the graph is the reboot
of the 100 vDesktops.

Page 57 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
6.4 Test Results 2b: Recovering all vDesktops after storage appliance
reboot

When running 1000 vDesktops, the storage array was forcibly rebooted. This subtest was performed to see
the impact on the storage array, on the data and on the vDesktops.

At the time of the forced shutdown of the storage device, all VMs froze. After rebooting the storage
appliance, the ZFS file system had to perform some resilvering (checking and making sure the data is
consistent which is a very reliable feature in ZFS) before normal NFS communication to the ESX servers was
able to resume. At this resume, the VMs simply unfroze and started to show their normal behavior almost
instantly.

Page 58 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
In graph 6.4.1 the effects of the forced reboot can clearly be seen:

Figure 6.4.1: Network and CPU load behavior during reboot of the storage appliance. The red
Striped bars indicate no measurements were made (during reboot of the storage
device itself)

Page 59 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
The red bar in figure 6.4.1 indicates the time required to perform the (re)boot of the storage device. The
“silent” period after that is the so called “resilvering” of the ZFS file system. No I/O is performed at this stage,
but as can be seen the CPU is quite busy during the resilvering.

After resilvering is done, the storage device immediately continues to perform I/O and settles quite fast. After
a reboot of the appliance, the ARC is empty (being RAM), and the L2ARC data is forcibly deleted and will be
rebuilt as read operations start to occur. Initially, the read operations will have to come out of SATA, filling up
the ARC and after that the L2ARC as they are read. In figure 6.4.2 the refilling of the ARC and L2ARC are
clearly visible:

Figure 6.4.2: Filling of the ARC and the L2ARC after a forced reboot of the storage appliance.

Page 60 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
The graph in figure 6.4.2 clearly shows the rapid filling of the ARC. It appears to fill a little bit during
resilvering, then shoots up quick (probably the two replicas are pulled into the ARC here). From there on, the
filling of the ARC slows its pace, and the L2ARC starts to come in as well. The third graph in figure 6.4.2
shows the (L2)ARC misses. During a few minutes there are quite a lot of misses, but this issue is resolved
rather quickly.

All in all the device is up and running again in 15 minutes. Take note that the setup used here did not make
use of the clustering features available for the 7000 series; all tests are performed on a single storage
processor.

Page 61 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
6.5 Test Results 3: User load simulated full clone desktops

A limited test was added to the original linked-clone test scenario. In this test the same (user
simulated) Windows XP images were deployed, but this time not in linked clone but full clone mode.
Only 150 full-clone desktops have been deployed, to see the behavior of the ARC and L2ARC in this
scenario.

Figure 6.5.1: Filling of the ARC and the L2ARC during the deployment of 150 full-clone
vDesktops. Totally left some test vDesktops (full clones) are deployed. At (1)
the first batch of 25 vDesktops are deployed, at (2) the rest of the vDesktops
are deployed.

See figure 6.5.1. After start of the test (totally left) some full-clone vDesktops are deployed. At the (1)
marker, the first batch of 25 vDesktops are deployed. Shortly after marker (1), the ARC and L2ARC sizes settle
around 25 [GB]. This indicates that the vDesktops perform around 1[GB] of reads per vDesktop. Because the
ARC does not saturate yet, the L2ARC remains (almost) empty at this stage. Beyond marker (2) the rest of the
vDesktops are deployed, quickly filling the ARC and the L2ARC.

Page 62 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
7410 CPU load Average percent (Extrapolated)

85
80
75
70
7410 CPU load (average) [%]

65
60
55
50
45 7410 CPU load ave
40
35
30
25
20
15
10
5
0
Number of deployed full-clone vDesktops

Figure 6.5.2: Filling Extrapolation of the 7000 storage CPU usage.

Figure 6.5.2 contains an extrapolation of the CPU load on the 7000 storage. The extrapolation is
extensive, and gives room for error. However, it appears to be pretty much in line with the CPU figures
measured in the linked-cloning setup (see figure 6.2.20).

Page 63 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
More interesting is the amount of IOPs performed in the full-clone scenario compared to the linked-clone
scenario:

NFS IOPS performed; FULL vs. LINKED clones


(0 to 100 vDesktops)
4500

4000

3500
Number of IOPS [sec-1]

3000

2500 NFS IOPs FULL-CLONE


NFS IOPs LINKED-CLONE
2000

1500

1000

500

Number of vDesktops deployed


Figure 6.5.3: NFS IOPS comparison of full-clone versus linked-clone vDesktop deployment.

Figure 6.5.3 shows that linked-clone vDesktops use more IOPS than full-clone vDesktops. This effect
can be explained by the way linked-clones function within VMware ESX. This behavior is much like
VMware snapshotting (see reference [1] for more details).

Another thing that can be seen in figure 6.5.3, is that the deployment itself of linked-clones appears to
have a greater IOPS impact than when performing full-clone deployment. Take note though, that this is
not the case: In figure 6.5.3, the time scale has been adjusted in order to match both graphs into a
single figure. Fact is that the speed of deployment is very different:

- Linked-clones deploy at a rate of 100 vDesktops per Hour;


- Full-clones deploy at a rate of 10 vDesktops per Hour.

This factor 10 is not visible in the graph, but in fact the full-clone vDesktop deployment uses far more
IOPS. This makes sense: In the full-clone scenario every vDesktop gets its boot drive fully copied, while
linked-clones only have to perform some IOPS overhead when creating an empty linked-clone (and
some other administrative actions on disk).

Page 64 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
7 Conclusions

From all tests conducted, some very interesting conclusions can be drawn. First of all, the fact that the
environment managed to run over 1300 vDesktops without performance issues on its own is a great
accomplishment. Looking deeper into the measured values gives a wealth of information on best practices
how to configure Sun Unified Storage 7000 in combination with VMware View linked clones.

7.1 Conclusions on scaling VMware ESX

It proves to be very important to scale your VMware ESX nodes correctly. There are basically three things to
keep in mind:

1) The amount of CPU cores inside an ESX server;


2) The amount of memory inside an ESX server;
3) The amount of vCPUs/VMs the ESX server can deliver.

The first and second are obvious ones; put in too much CPU power, and you run out of memory leaving the
CPU cores underutilized; put in too much memory and you run out of CPU power leaving memory
underutilized.

The third is sometimes forgotten, but proved to be the culprit in our test setup: If you use ESX servers with
too much CPU and memory, you’ll run out of vCPUs and VMs will not start anymore beyond a certain point.
Luckily, with each release of VMware ESX this number appears to get higher and higher:

- ESX 3.01 / ESX3.5: 128 vCPUs, 128 VMs;


- ESX3.5U2+: 192 vCPUs, 170 VMs;
- vSphere (ESX4): 512 vCPUs, 320 VMs.

As shown, using vSphere as a basis will allow for much bigger ESX servers.

Page 65 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
7.2 Conclusions on scaling networking between ESX and Unified Storage

The network did not really prove to be an issue during the performed tests. Bandwidth usage to any single
ESX node proved to be far within the capabilities of a single GbE connection.

Bandwidth to the storage also remained far within the designed bandwidth. The two 10 GbE connections
remained underutilized throughout all tests.

Load balancing was forcibly introduced into the test environment, but could have been skipped without issue
in this case. If the 7000 storage would have been driven using 1 GbE links, load balancing would be
recommended.

Page 66 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
7.3 Conclusions on scaling Unified Storage CPU power

During the tests, the CPUs inside the 7000 storage were not saturated. At 1300 user-simulated vDesktops,
the load on the two CPUs reached 85%, which should be considered to be near its maximum performance. In
order to be able to scale up further, 4 CPUs (or 6-core CPUs) will be required.

The HyperTransport bus between the two CPUs showed quite large values (order of 1.7 [GByte.sec-1] ). This
was partially due to the fact that the two 10 GbE ports both resides on a single PCIe card. This caused all
traffic to be forcibly sent through the HyperTransport bus of CPU0, instead of being load-balanced between
CPU0 and CPU1:

Sun 7410 Unified Storage


HyperTransport Bus technology
Memory Memory
Bus Bus

HT-bus
HT-bus

HT-bus

HyperTransport-to-I/O HyperTransport-to-I/O
bridge bridge
PCIe bus
PCIe bus

t
rne
E the
Gb t 10
10 rne Gb
E the E the
Gb rne
10 t

Figure 7.3.1: Sun 7410 Unified Storage HyperTransport bus architecture. In the performance tests
a single PCIe card with dual 10GbE was used. Best practice would be to use two single
port 10GbE PCIe cards using a different HT-Bus (shown in semi-transparency).

Page 67 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
7.4 Conclusions on scaling Unified Storage Memory and L2ARC

In order to obtain best performance out of the 7000 Unified Storage, read cache is very important. This type
of storage was even primarily selected for its large read cache capabilities. Using linked-clones all replicas
(the full-cloned “mother” of the linked clones) were directly committed to read cache. For each linked-clone
deployed a small addition amount of read cache was required. The amount of read cache should be carefully
matched to the projected number of vDesktops on the storage device. See chapter 8 for more details.

The L2ARC presents itself in the form of one or more read-optimized Solid State Drives (SSDs). It can be seen
as a direct extension to internal memory. It is important to note though, that L2ARC storage is about a factor
1000 slower than memory. Best practice would be to match internal memory to the required read-cache. If
(and only if) memory requirements exceed the physical maximum amount of internal memory, L2ARC could
be used to reach the required amount.

7.5 Conclusions on scaling Unified Storage LogZilla SSDs

The LogZilla devices enable the 7000 Unified storage to quickly acknowledge synchronous writes to the
storage device. The metadata of the write is stored in both the LogZilla and the write itself in the ARC. Finally,
the write is committed to disk from the ARC and the metadata in the LogZilla is flagged as handled.

In normal operation, the LogZilla is never read from. Only on recovery (such as power-loss) the LogZilla is
read and the ZFS file system is returned to a consistent state using the metadata present in the LogZilla that
was not flagged as handled yet.

In effect, the addition of a LogZilla greatly enhances (lowers) the write latency to the storage device. The
performed tests show that the LogZilla really helps to keep write latency to a minimum.

Each LogZilla is able to perform at 10.000 [WOPS]. When the projected number of writes is larger than 10.000
[IOPS], adding LogZillas could help. Note should be taken that adding a second LogZilla will not help
performance-wise: The Unified storage will place both LogZillas in a RAID1 configuration. This RAID1
configuration will help in ensuring performance; a LogZilla may fail, and the storage device will keep working
normally. When using a single LogZilla, the synchronous writes will have to be written to disk directly if the
LogZilla fails, clipping performance.

Using four LogZilla devices could increase the number of WOPS that can be performed to a single storage
device (the Unified Storage will put four LogZillas into a RAID10 configuration effectively being able to
perform 20.000 [WOPS])

Page 68 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
7.6 Conclusions on scaling Unified Storage SATA storage

Throughout the test the number of SATA ROPS and WOPS has been consistently limited in numbers. This is
due to the way ZFS works: ZFS aims to read most (if not all) data from the ARC and L2ARC; ZFS combines and
reorders small random writes to very large blocks and so converts the small random writes to large sequential
writes. This way of working minimizes ROPS, and performs only few large sequential writes to SATA (also see
Reference [3]).

Given the fact that a single SATA spindle can handle around 70 [IOPS], only very few SATA spindles would be
required to drive even the larger linked-clone loads.

Capacity is also not really an issue when using linked-clone technology. Each replica is as large as the full
vDesktop boot drive, but each linked clone remains much smaller than this (2 [GByte] is considered a realistic
value here).

Furthermore, it is advisable not to fill up the ZFS file system completely, since this would impact the
effectiveness of ZFS (not being able to concatenate all small writes to a single big sequential).

If we use 1000 vDesktops in our test setup as a target number, the ZFS capacity required would be:

Two Replicas: 2x 8 [GB] = 16 [GB]


1000 linked clones: 1000x 2 [GB] = 2000 [GB]
vDesktop swap files: 1000x 0,5[GB] = 500 [GB] +
Total space required: 2516 [GB] 2516 [GB]

20% ZFS reservation: 20% x 2516 [GB]= 500 [GB] +


Total storage required: 3016 [GB]

In a ZFS mirror configuration (much like RAID10), a total of 6 disks would just about suffice for capacity. Since
the environment performs almost solely writes to SATA, this setup would deliver 3x70 =210 [WOPS] at 100%
writes (which is very close to reality in ZFS), which according to the graph in figure 6.2.24 is around 44 WOPS.
This should be more than enough to run 1000 vDesktops.

The writes to SATA are very large though, mostly 1 [Mbyte.write -1]. At 44 WOPS, this means a write
throughput of 44[Mbyte.sec-1]. This number also is not beyond the capabilities of a set of 6 SATA drives in a
RAID10 configuration.

Page 69 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
8 Conclusions in numbers

Looking at the conclusion in chapter 7, and the measured values throughout the test, the following constants
can be derived. These constants can act as a lead when scaling linked-clone based designs to 7000 series of
storage. In Appendix 2, these calculated results are shown in a table.

ESX CPU SIZING

The vDesktops used in these tests consumed:

40% CPU load per 1000 vDesktops (see figure 6.2.12). 40% CPU load over 24 cores running at 2600 [MHz]
means a total load of (2600*24) * 40% = 24960 [MHz]. This load was divided over 8 nodes, so each node was
running 1000/8 = 125 vDesktops. So each vDesktop was using 24960 [MHz]/125 = 200 [MHz] average per
vDesktop (which is quite low compared to best-practice standards of around 400-500 [MHz] per vDesktop).

ESX MEMORY SIZING

The vDesktops used in these tests consumed 63% memory load per 1000 vDesktops (see figure 6.2.12). An
ESX node had 64[GB] of memory, so 125 vDesktops were using (64[GB]*63%) / 125 = 323 [MB] average per
vDesktop. This number is related to the amount of memory each vDesktop had (512[MB]). If the vDesktop
would have 1024 [MB], this number would increase.

STORAGE NETWORK SIZING

From figure 6.2.15 we see that running 1000 vDesktops results in 29[MB.sec-1] writes and 13 [MB.sec-
1] reads. A single GbE interface saturates around 60 [MB.sec-1] which is set as 100%. Bandwidth usage
is limited to 40%, which seems to be a value where little to no packets get dropped (thus minimizing
latency). So 40% of 60 [MB.sec-1] = 24 [MB.sec-1] which is the maximum allowable transfer rate.

Using the worst-case value of 29 [MB.sec-1] (writes used more bandwidth in our environment), two GbE
links would be required as a minimum on the storage side. When eight ESX nodes are present, each ESX
node will perform at only 1/8th of these figures, so a single GbE network would suffice (although two
links are recommended at least to certify having a failover link). Load balancing is not required when
running 1000 vDesktops in this scenario.

Dividing down we get: 29[MB.sec-1] / 1000 = 0,029 [MB.sec-1] or 29 [KB.sec-1] per vDesktop.

Page 70 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
STORAGE CPU SIZING

At 1000 vDesktops, the CPU utilization of the 7410 (with two quad core CPUs) was 45%. The busiest HT
bus measured at 1.18 [GByte.sec-1] (see figure 6.2.13).

45% CPU load on two quad core CPU running at 2.30[GHz] means a total of ( ( 2300[MHz] x 8 ) * 45% ) /
1000 = 8,28 MHz per vDesktop.

The HT bus is loaded with 1180 [Mbyte.sec-1] / 1000 = 1.18 [MByte] per vDesktop.

STORAGE ARC/L2ARC MEMORY SIZING

As seen during the tests, the read caching fills almost linear as more vDesktops are deployed. We can
calculate how much memory on average is required as read cache per vDesktop:

See figure 6.2.14. At 1000 vDesktops, the (ARC+L2ARC) used was ( 60[GB] + 95[GB] ) = 155 [GB]. Per
vDesktop is utilized: 155[GB] / 1000 = 0,155 [GB] or 155 [MB] per vDesktop.

STORAGE LOGZILLA SIZING

The LogZilla accelerates synchronous writes to the 7000 storage device. At 1000 vDesktops, 1250
[WOPS] were performed (see figure 6.2.25). So each vDesktop is responsible for 1250 / 1000 = 1.25
[WOPS] per vDesktop.

The specified maximum for a single LogZilla device is 10.000 [WOPS], so a single LogZilla would be
able to support an impressive number of 10.000 / 1.25 = 8000 vDesktops!

STORAGE SATA PERFORMANCE SIZING

The 7000 storage uses SATA storage as its backend. At 1000 vDesktops, 44 [WOPS] to SATA are
performed. This means that every vDesktop is responsible for 44 / 1000 = 0,044 [WOPS] per vDesktop.

On throughput, the 7000 storage needed a sustained stream of 30 [MB.sec-1] to NFS (and thus to the
SATA disks) at 1000 vDesktops (see figure 6.2.15). This means that each vDesktop uses 30 [KB.sec-1]
per vDesktop write performance to SATA.

Page 71 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
9 References

1) Performance impact when using VMware snapshots: http://www.vmdamentals.com/?p=332


2) A quick guide to the ZFS Intent Log (ZIL): http://blogs.sun.com/realneel/entry/the_zfs_intent_log
3) What is ZFS?: http://hub.opensolaris.org/bin/view/Community+Group+zfs/whatis
4) Sun 7000 Unified Storage: http://www.sun.com/storage/disk_systems/unified_storage/index.jsp

Page 72 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
Appendix 1: Hardware test setup

Gbit Management
switch

Service C
onsole 1x
Gbit
VM networ
k 1xGbit

VMware ESX
Service Console 1x Gbit
Gb GBit

supporting VMs

Se VM M
VM network 1xGbit

rv otio ne
1 xG i t

VMotion 1x Gbit
on le 1 x

ice n tw
bit

C o 1 xG o r k
1x
VM onso

ns B 1xG
ol it
rk

e
oti
C

two

1x
i ce

G
ne

bi
rv

t
Se

VM

bi
t

VMware ESX
vDesktop server VMware ESX
VMware ESX vDesktops server
Dual G

vDesktops server Du
al
Gb
bit IP s

it IP ge
sto ra
ra sto
torage

ge P
it I
Gb
al
Du

24 port Gbit switch with 10GbE uplinks


10 G
10 G

bE
bE

SUN Amberroad

Page 73 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)
Appendix 2: Table of derived constants

The table below contains some measured constants. New VMware View designs that use 7000 storage as
backend could be scaled using these values. Take note that these are measured values in one unique and
simulated case. For example, vDesktop load simulation was kept very basic, only to simulate a representative
disk I/O load. Choices like these might affect real world numbers.

Derived Constant Value per Unit per vDesktop Unit Maximization specification
vDesktop vDesktop Limit
ESX CPU power required (test) 200 [MHz] 265 per ESX 4x six core CPUs ( 2.6[GHz] ) @85%
ESX CPU power required (general**)) 500 [MHz] 106 Per ESX 4x six core CPUs ( 2.6[GHz] ) @85%
ESX memory required (test) 323 [MByte] 336 per ESX 128 [GB] @85%; 512MB vDesktops
ESX memory required (general**)) 500 [MByte] 108 Per ESX 64 [GB] @85%; 1024MB vDesktops
ESX vCPUs maximum - - 170 per ESX ESX 3.5U2 and up
ESX vCPUs maximum - - 320 per ESX vSphere4.0
NFS networking 29 [KB.sec-1] 827 per Gbit Dual GbE balanced @40% load
UFS CPU required 8,28 [MHz] 1910 per UFS 2x quad core 2356 AMD CPU @85%
UFS HT-bus required 1.18 [MByte.sec-1] 2118 per UFS 2,5 [Gbyte.sec-1] assumed as max.
UFS memory (ARC+L2ARC) 155 [MByte] 1238 per UFS 256Gbyte memory; 75% effectiveness*)
UFS LogZilla 1,25 [WOPS] 8000 per UFS 1x LogZilla or 2x in a RAID1 config.
UFS SATA required (WOPS) 0,044 [WOPS] 7950 per UFS 7410 with one mirrored tray (11 disks)
#SATA required (IOPS – WOPS)***) 0,00125 [#SATApairs] 1590 per SATApair SATA mirror; 70 IOPS/spindle.
#SATA required (Throughput) ***) 30 [KB.sec-1] 1333 Per SATApair SATA mirror; 40MB/s w/c throughput
#SATA required (Capacity) ***) 2 [GByte] 2000 Per UFS 7410 with one mirrored tray
(11 disks) @80% capacity use
*) As vDesktops run different workloads, it is assumed they need to read more data (from cache if
possible). This is why the 75% effectiveness is used as a guideline;
**) These values are considered general values seen in real-life VMware View deployments.
***) On determining the number of SATA spindles required, it is important to note that this number is
determined from both performance and capacity perspective. Always take the worst-case value of IOP, throughput
and capacity requirements.

Page 74 of 74
Performance Report: VMware View linked clone performance on Sun’s Unified Storage (v1.0)

You might also like