You are on page 1of 47

Statistically Rigorous

/ˆÌiÊ6œœÀ˜>>“Ê >>“Ê
Java
Performance Evaluation
՘V̈i

Andy Georges - Dries Buytaert - Lieven Eeckhout


ELIS Department - Ghent University
Belgium

OOPSLA - October 23 2007 - Montréal


Whetting the appetite with db

OOPSLA - October 23 2007 - Montréal


,
/ Ê88888888888
Whetting the appetite with db
Best run out of 30
12.5

execution time (s)


12.0
11.5
11.0
10.5
10.0
9.5
9.0

MarkSweep
GenCopy
CopyMS

SemiSpace
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
GenMS
Whetting the appetite with db
Best run out of 30
we conclude there is 12.5

execution time (s)


no difference between 12.0
11.5
CopyMS and 11.0
GenCopy 10.5
10.0
9.5
9.0

MarkSweep
GenCopy
CopyMS

SemiSpace
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
GenMS
Whetting the appetite with db
Best run out of 30
we conclude there is 12.5

execution time (s)


no difference between 12.0
11.5
CopyMS and 11.0
GenCopy 10.5
10.0
we conclude 9.5
9.0
SemiSpace is 10.8%

MarkSweep
GenCopy
CopyMS

SemiSpace
GenMS
faster than GenCopy

OOPSLA - October 23 2007 - Montréal


,
/ Ê88888888888
Whetting the appetite with db
Best run out of 30
12.5
95% confidence
execution time (s)

12.0
11.5
11.0 12.5
interval for 30 runs

execution time (s)


10.5
10.0 12.0
9.5
9.0 11.5
MarkSweep
GenCopy
CopyMS

SemiSpace
GenMS

11.0
10.5
10.0
9.5
9.0

MarkSweep
GenCopy
CopyMS

SemiSpace
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
GenMS
Whetting the appetite with db
Best run out of 30
12.5
95% confidence
execution time (s)

12.0
11.5
11.0 12.5
interval for 30 runs

execution time (s)


10.5
10.0 12.0
9.5
9.0 11.5
MarkSweep
GenCopy
CopyMS

SemiSpace
GenMS

11.0
10.5
10.0

CopyMS and GenCopy 9.5


9.0
differ significantly, while

MarkSweep
GenCopy
CopyMS

SemiSpace
GenMS
GenCopy and
SemiSpace do not
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Non-determinsm is the Figuur moven en scalen
en uitleggen wat het
niet-determinisme

problem
veroorzaakt.

30 measurements with Jikes RVM using GenMS


Normalized execution time

1.05

1.00 ! ! ! ! !
! ! ! ! ! !
! ! !

0.95
javac

mpegaudio

luindex
compress

jess

db

mtrt

jack

antlr

bloat

fop

hsqldb

jython

pmd
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Contributions
Pitfall associated with current prevalent
data analysis techniques
We advocate a statistically rigorous Java
performance evaluation
Define approaches for both start-up and
steady-state and provide a tool to
automate this:
http://www.elis.ugent.be/JavaStats
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Current situation

Surveyed 50 papers from 2000-2006


from OOPSLA, VEE, CGO, PLDI, ISMM
A lot of variation in experimental setup
Various data analysis approaches
Often not clearly described

OOPSLA - October 23 2007 - Montréal


,
/ Ê88888888888
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Number of
Number of VM benchmark iterations
invocations

Fixed heap
With JIT size?
Experimental
(re)compilation? Design

Number of VM
Number of heap sizes
hardware platforms Application
input size

OOPSLA - October 23 2007 - Montréal


,
/ Ê88888888888
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
First or single
run

Worst of n runs Best of n runs

Data
Analysis
Mean of n runs Median of n runs

Second best
Confidence
of n runs
interval of n runs

OOPSLA - October 23 2007 - Montréal


,
/ Ê88888888888
First or single
run

Worst of n runs Best of n runs

Data
Analysis
Mean of n runs Median of n runs

Second best
Confidence
of n runs
interval of n runs

OOPSLA - October 23 2007 - Montréal


,
/ Ê88888888888
Defining the terms
Start-up execution: a single benchmark
iteration in one VM invocation, e.g.,
java -cp dacapo.jar Harness -n 1 antlr
Steady-state execution: multiple
benchmark iterations in one VM
invocation, e.g.,
java -cp dacapo.jar Harness -n 30 antlr

OOPSLA - October 23 2007 - Montréal


,
/ Ê88888888888
Dealing with start-up

OOPSLA - October 23 2007 - Montréal


,
/ Ê88888888888
Dealing with start-up
Execute p+1 invocations, each
time with a single iteration, drop
the first invocation
0 p
...

OOPSLA - October 23 2007 - Montréal


,
/ Ê88888888888
Dealing with start-up
Execute p+1 invocations, each
time with a single iteration, drop
the first invocation
0 p
...
1
!p
x̄ = p i=1 xi

s
x̄ ± t α
1− 2 ;p−1 √
p
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Dealing with steady-state

OOPSLA - October 23 2007 - Montréal


,
/ Ê88888888888
Dealing with steady-state
At most q iterations, retain at least k iterations
1 si-k si q

... ... ...

OOPSLA - October 23 2007 - Montréal


,
/ Ê88888888888
Dealing with steady-state
At most q iterations, retain at least k iterations
1 si-k si q
0
Execute p+1

... ... ...


invocations

... ... ...


... ... ...
... ... ...
... ... ...
p ... ... ...

OOPSLA - October 23 2007 - Montréal


,
/ Ê88888888888
Dealing with steady-state
At most q iterations, retain at least k iterations
1 si-k si q
... ... ...

OOPSLA - October 23 2007 - Montréal


,
/ Ê88888888888
Dealing with steady-state
At most q iterations, retain at least k iterations
1 si-k si q
... ... ...

OOPSLA - October 23 2007 - Montréal


,
/ Ê88888888888
Dealing with steady-state
At most q iterations, retain at least k iterations
1 si-k si q
... ... ...

CoV

1 !
si
x̄i = xij
k
j=si −k
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Dealing with steady-state

1 !
si
x̄i = xij
k
j=si −k
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Dealing with steady-state

1 !
si
x̄i = xij
k
j=si −k
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Dealing with steady-state
x̄1
We have p mean values, one per
invocation
x̄2
x̄3
We compute the confidence
interval for their mean
x̄4

... 1
!p s
p i=1 x̄i ± t1− α2 ;p−1 √p
x̄p−1
x̄p

OOPSLA - October 23 2007 - Montréal


,
/ Ê88888888888
Making comparisons

Two alternatives: Student t-test


Multiple alternatives: ANOVA
Important: use a post-hoc test to obtain
simultaneous confidence intervals!

OOPSLA - October 23 2007 - Montréal


,
/ Ê88888888888
Comparison categories
Statistical approach Prevalent methodology
ANOVA + performance performance
confidence intervals difference < θ difference ≥ θ

overlapping intervals indicative misleading

non-overlapping misleading
correct
interval, same order but correct
non-overlapping
misleading
interval, different incorrect
and correct
order
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Comparison categories
Statistical approach Prevalent methodology
ANOVA +PrevalentperformanceRigorous
performance
confidence intervals difference < θ difference ≥ θ

overlapping intervals indicative misleading


θ
non-overlapping misleading
correct
interval, same order but correct
non-overlapping
misleading
interval, different incorrect
and correct
order
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Comparison categories
Statistical approach Prevalent methodology
ANOVA +PrevalentperformanceRigorous
performance
confidence intervals difference < θ difference ≥ θ

overlapping intervals indicative misleading


θ
non-overlapping misleading
correct
interval, same order but correct
non-overlapping indicative
misleading
interval, different incorrect
and correct
order
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Comparison categories
Statistical approach Prevalent methodology
ANOVA +PrevalentperformanceRigorous
performance
confidence intervals difference < θ difference ≥ θ

overlapping intervals indicative misleading


θ
non-overlapping misleading
correct
interval, same order but correct
non-overlapping misleading
misleading
interval, different but correct incorrect
and correct
order
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Comparison categories
Statistical approach Prevalent methodology
ANOVA +PrevalentperformanceRigorous
performance
confidence intervals difference < θ difference ≥ θ

overlapping intervals indicative misleading


θ
non-overlapping misleading
correct
interval, same order but correct
non-overlapping misleading
misleading
interval, different and incorrect incorrect
and correct
order
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Comparison categories
Statistical approach Prevalent methodology
ANOVA +PrevalentperformanceRigorous
performance
confidence intervals difference < θ difference ≥ θ

overlapping intervals indicative misleading


θ
non-overlapping misleading
correct
interval, same order but correct
non-overlapping misleading
misleading
interval, different incorrect
and correct
order
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Comparison categories
Statistical approach Prevalent methodology
ANOVA +PrevalentperformanceRigorous
performance
confidence intervals difference < θ difference ≥ θ

overlapping intervals indicative misleading


θ
non-overlapping misleading
correct
interval, same order but correct
non-overlapping correct
misleading
interval, different incorrect
and correct
order
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Comparison categories
Statistical approach Prevalent methodology
ANOVA +PrevalentperformanceRigorous
performance
confidence intervals difference < θ difference ≥ θ

overlapping intervals indicative misleading


θ
non-overlapping misleading
correct
interval, same order but correct
non-overlapping incorrect
misleading
interval, different incorrect
and correct
order
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Comparison categories
Statistical approach Prevalent methodology
ANOVA + performance performance
confidence intervals difference < θ difference ≥ θ

overlapping intervals indicative misleading

non-overlapping misleading
correct
interval, same order but correct
non-overlapping
misleading
interval, different incorrect
and correct
order
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Experimental setup
AMD Athlon XP @ 2.1GHz, 2 GiB RAM,
Linux 2.6.18, idle
Jikes RVM svn head of February 12 2007
5 GCs from MMTk: CopyMS, GenCopy,
GenMS, MarkSweep, SemiSpace
SPECjvm98 and DaCapo
Minimal heap up to 6 times as much
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
10 invocations on Athlon
decision threshold θ=1%
incorrect misleading misleading and incorrect
misleading but correct indicative
percentage of all comparisons

25
SPECjvm98 DaCapo
20
15
10
5
0
st

an

st

st

st

an

st

st
ea

ea
be

be

or

be

be

or
i

i
ed

ed
m

m
w

w
nd

nd
m

m
co

co
se

se
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Athlon GenMS vs. other
GCs, 10 invocations, θ=1%
incorrect misleading misleading and incorrect
misleading but correct indicative
percentagee of all comparisons

30
SPECjvm98 DaCapo
25
20
15
10
5
0
st

an

st

st

st

an

st

st
ea

ea
be

be

or

be

be

or
i

i
ed

ed
m

m
w

w
nd

nd
m

m
co

co
se

se
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Raise the θ threshold?
javac, best-of-30, θ [0;3]
incorrect misleading misleading and incorrect
misleading but correct indicative

70
percentage of all comparisons

60

50

40

30

20

10

0
0 1 2 3
θ-threshold
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
percentage of all comparisons

0
10
20
30
40
50
best median of (3,10)

best of (3,10)

second best of (3,10) incorrect


misleading but correct

best median of (3,30)

best of (3,30)

second best of (3,30)


indicative
misleading

best median of (5,10)

,
/ Ê88888888888
best of (5,10)

second best of (5,10)


OOPSLA - October 23 2007 - Montréal

best median of (5,30)

best of (5,30)
misleading and incorrect
invocations,10/30 iterations

second best of (5,30)


SPECjvm98 steady-state, 3/5
Confidence interval width for
jess start-up execution
CopyMS GenCopy GenMS MarkSweep SemiSpace

10
width as percentage of the mean

0
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
number of VM invocations
OOPSLA - October 23 2007 - Montréal
,
/ Ê88888888888
Conclusion

Methodology can (and should) be applied


to other managed runtime systems
Prevalent approaches can lead to
incorrect or misleading conclusions
One should use a rigorous statistical
approach to deal with non-determinism

OOPSLA - October 23 2007 - Montréal


,
/ Ê88888888888
Correction to the paper

Table 1 on page XXX contains two errors


Reference [22] uses a confidence
interval
Reference [3] uses a mean
performance number

OOPSLA - October 23 2007 - Montréal


,
/ Ê88888888888
?

You might also like