Statistically Rigorous Java Performance Evaluation

Statistically Rigorous
/ÌiÊ6À>>Ê >>Ê
Java
Performance Evaluation
ÕVÌi
Andy Georges - Dries Buytaert - Lieven Eeckhout

ELIS Department - Ghent University
Belgium
OOPSLA - October 23 2007 - Montréal

Whetting the appetite with db

,
/ Ê88888888888
Best run out of 30
12.5
execution time (s)

12.0
11.5
11.0
10.5
10.0
9.5
9.0
MarkSweep
GenCopy
CopyMS
SemiSpace
,
/ Ê88888888888
GenMS
Best run out of 30
we conclude there is 12.5
execution time (s)

no difference between 12.0
11.5
CopyMS and 11.0
GenCopy 10.5
10.0
9.5
9.0
MarkSweep
GenCopy
CopyMS
SemiSpace
,
/ Ê88888888888
GenMS
Best run out of 30
we conclude there is 12.5
execution time (s)

no difference between 12.0
11.5
CopyMS and 11.0
GenCopy 10.5
10.0
we conclude 9.5
9.0
SemiSpace is 10.8%
MarkSweep
GenCopy
CopyMS
SemiSpace
GenMS
faster than GenCopy

,
/ Ê88888888888
Best run out of 30
12.5
95% confidence
execution time (s)
12.0
11.5
11.0 12.5
interval for 30 runs
execution time (s)

10.5
10.0 12.0
9.5
9.0 11.5
MarkSweep
GenCopy
CopyMS
SemiSpace
GenMS
11.0
10.5
10.0
9.5
9.0
MarkSweep
GenCopy
CopyMS
SemiSpace
,
/ Ê88888888888
GenMS
Best run out of 30
12.5
95% confidence
execution time (s)
12.0
11.5
11.0 12.5
interval for 30 runs
execution time (s)

10.5
10.0 12.0
9.5
9.0 11.5
MarkSweep
GenCopy
CopyMS
SemiSpace
GenMS
11.0
10.5
10.0
CopyMS and GenCopy 9.5

9.0
differ significantly, while
MarkSweep
GenCopy
CopyMS
SemiSpace
GenMS
GenCopy and
SemiSpace do not
,
/ Ê88888888888
Non-determinsm is the Figuur moven en scalen
en uitleggen wat het
niet-determinisme
problem
veroorzaakt.
30 measurements with Jikes RVM using GenMS

Normalized execution time
1.05
1.00 ! ! ! ! !
! ! ! ! ! !
! ! !
0.95
javac
mpegaudio
luindex
compress
jess
db
mtrt
jack
antlr
bloat
fop
hsqldb
jython
pmd
,
/ Ê88888888888
Contributions
Pitfall associated with current prevalent
data analysis techniques
We advocate a statistically rigorous Java
performance evaluation
Define approaches for both start-up and
steady-state and provide a tool to
automate this:
http://www.elis.ugent.be/JavaStats
,
/ Ê88888888888
Current situation
Surveyed 50 papers from 2000-2006

from OOPSLA, VEE, CGO, PLDI, ISMM
A lot of variation in experimental setup
Various data analysis approaches
Often not clearly described

,
/ Ê88888888888
,
/ Ê88888888888
Number of
Number of VM benchmark iterations
invocations
Fixed heap
With JIT size?
Experimental
(re)compilation? Design
Number of VM
Number of heap sizes
hardware platforms Application
input size

,
/ Ê88888888888
,
/ Ê88888888888
First or single
run
Worst of n runs Best of n runs
Data
Analysis
Mean of n runs Median of n runs
Second best
Confidence
of n runs
interval of n runs

,
/ Ê88888888888
First or single
run
Worst of n runs Best of n runs
Data
Analysis
Mean of n runs Median of n runs
Second best
Confidence
of n runs
interval of n runs

,
/ Ê88888888888
Defining the terms
Start-up execution: a single benchmark
iteration in one VM invocation, e.g.,
java -cp dacapo.jar Harness -n 1 antlr
Steady-state execution: multiple
benchmark iterations in one VM
invocation, e.g.,
java -cp dacapo.jar Harness -n 30 antlr

,
/ Ê88888888888
Dealing with start-up

,
/ Ê88888888888
Execute p+1 invocations, each
time with a single iteration, drop
the first invocation
0 p
...

,
/ Ê88888888888
Execute p+1 invocations, each
time with a single iteration, drop
the first invocation
0 p
...
1
!p
x̄ = p i=1 xi
s
x̄ ± t α
1− 2 ;p−1 √
p
,
/ Ê88888888888
Dealing with steady-state

,
/ Ê88888888888
At most q iterations, retain at least k iterations
1 si-k si q
... ... ...

,
/ Ê88888888888
1 si-k si q
0
Execute p+1
... ... ...

invocations
... ... ...

... ... ...
... ... ...
... ... ...
p ... ... ...

,
/ Ê88888888888
1 si-k si q
... ... ...

,
/ Ê88888888888
1 si-k si q
... ... ...

,
/ Ê88888888888
1 si-k si q
... ... ...
CoV
<δ
1 !
si
x̄i = xij
k
j=si −k
,
/ Ê88888888888
1 !
si
x̄i = xij
k
j=si −k
,
/ Ê88888888888
1 !
si
x̄i = xij
k
j=si −k
,
/ Ê88888888888
x̄1
We have p mean values, one per
invocation
x̄2
x̄3
We compute the confidence
interval for their mean
x̄4
... 1
!p s
p i=1 x̄i ± t1− α2 ;p−1 √p
x̄p−1
x̄p

,
/ Ê88888888888
Making comparisons
Two alternatives: Student t-test

Multiple alternatives: ANOVA
Important: use a post-hoc test to obtain
simultaneous confidence intervals!

,
/ Ê88888888888
Comparison categories
Statistical approach Prevalent methodology
ANOVA + performance performance
confidence intervals difference < θ difference ≥ θ
overlapping intervals indicative misleading
non-overlapping misleading
correct
interval, same order but correct
non-overlapping
misleading
interval, different incorrect
and correct
order
,
/ Ê88888888888
ANOVA +PrevalentperformanceRigorous
performance

θ
correct
non-overlapping
misleading
and correct
order
,
/ Ê88888888888
performance

θ
correct
non-overlapping indicative
misleading
and correct
order
,
/ Ê88888888888
performance

θ
correct
misleading
interval, different but correct incorrect
and correct
order
,
/ Ê88888888888
performance

θ
correct
misleading
interval, different and incorrect incorrect
and correct
order
,
/ Ê88888888888
performance

θ
correct
misleading
and correct
order
,
/ Ê88888888888
performance

θ
correct
non-overlapping correct
misleading
and correct
order
,
/ Ê88888888888
performance

θ
correct
non-overlapping incorrect
misleading
and correct
order
,
/ Ê88888888888
ANOVA + performance performance
correct
non-overlapping
misleading
and correct
order
,
/ Ê88888888888
Experimental setup
AMD Athlon XP @ 2.1GHz, 2 GiB RAM,
Linux 2.6.18, idle
Jikes RVM svn head of February 12 2007
5 GCs from MMTk: CopyMS, GenCopy,
GenMS, MarkSweep, SemiSpace
SPECjvm98 and DaCapo
Minimal heap up to 6 times as much
,
/ Ê88888888888
10 invocations on Athlon
decision threshold θ=1%
incorrect misleading misleading and incorrect
misleading but correct indicative
percentage of all comparisons
25
SPECjvm98 DaCapo
20
15
10
5
0
st
an
st
st
st
an
st
st
ea
ea
be
be
or
be
be
or
i
i
ed
ed
m
m
w
w
nd
nd
m
m
co
co
se
se
,
/ Ê88888888888
Athlon GenMS vs. other
GCs, 10 invocations, θ=1%
percentagee of all comparisons
30
SPECjvm98 DaCapo
25
20
15
10
5
0
st
an
st
st
st
an
st
st
ea
ea
be
be
or
be
be
or
i
i
ed
ed
m
m
w
w
nd
nd
m
m
co
co
se
se
,
/ Ê88888888888
Raise the θ threshold?
javac, best-of-30, θ [0;3]
70
60
50
40
30
20
10
0
0 1 2 3
θ-threshold
,
/ Ê88888888888
0
10
20
30
40
50
best median of (3,10)
best of (3,10)
second best of (3,10) incorrect

misleading but correct
best of (3,30)
second best of (3,30)

indicative
misleading
,
/ Ê88888888888
best of (5,10)

best of (5,30)
misleading and incorrect
invocations,10/30 iterations

SPECjvm98 steady-state, 3/5
Confidence interval width for
jess start-up execution
CopyMS GenCopy GenMS MarkSweep SemiSpace
10
width as percentage of the mean
0
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
number of VM invocations
,
/ Ê88888888888
Conclusion
Methodology can (and should) be applied

to other managed runtime systems
Prevalent approaches can lead to
incorrect or misleading conclusions
One should use a rigorous statistical
approach to deal with non-determinism

,
/ Ê88888888888
Correction to the paper
Table 1 on page XXX contains two errors

Reference [22] uses a confidence
interval
Reference [3] uses a mean
performance number

,
/ Ê88888888888
?

Statistically Rigorous Java Performance Evaluation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistically Rigorous Java Performance Evaluation

Uploaded by

Copyright:

Available Formats

Statistically Rigorous

Andy Georges - Dries Buytaert - Lieven Eeckhout

OOPSLA - October 23 2007 - Montréal

OOPSLA - October 23 2007 - Montréal

execution time (s)

execution time (s)

execution time (s)

OOPSLA - October 23 2007 - Montréal

execution time (s)

execution time (s)

CopyMS and GenCopy 9.5

30 measurements with Jikes RVM using GenMS

Surveyed 50 papers from 2000-2006

OOPSLA - October 23 2007 - Montréal

OOPSLA - October 23 2007 - Montréal

Worst of n runs Best of n runs

OOPSLA - October 23 2007 - Montréal

Worst of n runs Best of n runs

OOPSLA - October 23 2007 - Montréal

OOPSLA - October 23 2007 - Montréal

OOPSLA - October 23 2007 - Montréal

OOPSLA - October 23 2007 - Montréal

OOPSLA - October 23 2007 - Montréal

... ... ...

OOPSLA - October 23 2007 - Montréal

... ... ...

... ... ...

OOPSLA - October 23 2007 - Montréal

OOPSLA - October 23 2007 - Montréal

OOPSLA - October 23 2007 - Montréal

OOPSLA - October 23 2007 - Montréal

Two alternatives: Student t-test

OOPSLA - October 23 2007 - Montréal

overlapping intervals indicative misleading

overlapping intervals indicative misleading

overlapping intervals indicative misleading

overlapping intervals indicative misleading

overlapping intervals indicative misleading

overlapping intervals indicative misleading

overlapping intervals indicative misleading

overlapping intervals indicative misleading

overlapping intervals indicative misleading

second best of (3,10) incorrect

best median of (3,30)

second best of (3,30)

best median of (5,10)

second best of (5,10)

best median of (5,30)

second best of (5,30)

Methodology can (and should) be applied

OOPSLA - October 23 2007 - Montréal

Table 1 on page XXX contains two errors

OOPSLA - October 23 2007 - Montréal

You might also like