You are on page 1of 596

Bootstrap methods and their application

Cambridge Series on Statistical and Probabilistic Mathematics


Editorial Board:
R. Gill (Utrecht)
B.D. Ripley (Oxford)
S. Ross (Berkeley)
M. Stein (Chicago)
D. Williams (Bath)
This series of high quality upper-division textbooks and expository mono
graphs covers all areas of stochastic applicable mathematics. The topics
range from pure and applied statistics to probability theory, operations re
search, mathematical programming, and optimzation. The books contain
clear presentations of new developments in the field and also of the state of
the art in classical methods. While emphasizing rigorous treatment of the
oretical methods, the books contain important applications and discussions
of new techniques made possible be advances in computational methods.

Bootstrap methods and


their application
A . C. D a v iso n

Professor o f Statistics, Department o f Mathematics,


Swiss Federal Institute o f Technology, Lausanne

D . V. H in k le y

Professor o f Statistics, Department o f Statistics and Applied Probability,


University o f California, Santa Barbara

H I C a m b r id g e
U N IV E R S IT Y P R E S S

P U B L IS H E D BY THE PRESS S Y N D IC A T E OF THE U N IV E R S IT Y OF C A M B R ID G E

The Pitt Building, Trumpington Street, Cambridge CB2 1RP, United Kingdom
C A M B R ID G E U N IV E R S IT Y PRESS

The Edinburgh Building, Cambridge CB2 2R U , United Kingdom


40 West 20th Street, N ew York, N Y 10011-4211, U SA
10 Stamford Road, Oakleigh, M elbourne 3166, Australia
Cambridge University Press 1997
This book is in copyright. Subject to statutory exception
and to the provisions o f relevant collective licensing agreements,
no reproduction o f any part may take place without
the written permission o f Cambridge University Press
First published 1997
Printed in the United States o f America
Typeset in TgX M onotype Times
A catalogue record fo r this book is available fro m the British Library

Library o f Congress Cataloguing in Publication data


D avison, A. C. (Anthony Christopher)
Bootstrap methods and their application / A.C. D avison,
D.V. Hinkley.
p. cm.
Includes bibliographical references and index.
ISB N 0 521 57391 2 (hb).
ISBN 0 521 57471 4 (pb)
1. Bootstrap (Statistics) I. Hinkley, D. V. II. Title.
QA276.8.D38 1997
519.5'44~dc21 96-30064 CIP
ISBN 0 521 57391 2 hardback
ISB N 0 521 57471 4 paperback

Contents

Preface
1

Introduction

The Basic Bootstraps


2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11

In tro d u ctio n
Param etric Sim ulation
N o n p aram etric Sim ulation
Simple Confidence Intervals
R educing E rro r
Statistical Issues
N o n p aram etric A pproxim ations for V ariance and Bias
Subsam pling M ethods
B ibliographic N otes
Problem s
Practicals

Further Ideas
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9

In tro d u ctio n
Several Sam ples
Sem iparam etric M odels
Sm ooth E stim ates o f F
C ensoring
M issing D a ta
F inite Population Sam pling
H ierarchical D a ta
B ootstrapping the B ootstrap

ix
1
11
11
15
22
27
31
37
45
55
59
60
66
70
70
71
77
79
82
88
92
100
103

Contents

vi
3.10
3.11
3.12
3.13
3.14

B ootstrap D iagnostics
Choice o f E stim ator from the D ata
B ibliographic N otes
Problem s
Practicals

136

Tests
4.1

Intro d u ctio n

4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9

R esam pling for Param etric Tests


N o n p aram etric P erm utation Tests
N o n p aram etric B ootstrap Tests
A djusted P-values
Estim ating Properties o f Tests
B ibliographic N otes
Problem s
Practicals

Confidence Intervals
5.1
5.2

113
120
123
126
131

Intro d u ctio n

136
140
156
161
175
180
183
184
187
191
191
193
202
211
220
223

Basic C onfidence Lim it M ethods


5.3
Percentile M ethods
5.4 T heoretical C om parison o f M ethods
5.5
Inversion o f Significance Tests
5.6
D ouble B ootstrap M ethods
5.7
Em pirical C om parison o f B ootstrap M ethods
5.8
M ultip aram eter M ethods
5.9
C onditional Confidence Regions
5.10 Prediction
5.11 B ibliographic N otes
5.12 Problem s
5.13 Practicals

230
231
238
243
246
247
251

Linear Regression

256

6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8

256
257
273
290
307
315
316
321

Intro d u ctio n
Least Squares L inear Regression
M ultiple L inear Regression
A ggregate Prediction E rro r and V ariable Selection
R obust Regression
B ibliographic N otes
Problem s
Practicals

vii

Contents

Further Topics in Regression

326

7.1

In tro d u ctio n

326

7.2

G eneralized L inear M odels

327

7.3

Survival D a ta

346

7.4

O th er N onlinear M odels

353

7.5

M isclassification E rro r

358

7.6

N o n p aram etric Regression

362

7.7

B ibliographic N otes

374

7.8

Problem s

376

7.9

Practicals

378

Complex Dependence

385

8.1

In tro d u ctio n

385

8.2

Time Series

385

8.3

Point Processes

415

8.4

B ibliographic N otes

426

8.5

Problem s

428

8.6

Practicals

432

Improved Calculation

437

9.1

In tro d u ctio n

437

9.2

Balanced B ootstraps

438

9.3

C ontrol M ethods

446

9.4

Im po rtan ce R esam pling

450

9.5

Saddlepoint A pproxim ation

466

9.6

B ibliographic N otes

485

9.7

Problem s

487

9.8

Practicals

494

10 Semiparametric Likelihood Inference

499

10.1 Likelihood

499

10.2 M ultinom ial-B ased Likelihoods

500

10.3 B ootstrap Likelihood

507

10.4 Likelihood Based on Confidence Sets

509

10.5 Bayesian B ootstraps

512

10.6 B ibliographic N otes

514

10.7 Problem s

516

10.8 Practicals

519

viii
11

Contents

Computer Implementation

522

11.1
11.2
11.3
11.4
11.5
11.6

In tro d u ctio n
Basic B ootstraps
F u rth er Ideas
Tests
Confidence Intervals
L inear Regression

522
525
531
534
536
537

11.7
11.8

F u rth er Topics in Regression


Time Series

540
543

11.9 Im proved S im ulation


11.10 S em iparam etric Likelihoods
Appendix A. Cumulant Calculations
Bibliography
Name Index
Example index
Subject index

545
549
551
555
568
572
575

Preface

The publication in 1979 of Bradley Efrons first article on bootstrap methods was a
major event in Statistics, at once synthesizing some of the earlier resampling ideas
and establishing a new framework for simulation-based statistical analysis. The idea
of replacing complicated and often inaccurate approximations to biases, variances,
and other measures of uncertainty by com puter simulations caught the imagination
of both theoretical researchers and users of statistical methods. Theoreticians
sharpened their pencils and set about establishing mathematical conditions under
which the idea could work. Once they had overcome their initial skepticism, applied
workers sat down at their terminals and began to amass empirical evidence that
the bootstrap often did work better than traditional methods. The early trickle of
papers quickly became a torrent, with new additions to the literature appearing
every month, and it was hard to see when would be a good moment to try to chart
the waters. Then the organizers o f COMPSTAT 92 invited us to present a course
on the topic, and shortly afterwards we began to write this book.
We decided to try to write a balanced account o f resampling methods, to include
basic aspects of the theory which underpinned the methods, and to show as many
applications as we could in order to illustrate the full potential of the methods
warts and all. We quickly realized that in order for us and others to understand
and use the bootstrap, we would need suitable software, and producing it led us
further towards a practically oriented treatment. Our view was cemented by two
further developments: the appearance o f two excellent books, one by Peter Hall
on the asymptotic theory and the other on basic methods by Bradley Efron and
Robert Tibshirani; and the chance to give further courses that included practicals.
O ur experience has been that hands-on computing is essential in coming to grips
with resampling ideas, so we have included practicals in this book, as well as more
theoretical problems.
As the book expanded, we realized that a fully comprehensive treatm ent was
beyond us, and that certain topics could be given only a cursory treatm ent because
too little is known about them. So it is that the reader will find only brief accounts
o f bootstrap methods for hierarchical data, missing data problems, model selection,
robust estimation, nonparam etric regression, and complex data. But we do try to
point the more ambitious reader in the right direction.
No project of this size is produced in a vacuum. The majority of work on
the book was completed while we were at the University of Oxford, and we are
very grateful to colleagues and students there, who have helped shape our work
in various ways. The experience of trying to teach these methods in Oxford and
elsewhere at the Universite de Toulouse I, Universite de Neuchatel, Universita
degli Studi di Padova, Queensland University of Technology, Universidade de
Sao Paulo, and University of Umea has been vital, and we are grateful to
participants in these courses for prompting us to think more deeply about the

ix

Preface

material. Readers will be grateful to these people also, for unwittingly debugging
some of the problems and practicals. We are also grateful to the organizers of
COMPSTAT 92 and CLAPEM V for inviting us to give short courses on our
work.
While writing this book we have asked many people for access to data, copies
of their programs, papers or reprints; some have then been rewarded by our
bombarding them with questions, to which the answers have invariably been
courteous and informative. We cannot name all those who have helped in this
way, but D. R. Brillinger, P. Hall, M. P. Jones, B. D. Ripley, H. OR. Sternberg and
G. A. Young have been especially generous. S. Hutchinson and B. D. Ripley have
helped considerably with computing matters.
We are grateful to the mostly anonymous reviewers who commented on an early
draft of the book, and to R. G atto and G. A. Young, who later read various parts
in detail. A t Cambridge University Press, A. W oollatt and D. Tranah have helped
greatly in producing the final version, and their patience has been commendable.
We are particularly indebted to two people. V. Ventura read large portions o f the
book, and helped with various aspects of the com putation. A. J. Canty has turned
our version o f the bootstrap library functions into reliable working code, checked
the book for mistakes, and has made numerous suggestions that have improved it
enormously. Both of them have contributed greatly though o f course we take
responsibility for any errors that remain in the book. We hope that readers will
tell us about them, and we will do our best to correct any future versions of the
book; see its WWW page, at U R L
http://dmawww.epf1.ch/davison.mosaic/BMA/
The book could not have been completed without grants from the U K Engineer
ing and Physical Sciences Research Council, which in addition to providing funding
for equipment and research assistantships, supported the work o f A. C. Davison
through the award o f an Advanced Research Fellowship. We also acknowledge
support from the US N ational Science Foundation.
We must also mention the Friday evening sustenance provided at the Eagle and
Child, the Lam b and Flag, and the Royal Oak. The projects of many authors have
flourished in these amiable establishments.
Finally, we thank our families, friends and colleagues for their patience while
this project absorbed our time and energy. Particular thanks are due to Claire
Cullen Davison for keeping the Davison family going during the writing of this
book.
A. C. Davison and D. V. Hinkley
Lausanne and Santa Barbara
May 1997

1
Introduction

The explicit recognition o f uncertainty is central to the statistical sciences. N o


tions such as prior inform ation, probability models, likelihood, stan d ard errors
an d confidence limits are all intended to form alize uncertainty and thereby
m ake allow ance for it. In sim ple situations, the uncertainty o f an estim ate may
be gauged by analytical calculation based on an assum ed probability m odel
for the available data. But in m ore com plicated problem s this approach can be
tedious an d difficult, and its results are potentially m isleading if inappropriate
assum ptions or sim plifications have been made.
F or illustration, consider Table 1.1, which is taken from a larger tabulation
(Table 7.4) o f the num bers o f A ID S reports in E ngland and W ales from
m id -1983 to the end o f 1992. R eports are cross-classified by diagnosis period
an d length o f reporting delay, in three-m onth intervals. A blank in the table
corresponds to an unknow n (as yet unreported) entry. The problem was to
predict the states o f the epidem ic in 1991 and 1992, which depend heavily on
the values missing at the b o tto m right o f the table.
T he d a ta su p p o rt the assum ption th at the reporting delay does n o t depend
on the diagnosis period. In this case a simple m odel is th a t the num ber o f
reports in row j and colum n k o f the table has a Poisson distribution with m ean
Hjk = exp(oij -f f t) . If all the cells o f the table are regarded as independent,
then the to tal nu m b er o f u n reported diagnoses in period j has a Poisson
distribution w ith m ean
n jk = exp(ay)
k

exP (Pk),
k

where the sum is over colum ns with blanks in row j. The eventual total o f as
yet u n rep o rted diagnoses from period j can be estim ated by replacing a j and
Pk by estim ates derived from the incom plete table, and thence we obtain the
predicted to tal for period j. Such predictions are shown by the solid line in

1 Introduction

D iagnosis
period

R e p o rtin g delay interval (q u a rte rs ):

Y ear

Q u a rte r

0+

1988

1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4

31
26
31
36
32
15
34
38
31
32
49
44
41
56
53
63
71
95
76
67

80
99
95
77
92
92
104
101
124
132
107
153
137
124
175
135
161
178
181

16
27
35
20
32
14
29
34
47
36
51
41
29
39
35
24
48
39

9
9
13
26
10
27
31
18
24
10
17
16
33
14
17
23
25

3
8
18
11
12
22
18
9
11
9
15
11
7
12
13
12

2
11
4
3
19
21
8
15
15
7
8
6
11
7
11

8
3
6
8
12
12
6
6
8
6
9
5
6
10

1989

1990

1991

1992

>14

6
3
3
2
2
1

T otal
rep o rts
to end
o f 1992
174
211
224
205
224
219
253
233
281
245
260
285
271
263
306
258
310
318
273
133

Figure 1.1, together w ith the observed to tal reports to the end o f 1992. How
good are these predictions?
It would be tedious b u t possible to p u t pen to p ap er and estim ate the
prediction uncertainty th ro u g h calculations based on the Poisson model. But
in fact the d a ta are m uch m ore variable th an th a t m odel would suggest, and
by failing to take this into account we w ould believe th at the predictions are
m ore accurate th a n they really are. Furtherm ore, a b etter approach would be
to use a sem iparam etric m odel to sm ooth out the evident variability o f the
increase in diagnoses from q u arter to q u arter; the corresponding prediction is
the dotted line in Figure 1.1. A nalytical calculations for this m odel would be
very unpleasant, and a m ore flexible line o f attack is needed. W hile m ore th an
one approach is possible, the one th a t we shall develop based on com puter
sim ulation is b o th flexible and straightforw ard.

Purpose of the Book


O ur central goal is to describe how the com puter can be harnessed to obtain
reliable stan d ard errors, confidence intervals, and o th er m easures o f uncertainty
for a wide range o f problem s. The key idea is to resam ple from the original
d a ta either directly o r via a fitted m odel to create replicate datasets, from

Table 1.1 Numbers of


AIDS reports in
England and Wales to
the end of 1992
(De Angelis and Gilks,
1994) extracted from
Table 7.4. A t indicates
a reporting delay less
than one month.

1 Introduction

Figure 1.1 Predicted


quarterly diagnoses
from a parametric
model (solid) and a
semiparametric model
(dots) fitted to the
AIDS data, together
with the actual totals to
the end of 1992 (+).

Time

which the variability o f the quantities o f interest can be assessed w ithout longwinded and error-prone analytical calculation. Because this approach involves
repeating the original d a ta analysis procedure w ith m any replicate sets o f data,
these are som etim es called computer-intensive methods. A n o th er nam e for them
is bootstrap methods, because to use the d a ta to generate m ore d a ta seems
analogous to a trick used by the fictional B aron M unchausen, who when he
found him self a t the b o tto m o f a lake got out by pulling him self up by his
b ootstraps. In the sim plest nonparam etric problem s we do literally sample
from the data, and a com m on initial reaction is th a t this is a fraud. In fact
it is not. It turns out th a t a wide range o f statistical problem s can be tackled
this way, liberating the investigator from the need to oversimplify complex
problem s. T he ap proach can also be applied in simple problem s, to check the
adequacy o f stan d ard m easures o f uncertainty, to relax assum ptions, and to
give quick approxim ate solutions. A n exam ple o f this is random sam pling to
estim ate the p erm u tatio n distribution o f a nonparam etric test statistic.
It is o f course true th a t in m any applications we can be fairly confident in
a p articu lar p aram etric m odel and the stan d ard analysis based on th a t model.
Even so, it can still be helpful to see w hat can be inferred w ithout particular
p aram etric m odel assum ptions. This is in the spirit o f robustness o f validity o f
the statistical analysis perform ed. N onparam etric b o o tstrap analysis allows us
to do this.

1 Introduction

3
5
7
18
43
85
91
98
100
130
230
487
_____________________________________________________________________

Despite its scope an d usefulness, resam pling m ust be carefully applied.


Unless certain basic ideas are understood, it is all too easy to produce a
solution to the w rong problem , or a b ad solution to the right one. B ootstrap
m ethods are intended to help avoid tedious calculations based on questionable
assum ptions, and this they do. But they can n o t replace clear critical thought
ab o u t the problem , ap p ro p riate design o f the investigation and d a ta analysis,
and incisive presentation o f conclusions.
In this b o o k we describe how resam pling m ethods can be used, and evaluate
their perform ance, in a wide range o f contexts. O u r focus is on the m ethods
and their practical application rath er th an on the underlying theory, accounts
o f which are available elsewhere. This book is intended to be useful to the
m any investigators w ho w ant to know how and when the m ethods can safely
be applied, and how to tell when things have gone wrong. The m athem atical
level o f the book reflects this: we have aim ed for a clear account o f the key
ideas w ithout an overload o f technical detail.

Examples
B ootstrap m ethods can be applied b o th when there is a well-defined probability
m odel for d a ta an d when there is not. In o u r initial developm ent o f the
m ethods we shall m ake frequent use o f tw o simple examples, one o f each type,
to illustrate the m ain points.
Example 1.1 (Air-conditioning data) Table 1.2 gives n = 12 times between
failures o f air-conditioning equipm ent, for which we wish to estim ate the
underlying m ean or its reciprocal, the failure rate. A simple m odel for this
problem is th a t the times are sam pled from an exponential distribution.
The dotted line in the left panel o f Figure 1.2 is the cum ulative distribution
function (C D F )
F t ) = /

\ l - e x p (-y/n),

y ~

y > 0,

for the fitted exponential distrib u tio n w ith m ean fi set equal to the sample
average, y = 108.083. The solid line on the sam e plot is the nonparam etric
equivalent, the em pirical distribution function (E D F ) for the data, which places
equal probabilities n-1 = 0.083 at each sam ple value. C om parison o f the two
curves suggests th a t the exponential m odel fits reasonably well. A n alternative
view o f this is shown in the right panel o f the figure, which is an exponential

Table 1.2 Service hours


between failures of the
air-conditioning
equipment in a Boeing
720 jet aircraft
(Proschan, 1963).

1 Introduction

O
co

Figure 1.2 Summary


displays for the
air-conditioning data.
The left panel shows the
EDF for the data, F
(solid), and the CDF of
a fitted exponential
distribution (dots). The
right panel shows a plot
of the ordered failure
times against
exponential quantiles,
with the fitted
exponential model
shown as the dotted line.

o
o
in
o
o
o
o

co
o
o

CM

O
o

0.0 0.5
Failure time y

1.0 1.5 2.0 2.5 3.0

Quantiles of standard exponential

Q -Q plot a plot o f ordered d a ta values yy) against the standard exponential


quantiles

n+ 1

= - log (1
K=1

n+ 1

A lthough these plots suggest reasonable agreem ent with the exponential
m odel, the sam ple is ra th e r too small to have m uch confidence in this. In the
d a ta source the m ore general gam m a m odel with m ean /i and index k is used;
its density is
fw (y) =

1
1

/ \ K
I K ' K-1.
y K exP ( - Ky / v l

y > o,

h, k

> o.

( i.i)

F or o u r sam ple the estim ated index is k = 0.71, which does not differ signif
icantly (P = 0.29) from the value k = 1 th a t corresponds to the exponential
m odel. O u r reason for m entioning this will becom e apparent in C h apter 2.
Basic properties o f the estim ator T = Y for fj. are easy to obtain theoretically
under the exponential model. For example, it is easy to show th at T is unbiased
and has variance fi2/n. A pproxim ate confidence intervals for n can be calculated
using these properties in conjunction with a norm al approxim ation for the
distrib u tio n o f T, alth o u g h this does n o t w ork very well: we can tell this
because Y / n has an exact gam m a distribution, which leads to exact confidence
limits. Things are m ore com plicated under the m ore general gam m a model,
because the index k is only estim ated, and so in a traditional approach we would
use approxim ations such as a norm al approxim ation for the distribution
o f T, or a chi-squared approxim ation for the log likelihood ratio statistic.

1 Introduction

The param etric sim ulation m ethods o f Section 2.2 can be used alongside these
approxim ations, to diagnose problem s w ith them , or to replace them entirely.

Example 1.2 (City population data) Table 1.3 reports n = 49 d a ta pairs, each
corresponding to a city in the U nited States o f A m erica, the p air being the 1920
and 1930 p o pulations o f the city, w hich we denote by u and x. The d a ta are
plotted in Figure 1.3. Interest here is in the ratio o f m eans, because this would
enable us to estim ate the to tal pop u latio n o f the U SA in 1930 from the 1920
figure. I f the cities form a ran d o m sam ple w ith ( U , X ) denoting the p air o f
populatio n values for a random ly selected city, then the total 1930 population
is the prod u ct o f the to tal 1920 popu latio n and the ratio o f expectations
6 = E (X )/E ([7). This ratio is the p aram eter o f interest.
In this case there is no obvious p aram etric m odel for the jo in t distribution
o f ( U , X ) , so it is n atu ral to estim ate 9 by its em pirical analog, T = X / U , the
ratio o f sam ple averages. We are then concerned w ith the uncertainty in T. If
we had a plausible param etric m odel for exam ple, th a t the pair ( U, X ) has
a bivariate lognorm al distrib u tio n then theoretical calculations like those
in Exam ple 1.1 would lead to bias an d variance estim ates for use in a norm al
approxim ation, which in tu rn would provide approxim ate confidence intervals
for 6. W ithout such a m odel we m ust use nonparam etric analysis. It is still
possible to estim ate the bias an d variance o f T, as we shall see, and this m akes
norm al approxim ation still feasible, as well as m ore com plex approaches to
setting confidence intervals.

Exam ple 1.1 is special in th a t an exact distribution is available for the


statistic o f interest an d can be used to calculate confidence limits, at least
u nder the exponential m odel. But for param etric m odels in general this will
n o t be true. In Section 2.2 we shall show how to use param etric sim ulation to
o b tain approxim ate distributions, either by approxim ating m om ents for use in
norm al approxim ations, or when these are inaccurate directly.
In Exam ple 1.2 we m ake no assum ptions ab o u t the form o f the d ata
disribution. But still, as we shall show in Section 2.3, sim ulation can be used to
obtain properties o f T, even to approxim ate its distribution. M uch o f C h ap ter 2
is devoted to this.

Layout of the Book


C h ap ter 2 describes the properties o f resam pling m ethods for use w ith sin
gle sam ples from p aram etric an d nonparam etric m odels, discusses practical
m atters such as the num bers o f replicate datasets required, and outlines delta
m ethods for variance approxim ation based on different forms o f jackknife. It

1 Introduction
Table 13 Populations
in thousands of n 49
large US cities in 1920
(u) and in 1930 (x)
(Cochran, 1977, p. 152).

138
93
61
179
48
37
29
23
30

143
104
69
260
75
63
50
48
111
50
52
53
79
57
317
93
58

76
381
387
78
60
507
50
77
64
40
136
243
256
94
36
45

80
464
459
106
57
634
64
89
77
60
139
291
288
85
46
53

67
120
172
66
46
121
44
64
56
40
116
87
43
43
161
36

67
115
183
86
65
113
58
63
142
64
130
105
61
50
232
54

38
46
71
25
298
74
50

Figure 1J Populations
of 49 large United
States cities (in 1000s)
in 1920 and 1930.

3
Q.
O
Q.
O
CO
O)

1920 population

1 Introduction

also contains a basic discussion o f confidence intervals and o f the ideas th at


underlie b o o tstrap m ethods.
C h apter 3 outlines how the basic ideas are extended to several samples,
sem iparam etric and sm ooth models, simple cases where d a ta have hierarchical
structure or are sam pled from a finite population, an d to situations where d ata
are incom plete because censored o r missing. It goes on to discuss how the
sim ulation o u tp u t itself m ay be used to detect problem s so-called boo tstrap
diagnostics an d how it m ay be useful to b o o tstrap the bootstrap.
In C h ap ter 4 we review the basic principles o f significance testing, and then
describe M onte C arlo tests, including those using M arkov C hain sim ulation,
and param etric b o o tstrap tests. This is followed by discussion o f nonparam etric
perm utatio n tests, and the m ore general m ethods o f semi- and nonparam etric
boo tstrap tests. A double b o o tstrap m ethod is detailed for im proved approxi
m ation o f P-values.
Confidence intervals are the subject o f C h ap ter 5. A fter outlining basic
ideas, we describe how to construct simple confidence intervals based on
sim ulations, an d then go on to m ore com plex m ethods, such as the studentized
bootstrap, percentile m ethods, the double b o o tstrap and test inversion. The
m ain m ethods are com pared em pirically in Section 5.7, then there are brief
accounts o f confidence regions for m ultivariate param eters, and o f prediction
intervals.
The three subsequent chapters deal w ith m ore com plex problem s. C h ap
ter 6 describes how the basic resam pling m ethods m ay be applied in linear
regression problem s, including tests for coefficients, prediction analysis, and
variable selection. C h ap ter 7 deals w ith m ore com plex regression situations:
generalized linear models, oth er nonlinear m odels, semi- and nonparam etric
regression, survival analysis, and classification error. C h apter 8 details m ethods
appropriate for tim e series, spatial data, an d poin t processes.
C h apter 9 describes how variance reduction techniques such as balanced
sim ulation, control variates, and im portance sam pling can be adapted to
yield im proved sim ulations, w ith the aim o f reducing the am ount o f sim ulation
needed for an answ er o f given accuracy. It also shows how saddlepoint m ethods
can som etim es be used to avoid sim ulation entirely.
C h apter 10 describes various sem iparam etric versions o f the likelihood
function, the ideas underlying which are closely related to resam pling m ethods.
It also briefly outlines a Bayesian version o f the b o otstrap.
C hapters 2 -10 contain problem s intended to reinforce the readers under
standing o f b o th m ethods an d theory, and in some cases problem s develop
topics th at could n o t be included in the text. Some o f these dem and a know l
edge o f m om ents and cum ulants, basic facts ab o u t which are sketched in the
A ppendix.
The book also contains practicals th a t apply resam pling routines w ritten in

1 Introduction

the S language to sets o f data. The practicals are intended to reinforce the
ideas in each chapter, to supplem ent the m ore theoretical problem s, and to
give exam ples on which readers can base analyses o f their own data.
It would be possible to give different sorts o f course based on this book.
O ne w ould be a theoretical course based on the problem s and an o th er an
applied course based on the practicals; we prefer to blend the two.
A lthough a library o f routines for use with the statistical package S P lu s
is bundled w ith it, m ost o f the book can be read w ithout reference to p a r
ticular softw are packages. A p art from the practicals, the exception to this is
C h ap ter 11, which is a short introduction to the m ain resam pling routines,
arran g ed roughly in the order with which the corresponding ideas ap p ear in
earlier chapters. R eaders intending to use the bundled routines will find it
useful to w ork through the relevant sections o f C h apter 11 before attem pting
the practicals.

Notation
A lthough we believe th a t o u r n o tation is largely standard, there are not enough
letters in the English and G reek alphabets for us to be entirely consistent. G reek
letters such as 6, P and v generally denote param eters or o ther unknow ns, while
a is used for error rates in connection with significance tests and confidence
sets. English letters X , Y, Z , and so forth are used for random variables, which
take values x, y, z. T hus the estim ator T has observed value t, which m ay be
an estim ate o f the unknow n p aram eter 0. The letter V is used for a variance
estim ate, an d the letter p for a probability, except for regression models, where
p is the num b er o f covariates. Script letters such as J/~ are used to denote sets.
Probability, expectation, variance and covariance are denoted Pr( ), E( ),
var(-) and cov(-, ), while the jo in t cum ulant o f Yi, Y1Y2 and Y3 is denoted
cum(Yi, Yj Y2, Y3). We use I {A} to denote the indicator random variable, which
takes values one if the event A is true and zero otherwise. A related function
is the H eaviside function

We use #{/!} to denote the nu m ber o f elem ents in the set A, and #{^4r} for the
num ber o f events A r th a t occur in a sequence A i , A 2 , __ We use = to m ean
is approxim ately equal to , usually corresponding to asym ptotic equivalence
as sam ple sizes tend to infinity, ~ to m ean is distributed as o r is distributed
according to , ~ to m ean is distributed approxim ately a s, ~ to m ean is a
sam ple o f independent identically distributed random variables from , while
s has its usual m eaning o f is equivalent to .

10

1 Introduction

The d a ta values in a sam ple o f size n are typically denoted by y i , . . . , y n,


the observed values o f the ran d o m variables y i , . . . , y n; their average is y =
n-'Zyj-

We m ostly reserve Z for ran d o m variables th a t are stan d ard norm al, at least
approxim ately, an d use Q for ran d o m variables w ith o ther (approxim ately)
know n distributions. As usual N(n, a 2) represents the norm al distribution w ith
m ean \i an d variance a 2, while za is often the a quantile o f the stan d ard norm al
distribution, w hose cum ulative distrib u tio n function is ( ).
The letter R is reserved for the n u m b er o f replicate sim ulations. Sim ulated
copies o f a statistic T are denoted T ' , r = 1 ,..., R, w hose ordered values are
r ('i) ^
^ T (R)- E xpectation, variance an d probability calculated w ith respect
to the sim ulation distribution are w ritten Pr*(), E*(-) and var*(-).
W here possible we avoid boldface type, and rely on the context to m ake
it plain when we are dealing w ith vectors o r m atrices; a T denotes the m atrix
transpose o f a vector o r m atrix a.
We use PD F, C D F, an d E D F as sh o rth an d for probability density function,
cum ulative distribution function, and em pirical distribution function. The
letters F and G are used for C D F s, an d / and g are generally used for the
corresponding PD F s. A n exception to this is th a t /*; denotes the frequency
with which y; app ears in the rth resample.
We use M L E as sh o rth an d for m axim um likelihood estim ate or som etim es
m axim um likelihood estim ation.
The end o f each exam ple is m arked , an d the end o f each algorithm is
m arked .

2
The Basic Bootstraps

2.1 Introduction
In this chap ter we discuss techniques which are applicable to a single, h om o
geneous sam ple o f data, denoted by y i,...,} V T he sam ple values are thought
o f as the outcom es o f independent and identically distributed ran d o m variables
Y U . . . ,Y w hose probability density function (P D F ) and cumulative distribution
function (C D F ) we shall denote by / and F, respectively. T he sam ple is to be
used to m ake inferences ab o u t a p o p ulation characteristic, generically denoted
by 6, using a statistic T whose value in the sam ple is t. We assum e for the
m om ent th a t the choice o f T has been m ade and th a t it is an estim ate for 6,
which we take to be a scalar.
O u r atten tio n is focused on questions concerning the probability distribution
o f T. F or exam ple, w hat are its bias, its stan d ard error, or its quantiles? W hat
are likely values und er a certain null hypothesis o f interest? H ow do we
calculate confidence limits for 6 using T ?
T here are tw o situations to distinguish, the param etric and the n o n p a ra m et
ric. W hen there is a p articu lar m athem atical m odel, with adjustable constants
o r p aram eters ip th a t fully determ ine / , such a m odel is called parametric and
statistical m ethods based on this m odel are param etric m ethods. In this case
the p aram eter o f interest 6 is a com ponent o f or function o f ip. W hen no such
m athem atical m odel is used, the statistical analysis is nonparametric, and uses
only the fact th a t the ran d o m variables Yj are independent and identically
distributed. Even if there is a plausible param etric m odel, a nonparam etric
analysis can still be useful to assess the robustness o f conclusions draw n from
a p aram etric analysis.
A n im p o rta n t role is played in nonparam etric analysis by the empirical
distribution which puts equal probabilities n-1 a t each sam ple value yj. The
corresponding estim ate o f F is the empirical distribution function (E D F ) F,

11

12

2 The Basic Bootstraps

which is defined as the sam ple p ro p o rtio n


#{^4} means the number
of times the event A
occurs.

n
M ore form ally
F(y) = l i Z H ^ y - y ^
j=i

where H(u) is the unit step function which ju m p s from 0 to 1 at u = 0. N otice


th at the values o f the E D F are fixed (0,
j[), so the E D F is equivalent
to its points o f increase, the ordered values >(i) < < y ln} o f the data. An
exam ple o f the E D F was shown in the left panel o f Figure 1.2.
W hen there are rep eat values in the sample, as would often occur with
discrete data, the E D F assigns probabilities p ro p o rtional to the sam ple fre
quencies at each distinct observed value y. The form al definition (2.1) still
applies.
The E D F plays the role o f fitted m odel when no m athem atical form is
assum ed for F, analogous to a param etric C D F w ith param eters replaced by
their estim ates.

2.1.1 Statistical functions


M any simple statistics can be th o u g h t o f in term s o f properties o f the EDF.
For exam ple, the sam ple average y = n_1
yj is the m ean o f the E D F ;
see Exam ple 2.1 below. M ore generally, the statistic o f interest t will be a
sym m etric function o f y \ , . . . , y, m eaning th a t t is unaffected by reordering the
data. This implies th a t t depends only on the ordered values y(i) < < y^),
or equivalently on the E D F F. O ften this can be expressed simply as t = t(F),
where t(-) is a statistical function essentially ju st a m athem atical expression
o f the algorithm for com puting t from F. Such a statistical function is o f central
im portance in the n o n p aram etric case because it also defines the param eter
o f interest 9 th ro u g h the algorithm 9 = t(F). This corresponds to the
qualitative idea th a t 6 is a characteristic o f the population described by F.
Simple exam ples o f such functions are the m ean an d variance o f Y , which are
respectively defined as
t(F) =

y dF( y) ,

t(F) =

y 2 dF(y) ~ { J ydF(y) J

(2.2)

T he same definition o f 9 applies in p aram etric problem s, although then 6 is


m ore usually defined explicitly as one o f the m odel param eters tp.
T he relationship betw een the estim ate t an d F can usually be expressed as
t = t(F), corresponding to the relation 9 = t(F) betw een the characteristic
o f interest an d the underlying distribution. T he statistical function t( ) defines

13

2.1 Introduction

b o th the p aram eter an d its estim ate, b u t we shall use t( ) to represent the
function, and t to represent the estim ate o f 9 based on the observed d ata

Example 2.1 (Average)

T he sample average, y, estim ates the population m ean


H

ydF(y).

To show th a t y = t(F), we substitute for F in the defining function at (2.2) to


obtain

j= i
because f a ( y ) d H ( y x) = a(x) for any continuous function a(-).

Example 2.2 (City population data) F or the problem outlined in Exam ple 1.2,
the p aram eter o f interest is the ratio o f m eans 9 = E (X )/E (l/). In this case F
is the bivariate C D F o f Y = (V , X ), and the bivariate E D F F puts probability
n~l at each o f the d a ta pairs (uj ,Xj). T he statistical function version o f 9 simply
uses the definition o f m ean for b o th nu m erato r and denom inator, so th at
fxdF(u,x)
f ud F( u, x)
The corresponding estim ate o f 9 is
*
[ xdF(u,x)
t = t(F) =
J udF(u,x)
w ith x = n-1 J2 x j ar*d = n_1 J 2 uj-

A quantity A is said to
be 0(nd) if
lim_00 n~dA = a for
some finite a, and o(nJ)
if lim_0Q n~dA = 0.

x
u

It is quite straightforw ard to show th at (2.1) implies convergence o f F to


F as n>oo (Problem 2.1). T hen if t(-) is continuous in an appropriate sense,
the definition T = t( ) implies th a t T converges to 6 as n>oo, which is the
property o f consistency.
N o t all estim ates are exactly o f the form t(F). For example, if t(F) = var(Y )
then the usual unbiased sam ple variance is nt(F)/(n 1). A lso the sample
m edian is n o t exactly F -1 ( |) . Such small discrepancies are fairly un im p o rtan t
as far as applying the b o o tstrap techniques discussed in this book. In a very
form al developm ent we could write T tn(F) and require th a t tn*t as n>oo,
possibly even th a t t t = 0 ( _1). But such form ality would be excessive here,
an d we shall assum e in general discussion th at T = t(F). (One case th at does

2 The Basic Bootstraps

14

require special treatm en t is n o n p aram etric density estim ation, which we discuss
in Exam ple 5.13.)
The representation 6 = t(F) defines the p aram eter and its estim ator T in a
robust way, w ithout any assum ption ab o u t F, oth er th an th a t 6 exists. This
guarantees th a t T estim ates the right thing, no m atter w hat F is. Thus the
sam ple average y is the only statistic th a t is generally valid as an estim ate o f the
population m ean f i : only if Y is sym m etrically distributed ab o u t /i will statistics
such as trim m ed averages also estim ate fi. This property, which guarantees th at
the correct characteristic o f the underlying distribution is estim ated, w hatever
th a t distribution is, is som etim es called robustness o f specification.

2.1.2 Objectives
M uch o f statistical theory is devoted to calculating approxim ate distributions
for p articu lar statistics T , on which to base inferences ab o u t their estim ands 8.
Suppose, for exam ple, th a t we w ant to calculate a (1 2a) confidence interval
for 6. It m ay be possible to show th a t T is approxim ately norm al w ith m ean
6 + P and variance v; here P is the bias o f T. If p an d v are b o th know n, then
we can write
P r(T < 1 1 F) = O

(2-3)

where <t>() is the stan d ard norm al integral. I f the a quantile o f the standard
norm al distrib u tio n is z = <D- 1(a), then an approxim ate (1 2a) confidence
interval for 6 has limits
t - p - v ^ \ ,

(2.4)

as follows from
Pr(/? + v1/2za < T

0 < ft + v1/2Z!_a) = 1 - 2a.

There is a catch, however, which is th a t in practice the bias /? and variance


v will not be know n. So to use the norm al approxim ation we m ust replace P
and v w ith estim ates. To see how to do this, note th a t we can express P and v
as
P = b(F) = E ( T | F) -

t(F),

v = v(F) = v ar( T \ F),

(2.5)

thereby stressing their dependence on the underlying distribution. We use


expressions such as E (T | F) to m ean th a t the ran d o m variables from which
T is calculated have distrib u tio n F; here a pedantic equivalent would be
E{t(F) | Y U . . . , Y ~ F } . Suppose th a t F is estim ated by F, which m ight be
the em pirical distrib u tio n function, o r a fitted p aram etric distribution. Then
estim ates o f bias an d variance are obtained simply by substituting F for F in

= means is
approximately equal to.

15

2.2 Parametric Simulation

(2.5), th a t is
B = b(F) = E ( T \ F ) - t ( F ) ,

V = v(F) = v a r(T | F).

(2.6)

These estim ates B an d V are used in place o f (i and v in equations such as


(2.4).
Example 2.3 (Air-conditioning data) U nder the exponential m odel for the
d a ta in Exam ple 1.1, the m ean failure tim e n is estim ated by the average T = Y ,
which has a gam m a distrib u tio n with m ean fi and shape param eter k = n.
Therefore the bias an d variance o f T are b(F) = 0 and i>(F) = /i2/ n , and these
are estim ated by 0 and y 2/n. Since n = 12, y = 108.083, and 20.025 = 1.96,
a 95% confidence interval for /i based on the norm al approxim ation (2.3) is
+ 1.96n_1/2y = (46.93,169.24).

E stim ates such as those in (2.6) are b o o tstrap estim ates. H ere they have
been used in conjunction w ith a norm al approxim ation, which som etim es will
be adequate. However, the b o o tstrap approach o f substituting estim ates can
be applied m ore am bitiously to im prove upon the norm al approxim ation and
o th e r first-order theoretical approxim ations. The elaboration o f the b o o tstrap
ap proach is the purpose o f this book.

2.2 Parametric Simulation


In the previous section we pointed out th a t theoretical properties o f T m ight be
h ard to determ ine w ith sufficient accuracy. We now describe the sound practical
alternative o f repeated sim ulation o f d a ta sets from a fitted param etric model,
an d em pirical calculation o f relevant properties o f T.
Suppose th a t we have a p articular param etric m odel for the distribution
o f the d a ta y \ , . . . , y . We shall use F v (y) and f v (y) to denote the C D F and
P D F respectively. W hen 1p is estim ated by (p often b u t not invariably its
m axim um likelihood estim ate its substitution in the m odel gives the fitted
model, w ith C D F F{y) = F^(y), which can be used to calculate properties o f T,
som etim es exactly. We shall use Y * to denote the random variable distributed
according to the fitted m odel F, and the superscript * will be used with E,
var and so forth when these m om ents are calculated according to the fitted
distribution. O ccasionally it will also be useful to w rite \p = xp to em phasize
th a t this is the p aram eter value for the sim ulation model.
Example 2.4 (Air-conditioning data) We have already calculated the m ean
and variance u nder the fitted exponential m odel for the estim ator T = Y
o f Exam ple 1.1. O u r sam ple estim ate for the m ean fi is t = y. So here 7*
is exponential w ith m ean y. In the n o tatio n ju st introduced, we have by

16

2 The Basic Bootstraps

theoretical calculation w ith this exponential distrib u tion th at


E*(Y*) = y,

v ar'(Y * ) = y 2/n.

N ote th a t the estim ated bias o f Y is zero, being the difference between
E '(Y *) an d the value ji = y for the m ean o f the fitted distribution. These
m om ents were used to calculate an approxim ate norm al confidence interval in
Exam ple 2.3.
If, however, we wished to calculate the bias and variance o f T = log Y under
the fitted m odel, i.e. E* (log Y*) lo g y and v ar (lo g Y '), exact calculation is
m ore difficult. The delta m ethod o f Section 2.7.1 would give approxim ate
values (2n)~* and n-1 . But m ore accurate approxim ations can be obtained
using sim ulated sam ples o f 7* s.
Sim ilar results and com m ents would apply if instead we chose to use the
m ore general gam m a m odel (1.1) for this example. T hen Y* would be a gam m a
random variable with m ean y and index k.
m

2.2.1 Moment estimates


So now suppose th a t theoretical calculation w ith the fitted m odel is too
complex. A pproxim ations m ay n o t be available, or they m ay be untrustw orthy,
perhaps because the sam ple size is small. The alternative is to estim ate the
properties we require from sim ulated datasets. We w rite such a dataset as
Yj",. . . , Y* w here the YJ are independently sam pled from the fitted distribution
F. W hen the statistic o f interest is calculated from a sim ulated dataset, we
denote it by T*. F rom R repetitions o f the d a ta sim ulation we obtain T [ , . . . , T R.
Properties o f T 6 are then estim ated from T,*,. . . , T^. F or example, the
estim ator o f the bias b(F) E (T | F) 0 o f T is
B = b(F) = E (T | F) t = E*(T*) - t,
and this in tu rn is estim ated by
R

B r = / r 1 Y , Tr ~ t = T* - 1.

(2.7)

r= 1

N ote th a t in the sim ulation t is the p aram eter value for the model, so th at
T ' t is the sim ulation analogue o f T 6. The corresponding estim ator o f
the variance o f T is
1
Vr =

R
D 7-* - f *)2

(2-8)

with sim ilar estim ators for oth er m om ents.


These em pirical approxim ations are justified by the law o f large num bers.
F or exam ple, B r converges to B, the exact value under the fitted model, as R

2.2 Parametric Simulation


Figure 2.1 Empirical
biases and variances of
Y* for the
air-conditioning data
from four repetitions of
parametric simulation.
Each line shows how the
estimated bias and
variance for R ~ 10
initial simulations
change when further
simulations are
successively added. Note
how the variability
decreases as the
simulation size
increases, and how the
simulated values
converge to the exact
values under the fitted
exponential model,
given by the horizontal
dotted lines.

17

cC/>
O
in

increases. We usually d ro p the subscript R from B R, VR, and so forth unless


we are explicitly discussing the effect o f R. How to choose R will be illustrated
in the exam ples th a t follow, and discussed in Section 2.5.2.
It is im p o rtan t to recognize th a t we are not estim ating absolute properties o f
T , b u t ra th e r o f T relative to 9. Usually this involves the estim ation erro r T 9,
b u t we should n o t ignore the possibility th at T / 0 (equivalently log T log 9)
o r som e o th er relevant m easure o f estim ation error m ight be m ore appropriate,
depending u p o n the context. B ootstrap sim ulation m ethods will apply to any
such measure.
Example 2.5 (Air-conditioning data) C onsider Exam ple 1.1 again. As we
have seen, sim ulation is unnecessary in practice for this problem because the
m om ents are easy to calculate theoretically, b u t the exam ple is useful for
illustration. H ere the fitted m odel is an exponential distribution for the failure
times, w ith m ean estim ated by the sam ple average y = 108.083. All sim ulated
failure tim es Y * are generated from this distribution.
Figure 2.1 shows the results from several sim ulations, four for each o f
eight values o f R, in each o f which the em pirical biases and variances o f
T" = Y" have been calculated according to (2.7) and (2.8). O n both panels the
correct values, nam ely zero and y 2/ n = (108.083)2/1 2 = 973.5, are indicated
by horizontal d o tted lines.
Evidently the larger is R, the closer is the sim ulation calculation to the right
answer. H ow large a value o f R is needed? Figure 2.1 suggests th a t for some
purposes R = 100 or 200 will be adequate, b u t th a t R = 10 will n o t be large
enough. In this problem the accuracy o f the em pirical approxim ations is quite
easy to determ ine from the fact th at n Y / n has a gam m a distribution with

2 The Basic Bootstraps

18
index n. The sim ulation variances o f B R and F r are
t2

t4 /

6 \

nR

n2 \ R - 1 + n R . )

and we can use these to say how large R should be in order th a t the sim ulated
values have a specified accuracy. For exam ple, the coefficients o f variation
o f VR a t R = 100 and 1000 are respectively 0.16 and 0.05. However, for a
com plicated problem w here sim ulation was really necessary, such calculations
could n o t be done, an d general rules are needed to suggest how large R should
be. These are discussed in Section 2.5.2.

2.2.2 Distribution and quantile estimates


The sim ulation estim ates o f bias and variance will som etim es be o f interest in
their own right, but m ore usually w ould be used w ith norm al approxim ations
for T , p articularly for large samples. For situations like those in Exam ples 1.1
and 1.2, however, the norm al approxim ation is intrinsically inaccurate. This
can be seen from a norm al Q -Q plot o f the sim ulated values t \ , . . . , t R, th a t is,
a plot o f the ordered values
< < t R) against expected norm al order
statistics. It is the em pirical distrib u tio n o f these sim ulated values which can
provide a m ore accurate distrib u tio n al approxim ation, as we shall now see.
If as is often the case we are approxim ating the distribution o f T 8
by th a t o f T m t, then cum ulative probabilities are estim ated simply by the
em pirical distribution function o f the sim ulated values t ' t. M ore formally,
if G(u) = P r( T 8 < u), then the sim ulation estim ate o f G(u) is
n i \
t < u}
1
G* (U) = ~ ^ R ------- = R Z 2 1{tr ~ 1 -

r=l

where I {A} is the indicator o f the event A, equal to 1 if A is true and 0


otherwise. As R increases, so this estim ate will converge to G(u), the exact
C D F o f T* t under sam pling from the fitted model. Ju st as w ith the m om ent
approxim ations discussed earlier, so the approxim ation GR to G contains two
sources o f error, i.e. th a t betw een G an d G due to d a ta variability and th a t
betw een GR an d G due to finite sim ulation.
We are often interested in quantiles o f the distrib ution o f T 8, and these
are approxim ated using ordered values o f t* t. T he underlying result used
here is th a t if X i , . . . , X N are independently distributed with C D F K and if
denotes the j \ h ordered value, then

This implies th a t a sensible estim ate o f K ~ l (p) is X ^ N+i)p), assum ing th at

2.2 Parametric Simulation

19

( N + l)p is an integer. So we estim ate the p quantile o f T 9 by the (R + l)p th


ordered value o f t" t, th a t is t((R+1)p) t. We assum e th at R is chosen so th at
(/?

l)p is an integer.
The sim ulation approxim ation GR and the corresponding quantiles are in
principle b etter th a n results obtained by norm al approxim ation, provided th at
R is large enough, because they avoid the supposition th a t the distribution o f
T* t has a p articu lar form.
Example 2.6 (Air-conditioning data) T he sim ulation experim ents described
in Exam ple 2.5 can be used to study the sim ulation approxim ations to the
d istribution an d quantiles o f Y fi. First, Figure 2.2 shows norm al Q -Q plots
o f t* values for R = 99 (top left panel) and R = 999 (top right panel). Clearly
a norm al ap proxim ation would n o t be accurate in the tails, and this is already
fairly clear w ith R = 99. F or reference, the lower h a lf o f Figure 2.2 shows
corresponding Q -Q plots w ith exact gam m a quantiles.
T he n onnorm ality o f T * is also reasonably clear on histogram s o f t* values,
show n in Figure 2.3, at least at the larger value R = 999. C orresponding
density estim ate plots provide sm oother displays o f the same inform ation.
We look next at the estim ated quantiles o f Y p.. T he p quantile is a p
proxim ated by J'fjK+np) y for p = 0.05 and 0.95. The values o f R are
1 9 ,3 9 ,9 9 ,1 9 9 ,..., 999, chosen to ensure th a t (R + 1)p is an integer throughout.
T hus at R = 19 the 0.05 quantile is approxim ated by y ^ y and so forth. In
order to display the m agnitude o f sim ulation error, we ran four independent
sim ulations a t R = 1 9 ,3 9 ,9 9 ,...,9 9 9 . The results are plotted in Figure 2.4.
A lso shown by d o tted lines are the exact quantiles under the m odel, which the
sim ulations ap proach as R increases. T here is large variability in the approxi
m ate quantiles for R less th an 100 and it appears th a t 500 or m ore sim ulations
are required to get accurate results.
The same sim ulations can be used in o th er ways. F or example, we m ight
w ant to know a b o u t log Y log /i, in which case the em pirical properties o f
logy* lo g y are relevant.

T he illustration used here is very simple, but essentially the same m ethods
can be used in arb itrarily com plicated param etric problems. F or example,
distributions o f likelihood ratio statistics can be approxim ated when largesam ple approxim ations are inaccurate or fail entirely. In C hapters 4 and
5 respectively we show how param etric boo tstrap m ethods can be used to
calculate significance tests an d confidence sets.
It is som etim es useful to be able to look at the density o f T, for exam ple to
see if it is m ultim odal, skewed, or otherw ise differs appreciably from norm ality.
A rough idea o f the density g(u) o f U = T 6, say, can be had from a histogram
o f the values o f t ' t. A som ew hat b etter picture is offered by a kernel density

20

2 The Basic Bootstraps

Figure 2.2 Normal


(upper) and gamma
(lower) Q-Q plots of (*
values based on R = 99
(left) and R = 999
(right) simulations from
the fitted exponential
model for the
air-conditioning data.

Quantiles of standard normal

ooo

Quantiles of standard normal

o
C\J
>

/*S

CD

to

Jr

o
o

o
o
o

in

/
/

O
''fr
60 80

120

160

200

50

Exact gamma quantile

100

150

200

Exact gamma quantile

estim ate, defined by

r= l

<>

where w is a sym m etric P D F with zero m ean and h i s a. positive bandw idth th a t
determ ines the sm oothness o f gh. The estim ate gh is non-negative and has unit
integral. It is insensitive to the choice o f w(-), for which we use the standard
norm al density. The choice o f h is m ore im portant. T he key is to produce a
sm ooth result, while n o t flattening out significant modes. If the choice o f h
is quite large, as it m ay be if R < 100, then one should rescale the density

21

2.2 - Parametric Simulation

Figure 2 3 Histograms
of t* values based on
R = 99 (left) and
R = 999 (right)
simulations from the
fitted exponential model
for the air-conditioning
data.

o
o

o
r~

O
o

co

o
o

in
o

o
o

Tt
o

liB
50

100

150
t*

lb

o
200

50

100

150

200

t*

Figure 2.4 Empirical


quantiles (p = 0.05, 0.95)
of T* t under
resampling from the
fitted exponential model
for the air-conditioning
data. The horizontal
dotted lines are the
exact quantiles under
the model.

estim ate to m ake its m ean and variance agree with the estim ated m ean bR and
variance vR o f T 9; see Problem 3.8.
As a general rule, good estim ates o f density require at least R = 1000:
density estim ation is usually h ard er th an probability o r quantile estim ation.
N ote th a t the same m ethods o f estim ating density, distribution function and
quantiles can be applied to any transform ation o f T. We shall discuss this
fu rth er in Section 2.5.

22

2 The Basic Bootstraps

2.3 Nonparametric Simulation


Suppose th a t we have no p aram etric m odel, b u t th a t it is sensible to assum e th at
Y i,. . . , Y are independent and identically distributed according to an unknow n
A
distribution function F. We use the E D F F to estim ate the unknow n C D F F.
We shall use F ju st as we w ould a p aram etric m o d e l: theoretical calculation if
possible, otherw ise sim ulation o f datasets and em pirical calculation o f required
properties. In only very simple cases are exact theoretical calculations possible,
b u t we shall see in Section 9.5 th a t good theoretical approxim ations can be
obtained in m any problem s involving sam ple m om ents.
Example 2.7 (Average) In the case o f the average, exact m om ents
sam pling from the E D F are easily found. F or exam ple,
E*(Y*) = E '(Y * ) = ^

under

; =y

j=i

and similarly
1
v a r* (Y * )= -v a r * ( Y ')
n

1
1
"
1
-E *{Y * E*(Y*)}2 = - x V - { y , y f
n
1
1
n
^
n 1
}=i
(n 1)

A p art from the factor (n 1)/n, this is the usual result for the estim ated
variance o f Y .

O ther simple statistics such as the sam ple variance and sam ple m edian are
also easy to handle (Problem s 2.3, 2.4).
To apply sim ulation w ith the E D F is very straightforw ard. Because the
E D F puts equal probabilities on the original d a ta values y i , . . . , y , each Y*
is independently sam pled a t ran d o m from those d a ta values. T herefore the
sim ulated sam ple Y(, . . . , Y* is a ran d o m sam ple taken with replacem ent from
the data. This simplicity is special to the case o f a hom ogeneous sample, but
m any extensions are straightforw ard. This resam pling procedure is called the
nonparametric bootstrap.
Example 2.8 (City population data) H ere we look at the ratio estim ate for
the problem described in Exam ple 1.2. F or convenience we consider a subset
o f the d a ta in Table 1.3, com prising the first ten pairs. This is an application
with no obvious param etric m odel, so nonparam etric sim ulation m akes good
sense. Table 2.1 shows the d a ta and the first sim ulated sample, which has been
draw n by random ly selecting subscript j ' from the set { l,...,n } w ith equal
probability and taking (w*,x*) = (uj-,xj-). In this sam ple j ' = 1 never occurs

23

2.3 Nonparametric Simulation


Table 2.1 The dataset

for ratio estimation, and


one synthetic sample.
The values j* are
chosen randomly with
equal probability from
with
replacement; the
simulated pairs are

.7
u

1
138

/'
u
X*

143

2
93
104

3
61
69

4
179
260

5
48
75

6
37
63

7
29
50

8
23
48

9
30
111

10
2
50

6
37
63

7
29
50

2
93
104

2
93
104

3
61
69

3
61
69

10
2
50

7
29
50

2
93
104

9
30
111

1
138
143

2
93
104

(/ -Xj*).
Table 2.2 Frequencies
with which each original
data pair appears in
each of R = 9
nonparametric
bootstrap samples for
the data on US cities.

j
u
X

3
61
69

4
179
260

5
48
75

6
37
63

7
29
50

8
23
48

9
30
111

10
2
50

1
2
4

1
1
2
1

N u m b ers o f tim es each p air sam pled


D a ta

2
1

1
2
1
1

2
1

S tatistic
t = 1.520

R eplicate r
1
2
3
4
5
6
7
8
9

an d /

1
1
3
1
1
2

1
1

2
1
1
3

2
1

1
1
1

2
2
1
1

2
1

1
2

2
3
2

1
1

1
1

2
1
1

1
1

1
2

1
1

3
1
1

t\
t*
r;
t\
t'5
t'6
t;
tj
(j

=
=
=
=
=
=
=
=
=

1.466
1.761
1.951
1.542
1.371
1.686
1.378
1.420
1.660

= 2 occurs three times, so th at the first d a ta pair is never selected, the

second is selected three times, and so forth.


Table 2.2 shows the sam e sim ulated sample, plus eight m ore, expressed
in term s o f the frequencies o f original d ata pairs. The ratio t* for each
sim ulated sam ple is recorded in the last colum n o f the table. A fter the R sets
o f calculations, the bias and variance estim ates are calculated according to
(2.7) and (2.8). The results are, for the R = 9 replicates shown,
b = 1.582 1.520 = 0.062,

v = 0.03907.

A simple approxim ate distribution for T 6 is N(b,v). W ith the results so


far, this is N (0.062,0.0391), b u t this is unlikely to be accurate enough and a
larger value o f R should be used. In a sim ulation with R = 999 we obtained
b = 1.5755 1.5203 = 0.0552 and v = 0.0601. The latter is appreciably bigger
th an the value 0.0325 given by the delta m ethod variance estim ate
n

vL = n~2 J ^ ( x ; - t u j f / u 1,
j=i

24

2 The Basic Bootstraps

o
C
oO
<
oN

c\i

I
1 ll
J llll.-_

in
o

o
o

0.5

1.0

1.5

2.0

Figure 2.5 City


population data.
Histograms of t9 and z *
under nonparametric
resampling for sample
of size n 10, R = 999
simulations. Note the
skewness of both t* and

2.5

-8

n .llll

-6

-4

-2

z*

t*

which is based on an expansion th a t is explained in Section 2.7.2; see also


Problem 2.9. The discrepancy betw een v and Vi is due partly to a few extrem e
values o f f \ an issue we discuss in Section 2.3.2.
T he left panel o f Figure 2.5 shows a histogram o f t \ whose skewness is
evident: use o f a norm al approxim ation here w ould be very inaccurate.
We can use the sam e sim ulations to estim ate d istributions o f related statistics,
such as transform ed estim ates or studentized estim ates. The right panel o f
Figure 2.5 shows a histogram o f studentized values z* = (t* t ) / v ^ /2, where
v'L is the delta m ethod variance estim ate based on a sim ulated sample. T h at is,

v'L = n~2 Y ^ ( x ,j - t , Uj)2/ u 2.


7=1
The corresponding theoretical ap proxim ation for Z is the N ( 0,1) distribution,
which we would ju d g e also inaccurate in view o f the strong skewness in the
histogram . We shall discuss the rationale for the use o f z* in Section 2.4.
One n atu ral question to ask here is w hat effect the sm all sam ple size has
on the accuracy o f norm al approxim ations. This can be answ ered in p a rt by
plotting density estim ates. T he left panel o f Figure 2.6 shows three estim ated
densities for T* t w ith o u r sam ple o f n = 10, a kernel density estim ate
based on o u r sim ulations, the N(b, v) approxim ation with m om ents com puted
from the sam e sim ulations, an d the N ( 0 , vl ) approxim ation. The right panel
shows corresponding density approxim ations for the full d a ta with n = 49; the
em pirical bias and variance o f T are b = 0.00118 and v = 0.001290, and the

2.3 Nonparametric Simulation

25

Figure 2.6 Density


estimates for 7* t
based on 999
nonparametric
simulations for the city
population data. The
left pane! is for the
sample of size n = 10 in
Table 2.1, and the right
panel shows the
corresponding estimates
for the entire dataset of
size n = 49. Each plot
shows a kernel density
estimate (solid), the
N(b,v) approximation
(dashes), with these
moments computed
from the same
simulations, and the
N(0, vl ) approximation
(dots).

delta m ethod variance approxim ation is vl = 0.001166. A t the larger sample


size the norm al approxim ations seem very accurate.

2.3.1 Comparison with parametric methods


A n atu ral question to ask is how well the nonparam etric resam pling m ethods
m ight com pare to p aram etric m ethods, w hen the latter are appropriate. Equally
im p o rtan t is the question as to which param etric m odel would produce results
like those for n o n p aram etric resam pling: this is an o th er way o f asking just
w hat the nonp aram etric b o o tstrap does. Some insight into these questions can
be gained by revisiting Exam ple 1.1.
Example 2.9 (Air-conditioning data) We now look at the results o f applying
no n p aram etric resam pling to the air-conditioning data. O ne m ight naively
expect to o btain results sim ilar to those in Exam ple 2.5, where exponential
resam pling was used, since we found in Exam ple 1.1 th a t the d a ta ap p ear
com patible w ith an exponential model.
Figure 2.7 is the n o n p aram etric analogue o f Figure 2.4, and shows quantiles
o f T* t. It appears th a t R = 500 or so is needed to get reliable quantile
estim ates; R = 100 is enough for the corresponding plot for bias and variance.
U nder nonparam etric resam pling there is no reason why the quantiles should
ap proach the theoretical quantiles under the exponential model, and it seems
th a t they do n o t d o so. This suggestion is confirm ed by the Q-Q plots in
Figure 2.8. The first panel com pares the ordered values o f t ' from R = 999
n o n p aram etric sim ulations w ith theoretical quantiles under the fitted exponen
tial model, an d the second panel com pares the t' with theoretical quantiles

2 The Basic Bootstraps

26

Figure 2.7 Empirical


quantiles (p = 0.05, 0.95)
of T* t under
nonparametric
resampling from the
air-conditioning data.
The horizontal lines are
the exact quantiles
based on the fitted
exponential model.

Figure 2.8 Q-Q plots of


y* under nonparametric
resampling from the
air-conditioning data,
first-against theoretical
quantiles under fitted
exponential model (left
panel) and then against
theoretical quantiles
under fitted gamma
model (right pane!).

under the best-fitting gam m a m odel w ith index k = 0.71. The agreem ent in the
second panel is strikingly good. O n reflection this is natural, because the E D F
is closer to the larger gam m a m odel th a n to the exponential model.

2.3.2 Effects o f discreteness


F or intrinsically continuous data, a m ajor difference betw een param etric and
nonparam etric resam pling lies in the discreteness o f the latter. U nder nonpara-

2.4 Simple Confidence Intervals

27

m etric resam pling, T* and related quantities will have discrete distributions,
even though they m ay be approxim ating continuous distributions. This m akes
results som ew hat fuzzy com pared to their param etric counterparts.
Example 2.10 (Air-conditioning data) For the nonparam etric sim ulation dis
cussed in the previous exam ple, the right panels o f Figure 2.9 show the scatter
plots o f sam ple stan d ard deviation versus sam ple average for R = 99 and
R = 999 sim ulated datasets. C orresponding plots for the exponential sim u
lation are shown in the left panels. T he qualitative feature to be read from
any one o f these plots is th a t d a ta stan d ard deviation is proportional to d ata
average. The discreteness o f the nonparam etric m odel (the E D F ) adds noise
whose peculiar b anded structure is evident a t R = 999, although the qualitative
structure is still apparent.

F or a statistic th at is sym m etric in the d a ta values, there are up to


W"

_ f i n 1\ _ (2n 1)!
\ n1)
n\(n 1)!

possible values o f t*, depending upon the sm oothness o f the statistical function
t( ). Even for m oderately small sam ples the support o f the distribution o f T*
will often be fairly dense: values o f m for n = 7 and 11 are 1716 and 352 716
(Problem 2.5). It would therefore usually be harm less to think o f there being
a P D F for T*, and to approxim ate it, either using sim ulation results as in
Figure 2.6 o r theoretically (Section 9.5). There are exceptions, however, m ost
n otably when T is a sam ple quantile. The case o f the sam ple m edian is
discussed in Exam ple 2.16; see also Problem 2.4 and Exam ple 2.15.
For m any practical applications o f the sim ulation results, the effects o f
discreteness are likely to be fairly m inim al. However, one possible problem is
th at outliers are m ore likely to occur in the sim ulation output. F or example,
in Exam ple 2.8 there were three outliers in the sim ulation, and these inflated
the estim ate v o f the variance o f T*. Such outliers should be evident on a
norm al Q -Q plot (or com parable relevant plot), and when found they should be
om itted. M ore generally, a statistic th at depends heavily on a few quantiles can
be sensitive to the repeated values th a t occur under nonparam etric sampling,
an d it can be useful to sm ooth the original d a ta when dealing with such
statistics; see Section 3.4.

2.4 Simple Confidence Intervals


The m ajor application for distributions and quantiles o f an estim ator T is in
the calculation o f confidence limits. There are several ways o f using boo tstrap
sim ulation results in this context, m ost o f which will be explored in C h apter 5.
H ere we describe briefly two basic m ethods.

28

2 The Basic Bootstraps

Figure 2.9 Scatter plots


of sample standard
deviation versus sample
average for samples
generated by parametric
simulation from the
fitted exponential model
(left panels) and by
nonparametric
resampling (right
panels). Top line is for
R = 99 and bottom line
is for R 999.

Bootstrap average

Bootstrap average

O
O

CO

in
C\J
Q

C/)
o.
CO
to

o
o

co
Q.
(0

CsJ
o

LO
o

8
CD o

50

100 150 200 250 300

Bootstrap average

Bootstrap average

T he sim plest ap proach is to use a norm al approxim ation to the distribution


o f T. As outlined in Section 2.1.2, this m eans estim ating the lim its (2.4), which
require only b o o tstrap estim ates o f bias and variance. As we have seen in
previous sections, a norm al approxim ation will n o t alw ays suffice. T hen if we
use the b o o tstrap estim ates o f quantiles for T 6 as described in Section 2.2.2,
an equitailed (1 2a) confidence interval will have limits
1 ~ (^(R+lXl-a)) f)>

1 (^(R+lJa) 0-

(2.10)

This is based on the probability im plication


Prr<T-fl<M =1-2a

=>

P r ( T - b < 6 < T - a) = 1 - 2 a .

29

2.4 Simple Confidence Intervals

We shall refer to the limits (2.10) as the basic bootstrap confidence limits. Their
accuracy depends upon R, o f course, and one would typically take R > 1000 to
be safe. But accuracy also depends upon the extent to which the distribution o f
T" t agrees w ith th a t o f T 9. Com plete agreem ent will occur if T 9 has a
distribution n o t depending on any unknow ns. This special property is enjoyed
by quantities called pivots, which we discuss in m ore detail in Section 2.5.1.
If, as is usually the case, the distribution o f T 9 does depend on unknow ns,
then we can try alternative expressions contrasting T and 6, such as differences
o f transform ed quantities, o r studentized com parisons. For the latter, we define
the studentized version o f T 9 as

where V is an estim ate o f v a r(T | F): we give a fairly general form for V in
Section 2.7.2. The idea is to mimic the Student-t statistic, which has this form,
and which elim inates the unknow n standard deviation when m aking inference
ab o u t a norm al mean. T hro u g hout this book we shall use Z to denote a
studentized statistic.
Recall th a t the S tudent-t (1 2a) confidence interval for a norm al m ean n
has limits
y - v l/2tn- i ( l - a ) ,

y - v l/2t-i(a),

where v is the estim ated variance o f the m ean and f_i(a), t_ i(l a) are
quantiles o f the Student-f distribution w ith n 1 degrees o f freedom , the
distribution o f the pivot Z . M ore generally, when Z is defined by (2.11), the
(1 2a) confidence interval limits for 9 have the analogous form

where zp denotes the p quantile o f Z . One simple approxim ation, which can
often be justified for large sam ple size n, is to take Z as being N ( 0,1). The result
would be no different in practical term s from using a norm al approxim ation
for T 9, and we know th a t this is often inadequate. It is m ore accurate
to estim ate the quantiles o f Z from replicates o f the studentized bootstrap
statistic, Z* = (T* t ) / V * 1/2, where T ' and V * are based on a sim ulated
ran d o m sample, Y , . . . , Yn'.
If the m odel is param etric, the Y ' are generated from the fitted param etric
distribution, and if the m odel is nonparam etric, they are generated from the
E D F F, as outlined in Section 2.3. In either case we use the (R + l)a th order
statistic o f the sim ulated values z \ , . . . , z ' R, nam ely z(*(K+1)(x), to estim ate z. Then
the studentized bootstrap confidence interval for 9 has limits
(2 .12)

30

2 The Basic Bootstraps

This studentized b o o tstrap m ethod is m ost likely to be o f use in n o n p ara


m etric problem s. O ne reason for this is th a t w ith param etric m odels we can
som etim es find exact solutions (as w ith the exponential m odel for E xam
ple 1.1), and otherw ise we have available m ethods based on the likelihood
function. This does n o t necessarily rule out the use o f param etric sim ulation,
o f course, for approxim ating the distribution o f the q uantity used as basis for
the confidence interval.
Example 2.11 (Air-conditioning data) U nder the exponential m odel for the
d a ta o f Exam ple 1.1, we have T = Y , and since v a r(T | FM) = n 2/n, we would
take V = Y 2/n. This gives
Z = (T - n ) / V l/2 = n 1/2(l - n / Y ) ,
which is an exact pivot because Q = Y / n has the gam m a distribution with
index n and unit mean. S im ulation to construct confidence intervals is unneces
sary because the quantiles o f the gam m a distribution are available from tables.
Param etric sim ulation would be based on Q* = Y* / t , where Y* is the average
o f a rando m sam ple Y , \ . . . , Y* from the exponential distribution with m ean t.
Since Q has the same distribution as Q, the only erro r incurred by sim ulation
would be due to the random ness o f the sim ulated quantiles. F or exam ple, the
estim ates o f the 0.025 an d 0.975 quantiles o f Q based on R = 999 sim ulations
are 0.504 and 1.608, com pared to the exact values 0.517 and 1.640; these lead
to estim ated an d exact 95% confidence intervals (67.2,214.6) and (65.9,209.2)
respectively. We shall discuss these intervals m ore fully in C hapter 5.

Example 2.12 (City population data) F or the sam ple o f n = 10 pairs analysed
in Exam ple 2.8, o u r estim ate o f the ratio 8 is t = x / u = 1.52. The 0.025 and
0.975 quantiles o f the 999 values o f t are 1.236 and 2.059, so the 95% basic
boo tstrap confidence interval (2.10) for 8 is (0.981,1.804).
To apply the studentized interval, we use the delta m ethod approxim ation
to the variance o f T, which is (Problem 2.9)
n

VL = n ~ 2 J ^ ( x y - tU j)2/Q 2,

j =i
and base confidence intervals for 8 on ( T 0 ) / v lL[ 2, using sim ulated values
o f z ' = (t* t ) / v L . T he sim ulated values in the right panel o f Figure 2.5
show th at the density o f the studentized b o o tstrap statistic Z ' is n o t close to
norm al. The 0.025 and 0.975 quantiles o f the 499 sim ulated z ' values are -3.063
and 1.447, and since v i = 0.0325, an approxim ate 95% equitailed confidence
interval based on (2.12) is (1.260,2.072). T his is quite different from the interval
above.
The usefulness o f these confidence intervals will depend on how well F

2.5 Reducing Error

31

estim ates F an d the extent to which the distributions o f T 6 and o f Z


depend on F. We can n o t ju d g e the form er, b u t we can check the latter using
the m ethods outlined in Section 3.9.2; see Exam ples 3.20 and 9.11.

2.5 Reducing Error


T he erro r in resam pling m ethods is generally a com bination o f statistical error
and sim ulation error. The first o f these is due to the difference between F and
F, and the m agnitude o f the resulting error will depend upon the choice o f T.
T he sim ulation erro r is wholly due to use o f em pirical estim ates o f properties
under sam pling from F, ra th e r th an exact properties.
Figure 2.7 illustrates these tw o sources o f error in quantile estim ation. The
decreasing sim ulation erro r shows as reduced scatter o f the quantile estim ates
for increased R. Statistical error due to an inappropriate m odel for T is
reflected by the difference betw een the sim ulated nonparam etric quantiles for
large R and the d o tted lines th a t indicate the quantiles under the exponential
m odel. The fu rth er statistical error due to the difference betw een F and F
cann o t be illustrated, because we do n o t know the true m odel underlying the
data. However, other sam ples o f the same size from th a t m odel would yield
different estim ates o f the true quantiles, quite ap art from the variability o f the
quantile estim ates obtained from each specific dataset by sim ulation.

2.5.1 Statistical error


T he basic b o o tstra p idea is to approxim ate a quantity c{F) such as v ar(T | F)
by the estim ate c(F), where F is either a param etric or a nonparam etric
estim ate o f F based on d a ta
The statistical erro r is then the difference
betw een c(F) and c(F), and as far as possible we wish to m inimize this or
remove it entirely. This is som etim es possible by careful choice o f c(-). For
exam ple, in Exam ple 1.1 w ith the exponential m odel, we have seen th a t w orking
with T / 9 rem oves statistical error completely.
F or b o th confidence interval and significance test calculation, we usually
have a choice as to w hat T is and how to use it. Significance testing raises
special issues, because we then have to deal with a null hypothesis sam pling
distribution, so here it is best to focus on confidence interval calculation. For
simplicity we also assum e th a t estim ate T is decided upon. T hen the quantity
c(F) will be a quantile or a m om ent o f some quantity Q = q (F, F) derived from
T , such as h (T) h{6) o r ( T 6 ) / V l/2 where V is an estim ated variance, or
som ething m ore com plicated such as a likelihood ratio. The statistical problem
is to choose am ong these possible quantities so th at the resulting Q is as nearly
pivotal as possible, th a t is it has (at least approxim ately) the same distribution
under sam pling from b o th F and F.

32

2 The Basic Bootstraps

Provided th a t Q is a m onotone function o f 8, it will be straightforw ard to


o btain confidence limits. F or exam ple, if Q = h ( T ) h(8) with h(t) increasing
in t, and if ax is an approxim ate lower a quantile o f h (T ) h(8), then
1 - a = Pr{Ji(T) - h(8) > aa} = Pr [0 < h~l {h (T) - a* } ],

(2.13)

where /i_1( ) is the inverse transform ation. So h~l { h(T) aa} is an upper
(1 a) confidence lim it for 8.
Parametric problems
In param etric problem s F = F# and F = Fv have the sam e form, differing
only in p aram eter values. T he n otion o f a pivot is quite simple here, m eaning
constant behaviour und er all values o f the m odel param eters. M ore formally,
we define a pivot as a function Q = q ( T , 8 ) w hose distribution does o r n o t a
p articular q uantity Q is exactly or nearly pivotal, by exam ining its behaviour
under the m odel form w ith varying p aram eter values. F or example, in the
context o f Exam ple 1.1 n o t depend on the value o f \p: for all q,

In general Q may also


depend on other
statistics, as when Q is
the studentized form of
T.

Pr{ q ( T ,0 ) < q | v>}


is independent o f \p.
O ne can check, som etim es theoretically and always em pirically, whether,
we could sim ultaneously exam ine properties o f T 8, log T log 8 and the
studentized version o f the form er, by sim ulation under several exponential
m odels close to the fitted m odel. This m ight result in plots o f variance or
selected quantiles versus param eter values, from which we could diagnose the
nonpivotal behaviour o f T 6 and the pivotal b ehaviour o f log T log 8.
A special role for tran sfo rm atio n h ( T) arises because som etim es it is rela
tively easy to choose h{-) so th a t the variance o f T is approxim ately o r exactly
independent o f 8, and this stability is the prim ary feature o f stability o f distri
bution. Suppose th a t T has variance v(6). T hen provided the function h(-) is
well behaved at 8, T aylor series expansion as described in Section 2.7.1 leads
to
h(8) is the first
derivative dh(6)/d6.

W L i { h ( T ) } { h ( 8 ) } 2 v(8),
which in tu rn implies th a t the variance is m ade approxim ately constant (equal
to 1) if

H{t) = /

M lijp '

(114)

This is know n as the variance-stabilizing transformation. A ny constant m ultiple


o f h ( T) will be equally effective: often in one-sam ple problem s where v{8) =
ri~l it2(8) equation (2.14) w ould be applied w ith a(u) in place o f {u(m)}1/2, in
which case h(-) is independent o f n and v a r(T ) = n-1 .
F or a problem where v{8) varies strongly with 8, use o f this transform ation

2.5 Reducing Error

Figure 2.10 Log-log


plot of estimated
variance of Y against 6
for the air-conditioning
data with an
exponential model. The
plot suggests strongly
that var(Y | 0) oc 62.

33

<
ocD
(0
c
(0

o
o
o

>

50 60 70

90

200

theta

in conjunction w ith (2.13) will typically give m ore accurate confidence limits
th an would be obtained using direct approxim ations o f quantiles for T 6.
If such use o f the transfo rm ation is appropriate, it will som etim es be clear
from theoretical considerations, as in the exponential case. O therw ise the
tran sfo rm atio n w ould have to be identified from a scatter plot o f sim ulationestim ated variance o f T versus 6 for a range o f values o f 8.
Example 2.13 (Air-conditioning data) Figure 2.10 shows a log-log plot o f the
em pirical variances o f r* = y ' based on R = 50 sim ulations for each o f a
range o f values o f 6. T h a t is, for each value o f 0 we generate R values t
corresponding to sam ples y y " from the exponential distribution with
m ean 6, and then plot log { ( R l) -1 X)(t* r*)2} against log0. T he linearity
an d slope o f the plot confirm th at v a r(T | F ) oc 62, where 6 = E (T | F).
a
Nonparametric problems
In n o n p aram etric problem s the situation is m ore com plicated. It is now unlikely
(but n o t strictly im possible) th a t any quantity can be exactly pivotal. A lso we
cann o t sim ulate d a ta from a distribution with the same form as F, because
th a t form is unknow n. However, we can sim ulate d a ta from distributions near
to and sim ilar to F, an d this m ay be enough since F is near F. A rough idea
o f w hat is possible can be h ad from Exam ple 2.10. In the right-hand panels o f
Figure 2.9 we plotted sam ple stan d ard deviation versus sam ple average for a
series o f n o nparam etrically resam pled datasets. If the E D F s o f those datasets
are th o u g h t o f as m odels n ear both F and F, then although the pattern is
obscured by the banding, the plots suggest th a t the true m odel has standard
deviation p ro p o rtio n al to its m ean which is indeed the case for the m ost

34

2 The Basic Bootstraps

likely true m odel. T here are conceptual difficulties with this argum ent, b u t
there is little question th a t the im plication draw n is correct, nam ely th at log Y
will have approxim ately the sam e variance und er sam pling from b o th F and
F.
A m ore tho ro u g h discussion o f these ideas for nonparam etric problem s will
be given in Section 3.9.2.
A m ajor focus o f research on resam pling m ethods has been the reduction
o f statistical error. This is reflected particularly in the developm ent o f accurate
confidence lim it m ethods, which are described in C h apter 5. In general it is
best to rem ove as m uch o f the statistical erro r as possible in the choice o f
procedure. However, it is possible to reduce statistical erro r by a b o o tstrap
technique described in Section 3.9.1.

2.5.2 Simulation error


Sim ulation erro r arises w hen M onte C arlo sim ulations are perform ed and
properties o f statistics are approxim ated by their em pirical properties in these
sim ulations. F o r exam ple, we approxim ate the estim ate B = E*(T* | F) t o f
bias /? = E (T ) 8 by the average B R = R ~ l
t) = T ' t, using the
independent replications Tj*,. . . , T R, each based on a random sam ple from our
d a ta E D F F. The M onte C arlo variability in R ~ ]
T can only be removed
entirely by an infinite sim ulation, which seems b o th im possible and unnecessary
in practice. T he practical question is, how large does R need to be to achieve
reasonable accuracy, relative to the statistical accuracy o f the quantity (bias,
variance, etc.) being approxim ated by sim ulation? W hile it is n o t possible to
give a com pletely general an d firm answer, we can get a fairly good sense o f
w hat is required by considering the bias, variance and quantile estim ates in
simple cases. This we now do.
Suppose th a t we have a sam ple y u - - - , y n from the N(p,<j2) distribution,
and th at the p aram eter o f interest 9 n is estim ated by the sam ple average
t = y. Suppose th a t we use nonparam etric sim ulation to approxim ate the
bias, variance and the p quantile ap o f T 8 = Y jx. T hen the first step,
as described in Section 2.3, is to take R independent replicate sam ples from
y y n, an d calculate their m eans Yj* ,..., Y^. From these we calculate the
bias, variance an d quantile estim ators as described earlier. O f course the
problem is so simple th a t we know the real answers, nam ely 0, n~xa 2 and
w~1/2<rzp, where zp is the p quantile o f the stan d ard norm al distribution. So
the corresponding (infinite sim ulation) estim ates o f bias and variance are 0
and n-1 ff2, where a 2 n r 1 J2(yj ~ y )2- T he corresponding estim ate ap o f
the p quantile ap is approxim ately n~{/2a z p under nonparam etric resampling,
ignoring 0 ( n ~ l ) terms. We now com pare the finite-sim ulation approxim ations
to these estim ates.

35

2.5 Reducing Error

First consider the bias estim ator


Br = R ~ l

t ; - Y.

C onditional on the p articu lar sam ple y \ , . . . , y n, or equivalently its E D F F, the


m ean and variance o f the bias estim ator across all possible sim ulations are
(2.15)
because E*(Y /) = y and v a r(Yr*) = n~la 2. The unconditional variance o f
B R, taking into account the variability between different sam ples from the
underlying distribution, is
v ar ( R

?; - y )

vary |e * ( R ~ 1 ^

Yr* -

+ E y | var' ( r ' 5 ] y ; - y ) } ,
where E y ( - ) and vary(-) denote the m ean and variance taken with respect to
the jo in t distrib u tio n o f Y \ , . . . , Y n. F rom (2.15) this gives
v ar (Br ) = vary(O) +

Ey

a2
n 1
x .
n
nR

(2.16)

This result does not depend on norm ality o f the data. A sim ilar expression
holds for any sm ooth statistic T w ith a linear approxim ation (Section 2.7.2),
except for an 0 ( n ~ 2) term.
N ext consider the variance estim ator VR = (R I)-1 XXYr Y*)2, where
Y* = R ^ 1
Yr*. The m ean and variance o f VR across all possible sim ulations,
conditional on the data, are

where 72 is the standardized fourth cum ulant the standardized kurtosis o f


the d a ta (A ppendix A). N ote th a t
would be zero for a param etric sim ulation
b u t not for o u r n o n p aram etric sim ulation, although in general 72 = 0 ( n ~ l )
because the d a ta are norm ally distributed. The unconditional variance o f VR,
averaging over all possible datasets, is
var(F) = vary

+ Ey

which reduces to
(2.17)
The first term on the right o f (2.17) is due to d a ta variation, the second to

36

2 The Basic Bootstraps

sim ulation variation. T he im plication is th a t to m ake the sim ulation variance


as small as 10% o f th a t due to d a ta variation, one m ust take R = lOn.
The corresponding result for general d a ta distributions would include an
additional term from the kurtosis o f the Yj. A sim ilar result holds for a general
sm ooth statistic T.
Finally consider the estim ator o f the p quantile ap for Y p., which is
= ^((r+1)p) y w ith Y(r+i)p)
(R + l)p th o rder statistic o f the sim ulated
values Y ,*,..., Yr. T he general calculation o f sim ulation properties o f aPtR
is com plicated, so we m ake the sim plifying assum ption th a t the N(y, ri~la 2)
approxim ation for Y * is exact. W ith this assum ption, stan d ard properties o f
order statistics give
<*p ,r

E (ap,R) = ap = n~l/2a zp,


and
p ( l - p ) = 2np(l p) a2 exp(z2)
var (aPjR) = p
^ = ------------ ----------(2.18)
R g 2(ap)
nR
where g( ) is the density o f Y* Y conditional on F, here taken to be the
N(0, n_1a 2) density. (N ote th a t the m iddle term o f (2.18) applies for any T
and any d a ta distribution, w ith g( ) the density o f T 8.) T he unconditional
variance o f aPiR over all d atasets can then be reduced to
^
v a r(s ) =

( zp , 2 n p ( l - p ) e \ p ( z 2) \
- | ^ + ----------------------- - j .

(2.19)

The im plication o f (2.19) is th a t the variance inflation due to sim ulation is


approxim ately
4nnp(l p )ex p (z2) _ nd(p)
z 2p R

say. Some values o f d(p) are as follows.


p o rl-p
d(p)

0.01
5.15

0.025
3.72

0.05
3.30

0.10
3.56

0.25
8.16

So to m ake the variance inflation factor 10% for the 0.025 quantile, for
example, we would need R = 40n. E qu atio n (2.19) m ay n o t be useful in the
centre o f the distribution, where d(p) is very large because zp is small.
Example 2.14 (Air-conditioning data) To see how well this discussion applies
in practice, we look briefly a t results for the d a ta in Exam ple 1.1. T he statistic
o f interest is T = log Y, which estim ates 8 = log fi. The true m odel for Y is
taken to be the gam m a distrib u tio n w ith index k = 0.71 and m ean p. = 108.083;
these are the d a ta estim ates. Effects due to sim ulation e rro r are approxim ated

37

2.6 Statistical Issues


Table 2 3 Components
of variance (xlO -3 ) in
bootstrap estimation of
p quantile for
log Y log /i, due to
data variation and
simulation variation,
based on nonparametric
simulation applied to
the data of Example 1.1.

Source

T ype

P
0.01

0.99

0.05

0.95

0.10

0.90

D a ta

actual
theoretical

31.0
26.6

6.9
26.6

14.0
13.3

3.6
13.3

8.3
8.1

2.2
8.1

S im ulation, R = 100

actual
theoretical
actual
theoretical
actual
theoretical

53.6
32.9
4.3
6.6
2.2
3.3

9.4
32.9
2.4
6.6
0.8
3.3

8.5
10.5
2.0
2.1
1.5
1.0

3.2
10.5
0.6
2.1 .
0.1
1.0

3.8
6.9
1.2
1.4
0.8
0.7

2.6
6.9
0.4
1.4
0.2
0.7

S im ulation, R = 500
S im ulation, R = 1000

by taking sets o f R sim ulations from one long nonparam etric sim ulation o f
9999 datasets. Table 2.3 shows the actual com ponents o f variation due to
sim ulation an d d a ta variation, together with the theoretical com ponents in
(2.19), for estim ates o f quantiles o f l o g ? log/i. O n the whole the theory
gives a fairly accurate prediction o f perform ance.

It is n o t necessarily best to choose R solely on the basis o f the variance


inflation factor. F or exam ple, if we had been discussing the studentized statistic
Z defined by (2.11) an d its quantiles, then the com ponent o f variation due
to d a ta variance would be approxim ately zero to the accuracy used in (2.18),
based on the N ( 0 ,1) approxim ation. So the variance inflation factor would be
enorm ous. W h at really counts is the effect o f the sim ulation on the final result,
say the length an d coverage o f the confidence interval. This presents a m uch
m ore delicate question (Problem 5.5).
A n o th er way to estim ate quantiles for T 6 is by norm al approxim ation with
b o o tstrap estim ates o f bias and variance. Sim ilar calculations o f sim ulation
error are possible; see Problem 2.7. In general the norm al approxim ation is
suspect, although its applicability can be assessed by a norm al Q -Q plot o f the
sim ulated t' values.

2.6 Statistical Issues


2.6.1 When does the bootstrap work?
Consistency
T here are two senses in which resam pling m ethods m ight w ork. First, do
they give reliable results when used with the sort o f d a ta encountered in
practice? This question is crucial in applications, and is a m ajor focus o f this
book. It leads one to consider how the resam ples themselves can be used to tell

38

2 The Basic Bootstraps

when an d how a b o o tstrap calculation m ight fail, and ideally how it should
be am ended to yield useful answers. This topic o f boo tstrap diagnostics is
discussed m ore fully in Section 3.10.
A second question is: u nder w hat idealized conditions will a resam pling
procedure produce results th a t are in some sense m athem atically correct?
Answ ers to questions o f this sort involve an asym ptotic fram ew ork in which
the sam ple size n>oo. A lthough such asym ptotics are ultim ately intended
to guide practical work, they often act only as a backstop, by rem oving from
consideration procedures th a t do n o t have ap p ro p riate large-sam ple properties,
and are usually n o t subtle enough to discrim inate am ong com peting procedures
according to their finite-sam ple characteristics. N evertheless it is essential to
appreciate when a naive application o f the b o o tstrap will fail.
To put the theoretical basis for the b o o tstrap in simple term s, suppose th at
we have a ran d o m sam ple
or equivalently its E D F F, from which
we wish to estim ate properties o f a standardized quantity Q = q ( YU ---, Y;F).
For exam ple, we m ight take
Q{Yu . . . , Y n\F) = n 1/2 j ? -

y d F ( y ) ^ = n ^ 2( ? - 6 ) ,

say, and w ant to estim ate the distribution function


G fA<1) = Pt {Q(Y u ... ,Y' ,F) < q \ F } ,

(2.20)

where the conditioning on F indicates th a t Y \ , . . . , Y is a random sample from


F. The b o o tstrap estim ate o f (2.20) is
G*() = Pr { Q ( l Y , . . . , y B* ;F ) ^ q I F }

(2.21)

where in this case Q{Y{, . . . , Y * ; F ) = n{/1{ Y ' y). In order for G p n to approach
G f n as n*oo, three conditions m ust hold. Suppose th a t the true distribution
F is surrounded by a neighbourhood
in a suitable space o f distributions,
and th at as n*oo, F eventually falls into J f w ith probability one. T hen the
conditions are:
1
2
3

for any A j V, GA,n m ust converge weakly to a limit Ga ,x ;


this convergence m ust be uniform on J/'-, and
the function m apping A to GAyCC m ust be continuous.

H ere weak convergence o f GA, to GA,X m eans th a t as n -*oo,

h(u)dGAy(u)

->

h(u)dGAi0D(u)

for all integrable functions h(-). U nder these conditions the b o o tstrap is con
sistent, m eaning th a t for any q and e > 0, Pr{\Gpn(q) GF^ }(q)\ > e}>0 as
nyoo.

39

2.6 Statistical Issues

T he first condition ensures th at there is a limit for Gf, to converge to, and
w ould be needed even in the happy situation where F equalled F for every
n > n', for som e ri. N ow as n increases, F changes, so the second and third
conditions are needed to ensure th at G p n approaches G fi00 along every possible
sequence o f F s. If any one o f these conditions fails, the b o o tstrap can fail.
Example 2.15 (Sample maximum) Suppose th at Y i,. . . , Yn is a random sample
from the uniform distribution on (0 ,9). T hen the m axim um likelihood estim ate
o f 9 is the largest sam ple value, T = Yln), where Y(i) < < Y(n) are the sample
order statistics. C onsider nonparam etric resam pling. The lim iting distribution
o f Q = n(9 T ) / 9 is stan d ard exponential, and this suggests th a t we take our
standardized quantity to be Q' = n(t T ' ) / t , where t is the observed value
o f T , an d T* is the m axim um o f a b o o tstrap sam ple o f size n taken from
y i , . . . , y n. As n>oo, however,
Pr(g* = 0 | F) = Pr(T* = t \ F) = 1 - (1 - n_1)"-> 1 - e_1,
an d consequently the lim iting distribution o f Q* can n o t be stan d ard exponen
tial. The problem here is th a t the second condition fails: the distributional
convergence is not uniform on useful neighbourhoods o f F. A ny fixed o r
d er statistic Y(k) suffers from the same difficulty, b u t a statistic like a sample
quantile, where we would take k = pn for some fixed 0 < p < 1, does not.

Asymptotic accuracy
Here and below we say
X n = Op{nd) when
Prfn^l-Xnl > e)-*p for
some constant p as
noo, and X = op(nd)
when
Pr(n rf|ATn| > e)-*0 as
n>cc, for any e > 0.

Consistency is a w eak property, for exam ple guaranteeing only th at the true
probability coverage o f a nom inal (1 2a) confidence interval is 1 2ot + op(l).
S tan d ard norm al approxim ation m ethods are consistent in this sense. Once
consistency is established, m eaning th at the resam pling m ethod is valid, we
need to know w hether the m ethod is good relative to o ther possible m ethods.
This involves looking at the rate o f convergence to nom inal properties. For
example, does the coverage o f the confidence interval deviate from (1 2a) by
0 p(n~l/2) or by 0 p(n-1 )? Some insight into this can be obtained by expansion
m ethods, as we now outline. M ore detailed calculations are m ade in Section 5.4.
Suppose th a t the problem is one where the lim iting distribution o f Q is stan
d ard norm al, and where an Edgeworth expansion applies. T hen the distribution
o f Q can be w ritten in the form
Pr (Q < q \ F ) = <S>(q) + n~x/1a{q)<t>(q) + 0 ( n ~ l ),

(2.22)

where <!>() an d </>{) are the C D F and P D F o f the stan d ard norm al distribution,
and a(-) is an even quad ratic polynom ial. For a wide range o f problem s it can
be shown th a t the corresponding approxim ation for the b o o tstrap version o f
Q is
Pr(2* < q \ F ) = <b(q) + n~l/2a(q)(l>(q) + 0 ^ ) ,

(2.23)

40

2 The Basic Bootstraps

where a(-) is obtained by replacing unknow ns in a(-) by estim ates. Now typically
a(q) = a(q) + 0 p(n~1/2), so
P r(Q' < q \ F) Pr2 < q \ F) = Op(n~l ).

(2.24)

T hus the estim ated distrib u tio n for Q differs from the true distribution by a
term th a t is Op(n_1), provided th a t Q is constructed in such a way th a t it is
asym ptotically pivotal. A sim ilar argum ent will typically hold when Q has a
different lim iting distribution, provided it does n o t depend on unknow ns.
Suppose th a t we choose n o t to standardize Q, so th a t its lim iting distribution
is norm al w ith variance v. A n E dgew orth expansion still applies, now with
form

Pr(fi , I F) _ * ( - j ) +

( - k ) * ( J L ) + 0(n-1),

(125)

where a'(-) is a q u ad ratic polynom ial th a t is different from a( ). The corre


sponding expansion for Q' is

Pr(Q < , | F) - ( ^ ) + - ' ' V ( j i j ) * ( j i j ) + O

(2.26)

Typically v = v + Op(n~l/2), which w ould im ply th a t


P r(2 < q I F) - P r(Q < q \ F) = Op(n~V2),

(2.27)

because the leading term s on the right-hand sides o f (2.25) and (2.26) are
different.
The difference betw een (2.24) and (2.27) explains o u r insistence on w orking
w ith approxim ate pivots w henever possible: use o f a pivot will m ean th at a
boo tstrap distribution function is an o rd er o f m agnitude closer to its target.
It also gives a cogent theoretical m otivation for using the b o o tstrap to set
confidence intervals, as we now outline.
We can obtain the a quantile o f the distribution o f Q by inverting (2.22),
giving the Cornish-Fisher expansion
qx = z a + n - '^ a ' ^ Z x ) + 0 ( n _1),
where za is the a quantile o f the stan d ard norm al distribution, and a"(-) is
a further polynom ial. T he corresponding b o o tstrap quantile has the property
th a t q ^ qn = Op(n~l ). F or simplicity take Q = ( T 0 ) / V l/1, where V estim ates
the variance o f T. T hen an exact one-sided confidence interval for 9 based on
Q would be I a = [T V 1/2qx, oo), an d this contains the true 6 w ith probability
a. T he corresponding b o o tstrap interval is / = [T I/1/2g ,oo), where q is
the a quantile o f the distrib u tio n o f Q* which w ould often be estim ated by
sim ulation, as we have seen. Since q'x qx = Op(n~[), we have
Pr(0 e I a) = a,

P r(0 e /* ) = a + 0 ( n ~ l ),

2.6 Statistical Issues

41

so th a t the actual probability th at / ' contains 6 differs from the nom inal
probability by only 0 ( n -1 ). In contrast, intervals based on inverting (2.25) will
contain 8 w ith probability a + 0 ( n ~ l/2). This interval is in principle no m ore
accurate th a n using the interval [T F 1/2za, oo) obtained by assum ing th at
the distribution o f Q is stan d ard norm al. Thus one-sided confidence intervals
based on quantiles o f Q have an asym ptotic advantage over the use o f a
norm al approxim ation. Sim ilar com m ents apply to tw o-sided intervals.
The practical usefulness o f such results will depend on the num erical value
o f the difference (2.24) at the values o f q o f interest, and it will always be wise
to try to decrease this statistical error, as outlined in Section 2.5.1.
T he results above based on E dgew orth expansions apply to m any com m on
statistics: sm ooth functions o f sam ple m om ents, such as m eans, variances, and
higher m om ents, eigenvalues and eigenvectors o f covariance m atrices; sm ooth
functions o f solutions to sm ooth estim ating equations, such as m ost m axim um
likelihood estim ators, estim ators in linear and generalized linear models, and
som e robust estim ators; and to m any statistics calculated from tim e series.

2.6.2 Rough statistics: unsmooth and unstable


W h at typically validates the b o o tstrap is the existence o f an E dgew orth ex
pansion for the statistic o f interest, as would be the case when th at statistic is
a differentiable function o f sam ple m om ents. Some statistics, such as sam ple
quantiles, depend on the sam ple in an unsm ooth or unstable way such th at
stan d ard expansion theory does n o t apply. O ften the nonparam etric resam
pling m ethod will still be valid, in the sense th a t it is consistent, b u t for finite
sam ples it m ay n o t w ork very well. P art o f the reason for this is th a t the set o f
possible values for T* m ay be very small, and very vulnerable to unusual d ata
points. A case in poin t is th a t o f sam ple quantiles, the m ost fam iliar o f which
the sam ple m edian is discussed in the next example. Exam ple 2.15 gives
a case where naive resam pling fails completely.
Example 2.16 (Sample median) Suppose th at the sample size is odd, n =
2m + 1, so th a t the sam ple m edian is y = y(m+\). In large sam ples the m edian
is approxim ately norm ally distributed ab o u t the population m edian //, but
stan d ard nonparam etric m ethods o f variance estim ation (jackknife and delta
m ethod) d o not w ork here (Exam ple 2.19, Problem 2.17). N onparam etric
resam pling does w ork to som e extent, provided the sam ple size is quite large
and the d a ta are not too dirty. Crucially, b o o tstrap confidence limits work
quite well.
N ote first th a t the b o o tstrap statistic Y* is concentrated on the sample
values y^k), which m akes the estim ated distribution o f the m edian very discrete
and very vulnerable to unusual observations. Problem 2.4 shows th at the exact

2 The Basic Bootstraps

42

Normal

Theoretical
Empirical
M ean bootstrap
Effective df

Table 2.4 Theoretical,


empirical and mean
bootstrap estimates of
variance (x 10 2) of
sample median, based
on 10000 datasets of
sizes n = 11,21. The
effective degrees of
freedom of bootstrap
variances uses a x2
approximation to their
distribution.

Cauchy

f3

11

21

11

21

11

21

14.3
13.9
17.2
4.3

7.5
7.3
8.8
5.4

16.8
19.1
25.9
3.2

8.8
9.5
11.4
4.9

22.4
38.3
14000
0.002

11.7
14.6
22.8
0.5

distribution o f Y * is
p r(y * =

, \

^
;=0

"

, s

(2.28)
j=0 '*'

for k = l , . . . , n where
= k / n ; sim ulation is n o t needed in this case. The
m om ents o f this b o o tstrap distribution, including its m ean and variance,
converge to the correct values as n increases. However, the convergence can be
very slow. To illustrate this, Table 2.4 com pares the average b o o tstrap variance
w ith the em pirical variance o f the m edian for d a ta sam ples o f sizes n = 11 and
21 from the stan d ard norm al distribution, the Student-t distribution with three
degrees o f freedom , and the C auchy d istrib u tio n ; also shown are the theoretical
variance approxim ations, which are incalculable when the true distribution F
is unknow n. We see th a t the b o o tstrap variance can be very po o r for n = 11
when distributions are long-tailed. The value 1.4 x 104 for average boo tstrap
variance w ith C auchy d a ta is not a m istake: the b o o tstrap variance exceeds
100 for ab o u t 1% o f d atasets: for som e sam ples the b o o tstrap variance is
huge. The situation stabilizes when n reaches 40 o r more.
The gross discreteness o f y * could also affect the simple confidence limit
m ethod described in Section 2.4. But provided the inequalities used to justify
(2.10) are taken to be < an d > rath er th a n < and > , the m ethod w orks well.
For example, for C auchy sam ples o f size n = 11 the coverage o f the 90% basic
boo tstrap confidence interval (2.10) is 90.8% in 1000 sam ples; see Problem 2.4.
We suggest ado p tin g the sam e practice for all problem s where t* is supported
on a small nu m b er o f values.

The statistic T will certainly behave wildly under resam pling w hen t(F) does
not exist, as happens for the m ean when F is a C auchy distribution. Q uite
naturally over repeated sam ples the b o o tstrap will produce silly and useless
results in such cases. T here are two points to m ake here. First, if d a ta are
taken from a real population, then such m athem atical difficulties can n o t arise.
Secondly, the stan d ard approaches to d a ta analysis include careful screening
o f d a ta for outliers, nonnorm ality, an d so forth, which leads either to deletion
o f disruptive d a ta elem ents or to sensible and reliable choices o f estim ators

2.6 Statistical Issues

43

T. In short, the m athem atical pathology o f nonexistence is unlikely to be a


practical problem .

2.6.3 Conditional properties


Resam pling calculations are based on the observed data, and in th at sense
resam pling m ethods are conditional on the data. This is especially so in the
nonp aram etric case, where nothing b u t d a ta is used. Because o f this, the
question is som etim es asked: Are resam pling m ethods therefore conditional
in the inferential sense? The short answ er is: N o, at least n o t in any useful
way unless the relevant conditioning can be m ade explicit.
C onditional inference arises in param etric inference when the sufficient statis
tic includes an ancillary statistic A whose distribution is free o f param eters.
T hen we argue th at inferences ab o u t param eters (e.g. confidence intervals)
should be based on sam pling distributions conditional on the observed value
o f A ; this brings inference m ore into line w ith Bayesian inference. Two exam
ples are the configuration o f residuals in location models, and the values o f
explanatory variables in regression models. The first cannot be accom m odated
in nonp aram etric b o o tstrap analysis because the effect depends upon the u n
know n F. The second can be accom m odated (C hapter 6) because the effect
does n o t depend upon the stochastic p a rt o f the model. It is certainly true th a t
the b o o tstrap distribution o f T* will reflect ancillary features o f the data, as
in the case o f the sam ple m edian (Exam ple 2.16), b u t the reflection is pale to
the poin t o f uselessness.
T here are situations where it is possible explicitly to condition the resam pling
so as to provide conditional inference. Largely these situations are those where
there is an experim ental ancillary statistic, as in regression. O ne other situation
is discussed in Exam ple 5.17.

2.6.4 When might the bootstrap fail?


Incomplete data
So far we have assum ed th a t F is the distribution o f interest and th at the sample
y i , . . . , y draw n from F has nothing rem oved before we see it. This m ight be
im p o rtan t in several ways, n o t least in guaranteeing statistical consistency o f
o u r estim ator T. But in some applications the observation th a t we get m ay not
always be y itself. F or example, w ith survival d a ta the ys m ight be censored,
m eaning th a t we m ay only learn th a t y was greater th an some cut-off c because
observation o f the subject ceased before the event which determ ines y. Or, with
m ultiple m easurem ents on a series o f patients it m ay be th a t for som e patients
certain m easurem ents could n o t be m ade because the patient did n o t consent,
or the d o cto r forgot.

44

2 The Basic Bootstraps

U nder certain circum stances the resam pling m ethods we have described will
work, b u t in general it w ould be unwise to assum e this w ithout careful thought.
A lternative m ethods will be described in Section 3.6.
Dependent data
In general the n o n p aram etric resam pling m ethod th a t we have described will
n o t work for dependent data. This can be illustrated quite easily in the case
where the d a ta
form one realization o f a correlated tim e series. For
example, consider the sam ple average y an d suppose th a t the d a ta com e from
a stationary series {Yj} whose m arginal variance is a 2 = var(Y; ) and whose
autocorrelations are ph = c o n ( Y j , Y j +h) for h = 1 ,2 ,... In Exam ple 2.7 we
showed th a t the nonparam etric b o o tstrap estim ate o f the variance o f Y is
approxim ately s2/n, an d for large n this will ap proach <r2/n . But the actual
variance o f Y is

The sum here w ould often differ considerably from one, and then the b o otstrap
estim ate o f variance would be badly wrong.
Sim ilar problem s arise w ith oth er form s o f dependent data. The essence o f
the problem is th a t simple b o o tstrap sam pling im poses m utual independence
on the Y j , effectively assum ing th a t their jo in t C D F is F(yi) x x F (yn)
and thus sam pling from its estim ate
x x F (y '). This is incorrect for
dependent data. The difficulty is th a t there is no obvious way to estim ate a
general jo in t density for Y i,...,Y given one realization. We shall explore this
im p o rtan t subject furth er in C h ap ter 8.
W eakly dependent d a ta occur in the altogether different context o f finite
population sam pling. H ere the basic nonparam etric resam pling m ethods work
reasonably well. M ore will be said ab o u t this in Section 3.7.
Dirty data
W hat if sim ulated resam pling is used when there are outliers in the d a ta?
T here is no substitute for careful d a ta scrutiny in this o r any o th er statistical
context, an d if obvious outliers are found, they should be removed or corrected.
W hen there is a fitted p aram etric m odel, it provides a benchm ark for plots
o f residuals an d the panoply o f statistical diagnostics, and this helps to detect
poor m odel fit. W hen there is no p aram etric m odel, F is estim ated by the ED F,
and the bench m ark is sw ept aw ay because the d a ta and the m odel are one and
the same. It is then vital to look closely a t the sim ulation output, in order to
see w hether the conclusions depend crucially on p articular observations. We
retu rn to this question o f sensitivity analysis in Section 3.10.

2.7 Nonparam etric Bias and Variance

45

2.7 Nonparametric Approximations


for Variance and Bias
2.7.1 Delta methods

means is
approximately
distributed as.

In p aram etric analysis it is often possible to represent estim ators T in term s o f


fundam ental statistics U i , . . . , Um, such as sam ple m om ents, for which exact or
approxim ate distrib u tio n al calculations are relatively easy. T hen we can take
advantage o f the delta m ethod to obtain distributional approxim ations for T
itself.
C onsider first the case o f a scalar estim ator T which is a function o f the
scalar statistic U based on a sam ple o f size n, say T = g(U). Suppose th at it
is know n th a t
U ~ N (C , n - ' a 2(l:) ) .

In some cases the


Op(n~l ) remainder
term in the second
expression would be
op(n~i/2), but this would
not affect the principal
result of the delta
method below.

Two form al expressions are U = + op(l) and U = ( + n~1/2(T(C)Z + Op(n_1),


where Z is a N ( 0,1) variable. The first o f these corresponds to a statem ent o f
the consistency p roperty o f U, and the second amplifies this to state both the
rate o f convergence an d the norm al approxim ation in an alternative form.
N ow consider T = g(U), where g(-) is a sm ooth function. We shall see below
th a t provided th a t g() ^ 0,
T -JV (0 ,n _1{g(C)}2o2(C )) ,
where 0 = g(); the d o t indicates differentation w ith respect to (. This result
is w hat is usually m eant by the delta method result, the principal feature being
the delta method variance approximation
var{g([/)} = { g (0 } 2v ar(l/).

(2.29)

To see why (2.29) should be true, note th a t if g( ) is sm ooth then T is


consistent for 0 = g ((), since
g (t/)

g (C +

o P( l) ) =

g ( 0

o P( l) .

Further, by T aylor series expansion we can write


T = g(U) = g(0 + { U - C)g(C) + \ { U - o 2g ( 0 + 0 p(n~l ),

(2.30)

since the rem ainder is p ro p o rtional to ( U ( ) 3- A truncated version o f the


series expansion is
T = g(U) = g ( 0 + ( U - O g ( 0 + 0p(rT1/2).

(2.31)

F rom the latter, we can see th a t the norm al approxim ation for U implies th a t
T = g ( 0 + n - 1/2g(C M )Z + op(n~1/2),
which in tu rn entails (2.29).

46

2 The Basic Bootstraps

N othing has yet been said a b o u t the bias o f T, which would usually be
hidden in the Op(n_1) term . I f we take the larger expansion (2.30), ignore the
rem ainder term , an d take expectations, we obtain
E (T ) = g(C) + g(C )E (u - 0 +

A t);

or, if U is unbiased for ,


E (T ) = 0 + ^ g ( C)a2( 0 .
These results extend quite easily to the case o f vector U and vector T , as
outlined in Problem 2.9. T he extension includes the case where U is the set o f
observed frequencies / i , . . . , / m when Y is discrete w ith probabilities
on m possible values. T hen the analogue o f (2.31) is
T = g ( n u . . . , n m) + ^ 2
j=i

- n j ' j d 8^nu^ ' ,7tm\

(2.32)

In this case the norm al ap proxim ation for / i , . . . , / m is easy to derive, b u t


is singular because o f the con strain t 5 Z /; = n / C 71j = n- 1 effect (2.32)
provides a version o f the nonparam etric delta m ethod, restricted to discrete
d a ta problem s. In the next subsection we extend the expansion m ethod to the
general nonparam etric case.

2.7.2 Influence function and nonparametric


delta method
T here is a simple variance approxim ation for m any statistics T with the
representation t(F). T he key idea is an extension o f the Taylor series expan
sion to statistical functions, which allows us to extend (2.32) to continuous
distributions. T he linear form o f the expansion is
t{G) = t ( F ) + j uL rt( y,
, F) dG ( y) ,

(2.33)

where L t, the first derivative o f f(-) at F, is defined by


L t( y; F) = lim

t{(l - e)F + eHy} - t(F) _ 8t {(1 - e)F + eHy}


e

de

(2.34)

E=0

with H y(u) = H( u y) the H eaviside or unit step function jum ping from 0 to
1 at u = y. In this form the derivative satisfies / L t( y ; F ) d F ( y ) = 0, as seen
on setting G = F in (2.33). O ften the function L t(y) = L t( y; F ) is called the
influence function o f T an d its em pirical approxim ation l(y) = L t( y; F) is called
the empirical influence function. T he p articu lar values lj = l(yj) are called the
empirical influence values.

2.7 Nonparametric Bias and Variance

47

T he nonparametric delta method comes from applying the first-order approx


im ation (2.33) w ith G = F,
t(F) = r(F) + / L ((y ; F)dF(y) = t(F) + - L , ( y j ; F).
J
H j-i

(2.35)

The right-hand side o f (2.35) is also know n as the linear approximation. We


apply the central lim it theorem to the sum on the right-hand side o f (2.35) and
obtain
T 9

N ( 0 , vl (F))

because f L , ( y ; F ) d F ( y ) = 0, where
vl (F) = n - 'v a r jL ^ Y )} = n~l

L 2{y)dF{y).

In practice vL( F) is approxim ated by substituting F for F in the result, th at


is by using the sam ple version
n

vL = vL(F) = n - 2 Y , j,

(2-36)

j=i

which is know n as the nonparametric delta method variance estimate. N ote th a t


(2.35) implies th at

L , ( y ; F) d F{ y ) = n-1 ^

lj = 0.

In some cases it m ay be difficult to evaluate the derivative (2.34) theoretically.


T hen a num erical approxim ation to the derivative can be m ade, th at is
,2.37)

w ith a sm all value o f e such as (100n)-1 . The same m ethod can be used
for em pirical influence values lj = L,(yj;F). A lternative approxim ations to
the em pirical influence values lj, which are all th a t are needed in (2.36), are
described in the following sections.
Example 2.17 (Average) Let t = y, corresponding to the statistical function
t(F) = f ydF(y). To apply (2.34) we write
{(1 e)F + eHy} = (1 e)fi + sy,
and differentiate to obtain
d{(l - e)n + ey}
M y) =

de

= y-H.
e=0

The em pirical influence function is therefore /(y) = y y, with lj = y j y . Thus


the delta m ethod variance approxim ation (2.36) is Vi = (n 1)s2/ n 2, where s2

48

2 The Basic Bootstraps

is the unbiased sam ple variance o f the yj. This differs by the factor (n 1)/n
from the m ore usual n o n p aram etric variance estim ate for y.
m
The m ean is an exam ple o f a linear statistic, whose general form is
/ a(y) dF(y). As the term inology suggests, linear statistics have zero derivatives
beyond the first; they have influence function a{y) E{a(Y)}. This applies to
all m om ents ab o u t zero; see Problem 2.10.
C om plicated statistics w hich are functions o f simple statistics can be dealt
w ith using the chain rule. So if t(F) = a { t i ( F ) ,...,r m(F)}, then
m

=E

<2-38)

i= l o ti

This can also be used to find the influence function for a transform ed statistic,
given the influence function for the statistic itself.
Example 2.18 (Correlation) The sam ple correlation is the sam ple version o f
the prod u ct m om ent correlation, w hich for the p air Y = ( U , X ) can be defined
in term s o f p rs = E ( U rX s) by

__ / r>\ ___

P n PioPoi
{(^ 2 0 -

P ? 0 )(W>2 -

/* o i)} l / r

T he influence functions o f m eans are given in Exam ple 2.17. F o r second


m om ents we have L ^ J u , x) = urx s /xrs, w hen r + s = 2, because the [irs are
linear statistics (Problem 2.10). The p artial derivatives o f p(-) with respect to
the ps are straightforw ard, and (2.38) leads to the influence function for the
correlation coefficient,
L p(u, x) = usx s \p{u] + x ] ),

(2.40)

where us = (u p io )/(P 20 - Pi0)l/2> an d xs = (x - pm )/(P 02 - Poi)l/2 are


standardized variates.
If we w anted to w ork w ith the stan d ard tran sfo rm ation = | log
whose derivative is ( l p 2)_ l, then an o th er application o f (2.38) shows th at
the influence function would be
L d u>x ) = j Z T p i {MsXs ~

+ xs)}

Example 2.19 (Quantile) The p quantile qp o f a distribution F w ith density


/ is defined as the solution to the equation F{qp(F)} = p. I f we set Fe(x) =
( l e)F(x) + e H( x y), we have
p = Fe{qp(Fe)} = ( l - e)F{qp(Fe)} + s H{ q p(Fe) - y}.

2.7 Nonparametric Bias and Variance


Table 2.5 Exact
em pirical influence
values and their
regression estimates for
the ratio applied to the
city population data
with n 10.

49

Ca s e
Exact

1
2
-1 .0 4
-0.58

3
-0.37

4
-0.19

R egression

-1.11

-0 .4 4

-0.38

-0.65

5
0.03
-0.04

6
0.11
0.12

7
0.09
0.13

8
0.20
0.27

9
1.02
1.16

10
0.73
0.94

Figure 2.11 Empirical


influence values for city
population example.
The left panel shows the
lj for the n = 10 cases;
the line has slope
t = 1.52. The right
panels show 999 values
o f t plotted against
jittered values o f
for j = 1,2,9,4
(clockwise from top
left); the lines have
slope lj and pass
through t when / ' = 0.

50

100 150 200 250 300

O n differentiating this w ith respect to e and setting e = 0, we find th at

H{qp - y ) ~ p
f (qP)

L q, {y ;F ) =

Evidently this has m ean zero.


T he approxim ate variance o f qp( F) is
vl

{F)

tT

_ P(1 ~ P )
/ LUy;F)dF(y) = n f{ q v
ip)

the em pirical version o f which requires an estim ate o f f ( q p). But since n o n
p aram etric density estim ates converge m uch m ore slowly th an estim ates o f
m eans, variances, and so forth, estim ation o f variance for quantile estim ates is
h ard er and requires m uch larger samples.

Example 2.20 (City population data) For the ratio estim ate t = x / u , calcu
lations in Problem 2.16 lead to em pirical influence values lj = (xj tuj)/u.
N um erical values for the city population d a ta o f size 10 are given in Table 2.5;
the regression estim ates are discussed in Exam ple 2.23. T he variance estim ate
is Vl =
= 0.182.

50

2 - The Basic Bootstraps

The lj are plotted in the left panel o f Figure 2.11. Values o f yj = (uj, x j )
close to the line x = tu have little effect on the ratio t. C hanging the d ata
by giving m ore weight to those yj w ith negative influence values, for which
(uj, Xj) lies below the line, would result in sm aller values o f t th a n th a t actually
observed, and conversely. We discuss the right panels in Exam ple 2.23.

In some applications the estim ator T will be defined by an estim ating


equation, the sim plest form being ^2 c(yj, t) = 0 such th at f c(y, 8)dF(y) = 0.
T hen the influence function for scalar t is (Problem 2.12)
L

( v

) _

,(y)

E { - c ( y ,0 ) }

where c = dc/dd. The corresponding em pirical influence values are therefore


nc(yj, t)
j

Z nyjJY

and the nonparam etric delta m ethod variance estim ate is


_ E ( to 0 1 1

{>;. Of2'

A simple illustration is Exam ple 2.20, where t is determ ined by the estim ating
function c(y, 6) = x 6u.
For som e purposes it is useful to go beyond the first derivative term in the
expansion o f t(F) and o btain the quad ratic approxim ation
t(F) = t(F) + j L t( y; F) dF(y) +

\jj

Qt(y, 2; F) dF(y)dF(z),

(2.41)

where the second derivative Qt( y , z ; F ) is defined by

dl d2

,=82=0

This derivative satisfies / Qt( x , y , F ) d F ( x ) = / Qt( x ,y ;F) dF{y ) = 0, b u t in


general J Q, ( x, x; F) dF ( x) ^ 0. T he values qjk = Qt(yj,yk',F) are em pirical
second derivatives o f t(-) analogous to the em pirical influence values lj. In
principle (2.41) will be m ore accurate th an (2.35).

2.7.3 Jackknife estimates


A n other ap p ro ach to approxim ating the influence function, b u t only a t the
sam ple values y \ , . . . , y themselves, is the jackknife. H ere lj is approxim ated by
ljackj = { n - W - t - j ) ,

(2.42)

where t - j is the estim ate calculated w ith y; om itted from the data. In effect
this corresponds to num erical approxim ation (2.37) using e = (n I)- 1 ; see
Problem 2.18.

2.7 Nonparametric Bias and Variance

51

The jackknife approxim ations to the bias and variance o f T are


1

bjack = ~ ~

Ijack,j,

Vjack = ^ ackj ~

It is reasonably straightforw ard to apply (2.33) w ith F - j and F in place o f G


an d F, respectively, to show th a t
IjackJ lj 5
see Problem 2.15.
Example 2.21 (Average) F or the sam ple average t = y and the case deletion
values are
= (ny y j ) / ( n 1) and so ljack,j = }j ~ V- This is the same as the
em pirical influence function because t is linear. The variance approxim ation in
(2.43) reduces to {n{n l )}-1 ^2(yj y)2 because bjack = 0; the denom inator
n 1 in the form ula for vjack was chosen to ensure th at this happens.

O ne application o f (2.43) is to show th a t in large sam ples the jackknife bias


approxim ation gives
n

bjack = E*(T") t = \ n ~ 2

Qjj'i
j=i

see Problem 2.15.


So far we have seen two ways to approxim ate the bias and variance o f T
using approxim ations to the influence function, nam ely the nonparam etric delta
m ethod and the jackknife m ethod. O ne can generalize the basic approxim ation
by using alternative num erical derivatives in these two m ethods.

2.7.4 Empirical influence values via regression


T he approxim ation (2.35) can also be applied to the b o o tstrap estim ate T*. If
the E D F o f the b o o tstra p sam ple is denoted by F*, then the analogue o f (2.35)
is
t(F*) = t(F) + - V L t(y*;F),
n
J
7=1

o r in sim pler n o tatio n


=

(2.44)

j- 1

say, where /* is the nu m b er o f times th a t y* equals yj, for j = 1, . . . , n . The


linear ap proxim ation (2.44) will be used several times in future chapters.
U nder the n o n p aram etric b o o tstrap the jo in t distribution o f the /* is m ulti
nom ial (Problem 2.19). It is easy to see th a t var(T *) = n~2
= vl , showing

2 The Basic Bootstraps

52

Figure 2.12 Plots of


linear approxim ation t*L
against r* for the ratio
applied to the city
population data, with
n = 10 (left panel), and
n = 49 (right panel).

th a t the b o o tstrap estim ate o f variance should be sim ilar to the nonparam etric
delta m ethod approxim ation.
Example 2.22 (City population data) The right panels o f Figure 2.11 show
how 999 resam pled values o f f* depend on -1 / j for four values o f j, for the
d ata w ith n = 10. T he lines w ith slope lj sum m arize fairly well how t depends
on /* , b u t the correspondence is n o t ideal.
A different way to see this is to p lo t t* against the corresponding t'L.
Figure 2.12 shows this for 499 replicates. The line shows where the values
for an exactly linear statistic would fall. The linear approxim ation is poor
for n = 10, b u t it is m ore accurate for the full dataset, where n = 49. In
Section 3.10 we outline how such plots m ay be used to find a suitable scale on
which to set confidence limits.

Expression (2.44) suggests a way to approxim ate the /,-s using the results o f
a b o o tstrap sim ulation. Suppose th a t we have sim ulated R sam ples from F as
described in Section 2.3. Define /* to be the frequency with which the d a ta
value yj occurs in the rth b o o tstrap sample. T hen (2.44) implies th a t
t; = t + ^

r = l,...,R.

j=i

This can be viewed as a linear regression equation for responses t* with


covariate values
and coefficients lj. We should, however, adjust for
the facts th a t E*(7 ) =f= t in general, th a t J2j h = 0, and th at J 2 j f r j = n- F r the
first o f these we add a general intercept term , or equivalently replace t with T .

2.7 Nonparametric Bias and Variance


F or the second two we d ro p the term

53
resulting in the regression equation

So the vector I = ( /j,___


i ) o f approxim ate values o f the lj is obtained with
the least-squares regression form ula
/ = (F*TF*)_1F*r d*,

(2.46)

where F* is the R x ( n 1) m atrix w ith (r,j) elem ent n-1 /*;, and the rth row
o f the R x 1 vector d* is t* f*. In fact (2.45) is related to an alternative,
o rthogonal expansion o f T in which the rem ainder term is uncorrelated
with the linear piece.
The several different versions o f influence produce different estim ates o f
v ar(T ). In general vl is an underestim ate, w hereas use o f the jackknife values
or the regression estim ates o f the Is will typically produce an overestim ate. We
illustrate this in Section 2.7.5.
Example 2.23 (City population data) For the previous exam ple o f the ratio
estim ator, Table 2.5 gives regression estim ates o f em pirical influence values,
obtained from R = 1000 samples. The exact estim ate v l for v a r(T ) is 0.036,
com pared to the value 0.043 obtained from the regression estimates. The
b o o tstrap variance is 0.042. For n = 49 the corresponding values are 0.00119,
0.00125 an d 0.00125.
O u r experience is th a t R m ust be in the hundreds to give a good regression
approxim ation to the em pirical influence values.

2.7.5 Variance estimates


In previous sections we have outlined the m erits o f studentized quantities

where V = v{F) is an estim ate o f v a r(T | F). O ne general way to obtain a


value for V is to set
M
v = (M - 1) 1
- 0 2>
m=1
where t ], . . . ,t 'M are calculated by b o o tstrap sam pling from F. Typically we
would take M in the range 50-200. N ote th at resam pling is needed to produce
a stan d ard erro r for the original value t o f T.

54

2 The Basic Bootstraps

Now suppose th a t we wish to estim ate the quantiles o f Z , using em pirical


quantiles o f b o o tstrap sim ulations
r=

(2-48)

Since M b o o tstrap sam ples from F were needed to obtain v, M bo o tstrap


sam ples from F ' are needed to produce v". T hus w ith R = 999 and M = 50,
we would require R ( M + 1) = 50949 sam ples in all, which seems prohibitively
large for m any applications. This suggests th a t we should replace u1/2 with a
standard error th a t involves no resam pling, as follows.
W hen a linear approxim ation (2.44) applies, we have seen th a t var(T* | F)
can be estim ated by v l = n~2 ^ l], where the lj = L ((y; ;F ) are the em pirical
influence values for t based on the E D F F o f y \ , . . . , y n- T he corresponding
variance estim ate for v a r(T | F ' ) is vLr = ri~2 ^ L 2{yy, F'), based on the
em pirical influence values for t at the E D F F o f y r l, . . . , y' rn. A lthough this
requires no furth er sim ulation, the L t( y \ F *) m ust be calculated for each o f
the R samples. If an analytical expression is know n for the em pirical influence
values, it will typically be straightforw ard to calculate the VLr- If not, num erical
differentiation can be used, though this is m ore tim e-consum ing. I f neither o f
these is feasible, we can use the furth er approxim ation
2

(2.49)
which is exact for a linear statistic. In effect this uses the usual form ula, with
lj replaced by L t(y*j\F) n-1 J 2 L t(y*k ;F) in the rth resam ple. However, the
right-hand side o f (2.49) can badly underestim ate v'Lr if the statistic is not close
to linear. A n im proved approxim ation is outlined in Problem 2.20.
Example 2.24 (City population data) Figure 2.13 com pares the variance a p
proxim ations for n = 10. T he top left panel shows v" with M = 50 plotted
against the values
n

for R = 200 b o o tstrap samples. T he top right panel shows the values o f the
approxim ate variance on the right o f (2.49), also plotted against v'L. T he lower
panels show Q -Q plots o f the corresponding z* values, with (t* t ) / v ^ /2 on
the horizontal axis. Plainly vL underestim ates v', though not so severely as to
have a big effect on the studentized b o o tstrap statistic. But the right o f (2.49)
underestim ates v'L to an extent th a t greatly changes the distribution o f the
corresponding studentized b o o tstrap statistics.

2.8 Subsampling Methods

55

Figure 2.13 Variance


approxim ations for the
city population data,
n 10. The top panels
com pare the bootstrap
variance v* calculated
with M = 50 and the
right o f (2.49) with v*L
for R = 200 samples.
The bottom panels
com pare the
corresponding
studentized bootstrap
statistics.

co
>

Q_

2
2o
o

CO

vL*

T he rig h t-h an d panels o f the corresponding plots for the full d a ta show m ore
nearly linear relationships, so it appears th a t (2.49) is a b etter approxim ation
at sample size n = 49. In practice the sam ple size cannot be increased, and
it is necessary to seek a tran sfo rm ation o f t to attain approxim ate linearity.
T he tran sfo rm atio n outlined in Exam ple 3.25 greatly increases the accuracy o f
(2.49), even w ith n = 10.

2.8 Subsampling Methods


Before and after the developm ent o f nonparam etric b o o tstrap m ethods, other
m ethods based on subsam ples were developed to deal with special problems.

56

2 The Basic Bootstraps

We briefly review three such m ethods here. The first two are in principle
superior to resam pling for certain applications, although their com petitive
m erits in practice are largely untested. T he third m ethod provides an alternative
to the nonparam etric delta m ethod for variance approxim ation.

2.8.1 Jackknife methods


In Section 2.7.3 we m entioned briefly the jacknife m ethod in connection with
estim ating the variance o f T, using the values
o f t obtained when each case
is deleted in turn. G eneralized versions o f the jackknife have also been proposed
for estim ating the distribution o f T 0, as alternatives to the bootstrap. For
this to work, the jackknife m ust be generalized to m ultiple case deletion. For
example, suppose th a t we delete d observations rath er th an one, there being
N = (j) ways o f doing this; this is the sam e thing as taking all subsets o f size
n d. The full set o f group-deletion estim ates is t{,. . . , tfN , say. The em pirical
distribution o f
t will approxim ate the distribution o f T 6 only if we
renorm alize to rem ove the discrepancy in sam ple sizes, n d versus n. So if
T 6 = Op(n~a), we take the em pirical distribution o f
z f = (n - d)a{S - t)

(2.50)

as the delete-^ jackknife approxim ation to the distribution o f Z = na( T 6).


In practice we would n o t use all N subsam ples o f size n d, b u t rath er R
random subsam ples, ju st as with ordinary resampling.
In principle this m ethod will apply m uch m ore generally th an b o o tstrap
resam pling. But to w ork in practice it is necessary to know a and to choose d
so th at n d>oo and d /n >1 as n increases. T herefore the m ethod will work
only in rath er special circum stances.
N ote th a t if n d is small relative to n, then the m ethod is not very different
from a generalized b o o tstrap th a t takes sam ples o f size n d ra th er th an n.

Example 2.25 (Sample maximum) We referred earlier to the failure o f the


boo tstrap w hen applied to the largest o rd er statistic t = y(n), which estim ates
the upper lim it o f a distribution on [0,0]. The jackknife m ethod applies here
w ith a = 1, as n(9 T ) is approxim ately exponential w ith m ean 6 for uniform ly
distributed ys. However, em pirical evidence suggests th a t the jackknife m ethod
requires a very large sam ple size in o rd er to give good results. For example,
if we take sam ples o f n = 100 uniform variables, for values o f d in the range
80-95 the distrib u tio n o f (n d)(t T +) is close to exponential, but the m ean
is w rong by a factor th a t can vary from 0.6 to 2.

2.8 Subsampling M ethods

57

2.8.2 All-subsamples method


A different type o f subsam pling consists o f taking all N = 2" 1 non-em pty
subsets o f the data. This can be applied to a lim ited type o f problem , including
M -estim ation where m ean /i is estim ated by the solution t to the estim ating
equation ^ c(yj t) = 0. If the ordered estim ates from subsets are denoted
by tJj ),. . . , f[N), then rem arkably fi is equally likely to be in any o f the N + 1
intervals

Hence confidence intervals for fi can be determ ined. In practice one w ould take
a ran d o m selection o f R such subsets, and attach equal probability ( R + I)-1
to the R + 1 intervals defined by the R ff values. It is unclear how efficient
this m ethod is, and to w hat extent it can be generalized to o th er estim ation
problems.

2.8.3 Half-sampling methods


T he jackknife m ethod for estim ating v a r(T ) can be extended to deal with
estim ates based on m any samples, b u t in one special circum stance there is
another, sim pler subsam pling m ethod. O riginally this was proposed for samplesurvey d a ta consisting o f stratified sam ples o f size 2. To fix ideas, suppose th at
we have sam ples o f size 2 from each o f m strata, and th a t we estim ate the
p o pulation m ean n by the w eighted average t = Y27=i wifi^ these weights
reflect stratu m sizes. The usual estim ate for v a r(T ) is v = J 2 wf sf with sj the
sam ple variance for the ith stratum . The half-sam pling m ethod is designed to
reproduce this variance estim ate using only subsam ple values o f t, ju st as the
jackknife does. T hen the m ethod can be applied to m ore com plex problems.
In the present context there are N = 2m half-sam ples form ed by taking one
elem ent from each stratu m sample. If ft denotes the estim ator calculated on
such a half-sam ple, then clearly ft t equals \
~ y a ) c ] , where cj = +1
according to which o f yn and y,%is in the half-sam ple. D irect calculation shows
th a t for a ran d o m half-sam ple E (T t T )2 = jv a r(T ), so th a t an unbiased
estim ate o f v a r(T ) is obtained by doubling the average o f (ft t)2 over all
N half-sam ples: this average equals the usual estim ate given earlier. But it is
unnecessary to use all N half-sam ples. If, say, we use R half-sam ples, then we
require th at

2 The Basic Bootstraps

58
From the earlier representation for
.

[ 1

s
r= l

i E
I

- 1 we see th a t this implies th at


1

m m

wf ( yn - y a )1 +

i= 1

j(yn - y a ) { y n - yj i)
i= l j = 1

equals
1 m
4 E
i=l

>2)2-

For this to hold for all d a ta values we m ust have


= 0 for all i j.
This is a stan d ard problem arising in factorial design, and is solved by w hat
are know n as P lackett-B urm an designs. If the rth half-sam ple coefficients cfrj
form the rth row o f the R x m m atrix C +, and if every observation occurs in
exactly | R half-sam ples, then C +TC f = rnlmxm. In general the ith colum n o f C +
can be expressed as ( c y, .
1) w ith the first R 1 elem ents obtained
by i 1 cyclic shifts o f c i j , . . . , For exam ple, one solution for m = 7 with
R = 8 is
-1
-1
+1 - 1 +1
+ 1 ni
( +l
+1 +1 - 1 - 1 +1 - 1 +1
+1 +1 +1 - 1 - 1 +1 - 1
-1
+1 +1 +1 - 1 - 1 - 1
+1 - 1 +1 +1 +1 - 1 - 1
-1
+1 - 1 +1 +1 +1 - 1
-1
-1
+1 - 1 +1 +1 +1
U i
-1
-1
-1
-1
-1
1)
This solution requires th a t R be the first m ultiple o f 4 greater th a n or equal
to m. The half-sam ple designs for m = 4 ,5 ,6 ,7 are the first in colum ns o f this
C + m atrix.
In practice it would be com m on to double the half-sam pling design by
adding its com plem ent C \ which adds furth er balance.
It is fairly clear th a t the half-sam pling m ethod extends to stratum sample
sizes k larger th a n 2. The basic idea can be seen clearly for linear statistics o f
the form
m

t= n + X
i= 1

k~l E

= ^ + E

7=1

i= l

a,>
j= l

say. Suppose th a t in the rth subsam ple we take one observation from each
stratum , as specified by the zero -o n e indicator c jy . T hen
'! - , = E

cl,,j(aU - a,),

which is a linear regression m odel w ithout erro r in which the atj a, are
coefficients and the
are covariate values to be determ ined. If the ay a,

2.9 Bibliographic Notes

59

can be calculated, then the usual estim ate o f v ar(T ) can be calculated. The
choice o f
- values corresponds to selection o f a fractional factorial design,
w ith only m ain effects to be calculated, and this is solved by a Plackett-B urm an
design. O nce the subsam pling design is obtained, the estim ate o f v a r(T ) is a
form ula in the subsam ple values tj. The same form ula w orks for any statistic
th a t is approxim ately linear.
The same principles apply for unequal stratum sizes, although then the
solution is m ore com plicated and m akes use o f orthogonal arrays.

2.9 Bibliographic Notes


T here are two key aspects to the m ethods described in this chapter. The first
is th a t in o rd er for statistical inference to proceed, an unknow n distribution
F m ust be replaced by an estim ate. In a param etric m odel, the estim ate is a
p aram etric distribution F$, w hereas in a nonparam etric situation the estim ate
is the em pirical distribution function or som e m odification o f it (Section 3.3).
A lthough the use o f the E D F to estim ate F m ay seem novel a t first sight, it is
a n atu ral developm ent o f replacing F by a param etric estim ate. We have seen
th a t in essence the E D F will produce results sim ilar to those for the nearest
param etric model.
The second aspect is the use o f sim ulation to estim ate quantities o f interest.
The w idespread availability o f fast cheap com puters has m ade this a practical
alternative to analytical calculation in m any problem s, because com puter time
is increasingly plentiful relative to the num ber o f hours in a researchers day.
T heoretical approxim ations based on large samples can be tim e-consum ing to
obtain for each new problem , and there m ay be d o u b t about their reliability in
small samples. C ontrariw ise, sim ulations are tailored to the problem at hand
an d a large enough sim ulation m akes the num erical erro r negligible relative to
the statistical erro r due to the inescapable uncertainty ab o u t F.
M onte C arlo m ethods o f inference had already been used for m any years
when E fron (1979) m ade the connection to standard m ethods o f param etric
inference, drew the atten tio n o f statisticians to their potential for nonparam etric
inference, and originated the term b o o tstra p . This work and subsequent
developm ents such as his 1982 m onograph m ade strong connections with the
jackknife, which had been introduced by Q uenouille (1949) and Tukey (1958),
and w ith o th er subsam pling m ethods (H artigan, 1969, 1971, 1975; M cC arthy,
1969). M iller (1974) gives a good review o f jackknife m ethods; see also G ray
an d Schucany (1972).
Y oung and D aniels (1990) discuss the bias in the nonparam etric boo tstrap
introduced by using the em pirical distribution function in place o f the true
distribution.
H all (1988a, 1992a) strongly advocates the use o f the studentized b o o tstrap

60

2 The Basic Bootstraps

statistic for confidence intervals an d significance tests, and m akes the connec
tion to E dgew orth expansions for sm ooth statistics. The em pirical choice o f
scale for resam pling calculations is discussed by C h apm an and H inkley (1986)
and T ibshirani (1988).
H all (1986) analyses the effect o f discreteness on confidence intervals. Efron
(1987) discusses the num bers o f sim ulations needed for bias and quantile
estim ation, while D iaconis an d H olm es (1994) describe how sim ulation can be
avoided com pletely by com plete en um eration o f b o o tstrap sam ples; see also
the bibliographic notes for C h ap ter 9.
Bickel and F reedm an (1981) were am ong the first to discuss the conditions
under which the b o o tstrap is consistent. T heir w ork was followed by Bretagnolle (1983) and others, and there is a grow ing theoretical literature on
m odifications to ensure th a t the b o o tstra p is consistent for different classes o f
aw kw ard statistics. T he m ain m odifications are sm oothing o f the d ata (Sec
tion 3.4), which can im prove m atters for nonsm ooth statistics such as quantiles
(D e Angelis and Young, 1992), subsam pling (Politis and R om ano, 1994b), and
rew eighting (B arbe and Bertail, 1995). H all (1992a) is a key reference to Edgew orth expansion theory for the b o o tstrap , while M am m en (1992) describes
sim ulations intended to help show when the b o o tstrap works, and gives the
oretical results for various situations. Shao and Tu (1995) give an extensive
theoretical overview o f the b o o tstrap an d jackknife.
A threya (1987) has show n th a t the b o o tstra p can fail for long-tailed distri
butions. Some o th er exam ples o f failure are discussed by Bickel, G otze and
van Zwet (1996).
T he use o f linear approxim ations an d influence functions in the context
o f robust statistical inference is discussed by H am pel et al. (1986). Fernholtz
(1983) describes the expansion theory th a t underlies the use o f these approx
im ation m ethods. A n alternative and o rthogonal expansion, sim ilar to th at
used in Section 2.7.4, is discussed by E fron and Stein (1981) and E fron (1982).
Tail-specific approxim ations are described by H esterberg (1995a).
The use o f m ultiple-deletion jackknife m ethods is discussed by H inkley
(1977), Shao and W u (1989), W u (1990), and Politis and R om ano (1994b), the
last w ith num erous theoretical exam ples. T he m ethod based on all non-em pty
subsam ples is due to H artig an (1969), an d is nicely p u t into context in C h apter 9
o f Efron (1982). H alf-sam ple m ethods for survey sam pling were developed by
M cC arthy (1969) an d extended by W u (1991). The relevant factorial designs
for half-sam pling were developed by Plackett and B urm an (1946).

2.10 Problems
1

Let F denote the E D F (2.1). Show that E {f(y )} = F(y) and that var{F(y)} =
f (3'){l F(y)}/ n. Hence deduce that provided 0 < F(y) < 1, F(y) has a limiting

61

2.10 Problems

normal distribution for large n, and that Pr(|F(y) F(y)| < e)>1 as ntoo for any
positive e. (In fact the much stronger property s u p ^ ^ ^ ^ |F(y) F (y )|>0 holds
with probability one.)
(Section 2.1)
2

Suppose that Y ],..., Y are independent exponential with mean


Y=n~' E

their average is

Yj .

(a) Show that Y has the gamma density (1.1) with k = n, so its mean and variance
are n and fi2/n.
(b) Show that log Y is approximately normal with mean log^i and variance n~'.
(c) Compare the normal approximations for Y and for log Y in calculating 95%
confidence intervals for /z. Use the exact confidence interval based on (a) as the
baseline for the comparison, which can be illustrated with the data o f Example 1.1.
(Sections 2.1, 2.5.1)
3

Under nonparametric simulation from a random sample y [ , . . . , y in which T =


nr1
Yj Y) 2 takes value t, show that
E '(T ') = (n l)t/n,

var(7 ) = (n l ) 2 [m4/ n + (3 - n)t2/ {n(n 1)}] / n2,

where w 4 = n- 1 E / X ; - f ) 4(Section 2.3; Appendix A)


4

Let t be the median o f a random sample o f size n = 2m + 1 with ordered values


>>(i) < < y(); t = y(m+i).
(a) Show that T" >
if and only if fewer than m + 1 o f the Y are less than or
equal to y ^ .
(b) Hence show that

This specifies the exact resampling density (2.28) o f the sample median. (The result
can be used to prove that the bootstrap estimate o f var(T ) is consistent as n>oo.)
(c) Use the resampling distribution to show that for n = 11
P r * ( r < y,3 j) = Pr( T > y(9)) = 0.051,
and apply (2.10) to deduce that the basic bootstrap 90% confidence interval for
the population median 6 is (2 y(6) y(9 ), 2 y(6)
(d) Examine the coverage o f the confidence interval in (c) for samples from normal
and Cauchy distributions.
(Sections 2.3, 2.4; Efron, 1979, 1982)
5

Consider nonparametric simulation o f Y* based on distinct linearly independent


observations y i,...,y .
(a) Show that there are m = (^"T,1) ways that n 1 red balls can be put in a line
with n white balls. Explain the connection to the number o f distinct values taken
by Y '.
(b) Suppose that the value y" taken by Y* is n~l J 2 f j y j < where / can be one o f
0
and J 2 j f j ~ n- Find Pr(Y = y), and deduce that the most likely value o f
Y is y, with probability p = n'./n".
(c) Use Stirlings approximation, i.e. n \ ~ (27r)l/2e~"n"+1//2 as n>oo, to find approx
imate formulae for m and p.
(d) For the correlation coefficient T calculated from distinct pairs (i, x j ) ,. . . , (u,x),

62

2 The Basic Bootstraps


show that T* is indeterminate with probability
W hat is the probability that
17 | = 1? Discuss the implications o f this when n < 10.
(Section 2.3; Hall, 1992a, Appendix I)
Suppose that
are independently distributed with a two-parameter density
W hat simulation experiment would you perform to check whether or not
Q = q ( Y u . . . , Y n;6) is a pivot?
If / is the gamma density (1.1), let fi be the M LE o f n, let

feAy)-

tpin) = max Y

lg//vc(y; )

j=i
be the profile log likelihood for n and let Q = 2 { /p(/i) /?p(n)}. In theory Q should
be approximately a x] variable for large n. Use simulation to examine whether or
not Q is approximately pivotal for n = 10 when k is in the range (0.5,2).
(Section 2.5.1)
7

The bootstrap normal approximation for T 9 is N ( b R, v R), so that the p quantile


ap for T 96 can be approximated
appro
by ap = bR + zpvR 2. Show that the simulation
variance o f this estimate is

i* \ v I . ,
*3 , l 2 / t , k4
K ) - R { ' + Z ' , ^ + i2' ( 2 + <
where k 3 and k4 are the third and fourth cumulants o f T" under bootstrap
resampling. If T is asymptotically normal, k ^ / v U2 = 0 ( n ~ l/2) and k 4/ v1^ = 0 (n ).
Compare this variance to that o f the bootstrap quantile estimate
t in
the special case T = Y .
(Sections 2.2.1, 2.5.2; Appendix A)
8

Suppose that estimator T has expectation equal to 0(1 + y ) , so that the bias is 9y.
The bias factor y can be estimated by C = E( T ' ) / T 1. Show that in the case
o f the variance estimate T = ri [ ^ 2(Yj Y ) 2, C is exactly equal to y. I f C were
approximated from R resamples, what would be the simulation variance o f the
approximation?
(Section 2.5)
Suppose that the random variables U = (Ui, .. . , Um) have means C i,...,( m and
covariances cov(Uk,Ui) = n-1 cow( 0 , and that Ti = g t ( U ) , . . . , T q = gq(U). Show
that
E(T,)

g , . ( 0 + i n - > f > w( 0 | ^ ,

cov(Tj, Tj)

/r f >

w(

How are these estimated in practice?


Show that
2

\ " (x i tuj)2

" - 2
i=i
is a variance estimate for t = x / u , based on independent pairs (u i, Xi) ,...,( ,x n).
(Section 2.7.1)

63

2.10 Problems
10

(a) Show that the influence function for a linear statistic t(F) = / a(x) dF(x)
is a ( y ) t(F). Hence obtain the influence functions for a sample mom ent fir
f x r dF(x), for the variance /1 2 (F) {/ti(F)}2, and for the correlation coefficient
(Example 2.18).
(b) Show that the influence function for {t(F) 6 } / v ( F ) i/2 evaluated at 9 = t{F) is
v(F)~l/2L, (y; F) . Hence obtain the empirical influence values lj for the studentized
quantity {t{F) t ( F) } / v L( F ) l/2, and show that they have the properties E O = 0
and n~2 E I2 = 1 .
(Section 2.7.2; Hinkley and Wei, 1984)

11

The pairs ( U [ , X i ) , . . . , { U , X n) are independent bivariate normal with correlation


9. Use the influence function o f Example 2.18 to show that the sample correlation
T has approximate variance n~l { 1 92)2. Then apply the delta method to show
that \ log ( j r ) , called Fishers z-transform, has approximate variance n~].
(Section 2.7.1; Appendix A)

12

Suppose that a parameter 0 = t(F) is determined implicitly through the estimating


equation

J u { y, 9 ) d F ( y ) = 0

(a) Write the estimating equation as

J u { y J ( F ) } dF(y) = 0,
u(x;0) = du(x-,6)/d8

replace F by (1 e)F + eH y, and differentiate with respect to e to show that the


influence function for f(-) is

,(-V *

f U(x;9)dF(x) '

Hence show that with 9 = t{F) the y'th empirical influence value is
t =

u ( y j ; 6)

- n ~ l E L i (w ;

(b) Let {p be the maximum likelihood estimator o f the (possibly vector) parameter
o f a regular parametric m odel / v (y) based on a random sample y u . ..,y. Show
that the j \ h empirical influence value for \p at yj may be written as n I ~ lSj, where
y-v g 2 l o g / v-,(y; )

dxpdip7

d\ogjjiyj)

dxp

Hence show that the nonparametric delta method variance estimate for ip is the
so-called sandwich estimator

/-> ( X s A r ) ' - ' Compare this to the usual parametric approximation when y \ , . . . , y is a random
sample from the exponential distribution with mean tp .
(Section 2.7.2; Royall, 1986)

64
13

2 The Basic Bootstraps


The a trimmed average is defined by

t { F) =r h a [
computed at the E D F F. Express t(F) in terms o f order statistics, assuming that
na is an integer. How would you extend this to deal with non-integer values o f not?
Suppose that F is a distribution symmetric about its mean, p.. By rewriting t(F) as

rii-(f)

- /
1 2 a

udF(u),

where qa(F) is the a quantile o f F, use the result o f Example 2.19 to show that the
influence function o f t(F) is

L t(y,F )= l

l - 2 r I,
1 2a) ',
{{q '(F )-p }(l-2 * )-\

y<q(F),
q(F) < y < <ji_a(F ),
q t - x( F ) < y .

Hence show that the variance o f t ( F) is approximately

n(1 _

r r<i\-AF)

^ >*)2dF(y) + te(F) -

2ay [J {F)

n}2 + a{qi-x(F) - n}2 .

Evaluate this at F = F.
(Section 2.7.2)
14

Let Y have a p-dimensional multivariate distribution with mean vector p and


covariance matrix fi. Suppose that Q has eigenvalues Aj > > Xp and corre
sponding orthogonal eigenvectors ej, where e j e y = 1. Let Fc = (1 s)F + eHy.
Show that the influence function for Q is
L a ( y ',F) = { y - n)(y - p ) T fl,
and by considering the identities
Q(Fc)ej(Fs) = Xj(Fe)ej(Fc),

ej (F) t e j(Fc) = 1,

or otherwise, show that the influence function for l j is { e j ( y p)}2 Xj.


(Section 2.7.2)
15

Consider the biased sample variance t n_ 1 J2(yj ~ J')2(a) Show that the empirical influence values and second derivatives are
lj = (yj - y ) 2 - U

qjk = - 2 ( y j - y)(yk - y).

(b) Show that the exact case-deletion values o f t are

Compare these with the result o f the general approximation


t - t-j = ( n -

y 'lj -

)~2qjj,

which is obtained from (2.41) by substituting F for F and


for F.
(c) Calculate jackknife estimates o f the bias and variance o f T. Are these sensible
estimates?
(Section 2.7.3; Appendix A)

2.10 Problems
16

65

The empirical influence values lj can also be defined in terms o f distributions


supported on the data values. Suppose that the support o f F is restricted to
y i , . . . , y n, with probabilities p = ( p i , . . . , p n) on those values. For such distributions
t(F) can be re-expressed as t{p).
(a) Show that
h = j Rt{(l - e)p + s l j }
e=0

where P = ( $ , ->%) and 1 j is the vector with


Hence or otherwise show that

in the y'th position and

elsewhere.

0 = Mp) -

X Mp)>
k=\

where 'tj(p) = 8t(p)/dpj.


(b) Apply this result to derive the empirical influence values lj = (xj tuj )/ u for
the estimate t = J2 Pjx j ! 5Z Pjuj o f the ratio o f two means.
(c) The empirical second derivatives qtj can be defined similarly. Show that

d2

qtj = g~ ^

t{(l - El - E 2)p + Ell, + E 2 ly}

| =2=0
Hence deduce that
<2.7 = 'iij(P) ~ n

5Z
k=l

tik ( P ) -

n '

+ n 2

k=l

Y1ikl

k,l=1

(Section 2.7.2)
17

Suppose that t =
+ }W i)) is the median o f a sample o f even size n = 2m
from a distribution with continuous C D F F and P D F / whose median is p. Show
that the case-deletion values
are either y lmj or
and that the jackknife
variance estimate is

1 (,y<m
( +i) - y(m))'i2
vjack = "
By writing Yu> = F '{1 exp(y))}, where
is the 7 th order statistic o f a
random sample
from the standard exponential distribution, and recalling
properties o f exponential order statistics, show that
nVJadl~

J ___ / I

?\2

( X2 )
Pin)

as n*oo. This confirms that the jackknife variance estimate is not consistent.
(Section 2.7.3)
18

A generalized form o f jackknife can be defined by estimating the influence function


at yj by
t { ( l - e ) F + eHyj} - t ( F )
e
for some value e. D iscuss the effects o f (a) e>0, (b) e = ( n l ) -1 , (c) e = ( n + 1) - 1
which respectively give the infinitesimal jackknife, the ordinary jackknife, and the
positive jackknife.

66

2 The Basic Bootstraps

Show that in (b) and (c) the squared distance (dF dFe)T(dF dFc) from F to
Fe = (1 s)F + eH Vj is o f order 0 ( n ~ 2), but that if F* is generated by bootstrap
sampling, E* j( d F d F ) T {dF dF) j = 0 ( n ~ l ). Hence discuss the results you
would expect from the butcher knife, which uses e = n~l/2. How would you
calculate it?
(Section 2.7.3; Efron, 1982; Hesterberg, 1995a)
19

The cumulant generating function o f a multinomial random variable


with denominator n and probability vector ( 7 1 1 , . . . , n) is
K ( ) = n log

ty e x p ( ^ -)|,

where =
(a) Show that with Kj = n~l, the first four cumulants o f the /* are
E(/D
co v '( / ' , / * )

=
=

1,
dij-n~\

cum ' ( f i J j J k )

n~2{n2Sijk-<5ft[3] + 2 } ,

cum (/,',/* , f l J J )

n }{n}dijki - n2 (c5ft<5y,[3] + SJkl[4]) + 2nSit [6 ] -

},

where S:J = 1 when i = j and zero otherwise, and so on, and d:k [3] = d,k + S,j + Sjk,
and so forth.
(b) N ow consider t Q = f + n ~ ' J 2 f j h + \ n~2 H
Show that E*(tg) =
t + \ n~2 ^2 qjj and that t'g has variance

1j + ^ E

{E 4 - 1 ( E ) * + E

(2-51)

(Section 2.7.2; Appendix A ; D avison, Hinkley and Schechtman, 1986; McCullagh,


1987)
20

Show that the difference between the second derivative Q , ( x , y ) and the first
derivative o f L,(x) is equal to L,(y). Hence show that the empirical influence value
can be written as

n
lj = L t( y j ) + n~l ^ { 2 ,( y y ,y ( c ) ~ L ,(yk)}k= 1
Use the resampling version o f this result to discuss the accuracy o f approximation
(2.49)
for vL .
(Sections 2.7.2, 2.7.5)

2.11 Practicals
1

Consider parametric simulation to estimate the distribution o f the ratio when a


bivariate lognormal distribution is fitted to the data in Table 2.1:
ml < - m e a n ( lo g ( c it y $ u ) ) ; m2 < - m e a n ( lo g ( c it y $ x ) )
s i <- s q r t ( v a r ( lo g ( c it y $ u ) ) ) ; s 2 <- s q r t(v a r (lo g (c it y $ x )))
rho < - c o r ( l o g ( c i t y ) ) [ 1 , 2 ]
c it y .m le < - c (m l, m2 , s i , s 2 , rho)

2.11 Practicals

67

city.sim <- function(city, mle)


{ n <- nrow(city)
zl <- rnorm(n) ; z2 <- rnonn(n)
z2 <- m l e [5]*zl+sqrt(1-mle[5]2)*z2
data, frame(u=exp(mle [1] +mle [3] *zl) , x=exp(mle[2]+m l e [4]*z2)) }
city.fun <- function(data, i=l:nrow(data))
{ d <- data[i,]
tstar <- sum(d$x)/sum(d$u)
ubar <- mean(d$u)
c(tstar, sum((d$x-tstar*d$u)~2/(nrow(d)*ubar)"2))
}
city.para <- boot (city,city.fun, 11=999,
sim="parametric",r a n .gen=city.sim,mle=city.mle)
Are histograms o f t and z ' similar to those for nonparametric simulation, shown
in Figure 2.5?

tstar <- city.para$t[,l]


zstar <- (tstar-city.para$tO[1])/sqrt(city.para$t[,2])
split.screen(c(1,2))
screen(l); hist(tstar)
screen(2); hist(zstar)
screen(l); qqnorm(tstar,pch=".")
screen(2); qqnorm(zstar,pch="."); abline(0,1,lty=2)
Use (2.10) and (2.12) to give 95% confidence intervals for the true ratio under this
model:

city.para$tO[l] - sort (tstar-city .para$tO [1] ) [c (975,25)]


city.para$tO [1] - sqrt(city.para$t0[2])*sort(zstar) [c(975,25)]
Compare these intervals with those given in Example 2.12.
Repeat this with R = 199 and R = 399.
(Sections 2.2, 2.3, 2.4)
2

c o .t r a n s f e r contains data on the carbon monoxide transfer factor for seven


smokers with chickenpox, measured on admission to hospital and after a stay o f
one week. The aim is to estimate the average change in the factor.
To display the data:

attach(co.transfer)
plot(0.5*(entry+week),week-entry)
t .test(week-entry)
Are the differences normal? Is the Student-t confidence interval reliable?
For a bootstrap approach:

co.fun <- function(data, i)


{ d <- data[i,]
y <- d$week-d$entry
c(mean(y), var(y)/nrow(d)) }
co.boot <- boot(co.transfer, co.fun, R=999)
Compare the variance o f the bootstrap estimate t" with the estimated variance
o f t, in c o .b o o t $ t 0 [ 2 ]. Compare normal-based and studentized bootstrap 95%
confidence intervals.
To display the bootstrap output:

2 The Basic Bootstraps


split.screen(c(l,2))
screen(l); split.screen(c(2,1))
screen(3); qqnonn(co,boot$t[,1],ylab="t*",pch=".")
abline(co.boot$tO[l],sqrt(co.boot$t0[2]) ,lty=2)
screen(2)
plot(co.boot$t[,1],sqrt(co.boot$t[,2]),xlab="t*",ylab="SE*",pch=".")
screen(4); z <- (co,boot$t[,1]- co.boot$tO[1])/sqrt(co.boot$t[,2])
qqnorm(z); abline(0,1 ,lty=2)
What is going on here? Is the normal interval useful? What difference does
dropping the simulation outliers make to the studentized bootstrap confidence
interval?
(Sections 2.3, 2.4; Hand et ai , 1994, p. 228)
cd4 contains the C D 4 counts in hundreds for 20 HIV-positive patients at baseline
and after one year o f treatment with an experimental anti-viral drug. We attempt
to set a confidence interval for the correlation between the baseline and later
counts, using the nonparametric bootstrap.

corr.fun <- function(d, w = rep(l, nrow(d))/nrow(d))


{ w <- w/sum(w)
n <- nrow(d)
ml <- sum(d[, 1] *
w)
m2 <- sum(d[, 2] *
w)
vl <- sum(d[, 1]
2 * w) - ml"2
v2 <- sum(d[, 2]
2 * w) - m2~2
rho <- (sum(d[, 1] * d[, 2] * w) - ml * m2)/sqrt(vl * v2)
i <- rep(l:n,round(n*w))
us <- (d[i, 1] - ml)/sqrt(vl)
xs <- (d[i, 2] - m2)/sqrt (v2)
L <- us * xs - 0.5 * rho * (us~2 + xs'2)
c(rho, sum(L"2)/n"2) >
cd4.boot <- boot(cd4, corr.fun, R=999, stype="w")
Is the variance independent o f t? Is z* pivotal? Should we transform the correlation
coefficient?

t0 <- cd4.boot$t0[l]
tstar <- cd4.boot$t[,1]
vL <- cd4.boot$t[,2]
zstar <- (tstar-tO)/sqrt(vL)
fisher <- function( r ) 0.5*log( (l+r)/(l-r) )
split.screen(c(1,2))
screen(l); plot(tstar,vL)
screen(2); plot(fisher(tstar),vL/(l-tstar"2)~2)
For a studentized bootstrap confidence interval on transformed scale:

zstar <- (fisher(tstar)-fisher(tO))/sqrt(vL/(l-tstar~2)~2)


vO <- cd4.boot$t0[2]/(l-t0
2)"2
fisher(tO) - sqrt(v0)*sort(zstar) [c(975,25)]
W hat are these on the correlation scale? How do they compare to intervals obtained
without the transformation?
If there are simulation outliers, delete them and recalculate the intervals.
(Sections 2.3, 2.4, 2.5; D iC iccio and Efron, 1996)

2.11 Practicals
4

69

How many simulations are required for quantile estimation? To get som e idea, we
make four replicate plots with 39, 99, 399 and 999 simulations.

split.screen(c(4,4))
quantiles <- matrix(NA,16,4)
n <- c (39,99,399,999)
p <- c(0.025,0.05,0.95,0.975)
for (i in 1:4)
{ y <- rnorm(999)
for (j in 1:4) {
quantiles[(j-1)*4+i,] <- quantile(y [1 :n[j]] , probs=p)
screen((i-1)*4+j)
qqnorm(y [1 :n[j] ] ,ylab="y" ,main=paste("R = ",n[j]))
abline(h=quantile(y[l :n[j]] ,p) ,lty=2) } }
Repeat the loop a few times. How large a simulation is required to get reasonable
estimates o f the 0.05 and 0.95 quantiles? O f the 0.025 and 0.975 quantiles?
(Section 2.5.2)
5

Following on from Practical 2.3, we compare variance approximations for the


correlation in cd4:

L.inf <- empinf(data=cd4,statistic=corr.fun)


L.jack <- empinf(data=cd4,statistic=corr.fun,type="jack")
L.reg <- empinf(boot.out=cd4.boot,type="reg")
split.screen(c (1,2))
screen(l); plot(L.inf,L.jack); screen(2); plot(L.inf,L.reg)
v.inf <- sum(L.inf2)/nrow(cd4)'2
v.jack <- var(L.jack)/nrow(cd4)
v.reg <- sum(L.reg~2)/nrow(cd4)~2
v.boot <- var(cd4.boot$t[,l])
c (v.inf,v .r eg,v .j a ck,v .boot)
Discuss the different variance approximations in relation to the values o f the
influence values. Compare with results for the transformed correlation coefficient.
To see the accuracy o f the linear approxim ation:

close.screen(all=T);plot(tstar.linear.approx(cd4.boot,L.reg))
Find the correlation between t and its linear approximation. M ake the corre
sponding plots for the other empirical influence values. Are the plots better on the
transformed scale?
(Section 2.7)

3
Further Ideas

3.1 Introduction
In the previous chap ter we laid out the basic elem ents o f resam pling or
b o o tstrap m ethods, in the context o f the analysis o f a single hom ogeneous
sam ple o f data. This ch ap ter deals w ith how those ideas are extended to some
m ore com plex situations, an d then tu rn s to uses for variations and elaborations
o f simple b o o tstrap schemes.
In Section 3.2 we describe how to construct resam pling algorithm s for
several independent sam ples, and then in Section 3.3 we discuss briefly the use
o f partial m odelling, either qualitative or sem iparam etric, a topic explored m ore
fully in the later chapters on regression m odels (C hapters 6 and 7). Section 3.4
exam ines w hen it is w orthw hile to m odify the statistic by using a sm oothed
em pirical distribution function. In Sections 3.5 and 3.6 we tu rn to situations
where d a ta are censored or missing and therefore are incom plete. One relatively
simple situation where the stan d ard b o o tstrap m ust be modified to succeed is
finite population sampling, which we consider in Section 3.7. In Section 3.8 we
deal with simple situations o f hierarchical variation. Section 3.9 is an account
o f nested b ootstrapping, where we outline how to overcome som e o f the
shortcom ings o f a single b o o tstrap calculation by a fu rther level o f sim ulation.
Section 3.10 describes b o o tstrap diagnostics, which are concerned w ith the
assessm ent o f sensitivity o f resam pling analysis to individual observations, as
well as the use o f bo o tstrap o u tp u t to suggest m odifications to the calculations.
Finally, Section 3.11 describes the use o f nested b o o tstrapping in selecting an
estim ator from the data.

70

71

3.2 Several Samples

3.2 Several Samples


Suppose th a t we are interested in a param eter th a t depends on the populations
F \ , . . . , F k , and th a t the d a ta consist o f independent random sam ples from

these populations. The ith sam ple is >'n,


and arises from p o pulation F t,
for i = 1
If there is no further inform ation ab o u t the populations, the
nonparam etric estim ate o f F t is the E D F o f the ith sample,
Recall that the
Heaviside function H(u)
jumps from 0 to 1 at
u = 0.

Since each o f the k p opulations is separate, nonparam etric sim ulation from
their respective E D F s F i , . . . , F k leads to datasets

where
is generated by sam pling n,- tim es w ith equal probabilities,
n ', from the ith original sample, independently o f all other sim ulated samples.
This am ounts to stratified sam pling in which each o f the original sam ples
corresponds to a stratum , an d nt observations are taken w ith equal probability
from the ith stratum . W ith this extension o f the resam pling algorithm , we
proceed as o u tlined in C h ap ter 2. F or example, if v = v(Fi,...,F/c) is an
estim ated variance for t, confidence intervals for 6 could be based on sim ulated
values o f z* = (t* t ) / v ' ll2 ju st as described in Section 2.4, where now t* and
v' are form ed from sam ples generated by the sim ulation algorithm described
above.
Example 3.1 (Difference of population means) Suppose we are interested in
the difference o f two p o p u latio n m eans, 6 = t(Fi,F 2 ) = f y d F i ( y ) f
The corresponding estim ate o f f(F], F 2) based on independent sam ples from
the two distributions is the difference o f the two sam ple averages,

for which the usual unbiased estim ate o f variance is

This differs slightly from the delta m ethod variance approxim ation, which we
describe in Section 3.2.1.
A sim ulated value o f T w ould be f* = yj y 2 > where yj is the average
o f n\ observations generated w ith equal probability from the first sample,
y ii ,- - -, yi m, and

is the average o f n2 observations generated with equal

72

3 Further Ideas

Series
4
5

105
83

95
90

76
76

78
78

82
79

84
86

76
75
51
76
93
75
62

76
76
87
79
77
71

78
79
72
68
75
78

78
86
87
81
73
67
75
82
83

81
79
77
79
79
78
79
82
76
73
64

85
82
77
76
77
80
83
81
78
78
78

76
82

87
95

83
54
35
46
87
68

98
100
109
109
100
81
75
68
67

probability from the second sample, y 2 i , - - - , y 2n2- T he corresponding unbiased


estim ate o f variance for t* based on these sam ples would be
1

>

Example 3.2 (Gravity data)


Between M ay 1934 and July 1935, a series o f
experim ents to determ ine the acceleration due to gravity, g, was perform ed at
the N atio n al B ureau o f S tan d ard s in W ashington D C. T he experim ents, m ade
with a reversible pendulum , led to eight successive series o f m easurem ents. The
d ata are given in Table 3.1. Figure 3.1 suggests th a t the variance decreases
from one series to the next, th a t there is a possible change in location, and
th a t mild outliers m ay be present.
T he m easurem ents for the later series seem m ore reliable, and although
we would wish to estim ate g from all the data, it seems in ap p ropriate to
pool the series. We suppose th a t each o f the series is taken from a separate
population, F i,...,F g , b u t th a t each pop u latio n has m ean g; for a check on
this see Exam ple 4.14. T hen the ap p ro p riate form o f estim ator is a weighted
com bination

r = Ef=i V(Fi)/<r2(Fi)
E l i IM A )
where F, is the E D F o f the ith series, fi(Fi) is an estim ate o f g from F and

Table 3.1 Eight series


of measurements of the
acceleration due to
gravity, g, given as
deviations from 980000
xlO-3 cm s"2, in units
of cms-2 x 10-3.
(Cressie, 1982)

73

3.2 Several Samples

Figure 3.1 Gravity


series box plots, showing
a reduction in variance,
a shift in location, and
possible outliers.

O
00
O
CD
o

<j2(Fj) is an estim ated variance for n(Fi). The estim ated variance o f T is

v =

j E 1/ ^
1 1=1

If the d a ta were tho u g h t to be norm ally distributed with m ean g but different
variances, we w ould take
KFi) = yh

v 2(Fi) = {n,(n, - l)}-1 E ^ 'V ^ ) 2


j

to be the average o f the ith series and its estim ated variance. The resulting
estim ator T is then an em pirical version o f the optim al weighted average. For
o u r d a ta t = 78.54 w ith stan d ard error uI/2 = 0.59.
Figure 3.2 shows sum m ary plots for R = 999 nonparam etric sim ulations
from this model. The to p panels show norm al plots for the replicates t ' and
for the corresponding studentized b o o tstrap statistics z* = (f* t ) / v ' l/2. Both
are m ore dispersed th a n norm al. There is one large negative value o f z*, and
the lower panels show w h y : on the left we see th a t the u* for the smallest value
o f t* is very small, w hich inflates the corresponding z*. We would certainly
om it this value on the grounds th at it is a sim ulation outlier.
The average an d variance o f the * are 78.51 and 0.371, so the bias estim ate
for t is 78.51 78.54 = 0.03, and a 95% confidence interval for g based on a
norm al approxim ation is (77.37,79.76). The 0.025 x (R + 1) and 0.975 x (R + 1)
order statistics o f the z* are -3.03 and 2.50, so the 95% studentized boo tstrap
confidence interval for g is (77.07,80.32), slightly wider th an th at based on
the norm al approxim ation, as the top right panel o f Figure 3.2 w ould suggest.

74

3 Further Ideas

10
o00
O)

r^

*
N

GO

hr-.

/ ' y

. .. y

o
V

in

-2

-2

Quantiles of standard normal

Quantiles of standard normal

ho
CD

in

o
tr

o
co
d
c\j

o
77

78

79

80

81

t*

A p art from the resam pling algorithm , this mimics exactly the studentized
bo o tstrap procedure described in Section 2.4.

O ther, constrained resam pling plans m ay be suggested by stronger assum p


tions ab o u t the populations, as discussed in Section 3.3. T he advantage o f the
resam pling plan described here is th a t it is robust.

3.2.1 Influence functions and variance approximations


The discussion in Section 2.7.2 generalizes quite easily to the case o f m ultiple
independent samples, w ith separate influence functions corresponding to each

Figure 3.2 Summary


plots for 999
nonparametric
simulations of the
weighted average for
the gravity data and its
estimated variance v.
The top panels show
normal quantile plots of
t* and the studentized
bootstrap statistic
z* = (t* t)/v*^2. The
line on the top left has
intercept t and slope
vl/2, and on the top
right the line has
intercept zero and unit
slope. The bottom
panels show that the
smallest t* also has the
smallest v*, leading to
an outlying value of z*.

75

3.2 Several Samples

population represented. W hen T has the representation f(FI ;. .. , Fk), the an a


logue o f the linear approxim ation (2.35) is
k J
t(Fu . . . , F k) = t(Fu . .., Fk) + E ~
,=i

Hi
t1 1 )
j=

where the influence functions L t i are defined by


Lv(y;F) =

St (Fu . . . , ( l - e ) F i + eHy, . . . , F k)
ds

(3.2)
6=0

and for brevity we w rite F = (Fi, . . . , Fk). A s in the single sam ple case, the influ
ence functions have m ean zero, E { Ltii( y;F)} = 0 for each i. T hen the im m ediate
consequence o f (3.1) is the nonparam etric delta m ethod approxim ation
T 6

N ( 0, vL),

for large
where the variance approxim ation vL is given by the variance o f
the second term on the right-hand side o f (3.1), th a t is
k 1
v l = V - v a r { L M(Y ;F) \ F}.
f n

(3.3)

By analogy with the single sam ple case, em pirical influence values are
obtained by substituting the E D F s F = ( F i , . . . , F k) for the C D F s F in (3.2) to
give
h j = Lt Ay j i f )These values satisfy E y = i kj ~ f r eac^ * S ubstitution o f em pirical variances
o f the em pirical influence values in (3.3) gives the variance approxim ation

' E i D S -

<1 4 >

i= i n i j = i

which generalizes (2.36).


Example 3.3 (Difference of population means) F or the difference between
sam ple averages in Exam ple 3.1, the first influence function is

x \{ {\ - )dFi(xi) + edHy(xi)} -

x 2dF2(x2)
e=0

= y -p i

ju st as in Exam ple 2.17. Sim ilarly L^2(y,F) = (y fi2). In this case the linear
approxim ation (3.1) is exact. The variance approxim ation form ula (3.3) gives
vL = v a r(F i) + var(Y2),
ni
n2

76

3 Further Ideas

and the em pirical version (3.4) is

vl

= X
"1 ;=1

~ h)2 +

_ *2)2;=1

As usual this differs slightly from the unbiased variance approxim ation.
N ote th a t if we could assum e th a t the two p o pulation variances were equal,
then it w ould be ap p ro p riate to replace vl by

( 5 > u - J1'12+ 2 > - >2}

(b,+ i )

sim ilar to the usual pooled variance form ula.

The various com m ents m ade ab o u t calculation in Section 2.7 apply here
w ith obvious m odifications. T hus the em pirical influence values can be ap
proxim ated accurately by num erical differentiation, which here m eans
f ^ t(F\ , ...,( 1 - e ) F j + e H yj , .. ., Fk) - t
lj ~
for small

e.

We can also use the generalization o f (2.44), namely

' - +

r E

^ -

<3-5>

.- = 1 J.1

where / ' denotes the frequency o f d a ta value


in the b o o tstrap sample.
T hen given sim ulated values
we can approxim ate the ly by regression,
generalizing the m ethod outlined in Section 2.7.4. A lternative ways to calculate
the Ijj an d vL are described in Problem s 3.6 and 3.7.
The m ultisam ple analogue o f the jackknife m ethod o f Section 2.7.3 involves
the case deletion estim ates
^jack,]] = (tlj
where
T hen

l)(t

tjj),

is the estim ate obtained by om itting the yth case in the ith sample.
k

vjack = E

_ J)

n,

~^jack,if-

One can also generalize the discussion o f bias approxim ation in Section 2.7.3.
However, the extension o f the quad ratic approxim ation (2.41) is n o t straight
forw ard, because there are cross-population terms.
The same approxim ation (3.1) could be used even when the samples, and
hence the F,s, are correlated. But this w ould have to be taken into account in
(3.3), which as stated assum es m utual independence o f the samples. In general
it would be safer to incorporate dependence th ro u g h the use o f appropriate
m ultivariate E D Fs.

3.3 Semiparametric Models

77

3.3 Semiparametric Models


In a sem iparam etric m odel, some aspects o f the d ata distribution are specified
in term s o f a small num ber o f param eters, b u t other aspects are left arbitrary. A
simple exam ple would be the characterization Y = fi + ae, with no assum ption
on the distribution o f e except th a t it has centre and scale zero and one. Usually
a sem iparam etric m odel is useful only when we have nonhom ogeneous data,
w ith only the differences characterized by param eters, com m on elem ents being
nonparam etric.
In the context o f Section 3.2, and especially Exam ple 3.2, we m ight for
exam ple be fairly sure th a t the distributions F, differ only in scale or, m ore
cautiously, scale and location. T h a t is, Yy m ight be expressed as
Yy fli 4 6 i'c-ij>
where the ey are sam pled from a com m on distribution with C D F Fo, say.
The norm al distrib u tio n is a p aram etric m odel o f this form. The form can be
checked to some extent by plotting standardized residuals such as

for ap p ro p riate estim ates jl, an d au to verify hom ogeneity across samples. The
com m on Fo will be estim ated by the E D F o f all
n, o f the ey-s, or better
by the E D F o f the standardized residuals e y /( l n f 1)1/2. The resam pling
algorithm will then be
Yy

fii 1" *7['y,

1, , Wj, i

1 ,. . . , /c,

where the y-s are random ly sam pled from the ED F, i.e. random ly sam pled
w ith replacem ent from the standardized eys; see Problem 3.1.
In an o th er context, w ith positive d a ta such as lifetimes, it m ight be ap p ro
priate to think o f d istributions as differing only by m ultiplicative effects, i.e.
Yy = HiSij, where the ey are random ly sam pled from some baseline distribution
w ith unit m ean. The exponential distribution is a param etric m odel o f this
form. The principle here w ould be essentially the sam e: estim ate the ey by
residuals such as ey = y y //i then define Yy = &* with the e*- random ly
sam pled w ith replacem ent from the eys.
Sim ilar ideas apply in regression situations. The param etric p art o f the model
concerns the system atic relationship betw een the response y and explanatory
variables x, e.g. th ro u g h the m ean, and the nonparam etric p a rt concerns the
ran d o m variation. We consider this in detail in C hapters 6 and 7.
R esam pling plans such as those ju st outlined will give m ore accurate answers
when their assum ptions ab o u t the relationships betw een F, are correct, but they
are not robust to failure o f these assum ptions. Some pooling o f inform ation

78

3 Further Ideas

across sam ples m ay be essential in o rd er to avoid difficulties w hen the sam ples
are small, b u t otherw ise it is usually unnecessary.
If we widen the m eaning o f sem iparam etric to include any partial modelling,
then features less tangible th a n param eters com e into play. T he following two
exam ples illustrate this.

Example 3.4 (Symmetric distribution) Suppose th a t with our simple random


sam ple it was ap p ro p riate to assum e th a t the distrib u tion was sym m etric ab o u t
its m ean or m edian. Using this assum ption could be critical to correct statistical
analysis; see Exam ple 3.26. W ithout a param etric m odel it is h ard to see a
clear choice for F. But we can argue as follows: u n d er F the distributions o f

Y-n

an d ( 7 n) are the same, so u n d er F the d istributions o f Y* fi and


should be the same. This will be tru e if we sym m etrize the E D F
ab o u t p., m eaning th a t we take F to be the E D F o f y \ , . . .,y,2p.y \ , . . . , 2 p . y.
A robust choice for p. w ould be the m edian. (For discrete distributions we could
equivalently average sam ple pro p o rtio n s for ap p ro p riate pairs o f d ata values.)
The m ean, m edian an d o th er sym m etrically defined location estim ates o f the
resulting estim ated distrib u tio n are all equal.

Example 3.5 (Equal marginal distributions) Suppose th a t Y is bivariate, say


Y = ( U , X ) , and th a t it is ap p ro p riate from the context to assum e th a t U and
X have the sam e m arginal distribution. T hen F can be forced to have the same
m argins by defining it as the E D F o f the 2 n pairs ( u i ,x i) ,.
x ) ,( x i,u i) ,...,
(xn,Un).

In b o th o f these exam ples the resulting estim ate will be m ore efficient th an
the EDF. This m ay be less im p o rtan t th a n producing a m odel which satisfies
the practical assum ptions an d m akes intuitive sense.

Example 3.6 (M ixed discrete-continuous distributions) There will be situa


tions where the raw E D F is n o t suitable for resam pling because it is not
a credible m odel. Such a situation arises in classification, where we have a
binary response y and covariates x which are used to predict y. If the ob
served covariate values x i , . . . , x are distinct, then the conditional probabilities
Tt(x) = Pr(Y = 1 | x) estim ated from the E D F are all 0 or 1. This is clearly not
credible, so the E D F should n o t be used as a resam pling m odel if the focus
o f interest is a property th a t depends critically on the conditional probabilities
n(x). A n a tu ra l m odification o f the E D F is to keep the m arginal E D F o f x, but
to replace the 0-1 values o f the conditional distrib u tion by a sm ooth estim ate
o f n(x). This is discussed fu rth er in Exam ple 7.9.

3.4 Smooth Estimates o f F

79

3.4 Smooth Estimates of

F or nonparam etric situations we have so far m ostly assum ed th at the E D F F


is a suitable estim ate o f F. But F is discrete, and it is natural to ask if a sm ooth
estim ate o f F m ight be better. The m ost likely situation for im provem ent is
where the effects o f discreteness (Section 2.3.2) are severe, as in the case o f the
sam ple m edian (Exam ple 2.16) or o th er sam ple quantiles.
W hen it is reasonable to suppose th a t F has a continuous PD F, one possi
bility is to use kernel density estim ation. F or scalar y we take

t M

- h t H
j=i

n r 1)-

where w(-) is a continuous an d sym m etric P D F with m ean zero and unit
variance, an d do calculations o r sim ulations based on the corresponding C D F
Fh, rath er th a n on the E D F F. This corresponds to sim ulation by setting
Y = yr. + hj,

j = l,...,n,

where the l j are independent and uniform ly distributed on the integers 1 ,..., n
and the ej are a ran d o m sam ple from w(-), independent o f the l j . This is the
smoothed bootstrap. N ote th a t h = 0 recovers the EDF.
The variance o f an observation generated from (3.6) is n~l J2(yj ~ S)2 + ^2>
and it m ay be preferable for the sam ples to have the same variance as for the
unsm oothed b ootstrap. This is im plem ented via the shrunk smoothed bootstrap,
under which h sm ooths betw een F and a m odel in which d a ta are generated
from density w(-) centred at the m ean and rescaled to have the variance o f F ;
see Problem 3.8.
H aving decided which sm oothed b o o tstrap is to be used, we estim ate the
required p roperty o f F , a(F), by a(F/,) ra th er th an a(F). So if T is an estim ator
o f 9 = t(F), an d we inten d to estim ate a(F) = v a r(T | F) by sim ulation, we
w ould obtain values t \ , . . . , t R calculated from sam ples generated from F/,, and
then estim ate a(F) by (R I)-1 F ) 2. N otice th a t it is a(F), n o t t(F),
th a t is estim ated using sm oothing.
To see w hen a(F/,) is b etter th an a(F), suppose th a t a(F) has linear approxi
m ation (2.35). Then
n

a(Fh) - a(F)

n~l ^

L a( Y j + h j - , F ) w ( E j ) d E j - i -------

7= 1

n - 1 Y , L a( Yj ; F ) + \ h 2n~ l
7=1

L "(Y ,;F ) +

7=1

for large n and small h, where L "(u ;F ) = d2L a( u ; F ) / 3 u 2. It follows th at the

80

3 Further Ideas

2 0

80

Table 3.2 Root mean


squared error (xlO-2)
for estimation of n1//2
times the standard
deviation of the
transformed correlation
coefficient for bivariate
normal data with
correlation 0.7, for usual
and smoothed
bootstraps with R = 200
and smoothing
parameter h.

Smoothed, h
Usual
h= 0
18.9
11.4

0 .1

18.6
1 1 .2

0.25

0.5

16.6
10.4

11.9
8.5

1 .0

6 .6

6.4

m ean squared erro r o f a(Fh), M S E ( h ) = E[{a(Fj,) a(F )}2], roughly equals


n~lj L a( y ; F)2 d F (y ) +h 2n ^ j L a( y ; F ^ y ; F) d F ( y ) + \ hA^ f U'a{ y ; F) dF(y)

J.

(3.7)
Sm oothing is n o t beneficial if the coefficient o f h2 is positive, b u t if it is negative
(3.7) can be reduced by choosing a positive value o f h th a t trades off the last
two terms. The leading term in (3.7) is unaffected by the choice o f h, which
suggests th a t in large sam ples any effect o f sm oothing will be m inor for such
statistics.
Example 3.7 (Sample correlation)
To illustrate the discussion above, we take
a(F) to be the scaled stan d ard deviation o f T = i log{(l + C )/( 1 C)}, where
C is the correlation coefficient for bivariate norm al data. We extend (3.6) to
bivariate y by taking w( ) to be the bivariate norm al density with m ean zero
and variance m atrix equal to the sam ple variance m atrix. F or each o f 200
samples, we applied the sm oothed b o o tstrap w ith different values o f h and
R = 200 to estim ate a(F).
Table 3.2 shows results for two sam ple sizes. F or n = 20 there is a reduction
in root m ean squared error by a factor o f ab o u t three, w hereas for n = 80 the
factor is ab o u t two. Results for the shrunk sm oothed b o o tstrap are the same,
because o f the scale invariance o f C and the form o f w( ).

Sm oothing is potentially m ore valuable w hen the quantity o f interest depends


on the local behaviour o f F, as in the case o f a sam ple quantile.
Example 3.8 (Sample median)
Suppose th a t t(F) is the sam ple m edian, and
th at we wish to estim ate its variance a(F). In Exam ple 2.16 we saw th at
the discreteness o f the m edian posed problem s for the ordinary, unsm oothed,
b ootstrap. D oes sm oothing im prove m atters?
U nder regularity conditions on F an d h, detailed calculations show th a t the
m ean squared error o f na(Fh) is pro p o rtio n al to
(n/i)-1 ci + h4C2 ,

(3.8)

where c\ an d c2 depend on F and w(-) b u t not on n. Provided th at c\ and c2


are non-zero, (3.8) is minim ized at h oc n-1/5, and (3.8) is then o f order n-4/5,

3.4 Smooth Estimates o f F


Table 3.3 Root mean
squared error for
estimation of n times
the variance of the
median of samples of
size n from the 3 and
exponential densities,
for usual, smoothed and
shrunk smoothed
bootstraps with R = 200
and smoothing
parameter h.

81
S m oothed, h

Exp

S h ru n k sm oothed, h

U sual
h= 0

0.1

0.25

0.5

1.0

0.1

0.25

0.5

1.0

11
81

2.27
0.97

2.08
0.76

2.17
0.77

3.59
1.81

10.63
6.07

2.06
0.75

2.00
0.67

2.72
1.17

4.91
2.30

11
81

1.32
0.57

1.15
0.48

1.02
0.37

1.18
0.41

7.53
1.11

1.13
0.47

0.92
0.34

0.76
0.27

0.93
0.27

w hereas it is 0 ( n ~ ,/2) in the unsm oothed case. T hus there are advantages
to sm oothing here, a t least in large samples. Sim ilar results hold for other
quantiles.
Table 3.3 shows results o f sim ulation experim ents where 1000 sam ples were
taken from the exponential an d tj distributions. F or each sam ple sm oothed
an d shrunk sm oothed b o o tstrap s were perform ed w ith R = 200 an d several
values o f h. U nlike in Table 3.2, the advantage due to sm oothing increases with
n, and the shrunk sm oothed b o o tstrap im proves on the sm oothed bootstrap,
particularly at larger values o f h.
As predicted by the theory, as n increases the root m ean squared error
decreases m ore rapidly for sm oothed th an for unsm oothed bootstrap s; it
decreases fastest for shru n k sm oothing. F o r the tj d a ta the ro o t m ean squared
erro r is n o t m uch reduced. F or the exponential d a ta sm oothing was per
form ed on the log scale, leading to reduction in root m ean squared erro r by
a factor two o r so. Too large a value o f h can lead to large increases in
ro o t m ean squared error, b u t choice o f h is less critical for shrunk sm ooth
ing. Overall, a small am o u n t o f shrunk sm oothing seems w orthw hile here,
provided the d a ta are well-behaved. But sim ilar experim ents w ith Cauchy
d a ta gave very p o o r results m ade worse by sm oothing, so one m ust be
sure th a t the d a ta are n o t pathological. F urtherm ore, the gains in preci
sion are n o t large enough to be critical, at least for these sam ple sizes.

The discussion above begs the im p o rtan t question o f how to choose the
sm oothing p aram eter for use w ith a p articular dataset. O ne possibility is
to treat the problem as one o f choosing am ong possible estim ators a(Fh)
an d use the nested b o o tstrap , as in Exam ple 3.26. However, the use o f an
estim ated h is n o t sure to give im provem ent. W hen the rate o f decrease o f the
optim al value o f h is know n, an o th er possibility is to use subsam pling, as in
E xam ple 8.6.

82

3 Further Ideas

3.5 Censoring
3.5.1 Censored data
Censoring is present w hen d a ta con tain a lower or upper b o und for an
observation ra th e r th a n the value itself. Such d a ta often arise in m edical and
industrial reliability studies. In the m edical context, the variable o f interest
m ight represent the tim e to death o f a patien t from a specific disease, with an
indicator o f w hether the tim e recorded is exact or a lower b o und due to the
p atient being lost to follow -up or to d eath from oth er causes.
The com m onest form o f censoring is right-censoring, in which case the value
observed is Y = m in (7 , C), where C is a censoring value, and Y is a no n
negative failure time, which is know n only if Y < C. The d a ta themselves
are pairs ( Y , D ), w here D is a censoring indicator, w hich equals one if Y is
observed an d equals zero if C is observed. Interest is usually focused on the
distributio n F o f Y, w hich is obscured if there is censoring.
The survivor function and the cumulative hazard function are central to
the study o f survival data. The survivor function corresponding to F(y) is
Pr(Y > y) = 1 F(y), an d the cum ulative h azard function is A(y) =
lo g { l-F(y)}. The cum ulative h azard function m ay be w ritten as / 0y dA(u),
where for continuous y the hazard function d A ( y) /d y m easures the in stan
taneous rate o f failure at tim e y, conditional on survival to th a t point. A
constant h azard X leads to an exponential distrib u tion o f failure tim es with
survivor an d cum ulative h azard functions exp(Ay) and Ay; departures from
these simple form s are often o f interest.
T he sim plest m odel for censoring is random censorship, u n der which C is a
random variable w ith distrib u tio n function G, independent o f Y. In this case
the observed variable Y has survivor function
Pr(Y > y ) = { I - F ( y ) } { l - G ( y ) } .
O ther form s o f censoring also arise, an d these are often m ore realistic for
applications.
Suppose th a t the d a ta available are a hom ogeneous random sam ple (yi,di),
. . . , (y n, dn), and th a t censoring occurs at random . Let y\ < < y, so there are
n o tied observations. A stan d ard estim ate o f the failure-tim e survivor function,
the product-limit o r Kaplan-Meier estim ate, m ay then be w ritten as
(3.9)

I f there is no censoring, all the dj equal one, and F(y) reduces to the E D F o f
y i , . . . , y n (Problem 3.9). T he product-lim it estim ate changes only a t successive
failures, by an am o u n t th a t depends on the num b er o f censored observations

3.5 Censoring

83

between them . Ties betw een censored and uncensored d ata are resolved by
assum ing th a t censoring happens instantaneously after a failure m ight have
occurred; the estim ate is unaffected by o th er ties. A stan d ard error for 1F(y)
is given by Greenwoods formula,
1/2

(3.10)
In setting confidence intervals this is usually applied on a transform ed scale.
Both (3.9) an d (3.10) are unreliable where the num bers a t risk o f failure are
small.
Since 1 dj is an indicator o f censoring, the product-lim it estim ate o f the
censoring survivor function 1 G is

-*M- n Gr^Hj:yj< y v

J/

<^>

T he cum ulative h azard function m ay be estim ated by the Nelson-Aalen


estim ate
H{u) is the Heaviside
function, which equals
zero if u < 0 and equals
one otherwise.

----- y

(3.12)

Since y\ < < y, the increase in A (> at yj is dA(yj) = dj /( n j + 1). The


in terp retatio n o f (3.12) is th a t at each failure the hazard function is estim ated
by the num b er observed to fail, divided by the num ber o f individuals at risk (i.e.
available to fail) im m ediately before th a t time. In large sam ples the increm ents
o f A 0, the d A0(yj), are approxim ately independent binom ial variables with
denom inators (n + 1 j ) and probabilities dj /( n j + 1). The product-lim it
estim ate m ay be expressed as
1 - F 0( y ) =

J ] {l-dA (yj)}
j-yj^y

(3.13)

in term s o f the com ponents o f (3.12).


Example 3.9 (AM L data)
Table 3.4 contains d a ta from a clinical trial
conducted a t Stanford U niversity to assess the efficacy o f m aintenance chem o
therapy for the rem ission o f acute m yelogeneous leukaem ia (A M L). A fter
reaching a state o f rem ission through treatm ent by chem otherapy, patients were
divided random ly into two groups, one receiving m aintenance chem otherapy
an d the oth er not. T he objective o f the study was to see if m aintenance
chem otherapy lengthened the tim e o f remission, w hen the sym ptom s recur.
T he d a ta in the table were gathered for prelim inary analysis before the study
ended.

3 Further Ideas

84

Table 3.4 Remission


G ro u p 1
G ro u p 2

9
5

13
5

>13
8

18
8

23
12

>28
>16

31
23

34
27

>45
30

48
33

> 161
43

45

The left panel o f Figure 3.3 shows the estim ated survivor functions for the
tim es o f rem ission. A plus on one o f the lines indicates a censored observation.
T here is some suggestion th a t m aintenance prolongs the time to remission,
b u t the sam ples are sm all and the evidence is n o t overwhelming. T he right
panel shows the estim ated survivor functions for the censoring times. Only
one observation in the n o n-m aintained group is censored, b u t the censoring
distributions seem sim ilar for b o th groups.
The estim ated probabilities th a t rem ission will last beyond 20 weeks are
respectively 0.71 and 0.59 for the groups, w ith stan d ard errors from (3.10)
b o th equal to 0.14.

3.5.2 Resampling plans


Cases
W hen the d a ta are a hom ogeneous sam ple subject to random censorship, the
m ost direct way to b o o tstra p is to set 7* = m in( Y ,C ), where Y * and C*
are independently generated from F and G respectively. This implies th at
P r{Y'> y) = {l-G (y )}{l-F (y )} = U (
j-yj^y

" ~ J_ ) ,

which corresponds to the E D F th a t places m ass n~l on each o f the n cases


(yj,dj). T h a t is, o rdinary b o o tstra p sam pling u nder the random censorship
m odel is equivalent to resam pling cases from the original data.
Conditional bootstrap
A second sam pling scheme starts from the prem ise th a t since the censoring
variable C is unrelated to Y, know ledge o f the quantities C i,...,C alone
would tell us noth in g a b o u t F. They w ould in effect be ancillary statistics. This
suggests th a t sim ulations should be conditional on the p atte rn o f censorship, so
far as practicable. To allow for the censoring pattern, we argue th a t although
the only values o f cj know n exactly are those
w ith dj = 0, the observed values
o f the rem aining observations are lower b o unds for the censoring variables,
because Cj > yj when d} = 1. This suggests the following algorithm .

times (weeks) for two


groups o f patients with
acute myelogeneous
leukaemia (AM L), one
receiving maintenance
chem otherapy (G roup
1) and the other not
(Miller, 1981, p. 49).
^ indicates
right-censoring.

3.5 Censoring

Figure 3.3
Product-limit survivor
function estimates for
two groups o f patients
with A M L, one
receiving maintenance
chem otherapy (solid)
and the other not (dots).
The left panel shows
estimates for the time to
remission, and the right
panel shows the
estimates for the time to
censoring. In the left
panel, + indicates times
o f censored
observations; in the
right panel + indicates
times o f uncensored
observations.

85

n
na
o
CO

>
3

C/D

Time (weeks)

Time (weeks)

Algorithm 3.1 (Conditional bootstrap for censored data)


F or r = 1 ,...,/? ,
1 generate Y| \ . .., Fn* independently from F ;
2 for j = 1 ,..., n, m ake sim ulated censoring variables by setting C = yj
if dj = 0, an d if dj = 1, generating Cj from {G(y) G(y; )}/{ 1 G(y; )},
which is the estim ated distribution o f Cj conditional on Cj > y j ; then
3 set YJ = m in( Y,0*, CJ), for j = 1 ,..., n.

I f the largest observation is censored, it is given a notional failure time to


the right o f the observed value, and conversely if the largest observation is
uncensored, it is given a n o tio n al censoring tim e to the right o f the observed
value. This ensures th a t the observation can ap p ear in b o o tstrap resamples.
B oth the above sam pling plans can accom m odate m ore com plicated patterns
o f censoring, provided it is uninform ative. F o r example, it m ight be decided at
the start o f a reliability experim ent on independent and identical com ponents
th a t if they have n o t already failed, item s will be censored at fixed times
c i , . . . , c. In this situation an ap p ro p riate resam pling plan is to generate failure
tim es Y?* from F, and then to take YJ = min(YJ0*,c,), for j = 1
Thi s
am ounts to having separate censoring distributions for each item, w ith the
j t h p u ttin g m ass one at c; . O r in a m edical study the yth individual m ight be
subject to ran d o m censoring up to a tim e c , corresponding to a fixed calendar
date for the end o f the study. In this situation, Yj = m in( Y f , C j , d f ) , with the
indicator Dj equalling zero, one, o r tw o according to w hether Cj, Y j \ or c j
was observed. T hen an ap p ro p riate conditional sam pling plan w ould generate

3 Further Ideas

86

Yj0' and C* as in the conditional plan above, b u t take YJ = m in(y;,


and m ake D accordingly.
Weird bootstrap
The sam pling plans outlined above m im ic how the d a ta are th o u g h t to arise, by
generating individual failure and censoring times. W hen interest is focused on
the survival o r h azard functions, a third and quite different approach uses direct
sim ulation from the N elso n -A alen estim ate (3.12) o f the cum ulative hazard.
The idea is to treat the num bers o f failures a t each observed failure tim e as
independent binom ial variables w ith denom inators equal to the num bers of
individuals at risk, and m eans equal to the num bers th at actually failed. Thus
w hen yi < < y n, we take the sim ulated num b er to fail at tim e yj, N*, to be
binom ial w ith den o m in ato r n j + 1 an d probability o f failure dj / ( n j + 1).
A sim ulated N elso n -A alen estim ate is then
A*00 = E V n - L
;=1 l ^ k =i \yj

vV
yk)

(3-14)

which can be used to estim ate the uncertainty o f the original estim ate A Q(y).
In this weird bootstrap the failures at different tim es are unrelated, the num ber
at risk does n o t depend on previous failures, there are no individuals whose
sim ulated failure tim es underlie -4 (y), and no explicit assum ption is m ade
ab o u t the censoring m echanism . Indeed, under this scheme the censored indi
viduals are held fixed, b u t the num b er o f failures is a sum o f binom ial variables
(Problem 3.10).
The sim ulated survivor function corresponding to (3.14) is obtained by
substituting

into (3.13) in place o f dA(yj).


Example 3.10 (AM L data)
Figure 3.3 suggests th a t the censoring distribu
tions for b o th groups o f d a ta in Table 3.4 are sim ilar, b u t th at the survival
distributions them selves are not. To com pare the resam pling schemes described
above, we consider estim ates o f two param eters, the probability o f remission
beyond 20 weeks and the m edian survival time, b o th for G ro u p 1. These
estim ates are 1 F(20) = 0.71 an d inf{t : F(t) > 5} = 31.
Table 3.5 com pares results from 499 sim ulations using the ordinary, condi
tional, and weird bootstraps. F or the survival probabilities, the ordinary and
conditional b o o tstrap s give sim ilar results, and b o th stan d ard errors are sim
ilar to th a t from G reenw oods form ula; the weird b o o tstrap probabilities are
significantly higher an d are less variable. The schemes give infinite estim ates

87

3.5 Censoring
Table 3.5 Results for
499 replicates of
censored data
bootstraps of Group 1
of the AML data:
average (standard
deviation) for estimated
probability of remission
beyond 20 weeks,
average (standard
deviation) for estimated
median survival time,
and the number of
resamples in which case
3 occurs 0, 1, 2 and 3 or
more times.
Figure 3.4 Comparison
of distributions of
differences in median
survival times for
censored data
bootstraps applied to
the AML data. The
dotted line is the line
x = y.

F requency o f case 3

Cases
C o n d itio n al
W eird

P robability

M edian

>3

0.72 (0.14)
0.72 (0.14)
0.73 (0.12)

32.5 (8.5)
32.8 (8.5)
33.3 (7.2)

180
75
0

182
351
499

95
71
0

42
3
0

co
c
o

o
a>

V-

c
o

-20

20
Cases

40

-20

20

40

Cases

o f the m edian 21, 19, and 2 tim es respectively. The w eird b o o tstrap results for
the m edian are less variable th a n the others.
The last colum ns o f the table show the num bers o f sam ples in which the
sm allest censored observation appears 0, 1, 2, and 3 or m ore times. U nder the
conditional scheme the observation appears m ore often th an under the ordinary
b o o tstrap , and und er the weird b o o tstrap it occurs once in each resample.
Figure 3.4 com pares the distributions o f the difference o f m edian survival
times betw een the two groups, und er the three schemes. R esults for the condi
tional and o rdinary b o o tstrap s are similar, b u t the weird bo o tstrap again gives
results th a t are less variable th a n the others.
This set o f d a ta gives an extrem e test o f m ethods for censored data, because
quantiles o f the product-lim it estim ate are very discrete.
T he weird b o o tstra p also gave results less variable th a n the o ther schemes
for a larger set o f data. In general it seems th a t case resam pling and conditional
resam pling give quite sim ilar an d reliable results, b o th differing from the weird
bootstrap.

88

3 Further Ideas

3.6 Missing Data


The expression missing d a ta relates to d atasets o f a stan d ard form for which
some entries are missing or incom plete. This happens in a variety o f different
ways. F o r example, censored d a ta as described in Section 3.5 are incom plete
w hen the censoring value c is reported instead o f y. O r in a factorial ex
perim ent a few factor com binations m ay n o t have been used. In such cases
estim ates an d inferences w ould take a simple form if the dataset were com
plete. But because p a rt o f the stan d ard form is missing, we have two problem s:
how to estim ate the quantities o f interest, and how to m ake inferences about
them. We have already discussed ways o f dealing w ith censored data. N ow
we exam ine situations where each response has several com ponents, some o f
which are missing for som e cases.
Suppose, then, th a t the fictional o r p o tential com plete d a ta are ys and th a t
corresponding observed d a ta are ys, w ith some com ponents taking the value
N A to represent n o t available.
Parametric problems
F o r param etric problem s the situation is relatively straightforw ard, at least
in principle. First, in defining estim ators there is a general fram ew ork w ithin
which com plete-data M L E m ethods can be applied using the iterative EM
algorithm , which essentially w orks by estim ating missing values. Form ulae
exist for com puting approxim ate stan d ard errors o f estim ators, b u t sim ulation
will often be required to obtain accurate answers. O ne extra com ponent th at
m ust be specified is the m echanism which takes com plete d a ta y into observed
d a ta y, i.e. f ( y \ y). T he m ethodology is sim plest w hen d a ta are missing at
random .
The corresponding Bayesian m ethodology is also relatively straightforw ard
in principle, and num erous general algorithm s exist for using com plete-data
form s o f posterior distribution. Such algorithm s, although they involve sim u
lation, are som ew hat rem oved from the general context o f b o o tstra p m ethods
and will n o t be discussed here.
Nonparametric problems
N onparam etric analysis is som ew hat m ore com plicated, in p a rt because o f the
difficulty o f defining ap p ro p riate estim ators. T he following artificial exam ple
illustrates som e o f the key ideas.
Example 3.11 (Mean with missing data) Suppose th a t responses y had been
obtained from n random ly chosen individuals, b u t th a t m random ly selected
values were then lost. So the observed d a ta are
y u - - - , y n = y \ , - . - , y l - m, N A , . . . , N A .

The EM or expectation
maximization algorithm
is widely used in
incomplete data
problems.

89

3.6 Missing Data

To estim ate the popu latio n m ean /i we should o f course use the average
response y = (n m)-1
X/ whose variance we would estim ate by
nm

v = (n m) 2 Y ( y j - y f
But think o f this as a prototype missing d a ta problem , to which resam pling
m ethods are to be applied. C onsider the following two approaches:
1

First estim ate fi by t = y, the average o f the non-m issing data. Then
(a) sim ulate sam ples y\,...,y*n by sam pling with replacem ent from the n
observations y \ , . . . , y-m, N A , . . . , N A ; then
(b ) calculate f* as the average o f non-m issing values.

First estim ate the missing values y _m+l, . . . ,


by
= y for j = n m + 1 ,
. . . , n an d estim ate n as the m ean o f y \ , . . . , y_m, }>_m+1). . . , y. Then
(a) sam ple w ith

replacem ent from

y\,...,yQ
n_m, f n_m+x, . . . , f n to

get

(ft) duplicate the data-loss procedure by replacing a random ly chosen m


o f the y* w ith N A ; finally
(c) duplicate the d a ta estim ation o f fi to get /*.
In the first approach, we choose the form o f t to take account o f the missing
data. T hen in the resam pling we get a random num ber o f missing values,
M* say, w hose m ean is m. The effect o f this is to m ake the variance o f T*
som ew hat larger th a n the variance o f T : specifically

A ssum ing th a t we discard all resam ples with rn = n (all d a ta missing), the
b o o tstrap variance will overestim ate v ar(T ) by a factor which ranges from
15% for n = 10, m = 5 to 4% for n = 30, m = 15.
In the second approach, the first step was to fix the d ata so th at the
com plete-data estim ation form ula /t = n-1 YTj=i y*j f r t could be used. Then
we attem pted to sim ulate d a ta according to the two steps in the original
d ata-generation process. U nfortunately the E D F o f y,...,y_m,y_m+l,...,y
is an underdispersed estim ate o f the true C D F F. Even though the estim ate t
is n o t affected in this particularly simple problem , the boo tstrap distribution
certainly is. This is illustrated by the b o o tstrap variance

Both approaches can be repaired. In the first, we can stratify the sam pling
w ith com plete an d incom plete d a ta as strata. In the second approach, we can
ad d variability to the estim ates o f missing values. This device, called multiple

90

3 Further Ideas

imputation, replaces the single estim ate y = y by the set y + e \ , . . . , yj + e_m,


where ek = yk y for k = 1 ,..., n m. W here the estim ate yj was previously
given weight 1, the n m im puted values for the y'th case are now given equal
weights (n m)~l . The im plication is th a t F is m odified to equal n~] on each
com plete-data value, and n_1 x (n m)_1 on the m(n m) values
+ ek. In
this simple case y + ek = yk, so F reduces to the E D F o f the non-m issing d a ta
y n-m, as a consequence o f which t(F) = y and the b o o tstrap distribution
o f T* is correct.

This exam ple suggests two lessons. First, if the com plete-data estim ator can
be m odified to w ork for incom plete data, then resam pling cases will w ork
reasonably well provided the p ro p o rtio n o f m issing d a ta is sm all: stratified
resam pling would reduce variation in the am o u n t o f missingness. Secondly,
the com plete-data estim ator and full sim ulation o f d a ta observation (including
the data-loss step) can n o t be based on single im p u tatio n estim ation o f missing
values, b u t m ay w ork if we use m ultiple im p u tatio n appropriately.
O ne fu rth er poin t concerns the data-loss m echanism , which in the exam ple
we assum ed to be com pletely random . If d a ta loss is dependent upon the
response value y, then resam pling cases should still be v a lid : this is som ew hat
sim ilar to the censored-data problem . But the o th er approach via m ultiple
im putatio n will becom e com plicated because o f the difficulty o f defining a p
propriate m ultiple im putations.
Example 3.12 (Bivariate missing data) A m ore realistic exam ple concerns the
estim ation o f bivariate correlation when some cases are incom plete. Suppose
th a t Y is bivariate w ith com ponents U an d X . T he param eter o f interest is
6 = c o t t ( U , X ) . A ran d o m sam ple o f n cases is taken, such th a t m cases have
x missing, b u t no cases have b o th u an d x missing o r ju st u missing. I f it is
safe to assum e th a t X has a linear regression on U, then we can use fitted
regression to m ake single im pu tatio n s o f missing values. T h a t is, we estim ate
each missing x; by
Xj = x + b(uj u),
where x, u and b are the averages and the slope o f linear regression o f x on u
from the n m com plete pairs.
It is easy to see th a t it would be w rong to substitute these single im putations
in the usual form ula for sam ple correlation. The result would be biased aw ay
from zero if b 0. O nly if we can m odify the sam ple correlation form ula to
remove this effect will it be sensible to use simple resam pling o f cases.
The o th er strategy is to begin w ith m ultiple im p u tation to obtain a suitable
bivariate F, next estim ate 6 w ith the usual sam ple correlation t(F), and then
resam ple appropriately. M ultiple im p u tatio n uses the regression residuals from

3.6 Missing Data

91

Figure 3.5 Scatter plot


of bivariate sample and
multiple imputation
values. Left panel shows
observed pairs (o) and
cases where only u is
observed (). Right
panel shows observed
pairs (o) and multiple
imputation values (+).
Dotted line is
imputation regression
line obtained from
observed pairs.

- 3 - 2 - 1 0 1 2 3

- 3 - 2 - 1 0 1 2 3

com plete pairs,


ej = Xj Xj = Xj {x + b(uj u )},
for j =
T hen each missing Xj is k j plus a random ly selected
O ur
estim ate F is the bivariate distribution which puts weight n~l on each com plete
pair, and w eight n-1 x (n m)-1 on each o f the n m m ultiple im putations for
each incom plete case. T here are two strong, implicit assum ptions being m ade
here. First, as th ro u g h o u t o u r discussion, it is assum ed th at values are missing
at random . Secondly, hom ogeneity o f conditional variances is being assumed,
so th a t pooling o f residuals m akes sense.
As an illustration, the left panel o f Figure 3.5 shows a scatter plot for a
sam ple o f n = 20 where m = 5 cases have x com ponents missing. Com plete
cases ap p e a r as open circles, and incom plete cases as filled circles only the u
com ponents are observed. In the right panel, the do tted line is the im putation
line which gives x , for j = 1 6 ,...,2 0 , and the m ultiple im putation values are
plotted w ith sym bol + . T he m ultiple im putation E D F will put probability ^
on each open circle, and probability
on each + .
The results in Table 3.6 illustrate the effectiveness o f the m ultiple im p u ta
tion ED F. The table shows sim ulation averages and stan d ard deviations for
estim ates o f co rrelation 6 and a \ = var(X ) using the stan d ard com plete-data
form s o f the estim ators, w hen h alf o f the x values are missing in a sample
o f size n = 20 from the bivariate norm al distribution. In this problem there
would be little gain from using incom plete cases, b u t in m ore com plex situa
tions there m ight be so few com plete cases th at m ultiple im putation would be
highly effective or even essential.

92

3 Further Ideas
Table 3.6 Average
Full d a ta
estim ates

a\
9

1.00 (0.33)
0.69 (0.13)

O bserved d a ta estim ates


--------------------------------------------------------------------------------------------------C om plete case only
Single im p u ta tio n
M ultiple im p u tatio n

1.01 (0.49)
0.68 (0.20)

0.79 (0.44)
0.79 (0.18)

0.96 (0.46)
0.70 (0.19)

H aving set u p an ap p ro p riate m ultiple im p u tation E D F F, resam pling


proceeds in an obvious way, first creating a full set o f n pairs by random
sam pling from F, and then selecting m cases random ly w ithout replacem ent
for which the x values are lo st. T he first stage is equivalent to random
sam pling w ith replacem ent from n m copies o f the com plete d a ta plus all
m x (n m) possible m ultiple im p u tatio n values.

3.7 Finite Population Sampling


Basics
The sim plest form o f finite popu latio n sam pling is when a sample
is taken random ly w ith o u t replacem ent from a population ^ with values
w ith N > n know n. T he statistic t ( y \ , . . . , y n) is used to estim ate
the corresponding popu latio n q uantity 9 = t{i)\,...,ay ^ ) . The d a ta are one
o f the (^ ) possible sam ples Y \ , . . . , Y n from the population, and the w ithoutreplacem ent sam pling m eans th a t the Yj are exchangeable b u t n o t independent;
the sam pling fraction is defined to be / = n / N . I f n <C N, f is very small and
correlation am ong the Y i,..., Yn will have little effect, but in practice / often
lies in the range 0.1-0.5 an d can n o t be ignored. D ependence am ong the Y,
com plicates inference for 9, as the following exam ple indicates.
Example 3.13 (Sample average) Suppose th a t the yj are scalar and th at we
w ant a confidence interval for the pop u latio n average 9 =
) A lthough
the sam ple average Y = n~l
Yj is an unbiased estim ator o f 9, when sam pling
with and w ithout replacem ent we find
var( Y ) = (
[ (1 f ) n

,
y,

with ^Placement,
w ithout replacem ent,

where y = ( N I )-1
T he sam ple variance c = (n I )-1 X X >;y )2
is an unbiased estim ate o f y, an d the usual stan d ard erro r for y under w ithoutreplacem ent sam pling is obtained from the second line o f (3.15) by replacing y
with c. N orm al approxim ation to the distribution o f Y then gives approxim ate
(1 2a) confidence lim its y + (1 / ) 1'/2c 1/2n_ 1/ 2za for 9, where za is the a

(standard deviation) o f
estim ators for variance
and correlation 6
from bivariate normal
da ta (u,x) with sample
size n = 20 and m = 10
x values missing at
random. True values
o^ l and B 0.7.
Results from 1000
simulated datasets.

3.7 Finite Population Sampling

93

quantile o f the stan d ard norm al distribution. Such confidence intervals are a
factor (1 / ) 1/2 shorter th a n for sam pling with replacem ent.
The lack o f independence affects possible resam pling plans, as is seen by
applying the o rdinary b o o tstrap to 7 . Suppose th a t 7 1*,...,Y* is a random
sam ple tak en w ith replacem ent from y i , . . . , y n- T heir average 7* has variance
var*(7*) = n~2 ^ 2 ( y j y ) 2, and this has expected value n~2(n l)y over possible
sam ples y i , . . . , y . This only m atches the second line o f (3.15) if / = n~l . T hus
for the larger values o f / generally m et in practice, ordinary b o o tstrap standard
errors for y are too large an d the confidence intervals for 6 are system atically
too wide.

Modified sample size


The key difficulty w ith the ordinary b o o tstrap is th at it involves withreplacem ent sam ples o f size n and so does n o t capture the effect o f the
sam pling fraction, which is to shrink the variance o f an estim ator. O ne way to
deal w ith this is to take resam ples o f size n', resam pling with or w ithout re
placem ent. The value o f n' is chosen so th a t the estim ator variance is m atched,
a t least approxim ately.
F or w ith-replacem ent resam ples the average 7 o f 7 ,* ,...,7 n* has variance
var*(7*) = (n 1)c/{n'n), which is only an unbiased estim ate o f (1 f ) y / n
w hen n' = (n 1)/(1 / ) ; this usually exceeds n.
F or w ithout-replacem ent resam pling, a sim ilar argum ent implies th a t we
should take n' = f n . O ne obvious difficulty with this is th a t if / <C 1, the
resam ple size is m uch sm aller than n, and then the resam pled statistics may
be m uch less stable th an those based on sam ples o f size n. This suggests th at
we m irro r the dependence induced by sam pling w ithout replacem ent b u t try
to m atch the original sam ple size, by resam pling as follows. Suppose first th a t
m = n f and k = n / m are b o th integers, and th a t to form our resam ple we
concatenate k w ithout-replacem ent sam ples o f size m taken independently from
y \ , . . . , y n. T hen o u r resam ple has size n' = mk, and the same sam pling fraction
as the original data. This is know n as the mirror-match bootstrap. W hen m
an d k are not integers we choose m to be the positive integer closest to n f
an d take k so th a t km < n < (k + l)m. We then select random ly either k or
k + 1 w ithout-replacem ent sam ples from y \ , . . . , y n with probabilities chosen to
m atch the original sam pling fraction. If random ization is used it is im portant
th a t it be inco rp o rated correctly into the resam pling scheme (Problem 3.15).
Population and superpopulation bootstraps
Suppose for the m om ent th a t N / n is an integer, k. T hen one obvious idea is
to form a fake p o p u latio n
o f size N by concatenating k copies o f yi, ...,y.
The n atu ral next step which mimics how the d a ta were sam pled is
to generate a b o o tstra p replicate o f y i , . . . , y by taking a sam ple o f size n
w ithout replacem ent from
. So the boo tstrap sample, 7 ,* ,..., 7\ is one o f

94

3 Further Ideas

the (^) possible w ithout-replacem ent sam ples from 9 , and the corresponding
b o o tstrap value is X* = f(Y,*,. , Y).
If N / n is n o t an integer, we w rite N = kn + 1, where 0 < I < n, and form
t y by taking k copies o f y i , . . . , y n an d adding to them a sam ple o f size I
taken w ithout replacem ent from y i , . . . , y n- B ootstrap sam ples are form ed as
w hen N = kn, b u t a different <&' is used for each. We call this the population
bootstrap. U nder a superp o p u latio n m odel, the m em bers o f the population
aJJ are them selves a ran d o m sam ple from an underlying distribution, 2P. The
nonparam etric m axim um likelihood estim ate o f & is the E D F o f the sample,
which suggests the following resam pling plan.

Algorithm 3.2 (Superpopulation bootstrap)


For r = 1
1 generate a replicate pop u latio n
= (/W \ , . .. ,
w ith replacem ent from y \ , . . . , y n\ then

by sam pling N times

2 generate a b o o tstrap sam ple Y ,* ,...,Y B* by sam pling n times w ithout


replacem ent from
and set T = t ( Y , . . . , Y*).

As one w ould expect, this gives results sim ilar to the population bootstrap.
Example 3.14 (Sample average) Suppose th a t y \ , . . . , y n are scalars, th a t N =
kn, and th a t interest focuses on 6 = N ~ l J2 <3fj, as in Exam ple 3.13. T hen under
the population b ootstrap,
vv*\
N ( n - 1)
i
v a r ( y , = < A r = T j ; '< 1 - ' ,
and this is the correct form ula a p a rt from the first factor on the right, which is
typically close to one. U n d er the su p erp o p u latio n b o o tstra p a straightforw ard
calculation establishes th a t the m ean variance o f Y is (n l) /n x (1 / ) n -1 c
(Problem 3.12).
These sam pling schemes m ake alm ost the right allowance for the sam pling
fraction, at least for the average.
F or the m irror-m atch scheme we suppose th a t n = km for integer m, and write
Y* = n~l ]Tf= i
Y,j, where (Y(j , . . . , Y ^) is the ith w ithout-replacem ent
resam ple, independent o f the o th er w ithout-replacem ent resamples. T hen we
can use (3.15) to establish th a t var*(Y ) = (km)~l ( 1 m / n )m ~lc. Because our
assum ptions im ply th a t / = m/n, this is an unbiased estim ate o f var(Y ), b u t it
would be biased if m ^ n f .
m

3.7 Finite Population Sampling

95

Studentized confidence intervals


Suppose th a t v = v(yu . . . , y n) is an estim ated variance for the statistic t =
t(yu- , y n), based on the w ithout-replacem ent sam ple y i , . . . , y n, and th a t some
b o o tstrap scheme is used to form replicates t* and v* o f t and v, for r = 1 ,..., R.
T hen the studentized b o o tstrap can be used to form confidence intervals for 6,
based on the values o f z* = (t* t ) / v ' l/2. As outlined in Section 2.4, a (1 2a)
confidence interval has limits
t - V2_*
1

z ((R + l)(l-a))>

t __*.1/2 *
1

Z(UM-1))>

where z(*(R+1)p) is the em pirical p quantile o f the z*. If the population or


su p erp o p u latio n b o o tstrap s are used, and N, n>cc in such a way th a t / =
n / N >n, where 0 < n < 1, these intervals can be shown to have the same
good properties as when the
are a random sam ple from an infinite
p o p u latio n ; see Section 5.4.1.
Example 3.15 (City population data)
F or a num erical assessm ent o f the
schemes outlined above, we consider again the d a ta in Exam ple 1.2, on 1920
a n d 1930 p opulations (in thousands) o f N = 49 U S cities. Table 2.1 contains
populations yj = (Uj,Xj) for a sam ple o f n = 10 cities taken w ithout replace
m ent from the 49, and we use them to estim ate the m ean 1930 population
6 = N~l
x j f r the 49 cities.
Two sta n d a rd estim ators o f 6 are the ratio and regression estim ators. The
ratio estim ate an d its estim ated variance are given by
*

_ -

frat MJV x

^2 j= ix j

E j= i

>

..

v rat

( I - / ) y -' ( ~
7T /

n(n ~ 1)

I xj

u j t r a t \ 2__
> WjV TV / _

jv /

UJ '

1 V ''. .

(3.16)
F o r o u r d a ta trat = 156.8 an d vrat = 10.852. The regression estim ate is based
on the straight-line regression x = j?o + fixu fit to the d a ta (w i,x i),...,(u ,x ),
using least squares estim ates /?o and (1]. The regression estim ate o f 9 and its
estim ated variance are
11 _n
treg = Po +

Vreg =

^ ^

Pluj) j

(3-17)

for ou r d a ta treg = 138.3 and vreg = 8.322.


Table 3.7 contains 95% confidence intervals for 6 based on norm al approxi
m ations to trat an d treg, an d on the studentized b o o tstrap applied to (3.16) and
(3.17). N orm al approxim ations to the distributions o f trat and treg are poor,
an d intervals based on them are considerably shorter th an the o ther intervals.
The popu latio n and su perpopulation bootstraps give rath er sim ilar intervals.
T he sam pling fraction is / = 10/49, so the estim ate o f the distribution
o f 7 using m odified sam ple size and w ithout-replacem ent resam pling uses

3 Further Ideas

96

Schem e

R a tio

N o rm al
M odified size, n' = 2
M odified size, n' = 11
M irro r-m atch , m = 2
P o p u lation
S u p erp o p u latio n

137.8
58.9
111.9
115.6
118.9
120.3

174.7
298.6
196.2
196.0
193.3
195.9

123.7
1 M il
112.8
116.1
114.0

N o rm al
M odified size, n' = 2
M odified size, n' = 11
M irro r-m atch , m = 2
P o p u latio n
S u p erp o p u latio n

7
1
2
3
2
1

152.0

258.2
258.7
240.7
255.4

L ength

C overage
L ow er

Table 3.7 City


population data: 95%
confidence limits for the
mean population per
city in 1930 based on
the ratio and regression
estimates, using normal
approximation and
various resampling
methods with R = 999.

R egression

U pper

O verall

A verage

SD

89

82
98
89
88
89
91

23
151
34
33
36
41

142
19
19
21
24

98
91
91
91
92

8.2

sam ples o f size n f = 2. N o t surprisingly, w ithout-replacem ent resam ples o f


size n' = 2 from 10 observations give a very p o o r idea o f w hat happens
w hen sam ples o f size 10 are taken w ithout replacem ent from 49 observations,
and the corresponding confidence interval is very wide. Studentized boo tstrap
confidence limits can n o t be based on treg, because w ith ri = 2 we have
veg = 0. F or w ith-replacem ent resam pling, we take (n 1)/(1 / ) = n' = 11,
giving intervals quite close to those for the m irror-m atch, population and
superpop u latio n bootstraps.
Figure 3.6 shows why the upp er endpoints o f the ratio and regression
confidence intervals differ so m uch. T he variance estim ate v*eg is unstable
because o f resam ples in which case 4 does n o t ap p e a r and case 9 appears just
once o r n o t at all; then z* takes large negative values. The right panel o f
the figure explains this; the regression slope changes m arkedly w hen case 4 is
deleted. Exclusion o f case 9 fu rth er reduces the regression sum o f squares and
hence veg. T he ratio estim ate is m uch less sensitive to case 4. I f we insisted on
using treg, one solution w ould be to exclude from the sim ulation sam ples in
which case 4 does n o t appear. T hen the 0.025 and 0.975 quantiles o f z eg using
the popu latio n b o o tstrap are -1.30 an d 3.06, and the corresponding confidence
interval is [112.9,149.1].

Table 3.8 City


population data.
Empirical coverages (%)
and average and
standard deviation of
length of 90%
confidence intervals
based on the ratio
estimate of the 1930
total, based on 1000
samples of size 10 from
the population of size
49. The nominal lower,
upper and overall
coverages are 5, 95 and
90.

91

3.1 Finite Population Sampling

Figure 3.6 Population


bootstrap results for
regression estim ator
based on city d a ta with
n = 10. The left panel
shows values o f z'eg and
ivJ/2 for resamples in
which case 4 appears at
least once (dots), and in
which case 4 does not
appear and case 9
appears zero times (0),
once (1), or m ore times
(+ ); the dotted line
shows
The right
panel shows the sample
and the regression lines
fitted to the d a ta with
case 4 (dashes) and
w ithout it (dots); the
vertical line shows the
value fi at which 0 is
estimated.

X
o

-----------y

o
o
CO
o
in
C\J
o
o
CVJ
o
lO

//

o
o

CM

o
in

O
co

/ //

9 / > 2'

Aft
/Q

m
,IUy

o
2

6
sqrt(v*)

10

50 100 150 200 250 300

To com pare the perform ances o f the various m ethods in setting confidence
intervals, we conducted a num erical experim ent in which 1000 sam ples o f
size n = 10 were taken w ithout replacem ent from the p o p ulation o f size
N = 49. F or each sam ple we calculated 90% confidence intervals [L, U] for
6 using R = 999 b o o tstrap samples. Table 3.8 contains the em pirical values
o f Pr(0 < L), Pr(0 < U), an d Pr(L < 9 < U). T he norm al intervals are short
an d their coverages are m uch too small, while the m odified intervals with
ri = 2 have the opposite problem . Coverages for the m odified sam ple size with
ri = 11 and for the pop u latio n and superpopulation b o o tstrap are close to
their nom inal levels, though their endpoints seem to be slightly too far left. The
80% and 95% intervals an d those for the regression estim ator have sim ilar
properties. In line w ith o th er studies in the literature, we conclude th a t the
population and superp o p u latio n b o o tstraps are the best o f those considered
here.

Stratified sampling
In m ost applications the pop u lation is divided into k strata, the ith o f which
contains N t individuals from which a sam ple o f size n, is taken w ithout
replacem ent, independent o f o th er strata. The ith sam pling fraction is f i =
tii/Ni and the p ro p o rtio n o f the p o pulation in the ith stratu m is vv, = N t/ N ,
where N = N i H-------- 1- N k- The estim ate o f 9 and its stan d ard erro r are found
by com bining quantities from each stratum .
Two different setups can be envisaged for m athem atical discussion. In the
first the small-fc case there is a small num ber o f large stra ta: the
asym ptotic regim e takes k fixed and n N j>oo with
where 0 < 7tj < 1.

98

3 Further Ideas

A p art from there being k strata, the same ideas and results will apply as above,
w ith the chosen resam pling scheme applied separately in each stratum . The
second setup the large-/c case is where there are m any sm all stra ta;
in m athem atical term s we suppose th a t k >00 b u t th a t N, and n, are bounded.
This situation is m ore com plicated, because biases from each stratum can
com bine in such a way th a t a b o o tstrap fails completely.
Example 3.16 (Average)

Suppose th a t the p o p u lation ,]M com prises k strata,

and th at the yth item in the ith stratu m is labelled


the average for th at
stratum is ^ . T hen the pop u latio n average is 6 =
which is estim ated
by T =
wiYi, where % is the average o f the sam ple Y,i,. . . , Yint from the ith
stratum . T he variance o f T is
k

.
Ni
W,2(l - / ,) X - W f ,

V=
i=l

(3.18)

j= 1

an unbiased estim ate o f w hich is


k

v = v v , 2( l - U ) x
>=1
Hi

Hi

- ( Y y - Yj)2.
1 j= 1

(3.19)

Suppose for sake o f sim plicity th a t each N ,/n , is an integer, and th a t


the popu latio n b o o tstrap is applied to each stratu m independently. T hen the
variance o f the b o o tstra p version o f T is
v a r '( T ') - E , 2( l - / , ) X

x - l j

(3.20)

the m ean o f which is obtained by replacing the last term on the right by
(Ni I )-1 Z j i & i j &i)2- If k is fixed and
TV,>-oo while f ~ * n t , (3.20) will
converge to v, b u t this will n o t be the case if n!; N, are bounded and k >00.
T he boo tstrap bias estim ate also m ay fail for the same reason (Problem 3.12).

F or setting confidence intervals using the studentized b o o tstrap the key issue
is n o t the perform ance o f bias and variance estim ates, b u t the extent to which
the distrib u tio n o f the resam pled q uantity Z* = (T* t ) / V ll2m atches th at
o f Z = ( T 6 ) / V 1/2. D etailed calculations show th a t when the population
and superpopulation b o o tstrap s are used, Z an d Z* have the same limiting
distribution u n d er b o th asym ptotic regimes, an d th a t under the fixed-/c setup
the approxim ation is b etter th a n th a t using the other resam pling plans.
Example 3.17 (Stratified ratio) F or em pirical com parison o f the m ore prom is
ing o f these finite populatio n resam pling schemes w ith stratified data, we gen
erated a pop u latio n w ith N pairs (u,x) divided into strata o f sizes N i , . . . , N k

99

3.7 Finite Population Sampling


Table 3.9 Empirical
coverages (%) of
nominal 90%
confidence intervals
using the ratio estimate
for a population
average, based on 1000
stratified samples from
populations with k
strata of size N, from
each of which a sample
of size n = N/'i was
taken without
replacement. The
nominal lower (L),
upper (U) and overall
(O) coverages are 5, 95
and 90.

N o rm al
M odified size
M irro r-m atch
P o p u latio n
S u p erp o p u latio n

k = 20, N = 18

k = 5, N = 72

k = 3 , N = 18

5
6
9
6
3

93
94
92
95
97

88
89
83
89
95

4
4
8
5
2

94
94
90
95
98

90
90
82
90
96

7
6
6
6
3

93
96
94
95

86
90
88
89
96

98

according to the ordered values o f u. The aim was to form 90% confidence
intervals for
k

N,

e = r l E E x'>
.=i j=\

where x,j is the value o f x for the jth elem ent o f stratu m i.
We took independent sam ples (uy,Xy) o f sizes n, w ithout replacem ent from
the ith stratum , an d used these to form the ratio estim ate o f 9 and its estim ated
variance, given by
k

t = V
WjU, X ti,
i= 1

V = Y
Wi ( 1 ~ f i )
i= 1

n,

X (---- 7T
^
l } j 1

~~ t t o j ) 2

where
E / ' = 1 X ij

E jW

Ni

.....

these extend (3.16) to stratified sampling. We used b o o tstrap resam ples with
R = 199 to com pute studentized b oo tstrap confidence intervals for 9 based on
1000 different sam ples from sim ulated datasets. Table 3.9 shows the em pirical
coverages o f these confidence intervals in three situations, a large-/c case with
k = 20, Nj = 18 and n, = 6, a small-fc case with k = 5, Ni = 72 and n, = 24,
and a small-fc case w ith k = 3, Ni = 18 and n, = 6. The m odified sam pling
m ethod used sam pling w ith replacem ent, giving sam ples o f size n' = 7 when
n = 6 an d size ri = 34 w hen n = 24, while the corresponding values o f m for
the m irror-m atch m ethod were 3 and 8. T h roughout / i = jIn all three cases the coverages for norm al, population and m odified sample
size intervals are close to nom inal, while the m irror-m atch m ethod does poorly.
T he superp o p u latio n m ethod also does poorly, perhaps because it was applied
to separate stra ta ra th e r th an used to construct a new p o pulation to be
stratified a t each replicate. Sim ilar results were obtained for nom inal 80% and
95% confidence limits. O verall the population b o o tstrap and m odified sample

3 Further Ideas

100

size m ethods d o best in this lim ited com parison, an d coverage is n o t im proved
by using the m ore com plicated m irror-m atch m ethod.

3.8 Hierarchical Data


In some studies the v ariatio n in responses m ay be hierarchical or m ulti
level, as happens in repeated-m easures experim ents and the classical split-plot
experim ent. D epending u p o n the n atu re o f the p aram eter being estim ated, it
m ay be im p o rtan t to take careful account o f the two (or m ore) sources o f
variation w hen setting up a resam pling scheme. In principle there should be
no difficulty w ith p aram etric resam pling: having fitted the m odel param eters,
resam ple d a ta will be generated according to a com pletely defined model.
N onparam etric resam pling is n o t straightforw ard: certainly it will n o t m ake
sense to use simple n o n p aram etric resam pling, which treats all observations as
independent. H ere we discuss some o f the basic points ab o u t nonparam etric
resam pling in a relatively simple context.
Perhaps the m ost basic problem involving hierarchical variation can be
form ulated as follows. F o r each o f a groups we o b tain b responses y tj such
th a t
y i} = X; +

Zij,

i = 1 , . . . , a, j = l , . . . , b ,

(3.21)

where the x,s are random ly sam pled from Fx an d independently the z^s
are random ly sam pled from Fz, w ith E (Z ) = 0 to force uniqueness o f the
model. T hus there is hom ogeneity o f variation in Z betw een groups, and the
structure is additive. T he feature o f this m odel th a t com plicates resam pling is
the correlation betw een observations w ithin a group,
var(Yjy) = c* + a\,

cov(y,; , Yik) = a 2x,

j f k.

(3.22)

For d a ta having this nested structure, one m ight be interested in param eters o f
Fx o r Fz o r some co m bination o f both. F o r exam ple, w hen testing for presence
o f variation in X the usual statistic o f interest is the ratio o f betw een-group
and w ithin-group sum s o f squares.
How should one resam ple nonparam etrically for such a d ata structure? There
are two simple strategies, for b o th o f which the first stage is to random ly sample
groups w ith replacem ent. A t the second stage we random ly sam ple w ithin the
groups selected at the first stage, either w ithout replacem ent (Strategy 1) or
w ith replacem ent (Strategy 2). N ote th a t Strategy 1 keeps selected groups intact.
To see which strategy is likely to w ork better, we look at the second m om ents
o f resam pled d a ta y'j to see how well they m atch (3.22). C onsider selecting
y'i V. . . , y ib. A t the first stage we select a ran d o m integer /* from {1 ,2 ,__a}.
A t the second stage, we select ran d o m integers
from {1,2
either w ithout replacem ent (Strategy 1) o r w ith replacem ent (Strategy 2): the

101

3.8 Hierarchical Data

sam pling w ithout replacem ent is equivalent to keeping the J* th group intact.
U nder b o th strategies
E*(5y I /* = O = )V,
and

However,
E*(Yy* Y* | /* = n =

6(6- 1)

yiiyi'm,

Strategy 1,
Strategy 2.

h tm = i ynyi'm,

T herefore
E*(Yt; ) = ?.,
1

SSg
SSyy
var*(Y,*) = +
J
a
ab

(3.23)

and
Strategy 1,
Strategy 2,

(3.24)

where y = a 1 y t, S S B = E L iO ^ - y - f and S S W = ? =1 E*=i(.Vy ~ tf)2- To


see how well the resam pling variation mimics (3.22), we calculate expectations
o f (3.23) an d (3.24), using

This gives
E {v a r'(i'jy )} =
and
Strategy 1,
Strategy 2.
O n balance, therefore, Strategy 1 m ore closely mimics the variation properties
o f the data, an d so is the preferable strategy. R esam pling should w ork well so
long as a is m oderately large, say at least 10, ju st as resam pling hom ogeneous
d a ta w orks well if n is m oderately large. O f course b o th strategies would work
well if b o th a an d b were very large, b u t this is rarely the case.
A n application o f these results is given in Exam ple 6.9.
The preceding discussion w ould apply to balanced d a ta structures, b u t not
to m ore com plex situations, for which a m ore general approach is required. A
direct, m odel-based ap proach would involve resam pling from suitable estim ates
o f the tw o (or m ore) d a ta distributions, generalizing the resam pling from F in
C h ap ter 2. H ere we outline how this m ight work for the d a ta structure (3.21).

3 Further Ideas

102

Estim ates o f the two C D F s Fx an d Fz can be form ed by first estim ating the
xs and zs, and then using their E D F s. A naive version o f this, which parallels
stan d ard linear m odel theory, is to define
xi = yu

ztj = y,j - %

(3.25)

The resulting way to o btain a resam pled d ataset is to

choose x j , . . . , x* by random ly sam pling w ith replacem ent from x i , . . . , x a ;


then

choose z*n , . . . , z ' ab by random ly sam pling ab times with replacem ent from
z n , . . . , z ab; and finally

set y-j = x* + z-j,

i=

j = l,...,b.

S traightforw ard calculations (Problem 3.17) show th a t this approach has the
sam e second-m om ent properties o f
as Strategy 2 earlier, show n in (3.23)
and (3.24), w hich are n o t satisfactory. Som ew hat predictably, Strategy 1 is
mim icked by choosing z\ r a n d o m l y w ith replacem ent from one group
o f residuals Zki,...,Zkb either a random ly selected group or the group
corresponding to x* (Problem 3.17).
W hat has gone w rong here is th a t the estim ates x* in (3.25) have excess
variation, nam ely a ^ S S g = <xl + b~loj, relative to
T he estim ates Zy defined
in (3.25) will be satisfactory provided b is reasonably large, although in principle
they should be standardized to
- 11

( 1 - f c - 1)1/2 '

(3.26)

The excess variation in X; can be corrected by using the shrinkage estim ate
= cy+ (1 - c ) y i . ,
where c is given by
(i - c Y =

b ( b - l ) S S B

o r 1 if the righ t-h an d side is negative. A straightforw ard calculation shows


th a t this choice for c m akes the variance o f the x* equal to the com ponents o f
variance estim ator o f
\ see Problem 3.18. N ote th a t the w isdom o f m atching
first and second m om ents m ay depend u p o n 9 being a function o f such
m om ents.

103

3.9 Bootstrapping the Bootstrap

3.9 Bootstrapping the Bootstrap


3.9.1 Bias correction o f bootstrap calculations
A s w ith m ost statistical m ethods, the b o o tstrap does n o t provide exact answers.
F or exam ple, the basic confidence interval m ethods outlined in Section 2.4 do
n o t have coverage exactly equal to the target, or nom inal, coverage. Similarly
the bias and variance estim ates B and V o f Section 2.2.1 are typically biased. In
m any cases the discrepancies involved are n o t practically im portant, or there
is some specific rem edy as w ith the im proved confidence lim it m ethods
o f C h ap ter 5. N evertheless it is useful to have available a general technique
for m aking a bias correction to a b o o tstrap calculation. T h a t technique is
the b o o tstrap itself. H ere we describe how to apply the b o o tstrap to improve
estim ation o f the bias o f an estim ator in the simple situation o f a single random
sample.
In the n o tatio n o f C h ap ter 2, the estim ator T = t(F) has bias
P = b(F) = E ( T ) - 0 = E{t(F) | F} - t(F).
T he b o o tstrap estim ate o f this bias is
B = b(F) = E*(T*) - T = E*{t(F*) | F} - t(F),

(3.27)

where F* denotes either the E D F o f the boo tstrap sam ple Y J , . . . , Y * draw n
from F or the p aram etric m odel fitted to th at sample. Thus the calculation
applies to b o th param etric an d nonparam etric situations. There is b o th random
variation an d system atic bias in B in g e n eral: it is the bias w ith which we are
concerned here.
As with T itself, so w ith B : the bias can be estim ated using the bootstrap.
If we w rite y = c(F) = E (B \ F ) b(F), then the simple b o o tstra p estim ate
according to the general principle laid out in C h ap ter 2 is C = c(F). From the
definition o f c(F) this implies
C = E*(B* | F ) - B ,
the b o o tstrap estim ate o f the bias o f B. To see ju st w hat C involves, we use
the definition o f B in (3.27) to obtain
C = E*[E**{r(F**) | F*} - t(F*) | F] - [E*{t(F*) | F} - t (F)];

(3.28)

or m ore simply, after com bining terms,


C = E*{E**(T**)} - 2E*(T* | F ) + T .

(3.29)

H ere F** denotes the E D F o f a sample draw n from F*, o r from the param etric
m odel fitted to th a t sam ple; T** is the estim ate com puted w ith th a t sam ple; and
E** denotes expectation over the the distribution o f th a t sam ple conditional on
F*. T here are tw o levels o f b o o tstrapping in this procedure, which is therefore

104

3 Further Ideas

called the nested or double bootstrap. In principle a nested b o o tstrap m ight


involve m ore th a n tw o levels, b u t in practice the com putational burden would
ordinarily be too great for m ore th a n two levels to be w orthwhile, and we shall
assum e th a t a nested b o o tstra p has ju st tw o levels.
The adjusted estim ate o f the bias o f T is
Badj = B C .

Since typically bias is o f o rder n-1 , the adjustm ent C is typically o f order n~2.
T he following exam ple gives a simple illustration o f the adjustm ent.
Example 3.18 (Sample variance) Suppose th a t T = n~l Z ( Y j Y )2 is used to
estim ate v a r(Y ) = a 2. Since E { J](Y / Y ) 2} = (n l ) a 2, the bias o f T is easily
seen to be /? = n_1<x2, which the b o o tstrap estim ates by B = n~l T. The
bias o f this bias estim ate is E (B) ft = n~2o 2, which the b o o tstrap estim ates
by C = n~2T. T herefore the adjusted bias estim ate is
B C = n-1 T n~2 T.
T h at this is an im provem ent can be checked by showing th a t it has expectation
/?(1 + n~2), w hereas B has expectation /?(1 + n~]).

In m ost applications b o o tstrap calculations are approxim ated by sim ulation.


So, as explained in C h ap ter 2, for m ost estim ators T we would approxim ate the
bias B by
]T t* t using the resam pled values
and the d a ta value
t o f the estim ator. Likewise the expectations involved in the bias adjustm ent
C will usually be approxim ated by sim ulation. The calculation is as follows.
Algorithm 3.3 (Double bootstrap for bias adjustment)
F or r = 1
1 generate the rth original b o o tstrap sam ple y j,...,y * and then t by

sam pling a t ran d o m from


(nonparam etric case) or
sam pling param etrically from the fitted m odel (param etric case);

2 obtain M second-level b o o tstrap sam ples y \ ' , . . . , y ' n, either

sam pling w ith replacem ent from y \ , . . . , y ' n (nonparam etric case)
or
sam pling from the m odel fitted to y [ , . . . , y * (param etric case);

3 evaluate the estim ator T for each o f the M second-level sam ples to give
..

..

V l - '- VMT hen approxim ate the bias adjustm ent C in (3.29) by
.

- k m E

E ' ~ - r E ' ; + '

r = l m= 1

r=l

3 3 )

105

3.9 Bootstrapping the Bootstrap

A t first sight it w ould seem th at to apply (3.30) successfully would involve


a vast am o u n t o f com putation. If a general rule is to use a t least 100 samples
when b ootstrapping, this w ould im ply a total o f R M + R = 10100 sim ulated
sam ples and evaluations o f t. But this is unnecessary, because o f theoretical
an d co m p u tatio n al techniques th a t can be used, as explained in C h apter 9. For
the case o f the bias B discussed here, the sim ulation variance o f B C would
be no greater th a n it was for B if we used M = 1 and increased R by a factor
o f a b o u t 5, so th a t a to tal o f ab out 500 sam ples would seem reasonable; see
Problem 3.19.
M ore com plicated applications o f the technique are discussed in E xam
ple 3.26 an d in C hapters 4 and 5.
Theory
It m ay be intuitively clear th a t bo o tstrap p in g the b o o tstrap will reduce the
o rd er o f bias in the original b o o tstrap calculation, at least in simple situations
such as Exam ple 3.18. However, in some situations the order o f the reduction
m ay n o t be clear. H ere we outline a general calculation which provides the
answer, so long as the quantity being estim ated by the b o o tstra p can be
expressed in term s o f an estim ating equation. F or simplicity we focus on the
single-sam ple case, b u t the calculations extend quite easily.
Suppose th a t the q uantity P = b(F) being estim ated by the b o o tstrap is
defined by the estim ating equation
E { h ( F , F - p ) \ F } = 0,

(3.31)

w here h( G, F; P) is chosen to be o f order one. T he b o o tstrap solution is


P = b(F), which therefore solves
E * { h ( F \ F - , P ) \ P } = 0.
In general p has a bias o f order n~a, say, where typically a is
Therefore, for some e(F) th a t is o f order one, we can write
E { h ( F , F ; p ) \ F } = e ( F ) n - a.

1 or .

(3.32)

To correct for this bias we introduce the ideal pertu rb atio n y = c(F) which
modifies b(F) to b(F,y) in o rd er to achieve
E[h{F,F-,b(F,y)}\F]=0.

(3.33)

T here is usually m ore th a n one way to define b(F,y), b u t we shall assum e th a t


y is defined to m ake b{F, 0) = b(F). T he b o o tstrap estim ate for y is y = cn(F),
which is the solution to
E '[h{F ',F ;b(F \y)}\F]= 0,
and the adjusted value o f P is then
the second level o f resam pling.

= b (F, y ); it is b(F",y) th a t requires

106

3 Further Ideas

W hat we w ant to see is the effect o f substituting p ajj for ft in (3.32). First
we approxim ate the solution to (3.33). T aylor expansion ab o u t 7 = 0 , together
with (3.32), gives
E [h{F, F ; b(F, y ) } \ F] = e(F)n~a + dn(F)y,

(3.34)

where
dn( F ) = ^ E [ h { F , F ; b ( F , y ) } \ F ]
y=0
Typically d{F) = d(F) =f= 0, so th a t if we w rite r(F) = e(F)/d(F) then (3.33)
and (3.34) together im ply th at
7 = c(F) = r(F)n~a.
This, together w ith the corresponding approxim ation for y = cn(F), gives
?

= n~a{r(F) - r(F)} = - r T ^ X , ,

say. The quantity


X n = n ^ 2{r(F) - r(F)}
is Op(l) because F and F differ by Op(n~l/1). It follows th at, because y = 0 ( n ~ a),
h { F, F\ b ( F, y)} = h {F ,F ; b( F, y) } - n~a~ l/2X n- ^ h { F , F -,b(F, y)}

(3.35)

y= 0

We can now assess the effect o f the adjustm ent from [3 to (iadj- Define the
conditional quantity

kH(X) = ^ - E [ h { F , F ; b ( F , y ) } \ X , F ]
8y

1/2)+

y=0

which is Op(l). T hen taking expectations in (3.35) we deduce that, because o f


(3.34),
E[h{F, F; b( F, y) } \ F] = - n~a- V 2E{Xkn( X n) I F}.

(3.36)

In m ost applications E { X nkn( X n) \ F} = 0 ( n ~ b) for b = 0 or j , so com paring


(3.36) with (3.32) we see th a t the adjustm ent does reduce the order o f bias by
at least j.
Example 3.19 (Adjusted bias estimate) In the case o f the bias /? = E( T \
F) 9, we take h(F, F;[5) = t(F) t(F) /? and b(F, y) = b(F) y. In regular
problem s the bias an d its estim ate are o f order n ~ \ and in (3.32) a = 2. It is
easy to check th a t d(F) = 1, so th a t X = n l/2{e(F) e(F)} and
kn( X n) = ^ E { t ( F ) - t(F) - ( P - y ) \ e(F), F}

= 1.
?=o

Note that if the next


term in expansion (3.34)
were 0(n~a~c), then the
right-hand side of (3.35)
would strictly be
0 (n- a-(.-l/2) +
0(n-a-c-1/2). In almost
all cases this will lead to
the same conclusion.

3.9 Bootstrapping the Bootstrap

107

This implies th at
E { X nk n(Xn) I F} = n i/2E{e(F) - e(F) \ F} = 0 ( n ~ l/2).
E quation (3.36) then becom es E { T 9 (fi y)} = 0 (n -3 ). This generalizes
the conclusion o f Exam ple 3.18, th at the adjusted b o o tstrap bias estim ate fi y
is correct to second order.

F u rth er applications o f the double b o o tstrap to significance tests and confi


dence limits are described in Sections 4.5 and 5.6 respectively.

3.9.2 Variation o f properties o f T


A som ew hat different application o f b o o tstrapping the boo tstrap concerns
assessm ent o f how the distribution o f T depends on the param eters o f F.
Suppose, for exam ple, th a t we w ant to know how the variance o f T depends
upon 9 an d o th er unknow n m odel param eters, but th at this variance cannot
be calculated theoretically. O ne possible application is to the search for a
variance-stabilizing transform ation.
The p aram etric case does n o t require nested b o o tstrap calculations. However,
it is useful to outline the approach in a form th at can be m im icked in the
nonparam etric case. The basic idea is to approxim ate v ar(T | ip) = v(xp) from
sim ulated sam ples for an appropriately b road range o f param eter values. Thus
we would select a set o f p aram eter values ipn---,V>K, f r each o f which we
w ould sim ulate R sam ples from the corresponding param etric m odel, and
com pute the corresponding R values o f T . This would give t'kl, . . . , t'kR, say, for
the m odel w ith p aram eter value \pk. T hen the variance v(tpk) = v ar(T | xpk)
w ould be approxim ated by
R

v(Vk) =

(3'37)
r= l

where t*k = J T 1 ? = i C Plots o f v{\pk) against com ponents o f yik can then
be used to see how v a r(T ) depends on 1p. Exam ple 2.13 shows an application
o f this. The sam e sim ulation results can also be used to approxim ate other
properties, such as the bias or quantiles o f T , or the variance o f transform ed T.
As described here the num ber o f sim ulated datasets will be R K , b u t in fact
this num b er can be reduced considerably, as we shall show in Section 9.4.4. The
sim ulation can be bypassed com pletely if we estim ate v(ipk) by a delta-m ethod
variance approxim ation VL(y)k), based on the variance o f the influence function
under the p aram etric m odel. However, this will often be impossible.
In the nonparam etric case there appears to be a m ajor obstacle to per
form ing calculations analogous to (3.37), nam ely the unavailability o f models
corresponding to a series o f p aram eter values rpi,...,\pK. But this obstacle can

108

3 Further Ideas

be overcome, at least partially. Suppose for simplicity th at we have a single


sam ple problem , so th a t the E D F F is the fitted m odel, and im agine th at we
have draw n R independent b o o tstrap sam ples from this model. These b o o t
strap sam ples can be represented by their E D F s F , which can be thought o f as
the analogues o f param etric m odels defined by R different values o f param eter
ip. Indeed the corresponding values o f 9 = t(F) are simply t(F*) = (*, and other
com ponents o f ip can be defined sim ilarly using the representation ip = p(F).
This gives us the same fram ew ork as in the p aram etric case above. F or ex
am ple consider variance estim ation. To approxim ate v a r(T ) under param eter
value tp* = p(F'), we sim ulate M sam ples from the corresponding m odel F *;
calculate the corresponding values o f T , which we denote by
, m = 1 ,..., M ;
and then calculate the analogue o f (3.37),
M
K = v(Wr) = M ~ l

fr*)2,

(3.38)

m=1
with t = M ~ l E m =i Cm- T he scatter plot o f v against t* will then be a proxy
for the ideal plot o f v a r(T | ip) against 6, an d sim ilarly for o ther plots.
Example 3.20 (City population data) Figure 3.7 shows the results o f the
double b o o tstrap procedure outlined above, for the ratio estim ator applied to
the d a ta in Table 2.1, w ith n = 10. The left panel shows the bias b estim ated
using M = 50 second-level b o o tstrap sam ples from each o f R = 999 first-level
b o o tstrap samples. The right panel shows the corresponding stan d ard errors
* 112
vr . The lines from applying a locally w eighted robust sm oother confirm the
clear increase w ith the ratio in each panel.
The lim plication o f Figure 3.7 is th a t the bias and variance o f the ratio are
no t stable w ith n = 10. Confidence intervals for the true ratio 9 based on
norm al approxim ations to the distrib u tio n o f T 9 will therefore be poor, as
will basic b o o tstra p confidence intervals, and those based on related quantities
such as the studentized b o o tstrap are suspect. A reasonable in terpretation o f
the right panel is th a t v a r(T ) oc 92, so th a t log T should be m ore stable.

The p articu lar application o f variance estim ation can be handled in a


sim pler way, a t least approxim ately. I f the n o nparam etric delta m ethod variance
approxim ation vL (Sections 2.7.2 an d 3.2.1) is fairly accurate, which is to say
if the linear ap proxim ation (2.35) or (3.1) is accurate, then v'r = v(tp') can be
estim ated by v l = vl ( f ;).
Example 3.21 (Transformed correlation) A n exam ple where simple b o otstrap
m ethods tend to perform badly w ithout the (explicit o r im plicit) use o f tran s
form ation is the correlation coefficient. F or a sam ple o f size n = 20 from a
bivariate norm al distribution, w ith sam ple correlation t = 0.74, the left panel

109

3.9 Bootstrapping the Bootstrap

Figure 3.7 Bias and


standard error estimates
for ratio applied to city
population data, n = 10.
For each of R = 999
bootstrap samples from
the data, M = 50
second-level samples
were drawn, and the
resulting bias and
standard error estimates
b* and v ' 1/2 plotted
against the
bootstrapped ratio t*.
The lines are from a
robust nonparametric
curve fit to the
simulations.

t*

t*

Figure 3.8 Scatter plot


of v*L versus t* for
nonparametric
simulation from a
bivariate normal sample
of size n = 20 with
R = 999. The left panel
is for t the sample
correlation, with dotted
line showing the
theoretical relationship.
The right panel is for
transformed sample
correlation.

Transformed t*

o f Figure 3.8 contains a scatter plot o f vL versus t* from R = 999 n o n p aram et


ric sim ulations: the d o tted line is the approxim ate norm al-theory relationship
v a r(T ) = n ~ '( l 02)2. T he p lo t correctly shows strong instability o f variance.
The right panel shows the corresponding plot for b o otstrapping the tra n s
form ed estim ate ^ l o g ^ l + f ) /( l - t)}, whose variance is approxim ately n~l :
here v i is com puted as in Exam ple 2.18. The plot correctly suggests quite
stable variance.

3 Further Ideas

110

As presented here the selection o f p aram eter values ip* is com pletely random ,
and R would need to be m oderately large (at least 50) to get a reasonable
spread o f values o f \p*. T he to tal nu m b er o f samples, R M + R, will then be very
large. It is, however, possible to im prove upon the algorithm ; see Section 9.4.4.
A n other im p o rtan t problem is the roughness o f variance estim ates, apparent
in b o th o f the preceding exam ples. This is due n o t ju st to the size o f M , but
also to the noise in the E D F s F* being used as models.
Frequency smoothing
O ne m ajor difference betw een the p aram etric an d nonparam etric cases is th at
the param etric m odels vary sm oothly w ith p aram eter values. A simple way
to inject such sm oothness into the nonp aram etric m odels F is to sm ooth
them. F or simplicity we consider the one-sam ple case.
Let w( ) be a sym m etric density w ith m ean zero and unit variance, and
consider the sm oothed frequencies

f j ( o , e ) c c ( n r O ^ ,
r= l

'

j =

(3-39)

'

H ere e > 0 is a sm oothing p aram eter th a t determ ines the effective range o f
values o f t* over which the frequencies are sm oothed. As is com m on with kernel
sm oothing, the value o f e is m ore im p o rtan t th an the choice o f w(-), which we
take to be the stan d ard norm al density. N um erical experim entation suggests
th a t close to 6 = t, values o f e in the range 0 .2 v l/ 2 - 1 .0 v l/2 are suitable, where v is
an estim ated variance for t. We choose the co n stan t o f proportionality in (3.39)
to ensure th a t Z j f j { 8 ,E) = n- F r a given e, the relative frequencies n~ 1 f j ( 8 , e)
determ ine a distribution F e , for which the p aram eter value is 8 " = t{Fg); in
general 0* is n o t equal to 8 , although it is usually very close.
Example 3.22 (City population data) In co n tin u ation o f Exam ple 3.20, the
top panels o f Figure 3.9 show the frequencies f j for four sam ples with values
o f t' very close to 1.6. T he variation in the f j leads to the variability in both
b* and v" th a t shows so clearly in Figure 3.7.
The lower panels show the sm oothed frequencies (3.39) for distributions Fg
with 8 = 1.2, 1.52, 1.6, 1.9 and e = 0.2u1/2. The corresponding values o f the
ratio are 8 = 1.23, 1.51, 1.59, an d 1.89. T he observations w ith the smallest
em pirical influence values are m ore heavily weighted when 8 is less th a n the
original value o f the statistic, t = 1.52, and conversely. The third panel, for
6 = 1.6, results from averaging frequencies including those shown in the upper
panels, an d the distribution is m uch sm oother th an those. The results are not
very sensitive to the value o f e, although the tilting o f the frequencies is less
m arked for larger s.
The sm oothed frequencies can be used to assess how the bias and variance

3.9 Bootstrapping the Bootstrap


Figure 3.9 Frequencies
for city population data.
The upper panels show
frequencies /* for four
samples with values of
t close to 1.6, plotted
against empirical
influence values lj for
the ratio. The lower
panels show smoothed
frequencies f {6,e) for
distributions Fq with
9 = 1.2, 1.52, 1.6, 1.9
and e = 0.2i>1//2.

1.0

theta=1.2

-0.5

111

0.0

0.5

r = 1 .5988

t*=1,6015

theta=1.6

theta=1.9

1.0

theta=1.52

-1.0

-0.5

0.0

0.5

1.0

o f T depend on 0. F o r each o f a range o f values o f 0, we generate samples


from the m ultinom ial distribution Fg w ith expected frequencies (3.39), and
calculate the corresponding values o f t*, t'(0). say. We then estim ate the bias
for sam pling from F'e by t*(0) O, where t'(9) is the average o f the t'r( 6 ). The
variance is estim ated similarly.
T he top panel o f Figure 3.10 shows values o f t*(8 ) plotted against jittered
values o f 0 for 100 sam ples generated from Fg at 0 = 1 .2 ,...,1 .9 ; we took
e = 0.2. The lower panels show th a t the corresponding biases and standard
deviations, which are connected by the rougher solid lines, com pare well
with the double b o o tstrap results. The am ount o f com putation is m uch less,
however. T he sm oothed estim ates are based on 1000 sam ples to estim ate the
Fg, an d then 100 sam ples at each o f the eight chosen values o f 0 , w hereas the
double b o o tstra p required ab o u t 25 000 samples.

O th er applications o f (3.39) are described in C hapters 9 and 10.


Variance stabilization
Experience suggests th a t b o o tstrap m ethods for confidence limits and signif
icance tests based on estim ators T are m ost effective when 9 is essentially a
location param eter, which is approxim ately induced by a variance-stabilizing
transform ation. Ideally such a transform ation would be derived theoretically
from (2.14) w ith variance function v(0) = var( T | F).
In a nonparam etric setting a suitable transform ation m ay som etim es be
suggested by analogy w ith a param etric problem , as in Exam ple 3.21. If
not, a tran sfo rm atio n can be obtained em pirically using the double boo tstrap
estim ates o f variance discussed earlier in the section. Suppose th a t we have
b o o tstrap sam ples F* =
. - , y ' J and the corresponding statistics t", for

3 Further Ideas

112

Figure 3.10 Use of


smoothed
nonparametric
distributions to estimate
bias and standard
deviation functions for
the ratio of the city
population data. The
top panel shows 100
bootstrapped ratios
calculated from samples
generated from Fg, for
each of 6 = 1.2,..., 1.9;
for clarity the 0 values
are jittered. The lower
panels show 200 of the
points from Figure 3.7
and the estimated bias
and standard deviation
functions from that
figure (smooth curves),
with the biases and
standard deviations
estimated from the top
panel (rougher curves).

theta (jittered)

r = 1
W ithout loss o f generality, suppose th at t\ < < t*R. One
way to im plem ent em pirical variance-stabilization is to choose Ri o f the t"
th at are roughly evenly-spaced an d th a t include
and t'R. For each o f the
corresponding F* we then generate M b o o tstrap values t , from which we
estim ate the variance o f t to be v'r as defined in (3.38). We now sm ooth a plot
o f the v against the t, giving an estim ate v(Q) o f the variance v a r(T | F ) as
a function o f the p aram eter 0 = t(F), and integrate num erically to obtain the
estim ated variance-stabilizing transfo rm atio n
tt

{t)

dd

{ m v /r

(3.40)

3.10 Bootstrap Diagnostics

113

In general, b u t especially for small Ri, it will be b etter to fit a sm ooth curve
to values o f logt>*, in p art to avoid negative estim ates v(0). Provided th at
a suitable sm oothing m ethod is used, inclusion o f t\ and t'R in the set for
which the v" are estim ated implies th at all the transform ed values h(t*) can be
calculated. T he transform ed estim ator h ( T ) should have approxim ately unit
variance.
A ny o f the com m on sm oothers can be used to obtain v(0), and simple inte
gration algorithm s can be used for the integral (3.40). I f the nested boo tstrap
is used only to obtain the variances o f Ri o f the f*, the total num ber o f
b o o tstrap sam ples required is R + M R i . Values o f R\ and M in the ranges
50-100 and 25-50 will usually be adequate, so if R = 1000 the overall num ber
o f b o o tstrap sam ples required will be 2250-6000. If variance estim ates for all
the t are available, for exam ple nonparam etric delta m ethod estim ates, then
the delta m ethod shows th a t approxim ate standard errors for the h(t'r) will be
i>*1/2/ v ( t ') 1/2; a plot o f these against t* will provide a check on the adequacy
o f the transform ation.
T he sam e procedure can be applied with second-level resam pling done from
sm oothed frequencies, as in Exam ple 3.22.
Example 3.23 (City population data) For the city population d ata o f E xam
ple 2.8 the p aram eter o f interest is the ratio 6 , which is estim ated by t = x / u.
Figure 3.7 shows th a t the variance o f T depends strongly on 6 . We used the
procedure outlined above to estim ate a transform ation based on R = 999
b o o tstrap samples, w ith R\ = 50 and M = 25. The transform ation is shown
in the left panel o f Figure 3.11: the right panel shows the stan d ard errors
v ^ 2 / v ( O l/2 o f the h(t'). T he transform ation has been largely successful in
stabilizing the variance.
In this case the variances VLr based on the linear approxim ation are readily
calculated, an d the tran sfo rm atio n could have been estim ated from them rather
than from the nested bootstrap.

3.10 Bootstrap Diagnostics


3.10.1 Jackknife-after-bootstrap
Sensitivity analysis is im p o rtan t in understanding the im plications o f a statisti
cal calculation. A conclusion th a t depended heavily on ju st a few observations
would usually be regarded as m ore tentative th an one supported by all the
data. W hen a p aram etric m odel is fitted, difficulties can be detected by a wide
range o f diagnostics, careful scrutiny o f which is p a rt o f a param etric boo tstrap
analysis, as o f any param etric m odelling. But if a nonparam etric b o o tstrap is
used, the E D F F is in effect the m odel, and there is no baseline against which

114

3 Further Ideas

f
ID

CO

to
of
as
or

com pare outliers, for example. In this situation we m ust focus on the effect
individual observations on b o o tstrap calculations, to answ er questions such
would the confidence interval differ greatly if this point were rem oved?,
w hat happens to the significance level when this observation is deleted?

Nonparametric case
Once a nonparam etric resam pling calculation has been perform ed, a basic
question is how it w ould have been different if an observation, yj, say, had
been absent from the original data. F or exam ple, it m ight be wise to check
w hether or n o t a suspicious case has affected the quantiles used in a confidence
interval calculation. T he obvious way to assess this is to do a fu rth er sim ulation
from the rem aining observations, b u t this can be avoided. This is because a
resam ple in which y; does n o t ap p ear can be th o u g ht o f as a random sample
from the d a ta w ith yj excluded. Expressed formally, if J* is sam pled uniform ly
from { l ,...,n } , then the conditional distribution o f J ' given th at J* =/= j
is the sam e as the distribution o f /*, where /* is sam pled uniform ly from
{ 1 ,... , j \ , j + 1 ,..., } . T he probability th a t
is n o t included in a boo tstrap
sample is (1 n-1 )" = e ~ \ so the num b er o f sim ulations R - j th a t do not
include yj is roughly equal to R e ~l = 0.368R.
So we can m easure the effect o f
on the calculations by com paring the full
sim ulation w ith the subset o f t \ , . . . , t R
obtained from bo o tstrap sam ples where
yj does n o t occur. In term s o f the frequencies f j which count the num ber o f
tim es yj app ears in the rth sim ulation, we sim ply restrict attention to replicates
with f ' j = 0. F or exam ple, the effect o f yj on the bias estim ate B can be

Figure 3.11
Variance-stabilization
for the city population
ratio. The left panel
shows the empirical
transformation (), and
the right panel shows
the standard errors
u jy2/{v(r*)}1,/2 of the
h{t*), with a smooth
curve.

115

3.10 Bootstrap Diagnostics


Table 3.10
M easurements on the
head breadth and length
o f the first two adult
sons in 25 families
(Frets, 1921).

1
2
3
4
5
6
7
8
9
10
11
12
13

F irst son
L en
Brea

Second son
Len
Brea

191
195
181
183
176
208
189
197
188
192
179
183
174

179
201
185
188
171
192
190
189
197
187
186
174
185

155
149
148
153
144
157
150
159
152
150
158
147
150

145
152
149
149
142
152
149
152
159
151
148
147
152

14
15
16
17
18
19
20
21
22
23
24
25

F irst son
Len
B rea

Second son
L en
Brea

190
188
163
195
186
181
175
192
174
176
197
190

195
187
161
183
173
182
165
185
178
176
200
187

159
151
137
155
153
145
140
154
143
139
167
163

157
158
130
158
148
146
137
152
147
143
158
150

m easured by the scaled difference

n(B_j - B) = J

J -

(t; - t - j ) - i

'^>=0

- t ) 1,
r

(3.41)

where B - j is the bias estim ate from the resam ples in which yj does not
appear, and r_; is the value o f t when yj is excluded from the original
data. Such calculations are applications o f the jackknife m ethod described
in Section 2.7.3, so the technique applied to b o o tstra p results is called the
jackknife-after-bootstrap. The scaling factor n in (3.41) is n o t essential.
A useful diagnostic is the plot o f jackknife-after-bootstrap m easures such
as (3.41) against em pirical influence values, possibly standardized. F or this
purpose any o f the approxim ations to em pirical influence values described in
Section 2.7 can be used. The next exam ple illustrates a related plot th a t shows
how the distrib u tio n o f r* t changes w hen each observation is excluded.
Example 3.24 (Frets heads)
Table 3.10 contains d ata on the head breadth
and length o f the first two ad u lt sons in 25 families.
T he correlations am ong the log m easurem ents are given below the diagonal
in Table 3.11. T he values above the diagonal are the partial correlations. For
exam ple, the value 0.13 in the second row is the correlation betw een the log
head b read th o f the first son, b i, and the log head length o f the second
son, h, after allowing for the other variables. In effect, this is the correlation
betw een the residuals from separate regressions o f b\ and lj on the other two
variables. T he correlations are all large, b u t four o f the partial correlations
are small, which suggests the simple in terpretation th at each o f the four pairs
o f m easurem ents for first and second sons is independent conditionally on the
values o f the o th er two m easurem ents.

116

3 Further Ideas

F irst son
L ength
B readth

F irst son
S econd son

L ength
B readth
L ength
B readth

0.43
0.75
0.72
0.72

0.70
0.72

Table 3.11 Correlations


(below diagonal) and
partial correlations
(above diagonal) for log
measurements on the
head breadth and length
of the first two adult
sons in 25 families.

Second son
L ength
B readth

0.21

0.17

0.13

0.22
0.64

0.85

We focus on the p artial correlation t = 0.13 betw een log foj and log I2 . The
top panel o f Figure 3.12 shows a jack k n ife-after-b ootstrap plot for t, based
on 999 b o o tstrap samples. T he points at the left-hand end show the em pirical
0.05, 0.1, 0.16, 0.5, 0.84, 0.9, an d 0.95 quantiles o f the values o f t t *_2 for the
368 b o o tstrap sam ples in which case 2 was n o t selected; ~t_ 2 is the average o f
t* for those samples. T he d o tted lines are the corresponding quantiles for all
999 values o f t* t. T he distribution is clearly m uch m ore peaked when case
2 is left out. T he panel also contains the corresponding quantiles when other
cases are excluded. T he horizontal axis shows the em pirical influence values
for t: clearly puttin g m ore weight on case 2 sharply decreases the value o f t.
The low er left panel o f the figure shows th a t case 2 lies som ew hat away
from the rest, and the plot o f residuals for the regressions o f logfti and lo g /2
on (lo g b2,lo g h) in the low er right panel accounts for the jackknife-afterb oo tstrap results. Case 2 seems outlying relative to the others: deleting it will
clearly increase t substantially. T he overall average and stan d ard deviation o f
the t* are 0.14 an d 0.23, changing to 0.34 and 0.17 when case 2 is excluded.
The evidence against zero p artial correlation depends heavily on case 2.

A n o th er version o f the diagnostic plot uses case-deletion averages o f the


i-e- t_j = R_j X>r:/*.=0
instead o f the em pirical influence values. This m ore
clearly reveals how the quantity o f interest varies w ith param eter values.
Parametric case
In the p aram etric case different calculations are needed, because random
sam ples from a case-deletion m odel are n o t simply an unw eighted subset o f
the original b o o tstrap samples. N evertheless, those original b o o tstrap samples
can still be used if we m ake use o f the following identity relating expectations
under two different p aram eter v alu es:
E { h ( Y ) \ r p ' } = E { h ( Y ) f^ Y li 'P
w) | y j-

(3.42)

Suppose th a t the full-data estim ate (e.g. m axim um likelihood estim ate) o f the
m odel p aram eter is xp, an d th a t when case j is deleted the corresponding
estim ate is xp^j. The idea is to use (3.42) w ith xp an d xp-j in place o f xp and xpr

117

3.10 Bootstrap Diagnostics

Figure 3.12 Jackknifeafter-bootstrap analysis


for the partial
correlation between
lo g b\ and lo g /2 for
Frets heads data. The
top panel shows 0.05,
0.1,0.16, 0.5, 0.84, 0.9
and 0.95 empirical
quantiles o f r t*_j
when each o f the cases
is dropped from the
bootstrap calculation in
turn. The lower panels
show scatter plots o f the
raw values o f logfci and
log fe, and o f their
residuals when regressed
on the other two
variables.

infinitesimal jackknife value

Log b1

Residual for log b1

respectively. F or example,

Therefore the param etric analogue o f (3.41) is


/d

di _

f l W .*

} ~ "\

. \ f ( y * I V-y)

j) f ( y ;

Iv)

1 V~V**

(r

}J

w here the sam ples y* are draw n from the full-data fitted model, th at is with
p aram eter value ip. Sim ilar w eighted calculations apply to o ther features o f the

118

3 Further Ideas

distributio n o f T* t; see Problem 3.20. O th er applications o f the importance


reweighting identity (3.42) will be discussed in C h ap ter 9.

3.10.2 Linearity
Statistical analysis is simplified w hen the statistic o f interest T is close to
linear. In this case the variance approxim ation v i will be an accurate estim ate
o f the b o o tstrap variance v a r(T | F), and saddlepoint m ethods (Section 9.5)
can be applied to o btain accurate estim ates o f the distribution o f t \ w ithout
recourse to sim ulation. A linear statistic is n o t necessarily close to norm ally
distributed, as Exam ple 2.3 illustrates. N o r does linearity guarantee th at T is
directly related to a pivot and therefore useful in finding confidence intervals.
O n the o th er hand, experience from o th er areas in statistics suggests th at these
three properties will often occur together.
This suggests th a t we aim to find a transfo rm atio n h(-) such th a t h ( T ) is well
described by the linear approxim ation th a t corresponds to (2.35) or (3.1). For
simplicity we focus on the single-sam ple case here. T he shape o f h(-) would be
revealed by a p lo t o f h(t) against t, b u t o f course this is n o t available because
h(-) is unknow n. However, using T aylor approxim ation and (2.44) we do have
h(t') = h(tl) = h{t) + h(t) Y ' f j l j - h(t) + h(t)(t'L - t),
" i =i
which shows th a t tL = c + dh(t') w ith ap p ro p riate definitions o f constants c
and d. T herefore a plot o f the values o f t'L = t + m_1 Y ^ f ) h against the t*
will look roughly like h(-), a p a rt from a location and scale shift. We can now
estim ate h(-) from this plot, either by fitting a p articular param etric form, or
by nonparam etric curve estim ation.
Example 3.25 (City population data) T he top left panel o f Figure 3.13 shows
t L plotted against t" for 499 b o o tstrap replicates o f the ratio t = x / u for the
d ata in Table 2.1. The p lo t is highly nonlinear, an d the logarithm ic tran sfo r
m ation, o r one even m ore extreme, seems appropriate. N ote th a t the plot has
shape sim ilar to th a t for the em pirical variance-stabilizing transform ation in
Figure 3.11.
For a p aram etric transform ation, we try a B ox-C ox transform ation, h{t) =
(tx 1) / 1, w ith the value o f k estim ated by m axim izing the log likelihood for
the regression o f the h(t') on the t'Lr. This strongly suggests th at we use I = 2,
for which the fitted curve is shown as the solid line on the plot. This is close to
the result for a sm oothing spline, shown as the d o tted line. The to p right panel
shows the linear approxim ation for h(t), i.e. h(t) + h(t)n~l Y T j = i f j b plotted
against h(tm). This plot is close to the line w ith unit gradient, and confirm s the
results o f the analysis o f transform ations.

h(t) is dh(t)/dt.

3.10 Bootstrap Diagnostics

Figure 3.13 Linearity


transformation for the
ratio applied to the city
population data. The
top left panel shows
linear approximations t*L
plotted against
bootstrap replicates f \
with the estimated
parametric
transformation (solid)
and a transformation
estimated by a
smoothing spline (dots).
The top right panel
shows the same plot on
the transformed scale.
The lower left panel
shows the plot for the
studentized bootstrap
statistic. The lower right
panel shows a normal
Q-Q plot of the
studentized bootstrap
statistic for the
transformed values h{t*).

119

h(t*)

CO

CO

..y -

CM

r*

CM
_c=
*
N

CNJ

jf c a

C \1

CO

CO

-6

-4-2

- 3 - 2 - 1 0 1 2 3
Quantiles of Standard Normal

z*

The lower panels show related plots for the studentized b o o tstrap statistics
on the original scale and on the new scale,
.
t'-t
Z ~ *1/2
vL

.
h(t')-h(t)
Z>
>~
*1/2
h(t)vL

where vL = n~ 2 ^ 2 f j l j . T he left panel shows that, like t*, z is far from


linear. The lower right panel shows th a t the distribution o f z h is fairly close
to stan d ard norm al, though there are som e outlying values. The distribution
o f z* is far from norm al, as shown by the right panel o f Figure 2.5. It
seems that, here, the tran sfo rm ation th a t gives approxim ate linearity o f t* also

3 Further Ideas

120

m akes the corresponding studentized b o o tstrap statistic roughly norm al. The
transform atio n based on the sm oothing spline w ould give sim ilar results.

3.11 Choice of Estimator from the Data


In some applications we m ay w ant to choose an estim ator o r o th er procedure
after looking a t the data, especially if there is considerable prio r uncertainty
ab o u t the n atu re o f ran d o m variation o r o f the form o f relationship am ong
variables. The sim plest exam ple w ith hom ogeneous d a ta involves the choice o f
estim ator for a pop u latio n m ean fi, when em pirical evidence suggests th at the
underlying distribution F has long, n on-norm al tails.
Suppose th a t T ( 1 ) ,..., T ( K ) can all be considered potentially suitable esti
m ators for n, and for the m om ent assum e th a t all are unbiased, which m eans
th a t the underlying d a ta distrib u tio n is sym m etric. T hen one n a tu ra l criterion
for choice am ong these estim ators is variance or, since their exact variances
will be unknow n, estim ated variance. So if the estim ated variance o f T(i) is
V(i), a n atu ral procedure is to select as estim ate for a given dataset th at t(i)
whose estim ated variance is smallest. This defines the adaptive estim ator T by
T = T(i)

if

V(i) = m in V(k).
1Zk<.K

F or m ost simple estim ators we can use the nonp aram etric delta m ethod vari
ance estim ates. But in general, an d for m ore com plicated problem s, we use the
b o o tstrap to im plem ent this procedure. T hus we generate R boo tstrap samples,
com pute the estim ates f* (l),. . . , t ' ( K ) for each sample, and then choose t to be
th a t t(i) for which the b o o tstra p estim ate o f variance
R

;(0 = ( - l r 1 5 3 {t;(o - r ( o }2
r= 1

is sm allest; here t{i) = R~' J 2r f(0How we generate the b o o tstrap sam ples is im p o rtan t here. H aving assum ed
sym m etry o f d a ta distribution, the resam pling distribution should be sym m etric
so th a t the t'(i) are unbiased for fi. O therw ise selection based on variance alone
is questionable. F u rth er discussion o f this is postponed to Exam ple 3.26.
So far the procedure is straightforw ard. But now suppose th a t we w ant
to estim ate the variance o f T, o r quantiles o f T p. For the variance, the
m inim um estim ate v(i) used to select t = t{i) will tend to be too low: if /
is the rand o m index corresponding to the selected estim ator, then E{K (/)} <
var{ T(J)} = v ar(T ). Sim ilarly the resam pling distribution o f T* = T * (/) will
be artificially co ncentrated relative to th a t o f T , so th a t em pirical quantiles o f
the t(i) values will tend to be too close to t. W hether or n o t this selection bias

3.11 Choice o f Estimator from the Data

121

is serious depends on the context. However, the bias can be adjusted for by
b o o tstrap p in g the w hole procedure, as follows.
L et y\,...,y*n be one o f the R sim ulated samples. Suppose th at we apply
the procedure for choosing am ong T ( 1 ) ,..., T { K ) to this b o o tstrap sample.
T h a t is, we generate M sam ples with equal probability from y \ , . . . , y n, and
calculate the estim ates f (l), . . . , f (K ) for the mth such sample. T hen choose
the estim ator w ith the smallest estim ated variance
M

m=1
where f'*(i) =

C ( 0 - T h a t is,
t* = t*(i)

if

v'(i) = m in v'(k).

D oing this for each o f the R sam ples y [ , . . . , y * gives t \ , . . . , t R, and the em pirical
d istribution o f the t t values approxim ates the distribution o f T F or
exam ple, v = ( R I )-1 ^ ( t * t )2 estim ates the variance o f T, and by accounting
for the selection bias should be m ore accurate th an t>(i).
T here are two byproducts o f this double b o o tstrap procedure. One is infor
m ation on how w ell-determ ined is the choice o f estim ator, if this is o f interest,
simply by exam ining the relative frequency with which each estim ator is cho
sen. Secondly, the bias o f v(i) can be approxim ated: on the log scale bias is
estim ated by R ~ l ^ l o g y log v, where v'r is the sm allest value o f the v(i)s
in the rth b o o tstrap sample.
Example 3.26 (Gravity data)
Suppose th a t the d a ta in Table 3.1 were
only available as a com bined sample o f n = 81 m easurem ents. T he different
dispersions o f the ingredient series m ake the com bined sam ple very no n
norm al, so th a t the simple average is a po o r estim ator o f the underlying m ean
fi. O ne possible ap proach is to consider trim m ed average estim ates
n-k

which are averages after d ropping the k smallest and k largest order statistics
yy y The usual average and sam ple m edian correspond respectively to k =
0 an d \{n 1). The left panel o f Figure 3.14 plots the trim m ed averages
against k. The m ild dow nw ard trend in the plot suggests slight asym m etry o f
the d a ta distribution. O u r aim is to use the b o o tstrap to choose am ong the
trim m ed averages.
T he trim m ed averages will all be unbiased if the underlying d a ta distribution
is sym metric, an d estim ator variance will then be a sensible criterion on which
to base choice. The b o o tstrap procedure m ust build in the assum ed symmetry,

3 Further Ideas

2.0

2.0

122

9
&

a
>

'

% "
9
e

20

30

40

0 0

0 0

10

9
6

10

20

30

40

10

and this can be done (cf. Exam ple 3.4) by sim ulating sam ples
sym m etrized version o f F such as

20

30

40

from a

F sym(y ) = l2 { F ( y ) + F( 2 U - y - 0)} ,

which is sim ply the E D F o f y i , . . . , y , p. {y\ p.),. . . , p (y p.), with p. an


estim ate o f fi which for this purpose we take to be the sam ple m edian. The
centre panel o f Figure 3.14 shows b o o tstrap estim ates o f variance for eleven
trim m ed averages based on R = 1000 sam ples d raw n from Fsym. We conclude
from this th a t k = 36 is best, b u t th a t there is little to choose am ong trim m ed
averages w ith k = 2 4 ,..., 40. A sim ilar conclusion em erges if we sam ple from
F, although the b o o tstrap variances are noticeably higher for k > 24.
If sym m etry o f the underlying distrib u tio n were in doubt, then we should
take the biases o f the estim ators into account. O ne n atu ral criterion then would
be m ean squared error. In this case o u r b o o tstrap sam ples would be draw n
from F, an d we w ould select am ong the trim m ed averages on the basis o f
bo o tstrap m ean squared error
R

mse(i) = K_ 1 { r ; ( 0 - y } 2 r= 1

N ote th a t m ean squared erro r is m easured relative to the m ean y o f the


b o o tstrap population. T he right panel o f Figure 3.14 shows the boo tstrap
m ean squared errors for o u r trim m ed averages, an d we see th a t the estim ated
biases do have an effect: now a value o f k nearer 20 w ould ap p e ar to be best.
U nder the sym m etric b o o tstrap , when the m ean o f Fsym is the sam ple m edian
because we sym m etrized ab o u t this point, b o o tstrap m ean squared erro r equals
bo o tstrap variance.
To focus the rest o f the discussion, we shall assum e sym m etry and therefore
choose t to be the trim m ed average w ith k = 36. T he value o f t is 78.33, and
the m inim um b o o tstrap variance based on 1000 sim ulations is 0.321.
We now use the double b o o tstra p procedure to estim ate the variance for
t, and to determ ine ap pro p riate quantiles for t. First we generate R = 1000

Figure 3.14 Trimmed


averages and their
estimated variances and
m ean squared errors for
the pooled gravity data,
based on R = 1000
bootstrap samples, using
the ordinary bootstrap
() and the symmetric
bootstrap (o).

3.12 Bibliographic Notes

123

sam ples y j,...,y g [ from Fsym. To each o f these sam ples we then apply the
original sym m etric b o o tstrap procedure, generating M = 100 sam ples o f size
n = 81 from the sym m etrized E D F o f y \ , . .. , 3^ , choosing t* to be th a t one o f
the 11 trim m ed averages w ith sm allest value o f v(i). The variance v o f t\ , . . . , t'R
equals 0.356, which is 10% larger th an the original m inim um variance. If we
use this variance w ith a norm al aproxim ation to calculate a 95% confidence
interval centred on t, the interval is [77.16,79.50]. This is very sim ilar to the
intervals obtained in Exam ple 3.2.
The frequencies w ith which the different trim m ing proportions are chosen
are:
k
12 16 20 24 28
32
36
40
Frequency
1
25 54 96 109131 49886
T hus when sym m etry o f the underlying distribution is assum ed, a fairly heavy
degree o f trim m ing seems desirable for these data, and the value k = 36
actually chosen seems reasonably well-determ ined.

The general features o f this discussion are as follows. We have a set o f


estim ators T (a) = t(a, F ) for a e A, and for each estim ator we have an
estim ated value C (a ,F ) for a criterion C (a ,F ) = E {c(T (a),0) | F} such as
variance or m ean squared error. The adaptive estim ator is T = t(a, F) where
a = a(F) m inim izes C (a ,F ) w ith respect to a. We w ant to know ab o u t the
d istribution o f T, including for exam ple its bias and variance. The distribution
o f T 6 = t(F) t(F) under sam pling from F will be approxim ated by
evaluating it under sam pling from F. T h at is, it will be approxim ated by the
d istribution o f
T* - t = t (F') - f(F) = t( a , F*) - t( a, F)
un d er sam pling from F. H ere F* is the analogue o f F based on y y * : if F
is the E D F o f the data, then F* is the E D F o f
sam pled from F.
W hether or n o t the allowance for selection bias is num erically im portant
will depend u p o n the density o f a values and the variability o f C(a,F).

3.12 Bibliographic Notes


The extension o f b o o tstrap m ethods to several unrelated sam ples has been
used by several authors, including Hayes, Perl and Efron (1989) for a special
contrast-estim ation problem in particle physics; the application is discussed
also in Efron (1992) an d in Practical 3.4.
A general theoretical account o f estim ation in sem iparam etric m odels is
given in the book by Bickel et al. (1993). The m ajority o f applications o f
sem iparam etric m odels are in regression; see references for C hapters 6 and 7.

124

3 Further Ideas

E fron (1979, 1982) suggested and studied em pirically the use o f sm ooth ver
sions o f the ED F, b u t the first system atic investigation o f sm oothed bootstraps
was by Silverm an and Y oung (1987). They studied the circum stances in which
sm oothing is beneficial for statistics for which there is a linear approxim ation.
Hall, D iCiccio an d R om an o (1989) show th a t when the quantity o f interest
depends on a local property o f the underlying C D F, as do quantiles, sm ooth
ing can give w orthw hile theoretical reductions in the size o f the m ean squared
error. Sim ilar ideas apply to m ore com plex situations such as L\ regression
(D e Angelis, H all and Y oung 1993); see how ever the discussion in Section 6.5.
D e Angelis an d Y oung (1992) give a useful review o f b o o tstrap sm oothing, and
discuss the em pirical choice o f how m uch sm oothing to apply. See also W ang
(1995). R o m an o (1988) describes a problem estim ation o f the m ode o f a
density where the estim ator is undefined unless the E D F is sm oothed; see
also Silverm an (1981). In a spatial d a ta problem , K endall and K endall (1980)
used a form o f b o o tstrap th a t jitte rs the observed data, in order to keep the
rough configuration o f p oints co n stan t over the sim ulations; this am ounts to
sam pling w ithout replacem ent when applying the sm oothed bootstrap. Young
(1990) concludes th a t although this ap proach can o u tperform the unsm oothed
bootstrap , it does n o t perform so well as the sm oothed b o o tstrap described in
Section 3.4.
G eneral discussions o f survival d a ta can be found in the books by Cox
and O akes (1984) and Kalbfleisch an d Prentice (1980), while Flem ing and
H arringto n (1991) and A ndersen et al. (1993) give m ore m athem atical accounts.
T he product-lim it estim ator was derived by K ap lan and M eier (1958): it and
variants are widely used in practice.
Efron (1981a) proposed the first b o o tstra p m ethods for survival data, and
discussed the relation betw een trad itio n al an d b o o tstrap stan d ard errors for
the product-lim it estim ator. A kritas (1986) com pared variance estim ates for
the m edian survival tim e from E frons sam pling scheme and a different a p
proach o f R eid (1981), and concluded th a t E frons scheme is superior. The
conditional m ethod outlined in Section 3.5 was suggested by H jo rt (1985),
and subsequently studied by K im (1990), who concluded th a t it estim ates
the conditional variance o f the product-lim it estim ator som ew hat b etter th an
does resam pling cases. D oss and G ill (1992) an d B urr and D oss (1993) give
weak convergence results leading to confidence bands for quantiles o f the
survival time distribution. T he asym ptotic behaviour o f param etric and no n
param etric b o o tstrap schemes for censored d a ta is described by H jo rt (1992),
while A ndersen et al. (1993) discuss theoretical aspects o f the weird b o o t
strap.
The general ap p ro ach to m issing-data problem s via the EM algorithm is dis
cussed by D em pster, L aird and R ubin (1977). Bayesian m ethods using m ultiple
im putatio n an d d a ta au gm entation are decribed by T anner and W ong (1987)

3.12 Bibliographic Notes

125

and T anner (1996). A detailed treatm ent o f m ultiple im putation techniques


for m issing-data problem s, w ith special em phasis on survey data, is given by
R ubin (1987). The principal reference for resam pling in m issing-data problem s
is Efron (1994), together with the useful, cautionary discussion by D. B. Rubin.
T he account in Section 3.6 puts m ore em phasis on careful choice o f estim ators.
C ochran (1977) is a stan d ard reference on finite population sampling. V ari
ance estim ation by balanced subsam pling m ethods was discussed in this con
text as early as M cC arthy (1969), but the first a ttem p t to apply the boo tstrap
directly was by G ross (1980), who describes w hat we have term ed the p o pula
tion b o o tstra p , b u t restricted to cases where N / n is an integer. This approach
was subsequently developed by Bickel and F reedm an (1984), while C hao and
Lo (1994) also m ake a case for this approach. Booth, Butler and H all (1994)
describe the construction o f studentized b o o tstrap confidence limits in this
context. Presnell and B ooth (1994) give a critical discussion o f earlier literature
and describe the superp o p u latio n bootstrap. The use o f modified sam ple sizes
was proposed by M cC arth y and Snowden (1985) and the m irror-m atch m ethod
by S itter (1992). A different approach based on rescaling was introduced by
R ao and W u (1988). A com prehensive theoretical discussion o f the jackknife
an d b o o tstrap in sam ple surveys is given in C h apter 6 o f Shao and Tu (1995),
w ith later developm ents described by Presnell and Booth (1994) and Booth,
Butler and H all (1994), on which the account in Section 3.7 is largely based.
Little has been w ritten ab o u t resam pling hierarchical d a ta although two
relevant references are given in the bibliographic notes for C h apter 7. R elated
m ethods for b o o tstrap p in g em pirical Bayes estim ates in hierarchical Bayes
m odels are described by L aird and Louis (1987). N onparam etric estim ation o f
the C D F for a ran d o m effect is discussed by L aird (1978).
B ootstrapping the b o o tstrap is described by C hapm an and H inkley (1986),
an d was applied to estim ation o f variance-stabilizing transform ations by Tibshirani (1988). T heoretical aspects o f adjustm ent o f boo tstrap calculations
were developed by H all an d M artin (1988). See also the bibliographic notes
for C hapters 4 and 5. M ilan and W h ittak er (1995) give a param etric boo tstrap
analysis o f the d a ta in Table 3.10, and discuss the difficulties th at can arise
when resam pling in problem s with a singular value decom position.
Efron (1992) introduced the jackknife-after-bootstrap, and described a vari
ety o f ingenious uses for related calculations. D ifferent graphical diagnostics
for b o o tstrap reliability are developed in an asym ptotic fram ew ork by Beran
(1997). The linearity plot o f Section 3.10.2 is due to C ook and W eisberg (1994).
Theoretical aspects o f the em pirical choice o f estim ator are discussed by
Leger and R om an o (1990a,b) and Leger, Politis and R om ano (1992). Efron
(1992) gives an exam ple o f choice o f level o f trim m ing o f a robust estim ator,
w ithout double bootstrapping. Some o f the general issues, w ith examples, are
discussed by Faraw ay (1992).

126

3 Further Ideas

3.13 Problems
1

In a two-sample problem, with data y tj, j = 1 ,..., n i = 1,2, giving sample averages
y,- and variances t> describe models for which it would be appropriate to resample
the following quantities:
(a) e y = ytj - %
(b) ei} = (ytj - 3>.)/(l + n~l )l/2,
(c) etj = (ytj - y,)/{.( 1 + n - l )}l/2,
(d)
= + ( y , j yi)/{vt( 1 + n~l )}l/1, where the signs are allocated with equal
probabilities,

(e) etj = yij/%


In each case say how a simulated dataset would be constructed.
What difficulties, if any, would arise from replacing y and v, by more robust
estimates o f location and scale?
(Sections 3.2, 3.3)
2

A slightly simplified version o f the weighted mean o f k samples, as used in


Example 3.2, is defined by

i=i w.-y,-

E i= i wi
where w, = n j a j , with y,- = n~' J 2 j ytj and a f = n~[ J 2 j(yij ~ Pi)2 estimates o f
mean /j, and variance of o f the ith distribution. Show that the influence functions
for T are

Ltjiy-;F) = ^ 7- [yi - /*. - (w- - 0) { (y< / v Wi

^ }! ^ \.

where qj,- = n j a } . Deduce that the first-order approximation under the constraint
Hi = = Hk for the variance o f T is vL = 1 / ^
with empirical analogue vL =
1/
vv>- Compare this to the corresponding formula based on the unconstrained
empirical influence values.
(Section 3.2.1)
3

Suppose that Y is bivariate with polar representation (X , m ), so that Y T =


(X cos co, X sin co). If it is known that w has a uniform distribution on [0,27t),
independent o f X , what would be an appropriate resampling algorithm based on
the random sample y i , . . . , y l
(Section 3.3)

Spherical data y i , . . . , y are points on the sphere o f unit radius. Suppose that it
is assumed that these data come from a distribution that is symmetric about the
unknown mean direction /i. In light o f the symmetry assumption, what would be
an appropriate resampling algorithm for simulating data y j ,...,y * ?
(Section 3.3; Ducharme et a l., 1985)

Two independent random samples y ii,...,y i , and y i \ , . . . , y 2ni o f positive data


are obtained, and the ratio o f sample means t = y ^ / y i is used to estimate the
corresponding population ratio 9 = ^ 2 / ^ 1 (a) Show that the influence functions for t are
Lt.i (yi ;F) = -(J 'i - 111 )6 / ni,

L t<1 (y 2 \ F) = ( y i ~ n i ) / n 1 .

Hence obtain the formula

vl = {n^^^iyij -

yi)2+ n22^2(y2j - y2)2} /yi

127

3.13 Problems

for the approximate variance o f T.


(b) Describe an appropriate resampling algorithm. How could this be modified if
one could assume a multiplicative model, i.e. Y\j = n\E\j and Y2j = ^ 2 ; with all
es sampled from a com m on distribution o f positive random variables?
(c) Show that under the multiplicative m odel the approximate variance formula
can be changed to vL = t2 Y , , / e'j - 1 )2/( i ^2 ), where eu = y,j/y,.
(Section 3.2.1)
6

The empirical influence values can be calculated more directly as follows. Consider
only distributions supported on the data values, with probabilities p t = ( p n , ... , p inf)
on the values in the ith sample for i = 1, . . . , k . Then write T =
so
that t = t(pi,...,pk) with pi = (},, ^ ) . Show that the empirical influence value
lij corresponding to the 7 'th case in sample i is given by

lv = ^ - t { p u . . . , ( l - s ) p i + e l j , . . . , p k} I ,
u
I=0
where l j is the vector with
(Section 3.2.1)
7

in the j th position and zeroes elsewhere.

Following on from the previous problem, re-express t(p 1 , . . . , pk) as a function u(n)
o f a single probability vector n = ( 7t 1 1 , . . . , 7t 1 1 , . . . , nkt ). For example, for the ratio
o f means o f two independent samples, t = _p2 />i,

= ( j W I > y ) / ( 5 > u * y /*iy)The observed value t is then equal to u{n) where n = (~n, . . . , - n ) with n = * 1 , nt.
Show that
l j = j u {(1 - e)n + e h j} ^
where

y is the vector with

= -/y,

in the (,_1 + j )th position, with n0 =

elsewhere. One consequence o f this is that vL = n~ 2


A pply these calculations to the ratio t = yi/yi.
(Section 3.2.1)
8

, and zeroes

J2j'=i %

If x i , . . . , x is a random sample from some distribution G with density g, suppose


that this density is estimated by

Vh i b w i ^ r ) = l j w { ^ r )

where w is a symmetric P D F with mean zero and variance t2.


(a) Show that this density estimate has mean x and variance n~l Y H x j ~ x ) 2 + h2x2.
(b) Show that the random variable x = x j + he has P D F gh, where J is uniformly
distributed on ( l , . . . , n ) and e has P D F w. Hence describe an algorithm for
bootstrap simulation from a sm oothed version o f the EDF.
(c) Show that the rescaled density
1

( x

j= 1

a bxj\
hb

J7

will have the same first two mom ents as the E D F if a = (1 b)x and b =
{1 + nh2z 2/ J2(x j x)2} ~ l/2. W hat algorithm simulates from this sm oothed E D F?

128

3 Further Ideas
(d) D iscuss the special problems that arise from using gh(x) when the range o f x
is [0, oo) rather than (oo, oo).
(e) Extend the algorithms in (b) and (c) to multivariate x.
(Section 3.4; Silverman and Young, 1987; Wand and Jones, 1995)

Consider resampling cases from censored data (y i, d \ ) , . . . , (y, dn), where yi < <
y n. Let f j denote the number o f times that (y j , d j ) occurs in an ordinary bootstrap
sample, and let Sj = / ' H-------1- / ' .
(a) Show that when there is no censoring, the product-limit estimate puts mass n-1
on each observed failure yi << y, so that F = F.
(b) Show that if B(m, p) denotes binom ial distribution with index m and probability
p, then

and deduce that this has variance J2j-.yJ<ydjVar{log(l f / S j ) } .


(d) Use the delta method to show that v a r[lo g {lF*(y)}] = J 2 j y j<y d j / ( n ~ 7 + 1)2>
and infer that
v a r { l-F -(y )} = { l - F ( y ) } 2
)-yj<y

( n - j + l)2'

This equals the variance from Greenwoods formula, (3.10), apart from replacement
o f (n j + l) 2 by (n - j)(n - j + 1).
(Section 3.5; Efron, 1981a; Cox and Oakes, 1984, Section 4.3)
10

Consider the weird bootstrap applied to a hom ogeneous sample o f censored data,
(yi ,di),...,(y,d), in which >i < - < >>. Let d A 0,(yj) = N j / ( n j + l), where the
N'j are independent binomial variables with denominators n j + 1 and probabilities
dj / ( n j + 1).
(a) Show that the total number o f failures under this resampling scheme is dis
tributed as a sum o f independent binom ial observations.
(b) Show that

and that if dn = 1 then dA ' (yn) always equals one.


(Section 3.5; Andersen et al., 1993)
11

Suppose that Yj = ( U j , X j ) , j = 1 , . . . , are bivariate normal with mean vector


H and variance matrix Q. When the Ys are observed, a random m cases have x
missing. Obtain formulae for the maximum likelihood estimators o f fi and fl. Verify
that these formulae agree with the multiple-imputation estimators constructed by
the method o f Section 3.6.

12

(a) Establish (3.15), and show that the sample variance c is an unbiased estimate
o f y.

129

3.13 Problems

(b) N ow suppose that N = kn for some integer k. Show that under the population
bootstrap,

E'(y*) = y,

v a r-(n =

x ( l - / ) n - 1c.

(c) In the context o f Example 3.16, suppose that the parameter o f interest is a
nonlinear function o f 9, say t] g (6), which is estimated by g(T ). Use the delta
m ethod to show that the bias o f g (T ) is roughly ^g"(0)var(T), and that the
bootstrap bias estimate is roughly ig " (t)var'(T ). Under what conditions on n and
N does the bootstrap bias estimate converge to the true bias?
(Section 3.7; Bickel and Freedman, 1984; Booth, Butler and Hall, 1994)
13

To model the superpopulation bootstrap, suppose that the original data are
y i , . . . , y n and that <9* contains
copies o f y u . . - , y n; the joint distri
bution o f the M j is multinomial with probabilities n~{ and denominator N. If
Y], . . . , y* are sampled without replacement from <&' and if Y = n~l J2 Y/> show
that

E-(Y ') = y,

E m {var'(Y* | M)} = ^

x (1 - f h ^ c .

(Section 3.7; Presnell and Booth, 1994)


14

Suppose we wish to perform mirror-match resampling with k independent withoutreplacement samples o f size m, but that k = {n(l m/ n ) } / {{ m( 1 / ) } is not an
integer. Let K be the random variable such that
Pr( K = k') = 1 - Pr(X* = k' + 1) = k'(l + k' - k)/ k,
where k' = [k] is the integer part o f k. Show that if the mirror-match algorithm is
applied for an average Y with this distribution fo r X ', var(Y ) = (1m/n)c/ (mk).
Show also that under mirror-match resampling with the simplifying assumption
that randomization is not required because k is an integer,

f,
(* -!)
E (C ) = c l 1- ^ j ^ T j
where C is the sample variance o f the Y-.
What implications are there for variance estimation for more complex statistics?
(Section 3.7; Sitter, 1992)
15

Suppose that n is a large even integer and that N = 5n/2, and that instead o f
applying the population bootstrap we choose a population from which to resample
according to

#{/!} is the number of


elements in the set A.

y i , - - - , y n,

yi, - --, y,

y u . . . , y n,

y u . . . , y n,

with probability \ ,
yi,...,y,

with probability

Having selected <& we take a sample Y ,\...,Y ' from it without replacement and
calculate Z = (Y* y ){(l f ' ) n ~ l c}~i/2. Show that if f = n / N the approximate
distribution o f Z is the normal mixture |N (0, | ) + |N (0, y ) , but that if f =
n/#{<&'} the approximate distribution o f Z is N ( 0,1). Check that in the first case,
E*(Z*) = 0 and var(Z ) = 1.
Comment on the implications for the use o f randomization in finite population
resampling.
(Section 3.7; Bickel and Freedman, 1984; Presnell and Booth, 1994)

130
16

5 Further Ideas
Suppose that we have data y i,...,y , and that the bootstrap sample is taken to be
Yj = y + d(yij y),

where I \ , .
are independently chosen at random from 1
Show that when d = {n'( 1 / ) / ( 1)}I/2, we have E * (y ) = y and var'(Y*) =
(1 f ) n ~ l c. How might the value o f ri be chosen?
Discuss critically this resampling scheme.
(Section 3.7; R ao and Wu, 1988)
17

Suppose that y , j = x ,+ z i;, i = 1, . . . , a and j 1


where the x,s are independent
with mean [i and variance <x2, and the zi;s are independent with mean 0 and variance
<x2. Consider the resampling schemes

v'j = yij + (yK,jj yx,),


where / i a n d
K u . . . , K a are randomly sampled with replacement from
{ l , . . . , a } , and J x, . . . , J b are randomly sampled from { 1 ,...,6 } either with or with
out replacement. Show that the second-moment properties o f the YJs are given by
(3.23) and (3.24).
(Section 3.8)
18

For the m odel o f Problem 3.17, define estimates o f the x,s and z,; s by
^ = cy. + (1 - c)yh

ztj = d(y,7 - y,)-

Show that the E D F s o f % and


have first two mom ents which are unbiased for
the corresponding m om ents o f the A"s and Z s if

(Section 3.8)
19

Consider the double bootstrap procedure for adjusting the estimated bias o f T,
as described in Section 3.9, when T is the average Y . Show that the variance o f
simulation error for the adjusted bias estimate B C is

with s2 the sample variance.


Hence deduce that for fixed R M the best choice for M is 1. How would the results
change for a statistic other than the average?
Derive the corresponding result for the bias correction o f the bootstrap estimate
o f var(T).
(Sections 2.5.2, 3.9)
20

Extend the discussion following (3.42) to jackknife-after-bootstrap calculations for


Describe the calculation in detail when parametric simulation is performed from
the exponential density.
(Section 3.10; Efron, 1992)

21

Let tp(F) denote the p x 100% trimmed average o f distribution F, i.e.


F-'a-p)

3.14 Practicals

131

(a) If Fk denotes the gamma distribution with index k and unit mean, show that
tp(FK) = k(1 - 2p )-'{ F K+1(yK,i_p) - / \ +](>v,P)}, where y K<p is the p quantile o f FK.
Hence evaluate tp(FK) for k: = 1, 2, 5, 10 and p = 0, 0.1, 0.2, 0.3, 0.4, 0.5.
(b) Suppose that the parameter o f interest, 6 = * =1 Cjt()(FjKi), depends on several
gamma distributions FiJtr Let F, denote the E D F o f a sample o f size n, from f , K|.
Under what circumstances is T = , = 1 c,rp(F,) (i) unbiased, (ii) nearly unbiased,
as an estimate o f 0? Test your conclusions by a small simulation experiment.
(Section 3.11)

3.14 Practicals
1

To perform the analysis for the gravity data outlined in Example 3.2:
g r a v .fu n < - f u n c t io n ( d a t a , i )
{ d <- d a t a f i,]
m < - ta p p ly (d $ g ,d $ s e r ie s ,m e a n )
v < - t a p p l y ( d $ g ,d $ s e r i e s ,v a r )
n <- ta b le (d S s e r ie s )
c (su m (m * n /v )/su m (n /v ), l/s u m ( n /v ) ) >
g r a v .b o o t < - b o o t ( g r a v it y , g r a v .f u n , R=200, s t r a t a = g r a v it y $ s e r i e s )
Plot the estimate and its variance. Is the simulation well-behaved? How normal
are the bootstrapped estimates and studentized bootstrap statistics?
N ow for a semiparametric analysis, as suggested in Section 3.3:
a t t a c h ( g r a v it y )
n <- t a b le (s e r ie s )
m < - r e p ( t a p p ly ( g , s e r i e s , m ean), n)
s <- r e p (s q r t(ta p p ly (g ,s e r ie s ,v a r )) ,n )
r e s < - (g - m )/s
q q n o r m (r e s); a b l i n e ( 0 , 1 ,lt y = 2 )
g ra v < - d a ta .fr a m e (m , s , s e r i e s , r e s )
g r a v .fu n < - f u n c t io n ( d a t a , i )
{ e < - d a t a $ r e s [ i]
y < - data$m + d a ta $ s * e
m < - t a p p l y ( y , d a t a $ s e r ie s , mean)
v < - t a p p l y ( y , d a t a $ s e r ie s , v a r)
n <- ta b le (d a ta $ s e r ie s )
c (s u m (m * n /v )/su m (n /v ), l/s u m ( n /v ) )

>
g r a v l.b o o t < - b o o t( g r a v , g r a v .f u n , R=200)
D o residuals r e s for the different series look similar? Compare the values o f t and
d' for the two sampling schemes. Compare also 80% confidence intervals for g.
(Section 3.2)
2

Dataframe charm ing contains data on the survival o f 97 men and 365 women
in a retirement home in California. The variables are sex, ages in months at
which individuals entered and left the home, the time in months they spent there,
and a censoring indicator (0/1 denoting censored due to leaving the hom e/died
there). For details see Hyde (1980). We compare the variability o f the survival
probabilities at 75 and 85 years (900 and 1020 months), and o f the estimated 0.75
and 0.5 quantiles o f the survival distribution.

132

3 Further Ideas

chan <- charming[1:97,]


# men only
chan$age <- chan$entry+chan$time
attach(chan)
chan.F <- survfit(Surv(age, cens))
chan.F
max(chan.F$surv[chan.F$time>900])
max(chan.F$surv[chan.F$time>1020])
chan.G <- survfit(Surv(age-0.01*cens,l-cens))
split.screen(c(2,1))
screen(l); plot(chan.F,xlim=c(760,1200),main="survival")
screen(2); plot(chan.G,xlim=c(760,1200),main="censoring")
chan.fun <- function(data)
{ s <- survfit(Surv(age,cens),data=data)
c(max(s$surv[s$time>900]), max(s$surv[s$time>1020]),
min(s$time[(s$surv<=0.75)]), min(s$time[(s$surv<=0.5)])) }
chan.bootl <- censboot(chan, chan.fun, R=99, sim = "ordinary")
chan.boot2 <- censboot(chan, chan.fun, R=99, F .surv=chan.F ,
G.surv=chan.G, sim = "cond",index=c(6,5))
chan.boot3 <- censboot(chan, chan.fun, R=99, F .surv=chan.F ,
sim = "weird",index=c(6,5))
Give normal-approximation confidence limits for each o f the survival probabilities,
transformed if necessary, and compare them with those from chan.F . How do the
intervals for the different bootstraps compare?
(Section 3.5; Efron, 1981a)
To study the performance o f censored data resampling schemes when the censoring
pattern is fixed, we perform a small simulation study. We apply a fixed censoring
pattern to samples o f size 50 from the unit exponential distribution, and for each
sample we calculate t = ( t \ , t 2), where t\ is the maximum likelihood estimate o f the
distribution mean and t2 is the number o f censored observations. We apply each
bootstrap scheme to the sample, and record the mean and standard deviation o f
t from the bootstrap simulation. (This is quite time-consuming: take n re p s and R
as big as you dare.)

exp.fun <- function(d)


{ d.s <- survfit(Surv(y, cens),data=d)
prob <- min(d.s$surv[d.s$time<l])
med <- min(d.s$time[(1-d.r$surv)>=0.5])
c(sum(d$y)/sum(d$cens), sum(l-d$cens)) >
results <- NULL; nreps <- 100; n <- 50; R <- 25
cens <- 3*runif(n)
for (i in 1:nreps)
{ yO <- rexp(n)
junk <- data.frame(y = pmin(yO.cens), cens = as.numeric(yOCcens))
junk.F <- survfit(Surv(y,cens),data=junk)
junk.G <- survfit(Surv(y,1-cens),data=junk)
ord.boot <- censboot(junk, exp.fun, R=R)
con.boot <- censboot(junk, exp.fun, R=R,
F.surv=junk.F, G.surv=junk.G, sim = "cond")
wei.boot <- censboot(junk, exp.fun, R=R,
F.surv=junk.F, sim = "weird")
res <- c(exp.fun(junk ),
apply(ord.boot$t, 2, mean),

133

3.14 Practicals

apply(con.boot$t, 2, mean),
apply(wei.boot$t, 2, mean),
sqrt(apply(ord.boot$t, 2, var)),
sqrt(apply(con.boot$t, 2, var)),
sqrt(apply(wei.boot$t, 2, var)))
results <- rbind(results, res) }
The estimated bias and standard deviation o f t ly and the bootstrap bias estimates
are

mean(results[,1])-l
sqrt(var(results[,1] ))
bias.o <- results[,3]-results[,1]
bias.c <- results[,5]-results[,1]
bias.w <- results[,7]-results[,1]
How do they compare? W hat about the estimated standard deviations? How do
the numbers o f censored observations vary under the schemes?
(Section 3.5; Efron, 1981a; Burr, 1994)
4

The tau particle is a heavy electron-like particle which decays into various col
lections o f other charged particles shortly after its production. The decay usually
involves one charged particle, in which case it can happen in a number o f modes,
the main four o f which are labelled p, n, e, and p. It takes a major research project
to measure the rate o f occurrence o f single-particle decay, decayi, or any o f its
com ponent rates decay,,, decay^, decaye, and decay,,, and just one o f these can
be measured in any one experiment. Thus dataframe ta u on decay rates for 60
experiments represent several years o f work. Here we use them to estimate and
form a confidence interval for the parameter
8 = decay! decay p decay n decay e decay
Suppose that we had thought o f using the 0, 12.5, 25, 37.5 and 50% trimmed
averages to estimate the difference. To calculate these and to obtain bootstrap
confidence intervals for the estimates o f 8:

tau.diff <- function(data)


{ yO <- tapply(data[,l] ,data[,2] ,mean)
yl <- tapply(data[,1],data[,2],mean,trim=0.125)
y2 <- tapply(data[,1],data[,2],mean,trim=0.25)
y3 <- tapply(data[, 1] ,data[,2] ,mean,trim=0.375)
y4 <- tapply(data[, 1] ,data[,2] .median)
y <- rbind(y0, yl, y2, y3, y4)
y[,l]-apply(y[,-l] ,l,sum) >
tau.diff(tau)
tau.fun <- function(data, i)
tau.diff(data[i,])
tau.boot <- boot(tau,tau.fun,R=999,strata=tau$decay)
boot.ci(tau.boot, type=c("norm","basic"), index=l)
boot.ci(tau.boot, type=c("norm","basic"), index=2)
and so forth, with index=3, 4, 5 for the remaining degrees of trim. Does the
degree of trimming affect the interval much?
To see the jackknife-after-bootstrap plot when 8 is estimated using the average:
j a c k . a f t e r . b o o t ( t a u . b o o t , in d e x = l)
How does the degree o f trim affect the bootstrap distributions o f the different
estimators o f 0?

134

3 Further Ideas
N ow suppose that we want to choose the estimator from the data, by taking the
trimmed average with smallest variance. For the original data this is the 25%
trimmed average, so the estimate is 16.87. Its variance can be estimated by a
double bootstrap, which we can implement as follow s:

tau.nest <- function(data, i)


{ d <- data[i,]
d.trim <- tau.diff(d)
v.trim <- apply(boot(d, tau.fun, R=25, strata=d$decay)$t, 2, var)
c(d.trim, v.trim) }
tau.boot2 <- boot(tau, tau.nest, R=100, strata=tau$decay)
To see what degrees o f trimming give the smallest variances, and to calculate the
corresponding estimates and obtain their variance:

i <- matrix(l:5,5,tau.boot2$R)
i <- i[t(tau.boot2$t[,6:10]==apply(tau.boot2$t[,6:10] ,l,min))]
table(i)
t.best <- tau.boot2$t[cbind(l:tau.boot2$R,i)]
var(t.best)
Is the optimal degree o f trimming well-determined?
How would you use the results o f Problems 2.13 and 2.4 to avoid the second level
o f bootstrapping?
(Section 3.11; Efron, 1992)
5

We apply the jackknife-after-bootstrap to the correlation coefficient between


plumage and behaviour in cross-bred ducks.

ducks.boot <- boot(ducks, corr, R=999, stype="w")


ducks.L <- empinf(data=ducks, statistic=corr)
split.screen(c(1,2))
screen(l)
split.screen(c(2,1))
screen(4)
attach(ducks)
plot(plumage,behaviour,type="n")
text(plumage,behaviour,round(ducks.L ,2))
screen(3)
plot(plumage.behaviour,type="n")
text(plumage.behaviour,1:nrow(ducks))
screen(2)
jack.after.boot(boot.out=ducks.boot,useJ=F,stinf=F, L=ducks.L)
(a) The value o f the correlation is t = 0.83. Will it increase or decrease if observation
7 is deleted from the sample? (Be careful.) W hat is the effect on t o f deleting
observation 6?
(b) What happens to the bootstrap distribution o f t" t when observation 8 is
deleted from the sample? W hat about observation 6?
(c) Show that the probability that neither observation 5 nor observation 6 is in a
bootstrap sample is (1 ^ ) u = 0.11. N ow suppose that observation 5 is deleted,
and calculate the probability that observation 6 is not in a bootstrap sample. D oes
this explain what happens in (b)?
6

Suppose that we are interested in the largest eigenvalue o f the covariance matrix
between the baseline and one-year C D 4 counts in cd4; see Practical 2.3. To

3.14 Practicals

135

calculate this and its approximate variance using the nonparametric delta method
(Problem 2.14), and to bootstrap it:

eigen.fun <- functioned, w = rep(l, nrow(d))/nrow(d))


{ w <- w/sum(w)
n <- nrow(d)
m <- crossprod(w, d)
m2 <- sweep(d,2,m)
v <- crossprod(diag(sqrt(w))
m2)
eig <- eigen(v,symmetric=T)
stat <- eig$values[1]
e < - eig$vectors [ ,1 ]
i <- rep(l:n,round(n*w))
ds <- sweep(d[i,],2,m)
L <- (ds7.*
/,e) "2 - stat
c(stat, sum(L~2)/n~2) >
cd4.boot <- boot(cd4,eigen.fun,R=999,stype="w")
Some diagnostic plots:

split.screen(c(l,2))
screen(l); split.screen(c(2,1))
screen(3)
plot(cd4.boot$t[,1],cd4.boot$t[,2],xlab="t*",ylab="vL*",pch=".")
screen(4)
plot(cd4[,l],cd4[,2],type="n",xlab="baseline",
ylab="one year" ,xlim=c(l,7) ,ylim=c(1,7))
text(cd4[, 1] ,cd4[,2] ,c(l :20) ,cex=0.7)
screen(2); jack.after.boot(cd4.boot,useJ=F,stinf=F)
W hat is going on here?
(Section 3.10.1; Canty, D avison and Hinkley, 1996)

4
Tests

4.1 Introduction
M any statistical applications involve significance tests to assess the plausibil
ity o f scientific hypotheses. R esam pling m ethods are n o t new to significance
testing, since rando m izatio n tests and p erm u tatio n tests have long been used
to provide nonp aram etric tests. A lso M onte C arlo tests, which use sim ulated
datasets, are quite com m only used in certain areas o f application. In this chap
ter we describe how resam pling m ethods can be used to produce significance
tests, in b o th p aram etric and nonparam etric settings. The range o f ideas is
som ew hat w ider th a n the direct b o o tstrap approach introduced in the pre
ceding tw o chapters. To begin with, we sum m arize some o f the key ideas o f
significance testing.
T he sim plest situation involves a simple null hypothesis Ho which com pletely
specifies the probability distribution o f the data. Thus, if we are dealing with
a single sam ple y \ , . . . , y n from a p o p u latio n w ith C D F F, then Ho specifies
th a t F = Fo, where F0 contains no unknow n param eters. A n exam ple would
be exponential w ith m ean 1. T he m ore usual situation in practice is th at
Ho is a composite null hypothesis, which m eans th a t some aspects o f F are
n o t determ ined and rem ain unknow n w hen Ho is true. A n exam ple would
be norm al w ith m ean 1, the variance o f the norm al distribution being
unspecified.
P-values
A statistical test is based on a test statistic T which m easures the discrepancy
between the d a ta an d the null hypothesis. In general discussion we shall follow
the convention th a t large values o f T are evidence against H 0. Suppose for the
m om ent th a t this null hypothesis is simple. If the observed value o f the test
statistic is denoted by t then the level o f evidence against Ho is m easured by

136

137

4.1 Introduction

the significance probability


p = P r(T > t | Ho),

(4.1)

often called the P-value. A corresponding notion is th at o f a critical value tp


for t, associated w ith testing at level p: if t > tp then Ho is rejected at level p,
or 100p%. N ecessarily tp is defined by P r(T > t P | Ho) = p. T he level p is also
called the error rate or the size o f the test, and { (y i,...,} 'n) : t > tp} is called
the level p critical region o f the test. The distribution o f T under Ho is called
the null distribution o f T.
U nder Ho the P-value (4.1) has a uniform distribution on [0,1], if T is
continuous, so th a t the corresponding random variable P has distribution
P r(P < p \ H 0) = p.

(4.2)

This yields the error rate in terp retation o f the P-value, nam ely th at if the
observed test statistic were regarded as ju st decisive against Ho, then this is
equivalent to following a procedure which rejects H 0 with error rate p. The
sam e is not exactly true if T is discrete, and for this reason m odifications to
(4.1) are som etim es suggested for discrete d a ta problem s: we shall n o t worry
a b o u t the distinction here.
It is im p o rtan t in applications to give a clear idea o f the degree o f discrepancy
betw een d a ta an d null hypothesis, if not giving the P-value itself then at least
indicating how it com pares to several levels, say p = 0.10,0.05,0.01, rather
th a n ju st testing a t the 0.05 level.
Choice o f test statistic
In the p aram etric setting, we have an explicit form for the sam pling distribution
o f the d a ta w ith a finite num ber o f unknow n param eters. O ften the null
hypothesis specifies num erical values for, or relationships between, som e or all
o f these param eters. T here is also an alternative hypothesis H A which describes
w hat alternatives to Ho it is m ost im p o rtan t to detect, or w hat is thought likely
to be true if Ho is not. This alternative hypothesis guides the specific choice o f
T , usually th ro u g h use o f the likelihood function
L(e) = f Yu...,Yn( y u . - . , y n \ 0 ) ,
i.e. the jo in t density o f the observations. F or example, when Ho and H A are
b o th simple, say Ho : 8 = 0 o an d Ha : 0 = dA, then the best test statistic is the
likelihood ratio
T = L(9 a )/ L{60).

(4.3)

A rath er different situation is where we wish to test the goodness o f fit o f


the param etric model. Som etim es this can be done by em bedding the m odel
into a larger m odel, w ith one or a few additional param eters corresponding

138

4 Tests

to departu re from the original m odel. We would then test those additional
param eters. O therw ise general purpose goodness o f fit tests will be used, for
exam ple chi-squared tests.
In the nonp aram etric setting, no p articu lar form s are specified for the
distributions. T hen the ap p ro p riate choice o f T is less clear, b u t it should be
based on at least a qualitative n otion o f w hat is o f concern should Ho n o t be
true. Usually T would be based on a statistical function s(F) th a t reflects the
characteristic o f physical interest and for which the null hypothesis specifies a
value. F or example, suppose th a t we wish to test the null hypothesis Hq th at
X and Y are independent, given the ran d o m sam ple (X i, Vi) , . . . , (X, Y). The
correlation s(F) = corr(AT, Y ) = p is a convenient m easure o f dependence, and
p = 0 und er Hq. If the alternative hypothesis is positive dependence, then a
natu ral test statistic is T = s(F), the raw sam ple correlation; if the alternative
hypothesis is ju st dependence, then the tw o-sided test statistic T = s 2 (F)
could be used.
Conditional tests
In m ost p aram etric problem s and all nonparam etric problem s, the null h y p o th
esis Ho is com posite, th a t is it leaves som e param eters unknow n and therefore
does not com pletely specify F. Therefore P-value (4.1) is not generally welldefined, because P r( T > t \ F) m ay depend upon which F satisfying Ho is
taken. T here are two clean solutions to this difficulty. One is to choose T
carefully so th a t its distrib u tio n is the same for all F satisfying H o : examples
include the Student-t test for a norm al m ean w ith unknow n variance, and rank
tests for nonparam etric problem s. The second and m ore widely applicable so
lution is to elim inate the p aram eters which rem ain unknow n when Ho is true
by conditioning on the sufficient statistic und er Ho- If this sufficient statistic is
denoted by S, then we define the conditional P-value by
p = Pr(T > t \ S = s , H 0).

(4.4)

Fam iliar exam ples include the Fisher exact test for a 2 x 2 table and the
S tudent-t test m entioned earlier. O th er exam ples will be given in the next two
sections.
A less satisfactory approach, which can nevertheless give good approxim a
tions, is to estim ate F by a C D F f '0 which satisfies Ho and then calculate
p = Pr( T > t \ Fo).

(4.5)

Typically this value will n o t satisfy (4.2) exactly, b u t will deviate by an am ount
which m ay be practically negligible.
Pivot tests
W hen the null hypothesis concerns a p articu lar p aram eter value, the equiva
lence betw een significance tests an d confidence sets can be used. This equiv

139

4.1 Introduction

alence is th a t if the value 6 q is outside a 1 a confidence set for 6 , then 6


differs from do w ith P-value less th an a. T he p articular alternative hypothesis
for which this applies is determ ined by the type o f confidence se t: for example,
if the confidence set is all values to the right o f a lower confidence limit, then
the im plied alternative is H A : 6 > do- A specific form o f test based on this
equivalence is the pivot test. F or example, suppose th a t T is an estim ator for
scalar 6 , w ith estim ated variance V. Suppose further th a t the studentized form
Z = ( T 6 ) / V x/1 is a pivot, m eaning th at its distribution is the same for
all relevant F, and in p articu lar for all 6 . The Student-r statistic is a fam iliar
instance o f this. F or the one-sided test o f Ho : 6 = do versus H A : 6 > 6 o, the
P-value attached to the observed studentized test statistic zo = (f 0q)/v 1/2 is
p = P r{(T - 60) / V 112 > ( t - 60) / v l/2 | H0}.
But because Z is a pivot,
Pr{Z > (t - 6 o ) / v 1/2 I Ho} = Pr{Z > (t - 60) / v [/2 \ F},
an d therefore
p = Pr (Z > z0 | F ) .

(4.6)

T he p articu lar advantage o f this, in the resam pling context, is th a t we do not


have to construct a special null hypothesis sam pling distribution.
In p aram etric problem s it is usually possible to express the m odel in term s
o f the p aram eter o f interest ip and o th er (nuisance) param eters X, so th at the
null hypothesis concerns only ip. In the above discussion o f conditional tests,
(4.4) would be independent o f X. One general approach to construction o f a
test statistic T is to generalize the simple likelihood ratio (4.3), and to define
LR =

maxHaL(\p, a )
m axWo L(rp, X)

F or testing Ho : rp = xpo versus H A : ip =j= xpo, this generalized likelihood ratio


is equivalent to the m ore convenient expression
LR =

L ( v>A)

L(wo,%>)

= m axy^ L(\p, A)

maxAL(ip0, A)'

O f course this also applies when there is no nuisance param eter. F or m any
m odels it ispossible to show th at T = 2 log L R has approxim ately the Xd
d istribution u nder Ho, where d is the dim ension o f ip, so th at
p = Pr(X2d > t),

(4.8)

independently o f X. T hus the likelihood ratio L R is an approxim ate pivot.


T here is a variety o f related statistics, including the score statistic, and the
signed likelihood ratio for one-param eter problems. W ith each likelihood-based

140

4 Tests

statistic there is a simple approxim ation to the null distribution, and m odifi
cations to im prove approxim ation in m oderate-sized samples. The likelihood
ratio m ethod appears lim ited to p aram etric problem s, but as we shall see in
C h apter 10 it is possible to define analogues in the nonparam etric case.
W ith all o f the P-value calculations introduced thus far, simple approxim a
tions for p exist in m any cases by appealing to lim iting results as n increases.
Part o f the purpose o f this chapter is to provide resam pling alternatives to such
approxim ations when they either fail to give ap p ro p riate accuracy o r do not
exist a t all. Section 4.2 discusses ways in which resam pling and sim ulation can
help with param etric tests, starting w ith exact M onte C arlo tests. Section 4.3
briefly reviews p erm u tatio n and random ization tests. This leads on to the wider
topic o f nonp aram etric b o o tstrap tests in Section 4.4. Section 4.5 describes a
simple m ethod for im proving P-values when these are biased. M ost o f the
exam ples in this chap ter involve relatively simple applications. C hapters 6 and
beyond con tain m ore substantial applications.

4.2 Resampling for Parametric Tests


Broadly speaking, p aram etric resam pling m ay be useful in any testing problem
where either stan d ard approxim ations do not apply o r where the accuracy
o f such approxim ations is suspect. T here is a wide range o f such problems,
including hypotheses w ith order constraints, hypotheses involving separate
models, and graphical tests. In all o f these problem s, the basic m ethod is to
use a param etric resam pling scheme as outlined in Section 2.2 except th at here
the sim ulation m odel m ust satisfy the relevant null hypothesis.

4.2.1 M onte Carlo tests


One special situation is when the null hypothesis distribution o f T does
no t involve any nuisance param eters. O ccasionally this happens directly, but
m ore often it is induced, either by standardizing som e initial statistic, or by
conditioning on a sufficient statistic, as explained earlier. In the latter case the
exact P-value is given by (4.4) ra th e r th a n (4.1). In practice the exact P-value
m ay be difficult or im possible to calculate, and M onte C arlo tests provide
convenient approxim ations to the full tests. As we shall see, M onte C arlo tests
are exact in their own right, and am ong b o o tstrap tests are special in this way.
The basic M onte C arlo test com pares the observed statistic t to R indepen
dent values o f T which are obtained from corresponding sam ples indepen
dently sim ulated und er the null hypothesis model. If these sim ulated values
are denoted by t j , . . . , tR, then under H q all R + 1 values t , t \ , . . . , t R are equally

141

4.2 Resampling fo r Parametric Tests

likely values o f T. T h a t is, assum ing T is continuous,


r
P r(T < T(*, | Ho) = T rT - r ,
R + 1

(4.9)

where as usual Tlr) denotes the rth ordered value. If exactly k o f the sim ulated
t* values exceed t and none equal it, then
p = P r(T > t | H 0) = Pmc =

(4.10)

The right-hand side is referred to as the M onte C arlo P-value. I f T is con


tinuous, then it follows from (4.9) th a t under H q the distribution o f the
corresponding ran d o m variable Pmc is uniform on ( ^ y , ,
1). This result
is the discrete analogue o f (4.2), and guarantees th a t Pmc has the error rate
in terpretation. In this sense the M onte C arlo test is exact. It differs from the
full test, which corresponds to R = oo, by blurring the critical region o f the
full test for any attainable level.
If T is discrete, then repeat values o f t* can occur. If exactly / o f the t
values equal t, then it is som etim es advocated th a t one bounds the significance
probability,
k+l
k + l+ l
R + l-^ m c R + 1

---------< pmc < --------------- .

O u r strict in terp retatio n o f (4.1) would have us use the upper bound, and so
we ad o p t the general definition
#(y4) means the number
of times the event A
occurs.

P =

1 + #{t* ^ f)
R + 1

/j i n
(4 U )

Example 4.1 (Logistic regression) Suppose th a t y i , . . . , y are independent


binary outcom es, w ith corresponding scalar covariate values x i ,...,x , and
th a t we wish to test w hether o r not x influences y. I f our chosen m odel is the
logistic regression m odel
,

Pr(F ; = 1 | x/)
PriYj = 0 | x j) ~

^ X>

then the null hypothesis is H q :\p = 0 . U nder H q the sufficient statistic for X is
S
an d T = J2 x j Yj is the n atu ral test statistic; T is in fact optim al for
the logistic m odel, b u t is also effective for m onotone transform ations o f the
odds ratio o th er th an logarithm . The significance is to be calculated according
to (4.4).
T he null distribution o f Y i,...,Y given S = s is uniform over all (")
perm u tatio n s o f y i , . . . , y . R ath er th an generate all o f these perm utations to
com pute (4.4) exactly, we can generate R random perm utations and apply
(4.11). A sim ulated sam ple will then be ( x j,y j) , . . . , (xn,y^), where y \ , . . . , y n is
a ran d o m p erm u tatio n o f y \ , . . . , y n, and the associated test statistic will be
= E x jyj.

142

4 Tests

0
0
1
4
3

1
2
1
1
1

2
0
1
2
4

3
2
1
5
3

4
4
4
2
1

3
2
1
0
0

4
3
5
3
0

2
3
2
2
2

2
4
2
1
7

1
2
3
1
0

In some applications there will be repeats am ong the x values, o r equivalently


m, binom ial trials w ith a, occurrences o f y = 1 a t the ith distinct value o f x. If
the d a ta are expressed in the latter form , then the same random p erm utation
procedure can be applied to the original expanded form o f d a ta with = ^ m,
individual ys.

Example 4.2 (Overdispersed counts) T he d a ta in Table 4.1 are n = 50 counts


o f fir seedlings in small q uadrats, p a rt o f a larger dataset. The actual spatial
layout is preserved, alth o u g h we are n o t concerned with this here. R ath er we
wish to test the null hypothesis th a t these d a ta are a random sam ple from
a Poisson distribution w ith unknow n m ean. The concern is th a t the d a ta are
overdispersed relative to the Poisson distribution, which strongly suggests th a t
we take as test statistic the dispersion index T = J2(Yj Y ) 2 / Y . U nder the
Poisson m odel S = J 2 ^ j is sufficient for the com m on m ean, so we carry o u t a
conditional test an d apply (4.4). F or the data, t = 55.15 and s = 107.
Now und er the null hypothesis Poisson m odel, the conditional distribution o f
Y \ , ... ,Y given J 2 ^ j = s is m ultinom ial w ith deno m inator s and n categories
each having probability n-1 . It is easy to sim ulate from this distribution. In the
first R = 99 sim ulated values t ', 24 are larger th a n t = 55.15. So the M onte
C arlo P-value (4.11) is equal to 0.25, an d we conclude th a t the d a ta dispersion is
consistent w ith Poisson dispersion. Increasing R to 999 m akes little difference,
giving p = 0.235. The left panel o f Figure 4.1 shows a histogram o f all 999
values o f t' t: the unshaded p art o f the histogram corresponds to values
t' > t which count tow ard significance.
For this simple problem the null distrib u tio n o f T given S = s is approx
im ately
j. T h a t this approxim ation is accurate for our d a ta is illustrated
in the right panel o f Figure 4.1, which plots the ordered values o f t* against
quantiles o f the X49 distribution. The P-value obtained w ith this approxim ation
is 0.253, close to the exact value. T here are two points to m ake ab o u t this. First,
the sim ulation results enable us to check on the accuracy o f the theoretical
approxim ation: if the approxim ation is good, then we can use it; b u t if it isnt,
then we have the M onte C arlo P-value. Secondly, the M onte C arlo m ethod
does n o t require know ledge o f a theoretical approxim ation, which m ay not
even exist in m ore com plicated problem s, such as spatial analysis o f these data.
The M onte C arlo m ethod applies very generally.

Table 4.1 n = 50
counts o f balsam-fir
seedlings in five feet
square quadrats.

4.2 Resampling fo r Parametric Tests

Figure 4.1 Simulation


results for dispersion
test. Left panel:
histogram of R = 999
values of the dispersion
statistic t* obtained
under multinomial
sampling: the data value
is t = 55.15 and
pmc = 0.235. Right
panel: chi-squared plot
of ordered values of t*,
dashed line
corresponding to xl$
approximation to null
conditional distribution.

143

o
w
GO
c

o
e
0Q_
CO

20

40

60

80
Chi-squared quantiles

It seems intuitively clear th a t the sensitivity o f the M onte C arlo test increases
w ith R. We shall discuss this issue later, b u t for now we note th a t it is advisable
to take R to be a t least 99.
T here are tw o im p o rtan t aspects o f the M onte C arlo test which m ake it
widely useful. T he first is th a t we only need to be able to sim ulate d a ta under
the null hypothesis, this being relatively simple even in some very com plicated
problem s, such as those involving spatial processes (C hapter 8). Secondly,
do n o t need to be independent outcom es: the m ethod rem ains
valid so long as they are exchangeable outcom es, which is to say th a t the
jo in t density o f T,
T R u nder Ho is invariant under p erm u tatio n o f its
argum ents. This allows us to apply M onte C arlo tests to quite com plicated
problem s, as we see next.

4.2.2 Markov chain Monte Carlo tests


In som e applications o f the exact conditional test, w ith P-value given by (4.4),
the conditional probability calculation is difficult or im possible to do directly.
T he M onte C arlo test is in principle appropriate here, since the null d istribu
tion (given s) does not depend upon unknow n param eters. A practical obstacle
is th a t in com plicated problem s it m ay be difficult to sim ulate independent
sam ples directly from th a t conditional null distribution. However, as we ob
served before, the M onte C arlo test only requires exchangeable samples. This
opens u p a new possibility, the use o f M arkov chain M onte C arlo sim ulation,
in which only the u nconditional null distribution is needed.
The basic idea is to represent d a ta y = ( y i , . . . , y n) as the result o f N
steps o f a M arkov chain w ith some initial state x = ( x i,...,x ) , and to

144

4 Tests

generate each y by an independent sim ulation o f N steps w ith the same initial
state x. If the M arkov chain has equilibrium distribution equal to the null
hypothesis distribution o f Y = ( Y [ ,..., Y), then y and the R replicates o f y *
are exchangeable outcom es under Ho an d (4.11) applies.
Suppose th a t und er H q the d a ta have jo in t density fo(y) for
where
both /o and & are conditioned on sufficient statistic s if we are dealing with
a conditional test. F or simplicity suppose th a t
has \3S\ elements, which
we now regard as possible states labelled (1 ,2 ,...,\&S\) o f a M arkov chain
{Zr, t = . . . , 1 ,0 ,1 ,...} in discrete time. C onsider the d a ta y to be one
realization o f Zjy. We then have to fix an ap p ro p riate value o r state for Zo,
and w ith this initial state sim ulate the R independent values o f Z N which are
the R values o f Y \ The M arkov chain is defined so th a t /o is the equilibrium
distribution, which can be enforced by ap p ro p riate choice o f the one-step
forw ard transition probability m atrix Q, say, with elements
quv = P r(Z I+i = v | Z ( = u),

u,v &.

F or the m om ent suppose th a t Q is already known.


The first p a rt o f the sim ulation is to produce a value for Z 0. S tarting from
state y at tim e N, we sim ulate N backw ard steps o f the M arkov chain using
the one-step backw ard transition probabilities
Pr(Z, = u | Z t+1 = v ) = fo(u)quv/fo(v).

(4.12)

Let the final state, the realized value o f Zo, be x. N ote th at if Ho is true, so
th at y was indeed sam pled from / 0, then Pr(Zo = x) = /o(x). In the second
p art o f the sim ulation, which we repeat independently R times, we sim ulate N
forw ard steps o f the M arkov chain, starting in state x and ending up in state
y ' = (> > i,...,y '). Since und er Ho the chain starts in equilibrium ,
Pr(Y* = /

| H 0) = P r( Z N = / ) = / 0( / ) .

T h at is, if Ho is true, then the R replicates


and d a ta y are all
sam pled from /o, as we require. M oreover, the R replicates o f y are jointly
exchangeable w ith the d a ta und er Ho- To see this, we have first th at
R

f ( y , y l . . . , f R | Ho) = fo(y)

Pr(Z 0 = x | Z N = y ) ] ] P r(Z N =
x

| Z 0 = x),

r= l

using the independence o f the replicate sim ulations from x. But by the definition
o f the first p a rt o f the sim ulation, where (4.12) applies,
/o (y )P r(Z 0 = x | Z N = y) = / 0(x)P r(Z ^ = y \ Z 0 = x),

145

4.2 Resampling fo r Parametric Tests

and so
f ( y , y[, . .. , y 'R \ H 0) = J 2 /o (x ){ p r(Z N = y | Z 0 = x) [ J Pr(Z N = y*r | Z 0 = x ) \ ,
x

r= l

'

w hich is a sym m etric function o f y , y { , - - - , y R as required. G iven th a t the d ata


vector and sim ulated d a ta vectors are exchangeable under Ho, the associated
test statistic values ( t , t j , . . . , t R) are also exchangeable outcom es under H q.
Therefore (4.11) applies for the P-value calculation.
To com plete the description o f the m ethod, it rem ains to define the transition
probability m atrix Q so th a t the chain is irreducible w ith equilibrium d istribu
tion fo(y)- T here are several ways to do this, all o f which use ratios f o( v) / fo (u)F or exam ple, the M etropolis algorithm starts with a carrier M arkov chain on
state space @S having any sym m etric one-step forw ard transition probability
m atrix M , an d defines one-step forw ard transition from state u in the desired
M arkov chain as follows:

given we are in state u, select state v with probability muv;

accept the tran sitio n to v with probability min{ l,fo(v)/fo(u)}, otherwise


reject it an d stay in state u.

It is easy to check th a t the induced M arkov chain has transition probabilities


quv = m i n { l, f o( v )/ fo ( u) } muv,

u^v,

and
Qua = muu + Y ^ max{0 , 1 - fo(v)/fo{u)}muo,
V^U
an d from this it follows th a t f o is indeed the equilibrium distribution o f the
M arkov chain, as required. In applications it is n o t necessary to calculate
the probabilities muv explicitly, although the sym m etry and irreducibility o f
the carrier chain m ust be checked. If the m atrix M is n o t sym m etric, then
the acceptance probability in the M etropolis algorithm m ust be m odified to
m in [l,fo(v)mvu/{fo(u)muv}].
T he crucial feature o f the M arkov chain m ethod is th a t fo itself is not
needed, only ratio s fo(v)/fo(u) being involved. This m eans th a t for conditional
tests, w here f o is the conditional density for Y given S = s, only ratios o f the
u nconditional null density for Y are n e ed ed :
fo(v) = P r(7 = v \ S = s , H q) = P r(7 = v | H 0)
fo(u)
P r(7 = u | S = s , H 0)
P r(Y= u\H o)'
This greatly simplifies m any applications.
The realizations o f the M arkov chain are sym m etrically tied to the artificial
starting value x, an d this induces a sym m etric correlation am ong (t,

146

4 Tests

This correlation depends upon the p articu lar construction o f Q, and reduces
to zero at a rate which depends upon Q as m increases. W hile the correlation
does not affect the validity o f the P-value calculation, it does affect the power
o f the te s t: the higher the correlation, the lower the power.
Example 4.3 (Logistic regression) We retu rn to the problem o f Exam ple 4.1,
which provides a very sim ple if artificial illustration. The d a ta y are a binary
sequence o f length n w ith s ones, and calculations are to be conditional on
Y , Yj = s. Recall th a t direct M onte C arlo sim ulation is possible, since all (")
possible d a ta sequences are equally likely und er the null hypothesis o f constant
probability o f a unit response.
One simple M arkov chain has one-step transitions which select a pair o f
subscripts i, j a t random , an d switch y t an d yj. Clearly the chain is irreducible,
since one can progress from any one binary sequence with s ones to any other.
All ratios o f null probabilities /o (u )//o ( ) are equal to one, since all binary
sequences w ith s ones are equally probable. Therefore if we run the M etropolis
algorithm , all switches are accepted. But note th a t this M arkov chain, while
simple to im plem ent, is inefficient and will require a large num ber o f steps to
induce approxim ate independence o f the ts. T he m ost effective M arkov chain
would have one-step transitions which are ran d o m p erm utations, and for this
only one step w ould be required.

Example 4.4 (AM L data) F or d a ta such as those in Exam ple 3.9, consider
testing the null hypothesis o f p ro p o rtio n al h azard functions. D enote the failure
times by z\ < z2 < < z, assum ing no ties for the m om ent, and define rtj to
be the nu m b er in group i w ho were at risk ju st p rior to zj. Further, let yj be
0 or 1 according as the failure at zj is in group 1 or 2, and denote the hazard
function a t tim e z for group i by fy(z). Then
P r(y . = l ) = _____

r*Mzj>_____

rljh l (zj) + r2jh2(zj)

aj + 0 /

where aj = rij/rzj and 6j = h2{zj)/h\(zj) for j = 1


The null hypothesis
o f p ropo rtio n al hazards implies the hypothesis H q : 6\ = = 6n.
For the d a ta o f Exam ple 3.9, where n 18, the values o f y and a are given in
Table 4.2; one tie has been random ly split. N ote th a t censored d a ta contribute
only to the rs: the times are n o t used.
O f course the YjS are n o t independent, because aj depends upon the o u t
com es o f Yu . . . , Y j - i . However, for the purposes o f illustration here we shall
pretend th a t the ajS are fixed, as well as the survival times and censoring
times. T h a t is, we shall treat the Y)s as independent Bernoulli variables with
probabilities as given above. U nder this pretence the conditional likelihood for

147

4.2 Resampling fo r Parametric Tests

5
11

8
11

12

11

10

n
12

11
10

n
9

n
8

Table 4.2 Ingredients


o f the conditional test
for proportional
hazards. Failure times
as in Table 3.4; at time
z = 23 the failure in
group 2 is taken to
occur first.

8
11

*18

9
11

12

5
11

13
10

18
8

23
7

23
7

27
6

30
5

31
5

33
4

34
4

43
3

45
3

48
2

10
8

10
7

8
6

7
6

7
5

6
5

5
4

5
3

4
3

3
2

oo

10

is simply
18

dj + Oj

7=1

N ote th a t because aig = oo,


m ust be 0 w hatever the value o f 0ig, and so
this final response is uninform ative. We therefore dro p yig from the analysis.
H aving done this, we see th a t under Ho the sufficient statistic for the com m on
h azard ratio 0 is S =
Yj, whose observed value is s = 11.
W hatever the test statistic T, the exact conditional P-value (4.4) m ust be
approxim ated. D irect sim ulation appears impossible, but a simple M arkov
chain sim ulation is possible. First, the state space o f the chain is 3 = {x =
( x i , . . . , x n ) : Y l x j = s}> th a t is all perm utations o f y i , . . . , y n . F or any two vec
tors x and x in the state-space, the ratio o f null conditional jo in t probabilities
p{x | s, 0 ]
p(x | s, 01

;'= i

We take the carrier M arkov chain to have one-step transitions which are ra n
dom p erm u tatio n s: this guarantees fast m ovem ent over the state space. A step
which moves from x to x is then accepted with probability min ^ 1, f l j l i a]

By sym m etry the reverse chain is defined in exactly the same way.
The test statistic m ust be chosen to m atch the particular alternative hy p o th
esis th o u g h t relevant. H ere we suppose th at the alternative is a m onotone ratio
o f hazards, for which T = YljLi Yj log(Zj) seems to be a reasonable choice.
The M arkov chain sim ulation is applied with N = 100 steps back to give the
initial state x an d 100 steps forw ard to state y ' , the latter repeated R = 99
times. O f the resulting * values, 48 are less th an or equal to the observed value
t = 17.75, so the P-value is (1 + 4 8 )/(l + 99) = 0.49. Thus there appears to be
no evidence against the prop o rtional hazards model.
Average acceptance probability in the M etropolis algorithm is approxim ately
0.7, and results for N = 10 and N = 1000 ap p ear indistinguishable from those
for N = 100. This indicates unusually fast convergence for applications o f the
M arkov chain m ethod.

148

4 Tests

T he use o f R conditionally independent realizations o f the M arkov chain is


som etim es referred to as the parallel method. In co n trast is the series method,
where only one realization is used. Since the successive states o f the chain are
dependent, a rand o m izatio n device is needed to induce exchangeability. For
details see Problem 4.2.

4.2.3 Parametric bootstrap tests


In m any problem s o f course the distribution o f T under H q will depend upon
nuisance param eters which can n o t be conditioned away, so th at the M onte
C arlo test m ethod does not apply exactly. T hen the n atu ral approach is to fit
A

the null m odel Fo and use (4.5) to com pute the P-value, i.e. p = P r(T > t \ Fo).
F or exam ple, for the p aram etric m odel where we are testing Ho : ip = ipo
with X a nuisance p aram eter, Fo w ould be the C D F o f f ( y \ ipo,Xo) with Xo
the m axim um likelihood estim ator (M L E ) o f the nuisance param eter when ip
is fixed equal to ipo. C alculation o f the P-value by (4.5) is referred to as a
b o o tstrap test.
If (4.5) can n o t be com puted exactly, o r if there is no satisfactory approx
im ation (norm al or otherwise), then we proceed by sim ulation. T h at is, R
independent replicate sam ples yj,...,_y* are draw n from Fo, and for the rth
such sam ple the test statistic value t'r is calculated. T hen the significance
probability (4.5) will be approxim ated by
Pboot ~

( 4 .1 3 )

O rdinarily one would use a simple p ro p o rtio n here, but we have chosen to
m ake the definition m atch th a t for the M onte C arlo test in (4.11).
Example 4.5 (Separate families test) Suppose th a t we wish to choose between
the alternative m odel form s fo(y \ r\) and f i ( y \ ) for the P D F o f the random
sam ple y \ , . . . , y n. In some circum stances it m ay m ake sense to take one model,
say fo, as a null hypothesis, and to test this against the o th er m odel as
alternative hypothesis. In the n o tatio n o f Section 4.1, the nuisance param eter
is X = (t],C) and ip is the binary indicator o f m odel, w ith null value ipo = 0
and alternative value ipa = 1. The likelihood ratio statistic (4.7) is equivalent
to the m ore convenient form
r = - N g ^ = n- ' X > g M ^ ,
L o(rj)
fo(yj I ri)

(4.14)

where f\ and ( are the M L E s and Lo an d L\ the likelihoods under f o and


/ 1 respectively. If the tw o families are strictly separate, then the chi-squared
approxim ation (4.8) does n o t apply. T here is a norm al approxim ation for the

149

4.2 Resampling fo r Parametric Tests

null distribution o f T , b u t this is often quite unreliable except for very large
n. The p aram etric b o o tstrap provides a m ore reliable and simple option.
The p aram etric b o o tstrap w orks as follows. We generate R sam ples o f size
n by ran d o m sam pling from the fitted null m odel /o (y | fj). For each sample
we calculate estim ates fj* and ( by m axim izing the sim ulated log likelihoods

m) = E lo&w i

4>fa) = E lo&w 11)*

and com pute the sim ulated log likelihood ratio statistic

T hen we calculate p using (4.13).


As a p articu lar illustration, consider the failure-tim e d ata in Table 1.2. Two
plausible m odels for this type o f d a ta are gam m a and lognorm al, th a t is
, , , ,
Kiicy)*-1 e x p ( - K y / n )
f o ( y \ r i ) = ----------^ r ( K ) ----------

,
= ^

{ l o g y - ot\
p

) y > 0 -

F or these d a ta the M L E s o f the gam m a m ean and index are fi = y = 108.083


and k = 0.707, the latter being the solution to
log(/c) - h(k) = log(y) - lo g y
logy and s^og) are the
average and sample
variance for the log yj.

with h(x) = d \o gr ( K) /d K, the digam m a function. The M L E s o f the m ean


and variance o f the norm al distribution for log Y are a = lo g y = 3.829 and
P2 = (n 1)s?ogy/ n = 2.339. The test statistic (4.14) is
t = k log(fc/y) ka + k + log r(/c) | \og(2n[i2)
whose value for the d a ta is t = 0.465. The left panel o f Figure 4.2 shows
a histogram o f R = 999 values o f t* under sam pling from the fitted gam m a
m odel: o f these, 619 are greater th an t and so p = 0.62.
N ote th a t the histogram has a fairly non-norm al shape in this case, suggesting
th a t a norm al approxim ation will not be very accurate. This is true also for the
(rath er com plicated) studentized version Z o f T : the right panel o f Figure 4.2
shows the norm al plot o f b o o tstrap values z \ The observed value o f z is
0.4954, for which the b o o tstrap P-value is 0.34, som ew hat sm aller th an th at
com puted for t, b u t not changing the conclusion th a t there is no evidence to
change from a gam m a to a lognorm al m odel for these data. T here are good
general reasons to studentize test statistics; see Section 4.4.1.
It should p erhaps be m entioned th at significance tests o f this kind are not
always helpful in distinguishing between models, in the sense th at we could
find evidence against either b o th or neither o f them. This is especially true
w ith small sam ples such as we have here. In this case the reverse test shows
no evidence against the lognorm al model.

150

4 Tests

Figure 4.2 Null


hypothesis resampling
for failure data. Left
panel shows histogram
of under gamma
sampling. Right panel
shows normal plot of
z ' ; R 999 and gamma
parameters p. =
108.0833, k = 0.7065;
dotted line is theoretical
N(0,1) approximation.

t*

Quantiles of standard normal

4.2.4 Graphical tests


G raphical m ethods are p o p u lar in m odel checking: exam ples include norm al
and half-norm al plots o f residuals in regression, plots o f C ook distance in
regression, plots o f n o nparam etric h azard function estim ates, and plots o f
intensity functions in spatial analysis (Section 8.3). In m any cases the nom inal
shape o f the plot is a straight line, which aids the detection o f deviation
from a null model. W hatever the situation, inform ed in terpretation o f the plot
requires som e n otion o f its probable variation und er the m odel being checked,
unless the sam ple size is so large th a t deviation is obvious (c.f. the plot o f
resam pling results in Figure 4.2). The sim plest and m ost com m on approach
is to superim pose a probable envelope, to which the original d a ta plot is
com pared. This probable envelope is obtained by M onte C arlo or param etric
resam pling m ethods.
G raphical tests are n o t usually ap p ro p riate when a single specific alternative
m odel is o f interest. R ath er they are used to suggest alternative models,
depending upon the m anner in which such a plot deviates from its null
expected behaviour, or to find suspect data. (Indeed graphical tests are not
tests in the usual sense, because there is usually no simple notion o f rejectable
behaviour: we com m ent m ore fully on this below.)
Suppose th a t the g raph plots T (a) versus a for a e s / , a bounded set. The
observed plot is {t(a) : a j / } . F or exam ple, in a norm al plot j / is a set o f
norm al quantiles and the values o f t(a) are the ordered values o f a sample,
possibly studentized. T he idea o f the p lo t is to com pare t(a) w ith the probable
behaviour o f T(a) for all a e
when H q is true.
Example 4.6 (Normal plot)

C onsider the d ata in Table 3.1, and suppose in

151

4.2 Resampling fo r Parametric Tests

yt)

Figure 4.3 Normal plot


of n = 13 studentized
values for final sample
in Table 3.1.

O/
N
CO

o
O/'

c>d in
o
CDO
o
c o
<1)
-o
D

Q--'

0
/o
.6 o o

35

0/0
o
-1

Quantiles of standard normal

p articu lar th a t we w ant to assess w hether or n o t the last sam ple o f n = 13


m easurem ents can be assum ed norm al. A norm al plot o f the d a ta is shown in
Figure 4.3, which plots the ordered studentized values Z(,-) = ( y ^ y ) / s against
the quantiles a, =
o f the N{ 0,1) distribution. In the general n o tation
,s4 is the set o f norm al quantiles, and t(at) = Z(,-). The d o tted line is the expected
pattern , approxim ately, and the question is w hether or n o t the points deviate
sufficiently from this to suggest th at the sam ple is non-norm al.

A ssum e for the m om ent th a t the null hypothesis jo in t distribution o f {T(a) :


a
involves no unknow n nuisance param eters. This is true for a norm al
p lo t if we use studentized sam ple values z, as in the previous example. T hen
for any fixed a we can subject t(a) to a M onte C arlo test. F o r each o f R
independent sets o f d a ta y j ,...,y * , which are obtained by sam pling from the
null m odel, we com pute the sim ulated plot
t*(a),

a G si.

U nder the null hypothesis, T(a), Tj*(a),. . . , T R(a) are independent and identi
cally distributed for any fixed a, so th a t (4.9) applies w ith T = T(a). T h at
is,
P r ( T ( a ) < T (})( f l ) | H o ) = ^

I .

(4.15)

This leads to (4.11) as the one-sided P-value at the given value o f a, if


large values o f t(a) are evidence against the null m odel. T here are obvious

152

4 Tests

m odifications if we w ant to test for sm all values o f t(a), or if we w ant a


two-sided test.
The test as described applies for any single value o f a. However, the graphical
test does n o t look ju st a t one fixed a, b u t rath er at all a j / sim ultaneously.
In principle the M onte C arlo test could be applied at all values o f a e srf, b u t
this would be tim e-consum ing and difficult to interpret. To simplify m atters,
at each value o f a we com pute lower an d u p p er critical values corresponding
to fixed one-sided levels p, and plot these critical values against a to provide
critical curves against which to com pare the whole d a ta plot {t(a),a e i } .
So the m ethod is to choose integers R and k so th a t ^
= p, the desired
one-sided test level, and then com pute the critical values
f (fc)(a )> f (R + l-lc )(a )

from the R sim ulated plots. If t(a) exceeds the upp er value, or falls below the
lower value, then the corresponding one-sided P-value is at m ost p; the twosided test which rejects Ho if t(a) falls outside the interval [ ^ ( a ) , tJ'J?+1_fc)(a)]
has level equal to 2p. T he set o f all u p p er and lower critical values defines the
test envelope
S'1 2p = {[t(fc)(a), t(R+!_(;)()] : a e s / j .

(4.16)

Excursions o f t(a) outside S l~2p are regarded as evidence against Ho, and this
sim ultaneous com parison across all values o f a is w hat is usually m eant by the
graphical test.
Example 4.7 (Normal plot, continued) F or the norm al plot o f the previous
example, suppose we set p = 0.05. T he sm allest sim ulation size th a t works is
R = 19, and then we take k = 1 in (4.16). T he test envelope will therefore
be the lines connecting the m axim a and the m inim a. Because we are plotting
studentized sam ple values, which elim inates m ean and variance param eters, the
sim ulation can be done w ith the N ( 0,1) distribution. Each sim ulated sam ple
y \ , . . . , y u is studentized to give z* = ( y y*)/s*, i = 1 ,..., 13, whose ordered
values are then plotted against the same norm al quantiles a, = <P-1 ( ^ ) . The
left panel o f Figure 4.4 shows a set o f R = 19 norm al plots (plotted as
connecting dashed lines) and their envelope (solid curves) for studentized
values o f sim ulated sam ples o f n = 13 N{0,1) data. The right panel shows the
envelope o f these plots together w ith the original d a ta plot. N ote th a t one o f
the inner points falls ju st outside the envelope: this m ight be taken as mild
evidence against norm ality o f the data, b u t such an in terpretation m ay be
prem ature, in light o f the discussion below.

The discussion so far assum es either th a t the null m odel involves no u n


know n param eters, or th a t it is possible to elim inate unknow n param eters
by standardization, as in the previous example. In the latter case sim ulated

153

4.2 Resampling fo r Parametric Tests

Figure 4.4 Graphical


test of normality. Left
panel: normal plots
(dashed lines) of
studentized values for
R = 19 samples of
n = 13 simulated from
the N(0, 1) distribution,
together with their
envelope (solid line).
Right panel: envelope
of the simulated plots
superimposed on the
original data plot.

Quantiles of standard normal

Quantiles of standard normal

sam ples can be generated from any null m odel Fo. W hen unknow n model
param eters can n o t be elim inated, we would sim ulate from Fo: then (4.15) will
be approxim ately true provided n is n o t too small.
There are two aspects o f the graphical test which need careful thought,
nam ely the choice o f R and the in terpretation o f the resulting plot. It seems
clear from earlier discussion th a t for p = 0.05, say, R = 19 is too sm all:
the test envelope is too random . R = 99 would seem to be a m ore sensible
choice, provided this is not com putationally difficult. But we should consider
how form al is to be the interp retation o f the graph. As it stands the notional
one-sided significance levels p hold pointwise, and certainly the chance th a t the
envelope captures an entire plot will be far less th an 1 2p. So it would not
m ake sense to infer evidence against the null m odel if one arbitrarily placed
p o in t falls outside the envelope, as happened in Exam ple 4.7. In fact in th at
exam ple the chance is ab o u t 0.5 th at some point will fall outside the sim ulation
envelope, in co n trast to the pointw ise chance 0.1.
F or some purposes it will be useful to know the overall erro r rate, i.e.
the chance o f a point falling outside the envelope, or even to control this
rate. W hile this is difficult to do exactly, there is a simple em pirical approach
which w orks satisfactorily. G iven the R sim ulated plots which were used to
calculate the test envelope, we can sim ulate the graphical test by com paring
{t'(a),a G j / } to the envelope SlS r2p th at is obtained from the o th er R 1
sim ulated plots. If we repeat this sim ulated test for r = 1, . . . , R , then we obtain
a resam ple estim ate o f the overall two-sided erro r rate
# { r : {t(a),a G j / } exits <0Lr2pj
R

(4.17)

154

4 Tests
Figure 4.5 Normal plot
of n = 13 studentized
values for final sample
in Table 3.1, together
with simultaneous (solid
lines) and pointwise
(dashed lines) two-sided
0.10 test envelopes.
K = 199

Quantiles of standard normal

This is easy to calculate, since {t(a),a e jtfj exits S lS r2p if and only if
rank{t*(a)} < k

or

rank{f (a)} > R + 1 k

for at least one value o f a, where as before k = p(R + 1 ) . T hus if the R plots
are represented by a R x N array, we first com pute colum nw ise ranks. T hen
we calculate the p ro p o rtio n o f rows in which either the m inim um rank is less
th an or equal to k, or the m axim um ran k is greater th a n or equal to R + 1 k,
o r both. T he corresponding one-sided erro r rates are estim ated in the obvious
way.
Example 4.8 (Normal plot, continued)

F or the norm al plot o f Exam ple 4.6, an

overall tw o-sided error rate o f approxim ately 0.1 requires R = 199. Figure 4.5
shows a graphical test p lo t for R = 199 w ith outer envelope corresponding to
overall tw o-sided erro r rate 0.1 and inner envelope corresponding to pointw ise
two-sided erro r rate 0.1; the em pirical error rate (4.17) for the o u ter envelope
is 0.10.

In practice one m ight ra th e r be looking for trends, m anifested by sequences


o f points going outside the test envelope. A lternatively one m ight be focusing
attention on p articu lar regions o f the plot, such as the tails o f a probability plot.
Because such plots m ay be used to detect several possible deviations from a
hypothetical m odel, and hence be in terpreted in several possible ways, it is n o t
possible to m ake a single recom m endation th a t will induce a controlled error
rate. In the absence o f a single criterion by which the plot is to be judged, it
seems wise to plot envelopes corresponding to b o th pointw ise one-sided error
rate p and sim ultaneous one-sided erro r rate p, say with p = 0.05. This is
relatively easy to d o using (4.17). F or a further illustration see Exam ple 8.9.

155

4.2 Resampling fo r Parametric Tests

4.2.5 C hoice o f R
In any sim ulation-based test, relatively few sam ples could be used if it quickly
becam e clear th a t p was so large as to n o t be regarded as evidence against HoF or exam ple, if the event t* > t occurred 50 times in the first 100 samples, then
it is reasonably certain th a t p will exceed 0.25, say, for m uch larger R, so there
is little p o in t in sim ulating further. O n the other hand, if we observed t* > t
only five times, then it w ould be w orth sam pling fu rther to m ore accurately
determ ine the level o f significance.
O ne effect o f n o t com puting p exactly is to w eaken the pow er o f the test,
essentially because the critical region o f a fixed-level test has been random ly
displaced. T he effect can be quantified approxim ately as follows. C onsider
testing a t level a, which is to say reject Ho if p < a. If the integer k is chosen
equal to (R + l)a , then the test rejects Ho when t'{R+l_k) < t. F or the alternative
hypothesis H a , the pow er o f the test is
nR(a, HA) = Pr(reject H 0 \ H A) = P r(T (*R+1_k) < T \ H A).
To evaluate this probability, suppose for simplicity th a t T has a continuous
distribution, w ith P D F go(t) and C D F Go(t) under Ho, and density gA(t) under
H A. T hen from the stan d ard result for P D F o f an order statistic we have
nR( a, HA) =

J J

R ( ^ _ Q c o M ^ g o M U - Goix ) } ^ 1 gA(t)dxdt.

A fter change o f variable an d some rearrangem ent o f the integral, this becom es
nR(cc,Ha ) = [ ^ao(u, H A)hR(u;tx)du,
Jo

(4.18)

where nx (u,HA) is the pow er o f the test using the exact P-value, and hR{u;a)
is the b eta density on [0,1] w ith indices (R + l)a and (R + 1)(1 a).
T he next p a rt o f the calculation relies on n R{ot, H A) being a concave function
o f a, as is usually the case. T hen a lower bound for n ^ u , H a ) is nm[ u , H a )
which equals U7taj( a ,H a) / a for u < a and 7tx ( a ,H 4 ) for u > a. It follows by
applying (4.18) to n R(y., HA), and som e m anipulation, th at
n 00( o L , H A ) - n R( a,HA)

< nco^^A')J

\u - a \ h R(u;cc)du

7too(a, H y4)a*R+1*<x(l + 1)
(R + l ) a r ((R + l)a ) T ((R + 1)(1 - a)) '
We apply Stirlings approxim ation T(x) = (2n)l/ 2 x x~ l / 1 exp(x) for large x to
the rig h t-h an d side an d obtain the approxim ate bound

156

4 Tests

The following table gives som e num erical values o f this approxim ate bound.
sim ulation size R
power ratio for a = 0.05
power ratio for a. = 0.01

19

39

99

199

499

999

9999

0.61

0.73

0.83

0.60

0.88
0.72

0.92
0.82

0.95
0.87

0.98
0.96

These values suggest th a t the loss o f pow er with R = 99 is n o t serious for


a > 0.05, and th a t R = 999 should generally be safe. In fact the values can be
quite conservative. For exam ple, for testing a norm al m ean the pow er ratios for
a = 0.05 are usually above 0.85 and 0.97 for R = 19 and R = 99 respectively.

4.3 Nonparametric Permutation Tests


In m any practical situations it is useful to have available statistical m ethods
which do n o t depend upon specific param etric models, if only in order to
provide backup to results o f param etric m ethods. So, w ith significance testing,
it is useful to have n onparam etric tests such as the sign test and the signed-rank
test for analysing paired data, either to confirm the results o f applying the
param etric paired t test, or to deal w ith evident non-norm ality o f the paired
differences.
N onparam etric tests in general com pute significance w ithout assum ing form s
for the d a ta distributions. T he choice o f test statistic will usually be based firmly
on the physical context o f the problem , possibly reinforced by w hat we know
would be a good choice if a plausible p aram etric m odel were applicable. So,
in a com parison o f two treatm ents w here we believe th a t treatm ent effects are
additive, it w ould be reasonable to choose as test statistic the difference o f
means, especially if we th o u g h t th a t the d a ta distributions were not far from
norm al; for long-tailed d a ta distributions the difference o f m edians would be
m ore reasonable from a statistical poin t o f view. If we are concerned about
the nonrobustness o f m eans, then we m ight first convert d a ta values to relative
ranks and then use an ap p ro p riate ran k test.
There is a vast literature on various kinds o f nonparam etric tests, such
as rank tests, U -statistic tests, and distance tests which com pare E D F s in
various ways. We shall n o t a ttem p t to review these here. R ath e r our concern in
this ch ap ter is w ith resam pling tests, and the sim plest form o f nonparam etric
resam pling test is the p erm u tatio n test.
Essentially a p erm u tatio n test is a com parative test, where the test statistic
involves some sort o f com parison betw een E D Fs. T he special feature o f
the p erm u tatio n test is th a t the null hypothesis implies a reduction o f the
nonparam etric M L E o f the d a ta distributions to E D F s which play the role o f
sufficient statistic S in equation (4.4). The conditional probability distribution

157

4.3 Nonparametric Permutation Tests

Figure 4.6 Scatter plot


of n 37 pairs of
measurements in a study
of handedness (provided
by D r Gordon Claridge,
University of Oxford).

dnan

used in (4.4) is then a uniform distribution over a set o f perm utations o f the
d a ta structure. The following exam ple illustrates this.
Example 4.9 (Correlation test) Suppose th at Y = ( U , X ) is a random pair
an d th a t n such pairs are observed. T he objective is to see if U and X are
independent, this being the null hypothesis Hq. A n illustrative dataset is plotted
in Figure 4.6, where u = d nan is a genetic m easure and x = han d is an integer
m easure o f left-handedness. T he alternative hypothesis is th a t x tends to be
larger w hen u is larger. These d ata are clearly non-norm al.
O ne simple test statistic is the sam ple correlation, T = p{F) say. N ote th at
here the E D F F puts probabilities n~ on each o f the n d ata pairs (u;,x,).
T he correlation is zero for any distribution th a t satisfies Ho. The correlation
coefficient for the d a ta in Figure 4.6 is 0.509.
W hen the form o f F is unspecified, F is m inim al sufficient for F. U nder
the null hypothesis, however, the m inim al sufficient statistic is com prised o f
the ordered us an d ordered xs, s = (M(i),...,U(n),X(i),...,X()), equivalent to
the two m arginal E D Fs. So here a conditional test can be applied, w ith (4.4)
defining the P-value, w hich will therefore be independent o f the underlying
m arginal distributions o f U and X . N ow when S is constrained to equal s,
the ran d o m sam ple ( U \ , X \ ) , ... ,(U,X) is equivalent to (u(i),X j), . . . , (u (n),X*)
w ith ( X j ,. . .,X"n) a ran d o m p erm u tatio n o f X ( i ) ,...,X ( ) . F urther, when Ho
is true all such p erm u tatio n s are equally likely, and there are n! o f them.
Therefore the one-sided P-value is
# o f perm utations such th at T* > t

In evaluating p, we can use the fact th at all m arginal sam ple m om ents

158

4 Tests

Figure 4.7 Histogram


o f correlation t* values
for R 999 random
perm utations o f d ata in
Figure 4.6.

-0.5

0.0

0.5

Correlation t*

are constant across perm utations. This implies th a t T > t is equivalent to


T,XiUi> Y,XiU i-

As a practical m atter, it is rarely possible or necessary to com pute the


p erm utatio n P-value exactly. Typically a very large num ber o f perm utations
is involved, for exam ple m ore th a n 3 m illion in Exam ple 4.9 w hen n = 10.
In special cases involving linear statistics there will be theoretical approxi
m ations, such as norm al approxim ations or im proved versions o f these: see
Section 9.5. But for general use the m ost reliable ap proach is to m ake use o f
the M onte C arlo m ethod o f Section 4.2.1. T h at is, we take a large num ber
R o f rand o m p erm utations, calculate the corresponding values t \ , . . . , t R o f T,
and approxim ate p by
me

1 + # { tr* > r}
R + 1
'

A t least 99 and at m ost 999 ran d o m p erm u tatio n s should suffice.


Example 4.10 (Correlation test, ctd) F or the d ataset shown in Figure 4.6,
the test o f Exam ple 4.9 was im plem ented by sim ulation, th a t is generating
random p erm u tatio n s o f the x-values, w ith R = 999. Figure 4.7 is a histogram
o f the correlation values. The unshaded p a rt corresponds to the 4 t* values
which are greater th an the observed correlation t = 0.509: the P-value is
p = ( l + 4 ) / ( l + 9 9 9 ) = 0.005.

O ne feature o f p erm u tatio n tests is th a t any test statistic is as easy to use as


any other, a t least in principle. So in the previous exam ple it is ju st as easy to
use the ran k correlation (in which the us and xs are replaced by their relative

159

4.3 Nonparametric Permutation Tests

ranks), a robust m easure o f correlation, or a com plicated m easure o f distance


betw een the bivariate E D F F and its null hypothesis version Fo which is the
pro d u ct o f the E D F s o f u and x. All th at is required is th a t we be able to
com pute the test statistic for all perm utations o f the xs.
In the previous exam ple the null hypothesis o f independence led unam bigu
ously to a sufficient statistic s and a p erm u tatio n distribution. M ore generally
the explicit null hypothesis m ay n o t be strong enough to do this, unless it
can be taken to im ply a stronger hypothesis. This depends upon the practical
context, as we see in the following example.
Example 4.11 (Comparison of two means) Suppose th a t we w ant to com pare
the m eans o f tw o populations, given random sam ples from each which are
denoted by C y n ,...,j'ini) and (y 2 i , - - - , y 2n2)- The explicit null hypothesis is
Hq : n\ = fi 2 , w here ji\ an d jij are the m eans for the respective populations.
N ow Ho alone does n o t reduce the sufficient statistics from the two sets o f
ordered sam ple values. However, suppose we believe th a t the C D F s Fi and Fj
have either o f the special form s
Fi(y) = G ( y - n \ ) ,

F2(y) = G(y - n 2)

or
F\(y) = G ( y / n i),

F2(y) = G{y/(i2),

for some unknow n G. T hen the null hypothesis implies a com m on C D F F for
the two populations. In this case, the null hypothesis sufficient statistic s is the
set o f order statistics for the pooled sam ple
=

yiii >*^1

= yittp Wfii+i = 3^21 s >^i+2 =

y2n2,

th a t is s = (u(i),...,H (ni+n2)).
Situations where the special form s for Fj and F 2 apply would include
com parisons o f tw o treatm ents which were both applied to a random selection
o f units from a com m on pool. The special forms would n o t necessarily apply to
sets o f physical m easurem ents taken under different experim ental conditions or
using different apparatu s, since then the sam ples could have unequal variablity
even though Ho were true.
Suppose th a t we test Ho by com paring the sam ple m eans using test statistic
t = y 2 yi, and suppose th a t the one-sided alternative H a : fi2 > ji\ is
appropriate. If we assum e th a t Ho implies a com m on distribution for the Yu
and Yzj, then the exact significance probability is given by (4.4), i.e.
p = P r(T > t | S = s,Ho).
N ow when S is constrained to equal s, the concatenation o f the two random
sam ples ( Y u ,..., Yini, Y2i , . . . , Y22) m ust form a p erm utation o f s. The first

160

4 Tests

m com ponents o f a p erm u tatio n will give the first sam ple and the last 2
com ponents will give the second sample. Further, w hen Ho is true all such
perm utatio n s are equally likely, an d there are
o f them. Therefore
#

o f p erm u tatio n s such th a t T* > t


^ i+ n 2^

P ~

'

(4-21)

As in the previous exam ple, this exact probability would usually be approxi
m ated by taking R ran d o m p erm u tatio n s o f the type described, and applying
(4.11).

A som ew hat m ore com plicated tw o-sam ple test problem is provided by the
following example.
Example 4.12 (AM L data) Figure 3.3 shows the product-lim it estim ates o f
the survivor function for tim es to rem ission o f tw o groups o f patients with acute
m yelogeneous leukaem ia (A M L), w ith one o f the groups receiving m aintenance
chem otherapy. D oes this treatm en t m ake a difference to survival?
A com m on test for com parison o f estim ated survivor functions is based on
the log-rank statistic, which com pares the actual n u m ber o f failures in group
1 with its expected value at each tim e a failure is observed, under the null
hypothesis th a t the survival distributions o f the two groups are equal. To be
m ore explicit, suppose th a t we pool the two groups and obtain ordered failure
times y\ < < ym, w ith m < n if there is censoring. Let / \j and r\j be the
num ber o f failures and the nu m b er a t risk o f failure in group 1 at tim e yj, and
similarly for group 2. T hen the log-rank statistic is
T = E j = i ( /U - mij)

where
( / l j + f 2 j ) r i j r 2j ( r i j + r2j - f i; - f 2J)

(fij+f2j)rij
1]

r ij + r y

lJ

(ri; + r 2j ) 2{r\ j + r2j

1)

are the conditional m ean an d variance o f the n u m b er in group 1 to fail a t time


tj, given the values o f f i j + f 2j, r\j and r2j. F or the A M L d a ta t = 1.84. Is this
evidence th a t chem otherapy lengthens survival tim es?
For a suitable null distrib u tio n we simply treat the observations in the rows
o f Table 3.4 as a single group and perm ute them , effectively random ly allocating
group labels to the observations. F or each o f R perm utations, we recalculate
t, obtaining t\, ..., t*R. Figure 4.8 shows the t'r plotted against order statistics
from the JV(0,1) distribution, which is the asym ptotic null distribution o f T.
The asym ptotic P-value is 0.033, in reasonable agreem ent with the P-value
26/(999 + 1) = 0.026 from the p erm u tatio n test.

4.4 Nonparametric Bootstrap Tests

161

Figure 4.8 Results of a


Monte Carlo
permutation test for
differences between the
survivor functions for
the two groups of AML
data, R = 499. The
dashed horizontal line
shows the observed
value of the statistic,
and values of t* that
exceed it are hollow.
The dotted line is the
line x = y.

Quantiles of standard normal

4.4 Nonparametric Bootstrap Tests


T he p erm u tatio n tests described in the previous section are special n o n p a ra
m etric resam pling tests, in which resam pling is done w ithout replacem ent. In
this section we discuss the direct application o f nonparam etric resam pling
m ethods, as introduced in C hapters 2 and 3. F or tightly structured problem s
such as those in the previous section, this m eans resam pling w ith replacem ent
rath er th an w ithout, which m akes little difference. But b o o tstrap tests apply to
a m uch w ider class o f testing problems.
The special n ature o f significance tests requires th a t probability calculations
be done und er a null hypothesis model. In this way the b o o tstrap calculations
m ust differ from those in earlier chapters. F or exam ple, where in C h apter 2
we introduced the idea o f resam pling from the E D F F, now we m ust resam ple
from a distribution Fo, say, which satisfies the relevant null hypothesis H q.
This has been illustrated already for param etric b o o tstrap tests in Section 4.2.
A
O nce the null resam pling distribution Fo is decided, the basic boo tstrap test
will be to com pute the P-value as
Pboot =

Pr*(7

> r I

Fo),

or to approxim ate this by


p
P

i + # K > t}
R+l

using the results t\,...,t*R from R b o o tstrap samples.

4 Tests

162

Figure 4.9 Histogram


of test statistic values
t' = y 2 y\ from
R = 999 resamples of
the two samples in
Example 4.13. The data
value of the test statistic
is t 2.84.

in

CM

in

o
o
6

-4

Example 4.13 (Comparison of two means, continued) C onsider the last two
series o f m easurem ents in Exam ple 3.1, which are reproduced here labelled
sam ples 1 and 2 :

sam ple 1
sam ple 2

82
84

79
86

81
85

79
82

77
77

79
76

79
77

78
80

79
83

82
81

76
78

73
78

64
78

Suppose th a t we w ant to com pare the corresponding population m eans, p\


and /i2, say w ith test statistic t = y i y\. If, as seems plausible, the shapes
o f the underlying distributions are identical, then under Ho : P2 = Pi the two
distributions are the same. It would then be sensible to choose for Fo the
pooled E D F o f the tw o samples. T he resam pling test will be the same as the
p erm u tatio n test o f Exam ple 4.11, except th a t ran d om perm utations will be
replaced by ran d o m sam ples o f size n\ 4 -112 = 26 d raw n w ith replacem ent from
the pooled data.
Figure 4.9 shows the results from applying this procedure to our two samples
w ith R = 999. The unshaded area o f the histogram corresponds to the 48 values
o f t* larger th a n the observed value t = 80.38 77.54 = 2.84. T he one-sided Pvalue for alternative H A : H2 > Hi is p = (48 + l ) / ( 9 9 9 + l) = 0.049. A pplication
o f the p erm u tatio n test gave the sam e result.
It is w orth stressing again th a t because the resam pling m ethod is wholly
com putational, any sensible test statistic is as easy to use as any other. So here,
if outliers were present, it w ould be ju st as easy, and perhaps m ore sensible, to
choose t to be the difference o f trim m ed means.

4.4 Nonparametric Bootstrap Tests

163

The question is: do we gain or lose anything by assum ing th a t the two
distributions have the same shape?

The p articu lar null fitted m odel used in the previous exam ple was suggested
in p a rt by the p erm u tatio n test, and is clearly n o t the only possibility. Indeed,
a m ore reasonable null m odel in the context would be one which allowed
different variances for the tw o p opulations sam pled: an analogous m odel is
used in Exam ple 4.14 below. So in general there can be m any candidates for null
m odel in the nonparam etric case, each corresponding to different restrictions
im posed in ad d itio n to H q. O ne m ust judge which is m ost ap p ro p riate on the
basis o f w hat m akes sense in the practical context.
Semiparametric null models
If d a ta are described by a sem iparam etric m odel, so th a t some features o f
underlying distributions are described by param eters, then it m ay be relatively
easy to specify a null model. The following exam ple illustrates this.
Example 4.14 (Comparison of several means) F or the gravity d a ta in E xam
ple 3.2, one p o in t th a t we m ight check before proceeding w ith an aggregate
estim ation is th a t the underlying m eans for all eight series are in fact the same.
One plausible m odel for the data, as m entioned in Section 3.2, is

)fij ~

L ? I ~ I?)

where the ei; com e from a single distribution G. The null hypothesis to be
tested is Ho : p\ = = p.%, w ith general alternative. F or this an appropriate
test statistic is given by
yi and sj are the average
and sample variance for
the ith series.

t= E

Wi(yi - o)2,

Wi = Hi/sf,

i=1
w ith fo = Y wi}'i/ Y wi
null estim ate o f the com m on mean. The null
distribution o f T w ould be approxim ately yfi were it n o t for the effect o f small
sam ple sizes. So a b o o tstrap approach is sensible.
T he null m odel fit includes /to and the estim ated variances
K> = ( i

l ) s f / i + ( Pi ~ M

2-

T he null m odel studentized residuals


ytj - fo
eij

{ ^ - ( E w , ) - 1}172

when plotted against norm al quantiles, suggest mild non-norm ality. So, to be
safe, we apply a nonparam etric bootstrap. D atasets are sim ulated under the
null m odel

y'j = fo +

164

4 Tests

10

20

30
t*

s?

1
2
3
4
5
6
7
8

66.4
89.9
77.3
81.4
75.3
78.9
77.5
80.4

370.6
233.9
248.3
68.8
13.4
34.1
22.4
11.3

40

50

w,'
474.4
339.9
222.3
67.8
23.1
31.1
21.9
13.5

0.022
0.047
0.036
0.116
0.599
0.323
0.579
1.155

60
Chi-squared quantiles

with e'jS random ly sam pled from the pooled residuals {e^, i = 1.......8, j =
l,...,n ,} . F or each such sim ulated d ataset we calculate sam ple averages and
variances, then weights, the pooled m ean, and finally t*.
Table 4.3 contains a sum m ary o f the null m odel fit, from which we calculate
f o = 78.6 an d t = 21.275.
A set o f R = 999 b o o tstrap sam ples gave the histogram o f t values in the
left panel o f Figure 4.10. O nly 29 values exceed t = 21.275, so p = 0.030. The
right panel o f the figure plots ordered t* values against quantiles o f the Xi
approxim ation, which is off by a factor o f ab o u t 1.24 and gives the distorted
P-value 0.0034. A n o rm al-error p aram etric b o o tstrap gives results very sim ilar
to the nonparam etric b o otstrap.

Table 4.3 Summary


statistics for eight
samples in gravity data,
plus ingredients for
significance test. The
weighted mean is
po = 78.6.

Figure 4.10
Resampling results for
comparison of the
means of the eight series
of gravity data. Left
panel: histogram of
R = 999 values of t*
under nonparametric
resampling from the
null model with pooled
studentized residuals;
the unshaded area to
right of observed value
t = 21.275 gives
p = 0.029. Right panel:
ordered t values versus
Xi quantiles; the dotted
line is the theoretical
approximation.

4.4 Nonparametric Bootstrap Tests

165

Example 4.15 (Ratio test) Suppose that, as in Exam ple 1.2, each observation y
is a p air (u,x), and th a t we are interested in the ratio o f m eans 8 = E ( X ) /E ( U) .
In p articu lar suppose th a t we wish to test the null hypothesis Hq : 6 = 0OThis problem could arise in a variety o f contexts, and the context would help
to determ ine the relevant null model. F or example, we m ight have a pairedcom parison experim ent where the m ultiplicative effect 0 is to be tested. H ere
do would be 1, an d the m arginal distributions o f U and X should be the same
und er Hq- O ne n atu ral null m odel Fo w ould then be the sym m etrized E D F, i.e.
the E D F o f the expanded d a ta ( u i , x i ) , . . . , (u,x),(xi,ui),. . . , ( x n,u).

Fully nonparametric null models


In those few situations where the context o f the problem does n o t help identify
a suitable sem iparam etric null m odel, it is in principle possible to form a
w holly nonp aram etric null m odel Fo. H ere we look a t one general way to do
this.
Suppose the test involves k distributions F i ,...,F ^ for which the null hy
pothesis im poses a constraint, Ho : r(F i,. . . , F*) = 0. T hen we can obtain a null
m odel by nonp aram etric m axim um likelihood, or a sim ilar m ethod, by adding
the constraint to the usual derivation o f the E D F s as M LEs. To be specific,
suppose th a t we force the estim ates o f F \ , . . . , Fk to be supported on the corre
sponding sam ple values, as the E D F s are. T hen the estim ate for F, will attach
probabilities p, = (p,i, , P;n,) to sam ple values y , i , t h e unconstrained
E D F Ft corresponds to pi = n ^ 'f l , . . . , 1). N ow m easure the discrepancy be
tween a possible F, and the E D F F,- by rf(pp,), say, such th at the E D F
probabilities p, m inim ize this when no constraints o ther th an Y.%] Pij = 1 are
im posed. T hen a n o nparam etric null m odel is given by the probabilities which
m inim ize the aggregate discrepancy subject to t ( Fi , . . . ,F k) = 0. T h at is, the
null m odel m inim izes the L agrange expression

(4.22)

w here t{p i , . . . , pt) is a re-expression o f the original constraint function t ( F \ , . . ., Fk).


We denote the solutions o f this constrained m inim ization problem by p,
i = l,...,k .
T he choice o f discrepancy function d(-, ) th at corresponds to m axim um
likelihood estim ation is the aggregate inform ation distance
k

ttj

(4.23)

4 Tests

166
and a useful alternative is the reverse inform ation distance
k

nk

Y Y Pli log(P<7/Py')-

(4-24)

r=l j= 1

Both are m inim ized by the set o f E D F s when no constraints are im posed. The
second m easure has the advantage o f autom atically providing non-negative
solutions. T he following exam ple illustrates the m ethod and som e o f its im pli
cations.
Example 4.16 (Comparison of two means, continued) F or the tw o-sam ple
problem considered in Exam ples 4.11 and 4.13, we apply (4.22) with the
discrepancy m easure (4.24). T he null hypothesis constraint is th a t the two
m eans are equal, th a t is J ^ y i j P i j = Hi = H2 = ^ y i j P i j , so th a t (4.22) becomes
2

n,

Y Y piJ

n,

( Y yiwj - Y yypv) ~ Y a>[ Y ^ - 1

1=1 y=i

>=i

\j =i

Setting derivatives w ith respect to pi; equal to zero gives the equations
1 + lo g pij ai Xyij = 0,

1 + log p2j - a2 + Xy2j = 0,

which together w ith the initial constraints gives the solutions


exp(A)>ij)

17,0

EkLiexp(Ayik)

exp i - X y y )

E"Li ^ p i - X y i k Y

2 j'

The specific value o f X is uniquely determ ined by the null hypothesis constraint,
which becom es

Eyijexp(/l};iv-) = E y 2jexp(-Ay2j)
E * e x p ( ^ lt)

s x p ( - X y 2k)

whose solution m ust be determ ined numerically. D istributions o f the form


(4.25) are usually called exponential tilts o f the E D Fs.
F or o u r d a ta X = 0.130. The resulting null m odel probabilities are shown in
the left panel o f Figure 4.11. The right panel will be discussed later.
H aving determ ined these null probabilities, the b o o tstrap test algorithm is
as follows:
Algorithm 4.1 (Tilted bootstrap two-sample comparison)

For r = 1 ,..., JR,

1 G enerate ( y ^ , . . . , y *lni) by random ly sam pling i tim es from ( y n ,. . . , y i ni)


w ith weights (p n ,o ,...,P ini,o).
2 G enerate (y'2 x y 2 ni) by random ly sa m p lin g n2 times from (y2i , . . . , y 2ni)
w ith weights (p2i,o, , P2n2,o)3 C alculate the test statistic t' = y \ y\.

4.4 *Nonparametric Bootstrap Tests

Figure 4.11 Null


distributions for
comparison of two
means. Left panel: null
probability distributions
pio (1) and p2o (2) with
equal means (X = 0.130);
observations are marked
+. Right panel: smooth
densities corresponding
to null probability
distributions for
population 1 (dotted
curve) and population 2
(dashed curve), and
smooth density
corresponding to pooled
EDF (solid curve).

167

cco
0

TJ

Table 4.4 Resampling


P-values for one-sided
comparison of two
means. The entries are
explained in
Examples 4.11, 4.13,
4.16, 4.19 and 4.20.

N ull m odel
S tatistic
P-value
______________________________________
pooled E D F
n ull variances
exponential tilt
M LE
(pivot)

t and z
t
t
z
t
z
z

0.045
0.053
0.006
0.025
0.019
0.017
0.015

C alculate
1 + # { f* ^ t}
v = -------- --------- -
V
R+ 1

N um erical results for R = 999 are given in Table 4.4 in the line labelled
exponential tilt, t". R esults for other resam pling tests are also given for
com parison: z refers to a studentized version o f t, M L E refers to use o f
constrained m axim um likelihood (see Problem 4.8), null variances refers to
the sem iparam etric m ethod o f Exam ple 4.14. Clearly the choice o f null m odel
can have a strong effect on the P-value, as one m ight expect. T he studentized
test statistics z are discussed in Section 4.4.1.

The m ethod as illustrated here has strong sim ilarity to use o f em pirical
likelihood m ethods, as described in C h ap ter 10. In practice it seems wise to

168

4 Tests

check the null m odel produced by the m ethod, since resulting P-values are
generally sensitive to m odel. Thus, in the previous example, we should look at
Figure 4.11 to see if it m akes practical sense. The sm oothed versions o f the null
distributions in the right panel, which are obtained by kernel sm oothing, are
perhaps easier to interpret. One m ight well judge in this case th a t the two null
distributions are m ore different th a n seems plausible. D espite this reservation
ab o u t this exam ple, the general m ethod is a valuable tool to have in case o f
need.
There are, o f course, situations where even this quite general approach will
n o t work. N evertheless the basic idea behind the ap proach can still be applied,
as the following exam ples show.
Example 4.17 (Test for unimodality) O ne o f the difficulties w ith n o n p a ra
m etric curve estim ation is know ing w hether particu lar features are real. For
example, suppose th a t we com pute a density estim ate f ( y ) and find th a t it has
two modes. H ow do we tell if the m inor m ode is real? B ootstrap m ethods can
be helpful in such problem s. Suppose th a t a kernel density estim ate is used, so
th at
/<>; =

(4.26)
j=1

where (j> is the stan d ard norm al density. It is possible to show th at the num ber
o f m odes o f f decreases as h increases. So one way to test unim odality is to
see if an unusually large h is needed to m ake / unim odal. This suggests th a t
we take as test statistic
t = min{h : f { y , h ) is unim odal}.
A natural candidate for the null sam pling distribution is f { y , t ) , since this is
the least sm oothed version o f the E D F which satisfies the null hypothesis o f
unim odality. By the convolution p roperty o f / , random sam ple values from
f ( y ; t ) are given by
y j = yij + hep

(4.27)

where the Ej are independent N ( 0, 1) variates an d the l j are random integers


from {1 ,2 ,...,n } . O n general grounds it seems wise to m odify / so as to have
first two m om ents agree w ith the d a ta (Problem 3.8), b u t this m odification
would have no effect here.
For any such sam ple y \ , . . . , y ' n generated from the null distribution, we can
check w hether o r n o t t > t by checking w hether or not the p articular density
estim ate f ' ( y ' , t ) is unim odal.

The next exam ple applies a variation o f this test.

169

4.4 Nonparametric Bootstrap Tests


Table 4.5 Perpendicular
distances (miles) from
an aerial line transect to
schools of Southern
Bluefin Tuna in the
Great Australian Bight
(Chen, 1996).

0.19
1.00
1.83
2.46
3.48
4.36
6.19
9.29

0.28
1.16
1.91
2.51
3.79
4.53
6.45
9.78

0.29
1.17
1.97
2.89
3.83
4.97
7.13
10.15

0.45
1.29
2.05
2.89
3.94
5.02
7.35
11.32

0.64
1.31
2.10
2.90
3.95
5.13
7.77
13.21

0.65
1.34
2.17
2.92
4.11
5.75
7.80
13.27

0.78
1.55
2.28
3.03
4.14
6.03
8.81
14.39

0.85
1.60
2.41
3.19
4.19
6.19
9.22
16.26

Example 4.18 (Tuna density estimate) O ne m ethod for estim ating the ab u n
dance o f a species in a region is to traverse a straight line o f length L through
the region, an d to record the p erpendicular distances from the line to posi
tions where there are sightings. If there are n independent sightings and their
(unsigned) distances y \ , . . . , y n are presum ed to have P D F f ( y ) , y > 0, the
ab undance density can be estim ated by n /( 0 ) /( 2L), where / ( 0 ) is an estim ate
o f the density a t distance y = 0. The P D F f ( y ) is p roportional to a detection
function th a t is assum ed to decline m onotonically with increasing distance,
w ith non-m onotonic decline suggesting th a t the assum ptions th a t underlie line
transect sam pling m ust be questioned.
Table 4.5 gives d a ta from an aerial survey o f schools o f S outhern Bluefin
T una in the G reat A ustralian Bight. Figure 4.12 shows a histogram o f the data.
The figure also shows kernel density estim ates

y * 0-

( 4 -2 8 >

with h = 0.75, 1.5125, an d 3. This seemingly unusual density estim ate is used
because the probability o f detection, and hence the distribution o f signed
distances, should be sym m etric ab o u t the transect. The estim ate is obtained by
first calculating the E D F o f the reflected distances + y i , - . . , + y n, then applying
the kernel sm oother, and finally folding the result a b o u t the origin.
A lthough the estim ated density falls m onotonically for h greater th an 1.5125,
the estim ate for sm aller values suggests non-m onotonic decline. Since we
consider f ( y ; h ) for positive values o f y only, we are interested in w hether the
underlying density falls m onotonically or not. We take the sm allest h such th at
f ( y ; h ) is unim odal to be the value o f o u r test statistic t. This corresponds
to m ono tonic decline o f f ( y ; h ) for y > 0, giving no m odes for y > 0. The
observed value o f the test statistic is t = 1.5125, and we are interested in the
significance probability
P r( T >

1 Fo),

for d a ta arising from Fo, an estim ate o f F th at satisfies the null hypothesis o f

4 Tests

170

Figure 4.12 Histogram

of the tuna data, and


kernel density estimates
(4.28) with bandwidths
h = 1.5125 (solid), 0.75
(dashes), and 3 (dots).

Distance (miles)

m onotone decline b u t is otherw ise as close to the d a ta as possible. T h at is, the


null m odel is f ( y , t ) .
To generate replicate d atasets from the null m odel we use the convolution
p roperty o f (4.28), which implies
y] = I y i j + tej\,

j = l,...,n ,

where the signs + are assigned random ly, the l j are random integers from
{ 1 ,2 ,...,n}, and the r.j are independent N ( 0,1) variates; cf. (4.27). T he kernel
density estim ate based on the y is f *(y;h). We now calculate the test statistic
as outlined in the previous example, an d rep eat the process R = 999 times to
obtain an approxim ate significance probability. We restrict the h u n t for m odes
to 0 < y < 10, because it does n o t seem sensible to use so small a sm oothing
param eter in the density tails.
W hen the sim ulations were perform ed for these data, the frequencies o f the
num ber o f m odes o f f ' ( y ; t ) for 0 < y < 10 were as follows.

M odes
Frequency

0
536

1 2
411
50

3
2

Like the fitted null distribution, a replicate where the full f * {y ;t ) is unim odal
will have no m odes for y > 0. I f we assum e th a t the event t* = t is impossible,
b o o tstrap d atasets w ith no m odes have t* < t, so the significance probability
is (411 + 5 0 + 2 + l)/(9 9 9 + 1) = 0.464. T here is no evidence against m onotonic
decline, giving no cause to d o u b t the assum ptions underlying line transect
m ethods.

171

4.4 Nonparametric Bootstrap Tests

4.4.1 Studentized bootstrap m ethod


F or testing problem s which involve param eter values, it is possible to obtain
m ore stable significance tests by studentizing com parisons. One version o f this
is analogous to calculating a 1 p confidence set by the studentized boo tstrap
m ethod (Section 5.2.1), an d concluding th a t the P-value is less th a n p if the
null hypothesis p aram eter value is outside the confidence set. Section 4.1
outlined the application o f this idea. H ere we describe two possible resam pling
im plem entations.
F or simplicity suppose first th at 9 is a scalar with estim ator T, and th at
we w ant to test Ho : 9 = 9o versus Ha '9 > 9q. The m ethod suggested in
Section 4.1 applies when

is approxim ately a pivot, m eaning th a t its distribution is approxim ately inde


pendent o f unknow n param eters. Then, with zo = (t ()o)/v,/2 denoting the
observed studentized test statistic, the resam pling analogue o f (4.6) is
p = Pr*(Z* > z0 | F),

(4.29)

which we can approxim ate by sim ulation w ithout having to decide on a null
m odel Fo- T he usual choice for v would be the nonparam etric delta m ethod
estim ate vL o f Section 2.7.2. T he theoretical support for the use o f Z is given in
Section 5.4; in certain cases it will be advantageous to studentize a transform ed
estim ate (Sections 5.2.2 an d 5.7). In practice it would be appropriate to check
on w hether or n o t Z is approxim ately pivotal, using techniques described in
Section 3.10.
A pplications o f this m ethod are described in Section 6.2.5 and Section 6.3.2.
T he m odifications for the oth er one-sided alternative and for the two-sided
alternative are simply p = Pr*(Z* < zo | F ) and p = Pr*(Z *2 > z \ \ F).
Example 4.19 (Comparison of two means, continued) F or the application
considered in Exam ples 4.11, 4.13 and 4.16, where we com pared two m eans
using t =
y u it w ould be reasonable to suppose th a t the usual tw o-sam ple
t statistic
z

Y2 - Y 1 - ( H 2 - H i )

(,S i / n 2 + S f / n i ) l/2
is approxim ately pivotal. H ere F in (4.29) represents the E D F s o f the two
samples, given th a t no assum ptions are m ade connecting the two distributions.
We calculate the observed value o f the test statistic,
2o =

h - h
( s \ / n 2 + S ]/ i) 1/2

4 Tests

172
whose value for these d a ta is 2.846/1.610 = 1.768. T hen R values o f
z. = f 2 ~ fi ~ ( h - h )
(s 22/ n 2 + s \ 2/ n i ) l/2

are generated, w ith each sim ulated d ataset containing n\ values sam pled with
replacem ent from sam ple 1 an d n2 values sam pled with replacem ent from
sam ple 2.
In R = 999 sim ulations we found 14 values in excess o f 1.768, so the P-value
is 0.015. This is entered in Table 4.4 in the row labelled (pivot).

If 9 is a vector w ith estim ator T , an d the null hypothesis is simple, Ho : 6 =


Go, w ith general alternative H A \ 9 =/=Go, then the analogous pivot is
Q = {T - 6 ) t V ~ \ T -G ),
w ith observed test statistic value
qo = ( t - O o ) Tv~1( t - 9 o ) .
A gain v i is a stan d ard choice for v, and again it m ay be beneficial first to
transform T (Section 5.8). Test statistics for m ore com plicated alternatives can
be defined sim ilarly; see Problem 4.10.
Studentized test statistics can also be used w hen Z o r Q is n o t a pivot. The
definitions will be slightly different,
Z =

(4.30)

for the scalar case and


Q = ( T - 9 o) t Vo 1( T - 9 o)
for the vector case, where Vo is an estim ated variance under the null model. If
Zo is used the b o o tstrap P-value will sim ply be
p = Pr*(Z0* > z0 | Fo),

(4.31)

w ith the obvious changes for a test based on Qo. Even though the statistic
is n o t pivotal, its use is likely to reduce the effects o f nuisance param eters,
and to give a P-value th a t is m ore nearly uniform ly distributed u n der the null
hypothesis th a n th a t calculated from T alone.
Example 4.20 (Comparison of two means, continued) In Table 4.4 all the
entries for z, except for the row labelled (pivot), were obtained using (4.30)
w ith t = y 2 yi an d vo depending on the null m odel. F or example, for the null
m odels discussed in Exam ple 4.16,
2

n,

vo = Y l nr1 Y l ( yij ~ foo)2Pij,o,


i=1
j =1

173

4.4 Nonparametric Bootstrap Tests

where ,o = Yl'j=i yijPijfi F or the two sam ples in question, under the ex
ponential tilt null m odel b o th m eans equal 79.17 and vo = 1.195, the latter
differing considerably from the variance estim ate 2.59 used in the pivot m ethod
(Exam ple 4.19).
The associated P-values com puted from (4.31) are shown in Table 4.4 for
all null models. These P-values are less dependent upon the p articular m odel
th an those obtained w ith t unstudentized.

4.4.2 Conditional bootstrap tests


In p aram etric testing, conditioning plays an im p o rtan t role b o th in elim inating
nuisance param eters an d in fixing the inform ation content o f the data. In
n o nparam etric testing the situation is less clear, because o f the absence o f a
full m odel. Some aspects o f conditioning are illustrated in Exam ples 5.16 and
5.17.
O ne simple exam ple which does illustrate the possibility and effect o f condi
tioning is the nonparam etric b o o tstrap test for independence. In Exam ple 4.9
we described an exact p erm u tatio n test for this problem . The analagous b o o t
strap test w ould set the null m odel F q to be the product o f the m arginal
E D Fs. Sim ulation und er this m odel is equivalent to creating x*s by random
sam pling w ith replacem ent from the xs, and independently creating z*s by
ran d o m sam pling w ith replacem ent from the zs. However, we could view the
m arginal C D F s G and H as nuisance param eters and attem pt to remove them
from the analysis by conditioning on G* = G and H * = H. This turns o u t to be
exactly equivalent to using the perm utation test, which does indeed com pletely
elim inate G an d H.
Adaptive tests
C onditioning occurs in a som ew hat different way in the adaptive choice of
test statistic. Suppose th a t we have possible test statistics T \ , . . . , T k for which
efficiency m easures can be defined and estim ated by e i , . . . , ^ : for example, if
the T, are alternative estim ators for scalar param eter 9 and Ho concerns 9,
then e, m ight be the reciprocal o f the estim ated variance o f T,. The idea o f the
adaptive test is to use th a t T* which is estim ated to be m ost efficient for the
observed data, and to condition on this fact.
We first p artitio n the set 9 o f all possible null m odel resam ples
_y
into < W i k such th a t
V i = {Cxi. ,3';) =% = m a x e }.
1< J< k

T hen if y i , . . . , y n is in
the P-value as

so th a t t, is preferred, the adaptive test com putes

p = Pr*(T;* > ti | ( y j,...,y * )

^ ,)-

174

4 Tests

F o r an exam ple o f this, see Problem 4.13. In the case o f exact tests, such as
p erm utatio n tests, the adaptive test is also exact.

4.4.3 Multiple testing


In some applications m ultiple tests o f a hypothesis are based on a single
set o f data. This happens, for example, when pairwise com parisons o f m eans
are carried out for a several-sam ple analysis where the null hypothesis is
equality o f all m eans. In such situations the smallest o f all test P-values is
used, and it is clearly incorrect to interp ret this smallest value in the usual
way. B ootstrapping can be used to find the true significance level o f the
smallest P-value, as follows. D ep artin g from o u r general notation, suppose
th a t the test statistics are S i,...,S fc, w ith observed values s i,...,s /t, and th at
the null distribution o f 5, is know n to be G ,(). T hen the observed significance
levels are 1 G,(s,). T he incorrect procedure would be treat the smallest Pvalue m in{l G i( s i) ,...,1 G^Sk)} as uniform on the interval [0,1]. I f the
tests were exact and independent, the corresponding random variable would
have distribution 1 (1 p)k on [0, 1], b u t in general we should take into
account their (unknow n) dependence. We can allow for the m ultiple testing
by taking t = m in{l G i ( s i ) ,...,l Gi(sfc)} to be the test statistic, and
then the procedure is as follows. We generate d a ta from the null hypothesis
distribution, calculate the b o o tstrap statistics
and then take t* =
m in{l G i(s J ),...,1 G^(s^)}. We repeat this R tim es to get t\,...,t*R, and
then obtain the P-value in the usual way. N otice th a t if all the G;( ) equal
G (), say, the test is ta n ta m o u n t to boo tstrap p in g t = m ax (si,...,sf;), and then
G () need n o t be known. I f the G,( ) are unequal, the procedure requires them
to be know n, in order to p u t the test statistics on a scale where they can be
com pared. I f the G,(-) are unknow n, they can be estim ated, b u t then a nested
boo tstrap (Section 3.9) is needed to o btain the P-value. The algorithm is the
following.
Algorithm 4.2 (Multiple testing)

F or r = 1, . . . , R,

1 G enerate y [ , . . . , y 'n independently from the fitted null distribution Fo,


an d from them calculate s\, . .. ,s'k.
2 Fit the null distribution Fq to y*,. . . , y*.
3 F o r m = 1, . . . , M , generate y i n d e p e n d e n t l y from the fitted
null distribution Fq, and from them calculate
4 Calculate
t* = m in [ !+ # {
\
M
C alculate p = (1 + #{t^ > t})/(/? + 1).

i+ # { s r
M+ 1

175

4.5 Adjusted P-values

The procedure is analogous to th a t used in Section 4.5, b u t in this case


adjustm ent would require three levels o f nested bootstrapping.

4.5 Adjusted P-values


So far we have described tests w hich com pute P-values as p = Pr*(T* > t \ Fo)
w ith Fo the w orking null sam pling model. Ideally P should be uniform ly
distributed on [0,1] if the usual error rate in terpretation is to be valid. This
will be exactly or approxim ately correct for p erm u tatio n and p erm u tatio n
like b o o tstrap tests, b u t for other tests it can be far from correct. Preventive
m easures we can take are to transform t or studentize it, or both. However,
these are n o t g u aranteed to work. H ere we describe a general m ethod o f
adjustm ent, simple in principle b u t potentially very com puter-intensive.
T he idea behind adjusting P-values is simply to treat p as the observed test
statistic: it is after all ju st a tran sfo rm ation o f t. We estim ate the distribution o f
the corresponding ran d o m variable P by resam pling under the null model,
o f course. Since small values o f p are o f interest, the adjusted P-value is defined
by
Padj = Pr*(P* <

P I

Fo),

(4.32)

where p is the observed P-value defined above. This requires b o o tstrapping the
algorithm for com puting P-values, an o th er instance o f increasing the accuracy
o f a b o o tstrap m ethod by b o o tstrapping it, an idea introduced in Section 3.9.
T he problem can be explained theoretically in either o f two ways, perturbing
the critical value o f t for a fixed nom inal erro r rate a, or adjusting for the bias
in the P-value. We take the second approach, and since we are dealing with
statistical erro r rath er th an sim ulation erro r (Section 2.5), we ignore the latter.
The P-value com puted for the d ata is w ritten po{F), where the function po(')
depends on the m ethod used to obtain Fo from F. W hen the null hypothesis
is true, suppose th a t the p articu lar null distribution Fo obtains. T hen the null
distrib u tio n function for the P-value is
G ( u , F o)

P t { Po ( F ) < u \ F o } ,

(4.33)

which w ith u = a is the true error rate corresponding to nom inal erro r rate a.
N ow (4.33) implies th at
Pr{G(p0(F), F0) < a I F0} = a,
and so G{po(F),Fo) would be the ideal adjusted P-value, having actual error
rate equal to the nom inal erro r rate. N ext notice th a t by substituting Fo for Fo
in (4.33) we can estim ate G{u,Fo) by
Pr*{po(F*) < u | Fo}.

176

4 Tests

Finally, setting u = po(F) we obtain


G(po(F), Fo) = Pr*{p0( F ') < Po(F) \ F0}.
This we define to be the adjusted P-value, so when po{F) = p,
Padj = Pr*{p0( f *) < P I F0},
which is a m ore precise version o f (4.32).
O ne m ust be careful to interp ret P * =
outer probability relates to sam pling from
a sam ple draw n from F o.

(4.34)

properly in (4.34). Since the


F* in (4.34) denotes the E D F o f

po(F")
Fo,

The adjusted P-value can be applied to advantage in b o th param etric and


nonparam etric testing, the key point being th a t it is m ore nearly uniform ly
distributed th a n the unadjusted P-value. Before discussing sim ulation im ple
m entation o f the adjustm ent, we look a t a simple exam ple which illustrates the
basic m ethod.
Example 4.21 (Comparison of exponential means) Suppose th a t x i , . . . , x m
and y i , . . . , y are respectively ran d o m sam ples from exponential distributions
w ith m eans m and n 2, an d th a t we wish to test H o : Hi = Hi- F o r this problem
there is an exact test based on U = X / Y , b u t we consider instead the test
statistic T = X Y , for which we show th a t the adjusted P-value autom atically
produces the P-value for the exact test.
For the p aram etric b o o tstrap test the null m odel sets the two sam pling
distributions equal to a com m on fitted exponential distribution with pooled
m ean
m x + ny
v = m +;n
If X * and 7* denote averages o f ran d o m sam ples o f sizes m and n respec
tively from this exponential distribution, then the boo tstrap P-value is p =
P r(X 7* > x y). This can be rew ritten as
P = P r L - - G. - - G . > (m + * - ) } ,
(_
mu + n
j

(4.35)

where u = x / y , and Gm and Gn are independent gam m a random variables with


indices m a n d n respectively and unit scale param eters.
The b o o tstrap P-value (4.35) does n o t have a uniform distribution under
the null hypothesis, so P = p does n o t correspond to erro r rate p. This is fully
corrected using the adjustm ent (4.34). To see this, w rite (4.35) as p = h(u), so
th a t po(F') equals
P r* * (T " > T* | F*o) = h(U*),
where U ' = X ' / Y ' . Since h( ) is decreasing, it follows th at
Padj = Pr*{/i(l/*) < h(u) | x , y } = Pr*(t/* > u | x , y ) = P r(F 2m,2 > u),

177

4.5 Adjusted P-values

which is the P-value o f the exact test. Therefore p a<jj is exactly uniform and the
adjustm ent is perfectly successful.

In the previous example, the same result for pa^ would be achieved if the
b o o tstrap distribution o f T were replaced by a norm al approxim ation. This
m ight suggest th a t b o o tstrap calculation o f p could be replaced by a rough
theoretical approxim ation, thus rem oving one level o f boo tstrap sam pling from
calculation o f padj- U nfortunately this is n o t always true, as is clear from the
fact th a t if an approxim ate null distribution o f T is used which does not
depend upon F at all, then pa<jj is ju st the ordinary bo o tstrap P-value.
In m ost applications it will be necessary to use sim ulation to approxim ate the
adjusted P-value (4.34). Suppose th at we have draw n R resam ples from the null
m odel Fo, w ith corresponding test statistic values r j.......t'R. The rth resam ple
has E D F F* (possibly a vector o f E D Fs), to which we fit the null model
Ko- R esam pling M times from F *0 gives sam ples from which we calculate f " ,
m = 1 ,..., M. T hen the M onte C arlo approxim ation for the adjusted P-value
is
1
+ # { p r* < p }
dj
R +1

(4.36)
where for each r
=

Pr

1 + # K m

M +l

fr )

(4 3 7 )

If p is calculated from the same R resamples, then a total o f R M sam ples is


generated. We can sum m arize the algorithm as follows:
Algorithm 4.3 (Double bootstrap test)

For r = 1

1 G enerate y\,...,y*n independently from the fitted null distribution Fo


and calculate the test statistic t* from them.
2 Fit the null distribution to y [ , . . . , y * , thereby obtaining K r
3 F or m = 1, . . . , M ,
(a) generate y p , . . . , y independently from the fitted null distribu
tion F*0 ; and
(b) calculate from them the test statistic t .
4 C alculate p* as in (4.37).
Finally, calculate padj as in (4.36).

We discuss the choice o f M after the following example.


Example 4.22 (Two-way table) Table 4.6 contains a set o f observed m ulti
nom ial counts, for which we wish to test the null hypothesis o f row -colum n
independence, or additive loglinear model.

178

4 Tests

1
2
0
1
0

2
0
1
1
1

2
0
1
2
1

1
2
1
0
1

If the co u n t in row i an d colum n j is


P-ijfi = yi+y+j/y++> where
test statistic is

y l+

t =

1
3
2
0
1

y ,j,

0
0
7
0
0

1
0
3
1
0

then the null fitted values are

= E /J t y an d so forth. The log likelihood ratio

y 'i Xo^ y ' i / N f i ) -

A ccording to stan d ard theory, T is approxim ately distributed as Xd under the


null hypothesis w ith d = (7 1) x (5 1) = 24. Since t = 38.52, the approxim ate
P-value is P r(^24 ^ 38.52) = 0.031. However, the chi-squared approxim ation is
know n to be quite p o o r for such a sparse table, so we apply the param etric
b ootstrap.
The m odel F q is the fitted m ultinom ial m odel, sam ple size n = y ++ and (i,j)th
cell probability p-ijfi/n. We generate R tables from this m odel and calculate the
corresponding log likelihood ratio statistics t \, . . . , t ' R. W ith R = 999 we obtain
47 statistics larger th a n the observed value t = 38.52, so the b o o tstrap P-value
is (1 + 4 7 )/(l + 999) = 0.048. The inaccuracy o f the chi-squared approxim ation
is illustrated by Figure 4.13, which is a plot o f ordered values o f Pr(x24 > O
versus expected uniform order statistics: the straight line corresponds to the
theoretical chi-squared approxim ation for T.
The b o o tstrap P-value tu rn s out to be quite non-uniform . A double bo o tstrap
calculation w ith R = M = 999 gives pa<jj = 0.076.
N ote th a t the test applied here conditions only on the total y ++, whereas in
principle one would prefer to condition on all row an d colum n sums, which are
sufficient statistics u nder the null h y p o th esis: this would require m ore complex
sim ulation m ethods, such as those o f Section 4.2.1; see Problem 4.3.

Choice o f M
T he general application o f the double b o o tstrap algorithm involves sim ulation
at two levels, w ith a to tal o f R M samples. If we follow the suggestion to use
as m any as 1000 sam ples for calculation o f probabilities, then here we would
need as m any as 106 samples, which seems im practical for o th er th a n simple
problem s. As in Section 3.9, we can determ ine approxim ately w hat a sensible
choice for M would be. The calculation below o f sim ulation m ean squared
erro r suggests th a t M = 99 w ould generally be satisfactory, and M = 249
would be safe. T here are also ways o f reducing considerably the total size o f
the sim ulation, as we shall show in C h ap ter 9.

Table 4.6 Two-way


table of counts (Newton
and Geyer, 1994).

4.5 Adjusted P-values

179

Figure 4.13 Ordered


values of
^ t*)
versus expected uniform
order statistics from
R = 999 bootstrap
simulations under the
null fitted model for
two-way table. Dotted
line is theoretical
approximation.

Expected uniform order statistic

To calculate the sim ulation m ean squared error, we begin w ith equation
(4.37), which we rew rite in the form
I {A} is the indicator
function of the event A.

1 +Em=lJ{C ^ K}

Pr

M+ 1

In order to simplify the calculations, we suppose that, as M >oo, p ' >ur such
th a t the urs are a ran d o m sam ple from the uniform distribution on [0,1]. In
this case there is no need to adjust the b o o tstrap P-value, so padj = PU nder
this assum ption (M + l)p* is alm ost a B inom (M ,ur) random variable, so th a t
equation (4.36) can be approxim ated by
l +

r = l* r

Padj = r + t ~ '
where X r = /{B in o m (M , ur) < ( M + \)p}. We can now calculate the sim ulation
m ean and variance o f

p adj

by using the fact th at

E(X^ | ur) = Pr{B inom (M , ur) < (M + 1)p}


for k = 1,2. F irst we have th a t for all r
ri m + m
E (* ? ) = y

T .
y=0

( " ; )uJ( l - u ) M^ d u =

w here [z] is the integer p a rt o f z. Since pa^ is p ro portional to the average o f


independent X rs, it follows th a t
UW

R [ ( M + l)p]
(n + i)(Af + i)>

180

4 Tests

which tends to the correct answ er p as R, M >00, and


,
. . R [( M + 1)p](M + l - [ ( M + l)p])
var(padj) =
A simple aggregate m easure o f sim ulation erro r is the m ean squared error
relative to p,
M S E ( p 3di) =

[(M + l)p]{M + l - [ ( M + l)p]}


R ( M + l )2

N um erical evaluations o f this result suggest th a t M = 249 would be a safe


choice. If 0.01 < p < 0.10 then M = 99 would be satisfactory, while M = 49
would be adequate for larger p. N ote th a t two assum ptions were m ade in
the calculation, b o th o f which are harm less. First, we assum ed th a t p was
independent o f the t ', w hereas in fact it w ould likely be calculated from the
sam e values. Secondly, o u r m ain interest is in cases where P-values are not
exactly uniform ly distributed. Problem 4.12 suggests a m ore flexible calculation,
from which very sim ilar conclusions emerge.

4.6 Estimating Properties of Tests


A statistical test involves two steps, collection o f d a ta and application o f a
p articular test statistic to those data. Both steps involve choice, and resam pling
m ethods can have a role to play in such choices by providing estim ates o f test
power.
Estimation o f power
A s regards collection o f data, in simple problem s o f the kind under discussion
in this chapter, the statistical co n trib u tio n lies in recom m endation o f sample
sizes via considerations o f test power. I f it is proposed to use test statistic T,
an d if the p articu lar alternative H a to the null hypothesis Ho is o f prim ary
interest, then the pow er o f the test is
7i(p,HA) = P r(T > tp I H a ),
where tp is defined by P r(T > tp \ Ho) = p. In the simplified language o f testing
theory, if we fix p and decide to reject Ho when t > tp, then n ( p, HA) is the
chance o f rejection when HA is true. A n alternative specification is in term s o f
E (P | H a ), the expected P-value. In m any problem s hypotheses are expressed in
term s o f param eters, and then pow er can be evaluated for arbitrary param eter
values to give a pow er function. W h at is o f interest to us here is the use o f
resam pling to assess the pow er o f a test, either as an aid to determ ination o f
appropriate sam ple sizes for a p articu lar test, or as a way to choose from a set
o f possible tests.

4.6 Estimating Properties o f Tests

181

Suppose, then, th a t a pilot set o f d a ta y i , . . . , y n is in hand, and th a t the


m odel description is sem iparam etric (Section 3.3). The pilot d a ta can be
used to estim ate the n onparam etric com ponent o f the model, and to this
can be added a rb itrary values o f the param etric com ponent. This provides a
fam ily o f alternative hypothesis m odels from which to sim ulate d a ta and test
statistic values. F ro m these sim ulations we obtain approxim ations o f test power,
provided we have critical values tp for the test statistic. This last condition
will not always be met, b u t in m any problem s there will at least be a simple
approxim ation, for exam ple N ( 0,1) if we are using a studentized statistic. For
m any nonparam etric tests, such as those based on ranks, critical values are
distribution-free, and so are available. The following exam ple illustrates this
idea.
Example 4.23 (M aize height data) The E D F s plotted in the left panel o f
Figure 4.14 are for heights o f m aize plants growing in two adjacent rows, and
differing only in a pollen sterility factor. The two sam ples can be modelled
approxim ately by a sem iparam etric m odel with an unspecified baseline distri
b u tio n F and one m edian-shift p aram eter 8. F or analysis o f such d a ta it is
proposed to test Ho : 8 = 0 using the W ilcoxon test. W hether or n o t there are
enough d a ta can be assessed by estim ating the power o f this test, which does
depend upon F.
D enote the observations in sample i by y i j, j = l ,...,n ; . The underlying
distributions are assum ed to have the form s F ( y ) and F(y 8), where 8 is
estim ated by the difference in sam ple m edians 0. To estim ate F we subtract 0
from the second sam ple to give y 2j = y ij 8- Then F is the pooled E D F o f
the yijS and y 2js. F or these d a ta n\ = n2 = 12 and 8 = 4.5. The right panel
o f Figure 4.14 plots E D F s o f the y );s and y 2js.
T he next step is to sim ulate d a ta for selected values o f 0 and selected sample
sizes N i an d N 2 as follows. F or group 1, sam ple d a ta
from F(y),
i.e. random ly w ith replacem ent from

and for group 2, sam ple d a ta y 2\ , - - - , y 2Nl from F(y 8), i.e. random ly with
replacem ent from
y n + 8, . . . , yi, + 8, y 2\ + 0, . . . , y 22 + 0T hen calculate test statistic t*. W ith R repetitions o f this, the pow er o f the test
at level p is the p ro p o rtio n o f tim es th a t t* > tp, where tp is the critical value
o f the W ilcoxon test for specified N\ and N 2.
In this p articu lar case, the sim ulations show th a t the W ilcoxon test at level
p = 0.01 has pow er 0.26 for 8 = 8 and the observed sam ple sizes. A dditional

4 Tests

182

Figure 4.14 Power


comparison for maize
height data (Hand et al.,
1994, p. 130). Left
panel: EDFs of plant
height for two groups.
Right panel: EDFs for
group 1 (unadjusted)
and group 2 (adjusted
by estimated
median-shift 6 ~ 4.5).

Data values

Data values

calculations show th a t b o th sam ple sizes need to be increased from 12 to at


least 33 to have pow er 0.8 for 9 = 9.

If the proposed test uses the pivot m ethod o f Section 4.4.1, then calculations
o f sample size can be done m ore simply. F or exam ple, for a scalar 9 consider
a two-sided test o f Ho : 9 = 9o w ith level 2a based on the pivot Z . The pow er
function can be w ritten
n(2a, 9) = 1 - Pr I zx>N +

< Z N < z X- ^ N +

VN

- i ,

VN

(4.39)

where the subscript N indicates sam ple size. A rough approxim ation to this
pow er function can be obtained as follows. First sim ulate R sam ples o f size N
from F , an d use these to approxim ate the quantiles za>sr and zi_a>jv. N ext set
v Jl 2 = n^^vh^2/ N 1/2, where v is the variance estim ate calculated from the pilot
data. Finally, approxim ate the probability (4.39) using the same R boo tstrap
samples.
Sequential tests
Sim ilar sorts o f calculations can be done for sequential tests, where one
im p o rtan t criterion is term inal sam ple size. In this context sim ulation can also
be used to assess the likely eventual sam ple size, given d a ta y i , . . . , y at an
interim stage o f a test, w ith a specified protocol for term ination. This can
be done by sim ulating d a ta co n tin u atio n y^+i,y^,+2 , - up to term ination, by
sam pling from fitted m odels or E D F s, as appropriate. F rom repetitions o f this
sim ulation one obtains an approxim ate distribution for term inal sam ple size N.

4.7 Bibliographic Notes

183

4.7 Bibliographic Notes


The stan d ard theory o f significance tests is described in C hapters 3-5 and 9
o f Cox an d H inkley (1974). F o r detailed treatm ent o f the m athem atical theory
see L ehm ann (1986). In recent years m uch w ork has been done on obtaining
im proved distrib u tio n al approxim ations for likelihood-based statistics, and
m ost o f this is covered by Barndorff-N ielsen and Cox (1994).
R and o m izatio n an d p erm u tation tests have long histories. R. A. Fisher (1935)
introduced rando m izatio n tests as a device for explaining and justifying signifi
cance tests, b o th in simple cases and for com plicated experim ental designs: the
rando m izatio n used in selecting a design can be used as the basis for inference,
w ithout appeal to specific erro r models. F o r a recent account see M anly (1991).
A general discussion o f how to apply random ization in com plex problem s is
given by W elch (1990).
P erm utation tests, which are superficially sim ilar to random ization tests,
are specifically n onparam etric tests designed to condition out the unknow n
sam pling distribution. T he theory was developed by Pitm an (1937a,b,c), and
is sum m arized by L ehm ann (1986). M ore recently R om ano (1989, 1990) has
exam ined properties o f p erm u tation tests and their relation to b o o tstrap tests
for a variety o f problems.
M onte C arlo tests were first suggested by B arnard (1963) and are particularly
p o p u lar in spatial statistics, as described by Ripley (1977,1981,1987) and Besag
an d Diggle (1977). G raphical tests for regression diagnostics are described by
A tkinson (1985), and Ripley (1981) applies them to m odel-checking in spatial
statistics. M arkov chain M onte C arlo m ethods for conditional tests were
introduced by Besag and Clifford (1989); applications to contingency table
analysis are given by Forster, M cD onald and Sm ith (1996) and Smith, Forster
and M cD o n ald (1996), w ho give additional references. G ilks et al. (1996) is
a good general reference on M arkov chain M onte C arlo m ethods, including
design o f sim ulation.
T he effect o f sim ulation size R on power for M onte C arlo tests (with
independent sim ulations) has been considered by M a rrio tt (1979), Jockel (1986)
and by H all an d T itterington (1989); the discussion in Section 4.2.5 follows
Jockel. Sequential calculation o f P-values is described by Besag and Clifford
(1991) and Jennison (1992).
The use o f tilted E D F s was introduced by E fron (1981b), and has sub
sequently h ad a strong im pact on confidence interval m ethods; see C hapters 5
and 10.
D ouble b o o tstrap adjustm ent o f P-values is discussed by Beran (1988), Loh
(1987), H inkley and Shi (1989), and H all and M artin (1988). A pplications
are described by N ew ton and G eyer (1994). G eyer (1995) discusses tests for
inequality-constrained hypotheses, which sheds light on possible inconsistency

184

4 Tests

o f b o o tstrap tests an d suggests remedies. F or references to discussions o f


im proved sim ulation m ethods, see C h ap ter 9.
A variety o f m ethods and applications for resam pling in m ultiple testing are
covered in the books by N oreen (1989) an d W estfall and Y oung (1993).
Various aspects o f resam pling in the choice o f test are covered in papers
by Collings an d H am ilton (1988), H am ilton an d Collings (1991), and Samawi
(1994). A general theoretical treatm en t o f pow er estim ation is given by Beran
(1986). The b rief discussion o f adaptive tests in Section 4.4.2 is based on
D onegani (1991), w ho refers to previous w ork on the topic.

4.8 Problems
1

For the dispersion test of Example 4.2, y \ , . . . , y n are hypothetically sampled from
a Poisson distribution. In the Monte Carlo test we simulate samples from the
conditional distribution of Y i,..., Y given Y Yj s<with s = Yl yj- If the exact
multinomial simulation were not available, a Markov chain method could be used.
Construct a Markov chain Monte Carlo algorithm based on one-step transitions
from (mi,...,u) to (t>i,_,u) which involve only adding and subtracting 1 from
two randomly selected us. (Note that zero counts must not be reduced.)
Such an algorithm might be slow. Suggest a faster alternative.
(Section 4.2)

Suppose that X i , . . . , X n are continuous and have the same marginal CDF F,
although they are not independent. Let / be a random integer between 1 and n.
Show that rank(X/) has a uniform distribution on {1,2,...,n}.
Explain how to apply this result to obtain an exact Monte Carlo test using one
realization of a suitable Markov chain.
(Section 4.2.2; Besag and Clifford, 1989)

Suppose that we have a m x m contingency table with entries ytj which are counts.
(a) Consider the null hypothesis of row-column independence. Show that the
sufficient statistic So under this hypothesis is the set of row and column marginal
totals. To assess the significance of the likelihood ratio test statistic conditional
on these totals, a Markov chain Monte Carlo simulation is used. Develop a
Metropolis-type algorithm using one-step transitions which modify the contents of
a randomly selected tetrad yik,yu>yjk>yji> where i ^ j , k ^ I.
(b) Now consider the the null hypothesis of quasi-symmetry, which implies that
in the loglinear model for mean cell counts, log E(Yy) = /i + a, +
+ ytj, the
interaction parameters satisfy yy = y;i- for all /, j. Show that the sufficient statistic
So under this hypothesis is the set of totals yy+yji, i = j, together with the row and
column totals and the diagonal entries. Again a conditional test is to be applied.
Develop a Metropolis-type algorithm for Markov chain Monte Carlo simulation
using one-step transitions which involve pairs of symmetrically placed tetrads.
(Section 4.2.2; Smith et al, 1996)

Suppose that a one-sided bootstrap test at level a is to be applied with R simulated


samples. Then the null hypothesis will be rejected if and only if the number of ts
exceeding t is less than k = (R + l)a 1. If kr is the number of t*s exceeding t in
the first r simulations, for what values of kr would it be unnecessary to continue
simulation?
(Section 4.2.5; Jennison, 1992)

4.8 Problems

185

(a) Consider the following rule for choosing the number of simulations in a Monte
Carlo test. Choose k, and generate simulations t\,t2,..., t] until the first I for which
k of the t exceed the observed value t; then declare P-value p = (k + I)/(I + 1).
Let the random variables corresponding to I and p be L and P. Show that
Pr{P < (k + 1)/(/ + 1)} = Pr(L > 1 - 1 ) = k / l ,

l = k , k + 1,. .

and deduce that L has infinite mean. Show that P has the distribution of
a t/(0, 1) random variable rounded to the nearest achievable significance level
l , k / ( k + l ) , k / ( k + 2),..., and deduce that the test is exact.
(b) Consider instead stopping immediately if k of the f* exceed t at any I < R, and
anyway stopping when I = R, at which point m values exceed t. Show that this
rule gives achievable significance levels
/ ( * + ! ) /( / + !),

P ~ \( m + l) /( K + l) ,

m = k,
m <k.

Show that under this rule the null expected value of L is


R

E(L) = ^ 2 Pr(L > l ) = k + k ^ 2

1=1

l~\

Mc+l

and evaluate this with k = 49 and 9 for R = 999.


(Section 4.2.5; Besag and Clifford, 1991)
6

Suppose that n subjects are allocated randomly to each of two treatments, A and
B. In fact each subject falls in one of two relevant groups, such as gender, and the
treatment allocation frequencies differ between groups. The response y t] for the j l h

subject in the ith group is modelled as y,j = y,- +


+ e,;, where xA and rb are
treatment effects and k(i, j ) is A or B according to which treatment was allocated
to the subject. Our interest is in testing Ho : rA = xB with alternative that xA < tb,
and the test statistic chosen is
T =

Y . ri> - Y
r>
i,j(i,j)=B
i,jM<,j)=A

where
is the residual from regression of the >>s on the group indicators.
(a) Describe how to calculate a permutation P-value for the observed value t using
the method described above Example 4.12.
(b) A different calculation of the P-value is possible which conditions on the
observed covariates, i.e. on the treatment allocation frequencies in the two groups.
The idea is to first eliminate the group effects by reducing the data to differences
djj = yij yij+i, and then to note that the joint probability of these differences
under Ho is constant under permutations of data within groups. That is, the
minimal sufficient statistic So under H0 is the set of differences
Yl(J+l), where
Yni) < % ) < are the ordered values within the ith group. Show carefully how
to calculate the P-value for t conditional on so
le) Apply the unconditional and conditional permutation tests to the following
data:
Group 1
A

B O
(Sections 4.3, 6.3.2; Welch and Fahey, 1994)

Group 2
4

186
1

4 Tests
A randomized matched-pair experiment to compare two treatments produces
paired responses
from which the paired differences dj = yij >i7 are
calculated for j = 1
The null hypothesis Ho o f no treatment difference
implies that the djs are sampled from a distribution that is symmetric with mean
zero, whereas the alternative hypothesis implies a positive mean difference. For
any test statistic t, such as d, the exact randomization P-value Pr(T* > t | H0) is
calculated under the null resampling m odel

d) = Sjdj,

j =

where the Sj are independent and equally likely to be + 1 and 1. W hat would
be the corresponding nonparametric bootstrap sampling m odel Fo? Would the
resulting bootstrap P-value differ much from the randomization P-value?
See Practical 4.4 to apply the randomization and bootstrap tests to the following
data, which are differences o f measurements in eighths o f an inch on cross- and
self-fertilized plants grown in the same pot (taken from R. A. Fishers famous
discussion o f Darwins experiment).
49

-6 7

16

23

28

41

14

29

56

24 7560 -4 8

(Sections 4.3, 4.4; Fisher, 1935, Table 3)


8

For the two-sample problem o f Example 4.16, consider fitting the null m odel by
maximum likelihood. Show that the solution probabilities are given by
Pij,

1
.
ni (a + Xy i j) P2]'

1
n2(P - Xy2j)

where a, fi and / are the solutions to the equations Y P i j f l = 1>Y PVfl ~ U and
Y yijPij.o = Y y 2jP2j,o- Under what conditions does this solution not exist, or give
negative probabilities? Compare this null m odel with the one used in Example 4.16.
9

For the ratio-testing problem o f Example 4.15, obtain the nonparametric M LE o f


the joint distribution o f ( U , X ) . That is, if pj is the probability attached to the data
pair (Uj,Xj), maximize Yl Pj subject to Y P)(x i ~ ^ a uj) = 0- Verify that the resulting
distribution is the E D F o f (U,X) when 0o = x/u. Hence develop a numerical
algorithm for calculating the pjS for general $oN ow choose probabilities p i , . . . , p n to minimize the distance
d(p, q) = Y ^ V j log Pj - Y 2 Pi lo
with q = ( ^ ,..., i) , subject to Y ( x j ~ &oUj)Pj = 0. Show that the solution is the
exponential tilted E D F
Pj cc exp{r\(xj - Bouj)}.
Verify that for small values o f do x / u these PjS are approximately the same as
those obtained by the M LE method.
(Section 4.4; Efron, 1981b)

10

Suppose that we wish to test the reduced-rank m odel H0 : g(0) 0, where g(-) is a
Pi-dimensional reduction o f p-dimensional 6. For the studentized pivot method we
take Q = {g(T ) - g(6)}T V ~ l { g ( T ) - g(0)}, with data test value q0 = g(t)r i;g-1g(t),
where vg estimates var[g(T )}. Use the nonparametric delta method to show that
var{g(T )} = g(t)VLg ( t y , where g(0) = 8 g( 6 ) / d d T.
Show how the method can be applied to test equality o f p means given p indepen
dent samples, assuming equal population variances.
(Section 4.4.1)

187

4.9 Practicals
11

In a parametric situation, suppose that an exact test is available with test statistic
U, that S is sufficient under the null hypothesis, but that a parametric bootstrap
test is carried out using T rather than U. Will the adjusted P-value padj always
produce the exact test?
(Section 4.5)

12

In calculating the mean squared error for the simulation approximation to the
adjusted P-value, it might be more reasonable to assume that P-values u, follow
a Beta distribution with parameters a and b which are close to, but not equal to,
one. Show that in this case

E(Xk) = " V ,Pl


j^o

T(M + l)r (a + j)V(b + M - j)T(a + b)


T ( j + I W ( M - j + l )T(a + b + M ) r ( a ) r ( b )

where X r = /{B in om (M , ur) < ( M + l)p}. Use this result to investigate numerically
the choice o f M.
(Section 4.5)
13

For the matched-pair experiment o f Problem 4.7, suppose that we choose between
the two test statistics ty = d and t2 = (n 2m)~l J2"Z2+i ^c/)> f r som e m in the
range 2, . . . , [^n], on the basis o f their estimated variances Vi and v2, where

E (d j-h )2

1
v->

n2
=

Ej=m+l(^U) ~ f2)2 + m(^(rn+1) ~ h ) 2 + m(rf(_m) t2)2


--- ----------------------------------------------------------------------- .
n(n 2m)

Give a detailed description o f the adaptive test as outlined in Section 4.4.2. To


apply it to the data o f Problem 4.7 with m = 2, see Practical 4.4.
(Section 4.4.2; D onegani, 1991)
14

Suppose that we want critical values for a size a one-sided test o f Ho : 9 = 9o


versus H A : 9 > 0n. The ideal value is the 1 a quantile to,i-a o f the distribution
o f T under Ho, and this is estimated by the solution f o , i - a to Pr(T ' > t0 | F o) = aTypically t o i - c is biased. Consider an adjusted critical value ( o , i - o - y . Obtain the
double bootstrap algorithm for choosing y, and compare the resulting test to use
o f the adjusted P-value (4.34).
(Sections 4.5, 3.9.1; Beran, 1988)

4.9 Practicals
1

The data in dataframe dogs are from apharmacological experiment. The two
variables are cardiac oxygen consum ption (M VO) and left ventricular pressure
(LVP). D ata for n = 7 dogs are
M VO
LVP

78
32

92
33

116
45

90
30

106
38

7899
24 44

Apply a bootstrap test for the hypothesis o f zero correlation between M VO and
LVP. Use R = 499 simulations.
(Sections 4.3, 4.4)
2

For the permutation test outlined in Example 4.12,

188

4 Tests

ami.fun <- function(data, i)


{ d <- data[i, ]
temp <- survdiff(Surv(d$time, d$cens) " data$group)
s <- sign(temp$obs[2]-temp$exp[2])
s*sqrt(temp$chisq) }
ami.perm <- boot(ami, ami.fun, R=499, sim="permutation")
(1+sum(ami.perm$tO<aml.perm$t) ) / (1+aml.perm$R)
o <- rank(aml.perm$t)
less <- (l:aml.perm$R)[ami.perm$t<aml.perm$tO]
o <- o/(1+aml.perm$R)
qqnorm(ami,perm$t,ylab="Log-rank statistic",type="n")
points(qnorm(o[less]) ,aml.perm$t[less])
points(qnorm(o [-less]) ,aml.perm$t[-less],pch=l)
abline(0,l,lty=2);
abline(h=aml.perm$tO,lty=3)
Compare this with the corresponding bootstrap test.

(Section 4.3)
3

For a graphical test o f suitability o f the exponential m odel for the data in Table 1.2,
we generate data from the exponential distribution, and plot an envelope.

expqq.fun <- function(data, q) sort(data)/mean(data)


exp.gen <- function(data, mle) rexp(length(data), mle)
n <- nrow(aircondit)
qq <- qexp((1:n)/(n+1))
exp.boot <- boot(aircondit$hours,expqq.fun,R=999,sim="parametric",
r a n .gen=exp.gen,mle=l/mean(aircondit$hours),q=qq)
env <- envelope(exp.boot$t)
plot(qq,exp.boot$tO,xlab="Exponential quantiles",
ylab="Scaled order statistics",xlim=c(0,max(qq)),
ylim=c(0,max(c(exp.boot$t0,env$overall[2,]))),pch=l)
lines(qq,env$overall[1,]); lines(qq,env$overall[2,] )
lines(qq,env$point[l,],lty=2); lines(qq,env$point[2,],lty=2)
Discuss the adequacy o f the model. Check whether the gamma m odel is a better
fit.
(Section 4.2.4)
4

To apply the permutation test outlined in Problem 4.7,

darwin.gen <- function(data, mle)


{ sign <- sample(c(-1,1),mle,replace=T)
data*sign }
darwin.rand <- boot(darwin$y, mean, R=999, sim="parametric",
r a n .gen=darwin.g e n , mle=nrow(darwin))
(1+sum(darwin.rand$t>darwin.rand$tO))/(1+darwin.rand$R)
Can you see how to modify d a r w in .g e n to produce the bootstrap test?
To implement the adaptive test described in Problem 4.13, with m = 2:

darwin.f <- function(d)


{ n <- length(d); m <- 2
tl <- mean(d)
vl <- sum((d-tl)"2)/n
2
d <- sort(d)[(m+1):(n-m)]
t2 <- mean(d)

4.9 Practicals

189

v2 <- ((sum((d-t2)~2)+m*(min(d)-t2)2+m*(max(d)-t2)"2))/(n*(n-2*m))
c(tl, vl, t2, v2) }
darwln.ad <- boot(darwin$y, darwin.f, R=999, sim="parametric",
r a n .gen=darwin.g e n , mle=nrow (darwin))
darwin.ad$tO
i <- c (1:999)[darwin.ad$t[,2]>darwin.ad$t[,4]]
(1+sum(darwin.ad$t [i,3] >darwin.ad$tO [3] )) / (1+length (i))
Is a different result obtained with the adaptive version o f the bootstrap test?
(Sections 4.3, 4.4)
5

Dataframe p a u ls e n contains data collected as part o f an investigation into the


quantal nature o f neurotransmission in the brain, by Dr O. Paulsen o f the
Department o f Pharmacology, University o f Oxford, in collaboration with Pro
fessor P. Heggelund o f the Department o f Neurophysiology, University o f Oslo.
Two models have been proposed to explain such data. The first m odel suggests
that the data are drawn from an underlying skewed unimodal distribution. The
alternative m odel suggests that the data are drawn from a series o f distributions
with modes equal to integer multiples o f a unit size. To distinguish between the
two m odels, a bootstrap test o f multimodality may be carried out, with the null
hypothesis that the underlying distribution is unimodal.
To plot the data and a kernel density estimate with a Gaussian kernel and
bandwidth h = 1.5, and to count its local maxima:

h <- 1.5
hist(paulsen$y,probability=T,breaks=c(0:30))
lines(density(paulsen$y,width=4*h,from=0,to=30))
peak.test <- function(y, h)
{dens <- density(y,width=4*h,n=100)
sum(peaks(dens$y[(dens$x>=0) k (dens$x<=20)])) }
peak.test(paulsen$y, h)
Check that h = 1.87 is the smallest value giving just one peak.
For bootstrap analysis,

peak.gen <- function( d, mle)


{ n <- mle[l] ; h <- mle [2]
i
<- sample(n,n,replace=T)
d[i]+h*rnorm(n) >
paulsen.boot <- boot(paulsen$y, peak.test, R=999, sim="parametric",
r a n .gen=peak.g e n , mle=c(nrow(paulsen),1.87), h=1.87)
What is the significance level?
To repeat with a shrunk sm oothed density estimate:

shrunk.gen <- function(d, mle)


{ n <- mle[l] ; h <- mle [2]
v <- var(d)
(d [sample (n,n,replace=T)] +h*rnorm(n)) /sqrt (l+h~2/v) }
paulsen.boot <- boot(paulsenSy, peak.test, R=999, sim="parametric",
ran.gen=shrunk.gen, mle=c(nrow(paulsen),1.87), h=1.87)
Bootstrap to obtain the P-value. D iscuss your results.
(Section 4.4; Paulsen and Heggelund, 1994, Silverman, 1981).

190
6

4 Tests
For the cd4 data o f Practicals 2.3 and 3.6, test the hypothesis that the distribution
o f C D 4 counts after one year is the same as the baseline distribution. Test
also whether the treatment affects the counts for each individual. Discuss your
conclusions.

5
Confidence Intervals

5.1 Introduction
T he assessm ent o f uncertainty ab o ut param eter values is m ade using confidence
intervals or regions. Section 2.4 gave a brief introduction to the ways in which
resam pling can be applied to the calculation o f confidence limits. In this chapter
we u ndertake a m ore tho ro u g h discussion o f such m ethods, including m ore
sophisticated ideas th a t are potentially m ore accurate th an those m entioned
previously.
Confidence region m ethods all focus on the same target properties. T he first
is th a t a confidence region w ith specified coverage probability y should be a
set Cy(y) o f p aram eter values which depends only upon the d a ta y and which
satisfies
Pr{0 e Cy( F )} = y.

(5.1)

Im plicit in this definition is th a t the probability does n o t depend upon any


nuisance param eters th a t m ight be in the model. The confidence coefficient,
o r coverage probability, y, is the relative frequency with which the confidence
region would include, or cover, the true param eter value 9 in repetitions o f the
process th a t produced the d a ta y. In principle the coverage probability should
be conditional on the inform ation content o f y as m easured by ancillary
statistics, b u t this m ay be difficult in practice w ithout a param etric m odel; see
Section 5.9.
The second im p o rtan t property o f a confidence region is its shape. The
general principle is th a t any value in Cy should be m ore likely th an all values
outside Cy, where likely is m easured by a likelihood or sim ilar function.
This is difficult to apply in nonparam etric problem s, where strictly a likelihood
function is n o t available; see, however, C h apter 10. In practice the difficulty is

191

192

5 Confidence Intervals

not serious for scalar 9, which is the m ajor focus in this chapter, because in
m ost applications the confidence region will be a single interval.
A confidence interval will be defined by limits 0ai and 9 i_a2, such th a t for
any a
Pr(0 < 0) = a.
The coverage o f the interval [0a,,0 i_ a2] is y = 1 (x\ + a 2), and ai and a 2 are
respectively the left- an d right-tail error probabilities. For som e applications
only one lim it is required, either a low er confidence limit 6a o r an upper
confidence limit 9 i_a, these b o th having coverage 1 a. If a closed interval is
required, then in principle we can choose oti and a2, so long as they sum to the
overall erro r probability 2a. T he sim plest way to do this, which we ad o p t for
general discussion, is to set a.\ = a2 = a. T hen the interval is equi-tailed with
coverage probability 1 2a. In p articu lar applications, however, one m ight
well w ant to choose ai and a 2 to give approxim ately the shortest interval: this
would be analogous to having the likelihood property m entioned earlier.
A single confidence region can n o t give an adequate sum m ary o f the u n
certainty ab o u t 9, so in practice one should give regions for three or four
confidence levels betw een 0.50 and 0.99, say, together with the p o int estim ate
for 9. O ne benefit from this is th a t any asym m etry in the uncertainty ab o u t 6
will be fairly clear.
So far we have assum ed th a t a confidence region can be found to satisfy
(5.1) exactly, b u t this is n o t possible except in a few special param etric models.
The m ethods developed in this chapter are based on approxim ate probability
calculations, an d therefore involve a discrepancy betw een the nom inal or target
coverage, an d the actual coverage probability.
In Section 5.2 we review briefly the stan d ard approxim ate m ethods for
param etric an d n o nparam etric models, including the basic b o o tstrap m ethods
already described in Section 2.4. M ore sophisticated m ethods, based on w hat
is know n as the percentile m ethod, are the subject o f Section 5.3. Section 5.4
com pares the various m ethods from a theoretical viewpoint, using asym ptotic
expansions, and introduces the A B C m ethod as an alternative to sim ulation
m ethods. The use o f significance tests to obtain confidence lim its is outlined
in Section 5.5. A nested b o o tstrap algorithm is introduced in Section 5.6.
Em pirical com parisons betw een m ethods are m ade in Section 5.7.
Confidence regions for vector p aram eters are described in Section 5.8. The
possibility o f conditional confidence regions is explored in Section 5.9 through
discussion o f two examples. Prediction intervals are discussed briefly in Sec
tion 5.10.
The discussion in this chap ter is ab o u t how to use the results o f boo tstrap
sim ulation algorithm s to obtain confidence regions, irrespective o f w hat the
resam pling algorithm is. T he presentation supposes for the m ost p a rt th at we

5.2 Basic Confidence Limit M ethods

193

are in the simple situation o f C h apter 2, where we have a single, com plete
hom ogeneous sample. M ost o f the m ethods described can be applied to m ore
com plex d a ta structures, provided th a t appropriate resam pling algorithm s are
used, b u t for m ost sorts o f highly dependent d a ta the theoretical properties o f
the m ethods are largely unknow n.

5.2 Basic Confidence Limit Methods


5.2.1 Parametric models
O ne general ap p ro ach to calculating a confidence interval is to m ake it surround
a good point estim ate o f the param eter, which for param etric m odels will often
be taken to be the m axim um likelihood estim ator. We begin by discussing
various simple ways in which this approach can be applied.
Suppose th a t T estim ates a scalar 9 and th a t we w ant an interval with
left- and right-tail errors b o th equal to a. F or simplicity we assum e th at T is
continuous. I f the quantiles o f T 9 are denoted by ap, then
P r(T 9 < aa) = a = P r(T 9 > a i - x).

(5.2)

R ew riting the events T 9 < aa and T 9 > a i_ a as 9 > T a 3 and


9 < T a i _3 respectively, we see th a t the 1 2a equi-tailed interval has limits
9a = t

fli_ a,

^ 1a = t

aa.

(5.3)

This ideal solution rarely applies, because the distribution o f T 9 is usually


unknow n. This leads us to consider various approxim ate m ethods, m ost o f
w hich are based on approxim ating the quantiles o f T 9.
Normal approximation
T he sim plest ap proach is to apply a N{0,v) approxim ation for T 9. This
gives the approxim ate confidence limits
= t + v l/1z l- 0l,

((B) = d({e)id8, and

?(e) = d2t(e)/B0deT.

(5.4)

where as usual z i_ a = <I> '(1 a). If T is a m axim um likelihood estim ator, then
the approxim ate variance v can be com puted directly from the log likelihood
function tf(9). I f there are no nuisance param eters, then we can use the recip.. A
rocal o f either the observed Fisher inform ation, v = l/tf(9) o r the estim ated
expected Fisher inform ation v = 1/7(0), where i(9) = E{if(9)} var{/(0)}.
T he form er is usually preferable. W hen there are nuisance param eters, we use
the relevant elem ent o f the inverse o f either ?(0) o r i(9). M ore generally, if
T is given by an estim ating equation, then v can be calculated by the delta
m etho d ; see Section 2.7.2. E quation (5.4) is the stan d ard form for norm al
approxim ation confidence limits, although it is som etim es augm ented by a bias
correction which is based on the third derivative o f the log likelihood function.

194

5 Confidence Intervals

In problem s where the variance approxim ation v is hard to obtain theoret


ically, or is th o u g h t to be unreliable, the param etric boo tstrap o f Section 2.2
can be used. This requires sim ulation from the fitted m odel with param eter
value 9. If the resam pling estim ates o f bias and variance are denoted by bR
and vR, then (5.4) is replaced by
9a, 9i - x = t - b R + v ^ 2z i - x.

(5.5)

W hether or n o t a norm al approxim ation m ethod will w ork can be assessed


by m aking a norm al Q -Q p lo t o f the sim ulated estim ates t \ , . . ., t'R, as illustrated
in Section 2.2.2. If such a plot suggests th a t norm al approxim ation is poor,
then we can either try to im prove the norm al approxim ation in som e way,
or replace it com pletely. T he basic resam pling confidence interval m ethods o f
Section 2.4 do the latter, and we review them first.
Basic and studentized bootstrap methods
If we sta rt again at the general confidence limit form ula (5.3), we can estim ate
the quantiles ax and a i_ a by the corresponding quantiles o f T t. A ssum ing
th at these are approxim ated by sim ulation, as in Section 2.2.2, the argum ent
given in Section 2.4 leads to the confidence limits
Sot = 2t f((R+i)(i_a)),

9 i_a = 2t t((R+l)a)-

(5.6)

These we refer to as the basic bootstrap confidence limits for 9.


A m odification o f this is to use the form o f the norm al approxim ation
confidence limit, b u t to replace the N ( 0,1) approxim ation for Z = ( T 9 ) / V ,/2
by a b o o tstrap approxim ation. Each sim ulated sam ple is used to calculate t*,
the variance estim ate v*, an d hence the b o o tstrap version z* = (t t ) / v ll2
o f Z . T he R sim ulated values o f z* are ordered and the p quantile o f Z
is estim ated by the (R + l)p th o f these. T hen the confidence limits (5.4) are
replaced by
9a = t V 1/ 2Z ( { R + i)(i_a)),

^l-oc = t tf1/2Z((R+l)a)-

(5.7)

These we refer to as studentized bootstrap confidence limits. They are also


know n as b o o tstrap -t limits, by analogy w ith the Student-f confidence limits
for the m ean o f a norm al distribution, to which they are equal under infinite
sim ulation in th a t problem . In principle this m ethod is superior to the previous
basic m ethod, for reasons outlined in Section 2.6.1 and discussed further in
Section 5.4.
A n em pirical bias adjustm ent could be incorporated into the num erator o f
Z , b u t this is often difficult to calculate an d is usually not w orthw hile, because
the effect is implicitly adjusted for in the b o o tstrap distribution.
F or b o th (5.6) and (5.7) to apply exactly it is necessary th a t (R + l)a be
an integer. This can usually be arran g ed : w ith R = 999 we can handle m ost

195

5.2 Basic Confidence Limit Methods

conventional values o f a. But if for some reason ( R + l)a is not an integer, then
interp o latio n can be used. A simple m ethod th a t w orks well for approxim ately
norm al estim ators is linear interp olation on the norm al quantile scale. For
exam ple, if we are trying to apply (5.6) and the integer p a rt o f (R + l)a is k,
then we define
O -ifa ) 0 _1( - ^ r )
= fo +

'R + l'

^R+l-

= [(* + !) ]

(5-8)

The sam e in terp o latio n can be applied to the z* s. Clearly such interpolations
fail if k = 0, R or R + 1.
Parameter transformation
T he norm al approxim ation m ethod m ay fail to w ork well because it is being
applied on the w rong scale, in which case it should help to apply the approxi
m ation on an appropriately transform ed scale. Skewness in the distribution o f
T is often associated w ith v a r(T ) varying w ith 9. F or this reason the accuracy
o f norm al approxim ation is often im proved by transform ing the param eter
scale to stabilize the variance o f the estim ator, especially if the transform ed
scale is the w hole real line. T he accuracy o f the basic boo tstrap confidence
lim its (5.6) will also tend to be im proved by use o f such a transform ation.
Suppose th a t we m ake a m onotone increasing transform ation o f the p aram
eter scale from 9 to tj = h(9), and then transform t correspondingly to u = h(t).
A ny confidence limit m ethod can be applied for tj, and untransform ing the
results will give confidence limits for 9. F o r example, consider applying the
norm al approxim ation limits (5.4) for r). By the delta m ethod (Section 2.7.1)
the variance approxim ation v for T transform s to
h(0) is dh(6)/d6.

v a r(l/) = var{/i(T)} = {h(t)}2v = v v ,


say. T hen the confidence limits for r\ are h(t) + t;|/2zi_ a, which transform back
to the limits
9 j i - a = h - l {h(t) + vli2z l^ } .

(5.9)

Sim ilarly the basic b o o tstrap confidence limits (5.6) becom e


= h~1{2h(t) - /i(?,*(R+1)(1_a)))},

0 .* = h - l {2h(t) - h(t*{{R+l)a])}.

(5.10)

W hether or n o t the norm al approxim ation is im proved by transform ation can


be ju d g ed from a norm al Q -Q p lo t o f sim ulated h(t") values.
H ow do we determ ine an ap p ropriate transform ation h(-)2 I f v a r(T ) is
exactly or approxim ately equal to the know n function v(0), then the variancestabilizing tran sfo rm atio n is defined by (2.14); see Problem 5.2 for an example.
If no theoretical approxim ation exists for v ar(T ), then we can apply the
em pirical m ethod outlined in Section 3.9.2. A sim pler em pirical approach

196

5 Confidence Intervals

which som etim es w orks is to m ake norm al Q -Q plots o f h(t) for candidate
transform ations.
It is im p o rta n t to stress th a t the use o f transfo rm ation can im prove the basic
b o o tstrap m ethod considerably. N evertheless it m ay still be beneficial to use
the studentized m ethod, after transform ation. Indeed there is strong em pirical
evidence th a t the studentized m ethod is im proved by w orking on a scale with
stable approxim ate variance. T he studentized transform ed estim ator is
H T ) - m
\ h ( T) \ VV 2 '
G iven R values o f the b o o tstrap q uantity z* = {/i(f) h(t)} / {\h(t*)\v*1/2}, the
analogue o f (5.10) is given by

6a = h ~ l {h(t)

- IM0|t>1/2z(*(R+i)(i-))}> 0i- =

h - ' i h i t ) - \h(t)\v1/2z i { R + m }.

(5.11)
N ote th a t if h(-) is given by (2.14) w ith no co n stan t m ultiplier and V = v(T),
then the den o m in ato r o f z* an d the m ultiplier \h(t)\v1/2 in (5.11) are b o th unity.
Likelihood ratio methods
W hen likelihood estim ation is used, in principle the norm al approxim ation
confidence limits (5.4) are inferior to likelihood ratio limits. Suppose th at the
scalar 6 is the only unknow n p aram eter in the model, and define the log
likelihood ratio statistic
f(0) is the log likelihood
function

W(d) = 2 { m - m } Q uite generally the distribution o f W( 6) is approxim ately chi-squared, with


one degree o f freedom since 8 is a scalar. So a 1 2a approxim ate confidence
region is
Cl-2cc = { 0

: w ( 0 ) < CU _ 2 a } ,

( 5 .1 2 )

where ciiP is the p quantile o f the y2 distribution. This confidence region need
n o t be a single interval, although usually it will be, and the left- and righttail errors need n o t be even approxim ately equal. Separate lower and upper
confidence lim its can be defined using
sgn(u) = u/\u\ is the sign
function.

z(0) = sgn(d - d ) y/ Md) ,


which is approxim ately N ( 0,1). T he resulting confidence lim its are defined
im plicitly by
z ( h ) = za,

z(01a) = Zl-.

(5.13)

W hen the m odel includes o th er unknow n param eters X, also estim ated by
m axim um likelihood, w(6) is calculated by replacing / ( 0 ) with the profile log
likelihood / prof(0) = sup; <f(0, 1).
These m ethods are in v arian t w ith respect to use o f param eter transform ation.

197

5.2 Basic Confidence Limit M ethods

( ' is the log likelihood


for a set of data
simulated using 6, for
which the MLE is 0 \

In m ost applications the accuracy will be very good, provided the m odel is
correct, b u t it m ay nevertheless be sensible to consider replacing the theoretical
quantiles by b o o tstrap approxim ations. W hether or n o t this is w orthw hile can
be ju d g ed from a chi-squared Q-Q plot o f sim ulated values o f
w -(6) = 2 { r ( 6 ' ) - f ( G ) } ,
or from a norm al Q-Q plot o f the corresponding values o f z*(0).
Example 5.1 (Air-conditioning data) T he d a ta o f Exam ple 1.1 were used to
illustrate various features o f param etric resam pling in C h apter 2. H ere we look
at confidence lim it calculations for the underlying m ean failure time n under
the exponential m odel for these data. The exam ple is convenient in th a t there
is an exact solution against which to com pare the various approxim ations.
F or the norm al approxim ation m ethod we use an estim ate o f the exact
variance o f the estim ator T = Y , v = n~ly 2. T he observed value o f y is
108.083 an d n = 12, so v = (31.20)2. T hen the 95% confidence interval limits
given by (5.4) w ith a = 0.025 are
108.083 31.20 x 1.96 = 46.9 and 169.2.
These co n trast sharply w ith the exact lim its 65.9 and 209.2.
T ransform ation to the variance-stabilizing logarithm ic scale does improve
the norm al approxim ation. A pplication o f (2.14) with v(/i) = n V 2 gives
h(t) = log(t), if we d ro p the m ultiplier n1/2, and the approxim ate variance
transform s to n~l . The 95% confidence interval limits given by (5.9) are
e x p { lo g (1 0 8 .0 8 3 ) (1 2 r1/2 x 1.96} = 61.4 and 190.3.
W hile a considerable im provem ent, the results are still not very close to the
exact solution. A p artial explanation for this is th a t there is a bias in log(T )
and the variance approxim ation is no longer equal to the exact variance. Use
o f b o o tstrap estim ates for the bias and variance o f log(T ), w ith R = 999, gives
limits 58.1 and 228.8.
F or the basic b o o tstrap confidence lim its we use R = 999 sim ulations under
the fitted exponential m odel, sam ples o f size n = 12 being generated from
the exponential distribution w ith m ean 108.083; see Exam ple 2.6. The relevant
ordered values o f y ' are the (9 9 9 + l)0.025th and (9 9 9 + l)0.975th, i.e. the 25th
and 975th, which in o u r sim ulation were 53.3 and 176.4. The 95% confidence
limits obtained from (5.6) are therefore
2 x 108.083 - 176.4 = 39.8,

2 x 108.083 - 53.3 = 162.9.

These are no b etter th a n the norm al approxim ation limits. However, applica
tion o f the sam e m ethod on the logarithm ic scale gives m uch b etter results:

5 Confidence Intervals

198

using the same ordered values o f >' in (5.10) we o btain the limits
exp{21og(108.083)-log(176.4)} = 66.2, exp{21og(108.083)-log(53.4)} = 218.8.
In fact these are sim ulation approxim ations to the exact limits, which are based
on the exact gam m a distribution o f Y / p.- The sam e results are obtained using
the studentized b o o tstrap limits (5.7) in this case, because z = n l/2(y n ) / y is
a m onotone function o f log(y) log(/i) = log(y/p). E quation (5.11) also gives
these results.
N ote th a t if we h ad used R = 99, then the b oo tstrap confidence limits
would have required interpolation, because (9 9 + 1)0.025 = 2.5 which is n o t an
integer. T he application o f (5.8) would be
(D- 1(0.025)- 0 ) - 1(0.020)
*(2.5) -

f (2) +

(D- 1(0.030) - 0*-1(0.020) ( (3)

{2])'

This involves quite extrem e ordered values and so is som ew hat unstable.
The likelihood ratio m ethod gives good results here, even using the chisquared approxim ation.
Broadly sim ilar com parisons am ong the m ethods apply under the m ore
com plicated gam m a m odel for these data. As the com parisons m ade in Ex
am ple 2.9 would predict, results for the gam m a m odel are sim ilar to those for
nonparam etric resam pling, which are discussed in the next example.

5.2.2 Nonparametric models


W hen no m odel is assum ed for the d a ta distribution, we are then in the situation
o f Section 2.3, if the d a ta form a single hom ogeneous sample. Initially we
assum e th a t this is the case. M ost o f the m ethods ju st discussed for param etric
m odels extend to this nonparam etric situation w ith little difficulty. T he m ajor
exception is the likelihood ratio m ethod, which we postpone to C h apter 10.
Normal approximation
The sim plest m ethod is again to use a norm al approxim ation, now w ith a
nonparam etric estim ate o f variance such as th a t provided by the nonparam etric
delta m ethod described in Section 2.7.2. If lj represents the em pirical influence
value for the ;'th case yj, then the approxim ate variance is vL = n~2 J2 lj, so the
n onparam etric analogue o f (5.4) for the limits o f a 1 2a confidence interval
for 6 is
* -r

1/2

t + V

Z 1a.

(5.14)

Section 2.7 outlines various ways o f calculating or approxim ating the influence
values.
If a sm all nonparam etric b o o tstrap has been run to produce bias and

5.2 Basic Confidence Lim it Methods

199

variance estim ates bR an d vR, as described in Section 2.3, then the corresponding
approxim ate 1 2a confidence interval is
t - bR + 4 /2zi-a-

(5.15)

In general we should expect this to be m ore accurate, provided R was large


enough.
Basic and studentized bootstrap methods
F o r the basic b o o tstrap m ethod, the only change from the param etric case is
th a t the sim ulation m odel is the E D F P. O therwise equation (5.6) still applies.
W hether o r n o t the b o o tstra p m ethod is likely to give im provem ent over the
norm al ap proxim ation m ethod can again be judged from a norm al Q -Q plot o f
the t values. Sim ulated resam ple values do give estim ates o f bias and variance
which provide the m ore accurate norm al approxim ation limits (5.15).
T he studentized b o o tstra p m ethod w ith confidence limits (5.7) likewise ap
plies here. If the nonparam etric delta m ethod variance estim ate v l is used for
v, then those confidence limits becom e
_ .

1/2 *

/J

vx t V L

_ ,

t'la ~

1/2 *
VL

z((R+l)a)>

/r 1

(D.10J

where now z* = (t* t ) / v ' ^ 2. N ote th at the influence values m ust be recom
puted for each b o o tstra p sample, because in expanded n o tation lj = l(yj;P)
depends u p o n the E D F o f the sample. Therefore
i = - 2 / v ; ; n
7=1

where P ' is the E D F o f the b o o tstrap sample. A simple approxim ation to v]


can be m ade by substituting the approxim ation
n

Ky' j ; F " ) = K y ) ; f ) -

n~l

i(yk
' ; P) ,

k=l

b u t this is unreliable unless t is approxim ately linear; see Section 2.7.5 and
Problem 2.20.
As in the p aram etric case, one m ight consider m aking a bias adjustm ent
in the n u m erato r o f z, for exam ple based on the em pirical second derivatives
o f t. However, this rarely seems effective, and in any event an approxim ate
adjustm ent is implicitly m ade in the b o o tstra p distribution o f Z*.
Example 5.2 (Air-conditioning data, continued) F or the d a ta o f Exam ple 1.1,
confidence lim its for the m ean were calculated under an exponential m odel in
Exam ple 5.1. H ere we apply nonp aram etric m ethods, sim ulated datasets being
obtained by sam pling w ith replacem ent from the data.
F o r the norm al approxim ation, we use the nonparam etric delta m ethod

200

5 Confidence Intervals

estim ate vL = n 2 E (> ';


whose d a ta value is 1417.715 = (37.65)2. So the
approxim ate 95% confidence interval is
108.083 37.65 x 1.96 = 34.3 an d 181.9.
This, as w ith m ost o f the num erical results here, is very sim ilar to w hat is
obtained und er p aram etric analysis w ith the best-fitting gam m a m odel; see
Exam ple 2.9.
F or the basic b o o tstra p m ethod w ith R = 999 sim ulated datasets, the 25th
and 975th ordered values o f y * are 43.92 and 192.08, so the lim its o f the 95%
confidence interval are
2(108.083) - 192.08 = 24.1 and 2(108.083) - 43.92 = 172.3.
This is n o t obviously a p o o r result, unless com pared with results for the gam m a
m odel (likelihood ratio limits 57 an d 243), b u t the corresponding 99% interval
has lower lim it 27.3, which is clearly very bad! T he studentized boo tstrap
fares better: the 25th an d 975th ordered values o f z* are 5.21 and 1.66, so
th at application o f (5.7) gives 95% interval limits
108.083 - 37.65 x 1.66 = 45.7 an d 108.083 - 37.65 x (-5 .2 1 ) = 304.2.
But are these last results adequate, and how can we tell? T he first p art
o f this question we can answ er b o th by com parison w ith the gam m a m odel
results, an d by applying m ethods on the logarithm ic scale, which we know
is appro p riate here. T he basic b o o tstra p m ethod gives 95% limits 66.2 and
218.8 w hen the log scale is used. So it would ap p ear th a t the studentized
boo tstrap m ethod lim its are too wide here, b u t otherw ise are adequate. If the
studentized b o o tstrap m ethod is applied in conjunction with the logarithm ic
transform ation, the lim its becom e 50.5 and 346.9.
How would we know in practice th a t the logarithm ic transform ation o f T is
appropriate, o th er th an from experience w ith sim ilar d ata ? O ne way to answ er
this is to p lo t vL versus t*, as a surrogate for a v arian ce-p aram eter plot,
as suggested in Section 3.9.2. F or this p articu lar dataset, the equivalent plot
o f stan d ard errors vL is shown in the left panel o f Figure 5.1 and strongly
suggests th a t variance is approxim ately p ro p o rtio n al to squared param eter, as it
is under the param etric model. F rom this we w ould deduce, using (2.14), th at
the logarithm ic tran sfo rm atio n should approxim ately stabilize the variance.
T he right panel o f the figure, which gives the corresponding plot for logtransform ed estim ates, shows th a t the tran sfo rm atio n is quite successful.

Parameter transformation
For suitably sm ooth statistics, the consistency o f the studentized boo tstrap
m ethod is essentially g u aranteed by the consistency o f the variance estim ate
V. In principle the m ethod is m ore accurate th an the basic b o o tstrap m ethod,

5.2 Basic Confidence Lim it Methods

Figure 5.1
Air-conditioning d ata:
nonparametric delta
method standard errors
for t = y (left panel) and
for log(t) (right panel) in
R = 999 nonparametric
bootstrap samples.

201

o
CO

in

csj o

CNJ

*<" o
_i co
>

>

o
C\J

50

100

150

200

250

t*

log t*

as we shall see in Section 5.4. However, variance approxim ations such as vL


can be som ew hat unstable for small n, as in the previous exam ple with n = 12.
Experience suggests th a t the m ethod is m ost effective when 6 is essentially
a location p aram eter, which is approxim ately induced by variance-stabilizing
tran sfo rm atio n (2.14). However, this requires know ing the variance function
v(9) = v a r(T | F), which is never available in the nonparam etric case.
A suitable transfo rm atio n m ay som etim es be suggested by analogy w ith a
p aram etric problem , as in the previous example. T hen equations (5.10) and
(5.11) will apply w ithout change. Otherwise, a transform ation can be obtained
em pirically using the technique described in Section 3.9.2, using either nested
b o o tstrap estim ates v* or delta m ethod estim ates v*L w ith which to estim ate
values o f the variance function v(6). E quation (5.10) will then apply with
estim ated transfo rm atio n h( ) in place o f h( ). F or the studentized boo tstrap
interval (5.11), if the tran sfo rm ation is determ ined em pirically by (3.40), then
studentized values o f the transform ed estim ates h(t'r) are
K = v { K) l/2{ k O - M O }/! 1/ 2-

O n the original scale the (1 2a) studentized interval has endpoints


h~l { h(t) - v1/' 2tS(0 _ 1/2Z('(i-o<)(+i))}

{MO - l :^ ( 0 1/2Z(*a(K+i))} (5.17)


In general it is wise to use the studentized interval even after transform ation.
Example 5.3 (City population data) F or the d a ta o f Exam ple 2.8, w ith ratio
6 estim ated by t = x / u, we discussed em pirical choice o f transform ation
in Exam ple 3.23. A pplication o f the em pirical transform ation illustrated in

202

5 Confidence Intervals

Figure 3.11 w ith the studentized b o o tstra p limits (5.17) leads to the 95%
interval [1.23, 2.25], This is sim ilar to the 95% interval based on the h(t*) h(t),
[1.27, 2.21], while the studentized b o o tstra p interval on the original scale is
[1.12, 1.88]. T he effect o f the tran sfo rm atio n is to m ake the interval m ore like
those from the percentile m ethods described in the following section.
To com pare the studentized m ethods, we took 500 sam ples o f size 10 w ithout
replacem ent from the full city pop u latio n d a ta in Table 1.3. T hen for each
sam ple we calculated 90% studentized b o o tstra p intervals on the original scale,
and on the transform ed scale w ith and w ithout using the transform ed standard
erro r; this last interval is the basic b o o tstrap interval on the transform ed scale.
The coverages were respectively 90.4, 88.2, and 86.4%, to be com pared to the
ideal 90% . The first tw o are n o t significantly different, b u t the last is rath er
smaller, suggesting th a t it can be w orthw hile to studentize on the transform ed
scale, w hen this is possible. The draw back is th a t studentized intervals th a t use
the transform ed scale tend to be longer th an on the original scale, and their
lengths are m ore variable.

5.2.3 Choice o f R
W h at has been said ab o u t sim ulation size in earlier chapters, especially in
Section 4.2.5, applies here. In particular, if confidence levels 0.95 and 0.99 are
to be used, then it is advisable to have R = 999 or m ore, if practically feasible.
Problem 5.5 outlines som e relevant theoretical calculations.

5.3 Percentile Methods


We have seen in Section 5.2 th a t simple confidence limit m ethods can be
m ade m ore accurate by w orking on a transform ed scale. In m any cases it is
possible to use sim ulation results to get a reasonable idea o f w hat a sensible
transform atio n m ight be. A quite different ap proach is to find a m ethod which
im plicitly uses the existence o f a good transform ation, b u t does n o t require
th at the tran sfo rm atio n be found. This is w hat the percentile m ethod and its
m odifications try to do.

5.3.1 Basic percentile method


Suppose th a t there is som e unknow n tran sfo rm ation o f T , say U = h(T),
which has a sym m etric distribution. Im agine th a t we knew h and calculated a
1 2a confidence interval for cf) = h(6) by applying the basic b o o tstrap m ethod
(5.6), except th a t we first use the sym m etry to w rite ax = fli_a in the basic
equation (5.3) as it applies to U = h(T). This w ould m ean th a t in applying
(5.3) we w ould take u - u{iR+m_a)) instead o f u{{R+m - u, and u - u{{R+l)a)

203

5.3 Percentile Methods

instead o f
u, to estim ate the a and 1 a quantiles o f U. This
sw ap would change the confidence interval limits (5.6) to
*

U((K+l)a)>

U((R + l)(l-a))>

whose tran sfo rm atio n back to the 9 scale is


f ((R+l)<x)

f ((R+1)(1ot))-

( 5 .1 8 )

R em arkably this 1 2a interval for 9 does n o t involve h at all, and so can


be com puted w ithout know ing h. The interval (5.18) is know n as the bootstrap
percentile interval, an d was initially recom m ended in place o f (5.6).
As w ith m ost b o o tstrap m ethods, the percentile m ethod applies for both
param etric and n o n p aram etric b oo tstrap sampling. Perhaps surprisingly, the
m ethod turns out not to w ork very well with the nonparam etric b o o tstrap
even when a suitable transfo rm ation h does exist. However, adjustm ents to the
percentile m ethod described below are successful for m any statistics.
Example 5.4 (Air-conditioning data, continued) F or the air-conditioning d a ta
discussed in Exam ples 5.1 and 5.2, the percentile m ethod gives 95% intervals
[70.8, 148.4] und er the exponential m odel and [43.9, 192.1] under the n o n p ara
m etric model. N either is satisfactory, com pared to accurate intervals such as
the basic b o o tstra p interval using logarithm ic transform ation.

5.3.2 Adjusted percentile m ethod


F or the percentile m ethod to w ork well, it would be necessary th at T be
unbiased on the transform ed scale, so th a t the sw ap o f quantile estim ates be
correct. This does not usually happen. A lso the m ethod carries the defect o f
the basic b o o tstrap m ethod, th a t the shape o f the distribution o f T changes
as the sam pling distribution changes from F to F, even after transform ation.
In particular, the im plied sym m etrizing transform ation often will not be quite
the same as the variance-stabilizing transform ation this is the cause o f the
po o r perform ance o f the percentile m ethod in Exam ple 5.4. These difficulties
need to be overcom e if the percentile m ethod is to be m ade accurate.
Parametric case with no nuisance parameters
We assum e to begin w ith th a t the d a ta are described by a param etric m odel
w ith ju st the single unknow n param eter 9, which is estim ated by the m axim um
likelihood estim ate t = 9. In order to develop the adjusted percentile m ethod
we m ake the sim plifying assum ption th a t for some unknow n transform ation
h( ), unknow n bias correction factor w and unknow n skewness correction factor
a, the transform ed estim ator U = h ( T) for (j>= h(9) is norm ally distributed,
U ~ N (</> wer(</>), <t2(0))

with a(<j)) = 1 + a<^>.

(5.19)

204

5 Confidence Intervals

In fact this is an im proved norm al approxim ation, after applying the (unknow n)
norm alizing tran sfo rm atio n which elim inates the leading term in a skewness
approxim ation. T he usual factor n-1 has been taken o u t o f the variance by
scaling h(-) appropriately, so th a t b o th a an d w will typically be o f order n~x/2.
The use o f a an d w is analogous to the use o f B artlett correction factors in
likelihood inference for p aram etric models.
The essence o f the m ethod is to calculate confidence limits for <j) and then
transform these back to the 6 scale using the b o o tstrap distribution o f T. To
begin with, suppose th a t a an d w are know n, an d write
U = (j) + (I + acj))(Z - w ) ,
where Z has the N ( 0,1) distribution w ith a quantile z. It follows th at
log(l + aJJ) = lo g (l + a<j>) + log{ 1 + a( Z w)},
which is m onotone increasing in e/>. T herefore substitution o f za for Z and u
for U in this equation identifies the a confidence lim it for cj), which is

</>a =

i r\

w + z

U + f f ( u h ---------;-------------- r .

1 a(w + za)

Now the a confidence lim it for d is dx =


b u t h( ) is u n k nown. However,
if we denote the distribution function o f T ' by G, then
G(0a) = Pr*(T* < e a \ t ) = ? x '{ U' <4>a \u)

<6
\

<D I w +I w + z
1 - a(w + za)

which is know n. T herefore the a confidence limit for 0 is

<5 -2 0 >

which expressed in term s o f sim ulation values is


0* = ?m + l w ,

a = 0 ) ( w + 1 _ ^ + ^ ^ )) .

(5.21)

These lim its are usually referred to as B C a confidence limits. N ote th at they
share the tran sfo rm atio n invariance p roperty o f percentile confidence limits.
The use o f G overcom es lack o f know ledge o f the transform ation h. The
values o f a an d w are unknow n, o f course, b u t they can be easily estim ated.
F o r w we can use the initial norm al approxim ation (5.19) for U to write
Pr*(T* < t | t) = P r*([/* < u | u) = Pr([7 < (f> \ (j)) = <I>(w),
so th at
w = 0>-1{G(0}.

(5.22)

205

5.3 Percentile Methods

In term s o f sim ulation values


# denotes the number
of times the event
occurs.

'# { t; < t}
R+ 1
T he value o f a can be determ ined inform ally using (5.19). Thus if /(</>) denotes
the log likelihood defined by (5.19), with derivative ?(<f>), then it is easy to show
th at
e { m 3}
= 6a,
var{<f(</>)}3/2
ignoring term s o f o rd er n~l . But the ratio on the left o f this equation is
invariant und er p aram eter transform ation. So we transform back from (j) to 6
and deduce th at, still ignoring term s o f order n-1 ,

Em > }
v a r{ /(0)}3/2
To calculate a we approxim ate the m om ents o f f ( 6) by those o f / (d) under the
fitted m odel w ith p aram eter value 9, so th a t the skewness correction factor is
1 E*{/*(0)3}
a = T -------- : *1 >
6 v a r * { n 0 )3/2}

(5.23)

w here ( ' is the log likelihood o f a set o f d a ta sim ulated from the fitted
m odel. M ore generally a is one-sixth the standardized skewness o f the linear
approxim ation to T.
One p o tential problem w ith the B C a m ethod is th a t if a in (5.21) is m uch
closer to 0 or 1 th an a, then (R + l)a could be less th an 1 o r greater th an
R, so th a t even w ith interpolation the relevant quantile can n o t be calculated.
If this happens, and if R can n o t be increased, then it would be appropriate
to quote the extrem e value o f t' and the im plied value o f a. For example, if
( R + l)a > JR, then the u pper confidence limit t'Rj would be given w ith implied
right-tail error a 2 equal to one m inus the solution to a = R / ( R + 1).
Example 5.5 (Air-conditioning data, continued) R eturning to the problem
o f Exam ple 5.4 and the exponential b o o tstrap results for R = 999, we find
th a t the n u m b er o f y * values below y = 108.083 is 535, so by (5.22) w =
<P_1 (0.535) = 0.0878. T he log likelihood function is tC(fi) = n l o g fi fi^1 Y,yj>
whose derivative is
iw = ^

fiz

nfi

The second and third m om ents o f if(fi) are nfi~2 and 2n/i~3, so by (5.23)
a = I n 1/2 = 0.0962.

206

5 Confidence Intervals

z = w + za

$ = (w + - p rjj;)

r = (/?-(- 1)5

0.025
0.975
0.050
0.950

- 1 .8 7 2
2.048
-1 .5 5 7
1.733

0.067
0.996
0.103
0.985

67.00
995.83
102.71
984.89

Table 5.1 Calculation


of adjusted percentile
bootstrap confidence
limits for fi with the
data of Example 1.1,
under the parametric
exponential model with
R = 999;
a = 0.0962, w = 0.0878.

w
65.26
199.41
71.19
182.42

T he calculation o f the adjusted percentile limits (5.21) is illustrated in Table 5.1.


T he values o f r = ( K + l) a are not integers, so we have applied the interpolation
form ula (5.8).
H ad we tried to calculate a 99% interval, we should have had to calculate
the 999.88th ordered value o f t , which does n o t exist. The im plied right-tail
erro r for t'gg9j is the value a 2 which solves
9"

1000

= d>

(V0 0 8 7 8
.

0-0878 + ^

1 -0 .0 9 6 2 (0 .0 8 7 8 + Z !_ a2)

nam ely a2 = 0.0125.

Parametric case with nuisance parameters


W hen 9 is one o f several unknow n param eters, the previous developm ent
applies to a derived distribution called the least-favourable family. As usual
we denote the nuisance param eters by X and w rite ip = (9, a ). If the log
likelihood function for ip based on all the d a ta is <f(ip), then the expected Fisher
inform ation m atrix is i(ip) = E{(ip)}. N ow define (5 = i- 1( $ ) ( l , 0 , . . . , 0 ) r .
Then the least-favourable fam ily o f d istributions is the one-param eter family
obtained from the original m odel by restricting ip to the curve ip + S. W ith
this restriction, the log likelihood is
fLF(0 = f ( < P + t h
akin to the profile log likelihood for 9. The M L E o f is = 0.
The bias-corrected percentile m ethod is now applied to the least-favourable
family. E quations (5.21) an d (5.22) still apply. The only change in the calcula
tions is to the skewness correction factor a, which becom es
1

E -fc(0 )3 }

6 var'K lf (0))W'

In this expression the p aram eter estim ates ip are regarded as fixed, and the
m om ents are calculated u nder the fitted model.
A som ew hat sim pler expression for a can be obtained by noting th at i?LF{0)
is p ro portio n al to the influence function for t. The result in Problem 2.12 shows
th at
Lt(yj',Fv ) = m l (ipy{ip,yj),

j{y>) is d2f(\p)/dy>d\pT.

5J Percentile Methods
Table 5.2 Calculation
of adjusted percentile
bootstrap confidence
limits for ^ with the
data of Example 1.1,
under the gamma
parametric model with
p. = 108.0833,
/c = 0.7065 and

207

za = w + z a

5 = <I,(w + i i ; )

r = ( R + l)a

0.025
0.975
0.050
0.950

- 1 .8 2 3
2.097
- 1 .5 0 8
1.782

0.085
0.998
0.125
0.991

85.20
998.11
125.36
991.25

(r>
62.97
226.00
67.25
208.00

a = 0.1145, w = 0.1372.

w here i-1(tp) is the first row o f the inverse o f i(ip) and /(\p,yj) is the contribution
to /(tp) from the _/th case. We can then rew rite (5.24) as

where
L* = nil ( xpy(xp,Y )
an d Y ' follows the fitted distribution w ith param eter value ip. As before,
to first o rd er a is one-sixth the estim ated standardized skewness o f the linear
approxim ation to t. In the form given, (5.25) will apply also to nonhom ogeneous
data.
T he B C a m ethod can be extended to any sm ooth function o f the original
m odel p aram eters \p; see Problem 5.7.
Example 5.6 (Air-conditioning data, continued) We now replace the exponen
tial m odel used in the previous example, for the d a ta o f Exam ple 2.3, with the
tw o-param eter gam m a m odel. The param eters are 6 = fi and A = k , the first
still being the p aram eter o f interest. The log likelihood function is
/(fi,

nK

\og(K/fi) +

(k -

1)

log y j

yj/fi -

log T(k).

The inform ation m atrix is diagonal, so th a t the least-favourable fam ily is the
original gam m a family w ith k fixed a t k = 0.7065. It follows quite easily th a t
?

l f

()

- y ,

and so a is one-sixth o f the skewness o f the sam ple average under the fitted
gam m a m odel, th a t is a =
The same result is obtained som ew hat
m ore easily via (5.25), since we know th a t the influence function for the m ean
is L t( y \ F ) = y - fi.
The num erical values o f a and w for these d a ta are 0.1145 and 0.1372
respectively, the latter from R = 999 sim ulated samples. Using these we
com pute the adjusted percentile b o o tstrap confidence limits as in Table 5.2.

5 Confidence Intervals

208

Just how flexible is the B C a m eth o d ? The following exam ple presents a
difficult challenge for all b o o tstrap m ethods, an d illustrates how well the
studentized b o o tstrap an d B C a m ethods can com pensate for weaknesses in the
m ore prim itive m ethods.
Example 5.7 (Normal variance estimation) Suppose th at we have independent
sam ples ( y n , . . . , y i m), i = l,...,/c , from norm al d istributions w ith different
m eans A, b u t com m on variance 8, the latter being the param eter o f interest. The
m axim um likelihood estim ator o f the variance is t = n~l Y%=i Y^j=t(yij ~ yi)2,
where n = mk. In practice the m ore usual estim ate w ould be the pooled m ean
square, w ith deno m in ato r d = k(m 1) ra th e r th a n n, but here we leave the
bias o f T in tact to see how well the b o o tstrap m ethods can cope.
The distrib u tio n o f T is n~l 6xj. This exact result allows us b o th to avoid
the use o f sim ulation, an d to calculate exact coverages for all the confidence
limit m ethods. D enote the a quantile o f the Xd distribution by cd^. Using the
fact th a t T* = n~l t xd we see th a t the u p p er a confidence limits for 8 under
the basic b o o tstrap and percentile m ethods are respectively
21 r T 1tcj'i-a,

n~l tcd,a.

The coverages o f these limits are calculated using the exact distribution o f T.
F or exam ple, for the basic b o o tstrap confidence lim it
Pr (0 < 2 T

T c 4 ,_.) - Pr

> JT T T T ^) '

F or the B C a m ethod, (5.22) gives w = _ 1{ P r( ^ < n)\ and (5.24) gives


a = \ 2 l?2n - V 2. The upp er a confidence limit, calculated by (5.20) w ith the
exact distribution for T , is n~l tcd,. The exact coverage o f this lim it is P r ( ^ >
n2/cd,$).
Finally, for the studentized b o o tstrap upp er a confidence lim it (5.7), we first
calculate the variance approxim ation v = 2n~l t2 from the expected Fisher
inform ation m atrix and then the confidence lim it is nt/cd,i-a. The coverage o f
this limit is exactly a.
Table 5.3 shows num erical values o f coverages for the four m ethods in the
case k = 10 an d m = 2, w here d = | n = 10. The results show quite dram atically
first how b a d the basic an d percentile m ethods can be if used w ithout careful
thought, an d secondly how well studentized and adjusted percentile m ethods
can do in a m oderately difficult situation. O f course use o f a logarithm ic
transform atio n would im prove the basic b o o tstrap m ethod, which would then
give correct answers.

yi is the average of

yn>---,ymr

209

5.3 Percentile M ethods


Table 5.3 Exact
coverages (%) of
confidence limits for
normal variance based
on maximum likelihood
estimator for 10 samples
each of size two.

N om inal

Basic

S tudentized

P ercentile

BCa

1.0
2.5
5.0
95.0
97.5
99.0

0.8
2.5
4.8
35.0
36.7
38.3

1.0
2.5
5.0
95.0
97.5
99.0

0.0
0.0
0.0
1.6
4.4
6.9

1.0
2.5
5.0
91.5
100.0
100.0

Nonparametric case: single sample


The adjusted percentile m ethod for the nonparam etric case is developed by
applying the m ethod for the p aram etric case w ith no nuisance param eters
to a specially constructed nonparam etric exponential family w ith support
on the d a ta values, the least-favourable fam ily derived from the m ultinom ial
distribution for frequencies o f the d a ta values under nonparam etric resampling.
Specifically, if lj denotes the em pirical influence value for t at yj, then the
resam pling m odel for an individual Y * is the exponential tilted distribution
P r(7 * = y j ) = pj =

(5-26)

The p aram eter o f interest 6 is a m onotone function o f r\ with inverse rj(6), say.
The M L E o f rj is fj = rj(t) = 0, which corresponds to the E D F F being the
n o n p aram etric M L E o f the sam pling distribution F.
The bias correction factor w is calculated as before from (5.22), b u t using
nonp aram etric b o o tstrap sim ulation to obtain values o f t*. The skewness
correction a is given by the em pirical analogue o f (5.23), where now
fj($) is the first
derivative drj(6)/dd.

W hen the m om ents needed in (5.23) are evaluated at 6, or equivalently at


fj = 0, two sim plifications occur. First we have E*(L*) = 0, and secondly the
m ultiplier ij(t) cancels when (5.23) is applied. The result is th at

6 /

6 (e

\ 3/2
if)

which is the direct analogue o f (5.25).


Example 5.8 (Air-conditioning data, continued) The nonparam etric version
o f the calculations in the preceding exam ple involves the same form ula (5.21),
b u t now w ith a = 0.0938 and w = 0.0728. The form er constant is calculated
from (5.27) w ith lj = y7 y. The confidence lim it calculations are shown in
Table 5.4 for 90% an d 95% intervals.

210

5 Confidence Intervals

za = w + z a

0.025
0.975
0.050
0.950

-1.8872
2.0327
-1.5721
1.7176

5 = (w +

i^ i;)

0.0629
0.9951
0.0973
0.9830

= (R + 1)5
62.93
995.12
97.26
983.01

Table 5.4 Calculation


of adjusted percentile
bootstrap confidence
limits for p. in
Example 1.1 using
nonparametric
bootstrap with R ~ 999;
a = 0.0938, w = 0.0728.

C(r)
55.33
243.50
61.50
202.08

function o f sam ple m om ents, say t == t ( s ) where In = n - ' E =i Sitij)


for i =
then (5.26) is a one-dim ensional reduction o f a /c-dimensional
exponential fam ily for si(Y * ),... ,s*( Y *). By equation (2.38) the influence values
lj for t are given sim ply by lj = t T {s(yj) s} w ith t = dt/ds.
T he m ethod as described will apply as given to any single-sample problem,
and to m ost regression problem s (C hapters 6 and 7), but n o t exactly to
problem s where statistics are based on several independent samples, including
stratified samples.
Nonparametric case: several samples
In the param etric case the B C a m ethod as described applies quite generally
through the unifying likelihood function. In the n onparam etric case, however,
there are predictable changes in the B C a m ethod. The background approx
im ation m ethods are described in Section 3.2.1, which defines an estim ator
in term s o f the E D F s o f k samples, t = t(F\ , . . ., Fk). T he em pirical influence
values lij for j = 1
and i = 1, . . . , k and the variance approxim ation vL
are defined in (3.2) an d (3.3).
I f we retu rn to the origin and developm ent o f the B C a m ethod, we see th at
the definition o f bias correction w in (5.22) will rem ain the same. The skewness
correction a will again be one-sixth the estim ated standardized skewness o f
the linear approxim ation to t, which here is
,

s i , rJ

This can be verified as an application o f the p aram etric m ethod by constructing


the least-favourable jo in t family o f k distributions from the k m ultinom ial
distributions on the d a ta values in the k samples.
N ote th a t (5.28) can be expressed in the same form as (5.27) by defining
hj = nlij/ni, where n = Y
so th at

vL = n 2 Y ^ i ij

(5.29)

( e u ?5)!

211

5.4 Theoretical Comparison o f M ethods

see Problem 3.7. This can be helpful in w riting an all-purpose algorithm for the
B C a m eth o d ; see also the discussion o f the A B C m ethod in the next section.
A n exam ple is given at the end o f the next section.

5.4 Theoretical Comparison of Methods


The studentized b o o tstrap and adjusted percentile m ethods for calculating
confidence limits are inherently m ore accurate th an the basic b o o tstrap and
percentile m ethods. This is quite clear from em pirical evidence. H ere we look
briefly at the theoretical side o f the story for statistics which are approxim ately
norm al. Some aspects o f the theory were discussed in Section 2.6.1. For
simplicity we shall restrict m ost o f the detailed discussion to the single-sam ple
case, but the results generalize w ithout m uch difficulty.

5.4.1 Second-order accuracy


To assess the accuracies o f the various b o o tstrap confidence limits we calculate
coverage probabilities up to the n_1/2 term s in series approxim ations, these
based on corresponding approxim ations for the C D F s o f U = ( T 0 ) /v 1/2
and Z = (T 0 ) / F 1/2. H ere v is v a r(T ) or any approxim ation which agrees to
first order w ith vL, the variance o f the linear approxim ation to T. Sim ilarly V is
assum ed to agree to first order with VL. F o r exam ple, in the scalar param etric
case where T is the m axim um likelihood estim ator, v is the inverse o f the
expected Fisher inform ation m atrix. In all o f the equations in this section
equality is correct to o rd er n~1^2, i.e. ignoring errors o f order n ~ \
The relevant approxim ations for C D F s are the one-term C o rn ish-F isher
approxim ations
P r([/ < u) = G(6 +

v 1/ 2m)

= <5 (u n~l/2(m\ |m 3 + |m 3w2) j ,

(5.30)

where G is the C D F o f T , and


K ( z ) = Pr(Z < z) = <D (u - n~l/2{mi - gm3 - ( |m n - gm3)u2}) ,

(5.31)

with the con stan ts defined by


E(U) = n~1/2m u E ( l /3) = n 1/2(m3 - 3wi), E { (F - v)(T - 0)} = rT 1/2m ;
(5.32)
note th a t the skewness o f U is _ 1/2m3. The corresponding approxim ations for
quantiles o f T (rath er th an U) and Z are
G- 1(oe) = 6 + v 1/2za + n_ 1/2v1/2(mi

+ gm3z2)

(5.33)

K _ 1(a) = za + n~l/2 {mi - m3 - { { m n - gw3)z2} .

(5.34)

and

212

5 Confidence Intervals

T he analogous approxim ations apply to b o o tstrap distributions and q u a n


tiles, ignoring sim ulation error, by substituting ap p ro p riate estim ates for the
various constants. In fact, provided the estim ates for mi, m3 and m u have
errors o f o rd er n ~1^2 o r less, the n~ 1/2 term s in the approxim ations will not
change. This greatly simplifies the calculations th a t we are ab o u t to do.
Studentized bootstrap method
C onsider the studentized b o o tstrap confidence lim it 0a = t v [/2k ~ l (l a).
Since the right-hand side o f (5.34) holds also for K ~ 1 we see th a t to order
ea = t v l/2z i - x - n~1/2v l/2{mi - m3 - (m n - i/n 3)z^_o;}.

(5.35)

It then follows by applying (5.31) and expanding by Taylor series th a t


Pr(0 < 0a) = a + 0 ( n _1).
This property is referred to as second-order accuracy, as distinct from the firstorder accuracy o f the norm al approxim ation confidence limit, w hose coverage
is a + 0 ( n ~ l/2).
Adjusted percentile method
F or the adjusted percentile limit (5.20) we use (5.33) with estim ated constants
as the approxim ation for G_1( ). F or the norm al integral inside (5.20) we can
use the approxim ation (za + 2w + ai\), because a and w are o f order n ~{/2.
T hen the a confidence lim it (5.20) is approxim ately
= t + v 1/2za + n_ 1/V /2 { 2 n 1/2w + mi m3 + (nI/2a + |m 3)z2} .

(5.36)

This will also be second-order accurate if it agrees w ith (5.35), which requires
th at to order n~1/2,
a = n~ll2{ \ m n - |m 3),

w = - n ~ 1/2(mi - gm3).

(5.37)

To verify these approxim ations we use expressions for m u m3 and m n derived


from the q u ad ratic approxim ation for T described in Section 2.7.2.
In slightly simplified n otation, the q u ad ratic approxim ation (2.41) for T is
T = 6 + n~l Y

Lt(Yi) + \ n ~ 2 Q,(Yj, Yk).

(5.38)

It will be helpful here an d later to define the constants


a = I - 3v -V 2 ^

E {L3(y,)},

b = in 2 Y

E iQt (Yi,
i

(5.39)

c = \n-4v-3/2J 2 E{Lt(Yj)L,(Yk)Qt{Yj, 7,)}.


i+k

213

5.4 Theoretical Comparison o f M ethods

T hen calculations o f the first and third m om ents o f T 0 from the quadratic
approxim ation show th at
m, = n 1/2v ~ l/2b,

m3 = n 1/2(6<z + 6c).

(5.40)

F or m u, it is enough to take V to be the delta m ethod estim ator, which in the


full n o tatio n is V l = n~2 ^ L 2(Y ;;F). T hen using the approxim ation
n
L t(Y r,F ) =

L ,(Y ,) -

n -1

n
L t(Y j) + n 1

7=1

Q ,( Y Y j )

7=1

given in Problem 2.20, detailed calculation leads to


m u = ^ 2(6a + 4c).

(5.41)

The results in (5.40) and (5.41) im ply the identity for a in (5.37), after
noting th a t the definitions o f a in (5.23), (5.25) and (5.27) used in the adjusted
percentile m ethod are obtained by substituting estim ates for m om ents o f the
influence function. The identity for w in (5.37) is confirm ed by noting th at the
original definition w = 4>~ {G(t)} approxim ates <1>_ 1{G(0)}, which by applying
(5.30) w ith u = 0 agrees w ith (5.37).
Basic and percentile methods
Sim ilar calculations show th a t the basic b o o tstrap and percentile confidence
limits are only first-order accurate. However, they are b o th superior to the
norm al approxim ation limits, in the sense th at equi-tailed confidence intervals
are second-order accurate. F or example, consider the 1 2a basic boo tstrap
confidence interval w ith limits
0 x, 8 i - x

= 2t G_1(l a ),2 t G_ 1(a).

It follows from the estim ated version o f (5.33) th at


21 G_1(l a) = t t>1/2zi_c( n- 1/V /2(mi ^m3 +
and by (5.31) the error o f this lower limit is
Pr{0 < 2 1 G ~ '(l a)} = 1 <J>(zi_a + n~1/2m n z j _ J .
C orrespondingly the error o f the u p per lim it is
Pr{0 > 2 1 G- 1(a)} = <P(za + n_ 1/2m n z 2).
T herefore the com bined coverage error o f the confidence interval is
1 - <J>(zi_a + n ~l/2m n z 2- J + (za + n_ 1/2m iiz2)
which, after expanding in Taylor series and dropping n~l term s, and then
n oting th a t z2 = z \_ a an d <j)(za) = <f>{zi_), turns out to equal
2a + n~1/2mn {z2_a0 (zi-a) - z0 (za)} = 2a.

214

5 Confidence Intervals

These results are suggestive o f the behaviour th a t we observe in specific ex


amples, th a t b o o tstrap m ethods in general are superior to norm al approxim a
tion, but th a t only the adjusted percentile and studentized b o o tstrap m ethods
correctly adjust for the effects o f bias, n o n co n stan t variance, and skewness. It
would take an analysis including n-1 term s to distinguish betw een the preferred
m ethods, an d to see the effect o f tran sfo rm atio n p rio r to use o f the studentized
boo tstrap m ethod.

5.4.2 The ABC m ethod


It is fairly clear that, to the o rd er n~l/2 considered above, there are m any
equivalent confidence limit m ethods. O ne o f these, the A B C m ethod, is o f
particular interest. T he m ethod rests on the approxim ation (5.35), which by
using (5.40) and (5.41) can be re-expressed as
6x = t + v 1/2{zx + a + c v~l/2b + (2 a + c)za};

(5.42)

here v has been approxim ated by v in the definition o f mi, and we have used
Z \ - a z.
The con stan ts a, b and c in (5.42) are defined by (5.39), in which the
expectations will be estim ated. Special form s o f the A B C m ethod correspond
to special-case estim ates o f these expectations. In all cases we take v to be vl Parametric case
If the estim ate t is a sm ooth function o f sam ple m om ents, as is the case for
an exponential family, then the co nstants in (5.39) are easy to estim ate. W ith
a tem porary change o f notatio n , suppose th a t t = t(s) where s = n~l ^ s(yj)
has p com ponents, and define fi = E(S), so th a t 6 = t(n). Then
L t(Y,) = t(n)T {s(Yj) - fi},

Qt(Yj, Yk) = {s(Yj) - fi}Tt(fi){s(Yk) - /i}.

(5.43)

t = dt(s)/ds, and
V= d2t(s)/dsdsT .

Estim ates for a, b and c can therefore be calculated using estim ates for the
first three m om ents o f s( Y ).
For the p articu lar case where the distribution o f S has the exponential family
PD F
f S ) = exp{//r s - ( f / ) } ,
the calculations can be simplified. First, define L(^) = var(5) = l(rj). Then
vl

Ul) = 81Ul)/d'ldqT-

= t(s)T'L(s)i(s).

S ubstitution from (5.43) in (5.39), and estim ation o f the expectations, gives
estim ated con stan ts which can be expressed sim ply as
,
=0

b = ^tr{t(s)(s)},
L

tr(A) is the trace of the


square matrix A.

5.4 Theoretical Comparison o f Methods

c =

d2t(s + ke)

2vm

del

215

(5.44)
=0

where k = (s )i ( s ) / v ^ 2.
The confidence lim it (5.42) can also be approxim ated by an evaluation o f
the statistic t, analogous to the B C a confidence limit (5.20). This follows by
equating (5.42) w ith the right-hand side o f the approxim ation t(s + v 1^2e) =
t(s) + v ^ 2e T 't(s), w ith ap p ro p riate choice o f e. The result is

= t ii + F k ? k|*

( 5 -4 5 )

where
za = w + za = a + c - bvL i/2 + z.
In this form the ABC confidence limit is an explicit approxim ation to the B C a
confidence limit.
If the several derivatives in (5.44) are calculated by num erical differencing,
then only 4p + 4 evaluations o f t are necessary, plus one for every confidence
lim it calculated in the final step (5.45). A lgorithm s also exist for exact num erical
calculation o f derivatives.
Nonparametric case: single sample
If the estim ate t is again a sm ooth function o f sam ple m om ents, t = t(s), then
(5.43) still applies, and substitution o f em pirical m om ents leads to

b = ! t (;)
6(E/,2)3/2

2 { U

1 (E 5jO)rf'( sjQ) k== 1 E s)h


(E '72)3/2

n (E^)1/2'

2n

(5.46)
A n alternative, m ore general form ulation is possible in which s is replaced by
the m ultinom ial pro p o rtio n s n~l { f \ , . . . , f n) attaching to the d a ta values. C o r
respondingly fi is replaced by the probability vector p, and with distributions
F restricted to the d a ta values, we re-express t(F) as t(p); cf. Section 4.4. Now
F is equivalent to p = (,,) and t = t(p). In this no tatio n the em pirical
influence values and second derivatives are defined by
(5.47)
and

d2
qjj =

(5.48)
=0

where 1; is the vector w ith 1 in the ; t h position and 0 elsewhere. Let us


set tj(p) = dt(p)/8pj, an d tjk(p) = d2t (p)/dpjdpk; see Section 2.7.2 and Prob
lem 2.16. T hen alternative form s for the vector I and the full m atrix q are
/ = (/ - iT* J)t(p),

q = ( I ~ n - ' J i i m i ~ n~lJ),

216

5 Confidence Intervals

where J = 11T. F or each derivative the first form is convenient for approx
im ation by num erical differencing, while the second form is often easier for
theoretical calculation.
Estim ates for a and b can be calculated directly as em pirical versions o f
their definitions in (5.39), while for c it is sim plest to use the analogue o f the
representation in (5.44). The resulting estim ates are
a =

i E 'j
6 ( I ] ? ' 1'

i v1 / ,
2n2 ^ qjj ~ 2n1 ^
(5.49)

d2t(p + ek)

2 V 1/ 2

dl

t (I n

J)t(I n

J)t

2*vl

where k = n 2vL i/2lT and t , i are evaluated at p.


The approxim ation (5.45) can also be used here, b u t now in the form
d = t [ P + n
V

(1

' . J

- a z ay

(5.50)

If the several derivatives are calculated by num erical differencing, then the
num ber o f evaluations o f t(p) needed is only 2n+2, plus one for each confidence
limit and the original value t. N ote th a t the probability vector argum ent in
(5.50) is n o t constrained to be proper, o r even positive, so th at it is possible
for A B C confidence lim its to be undefined.
Example 5.9 (Air-conditioning data, continued) T he adjusted percentile m ethod
was applied to the air-conditioning d a ta in Exam ple 5.6 under the gam m a
m odel and in Exam ple 5.8 und er the nonparam etric model. H ere we exam ine
how well the A B C m ethod approxim ates the adjusted percentile confidence
limits. F or the m ean param eter, calculations are simple under all models. For
example, in the gam m a case the exponential family is tw o-dim ensional with
s = (y .lo g y )7,
rj i = - hk / h ,

rj2 = me,

xp(ri) = - f /2 l og( -t ]i /n) + n log r(rj2/n),

and t(s) = si. The last implies th a t t = ( l ,0 ) r and t = 0. It then follows


straightforw ardly th a t the co n stan t a is given by \ ( n k ) ~ l/2 as in Exam ple 5.6,
th a t b = c = 0, and th a t k = v ^ 2( l , 0 ) T. Sim ilar calculations apply for the
n onparam etric m odel, except th a t a is given by the corresponding value in
Exam ple 5.8. So under b o th m odels
0i_a = 108.083 + v[/2-------- + - - - {1 - a ( a + z i-c)}2 '
N um erical com parisons betw een the adjusted percentile confidence limits and

1 is a vector of ones.

5.4 Theoretical Comparison o f Methods


Table 5.5 Adjusted
percentile (BCa) and
ABC confidence
intervals for mean
failure time fi for the
air-conditioning data.
R = 999 simulated
samples for BCa
methods.

217
N o m in al confidence 1 2a
0.99

0.95

0.90

G a m m a m odel

BCa
ABC

51.5, 241.6
52.5, 316.6

63.0, 226.0
61.4, 240.5

67.2, 208.0
66.9, 210.5

N o n p a ra m e tric m odel

BCa
ABC

44.6, 268.8
46.6, 287.0

55.3, 243.5
57.2, 226.7

61.5, 202.1
63.6, 201.5

A B C lim its are shown in Table 5.5. The A B C m ethod appears to give rea
sonable approxim ations, except for the 99% interval under the gam m a model.

Nonparametric case: several samples


T he estim ated co nstants (5.49) for the single-sample case can be applied to
several sam ples by using a single artificial probability vector n o f length
n = J 2 ni as follows. The estim ator will originally be defined by a function
t(pi , .. ., pk), w here p, = ( p u , . . . , p ini) is the vector o f probabilities on the ith
sam ple values y n , . . . , yimT he artificial representation o f the estim ator in term s
o f the single probability vector
71 =

(7T11, . . . ,

7T21,. . . ,

o f length n is u(n) = t( pi , .. . ,p k ) where p, has elem ents


py =

(5.51)
E ;= i n n

The set o f E D F s is equivalent to ft = (, ,) and the observed value o f


the estim ate is t = u(n). This artificial representation leads to expressions such
as (5.29), in which the definition o f 7i; is obtained by applying (5.47) to u(p).
(N ote th a t the real influence values /y and second derivatives q(j j derived from
t ( pi , .. ., pk ) should n o t be used.) T h a t this m ethod produces correct results
is quite easy to verify using the several sam ple extension o f the quadratic
approxim ation (5.38); see Section 3.2.1 and Problem 3.7.
Example 5.10 (Air-conditioning data failure ratio) The d a ta o f Exam ple 1.1
form one o f several sam ples corresponding to different aircraft. The previous
sam ple (i = 12) and a second sam ple (n2 = 24) are given in Table 5.6. Suppose
th a t we w ant to estim ate the ratio o f failure rates for the two aircraft, and give
confidence intervals for this ratio.
To set n otation, let the m ean failure times be fii and fi 2 for the first and
second aircraft, w ith 6 = n t / n \ the param eter o f interest. T he corresponding

218

5 Confidence Intervals

F irst aircra ft
3

18

43

85

91

98

100

130

230

487

23
139

30
188

36
197

210

Second aircraft
3
44

5
46

5
50

13
72

14
79

15

88

22
97

22
102

39

sam ple m eans are y\ = 108.083 an d y 2 = 64.125, so the estim ate for 6 is
t = y i / y i = 0-593.
The em pirical influence values are (Problem 3.5)
hj =
yi

yi

We use (5.29) to calculate vL = 0.05614 and a = 0.0576. In R = 999


nonparam etric sim ulations there are 473 values o f t* below t, so by (5.22)
w = 0.0954. W ith these values we can calculate B C a confidence limit (5.21).
F o r exam ple, for a = 0.025 an d 0.975 the values o f a are 0.0076 and 0.944
respectively, so th a t the limits o f the 95% interval are r(*76) = 0.227 and
(944) = 1-306; the first value is in terpolated using (5.8).
The studentized b o o tstrap m ethod gives 95% confidence interval [0.131,1.255]
using the original scale. The distribution o f t* values is highly skew here, and
the logarithm ic scale is strongly indicated by diagnostic plots. Figure 5.2 shows
the norm al Q -Q plot o f the t* values, the variance-param eter plots for original
and logarithm ic scales, an d the norm al Q -Q plot o f z* values after logarithm ic
transform ation. A pplication o f the studentized b o o tstrap m ethod on the loga
rithm ic scale leads to 95% confidence interval [0.183,1.318] for 6, m uch closer
to the B C a limits.
F or the A B C m ethod, the original definition o f the estim ator is t = t(pi,p2) =
Y ^ y i j P i j / ^ y i j P i j - The artificial definition in term s o f a single probability
vector n is

u(n\= Ei yunn/ Ei n2i


(
E"li yijnij/TjLinij'
A pplication o f (5.47) shows th a t the artificial em pirical influence values are

Table 5.6 Failure times


for air-conditioning
equipment in two
aircraft (Proschan,
1963).

219

5.4 Theoretical Comparison o f M ethods

Figure 5.2 Diagnostic


plots for
air-conditioning data
confidence intervals,
based on R = 999
nonparametric
simulations. Top left
panel: normal Q-Q plot
of t*, dotted line is
N(t,VL) approximation.
Top right:
variance-parameter plot,
v*L versus r \ Bottom
left: variance-parameter
plot after logarithmic
transformation. Bottom
right: normal Q-Q plot
of z* after logarithmic
transformation.

Quantiles of Standard Normal

t*

CM

is-

/
y

O
/

CVJ
/

-1.5

-0.5

0.5 1.0

-2

Quantiles of Standard Normal

iog(t*)

and
'

n \ ( y2j-yi

.......

This leads to form ulae in agreem ent with (5.29), which gives the values o f a
and vL already calculated. It rem ains to calculate b and c.
F or b, application o f (5.48) gives
\2
w , , - y. r,.
\)
= -2t f, n2(yij
iUjj t-. 1
h I

2r,
i yi

. n n 2( y i j - y i )

2
n\

220

5 Confidence Intervals

and
^ l.jj

*,nrn(y2J - y2)
2n\yx

'

so by (5.49) we have
b = n f M T 3' Y i y i j - y i f ,
whose value is b = 0.0720. (The b o o tstrap estim ates b and v are respectively
0.104 and 0.1125.) Finally, for c we apply the second form in (5.49) to u(n),
th at is
c = ^n~4v ^ 3/2l Tii(7t)I,
and calculate c = 0.3032. The im plied value o f w is 0.0583, quite different
from the b o o tstrap value 0.0954. The A B C form ula (5.50) is now applied to
u( jt) with k = n~2v~[1/20 n , .
The resulting 95% confidence interval is
[0.250,1.283], which is fairly close to the B C a interval.
It seems possible th a t the ap proxim ation theory does not w ork well here,
which would explain the larger-than-usual differences betw een B C a, A B C and
studentized b o o tstrap confidence lim its; see Section 5.7.
O ne practical poin t is th a t the theoretical calculation o f derivatives is quite
tim e-consum ing, com pared to application o f num erical differencing in (5.47)(5.49).

5.5 Inversion of Significance Tests


T here is a duality betw een significance tests for p aram eters and confidence sets
for those param eters, in the sense th a t for a prescribed level a confidence
region includes p aram eter values which are n o t rejected by an appropriate
significance test. This can provide an o th er o ption for calculating confidence
limits.
Suppose th a t 8 is an unknow n scalar param eter, and th a t the m odel includes
no other unknow n param eters. If Ra(0o) is a size a critical region for testing
the null hypothesis H 0 : 8 = 80, which m eans th at
Pr{(Yu . . . , Y n) e R a( d o ) \ e 0} = ,
then the set
C W Y ,,. . . , Y) = {6 : ( Y l t . . . , Y) J^(0)}
is a 1 a confidence region for 6. The shape o f the region will be determ ined
by the form o f the test, including the alternative hypothesis for which the test is
designed. In particular, an interval w ould usually be obtained if the alternative
is two-sided, H A : 6
0O; an upp er lim it if H A : 8 < 0O; and a lower limit if
H a : 8 > 80.

5.5 Inversion o f Significance Tests

221

For definiteness, suppose th a t we w ant to calculate a lower 1 a confidence


limit, which we denote by 9X. The associated test o f Ho : 9 = do versus
H a : 8 > do will be based on a test statistic t(9o) for which large values
are evidence in favour o f H A : for example, t(0o) m ight be an estim ate o f 6
m inus Oo- We will have an algorithm for approxim ating the P-value, which we
can w rite as p(9o) = Pr{T(0o) > ?(^o) I Fo}, where Fo is the null hypothesis
distribution w ith p aram eter value 9o. The 1 a confidence set is all values o f
9 such th a t p(8) > a, so the lower confidence limit 0a is the smallest solution
o f p(9) = a. A simple way to solve this is to evaluate p(0) over a grid of, say,
20 values, an d to interpolate via a simple curve fit. The grid can sometimes
be determ ined from the norm al approxim ation confidence limits (5.4). F or the
curve fit, a simple general m ethod is to fit a logistic function to p(9) using
either a simple polynom ial in 9 or a spline. Once the curve is fitted, solutions
to p(9) = a can be com puted: usually there will be one solution, which is
K
F or an u p p er 1 a confidence limit 9 \ - a, note th a t this is identical to a
low er a confidence limit, so the same procedure as above w ith the same t(9o)
can be used, except th a t we solve p(0) = 1 a. The com bination o f lower
and u pper 1 a confidence limits defines an equi-tailed 1 2a confidence
interval.
The following exam ple illustrates this procedure.
Example 5.11 (Hazard ratio) For the A M L d a ta in Exam ple 3.9, also an
alysed in Exam ple 4.4, assum e th a t the ratio o f hazard functions h 2 (z)/hi(z)
for the tw o groups is a co n stan t 9. As before, let rtJ be the num ber in group
i w ho were at risk ju st p rio r to the y'th failure time zj, and let y} be 0 or 1
according as the failure ^t Zj is in group 1 or 2. T hen a suitable statistic for
testing Ho : 9 = 9o is

this is the score test statistic in the Cox p roportional hazards model. Large
values o f t(6o) are evidence th a t 9 > OoThere are several possible
including those described in
hazard ratio 9o- H ere we use
which holds fixed the survival
sim ulated values y \ , . . . , y ' n are

resam pling schemes th a t could be used here,


Section 3.5 b u t m odified to fix the constant
the sim pler conditional m odel o f Exam ple 4.4,
and censoring times. T hen for any fixed 9o the
generated by

222

5 Confidence Intervals

Figure 5.3 Bootstrap


P-values p(0o) for testing
constant hazard ratio
0o, with R = 199 at each
point. Solid curve is
spline fit on logistic
scale. Dotted lines
interpolate solutions to
p(l?o) = 0.05,0.95, which
are endpoints of 90%
confidence interval.

log(theta)

where the num bers a t risk ju st p rio r to zj are given by

J-i

r\j = m ax I 0, m - ^ ( 1 - y k ) - c1;I
*=i

(
r2j = m ax

1
0, r 2i

Y.y'k

C2j

k= 1

w ith Cij the n u m b er o f censoring tim es in group i before zj.


F o r the A M L d a ta we sim ulated R = 199 sam ples in this way, and calculated
the corresponding values t*(90) for a grid o f 21 values o f 90 in the range
0.5 < 0o ^ 10. F or each Go we com puted the one-sided P-value
Pieo) =

#{t*(0o) > t(0o)}


200

then on the logit scale we fitted a spline curve (in log 6), and interpolated the
solutions to p(9o) = a, 1a to determ ine the endpoints o f the (12a) confidence
interval for 9. Figure 5.3 illustrates this procedure for a = 0.05, which gives
the 90% confidence interval [1.07,6.16]; the 95% interval is [0.86,7.71] and the
p o int estim ate is 2.52. T hus there is m ild evidence th a t 6 > 1.
A m ore efficient ap proach w ould be to use R = 99 for the initial grid to
determ ine rough values o f the confidence limits, n ear which further sim ulation
with R = 999 w ould provide accurate interp o latio n o f the confidence limits.
Yet m ore efficient algorithm s are possible.

In a m ore system atic developm ent o f the m ethod, we m ust allow for a
nuisance p aram eter X, say, which also governs the d a ta distribution b u t is not
constrained by Ho. T hen b o th Ra(0) an d C \ - a{ Y \ , . . . , Y) m ust depend upon X
to m ake the inversion m ethod w ork exactly. U nder the b o o tstra p approach X
is replaced by an estim ate.

5.6 Double Bootstrap M ethods

223

Suppose, for exam ple, th a t we w ant a lower 1 a confidence limit, which is


obtained via the critical region for testing Ho : 9 = 9 q versus the alternative
hypothesis H a : 9 > 9oDefine ip = (9, A). I f the test statistic is T(9o), then the
size a critical region has the form
R(8o) = { ( y u - - - , y n) Pr{T (0o) > t(90) | ip = (0o,A)} < a},
an d the exact lower confidence limit is the value uy = ua(y, X), such th a t
Pr{ T (ua) > t(ua) | xp = (ua,/1)} = a.
We replace X by an estim ate s, say, to obtain the lower 1 a boo tstrap
confidence lim it u i_ a = ua(y,s). The solution is found by applying for u the
equation
Pr* {T*(u) > t(u) | xp = (u,s)} = a,
where T*(w) follows the distribution under xp = (u , s). This requires application
o f an interp o latio n m ethod such as the one illustrated in the previous example.
T he sim plest test statistic is the point estim ate T o f 9, and then T(9o) = T.
The m ethod will tend to be m ore accurate if the test statistic is the studentized
estim ate. T h a t is, if v a r(T ) = o 2(9,A), then we take Z = (T 9o)/v(9o,S)\
for furth er details see Problem 5.11. The same rem ark would apply to score
statistics, such as th a t in the previous example, where studentization would
involve the observed or expected Fisher inform ation.
N ote th a t for the p articu lar alternative hypothesis used to derive an upper
limit, it w ould be stan d ard practice to define the P-value as Pr{T(0o) < t(9o) \
Fo}, for exam ple if T ( 0 q) were an estim ator for 9 or its studentized form.
Equivalently one can retain the general definition and solve p(9o) = 1 a for
an upp er limit.
In principle these m ethods can be applied to b o th param etric and sem ipara
m etric problem s, b u t not to com pletely nonparam etric problems.

5.6 Double Bootstrap Methods


W hether the basic or percentile b o o tstrap m ethod is used to calculate con
fidence intervals, there is a possibly non-negligible difference betw een the
nom inal 1 a coverage an d the actual probability coverage o f the interval
in repeated sam pling, even if R is very large. The difference represents a bias
in the m ethod, an d as indicated in Section 3.9 the b o o tstrap can be used to
estim ate and correct for such a bias. T h a t is, by b o otstrapping a b o o tstrap
confidence interval m ethod it can be m ade m ore accurate. This is analogous
to the b o o tstrap adjustm ent for b o o tstra p P-values described in Section 4.5.
O ne straightforw ard application o f this idea is to the norm al-approxim ation
confidence interval (5.4), which produces the studentized b o o tstra p interval;

5 Confidence Intervals

224

see Problem 5.12. A m ore am bitious application is b o o tstrap adjustm ent o f the
basic b o o tstrap confidence limit, which we develop here.
First we recall the full n o tatio n s for the quantities involved in the basic
bo o tstrap confidence interval m ethod. The ideal u p per 1 a confidence limit
is t(F) ax(F), where
Pr { T - 6 < ax(F) | F j = Pr{f(F) - t(F) < aa(F) \ F} = a.
W h at is calculated, ignoring sim ulation error, is the confidence lim it t(F)ax(F).
The bias in the m ethod arises from the fact th a t aa(F) ^ a a(F) in general, so
th at
Pr{f(F) < t(F) - aa( F) | F} 1 - a.

(5.52)
A

We could try to elim inate the bias by adding a correction to ax(F), b u t a m ore
successful approach is to adjust the subscript a. T h a t is, we replace ax(F) by
Oq(a)(F) an d estim ate w hat the adjusted value q(a) should be. This is in the
sam e spirit as the B C a m ethod.
Ideally we w ant q(a) to satisfy
P r{t(F) < t(F) - fl, (a)(F) | F} = 1 - a.

(5.53)

The solution q(a) will depend u p o n F, i.e. q(oc) = q(a, F). Because F is unknow n,
we estim ate q(a) by q(a) = q(a, F). This m eans th a t we obtain q(a) by solving
the b o o tstrap version o f (5.53), namely
Pr*{t(F) < t(F') - ai{a)( h

I F} = 1 - a.

(5.54)

This looks intim idating, b u t from the definition o f aa(F) we see th a t (5.54) can
be rew ritten as
Pr*{Pr**(T** < 2 T ' - t \ F*) > q{oc) | F} = 1 - a.

(5.55)

The sam e m ethod o f adjustm ent can be applied to any b o o tstrap confi
dence lim it m ethod, including the percentile m ethod (Problem 5.13) and the
studentized b o o tstra p m ethod (Problem 5.14).
To verify th a t the nested b o o tstrap reduces the o rd er o f coverage erro r m ade
by the original b o o tstra p confidence limit, we can apply the general discussion
o f Section 3.9.1. In general we find th a t coverage 1 a + 0 ( n ~ ) is corrected to
1a + 0 ( n ~ fl~1/2) for one-sided confidence limits, w hether a = | or 1. However,
for equi-tailed confidence intervals coverage 1 2a + 0 (n-1 ) is corrected to
1 2a -I- 0 ( n ~ 2); see Problem 5.15.
Before discussing how to solve equation (5.55) using sim ulated samples,
we look at a simple illustrative exam ple where the solution can be found
theoretically.
Example 5.12 (Exponential mean) C onsider the param etric problem o f ex
ponential d a ta w ith unknow n m ean /i. T he d a ta estim ate for fi is t = y, F is

5.6 Double Bootstrap M ethods

225

the fitted exponential C D F w ith m ean y, and F * is the fitted exponential C D F


w ith m ean y * the m ean o f a param etric b o o tstrap sam ple y \ , . . . , y ' n draw n
from F. A result th a t we use repeatedly is th a t if X \ , . . . , X n are independent
exponential w ith m ean y, then 2n X / y has the x l n distribution.
The basic b o o tstrap u p p e r 1 a confidence limit for n is
2y - y c 2n,u/(2n),
where Pt(x I < cjn,%) = oc. To evaluate the left-hand side o f (5.55), for the inner
probability we have
P r* * (F " < 2 ? - y | F*) = Pr{*2 < 2n(2 - J) / ? ) } ,
which exceeds q if and only if 2n(2 y / y ) > C2n,q Therefore the outer
probability on the left-hand side o f (5.55) is
Pr" {2(2 - ? ) >

= Pr { & > 2 _

^ / ( 2 , } .

(5-56)

w ith q = q(a). Setting the probability on the right-hand side o f (5.56) equal to
1 a, we deduce th a t
2n
2 - cl n m l{2n)

C2na'

Using q(a) in place o f a in the basic b o o tstrap confidence lim it gives the
adjusted u p p er 1 a confidence limit 2 n y / c 2n,a, which has exact coverage 1 oc.
So in this case the double b o o tstrap adjustm ent is perfect.
Figure 5.4 shows the actual coverages o f nom inal 1 a b o o tstrap upper
confidence limits when n = 10. There are quite large discrepancies for both
basic and percentile m ethods, which are com pletely rem oved using the double
b o o tstrap adjustm ent; see Problem 5.13.

In general, an d especially for n onparam etric problem s, the calculations in


(5.55) can n o t be done exactly and sim ulation or approxim ation m ethods m ust
be used. A basic sim ulation algorithm is as follows. Suppose th a t we draw R
sam ples from F, and denote the m odel fitted to the rth sam ple by F the
E D F for one-sam ple n o nparam etric problem s. Define
ur = Pr(T** < 21* - 1 1 F*).
This will be approxim ated by draw ing M sam ples from F", calculating the
estim ator values r for m = 1, . . . , M and com puting the estim ate
I {A} is the zero-one
indicator function of the
event A.

M
M ,r =

^ K

m=1

~ '}

5 Confidence Intervals

226

Figure 5.4 Actual


coverages of percentile
(dotted line) and basic
bootstrap (dashed line)
upper confidence limits
for exponential mean
when n = 10. Solid line
is attained by nested
bootstrap confidence
limits.

0.0

0.2

0.4

0.6

0.8

1.0

Nominal coverage

T hen the M onte C arlo version o f (5.55) is


R

^ ()} = 1 r= l

which is to say th a t q(a) is the a quantile o f the uMr. The sim plest way to
obtain <j(ot) is to o rd er the values uMr into uM{l) < <
and then
set q{a) =
W h at this am ounts to is th a t the (R + l)a th ordered
value is read off from a Q -Q plot o f the uMr against quantiles o f the U ( 0 , 1)
distribution, and th a t ordered value is then used to give the required quantile
o f the t* t. We illustrate this in the next example.
The to tal nu m b er o f sam ples involved in this calculation is R M . Since
we always think o f sim ulating as m any as 1000 sam ples to approxim ate
probabilites, here this w ould suggest as m any as 106 sam ples overall. The
calculations o f Section 4.5 w ould suggest som ething a bit smaller, say M = 249
to be safe, b u t this is still ra th e r im practical. However, there are ways o f greatly
reducing the overall nu m b er o f sim ulations, two o f which are described in
C h apter 9.
Example 5.13 (Kernel density estimate) B ootstrap confidence intervals for the
value o f a density raise som e aw kw ard issues, which we now discuss, before
outlining the use o f the nested b o o tstra p in this context.
The stan d ard kernel estim ate o f the P D F f ( y ) given a ran d o m sample
y u - - - , y n is

227

5.6 Double Bootstrap M ethods

where w( ) is a sym m etric density with m ean zero and unit variance, and h
is the bandw idth. O ne source o f difficulty is th a t if we consider the estim ator
to be t(F), as we usually do, then t(F) = h~l f w{h~l (y x ) } f ( x ) d x is being
estim ated, n o t f ( y) . The m ean and variance o f f ( y ; h ) are approxim ately
f ( y ) + j h 2f ' ( y ) ,

(nh)~lf ( y )

w2(u)du,

(5.57)

for small h an d large n. In general one assum es th a t as n


o o so h>0 in
such a way th a t nh*-oo, an d this m akes both bias and variance tend to
zero as n increases. T he density estim ate then has the form t(F), such th at
t ( F ) - t ( F ) = f (y) .
Because the variance in (5.57) is approxim ately proportional to the mean,
it m akes sense to w ork w ith the square root o f the estim ate. T h a t is we
take T = {f ( y ; h )}1/2 as estim ator o f 9 = {f ( y )}1/2. By the delta m ethod o f
Section 2.7.1 we have from (5.57) th at the approxim ate m ean and variance o f
T are

{f(y)Y/2+Uf(yT1/2{h2f"(y)-i2(nhr iK},

(5.58)

where K = f w 2(u) du.


T here rem ains the problem o f choosing h. For point estim ation o f f ( y ) it is
usually suggested, on the grounds o f m inim izing m ean squared error, th a t one
take h o c n-1/5. This m akes b o th bias and stan d ard erro r o f order n~2^5. But
there is no reason to do the same for setting confidence intervals, and in fact
h o c n-1/5 tu rn s o u t to be a p oor choice, particularly for standard bo o tstrap
m ethods, as we now show.
Suppose th a t we resam ple y i , . . . , y from the E D F F. T hen the bo o tstrap
version o f the density estim ate, th a t is

has m ean exactly equal to f ( y ,h); the approxim ate variance is the same as in
(5.57) except th a t f ( y \ h ) replaces f ( y ) . It follows th at T* = { f ' ( y \ h ) } 1^2 has
approxim ate m ean and variance
{ f ( y , h ) } 1/2 - K/ONfc)}-172^ ) -1 ^ ,

{ ( n h) - lK .

Now consider the studentized estim ates

{ f ( y M

ll

2-

{ f ( y ) Y

i( n /j) - / 2K i /2

z<= {r(^}1/2-{/(>>;ft)}1/2

12

F rom (5.58) an d (5.59) we see th a t if h

\(nh)~^K ^
oc

n 1/5, then as n increases

2 = e + { f ( y ) } - l/2K - ' /2{f " (y) - \ K } ,

Z* = e \

(5.59)

5 Confidence Intervals

228

Figure 5.5 Studentized


quantities for density
estimation. The left
panels show values of Z
when h = n~1^5 for 500
standard normal
samples of sizes n and
500 bootstrap values for
one sample at each n.
The right panels show
the corresponding
values when h = n-1^3.

20

50 100 200 5001000

20

50 100 200 5001000

20

50 100 200 5001000

20

50 100 200 5001000

where b o th e and s' are N ( 0,1). This m eans th a t quantiles o f Z can n o t be well
approxim ated by quantiles o f Z*, no m atter how large is n. The same thing
happens for the u n transform ed density estim ate.
There are several ways in which we can try to overcome this problem . O ne
o f the sim plest is to change h to be o f o rd er -1/3, when calculations sim ilar to
those above show th a t Z = e an d Z* = e*. Figure 5.5 illustrates the effect. H ere
we estim ate the density a t y = 0 for sam ples from the N ( 0,1) distribution, with
w(-) the stan d ard norm al density. T he first two panels show box plots o f 500
values o f z an d z* w hen h = n~1/s, which is near-optim al for estim ation in this
case, for several values o f n; the values o f z* are obtained by resam pling from
one dataset. T he last two panels correspond to h = n~1/3. The figure confirm s
the key points o f the theory sketched above: th a t Z is biased aw ay from zero
when h = n-1^5, b u t not w hen h = n_1/3; an d th a t the distributions o f Z and
Z are quite stable and sim ilar when h = n-1/3.
U nder resam pling from F, the studentized b o o tstrap applied to {/(>; ^)}1/2
should be consistent if h oc n~1/3. F rom a practical point o f view this m eans
considerable undersm oothing in the density estim ate, relative to standard
practice for estim ation. A bias in Z o f o rd er n~ 1/3 or worse will rem ain, and
this suggests a possibly useful role for the double bootstrap.
F or a num erical exam ple o f nested b o o tstrap p in g in this context we revisit
Exam ple 4.18, where we discussed the use o f a kernel density estim ate in
estim ating species abundance. T he estim ated P D F is

f(y.h) = z z
where </>() is the stan d ard norm al density, and the value o f interest is / ( 0 ;/i),
which is used to estim ate /(0 ). In light o f the previous discussion, we base

5.6 Double Bootstrap M ethods


Figure 5.6 Adjusted
bootstrap procedure for
variance-stabilized
density estimate
f = {/(0;0.5)}1/2 for the
tuna data. The left
panel shows the EDF of
1000 values of I* t.
The right panel shows a
plot of the ordered u'Mr
against quantiles
r/(R + 1) of the 1/(0,1)
distribution. The dashed
line shows how the
quantiles of the u are
used to obtain improved
confidence limits, by
using the right panel to
read off the estimated
coverage q{a)
corresponding to the
required nominal
coverage a, and then
using the left panel to
read off the q(a)
quantile of t* t.

229

<D
O)
5
O
>
o
o

0)
O

LU

fo
E

LU

t*-t

Nominal coverage

confidence intervals on the variance-stabilized estim ate t = { /(0 ;h )} 1/2. We


also use a value o f h considerably sm aller th an the value (roughly 1.5) used to
estim ate / in Exam ple 4.18.
T he right panel o f Figure 5.6 shows the quantiles o f the uMr obtained
when the double b o o tstrap bias adjustm ent is applied with R = 1000 and
M = 250, for the estim ate w ith b andw idth h = 0.5. If T* t were an exact
pivot, the distrib u tio n o f the u would lie along the do tted line, and nom inal
and estim ated coverage would be equal. The distribution is close to uniform ,
confirm ing o u r decision to use a variance-stabilized statistic.
The dashed line shows how the distribution o f the u* is used to remove
the bias in coverage levels. F or an up p er confidence limit with nom inal level
1 a = 0.9, so th a t a = 0.1, the estim ated level is 4(0-1) = 0.088. The 0.088
quantile o f the values o f tj. t is t(*gg) t = 0.091, while the 0.10 quantile
is t(*100) t = 0.085. The corresponding u p per 10% confidence limits for
f ( 0 ) V 2 are t - (t(*88) - t) = 0.356 - (-0 .0 9 1 ) = 0.447 and t - (t(*100) - t) =
0.356 (0.085) = 0.441. F or this value o f a the adjustm ent has only a small
effect.
Table 5.7 com pares the 95% limits for /(0 ) for different m ethods, using
bandw idth h = 0.5, for which /(0 ;0 .5 ) = 0.127. The longer upper tail for the
double b o o tstrap interval is a result o f adjusting the nom inal a = 0.025
to (0.025) = 0.004; a t the upper tail we obtain (0.975) = 0.980. The
lower tail o f the interval agrees well w ith the o ther second-order correct
m ethods.
F o r larger values o f h the density estim ates are higher and the confidence
intervals narrow er.

5 Confidence Intervals

230

Upper
Lower

Basic

Basic1-

Student

S tu d en t

Percentile

BCa

D ouble

0.204
0.036

0.240
0.060

0.273
0.055

0.266
0.058

0.218
0.048

0.240
0.058

0.301
0.058

In Exam ple 9.14 we describe how saddlepoint m ethods can greatly reduce
the tim e taken to perform the double b o o tstrap in this problem . It m ight be
possible to avoid the difficulties caused by the bias o f the kernel estim ate by
using a clever resam pling scheme, b u t it would be m ore com plicated th an the
direct ap p ro ach described above.

5.7 Empirical Comparison of Bootstrap Methods


T he several b o o tstrap confidence lim it m ethods can be com pared theoretically
on the basis o f first- and second-order accuracy, as in Section 5.4, b u t this really
gives only suggestions as to which m ethods we would expect to be good. The
theory needs to be bolstered by num erical com parisons. O ne rath e r extrem e
com parison was described in Exam ple 5.7. In this section we consider one
m oderately com plicated application, estim ation o f a ratio o f means, and assess
through sim ulation the perform ances o f the m ain b o o tstrap confidence limit
m ethods. T he conclusions ap p ear to agree qualitatively with the results o f other
sim ulation studies involving applications o f sim ilar com plexity: references to
some o f these are given in the bibliographic notes a t the end o f the chapter.
The application here is sim ilar to th a t in Exam ple 5.10, and concerns the ratio
o f m eans for d a ta from tw o different gam m a distributions. The first sam ple
o f size ni is draw n from a gam m a distrib u tio n w ith m ean fi\ = 100 and index
0.7, while the second independent sam ple o f size n2 is draw n from the gam m a
distribution w ith m ean n 2 = 50 and index 1. T he p aram eter 9 = n i / ( i 2, whose
value is 2, is estim ated by the ratio o f sam ple m eans t = y \ / y 2. F or particular
choices o f sam ple sizes we sim ulated 10000 datasets and to each applied
several o f the nonparam etric b o o tstrap confidence lim it m ethods discussed
earlier, always w ith R = 999. We did n o t include the double b o o tstrap m ethod.
As a control we added the exact p aram etric m ethod when the gam m a indexes
are know n: this turns out not to be a strong control, b u t it does provide a
check on sim ulation validity.
The results quoted here are for tw o cases, n\ = n2 = 10 and n\ = n2 = 25. In
each case we assess the left- and right-tail erro r rates o f confidence intervals,
and their lengths.
Table 5.8 shows the em pirical erro r rates for b o th cases, as percentages,
for nom inal rates betw een 1% and 10% : sim ulation stan d ard errors are rates

Table 5.7 Upper and


lower endpoints of 95%
confidence limits for
/ ( 0) for the tuna data,
with bandwidth h = 0.5;
t indicates use of
square-root
transformation.

231

5.8 M ultiparameter Methods


Table 5.8 Empirical
error rates (%) for
nonparametric
bootstrap confidence
limits in ratio
estimation: rates for
sample sizes
wi = n2 = 10 are given
above those for sample
sizes | = 2 = 25.
R = 999 for all
bootstrap methods.
10000 datasets
generated from gamma
distributions.

M e th o d

N o m in al e rro r rate
L ow er lim it

E xact
N o rm al ap proxim ation
Basic
Basic, log scale
S tudentized
S tudentized, log scale
B o o tstrap percentile
BCa
ABC

U p p e r lim it

2.5

10

10

2.5

1.0
1.0
0.1
0.1
0.0
0.0
2.6
1.6
0.6
0.8
1.1
1.1
1.8
1.2
1.9
1.4
1.9
1.3

2.8
2.3
0.5
0.5
0.0
0.1
4.9
3.2
2.1
2.3
2.8
2.5
3.6
2.6
4.0
3.0
4.2
3.0

5.5
4.8

10.5
9.9

1.7
2.1
0.2
0.4
8.1
6.0
4.6
4.6
5.6
5.0
6.5
5.1
6.9
5.6
7.4
5.7

6.3
6.4
1.8
3.0
12.9
11.4
9.9
9.9
10.7
10.1
11.6
10.1
12.3
10.9
12.7
11.0

9.8
10.2
20.6
16.3
24.4
19.2
13.1
11.5
11.9
10.9
11.6
10.8
14.6
12.6
14.0
11.8
14.6
12.1

4.8
4.9
15.7
11.5
21.0
15.0
7.5
6.3

2.6
2.5
12.5
8.2
18.6
12.5
4.8
3.3
4.0
3.0
3.5
2.9
5.9
4.2
5.3
3.8
5.5
3.7

1.0
1.1
9.6
5.5
16.4
10.3
2.5
1.7
2.0
1.4
1.7
1.3
3.3
2.1
3.0
1.9
3.1
1.9

6.7
5.9
6.3
5.7
8.9
7.1
8.3
6.8
8.7
6.8

divided by 100. The norm al approxim ation m ethod uses the delta m ethod
variance approxim ation. The results suggest th a t the studentized m ethod gives
the best results, provided the log scale is used. Otherwise, the studentized
m ethod and the percentile, B C a and A B C m ethods are com parable b u t only
really satisfactory a t the larger sample sizes.
Figure 5.7 shows box plots o f the lengths o f 1000 confidence intervals for
b o th sam ple sizes. The m ost pronounced feature for ni = n2 = 10 is the long
som etim es very long lengths for the two studentized m ethods, which
helps to account for their good error rates. This feature is far less prom inent
a t the larger sam ple sizes. It is noticeable th a t the norm al, percentile, B C a
an d A B C intervals are sh o rt com pared to the exact ones, and th at taking logs
improves the basic intervals. Sim ilar com m ents apply when ni = n2 = 25, but
w ith less force.

5.8 Multiparameter Methods


W hen we w ant a confidence region for a vector param eter, the question
o f shape arises. Typically a rectangular region form ed from intervals for each
com ponent p aram eter will n o t have high enough coverage probability, although
a B onferroni argum ent can be used to give a conservative confidence coefficient,

232

5 Confidence Intervals

n1=n2=10

Figure 5.7 Box plots of


confidence interval
lengths for the first 1000
simulated samples in the
numerical experiment
w ith gamma data.

1000
100
10

...... ^ ................... B "

" S .......E3........ Et3....... S "

n1=n2=25
10
5

0.... 0 .... 0 .....0 .... 6 .... B .... [j.....0 .... 0 -

as follows. Suppose th a t 9 has d com ponents, an d th a t the confidence region


Ca is rectangular, w ith interval Cxj = (9Lyi, 9Vj) for the ith com ponent 9t. T hen
Pr(0 * Ca) = P r ( \ J { 9 t $

Pr(0, ^ Q , ) = ^

say. If we take a, = a / d then the region Ca has coverage a t least equal to


1 a. F or certain applications this could be useful, in p a rt because o f its
simplicity. But there are tw o poten tial disadvantages. First, the region could
be very conservative the true coverage could be considerably m ore than
the nom inal 1 a. Secondly, the rectangular shape could be quite at odds
w ith plausible likelihood contours. This is especially true if the estim ates for
p aram eter com ponents are quite highly correlated, w hen also the B onferroni
m ethod is m ore conservative.
One simple possibility for a jo in t b o o tstrap confidence region when T is
approxim ately norm al is to base it on the quad ratic form
Q = ( T - 9 ) t V ~ 1( T - 9 ) ,

(5.60)

where V is the estim ated variance m atrix o f T. N ote th a t Q is the m ultivariate


extension o f the square o f the studentized statistic o f Section 5.2. If Q had
exact p quantiles ap, say, then a 1 a confidence set for 9 would be
{9 : ( T - 9 ) t V ~ 1( T - 9 ) < a ^ } .

(5.61)

233

5.8 Multiparameter Methods

T he elliptical shape o f this set is correct if the distribution o f T has elliptical


contours, as the m ultivariate norm al distribution does. So if T is approxim ately
m ultivariate norm al, then the shape will be approxim ately correct. M oreover,
Q will be approxim ately distributed as a y 2d variable. But as in the scalar case
such distrib u tio n al approxim ations will often be unreliable, so it m akes sense
to approxim ate the distrib u tio n o f Q, and in p articular the required quantiles
a i_a, by resam pling. T he m ethod then becom es com pletely analogous to the
studentized b o o tstrap m ethod for scalar param eters. The b o o tstrap analogue
o f Q will be
Q = ( T , - t ) r F * - 1( T * - t ) ,
which will be calculated for each o f R sim ulated samples. If we denote the
ordered b o o tstra p values by q[ < < q'R, then the 1 a b o o tstrap confidence
region is the set
{0 : (t - 9)Tv~l (t - 0) < 5(*R+i)(i-a)}-

(5-62)

As in the scalar case, a com m on and useful choice for v is the delta m ethod
variance estim ate v^.
T he sam e m ethod can be applied on any scales which are m onotone tra n s
form ations o f the original p aram eter scales. F or example, if h(6) has ith
com ponent /i,(0;), say, and if d is the diagonal m atrix with elem ents dhi/d6j
evaluated at 0 = t, then we can apply (5.62) with the revised definition
q = {h(t) - h(0)}T (dTvd)~l {h(t) - fe(0)}.
If corresponding ordered b o o tstrap values are again denoted by q *, then the
b o o tstrap confidence region will be
{0 : {h(t) - h(6)}T(dTv d ) - l {h(t) - h(6)} < 9(*r+1Mi_)}-

(5.63)

A p articu lar choice for h(-) would often be based on diagnostic plots o f
com ponents o f t* and v", the objectives being to attain approxim ate norm ality
an d approxim ately stable variance for each com ponent.
This m ethod will be subject to the same potential defects as the studentized
b o o tstrap m ethod o f Section 5.2. T here is no vector analogue o f the adjusted
percentile m ethods, b u t the nested b o o tstrap m ethod can be applied.
Example 5.14 (Air-conditioning data) F o r the air-conditioning d a ta o f Exam
ple 1.1, consider setting a confidence region for the two param eters 0 = (ji, k)
in a gam m a m odel. The log likelihood function is
y and logy are the
averages o f the d a ta and
the log data.

/(,u, k ) =

n{K \og{K /ii) ~

logr(jc) + (k - l)logy -

Ky/n},

from which we calculate the m axim um likelihood estim ators T = (p,,k). The

234

5 Confidence Intervals

num erical values are p. = 108.083 and k = 0.7065. A straightforw ard calcula
tion shows th a t the delta m ethod variance approxim ation, equal to the inverse
o f the expected inform ation m atrix as in Section 5.2, is
vL = n_1d i a g |/ c _1/i2, ~

(fi,

lo g r ( ) - k_1j .

(5.64)

The stan d ard likelihood ratio 1 a confidence region is the set o f values o f
k ) for which
2{/(fi, k) -

Z( f i ,

jc)} < c2,i,

where c2,i_ is the 1 a quantile o f the x l distribution. The top left panel
o f Figure 5.8 shows the 0.50, 0.95 an d 0.99 confidence regions obtained in
this way. T he top right panel is the same, except th a t C2,i_a is replaced by a
b o o tstrap estim ate obtained from R = 999 sam ples sim ulated from the fitted
gam m a m odel. This second region is som ew hat larger than, b u t o f course has
the same shape as, the first.
From the b o o tstrap sim ulation we have estim ators t" = (*,*) from each
sample, from which we calculate the corresponding variance approxim ations
using (5.64), an d hence the quad ratic form s q * = ( f f)r i>2-1 (f* t). We then
apply (5.62) to obtain the studentized b o o tstrap confidence regions shown in
the bottom left panel o f Figure 5.8. This is clearly nothing like the likelihoodbased confidence regions above, p artly because it fails com pletely to take
account o f the m ild skewness in the distribution o f fi and the heavy skewness
in the distrib u tio n o f k. These features are clear in the histogram plots o f
Figure 5.9.
L ogarithm ic transfo rm atio n o f b o th fi an d k improves m atters considerably:
the b otto m right panel o f Figure 5.8 com es from applying the studentized
boo tstrap m ethod after d ual logarithm ic transform ation. Nevertheless, the
solution is n o t com pletely satisfactory, in th a t the region is too wide on the k
axis and slightly narrow on the fi axis. This could be predicted to som e extent
by plotting v'L versus f*, which shows th a t the log transform ation o f k is not
quite strong enough. Perhaps m ore im p o rtan t is th a t there is a substantial bias
in k: the b o o tstrap bias estim ate is 0.18.
One lesson from this exam ple is th a t where a likelihood is available and
usable, it should be used w ith param etric sim ulation to check on, and if
necessary replace, stan d ard approxim ations for quantiles o f the log likelihood
ratio statistic.

Example 5.15 (Laterite data) T he d a ta in Table 5.9 are axial d a ta consisting


o f 50 pole positions, in degrees o f latitude an d longitude, from a palaeom agnetic study o f N ew C aledonian laterites. The d a ta take values only in
the lower unit half-sphere, because an axis is determ ined by a single pole.

5.8 Multiparameter M ethods

Figure 5.8 Bootstrap


confidence regions for
the parameters /*, k of a
gamma model for the
air-conditioning data,
with levels 0.50, 0.95
and 0.99. Top left:
likelihood ratio region
with x\ quantiles; top
right: likelihood ratio
region with bootstrap
quantiles; bottom left:
studentized bootstrap
on original scales;
bottom right:
studentized bootstrap
on logarithmic scales.
R = 999 bootstrap
samples from fitted
gamma model with
ft = 108.083 and
k = 0.7065. + denotes
MLE.

235

(0
Q.

<
0
Q.

(0

CO

Q.

Q.

mu

mu

Q.
Q.

<o
J*

mu

Let Y denote a u n it vector on the lower half-sphere with cartesian coordi


nates ( c o s X c o s Z ,c o s X s in Z ,s in X ) T, where X and Z are degrees o f latitude
and longitude. T he population quantity o f interest is the m ean p o lar axis,
a( 6, 0 ) = (cos 8 cos 0 , cos 9 sin <j), sin 6)T, defined as the axis given by the eigen
vector corresponding to the largest eigenvalue o f E ( 7 Y T ). The sam ple value o f
this is given by the corresponding eigenvector o f the m atrix n-1
y j y f , where
y/ is the vector o f cartesian coordinates o f the jth pole position. The sample
A
A
m ean p o lar axis has latitude 9 = 76.3 and longitude (f> = 83.8. Figure 5.10
shows the original d a ta in an equal-area projection onto a plane tangential
to the South Pole, at 9 = 90; the hollow circle represents the sam ple m ean
p o lar axis.

236

5 Confidence Intervals

C\J

Figure 5.9 Histograms


of ft and k* from
R = 999 bootstrap
samples from gamma
model with p. = 108.083
and ic = 0.7065, fitted to
air-conditioning data.

o
o
co
oo

so

in

o
o

I I I i i i i. ii i

o
50

100 150 200 250 300

0.5

1.0

mu

1.5

2.0

2.5

3.0

kappa

Lat

Long

Lat

Long

Lat

Long

Lat

Long

-26.4
-32.2
-73.1
-80.2
-71.1
-58.7
-40.8
-14.9
-66.1
-1.8
-38.3
-17.2
-56.2

324.0
163.7
51.9
140.5
267.2
32.0
28.1
266.3
144.3
256.2
146.8
89.9
35.6

-52.1
-77.3
-68.8
-68 .4
-29.2
-78.5
-65 .4
-49 .0
-67 .0
-5 6 .7
-72.7
-81 .6
-75.1

83.2
182.1
110.4
142.2
246.3
222.6
247.7
65.6
282.6
56.2
103.1
295.6
70.7

-80.5
-77.7
-6.9
-5 9 .4
-5 .6
-62.6
-74.7
-65.3
-71.6
-23.3
-60.2
-40.4
-53.6

108.4
266.0
19.1
281.7
107.4
105.3
120.2
286.6
106.4
96.5
33.2
41.0
59.1

-74.3
-8 1 .0
-12.7
-75.4
-85.9
-84.8
-7 .4
-29.8
-85.2
-53.1
-63.4

90.2
170.9
199.4
118.6
63.7
74.9
93.8
72.8
113.2
51.5
154.8

In ord er to set a confidence region for the m ean p o lar axis, or equivalently
(6, <f>), we let
b(6, <) = (sin 6 cos (j), sin 9 sin 0 , cos d)T,

c (0 ,0 ) = ( sin <j>t cos <j>,0)T

denote the unit vectors ortho g o n al to a(0, </>). The sam ple values o f these
vectors are 2, b and c, and the sam ple eigenvalues are 1\ < %2 < ^3- Let A
denote the 2 x 3 m atrix (S,c)r and B the 2 x 2 m atrix with { j, k)th element
------ n~ l y ^ ( b Tyj)(cTyj)(aTyj)2.

Table 5.9 Latitude ()


and longitude () of
pole positions
determined from the
paleomagnetic study of
New Caledonian
laterites (Fisher et a/.,
1987, p. 278).

5.8 M ultiparameter Methods

Figure 5.10 Equal-area


projection of the laterite
data onto the plane
tangential to the South
Pole (+). The sample
mean polar axis is the
hollow circle, and the
square region is for
comparison with
Figures 5.11 and 10.3.

237

90

T hen the analogue o f (5.60) is


Q = na(9,(j>)T A T J3_1/la(0, <^>),

(5.65)

which is approxim ately distributed as a y\ variable in large samples. In the


b o o tstrap analogue o f Q, a is replaced by a, and A and B are replaced by the
corresponding quantities calculated from the b o o tstrap sample.
Figure 5.11 shows results from setting confidence regions for the m ean polar
axis based on Q. The panels show the 0.5, 0.95 and 0.99 contours, using x\
quantiles an d those based on R = 999 nonparam etric boo tstrap replicates q".
T he contours are elliptical in this projection. For this sam ple size it would not
be m isleading to use the asym ptotic 0.5 and 0.95 quantiles, though the 0.99
quantiles differ by more. However, sim ulations with a random subset o f size
n 20 gave dram atically different quantiles, and it seems to be essential to use
the b o o tstrap quantiles for smaller sam ple sizes.
A different ap proach is to set T = (6, (j>)T, and then to base a confidence
region for (d,4>) on (5.60), w ith V taken to be nonparam etric delta m ethod
estim ate o f the covariance m atrix. This approach does not take into account
the geom etry o f spherical d a ta and w orks very poorly in this example, partly
because the estim ate t is close to the South Pole, which limits the range o f ().

238

5 * Confidence Intervals

Figure 5.U The 0.5,


0.95, and 0.99
confidence regions for
the mean polar axis of
the laterite data based
on (5.65), using x\
quantiles (left) and
bootstrap quantiles
(right). The boundary of
each panel is the square
region in Figure 5.10;
also shown are the
South Pole (+) and the
sample mean polar axis
< ).

5.9 Conditional Confidence Regions


In param etric inference the probability calculations for confidence regions
should in principle be m ade conditional on the ancillary statistics for the
m odel, w hen these exist, the basic reason being to ensure th a t the inference
accounts for the actual inform ation content in the observed data. In param etric
m odels w hat is ancillary is often specific to the m athem atical form o f F, and
there is no n o n p aram etric analogue. However, there are situations where there
is a m odel-free ancillary indicator o f the experim ent, as w ith the design o f a
regression experim ent (C h ap ter 6). In fact there is such an indicator in one
o f our earlier exam ples, an d we now use this to illustrate some o f the points
which arise w ith conditional b o o tstrap confidence intervals.
Example 5.16 (City population data) F o r the ra tio estim ation problem o f
Exam ple 1.2, the statistic d = u w ould often be regarded as ancillary. The
reason rests in p a rt on the n o tio n o f a m odel for linear regression o f x on
u with v ariatio n p ro p o rtio n al to u. The left panel o f Figure 5.12 shows the
scatter plo t o f t* versus d" for the R = 999 n o n p aram etric b o o tstrap sam ples
used earlier. T he observed value o f d is 103.1. T he m iddle and right panels o f
the figure show trends in the conditional m ean an d variance, E*(T* | d') and
v ar* (T | d"), these being approxim ated by crude local averaging in the scatter
plot on the left.
The calculation o f confidence lim its for the ratio 6 = E(AT)/E(l/) is to be
m ade conditional on d* = d, the observed m ean o f u. Suppose, for example, th at
we w ant to apply the basic b o o tstra p m ethod. T hen we need to approxim ate
the conditional quantiles ap(d) o f T 6 given D = d for p = a and 1 a, and

239

5.9 Conditional Confidence Regions

5 \

V V
vaitH

15

0.0012

l i f t ; , .

0.0008

0.0010

0.0014

<3

80

100

120

140

160

80

Table 5.10 City


population data, n = 49.
Comparison of
unconditional and
conditional cumulative
probabilities for
bootstrap ratio T*.
R = 9999 nonparametric
samples, Rj = 499 used
for conditional
probabilities.

. ...

91001

Figure 5.12 City


population data, n = 49.
Scatter plot of bootstrap
ratio estimates t* versus
d*, and conditional
means and variances of
t* given d*. R = 999
nonparametric samples.

U nconditional
C o n d itio n al

90

100

110

120

130

80

90

100

0.010
0.006

0.025
0.020

0.050
0.044

110

120

130

<S'

d'

0.100
0.078

0.900
0.940

0.950
0.974

0.975
0.988

0.990
1.000

use these in (5.3). T he b o o tstrap estim ate o f ap(d) is the value ap(d) defined by
Pr{T* t < ap(d) \ D* = d} = p,
and the sim plest way to use o u r sim ulated sam ples to approxim ate this is to
use only those sam ples for which d* is n ea r d. F or example, we could take
the R i = 99 sam ples whose d* values are closest to d and approxim ate ap(d)
by the lOOpth ordered value o f t* in those samples.
C ertainly stratification o f the sim ulation results by intervals o f d* values
shows quite strong conditional effects, as evidenced in Figure 5.12. The difficulty
is th a t R j = 99 sam ples is n o t enough to obtain good estim ates o f conditional
quantiles, and certainly not to distinguish betw een unconditional quantiles and
the conditional quantiles given d' = d, which is near the m ean. O nly w ith an
increase o f R to 9999, an d using strata o f Rd = 499 samples, does a clear
picture emerge. Figure 5.13 shows plots o f conditional quantile estim ates from
this larger sim ulation.
How different are the conditional and unconditional distributions? Table
5.10 shows b o o tstrap estim ates o f the cum ulative conditional probabilities
Pr( T < ap | D = d), where ap is the unconditional p quantile, for several values
o f p. Each estim ate is the p ro p o rtio n o f times in Rd = 499 sam ples th a t t" is less
than or equal to the unconditional quantile estim ate (10ooop)- The com parison
suggests th a t conditioning does n o t have a large effect in this case.
A m ore efficient use o f b o o tstrap samples, which takes advantage o f the
sm oothness o f quantiles as a function o f d, is to estim ate quantiles for interval
stra ta o f Rd sam ples an d then for each level p to fit a sm ooth curve. For
exam ple, if the k th such stratu m gives quantile estim ates ap# and average

5 * Confidence Intervals

240

Figure 5.13 City


population data, n = 49.
Conditional 0.025 and
0.975 quantiles of
bootstrap ratio t* from
R = 9999 samples, with
strata of size Rj = 499.
The horizonal dotted
lines are unconditional
quantiles, and the
vertical dotted line is at
d' = d.

Ancillary d*

Figure 5.14 City


population data, n = 49.
Smooth spline fits to
0.025 and 0.975
conditional quantiles of
bootstrap ratio t* from
R = 9999 samples, using
overlapping strata of
size Rj = 199.

Ancillary d*

value dk for d', then we can fit a sm oothing spline to the points (dk, (ip^) for
each p an d interpolate the required value ap(d) at the observed d. Figure 5.14
illustrates this for R = 9999 and non-overlapping s tra ta o f size R^ = 199, with
p = 0.025 an d 0.975. N ote th a t interp o latio n is only needed at the centre o f
the curve. Use o f non-overlapping intervals seems to give the best results.

A n alternative sm oothing m ethod is described in Problem 5.16. In C h apter 9


we shall see th a t in some cases, including the preceding example, it is possible to
get accurate approxim ations to conditional quantiles using theoretical m ethods.

241

5.9 Conditional Confidence Regions

Figure 5.15 Annual


discharge of River Nile
at Aswan, 1871-1970
(Cobb, 1978).

o
o
o
o
CM

<D

o
o
o

_2

o
>

' l\
.
/ h m . i r
': j*
* j!

o
00

o
o

T
:

:.k
* !
:;* ;
M M
ii
m #
* ?!
iii . . * M i
1/1
* ii \Mi ; i\i * * \i. **
r ; ' U
i\.
* !:*
*
.
*

CO

i
1880

1900

1920

1940

1960

Year

Ju st as w ith unconditional analysis, so with conditional analysis there is a


choice o f b o o tstrap confidence interval m ethods. F rom our earlier discussion
the studentized b o o tstrap and adjusted percentile m ethods are likely to work
best for statistics th a t are approxim ately norm al, as in the previous example.
The adjusted percentile m ethod requires constants a, v i and w, all o f which
m ust now be co n d itio n a l; see Problem 5.17. The studentized b o o tstra p m ethod
can be applied as before w ith Z = (T 0 ) / F 1/2, except th at now conditional
quantiles will be needed. Some simplification m ay occur if it is possible to
standardize w ith a conditional standard error.
T he next exam ple illustrates an o th er way o f overcom ing the paucity o f
b o o tstrap sam ples which satisfy the conditioning constraint.
Example 5.17 (N ile data) T he d a ta plotted in Figure 5.15 are annual dis
charges y o f the R iver Nile at A sw an from 1871 to 1970. Interest lies in the
year 1870+0 in which the m ean discharge drops from n\ H 0 0 to H2 = 870;
these m ean values are estim ated, b u t it is reasonable to ignore this fact and we
shall do so.
The least squares estim ate o f the integer 0 maximizes

e
S(0) = ^ { > 7 3 ^ i + w ) } j= i
S tan d ard norm al-theory likelihood analysis suggests th a t differences in S(6)
for 0 n ear 0 are ancillary statistics. We shall reduce these differences to two
p articu lar statistics which m easure skewness and curvature o f S( ) near 0,

242

5 Confidence Intervals

b'

c*

1.64
2.44
4.62
4.87
5.12
5.49
6.06
6.94

..
..

-0.62

-0.37

-0.17

0.17

0.37

0.62

0.87

59
62

52
88

53
81

71
83

68
79

62
82

50
68

53
81

92
91
92
97
94
93

84
91
96
96
100
100

93
91
100
89
100
100

93
95
95
98
100
100

95
89
86
96
97
100

97
92
97
95
96
100

87
92
100
97
95
100

93
95
97
96
95
100

2.45

_
50
76
76
81
85
86
100

nam ely
B = S(d + 5) - S(6 - 5 ) ,

C = S(0 + 5) - 2S(0) + S(0 - 5);

for num erical convenience we rescale B and C by 0.0032. It is expected th at


B and C respectively influence the bias an d variablity o f 0. We are interested
in the conditional confidence th a t should be attached to the set 0 + 1, th at
is
Pr(|0 0| < 1 | b,c).
The d a ta analysis gives 0 = 28 (year 1898), b = 0.75 and c = 5.5.
W ith no assum ption on the shape o f the distribution o f Y , except th a t it is
constant, the obvious b o o tstrap sam pling scheme is as follows. First calculate
the residuals ej = Xj f i u j = 1 ,...,2 8 and e; = x j fi2, j = 2 9 ,..., 100.
T hen sim ulate d a ta series by x ' = m + e , j = 1 ,...,2 8 and x* = n 2 + s ) , j =
29.......100, w here e is random ly sam pled from
eioo- Each such sam ple
series then gives 0*,fr* an d c*.
F rom R = 10 000 b o o tstra p sam ples we find th a t the pro p o rtio n o f samples
A
A
w ith 16 9\ < 1 is 0.862, which is the unconditional b o o tstrap confidence.
But when these sam ples are p artitio n ed according to b* and c, strong effects
show up. Table 5.11 shows p a rt o f the table o f proportions for outcom e
10* 01 < 1 for a 16 x 15 partitio n , 201 o f these p artitions being non-em pty
and m ost o f them having at least 50 b o o tstrap samples. The proportions are
consistently higher th an 0.95 for ( b' ,c ') n ear (b,c), which strongly suggests
th a t the conditional confidence Pr(|0 0| < 1 | b = 0.75, c = 5.5) exceeds
0.95.
T he conditional probability Pr(|0 0| < 1 | b,c) will be sm ooth in b and c,
so it m akes sense to assum e th a t the estim ate
p(b,c*) = Pr*(|0* 0| < 1 | 6*,c)

Table 5.11 Nile data.


Part of the table of
proportions (%) of
bootstrap samples for
which 10" | ^ 1, for
interval values of b' and
c*. R = 10000 samples.

5.10 Prediction

243

is sm ooth in b ' , c ' . We fitted a logistic regression to the proportions in the 201
non-em pty cells o f the com plete version o f Table 5.11, the result being
logit p(b* , c ) = 0.51 0.20b2 + 0.68c*.
The residual deviance is 223 on 198 degrees o f freedom , which indicates an
adequate fit for this simple model. The conditional bo o tstrap confidence is the
fitted value o f p a t b' = b, c* = c, which is 0.972 w ith standard erro r 0.009.
So the conditional confidence attached to 6 = 28 + 1 is m uch higher th an the
unconditional value.
The value o f the stan d ard error for the fitted value corresponds to a binom ial
stan d ard error for a sam ple o f size 3500, or 35% o f the whole b o o tstrap sim u
lation, which indicates high efficiency for this m ethod o f estim ating conditional
probability.

5.10 Prediction
Closely related to confidence regions for param eters are confidence regions for
future outcom es o f the response Y , m ore usually called prediction regions.
A pplications are typically in m ore com plicated contexts involving regression
m odels (C hapters 6 and 7) and time series m odels (C hapter 8), so here we give
only a b rief discussion o f the m ain ideas.
In the sim plest situation we are concerned with prediction o f one future
response Yn+l given observations y \ , . . . , y n from a distribution F. The ideal
upp er y prediction lim it is the y quantile o f F, which we denote by ay(F). The
sim plest ap p ro ach to calculating a prediction limit is the plug-in approach,
th a t is substituting the estim ate F for F to give ay = ay(F). But this is clearly
biased in the optim istic direction, because it does n o t allow for the uncertainty
in F. R esam pling is used to correct for, or remove, this bias.
Parametric case
Suppose first th a t we have a fully param etric model, F = Fg, say. T hen the
prediction lim it ay(F) can be expressed m ore directly as ay(9). T he true coverage
o f this limit over repetitions o f b o th d a ta and predictand will n o t generally be
y, b u t rath er
P r{7 n+i < ay(6) \ 6} = h(y),

(5.66)

say, where h(-) is unknow n except th a t it m ust be increasing. (The coverage


also depends on 6 in general, b u t we suppress this from the no tatio n for
simplicity.) T he idea is to estim ate h(-) by resam pling. So, for d a ta Y J , . . . , Y *
and predictand Yn*+1 all sam pled from F = Fg, we estim ate (5.66) by
Mv) = Pr*{y*+1 < a y(d')},

(5.67)

244

5 Confidence Intervals

where as usual O' is the estim ator calculated for d a ta Y


Y
. In practice it
would usually be necessary to use R sim ulated repetitions o f the sam pling and
approxim ate (5.67) by
(5.68)
Once h(y) has been calculated, the adjusted y prediction limit is taken to be
at<7) = ag(y)(h where
Hgi v) } = 7Example 5.18 (Normal prediction limit) Suppose th a t Y^,..., Y+i are inde
pendently sam pled from the N(/i, cr2) distribution, where fi and a are unknow n,
and th a t we wish to predict Yn+\ having observed yi, . .. , y - The plug-in m ethod
gives the basic y prediction limit
aY = y + s<l> (y),
where y = n 1 ^ yj and s2 = n 1
-V)2- ^ we write Yj = n + ere,-, so th at
the Ej are independent JV(0,1), then (5.66) becom es
e is the average of

where Z_i has the S tudent-f distribution w ith n 1 degrees o f freedom . This
leads directly to the S tudent-f prediction limit

where /c_i,y is the y quantile o f the S tudent-t distribution with n 1 degrees


o f freedom.
In this p articu lar case, then, h( ) does not need to be estim ated. But if we
had n o t recognized the occurrence o f the Student-f distribution, then the first
probability in (5.69) w ould have been estim ated by applying (5.68) w ith samples
generated from the N ( y n, s2) distribution. Such an estim ate (corresponding to
infinite R) is plotted in Figure 5.16 for sam ple size n = 10. The plot has logit
scales to em phasize the discrepancy betw een h(y) and y. G iven values o f the
estim ate h{y), a sm ooth curve can be obtained by quadratic regression o f their
logits on logits o f y; this is illustrated in the figure, where the solid line is the
regression fit. T he required value g(y) can be read off from the curve.

The preceding exam ple suggests a m ore direct m ethod for special cases
involving m eans, which m akes use o f a poin t prediction y n+\ and the distribu
tion o f prediction error Yn+l y+1: resam pling can be used to estim ate this
distribution directly. This m ethod will be applied to linear regression m odels
in Section 6.3.3.

245

5.10 Prediction

Figure 5.16
Adjustment function
/i(y) for prediction with
sample size n = 10 from
N(n,cr2), with quadratic
logistic fit (solid), and
line giving /i(y) = y
(dots).

Logit of gamma

Nonparametric case
N ow consider the n o nparam etric context, where F is the E D F o f a single
sample. The calculations outlined for the param etric case apply here also.
First, if r / n < y < (r + 1)/n then the plug-in prediction limit is ay(F) = y(r)\
equivalently, ay(F) = y([ny\), where [] m eans integer part. Straightforw ard
calculation shows th at
Pr(Y+1 < yw ) = r / ( n + l ) ,
w hich m eans th a t (5.66) becom es h(y) = [ny]/(n+1). Therefore [n g (y )]/(n + l) =
y, so th at the adjusted prediction limit is y ( [ ( n+ i ) v ] ) : this is exact if (n + l ) y is
an integer.
It seems intuitively clear th a t the efficiency o f this nonparam etric prediction
lim it relative to a param etric prediction limit would be considerably lower
th an would be the case for confidence limits on a param eter. F or example,
a com parison betw een the norm al-theory and nonparam etric m ethods for
sam ples from a norm al distribution shows the efficiency to be ab o u t j for
a = 0.05.
F or sem iparam etric problem s sim ilar calculations apply. One general ap
proach which m akes sense in certain applications, as m entioned earlier, bases
prediction lim its on poin t predictions, and uses resam pling to estim ate the
distribution o f prediction error. For further details see Sections 6.3.3 and
7.2.4.

246

J Confidence Intervals

5.11 Bibliographic Notes


S tan d ard m ethods for obtaining confidence intervals are described in C hap
ters 7 an d 9 o f Cox an d H inkley (1974), while m ore recent developm ents in
likelihood-based m ethods are outlined by B arndorff-N ielsen and Cox (1994).
C orresponding m ethods based on resam ple likelihoods are described in C hap
ter 10.
B ootstrap confidence intervals were introduced in the original b o otstrap
paper by E fron (1979); bias adjustm ent and studentizing were discussed by
E fron (1981b). The adjusted percentile m ethod was developed by E fron (1987),
w ho gives detailed discussion o f the bias and skewness adjustm ent factors
b and a. In p a rt this developm ent responded to issues raised by Schenker
(1985). T he A B C m ethod an d its theoretical justification were laid out by
DiCiccio an d Efron (1992). H all (1988a, 1992a) contain rigorous developm ents
o f the second-order com parisons betw een com peting m ethods, including the
studentized b o o tstrap m ethods, an d give references to earlier w ork dating back
to Singh (1981). D iCiccio an d E fron (1996) give an excellent review o f the B C a
and A B C m ethods, together w ith their asym ptotic properties and com parisons
to likelihood-based m ethods. A n earlier review, w ith discussion, was given by
D iCiccio an d R om ano (1988).
O ther em pirical com parisons o f the accuracy o f b o o tstrap confidence interval
m ethods are described in Section 4.4.4 o f Shao and Tu (1995), while Lee
and Y oung (1995) m ake com parisons w ith iterated bo o tstrap m ethods. Their
conclusions and those o f Canty, D avison and H inkley (1996) broadly agree
w ith those reached here.
T ibshirani (1988) discussed em pirical choice o f a variance-stabilizing tra n s
form ation for use w ith the studentized b o o tstrap m ethod.
Choice o f sim ulation size R is investigated in detail by H all (1986). See also
the related references for C h ap ter 4 concerning choice o f R to m aintain high
test power.
T he significance test m ethod has been studied by K abaila (1993a) and
discussed in detail by C arp en ter (1996). B uckland and G arthw aite (1990)
and G a rth w aite and B uckland (1992) describe an efficient algorithm to find
confidence lim its in this context. The p articu lar application discussed in E xam
ple 5.11 is a m odified version o f Jennison (1992). O ne intriguing application,
to phylogenetic trees, is described by Efron, H allo ran and H olm es (1996).
The double b o o tstrap m ethod o f adjustm ent in Section 5.6 is sim ilar to th at
developed by Beran (1987) and H inkley an d Shi (1989); see also Loh (1987).
The m ethod is som etim es called b o o tstrap calibration. H all and M artin (1988)
give a detailed analysis o f the reduction in coverage error. Lee and Y oung
(1995) provide an efficient algorithm for approxim ating the m ethod w ithout
sim ulation w hen the p aram eter is a sm ooth function o f means. B ooth and H all

247

5.12 Problems

(1994) discuss the num bers o f sam ples required when the nested b o o tstrap is
used to calibrate a confidence interval.
C onditional m ethods have received little attention in the literature. E xam
ple 5.17 is tak en from H inkley an d Schechtm an (1987). B ooth, H all and W ood
(1992) describe kernel m ethods for estim ating the conditional distribution o f
a b o o tstrap statistic.
Confidence regions for vector param eters are alm ost untouched in the lit
erature. T here are no general analogues o f adjusted percentile m ethods. H all
(1987) discusses likelihood-based shapes for confidence regions.
Geisser (1993) surveys several approaches to calculating prediction intervals,
including resam pling m ethods such as cross-validation.
References to confidence interval and prediction interval m ethods for regres
sion m odels are given in the notes for C hapters 6 and 7; see also C hapter 8
for tim e series.

5.12 Problems
1

Suppose that we have a random sample


from a distribution F whose
mean is unknown but whose variance is known and equal to a 1. D iscuss possi
ble nonparametric resampling methods for obtaining confidence intervals for ^,
including the following: (i) use z = J n ( y n ) / a and resample from the E D F ; (ii)
use z = J n ( y fi)/s and resample from the E D F ; (iii) as in (ii) but replace the
E D F o f the data by the E D F o f values y + a(yi y ) / s; (iv) as in (ii) but replace
the E D F by a distribution on the data values whose mean and variance are y and
a 2.

Suppose that 9 is the correlation coefficient for a bivariate distribution. If this


distribution is bivariate normal, show that the M LE 9 is approximately
N ( 9 , ( 1 92)2/n). Use the delta m ethod to show that the transformed correlation
parameter f for which fj is approximately N ( 0 , n ') is ( = | lo g {(l + 9)/ ( 1 0)}.

s2 is the usual sample


variance of .........y.

Compare the use o f normal approximations for 9 and f with use o f a parametric
bootstrap analysis to obtain confidence intervals for 9: see Practical 5.1.
(Section 5.2)
3

Independent measurements y i , . . . , y n come from a distribution with range [0,0],


Suppose that we resample by taking samples o f size m from the data, and base
confidence intervals on Q = m{t T' )/t, where T = m a x { . . . , Ym
}. Show
that this works provided that m /n >0 as noo, and use simulation to check its
performance when n = 100 and Y has the (7(0,0) distribution.
(Sections 2.6.1, 5.2)

The gamma model (1.1) with mean /i and index k can be applied to the data o f
Example 1.1. For this model, show that the profile log likelihood for pt is
^prof(M) = nk lo g (kft/fi) + (k - 1) Y 2 lo 8 JO ~ ^ Y I Vi/t1 ~ n lo g r ^
where k h is the solution to the estimating equation

n log(K/n) + n +

log yj - ^

y j / f i - m p ( K ) = 0,

248

5 Confidence Intervals
with tp(fc) the derivative o f logr(K ).
Describe an algorithm for simulating the distribution o f the log likelihood ratio
statistic W( p ) = 2{<fprof(/i) <fprof(^)}, where p. is the overall maximum likelihood
estimate.
(Section 5.2)

Consider simulation to estimate the distribution o f Z = (T 6 ) / V </2, using R


independent replicates with ordered values z[ < < z R, where z" = (t t ) / v ' 1/2 is
based on nonparametric bootstrapping o f a sample y i , . . . , y . Let a = (r+ 1) / ( R + 1),
so that a one-sided confidence interval for 6 with nominal coverage a is I r =
[ t - v l/2z'r+l,co).
(a) Show that
Pr'(0 I r | F) = Pr*(z < Z r'+1) =
5=0

( R ) f ( l - p f ~ s,
W

where p = p(F) = P r'(Z < z \ F). Let P be the random variable corresponding to
p(F), with C D F G( ). Hence show that the unconditional probability is
Pr(0 6 Ir) = J 2 ( * ) f o S( 1 - u)R~s dG(u).
N ote that Pr(P < a) = Pr{0 6 [T 7 1/2Z a',oo)}, where Z a* is the a quantile o f the
distribution o f Z ', conditional on Y i , . . . , Y n.
(b) Suppose that it is reasonable to approximate the distribution o f P by the beta
distribution with density wa l (1 u)b~l / B(a,b), 0 < u < 1; note that a, b>\ as
n
o o . For som e representative values o f R, a, a and b, compare the coverage error
o f I , with that o f the interval [T V 1/2Z ,oo).
(Section 5.2.3; Hall, 1986)
6

Capability or precision indices are used to indicate whether a process satisfies a


specification o f form (L, U ), where L and U are the lower and upper specification
limits. If the process is in control, observations y i , . . . , y on it are taken to
have mean p and standard deviation a. Two basic capability indices are then
9 = (U L)/a and t] = 2 m in {([/ p)/a,(p L)/a], with precision regarded as
low if 9 < 6, medium if 6 < 6 < 8, and high if 6 > 8, and similarly for r\, which is
intended to be sensitive to the possibility that p ^ j ( L + U ) . Estimates o f 9 and r\
are obtained by replacing p and a with sample estimates, such as
(i) the usual estimates p = y = n~l Y , y j and a = {( I)-1 Y ( y j ~ y)2}1/2;
(ii) p y and a rk/dk, where rk = b~' Y l rKi and r/y is the range max yj min yj
o f the ith block o f k observations, namely yk(i-i)+i, , yki, where n = kb. Here du
is a scaling factor chosen so that rk estimates a.
(a) When estimates (i) are used, and the
are independent N ( p , a 2) variables,
show that an exact (1 2a) confidence interval for 8 has endpoints

s{ ^

where c(a) is the a quantile o f the x, distribution.


(b) With the set-up in (a), suppose that parametric simulation from the fitted
normal distribution is used to generate replicate values 9 1, . . . , 8 R o f 6. Show that
for R = o o , the true coverage o f the percentile confidence interval with nominal

249

5.12 Problems
coverage (1 2a) is

Pr i

(n -l)2 _ 2

(n -l)2

< t n_ x <

i. ^n1,1a

Cnl,a

where C has the x l - 1 distribution. Give also the coverages o f the basic bootstrap
confidence intervals based on 9 and log 6.
Calculate these coverages for n = 25, 50, 75 and a = 0.05, 0.025, and 0.005. Which
o f these intervals is preferable?
(c) See Practical 5.4, in which we take d5 = 2.236.
(Section 5.3.1)
7

Suppose that we have a parametric model with parameter vector tp, and that
9 = h(xp) is the parameter o f interest. The adjusted percentile ( B C a) method is
found by applying the scalar parameter method to the least-favourable family, for
which the log likelihood <f(ip) is replaced by / l f ( 0 = ($>+&), with S = i~l {rp)h(y))
and h( ) is the vector o f partial derivatives. Equations (5.21), (5.22) and (5.24) still
apply.
Show in detail how to apply this extension o f the B C a method to the problem
o f calculating confidence intervals for the ratio 9 = Hi/\i\ o f the means o f two
exponential distributions, given independent samples from those distributions. Use
a numerical example (such as Example 5.10) to compare the B C a m ethod to the
exact method, which is based on the fact that 9 / 9 has an F distribution.
(Sections 5.3.2, 5.4.2; Efron, 1987)

For the ratio o f independent means in Example 5.10, show that the matrix o f
second derivatives ii{n) has elements

n2t 1 2 ( y u - y i X y i j ~ y \ )
njyi I
yi

uu,ij ~ ^ r \ --------- =--------------- h (yu y 0 + (y\j y\)

uu.2j =

n2

r j {(yi,- - yi )(yn - h)},

n\n2y {

and
ft

2i,2j = j~i ( y 2 ~ fo) + (yy ~ ^)}n2y i


Use these results to check the value o f the constant c used in the A B C method in
that example.
For the data o f Example 1.2 we are interested in the ratio o f means 9 = E ( X ) / E ( U) .
Define /j. = (E((7), E ( X ) ) T and write 9 = t(n), which is estimated by t = t(s) with
s = (, 5c)t . Show that
h 2/ h \ \
(' - I*2//^l
/in )r
V lVm

. ___

-i = ( l -Hi /Hl

From Problem 2.16 we have lj = e j / u with

\-i/fii

1/Mi

= x; tUj. Derive expressions for the

constants a, b and c in the nonparametric A B C method, and note that b = cv1/ 2-

5 Confidence Intervals

250
Hence show that the A B C confidence limit is given by
~ = x + d Y X j e j / ( n 2v l / 2u)
u + d x Y l u j e j / ( n 2v l / 2u)

where da = (a + z )/{ l - a(a + za)}2.


Apply this result to the full dataset with n = 49, for which u = 103.14, x = 127.80,
t = 1.239, vL = 0.0119, and a = 0.0205.
(Section 5.4.2)
10

Suppose that the parameter 9 is estimated by solving the m onotone estimating


equation S Y(0) = 0, with unique solution T. If the random variable c ( Y , 9 ) has
(approximately or exactly) the known, continuous distribution function G, and if
U ~ G, then define t v to be the solution to c(y, t v ) = U for a fixed observation
vector y. Show that for suitable A, t tu = A ~ lc ( Y , 9 ) has roughly the same
distribution as A ~ l U = A ' ' c ( y , t u ) = T 9, and deduce that the distributions
o f t t v and T 9 are roughly the same.
The distribution o f t tu can be approximated by simulation, and this provides
a way to approximate the distribution o f T 6. Comment critically on this
resampling confidence limit method.
(Parzen, Wei and Ying, 1994)

11

Consider deriving an upper confidence limit for 9 by test inversion. If T is


an estimator for 8, and S is an estimator for nuisance parameter X, and if
var(T | 9, X) = a2(9,X), then define Z = (T 90)/a(90,S). Show that an exact
upper 1 a confidence limit is U\ = Ui_a(t,s, X) which satisfies

The bootstrap confidence limit is i_ = ui_a(r, s, s). Show that if S is a consistent


estimator for X then the method is consistent in the sense that Pr(0 < tii_a) =
1 a + o(l). Further show that under certain conditions the coverage differs from
1 a by 0 (n _1).
(Section 5.5; Kabaila, 1993a; Carpenter, 1996)
12

The normal approximation method for an upper 1 a confidence limit gives


= 9 + z i_ at)1/2. Show that bootstrap adjustment o f the nominal level 1 a in
z i - a leads to the studentized bootstrap method.
(Section 5.6; Beran, 1987)

13

The bootstrap m ethod o f adjustment can be applied to the percentile method.


Show that the analogue o f (5.55) is

s is consistent for k if
s = A+ op(i) as n->oo.

Pr'{Pr**(T < t | F") < 1 - q(a) | F} = 1 - a.


The adjusted 1 a upper confidence limit is then the 1 q(a) quantile o f T*.
In the parametric bootstrap analysis for a single exponential mean, show that the
percentile method gives upper 1 a limit > C 2 , i - a / ( 2 n ) . Verify that the bootstrap
adjustment o f this limit gives the exact upper 1 a limit 2 n y / c 2n,tt(Section 5.6; Beran, 1987; Hinkley and Shi, 1989)
14

Show how to make a bootstrap adjustment o f the studentized bootstrap confidence


limit method for a scalar parameter.
(Section 5.6)

cv is the a quantile of
the

distribution.

251

5.13 Practicals
15

For an equi-tailed (1 2a) confidence interval, the ideal endpoints are t + p with
values o f P solving (3.31) with
h(F, F ; P ) = I {t(F) - t(F) < 0} - a,

h(F, F; P) = I {t ( F) - t(F) < p } - (1 - a).

Suppose that the bootstrap solutions are denoted by [i? and P t- a., and that in
the language o f Section 3.9.1 the adjustments b(F, y) are /Ja+?1 and /?i_a+w. Show
how to estimate yi and y2, and verify that these adjustments modify coverage
1 2a + 0 (n _1) to 1 2a + 0(n~2).
(Sections 3.9.1, 5.6; Hall and Martin, 1988)
16

Suppose that D is an approximate ancillary statistic and that we want to estimate


the conditional probability G(u | d) = Pr(T 9 < u \ D = d) using R simulated
values (t,d"r). One sm ooth estimate is the kernel estimate

G( I d ) ,
f= i W{h-'(d;-d)}
where w( ) is a density symmetric about zero and h is an adjustable bandwidth.
Investigate the bias and variance o f this estimate in the case where ( T , D ) is ap
proximately bivariate normal and w( ) = <p(-). Show that h = R ~ i/2 is a reasonable
choice.
(Section 5.9; Booth, Hall and W ood, 1992)
17

Suppose that ( T , D) are approximately bivariate normal, with D an ancillary


statistic upon whose observed value d we wish to condition when calculating
confidence intervals. If the adjusted percentile method is to be used, then we need
conditional evaluations o f the constants a, vL and w. One approach to this is based
on selecting the subset o f the R bootstrap samples for which d' = d. Then w
can be calculated in the usual way, but restricted to this subset. For a and vL we
need empirical influence values, and these can be approximated by the regression
method o f Section 2.7.4, but using only the selected subset o f samples.
Investigate whether or not this approach makes sense.
(Section 5.9)

18

Suppose that y \ , . . . , y are sampled from an unknown distribution, which is known


to be symmetric about its median. Then to calculate a 1 a upper prediction
limit for a further observation Y+1 , the plug-in approach would use the 1 a
quantile o f the symmetrized E D F (Example 3.4). D evelop a resampling algorithm
for obtaining a bias-corrected prediction limit.
(Section 5.10)

19

For estimating the mean /i o f a population with unknown variance, we want to


find a (1 2a) confidence interval with specified length i. Given data y {,...,y,
consider the following approach. Create bootstrap samples o f sizes N = n , n + 1,...
and calculate confidence intervals (e.g. by the studentized bootstrap method) for
each N. Then choose as total sample size that N for which the interval length is
if or less. An additional N n data values are then obtained, and a bootstrap
confidence interval applied. Discuss this approach, and investigate it numerically
for the case where the data are sampled from a N(n,cr2) distribution.

5.13 Practicals
1

Suppose that we wish to calculate a 90% confidence interval for the correlation
9 between the two counts in the colum ns o f cd4; see Practical 2.3. To obtain

252

5 Confidence Intervals

confidence intervals for 9 under nonparametric resampling, using the empirical


influence values to calculate vl

cd4.boot <- boot(cd4, corr.fun, stype="w", R=999)


bo o t .ci(cd4.boot,conf=0.9)
To obtain intervals on the variance-stabilized scale, i.e. based on
t = ilog{(l+ 0)/(l-0)} :

fisher <- function(r) 0.5*log((l+r)/(l-r))


fisher.dot <- function(r) l/(l-r~2)
fisher.inv <- function(z) (exp(2*z)-l)/(exp(2*z)+l)
bo o t .c i (cd 4 .boot,h=f isher,hdot=f isher.d o t ,hinv=f isher.inv,conf =0.9)
How well do the intervals compare? Is the normal approximation reliable here?
To compare intervals under parametric simulation from a fitted bivariate normal
distribution:

cd4.rg <- function(data, mle)


{ d <- matrix(rnorm(2*nrow(data)), nrow(data), 2)
d[,2] <- mle [5] *d[, 1]+sqrt (1-mle [5] "2)*d[,2]
d[,l] <- m l e [1]+mle[3]*d[, 1]
d[,2] <- mle [2]+mle [4] * d [,2]
d >
n <- nrow(cd4)
cd4.mle <- c (apply(cd4,2,mean),sqrt(apply(cd4,2,var)*(n-l)/n),
corr(cd4))
cd4.para <- boot(cd4, corr.fun, R=999, sim="parametric",
ran.gen = cd4.rg, mle=cd4.mle)
bo o t .ci(cd4.para,type=c("norm","basic","stud","perc"),conf=0.9)
b o o t .ci(cd4.para,h=fisher,hdot=fisher.dot,hinv=fisher.inv,
type=c("norm","basic","stud","perc"),conf=0.9)
To obtain the corresponding interval using the nonparametric ABC method:

abc.ci(cd4, corr, conf=0.9)


D o the differences among the various intervals reflect what you would expect?
(Sections 5.2, 5.3, 5.4.2; D iC iccio and Efron, 1996).

Suppose that we wish to calculate a 90% confidence interval for the largest
eigenvalue 9 o f the covariance matrix o f the two counts in the colum ns o f cd4; see
Practicals 2.3 and 5.1. To obtain confidence intervals for 9 under nonparametric
resampling, using the empirical influence values to calculate vL :

eigen.fun <- function(d, w = rep(l, nrow(d))/nrow(d))


{ w <- w/sum(w)
n <- nrow(d)
m <- crossprod(w, d)
m2 <- sweep(d,2,m)
v <- crossprod(diag(sqrt(w))
/.*
/, m2)
eig <- eigen(v,symmetric=T)
stat <- eig$values[l]
e <- eig$vectors[,l]
i <- rep(l:n,round(n*w))
ds <- sweep(d[i,],2,m)

5.13 Practicals

253

L <- (ds/C*7,e)~2 - stat


c(stat, sum(L~2)/n~2) }
cd4.boot <- boot(cd4,eigen.fun,R=999,stype="w")
boot.ci(cd4.boot, conf=0.90)
abc.ci(cd4, eigen.fun, conf=0.9)
Discuss the differences among the various intervals.
(Sections 5.2, 5.3, 5.4.2; D iC iccio and Efron, 1996)
3

Dataframe am is contains data made available by G. Amis o f Cambridgeshire


County Council on the speeds in miles per hour o f cars at pairs o f sites on roads
in Cambridgeshire. Speeds were measured at each site before and then again after
the erection o f a warning sign at one site o f each pair. The quantity o f interest is
the mean relative change in the 0.85 quantile,
o f the speeds for each pair, i.e. the
mean o f the quantities (rjai r]bl) (rjaoVbo)', here /m and r\ai are the 0.85 quantiles
o f the speed distribution at the site where the sign was placed, before and after
its erection. This quantity is chosen because the warning is particularly intended
to slow faster drivers. A bout 100 speeds are available for each com bination o f
14 pairs o f sites and three periods, one before and two after the warnings were
erected, but some o f the pairs overlap. We work with a slightly smaller dataset, for
which the rjs are:

amisl <- amis[(amis$pair!=4)&(amis$pair!=6)&(amis$period!=3),]


tapply(amisl$speed, list(amisl$period,amisl$warning,amisl$pair),
quantile, 0.85)
To attempt to set confidence intervals for 6, by stratified resampling from the
speeds at each com bination o f site and period:

amis.fun <- function(data, i)


{ d <- data[i, ]
d <- tapply(d$speed,list(d$period,d$warning,d$pair).quantile,0.85)
m ean((d[2,1, ] - d[l,l, ]) - (d[2,2, ] - d[l,2, ])) >
str <- 4*(amisl$pair-l)+2*(amisl$warning-l)+amisl$period
amisl.boot <- boot(amisl,amis.fun,R=99,strata=str)
amisl,boot$t0
qqnonn(amisl.boot$t)
abline(mean(amisl,boot$t),sqrt(var(amisl,boot$t)),lty=2)
boot.ci(amisl.boot,type=c("basic","perc","norm"),conf=0.9)
(There are 4800 cases in a m isl so this is demanding on memory: it may be
necessary to increase the o b j e c t . s i z e .) D o the resampled averages look normal?
Can you account for the differences am ong the intervals?
How big is the average effect o f the warnings?
(Section 5.2)
4

Dataframe c a p a b i l i t y gives data from Bissell (1990) comprising 75 successive


observations with specification limits U = 5.79 and L 5.49; see Problem 5.6. To
check that the process is in control and that the data are close to independent
normal random variables:

par(mfrow=c(2,2))
tsplot(capabilityly,ylim=c(5,6))
abline(h=5.79,lty=2); abline(h=5.49,lty=2)
qqnorm(capability$y)
acf(capabilitySy)

254

5 Confidence Intervals

acf(capability$y,type="partial")
To find nonparametric confidence limits for rj using the estimates given by (ii) in
Problem 5.6:

capability.fun <- function(data, i, U=5.79, L=5.49, dk=2.236)


{ y <- data$y[i]
m <- mean(y)
r5 <- apply(matrix(y,15,5), 1, function(y) diff(range(y)))
s <- mean(r5)/dk
2*min((U-m)/s, (m-L)/s) >
capability.boot <- boot(capability, capability.fun, R=999)
b o o t .ci(capability.boot,type=c("norm","basic","perc"))
D o the values o f t* look normal? Why is there such a difference between the
percentile and basic bootstrap limits? W hich do you think are more reliable here?
(Sections 5.2, 5.3)

Following on from Practical 2.3, w e use a double bootstrap with M = 249 to adjust
the studentized bootstrap interval for a correlation coefficient applied to the cd4
data.
nested.corr <- function(data, w, tO, M)
{ n <- nrow(data)
i <- rep(l:n,round(n*w))
t <- corr.fun(data, w )
z <- (t[l]-t0)/sqrt(t[2])
nested.boot <- boot(data[i,], corr.fun, R=M, stype="w")
z.nested <- (nested.boot$t[,1]t [1])/sqrt(nested.boot$t[,2])
c(z, sum(z.nested<z)/(M+l)) }
cd4.boot <- boot(cd4, nested.corr, R=9, stype="w",
tO=corr(cd4), M=249)
To get som e idea how long you will have to wait if you set R = 999 you can
time the call to b o o t using u n ix .t im e or d o s . t i m e : beware o f time and memory
problems. It may be best to run a batch job, with contents

cd4.boot <- boot(cd4.nested.corr,R=99,stype="w",tO=corr(cd4),M=249)


junk <- boot(cd4,nested.corr,R=100,stype="w",tO=corr(cd4),M=249)
cd4.boot$t <- rbind(cd4.boot$t,junk$t)
cd4.boot$R <- cd4.boot$R+junk$R
but with the last three lines repeated eight further times.
cd4.nested contains a nested simulation we did earlier. T o compare the actual
and nominal coverage levels:
par(pty="s")
qqplot((1:c d4.nested$R)/ (l+cd4.nested$R),cd4.nested$t[,2],
xlab="nominal coverage",ylab="estimated coverage",pch=".")
lines(c(0,l),c(0,l))
How close to nominal is the estimated coverage? To read off the original and
corrected 95% confidence intervals:

q <- c(0.975,0.025)
q.adj <- quantile(cd4.nested$t[,2],q)
tO <- corr.fun(cd4)
z <- sort(cd4.nested$t[,1])

5.13 Practicals

255

t O [1]-sqrt(tO[2])*z[floor((l+cd4.nested$R)*q)]
t O [1]-sqrt(tO[2])*z[floor((l+cd4.nested$R)*q.adj)]

Does the correction have much effect? Compare this interval with the correspond
ing ABC interval.
(Section 5.6)

6
Linear Regression

6.1 Introduction
O ne o f the m ost im p o rta n t and frequent types o f statistical analysis is re
gression analysis, in which we study the effects o f explanatory variables or
covariates on a response variable. In this chap ter we are concerned with Unear
regression, in which the m ean o f the ran d o m response Y observed at value
x = ( x i,. . . , x p)T o f the explanatory variable vector is
E ( y | x) = n(x) = x Tp.
The m odel is com pleted by specifying the natu re o f random variation, which
for independent responses am o u n ts to specifying the form o f the variance
v a r(7 | x). F or a full p aram etric analysis we would also have to specify the
distribution o f Y , be it norm al, Poisson o r w hatever. W ithout this, the m odel
is sem iparam etric.
F or linear regression w ith norm al ran d o m errors having co n stan t variance,
the least squares theory o f regression estim ation and inference provides clean,
exact m ethods for analysis. But for generalizations to non-norm al errors and
non-con stan t variance, exact m ethods rarely exist, and we are faced with
approxim ate m ethods based o n linear approxim ations to estim ators and central
lim it theorem s. So, ju s t as in the sim pler context o f C hapters 2-5, resam pling
m ethods have the poten tial to provide m ore accurate analysis.
We begin o u r discussion in Section 6.2 w ith simple least squares linear re
gression, where in ideal conditions resam pling essentially reproduces the exact
theoretical analysis, b u t also offers the p o tential to deal with non-ideal cir
cum stances such as non-co n stan t variance. Section 6.3 covers the extension
to m ultiple explanatory variables. The related topics o f aggregate prediction
erro r an d o f variable selection based on predictive ability are discussed in
Section 6.4. R obust m ethods o f regression are exam ined briefly in Section 6.5.

256

257

6.2 Least Squares Linear Regression

Figure 6.1 Average


body weight (kg) and
brain weight (g) for 62
species of mammals,
plotted on original
scales and logarithmic
scales (Weisberg, 1985,
p. 144).

Body weight

Body weight

T he furth er topics o f generalized linear models, survival analysis, other n o n


linear regression, classification error, and nonparam etric regression m odels are
deferred to C h ap ter 7.

6.2 Least Squares Linear Regression


6.2.1 Regression fit and residuals
T he left panel o f Figure 6.1 shows the scatter plot o f response brain w eight
versus explanatory variable body w eight for n = 62 m am m als. As the right
panel o f the figure shows, the d ata are well described by a simple linear
regression after the two variables are transform ed logarithm ically, so th at
y = log(brain weight),

x = log(body weight).

The simple linear regression m odel is


Yj =

+ Pi xj + ej,

j= l,...,n,

(6.1)

w here the EjS are uncorrelated w ith zero m eans and equal variances a 2. This
constancy o f variance, or hom oscedasticity, seems roughly right for the example
data. We refer to the d a ta (x j , y j ) as the y'th case.
In general the values Xj m ight be controlled (by design), random ly sampled,
o r m erely observed as in the example. But we analyse the d a ta as if the x,s
were fixed, because the am o u n t o f inform ation ab o u t ft = (/fo, l h ) T depends
u p o n their observed values.
The sim plest analysis o f d a ta under (6.1) is by the ordinary least squares

6 Linear Regression

258

m ethod, on which we concentrate here. The least squares estim ates for (i are
,

h = y - Pi*,

(6 .2 )

where x = n 1 Y XJ an d
= ^ = i ( x; x )2- T he conventional estim ate o f
the error variance er2 is the residual m ean square

where

ei = yj - A>

(6.3)

A/ = Po + Plxj

(6.4)

are raw residuals with

the fitted values, or estim ated m ean values, for the response at the observed x
values.
The basic properties o f the p aram eter estim ates Po, Pi, which are easily
obtained u n d er m odel (6.1), are
(6.5)
and
E(j?i) =

Pu

(6.6)

var(j?i) =

The estim ates are norm ally distributed and optim al if the errors e;- are norm ally
distributed, they are often approxim ately norm al for other erro r distributions,
b u t they are n o t robust to gross non-norm ality o f errors or to outlying response
values.
The raw residuals e} are im p o rtan t for various aspects o f m odel checking,
and potentially for resam pling m ethods since they estim ate the random errors
Ej, so it is useful to sum m arize their properties also. U nder (6.1),
n
(6.7)
k= 1

where

with djk equal to 1 if j = k an d zero otherwise. T he quantities hjj are know n


as leverages, an d for convenience we denote them by hj. It follows from (6.7)
th a t
E(e; ) = 0,

var(e; ) = tx2( l

-hj).

68

( . )

259

6.2 Least Squares Linear Regression

O ne consequence o f this last result is th a t the estim ator S 2 th a t corresponds


to s2 has expected value a 2, because )(1 hj) = n 2. N ote th a t w ith the
intercept o in the m odel, YI ej = 0 autom atically.
T he raw residuals e} can be modified in various ways to m akes them suitable
for diagnostic m ethods, b u t the m ost useful m odification for our purposes is
to change them to have co n stan t variance, th a t is

Standardized residuals
are called studentized
residuals by some
authors.

(6.9)

(i - h j W

We shall refer to these as modified residuals, to distinguish them from standard


ized residuals which are in addition divided by the sam ple standard deviation.
A norm al Q -Q p lo t o f the r;- will reveal obvious outliers, or clear non-norm ality
o f the ran d o m errors, alth o u g h the latter m ay be obscured som ew hat because
o f the averaging pro p erty o f (6.7).
A sim pler m odification o f residuals is to use 1 h = 1 2n-1 instead o f
individual leverages 1 hj, where h is the average leverage; this will have a
very sim ilar effect only if the leverages hj are fairly hom ogeneous. This simpler
m odification implies m ultiplication o f all raw residuals
by (1 2n~1)~]/'2:
the average will equal zero autom atically because ^ ej = 0.
I f (6.1) holds w ith hom oscedastic random errors e; and if those random
errors are norm ally distributed, or if the dataset is large, then stan d ard distri
butio n al results will be adequate for draw ing inferences w ith the least squares
estim ates. But if the errors are very non-norm al o r heteroscedastic, m eaning
th a t their variances are unequal, then those stan d ard results m ay n o t be reliable
an d a resam pling m ethod m ay offer genuine im provem ent. In Sections 6.2.3
an d 6.2.4 we describe two quite different resam pling m ethods, the second o f
w hich is robust to failure o f the m odel assum ptions.
I f strong non-norm ality o r heteroscedasticity (which can be difficult to
distinguish) ap p ear to be present, then robust regression estim ates m ay be
considered in place o f least squares estim ates. These will be discussed in
Section 6.5.

6.2.2 Alternative models


T he linear regression m odel (6.1) can arise in two ways, and for our purposes
it can be useful to distinguish them.
First formulation
T he first possibility is th a t the pairs are random ly sam pled from a bivariate
distrib u tio n F for (X, 7 ). T hen linear regression refers to linearity o f the
conditional m ean o f Y given X = x, th a t is
E(Y

IX

= x) =

fly

+ y{x H x ) ,

y =

0 x y / 0 x2 ,

(6-10)

260

6 Linear Regression

w ith n x = E(X ), fly = E(Y ), a 2 = \a.r(X) and axy = cov(X, Y). This condi
tional m ean corresponds to the m ean in (6.1), w ith

Po = H y - y f i x,

Pi=y-

(6.11)

T he param eters ft = (Po,Pi)T are here seen to be statistical functions o f the


kind m et in earlier chapters, in this case based on the first and second m om ents
o f F. The ran d o m errors t.j in (6.1) will be hom oscedastic with respect to x if
F is bivariate norm al, for exam ple, b u t n o t in general.
The least squares estim ators (6.2) correspond to the use o f sam ple m om ents
in (6.10). F or future reference we n ote (Problem 6.1) th a t the influence function
for the least squares estim ators t = (/?o, Pt )T is the vector

L^

<612>

F> = C - S ?

T he em pirical influence values as defined in Section 2.7.2 are therefore


(1 -n x (x j-x )/S S x \
'< = {
n(Xj x ) / S S x
)

(6' 13)

T he nonparam etric delta m ethod variance approxim ation (2.36) applied to [1]
gives
vl

Y, { x j x)2e2j
= -S S 2
1

(6-14)

This m akes no assum ption o f hom oscedasticity. In practice we m odify the


variance approxim ation to account for leverage, replacing ej by r, as defined
in (6.9).
Second formulation
The second possibility is th a t a t any value o f x, responses Yx can be sam pled
from a distribution Fx(y) whose m ean an d variance are n(x) and <r2(x), such
th a t n{x) = Po + Pix. Evidently /?o = /40), an d the slope param eter /ii is a
linear co n trast o f m ean values n(x i ) ,^ (* 2), , nam ely
_

E (xj - x)n(xj)
SS X

In principle several responses could be obtained at each xj. Simple linear


regression w ith hom oscedastic errors, w ith which we are initially concerned,
corresponds to cr(x) = a and
Fx(y) = G { y - r t x ) } .

(6.15)

So G is the distribution o f ran d o m error, w ith m ean zero and variance a 2.


A ny p articu lar application is characterized by the design x i ,...,x and the
corresponding d istributions Fx, the m eans o f which are defined by linear
regression.

6.2 Least Squares Linear Regression

261

The influence function for the least squares estim ator is again given by
(6.12), b u t w ith fix and a \ respectively replaced by x and n~' J2(x j ~ *)2Em pirical influence values are still given by (6.13). The analogue o f linear
approxim ations (2.35) an d (3.1) is $ = fi + n~x
Lt { ( xj , y j) ; F} , w ith vari
ance n_ 2 ^ " =1 v ar [Lt{( xj, Yj) ;F}]. If the assum ed hom oscedasticity o f errors
is used to evaluate this, w ith the constant variance a 2 estim ated by n~l
ep
then the delta m ethod variance approxim ation for /?i, for example, is
'Z i.
nSSx
strictly speaking this is a sem iparam etric approxim ation. This differs by a
factor o f (n 2) / n from the stan d ard estim ate, which is given by (6.6) with
residual m ean square s2 in place o f a 2.
The stan d ard analysis for linear regression as outlined in Section 6.2.1 is the
sam e for b o th situations, provided the random errors ej have equal variances,
as w ould usually be jud g ed from plots o f the residuals.

6.2.3 Resam pling errors


To extend the resam pling algorithm s o f C hapters 2-3 to regression, we have
first to identify the underlying m odel F. Now if (6.1) is literally correct with
hom oscedastic errors, then those errors are effectively sam pled from a single
distribution. I f the x; s are treated as fixed, then the second form ulation o f
Section 6.2.2 applies, G being the com m on error distribution. The m odel F
is the series o f distributions Fx for x = x i,...,x , defined by (6.15). The
resam pling m odel is the corresponding series o f estim ated distributions Fx in
which each /i(xy) is replaced by the regression fit p.(xj) and G is estim ated from
all residuals.
F or p aram etric resam pling we would estim ate G according to the assum ed
form o f error distribution, for exam ple the N ( 0 , s 2) distribution if norm ality
were ju d g ed appropriate. (O f course resam pling is n o t necessary for the norm al
linear m odel, because exact theoretical results are available.) For nonparam etric
resam pling, on which we concentrate in this chapter, we need a generalization
o f the E D F used in C h ap ter 2. I f the random errors Ej were known, then
their E D F w ould be appropriate. As it is we have the raw residuals ej which
estim ate the e; , and their E D F will usually be consistent for G. But for practical
use it is better to use the residuals r,- defined in (6.9), because their variances
agree w ith those o f the e; . N oting th a t G is assum ed to have m ean zero in the
m odel, we then estim ate G by the E D F o f rj f, where r is the average o f the
rj. These centred residuals have m ean zero, and we refer to their E D F as G.
The full resam pling m odel is taken to have the same design as the data,
th a t is x* = X j ; it then specifies the conditional distribution o f YJ given x*

262

6 Linear Regression

through the estim ated version o f (6.1), which is


Y j = p . j + ep

j =

(6.16)

w ith p.j =
+ [Six an d ej random ly sam pled from G. So the algorithm
to generate sim ulated datasets an d corresponding param eter estim ates is as
follows.
Algorithm 6.1 (Model-based resampling in linear regression)
For r = 1
1 F or j = 1, . . . , n ,
(a) set x j = Xj\
(b) random ly sam ple ej from r

. . , r r; then

(c) set yj = P o + j?ix j + ej.


2 Fit least squares regression to ( x j ,y j ) ,. ..,(x * ,y * ), giving estim ates
Po,r P \ j Sr2

The resam pling m eans an d variances o f Pq an d p \ will agree very closely


w ith sta n d a rd least squares theory. To see this, consider for exam ple the slope
estim ate, whose b o o tstrap sam ple value can be w ritten
a.

a ,
E (* )-* > 2 f t +

SS,

'

Because E*(e*) = n r 1 Y ( rj r) = 0, it follows th a t E*(j?j) = Pi. Also, because


var*(e*) = n_1 " =1(r; ~ Ff for a11 J,
.
y^(x; x)2var*(;)
,
v ar (Pi) = -----------^ -------- J- = n ^ ( r , - - r f / S S x.
The latter will be approxim ately equal to the usual estim ate s2/ S S x, because
n_1 Y;(rj ~ r ) 2 = (n ~ 2)~'

e] = s2- 1 fact if the individual hj are replaced by

their average h, then the m eans an d variances o f Pq and p \ are given exactly
by (6.5) an d (6.6) w ith the estim ates Pq, P i an d s2 substituted for param eter
values. T he advantage o f resam pling is im proved quantile estim ation when
norm al-theory distributions o f the estim ators Pq, P i , S 2 are n o t accurate.
Example 6.1 (M am m als) F or the d a ta plotted in the right panel o f Figure 6.1,
the simple linear regression m odel seems appropriate. S tan d ard analysis sug
gests th a t errors are approxim ately norm al, although there is a small suspicion
o f heteroscedasticity: see Figure 6.2. T he p aram eter estim ates are Po = 2.135
and Pi = 0.752.
From R = 499 b o o tstra p sim ulations according to the algorithm above, the

263

6.2 Least Squares Linear Regression

Figure 6.2 Normal


Q-Q plot of modified
residuals r;- and their
plot against leverage
values hj for linear
regression fit to
log-transformed
mammal data.

co
3
TD

tO
3
o

o
<D
o
O

0o>
"D
O

Quantiles of Standard Normal

Leverage h

estim ated sta n d a rd errors o f intercept and slope are respectively 0.0958 and
0.0273, com pared to the theoretical values 0.0960 and 0.0285. The em pirical
distributions o f b o o tstra p estim ates are alm ost perfectly norm al, as they are
for the studentized estim ates. T he estim ated 0.05 and 0.95 quantiles for the
studentized slope estim ate

sE{fay
w here SE(fS\) is the stan d ard error for
obtained from (6.6), are z*25) = 1.640
an d z'475) = 1.5 89, com pared to the stan d ard norm al quantiles +1.645. So, as
expected for a m oderately large clean dataset, the resam pling results agree
closely w ith those obtained from stan d ard m ethods.

Zero intercept
In som e applications the intercept f o will n o t be included in (6.1). This affects
the estim ation o f Pi and a 2 in obvious ways, b u t the resam pling algorithm will
also differ. First, the leverage values are different, nam ely

so the m odified residual will be different. Secondly, because now


e;
0, it is
essential to m ean-correct the residuals before using them to sim ulate random
errors.
Repeated design points
I f there are rep eat observations a t som e or all values o f x, this offers an
enhanced o p p o rtu n ity to detect heteroscedasticity: see Section 6.2.6. W ith

264

6 Linear Regression

m any such repeats it is in principle possible to estim ate the C D F s Fx separately


(Section 6.2.2), b u t there is rarely enough d a ta for this to be useful in practice.
T he m ain advantage o f repeats is the o p portunity it affords to test the
adequacy o f the linear regression form ulation, by splitting the residual sum o f
squares into a pure e rro r com ponent an d a goodness-of-fit com ponent. To
the extent th a t the com parison o f these com ponents through the usual F ratio
is quite sensitive to non-norm ality and heteroscedasticity, resam pling m ethods
m ay be useful in interpreting th a t F ratio (Practical 6.3).

6.2.4 Resam pling cases


A com pletely different approach w ould be to im agine the d a ta as a sam ple from
som e bivariate distribution F o f (X , Y). This will sometimes, b u t not often,
mimic w hat actually happened. In this approach, as outlined in Section 6.2.2,
the regression coefficients are viewed as statistical functions o f F, and defined
by (6.10). M odel (6.1) still applies, b u t w ith no assum ption on the random
errors e7 other th an independence. W hen (6.10) is evaluated a t F we obtain
the least squares estim ates (6.2).
W ith F now the bivariate distribution o f (X, Y ), it is appropriate to take
F to be the E D F o f the d a ta pairs, an d resam pling will be from this ED F,
ju st as in C h ap ter 2. T he resam pling sim ulation therefore involves sam pling
pairs w ith replacem ent from { x \ , y \ ) , . . . , (x,y). This is equivalent to taking
(x,*,y*) = (x i , y i ), where I is uniform ly distributed on {1 ,2 ,...,n } . Sim ulated
values Pq, fi\ o f the coefficient estim ates are com puted from (xj,_y*),...,(x*,y*)
using the least squares algorithm which was applied to obtain the original
estim ates feo, fi\. So the resam pling algorithm is as follows.
Algorithm 6.2 (Resampling cases in regression)
F or r =
1 sam ple i\ , r a n d o m l y w ith replacem ent from {1,2
2 for j = 1 ,..., n, set x j = x,-, y j = y ;; then
3 fit least squares regression to ( x \ , y \ ) , ... ,(x*n,y*n), giving estim ates
K r K sr2
There are two im p o rtan t differences betw een this second b o o tstrap m ethod
and the previous one using a p aram etric m odel an d sim ulated errors. First,
w ith the second m ethod we m ake no assum ption ab o u t variance hom ogeneity
indeed we do n o t even assum e th a t the conditional m ean o f Y given X = x
is linear. This offers the advantage o f potential robustness to heteroscedasticity,
and the disadvantage o f inefficiency if the constant-variance m odel is correct.
Secondly, the sim ulated sam ples have different designs, because the values

The model E(Y | X = x)


= a + /?i(x x), which
some writers use in
place of (6.1), is not
useful here because
a = /fo 4- fi\x is a
function not only of F
but also of the data,
through x.

265

6.2 Least Squares Linear Regression


Mammals
data. Comparison of
bootstrap biases and
standard errors of
intercept and slope with
theoretical results,
standard and robust.
Resampling cases with
Table 6.1

R = 999.

f>i

T heoretical

R esam pling cases

R o b u st theoretical

bias
sta n d a rd e rro r

0
0.096

0.0006
0.091

0.088

bias
sta n d a rd e rro r

0
0.0285

0.0002
0.0223

0.0223

x j ,...,x * are random ly sam pled. The design fixes the inform ation content o f a
sample, and in principle o u r inference should be specific to the inform ation in
o u r data. The variation in x j , . . . , x will cause some variation in inform ation,
b u t fortunately this is often u n im p o rtan t in m oderately large datasets; see,
however, Exam ples 6.4 and 6.6.
N ote th a t in general the resam pling distribution o f a coefficient estim ate
will not have m ean equal to the d a ta estim ate, contrary to the unbiasedness
property th a t the estim ate in fact possesses. However, the difference is usually
negligible.
Example 6.2 (M ammals) F or the d ata o f Exam ple 6.1, a b o o tstra p sim ulation
was run by resam pling cases with R = 999. Table 6.1 shows the bias and
stan d ard error results for b o th intercept and slope. The estim ated biases are
very small. T he striking feature o f the results is th at the stan d ard erro r for the
slope is considerably sm aller than in the previous b o o tstrap sim ulation, which
agreed w ith stan d ard theory. The last colum n o f the table gives robust versions
o f the stan d ard errors, which are calculated by estim ating the variance o f Ej to
be rj. For exam ple, the robust estim ate o f the variance o f (it is

This corresponds to the delta m ethod variance approxim ation (6.14), except
th a t rj is used in preference to e; . As we m ight have expected from previous
discussion, the b o o tstrap gives an approxim ation to the robust stan d ard error.
A
A
Figure 6.3 shows norm al Q -Q plots o f the b o o tstra p estim ates Pq and fi'.
F or the slope p aram eter the right panel shows lines corresponding to norm al
d istributions w ith the usual and the robust stan d ard errors. T he distribution
o f Pi is close to norm al, with variance m uch closer to the robust form (6.17)
th an to the usual form (6.6).

One disadvantage o f the robust stan d ard error is its inefficiency relative to
the usual stan d ard erro r when the latter is correct. A fairly straightforw ard
calculation (Problem 6.6) gives the efficiency, which is approxim ately 40% for
the slope p aram eter in the previous example. T hus the effective degrees o f
freedom for the robust stan d ard error is approxim ately 0.40 times 62, or 25.

6 Linear Regression

266

Quantiles of standard normal

Quantiles of standard normal

The sam e loss o f efficiency would apply approxim ately to b o o tstrap results for
resam pling cases.

6.2.5 Significance tests for slope


Suppose th a t we w ant to test w hether or n o t the covariate x has an effect on
the response y, assum ing linear regression is appropriate. In term s o f m odel
param eters, the null hypothesis is Ho : fi\ = 0. If we use the least squares
estim ate
as the basis for such a test, then this is equivalent to testing
the Pearson correlation coefficient. This connection im m ediately suggests one
nonparam etric test, the p erm u tatio n test o f Exam ple 4.9. However, this is not
always valid, so we need also to consider o th er possible b o o tstrap tests.
Permutation test
The p erm u tatio n test o f co rrelation applies to the null hypothesis o f inde
pendence betw een X and Y when these are b o th random . Equivalently it
applies when the null hypothesis implies th a t the conditional distribution o f Y
given X = x does n o t depend upon x. In the context o f linear regression this
m eans n o t only zero slope, b u t also constant erro r variance. The justification
then rests sim ply on the exchangeability o f the response values under the null
hypothesis.
If we use AT(.) to denote the ordered values o f X \ , . . . , X n, and so forth, then
the exact level o f significance for one-sided alternative H a 'Pi > 0 and test
statistic T is
p

Pr ( T > t | X (.) = x (.), y(.) = )>(.), H 0)

Pr [T > 1 1X = x, Y = p e rm j^ .)} ],

Figure 63 Normal
plots for bootstrapped
estimates of intercept
(left) and slope (right)
for linear regression fit
to logarithms of
mammal data, with
R = 999 samples
obtained by resampling
cases. The dotted lines
give approximate
normal distributions
based on the usual
formulae (6.5) and (6.6),
while the dashed line
shows the normal
distribution for the
slope using the robust
variance estimate (6.17).

6.2 L east Squares Linear Regression

267

where perm { } denotes a perm utation. Because all perm utations are equally
likely, we have
# o f perm utations such th a t T > t

P = --------------------n!i-------------------
as in (4.20). In the present context we can take T = fii, for which p is the same
as if we used the sam ple Pearson correlation coefficient, b u t the same m ethod
applies for any ap p ro p riate slope estim ator. In practice the test is perform ed
by generating sam ples ( x j ,y j ) ,. ..,(x * ,y * ) such th a t x* = x j and (_ y j,...,y )
is a ran d o m p erm u tatio n o f ( y i , . . . , y n), and fitting the least squares slope
estim ate jSj. If this is done R times, then the one-sided P-value for alternative
H A : fi i > 0 is
P

# { fr> M + i
R + 1

It is easy to show th a t studentizing the slope estim ate would n o t affect


this test; see Problem 6.4. The test is exact in the sense th at the P-value has
a uniform distrib u tio n under Ho, as explained in Section 4.1; note th at this
uniform distribution holds conditional on the x values, which is the relevant
property here.
First bootstrap test
A b o o tstrap test whose result will usually differ negligibly from th a t o f the
p erm u tatio n test is obtained by taking the null m odel as the pair o f m arginal
E D F s o f x an d y , so th a t the x*s are random ly sam pled with replacem ent from
the X j S , and independently the y * s are random ly sam pled from the y j s. A gain
is the slope fitted to the sim ulated data, and the form ula for p is the same.
As w ith the p erm u tatio n test, the null hypothesis being tested is stronger than
ju st zero slope.
The p erm u tatio n m ethod and its b o o tstrap look-alike apply equally well to
any slope estim ate, n o t ju st the least squares estimate.
Second bootstrap test
The next b o o tstrap test is based explicitly on the linear m odel structure with
hom oscedastic errors, and applies the general approach o f Section 4.4. The
null m odel is the null m ean fit and the E D F o f residuals from th a t fit. We
calculate the P-value for the slope estim ate under sam pling from this fitted
model. T h a t is, d a ta are sim ulated by

x) =

xp

yj =

;0 + 8}o>

w here pjo = y an d the *0 are sam pled with replacem ent from the null m odel
residuals e^o = yj ~ y , j = 1
, The least squares slope /Jj is calculated
from the sim ulated data. A fter R repetitions o f the sim ulation, the P-value is
calculated as before.

268

6 Linear Regression

This second b o o tstrap test differs from the first b o o tstrap test only in th at
the values o f explanatory variables x are fixed at the d a ta values for every
case. N ote th a t if residuals were sam pled w ithout replacem ent, this test would
duplicate the exact p erm u tatio n test, which suggests th at this boo tstrap test
will be nearly exact.
The test could be m odified by standardizing the residuals before sam pling
from them , which here w ould m ean adjusting for the constant null m odel
leverage n-1 . This w ould affect the P-value slightly for the test as described,
b u t not if the test statistic were changed to the studentized slope estimate.
It therefore seems wise to studentize regression test statistics in general, if
m odel-based sim ulation is used; see the discussion o f b o o tstrap pivot tests
below.
Testing non-zero slope values
All o f the preceding tests can be easily modified to test a non-zero value o f
Pi. If the null value is /?i,o, say, then we apply the test to m odified responses
yj PiflXj, as in Exam ple 6.3 below.
Bootstrap pivot tests
F u rther b o o tstrap tests can be based on the studentized b o o tstrap approach
outlined in Section 4.4.1. F or simplicity suppose th at we can assum e ho m o
scedastic errors. T hen Z = ([S\ Pi)/S\ is a pivot, where Si is the usual
standard error for
As a pivot, Z has a distribution not depending upon
param eter values, an d this can be verified under the linear m odel (6.1). The null
hypothesis is Ho : Pi = 0, and as before we consider the one-sided alternative
H a : Pi > 0. T hen the P-value is
p = Pr

P i = 0, P o, c r

Pi,Po,<r),

because Z is a pivot. T he probability on the right is approxim ated by the


b o o tstrap probability

where Z* = (j?,* Pi ) / S ' is com puted from a sam ple sim ulated according to
A lgorithm 6.1, which uses the fit from the full m odel as in (6.16). So, applying
the b o o tstrap as described in Section 6.2.3, we calculate the b o o tstrap P-value
from the results o f R sim ulated sam ples as
#
P

{z* > Zo}


R + 1

(6.19)

where zq = Pi/si.
The relation o f this m ethod to confidence limits is th a t if the lower 1 a

6.2 Least Squares Linear Regression

CM

X * *

* A i* ***.
i

o
CO
Ip
CM
O
CM

o
*

00

CO

d
-

0.2

0.1

0.0

so

Figure 6.4 Linear


regression model fitted
to m onthly excess
returns over riskless rate
y for one company
versus excess m arket
returns x. The left panel
shows the data and
fitted line. The right
panel plots the absolute
values o f the
standardized residuals
against x (Simonoff and
Tsai, 1994).

269

0.2

.
1 .

*
w - ,*
/
t
0.1

0.0

confidence lim it for fa is above zero, then p < oc. Sim ilar interpretations apply
with upper confidence limits and confidence intervals.
T he sam e m ethod can be used with case resampling. If this were done as
a precaution against erro r heteroscedasticity, then it would be appropriate to
replace si w ith the robust stan d ard erro r defined as the square root o f (6.17).
If we wish to test a non-zero value fa$ for the slope, then in (6.18) we
simply replace f a / s \ by zo = (fa fa,o)/si, or equivalently com pare the lower
confidence lim it to fayW ith all o f these tests there are simple m odifications if a different alternative
hypothesis is appropriate. For example, if the alternative is H A : fa < 0, then
the inequalities > used in defining p are replaced by
and the two-sided
P-value is twice the sm aller o f the two one-sided P-values.
O n balance there seems little to choose am ong the various tests described.
The perm u tatio n test an d its b o o tstrap look-alike are equally suited to statis
tics other th an least squares estim ates. T he b o o tstrap pivot test with case
resam pling is the only one designed to test slope w ithout assum ing constant
erro r variance u nder the null hypothesis. But one would usually expect sim ilar
results from all the tests.
The extensions to m ultiple linear regression are discussed in Section 6.3.2.
Example 6.3 (Returns data) The d a ta plotted in Figure 6.4 are n = 60
consecutive cases o f m onthly excess returns y for a particular com pany and
excess m ark et returns x, where excess is relative to riskless rate. We shall ignore
the possibility o f serial correlation. A linear relationship appears to fit the data,
and the hypothesis o f interest is Ho : fa = 1 with alternative HA : fa > 1, the
la tte r corresponding to the com pany outperform ing the m arket.

270

6 Linear Regression

Figure 6.5 Returns


data: histogram of
R = 999 bootstrap
values of studentized
slope
Q

a.

z* = (fil - M/Kob

CM

obtained by resampling
cases. Unshaded area
corresponds to values in
excess of data value
20 = (ft - 1)/sr0b =
0.669.

-2

Figure 6.4 and plots o f regression diagnostics suggest th a t erro r variation


increases w ith x and is non-norm al. It is therefore appropriate to apply the
boo tstrap pivot test w ith case resam pling, using the robust standard error from
(6.17), which we denote here by s rob, to studentize the slope estimate.
Figure 6.5 shows a histogram o f R = 999 values o f z". The unshaded p art
corresponds to z ' greater th a n the d a ta value
zo = (Pi - 1) / srob = (1.133 - 1)/0.198 = 0.669,
which happens 233 times. Therefore the b o o tstrap P-value is 0.234. In fact the
use o f the robust stan d ard erro r m akes little difference here: using the ordinary
stan d ard erro r gives P-value 0.252. C om parison o f the ordinary t-statistic to
the stan d ard norm al table gives P-value 0.28.

6.2.6 Non-constant variance: weighted error resampling


In some applications the
tic rando m errors. If the
sim ulation by resam pling
ordinary, i.e. unw eighted,

linear m odel (6.1) will apply, b u t with heteroscedasheteroscedasticity can be m odelled, then boo tstrap
errors is still possible. We assum e to begin with th at
least squares estim ates are fitted, as before.

Known variance function


Suppose th a t in (6.1) the ran d o m erro r ej a t x = Xj has variance uj, where
either c ? = k V ( x j ) or a j = K V ( f i j ) , with V ( ) a know n function. It is possible
to estim ate k , b u t we do n o t need to d o this. We only require the modified
residuals
r

y j-h
{V (X j)(l-h j)y/2

or

y j-h
{ F( ^. )

1/ 2

271

6.2 L east Squares Linear Regression

w hich will be approxim ately hom oscedastic. T he E D F o f these m odified resid


uals, after subtracting their m ean, will estim ate the distribution function G o f
the scaled, hom oscedastic ran d o m errors dj in the m odel
Yj = p 0 + fa Xj + V } % ,

(6.20)

w here Vj = V ( x j ) or V( f i j ) . A lgorithm 6.1 for resam pling errors is now modified


as follows.
Algorithm 6.3 (Resampling errors with unequal variances)
F o r r = 1 ,..., R,
1 F or j = 1 ,..., n,
(a) set x* = Xj\
(b ) random ly sam ple <5* from r\ r , . . . , r n r; then
(c) set y'j = fio + fa Xj + Vj1/2Sj, where Vj is V( xj ) or V(frj) as
appropriate.
2 F it linear regression by ordinary least squares to d a ta (xj, y [ ) , (x*, >*),
giving estim ates
f a r, s*2.

Weighted least squares


O f course in this situation ordinary least squares is inferior to weighted least
squares, in which ideally the j'th case is given weight Wj = V ~ l . If Vj = V ( x })
then weighted least squares can be done in one pass through the data, whereas
if Vj V(fij) we first estim ate fij by ordinary least squares fitted values
pj, say, an d then do a weighted least squares fit w ith the em pirical weights
Wj = l/V(p.j). In the la tte r case the stan d ard theory assum es th at the weights
are fixed, which is adequate for first-order approxim ations to distributional
properties. T he practical effect o f using em pirical weights can be incorporated
into the resam pling, an d so potentially m ore accurate distributional properties
can be obtain ed ; cf. Exam ple 3.2.
F or w eighted least squares, the estim ates o f intercept and slope are
a _ T , wA x j - x ) y j
P1
22 Wj{xj - x w)2

a _ 5
P0

PlXw,

where x w = Y wj x j / Y ^ wj anc^ % ~ S wj y j / S wj- Fitted values and raw


residuals are defined as for o rdinary least squares, b u t leverage values and
m odified residuals differ. T he leverage values are now
Wj(Xj - x w)2
hj ^ ------h
E wi E wi(* i-X w )2

272

6 Linear Regression

and the m odified residuals (standardized to equal variance) are

},

var(/?i)

Y , W j ( X j - X w)2

where k = s2 = (n 2)_l J2 w j ( y j f aj ) 2 is the weighted residual m ean square.


The algorithm for resam pling errors is the sam e as for ordinary least squares,
sum m arized in A lgorithm 6.3, b u t w ith the full weighted least squares procedure
im plem ented in the final step.
The situation where erro r variance depends on the m ean is a special case o f
the generalized linear m odel, which is discussed m ore fully in Section 7.2.
Wild bootstrap
W hat if the variance function F(-) is unspecified? In some circum stances
there m ay be enough d a ta to m odel it from the p a ttern o f residual variation,
for exam ple using a plot o f m odified residuals r; (or their absolute values
o r squares) versus fitted values fij. This ap proach can w ork if there is a clear
m onotone relationship o f variance w ith x or fi, or if there are clearly identifiable
strata o f constant variance (cf. Figure 7.14). But w here the heteroscedasticity
is unpattern ed , either resam pling o f cases should be done with least squares
estim ates, o r som ething akin to local estim ation o f variance will be required.
The m ost local ap proach possible is the wild bootstrap, which estim ates
variances from individual residuals. This uses the m odel-based resam pling
A lgorithm 6.1, b u t w ith the j t h resam pled erro r s* taken from the tw o-point
distribution
(6 .21 )
where n = (5 + *J5)/10 an d
= yj fij is the raw residual. The first three
m om ents o f e ' are zero, ej an d ej (Problem 6.8). This algorithm generates at
m ost 2" different values o f param eter estim ates, an d typically gives results th a t
are underdispersed relative to m odel-based resam pling or resam pling cases.
N ote th a t if m odified residuals rj were used in place o f raw residuals ej, then
the variance o f fi* u nder the wild b o o tstrap w ould equal the robust variance
estim ate (6.17).
Example 6.4 (Returns data) As m entioned in Exam ple 6.3, the d ata in Fig
ure 6.4 show an increase in error variance w ith m arket return, x. Table 6.3
com pares the b o o tstrap variances o f the p aram eter estim ates from ordinary
least squares for case resam pling an d the wild b o o tstrap, with R = 999. The
estim ated variance o f fii from resam pling cases is larger th a n for the wild

273

6.3 M ultiple Linear Regression


Table 6.2 Bootstrap
variances (xlO-3 ) of
ordinary least squares
estimates for returns
data, with R = 999.

All cases

C ases
Cases, subset
W ild, ej
W ild, rj
R o b u st theoretical

0.32
0.28
0.31
0.33
0.34

44.3
38.4
37.9
37.0
39.4

W ith o u t case 22

0.42
0.39
0.37
0.41
0.40

73.2
59.1
62.5
67.2
67.2

b ootstrap , an d for the full d a ta it m akes little difference when the modified
residuals are used.
Case 22 has high leverage, and its exclusion increases the variances o f both
estim ates. T he wild b o o tstrap is again less variable th an bootstrapping cases,
with the wild b o o tstrap o f modified residuals interm ediate betw een them.
We m entioned earlier th a t the design will vary when resam pling cases. The
left panel o f Figure 6.6 shows the sim ulated slope estim ates
plotted against
the sum s o f squares X X x )2> f r 200 b o o tstrap samples. The plotting
ch aracter distinguishes the num ber o f tim es case 22 occurs in the resam ples:
we retu rn to this below. The variability o f /}j decreases sharply as the sum o f
squares increases. N ow usually we would treat the sum o f squares as fixed in
the analysis, and this suggests th at we should calculate the variance o f P\ from
those b o o tstra p sam ples for which X ( x} x*)2 is close to the original value
XXx; ~ x)2, show n by the d otted vertical line. If we take the subset between
the dashed lines, the estim ated variance is closer to th at for the wild bootstrap,
as show n the values in Table 6.2 and by the Q-Q plot in the right panel o f
Figure 6.6. This is also true when case 22 is excluded.
The m ain reason for the large variability o f XXxy x )2 is th a t case 22 has
high leverage, as its position at the b o tto m left o f Figure 6.4 shows. Figure 6.6
shows th a t it has a substantial effect on the precision o f the slope estim ate:
the m ost variable estim ates are those where case 22 does not occur, and the
least variable those w here it occurs two or m ore times.

6.3 Multiple Linear Regression


T he extension o f the simple linear regression m odel (6.1) to several explanatory
variables is

( 6.22)

274

6 Linear Regression

Figure 6.6 Comparison


of wild bootstrap and
bootstrapping cases for
monthly returns data.
The left panel shows
200 estimates of slope
plotted against sum
of squares
x )2
for case resampling.
Resamples where case
22 occurred zero or one
times are labelled
accordingly. The right
panel shows a Q-Q plot
of the values of
for
the wild bootstrap and
the subset of the cases
lying within the dashed
lines in the left panel.

;
V%:
0 (*1n ol*

:d fe
co

i
i

ii
0 i!
0.001

r i p i . ..
v Ti
*
1 ill
i
ii
ii
i

0.003

0.005

Sum of squares

Cases

where for m odels w ith an intercept Xjo = 1. In the m ore convenient vector
form the m odel is
Yj = Xj (i + j
with x j = ( x jo , Xj i, .. ., Xj P). The com bined m atrix representation for all re
sponses Y t = ( Y i , . . . , Y) is

xp + s

(6.23)

with X T = ( xi , . . . , x ) an d eT = ( e i , . . . , e ) . A s before, the responses Y j are


supposed independent. This general linear m odel will encom pass polynom ial
and interaction models, by judicious definition o f x in term s o f prim itive
variables; for exam ple, we m ight have Xji = u j i an d x,-2 =
or Xj$ = uj\Uj 2 ,
and so forth. W hen the Xjk are dum m y variables representing levels o f factors,
we om it Xjo if the intercept is a red u n d an t param eter.
In m any respects the b o o tstrap analysis for m ultiple regression is an obvious
extension o f the analysis for simple linear regression in Section 6.2. We again
concentrate on least squares m odel fitting. P articular issues which arise a re : (i)
testing for the effect o f a subset o f the explanatory variables, (ii) assessm ent o f
predictive accuracy o f a fitted m odel, (iii) the effect o f p large relative to n, and
(iv) selection o f the b est m odel by suitable deletion o f explanatory variables.
In this section we focus on the first two o f these, briefly discuss the third, and
address variable selection m ethods in Section 6.4. We begin by outlining the
extensions o f Sections 6.2.1-6.2.4.

275

6.3 M ultiple Linear Regression

6.3.1 Bootstrapping the least squares fit


The ordinary least squares estim ates o f P for m odel (6.23) based on observed
response vector y are
P = (X TX r lX Ty ,
and corresponding fitted values are fr = H y where H = X ( X TX ) ~ {X T is the
h a t m atrix, whose diagonal elem ents hjj again denoted by hj for simplicity
are the leverage values. The raw residuals are e = (I H)y.
U nder hom oscedasticity the standard form ula for the estim ated variance o f
P is
v ar (p) = s2(X TX ) ~ \

(6.24)

with s2 equal to the residual m ean square (n p l ) ~ 1e Te. The em pirical


influence values for ordinary least squares estim ates are
lj = n ( X T X ) ~ l Xjej,

(6.25)

which give rise to the robust estim ate o f var(/?),

vl

= (Xt X )-1

(X TX ) ~ l

(6.26)

see Problem 6.1. These generalize equations (6.13) and (6.14). The variance
approxim ation is im proved by using the modified residuals

(1 - M 1/2

in place o f the e; , and then v i generalizes (6.17).


B ootstrap algorithm s generalize those in Sections 6.2.3-6.2.4. T h at is, modelbased resam pling generates d a ta according to
Y] = x J P + E p
where the s' are random ly sam pled from the modified residuals n , . . . , rn,
or their centred co u n terp arts
r. Case resam pling operates by random ly
resam pling cases from the data. Pros and cons o f the two m ethods are the
sam e as before, provided p is small relative to n and the design is far from
being singular. T he situation where p is large requires special attention.
Large p
Difficulty can arise w ith b o th m odel-based resam pling and case resam pling if
p is very large relative to n. The following theoretical exam ple illustrates an
extrem e version o f the problem .

6 Linear Regression

276

Example 6.5 (One-way model) C onsider the regression m odel th at corre


sponds to m independent sam ples each o f size two. If the regression param eters
P i , . . . , pm are the m eans o f the p o pulations sampled, then we om it the intercept
term from the m odel, an d the design m atrix has p = m colum ns and n = 2m
rows with dum m y explanatory variables x 2,-i,( = x 2iyi = 1,
= 0 otherwise,
i = I , . . . , p . T h a t is,
0

/I
1
0
0

X =

0
\0

0\
0
0
0

0
0
0

1/

For this m odel

Pi = 3 (yn + y n - i ),

i=

l,...,p,

and

ej = ( ~ i y ^(yn ~ yn-i),

hj=\,

j = 2i - l , 2i,

i=l,...,p.

The E D F o f the residuals, m odified o r not, could be very unlike the true error
distribution: for example, the E D F will always be symmetric.
I f the ran d o m errors are hom oscedastic then the m odel-based b o otstrap
will give consistent estim ates o f bias and stan d ard error for all regression
coefficients. However, the b o o tstrap distributions m ust be symmetric, and so
m ay be no b etter th an norm al approxim ations if true random errors are
skewed. T here appears to be no rem edy for this. T he problem is n o t so serious
for contrasts am ong the P,. F or example, if 0 = P\ P2 then it is easy to
see th at 9 has a sym m etric distribution, as does O'. The kurtosis is, however,
A
A
different for 9 an d 6 ; see Problem 6.10.
Case resam pling will not w ork because in those sam ples where b o th y 2i+i
and y2i+2 are absent /?, is inestim able: the resam ple design is singular. The
chance o f this is 0.48 for m = 5 increasing to 0.96 for m = 20. This can be fixed
by om itting all b o o tstrap sam ples where
+ f 2i = 0 for any i. T he resulting
boo tstrap variance for P consistently overestim ates by a factor o f ab o u t 1.3.
F u rth er details are given in Problem 6.9.

The im plication for m ore general designs is th a t difficulties will arise with
com binations cTp where c is in the subspace spanned by those eigenvectors o f
X TX corresponding to sm all eigenvalues. First, m odel-based resam pling will
give adequate results for stan d ard erro r calculations, but b o o tstrap distribu
tions m ay n o t im prove on norm al approxim ations in calculating confidence
limits for the /?,-s, o r for prediction. Secondly, unconstrained case resam pling

277

6.3 M ultiple Linear Regression


Table 6 3 Cement data
(Woods, Steinour and
Starke, 1932). The
response y is the heat
(calories per gram of
cement) evolved while
samples of cement set.
The explanatory
variables are
percentages by weight
of four constituents,
tricaicium aluminate x\,
tricalcium silicate X2,
tetracalcium alumino
ferrite *3 and dicalcium
silicate X4.

1
2
3
4
5
6
7
8
9
10
11
12
13

xi

*2

X)

*4

7
1
11
11
7
11
3
1
2
21
1
11
10

26
29
56
31
52
55
71
31
54
47
40
66
68

6
15
8
8
6
9
17
22
18
4
23
9
8

60
52
20
47
33
22
6
44
22
26
34
12
12

78.5
74.3
104.3
87.6
95.9
109.2
102.7
72.5
93.1
115.9
83.8
113.3
109.4

m ay induce near-collinearity in the design m atrix X ' , or equivalently near


singularity in X ' TX *, an d hence produce grossly inflated b o o tstrap estim ates
o f some stan d ard errors. O ne solution would be to reject sim ulated samples
where the sm allest eigenvalue
o f X TX * is lower th an a threshold ju st below
the sm allest eigenvalue ( \ o f X TX . A n alternative solution, m ore in line with
the general thinking th a t analysis should be conditioned on X , is to use only
those sim ulated sam ples corresponding to the middle h alf o f the values o f t \ .
This probably represents the best strategy for getting good confidence limits
which are also robust to erro r heteroscedasticity. The difficulty m ay be avoided
by an ap p ro p riate use o f principal com ponent regression.
Example 6.6 (Cement data) The d a ta in Table 6.3 are classic in the regression
literature as an exam ple o f near-collinearity. The four covariates are percent
ages o f constituents which sum to nearly 100: the sm allest eigenvalue o f X TX
is
= 0.0012, corresponding to eigenvector (1,0.01,0.01,0.01,0.01).
T heoretical an d b o o tstrap stan dard errors for coefficients are given in Table
6.4. For error resam pling the results agree closely w ith theory, as expected.
The b o o tstrap distributions o f /?* are very norm al-looking: the h at m atrix H
is such th a t modified residuals r; w ould look norm al even for very skewed
errors Ej.
Case resam pling gives m uch higher standard errors for coefficients, and
the b o o tstrap distributions are visibly skewed w ith several outliers. Figure 6.7
shows scatter plots o f tw o b o o tstrap coefficients versus smallest eigenvalue
o f X T' X ' ; plots for the oth er two coefficients are very similar. The variability
o f /?,* increases substantially for small values o f /}, whose reciprocal ranges
from j to 100 tim es the reciprocal o f \. Taking only those b o o tstrap samples
which give the m iddle 500 values o f / j (which are betw een 0.0005 and 0.0012)

278

6 Linear Regression
Table 6.4 Standard
fio

/?!

P2

P4

err0rS of linear

____________________________________________________________________
N o rm al-th eo ry
E rro r resam pling, R = 999
C ase resam pling, all R = 999
C ase resam pling, m iddle 500
C ase resam pling, largest 800

70.1
66.3
108.5
68.4
67.3

0.74
0.70
1.13
0.76
0.77

0.72
0.69
1.12
0.71
0.69

0.75
0.72
1.18
0.78
0.78

regression coefficients
for cement data.
Theoretical and error
resampling assume
homoscedasticity.
Resampling results use
R = 999 samples, but

0.71
0.67
1.11
0.69
0.68

--------------------------------------------------------------------------------------------------------

only on those samples


with the middle 500 and
the largest 800 values of

rv

Figure 6.7 Bootstrap


regression coefficients
and fit, versus smallest
eigenvalue
( x l 0~5)
o f X ' TX ' for R = 999
resamples of cases from
the cement data. The
vertical line is the
smallest eigenvalue
of X TX, and the
horizontal lines show
the original coefficients
two standard errors.

V1

U
(0

-O

. V :?-

5 10

50

500

Smallest eigenvalue

5 10

50

500

Smallest eigenvalue

gives m ore reasonable stan d ard errors, as seen in the penultim ate row o f
Table 6.4. T he last row, corresponding to d ropping the smallest 200 values o f
f \ , gives very sim ilar results.

Weighted least squares


The general discussion extends in a fairly obvious way to weighted least
squares estim ation, ju st as in Section 6.2.6 for the case p = 1. Suppose th a t
var(e) = k W ~ 1 where W is the diagonal m atrix o f know n case weights w; .
T hen the w eighted least squares estim ates are
p = (X T W X ) ~ lX T Wy,

(6.27)

the fitted values are p. = Xfl, and the residual vector is e = (I H)y, where
now the h a t m atrix H is defined by

X ( X T WX)~lX T W,

(6.28)

Note that H is not


symmetric in general.
Some authors prefer to
work with the
symmetric matrix
X' ( X' TX ' ) - ' X 'T, where
X' = W l' 1X.

279

6.3 M ultiple Linear Regression

w hose diagonal elem ents are the leverage values hj. The residual vector e has
variance var(e) = k (I H ) W ~ [, whose y'th diagonal elem ent is /c(l h j ) w j 1.
So the m odified residual is now
rj =

_ J 2J -- ------
Wj
(1 hj)1/2

(6.29)

M odel-based resam pling is defined by


y;

= x j p + w j ll2j,

where e* is random ly sam pled from the centred residuals r t r , . . . , r n r. It


is not necessary to estim ate k to apply this algorithm , b u t if an estim ate were
required it w ould be k = (n p 1)~1y T W ( I H)y.
A n im p o rtan t m odification o f case resam pling is th at each case m ust now
include its w eight w in addition to the response y and explanatory variables x.

6.3.2 Significance tests


Significance tests for the single covariate in simple linear regression were
described in Section 6.2.5. A m ong those tests, which should all behave similarly,
are the exact p erm u tatio n test and a related b o o tstrap test. H ere we look at
the m ore usual practical problem , testing for the effect o f one or a subset o f
several covariates. The tests are based on least squares estimates.
Suppose th a t the linear regression m odel is partitioned as
Y = X (3 + = X q oc + X \ y + e,

where y is a vector an d we wish to test Ho : y = 0. Initially we assume


hom oscedastic errors. It would ap p ear th a t the sufficiency argum ent which
m otivates the single-variable p erm utation test, and m akes it exact, no longer
applies. But there is a n atu ral extension o f th at p erm utation test, and its
m o tivation is clear from the developm ent o f boo tstrap tests. The basic idea is
to su b tract out the linear effect o f X q from both y and X \ , and then to apply
the test described in Section 6.2.5 for simple linear regression.
The first step is to fit the null model, th a t is
o = Xo&o,

fo = (X0r X0)_1X 0Ty.

We shall also need the residuals from this fit, which are eo = (/ Ho)y with
Ho = X q( X q Xo)~lX q . The test statistic T will be based on the least squares
estim ate y for y in the full m odel, which can be expressed as
y (Xi-oXio) 1X[.0eo
w ith X i o = (I H q) X i. The extension o f the earlier p erm utation test is

6 Linear Regression

280

equivalent to applying the p erm u tatio n test to responses eo and explanatory


variables XioIn the perm utation-type test and its b o o tstrap analogue, we sim ulate d a ta
from the null m odel, assum ing hom oscedasticity; th a t is
y

= Ao + o,

where the com ponents o f the sim ulated error vector e0 are sam pled w ithout
(perm utation) or w ith (bo o tstrap ) replacem ent from the n residuals in eo- N ote
th at this m akes use o f the assum ed hom oscedasticity o f errors. Each case keeps
its original covariate values, which is to say th a t X = X . W ith the sim ulated
d a ta we regress y on X to calculate y' and hence the sim ulated test statistic
t \ as described below. W hen this is repeated R times, the b o o tstrap P-value is
# { t; > t} + l
R + l
T he p erm u tatio n version o f the test is not exact w hen nuisance covariates X j
are present, b u t em pirical evidence suggests th a t it is close to exact.
Scalar y
W hat should t be? F or testing a single com ponent, so th a t y is a scalar, suppose
th a t the alternative hypothesis is one-sided, say H A : y > 0. T hen we could
A
1/2
take t to be y itself, o r possibly a studentized form such as zo = y / v 0 , where
Do is an ap p ro p riate estim ate o f the variance o f y. If we com pute the standard
error using the null m odel residual sum o f squares, then
v0 = ( n - q r ' e l e o i X l o X i o r 1,
where q is the ran k o f X q. T he sam e form ula is applied to every sim ulated
sam ple to get i>q an d hence z* = y*/vq1/2.
W hen there are no nuisance covariates Xo, Vq = vq in the p erm u tatio n test,
and studentizing has no effect: the sam e is true if the non-null stan d ard error
is used. Em pirical evidence suggests th a t this is approxim ately true w hen Xo is
present; see the exam ple below. Studentizing is necessary if m odified residuals
are used, w ith stan d ard izatio n based on the null m odel hat m atrix.
A n alternative b o o tstrap test can be developed in term s o f a pivot, as
described for single-variable regression in Section 6.2.5. H ere the idea is to
treat Z = (y y ) / V l/2 as a pivot, w ith V l/1 an ap propriate stan d ard error.
B ootstrap sim ulation u nder the full fitted m odel then produces the R replicates
o f z which we use to calculate the P-value. To elaborate, we first fit the full
m odel p = X f i by least squares and calculate the residuals e = y p. Still
assum ing hom oscedasticity, the stan d ard erro r for y is calculated using the
residual m ean square a simple form ula is
v = ( n - p - 1) l e Te ( X l 0Xi . 0)

6.3 M ultiple Linear Regression

281

N ext, d atasets are sim ulated using the m odel


/

= X p + e*,

X ' = X,

where the n errors in e* are sam pled independently w ith replacem ent from the
residuals e o r m odified versions o f these. The full regression o f y on X is then
fitted, from which we obtain y * and its estim ated variance v", these being used
to calculate z* = (y* y ) / v ' ll2. F rom R repeats o f this sim ulation we then
have the one-sided P-value
#

{ z r* >

Z q }

R + 1

where zo = y /u 1/2. A lthough here we use p to denote a P-value as well as the


num b er o f covariates, no confusion should arise.
This test procedure is the same as calculating a (1 a) lower confidence limit
for y by the studentized b o o tstrap m ethod, and inferring p < a if the lower
lim it is above zero. The corresponding two-sided P-value is less th an 2a if the
equi-tailed (1 2a) studentized b o o tstrap confidence interval does n o t include
zero.
O ne can guard against the effects o f heteroscedastic errors by using case
resam pling to d o the sim ulation, and by using a robust standard error for y
as described in Section 6.2.5. Also the same basic procedure can be applied to
estim ates o th e r th a n least squares.
Example 6.7 (Rock data) The d a ta in Table 6.5 are m easurem ents on four
cross-sections o f each o f 12 oil-bearing rocks, taken from two sites. The aim is
to predict perm eability from the other three m easurem ents, which result from
a com plex im age-analysis procedure. In all regression m odels we use logarithm
o f perm eability as response y. The question we focus on here is w hether the
coefficient o f shape is significant in a m ultiple linear regression on all three
variables.
The problem is n o n stan d ard in th at there are four replicates o f the ex
p lanatory variables for each response value. If we fit a linear regression to
all 48 cases treating them as independent, strong correlation am ong the four
residuals for each core sam ple is evident: see Figure 6.8, in which the residuals
have unit variance.
U nder a plausible m odel which accounts for this, which we discuss in
E xam ple 6.9, the ap p ro p riate linear regression for testing purposes uses core
averages o f the explanatory variables. T hus if we represent the d a ta as responses
yj and replicate vectors o f the explanatory variables Xjk, k = 1,2,3,4, then the
m odel for o u r analysis is
yj = x J . P + Ej,
where the Ej are independent. A sum m ary o f the least squares regression

6 Linear Regression

282

Table 6.5 Rock data

case

a rea

p e rim e te r

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

4990
7002
7558
7352
7943
7979
9333
8209
8393
6425
9364
8624
10651
8868
9417
8874
10962
10743
11878
9867
7838
11876
12212
8233
6360
4193
7416
5246
6509
4895
6775
7894
5980
5318
7392
7894
3469
1468
3524
5267
5048
1016
5605
8793
3475
1651
5514
9718

2792
3893
3931
3869
3949
4010
4346
4345
3682
3099
4480
3986
4037
3518
3999
3629
4609
4788
4864
4479
3429
4353
4698
3518
1977
1379
1916
1585
1851
1240
1728
1461
1427
991
1351
1461
1377
476
1189
1645
942
309
1146
2280
1174
598
1456
1486

sh a p e
0.09
0.15
0.18
0.12
0.12
0.17
0.19
0.16
0.20
0.16
0.15
0.15
0.23
0.23
0.17
0.15
0.20
0.26
0.20
0.14
0.11
0.29
0.24
0.16
0.28
0.18
0.19
0.13
0.23
0.34
0.31
0.28
0.20
0.33
0.15
0.28
0.18
0.44
0.16
0.25
0.33
0.23
0.46
0.42
0.20
0.26
0.18
0.20

p e rm e a b ility
6.3
6.3
6.3
6.3
17.1
17.1
17.1
17.1
119.0
119.0
119.0
119.0
82.4
82.4
82.4
82.4
58.6
58.6
58.6
58.6
142.0
142.0
142.0
142.0
740.0
740.0
740.0
740.0
890.0
890.0
890.0
890.0
950.0
950.0
950.0
950.0
100.0
100.0
100.0
100.0
1300.0
1300.0
1300.0
1300.0
580.0
580.0
580.0
580.0

(Katz, 1995; Venables


and Ripley, 1994,
p. 251). These are
measurements on
four cross-sections of
12 core samples, with
permeability
(milli-Darcies), area
(of pore space, in pixels
out of 256 x 256),
perimeter (pixels),
and shape
(perimeter/area)1^2.

6.3 M ultiple Linear Regression

Figure 6.8 Rock data:


standardized residuals
from linear regression of
all 48 cases, showing
strong intra-core
correlations.

283

co
3

T3

O
0N
?
(0
O
c
03

CO

10

12

Core number

Table 6.6 Least squares


results for multiple
linear regression of rock
data, all covariates
included and core
means used as response
variable.

V ariable
intercept
a r e a ( x lO - 3 )
p e r i ( x lO - 3 )
sh ap e

Coefficient

SE

f-value

3.465
0.864
-1 .9 9 0
3.518

1.391
0.211
0.400
4.838

2.49
4.09
- 4 .9 8
0.73

is shown in Table 6.6. T here is evidence o f m ild non-norm ality, b u t not


heteroscedasticity o f errors.
Figure 6.9 shows results from b o th the null m odel resam pling m ethod and
the full m odel pivot resam pling m ethod, in b o th cases using resam pling o f
errors. The observed value o f z is z0 = 0.73, for which the one-sided P-value is
0.234 und er the first m ethod, an d 0.239 under the second m ethod. Thus sh ap e
should n o t be included in the linear regression, assum ing th at its effect would
be linear. N ote th a t R = 99 sim ulations would have been sufficient here.

Vector y
F or testing several com ponents sim ultaneously, we take the test statistic to be
the quad ratic form
T = F i X l o X v 0)y,

6 *Linear Regression

284

Figure 6.9 Resampling


distributions of
standardized test
statistic for variable
shape. Left: resampling
2 under null model,
R = 999. Right:
resampling pivot under
full model, R = 999.

-6

-4

-2

-6

-4

z*

-2

z0*

or equivalently the difference in residual sum s o f squares for the null and full
m odel least squares fits. This can be standardized to
n q
RSSo R S S
q
X
RSSo
where RSSo and R S S denote residual sum s o f squares under the null m odel
and full m odel respectively.
We can apply the pivot m ethod with full m odel sim ulation here also, using
Z = (y y)T ( X l 0Xi.o)(y y ) / S 2 w ith S 2 the residual m ean square. The test
statistic value is zo = y T(X[.0Xi .0) y /s 2, for w hich the P-value is given by
# {z* >

Zp}

R + 1
This would be equivalent to rejecting Ho at level a if the 1 a confidence set
for y does n o t include the point y = 0. A gain, case resam pling would provide
protection against heteroscedasticity: z would then require a robust standard
error.

6.3.3 Prediction
A fitted linear regression is often used for prediction o f a new individual
response Y+ when the explanatory variable vector is equal to x +. T hen we shall
w ant to supplem ent o u r predicted value by a prediction interval. Confidence
limits for the m ean response
can be found using the same resam pling
as is used to get confidence limits for individual coefficients, b u t limits for
the response Y+ itself usually called prediction lim its require additional
resam pling to sim ulate the variation o f 7+ ab o u t x \ j i .

285

6.3 M ultiple Linear Regression

T he q uantity to be predicted is Y+ = x'+ji + +, say, and the point predictor is


Y+ =
The ran d o m erro r + is assum ed to be independent o f the random
errors i,..., in the observed responses, and for simplicity we assum e th at
they all com e from the sam e d istribution: in p articular the errors have equal
variances.
To assess the accuracy o f the point predictor, we can estim ate the distribution
o f the prediction error
S = Y+ - Y + = x tJ -

( x l P + +)

by the distribution o f
<5* = x+/?* (x+/? + e+),

(6.30)

w here + is sam pled from G and /T is a sim ulated vector o f estim ates from the
m odel-based resam pling algorithm . This assum es hom oscedasticity o f random
error. U nconditional properties o f the prediction erro r correspond to averaging
over the distributions o f b o th + and the estim ates /?, which we do in the
sim ulation by repeating (6.30) for each set o f values o f /T. H aving obtained
the m odified residuals
from the d a ta fit, the algorithm to generate R sets
each w ith M predictions is as follows.
Algorithm 6.4 (Prediction in linear regression)
F or r = 1 ,..., R,
1 sim ulate responses y* according to (6.16);
2 obtain least squares estim ates pr = ( X TX ) ~ 1X Ty *; then
3 for m = 1 ,..., M ,
(a) sam ple ^ m from r \ f , . . . , r r, and
(b ) com pute prediction error S m = x+i?* (x/? + +m).

It is acceptable to use M = 1 here: the key point is th a t R M be large enough


to estim ate the required properties o f <5*. N ote th at if predictions at several
values o f x + are required, then only the third step o f the algorithm needs to
be repeated for each x+.
T he m ean squared prediction error is estim ated by the sim ulation m ean
squared erro r (R M )-1 E rm(<5*m <S*)2. M ore useful would be a (1 2a) pre
diction interval for Y+, for which we need the a and (1 a) quantiles ax and
say, o f prediction erro r S. T hen the prediction interval would have limits
y+ - fli-a,

$+ - a*-

T he exact, b u t unknow n, quantiles are estim ated by em pirical quantiles o f

6 Linear Regression

286

the pooled <5*s, w hose ordered values we denote by < 5( < <
boo tstrap prediction lim its are
y+ ^((RM+l)(l-ct))

y+ ^((RM+lJa)

The

(6.31)

where y+ = *+/?. This is analogous to the basic b o o tstrap m ethod for confi
dence intervals (Section 5.2).
A som ew hat b etter ap p ro ach w hich mimics the stan d ard norm al-theory
analysis is to w ork w ith studentized prediction error

where S is the square root o f residual m ean square for the linear regression.
The corresponding sim ulated values are z*m = <5*m/s*, with s ' calculated in step
2 o f A lgorithm 6.4. T he a and (1 a) quantiles o f Z are estim ated by z*(RM+1)0,)
and
respectively, where z'{V) < < z RM) are the ordered values
o f all R M z* s. T hen the studentized b o o tstrap prediction interval for 7+ is
y+ ~ SZ((RM+l)(l-ct))

+ ~ SZ((RM+1))-

(6.32)

E xam ple 6.8 (N uclear power stations) Table 6.7 contains d a ta on the cost o f
32 light w ater reactors. T he cost (in dollars x l0 ~ 6 adjusted to a 1976 base) is
the response o f interest, an d the o th er quantities in the table are explanatory
variables; they are described in detail in the d a ta source.
We take lo g (c o s t) as the w orking response y, and fit a linear m odel with
covariates PT, CT, NE, d a te , lo g (c a p a c ity ) and log(N). T he dum m y variable PT
indicates six plants for w hich there were p artial turnkey guarantees, and it is
possible th a t some subsidies m ay be hidden in their costs.
Suppose th a t we wish to obtain 95% prediction intervals for the cost o f a
station like case 32 above, except th a t its value for d a te is 73.00. T he predicted
value o f lo g (c o s t) from the regression is x+fi = 6.72, and the m ean squared
erro r from the regression is s = 0.159. W ith a = 0.025 and a sim ulation with
R = 999 an d M = 1, ( R M + l)a = 25 an d ( R M + 1)(1 a) = 975. The values
o f 3(25) an d <5*975) are -0.539 and 0.551, so the 95% lim its (6.31) are 6.18 and
7.27, which are slightly w ider th a n the norm al-theory limits o f 6.25 and 7.19.
F or the lim its (6.32) we get z(*25) = 3.680 and z(*975) = 3.5 12, so the lim its for
lo g (c o st) are 6.13 and 7.28. T he corresponding prediction interval for c o s t is
[exp(6.13), exp(7.28)] = [459.4,1451],
The usual caveats apply a b o u t extrapolating a trend outside the range o f
the data, an d we should use these intervals w ith great caution.

The next exam ple involves an u nusual d a ta structure, where there is hierar
chical variatio n in the covariates.

It is unnecessary to
standardize also by the
square root of
1 + x l ( X TX)- ' x+,
which would make the
variance of Z close to 1.
unless bootstrap results
for different x+ are
pooled.

6.3 M ultiple Linear Regression


Table 6.7 Data on light
water reactors
constructed in the USA
(Cox and Snell, 1981,
p. 81).

1
2
3
4
5

6
7

8
9

10
11
12
13
14
15
16
17
18
19

20
21
22
23
24
25
26
27
28
29
30
31
32

287

cost

d a te

Tl

t2

c a p a c ity

PR

NE

CT

BW

PT

460.05
452.99
443.22
652.32
642.23
345.39
272.37
317.21
457.12
690.19
350.63
402.59
412.18
495.58
394.36
423.32
712.27
289.66
881.24
490.88
567.79
665.99
621.45
608.80
473.64
697.14
207.51
288.48
284.88
280.36
217.38
270.71

68.58
67.33
67.33

14

46
73
85
67
78
51
50
59
55
71
64
47
62
52
65
67
60
76
67
59
70
57
59
58
44
57
63
48
63
71
72
80

687
1065
1065
1065
1065
514
822
457
822
792
560
790
530
1050
850
778
845
530
1090
1050
913
828
786
821
538
1130
745
821

0
0
1
0
1
0
0
0
1
0
0
0
0
0
0
0
0
1
0
1
0
1
0
1
0
0
0
0
0
1
1
1

1
0
0
1
1
1
0
0
0
1
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0

0
1
1
1
1
1
0
0
0
1
0
0
1
0
0
0
0
1
0
0
1
0
1
0
1
1
0
1
0
0
0
0

0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
1
1
0
1

14

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1

68.00
68.00
67.92
68.17
68.42
68.42
68.33
68.58
68.75
68.42
68.92
68.92
68.42
69.50
68.42
69.17
68.92
68.75
70.92
69.67
70.08
70.42
71.08
67.25
67.17
67.83
67.83
67.25
67.83

10
10
11
11
13

12
14
15

12
12
13
15
17
13

11
18
15
15
16

11
22
16
19
19

20
13
9

12
12
13
7

886
886
745

886

1
1
12
12
3
5

1
5

2
3

6
2
7
16
3
17

2
1
8
15

20
18
3
19

21
8
7

11
11
8
11

Example 6.9 (Rock data) F or the d a ta discussed in Exam ple 6.7, one objective
is to see how well one can predict perm eability from a single replicate o f the
three im age-based m easurem ents, as opposed to the four replicates obtained
in the study. The previous analysis suggested th a t variable sh ap e did not
contribute usefully to a linear regression relationship for the logarithm o f
perm eability, an d this is confirm ed by cross-validation analysis o f prediction
errors (Section 6.4.1). So here we concentrate on predicting perm eability from
the linear regression o f y = lo g ( p e r m e a b ility ) on a r e a and p e r i .
In Exam ple 6.7 we com m ented on the strong intra-core correlation am ong
the explanatory variables, and th a t m ust be taken into account here if we are
to correctly analyse prediction o f core perm eability from single m easurem ents
o f a r e a and p e r i . O ne way to do this is to think o f the four replicate values
o f u = ( a r e a , p e r i ) T as unbiased estim ates o f an underlying core variable ,
on which y has a linear regression. T hen the d a ta are m odelled by

yj = <x + j y + fij,

ujk =

+ sjk,

(6.33)

6 Linear Regression

288

Table 6.8 Rock data:


fits o f linear regression
models with K replicate
values o f explanatory
variables a r e a and
p e r i . Norm al-theory
analysis is via model

V ariable

M eth o d
In tercep t

a r e a ( x lO - 4 )

p e r i ( x l O 4)

K = 1

D irect regression on x ^ s
N o rm al-th eo ry fit

5.746
5.694

5.144
5.300

-16.16
-16.39

K = 4

R egression on Xj. s
N o rm al-th eo ry fit

4.295
4.295

9.257
9.257

-21.78
-21.78

(6.33).

for j = 1 ,...,1 2 and k =


where rjj and < 5 are uncorrelated errors
with zero means, and for o u r d a ta K = 4.
U nder norm ality assum ptions on the errors and the
the linear regression
o f yj on Uj\,...,UjK depends only on the core average u; = K ~ l Y a =i ujkThe regression coefficients depend strongly on K . F or prediction from a single
m easurem ent u+ we need the m odel w ith K = 1, and for resam pling analysis
we shall need the m odel w ith K = 4. These tw o versions o f the observation
regression m odel we w rite as
yj = x j p w + ef> = a(K) + u J y (K} + e f \

(6.34)

for K = 1 and 4; the param eters a and y in (6.33) correspond to a (x) and
when K = oo. Fortunately it turns out th a t b o th observation m odels can be fit
ted easily: for K = 4 we regress the yjs on the core averages Uj; and for K = 1
we fit linear regression w ith all 48 individual cases as tabled, ignoring the
intra-core correlation am ong the e;*s, i.e. pretending th at y; occurs four times
independently. Table 6.8 shows the coefficients for both fits, and com pares
them to corresponding estim ates based on exact norm al-theory analysis.
Suppose, then, th a t we w ant to predict the new response y + given a single
set o f m easurem ents u+. If we define x \ = (1,m+), then the point prediction Y+
is x l P \ where /?(1) are the coefficients in the fit o f m odel (6.34) with K = 1,
shown in the first row o f Table 6.8. T he E D F o f the 48 modified residuals
from this fit estim ates the m arginal distribution o f the e*1* in (6.34), and hence
o f the error e+ in
Y+ = x l ^ + s +.
O ur concern is w ith the prediction error
5 = Y+ - Y + = x l $ W -

- +,

(6.35)

whose distrib u tio n is to be estim ated by resampling.


The question is how to do the resam pling, given the presence o f intra-core
correlation. A resam pled dataset m ust consist o f 12 subsets each with 4 repli
cates u*k an d a single response yj, from which we shall fit /?(1)*. The prediction

6.3 M ultiple Linear Regression

289

erro r (6.35) will then be sim ulated by

<5* = K - y'+ = *I(/*(1)*- j3(1)) - +,


where e*+ is sam pled from the E D F o f the 48 modified residuals as m entioned
above. It rem ains to decide how to sim ulate the d a ta from which we calculate
Pw '.
Usually w ith erro r resam pling we would fix the covariate values, so here we
fix the 12 values o f Uj, which are surrogates for the jS in m odel (6.33). T hen
we sim ulate responses from the fitted regression on these averages, and sim u
late the replicated m easured covariates using an appropriate hierarchical-data
algorithm . Specifically we take
Ujk = Uj + djk,
where djk = ujk Uj and J is random ly sam pled from { 1 ,2 ,..., 12}. O ur ju s
tification for this, in term s o f retaining intra-core correlation, is given by the
discussion in Section 3.8. It is potentially im p o rtan t to build the variation o f
u into the analysis. Since u* = Uj, the resam pled responses are defined by
>=*j r +

t f .

where the *4)* are random ly sam pled from the 12 m ean-adjusted, modified
residuals r ^ rw from the regression o f the y; s on the iijS. The estim ates
are now obtained by fitting the regression to the 48 sim ulated cases ( u ^ y j ) ,
k = 1 , ...,4 and j = 1 ,..., 12.
Figure 6.10 shows typical norm al plots for prediction error y + y+ , these
for x + = (1,4000,1000) and x + = (1,10000,4000) which are near the edge o f
the observed space, from R = 999 resam ples and M = 1. The skewness o f
prediction erro r is quite noticeable. The resam pling stan d ard deviations for pre
diction errors are 0.91 an d 0.93, som ew hat larger th an the theoretical standard
deviations 0.88 and 0.87 obtained by treating the 48 cases as independent.
To calculate 95% intervals we set a = 0.025, so th at ( R M + l)a = 25 and
( R M + 1)(1 a) = 975. The sim ulation values <5(*25) and <5('975) are 1.63 and
1.93 at x+ = (1,4000,1000), and -1 .5 7 and 2.19 at x + = (1,10000,4000). The
corresponding p o in t predictions are 6.19 and 4.42, so 95% prediction intervals
are (4.26,7.82) at x+ = (1,4000,1000) and (2.23,5.99) at x+ = (1,10000,4000).
These intervals differ m arkedly from those based on norm al theory treating all
48 cases as independent, those being (4.44,7.94) and (2.68,6.17). M uch o f the
difference is due to the skewness o f the resam pling distribution o f prediction
error.

6 Linear Regression

290

Figure 6.10 Rock data:


normal plots of
resampled prediction
errors for
x+ =(1,4000,1000) (left
panel) and
= (1,10000,4000)
(right panel), based on
R = 999 and M = 1.
Dotted lines correspond
to theoretical means and
standard deviations.

Quantiles of standard normal

Quantiles of standard normal

6.4 Aggregate Prediction Error and Variable Selection


In Section 6.3.3 o u r discussion o f prediction focused on individual cases, and
particularly on intervals o f uncertainty aro u n d point predictions. F or some
applications, however, we are interested in an aggregate m easure o f prediction
erro r such as average squared error o r m isclassification erro r which
sum m arizes accuracy o f prediction across a range o f values o f the covariates,
using a given regression m odel. Such a m easure m ay be o f interest in its own
right, o r as the basis for com paring alternative regression models. In the first
p art o f this section we outline the m ain resam pling m ethods for estim ating
aggregate prediction error, an d in the second p a rt we discuss the closely related
problem o f variable selection for linear regression models.

6.4.1 Aggregate prediction error


The least squares fit o f the linear regression m odel (6.22) provides the least
squares prediction rule y+ = x+fi for predicting w hat a single response y+
would be at value x+ o f the vector o f covariates. W h at we w ant to know is
how accurate this prediction rule will be for predicting d a ta sim ilar to those
already observed. Suppose first th a t we m easure accuracy o f prediction by
squared error (y+ y+)2, an d th a t o u r interest is in predictions for covariate
values th a t exactly duplicate the d a ta values x \ , . . . , x n. T hen the aggregate
prediction error is
D = n - x Y j E (Y + j - x ] h \
j= i

6.4 Aggregate Prediction Error and Variable Selection

X is the n x q matrix
with rows x j , . . . , x j ,
where q = p + 1 if there
are p covariate terms
and an intercept in the
model.

291

in which ft is fixed and the expectation is over y+J = x]p + e+j. We cannot
calculate D exactly, because the m odel param eters are unknow n, so we m ust
settle for an estim ate which in reality is an estim ate o f A = E(D), the
average over all possible sam ples o f size n. O ur objective is to estim ate D or A
as accurately as possible.
As stated the problem is quite simple, at least under the ideal conditions
where the linear m odel is correct and the error variance is constant, for then
D

n - l Y r ( Y +j) + n - l Y , ( X j P - x J [ l ) 2

a 2 + n - l ( p - l } ) TX TX 0 - p ) ,

(6.36)

w hose expectation is
A = <j 2(1 + ^ - 1),

(6.37)

where q = p + 1 is the nu m b er o f regression coefficients. Since the residual


m ean square s2 is an unbiased estim ate for a 2, we have the natural estim ate
A = s2(l + qn~l ).

(6.38)

However, this estim ate is very specialized, in two ways. First, it assumes th at
the linear m odel is correct and th a t erro r variance is constant, b o th unlikely to
be exactly true in practice. Secondly, the estim ate applies only to least squares
prediction and the squared erro r m easure o f accuracy, w hereas in practice we
need to be able to deal w ith other m easures o f accuracy and other prediction
rules such as robust linear regression (Section 6.5) and linear classification,
where y is binary (Section 7.2). T here are no simple analogues o f (6.38) to
cover these situations, b u t resam pling m ethods can be applied to all o f them.
In order th a t o u r discussion apply as broadly as possible, we shall use
general n o tatio n in which prediction erro r is m easured by c(y+, y +), typically
an increasing function o f |y+ y+|, and the prediction rule is y + = /i(x+, F),
where the E D F F represents the observed data. Usually n(x +>F) is an estim ate
o f the m ean response at x +, a function o f x+/? with /? an estim ate o f /?, and
the form o f this prediction rule is closely tied to the form o f c(y+,y+). We
suppose th a t the d a ta F are sam pled from distribution F, from which the
cases to be predicted are also sampled. This implies th at we are considering
x + values sim ilar to d a ta values x i ,...,x . Prediction accuracy is m easured by
the aggregate prediction error
D = D(F, F) = E + [c{ Y+, tx(X+, F)} | F],

(6.39)

where E + em phasizes th a t we are averaging only over the distribution o f


(AT+, 7+), w ith d a ta fixed. Because F is unknow n, D can n o t be calculated, and
so we look for accurate m ethods o f estim ating it, or ra th er its expectation
A = A (F ) = E { D ( F , F ) } ,

(6.40)

6 Linear Regression

292

the average prediction accuracy over all possible d atasets o f size n sam pled
from F.
The m ost direct ap proach to estim ation o f A is to apply the boo tstrap
substitution principle, th a t is substituting the E D F F for F in (6.40). However,
there are o th er widely used resam pling m ethods which also m erit consideration,
in p art because they are easy to use, an d in fact the best approach involves a
com bination o f m ethods.
Apparent error
The sim plest way to estim ate D or A is to take the average prediction error
w hen the prediction rule is applied to the sam e d a ta th at was used to fit it.
This gives the apparent error, som etim es called the resubstitution error,
n

K PP = D( F, F) = n ~x ' Y ^ c { y j ,ii{xj,F)}.
7=1

(6.41)

This is n o t the sam e as the b o o tstrap estim ate A(F), which we discuss later.
It is intuitively clear th a t A app will tend to underestim ate A, because the
latter refers to prediction o f new responses. The underestim ation can be easily
A
|
checked for least squares prediction w ith squared error, when A app = n~ R S S ,
the average squared residual. If the m odel is correct with hom oscedastic
random errors, then A app has expectation a 2(l qn~ l ), w hereas from (6.37) we
know th a t A = <x2(l + qn~l ).
The difference betw een the true erro r and ap p aren t erro r is the excess error,
D( F, F) D(F,F), whose m ean is the expected excess error,
e(F) = E {D(F, F) - D(F, F)} = A(F) - E{D(F, F)},

(6.42)

where the expectation is taken over possible datasets F. F or squared error


and least squares prediction the results in the previous p arag rap h show th at
e(F) = 2qri~l o 2. The q uantity e(F) is akin to a bias and can be estim ated by
resam pling, so the a p p aren t error can be m odified to a reasonable estim ate, as
we see below.
Cross-validation
T he ap p aren t error is dow nw ardly biased because it averages errors o f predic
tions for cases at zero distance from the d a ta used to fit the prediction rule.
C ross-validation estim ates o f aggregate erro r avoid this bias by separating the
d a ta used to form the prediction rule and the d a ta used to assess the rule. The
general paradigm is to split the d ataset into a training set {(x j , y j ) : j S,}
and a separate assessment set {(X j , y j ) : j e Sa}, represented by Ft and Fa, say.
The linear regression predictor is fitted to Ft, used to predict responses yj for

293

6.4 Aggregate Prediction Error and Variable Selection

j Sa, and then A is estim ated by


D{Fa, Ft) = n ~ ' Y

)}>

(6-43)

jSa

w ith na the size o f Sa. T here are several variations on this estim ate, depending
on the size o f the training set, the m anner o f splitting the dataset, and the
num ber o f such splits.
The version o f cross-validation th at seems to come closest to actual use o f
o u r predictor is leave-one-out cross-validation. H ere training sets o f size n 1 are
taken, and all such sets are used, so we m easure how well the prediction rule
does when the value o f each response is predicted from the rest o f the data. If
F^j represents the n 1 observations {(xk,yk),k ^ j}, and if /u(Xy,F_; ) denotes
the value predicted for yj by the rule based on F _; , then the cross-validation
estimate o f prediction error is
n

Ac v = n~l

c{yj>

F-j)}, (6.44)

i= i
which is the average erro r when each observation is predicted from the rest o f
the sample.
In general (6.44) requires n fits o f the model, b u t for least squares linear
regression only one fit is required if we use the case-deletion result (Problem 6.2)
~

T A

Vi x j B

P - P- j = ( X TX ) ~ ' x j ^ _

where as usual hj is the leverage for the 7th case. F or squared erro r in particular
we then have
="

- ^

' 6-45>

From the natu re o f Ac v one would guess th a t this estim ate has only a small
bias, and this is so: assum ing an expansion o f the form A(F) = oq + a\ n~l +
a2n~2 + , one can verify from (6.44) th a t E(A c^) = o + a i(n I )-1 + ,
which differs from A by term s o f order n~2 unlike the expectation o f the
ap p aren t error which differs by term s o f order n_ l .
K -fold cross-validation
In general there is no reason th at training sets should be o f size n 1. For
certain m ethods o f estim ation the num ber n o f fits required for Ac v could
itself be a difficulty although not for least squares, as we have seen in
(6.45). T here is also the possibility th at the small p erturbations in fitted m odel
w hen single observations are left out m akes Ac v too variable, if fitted values
H(x,F) do n o t depend sm oothly on F o r if c(y+ ,y+ ) is n o t continuous. These

294

6 Linear Regression

potential problem s can be avoided to a large extent by leaving out groups o f


observations, rath er th an single observations. T here is m ore th an one way to
d o this.
One obvious im plem entation o f group cross-validation is to repeat (6.43) for
a series o f R different splits into training and assessm ent sets, keeping the size
o f the assessm ent set fixed at na = m, say. T hen in a fairly obvious n o tation
the estim ate o f aggregate prediction error would be
R

Acv = R ~{

X ! c{yJ
jesv

r= 1

^v)}-

(6-46^

In principle there are (") possible splits, possibly an extrem ely large num ber,
b u t it should be adequate to take R in the range 100 to 1000. It would be in
the spirit o f resam pling to m ake the splits at random . However, consideration
should be given to balancing the splits in some way for example, it would
seem desirable th a t each case should occur w ith equal frequency over the R
assessm ent sets; see Section 9.2. D epending on the value o f nt = n m and the
num ber p o f explanatory variables, one m ight also need some form o f balance
to ensure th a t the m odel can always be fitted.
There is an efficient version o f group cross-validation th at does involve ju st
one prediction o f each response. We begin by splitting the d a ta into K disjoint
sets o f nearly equal size, w ith the corresponding sets o f case subscripts denoted
by C i , . . . , C k , say. These K sets define R = K different splits into training and
assessm ent sets, w ith S^k = Q the kt h assessm ent set and the rem ainder o f the
d a ta Stf =
|J,y* Ci the /cth training set.
F or each such
split weapply (6.43), and
then average these estim ates. The result is the K-fold cross-validation estimate
o f prediction error
n

Acvjc = n~l y c{yj, n(xj, F - k{J))},


j=i

(6.47)

where F-k{j) represents the d a ta from which the group containing the j i h
case has been deleted. N ote th a t ACvjc is equal to the leave-one-out estim ate
(6.44) when K = n. C alculation o f (6.47) requires ju st K m odel fits. Practical
experience suggests th a t a good strategy is to take K = m in{n1!1, 10}, on the
grounds th a t taking K > 10 m ay be too com putationally intensive when the
prediction rule is com plicated, while taking groups o f size at least n1/2 should
p ertu rb the d a ta sufficiently to give small variance o f the estimate.
The use o f groups will have the desired effect o f reducing variance, b u t at
the cost o f increasing bias. F or exam ple, it can be seen from the expansion
used earlier for A th a t the bias o f A Cvjc is a\{n(K l )}-1 + , which could be
substantial if K is small, unless n is very large. F ortunately the bias o f A qv ,k
can be reduced by a simple adjustm ent. In a harm less abuse o f notation, let

6.4 Aggregate Prediction Error and Variable Selection

if n / K

=m

is an

integer, then ail groups


are o f size m and
Pk = l / K .

295

F-k denote the d a ta w ith the /cth group om itted, for k = 1


and let
pk denote the p ro p o rtio n o f the d ata falling in the /cth group. T he adjusted
cross-validation estimate o f aggregate prediction erro r is
00

&acvjk. = Ack,k + D( F, F) ^2,PkD{F,F-k)-

(6.48)

k= 1

T his has sm aller bias th a n Acvjc and is alm ost as simple to calculate, because
it requires n o additional fits o f the model. F or a com parison betw een ACvjc
an d A acvjc in a simple situation, see Problem 6.12.
T he following algorithm sum m arizes the calculation o f AAcvji w hen the
split into groups is m ade a t random .
Algorithm 6.5 (K -fold adjusted cross-validation)
1 Fit the regression m odel to all cases, calculate predictions
m odel, an d average the values o f c(yj,yj) to get D.
2 C hoose group sizes m i,. . . ,
such th a t mi H----- + m* = n.
3 For k = 1

from th at

(a) choose Ck by sam pling


times w ithout replacem ent from
{ 1 ,2 ,..., } m inus elem ents chosen for previous C,s;
(b)
(c)
(d)
(e)

fit the regression m odel to all d a ta except cases j Ck',


calculate new predictions yj = n(xj, F-k) for j e Ck ;
calculate predictions %j = fi(xj,F-k) for all j ; then
average the n values c{yj,%j) to give D(F,F-k).

4 A verage the n values o f c(yj,yj) using yj from step 3(c) to give Ac vj i5 C alculate Aacvji as in (6.48) with pk = mk/n.

Bootstrap estimates
A direct ap plication o f the b o o tstrap principle to A(F) gives the estim ate
A = A(F) = E*{D(F,F*)},
w here F* denotes a sim ulated sam ple ( x j,y j) ,. . . , (x*, >) taken from the d a ta by
case resam pling. U sually sim ulation is required to approxim ate this estim ate,
as follows. F or r = 1
we random ly resam ple cases from the d ata to
obtain the sam ple (x*j,y*j) , . . . , (x*n,y'), which we represent by F*, and to this
sam ple we fit the prediction rule and calculate its predictions n ( x j , F ' ) o f the
d a ta responses yj for j = 1
The aggregate prediction erro r estim ate is
then calculated as
R
R - 1

n
Y 2 c { y j,f i{ x j,F ') } .

r= l

j=l

(6.49)

6 Linear Regression

296

Intuitively this b o o tstra p estim ate is less satisfactory th an cross-validation,


because the sim ulated d ataset F* used to calculate the prediction rule is p art
o f the d a ta F used for assessm ent o f prediction error. In this sense the estim ate
is a hybrid o f the a p p aren t erro r estim ate and a cross-validation estim ate, a
point to which we retu rn shortly.
As we have noted in previous chapters, care is often needed in choosing w hat
to bootstrap. H ere, an ap p ro ach w hich w orks b etter is to use the boo tstrap
to estim ate the expected excess erro r e(F) defined in (6.42), w hich is the bias
o f the a p p aren t erro r A app, an d to add this estim ate to A app. In theory the
b o o tstrap estim ate o f e(F) is
e(F) = E ' { D ( F , F ' ) - D ( F , F *)},
and its approxim ation from the sim ulations described in the previous p a ra
graph defines the bootstrap estimate o f expected excess error

eB = R

n 1E c{yj>^ xpK .)} - n 1E cWp MKpF")}


i=i

r= 1

(6.50)

j=i

T h at is, for the rth b o o tstra p sam ple we construct the prediction rule n(x, F'),
then calculate the average difference betw een the prediction errors when this
rule is applied first to the original d a ta an d secondly to the b o o tstrap sam ple
itself, an d finally average across b o o tstra p samples. We refer to the resulting
estim ate o f aggregate prediction error, Ab = $b + A app, as the bootstrap estimate
o f prediction error, given by
n

n~l E
7=1

E
r= 1

F'r )} - R - 1 E D (F'r, K ) + D(F, F).

(6.51)

r= l

N ote th a t the first term o f (6.51), which is also the simple b o o tstra p estim ate
(6.49), is expressed as the average o f the contributions jR-1 ^ f = i c{yy-,
F )}
th at each original observation m akes to the estim ate o f aggregate prediction
error. These contributions are o f interest in their own right, m ost im portantly
in assessing how the perform ance o f the prediction rule changes with values
o f the explanatory variables. This is illustrated in Exam ple 6.10 below.
Hybrid bootstrap estimates
It is useful to observe th a t the naive estim ate (6.49), which is also the first term
o f (6.51), can be broken into two qualitatively different parts,

6.4 Aggregate Prediction Error and Variable Selection

297

and

w here R - j is the n u m b er o f the R b o o tstrap sam ples F ' in which (xj ,yj ) does
n o t appear. In (6.52) yj is always predicted using d ata from which (X j , y j) is
excluded, which is analogous to cross-validation, w hereas (6.53) is sim ilar to
an a p p aren t erro r calculation because yj is always predicted using d a ta th at
contain (xj,yj).
N ow R - j / R is approxim ately equal to the constant e~l = 0.368, so (6.52) is
approxim ately p ro p o rtio n al to
A scr = n - 1E
j=1

(6'54)

J r:j out

som etim es called the leave-one-out bootstrap estimate o f prediction error. The
n o ta tio n refers to the fact th a t Abcv can be viewed as a b o o tstrap sm oothing
o f the cross-validation estim ate Acv- To see this, consider replacing the term
c {y j , n ( x j , F - j )} in (6.44) by the expectation E l j[c{yj,n(Xj,F*)}], where E lrefers to the expectation over b o o tstrap sam ples F * o f size n draw n from F-j.
T he estim ate (6.54) is a sim ulation approxim ation o f this expectation, because
o f the result n o ted in Section 3.10.1 th a t the R - j b o o tstrap sam ples in which
case j does n o t ap p ear are equivalent to random sam ples draw n from F-j.
T he sm oothing in (6.54) m ay effect a considerable reduction in variance,
com pared to Ac v , especially if c(y+, y +) is n o t continuous. B ut there will also
be a tendency tow ard positive bias. This is because the typical b o o tstrap sample
from which predictions are m ade in (6.54) includes only ab o u t (1 e~l )n =
0.632n distinct d a ta values, an d the bias o f cross-validation estim ates increases
as the size o f the train in g set decreases.
W hat we have so far is th a t the b o o tstrap estim ate o f aggregate prediction
erro r essentially involves a w eighted com bination o f Abcv and an apparent
erro r estim ate. Such a com bin atio n should have good variance properties, b u t
m ay suffer from bias. However, if we change the weights in the com bination it
m ay be possible to reduce or rem ove this bias. This suggests th at we consider
the hybrid estim ate
A w = w A b cv + (1 - w)Aapp,

(6.55)

an d then select w to m ake the bias as small as possible, ideally E(AW) =


A + 0 ( n ~ 2).
N o t unexpectedly it is difficult to calculate E(AW) in general, b u t for quadratic
erro r and least squares prediction it is relatively easy. We already know th at
the a p p aren t erro r estim ate has expectation a 2( 1 qn~l ), and th a t the true

298

6 Linear Regression

A p p a re n t

Table 6.9 Estimates of


aggregate prediction
error (xlO -2) for data
on nuclear power plants.
Results for adjusted
cross-validation are
shown in parentheses.

K -fo ld (adjusted ) cross-validation

e rro r

B o o tstrap

0.632

32

16

10

2.0

3.2

3.5

3.6

3.7 (3.7)

3.8 (3.7)

4.4 (4.2)

aggregate erro r is A = er2( l + qn 1). It rem ains only to calculate E(ABCk),


where here
A B CV =

n~l Y 2 E -j(y j -

x ] P - j ) 2>

j =i
A

w ith p _ j the least squares estim ate o f /? from a b o o tstra p sam ple w ith the j t h
case excluded. A ra th e r lengthy calculation (Problem 6.13) shows th at
E(A jjck) = c 2( l + 2 qn~l ) + 0 ( n ~ 2),
from which it follows th a t
E{wABCk + (1 - w)A app} = er2( l + 3w qn~l ) + 0 ( n ~ 2),
which agrees w ith A to term s o f o rd er n~l if w = 2/3.
It seems im possible to find an optim al choice o f w for general m easures
o f prediction erro r an d general prediction rules, b u t detailed calculations do
suggest th a t w = 1 e-1 = 0.632 is a good choice. H euristically this value
for w is equivalent to an ad justm ent for the below -average distance betw een
cases an d b o o tstra p sam ples w ithout them , com pared to w hat we expect in the
real prediction problem . T h a t the value 0.632 is close to the value 2 /3 derived
above is reassuring. T he hybrid estim ate (6.55) w ith w = 0.632 is know n as
the 0.632 estimator o f prediction error an d is denoted here by A0.632- T here is
substantial em pirical evidence favouring this estim ate, so long as the num ber
o f covariates p is n o t close to n.
Example 6.10 (Nuclear power stations) C onsider predicting the cost o f a new
pow er station based on the d a ta o f Exam ple 6.8. We base o u r prediction on
the linear regression m odel described there, so we have n(x j , F ) = x j f i , where
A

'

18 is the least squares estim ate for a m odel w ith six covariates. The estim ated

erro r variance is s2 = 0.6337/25 = 0.0253 w ith 25 degrees o f freedom . The


dow nw ardly biased a p p aren t erro r estim ate is A app = 0.6337/32 = 0.020,
whereas the idealized estim ate (6.38) is 0.025 x (1 + ~ ) = 0.031. In this
situation the prediction e rro r for a p articu lar station seems m ost useful, b u t
before we tu rn to individual stations, we discuss the overall estim ates, which
are given in Table 6.9.
Those estim ates show the p a tte rn we would anticipate from the general

299

6.4 Aggregate Prediction Error and Variable Selection

Figure 6.11
Components of
prediction error for
nuclear power data
based on 200 bootstrap
simulations. The top
panel shows the values
of yj n{xj,F*). The
lower left panel shows
the average error for
each case, plotted
against the residuals.
The lower right panel
shows the ratio of the
model-based to the
bootstrap prediction
standard errors.

Case

Raw residual

Case

discussion. T he ap p aren t e rro r is considerably sm aller th an other estimates.


The b o o tstrap estim ate, w ith R = 200, is larger th an the ap p aren t error, b u t
sm aller th a n the cross-validation estim ates, and the 0.632 estim ate agrees well
w ith the ordin ary cross-validation estim ate (6.44), for which K n = 32.
A d justm ent slightly decreases the cross-validation estim ates. N ote th a t the
idealized estim ate appears to be quite accurate here, presum ably because the
m odel fits well an d errors are n o t far from hom oscedastic except for the
last six cases.
N ow consider the individual predictions. Prediction erro r arises from two
com ponents: the variability o f the predictor
and th a t o f the associated
erro r s+. Figure 6.11 gives som e insight into these. Its top panel shows the values

300

6 Linear Regression

o f yj n(xj,F*) for r = 1 ,...,J ? , p lo tted against case num ber j. The variability
o f the average error corresponds to the variation o f individual observations
a b o u t their predicted values, while the variance w ithin each group reflects
param eter estim ation uncertainty. A striking feature is the small prediction
erro r for the last six pow er plants, whose variances and m eans are both small.
The lower left panel shows the average values o f y j fi(xj,F*) over the 200
sim ulations, plotted against the raw residuals. They agree closely, as we should
expect w ith a well-fitting m odel. T he lower right panel shows the ratio o f the
m odel-based prediction stan d ard erro r to the b o o tstrap prediction standard
error. It confirm s th a t the m odel-based calculation described in Exam ple 6.8
overestim ates the predictive stan d ard erro r for the last six plants, which have
the partial turnkey guarantee. T he estim ated b o o tstra p prediction erro r for
these plan ts is 0.003, while it is 0.032 for the rest. T he last six cases fall into
three groups determ ined by the values o f the explanatory variables: in effect
they are replicated.
It m ight be preferable to p lo t y j fi(xj, F ' ) only for those b o o tstrap samples
which exclude the j t h case, and then m ean prediction error would b etter be
com pared to jackknifed residuals yj x j /L ; . F or these d a ta the plots are very
sim ilar to those we have shown.

Example 6.11 (Times on delivery suite) F or a m ore system atic com parison
o f prediction error estim ates in linear regression, we use d ata provided by
E. Burns on the times tak en by 1187 w om en to give b irth a t the Jo h n Radcliffe
H ospital in O xford. A n ap p ro p riate linear m odel has response the log time
spent on delivery suite an d dum m y explanatory variables indicating the type
o f labour, the use o f electronic fetal m onitoring, the use o f an intravenous
drip, the reported length o f la b o u r before arriving a t the hospital and w hether
or n o t the lab o u r is the w om ans first; seven p aram eters are estim ated in all.
We took 200 sam ples o f size n = 50 at ran d o m from the full data. F or each
o f these sam ples we fitted the m odel described above, and then calculated
cross-validation estim ates o f prediction error Acv#. w ith K = 50, 10, 5 and
2 groups, the corresponding adjusted cross-validation estim ates A a c v j c , the
b o o tstrap estim ate AB, and the hybrid estim ate Ao.632- We took R = 200 for
the b o o tstrap calculations.
The results o f this experim ent are sum m arized in term s o f estim ates o f the
expected excess erro r in Table 6.10. T he average a p p aren t error and excess
erro r were 15.7 x 10-2 and 5.2 x 10-2 , the latter taken to be e(F) as defined
in (6.42). T he table shows averages and stan d ard deviations o f the differences
betw een estim ates A an d A app. T he cross-validation estim ate w ith K = 50,
the boo tstrap an d the 0.632 estim ate have sim ilar properties, while other
choices o f K give estim ates th a t are m ore variable; the half-sam ple estim ate
A C v ,2 is worst. R esults for cross-validation w ith 10 and 5 groups are alm ost

301

6.4 Aggregate Prediction Error and Variable Selection


Table 6.10 Summary
results for estimates of
prediction error for 200
samples of size n = 50
from a set of data on
the times 1187 women
spent on delivery suite
at the John Radcliffe
Hospital, Oxford. The
table shows the average,
standard deviation, and
conditional mean
squared error (x 10~2)
for the 200 estimates of
excess error. The target
average excess error is
5.2 x lO"2.

X -fo ld (adjusted) cross-validation

M ean
SD
M SE

B o o tstrap

0.632

50

10

4.6
1.3
0.23

5.3
1.6
0.24

5.3
1.6
0.24

6.0 (5.7)
2.3 (2.2)
0.28 (0.26)

6.2 (5.5)
2.6 (2.3)
0.30 (0.27)

9.2 (5.7)
5.4 (3.3)
0.71 (0.33)

the same. A djustm ent significantly im proves cross-validation when group size
is n o t small. T he b o o tstrap estim ate is least variable, b u t is dow nw ardly
biased.
The final row o f the table gives the conditional m ean squared error, defined
as (200)-1
{Aj Dj ( F, F) }2 for each erro r estim ate A. This m easures the
success o f A in estim ating the true aggregate prediction error D(F, F) for each
o f the 200 samples. A gain the ordinary cross-validation, bootstrap, and 0.632
estim ates perform best.
In this exam ple there is little to choose betw een K -fold cross-validation with
10 an d 5 groups, which b o th perform worse th an the ordinary cross-validation,
bootstrap , an d 0.632 estim ators o f prediction error. K -fold cross-validation
should be used w ith adjustm ent if ordinary cross-validation or the sim ulationbased estim ates are not feasible.

6.4.2 Variable selection


In m any applications o f m ultiple linear regression, one purpose o f the analysis
is to decide which covariate term s to include in the final model. T he supposition
is th a t the full m odel y = x T fi + s with p covariates in (6.22) is correct, b u t th at
it m ay include some red u n d an t terms. O ur aim is to elim inate those red u n d an t
term s, and so obtain the true m odel, which will form the basis for further
inference. This is som ew hat simplistic from a practical viewpoint, because it
assum es th a t one subset o f the proposed linear m odel is tru e : it m ay be m ore
sensible to assum e th a t a few subsets m ay be equally good approxim ations to
a com plicated true relationship betw een m ean response and covariates.
G iven th a t there are p covariate term s in the m odel (6.22), there are 2P
candidates for true m odel because we can include or exclude each covariate.
In practice the num b er o f candidates will be reduced if prior inform ation
necessitates inclusion o f p articu lar covariates or com binations o f them.
There are several approaches to variable selection, including various stepwise
m ethods. But the approach we focus on here is the direct one o f m inim izing
aggregate prediction error, when each candidate m odel is used to predict
independent, future responses at the d a ta covariate values. F or simplicity we
assum e th a t m odels are fitted by least squares, and th a t aggregate prediction

302

6 Linear Regression

erro r is average squared error. It w ould be a sim ple m atter to use other
prediction rules an d o th er m easures o f prediction accuracy.
First we define som e n otation. We denote an arb itrary candidate m odel by
M , which is one o f the 2P possible linear models. W henever M is used as a
subscript, it refers to elem ents o f th a t model. T hus the n x pm design m atrix
X M contains those pM colum ns o f the full design m atrix X th a t correspond
to covariates included in M ; the y'th row o f X m is x h p the least squares
estim ates for regression coefficients in M are P m , and H M is the h at m atrix
X m ( X I i X m )~1X11 th a t defines fitted values
= H My under m odel M . The
total num b er o f regression coefficients in M is qM = pM + 1, assum ing th a t an
intercept term is always included.
Now consider prediction o f single responses y+ a t each o f the original design
points x i,...,x . The average squared prediction erro r using m odel M is
n
n ~l J 2 ( y +j ~ x T m
M >
7=1

and its expectation u n d er m odel (6.22), conditional on the data, is the aggregate
prediction error
n

D ( M ) = a 2 + n~x ^ ( ^ - - x ^ j Pm )2,
i= i
where p.T = (AMj
is the vector o f m ean responses for the true m ultiple
regression m odel. T aking expectation over the d a ta distribution we obtain
A (M ) = E{D(M)} = (1 + n~lqM) a 2 + fxT(I H M)n,

(6.56)

where /ir (/ H M)p is zero only if m odel M is correct. The quantities D (M)
and A(M) generalize D and A defined in (6.36) an d (6.37).
In principle the best m odel w ould be the one th a t m inimizes D{M), but
since the m odel p aram eters are unknow n we m ust settle for m inim izing a
good estim ate o f D ( M) o r A(M). Several resam pling m ethods for estim ating
A were discussed in the previous subsection, so the n atu ral approach would
be to choose a good m ethod an d apply it to all possible models. However,
accurate estim ation o f A(M ) is n o t itself im p o rtan t: w hat is im p o rtan t is to
accurately estim ate the signs o f differences am ong the A(M), so th a t we can
identify which o f the A(M )s is smallest.
O f the m ethods considered earlier, the a p p aren t e rro r estim ate A app( M) =
h^ R S S m was poor. Its use here is im m ediately ruled out w hen we observe th a t
it always decreases w hen covariates are added to a m odel, so m inim ization
always leads to the full model.

6.4 Aggregate Prediction Error and Variable Selection

303

Cross-validation
O ne good estim ate, when used w ith squared error, is the leave-one-out crossvalidation estim ate. In the present no tatio n this is

(6.57)

w here y ^ j is the fitted value for m odel M based on all the d a ta and h ^ j is the
leverage for case j in m odel M . The bias o f Ac v ( M ) is small, b u t th at
is not
enough to m ake it a good basis for selecting M . To see why, note first th a t an
expansion gives
mAc k (M ) =

et

(I

- H M)e +

2pM + fiT(I - H M)fi.

(6.58)

T hen if m odel M is true, an d M ' is a larger model, it follows th a t for large n


Pr{Ac v ( M ) < ACv( M') } = P r(Z2 < 2d),
where d = p w ~ P m - This probability is substantially below 1 unless d is large.
It is therefore quite likely th a t selecting M to m inimize Ac v ( M ) will lead
to overfitting, even for large n. So although the term p T(I H M)n in (6.58)
guarantees th at, for large n, incorrect m odels will n o t be selected, m inim ization
o f A c v ( M ) does n o t provide consistent selection o f the true model.
One explanation for this is th a t to estim ate A(M) w ith sufficient accuracy
we need b o th large am o u n ts o f d ata to fit m odel M and a large num ber o f
independent predictions. This can be accom plished using the m ore general
cross-validation m easure (6.43), u nder conditions given below. In principle we
need to average (6.43) over all possible splits, b u t for practical purposes we
follow (6.46). T h a t is, using R different splits into training and assessm ent sets
o f sizes nt = n m and na = m, we generalize (6.57) to
R

ACv(M) = jR_1 Y l m~ l X
r= 1
jesv

~ yMj(St,r)}2,

where p M j ( S t,r) = x h ^ M ^ t , ) an d ^ M(^t,r) are the least squares estim ates for
coefficients in M fitted to the rth training set whose subscripts are in Sv . N ote
th a t the sam e R splits into training and assessm ent sets are used for all models.
It can be show n that, provided m is chosen so th a t n m > o o and m /n >1 as
n - o o , m inim ization o f Ac v ( M ) will give consistent selection o f the true m odel
as n
o o an d R >o o .

304

6 Linear Regression

Bootstrap methods
C orresponding results can be obtained for b o o tstrap resam pling m ethods. The
b o o tstrap estim ate o f aggregate prediction erro r (6.51) becomes

Ab ( M ) = n~l R S S m + R ~ l

n~l

- RSS'M,

(6.59)

where the second term on the right-hand side is an estim ate o f the expected
excess erro r defined in (6.42). The resam pling scheme can be either case
resam pling o r error resam pling, w ith x m
Mj r = x Mj for the latter.
It turns o u t th a t m inim ization o f A B( M) behaves m uch like m inim ization o f
the leave-one-out cross-validation estim ate, an d does n o t lead to a consistent
choice o f true m odel as n*o o . However, there is a m odification o f A B(M),
analogous to th a t m ade for the cross-validation procedure, which does produce
a consistent m odel selection procedure. T he m odification is to m ake sim ulated
datasets be o f size n m rath er th an n, such th a t m / n >l and n m> o o as
n>co. Also, we replace the estim ate (6.59) by the sim pler b o o tstrap estim ate
R

Ab (M ) = R - 1
r= l

n- 1 Y ^ ( y j ~ x l j K r ) 2>
j= 1

(6.60)

which is a generalization o f (6.49). (The previous doubts ab o u t this simple


estim ate are less relevant for small n m.) I f case resam pling is used, then
n m cases are random ly selected from the full set o f n. If m odel-based
resam pling is used, the m odel being M w ith assum ed hom oscedasticity o f
errors, then
is a ran d o m selection o f n m rows from X m and the n m
errors * are random ly sam pled from the n m ean-corrected m odified residuals
i"Mj ~
for m odel M.
Bearing in m ind the general advice th a t the nu m ber o f sim ulated datasets
should be at least R = 100 for estim ating second m om ents, we should use at
least th a t m any here. T he sam e R b o o tstra p resam ples are used for each m odel
M , as w ith the cross-validation procedure.
One m ajo r practical difficulty th a t is shared by the consistent cross-validation
and b o o tstrap procedures is th a t fitting all candidate m odels to small subsets
o f d a ta is n o t always possible. W h at em pirical evidence there is concerning
good choices for m / n suggests th a t this ratio should be ab o u t | . If so, then in
m any applications some o f the R subsets will have singular designs X'M for big
models, unless subsets are balanced by ap p ro p riate stratification on covariates
in the resam pling procedure.
Example 6.12 (Nuclear power stations) In Exam ples 6.8 and 6.10 o u r analyses
focused on a linear regression m odel th a t includes six o f the p = 10 covariates
available. T hree o f these covariates d a te , lo g ( c a p ) and NE are highly

305

6.4 Aggregate Prediction Error and Variable Selection

Figure 6.12 Aggregate


prediction error
estimates for sequence
of models fitted to
nuclear power stations
data; see text.
Leave-one-out
cross-validation (solid
line), bootstrap with
R = 100 resamples of
size 32 (dashed line) and
16 (dotted line).

10

Number of covariates

sign ifica n t, a ll o th ers h a v in g P -v a lu e s o f 0.1 or m ore. H ere w e co n sid e r the


sele c tio n o f v a ria b les to in c lu d e in th e m o d el. T h e to ta l n u m b er o f p o ssib le
m o d els, 2 10 = 1024, is p ro h ib itiv e ly larg e, a n d for th e p u r p o se s o f illu stra tio n
w e co n sid e r o n ly the p a rticu la r seq u en ce o f m o d e ls in w h ich v a ria b les en ter
in th e ord er d a t e , l o g ( c a p ) , NE, CT, l o g ( N ) , PT, T l, T2, PR, BW: th e first three
are th e h ig h ly sig n ifica n t variab les.

Figure 6.12 plots the leave-one-out cross-validation estim ates and the b o o t
strap estim ates (6.60) w ith R = 100 o f aggregate prediction error for the
m odels w ith 0 , 1 ,..., 10 covariates. The two estim ates are very close, and b o th
are m inim ized w hen six covariates are included (the six used in Exam ples 6.8
an d 6.10). Selection o f five or six covariates, ra th er th a n fewer, is quite clearcut. These results b ear o u t the rough rule-of-thum b th a t variables are selected
by cross-validation if they are significant at roughly the 0.1 level.
As the previous discussion would suggest, use o f corresponding crossvalidation and b o o tstra p estim ates from training sets o f size 20 or less is
precluded because for training sets o f such sizes the m odels with m ore th an
five covariates are frequently unidentifiable. T h at is, the unbalanced nature o f
the covariates, coupled w ith the binary nature o f some o f them , frequently
leads to singular resam ple designs. Figure 6.12 includes b o o tstrap estim ates
for m odels w ith u p to five covariates and training set o f size 16: these results
were obtained by om itting m any singular resamples. These ra th er fragm entary
results confirm th a t the m odel should include at least five covariates.
A useful lesson from this is th a t there is a practical obstacle to w hat in
theory is a preferred variable selection procedure. O ne w ay to try to overcome

306

6 Linear Regression
cv, resample 10

cv, resample 20

cv, resample 30

leave-one-out cv

boot, resample 10

boot, resample 20

boot, resample 30

boot, resample 50

this difficulty is to stratify on the b inary covariates, b u t this is difficult to


im plem ent an d does n o t w ork well here.

Example 6.13 (Simulation exercise) In order to assess the variable selection


procedures w ithout the com plication o f singular resam ple designs, we consider
a sm all sim ulation exercise in which procedures are applied to ten datasets
sim ulated from a given m odel. T here are p = 5 independent covariates, whose
values are sam pled from the uniform distrib u tio n on [0, 1], and responses y
are generated by adding N ( 0,1) variates to the m eans p. = x Tp. The cases
we exam ine have sam ple size n = 50, an d yS3 = jS4 =
= 0, so the true
m odel includes an intercept and two covariate terms. To simplify calculations
only six m odels are fitted, by successively adding x i , . . . , x 5 to an initial m odel
with con stan t intercept. All resam pling calculations are done with R = 100
samples. T he num b er o f d atasets is adm ittedly small, b u t sufficient to m ake
rough com parisons o f perform ance.
The m ain results concern m odels w ith P\ = P2 = 2, which m eans th a t the
two non-zero coefficients are ab o u t four stan d ard errors aw ay from zero. Each
panel o f Figure 6.13 shows, for the ten datasets, one variable selection criterion
plotted against the n u m b er o f covariates included in the model. Evidently the
clearest indications o f the tru e m odel occur w hen training set size is 10 or
20. L arger training sets give flat profiles for the criterion, and m ore frequent
selection o f overfitted models.
These indications m atch the evidence from m ore extensive sim ulations, which
suggest th a t if training set size n m is a b o u t n /3 then the probability o f correct
m odel selection is 0.9 or higher, com pared to 0.7 o r less for leave-one-out crossvalidation.
F u rther results were obtained w ith P\ = 2 an d P2 = 0.5, the latter equal to
one stan d ard erro r aw ay from zero. In this situation underfitting failure to

Figure 6.13
Cross-validation and
bootstrap estimates of
aggregate prediction
error for sequence of six
models fitted to ten
datasets of size n = 50
with p = 5 covariates.
The true model includes
only two covariates.

6.5 Robust Regression

307

include x 2 in the selected m odel occurred quite frequently even w hen using
training sets o f size 20. This deg radation o f variable selection procedures when
coefficients are sm aller th a n tw o stan d ard errors is reputed to be typical.

The theory used to justify the consistent cross-validation and boo tstrap
procedures m ay depend heavily on the assum ptions th at the dim ension o f
the true m odel is small com pared to the num ber o f cases, and th a t the
non-zero regression coefficients are all large relative to their stan d ard errors.
It is possible th a t leave-one-out cross-validation m ay w ork well in certain
situations where m odel dim ension is com parable to num ber o f cases. This
w ould be im p o rtan t, in light o f the very clear difficulties o f using small training
sets w ith typical applications, such as Exam ple 6.12. Evidently fu rther work,
b o th theoretical an d em pirical, is necessary to find broadly applicable variable
selection m ethods.

6.5 Robust Regression


T he use o f least squares regression estim ates is preferred w hen errors are
n ear-norm al in distrib u tio n an d hom oscedastic. However, the estim ates are
very sensitive to outliers, th a t is cases which deviate strongly from the general
relationship. Also, if errors have a long-tailed distribution (possibly due to
heteroscedasticity), then least squares estim ation is n o t an efficient m ethod.
A ny regression analysis should therefore include ap p ro p riate inspection o f
diagnostics based on residuals to detect outliers, and to determ ine if a norm al
assum ption for errors is reasonable. If the occurrence o f outliers does not
cause a change in the regression model, then they will likely be om itted from
the fitting o f th a t m odel. D epending on the general pattern o f residuals for
rem aining cases, we m ay feel confident in fitting by least squares, or we m ay
choose to use a m ore robust m ethod to be safe. Essentially the resam pling
m ethods th a t we have discussed previously in this chapter can be adapted
quite easily for use w ith m any robust regression m ethods. In this section we
briefly review som e o f the m ain points.
Perhaps the m ost im p o rtan t p o in t is th a t gross outliers should be rem oved
before final regression analysis, including resam pling, is undertaken. There are
tw o reasons for this. The first is th a t m ethods o f fitting th a t are resistant to
outliers are usually n o t very efficient, and m ay behave badly u n der resampling.
T he second reason is th a t outliers can be disruptive to resam pling analysis o f
m ethods such as least squares th a t are n o t resistant to outliers. F o r m odel-based
resam pling, the erro r distribution will be contam inated and in the resam pling
the outliers can then occur at any x values. F or case resam pling, outlying
cases will occur w ith variable frequency and m ake the b o o tstrap estim ates o f
coefficients too variable; see Exam ple 6.4. The effects can be diagnosed from

308

6 Linear Regression

D ose (rads)

117.5

235.0

470.0

705.0

940.0

1410

S urvival %

44.000
55.000

16.000
13.000

4.000
1.960
6.120

0.500
0.320

0.110
0.015
0.019

0.700
0.006

Table 6.11 Survival


data (Efron, 1988).

the jackk n ife-after-b o o tstrap plots o f Section 3.10.1 o r sim ilarly inform ative
diagnostic plots, b u t such plots can fail to show the occurrence o f m ultiple
outliers.
For datasets w ith possibly m ultiple outliers, diagnosis is aided by initial
use o f a fitted m ethod th a t is highly resistant to the effects o f outliers. One
preferred resistant m ethod is least trim m ed squares, which minimizes
m

5 > 0 )(/*)j=i

(6.61)

the sum o f the m sm allest squares o f deviations e; (/}) = yj x j p. Usually m


is taken to be [\n] + 1. R esiduals from the least trim m ed squares fit should
clearly identify outliers. The fit itself is n o t very efficient, and should best be
th o ught o f as an initial step in a m ore efficient analysis. (It should be noted
th a t in som e im plem entations o f least trim m ed squares, local m inim a o f (6.61)
m ay be found far aw ay from the global m inim um .)
Example 6.14 (Survival proportions) T he d a ta in Table 6.11 and the left panel
o f Figure 6.14 are survival percentages for rats a t a succession o f doses o f
radiation, w ith two o r three replicates at each dose. T he theoretical relationship
betw een survival rate an d dose is exponential, so linear regression applies to
x = dose,

y = log(survival percentage).

T he right panel o f Figure 6.14 plots these variables. There is a clear outlier,
case 13, at x = 1410. T he least squares estim ate o f slope is 59 x 10-4 using
all the data, changing to 78 x 10-4 w ith stan d ard erro r 5.4 x 10-4 when case
13 is om itted. T he least trim m ed squares estim ate o f slope is 69 x 10-4 .
F rom the scatter p lo t it app ears th a t heteroscedasticity m ay be present, so we
resam ple cases. The effect o f the outlier on the resam ple least squares estim ates
is illustrated in Figure 6.15, which plots R = 200 b o o tstrap least squares slopes
PI against the corresponding values o f ]T (x x*)2, differentiated by the
frequency w ith which case 13 appears in the resam ple. There are three distinct
groups o f b o o tstrap p ed slopes, w ith the lowest corresponding to resam ples in
which case 13 does n o t occur and the highest to sam ples where it occurs twice or
more. A jack k n ife-after-b o o tstrap plot w ould clearly reveal the effect o f case 13.
T he resam pling stan d ard erro r o f p \ is 15.3 x 10-4 , b u t only 7.6 x 10-4 for

Here [] denotes integer


part.

6.5 Robust Regression

Figure 6.14 Scatter


plots of survival data.

309

0s
15
> o

D
(0
O ) CM
O '

i co
D o

CO

C\J

CM

200

600

1000

1400

200

600

1000

1400

Dose

Dose

Figure 6.15 Bootstrap


estimates of slope
and design
sum-of-squares
J2(x } - x

)2 ( x \ 0 5 ),

differentiated by
frequency of case 13
(appears zero, one or
more times), for case
resampling with
R = 200 from survival
data.

Sum of squares

sam ples w ithout case 13. T he corresponding resam pling standard errors o f the
least trim m ed squares slope are 20.5 x 10-4 and 18.0 x 10~4, showing b o th the
resistance an d inefficiency o f the least trim m ed squares m ethod.

Exam ple 6.15 (Salinity d a ta ) The d a ta in Table 6.12 are n = 28 observations


on the salinity o f w ater in Pam lico Sound, N o rth C arolina. The response in
the second colum n is the bi-weekly average o f salinity. The next three colum ns
contain values o f the covariates, respectively a lagged value o f salinity, a trend

310

6 Linear Regression

Salinity
sal

L agged salinity
la g

T ren d in d icato r
tre n d

R iver discharge
d is

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

7.6
7.7
4.3
5.9
5.0
6.5
8.3
8.2
13.2
12.6
10.4
10.8
13.1
12.3
10.4
10.5
7.7
9.5
12.0
12.6
13.6
14.1
13.5
11.5

8.2
7.6
4.6
4.3
5.9
5.0
6.5
8.3
10.1
13.2
12.6
10.4
10.8
13.1
13.3
10.4
10.5
7.7

23.01
22.87
26.42
24.87
29.90
24.20
23.22
22.86
22.27
23.83
25.14
22.43
21.79
22.38
23.93
33.44
24.86
22.69
21.79
22.04
21.03
21.01
25.87
26.29

25
26
27
28

12.0
13.0
14.1
15.1

4
5
0
1
2
3
4
5
0
1
2
3
4
5
0
1
2
3
0
1
4
5
0
1
2
3
4
5

10.0
12.0
12.1
13.6
15.0
13.5
11.5
12.0
13.0
14.1

Table 6.12 Salinity


data (Ruppert and
Carroll, 1980).

22.93
21.31
20.77
21.39

indicator, an d the river discharge. We consider a linear regression m odel with


these three covariates.
The initial least squares analysis gives coefficients 0.78, 0.03 and 0.30,
with intercept 9.70. The usual stan d ard error for the trend coefficient is 0.16,
so this coefficient would be ju d g ed n o t nearly significant. However, this fit is
suspect, as can be seen n o t from the Q -Q plot o f m odified residuals b u t from
the plot o f cross-validation residuals versus leverages, where case 16 stands
out as an outlier due apparen tly to its unusual value o f d is . T he outlier is
m uch m ore easily detected using the least trim m ed squares fit, w hich has the
quite different coefficient values 0.61, 0.15 and 0.86 w ith intercept 24.72: the
residual o f case 16 from this fit has standardized value 6.9. Figure 6.16 shows
norm al Q -Q plots o f standardized residuals from least squares (left panel) and
least trim m ed squares fits (right panel); for the la tte r the scale factor is taken
to be the m edian absolute residual divided by 0.6745, the value appropriate
for estim ating the stan d ard deviation o f norm al errors.

Application of standard
algorithms for least
trimmed squares with
default settings can give
very different, incorrect
solutions.

311

6.5 Robust Regression

Figure 6.16 Salinity


data: standardized
residuals from least
squares (left) and least
trimmed squares (right)
fits using all cases.

co
3
D
'(/)
T
3
N
CO
x>

co

55

Quantiles of standard normal

Quantiles of standard normal

T here is some question as to w hether the outlier is really ab errant, o r simply


reflects the need for a quad ratic term in d i s .

Robust methods
We suppose now th a t outliers have been isolated by diagnostic plots and set
aside from fu rth er analysis. The problem now is w hether o r n o t th a t analysis
should use least squares estim ation: if there is evidence o f a long-tailed error
distribution, then we should dow nw eight large deviations yj x j fi by using a
robust m ethod. Two m ain options for this are now described.
O ne ap p ro ach is to m inim ize n o t sums o f squared deviations b u t sums o f
absolute values o f deviations, Y , Iy j ~ x J Jl> so liv in g less weight to those cases
w ith the largest errors. This is the L i m ethod, which generalizes and has
efficiency com parable to the sam ple m edian estim ate o f a population mean.
T here is n o simple expression for approxim ate variance o f L\ estim ators.
M ore efficient is M -estim ation, which is analogous to m axim um likelihood
estim ation. H ere the coefficient estim ates /? for a m ultiple linear regression
solve the estim ating equation
0,

(6.62)

where tp(z) is a b o unded replacem ent for z, and s is either the solution to a
sim ultaneous estim ating equation, o r is fixed in advance. We choose the latter,
tak in g s to be the m edian absolute deviation (divided by 0.6745) from the least
trim m ed squares regression fit. T he solution to (6.62) is obtained by iterative
weighted least squares, for which least trim m ed squares estim ates are good
startin g values.

6 Linear Regression

312

W ith a careful choice o f ip(-), M -estim ates should have sm aller standard
errors th a n least squares estim ates for long-tailed d istributions o f random
errors e, yet have com parable stan d ard errors should those errors be hom o
scedastic norm al. O ne stan d ard choice is tp(z) = z m in (l,c /|z |), H u b ers winsorizing function, for which the coefficient estim ates have approxim ate effi
ciency 95% relative to least squares estim ates for hom oscedastic norm al errors
when c = 1.345.
F or large sam ple sizes M -estim ates ft are approxim ately norm al in distribu
tion, with approxim ate variance
v ar() = o'2 * {'p2{e/<T)\ -2 ( X TX ) - \
[E { v (e /a )}]2

(6.63)

under hom oscedasticity. A m ore robust, em pirical variance estim ate is provided
by the nonp aram etric delta m ethod. First, the em pirical influence values are,
analogous to (6.25),
lj = k n ( X T X ) ~ 1Xj\p

^,

where k = sn-1 ]T "=1 w(ej / s) and e; = yj x j f i is the raw residual; see


Problem 6.7. T he variance approxim ation is then
vL = n~2

h lJ = k 2( X TX ) - lX TD X ( X TX ) - \

(6.64)

7=1

where D = diag {y>2( e i/s ),. .. ,xp2( e/ s) }; this generalizes (6.17).


Resampling
As with least squares estim ation, so w ith robust estim ates we have two simple
choices for resam pling: case resam pling, o r m odel-based resam pling. D epend
ing on which robust m ethod is used, the resam pling algorithm m ay need to be
modified from the simple form th a t it takes for least squares estim ation.
T he Lj estim ates will behave like the sam ple m edian under either resam pling
scheme, so th a t the distrib u tio n o f
can be very discrete, and close to th at
o f P ~ P only for very large samples. Use o f the sm ooth b o o tstrap (Section 3.4)
will im prove accuracy. N o simple studentization is possible for L\ estimates.
F or M -estim ates case resam pling should be satisfactory except for small
datasets, especially those w ith unreplicated design points. The advantage o f
case resam pling is simplicity. F or m odel-based resam pling, som e m odifications
are required to the algorithm used to resam ple least squares estim ation in
Section 6.3. First, the leverage correction o f raw residuals is given by
ej
1

( l - d h j ) ' / 2

j _2 J2(e)f sMej A)
Y W j/s)

E W2(ej/s)
(E v ff j/s )} 2'

Sim ulated errors are random ly sam pled from the uncentred ru . . . , r n. M ean

tp(u) is the derivative


d\p(u)/du.

6.5 Robust Regression

313

correction to the rj is replaced by a slightly m ore com plicated correction in


the estim ation equation itself. T he resam ple version o f (6.62) is

T he scale estim ate s' is obtained by the same m ethod as s, b u t from the
resam ple data.
S tudentization o f j?* ft is possible, using the resam ple analogue o f the delta
m ethod variance (6.64) o r m ore simply ju st using s'.
Exam ple 6.16 (Salinity d ata) In our previous look a t the salinity d a ta in
E xam ple 6.15, we identified case 16 as a clear outlier. We now set th a t
case aside an d re-analyse the linear regression w ith all three covariates. O ne
objective is to determ ine w hether o r n o t the trend variable should be included
in the m odel: the initial, incorrect least squares analysis suggested not.
A norm al Q -Q plot o f the m odified residuals from the new least squares fit
suggests som ew hat long tails for the erro r disribution, so th a t robust m ethods
m ay be w orthw hile. We fit the m odel by four m e th o d s: least squares, H u b er Mestim ate (w ith c = 1.345), L i and least trim m ed squares. Coefficient estim ates
are fairly sim ilar und er all m ethods, except for t r e n d whose coefficients are
-0 .1 7 , -0 .2 2 , - 0 .1 8 an d -0 .0 8 .
F o r fu rth er analysis we apply case resam pling w ith R = 99. Figure 6.17
illustrates the results for estim ates o f the coefficient o f tr e n d . The d o tted lines
on the top two panels correspond to the theoretical norm al approxim ations:
evidently the stan d ard variance approxim ation based on (6.63) for the
H u b er estim ate is too low. N ote also the relatively large resam pling variance for
the least trim m ed squares estim ate, p a rt o f which m ay be due to unconverged
estim ates: tw o resam pling outliers have been trim m ed from this plot.
To assess the significance o f t r e n d we apply the studentized pivot m ethod
o f Section 6.3.2 w ith b o th least squares and M -estim ates, studentizing by the
theoretical stan d ard erro r in each case. The corresponding values o f z are
1.25 and 1.80, w ith respectively 23 and 12 sm aller values o f z* o u t o f 99.
So there appears to be little evidence o f the need to include tr e n d .
I f we checked diagnostic plots for any o f the four regression fits, a question
m ight be raised ab o u t w hether or n o t case 5 should be included in the
analysis. A n alternative view o f this is provided by jackknife-after-bootstrap
plots (Section 3.10.1) o f the four fits: such plots correspond to case-deletion
resam pling. A s an illustration, Figure 6.18 shows the jackknife-after-bootstrap
plo t for the coefficient o f t r e n d in the M -estim ation fit. This shows clearly th a t
case 5 has an appreciable effect on the resam pling distribution, and th at its
om ission w ould give tighter confidence limits on the coefficient. It also raises

6 Linear Regression

314

Figure 6.17 Salinity


data: Normal Q-Q plots
of resampled estimates
of trend coefficient,
based on case
resampling (R = 99 for
data excluding case 16.
Clockwise from top left:
least squares, Huber
M-estimation, least
trimmed squares, L\.
Dotted lines correspond
to theoretical normal
approximations.

Quantiles of standard normal

Quantiles of standard normal

Quantiles of standard normal

Quantiles of standard normal

q u e stio n s a b o u t tw o o th er ca ses. C lea rly so m e fu rth er e x p lo r a tio n is n eed ed


b efo re firm c o n c lu s io n s c a n b e reach ed .

T h e p r ev io u s ex a m p le illu stra tes th e p o in t th a t it is o fte n w o rth w h ile to


in co rp o ra te ro b u st m e th o d s in to a reg ressio n a n a ly sis, b o th to h elp iso la te
o u tliers an d to a ssess th e relia b ility o f c o n c lu s io n s b a sed o n th e le a st sq u ares
fit to su p p o se d ly c le a n d ata. In so m e areas o f a p p lic a tio n s, fo r ex a m p le th o se
in v o lv in g r e la tio n sh ip s b etw e e n fin a n cia l series, lo n g -ta ile d d istrib u tio n s m a y
b e q u ite c o m m o n , an d th e n ro b u st m e th o d s w ill b e e sp e c ia lly im p o rta n t. T o th e
e x ten t th at th eo retica l n o r m a l a p p r o x im a tio n s are in a ccu ra te fo r m a n y ro b u st
estim a tes, resa m p lin g m e th o d s are a n a tu ra l c o m p a n io n to ro b u st an a ly sis.

315

6.6 Bibliographic Notes

Figure 6.18 Jackknifeafter-bootstrap plot for


the coefficient of tre n d
in the M-estimation fit
to the salinity data,
omitting case 16.

*
o
O

CO
o

xP
O '"
to
05

CNJ

o
o

8
3

CM
o

LO

CO

<fr
o

9
8
2
1
14

O*

22
11
17
i

2412 21
1
13
ISO
VS
16 2 * 5

15
S
19
3
27

LO

Standardized jackknife value

6.6 Bibliographic Notes


There are several com prehensive accounts o f linear regression analysis, in
cluding the books by D ra p e r and Sm ith (1981), Seber (1977), and W eisberg
(1985). D iagnostic m ethods are described by A tkinson (1985) and by C ook
an d W eisberg (1982). A good general reference on robust regression is the
book by Rousseeuw an d Leroy (1987). M any linear regression m ethods and
their properties are sum m arized, with illustrations using S-Plus, in Venables
an d Ripley (1994).
T he use o f b o o tstra p m ethods in regression was initiated by E fron (1979).
Im p o rta n t early w ork on the theory o f resam pling for linear regression was
by F reedm an (1981) an d Bickel and Freedm an (1983). See also E fron (1988).
F reedm an (1984) and F reedm an and Peters (1984a,b) assessed the m ethods
in practical applications. W u (1986) gives a quite com prehensive theoretical
treatm ent, including com parisons betw een various resam pling and jackknife
m ethods; for fu rth er developm ents see Shao (1988) and Liu and Singh (1992b).
H all (1989b) shows th a t b o o tstrap m ethods can provide unusually accurate
confidence intervals in regression problems.
T heoretical properties o f b o o tstrap significance tests, including the use o f
b o th studentized pivots an d F statistics, were established by M am m en (1993).
R ecent interest in resam pling tests for econom etric m odels is reviewed by
Jeong an d M ad d ala (1993).
Use o f the b o o tstrap for calculating prediction intervals was discussed by
Stine (1985). T he asym ptotic theory for the m ost elem entary case was given by
Bai and O lshen (1988). F or further theoretical developm ent see B eran (1992).

6 Linear Regression

316

Olshen et al. (1989) described an interesting application to a com plicated


prediction problem .
The wild b o o tstra p is based on an idea suggested by W u (1986), and has
been explored in detail by H ardle (1989, 1990) an d M am m en (1992). The
effectiveness o f the wild b o o tstrap , p articularly for studentized coefficients,
was dem o n strated by M am m en (1993).
C ross-validation m ethods for the assessm ent o f prediction erro r have a long
history, b u t m odern developm ents originated w ith Stone (1974) and Geisser
(1975). W h at we refer to as K -fo ld cross-validation was proposed by Breim an
et al. (1984), and further studied by B urm an (1989). Im p o rta n t theoretical
results were developed by Bunke and D roge (1984), Li (1987), and Shao
(1993). The theoretical fo undation o f cross-validation and b o o tstrap estim ates
o f prediction error, w ith p articu lar em phasis on classification problem s, was
developed in C h ap ter 7 o f E fron (1982) and by E fron (1983), the latter
introducing the 0.632 estim ate. F u rth er developm ents, w ith applications and
em pirical studies, were given by E fron (1986) and E fron and Tibshirani (1997).
T he discussion o f hybrid estim ates in Section 6.4 is based on H all (1995). In a
simple case D avison an d H all (1992) a ttem p t to explain the properties o f the
b o o tstrap an d cross-validation erro r estim ates.
T here is a large literature on variable selection in regression, m uch o f which
overlaps w ith the cross-validation literature. C ross-validation is related to the
the Cp m ethod o f linear m odel selection, proposed by M allow s (1973), and
to the A IC m ethod o f A kaike (1973), as was show n by Stone (1977). F or
a sum m ary discussion o f various m ethods o f m odel selection see C h apter 2
o f Ripley (1996), for exam ple. T he consistent b o o tstra p m ethods outlined in
Section 6.4 were developed by Shao (1996).
A sym ptotic properties o f resam pled M -estim ates were derived by Shorack
(1982) w ho described the adjustm ent necessary for unbiasedness o f the re
sam pled coefficients. M am m en (1989) provided additional asym ptotic sup
port. A spects o f residuals from robust regression were discussed by C ook,
H aw kins an d W eisberg (1992) and M cK ean, S heather and H ettsm ansperger
(1993), the la tte r show ing how to standardize raw residuals in M -estim ation.
De Angelis, H all and Y oung (1993) gave a detailed theoretical analysis o f
m odel-based resam pling in L i estim ation, which confirm ed th a t a sm ooth
b o o tstrap is advisable; fu rth er num erical results were provided by Stangenhaus
(1987).

6.7 Problems
1

Show that for a multivariate distribution with mean vector pi and variance matrix
Q, the influence functions for the sample mean and variance are respectively
L(z) = z - f i ,

k(z) = (z - n)(z - n)T - si.

6.7 Problems

317

Hence show that for the linear regression model derived as the conditional expec
tation E (y | X = x) o f a multivariate C D F F, the empirical influence function
values for linear regression parameters are
h (xj , yj ) = n ( X TX ) ~ i x j eJ,
where X is the matrix o f explanatory variables.
(Sections 2.7.2, 6.2.2)
For hom ogeneous data as in Chapter 2, the empirical influence values for an
estimator can be approximated using case-deletion values. Use the matrix identity
t

(* * -

(X TX ) - l x iXJ ( X TX )->

l - xJlXTXT'x,

to show that in the linear regression model with least squares fitting,

P - P - J

= (X X)-

'y j-x jP '


l-h j

Compare this to the corresponding empirical influence value in Problem 6.1, and
obtain the jackknife estimates o f the bias and variance o f fa
(Sections 2.7.3, 6.2.2, 6.4)
3

For the linear regression m odel y, = xjji + ej, with no intercept, show that the
least squares estimate o f /? is ft = Y x jy j/ Y x j. Define residuals by ej

y j xjfa

If the resampling model is y j = Xjfi + e", with e randomly sampled from the e;s,
show that the resample estimate /T has mean and variance respectively
e and x are the averages
of the ej and xj.

TSei ~

+
Z * j

nExj

Thus in particular the resampling mean is incorrect. Examine the improvements


made by leverage adjustment and mean correction o f the residuals.
(Section 6.2.3)
The usual estimated variance o f the least squares slope estimate fa in simple linear
regression can be written

_ n y j - y ) 2- M U x j - x ) 2
(n ~

2) ( * ; -

x )2

If the x s and y s are random permutations o f xs and ys, show that

U y j - y ) 2 - P 2n x j - x ) 2
(n - 2) J2(xj ~ x)2

Hence show that in the permutation test for zero slope, the R values o f f}[ are in the
same order as those o f f a / v ' 1/2, and that f a > fa is equivalent to f a /u*1/2 > f a / v lf2.
This confirms that the P-value o f the permutation test is unaffected by studentizing.
(Section 6.2.5)

6 Linear Regression

318

For least squares regression, model-based resampling gives a bootstrap estimator


fi' which satisfies
n
7=1

where the sj are randomly sampled modified residuals. An alternative proposal is


to bypass the resampling model for data and to define directly
n
p = $ + { x Tx r i Y t i
j=i
where the us are randomly sampled from the vectors
uj = xj ( y j - xJ h

j = 1......... n.

Show that under this proposal fi" has mean fi and variance equal to therobust
variance estimate (6.26). Examine, theoretically or through numerical examples, to
what extent the skewness of fi matches the skewness of fi.
(Section 6.3.1; Hu and Zidek, 1995)
For the linear regression model y = X p + e, the improved version of the robust
estimate of variance for the least squares estimates fi is
Vrob = (X TX ) - lX Tdizg(r2i, . . . , r 2n) X ( XTX ) - \
where rj is the j th modified residual. If the errors have equal variances, then the
usual variance estimate
v = s2^ 7* ) - 1
would be appropriate and vroi, could be quite inefficient. To quantify this, examine
the case where the random errors e; are independent N(0, a2). Show first that

E(rj) = =,
Hence show that the efficiency of the ith diagonal element of vrob relative to the
ith diagonal element of v, as measured by the ratio of their variances, is
bl
(n-p)g{Qgt
where bu is the ith diagonal element of (Z TX )_1, gJ = (d^...... dfn) with D =
TX)~lX T, and Q has elements (1 h j k ) 2/ { ( 1 /i; )(l hk ) } .
Calculate this relative efficiency for a numerical example.
(Sections 6.2.4, 6.2.6, 6.3.1; Hinkley and Wang, 1991)
(X

The statistical function /?(F) for M-estimation is defined by the estimating equation

J xv{

y - x Tm '
a(F)

dF(x,y) = 0,

where a(F) is typically a robust scale parameter. Assume that the model contains
an intercept, so that the covariate vector x includes the dummy variable 1. Use the

hjk is the (J,k)th


element of hat matrix H
and hjj = hj.

6.1 Problems

319

technique o f Problem 2.12 to show that the influence function for fl(F) is
V?(u) is d ip(u)/du.

^ ) = { / x x Tyj(e)dF(x, y) |

oxy>(e),

where e (y x Tf i ) / o ; it is assumed that sy)(e) has mean zero.


If the distribution o f the covariate vector is taken to be the E D F o f x i , . . . , x ,
show that

Lp(x,y) = m k ~ 1( X TX)~1x\p(e),
where X is the usual covariate matrix and k = E{ip(e)}. U se the empirical version
o f this to verify the variance approximation

y-rX ) i T , V 2(ej/s)
Vl = ns.2 /
(X

{ v(ej/s)}2
where e; = yj x j f t and s is the estimated scale parameter.
(Section 6.5)
Given raw residuals e i , . . . , e n, define independent random variables ej by (6.21).
Show that the first three mom ents o f ej are 0, ej, and ej.
(a) Let
be raw residuals from the fit o f a linear m odel y = X f t + e , and
define bootstrap data by y ' = x f t + e , where the elements o f s are generated
according to the wild bootstrap. Show that the bootstrap least squares estimates
ft" take at m ost 2" values, and that

E(ft') = ft,

var'($*) = vwild = (X TX r lX TW X ( X TX ) ~ \

where W = d ia g ( e f,...,e 2).


(b) Show that when all the errors have equal variances and the design is balanced,
so that hj = p / n , vwiu is negatively biased as an estimate o f var(/3).
(c) Show that for the simple linear regression m odel (6.1) the expected value o f
var'($*) is

/r2
m2

n 2(n 1 m^/m\),

where mr = n~l J2(x j x ) r. Hence show that if the x j are uniformly spaced and
the errors have equal variances, the wild bootstrap variance estimate is too small
by a factor o f about 1 14/(5n).
(d) Show that if the e,- are replaced by r;, the difficulties in (b) and (c) do not arise.
(Sections 6.2.4, 6.2.6, 6.3.2)
Suppose that responses y i , . . . , y with n = 2m correspond to m independent
samples o f size two, where the ith sample comes from a population with mean n t
and these means are o f primary interest; the m population variances may differ.
Use appropriate dummy variables x t to express the responses in the linear m odel
y = X f t + e, where /?, = n t. With parameters estimated by least squares, consider
estimating the standard error o f ft, by case resampling.
(a) Show that the probability o f getting a simulated sample in which all the
parameters are estimable is

6 Linear Regression

320

(b) Consider constrained case resampling in which each o f the m samples must be
represented at least once. Show that the probability that there are r resample cases
from the ith sample is
i

^ \
// 2m
\ (/ 11 \\

11 \\ 2mr in1 / <m / m 1<

r /

(4

for r = l , . . . , m + 1. Hence calculate the resampling mean o f [ij and give an


expression for its variance.
(Section 6.3; Feller, 1968, p. 102)
10

For the one-way m odel o f Problem 6.9 with two observations per group, suppose
that 9 = fa ~ Pi- N ote that the least squares estimator o f 9 satisfies

+ j (fi 3 + 4 Si 62).

Suppose that we use model-based resampling with the assumption o f error hom oscedasticity. Show that the resample estimate can be expressed as

1=1
where the e ' are randomly sampled from the 2m modified residuals ^ ( 2 i S 2 1-1),
i = 1, . .. , m. U se this representation to calculate the first four resampling moments
o f 8 9. Compare the results with the first four mom ents o f 9 6, and comment.
(Section 6.3)
11

Suppose that a 2~r fraction o f a 28 factorial experiment is run, where 1 < r < 4.
Under what circumstances would a bootstrap analysis based on case resampling
be reliable?
(Section 6.3)

12

The several cross-validation estimates o f prediction error can be calculated explic


itly in the simple problem o f least squares prediction for hom ogeneous data with
no covariates. Suppose that data y u - - - , y n and future responses y + are all sampled
from a population with mean n and variance a 2, and consider the prediction rule
H(F) = y with accuracy measured by quadratic error.
(a) Verify that the overall prediction error is A = cr2( l + n_1), that the expectation
o f the apparent error estimate is <r2( l n-1 ), and that the cross-validation estimate
&cv with training sets o f size n, has expectation a 2( 1 + n~').
(b) N ow consider the K -fold cross-validation estimate ACvjc and suppose that
n = K m with m an integer. Re-label the data in the fcth group as yki,---,ykm, and
define % = m-1 Yl?=i yu- Verify that

yk 1 (=1

k=1

and hence show that


E ( A c ^ ) = ff2{ l + n - + n ~ \ K I)-1 }.
Thus the bias o f A Cvj< is a 2n (X 1)

321

6.8 Practicals

(c) Extend the calculations in (b) to show that the adjusted estimate can be written

A acvjc = & c v x

K
K ~ l ( K I)-2 ^ ( p * y ) 2,
k=1

and use this to show that E(AACvjc) A.


(Section 6.4; Burman, 1989)
13

The leave-one-out bootstrap estimate o f aggregate prediction error for linear


prediction and squared error is equal to
Abcv =

E '_j(yj - x f f t l j ) 2,
j=i

where /T j is the least squares estimate o f ji from a bootstrap sample with the )th
case excluded and EV denotes expectation over such samples. To calculate the
mean o f ABcv, use the substitution

yj - x j p_j = yj - x j P-j + x j (Plj - p_j),


and then show that
E( Y j - X j p _ j ) 2

^ { l + q l n - l ) - 1},

E [E'_j { X J ( P l j ~ P - j ) ( t j ~ P - j ) TX j } \

2q(n ~ 1) + 0 ( n ~ 2),

E H Y j-X jp^X jE ljC plj-p-j)}

0 ( n ~ 2).

These results combine to show that E(ABCf ) = ff2( 1 + 2qn~]) 0 ( n 2), which leads
to the choice w = | for the estimate Aw = w A BCv + (1 w)Aapp.
(Section 6.4; Hall, 1995)

6.8 Practicals
1

D ataset catsM contains a set o f data on the heart weights and body weights o f 97
male cats. We investigate the dependence o f heart weight (g) on body weight (kg).
To see the data, fit a straight-line regression and do diagnostic plots:

catsM
p lo t(c a tsM $ B w t, catsM$Hwt, x lim = c (0,4),y lim = c (0 , 24))
c a t s . l m < - glm (H w t~Bw t,data=catsM )
su m m ary(cats. lm)

cats.diag <- glm.diag.plots(cats.lm,ret=T)


The summary suggests that the line passes through the origin, but we cannot
rely on normal-theory results here, because the residuals seem skewed, and their
variance possibly increases with the mean. Let us assess the stability o f the fitted
regression.
For case resampling:

cats.fit <- function(data) coef(glm(data$Hwt~data$Bwt))


cats.case <- function(data, i) cats.fit(data[i,])
cats.bootl <- boot(catsM, cats.case, R=499)
cats.bootl

322

6 Linear Regression

plot(cats.boot1,j ack=T)
plot(cats.boot1,index=2,j ack=T)
to see a summary and plots for the bootstrapped intercepts and slopes,. How
normal do they seem? Is the model-based standard error from the original fit
accurate? To what extent do the results depend on any single observation? We can
calculate the estimated standard error by the nonparametric delta m ethod by

cats.L <- empinf(cats.bootl,type="reg")


sqrt(var.linear(cats.L))
Compare it with the quoted standard error from the regression output, and from
the empirical variance o f the intercepts. Are the three standard errors in the order
you would expect?
For model-based resampling:

cats.res <- cats.diag$res*cats.diag$sd


cats.res <- cats.res - mean(cats.res)
cats.df <- data.frame(catsM,res=cats.res,fit=fitted(cats.lm))
cats.model <- function(data, i)
{ d <- data
d$Hwt <- d$fit + d$res[i]
cats.fit(d) }
cats.boot2 <- boot(cats.df, cats.model, R=499)
cats.boot2
plot(cats.boot2)
Compare the properties o f these bootstrapped coefficients with those from case
resampling.
How would you use a resampling m ethod to test the hypothesis that the line passes
through the origin?
(Section 6.2; Fisher, 1947)
2

The data o f Example 6.14 are in dataframe s u r v iv a l. For a jackknife-afterbootstrap plot for the regression slope f a :

survival.fun <- function(data, i)


{ d <- data[i,]
d.reg <- glm(log(d$surv)d$dose)
c(coefficients(d.reg))}
survival.boot <- boot(survival, survival.fun, R=999)
j a c k.after.boot(survival.boot, index=2)
Compare this with Figure 6.15. W hat is happening?
3

p o is o n s contains the survival times o f animals in a 3 x 4 factorial experiment.


Each com bination o f three poisons and four treatments is used for four animals,
the allocation to the animals being com pletely randomized. The data are standard
in the literature as an example where transformation can be applied. Here we
apply resampling to the data on the original scale, and use it to test whether an
interaction between the two factors is needed. To calculate the test statistic, the
standard F statistic, and to see its significance using the usual F test:

poison.fun <- function(data)


{ assignC'data. junk",data,frame=l)
data.anova <- anova(glm(time~poison*treat,data=data.junk))
dev <- as.numeric(imlist(data.anova[2]))

6.8 Practicals

323

df <- as.numeric(unlist(data.anova[1]))
res.dev <- as.numeric(unlist(data.anova[4]))
res.df <- as.numeric(unlist(data.anova[3]))
(dev [4] /df [4] ) / (res.dev [4] /r e s .df [4] ) >
poison.fun(poisons)
anova(glm(time~poison*treat,data=poisons),test="F")
To apply resampling analysis, using as the null m odel that with main effects:

poison.lm <- glm(time~poison+treat,data=poisons)


poison.diag <- glm.diag(poison.lm)
poison.mle <- list(fit=fitted(poison.lm),
res=residuals(poison.lm)/sqrt(1-poison.diagSh))
poison.gen <- function(data,mle)
{ i <- sample(48,replace=T)
data$time <- mle$fit + mle$res[i]
data >
poison.boot <- boot(poisons, poison.fun, R=199, sim="parametric",
r a n .gen=poison.g e n , mle=poison.mle)
sum(poison.boot$t>poison.boot$tO)
A t what level does this give significance? Is this in line with the theoretical value?
One assumption o f the above analysis is hom ogeneity o f variances, but the data
cast some doubt on this. To test the hypothesis without this assumption:

poison.genl <- function(data,mle)


{ i <- matrix(l:48,4,12,byrow=T)
i <- apply(i,1.sample,replace=T,size=4)
data$time <- mle$fit + mle$res[i]
data >
poison.boot <- boot(poisons, poison.fun, R=199, sim="parametric",
r a n .gen=poison.genl, mle=poison.mle)
sum (poison.boot$t>poison.boot$tO)
W hat do you conclude now?
(Section 6.3; Box and Cox, 1964)
For an example o f prediction, we consider using the nuclear power station data to
predict the cost o f new stations like cases 27-32, except that their value for d a te
is 73. We choose to make the prediction using the m odel with all covariates. To fit
that model, and to make the new station:

nuclear.glm <- glm(log(cost)~date+log(tl)+log(t2)+log(cap)+pr+ne


+ct+bw+log(cum.n)+pt,data=nuclear)
nuclear.diag <- glm.diag(nuclear.glm)
nuke <- data.frame(nuclear,fit=fitted(nuclear.glm),
res=nuclear.diag$res*nuclear.diagSsd)
nuke.p <- n u k e [32,]
nuke.p$date <- 73
nuke.p$fit <- predict(nuclear.glm,nuke.p)
The bootstrap function and the call to b o o t are:

nuke.pred <- function(data,i,i.p,d.p)


{ d <- data
d$cost <- exp(d$fit+d$res[i])
d.glm <- glm(log(cost)~date+log(tl)+log(t2)+log(cap)+pr+ne

324

6 Linear Regression

+ct+bw+log(cum.n)+pt,data=d)
predict(d.glm,d.p)-(d.p$fit+d$res[i.p]) }
nuclear.boot.pred <- boot (nuke, nuke.pred,R=199,m=l,d.p=nuke.p)
Finally the 95% prediction intervals are obtained by

a s .vector(exp(nuke.p$f it-quantile(nuclear.boo t .pred$t,


c(0.975,0.025))))
How do these compare to those in Example 6.8?
M odify the above analysis to use a studentized pivot. What effect has this change
on your interval?
(Section 6.3.3; Cox and Snell, 1981, pp. 81-90)
5

Consider predicting the log brain weight o f a mammal from its log body weight,
using squared error cost. The data are in dataframe mammals. For an initial model,
apparent error and ordinary cross-validation estimates o f aggregate prediction
error:

cost <- function(y, mu=0) mean((y-mu)2)


mammals.glm <- glm(log(brain)"log(body) ,data=maminals)
muhat <- fitted(mammals.glm)
app.err <- cost(mammals.glm$y, muhat)
mammals.diag <- glm.diag(mammals.glm)
cv.err <- mean((mammals.glm$y-muhat)2/(1-mammals,diag$h)
2)
For 6-fold unadjusted and adjusted estimates o f aggregate prediction error:
c v . e r r . 6 < - cv.glm (m am m als, mammals.glm, c o s t , K=6)
Experiment with other values o f K .
For bootstrap and 0.632 estimates, and plot o f error components:

mammals.pred.fun <- function(data, i, formula)


{ d <- data[i,]
d.glm <- glm(formula,data=d)
D.F.hatF <- cost(log(data$brain), predict(d.glm,data))
D.hatF.hatF <- cost(log(d$brain), fitted(d.glm))
c(log(data$brain)-predict(d.glm,data), D.F.hatF - D.hatF.hatF)}
mam.boot <- boot(mammals, mammals.pred.fun, R=200,
formula=formula(mammals.glm))
n <- nrow(mammals)
err.boot <- app.err + mean(mam.boot$t[,n+l])
err.632 <- 0
mam.boot$f <- boot.array(mam.boot)
for (i in l:n)
err.632 <- err.632 + cost(mam.boot$t[mam.boot$f[,i]==0,i])/n
err.632 <- 0.368*app.err + 0.632*err.632
ord <- order(mammals.diag$res)
mam.pred <- mam.boot$t[,ord]
mam.fac <- factor(rep(l:n,rep(200,n)) ,labels=ord)
plot(mam.fac, mam.pred,ylab="Prediction errors",
xlab="Case ordered by residual")
What are cases 34, 35, and 32?
(Section 6.4.1)

325

6.8 Practicals
6

The data o f Examples 6.15 and 6.16 are


regression m odel with all three covariates,
the influence o f case 16 on estimating this.
trimmed squares estimates, and then look

in dataframe s a l i n i t y . For the linear


consider the effect o f discharge d is and
Resample the least squares, Li and least
at the jackknife-after-bootstrap p lo ts:

salinity.r o b .fun <- function(data,i)


{ data.i <- data[i,]
Is.fit <- lm(sal~lag+trend+dis, data=data.i)
11.fit <- llfit(data.i[,-l] ,data.i[,l])
Its.fit <- ltsreg(data.i[,-l] ,data.i[,l])
c(ls.fit$coef,ll.fit$coef,I t s .fit$coef) }
salinity.boot <- boot(salinity,salinity.rob.fun,R=1000)
j ack.after.boot(salinity.boot,index=4)
jack.after.boot(salinity.boot,index=8)
j a ck.after.boot(salinity.boot,index=12)
W hat conclusions do you draw from these plots about (a) the shapes o f the
distributions o f the estimates, (b) comparisons between the estimation methods,
and (c) the effects o f case 16?
One possible explanation for case 16 being an outlier with respect to the multiple
linear regression model used previously is that a quadratic effect in d is c h a r g e
should be added to the model. We can test for this using the pivot m ethod with
least squares estimates and case resampling:

salinity.quad.fun <- function(data, i)


{ data.i <- data[i,]
Is.fit <- lm(sal~lag+trend+poly(dis,2), data=data.i)
Is.sum <- summary(ls.fit)
ls.std <- sqrt(diag(Is.sum$cov))*ls.sum$sigma
c(ls.fit$coef, ls.std) >
salinity.boot <- boot(salinity, salinity.quad.fun, R=99)
quad.z <- salinity.boot$t0[5]/salinity.boot$tO[10]
quad.z. stair <- (salinity ,boot$t [,5]-salinity .boot$t0[5] )/
salinity.boot$t[,10]
(1+sum(quad.z<quad.z .star))/(1+salinity.boot$R)
Out o f curiosity, look at the normal Q-Q plots o f raw and studentized coefficients:

qqnorm(salinity.boot$t[,5],ylab="discharge quadratic coefficient")


qqnorm(quad.z.star, ylab="discharge quadratic z statistic")
Is it reasonable to use least squares estimates here? See whether or not the same
conclusion would be reached using other methods o f estimation.
(Section 6.5; Ruppert and Carroll, 1980; Atkinson, 1985, p. 48)

7
Further Topics in Regression

7.1 Introduction
In C h ap ter 6 we showed how the basic b o o tstra p m ethods o f earlier chapters
extend to linear regression. The b ro ad aim o f this ch ap ter is to extend the
discussion further, to various form s o f nonlinear regression m odels espe
cially generalized linear m odels an d survival m odels and to nonparam etric
regression, where the form o f the m ean response is n o t fully specified.
A particu lar feature o f linear regression is the possibility o f error-based
resam pling, w hen responses are expressible as m eans plus hom oscedastic errors.
T his is p articularly useful w hen o u r objective is prediction. F or generalized
linear m odels, especially for discrete data, responses can n o t be described in
term s o f additive errors. Section 7.2 describes ways o f generalizing error-based
resam pling for such m odels. The corresponding developm ent for survival d a ta
is given in Section 7.3. Section 7.4 looks briefly at nonlinear regression with
additive error, m ainly to illustrate the useful co n trib u tio n th a t resam pling
m ethods can m ake to analysis o f such models. T here is often a need to
estim ate the poten tial accuracy o f predictions based on regression models,
and Section 6.4 contained a general discussion o f resam pling m ethods for
this. In Section 7.5 we focus on one type o f application, the estim ation o f
misclassification rates w hen a binary response y corresponds to a classification.
N o t all relationships betw een a response y an d covariates x can be readily
m odelled in term s o f a p aram etric m ean function o f know n form. A t least
for exploratory purposes it is useful to have flexible nonparam etric curvefitting m ethods, an d there is now a wide variety o f these. In Section 7.6 we
exam ine briefly how resam pling can be used in conjunction w ith som e o f these
n onparam etric regression m ethods.

326

327

7.2 Generalized Linear Models

7.2 Generalized Linear Models


7.2.1 Introduction
T he generalized linear m odel extends the linear regression m odel o f Section 6.3
in two ways. First, the distrib u tio n o f the response Y has the property th a t the
variance is an explicit function o f the m ean n,
v a r(Y ) =
w here V(-) is the know n variance function and 4> is the dispersion param eter,
w hich m ay be unknow n. T his includes the im p o rtan t cases o f binom ial, Poisson,
an d gam m a d istributions in add ition to the norm al distribution. Secondly, the
linear m ean structure is generalized to
gO*) =

t],

n = x Tp,

w here g(-) is a specified m o notone link function which links the m ean to the
linear predictor rj. As before, x is a {p + 1) x 1 vector o f explanatory variables
associated w ith Y. The possible com binations o f different variance functions
an d link functions include such things as logistic and probit regression, an d loglinear m odels for contingency tables, w ithout m aking ad-hoc transform ations
o f responses.
T he first extension was touched on briefly in Section 6.2.6 in connection
w ith w eighted least squares, which plays a key role in fitting generalized
linear m odels. T he second extension, to linear m odels for transform ed m eans,
represents a very special type o f nonlinear model.
W hen independent responses y } are obtained with explanatory variables Xj,
the full m odel is usually taken to be
E (Yj) = Hj,

g(nj) = x j p ,

\ a i ( Y j ) = KCjV(fij),

(7.1)

w here k m ay be unknow n and the c j are know n weights. F or example, for


binom ial d a ta w ith probability n(xj) and d en om inator m7, we take c; =
l/ m ; ; see Exam ple 7.3. T he co n stant k equals one for binom ial, Poisson
an d exponential data. N otice th a t (7.1) strictly only specifies first and second
m om ents o f the responses, an d in th a t sense is a sem iparam etric model. So, for
exam ple, we can m odel overdispersed count d a ta by using the Poisson variance
function V(fi) = n b u t allow ing k to be a free overdispersion param eter which
is to be estim ated.
O ne im p o rta n t p o in t a b o u t generalized linear m odels is the non-unique
definitions o f residuals, an d consequent non-uniqueness o f nonparam etric re
sam pling algorithm s.
A fter illustrating these ideas w ith an exam ple we briefly review the m ain
aspects o f generalized linear models. We then go on to discuss resam pling
m ethods.

7 Further Topics in Regression

328

G ro u p 1
C ase

Case

1
2
3
4
5

3.36
2.88
3.63
3.41
3.78

65
156
100
134
16

18
19
20
21
22

3.64
3.48
3.60
3.18
3.95

56
65
17
7
16

6
7
8
9
10
11
12

4.02
4.00
4.23
3.73
3.85
3.97

108
121
4
39
143
56
26
22
1
1
5
65

23
24
25
26
27
28
29
30
31
32
33

3.72
4.00
4.28
4.43
4.45
4.49
4.41

22
3
4
2
3
8
4

4.32
4.90
5.00
5.00

3
30
4
43

13
14
15
16
17

4.51
4.54
5.00
5.00
4.72
5.00

Table 7.1 Survival


times y (weeks) for two
groups of acute
leukaemia patients,
together with x = log10
white blood cell count
(Feigl and Zelen, 1965).

G ro u p 2

Exam ple 7.1 (Leukaem ia d a ta ) Table 7.1 contains d a ta on the survival times
in weeks o f tw o groups o f acute leukaem ia victims, as a function o f their w hite
blood cell counts.
A simple m odel is th a t w ithin each group survival tim e Y is exponential
w ith m ean /i = exp(/?o + Pix), where x = log10(white blood cell count). T hus
the link function is logarithm ic. T he intercept is different for each group, b u t
the slope is assum ed com m on, so the full m odel for the- yth response in group
i is
E (Y y) = Hij,

lo g (^ y ) = p Qi + pi Xj j ,

v a r(Y y ) = K(/Zy) = /X2,

T he fitted m eans p. an d the d a ta are show n in the left panel o f Figure 7.1. The
m ean survival tim es for group 2 are shorter th a n those for group 1 at the same
white blood cell count.
U nder this m odel the ratios Y / n are exponentially distributed with unit
m ean, an d hence the Q -Q p lo t o f y y //iy against exponential quantiles in the
right panel o f Figure 7.1 w ould ideally be a straight line. System atic curvature
m ight indicate th a t we should use a gam m a density w ith index v,
y v_1vv
/ vv\
f i y l ^ v) = J ? T w e x p \ j )

y>0,

^ V>a

In this case v ar(Y ) = /i2/v , so the dispersion p aram eter is taken to be


and Cj = 1. In fact the exponential m odel seems to fit adequately.

= 1/v

329

7.2 Generalized Linear Models

Figure 7.1 Summary


plots for fits of an
exponential model fitted
to two groups of
survival times for
leukaemia patients. The
left panel shows the
times and fitted means
as a function of their
white blood cell count
(group 1, fitted line
solid; group 2, fitted line
dots). The right panel
shows an exponential
Q-Q plot of the y/\i.

Log10 white blood cell count

Quantile of standard exponential

7.2.2 Model fitting and residuals


Estimation
Suppose th a t in dependent d a ta (x i , y i ) , . . . , ( x n, y n) are available, w ith response
m ean and variance described by (7.1). I f the response distributions are assum ed
to be given by the corresponding exponential fam ily m odel, then the m axim um
likelihood estim ates o f the regression param eters ji solve the (p + 1) x 1 system
o f estim ating equations

P ' C jV (fij)

tg iSn j ) = 0

w here g(n) = dr\/dp. is the derivative o f the link function. Because the dis
persion p aram eters are tak en to have the form k c j , the estim ate fi does n o t
depend on k . N ote th a t although the estim ates are derived as m axim um like
lihood estim ates, their values depend only upon the regression relationship as
expressed by the assum ed variance function and the link function and choice
o f covariates.
T he usual m ethod for solving (7.2) is iterative weighted least
squares, in
which a t each iteration the adjusted responses zj
= t]j+ (yj /ij)g(nj) are
regressed on the x; w ith weights wj given by
w j l = c j V(fij)g2(fiJ)-

(7.3)

all these quantities are evaluated at the cu rren t values o f the estim ates. The
weighted least squares equation (6.27) applies at each iteration, w ith y replaced
by the adjusted dependent variable z. The approxim ate variance m atrix for p

7 Further Topics in Regression

330
is given by the analogue o f (6.24), nam ely

var(j?) = k ( X t W X ) ~ 1,

(7.4)

with the diagonal weight m atrix W = d ia g (w i,...,w ) evaluated at the final


fitted values p.j.
The corresponding h a t m atrix is
H =

X ( X T W X ) ~ lX T W ,

(7.5)

as in (6.28). T he relationship o f H to fitted values is rj = H z , where z is the


vector o f adjusted responses. N ote th a t in general W , and hence H, depends
upon the fitted values. T he residual vector e = y fi has approxim ate variance
m atrix (I H )v ar(Y ), this being exact only for linear regression w ith know n

W.
W hen the dispersion p aram eter
o f residual m ean square,
ft-

is unknow n, it is estim ated by the analogue

y ' to - * #

n - p - l j j

CjVfrj)

(7 6 )

F or a linear m odel w ith V{y) = 1 an d dispersion p aram eter k = a 2, this gives


k = s2, the residual m ean square.
Let tj(iij) denote the co n trib u tio n th a t the jih observation m akes to the
overall log likelihood /(/i), param etrized in term s o f the m eans Hj. T hen the fit
o f a generalized linear m odel is m easured by the deviance
D = 2 k {t (y) - a m

= 2k

{tj (yj ) - 1 0 }) ) ,

(7.7)

which is the scaled difference betw een the m axim ized log likelihoods for the
saturated m odel which has a p aram eter for each observation and the
fitted model. T he deviance corresponds to the residual sum o f squares in the
analysis o f a linear regression model. F or exam ple, there are large reductions
in the deviance w hen im p o rtan t explanatory variables are added to a m odel,
and com peting m odels m ay be com pared via their deviances. W hen the fitted
m odel is correct, the scaled deviance k ~ 1D will som etim es have an approxim ate
chi-squared distrib u tio n on n p 1 degrees o f freedom , analogous to the
rescaled residual sum o f squares in a norm al linear model.
Significance tests
Individual coefficients /?; can be tested using studentized estim ates, with stan
dard errors estim ated using (7.4), w ith k replaced by the estim ate k if necessary.
The null distrib u tio n s o f these studentized estim ates will be approxim ately stan
d ard norm al, b u t the accuracy o f this ap proxim ation can be open to question.
Allowance for estim ation o f k can be m ade by using the t distribution with

Some authors prefer to


work with
X'(X'TX')-'X 'V2,
where X' = W ^ X .

331

7.2 Generalized Linear Models

n p 1 degrees o f freedom , as is justifiable for norm al-theory linear regression,


b u t in general the accuracy is questionable.
T he analogue o f analysis o f variance is the analysis o f deviance, wherein
differences o f deviances are used to m easure effects. To test w hether or not
a p articu lar subset o f covariates has no effect on m ean response, we use as
test statistic the scaled difference o f deviances, D for the full m odel w ith p
covariates an d Do for the reduced m odel w ith po covariates. If k is know n, then
the test statistic is Q = (Do D) /k. A pproxim ate properties o f log likelihood
ratio s im ply th a t the null distribution o f Q is approxim ately chi-squared, with
degrees o f freedom equal to p po, the n u m ber o f covariate term s being tested.
I f k is estim ated for the full m odel by fc, as in (7.6), then the test statistic is
Q = (Do - D) / k.

(7.8)

In the special case o f linear regression, (p po)~l Q is the F statistic, and


this m otivates the use o f the Fp- pa<n- p- i distribution as approxim ate null
distrib u tio n for ( ppo)_16 here, although this has little theoretical justification.
Residuals
R esiduals an d o th e r regression diagnostics for linear m odels m ay be extended
to generalized linear m odels. T he general form o f residuals will be a suitably
standardized version o f d(y, p) where d(Y,[i) m atches some notion o f random
error.
T he sim plest w ay to define residuals is to m imic the earlier definitions for
linear m odels, a n d to take the set o f standardized differences, the Pearson
residuals, (yj p,j)/{cjkV(faj)}l/1. Leverage adjustm ent o f these to com pensate
for estim ation o f /? involves hj, the yth diagonal elem ent o f the h at m atrix H
in (7.5), and yields standardized Pearson residuals
rpj = f\ cjKV(fij)(l hj)}1'
u WU2'
2

7 = !.,

(7.9)

The standardized Pearson residuals are essentially scaled versions o f the m o d


ified residuals defined in (6.29), except th a t the denom inators o f (7.9) may
depend on the p aram eter estim ates. In large sam ples one would expect the rpj
to have m ean an d variance approxim ately zero and one, as they do for linear
regression models.
In general the Pearson residuals inherit the skewness o f the responses them
selves, which can be considerable, and it m ay be b etter to standardize a
transform ed response. O ne way to do this is to define standardized residuals
on the linear predictor scale,
t w - m
{cjkg2( t i j ) V ( j i j ) ( l - h j ) }

(7 10)

F o r discrete d a ta this definition m ust be altered if g (yj) is infinite, as for

332

7 Further Topics in Regression

exam ple w hen g(y) = lo g y and y = 0. F o r a non-identity link function one


should n o t expect the m ean and variance o f rLj to be approxim ately zero and
one, unless k is unusually sm all; see Exam ple 7.2.
A n alternative ap p ro ach to defining residuals is based on the fact th a t in a
linear m odel the residual sum o f squares equals the sum o f squared residuals.
This suggests th a t residuals for generalized linear m odels can be constructed
from the contributions th a t individual observations m ake to the deviance.
Suppose first th a t k is know n. T hen the scaled deviance can be w ritten as

where dj = d(y; , fij) is the signed square root o f the scaled deviance contribution
due to the yth case, the sign being th a t o f y,- frj. T he deviance residual is dj.
D efinition (7.7) implies th a t
dj = sign(y, - ; )[2{ /,(y y) - <0 (j)}]1/2W hen / is the norm al log likelihood an d k = o 2 is unknow n, D is scaled by
k = s2 rath er th a n k before defining dj. Sim ilarly for the gam m a log likelihood;
see Exam ple 7.2. In practice standardized deviance residuals
TDi

( l - h j ) V 2

(7.11)

are m ore com m only used th a n the unadjusted dj.


F or the linear regression m odel o f Section 6.3, r Dj is p roportional to the
m odified residual (6.9). F o r o th er m odels the r Dj can be seriously biased, but
once bias-corrected they are typically closer to stan d ard norm al th an are the
r Pj or r LJ.
One general point to note ab o u t all o f these residuals is th a t they are scaled,
implicitly o r explicitly, unlike the m odified residuals o f C h ap ter 6.
Quasilikelihood estimation
As we have noted before, only the link an d variance functions m ust be specified
in order to find estim ates ft and approxim ate stan d ard errors. So although (7.2)
and (7.6) arise from a param etric m odel, they are m ore generally applicable
ju st as least squares results are applicable beyond the norm al-theory linear
model. W hen n o response distribution is assum ed, the estim ates ft are referred
to as quasilikelihood estim ates, and there is an associated theory for such
estim ates, although this is n o t o f concern here. T he m ost com m on application
is to d a ta w ith a response in the form o f counts or proportions, which are often
found to be overdispersed relative to the Poisson or binom ial distributions. One
approach to m odelling such d a ta is to use the variance function appropriate
to binom ial or Poisson data, but to allow the dispersion param eter k to be
a free param eter, estim ated by (7.6). This estim ate is then used in calculating
stan d ard errors for ft and residuals, as indicated above.

333

7.2 Generalized Linear Models

7.2.3 Sam pling plans


Param etric sim ulation for a generalized linear m odel involves sim ulating new
sets o f d a ta from the fitted param etric model. It has the usual disadvantage
o f the p aram etric bootstrap, th a t datasets generated from a poorly fitting
m odel m ay n o t have the statistical properties o f the original data. This applies
particularly w hen count d a ta are overdispersed relative to a Poisson o r binom ial
m odel, unless the overdispersion has been m odelled successfully.
N o n p aram etric sim ulation requires generating artificial d a ta w ithout assum
ing th a t the original d a ta have som e p articular param etric distribution. A
com pletely nonparam etric approach is to resam ple cases, which applies exactly
as described in Section 6.2.4. However, it is im p o rtan t to be clear w hat a case
is in any p articu lar application, because count and pro p o rtio n d ata are often
aggregated from larger datasets o f independent variables.
Provided th a t the m odel (7.1) is correct, as w ould be checked by appropriate
diagnostic m ethods, it m akes sense to use the fitted m odel and generalize the
sem iparam etric approach o f resam pling errors, as described in Section 6.2.3.
We focus now on ways to do this.
Resampling errors
T he simplest approach mimics the linear m odel sam pling scheme b u t allows for
the different response variances, ju st as in Section 6.2.6. So we define sim ulated
responses by
y j = fij + {cjkV{p.j)Yl2t),

j = l,...,n,

(7.12)

where j,...,e * is a ran d o m sam ple from the m ean-adjusted, standardized


Pearson residuals r Pj r P w ith r Pj defined at (7.9). N ote th a t for count d a ta
we are n o t assum ing k = 1. This resam pling scheme duplicates the m ethod o f
Section 6.2.6 for linear m odels, where the link function is the identity.
Because in general there is no explicit function connecting response yj
to ran d o m erro r Sj, as there is for linear regression models, the resam pling
scheme (7.12) is n o t the only approach, and som etim es it is n o t suitable. One
alternative is to use the sam e idea on the linear predictor scale. T h a t is, we
generate b o o tstra p d a ta by setting
y ) = g ~ l x Tp + g{fij){cjkV{fij)}1/2 ^ ,

In these first two


resampling schemes the
scale factor k~l/2 can be
omitted provided it is
omitted from both the
residual definition and
from the definition of

j = l,...,n ,

(7.13)

where g _1(') is the inverse link function and j,...,e * is a b o o tstrap sample
from the residuals r L U . . . , r Ln defined at (7.10). H ere the residuals should n o t
be m ean-adjusted unless g( ) is the identity link, in which case r Lj = r Pj and
the two schemes (7.12) an d (7.13) are the same.
A th ird ap p ro ach is to use the deviance residuals as surrogate errors. If the
deviance residual dj is w ritten as d{yj,p.j), then im agine th a t corresponding
ran d o m errors ej are defined by ej = d(yj,fij). The distribution o f these _,

334

7 Further Topics in Regression

is estim ated by the E D F o f the standardized deviance residuals (7.11). This


suggests th a t we construct a b o o tstrap sam ple as follows. R andom ly sam ple
from r o i , . .. , rD n and let y |,...,y * be the solutions to
ej = d(yj,fij),

j = 1 ,..., n.

(7.14)

This also gives the m ethod o f Section 6.2.3 for linear models, except for the
m ean adjustm ent o f residuals.
N one o f these three m ethods is perfect. O ne obvious draw back is th a t they
can all give negative or non-integer values o f y ' when the original d ata are
non-negative integer counts. A simple fix for discrete responses is to round the
value o f y j from (7.12), (7.13), or (7.14) to the nearest appropriate value. For
count d a ta this is a non-negative integer, and if the response is a proportion
w ith d en o m in ato r m, it is a nu m b er in the set 0 , 1 /m ,2 /m ,. . . , 1. However,
rounding can appreciably increase the p ro p o rtio n o f extrem e values o f y ' for
a case w hose fitted value is n ear the end o f its range.
A sim ilar difficulty can occur w hen responses are positive w ith V(fi) = Kfi2,
as in Exam ple 7.1. T he Pearson residuals are K~l/2(yj fij)/p.j, all necessarily
greater th a n k ~ 1^2. But the standardized versions rpj are n o t so constrained,
so th a t the result yj = fij( 1 + /c1/2e*) from applying (7.12) can be negative. The
obvious fix is to tru n cate y j at zero, b u t this m ay distort the distribution o f
y ', and so is n o t generally recom m ended.
Example 7.2 (Leukaemia data) F or the d a ta introduced in Exam ple 7.1 the
p aram etric m odel is gam m a w ith log likelihood contributions
tij(Hij) - K^'OogOxy) + yij/Hij),
and the regression is additive on the logarithm ic scale, log(/zi;) = /?0i + /?ixy.
The deviance for the fitted m odel is D = 40.32 w ith 30 degrees o f freedom ,
and equation (7.6) gives k = 1.09. The deviance residuals are calculated w ith
k set equal to k ,
dtj = sign(ziy -

l ) { 2 k ~ l (zij

- 1 - logz,7)}1/2,

where zy = y y / y . The corresponding standardized values rDi,; have sam ple


m ean an d variance respectively 0.37 an d 1.15. The Pearson residuals are
k-

1 /2 ( z ,7 -

1 ).

T he Zjj w ould be approxim ately a sam ple from the stan d ard exponential
distribution if in fact k = 1, and the right-hand panel o f Figure 7.1 suggests
th a t this is a reasonable assum ption.
O ur basic p aram etric m odel for these d a ta sets k = 1 and puts Y = fie,
where has an exponential distrib u tio n w ith unit m ean. Hence the param etric
b o o tstrap involves sim ulating exponential d a ta from the fitted m odel, th a t is
setting y * = fie', where em is stan d ard exponential. A slightly m ore cautious

335

7.2 Generalized Linear M odels


Table 7.2 Lower and
upper limits of 95%
studentized bootstrap
confidence intervals for
A i and 0 i for
leukaemia data, based
on 999 replicates of
different simulation
schemes.

Poi

E xponential
L inear p redictor, r i
D eviance, rp
Cases

Pi

Lower

Upper

Lower

Upper

5.16
3.61
5.00
0.31

11.12
10.58
11.10
8.78

-1.42
-1.53
-1.46
-1.37

-0.04
0.17
0.02
0.81

ap p ro ach would be to generate gam m a d a ta with m ean p. and index /c_1, b u t


we shall n o t d o this here.
F o r nonparam etric sim ulation, we consider all three schemes described
earlier. First, w ith variance function V(fi) = k /x2, the Pearson residuals are
k ~ 1/2(y p)/p- R esam pling Pearson residuals via (7.12) would be equivalent
to setting y * = p.e*, where e* is sam pled at random from the zs (Problem 7.2).
However, (7.12) can n o t be used w ith the standardized Pearson residuals rp,
because negative values o f y * will occur, possibly as low as 4. T runcation at
zero is n o t a sufficient rem edy for this.
F or the second resam pling scheme (7.13), the logarithm ic link gives y =
j l c x p ( k 1/2e ), where e* is random ly sam pled from the rLs which here are given
by n /c-1/2( 1 h)~l/2 log(z). The sam ple m ean and variance o f rL are 0.61
an d 1.63, in very close agreem ent w ith those for the logarithm o f a standard
exponential variate. It is im p o rta n t th a t no m ean correction be m ade to the
r^
To im plem ent the b o o tstrap for deviance residuals, the scheme (7.14) can be
simplified as follows. We solve the equations d(zj, 1) = rDj for j =
to
obtain z i ,...,z , and then set y* = /t,* for j = 1
where
is a
b o o tstrap sam ple from the zs (Problem 7.2).
Table 7.2 shows 95% studentized b o o tstrap confidence intervals for /foi (the
intercept for G ro u p 1) an d Pi using these schemes w ith R = 999. T he variance
estim ates used are from (7.4) rath er th a n the nonparam etric delta m ethod.
T he intervals for the three m odel-based schemes are very similar, while those
for resam pling cases are ra th e r different, particularly for pi, for which the
b o o tstrap distrib u tio n o f the studentized statistic is very non-norm al.
Figure 7.2 com pares sim ulated deviances w ith quantiles o f the chi-squared
distribution. N aive asym ptotics would suggest th a t the scaled deviance kD
has approxim ately a chi-squared distribution on 30 degrees o f freedom , b u t
these asym ptotics w hich apply as k >0 are clearly n o t useful here, even
w hen d a ta are in fact generated from the exponential distribution. T he fitted
deviance o f 40.32 is n o t extrem e, and the variation o f the sim ulated estim ates

7 Further Topics in Regression

336

Figure 7.2 Leukaemia


data. Chi-squared Q-Q
plots of simulated
deviances for parametric
sampling from the fitted
exponential model (left)
and case resampling
(right).

Quantile of chi-squared distribution

Quantile of chi-squared distribution

k is large enough th a t the observed value k = 1.09 could easily occur by


chance if the d a ta were indeed exponential.

Comparison o f resampling schemes


To com pare the perform ances o f the resam pling schemes described above in
setting confidence intervals, we conducted a series o f M onte C arlo experim ents,
each based on 1000 sets o f d a ta o f size n = 15, w ith linear predictor r\ =
Po + Pix. In the first experim ent, the values o f x were generated from a
distribution uniform on the interval (0, 1), we to o k po = Pi = 4, and responses
were generated from the exponential distribution with m ean exp(^). Each
sam ple was then b o o tstrap p ed 199 times using case resam pling and by m odelbased resam pling from the fitted m odel, w ith variance function V(/j) = /i2,
by applying (7.13) and (7.14). F or each o f these resam pling schemes, various
confidence intervals were obtained for param eters Po, Pi, tpi = PoPi and
V 2 = Po/Pi- T he confidence intervals used were: the stan d ard interval based
on the large-sam ple norm al distrib u tio n o f the estim ate, using the usual rath er
th an a robust stan d ard erro r; the interval based on a norm al approxim ation
w ith bias an d variance estim ated from the resam ples; the percentile and B C a
intervals; an d the basic b o o tstrap and studentized b o o tstrap intervals, the
la tter using n o n p aram etric delta m ethod variance estim ates. The first p a rt o f
Table 7.3 shows the em pirical coverages o f nom inal 90% confidence intervals
for these com binations o f resam pling scheme, m ethod o f interval construction,
and param eter.
The second experim ent used the sam e design m atrix, linear predictor, and
m odel-fitting an d resam pling schemes as the first, b u t the d a ta were generated
from a lognorm al m odel w ith m ean exp(t]) and u n it variance on the log scale.

7.2 Generalized Linear Models


Table 7 3 Empirical
coverages (%) for four
parameters based on
applying various
resampling schemes
with R = 199 to 1000
samples of size 15
generated from various
models. Target coverage
is 90%. The first two
sets of results are for an
exponential model fitted
to exponential and
lognormal data, and the
second two are for a
Poisson model fitted to
Poisson and negative
binomial data. See text
for details.

337

Cases

rL o r rp

ro

Po

Pi

V>1

xp2

S tan d ard
N o rm al
Percentile
BCa
Basic
S tu d en t

85
88
85
84
86
89

86
89
87
86
88
89

89
92
83
82
87
86

85
90
89
86
84
81

85
88
86
86
86
92

86
89
89
88
89
92

89
90
86
83
86
89

86
89
89
88
83
84

85
87
86
86
85
92

86
89
88
88
89
92

90
90
86
83
87
89

86
89
89
88
83
84

S tan d ard
N o rm al
Percentile
BCa
Basic
S tudent

79
81
80
78
78
84

79
81
84
83
78
85

82
84
73
72
82
82

81
85
85
81
78
74

79
81
80
80
81
90

78
80
82
80
80
88

82
84
77
74
83
84

82
84
83
79
80
79

79
82
80
79
80
90

78
80
81
81
81
88

82
84
76
74
84
84

82
84
82
80
80
79

S ta n d a rd
N o rm al
Percentile
BCa
Basic
S tudent

90
88
87
86
87
95

90
88
87
86
87
90

91
88
85
82
85
80

90
88
86
86
87
92

89
87
89
88
87
90

90
86
88
87
87
89

92
88
88
85
88
89

90
88
88
87
88
89

89
87
90
88
86
90

91
93
94
94
92
93

92
97
97
96
97
92

91
93
91
91
92
91

S tan d ard
N o rm al
Percentile
BCa
Basic
S tu d en t

69
87
85
85
86
93

64
84
86
85
84
87

59
86
84
80
83
82

70
90
86
85
85
87

69
88
90
88
88
89

63
84
86
83
84
89

59
84
82
77
83
85

69
89
88
86
87
85

67
87
90
87
87
89

64
89
91
89
89
93

60
92
93
88
91
90

71
94
91
89
91
85

Po

Pi

Vl

tp2

Po

Pi

Vl

V>2

The th ird experim ent used the same design m atrix as the first two, b u t linear
predictor rj = Pq + P\x, w ith Po Pi = 2 and Poisson responses w ith m ean
H = exp (rj). T he fourth experim ent used the same m eans as the third, b u t had
negative binom ial responses w ith variance function \x + /i2/1 0 . The b o o tstrap
schemes for these two experim ents were case resam pling and m odel-based
resam pling using (7.12) an d (7.14).
Table 7.3 shows th at while all the m ethods tend to undercover, the standard
m ethod can be disastrously b ad w hen the random p a rt o f the fitted m odel is
incorrect, as in the second an d fourth experim ents. The studentized m ethod
generally does b etter th a n the basic m ethod, b u t the B C a m ethod does not
im prove on the percentile intervals. T hus here a m ore sophisticated m ethod
does n o t necessarily lead to b etter coverage, unlike in Section 5.7, and in
p articu lar there seems to be no reason to use the B C a m ethod. Use o f the
studentized interval on an o th er scale m ight im prove its perform ance for the
ratio \p2 , for which the sim pler m ethods seem best. As far as the resam pling
schemes are concerned, there seems to be little to choose betw een the m odel-

7 Further Topics in Regression

338

based schemes, which im prove slightly on b o o tstrap p in g cases, even when the
fitted variance function is incorrect.
We now consider an im p o rtan t caveat to these general com m ents.
Inhomogeneous residuals
F or some types o f d a ta the standardized Pearson residuals m ay be very
inhom ogeneous. If y is Poisson w ith m ean fi, for example, the distribution
o f (y f i ) / n l/1 is strongly positively skewed w hen n <
increasingly sym m etric as fi increases. T hus w hen a set o f
large and sm all counts, it is unwise to treat the rP as
possibility for such d a ta is to apply (7.12) b u t w ith fitted
the estim ated skewness o f their residuals.

I, b u t it becom es
d a ta contains both
exchangeable. One
values stratified by

Example 7.3 (Sugar cane) Carvao da cana-de-aqucar coal o f sugar cane


is a disease o f sugar cane th a t is com m on in some areas o f Brazil, and its
effects on p roduction o f the crop have led to a search for resistant varieties
o f cane. We use d a ta kindly provided by D r C. G. B. D em etrio o f Escola
Superior de A gricultura, U niversidade de Sao Paulo, from a random ized block
experim ent in which the resistance to the disease o f 45 varieties o f cane was
com pared in four blocks o f 45 plots. Fifty stems from a variety were p u t in a
solution containing the disease agent, an d then plan ted in a plot. A fter a fixed
period, the to tal num b er o f shoots appearing, m, an d the n um ber o f diseased
shoots, r, were recorded for each plot. T hus the d a ta form a 4 x 45 layout o f
pairs (m, r). T he purpose o f analysis was to identify the m ost resistant varieties,
for further investigation.
A simple m odel is th a t the nu m b er o f diseased shoots ry for the ith block and
/ t h variety is a binom ial ran d o m variable w ith d en o m inator my and probability
nij. F or the generalized linear m odel form ulation, the responses are taken to
be y tj = rij/niij so th a t the m ean response fiij is equal to the probability 7iy
th at a sho o t is diseased. Because the variance o f Y is 7t( l n) /m, the variance
function is V(n) = fi(\ fi) an d the dispersion p aram eters are (fi = 1/m , so
th at in the tw o-w ay version o f (7.1), cy = 1/my and k = 1. The probability
o f disease for the ith block an d / t h variety is related to the linear predictor
tjij
= a,+ Pj through the logit link function t] = log { n / ( l
7i)}. So the full
m odel for all d a ta is
E(Yij)
v ar(Ytj)

- fiij,

fiij = exp(a, + Pj)/ {1 + exp(a,

= m-jl V(fiij),

+ P j) } ,

V(fitj) = /i,7(l - fi,j).

Interest focuses on the varieties w ith sm all values o f Pj, which are likely to be
the m ost resistant to the disease.
F or an adequate fit, the deviance would roughly be distributed according
to a X m d istrib u tio n ; in fact it is 1142.8. This indicates severe overdispersion
relative to the model.

7.2 Generalized Linear Models

Figure 7 3 Model fit


for the cane data. The
left panel shows the
estimated variety effects
i +
for block 1:
varieties 1 and 3 are
least resistant, and 31 is
most resistant. The lines
show the levels on the
logit scale
corresponding to
n = 0.5, 0.2, 0.05 and
0.01. The right panel
shows standardized
Pearson residuals rp
plotted against etj + pj;
the lines are at 0, 3.

339

o
o

<o
Q.

CO

Q.

&1
c
<0
>

10

20

30

40

Variety

eta

T he left panel o f Figure 7.3 shows estim ated variety effects for block 1.
Varieties 1 an d 3 are least resistant to the disease, while variety 31 is m ost
resistant. T he right panel shows the residuals plotted against linear predictors.
T he skewness o f the rP drops as rj increases.
Param etric sim ulation involves generating binom ial observations from the
fitted m odel. This greatly overstates the precision o f conclusions, because this
m odel clearly does n o t reflect the variability o f the data. We could instead use
the beta-binom ial distribution. Suppose that, conditional on n, a response is
binom ial w ith den o m in ato r m an d probability n, b u t instead o f being fixed, n
is taken to have a b eta distribution. T he resulting response has unconditional
m ean and variance
m il,

m ll(l - n ) { l + ( m - 1)0},

(7.15)

where n = E(7t) and <j) > 0 controls the degree o f overdispersion. Param etric
sim ulation from this m odel is discussed in Problem 7.5.
Two variance functions for overdispersed binom ial d a ta are V\{n) =
<f>n(l n), w ith <j> > 1, and Viin) = 7i(l 7t){l + (m
with (/> > 0.
T he first o f these gives com m on overdispersion for all the observations, while
the second allows p roportionately greater spread when m is larger. We use the
first, for which 4> = 8.3, an d perform nonparam etric sim ulation using (7.12).
T he sim ulated responses are rounded to the nearest integer in 0 ,1 ,..., m.
The left panel o f Figure 7.4 shows box plots o f the ratio o f deviance to
degrees o f freedom for 200 sim ulations from the binom ial model, the betabinom ial m odel, for nonparam etric sim ulation by (7.12), and for (7.12) b u t
w ith residuals stratified into groups for the fifteen varieties w ith the smallest
values o f fij, the m iddle fifteen values o f fij, and the fifteen largest values o f

340

7 Further Topics in Regression

Figure 7.4 Resampling


results for cane data.
The left panel shows
(left to right) simulated
deviance/degrees of
freedom ratios for fitted
binomial and
beta-binomial models, a
nonparametric
bootstrap, and a
nonparametric
bootstrap with residuals
stratified by varieties;
the dotted line is at the
data ratio
8.66 = 1142.8/132. The
right panel shows the
variety effects in 200
replicates of the
stratified nonparametric
resampling scheme.

Variety

fij. The d o tted line shows the observed ratio. T he binom ial results are clearly
quite inappropriate, those for the beta-binom ial an d unstratified sim ulation
are better, an d those for the stratified sim ulation are best.
To explain this, we retu rn to the right panel o f Figure 7.3. This shows th a t the
residuals are n o t hom ogeneous: residuals for observations with sm all values
o f rj are m ore positively skewed th a n those for larger values. This reflects the
varying skewness o f binom ial data, which m ust be taken into account in the
resam pling scheme.
The right panel o f Figure 7.4 shows the estim ated variety effects for the
200 sim ulations from the stratified sim ulation. Varieties 1 and 3 are m uch less
resistant th a n the others, b u t variety 31 is not m uch m ore resistant th an 11,
18, and 23; o th er varieties are close behind. As m ight be expected, results for
the binom ial sim ulation are m uch less variable. T he unstratified resam pling
scheme gives large negative estim ated variety effects, due to inappropriately
large negative residuals being used. To explain this, consider the right panel o f
Figure 7.3. In effect the unstratified scheme allows residuals from the right h alf
o f the panel to be sam pled an d placed at its left-hand end, leading to negative
sim ulated responses th a t are rounded u p to zero: the varieties for which this
happens seem spuriously resistant.
Finer stratification o f the residuals seems unnecessary for this application.

7.2.4 Prediction
In Section 6.3.3 we showed how to use resam pling m ethods to obtain prediction
intervals based on a linear regression fit. T he sam e idea can be applied here.

7.2 Generalized Linear Models

341

Beyond having a suitable resam pling algorithm to produce the appropriate


variation in p aram eter estim ates, we m ust also produce suitable response
variation. In the linear m odel this is provided by the E D F o f standardized
residuals, which estim ates the C D F o f hom oscedastic errors. N ow we need to
be able to produce the correct heteroscedasticity.
Suppose th a t we w ant to predict the response Y+ at x+, w ith a prediction
interval. O ne possible poin t prediction is the regression estim ate
K = g -'ix lh
although it w ould often be wise to m ake a bias correction. F o r the prediction
interval, let us assum e for the m om ent th a t some m onotone function 5 ( Y , n )
is hom oscedastic, w ith pth quantile ap, and th a t the m ean value
o f Y+ is
known. T hen the 1 2a prediction interval should be the values y+>a, y + ,i-a
w here y +iP satisfies <5(y,/i+) = ap. If n is estim ated by p. independently o f Y+
and if 3{Y+,fi) has know n quantiles, then the sam e m ethod applies. So the
ap p ro p riate b o o tstrap m ethod is to estim ate quantiles o f <5(7+,/t), and then set
5 (y,n+) equal to the estim ated a and 1 a quantiles. T he function d ( Y, f i ) will
correspond to one o f the definitions o f residuals, and the boo tstrap algorithm
will use resam pling from the corresponding standardized residuals, whose
hom oscedasticity is critical. The full resam pling algorithm , which generalizes
A lgorithm 6.4, is as follows.
Algorithm 7.1 (Prediction in generalized linear models)
F or r =
1 create b o o tstrap sam ple response y j at Xj by solving
d(y,p.j) = e,j ,

j= l,...,n ,

where the ej are random ly sam pled from residuals r i , . . . , r ;


2 fit estim ates fi* and k", and com pute fitted value p*+ r corresponding to
the new observation w ith x = x + ; then
3 for m = 1 ,...,M ,
(,a ) sam ple S'm from n , . . . , r,
(b ) set y + rm equal to the solution o f the equation S(y,p.+) = d*m,
(c) com pute sim ulated prediction errors d+rm = 8{y+rm,fi'+r).
Finally, o rd er the R M values d \ rm to give d'_j_(1) < < d+{RK1). T hen calculate
the prediction limits as the solutions to

3(y+,fl+)

^+,((RM+!))>

$(y+>fi+)

^+,((HM+l)(l-a))-

342

7 Further Topics in Regression

In principle any o f the resam pling m ethods in Section 7.2.3 could be used.
In practice the hom oscedasticity is im portant, and should be checked.
Exam ple 7.4 (A ID S diagnoses)

Table 7.4 contains the n um ber o f A ID S

reports in E ngland an d W ales to the end o f 1992. They are cross-classified


by diagnosis period an d length o f reporting delay, in three-m onth intervals. A
blank in the table corresponds to an unknow n entry, and > indicates where
an entry is a lower b o u n d for the actual value. We shall treat these incom plete
d a ta as unknow n in o u r analysis below. The problem was to predict the state
o f the epidem ic at the tim e from the given data. This depends heavily on the
values missing tow ards the foot o f the table.
The d a ta su p p o rt the assum ption th a t the reporting delay does n o t depend
on the diagnosis period. In this case a simple m odel is th a t the num ber o f
reports in row j and colum n k o f the table, yjk, has a Poisson distribution
w ith m ean fijk = exp(a; + /4). I f all the cells o f the table are regarded as
independent, the total diagnoses in period j have a Poisson distribution w ith
m ean J2k Vjk = exP(a;') J2k exP (ft)- H ence the eventual total for an incom plete
row can be predicted by adding the observed row total and the fitted values
for the unobserved p a rt o f the row. H ow accurate is this prediction?
To assess this, we first sim ulate a com plete table o f b o o tstrap data, y*k, using
the fitted values fak = exp(a; + /?*) from the original fit. We shall discuss below
how to do this; for now simply note th a t this am ounts to steps 1 and 3(b) o f
A lgorithm 7.1. We then fit the tw o-w ay layout m odel to the sim ulated data,
excluding the cells where the original table was incom plete, thereby obtaining
param eter estim ates a and /?. We then calculate
y'+j =

YI
k

yjk

A+J = ex p (a ')

exP(PD>
k

unobs

7 = 1 ,...,3 8 ,

unobs

where the sum m ation is over the cells o f row j for which yjk was unobserved;
this is step 2. N ote th a t y*+j is equivalent to the results o f steps 3(a) and 3(b)
with M = 1.
We take 8(y,n) = (y
corresponding to Pearson residuals for the
Poisson distribution. This m eans th a t step 3(c) involves setting
_ y-+ J - K j
+J

a *1/2

V+J
We repeat this R times, to obtain values d+}(l) < < d \ j(R) for each j.
The final step is to o btain the b o o tstrap u p p er an d lower limits
for y +j , by solving the equations
y+j

a*+j _ j*
.1 /2

*+J

y +j
~

p +j

_ j*

a + , M ( R + 1))>TT /2

< )

a + J ,( ( R + l) ( la))-

y*+j i_a

343

7.2 Generalized Linear M odels


Table 7.4 N um bers o f
A ID S reports in
England and Wales to
the end o f 1992
(De Angelis and Gilks,
1994). A ^ sign in the
body o f the table
indicates a count
incomplete a t the end o f
1992, and t indicates a
reporting-delay less than
one month.

R eporting delay interval (quarters)

Diagnosis
period
Y ear

Q uarter

ot

10

11

12

13

214

1983

3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4

2
2
4
0
6
5
4
11
9
2
5
7
13
12
21
17
36
28
31
26
31
36
32
15
34
38
31
32
49
44
41
56
53
63
71
95
76
67

6
7
4
10
17
22
23
11
22
28
26
49
37
53
44
74
58
74
80
99
95
77
92
92
104
101
124
132
107
153
137
124
175
135
161
178
181
2:66

0
1
0
0
3
1
4
6
6
8
14
17
21
16
29
13
23
23
16
27
35
20
32
14
29
34
47
36
51
41
29
39
35
24
48
39
2:16

1
1
1
1
1
5
5
1
2
8
6
11
9
21
11
13
14
11
9
9
13
26
10
27
31
18
24
10
17
16
33
14
17
23
25
6

1
1
0
1
1
2
2
1
4
5
9
4
3
2
6
3
7
8
3
8
18
11
12
22
18
9
11
9
15
11
7
12
13
12
2:5

0
0
2
0
0
1
1
5
3
2
2
7
5
7
4
5
4
3
2
11
4
3
19
21
8
15
15
7
8
6
11
7
11
Si

0
0
0
0
0
0
3
0
3
2
5
5
7
0
2
3
1
3
8
3
6
8
12
12
6
6
8
6
9
5
6
10
>2

1
0
0
0
0
2
0
1
4
4
5
7
3
7
2
1
2
6
3
4
4
4
4
5
7
1
6
4
2
7
4
Si

0
0
0
1
0
1
1
1
7
3
5
3
1
0
1
2
1
2
1
6
4
8
3
3
3
2
5
4
1
2
23

0
0
0
1
0
0
2
1
1
0
1
1
3
0
0
2
3
5
4
3
3
7
2
0
8
2
3
5
1

0
0
2
1
0
0
0
1
2
1
2
2
1
0
2
0
0
4
6
5
3
1
0
3
0
2
3
0

0
0
1
0
1
0
0
0
0
1
0
2
0
0
0
0
0
1
2
5
2
0
2
3
2
3
4

0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
1
1
1
0
0
2
0
1
2

0
0
0
0
0
0
0
0
0
0
0
1
0
1
2
3
3
1
2
1
3
2
0
1
2

1
0
0
0
1
0
2
1
0
1
2
4
6
1
8
5
1
3
6
3
3
2
2
1

1984

1985

1986

1987

1988

1989

1990

1991

1992

Total
reports
to end
1992
12
12
14
15
30
39
47
40
63
65
82
120
109
120
134
141
153
173
174
211
224
205
224
219
253
2233
2:281
2:245
2260
2285
2271
2263
2306
2258
2310
2318
2273
2133

This procedure takes into account two aspects o f uncertainty th a t are im


p o rta n t in prediction, nam ely the inaccuracy o f param eter estim ates, and the
ran d o m fluctuations in the unobserved y,*. T he first enters through variation
in a* an d
from replicate to replicate, and the second enters through the sam
pling variability o f the p redictand y'+J over different replicates. The procedure
does n o t allow for a th ird com ponent o f predictive error, due to uncertainty
ab o u t the form o f the model.
T he m odel described above is a generalized linear m odel w ith Poisson errors
an d the log link function. It contains 52 param eters. The deviance o f 716.5
on 413 degrees o f freedom is strong evidence th a t the d a ta are overdispersed
relative to the Poisson distribution. The estim ate o f k is tc = 1.78, and in

7 Further Topics in Regression

344

o
oin

Figure I S Results
from the fit of a Poisson
two-way layout to the
AIDS data. The left
panel shows predicted
diagnoses (solid),
together with the actual
totals to the end of 1992
(+). The right panel
shows standardized
Pearson residuals
plotted against
estimated skewness,
p~l/2; the vertical lines
are at skewness 0.6 and

oo
rT
<
D O
O
CO
o

CO

O) O
<C O
CVi
oo

1984 1986

1988

1990

1.

1992

Skewness

fact a quasilikelihood m odel in which v ar(Y ) = k/i appears to fit the d a ta ;


this corresponds to treatin g the counts in Table 7.4 as independent negative
binom ial ran d o m variables.
The predicted value exp(a; ) J2k exP(A0 is shown as the solid line in the left
panel o f Figure 7.5, together w ith the observed to tal reports to the end o f 1992.
The right panel shows the standardized Pearson residuals plotted against the
estim ated skewness p r l/2. T he b anding o f residuals at the right is characteristic
o f d a ta containing sm all counts, w ith the lower b an d corresponding to zeroes
in the original data, the next to ones, an d so forth. The distributions o f
the rp change m arkedly, an d it w ould be in ap p ro p riate to tre a t them as a
hom ogeneous group. The sam e conclusion holds for the standardized deviance
residuals, although they are less skewed for larger fitted values. T he dotted
lines in the figure divide the observations into three strata, w ithin each o f
which the residuals are m ore hom ogeneous. Finer stratification has little effect
on the results described below.
One param etric b o o tstrap involves generating Poisson random variables Y k
with m eans exp(aj + /?*). This fails to account for the overdispersion, which
can be m im icked by p aram etric sam pling from a fitted negative binom ial
distributio n w ith the sam e m eans an d estim ated overdispersion.
N onparam etric resam pling from standardized Pearson residuals will give
overdispersion, b u t the right panel o f Figure 7.5 suggests th a t the residuals
should be stratified. Figure 7.6 shows the ratio o f deviances to degrees o f
freedom for 999 sam ples tak en u nder these four sam pling schemes; the strata
used in the low er right panel are show n in Figure 7.5. Param etric sim ulation
from the Poisson m odel is plainly in ap p ro p riate because the d a ta so generated

7.2 Generalized Linear Models


Figure 7.6 Resampling
results for AIDS data.
The left panels show
deviances/degrees of
freedom ratios for the
four resampling
schemes, with the
observed ratio given as
the vertical dotted line.
The right panel shows
predicted diagnoses
(solid line), with
pointwise 95%
predictive intervals,
based on 999 replicates
of Poisson simulation
(small dashes), of
resampling residuals
(dots), and of stratified
resampling of residuals
(large dashes).

345

Negative binomial

0.0

0.5

1.0

1.5

2.0

2.5

Davtanca/df

Nonparametric

mi
0.0

0.5

1.0

1.5

2.0

2.5

Devianca'df

Table 7.5 Bootstrap

95% prediction intervals


for numbers of AIDS
cases in England and
Wales for the fourth
quarters of 1990, 1991,
and 1992.

1990
Poisson
N egative binom ial
N o n p aram etric
S tratified n o n p aram etric

296
294
294
292

315
318
318
319

1991
294
289
289
288

327
333
333
335

1992
356
317
314
310

537
560
547
571

are m uch less dispersed th an the original data, for which the ratio is 716.5/413.
T he negative binom ial sim ulation gives m ore ap p ro p riate results, which seem
rath er sim ilar to those for n o nparam etric sim ulation w ithout stratification.
W hen stratification is used, the results mimic the overdispersion m uch better.
The pointw ise 95% prediction intervals for the num bers o f A ID S diagnoses
are shown in the right panel o f Figure 7.6. The intervals for sim ulation from
the fitted Poisson m odel are considerably narrow er th an the intervals from
resam pling residuals, b o th o f which are similar. The intervals for the last
quarters o f 1990, 1991, an d 1992 are given in Table 7.5.
T here is little change if intervals are based on the deviance residual form ula
for the Poisson distribution, S(y,fi) = [ 2 {y log ( y / n) + n ~ y } \ x/1A serious draw back w ith this analysis is th a t predictions from the two-way
layout m odel are very sensitive to the last few rows o f the table, to the extent
th a t the estim ate
for the last row is determ ined entirely by the b ottom left

346

7 Further Topics in Regression

cell. Some sort o f tem poral sm oothing is preferable, and we reconsider these
d a ta in Exam ple 7.12

7.3 Survival Data


Section 3.5 describes resam pling m ethods for a single hom ogeneous sam ple o f
d a ta subject to censoring. In this section we tu rn to problem s where survival
is affected by explanatory variables.
Suppose th a t the d a ta ( Y , D , x ) on an individual consist of: a survival time
Y ; an indicator o f censoring, D, th a t equals one if Y is observed and zero if
7 is right-censored; an d a covariate vector x. U nder random censorship the
observed value o f Y is supposed to be m in( Y ,C ), where C is a censoring
variable w ith distribution G, and the true failure tim e 7 is a variable whose
distribution F ( y ; /?, x) depends on the covariates x through a vector o f p a ram
eters, /?. M ore generally we m ight suppose th a t Y and C are conditionally
independent given x, and th a t C has distribution G(c;y,x). In either case, the
value o f C is supposed to be uninform ative ab o u t the param eter p.
Parametric model
In a param etric m odel F is fully specified once j3 has been chosen. So if the d ata
consist o f m easurem ents ( y i , d i , x \ ) , . . . , ( y , d n,x) on independent individuals,
we suppose th a t p is estim ated, often by the m axim um likelihood estim ator p.
Param etric sim ulation is perform ed by generating values Y f ' from the fitted
distributions F ( y ; P , X j ) an d generating ap p ro p riate censoring tim es Cj, setting
Yj = min(Yj0, Cj), an d letting Dj indicate the event Yj>" < Cj. The censoring
variables m ay be generating according to any one o f the schemes outlined in
Section 3.5, or otherw ise if appropriate.
Exam ple 7.5 (P E T film d a ta ) Table 7.6 contains d a ta from an accelerated
life test on P E T film in gas insulated transform ers; the film is used in electrical
insulation. T here are failure times y at each o f four different voltages x. Three
failure tim es are right-censored a t voltage x = 5: according to the d a ta source
they were subject to censoring at a pre-determ ined time, b u t their values
m ake it m ore likely th a t they were censored after a pre-determ ined num ber o f
failures, an d we shall assum e this in w hat follows.
T he W eibull distrib u tio n is often used for such data. In this case plots
suggest th a t b o th o f its p aram eters depend on the voltage applied, and th a t
there is an unknow n th reshold voltage xo below which failure can n o t occur.
O ur m odel is th a t the distrib u tio n function for y at voltage x is given by
F(y\P,x)

1 - e x p { - ( y / ; . ) K},

y > 0,

exp {Po Pi log(x 5 + e^4) } ,

exp (/?2 P 3 log x).

(7.16)

347

7.3 Survival Data


Table 7.6 Failure times
(hours) from an
accelerated life test on
PET film in SFg gas
insulated transformers
(Hirose, 1993). ^
indicates
right-censoring.

V oltage (kV)
5
7

10
15

7131
>9104.25
50.25
108.30
135.60
15.17
23.90
2.40
6.68

8482
>9104.25
87.75
108.30

8559
>9104.25
87.76
117.90

19.87
28.17
2.42
7.30

20.18
29.70
3.17

8762

9026

9034

9104

87.77
123.90

92.90
124.30

92.91
129.70

95.96
135.60

21.50

21.88

22.23

23.02

3.75

4.65

4.95

6.23

This param etrizatio n is chosen so th a t the range o f each param eter is u n


boun d ed ; n ote th a t xq = 5 e^*.
The u p p er panels o f Figure 7.7 show the fit o f this m odel when the p aram
eters are estim ated by m axim izing the log likelihood (. The left panel shows
Q -Q plots for each o f the voltages, and the right panel shows the fitted m ean
failure tim e an d estim ated threshold xo- T he fit seems broadly adequate.
We sim ulate replicate d atasets by generating observations from the W eibull
m odel obtained by substituting the M L E s into (7.16). In order to apply our
assum ed censoring m echanism , we sort the observations sim ulated w ith x = 5
to get _y*1} <
<
say, an d then set
y(*9), and
equal to y'7) + 0.25.
We give these three observations censoring indicators d* = 0, so th a t they are
treated as censored, treat all the o th er observations as uncensored, and fit the
W eibull m odel to the resulting data.
For sake o f illustration, suppose th a t interest focuses on the m ean failure
tim e 9 w hen x = 4.9. To facilitate this we reparam etrize the m odel to have
T(v) is the Gamma
function / 0 uv-1e-u du.

p aram eters 9 an d /? = ( / f i ,...,/ ^ ) , where 9 = 10- 3A r(l + 1/k), w ith x = 4.9.


T he lower left panel o f Figure 7.7 shows the profile log likelihood for 9, i.e.
^Prof(0) = m a
P

in the figure we renorm alize the log likelihood to have m axim um zero. U nder
the stan d ard large-sam ple likelihood asym ptotics outlined in Section 5.2.1, the
approxim ate distrib u tio n o f the likelihood ratio statistic
W( 9) = 2 {< V of(0)
is xj, so a 1 a confidence set for the true 9 is the set such th at
cVtP is the p quantile of
the Xv distribution.
^ p ro f(0 ) ^ < V o f ( ^ ) 5 C U _ a .

348

7 Further Topics in Regression

CD

_D

o>

(5

(0
0

log Weibull quantiles

Voltage

theta

Chi-squared quantile

oo
o
o>
o
S

Q_

where 6 is the overall M L E . F o r these d a ta 0 = 24.85 and the 95% confidence


interval is [19.75,35.53]; the confidence set contains values o f 6 for which
f prof (^) exceeds the d o tted line in the b o tto m left panel o f Figure 7.7.
T he use o f the chi-squared quantile to set the confidence interval presupposes
th a t the sam ple is large enough for the likelihood asym ptotics to apply,
and this can be checked by the p aram etric sim ulation outlined above. The
lower right panel o f the figure is a Q -Q plot o f likelihood ratio statistics
w (6) = 2 { /rof(0 ) /* rof(0)} based on 999 sets o f d a ta sim ulated from the
fitted model. The distribution o f the w (6) is close to chi-squared, b u t w ith

Figure 7.7 PET


reliability data analysis.
Top left panel: Q-Q plot
of log failure times
against quantiles of log
Weibull distribution,
with fitted model given
by dotted lines, and
censored data by o. Top
right panel: Fitted mean
failure time as a
function of voltage x;
the dotted line shows
the estimated voltage o
below which failure is
impossible. Lower left
panel: normalized
profile log likelihood for
mean failure time 0 at
x = 4.9; the dotted line
shows the 95%
confidence interval for Q
using the asymptotic
chi-squared distribution,
and the dashed line
shows the 95%
confidence interval using
bootstrap calibration of
the likelihood ratio
statistic. Lower right
panel: chi-squared Q-Q
plot for simulated
likelihood ratio statistic,
with dotted line showing
its large-sample
distribution.

7.3 Survival Data

349

Table 7.7 Com parison


of estim ated biases and
standard errors o f
maximum likelihood
estimates for the PET
reliability data, using
standard first-order
likelihood theory,
param etric bootstrap
simulation, and
model-based
nonparam etric
resampling.

P aram eter

Po
Pi
Pi
ft
*0

M LE

6.346
1.958
4.383
1.235
4.758

L ikelihood

P aram etric

N o n p a ra m e tric

Bias

SE

Bias

SE

Bias

SE

0
0
0
0
0

0.117
0.082
0.850
0.388
0.029

0.007
0.007
0.127
0.022
-0.004

0.117
0.082
0.874
0.393
0.030

0.001
0.006
0.109
0.022
-0.002

0.112
0.080
0.871
0.393
0.028

m ean 1.12, an d their 0.95 quantile is w(*950) = 4.09, to be com pared with
ci,o.95 = 3.84. This gives b o o tstrap calibrated 95% confidence interval the set
o f 9 such th a t / prof(0) > / prof(9) 5 x 4.09, th a t is [19.62,36.12], which is
slightly w ider th a n the stan d ard interval.

? is the m atrix o f
second derivatives o f
with respect to 0 and /?.

Table 7.7 com pares the bias estim ates and stan d ard errors for the m odel
param eters using the param etric b o o tstra p described above and standard firsto rd er likelihood theory, und er which the estim ated biases are zero, and the
variance estim ates are obtained as the diagonal elem ents o f the inverse observed
inform ation m atrix (?)_1 evaluated at the M LEs. The estim ated biases are
sm all b u t significantly different from zero. The largest differences betw een the
stan d ard theory and the b o o tstrap results are for f o and fo, for which the
biases are o f order 2 -3 % . T he threshold param eter xo is well determ ined; the
sta n d a rd 95% confidence interval based on its asym ptotic norm al distribution
is [4.701,4.815], w hereas the norm al interval with estim ated bias and variance
is [4.703,4.820],
A m odel-based nonparam etric b o o tstrap m ay be perform ed by using resid
uals e = ( y / ) . f , three o f which are censored, then resam pling errors * from
their product-lim it estim ate, an d then m aking uncensored b o o tstrap observa
tions le*1/*. T he observations with x = 5 are then m odified as outlined above,
an d the m odel refitted to the resulting data. The product-lim it estim ate for the
residuals is very close to the survivor function o f the stan d ard exponential dis
tribution, so we expect this to give results sim ilar to the param etric sim ulation,
and this is w hat we see in Table 7.7.
F or censoring at a pre-determ ined tim e c, the sim ulation algorithm s would
w ork as described above, except th a t values o f y * greater th a n c would be
replaced by c an d the corresponding censoring indicators d* set equal to zero.
T he nu m b er o f censored observations in each sim ulated dataset would then be
ran d o m ; see Practical 7.3.
Plots show th a t the sim ulated M L E s are close to norm ally distributed: in
this case sta n d a rd likelihood theory w orks well enough to give good confi
dence intervals for the param eters. The benefit o f param etric sim ulation is th at
the b o o tstra p estim ates give em pirical evidence th a t the stan d ard theory can

350

7 Further Topics in Regression

be trusted, while providing alternative m ethods for calculating m easures o f


uncertainty if the stan d ard theory is unreliable. It is typical o f first-order like
lihood m ethods th a t the variability o f likelihood quantities is underestim ated,
although here the effect is sm all enough to be un im p o rtant.

Proportional hazards model


I f it can be assum ed th a t the explanatory variables act m ultiplicatively on the
hazard function, an elegant an d pow erful ap p ro ach to survival d a ta analysis
is possible. U nder the usual form o f proportional hazards model the hazard
function for an individual w ith covariates x is d A ( y ) = exp( x T P)dA(y), where
dA(y) is the baseline h azard function th a t would apply to an individual w ith
a fixed value o f x, often x = 0. T he corresponding cum ulative hazard and
survivor functions are
A{y) = [ y e x p ( x T P)dA(u),
Jo

1 - F ( y ; p, x) = {1 - F (y )}exp(x7 P)

where 1 F(y) is the baseline survivor function for the hazard dA(y).
The regression p aram eters P are usually estim ated by m axim izing the partial
likelihood, which is the p ro d u ct over cases w ith dj = 1 o f term s
________g P f r r ft>________
E L i H (yj - y k ) e xp (x Tpky

(717)

where H(u) equals zero if u < 0 an d equals one otherwise. Since (7.17) is
unaltered by recentring the xj, we shall assum e below th at E x j = 0 ; the
baseline h azard then corresponds to the average covariate value x = 0.
In term s o f the estim ated regression param eters the baseline cum ulative
hazard function is estim ated by the Breslow estimator

A (y )= J 2 ^n
m
d\
(T tiV
j:yj<y * = i H (yj - yk)ex p ( x Tfa)

(7-18)

a non-decreasing function th a t ju m p s a t yj by
dA(yj) = ------------------^ -------------- .
E L i H (yj - yk) exp ( x Tpk)
O ne stan d ard estim ator o f the baseline survivor function is
1- ^ 0 0 =

n
{ i- ^ v ) } .
i-y&y

(7.i9)

which generalizes the product-lim it estim ate (3.9), although o th e r estim ators
also exist. W hichever o f them is used, the p ro p o rtio nal hazards assum ption
implies th a t
{1 _ F(y)}exp<-xJfo

351

7.3 Survival Data

will be the estim ated survivor function for an individual with covariate values
Xj.

U nder the ran d o m censorship model, the survivor function o f the censoring
d istribution G is given by (3.11).
T he b o o tstra p m ethods for censored d a ta outlined in Section 3.5 extend
straightforw ardly to this setting. F or example, if the censoring distribution is
independent o f the covariates, we generate a single sam ple under the condi
tional sam pling plan according to the following algorithm .
Algorithm 7.2 (Conditional resampling for censored survival data)
For j = 1
1 generate

7?*

from

the estim ated failure

time survivor function

{1 F(y)}exp(xJW;
2 if dj = 0, set Cj = yj, and if dj = 1, generate Cj from the
conditional censoring distribution given th a t Cj > yj, namely
{G(y) - G{yj)}/{ 1 - G(y,)}; then
3 set Yj = m in(7P*, Cj), w ith Dj = 1 if YJ = Yf* and zero otherwise.

U nder the m ore general m odel where the distribution G o f C also depends
up o n the covariates an d a p ro p o rtional hazards assum ption is ap p ro p riate for
G, the estim ated censoring survivor function when the covariate is x is
f

-1 exp(xr y)

1 - G ( y ;y ,x ) = { l- G ( y ) j

where G0(y) is the estim ated baseline censoring distribution given by the
analogues o f (7.18) and (7.19), in which 1 dj and y replace dj and fi. U nder
m odel-based resam pling, a b o o tstrap dataset is then obtained by
Algorithm 7.3 (Resampling for censored survival data)
F or j = 1 ,..., n,
1 generate

7?*

from

the estim ated failure tim e survivor function

{1 F(y)}exp{xyP\ and independently generate Cj from the estim ated


censoring survivor function {1 G(>')}exp(x^ ) ; then
2 set 7 / = m in(7P*,C *), w ith Dj = 1 if 7 / = Y f and zero otherwise.

T he next exam ple illustrates the use o f these algorithm s.

352

7 Further Topics in Regression

Example 7.6 (Melanoma data) To illustrate these ideas, we consider d a ta on


the survival o f patients w ith m alignant m elanom a, whose tum ours were re
m oved by o p eratio n a t the D ep artm en t o f Plastic Surgery, U niversity H ospital
o f Odense, D enm ark. O perations to o k place from 1962 to 1977, and patients
were followed to the end o f 1977. Each tu m o u r was com pletely removed, to
gether w ith a b o u t 2.5 cm o f the skin aro u n d it. T he following variables were
available for 205 p atients: tim e in days since the operation, possibly censored;
status at the end o f the study (alive, dead from m elanom a, dead from other
causes); sex; age; year o f o p eratio n ; tu m o u r thickness in m m ; and an indi
cator o f w hether or n o t the tu m o u r was ulcerated. U lceration and tum our
thickness are im p o rtan t prognostic variables: to have a thick o r ulcerated
tu m o u r substantially increases the chance o f d eath from m elanom a, and we
shall investigate how they affect survival. We assum e th a t censoring occurs at
random .
We fit a p ro p o rtio n al hazards m odel und er the assum ption th a t the baseline
hazards are different for the ulcerated group o f 90 individuals, and the no n
ulcerated group, b u t th a t there is a com m on effect o f tu m o u r thickness. F or a
flexible assessm ent o f how thickness affects the h azard function, we fit a natu ral
spline w ith four degrees o f freed o m ; its k nots are placed a t the em pirical 0.25,
0.5 and 0.75 quantiles o f the tu m o u r thicknesses. T hus our m odel is th at the
survivor functions for the ulcerated an d non-ulcerated groups are
1 - F l ( y ; P , x ) = {1 - f ? ( 30} p(xrw,

l - F 2( y ; p , x ) = {1 - F 2(y)}exp(xT/f),

where x has dim ension fo u r an d corresponds to the spline, /? is com m on to the


groups, b u t the baseline survivor functions 1 F^(y) and 1 F^iy) m ay differ.
F o r illustration we take the fitted censoring distribution to be the product-lim it
estim ate obtained by setting censoring indicators d' = 1 d, and fitting a m odel
w ith no covariates, so G is ju st the product-lim it estim ate o f the censoring time
distribution. T he left panel o f Figure 7.8 shows the estim ated survivor functions
1 F(y) an d 1 F (y); there is a strong effect o f ulceration. T he right panel
shows how the linear predictor x Tji depends on tu m o u r thickness: from 0-3
m m the effect on the baseline h azard changes from ab o u t exp(1) = 0.37 to
ab o u t exp(0.6) = 1.8, followed by a slight dip an d a gradual upw ard increase
to a risk o f a b o u t exp(1.2) = 3.3 for a tu m o u r 15 m m thick. T hus the hazard
increases by a factor o f a b o u t 10, b u t m ost o f the increase takes place from
0 -3 mm. However, there are too few individuals w ith tum ours m ore th an 10
m m thick for reliable inferences at the right o f the panel.
The top left panel o f Figure 7.9 shows the original fitted linear predictor,
together w ith 19 replicates o btained by resam pling cases, stratified by ulcera
tion. The lighter solid lines in the panel below are pointw ise 95% confidence
limits, based on R = 999 replicates o f this sam pling scheme. In effect these are
percentile m ethod confidence lim its for the linear predictor a t each thickness.

7.4 Other Nonlinear Models

Figure 7.8 Fit o f a


proportional hazards
model for ulcer
histology and survival o f
patients with malignant
m elanom a (Andersen
et al., 1993, pp.
709-714). Left panel:
estim ated baseline
survivor functions for
cases with ulcerated
(dots) and non-ulcerated
(solid) tumours. Right
p an el: fitted linear
predictor x Tfi for risk
as a function o f tum our
thickness. The lower rug
is for non-ulcerated
patients, and the upper
rug for ulcerated
patients.

353

Time (days)

Tumour thickness (mm)

T he sharp increase in risk for small thicknesses is clearly a genuine effect, while
beyond 3mm the confidence interval for the linear predictor is roughly [0,1],
w ith thickness having little o r no effect.
R esults from m odel-based resam pling using the fitted m odel and applying
A lgorithm 7.3, an d from conditional resam pling using A lgorithm 7.2 are also
show n; they are very sim ilar to the results from resam pling cases. In view o f
the discussion in Section 3.5, we did n o t apply the weird bootstrap.
The right panels o f Figure 7.9 show how the estim ated 0.2 quantile o f the
survival distribution, yo.2 = min{y : F i ( y ; P , x ) > 0.2} depends on tum our
thickness. T here is an initial sharp decrease from 3000 days to ab o u t 750
days as tu m o u r thickness increases from 0 -3 mm, but the estim ate is roughly
co n stan t from then on. T he individual estim ates are highly variable, b u t the
degree o f uncertainty m irrors roughly th a t in the left panels. Once again results
for the three resam pling schemes are very similar.
U nlike the previous exam ple, where resam pling and stan d ard likelihood
m ethods led to sim ilar conclusions, this exam ple shows the usefulness o f
resam pling w hen stan d ard approaches would be difficult o r im possible to
apply.

7.4 Other Nonlinear Models


A nonlinear regression m odel w ith independent additive errors is o f form

yj

Kxj,P) + j,

; =

(7.20)

354

7 Further Topics in Regression

Figure 7.9 Bootstrap


results for melanoma
data analysis. Top left:
fitted linear predictor
(heavy solid) and 19
replicates from case
resampling (solid); the
rug shows observed
thicknesses. Top right:
estimated 0.2 quantile of
survivor distribution as
a function of tumour
thickness, for an
individual with an
ulcerated tumour (heavy
solid), and 19 replicates
for case resampling
(solid); the rug shows
observed thicknesses.
Bottom left: pointwise
95% percentile
confidence limits for
linear predictor, from
case (solid),
model-based (dots), and
conditional (dashes)
resampling. Bottom
right: pointwise 95%
percentile confidence
limits for 0.20 quantile
of survivor distribution,
from case (solid),
model-based (dots), and
conditional (dashes)
resampling, R 999.

o
g
TD
ok>
Q.

(0
<D
C

10

Tumour thickness (mm)

Tumour thickness (mm)

0
Tumour thickness (mm)

10

Tumour thickness (mm)

with ji{x, /?) nonlinear in the p aram eter /?, which m ay be vector o r scalar.
The linear algebra associated w ith least squares estim ates for linear regression
no longer applies exactly. However, least squares theory can be developed by
linear approxim ation, an d the least squares estim ate ft can often be com puted
accurately by iterative linear fitting.
The linear approxim ation to (7.20), obtained by Taylor series expansion,
gives
yj ~ t i x j , P') = u j (0 - P') + ej,

j = 1, . . . , n,

(7.21)

355

7.4 Other Nonlinear Models

where

= 8y{xj,P)
W

i>-p

T his defines an iteration th a t starts at P' using a linear regression least squares
fit, an d a t the final iteratio n /?' = /?. A t th a t stage the left-hand side o f (7.21)
is simply the residual ej = yj fi(xj,P). A pproxim ate leverage values and
o th er diagnostics are obtained from the linear approxim ation, th a t is using the
definitions in previous sections b u t w ith the UjS evaluated a t p' = p as the values
o f explanatory variable vectors. This use o f the linear approxim ation can give
m isleading results, depending upon the intrinsic curvature o f the regression
surface. In particu lar, the residuals will no longer have zero expectation in
general, an d standardized residuals r; will no longer have co n stan t variance
u n d er hom oscedasticity o f true errors.
T he usual norm al approxim ation for the distribution o f P is also based on
the linear approxim ation. F or the approxim ate variance, (6.24) applies w ith X
replaced by U = ( u i , . . . , u n)T evaluated at p. So w ith s2 equal to the residual
m ean square, we have
P -P

N ( 0 , s 2( U T U r l ) .

(7.22)

T he accuracy o f this ap proxim ation will depend upon tw o types o f curvature


effects, called p aram eter effects and intrinsic effects. The first o f these is
specific to the p aram etrizatio n used in expressing /x(x, ), and can be reduced
by careful choice o f param etrization. O f course resam pling m ethods will be
the m ore useful the larger are the curvature effects, and the worse the norm al
approxim ation.
R esam pling m ethods apply here ju st as with linear regression, either sim
u lating d a ta from the fitted m odel w ith resam pled m odified residuals or by
resam pling cases. F o r the first o f these it will generally be necessary to m ake a
m ean adjustm ent to w hatever residuals are being used as the erro r population.
It would also be generally advisable to correct the raw residuals for bias due
to nonlinearity: we d o n o t show how to do this here.
Exam ple 7.7 (Calcium uptake d ata) T he d ata plotted in Figure 7.10 show
the calcium u p tak e o f cells, y, as a function o f tim e x after being suspended
in a solution o f radioactive calcium. Also shown is the fitted curve
fi(x,P) = Po { l - e x p ( - / ? i x ) } .
T he least squares estim ates are Po = 4.31 and Pi = 0.209, and the estim ate o f
a is 0.55 w ith 25 degrees o f freedom. The stan d ard errors for Po and Pi based
on (7.22) are 0.30 an d 0.039.

7 *Further Topics in Regression

356

Figure 7.10 Calcium


uptake data and fitted
curve (left panel), with
raw residuals (right
panel) (Rawlings, 1988,
p. 403).

to
o
(0
ZJ
"O
*35 o
o
5(0
cr m
o

Time (minutes)

Po
h

10

12

14

Time (minutes)

E stim ate

B o o tstrap bias

T heoretical SE

B o o tstrap SE

4.31

0.028

0.30

0.38

0.209

0.004

0.039

0.040

The right panel o f Figure 7.10 shows th a t hom ogeneity o f variance is slightly
questionable here, so we resam ple cases by stratified sam pling. Estim ated biases
and stan d a rd errors for f o an d fo based on 999 b o o tstrap replicates are given
in Table 7.8. T he m ain p o in t to notice is the appreciable difference betw een
A
theoretical an d b o o tstra p stan d ard errors for Po.
Figure 7.11 illustrates the results. N ote the non-elliptical p a ttern o f variation
and the n on-norm ality: the z-statistics are also quite non-norm al. In this case
the b o o tstrap should give b etter results for confidence intervals th an norm al
approxim ations, especially for Po- T he b o tto m right panel suggests th a t the
param eter estim ates are closer to norm al on logarithm ic scales.
Results for m odel-based resam pling assum ing hom oscedastic errors are fairly
similar, alth o u g h the sta n d a rd error for f o is then 0.32. The effects o f nonlin
earity are negligible in this case: for exam ple, the m axim um absolute bias o f
residuals is a b o u t 0.012<r.
Suppose th a t we w ant confidence lim its on som e aspect o f the curve, such
as the p ro p o rtio n o f m axim um n = 1 exp(P\x). O rdinarily one m ight

Table 7.8 Results from


R = 999 replicates of
stratified case
resampling for nonlinear
regression model fitted
to calcium data.

7.4 Other Nonlinear Models

357

Figure 7.11 Parameter


estimates for case
resampling of calcium
data, with R = 999. The
upper panels show
normal plots of fa and
while the lower
panels show their joint
distributions on the
original (left) and
logarithmic scales
(right).

Quantiles of standard normal

Quantiles of standard normal

't
O

o
co

CO

. A.

"".4/ *

r.
iS w

<D
JO

V v W c g 1;

CO

I
*

^
b

3.5

4.0

4.5

5.0

betaO

5.5

6.0

5
betaO

approach this by applying the delta m ethod together with the bivariate norm al
approxim ation for least squares estim ates, b u t the b o o tstrap can deal w ith this
using only the sim ulated p aram eter estim ates. So consider the times x = 1,
5, 15, at which the estim ates n = 1 exp(fiix) are 0.188, 0.647 and 0.956
respectively. T he top panel o f Figure 7.12 shows b o o tstrap distributions o f
7T* = 1 exp(P[x): n ote the strong non-norm ality at x = 15.
T he co n strain t th a t n m ust lie in the interval (0,1) m eans th a t it is unwise
to construct basic or studentized confidence intervals for n itself. F o r example,
the basic b o o tstrap 95% interval for n at x = 15 is [0.922,1.025], The solution
is to do all the calculations on the logit scale, as show n in the lower panel o f
Figure 7.12, an d untransform the lim its obtained a t the end. T h a t is, we obtain

7 Further Topics in Regression

358

x=15
x=1

x=5

1
0.2

0.0

-2

ItfTh-i-rL-

0.4
0.6
Proportion 1 - exp(-beta1*x)

2
Logit of proportion

0.8

1.0

intervals [rji,rj2] for r\ = log{7t/(l n)}, and then take


exp(?7i)

exp (rj2)

1 +exp(j7i) 1 +exp(f/2).
as the corresponding intervals for n. T he resulting 95% intervals are [0.13,0.26]
at x = 1, [0.48,0.76] a t x = 5, and [0.83,0.98] a t x = 15. T he stan d ard linear
theory gives slightly different values, e.g. [0.10,0.27] at x = 1 and [0.83,1.03]
at x = 15.

7.5 Misclassification Error


The discussion o f aggregate prediction erro r in Section 6.4.1 was expressed in
a general n o ta tio n th a t would apply also to the regression m odels described in
this chapter, w ith ap p ro p riate definitions o f prediction rule y+ = fi(x+, F) for
a response y+ a t covariate values x+, an d m easure o f accuracy c(y+,y+). The
general conclusions o f Section 6-4.1 concerning b o o tstra p and cross-validation
estim ates o f aggregate prediction erro r should apply here also. In p articular
the adjusted K -fold cross-validation estim ate an d the 0.632 b o o tstrap estim ate
should be preferred in m ost situations.

Figure 7.12 Calcium


uptake d ata: bootstrap
histograms for estimated
proportion of maximum
n = 1 exp(fi\x) at
x = 1, 5 and 15 based
on R = 999 resamples
of cases.

359

7.5 Misclassification Error

O ne type o f problem th a t deserves special attention, in p a rt because it differs


m ost from the exam ples o f Section 6.4.1, is the estim ation o f prediction error
for binary responses, supposing these to be m odelled by a generalized linear
m odel o f the sort discussed in Section 7.2. I f the binary response corresponds
to a classification indicator, then prediction o f response y + for an individual
w ith covariate vector x + is equivalent to classification o f th at individual, and
incorrect prediction (y+ =^= y+ ) is a m isclassification error.
Suppose, then, th a t the response y is 0 o r 1, and th a t the prediction rule
fi(x+, F) is an estim ate o f Pr(Y+ = 1 | x + ) for a new case (x+ ,y + ). We im ag
ine th a t this estim ated probability is translated into a prediction o f y+, or
equivalently a classification o f the individual w ith covariate x + . F or simplicity
we set y + = 1 if fi{x+, F) >
and y + = 0 otherw ise; this would be m od
ified if incidence rates for the two classes differed. I f costs o f b o th types o f
misclassification erro r are equal, as we shall assum e, then it is enough to set

otherwise.

(7.23)

T he aggregate prediction erro r D is simply the overall misclassification rate,


equal to the p ro p o rtio n o f cases where y+ is wrongly predicted.
T he special feature o f this problem is th a t the prediction and the m easure
o f erro r are n o t continuous functions o f the data. A ccording to the discussion
in Section 6.4.1 we should then expect b o o tstrap m ethods for estim ating D
o r its expected value A to be superior to cross-validation estim ates, in term s
o f variability. A lso leave-one-out cross-validation is no longer attractive on
co m p u tatio n al grounds, because we now have to refit the m odel for each
resample.
Exam ple 7.8 (U rine d a ta ) F or an exam ple o f the estim ation o f misclassifi
cation error, we take binary d a ta on the presence o f calcium oxalate crystals
in 79 sam ples o f urine. E xplanatory variables are specific gravity, i.e. the den
sity o f urine relative to w ater, pH , osm olarity (mOsm), conductivity (m M ho
m illiM ho), u rea concen tratio n (millimoles per litre), and calcium concentration
(millimoles p er litre). A fter d ropping two incom plete cases, 77 remain.
C onsider how well the presence o f crystals can be predicted from the ex
planatory variables. A nalysis o f deviance for binary logistic regression suggests
the m odel which includes the p = 4 covariates specific gravity, conductivity,
log calcium concentration, and log urine density, and we base o u r predictions
on this model. T he sim plest estim ate o f the expected aggregate prediction error
A is the average nu m b er o f m isclassifications, A app = n~l E
w ith c(-, )
given by (7.23); it w ould be equivalent to use instead

otherwise.

360

7 Further Topics in Regression

K -fold (adjusted) cross-validation


B o o tstrap

0.632

77

38

10

24.7

22.1

23.4

23.4 (23.7)

20.8 (21.0)

26.0 (25.4)

20.8 (20.8)

Table 7.9 Estimates of


aggregate prediction
error (xlO-2), or
misclassification rate,
for urine data (Andrews
and Herzberg, 1985,
pp. 249-251).

Figure 7.13
Components of 0.632
estimate of prediction
error, yj fi(xj; F*), for
urine data based on 200
bootstrap simulations.
Values within the dotted
lines make no
contribution to
prediction error. The
components from cases
54 and 66 are the
rightmost and the
fourth from rightmost
sets of errors shown; the
components from case
27 are leftmost.

Case ordered by residual

In this case A app = 20.8 x 10- 2 . O th er estim ates o f aggregate prediction error
are given in Table 7.9. F o r the b o o tstrap an d 0.632 estim ates, we used R = 200
boo tstrap resamples.
The discontinuous n ature o f prediction error gives m ore variable results th an
for the exam ples with squared erro r in Section 6.4.1. In p articular the results
for K -fold cross-validation now depend m ore critically on which observations
fall into the groups. F or example, the average an d standard deviation o f A q v j
for 40 repeats were 23.0 x 10-2 an d 2.0 x 10~2. However, the broad pattern is
sim ilar to th a t in Table 6.9.
Figure 7.13 shows box plots o f the quantities yj n(xj ;F*) th a t contribute
to the 0.632 estim ate o f prediction error, plotted against case j ordered by the
residual; only three values o f j are labelled. There are ab o u t 74 contributions
at each value o f j. O nly values outw ith the horizontal d o tted lines contribute
to prediction error. The p attern is broadly w hat we would ex p ect: observations
with residuals close to zero are generally well predicted, and m ake little
contribu tio n to prediction error. M ore extrem e residuals contribute m ost to
prediction error. N ote cases 66 an d 54, which are always misclassified; their
standardized Pearson residuals are 2.13 an d 2.54. T he figure suggests th a t case

7.5 Misclassification Error


Table 7.10 Summary
results for estimates of
prediction error for 200
samples of size n = 50
from data on low birth
weights (Hosmer and
Lemeshow, 1989,
pp. 247-252; Venables
and Ripley, 1994,
p. 193). The table shows
the average, standard
deviation, and
conditional mean
squared error (xlO -2)
for the 200 estimates of
excess error. The
target average excess
error is 8.3 x 10 2.

361
K -fold (adjusted) cross-validation

M ean
SD
M SE

B o o tstrap

0.632

50

25

10

9.1
1.2
0.38

8.8
1.9
0.29

11.5
4.4
0.62

11.7 (11.5)
4.5 (4.2)
0.64 (0.63)

12.2 (11.7)
5.0 (4.6)
0.76 (0.73)

12.4 (11.3)
4.8 (3.9)
0.64 (0.54)

15.3 (11.1)
7.1 (4.6)
1.14 (0.59)

54 is outlying. A t the o th er end is case 27, w hose residual is -1.84; this was
misclassified 42 tim es out o f 65 in our sim ulation.

Exam ple 7.9 (Low birth weights)


In order to com pare the properties o f
estim ates o f m isclassification erro r under repeated sam pling, we took d a ta on
189 births a t a U S hospital to be o ur p o p ulation F. The binary response equals
zero for babies w ith b irth weight less th an 2.5 kg, and equals one otherwise.
We took 200 sam ples o f size n = 50 from these data, and to each sam ple
we fitted a b inary logistic m odel with nine regression param eters expressing
dependence on m atern al characteristics weight, sm oking status, n u m ber o f
previous p rem ature labours, hypertension, uterine irritability and the num ber
o f visits to the physician in the first trim ester. F or each o f the sam ples we
calculated various cross-validation and b o o tstrap estim ates o f misclassification
rate, using R = 200 b o o tstrap resamples.
Table 7.10 shows the results o f this experim ent, expressed in term s o f esti
m ates o f the excess error, which is the difference betw een true misclassification
rate D an d the a p p aren t erro r rate A app found by applying the prediction rule
to the data. T he ta rg e t value o f the average excess erro r over the 200 samples
w as 8.3 x 102; the average a p p aren t erro r was 20.0 x 10~2.
The b o o tstrap an d 0.632 excess erro r estim ates again perform best overall in
term s o f m ean, variability, and conditional m ean squared error. N ote th at the
stan d ard deviations for the b o o tstrap and 0.632 estim ates suggest th a t R = 50
w ould have given results accurate enough for m ost purposes.
O rdinary cross-validation is significantly better th an K -fold cross-validation,
unless K = 25. However, the results for K -fold adjusted cross-validation are
n o t significantly different from those for unadjusted cross-validation, even with
K = 2 . T hus if cross-validation is to be used, adjusted K -fold cross-validation
offers considerable co m p u tatio n al savings over ordinary cross-validation, and
is ab o u t equally accurate.
F or reasons outlined in Exam ple 3.6, the E D F o f the d ata m ay be a poor
estim ate o f the original C D F when there are binary responses yj. One way
to overcom e this is to switch the response value w ith small probability, i.e. to
replace (x*,y*) w ith (x * ,l y ' ) w ith probability (say) 0.1. This corresponds
to a binom ial sim ulation using probabilities shrunk som ew hat tow ards 0.5

7 Further Topics in Regression

362

from the observed values o f 0 an d 1. It should produce results th at are


sm oother th a n those obtained und er case resam pling from the original data.
O u r sim ulation experim ent included this random ized bootstrap, b u t although
typically it im proves slightly on b o o tstrap results, the results here were very
sim ilar to those for the o rdinary bootstrap .

In principle resam pling estim ates o f misclassification rates could be used


to select which covariates to include in the prediction rule, along the lines
given for linear regression in Section 6.4.2. It seems likely, in the light o f the
preceding exam ple, th a t the b o o tstrap ap p ro ach w ould be preferable.

7.6 Nonparametric Regression


So far we have considered regression m odels in w hich the m ean response is
related to covariates x th ro u g h a function o f know n form w ith a sm all num ber
o f unknow n param eters. T here are, however, occasions w hen it is useful to
assess the effects o f covariates x w ithout com pletely specifying the form o f the
relationship betw een m ean response n an d x. This is done using nonparam etric
regression m ethods, o f w hich there are now a large num ber.
The sim plest n o n p aram etric regression relationship for scalar x is

7 - n ( x ) + e,
where fi(x) has com pletely unknow n form but w ould be assum ed continuous
in m any applications, an d e is a ran d o m erro r w ith zero m ean. A typical
application is illustrated by the scatter p lo t in Figure 7.14. H ere no simple
param etric regression curve seems appropriate, so it m akes sense to fit a
sm ooth curve (which we do later in Exam ple 7.10) w ith as few restrictions as
possible.
O ften n o n p aram etric regression is used as an exploratory tool, either directly
by producing a curve estim ate for visual interpretation, or indirectly by provid
ing a com parison w ith som e tentative p aram etric m odel fit via a significance
test. In som e applications the ra th e r different objective o f prediction will be o f
interest. W hatever the application, the com plicated n ature o f nonparam etric
regression m ethods m akes it unlikely th a t probability distributions for statistics
o f interest can be evaluated theoretically, an d so resam pling m ethods will play
a prom inent role.
It is n o t possible here to describe all o f the nonparam etric regression
m ethods th a t are now available, an d in any event m any o f them do not yet
have fully developed com panion resam pling m ethods. We shall limit ourselves
to a brief discussion o f som e o f the m ain m ethods, and to applications in
generalized additive m odels, where nonparam etric regression is used to extend
the generalized linear m odels o f Section 7.2.

7.6 Nonparametric Regression

Figure 7.14 Motorcycle


impact data.
Acceleration y (g) at a
time x milliseconds after
impact (Silverman,
1985).

363

o
2q5
<
oD
o

<

Time (ms)

7.6.1 Nonparam etric curves


Several nonparam etric curve-fitting algorithm s are variants on the idea o f
local averaging. O ne such m ethod is kernel smoothing, which estim ates m ean
response E(Y | x) = fi(x) by
(*) =

X > ;w { (x ~ */)/*>}
E {(x -x # )}

(7.24)

w ith w(-) a sym m etric density function and b an adjustable ban d w id th con
stan t th a t determ ines how widely the averaging is done. This estim ate is similar
in m any ways to the kernel density estim ate discussed in Exam ple 5.13, and as
there the choice o f b depends upon a trade-off betw een bias and variability o f
the e stim a te : sm all b gives sm all bias and large variance, whereas large b has
the opposite effects. Ideally b would vary w ith x, to reflect large changes in the
derivative o f /i(x) and heteroscedasticity, b o th evident in Figure 7.14.
M odifications to the estim ate (7.24) are needed at the ends o f the x range,
to avoid the inherent bias when there is little or no d ata on one side o f x.
In m any ways m ore satisfactory are the local regression m ethods, where a
local linear or quad ratic curve is fitted using weights w{(x xj ) / b} as above,
and then p.(x) is taken to be the fitted value at x. Im plem entations o f this
idea include the lowess m ethod, which also incorporates trim m ing o f outliers.
A gain the choice o f b is critical.
A different approach is to define a curve in term s o f basis functions, such
as pow ers o f x which define polynom ials. The fitted m odel is then a linear
co m bination o f basis functions, with coefficients determ ined by least squares
regression. W hich basis to use depends on the application, b u t polynom ials are

364

7 Further Topics in Regression

generally b a d because fitted values becom e increasingly variable as x moves


tow ard the ends o f its d a ta range polynom ial extrapolation is notoriously
poor. O ne p o p u lar choice for basis functions is cubic splines, w ith which n(x)
is m odelled by a series o f cubic polynom ials joined at k n o t values o f x, such
th a t the curve has continuous second derivatives everywhere. The least squares
cubic spline fit m inim izes the penalized least squares criterion for fitting /i(x),
~ M*/)}2 + * J { t t x ) } 2dx;
w eighted sum s o f squares can be used if necessary. In m ost softw are im ple
m entations the spline fit can be determ ined either by specifying the degrees o f
freedom o f the fitted curve, o r by applying cross-validation (Section 6.4.1).
A spline fit will generally be biased, unless the underlying curve is in fact a
cubic. T h a t such bias is nearly always present for nonparam etric curve fits can
create difficulties. T he o th er general feature th a t m akes in terp retatio n difficult
is the occurrence o f spurious bum ps an d bends in the curve estim ates, as we
shall see in Exam ple 7.10.
Resampling methods
Two types o f applications o f n o n p aram etric curves are use in checking a p a ra
m etric curve, an d use in setting confidence lim its for fi(x) o r prediction limits
for Y = h ( x ) + e at some values o f x. The first type is quite straightforw ard, be
cause d a ta would be sim ulated from the fitted param etric m odel: Exam ple 7.11
illustrates this. H ere we look briefly a t confidence lim its and prediction limits,
where the n o n p aram etric curve is the only m odel.
The basic difficulty for resam pling here is sim ilar to th a t w ith density
estim ation, illustrated in Exam ple 5.13, nam ely bias. Suppose th a t we w ant
to calculate a confidence interval for ji(x) at one o r m ore values o f x. Case
resam pling can n o t be used w ith stan d ard recom m endations for nonparam etric
regression, because the resam pling bias o f f i { x ) will be sm aller th an th at
o f ju(x). T his could probably be corrected, as w ith density estim ation, by
using a larger b andw idth o r equivalent tuning constant. But simpler, at least
in principle, is to apply the idea o f m odel-based resam pling discussed in
C h apter 6.
The naive extension o f m odel-based resam pling would generate responses
y j = p.{xj) + e*, where fa(x; ) is the fitted value from some nonparam etric
regression m ethod, an d ej is sam pled from appropriately m odified versions
o f the residuals yj fi(xj). U n fortunately the inherent bias o f m ost n o n p a ra
m etric regression m ethods distorts b o th the fitted values and the residuals,
and thence biases the resam pling scheme. O ne recom m ended strategy is to
use as sim ulation m odel a curve th a t is oversm oothed relative to the usual
estim ate. F o r definiteness, suppose th a t we are using a kernel m ethod o r a local
sm oothing m ethod w ith tuning co n stan t b, an d th a t we use cross-validation

7.6 Nonparametric Regression

365

to determ ine the best value o f b. T hen for the sim ulation m odel we use the
corresponding curve with, say, 2b as the tuning constant. To try to elim inate
bias from the sim ulation errors ej, we use residuals from an undersm oothed
curve, say w ith tuning co n stan t b / 2. As with linear regression, it is appropriate
to use m odified residuals, where leverage is taken into account as in (6.9). This
is possible for m ost nonparam etric regression m ethods, since they are linear.
D etailed asym ptotic theory shows th at som ething along these lines is necessary
to m ake resam pling work, b u t there is no clear guidance as to precise relative
values for the tuning constants.
E xam ple 7.10 (M otorcycle im pact d a ta ) The response y here is acceleration
m easured x m illiseconds after im pact in an accident sim ulation experim ent.
T he full d a ta were shown in Figure 7.14, b u t for com putational reasons we
elim inate replicates for the present analysis, which leaves n = 94 cases with
distinct x values. The solid line in the top left panel o f Figure 7.15 shows a cubic
spline fit for the d a ta o f Figure 7.14, chosen by cross-validation and having
approxim ately 12 degrees o f freedom. The top right panel o f the figure gives
the plot o f m odified residuals against x for this fit. N ote the heteroscedasticity,
w hich broadly corresponds to the three stra ta separated by the vertical dotted
lines. The estim ated variances for these stra ta are approxim ately 4, 600 and
140. Reciprocals o f these were used as weights for the spline fit in the left
panel. Bias in these residuals is evident at times 10-15 ms, where the residuals
are first m ostly negative and then positive because the curve does not follow
the d a ta closely enough.
There is a rough correspondence betw een kernel sm oothing and spline
sm oothing an d this, together w ith the previous discussion, suggests th a t for
m odel-based resam pling we use yj = p(xj) + ej, where fi is the spline fit
obtained by doubling the cross-validation choice o f L This fit is the dotted
line in the top left panel o f Figure 7.15. The random errors ej are sam pled
from the m odified residuals for an o th er spline fit in which X is h a lf the crossv alidation value. The lower right panel o f the figure displays these residuals,
which show less bias th a n those for the original fit, though perhaps a smaller
b andw idth would be b etter still. The sam pling is stratified, to reflect the very
strong heteroscedasticity.
We sim ulated R = 999 d atasets in this way, and to each fitted the spline
curve fi (x), w ith the b an d w id th chosen by cross-validation each time. We then
calculated 90% confidence intervals at six values o f x, using the basic b o otstrap
m ethod m odified to equate the distributions o f /i*(x) p(x) and
F or
example, at x = 20 the estim ates ft and p. are respectively 110.8 and 106.2,
and the 950th ordered value o f p" is 87.2, so th a t the upper confidence limit
is 110.8 {87.2 (106.2)} = 129.8. The resulting confidence intervals
are shown in the b o tto m left panel o f Figure 7.15, together w ith the original

7 Further Topics in Regression

366

c3o
'</)

go
-g

Time (ms)

Time (ms)

Time (ms)

Time (ms)

fit. N ote how the confidence limits are centred on the convex side o f the fitted
curve in o rd er to account for its bias; this is m ost evident at x = 20.

7.6.2 Generalized additive models


The structural p a rt o f a generalized linear m odel, as outlined in Section 7.2.1,
is the linear predictor rj = x Tft, which is additive in the com ponents x, o f x. It
m ay n o t always be the case th a t we know w hether Xj o r some transform ation
o f it should be used in the linear predictor. T hen it m akes sense, a t least for
exploratory purposes, to include in rj a nonparam etric curve com ponent s;(x,)
for each com ponent x,- (except those corresponding to qualitative factors). This
still assum es additivity o f the effects o f the x,s on the linear predictor scale.

Figure 7.15 Bootstrap


analysis of motorcycle
data, without replicate
responses. Top left:
data, original cubic
spline fit (solid) and
oversmoothed fit (dots).
Top right: residuals
from original fit; note
their bias at times 10-15
ms. Bottom right:
residuals from
undersmoothed fit. The
lines in these plots show
strata used in the
resampling. Bottom left:
original fit and 90%
basic bootstrap
confidence intervals at
six values of x ; they are
not centred on the fitted
curve.

367

7.6 Nonparametric Regression

T he result is the generalized additive model


p
g{fi(x)} = ri(x) = ^ 2 Si(xt),
!=i

(7.25)

where g( ) is a know n link function, as before. As for a generalized linear model,


the m odel specification is com pleted by a variance function, var(Y ) = k V( h ).
In practice we m ight force some term s s,(x,) in (7.25) to be linear, depending
u p o n w hat is know n a b o u t the application. Each nonparam etric term is typi
cally fitted as a linear term plus a nonlinear term , the latter using sm oothing
splines or a local sm oother. This m eans th a t the corresponding generalized
linear m odel is a sub-m odel, so th a t the effects o f nonlinearity can be as
sessed using differences o f residual deviances, suitably scaled, as in (7.8). In
stan d ard com puter im plem entations each nonparam etric curve s,(x,) has (ap
proxim ately) three degrees o f freedom for nonlinearity. S tan d ard distributional
approxim ations for the resulting test statistics are som etim es quite unreliable,
so th a t resam pling m ethods are particularly helpful in this context. F or tests
o f this sort the null m odel for resam pling is the generalized linear model, and
the ap p ro ach taken can be sum m arized by the following algorithm .
Algorithm 7.4 (Comparison of generalized linear and generalized additive
models)
F or r = 1 ,..., R,
1 fix the covariate values at those observed;
2 generate b o o tstra p responses y j,...,y * by resam pling from the fitted
generalized linear null m odel;
3 fit the generalized linear m odel to the b o o tstrap d a ta and calculate the
residual deviance d'0r;
4 fit the generalized additive m odel to the b o o tstrap data, calculate the
residual deviance d* an d dispersion k*; then
5 calculate t* = (d$r d * ) / k*.
Finally, calculate the P-value as [l + #{t* > t}] / ( R + 1), where t = (do d ) / k
is the scaled difference o f deviances for the original data.

T he following exam ple illustrates the use o f nonparam etric curve fits in m odelchecking.
Example 7.11 (Leukaemia data) For the d a ta in Exam ple 7.1, we originally
fitted a generalized linear m odel w ith gam m a variance function and linear
p redictor g ro u p + x w ith logarithm ic link, where g ro u p is a factor w ith two
levels. T he fitted m ean function for th a t m odel is show n as two solid curves
in Figure 7.16, the u p p er curve corresponding to G ro u p 1. H ere we consider

7 Further Topics in Regression

368

Figure 7.16
Generalized linear
model fits (solid) and
generalized additive
model fits (dashed) for
leukaemia data of
Example 7.1.

Log10 white blood cell count

w hether or n o t the effect o f x is linear. To do this, we com pare the original fit
to th at o f the generalized additive m odel in which x is replaced by s(x), which
is a sm oothing spline w ith three degrees o f freedom . The link and variance
functions are unchanged. T he fitted m ean function for this m odel is shown as
dashed curves in the figure.
Is the sm ooth curve a significantly b etter fit? To answ er this we use the test
statistic Q defined in (7.8), where here D corresponds to the residual deviance
for the generalized additive m odel, k is the dispersion for th a t m odel, and
Do is the residual deviance for the sm aller generalized linear model. F or these
d a ta D = 40.32 w ith 30 degrees o f freedom , k = 0.725, and Do = 30.75 w ith
27 degrees o f freedom , so th a t q = (40.32 30.75)/0.725 = 13.2. The standard
approxim ation for the null distrib u tio n o f Q is chi-squared w ith degrees o f
freedom equal to the difference in m odel dim ensions, here p po = 3, so
the approxim ate P-value is 0.004. A lternatively, to allow for estim ation o f the
dispersion, (p po)_12 is com pared to the F distribution w ith denom inator
degrees o f freedom n p 1, here 27, an d this gives approxim ate P-value
0.012. It looks as though there is strong evidence against the simpler, loglinear
model. However, the accuracies o f the approxim ations used here are som ew hat
questionable, so it m akes sense to apply the resam pling analysis.
To calculate a b o o tstrap P-value corresponding to q = 13.2, we sim ulate the
distribution o f Q u nder the fitted null m odel, th a t is the original generalized
linear m odel fit, b u t w ith n o n p aram etric resam pling. T he p articular resam pling
scheme we choose here uses the linear predictor residuals rLj defined in (7.10),
one advantage o f which is th a t positive sim ulated responses are guaranteed.
The residuals in this case are
= logCVj) ~ log(Aoj)
Ll

4 /2( l - S ) i / 2

7.6 - Nonparametric Regression

369

Figure 7.17
Chi-squared Q-Q plot of
standardized deviance
differences q* for
comparing generalized
linear and generalized
additive model fits to
the leukaemia data. The
lines show the
theoretical x\
approximation (dashes)
and the F
approximation (dots).
Resampling uses
Pearson residuals on
linear predictor scale,
with R = 999.

Chi-squared quantiles

w here hoj, jhj an d kq are the leverage, fitted value and dispersion estim ate
for the null (generalized linear) m odel. These residuals ap p ear quite hom oge
neous, so no stratification is used. T hus step 2 o f A lgorithm 7.4 consists o f
sam pling e j,...,e * random ly with replacem ent from rL{, . . . , r Ln (w ithout m ean
correction), an d then generating responses y * = /io; exp(KQ/2e*) for j = l , . . . , n .
A pplying this algorithm w ith R = 999 for o u r d a ta gives the P-value 0.035,
larger th a n the theoretical approxim ations, b u t still suggesting th a t the linear
term in x is n o t sufficient. The b o o tstrap null distribution o f q * deviates
m arkedly from the stan d ard %\ approxim ation, as the Q-Q plot in Figure 7.17
shows. The F approxim ation is also inaccurate.
A jack k n ife-after-b o o tstrap plot reveals th at quantiles o f q* are m oderately
sensitive to case 2, b u t w ithout this case the P-value is virtually unchanged.
Very sim ilar results are obtained under param etric resam pling with the
exponential m odel, as m ight be expected from the original d a ta analysis.

O u r next exam ple illustrates the use o f sem iparam etric regression in predic
tion.
E xam ple 7.12 (A ID S diagnoses) In Exam ple 7.4 we discussed prediction
o f A ID S diagnoses based on the d a ta in Table 7.4. A sm ooth tim e trend
seems preferable to fitting a separate param eter for each diagnosis period,
an d accordingly we consider a m odel where the m ean num ber o f diagnoses in
p eriod j reported w ith delay k, the m ean for the ( j, k) cell o f the table, equals
Hjk = exp(aO') + 0k}.
We take a (j) to be a locally quadratic lowess sm ooth w ith bandw idth 0.5.

370

7 Further Topics in Regression

T he delay distrib u tio n is so sharply peaked here th a t although we could take


a sm ooth function in the delay time, it is equally parsim onious to take 15
separate p aram eters f t . We use the sam e variance function as in Exam ple 7.4,
which assum es th a t the observed counts yjk are overdispersed Poisson w ith
m eans /ijk, and we fit the m odel as a generalized additive m odel. T he residual
deviance is 751.7 on 444.2 degrees o f freedom , increased from 716.5 and 413
in the previous fit. The curve show n in the left panel o f Figure 7.18 fits well,
and is m uch m ore plausible as a m odel for underlying trend th an the curve in
Figure 7.5. The panel also shows the predicted values from this curve, which
o f course are heavily affected by the observed diagnoses in Table 7.4.
As m entioned above, in resam pling from fitted curves it is im p o rta n t to take
residuals from an u n dersm oothed curve, in o rd er to avoid bias, and to add
them to an oversm oothed curve. We take Pearson residuals {y p ) / p } l2 from a
sim ilar curve w ith b andw idth 0.3, and ad d them to a curve w ith bandw idth 0.7.
These fits have deviances 745.3 on 439.2 degrees o f freedom and 754.1 on 446.1
degrees o f freedom . B oth o f these curves are show n in Figure 7.18. Leverage
adjustm ent is aw kw ard for generalized additive m odels, b u t the large num ber
o f degrees o f freedom here m akes such adjustm ents unnecessary. We m odify
resam pling scheme (7.12), an d repeat the calculations as for A lgorithm 7.1
applied to Exam ple 7.4, w ith R = 999.
Table 7.11 shows the resulting prediction intervals for the last quarters o f
1990, 1991, an d 1992. T he intervals for 1992 are substantially shorter th an
those in Table 7.5, because o f the different m odel. T he generalized additive
m odel is based on an underlying sm ooth trend in diagnoses, so predictions
for the last few rows o f the table depend less critically on the values observed

Figure 7.18
Generalized additive
model prediction of UK
AIDS diagnoses. The
left panel shows the
fitted curve with
bandwidth 0.5 (smooth
solid line), the predicted
diagnoses from this fit
(jagged dashed line),
and the fitted curves
with bandwidths 0.7
(dots) and 0.3 (dashes),
together with the
observed totals (+). The
right panel shows the
predicted quarterly
diagnoses for 1989-92
(central solid line), and
pointwise 95%
prediction limits from
the Poisson bootstrap
(solid), negative
binomial bootstrap
(dashes), and
nonparametric
bootstrap without (dots)
and with (dot-dash)
stratification.

7.6 Nonparametric Regression


Table 7.11 Bootstrap
95% prediction intervals
for numbers of AIDS
cases in England and
Wales for the fourth
quarters of 1990, 1991,
and 1992, using
generalized additive
model.

371
1990

Poisson
N egative binom ial
N o n p aram etric
S tratified n o n p aram etric

1991

1992

295

314

302

336

415

532

293
294
293

317
316
315

298
296
295

339
337
338

407
397
394

547
545
542

in those rows. This contrasts w ith the Poisson tw o-w ay layout m odel, for
which the predictions depend com pletely on single rows o f the table and are
m uch m ore variable. C om pare the slight forecast drop in Figure 7.6 with the
predicted increase in Figure 7.18.
The d otted lines in Figure 7.18 show pointw ise 95% prediction bands for the
A ID S diagnoses. The prediction intervals for the negative binom ial and n o n
p aram etric schemes are similar, although the effect o f stratification is smaller.
S tratification has no effect on the deviances. The negative binom ial deviances
are typically a b o u t 90 larger th a n those generated under the nonparam etric
scheme.
The plausibility o f the sm ooth underlying curve and its usefulness for p re
diction is o f course central to the approach outlined here.

7.6.3 Other m ethods


O ften a nonp aram etric regression fit will be com pared to a param etric fit,
b u t not all applications are o f this kind. F or exam ple, we m ay w ant to see
w hether or n o t a regression curve is m onotone w ithout specifying its form.
T he following application is o f this kind.
Exam ple 7.13 (Downs syndrom e) Table 7.12 contains a set o f d a ta on inci
dence o f D ow ns syndrom e babies for m others in various age ranges. M ean
age is approxim ate m ean age o f the m m others whose babies included y babies
with D ow ns syndrom e. These d a ta are plotted on the logistic scale in Fig
ure 7.19, together w ith a generalized additive spline fit as an exploratory aid
in m odelling the incidence rate.
W h at we notice ab o u t the curve is th at it decreases w ith age for young
m others, co n trary to intu itio n and expert belief. A sim ilar phenom enon occurs
for o th er datasets. We w ant to see if this dip is real, as opposed to a statistical
artefact. So a null m odel is required under which the rate o f occurrence is
increasing w ith age. L inear logistic regression is clearly inappropriate, and
m ost oth er stan d ard m odels give non-increasing rates. The approach taken is
isotonic regression, in which the rates are fitted nonparam etrically subject to
their increasing w ith age. F urther, in order to m ake the null m odel a special

372

7 Further Topics in Regression

X
m

y
X
m

y
X
m

17.0
13555
16

18.5
13675
15

19.5
18752
16

20.5
22005
22

21.5
23896
16

22.5
24667
12

23.5
24807
17

24.5
23986
22

25.5
22860
15

26.5
21450
14

27.5
19202
27

28.5
17450
14

29.5
15685
9

30.5
13954
12

31.5
11987
12

32.5
10983
18

33.5
9825
13

34.5
8483
11

35.5
7448
23

36.5
6628
13

37.5
5780
17

38.5
4834
15

39.5
3961
30

40.5
2952
31

41.5
2276
33

42.4
1589
20

43.5
1018
16

44.5
596
22

45.5
327
11

47.0
249
7

Table 7.12 Number y


of Downs syndrome
babies in m births for
mothers with age
groups centred on x
years (Geyer, 1991).

Figure 7.19 Logistic


scale plot of Downs
syndrome incidence
rates. Solid curve is
generalized additive
spline fit with 3 degrees
of freedom

a>
o
<d
o
c
o
O)
o

Mean age x

case o f the general model, the la tte r is taken to be an arb itrary convex curve
for the logit o f incidence rate.
If the incidence rate at age x, is n(xi) w ith logit{7r(x/)} = rj(xi) = */*, say, for
i=
then the binom ial log likelihood is

1=1
A convex m odel is one in which
Xi+1 -

Xi

Xi -

%i1

Xj+1 -

X i-1

t i i < - ------ rn- 1 + 7 ------ 1i+1.


x i+ 1

Xi- 1

I = 2 , . .. ,k - 1 .

The general m odel fit will m axim ize the binom ial log likelihood subject to these
constraints, giving estim ates fji,...,rjk- T he null m odel satisfies the constraints
rji < rji+i for i = l , . . . , k 1, which are equivalent to the previous convexity

373

7.6 Nonparametric Regression

Figure 7.20 Logistic


scale plot of incidence
rates for Downs
syndrome data, with
convex fit (solid line)
and isotonic fit (dotted
line).

Mean age x

constraints plus the single co n straint r\\ < r\2 - The null fit essentially pools
adjacent age groups for which the general estim ates fji violate the m onotonicity
o f the null m odel. If the null estim ates are denoted by
then we take as our
test statistic the deviance difference
T = 2{(f()i,. ..,r\k) ~

- flojc)}-

T he difficulty now is th a t the stan d ard chi-squared approxim ation for de


viance differences does n o t apply, essentially because there is n o t a fixed value
for the degrees o f freedom . T here is a com plicated large-sam ple approxim ation
which m ay well n o t be reliable. So a param etric b o o tstrap is used to calculate
the P-value. This requires sim ulation from the binom ial m odel w ith sample
sizes m covariate values x, and logits fjo,iFigure 7.20 shows the convex and isotone regression fits, which clearly
differ for age below 30. T he deviance difference for these fits is t = 5.873.
S im ulation o f R = 999 binom ial datasets from the isotone m odel gave 33
values o f t* in excess o f 5.873, so the P-value is 0.034 and we conclude
th a t the dip in incidence rate m ay be real. (F urther analysis w ith additional
d a ta does n o t su p p o rt this conclusion.) Figure 7.21 is a histogram o f the t*
values.
It is possible th a t the null distribution o f T is unstable with respect to p ara m
eter values, in which case the nested b o o tstrap procedure o f Section 4.5 should
be used, possibly in conjunction w ith the recycling m ethod o f Section 9.4.4 to
accelerate the com putation.

7 Further Topics in Regression

374

Figure 7.21 Histogram


of 999 resampled
deviance test statistics
for the Downs
syndrome data. The
unshaded portion
corresponds to values
exceeding observed test
statistic t = 5.873.

10

t*

7.7 Bibliographic Notes


A full treatm en t o f all aspects o f generalized linear m odels is given by M cCullagh and N elder (1989). D obson (1990) is a m ore elem entary discussion, while
F irth (1991) gives a useful sh o rter account. D avison and Snell (1991) describe
m ethods o f checking such models. Books by C ham bers and H astie (1992) and
Venables an d Ripley (1994) cover m ost o f the basic m ethods discussed in this
chapter, b u t restricted to im plem entations in S and S-Plus.
Published discussions o f b o o tstrap m ethods for generalized linear m odels
are usually lim ited to one-step iterations from the m odel fit, w ith resam pling o f
Pearson residuals; see, for example, M oulton and Zeger (1991). T here appears
to be no system atic study o f the various schemes described in Section 7.2.3.
N elder an d Pregibon (1987) briefly discuss a m ore general application. M oulton
and Zeger (1989) discuss b o o tstra p analysis o f repeated m easure data, w hile
Booth (1996) describes m ethods for use when there is nested variation.
Books giving general accounts o f survival d a ta are m entioned in Section 3.12.
H jo rt (1985) describes m odel-based resam pling m ethods for p roportional haz
ards regression, and studies their theoretical properties such as confidence
interval accuracy. B urr an d D oss (1993) outline how the double boo tstrap
can be used to provide confidence b an d s for a m edian survival time, and
com pare its perform ance w ith sim ulated bands based on asym ptotic results.
Lo and Singh (1986) an d H orvath an d Y andell (1987) m ake theoretical con
tributions to b o o tstrap p in g survival data. B ootstrap and p erm utation tests for
com parison o f survivor functions are discussed by H eller and V enkatram an
(1996).
Burr (1994) studies em pirically various b o o tstrap confidence interval m eth
ods for the p ro p o rtio n al hazards m odel. She finds no overall best com bination,

7.7 Bibliographic Notes

375

b u t concludes th a t norm al-theory asym ptotic confidence intervals and basic


b o o tstrap intervals are generally good for regression param eters fi, while per
centile intervals are satisfactory for survival probabilities derived from the
product-lim it estim ate. R esults from the conditional boo tstrap are m ore er
ratic th an those for resam pling cases o r from m odel-based resam pling, and the
latter is generally preferred.
A ltm an an d A ndersen (1989), C hen and G eorge (1985) and Sauerbrei and
Schum acher (1992) apply case resam pling to variable selection in survival d ata
m odels, b u t there seems to be little theoretical justification o f this. The use o f
b o o tstrap m ethods in general assessm ent o f m odel uncertainty in regression is
discussed by Faraw ay (1992).
B ootstrap m ethods for general nonlinear regression m odels are usually
studied theoretically via linear approxim ation. See H uet, Jolivet and M essean
(1990) for some sim ulation results. T here appears to be no literature on
incorporating curvature effects into m odel-based resam pling. T he behaviour
o f residuals, leverages an d diagnostics for nonlinear regression m odels are
developed by Cook, Tsai an d Wei (1986) and St. L aurent and C ook (1993).
The large literature on prediction erro r as related to discrim ination is sur
veyed by M cL achlan (1992). References for boo tstrap estim ation o f prediction
error are m entioned in Section 6.6. Those dealing particularly w ith misclassification erro r include Efron (1983) and Efron and T ibshirani (1997). G ong
(1983) discusses a p articu lar case where the prediction rule is based on a
logistic regression m odel obtained by forw ard selection.
References to b o o tstrap m ethods for m odel selection are m entioned in
Section 6.6. The treatm en t by Shao (1996) covers both generalized linear
m odels an d nonlinear models.
T here are now num erous accounts o f nonparam etric regression, such as
H astie and T ibshirani (1990) on generalized additive models, and G reen and
Silverm an (1994) on penalized likelihood m ethods. A useful treatm ent o f local
w eighted regression by H astie and L oader (1993) is followed by a discussion
o f the relative m erits o f various kernel-type estim ators. Venables and Ripley
(1994) discuss im plem entation in S-Plus with exam ples; see also C ham bers
and H astie (1992). C onsiderable theoretical w ork has been done on boo tstrap
m ethods for setting confidence bands on nonparam etric regression curves,
m ostly focusing on kernel estim ators. H ardle and Bowman (1988) and H ardle
and M arro n (1991) b o th em phasize the need for. different levels o f sm oothing
in the com ponents o f m odel-based resam pling schemes. H all (1992b) gives
a detailed theoretical assessm ent o f the properties o f such confidence band
m ethods, and em phasizes the benefits o f the studentized bootstrap. There
appears to be no corresponding treatm ent for spline sm oothing m ethods, nor
for the m any com plex m ethods now used for fitting surfaces to m odel the
effects o f m ultiple covariates.

7 Further Topics in Regression

376

A sum m ary o f m uch o f the theory for resam pling in nonlinear and no n
param etric regression is given in C h ap ter 8 o f Shao and Tu (1995).

7.8 Problems
1

The estimator ft in a generalized linear model may be defined as the solution to


the theoretical counterpart of (7.2), namely

c V ( t ) d f / e/

F{x' y} = 0'

where n is regarded as a function of ft through the link function g(fi) = r\ = x Tft.


Use the result of Problem 2 . 1 2 to show that the empirical influence value for ft
based on data (x1,ci,yi),...,(x,c,}') is
lj = n(XT W X ) - 1xj

J ? " * 1'
.
cjV(Hj)3t]j/8fij

evaluated at the fitted model, where W is the diagonal matrix with elements given
by (7 . 3 ) .
Hence show that the approximate variance matrix for ft' for case resampling in a
generalized linear model is
k ( X T W X ) - 1X T W S X { X T W X ) ~ \
where $ = diag(rp,,..., rj,n) with the rpj standardized Pearson residuals (7.9).
Show that for the linear model this yields the modified version of the robust
variance matrix ( 6 . 2 6 ) .
(Section 7 . 2 . 2 ; Moulton and Zeger, 1 9 9 1 )
2

For the gamma model of Examples 7 . 1 and 7 . 2 , verify that v a r(7 ) = k/i2 and that
the log likelihood contribution from a single observation is
= - ^ { l o g i ^ + y/fi}.
Show that the unstandardized Pearson and deviance residuals are respectively
_ / 2 ( 1) and sign(z1 ) [ 2 k _ 1 / 2 { z 1 log(z)}]1/2, where z = y/p.. If the regression
is loglinear, meaning that the log link is used, verify that the unstandardized linear
predictor residuals are simply k~i/2 log(z).
What are the possible ranges of the standardized residuals rP, rL and rDl Calculate
these for the model fitted in Example 7 .2.
If the deviance residual is expressed as d(y,p), check that d(y,p) = d(z, 1). Hence
show that the resampling scheme based on standardized deviance residuals can
be expressed as y = faz, where zj is defined by d(zj, 1) = s' with ' randomly
sampled from rDi, . . . , r Dn. What further simplification can be made?
(Sections 7 . 2 . 2 , 7 . 2 . 3 )
k

The figure below shows the fit to data pairs ( x u y \ ),,(x,y) of a binary logistic
model
Pr(7 = 1) = 1 - Pr(Y = 0) =

eXp(/?0 + /?lX)
1 + exp(/?0 + /fix)

7.8 Problems

377

(a) Under case resampling, show that the maximum likelihood estimate
for a
bootstrap sample is infinite with probability close to e~2. W hat effect has this on
the different types o f bootstrap confidence intervals for fa ?
(b) Bias-corrected maximum likelihood estimates are obtained by modifying re
sponse values (0,1) to (/iy/2, l+hj), where hj is the jth leverage for the model fit to
the original data. D o infinite parameter estimates arise when bootstrapping cases
from the modified data?
(Section 7.2.3; Firth, 1993; M oulton and Zeger, 1991)
4

Investigate whether resampling schemes given by (7.12), (7.13), and (7.14) yield
Algorithm 6.1 for bootstrapping the linear model.

Suppose that conditional on P = n, Y has a binom ial distribution with probability


n and denominator m, and that P has a beta density

/( n | ,ff) =

r Wi(P)

- nf-',

0 < n < 1,

tx,P>0.

Show that Y has unconditional mean and variance (7.15) and express n and <f>in
terms o f a and fa
Express a and /? in terms o f n and <j>, and hence explain how to generate data
with mean and variance (7.15) by generating n from a beta distribution, and then,
conditional on the probabilities, generating binom ial variables with probabilities n
and denominators m.
How should your algorithm be amended to generate beta-binomial data with
variance function </>II(l II)?
(Example 7.3)
6

For generalized linear models the analogue o f the case-deletion result in Problem 6.2
is

Kj = P-(xTwxy'wjk-^xj^^i.
(a) Use this to show that when the y'th case is deleted the predicted value for y, is

7 Further Topics in Regression

378

(b)
Use (a) to give an approximation for the leave-one-out cross-validation estimate
o f prediction error for a binary logistic regression with cost (7.23).
(Sections 6.4.1,7.2.2)

7.9 Practicals
1

Dataframe r e m is s io n contains data from Freeman (1987) concerning a measure


o f cancer activity, the LI values, for 27 cancer patients, o f whom 9 went into
remission. Remission is indicated by the binary variable r = 1. Consider testing
the hypothesis that the LI values do not affect the probability o f remission. First,
fit a binary logistic m odel to the data, plot them, and perform a permutation test:

attach(remission)
plot(LI+O.03*rnorm(27),r,pch=l,xlab="LI, jittered",xlim=c(0,2.5))
rem.glm <- glm(r"LI.binomial,data=remission)
summary(rem.glm)
x <- seqC0.4,2.0,0.02)
eta <- cbind(rep(l,81) ,x)/C*'/.coeff icients(rem.glm)
lines(x,inv.logit(eta),lty=2)
rem.perm <- function(data, i)
{ d <-data
d$LI<- d$LI[i]
d.glm <- glm(r~LI,binomial,data=d)
coefficients(d.glm) >
rem.boot <- boot(remission, rem.perm, R=199, sim="permutation")
qqnorm(rem.boot$t[,2],ylab="Coefficient of LI",ylim=c(-3,3))
abline(h=rem.boot$tO[2],lty=2)
Compare this significance level with that from using a normal approximation for
the coefficient o f LI in the fitted model.
Construct bootstrap tests o f the hypothesis by extending the methods outlined in
Section 6.2.5.
(Freeman, 1987; Hall and Wilson, 1991)
2

Dataframe b reslo w contains data from Breslow (1985) on death rates from heart
disease among British male doctors. A standard m odel is that the numbers o f
deaths y have a Poisson distribution with mean nX, where n is the number o f
person-years and X is the death rate. The focus o f interest is how death rate
depends on two explanatory variables, a factor representing the age group and an
indicator o f sm oking status, x. Two com peting models are
X = exp(aage + fix),

X = aage + fix;

these are respectively multiplicative and additive. To fit these models we proceed
as follows:

breslow.mult <- glm(y*offset(log(n))+age+smoke,poisson(log),


data=breslow)
breslow.add <- glm(y~n:age+ns-l,poisson(identity),data=breslow)
Here n s is a variable for the effect o f smoking, constructed to allow for the
difficulty in applying an offset in fitting the additive model. The deviances o f the
fitted models are Dadd = 7.43 and Dmuit = 12.13. Although it appears that the
additive model is the better fit, these models are not nested, so a chi-squared
approximation cannot be applied to the difference o f deviances. For bootstrap

7.9 Practicals

379

assessment o f fit based on the difference o f deviances, we simulate in turn from


each fitted model. Because fits o f the additive m odel fail if there are no deaths in
the lowest age group, and this happens with appreciable probability, we constrain
the simulation so that there are deaths at each age.

breslow.fun <- function(data)


{ mult <- glm(y"offset(log(n))+age+smoke,poisson(log),data=data)
add <- glm(y~n:age+ns-1,poisson(identity),data=data)
deviance(mult)-deviance(add) }
breslow.sim <- function(data, mle)
{ data$y <- rpois(nrow(data), mle)
while(min(data$y)==0) data$y <- rpois(nrow(data), mle)
data }
add.mle <- fitted(breslow.add)
add.boot <- boot(breslow, breslow.fun, R=99, sim="parametric",
ran.gen=breslow.sim, mle=add.mle)
mult.mle <- fitted(breslow.mult)
mult.boot <- boot(breslow, breslow.fun, R=99, sim="parametric",
ran.gen=breslow.sim, mle=mult.mle)
boxplot(mult.boot$t,add.boot$t,ylab="Deviance difference",
names=c("multiplicative","additive"))
abline(h=mult.boot$tO,lty=2)
W hat does this tell you about the relative fit o f the models?
A different strategy would be to use parametric simulation, simulating not from
the fitted models, but from the model with separate Poisson distributions for each
o f the original data. D iscuss critically this approach.
(Section 7.2; Example 4.5; Wahrendorf, Becher and Brown, 1987; Hall and Wilson,
1991)
Dataframe h ir o s e contains the PET reliability data o f Table 7.6. Initially we
consider estimating the bias and variance o f the M LEs o f the parameters /?o,. . . , / ? 4
and xo discussed in Example 7.5, using parametric simulation from the fitted
Weibull model, but assuming that the data were subject to censoring at the fixed
time 9104.25. Functions to calculate the minus log likelihood (in parametrization
and to find the M LEs are:

hirose.lik <- function(mle, data)


{ xO <- 5-exp(mle [5])
lambda <- exp(mle[l]+mle[2]*(-log(data$volt-x0)))
beta <- exp(mle[3]+mle[4]*(-log(data$volt)))
z <- (data$time/lambda)
beta
sum(z - data$cens*log(beta*z/data$time)) }
hirose.fun <- function(data, start)
{ d <- nlminb(start, hirose.lik, data=data)
conv <- (d$message=="RELATIVE FUNCTION CONVERGENCE")
c(conv, d$objective, d$parameters) }
The M L E s for the original data can be obtained by setting hirose.start <c(6,2,4,l,l) (obtained by introspection), and then iterating the following lines
hirose.start <- hirose.fun(hirose, start=hirose.start)[3:7]
hirose.start
a few times.
N ew data are generated by

380

7 Further Topics in Regression

hirose.gen <- function(data, mle)


{ xO <- 5 - exp (mle [5])
xl <- -log(data$volt-xO)
xb <- -log(data$volt)
lambda <- exp(mle[1]+mle[2]*xl)
beta <- exp(mle[3]+mle[4]*xb)
y <- rweibull(nrow(data), shape=beta, scale=lambda)
data$cens <- ifelse(y<=9104.25,1,0)
data$time <- ifelse(data$cens==l,y,9104.25)
data >

and the bootstrap results are obtained by


hirose.mle <- hirose.start
hirose.boot <- boot(hirose, hirose.fun, R=19, sim="parametric",
r a n .gen=hirose.g en, mle=hirose.m l e ,
start=hirose.start)
hirose.boot$t[,7] <- 5-exp(hirose.boot$t[,7])
hirose.boot$tO[7] <- 5-exp(hirose.boot$tO[7])
hirose.boot

Try this with a larger value of R but dont hold your breath.
For a full likelihood analysis for the parameter 9, the log likelihood must be
maximized over /?i,...,/?4 for a given value of 9. A little thought shows that the
necessary code is
betaO <- function(theta, mle)
{ x49 <- -log(4.9-(5-exp(mle[4])))
x <- -log(4.9)
log(theta*10"3) - m l e [1]*x49-lgamma(l + exp (-mle [2]-mle [3] *x)) }
hirose.Iik2 <- function(mle, data, theta)
{ xO <- 5-exp(mle[4])
lambda <- exp(betaO(theta,mle)+mle[1]*(-log(data$volt-xO)))
beta <- exp(mle[2]+mle[3]*(-log(data$volt)))
z <- (data$time/lambda)
beta
sum(z - data$cens*log(beta*z/data$time)) }
hirose.fun2 <- function(data, start, theta)
{ d <- nlminb(start, hirose.Iik2, data=data, theta=theta)
conv <- (d$message=="RELATIVE FUNCTION CONVERGENCE")
c(conv, d$objective, d$parameters) }
hirose.f <- function(data, start, theta)
c( hirose.fun(data,i.start),
hirose.fun2(data,i ,start[-1],theta))

so that h i r o s e . f does likelihood fits when 6 is fixed and when it is not.


The quantiles of the simulated likelihood ratio statistic are then obtained by
make.theta <- function(mle, x=hirose$volt)
{ xO <- 5-exp(mle[5])
lambda <- exp(mle[1]-mle[2]*log(x-x0))/10~3
beta <- exp(mle[3]-mle[4]*log(x))
lambda*gamma(l+l/beta) >
theta <- m a k e .theta(hirose.mle,4.9)
hirose.boot <- boot(hirose, hirose.f, R=19, "parametric",
r a n .gen=hirose.g e n , mle=hirose.m l e ,
start=hirose.start, theta=theta)

7.9 Practicals

381

R <- hirose.bootSR
i <- c(l:R) [(hirose.boot$t[,l]==l)&(hirose.boot$t[,8]==l)]
w <- 2*(hirose.boot$t[i,9]-hirose.boot$t[,2])
qqplot(qchisq(c(l:length(w))/(l+length(w)),1),w)
abline(0,1,lty=2)
Again, try this with a larger R.
Can you see how the code would be modified for nonparametric simulation?
(Section 7.3; Hirose, 1993)
Dataframe n o d a l contains data on 53 patients with prostate cancer. For each
patient there are five explanatory variables, each with two levels. These are aged
(< 60, > 6 0 ); s t a g e , a measure o f the seriousness o f the tumour; grade, a measure
o f the pathology o f the tumour; a measure o f the seriousness o f an xray; and
a c id , the level o f serum acid phosphatase. The higher level o f each o f the last four
variables indicates a more severe condition. The response r indicates whether the
cancer has spread to the neighbouring lymph nodes. The data were collected to
see whether nodal involvement can be predicted from the explanatory variables.
Analysis o f deviance for a binary logistic regression model suggests that the
response depends only on s ta g e , xray and a c id , and we base our predictions on
the m odel with these variables. Our measure o f error is the average number o f
misclassifications n 1 c(yj,/ij), where c(y, ft) is given by (7.23).
For an initial model, apparent error, and ordinary and X -fold cross-validation
estimates o f prediction error:

attach(nodal)
cost <- function(r, pi=0) mean(abs(r-pi)>0.5)
nodal.glm <- glm(r~stage+xray+acid,binomial,data=nodal)
nodal.diag <- glm.diag(nodal.glm)
app.err <- cost(r, fitted(nodal.glm))
cv.err <- cv.glm(nodal, nodal.glm, cost, K=53)$delta
cv.ll.err <- c v .glm(nodal, nodal.glm, cost, K=ll)$delta
For resampling-based estimates and plot for 0.632 errors:

nodal.pred.fun <- function(data, i, model)


{ d <- data[i,]
d.glm <- update(model,data=d)
pred <- predict(d.glm,data,type="response")
D.F.Fhat <- cost(data$r, pred)
D.Fhat.Fhat <- cost(d$r, fitted(d.glm))
c(data$r-pred, D.F.Fhat - D.Fhat.Fhat) }
nodal.boot <- boot(nodal, nodal.pred.fun, R=200, model=nodal.glm)
nodal.boot$f <- boot.array(nodal.boot)
n <- nrow(nodal)
err.boot <- mean(nodal.boot$t[,n+l]) + app.err
ord <- order(nodal.diag$res)
nodal.pred <- nodal.boot$t[,ord]
err.632 <- 0
n.632 <- NULL
pred.632 <- NULL
for (i in l:n) {
inds <- nodal,boot$f[,i]==0
err.632 <- err.632 + cost(nodal.pred[inds,i])/n
n.632 <- c(n.632, sum(inds))
pred.632 <- c(pred.632, nodal.pred[inds,i]) }

382

7 Further Topics in Regression


err.632 <- 0.368*app.err + 0.632*err.632
nodal.fac <- factor(rep(l:n,n.632),labels=ord)
plot(nodal.fac, pred.632,ylab="Prediction errors",
xlab="Case ordered by residual")
abline(h=-0.5,lty=2); abline(h=0.5,lty=2)
Cases with errors entirely outside the dotted lines are always misclassified, and
conversely.
Estimate the misclassification error using the m odel with all five explanatory
variables.
(Section 7.5; Brown, 1980)

Dataframe c lo t h records the number o f faults y in lengths x o f cloth. Is it true


that E(y) oc x?

plot(cloth$x,cloth$y)
cloth.glm <- glm(y~offset(log(x)).poisson,data=cloth)
lines(cloth$x,f itted(cloth.glm))
summary(cloth.glm)
cloth.diag <- glm.diag(cloth.glm)
cloth.gam <- gam(y~s(log(x)) .poisson,data=cloth)
lines(cloth$x,fitted(cloth.gam),lty=2)
summary(cloth.gam)
There is som e overdispersion relative to the Poisson m odel with identity link, and
strong evidence that the generalized additive model fit c lo th .g a m improves on
the straight-line m odel in which y is Poisson with mean /30 + fa x . We can try
parametric simulation from the m odel with the linear fit (the null model) to assess
the significance o f the decrease; cf. Algorithm 7.4:

cloth.gen <- function(data, fits)


{ y <- rpois(n=nrow(data).fits)
data.frame(x=data$x,y=y) >
cloth.fun <- function(data)
{ d.glm <- glm(y~offset(log(x)),poisson,data=data)
d.gam <- gam(y~s(log(x)) .poisson,data=data)
c(deviance(d.glm),deviance(d.gam)) }
cloth.boot <- boot(cloth, cloth.fun, sim="parametric", R=99,
r a n .gen=cloth.g e n , mle=fitted(cloth.glm))
Are the simulated drops in deviance roughly
as they would be if standard
asymptotics applied? How significant is the observed drop?
In addition to the hypothesis that we want to test that E(y) depends linearly
on x the parametric bootstrap im poses the constraint that the data are Poisson,
which is not intended to be part o f the null hypothesis. We avoid this by a
nonparametric bootstrap, as follows:

clothl <- data.frame(cloth,fits=fitted(cloth.glm),


pearson=cloth.diag$rp)
cloth.funl <- function(data, i)
{ y <- data$fits+sqrt(data$fits)*data$pearson[i]
y <- round(y)
y[y<0] <- 0
d.glm <- glm(y~offset(log(data$x)).poisson)
d.gam <- gam(y~s(log(data$x)).poisson)
c(deviance(d.glm).deviance(d.gam)) }
cloth.boot <- boot(clothl, cloth.funl, R99)

7.9 Practicals

383

Here we have used resampled standardized Pearson residuals for the null model,
obtained by c lo t h .d ia g $ r p .
How significant is the observed drop in deviance under this resampling scheme?
(Section 7.6.2; Bissell, 1972; Firth, G losup and Hinkley, 1991)
6

The data n i t r o f e n are taken from a test o f the toxicity o f the herbicide nitrofen
on the zooplankton Ceriodaphnia dubia, an important species that forms the basis
o f freshwater food chains for the higher invertebrates and for fish and birds. The
standard test measures the survival and reproductive output o f 10 juvenile C. dubia
in each o f four concentrations o f the herbicide, together with a control in which
the herbicide is not present. During the 7-day period o f the test each o f the original
individuals produces three broods o f offspring, but for illustration we analyse the
total offspring.
A previous m odel for the data is that at concentration x the total offspring y for
each individual is Poisson distributed with mean exp(/?, + [3[X + (h * 1)- The fit o f
this m odel to the data suggests that low doses o f nitrofen augment reproduction,
but that higher doses inhibit it.
One thing required from analysis is an estimate o f the concentration x 5o o f nitrofen
at which the mean brood size is halved, together with a 95% confidence interval
for x 50. A second issue is posed by the surprising finding from a previous analysis
that brood sizes are slightly larger at low doses o f herbicide than at high or zero
doses: is this true?
A wide variety o f nonparametric curves could be fitted to the data, though care
is needed because there are only five distinct values o f x. The data do not look
Poisson, but we use models with Poisson errors and the log link function to
ensure that fitted values and predictions are positive. To compare the fits o f the
generalized linear m odel described above and a robustified generalized additive
model with Poisson errors:

nitro <- rbind(nitrofen,nitrofen,nitrofen,nitrofen,nitrofen)


nitro <- rbind(nitro,nitro,nitro,nitro,nitro)
nitro$conc <- seq(0,310,length=nrow(nitro))
attach(nitrofen)
plot(conc,j itter(total),ylab="total")
nitro.glm <- glm(total~conc+conc2,poisson,data=nitrofen)
lines(nitro$conc,predict(nitro.g l m ,nitro,"response"),lty=3)
nitro.gam <- gam(total~s(conc,df=3).robust(poisson),data=nitrofen)
lines(nitro$conc,predict(nitro.g a m ,nitro,"response"))
To compare bootstrap confidence intervals for x 50 based on these models:

nitro.fun <- function(data, i, nitro)


{ assignC'd" ,data[i,] ,frame=l)
d.fit <- gam(total~s(conc,df=3).robust(poisson),data=d)
f <- predict(d.fit,nitro,"response")
f.gam <- max(nitro$conc[f>0.5 * f [1]])
d.fit <- glm(total~conc+conc2,poisson,data=d)
f <- predict(d.fit,nitro,"response")
f.glm <- max(nitro$conc[f>0.5*f [1]])
c(f.gam, f.glm) >
nitro.boot <- boot(nitrofen, nitro.fun, R=499,
strata=rep(l:5,re p (10,5)), nitro=nitro)
boot.ci(nitro.boot,index=l,type=c("norm","basic","perc","bca"))
boot.ci(nitro.boot,index=2,type=c("norm","basic","perc","bca"))

384

7 Further Topics in Regression


Do the values of x'^ look normal? What is the bias estimate for x50 using the two
models?
To perform a bootstrap test of whether the peak is a genuine effect, we simulate
from a model satisfying the null hypothesis of no peak to see if the observed
value of a suitable test statistic (, say, is unusual. This involves fitting a model
with no peak, and then simulating from it. We read fitted values m0(x) from
the robust generalized additive model fit, but with 2.2 df (chosen by eye as the
smallest for which the curve is flat through the first two levels of concentration).
We then generate bootstrap responses by setting y = m o ( x ) + s', where the e are
chosen randomly from the modified residuals at that x. We take as test statistic
the difference between the highest fitted value and the fitted value at x = 0.
nitro.test <- fitted(gam(total~s(conc,df=2.2).robust(poisson),
data=nitrofen))
f <- predict(nitro.glm,nitro,"response")
nitro.orig <- max(f) - f[l]
res <- (nitrofen$total-nitro.test)/sqrt(l-0.1)
nitrol <- data.frame(nitrofen,res=res,fit=nitro.test)
nitrol.fun <- function(data, i, nitro)
{ assignC'd" ,data[i,] ,frame=l)
d$total <- round(d$fit+d$res[i])
d.fit <- glm(total~conc+conc2,poisson,data=d)
f <- predict(d.fit,nitro,"response")
max(f)-f[l] }
nitrol.boot <- boot(nitrol, nitrol.fun, R=99,
strata=rep(l:5,r ep(10,5)), nitro=nitro)
(1+sum(nitrol.boot$t>nitro.orig))/(1+nitrol.boot$R)

Do your conclusions change if other smooth curves are fitted?


(Section 7.6.2; Bailer and Oris, 1994)

8
Complex Dependence

8.1 Introduction
In previous chapters o u r m odels have involved variables independent at some
level, an d we have been able to identify independent com ponents th at can be
sim ulated. W here a m odel can be fitted and residuals o f some sort identified,
the sam e ideas can be applied in the m ore com plex problem s discussed in
this chapter. W here th a t m odel is param etric, param etric sim ulation can in
principle be used to obtain resam ples, though M arkov chain M onte C arlo
techniques m ay be needed in practice. But in nonparam etric situations the
dependence m ay be so com plex, or our knowledge o f it so limited, th a t neither
o f these approaches is feasible. O f course some assum ption o f repeatedness
w ithin the d a ta is essential, o r it is im possible to proceed. But the repeatability
m ay not be at the level o f individual observations, b u t o f groups o f them , and
there is typically dependence betw een as well as w ithin groups. This leads to
the idea o f constructing b o o tstrap d a ta by taking blocks o f some sort from the
original observations. T he area is in rapid developm ent, so we avoid a detailed
m athem atical exposition, an d merely sketch key aspects o f the m ain ideas. In
Section 8.2 we describe som e o f the resam pling schemes proposed for time
series. Section 8.3 outlines some ideas useful in resam pling point processes.

8.2 Time Series


8.2.1 Introduction
A time series is a sequence o f observations arising in succession, usually at
tim es spaced equally an d taken to be integers. M ost m odels for tim e series
assum e th a t the d a ta are stationary, in which case the jo in t distribution o f any
subset o f them depends only on their times o f occurrence relative to each other

385

8 Complex Dependence

386

and n o t on their absolute position in the series. A w eaker assum ption used in
d a ta analysis is th a t the jo in t second m om ents o f observations depend only
on their relative positions; such a series is said to be second-order o r weakly
stationary.
Time domain
T here are two basic types o f sum m ary quantities for stationary tim e series. The
first, in the tim e dom ain, rests on the jo in t m om ents o f the observations. Let
{7,} be a second-order stationary tim e series, w ith zero m ean and autocovari
ance function yj. T h at is, E (Yj) = 0 an d co\(Yk, Yk+j) = yj for all k and j ; the
variance o f Yj is yo- T hen the autocorrelation function o f the series is pj = y j / y o,
for j = 0, + 1, . . which m easures the co rrelation betw een observations at lag
j a p a rt; o f course 1 < pj < 1, po = 1, an d ps = p _; . A n uncorrelated series
would have pj = 0, and if the d a ta were norm ally d istributed this would imply
th a t the observations were independent.
For exam ple, the statio n ary m oving average process o f order one, or M A(1)
model, has
Yj = ej + Pej-i,

; =

1 ,0 ,1 ,...,

(8.1)

where {ej} is a white noise process o f innovations, th a t is, a stream o f inde


pendent observations w ith m ean zero and variance a 1. T he autocorrelation
function for the (Y)} is p\ = /? /(l + P2) and pj = 0 for |y| > 1; this sharp
cut-off in the autocorrelations is characteristic o f a m oving average process.
O nly if P = 0 is the series Yj independent. O n the o ther hand the stationary
autoregressive process o f o rd er one, o r A R(1) m odel, has
Yj = ctYj-i + Ej,

j = . . . , - 1, 0, 1, . . . ,

| | < 1.

(8.2)

The autoco rrelatio n function for this process is pj = a 1-'1 for j = + 1 , 2 and so
forth, so large a gives high correlation betw een successive observations. The
autocorrelatio n function decreases rapidly for b o th m odels (8.1) and (8.2).
A close relative o f the au to co rrelatio n function is the partial autocorrelation
function, defined as pj = yj/yo, where yj is the covariance betw een Y& and Yk+j
after adjusting for the intervening observations. T he partial autocorrelations
for the M A (1) m odel are
p j

= - ( - m i - /?2){i - ^2(;+1)} - 1,

= i , +2,

The A R(1) m odel has p\ = a, and pj = 0 for \j\ > 1; a sh arp cut-off in the
partial autocorrelations is characteristic o f autoregressive processes.
The sam ple estim ates o f pj and pj are basic sum m aries o f the structure o f
a time series. Plots o f them against j are called the correlogram and partial
correlogram o f the series.
One widely used class o f linear time series m odels is the autoregressivem oving average or A R M A process. T he general ARM A(p,<j) m odel is defined

387

8.2 Time Series

by
9

Yj = '^2<*kYj-k + Ej + '^2PkEj-k,
k=l
k=1

(8.3)

where {,} is a w hite noise process. I f all the a& equal zero, { Yj} is the
m oving average process M A (q), w hereas if all the f t equal zero, it is AR(p).
In order for (8.3) to represent a stationary series, conditions m ust be placed
on the coefficients. Packaged routines enable m odels (8.3) to be fitted readily,
while series from them are easily sim ulated using a given innovation series
...,

1, o , j , . . . .

Frequency domain
The second ap p ro ach to tim e series is based on the frequency dom ain. The
spectrum o f a statio n ary series w ith autocovariances yj is
00

g(co) = y0 + 2 ^ 2 yj cos (coj),


i =i

0 < co < n. (8.4)

This sum m arizes the values o f all the autocorrelations o f {Yj}. A w hite noise
process has the flat spectrum g(co) = yo, while a sh arp peak in g(to) corresponds
to a strong periodic com ponent in the series. F or example, the spectrum for a
stationary A R (1) m odel is g(co) = cr2{ 1 2acos(co) + a2}-1 .
The em pirical F ourier transform plays a key role in d a ta analysis in the
frequency dom ain. T he treatm en t is simplified if we relabel the series as
yo, a n d
suppose th a t n = 2np + 1 is odd. Let f = e2n'^n be the nth
com plex ro o t o f unity, so (" = 1. T hen the empirical Fourier transform o f the
d a ta is the set o f n com plex-valued quantities
n1
y k = Y 2 }ky j

fc = o ,. . . , n - 1;

7=0

note th a t yo = ny an d th a t the com plex conjugate o f % is y n-k, for k =


1 ,..., 1. F o r different k the vectors (1, Ck, . . . ,
are orthogonal. It is
straightforw ard to see th a t
1 "-1

~^2C~}kyk = yj,

= 0 , . . l,

k=0
so this inverse Fourier transform retrieves the data. N ow define the Fourier
frequencies cok 2nk /n, for k = 1, . . . , n p . T he sam ple analogue o f the spectrum
at a>k is the periodogram,

Y2yj
n- 1

I{(ok) = n ' l & l 2 = n 1

j =0

cos(cokj)

\ +1YI yjsin(mkj)

(n-l

I j =0

' 2

8 Complex Dependence

388

The orthogonality properties o f the vectors involved in the Fourier transform


im ply th a t the overall sum o f squares o f the d a ta m ay be expressed as
n- 1

(8.5)

The em pirical Fourier transform an d its inverse can be rapidly calculated by


an algorithm know n as the f a s t Fourier transform.
If the d a ta arise from a statio n ary process {Yj} with spectrum g(co), where
Yj = YlT=-ccai - i Ei '"'ith {/} a norm al w hite noise process, then as n increases
and provided the term s |a/| decrease sufficiently fast as l> oo, the real
and im aginary parts o f the com plex-valued ran d o m variables y i , . . . , y F are
asym ptotically independent norm al variables w ith m eans zero and variances
ng(o)[)/2,. . . , ng(f )/2 ; furtherm ore the % a t different F ourier frequencies are
asym ptotically independent. This implies th a t as n>co for such a process, the
periodogram values I{a>k) a t different Fourier frequencies will be independent,
and th a t I(cok) will have an exponential distrib u tio n with m ean g(co^). (If n
is even I ( n) m ust be added to (8.5); I(n) is approxim ately independent o f
the /(ajfc) an d its asym ptotic distribution is g(Tt)xi-) T hus (8.5) decom poses
the to tal sum o f squares into asym ptotically independent com ponents, each
associated w ith the am o u n t o f variation due to a particular Fourier frequency.
W eaker versions o f these results hold w hen the process is n o t linear, o r when
the process {e/} is n o t norm al, the key difference being th a t the jo in t lim iting
distribution o f the p eriodogram values holds only for a finite n um ber o f fixed
frequencies.
If the series is w hite noise, und er m ild conditions its periodogram ordinates
I{co\) , . . . , I{(oF) are roughly a ran d o m sam ple from an exponential distribu
tion w ith m ean yo. Tests o f independence m ay be based on the cumulative
periodogram ordinates,
J2j=i

k =

1.

Z jU H ujY
W hen the d a ta are w hite noise these ordinates have roughly the same jo in t
distributio n as the o rd er statistics o f np 1 uniform ran d o m variables.
Exam ple 8.1 (Rio N egro d a ta ) The d a ta for o u r first time series exam ple are
m onthly averages o f the daily stages heights o f the R io N egro, 18 km
upstream a t M anaus, from 1903 to 1992, m ade available to us by Professors
H. O Reilly S ternberg an d D. R. B rillinger o f the U niversity o f C alifornia
at Berkeley. Because o f the tiny slope o f the w ater surface and the lower
courses o f its flatland affluents, these d a ta m ay be regarded as a reasonable
approxim ation o f the w ater level in the A m azon R iver at the confluence o f the

8.2 Time Series

389

Figure 8.1
Deseasonalized monthly
average stage (metres)
of the R io N egro at
M anaus, 1903-1992
(Sternberg, 1995).

1900

1920

1940

1960

1980

2000

Time (years)

two rivers. To remove the strong seasonal com ponent, we subtract the average
value for each m onth, giving the series o f length n = 1080 shown in Figure 8.1.
F or an initial exam ple, we take the first ten years o f observations. The top
panels o f Figure 8.2 show the correlogram and partial correlogram for this
sh o rter series, w ith horizontal lines showing approxim ate 95% confidence limits
for correlations from a w hite noise series. The shape o f the correlogram and
the cut-off in the p artial correlogram suggest th a t a low -order autoregressive
m odel will fit the data, which are quite highly correlated. T he lower left panel
o f the figure shows the periodogram o f the series, which displays the usual
high variability associated w ith single periodogram ordinates. The lower right
panel shows the cum ulative periodogram , which lies well outside its overall
95% confidence b and an d clearly does n o t correspond to a white noise series.
A n A R (2) m odel fitted to the shorter series gives oil = 1.14 and a.2 = 0.31,
b o th w ith stan d ard erro r 0.062, and estim ated innovation variance 0.598. The
left panel o f Figure 8.3 shows a norm al probability plot o f the standardized
residuals from this m odel, an d the right panel shows the cum ulative peri
odogram o f the residual series. The residuals seem close to G aussian white
noise.

8.2.2 M odel-based resampling


T here are two approaches to resam pling in the tim e dom ain. The first and
sim plest is analogous to m odel-based resam pling in regression. T he idea is to
fit a suitable m odel to the data, to construct residuals from the fitted model,
an d then to generate new series by incorporating random sam ples from the

8 ' Complex Dependence

390

Figure 8.2 Summary


plots for the Rio Negro
data, 1903-1912. The
top panels show the
correlogram and partial
correlogram for the
series. The bottom
panels show the
periodogram and
cumulative
periodogram.

to
o>
o
o
O

Lag

Lag

omega

omega/pi

residuals into the fitted m odel. T he residuals are typically recentred to have
the same m ean as the innovations o f the m odel. A b o u t the sim plest situation
is w hen the A R (1) m odel (8.2) is fitted to an observed series y i , . . . , y , giving
estim ated autoregressive coefficient a an d estim ated innovations

ej

= yj

- &y j - u

j = 2,...,n;

e\ is uno b tain ab le because yo is unknow n. M odel-based resam pling m ight then


proceed by equi-probable sam pling w ith replacem ent from centred residuals
e, . . . , en e to obtain sim ulated innovations e j,. . . ,
and then setting

8.2 Time Series

Figure 8.3 Plots for


residuals from AR(2)
model fitted to the Rio
Negro data, 1903-1912:
normal Q-Q plot of the
standardized residuals
(left), and cumulative
periodogram of the
residual series (right).

391

E
2?
o>
o
-o
o

V)

co
D
O
cn
<D
cc

0Q.

3
e3
o

Quantiles of standard normal

omega/pi

yo = ej and
y j = a yj_! + e j ,

j = l,...,n ;

(8.6)

o f course we m ust have |a| < 1. In fact the series so generated is n o t stationary,
an d it is b etter to start the series in equilibrium , o r to generate a longer series
o f innovations an d sta rt (8.6) at j = k, where the b u rn-in period k , . . . , 0
is chosen large enough to ensure th at the observations y [ , . . . , y * are essentially
statio n ary ; the values y'_k, . . . , y ' ) are discarded.
T hus m odel-based resam pling for tim e series is based on applying the
defining equation(s) o f the series to innovations resam pled from residuals.
This procedure is simple to apply, and leads to good theoretical behaviour
for estim ates based on such d a ta w hen the m odel is correct. F or example,
studentized b o o tstrap confidence intervals for the autoregressive coefficients
ak in an A R (p) process enjoy the good asym ptotic properties discussed in
Section 5.4.1, provided th a t the m odel fitted is chosen correctly. Just as there,
confidence intervals based on transform ed statistics m ay be b etter in practice.
Exam ple 8.2 (Wool prices) T he A ustralian W ool C o rp o ratio n m onitors prices
weekly w hen wool m arkets are held, and sets a m inim um price ju st before each
weeks m arkets open. This reflects the overall price o f wool for th a t week, b u t
the prices actually paid can vary considerably relative to the m inim um . The
left panel o f Figure 8.4 shows a plot o f log(price p aid /m in im u m price) for
those weeks w hen m arkets were held from July 1976 to June 1984. The series
does n o t seem stationary, having som e o f the characteristics o f a ran d o m walk,
as well as a possible overall trend.
I f the log ratio in week j follows a random walk, we have Yj = Yj -\ + Sj,

392

8 Complex Dependence

Figure 8.4 Weekly log


ratio o f price paid to
m inimum price for
A ustralian wool from
July 1976 to June 1984
(Diggle, 1990,
pp. 229-237). Left
panel: original data.
R ight p a n el: first
differences o f data.

50 100 150 200 250 300

Time in weeks

Time in weeks

where the ej are w hite noise; a non-zero m ean for the innovations Ej will
lead to drift in yj. The right panel o f Figure 8.4 shows the differenced series,
ej = y j y j - i , which appears stationary a p a rt from a change in the innovation
variance at a b o u t the 100th week. In o u r analysis we drop the first 100
observations, leaving a differenced series o f length 208.
A n alternative to the ran d o m w alk m odel is the A R(1) m odel
( Y j - n ) = <x(Y}- 1 -iJ.) + ej ;

(8.7)

this gives the ran d o m w alk when a = 1. If the innovations have m ean zero
and a is close to b u t less th a n one, (8.7) gives stationary data, though subject
to the clim bs and falls seen in the left panel o f Figure 8.4. The im plications for
forecasting depend on the value o f a, since the variance o f a forecast is only
asym ptotically bounded w hen |a| < 1. We test the unit root hypothesis th a t
the d ata are a ran d o m walk, or equivalently th a t a = 1, as follows.
O ur test is based on the o rdinary least squares estim ate o f a in the regression
Yj = } + a Yj-1 +Sj for j = 2 , . . . , n using test statistic T = (1 a) /S, where S is
the stan d ard erro r for a calculated using the usual form ula for a straight-line
regression m odel. L arge values o f T are evidence against the random walk
hypothesis, w ith or w ithout drift. T he observed value o f T is t = 1.19. The
distribution o f T is far from the usual stan d ard norm al, however, because o f
the regression o f each observation on its predecessor.
U nder the hypothesis th a t a = 1 we sim ulate new time series Y J , . . . , Y * by
generating a b o o tstrap sam ple e \ , . . . , e* from the differences e i , . . . , e n and then
setting YJ = Y\, Y j = YJ + e 2" , an d YJ = Y]'_l + * for subsequent j. This is
(8.6) applied w ith the null hypothesis value a = 1. T he value o f T ' is then
obtained from the regression o f YJ on YJ_X for j = 2
The left panel

8.2 Time Series

393

Figure 8.5 Results for


199 replicates of the
random walk test
statistic, T*. The left
panel is a normal plot
of t*. The right panel
shows t* plotted against
the inverse sum of
squares for the
regressor, with the
dotted line giving the
observed value.

Quantiles of standard normal

1/SSy*

o f Figure 8.5 shows the em pirical distribution o f T * in 199 sim ulations. The
distribution is close to norm al w ith m ean 1.17 and variance 0.88. T he observed
significance level for t is (97 + l ) / ( 199 + 1) = 0.49: there is no evidence against
the ran d o m w alk hypothesis.
The right panel o f Figure 8.5 shows the values o f f* plotted against the
inverse sum o f squares for the regressor y j _ v In a conventional regression,
inference is usually conditional on this sum o f squares, which determ ines the
precision o f the estim ate. The dotted line shows the observed sum o f squares.
If the conditional distribution o f tm is th ought to be appropriate here, the
distribution o f values o f t* close to the do tted line shows th a t the conditional
significance level is even higher; there is no evidence against the random walk
conditionally or unconditionally.

M odels are com m only fitted in o rder to predict future values o f a tim e series,
b u t as in o th er settings, it can be difficult to allow for the various sources o f
u ncertainty th a t affect the predictions. The next exam ple shows how boo tstrap
m ethods can give some idea o f the relative contributions from innovations,
estim ation error, and m odel error.

Exam ple 8.3 (Sunspot num bers) Figure 8.6 shows the m uch-analysed annual
sunspot num bers y [ , - - - , y 2%g from 1700-1988. T he d a ta show a strong cycle
w ith a period o f ab o u t 11 years, and som e hint o f non-reversibility, which
shows up as a lack o f sym m etry in the peaks. We use values from 1930-1979
to predict the num bers o f sunspots over the next few years, based on fitting

394

8 Complex Dependence

Figure 8.6 Annual


sunspot numbers,
1700-1988 (Tong, 1990,
p. 470).

Time in years

A ctu al
Predicted

1980

81

82

83

84

85

86

87

1988

23.0
21.6

21.8
18.9

19.6
14.9

14.4
12.2

11.7
9.1

6.7
7.5

5.6
6.8

9.0
8.8

18.1
13.6

3.2
3.8
3.6
3.8
6.6
3.6

3.3
4.1
3.8
3.9
6.7
3.9

3.4
4.0
3.9
4.0
6.8
4.3

3.4
3.6
3.8
4.1
6.5
4.3

Table 8.1 Predictions


and their standard
errors for
2{{y'j + 1)1/2 - 1} for
sunspot data,
1980-1988, based on
data for 1930-1979. The
standard errors are
nominal, and also those
obtained under
model-based resampling
assuming the simulated
series y* are AR(9), not
assuming y is AR(9),
and by a conditional
scheme, and the block
and post-blackened
bootstraps with block
length / = 10. See
Examples 8.3 and 8.5
for details.

S ta n d a rd e rro r
N o m in al
M odel, A R (9)
M odel
M odel, c o n d itl
Block, I = 10
P o st-b lack d, I = 10

2.0
2.2
2.3
2.5
7.8
2.1

2.9
2.9
3.3
3.6
7.0
3.3

3.2
3.0
3.6
4.1
6.9
3.9

3.2
3.2
3.5
3.9
6.9
4.0

3.2
3.3
3.5
3.8
6.7
3.6

A R (p) m odels
p

Y j- n

= ^2<3ik ( Yj - k -

n)

+ j ,

( 8 .8 )

fc=i
to the transform ed observations yj = 2 {(yj + l )1/2 1}; this transform ation is
chosen to stabilize the variance. T he corresponding m axim ized log likelihoods
are denoted ?p. A stan d ard ap p ro ach to m odel selection is to select the m odel
th a t m inimizes A IC = 2 i p + 2p, which trades off goodness o f fit (m easured
by the m axim ized log likelihood) against m odel com plexity (m easured by p).
H ere the resulting b est m odel is A R(9), w hose predictions yj for 1980-88
and their n om inal sta n d a rd errors are given a t the top o f Table 8.1. These
stan d a rd errors allow for prediction e rro r due to the new innovations, b u t not
for param eter estim ation or m odel selection, so how useful are they?
To assess this we consider m odel-based sim ulation from (8.8) using centred
residuals an d the estim ated coefficients o f the fitted A R (9) m odel to generate
series y*1,...,y * 59, corresponding to the p eriod 1930-1988, for r = l , . . . , R . We
then fit autoregressive m odels up to o rd er p = 25 to y {, . . . , y ' 50, select the
m odel giving the sm allest A IC , and use this m odel to produce predictions y'rj
for j = 5 1, . .. , 59. The prediction erro r is y j y'r], and the estim ated standard

8.2 Time Series

395

errors o f this are given in Table 8.1, based o n J ? = 999 b o o tstrap series. The
orders o f the fitted m odels were
O rd er
#

1 234
5
3 257 126100

67
89
273 8522 18

10
83

11

12
23

72

so the A R (9) m odel is chosen in only 8% o f cases, and m ost o f the m odels
selected are less com plicated. The fifth and sixth rows o f Table 8.1 give the
estim ated sta n d a rd errors o f the y y* using the 83 sim ulated series for
which the selected m odel was A R(9) and using all the series, based on the
999 replications. T here is ab o u t a 10-15% increase in stan d ard erro r due to
p aram eter estim ation, an d the stan dard errors for the A R (9) m odels are m ostly
smaller.
Prediction errors should take account o f the values o f yj im m ediately prior
to the forecast period, since presum ably these are relevant to the predictions
actually m ade. Predictions th a t follow on from the observed d a ta can be
obtained by using innovations sam pled a t random except for the period j =
n k + 1 ,... ,n, where we use the residuals actually observed. T aking k = n
yields the original series, in which case the only variability in the y'rj is due to
the innovations in the forecast period; the stan d ard errors o f the predictions
will then be close to the nom inal stan d ard error. However, if k is sm all relative
to n, the differences y*j y'j will largely reflect the variability due to the use o f
estim ated param eters, although the y*rj will follow on from y n. The conditional
stan d ard errors in Table 8.1, based on k = 9, are a b o u t 10% larger th an the
unconditional ones, and substantially larger th an the nom inal stan d ard errors.
The distrib u tio n s o f the y'j y'j app ear close to norm al with zero means,
and a sum m ary o f variation in term s o f standard errors seems appropriate.
T here will clearly be difficulties w ith norm al-based prediction intervals in 1985
and 1986, w hen the lower lim its o f 95% intervals for y are negative, and it
m ight be b etter to give one-sided intervals for these years. It would be better
to use a studentized version o f y'j y'j if an ap p ro p riate stan d ard error were
readily available.
W hen b o o tstra p series are generated from the A R (9) m odel fitted to the
d a ta from 1700-1979, the orders o f the fitted m odels are
O rd er
#

5
1

9
765

10
88

11
57

12131415 161718
28211111 51 4

19
25

so the A R (9) m odel is chosen in ab o u t 75% o f cases. T here is a tendency for


A IC to lead to overfitting: ju st one o f the m odels has order less th a n 9. For
this longer series p aram eter estim ation and m odel selection inflate the nom inal
stan d ard erro r by at m ost 6 %.
The above analysis gives the variability o f predictions based on selecting the
m odel th a t m inim izes A IC on the basis th at an A R (9) m odel is correct, and

396

8 Complex Dependence

does n o t give a true reflection o f the erro r otherwise. Is an autoregressive or


m ore generally a linear m odel ap p ro p riate? A test for linearity o f a time series
can be based on the non-additivity statistic T w2{n 2m 2)/(R S S w2),
where RSS is the residual sum o f squares for regression o f (ym+i ,... ,y) on the
(n m) x (m + 1) m atrix X whose y'th row is ( l , y m+j - i , . . . , y j ) , with residuals
qj and fitted values gy. Let q'j denote the residuals from the regression o f gj
on X , and let w equal
T hen the approxim ate distribution
o f T is fi,n 2m2> w ith large values o f T indicating potential nonlinearity.
The observed value o f T w hen m = 20 is 5.46, giving significance level 0.02,
in good agreem ent w ith b o o tstrap sim ulations from the fitted A R(9) model.
The significance level varies little for values o f m from 6 to 30. There is good
evidence th a t the series is nonlinear. We retu rn to these d ata in Exam ple 8.5.

The m ajo r draw back w ith m odel-based resam pling is th a t in practice not
only the p aram eters o f a m odel, b u t also its structure, m ust be identified
from the data. I f the chosen structure is incorrect, the resam pled series will
be generated from a w rong m odel, an d hence they will n o t have the same
statistical properties as the original data. This suggests th a t som e allowance
be m ade for m odel selection, as in Section 3.11, b u t it is unclear how to do
this w ithout som e assum ptions ab o u t the dependence structure o f the process,
as in the previous example. O f course this difficulty is less critical when the
m odel selected is strongly indicated by subject-m atter considerations o r is
w ell-supported by extensive data.

8.2.3 Block resampling


The second ap proach to resam pling in the tim e dom ain treats as exchangeable
n o t innovations, b u t blocks o f consecutive observations. The sim plest version
o f this idea divides the d a ta into b non-overlapping blocks o f length /, where
we suppose th a t n = bl. We set z\ = ( y i , . . . ,y i ) , z2 = (yi+u,yn), and so
forth, giving blocks z \ , . . . , z&. The procedure is to take a b o o tstrap sam ple with
equal probabilities b~l from the z; , an d then to paste these end-to-end to form
a new series. As a simple example, suppose th a t the original series is y i , . . . , y i2,
and th a t we take I = 4 an d b = 3. T hen the blocks are z\ = ( y i , y 2 , y 3,y),
Z2 = iys,y6,yi,y%), an d z3 = {y<),yw,yu,yi 2 )- If the resam pled blocks are
z\ = Z2, z \ = zj, an d z\ = zi, the new series o f length 12 is

[y]}

z i>z2>z3

y5,ye,yi,y%,

yuyi,yi,y*,

y5, yt,, yi, yz-

In general, the resam pled series are m ore like w hite noise th a n the original
series, because o f the joins betw een blocks w here successive independently
chosen z* meet.
The idea th a t underlies this block resampling scheme is th a t if the blocks

8.2 Time Series

397

are long enough, enough o f the original dependence will be preserved in the
resam pled series th a t statistics f* calculated from {yj} will have approxim ately
the sam e distribution as values t calculated from replicates o f the original
series. C learly this approxim ation will be best if the dependence is weak and
the blocks are as long as possible, thereby preserving the dependence m ore
faithfully. O n the o th er hand, the distinct values o f t* m ust be as num erous
as possible to provide a good estim ate o f the distribution o f T, and this
points tow ards short blocks. T heoretical work outlined below suggests th a t a
com prom ise in which the block length I is o f order ny for some y in the interval
(0,1) balances these tw o conflicting needs. In this case b o th the block length /
an d the n u m b er o f blocks b = n/ l tend to infinity as n * oo, though different
values o f y are ap p ro p riate for different types o f statistic t.
There are several v ariants on this resam pling plan. One is to let the original
blocks overlap, in o u r exam ple giving the n I + 1 = 9 blocks z\ = (>i , ...,> '4),
22 =
Z3 = t o , . . . , ye), and so forth up to z9 = (y9, . . . , y n) . This
incurs end effects, as the first and last / 1 o f the original observations ap p ear
in fewer blocks th an the rest. Such effects can be rem oved by w rapping the
d a ta around a circle, in o u r exam ple adding the blocks z\o = (yio,y n , y n , y \ ) ,
Z n . = ( y u , y n , y i , y 2 ), and Z 12 = 0 ' 12,J '1J'2,J'3)- This ensures th a t each o f the
original observations has an equal chance o f appearing in a sim ulated series.
E nd correction by w rapping also removes the m inor problem with the no n
overlapping scheme th a t the last block is shorter th an the rest if n / l is not an
integer.

Post-blackening
The m ost im p o rtan t difficulty w ith resam pling schemes based on blocks is th at
they generate series th a t are less dependent th an the original data. In some
circum stances this can lead to catastrophically bad resam pling approxim ations,
as we shall see in Exam ple 8.4. It is clearly inappropriate to take blocks o f
length / = 1 w hen resam pling dependent data, for the resam pled series is
then w hite noise, b u t the w hitening can rem ain substantial for small and
m oderate values o f I. This suggests a strategy interm ediate betw een m odelbased and block resam pling. The idea is to pre-w hiten the series by fitting
a m odel th a t is intended to remove m uch o f the dependence betw een the
original observations. A series o f innovations is then generated by block
resam pling o f residuals from the fitted m odel, and the innovation series is
then post-blackened by applying the estim ated m odel to the resam pled
innovations. T hus if an A R (1) m odel is used to pre-w hiten the original data,
new series are generated by applying (8.6) b u t w ith the innovation series {ej}
sam pled n o t independently b u t in blocks taken from the centred residual series
e2 - e , . . . , e - e.

8 Complex Dependence

398
B lo ck s o f blocks

A different ap p ro ach to rem oving the w hitening effect o f block resam pling is
to resam ple blocks o f blocks. Suppose th a t the focus o f interest is a statistic
T which estim ates 6 an d depends only on blocks o f m successive observations.
A n exam ple is the lag k autocovariance (n k) 1 Y ^ J l i y j ~ y)(yj+k ~ y), for
which m = k + 1. T hen unless / m the distribution o f T* M s typically a
po o r approxim ation to th a t o f T 6, because a substantial p ro p ortion o f the
pairs (YJ, Yj+k) in a resam pled series will lie across a jo in betw een blocks, and
will therefore be independent. To im plem ent resam pling blocks o f blocks we
define a new m -variate process { Yj } for which Y j = ( Y j , Y j +m- 1), rew rite T
so th a t it involves averages o f the Yj, an d resam ple blocks o f the new d a ta
y \ , .. .,y'_m+1, each o f the observations o f which is a block o f the original data.
F or the lag 1 autocovariance, for exam ple, we set

and w rite t = (n I )-1 YXVij ~ y'lMy'ij ~ ? 2-)- The key point is th a t t should
n o t com pare observations adjacent in each row. W ith n = 12 and / = 4 a
b o o tstrap replicate m ight be
ys

y6

yi

ys

yi

yi

ys

y4

yi

^6

yi

>'8

y9

yi

yi

ys

ys

y?

y9

y io

yio

yn

Since a b o o tstra p version o f t based on this series will only contain products
o f (centred) adjacent observations o f the original data, the w hitening due to
resam pling blocks will be reduced, though n o t entirely removed.
This ap p ro ach leads to a sh o rter series being resam pled, b u t this is unim
p o rta n t relative to the gain from avoiding whitening.
Stationary bootstrap
A further b u t less im p o rtan t difficulty w ith these block schemes is th at the
artificial series generated by them are n o t stationary, because the jo in t distri
bution o f resam pled observations close to a jo in betw een blocks differs from
th a t in the centre o f a block. This can be overcom e by taking blocks o f random
length. The stationary bootstrap takes blocks whose lengths L are geom etrically
distributed, w ith density
Pr(L = j ) = ( l - p y - ' p ,

j = 1 ,2 ,

This yields resam pled series th a t are statio n ary w ith m ean block length Z = p *.
Properties o f this scheme are explored in Problem s 8.1 and 8.2.
Exam ple 8.4 (Rio N egro d a ta ) To illustrate these resam pling schemes we
consider the shorter series o f river stages, o f length 120, w ith its average
subtracted. Figure 8.7 shows the original series, followed by three b o o tstrap

399

8.2 Time Series


Figure 8.7 Resamples
from the shorter Rio
Negro data. The top
panel shows the original
series, followed by three
series generated by
model-based sampling
from the fitted AR(2)
model, then three series
generated using the
block bootstrap with
/ = 24 and no end
correction, and three
series made using the
post-blackened method,
with the same blocks as
the block series and the
fitted AR(2) model.

20

40

60

80

100

120

20

40

60

80

100

120

20

40

60

80

100

120

20

40

60

80

100

120

series generated by m odel-based sam pling from the A R (2) model. The next
three panels show series generated using the block b o o tstrap with length I = 24
and no w rapping. There are some sharp jum ps a t the ends o f contiguous blocks
in the resam pled series. T he b o tto m panels show series generated using the
sam e blocks applied to the residuals, and then post-blackened using the A R(2)
m odel. The ju m p s from using the block b o o tstrap are largely rem oved by
post-blackening.
F o r a m ore system atic com parison o f the m ethods, we generated 200 b o o t
strap replicates under different resam pling plans. F or each plan we calculated
the sta n d a rd erro r SE o f the average y * o f the resam pled series, and the
average o f the first three au to correlation coefficients. The m ore dependent

400

8 Complex Dependence

O riginal values
S am pling SE
R esam p lin g p lan
M odel-based

Blockwise

P o st-blackened

0.85

0.002

0.62
0.007

0.010

D etails

SE

Pi

Pi

P\

A R (2)

AR(1)

0.34
0.49

0.83
0.82

0.60
0.67

0.38
0.54

A R (3)

0.44

0.83

0.58

0.39

0.20

0.41
0.67
0.75
0.79

-0.02
0.35
0.47
0.54

-0.01
0.14
0.27

0.85
0.85
0.85
0.85

0.63
0.63
0.64
0.64

0.45
0.45
0.47
0.48
0.03

0.38
0.56
0.40

I= 2
1= 5
/ = 10

0.26
0.33
0.33

1= 2

0.20

I= 5
I = 10

0.26
0.33
0.33

1 = 20
S tatio n ary

P2

0.017

1 = 20
B locks o f blocks

Pi

1= 2

0.25
0.28
0.31
0.28

0.40

I= 5
/ = 10
/ = 20

0.74
0.79

0.13
0.37
0.47
0.54

A R (2), I = 2
A R (1), I = 2
A R (3), I = 2

0.39
0.58
0.43

0.83
0.85
0.83

0.59
0.69
0.58

0.66

0.45

0.35

0.20
0.28
0.36

the series, the larger we expect SE an d the autocorrelation coefficients to be.


Table 8.2 gives the results. T he top tw o rows show the correlations in the d a ta
and approxim ate stan d ard errors for the resam pling results below.
The results for m odel-based sim ulation depend on the m odel used, although
the overfitted A R (3) m odel gives results sim ilar to the AR(2). The A R(1)
m odel adds correlation n o t present in the original data.
T he block m ethod is applied w ith no end correction, b u t further sim ulations
show th a t it m akes little difference. Block length has a dram atic effect, and in
particular, block length / = 2 essentially rem oves co rrelation a t lags larger th an
one. Even blocks o f length 20 give resam pled d a ta noticeably less dependent
than the original series.
T he w hitening is overcom e by resam pling blocks o f blocks. We took blocks
o f length m = 4, so th a t the m -variate series h ad length 117. The m ean
resam pled autocorrelations are essentially unchanged even w ith / = 2, while
SE* does depend on block length.

Table 8.2 Comparison


of time-domain
resampling plans
applied to the average
and first three
autocorrelation
coefficients for the Rio
Negro data, 1903-1912.

8.2 Time Series

401

The statio n ary b o o tstrap is used with end correction. The results are similar
to those for the block b o o tstrap , except th a t the varying block length preserves
slightly m ore o f the original correlation structure; this is noticeable at I = 2.
R esults for the post-blackened m ethod with A R (2) and A R (3) m odels are
sim ilar to those for the corresponding m odel-based schemes. The results for
the post-blackened A R (1) scheme are interm ediate betw een A R (1) and A R(2)
m odel-based resam pling, reflecting the fact th a t the A R (1) m odel underfits the
data, and hence structure rem ains in the residuals. L onger blocks have little
effect for the A R (2) an d A R (3) models, b u t they bring results for the A R(1)
m odel m ore into line w ith those for the others.

T he previous exam ple suggests th a t post-blackening generates resam pled


series w ith co rrelation structure sim ilar to the original data. C orrelation, how
ever, is a m easure o f linear dependence. Is nonlinear dependence preserved by
resam pling blocks?
Exam ple 8.5 (Sunspot num bers) To assess the success o f the block and p o st
blackened schemes in preserving nonlinearity, we applied them to the sunspot
data, using / = 10. We saw in Exam ple 8.3 th a t although the best autoregressive
m odel for the transform ed d a ta is A R(9), the series is nonlinear. This nonlin
earity m ust rem ain in the residuals, which are alm ost a linear transform ation
o f the series. Figure 8.8 shows probability plots o f the nonlinearity statistic T
from Exam ple 8.3, w ith m = 20, for the block and post-blackened bootstraps
w ith I = 10. T he results for m odel-based resam pling o f residuals are not shown
b u t lie on the diagonal line, so it is clear th a t b o th schemes preserve some o f
the nonlinearity in the data, which m ust derive from lags up to 10. C uriously
the post-blackened scheme seems to preserve more.
Table 8.1 gives the predictive standard errors for the years 1980-1988 when
the simple block resam pling scheme w ith I = 10 is applied to the d a ta for 1930
1979. O nce d a ta for 1930-1988 have been generated, the procedure outlined
in Exam ple 8.3 is used to select, fit, and predict from an autoregressive model.
Owing to the jo in s betw een blocks, the stan d ard errors are m uch larger than
for the o th er schemes, including the post-blackened one with I = 10, which
gives results sim ilar to b u t som ew hat m ore variable th an the m odel-based
bootstraps. U nadorned block resam pling seems inappropriate for assessing
prediction error, as one w ould expect.

Choice o f block length


Suppose th a t we w ant to use the block b o o tstrap to estim ate some feature k
based on a series o f length n. A n exam ple would be the stan d a rd erro r o f the
series average, as in the third colum n o f Table 8.2. D ifferent block lengths /
result in different b o o tstrap estim ates k(n,l). W hich should we use?
A key result is th a t u nder suitable assum ptions and for large n and I the

402

8 Complex Dependence

Figure 8.8
Distributions of
nonlinearity statistic for
block resampling
schemes applied to
sunspot data. The left
panel shows R = 999
replicates of a test
statistic for nonlinearity,
based on detecting
nonlinearity at up to 20
lags for the block
bootstrap with / = 10.
The right panel shows
the corresponding plot
for the post-blackened
bootstrap using the
AR(9) model.

Quantile of F distribution

Quantile of F distribution

m ean squared erro r o f k(n, I) is p ro p o rtio n al to

where Ci an d C2 depend only on k and the dependence structure o f the series.


In (8.9) d = 2, c = 1 if k is a bias o r variance, d = 1, c = 2 if k is a one
sided significance probability, and d = 2, c = 3 if k is a two-sided significance
probability. T he justification for (8.9) when k is a bias o r a variance is discussed
after the next example. T he im plication o f (8.9) is th at for large n, the m ean
squared erro r o f o f k{n, I) is m inim ized by taking I oc n 1/(c+2), b u t we do not
know the co n stan t o f proportionality. However, it can be estim ated as follows.
We guess an initial value o f I, an d sim ulate to obtain k(n, I). We then take
m < n and k < I an d calculate the values o f kj(m, k) from the n m + 1 series
y > j , . . . , y j +m- 1 for j = 1
, m + 1. T he estim ated m ean squared erro r for
k(m, k) from a series o f length m w ith block size k is then
1
nm+1
MSE(m,fc) = ^-----------j{/(wi, k) k(n, I)}2 .
j=

By repeating this procedure for different values o f k b u t the same m, we obtain


the value k for which MSE(m,/c) is m inimized. We then choose
Z = k x (n /m )1/(c+2)

(8.10)

as the optim um block length for a series o f length n, and calculate k(n,l).
This procedure elim inates the co n stan t o f proportionality. We can check on
the adequacy o f I by repeating the procedure w ith initial value I = I, iterating
if necessary.

8.2 Time Series

403

Figure 8.9 Ten-year


running average of
Manaus data (left),
together with
Abelson-Tukey
coefficients (right)
(Abelson and Tukey,
1963).

c
o
'o
it=
<b

o
o

>
D
C

O
<n
n<D
<

1900

1940

1980

Time (years)

T he m inim um asym ptotic m ean squared error is n d 2/<c+2)(Ci + C2), so if


the block length selection procedure is applicable,
A ( m ) = log |rn in M S E (m ,fc)

+ {d + 2 /(c + 2)} logm

should be approxim ately independent o f m. This suggests th a t values o f A(m)


for different m should be com pared as a check on the asym ptotics.
Exam ple 8.6 (Rio N egro d a ta ) There is concern th a t river heights at M anaus
m ay be increasing due to deforestation, so we test for trend in the river series,
a ten-year running average o f which is shown in the left panel o f Figure 8.9.
T here m ay be an u pw ard trend, b u t it is h ard to say w hether the effect is real.
To proceed, we suppose th a t the d a ta consist o f a stationary tim e series to
which has been added a m onotonic trend. O ur test statistic is T = Y?j=1 ai
where the coefficients

are optim al for detecting a m onotonic trend in independent observations.


The p lo t o f the a , in the right panel o f Figure 8.9 shows th a t T strongly
contrasts the ends o f the series. We can think o f T as alm ost being a difference
o f averages for the two ends o f the series, and this falls into the class o f
statistics for which th e m ethod o f choosing the block length described above
is appropriate. R esam pling blocks o f blocks w ould n o t be ap p ro p riate here.
T he value o f T for the full series is 7.908. Is this significantly large?
To sim ulate d a ta u nder the null hypothesis o f no trend, we use the stationary

404

8 Complex Dependence

Figure 8.10 Estimated


variances of T for Rio
Negro data, for
stationary (solid) and
block (dots) bootstraps.
The left panel is for
1903-1912 {R = 999),
the right panel is for the
whole series (R = 199).

o
CO

ir>
co

o
CO

CM

,y\

\ s/

Nv

j\/\

/\jy/ v '0-/v/

/ V'

ca

40

/\" _

0
o
c

Vari

o
<
D co
O
c 10
<d CM
<6
> o

J--' \ /
\\]I

V x/
'

/
/

/
o

CM

10
Block length

15

20

10

20

30

40

50

Block length

bo o tstrap w ith w rapping to generate new series Y \ We initially apply this to


the shorter series o f length 120, adjusted to have m ean zero, for which T
takes value 0.654. U nder the null hypothesis the m ean o f T = J 2 aj Y j is zero
and the distrib u tio n o f T will be close to norm al. We estim ate its variance
by taking the em pirical variance o f values T" generated with the stationary
bootstrap. T he left panel o f Figure 8.10 shows these variances k(n, /) based on
different m ean block lengths I, for b o th stationary and block bootstraps. The
stationary b o o tstrap sm ooths the variances for different fixed block lengths,
resulting in a fairly stable variance for / > 6 or so. Variances o f T * based on
the block b o o tstra p are m ore variable and increase to a higher eventual value.
The variances for the full series are larger an d m ore variable.
In order to choose the block length /, we took 50 random ly selected subseries
o f m consecutive observations from the series w ith n = 120, and for each value
o f k = 2 ,. . . , 20 calculated values o f k(m, k) from R = 50 stationary b o otstrap
replicates. T he left p a rt o f Table 8.3 shows the values k th a t m inimize the
m ean squared erro r for different possible values o f k{n, I). N ote th a t the values
o f k do n o t broadly increase w ith m, as the theory w ould predict. F or smaller
values o f k(n, I) the values o f k vary considerably, an d even for k(n, I) = 30 the
corresponding values o f I as given by (8.10) w ith c = 1 and d = 2 vary from 12
to 20. The left panel o f Figure 8.10 shows th a t for / in this range, the variance
k(n, I) takes value roughly 25. F or k(n, I) = 25, Table 8.3 gives I in the range
8-20, so overall we take k(n, I) = 25 based on the stationary bootstrap.
The right p art o f Table 8.3 gives the values o f k w hen the block boo tstrap
w ith w rapping is used. T he series so generated are n o t exactly stationary, b u t
are nearly so. O verall the values are m ore consistent th an for the stationary

405

8.2 Time Series


Table 8.3 Estimated
values of k for Rio
Negro data, 1903-1912,
based on stationary
bootstrap with mean
length k applied to 50
subseries of length m
(left figures) and block
bootstrap with block
length k applied to 50
subseries of length m
(right figures).

k(m, /)

S tationary, m

20
15
17.5

20
22.5
25
27.5
30

10
11
11
11
11
11
11

60

70

20

3
3

2
2

4
4
4
4
4
4
4

30

40

50

3
3

18
18
18

11
11
11

Block, m

6
6
12

6
6

3
3
5
7

14
14

9
9

10
10

3
4

8
8
11

30

40

50

60

70

18
16
4
5
5
5
5

6
6
6

3
3
4
5

3
4
4
5

9
9

6
6

8
8

10
5
5
5
5
5

b o otstrap , w ith broadly increasing values o f k w ithin each row, provided


k(n, I) > 20. F or these values o f k(n, I), the values o f k suggest th at I lies in
the range 5-8, giving k(n, I) = 25 or slightly less. T hus b o th the stationary and
the block b o o tstra p suggest th a t the variance o f T is roughly 25, and since
t = 0.654, there is no evidence o f trend in the first ten years o f data.
F or the stationary bootstrap, the values o f A ( m ) have smallest variance
for k(n, I) = 22.5, when they are 13.29, 13.66, 14.18, 14.01, 13.99 and 13.59
for m = 2 0 ,...,7 0 . F o r the block b o o tstrap the variance is smallest when
k(n,l) = 27.5, when the values are 13.86, 14.25, 14.63, 14.69, 14.73 and 14.44.
However, the m inim um m ean squared erro r shows no obvious p attern for any
value o f k(n, I), and it seems th a t the asym ptotics apply adequately well here.
Overall Table 8.3 suggests th a t a range o f values o f m should be used, and
th a t results for different m are m ore consistent for the block th a n for the
stationary bootstrap. F or given values o f m and k, the variances k j ( m , k ) have
approxim ate gam m a distributions, b u t calculation o f their m ean squared error
on the variance-stabilizing log scale does little to im prove m atters.
For the stationary b o o tstrap applied to the full series, we take I in the range
(8,20) x (1080/120)1/3 = (17,42), which gives variances 46-68, w ith average
variance roughly 55. T he corresponding range o f I for the block b o o tstrap is
10-17, w hich gives variances k(n,l) in the range 43-53 or so, w ith average
value 47. In either case the lowest reasonable variance estim ate is about 45.
Since the value o f t for the full series is 7.9, an approxim ate significance level
for the hypothesis o f n o tren d based on a norm al approxim ation to T* is
1 <E>(7.9/451/2) = 0.12. The evidence for trend based on the m onthly d a ta is
thus fairly weak.

Some block theory


In order to gain some theoretical insight into block resam pling and the fun
d am ental approxim ation (8.9) which guides the choice o f I, we exam ine the
estim ation o f bias and variance for a special class o f statistics.

406

8 Complex Dependence

C onsider a stationary tim e series {Yy} w ith m ean n and covariances yj =


cov( Yo, Y j ) , an d suppose th a t the p aram eter o f interest is 9 = h(ji). The obvious
estim ator o f 9 based on Y i,..., Y is T = h(Y), w hose bias and variance are
P

E { M Y ) - /i( /i) } = ^ " (/i)v a r(Y ),


(8.11)

var { h( Y )} = h'(n)2var( Y ),

by the delta m ethod o f Section 2.7.1. N ote th a t


var( Y) = n~2 {ny0 + 2(n - l)? i + ----- 1- 2y_i} = n-2 ^ ,
say, and th a t as n>oo,
tt1

_14 B) =

yo + 2

co

i / n^ i

7=1

^ 2 yj =

j= co

Therefore p ~ \h"( p)n~l, an d v ~


for large n. Now suppose th at
we estim ate P an d v by simple block resam pling, w ith b non-overlapping blocks
o f length I, w ith n = bl, and use S; to denote the average Z_1
^o-i)/+i o f
the j i h block, for j = 1
T hus S = Y, and Y* = b~l Y j = i Sj , where the
S j are sam pled independently from S i,. . . , St. T he b o o tstrap estim ates o f the
bias and variance o f T are
p

E*{fc(Y*)- / ! ( ? ) } = h ' ( Y ) E ( Y ' - Y ) + i/i" ( Y ) E * { ( Y * - Y )2} ,

(8.12)
v

v a r {/i(Y*)} = fr'(Y)2v ar (Y*).

W hat we w ant to know is how the accuracies o f P an d v vary w ith /.


Since the blocks are non-overlapping,
b

E (Y*) = S,

var*(Y*) = r 2 ^ ( 5 y - S ) 2.
j=i

It follows by com paring (8.11) an d (8.12) th a t the m eans o f P and v will be


asym ptotically correct provided th a t w hen n is large, E{ b~l ^ ( S , S)2} ~ n-1 .
This will be so because ^Z(Sj ~ S) 2 =

~ I1)2 ~

~ I1)2 has m ean

bwar(Si) b\ar(S) = b(l~2c^ n~2c(0n>) ~


if Ico and Z/n->0 as n>oo. To calculate approxim ations for the m ean squared
errors o f P an d v requires m ore careful calculations and involves the variance
of
S ) 2. This is messy in general, b u t the essential points rem ain under
the simplifying assum ptions th a t {Yj) is an m -dependent norm al process. In
this case ym+i = y m+2 = = 0, an d the third and higher cum ulants o f the

Y is the average of
Yu . . . , Y n.

8.2 Time Series

407

process are zero. Suppose also th a t m < I. T hen the variance o f


approxim ately
v a r { X ] ( S' - V ) 2} = b v a r { ( S l

~ n ) 2} + 2

~ S )2 is

(b - l)cov {(Si - n)2, (S2 - n)2} .

F or norm al data,
var {(Si n)2}

2{var(Si - n)}2 ,

cov{(S i - j u ) 2,(S 2 - / i ) 2}

2 {cov(Si - n, S2 - n )}2 ,

SO

var { J 2 ( SJ - S)2} = 2b(l~24 ))2 + 4 6 ( r V 1'))2,


w here u n d er suitable conditions on the process,
OO

c f = y i + 2 y 2 -\------ 1- l y i - >

i= i

jyj

~ ji,

say. A fter a delicate calculation we find th at


E {$) (} ~

x f 'r 't ,

var(/?) ~

{ ^ h " ( n ) } 2 x 2ln~3( 2, (8.13)

E(v) v ~ - t i ( f i ) 2 xn~lr lT,

var(v) ~

hf(jif x 2/n 3f 2,

(8.14)

th u s establishing th a t the m ean squared errors o f fi and v are o f form (8.9).


This developm ent can clearly be extended to m ultivariate tim e series, and
thence to m ore com plicated param eters o f a single series. F or example, for the
first-order co rrelation coefficient o f the univariate series {Xj}, we would apply
the argum ent to the trivariate series {Yj} = { ( X j , X 2, X j X j - 1)} w ith m ean
an d set G = M^i A*n, ^ 12) =
~ H2)W hen overlapping blocks are resam pled, the argum ent is sim ilar b u t the
details change. If the d a ta are n o t w rapped around a circle, there are n I + 1
blocks w ith averages Sj = /-1 Y?i=i

an(^

E (? * - ? ) = /(- *+ !) | /(/ ~ 1)? ~

+ y";+l)} '

(8'15)

In this case the leading term o f the expansion for fi is the product o f h'( Y)
and the rig h t-h an d side o f (8.15), so the b o o tstrap bias estim ate for Y as an
estim ator o f 9 = n is non-zero, which is clearly m isleading since E (T ) = fi.
W ith overlapping blocks, the properties o f the b o o tstra p bias estim ator depend
on E*(Y *)Y , and it tu rn s o u t th a t its variance is an order o f m agnitude larger
th an for non-overlapping blocks. This difficulty can be rem oved by w rapping
Yi....... Y aro u n d a circle an d using n blocks, in which case E*(Y*) = Y, or
by re-centring the b o o tstrap bias estim ate to ^ = E {/i(Y*)} ft { E (Y ')} . In
either case (8.13) and (8.14) apply. One asym ptotic benefit o f using overlapping

8 Complex Dependence

408

blocks when the re-centred estim ator is used is th at var(/?) and var(v) are
reduced by a factor | , though in practice the reduction m ay not be visible for
small n.
The corresponding argum ent for tail probabilities involves E dgew orth ex
pansions and is considerably m ore intricate th an th a t sketched above.
A part from sm oothness conditions on h(-), the key requirem ent for the above
argum ent to w ork is th a t x an d ( be finite, and th a t the autocovariances
decrease sharply enough for the various term s neglected to be negligible. This
is the case if
~ a; for sufficiently large j and some a with |a| < 1, as is
the case for stationary finite A R M A processes. However, if for large j we find
th at yj ~ j ~ s, where 5 < S < 1, an d x are n o t finite and the argum ent will
fail. In this case g(<u) ~ oj~ s for sm all co, so long-range dependence o f this
sort is characterized by a pole in the spectrum a t the origin, where = g(0)
is the value o f the spectrum . The d a ta co u n terp art o f this is a sharp increase
in periodogram ordinates a t small values o f co. T hus a careful exam ination o f
the periodogram n ear the origin and o f the long-range correlation structure is
essential before applying the block b o o tstrap to data.

8.2.4 Phase scrambling


Recall the basic stochastic properties o f the em pirical Fourier transform o f
a series y o , . . . , y n- i o f length n = 2nf + 1 : for large n and under certain
conditions on the process generating the data, the transform ed values % for
k = 1, . . . , n F are approxim ately independent, and their real and im aginary
parts are approxim ately independent norm al variables with m eans zero and
variances ng(cok)/2, where cok = 2nk/ n. The approxim ate independence o f
y i , . . . , y nF suggests that, provided the conditions on the underlying process
are met, the frequency dom ain is a b etter place to look for exchangeable
com ponents th an the tim e dom ain. Expression (8.4) shows th at the spectrum
sum m arizes the covariance structure o f the process { Y j } , and correspondingly
the periodogram values I(tOk) = \%\2/ n sum m arize the second-order structure
o f the data, which as far as possible we should preserve w hen resampling.
This suggests th a t we generate resam ples by keeping fixed the m oduli |y*|,
b u t random izing their phases Uk = arg %, which anyw ay are asym ptotically
uniform ly distributed on the interval [0,2n), independent o f the \yk\- This phase
scrambling can be done in a variety o f ways, one o f which is the following.
Algorithm 8.1 (Phase scram bling)
1 C om pute from the d a ta yo,

-1 the em pirical Fourier transform

n1
h =

_ y)
j=

where ( = exp(2

ni/n).

k = 0 ,...,n -l,

8.2 Time Series

409

2 Set X k = exp(iUk )ek, k = 0


variables uniform on [0, 2 k).
3 Set

1, wher e the Uk are independent

ek = 2 ~ ^ 2 { Xk + X cn_k) ,

fc = 0, . . . , n 1,

where superscript c denotes com plex conjugate and we take X = Xo4 A pply the inverse Fourier transform to e*0, . . . , e'n_ ] to obtain
n1
Y j ^ y + n - 1 Y , Z ~ ik~
e 'k

j = 0,...,n-l.

fc=0

5 C alculate the b o o tstrap statistic T ' from Y0' , . .., Y _ ,.

Step 3 guarantees th a t Yk has com plex conjugate Y*_k, and therefore th a t the
bo o tstrap series Y0*, . . . , Yn'_{ is real. A n alternative to step 2 is to resam ple the
Uk from the observed phases
The b o o tstrap series always has average y , which implies th at phase scram
bling should be applied only to statistics th a t are invariant to location changes
o f the original series; in fact it is useful only for linear contrasts o f the y j , as
we shall see below. It is straightforw ard to see th at
-1/2 n-1

n-1

Y j = y + -----Y P l ~
n 1=0

^ Y l cos {2 n k (l ~
k=0

+ U k}

j = 0 , . . . , n - 1,

(8.16)
from which it follows th a t the b o o tstrap d a ta are stationary, w ith covariances
equal to the circular covariances o f the original series, and th a t all their odd
jo in t cum ulants equal zero (Problem 8.4). This representation also m akes it
clear th a t the resam pled series will be essentially linear with norm al margins.
The difference betw een phase scram bling and m odel-based resam pling can
be deduced from A lgorithm 8.1. U nder phase scram bling,
\Yk' \ 2 = \ h \2 {1 + cos (u; + u ;_ k)} ,

(8.17)

which gives
e *(| y; i2)

= Iw I2,

v a r * ( |y ;|2) = i |j) ,|4.

U nder m odel-based resam pling the approxim ate distribution o f n~] \ Y^ \ 2 is


g(a>k)X*, where g(-) is the spectrum o f the fitted m odel and X has a standard
exponential d istrib u tio n ; this gives
E * ( |y ;i2) = g t o ) ,

v a r* (|y ;i2) = g W

C learly these resam pling schemes will give different results unless the quantities
o f interest depend only on the m eans o f the |y fe' | 2, i.e. are essentially quadratic

410

8 Complex Dependence

Figure 8.11 Three time


series generated by
phase scrambling the
shorter Rio Negro data.

in the data. Since the quan tity o f interest m ust also be location-invariant,
this restricts the dom ain o f phase scram bling to such tasks as estim ating the
variances o f linear contrasts in the data.
Example 8.7 (Rio Negro data) We assess em pirical properties o f phase scram
bling using the first 120 m o n th s o f the R io N egro d ata, which we saw previously
were well-fitted by an A R (2) m odel w ith norm al errors. N ote th a t our statistic
o f interest, T = Y l ajYj> has the necessary structure for phase scram bling n o t
autom atically to fail.
Figure 8.11 shows three phase scram bled datasets, which look sim ilar to the
A R(2) series in the second row o f Figure 8.7.
T he top panels o f Figure 8.12 show the em pirical Fourier transform for the
original d a ta an d for one resam ple. Phase scram bling seems to have shrunk
the m oduli o f the series tow ards zero, giving a resam pled series w ith lower
overall variability. The low er left panel shows sm oothed periodogram s for the
original d a ta and for 9 phase scram bled resam ples, while the right panel shows
corresponding results for sim ulation from the fitted A R (2) model. The results
are quite different, an d show th a t d a ta generated by phase scram bling are less
variable th an those generated from the fitted model.
R esam pling w ith 999 series generated from the fitted A R(2) m odel and by
phase scram bling, the distribution o f 7 is close to no rm al under b o th schemes
b u t it is less variable u nder phase scram bling; the estim ated variances are 27.4
and 20.2. These are sim ilar to the estim ates o f a b o u t 27.5 and 22.5 obtained
using the block and statio n ary bootstraps.
Before applying phase scram bling to the full series, we m ust check th a t
it shows no sign o f nonlinearity or o f long-range dependence, and th at it
is plausibly close to a linear series w ith norm al errors. W ith m = 20 the
nonlinearity statistic described in Exam ple 8.3 takes value 0.015, and no value
for m < 30 is greater th a n 0.84: this gives no evidence th a t the series is
nonlinear. M oreover the p eriodogram shows no signs o f a pole as to>0+, so
long-range dependence seems to be absent. A n A R (8) m odel fits the series
well, b u t the residuals have heavier tails th an the norm al distribution, w ith
kurtosis 1.2. T he variance o f T * u nder phase scram bling is ab o u t 51, which

8.2 Time Series

Figure 8.12 Phase


scrambling for the
shorter Rio Negro data.
The upper left panel
shows an Argand
diagram containing the
empirical Fourier
transform % of the
data, with phase
scrambled y'k in the
upper right panel. The
lower panels show
smoothed periodograms
for the original data
(heavy solid), 9 phase
scrambled datasets (left)
and 9 datasets generated
from an AR(2) model
(right); the theoretical
AR(2) spectrum is the
lighter solid line.

411

CD

O
Tj-

O
C\J
o
o

C>
- 60

-40

- 20

20

40

60

-60

- 40

- 20

20

40

60

CG
o>
0)
e
o

o>
o

omega

omega

again is sim ilar to the estim ates from the block resam pling schemes. A lthough
this estim ate m ay be untrustw orthy, on the face o f things it casts no d o ubt on
the earlier conclusion th a t the evidence for trend is weak.

The discussion above suggests th a t n o t only should phase scram bling be


confined to statistics th a t are linear contrasts, b u t also th a t it should be
used only after careful scrutiny o f the d a ta to detect nonlinearity and longrange dependence. W ith n on-norm al d a ta there is the further difficulty th a t
the Fourier transform and its inverse are averaging operations, which can
produce resam pled d a ta quite unlike the original series; see Problem 8.4 and
Practical 8.3. In p articular, w hen phase scram bling is used in a test o f the null

8 Complex Dependence

412

hypothesis o f linearity, it im poses on the distribution o f the scram bled d a ta


the additional constraints o f stationarity an d a high degree o f symmetry.

8.2.5 Periodogram resampling


Like time d om ain resam pling m ethods, phase scram bling generates an entire
new dataset. T his is unnecessary for such problem s as setting a confidence in
terval for the spectrum at a p articu lar frequency or for assessing the variability
o f an estim ate th a t is based on periodogram values. T here are well-established
lim iting results for the distributions o f p eriodogram values, which under cer
tain conditions are asym ptotically independent exponential random variables,
and this suggests th a t we som ehow resam ple p eriodogram values.
The obvious ap proach is to note th a t if g f (wk) is a suitable consistent
estim ate o f g(a)k) based on d a ta yo,...,y _ i, w here n = 2 np + 1, then for
k = 1, . . . , f the residuals e k I(cok)/g^(o}k) are approxim ately standard
exponential variables. This suggests th a t we generate b o o tstrap periodogram
values by setting I (ojk) = g{(ok)e*k, where g(o)k) is also a consistent estim ate
o f g(a>k), an d the e\ are sam pled random ly from the set ( e \ / e , . . . , e nF/e); this
ensures th a t E*(e) = 1. T he choice o f g^co) and g(co) is discussed below. Such
a resam pling scheme will only w ork in special circum stances. To see why, we
consider estim ation o f 6 = f a(co)g(a>)dco by a statistic th a t can be w ritten in
the form

e is the average of

r = -? - > /* ,

tr
where I k = Ho}k), ak = a(cok), an d (ok is the /cth F ourier frequency. F o r a linear
process
00
Y j = T , b* H>
i=oo
where {,} is a stream o f independent and identically distributed random
variables w ith standardized fourth cum u lan t K4, the m eans and covariances o f
the Ik are approxim ately
E (Ik) = g(a>k),

cov(Ik,Ii) =

g ( a > k ) g ( c o ,) ( S k,

+ n~ 1 K4).

F rom this it follows th a t u n d er suitable conditions,


E (T )

J a(co)g(a>)d(o,

v ar(T )

ri- 1

2nJ a2(co)g2(co)dco+K4 | J a(o))g ( ) dcoj

(8.18)

<5hi is the Kronecker


delta symbol, which
equals one if k = I and
zero otherwise.

413

8.2 Time Series

T he b o o tstrap analogue o f T is T* = nn F] Y k ak^k an<^ under the resam


pling scheme described above this has m ean and variance
E*(T*) = Ja(co)g((o)da>,

v a r '( T ') = 2n n~l J a 2 (to)g2 (co)da).

F or var*(T*) to converge to v ar(T ) it is therefore necessary th a t k4 = 0 or th at


f a(co)g(to) dco be asym ptotically negligible relative to the first variance term.
A process w ith norm al innovations will have K4 = 0, but since this can n o t be
ensured in general the structure o f T m ust be exam ined carefully before this
resam pling scheme is applied; see Problem 8.6. One situation where it can be
applied is kernel density estim ation o f g( ), as we now see.
Example 8.8 (Spectral density estim ation) Suppose th a t our goal is inference
for the spectral density g(tj) a t some t] in the interval (0, 7r), and let our estim ate
o f g(tj) be

k=0

where X ( ) is a sym m etric P D F with m ean zero and unit variance and h is a
positive sm oothing param eter. Then
E(T)

'1' 1/ ^

) s ( w ) iffl= ^ ) + 5'1V 'W ,

v a r(T )

^ { g ( r i ) } 2 J K 2 (u) du +

J K

g(co)d(o^ .

Since we m ust have h>0 as n *00 in order to remove the bias o f T , the second
term in the variance is asym ptotically negligible relative to the first term , as is
necessary for the resam pling scheme outlined above to work w ith a tim e series
for which /c4 0. C om parison o f the variance and bias term s implies th at the
asym ptotic form o f the relative m ean squared erro r for estim ation o f g(//) is
m inim ized by tak in g h oc n~[^5. However, there are two difficulties in using
resam pling to m ake inference ab o ut g(^) from T.
T he first difficulty is analogous to th at seen in Exam ple 5.13, and appears
on com paring T and its b o o tstrap analogue

k=1
We suppose th a t I k is generated using a kernel estim ate g(a>k) with sm oothing
param eter h. T he standardized versions o f T and T * are
Z = (n h c)1/2 T

g^ \

Z* = (n h c)1 / l T

8 Complex Dependence

414
where c = {2n f K 2 (u)du}

These have m eans

E (Z ) = (nhc ) l / 1

E * (Z ') = (n/ic)1/2E
gO/)

gU/)

C onsiderations sim ilar to those in Exam ple 5.13 show th at E '( Z ) ~ E (Z ) if


h>0 such th a t h / h ^ O as n>o o .
The second difficulty concerns the variances o f Z and Z*, which will both
be approxim ately one if the rescaled residuals ek have the same asym ptotic
distribution as the erro rs h/g{u>k). F or this to h appen with g f (co) a kernel
estim ate, it m ust have sm oothing p aram eter hf oc n-1//4. T h a t is, asym ptotically
gt (ftj) m ust be undersm oothed com pared to the estim ate th at m inimizes the
asym ptotic relative m ean squared erro r o f T.
Thus the application o f the b o o tstrap outlined above involves three kernel
density estim ates: the original, g(co), w ith h o c n 1/5; a surrogate g(co) for
g(a>) used when generating b o o tstrap spectra, w ith sm oothing param eter h
asym ptotically larger th a n h ; and g t (oj), from which residuals are obtained,
w ith sm oothing param eter ht o c n-1//4 asym ptotically sm aller th a n h. This
raises sub stantial difficulties for practical application, which could be avoided
by explicit correction to reduce the bias o f T o r by taking h asym ptotically
narrow er th a n n ~ ^ 5, in which case the lim iting m eans o f Z and Z* equal zero.
F or a num erical assessm ent o f this procedure, we consider estim ating the
spectrum g(a>) = {1 2acos(co) + a2}-1 o f an A R (1) process w ith a. = 0.9 at
rj = n i l . T he kernel K(-) is the stan d ard norm al PD F. Table 8.4 com pares the
m eans and variances o f Z w ith the average m eans and variances o f Z* for
1000 time series o f various lengths, w ith norm al and x 2 innovations. The first
set o f results has bandw idths h = an~1/5, hf = an-1/4, and h = an-1/6, with a
chosen to m inim ize the asym ptotic relative m ean squared erro r o f g(>/).
Even for tim e series o f length 1025, the m eans and variances o f Z and Z
can be quite different, w ith the variances m ore sensitive to the distribution
o f innovations. F or the second block o f num bers we took a non-optim al
b andw idth h = an~{/4, an d hf = h = h. A lthough in this case the true and
bo o tstrap m om ents agree better for norm al innovations, the results for chisquared innovations are alm ost as bad as previously, and it would be unwise
to rely on the results even for fairly long series.
M ean and variance only sum m arize lim ited aspects o f the distributions, and
for a m ore detailed com parison we com pare 1000 values o f Z and o f Z for
a p articu lar series o f length 257. The left panel o f Figure 8.13 shows th a t
the Z* are far from norm ally distributed, while the right panel com pares the
sim ulated Z an d Z . A lthough Z ' captures the shape o f the distribution o f
Z quite well, there is a clear difference in their m eans and variances, and
confidence intervals for g(rj) based on Z ' can be expected to be poor.

8.3 Point Processes


Table 8.4 Com parison
o f actual and bootstrap
means and variances for
a standardized kernel
spectral density estimate
Z . For the means the
upper figure is the
average o f Z from 1000
AR(1) time series with
a = 0.9 and length n,
and the lower figure is
the average o f E*(Z*)
for those series; for the
variances the upper and
lower figures are
estimates o f v ar(Z ) and
E{var (Z*)}. The upper
8 lines o f results are for
h oc n-1/ 5, h * oc n~ l/4>
and h oc n~l/6 ; for the
lower 8 lines
h=
{ oc
1/4.

415

In n o v atio n s

N o rm al

M ean
V ariance

C hi-squared

M ean
V ariance

N o rm al

M ean
V ariance

C hi-squared

M ean
V ariance

65

129

257

513

1025

00

1.4
2.0
2.5
2.7
1.2
2.1
6.9
2.8

0.9
1.7
1.5
2.0
1.0
1.7
4.9
2.0

0.8
1.3
1.3
1.7
0.8
1.3
3.8
1.6

0.7
1.0
1.1
1.5
0.7
1.0

0.6
0.8
1.1
1.3
0.7
0.8
2.7
1.3

0.5

0.9
0.6
2.3

0.5
0.4
1.3
1.4
0.6
0.4
3.7
1.4

0.5
0.3
1.1
1.4
0.5
0.3
3.1
1.4

0.3
0.3
1.1
1.3
0.4

0.2
0.2
1.0
1.3
0.3
0.2
2.2
1.2

0.0

1.5
1.0
0.7
5.6
1.4

3.1
1.4

0.3
2.5
1.3

1.0
0.5
1.0

1.0
0.0
1.0

Figure 8.13
C om parison o f
distributions o f Z and
Z* for time series o f
length 257. The left
panel shows a norm al
plot o f 1000 values o f
Z . The right panel
com pares the
distributions o f
Z and Z*.

Quantiles of standard normal

Z*

8.3 Point Processes


8.3.1 Basic ideas
A p o in t process is a collection o f events in a continuum . Exam ples are tim es o f
arrivals at an intensive care unit, positions o f trees in a forest, and epicentres

416

8 Complex Dependence

o f earthquakes. M athem atical properties o f such processes are determ ined by


the jo in t distribution o f the num bers o f events in subsets o f the continuum .
Statistical analysis is based on some n otion o f repeatability, usually provided
by assum ptions o f stationarity.
Let N { A ) denote the nu m b er o f events in a set A . A point process is
stationary if Pr{/V(/li) = m , . . . , N ( A k ) = n k ) is unaffected by applying the
same tran slatio n to all the sets A u . . . , A k , for any finite k. U nder second-order
stationarity only the first an d jo in t second m om ents o f the N ( A t) rem ain
unchanged by translation.
F or a stationary process E{N(/1)} = X\A\, where X is the intensity o f the
process and \A\ is the length, area, or volum e o f A . Second-order m om ent
properties can be defined in various ways, w ith the m ost useful definition
depending on the context.
The sim plest stationary point process m odel is the hom ogeneous Poisson
process, for which the ran d o m variables N(Ai), N i A i ) have independent Pois
son distributions w henever A\ and A 2 are disjoint. This com pletely random
process is a n atu ral stan d ard w ith which to com pare data, although it is rarely
a plausible m odel. M ore realistic m odels o f dependence can lead to estim ation
problem s th a t seem analytically insuperable, and M onte C arlo m ethods are
often used, particularly for spatial processes. In particular, sim ulation from
fitted param etric m odels is often used as a baseline against which to judge
data. This often involves graphical tests o f the type outlined in Section 4.2.4.
In practice the process is observed only in a finite region. This can give rise
to edge effects, which are increasingly severe in higher dimensions.
Exam ple 8.9 (Caveolae) T he u p p er left panel o f Figure 8.14 shows the p o
sitions o f n = 138 caveolae in a 500 unit square region, originally a 2.65 /*m
square o f muscle fibre. T he u pper right panel shows a realization o f a binom ial
process, for which n points were placed a t ran d o m in the same region; this
is an hom ogeneous Poisson process conditioned to have 138 events. The d ata
seem to have fewer alm ost-coincident points th a n the sim ulation, b u t it is hard
to be sure.
Spatial dependence is often sum m arized by K -functions. Suppose th at the
process is orderly and isotropic, i.e. m ultiple coincident events are precluded
and jo in t probabilities are invariant und er ro tatio n as well as translation. Then
a useful sum m ary o f spatial dependence is Ripleys K -function,
K ( t ) = A-1 E (#{events w ithin distance t o f an arb itrary e v e n t} ),

t > 0.

The m ean- an d variance-stabilized function Z ( t ) = { K ( t ) / n Y /2t is som etim es


used instead. F or an hom ogeneous Poisson process, K ( t ) = n t2. Em pirical
versions o f K ( t ) m ust allow for edge effects, as m ade explicit in Exam ple 8.12.
The solid line in the low er left panel o f Figure 8.14 is the em pirical version

417

8.3 Point Processes

Figure 8.14 Muscle


caveolae analysis. Top
left: positions of 138
cavoelae in a 500 unit
square of muscle fibre
(Appleyard et al., 1985).
Top right: realization of
an homogeneous
binomial process with
n = 138. Lower left:
Z(t) (solid), together
with pointwise 95%
confidence bands
(dashes) and overall
92% confidence bands
(dots) based on R = 999
simulated binomial
processes. Lower right:
corresponding results
for R = 999 realizations
of a fitted Strauss
process.

o
o

LO

%.*

o
o
o
o

CO

o
o

C\J

o
o

100

200

300

400

500

m
o

lO

in

---------

V v~

r-

in
T~

40

__r

_ -- --------

K5
O
V

20

o
T
7

I 'v\ / "

60

Distance

80

100

<

20

40

60

80

100

Distance

Z (t) o f Z(t). The dashed lines are pointw ise 95% confidence bands from
R = 999 realizations o f the binom ial process, and the dotted lines are overall
b ands w ith level ab o u t 92% , obtained by using the m ethod outlined after
(4.17) w ith k = 2. Relative to a Poisson process there is a significant deficiency
o f pairs o f points lying close together, which confirm s our previous impression.
The lower right panel o f the figure shows the corresponding results for
sim ulations from the Strauss process, a param etric m odel o f interaction th at
can inhibit p attern s in which pairs lie close together. This m odels the local
behaviour o f the d a ta b etter th an the stationary Poisson process.

8 Complex Dependence

418

c
o

<N
-200 -100

100

200

Time (ms)

W
c
0)
o
-200

100

200

Time (ms)

-200 -100

100

200

Time (ms)

8.3.2 Inhomogeneous Poisson processes


The sam pling plans used in the previous exam ple b o th assum e stationarity o f
the process underlying the data, an d rely on sim ulation from fitted param etric
models. Som etim es independent cases can be identified, in which case it m ay
be possible to avoid the assum ption o f stationarity.

Example 8.10 (Neurophysiological point process) The d a ta in Figure 8.15


were recorded by D r S. J. Boniface o f the Clinical N europhysiology U nit at
the Radcliffe Infirm ary, O xford, in a study o f how a hum an subject responded
to a stimulus. Each row o f the left panel o f the figure shows the times at which
the firing o f a m otoneurone was observed, in an interval extending 250 ms
either side o f 100 applications o f the stim ulus, w hich is taken to be at time
zero. A lthough little can be assum ed a b o u t dependence w ithin each interval,
the stim ulus was given far enough a p a rt for firings in different intervals to
be treated as independent. Firings occur a t ran d o m a b o u t 100 ms a p art prior
to the stim ulus, b u t on ab o u t one-third o f occasions a firing is observed
ab o u t 28 ms after it, an d this partially synchronizes the firings im m ediately
following.
T heoretical results im ply th a t und er m ild conditions the process obtained
by superposing all N = 100 intervals will be a Poisson process with timevarying intensity, NX(y). H ere it seems plausible th a t the conditions are m et:
for exam ple, 90 o f the 100 intervals con tain four o r fewer events, so the overall
intensity is n o t d o m inated by any single interval. The superposed d a ta have
n 389 events whose tim es we denote by yj.

Figure 8.15
Neurophysiological
point process. The rows
of the left panel show
100 replicates of the
interval surrounding the
times at which a human
subject was given a
stimulus; each point
represents the time at
which the firing of a
neuron was observed.
The right panels shows
a histogram and kernel
intensity estimate
(xlO -2 ms-1) from
superposing the events
on the left, which are
shown by the rug in the
lower right panel.

8.3 Point Processes

419

The right panels o f Figure 8.15 show a histogram o f the superposed d ata
and a rescaled kernel estim ate o f the intensity X(y) in units o f 10-2 m s-1 ,
k y , h ) = 100 x (N h )~1 w ( ^ y 1 ) ,
7=1
where w(-) is a sym m etric density with m ean zero and unit variance; we use
the stan d ard norm al density w ith bandw idth h = 7.5 ms. O ver the observation
period this estim ate integrates to 100n / N . The estim ated intensity is highly
variable an d it is unclear which o f its features are spurious. We can try to
construct a confidence region for A(y) at a set
o f y values o f interest, but
the sam e problem s arise as in Exam ples 5.13 and 8.8.
O nce again the key difficulty is bias: l ( y ; h ) estim ates n o t k(y) b u t
/ w(u)A(y hu) du. F or large n and small h this m eans th at
E {l(y ;/j)} = 2.(y) + \ h 2 X'(y),

var{2(y;/i)} = c(iVft)_1A(>>),

where c = f w 2 (u)du. As in Exam ple 5.13, the delta m ethod (Section 2.7.1) im
plies th a t l ( y ; h )l/2 has approxim ately constant variance \ c ( N h ) ~ l . We choose
to w ork w ith the standardized quantities
2 (y,h)=

l l' 2 ( y ; h ) - k l/ 2 ( y )
K M )-V 2 c 1/2

y ef.

In principle an overall 1 2a confidence band for k(y) over W is determ ined


by the quantiles ZLA(h) and z u A(h) th a t satisfy
1 - a = P r{zLiX(h) < Z { y ; h ) , y e 9 } = P t { Z ( y ; h ) < z U:X(h),y <&}.

(8.19)

T he lower and u p p er lim its o f the ban d would then be


\ l l/ 2 (y;h) - \ ( N h ) ~ V 2 cl/ 2 z UA( h ) \ ,

J2

(8.20)

{ l i/2( y , h ) - \ ( N h ) - 1/2cl/2z U h ) }
In practice we m ust use resam pling analogues Z * ( y \ h ) o f Z ( y ; h ) to estim ate
ZL,a(h) and zu,x(h), and for this to be successful we m ust choose h and the
resam pling scheme to ensure th a t Z* and Z have approxim ately the same
distributions.
In this context there are a nu m b er o f possible resam pling schemes. The sim
plest is to take n events a t ran d o m from the observed events. This relies on the
independence assum ptions for Poisson processes. A second scheme generates
n events from the observed events, where n* has a Poisson distribution with
m ean n. A m ore robust scheme is to superpose 100 resam pled intervals, though
this does n o t hold fixed the to tal n um ber o f events. These schemes would be

8 Complex Dependence

420

inappro p riate if the estim ator o f interest presupposed th at events could not
coincide, as did the K -function o f Exam ple 8.9.
For all o f these resam pling schemes the b o o tstrap estim ators r ( y ; h ) are
unbiased for l(y',h). T he n atu ral resam pling analogue o f Z is
{ r ( r ; f c ) } '/ 2 - { r ( r ) ) l/2

b u t E*(Z*) = 0 and E (Z ) ^ 0. This situation is analogous to th at o f E xam


ple 5.13, an d the conclusion is the sam e: to m ake the first two m om ents o f Z
and Z* agree asym ptotically, one m ust choose h oc N ~ y w ith y > j . F urther
detailed calculations for the jo in t distributions over % suggest also th at y <
The essential idea is th a t h should be sm aller th a n is com m only used for point
estim ation o f the intensity.
A quite different ap p ro ach is to generate realizations o f an inhom ogeneous
Poisson process from a sm ooth estim ate l ( y ; h ) o f the intensity. This can be
achieved by using the sm oothed b o o tstrap , as outlined in Section 3.4, and
detailed in Problem 8.7. U nder this scheme
E* | X*{y; h) j =

J l ( y hu',h)w(u) du = l ( y ',h)+ j h 2 l "(y;h),

and the resam pling analogue o f Z is

z(y M

------------------------------------

whose m ean and variance closely m atch those o f Z .


W hatever resam pling scheme is employed, sim ulated values o f Z* will be
used to estim ate the quantiles z i A(h) and z y A{h) in (8.19). If R realizations are
generated, then we take ZL,cc{h) and zu, Jh) to be respectively the (R + l)a th
ordered values o f
m in z*(y:/i),

m a xz*(y;h).

The u p p er panel o f Figure 8.16 shows overall 95% confidence bands for
A(y;5), using three o f the sam pling schemes described above. In each case
R = 999, an d zl,0.025(5) an d zl',0.025(5) are estim ated by the em pirical 0.025 and
0.975 quantiles o f the R replicates o f m in{z(j;;5),>' = 250, 2 4 8 ,...,2 5 0 }
and m a x { z '(y ;5),y = 2 5 0 ,2 4 8 ,...,2 5 0 } . R esults from resam pling intervals
and events are alm ost indistinguishable, while generating d a ta from a fitted
intensity gives slightly sm oother results. In o rd er to avoid problem s at the
boundaries, the set
is taken to be (230,230). The experim ental setup
implies th a t the intensity should be ab o u t 1 x 10-2 firings per second, the
only significant d ep artu re from which is in the range 0-130 ms, where there is
strong evidence th a t the stim ulus affects the firing rate.

421

8.3 Point Processes


Figure 8.16 Confidence
bands for the intensity
of the
neurophysiological point
process data. The upper
panel shows the
estimated intensity
x(y;5) ( 10 2 ms-1 )
(heavy solid), with
overall 95% equi-tailed
confidence bands based
on resampling intervals
(solid), resampling
events (dots), and
generating events from
a fitted intensity
(dashes). The outer lines
in the lower panel show
the 2.5% and 97.5%
quantiles of the
standardized quantile
processes z (y;h) for
resampling intervals
(solid) and generating
from a fitted intensity
(dashes), while the lines
close to zero are the
bootstrap bias estimates
for k.

-200

-100

100

200

Time (ms)

Time (ms)

The lower panel o f the figure shows z0.025(5) z0.975(5), and the boo tstrap
bias estim ate for /*(>>) for resam pling intervals and for generating d a ta from
a fitted intensity function, with h = 7.5 ms. The quantile processes suggest
th a t the variance-stabilizing transform ation has w orked well, b u t the double
sm oothing effect o f the latter scheme shows in the bias. The behaviour o f the
quantile process when y = 50 ms where there are no firings suggests th at
a variable b andw idth sm oother m ight be better.

Essentially the same ideas can be applied when the d ata are a single real
ization o f an inhom ogeneous Poisson process (Problem 8.8).

8.3.3 Tests o f association


W hen a poin t process has events o f different types, interest often centres
on association betw een the different types o f events or between events and
associated covariates. T hen p erm u tation o r b o o tstrap tests m ay be appropriate,
although the sim ulation scheme will depend on the context.
Example 8.11 (Spatial epidemiology) Suppose th a t events o f a point pattern
correspond to locations y o f cases o f a rare disease S> th a t is th ought to be
related to em issions from an industrial site at the origin, y = 0. A m odel for
the incidence o f Q) is th a t it occurs at rate /.(y) per person-year at location y,

422

8 Complex Dependence

where the suspicion is th a t X(y) decreases w ith distance from the origin. Since
the disease is rare, the n u m b er o f cases a t y will be well approxim ated by a
Poisson variable w ith m ean X{y)n(y), where fi(y) is the population density o f
susceptible persons a t y. T he null hypothesis is th a t My) = Xo, i.e. th a t y has
no effect on the intensity o f cases, o th er th an through /i(y). A crucial difficulty
is th at n{y) is unknow n an d will be h ard to estim ate from the d a ta available.
One ap p ro ach to testing for constancy o f X ( y ) is to com pare the p o int pattern
for 2> to th a t o f an o th er disease 2)'. This disease is chosen to have the same
populatio n o f susceptible individuals as 3), b u t its incidence is assum ed to be
unrelated to em issions from the site an d to incidence o f S>, and so it arises
with co n stan t b u t unknow n rate X p er person-year. If Sfi' is also rare, it will
be reasonable to suppose th a t the num b er o f cases o f
at y has a Poisson
distributio n w ith m ean X 'f i ( y ) . H ence the conditional probability o f a case o f
at y given th a t there is a case o f
o r 3 ' a t y is n { y ) = X { y ) / { X ' + A(y)}.
If the disease locations are indicated by yj, an d dj is zero o r one according as
the case a t yj has 3)' or Q>, the likelihood is
n ^ { i - ( y ^ .
j
If a suitable form for X(y) is assum ed we can o btain the likelihood ratio or
perhaps an o th er statistic T to test the hypothesis th at 7i(y) is constant. This
is a test o f pro p o rtio n al hazards for Q) and & , b u t unlike in Exam ple 4.4 the
alternative is specified, at least weakly.
W hen A(y) = Xo an ap proxim ation to the null distribution o f T can be
obtained by perm uting the labels on cases at different locations. T h at is, we
and 3l' to the yj, recom pute T
perform R ran d o m reallocations o f the labels
for each such reallocation, an d see w hether the observed value o f t is extrem e
relative to the sim ulated values t \ , . . . , t R.
m
Exam ple 8.12 (Bram bles) The upp er left panel o f Figure 8.17 shows the
locations o f 103 newly em ergent an d 97 one-year-old bram ble canes in a 4.5 m
square plot. It seems plausible th a t these two types o f event are related, but
how should this be tested? Events o f b o th types are clustered, so a Poisson
null hypothesis is not appropriate, n o r is it reasonable to perm ute the labels
attached to events, as in the previous example.
Let us denote the locations o f the two types o f event by y i , . . . , y and
y [, . . ., y 'n-. Suppose th a t a statistic T = t ( y i , . . . , y , y [ , . . . , y ' n,) is available th at
tests for association betw een the event types. If the extent o f the observation
region were infinite, we m ight construct a null distribution for T by applying
random translations to events o f one type. T hus we would generate values
T = t(yi + U*, . . ., y + U*,y[,...,y'rf), where I/* is a random ly chosen location
in the plane. This sam pling scheme has the desirable property o f fixing the

8.3 Point Processes

423

Figure 8.17 Brambles


data. Top left: positions
of newly emergent (+)
and one-year bramble
canes () in a 4.5 m
square plot. Top right:
random toroidal shift of
the newly emergent
canes, with the original
edges shown by dotted
lines. Bottom left:
Original dependence
function Z n (solid) and
20 replicates (dots)
under the null
hypothesis of no
association between
newly emergent and
one-year canes. Bottom
right: original
dependence function
and pointwise (dashes)
and overall (dots) 95%
null confidence sets. The
data used here are the
upper left quarter of
those displayed on
p. 113 of Diggle (1983).

++
\ \

+:*4 + ++ *-

v*

+
;

++ **
4-

++

V.

relative locations o f each type o f event, b u t cannot be applied directly to the


d a ta in Figure 8.17 because the resam pled patterns will n o t overlap by the
sam e am o u n t as the original.

[] denotes integer part.

We overcom e this by ran d o m toroidal shifts, where we im agine th a t the


pattern is w rapped on a torus, the random translation is applied, and the
translated p attern is then unw rapped. Thus for points in the unit square we
w ould generate U * = ( [ /j, Uj) at random in the unit square, and then m ap the
event a t y} = ( y i j , y 2j) to yj = ( y {] + U\ - [yij + U[],y2j + U 2' - [y2J + U\]).
The u p p er right panel o f Figure 8.17 shows how such a shift uncouples the
tw o types o f events.

424

8 Complex Dependence

We can construct a test through an extension o f the K -function to events o f


two types, th a t is the function
(# {ty p e 2 events w ithin distance t o f an arbitrary type 1 e v e n t} ),
where A2 is the overall intensity o f type 2 events. Suppose th a t there are i,
ri2 events o f types 1 an d 2 in an observation region A o f area \A\, th at u,, is
the distance from the ith type 1 event to the 7th type 2 event, th a t w,(u) is the
proportio n o f the circum ference o f the circle th a t is centred at the ith type 1
event an d has radius u th a t lies w ithin A, and let /() denote the indicator o f
the event
T hen the sam ple version o f this bivariate K -function is
K i2(r) = (nin 2 r l \ A \ J 2 '^2 w - l (uij)I(uij < t).
i=i j=\
A lthough it is possible to base an overall statistic on K n i t ) , for exam ple taking
T = f Z n ( t ) 2 dt, where Z\ i ( t) = { k n { t ) / n } 112 f, a graphical test is usually
m ore inform ative.
The lower left panel o f Figure 8.17 shows results from 20 random toroidal
shifts o f the data. The original value o f Z \ 2 (t) seems to show m uch stronger
local association th an do the sim ulations. This is confirm ed by the lower right
panel, which shows 95% pointw ise an d overall confidence bands for Z n ( t )
based on R = 999 shifts. T here is clear evidence th a t the point patterns are
no t in d ep en d en t: as the original d a ta suggest, new canes emerge close to those
from the previous year.

8.3.4 Tiles
Little is know n ab o u t resam pling spatial processes when there is no param etric
model. One n onparam etric ap proach th a t has been investigated starts from a
p artition o f the observation region St into disjoint tiles
o f equal
size and shape. I f we abuse n o tatio n by identifying each tile with the pattern it
contains, we can w rite the original value o f the statistic as T = t(.stf
The idea is to create a resam pled p attern by tak ing a random sam ple of
tiles s 4 \ , . . . , s 4 ' n from
with corresponding boo tstrap statistic T* =
t( j/J ,...,,s /* ) . The hope is th a t if dependence is relatively short-range, taking
large tiles will preserve enough dependence to m ake the properties o f T* close
to those o f T. If this is to w ork, the size o f the tile m ust be chosen to trade
off preserving dependence, which requires a few large tiles, and getting a good
estim ate o f the distribution o f T , which requires m any tiles.
This idea is analogous to block resam pling in tim e series, and is capable o f
sim ilar variations. F o r exam ple, ra th e r th an choosing the stf* independently
from the fixed tiles s
i
we m ay resam ple m oving tiles by setting

8.3 Point Processes

o
in

.....

:
:
:

*
. .

..

* *
-I
.*
j * ......... -

. r

V #

* -V


.........
..... . *
!

!
.

: : :
0

100

;
200

*
300

.
400

o
o
300

.
:

.v

..
-........- ......................................

-* .

............
!!
* **
:*

............. . *
:
/. *
* . * *

*
* '

*/ .
. ** :
*

o
500

200

..

100

Figure 8.18 Tile


resampling for the
caveolae data. The left
panel shows the original
data, with nine tiles
sampled at random
using toroidal wrapping.
The right panel shows
the resampled point
pattern.

425

100

200

300

400

srf'j = Uj + sJj, where Uj is a random vector chosen so th a t s / j lies wholly


w ithin
we can avoid bias due to undersam pling near the boundaries o f 9t
by toroidal w rapping. As in all problem s involving spatial data, edge effects
are likely to play a critical role.
Exam ple 8.13 (Caveolae) Figure 8.18 illustrates tile resam pling for the d ata
o f Exam ple 8.9. T he left panel shows the original caveolae data, with the dotted
lines showing nine square tiles taken using the m oving scheme w ith toroidal
w rapping. The right panel shows the resam pled p a ttern obtained when the
tiles are laid side-by-side. F or example, the centre top tile and m iddle right
tiles were respectively taken fropi the top left and b ottom right o f the original
data. A long the tile edges, events seem to lie closer together th a n in the left
p anel; this is analogous to the w hitening th a t occurs in blockwise resam pling
o f tim e series. N o analogue o f the post-blackened b o o tstrap springs to mind,
however.
F or a num erical evaluation o f tile resam pling, we experim ented with esti
m ating the variance 9 o f the nu m ber o f events in an observation region 3tt
o f side 200 units, using d a ta generated from three random processes. In each
case we generated 8800 events in a square o f side 4000, then estim ated 9 from
2000 squares o f side 200 taken at random . F or each o f 100 random squares
o f side 200 we calculated the em pirical m ean squared error for estim ation
o f 9 using b o o tstrap s o f size R, for b o th fixed and m oving tiles. D a ta were
generated from a spatial Poisson process (9 = 23.4), from the Strauss process
th a t gave the results in the b o tto m right panel o f Figure 8.14 (9 = 17.5), and
from a sequential spatial inhibition process, which places points sequentially
at ran d o m b u t n o t w ithin 15 units o f an existing event (6 = 15.6).

8 Complex Dependence

426

Table 8.5 M ean


n
4

16

36

64

100

144

196

256

theory
fixed
m oving

224.2
255.2
92.2

77.9
66.1
39.7

47.3
40.2
35.8

36.3
31.7
31.6

31.2
27.6
33.0

28.4
27.6
30.8

26.7
25.5
27.4

25.6
27.8
27.0

S trau ss

fixed
m oving

129.1
53.2

49.1
26.4

27.9
19.0

19.2
17.4

16.4
15.9

19.3
18.9

20.8
18.7

21.9
17.9

SSI

fixed
m oving

123.8
36.5

37.7
12.9

14.8
11.2

13.5
15.6

17.9
18.3

25.1
21.2

34.6
28.6

42.4
35.4

Poisson

Table 8.5 shows the results. F o r the Poisson process the fixed tile results
broadly agree w ith theoretical calculations (Problem 8.9), and the m oving tile
results accord w ith general theory, which predicts th a t m ean squared errors
for m oving tiles should be lower th a n for fixed tiles. H ere the m ean squared
erro r decreases to 22 as no o .
T he fitted Strauss process inhibits pairs o f points closer together th an 12
units. The m ean squared erro r is m inim ized w hen n = 100, corresponding to
tiles o f side 20; the average estim ated variances from the 100 replicates are
then 19.0 an d 18.2. T he m ean squared errors for m oving tiles are rath er lower,
b u t their p a tte rn is similar.
The sequential spatial inhibition results are sim ilar to those for the Strauss
process, b u t w ith a sh arp er rise in m ean squared error for larger n.
In this setting theory predicts th a t for a process with sufficiently shortrange dependence, the optim al n o c \
I f the caveolae d a ta were generated
by a Strauss process, results from Table 8.5 would suggest th a t we take
n = 100 x 500/200 = 162, so there w ould be 16 tiles along each side o f 3k.
W ith R = 200 an d fixed and m oving tiles this gives variance estim ates o f 101.6
and 100.4, b o th considerably sm aller th a n the variance for Poisson data, which
would be 138.

8.4 Bibliographic Notes


There are m any books on tim e series. Brockwell an d D avis (1991) is a recent
book aim ed at a fairly m athem atical readership, while Brockwell and D avis
(1996) an d Diggle (1990) are m ore suitable for the less theoretically inclined.
Tong (1990) discusses nonlinear tim e series, while Beran (1994) covers longm em ory processes. Bloomfield (1976), Brillinger (1981), Priestley (1981), and
Percival an d W alden (1993) are introductions to spectral analysis o f tim e series.

squared errors for


estim ation o f the
variance o f the num ber
o f events in a square of
side 200, based on
bootstrapping fixed and
moving tiles. D ata were
generated from a
Poisson process, a
Strauss process with
param eters chosen to
match the da ta in
Figure 8.14, and from a
sequential spatial
inhibition process with
radius 15. In each case
the mean num ber o f
events is 22. For n 64,
we took R = 200, for
n = 100, 144, we took
R = 400, and for
n ^ 196 we took
R = m .

8.4 Bibliographic Notes

427

M odel-based resam pling for tim e series was discussed by F reedm an (1984),
Freedm an an d Peters (1984a,b), Sw anepoel and van W yk (1986) and Efron and
T ibshirani (1986), am ong others. Li and M ad d ala (1996) survey m uch o f the
related tim e dom ain literature, which has a som ew hat theoretical em phasis;
their account stresses econom etric applications. F or a m ore applied account o f
param etric b o o tstrap p in g in tim e series, see Tsay (1992). B ootstrap prediction
in tim e series is discussed by K ab aila (1993b), while the b o otstrapping o f statespace m odels is described by Stoffer and W all (1991). The use o f m odel-based
resam pling for o rd er selection in autoregressive processes is discussed by Chen
et al. (1993).
Block resam pling for tim e series was introduced by C arlstein (1986). In an
im p o rta n t paper, K iinsch (1989) discussed overlapping blocks in tim e series,
although in spatial d a ta the proposal o f block resam pling in H all (1985)
predates both. Liu an d Singh (1992a) also discuss the properties o f block
resam pling schemes. Politis an d R om ano (1994a) introduced the stationary
b o o tstrap , an d in a series o f papers (Politis and R om ano, 1993, 1994b) have
discussed theoretical aspects o f m ore general block resam pling schemes. See
also B uhlm ann an d K iinsch (1995) and L ahiri (1995). The m ethod for block
length choice outlined in Section 8.2.3 is due to H all, H orow itz and Jing
(1995); see also H all an d H orow itz (1993). B ootstrap tests for unit roots in
autoregressive m odels are discussed by F erretti and R om o (1996). H all and
Jing (1996) describe a block resam pling approach in which the construction o f
new series is replaced by R ichardson extrapolation.
Bose (1988) showed th a t m odel-based resam pling for autoregressive p ro
cesses has good asym ptotic higher-order properties for a wide class o f statistics.
L ahiri (1991) an d G otze and K iinsch (1996) show th a t the same is true for
block resam pling, b u t D avison and H all (1993) p o int o u t th a t unfortunately
and unlike w hen the d a ta are independent this depends crucially on the
variance estim ate used.
Form s o f phase scram bling have been suggested independently by several
au th o rs (N ordgaard, 1990; Theiler et al., 1992), and B raun and K ulperger
(1995, 1997) have studied its properties. H artig an (1990) describes a m ethod
for variance estim ation in G aussian series th a t involves sim ilar ideas b u t needs
no rand o m izatio n ; see Problem 8.5.
Frequency dom ain resam pling has been discussed by F ranke and H ardle
(1992), w ho m ake a strong analogy w ith b o o tstrap m ethods for nonparam etric
regression. It has been fu rth er studied by Janas (1993) and D ahlhaus and
Janas (1996), on which o u r account is based.
O u r discussion o f the R io N egro d a ta is based on Brillinger (1988, 1989),
which should be consulted for statistical details, while Sternberg (1987, 1995)
gives accounts o f the d a ta and background to the problem.
M odels based on p o in t processes have a long history and varied provenance.

428

8 Complex Dependence

D aley and V ere-Jones (1988) an d K a rr (1991) provide careful accounts o f


their m athem atical basis, while Cox an d Isham (1980) give a m ore concise
treatm ent. Cox and Lewis (1966) is a sta n d a rd account o f statistical m ethods
for series o f events, i.e. p o in t processes in the line. Spatial p o in t processes and
their statistical analysis are described by Diggle (1983), Ripley (1981, 1988),
and Cressie (1991). Spatial epidem iology has recently received atten tio n from
various p oints o f view (M uirhead and D arby, 1989; Bithell and Stone, 1989;
Diggle, 1993; Law son, 1993). Exam ple 8.11 is based on Diggle and Rowlingson
(1994).
Owing to the im possibility o f exact inference, a num ber o f statistical proce
dures based on rando m izatio n or sim ulation originated in spatial d a ta analysis.
Exam ples include graphical tests, which were used extensively by Ripley (1977),
and various approaches to p aram etric inference based on M arkov chain M onte
C arlo m ethods (Ripley, 1988, C hapters 4, 5). However, nonparam etric b o o t
stra p m ethods for spatial d a ta have received little attention. O ne exception is
H all (1985), a pioneering w ork on the theory th a t underlies block resam pling
in coverage processes, a p articu lar type o f spatial data. F u rth er discussion o f
resam pling these processes is given by H all (1988b) and G arcia-S oidan and
H all (1997). Possolo (1986) discusses subsam pling m ethods for estim ating the
p aram eters o f a ran d o m field. O th er applications include H all and K eenan
(1989), w ho use the b o o tstra p to set confidence gloves for the outlines o f
hands, an d Journel (1994), w ho uses p aram etric b o o tstrap p in g to account for
estim ation uncertainty in an application o f kriging. Y oung (1986) describes
b o o tstrap approaches to testing in som e geom etrical problems.
Cowling, H all and Phillips (1996) describe the resam pling m ethods for
inhom ogeneous Poisson processes th a t form the basis o f Exam ple 8.10, as well
as outlining the related theory. V entura, D avison and Boniface (1997) describe
a different analysis o f the neurophysiological d a ta used in th at example. Diggle,
Lange an d Benes (1991) describe an application o f the b o o tstrap to a point
process problem in neuroanatom y.

8.5 Problems
1

Suppose that y i,...,y is an observed time series, and let zy denote the block
of length / starting at yu where we set y, = yi+(i_i mod ) and y0 = ynAlso let h , . . . be a stream of random numbers uniform on the integers 1,...,n
and let
be a stream of random numbers having the geometric distribution
Pr(L = I) = p(l p)~ \ I = 1, The algorithm to generate a single stationary
bootstrap replicate is
Algorithm 8.2 (Stationary bootstrap)

Set 7* = z/jx,, and set i = 1.


While length(Y) < n, {increment /; replace Y with (Y z ; i>Li)}.

429

8.5 Problems

Set 7* =

(a) Show that the algorithm above is equivalent to


Algorithm 8.3
. Set Yl' = y , r
For i = 2,...,n, let Y ' =
with probability p, and let Y" = yj+l with
probability 1 p, where y,l, = yj.

(b) Define the empirical circular autocovariance

n
Ck =

O '; -

y ) ( y i + u + k - t mod n) -

y ),

k =Q ,...,n.

;=1
Show that conditional on y i , . . . , y ,

E(y /) = y,

cov*(y,-,y;+1) = ( i - Py Cj

and deduce that y ' is second-order stationary.


(c) Show that if y i , . . . , y n are all distinct, 7 is a first-order Markov chain. Under
what circumstances is it a fcth-order Markov chain?
(Section 8.2.3; Politis and Romano, 1994a)
2

Let Y i , . . . , Yn be a stationary time series with covariances f j = cov(Y!, Yj+ 1 ). Show


that
v a r (? ) = y0 + 2 ^

fl -

yh

;=l '
and that this approaches C = Vo + 2 5 yj if
! j\yj\ is finite.
Show that under the stationary bootstrap, conditional on the data,
n1 /
v a r '( y ) = c0 + 2 ^ 3 ( 1 - " ) (! ~ P ) JCj,

;=l '

nJ

where Co,c1;. .. are the empirical circular autocovariances defined in Problem 8.1.
(Section 8.2.3; Politis and Romano, 1994a)
3

(a) Using the setup described on pages 405-408, show that J2($j ~ S )2 has mean
vy b~l v,j and variance
V ijjj +

2 Vj jV tj - 2 b _1( v Uj j t + 2 v u v iJt) + b - 2( v iJJcJ + 2 v u v k J ),

where vy = cov(S,,S,),
= cum(S,, Sj, St) and so forth are the joint cumulants
o f the Sj, and summation is understood over each index.
(b) For an m-dependent normal process, show that provided / > m,

( l~' 4 }, i = j,
v.i = \ l - 2c(l>, \ i - j \ = l,
( 0,

otherwise,

and show that / cq1>(, c,1*


as /o o . Hence establish (8.13) and (8.14).
(Section 8.2.3; Appendix A ; Hall, Horowitz and Jing, 1995)

8 Complex Dependence

430
4

Establish (8.16) and (8.17). Show that under phase scrambling,

n_1 H YJ =

cov(y/. Y,'+m) = _ 1 - y)(yi+* - y)>

where j + m is interpreted m od n, and that all odd joint mom ents o f the Y j are
zero.
This last result implies that the resampled series have a highly symmetric joint
distribution. W hen the original data have an asymmetric marginal distribution, the
following procedure has been proposed:

let Xj = <t>- 1 { r j/(n +

)}, where rj is the rank o f y} among the original series

ya, . . . , y n- 1 ;

apply Algorithm 8.1 to x 0 , . . . , x-i, giving


X _ , ; then
set Y j = y(r/), where rj is the rank o f X j am ong Aro ,...,A '* _ 1.

D iscuss critically this idea (see also Practical 8.3).


(Section 8.2.4; Theiler et al., 1992; Braun and Kulperger, 1995, 1997)
5

(a) Let / i , . . . , / m be independent exponential random variables with means fij,


and consider the statistic T = Yl"j=\ ai h where the a; are unknown. Show that
V = |
ajl ? is an unbiased estimate o f var(T ) = Y j
N ow let C = (c 0 , - . . , c m) be an ( m + 1) x ( m + 1) orthogonal matrix with colum ns cj,
where co is a vector o f ones; the Zth element o f c, is cj,-. That is, for som e constant

b,
cjci= 0 ,

ii= j,

c j c j = b,

j= l,...,m .

Show that for a suitable choice o f b, V is equal to

ffl+1 ttl+1

2 ^ r n ) g B r ' - TO'
where for i = 1 , . . . , m + 1 , Tf =
+ ca)h(b) N ow suppose that Yo,. . . , Y_i is a time series o f length n = 2m + lz with
empirical Fourier transform fb .---.iB _ i and periodogram ordinates h = \Yk\2/n,
for k = 0 , . . . , m. For each i = 1 ,..., m + 1, let the perturbed periodogram ordinates
be

YJ = ?o,

Y> = ( l + c ^ 2 Yk,

= ( l + c * ) 1/2Y_*,

k = l,...,m ,

from which the ith replacement time series is obtained by the inverse Fourier
transform.
Let T be the value o f a statistic calculated from the original series. Explain
how the corresponding resample values, T 1' , . . . , T ^ +1, may be used to obtain an
approximately unbiased estimate o f the variance o f T , and say for what types o f
statistics you think this is likely to work.
(Section 8.2.4; Hartigan, 1990)
6

In the context o f periodogram resampling, consider a ratio statistic


T =

a(u>k)I((Qk) = / a M g M dw( 1 + n } ' /2X a)


YkF
=i 1 (fc)/ g(ft>) dw( 1 -f

1/2Z i)

say. U se (8.18) to show that X a and X i have means zero and that
var(-Xa)
COV(XUX a)

n l aaggl ^ 2 + i(c4,

1llagglag Ig

t-

^4 .

\ a r ( X i ) = n l gel ~ 2 + ^ k4,

431

8.5 Problems

where I aagg = / a2(co)g2(co) dco, and so forth. Hence show that to first order
the mean and variance o f T do not involve k4, and deduce that periodogram
resampling may be applied to ratio statistics.
Use simulation to see how well periodogram resampling performs in estimating the
distribution o f a suitable version o f the sample estimate o f the lag j autocorrelation,
=
Pl

e~toJg M dco
f l n g ( ) dco

(Section 8.2.5; Janas, 1993; Dahlhaus and Janas, 1996)


7

Let y \ , . . . , y n denote the times o f events in an inhom ogeneous Poisson process o f


intensity My), observed for 0 < y < 1, and let

J= 1

denote a kernel estimate o f My), based on a kernel w( ) that is a PDF. Explain why
the following two algorithms for generating bootstrap data from the estimated
intensity are (almost) equivalent.

Algorithm 8.4 (Inhomogeneous Poisson process 1)

Let N have a Poisson distribution with mean A = f Q' l(u ;h )d u .


For j = 1, . . . , N , independently take 17* from the t /( 0 ,1) distribution, and
then set Y = F ~ l (U j), where F (y) = A-1 f0} l(u ;h )d u .

Algorithm 8.5 (Inhomogeneous Poisson process 2)

p1

Let N have a Poisson distribution with mean A = J0 /.(u; h) du.


For j = 1, . . . , N , independently generate /* at random from the integers
{ ! , . . . , } and let s* be a random variable with P D F w(-). Set YJ = y,- + ht:'.

(Section 8.3.2)
8

Consider an inhom ogeneous Poisson process o f intensity /.(y) = N n(y), where fi(y)
is fixed and sm ooth, observed for 0 < y < 1.
A kernel intensity estimate based on events at y i , . . . , y n is

i =i

where w( ) is the P D F o f a symmetric random variable with mean zero and


variance one; let K = / w2(u)du.
(a) Show that as N - * c c and h>0 in such a way that N h >cej,
E { l(y ; h)} = X(y) + h2X"(y),

var j l(y ; h) j = K h~l X(y);

you may need the facts that the number o f events n has a Poisson distribution with
mean A = /J Mu) du, and that conditional on there being n observed events, their

432

8 Complex Dependence

times are independent random variables with PDF


Hence show that the
asymptotic mean squared error of
is minimized when h oc N ~l/S. Use the
delta method to show that the approximate mean and variance of l 1/ 2(y;h) are
*'/ 2 (y) + \ * r m (y) {h 2f ( y ) - K h r 1},

\ Kh ~l.

(b)
Now suppose that resamples are formed by taking n observations at random
from yi,...,y. Show that the bootstrapped intensity estimate
w ', y - y j
h J=l
has mean E{ l (y, h)} = l(y;h), and that the same is true when there are n'
resampled events, provided that E '(n') = n.
For a third resampling scheme, let n have a Poisson distribution with mean n,
and generate n events independently from density ).(y;h)/ f Ql l(u;h)du. Show that
under this scheme
E*{3.*{_y; Ai)} =

J w(u)2(y hu;h)du.

(c) By comparing the asymptotic distributions of


P 2( y ; h ) - ^ 2 (y)
z i y h) =

{kU -w

Z ( r h) =

{ r ( y ; h ) \ ' - l 1/ 2 (y;h)
------- W m F u i ---------*

find conditions under which the quantiles of Z ' can estimate those of Z.
(Section 8.3.2; Example 5.13; Cowling, Hall and Phillips, 1996)
Consider resampling tiles when the observation region ^ is a square, the data are
generated by a stationary planar Poisson process of intensity X, and the quantity
of interest is d = var(Y), where Y is the number of events in 3t.
Suppose that 0t is split into n fixed tiles of equal size and shape, which are then
resampled according to the usual bootstrap. Show that the bootstrap estimate of
6 is t = ^2(yj y)2, where yj is the number of events in the jth tile. Use the fact
that var(T) = (n 1)2{k4/h + 2 k \ /( n 1)}, where Kr is the rth cumulant of Yj, to
show that the mean squared error of T is
^ { n + ( n - l ) ( 2n + n - l ) } ,
where n = l\9l\. Sketch this when p. > 1, fi = 1, and /i < 1, and explain in
qualitative terms its behaviour when fi > 1.
Extend the discussion to moving tiles.
(Section 8.3)

8.6 Practicals
1

Dataframe lynx contains the Canadian lynx data, to the logarithm of which we
fit the autoregressive model that minimizes A IC :
t s .plot(log(lynx))
lynx.ar <- arClogClynx))
lynx.ar$order

Practicals

433

The best model is A R (ll). How well determined is this, and what is the variance
of the series average? We bootstrap to see, using ly n x .fu n (given below), which
calculates the order of the fitted autoregressive model, the series average, and saves
the series itself.
Here are results for fixed-block bootstraps with block length I = 20:
lynx.fun <- function(tsb)
{ ar.fit <- ar(tsb, order,max=25)
c(ar.fit$order, mean(tsb), tsb) >
lynx.l <- tsboot(log(lynx), lynx.fun, R=99, 1=20, sim="fixed")
tsplot(ts(lynx.l$t[l,3:116],start=c(1821,1)),
main="Block simulation, 1=20")
boot.array(lynx.1) [1,]
table(lynx.l$t[,1])
var(lynx.l$t[,2])
qqnormdynx. l$t [,2] )
abline(mean(lynx.l$t[,2]),sqrt(var(lynx.l$t[,2])),lty=2)

To obtain similar results for the stationary bootstrap with mean block length
1 = 20:

.Random.seed <- lynx.l$seed


lynx.2 <- tsboot(log(lynx), lynx.fun, R=99, 1=20, sim="geom")

See if the results look different from those above. Do the simulated series using
blocks look like the original? Compare the estimated variances under the two
resampling schemes. Try different block lengths, and see how the variances of the
series average change.
For model-based resampling we need to store results from the original model:
lynx.model <- list(order=c(lynx.ar$order,0,0),ar=lynx.ar$ar)
lynx.res <- lynx.ar$resid[!is.na(lynx.ar$resid)]
lynx.res <- lynx.res - mean(lynx.res)
lynx.sim <- function(res,n.sim, ran.args)
{ rgl <- function(n, res) sample(res, n, replace=T)
ts.orig <- ran.args$ts
ts.mod <- r a n .args$model
mean(ts.orig)+ts(arima.sim(model=ts.mod, n=n.sim,
rand.gen=rgl, res=as.vector(res))) }
.Random.seed <- lynx.l$seed
lynx.3 <- tsboot(lynx.res, lynx.fun, R=99, sim="model",
n.sim=114,ran.gen=lynx.sim,
ran.args=list(ts=log(lynx), model=lynx.model))

Check the orders of the fitted models for this scheme.


For post-blackening we need to define yet another function:
lynx.black <- function(res, n.sim, ran.args)
{ ts.orig <- ran.args$ts
ts.mod <- r a n .args$model
mean(ts.orig) + ts(arima.sim(model=ts.mod,n=n.sim,innov=res)) }
.Random.seed <- lynx.l$seed
lynx.lb <- tsboot(lynx.res, lynx.fun, R=99, 1=20, sim="fixed",
n .sim=l14,r a n .gen=lynx.black,
ran.args=list(ts=log(lynx), model=lynx.model))

8 Complex Dependence
Compare these results with those above, and try the post-blackened bootstrap with
sim=" geom".
(Sections 8.2.2, 8.2.3)
The data in b ea v er consist o f a time series o f n = 100 observations on the body
temperature y i , . . . , y and an indicator x i , . . . , x n o f activity o f a female beaver,
Castor canadensis. We want to estimate and give an uncertainty measure for the
body temperature o f the beaver. The simplest m odel that allows for the clear
autocorrelation o f the series is
yj = P o + PiXj + rij,

rij = tcrij_i +Ej,

j = l,...,n ,

(8.21)

a linear regression m odel in which the errors r\j form an A R (1) process, and the
are independent identically distributed errors with mean zero and variance a 2.
Having fitted this model, estimated the parameters a,/?o, j8i,<t2 and calculated the
residuals e i , . . . , e n (e\ cannot be calculated), we generate bootstrap series by the
following recipe:
y'j = Po + PiXj + *}],

n] = j = i , . . . , n ,

(8.22)

where the error series {>/'} is formed by taking a white noise series {e } at random
from theset {a(e2 e) , . . . , o(e e)} and then applying the second parto f (8.22).
To fit the original m odel and to generate a new series:
f i t < - f u n c t io n ( d a ta )
{ X < - c b i n d ( r e p ( l , 1 0 0 ) ,d a t a $ a c t iv )
para < - l i s t ( X =X ,data=data)
a ss ig n (" p a r a " ,p a r a ,fr a m e = l)
d < - a r im a .m le (x = p a r a $ d a ta $ t e m p ,m o d e l= lis t(a r = c (0 .8 )),
xreg=para$X )
r e s < - a r i m a .d ia g ( d ,p l o t = F ,s t d .r e s id = T ) $ s t d .r e s i d
r e s <- r e s [ ! is .n a ( r e s ) ]
li s t ( p a r a s = c ( d $ m o d e l$ a r ,d $ r e g .c o e f ,s q r t ( d $ s ig m a 2 ) ) ,
r e s = r e s -m e a n (r e s ) ,f it = X 7,*7, d $ r e g .c o e f ) >
b e a v e r .a r g s < - f i t ( b e a v e r )
w h it e .n o i s e < - f u n c t io n ( n .s im , t s ) s a m p le ( t s ,s iz e = n .s im ,r e p la c e = T )
b e a v e r .g e n < - f u n c t i o n ( t s , n .s im , r a n .a r g s )
{ t s b < - r a n .a r g s $ r e s
f i t < - r a n .a r g s $ f i t
c o e f f < - r a n .a r g s$ p a r a s
ts$ tem p < - f i t + c o e f f [ 4 ] * a r im a .s im ( m o d e l= lis t ( a r = c o e f f [ 1 ] ) ,
n = n .s im ,r a n d .g e n = w h it e .n o is e ,t s = t s b )
ts }
n ew .b ea v er < - b e a v e r .g e n (b e a v e r , 1 0 0 , b e a v e r .a r g s )
N ow we are able to generate data, we can bootstrap and see the results o f
b e a v e r .b o o t as follows:
b e a v e r .fu n < - f u n c t i o n ( t s ) f i t ( t s ) $ p a r a s
b e a v e r .b o o t < - t s b o o t ( b e a v e r , b e a v e r .fu n , R =99,sim ="m odel",
n . s im=1 0 0 ,r a n . g e n = b e a v e r. g e n , r a n . a r g s= b e a v e r . a r g s )
n a m es(b ea v er. b o o t)
b e a v e r . b o o t$ t0
b e a v e r .b o o t $ t [ 1 : 1 0 ,]
showing the original value o f b e a v e r . fu n and its value for the first 10 replicate

8.6 Practicals

435

series. Are the estimated mean temperatures for the R = 99 simulations normal?
Use b o o t . c i to obtain normal and basic bootstrap confidence intervals for the
resting and active temperatures.
In this analysis we have assumed that the linear m odel with A R(1) errors is
appropriate. How would you proceed if it were not?
(Section 8.2; Reynolds, 1994)
3

Consider scrambling the phases o f the su n sp o t data. To see the original data,
two replicates generated using ordinary phase scrambling, and two phase scram
bled series whose marginal distribution is the same as that o f the original
data:
su n s p o t .fu n < - f u n c t i o n ( t s ) t s
s u n s p o t .1 < - ts b o o t(s u n s p o t,s u n s p o t.fu n ,R = 2 ,s im = " s c r a m b le " )
.R andom .seed < - s u n s p o t .l$ s e e d
s u n s p o t .2 < - tsb o o t(su n sp o t,su n sp o t.fu n ,R = 2 ,sim = " sc r a m b le " ,n o r m = F )
s p l i t . s c r e e n ( c (3 ,2 ) )
y l < - c (- 5 0 ,2 0 0 )
s c r e e n ( l ) ; t s . p l o t ( s u n s p o t , y l i m = y l ) ; a b lin e ( h = 0 ,lt y = 2 )
s c r e e n ( 3 ) ; t s p l o t ( s u n s p o t . l $ t [ 1 , ] ,y l i m = y l ) ; a b li n e ( h = 0 ,lt y = 2 )
s c r e e n ( 4 ) ; t s p l o t ( s u n s p o t . l $ t [ 2 , ] ,y l i m = y l ) ; a b li n e ( h = 0 ,lt y = 2 )
s c r e e n ( 5 ) ; t s p l o t ( s u n s p o t . 2 $ t [ 1 , ] ,y l i m = y l ) ; a b li n e ( h = 0 ,lt y = 2 )
s c r e e n ( 6 ) ; t s p l o t ( s u n s p o t . 2 $ t [ 2 , ] ,y l i m = y l ) ; a b li n e ( h = 0 ,lt y = 2 )
W hat features o f the original data are preserved by the two algorithms? (You may
find it helpful to experiment with different shapes for the figures.)
(Section 8.2.4; Problem 8.4; Theiler et a l, 1992)

c o a l contains data on times o f explosions in coal mines from 15 March 1851 to 22


March 1962, often modelled as an inhom ogeneous Poisson process. For a kernel
intensity estimate (accidents per year):
c o a l . e s t < - f u n c t io n ( y , h=5) le n g th (y )* k sm o o th (y ,b a n d w id th = 2 . 7*h,
k e r n e l= " n " ,x .p o in ts = s e q (1 8 5 1 , 1 9 6 3 ,2 ) )$ y
y e a r < - s e q ( 1 8 5 1 ,1 9 6 3 ,2 )
p l o t ( y e a r , c o a l .e s t ( c o a l $ d a t e ) ,t y p e = " l " ,y l a b = " i n t e n s i t y " ,
y lim = c ( 0 ,6 ) )
r u g (c o a l)
Try other choices o f bandwidth h, noting that the estimate for the period (1851 +
4/i, 1962 4h) does not have edge effects. D o you think that the drop from about
three accidents per year before 1900 to about one thereafter is spurious? W hat
about the peaks at around 1910 and 1940?
For an equi-tailed 90% bootstrap confidence band for the intensity, we take h = 5
and R = 199 (a larger R will give more reliable results):
c o a l.f u n < - f u n c t io n ( d a t a , i , h=5) c o a l . e s t ( d a t a [ i ] , h)
c o a l.b o o t < - b o o t ( c o a l$ d a t e , c o a l.f u n , R=199)
A <- 0 .5 /s q r t(5 * 2 * s q r t(p i))
Z < - s w e e p ( s q r t ( c o a l .b o o t $ t ) ,2 ,s q r t ( c o a l .b o o t $ t 0 ) ) / A
Z.max < - s o r t ( a p p ly ( Z ,l ,m a x ) ) [190]
Z.m in < - s o r t ( a p p l y ( Z , l . m i n ) ) [10]
to p < - (s q r t(c o a l.b o o t$ tO )-A * Z .m in )2
b o t < - (s q r t(c o a l.b o o t$ tO )-A * Z .m a x )" 2
li n e s ( y e a r , t o p ,l t y = 2 ) ; lin e s ( y e a r ,b o t ,lt y = 2 )

436

8 Complex Dependence

To see the quantile process:


Z <- apply(Z,2,sort)
Z.05 <- Z[10,]
Z.95 <- Z[190,]
plot(year,Z .05,type="1",ylab="Z",ylim=c(-3,3))
lines(year,Z .95)
Construct symmetric bootstrap confidence bands based on za{h) such that

Pr{|Z(y; /i)| < z(h),y &} = a


(no more simulation is required). How different are they from the equi-tailed ones?
For simulation with a random number o f events, use

coal.gen <- function(data, n)


{ i <- s a m p l e d :n,size=rpois(n=l ,lambda=n) ,replace=T)
datafi] }
coal.boot2 <- boot(coal$date, coal.est, R=199, sim="parametric",
ran.gen=coal.gen, mle=nrow(coal))
D oes this make any difference?
(Section 8.3.2; Cowling, Hall and Phillips 1996; Hand et al., 1994, p. 155)

9
Improved Calculation

9.1 Introduction
A few o f the statistical questions in earlier chapters have been am enable to
analytical calculation. However, m ost o f o u r problem s have been too com
plicated for exact solutions, an d sam ples have been too small for theoretical
large-sam ple approxim ations to be trustw orthy. In such cases sim ulation has
provided approxim ate answ ers through M onte C arlo estim ates o f bias, vari
ance, quantiles, probabilities, an d so forth. T h roughout we have supposed th at
the sim ulation size is lim ited only by our im patience for reliable results.
S im ulation o f independent b o o tstrap sam ples and their use as described in
previous chapters is usually easily program m ed and im plem ented. I f it takes
up to a few hours to calculate enough values o f the statistic o f interest, T,
ordinary simulation o f this sort will be an efficient use o f a researchers time. But
som etim es T is very costly to com pute, or sam pling is only a single com ponent
in a larger procedure as in a double b o o tstrap o r the procedure will be
repeated m any times w ith different sets o f data. T hen it m ay pay to invest
in m ethods o f calculation th a t reduce the num ber o f sim ulations needed to
obtain a given precision, o r equivalently increase the accuracy o f an estim ate
based on a given sim ulation size. This chapter is devoted to such m ethods.
N o lunch is free. The techniques th a t give the biggest potential variance
reductions are usually the h ardest to im plem ent. O thers yield less spectacular
gains, b u t are m ore easily im plem ented. T houghtless use o f any o f them may
m ake m atters worse, so it is essential to ensure th a t use o f a variance reduction
technique will save the investigators time, which is m uch m ore valuable than
com puter time.
M ost o f o u r b o o tstrap estim ates depend on averages. For exam ple, in testing
a null hypothesis (C h ap ter 4) we w ant to calculate the significance probability
p = P r(7 ^ t | Fo), where t is the observed value o f test statistic T and

437

9 Improved Calculation

438

the fitted m odel Fo is an estim ate o f F und er the null hypothesis. The simple
M onte C arlo estim ate o f p is R ^ 1
/ {T ' > (}, where I is the indicator
function an d the T are based on R independent sam ples generated from FoT he variance o f this estim ate is c R ~{, w here c = p fl p). N othing can generally
be done ab o u t the factor R ~ l , b u t the co n stan t c can be reduced if we use
a m ore sophisticated M onte C arlo technique. M ost o f this chapter concerns
such techniques. Section 9.2 describes m ethods for balancing the sim ulation
in order to m ake it m ore like a full enum eration o f all possible samples,
and in Section 9.3 we describe m ethods based on the use o f control variates.
Section 9.4 describes m ethods based on im portance sampling. In Section 9.5 we
discuss one im p o rta n t m ethod o f theoretical approxim ation, the saddlepoint
m ethod, which elim inates the need for sim ulation.

9.2 Balanced Bootstraps


Suppose for simplicity th a t the d a ta are a hom ogeneous ran d o m sample

y \, . . . , y n w ith E D F F, and th a t as usual we are concerned with the properties


o f a statistic T whose observed value is t = t ( y i , . . . , y n). O u r focus is T =
t ( Y { , . . . , Y*), w here the Y" are a ran d o m sam ple from F. C onsider the bias
estim ate for T, nam ely B = E (T* | F) t. I f g denotes the jo in t density o f
then
B

=J

t { y \, . . . , y'n)g(y[,. . . , y')dy{

This m ight be com putable analytically if t( ) is simple enough, particularly for


som e param etric models. In the nonparam etric case, if the calculation cannot
be done analytically, we set g equal to n~n for all possible sam ples y\, ..., y'n
in the set S f = { y i , . . . , y } n and w rite
B = n~n ^ 2 t ( y [ , . . . , y ' n) - t .

(9.1)

This sum over all possible sam ples need involve only (2n_1) calculations o f (*,
since the sym m etry o f t( ) w ith respect to the sam ple can be used, b u t even
so the complete enumeration o f values t* th a t (9.1) requires will usually be
im practicable unless n is very small. So it is that, especially in nonparam etric
problem s, we usually approxim ate the average in (9.1) by the average o f R
random ly chosen elem ents o f Zf an d so approxim ate B by B r = R _i Y , T* t.
This calculation w ith a ran d o m subset o f
has a m ajor defect: the values
y i , . . . , y n typically d o n o t occur w ith equal frequency in th a t subset. This is
illustrated in Table 9.1, which reproduces Table 2.2 b u t adds (penultim ate
row) the aggregate frequencies for the d a ta values; the final row is explained
later. In the even sim pler case o f the sam ple average t = y we can see clearly

9.2 Balanced Bootstraps


Table 9.1 R = 9
resamples for city
population data, chosen
by ordinary bootstrap
sampling from F.

j
u
X

Data
Sample

1
2
3
4
5
6
7
8
9

439

1
138
143

2
93
104

3
61
69

2
1

1
1
3
1
1
2

1
1

4
179
260

Aggregate

1
8

rF*

9
50

55

1
2
1

2
2
2
1
1
11
11
50

7
29
50

8
23
48

Number o f times j sampled


1
1
1
1
1

1
1
1

Data
5
6
48
37
75
63

1
3

1
1

2
1
2
1

1
2
1
5

2
3
2
13

1
1
8

5
50

13
50

50

1
1
8
8

50

9
30
111

10
2
50

1
2
4

1
1
2
1

2
1
1

1
1

1
2
7

1
1
11

7
50

50

3
1
1
10

Statistic
t = 1.520
t\ =
t 2 =
=
tA =
t; =
t'6 =
t; =
( =
t; =

1.466
1.761
1.951
1.542
1.371
1.686
1.378
1.420
1.660

10
50

th a t the unequal frequencies com pletely account for the fact th a t B r differs
from the correct value B = 0. The corresponding phenom enon for param etric
b o o tstrap p in g is th a t the aggregated E D F o f the R sam ples is n o t as close
to the C D F o f the fitted param etric m odel as it is to the sam e m odel with
different p aram eter values.
T here are tw o ways to deal w ith this difficulty. First, we can try to change
the sim ulation to remove the defect; and secondly we can try to adjust the
results o f the existing sim ulation.

9.2.1 Balancing the simulation


T he idea o f balanced resampling is to generate tables o f random frequencies,
b u t to force them to be balanced in an ap p ro p riate way. A set o f R boo tstrap
sam ples is said to have first-order balance if each o f the original observations
app ears w ith equal frequency, i.e. exactly R times overall.
F irst-o rd er balance is easy to achieve. A simple algorithm is as follow s:
Algorithm 9.1 (Balanced bootstrap)
C oncatenate R copies o f y i , . . . , y into a single set

o f size Rn.

Perm ute the elem ents o f 9 at random , giving <


&*, say.
F o r r = 1 ,...,/? , take successive sets o f n elements o f
resam ples, y *, an d set t'r = t(y ).

to be the balanced

440

9 Improved Calculation

Data
Sample

Aggregate

1
2
3
4
5
6
7
8
9

10

Number of times j sampled


1 1
1 1
1 1
1

1
2
2
2

1
2
2

3
1

1
2
2
1
2

2
2
3
1
2
1
9

i
1
1
1
2
9

1
1
1

1
2
1
1
1
2
9

2
2
1
1
1
1
1
9

1
1
1
1
1
1
1
1
1
9

1
1
1

1
2
1
1

2
2
1
1
9

1
1
1
1

Table 9.2 First-order


balanced bootstrap with
R = 9 for city
population data.

Statistic
t = 1.520
t\ =
ti =
t"3 =
t'4 =
t5 =
t6 =
ty =
t\ =
t; =

1.632
1.823
1.334
1.317
1.531
1.344
1.730
1.424
1.678

O ther algorithm s (e.g. Problem 9.2) have been suggested th a t economize


on the tim e and space needed to generate balanced samples, b u t the m ost
tim e-consum ing p a rt o f a b o o tstrap sim ulation is usually the calculation o f
the values o f t \ so the details o f the sim ulation algorithm are rarely critical.
W hatever the m ethod used to generate the balanced samples, the result will
be th at individual observations have equal overall frequencies, ju st as for
com plete enum eratio n a simple illustration is given below. Indeed, so far
as the m arginal frequencies o f the d a ta values are concerned, a com plete
enum eration has been perform ed.
Exam ple 9.1 (City population d ata)
C onsider estim ating the bias o f the
ratio estim ate t = x / u for the d a ta in the second an d third rows o f Table 9.1.
Table 9.2 shows the results for a balanced b o o tstrap w ith R = 9: each d a ta
value occurs exactly 9 tim es overall.
To see how well the balanced b o o tstrap works, we apply it with the m ore
realistic n u m b er R = 49. T he bias estim ate is B R = T* t = R ~ l J 2r T ' t,
and its variance over 100 replicates o f the ordinary resam pling scheme is
7.25 x 10-4 . T he corresponding figure for the balanced b o o tstrap is 9.31 x 10-5 ,
so the balanced scheme is ab o u t 72.5/9.31 = 7.8 tim es m ore efficient for bias
estim ation.

H ere an d below we say th a t the efficiency o f a b o o tstrap estim ate such as


B r relative to the o rdinary b o o tstrap is the variance ratio
v'

K J b r)
v ar I J B r Y

where for this com parison the subscripts denote the sam pling scheme under
which B r was calculated.

441

9.2 Balanced Bootstraps


Table 9 3 Approximate
efficiency gains when
balancing schemes with
R = 49 are applied in
estimating biases for
estimates of nonlinear
regression model
applied to the calcium
uptake data, based on
100 repetitions of the
bootstrap.

Cases

Po
Pi
a

Stratified

R esiduals

B alanced

A djusted

B alanced

A djusted

B alanced

A djusted

8.9
13.1
11.1

6.9

141

108

1.2

0.6

8.9
9.1

63
18.7

49
18.0

1.4
15.3

0.6
13.5

So far we have focused on the application to bias estim ation, for which the
balance typically gives a big im provem ent. The same is not generally true for
estim ating higher m om ents or quantiles. For instance, in the previous exam ple
the balanced b o o tstrap has efficiency less th an one for calculation o f the
variance estim ate VR.
The balanced b o o tstra p extends quite easily to m ore com plicated sam pling
situations. I f the d a ta consist o f several independent samples, as in Section 3.2,
balanced sim ulation can be applied separately to each. Some o ther extensions
are straightforw ard.
Exam ple 9.2 (Calcium uptake d ata)
To investigate the im provem ent in bias
estim ation for the p aram eters o f the nonlinear regression m odel fitted to the
d a ta o f Exam ple 7.7, we calculated 100 replicates o f the estim ated biases based
on 49 b o o tstra p samples. The resulting efficiencies are given in Table 9.3 for
different resam pling schem es; the results labelled A djusted are discussed in
Exam ple 9.3. F or stratified resam pling the d a ta are stratified by the covariate
value, so there are nine stra ta each w ith three observations. T he efficiency gains
u nder stratified resam pling are very large, and those under case resam pling are
worthwhile. T he gains w hen resam pling residuals are n o t w orthw hile, except
for a 2.

F irst-order balance ensures th a t each observation occurs precisely R times in


the R samples. In a scheme w ith second-order balance, each pair o f observations
occurs together precisely the same num ber o f times, and so on for schemes
w ith third- an d higher-order balance. T here is a close connection to certain
experim ental designs (Problem 9.7). D etailed investigation suggests, however,
th a t there is usually no practical gain beyond first-order balance. A n open
question is w hether o r n o t there are useful nearly balanced designs.

9.2.2 Post-sim ulation balance


C onsider again estim ating the bias o f T in a nonparam etric context, based on
an unbalanced array o f frequencies such as Table 9.1. The usual bias estim ate

9 Improved Calculation

442
can be w ritten in expanded n o tatio n as
R

(9.2)
r= l

where as usual F* denotes the E D F corresponding to the rth row o f the array.
Let F* denote the average o f these E D F s, th a t is
f * = r - ^ f ; + --- + F*r ).
F or a frequency table such as Table 9.1, F* is the C D F o f the distribution
corresponding to the aggregate frequencies o f d a ta values, as show n in the
final row. T he resulting adjusted bias estimate is
R

Brmj = R - 1

*(*) -

(9-3)

r= 1

This is som etim es called the re-centred bias estim ate. In addition to the usual
A

b o o tstrap values t(Fr ), its calculation requires only F* and f(F*). N ote th at
for adjustm ent to work, t( ) m ust be in a functional form, i.e. be defined
independently o f sam ple size n. F or example, a variance m ust be calculated
with divisor n ra th e r th a n n 1.
The corresponding calculation for a p aram etric b o o tstra p is similar. In effect
the adjustm ent com pares the sim ulated estim ates T ' to the p aram eter value
Or = t(F*) obtained by fitting the m odel to d a ta w ith E D F F* rath er th an F.
Exam ple 9.3 (Calcium uptake d a ta )
Table 9.3 shows the efficiency gains
from using B r ^
in the nonparam etric resam pling experim ent described in
Exam ple 9.2. T he gains are broadly sim ilar to those for balanced resam pling,
b u t smaller.
F o r param etric sam pling the quantities F in (9.3) represent sets o f d a ta
generated by p aram etric sim ulation from the fitted m odel, and the average
F* is the d ataset o f size R n obtained by concatenating the sim ulated samples.
H ere the sim plest p aram etric sim ulation is to generate d a ta y j = p-j + ej, where
the fa are the fitted values from Exam ple 7.7 an d the e* are independent
iV(0,0.552) variables. In 100 replicates o f this b o o tstrap with R = 49, the
efficiency gains for estim ating the biases o f Po, P\, an d a were 24.7, 42.5, and
20.7; the effect o f the adjustm ent is m uch m ore m arked for the param etric
th a n for the n o n p aram etric b ootstraps.

The sam e adjustm ent does n o t apply to the variance approxim ation V r ,
higher m om ents o r quantiles. R a th e r the linear approxim ation is used as a
conventional control variate, as described in Section 9.3.

443

9.2 Balanced Bootstraps

9.2.3 Som e theory


Some theoretical insight into both balanced sim ulation and post-sim ulation
balancing can be gained by m eans o f the nonparam etric delta m ethod (Sec
tion 2.7). As before, let F* denote the E D F o f a b o o tstrap sam ple Y J , . . . , Y*.
The expansion o f T* = t ( F) ab o u t F is, to second-order terms,
ti

t (F') = tQ(F') = t(F) + n~l 5 3 lj + \ n~ 2 5 3 5 3 q 'jk


j=i
j= l t=i

<9'4)

where lj = H Y J ; F) an d qjk = q(YJ, Yk ; F) are values o f the em pirical firstan d second-order derivatives o f t at F; equation (9.4) is the same as (2.41),
b u t w ith F an d F replaced by F ' and F. We call the right-hand side o f (9.4)
the quadratic approximation to T". O m ission o f the final term leaves the linear
approxim ation
n

tL(F ) = t(F) + n~l 5 3 l j

(9-5)

i= i

which is the basis o f the variance approxim ation vL ; equation (9.5) is simply a
recasting o f (2.44).
In term s o f the frequencies f j with which the yj ap p e ar in the boo tstrap
sam ple a n d the em pirical influence values lj = l(yj;F) and qjk = q(yj,yk;F),
the q u ad ratic ap proxim ation (9.4) is

n =t

+ E fpj +K2E E fjfa*


7=1

7=1 k= 1

in abbreviated n otation. Recall th a t 22j h 0 an ^ 22j Qjk = 22k Qjk ~


We can now com pare the resam pling schemes through the properties o f the
frequencies f j .
C onsider b o o tstrap sim ulation to estim ate the bias o f T. Suppose th a t there
are R sim ulated samples, an d th a t yj appears in the rth w ith frequency f rJ,
while T takes value T ' . T hen from (9.2) and (9.6) the bias approxim ation
B r = R ~ l 22 T ~ t can be approxim ated by

a -'E *+-1E a + i'2E E ) - cr= l

7=1

7=1 k = l

{9J)

In the ord in ary resam pling scheme, the rows o f frequencies (/* 1 , . . . , f ' n) are
in dependent sam ples from the m ultinom ial distribution with denom inator n
an d probability vector (n-1 , . . . , n _1). This is the case in Table 9.1. In this
situation the first an d second jo in t m om ents o f the frequencies are
E*(/V) = 1,

c o v -(/V ,/;fc) = SrASjk - n~l ),

444

9 Improved Calculation

where <5;* = 1 if j = k an d zero otherwise, an d so forth; the higher cum ulants are given in Problem 2.19. Straightforw ard calculations show th a t
approxim ation (9.7) has m ean ^n~ 2 ^ 2 j q j j and variance
1
Rn1

i= 1

j= 1

An1

j= i

\j= i

2 ^

j= i

Qjk

(9.8)

*=i

F or the balanced b o o tstrap , the jo in t distrib u tio n o f the R x n table o f


frequencies f ' j is hypergeom etric w ith row sum s n and colum n sums R.
Because
= 0 an d
f'r] = R for all j, approxim ation (9.7) becomes

/= 1 k= l

rl

U nder balanced resam pling one can show (Problem 9.1) th at


e

*(/*;) = i,

(nSJk - 1)(JW - 1)
ni? - 1

cov*(/;;, / ; , ) =

(9.9)

so the bias approxim ation (9.7) has m ean


lM( i ? - l ) _2 , A
j=i

m ore painful calculations show th a t its variance is approxim ately


1
4Rr?

-2I T 1

qjj + 2nT 2 R - 2 (
j=1

^
/

\;= 1

+ 2(n - I)/!"1 q)k


j=1 /c =l

(9.10)
The m ean is alm ost the sam e u nder b o th schemes, b u t the leading term o f the
variance in (9.10) is sm aller th an in (9.8) because the term in (9.7) involving
the lj is held equal to zero by the balance constraints Y l r f*j =
First-order
balance ensures th a t the linear term in the expansion for B r is held equal to
its value o f zero for the com plete enum eration.
Post-sim ulation balance is closely related to the balanced bootstrap. It is
straightforw ard to see th a t the quad ratic nonparam etric delta m ethod approx
im ation o f Bg^adj in (9.3) equals
(9.11)
y = l k= 1 I

r= l

r= l

r= l

9.2 Balanced Bootstraps

". / W

JS

c
0)
o
ifc
LU

v '

CO

in

j
V
.- j :

5.0

in

O
0)
oc
m

o
o
T

icy

Figure 9.1 Efficiency


comparisons for
estimating biases of
normal eigenvalues. The
left panel compares the
efficiency gains over the
ordinary bias estimate
due to balancing and
post-simulation
adjustment. The right
panel shows the gains
for the balanced
estimate, as a function
of the correlation
between the statistic and
its linear approximation;
the solid line shows the
theoretical relation. See
text for details.

445

"V*
o

"

_____ -''r'TV'V.T/*

in
d
0.1

0.5

5.0
Adjusted

0.0

0.2

0.4

0.6

0.8

1.0

Correlation

Like the balanced b o o tstrap estim ate o f bias, there are no linear term s in this
expression. R e-centring has forced those term s to equal their p o p ulation values
o f zero.
W hen the statistic T does n o t possess an expansion like (9.4), balancing
m ay n o t help. In any case the correlation betw een the statistic and its linear
approxim ation is im p o rtan t: if the correlation is low because the quadratic
com ponent o f (9.4) is appreciable, then it m ay n o t be useful to reduce variation
in the linear com ponent. A rough approxim ation is th a t var*(B) is reduced
by a factor equal to 1 m inus the square o f the correlation betw een T" and T'L
(Problem 9.5).
Exam ple 9.4 (N orm al eigenvalues)
F or a num erical com parison o f the effi
ciency gains in bias estim ation from balanced resam pling and post-sim ulation
adjustm ent, we perform ed M onte C arlo experim ents as follows. We generated
n variates from the m ultivariate norm al density w ith dim ension 5 and identity
covariance m atrix, and to o k t to be the five eigenvalues o f the sam ple covari
ance m atrix. F or each sam ple we used a large b o o tstrap to estim ate the linear
approxim ation t"L for each o f the eigenvalues and then calculated the correla
tion c betw een t* and t"L. We then estim ated the gains in efficiency for balanced
an d adjusted estim ates o f bias calculated using the b o o tstrap w ith R = 39,
using variances estim ated from 100 independent bo o tstrap sim ulations.
Figure 9.1 shows the gains in efficiency for each o f the 5 eigenvalues, for
50 sets o f d a ta w ith n = 15 an d 50 sets w ith n = 25; there are 500 points
in each panel. T he left panel com pares the efficiency gains for the balanced
an d adjusted schemes. Balanced sam pling gives b etter gains th an post-sam ple
adjustm ent, b u t the difference is sm aller at larger gains. The right panel shows

446

9 Improved Calculation

the efficiency gains for the balanced scheme plotted against the correlation
c. The solid line is the theoretical curve (1 c2)-1 . Know ledge o f c would
enable the efficiency gain to be predicted quite accurately, at least for c > 0.8.
T he potential im provem ent from balancing is n o t g u aranteed to be w orthwhile
w hen c < 0.7. The corresponding plot for the adjusted estim ates suggests th a t
c m ust be at least 0.85 for a useful efficiency gain.

This exam ple suggests the following strategy when a good estim ate o f bias
is required: perform a sm all stan d ard unbalanced b ootstrap, and use it to
estim ate the correlation betw een the statistic an d its linear approxim ation.
If th a t correlation exceeds ab o u t 0.7, it m ay be w orthw hile to perform a
balanced sim ulation, b u t otherw ise it will not. I f the correlation exceeds 0.85,
post-sim ulation adjustm ent will usually be w orthw hile, b u t otherw ise it will
not.

9.3 Control Methods


The basis o f control m ethods is extra calculation during or after a series o f
sim ulations w ith the aim o f reducing the overall variability o f the estim ator.
This can be applied to nonparam etric sim ulation in several ways. The p o st
sim ulation balancing described in the preceding section is a simple control
m ethod, in which we store the sim ulated ran d o m sam ples and m ake a single
post-sim ulation calculation.
M ost control m ethods involve ex tra calculations a t the time o f the sim ulation,
an d are applicable w hen there is a simple statistic th a t is highly correlated with
T*. Such a statistic is know n as a control variate. T he key idea is to write T*
in term s o f the control variate an d the difference betw een T* and the control
variate, an d then to calculate the required properties for the control variate
analytically, estim ating only the differences by sim ulation.
Bias and variance
In m any b o o tstrap contexts where T is an estim ator, a natu ral choice for
the control variate will be the linear approxim ation T[ defined in (2.44). The
m om ents o f
can be obtained theoretically using m om ents o f the frequencies
f j . In ordinary ran d o m sam pling the f j are m ultinom ial, so the m ean and
variance o f T are
E'(T'l ) = t,

v a r' ( T i ) = n~2 lj = vL.


7=1

In order to use T L as a control variate, we write T* = T[ + D , so th at


D* equals the difference T * T[. The m ean and variance o f T* can then

447

9.3 Control Methods

be w ritten
E 'e r * ) = E m( T l ) + E*(D ),

v ar *(T*) = var *(T) + 2co v ' { T L , D ) + var *(/)*),

the leading term s o f which are known. O nly term s involving D * need to be
approxim ated by sim ulation. G iven sim ulations T
w ith corresponding
linear approxim ations
and differences D* = T* T r, the m ean
and variance o f T* are estim ated by
t+ D\

VKcon = v L + ^

i?
^ ( T r - f i ) ( D r* - D' ) + ^

i?
J 2 ( D ; ~ D ' ) 2,

r= l

r= l

(9.12)
where T[ =
Ylr ^L,r an d D" =
Use o f these and related
approxim ations requires the calculation o f the T[ r as well as the T*.
The estim ated bias o f T* based on (9.12) is B r co = D ' . This is closely
related to the estim ate obtained un d er balanced sim ulation and to the re
centred bias estim ate B r ^ . Like them , it ensures th at the linear com ponent
o f the bias estim ate equals its population value, zero. D etailed calculation
shows th a t all three approaches achieve the same variance reduction for the
bias estim ate in large samples. However, the variance estim ate in (9.12) based
on linear approxim ation is less variable th an the estim ated variances obtained
u n d er the o th er approaches, because its leading term is n o t random .
Example 9.5 (City population data)
To see how effective control m ethods
are in reducing the variability o f a variance estim ate, we consider the ratio
statistic for the city pop u latio n d a ta in Table 2.1, w ith n = 10. F or 100
b o o tstrap sim ulations w ith R = 50, we calculated the usual variance estim ate
vr = ( R I)-1
t*)2 and the estim ate VR>con from (9.12). The estim ated
gain in efficiency calculated from the 100 sim ulations is 1.92, which though
w orthw hile is n o t large. T he correlation betw een t* and tL is 0.94.
F or the larger set o f d a ta in Table 1.3, with n = 49, we repeated the
experim ent w ith R = 100. H ere the gain in efficiency is 7.5, and the correlation
is 0.99.
Figure 9.2 shows scatter plots o f the estim ated variances in these experim ents.
F or b o th sam ple sizes the values o f v r <co are m ore concentrated th an the values
o f vR, though the m ain effect o f control is to increase underestim ates o f the
true variances.

Example 9.6 (Frets heads)


The d a ta o f Exam ple 3.24 are a sam ple o f n = 25
cases, each consisting o f 4 m easurem ents. We consider the efficiency gains
from using v ^ con to estim ate the b o o tstrap variances o f the eigenvalues o f
their covariance m atrix. T he correlations betw een the eigenvalues and their
linear approxim ations are 0.98, 0.89, 0.85 and 0.74, and the gains in efficiency
estim ated from 100 replicate b o o tstrap s o f size R = 39 are 2.3, 1.6, 0.95 and

9 Improved Calculation

448

Figure 9.2 Comparison


of estimated variances
(xlO -2) for city
population ratio, using
usual and control
methods, for n = 10
with R = 50 (left) and
for n = 49 with R = 100
(right). The dotted line
is the line x = y, and
the dashed lines show
the true variances,
estimated from a much
larger simulation.

.....................................
0

Usual

Usual

1.3. The four left panels o f Figure 9.3 show plots o f the values o f v r >co against
the values o f v r . N o strong p attern is discernible.
To get a m ore system atic idea o f the effectiveness o f control m ethods in this
setting, we repeated the experim ent outlined in Exam ple 9.4 and com pared the
usual and control estim ates o f the variances o f the five eigenvalues. The results
for the five eigenvalues an d n = 15 and 25 are show n in Figure 9.3. G ains in
efficiency are n o t g u aranteed unless the correlation betw een the statistic and
its linear ap proxim ation is 0.80 o r m ore, and they are n o t large unless the
correlation is close to one. T he line y = (1 x4)-1 sum m arizes the efficiency
gain well, th o u g h we have n o t attem p ted to justify this.

Quantiles
C ontrol m ethods m ay also be applied to quantiles. Suppose th a t we have
the sim ulated values t\, ..., tR o f a statistic, and th a t the corresponding
control variates and differences are available. We now sort the differences by
the values o f the control variates. F o r exam ple, if o u r control variate is a
linear approxim ation, w ith R = 4 an d t 'L 2 < t"L , < t *L 4 < t] 3, we p u t the
differences in order d"2, d\, d"4, d\. The procedure now is to replace the p
quantile o f the linear approxim ation by a theoretical approxim ation, tp, for
p = 1/(jR + 1 ) ,..., R / ( R + 1), thereby replacing t'r) w ith t C r = tp + d '(r), where
7t(r) is the ran k o f t'L r. In o u r exam ple we would obtain t c j = t0.2 + d'2,
t'c 2 = 0 . 4 + d.\, t'c 3 = to. 6 + d\, an d t CA = fo.g + d\. We now estim ate the pth
quantile o f the distribution o f T by t'c ^ , i.e. the rth quantile o f t
c v ... ,t*CR.
If the control variate is highly correlated w ith T m, the bulk o f the variability
in the estim ated quantiles will have been rem oved by using the theoretical
approxim ation.

449

9.3 Control Methods


Figure 9.3 Efficiency
comparisons for
estimating variances of
eigenvalues. The left
panels compare the
usual and control
variance estimates for
the data of
Example 3.24, for which
n = 25, when R = 39.
The right panel shows
the gains made by the
control estimate in 50
samples of sizes 15 and
25 from the normal
distribution, as a
function of the
correlation between the
statistic and its linear
approximation; the solid
line shows the line
y = (1 x4)-1. See text
for details.

Third

Fourth

0.0
S

10

15

20

25

0.2

0.4

0.6

0.8

1.0

Correlation

O ne desirable property o f the control quantile estim ates is that, unlike m ost
o th er variance reduction m ethods, their accuracy improves with increasing n
as well as R.
T here are various ways to calculate the quantiles o f the control variate. The
preferred ap proach is to calculate the entire distribution o f the control variate
by saddlepoint approxim ation (Section 9.5), and to read off the required qu an
tiles tp. This is better th a n oth er m ethods, such as C o rn ish 'F ish e r expansion,
because it guarantees th a t the quantiles o f the control variate will increase
w ith p.
Example 9.7 (Returns data)
To assess the usefulness o f the control m ethod
ju s t described, we consider setting studentized b o o tstrap confidence intervals
for the rate o f retu rn in Exam ple 6.3. We use case resam pling to estim ate
quantiles o f T* = (/?J /?i ) / S \ where fli is the estim ate o f the regression slope,
an d S 2 is the robust estim ated variance o f fii based on the linear approxim ation
to Pi.
F or a single b o o tstra p sim ulation we calculated three estim ates o f the qu an
tiles o f T * : the usual estim ates, the order statistics
< < t'R); the control
estim ates
taking the control variate to be the linear approxim ation to T*
based on exact em pirical influence values; and the control estim ates obtained
using the linear approxim ation w ith em pirical influence values estim ated by
regression on the frequency array for the same bootstrap. In each case the
quantiles o f the control variate were obtained by saddlepoint approxim ation,
as outlined in Exam ple 9.13 below. We used R = 999 and repeated the experi
m ent 50 tim es in o rder to estim ate the variance o f the quantile estim ates. We

9 *Improved Calculation

450

Figure 9.4 Efficiency


and bias com parisons
for estim ating quantiles
o f a studentized

CM

bootstrap statistic for


the returns data, based
on a bootstrap of size
R = 999. The left panel
c

shows the variance of


the usual quantile
estimate divided by the
variance o f the control
estimate based on an
exact linear
approxim ation, plotted
against the
corresponding norm al
quantile. The dashed
lines show efficiencies of
1, 2, 3, 4 and 5. The
right panel shows the
estim ated biases for the
exact control (solid) and
estim ated control (dots)

CM

o
-3

-2

-1

Normal quantile

-3

-2

-1

Normal quantile

estim ated their bias by com paring them w ith quantiles o f T * obtained from
100000 b o o tstrap resamples.
Figure 9.4 shows the efficiency gains o f the exact control estim ates relative
to the usual estim ates. T he efficiency gain based on the linear approxim ation
is n o t shown, b u t it is very similar. T he right panel shows the biases o f the
two control estim ates. The efficiency gains are largest for central quantiles,
and are o f o rd er 1.5-3 for the quantiles o f m ost interest, at ab o u t 0.025-0.05
an d 0.95-0.975. T here is som e suggestion th a t the control estim ates based on
the linear ap proxim ation have the sm aller bias, b u t b o th sets o f biases are
negligible a t all b u t the m ost extrem e quantiles.
The efficiency gains in this exam ple are broadly in line w ith sim ulations
reported in the literatu re; see also Exam ple 9.10 below.

9.4 Importance Resampling


9.4.1 Basic estimators
Importance sampling
M ost o f o u r sim ulation calculations can be th o u g h t o f as approxim ate inte
grations, w ith the aim o f approxim ating

for som e function m( ), where y ' is abbreviated n o ta tio n for a sim ulated d a ta
set. In expression (9.1), for exam ple, m( y' ) = t(y*), and the distribution G for
y* = (y^,..., y*) puts m ass n~n on each elem ent o f the set f f = { y i,...,y } ".

quantiles. See text for

details

451

9.4 Importance Resampling

W hen it is im possible to evaulate the integral directly, o u r usual approach is


to generate R independent sam ples 7,, ..., YR* from G, and to estim ate fi by
R

pG = R 5 3 " H O
r=1
This estim ator has m ean an d variance

an d so is unbiased for fi. In the situation m entioned above, this is a re


expression o f o rdinary b o o tstrap sim ulation. We use n o ta tio n such as po and
Eg to indicate th a t estim ates are calculated from ran d o m variables sim ulated
from G, and th a t m om ent calculations are w ith respect to the distribution G.
O ne problem w ith po is th a t some values o f y* m ay contribute m uch m ore
to fi th an others. F or example, suppose th a t the aim is to approxim ate the
probability P r(T* < to \ F), for which we would take m(y*) = I{t(y") < to},
where I is the indicator function. If the event t(y*) < t0 is rare, then m ost o f the
sim ulations will co ntribute zero to the integral. The aim o f importance sampling
is to sam ple m ore frequently from those im p o rta n t values o f y * whose
contrib u tio n s to the integral are greatest. This is achieved by sam pling from a
distribution th a t concentrates probability on these y ' , and then w eighting the
values o f m(y*) so as to m im ic the approxim ation we w ould have used if we
h ad sam pled from G. Im portance sam pling in the case o f the nonparam etric
b o o tstrap am o u n ts to re-w eighting sam ples from the em pirical distribution
function F , so in this context it is som etim es know n as importance resampling.
T he identity th a t m otivates im portance sam pling is
n =

J m( y )dG(y*) = J

d H ( y ),

(9.14)

where necessarily the su p p o rt o f H includes the support o f G. Im portance


sam pling approxim ates the right-hand side o f (9.14) using independent sam ples
y ;,..., yR
*from H. T he new ap proxim ation for fi is the raw importance sampling
estimate
R

Ph ,raw = / r 1 5 > ( y r> ( y ; ) ,

(9.15)

r= l

where w(y) = dG(y ) / d H ( y ' ) is know n as the importance sampling weight. The
estim ate fin,raw has m ean fi by virtue o f (9.14), so is unbiased, and has variance

9 Improved Calculation

452
O u r aim is now to choose H so th at

J m ( y * ) 2 w ( y ' ) d G ( y ' ) < J m ( y *)2 dG(y*).


C learly the best choice is the one for which m(y*)w(y*) = n, because then Ah,raw
has zero variance, b u t this is n o t usable because /i is unknow n. In general it
is hard to choose H, b u t som etim es the choice is straightforw ard, as we now
outline.
Tilted distributions
A potentially im p o rtan t application is calculation o f tail probabilities such as
n = Pr*(T* < to | F), an d the corresponding quantiles o f T*. F or probabilities
w (y ) is taken to be the indicator function I {t(y') < o}, and if y \, . . . , y n is
a single ran d o m sam ple from the E D F F then dG(y') = n~". A ny adm issible
nonparam etric choice for H is a m ultinom ial distribution w ith probability pj
on yj, for j = 1 ,..., n. Then
dH (f) = J J p f ,
j

where f j counts how m any com ponents o f Y * equal y ; . We w ould like to


choose the probabilities pj to m inimize v ar# (/iH.raw), or at least to m ake this
m uch sm aller th a n R_1rc(l n). T his ap p ears to be im possible in general, b u t
if T is close to norm al we can get a good approxim ate solution.
Suppose th a t T * has a linear approxim ation T l which is accurate, and th at
the N ( t , v ) approxim ation for T[ u nder ordinary resam pling is accurate. T hen
the probability n we are trying to approxim ate is roughly $ {(t0 f)/u 1/2}. If
we were using sim ulation to approxim ate such a norm al probability directly,
then provided th a t to < t a good (near-optim al) im portance sam pling m ethod
would be to generate t*s from the N(to, vi) distribution, where vl is the
n onparam etric delta m ethod variance. It tu rn s o u t th a t we can arrange th a t
this happen approxim ately for T* by setting
pj cc e x p ( M j ) ,

j= l,...,n ,

(9.18)

where the lj are the usual em pirical influence values for t. The result o f Prob
lem 9.10 shows th a t u nder this distribution T * is approxim ately
N ( t + XnvL, vi ), so the ap p ro p riate choice for X in (9.18) is approxim ately
X = (to t)/{nvL), again provided to < t\ in some cases it is possible to choose
X to m ake T* have m ean exactly to- T he choice o f probabilities given by (9.18)
is called an exponential tilting o f the original values n ~l . This idea is also used
in Sections 4.4, 5.3, an d 10.2.2.
Table 9.4 shows approxim ate values o f the efficiency R ~ 1 n ( l n ) / \ a T , (p.H,raw)
o f near-optim al im portance resam pling for various values o f the tail probability
7i. The values were calculated using no rm al approxim ations for the distributions

453

9.4 Importance Resampling


Table 9.4 Approximate
efficiencies for
estimating tail
probability n under
importance sampling
with optimal tilted EDF
when T is
approximately normal.

n
Efficiency

0.01
37

0.025
17

0.05
9.5

0.2
3.0

0.5
1.0

0.8
0.12

0.95
0.003

0.975
0.0005

0.99
0.00004

o f T* und er G and H ; see Problem 9.8. The entries in the table suggest th at
for n < 0.05 we could a tta in the same accuracy as w ith ordinary resam pling
w ith R reduced by a factor larger th an ab o u t 10. A lso shown in the table is the
result o f applying the exponential tilted im portance resam pling distribution
w hen t > to, or n > 0.5: then im portance resam pling will be worse possibly
much worse th an o rdinary resampling.
This last observation is a w arning: straightforw ard im portance sam pling can
be bad if m isapplied. We can see how from (9.17). If d H ( y ' ) becom es very
small where m( y ) an d dG(y') are n o t small, then w{y') = d G(y ) / d H ( y ' ) will
becom e very large and inflate the variance. For the tail probability calculation,
if to > t then all sam ples y ' w ith t(y*) < to contribute R ~ lw(y'r ) to pH,raw, and
som e o f these contributions are enorm ous: although rare, they w reak havoc
On flH,rawA little th o u g h t shows th a t for to > t one should apply im portance sam pling
to estim ate 1 n = Pr*(T* > to) and subtract the result from 1, ra th er th an
estim ate n directly.
Quantiles
To see how quantiles are estim ated, suppose th a t we w ant to estim ate the
a quantile o f the distribution o f 7 , and T* is approxim ately N(t, vL) under
G = F. T hen we take a tilted distribution for H such th a t T* is approxim ately
N ( t + zxV l 2 ,vl). For the situation we have been discussing, the exponential
tilted distribution (9.18) will be near-optim al with k = zi / ( n v i/ 2), and in large
sam ples this will be superior to G = F for any ct =/= i. So suppose th a t we
have used im portance resam pling from this tilted distribution to obtain values
fj < < tf; w ith corresponding weights vvj,. . . , w R. T hen for a < | the raw
quantile estim ate is t"M, where
- m

V
R + 1^

r= l

.
M+l
wr* < a < - - V wr\
r
R+l ^

(9.19)
r

r= 1

while for a > j we define M by


R
- i - y w ; < l - a < - r=M

R
w*;

r= M + 1

see Problem 9.9. W hen there is no im portance sam pling we have w* = 1, and
the estim ate equals the usual ((R+1)a).
T he variation in w (y') and its im plications are illustrated in the following

454

9 Improved Calculation

example. We discuss stabilizing m odifications to raw im portance resam pling


in the next subsection.
Exam ple 9.8 (Gravity d a ta )
F or an exam ple o f im portance resam pling, we
follow Exam ple 4.19 an d consider testing for a difference in m eans for the last
two series o f Table 3.1. H ere we use the studentized pivot test, w ith observed
test statistic
Z = , ,
y 2 ~ 7yi ,1 /2 '
(s\/n2 + s\/ni)

(9'2)

where y t an d sj are the average an d variance o f the sam ple y n , . . . , y i n for


i = 1,2. T he test com pares zo to the general distribution o f the studentized
pivot

z =

?2-?l-(/^2-W ).
1/2
(S f /n 2 + S f / n i )

zo is the value taken by Z u n d er the null hypothesis m = n 2. T he observed value


o f zo is 1.84, w ith norm al one-sided significance probability P r(Z > zo) = 0.033.
We aim to estim ate P r(Z > zo) by P r*(Z > zo | F), where F stands for the
E D F s o f the two samples. In this case y* = ( y u , - - - , y i ni, y 2 i>--->y2n2)< an(^
is the jo in t density u n d er the two E D F s, so the probability on each sim ulated
d ataset is dG{y*) = n p 1 x n^""2.
Because zo > 0 an d the P-value is clearly below
is ap p rop riate an d the estim ated P-value is

pH,raw = R 1 y ^ J { z'r > ^0}wr*,

raw im portance sam pling

W = ^ )
r
dHW Y

The choice o f H is m ade by analogy w ith the single-sam ple case discussed e ar
lier. The tw o E D F s are tilted so as to m ake Z* approxim ately N ( zq, v l ), which
should be near-optim al. This is done by w orking w ith the linear approxim ation
nl
Z'L =

f j l 'j +

Z + Mi 1

n 2 1 Y l f 2 J lV>

7=1

;=1

where / a nd f'2j are the b o o tstrap sam ple frequencies o f y \j and y 2j, and the
em pirical influence values are
l

yij - h
{ s \ / n 2 + s f / n i ) 1/2

_
1

yij - yi
( s l / n 2 + s 2l / n i ) U2

We take H to be the p air o f exponential tilted distributions


Pi] = P r( Y { = yij) cc exp(/.hJ/ n l ),

p2j = P r(7 2 = y 2J) cc exp(A/2y/ n 2),


(9.21)

455

9.4 Importance Resampling

O
O
o
o
X
8
1

Figure 9.5 Importance


resampling to test for a
location difference
between series 7 and 8
of the gravity data. The
solid points in the left
panel are the weights w*
and bootstrap statistics
z for R = 99
importance resamples;
the hollow points are
the pairs (z*,w) for 99
ordinary resamples. The
right panel compares
the survivor function
Pr*(Z* > 2*) estimated
from 50000 ordinary
bootstrap resamples
(heavy solid) with
estimates of it based on
the 99 ordinary
bootstrap samples
(dashes) and the 99
importance resamples
(solid). The vertical
dotted lines show z q .

o
O

LL
Q
O

2
o
r i

i \
I:
1
i;

o
-2

-2

V L
y

0
z*

z*

where X is chosen so th a t Z L has m ean z0 : this should m ake Z* approxim ately


N( zo ,v i) u n d er H. The explicit equation for X is
1 hj exp(A/u /n i)
E "L ie x p (/U ij/n i)

E l i h j exp(Xl2}/ n 2) _
+

" l i exp(Xl 2J/ n 2)

w ith approxim ate solution X = zo since vL = 1. F or our d a ta the exact solution


is X = 1.42.
Figure 9.5 shows results for R = 99 sim ulations. The solid points in the left
panel are the weights

Wr =

= eXP | - E f l j lQg ("1Plj) - E f a l0 fa P v )

p lo tted against the b o o tstra p values z* for the im portance resamples. These
values o f z* are shifted to the right relative to the hollow points, which show
the values o f z an d w* (all equal to 1) for 99 ordinary resamples. The values
o f w* for the im portance re-w eighting vary over several orders o f m agnitude,
w ith the largest values w hen z* <C z q . But only those for z* > z0 contribute to
f^H,raw

H ow well does this single im portance resam pling distribution w ork for
estim ating all values o f the survivor function Pr*(Z * > z)? T he heavy solid
line in the right panel shows the tru e survivor function o f Z* estim ated from
50 000 o rdinary b o o tstra p sim ulations. T he lighter solid line is the im portance

456

9 Improved Calculation

resam pling estim ate


K- 1

wrf{*r* ^ Z)

r= 1

with R = 99, an d the d o tted line is the estim ate based on 99 ordinary boo tstrap
sam ples from the null distribution. T he im portance resam pling estim ate follows
the tru e survivor function accurately close to zq b u t does poorly for negative
z*. The usual estim ate does best n ear z* = 0 b u t poorly in the tail region
o f interest; the estim ated significance probability is f a = 0. W hile the usual
estim ate decreases by R ~ { at each z*, the weighted estim ate decreases by
m uch sm aller ju m p s close to z<>; the raw im portance sam pling tail probability
estim ate is p.H,raw = 0.015, which is very close to the true value. T he weighted
survivor function estim ate has large ju m p s in its left tail, where the estim ate is
unreliable.
In 50 repetitions o f this experim ent the o rdinary and raw im portance re
sam pling tail probability estim ates h ad variances 2.09 x 10-4 and 2.63 x 10-5 .
F or a tail probability o f 0.015 this efficiency gain o f ab o u t 8 is sm aller th an
would be predicted from Table 9.4, the reason being th a t the distribution o f
z* is rath er skewed an d the norm al approxim ation to it is poor.

In general there are several ways to obtain tilted distributions. We can use
exponential tilting w ith exact em pirical influence values, if these are readily
available. O r we can estim ate the influence values by regression using jRo
initial ordinary b o o tstra p resam ples, as decribed in Section 2.7.4. A n other
way o f using an initial set o f b o o tstrap sam ples is to derive weighted sm ooth
distributions as in (3.39): illustrations o f this are given later in Exam ples 9.9
and 9.11.

9.4.2 Improved estimators


Ratio and regression estimators
One simple m odification o f the raw im portance sam pling estim ate is based on
the fact th a t the average w eight R -1
w ( Y ' ) from any particular sim ulation
will n o t equal its theoretical value o f E*{w(Y*)} = 1. This suggests th a t the
weights w(Yr) be norm alized, so th a t (9.15) is replaced by the importance
resampling ratio estimate
tl

_ E f-i h y ; m y ;)
Z L

m y

;)

To some extent this controls the effect o f very large fluctuations in the weights.
In practice it is b etter to treat the weight as a control variate o r covariate.
Since ou r aim in choosing H is to concentrate sam pling where m( ) is largest,
the values o f m(Yr )w(Yr*) and w(Yr*) should be correlated. If so, and if

457

9.4 Importance Resampling

the average weight differs from its expected value o f one un d er sim ulation
from H, then the estim ate pH,raw probably differs from its expected value fi.
This m otivates the covariance adjustm ent m ade in the importance resampling
regression estimate
Ph ,reg = Ah,raw ~ b(w - 1),

(9.23)

w here vv* = R ~ x
w(Yr*), an d b is the slope o f the linear regression o f the
m ( Y ' ) w ( Y * ) on the w (Y r*). The estim ator pH,reg is the predicted value for
m { Y ' ) w { Y ) at the poin t w(Y*) = 1.
T he adjustm ents m ade to pH,raw in b o th pH,rat and pH,reg m ay induce bias,
b u t such biases will be o f o rd er R ~ l and will usually be negligible relative
to sim ulation stan d ard errors. C alculations outlined in Problem 9.12 indicate
th a t for large R the regression estim ator should outperform the raw and ratio
estim ators, b u t the im provem ent depends on the problem , and in practice the
raw estim ator o f a tail probability o r quantile is usually the best.
Defensive mixtures
A second im provem ent aim s to prevent the weight w( y' ) from varying wildly.
Suppose th a t H is a m ixture o f distributions, n H\ + (1 n ) H 2 , where 0 < n < 1.
T he distributions Hi and H 2 are chosen so th at the corresponding probabilities
are n o t b o th sm all sim ultaneously. T hen the weights
d G ( / ) / { j i d H , ( / ) + (1 - 7z)dH 2 (y')}
will vary less, because even if d H i ( y m) is very small, d H 2 (y*) will keep the
den o m in ato r aw ay from zero and vice versa. This choice o f H is know n as
a defensive mixture distribution, and it should do particularly well if m any
estim ates, w ith different m( y ), are to be calculated. T he m ixture is applied by
stratified sam pling, th a t is by generating exactly n R observations from Hi and
the rest from H 2, and using pH,reg as usual.
T he com ponents o f the m ixture H should be chosen to ensure th a t the
relevant range o f values o f t* is well covered, b u t beyond this the detailed
choice is n o t critical. F o r exam ple, if we are interested in quantiles o f T* for
probabilities betw een a an d 1 a, then it would be sensible to target Hi at
the a quantile and H 2 a t the 1 a quantile, m ost simply by the exponential
tilting m ethod described earlier. As a further precaution we m ight add a
th ird com ponent to the m ixture, such as G, to ensure stable perform ance
in the m iddle o f the distribution. In general the m ixture could have m any
com ponents, b u t careful choice o f two or three will usually be adequate.
A lways the application o f the m ixture should be by stratified sam pling, to
reduce variation.
Exam ple 9.9 (G ravity d a ta )
To illustrate the above ideas, we again consider
the hypothesis testing problem o f Exam ple 9.8. T he left panel o f Figure 9.6

458

9 Improved Calculation

shows 20 replicate estim ates o f the null survivor function o f z*, using ordinary
b o o tstrap resam pling w ith R = 299. The right panel shows 20 estim ates o f the
survivor function using the regression estim ate fiH,reg after sim ulations w ith a
defensive m ixture distribution. This m ixture has three com ponents which are
G (the tw o E D F s), an d tw o pairs o f exponential tilted distributions targeted
at the 0.025 an d 0.975 quantiles o f Z*. From o u r earlier discussion these
distributions are given by (9.21) w ith X = 2 / v L \ we shall denote the first pair
o f distributions by probabilities p i j an d p 2j , and the second by probabilities
q i j and q 2j . The first com ponent G was used for R i = 99 samples, the second
com ponent (the ps) for R 2 = 100 an d the th ird com ponent (the qs) for
R j = 100: the m ixture prop o rtio n s were therefore nj = R j / ( R \ + R 2 + R 3 ) for
j = 1,2,3. T he im portance resam pling weights were

where as before f \ j and f y respectively count how m any tim es y ij and y 2j


a p p e ar in the resample.
F or convenience we estim ated the C D F o f Z* at the sam ple values z*.
The regression estim ate at z* is obtained by setting m( y ) = I { z ( y *) < z ( y )}
and calculating (9.23); this appears to involve 299 regressions for each C D F
estim ate, b u t Problem 9.13 shows how in fact ju st one m atrix calculation is
needed. T he im portance resam pling estim ate o f the C D F is ab o u t as variable
as the ordin ary estim ate over m ost o f the distribution, b u t m uch less variable
well into the tails.
For a m ore system atic com parison, we calculated the ratio o f the m ean

Figure 9.6 Importance


resampling to test for a
location difference
between series 7 and 8
of the gravity data. In
each panel the heavy
solid line is the survivor
function Pr(Z > z)
estimated from 50000
ordinary bootstrap
resamples and the
vertical dotted lines
show z q . The left panel
shows the estimates for
20 ordinary bootstraps
of size 299. The right
panel shows 20
importance resampling
estimates using 299
samples with a
regression estimate
following resampling
from a defensive
mixture distribution
with three components.
See text for details.

459

9.4 Importance Resampling


Table 9.5 Efficiency
gains (ratios of mean
squared errors) for
estimating a tail
probability, a bias, a
variance and two
quantiles for the gravity
data, using importance
resampling estimators
together with defensive
mixture distributions,
compared to ordinary
resampling. The
mixtures have Ri
ordinary bootstrap
samples mixed with R 2
samples exponentially
tilted to the 0.025
quantile of z*, and with
R 3 samples
exponentially tilted to
the 0.975 quantile of r*.
See text for details.

M ixture
Ri

r2

E stim ate
Ri
299

99

100

100

19

140

140

R aw
R a tio
R egression
R aw
R a tio
R egression
R aw
R a tio
R egression

E stim an d
Pr* (Z* > z0)

E ( Z ')

var*(Z *)

Z0.05

z0.025

11.2
3.5
12.4
3.8
3.4
4.0
3.9
2.3
4.3

0.04
0.06
0.18
0.73
0.79
0.93
0.34
0.43
0.69

0.03
0.05
0.07
1.5
1.5
1.6
1.2
0.82
1.3

0.07
0.06
0.06
1.3
0.93
0.87
0.96
0.48
0.44

0.05
0.04
2.5
1.3
1.2
2.6
1.1
1.3

squared erro r from ordinary resam pling to th at w hen using defensive m ixture
d istributions to estim ate the tail probability Pr*(Z* > z q ) with zo = 1.77, two
quantiles, an d the bias E *(Z ) and the variance var (Z*) for sam pling from
the two series. T he m ixture distributions have the same three com ponents
as before, b u t w ith different values for the num bers o f sam ples R \ , R 2 and
Rt, from each. Table 9.5 gives the results for three resam pling m ixtures with
a to tal o f R = 299 resam ples in each case. The m ean squared errors were
estim ated from 100 replicate b ootstraps, w ith tru e values obtained from a
single b o o tstra p o f size 50000. The m ain contribution to the m ean squared
erro r is from variance ra th e r th an bias.
The first resam pling distribution is n o t a m ixture, b u t simply the exponential
tilt to the 0.975 quantile. This gives the best estim ates o f the tail probability,
w ith efficiencies for raw an d regression estim ates in line with Exam ple 9.8, b u t
it gives very p o o r estim ates o f the other quantities. F or the other two m ixtures
the regression estim ates are best for estim ating the m ean and variance, while
the raw estim ates are best for the quantiles and n o t really worse for the tail
probability. B oth m ixtures are ab o u t the same for tail quantiles, while the first
m ixture is b etter for the m om ents.
In this case the efficiency gains for tail probabilities and quantiles predicted
by Table 9.4 are unrealistic, for two reasons. First, the table com pares 299
o rdinary sim ulations w ith ju st 100 tilted to each tail o f the first m ixture
distribution, so we w ould expect the variance for a tail quantity based on the
m ixture to be larger by a factor o f ab o u t three; this is ju st w hat we see when
the first distrib u tio n is com pared to the second. Secondly, the distribution o f
Z* is quite skewed, which considerably reduces the efficiency out as fa r as the
0.95 quantile.
We conclude th a t the regression estim ate is best for estim ating central

9 Improved Calculation

460

quantities, th a t the raw estim ate is best for quantiles, th a t results for estim ating
quantiles are insensitive to the precise m ixture used, and th a t theoretical gains
m ay not be realized in practice unless a single tail quantity is to be estim ated.
This is in line w ith o th er studies.

9.4.3 Balanced importance resampling


Im portance resam pling w orks best for the extrem e quantiles corresponding
to small tail probabilities, b u t is less effective in the centre o f a distribution.
Balanced resam pling, on the o th er hand, w orks best in the centre o f a distri
bution. Balanced im portance resam pling aims to get the best o f b o th worlds
by com bining the two, as follows.
Suppose th a t we wish to generate R balanced resam ples in which y j has
overall probability p, o f occurring. To do this exactly in general is im possible
for finite n R , b u t we can do so approxim ately by applying the following simple
algorithm ; a m ore efficient algorithm is described in Problem 9.14.

Algorithm 9.2 (Balanced importance resampling)


C hoose Ri = n R p i , . . .,
C oncatenate
to form .

R\

= nRpn, such th a t Ri H----- + R n = nR.

copies o f y\ w ith

R 2

copies o f y 2 w ith ... with

R n

copies o f y n,

Perm ute the n R elem ents o f W at ran d o m to form


and read off the R
balanced im portance resam ples as sets o f n successive elem ents o f
.

A simple way to choose the Rj is to set Rj = 1 + [n(R l)p ; ], j =


1
wher e [] denotes integer part, and to set Rj = Rj + \ for the d =
n R (R[ - (- + R') values o f j w ith the largest values o f nRpj R j ; we set
R j = Rj for the rest. This ensures th a t all the observations are represented in
the b o o tstrap sim ulation.
Provided th a t R is large relative to n , individual sam ples will be approx
im ately independent an d hence the w eight associated w ith a sam ple having
frequencies ( / j , . . . , / ^ ) is approxim ately

this does n o t take account o f the fact th a t sam pling is w ithout replacem ent.
Figure 9.7 shows the theoretical large-sam ple efficiencies o f balanced re
sampling, im portance resam pling, an d balanced im portance resam pling for
estim ating the quantiles o f a norm al statistic. O rdinary balance gives m ax
im um efficiency o f 2.76 a t the centre o f the distribution, while im portance

461

9.4 Importance Resampling

Figure 9.7 Asymptotic


efficiencies of balanced
importance resampling
(solid), importance
resampling (large
dashes), and balanced
resampling (small
dashes) for estimating
the quantiles of a
normal statistic. The
dotted horizontal line is
at relative efficiency one.

Normal quantile

resam pling w orks well in the lower tail b u t badly in the centre and u p per tail
o f the distribution. Balanced im portance resam pling dom inates both.
Exam ple 9.10 (Returns d a ta )
In order to assess how well these ideas m ight
w ork in practice, we again consider setting studentized b o o tstrap confidence
intervals for the slope in the returns example. We perform ed an experim ent
like th a t o f Exam ple 9.7, b u t w ith the R = 999 b o o tstrap sam ples generated
by balanced resam pling, im portance resam pling, and balanced im portance
resampling.
Table 9.6 shows the m ean squared error for the ordinary b o o tstrap divided
by the m ean squared errors o f the quantile estim ates for these m ethods, using
50 replicate sim ulations from each scheme. This slightly different efficiency
takes into account any bias from using the im proved m ethods o f sim ulation,
though in fact the co n trib u tio n to m ean squared error from bias is small. The
tru e quantiles are estim ated from an ordinary b o o tstrap o f size 100000.
The first tw o lines o f the table show the efficiency gains due to using the
control m ethod w hen the linear approxim ation is used as a control variate,
w ith em pirical influence values calculated exactly and estim ated by regression
from the sam e b o o tstrap sim ulation. The results differ little. T he next two rows
show the gains due to balanced sampling, both w ithout and w ith the control

462

M eth o d

9 Improved Calculation

D istrib u tio n

Q u an tile (% )
1

2.5

10

50

90

95

97.5

99

C o n tro l (exact)
C o n tro l (approx)

1.7
1.4

2.7
2.8

2.8
3.2

4.0
4.1

11.2
11.8

5.5
5.1

2.4
2.2

2.6
2.6

1.4
1.3

B alance
w ith co n tro l

1.0
1.4

1.2
1.8

1.5
3.0

1.4
2.8

3.1
4.4

2.9
4.7

1.7
2.5

1.4
2.2

0.6
1.5

7.8
4.6
3.6
4.3
2.6

3.7
2.9
3.7
2.6
2.1

3.6
3.5
2.0
2.5
0.7

1.8
1.1
1.7
1.8
0.3

0.4
0.1
0.5
0.9
0.4

3.5
2.6
2.4
1.6
0.5

2.3
3.1
2.2
1.6
0.6

3.1
4.3
2.6
2.2
1.6

5.5
5.2
3.6
2.3
2.1

5.0
4.2
5.2
4.3
3.2

5.7
3.4
4.2
3.3
2.8

4.1
2.4
3.8
3.4
1.0

1.9
1.8
1.8
2.2
0.4

0.5
0.2
0.9
2.1
0.9

2.6
2.0
3.0
2.7
0.9

2.2
3.6
2.4
3.7
1.4

6.3
4.2
4.0
3.3
2.1

4.5
3.9
4.0
4.3
2.1

Im p o rtan ce

Hi
Hi
Hi
H*
Hs

B alanced
im p o rtan ce

Hi
Hi
h3
h4
h 5

m ethod, which gives a w orthw hile im provem ent in perform ance, except in the
tail.
The next five lines show the gains due to different versions o f im portance
resam pling, in each case using a defensive m ixture distribution and the raw
quantile estim ate. In practice it is unusual to perform a b o o tstrap sim ulation
w ith the aim o f setting a single confidence interval, and the choice o f im
portance sam pling distrib u tio n H m ust balance various potentially conflicting
requirem ents. O u r choices were designed to reflect this. We first suppose th at
the em pirical influence values lj for t are know n an d can be used for exponen
tial tilting o f the linear approxim ation t'L to t . T he first defensive m ixture, H\,
uses 499 sim ulations from a distribution tilted to the a quantile o f t*L and 500
sim ulations from a distribution tilted to the 1 a quantile o f fL, for a = 0.05.
The second m ixture is like this b u t w ith a = 0.025.
The third, fo u rth an d fifth distributions are the sort th a t m ight be used in
practice w ith a com plicated statistic. We first perform ed an ordinary b o otstrap
o f size Ro, which we used to estim ate first the em pirical influence values lj
by regression an d then the tilt values rj for the 0.05 and 0.95 quantiles. We
then perform ed a fu rth er b o o tstrap o f size (R Ro)/2 using each set o f tilted
probabilities, giving a to tal o f R sim ulations from three different distributions,
one centred an d tw o tilted in opposite directions. We took Ro = 199 and
Ro = 499, giving Hj an d i / 4. F or H$ we took Ro = 499, b u t estim ated the
tilted distributions by frequency sm oothing (Section 3.9.2) w ith bandw idth

Table 9.6 Efficiencies


for estimation of
quantiles of studentized
slope for returns data,
relative to ordinary
bootstrap resampling.

463

9.4 Importance Resampling

e = 0.5t>1/2 at the 0.05 an d 0.95 quantiles o f t*, where v x/1 is the standard error
o f t estim ated from the ordinary bootstrap.
Balance generally im proves im portance resam pling, which is n o t sensitive to
the m ixture distrib u tio n used. The effect o f estim ating the em pirical influence
values is n o t m arked, while frequency sm oothing does n o t perform so well as
exponential tilting. Im portance resam pling estim ates o f the central quantiles
are poor, even w hen the sim ulation is balanced. Overall, any o f schemes H \H 4 leads to appreciably m ore accurate estim ates o f the quantiles usually o f
interest.

9.4.4 Bootstrap recycling


In Section 3.9 we introduced the idea o f b o o tstrapping the b o otstrap, b o th for
m aking bias adjustm ents to b o o tstrap calculations and for studying the v aria
tion o f properties o f statistics. F u rth er applications o f the idea were described
in C hapters 4 an d 5. In b o th param etric and nonparam etric applications we
need to sim ulate sam ples from a series o f distributions, themselves obtained
from sim ulations in the nonparam etric case. Recycling m ethods replace m any
sets o f sim ulated sam ples by one set o f sam ples and m any sets o f weights,
and have the p otential to reduce the com putational effort greatly. This is
particularly valuable when the statistic o f interest is expensive to calculate, for
exam ple when it involves a difficult optim ization, or w hen each b o o tstrap sam
ple is costly to generate, as when using M arkov chain M onte C arlo m ethods
(Section 4.2.2).
T he basic idea is repeated use o f the im portance sam pling identity (9.14), as
follows. Suppose th a t we are trying to calculate
= E{m(Y)} for a series o f
d istributions G i , . . . , G k The naive M onte C arlo approach is to calculate each
value Hk = E { m ( Y ) | Gk} independently, sim ulating R sam ples y u - - - , y R from
G/c and calculating pk = R ~ l
m(yr). But for any distribution H whose
su p p o rt includes th a t o f G* we have

E{m(Y) | Gk} =

J m(y)dGk{y) = J

E jm(Y)

dGk( Y )
dH(Y)

We can therefore estim ate all K values using one set o f sam ples y \ , . . . , y N
sim ulated from H, w ith estim ates
N

P k = N 1^ m ( y , )

(9.24)

In some contexts we m ay choose N to be m uch larger th a n the value R we


m ight use for a single sim ulation, b u t less th an K R . It is im p o rtan t to choose
H carefully, an d to take account o f the fact th a t the estim ates are correlated.

464

9 Improved Calculation

Both N and the choice o f H depend u p o n the use being m ade o f the estim ates
and the form o f m(-).
Exam ple 9.11 (City population d a ta ) C onsider again estim ating the bias
and variance functions for ratio 8 = t(F ) o f the city population d a ta with
n = 10. In Exam ple 3.22 we estim ated b(F) = E (T | F) t(F) and v(F) =
v ar( T | F) for a range o f values o f 0 = t{F) using a first-level b o o tstrap to
calculate values o f t* for 999 b o o tstrap sam ples F*, and then doing a secondA
A
level b o o tstrap to estim ate b(F') an d v( F) for each o f those samples. H ere
the second level o f resam pling is avoided by using im portance re-weighting.
A t the sam e time, we retain the sm oothing introduced in Exam ple 3.22.
R a th er th a n take each Gk to be one o f the b o o tstrap E D F s F*, we obtain
a sm ooth curve by using sm ooth distributions F'f) w ith probabilities pj( 6 ) as
defined by (3.39). Recall th a t the p aram eter value o f F e is t(F'g) = 0*, say,
which will differ slightly from 6 . F o r H we take F , the E D F o f the original
data, on the grounds th a t it has the correct su p p o rt and covers the range o f
values for y w ell: it is n o t necessarily a good choice. T hen we have weights
dGk( f r ) = dFg(y') = A ( PjW V " =
dH(y'r )
dH(y'r )
y i n - 1/

say, where as usual /* is the frequency with which y} occurs in the rth
bo o tstrap sample. We should em phasize th a t the sam ples y * draw n from H
here replace second-level b o o tstrap samples.
C onsider the bias estim ate. T he weighted sum R~' ^ ( f 6")w'(0} is an
unbiased estim ate o f the bias E (T * | F'e ) 6 *, an d we can plot this estim ate
to see how the bias varies as a function o f O' or 6 . However, the weighted sum
can behave badly if a few o f the w ' ( 0 ) are very large, and it is b etter to use
the ratio an d regression estim ates (9.22) and (9.23).
The top left panel o f Figure 9.8 shows raw, ratio, an d regression estim ates o f
the bias, based on a single set o f R = 999 sim ulations, w ith the curve obtained
from the double b o o tstrap calculation used in Figure 3.7. F o r example, the
ratio estim ate o f bias for a p articu lar value o f d is ]T r(r' 0 )w(0 ) / 2 2 r w '(0),
and this is plotted as a function o f 0*. T he raw an d ratio estim ates are rath er
poor, but the regression estim ate agrees fairly well w ith the double boo tstrap
curve. The panel also shows the estim ated bias from a defensive m ixture w ith
499 ordinary sam ples m ixed w ith 250 sam ples tilted to each o f the 0.025 and
0.975 quantiles; this is the best estim ate o f those we consider. The panels below
show 20 replicates o f these estim ated biases. These confirm the im pression from
the panel a b o v e: w ith o rdinary resam pling the regression estim ator is best, but
it is b etter to use the m ixture distribution.
The to p right panel shows the corresponding estim ates for the standard

465

9.4 Importance Resampling

ID
o

iS
c

1</) 1

*
CO
O

co
d

&
"
o
0
go

_1

CD

aj
CM

O
o

oo

LU
/
/

1.2 1.3 1.4 1.5 1.6 1.7 1.6

0.5

0.4

/.

/A

0.2

J' B &

02

l;k
n'if vs
Mrr--

0.3

0.10
0.08
0.04

0.06

;'Y / *
) / / M

0.5

1.2 1.3 1.4 1.5 1.6 1.7 1.8

1.2 1.3 1.4 1.5 1.6 1.7 1.8

1.2 1.3 1.4 1.5 1.6 1.7 1.8

1.2 1.3 1.4 1.5 1.6 1.7 1.6

0.4

____

0 0

>

0.3

0)

00
o
o

200

Figure 9.8 Estimated


bias and standard error
functions for the city
population data ratio.
In the top panels the
heavy solid lines are the
double bootstrap values
shown in Figure 3.7, and
the others are the raw
estimate (large dashes),
the ratio estimate (small
dashes), the regression
estimate (dots), and the
regression estimate
based on a defensive
mixture distribution
(light solid). The lower
panels show 20
replicates of raw, ratio,
and regression estimates
from ordinary sampling,
and the regression
estimate from a
defensive mixture
(clockwise from upper
left) for the panels
above.

1.2 1.3 1.4 1.5 1.6 1.7 1.8

error o f t. For each o f a range o f values o f 6 , we calculate this by estim ating


the bias and m ean squared error
h f ;)

= e**(t** - 0* i f ;),

e**{(r* - 0*)2 1f ; }

by each o f the raw, ratio, and regression m ethods, and plotting the resulting
estim ate

v^ 2(F'e) = [E**{(r* - e ')2 1f ; } - {e**(t** - e ' \ f 0*)}2]1/2.


T he conclusions are the sam e as for the bias estimate.
A s we saw in Exam ple 3.22, here T 6 is n o t a stable quantity because its
m ean an d variance depend heavily on 6 .

466

9 Improved Calculation

The results for the raw estim ate suggest th a t recycling can give very variable
results, an d it m ust be used w ith care, as the next exam ple vividly illustrates.
Exam ple 9.12 (Bias adjustm ent)
C onsider the problem o f adjusting the
b o o tstrap estim ate o f bias o f T , discussed in Section 3.9. The adjustm ent C
in equation (3.30) is (R M )_1 5Zf=1 ]Cm=i(f*m ~ K) ~
which uses M sam ples
from each o f the R m odels F* fitted to sam ples from F. The recycling m ethod
replaces each average M -1 2 2 m=i([rm t ) by a w eighted average o f the form
(9.24), so th a t C is estim ated by
(9.25)
where t is the value o f T for the /th sam ple y { , y draw n from the
distribution H. If we applied recycling only to the first term o f C, which
estim ates E (T**), then a different an d as it tu rn s out inferior estim ate
would be obtained for C.
The sup p o rt o f H m ust include all R first-level b o o tstrap samples, so as in
the previous exam ple a n atu ral choice is H = F , the m odel fitted to (or the
E D F of) the original sample. However, this can give highly unstable results, as
one m ight predict from the leftm ost panel in the second row o f Figure 9.8. This
can be illustrated by considering the case o f the p aram etric m odel Y ~ N(0, 1),
with estim ate T = Y . H ere the term s being sum m ed in (9.25) have infinite
variance; see Problem 9.15. T he difficulty arises from the choice H = F , and
can be avoided by taking H to be a m ixture as described in Section 9.4.2, with
at least three com ponents.

Instability due to the choice H = F does not occur with all applications o f
recycling. Indeed applications to b o o tstra p likelihood (C hapter 10) w ork well
with this choice.

9.5 Saddlepoint Approximation


9.5.1 Basic approxim ations
Basic ideas
Let W\, ..., W be a ran d o m sam ple from a continuous distribution F with
cum ulant-generating function

Suppose th a t we are interested in the linear co m bination U = Y 2 aj W j , and


th at this has exact P D F an d C D F g(w) an d G(u). U nder suitable conditions,
the cumulant-generating function o f U, which is
is the

467

9.5 Saddlepoint Approximation

basis o f highly accurate approxim ations to the P D F and C D F o f U, know n


as saddlepoint approximations. The saddlepoint approxim ation to the density
o f U a t u is
gs() = { 2 n K " ( I ) } ~ m exp [ k ( | ) - {} ,

(9.26)

where the saddlepoint <f satisfies the saddlepoint equation


K ' ( b = u,

(9.27)

an d is therefore a function o f u. H ere K ' and K " are respectively the first and
second derivatives o f K with respect to . A simple approxim ation to the C D F
o f U, P r(l/ < u), is
Gf(u) = <I>(w + - l o g ( - ) l ,
[
w
\ w/ J

(9.28)

where $() denotes the stan d ard norm al CD F, and


w = slgn ( 3 ) [ 2 { a u - K ( 3 ) } ] 1/2,

v = l { K " ( l ) } U\

are b o th functions o f u. A n alternative to (9.28) is the Lugannani-Rice formula


cD(w) + 0 ( w ) f - - - ) ,
\W
V J

(9.29)

b u t in practice the difference between them is small. W hen \ = 0 the C D F


approxim ation is m ore com plicated and we do n o t give it here. The approx
im ations are constructed by num erical solution o f the saddlepoint equation
to obtain the value o f I for each value o f u o f interest, from which the
approxim ate P D F and C D F are readily calculated.
Form ulae such as (9.26) and (9.28) can provide rem arkably accurate approx
im ations in m any statistical problems. In fact,
g(u) = gs(u) {1 + 0 ( n -1 )} ,

G(u) = Gs(u) {1 + 0 ( n - 3/2)} ,

for values o f u such th a t |w| < c for some positive c; the erro r in the
C D F ap proxim ation rises to 0 ( n ~ l ) w hen u is such th a t |w| < cn1^2. A key
feature is th a t the error is relative, so th a t the ratio o f the true density o f
U to its saddlepoint approxim ation is bounded over the likely range o f u. A
consequence is th a t unlike other analytic approxim ations to densities and tail
probabilities, (9.26), (9.28) an d (9.29) are very accurate far into the tails o f the
density o f U. If there is d o u b t ab o u t the accuracy o f (9.28) and (9.29), Gs may
be calculated by num erical in tegration o f gs.
The m ore com plex form ulae th a t are used for conditional and m arginal
density an d distrib u tio n functions are given in Sections 9.5.2 and 9.5.3.

9 Improved Calculation

468

Table 9.7 C om parison


n

10
15

a (% )

2.5

95

97.5

99

99.5

99.9

10.9

12.8
12.5

15.4
15.2

18.1
17.8

78.5
78.1

85.1
85.9

96.0
95.3

102.1

10.8

101.9

116.4
115.8

13.6
13.5

14.5
14.4

16.0
15.9

17.4
17.4

37.4
37.4

39.7
39.7

42.3
42.4

44.4
44.3

48.2
48.2

0.1

0.5

Sim n
Sp o in t

7.8
7.6

Sim n
Sp o in t

11.8
11.7

Application to resampling
In the context o f resam pling, suppose th a t we are interested in the distri
b ution o f the average o f a sam ple from y \ , . . . , y n, where
is sam pled with
probability pj, j = 1, . . . , n. O ften, b u t n o t always, Pj = n-1 . We can w rite the
average as U' = n~l J2 f)yj> where as usual ( / j , . . . , ) has a jo in t m ultinom ial
distribution w ith den o m in ato r n. T hen U ' has cum ulant-generating function

K( ) = nl og j ] T f ? ; exp(o,)

(9.30)

where a; = y j / n . The function (9.30) can be used in (9.26) and (9.28) to give
n on-random approxim ations to the P D F and C D F o f U . Unlike m ost o f
the m ethods described in this book, the erro r in saddlepoint approxim ations
arises not from sim ulation variability, b u t from determ inistic num erical error
in using gs and Gs rath er th an the exact density and distribution function.
In principle, o f course, a n o n p aram etric b o o tstrap statistic is discrete and
so the density does not exist, b u t as we saw in Section 2.3.2, U * typically has
so m any possible values th a t we can thin k o f it as continuous aw ay from
the extrem e tails o f its distribution. C ontinuity corrections can som etim es be
applied, b u t they m ake little difference in b o o tstrap applications.
W hen it is necessary to approxim ate the entire distribution o f U', we
calculate the values o f Gs(u) for m values o f u equally spaced betw een min aj
and m ax aj an d use a spline sm oother to in terpolate betw een the corresponding
values o f C>_ 1{Gs(m)}. Q uantiles an d cum ulative probabilities for U ' can be
read off the fitted curve. Experience suggests th a t m = 50 is usually ample.
Exam ple 9.13 (L inear approxim ation)
A simple application o f these ideas
is to the linear approxim ation t'L for a b o o tstrap statistic t \ as was used in
Exam ple 9.7. We write T [ = t + n~l
where as usual /* is the frequency
o f the yth case in the b o o tstrap sam ple an d lj is the 7 th em pirical influence
value. T he cum ulant-generating function o f T[ t is (9.30) with aj = l j / n and

o f saddlepoint
approxim ation to
bootstrap a quantiles
(x lO -2 ) o f a linear
statistic for samples of
sizes 10 and 15, with
results from R = 49 999
simulations.

469

9.5 Saddlepoint Approximation

Figure 9.9 Comparison


of the saddlepoint
approximation (solid) to
the PDF of a linear
statistic with results
from a bootstrap
simulation with
R = 49 999, for samples
of sizes 10 (left) and 15
(right).

A
O

cvi

o
Q

Q
Q_

co

CL
CM

in

Jh

0.0

0.5

1.0

1.5

lltlii**,----------

0.1

0.2

0.3

0.4

0.5

0.6

Pj = n

so the saddlepoint equation for approxim ation to the P D F and C D F

o f T[ a t t'L is
E " = i O exP ( ^ ;/ ) _ .
E J - . e x p ({ /;/ )
L

whose solution is \.
For a num erical exam ple, we take the variance t = n~l J2(yj ~ y )2 f r
exponential sam ples o f sizes 10 and 15; the em pirical influence values are
lj = (yj y )2 t. Figure 9.9 com pares the saddlepoint approxim ations to the
P D F s o f t'L w ith the histogram from b o o tstrap calculations with R = 49 999.
T he saddlepoint approxim ation accurately reflects the skewed lower tail o f
the b o o tstrap distribution, w hereas a norm al approxim ation would not do so.
However, the saddlepoint approxim ation does n o t pick up the m ultim odality
o f the density for n = 10, which arises for the sam e reason as in the right panels
o f Figure 2.9: the bulk o f the variability o f T[ is due to a few observations
w ith large values o f |/; |, while those for which \lj\ is small merely add noise.
The figure suggests th a t w ith so small a sam ple the C D F approxim ation will
be m ore useful. This is borne out by Table 9.7, which com pares the sim ulation
quantiles an d quantiles obtained by fitting a spline to 50 saddlepoint C D F
values.
In m ore com plex applications the em pirical influence values lj would usually
be estim ated by num erical differentiation or by regression, as outlined in
Sections 2.7.2, 2.7.4 and 3.2.1.

Example 9.14 (Tuna density estimate)

We return to the double boo tstrap

470

9 Improved Calculation

used in Exam ple 5.13 to calibrate confidence intervals based on a kernel density
estim ate. This involved estim ating the probabilities
P r**(T " < 2t - t | F ),

(9.31)

where

is the variance-stabilized estim ate o f the quan tity o f interest. The double
bo o tstrap version o f t can be w ritten as t " = ( 2 2 f j ' aj ) ,/2> where aj =
(nh)~l { 4>{y j / h ) + <t>(yj/h)} an d /** is the frequency w ith which yj appears in
a second-level b o o tstrap sample. C onditional on a first-level b o o tstra p sample
F with frequencies
/*, the / * are independent m ultinom ial variables
with m ean vector (/* !,...,/* ) and d en o m in ato r n.
Now if 2 tr t < 0, the probability (9.31) equals zero, because T is positive. If
2t*t > 0, the event T** < 2 t*t is equivalent to n~l

^ (2 ^ 0 2- Thus

conditional on F", if 2 1 * t > 0, we can obtain a saddlepoint approxim ation


to (9.31) by applying (9.28) an d (9.30) w ith u = (21* t )2 and pj =
Including program m ing, it took ab o u t ten m inutes to calculate 3000 values
o f (9.31) by saddlepoint approxim ation; direct sim ulation with 250 sam ples at
the second level took ab o u t four hours on the sam e w orkstation.

Estimating functions
O ne simple extension o f the basic approxim ations is to statistics determ ined by
m onotonic estim ating functions. Suppose th a t the value o f a scalar bo o tstrap
statistic T* based on sam pling from y i , . . . , y is the solution to the estim ating
equation
n
(9.32)
U*(t) = ^ 2, a{ f ,y j )f 'j = 0,
where for each y the function a( 6 ;y) is decreasing in d. T hen T* < t if and
only if U (t) < 0. H ence Pr*(T* < t) m ay be estim ated by Gs(0) applied
w ith cum ulant-generating function (9.30) in which aj = a{t;yj). A saddlepoint
approxim ation to the density o f T is
(9.33)
.

where K ( ^ ) = d K / d t , an d solves the equation K '( ) = 0. The first term on


the right in (9.33) corresponds to the Jacobian for transform ation from the
density o f U to th a t o f T ' .

471

9.5 Saddlepoint Approximation

Example 9.15 (M aize data)


Problem 4.7 contains d a ta from a paired com
parison experim ent perform ed by D arw in on the grow th o f m aize plants. The
d a ta are reduced to 15 differences y \ , . . . , y n betw een the heights (in eighths o f
an inch) o f cross-fertilized and self-fertilized plants. W hen two large negative
values are excluded, the differences have average J> = 33 and look close to
norm al, b u t w hen those values are included the average drops to 20.9.
W hen d a ta m ay have been co ntam inated by outliers, robust M -estim ates are
useful. If we assum e th a t Y = 8 + as, where the distribution o f e is sym m etric
a b o u t zero b u t m ay have long tails, an estim ate o f location 0 can be found by
solving the equation
' = 0,

(9.34)

j=i
where tp(e) is designed to dow nw eight large values o f s. A com m on choice is
the H u b er estim ate determ ined by

y>(e) =

c,

(9.35)

W ith c = oo this gives 1p(s) = s and leads to the norm al-theory estim ate 9 = y,
b u t a sm aller choice o f c will give b etter behaviour w hen there are outliers.
W ith c = 1.345 and a fixed a t the m edian absolute deviation s o f the data,
we obtain 8 = 26.45. H ow variable is this? We can get some idea by looking
at replicates o f 9 based on b o o tstrap sam ples y j,...,y * . A b o o tstrap value 9*
solves

so the saddlepoint approxim ation to the P D F o f b o o tstrap values is obtained


starting from (9.32) w ith a(f , yj ) = y>{(yj t)/s}. The left panel o f Figure 9.10
com pares the saddlepoint approxim ation with the em pirical distribution o f 9*,
and w ith the approxim ate P D F o f the b o o tstrap p ed average. The saddlepoint
approxim ation to 9 seems quite accurate, while the P D F o f the average is
w ider and shifted to the left.
The assum ption o f sym m etry underlies the use o f the estim ator 9, because
the p aram eter 9 m ust be the same for all possible choices o f c. The discus
sion in Section 3.3 an d Exam ple 3.26 implies th at our resam pling scheme
should take this into account by enlarging the resam pling set to y i ,. . . , y ,
9 (yi 9 ) , . . . , 9 {y 9), for some very robust estim ate o f 9 ; we take 9 to be
the m edian. T he cum ulant-generating function required w hen taking sam ples

9 Improved Calculation

472

CD

o
o

Q_

C\J

o
d

-20
theta

20

40

60

theta

o f size n from this set is


n log

(2n) 1 ^

exP { a (f ;?;)} + ex p { < ^ a(t;2 - y , ) }

j=i

T he right panel o f Figure 9.10 com pares saddlepoint and M onte C arlo a p
proxim ations to the P D F o f O' und er this sym m etrized resam pling scheme;
the P D F o f the average is shown also. All are sym m etric ab o u t 6 .
One difficulty here is th a t we m ight prefer to approxim ate the P D F o f O'
w hen s is replaced by its b o o tstrap version s', an d this cannot be done in the
current fram ew ork. M ore fundam entally, the distrib u tion o f interest will often
be for a q uantity such as a studentized form o f O' derived from 6 ", s', and
perhaps other statistics, necessitating the m ore sophisticated approxim ations
outlined in Section 9.5.3.

9.5.2 C onditional approxim ation


T here are num erous ways to extend the discussion above. O ne o f the m ost
straightforw ard is to situations w here U is a q x 1 vector which is a linear
function o f independent variables W i , . . . , W w ith cum ulant-generating func
tions K j { ) , j =
T h a t is, U = A T W , where A is a n x q m atrix with
rows a j .
The jo in t cum ulant-generating function o f U is

K(0

= log E exp

(ZTA T W) =

Figure 9.10
Comparison of the
saddlepoint
approximation to the
PDF of a robust
M-estimate applied to
the maize data (solid),
with results from a
bootstrap simulation
with R = 50000. The
heavy curve is the
saddlepoint
approximation to the
PDF of the average.
The left panel shows
results from resampling
the data, and the right
shows results from a
symmetrized bootstrap.

473

9.5 Saddlepoint Approximation

an d the saddlepoint approxim ation to the density o f U at u is


gs(u) = ( 2 n r q/ 2 \ K " ( t r 1/2cxp { * ( ) - l Tu ] ,

(9.36)

w here | satisfies the q x 1 system o f equations 8 K(t;)/d, = u, and K "() =


d 2K { t ) / d m T is the q x q m atrix o f second derivatives o f K ; | | denotes
determ inant.
N ow suppose th a t U is p artitioned into U\ and U2, th at is, U T = ( I / / , U j ),
w here U\ an d U 2 have dim ension q\ x 1 and (q qi) x 1 respectively. N ote
th a t U2 = A% W , where A 2 consists o f the last q qi colum ns o f A. The
cum ulant-generating function o f U 2 is simply K(0, ,2), where T = (jr ,<J^)
has been p artitioned conform ably w ith U, so the saddlepoint approxim ation
to the m arginal density o f U 2 is
g,(u2) = ( 2 ^ r (9- 1,/2|X^2( 0 , |20) r 1/2 exp { k ( 0 , | 20) - U>u2) ,

(9.37)

w here 20 satisfies the (q qi) x 1 system o f equations d K ( 0 , 2 )/dt ; 2 = u2, and


K '22 is the (q q\) x {q qi) corner o f K " corresponding to U2.
D ivision o f (9.36) by (9.37) gives a double saddlepoint approxim ation to the
conditional density o f U\ at ui given th a t U2 = u2. W hen U\ is scalar, i.e.
q\ = 1, the approxim ate conditional C D F is again (9.28), b u t with
w

sig n (lt) ^ | x ( 0 , | 2o) - 32T0M2} - { ^ ( ^ ) -

' .

\ |K"2( 0 , 6 o)I J

Example 9.16 (City population data)


A simple b o o tstrap application is to
obtain the distribution o f the ratio T* in b o o tstrap sam pling from the city
population d a ta pairs w ith n = 10. In order to avoid conflicts o f no tatio n we
set yj = (Zj, Xj), so th a t T* is the solution to the equation ]T (x; tZj)f* = 0.
F or this we take the W j to be independent Poisson random variables with
equal m eans /j, s o K j ( ) = n{e{ 1). We set

- ( ) " - ( V '

N ow T ' < t if and only if J2 j(xj ~ tzj ) W j < 0, where Wj is the num ber o f
times ( z j , X j ) is included in the sample. But the relation betw een the Poisson
an d m ultinom ial distributions (Problem 9.19) implies th a t the jo in t conditional
distrib u tio n o f ( W \ , . . . , W ) given th a t J 2 ^ j = n is the same as th a t o f the
m ultinom ial frequency vector (/*, . . . , / * ) in ordinary b o o tstra p sam pling from
a sam ple o f size n. T hus the probability th a t J2 j(xj tzj)W j < 0 given th at
J2 W j = n is ju st the probability th a t T ' < t in ordinary b o o tstrap sam pling
from the d a ta pairs.

474

9 Improved Calculation
a

0.001
0.005
0.01
0.025
0.05
0.1
0.9
0.95
0.975
0.99
0.995
0.999

Unconditional

Conditional

W ithout replacement

Spoint

Simn

Spoint

Simn

Spoint

Simn

1.150
1.191
1.214
1.251
1.286
1.329
1.834
1.967
2.107
2.303
2.461
2.857

1.149
1.192
1.215
1.252
1.286
1.329
1.833
1.967
2.104
2.296
2.445
2.802

1.216
1.236
1.248
1.273
1.301
1.340
1.679
1.732
1.777
1.829
1.865
1.938

1.215
1.237
1.247
1.269
1.291
1.337
1.679
1.736
1.777
1.833
1.863
1.936

1.070
1.092
1.104
1.122
1.139
1.158
1.348
1.392
1.436
1.493
1.537
1.636

1.070
1.092
1.103
1.122
1.138
1.158
1.348
1.392
1.435
1.495
1.540
1.635

In this situation it is o f course m ore direct to use the estim ating function
m ethod w ith a(t;yj) = XjtZj and the sim pler approxim ations (9.28) and (9.33).
T hen the Jaco b ian term in (9.33) is | 22 z; e x p { |(x , t zj ) } / 22 exp{|(x,- tZj)}\.
A n o th er application is to conditional distributions for T*. Suppose th at the
populatio n pairs are related by x; = Zj 6 + z l/ 2 j, where the e; are a random
sam ple from a distrib u tio n w ith m ean zero. T hen conditional on the Zj, the
ratio 2 2 xj / 2 2 zj has variance p ro p o rtio n al to (]P Z j)~' In some circum stances
we m ight w ant to obtain an ap proxim ation to the conditional distribution o f
T * given th a t 2 2 Z j = 2 2 zjthis case we can use the approach outlined in
the previous p aragraph, b u t w ith tw o conditioning variables: we take the Wj
to be independent Poisson variables w ith equal m eans, and set
"E (xj-tzj)W j\
[/*=

2 2 zjW j

22 w j

(
h

u = \ 2 2 zj

n )

a, =

zJ

V 1

A third application is to approxim ating the distribution o f the ratio when


a sam ple o f size m = 10 is taken w ithout replacem ent from the n = 49 d a ta
pairs. A gain T ' < t is equivalent to the event 2 2 j( x j ~ t z j ) W j < 0, b u t now W j
indicates th a t (z j , X j) is included in the m cities chosen; we w ant to im pose the
condition 2 2 ^ 0 = m - We take Wj to be binary variables with equal success
probabilities 0 < n < 1, giving Kj() = lo g (l n + ne*), with n any value. We
then apply the double saddlepoint approxim ation with

- U ) -

" - ( V '

Table 9.8 com pares the quantiles o f these saddlepoint distributions with

Table 9.8 Comparison


of saddlepoint and
simulation quantile
approximations for the
ratio when sampling
from the city population
data. The statistics are
the ratio x j / zj
with n = 10, the ratio
conditional on
Yl zj = 640 with n = 10,
and the ratio in samples
of size 10 taken without
replacement from the
full data. The simulation
results are based on
100000 bootstrap
samples, with logistic
regression used to
estimate the simulated
conditional
probabilities, from
which the quantiles were
obtained by a spline fit.

9.5 Saddlepoint Approximation

475

M onte C arlo approxim ations based on 100000 samples. The general agreem ent
is excellent in each case.

A fu rth er application is to p erm u tatio n distributions.


Exam ple 9.17 (C orrelation coefficient)
In Exam ple 4.9 we applied a perm u
tation test to the sam ple co rrelation t betw een variables x and z based on pairs
(x i,z i), ..., (x,z). F or this statistic and test, the event T > t is equivalent to
EjXjZ(U) - Y l x i zj> where () is a p erm u tatio n o f the integers 1,.. . , n .
A n alternative form ulation is as follows. Let Wy, i , j = 1
denote
independent binary variables w ith equal success probabilities 0 < n < 1, for
any n. T hen consider the distrib ution o f U\ = J 2 i j x izj ^ U conditional on
U 2 = ( , W i j , . . . , Y , j w nj,E , W|1....... 5Di w i,n-i) r = M2, where u 2 is a vector
o f ones o f length 2n 1. N otice th a t the condition E ,
= 1 is entailed by
the o th er conditions an d so is redundant. E ach value o f Xj and each value o f zj
app ears precisely once in the sum U\, w ith equal probabilities, and hence the
conditional distrib u tio n o f U\ given U 2 = u 2 is equivalent to the p erm utation
distribution o f T. H ere m = n2, q = 2n, and qi = 1.
O u r lim ited num erical experience suggests th a t in this exam ple the sad
d lepoint ap proxim ation can be inaccurate if the large num ber o f constraints
results in a conditional distribution on only a few values.

9.5.3 Marginal approximation


T he approxim ate distrib u tio n and density functions described so far are useful
in contexts such as testing hypotheses, b u t they are hard er to apply to such
problem s as setting studentized b o o tstrap confidence intervals. A lthough (9.26)
an d (9.28) can be extended to some types o f com plicated statistics, we merely
outline the results.
Approximate cumulant-generating function
T he sim plest ap proach is direct approxim ation to the cum ulant-generating
function o f the b o o tstrap statistic o f interest, T . The key idea is to replace
the cum ulant-generating function K ( ^ ) by the first four term s o f its expansion
in powers o f
+ \ 2 k 2 + g3*C3 + ^<^4k4,

(9.38)

w here K; is the ith cum u lan t o f T*. T he exact cum ulants are usually unavailable,
so we replace them w ith the cum ulants o f the cubic approxim ation to T* given
by
n

t * = t + n - 1 / ; / , + K 2 E / * / ; * ; + i n~3
fifjfkdjk,
i=l
i,j=1
i,jjt=1

9 Improved Calculation

476

where t is the original value o f the statistic, an d lj, qjj and Cy* are the em pirical
linear, quad ratic and cubic influence values; see also (9.6). To the order required
the approxim ate cum ulants are

+!2 (n 3Y .ijk llm q jk) +4 (n 3Y ,ijk W ^ w ) } where the quantities in parentheses are o f o rder one.
We get an approxim ate cum ulant-generating function K c ( 0 by substituting
the Kc,i into (9.38), an d then use the stan d ard approxim ations (9.26) and (9.28)
w ith K ( ) replaced by Kc(). D etailed consideration establishes th a t this
preserves the usual asym ptotic accuracy o f the saddlepoint approxim ations.
From a practical point o f view it m ay be b etter to sacrifice some theoretical
accuracy b u t reduce the co m p u tatio n al b urden by dropping from k c ,2 and
Kc,4 the term s involving the cy*; w ith this m odification b o th P D F and C D F
approxim ations have erro r o f order n~l .
In principle this ap proach is fairly simple, b u t in applications there is no
guarantee th a t K c ( ) is close to the true cum ulant-generating function o f T '
except for small
It m ay be necessary to m odify K c () to avoid m ultiple
roots to the saddlepoint equation or if Kc ,4 < 0, for then K c( ) cannot be
convex. In these circum stances we can m odify K c () to ensure th a t the cubic
and quartic term s do n o t cause trouble, for exam ple replacing it by
K c M ) = f r c . i + { 2 * c ,2 + (|<^3Kc,3 + J4 %4 kc,4) exp ( - \ n b 2 2 KC,2 ) ,
where b is chosen to ensure th a t the second derivative o f Kc,b() with respect
to is positive; Kc,b() is then convex. A suitable value is
b = m ax [5, in f {a : K ' c a() > 0, oo <

< oo}],

and this can be found by num erical search.


Empirical Edgeworth expansion
The approxim ate cum ulants can also be used in series expansions for the
density and distrib u tio n o f T*. T he E dgew orth expansion for the C D F o f

477

9.5 Saddlepoint Approximation

Z q = (T* ~ kc ,i ) / k

is

P r\ Z ' C < z) = <D(z) + {pi(z) + p 2 (z)}(t>(z) + Op(n~3/2),

(9.39)

where
Pi(z )

-5'fc,3'C c,2/2 ( z 2 - 1).

p 2 {z)

- z { ^ K C,4 K c l ( z 2 - 3) + j 2 KC,3 Kc U z4 ~ ^

+ 15)}

D ifferentiation o f (9.39) gives an approxim ate density for Z'c


and hence for
T*. However, experience suggests th a t the saddlepoint approxim ations (9.28)
and (9.29) are usually preferable if they can be obtained, prim arily because
(9.39) results in less accurate tail probability estim ates: its error is absolute
ra th e r th an relative. F u rth er draw backs are th at (9.39) need n o t increase with
z, and th a t the density approxim ation m ay becom e negative.
D erivation o f the influence values th a t contribute to kc,i , . . . , Kc,4 can be
tedious.
Exam ple 9.18 (Studentized statistic)
A statistic T = t (F) studentized using
the nonparam etric delta m ethod variance estim ate obtained from its linear
influence values L t( y , F ) m ay be w ritten as Z = nx^2 W , where
t(F) - t(F)

W = w (F,F) =

1/ 2

{ / L t ( y , F ) 2 d F( y )}
w ith F the E D F o f the data. The corresponding b o o tstrap statistic is w ( F \ F),
where F* is the E D F corresponding to a boo tstrap sample. F or econom y o f
n o tatio n below we write
v = v(F) = J L t( y; F) dF(y), L w(yi) = M j ^ F ) , Q A y u y i ) = Q A y u y n F ) ,
and so forth.
To obtain the linear, quad ratic and cubic influence values for w(G, F ) at
G = F, we replace G(y) w ith
Here H(x) is the
Heaviside function,
jumping from 0 to 1 at
x = 0.

(1 - ei - s 2 - e3)F(y) + 1H ( y - j>i) + e2 H ( y - y 2) + 3H (y - y3),


differentiate w ith respect to 1, s2, and 3, and set 1 = 2 = 3 = 0. The
em pirical influence values for W at F are then obtained by replacing F with
F. In term s o f the influence values for t and v the result o f this calculation is
L w(yi)
Qviyuyi)
Cw{y \, y 2 , y i )

= v~ 1 / 2 L t(yi),
= v~ll 2 Qt{yx, y 2) - ^v~ 3/ 2 L t{yi)Lv{y2 )[2],
= v ^ l/ 2 Ct(yu y 2 , y 3)
- \ v ~ V 2 { 6 f0 'i,j'2 )l.,0'3) + Qv (y uy 2 ) Lt(yi)} [3]
+ 1V~5/ 2 L[ (y 1)LV(y 2 )LV(y3) [3],

9 Improved Calculation

478

where [fc] after a term indicates th a t it should be sum m ed over the perm utations
o f its y^s th a t give the k distinct quantities in the sum. Thus for exam ple
L t( yi ) Lv(y 2 )Lv(y 3 )[3 ]

L t( y i) Lv(y 2 )Lv(y}) + L t( yi ) Lv( yi ) Lv(y 2 )


+ L t{y2 )Lv( yi ) Lv(yi).

The influence values for z involve linear, quadratic, and cubic influence values
for t, and linear an d quad ratic influence values for v, the latter given by

L t( x )2 dF(x)

+ 2J L t(x)Qt( x , y l )dF(x),

L v(yi)

L t{yi )2

lQv(yi,y2)

L t( y i )Q t ( y uy 2 )l 2 ] - L t{yi)Lt(y2)
~ J { Q t ( x , y i ) + Qt( x , y 2 ) }Lt( x) dF( x)

+J

{Qt(x,y 2 )Qt(x,yi) + L t(x)Ct{ x , y u y 2)} dF(x).

The sim plest exam ple is the average t(F) = f x d F ( x ) = y o f a sam ple o f
values y u - - - , y from F. T hen L t(j/,-) = y t - y , Qtiyuyj) = Ct(yi,yj,yk) = 0, the
expressions above simplify greatly, an d the required influence quantities are

li
9ij
Cijk

= Lw(yi;F) = v~x,2{yi - y),


= Q U y i , y j i h = - i v ~ i/ 2 ( y i - y ) { ( y j - y ) 2 - v } [ 2 ],
= Cw(yi , yj, yk ;F) = 3v~i/2(yi - y)(yj - y)(yk - y)

+\v~5n{yi - y) {(yj - y)2 -

{(yk - y)2 -

[3],

where v = n-1 J2(yi ~ y)2- The influence quantities for Z are obtained from
those for W by m ultiplication by n 1/2. A num erical illustration o f the use o f
the corresponding approxim ate cum ulant-generating function K c ( ) is given
in Exam ple 9.19.

Integration approach
A n other ap proach involves extending the estim ating function approxim ation
to the m ultivariate case, an d then approxim ating the m arginal distribution o f
the statistic o f interest. To see how, suppose th a t the quantity T o f interest is a
scalar, an d th a t T and S = ( S i , . . . , S q- i) r are determ ined by a q x 1 estim ating
function
n
U(t,s) = ^ a ( t , s i , . . . , s 9- l ;Yj ).
J=i
T hen the b o o tstra p quantities T* an d S are the solutions o f the equations
n
U'(t, s) = J 2 a j ( t , s ) f j = 0 ,

j=i

(9.40)

9.5 Saddlepoint Approximation

479

where a; (t,s) = a(t,s;yj) an d the frequencies ( / j , . . . , / * ) have a m ultinom ial


distribution w ith d en o m in ato r n and m ean vector n ( p \ , - typically pj =
n_1. We assum e th a t there is a unique solution (t*,s*) to (9.40) for each possible
set o f /* , an d seek saddlepoint approxim ations to the m arginal P D F and C D F
of r .
F o r fixed t an d s, the cum ulant-generating function o f U" is

K ( ; t , s ) = n log

(9.41)

Y 2 PJex P l ^ a / M ) }
;'=i

an d the jo in t density o f the U * at u is given by (9.36). The Jacobian needed


to obtain the jo in t density o f T* and S ' from th at o f U ' is h ard to obtain
exactly, b u t can be approxim ated by
dcij(t,s) 8 aj(t,s) '
dt

j=i

dsT

where
,,
1

p j e x p { Z Taj(t,s)}
Y l = i P k e x p { , Tak{ t , s) }

as usual for r x 1 an d c x 1 vectors a and s w ith com ponents at and sj, we


w rite 8 a / d s T for the r x c array whose (i,j) elem ent is dat/dsj. T he Jacobian
J { t , s \ ,) reduces to the Jacobian term in (9.33) w hen s is not present. Thus the
saddlepoint approxim ation to the density o f (T*,S*) at (t,s) is
J ( t , s ; l ) { 2 n ) - ^ 2 \ K " { l ; t, s) p 1/2 exp K & ; t, s),

(9.42)

w here = (t,s) is the solution to the q x 1 system o f equations 8 K/d, = 0.


L et us w rite A(t,s) = K{( t, s) ;t, s} .
We require the m arginal density and distribution functions o f T* a t t. In
principle they can be obtained by integration o f (9.42) num erically w ith respect
to s, but this is tim e-consum ing when s is a vector. A n alternative approach
is analytical approxim ation using Laplaces m ethod, which replaces the m ost
im p o rtan t p a rt o f the integrand the rightm ost term in (9.42) by a norm al
integral, suitably centred an d scaled. Provided th a t the m atrix d 2 A ( t ,s ) / ds d sT
is positive definite, the resulting approxim ate m arginal density o f T * at t is

J(t,S;?)(2 n)-,/2 |X"(|;t,S)|-1/ 2

d 2 A(t, s)
dsdsT

1 /2

exp

s),

(9.43)

w here \ = \ ( t ) an d s = s(t) are functions o f t th a t solve sim ultaneously the

480

9 Improved Calculation

q x 1 and (q 1) x 1 systems o f equations


8K

; t, s)

al

nYj l

8 K (<^; t, s)

pA

s ) = >

, 8 aj

nYlpft

=1

s)

>s )

d-s- t =

;= 1

(9.44)
These can be solved using packaged routines, w ith starting values given by
noting th a t w hen t equals its sam ple value to, say, s equals its sam ple value
and = 0.
The second derivatives o f A needed to calculate (9.43) m ay be expressed as
8 2 A(t,s) _ d 2 K ( ; t , s ) f d 2 K ( i ; t , s ) Y i d 2K(-,t,s)
8 s8 s T

8 s8 T

8^8^ T

8 ,dsT

82 K(t;-,t,s)
8 sdsT

where a t the solutions to (9.44) the m atrices in (9.45) are given by


8 2K { t - , t , s )

n ^2p ' j( t ,s ) aj ( t , s ) a j ( t ,s ) T,

(9.46)

(9.47,

w ith sc and sj the cth and dth com ponents o f s.


The m arginal C D F approxim ation for T ' at t is (9.28), with
w

s i g n ( t - t0){ 2 X (^ ;t,s )} 1/2,


dt

|K " ( ;t,* )l1/2

(9.49)
d2A (t, s)
8 s8 s T

1/2

evaluated a t s = s, = | ; the only additional q uantity needed here is


(9.51)
;= i
A pproxim ate quantiles o f T* can be obtained in the way described ju st before
Exam ple 9.13.
The expressions above look forbidding, b u t their im plem entation is relatively
straightforw ard. The key p o in t to note is th a t they depend only on the qu an ti
ties aj(t, s), their first derivatives w ith respect to t, an d their first two derivatives
w ith respect to s. Once these have been program m ed, they can be input to
a generic routine to perform the saddlepoint approxim ations. Difficulties th a t
som etim es arise w ith num erical overflow due to large exponents can usually be
circum vented by rescaling d a ta to zero m ean and unit variance, which has no

481

9.5 Saddlepoint Approximation

effect on location- an d scale-invariant quantities such as studentized statistics.


R em em ber, however, o u r initial com m ents in Section 9.1: the investm ent o f
tim e an d effort needed to p rogram these approxim ations is unlikely to be
w orthw hile unless they are to be used repeatedly.
Exam ple 9.19 (M aize d ata) To illustrate these ideas we consider the boo tstrap
variance an d studentized average for the m aize data. Both these statistics are
location-invariant, so w ithout loss o f generality we replace yj with yj y and
henceforth assum e th a t y = 0. W ith this sim plification the statistics o f interest
are

where Y" = n 1 J2 YJ. A little algebra shows th at


ii-1 Y , V 2 = V * {1 + Z * 2/(n - 1)} ,

n~l Y , Yj = Z ' V l/2{n - 1)~1/2,

so to apply the integration ap proach we take pj = n 1 and

from which the 2 x 1 m atrices o f derivatives


daj(z,v)
8z

daj(z,v)
dv

d 2 cij(z,v)
8z 2

8 2 aj(z,v)
8 v1

needed to calculate (9.43)(9.51) are readily obtained.


To find the m arginal distribution o f Z*, we apply (9.43)-(9.51) with t = z
and s = v. F or a given value o f z, the three equations in (9.44) are easily
solved numerically. The u p p er panels o f Figure 9.11 com pare the saddlepoint
distribution an d density approxim ations for Z* w ith a large sim ulation. The
analytical quantiles are very close to the sim ulated ones, and although the
saddlepoint density seems to have integral greater th an one it captures well
the skewness o f the distribution.
F or V * we take t = v and s = z, b u t the lower left panel o f Figure 9.11
shows th a t resulting P D F approxim ation fails to capture the bim odality o f the
density. This arises because V * is deflated for resam ples in which neither o f
the two sm allest observations which are som ew hat separated from the rest
appear.
The contours o f A(z, v) in the lower right panel reveal a potential problem
w ith these m ethods. For z = 3.5, the Laplace approxim ation used to obtain
(9.43) am ounts to replacing the integral o f exp{A(z, t>)} along the dashed
vertical line by a norm al approxim ation centred at A and w ith precision given
by the second derivative o f A(z, v) at A along the line. But A(3.5, v) is bim odal
for v > 0, an d the Laplace ap proxim ation does n o t account for the second
peak at B. As it turns out, this doesnt m atte r because the peak at B is so m uch

9 *Improved Calculation

482

o
W
a>
c
co
3

-4

*2

2000

3000

2
z

Quantiles of standard normal

1000

-2

lower th a n a t A th a t it adds little to the integral, b u t clearly (9.43) would be


catastrophically bad if the peaks at A an d B were com parable. This behaviour
occurs because there is no guarantee th a t A(z, v) is a convex function o f v and
z. If the difficulty is th o u g h t to have arisen, num erical integration o f (9.42)
can be used to find the m arginal density o f Z , b u t the problem is n o t easily
diagnosed except by checking th a t (9.45) is positive definite a t any solution
to (9.44) an d by checking th a t different initial values o f c, and s lead to the
the same solution for a given value o f t. This m ay increase the com putational
burden to an extent th a t direct sim ulation is m ore efficient. Fortunately this
difficulty is m uch rarer in larger samples.
The quantities needed for the approxim ate cum ulant-generating function

Figure 9.11
Saddlepoint
approximations for the
bootstrap variance V *
and studentized average
Z* for the maize data.
Top left:
approximations to
quantiles of Z* by
integration saddlepoint
(solid) and simulation
using 50000 bootstrap
samples (every 20th
order statistic is shown).
Top right: density
approximations for Z*
by integration
saddlepoint (heavy
solid), approximate
cumulant-generating
function (solid), and
simulation using 50 000
bootstrap samples.
Bottom left:
corresponding
approximations for V*.
Bottom right: contours
of A(z,t>), with local
maxima along the
dashed line z = 3.5 at
A and at B.

9.5 Saddlepoint Approximation

483

ap proach to obtaining the distribution o f n '/2(n 1)-1/2Z* were given in


E xam ple 9.18. The approxim ate cum ulants for Z* are Kc,i = 0.13, k c ,2 =
1.08, Kc,3 = 0.51 and k c ,4 = 0.50, w ith k c ,2 = 0.89 and k c ,4 = 0.28 when
the term s involving the
are dropped. W ith or w ithout these term s, the
cum ulants are som e way from the values 0.17, 1.34, 1.05, and 1.55 estim ated
from 50000 sim ulations. T he upper right panel o f Figure 9.11 shows the P D F
approxim ation based on the m odified cum ulant-generating function; in this
case Kc fi ( ) is convex. The m odified P D F m atches the centre o f the distribution
m ore closely th a n the in tegration PD F, b u t is poor in the u p per tail.
F or V ' , we have
U = (yi ~ y f ~ t,

qtj = - 2 (yt - y)(yj - y),

ciJk = 0,

so the approxim ate cum ulants are kc,i = 1241, k c ,2 /k c i =


kc j / kc i =
0.013 and
, = 0.0015; the corresponding sim ulated values are 1243,
0.18, 0.018, 0.0010. N either saddlepoint approxim ation captures the bim odality
o f the sim ulations, though the integration m ethod is the b etter o f the two.
In this case b = j for the approxim ate cum ulant-generating function m ethod,
and the resulting density is clearly too close to norm al.

Exam ple 9.20 (Robust M -estim ate)


For a second exam ple o f m arginal ap
proxim ation, we suppose th a t 8 and a are M -estim ates found from a random
sam ple y i , . . . , y n by sim ultaneous solution o f the equations

7=1

;=1

T he choice rp(e) = e, ^(e) = e2, y = 1 gives the non-robust estim ates 8 = y


and <t2 = h-1 Y ^ y j ~ y)2- Below we use the m ore robust H uber M -estim ate
specified by (9.35), w ith x(s) = W2(s) and with y taken to equal E{x(e)}, where
e is stan d ard norm al. F o r purposes o f illustration we take c = 1.345.
Suppose we w ant to use 8 to get a confidence interval for 8 . Since the
n o nparam etric delta m ethod estim ate o f var(0) is <x2 Y V 2 (ej ) / { Y vA f/)}2
(Problem 6.7), where e; = (yj 8 ) /a, the studentized version o f 8 is
8-9

n~] Y nJ=i v'jej )

( y/n)x/ 2

which is pro p o rtio n al to the usual Student-t statistic when ip(s) = e. In order
to set studentized b o o tstrap confidence limits for 8 , we need approxim ations to
the b o o tstrap quantiles o f Z . These m ay be obtained by applying the m arginal
saddlepoint approxim ation outlined above with T = Z , S = (Si, S i ) 7 , Pj = n-1 ,

9 Improved Calculation

484

Table 9.9 Comparison


of results from 50000
simulations with
integration saddlepoint
approximation to
bootstrap a quantiles of
a robust studentized
statistic for the maize
data.

(% )

Sim n
Sp o in t

0.1

2.5

10

90

95

97.5

99

99.9

-3.81
-3 .6 8

-2 .6 8
-2 .6 0

-2.21
-2.11

-1 .8 6
-1 .7 2

-1.49
-1.31

1.25
1.24

1.62
1.62

1.94
1.97

2.35
2.42

3.49
3.57

and

xp ( ae j/s i - z d / s 2)
\
tp 2 ( ae j/ si - z d / s 2) - y j ,

(9.52)

tp' ( pe j/ s \ z d / s 2) - s 2 )

where d = (y/ri)l/2. F or the H u b er estim ate, s 2 = n_1


~ z*d/s\\ < c)
takes the discrete range o f values j / n ; here 1(A) is the indicator o f the event
A. In the extrem e case w ith c = oo, sj always equals one, b u t even if c is finite
it will be unwise to treat sj as continuous unless n is very large. We therefore
fix sj = s2, and m odify a,- by dropping its th ird com ponent, so th a t q = 2,
and S = Si. W ith this change the quantities needed for the P D F and C D F
approxim ations are
daj(t,s)

dt
ddj(t,s)

\ 2 xpxp'd/s2 ,

ds
d 2 cij(t,s)
8 s 8 sT

xp'd/s 2
aejtp'/s 2

\ 2 oej\py>'/si
_

(
2 aej\p'/sl
\4aejxpxp,/ s l + 4 a 2 e 2 \p,2 /s'\

where tp an d xp' are evaluated a t a e j/ s\ z d / s 2.


For the m aize d a ta the rob u st fit dow nw eights the largest and tw o smallest
observations, giving 6 = 26.68 an d a = 25.20. Table 9.9 com pares saddlepoint
and sim ulated quantiles o f Z*. The agreem ent is generally poorer th an in the
previous examples.
To investigate the properties o f studentized b o o tstrap confidence intervals
based on Z , we conducted a small sim ulation experim ent. We generated 1000
sam ples o f size n from the norm al distribution, the t$ distribution, the slash
distributio n the distrib u tio n o f an N ( 0 , 1) variate divided by an independent
17(0,1) variate and the x$ distribution. F or each sam ple confidence intervals
were obtained using the saddlepoint m ethod described above. Table 9.10
shows the actual coverages o f nom inal 95% confidence intervals based on
the integration saddlepoint approxim ation. F or the sym m etric distributions
the results are rem arkably good. T he assum ption o f sym m etric errors is false
for the xl distribution, an d its results are poorer. In the sym m etric cases the
saddlepoint m ethod failed to converge for a b o u t 2% o f samples, for which

9.6 Bibliographic Notes


Table 9.10 Coverage
(%) of nominal 90%
and 95% confidence
intervals based on the
integration saddlepoint
approximation to a
studentized bootstrap
statistic, based on 1000
samples of size n from
underlying normal, ts,
slash, and xl
distributions. Two-sided
90% and 95%
coverages are given for
all distributions, but for
the asymmetric x2
distribution one-sided 5,
95, 2.5 and 97.5 %
coverages are given also.

485

N o rm al

n = 20
n = 40

Slash

C hi-squared

90

95

90

95

90

95

95

90

2.5

97.5

95

91
90

95
94

91
89

96
95

91
89

95
95

14
9

97
94

83
85

9
6

97
95

88
89

sim ulation w ould be needed to obtain confidence intervals; we simply left these
sam ples out. C uriously there were no convergence problem s for the xl samples.
O ne com plication arises from assum ing th at the error distribution is sym
metric, in which case the discussion in Section 3.3 implies th a t our resam pling
scheme should be m odified accordingly. We can do so by replacing (9.41) with

X (^ ;z ,S !) = n lo g

i ^ e x p { ^ T a/(z Sl)} + I ^ P . / exP { f Ta}(z>s i)}


j= 1

j~i

where a'j{z,s\) is obtained from (9.52) by replacing ej with e; . However


the odd cum ulants o f
are then zero, and a norm al approxim ation to the
distribution o f Z will often be adequate.
Even w ith o u t this m odification, it seems th at the m ethod described above
yields rob u st confidence intervals with coverages very close to the nom inal
level.
Relative tim ings for sim ulation and saddlepoint approxim ations to the b o o t
strap distribution o f Z* depend on how the m ethods are im plem ented. In our
im plem entation for this exam ple it takes ab o u t the same time to obtain 1000
values o f Z ' by sim ulation as to calculate 50 saddlepoint approxim ations
using the integration m ethod, but this com parison is n o t realistic because the
saddlepoint m ethod gives accurate quantile estim ates m uch further into the
tails o f the distrib u tio n o f Z*. If ju st two quantile estim ates are needed, as
would be the case for a 95% confidence interval, the saddlepoint m ethod is
ab o u t ten tim es faster. O ther studies in the literature suggest that, once p ro
gram m ed, saddlepoint m ethods are 20-50 tim es faster than sim ulation, and
th a t efficiency gains tend to be larger w ith larger samples. However, saddle
p o in t approxim ation fails on ab o u t 1-2% o f samples, for which sim ulation is
needed.

9.6 Bibliographic Notes


V ariance reduction m ethods for param etric sim ulation have a long history and
a scattered literature. They are discussed in books on M onte C arlo m ethods,

486

9 Improved Calculation

such as H am m ersley and H andscom b (1964), Bratley, Fox and Schrage (1987),
Ripley (1987), an d N iederreiter (1992).
B alanced b o o tstra p sim ulation was introduced by D avison, H inkley and
Schechtm an (1986). O gbonm w an (1985) describes a slightly different m ethod
for achieving first-order balance. G rah am et al. (1990) discuss second-order
balance an d the connections to classical experim ental design. A lgorithm s for
balanced sim ulation are described by G leason (1988). T heoretical aspects o f
balanced resam pling have been investigated by D o and H all (1992b). Balanced
sam pling m ethods are related to num ber-theoretical m ethods for integration
(Fang an d W ang, 1994), and to L atin hypercube sam pling (M cK ay, Conover
and Beckm an, 1979; Stein, 1987; Owen, 1992b). D iaconis and H olm es (1994)
discuss the com plete en um eration o f b o o tstrap sam ples by m ethods based on
G ray codes.
L inear approxim ations were used as control variates in b o o tstra p sam pling
by D avison, H inkley an d Schechtm an (1986). A different approach was taken
by E fron (1990), w ho suggested the re-centred bias estim ate and the use
o f control variates in quantile estim ation. D o and H all (1992a) discuss the
properties o f this m ethod, an d provide com parisons with other approaches.
F u rth er discussion o f control m ethods is contained in theses by T herneau
(1983) and H esterberg (1988).
Im portance resam pling was suggested by Jo hns (1988) and D avison (1988),
and was exploited by H inkley an d Shi (1989) in the context o f iterated
boo tstrap confidence intervals. G igli (1994) outlines its use in param etric
sim ulation for regression an d certain tim e series problem s. H esterberg (1995b)
suggests the application o f ratio and regression estim ators and o f defensive
m ixture distributions in im portance sam pling, an d describes their properties.
T he large-sam ple perform ance o f im portance resam pling has been investigated
by D o an d H all (1991). B ooth, H all and W ood (1993) describe algorithm s for
balanced im portance resampling.
B ootstrap recycling was suggested by D avison, H inkley and W orton (1992)
and independently by N ew ton and G eyer (1994), following earlier ideas by
J. W. Tukey; see M orgenthaler an d Tukey (1991) for application o f sim ilar
ideas to robust statistics. Properties o f recycling in various applications are
discussed by V entura (1997).
S addlepoint m ethods have a history in statistics stretching back to D aniels
(1954), an d they have been studied intensively in recent years. R eid (1988)
reviews their use in statistical inference, while Jensen (1995) and Field and
R onchetti (1990) give longer accounts; see also B arndorff-N ielsen and Cox
(1989). Jensen (1992) gives a direct account o f the distribution function ap
proxim ation we use. Saddlepoint approxim ation for p erm u tatio n tests was
proposed by D aniels (1955) and further discussed by R obinson (1982). D avi
son and H inkley (1988), D aniels and Y oung (1991), and W ang (1993b) in

9.7 - Problems

487

vestigate their use in a n u m b er o f resam pling applications, and others have


investigated their use in confidence interval estim ation (DiCiccio, M artin and
Y oung 1992a,b, 1994). T he use o f approxim ate cum ulant-generating functions
is suggested by E aston and R onchetti (1986), G a tto and R onchetti (1996),
an d G a tto (1994), while W ang (1992) shows how the approxim ation m ay be
m odified to ensure the saddlepoint equation has a single root. W ang (1990)
discusses the accuracy o f such m ethods in the b o o tstrap context. B ooth and
Butler (1990) show how relationships am ong exponential family distributions
m ay be exploited to give saddlepoint approxim ations for a num ber o f b o o t
strap and p erm u tatio n inferences, while W ang (1993a) describes an alternative
approach for use in finite popu lation problems. T he m arginal approxim ation
in Section 9.5.3 extends an d corrects th a t o f D avison, H inkley and W orton
(1995); see also Spady (1991). The discussion in Exam ple 9.18 follows H inkley
and Wei (1984). Jing an d R obinson (1994) give a careful discussion o f the
accuracy o f conditional an d m arginal saddlepoint approxim ations in b o o t
strap applications, while C hen and D o (1994) discuss the efficiency gains from
com bining saddlepoint m ethods with im portance resampling.
O th er m ethods o f variance reduction applied to b o o tstra p sim ulation include
antithetic sam pling (H all, 1989a) see Problem 9.21 and R ichardson
e x trap o latio n (Bickel an d Y ahav, 1988) see Problem 9.22.
A ppendix II o f H all (1992a) com pares the theoretical properties o f some o f
the m ethods described in this chapter.

9.7 Problems
1

Under the balanced bootstrap the descending product factorial m om ents o f the /*
are
= Y [ m(p>
U

R ^ / i n R ) ^ s">,

(9.53)

where / (a) = / ! / ( / a)!, and


Pu =

vv:rw = u

Qu

^ ^ Sw>
w :jw=v

with u and v ranging over the distinct values o f row and colum n subscripts on the
left-hand side o f (9.53).
(a) Check the first- and second-order mom ents for the f
j at (9.9), and verify that
the values in Problem 2.19 are recovered as R * c o .
(b) Use the results from (a) to obtain the mean o f the bias estimate under balanced
resampling.
(c) N ow suppose that 7 is a linear statistic, and let V ( R I)-1 ^ r(Tr' T ' ) 2
be the estimated variance o f T based on the bootstrap samples. Show that the
mean o f V under m ultinom ial sampling is asymptotically equivalent to the mean
under hypergeometric sampling, as R increases.
(Section 9.2.1; Appendix A ; Haldane, 1940; D avison, Hinkley and Schechtman,
1986)

9 Improved Calculation

488
2

Consider the following algorithm for generation o f R balanced bootstrap samples


from y = (y i,...,y ):

Algorithm 9.3 (Balanced bootstrap 2)


Concatenate y with itself R times to form a list

o f length nR.

For I = n R , . . . , 2 :
(a) Generate a random integer U in the range 1 ,..., I.
(b) Swap a.Vi and <
& iv

Show that this produces output equivalent to Algorithm 9.1.


Suggest a balanced bootstrap algorithm that uses storage 2n, rather than the Rn
used above.
(Section 9.2.1; G leason, 1988; Booth, Hall and W ood, 1993)
Show that the re-centred estimate o f bias, B r^ , can be approximated by (9.11),
and obtain its mean and variance under ordinary bootstrap sampling. Compare
the results with those obtained using the balanced bootstrap.
(Section 9.2.2; Appendix A ; Efron, 1990)
D ata y \ , . . . , y n are sampled from a N{n,a2) distribution and we estimate a by the
M LE t = {n-1 Y ( y j ~ >)2}'/2- The bias o f T can be estimated theoretically:

[ 21/2r ( f )

l " I/2r ( )

But suppose that we estimate the bias by parametric resampling; that is, we
generate samples y[,...,y from the N(y,t2) distribution. Show that the raw and
adjusted bootstrap estimates o f B can be expressed as

Br =

Y xr/2 ~ 1

and
1 /2

B RM j = n - ' /2 R - ' |

X r /2 -

R i/2

>

( Y X' + X R
+l

where X l, . . . , X R are independent xl-i and X R+\ is independently /1R_ l.


Use simulation with these representations to show that the efficiencies o f B Rajj are
roughly 8 and 16 for n = 10 and 20, for any R.
(Section 9.2.2; Efron, 1990)
(a) Show that, for large n, the variance o f the bias estimate under ordinary
resampling, (9.8), can be written (nA + 2B + C ) / ( R n 2), while the variance o f
the bias estimate under balanced resampling, (9.11), is C / ( R n 2); here A, B,
and C are quantities o f order one. Show also that the correlation p between
a quadratic statistic T and its linear approximation T'L can be written as
(nA + B ) / { n A ( n A + 2B + C )}1/2, and hence verify that the variance o f the bias
estimate is reduced by a factor o f 1 p 2 when balanced resampling is used.
(b) Give p in terms o f sample mom ents when t = n~ Y ( y j y ) 2, and evaluate the
resulting expression for samples o f sizes n = 10 and 20 simulated from the normal
and exponential distributions.
(Section 9.2.3)

9.7 Problems
6

489

Consider generating bootstrap samples y ' {, . . . , y ^ n, r = 1 ,...,R , from y i , . . . , y n.


Write y'j = y ^ j ) , where the elements o f the R x n matrix ( r , j ) take values in
1
(a) Show that first-order balance can be arranged by ensuring that each column
o f contains each o f 1
wi t h equal frequency, and deduce that when R = n
the matrix is a randomized block design with treatment labels 1, . . . , n and with
colum ns as blocks. Hence explain how to generate such a design when R = kn.
(b) Use the representation

n
f'rj = ' H j - Z ( r , i ) } ,
i= 1

where S(u) = 1 if u = 0 and equals zero otherwise, to show that the ^-balanced
design is balanced in terms o f f r j. Is the converse true?
(c) Suppose that we have a regression m odel Yj = fixj + ej, where the independent
errors e; have mean zero and variance a 2. We estimate fi by T =
Y j X j / J 2 x jLet T ' = 52(t Xj +e'j )xj/ Y l x ) denote a resampled version o f T , where e is selected
randomly from centred residuals e; e, with e} = y, t x j and e = n ~ l ^ e;. Show
that the average value o f T* equals t if R values o f T ' are generated from a
<!;-balanced design, but not necessarily if the design is balanced in terms o f / * .
(Section 9.2.1; Graham e t al., 1990)

(a) Following on from the previous question, a design is s e c o n d - o r d e r <J-ba l a n c e d if


all n2 values o f ( ( r , i ) , l ; ( r , j ) ) occur with equal frequency for any pair o f columns i
and j. With R = n2, show that this is achieved by setting the first colum n o f c to be
( l , . . . , l , 2 , . . . , 2 , . . . , n , . . . , n ) T, the second colum n to be ( 1 , 2 , . 1 , 2 , . . , , n ) T,
and the remaining n 2 colum ns to be the elements o f n 2 orthogonal Latin
squares with treatment labels 1, . . . , n . Exhibit such a design for n = 3.
(b) Think o f the design matrix as having rows as blocks, with treatment labels
1 , . . . , n to be allocated within blocks; take R = kn. Explain why a design is said to
be s e c o n d - o r d e r / - b a l a n c e d if
R

Y l f ' J = k (2 n ~ ! ).
r=l

J2frjfrk= k(n-l),

j,k = l,...,n,

j + k.

r=l

Such a design is derived by replacing the treatment labels by 0 , . . . , n 1, choosing


k initial blocks with these replacement labels, adding in turn l , 2 , . . . , n 1, and
reducing the values m od n. With n = 5 and k = 3, construct the design with
initial blocks (0 ,1 ,2 ,3 ,4 ), (0 ,0 ,0 ,1 ,3 ), and (0 ,0 ,0 ,1 ,2 ), and verify that it is firstand second-order balanced. Can the initial blocks be chosen at random?
(Section 9.2.1; Graham e t al., 1990)

Suppose that you wish to estimate the normal tail probability f I { z < a}<p(z)dz,
where <p(.) is the standard normal density function and I [A] is the indicator o f the
event A, by importance sampling from a distribution H( ).
Let H be the normal distribution with mean fi and unit variance. Show that the
maximum efficiency is
<P(fl){l-< D (q )}
exp(/i2)<I>(a + fi) 9 ( a ) 2
where (i is chosen to minimize exp(/r )<t>(a+^). Use the fact that <t>(z) = <f>(z)/z for
z < 0 to give an approximate value for fj., and plot the corresponding approximate
efficiency for 3 < a < 0. W hat happens when a > 0 1
(Section 9.4.1)

9 Improved Calculation

490
9

Suppose that T \ , . . . , T R is a random sample from P D F h( ) and C D F H ( ) , and let


g( ) be a P D F with the same support as h( ) and with p quantile t]p, i.e.
rip

'- L

rip

- / . S 5 " w

Let T(! ) < < T {R) denote the order statistics o f the Tr, set

1 \ " g(T(r))
R + i ^ H T {r)y

m\

and let M be the random index determined by SM < p < SM+1 - Show that
as R *-oo, and hence justify the estimate t"M given at (9.19).
(Section 9.4.1; Johns, 1988)
10

Suppose that T has a linear approximation T[, and let p be the distribution on
y y n with probabilities p; oc exp { l l j / ( n v Ll / 2 ) } , where v L = n ~ 2 Y I l j - Find the
mom ent-generating function o f T[ under sampling from p, and hence show that
in this case T* is approximately N ( t -I- A v j / 2 , v L ). You may assume that T[ is
approximately N ( 0, v L ) when A = 0.
(Section 9.4.1; Johns, 1988; Hinkley and Shi, 1989)

11

The linear approximation t L


for a single-sample resample statistic is typically
accurate for t near t, but may not work well near quantiles o f interest. For an
approximation that is accurate
near the a quantile o f T \ consider expanding
t" = t(p) about p = (pi*,... ,pa) rather than about ( ,. . ., ) .
(a) Show that if pja oc cx-p(n~lv~[[/2z j j ) , then t(pa) will be close to the a quantile
o f T" for large n.
(b) Define
d

lj* = ^ t { ( l - f i) p a + e lj]
Show that
t = ?Ljz = t(pa) + n

f]h aj=i

(c) For the ratio estimates in Example 2.22, compare numerically tL, t'Lfi9 and the
quadratic approximation

tQ = t + n^ J l f ? J +
j=l

2l n ~ 2 H
j = 1 k= 1

fjfk t*

with t .
(Sections 2.7.4, 3.10.2, 9.4.1; Hesterberg, 1995a)
12

(a) The importance sampling ratio estimator o f n can be written as

E ^ r)w W
J 2 W(yr)
where si = R 1/2 $3{m(yr)w(jv)
implies that

n + R-Vh,

1 + R~,/2Eo

A1} an(i o = R 1/2 E W O v) !} Show that this

var (fiH, at) = K ^ v a r {m(Y )w(Y) - /*w(Y)} .

\ j is a vector with one


in the jth position and
zeroes elsewhere.

9.7 Problems

491

(b) The variance o f the importance sampling regression estimator is approximately


var(/iHreg) = R -'v a r {m (Y )w (Y ) - m v (Y )},

(9.54)

where a = cov{m (Y )w (Y ), w (Y )}/var{w (Y )}. Show that this choice o f a achieves


minimum variance am ong estimators for which the variance has form (9.54), and
deduce that when R is large the regression estimator will always have variance no
larger than the raw and ratio estimators.
(c) As an artificial illustration o f (b), suppose that for 0 > 0 and som e non-negative
integer k we wish to estimate

m(y ) g( y ) dy =

/ vke~y
- j j - x e e ~ eydy

by simulating from density h(y) = fie~^y, y > 0, fi > 0. Give w(y) and show
that E{m (Y )w (Y )} = n for any fi and 6, but that var(//rat) is only finite when
0 < fi < 2 6 . Calculate var{m (Y)w (Y )}, cov{m (Y )w (Y ), w (Y)}, and var{w(Y)}.
Plot the asymptotic efficiencies var(/i;; raw) / var(// ra, ) and var(/*//ratv) / var(^Wrfg) as
functions o f fi for 0 = 2 and fc = 0 ,1 ,2 ,3 . Discuss your findings.
(Section 9.4.2; Hesterberg, 1995b)
13

Suppose that an application o f importance resampling to a statistic T" has resulted


in estimates tj < < t'R and associated weights w, and that the importance re
weighting regression estimate o f the C D F o f T" is required. Let A be the R x R
matrix w hose (r,s) element is w/ ( t < t ) and B be the R x 2 matrix whose rth
row is ( I X ) . Show that the regression estimate o f the C D F at t \ , . . . , t R equals
(1,1 ) ( BTB ) ~ i B TA.
(Section 9.4.2)

14

(a) Let h = ( h \ , . ,hn), k = 1, . . . , n R , denote independent identically dis


tributed multinomial random variables with denominator 1 and probability vector
p = (p\, . . . ,p). Show that SnK = Yl k =l ^ ^as a multinomial distribution with
denominator n R and probability vector p, and that the conditional distribution
o f I nR given that SR = q is multinomial with denominator 1 and mean vector
(nR) ~{q , where q = ( R i , . . . , R ) is a fixed vector. Show also that
Prf/]

i i , . . . ,/ni?

InR I SnR

q)

equals

nfi-l
g(inR I Sr = q)

g (inR-j |

= q iR-J+] iRj ,

i =i

where g( ) is the probability mass function o f its argument.


(b) U se (a) to justify the following algorithm:

Algorithm 9.4 (Balanced importance resampling)


Initialize by setting values o f R i , . . . , R such that Rj = n R Pj and
For m = n R , . . . , 1:

= n ^-

(a) Generate u from the 1/(0,1) distribution.


(b) Find the j such that 1=i Ri < um < Y2i=i Ri
fe) Set I m = j and decrease Rj to Rj 1.
Return the sets
{I+l, . . . , I 2n}, , { /n(R_i)+1, ...,/ } as the indices o f
the R bootstrap samples o f size n.

(Section 9.4.3; Booth, Hall and Wood, 1993)

9 Improved Calculation

492
15

For the bootstrap recycling estimate o f bias described in Example 9.12, consider
the case T = Y with the parametric m odel Y ~ N ( 0 , 1). Show that if H is taken to
be the N ( y , a ) distribution, then the simulation variance o f the recycling estimate
o f C is approximately

/ a2 y
+ \2 -l/

~ 1)/2 r ( - ! )
1
I (2a - 3)3/2 R N

2
11
8 (a - \ f ' 2 N J J

provided a >
Compare this to the simulation variance when ordinary double
bootstrap methods are used.
What are the im plications for nonparametric double bootstrap calculations? In
vestigate the use o f defensive mixtures for H in this problem.
(Section 9.4.4; Ventura, 1997)
16

Consider exponential tilting for a statistic whose linear approximation is

where the ( / ' , , . . . , f s), s = 1 ,..., S, are independent sets o f m ultinom ial frequen
cies.
(a) Show that the cumulant-generating function o f T I is
s
K { 0 = ft + Y

f 1 "s
n* lo6 \ ~

s=l

exP ( ^ y M )
t= 1

Hence show that choosing to give K ' ( ^ ) = t0 is equivalent to exponential tilting


o f T [ to have mean to, and verify the tilting calculations in Example 9.8.
(b) Explain how to modify (9.26) and (9.28) to give the approximate P D F and
C D F o f T[.
(c) How can stratification be accom m odated in the conditional approximations o f
Section 9.5.2?
(Section 9.5)
17

In a matched pair design, two treatments are allocated at random to each o f n


pairs o f experimental units, with differences dj and average difference d = n~l J2 djI f there is no real effect, all 2" sequences + d i , . . . , + d n are equally likely, and so
are the values D" = n~l J2^j^j> where the Sj take values + 1 with probability
The one-sided significance level for testing the null hypothesis o f no effect is
Pr*(D* > d).
(a) Show that the cumulant-generating function o f D ' is
n

K() = Y

io g c o sh (Zdj/n),

i=i
and find the saddlepoint equation and the quantities needed for saddlepoint
approximation to the observed significance level. Explain how this may be fitted
into the framework o f a conditional saddlepoint approximation.
(b) See Practical 9.5.
(Section 9.5.1; Daniels, 1958; D avison and Hinkley, 1988)
18

For the testing problem o f Problem 4.9, use saddlepoint methods to develop an
approximation to the exact bootstrap P-value based on the exponential tilted EDF.
Apply this to the city population data with n = 10.
(Section 9.5.1)

9.7 Problems

493

19

(a) If W \ , . . . , W are independent Poisson variables with means


show that
their joint distribution conditional on J2j
= m is multinomial with probability
vector n = (fi\
^
fij and denominator w. Hence justify the first saddlepoint
approximation in Example 9.16.
(b) Suppose that T* is the solution to an estimating equation o f form (9.32), but
that f j = 0 or 1 and
f j = m < n; T" is a delete-m jackknife value o f the original
statistic. Explain how to obtain a saddlepoint approximation to the P D F o f T .
How can this P D F be used to estimate var*(T)? D o you think the estimate will
be good when m = n 1 ?
(Section 9.5.2; Booth and Butler, 1990)

20

(a) Show that the bootstrap correlation coefficient t based on data pairs ( x j , Zj),
j = 1, . . . , n , may be expressed as the solution to the estimating equation (9.40)
with
Xj-Si

Zj

Oj ( t , s ) =

( Xj

- s2
Si)2

53

- s2j2 - s4
( Xj - Si ) ( Zj - S2) - t{s3s4)1/2 J
(Zj

where s T = (s1,s 2 ,s 3,s 4), and show that the Jacobian J ( t , s ; ) = n5(s 3s4)1/2. Obtain
the quantities needed for the marginal saddlepoint approximation (9.43) to the
density o f T*.
(b) W hat further quantities would be needed for saddlepoint approximation to the
marginal density o f the studentized form o f T ?
(Section 9.5.3; D avison, Hinkley and Worton, 1995; DiCiccio, Martin and Young,
1994)
21

Let T[ be a statistic calculated from a bootstrap sample in which


appears with
frequency f j (j = 1, . ..,n ) , and suppose that the linear approximation to T ' is
T [ = t + n~ Y s f j h where /i < k < < / . The statistic r2
* antithetic to T,' is
calculated from the bootstrap sample in which y, appears with frequency /* +l ..
(a) Show that if T [ and r 2 are antithetic,

var{i(7Y + r 2*)} = J-n

(n-l j 2 lJ + ~l E bh' n+l - j


\

7=1

7=1

and that this is roughly x2/ 2 n as n


00, where

and t]p is the pth quantile o f the distribution o f L t( Y ;F).


(b) Show that if T j is independent o f r,' the corresponding variance is

and deduce that when T is the sample average and F is the exponential distribution
the large-sample performance gain o f antithetic resampling is 6 /(1 2 n 2) = 2.8.
(c) W hat happens if F is symmetric? Explain qualitatively why.
(Hall, 1989a)

9 - Improved Calculation

494
22

Suppose that resampling from a sample o f size n is used to estimate a quantity


z(n) with expansion
z(n) = zQ+ n~az\ + n~2az2 -\----- ,

(9-55)

where zo, zi, z2 are unknown but a is known; often a = j . Suppose that we
resample from the E D F F, but with sample sizes nQ, m , where 1 < no < n t < n,
instead o f the usual n, giving simulation estimates z ' ( n0), z ' ( n t ) o f z(n0), z( n x).
(a) Show that z*(n) can be estimated by

z(n) =

z (no) +

no

n,

(z(n0) - z > i ) }

(b) N ow suppose that an estimate o f z (n; ) based on Rj simulations has variance


approximately b / R j and that the com putational effort required to obtain it is cnjRj,
for some constants b and c. Given no and ni, discuss the choice o f R q and R\ to
minimize the variance o f z"(n) for a given total com putational effort.
(c) Outline how knowledge o f the limit zo in (9.55) can be used to improve z (n).
How would you proceed if a were unknown? D o you think it wise to extrapolate
from just two values no and
?
(Bickel and Yahav, 1988)

9.8
1

Practicals
For ordinary bootstrap sampling, balanced resampling, and balanced resampling
within strata:

y <- rnorm(lO)
junk.fun <- function(y, i) var(y[i])
junk <- boot(y, junk.fun, R=9)
b o o t .array(junk)
apply(junk$t,2,sum)
junk <- boot(y, junk.fun, R=9, sim="balanced")
b o o t .array(junk)
apply(j unk$t,2,sum)
junk <- boot(y, junk.fun, R=9, sim="balanced",
strata=rep(l:2,c(5,5)))
boot.array(j unk)
apply(junk$t,2,sum)
N ow use balanced resampling in earnest to estimate the bias for the gravity data
weighted average:

grav.fun <- function(data, i)


{ d <- data[i,]
m <- tapply(d$g,d$series,mean)
v <- tapply(d$g,d$series,var)
n <- table(d$series)
v <- (n-l)*v/n
c(sum(m*n/v)/sum(n/v), sum(n/v)) }
grav.bal <- boot(gravity, grav.fun, R=49,
strata=gravity$series, sim="balanced")
mean (grav.bal$t [, 1] ) -grav.bal$tO [1]
For the adjusted estimate o f bias:

Practicals

495

grav.ord <- boot(gravity, grav.fun, R=49,


strata=gravity$series)
control(grav.ord,bias.adj =T)
N ow a more systematic comparison, with 40 replicates each with R = 19:

R <- 19; nreps <- 40; bias <- matrix(,nreps,3)


for (i in 1.
nreps) {
grav.ord <- boot(gravity, grav.fun, R=R, strata=gravity$series)
grav.bal <- boot(gravity, grav.fun, R=R,
strata=gravity$series, sim="balanced")
bias[i,] <- c(mean(grav.ord$t[,l])-grav.ord$tO[l] ,
mean(grav.bal$t[,1])-grav.bal$t0[1],
control(grav.ord,bias.adj=T)) }
bias
apply(bias,2,mean)
apply(bias,2,var)
split.screen(c(l,2))
screen(l)
qqplot(bias [,1],bias [,2J ,xlab="ordinary",ylab="balanced")
abline(0,1,lty=2)
screen(2)
qqplot (bias [ ,2],bias[,3],xlab="balanced",ylab="adjusted")
abline(0,1,lty=2)
W hat are the efficiency gains due to using balanced simulation and post-simulation
adjustment for bias estimation here? N ow a calculation to see the correlation
between T ' and its linear approximation:

grav.ord <- boot(gravity, grav.fun, R=999, strata=gravity$series)


grav.L <- empinf(grav.ord,type="reg")
tL <- linear.approx(grav.ord,grav.L,index=l)
close.screen(all=T)
plot(tL,grav.ord$t[,1])
cor(tL,grav.ord$t[,1])
Finally, calculations for the estimates o f bias, variance and quantiles using the
linear approximation as control variate:

grav.cont <- control(grav.ord,L=grav.L,index=l)


grav.contSbias
grav.cont$var
grav.cont$quantiles
To use importance resampling to estimate quantiles o f the contrast o f averages
for the tau data o f Practical 3.4, we first set up strata, a weighted version o f the
statistic t, a contrast o f averages, and calculate the empirical influence values:

tau.w <- function(data, w)


{
d <- data$rate*w
d <- tapply(d,data$decay,sum)/tapply(w,data$decay,sum)
d[l]-sum(d [-1])
}
tau.L <- empinf(data=tau, statistic=tau.w, strata=tau$decay)
We could use exponential tilting to find distributions tilted to 14 and 18 (the
original value o f t is 16.16):

496

9 Improved Calculation

e x p . t i l t ( t a u . L , t h e t a = c ( 1 4 , 1 8 ) ,t 0 = 1 6 .1 6 )
Function t i l t . b o o t does this automatically. Here we do 199 bootstraps without
tilting, then 100 each tilted to the 0.05 and 0.95 quantiles o f these 199 values o f t".
We then display the weights, without and with defensive mixture distributions:
t a u . t i l t < - t i l t . b o o t ( t a u , t a u . w, R=c( 1 9 9 ,1 0 0 ,1 0 0 ) ,s t r a ta = ta u $ d e c a y ,
s t y p e = "w", L = ta u . L, a lp h a = c ( 0 . 0 5 , 0 . 9 5 ) )
s p l i t . s c r e e n (c (1 ,2 ) )
s c r e e n ( l) ; p lo t ( t a u .t ilt $ t ,im p .w e ig h t s ( t a u .t ilt ) ,lo g = " y " )
s c r e e n ( 2 ) ; p l o t ( t a u . t i l t $ t , im p. w e ig h t s ( t a u . t i l t , d e f= F ), lo g = " y ")
The corresponding estimated quantiles are
i m p .q u a n t i l e ( t a u . t i l t , a l p h a = c ( 0 . 0 5 , 0 . 9 5 ) )
im p. q u a n t i l e ( t a u . t i l t , a lp h a = c ( 0 . 0 5 , 0 . 9 5 ) , def=F )
The same can be done with frequency sm oothing, but then the initial value o f R
must be larger:
t a u .f r e q < - t i l t . b o o t ( t a u , ta u .w , R=c( 4 9 9 ,2 5 0 ,2 5 0 ) ,
s t r a ta = ta u $ d e c a y , stype="w ", t i l t = F , a lp h a = c ( 0 . 0 5 , 0 . 9 5 ) )
im p .q u a n t il e ( t a u .f r e q ,a lp h a = c ( 0 . 0 5 , 0 . 9 5 ) )
For balanced importance resampling we simply add sim ="balanced" to the argu
ments o f t i l t . b o o t . For a small simulation study to see the potential efficiency
gains over ordinary sampling, we compare the performance o f ordinary sampling
and importance resampling with and without balance, in estimating the 0.1 and
0.9 quantiles o f the distribution o f t".
t a u . t e s t < - NULL
f o r ( i r e p in 1 :1 0 )
{ ta u .b o o t < - b o o t ( t a u , ta u .w , R=199, stype="w ",
str a ta = ta u $ d e c a y )
q .o r d < - s o r t ( t a u . b o o t $ t ) [ c ( 2 0 , 1 8 0 )]
t a u . t i l t < - t i l t . b o o t ( t a u , ta u .w , R = c ( 9 9 ,5 0 ,5 0 ) ,
s t r a ta = ta u $ d e c a y , stype="w ", L = tau .L ,
a lp h a = c ( 0 . 1 , 0 . 9 ) )
q . t i l t < - i m p . q u a n t i l e ( t a u . t i l t , a lp h a = c ( 0 . 1 , 0 . 9 ) ) $raw
t a u .b a l < - t i l t . b o o t ( t a u , ta u .w , R = c ( 9 9 ,5 0 ,5 0 ) ,
s tr a ta = ta u $ d e c a y , stype="w ", L = tau .L ,
a lp h a = c ( 0 .1 , 0 . 9 ) , sim ="balanced")
q .b a l < - i m p .q u a n t il e ( t a u .b a l , a lp h a = c ( 0 .1 , 0 .9 ))$ r a w
t a u . t e s t < - r b i n d ( t a u . t e s t , c ( q .o r d , q . t i l t , q . b a l ) ) >
s q r t ( a p p l y ( t a u . t e s t , 2 , v a r ))
W hat are the efficiency gains o f the two importance resampling methods?
Consider the bias and standard deviation functions for the correlation o f the
c la r id g e data (Example 4.9). To estimate them, we perform a double bootstrap
and plot the results, as follows.
c l a r .f u n < - f u n c t io n ( d a t a , f )
{ r <- c o r r (d a ta , f/s u m (f))
n < - n row (d ata)
d <- d a ta [r e p ( 1 : n ,f ) ,]
us <- ( d [ ,l] - m e a n ( d [ ,l ] ) ) /s q r t ( v a r ( d [ ,l] ) )
x s < - ( d [ ,2 ] - m e a n ( d [ ,2 ] ) ) / s q r t ( v a r ( d [ ,2 ] ) )

9.8 Practicals

497

L <- us*xs - r*(us"2+xs~2)/2


v <- sum((L/n)*2)
clar.t <- boot(d, corr, R=25, stype="w")$t
i <- is.na(clar.t)
clar.t <- clar.t[!i]
c(r, v, mean(clar.t)-r, var(clar.t), sum(i)) >
clar.boot <- boot(claridge, clar.fun, R=999, stype="f")
split.screen(c(1,2))
screen(l)
plot (clar .boot$t [, 1] ,clar .boot$t [,3] ,pch=" ." ,
xlab="theta*",ylab="bias")
lines(lowess(clar.boot$t[,l] ,clar.boot$t[,3] ,f=l/2) ,lwd=2)
screen(2)
plot(clar.boot$t[,1],sqrt(clar,boot$t [ , 4 ] ) ,pch=".",
xlab="theta*",ylab="SD")
1 <- lowess(clar.boot$t[,l] ,clar.boot$t[,4] ,f=l/2)
lines(l$x,sqrt(l$y),lwd=2)
To obtain recycled estimates using only the results from a single bootstrap, and to
compare them with those from the double bootstrap:

clar.rec <- boot(claridge, corr, R=999, stype="w")


IS.ests <- function(theta, boot.out, statistic, A=0.2)
{ f <- smooth.f(theta,boot.out,width=A)
theta.f <- statistic(boot.out$data,f/sum(f))
IS.w <- imp.weights(boot.out,q=f)
moms <- imp.moments(boot.out,t=boot.out$t[,1]-theta.f,w=IS.w)
c(theta, theta.f, moms$raw, moms$rat, moms$reg) }
IS.clar <- matrix(,41,8)
theta <- seq(0,0.8,length=41)
for (j in 1:41) IS.clar[j,] <- IS.ests (theta [j] ,clar .rec, corr)
screen(l,new=F)
lines(IS.clar[,2],I S .clar [,7])
lines (IS. clar [, 2] ,IS.clar[,5] ,lty=3)
lines(IS.clar [,2] ,IS.clar[,3] ,lty=2)
screen(2, new=F)
lines(IS.clar[,2],sqrt(IS.clar[,8]))
lines(IS.clar[,2],sqrt(IS.clar[,6]),lty=3)
lines(IS.clar [,2],sqrt(IS.clar[,4]) ,lty=2)
D o you think these results are close enough to those from the double bootstrap?
Compare the values o f 9 in I S .c la r [, 1] to the values o f O' = t(Fg) in I S .c la r [, 2].
Dataframe c a p a b i l i t y gives data from Bissell (1990) comprising 75 successive
observations with specification limits U = 5.79 and L = 5.49; see Problem 5.6 and
Practical 5.4. Suppose that we wish to use the range o f blocks o f 5 observations to
estimate <x, in which case 0 = k / r 5, where k = ( U L ) d 5. Then 8 is the root o f the
estimating equation ^T;(/c r5j-0) = 0; this is just a ratio statistic. We estimate the
P D F o f 6' by saddlepoint methods as follow s:

psi <- function(tt, r, k=2.236*(5.79-5.49)) k-r*tt


psil <-function(tt, r, k=2.236*(5.79-5.49)) r
det.psi <- function(tt, r, xi)
{ p <- exp(xi * psi(tt, r))
length(r) * abs(sum(p * psil(tt,r))/sum(p)) }

498

9 Improved Calculation
r5 <- apply(matrix(capability$y,15,5,byrow=T), 1,
function(x) diff(range(x)))
m <- 300; top <- 10; bot <- 4
sad <- matrix(, m, 3)
th <- seq(bot,top,length=m)
for (i in l:m)
{ sp <- saddle(A=psi(th[i], r 5 ) , u=0)
sad[i,] <- c(th[i] , sp$spa[l] *det .psi(th[i] , r5, xi=sp$zeta.hat) ,
sp$spa[2]) }
sad <- sad[! is.na(sad[,2] )&!is.na(sad[,3] ) ,]
plot(sad[,l],sad[,2],type="l",xlab="theta hat",ylab="PDF")
To obtain the quantiles o f the distribution o f 6', we use the following code; here
ca p a b .tO contains 9 and its standard error.

theta.fun <- function(d, w, k = 2 .236*(5.79-5.49)) k*sum(w)/sum(d*w)


capab.v <- v a r .linear(empinf(data=r5, statistic=theta.fun))
capab.tO <- c (2.236*(5.79-5.49)/mean(r5),sqrt(capab.v))
Afn <- function(t, data, k=2.236*(5.79-5.49)) k-t*data
ufn <- function(t, data, k=2.236*(5.79-5.49)) 0
capab.sp <- saddle.distn(A=Afn, u=ufn, t0=capab.t0, data=r5)
capab.sp
We can use the same ideas to apply the block bootstrap. N ow we take b = 15 o f
the n I + 1 blocks o f successive observations o f length / = 5. We concatenate
them to form a new series, and then take the ranges o f each block o f successive
observations. This is equivalent to selecting b ranges from among the n I + 1
possible ranges, with replacement. The quantiles o f the saddlepoint approximation
to the distribution o f 6 under this scheme are found as follows.

r5 <- NULL
for (j in 1:71) r5 <- c(r5, diff(range(capability$y[j:(j+4)])))
Afn <- function(t, data, k=2.236*(5.79-5.49)) cbind(k-t*data,1)
ufn <- function(t, data, k=2.236*(5.79-5.49)) c(0,15)
capab.spl <- saddle.distn(A=Afn,u=ufn,wdist="p",
type="cond",t0=capab.t O ,data=r5)
capab.spl$quantiles
Compare them with the quantiles above. How do they differ? W hy?
5

To apply the saddlepoint approximation given in Problem 9.17 to the paired com
parison data o f Problem 4.7, and obtain a one-sided significance level P r'(D > d):

K <- function(xi) sum(log(cosh(xi*darwin$y)))-xi*sum(darwin$y)


K2 <- function(xi) sum(darwin$y~2/cosh(xi*darwin$y)~2)
darwin.saddle <- saddle(K.adj=K,K2=K2)
darwin.saddle
1-darwin.saddle$spa[2]

10
Semiparametric Likelihood Inference

10.1 Likelihood
The likelihood function is central to inference in param etric statistical models.
Suppose th a t d a ta y are believed to have com e from a distribution F w, where
xp is an unknow n p x 1 vector param eter. T hen the likelihood for rp is the
corresponding density evaluated a t y , nam ely
L (w )

/ v (y ).

regarded as a function o f xp. This m easures the plausibility o f the different


values o f ip which m ight have given rise to y , and can be used in various ways.
If furth er inform ation ab o u t \p is available in the form o f a p rio r probability
density, n(xp), Bayes theorem can be used to form a posterior density for ip
given the d a ta y ,
" (V I

Inferences regarding xp o r other quantities o f interest m ay then be based on


this density, which in principle contains all the inform ation concerning xp.
If p rio r inform ation a b o u t xp is n o t available in a probabilistic form, the
likelihood itself provides a basis for com parison o f different values o f xp.
T he m ost plausible value is th a t which m aximizes the likelihood, nam ely the
m a x i m u m l i keli hood e st imate, xp. The relative plausibility o f o ther values is
m easured in term s o f the log likelihood t?(xp) = log L(xp) by the l i keli hood ratio
statistic
W{ y >) = 2 { t ( \ p ) - / ( x p ) } .

A key result is th a t u nder repeated sam pling o f d a ta from a regular model,


W(xp) has approxim ately a chi-squared distribution w ith p degrees o f freedom.
T his form s the basis for the prim ary m ethod o f calculating confidence regions

499

10 Semiparametric Likelihood Inference

500

in param etric models. O ne special feature is th a t the likelihood determ ines the
shape o f confidence regions when xp is a vector.
Unlike m any o f the confidence interval m ethods described in C h apter 5,
likelihood provides a n a tu ra l basis for the com bination o f inform ation from
different experim ents. If we have tw o independent sets o f data, y and z,
th a t bear on the sam e p aram eter, the overall likelihood is simply L(xp) =
f ( y I W)f(z I
and tests an d confidence intervals concerning 1p m ay be
based on this. This type o f com bination is particularly useful in applications
where several in dependent experim ents are linked by com m on param eters; see
Practical 10.1.
In applications we can often w rite xp = ( 6 ,X), where the com ponents o f 8 are
o f prim ary interest, while the so-called nuisance param eters X are o f secondary
concern. In such situations inference for 8 is based on the profile likelihood,
L p( 6 ) = m ax L ( 8 , X),

(10.1)

which is treated as if it were a likelihood. In some cases, particularly those where


X is high dim ensional, the usual properties o f likelihood statistics (consistency
o f m axim um likelihood estim ate, approxim ate chi-squared distribution o f log
likelihood ratio) d o n o t apply w ithout m aking an adjustm ent to the profile
likelihood. The adjusted likelihood is
L a(8 )

L p ( o m e ,

i<or1/2,

(io.2)

where % is the M L E o f X for fixed 6 and jx(xp) is the observed inform ation
m atrix for X, i.e. jx(xp) = d 2 f(xp)/dXdXT .
W ithout a p aram etric m odel the definition o f a param eter is m ore vexed.
As in C h ap ter 2, we suppose th a t a p aram eter d is determ ined by a statistical
function t(-), so th a t 8 = t(F) is a m ean, m edian, o r o ther quantity determ ined
by, b u t n o t by itself determ ining, the unknow n distribution F. N ow the nuisance
param eter Xis all aspects o f F o th er th a n t(F), so th a t in general X is
infinite dim ensional. N o t surprisingly, there is no unique way to construct a
likelihood in this situation, and in this ch ap ter we describe some o f the different
possibilities.

10.2 Multinomial-Based Likelihoods


10.2.1 Empirical likelihood
Scalar parameter
Suppose th a t observations
form a ran d o m sam ple from an unknow n
distribution F, an d th a t we wish to construct a likelihood for a scalar param eter
8 = t(F), where t( ) is a statistical function. O ne view o f the E D F F is th a t it
is the nonparam etric m axim um likelihood estim ate o f F, with corresponding

10.2 Multinomial-Based Likelihoods

501

n o n p aram etric m axim um likelihood estim ate t = t(F) for 9 (Problem 10.1).
T he E D F is a m ultinom ial distribution w ith d enom inator one and probability
vector (_1, . . . , n _1) attached to the yj. We can think o f this distribution as
em bedded in a m ore general m ultinom ial distribution with a rb itrary probability
vector p = (pi ,. . . ,p) attached to the d a ta values. If F is restricted to be
such a m ultinom ial distribution, then we can w rite t(p) rath er than t(F)
for the function which defines 8 . The special m ultinom ial probability vector
(n_1, . . . , n _1) corresponding to the E D F is p, and t = t(p) is the nonparam etric
m axim um likelihood estim ate o f 6 . This m ultinom ial representation was used
earlier in Sections 4.4 an d 5.4.2.
R estricting the m odel to be m ultinom ial on the d a ta values with probability
vector p, the p aram eter value is 9 = t{p) and the likelihood for p is L(p) =
n " = i P^j >with / j equal to the frequency o f value yj in the sample. But, assum ing
there are no tied observations, all f j are equal to 1, so th a t U p ) = p x x x pn:
this is the analogue o f L(i/;) in the param etric case. We are interested only in
9 = t(p), for which we can use the profile likelihood
L e l {Q)=

n
sup TT Pj,
p-Ap)=ejJi

(10.3)

which is called the empirical likelihood for 9. N otice th a t the value o f 9 which
m axim izes L El { 8 ) corresponds to the value o f p m axim izing L(p) with only
the constrain t Y l Pj = 1> th a t is p. In other words, the em pirical likelihood is
m axim ized by the nonparam etric m axim um likelihood estim ate t.
In (10.3) we m axim ize over the p; subject to the constraints im posed by
fixing t(p) = 9 an d Y l Pj = 1> which is effectively a m axim ization over n 2
quantities w hen 9 is scalar. R em arkably, although the num ber o f param eters
over which we m axim ize is com parable with the sam ple size, the approxim ate
d istributional results from the p aram etric situation carry over. Let do be the
true value o f 8 , w ith T the m axim um em pirical likelihood estim ator. T hen
und er mild conditions on F and in large samples, the em pirical likelihood
ratio statistic
W e l (90) = 2 {log L e l ( T ) - log L e l ( 6 o)}
has an approxim ate chi-squared distribution w ith d degrees o f freedom. A l
though the lim iting distribution o f W e l (8 q) is the same as th at o f W p(8 o)
und er a correct p aram etric m odel, such asym ptotic results are typically less
useful in the nonparam etric setting. This suggests th at the b o o tstrap be used
to calibrate em pirical likelihood, by using quantiles o f b o o tstrap replicates o f
A
W e l (9q), i.e. quantiles o f W ^ L( 8 ). This idea is outlined below.
Exam ple 10.1 (Air-conditioning d ata) We consider the em pirical likelihood
for the m ean o f the larger set o f air-conditioning d a ta in Table 5.6; n = 24

10 Semiparametric Likelihood Inference

502

Figure 10.1 Likelihood


and log likelihoods for
the mean of the
air-conditioning data:
empirical (dots),
exponential (dashes),
and gamma profile
(solid). Values of 6
whose log likelihood lies
above the horizontal
dotted line in the right
panel are contained in
an asymptotic 95%
confidence set for the
true mean.

00
d

JO

CvJ

o
o
o

40

60

80

100

120

40

60

80

100

120

theta

theta

and y = 64.125. T he m ean is d = f yd F ( y) , which equals Y j P j y j f r the


m ultinom ial distribution th a t p u ts m asses pj on the yj. F or a specified value o f
8 , finding (10.3) is equivalent to m axim izing E l g P ; w ith respect to
subject to the constraints th a t E Pj = 1 ar,d
Pjyj =
Use o f L agrange
m ultipliers gives pj oc {1 + rjoiyj 0)}_1, where the L agrange m ultiplier rjg is
determ ined by 8 and satisfies the equation
(10.4)

T hus the log em pirical likelihood, norm alized to have m axim um zero, is
n

(10.5)

This is m axim ized at the sam ple average 8 = y, where ye = 0 and Pj = n_1. It
is undefined outside (m iny^, m ax y 7-), because no m ultinom ial distribution on
the yj can have m ean outside this interval.
Figure 10.1 shows L e l (&), which is calculated by successive solution o f (10.4)
to yield tjg at values o f 8 small steps apart. T he exponential likelihood and
gam m a profile likelihood for the m ean are also shown. As we should expect,
the gam m a profile likelihood is always higher th a n the exponential likelihood,
which corresponds to the gam m a likelihood b u t w ith shape param eter k = 1.
Both param etric likelihoods are w ider th a n the em pirical likelihood. D irect
com parison betw een p aram etric an d em pirical likelihoods is misleading, how
ever, since they are based on different m odels, an d here and in later figures

10.2 Multinomial-Based Likelihoods

Figure 10.2 Simulated


empirical likelihood
ratio statistics (left
panel) and gamma
profile likelihood ratio
statistics (right panel)
for exponential samples
of size 24. The dotted
line corresponds to the
theoretical x\
approximation.

y
To
sw
o
2
"D

503

O
W
sw
o
2
D

o
o

o
o

JO

<D

o
OJ
E
E

Q .

aj
0

LU

Chi-squared quantiles

Chi-squared quantiles

we give the gam m a likelihood purely as a visual reference. The circum stances
in which em pirical an d p aram etric likelihoods are close are discussed in P rob
lem 10.3.
The endpoints o f an approxim ate 95% confidence interval for 8 are obtained
by reading off where e l ( 8 ) = 501,0.95, where c^a is the a quantile o f the chisquared distribution w ith d degrees o f freedom. The interval is (43.3,92.3),
which com pares well w ith the n onparam etric B C a interval o f (42.4,93.2). The
likelihood ratio intervals for the exponential and gam m a m odels are (44.1,98.4)
and (44.0,98.6).
Figure 10.2 shows the em pirical likelihood and gam m a profile likelihood
ratio statistics for 500 exponential sam ples o f size 24. T hough good for the
param etric statistic, the chi-squared approxim ation is poor for W EL, whose
estim ated 95% quantile is 5.92 com pared to the xj quantile o f 3.84. This
suggests strongly th a t the em pirical likelihood-based confidence interval given
above is too narrow . However, the sim ulations are only relevant w hen the
d a ta are exponential, in which case we would n o t be concerned w ith em pirical
likelihood.
We can use the b o o tstrap to estim ate quantiles for W e l ( 8 o ), by setting 6 q = y
and then calculating W ( 6 q ) for b o o tstrap sam ples from the original data. The
resulting Q -Q p lo t is less extrem e th a n the left panel o f Figure 10.2, w ith a 95%
quantile estim ate o f 4.08 based on 999 b o o tstrap sam ples; the corresponding
em pirical likelihood ratio interval is (42.8,93.3). W ith a sam ple o f size 12, 41
o f the 999 sim ulations gave infinite values o f W e l ( 6 q) because y did not lie
w ithin the lim its (m in y ',m a x y * ) o f the b o o tstrap sample. W ith a sam ple o f
size 24, this problem did n o t arise.

10 Semiparametric Likelihood Inference

504
V ector p a ra m eter

In principle, em pirical likelihood is straightforw ard to construct when 6 has


dim ension d < n 1. Suppose th a t 9 = (91, . . . , 8 d)T is determ ined implicitly as
the root o f the sim ultaneous equations

u( 9; y) dF ( y) = 0,

i= l,...,d,

where u(9;y) is a d x 1 vector w hose ith elem ent is Ui(9;y). T hen the estim ate
9 is the solution to the d estim ating equations
n

( 10.6 )

;'=i
A n extension o f the argum ent in Exam ple 10.1, involving the vector o f L a
grange m ultipliers rjg = (t]ou- , f]od)T, shows th a t the log em pirical likelihood
is
n
S e l ( 0) = ~
log {1 + n l U j ( 9 ) } ,
(10.7)
i =i
where uj(9) = u(9;yj). T he value o f rjg is determ ined by 9 through the.
sim ultaneous equations
V

- UjiTd )- - = 0 .
1 + tiju j(8)

(10.8)

The sim plest approxim ate confidence region for the true 9 is the set o f values
such th at W EL( 0 ) < q j _ b u t in sm all sam ples it will again be preferable to
replace the Xd quantile by its b o o tstrap estim ate.

10.2.2 Empirical exponential family likelihoods


A n o th er data-b ased m ultinom ial likelihood can be based on an em pirical expo
nential family construction. Suppose th a t 9 \ , . . . , 9 i are defined as the solutions
to the equations (10.6). T hen ra th e r th a n p u ttin g probability n_1{ l+ f /J u ; (0)}_1
on yj, corresponding to (10.7), we can take probabilities proportional to
e x p { jUj(9)} this is the exponential tilting construction described in Ex
am ple 4.16 an d in Sections 5.3 an d 9.4. H ere
= (iei, , ied)T is determ ined
by 9 through
n

E M;( 0 ) e x p { ^ u ; ( 0 ) } = O .
j=i

(10.9)

This is analogous to (10.8), b u t it m ay be solved using a program th at


fits regression m odels for Poisson responses (Problem 10.4), which is of
ten m ore convenient to deal w ith th an the optim ization problem s posed

10.2 Multinomial-Based Likelihoods

505

by em pirical likelihood. T he log likelihood obtained by integrating (10.9)


is e e f ( 6 ) = X !exP { j /(0)}- This can be close to el{&), which suggests th a t
b o th the corresponding log likelihood ratio statistics share the same rather
slow ap p ro ach to their large-sam ple distributions.
In ad d itio n to likelihood ratio statistics from em pirical exponential families
an d em pirical likelihood, m any other related statistics can be defined. For
exam ple, we can regard
as the p aram eter in a Poisson regression m odel and
construct a quad ratic form

u ,( 0 ) j

(10.10)

based on the score statistic th a t tests the hypothesis


= 0. T here is a close
parallel betw een Q e e f ( O ) an d the q u adratic form s used to set confidence
regions in Section 5.8, b u t the nonlinear relationship betw een 6 and Q e e f ( 6)
m eans th a t the contours o f (10.10) need not be elliptical. As discussed there,
for exam ple, theory suggests th a t w hen the true value o f 6 is 9o, Q e e f (&o) has
a large-sam ple x i distribution. T hus an approxim ate 1 a confidence region
for 8 is the set o f values o f 6 for which Q e e f (&) does n o t exceed q ,i_a. A nd
as there, it is generally b etter to use b o o tstrap estim ates o f the quantiles o f
Q e e f (0)-

Example 10.2 (Laterite data) We consider again setting a confidence region


based on the d a ta in Exam ple 5.15. Recall th at the quantity o f interest is the
m ean p o lar axis,
a( 8 , (p) = (cos 6 cos 0 , cos 9 sin cj>, sin 0)T,
which is the axis given by the eigenvector corresponding to the largest eigen
value o f E ( y y T). T he d a ta consist o f positions on the lower half-sphere, or
equivalently the sam ple values o f a(9, (j>), which we denote by yj, j = 1 ,..., n.
In o rd er to set an em pirical likelihood confidence region for the m ean p o lar
axis, or equivalently for the spherical polar coordinates (9, cj)), we let
b(9, <f>) = (sin 9 cos 0 , sin 0 sin <f>,cos 9)T,

c(9, </>) = ( sin (p, cos <f>,0)T

denote the unit vectors o rthogonal to a(0,4>). T hen since the eigenvectors o f
E (y Y T) m ay be taken to be orthogonal, the p o pulation values o f (0, <f>) satisfy
sim ultaneously the equations
b(9, 4>)t E{ Y Y T )a(0, <f>) = 0,
w ith sam ple equivalents

c(0, <j))T E ( Y Y T )a(9, <t>) = 0,

10 - Semiparametric Likelihood Inference

506

Figure 10.3 Contours


of W Ei (left) and Qeef
(right) for the mean
polar axis, in the square
region shown in
Figure 5.10. The dashed
lines show the 95%
confidence regions using
bootstrap quantiles. The
dotted ellipse is the 95%
confidence region based
on a studentized statistic
(Fisher, Lewis and
Embleton, 1987,
equation 6.9).

In term s o f the previous general discussion, we have d = 2 and

<e,4>,y j ) -

[ c{e>(j ))T y j y j a i d (j))

The left panel o f Figure 10.3 shows the em pirical likelihood contours based
on (10.7) and (10.8), in the square region show n in Figure 5.10. The correspond
ing contours for Q e e f ( S ) are show n on the right. T he dashed lines show the
boundaries o f the 95% confidence regions for ( 6 , (f>) using b o o tstrap calibra
tio n ; these differ little from those based on the asym ptotic y\ distribution. In
each panel the d o tted ellipse is a 95% confidence region based on a studentized
form o f the sam ple m ean p o la r axis, for which the contours are ellipses. The
elliptical contours are appreciably tighter th a n those for the likelihood-based
statistics.
Table 10.1 com pares theoretical and b o o tstrap quantiles for several likelihoodbased statistics an d the studentized b o o tstrap statistic, Q , for the full d a ta and
for a rando m subset o f size 20. F or the full data, the quantiles for Q e e f and
W e l are close to those for the large-sam ple
distribution. F o r the subset,
Q e e f is close to its nom inal distribution, b u t the o th er statistics seem consid
erably m ore variable. Except for Q e e f , it would be m isleading to rely on the
asym ptotic results for the subsam ple.

T heoretical w ork suggests th a t W e l should have b etter properties th an


statistics such as W e e f or Q e e f , b u t since sim ulations do n o t always confirm
this, b o o tstrap quantiles should generally be used to set the limits o f confidence
regions from m ultinom ial-based likelihoods.

10.3 Bootstrap Likelihood


Table 10.1 Bootstrap p
quantiles of
likelihood-based
statistics for mean polar
axis data.

507

Full d a ta , n = 50

Subset, n = 20

xi

W el

W ee f

Qe e f

W EL

W eef

Qeef

0.80
0.90
0.95

3.22
4.61
5.99

3.23
4.77
6.08

3.40
4.81
6.18

3.37
5.05
6.94

3.15
4.69
6.43

3.67
5.39
7.17

3.70
5.66
7.99

3.61
5.36
10.82

3.15
4.45
7.03

10.3 Bootstrap Likelihood


Basic idea
Suppose for sim plicity th a t o u r d a ta y i , . . . , y n form a hom ogeneous random
sam ple for which statistic T takes value t. If the d a ta were governed by
a p aram etric m odel u n d er which T had the density
then a partial
likelihood for 9 based on T would be f T(t',0) regarded as a function o f 9. In
the absence o f a p aram etric m odel, we m ay estim ate the density o f T at t, for
different values o f 9, by m eans o f a nonparam etric double bootstrap.
To be specific, suppose th a t we generate a first-level b o o tstrap sam ple
y [ , . . . , y from
w ith corresponding estim ator value t*. This boo tstrap
sam ple is now considered as a population whose param eter value is t*; the em
pirical distribution o f y j,...,y * is the nonparam etric analogue o f a param etric
m odel w ith 9 = t*. We then generate M second-level b o o tstrap sam ples by
sam pling from o u r first-level sample, and calculate the corresponding values
o f T, nam ely r * \ . . . , f'J. K ernel density estim ation based on these second-level
values provides an approxim ate density for T , and by analogy with p a ra
m etric p artial likelihood we take this density at f** = t to be the value o f a
n o n p aram etric p artial likelihood at 9 = f*. If the density estim ate uses kernel
w( ) w ith bandw idth h, then this leads to the bootstrap likelihood value at
9 = t given by
1 M
/ <* _ A
u n = f M ' i n = m Y . - { Jn r ) m=1
v7

<10-1

O n repeating this procedure for R different first-level boo tstrap samples, we


obtain R approxim ate likelihood values L ( t ), r = 1, . . . , R , from which a
sm ooth likelihood curve L B(9) can be produced by nonparam etric sm oothing.
Computational improvements
There are various ways to reduce the large am ount o f com putation needed
to obtain a sm ooth curve. One, which was used earlier in Section 3.9.2,
is to generate second-level sam ples from sm oothed versions o f the first-level
samples. As before, probability distributions on the values y i .......y n are denoted

508

10 Semiparametric Likelihood Inference

by vectors p = (p i , . . . , p ), and p aram eter values are expressed as t(p); recall


th a t p =
and t = t(p). The rth first-level b o o tstrap sam ple gives
statistic value t'n and the d a ta value yj occurs w ith frequency /* = nprj, say.
In the bo o tstrap likelihood calculation this b o o tstrap sam ple is considered as a
population w ith probability distrib u tio n p* = (p*1;...,p * n) on the d a ta values,
and t ' = f(p ') is considered as the 0-value for this population.
In order to obtain p opulations which vary sm oothly with 6 , we apply kernel
sm oothing to the p*, as in Section 3.9.2. T hus for target param eter value 6 we
define the vector p*(0) o f probabilities
p ' ( 0 )

e - 1-') p'rj,
r= 1

'

j= l,...,n ,

(10.12)

'

1II
where typically w(-) is the stan d ard norm al density an d e = t>L ; as usual vl is
the nonparam etric delta m ethod variance estim ate for t. The distribution p*(0)
will have p aram eter value not 0 b u t 9 = t (p '(0 0)). W ith the understanding
th a t 9 is defined in this way, we shall for sim plicity w rite p'(9) ra th er th an
p*(0). F or a fixed collection o f R first-level sam ples and bandw idth e > 0, the
probability vectors p"(9) change gradually as 9 varies over its range o f interest.
Second-level b o o tstra p sam pling now uses vectors p'(0) as sam pling distri
butions on the d a ta values, in place o f the p* s. The second-level sam ple values
f** are then used in (10.11) to o btain Lg(0). R epeating this calculation for,
say, 100 values o f 6 in the range t + 4 v /1 2, followed by sm ooth interpolation,
should give a good result.
Experience suggests th a t the value e = v 1/ 2 is safe to use in (10.12) if the
t* are roughly equally spaced, which can be arran g ed by weighted first-level
sam pling, as outlined in Problem 10.6.
A way to reduce furth er the am o u n t o f calculation is to use recycling, as
described in Section 9.4.4. R a th e r th an generate second-level sam ples from each
p"(9) o f interest, one set o f M sam ples can be generated using distribution p on
the d ata values, an d the associated values f , . . . ,
calculated. Then, following
the general re-w eighting m ethod (9.24), the likelihood values are calculated as
.> 0 ,3 ,
m=\

/j = 1 v

'

where
is the frequency o f the j t h case in the with second-level boo tstrap
sample. O ne simple choice for p is the E D F p. In special cases it will be possible
to replace the second level o f sam pling by use o f the saddlepoint approxim ation
m ethod o f Section 9.5. This w ould give an accurate an d sm ooth approxim ation
to the density o f T for sam pling from each p ' ( 8 ).
Exam ple 10.3 (Air-conditioning d a ta )

We apply the ideas outlined above to

10.4 Likelihood Based on Confidence Sets

Figure 10.4 Bootstrap


likelihood for mean of
air-conditioning data.
Left panel: bootstrap
likelihood values
obtained by saddlepoint
approximation for 200
random samples, with
smooth curve fitted to
values obtained by
smoothing frequencies
from 1000 bootstrap
samples. Right panel:
gamma profile log
likelihood (solid) and
bootstrap log likelihood
(dots).

509

~D
O
O

JZ

o>

theta

theta

the d a ta from Exam ple 10.1. The solid points in the left panel o f Figure 10.4
are b o o tstrap likelihood values for the m ean 9 for 200 resamples, obtained by
saddlepoint approxim ation. This replaces the kernel density estim ate (10.11)
an d avoids the second level o f resam pling, b u t does n o t remove the variation in
estim ated likelihood values for different b o o tstrap sam ples with sim ilar values
o f t r*. A locally q u ad ratic nonparam etric sm oother (on the log likelihood scale)
could be used to produce a sm ooth likelihood curve from the values o f L(t"),
b u t an o th er approach is better, as we now describe.
The solid line in the left panel o f Figure 10.4 interpolates values obtained
by applying the saddlepoint approxim ation using probabilities (10.12) at a few
values o f 9. H ere the values o f t! are generated at random , and we have taken
112
e = 0.5vl ; the results depend little on the value o f e.
T he log b o o tstrap likelihood is very close to log em pirical likelihood, with
95% confidence interval (43.8,92.1).

B ootstrap likelihood is based purely on resam pling and sm oothing, which


is a p o tential advantage over em pirical likelihood. However, in its simplest
form it is m ore com puter-intensive. This precludes b o otstrapping to estim ate
quantiles o f b o o tstra p likelihood ratio statistics, which would involve three
levels o f nested resam pling.

10.4 Likelihood Based on Confidence Sets


In certain circum stances it is possible to view confidence intervals as being
approxim ately posterior probability sets, in the Bayesian sense. This encourages
the idea o f defining a confidence distribution for 9 from the set o f confidence

510

10 Semiparametric Likelihood Inference

limits, and then taking the P D F o f this distribution as a likelihood function.


T h a t is, if we define the confidence distribution function C by C( 6 xj = a,
then the associated likelihood would be the density dC( 6 )/dd. Leaving the
philosophical argum ents aside, we look briefly at where this idea leads in the
context o f nonparam etric b o o tstrap m ethods.

10.4.1 Likelihood from pivots


Suppose th a t Z ( 9 ) = z ( 6 , F ) is a pivot, w ith C D F K ( z ) not depending on
the true distribution F , an d th a t z ( 0 ) is a m onotone function o f 6 . T hen the
confidence distribution based on confidence limits derived from z leads to the
likelihood
LH6) = \ m \ k { z ( 6 ) } ,

(10.14)

where k ( z ) = d K ( z ) / d z . Since k will be unknow n in practice, it m ust be


estim ated.
In fact this definition o f likelihood has a hidden defect. If the identification
o f confidence distrib u tio n w ith posterior distribution is accurate, as it is to a
good approxim ation in m any cases, then the effect o f some prio r distribution
has been ignored in (10.14). But this effect can be rem oved by a simple device.
C onsider an im aginary experim ent in which a ran d o m sam ple o f size 2n is
obtained, w ith outcom e exactly tw o copies o f the d a ta y th at we have. Then
the likelihood w ould be the square o f the likelihood L z (6 I y) we are trying to
calculate. T he ratio o f the corresponding posterior densities would be simply
L z (6 | y). This argum ent suggests th a t we apply the confidence density (10.14)
twice, first w ith d a ta y to give L j(0), say, and second w ith d a ta (y, y) to give
L f2n(0). The ratio L l n(6 ) / L l ( 6 ) will then be a likelihood with the unknow n prior
effect removed. In an explicit notatio n , this definition can be w ritten
t (Q\ ^2n() _ l^2n(0)l&2n {z 2n(d)}
L z(p ) =
,
Ln
\Zn{6 )\kn \2n(0)}

(10 15)
(10.15)

where the subscripts indicate sam ple size. N ote th a t F and t are the same for
both sam ple sizes, b u t quantities such as variance estim ates will depend upon
sam ple size. N ote also th a t the im plied p rio r is estim ated by L l 2( 6 ) / L f2n(6).
Exam ple 10.4 (Exponential m ean)
If d a ta y i , . . . , y n are sam pled from an
exponential distrib u tio n w ith m ean 6 , then a suitable choice for z ( 6 , F ) is
y / 6 . The gam m a distrib u tio n for Y can be used to check th at the original
definition (10.14) gives L i (6) = 9 ~ n~ l exp(n y / 6 ) , w hereas the true likelihood
is 9~n exp (n y / 6 ) . The true result is obtained exactly using (10.15). The im plied
prior is n( 6 ) oc 0-1 , for 6 > 0.

In practice the distrib u tio n o f Z m ust be estim ated, in general by boo tstrap

2(0) equals dz{Q)/d6.

10.4 Likelihood Based on Confidence Sets

511

sam pling, so the densities k n and k 2 in (1 0 .1 5 ) m ust be estim ated. To be specific,


consider the p articu lar case o f the studentized quantity z(9) = (td ) / v 1L/2. A part
from a co n stan t m ultiplier, the definition (1 0 .1 5 ) gives

L f (0 ) = k 2n

j k n

,(10 .1 6 )

where v^ = v i an d v2,l = \ vl , and we have used the fact th a t t is the


estim ate for b o th sam ple sizes. The densities k and k 2n are approxim ated using
b o o tstrap sam ple values as follows. First R nonparam etric sam ples o f size n are
a
^ i j'y
draw n from F an d corresponding values o f z* = (t* t ) / v n[ calculated. T hen
R sam ples o f size 2n are draw n from F and values o f Z2 = (*2 ~ 0 /( io .)1/2
calculated. N ext kernel estim ates for k and k 2n, with bandw idths h n and h 2n
respectively, are obtained and substituted in (10.16). F or example,
(10.17)

In practice these values can be com puted via spline sm oothing from a dense
set o f values o f the kernel density estim ates k{z).
There are difficulties w ith this m ethod. First, ju st as with b o o tstrap likeli
hood, it is necessary to use a large num ber o f sim ulations R. A second difficulty
is th a t o f ascertaining w hether or n o t the chosen Z is a pivot, o r else w hat
p rio r tran sfo rm atio n o f T could be used to m ake Z pivotal; see Section 5.2.2.
This is especially true if we extend (10.16) to vector 9, which is theoretically
possible. N ote th a t if the studentized b o o tstrap is applied to a transform ation
o f t rath er th a n t itself, then the factor \z(9)\ in (10.14) can be ignored when
applying (10.16).

10.4.2 Implied likelihood


In principle any b o o tstra p confidence lim it m ethod can be turned into a
likelihood m ethod via the confidence distribution, b u t it m akes sense to restrict
atten tio n to the m ore accurate m ethods such as the studentized b o o tstrap used
above. Section 5.4 discusses the underlying theory and introduces one other
m ethod, the A B C m ethod, which is particularly easy to use as basis for a
likelihood because no sim ulation is required.
First, a confidence density is obtained via the q u adratic approxim ation (5.42),
w ith a, b and c as defined for the nonparam etric A B C m ethod in (5.49). Then,
using the argum ent th a t led to (10.15), it is possible to show th a t the induced
likelihood function is
L Ab c (0) = ex p { -5 U 2(0)},

(10.18)

512

10 Semiparametric Likelihood Inference

Figure 10.5 Gamma


profile likelihood (solid),
implied likelihood L a b c
(dashes) and pivot-based
likelihood (dots) for
air-conditioning dataset
of size 12 (left panel)
and size 24 (right
panel). The pivot-based
likelihood uses
R = 9999 simulations
and bandwidths 1.0.

50

100

150

200

250

300

theta

40

60

80

100

120

theta

where
um
W

2r(fl)
l + 2 ar(d) + { l + 4 a r ( d ) } 1/ 2

22(0)
1 + {1 - 4cz(0)}V2

1/I
with z(9) = (t d)/vj[ as before. This is called the implied likelihood. Based
on the discussion in Section 5.4, one w ould expect results sim ilar to those from
applying (10.16).
A furth er m odification is to m ultiply La b c( 8 ) by exp{(cv1/ 2 b) 6 /vi.}, with b
the bias estim ate defined in (5.49). T he effect o f this m odification is to m ake the
likelihood even m ore com patible w ith the Bayesian interpretation, som ew hat
akin to the adjusted profile likelihood (10.2).
Exam ple 10.5 (Air-conditioning d ata) Figure 10.5 shows confidence likeli
hoods for the two sets o f air-conditioning d a ta in Table 5.6, sam ples o f size
12 and 24 respectively. The im plied likelihoods L ABc ( 9 ) are sim ilar to the
em pirical likelihoods for these data. The pivotal likelihood L z ( 6 ), calculated
from R = 9999 sam ples w ith bandw idths equal to 1.0 in (10.17), is clearly quite
unstable for the sm aller sam ple size. This also occurred with b o o tstrap likeli
hood for these d a ta an d seems to be due to the discreteness o f the sim ulations
with so sm all a sample.

10.5 Bayesian Bootstraps


All the inferences we have described thus far have been frequentist: we have
sum m arized uncertainty in term s o f confidence regions for the unknow n p a
ram eter 6 o f interest, based on repeated sam pling from a distribution F. A

10.5 Bayesian Bootstraps

513

quite different ap proach is possible if prior inform ation is available regarding


F. Suppose th a t the only possible values o f Y are know n to be u i , . . . , u N, and
th a t these arise with unknow n probabilities p \ , . . . , p N, so that

Pr(y = Uj | p i , . . . , p N ) = pj,

= I-

If o u r d a ta consist o f the ran d o m sam ple y \ , . . . , y , and f j counts how m any


y, equal Uj, the probability o f the observed d a ta given the values o f the
Pj is pro p o rtio n al to flyLi P^ If the prior inform ation regarding the p; is
sum m arized in the p rior density n(Pi, . . . , p N), the jo in t posterior density o f the
Pj given the d a ta is pro p o rtio n al to
N

n ip u -.^p ^n //,
7= 1

and this induces a posterior density for 8 . Its calculation is particularly straight
forw ard w hen 7i is the D irichlet density, in which case the p rio r and posterior
densities are respectively prop o rtional to

ft#

7= 1

7= 1

the posterior density is D irichlet also. Bayesian bootstrap sam ples and the
corresponding values o f 8 are generated from the jo in t posterior density
for
the pj, as follows.
Algorithm 10.1 (Bayesian bootstrap)
F or r = 1 ,...,/? ,
1 L et G \ , . . . , G n be independent gam m a variables with shape param eters
aj + f j + 1, and unit scale param eters, and for j = l , . . . , N set P j =
Gj/{G\ H------- 1- G^).
2 L et 8 } = t(Fj), where F j = ( P / , . . . , / ^ ) .
E stim ate the posterior density for 8 by kernel sm oothing o f d \ , . . . , dfR.

In practice w ith continuous d a ta we have f j = l. The simplest version o f


the sim ulation puts aj = 1, corresponding to an im proper p rio r distribution
w ith su p p o rt on y \ , . . . , y; the G; are then exponential. Some properties o f this
procedure are outlined in Problem 10.10.
Example 10.6 (City population data) In the city population d a ta o f E xam
ple 2.8, for which n = 10, the param eter 8 = t(F) and the rth sim ulated
posterior value dj are

514

10 Semiparametric Likelihood Inference

Figure 10.6 Bayesian


bootstrap applied to city
population data, with
n = 10. The left panel
shows posterior densities
for ratio 6 estimated
from 999 Bayesian
bootstrap simulations,
with a = 1, 2, 5, 10;
the densities are more
peaked as a increases.
The right panel shows
the corresponding prior
densities for 0.

o
C

Q_

theta

theta

The left panel o f Figure 10.6 shows kernel density estim ates o f the posterior
density o f 9 based on R = 999 sim ulations w ith all the aj equal to a = 1, 2, 5,
and 10. The increasingly strong p rio r inform ation results in posterior densities
th at are m ore an d m ore sharply peaked.
The right panel shows the im plied priors on 6 , obtained using the d a ta
doubling device from Section 10.4. The priors seem highly inform ative, even
when a = 1.

The prim ary use o f the Bayesian b o o tstrap is likely to be for im putation when
d a ta are missing, ra th e r th a n in inference for 9 per se. There are theoretical
advantages to such weighted bootstraps, in which the probabilities P* vary
sm oothly, b u t as yet they have been little used in applications.

10.6 Bibliographic Notes


Likelihood inference is the core o f p aram etric statistics. M any elem entary
textbooks con tain som e discussion o f large-sam ple likelihood asym ptotics,
while adjusted likelihoods an d higher-order approxim ations are described by
Barndorff-N ielsen an d Cox (1994).
Em pirical likelihood was defined for single sam ples by Owen (1988) and
extended to w ider classes o f m odels in a series o f papers (Owen, 1990, 1991).
Q in and Lawless (1994) m ake theoretical connections to estim ating equations,
while H all and L a Scala (1990) discuss some practical issues in using em pir
ical likelihoods. M ore general m odels to which em pirical likelihood has been
applied include density estim ation (H all an d Owen, 1993; Chen 1996), lengthbiased d a ta (Qin, 1993), tru n cated d a ta (Li, 1995), and tim e series (M onti,

10.6 Bibliographic Notes

515

1997). A pplications to directional d a ta are discussed by Fisher et al. (1996).


Owen (1992a) reports sim ulations th at com pare the behaviour o f the em pirical
likelihood ratio statistic w ith b o o tstrap m ethods for sam ples o f size up to
20, w ith overall conclusions in line with those o f Section 5.7: the studentized
b o o tstrap perform s best, in particular giving m ore accurate confidence in ter
vals for the m ean th an the em pirical likelihood ratio statistic, for a variety o f
underlying populations.
R elated theoretical developm ents are due to DiCiccio, H all and R om ano
(1991), D iC iccio and R om an o (1989), and Chen and H all (1993). F rom a
theoretical view point it is notew orthy th a t the em pirical likelihood ratio statistic
can be B artlett-adjusted, th o u g h C orcoran, D avison and Spady (1996) question
the practical relevance o f this. H all (1990) m akes theoretical com parisons
betw een em pirical likelihood and likelihood based on studentized pivots.
Em pirical likelihood has roots in certain problem s in survival analysis,
notably using the product-lim it estim ator to set confidence intervals for a
survival probability. R elated m ethods are discussed by M urphy (1995). See
also M ykland (1995), w ho introduces the idea o f dual likelihood, which treats
the L agrange m ultiplier in (10.7) as a param eter. Except in large samples, it
seems likely th a t o u r caveats ab o u t asym ptotic results apply here also.
Em pirical exponential families have been discussed in Section 10.10 o f Efron
(1982) an d D iCiccio and R om an o (1990), am ong others; see also C orcoran,
D avison and Spady (1996), w ho m ake com parisons with em pirical likelihood
statistics. Jing an d W ood (1996) show th a t em pirical exponential family like
lihood is n o t B artlett adjustable. A univariate version o f the statistic Q e e f in
Section 10.2.2 is discussed by Lloyd (1994) in the context o f M -estim ation.
B ootstrap likelihood was introduced by D avison, H inkley and W orton
(1992), w ho discuss its relationship to em pirical likelihood, while a later paper
(D avison, H inkley and W orton, 1995) describes com putational im provem ents.
E arly w ork on the use o f confidence distributions to define nonparam etric
likelihoods was done by H all (1987), Boos and M on ah an (1986), and O gbonm w an and W ynn (1986). T he use o f confidence distributions in Section 10.4
rests in p a rt on the sim ilarity o f confidence distributions to Bayesian posterior
distributions. F o r related theory see W elch and Peers (1963), Stein (1985) and
Berger an d B ernardo (1992). E fron (1993) discusses the likelihood derived
from A B C confidence limits, shows a strong connection with profile likelihood
an d related likelihoods, an d gives several applications; see also C h apter 24 o f
E fron and T ibshirani (1993).
T he Bayesian b o o tstrap was introduced by R ubin (1981), and subsequently
used by R ubin and Schenker (1986) and R ubin (1987) for m ultiple im putation
in missing d a ta problem s. B anks (1988) has described some variants o f the
Bayesian b o o tstrap , while N ew ton and R aftery (1994) describe a varian t which

516

10 Semiparametric Likelihood Inference

they nam e the w eighted likelihood b o otstrap. A com prehensive theoretical


discussion o f w eighted b o o tstrap s is given in B arbe and Bertail (1995).

10.7 Problems
1

Consider empirical likelihood for a parameter 0 = t(F) defined by an estimating


equation f u(t;y)dF(y) = 0, based on a random sample y\,...,y.
(a) Use Lagrange multipliers to maximize Y lg Pj subject to the conditions Y P j =
1 and Y2 Pju(t;yj) = 0, and hence show that the log empirical likelihood is given by
(10.7) with d = 1. Verify that the empirical likelihood is maximized at the sample
EDF, when 6 = t(F).
(b) Suppose that u(f,y) = y t and n = 2, with y\ < y 2. Show that rj9 can be
written as (9 y ) / {( 6 y i)(y2 0)}, and sketch it as a function o f 6.
(Section 10.2.1)

Suppose that x \ , . . . , x n and


are independent random samples from dis
tributions with means /i and n + 3. Obtain the empirical likelihood ratio statistic
for 5.
(Section 10.2.1)

(a) In (10.5), suppose that 6 = y + n~1/2ere, where a 1 = var(y; ) and e has an


asymptotic standard normal distribution. Show that rjg = n~l/2s / a 2, and deduce
that near y, SEl (0) = (y ~ 0)2/ o 2.
(b) N ow suppose that a single observation from F has log density f ( 0 ) = log f(y;6)
and corresponding Fisher information i(6) = E{?($)}. Use the fact that the M LE
6 satisfies the equation t(6) = 0 to show that near 6 the parametric log likelihood
is roughly <?(0) = *i(6)(6 0)2
(c) By considering the double exponential density | exp(\y 6 1), oo < y < oo,
and an exponential family density with mean 9, a(y) exp{yb(0) c(8)}, show that
it may or may not be true that Sel ( 0) = S(9)(Section 10.2.1; DiC iccio, Hall and Romano, 1989)

Let 6 be a scalar parameter defined by an estimating equation f u(6;y)dF(y) = 0.


Suppose that we wish to make likelihood inference for 6 based on a random
sample
using the empirical exponential family

gisuiOiyj)
nj(6) = Pr(Y = y j ) =
where

j= l,...,n,

is determined by

n
5 > ; (0)u (0;y,) = 0.
j =i

(10.19)

(a) Let Z i,...,Z be independent Poisson variables with means exp(uj), where
Uj = u(0;yj); we treat 6 as fixed. Write down the likelihood equation for
and
show that when the observed values o f the Z j all equal zero, it is equivalent to
(10.19). Hence outline how software that fits generalized linear models may be
used to find
(b) Show that the formulation in terms o f Poisson variables suggests that the empir
ical exponential family likelihood ratio statistic is the Poisson deviance W tEF(0Q),

517

10.7 Problems
while the multinomial form gives W EEF(0O), where
W W (flo)

2 ^ { l - e x p ( ^ ; )},

W EEF(e0)

2 [log { n - ' X y ^ } - & > , ] .

(c) Plot the log likelihood functions corresponding to W E e f and W EEF for the data
in Example 10.1; take Uj = y, 6. Perform a small simulation study to compare
the behaviour o f W EEF and W'EEF when the underlying data are samples o f size
24 from the exponential distribution.
(Section 10.2.2)

Suppose that a = (s in 0 ,c o s 0 )r is the mean direction o f a distribution on the


unit circle, and consider setting a nonparametric confidence set for a based on a
random sample o f angles 9 i , . . . , 9 \ set yj = (sin 0 . , c o s 0j)T.
(a) Show that a is determined by the equation Y v f b = 0> where b = (cos 6, sin 6)T
Hence explain how to construct confidence sets based on statistics from empirical
likelihood and from empirical exponential families.
(b) Extend the argument to data taking values on the unit sphere, with mean
direction a = (cos 9 cos <f>,cos 9 sin <j>, sin 9)T .
(c) See Practical 10.2.
(Section 10.2.2; Fisher et ai, 1996)

Suppose that t has empirical influence values lj, and set


Pj(9) =

f 1

(10.20)

E ,= i ^
where f = v l/2(8 t ) and v = n~2

lj.

(a) Show that t(Ft ) = 90, where Fj denotes the C D F corresponding to (10.20).
Hence describe how to space out the values t" in the first-level resampling for a
bootstrap likelihood.
(b) Rather than use the tilted probabilities (10.12) to construct a bootstrap like
lihood by simulation, suppose that we use those in (10.20). For a linear statistic,
show that the cumulant-generating function o f T in sampling from (10.20) is
At + n { K ( ^ + n~iA) K( ^ ) } , where K ( ^ ) = lo g ( ] e(lJ). Deduce that the saddlepoint
approximation to f r - \ T - ( t I 0) is proportional to exp {n X (f)}, where 6 = K '(0 Hence show that for the sample average, the log likelihood at 9 = Y I y j e tyi / 53 eiyj
is n { i t - lo g ( 5 3 e ^ )} .
(c) Extend (b) to the situation where t is defined as the solution to a m onotonic
estimating equation.
(Section 10.3; D avison, Hinkley and Worton, 1992)
7

Consider the choice o f h for the raw bootstrap likelihood values (10.11), when w ( )
is the standard normal density. A s is often roughly true, suppose that T* ~ N(t, v),
and that conditional on T t , T ~ N(t ' , v).
(a) Show that the mean and variance o f the product o f vl/1 with (10.11) are /j and
M ~ l {12 It ), where

h = (

}.

where y = hv~l/2 and 3 = v~l/2(t' t). Hence verify some o f the values in the
following table:

10 Semiparametric Likelihood Inference

D ensity x lO -2
Bias x lO -2
M x variance xlO -2

39.9
-0.8
40.4

<N
II

Oo
II
O

518

8 = 1
0

24.2
0
28.3

24.2
-0.1
11.2

24.2
-0 .5
5.7

5.4
0.3
7.5

5.4
1.2
3.8

5.4
2.5
2.6

39.9 39.9
-2 .9 -5 .7
13.4 5.6

(b) If y is small, show that the variance o f (10.11) is roughly proportional to the
square o f its mean, and deduce that the variance is approximately constant on the
log scale.
(c) Extend the calculations in (a) to (10.13).
(Section 10.3; D avison, Hinkley and Worton, 1992)
8

Let y represent data from a parametric m odel f ( y ; 6 ) , and suppose that 6 is


estimated by t(y). Assum ing that simulation error may be ignored, under what
circumstances would the bootstrap likelihood generated by parametric simulation
from / equal the parametric likelihood? Illustrate your answer with the N ( 9 , 1)
distribution, taking t to be (i) the sample average, (ii) the sample median.
(Section 10.3)

Suppose that we wish to construct an implied likelihood for a correlation coefficient


6 based on its sample value T by treating Z = \ lo g {(l + T)/( 1 T )} as normal
with mean g(0) = | lo g {(l + 6)/(1 0)} and variance n ~ l . Show that the implied
likelihood and implied prior are proportional to
e x p [ - { g ( t ) - g ( 0 ) } 2] ,

( 1 - 0 ) - 2,

|0| < 1.

Is the prior here proper?


(Section 10.4)
10

The Dirichlet density with parameters ( f i , . . . , f ) is

*>

5 > -> .

< -> *

Show that the Pj have joint moments


Tjf n

\ __

_____l n

n \ _

cow(pi ' Pk) ~

^j(^jks ~ Zk)
S2(t + l)

where S]k = 1 if j = k and zero otherwise, and s =


------- b c.
(a) Let y i , . . . , y n be a random sample, and consider bootstrapping its average.
Show that under the Bayesian bootstrap with aj = a,
e ', p ;

,-

Hence show that the posterior mean and variance o f

do- 21)
= YI yjPj are y an(l

(2n + a n + 1 )- I m2, where m 2 = n_1 J2(yj ~ y ) 2(b) N ow consider the average F t o f bootstrap samples generated as follows. We
generate a distribution F 1 = ( P / , . . . , P j ) o n y t, . . . , y under the Bayesian bootstrap,

10.8 Practicals

519

and then make Y/, . . . , Y j by independent multinomial sampling from F f. Show


that

E>(

,a,(f

(2n + an + 1) n

Are the properties o f this as noo and a oo what you would expect? How does
this compare with samples generated by the usual nonparametric bootstrap?
(Section 10.5)

10.8 Practicals
1

We compare the empirical likelihoods and 95% confidence intervals for the mean
o f the data in Table 3.1, (a) pooling the eight series:

attach(gravity)
grav.EL <- EL.profile(g,tmin=70,tmax=85,n.t=51)
plot(grav.EL[,1],exp(grav.EL[,2]) ,type="l",xlab="mu",
ylab="empirical likelihood")
l i k .CI(grav.E L ,lim=-0.5*qchisq(0.95,1))
and (b) treating the series as arising from separate distributions with the same
mean and plotting eight individual likelihoods:

gravs.EL <- EL.profile(g[series==l],n.t=21)


plot(gravs.EL[,1],exp(gravs.EL[,2]) ,type="n",xlab="mu",
ylab="empirical likelihood",xlim=range(g))
lines(gravs .EL[, 1] ,exp(gravs.EL[,2] ) ,lty=2)
for (s in 2:8)
{ gravs.EL <- EL.profile(g[series==s],n.t=21)
lines(gravs.EL[,1],exp(gravs.EL[,2]) ,lty=2) }
N ow we combine the individual likelihoods into a single likelihood by multiplying
them together; we renormalize so that the product has maximum one.

lims <- matrix(NA,8,2)


for (s in 1:8) { x <- g[series==s]; lims[s,] <- range(x) }
mu.min <- max(lims[,1]); mu.max <- min(lims[,2])
gravs.EL <- EL.profile(g[series==l],
tmin=mu.m i n ,tmax=mu.m a x ,n .t =21)
gravs.E L .L <- gravs.E L [,2]
gravs.EL.mu <- gravs.EL[,l]
for (s in 2:8)
gravs.EL.L <- gravs.EL.L + EL.profile(g[series==s],
tmin=mu.min,tmax=mu.max,n.t=21)[,2]
gravs.EL.L <- gravs.EL.L - max(gravs.EL.L)
lines(gravs.EL.mu,exp(gravs.EL.L),lwd=2)
lik.CI(cbind(gravs.EL.mu,gravs.EL.L),lim=-0.5*qchisq(0.95,1))
Compare the intervals with those in Example 3.2. D oes the result for (b) suggest
a limitation o f multinomial likelihoods in general?
Compare the empirical likelihoods with the profile likelihood (10.1) and the ad
justed profile likelihood (10.2), obtained when the series are treated as independent
normal samples with different variances but the same mean.
(Section 10.2.1)

520
2

10 Semiparametric Likelihood Inference


Dataframe i s l a y contains 18 measurements (in degrees east o f north)o f palaeocurrent azimuths from the Jura Quartzite on the Scottish island o f Islay. We aim
to use m ultinom ial-based likelihoods to set 95% confidence intervals for the mean
direction a(0) = (s in 0 ,c o s 0 )r o f the distribution underlying the data; the vector
b(6) = (c o s 0, s in 6)T is orthogonal to a. Let yj = (s mQj , co s 6 j )T denote the
vectors corresponding to the observed angles
Then the mean direction 6
is the angle subtended at the origin by Y , yj/\\ Y X/IIFor the original estimate, plots o f the data, log likelihoods and confidence intervals:
a tt a c h (is la y )

th <- ifelse(theta>180,theta-360,theta)
a.t <- function(th) c(sin(th*pi/180), cos(th*pi/180))
b.t <- fimction(th) c(cos(th*pi/180), -sin(th*pi/180))
y <- t(apply(matrix(theta, 18,1), 1, a.t))
thetahat <- function(y)
{ m <- apply(y,2,sum)
m <- m/sqrt(m[l] ~2+m[2]
2)
180*atan(m[l]/m[2] )/pi }
thetahat(y)
u.t <- function(y, th) crossprod(b.t(th), t(y))
islay.EL <- EL.profile(y, tmin=-100, tmax=120, n.t=40, u=u.t)
plot(islay.EL[,1],islay.EL[,2],type="l",xlab="theta",
ylab="log empirical likelihood",ylim=c(-25,0))
points(th,rep(-25,18)); abline(h=-3.84/2,lty=2)
lik.CI(islay.EL,lim=-0.5*qchisq(0.95,1))
islay.EEF <- EEF.profile(y, tmin=-100, tmax=120, n.t=40, u=u.t)
lines(islay.EEF[,1],islay.EEF[,2],lty=3)
l i k .CI(islay.E E F ,lim=-0.5*qchisq(0.95,1))
Discuss the shapes o f the log likelihoods.
To obtain 0.95 quantiles o f the bootstrap distributions o f W EL and

W Eef

'

islay.fun <- function(y, i, angle)


{
u <- as.vector(u.t(y[i,] , angle))
z <- r e p (0,length(u))
EEF.fit <- glm(z~u-l.poisson)
W.EEF <- 2*sum(l-fitted(EEF.fit))
EL.loglik <- function(lambda) - sum(log(l + lambda * u))
EL.score <- function(lambda) - sum(u/(l + lambda * u))
assignC'u" ,u,frame=l)
EL.out <- nlmin(EL.loglik,0.001)
W.EL <- -2*EL.loglik(EL.out$x)
c(thetahat(y[i,]), W.EL, W.EEF, E L .out$converged) }
islay.boot <- boot(y,islay.fun,R=999,angle=thetahat(y))
islay.boot$R < - sum(islay.boot$t [ , 4 ] )
islay.boot$t <- islay,boot$t[islay.boot$t[,4]==1,]
apply(islay.boot$t[,2:3],2,quantile,0.95)
How do the bootstrap-calibrated confidence intervals compare with those based
on the xi distribution, and with the basic bootstrap intervals using the 6' ?
(Sections 10.2.1, 10.2.2; Hand et al., 1994, p. 198)
3

We compare posterior densities for the mean o f the air-conditioning data using (a)
the Bayesian bootstrap with aj = 1:

521

8 Practicals

airl <- data.frame(hours=aircondit$hours,G=l)


air.bayes.gen <- function(d, a)
{ out <- d
out$G <- rgamma(nrow(d),shape=a+2)
out }
air.bayes.fun <- function(d) sum(d$hours*d$G)/sum(d$G)
air.bayesian <- boot(airl, air .bayes .fun, R=999, sim="parametric",
r a n .gen=air.bayes.g e n ,mle=-1)
plot(density(air.bayesian$t,n=100,width=25),type="l",
xlab="theta",ylab="density",ylim=c(0,0.02))
and (b) an exponential m odel with mean 6 for the data, with prior according to
which 6 ~ l has a gamma distribution with index k and scale
:

kappa <- 0; lambda <- 0


kappa.post <- kappa + length(airl$hours)
lambda.post <- lambda + sum(airl$hours)
theta <- 30:300
lines(theta,
lambda.post/theta2*dgamma(lambda.post/theta,kappa.post),
lty-2)
Repeat this with different values o f a in the Bayesian bootstrap and
parametric case, and discuss your results.
(Section 10.5)

k,

X in the

11
Computer Implementation

11.1 Introduction
The key requirem ents for com puter im plem entation o f resam pling m ethods
are a flexible program m ing langauge w ith a suite o f reliable quasi-random
num ber generators, a wide range o f built-in statistical procedures to bootstrap,
and a reasonably fast processor. In this chapter we outline how to use one
im plem entation, using the curren t (M ay 1997) com m ercial version S p lu s 3.3
o f the statistical language S, although the m ethods could be realized in a
n um ber o f o th e r statistical com puting environm ents.
The rem ainder o f this section outlines the in stallation o f the library, and
gives a quick sum m ary o f features o f S p lu s essential to our purpose. Each
subsequent section describes aspects o f the library needed for the m aterial in
the corresponding ch ap ter: Section 11.2 corresponds to C h apter 2, Section 11.3
to C h apter 3, an d so forth. These sections take the form o f a tutorial on the
use o f the library functions. T he outline given here is n o t intended to replace
the help files distributed w ith the library, w hich can be viewed by typing
h e l p ( b o o t , l i b r a r y = " b o o t " ) w ithin S p lu s. A t various points below, you
will need to consult these files for m ore details on functions.
The m ain functions in the library are sum m arized in Table 11.1.
The best way to learn to use softw are is to use it, and from Section 11.1.2
onw ards, we assum e th a t you, d ear reader, know the basics o f S, including how
to w rite sim ple functions, th a t you are seated com fortably at your favourite
com puter w ith S p lu s launched and a graphics w indow open, and th a t you are
working through this chapter. We d o n o t show the S p lu s p ro m p t >, n o r the
continuatio n p ro m p t +.

522

523

11.1 Introduction
Table 11.1 Functions in
the S plus bootstrap
library.

F u nction

Purpose

abc.ci
boot
b o o t. a rra y
b o o t. c i
censboot
c o n tro l
c v . glm
em p in f
e n v e lo p e
e x p .tilt
g l m .d ia g
g lm .d ia g .p lo t
im p.m om ents
im p . p r o b
im p .q u a n tile
im p .w e ig h ts
j a c k , c i f t e r .b o o t
lin e a r .a p p r o x
s a d d le
s a d d le .d is tn
s im p le x
s m o o th .f
tilt.b o o t
ts b o o t

N o n p a ra m e tric A B C confidence intervals


P aram etric a n d n o n p aram etric b o o tstra p sim ulation
A rra y o f indices o r frequencies from b o o tstra p sim ulation
B o o tstrap confidence intervals
B o o tstrap for censored a n d survival d a ta
C o n tro l m eth o d s for e stim atio n o f quantiles, bias, variance, etc.
C ro ss-v alid atio n pred ictio n e rro r estim ate for generalized linear m odel
C alcu late em pirical influence values
C alcu late sim ulation envelope
E x p o n en tial tilting to calcu late p ro b ab ility d istrib u tio n s
G eneralized linear m odel diagnostics
P lo t generalized linear m odel diagnostics
Im p o rta n c e resam pling m o m en t estim ates
Im p o rtan ce resam pling tail p ro b ab ility estim ates
Im p o rtan ce resam pling q u an tile estim ates
C alculate im p o rtan c e resam pling w eights
Ja ck k n ife-after-b o o tstrap p lo t
C alculate linear ap p ro x im atio n to a statistic
S ad d lep o in t ap p ro x im atio n
S ad d lep o in t ap p ro x im atio n for a d istrib u tio n
Sim plex m eth o d o f lin ear pro g ram m in g
F requency sm oothing
A uto m atic im p o rtan c e re-w eighting b o o tstra p sim ulation
B o o tstrap for tim e series d a ta

11.1.1 Installation
UNIX
T he b o o tstra p library can be obtained from the hom e page for this book,
h t t p : //dmawww. e p f 1 . c h / d a v i s o n . mosaic/BM A/
in the form o f a com pressed s h a r file b o o t l i b . s h . Z. This file should be
uncom pressed an d m oved to an ap p ro p riate directory. The file can then be
unpacked by
sh bootlib.sh
rm bootlib.sh

Y ou should then
installation o f the
It is best to set
m ay need to ask
once inside S p lu s

follow the instructions in the README file to com plete the


library.
up an S p lu s library b o o t containing the library files; you
your system m anager to do this. Once this is done, and
in your usual w orking directory, the functions and d a ta are

accessed by typing
l i b r a r y ( b o o t ,f irst=T)

524

11 Computer Implementation

T his will avoid cluttering your w orking directory w ith library files, and reduce
the chance th a t you accidentally overw rite them.
W i n d ow s

T he disk at the back o f this book contains the library functions and docum en
tatio n for use w ith S p lu s f o r Windows. F or instructions on the installation,
see the file README. TXT on the disk. T he contents o f the disk can also be
retrieved in the form o f a z i p file from the hom e page for the book given
above.

11.1.2 Some key S p lu s ideas


Q u a s i - r a n do m n u m b er s

To p u t 20 q uasi-random N ( 0,1) d a ta into y and to see its contents, type


y < - rn o rm (2 0 )

y
H ere < - is the assignm ent sym bol. To see the contents o f any S object, simply
type its nam e, as above. This is often done below, and we do n o t show the
output.
In general q uasi-random num bers from a distrib u tion are generated by the
functions re x p , rgamma, r c h i s q , r t , . . . , w ith argum ents to give param eters
where needed. F or example,

y <- rgamma(n=10,shape=2)
generates 10 gam m a observations w ith shape p aram eter 2, and

y <- rgamma(n=10,shape=c(l:10))
generates a vector o f ten gam m a variables w ith shape param eters 1 ,2 ,..., 10.
T he function sam ple is used to sam ple from a set w ith o r w ithout replace
m ent. F o r exam ple, to get a ran d o m p erm u tatio n o f the num bers 1 ,...,1 0 , a
random sam ple w ith replacem ent from them , a ran d o m p erm u tatio n o f 11, 22,
33, 44, 55, a sam ple o f size 10 from them , and a sam ple o f size 10 taken with
unequal probabilities:

sample(10)
sample(10,replace=T)
set <- c(ll,22,33,44,55)
sample(set)
sample(set,size=10,replace=T)
sample(set,size=10,replace=T,prob=c(0.1,0.1,0.1,0.1,0.6))

11.2 Basic Bootstraps

525

Subscripts
T he city p o p u latio n d a ta w ith n = 10 are

city
citySu
city$x
where the second two com m ands show the individual variables o f c i t y . T his
S p lu s object is a datafram e an array o f d a ta in which rows correspond to
cases, and the nam ed colum ns to variables. Elem ents o f an object are accessed
by subscripts, so

city$x[l]
city$x[c(l:4)]
city$x[c(l,5,10)]
city[c(l,5,10),2]
city$x[-l]
city[c(l:3),]
give various subsets o f the elem ents o f c i t y . To m ake a nonparam etric
b o o tstrap sam ple o f the rows o f c i t y , you could type:

i <- sample(10,replace=T)
city[i,I
The row labels result from the algorithm used to give unique labels to rows,
an d can be ignored for o u r purposes.

11.2 Basic Bootstraps


11.2.1 Nonparam etric bootstrap
The m ain b o o tstra p function, b o o t, w orks on a vector, a m atrix, or a datafram e.
A simple use o f b o o t to b o o tstrap the ratio t = x / u for the city population
d a ta o f Exam ple 2.8 is

city.fun <- function(data, i)


{ d <- data[i,]
mean(d$x)/mean(d$u) }
city.boot <- boot(data=city, statistic=city.fun, R=50)
T he function c i t y . f u n takes as input the datafram e d a t a and the vector
o f indices i. Its first com m and sets up the boo tstrap p ed datafram e, and its
second m akes an d returns the b o o tstrap p ed ratio. The last com m and instructs
the function b o o t to b o o tstra p the d a ta in c i t y R = 50 times, apply the
statistic c i t y . f u n to each b o o tstrap dataset and p u t the results in c i t y . b o o t .

526

11 Computer Implementation

B o o t s t r a p obj ect s

The result o f a call to b o o t is a b o o tstrap object. This is im plem ented as a list


o f quantities which is given the class " b o o t" and for which various m ethods
are defined. F or exam ple, typing

city.boot
prints the original statistic, its estim ated bias an d its stan d ard error, while

plot(city.boot)
gives suitable sum m ary plots.
To see the nam es o f the elem ents o f the b o o tstrap object c i t y . b o o t , type

names(city.boot)
Y ou see various nam es, o f which c i t y .b o o t$ t O , c i t y . b o o t $ t , c ity .b o o t$ R ,
c i t y ,b o o t$ s e e d con tain the original value o f the statistic, the boo tstrap
values, the value o f R, an d the value o f the S p lu s random num ber generation
seed w hen b o o t was invoked. To see their contents, type their names.
Timing
To repeat the sim ulation, checking how long it takes, type
u n i x . t i m e ( c i t y . b o o t < - b o o t ( c i t y , c i t y .f u n ,R = 5 0 ) )
on a U N IX system or

dos.time(city.boot <- boot(city,city.fun,R=50))


on a D O S system. The first num b er retu rn ed is the tim e the sim ulation took,
and is useful for estim ating how long a larger sim ulation would take.
A lthough code is generally clearer w hen d atafram es are used, the co m p u ta
tion can be speeded up by avoiding them , as here:

mat <- as.matrix(city)


mat.fun <- function(data, i)
{ d <- data[i,1
mean(d[,2])/mean(d[,1]) >
unix.time(mat.boot <- boot(mat,mat.fun,R=50))
C om pare this w ith the tim e tak en using the d atafram e c i t y .
Frequency array
To obtain the R x n arra y o f b o o tstra p frequencies for c i t y . b o o t and to
display its first 20 lines, type

f <- boot.array(city.boot)
f [ 1 :2 0 ,]

527

11.2 Basic Bootstraps

T he rows o f f are the vectors o f frequencies for individual b o o tstrap samples.


T he array is useful for m any post hoc calculations, and is invoked by p o st
processing functions such as j a c k , a f t e r .b o o t and imp. w e ig h t, which are
discussed below. It is calculated from c i t y . b o o t$ se e d . The array o f indices for
the b o o tstrap sam ples can be obtained by b o o t . a r r a y ( c i t y . b o o t , in d e x = T ).
Types o f statistic
F or a nonparam etric b o o tstrap , the function s t a t i s t i c can be o f one o f
three types. We have already seen exam ples o f the first, index type, where the
argum ents are the d atafram e d a t a and the vector o f indices, i ; this is specified
by s ty p e = " i" (the default).
F or the second, weighted type, the argum ents are d a t a and a vector o f
weights w. F or exam ple,

city.w <- function(data, w=rep(l,nrow(data))/nrow(data))


{ w <- w/sum(w)
sum(w*data$x)/sum(w*data$u)}
city.boot <- boot(city, city.w, R=20, stype="w")
writes

= E w'jxj / E wj
u

Y , w) uj / Y , wY

where w* is the weight p u t on the j t h case o f the datafram e in the boo tstrap
sam ple; the first line o f c i t y .w ensures th at
vv* = 1. Setting w in the
initial line o f the function gives the default value for w, which is a vector of
n-1 s; this enables the original value o f t to be obtained by c i t y . w ( c i t y ) . A
m ore com plicated exam ple is given by the library correlation function c o rr .
N o t all statistics can be w ritten in this form, b u t w hen they can, num erical
differentiation can be used to obtain em pirical influence values and A B C
confidence intervals.
F or the third, frequency type, the argum ents are d a t a and a vector o f
frequencies f. F o r example,

city.f <- function(data, f) mean(f*data$x)/mean(f*data$u)


city.boot <- boot(city, city.f, R=20, stype="f")
uses

n_ 1 E / y * /

w here /* is the frequency w ith which the ;'th row o f the datafram e occurs in the
b o o tstra p sample. N o t all statistics can be w ritten in this form. It differs from
the preceding type in th a t w hereas weights can in principle take any positive

528

11 Computer Implementation

values, frequencies m ust be integers. O f course in this exam ple it would be


easiest to use the function c i t y . f u n given earlier.
Fu n c t i on s t a t i s t i c

The contents o f s t a t i s t i c can be m ore-or-less arbitrarily com plicated, p ro


vided th a t its o u tp u t is a scalar or fixed-length vector. F or example,
a i r .f u n <- f u n c tio n (d a ta , i)
{ d <- d a t a [ i ,]
c (m e a n (d ), v a r ( d ) / n r o w ( d a t a ) ) }
a i r . b o o t < - b o o t ( d a t a = a i r c o n d i t , s t a t i s t i c = a i r . f u n , R=200)
perform s a n o n p aram etric b o o tstrap for the average o f the air-conditioning
data, and returns the b o o tstrap p ed averages and their estim ated variances. We
give m ore com plex exam ples below. Beware o f m em ory and storage problem s
if you m ake the o u tp u t too long.
By default the first elem ent o f s t a t i s t i c (and so the first colum n o f
b o o t . o u t$ t) is treated as the m ain statistic for certain calculations, such as
calculation o f em pirical influence values, the jackknife-after-bootstrap plot, and
confidence interval calculations, which are described below. This is changed
by use o f the in d e x argum ent, usually a single num b er giving the colum n o f
s t a t i s t i c to which the calculation is to be applied.
F u rther argum ents can be passed to s t a t i s t i c using the . . . argum ent to
b o o t. F or exam ple,
c i t y . s u b s e t < - f u n c t i o n ( d a t a , i , n=10)
{ d < - d a t a [ i [ 1 : n ] ,]
m e a n ( d [ ,2 ] ) /m e a n ( d [ ,1 ]) }
c i t y . b o o t < - b o o t ( d a t a = c i t y , s t a t i s t i c = c i t y . s u b s e t , R=200, n=5)
gives resam pled ratios for b o o tstrap sam ples o f size 5. N ote th a t the frequency
array for c i t y . b o o t w ould n o t be useful in this case. The indices can be
obtained by
b o o t . a r r a y ( c i t y . b o o t , i n d i c e s = T ) [ ,1 : 5 ]

11.2.2 Parametric bootstrap


F o r a param etric bootstrap , the first argum ent to s t a t i s t i c rem ains a vector,
m atrix, o r d atafram e, b u t s t a t i s t i c need take no second argum ent. Instead
three furth er argum ents to b o o t m ust be supplied. T he first, r a n . gen, tells b o o t
how to sim ulate b o o tstra p data, an d is a function th a t takes two argum ents,
the original d ata, and an object containing any other param eters, mle. The
o u tp u t o f r a n .g e n should have the sam e form and attributes as the original
dataset. The second new argum ent to b o o t is a value for m le itself. The third

11.2 Basic Bootstraps

529

new argum ent to b o o t, s im = " p a ra m e tric " , tells b o o t to perform a param etric
sim ulation: by default the sim ulation is nonparam etric and s im = " o rd in a ry " .
O ther possible values for sim are described below.
F or exam ple, for p aram etric sim ulation from the exponential m odel fitted
to the air-conditioning d a ta in Table 1.2, we set
a i r c o n d i t . f u n < - f u n c t i o n ( d a t a ) m e a n (d a ta $ h o u rs )
a i r c o n d i t . sim < - f u n c t i o n ( d a t a , m le)
{ d <- d a ta
d $ h o u rs < - r e x p ( n = n r o w ( d a ta ) , ra te = m le )
d >
a i r c o n d i t . m l e < - l/ m e a n ( a ir c o n d i t$ h o u r s )
a ir c o n d it.p a r a <- b o o t( d a ta = a ir c o n d it, s t a t i s t i c = a i r c o n d i t .f u n ,
R=20, s im = " p a r a m e tr ic " , r a n . g e n = a i r c o n d i t . sim ,
m le = a ir c o n d it.m le )
A ir-conditioning d a ta for a different aircraft are given in a i r c o n d i t 7 . O btain
their sam ple average, and perform a param etric b o o tstrap o f the average using
the fitted exponential model. Give the bias and variance estim ates for the
average. D o the b o o tstrap p ed averages look norm al for this sam ple size?
A m ore com plicated exam ple is param etric sim ulation based on a log
bivariate no rm al distribution fitted to the city p o pulation d a ta:
l . c i t y <- lo g (c ity )
c ity .m le <- c ( a p p ly ( 1 .c i t y , 2 ,m e a n ) ,s q r t ( a p p l y ( l .c i t y ,2 ,v a r ) ) ,
c o r r ( 1 .c i t y ) )
c i t y . s i m < - f u n c t i o n ( d a t a , m le)
{ n < - n ro w (d a ta )
d < - m a t r i x ( r n o r m ( 2 * n ) ,n ,2)
d [ , 2 ] < - m le [2 ] + m le [ 4 ] * (m le [ 5 ] * d [ ,2 ] + s q r t ( 1 - m l e [ 5 ] ~ 2 )* d [ , 1 ] )
d [ , 1] < - m l e [ l ] + m l e [ 3 ] * d [ ,l ]
d a ta $ x < - e x p ( d [ ,2 ] )
d a ta $ u < - e x p ( d [ , l ] )
d a ta }
c i t y . f < - f u n c t i o n ( d a t a ) m e a n ( d a t a [ ,2 ] ) / m e a n ( d a t a [ ,1 ])
c i t y . p a r a < - b o o t ( c i t y , c i t y . f , R=200, s im = " p a r a m e tr ic " ,
r a n . g e n = c i t y . sim , m le = c ity .m le )
W ith this definition o f c i t y . f , a nonparam etric b o o tstrap can be perform ed
by
c ity .b o o t <- b o o t(d a ta = c ity ,
s t a t i s t i c = f u n c t i o n ( d a t a , i ) c i t y . f ( d a t a [ i , ] ) , R=200)

530

11 Computer Implementation

This is useful w hen com paring p aram etric and n o n p aram etric b o o tstraps for
the same problem . C om pare them for the c i t y data.

11.2.3 Empirical influence values


F or a statistic b o o t .f u n in w eighted form, function em pinf returns the em
pirical influence values lj, obtained by num erical differentiation. F o r the ratio
function c i t y . w given above, for exam ple, these an d the exact values (Prob
lem 2.9) are

L.diff <- empinf(data=city, statistic=city.w, stype="w")


cbind(L.diff,(city$x-city.w(city)*city$u)/mean(city$u))
Em pirical influence values can also be obtained from the o u tp u t o f b o o t by
regression o f the values o f f* on the frequency array. F o r example,

city.boot <- boot(city, city.fun, R=999)


L.reg <- empinf(city.boot)
L.reg
uses regression w ith the 999 sam ples in c i t y . b o o t to estim ate the lj.
Jackknife values can be obtained by

J <- empinf(data=city,statistic=city.fun,stype="i",type="jack")
The argument type controls h o w the influence values are to be calculated, but
this also depends on the quantities input to empinf: for details see the help
file.
Variance approximations
v a r . l i n e a r uses em pirical influence values to calculate the nonparam etric
delta m ethod variance ap proxim ation for a statistic:

v a r .linear(L.diff)
v a r .linear(L.reg)
Linear approximation
l i n e a r .a p p ro x uses o u tp u t from a nonparam etric b o o tstrap sim ulation to
calculate the linear approxim ations to the b o o tstrap p ed quantities. The em pir
ical influence values can be supplied, b u t if not, they are estim ated by a call to
em pinf. F o r the city p o p u latio n ratio,

city.tL.reg <- linear.approx(city.boot)


city.tL.diff <- linear.approx(city.boot, L=L.diff)
split.screen(c(1,2))
screen(l); plot(city.tL.reg,city.boot$t); abline(0,l,lty=2)
screen(2); plot(city.tL.diff,city,boot$t); abline(0,1,lty=2)

11.3 Further Ideas

531

calculates the linear approxim ation for the two sets o f em pirical influence
values an d plots the actual t' against them.

11.3 Further Ideas


11.3.1 Stratified sampling
Stratified sam pling is perform ed by including argum ent s t r a t a in the call
to b o o t. Suppose th a t we wish to b o o tstrap the difference in the trim m ed
averages for the last tw o groups o f gravity d a ta (Exam ple 3.2):

gravity
grav <- gravity[as.numeric(gravity$series)>=7,]
grav
grav.fun <- function(data, i, trim=0.125)
{ d <- data[i,]
m <- tapply(d$g, d$series, mean, trim=trim)
m[7] -m [8] >
grav.boot <- boot(grav, grav.fun, R=200, strata=grav$series)
Check th a t the expected properties o f b o o t . a r r a y ( g r a v . b o o t) hold.
E m pirical influence values, linear approxim ations, and nonparam etric delta
m ethod variance approxim ations are calculated by

grav.L <- empinf(grav.boot)


grav.tL <- linear.approx(grav.boot)
v a r .linear(grav.L , strata=grav$series)
g r a v . b o o t $ s t r a t a contains the strata used in the resam pling, which are
taken into account autom atically if g r a v . b o o t is used, b u t otherw ise m ust be
supplied, as in the final line o f the code above.

11.3.2 Sm oothing
T he neatest w ay to perform sm ooth bo o tstrap p in g is to use s im = " p a ra m e tric " .
F o r exam ple, to estim ate the variance o f the m edian o f the d a ta in y, using
sm oothing p aram eter h = 0.5:

y <- rnorm(99)
h <- 0.5
y.gen <- function(data, mle)
{ n <- length.(data)
i <- sample(n, n, replace=T)
data[i] + mle*rnorm(n) }

532

11 Computer Implementation

y.boot <- boot(y, median, R=200, sim="parametric",


r a n .gen=y.g e n , mle=h)
var(y.boot$t)
This guarantees th a t y .b o o t$ tO contains the original m edian. F o r shrunk
sm oothing, see P ractical 4.5.

11.3.3 Censored data


c e n s b o o t is used to b o o tstra p censored data. Suppose th a t we wish to assess
the variability o f the m edian survival time and the probability o f survival
beyond 20 weeks for the first group o f A M L d a ta (Exam ple 3.9).

amll <- ami[aml$group==l,]


amll.fun <- function(data)
{ surv <- survfit(Surv(data$time,data$cens))
pi <- min(surv$surv[surv$time<20])
ml <- min(surv$time[surv$surv<0.5])
c(pl, ml) >
amll.ord <- censboot(data=amll, statistic=amll.fun, R=50)
ami1.ord
This involves ordinary b o o tstra p resam pling, an d hence could be perform ed
w ith b o o t, alth o u g h a m l l .f u n w ould then have to be rew ritten to have
an o th er argum ent. F or conditional sim ulation, two additional argum ents m ust
be supplied containing the estim ated survivor functions for the tim es to failure
and the censoring d istrib u tio n :

amll.fail <- survfit(Surv(time,cens),data=amll)


amll.cens <- survfit(Surv(time-0.01*cens,l-cens),data=amll)
amll.con <- censboot(data=amll, statistic=amll.fun, R=50,
F.surv=amll.fail, G.surv=amll.cens, sim="cond")

11.3.4 Bootstrap diagnostics


Jackknife-after-bootstrap
Function j a c k . a f t e r . b o o t produces a jack k n ife-after-bootstrap plot o f the
first colum n o f b o o t . o u t $ t based on a n o n p aram etric sim ulation. F o r example,
for the c i t y d a ta ratio :

city.fun <- function(data, i)


{ d <- data[i,]
rat <- mean(d$x)/mean(d$u)
L <- (d$x-rat*d$u)/mean(d$u)
c(rat, sum(L~2)/nrow(d)~2, L) }

11.3 Further Ideas

533

c i t y . b o o t < - b o o t ( c i t y , c i t y . f u n , R=999)
c i t y . L < - c i t y . b o o t $ t 0 [ 3 : 12]
s p l i t . s c r e e n ( c ( l , 2 ) ) ; s c r e e n ( l ) ; s p l i t .s c r e e n ( c ( 2 ,1 )); sc re e n (4 )
a tta c h (c ity )
p l o t ( u , x , ty p e = " n " , x lim = c ( 0 , 3 0 0 ) , y lim = c ( 0 ,3 0 0 ) )
te x t( u ,x ,r o u n d ( c ity .L ,2 ) )
sc re e n (3 )
p l o t ( u , x , t y p e = " n " , x l i m = c ( 0 ,3 0 0 ) ,y lim = c ( 0 ,3 0 0 ) )
t e x t ( u , x , c ( l : 1 0 ) ) ; a b l i n e ( 0 , c i t y . b o o t $ t 0 [ 1 ] ,lty = 2 )
sc re e n (2 )
j a c k . a f t e r . b o o t ( b o o t . o u t = c i t y . b o o t , u se J= F , s t i n f = F , L = c ity .L )
c l o s e . s c r e e n ( a ll = T )
The two left panels show the d a ta with case num bers and em pirical influence
values as p lo ttin g symbols. T he jackknife-after-bootstrap plot on the right
shows the effect o f deleting cases in tu rn : values o f t* are m ore variable when
case 4 is deleted and less variable w hen cases 9 and 10 are deleted. We see
from the em pirical influence values th at the distribution o f t' shifts dow nw ards
when cases w ith positive em pirical influence values are deleted, and conversely.
This plot is also produced by setting true the ja c k argum ent to p l o t when
applied to a b o o tstrap object, as in p l o t ( c i t y . b o o t , j a c k = T ) .
O ther argum ents for j a c k , a f t e r .b o o t control w hether the influence values
are standardized (by default they are, s tin f = T ) , w hether the em pirical influence
values are used (by default jackknife values are used, based on the sim ulation,
so the default values are u seJ= T and L=NULL).
M ost post-processing functions allow the user to specify either an index for
the com ponent o f interest, o r a vector o f length b o o t.o u t$ R to be treated
as the m ain statistic. T hus a jackknife-after-bootstrap plot using the second
com ponent o f c i t y . b o o t $ t the estim ated variances for t* would be
obtained by either o f
j a c k . a f t e r . b o o t ( c i t y . b o o t , u s e J = F , s t i n f = F , in d ex = 2 )
ja c k .a f te r .b o o t( c ity .b o o t,u s e J = F ,s tin f = F ,t= c ity .b o o t$ t[ ,2 ] )
Frequency smoothing
sm o o th . f sm ooths the frequencies o f a nonparam etric b o o tstrap object to give
a typical distrib u tio n w ith expected value roughly at 9. In order to find
the sm oothed frequencies for 9 = 1.4 for the city ratio, and to obtain the
corresponding value o f t, we set
c i t y . f r e q < - s m o o t h . f ( t h e t a = l . 4 , b o o t . o u t = c i t y .b o o t )
c ity .w ( c ity , c ity .f r e q )

534

11 Computer Implementation

The sm oothing b andw idth is controlled by the w id th argum ent to s m o o th .f


and is w id th x u 1/2, where v is the estim ated variance o f t w id th = 0 .5 by
default.

11.4 Tests
11.4.1 Parametric tests
Simple param etric tests can be conducted using p aram etric sim ulation. For
example, to perform the conditional sim ulation for the d a ta in f i r (E xam
ple 4.2):

fir.mle <- c(sum(fir$count), nrow(fir))


fir.gen <- function(data, mle)
{ d <- data
y <- sample(x=mle[2],size=mle[1],replace=T)
d$count <- tabulate(y,mle[2])
d >
fir.fun <- function(data)
(nrow(data)-1)*var(data$count)/mean(data$count)
fir.boot <- boot(fir, fir.fun, R=999, sim="parametric",
r a n .gen=fir.g e n , mle=fir.mle)
qqplot(qchisq(c(l:fir.boot$R)/(fir.boot$R+l),df=49),fir.boot$t)
abline(0,1,lty=2); abline(h=fir.boot$t0)
The last tw o lines here display the results (alm ost) as in the right panel o f
Figure 4.1.

11.4.2 Permutation tests


A pproxim ate p erm u tatio n tests are perform ed by setting s im = " p e rm u ta tio n "
w hen invoking b o o t. F o r exam ple, suppose th a t we wish to perform a per
m u tatio n test for zero co rrelation betw een the tw o colum ns o f datafram e
d ucks:

perm.fun <- function(data, i) cor(data[,1],data[i,2])


ducks.perm <- boot(ducks, perm.fun, R=499, sim="permutation")
(sum(ducks.perm$t>ducks.perm$t0)+l)/(ducks.perm$R+l)
qqnorm(ducks.perm$t,ylim=c(-1,1))
abline(h=ducks.perm$t0,lty=2)
If s t r a t a is included in the call to b o o t, p erm u tatio n is perform ed indepen
dently w ithin each stratum .

11.4 Tests

535

11.4.3 Bootstrap tests


F o r a b o o tstra p test o f the hypothesis o f zero correlation in the d u ck s data,
we m ake a new d atafram e and fu n c tio n :

duck <- c(ducks[,1].ducks[,2])


n <- nrow(ducks)
duck.fun <- function(data, i, n)
{ x <- data[i]
cor(x[l:n],x[(n+l):(2*n)]) >
.Random.seed <- ducks.perm$seed
ducks.boot <- boot(duck, duck.fun, R=499,
strata=rep(c(l,2),c(n,n)), n=n)
(sum(ducks.boot$t>ducks.boot$tO)+l)/(ducks.boot$R+l)
This uses the same seed as for the p erm u tatio n test, for a m ore precise
com parison. Is the significance level sim ilar to th a t for the p erm utation test?
W hy can n o t b o o t be directly applied to d u ck s to perform a b o o tstrap test?
Exponential tilting
The test o f equality o f m eans for two sets o f d a ta in Exam ple 4.16 involves
exponential tilting. The null distribution puts probabilities given by (4.25) on
the two sets o f data, an d the tilt param eter k solves the equation
Y j zij exp(^zi7') __
Eyexp(A zy)
where z\j = yij, z2j = yij, an d 6 = 0. The fitted null distribution is obtained
using e x p . t i l t , as follows:

z <- grav$g
z[grav$series==8] <- -z[grav$series==8]
z.tilt <- exp.tilt(L=z, theta=0, strata=grav$series)
z.tilt
where z . t i l t contains the fitted probabilities (which sum to one for each
stratum ) and the values o f k an d 6. O ther argum ents can be in put to e x p . t i l t :
see its help file.
The significance probability is then obtained by using the w e ig h ts argum ent
to b o o t. This argum ent is a vector containing the probabilities w ith which to
select the rows o f d a ta , when b o o tstrap sam pling is to be perform ed with
unequal probabilities. In this case the unequal probabilities are given by the
tilted distribution, und er which the expected value o f the test statistic is zero.
The code needed to perform the sim ulation and get the estim ated significance
level is:

536

11 Computer Implementation

g r a v .te s t <- fu n c tio n (d a ta , i)


{ d <- d a t a [ i ,]
d i f f ( t a p p l y ( d $ g , d $ s e r i e s , m e a i i ) ) [7] }
g r a v .b o o t < - b o o t( d a ta = g r a v , s t a t i s t i c = g r a v . t e s t , R=999,
w e ig h ts = z . t i l t $ p , s t r a t a = g r a v $ s e r i e s )
(s u m (g ra v . b o o t$ t > g r a v . b o o t$ tO ) + 1 ) / ( g r a v . b o o t$ R + l)

11.5 Confidence Intervals


The m ain function for setting b o o tstrap confidence intervals is b o o t . c i , which
takes as in p u t a b o o tstrap object. F or example, to get a 95% confidence
interval for the ratio in the c i t y data, using the c i t y . b o o t object created in
Section 11.3.4:
b o o t . c i ( b o o t . o u t = c i t y .b o o t )
By default the confidence level is 0.95, b u t oth er values can be obtained
using the c o n f argum ent. H ere invoking b o o t . c i shows the norm al, basic,
studentized b o o tstrap , percentile, an d B C a intervals. Subsets o f these intervals
are obtained using the ty p e argum ent. F o r exam ple, if c i t y . b o o t $ t only
contained the ratio and n o t its estim ated variance, it would be im possible to
o btain the studentized b o o tstra p interval, an d an ap p ropriate use o f b o o t . c i
would be
b o o t.c i( b o o t.o u t= c ity .b o o t,ty p e = c ( " n o r m " , " p e r c " , " b a s ic " , " b c a " ) ,
c o n f = c ( 0 .8 ,0 .9 ) )
By default b o o t . c i assum es th a t the first an d second colum ns o f b o o t . o u t$ t
contain the statistic itself and its estim ated variance; otherwise the in d e x
argum ent can be used, as outlined in the help file.
To calculate intervals for the p aram eter h{6), an d then back-transform them
to the original scale, we use the h, h in v , an d h d o t argum ents. F or example, to
calculate intervals for the city ratio, using h(-) = log(-), we set
b o o t . c i ( c i t y . b o o t , h = lo g , h in v = e x p , h d o t= f u n c tio n ( u ) 1 /u )
where h in v and h d o t are the inverse and first derivative o f h(-). N ote how
transform atio n im proves the basic b o o tstrap interval.
N onparam etric A BC intervals are calculated using a b c . c i. F or exam ple
a b c . c i ( d a t a = c i t y , s t a t i s t i c = c i t y . w)
calculates the 95% A BC interval for the city ratio ; s t a t i s t i c m ust be in
weighted form for this. As usual, stra ta are in corporated using the s t r a t a
argum ent.

11.6 Linear Regression

537

11.6 Linear Regression


11.6.1 Basic approaches
R esam pling for linear regression m odels is perform ed using b o o t. It is simplest
when b o o tstrap p in g cases. F or example, to com pare the biases and variances
for p aram eter estim ates from b o o tstrapping least squares and Li estim ates for
the mammals d a ta :

fit.model <- function(data)


{ fit <- glm(log(brain)~log(body),data=data)
11 <- llfit(log(data$body),log(data$brain))
c(coef(fit), coef(ll)) }
mammals.fun <- function(data, i) fit.model(data[i,])
mammals.boot <- boot(mammals, mammals.fun, R=99)
mammals.boot
F or m odel-based resam pling it is sim plest to set up an augm ented datafram e
containing the residuals and fitted values. A lthough the m odel is a straightfor
w ard linear m odel, we fit it using glm rath er th an lm so th at we can calculate
residuals using the library function g lm .d ia g , which calculates various types
o f residuals, approxim ate C ook statistics, and m easures o f leverage for a glm
object. (The diagnostics are exact for a linear model.) A related function is
g l m . d i a g . p l o t s , which produces stan d ard diagnostic plots for a generalized
linear m odel fit:

mam.lm <- glm(log(brain)"log(body),data=mammals)


maim, diag <- glm. diag (mam. lm)
glm.diag.plots(mam.lm)
res <- (mam.diag$res-mean(mam.diag$res))*mam.diag$sd
mam <- data.frame(mammals,res=res,fit=fitted(mam.lm))
mam.fun <- function(data, i)
{ d <- data
d$brain <- exp(d$fit+d$res [i])
fit.model(d) }
mam.boot <- boot(mam, mam.fun, R=99)
mam.boot
Em pirical influence values and the nonparam etric delta m ethod standard
error for the slope o f the linear m odel could be obtained by putting the slope
estim ate in weighted form :

mam.w <- function(data, w)


coef(glm(log(data$brain)"log(data$body), weights=w))[2]
mam.L <- empinf(data=mammals, statistic=mam.w)
sqrt(var.linear(mam.L))

538

11 Computer Implementation

F or m ore com plicated regressions, for exam ple w ith unequal response vari
ances, m ore inform ation m ust be added to the new datafram e.
Wild bootstrap
The wild b o o tstra p can be im plem ented using s im = " p a ra m e tric " , as follows:

mam.mle <- c(nrow(mam), (5+sqrt(5))/10)


mam.wild <- function(data, mle)
{ d <- data
i <- 2*rbinom(mle[1], size=l, prob=l-mle[2])-l
d$brain <- exp(d$fit+d$res*(l-i*sqrt(5))/2)
d >
mam.boot.wild <- boot(mam, fit.model, R=20, sim="parametric",
ran.gen=mam.wild, mle=mam.mle)

11.6.2 Prediction
Now consider prediction o f the log b rain weight o f new m am m als w ith body
weights equal to those for the chim panzee and baboon. F or this we introduce
yet an o th er argum ent to b o o t m, which gives the num ber o f e*m to be
sim ulated w ith each b o o tstra p sam ple (see A lgorithm 6.4). In this case we
w ant to predict a t m = 2 new m am m als, w ith covariates contained in
d .p r e d . The s t a t i s t i c function supplied to b o o t m ust now take at least one
m ore argum ent, nam ely the additional indices for constructing the boo tstrap
versions o f the two new m am m als. We im plem ent this as follows:

d.pred <- mam[c(46,47),]


pred <- function(data, d.pred)
predict(glm(log(brain)"log(body),data=data), d.pred)
maun.pred <- function(data, i, i.pred, d.pred)
{ d <- data
d$brain <- exp(d$fit+d$res[i])
pred(d, d.pred) - (d.pred$fit + d$res[i.pred]) }
mam.boot.pred <- boot(mam, mam.pred, R=199, m=2, d .pred=d.pred)
orig <- matrix(pred(mam, d.pred),mam.boot.pred$R,2,byrow=T)
exp(apply(orig+mam.b o o t .pred$t,2,quantile,c (0.025,0.5,0.975) ) )
giving the 0.025, 0.5, an d 0.975 prediction limits for the b rain sizes o f the
new m am m als. The actual brain sizes lie close to o r above the up p er limits
o f these intervals: prim ates tend to have larger b rains th a n other m am m als.

11.6.3 Aggregate prediction error and variable selection


Practical 6.5 shows how to o btain the various estim ates o f aggregate prediction
erro r based on a given m odel.

11.6 Linear Regression

539

F or consistent b o o tstrap variable selection, a subset o f size n m is used to


fit each o f the possible models. C onsider Exam ple 6.13, where a fake set o f
d a ta is m ade by

xl <- runif(50); x2 <- runif(50); x3 <- runif(50)


x4 <- runif(50); x5 <- runif(50); y <- rnorm(50)+2*xl+2*x2
fake <- data.frame(y,xl,x2,x3,x4,x5)
As in th a t example, we consider the six possible m odels with no covariates,
with ju st x i, w ith x i , x 2, and so forth, finishing with x i , . . . , x 5. The function
s u b s e t .b o o t fits these to a subset o f n - s i z e observations, and calculates the
prediction m ean squared erro r for all the data. It is then applied using b o o t :

subset.boot <- function(data, i, size=0)


{ n <- nrow(data)
i.t <- i [1:(n-size)]
data.t <- data[i.t, ]
resO <- data$y - mean(data.t$y)
lm.d <- lm(y ~ xl, data=data.t)
resl <- data$y - predict.lm(lm.d, data)
lm.d <- update(lm.d, .~.+x2)
res2 <- data$y - predict.lm(lm.d, data)
lm.d <- update(lm.d, .~.+x3)
res3 <- data$y - predict.lm(lm.d, data)
lm.d <- update(lm.d, .~.+x4)
res4 <- data$y - predict.lm(lm.d, data)
lm.d <- update(lm.d, ..+x5)
res5 <- data$y - predict.lm(lm.d, data)
meansq <- function(y) mean(y~2)
apply(cbind(res0,resl,res2,res3,res4,res5),2,meansq)/n }
fake.boot.40 <- boot(fake, subset.boot, R=100, size=40)
delta.hat.40 <- apply(fake.boot.40$t,2,mean)
plot(c(0:5).delta.hat.40,xlab="Number of covariates",
ylab="Delta hat (M)" ,type="l",ylim=c(0,0.1))
F or results w ith a different value o f s i z e , but re-using f a k e . b o o t. 4 0 $ se e d in
order to reduce sim ulation variability:

.Random.seed <- fake.boot.40$seed


fake.boot.30 <- boot(fake, subset.boot, R=100, size=30)
delta.hat.30 <- apply(fake.boot.30$t,2,mean)
lines(c(0:5),delta.h a t .30,lty=2)
Try this w ith various values o f s iz e .

540

11 Computer Implementation

M odify the code above to d o variable selection using cross-validation, and


com pare it w ith the b o o tstrap results.

11.7 Further Topics in Regression


11.7.1 N onlinear and generalized linear m odels
N onlinear and generalized linear m odels are b o o tstrap p ed using the ideas in
the preceding section. F or example, to apply case resam pling to the calcium
d a ta o f Exam ple 7.7:

calcium.fun <- function(data, i)


{ d <- data[i,]
d.nls <- nls(cal~betaO*(l-exp(-time*betal)),data=d,
start=list(beta0=5,betal=0.2))
c(coefficients(d.nls),sum(d.nls$residuals"2)/(nrow(d)-2)) >
cal.boot <- boot(calcium,calcium.fun,R=19,strata=calcium$time)
Likewise, to apply m odel-based sim ulation to the leukaem ia d a ta o f E xam
ple 7.1, resam pling standardized deviance residuals according to (7.14),

leuk.glm <- glm(time~loglO(wbc)+ag-l,Gamma(log),data=leuk)


leuk.diag <- glm.diag(leuk.glm)
muhat <- fitted(leuk.glm)
rL <- log(leuk$time/muhat)/sqrt(l-leuk.diag$h)
eps <- 10*(-4)
u <- -log(seq(from=eps,to=l-eps,by=eps))
d <- sign(u-l)*sqrt(2*(u-l-log(u)))/leuk.diag$sd
r.dev <- smooth.spline(d, u)
z <- predict(r.dev, leuk.diag$rd)$y
leuk.mle <- data.frame(muhat,rL,z)
fit.model <- function(data)
{ data.glm <- glm(time"loglO(wbc)+ag-l,Gamma(log),data=data)
c(coefficients(data.glm).deviance(data.glm)) }
leuk.gen <- function(data,mle)
{ i <- sample(nrow(data),replace=T)
data$time <- mle$muhat*mle$z[i]
data }
leuk.boot <- boot(leuk, fit.model, R=19, sim="parametric",
r a n .gen=leuk.g e n , mle=leuk.mle)
The o ther procedures for m odel-based resam pling o f generalized linear m odels
are applied similarly. Try to m odify this code to resam ple the linear predictor
residuals according to (7.13) (they are already calculated above).

11.7 Further Topics in Regression

541

11.7.2 Survival data


F u rth er argum ents to c e n s b o o t are needed to b o o tstrap survival data. For
illustration, we consider the m elanom a d a ta o f Exam ple 7.6, and fit a m odel in
which survival depends on log tum our thickness. T he initial fits are given by

mel.cox <- coxph(Surv(time,status==l)~log(thickness)


+strata(ulcer),data=melanoma)
mel.surv <- survfit(mel.cox)
mel.cens <- survfit(Surv(time-0.01*(status!=1),status!=l)~l,
data=melanoma)
The b o o tstra p function m e l. fun given below need only take one argum ent,
a d atafram e containing the d a ta themselves. N ote how the function uses a
sm oothing spline to interpolate fitted values for the full range o f thickness;
this avoids difficulties due to the variability o f the covariate when resam pling
cases. T he o u tp u t o f m e l. fun is the vector o f fitted linear predictors predicted
by the spline.

mel.fun <- function(d)


{ attach(d)
cox <- coxph(Surv(time,status==l)~log(thickness)+strata(ulcer))
eta <- unique(cox$linear.predictors)
u <- unique(thickness)
sp <- smooth.spline(u,eta,df=20)
th <- seq(from=0.25,to=10,by=0.25)
eta <- predict(sp,th)$y
detach("d")
eta >
T he next three com m ands give the syntax for case resam pling, for m odel-based
resam pling an d for conditional resam pling. For either o f these last two schemes,
the baseline survivor functions for the survival times and censoring times, and
the fitted p ro p o rtio n al hazards (Cox) m odel for the survival distribution m ust
be supplied via the F . s u rv , G . s u rv , and cox argum ents.

attach(melanoma)
mel.boot <- censboot(melanoma, mel.fun, R=99, strata=ulcer)
mel.boot.mod <- censboot(melanoma, mel.fun, R=99,
F.surv=mel.surv, G.surv=mel.cens, strata=ulcer,
cox=mel.cox, sim="model")
mel.boot.con <- censboot(melanoma, mel.fun, R=99,
F .surv=mel.surv, G.surv=mel.cens, strata=ulcer,
cox=mel.cox, sim="cond")

542

11 Computer Implementation

The b o o tstrap results are best displayed graphically. H ere is the code for the
analogue o f the left panels o f Figure 7.9:
t h < - s e q ( f r o m = 0 .2 5 ,to = 1 0 ,b y = 0 .2 5 )
s p l i t . s c r e e n ( c ( 2 , 1 ))
s c re e n (l)
p lo t( th ,m e l.b o o t$ tO ,ty p e = " n " ,x la b = " T u m o u r t h i c k n e s s (mm)",
x l i m = c ( 0 ,1 0 ) ,y l i m = c ( - 2 ,2 ) ,y l a b = " L in e a r p r e d i c t o r " )
l i n e s ( t h , m e l . b o o t$ tO , lwd=3)
r u g (jitte r(th ic k n e s s ))
f o r ( i i n 1 :1 9 ) l i n e s ( t h , m e l . b o o t $ t [ i , ] , l w d = 0 . 5 )
sc re e n (2 )
p lo t( th ,m e l.b o o t$ tO ,ty p e = " n " ,x la b = " T u m o u r t h i c k n e s s (mm)",
x lim = c ( 0 ,1 0 ) ,y l i m = c ( - 2 ,2 ) ,y l a b = " L i n e a r p r e d i c t o r " )
l i n e s ( t h , m e l . b o o t $ t 0 , lwd=3)
m e l.e n v < - e n v e l o p e ( m e l .b o o t $ t ,l e v e l = 0 .95)
l i n e s ( t h . m e l . e n v $ p o in t [ 1 , ] , l t y = l )
l i n e s ( t h , m e l. e n v $ p o in t [ 2 , ] , l t y = l )
m e l.e n v < - e n v e lo p e ( m e l .b o o t .m o d $ t ,l e v e l = 0 .95)
l i n e s ( t h , m e l . e n v $ p o i n t [ 1 , ] , lty = 2 )
lin e s ( th ,m e l.e n v $ p o in t[ 2 , ] ,lty = 2 )
m e l.e n v < - e n v e l o p e ( m e l .b o o t .c o n $ t ,l e v e l = 0 .95)
l i n e s ( t h . m e l . e n v $ p o i n t [ 1 , ] ,lt y = 3 )
l i n e s ( t h , m e l . e n v $ p o i n t [ 2 , ] , lty = 3 )
d e t a c h ( "m elanom a")
N ote how tight the confidence envelope is relative to th a t for the m ore highly
param etrized m odel used in the example. Try again w ith larger values o f R, if
you have the patience.

11.7.3 Nonparam etric regression


N onparam etric regression is b o o tstrap p ed in the sam e way as o ther regres
sions. C onsider for exam ple b o o tstrap p in g the sm oothing spline fit to the
m otorcycle d a ta o f Exam ple 7.10. The d a ta w ithout repeats are in m otor, with
com ponents a c c e l, tim e s , s t r a t a , and v, the last tw o o f which give the strata
for resam pling an d an estim ated variance w ithin each stratum . The three fits
are obtained by
a tta c h ( m o to r )
m o to r.sm o o th < - s m o o t h .s p l i n e ( t i m e s ,a c c e l ,w = l / v )
m o to r .s m a ll < - s m o o t h .s p l i n e ( t i m e s ,a c c e l ,w = l / v ,
s p a r = m o to r . s m o o th $ s p a r/2 )
m o to r .b ig < - s m o o t h . s p l i n e ( t i m e s , a c c e l , w = l / v ,
s p a r = m o to r . sm o o th $ sp ar* 2 )

11.8 Time Series

543

C om m ands to set up and perform the resam pling are as follows:

res <- (motor$accel-motor.small$y)/sqrt(1-motor.small$lev)


motor.mle <- data.frame(bigfit=motor.big$y,res=res)
xpoints <- c(10,20,25,30,35,45)
motor.fun <- function(data, x)
{ y.smooth <- smooth.spline(data$times,data$accel,w=l/data$v)
predict(y.smooth,x)$y }
motor.gen <- function(data, mle)
{ d <- data
i <- c(l:nrow(data))
11 <- sample(i[data$strata==l],replace=T)
12 <- sample(i[data$strata==2],replace=T)
13 <- sample(i[data$strata==3],replace=T)
d$accel <- mlelbigfit + mle$res[c(il,i 2 ,i3)]
d >
motor.boot <- boot(motor, motor.fun, R=999, sim="parametric",
ran.gen=motor.gen, mle=motor.mle, x=xpoints)
Finally, the 90% basic b o o tstrap confidence limits are obtained by

mu.big <- predict(motor.big,xpoints)$y


mu <- predict(motor.smooth,xpoints)$y
ylims <- apply(motor.boot$t,2,quantile,c(0.05,0.95))
ytop <- mu - (ylims[1,]-mu.big)
ybot <- mu - (ylims[2,]-mu.big)
W h at is the effect o f using a smaller sm oothing param eter when calculating
the residuals?
Try altering this code to apply the wild bootstrap, and see w hat effect it has
on the results.

11.8 Time Series


M odel-based resam pling for tim e series is analogous to regression. We consider
the sunspot d a ta o f Exam ple 8.3, to which we fit the autoregressive m odel th at
m inimizes A IC :

sun <- 2*(sqrt(sunspot+l)-l)


ts.plot(sun)
sun.air <- ar(sun)
sun.ar$order
The best m odel is AR(9). How well determ ined is this, and w hat is the variance
o f the series average? We b o o tstrap to see, using

544

11 Computer Implementation

sun.fun <- function(tsb)


{ ax.fit <- ar(tsb, order,max=25)
c(ar.fit$order, mean(tsb), tsb) >
which calculates the o rd er o f the fitted autoregressive m odel, the series average,
and saves the series itself.
O u r function for b o o tstrap p in g tim e series is ts b o o t. H ere are results for
fixed-block b o o tstrap s w ith block length / = 20:

sun.l <- tsboot(sun, sun.fun, R=99, 1=20, sim="fixed")


tsplot(sun.l$t[l,3:291],main="Block simulation, 1=20")
table(sun.l$t[,1])
var(sun.l$t [,2])
qqnorm(sun.l$t [,2])
The statistic for t s b o o t takes only one argum ent, the tim e series. The first plot
here shows the results for a single replicate using block sim ulation: note the
occasional big ju m p s in the resam pled series. N o te also the large variation in
the orders o f the fitted autoregressive models.
To obtain sim ilar results for the statio n ary b o o tstrap w ith m ean block length
I = 20 :

sun.2 <- tsboot(sun, sun.fun, R=99, 1=20, sim="geom")


A re the results sim ilar to having blocks o f fixed length?
For m odel-based resam pling we need to store results from the original m odel,
and to m ake residuals from th a t fit:

sun.model <- list(order=c(sun.ar$order,0,0),ar=sun.ar$ar)


sun.res <- sun.ar$resid[!is.na(sun.ar$resid)]
sun.res <- sun.res - mean(sun.res)
sim.sim <- function(res,n.sim, ran.args)
{ rgl <- function(n, res) sample(res, n, replace=T)
ts.orig <- ran.args$ts
ts.mod <- r a n .args$model
mean(ts.orig)+rts(arima.sim(model=ts.mod, n=n.sim,
rand.gen=rgl, res=as.vector(res))) }
sun.3 <- tsboot(sun.res, sun.fun, R=99, sim="model", n.sim=114,
r a n .gen=sun.s i m ,r a n .args=list(t s=sun, model=sun.model))
Check the orders o f the fitted m odels for this scheme. A re they sim ilar to those
obtained using the block schemes above?
For post-blackening we need to define yet an o th er function:

sun.black <- function(res, n.sim, ran.args)


{ ts.orig <- ran.args$ts

11.9 Improved Simulation

545

ts .m o d < - ra n .a r g s $ m o d e l
m e a n ( t s . o r i g ) + r t s ( a r im a .s im ( m o d e l= ts .m o d ,n = n .s im ,in n o v = r e s ) ) }
s u n . l b < - t s b o o t ( s u n . r e s , s u n .f u n , R=99, 1=20, s im = " f ix e d " ,
r a n .g e n = s u n .b l a c k , r a n . a x g s = l i s t ( t s = s u n , m o d e l= su n .m o d e l),
n .s im = le n g t h ( s u n ) )
C om pare these results with those above, and try it with sim="geom".

11.9 Improved Simulation


11.9.1 Balanced resampling
The balanced b o o tstrap is invoked via the sim argum ent to boot:
c i t y . b a l < - b o o t ( c i t y , c i t y . f u n , R=20, sim = " b a la n c e d " )
If s t r a t a is supplied, balancing takes place separately w ithin each stratum .

11.9.2 Control m ethods


c o n t r o l applies the control m ethods, including post-sim ulation balance, to the
o u tp u t from an existing b o o tstrap sim ulation. F o r example,
c o n t r o l ( c i t y . b o o t , b i a s .a d j= T )
produces the adjusted bias estim ate, while
c ity .c o n <- c o n tr o l( c ity .b o o t)
gives a list consisting o f the regression estim ates o f the em pirical influence
values, linear approxim ations to the b o o tstrap statistics, the control estim ates
o f bias, variance, and the th ird cum ulant o f the f \ control estim ates o f selected
quantiles o f the distrib u tio n o f t*, and a spline object th at sum m arizes the ap
proxim ate quantiles used to o btain the control quantile estim ates. Saddlepoint
approxim ation is used to obtain these approxim ate quantiles. Typing
c ity
c ity
c ity
c ity

. con$L
. c o n $ b ia s
. con$var
. c o n $ q u a n tile s

gives some o f the above-m entioned quantities. A rgum ents to c o n t r o l allow


the user to specify the em pirical influence values, the spline object, and other
quantities to be used by control, if they are already available; see the help file
for details.

11 Computer Implementation

546

11.9.3 Importance resampling


We have already m et a use o f nonparam etric sim ulation with unequal pro b
abilities in Section 11.4, using the w e ig h ts argum ent to b o o t. T he simplest
form for w e ig h ts , used there, is a vector containing the probabilities w ith
which to select the rows o f d a ta , when b o o tstrap sam pling is to be perform ed
with unequal probabilities. I f we wish to perform im portance resam pling using
several distributions, we can set them up an d then perform the sam pling as
follows:
c i t y . t o p < - e x p . t i l t ( L = c i t y . L , t h e t a = 2 , t O = c i t y .w ( c i t y ) )
c i t y . b o t < - e x p . t i l t ( L = c i t y . L , t h e t a = 1 . 2 , t O = c i t y .w ( c i t y ) )
c i t y . t i l t < - b o o t ( c i t y , c i t y . f u n , R = c (1 0 0 ,9 9 ),
w e i g h t s = r b i n d ( c i t y . to p $ p , c i t y . b o t$ p ) )
which perform s 100 sim ulations from the probabilities in c i t y . t o p $ p and 99
from the probabilities in c i t y . b o t $ p . In the first tw o lines e x p . t i l t is used
to solve the equation
.

to I

, 0 e x p (^ )
V\
/17 \
J2 j exp(A/j)

corresponding to exponential tilting o f the linear approxim ation to t to be


centred a t 6 = 2 an d 1.2. In the call to b o o t, R is a vector, and w e ig h ts a
m atrix w ith le n g th ( R ) rows and n ro w (d a ta ) colum ns, corresponding to the
l e n g th (R) d istributions from which resam pling takes place.
The im portance sam pling weights, m om ents, and selected quantiles o f the
resam ples in c i t y . t i l t $ [ , l ] are calculated by
im p. w e i g h t s ( c i t y . t i l t )
i m p .m o m e n t s ( c i t y . t i l t )
im p .q u a n tile ( c ity .tilt)
Each o f these returns raw, ratio and regression estim ates o f the corresponding
quantities. Some oth er uses o f im p o rtan t resam pling are exemplified by
i m p . p r o b ( c i t y . t i l t , t 0 = 1 .2 , d ef= F )
z <- ( c i t y . t i l t $ t [ ,1 ] - c i t y . t i l t $ t 0 [1 ]) / s q r t ( c i t y . t i l t $ t [ ,2 ] )
im p. q u a n t i l e ( b o o t . o u t = c i t y . t i l t , t= z )
The call to im p .p ro b calculates the im portance sam pling estim ate o f the
probability th a t t* < 1.2, w ithout using defensive m ixture distributions (by
default d e f =T, i.e. defensive m ixture distributions are used to obtain the weights
and estim ates). The last two lines show how im portance sam pling is used to
estim ate quantiles o f the studentized b o o tstrap statistic.
F or m ore details an d fu rth er argum ents to the functions, see their help files.

11.9 Improved Simulation

547

Function tilt.boot
The description above relies on exponential tilting to obtain the resam
pling probabilities, an d requires know ing where to tilt to. If this is difficult,
t i l t . b o o t can be used to avoid this, by perform ing an initial b o o tstrap with
equal resam pling probabilities, then using frequency sm oothing to estim ate
ap p ro p riate tilted probabilities. F o r example,
c i t y . t i l t < - t i l t . b o o t ( c i t y , c i t y . f u n , R=c(5 0 0 ,2 5 0 ,2 4 9 ))
perform s 500 ordinary bootstraps, uses the results to estim ate probability
d istributions tilted to the 0.025 and 0.975 points o f the sim ulations, and then
perform s 250 b o o tstrap s tilted to the 0.025 quantile, and 249 tilted to the 0.975
quantile, before assigning the result to a b o o tstrap object. M ore com plex uses
o f t i l t , b o o t are possible; see its help file.
Importance re-weighting
These functions allow for im portance re-w eighting as well as im portance
sam pling. F or exam ple, suppose th a t we require to re-weight the sim ulated
values so th a t they ap p e a r to have been sim ulated from a distribution with
expected ratio close to 1.4. We then use the q= option to the im portance
sam pling functions as follows:
q <- s m o o th .f ( th e ta = l.4 , b o o t .o u t = c i t y .t i l t )
c i t y . w ( c i t y , q)
i m p . m o m e n t s ( c i t y . t i l t , q=q)
i m p . q u a n t i l e ( c i t y . t i l t , q=q)
where the first line calculates the sm oothed distribution, the second obtains the
corresponding ratio, an d the third and fourth obtain the m om ent and quantile
estim ates corresponding to sim ulation from the distribution q.

11.9.4 Saddlepoint m ethods


T he function used for single saddlepoint approxim ation is s a d d le . Its simplest
use is to obtain the P D F an d C D F approxim ations for a linear statistic, such
as the linear ap proxim ation t + n-1 Y f j h to a general b o o tstrap statistic t*.
The sam e results are obtained by using the approxim ation n~l Y f j h to t* t,
and this is w hat s a d d le does. To obtain the approxim ations at t' = 2 for the
c i t y data, we set
s a d d l e ( A = c i t y .L / n r o w ( c i t y ) , u = 2 - c i t y . w ( c i t y ) )
which returns the P D F an d C D F approxim ations, and the value o f .
The function s a d d l e . d i s t n returns the saddlepoint estim ate o f an entire
distribution, using the term s n~*lj in the random sum and an initial idea o f
the centre and scale for the distribution o f T ' t:

11 Computer Implementation

548

city.tO <- c(0, sqrt(var.linear(city.L)))


city.sad <- saddle.distn(A=city.L/nrow(city), tO=city.tO)
city.sad
T he L ug an n an i-R ice form ula can be applied by setting LR=T in the calls to
s a d d le an d s a d d l e . d i s t ; by default LR=F.
F o r m ore sophisticated applications, the argum ents A and u to s a d d l e . d i s t n
can be replaced by functions. F or exam ple, the b oo tstrap p ed ratio can be
defined th ro u g h the estim ating equation

/ ; ( * ; - tu ,) = ,

(11.1)

where the /* have a jo in t m ultinom ial distrib u tio n w ith equal probabilities and
denom in ato r n = 10, the n u m b er o f rows o f c i t y , as outlined in Exam ple 9.16.
A ccordingly we set

city.tO <- c(city.w(city), sqrt(var.linear(city.L)))


Afn <- function(t, data) data$x-t*data$u
ufn <- function(t, data) 0
saddle(A=Afn(2, city), u=0)
city.sad <- saddle.distn(A=Afn, u=ufn, t0=city.t0, data=city)
The penultim ate line here gives the exact version o f the call to s a d d le th at
started this section, while the last line calculates the saddlepoint approxim ation
to the exact distrib u tion o f T*. F or s a d d l e . d i s t n the quantiles o f the distri
bution o f T * are estim ated by obtaining the C D F approxim ation a t a num ber
o f values o f t, an d then interp o latin g the C D F using a spline sm oother. The
range o f values o f t used is determ ined by the contents o f tO, w hose first value
contains the original value o f the statistic, and whose second value contains a
m easure o f the spread o f the distrib u tio n o f T*, such as its stan d ard error.
A n o th er use o f s a d d le an d s a d d l e . d i s t n is to give them directly the
adjusted cum ulant generating function K ( ) t, and the second derivative
K " { <*). F o r exam ple, the c i t y d a ta above can be tackled as follows:

K.adj <- function(xi)


{ L <- city$x-city.t*city$u
nrow(city)*log(sum(exp(xi*L))/nrow(city))-city.t*xi }
K2 <- function(xi)
{ L <- city$x-city.t*city$u
p <- exp(L*xi)
nrow(city)*(sum(L
2*p)/sum(p) - (sum(L*p)/sum(p))~2) >
city.t <- 2
saddle(K.adj=K.adj, K2=K2)

11.10 Semiparametric Likelihoods

549

This is m ost useful w hen K (-) is not o f the standard form th a t follows from a
m ultinom ial distribution.
Conditional approximations
C onditional saddlepoint approxim ation is applied by giving Af n and u f n m ore
colum ns, an d setting the w d is t and ty p e argum ents to s a d d le appropriately.
F or example, suppose th a t we w ant to find the distribution o f T " , defined as
the root o f (11.1), b u t resam pling 25 rath e r th an 49 cases o f b i g c i t y . T hen
we set

bigcity.L <- (bigcity$x-city.w(bigcity)*bigcity$u)/


mean(bigcity$u)
bigcity.tO <- c(city.w(bigcity), sqrt(var.linear(bigcity.L)))
Afn <- function(t, data) cbind(data$x-t*data$u, 1)
ufn <- function(t, data) c(0,25)
saddle(A=Afn(l.4, bigcity), u=ufn(1.4, bigcity), wdist="p",
type="cond")
city.sad <- saddle.distn(A=Afn, u=ufn, wdist="p", type="cond",
data=bigcity, tO=bigcity.tO)
H ere the w d is t argum ent gives the distribution o f the random variables W j ,
which is Poisson in this case, and the ty p e argum ent specifies th a t a conditional
approxim ation is required. F or resam pling w ithout replacem ent, see the help
file. A fu rth er argum ent mu allows these variables to have differing means,
in which case the conditional saddlepoint will correspond to sam pling from
m ultinom ial or hypergeom etric distributions with unequal probabilities.

11.10 Semiparametric Likelihoods


Basic functions only are provided for sem iparam etric likelihood inference.
To calculate an d plot the log profile likelihood for the m ean o f a gam m a
m odel for the larger air conditioning d ata (Exam ple 10.1):

gam.L <- function(y, tmin=min(y)+0.1, tmax=max(y)-0.1, n.t)


{ gam.loglik <- function(l.nu, mu, y)
{ nu <- exp(l.nu)
-sum(log(dgamma(mi*y/mu, nu)*nu/mu)) }
out <- matrix(NA, n.t+1, 3)
for (it in 0:n.t)
{ t <- tmin + (tmax-tmin)*it/n.t
fit <- nlminb(0, gam.loglik, mu=t, y=y)
out[l+it,] <- c(t, exp(fit$parameters), -fit$objective) }
out }

550

11 Computer Implementation

air.gam <- gam.L(aircondit7$hours, 40, 120, 100)


air. gam [, 3] <- air. gam [, 3] - max (air .gam [, 3])
plot(air.gam[,1],a i r .g a m [,3],type="l",xlab="theta",
ylab="Log likelihood",xlim=c(40,120))
abline(h=-0.5*qchisq(0.95,1),lty=2)
Em pirical an d em pirical exponential family likelihoods are obtained using
the functions E L . p r o f i l e an d EEF. p r o f i l e . They are included in the library
for dem o n stratio n purposes only, an d are not intended for serious use, nor
are they currently supported as p art o f the library. These functions give log
likelihoods for the m ean o f their first argum ent, calculated a t n . t values o f 6
from tm in an d tmax. T he o u tp u t o f E L .p ro f i l e is a n . t x 3 m atrix whose first
colum n is the values o f 6 , w hose next colum n is the log profile likelihood, and
whose final colum n is the values o f the L agrange m ultiplier. The o u tp u t o f
EEF. p r o f i l e is a n . t x 4 m atrix w hose first colum n is the values o f 6 , whose
next two colum ns are versions o f the log profile likelihood (see Exam ple 10.4),
and whose final colum n is the values o f the L agrange m ultiplier. F or ex am ple:

air.EL <- EL.profile(aircondit7$hours,tmin=40,tmax=120,n.t=100)


lines(air.E L [,1],a i r .E L [,2],lty=2)
air.EEF <- EEF.profile(aircondit7$hours,tmin=40,tmax=120,
n.t=100)
lines(air.E E F [,1],air.E E F [,3],lty=3)
N ote how close the two sem iparam etric log likelihoods are, com pared to the
param etric one. T he practicals at the end o f C h ap ter 10 give m ore exam ples
o f their use (and abuse).
M ore general (and m ore robust!) code to calculate em pirical likelihoods is
provided by Professor A. B. Owen a t Stanford U niversity; see W orld W ide
W eb reference http://playfair.stanford.edU/reports/owen/el.S.

APPENDIX A
Cumulant Calculations

In this b o o k several chapters and som e o f the problem s involve m om ent


calculations, which are often simplified by using cum ulants.
The cum ulant-generating function o f a random variable Y is
K ( t ) = l o g E ( e tY) = J 2 - / ks,
s=l Swhere

ks

is the sth cum ulant, while the m om ent-generating function o f Y is


00

M(t) = E ( e' r ) = - f y s,
s=0 5

where fi's = E (Y S) is the sth m om ent. A simple exam ple is a N(/i, a 2) random
variable, for which K( t ) = t/j.+ ^ t 2 a 2; note the appealing fact th a t its cum ulants
o f order higher th an tw o are zero. By equating powers o f t in the expansions
o f K ( t ) an d log M( t) we find th a t k\ =
and th at
K2

*3

n'3 -3(i2'i+2(n'l)3,

K4

H 4 ~ 4 /4 /4 - 3(/4)2 + 12/4(/4 )2 - 6(/4 )4.

with inverse form ulae


A*2

= K2 + (K i )2,

/i'3

= K3 + 3K2K\ + (Ki)3,

/*4

4+

4 ( c 3 )c i +

3 (k 2)2

(A .l)
+

2(k

i)2

(k i)4.

T he cum ulants k j, k2, kt, and K4 are the m ean, variance, skewness and kurtosis
o f Y.
F o r vector Y it is b etter to drop the pow er n o ta tio n used above and to

551

552

A Cumulant Calculations

ad o p t index n o tatio n an d the sum m ation convention. In this no tatio n Y has


com ponents Y ' , . . . , Y" an d we w rite Y ' Y ' and Y 'Y 'Y ' for the square and
cube o f Y' . T he jo in t cum ulant-generating function K ( t ) o f Y l , . . . , Y n is the
logarithm o f their jo in t m om ent-generating function,
lo g E

= hk 1 + jjtitjKlJ + j ]titj tkKt'j'k + j ititJtktiKJJ<J + ,

where sum m ation is im plied over repeated indices, so that, for example,
t,/c = t\Kl + ----- h tKn,

titjK''J = fitiK 1,1 + t \ t 2 K1'2 + ----- h tntnKn'n.

T hus the n-dim ensional norm al distrib u tio n w ith m eans k and covariance
m atrix fcJ has cum ulant-generating function Uk1+ jtjtjKli . We som etim es write
K>J = cum (Y , Y j ), K'j* = cum (Y ', Y>, Y k) and so forth for the coefficients o f
titj, titjtk in K(t). The cum ulant arrays k j ,
etc. are invariant to index
perm utation, so for exam ple (c1,2,3 = k2,3,1.
A key feature th a t simplifies calculations w ith cum ulants as opposed to m o
m ents is th a t cum ulants involving two or m ore independent random variables
are z e ro : for independent variables, k,j = k ' ^ = = 0 unless all the indices
are equal.
The above n o tatio n extends to generalized cum ulants such as
cu m (Y Y ; Y fc)

E ( Y iY i Y k) = Kijk,

cu m (Y , Y * Y k)

KlJk,

c u m ( Y iY J, Y k, Y l) = KijJ('1,

which can be obtained from the jo in t cum ulant-generating functions o f


Y ' Y j Y k, o f Y ' an d Y JY k an d o f Y Y J, Y k, and Y 1. N ote th at ordinary
m om ents can be regarded as generalized cum ulants.
G eneralized cum ulants can be expressed in term s o f ordinary cum ulants by
m eans o f com plem entary set partitions, the m ost useful o f which are given in
Table A .I. F or exam ple, we use its second colum n to see th at
= jcj + k k \
or
E (Y Y J) = cum ( Y iY j ) = cum (Y ', Y J) + cu m (Y )cum (Y '),
m ore fam iliarly w ritten cov(Y , Y j ) + E (Y ')E (Y ; ). The boldface 12 represents
k 12, while the 12 [1] an d 1|2 [1] im m ediately below it represent k 1,2 and
k 1k 2. W ith this u nderstanding we use the third colum n to see th a t Kyfc =
[3] + KiKJV t, where
[3] is sh o rth an d for kijk* + KlJtkj + k ^ k ' ;
this is the m ultivariate version o f (A .l). Likewise k ' ^ = i
+ k*^k;[2], where
the term K ,,kK ^ [2 ] on the right is understood in the context o f the left-hand
side to equal
+ k ^ k ' : each index in the first block o f the p artition i j \ k
appears once w ith the index in the second block. The expression 123|4 [2] [2]
in the fourth colum n o f the table represents the p artitions 123|4, 124|3, 134|2,
234| 1.
To illustrate these ideas, we calculate cov{Y ,(n I)-1 ^ ( Y , Y )2}, w here

A Cumulant Calculations

553

Y = n ~ ] Y l Yj is the average o f the independent and identically distributed


rand o m variables Y \ , . . . , Y n. N ote first th a t the covariance does n o t depend
on the m ean o f the Y so we can take k = 0. We then express Y and
(n I)-1
Y ) 1 in index n o tatio n as a, Y' and frj/Y'Y-', where a, = l / n
and bij = (<5I;- l/n ) /( n 1), with
S i = l 1,
,}
\ 0,

i = j
otherwise,

the K ronecker delta symbol. T he covariance is


c u m ( a i Y ' , b j k Y JY k) = afij^K1'^ = atbjkKl'J'k = n a i b u K 1,1,1,
the second equality following on use o f Table A .l because k ' = 0 and the
third equality following because the observations are independent and identi
cally distributed. In pow er n o ta tio n k 1,1,1 is k 3, the third cum ulant o f Yi; so
cov{Y ,(n - l ) - 1 ( Y , - Y )2} = K-i/n. Similarly
var{(n - I)-1 ^ ( Y , ~ Y ) 2} = c u m ^ Y ' Y 'A , Y* Y () = btj h, ,
which Table A .l shows to be equal to bijbiii(ic'jk1 + k '^ k ^ 1 + kx1k ^ ) . This
reduces to
n b n b u K 1111 + 2 nb \ ib uKll K1'1 + 2 n(n l ) b n b i 2 K11k 1'1,
which in tu rn is K 4 / n + 2(k2)2/(n 1) in pow er notation. To perform this
calculation using m om ents and pow er no tatio n will convince the reader o f the
elegance an d relative sim plicity o f cum ulants and index notation.
M cC ullagh (1987) m akes a cogent m ore-extended case for these m ethods.
H is book includes m ore-extensive tables o f com plem entary set partitions.

554

A Cumulant Calculations

1
1 [1]

12
12 [1]
1|2 [1]

123
123 [1]
12|3 [3]
1|2|3 [1]

1234
1234 [1]
123|4 [4]
12|34 [3]
12|3|4 [6]
1|2|3|4 [1]

1|2
12 [1]

12|3
123 [1]
13|2 [2]
1|2|3|
123 [1]

123|4
1234 [1]
124|3 [3]
12|34 [3]
14|2|3 [3]
12|34
1234 [1]
123|4 [2] [2]
134|2
13|24 [2]
13|2|4 [4]
12|3|4
1234 [1]
134|2 [2]
13|24 [2]
1|2|3|4
1234 [1]

Table A.1
Complementary set
partitions

Bibliography

A belson, R . R an d T ukey, J. W. (1963) Efficient utilization


o f n o n -n u m erical in fo rm atio n in q u a n tita tiv e analysis:
general th eo ry a n d th e case o f sim ple order. Annals of
Mathematical Statistics 34, 1347-1369.
A kaike, H . (1973) In fo rm a tio n theory a n d a n extension o f
th e m axim um likelihood principle. In Second
International Symposium on Information Theory, eds
B. N . Petrov an d F. C zaki, pp. 267-281. B udapest:
A k ad em iai K iad o . R e p rin ted in Breakthroughs in
Statistics, volum e 1, eds S. K o tz an d N . L. Jo h n so n , pp.
610-624. N ew Y o rk : Springer.
A k ritas, M . G . (1986) B o o tstrap p in g the K a p la n -M e ie r
estim ato r. Journal of the American Statistical Association
81, 1032-1038.
A ltm an , D . G . a n d A n dersen, P. K . (1989) B o o tstrap
investig atio n o f th e stab ility o f a C ox regression m odel.
Statistics in Medicine 8, 771-783.
A n d ersen , P. K ., B organ, 0 ., G ill, R . D . an d K eiding, N.
(1993) Statistical Models Based on Counting Processes.
N ew Y o rk : S pringer.
A ndrew s, D. F. a n d H erzberg, A. M . (1985) Data: A

Collection of Problems from Ma ny Fields for the Student


and Research Worker. N ew Y ork: S pringer.
A p p ley ard , S. T., W itkow ski, J. A., R ipley, B. D., S hotton,
D . M . a n d D ubow icz, V. (1985) A novel pro ced u re for
p a tte rn analysis o f featu res p resen t on freeze fractured
p lasm a m em branes. Journal of Cell Science 74, 105-117.
A th rey a, K . B. (1987) B o o tstrap o f the m ean in th e infinite
varian ce case. Annals of Statistics 15, 724-731.
A tk in so n , A. C. (1985) Plots, Transformations, and
Regression. O x fo rd : C laren d o n Press.
Bai, C. a n d O lshen, R . A. (1988) D iscussion o f

T h eo retical c o m p ariso n o f b o o tstra p confidence


intervals, by P. H all. Annals of Statistics 16, 953-956.
Bailer, A. J. a n d O ris, J. T. (1994) A ssessing toxicity o f
p o llu tan ts in aq u atic system s. In Case Studies in
Biometry, eds N. L ange, L. R yan, L. B illard, D . R.
B rillinger, L. C o n q u est an d J. G reenhouse, pp. 25-40.
N ew Y o rk : Wiley.
Banks, D . L. (1988) H istospline sm o o th in g the Bayesian
b o o tstrap . Biometrika 75, 673-684.
Barbe, P. a n d B ertail, P. (1995) The Weighted Bootstrap.
V olum e 98 o f Lecture Notes in Statistics. N ew Y ork:
Springer.
B arnard, G . A. (1963) D iscussion o f T h e spectral analysis
o f p o in t processes, by M . S. B artlett. Journal of the
Royal Statistical Society series B 25, 294.
B am dorff-N ielsen, O. E. a n d Cox, D . R . (1989) Asymptotic
Techniques for Use in Statistics. L o n d o n : C h a p m a n &
H all.
B am dorff-N ielsen, O. E. a n d Cox, D . R. (1994) Inference
and Asymptotics. L o n d o n : C h a p m a n & H all.
B eran, J. (1994) Statistics for Long-Memory Processes.
L o n d o n : C h a p m a n & H all.
B eran, R . J. (1986) S im ulated pow er functions. Annals of
Statistics 14, 151-173.
B eran, R . J. (1987) P repivoting to reduce level e rro r o f
confidence sets. Biometrika 74, 457-468.
B eran, R . J. (1988) P repivoting test statistics: a b o o tstra p
view o f asym ptotic refinem ents. Journal of the American
Statistical Association 83, 687-697.
B eran, R. J. (1992) D esigning b o o tstra p pred ictio n regions.
In Bootstrapping and Related Techniques: Proceedings,
Trier, FRG, 1990, eds K .-H . Jockel, G . R o th e an d

555

556
W. Sendler, volum e 376 o f L ecture N otes in Economics
and M athem atical Systems, pp. 23-30. N ew Y ork:
Springer.
B eran, R. J. (1997) D iag n o sin g b o o tstra p success. Annals
o f the Institute o f Statistical M athem atics 49, to appear.
Berger, J. O. an d B ern ard o , J. M . (1992) O n the
dev elo p m en t o f reference p rio rs (w ith D iscussion). In
Bayesian S tatistics 4, eds J. M . B ernardo, J. O. Berger,
A. P. D aw id an d A. F. M . Sm ith, pp. 35-60. O xford:
C laren d o n Press.

Bibliography
B ooth, J. G . (1996) B o o tstrap m eth o d s for generalized
linear m ixed m odels w ith ap p licatio n s to sm all area
estim ation. In Statistical Modelling, eds G. U . H. Seeber,
B. J. F rancis, R . H atzin g er an d G . Steckel-Berger,
volum e 104 o f Lecture N otes in Statistics, pp. 43-51.
N ew Y ork: Springer.
B ooth, J. G . an d B utler, R. W. (1990) R a n d o m iza tio n
d istrib u tio n s an d sa d d lep o in t a p p ro x im atio n s in
g eneralized linear m odels. Biometrika 77, 787-796.

Besag, J. E. a n d Clifford, P. (1989) G eneralized M o n te


C a rlo significance tests. Biometrika 76, 633-642.

B ooth, J. G., B utler, R. W. an d H all, P. (1994) B o o tstrap


m eth o d s for finite p o p ulations. Journal o f the American
Statistical Association 89, 1282-1289.

Besag, J. E. an d C lifford, P. (1991) S equential M o n te C arlo


p-values. Biom etrika 78, 301-304.

B ooth, J. G . an d H all, P. (1994) M o n te C arlo


ap p ro x im atio n an d th e iterated b o o tstra p . Biometrika

Besag, J. E. an d D iggle, P. J. (1977) Sim ple M o n te C a rlo


tests fo r sp a tia l p a ttern . A pplied Statistics 26, 327-333.
Bickel, P. J. a n d F reed m an , D. A. (1981) Som e asym ptotic
th eo ry fo r th e b o o tstrap . Annals o f Statistics 9,
1196-1217.
Bickel, P. J. a n d F reed m an , D . A. (1983) B oo tstrap p in g
regression m odels w ith m an y param eters. In A
Festschrift fo r Erich L. Lehmann, eds P. J. Bickel, K . A.
D o k su m an d J. L. H odges, pp. 28-48. Pacific G rove,
C alifo rn ia: W ad sw o rth & B ro o k s/C o le.
Bickel, P. J. a n d F reed m an , D. A. (1984) A sym ptotic
n o rm ality a n d th e b o o tstra p in stratified sam pling.
Annals o f S tatistics 12, 470-482.
Bickel, P. J., G o tze, F. a n d v an Zw et, W. R. (1997)
R esam p lin g fewer th a n n o b serv atio n s: gains, losses, an d
rem edies fo r losses. S tatistica Sinica 7, 1-32.
Bickel, P. J., K lassen, C. A. J., R itov, Y. an d W ellner, J. A.
(1993) Efficient and A daptive Estimation fo r
Semiparametric M odels. B altim ore: Jo h n s H o p k in s
U niversity Press.
Bickel, P. J. a n d Y ahav, J. A. (1988) R ich a rd so n
e x tra p o la tio n an d the b o o tstra p . Journal o f the
American Statistical Association 83, 387-393.
Bissell, A. F. (1972) A negative b in o m ial m odel w ith
varying elem ent sizes. Biom etrika 59, 435441.
Bissell, A. F. (1990) H ow reliable is y o u r cap ab ility index?
Applied Statistics 39, 331-340.
Bithell, J. F. an d Stone, R. A. (1989) O n statistical m ethods
fo r analysing th e geo g rap h ical d istrib u tio n o f cancer
cases n ear n u clear installatio ns. Journal o f Epidemiology
and Community Health 43, 79-85.
Bloom field, P. (1976) Fourier Analysis o f Time Series: An
Introduction. N ew Y o rk : Wiley.
Boos, D. D . an d M o n a h an , J. F. (1986) B o o tstrap m ethods
u sin g p rio r in fo rm atio n . Biom etrika 73, 77-83.

81, 331-340.
B ooth, J. G ., H all, P. an d W ood, A. T. A. (1992) B o o tstrap
estim atio n o f co n d itio n al distrib u tio n s. Annals o f
Statistics 20, 1594-1610.
B ooth, J. G ., H all, P. an d W ood, A. T. A. (1993) B alanced
im p o rtan c e resam pling for the b o o tstrap . Annals o f
S tatistics 21, 286-298.
Bose, A. (1988) E dgew orth co rrectio n by b o o tstra p in
autoregressions. Annals o f Statistics 16, 1709-1722.
Box, G . E. P. a n d C ox, D . R. (1964) A n analysis o f
tran sfo rm atio n s (w ith D iscussion). Journal o f the Royal
Statistical Society series B 26, 211-246.
B ratley, P., Fox, B. L. an d Schrage, L. E. (1987) A Guide to
Simulation. S econd edition. N ew Y o rk : Springer.
B raun, W. J. an d K ulperger, R. J. (1995) A F o u rie r m eth o d
fo r b o o tstra p p in g tim e series. P reprint, D e p a rtm e n t o f
M a th em atic s an d Statistics, U niversity o f W innipeg.
B raun, W. J. an d K ulperger, R . J. (1997) P roperties o f a
F o u rie r b o o tstra p m eth o d fo r tim e series.
Communications in Statistics Theory and M ethods 26,
to ap pear.
B reim an, L., F ried m an , J. H ., O lshen, R . A. an d Stone,
C. J. (1984) Classification and Regression Trees. Pacific
G rove, C alifo rn ia: W adsw orth & B ro o k s/C o le.
Breslow, N . (1985) C o h o rt analysis in epidem iology. In A
Celebration o f Statistics, eds A. C. A tk in so n an d S. E.
F ienberg, pp. 109-143. N ew Y o rk : S pringer.
B retagnolle, J. (1983) L ois lim ites d u b o o tstra p de
certaines fonctionelles. Annales de I'Institut Henri
Poincare, Section B 19, 281-296.
Brillinger, D. R . (1981) Time Series: Data Analysis and
Theory. E xpan d ed edition. San F ran cisco : H olden-D ay.
Brillinger, D. R . (1988) A n elem entary tren d analysis o f
R io N eg ro levels a t M a n au s, 1903-1985. Brazilian
Journal o f Probability and Statistics 2, 63-79.

557

Bibliography
B rillinger, D. R. (1989) C o n sistent d etection o f a
m o n o to n ic tren d su p erp o sed on a sta tio n ary tim e series.
Biometrika 76, 23-30.

b o o tstra p , pivots an d confidence lim its. T echnical


R e p o rt 34, C enter for S tatistical Sciences, U niversity o f
T exas a t A ustin.

Brockw ell, P. J. an d D avis, R . A. (1991) Time Series:


Theory and Methods. Second edition. N ew Y ork:
Springer.

Chen, C., D avis, R . A., Brockwell, P. J. a n d Bai, Z. D.


(1993) O rd e r d eterm in atio n for autoregressive processes
using resam pling m ethods. Statistica Sinica 3, 481-500.

Brockw ell, P. J. an d D avis, R. A. (1996) Introduction to


Time Series and Forecasting. N ew Y o rk : Springer.

C hen, C.-H. a n d G eorge, S. L. (1985) T he b o o tstra p an d


identification o f pro g n o stic factors via C oxs
p ro p o rtio n a l h azard s regression m odel. Statistics in
Medicine 4, 39-46.

Brow n, B. W. (1980) P red iction analysis for b inary d a ta . In


Biostatistics Casebook, eds R. G. M iller, B. E fron, B. W.
B row n a n d L. E. M oses, pp. 3-18. N ew Y ork: Wiley.
B uckland, S. T. an d G arth w aite, P. H . (1990) A lgorithm
A S 259: estim atin g confidence intervals by the
R o b b in s -M o n ro search process. Applied Statistics 39,
413-424.
B iihlm ann, P. an d K iinsch, H. R. (1995) T he blockw ise
b o o tstra p for general p aram eters o f a sta tio n ary tim e
series. Scandinavian Journal of Statistics 22, 35-54.
B unke, O. an d D ro g e, B. (1984) B o o tstrap and
cro ss-v alid atio n estim ates o f the pred ictio n e rro r for
lin ear regression m odels. Annals of Statistics 12,
1400-1424.
B u rm an , P. (1989) A co m p arativ e study o f o rd in ary
cross-v alid atio n , D-fold cross-validation a n d the rep eated
learn in g -testin g m eth o d s. Biometrika 76, 503-514.
B urr, D . (1994) A co m p ariso n o f certain b o o tstra p
confidence intervals in th e Cox m odel. Journal of the
American Statistical Association 89, 1290-1302.
B urr, D . a n d D oss, H. (1993) C onfidence b an d s for the
m ed ian survival tim e as a function o f covariates in the
C ox m odel. Journal of the American Statistical
Association 88, 1330-1340.

C hen, S. X. (1996) E m pirical likelihood confidence


intervals for n o n p aram etric density estim ation.
Biometrika 83, 329-341.
C hen, S. X. an d H all, P. (1993) S m oothed em pirical
likelihood confidence intervals for quantiles. Annals of
Statistics 21, 1166-1181.
C hen, Z. an d D o , K.-A. (1994) T he b o o tstra p m ethods
w ith sad d lep o in t ap p ro x im atio n s an d im p o rtan ce
resam pling. Statistica Sinica 4, 407-421.
C obb, G . W. (1978) T he problem o f th e N ile: conditional
solution to a c h an g ep o in t problem . Biometrika 65,
243-252.
C o chran, W. G . (1977) Sampling Techniques. T h ird edition.
N ew Y o rk : Wiley.
C ollings, B. J. an d H am ilto n , M . A. (1988) E stim atin g the
pow er o f the tw o-sam ple W ilcoxon test for location
shift. Biometrics 44, 847-860.
C ook, R . D ., H aw kins, D. M . an d W eisberg, S. (1992)
C o m p ariso n o f m odel m isspecification diagnostics using
residuals from least m ean o f squares an d least m edian o f
squares fits. Journal of the American Statistical
Association 87, 4 1 9 ^ 2 4 .

C anty, A. J., D avison, A. C. an d H inkley, D. V. (1996)


R eliable confidence intervals. D iscussion o f B o o tstrap
confidence in terv als, by T. J. D iC iccio an d B. Efron.
Statistical Science 11, 214-219.

C ook, R. D ., Tsai, C.-L. an d Wei, B. C. (1986) Bias in


n o n lin ear regression. Biometrika 73, 615-623.

C arlstein, E. (1986) T h e use o f subseries values for


estim atin g th e v arian ce o f a general statistic from a
sta tio n a ry sequence. Annals of Statistics 14, 1171-1179.

C ook, R . D . a n d W eisberg, S. (1994) T ran sfo rm in g a


response variable for linearity. Biometrika 81, 731-737.

C a rp en ter, J. R. (1996) Simulated confidence regions for


parameters in epidemiological models. Ph.D . thesis,
D e p a rtm e n t o f S tatistics, U niversity o f O xford.
C h am b ers, J. M . a n d H astie, T. J. (eds) (1992) Statistical
Models in S. Pacific G rove, C alifo rn ia: W adsw orth &
B ro o k s/C o le.
C h ao , M.-T. a n d Lo, S.-H. (1994) M ax im u m likelihood
su m m ary an d th e b o o tstra p m eth o d in stru ctu red finite
p o p u latio n s. Statistica Sinica 4, 389-406.
C h a p m a n , P. a n d H inkley, D . V. (1986) T h e double

C ook, R . D . a n d W eisberg, S. (1982) Residuals and


Influence in Regression. L o n d o n : C h a p m a n & Hall.

C o rco ran , S. A., D avison, A. C. an d Spady, R. H . (1996)


R eliable inference from em pirical likelihoods. P reprint,
D e p a rtm e n t o f Statistics, U niversity o f O xford.
Cow ling, A., H all, P. an d Phillips, M . J. (1996) B o o tstrap
confidence regions for th e intensity o f a Poisson process.

Journal of the American Statistical Association 91,


1516-1524.
Cox, D . R . a n d H inkley, D . V. (1974) Theoretical Statistics.
L o n d o n : C h a p m a n & Hall.
Cox, D . R . a n d Isham , V. (1980) Point Processes. L o n d o n :
C h a p m a n & H all.

558
Cox, D. R. a n d Lewis, P. A. W. (1966) The Statistical
Analysis of Series of Events. L o n d o n : C h a p m a n & H all.
Cox, D. R. a n d O akes, D . (1984) Analysis of Survival Data.
L o n d o n : C h a p m a n & H all.
Cox, D. R. a n d Snell, E. J. (1981) Applied Statistics:
Principles and Examples. L o n d o n : C h a p m a n & H all.
Cressie, N. A. C. (1982) P laying safe w ith m isw eighted
m eans. Journal of the American Statistical Association
77, 754-759.
Cressie, N . A. C. (1991) Statistics for Spatial Data. N ew
Y o rk : Wiley.
D ah lh au s, R. a n d Ja n as, D. (1996) A frequency d o m ain
b o o tstra p fo r ra tio statistics in tim e series analysis.
Annals of Statistics 24, to ap p ear.
D aley, D. J. an d V ere-Jones, D. (1988) A n Introduction to
the Theory of Point Processes. N ew Y ork: S pringer.
D aniels, H . E. (1954) S ad d lep o in t a p p ro x im atio n s in
statistics. Annals of Mathematical Statistics 25, 631-650.
D aniels, H. E. (1955) D iscussion o f P e rm u ta tio n theory in
the d eriv atio n o f ro b u st criteria an d th e study o f
d ep artu re s fro m assu m p tio n , by G . E. P. Box a n d S. L.
A ndersen. Journal of the Royal Statistical Society series
B 17, 27-28.
D aniels, H . E. (1958) D iscussion o f T h e regression
analysis o f b in ary sequences, by D . R. Cox. Journal of
the Royal Statistical Society series B 20, 236-238.
D aniels, H. E. an d Y oung, G . A. (1991) S ad d lep o in t
ap p ro x im atio n fo r th e stu d entized m ean, w ith an
a p p licatio n to th e b o o tstra p . Biometrika 78, 169-179.
D av iso n , A. C. (1988) D iscussion o f th e R oyal S tatistical
Society m eeting o n th e b o o tstrap . Journal of the Royal
Statistical Society series B 50, 356-357.
D av iso n , A. C. an d H all, P. (1992) O n the bias a n d
variab ility o f b o o tstra p an d cross-validation estim ates o f
e rro r rate in d iscrim in atio n problem s. Biometrika 79,
279-284.
D av iso n , A. C. a n d H all, P. (1993) O n S tudentizing an d
blo ck in g m eth o d s fo r im plem enting the b o o tstra p w ith
d ep en d en t d a ta . Australian Journal of Statistics 35,
215-224.
D av iso n , A. C. a n d H inkley, D. V. (1988) S ad d lep o in t
ap p ro x im atio n s in resam pling m ethods. Biometrika 75,
417-431.
D av iso n , A. C., H inkley, D. V. a n d S chechtm an, E. (1986)
Efficient b o o tstra p sim ulation. Biometrika 73, 555-566.
D av iso n , A. C., H inkley, D. V. a n d W orton, B. J. (1992)
B o o tstrap likelihoods. Biometrika 79, 113-130.
D av iso n , A. C., H inkley, D . V. a n d W orton, B. J. (1995)
A ccu rate an d efficient co n stru ctio n o f b o o tstra p

Bibliography

likelihoods. Statistics and Computing 5, 257-264.


D avison, A. C. a n d Snell, E. J. (1991) R esiduals an d
diagnostics. In Statistical Theory and Modelling: In
Honour of Sir David Cox, FRS, eds D. V. H inkley,
N . R eid a n d E. J. Snell, pp. 83-106. L o n d o n : C h a p m a n
& H all.
D e A ngelis, D . an d G ilks, W. R. (1994) E stim ating
acquired im m une deficiency syndrom e incidence
acco u n tin g for re p o rtin g delay. Journal of the Royal
Statistical Society series A 157, 31-40.
D e A ngelis, D ., H all, P. a n d Y oung, G . A. (1993)
A nalytical an d b o o tstra p ap p ro x im atio n s to estim ato r
d istrib u tio n s in L\ regression. Journal of the American
Statistical Association 88, 1310-1316.
D e A ngelis, D . an d Y oung, G . A. (1992) S m oothing the
b o o tstrap . International Statistical Review 60, 45-56.
D em pster, A. P., L aird, N . M . a n d R ubin, D . B. (1977)
M ax im u m likelihood from incom plete d a ta via the E M
alg o rith m (w ith D iscussion). Journal of the Royal
Statistical Society series B 39, 1-38.
D iaconis, P. an d H olm es, S. (1994) G ray codes for
ran d o m izatio n procedures. Statistics and Computing 4,
287-302.
D iC iccio, T. J. a n d E fron, B. (1992) M ore accurate
confidence intervals in exponential fam ilies. Biometrika
79, 231-245.
D iC iccio, T. J. an d E fron, B. (1996) B o o tstrap confidence
intervals (w ith D iscussion). Statistical Science 11,
189-228.
D iC iccio, T. J., H all, P. a n d R o m ano, J. P. (1989)
C o m p ariso n o f p aram etric an d em pirical likelihood
functions. Biometrika 76, 465-476.
D iC iccio, T. J., H all, P. an d R o m ano, J. P. (1991) E m pirical
likelihood is B a rtlett-correctable. Annals of Statistics 19,
1053-1061.
D iC iccio, T. J., M a rtin , M . A. an d Y oung, G . A. (1992a)
A nalytic ap p ro x im atio n s for iterated b o o tstra p
confidence intervals. Statistics and Computing 2,
161-171.
D iC iccio, T. J., M a rtin , M . A. a n d Y oung, G . A. (1992b)
F a st an d accu rate ap p ro x im ate dou b le b o o tstra p
confidence intervals. Biometrika 79, 285-295.
D iC iccio, T. J., M a rtin , M . A. an d Y oung, G . A. (1994)
A nalytical a p p ro x im atio n s to b o o tstra p d istrib u tio n
functions using sad d lep o in t m ethods. Statistica Sinica 4,
281-295.
D iC iccio, T. J. an d R o m an o , J. P. (1988) A review o f
b o o tstra p confidence intervals (w ith D iscussion). Journal
of the Royal Statistical Society series B 50, 338-370.
C o rrectio n , volum e 51, p. 470.

559

Bibliography
D iC iccio, T. J. an d R o m an o , J. P. (1989) O n adjustm ents
based o n th e signed ro o t o f the em pirical likelihood
ra tio statistic. Biometrika 76, 447-456.
D iC iccio, T. J. an d R o m an o , J. P. (1990) N o n p aram etric
confidence lim its by resam pling m eth o d s an d least
fav o rab le fam ilies. International Statistical Review 58,
59-76.
Diggle, P. J. (1983) Statistical Analysis of Spatial Point

Patterns. L o n d o n : A cadem ic Press.


Diggle, P. J. (1990) Time Series: A Biostatistical
Introduction. O x fo rd : C laren d o n Press.
Diggle, P. J. (1993) P o in t process m odelling in
en v iro n m en tal epidem iology. In Statistics for the
Environment, eds V. B a rn ett an d K . F. T u rk m an , pp.
89-110. C h ich ester: Wiley.
Diggle, P. J., L ange, N . a n d Benes, F. M . (1991) A nalysis
o f varian ce fo r rep licated sp a tia l p o in t p a tte rn s in
clinical n eu ro an ato m y . Journal of the American
Statistical Association 86, 618-625.
Diggle, P. J. an d R ow lingson, B. S. (1994) A co n d itio n al
a p p ro a c h to p o in t process m odelling o f elevated risk.
Journal of the Royal Statistical Society series A 157,
433-440.

jackknife. Annals o f Statistics 7, 1-26.


E fron, B. (1981a) N o n p a ra m e tric sta n d a rd erro rs an d
confidence intervals (w ith D iscussion). Canadian Journal
of Statistics 9, 139-172.
E fron, B. (1981b) C ensored d a ta an d th e b o o tstrap .
Journal of the American Statistical Association 76,
312-319.
E fron, B. (1982) The Jackknife, the Bootstrap, and Other
Resampling Plans. N u m b er 38 in C B M S -N S F R egional
C onference Series in A pplied M athem atics.
P h ilad elp h ia: S IA M .
Efron, B. (1983) E stim atin g th e e rro r rate o f a prediction
rule: im provem ent on cross-validation. Journal of the
American Statistical Association 78, 316-331.
E fron, B. (1986) H ow biased is the a p p a re n t e rro r ra te o f a
pred ictio n rule? Journal of the American Statistical
Association 81, 461-470.
Efron, B. (1987) B etter b o o tstra p confidence intervals (w ith
D iscussion). Journal of the American Statistical
Association 82, 171-200.
E fron, B. (1988) C om puter-intensive m eth o d s in statistical
regression. S I A M Review 30, 421-449.

D o , K .-A . an d H all, P. (1991) O n im p o rtan c e resam pling


fo r th e b o o tstra p . Biometrika 78, 161-167.

E fron, B. (1990) M ore efficient b o o tstra p co m p u tatio n s.


Journal of the American Statistical Association 55, 79-89.

D o, K .-A . an d H all, P. (1992a) D istrib u tio n estim ation


using co n co m itan ts o f o rd er statistics, w ith application
to M o n te C a rlo sim u latio n for the b o o tstrap . Journal of
the Royal Statistical Society series B 54, 595-607.

E fron, B. (1992) Ja ck k n ife-after-b o o tstrap sta n d a rd erro rs


an d influence functions (w ith D iscussion). Journal of the
Royal Statistical Society series B 54, 83-127.

D o , K .-A . a n d H all, P. (1992b) Q u asi-ran d o m resam pling


fo r th e b o o tstra p . Statistics and Computing 1, 13-22.
D o b so n , A. J. (1990) An Introduction to Generalized Linear
Models. L o n d o n : C h a p m a n & H all.
D o n eg an i, M . (1991) A n ad aptive an d pow erful
ra n d o m izatio n test. Biometrika 78, 930-933.
D oss, H. a n d G ill, R. D. (1992) A n elem entary ap p ro ach
to w eak convergence fo r q u an tile processes, with
ap p licatio n s to cen so red survival d ata. Journal of the
American Statistical Association 87, 869-877.
D rap er, N . R. an d Sm ith, H . (1981) Applied Regression
Analysis. S econd edition. N ew Y o rk : Wiley.
D u ch arm e, G . R., Jh u n , M., R o m ano, J. P. a n d T ruong,
K . N . (1985) B o o tstrap confidence cones for directional
d a ta . Biometrika 72, 637-645.
E asto n , G . S. an d R o n ch etti, E. M. (1986) G en eral
sa d d lep o in t a p p ro x im atio n s w ith ap p licatio n s to L
statistics. Journal of the American Statistical Association
81, 420-430.
E fron, B. (1979) B o o tstrap m eth o d s: a n o th e r look a t the

E fron, B. (1993) Bayes a n d likelihood calcu latio n s from


confidence intervals. Biometrika 80, 3-26.
E fron, B. (1994) M issing d a ta , im p u tatio n , an d the
b o o tstra p (w ith D iscussion). Journal of the American
Statistical Association 89, 463-479.
E fron, B., H allo ran , M . E. an d H olm es, S. (1996) B o o tstrap
confidence levels fo r phylogenetic trees. Proceedings of
the National Academy of Sciences, U S A 93, 13429-13434.
E fron, B. a n d Stein, C. M . (1981) T he jack k n ife estim ate o f
variance. Annals of Statistics 9, 586-596.
E fron, B. an d T ibshirani, R . J. (1986) B o o tstra p m ethods
for sta n d a rd errors, confidence intervals, a n d o th er
m easures o f statistical accuracy (w ith D iscussion).
Statistical Science 1, 54-96.
E fron, B. a n d T ibshirani, R . J. (1993) An Introduction to
the Bootstrap. N ew Y ork: C h a p m a n & H all.
E fron, B. a n d T ibshirani, R . J. (1997) Im provem ents on
cross-validation: the .632+ b o o tstra p m ethod. Journal of
the American Statistical Association 92, 548-560.
Fang, K . T. a n d W ang, Y. (1994) Number-Theoretic
Methods in Statistics. L o n d o n : C h a p m a n & H all.

560
Faraway, J. J. (1992) O n the cost of data analysis. Journal
of Computational and Graphical Statistics 1, 213-229.

Bibliography

least-squares estimates in stationary linear models.


Annals of Statistics 12, 827-842.

Feigl, P. and Zelen, M. (1965) Estimation of exponential


survival probabilities with concomitant information.
Biometrics 21, 826-838.

Freedman, D. A. and Peters, S. C. (1984a) Bootstrapping a


regression equation: some empirical results. Journal of
the American Statistical Association 79, 97-106.

Feller, W. (1968) A n Introduction to Probability Theory and


its Applications. Third edition, volume I. N e w York:
Wiley.
Fernholtz, L. T. (1983) von Mises Calculus for Statistical
Functionals. Volume 19 of Lecture Notes in Statistics.
N e w York: Springer.

Freedman, D. A. and Peters, S. C. (1984b) Bootstrapping


an econometric model:some empirical results. Journal
of Business & Economic Statistics 2, 150-158.

Ferretti, N. and Romo, J. (1996) Unit root bootstrap tests


for /1R( 1) models. Biometrika 83, 849-860.
Field, C. and Ronchetti, E. M. (1990) Small Sample
Asymptotics. Volume 13 of Lecture Notes Monograph
Series. Hayward, California: Institute of Mathematical
Statistics.
Firth, D. (1991) Generalized linear models. In Statistical
Theory and Modelling: In Honour of Sir David Cox,
FRS, eds D. V. Hinkley, N. Reid and E. J. Snell, pp.

55-82. London: Chapman & Hall.


Firth, D. (1993) Bias reduction of maximum likelihood
estimates. Biometrika 80, 27-38.
Firth, D., Glosup, J. and Hinkley, D. V. (1991) Model
checking with nonparametric curves. Biometrika 78,
245-252.
Fisher, N. I., Hall, P., Jing, B.-Y. and Wood, A. T. A. (1996)
Improved pivotal methods for constructing confidence
regions with directional data. Journal of the American
Statistical Association 91, 1062-1070.
Fisher, N. I., Lewis, T. and Embleton, B. J. J. (1987)
Statistical Analysis of Spherical Data. Cambridge:
Cambridge University Press.
Fisher, R. A. (1935) The Design of Experiments.
Edinburgh: Oliver and Boyd.
Fisher, R. A. (1947) The analysis of covariance method for
the relation between a part and the whole. Biometrics 3,
65-68.
Fleming, T. R. and Harrington, D. P. (1991) Counting
Processes and Survival Analysis. N e w York: Wiley.
Forster, J. J., McDonald, J. W. and Smith, P. W. F. (1996)
Monte Carlo exact conditional tests for log-linear and
logistic models. Journal of the Royal Statistical Society
series B 58, 445^53.
Franke, J. and Hardle, W. (1992) On bootstrapping kernel
spectral estimates. Annals of Statistics 20, 121-145.
Freedman, D. A. (1981) Bootstrapping regression models.
Annals of Statistics 9, 1218-1228.
Freedman, D. A. (1984) O n bootstrapping two-stage

Freeman, D. H. (1987) Applied Categorical Data Analysis.


N e w York: Marcel Dekker.
Frets, G. P. (1921) Heredity of head form in man. Genetica
3, 193-384.
Garcia-Soidan, P. H. and Hall, P. (1997) O n sample reuse
methods for spatial data. Biometrics 53, 273-281.
Garthwaite, P. H. and Buckland, S. T. (1992) Generating
Monte Carlo confidence intervals by the
Robbins-Monro process. Applied Statistics 41, 159-171.
Gatto, R. (1994) Saddlepoint methods and nonparametric
approximations for econometric models. Ph.D. thesis,
Faculty of Economic and Social Sciences, University of
Geneva.
Gatto, R. and Ronchetti, E. M. (1996) General saddlepoint
approximations of marginal densities and tail
probabilities. Journal of the American Statistical
Association 91, 666-673.
Geisser, S. (1975) The predictive sample reuse method with
applications. Journal of the American Statistical
Association 70, 320-328.
Geisser, S. (1993) Predictive Inference: An Introduction.
London: Chapman & Hall.
Geyer, C. J. (1991) Constrained maximum likelihood
exemplified by isotonic convex logistic regression.
Journal of the American Statistical Association 86,
717-724.
Geyer, C. J. (1995) Likelihood ratio tests and inequality
constraints. Technical Report 610, School of Statistics,
University of Minnesota.
Gigli, A. (1994) Contributions to importance sampling and
resampling. Ph.D. thesis, Department of Mathematics,
Imperial College, London.
Gilks, W. R., Richardson, S. and Spiegelhalter, D. J. (eds)
(1996) Markov Chain Monte Carlo in Practice. London:
Chapman & Hall.
Gleason, J. R. (1988) Algorithms for balanced bootstrap
simulations. American Statistician 42, 263-266.
Gong, G. (1983) Cross-validation, the jackknife, and the
bootstrap: excess error estimation in forward logistic
regression. Journal of the American Statistical
Association 78, 108-113.

Bibliography

Gotze, F. and Kiinsch, H. R. (1996) Second order


correctness of the blockwise bootstrap for stationary
observations. Annals of Statistics 24, 1914-1933.
Graham, R. L., Hinkley, D. V., John, P. W. M. and Shi, S.
(1990) Balanced design of bootstrap simulations. Journal
of the Royal Statistical Society series B 52, 185-202.
Gray, H. L. and Schucany, W. R. (1972) The Generalized
Jackknife Statistic. N e w York: Marcel Dekker.
Green, P. J. and Silverman, B. W. (1994) Nonparametric
Regression and Generalized Linear Models: A Roughness
Penalty Approach. London: Chapman & Hall.

Gross, S. (1980) Median estimation in sample surveys. In


Proceedings of the Section on Survey Research Methods,

pp. 181-184. Alexandria, Virginia: American Statistical


Association.
Haldane, J. B. S. (1940) The mean and variance of x2,
when used as a test of homogeneity, when expectations
are small. Biometrika 31, 346-355.
Hall, P. (1985) Resampling a coverage pattern. Stochastic
Processes and their Applications 20, 231-246.
Hall, P. (1986) O n the bootstrap and confidence intervals.
Annals of Statistics 14, 1431-1452.
Hall, P. (1987) O n the bootstrap and likelihood-based
confidence regions. Biometrika 74, 481^193.
Hall, P. (1988a) Theoretical comparison of bootstrap
confidence intervals (with Discussion). Annals of
Statistics 16, 927-985.
Hall, P. (1988b) On confidence intervals for spatial
parameters estimated from nonreplicated data.
Biometrics 44, 271-277.
Hall, P. (1989a) Antithetic resampling for the bootstrap.
Biometrika 76, 713-724.
Hall, P. (1989b) Unusual properties of bootstrap
confidence intervals in regression problems. Probability
Theory and Related Fields 81, 247-273.
Hall, P. (1990) Pseudo-likelihood theory for empirical
likelihood. Annals of Statistics 18, 121-140.
Hall, P. (1992a) The Bootstrap and Edgeworth Expansion.
N e w York: Springer.

561
Hall, P. and Horowitz, J. L. (1993) Corrections and
blocking rules for the block bootstrap with dependent
data. Technical Report SRI 1-93, Centre for
Mathematics and its Applications, Australian National
University.
Hall, P., Horowitz, J. L. and Jing, B.-Y. (1995) O n blocking
rules for the bootstrap with dependent data. Biometrika
82, 561-574.
Hall, P. and Jing, B.-Y. (1996) On sample reuse methods
for dependent data. Journal of the Royal Statistical
Society series B 58, 727-737.
Hall, P. and Keenan, D. M. (1989) Bootstrap methods for
constructing confidence regions for hands.
Communications in Statistics Stochastic Models 5,
555-562.
Hall, P. and La Scala, B. (1990) Methodology and
algorithms of empirical likelihood. International
Statistical Review 58, 109-28.
Hall, P. and Martin, M. A. (1988) O n bootstrap resampling
and iteration. Biometrika 75, 661-671.
Hall, P. and Owen, A. B. (1993) Empirical likelihood
confidence bands in density estimation. Journal of
Computational and Graphical Statistics 2, 273-289.
Hall, P. and Titterington, D. M. (1989) The effect of
simulation order on level accuracy and power of Monte
Carlo tests. Journal of the Royal Statistical Society series
B 51, 459-467.
Hall, P. and Wilson, S. R. (1991) Two guidelines for
bootstrap hypothesis testing. Biometrics 47, 757-762.
Hamilton, M. A. and Collings, B. J. (1991) Determining
the appropriate sample size for nonparametric tests for
location shift. Technometrics 33, 327-337.
Hammersley, J. M. and Handscomb, D. C. (1964) Monte
Carlo Methods. London: Methuen.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and
Stahel, W. A. (1986) Robust Statistics: The Approach
Based on Influence Functions. N e w York: Wiley.
Hand, D. J., Daly, F., Lunn, A. D., McConway, K. J. and
Ostrowski, E. (eds) (1994) A Handbook of Small Data
Sets. London: Chapman & Hall.

Hall, P. (1992b) O n bootstrap confidence intervals in


nonparametric regression. Annals of Statistics 20,
695-711.

Hardle, W. (1989) Resampling for inference from curves. In

Hall, P. (1995) O n the biases of error estimators in


prediction problems. Statistics and Probability Letters
24, 257-262.

Hardle, W. (1990) Applied Nonparametric Regression.


Cambridge: Cambridge University Press.

Hall, P., DiCiccio, T. J. and Romano, J. P. (1989) On


smoothing and the bootstrap. Annals of Statistics 17,
692-704.

Bulletin of the 47th Session of the International Statistical


Institute, Paris, August 1989, volume 3, pp. 53-63.

Hardle, W. and Bowman, A. W. (1988) Bootstrapping in


nonparametric regression: local adaptive smoothing and
confidence bands. Journal of the American Statistical
Association 83, 102-110.

562
Hardle, W. and Marron, J. S. (1991) Bootstrap
simultaneous error bars for nonparametric regression.
Annals of Statistics 19, 778-796.
Hartigan, J. A. (1969) Using subsample values as typical
values. Journal of the American Statistical Association
64, 1303-1317.
Hartigan, J. A. (1971) Error analysis by replaced samples.
Journal of the Royal Statistical Society series B 33,
98-110.
Hartigan, J. A. (1975) Necessary and sufficient conditions
for asymptotic joint normality of a statistic and its
subsample values. Annals of Statistics 3, 573-580.

Bibliography
jackknife confidence limit methods. Biometrika 71,
331-339.

Hirose, H. (1993) Estimation of threshold stress in


accelerated life-testing. IEEE Transactions on Reliability
42, 650-657.
Hjort, N. L. (1985) Bootstrapping Cox
s regression model.
Technical Report NSF-241, Department of Statistics,
Stanford University.
Hjort, N. L. (1992) O n inference in parametric survival
data models. International Statistical Review 60,
355-387.

Hartigan, J. A. (1990) Perturbed periodogram estimates of


variance. International Statistical Review 58, 1-7.

Horvath, L. and Yandell, B. S. (1987) Convergence rates


for the bootstrapped product-limit process. Annals of
Statistics 15, 1155-1173.

Hastie, T. J. and Loader, C. (1993) Local regression:


automatic kernel carpentry (with Discussion). Statistical
Science 8, 120-143.

Hosmer, D. W. and Lemeshow, S. (1989) Applied Logistic


Regression. N e w York: Wiley.

Hastie, T. J. and Tibshirani, R. J. (1990) Generalized


Additive Models. London: Chapman & Hall.

Hu, F. and Zidek, J. V. (1995) A bootstrap based on the


estimating equations of the linear model. Biometrika 82,
263-275.

Hayes, K. G., Perl, M. L. and Efron, B. (1989) Application


of the bootstrap statistical method to the
tau-decay-mode problem. Physical Review Series D 39,
274-279.

Huet, S., Jolivet, E. and Messean, A. (1990) Some


simulations results about confidence intervals and
bootstrap methods in nonlinear regression. Statistics 3,
369-432.

Heller, G. and Venkatraman, E. S. (1996) Resampling


procedures to compare two survival distributions in the
presence of right-censored data. Biometrics 52,
1204-1213.
Hesterberg, T. C. (1988) Advances in importance sampling.
Ph.D. thesis, Department of Statistics, Stanford
University, California.

Hyde, J. (1980) Survival analysis with incomplete


observations. In Biostatistics Casebook, eds R. G. Miller,
B. Efron, B. W. Brown and L. E. Moses, pp. 31-46. N e w
York: Wiley.

Hesterberg, T. C. (1995a) Tail-specific linear


approximations for efficient bootstrap simulations.
Journal of Computational and Graphical Statistics 4,
113-133.
Hesterberg, T. C. (1995b) Weighted average importance
sampling and defensive mixture distributions.
Technometrics 37, 185-194.
Hinkley, D. V. (1977) Jackknifing in unbalanced situations.
Technometrics 19, 285-292.
Hinkley, D. V. and Schechtman, E. (1987) Conditional
bootstrap methods in the mean-shift model. Biometrika
74, 85-93.
Hinkley, D. V. and Shi, S. (1989) Importance sampling and
the nested bootstrap. Biometrika 76, 435-446.
Hinkley, D. V. and Wang, S. (1991) Efficiency of robust
standard errors for regression coefficients.
Communications in Statistics Theory and Methods 20,
1- 11.

Hinkley, D. V. and Wei, B. C. (1984) Improvements of

Janas, D. (1993) Bootstrap Procedures for Time Series.


Aachen: Verlag Shaker.
Jennison, C. (1992) Bootstrap tests and confidence intervals
for a hazard ratio when the number of observed failures
is small, with applications to group sequential survival
studies. In Computer Science and Statistics: Proceedings
of the 22nd Symposium on the Interface, eds C. Page and
R. LePage, pp. 89-97. N e w York: Springer.
Jensen, J. L. (1992) The modified signed likelihood statistic
and saddlepoint approximations. Biometrika 79,
693-703.
Jensen, J. L. (1995) Saddlepoint Approximations. Oxford:
Clarendon Press.
Jeong, J. and Maddala, G. S. (1993) A perspective on
application of bootstrap methods in econometrics. In
Handbook of Statistics, vol. II: Econometrics, eds G. S.
Maddala, C. R. Rao and H. D. Vinod, pp. 573-610.
Amsterdam: North-Holland.
Jing, B.-Y. and Robinson, J. (1994) Saddlepoint
approximations for marginal and conditional
probabilities of transformed variables. Annals of
Statistics 22, 1115-1132.

Bibliography

563

Jing, B.-Y. and Wood, A. T. A. (1996) Exponential


empirical likelihood is not Bartlett correctable. Annals of
Statistics 24, 365-369.

Lawson, A. B. (1993) O n the analysis of mortality events


associated with a prespecified fixed point. Journal of the
Royal Statistical Society series A 156, 363-377.

Jockel, K.-H. (1986) Finite sample properties and


asymptotic efficiency of Monte Carlo tests. Annals of
Statistics 14, 336-347.

Lee, S. M. S. and Young, G. A. (1995) Asymptotic iterated


bootstrap confidence intervals. Annals of Statistics 23,
1301-1330.

Johns, M. V. (1988) Importance sampling for bootstrap


confidence intervals. Journal of the American Statistical
Association 83, 709-714.

Leger, C., Politis, D. N. and Romano, J. P. (1992)


Bootstrap technology and applications. Technometrics
34, 378-398.

Journel, A. G. (1994) Resampling from stochastic


simulations (with Discussion). Environmental and
Ecological Statistics 1, 63-91.

Leger, C. and Romano, J. P. (1990a) Bootstrap choice of


tuning parameters. Annals of the Institute of Statistical
Mathematics 42, 709-735.

Kabaila, R (1993a) Some properties of profile bootstrap


confidence intervals. Australian Journal of Statistics 35,
205-214.

Leger, C. and Romano, J. P. (1990b) Bootstrap adaptive


estimation: the trimmed mean example. Canadian
Journal of Statistics 18, 297-314.

Kabaila, P. (1993b) O n bootstrap predictive inference for


autoregressive processes. Journal of Time Series Analysis
14, 473484.

Lehmann, E. L. (1986) Testing Statistical Hypotheses.


Second edition. N e w York: Wiley.

Kalbfleisch, J. D. and Prentice, R. L. (1980) The Statistical


Analysis of Failure Time Data. N e w York: Wiley.
Kaplan, E. L. and Meier, P. (1958) Nonparametric
estimation from incomplete observations. Journal of the
American Statistical Association 53, 457-481.
Karr, A. F. (1991) Point Processes and their Statistical
Inference. Second edition. N e w York: Marcel Dekker.
Katz, R. (1995) Spatial analysis of pore images. Ph.D.
thesis, Department of Statistics, University of Oxford.

Li, G. (1995) Nonparametric likelihood ratio estimation of


probabilities for truncated data. Journal of the American
Statistical Association 90, 997-1003.
Li, H. and Maddala, G. S. (1996) Bootstrapping time series
models (with Discussion). Econometric Reviews 15,
115-195.
Li, K.-C. (1987) Asymptotic optimality for C p, C l ,
cross-validation and generalized cross-validation:
discrete index set. Annals of Statistics 15, 958-975.

Kendall, D. G. and Kendall, W. S. (1980) Alignments in


two-dimensional random sets of points. Advances in
Applied Probability 12, 380-424.

Liu, R. Y. and Singh, K. (1992a) Moving blocks jackknife


and bootstrap capture weak dependence. In Exploring
the Limits of Bootstrap, eds R. LePage and L. Billard,
pp. 225-248. N e w York: Wiley.

Kim, J.-H. (1990) Conditional bootstrap methods for


censored data. Ph.D. thesis, Department of Statistics,
Florida State University.

Liu, R. Y. and Singh, K. (1992b) Efficiency and robustness


in resampling. Annals of Statistics 20, 370-384.

Kiinsch, H. R. (1989) The jackknife and bootstrap for


general stationary observations. Annals of Statistics 17,
1217-1241.
Lahiri, S. N. (1991) Second-order optimality of stationary
bootstrap. Statistics and Probability letters 11, 335-341.
Lahiri, S. N. (1995) On the asymptotic behaviour of the
moving block bootstrap for normalized sums of
heavy-tail random variables. Annals of Statistics 23,
1331-1349.
Laird, N. M. (1978) Nonparametric maximum likelihood
estimation of a mixing distribution. Journal of the
American Statistical Association 73, 805-811.
Laird, N. M. and Louis, T. A. (1987) Empirical Bayes
confidence intervals based on bootstrap samples (with
Discussion). Journal of the American Statistical
Association 82, 739-757.

Lloyd, C. J. (1994) Approximate pivots from M-estimators.


Statistica Sinica 4, 701-714.
Lo, S.-H. and Singh, K. (1986) The product-limit estimator
and the bootstrap: some asymptotic representations.
Probability Theory and Related Fields 71, 455-465.
Loh, W.-Y. (1987) Calibrating confidence coefficients.
Journal of the American Statistical Association 82,
155-162.
Mallows, C. L. (1973) Some comments on C p.
Technometrics 15, 661-675.
Mammen, E. (1989) Asymptotics with increasing
dimension for robust regression with applications to the
bootstrap. Annals of Statistics 17, 382-400.
Mammen, E. (1992) When Does Bootstrap Work?
Asymptotic Results and Simulations. Volume 77 of
Lecture Notes in Statistics. N e w York: Springer.

Bibliography

564

Journal of the Royal Statistical Society series A 152,

Mammen, E. (1993) Bootstrap and wild bootstrap for high


dimensional linear models. Annals of Statistics 21,
255-285.
Manly, B. F. J. (1991) Randomization and Monte Carlo
Methods in Biology. London: Chapman & Hall.

Murphy, S. A. (1995) Likelihood-based confidence


intervals in survival analysis. Journal of the American
Statistical Association 90, 1399-1405.

Marriott, F. H. C. (1979) Barnard


s Monte Carlo tests:
how many simulations? Applied Statistics 28, 75-77.

Mykland, P. A. (1995) Dual likelihood. Annals of Statistics


23, 396-421.

McCarthy, P. J. (1969) Pseudo-replication: half samples.


Review of the International Statistical Institute 37,
239-264.
McCarthy, P. J. and Snowden, C. B. (1985) The Bootstrap
and Finite Population Sampling. Vital and Public Health
Statistics (Ser. 2, No. 95), Public Health Service
Publication. Washington, DC: United States
Government Printing Office, 85-1369.

Nelder, J. A. and Pregibon, D. (1987) A n extended


quasi-likelihood function. Biometrika 74, 221-232.

McCullagh, P. (1987) Tensor Methods in Statistics.


London: Chapman & Hall.
McCullagh, P. and Nelder, J. A. (1989) Generalized Linear
Models. Second edition. London: Chapman & Hall.
McKay, M. D., Beckman, R. J. and Conover, W. J. (1979)
A comparison of three methods for selecting values of
input variables in the analysis of output from a
computer code. Technometrics 21, 239-245.
McKean, J. W., Sheather, S. J. and Hettsmansperger, T. P.
(1993) The use and interpretation of residuals based on
robust estimation. Journal of the American Statistical
Association 88, 1254-1263.
McLachlan, G. J. (1992) Discriminant Analysis and
Statistical Pattern Recognition. N e w York: Wiley.
Milan, L. and Whittaker, J. (1995) Application of the
parametric bootstrap to models that incorporate a
singular value decomposition. Applied Statistics 44,
31-49.
Miller, R. G. (1974) The jackknife a review. Biometrika
61, 1-15.
Miller, R. G. (1981) Survival Analysis. N e w York: Wiley.
Monti, A. C. (1997) Empirical likelihood confidence
regions in time series models. Biometrika 84, 395-405.
Morgenthaler, S. and Tukey, J. W. (eds) (1991) Configural
Polysampling: A Route to Practical Robustness. N e w
York: Wiley.
Moulton, L. H. and Zeger, S. L. (1989) Analyzing repeated
measures on generalized linear models via the bootstrap.
Biometrics 45, 381-394.

305-384.

Newton, M. A. and Geyer, C. J. (1994) Bootstrap


recycling: a Monte Carlo alternative to the nested
bootstrap. Journal of the American Statistical Association
89, 905-912.
Newton, M. A. and Raftery, A. E. (1994) Approximate
Bayesian inference with the weighted likelihood
bootstrap (with Discussion). Journal of the Royal
Statistical Society series B 56, 348.
Niederreiter, H. (1992) Random Number Generation and
Quasi-Monte Carlo Methods. Number 63 in C B M S - N S F
Regional Conference Series in Applied Mathematics.
Philadelphia: SIAM.
Nordgaard, A. (1990) On the resampling of stochastic
processes using a bootstrap approach. Ph.D. thesis,
Department of Mathematics, Linkoping University,
Sweden.
Noreen, E. W. (1989) Computer Intensive Methods for
Testing Hypotheses: An Introduction. N e w York: Wiley.
Ogbonmwan, S.-M. (1985) Accelerated resampling codes
with application to likelihood. Ph.D. thesis, Department
of Mathematics, Imperial College, London.
Ogbonmwan, S.-M. and Wynn, H. P. (1986) Accelerated
resampling codes with low discrepancy. Preprint,
Department of Statistics and Actuarial Science, The City
University.
Olshen, R. A., Biden, E. N., Wyatt, M. P. and Sutherland,
D. H. (1989) Gait analysis and the bootstrap. Annals of
Statistics 17, 1419-1440.
Owen, A. B. (1988) Empirical likelihood ratio confidence
intervals for a single functional. Biometrika 75, 237-249.
Owen, A. B. (1990) Empirical likelihood ratio confidence
regions. Annals of Statistics 18, 90-120.
Owen, A. B. (1991) Empirical likelihood for linear models.
Annals of Statistics 19, 1725-1747.

Moulton, L. H. and Zeger, S. L. (1991) Bootstrapping


generalized linear models. Computational Statistics and
Data Analysis 11, 53-63.

Owen, A. B. (1992a) Empirical likelihood and small


samples. In Computer Science and Statistics: Proceedings
of the 22nd Symposium on the Interface, eds C. Page and
R. LePage, pp. 79-88. N e w York: Springer.

Muirhead, C. R. and Darby, S. C. (1989) Royal Statistical


Society meeting on cancer near nuclear installations.

Owen, A. B. (1992b) A central limit theorem for Latin


hypercube sampling. Journal of the Royal Statistical

Bibliography
Society series B 54, 541-551.

Parzen, M. I., Wei, L. J. and Ying, Z. (1994) A resampling


method based on pivotal estimating functions.
Biometrika 81, 341-350.
Paulsen, O. and Heggelund, P. (1994) The quantal size at
retinogeniculate synapses determined from spontaneous
and evoked EPSCs in guinea-pig thalamic slices. Journal
of Physiology 480, 505-511.
Percival, D. B. and Walden, A. T. (1993) Spectral Analysis

565

Quenouille, M. H. (1949) Approximate tests of correlation


in time-series. Journal of the Royal Statistical Society
series B 11, 68-84.
Rao, J. N. K. and Wu, C. F. J. (1988) Resampling
inference with complex survey data. Journal of the
American Statistical Association 83, 231-241.
Rawlings, J. O. (1988) Applied Regression Analysis: A
Research Tool. Pacific Grove, California:Wadsworth &
Brooks/Cole.

for Physical Applications: Multitaper and Conventional


Univariate Techniques. Cambridge: Cambridge

Reid, N. (1981) Estimating the median survival time.


Biometrika 68, 601-608.

University Press.

Reid, N. (1988) Saddlepoint methods and statistical


inference (with Discussion). Statistical Science 3,
213-238.

Pitman, E. J. G. (1937a) Significance tests which may be


applied to samples from any populations. Journal of the
Royal Statistical Society. Supplement 4, 119-130.
Pitman, E. J. G. (1937b) Significance tests which may be
applied to samples from any populations: II. The
correlation coefficient test. Journal of the Royal
Statistical Society, Supplement 4, 225-232.
Pitman, E. J. G. (1937c) Significance tests which may be
applied to samples from any populations: III. The
analysis of variance test. Biometrika 29, 322-335.
Plackett, R. L. and Burman, J. P. (1946) The design of
optimum multifactorial experiments. Biometrika 33,
305-325.
Politis, D. N. and Romano, J. P. (1993) Nonparametric
resampling for homogeneous strong mixing random
fields. Journal of Multivariate Analysis 47, 301-328.
Politis, D. N. and Romano, J. P. (1994a) The stationary
bootstrap. Journal of the American Statistical Association
89, 1303-1313.
Politis, D. N. and Romano, J. P. (1994b) Large sample
confidence regions based on subsamples under minimal
assumptions. Annals of Statistics 22, 2031-2050.

Reynolds, P. S. (1994) Time-series analyses of beaver body


temperatures. In Case Studies in Biometry, eds N. Lange,
L. Ryan, L. Billard, D. R. Brillinger, L. Conquest and
J. Greenhouse, pp. 211-228. N e w York: Wiley.
Ripley, B. D. (1977) Modelling spatial patterns (with
Discussion). Journal of the Royal Statistical Society
series B 39, 172-212.
Ripley, B. D. (1981) Spatial Statistics. N e w York: Wiley.
Ripley, B. D. (1987) Stochastic Simulation. N e w York:
Wiley.
Ripley, B. D. (1988) Statistical Inference for Spatial
Processes. Cambridge: Cambridge University Press.
Ripley, B. D. (1996) Pattern Recognition and Neural
Networks. Cambridge: Cambridge University Press.
Robinson, J. (1982) Saddlepoint approximations for
permutation tests and confidence intervals. Journal of
the Royal Statistical Society series B 44, 91-101.
Romano, J. P. (1988) Bootstrapping the mode. Annals of
the Institute of Statistical Mathematics 40, 565-586.

Possolo, A. (1986) Subsampling a random field. Technical


Report 78, Department of Statistics, University of
Washington, Seattle.

Romano, J. P. (1989) Bootstrap and randomization tests of


some nonparametric hypotheses. Annals of Statistics 17,
141-159.

Presnell, B. and Booth, J. G. (1994) Resampling methods


for sample surveys. Technical Report 470, Department
of Statistics, University of Florida, Gainesville.

Romano, J. P. (1990) O n the behaviour of randomization


tests without a group invariance assumption. Journal of
the American Statistical Association 85, 686-692.

Priestley, M. B. (1981) Spectral Analysis and Time Series.


London: Academic Press.

Rousseeuw, P. J. and Leroy, A. M. (1987) Robust


Regression and Outlier Detection. N e w York: Wiley.

Proschan, F. (1963) Theoretical explanation of observed


decreasing failure rate. Technometrics 5, 375-383.

Royall, R. M. (1986) Model robust confidence intervals


using maximum likelihood estimators. International
Statistical Review 54, 221-226.

Qin, J. (1993) Empirical likelihood in biased sample


problems. Annals of Statistics 21, 1182-1196.
Qin, J. and Lawless, J. (1994) Empirical likelihood and
general estimating equations. Annals of Statistics 22,
300-325.

Rubin, D. B. (1981) The Bayesian bootstrap. Annals of


Statistics 9, 130-134.
Rubin, D. B. (1987) Multiple Imputation for Nonresponse in
Surveys. N e w York: Wiley.

566
Rubin, D. B. and Schenker, N. (1986) Multiple imputation
for interval estimation from simple random samples
with ignorable nonresponse. Journal of the American
Statistical Association 81, 366-374.
Ruppert, D. and Carroll, R. J. (1980) Trimmed least
squares estimation in the linear model. Journal of the
American Statistical Association 75, 828-838.
Samawi, H. M. (1994) Power estimation for two-sample
tests using importance and antithetic resampling. Ph.D.
thesis, Department of Statistics and Actuarial Science,
University of Iowa, Ames.
Sauerbrei, W. and Schumacher, M. (1992) A bootstrap
resampling procedure for model building: application to
the Cox regression model. Statistics in Medicine 11,
2093-2109.
Schenker, N. (1985) Qualms about bootstrap confidence
intervals. Journal of the American Statistical Association
80, 360-361.
Seber, G. A. F. (1977) Linear Regression Analysis. N ew
York: Wiley.
Shao, J. (1988) O n resampling methods for variance and
bias estimation in linear models. Annals of Statistics 16,
986-1008.
Shao, J. (1993) Linear model selection by cross-validation.
Journal of the American Statistical Association 88,
486-494.
Shao, J. (1996) Bootstrap model selection. Journal of the
American Statistical Association 91, 655-665.
Shao, J. and Tu, D. (1995) The Jackknife and Bootstrap.
N e w York: Springer.
Shao, J. and Wu, C. F. J. (1989) A general theory for
jackknife variance estimation. Annals of Statistics 17,
1176-1197.
Shorack, G. (1982) Bootstrapping robust regression.
Communications in Statistics Theory and Methods 11,
961-972.

Bibliography
bootstrap. Annals o f Statistics 9, 1187-1195.

Sitter, R. R. (1992) A resampling procedure for complex


survey data. Journal of the American Statistical
Association 87, 755-765.
Smith, P. W. F., Forster, J. J. and McDonald, J. W. (1996)
Monte Carlo exact tests for square contingency tables.
Journal of the Royal Statistical Society series A 159,
309-321.
Spady, R. H. (1991) Saddlepoint approximations for
regression models. Biometrika 78, 879-889.
St. Laurent, R. T. and Cook, R. D. (1993) Leverage, local
influence, and curvature in nonlinear regression.
Biometrika 80, 99-106.
Stangenhaus, G. (1987) Bootstrap and inference
procedures for L\ regression. In Statistical Data Analysis
Based on the L\-Norm and Related Methods, ed.
Y. Dodge, pp. 323-332. Amsterdam: North-Holland.
Stein, C. M. (1985) On the coverage probability of
confidence sets based on a prior distribution. Volume 16
of Banach Centre Publications. Warsaw: P W N Polish
Scientific Publishers.
Stein, M. (1987) Large sample properties of simulations
using Latin hypercube sampling. Technometrics 29,
143-151.
Sternberg, H. O
R. (1987) Aggravation of floods in the
Amazon River as a consequence of deforestation?
Geografiska Annaler 69A, 201-219.
Sternberg, H. O
R. (1995) Water and wetlands of Brazilian
Amazonia: an uncertain future. In The Fragile Tropics
of Latin America : Sustainable Management of Changing
Environments, eds T. Nishizawa and J. I. Uitto, pp.

113-179. Tokyo: United Nations University Press.


Stine, R. A. (1985) Bootstrap prediction intervals for
regression. Journal of the American Statistical
Association 80, 1026-1031.

Silverman, B. W. (1981) Using kernel density estimates to


investigate multimodality. Journal of the Royal
Statistical Society series B 43, 97-99.

Stoffer, D. S. and Wall, K. D. (1991) Bootstrapping


state-space models: Gaussian maximum likelihood
estimation and the Kalman filter. Journal of the
American Statistical Association 86, 1024-1033.

Silverman, B. W. (1985) Some aspects of the spline


smoothing approach to non-parametric regression curve
fitting (with Discussion). Journal of the Royal Statistical
Society series B 47, 1-52.

Stone, M. (1974) Cross-validatory choice and assessment


of statistical predictions (with Discussion). Journal of
the Royal Statistical Society series B 36, 111-147.

Silverman, B. W. and Young, G. A. (1987) The bootstrap:


to smooth or not to smooth? Biometrika 74, 469-479.
Simonoff, J. S. and Tsai, C.-L. (1994) Use of modified
profile likelihood for improved tests of constancy of
variance in regression. Applied Statistics 43, 357-370.
Singh, K. (1981) O n the asymptotic accuracy of Efron
s

Stone, M. (1977) A n asymptotic equivalence of choice of


model by cross-validation and Akaike
s criterion.
Journal of the Royal Statistical Society series B 39,
44-47.
Swanepoel, J. W. H. and van Wyk, J. W. J. (1986) The
bootstrap applied to power spectral density function
estimation. Biometrika 73, 135-141.

Bibliography

Tanner, M. A. (1996) Tools for Statistical Inference:


Methods for the Exploration of Posterior Distributions
and Likelihood Functions. Third edition. N e w York:

Springer.

567
Wang, S. (1992) General saddlepoint approximations in the
bootstrap. Statistics and Probability Letters 13, 61-66.
Wang, S. (1993a) Saddlepoint expansions in finite
population problems. Biometrika 80, 583-590.

Tanner, M. A. and Wong, W. H. (1987) The calculation of


posterior densities by data augmentation (with
Discussion). Journal of the American Statistical
Association 82, 528-550.

Wang, S. (1993b) Saddlepoint methods for bootstrap


confidence bands in nonparametric regression.
Australian Journal of Statistics 35, 93-101.

Theiler, J., Galdrikian, B., Longtin, A., Eubank, S. and


Farmer, J. D. (1992) Using surrogate data to detect
nonlinearity in time series. In Nonlinear Modeling and
Forecasting, eds M. Casdagli and S. Eubank, number
XII in Santa Fe Institute Studies in the Sciences of
Complexity, pp. 163-188. N e w York: Addison-Wesley.

Weisberg, S. (1985) Applied Linear Regression. Second


edition. N e w York: Wiley.

Therneau, T. (1983) Variance reduction techniques for the


bootstrap. Ph.D. thesis, Department of Statistics,
Stanford University, California.
Tibshirani, R. J. (1988) Variance stabilization and the
bootstrap. Biometrika 75, 433-444.
Tong, H. (1990) Non-linear Time Series: A Dynamical
System Approach. Oxford: Clarendon Press.
Tsay, R. S. (1992) Model checking via parametric
bootstraps in time series. Applied Statistics 41, 1-15.
Tukey, J. W. (1958) Bias and confidence in not quite large
samples (Abstract). Annals of Mathematical Statistics 29,
614.
Venables, W. N. and Ripley, B. D. (1994) Modern Applied
Statistics with S-Plus. N e w York: Springer.
Ventura, V. (1997) Likelihood inference by Monte Carlo
methods and efficient nested bootstrapping. D.Phil. thesis,
Department of Statistics, University of Oxford.
Ventura, V., Davison, A. C. and Boniface, S. J. (1997)
Statistical inference for the effect of magnetic brain
stimulation on a motoneurone. Applied Statistics 46, to
appear.
Wahrendorf, J., Becher, H. and Brown, C. C. (1987)
Bootstrap comparison of non-nested generalized linear
models: applications in survival analysis and
epidemiology. Applied Statistics 36, 72-81.
Wand, M. P. and Jones, M. C. (1995) Kernel Smoothing.
London: Chapman & Hall.
Wang, S. (1990) Saddlepoint approximations in resampling
analysis. Annals of the Institute of Statistical
Mathematics 42, 115-131.

Wang, S. (1995) Optimizing the smoothed bootstrap.


Annals of the Institute of Statistical Mathematics 47,
65-80.

Welch, B. L. and Peers, H. W. (1963) O n formulae for


confidence points based on integrals of weighted
likelihoods. Journal of the Royal Statistical Society series
B 25, 318-329.
Welch, W. J. (1990) Construction of permutation tests.
Journal of the American Statistical Association 85,
693-698.
Welch, W. J. and Fahey, T. J. (1994) Correcting for
covariates in permutation tests. Technical Report
STAT-94-12, Department of Statistics and Actuarial
Science, University of Waterloo, Waterloo, Ontario.
Westfall, P. H. and Young, S. S. (1993) Resampling-Based
Multiple Testing: Examples and Methods for p-value
Adjustment. N e w York: Wiley.

Woods, H., Steinour, H. H. and Starke, H. R. (1932) Effect


of composition of Portland cement on heat evolved
during hardening. Industrial Engineering and Chemistry
24, 1207-1214.
Wu, C. J. F. (1986) Jackknife, bootstrap and other
resampling methods in regression analysis (with
Discussion). Annals of Statistics 14, 1261-1350.
Wu, C. J. F. (1990) O n the asymptotic properties of the
jackknife histogram. Annals of Statistics 18, 1438-1452.
Wu, C. J. F. (1991) Balanced repeated replications based
on mixed orthogonal arrays. Biometrika 78, 181-188.
Young, G. A. (1986) Conditioned data-based simulations:
Some examples from geometrical statistics. International
Statistical Review 54, 1-13.
Young, G. A. (1990) Alternative smoothed bootstraps.
Journal of the Royal Statistical Society series B 52,
477-484.
Young, G. A. and Daniels, H. E. (1990) Bootstrap bias.
Biometrika 77, 179-185.

Name Index

Abelson, R. R 403
Akaike, H. 316
Akritas, M. G. 124
Altman, D. G. 375
Amis, G. 253
Andersen, P. K. 124, 128, 353, 375
Andrews, D. F. 360
Appleyard, S. T. 417
Athreya, K. B. 60
Atkinson, A. C. 183, 315, 325
Bai, C. 315
Bai, Z. D. 427
Bailer, A. J. 384
Banks, D. L. 515
Barbe, P. 60, 516
Barnard, G. A. 183
Bamdorff-Nielsen, O. E. 183, 246,
486, 514
Becher, H. 379
Beckman, R. J. 486
Benes, F. M. 428
Beran, J. 426
Beran, R. J. 125, 183, 184, 187, 246,
250, 315
Berger, J. O. 515
Bernardo, J. M. 515
Bertail, P. 60, 516
Besag, J. E. 183, 184, 185
Bickel, P. J. 60, 123, 125, 129, 315,
487, 494
Biden, E. N. 316
Bissell, A. F. 253, 383, 497
Bithell, J. F. 428
Bloomfield, P. 426
Boniface, S. J. 418, 428
Boos, D. D. 515
Booth, J. G. 125, 129, 246, 247, 251,
374, 486, 487, 488, 491, 493

568

Borgan, 0. 124, 128, 353


Bose, A. S. 427
Bowman, A. W. 375
Box, G. E. P. 323
Bratley, P. 486
Braun, W. J. 427, 430
Breiman, L. 316
Breslow, N. 378
Bretagnolle, J. 60
Brillinger, D. R. x, 388, 426, 427
Brockwell, P. J. 426, 427
Brown, B. W. 382
Brown, C. C. 379
Buckland, S. T. 246
Buhlmann, P. 427
Bunke, O. 316
Burman, J. P. 60
Burman, P. 316, 321
Burr, D. 124, 133, 374
Bums, E., 300
Butler, R. W. 125, 129, 487, 493
Canty, A. J. x, 135, 246
Carlstein, E. 427
Carpenter, J. R. 246, 250
Carroll, R. J. 310, 325
Chambers, J. M. 374, 375
Chao, M.-T. 125
Chapman, P. 60, 125
Chen, C. 427
Chen, C.-H. 375
Chen, S. X. 169, 514, 515
Chen, Z. 487
Claridge, G. 157
Clifford, P. 183, 184, 185
Cobb, G. W. 241
Cochran, W. G. 7, 125
Collings, B. J. 184

Conover, W. J. 486
Cook, R. D. 125, 315, 316, 375
Corcoran, S. A. 515
Cowling, A. 428, 432, 436
Cox, D. R. 124, 128, 183, 246, 287,
323,
324, 428, 486, 514
Cressie, N. A. C. 72, 428
Dahlhaus, R. 427, 431
Daley, D. J. 428
Daly, F. 68, 182, 436, 520
Daniels, H. E. 59, 486, 492
Darby, S. C. 428
Davis, R. A. 426, 427
Davison, A. C. 66, 135, 246, 316, 374,
427, 428, 486, 487, 492, 493, 515,
517, 518
De Angelis, D. 2, 60, 124, 316, 343
Demetrio, C. G. B. 338
Dempster, A. P. 124
Diaconis, P. 60, 486
DiCiccio, T. J. 68, 124, 246, 252, 253,
487, 493, 515, 516
Diggle, P. J. 183, 392, 423, 426, 428
Do, K.-A. 486, 487
Dobson, A. J. 374
Donegani, M. 184, 187
Doss, H. 124, 374
Draper, N. R. 315
Droge, B. 316
Dubowicz, V. 417
Ducharme, G. R. 126
Easton, G. S. 487
Efron, B. ix, 59, 60, 61, 66, 68, 123,
124, 125, 128, 130, 132, 133, 134,
183, 186, 246, 249, 252, 253, 308,
315,
316, 375, 427, 486, 488, 515
Embleton, B. J. J. 236, 506
Eubank, S. 427, 430, 435
Fahey, T. J. 185

Name Index

Fang, K.-T. 486


Faraway, J. J. 125, 375
Farmer, J. D. 427, 430, 435
Feigl, P. 328
Feller, W. 320
Fernholtz, L. T. 60
Ferretti, N. 427
Field, C. 486
Firth, D. 374, 377, 383
Fisher, N. I. 236, 506, 515, 517
Fisher, R. A. 183, 186, 322
Fleming, T. R. 124
Forster, J. J. 183, 184
Fox, B. L. 486
Franke, J. 427
Freedman, D. A. 60, 125, 129, 315,
427
Freeman, D. H. 378
Frets, G. P. 115
Friedman, J. H. 316
Galdrikian, B. 427, 430, 435
Garcia-Soidan, P. H. 428
Garthwaite, P. H. 246
Gatto, R. x, 487
Geisser, S. 247, 316
George, S. L. 375
Geyer, C. J. 178, 183, 372, 486
Gigli, A. 486
Gilks, W. R. 2, 183, 343
Gill, R. D. 124, 128, 353
Gleason, J. R. 486, 488
Glosup, J. 383
Gong, G. 375
Gotze, F. 60, 427
Graham, R. L. 486, 489
Gray, H. L. 59
Green, P. J. 375
Gross, S. 125
Haldane, J. B. S. 487
Hall, P. ix, x, 59, 60, 62, 124, 125, 129,
183, 246, 247, 248, 251, 315, 316,
321, 375, 378, 379, 427, 428, 429,
432, 436, 486, 487, 488, 491, 493,
514,
515, 516, 517
Halloran, M. E. 246
Hamilton, M. A. 184

569
Hammersley, J. M. 486
Hampel, F. R. 60
Hand, D. J. 68, 182, 436, 520
Handscomb, D. C. 486
Hardle, W. 316, 375, 427
Harrington, D. P. 124
Hartigan, J. A. 59, 60, 427, 430
Hastie, T. J. 374, 375
Hawkins, D. M. 316
Hayes, K. G. 123
Heggelund, P. 189
Heller, G. 374
Herzberg, A. M. 360
Hesterberg, T. C. 60, 66, 486, 490, 491
Hettsmansperger, T. P. 316
Hinkley, D. V. 60, 63, 66, 125, 135,
183, 246, 247, 250, 318, 383, 486,
487, 489, 490, 492, 493, 515, 517,
518
Hirose, H. 347, 381
Hjort, N. L. 124, 374
Holmes, S. 60, 246, 486
Horowitz, J. L. 427, 429
Horvath, L. 374
Hosmer, D. W. 361
Hu, F. 318
Huet, S. 375
Hyde, J. 131
Isham, V. 428
Janas, D. 427, 431
Jennison, C. 183, 184, 246
Jensen, J. L. 486
Jeong, J. 315
Jhun, M. 126
Jing, B.-Y. 427, 429, 487, 515, 517
Jockel, K.-H. 183
John, P. W. M. 486, 489
Johns, M. V. 486, 490
Jolivet, E. 375
Jones, M. C. x, 128
Joumel, A. G. 428
Kabaila, P. 246, 250, 427
Kalbfleisch, J. D. 124
Kaplan, E. L. 124
Karr, A. F. 428

Katz, R. 282
Keenan, D. M. 428
Keiding, N. 124, 128, 353
Kendall, D. G. 124
Kendall, W. S. 124
Kim, J.-H. 124
Klaassen, C. A. J. 123
Kulperger, R. J. 427, 430
KUnsch, H. R. 427
Lahiri, S. N. 427
Laird, N. M. 124, 125
Lange, N. 428
La Scala, B. 514
Lawless, J. 514
Lawson, A. B. 428
Lee, S. M. S. 246
Leger, C. 125
Lehmann, E. L. 183
Lemeshow, S. 361
Leroy, A. M. 315
Lewis, P. A. W. 428
Lewis, T. 236, 506
Li, G. 514
Li, H. 427
Li, K.-C. 316
Liu, R. Y. 315, 427
Lloyd, C. J. 515
Lo, S.-H. 125, 374
Loader, C. 375
Loh, W.-Y. 183, 246
Longtin, A. 427, 430, 435
Louis, T. A. 125
Lunn, A. D. 68, 182, 436, 520
Maddala, G. S. 315, 427
Mallows, C. L. 316
Mammen, E. 60, 315, 316
Manly, B. F. J. 183
Marriott, F. H. C. 183
Marron, J. S. 375
Martin, M. A. 125, 183, 246, 251, 487,
493
McCarthy, P. J. 59, 60, 125
McConway, K. J. 68, 182, 436, 520
McCullagh, P. 66, 374, 553
McDonald, J. W. 183, 184

570
McKay, M. D. 486
McKean, J. W. 316
McLachlan, G. J. 375
Meier, P. 124
Messean, A. 375
Milan, L. 125
Miller, R. G. 59, 84
Monahan, J. F. 515
Monti, A. C. 514
Morgenthaler, S. 486
Moulton, L. H. 374, 376, 377
Muirhead, C. R. 428
Murphy, S. A. 515
Mykland, P. A. 515
Nelder, J. A. 374
Newton, M. A. 178, 183, 486, 515
Niederreiter, H. 486
Nordgaard, A. 427
Noreen, E. W. 184
Oakes, D. 124, 128
Ogbonmwan, S.-M. 486, 515
Olshen, R. A. 315, 316
Oris, J. T. 384
Ostrowski, E. 68, 182, 436, 520
Owen, A. B. 486, 514, 515, 550
Parzen, M. I. 250
Paulsen, O. 189
Peers, H. W. 515
Percival, D. B. 426
Perl, M. L. 123
Peters, S. C. 315, 427
Phillips, M. J. 428, 432, 436
Pitman, E. J. G. 183
Plackett, R. L. 60
Politis, D. N. 60, 125, 427, 429
Possolo, A. 428
Pregibon, D. 374
Prentice, R. L. 124
Presnell, B. 125, 129
Priestley, M. B. 426
Proschan, F. 4, 218
Qin, J. 514
Quenouille, M. H. 59
Raftery, A. E. 515
Rao, J. N. K 125, 130

Name Index

Rawlings, J. O. 356
Reid, N. 124, 486
Reynolds, P. S. 435
Richardson, S. 183
Ripley, B. D. x, 183, 282, 315, 316,
361, 374, 375, 417, 428, 486
Ritov, Y. 123
Robinson, J. 486, 487

Stein, C. M. 60, 515


Stein, M. 486
Steinour, H. H. 277
Sternberg, H. O
R. x, 388, 389, 427
Stine, R. A. 315
Staffer, D. S. 427
Stone, C. J. 316
Stone, M. 316

Romano, J. P. 60, 124, 125, 126, 183,


246,
427, 429, 515, 516
Romo, J. 427
Ronchetti, E. M. 60, 486, 487
Rousseeuw, P. J. 60, 315
Rowlingson, B. S. 428
Royall, R. M. 63
Rubin, D. B. 124, 125, 515
Ruppert, D. 310, 325
St. Laurent, R. T 375
Samawi, H. M. 184
Sauerbrei, W. 375
Schechtman, E. 66, 247, 486, 487
Schenker, N. 246, 515
Schrage, L. E. 486
Schucany, W. R. 59
Schumacher, M. 375
Seber, G. A. F. 315
Shao, J. 60, 125, 246, 315, 316, 375,
376
Sheather, S. J. 316
Shi, S. 183, 246, 250, 486, 489, 490
Shorack, G. 316
Shotton, D. M. 417
Silverman, B. W. 124, 128, 189, 363,
375
Simonoff, J. S. 269
Singh, K. 246, 315, 374, 427
Sitter, R. R. 125, 129
Smith, H. 315
Smith, P. W. F. 183, 184
Snell, E. J. 287, 324, 374
Snowden, C. B. 125
Spady, R. H. 487, 515
Spiegelhalter, D. J. 183
Stahel, W. A. 60
Stangenhaus, G. 316
Starke, H. R. 277

Stone, R. A. 428
Sutherland, D. H. 316
Swanepoel, J. W. H. 427
Tanner, M. A. 124, 125
Theiler, J. 427, 430, 435
Themeau, T. 486
Tibshirani, R. J. ix, 60, 125, 246, 316,
375,
427, 515
Titterington, D. M. 183
Tong, H. 394, 426
Truong, K. N. 126
Tsai, C.-L. 269, 375
Tsay, R. S. 427
Tu, D. 60, 125, 246, 376
Tukey, J. W. 59, 403, 486
van Wyk, J. W. J. 427
van Zwet, W. R. 60
Venables, W. N. 282, 315, 361, 374,
375
Venkatraman, E. S. 374
Ventura, V. x, 428, 486, 492
Vere-Jones, D. 428
Wahrendorf, J. 379
Walden, A. T. 426
Wall, K. D. 427
Wand, M. P. 128
Wang, S. 124, 318, 486, 487
Wang, Y. 486
Wei, B. C. 63, 375, 487
Wei, L. J. 250
Weisberg, S. 125, 257, 315, 316
Welch, B. L. 515
Welch, W. J. 183, 185
Wellner, J. A. 123
Westfall, P. H. 184
Whittaker, J. 125
Wilson, S. R. 378, 379

Name Index

Witkowski, J. A. 417
Wong, W. H. 124
Wood, A. T. A. 247, 251, 486, 488,
491,
515, 517
Woods, H. 277
Worton, B. J. 486, 487, 493, 515, 517,
518

571

Wu, C. J. F. 60, 125, 130, 315, 316


Wyatt, M. P. 316

Young, G. A. x, 59, 60, 124, 128, 246,


316, 428, 486, 487, 493

Wynn, H. P. 515

Young, S. S. 184

Yahav, J. A. 487, 494

Zeger, S. L. 374, 376, 377

Yandell, B. S. 374
Ying, Z. 250

Zelen, M. 328
Zidek, J. V. 318

Example index

accelerated life test, 346, 379


adaptive test, 187, 188
AIDS data, 1, 342, 369
air-conditioning data, 4, 15, 17, 19, 25,
27, 30, 33, 36, 197, 199, 203, 205,
207, 209, 216, 217, 233, 501, 508,
513, 520
Amis data, 253
A M L data, 83, 86, 146, 160, 187
antithetic bootstrap, 493
association, 421, 422
autoregression, 388, 391, 393, 398,
432, 434
average, 13, 15, 17, 19, 22, 25, 27, 30,
33, 36, 47, 51, 88, 92, 94, 98, 128,
501, 508, 513, 516
axial data, 234, 505
balanced bootstrap, 440, 441, 442,
445, 487, 488, 489, 494
Bayesian bootstrap, 513, 518, 520
beaver data, 434
bias estimation, 106, 440, 464, 466,
488,
492, 495
binomial data, 338, 359, 361
bivariate missing data, 90, 128
bootstrap likelihood, 508, 517, 518
bootstrap recycling, 464, 466, 492, 496
block bootstraps, 398, 401, 403, 432
brambles data, 422
Breslow data 378
calcium uptake data, 355, 441, 442
capability index, 248, 253, 497
carbon monoxide data, 67
cats data, 321
caveolae data, 416, 425
C D 4 data, 68, 134, 190, 251, 252, 254
cement data, 277
changepoint estimation, 241

572

Channing House data, 131


gamma model, 5, 25, 36, 148, 207,
233, 247, 376
circular data, 126, 517, 520
generalized additive model, 367, 369,
city population data, 6, 13, 22, 30, 49,
371, 382, 383
52, 53, 54, 66, 95, 108, 110, 113,
118, 201, 238, 439, 440, 447, 464,
generalized linear model, 328, 334,
473, 490, 492, 513
338,
342, 367, 376, 378, 381, 383
Claridge data, 157, 158, 496
gravity data, 72, 121, 131, 454, 457,
494, 519
cloth data, 382
coal-mining data, 435
handedness data, 157, 496
comparison of means, 159, 162, 163,
166, 171, 172, 176, 181, 186, 454,
hazard ratio, 221
457, 519
head size data, 115
comparison of variable selection
heart disease data, 378
methods, 306
hypergeometric distribution, 487
convex regression, 371
correlation coefficient, 48, 61, 63, 68,
importance sampling, 454, 457, 461,
80, 90, 108, 115, 157, 158, 187,
464, 466, 489, 490, 491, 495
247,
251, 254, 475, 493, 496, 518
imputation, 88, 90
correlogram, 388
independence, 177
Darwin data, 186, 188, 471, 481, 498
influence function, 48, 53
difference of means, 71, 75
intensity estimate, 418
dogs data, 187
Islay data, 520
Downssyndrome data, 371
isotonic regression, 371
double bootstrap, 176, 177, 224, 226,
254, 464, 466, 469
jackknife, 51, 64, 65, 317
ducks data, 134
jackknife-after-bootstrap, 115, 130,
eigenvalue, 64, 134, 252, 277, 445, 447
empirical likelihood, 501, 516, 519,
520
empirical exponential family
likelihood, 505, 516, 520
equal marginal distributions, 78
exponential mean, 15, 17, 19, 30, 61,
176, 224, 250, 510
exponential model, 188, 328, 334, 367
factorial experiment, 320, 322
fir seedlings data, 142
Frets
heads data, 115, 447

134, 313, 325


K-function, 416, 422
kernel density estimate, 226, 413, 469
kernel intensity estimate, 418, 431
laterite data, 234, 505
leukaemia data, 328, 334, 367
likelihood ratio statistic, 62, 148, 247,
346, 501
linear approximation, 118, 468, 490
logistic regression, 141, 146, 338, 359,
361, 371, 376

Example index

log-linear model, 342, 369


lognormal model, 66, 148
low birth weights data, 361
lynx data, 432
maize data, 181,
mammals data, 257, 262, 265, 324
M C M C , 146, 184, 185
matched pairs, 186, 187, 188, 492
mean, see average
mean polar axis, 234, 505
median, see sample median
median survival time, 86
melanoma data, 352
misclassification error, 359, 361, 381
missing data, 88, 90, 128
mixed continuous-discrete
distributions, 78
model selection, 304, 306, 393, 432
motorcycle impact data, 363, 365
multinomial distribution, 66, 487
multiple regression, 276, 277, 281, 286,
287, 298, 300, 304, 306, 309, 313

573
phase scrambling, 410, 430, 435
point process data, 416, 418, 421
poisons data, 322
Poisson process, 416, 418, 422, 425,
431, 435
Poisson regression, 342, 369, 378, 382,
383
prediction, 244, 286, 287, 323, 324, 342
prediction error, 298, 300, 320, 321,
359, 361, 369, 381, 393, 401
product-limit estimator, 86, 128
proportional hazards, 146, 160, 221,
352
quantile, 48, 253, 352
quartzite data, 520

one-way model, 276, 319, 320


overdispersion, 142, 338, 342

ratio, 6, 13, 22, 30, 49, 52, 53, 54, 66,


95, 98, 108, 110, 113, 118, 126,
127, 165, 201, 217, 238, 249, 439,
447, 464, 473, 490, 513
regression, see convex regression,
generalized additive model,
generalized linear model, logistic
regression, log-linear model,
multiple regression, nonlinear
regression, nonparametric
regression, robust regression,
straight-line regression
regression prediction, 286, 287, 298,
300, 320, 321, 323, 324, 342, 359,
361, 369, 381
reliability data, 346, 379
remission data, 378
returns data, 269, 272, 449, 461
Richardson extrapolation, 494
Rio Negro data, 388, 398, 403, 410
robust M-estimate, 318, 471, 483
robust regression, 308, 309, 313, 318,
324
robust variance, 265, 318, 376
rock data, 281, 287

paired comparison, 471, 481, 498


partial correlation, 115
Paulsen data, 189
periodogram, 388
periodogram resampling, 413, 430
P ET film data, 346, 379

saddlepoint approximation, 468, 469,


471, 473, 475, 477, 481, 483, 492,
493, 497
salinity data, 309, 313, 324
sample maximum, 39, 56, 247
sample median, 41, 61, 65, 80

neurophysiological point process data,


418
neurotransmission data, 189
Nile data, 241
nitrofen data, 383
nodal involvement data, 381
nonlinear regression, 355, 441, 442
nonlinear time series, 393, 401
nonparametric regression, 365
normal plot, 150, 152, 154
normal prediction limit, 244
normal variance, 208
nuclear power data, 286, 298, 304, 323

sample variance, 61, 62, 64, 104, 481


separate families test, 148
several samples, 72, 126, 131, 133, 519
simulated data, 306
smoothed bootstrap, 80, 127, 168, 169,
418,
431
spatial association, 421, 422
spatial clustering, 416
spatial epidemiology, 421
spectral density estimation, 413
spherical data, 126, 234, 505
spline model, 365
stationary bootstrap, 398, 403, 428,
429
straight-line regression, 257, 262, 265,
269, 272, 308, 317, 321, 322, 449,
461
stratified ratio, 98
Strauss process, 416, 425
studentized statistic, 477, 481, 483
sugar cane data, 338
sunspot data, 393, 401, 435
survival probability, 86, 131
survival proportion data, 308, 322
survival time data, 328, 334, 346, 352,
367
survivor functions, 83
symmetric distribution, 78, 251, 471,
483
tau particle data, 133, 495
test of correlation, 157
test for overdispersion, 142, 184
test for regression coefficient, 269,
281, 313
test of interaction, 322
tile resampling, 425, 432
times on delivery suite data, 300
traffic data, 253
transformation, 33, 108, 118, 169, 226,
322,
355, 418
trend test in time series, 403, 410
trimmed average, 64, 121, 130, 133
tuna data, 169, 228, 469
two-sample problem, see comparison
of means
two-way model, 177, 184, 338

Example index

574
unimodality, 168, 169, 189
unit root test, 391

variance estimation, 208, 446,464,


488, 495

weird bootstrap, 128


Wilcoxon test, 181

unne data, 359

Weibull model, 346, 379

wild bootstrap, 272, 319

variable selection, 304, 306

weighted average, 72, 126, 131

wool prices data, 391

Subject index

abc.ci,536
A B C method, see confidence interval

Abelson-Tukey coefficients, 403


accelerated life test example, 346, 379
adaptive estimation, 120-123, 125, 133
adaptive test, 173-174, 184, 187, 188
aggregate prediction error, 290-301,
316,
320-321, 358-362
AIDS data example, 1, 342, 369
air conditioning data example, 4, 15,
17, 19, 25, 27, 30, 33, 36, 149, 188,
197, 199, 203, 205, 207, 209, 216,
217, 233, 501, 508-512
Akaike
s information criterion, 316,
394, 432
algorithms
JC-fold adjusted cross-validation,
295
balanced bootstrap, 439, 488
balanced importance resampling,
460, 491
Bayesian bootstrap, 513
case resampling in regression, 264
comparison of generalized linear
and generalized additive
models, 367
conditional bootstrap for censored
data, 84
conditional resampling for censored
survival data, 351
double bootstrap for bias
adjustment, 104
inhomogeneous Poisson process,
431
model-based resampling in linear
regression, 262
phase scrambling, 408
prediction in generalized linear
models, 341
prediction in linear regression, 285

resampling errors with unequal


balanced resampling, 440
variances, 271
bias of, 103
resampling for censored survival
post-simulation balance, 488
data, 351
sensitivity analysis for, 114, 117
stationary bootstrap, 428
binary data, 78, 359-362, 376, 377, 378
superpopulation bootstrap, 94
binomial data, 338
all-subsamples method, 57
binomial process, 416
A M L data example, 83, 86, 146, 160,
bivariate distribution, 78, 90-92, 128
187,
221
block resampling, 396-408, 427, 428,
analysis of deviance, 330-331,
432
367-369
boot, 525
ancillary statistics, 43, 238, 241
..528, 538, 548
antithetic bootstrap, 493

balanced
, 545
apparent error, 292
m, 538
assessment set, 292
mle, 528, 538, 540, 543
autocorrelation, 386, 431

parametric
, 534
autoregressive process, 386, 388, 389,
ran.gen,
528,
538,
540, 543
392,
395, 398, 399, 400, 401, 410,
414, 432, 433, 434
sim, 529, 534
simulation, 390-391
statistic, 527, 528
autoregressive-moving average
strata, 531
process, 386, 408
stype, 527
average, 4, 8, 13, 15, 17, 19, 22, 25, 27,
weights, 527, 536, 546
30, 33, 36, 47, 51, 88, 90, 92, 94,
boot.array, 526
98, 129, 130, 197, 199, 203, 205,
boot.ci, 536
207, 209, 216, 251, 501, 508, 512,
bootstrap
513, 516, 518, 520
adjustment, 103-107, 125, 130,
comparison of several, 163
175-180, 223-230
comparison of two, 159, 162, 166,
antithetic, 487, 493
171,
172, 186, 454, 457, 519
finite population, 94, 98, 129
asymptotic accuracy, 39-41,
211-214
balanced, 438-446, 486, 494-499
Bayesian bootstrap, 512-514, 515,
518, 520
algorithm, 439, 488
B C a method, see confidence interval
bias estimate, 438-440, 488
beaver data example, 434
conditions for success, 445
bias correction, 103-107
efficiency, 445, 460, 461, 495
bias estimator, 16-18
experimental design, 441, 486,
489
adjusted, 104, 106-107, 130, 442,
464, 466, 492
first-order, 439, 486, 487-488

575

576

Subject index

higher-order, 441, 486, 489


theory, 443^45, 487
balanced importance resampling,
460-463, 486, 496
Bayesian, 512-514, 515, 518, 520
block, 396-408, 427, 428, 433
calibration, 246
case resampling, 84
consistency, 37-39
conditional, 84, 124, 132, 351, 374,
474
discreteness 27, 61
double 103-113, 122, 125, 130,
175-180, 223-230, 254, 373,
463-466, 469, 486, 497,
507-509
theory for, 105-107, 125
generalized, 56
hierarchical, 100-102, 125, 130, 288
imputation, 89-92, 124-125
jittered, 124
mirror-match, 93, 125, 129
model-based, 349, 433, 434
nested, see double
nonparametric, 22
parametric, 15-21, 261, 333, 334,
339, 344, 347, 373, 378, 379,
416, 528, 534
population, 94, 125, 129
post-blackened, 397, 399, 433
post-simulation balance, 441-445,
486, 488, 495
quantile, 18-21, 36, 69, 441, 442,
448-450, 453-456, 457 463,
468, 490
recycling, 463-466, 486, 492, 496
robustness, 264
shrunk smoothed, 79, 81, 127
simulation size, 17-21, 34-37, 69,
155-156, 178-180, 183, 185,
202, 226, 246, 248
smoothed, 79-81, 124, 127, 168,
169, 310, 418, 431, 531
spectral, 412415, 427, 430
stationary, 398-408, 427, 428^29,
433

stratified, 89, 90, 306, 340, 344, 365,


371, 457, 494, 531
superpopulation, 94, 125, 129
symmetrized, 78, 122, 169, 471, 485
tile, 424-426, 428, 432
tilted, 166-167, 452^56, 459, 462,
546-547
weighted, 60, 514, 516
weird, 86-87, 124, 128, 132
wild, 272-273, 316, 319, 538
bootstrap diagnostics, 113-120, 125
bias function, 108, 110, 464-465
jackknife-after-bootstrap, 113-118,
532
linearity, 118-120
variance function, 107-111, 464-465
bootstrap frequencies, 22-23, 66, 76,
110-111, 438-445, 464, 526, 527
bootstrap likelihood, 507-509, 515,
517, 518
bootstrap recycling, 463-466, 487,
492,
497, 508
bootstrap test, see significance test
Box-Cox transformation, 118
brambles data example, 422
Breslow estimator, 350
calcium uptake data example, 355,
441, 442
capability index example, 248, 253,
497
carbon monoxide data, 67
cats data example, 321
caveolae data example, 416, 425
C D 4 data example, 68, 134, 190, 251,
252, 254
cement data example, 277
censboot, 532, 541
censored data, 82-87, 124, 128, 131,
160,
346-353, 514, 532, 541
changepoint model, 241
Channing House data example, 131
choice of estimator, 120-123, 125, 134
choice of predictor, 301-305
choice of test statistic, 173, 180, 184,
187
circular data, 126, 517, 520

city population data example, 6, 13,


22, 30, 49, 52, 53, 54, 66, 95, 108,
110, 113, 118,201, 238, 249,440,
447, 464, 473, 490, 492, 513
Claridge data example, 157, 158, 496
cloth data example, 382
coal-mining data example, 435
collinearity, 276-278
complementary set partitions, 552, 554
complete enumeration, 27, 60, 438,
440, 486
conditional inference, 43, 138, 145,
238-243, 247, 251
confidence band, 375, 417, 420, 435
confidence interval
ABC, 214-220, 231, 246, 511, 536
B C a, 203-213, 246, 249, 336-337,
383, 536
basic bootstrap, 28-29, 194-195,
199, 213-214, 337, 365, 374,
383, 435
coefficient, 191
comparison of methods, 211-214,
230-233, 246, 336-338
conditional, 238-243, 247, 251
double bootstrap, 223-230, 250,
254, 374, 469
normal approximation, 14, 194, 198,
337,
374, 383, 435
percentile method, 202-203,
213-214, 336-337, 352, 383
profile likelihood, 196, 346
studentized bootstrap, 29-31, 95,
125, 194-196, 199, 212,
227-228, 231, 233, 246, 248,
250,
336-337, 391, 449, 454,
483-485
test inversion, 220-223, 246
confidence limits, 193
confidence region, 192, 231-237,
504-506
consistency, 13
contingency table, 177, 183, 184, 342
control, 545
control methods, 446-450, 486
bias estimation, 446-448, 496
efficiency, 447, 448, 450, 462
importance resampling weight, 456

Subject index

linear approximation, 446, 486, 495


discreteness effects, 26-27, 61
quantile estimation, 446-450,
dispersion parameter, 327, 328, 331,
461-463, 486, 495-496
339,
see also overdispersion
saddlepoint approximation, 449
distribution
variance estimation, 446-448, 495
F, 331, 368
Cornish-Fisher expansion, 40, 211,
t, 81, 331, 484
449
Bernoulli, 376, 378, 381, 474, 475
correlation estimate, 48, 61, 63, 68, 69,
beta, 187, 248, 377
80, 90-92, 108, 115-116, 134, 138,
beta-binomial, 338, 377
157, 158, 247, 251, 254, 266, 475,
binomial, 86, 128, 327, 333, 338, 377
493
bivariate normal, 63, 80, 91, 108,
correlogram, 386, 389
128
partial, 386, 389
Cauchy, 42, 81
coverage process, 428
chi-squared, 139, 142, 163, 233, 234,
cross-validation, 153, 292-295,
237, 303, 330, 335, 368, 373,
296-301, 303, 305-307, 316, 320,
378, 382, 484, 500, 501, 503,
321,
324, 360-361, 365, 377, 381
504,
505, 506
K-fold, 294-295, 316, 320, 324,
defensive mixture, see defensive
360-361, 381
mixture distribution
cumulant-generating function, 66,
Dirichlet,
513, 518
466, 467, 472, 479, 551-553
double
exponential,
516
approximate, 476-478, 482, 492
empirical, see empirical distribution
paired comparison test, 492
function, empirical exponential
cumulants, 551-553
family
approximate, 476
exponential, 4, 81, 82, 130, 132, 176,
generalized, 552
188, 197, 203, 205, 224, 249,
cumulative hazard function, 82, 83,
328, 334, 336, 430, 491, 503,
86, 350
521
exponential
family, 504-507, 516
Darwin data example, 186, 188, 471,
gamma, 5, 131, 149, 207, 216, 230,
481, 498
233, 247, 328, 332, 334, 376,
defensive mixture distribution,
503,
512, 513, 521
457-459, 462, 464, 486, 496
geometric,
398,
428
delivery suite data example, 300
hypergeometric,
444, 487
delta method, 45^6, 195, 227, 233,
419, 432, see also nonparametric
least-favourable, 206, 209
delta method
lognormal, 66, 149, 336
density estimate, see kernel density
multinomial, 66, 111, 129, 443, 446,
estimate
452, 468, 473, 491, 492, 493,
deviance, 330-331, 332, 335, 367-369,
501, 502, 517, 519
370, 373, 378, 382
multivariate normal, 445, 552
deviance residuals, see regression
negative binomial, 337, 344, 345,
residuals
371
diagnostics, see bootstrap diagnostics
normal, 10, 150, 152, 154, 208, 244,
difference of means, see average,
327, 485, 488, 489, 518, 551
comparison of two
Poisson, 327, 332, 333, 337, 342,
directional data, 126, 234, 505, 515,
344, 345, 370, 378, 382, 383,
517, 520
416, 419, 431, 473, 474, 493,
dirty data, 44
516

577

posterior, see posterior distribution


prior, see prior distribution
slash, 485
tilted, see exponential tilting
Weibull, 346, 379
dogs data example, 187
Downssyndrome data example, 371
ducks data example, 134
Edgeworth expansion, 3941, 60, 408,
476-477
EEF.profile, 550
eigenvalue example, 64, 134, 252, 278,
445,
447
eigenvector, 505
EL. prof ile, 550
empinf, 530
empirical Bayes, 125
empirical distribution function, 11-12,
60-61, 128, 501, 508
as model, 108
marginal, 267
missing data, 89-91
residuals, 77, 181, 261
several, 71, 75
smoothed, 79-81, 127, 169, 227, 228
symmetrized, 78, 122, 165, 169, 228,
251
tilted, 166-167, 183, 209-210,
452-456, 459, 504
empirical exponential family
likelihood, 504-506, 515, 516, 520
empirical influence values, 46-47, 49,
51-53, 54, 63, 64, 65, 75, 209, 210,
452, 461, 462, 476, 517
generalized linear models, 376
linear regression, 260, 275, 317
numerical approximation of, 47,
51-53, 76
several samples, 75, 127, 210
see also influence values
empirical likelihood, 500-504, 509,
512, 514-515, 516, 517, 519, 520
empirical likelihood ratio statistic,
501, 503, 506, 515
envelope test, see graphical test

578
equal marginal distributions example,
78
error rate, 137, 153, 174, 175
estimating function, 50, 63, 105, 250,
318, 329, 470-471, 478, 483, 504,
505,
514, 516
excess error, 292, 296
exchangeability, 143, 145
expansion
Cornish-Fisher, 40, 211, 449
cubic, 475-478
Edgeworth, 39-41, 60, 411,
476-478, 487
linear, 47, 51, 69, 75, 76, 118, 443,
446,
468
notation, 39
quadratic, 50, 66, 76, 443
Taylor series, 45, 46
experimental design
relation to resampling, 58, 439, 486
exponential mean example, 15, 17, 19,
30, 61, 176, 250, 510
exponential quantile plot, 5, 188
exponential tilting, 166-167, 183,
209-210, 452-454, 456-458,
461-463, 492, 495, 504, 517, 535,
546, 547
exp.tilt, 535
factorial experiment, 320, 322
finite population sampling, 92-100,
125, 128, 129, 130, 474
fir seedlings data, 142
Fisher information, 193, 206, 349, 516
Fourier frequencies, 387
Fourier transform, 387
empirical, 388, 408, 430
fast, 388
inverse, 387
frequency array, 23, 52, 443
frequency smoothing, 110, 456, 462,
463, 464-465, 496, 508
Fretsheads data example, 115, 447
gamma model, 5, 25, 62, 131, 149,
207, 216, 233, 247, 376
generalized additive model, 366-371,
375, 382, 383

Subject index

generalized likelihood ratio, 139


generalized linear model, 327-346,
368, 369, 374, 376-377, 378,
381-384, 516
comparison of resampling schemes
for, 336-338
graphical test, 150-154, 183, 188, 416,
422
gravity data example, 72, 121, 131,
150, 152, 154, 162, 163, 166, 171,
172,
454, 457, 494, 519
Greenwood
s formula, 83, 128
half-sample methods, 57-59, 125
handedness data example, 157, 158,
496
hat matrix, 258, 275, 278, 318, 330
hazard function, 82, 146-147,
221-222, 350
heads data example, see Fretsheads
data example
heart disease data example, 378
heteroscedasticity, 259-260, 264, 269,
270-271, 307, 318, 319, 323, 341,
363, 365
hierarchical data, 100-102, 125, 130,
251-253, 287-289, 374
Huber M-estimate, see robust
M-estimate example
hypergeometric distribution, 487
hypothesis test, see significance test
implied likelihood, 511-512, 515, 518
imp.moments, 546
importance resampling, 450-466, 486,
491, 497
balanced, 460-463
algorithm, 491
efficiency, 461, 462
efficiency, 452, 458, 461, 462, 486
improved estimators, 456-460
iterated bootstrap confidence
intervals, 486
quantile estimation, 453-456, 457,
495
ratio estimator, 456, 459, 464, 486,
490
raw estimator, 459, 464, 486
regression, 486

regression estimator, 457, 459, 464,


486, 491
tail probability estimate, 452, 455
time series, 486
weights, 451, 455, 456-457, 458, 464
importance sampling, 450-452, 489
efficiency, 452, 456, 459, 460, 462,
489
identity, 116, 451, 463
misapplication, 453
quantile estimate, 489, 490
ratio estimator, 490
raw estimator, 451
regression estimator, 491
tail probability estimate, 453
weight, 451
imp.prob, 546
imp.quantile, 546
imputation, 88, 90
imp.weights, 546
incomplete data, 43-44, 88-92
index notation, 551-553
infinitesimal jackknife, see
nonparametric delta method
influence functions, 46-50, 60, 63-64
chain rule, 48
correlation, 48
covariance, 316, 319
eigenvalue, 64
estimating equation, 50, 63
least squares estimates, 260, 317
M-estimation, 318
mean, 47, 316
moments, 48, 63
multiple samples, 74-76, 126
quantile, 48
ratio of means, 49, 65, 126
regression, 260, 317, 319
studentized statistic, 63
trimmed mean, 64
two-sample t statistic, 454
variance, 64
weighted mean, 126
information distance, 165-166

Subject index

integration
number-theoretic methods, 486
interaction example, 322
interpolation of quantiles, 195
Islay data example, 520
isotonic regression example, 371
iterative weighted least squares, 329

579
likelihood, 137
adjusted, 500, 512, 515

based on confidence sets, 509-512


bootstrap, 507-509
combination of, 500, 519
definition, 499
dual, 515
empirical, 500-506
jackknife, 50-51, 59, 64, 65, 76, 115
implied, 511-512
delete-m, 56, 60, 493
multinomial-based, 165-166, 186,
for least squares estimates, 317
500-509
for sample surveys, 125
parametric, 347, 499-500
infinitesimal, see nonparametric
partial, 350, 507
delta method
pivot-based, 510-511, 512, 515
multi-sample, 76
profile, 62, 206, 248, 347, 501, 515,
jack.after.boot, 532
519
jackknife-after-bootstrap, 113-118,
quasi, 332, 344
125,
133, 134, 135, 308, 313, 322,likelihood ratio statistic, 62, 137, 138,
325, 369
139, 148, 196, 234, 247, 330, 347,
parametric model, 116-118, 130
368, 373, 380, 499-501
Jacobian, 470, 479
signed, 196
linear.approx, 530
K-function, 416, 424
linear approximation, see
Kaplan-Meier estimate, see
nonparametric delta method
product-limit estimate
linear predictor, 327, 366
kernel density estimate, 19-20, 79,
residuals, 331
124, 127, 168-170, 189, 226-230,
linearity
diagnostics, 118-120, 125
251,
413, 469,507,511,514
link
function,
327, 332, 367
kernel intensity estimate, 419-421,
location-scale
model, 77, 126
431, 435
logistic regression example, 141, 146,
kernel smoothing, 110, 363, 364, 375
338, 359, 361, 371, 376, 378, 381,
kriging, 428
474
Kronecker delta symbol, 412, 443, 553
logit, 338, 372
loglinear model, 177, 184, 342, 369
Lagrange multiplier, 165-166, 502,
lognormal model, 66, 149
504,
515, 516
log rank test, 160
Laplace
s method, 479, 481
long-range dependence, 408, 410, 426
laterite data example, 234, 505
low birth weights data example, 361
Latin hypercube sampling, 486
lowess, 363, 369
Latin square design, 489
lunch
least squares estimates, 258, 275, 392
nonexistence of free, 437
penalized, 364
lynx data example, 432
weighted, 271, 278, 329
length-biased data, 514
leukaemia data example, 328, 334, 367 M-estimate, 311-313, 316, 318, 471,
483, 515
leverage, 258, 271, 275, 278, 330, 370,
377
maize data example, 181

mammals data example, 257, 262, 265,


324
Markov chain, 144, 429
Markov chain Monte Carlo, 143-147,
183, 184, 385, 428
matched-pair data example, 186, 187,
188,
471, 481, 492, 498
maximum likelihood estimate
bias-corrected, 377
generalized linear model, 329
nonparametric, 165-166, 186, 209,
501, 516
mean, see average
mean polar axis example, 234, 505
median, see sample median
median absolute deviation, 311
median survival time example, 86, 124
melanoma data example, 352
misclassification error, 358-362, 375,
378, 381
misclassification rate, 359
missing data, 88-92, 125, 128
mixed continuous-discrete distribution
example, 78
mode estimator, 124
model selection, 301-307, 316, 375,
393,
427, 432
model-based resampling, 389-396,
427, 433
modified sample size, 93
moment-generating function, 551, 552
Monte Carlo test, 140-147, 151-154,
183, 184
motoneurone firing, 418
motorcycle impact data example, 363,
365
moving average process, 386
multiple imputation, 89-91, 125, 128
multiplicative bias, 62
multiplicative model, 77, 126, 328, 335
negative binomial model, 337, 344
Nelson-Aalen estimator, 83, 86, 128
nested bootstrap, see double
bootstrap
nested data, see hierarchical data

580
neurophysiological data example, 418,
428
Nile data example, 241
nitrofen data example, 383
nodal involvement data example, 381
nonlinear regression, 353-358
nonlinear time series, 393-396, 401,
410, 426
nonparametric delta method, 46-50,
75
balanced bootstrap, 443-444
cubic approximation, 475-478
linear approximation, 47, 51, 52, 60,
69, 76, 118, 126, 127, 205, 261,
443, 454, 468, 487, 488, 490,
492
control variate, 446
importance resampling, 452
tilted, 490
quadratic approximation, 50, 79,
212, 215, 443, 487, 490
variance approximation, 47, 50, 63,
64, 75, 76, 108, 120, 199, 260,
261, 265, 275, 312, 318, 319,
376,
477, 478, 483
nonparametric maximum likelihood,
165-166, 186, 209, 501
nonparametric regression, 362-373,
375,
382, 383
normal prediction limit, 244
normal quantile plot test, 150
notation, 9-10
nuclear power data example, 286, 298,
304, 323
null distribution, 137
null hypothesis, 136
one-way model example, 208, 276,
319,
320
outliers, 27, 307-308, 363
overdispersion, 327, 332, 338-339,
343-344, 370, 382
test for, 142
paired comparison, see matched-pair
data
parameter transformation, see
transformation of statistic
partial autocorrelation, 386

Subject index

partial correlation example, 115

prediction rule, 290, 358, 359

periodogram, 387-389, 408, 430


resampling, 412-415, 427, 430
permutation test, 141, 146, 156-160,
173, 183, 185-186, 266, 279, 422,
486, 492

prior distribution, 499, 510, 513


product factorial moments, 487
product-limit estimator, 82-83, 87,
124, 128, 350, 351, 352, 515
profile likelihood, 62, 206, 248, 347,
501, 515, 519
proportional hazards model, 146, 160,
221, 350-353, 374
P-value, 137, 138, 141, 148, 158, 161,
437
adjusted, 175-180, 183, 187
importance sampling, 452, 454, 459

for regression slope, 266, 378


saddlepoint approximation, 475,
487
PET film data example, 346, 379
phase scrambling, 408-412, 427, 430,
435
pivot, 29, 31, 33, 510-511, see also
studentized statistic
point process, 415-426, 427-428
poisons data example, 322
Poisson process, 416-422, 425, 428,
431-432, 435
Poisson regression example, 342, 369,
378, 382, 383
posterior distribution, 499, 510, 513,
515,
520
power notation, 551-553
prediction error, 244, 375, 378
K-fold cross-validation estimate,
293-295, 298-301, 316, 320,
324, 358-362, 381
0.632 estimator, 298, 316, 324,
358-362, 381
adjusted cross-validation estimate,
295, 298-301, 316, 324,
358-362
aggregate, 290-301, 320, 321, 324,
358-362
apparent, 292, 298-301, 320, 324,
381
bootstrap estimate, 295-301, 316,
324, 358-362, 381
comparison of estimators, 300-301
cross-validation estimate, 292-293,
298-301, 320, 324, 358-362,
381
generalized linear model, 340-346
leave-one-out bootstrap estimate,
297, 321
time series, 393-396, 401, 427
prediction limits, 243-245, 251,
284-289, 340-346, 369-371

quadratic approximation, see


nonparametric delta method
quantile estimator, 18-21, 48, 80, 86,
124, 253, 352
quartzite data example, 520
quasi-likelihood, 332, 344
random effects model, see hierarchical
data
random walk model, 391
randomization test, 183, 492, 498
randomized block design, 489
ratio
in finite population sampling, 98
stratified sampling for, 98
ratio estimate
in finite population sampling, 95
ratio example, 6, 13, 22, 30, 49, 52, 53,
54, 62, 66, 98, 108, 110, 113, 118,
126, 127, 165, 178, 186, 201, 217,
238, 249, 439, 447, 464, 473, 490,
513
recycling, see bootstrap recycling
regression
L u 124, 311, 312, 316, 325
case deletion, 317, 377
case resampling, 264-266, 269, 275,
277, 279, 312, 333, 355, 364
convex, 372
design, 260, 261, 263, 264, 276, 277,
305
generalized additive, 366-371, 375,
382, 383

Subject index

generalized linear, 327-346, 374,


linear predictor, 331, 333, 376
376,
377, 378, 381, 382, 383
modified, 77, 259, 270, 272, 275,
isotonic, 371
279, 312, 318, 331, 355, 365
least trimmed squares, 308, 311,
nonlinear regression, 355, 375
313, 325
nonstandard, 349
linear, 256-325, 434
raw, 258, 275, 278, 317, 319
local, 363, 367, 375
Pearson, 331, 333, 334, 342, 370,
logistic, 141, 146, 338, 371, 376, 378,
376,
382
381,
474
standardized, 259, 331, 332, 333,
loglinear, 342, 369, 383
376
M-estimation, 311-313, 316, 318
time series, 390, 392
many covariates with, 275-277
returns data example, 269, 272, 449,
461
model-based resampling, 261-264,
267, 270-272, 275, 276, 279,
Richardson extrapolation, 487, 494
280, 312, 333-335, 346-351,
Rio Negro data example, 388, 398,
364-365
403, 410, 427
multiple, 273-307
robustness, 3, 14, 264, 318
no intercept, 263, 317
robust M-estimate example, 471, 483
nonconstant variance, 270-273
robust regression example, 308, 309,
313, 318, 325
nonlinear, 353-358, 375, 441, 442
rock data example, 281, 287
nonparametric, 362-373, 375, 427
rough statistics, 41-43
Poisson, 337, 342, 378, 382, 383,
473, 504, 516
prediction, 284-301, 315, 316, 323,
saddle, 547
324,
340-346, 369
saddle.distn, 547
repeated design points in, 263
saddlepoint approximation, 466-485,
resampling moments, 262
486, 487, 492, 493, 498, 508, 509,
517, 547
residuals, see residuals
accuracy, 467, 477, 487
resistant, 308
conditional, 472-475, 487, 493
robust, 307-314, 315, 316, 318, 325
density function, 467, 470
significance tests, 266-270, 279-284,
322,
325, 367, 371, 382, 383
distribution function, 467, 468, 470,
486-487
straight-line, 257-273, 308, 317, 322,
391,
449, 461, 489
double, 473-475
survival data, 346-353
equation, 467, 473, 479
weighted least squares, 271-272,
estimating function, 470-472
278-279, 329
integration approach, 478-485
regression estimate
linear statistic for, 468-469, 517
in finite population sampling, 95
Lugannani-Rice formula, 467
remission data example, 378
marginal, 473, 475-485, 487, 493
repeated measures, see hierarchical
permutation distribution, 475, 486,
data
487
resampling, see bootstrap
quantile estimate, 449, 468, 480, 483
residuals
randomization distribution, 492,
deviance, 332, 333, 334, 345, 376
498
in multiple imputations, 89-91
salinity data example, 309, 311, 324
inhomogeneous, 338-340, 344
sample average, see average

581

sample maximum example, 39, 56, 247


sample median, 41, 61, 65, 80, 121,
181, 518
sample variance, 61, 62, 64, 104, 208,
432,
488
sampling
stratified, see stratified sampling
without replacement, 92
sampling fraction, 92-93
sandwich variance estimate, 63, 275,
318, 376
second-order accuracy, 39-41,
211-214, 246
semiparametric model, 77-78, 123
sensitivity analysis, 113
separate families example, 148
sequential spatial inhibition process,
425
several samples, 71-76, 123, 126, 127,
130,
131, 133, 163, 208, 210-211,
217-220, 253
shrinkage estimate, 102, 130
significance probability, see P-value
significance test
adaptive, 173-174, 184, 187, 188
conditional, 138, 173-174
confidence interval, 220-223
critical region, 137
double bootstrap, 175-180, 183,
186, 187
error rate, 137, 175-176
generalized linear regression,
330-331, 367-369, 378, 382
graphical, 150-154, 188, 416, 422,
428
linear regression, 266-270, 279-284,
317,
322, 392
Monte Carlo, 140-147, 151-154
multiple, 174-175, 184
nonparametric bootstrap, 161-175,
267-270
nonparametric regression, 367, 371,
382,
383
parametric bootstrap, 148-149
permutation, 141, 146, 156-160,
173,183, 185, 188, 266, 317,
378, 475, 486

582
pivot, 138-139, 268-269, 280, 284,
392,
454, 486
power, 155-156, 180-184
P-value, 137, 138, 141, 148, 158,
161,
175-176
randomization, 183, 185, 186, 492,
498
separate families, 148, 378
sequential, 182
spatial data, 416, 421, 422, 428
studentized, see pivot
time series, 392, 396, 403, 410
simulated data example, 306
simulation error, 34-37, 62
simulation outlier, 73
simulation size, 17-21, 34-37, 69,
155-156, 178-180, 183, 185, 202,
226,
246, 248
size of test, 137
smooth.f, 533
smooth estimates of F, 79-81
spatial association example, 421, 428
spatial clustering example, 416
spatial data, 124, 416, 421^126, 428
spatial epidemiology, 421, 428
species abundance example, 169, 228
spectral density estimation example,
413
spectral resampling, see periodogram
resampling
spectrum, 387, 408
spherical data example, 126, 234, 505
spline smoother, 352, 364, 365, 367,
368,371,468,
standardized residuals, see residuals,
standardized
stationarity, 385-387, 391, 398, 416
statistical error, 31-34
statistical function, 12-14, 46, 60, 75
Stirling
s approximation, 61, 155

Subject index

straight-line regression, see regression,


straight-line
stratified resampling, 71, 89, 90, 306,
340,
344, 365, 371, 457, 494
stratified sampling, 97-100
Strauss process, 417, 425
studentized statistic, 29, 53, 119, 139,
171-173, 223, 249, 268, 280-281,
284, 286, 313, 315, 324, 325, 326,
330, 477, 481, 483, 513
subsampling, 55-59
balanced, 125
in model selection, 303-304
spatial, 424426
sugar cane date example, 338
summation convention, 552
sunspot data example, 393, 401, 435
survival data
nonparametric, 82-87, 124, 128,
131,
132, 350-353, 374-375
parametric, 346-350,'379
survival probability, 86, 132, 160, 352,
515
survival proportion data example,
308, 322
survivor function, 82, 160, 350, 351,
352, 455
symmetric distribution example, 78,
169, 228, 251, 470, 471, 485
tau particle data example, 133, 495
test, see significance test
tile resampling, 424 426, 427, 428, 432
tilt.boot, 547
tilted distribution, see exponential
. tilting
time series, 385-415, 426-427,
428-431, 514
econometric, 427
nonlinear, 396, 410, 426
toroidal shifts, 423
traffic data, 253

training set, 292


transformation of statistic
empirical, 112, 113, 118, 125, 201
for confidence interval, 195, 200,
233
linearizing, 118-120
variance stabilizing, 32, 63, 108,
109, 111-113, 125, 195, 201,
227,
246, 252, 394, 419, 432
trend test in time series example, 403,
410
trimmed average example, 64, 121,
130, 133, 189
tsboot, 544
tuna data example, 169, 228, 469
two-way model example, 338, 342, 369
two-way table, 177, 184
unimodality test, 168, 169, 189
unit root test, 391, 427
urine data example, 359
variable selection, 301-307, 316, 375
variance approximations, see
nonparametric delta method
variance estimate, see sample variance
variance function, 327-330, 332, 336,
337, 338, 339, 344, 367
estimation of, 107-113, 465
variance stabilization, 32, 63, 108, 109,
111-113, 125, 195, 201, 227, 246,
419, 432
variation of properties of T, 107-113
var.linear, 530
weighted average example, 72, 126,
131
weighted least squares, 270-272,
278-279, 329-330, 377
white noise, 386
Wilcoxon test, 181
wool prices data example, 391

You might also like