You are on page 1of 18

Introduction to the Bootstrap

Machelle D. Wilson
Outline
Why the Bootstrap?
Limitations of traditional statistics
How Does it Work?
The Empirical Distribution Function and the Plug-in
Principle
Accuracy of an estimate: Bootstrap standard error
and confidence intervals
Examples
How Good is the Bootstrap?
Limitations of Traditional Statistics:
Problems with distributional assumptions
Often data can not safely be assumed to be
from an identifiable distribution.
Sometimes the distribution of the statistic
is mathematically intractable, even
assuming that distributional assumptions
can be made.
Hence, often the bootstrap provides a
superior alternative to parametric
statistics.
An example data set
80 100 120 140 160 180
0
5
0
1
0
0
1
5
0
2
0
0
2
5
0
1000 Bootstrapped Means
Mean conc. and Dose rate fixed
mean dose
Red Lines=BS CI
Black Lines=Normal CI
An Example Data Set
50 100 150 200 250 300 350
0
5
0
1
0
0
1
5
0
2
0
0
2
5
0
3
0
0
1000 Bootstrapped Means
Mean Conc and Dose Rate Random
mean dose
Red Lines=BS CI
Black Lines=Normal CI
Statistics in the Computer Age
Efron and Tibshirani, 1991 in Science:
Most of our familiar statistical methods, such as
hypothesis testing, linear regression, analysis of
variance, and maximum likelihood estimation, were
designed to be implemented on mechanical calculators.
Modern electronic computation has encouraged a host of
new statistical methods that require fewer distributional
assumptions than their predecessors and can be applied
to more complicated statistical estimatorswithout the
usual concerns for mathematical tractability.
The Bootstrap Solution
With the advent of cheap, high power
computing, it has become relatively easy to use
resampling techniques, such as the bootstrap, to
estimate the distribution of sample statistics
empirically rather than making distributional
assumptions.
The bootstrap resamples the data with equal
probability and with replacement and calculates
the statistic of interest at each resampling. The
resulting histogram, mean, quantiles and
variance of the bootstrapped statistics provide
an estimate of its distribution.
Example
Take the data set 1,2,3. There are 10
possible resamplings, where re-orderings
are considered the same sampling.
1,2,3 1,1,2
1,1,3 2,2,1
2,2,3 3,3,1
3,3,2 1,1,1
2,2,2 3,3,3
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
0
5
10
15
20
25
30
The Bootstrap Solution
In general, the number of
bootstrap samples, C
n
, is




Table of possible distinct bootstrap
re-samplings by sample size.
2 1
.
1
n
n
C
n

| |
=
|

\ .
n 5 10 12 15 20 25 30
C
n
126 92,378 1.35x10
4
7.76x10
5
6.89x10
10
6.32x10
13
5.91x10
16
The Empirical Distribution Function
Having observed a random sample of size
n from a probability distribution F,


the empirical distribution function (edf),
assigns to a set A in the sample space
of x its empirical probability
( )
1 2
, ,...
n
F x x x

, F
( ) { } { }

# /
i
F A P A x A n = = e
Example
A random sample of 100 throws of a die
yields 13 ones, 19 twos, 10 threes, 17 fours,
14 fives, and 27 sixes. Hence the edf is

(1) 0.13

(2) 0.19

(3) 0.10
F
F
F
=
=
=

(4) 0.17

(5) 0.14

(6) 0.27
F
F
F
=
=
=
The Plug-in Principle
It can be shown that is a sufficient statistic
for F.
That is, all the information about F contained
in x is also contained in .
The plug-in principle estimates



by

F
( ) T F u =

( ) T F u =
The Plug-in Principle
If the only information about F comes from
the sample x, then is a minimum
variance unbiased estimator of .
The bootstrap is drawing B samples from the
empirical distribution to estimate B statistics
of interest,
Hence, the bootstrap is both sampling from
an edf (of the original sample) and generating
an edf (of the statistic).

( ) T F u =
u
*

. u
Graphical Representation of the Bootstrap

x={x
1
,x
2
,,x
n
}
x
*1
x
*2
x
*3
. .

x
*B

T(x
*1
) T(x
*2
) T(x
*3
) T(x
*B
)




2
1
[ ( ) ]
( ( ))
1
B
b
b
T x t
se T x
B
=

=

( )
( )
*
1
1
( )
B
b
b
t T x T x
B

=
= =

Bootstrap Standard Error and Confidence
intervals.
The bootstrap estimate of the mean is just
the empirical average of the statistic over
all bootstrap samples.

The bootstrap estimate of standard error is
just the empirical standard deviation of
the bootstrap statistic over all bootstrap
samples.
Bootstrap Confidence Intervals
The percentile interval: the bootstrap
confidence interval for any statistic is
simply the o/2 and 1-o/2 quantiles.
For example, if B=1000, then to construct
the BS confidence interval we rank the
statistics and take the 25
th
and the 975
th

values.
There are other BS CIs but this is the
easiest and makes the fewest assumptions.
Example: Bootstrap of the Median
Go to Splus.
How Good is the Bootstrap?
The bootstrap, in most cases is as good as the
empirical distribution function.
The bootstrap is not optimal when there is good
information about F that did not come from the
datai.e. prior information or strong, valid
distributional assumptions.
The bootstrap does not work well for extreme
values and needs some what difficult
modifications for autocorrelated data such as
times series.
When all our information comes from the sample
itself, we can not do better than the bootstrap.