You are on page 1of 6

STATS 200 (Stanford University, Summer 2015)

Lecture 6:

Finite-Sample Properties of Estimators

Conceptually, a good estimator should usually be close to the parameter it estimates.


We now consider how to formalize this idea.

6.1

Bias and Variance

An estimator is simply a random variable. We begin by considering properties related to the


expectation and variance of this random variable.
Bias
= E ()
. The estimator is
The bias of an estimator of a parameter is Bias ()

unbiased if Bias () = 0 for all in the parameter space .


Note: Any function of the data can be an estimator of any unknown parameter (or
any function of an unknown parameter), so the bias implicitly depends on what an
estimator is estimating. We will assume that this is always clear from context, rather
than explicitly showing it in our notation.

Example 6.1.1: Let X1 , . . . , Xn be iid random variables such that = E (X1 ) is finite, and
let X be the usual sample mean. Consider X/2 as an estimator of . Then
Bias ( X/2) = E ( X/2) = /2.
Note that the bias is zero if happens to be zero, but not if 0, so this estimator is biased
(i.e., not unbiased).

Example 6.1.2: Let X1 , . . . , Xn be iid random variables such that both = E,2 (X1 ) and
2 = Var,2 (X1 ) are finite, and suppose n 2. Let X and S 2 be the usual sample mean and
sample variance, respectively. Then
E,2 (S 2 ) =

n
1
1
2
n1 2
2
E,2 ( Xi2 nX ) =
[n(2 + 2 ) n(2 + )] =
= 2.
n1
n

1
n
n

1
i=1

Thus, Bias,2 (S 2 ) = 2 2 = 0 for all values of 2 , so S 2 is an unbiased estimator of 2 . Note


that if the (n 1)1 is replaced with n1 , yielding (n 1)S 2 /n, then the resulting estimator
has expectation (n 1) 2 /n and thus is no longer unbiased.

The bias tells us whether an estimator tends to overestimate or underestimate its target on
average. It does not necessarily mean that the value it takes for a particular data set (i.e.,
a particular estimate) is larger or smaller than the true parameter value.
Example 6.1.3: Consider (n 1)S 2 /n as an estimator of 2 in Example 6.1.2. Its bias is
Bias,2 [

(n 1)S 2
(n 1) 2
2
(n 1)S 2
] = E,2 [
] 2 =
2 = ,
n
n
n
n

which is negative for all 2 > 0. Thus, (n 1)S 2 /n tends to underestimate 2 . However, this
does not necessarily mean that the value of this estimator for a particular data set (i.e., a
particular estimate) is smaller than 2 .

Lecture 6: Finite-Sample Properties of Estimators

Unbiasedness is not, by itself, enough to ensure that an estimator is good. Similarly, an


unbiased estimator is not necessarily better than a biased one.
Example 6.1.4: Return to the situation of Example 6.1.2. The estimator (X1 X2 )2 /2 has
expectation
E,2 [(X1 X2 )2 /2] = E,2 (X12 /2) + E,2 (X22 /2) E,2 (X1 ) E,2 (X2 )
= (2 + 2 )/2 + (2 + 2 )/2 2 = 2
and is hence an unbiased estimator of 2 . However, this estimator involves only the first two
observations and ignores the remaining n 2 observations, so we probably would not want
to use this estimator. In contrast, as shown in Example 6.1.2, the estimator (n 1)S 2 /n has
expectation (n 1) 2 /n and is hence a biased estimator of 2 . However, if n is large, then
the bias is small, in which case this estimator may not be bad.

It is often the case that we can trade a small amount of bias in order to improve an
estimator in other ways. This idea will be discussed more later.
Variance
of an estimator of a parameter .
It can also be useful to consider the variance Var ()
Example 6.1.5: Continuing from Example 6.1.4, suppose further that the distribution of
X1 , . . . , Xn is normal. Then
Var,2 (S 2 ) = (

(n 1)S 2
2 2
2( 2 )2
2 2
) Var,2 [
]
=
(
)
[2(n

1)]
=
,
n1
2
n1
n1

noting that (n 1)S 2 / 2 2n1 since X1 , . . . , Xn iid N (, 2 ). It follows that


Var,2 [(

n1 2
n1 2
)S ] = (
) Var,2 (S 2 ),
n
n

which is smaller than the variance of S 2 . The variance of the estimator (X1 X2 )2 /2 can be
found by noting that it is simply the sample variance of the first two observations, and thus
Var,2 [

(X1 X2 )2
] = 2( 2 )2 = (n 1) Var,2 (S 2 ).
2

Unless n is very small, this estimator has much larger variance than either of the other two
estimators discussed above.

A smaller variance is usually better, but this is not always true. For example, a constant
estimator (i.e., an estimator that ignores the data altogether) has zero variance but is clearly
not a good estimator.

Lecture 6: Finite-Sample Properties of Estimators

Bias-Variance Trade-Off
When comparing sensible estimators, an estimator with larger bias often has smaller variance,
and vice versa. Thus, it may not be immediately clear which of several sensible estimators
is to be preferred.
Example 6.1.6: Continuing from Example 6.1.5, the estimators S 2 and (X1 X2 )2 /2 are
both unbiased, but S 2 has smaller variance. Thus, S 2 is a better estimator than (X1 X2 )2 /2.
However, the comparison between S 2 and (n 1)S 2 /n is not so clear. One estimator has
smaller bias, while the other estimator has smaller variance.

6.2

Mean Squared Error

= E [( )2 ]. It
The mean squared error of an estimator of a parameter is MSE ()
provides one way to evaluate the overall performance of an estimator.
Note: Like the bias, the mean squared error of an estimator implicitly depends on
what an estimator is estimating. Again, we will assume that this is clear from context
rather than explicitly showing it in our notation.

Relationship Between Mean Squared Error, Bias, and Variance


The following theorem provides a useful way both to calculate and to interpret the mean
squared error.
= [Bias ()]
2 + Var ().

Theorem 6.2.1. Let be an estimator of . Then MSE ()


= E [( )2 ] = [E ( )]2 + Var ( ) = [Bias ()]
2 + Var ().

Proof. MSE ()
Example 6.2.2: Continuing from Example 6.1.6, the mean squared errors of the estimators
S 2 and (n 1)S 2 /n are
2( 2 )2
,
n1
n1 2
n1 2 2
n1 2
MSE,2 [(
)S ] = {Bias,2 [(
)S ]} + Var,2 [(
)S ]
n
n
n
2
n1 2
2(n 1)( 2 )2 (2n 1)( 2 )2
= [(
) 2 ] +
=
n
n2
n2
2

MSE,2 (S 2 ) = [Bias,2 (S 2 )] + Var,2 (S 2 ) =

by Theorem 6.2.1.

Comparing Estimators with Mean Squared Error


MSE ()
for all and
Let and be estimators of . Suppose that MSE ()
< MSE ()
for some . Then it is clear (at least in terms of MSE) that is a
MSE ()

better estimator than .


Example 6.2.3: From the results of Example 6.2.2, we can see that (n1)S 2 /n has smaller
mean squared error than S 2 for all 2 > 0 (recalling that n 2). Thus, (n 1)S 2 /n is a better
estimator than S 2 (at least in terms of MSE).

Lecture 6: Finite-Sample Properties of Estimators

More commonly, when comparing sensible estimators, it is often the case that one estimator
has smaller mean squared error for some parameter values, while the other estimator has
smaller mean squared error for other parameter values. In this case, it is not at all clear
which estimator is better.
Example 6.2.4: Suppose X Bin(n, ), where is unknown and 0 1. Recall that the
maximum likelihood estimator of is MLE = X/n. Its bias and variance are
X
(1 )
Var (MLE ) = Var ( ) =
,
n
n

X
Bias (MLE ) = E ( ) = 0,
n
so its mean squared error is

(1 )
.
n
If we instead put a Beta(a, b) prior on and conduct a Bayesian analysis, we find that the
posterior mean is B = (X + a)/(n + a + b). Its bias and variance are
2

MSE (MLE ) = [Bias (MLE )] + Var (MLE ) =

n + a
(1 )a b
X +a
) =
=
,
Bias (B ) = E (
n+a+b
n+a+b
n+a+b
X +a
n(1 )
Var (B ) = Var (
)=
,
n+a+b
(n + a + b)2
so its mean squared error is
2
[(1 )a b] + n(1 )
MSE (B ) = [Bias (B )] + Var (B ) =
.
(n + a + b)2
A rather stupid choice would be the constant estimator that ignores the data and just
estimates some constant c no matter what. Then
2

Bias (c) = E (c) p = c ,

Var (c) = 0,

MSE (c) = [Bias (c)] + Var (c) = (c )2 .


2

0.012

The MSE of each estimator as a function of is plotted below in the case where n = 25. The
Bayes estimator (posterior mean) uses a = b = 1. The constant estimator takes c = 1/3.

0.000

0.004

MSE

0.008

MLE
Bayes
Constant

0.0

0.2

0.4

0.6

0.8

1.0

Lecture 6: Finite-Sample Properties of Estimators

The plot shows the following:


The Bayes estimator with a = b = 1 corresponds to a Unif(0, 1) prior, which has a prior
mean of 1/2. The closer the true value of is to this prior mean, the better the Bayes
estimator does relative to the MLE.
It is not difficult to see why the Bayes estimator is outperformed by the MLE when
the true value of is close to 0 or 1. For n = 25 and a = b = 1 as shown in the plot, we
are guaranteed to have 1/27 B 26/27 since 0 X 25 no matter what.
The constant estimator does very well if is actually near 1/3, but otherwise its
performance can be very poor.
We would observe similar results if we repeated this plot for other values of a, b, c, and n.
The constant estimator is very bad unless the constant is close to the true value of , while
the comparison between the MLE and the Bayes estimator typically depends on how close
the true value of is to the prior mean.

Best Estimators
It is natural to ask whether we can find an estimator of that has smaller mean squared
error than every other estimator for all . However, no such estimator can exist. This
conclusion is actually trivial, since the constant estimator = c will always have smaller
mean squared error than any other estimator if is actually equal to c. Thus, we must
consider the idea of a best estimator in a narrower sense. There are two ways to do this:
Take a weighted average of the MSE over all possible values, so that we can measure
the performance of an estimator through a single number that takes into account all
values of (rather than a function of ). Then try to find the estimator that minimizes
this average MSE. It turns out that this is surprisingly easy, as well see.
Restrict our attention to only estimators that meet a certain criterion, then try to find
an estimator that is best (has lowest MSE for all ) within this subset. The most
common approach is to restrict our attention to unbiased estimators and try to find
the best unbiased estimator.
The notion of average MSE optimality is discussed below, while the notion of best unbiased
estimators will be addressed later in the course.
Average MSE Optimality
Let w() be a nonnegative weighting function that describes how much we want the various
values of to count toward our weighted average MSE. Assume without loss of generality
that w() d = 1 or w() = 1 (whichever is appropriate). Then let
= MSE ()
w() d
rw ()

or

= MSE ()
w()
rw ()

denote our weighted average MSE. The following theorem tells us how to find an estimator

that minimizes rw ().

Lecture 6: Finite-Sample Properties of Estimators

Theorem 6.2.5. Let B denote the posterior mean of under the prior () = w(). Then
for any other estimator of .
rw (B ) rw ()
Proof. We provide the proof for the case where the data and parameter are both continuous.
(The proofs of the other cases are similar.) Let f (x) be the joint pdf of the data, where

x Rn , and let = (x)


be an estimator of other than B = B (x). Then
= MSE ()
w() d = E [( )2 ] () d
rw ()

] f (x) dx} () d
= { [(x)

Rn

] ( x) d} m(x) dx,
= { [(x)
Rn

noting that f (x) () = ( x) m(x). Now write the inner integral as


2
2

] ( x) d = [(x)
B (x) + B (x) ] ( x) d
[(x)

= [(x)
B (x)] + 2[(x)
B (x)] [B (x) ] ( x) d

+ [ (x) ] ( x) d
B

2
[B (x) ] ( x) d,

noting that [B (x) ] ( x) d = 0. Then it follows that


2

{ [B (x) ] ( x) d} m(x) dx = E [( ) ] () d = rw (B ),
rw ()
Rn

again noting that f (x) () = ( x) m(x).


Note that although Theorem 6.2.5 involves the Bayes estimator and uses Bayesian notation
in its proof, the result holds regardless of whether or not we actually believe in the Bayesian
philosophy. Thus, we can still find Bayes estimators useful even if we are not willing to
interpret them as a mean of some posterior distribution.
Example 6.2.6: In Example 6.2.4, suppose we want to find the estimator that minimizes
a weighted average MSE with a weighting function of the form w() = c1 (1 )c2 , where
c1 > 1 and c2 > 1 (to ensure that the integral of the weighting function is finite). Then w(),
when multiplied by an appropriate constant, is the pdf of a Beta(c1 + 1, c2 + 1) distribution.
Then by Theorem 6.2.5, the estimator that minimizes the weighted average MSE under the
weighting function w() is simply the posterior mean of under a Beta(c1 + 1, c2 + 1) prior,
which is B = (X + c1 + 1)/(n + c1 + c2 + 2).

You might also like