Professional Documents
Culture Documents
---Figure 1
---Figure 2
Assessing Estimator Performance
---Figure 3
It should be evident that is a better than because the values
obtained are more concentrated about the true value of A=1.
will usually produce a value closer to the true one than .
Assessing Estimator Performance
E xn
1
N n 0
A
Assessing Estimator Performance
E A E x0 A
Second, the variances are
1 N 1
var A var xn
N n 0
N 1
var xn
1
2
N n 0
1 2
2 N 2
N N
Since the w[n]s are uncorrelated and thus
var A var x0 var A
2
Table of Contents
Minimum Variance Unbiased (MVU) Estimation
Unbiased Estimators
Minimum Variance Criterion
Unbiased Estimators
For an estimator to be unbiased we mean that on the average
the estimator will yield the true value of the unknown parameter.
Mathematically, an estimator is unbiased if
E a b
where (a, b) denotes the range of possible values of .
Unbiased estimators tend to have symmetric PDFs centered
about the true value of .
For example 1 the PDF is shown in Figure 4 and is easily
shown to be N ~ (A,2/N).
Unbiased Estimators
The restriction that E for all is an important one.
Letting g x ,where x=[x[0],x[1],x[N-1]]T, it asserts that
E g x px; dx for all .
x n<.
Where <A wna reasonable
A Then, n 0, 1, estimator
, N 1 for the
average value of x[n] is
N 1
1
A x n
N n 0
Due to the linearity properties of the expectation operator
N 1 N 1 N 1
E A E x n E x n A A
1 1 1
for all A. N n 0 N n 0 N n 0
Unbiased Estimators
Example 2 Biased Estimator for DC Level in White
Noise
Consider again Example 1 but with modified sample mean
estimator 1 N 1
A xn
2 N n 0
Then,
1
E A A A if A 0
2
A if A 0
That an estimator is unbiased does not necessarily mean
that it is a good estimator. It only guarantees that on the
average it will attain the true value.
A persistent bias will always result in a poor estimator.
Unbiased Estimators
Combining estimators problem
It sometimes occurs that multiple estimates of the same
parameter are available, i.e., 1 , 2 , , n .
A reasonable procedure is to combine these estimates into a
better one by averaging them to form
1 n
i
n i 1
Assuming the estimators are unbiased, with the same
variance, and uncorrelated with each other,
E
var 1
1 n
var 2 var i
n i 1
n
Unbiased Estimators
Combining estimators problem (cont.)
So that as more estimates are averaged, the variance will
decrease.
However, if the estimators are biased or E i b ,
then
b E is
1 n
E E i
n i 1
defined as the bias
of the estimator.
b
and no matter how many estimators are averaged, will
not converge to the true value.
Unbiased Estimators
Combining estimators problem (cont.)
Minimum Variance Criterion
Mean square error (MSE)
mse E
2
Unfortunately, adoption of this natural criterion leads to
unrealizable estimators, ones that cannot be written solely as a
function of the data.
To understand the problem, we first rewrite the MSE as
2
mse E E E
2
var E
var b
2
Minimum Variance Criterion
The equation shows that the MSE is composed of errors due to
the variance of the estimator as well as the bias.
As an example, from the problem in Example 1 consider the
modified estimator
1 N 1
A a x n
N n 0
N
Minimum Variance Criterion
Differentiating the MSE with respect to a yields
dmse A 2a 2
2 a 1 A2
da N
x0 ~ N ,1
N ,1 if 0
x1 ~
N ,1 if 0
Minimum Variance Criterion
Example (cont.)
The two estimators
1 x 0 x 1
1
2
2 1
2 x 0 x 1
3 3
can easily be shown to be unbiased. To compute the variances
we have that
var 1
1
4
var x 0 var x 1
36 if 0
36 if 0
20
var 2 24
36 if 0
Clearly, between
these two estimators
no MVU estimator
exists.
Table of Contents
Cramer-Rao Lower Bound
Estimator Accuracy Considerations
Cramer-Rao Lower Bound
Transformation of Parameters
Estimator Accuracy Considerations
If a single sample is observed as
x0 A wn
It is desired to estimate A, then we expect a better estimate if
2 is small.
A good unbiased estimator is =x[0]. The variance is 2, so
that the estimator accuracy improves as 2 decreases.
The PDFs for two different variances are shown in Figure 5.
They are
1 2
pi x0; A x0 A
1
exp
2 i 2 i
2
2
for i=1,2.
Estimator Accuracy Considerations
The PDF has been plotted versus the unknown parameter A for
a given value of x[0].
ln p x0; A ln 2
1
2 2
x 0 A
2 2
Then the first derivative is
ln px0; A 1
2 x0 A
A
And the negative of the second derivative becomes
2 ln p x 0 ; A 1
A 2
2
Estimator Accuracy Considerations
The curvature increase as 2 decrease. Since we already know
that the estimator =x[0] has variance 2, then for this example
var A
1
2 ln p x 0 ; A
2
A2
ln px;
I g x
2 2 N 2
2 n 0
Cramer-Rao Lower Bound
Example 4(cont.)
Taking the first derivative
ln p x; A N 1
1
N 2 2
ln 2 2
x n A
A A 2 2 n 0
N 1
x n A x A
1 N
n 0 2 2
Differentiating again
2 ln px; A N
2
A 2
Cramer-Rao Lower Bound
Example 4(cont.)
And noting that the second derivative is a constant, we have
from 1 2
var A
ln px; N
2
E
2
as the CRLB.
By comparing, we see that the sample mean estimator
attains the bound and must therefore be the MVU estimator.
Cramer-Rao Lower Bound
We now prove that when the CRLB is attained
var
1
I
where 2 ln px;
I E
2
From Cramer-Rao Lower Bound
var
1
2 ln px;
E
2
and
ln px;
I
Cramer-Rao Lower Bound
Differentiating the latter produces
2 ln px; I
2
I
2 ln px; I
E
2
E I I
And therefore
var 1
I
In the next example we will see that the CRLB is not always satisfied.
Cramer-Rao Lower Bound
xn A cos2f 0 n wn n 0,1,..., N 1
Example 5(cont.)
Differentiating the log-likelihood function produce
ln px; N 1
xn A cos2f n A sin 2f n
1
2
0 0
n 0
N 1
2 xnsin 2f 0 n sin 4f 0 n 2
A A
n 0 2
And
2 ln px; N 1
xncos2f n A cos4f n 2
A
2 2 0 0
n 0
Cramer-Rao Lower Bound
Example 5(cont.)
Upon taking the negative expected value we have
2 ln px; A N 1
E
2
xn cos 2
2f 0 n A cos4f 0 n 2
n 0
2
A N 1 1 1
2 cos 4f 0 n 2 cos 4f 0 n 2
n 0 2 2
NA2
2 2
Cramer-Rao Lower Bound
Example 5(cont.)
Since
N 1
cos4f n 2 0
1
0 for f 0 not near 0 or 1 2
N n 0
therefore,
2
2
var
NA2
Cramer-Rao Lower Bound
In this example the conditions for the bound to hold is not
satisfied. Hence a phase estimator does not exist.
An estimator which is unbiased and attains the CRLB, as the
sample mean estimator in Example 4 does, is said to be
efficiently uses the data.
Transformation of Parameters
In Example 4 we may not be interested in the sign of A but
instead may wish to estimate A2 or the power of the signal.
Knowing the CRLB for A, we can easily obtain it for A2.
If it is desired to estimate =g(), then the CRLB is
2
g
var
2 ln px;
E
2
For the present example this become =g(A)=A2 and
2 2 A 4 A2 2
2
var A
N /
2
N
Transformation of Parameters
We saw in Example 4 that sample mean estimator was efficient
for A.
2
It might be supposed that x is efficient for A2. To quickly dispel
2
this notion we first show that x is not even an unbiased
estimator.
Since x ~ N (A,2/N)
E x 2 E 2 x var x
2
A2 A2
N
Hence, we immediately conclude that the efficiency of an
estimator is destroyed by a nonlinear transformation.
Transformation of Parameters
That it is maintained for linear transformations is easily verified.
Assume that an efficient estimator for exists and is given by .
It is desired to estimate g()= a + b. We choose
g a b
E a b aE b a b g
g
2
var g
g
2
a var
2
But var g var a b a var
2
, so that the CRLB is achieved.
Transformation of Parameters
N
A2
Therefore
var 2 E 4 E 2 2
4 4 2 2 4
Transformation of Parameters
For our problem we have then
4 A2 2 2 4
var x
2
N
2
N
As N, the variance approaches 4A22/N, the last term
converging to zero faster than the first term.
Our assertion that x 2 is an asymptotically efficient estimator of
A2 is verified.
This situation occurs due to the statistical linearity of the
transformation, as illustrated in Figure 6.
Transformation of Parameters
n 0
n 0
A minimization is readily shown to produce the LSE
N 1
x n h n
n0
N 1
n
h 2
n 0
Linear Least Squares
The minimum LS error is
N 1
J min J x n h n x n h n
n 0
N 1 N 1
x n x n h n h n x n h n
n 0 n 0
S
N 1 N 1
x 2 n x n h n
n 0 n 0
Linear Least Squares
Alternatively, we can rewrite Jmin as
2
N 1
N 1
x n h n
J min x 2 n n 0N 1
n 0
n
n 0
h 2
s H
The matrix H, which is a
known N p matrix (N>p) of
full rank p, is referred to as
the observation matrix.
Linear Least Squares
The LSE is found by minimizing
N 1
J x n s n
2
n 0
x H x H
T
Since
J xT x xT H T HT x T HT H
xT x 2xT H T HT H
The gradient is
dJ
2HT x 2HT H
d
Linear Least Squares
Setting the gradient equal to zero yields the LSE
H H HT x
T 1
x H H H H x
x H H H HT x
1 T 1
T T T
x I H H H H I H H H H x
1 T 1
T T T T T
x I H H H H x
T T 1 T
Linear Least Squares
The last step results from the fact that IH(HTH)1HT is
an idempotent matrix or is has the property A2 = A.
Other forms for Jmin are
J min x x x H H H HT x
T T T 1
xT x H
Nonlinear Least Squares
Before discussing general methods for determining
nonlinear LSEs we first describe two methods that can
reduce the complexity of the problem.
1. Transformation of parameters.
2. Separability of parameters.
In the first case we seek a one-to-one transformation of
that produces a linear signal model in the new space.
To do so we let
g is a p-dimensional
g function of whose
inverse exists.
Nonlinear Least Squares
If a g can be found so that
s s g 1 H
then the signal model will be linear in .
We can then easily find the linear LSE of and thus the nonlinear
LSE of by
g 1 H H HT x
T 1
s n A cos 2 f0n n 0, 1, , N 1
n 0
over A and .
Nonlinear Least Squares
Example Sinusoidal Parameter Estimation (cont.)
Because
A cos 2 f0n A cos cos 2 f0n A sin sin 2 f 0n
if we let
1 A cos
2 A sin
then the signal model becomes
s n 1 cos 2 f0n 2 sin 2 f0n
In matrix form this is
s H
Nonlinear Least Squares
Example Sinusoidal Parameter Estimation (cont.)
where
1 0
cos 2 f sin 2 f
H
0 0
cos 2 f 0 N 1 sin 2 f 0 N 1
which is now linear in the new parameters.
The LSE of is
H H HT x
T 1
In general, n 0
s H H() is an N q matrix
This model is linear in dependent on .
but nonlinear in .
Nonlinear Least Squares
where
p q 1
q 1
As a result, the LS error may be minimized with respect to and
thus reduced to a function of only.
Since
J , x H x H
T
H H HT x
T 1
Nonlinear Least Squares
The resulting LS error is
J , x I H H H HT x
T T 1
The problem now reduces to a maximization of
x H H H HT x
T T 1
over .
If, for instance, q = p 1, so that is a scalar, then a grid search
can possibly be used.
Nonlinear Least Squares
Example Damped Exponentials
Assume we have a signal model
s n A1r n A2r 2n A3r 3n
where the unknown parameters are {A1, A2, A3, r}. It is
known that 0 < r < 1.
Then, the model is linear in the amplitudes = [A1, A2,
A3]T, and nonlinear in the damping factor = r.
The nonlinear LSE is obtained by maximizing
x H r H r H r
1
T T
HT r x
over 0 < r <1.
Nonlinear Least Squares
Example Damped Exponentials (cont.)
where
1 1 1
r r 2
r 3
H r
N 1 2 N 1 3 N 1
r r r
Once r is found we have the LSE for the amplitudes
1
H T
r H r HT r x
This maximization is easily carried out on a digital computer.
The Bayesian Philosophy
We now note depart from the classical approach to statistical
estimation in which the parameter of interest is assumed to be
deterministic but unknown constant.
Pr x A0 A0
It is seen that is a biased estimator. However, if we compare
the MSE of the two estimators, we note that for any A in the
interval A0 A A0
A0
A0
mse A A p ; A d A p ; A d A p ; A d
2 2 2
A A A
A0 A0
A0 A0
A0 A p ; A d A p ; A d A0 A p ; A d
2 2 2
A A A
A0 A0
mse A
Prior Knowledge and Estimation
Hence, , the truncated sample mean estimator, is better than the
sample mean estimator in terms of MSE.
Although is still the MVU estimator, we have been able to
reduce the mean square error by allowing the estimator to be
biased.
Knowing that A must lie in a known interval, we suppose that the
true value of A has been chosen from that interval. We then model
the process of choosing a value as a random event to which a PDF
can be assigned.
With knowledge only of the interval and no inclination as to
whether A should be nearer any particular value, it makes sense to
assign a U[-A0, A0] PDF to the random variable A.
Prior Knowledge and Estimation
The overall data model then appears as in the following figure.
Bmse A E A A
2
We choose to defined the error as A in contrast to the classical
estimation error of A .
Now we emphasize that since A is a random variable, the
expectation operator is with respect to the joint PDF p(x,A).
This is a fundamentally different MSE than in the classical case.
We distinguish it by using the Bmse notation.
Prior Knowledge and Estimation
To appreciate the difference compare the classical MSE
A A p x; A dx
2
mse A
to the Bayesian MSE
A A p x, A dxdA
2
Bmse A
Note that whereas the classical MSE will depend on A, and hence
estimators that attempt to minimize the MSE will usually depend
on A, the Bayesian MSE will not. In effect, we have integrated
the parameter dependence away!
Prior Knowledge and Estimation
To complete our example we now derive the estimator that
minimizes the Bayesian MSE. First, we use Bayes, theorem to
write
p x, A p A x p x
So that
2
Bmse A A A pA x dA px dx
A
A
2 A A p A x dA
2 Ap A x dA 2 A p A x dA
which when set equal to zero results in
A Ap A x dA
or finally
A E A x
Prior Knowledge and Estimation
It is seen that the optimal estimator in terms of minimizing the
Bayesian MSE is the mean of the posterior PDF p(A|x).
The posterior PDF refers to the PDF of A after the data have been
observed. In contrast, p(A) or
p A px, Adx
xn A xn 2 NAx NA2
2 2
n 0 n 0
N 1
N A x xn N x
2 2 2
n 0
Prior Knowledge and Estimation
So that we have
1
exp
1
A x 2
A A0
P A x c 2 2
2 2
N N
0 A A0
1 2
1
A 2 2 exp 2 N2 A x dA
0 N
Prior Knowledge and Estimation
Although this cannot be evaluated in closed form, we note that
will be a function of x as well as of A0 and 2.
The MMSE estimator will not be x due to the truncation shown
in figure (b), unless A0 is so large that there is effectively no
truncation. This will occur if A0 2 N.
So that
1 1 x E x
T
x E x
p x, y exp 1
C
k l 1
2 y E y
y E y
2 2 det 2
C
Then the conditional PDF p(y|x) is also Gaussian and
E y | x E y C yxCxx1 x E x
C y| x C yy C yx Cxx1Cxy
Bayesian Linear Model
Theorem Posterior PDF for the Bayesian General
Linear Model
If the observed data x can be modeled as
x H w
x is an N 1 data vector, H is a known N p matrix, is a p 1
random vector with prior PDF N(,C), and w is an N 1 noise vector
with PDF N(0, Cw) and independent of .
Then the posterior PDF p(|x) is Gaussian with mean
E | x C H HC H Cw x H
T T 1
and covariance
HC H Cw HC
1
C |x C C H T
T
Risk Function
Previously, we had derived the MMSE estimator by minimizing
E[( )2 ] , where the expectation is with respect to the PDF
p(x,).
If we let denote the error of the estimator for a
particular realization of x and , and also let C()= 2, then the
MSE criterion minimizes E[C()].
The deterministic function C() is termed the cost function. It is
noted that large errors are particularly costly.
Also, the average cost or E[C()] is termed the Bayes risk R or
R E C
and measures the performance of a given estimator.
Risk Function
Examples of cost function.
Risk Function
The Bayes risk R is
R E C
C p x, dxd
C p | x d p x dx
Maximum A Posteriori Estimators
In the MAP estimation approach we choose to maximize the
posterior PDF or
arg max p | x
In finding the maximum of p(|x) we observe that
p x | p
p | x
p x
so an equivalent maximization if of p(x|)p().
Hence, the MAP estimator is
arg max p x | p
or, equivalently,
arg max ln p x | ln p