You are on page 1of 95

Wireless Information Transmission System Lab.

Institute of Communications Engineering National Sun Yat-sen University


Table of Contents
Estimation Theory
Assessing Estimator Performance
Minimum Variance Unbiased (MVU) Estimation
Cramer-Rao Lower Bound
Assessing Estimator Performance
Consider the data set shown in Figure 1. That x[n] consists of
a DC level A in noise.

---Figure 1

We could model the data as


xn A wn
where w[n] denotes some zero mean noise process.
Assessing Estimator Performance
Based on the data set {x[0],x[1],,x[N-1]}, we would like to
estimate A.
It would be reasonable to estimate A as
1 N 1
A
N
xn
n 0

or by the sample mean of the data.


Several questions come to mind:
How close will be to A?
Are there better estimator than the sample mean ?
Assessing Estimator Performance
For the data set in Figure 1, it turns out that =0.9,
which is close to the true value of A=1.
Another estimator might be

A x[0]
For the data set in Figure1, =0.95, which is closer to
the true value of A than the sample mean estimate.
Can we conclude that is better estimator than ?
An estimator is a function of the data, which are random
variables, it too is a random variable, subject to many
possible outcomes.
Assessing Estimator Performance
Suppose we repeat the experiment by fixing A=1 and adding
different noise.
We determine the values of the two estimators for each data
set.
For 100 realizations the histograms are shown in Figure 2 and
3.

---Figure 2
Assessing Estimator Performance

---Figure 3
It should be evident that is a better than because the values
obtained are more concentrated about the true value of A=1.
will usually produce a value closer to the true one than .
Assessing Estimator Performance

To prove that is better we could establish that the


variance is less.
The modeling assumptions that we must employ are that
the w[n]s, in addition to bring zero mean, are
uncorrelated and have equal variance 2.
We first show that the mean of each estimator is the true
value or

1 N 1
E A E xn

N n 0
N 1

E xn
1

N n 0

A
Assessing Estimator Performance



E A E x0 A
Second, the variances are
1 N 1
var A var xn

N n 0
N 1

var xn
1
2
N n 0

1 2
2 N 2
N N
Since the w[n]s are uncorrelated and thus



var A var x0 var A
2
Table of Contents
Minimum Variance Unbiased (MVU) Estimation
Unbiased Estimators
Minimum Variance Criterion
Unbiased Estimators
For an estimator to be unbiased we mean that on the average
the estimator will yield the true value of the unknown parameter.
Mathematically, an estimator is unbiased if

E a b
where (a, b) denotes the range of possible values of .
Unbiased estimators tend to have symmetric PDFs centered
about the true value of .
For example 1 the PDF is shown in Figure 4 and is easily
shown to be N ~ (A,2/N).
Unbiased Estimators

The restriction that E for all is an important one.

Letting g x ,where x=[x[0],x[1],x[N-1]]T, it asserts that


E g x px; dx for all .

Figure 4 Probability density function for sample mean function


Unbiased Estimators
Example 1 Unbiased Estimator for DC level in WGN
Consider the observations

x n<.
Where <A wna reasonable
A Then, n 0, 1, estimator
, N 1 for the
average value of x[n] is

N 1
1
A x n
N n 0
Due to the linearity properties of the expectation operator


N 1 N 1 N 1
E A E x n E x n A A
1 1 1

for all A. N n 0 N n 0 N n 0
Unbiased Estimators
Example 2 Biased Estimator for DC Level in White
Noise
Consider again Example 1 but with modified sample mean
estimator 1 N 1
A xn
2 N n 0
Then,

1
E A A A if A 0
2
A if A 0
That an estimator is unbiased does not necessarily mean
that it is a good estimator. It only guarantees that on the
average it will attain the true value.
A persistent bias will always result in a poor estimator.
Unbiased Estimators
Combining estimators problem
It sometimes occurs that multiple estimates of the same

parameter are available, i.e., 1 , 2 , , n .
A reasonable procedure is to combine these estimates into a
better one by averaging them to form
1 n
i
n i 1
Assuming the estimators are unbiased, with the same
variance, and uncorrelated with each other,

E
var 1
1 n
var 2 var i
n i 1
n
Unbiased Estimators
Combining estimators problem (cont.)
So that as more estimates are averaged, the variance will
decrease.

However, if the estimators are biased or E i b ,
then

b E is
1 n
E E i
n i 1
defined as the bias
of the estimator.
b
and no matter how many estimators are averaged, will
not converge to the true value.
Unbiased Estimators
Combining estimators problem (cont.)
Minimum Variance Criterion
Mean square error (MSE)


mse E


2


Unfortunately, adoption of this natural criterion leads to
unrealizable estimators, ones that cannot be written solely as a
function of the data.
To understand the problem, we first rewrite the MSE as


2
mse E E E


2
var E

var b
2
Minimum Variance Criterion
The equation shows that the MSE is composed of errors due to
the variance of the estimator as well as the bias.
As an example, from the problem in Example 1 consider the
modified estimator
1 N 1
A a x n
N n 0

We will attempt to find the a which results in the minimum


MSE.
Since E()=aA and var()=a22/N, we have


mse A var b 2
a 2 2
a 1 A2
2

N
Minimum Variance Criterion
Differentiating the MSE with respect to a yields
dmse A 2a 2
2 a 1 A2
da N

which upon setting to zero and solving yields the optimum


value
A2
aopt 2
A 2 / N

It is seen that the optimal value of a depends upon the


unknown parameter A. The estimator is therefore not
realizable.
Minimum Variance Criterion
It would seem that any criterion which depends on the
bias will lead to an unrealizable estimator.
Although this is generally true, on occasion realizable
minimum MSE estimator can be found. From a practical
viewpoint the minimum MSE estimator needs to be
abandoned.
An alternative approach is:
Constrain the bias to be zero.
Find the estimator which minimizes the variance.
Such an estimator is termed the minimum variance
unbiased (MVU) estimator.
Minimum Variance Criterion
Possible dependence of estimator variance with .

In the former case 3 is sometimes referred to as the


uniformly minimum variance unbiased estimator.
In general, the MVU estimator does not always exist.
Minimum Variance Criterion
Example 3 Counterexample to Existence of MVU
Estimator.
If the form of the PDF changes with , then is would be
expected that the best estimator would also change with .
Assume that we have two independent observations

x0 ~ N ,1
N ,1 if 0
x1 ~
N ,1 if 0
Minimum Variance Criterion
Example (cont.)
The two estimators

1 x 0 x 1
1
2
2 1
2 x 0 x 1
3 3
can easily be shown to be unbiased. To compute the variances
we have that


var 1
1
4

var x 0 var x 1

var var x 0 var x 1


4 1
2
9 9
Minimum Variance Criterion
Example (cont.)
so that
18 if 0

var 1 27
36

36 if 0
36 if 0

20
var 2 24
36 if 0

Clearly, between
these two estimators
no MVU estimator
exists.
Table of Contents
Cramer-Rao Lower Bound
Estimator Accuracy Considerations
Cramer-Rao Lower Bound
Transformation of Parameters
Estimator Accuracy Considerations
If a single sample is observed as
x0 A wn
It is desired to estimate A, then we expect a better estimate if
2 is small.
A good unbiased estimator is =x[0]. The variance is 2, so
that the estimator accuracy improves as 2 decreases.
The PDFs for two different variances are shown in Figure 5.
They are
1 2
pi x0; A x0 A
1
exp
2 i 2 i
2

2

for i=1,2.
Estimator Accuracy Considerations
The PDF has been plotted versus the unknown parameter A for
a given value of x[0].

Figure 5 PDF dependence on unknown parameter


If 12 22 , then we should be able to estimate A more
accurately based on p1(x[0];A).
Estimator Accuracy Considerations
When the PDF is viewed as a function of the unknown parameter
(with x fixed), it is termed the likelihood function.
If we consider the natural logarithm of the PDF

ln p x0; A ln 2
1

2 2
x 0 A
2 2
Then the first derivative is
ln px0; A 1
2 x0 A
A
And the negative of the second derivative becomes
2 ln p x 0 ; A 1

A 2
2
Estimator Accuracy Considerations
The curvature increase as 2 decrease. Since we already know
that the estimator =x[0] has variance 2, then for this example


var A
1
2 ln p x 0 ; A
2


A2

and the variance decrease as the curvature increase.


A more appropriate measure of curvature is
2 ln p x[0]; A
E
A 2

which measures the average curvature of the log-likehood
function.
Cramer-Rao Lower Bound
Theorem (Cramer-Rao Lower Bound-Scalar Parameter)
It is assumed that the PDF p(x,) satisfies the regularity
condition
ln px;
E 0 for all

Then, the variance of any unbiased estimator must satisfy



var
1
2 ln px;
E
2

where the derivative is evaluated as the true value of and
the expectation is taken with respect to p(x,).
Cramer-Rao Lower Bound
An unbiased estimator may be found that attains the bound
for all if and only if

ln px;
I g x

That estimator , which is the MVU estimator, and the


minimum variance is 1/I().
Cramer-Rao Lower Bound
Example 4 DC level in White Gaussian Noise
xn A wn n 0,1,..., N 1
where w[n] is WGN with variance 2.

To determine the CRLB for A


N 1
1 2
px; A exp 2 xn A
1
n 0 2 2 2
1 N 1


1
exp 2 xn A 2

2 2 N 2
2 n 0
Cramer-Rao Lower Bound
Example 4(cont.)
Taking the first derivative
ln p x; A N 1

1

N 2 2
ln 2 2
x n A
A A 2 2 n 0
N 1

x n A x A
1 N

n 0 2 2

Differentiating again
2 ln px; A N
2
A 2

Cramer-Rao Lower Bound
Example 4(cont.)
And noting that the second derivative is a constant, we have


from 1 2
var A
ln px; N
2
E
2

as the CRLB.
By comparing, we see that the sample mean estimator
attains the bound and must therefore be the MVU estimator.
Cramer-Rao Lower Bound
We now prove that when the CRLB is attained



var
1
I
where 2 ln px;
I E
2

From Cramer-Rao Lower Bound

var
1
2 ln px;
E
2

and
ln px;


I
Cramer-Rao Lower Bound
Differentiating the latter produces

2 ln px; I
2



I

And taking the negative expected value yields

2 ln px; I
E
2


E I I

And therefore

var 1
I
In the next example we will see that the CRLB is not always satisfied.
Cramer-Rao Lower Bound

Example 5 Phase Estimator


Assume that we wish to estimate the phase of a sinusoid
embedded in WGN, then

xn A cos2f 0 n wn n 0,1,..., N 1

The amplitude A and frequency f0 are assumed known.


The PDF is
1 N 1

px; xn A cos2f n
1
exp 2
2


2
2 N 2
2 n 0
0

Cramer-Rao Lower Bound

Example 5(cont.)
Differentiating the log-likelihood function produce

ln px; N 1

xn A cos2f n A sin 2f n
1
2

0 0
n 0
N 1

2 xnsin 2f 0 n sin 4f 0 n 2
A A
n 0 2
And

2 ln px; N 1

xncos2f n A cos4f n 2
A

2 2 0 0
n 0
Cramer-Rao Lower Bound

Example 5(cont.)
Upon taking the negative expected value we have
2 ln px; A N 1
E

2

xn cos 2

2f 0 n A cos4f 0 n 2
n 0
2

A N 1 1 1

2 cos 4f 0 n 2 cos 4f 0 n 2
n 0 2 2
NA2

2 2
Cramer-Rao Lower Bound

Example 5(cont.)
Since
N 1

cos4f n 2 0
1
0 for f 0 not near 0 or 1 2
N n 0

therefore,

2
2
var
NA2
Cramer-Rao Lower Bound
In this example the conditions for the bound to hold is not
satisfied. Hence a phase estimator does not exist.
An estimator which is unbiased and attains the CRLB, as the
sample mean estimator in Example 4 does, is said to be
efficiently uses the data.
Transformation of Parameters
In Example 4 we may not be interested in the sign of A but
instead may wish to estimate A2 or the power of the signal.
Knowing the CRLB for A, we can easily obtain it for A2.
If it is desired to estimate =g(), then the CRLB is

2
g


var
2 ln px;
E
2

For the present example this become =g(A)=A2 and
2 2 A 4 A2 2
2
var A
N /
2
N
Transformation of Parameters
We saw in Example 4 that sample mean estimator was efficient
for A.
2
It might be supposed that x is efficient for A2. To quickly dispel
2
this notion we first show that x is not even an unbiased
estimator.
Since x ~ N (A,2/N)

E x 2 E 2 x var x
2
A2 A2
N
Hence, we immediately conclude that the efficiency of an
estimator is destroyed by a nonlinear transformation.
Transformation of Parameters
That it is maintained for linear transformations is easily verified.
Assume that an efficient estimator for exists and is given by .
It is desired to estimate g()= a + b. We choose

g a b

The CRLB for g()


E a b aE b a b g

g
2



var g
g
2


a var
2



But var g var a b a var

2
, so that the CRLB is achieved.
Transformation of Parameters

Although efficient is preserved only over linear transformations,


it is approximately maintained over nonlinear transformations if
the data record is large enough.
To see why this property holds, we return to the previous
example of estimating A2 by x 2 .
Although x 2 is biased, we note from
2

E x E x var x A
2 2 2

N
A2

that x 2 is asymptotically unbiased or unbiased as N.


Transformation of Parameters

Since x ~ N A, 2 / N , we can evaluate the variance
var x E x E x
2 4 2 2

By using the result that if ~ N(,2), then



E 2 2 2
E
4 4
6 2 2 3 4

Therefore

var 2 E 4 E 2 2
4 4 2 2 4
Transformation of Parameters
For our problem we have then
4 A2 2 2 4

var x
2

N
2
N
As N, the variance approaches 4A22/N, the last term
converging to zero faster than the first term.
Our assertion that x 2 is an asymptotically efficient estimator of
A2 is verified.
This situation occurs due to the statistical linearity of the
transformation, as illustrated in Figure 6.
Transformation of Parameters

Figure 6 Statistical linearity of nonlinear transformations

As N increase, the PDF of x becomes more concentrated about


the mean A. Therefore, the value of x that are observed lie in a
small interval about x A .
Over this small interval the nonlinear transformation is
approximately linear.
Minimum Variance Unbiased Estimator for
the Linear Model
If the data observed can be modeled as
x H w
where x is an N 1 vector of observations, H is a known N
p observation matrix, is p1 vector of parameters to be
estimated, and w is an N1 noise vector with PDF N(0,2I).

The MVU estimator is


H T H 1 H T x

and the covariance matrix of is


C H H
2
T

1
Table of Contents
Least Squares (LS) Estimators
Linear LSE
Nonlinear LSE
General Bayesian Estimators
The Bayesian Philosophy
Minimum Mean Square Error (MMSE) Estimators
Maximum A Posteriori (MAP) Estimators
Least Squares
A salient feature of the method is that no probabilistic
assumptions are made about the data, only a signal
model is assumed.
The advantage is its broader range of possible
applications.
On the negative side, no claims about optimality can be
made, and furthermore, the statistical performance
cannot be assessed without some specific assumptions
about the probabilistic structure of the data.
The least squares estimator is widely used in practice
due to its ease of implementation, amounting to the
minimization of a least squares error criterion.
Least Squares
In the LS approach we attempt to minimize the squared difference
between the given data x[n] and the assumed signal or noiseless
data.
Least Squares
The LS error criterion
N 1
J x n s n
2

n 0

The value of that minimizes J() is the LSE.


Note that no probabilistic assumptions have been made
about the data x[n].
LSE are usually applied in situations
A precise statistical characterization of the data is unknown.
An optimal estimator cannot be found.
Too complicated to apply in practice.
Linear Least Squares
In applying the linear LS approach for a scalar parameter we must
assume that
s n h n
where h[n] is a known sequence.
The LS error criterion becomes
N 1
J x n h n
2

n 0
A minimization is readily shown to produce the LSE
N 1

x n h n
n0
N 1

n
h 2

n 0
Linear Least Squares
The minimum LS error is


N 1
J min J x n h n x n h n
n 0


N 1 N 1
x n x n h n h n x n h n
n 0 n 0
S
N 1 N 1
x 2 n x n h n
n 0 n 0
Linear Least Squares
Alternatively, we can rewrite Jmin as
2
N 1

N 1
x n h n
J min x 2 n n 0N 1
n 0
n
n 0
h 2

For the signal s = [s[0] s[1] s[N1]]T to be linear in the


unknown parameters, using matrix notation,

s H
The matrix H, which is a
known N p matrix (N>p) of
full rank p, is referred to as
the observation matrix.
Linear Least Squares
The LSE is found by minimizing
N 1
J x n s n
2

n 0

x H x H
T

Since
J xT x xT H T HT x T HT H
xT x 2xT H T HT H
The gradient is
dJ
2HT x 2HT H
d
Linear Least Squares
Setting the gradient equal to zero yields the LSE
H H HT x
T 1

The equations HT H HT x to be solved for are termed


the normal equations.
The minimum LS error is

T
J min J x H T
x HT


x H H H H x
x H H H HT x
1 T 1
T T T


x I H H H H I H H H H x
1 T 1
T T T T T

x I H H H H x
T T 1 T
Linear Least Squares
The last step results from the fact that IH(HTH)1HT is
an idempotent matrix or is has the property A2 = A.
Other forms for Jmin are

J min x x x H H H HT x
T T T 1


xT x H
Nonlinear Least Squares
Before discussing general methods for determining
nonlinear LSEs we first describe two methods that can
reduce the complexity of the problem.
1. Transformation of parameters.
2. Separability of parameters.
In the first case we seek a one-to-one transformation of
that produces a linear signal model in the new space.
To do so we let
g is a p-dimensional
g function of whose
inverse exists.
Nonlinear Least Squares
If a g can be found so that

s s g 1 H
then the signal model will be linear in .
We can then easily find the linear LSE of and thus the nonlinear
LSE of by
g 1 H H HT x
T 1

This approach relies on the property that the minimization can be


carried out in any transformed space that is obtained by a one-to-
one mapping and then converted back to the original space.
Nonlinear Least Squares
Example Sinusoidal Parameter Estimation
For a sinusoidal signal model

s n A cos 2 f0n n 0, 1, , N 1

it is desired to estimate the amplitude A, where A > 0, and phase .


The frequency f0 is assumed known.
The LSE is obtained by minimizing
N 1
J x n A cos 2 f 0 n
2

n 0
over A and .
Nonlinear Least Squares
Example Sinusoidal Parameter Estimation (cont.)
Because
A cos 2 f0n A cos cos 2 f0n A sin sin 2 f 0n
if we let
1 A cos
2 A sin
then the signal model becomes
s n 1 cos 2 f0n 2 sin 2 f0n
In matrix form this is
s H
Nonlinear Least Squares
Example Sinusoidal Parameter Estimation (cont.)
where
1 0
cos 2 f sin 2 f
H
0 0



cos 2 f 0 N 1 sin 2 f 0 N 1
which is now linear in the new parameters.
The LSE of is
H H HT x
T 1

and to find we must find the inverse transformation g1().


Nonlinear Least Squares
Example Sinusoidal Parameter Estimation (cont.)
This is
A 12 22
2
arctan
1
so that the nonlinear LSE for this problem is given by
2 2
A 1 2
2
arctan
1


where HT H 1 HT x .
Nonlinear Least Squares
A second type of nonlinear LS problem that is less
complex than the general one exhibits the separability
property.
Although the signal model is nonlinear, it may be linear
in some of the parameters.
N 1
J a, f 0 x n A cos 2 f 0n
2

In general, n 0

a separable signal model has the form

s H H() is an N q matrix
This model is linear in dependent on .
but nonlinear in .
Nonlinear Least Squares
where
p q 1


q 1
As a result, the LS error may be minimized with respect to and
thus reduced to a function of only.
Since
J , x H x H
T

The that minimizes J for a given is

H H HT x
T 1
Nonlinear Least Squares
The resulting LS error is


J , x I H H H HT x


T T 1


The problem now reduces to a maximization of

x H H H HT x
T T 1

over .
If, for instance, q = p 1, so that is a scalar, then a grid search
can possibly be used.
Nonlinear Least Squares
Example Damped Exponentials
Assume we have a signal model
s n A1r n A2r 2n A3r 3n
where the unknown parameters are {A1, A2, A3, r}. It is
known that 0 < r < 1.
Then, the model is linear in the amplitudes = [A1, A2,
A3]T, and nonlinear in the damping factor = r.
The nonlinear LSE is obtained by maximizing
x H r H r H r
1
T T
HT r x
over 0 < r <1.
Nonlinear Least Squares
Example Damped Exponentials (cont.)
where
1 1 1
r r 2
r 3
H r

N 1 2 N 1 3 N 1

r r r
Once r is found we have the LSE for the amplitudes


1
H T
r H r HT r x
This maximization is easily carried out on a digital computer.
The Bayesian Philosophy
We now note depart from the classical approach to statistical
estimation in which the parameter of interest is assumed to be
deterministic but unknown constant.

Instead, we assume that is a random variable to whose


particular realization we must estimate. This is Bayesian
approach, so named because its implementation is based
directly on Bayes theorem.

The motivation for doing so is twofold.


First, if we have available some priori knowledge about ,
we can incorporate it into our estimator.
Second, Bayesian estimation is useful in situations where an
MVU estimator cannot be found.
Prior Knowledge and Estimation
It is a fundamental rule of estimation theorem that the use of prior
knowledge will lead to a more accurate estimator.
For example, if a parameter is constrained to lie in a known
interval, then any good estimator should produce only estimates
within that interval.
In Example before, it was shown that the MVU estimator of A is
the sample mean x . However, this Assumed that A could take on
any value in the interval < A < .
Due to physical constraints it may be more reasonable to assume
that A can take on only values in the finite interval A0 A A0.
To retain = asxthe best estimator would be undesirable since
may yield value outside the known interval.
Prior Knowledge and Estimation
As shown in figure (a), this is due to noise effects.

Certainly, we would expect to improve our estimation if we used


the truncated sample mean estimator
A0 x A0

A x A0 x A0
A x A0
0
Prior Knowledge and Estimation
Such an estimator would have the PDF

p ; A Pr x A0 A0 p ; Au A0 u A0
A


Pr x A0 A0
It is seen that is a biased estimator. However, if we compare
the MSE of the two estimators, we note that for any A in the
interval A0 A A0
A0


A0

mse A A p ; A d A p ; A d A p ; A d
2 2 2

A A A
A0 A0
A0 A0
A0 A p ; A d A p ; A d A0 A p ; A d
2 2 2

A A A
A0 A0

mse A
Prior Knowledge and Estimation
Hence, , the truncated sample mean estimator, is better than the
sample mean estimator in terms of MSE.
Although is still the MVU estimator, we have been able to
reduce the mean square error by allowing the estimator to be
biased.
Knowing that A must lie in a known interval, we suppose that the
true value of A has been chosen from that interval. We then model
the process of choosing a value as a random event to which a PDF
can be assigned.
With knowledge only of the interval and no inclination as to
whether A should be nearer any particular value, it makes sense to
assign a U[-A0, A0] PDF to the random variable A.
Prior Knowledge and Estimation
The overall data model then appears as in the following figure.

As shown there, the act of choosing A according to the given


PDF represents the departure of the Bayesian approach from
the classical approach.
The problem, as always, is to estimate the value of A or the
realization of the random variable, now we can incorporate our
knowledge of how A was chosen.
Prior Knowledge and Estimation
For example, we might attempt to find an estimator that would
minimize the Bayesian MSE defined as


Bmse A E A A

2


We choose to defined the error as A in contrast to the classical
estimation error of A .
Now we emphasize that since A is a random variable, the
expectation operator is with respect to the joint PDF p(x,A).
This is a fundamentally different MSE than in the classical case.
We distinguish it by using the Bmse notation.
Prior Knowledge and Estimation
To appreciate the difference compare the classical MSE


A A p x; A dx
2
mse A
to the Bayesian MSE


A A p x, A dxdA
2
Bmse A

Note that whereas the classical MSE will depend on A, and hence
estimators that attempt to minimize the MSE will usually depend
on A, the Bayesian MSE will not. In effect, we have integrated
the parameter dependence away!
Prior Knowledge and Estimation
To complete our example we now derive the estimator that
minimizes the Bayesian MSE. First, we use Bayes, theorem to
write
p x, A p A x p x
So that


2

Bmse A A A pA x dA px dx

Now since p(x) 0 for all x, if the integral in brackets can be


minimized for each x, then the Bayesian MSE will be minimized.
Prior Knowledge and Estimation
Hence, fixing x so that is a scalar variable, we have

A A p A x dA A A p A x dA

2 2

A
A

2 A A p A x dA

2 Ap A x dA 2 A p A x dA
which when set equal to zero results in
A Ap A x dA
or finally
A E A x
Prior Knowledge and Estimation
It is seen that the optimal estimator in terms of minimizing the
Bayesian MSE is the mean of the posterior PDF p(A|x).
The posterior PDF refers to the PDF of A after the data have been
observed. In contrast, p(A) or

p A px, Adx

may be thought of as the priori PDF of A, indicating the PDF


before the data are observed.
We will henceforth term the estimator that minimizes the
Bayesian MSE the minimum mean square error (MMSE)
estimator.
Prior Knowledge and Estimation
In determining the MMSE estimator we first require the posterior
PDF. We can use Bayes rule to determine it as
px A p A px A p A
pA x
px px A p AdA
Note that the denominator is just a normalizing factor,
independent of A, needed to ensure that p(A|x) integrates to 1.

If we continue our example, we recall that the prior PDF p(A) is


U[-A0, A0]. To specify the conditional PDF p(x|A) we need to
further assume that the choice of A via p(A) does not affect the
PDF of the noise samples or that w[n] is independent of A.
Prior Knowledge and Estimation
Then, for n = 0, 1, , N- 1
p x x n | A pw x n A A
pw x n A
1 2
exp 2 x n A
1

2 2
2
and therefore
1 N 1

p x A
1

2
exp 2 x n A
2 2
N
2 2 n 0

It is apparent that the PDF is identical in form to the usual


classical PDF p(x; A).
Prior Knowledge and Estimation
The posterior PDF becomes
1 N 1 2
exp 2 x n A
1

2 A0 2 2 2 2 n 0
N

A0 A A0
p A x 1 N 1 2
exp 2 x n A dA0
1
2 n 0
A0 2 A 2
N
0
2 2

0 A A0
But N 1 N 1

xn A xn 2 NAx NA2
2 2

n 0 n 0


N 1
N A x xn N x
2 2 2

n 0
Prior Knowledge and Estimation
So that we have



1
exp
1
A x 2
A A0
P A x c 2 2
2 2

N N

0 A A0

The factor c is determined by the requirement that p(A|x) integrate


to 1, resulting in


A0
1 1
exp A x dA
2
c
2 2 2

A0 2 N
N
Prior Knowledge and Estimation
The PDF is seen to be a truncated Gaussian, as shown in figure.

The MMSE estimator, which is the mean of p(A|x), is


1 2
A0
1

A A 2 2 exp 2 N2 A x dA
A E A | x A00 N

1 2
1

A 2 2 exp 2 N2 A x dA
0 N
Prior Knowledge and Estimation
Although this cannot be evaluated in closed form, we note that
will be a function of x as well as of A0 and 2.
The MMSE estimator will not be x due to the truncation shown
in figure (b), unless A0 is so large that there is effectively no
truncation. This will occur if A0 2 N.

The effect of the data is to position the posterior mean between A


= 0 and A = x in a compromise between the prior knowledge and
that contributed by the data.
To further appreciate this weighting consider what happens as N
becomes large so that the data knowledge becomes more
important.
Prior Knowledge and Estimation
As shown in Figure 4, as N increases, we have from that the
posterior PDF becomes more concentrated about x (since 2/N
decreases).

Hence, it becomes nearly Gaussian, and its mean becomes just x .


The MMSE estimator relies less and less on the prior knowledge
and more on the data. It is said that the data swamps out the
prior knowledge.
Prior Knowledge and Estimation
Theorem Conditional PDF of Multivariate Gaussian
If x and y are jointly Gaussian, where x is k 1 and y is l 1, with mean
vector [E(x)TE(y)T]T and partitioned covariance matrix
Cxx Cxy k k k l
C
C yx C yy l k l l

So that

1 1 x E x
T
x E x
p x, y exp 1
C
k l 1
2 y E y

y E y

2 2 det 2

C
Then the conditional PDF p(y|x) is also Gaussian and
E y | x E y C yxCxx1 x E x
C y| x C yy C yx Cxx1Cxy
Bayesian Linear Model
Theorem Posterior PDF for the Bayesian General
Linear Model
If the observed data x can be modeled as
x H w
x is an N 1 data vector, H is a known N p matrix, is a p 1
random vector with prior PDF N(,C), and w is an N 1 noise vector
with PDF N(0, Cw) and independent of .
Then the posterior PDF p(|x) is Gaussian with mean
E | x C H HC H Cw x H
T T 1

and covariance
HC H Cw HC
1
C |x C C H T

T
Risk Function
Previously, we had derived the MMSE estimator by minimizing
E[( )2 ] , where the expectation is with respect to the PDF
p(x,).
If we let denote the error of the estimator for a
particular realization of x and , and also let C()= 2, then the
MSE criterion minimizes E[C()].
The deterministic function C() is termed the cost function. It is
noted that large errors are particularly costly.
Also, the average cost or E[C()] is termed the Bayes risk R or
R E C
and measures the performance of a given estimator.
Risk Function
Examples of cost function.
Risk Function
The Bayes risk R is
R E C


C p x, dxd


C p | x d p x dx

Maximum A Posteriori Estimators
In the MAP estimation approach we choose to maximize the

posterior PDF or
arg max p | x

In finding the maximum of p(|x) we observe that
p x | p
p | x
p x
so an equivalent maximization if of p(x|)p().
Hence, the MAP estimator is
arg max p x | p

or, equivalently,
arg max ln p x | ln p

You might also like