Estimation Theory

Wireless Information Transmission System Lab.
Institute of Communications Engineering National Sun Yat-sen University

Table of Contents
Estimation Theory
Assessing Estimator Performance
Minimum Variance Unbiased (MVU) Estimation
Cramer-Rao Lower Bound
Consider the data set shown in Figure 1. That x[n] consists of
a DC level A in noise.
---Figure 1
We could model the data as

xn A wn
where w[n] denotes some zero mean noise process.
Based on the data set {x[0],x[1],,x[N-1]}, we would like to
estimate A.
It would be reasonable to estimate A as
1 N 1
A
N
xn
n 0
or by the sample mean of the data.

Several questions come to mind:
How close will be to A?
Are there better estimator than the sample mean ?
For the data set in Figure 1, it turns out that =0.9,
which is close to the true value of A=1.
Another estimator might be

A x[0]
For the data set in Figure1, =0.95, which is closer to
the true value of A than the sample mean estimate.
Can we conclude that is better estimator than ?
An estimator is a function of the data, which are random
variables, it too is a random variable, subject to many
possible outcomes.
Suppose we repeat the experiment by fixing A=1 and adding
different noise.
We determine the values of the two estimators for each data
set.
For 100 realizations the histograms are shown in Figure 2 and
3.
---Figure 2
---Figure 3
It should be evident that is a better than because the values
obtained are more concentrated about the true value of A=1.
will usually produce a value closer to the true one than .
To prove that is better we could establish that the

variance is less.
The modeling assumptions that we must employ are that
the w[n]s, in addition to bring zero mean, are
uncorrelated and have equal variance 2.
We first show that the mean of each estimator is the true
value or

1 N 1
E A E xn

N n 0
N 1
E xn
1

N n 0
A

E A E x0 A
Second, the variances are
1 N 1
var A var xn

N n 0
N 1
var xn
1
2
N n 0
1 2
2 N 2
N N
Since the w[n]s are uncorrelated and thus

var A var x0 var A
2
Table of Contents
Minimum Variance Unbiased (MVU) Estimation
Unbiased Estimators
Minimum Variance Criterion
Unbiased Estimators
For an estimator to be unbiased we mean that on the average
the estimator will yield the true value of the unknown parameter.
Mathematically, an estimator is unbiased if

E a b
where (a, b) denotes the range of possible values of .
Unbiased estimators tend to have symmetric PDFs centered
about the true value of .
For example 1 the PDF is shown in Figure 4 and is easily
shown to be N ~ (A,2/N).
Unbiased Estimators

The restriction that E for all is an important one.

Letting g x ,where x=[x[0],x[1],x[N-1]]T, it asserts that

E g x px; dx for all .
Figure 4 Probability density function for sample mean function

Unbiased Estimators
Example 1 Unbiased Estimator for DC level in WGN
Consider the observations
x n<.
Where <A wna reasonable
A Then, n 0, 1, estimator
, N 1 for the
average value of x[n] is
N 1
1
A x n
N n 0
Due to the linearity properties of the expectation operator

N 1 N 1 N 1
E A E x n E x n A A
1 1 1
for all A. N n 0 N n 0 N n 0
Unbiased Estimators
Example 2 Biased Estimator for DC Level in White
Noise
Consider again Example 1 but with modified sample mean
estimator 1 N 1
A xn
2 N n 0
Then,
1
E A A A if A 0
2
A if A 0
That an estimator is unbiased does not necessarily mean
that it is a good estimator. It only guarantees that on the
average it will attain the true value.
A persistent bias will always result in a poor estimator.
Unbiased Estimators
Combining estimators problem
It sometimes occurs that multiple estimates of the same

parameter are available, i.e., 1 , 2 , , n .
A reasonable procedure is to combine these estimates into a
better one by averaging them to form
1 n
i
n i 1
Assuming the estimators are unbiased, with the same
variance, and uncorrelated with each other,

E
var 1
1 n
var 2 var i
n i 1
n
Unbiased Estimators
Combining estimators problem (cont.)
So that as more estimates are averaged, the variance will
decrease.

However, if the estimators are biased or E i b ,
then

b E is
1 n
E E i
n i 1
defined as the bias
of the estimator.
b
and no matter how many estimators are averaged, will
not converge to the true value.
Unbiased Estimators
Combining estimators problem (cont.)
Mean square error (MSE)

mse E

2

Unfortunately, adoption of this natural criterion leads to
unrealizable estimators, ones that cannot be written solely as a
function of the data.
To understand the problem, we first rewrite the MSE as

2
mse E E E

2
var E

var b
2
The equation shows that the MSE is composed of errors due to
the variance of the estimator as well as the bias.
As an example, from the problem in Example 1 consider the
modified estimator
1 N 1
A a x n
N n 0
We will attempt to find the a which results in the minimum

MSE.
Since E()=aA and var()=a22/N, we have

mse A var b 2
a 2 2
a 1 A2
2
N
Differentiating the MSE with respect to a yields
dmse A 2a 2
2 a 1 A2
da N
which upon setting to zero and solving yields the optimum

value
A2
aopt 2
A 2 / N
It is seen that the optimal value of a depends upon the

unknown parameter A. The estimator is therefore not
realizable.
It would seem that any criterion which depends on the
bias will lead to an unrealizable estimator.
Although this is generally true, on occasion realizable
minimum MSE estimator can be found. From a practical
viewpoint the minimum MSE estimator needs to be
abandoned.
An alternative approach is:
Constrain the bias to be zero.
Find the estimator which minimizes the variance.
Such an estimator is termed the minimum variance
unbiased (MVU) estimator.
Possible dependence of estimator variance with .
In the former case 3 is sometimes referred to as the

uniformly minimum variance unbiased estimator.
In general, the MVU estimator does not always exist.
Example 3 Counterexample to Existence of MVU
Estimator.
If the form of the PDF changes with , then is would be
expected that the best estimator would also change with .
Assume that we have two independent observations
x0 ~ N ,1
N ,1 if 0
x1 ~
N ,1 if 0
Example (cont.)
The two estimators
1 x 0 x 1
1
2
2 1
2 x 0 x 1
3 3
can easily be shown to be unbiased. To compute the variances
we have that

var 1
1
4

var x 0 var x 1
var var x 0 var x 1

4 1
2
9 9
Example (cont.)
so that
18 if 0

var 1 27
36
36 if 0
36 if 0

20
var 2 24
36 if 0
Clearly, between
these two estimators
no MVU estimator
exists.
Table of Contents
Estimator Accuracy Considerations
Transformation of Parameters
If a single sample is observed as
x0 A wn
It is desired to estimate A, then we expect a better estimate if
2 is small.
A good unbiased estimator is =x[0]. The variance is 2, so
that the estimator accuracy improves as 2 decreases.
The PDFs for two different variances are shown in Figure 5.
They are
1 2
pi x0; A x0 A
1
exp
2 i 2 i
2

2
for i=1,2.
The PDF has been plotted versus the unknown parameter A for
a given value of x[0].
Figure 5 PDF dependence on unknown parameter

If 12 22 , then we should be able to estimate A more
accurately based on p1(x[0];A).
When the PDF is viewed as a function of the unknown parameter
(with x fixed), it is termed the likelihood function.
If we consider the natural logarithm of the PDF
ln p x0; A ln 2
1

2 2
x 0 A
2 2
Then the first derivative is
ln px0; A 1
2 x0 A
A
And the negative of the second derivative becomes
2 ln p x 0 ; A 1

A 2
2
The curvature increase as 2 decrease. Since we already know
that the estimator =x[0] has variance 2, then for this example

var A
1
2 ln p x 0 ; A
2

A2
and the variance decrease as the curvature increase.

A more appropriate measure of curvature is
2 ln p x[0]; A
E
A 2

which measures the average curvature of the log-likehood
function.
Theorem (Cramer-Rao Lower Bound-Scalar Parameter)
It is assumed that the PDF p(x,) satisfies the regularity
condition
ln px;
E 0 for all

Then, the variance of any unbiased estimator must satisfy

var
1
2 ln px;
E
2

where the derivative is evaluated as the true value of and
the expectation is taken with respect to p(x,).
An unbiased estimator may be found that attains the bound
for all if and only if
ln px;
I g x

That estimator , which is the MVU estimator, and the

minimum variance is 1/I().
Example 4 DC level in White Gaussian Noise
xn A wn n 0,1,..., N 1
where w[n] is WGN with variance 2.
To determine the CRLB for A

N 1
1 2
px; A exp 2 xn A
1
n 0 2 2 2
1 N 1

1
exp 2 xn A 2
2 2 N 2
2 n 0
Example 4(cont.)
Taking the first derivative
ln p x; A N 1

1

N 2 2
ln 2 2
x n A
A A 2 2 n 0
N 1
x n A x A
1 N

n 0 2 2
Differentiating again
2 ln px; A N
2
A 2

Example 4(cont.)
And noting that the second derivative is a constant, we have

from 1 2
var A
ln px; N
2
E
2

as the CRLB.
By comparing, we see that the sample mean estimator
attains the bound and must therefore be the MVU estimator.
We now prove that when the CRLB is attained

var
1
I
where 2 ln px;
I E
2

From Cramer-Rao Lower Bound

var
1
2 ln px;
E
2

and
ln px;

I
Differentiating the latter produces
2 ln px; I
2

I
And taking the negative expected value yields
2 ln px; I
E
2

E I I

And therefore

var 1
I
In the next example we will see that the CRLB is not always satisfied.
Example 5 Phase Estimator

Assume that we wish to estimate the phase of a sinusoid
embedded in WGN, then
xn A cos2f 0 n wn n 0,1,..., N 1
The amplitude A and frequency f0 are assumed known.

The PDF is
1 N 1

px; xn A cos2f n
1
exp 2
2

2
2 N 2
2 n 0
0

Example 5(cont.)
Differentiating the log-likelihood function produce
ln px; N 1
xn A cos2f n A sin 2f n
1
2

0 0
n 0
N 1

2 xnsin 2f 0 n sin 4f 0 n 2
A A
n 0 2
And
2 ln px; N 1
xncos2f n A cos4f n 2
A

2 2 0 0
n 0
Example 5(cont.)
Upon taking the negative expected value we have
2 ln px; A N 1
E

2

xn cos 2

2f 0 n A cos4f 0 n 2
n 0
2

A N 1 1 1

2 cos 4f 0 n 2 cos 4f 0 n 2
n 0 2 2
NA2

2 2
Example 5(cont.)
Since
N 1
cos4f n 2 0
1
0 for f 0 not near 0 or 1 2
N n 0
therefore,
2
2
var
NA2
In this example the conditions for the bound to hold is not
satisfied. Hence a phase estimator does not exist.
An estimator which is unbiased and attains the CRLB, as the
sample mean estimator in Example 4 does, is said to be
efficiently uses the data.
In Example 4 we may not be interested in the sign of A but
instead may wish to estimate A2 or the power of the signal.
Knowing the CRLB for A, we can easily obtain it for A2.
If it is desired to estimate =g(), then the CRLB is

2
g

var
2 ln px;
E
2

For the present example this become =g(A)=A2 and
2 2 A 4 A2 2
2
var A
N /
2
N
We saw in Example 4 that sample mean estimator was efficient
for A.
2
It might be supposed that x is efficient for A2. To quickly dispel
2
this notion we first show that x is not even an unbiased
estimator.
Since x ~ N (A,2/N)

E x 2 E 2 x var x
2
A2 A2
N
Hence, we immediately conclude that the efficiency of an
estimator is destroyed by a nonlinear transformation.
That it is maintained for linear transformations is easily verified.
Assume that an efficient estimator for exists and is given by .
It is desired to estimate g()= a + b. We choose

g a b
The CRLB for g()

E a b aE b a b g
g
2

var g
g
2

a var
2

But var g var a b a var

2
, so that the CRLB is achieved.
Although efficient is preserved only over linear transformations,

it is approximately maintained over nonlinear transformations if
the data record is large enough.
To see why this property holds, we return to the previous
example of estimating A2 by x 2 .
Although x 2 is biased, we note from
2

E x E x var x A
2 2 2
N
A2
that x 2 is asymptotically unbiased or unbiased as N.


Since x ~ N A, 2 / N , we can evaluate the variance
var x E x E x
2 4 2 2
By using the result that if ~ N(,2), then

E 2 2 2
E
4 4
6 2 2 3 4
Therefore

var 2 E 4 E 2 2
4 4 2 2 4
For our problem we have then
4 A2 2 2 4

var x
2
N
2
N
As N, the variance approaches 4A22/N, the last term
converging to zero faster than the first term.
Our assertion that x 2 is an asymptotically efficient estimator of
A2 is verified.
This situation occurs due to the statistical linearity of the
transformation, as illustrated in Figure 6.
Figure 6 Statistical linearity of nonlinear transformations
As N increase, the PDF of x becomes more concentrated about

the mean A. Therefore, the value of x that are observed lie in a
small interval about x A .
Over this small interval the nonlinear transformation is
approximately linear.
Minimum Variance Unbiased Estimator for
the Linear Model
If the data observed can be modeled as
x H w
where x is an N 1 vector of observations, H is a known N
p observation matrix, is p1 vector of parameters to be
estimated, and w is an N1 noise vector with PDF N(0,2I).
The MVU estimator is

H T H 1 H T x
and the covariance matrix of is

C H H
2
T

1
Table of Contents
Least Squares (LS) Estimators
Linear LSE
Nonlinear LSE
General Bayesian Estimators
The Bayesian Philosophy
Minimum Mean Square Error (MMSE) Estimators
Maximum A Posteriori (MAP) Estimators
Least Squares
A salient feature of the method is that no probabilistic
assumptions are made about the data, only a signal
model is assumed.
The advantage is its broader range of possible
applications.
On the negative side, no claims about optimality can be
made, and furthermore, the statistical performance
cannot be assessed without some specific assumptions
about the probabilistic structure of the data.
The least squares estimator is widely used in practice
due to its ease of implementation, amounting to the
minimization of a least squares error criterion.
Least Squares
In the LS approach we attempt to minimize the squared difference
between the given data x[n] and the assumed signal or noiseless
data.
Least Squares
The LS error criterion
N 1
J x n s n
2
n 0
The value of that minimizes J() is the LSE.

Note that no probabilistic assumptions have been made
about the data x[n].
LSE are usually applied in situations
A precise statistical characterization of the data is unknown.
An optimal estimator cannot be found.
Too complicated to apply in practice.
Linear Least Squares
In applying the linear LS approach for a scalar parameter we must
assume that
s n h n
where h[n] is a known sequence.
The LS error criterion becomes
N 1
J x n h n
2
n 0
A minimization is readily shown to produce the LSE
N 1
x n h n
n0
N 1
n
h 2
n 0
The minimum LS error is

N 1
J min J x n h n x n h n
n 0

N 1 N 1
x n x n h n h n x n h n
n 0 n 0
S
N 1 N 1
x 2 n x n h n
n 0 n 0
Alternatively, we can rewrite Jmin as
2
N 1

N 1
x n h n
J min x 2 n n 0N 1
n 0
n
n 0
h 2
For the signal s = [s[0] s[1] s[N1]]T to be linear in the

unknown parameters, using matrix notation,
s H
The matrix H, which is a
known N p matrix (N>p) of
full rank p, is referred to as
the observation matrix.
The LSE is found by minimizing
N 1
J x n s n
2
n 0
x H x H
T
Since
J xT x xT H T HT x T HT H
xT x 2xT H T HT H
The gradient is
dJ
2HT x 2HT H
d
Setting the gradient equal to zero yields the LSE
H H HT x
T 1
The equations HT H HT x to be solved for are termed

the normal equations.
The minimum LS error is

T
J min J x H T
x HT

x H H H H x
x H H H HT x
1 T 1
T T T

x I H H H H I H H H H x
1 T 1
T T T T T
x I H H H H x
T T 1 T
The last step results from the fact that IH(HTH)1HT is
an idempotent matrix or is has the property A2 = A.
Other forms for Jmin are
J min x x x H H H HT x
T T T 1

xT x H
Nonlinear Least Squares
Before discussing general methods for determining
nonlinear LSEs we first describe two methods that can
reduce the complexity of the problem.
1. Transformation of parameters.
2. Separability of parameters.
In the first case we seek a one-to-one transformation of
that produces a linear signal model in the new space.
To do so we let
g is a p-dimensional
g function of whose
inverse exists.
If a g can be found so that
s s g 1 H
then the signal model will be linear in .
We can then easily find the linear LSE of and thus the nonlinear
LSE of by
g 1 H H HT x
T 1
This approach relies on the property that the minimization can be

carried out in any transformed space that is obtained by a one-to-
one mapping and then converted back to the original space.
Example Sinusoidal Parameter Estimation
For a sinusoidal signal model
s n A cos 2 f0n n 0, 1, , N 1
it is desired to estimate the amplitude A, where A > 0, and phase .

The frequency f0 is assumed known.
The LSE is obtained by minimizing
N 1
J x n A cos 2 f 0 n
2
n 0
over A and .
Example Sinusoidal Parameter Estimation (cont.)
Because
A cos 2 f0n A cos cos 2 f0n A sin sin 2 f 0n
if we let
1 A cos
2 A sin
then the signal model becomes
s n 1 cos 2 f0n 2 sin 2 f0n
In matrix form this is
s H
where
1 0
cos 2 f sin 2 f
H
0 0

cos 2 f 0 N 1 sin 2 f 0 N 1
which is now linear in the new parameters.
The LSE of is
H H HT x
T 1
and to find we must find the inverse transformation g1().

This is
A 12 22
2
arctan
1
so that the nonlinear LSE for this problem is given by
2 2
A 1 2
2
arctan
1

where HT H 1 HT x .
A second type of nonlinear LS problem that is less
complex than the general one exhibits the separability
property.
Although the signal model is nonlinear, it may be linear
in some of the parameters.
N 1
J a, f 0 x n A cos 2 f 0n
2
In general, n 0
a separable signal model has the form
s H H() is an N q matrix
This model is linear in dependent on .
but nonlinear in .
where
p q 1

q 1
As a result, the LS error may be minimized with respect to and
thus reduced to a function of only.
Since
J , x H x H
T
The that minimizes J for a given is
H H HT x
T 1
The resulting LS error is

J , x I H H H HT x

T T 1

The problem now reduces to a maximization of
x H H H HT x
T T 1
over .
If, for instance, q = p 1, so that is a scalar, then a grid search
can possibly be used.
Example Damped Exponentials
Assume we have a signal model
s n A1r n A2r 2n A3r 3n
where the unknown parameters are {A1, A2, A3, r}. It is
known that 0 < r < 1.
Then, the model is linear in the amplitudes = [A1, A2,
A3]T, and nonlinear in the damping factor = r.
The nonlinear LSE is obtained by maximizing
x H r H r H r
1
T T
HT r x
over 0 < r <1.
Example Damped Exponentials (cont.)
where
1 1 1
r r 2
r 3
H r

N 1 2 N 1 3 N 1

r r r
Once r is found we have the LSE for the amplitudes

1
H T
r H r HT r x
This maximization is easily carried out on a digital computer.
The Bayesian Philosophy
We now note depart from the classical approach to statistical
estimation in which the parameter of interest is assumed to be
deterministic but unknown constant.
Instead, we assume that is a random variable to whose

particular realization we must estimate. This is Bayesian
approach, so named because its implementation is based
directly on Bayes theorem.
The motivation for doing so is twofold.

First, if we have available some priori knowledge about ,
we can incorporate it into our estimator.
Second, Bayesian estimation is useful in situations where an
MVU estimator cannot be found.
Prior Knowledge and Estimation
It is a fundamental rule of estimation theorem that the use of prior
knowledge will lead to a more accurate estimator.
For example, if a parameter is constrained to lie in a known
interval, then any good estimator should produce only estimates
within that interval.
In Example before, it was shown that the MVU estimator of A is
the sample mean x . However, this Assumed that A could take on
any value in the interval < A < .
Due to physical constraints it may be more reasonable to assume
that A can take on only values in the finite interval A0 A A0.
To retain = asxthe best estimator would be undesirable since
may yield value outside the known interval.
As shown in figure (a), this is due to noise effects.
Certainly, we would expect to improve our estimation if we used

the truncated sample mean estimator
A0 x A0

A x A0 x A0
A x A0
0
Such an estimator would have the PDF

p ; A Pr x A0 A0 p ; Au A0 u A0
A

Pr x A0 A0
It is seen that is a biased estimator. However, if we compare
the MSE of the two estimators, we note that for any A in the
interval A0 A A0
A0

A0
mse A A p ; A d A p ; A d A p ; A d
2 2 2

A A A
A0 A0
A0 A0
A0 A p ; A d A p ; A d A0 A p ; A d
2 2 2
A A A
A0 A0
mse A
Hence, , the truncated sample mean estimator, is better than the
sample mean estimator in terms of MSE.
Although is still the MVU estimator, we have been able to
reduce the mean square error by allowing the estimator to be
biased.
Knowing that A must lie in a known interval, we suppose that the
true value of A has been chosen from that interval. We then model
the process of choosing a value as a random event to which a PDF
can be assigned.
With knowledge only of the interval and no inclination as to
whether A should be nearer any particular value, it makes sense to
assign a U[-A0, A0] PDF to the random variable A.
The overall data model then appears as in the following figure.
As shown there, the act of choosing A according to the given

PDF represents the departure of the Bayesian approach from
the classical approach.
The problem, as always, is to estimate the value of A or the
realization of the random variable, now we can incorporate our
knowledge of how A was chosen.
For example, we might attempt to find an estimator that would
minimize the Bayesian MSE defined as

Bmse A E A A

2

We choose to defined the error as A in contrast to the classical
estimation error of A .
Now we emphasize that since A is a random variable, the
expectation operator is with respect to the joint PDF p(x,A).
This is a fundamentally different MSE than in the classical case.
We distinguish it by using the Bmse notation.
To appreciate the difference compare the classical MSE

A A p x; A dx
2
mse A
to the Bayesian MSE

A A p x, A dxdA
2
Bmse A
Note that whereas the classical MSE will depend on A, and hence
estimators that attempt to minimize the MSE will usually depend
on A, the Bayesian MSE will not. In effect, we have integrated
the parameter dependence away!
To complete our example we now derive the estimator that
minimizes the Bayesian MSE. First, we use Bayes, theorem to
write
p x, A p A x p x
So that

2

Bmse A A A pA x dA px dx

Now since p(x) 0 for all x, if the integral in brackets can be

minimized for each x, then the Bayesian MSE will be minimized.
Hence, fixing x so that is a scalar variable, we have

A A p A x dA A A p A x dA

2 2
A
A

2 A A p A x dA
2 Ap A x dA 2 A p A x dA
which when set equal to zero results in
A Ap A x dA
or finally
A E A x
It is seen that the optimal estimator in terms of minimizing the
Bayesian MSE is the mean of the posterior PDF p(A|x).
The posterior PDF refers to the PDF of A after the data have been
observed. In contrast, p(A) or
p A px, Adx
may be thought of as the priori PDF of A, indicating the PDF

before the data are observed.
We will henceforth term the estimator that minimizes the
Bayesian MSE the minimum mean square error (MMSE)
estimator.
In determining the MMSE estimator we first require the posterior
PDF. We can use Bayes rule to determine it as
px A p A px A p A
pA x
px px A p AdA
Note that the denominator is just a normalizing factor,
independent of A, needed to ensure that p(A|x) integrates to 1.
If we continue our example, we recall that the prior PDF p(A) is

U[-A0, A0]. To specify the conditional PDF p(x|A) we need to
further assume that the choice of A via p(A) does not affect the
PDF of the noise samples or that w[n] is independent of A.
Then, for n = 0, 1, , N- 1
p x x n | A pw x n A A
pw x n A
1 2
exp 2 x n A
1

2 2
2
and therefore
1 N 1

p x A
1

2
exp 2 x n A
2 2
N
2 2 n 0
It is apparent that the PDF is identical in form to the usual

classical PDF p(x; A).
The posterior PDF becomes
1 N 1 2
exp 2 x n A
1

2 A0 2 2 2 2 n 0
N

A0 A A0
p A x 1 N 1 2
exp 2 x n A dA0
1
2 n 0
A0 2 A 2
N
0
2 2

0 A A0
But N 1 N 1
xn A xn 2 NAx NA2
2 2
n 0 n 0

N 1
N A x xn N x
2 2 2
n 0
So that we have

1
exp
1
A x 2
A A0
P A x c 2 2
2 2
N N

0 A A0

The factor c is determined by the requirement that p(A|x) integrate

to 1, resulting in

A0
1 1
exp A x dA
2
c
2 2 2

A0 2 N
N
The PDF is seen to be a truncated Gaussian, as shown in figure.
The MMSE estimator, which is the mean of p(A|x), is

1 2
A0
1

A A 2 2 exp 2 N2 A x dA
A E A | x A00 N
1 2
1

A 2 2 exp 2 N2 A x dA
0 N
Although this cannot be evaluated in closed form, we note that
will be a function of x as well as of A0 and 2.
The MMSE estimator will not be x due to the truncation shown
in figure (b), unless A0 is so large that there is effectively no
truncation. This will occur if A0 2 N.
The effect of the data is to position the posterior mean between A

= 0 and A = x in a compromise between the prior knowledge and
that contributed by the data.
To further appreciate this weighting consider what happens as N
becomes large so that the data knowledge becomes more
important.
As shown in Figure 4, as N increases, we have from that the
posterior PDF becomes more concentrated about x (since 2/N
decreases).
Hence, it becomes nearly Gaussian, and its mean becomes just x .

The MMSE estimator relies less and less on the prior knowledge
and more on the data. It is said that the data swamps out the
prior knowledge.
Theorem Conditional PDF of Multivariate Gaussian
If x and y are jointly Gaussian, where x is k 1 and y is l 1, with mean
vector [E(x)TE(y)T]T and partitioned covariance matrix
Cxx Cxy k k k l
C
C yx C yy l k l l
So that
1 1 x E x
T
x E x
p x, y exp 1
C
k l 1
2 y E y

y E y

2 2 det 2

C
Then the conditional PDF p(y|x) is also Gaussian and
E y | x E y C yxCxx1 x E x
C y| x C yy C yx Cxx1Cxy
Bayesian Linear Model
Theorem Posterior PDF for the Bayesian General
Linear Model
If the observed data x can be modeled as
x H w
x is an N 1 data vector, H is a known N p matrix, is a p 1
random vector with prior PDF N(,C), and w is an N 1 noise vector
with PDF N(0, Cw) and independent of .
Then the posterior PDF p(|x) is Gaussian with mean
E | x C H HC H Cw x H
T T 1
and covariance
HC H Cw HC
1
C |x C C H T

T
Risk Function
Previously, we had derived the MMSE estimator by minimizing
E[( )2 ] , where the expectation is with respect to the PDF
p(x,).
If we let denote the error of the estimator for a
particular realization of x and , and also let C()= 2, then the
MSE criterion minimizes E[C()].
The deterministic function C() is termed the cost function. It is
noted that large errors are particularly costly.
Also, the average cost or E[C()] is termed the Bayes risk R or
R E C
and measures the performance of a given estimator.
Risk Function
Examples of cost function.
Risk Function
The Bayes risk R is
R E C

C p x, dxd

C p | x d p x dx

Maximum A Posteriori Estimators
In the MAP estimation approach we choose to maximize the

posterior PDF or
arg max p | x

In finding the maximum of p(|x) we observe that
p x | p
p | x
p x
so an equivalent maximization if of p(x|)p().
Hence, the MAP estimator is
arg max p x | p

or, equivalently,
arg max ln p x | ln p

Estimation Theory

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Estimation Theory

Uploaded by

Copyright:

Available Formats

Wireless Information Transmission System Lab.

Institute of Communications Engineering National Sun Yat-sen University

We could model the data as

or by the sample mean of the data.

To prove that is better we could establish that the

Figure 4 Probability density function for sample mean function

We will attempt to find the a which results in the minimum

which upon setting to zero and solving yields the optimum

It is seen that the optimal value of a depends upon the

In the former case 3 is sometimes referred to as the

var var x 0 var x 1

Figure 5 PDF dependence on unknown parameter

and the variance decrease as the curvature increase.

Then, the variance of any unbiased estimator must satisfy

That estimator , which is the MVU estimator, and the

To determine the CRLB for A

And taking the negative expected value yields

Example 5 Phase Estimator

The amplitude A and frequency f0 are assumed known.

The CRLB for g()

Although efficient is preserved only over linear transformations,

that x 2 is asymptotically unbiased or unbiased as N.

By using the result that if ~ N(,2), then

Figure 6 Statistical linearity of nonlinear transformations

As N increase, the PDF of x becomes more concentrated about

The MVU estimator is

and the covariance matrix of is

The value of that minimizes J() is the LSE.

For the signal s = [s[0] s[1] s[N1]]T to be linear in the

The equations HT H HT x to be solved for are termed

This approach relies on the property that the minimization can be

it is desired to estimate the amplitude A, where A > 0, and phase .

and to find we must find the inverse transformation g1().

a separable signal model has the form

The that minimizes J for a given is

Instead, we assume that is a random variable to whose

The motivation for doing so is twofold.

Certainly, we would expect to improve our estimation if we used

As shown there, the act of choosing A according to the given

Now since p(x) 0 for all x, if the integral in brackets can be

may be thought of as the priori PDF of A, indicating the PDF

If we continue our example, we recall that the prior PDF p(A) is

It is apparent that the PDF is identical in form to the usual

The factor c is determined by the requirement that p(A|x) integrate

The MMSE estimator, which is the mean of p(A|x), is

The effect of the data is to position the posterior mean between A

Hence, it becomes nearly Gaussian, and its mean becomes just x .

You might also like