ASSP8

Bayesian Estimation
Classical approach in statistical estimation:

θ is assumed to be a deterministic but unknown constant.
Bayesian approach:
We assume that θ is a random variable whose particular
realization we must estimate.
This is the Bayesian approach, so named because its
implementation is based directly on Bayes' theorem.
Prior knowledge about θ can be incorporated into our
estimator by assuming that θ is a random variable with a
given prior PDF.
Page 3
Bayesian Estimation
MSE in Classical Estimation.
( )
MSE = Ex (θˆ − θ ) 2 = ∫ (θˆ − θ ) 2 p (x)dx
MSE in Bayesian Estimation:
( )
BMSE = Ex ,θ (θ − θˆ) 2 = ∫∫ (θ − θˆ) 2 p(x, θ )dxdθ
The difference: In the Bayesian approach the averaging PDF is

the joint PDF of x and θ, while the averaging PDF in the
classical approach is p(x).
Page 4
Bayesian Estimation
Underlying Experiment (for our usual simple DC level in noise,

where we assume U[-A0,A0] as prior PDF).
Taken from Kay: Fundamentals of Statistical Signal Processing, Vol 1: Estimation Theory, Prentice Hall, Upper Saddle River 2009
Classical approach: MSE for each value of θ.

Bayesian approach: One single MSE (which is an average over the
PDF of θ).
Page 5
Bayesian Estimation
Derivation of the Bayesian MMSE estimator

Cost function: The Bayesian MSE
J = ∫∫ (θ − θˆ) 2 p (x, θ )dxdθ
Note: The averaging PDF is the joint PDF of x and θ!

We apply Bayes’ theorem
p (x, θ ) = p (θ | x) p (x)
to obtain
J = ∫ [∫ (θ − θˆ) 2
]
p (θ | x)dθ p (x)dx.
Since p(x) >= 0 for all x, if the integral in brackets can be
minimized for each x, then the Bayesian MSE will be minimized.
Page 6
Bayesian Estimation
J ' = ∫ (θ − θ ) p (θ | x)dθ
ˆ 2
Hence, fixing x so that θˆ is a scalar variable (as opposed to a

general function of x), we have
∂J ' ∂
∫ ˆ
∫ ∂θˆ ˆ) 2 p (θ | x)dθ
2
(θ − θ ) p (θ | x ) dθ = (θ − θ
∂θˆ
∫
= − 2(θ − θˆ) p (θ | x)dθ
= −2 ∫ θp (θ | x)dθ + 2θˆ ∫ p (θ | x)dθ
= −2 E (θ | x) + 2θˆ
which when set to zero results in
θˆ = E (θ | x).
Page 7
Bayesian Estimation
Comments:
It is seen that the optimal estimator in terms of minimizing the
Bayesian MSE is the mean of the posterior PDF p(θlx).
The posterior PDF refers to the PDF of θ after the data have been
observed. It summarizes our new state of knowledge about the
parameter.
In contrast, p(θ) may be thought of as the prior PDF of θ,

indicating the PDF before the data are observed.
We will term the estimator that minimizes the Bayesian MSE the
minimum mean square error (MMSE) estimator.
Intuitively, the effect of observing data will be to concentrate the
PDF of θ.
This is because knowledge of the data should reduce our

uncertainty about θ.
Page 8
Bayesian Estimation
Comments:
The MMSE estimator will in general depend on the prior knowledge as
well as the data.
If the prior knowledge is weak relative to that of the data, then the
estimator will ignore the prior knowledge.
Otherwise, the estimator will be "biased" towards the prior mean. As
expected, the use of prior information always improves the estimation
accuracy.
The choice of a prior PDF is critical in Bayesian estimation. The wrong

choice will result in a poor estimator, similar to the problems of a
classical estimator designed with an incorrect data model.
Remember: Classical MSE will depend on θ hence estimators that
attempt to minimize the MSE will usually depend on θ the Bayesian will
not!
In effect the parameter dependency has be integrated away
Page 9
Bayesian Estimation
We derived the MMSE estimator for the case of continuous

random variables.
We notice that the same estimator also holds for discrete random
variables.
Example 1:
Page 10
Bayesian Estimation
It is often not possible to find closed form solutions for the MMSE
estimator. An exception is the case, where x and θ are jointly
Gaussian distributed.
Page 11
Bayesian Estimation
Example: DC level in WGN with Gaussian prior

x[n] = A + w[n]
Gaussian prior
1  1 
p ( A) = exp − 2
( A − µ A )
2πσ 2 A  2σ A 
with µA=0
1  1 N −1

If p (x | A) = N
exp − 2
 2σ
∑ (x[n] − A) 
2

(2πσ ) 2 2 n =0
then p(A|x) can be written as:

 1 
exp − Q( A)
p ( A | x) = ∞  2 
 1 
∫−∞exp− 2 Q( A)dA Page 12
Bayesian Estimation
2 2
N 2 2 NAx A 2µ A A µA
with Q( A) = 2
A − 2
+ 2
− + 2
σ σ σ A σ 2A σ A
note that the denominator of p(A|x) does not depend on A any

more, being a normalizing factor (normalizing the area below
p(A|x) ) and the argument of the exponential is quadratic in A
Hence p(A|x) must be Gaussian. It can be shown that its mean
and variance are:
µA
N
2
x+ 2
µ A|x =σ σ A
N 1
2
+
σ σ 2A
2 1
σ A| x =
N 1
2
+
σ σ 2A
Page 13
Bayesian Estimation
In this form, the Bayes MMSE estimator is readily found as

N µA
2
x+ 2
Aˆ = E ( A | x) = µ A|x =σ σ A
N 1
2
+
σ σ 2A
for better interpretation this can be written as
σ2
2
σ A N
Aˆ = 2
x + 2
µ A = αx + (1 − α ) µ A
2 σ 2 σ
σ A+ σ A+
N N
σ 2A
with α = 2
σ
σ 2A +
N
Page 14
Bayesian Estimation
Note that α is a weighting factor since 0 < α < 1

σ2
When there is little data available so that σ A2 <<
then α is small and Aˆ≈µ N
A
σ2
but as more data are observed so that σ 2 A >>
α ≈ 1 and Aˆ ≈ x N
The weighting factor α directly depends on the confidence in the
2 2
prior knowledge σ A and the confidence in the sample dataσ / N
If one examines the posterior pdf, its variance
2 1
σ A| x =
N 1
2
+
σ σ 2A
decreases as N increases.
Page 15
Bayesian Estimation
As we have seen, the posterior mean changes with increasing N

for small N it will approximately be µA
but will approach x for increasing N
Taken from Kay: Fundamentals of Statistical Signal Processing, Vol 1: Estimation Theory, Prentice Hall, Upper Saddle River 2009
Page 16
Bayesian Estimation
Vector Case
Theorem: If x and θ are jointly Gaussian, where x is of dimension

kx1 and y of dimension lx1, with mean vector [E(x) E(θ)]T and
partitioned covariance matrix
C xx C xθ 
C=
Cθx Cθθ 
so that
1  1  x − E (x)  −1  x − E (x) 
T
p ( x, θ ) = k +l 1
exp −   C  
 2  θ − E (θ)   θ − E (θ) 
(2π ) det (C)
2 2  
then the conditional PDF p(θlx) is also Gaussian and

−1
E (θ | x) = E (θ) + Cθx C xx (x − E (x))
−1
Cθ | x = Cθθ − Cθx C xx C xθ
Page 17
Bayesian Estimation
Vector Case
Theorem: If the observed data x can be modeled as

x = Hθ + w
where x is an Nx1 data vector, H is a known Nxp matrix,θ is a px1
random vector with prior pdf N(µθ ,Cθ ), and w is an Nx1 noise
vector with pdf N(0,Cθ ) and independent of θ then the posterior pdf
p(θ|x) is Gaussian with mean
E (θ | x) = µ θ + Cθ H T (HCθ H T + C w ) −1 (x − Hµθ )
and covariance
Cθ | x = Cθ − Cθ H T (HCθ H T + C w ) −1 HCθ
In contrast to the classical general linear model, H need not be

full rank to ensure the invertibility of HCθ H T + C w
Note that Cθ | x is also the covariance matrix of the estimation error
ε = θ − θˆ (ε has zero mean) Page 18
Linear Bayesian Estimation
Linear MMSE Estimation

Except when x and θ are jointly Gaussian the MMSE estimator
may be difficult to find.
The situation is different when we constrain the estimator to be

linear in x.
As will be seen shortly we do not have to assume any specific

form for the joint PDF p(x, θ), only a knowledge of the first two
moments will be sufficient to derive the LMMSE (compare to the
BLUE).
That θ may be estimated from x is due to the assumed statistical

dependence of θ on x as summarized by the joint PDF p(x, θ).
In particular, for a linear estimator we rely on the correlation
between θ and x.
Page 19
Introductory Example: Assume x and θ are jointly distributed. Find

the linear (actually affine) estimator
θˆ = ax + b
that minimizes the Bayesian MMSE:
[
J = E x ,θ (θ − θˆ) 2 ]
Solution:
ˆ cov( x, θ )
θ = E (θ ) + ( x − E ( x))
var( x)
Example: Derive the LMMSE estimator and the MSE for Example 1.
Page 20
Scalar Parameter
Aim: Find the (affine linear) estimator of form

N −1
θˆ = ∑ an x[n] + a N
n =0
that minimizes the Bayesian MSE
J = E x ,θ [(θ − θˆ) ]
2
aN…compensate nonzero means of x and θ

omitted when both means are zero
Page 21
Scalar Parameter
Deriving the optimal weighting coefficients:

Starting with aN:
∂  N −1
 
2
 N −1 
E θ − ∑ an x[ n] − a N   = −2 E θ − ∑ an x[n] − a N  =
∂a N  n =0    n =0 
 N −1

= −2 E (θ ) − ∑ an E ( x[n]) − a N 
 n =0 
Setting to zero results in:
N −1
a N = E (θ ) − ∑ an E ( x[n])
n =0
Page 22
Scalar Parameter
Continuing for the remaining coefficients an:

 N −1

2
   N −1 N −1
 
2
E θ − ∑ an x[n] − a N   = E θ − ∑ an x[n] − E (θ ) + ∑ an E ( x[n]) 

 n =0    n =0 n =0  
 N −1  
2
= E  ∑ an ( x[n] − E ( x[n])) − (θ − E (θ )) 

 n =0  
When writing the sums as inner vector products with

a = [a0,a1,…,aN-1] leads to
[(
E a (x − E (x )) − (θ − E (θ ))
T
) ]=
2
[ ] [ ]
= E aT (x − E (x ))(x − E (x )) a − E aT (x − E (x ))(θ − E (θ )) −
T
− E [(θ − E (θ ))(x − E (x )) a]+ E [(θ − E (θ )) ]

T 2
Page 23
Scalar Parameter
[(
J = E aT (x − E (x )) − (θ − E (θ )) ) ]=
2
= E [a (x − E (x ))(x − E (x )) a ]− E [a (x − E (x ))(θ − E (θ ))]−

T T T
[ ] [
− E (θ − E (θ ))(x − E (x )) a + E (θ − E (θ )) =
T 2
]
= aT C xxa − aT c xθ − cθx a + σ θ
where is Cxx the NxN covariance matrix of x and cxθ is the 1xN
cross-covariance vector having the property cxθ = cθxT and σθ is
the variance of θ.
Taking the gradient yields
∂J
= 2C xx a − 2c xθ
∂a
setting to zero results in
−1
a = C xx c xθ Page 24
Scalar Parameter
Combining with the result for aN leads to

θˆ = E (θ ) + cθx C xx −1 (x − E (x ))
as the LMMSE estimator and the corresponding BMSE of
BMSE (θˆ) = σ θ − cθx C xx c xθ

−1
Note that this is identical in form to the MMSE estimator for

jointly Gaussian x and θ. This is because in the Gaussian case the
MMSE estimator happens to be linear, and hence our constraint is
automatically satisfied.
Page 25
Vector Parameter
The vector LMMSE estimator is a straightforward extension of the

scalar one
We wish to find the linear estimator that minimizes the Bayesian
MSE for each element:
N −1
θˆi = ∑ ain x[n] + aiN
n =0
for i=1,2,…,p. And choose the weighting coefficient to minimize
[ ˆ
J i = E (θ i − θ i ) 2
]
Combining the scalar LMMSE estimators leads to
ˆθ = E (θ ) + c C −1 (x − E (x ))
θx xx
and
BMSE (θˆi ) = σ θi − cθi xC xx c xθi
−1
Page 26
Vector Parameter
Problem: inverse system identification

Let the following communication scenario be given:
Data samples y[k] (+1 or -1, uncorrelated, zero mean, σy2=1)

are transmitted through a discrete time linear system, given by
its impulse response h = [h0,…, hl-1]. After that additive white
Gaussian noise n[k] (zero mean, σn2) is added. Your task is to
Find the best linear system w=[w0,…, wp-1] in an LMMSE
sense to estimate the data.
Write down the estimator using the hints at the next slide
Write a Matlab script simulating the system with l=4 and p=4
Vary σn2 from 0.001 to 1 and observe the results
Page 27
Vector Parameter
Hints:
as we will see in the following lectures:
For uncorrelated data samples and uncorrelated noise with zero
means and σy2 and σn2, respectively, we have
 σ 2 
R xx = σ y2  H H H + n2 I
 σy 
 
as the autocorrelation matrix of the samples and
H 2
rxy = H σ e y i
as the cross correlation vector of x and y. ei is the vector that has a
one as position i and zeros at all other elements. Choose i = l+1 and
the length of ei as 7.
H is the convolution matrix of h. Use convmtx(h,l) to obtain H in
matlab.
Please be aware that the output sequence after the filter w is shifted
by i=l+1 samples
Page 28

ASSP8

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ASSP8

Uploaded by

Copyright:

Available Formats

Bayesian Estimation

Classical approach in statistical estimation:

MSE in Classical Estimation.

MSE in Bayesian Estimation:

The difference: In the Bayesian approach the averaging PDF is

Underlying Experiment (for our usual simple DC level in noise,

Classical approach: MSE for each value of θ.

Derivation of the Bayesian MMSE estimator

J = ∫∫ (θ − θˆ) 2 p (x, θ )dxdθ

Note: The averaging PDF is the joint PDF of x and θ!

Hence, fixing x so that θˆ is a scalar variable (as opposed to a

In contrast, p(θ) may be thought of as the prior PDF of θ,

This is because knowledge of the data should reduce our

The choice of a prior PDF is critical in Bayesian estimation. The wrong

We derived the MMSE estimator for the case of continuous

Example: DC level in WGN with Gaussian prior

then p(A|x) can be written as:

note that the denominator of p(A|x) does not depend on A any

In this form, the Bayes MMSE estimator is readily found as

Note that α is a weighting factor since 0 < α < 1

As we have seen, the posterior mean changes with increasing N

Theorem: If x and θ are jointly Gaussian, where x is of dimension

then the conditional PDF p(θlx) is also Gaussian and

Theorem: If the observed data x can be modeled as

In contrast to the classical general linear model, H need not be

Linear MMSE Estimation

The situation is different when we constrain the estimator to be

As will be seen shortly we do not have to assume any specific

That θ may be estimated from x is due to the assumed statistical

Introductory Example: Assume x and θ are jointly distributed. Find

Aim: Find the (affine linear) estimator of form

aN…compensate nonzero means of x and θ

Deriving the optimal weighting coefficients:

Continuing for the remaining coefficients an:

E θ − ∑ an x[n] − a N   = E θ − ∑ an x[n] − E (θ ) + ∑ an E ( x[n]) 

= E  ∑ an ( x[n] − E ( x[n])) − (θ − E (θ )) 

When writing the sums as inner vector products with

− E [(θ − E (θ ))(x − E (x )) a]+ E [(θ − E (θ )) ]

= E [a (x − E (x ))(x − E (x )) a ]− E [a (x − E (x ))(θ − E (θ ))]−

Combining with the result for aN leads to

BMSE (θˆ) = σ θ − cθx C xx c xθ

Note that this is identical in form to the MMSE estimator for

The vector LMMSE estimator is a straightforward extension of the

for i=1,2,…,p. And choose the weighting coefficient to minimize

Problem: inverse system identification

Data samples y[k] (+1 or -1, uncorrelated, zero mean, σy2=1)

You might also like