You are on page 1of 26

Bayesian Estimation

 Classical approach in statistical estimation:


 θ is assumed to be a deterministic but unknown constant.

 Bayesian approach:
 We assume that θ is a random variable whose particular
realization we must estimate.
 This is the Bayesian approach, so named because its
implementation is based directly on Bayes' theorem.
 Prior knowledge about θ can be incorporated into our
estimator by assuming that θ is a random variable with a
given prior PDF.

Page 3
Bayesian Estimation

 MSE in Classical Estimation.

( )
MSE = Ex (θˆ − θ ) 2 = ∫ (θˆ − θ ) 2 p (x)dx

 MSE in Bayesian Estimation:

( )
BMSE = Ex ,θ (θ − θˆ) 2 = ∫∫ (θ − θˆ) 2 p(x, θ )dxdθ

 The difference: In the Bayesian approach the averaging PDF is


the joint PDF of x and θ, while the averaging PDF in the
classical approach is p(x).

Page 4
Bayesian Estimation

 Underlying Experiment (for our usual simple DC level in noise,


where we assume U[-A0,A0] as prior PDF).

Taken from Kay: Fundamentals of Statistical Signal Processing, Vol 1: Estimation Theory, Prentice Hall, Upper Saddle River 2009

 Classical approach: MSE for each value of θ.


 Bayesian approach: One single MSE (which is an average over the
PDF of θ).
Page 5
Bayesian Estimation

Derivation of the Bayesian MMSE estimator


 Cost function: The Bayesian MSE

J = ∫∫ (θ − θˆ) 2 p (x, θ )dxdθ

 Note: The averaging PDF is the joint PDF of x and θ!


 We apply Bayes’ theorem
p (x, θ ) = p (θ | x) p (x)
to obtain

J = ∫ [∫ (θ − θˆ) 2
]
p (θ | x)dθ p (x)dx.
 Since p(x) >= 0 for all x, if the integral in brackets can be
minimized for each x, then the Bayesian MSE will be minimized.
Page 6
Bayesian Estimation

J ' = ∫ (θ − θ ) p (θ | x)dθ
ˆ 2

 Hence, fixing x so that θˆ is a scalar variable (as opposed to a


general function of x), we have
∂J ' ∂
∫ ˆ
∫ ∂θˆ ˆ) 2 p (θ | x)dθ
2
(θ − θ ) p (θ | x ) dθ = (θ − θ
∂θˆ

= − 2(θ − θˆ) p (θ | x)dθ
= −2 ∫ θp (θ | x)dθ + 2θˆ ∫ p (θ | x)dθ
= −2 E (θ | x) + 2θˆ
which when set to zero results in

θˆ = E (θ | x).
Page 7
Bayesian Estimation

Comments:
 It is seen that the optimal estimator in terms of minimizing the
Bayesian MSE is the mean of the posterior PDF p(θlx).

 The posterior PDF refers to the PDF of θ after the data have been
observed. It summarizes our new state of knowledge about the
parameter.

 In contrast, p(θ) may be thought of as the prior PDF of θ,


indicating the PDF before the data are observed.

 We will term the estimator that minimizes the Bayesian MSE the
minimum mean square error (MMSE) estimator.
 Intuitively, the effect of observing data will be to concentrate the
PDF of θ.

 This is because knowledge of the data should reduce our


uncertainty about θ.
Page 8
Bayesian Estimation

Comments:
 The MMSE estimator will in general depend on the prior knowledge as
well as the data.
 If the prior knowledge is weak relative to that of the data, then the
estimator will ignore the prior knowledge.
 Otherwise, the estimator will be "biased" towards the prior mean. As
expected, the use of prior information always improves the estimation
accuracy.

 The choice of a prior PDF is critical in Bayesian estimation. The wrong


choice will result in a poor estimator, similar to the problems of a
classical estimator designed with an incorrect data model.
 Remember: Classical MSE will depend on θ hence estimators that
attempt to minimize the MSE will usually depend on θ the Bayesian will
not!
 In effect the parameter dependency has be integrated away
Page 9
Bayesian Estimation

 We derived the MMSE estimator for the case of continuous


random variables.
 We notice that the same estimator also holds for discrete random
variables.
Example 1:

Page 10
Bayesian Estimation

 It is often not possible to find closed form solutions for the MMSE
estimator. An exception is the case, where x and θ are jointly
Gaussian distributed.

Page 11
Bayesian Estimation

 Example: DC level in WGN with Gaussian prior


 x[n] = A + w[n]
 Gaussian prior

1  1 
p ( A) = exp − 2
( A − µ A )
2πσ 2 A  2σ A 
 with µA=0
1  1 N −1

 If p (x | A) = N
exp − 2
 2σ
∑ (x[n] − A) 
2


(2πσ ) 2 2 n =0

 then p(A|x) can be written as:


 1 
exp − Q( A)
p ( A | x) = ∞  2 
 1 
∫−∞exp− 2 Q( A)dA Page 12
Bayesian Estimation
2 2
N 2 2 NAx A 2µ A A µA
 with Q( A) = 2
A − 2
+ 2
− + 2
σ σ σ A σ 2A σ A

 note that the denominator of p(A|x) does not depend on A any


more, being a normalizing factor (normalizing the area below
p(A|x) ) and the argument of the exponential is quadratic in A
 Hence p(A|x) must be Gaussian. It can be shown that its mean
and variance are:
µA
N
2
x+ 2
µ A|x =σ σ A
N 1
2
+
σ σ 2A
2 1
σ A| x =
N 1
2
+
σ σ 2A
Page 13
Bayesian Estimation

 In this form, the Bayes MMSE estimator is readily found as


N µA
2
x+ 2
Aˆ = E ( A | x) = µ A|x =σ σ A
N 1
2
+
σ σ 2A
 for better interpretation this can be written as

σ2
2
σ A N
Aˆ = 2
x + 2
µ A = αx + (1 − α ) µ A
2 σ 2 σ
σ A+ σ A+
N N
σ 2A
 with α = 2
σ
σ 2A +
N
Page 14
Bayesian Estimation

 Note that α is a weighting factor since 0 < α < 1


σ2
 When there is little data available so that σ A2 <<
then α is small and Aˆ≈µ N
A
σ2
 but as more data are observed so that σ 2 A >>
α ≈ 1 and Aˆ ≈ x N
 The weighting factor α directly depends on the confidence in the
2 2
prior knowledge σ A and the confidence in the sample dataσ / N
 If one examines the posterior pdf, its variance

2 1
σ A| x =
N 1
2
+
σ σ 2A
 decreases as N increases.

Page 15
Bayesian Estimation

 As we have seen, the posterior mean changes with increasing N


 for small N it will approximately be µA
 but will approach x for increasing N

Taken from Kay: Fundamentals of Statistical Signal Processing, Vol 1: Estimation Theory, Prentice Hall, Upper Saddle River 2009

Page 16
Bayesian Estimation
Vector Case

Theorem: If x and θ are jointly Gaussian, where x is of dimension


kx1 and y of dimension lx1, with mean vector [E(x) E(θ)]T and
partitioned covariance matrix
C xx C xθ 
C=
Cθx Cθθ 
so that

1  1  x − E (x)  −1  x − E (x) 
T

p ( x, θ ) = k +l 1
exp −   C  
 2  θ − E (θ)   θ − E (θ) 
(2π ) det (C)
2 2  

then the conditional PDF p(θlx) is also Gaussian and


−1
E (θ | x) = E (θ) + Cθx C xx (x − E (x))
−1
Cθ | x = Cθθ − Cθx C xx C xθ
Page 17
Bayesian Estimation
Vector Case

Theorem: If the observed data x can be modeled as


x = Hθ + w
where x is an Nx1 data vector, H is a known Nxp matrix,θ is a px1
random vector with prior pdf N(µθ ,Cθ ), and w is an Nx1 noise
vector with pdf N(0,Cθ ) and independent of θ then the posterior pdf
p(θ|x) is Gaussian with mean
E (θ | x) = µ θ + Cθ H T (HCθ H T + C w ) −1 (x − Hµθ )
and covariance
Cθ | x = Cθ − Cθ H T (HCθ H T + C w ) −1 HCθ

 In contrast to the classical general linear model, H need not be


full rank to ensure the invertibility of HCθ H T + C w
Note that Cθ | x is also the covariance matrix of the estimation error
ε = θ − θˆ (ε has zero mean) Page 18
Linear Bayesian Estimation

Linear MMSE Estimation


 Except when x and θ are jointly Gaussian the MMSE estimator
may be difficult to find.

 The situation is different when we constrain the estimator to be


linear in x.

 As will be seen shortly we do not have to assume any specific


form for the joint PDF p(x, θ), only a knowledge of the first two
moments will be sufficient to derive the LMMSE (compare to the
BLUE).

 That θ may be estimated from x is due to the assumed statistical


dependence of θ on x as summarized by the joint PDF p(x, θ).
 In particular, for a linear estimator we rely on the correlation
between θ and x.
Page 19
Linear Bayesian Estimation

Introductory Example: Assume x and θ are jointly distributed. Find


the linear (actually affine) estimator

θˆ = ax + b
that minimizes the Bayesian MMSE:

[
J = E x ,θ (θ − θˆ) 2 ]
Solution:

ˆ cov( x, θ )
θ = E (θ ) + ( x − E ( x))
var( x)

Example: Derive the LMMSE estimator and the MSE for Example 1.

Page 20
Linear Bayesian Estimation
Scalar Parameter

 Aim: Find the (affine linear) estimator of form


N −1
θˆ = ∑ an x[n] + a N
n =0
 that minimizes the Bayesian MSE

J = E x ,θ [(θ − θˆ) ]
2

 aN…compensate nonzero means of x and θ


 omitted when both means are zero

Page 21
Linear Bayesian Estimation
Scalar Parameter

 Deriving the optimal weighting coefficients:


 Starting with aN:

∂  N −1
 
2
 N −1 
E θ − ∑ an x[ n] − a N   = −2 E θ − ∑ an x[n] − a N  =
∂a N  n =0    n =0 
 N −1

= −2 E (θ ) − ∑ an E ( x[n]) − a N 
 n =0 
 Setting to zero results in:
N −1
a N = E (θ ) − ∑ an E ( x[n])
n =0

Page 22
Linear Bayesian Estimation
Scalar Parameter

 Continuing for the remaining coefficients an:


 N −1

2
   N −1 N −1
 
2

E θ − ∑ an x[n] − a N   = E θ − ∑ an x[n] − E (θ ) + ∑ an E ( x[n]) 


 n =0    n =0 n =0  
 N −1  
2

= E  ∑ an ( x[n] − E ( x[n])) − (θ − E (θ )) 


 n =0  

 When writing the sums as inner vector products with


a = [a0,a1,…,aN-1] leads to

[(
E a (x − E (x )) − (θ − E (θ ))
T
) ]=
2

[ ] [ ]
= E aT (x − E (x ))(x − E (x )) a − E aT (x − E (x ))(θ − E (θ )) −
T

− E [(θ − E (θ ))(x − E (x )) a]+ E [(θ − E (θ )) ]


T 2

Page 23
Linear Bayesian Estimation
Scalar Parameter

[(
J = E aT (x − E (x )) − (θ − E (θ )) ) ]=
2

= E [a (x − E (x ))(x − E (x )) a ]− E [a (x − E (x ))(θ − E (θ ))]−


T T T

[ ] [
− E (θ − E (θ ))(x − E (x )) a + E (θ − E (θ )) =
T 2
]
= aT C xxa − aT c xθ − cθx a + σ θ

 where is Cxx the NxN covariance matrix of x and cxθ is the 1xN
cross-covariance vector having the property cxθ = cθxT and σθ is
the variance of θ.
 Taking the gradient yields
∂J
= 2C xx a − 2c xθ
∂a
 setting to zero results in
−1
a = C xx c xθ Page 24
Linear Bayesian Estimation
Scalar Parameter

 Combining with the result for aN leads to


θˆ = E (θ ) + cθx C xx −1 (x − E (x ))
 as the LMMSE estimator and the corresponding BMSE of

BMSE (θˆ) = σ θ − cθx C xx c xθ


−1

 Note that this is identical in form to the MMSE estimator for


jointly Gaussian x and θ. This is because in the Gaussian case the
MMSE estimator happens to be linear, and hence our constraint is
automatically satisfied.

Page 25
Linear Bayesian Estimation
Vector Parameter

 The vector LMMSE estimator is a straightforward extension of the


scalar one
 We wish to find the linear estimator that minimizes the Bayesian
MSE for each element:
N −1
θˆi = ∑ ain x[n] + aiN
n =0

for i=1,2,…,p. And choose the weighting coefficient to minimize

[ ˆ
J i = E (θ i − θ i ) 2
]
 Combining the scalar LMMSE estimators leads to
ˆθ = E (θ ) + c C −1 (x − E (x ))
θx xx
 and
BMSE (θˆi ) = σ θi − cθi xC xx c xθi
−1
Page 26
Linear Bayesian Estimation
Vector Parameter

 Problem: inverse system identification


 Let the following communication scenario be given:

 Data samples y[k] (+1 or -1, uncorrelated, zero mean, σy2=1)


are transmitted through a discrete time linear system, given by
its impulse response h = [h0,…, hl-1]. After that additive white
Gaussian noise n[k] (zero mean, σn2) is added. Your task is to
 Find the best linear system w=[w0,…, wp-1] in an LMMSE
sense to estimate the data.
 Write down the estimator using the hints at the next slide
 Write a Matlab script simulating the system with l=4 and p=4
 Vary σn2 from 0.001 to 1 and observe the results
Page 27
Linear Bayesian Estimation
Vector Parameter

 Hints:
 as we will see in the following lectures:
 For uncorrelated data samples and uncorrelated noise with zero
means and σy2 and σn2, respectively, we have
 σ 2 
R xx = σ y2  H H H + n2 I
 σy 
 
 as the autocorrelation matrix of the samples and
H 2
rxy = H σ e y i
 as the cross correlation vector of x and y. ei is the vector that has a
one as position i and zeros at all other elements. Choose i = l+1 and
the length of ei as 7.
 H is the convolution matrix of h. Use convmtx(h,l) to obtain H in
matlab.
 Please be aware that the output sequence after the filter w is shifted
by i=l+1 samples
Page 28

You might also like