You are on page 1of 22

Technion – Israel Institute of Technology, Department of Electrical Engineering

Estimation and Identification in Dynamical Systems (048825)


Lecture Notes, Fall 2009, Prof. N. Shimkin

4 Derivations of the Discrete-Time Kalman Filter

We derive here the basic equations of the Kalman filter (KF), for discrete-time
linear systems. We consider several derivations under different assumptions and
viewpoints:

• For the Gaussian case, the KF is the optimal (MMSE) state estimator.

• In the non-Gaussian case, the KF is derived as the best linear (LMMSE) state
estimator.

• We also provide a deterministic (least-squares) interpretation.

We start by describing the basic state-space model.

1
4.1 The Stochastic State-Space Model

A discrete-time, linear, time-varying state space system is given by:

xk+1 = Fk xk + Gk wk (state evolution equation)


zk = Hk xk + vk (measurement equation)

for k ≥ 0 (say), and initial conditions x0 . Here:


– Fk , Gk , Hk are known matrices.
– xk ∈ IRn is the state vector.
– wk ∈ IRnw is the state noise.
– zk ∈ IRm is the observation vector.
– vk the observation noise.
– The initial conditions are given by x0 , usually a random variable.

The noise sequences (wk , vk ) and the initial conditions x0 are stochastic processes
with known statistics.

The Markovian model

Recall that a stochastic process {Xk } is a Markov process if

p(Xk+1 |Xk , Xk−1 , . . . ) = p(Xk+1 |Xk ) .

For the state xk to be Markovian, we need the following assumption.

Assumption A1: The state-noise process {wk } is white in the strict sense, namely
all wk ’s are independent of each other. Furthermore, this process is independent of
x0 .

The following is then a simple exercise:

Proposition: Under A1, the state process {xk , k ≥ 0} is Markov.

2
Note:

• Linearity is not essential: The Marko property follows from A1 also for the
nonlinear state equation xk+1 = f (xk , wk ).

• The measurement process zk is usually not Markov.

• The pdf of the state can (in principle) be computed recursively via the following
(Chapman-Kolmogorov) equation:
Z
p(xk+1 ) = p(xk+1 |xk )p(xk )dxk .

where p(xk+1 |xk ) is determined by p(wk ).

The Gaussian model

• Assume that the noise sequences {wk }, {vk } and the initial conditions x0 are
jointly Gaussian.

• It easily follows that the processes {xk } and {zk } are (jointly) Gaussian as
well.

• If, in addition, A1 is satisfied (namely {wk } is white and independent of x0 ),


then xk is a Markov process.

This model is often called the Gauss-Markov Model.

3
Second-Order Model

We often assume that only the first and second order statistics of the noise is known.
Consider our linear system:

xk+1 = Fk xk + Gk wk , k≥0

zk = Hk xx + vk ,

under the following assumptions:

• wk a 0-mean white noise: E(wk ) = 0, cov(wk , wl ) = Qk δkl .

• vk a 0-mean white noise: E(vk ) = 0, cov(vk , vl ) = Rk δkl .

• cov(wk , vl ) = 0: uncorrelated noise.

• x0 is uncorrelated with the other noise sequences.


denote x0 = E(x0 ), cov(x0 ) = P0 .

We refer to this model as the standard second-order model.

It is sometimes useful to allow correlation between vk and wk :

cov(wk , vl ) ≡ E(wk vlT ) = Sk δkl .

This gives the second-order model with correlated noise.

A short-hand notation for the above correlations:


     
wk w Q δ Sk δkl 0
   l   k kl 
     
cov( vk  ,  vl ) =  SkT δkl Rk δkl 0 
     
x0 x0 0 0 P0

Note that the Gauss-Markov model is a special case of this model.

4
Mean and covariance propagation

For the standard second-order model, we easily obtain recursive formulas for the
mean and covariance of the state.

• The mean obviously satisfies:

xk+1 = Fk xk + Gk wk = Fk xk

• Consider next the covariance:

.
Pk = E((xk − xk )(xk − x)T ) .

Note that xk+1 − xk+1 = Fk (xk − xk ) + Gk wk , and wk and xk are uncorrelated


(why?). Therefore
Pk+1 = Fk Pk FkT + Gk Qk GTk .

This equation is in the form of a Lyapunov difference equation.

• Since zk = Hk xx + vk , it is now easy to compute its covariance, and also the


joint covariances of (xk , zk ).

• In the Gaussian case, the pdf of xk is completely specified by the mean and
covariance: xk ∼ N (xk , Pk ).

5
4.2 The KF for the Gaussian Case

Consider the linear Gaussian (or Gauss-Markov) model

xk+1 = Fk xk + Gk wk , k≥0

zk = Hk xx + vk

where:

• {wk } and {vk } are independent, zero-mean Gaussian white processes with
covariances
E(vk vlT ) = Rk δkl , E(wk wlT ) = Qk δkl

• The initial state x0 is a Gaussian RV, independent of the noise processes, with
x0 ∼ N (x0 , P0 ).

Let Zk = (z0 , . . . , zk ). Our goal is to compute recursively the following optimal


(MMSE) estimator of xk :
.
x̂+
k ≡ x̂k|k = E(xk |Zk ) .

Also define the one-step predictor of xk :

.
x̂−
k ≡ x̂k|k−1 = E(xk |Zk−1 )

and the respective covariance matrices:

.
Pk+ ≡ Pk|k = E{xk − x̂+ + T
k )(xk − x̂k ) |Zk }
.
Pk− ≡ Pk|k−1 = E{xk − x̂− − T
k )(xk − x̂k ) |Zk−1 } .

Note that Pk+ (and similarly Pk− ) can be viewed in two ways:

(i) It is the covariance matrix of the (posterior) estimation error, ek = xk − x̂+


k.

In particular, MMSE = trace(Pk+ ).

6
(ii) It is the covariance matrix of the “conditional RV (xk |Zk )”, namely an RV
with distribution p(xk |Zk ) (since x̂+
k is its mean).

. .
Finally, denote P0− = P0 , x̂−
0 = x0 .

Recall the formulas for conditioned Gaussian vectors:

• If x and z are jointly Gaussian, then px|z ∼ N (m, Σ), with

m = mx + Σxz Σ−1
zz (z − mz ) ,

Σ = Σxx − Σxz Σ−1


zz Σzx .

• The same formulas hold when everything is conditioned, in addition, on an-


other random vector.

According to the terminology above, we say in this case that the conditional RV
(x|z) is Gaussian.

Proposition: For the model above, all random processes (noises, xk , zk ) are jointly
Gaussian.

Proof: All can be expressed as linear combinations of the noise seqeunces, which
are jointly Gaussian (why?).

It follows that (xk |Zm ) is Gaussian (for any k, m). In particular:

(xk |Zk ) ∼ N (x̂+ +


k , Pk ) , (xk |Zk−1 ) ∼ N (x̂− −
k , Pk ) .

7
Filter Derivation

Suppose, at time k, that (x̂− −


k , Pk ) is given.

We shall compute (x̂+ + − −


k , Pk ) and (x̂k+1 , Pk+1 ), using the following two steps.

Measurement update step: Since zk = Hk xk + vk , then the conditional vector


µ ¶
xk
( |Zk−1 ) is Gaussian, with mean and covariance:
zk
   
− − − T
x̂k Pk Pk Hk
 ,  
Hk x̂−
k H P
k k

M k

where
4
Mk = Hk Pk− HkT + Rk .

To compute (xk |Zk ) = (xk |zk , Zk−1 ), we apply the above formula for conditional
expectation of Gaussian RVs, with everything pre-conditioned on Zk−1 . It follows
that (xk |Zk ) is Gaussian, with mean and covariance:

.
x̂+ − − T −1 −
k = E(xk |Zk ) = x̂k + Pk Hk (Mk ) (zk − Hk x̂k )

.
Pk+ = cov(xk |Zk ) = Pk− − Pk− HkT (Mk )−1 Hk Pk−

Time update step Recall that xk+1 = Fk xk + Gk wk . Further, xk and wk are inde-
pendent given Zk (why?). Therefore,

.
x̂− +
k+1 = E(xk+1 |Zk ) = Fk x̂k

− .
Pk+1 = cov(xk+1 |Zk ) = Fk Pk+ FkT + Gk Qk GTk

8
Remarks:

1. The KF computes both the estimate x̂+ +


k and its MSE/covariance Pk (and

similarly for x̂−


k ).

Note that the covariance computation is needed as part of the estimator com-
putation. However, it is also of independent importance as is assigns a measure
of the uncertainly (or confidence) to the estimate.

2. It is remarkable that the conditional covariance matrices Pk+ and Pk− do not de-
pend on the measurements {zk }. They can therefore be computed in advance,
given the system matrices and the noise covariances.

3. As usual in the Gaussian case, Pk+ is also the unconditional error covariance:

Pk+ = cov(xk − x̂+ + + T


k ) = E[(xk − x̂k )(xk − x̂k ) ] .

In the non-Gaussian case, the unconditional covariance will play the central
role as we compute the LMMSE estimator.

.
4. Suppose we need to estimate some sk = Cxk .
Then the optimal estimate is ŝk = E(sk |Zk ) = C x̂+
k.

5. The following “output prediction error”

.
z̃k = zk − Hk x̂−
k ≡ zk − E(zk |Zk−1 )

is called the innovation, and {z̃k } is the important innovations process.


Note that Mk = Hk Pk− HkT + Rk is just the covariance of z̃k .

9
4.3 Best Linear Estimator – Innovations Approach

a. Linear Estimators

Recall that the best linear (or LMMSE) estimator of x given y is an estimator of
the form x̂ = Ay + b, which minimizes the mean square error E(kx − x̂k2 ). It is
given by:
x̂ = mx + Σxy Σ−1
yy (y − my )

where Σxy and Σyy are the covariance matrices. It easily follows that x̂ is unbiased:
E(x̂) = mx , and the corresponding (minimal) error covariance is

cov(x − x̂) = E(x − x̂)(x − x̂)T = Σxx − Σxy Σ−1 T


yy Σxy

We shall find it convenient to denote this estimator x̂ as E L (x|y). Note that this is
not the standard conditional expectation.

Recall further the orthogonality principle:

E((x − E L (x|y))L(y)) = 0

for any linear function L(y) of y.

The following property will be most useful. It follows simply by using y = (y1 ; y2 )
in the formulas above:

• Suppose cov(y1 , y2 ) = 0. Then

E L (x|y1 , y2 ) = E L (x|y1 ) + [E L (x|y2 ) − E(x)] .

Furthermore,

cov(x − E L (x|y1 , y2 )) = (Σxx − Σxy1 Σ−1 T −1 T


y1 y1 Σxy1 ) − Σxy2 Σy2 y2 Σxy2 .

10
b. The innovations process

Consider a discrete-time stochastic process {zk }k≥0 . The (wide-sense) innovations


process is defined as
z̃k = zk − E L (zk |Zk−1 ) ,

where Zk−1 = (z0 ; · · · zk−1 ). The innovation RV z̃k may be regarded as containing
only the new statistical information which is not already in Zk−1 .

The following properties follow directly from those of the best linear estimator:

T
(1) E(z̃k ) = 0, and E(z̃k Zk−1 ) = 0.

(2) z̃k is a linear function of Zk .

(3) Thus, cov(z̃k , z̃l ) = E(z̃k z̃lT ) = 0 for k 6= l.

This implies that the innovations process is a zero-mean white noise process.

Denote Z̃k = (z̃0 ; · · · ; z̃k ). It is easily verified that Zk and Z̃k are linear functions of
each other. This implies that E L (x|Zk ) = E L (x|Z̃k ) for any RV x.

It follows that (taking E(x) = 0 for simplicity):

E L (x|Zk ) = E L (x|Z̃k )
k
X
L L
= E (x|Z̃k−1 ) + E (x|z̃k ) = E L (x|z̃l )
l=0

11
c. Derivation of the KF equations

We proceed to derive the Kalman filter as the best linear estimator for our linear,
non-Gaussian model. We slightly generalize the model that was treated so far by
allowing correlation between the state noise and measurement noise. Thus, we
consider the model

xk+1 = Fk xk + Gk wk , k≥0

zk = Hk xx + vk ,

with [wk ; vk ] a zero-mean white noise sequence with covariance


   
wk Q Sk
E(  [wlT , vlT ]) =  k  δkl .
vk SkT Rk

x0 has mean x0 , covariance P0 , and is uncorrelated with the noise sequence.

We use here the following notation:

Zk = (z0 ; · · · ; zk )
x̂k|k−1 = E L (xk |Zk−1 ) x̂k|k = E L (xk |Zk )
x̃k|k−1 = xk − x̂k|k−1 x̃k|k = xk − x̂k|k
Pk|k−1 = cov(x̃k|k−1 ) Pk|k = cov(x̃k|k )

and defne the innovations process

4
z̃k = zk − E L (zk |Zk−1 ) = zk − Hk x̂k|k−1 .

Note that
z̃k = Hk x̃k|k−1 + vk .

12
Measurement update: From our previous discussion of linear estimation and inno-
vations,

x̂k|k = E L (xk |Zk ) = E L (xk |Z̃k )

= E L (xk |Z̃k−1 ) + E L (xk |z̃k ) − E(xk )

This relation is the basis for the innovations approach. The rest follows essentially
by direct computations, and some use of the orthogonality principle. First,

E L (xk |z̃k ) − E(xk ) = cov(xk , z̃k )cov(z̃k )−1 z̃k .

The two covariances are next computed:

cov(xk , z̃k ) = cov(xk , Hk x̃k|k−1 + vk ) = Pk|k−1 HkT ,

where E(xk x̃Tk|k−1 ) = Pk|k−1 follows by orthogonality, and we also used the fact that
vk and xk are not correlated. Similarly,
.
cov(z̃k ) = cov(Hk x̃k|k−1 + vk ) = Hk Pk|k−1 HkT + Rk = Mk

By substituting in the estimator expression we obtain

x̂k|k = x̂k|k−1 + Pk|k−1 HkT Mk−1 z̃k

Time update: This step is less trivial than before due to the correlation between vk
and wk . We have
x̂k+1|k = E L (xk+1 |Z̃k ) = E L (Fk xk + Gk wk |Z̃k )
= Fk x̂k|k + Gk E L (wk |z̃k )
In the last equation we used E L (wk |Z̃k−1 ) = 0 since wk is uncorrelated with Z̃k−1 .
Thus

x̂k+1|k = Fk x̂k|k + Gk E(wk z̃kT )cov(z̃k )−1 z̃k

= Fk x̂k|k + Gk Sk Mk−1 z̃k

where E(wk z̃kT ) = E(wk vkT ) = Sk follows from z̃k = Hk x̃k|k−1 + vk .

13
Combined update: Combining the measurement and time updates, we obtain the
one-step update for x̂k|k−1 :

x̂k+1|k = Fk x̂k|k−1 + Kk z̃k

where
.
Kk = (Fk Pk|k−1 Hk + Gk Sk )Mk−1
z̃k = zk − Hk x̂k|k−1
Mk = Hk Pk|k−1 HkT + Rk .

Covariance update: The relation between Pk|k and Pk|k−1 is exactly as before.
The recursion for Pk+1|k is most conveniently obtained in terms of Pk|k−1 directly.
From the previous relations we obtain

x̃k+1|k = (Fk − Kk Hk )x̃k|k−1 + Gk wk − Kk vk

Since x̃k is uncorrelated with wk and vk ,

Pk+1|k = (Fk − Kk Hk )Pk|k−1 (Fk − Kk Hk )T + Gk Qk GTk


+Kk Rk KkT − (Gk Sk KkT + Kk SkT GTk )

This completes the filter equations for this case.

14
Addendum: A Hilbert space interpretation

The definitions and results concerning linear estimators can be nicely interpreted in
terms of a Hilbert space formulation.

Consider for simplicity all RVs in this section to have 0 mean.

Recall that a Hilbert space is a (complete) inner-product space. That is, it is a linear
vector space V , with a real-valued inner product operation hv1 , v2 i which is bi-linear,
symmetric, and non-degenerate (hv, vi = 0 iff v = 0). (Completeness means that
every Cauchy sequence has a limit.) The derived norm is defined as kvk2 = hv, vi.
The following facts are standard:

1. A subspace S is a linearly-closed subset of V . Alternatively, it is the linear


span of some set of vectors {vα }.

2. The orthogonal projection ΠS v of a vector v unto the subspace S is the closest


element to v in S, i.e., the vector v 0 ∈ S which minimizes kv − v 0 k. Such a
vector exists and is unique, and satisfies (v − ΠS v) ⊥ S, i.e., hv − ΠS v, si = 0
for s ∈ S.
Pk
3. If S = span{s1 , . . . , sk }, then ΠS v = i=1 αi si , where

[α1 , . . . , αk ] = [hv, s1 i, . . . , hv, sk i][hsi , sj ii,j=1...k ]−1

4. If S = S1 ⊕ S2 (S is the direct sum of two orthogonal subspaces S1 and S2 ),


then
ΠS v = ΠS1 v + ΠS2 v .

If {s1 , . . . , sk } is an orthogonal basis of S, then


k
X
ΠS v = hv, si ihsi , si i−1 si
i=1

15
5. Given a set of (independent) vectors {v1 , v2 . . . }, the following Gram-Schmidt
procedure provides an orthogonal basis:

ṽk = vk − Πspan{v1 ...vk−1 } vk


k−1
X
= vk − hvk , ṽi ihṽi , ṽi i−1 vi
i=1

We can fit the previous results on linear estimation to this framework by noting the
following correspondence:

• Our Hilbert space is the space of all zero-mean random variables x (on a given
probability space) which are square-integrable: E(x2 ) = 0. The inner product
in defined as hx, yi = E(xy).

• The optimal linear estimator E L (xk |Zk ), with Zk = (z0 , . . . , zk ), is the orthog-
onal projection of the vector xk on the subspace spanned by Zk . (If xk is
vector-valued, we simply consider the projection of each element separately.)

• The innovations process {zk } is an orthogonalized version of {zk }.

The Hilbert space formulation provides a nice insight, and can also provide useful
technical results, especially in the continuous-time case. However, we shall not go
deeper into this topic.

16
4.4 The Kalman Filter as a Least-Squares Problem

Consider the following deterministic optimization problem.

Cost function (to be minimized):

1
Jk = (x0 − x0 )T P0−1 (x0 − x0 )
2
k
1 X
+ (zl − Hl xl )T Rl−1 (zl − Hl xl )
2 l=0
k−1
1 X T −1
+ w Q wl
2 l=0 l l

Constraints:
xl+1 = Fl xl + Gl wl , l = 0, 1, . . . , k − 1

Variables:
x0 , . . . xk ; w0 , . . . wk−1 .

Here x0 , {zl } are given vectors, and P0 , Rl , Ql symmetric positive-definite matrices.


(k) (k) (k)
Let (xo , . . . , xk ) denote the optimal solution of this problem. We claim that xk
can be computed exactly as x̂k|k in the corresponding KF problem.

This claim can be established by writing explicitly the least-squares solution for
k − 1 and k, and manipulating the matrix expressions.
We will take here a quicker route, using the Gaussian insight.
(k) (k)
Theorem The minimizing solution (xo , . . . , xk ) of the above LS problem is the
maximizer of the conditional probability (that is, the M AP estimator):

p(x0 , . . . , xk |Zk ) , w.r.t.(xo , . . . , xk )

17
related to the Gaussian model:

xk+1 = Fk xk + Gk wk , x0 ∼ N (x0 , P0 )

zk = Hk xk + vk , wk ∼ N (0, Qk ), vk ∼ N (0, Pk )

with wk , vk white and independent of x0 .

Proof: Write down the distribution p(x0 . . . xk , Zk ).

Immediate Consequence: Since for Gaussian RV’s MAP=MMSE, then (x0 , . . . , xk )(k)
(k)
are equivalent to the expected means: In particular, xk = x+
k.

Remark: The above theorem (but not the last consequence) holds true even for the
non-linear model: xk+1 = Fk (xk ) + Gk wk .

18
4.5 KF Equations – Basic Versions

a. The basic equations

Initial Conditions:
. .
x̂−
0 = x0 = E(x0 ), P0− = P0 = cov(x0 ) .

Measurement update:
x̂+ − −
k = x̂k + Kk (zk − Hk x̂k )

Pk+ = Pk− − Kk Hk Pk−

where Kk is the Kalman Gain matrix:

Kk = Pk− HkT (Hk Pk− HkT + Rk )−1 .

Time update:
x̂− +
k+1 = Fk x̂k [+Bk uk ]


Pk+1 = Fk Pk+ FkT + Gk Qk GTk

b. One-step iterations

The two-step equations may obviously be combined into a one-step update which
computes x̂+ + − −
k+1 from x̂k (or x̂k+1 from x̂k ).

For example,

x̂− − −
k+1 = Fk x̂k + Fk Kk (zk − Hk x̂k )


Pk+1 = Fk (Pk− − K k Hk Pk− )FkT + Gk Qk GTk .

.
Lk = Fk Kk is also known as the Kalman gain.
The iterative equation for Pk− is called the (discrete-time, time-varying) Matrix
Riccati Equation.

19
c. Other important quantities

The measurement prediction, the innovations process, and the innovations covari-
ance are given by

.
ẑk = E(zk |Zk−1 ) = Hk x̂−
k (+Ik uk )
.
z̃k = zk − ẑk = Hk x̃−
k
.
Mk = cov(z̃k ) = Hk Pk− HkT + Rk

d. Alternative Forms for the covariance update

The measurement update for the (optimal) covariance Pk may be expressed in the
following equivalent formulas:

Pk+ = Pk− − Kk Hk Pk−

= (I − Kk Hk )Pk−

= Pk− − Pk− HkT Mk−1 Hk Pk−

= Pk− − Kk Mk KkT

We mention two alternative forms:

1. The Joseph form: Noting that

xk − x̂+ −
k = (I − Kk Hk )(xk − x̂k ) − Kk vk

it follows immediately that

Pk+ = (I − Kk Hk )Pk− (I − Kk Hk )T + Kk Rk KkT

This form may be more computationally expensive, but has the following
advantages:

20
– It holds for any gain Kk (not just the optimal) that is used in the esti-
mator equation x̂+ −
k = x̂k + Kk z̃k .

– Numerically, it is guaranteed to preserve positive-definiteness (Pk+ > 0).

2. Information form:
(Pk+ )−1 = (Pk− )−1 + Hk Rk−1 Hk

The equivalence may obtained via the useful Matrix Inversion Lemma:

(A + BCD)−1 = A−1 − A−1 B(DA−1 B + C −1 )−1 DA−1

where A, C are square nonsingular matrices (possibly of different size).

P −1 is called the Information Matrix. It forms the basis for the “information
filter”, which only computes the inverse covariances.

e. Relation to Deterministic Observers

The one-step recursion for x̂−


k is similar in form to the algebraic state observer from

control theory.
Given a (deterministic) system:

xk+1 = Fk xk + Bk uk

zk = Hk xk

a state observer is defined by

x̂k+1 = Fk x̂k + Bk uk + Lk (zk − Hk x̂k )

.
where Lk are gain matrices to be chosen, with the goal of obtaining x̃k = (xk − x̂k ) →
0 as k → ∞.

21
Since
x̃k+1 = (Fk − Lk Hk )x̃k ,

we need to choose Lk so that the linear system defined by Ak = (Fk − Lk Hk ) is


asymptotically stable.
This is possible when the original system is detectable.

The Kalman gain automatically satisfies this stability requirement (whenever the
detectability condition is satisfied).

22

You might also like