You are on page 1of 301

1

MODEL PREDICTIVE
CONTROL

Manfred Morari
Jay H. Lee
Carlos E. Garca

February 8, 2002

Chapter 2

Modeling and Identification


Throughout the book we will assume that the control task is carried out by
a computer as depicted in Fig. 2.1. The controlled output signal y(t) is sampled
at intervals T resulting in a sequence of measurements {y(0), y(1), . . . , y(k), . . .}
where the arguments 0, 1, . . . , k, . . . denote time measured in multiples of the
sampling interval T . On the basis of this sequence the control algorithm determines a sequence of inputs {u(0), u(1), . . . , u(k), . . .}. The input time function
u(t) is obtained by holding the input constant over a sampling interval:
u(t) = u(k)

kT < t (k + 1)T

The appropriate control action depends on the dynamic characteristics of the


plant to be controlled. Therefore, the control algorithm is usually designed
based on a plants model which describes its dynamic characteristics (at least
approximately). In the context of Fig. 2.1 we are looking for a plant model
which allows us to determine the sequence of outputs {y(0), y(1), . . . , y(k), . . .}
resulting from a sequence of inputs {u(0), u(1), . . . , u(k), . . .}.
In general, we are not only interested in the response of the plant to the
manipulated variable (MV) u but also in its response to other inputs, for
example, a disturbance variable (DV) d. We will use the symbol v to denote a
general input which can be manipulated (u), or a disturbance (d). Thus the
general plant model will relate a sequence of inputs {v(0), v(1), . . . , v(k), . . .}
to a sequence of outputs {y(0), y(1), . . . , y(k), . . .}.
Initially we will assume for our derivations that the system has only a single
input v and a single output y (Single Input - Single Ouput (SISO) system).
Later we will generalize the modeling concepts to describe multi-input multioutput (MIMO) systems.
In this chapter we will discuss some basic modeling assumptions and introduce convolution models. We will explain how these models can be used to
predict the response of the output to an input sequence. Finally, we will give
some advice on how to obtain the models from experiments.
1

Modeling and Identification

Figure 2.1: Sampled-data control system

2.1

Linear Time Invariant Systems

Let us consider two specific input sequences1

v1 (0) = v 1 (0), v 1 (1), . . . , v 1 (k), . . .

v2 (0) = v 2 (0), v 2 (1), . . . , v 2 (k), . . .

For example, v1 (0) might represent a step

v1 (0) = {1, 1, . . . , 1, . . .}
1

Throughout this book, we will adopt the convention that the input sequences are zero
for negative time, i.e., for the example here v(k) = 0, for k < 0.

February 8, 2002
and v2 (0) a pulse.
v2 (0) = {1, 0, . . . , 0, . . .}

Let us assume that for a particular system P these two input sequences give
rise to the output sequences

y1 (0) = y 1 (0), y 1 (1), . . . , y 1 (k), . . .

y2 (0) = y 2 (0), y 2 (1), . . . , y 2 (k), . . .

respectively. Symbolically, we can also write


v1 (0)

y1 (0)

v2 (0)

y2 (0)

and

Let us consider now the two input sequences


v3 (0)

= v1 (0) = v 1 (0), v 1 (1), . . . , v 1 (k), . . .

where is a real constant, and


v4 (0)

= v1 (0)+v2 (0) = v 1 (0) + v 2 (0), v 1 (1) + v 2 (1), . . . , v 1 (k) + v 2 (k), . . .

If it is true that changing the input magnitude by a factor changes the


output magnitude by the same factor
v1 (0) = v3 (0)

y3 (0) = y1 (0)

and that the response to a sum of two input sequences is the sum of the two
output sequences
v1 (0) + v2 (0) = v4 (0)

y4 (0) = y1 (0) + y2 (0)

for arbitrary choices of v1 (0), v2 (0) and < < +, then the system P
is called linear.
Linearity is a very useful property. It has the following implication: If
we know that the input sequences v1 (0), v2 (0), v3 (0), etc. yield the output
sequences y1 (0), y2 (0), y3 (0) etc., respectively, then the response to any linear
combination of input sequences is simply a linear combination of the output
sequences. We also say that the process output can be obtained by superposition.
1 v1 (0)+2 v2 (0)+. . .+k vk (0)+. . .

1 y1 (0)+2 y2 (0)+. . .+k yk (0)+. . .

Modeling and Identification

where < i < .


In practice there are no truly linear systems. Linearity is a mathematical
idealization which allows us to utilize a wealth of very powerful theory. In
a small neighborhood around a fixed operating point the behavior of many
systems is approximately linear. Because the purpose of control is usually to
keep the system operation close to a fixed point, the setpoint, linearity is often
a very good assumption.
Let us define now the shifted input sequence

v1 (`) =

0, 0, . . . , 0, v 1 (1), v 1 (2), . . . , v 1 (k), . . .

| {z }
`

where the first ` elements are zero. (Here we have simply moved the signal
v1 (0) to the right by ` time steps). If the resulting system output sequence is
shifted in the same manner

y1 (`) =

0, 0, . . . , 0, y 1 (1), y 1 (2), . . . , y 1 (k), . . .


| {z }

and if this system property

v(`)

y(`)

holds for arbitrary input sequences v(0) and arbitrary integers ` then the
system is referred to as time-invariant. Time invariance implies, for example,
that if tomorrow the same input sequence is applied to the plant as today,
then the same output sequence will result. Throughout this part of the book
we will assume that linear time-invariant models are adequate for the systems
we are trying to describe and control.

2.2

System Stability

Let us perturb the input v to a system in a step-wise fashion and observe the
output y. We can distinguish the three types of system behavior depicted in
Fig. 2.2.
A stable system settles to a steady state after an input perturbation.
Asymptotically the output does not change. An unstable system does not
settle, its output continues to change. The output of an integrating system
approaches a ramp asymptotically; it changes in a linear fashion. The output
of an exponentially unstable system changes in a exponential manner.

February 8, 2002
DO!

Figure 2.2: Output responses to a step input change for stable, integrating
and unstable systems.
Most processes encountered in the process industries are stable. An important example of an integrating process is shown in Fig. 2.3B.
At the outlet of the tank a controller is installed which delivers a constant
flow. If a step change occurs in the inflow, the level will rise in a ramp-like
fashion. (Or in the case of a pulse change shown, the level changes to a different
value on a permanent basis.)
Some chemical reactors are exponentially unstable when left uncontrolled.
However, for safety reasons, reactors are usually designed to be stable so that
there is no danger of runaway when the control system fails. The control
concepts discussed in this part of the book are not applicable to exponentially
unstable systems. This system type will be dealt with in the advanced part.

2.3

Impulse Response Models

Let us inject a unit pulse


v(0) = {1, 0, . . . , 0, . . .}
into a system at rest and observe the response
y(0) = {h0 , h1 , . . . , hn , hn+1 . . .} .
We will assume that the system has the following characteristics:

Modeling and Identification

FC

Figure 2.3: Water tank


h0 = 0, i.e. the systems does not react immediately to the input
hk = 0 for k > n, i.e. the system settles after n time steps
Such a system is called a Finite Impulse Response (FIR) system with the
matrix of FIR coefficients
H = [h1 , h2 . . . , hn ]T .

(2.1)

Many practical systems can be approximated well by FIR systems. Consider


the simple examples in Fig. 2.3. In one case the outflow of the tank occurs
through a fixed opening. Therefore it depends on the liquid height in the tank.
After the pulse inflow perturbation the level returns to its original value. The
system can be approximated by an FIR system. In the other case the outflow
is determined by a pump and is constant. The pulse adds liquid volume which
permanently changes the level in the tank. This integrating system cannot be
approximated by an FIR model.
For linear time-invariant systems, shifting the input pulse
v(0) = {0, 1, 0, . . . , 0, . . .}
will simply shift the impulse response
y(0) = {0, h0 , h1 , . . . , hn , hn+1 . . .} .
The dynamical behavior of an FIR system is completely characterized by the
set of FIR coefficients. The system response can be computed by superposition
as follows. Any arbitrary input
v(0) = {v(0), v(1), v(2) . . .}

February 8, 2002
can be represented as a sum of impulses

v(0) =

{1, 0, 0, . . .} v(0)
+ {0, 1, 0, . . .} v(1)
+ {0, 0, 1, 0 . . .} v(2)
+...

The system output is obtained by summing the impulse responses weighted


by their respective impulse strength v(i)
y(0) =

{0, h1 , h2 . . . , hn , 0, 0, . . .} v(0)
+ {0, 0, h1 , h2 , . . . , hn , 0, 0, . . .} v(1)
+ {0, 0, 0, h1 , h2 , . . . , hn , 0, 0, . . .} v(2)
+...

= {0, h1 v(0), h2 v(0) + h1 v(1), h3 v(0) + h2 v(1) + h1 v(2), . . .}


Directly by inspection we see from this derivation that at a particular time
the output is given by
n
X
hi v(k i).
(2.2)
y(k) =
i=1

The coefficient hi expresses the effect of an input, which occurred i intervals


in the past, on the present output y(k). In order to compute this output we
need to keep the last n inputs v(k 1), v(k 2), . . . , v(k n) in memory.

2.4

Step Response Models

For a unit input step


v(0) = {1, 1, . . . , 1, . . .}
the system output will be
y(0) = {0, h1 , h1 + h2 , h1 + h2 + h3 , . . .}

= {0, s1 , s2 , s3 , . . .}

(2.3)
(2.4)

Here we have defined the step response coefficients s1 , s2 , s3 , . . .. For linear


time-invariant systems the shifted step
v(0) = {0, 1, 1, . . . , 1, . . .}
will give rise to a shifted step response
y(0) = {0, 0, s1 , s2 , s3 , . . .} .

Modeling and Identification

The matrix of step response coefficients


S = [s1 , s2 , . . . , sn ]T

(2.5)

is a complete description of how a particular input affects the system output


when the system is at rest. Any arbitrary input
v(0) = {v(0), v(1), v(2), . . .}
can be represented as a sum of steps
v(0) =

{1, 1, 1, 1, . . .}v(0)
+{0, 1, 1, 1, . . .}(v(1) v(0))
+{0, 0, 1, 1, . . .}(v(2) v(1))
+...

where we will define v(i) to denote the input changes


v(i) = v(i) v(i 1)

(2.6)

from one time step to the next. The system output is obtained by summing
the step responses weighted by their respective step heights v(i)
y(0) =

{0, s1 , . . . , sn , sn , sn , . . .} v(0)
+ {0, 0, s1 , . . . , sn , sn , sn , . . .} v(1)
+ {0, 0, 0, s1 , . . . , sn , sn , sn , . . .} v(2)
+...

= {0, s1 v(0), s2 v(0) + s1 v(1), s3 v(0) + s2 v(1) + s1 v(2), . . .


. . . , sn v(0) + sn1 v(1) + sn2 v(2) + . . . + s1 v(n 1),

sn (v(0) + v(1)) +sn1 v(2) + sn2 v(3) + . . . + s1 v(n), . . .


{z
}
|

v(1)

We see that at a particular time the output is given by


y(k) =

n1
X

si v(k i) + sn v(k n).

(2.7)

i=1

The coefficient si expresses the effect of an input change, which occurred i


intervals in the past, on the present output y(k). In order to compute this
output we need to keep the last n inputs in memory.
The step response coefficients are directly calculable from the impulse response coefficients and vice versa.
sk =

k
X

hi

(2.8)

i=1

hk = sk sk1

(2.9)

February 8, 2002

2.5

Multi-Step Prediction

The objective of the controller is to determine the control action such that
a desirable output behavior results in the future. Thus we need to be able
to predict efficiently the future output behavior of the system. This future
behavior will be a function of past inputs to the process and future inputs,
i.e., inputs which we are considering to take in the future. We will separate the
effects of the past inputs and the future inputs on the output. All past input
information will be summarized in the dynamic state of the system. Thus the
future output behavior will be determined by the present system state and the
present and future inputs to the system.
Usually the representation of the state is not unique. For example, for an
FIR system we could choose the past n inputs as the state x.
x(k) = [v(k 1), v(k 2), . . . , v(k n)]T

(2.10)

Clearly this state summarizes all relevant past input information for an FIR
system and allows us to compute the future evolution of the system when we
are given the present input v(k) and the future inputs v(k + 1), v(k + 2), . . ..
There are other possible choices. For example, instead of the n past inputs
we could choose the effect of the past inputs on the future outputs at the next
n steps as the state. In other words, we define as the state the present output
and the n 1 future outputs assuming that the present and future inputs
are zero. The two states are equivalent in that they both have dimension n
and are related uniquely by a linear map.
The latter choice of state will prove more convenient for predictive control
computations. It shows explicitly how the system will evolve when there is no
control action and therefore allows us to determine easily what control action
should be taken to achieve a specified behavior of the outputs in the future.
In the next sections we will discuss how to do multi-step prediction for FIR
and step-response models based on this state representation.

2.5.1

Multi-Step Prediction Based on FIR Model

Let us define the state at time k as

T
Y (k) = y0 (k), y1 (k), . . . , yn1 (k)

where

yi (k) = y(k + i) for v(k + j) = 0; j 0

(2.11)
(2.12)

Thus we have defined the ith system state yi (k) as the system output at time
k + i under the assumption the system inputs are zero from time k into the
future (v(k + j) = 0; j 0). This state completely characterizes the evolution of the system output under the assumption that the present and future

10

Modeling and Identification

inputs are zero. In order to determine the future output we simply add to the
state the effect of the present and future inputs using (2.2).
y(k + 1) = y1 (k) + h1 v(k)

(2.13)

y(k + 2) = y2 (k) + h1 v(k + 1) + h2 v(k)

(2.14)

y(k + 3) = y3 (k) + h1 v(k + 2) + h2 v(k + 1) + h3 v(k)

(2.15)

y(k + 4) = . . .

(2.16)

We can put these equations in matrix form

y(k + 1)
y1 (k)
y(k + 2)
y2 (k)

..
..

=
.

..
.

..
.
y(k + p)
yp (k)
| {z }
0

from Y (k)
effect of past inputs

0
0
h1
0
h1
h2

..
..
..

+ . v(k) + . v(k + 1) + . . . . . . + .
v(k + p 1)
..
..
..
.
.
.
h1
hp1
hp
|
{z
}
effect of future inputs (yet to be determined)

and note that the first term is a part of the state and reflects the effect of
the past inputs. The other terms express the effect of the hypothesized future
inputs. They are simply the responses to impulses occurring at the future time
steps.
In order to obtain the state at k + 1, which according to the definition is
Y (k + 1) =

y0 (k + 1), y1 (k + 1), . . . , yn1 (k + 1)

with

yi (k + 1) = y(k + 1 + i) for v(k + 1 + j) = 0; j 0,

(2.17)
(2.18)

we need to add the effect of the input v(k) at time k to the state Y (k).
y0 (k + 1) = y1 (k) + h1 v(k)

(2.19)

y1 (k + 1) = y2 (k) + h2 v(k)

(2.20)

...
yn1 (k + 1) = yn (k) + hn v(k)

(2.21)
(2.22)

We note that yn (k) was not a part of the state at time k, but we know it to

11

February 8, 2002
be 0 because of the FIR assumption. By defining the matrix

... ... 0 0

0 . . . 0 0

..
.

0 . . . . . . . . . 0 1

0 ... ... ... 0 0

0
0
.
.
M=
.
0
0

1
0

0
1

(2.23)

we can express the state update compactly as

Y (k + 1) = M Y (k) + Hv(k).

(2.24)

Multiplication with the matrix M in the above represents the simple operation
of shifting the vector Y (k) and setting the last element of the resulting vector
to zero.

2.5.2

Recursive Multi-Step Prediction Based on Step-Response


Model

Let us define the state at time k as


T

(2.25)

yi (k) = y(k + i) for v(k + j) = 0; j 0

(2.26)

Y (k) =
where

y0 (k), y1 (k), . . . , yn1 (k)

Thus, in this case, we have defined the state yi (k) as the system output at time
k + i under the assumption the input changes are zero from time k into the
future (v(k + i) = 0; i 0). Note that because of the FIR assumption the
step response settles after n steps, i.e., yk+n1 (k) = yk+n (k) = . . . = y (k).
Hence, the choice of state Y (k) completely characterizes the evolution of the
system output under the assumption that the present and future input changes
are zero. In order to determine the future output we simply add to the state
the effect of the present and future input changes. From (2.7) we find
y(k + 1) =

n1
X

si+1 v(k i) + sn v(k n) + s1 v(k)

(2.27)

i=1

= y1 (k) + s1 v(k).

(2.28)

Continuing for k + 2, k + 3, . . . we find


y(k + 2) = y2 (k) + s1 v(k + 1) + s2 v(k)

(2.29)

y(k + 3) = y3 (k) + s1 v(k + 2) + s2 v(k 1) + s3 v(k)

(2.30)

y(k + 4) = . . .

(2.31)

12

Modeling and Identification

We can put these equations

y(k + 1)
y(k + 2)

..
=

..

.
y(k + p)

in matrix form

y1 (k)
y2 (k)

..
.

..
.
yp (k)
| {z }
Y (k)
+ current bias from
effect of past
inputs

0
0
s1
0
s1
s2

..
..
..

+
. v(k) + . v(k + 1) + . . . . . . + . v(k + p 1)
..
..
..
.
.
.
s1
sp1
sp
|
{z
}
effect of future input changes (yet to be determined)
(2.32)
and note that the first term is a part of the state and reflects the effect of
the past inputs. The other terms express the effect of the hypothesized future
input changes. They are simply the reponses to steps occurring at the future
time steps.
In order to obtain the state at k + 1, which according to the definition is

T
Y (k + 1) = y0 (k + 1), y1 (k + 1), . . . , yn1 (k + 1)
with (2.33)

yi (k + 1) = y(k + 1 + i) for v(k + 1 + j) = 0; j 0,

(2.34)

we need to add the effect of the input change v(k) at time k to the state
Y (k).
y0 (k + 1) = y1 (k) + s1 v(k)

(2.35)

y1 (k + 1) = y2 (k) + s2 v(k)

(2.36)

...
y2 (k + 1) = yn (k) + sn v(k)
We note that yn (k) =
tildeyn1 (k) because of the

0
0
.
.
M=
.
0
0

FIR assumption. By defining the matrix

1 0 ... ... 0 0

0 1
0 . . . 0 0

..

n
.

0 ... ... ... 0 1

0 ... ... ... 0 1

(2.37)
(2.38)

(2.39)

we can express the state update compactly as

Y (k + 1) = M Y (k) + Sv(k).

(2.40)

13

February 8, 2002

Multiplication with the matrix M denotes the operation of shifting the vector
Y (k) and repeating the last element. The recursive relation (2.40) is referred
to as the step response model of the system.
As is apparent from the derivation, the FIR and the step response models
are very similar. The definition of the state is slightly different. In the FIR
model the future inputs are assumed to be zero, in the step response model the
future input changes are kept zero. Also the input representation is different.
For the FIR model the future inputs are given in terms of pulses, for the step
response model the future inputs are steps. Because the step response model
expresses the future inputs in terms of changes v, it will be very convenient
for incorporating integral action in the controller as we will show.

2.5.3

Multivariable Generalization

The model equation (2.40) generalizes readily to the case when the system has
ny outputs y`, ` = 1, . . . , ny and nv inputs vj, j = 1, . . . , nv . We define the
output vector
y(k 1) = [y1(k 1), . . . , yny (k 1)]T ,
input vector v(k 1) = [v1(k 1), . . . , vnv (k 1)]T and step response
coefficient matrix

Si

s
1,1,i
s2,1,i
.
..

sny ,1,i

s1,nv ,i

s1,2,i

...

sny ,2,i

. . . sny ,nv ,i

where s`,m,i is the ith step response coefficient relating the mth input to the
`th output. The (ny n) nu step response matrix is obtained by stacking up
the step response coefficient matrices

T
S = S1T , S2T , S3T , . . . , SnT .
The state of the multiple output system at time k is defined as
Y (k) =

y0 (k), y1 (k), . . . , yn1 (k)

yi (k) = y(k + i) for v(k + j) = 0; j 0

with

(2.41)

With these definitions an equivalent update equation results as in the single


variable case (2.40)
Y (k + 1) = M Y (k) + Sv(k).

(2.42)

14

Modeling and Identification

Here M is defined as

0 I
0 0
.
..

0 0
0 0

..

. . . . . . . . . 0 I

... ... ... 0 I


0
I

... ... 0
0 ... 0

where the identity matrix I is of dimension ny ny . The recursive relation


(2.42) is referred to as the step response model of the multivariable system. It
was proposed in the original formulation of Dynamic Matrix Control.

2.6

Examples

The following examples illustrate different types of step response models.


Example 1 SISO process with deadtime and disturbance. The step responses
in Fig. 2.4 show the effect of side draw (manipulated variable) and intermediate reflux duty (disturbance) on side endpoint in the heavy crude
fractionator example problem. The sampling time is 7 minutes and 35 step
response coefficients were generated. Note the deadtimes of approximately 14
minutes in both responses. (The transfer functions used for the simulation
were gsd = 5.72e14s /(60s + 1) and gird = 1.52e15s /(25s + 1)).
Example 2 SISO process with inverse response. The step response in Fig. 2.5
shows the experimentally observed effect ref ?? of a change in steam flowrate on
product concentration in a multieffect evaporator system. The long term effect
of the steam rate increase is to increase evaporation and thus the concentration.
The initial decrease of concentration is caused by an increased (diluting) flow
from adjacent vaporizers. (The transfer function used for the simulation was
1.5s
g = 2.69(6s+1)e
(20s+1)(5s+1) ).
The control of systems with inverse response is a special challenge: the
controller must not be mislead by the inverse-response effect and its action
must be cautious.
Example 3 MIMO process. Figure 2.6 shows the response of overhead (y1 )
and bottom (y2 ) compositions of a high purity distillation column to changes in
reflux (u1 ) and boilup (u2 ). Note that all the responses are very similar which
will cause control problems as we will see later.
(The transfer matrix used for
0.878
0.864
1
the simulation was G = 75s+1
).
1.082 1.096

15

February 8, 2002

1.6
1.4

5
1.2
4
1
3

0.8
0.6

2
0.4
1
0.2
0
0

100

200

300

100

TIME

200

300

TIME

Figure 2.4: Responses of side endpoint for step change in side draw (left) and
intermediate reflux duty (right). (n = 35, T = 7)

2.5

1.5

0.5

-0.5

10

20

30

40

50

60

70

80

TIME

Figure 2.5: Response of evaporator product concentration to steam flowrate


(n = 25, T = 3).

16

Modeling and Identification


u1 step response : y1

u2 step response : y1

0.5

0.5

0
0

100

200

300

0
0

200
TIME

u1 step response : y2

u2 step response : y2

0.5

0.5

0
0

100

TIME

100

200
TIME

300

0
0

100

200

300

300

TIME

Figure 2.6: Response of bottom and top composition of a high purity distillation column to step changes in boilup and reflux (n = 30, T = 10).

2.7

Identification

Two approaches are available for obtaining the models needed for prediction
and control move computation. One can derive the differential equations representing the various material, energy and momentum balances and solve these
equations numerically. Thus one can simulate the response of the system to a
step input change and obtain a step response model. This fundamental modeling approach is quite complex and requires much engineering effort, because
the physicochemical phenomena have to be understood and all the process
parameters have to be known or determined from experiments. The main advantage of the procedure is that the resulting differentialequation models are
usually valid over wide ranges of operating conditions.
The second approach to modeling is to perturb the inputs of the real process and record the output responses. By relating the inputs and outputs a
process model can be derived. This approach is referred to as process identification. Especially in situations when the process is complex and the fundamental
phenomena are not well understood, experimental process identification is the
preferred modeling approach in industry.
In this section, we will discuss the direct identification of an FIR model.
The primary advantage of fitting an FIR model is that the only parameters
to be specified by the user are the sampling time and the response length n.
This is in contrast with other techniques which use more structured models
and thus require that a structure identification step be performed first.

17

February 8, 2002

Figure 2.7: Step response


One of the simplest ways to obtain the step response coefficients is to step
up or step down one of the inputs. The main drawbacks are that it is not always
easy to carry out the experiments and even if they are successful the model
is usually valid only in an operating region close to where the experiment was
performed. In addition, measurement noises and disturbances can significantly
corrupt the result, especially during initial transient periods.
For the experimental identification approaches, a number of issues need to
be addressed.

2.7.1

Settling Time

Very few physical processes are actually FIR but the step response coefficients
become essentially constant after some time. Thus, we have to determine by
inspection a time beyond which the step response coefficients do not change
appreciably and define an appropriate truncated step response model which is
FIR. If we choose this time too short the model is in error and the performance
of a controller based on this model will suffer. On the other hand, if we choose
this time too long the model will include many step response coefficients, which
makes the prediction and control computation unwieldy.

2.7.2

Sampling Time

Usually the minimum sampling interval is dictated by the control computer.


However, often this minimum sampling interval is smaller than necessary for

18

Modeling and Identification

effective control. For example, if a process has a deadtime of ten minutes it is


unnecessary to sample the output and recompute the control action every 30
sec. Usually faster sampling leads to better closedloop performance, but the
closedloop performance of a process with a tenminute deadtime is severely
limited by the deadtime (no reaction for ten minutes) and fast sampling does
not yield any performance gains. The following rule of thumb covers most
practical cases:

Sampling time = max {0.03 settling time, 0.3 deadtime}


For large deadtimes the sampling time is determined by the deadtime, for
small ones by the settling time. Typically step response models with about 30
step response coefficients are employed in practice (Settling Time/Sampling
Time 30). If the number of coefficients necessary to describe the response
adequately is significantly smaller or larger, then the sampling time should be
decreased or increased, respectively.

2.7.3

Choice of the Input Signal for Experimental Identification

The simplest test is to step up or step down the inputs to obtain the step response coefficients. When one tries to determine the impulse response directly
from step experiments one is faced with several experimental problems.
At the beginning of the experiment the plant has to be at steady state
which is often difficult to accomplish because of disturbances over which the
experimenter has no control. A typical response starting from a nonzero
initial condition is shown in Fig. 2.8 B. Disturbances can also occur during
the experiment leading to responses like that in Fig. 2.8 C.
Finally, there might be so much noise that the step response is difficult
to recognize Fig. 2.8 D. It is then necessary to increase the magnitude of the
input step and thus to increase the signalnoise ratio of the output signal.
Large steps, however, are frowned upon by the operating personnel, who are
worried about, for example, offspec products when the operating conditions
are disturbed too much.
In the presence of significant noise, disturbances and nonsteady state initial conditions, it is better to choose input signals that are more random in
nature. This idea of randomness of signals will be made more precise in the
advanced section of the book. Input signals that are more random should
help reduce the adverse effects due to nonsteady state initial conditions, disturbances and noise. However, we recommend any engineer to try to perform
a step test first since it is the easiest experiment.

19

February 8, 2002

Figure 2.8: (A) True step response. (B) Erroneous step responses caused
by nonsteady state initial conditions and (C) unmeasured disturbances, (D)
measurement noise.

2.7.4

The Linear Least Squares Problem

In order to explain the basic technique to identify an FIR model for arbitrary
input signals, we need to introduce some basic concepts of linear algebra. Let
us assume that the following system of linear equations is given:
b = Ax

(2.43)

where the matrix A has more rows than columns. That is, there exist more
equations than there are unknowns, the linear system is overspecified. Let us
also assume that A has full column rank. This means that the rank of A is
equal to the dimension of the solution vector x.
This system of equations has no exact solution because there are not
enough degrees of freedom to satisfy all equations simultaneously. However,
one can find a solution that makes the left hand and right hand sides close
to each other. One way is to find a vector x that minimizes the sum of squares
of the equation residuals. Let us denote the vector of residuals as

20

Modeling and Identification

= b Ax

(2.44)

The optimization problem that finds this solution is formulated as follows:


min T = min(b Ax)T (b Ax)
x

(2.45)

The necessary condition for optimality yields the equations2


d(T )
= 2AT (b Ax) = 0
dx
Note that the second order derivative

(2.47)

d2 (T )
= 2AT A
(2.48)
dx2
is positive definite because of the rank condition of A, and thus the solution
of (2.47) for x
x = (AT A)1 AT b

(2.49)

minimizes the objective function. This solution of a set of overspecified linear


equations is denoted as a least squares solution. The matrix (AT A)1 AT is
denoted as the generalized inverse of the matrix A. Note that in case A is
square, its generalized inverse is A1 , the exact inverse.
In some cases, it may be desirable to weigh each component of the residual
vector differently in finding x. The solution that minimizes the 2-norm of
the weighted residual vector is
x = (AT T A)1 AT T b

(2.50)

Note that formulas (2.49) and (2.50) are algebraically correct, but are
never used for numerical computations. Most available software employs the
QR algorithm for determining the least squares solution. The reader is referred
to the appropriate numerical analysis literature for more information.

2.7.5

Linear Least Squares Identification

The standard parameter fitting algorithm employed in identification is the least


squares algorithm or one of its many variants. In the most general case it fits
2

The derivative of the quadratic expression f T f where f = f (x) is defined as follows:


d(f T f )
df T
=2
f
dx
dx

(2.46)

21

February 8, 2002

a model relating one output variable of a process to multiple inputs (the same
algorithm is used for all outputs from the same test data). When many inputs
vary simultaneously they must be uncorrelated so that the individual models
can be obtained. Directly fitting an FIR model provides many advantages to an
inexperienced user. For example, apart from the settling time of the process,
it requires essentially no other a priori information about the process. Here we
give a brief derivation of the least squares identification technique and some
modifications to overcome the drawbacks of directly fitting an FIR model.
Let us assume the process has a single output y and a set of inputs vj, j =
1, . . . , nv . Let us also assume that the process is described by an impulse
response model as follows:
y(k) =

nv X
n
X

hj,i vj(k i) + (k 1)

(2.51)

j=1 i=1

where (k 1) collects any effects on y not described by the model.


We assume that n is selected sufficiently large so that only has to account for unmeasured input effects such as noise, disturbances, and bias in the
steady-state values.
In practice, it is difficult to define the steady states for inputs and outputs
because they tend to change with time according to process disturbances.
Hence, for the purpose of identification, it is common to re-express the model
in terms of incremental changes in the inputs and outputs from sample time
to sample time:
y(k) =

nv X
n
X

hj,i vj(k i) + (k 1)

(2.52)

j=1 i=1

(This model can be obtained directly by writing (2.51) for the times k and
k 1 and differencing.) We assume that is independent of, or uncorrelated
to v.
If we collect the values of the output and inputs over N +n intervals of time
from an on-line experiment we can estimate the impulse response coefficients
of the model as follows. We can write (2.52) over the past N intervals of time
up to the current time N + n:

y(n + 1)

y(n + 2)
=

..

y(n + N )
{z
}
|
YN

22

Modeling and Identification

v1 (n)
v1(1)
v2(n)
vnv (1)

v1 (n + 1)
v1(2)
v2(n 1)
vnv (2)

..
..

.
.

v1 (n + N 1) v1(N ) v2(n + N 1) vnv (N )


|
{z
}
N

h1,1
..
.
h1,n
h2,1
..
.
hnv ,n
| {z }


(n + 1)


(n + 2)

+
..


(n + N )

{z
}
|

VN

(2.53)

(2.54)

A natural objective for selecting is to minimize the square norm of the


residual vector VN . With this objective, the parameter estimation problem is
reduced to the least-squares problem discussed in the previous section:
YN = N

(2.55)

where YN is the vector of past output measurements, N is the matrix containing all past input measurements, is the vector of parameters to be identified. If the number of data points N is larger than the total number of
parameters in (nv n), the following formula for the least squares estimate
of for N data points, that is the choice of that minimizes the quantity
(YN N )T (YN N ), can be found from (2.49):
N = [N T N ]1 N T YN (k).

(2.56)

In practical identification experiments, the expected variance of the error


(k) for each data point may vary. For example, the engineer may know that
the data points obtained during a certain time interval are corrupted by more
severe noise than others. It is logical that the errors for the data points that
are likely to be inaccurate are weighed less in calculating than others. This
can be easily done by finding that minimizes the weighted square of the
residuals. Higher weights are assigned to data points that are believed to be
more accurate. Once again, this is formulated into a least squares problem
whose objective is to minimize
(YN N )T T (YN N )

(2.57)

23

February 8, 2002
The solution follows from:
N = [N T T N ]1 N T T YN (k).

(2.58)

It can be shown that if the underlying system is truly linear and n is large
enough so that and v are uncorrelated, this estimate is unbiased (i.e., it
is expected to be right or right on the average). Also, the estimate converges
to the true value as the number of data points N becomes large, under some
mild condition. Reliable software exists to obtain the least squares estimates
of the parameters .
A major drawback of this approach is that a large number of data points
needs to be collected because of the many parameters to be fitted. This is
required because in the presence of noise the variance of the parameters could
be so large as to render the fit useless. Often, the resulting step response will
be non-smooth with many sharp peaks. One simple approach to alleviate this
difficulty is to add a penalty term on the magnitude of the changes of the step
response coefficients, i.e. the impulse response coefficients, to be identified.
In other words, is found such that the quantity (YN N )T T (YN
N ) + T T is minimized where is a weighting matrix penalizing the
magnitudes of the impulse response coefficients. In other words, the weight
penalizes sharp changes in the step response coefficients. As before, this can
be formulated as a least-squares problem

YN
N
=

(2.59)
0

yielding the solution


N = [N T T N + T ]1 N T T YN

(2.60)

This simple modification to the standard least-squares identification algorithm should result in smoother step responses. One drawback of the method
is that the optimal choice of the weighting matrix is often unclear. Choosing too large a can lead to severely biased estimates even with large data
sets. On the other hand, too small a choice of may not smooth the step
response sufficiently. Other more sophisticated statistical methods to reduce
the error variance of the parameters will be discussed in the advanced part of
this book.
The procedure above can also be used to fit measured disturbance models to be used for feedforward compensation in MPC. Designing the manipulated inputs such that they are uncorrelated with the disturbance should
minimize problems when fitting disturbance models. Of course, the measured
disturbance must also have enough natural excitation, which may be hard to
guarantee.

24

Modeling and Identification

We stress that process identification is the most time consuming step in


the implementation of MPC. We have presented in this section a rudimentary discussion of the most basic identification technique. A more rigorous
discussion of the technique as well as the discussion of other more advanced
identification algorithms will be given in the advanced part of this book.

Chapter 3

Dynamic Matrix Control The Basic Algorithm


Dynamic Matrix Control (DMC) was one of the first commercial implementations of Model Predictive Control (MPC). In this chapter we describe the
basic ideas of the algorithm.

3.1

The Idea of Moving Horizon Control

Consider the diagram in Fig. 3.1. At the present time k the behavior of the
process over a horizon p is considered. Using the model, the response of the
process output to changes in the manipulated variable is predicted. Current
and future moves of the manipulated variables are selected such that the predicted response has certain desirable (or optimal) characteristics. For instance,
a commonly used objective is to minimize the sum of squares of the future
errors, i.e., the deviations of the controlled variable from a desired target (setpoint). This minimization can also take into account constraints which may
be present on the manipulated variables and the outputs.
The idea is appealing but would not work very well in practice if the moves
of the manipulated variable determined at time k were applied blindly over
the future horizon. Disturbances and modelling errors may lead to deviations
between the predicted behavior and the actual observed behavior, so that
the computed manipulated variable moves may not be appropriate any more.
Therefore only the first one of the computed moves is actually implemented. At
the next time step k + 1 a measurement is taken, the horizon is shifted forward
by one step, and the optimization is done again over this shifted horizon based
on the current system information. Therefore this control strategy is also
referred to as moving horizon control.
A similar strategy is used in many other non-technical situations. One
25

26

Dynamic Matrix Control - The Basic Algorithm

Figure 3.1: Moving Horizon Control

example is computer chess where the computer moves after evaluating all
possible moves over a specified depth (the horizon). At the next turn the
evaluation is repeated based on the current board situation. Another example
would be investment planning. A five-year plan is established to maximize the
return. Periodically a new five year plan is put together over a shifted horizon
to take into account changes which have occurred in the economy.
The DMC algorithm includes as one of its major components a technique
to predict the future output of the system as a function of the inputs and
disturbances. This prediction capability is necessary to determine the optimal
future control inputs and was outlined in the previous chapter. Afterwards
we will state the objective function, formulate the optimization problem and
comment on its solution. Finally we will discuss the various tuning parameters
which are available to the user to affect the performance of the controller.

27

February 8, 2002

+
+
Figure 3.2: Basic Problem Setup

3.2

Multi-Step Prediction

We consider the setup depicted in Fig. 3.2 where we have three different types
of external inputs: the manipulated variable (MV) u, whose effect on the
output, usually a controlled variable (CV), is described by Pu ; the measured
disturbance variable (DV) d whose effect on the output is described by Pd ;
and finally the unmeasured and unmodeled disturbances wy which add a bias
to the system output. The overall system can be described by

u(k)
u
d
P
+ wy (k)
(3.1)
y(k) = P
d(k)
We assume that step response models S u , S d are available for the system
dynamics Pu and Pd , respectively. We can define the overall multivariable
step response model

S = Su Sd
(3.2)
which is driven by the known overall input

u(k)
v(k) =
.
d(k)

Let us adopt (2.41) as the system state

T
Y (k) = y0 (k), y1 (k), . . . , yn1 (k)

By definition the state consists of the future system outputs

y(k)
y(k + 1)

Y (k) =

..

.
y(k + n 1)

(3.3)

(3.4)

(3.5)

28

Dynamic Matrix Control - The Basic Algorithm

obtained under the assumption that the system inputs do not change from the
previous values, i.e.,

u(k) = u(k + 1) = = 0
d(k) = d(k + 1) = = 0.

(3.6)

Also, the state does not include any unmeasureed disturbance information and
hence it is assumed in the definition that

wy (k) = wy (k + 1) = = 0

(3.7)

The state is updated according to (2.42)

Y (k) = M Y (k 1) + Sv(k 1).

(3.8)

The equation reflects the effect of the input change v(k 1) on the future
evolution of the system assuming that there are no further input changes.
The influence of the input change manifests itself through the step response
matrix S. The effect of any future input changes is described as well by the
appropriate step response matrix. Let us consider the predicted output over

29

February 8, 2002
the next p time steps

y(k + 1|k)
y(k + 2|k)
..
.
..
.
y(k + p|k)

u
S1
y1 (k)
y2 (k) S u

2
.. ..
= . + . u(k|k)


.. ..
. .
Spu
y (k)
p

0
0
S1u
0

+ S2 u(k + 1|k) + + + ...


..

.
0
u
S1u

Sp1

d
S1
0
Sd
S1d
2

..
S2d
d(k) +
.
+

d(k + 1|k) +

..
..
.
.
d
d
Sp1
Sp

0
0

+ ... d(k + p 1|k)

0
S1d

wy (k + 1|k)
wy (k + 2|k)

.
.

.
+

..

.
wy (k + p|k)

u(k + p 1|k)

(3.9)
Here the first term on the right hand side, the first p elements of the state,
describes the future evolution of the system when all the future input changes
are zero. The remaining terms describe the effect of the present and future
changes of the manipulated inputs u(k + i|k), the measured disturbances
d(k + i|k), and the unmeasured and unmodeled disturbances wy (k + i|k).
The notation y(k + i|k) represents the prediction of y(k + i) made based on
the information available at time k. The same notation applies to d and wy .
The values of most of these variables are not available at time k and have
to be predicted in a rational fashion. From the measurement at time k d(k) is
known and therefore d(k) = d(k) d(k 1). Unless some additional process
information or upstream measurements are available to conclude about the
future disturbance behavior, the disturbances are assumed not to change in

30

Dynamic Matrix Control - The Basic Algorithm

the future for the derivation of the DMC algorithm.

d(k + 1|k) = d(k + 2|k) = = d(k + p 1|k) = 0

(3.10)

This assumption is reasonable when the disturbances are varying only infrequently. Similarly, we will assume that the future unmodeled disturbances
wy (k + i|k) do not change.

wy (k|k) = wy (k + 1|k) = wy (k + 2|k) = = wy (k + p|k)

(3.11)

We can get an estimate of the present unmodeled disturbance from (3.1)

wy (k|k) ym (k) y0 (k).

(3.12)

where ym (k) represents the value of the output as actually measured in the
plant. Here y0 (k), the first component of the state Y (k), is the model prediction of the output at time k (assuming wy (k) = 0) based on the information
up to this time. The difference between this predicted output and the measurement provides a good estimate of the unmodeled disturbance.
For generality we want to consider the case where the manipulated inputs are not varied over the whole horizon p but only over the next m steps
(u(k|k), u(k + 1|k), , u(k + m 1|k)) and that the input changes are
set to zero after that.

u(k + m|k) = u(k + m + 1|k) = = u(k + p 1)|k) = 0

(3.13)

31

February 8, 2002
With these assumptions (3.9) becomes

y1 (k)
y2 (k)

..
.
Y(k + 1|k) =

..
.
yp (k)
| {z }
MY (k)
from the memory

S1
ym (k) y0 (k)
Sd
ym (k) y0 (k)

..

..

.
+ . d(k) +

..

..

.
d
Sp
ym (k) y0 (k)
|
{z
}
{z
}
|
Ip (ym (k) y0 (k))
S d d(k)
feedback term
feedforward term

S1
0

0
u(k|k)
u

Su
S1
0
0

u(k + 1|k)

..

..
..
.
.
.
.

.
.
.
.
.
.

.
+
u
u
u

Sm

Sm1
S1

..

..

..
..
.. ..
.

.
.
.
.
.
u(k + m 1|k)
u
u
Spu Sp1
Spm+1
|
{z
}
|
{z
}
U(k)
Su
future input moves
dynamic matrix

(3.14)

Here we have introduced the new symbols

y(k + 1|k)
y(k + 2|k)

Y(k + 1|k) =
..

(3.15)

y(k + p|k)

S1u
Su
2
..
.
u
S =
u
Sm

..
.
Spu

0
S1u
..
.

...
...

0
0
..
.

u
Sm1
..
.

...

S1u
..
.

u
Sp1

u
. . . Spm+1

32

Dynamic Matrix Control - The Basic Algorithm

Sd
1

S2d

Sd =
...

(3.16)

Spd

Ip

M=

U(k) =

0
0
k .
..
0
0
0
.
.
.

0
..
.
0

I
0
..
.

I
0
..
.

..
.

0
I
..
.

...
0
..
.
0
0 ...
I ...
.. ..
.
.
0

I
. p
..

u(k|k)
u(k + 1|k)
..
.

(3.17)

u(k + m 1|k)

... ... ... 0

... ... ... 0

..
.. p for p n
.. ..
.
. .
.

I
0

..

.
p for p n
I

..

(3.18)

(3.19)

With this new notation the p-step ahead prediction becomes

Y(k + 1|k) = MY (k) + S d d(k) + Ip (ym (k) y0 (k)) + S u U(k).

(3.20)

where the first three terms are completely defined by past control actions
(Y (k), y0 (k)) and present measurements (ym (k), d(k)) and the last term
describes the effect of future manipulated variable moves U(k).
This prediction equation can be easily adjusted if different assumptions are
made on the future behavior of the measured and unmeasured disturbances.
For instance, if the disturbances are expected to evolve in a ramp-like fashion
then we would set
d(k) = d(k + 1|k) = = d(k + p 1|k)

(3.21)

wy (k + `|k) = wy (k|k) + `(wy (k|k) wy (k 1|k 1))

(3.22)

and

33

February 8, 2002

3.3

Objective Function

Plant operation requirements determine the performance criteria of the control


system. These criteria must be expressed in mathematical terms so that a
control law can be obtained in algorithmic form. In DMC a quadratic objective
function is used which can be stated in its simplest form as1
min

u(k|k)...u(k+m1|k)

p
X

ky(k + `|k) r(k + `)k2.

(3.23)

`=1

This criterion minimizes the sum of squared deviations of the predicted CV


values from a timevarying reference trajectory or setpoint r(k + `) over p
future time steps. The quadratic criterion penalizes large deviations proportionally more than smaller ones so that on the average the output remains
close to its reference trajectory and large excursions are avoided.
Note that the manipulated variables are assumed to be constant after m
intervals of time into the future, or equivalently,
u(k + m|k) = u(k + m + 1|k) = = u(k + p 1|k) = 0,
where m p always. This means that DMC determines the next m moves,
only. The choices of m and p affect the closedloop behavior. Moreover, m, the
number of degrees of freedom, has a dominant influence on the computational
effort. Also, it does not make sense to make the horizon longer than m + n
(p m + n), because for an FIR system of order n the system reaches a steady
state after m + n steps. Increasing the horizon beyond m + n would simply
add identical constant terms to the objective function(3.23).
Due to inherent process interactions, it is generally not possible to keep
all outputs close to their corresponding reference trajectories simultaneously.
Therefore, in practice only a subset of the outputs is controlled well at the
expense of larger excursions in others. This can be influenced transparently
by including weights in the objective function as follows

min

u(k|k)...u(k+m1|k)

p
X

ky` [y(k + `|k) r(k + `)]k2.

(3.24)

`=1

For example, for a system with two outputs y1 and y2 , and constant diagonal
weight matrices of the form
y`
1

1 0
0 2

kxk denotes the norm (xT x) 2 of the vector x.

; `

(3.25)

34

Dynamic Matrix Control - The Basic Algorithm

the objective becomes


min

u(k|k)...u(k+m1|k)

{ 1
2 2

p
X

[y1 (k + `|k) r1 (k + `)]2 +

`=1
p
X

[y2 (k + `|k) r2 (k + `)]2 }.

(3.26)

`=1

Thus, the larger the weight is for a particular output, the larger is the contribution of its sum of squared deviations to the objective. This will make the
controller bring the corresponding output closer to its reference trajectory.
Finally, the manipulated variable moves that make the output follow a
given trajectory could be too severe to be acceptable in practice. This can be
corrected by adding a penalty term for the manipulated variable moves to the
objective as follows:
min

U (k)

p
X

ky` [y(k + `|k) r(k + `)]k2 +

`=1

m
X

ku` [u(k + ` 1)]k2 .

(3.27)

`=1

Note that the larger the elements of the matrix u` are, the smaller the resulting
moves will be, and consequently, the output trajectories will not be followed
as closely. Thus, the relative magnitudes of y` and u` will determine the
trade-off between following the trajectory closely and reducing the action of
the manipulated variables.
Of course, not every practical performance criterion is faithfully represented by this quadratic objective. However, many control problems can be
formulated as trajectory tracking problems and therefore this formulation is
very useful. Most importantly this formulation leads to an optimization problem for which there exist effective solution techniques.

3.4

Constraints

In many control applications the desired performance cannot be expressed


solely as a trajectory following problem. Many practical requirements are
more naturally expressed as constraints on process variables.
There are three types of process constraints
Manipulated Variable Constraints: these are hard limits on inputs u(k)
to take care of, for example, valve saturation constraints;
Manipulated Variable Rate Constraints: these are hard limits on the
size of the manipulated variable moves u(k) to directly influence the
rate of change of the manipulated variables;

35

February 8, 2002

Output Variable Constraints: hard or soft limits on the outputs of the


system are imposed to, for example, avoid overshoots and undershoots.
These can be of two kinds:
Controlled Variables: limits for these variables are specified even
though deviations from their setpoints are minimized in the objective function
Associated Variables: no setpoints exist for these output variables
but they must be kept within bounds (i.e. corresponding rows of
y` are zero for the projections of these variables in the objective
function given in (3.27).
The three types of constraints in DMC are enforced by formulating them as
linear inequalities. In the following we explicitly formulate these inequalities.

3.4.1

Manipulated Variable Constraints

The solution vector of DMC contains not only the current moves to be implemented but also the moves for the future m intervals of time. Although
violations can be avoided by constraining only the move to be implemented,
constraints on future moves can be used to allow the algorithm to anticipate
and prevent future violations thus producing a better overall response. The
manipulated variable value at a future time k + ` is constrained to be
ulow (`)

`
X

u(k + j|k) + u(k 1) uhigh (`); ` = 0, 1, . . . m 1

j=0

where u(k 1) is the implemented previous value of the manipulated variable.


For generality, we allowed the limits ulow (`), uhigh (`) to vary over the horizon.
These constraints are expressed in matrix form for all projections as

u(k 1) uhigh (0)

..

IL
u(k 1) uhigh (m 1)

(3.28)
U(k)

IL
ulow (0) u(k 1)

..

.
ulow (m 1) u(k 1)

where

IL =

I 0 0
I I 0

.. .. . . ..
. .
. .
I I I .

(3.29)

36

Dynamic Matrix Control - The Basic Algorithm

3.4.2

Manipulated Variable Rate Constraints

Often MPC is used in a supervisory mode where there are limitations on the
rate at which lower level controller setpoints are moved. These are enforced
by adding constraints on the manipulated variable move sizes:

umax (0)

..

umax (m 1)
I

(3.30)
U(k)

I
umax (0)

..

.
umax (m 1)

where umax (`) > 0 is the possibly time varying bound on the magnitude of
the moves.

3.4.3

Output Variable Constraints

The algorithm can make use of the output predictions (3.20) to anticipate
future constraint violations.
Ylow Y(k + 1|k) Yhigh

(3.31)

Substituting from (3.20) we obtain constraints on U(k)

S u
Su

where

MY (k) + S d d(k) + Ip (ym (k) y0 (k)) Yhigh


U(k)
(MY (k) + S d d(k) + Ip (ym (k) y0 (k))) + Ylow
(3.32)

yhigh (1)
ylow (1)
yhigh (2)
ylow (2)

Ylow =

; Yhigh =
..
..

.
.

ylow (p)

yhigh (p)

are vectors of output constraint trajectories ylow (`), yhigh (`) over the horizon
length p.

3.4.4

Combined Constraints

The manipulated variable constraints (3.28), manipulated variable rate constraints (3.30) and output variable constraints (3.32) can be combined into
one convenient expression
C u U(k) C(k + 1|k)

(3.33)

37

February 8, 2002

where C u combines all the matrices on the left hand side of the inequalities as
follows:

IL
IL

I
u

(3.34)
C =

S u
Su
.
The vector C(k + 1|k) on the right hand side collects all the error vectors on
the constraint equations as follows:

u(k 1) uhigh (0)


..

u(k 1) uhigh (m 1)

ulow (0) u(k 1)

..

ulow (m 1) u(k 1)

umax (0)
(3.35)

C(k + 1|k) =

..

u
(m

1)
max

umax (0)

..

umax (m 1)

d
MY (k) + S d(k) + Ip (ym (k) y0 (k)) Yhigh
(MY (k) + S d d(k) + Ip (ym (k) y0 (k))) + Ylow

3.5

3.5.1

Quadratic Programming Solution of the Control Problem


Quadratic Programs

Before the development of the DMC optimization problem, we introduce some


basic concepts of nonlinear programming. In particular, the following formulation of a Quadratic Program (QP) is considered:

min
x

s.t.

xT Hx g T x
Cx c

where
H is a symmetric matrix called the Hessian matrix;

(3.36)

38

Dynamic Matrix Control - The Basic Algorithm

g is the gradient vector;


C is the inequality constraint equation matrix; and
c is the inequality constraint equation vector.
This problem minimizes a quadratic objective in the decision variables x
subject to a set of linear inequalities. In the absence of any constraints the
solution of this optimization problem can be found analytically by computing
the necessary conditions for optimality as follows:
d(xT Hx g T x)
= 2Hx g = 0.
dx

(3.37)

The second order derivative is


d2 (xT Hx g T x)
= 2H
dx2
which means that for an unconstrained minimum to exist, the Hessian must
be positive semi-definite.
Note that in Section 2.7.4 the general Least Squares problem was formulated as an unconstrained minimization of a quadratic objective. In fact, the
problem of minimizing the sum of squares of the residual of a set of linear
equations
= Ax b
can be put in the form of this QP very simply since
min T = min(Ax b)T (Ax b)
x

= min xT AT Ax 2bT Ax + bT b
x

Thus the QP is
min xT AT Ax 2bT Ax
x

yielding
H = AT A
g = 2AT b.
In order to obtain the unique unconstrained solution
1
x = H 1 g
2
H must be positive definite, which is the same condition required in Section 2.7.4.

39

February 8, 2002

When the inequality constraints are added, strict positive definiteness of


H is not required. For instance, for H = 0 the optimization problem becomes
g T x

min
x

s.t.

Cx c

(3.38)

which is a Linear Programming (LP) problem. The solution of an LP will always lie at a constraint. This is not necessarily true of QP solutions. Although
not a requirement, more efficient QP algorithms are available for problems
with a positive definite H. For example, parametric QP algorithms employ
the preinverted Hessian in its computations, thus reducing the computational
requirements [?, ?].

3.5.2

Formulation of Control Problem as a Quadratic Program

We make use of the prediction equation (3.20) to rewrite the objective


min

U (k)

p
X

ky` [y(k + `|k) r(k + `)]k2 +

m
X

ku` [u(k + ` 1)]k2 .

and add the constraints (3.33) to obtain the optimization problem


y

min
k [Y(k + 1|k) R(k + 1)] k2 + ku U(k)k2
U (k)

s.t.

(3.39)

`=1

`=1

(3.40)

Y(k + 1|k) = MY (k) + S d d(k) + Ip (ym (k) y0 (k)) + S u U(k)


C u U(k) C(k + 1|k)

(3.41)

where
and

u = diag {u1 , , um }

(3.42)

y = diag y1 , , yp

(3.43)

are the weight matrices in block diagonal form, and

r(k + 1)
r(k + 2)

R(k + 1) =

..

(3.44)

r(k + p)

is the vector of reference trajectories.

We can substitute the prediction equation into the objective function to


obtain
ky [Y(k + 1|k) R(k + 1)] k2 + ku U(k)k2
y

= k [S U(k) Ep (k + 1|k)] k + k U(k)k


T

= U (k)(S

uT

yT

S +

yT

2Ep (k + 1|k)

uT

(3.45)
2

(3.46)

)U(k)

S U(k) +

EpT (k

(3.47)
+ 1|k)

yT

Ep (k + 1|k)

40

Constrained Model Predictive Control

Here we have defined

Ep (k + 1|k) =

e(k + 1|k)
e(k + 2|k)
..
.

(3.48)

e(k + p|k)
h
i

= R(k + 1) MY (k) + S d d(k) + Ip (ym (k) y0 (k)) ,

which is the measurement corrected vector of future output deviations from the
reference trajectory (i.e., errors), assuming that all future control moves are
zero. Note that this vector includes the effect of the measurable disturbances
(S d d(k)) on the prediction.
The optimization problem with a quadratic objective and linear inequalities, which we have defined is a Quadratic Program. By converting to the
standard QP formulation the DMC problem becomes2 :
min

U (k)

U(k)T Hu U(k) G(k + 1|k)T U(k)


s.t. C u U(k) C(k + 1|k)

(3.49)

where the Hessian of the QP is


Hu = S uT yT y S u + uT u

(3.50)

and the gradient vector is


G(k + 1|k) = 2S uT yT y Ep (k + 1|k).

3.6

(3.51)

Implementation

As explained in the introduction of this chapter the implementation of DMC


is done in a moving horizon fashion. This implies that the Quadratic Program
derived above will be solved at each controller execution time. Because of
this feature, the algorithm can be configured online as required to take care
of unexpected situations. For example, in case an actuator is lost during
the implementation, the high and low constraint limits on that particular
manipulated variable can be set to be equal. Then the MPC problem with the
remaining manipulated variables is solved. Similarly, the weight parameters in
the objective function can also be adjusted on-line, giving the user the ability
to tune the control law. In this section we discuss the different implementation
issues associated with DMC.
2

The term EpT (k + 1|k)Ep (k + 1|k) is independent of U(k) and can be removed from the
objective function.

41

February 8, 2002

3.6.1

Moving Horizon Algorithm

The constrained MPC algorithm is implemented on-line as follows.


1. Preparation. Do not vary the manipulated variables for at least n time
intervals (u(1) = u(2) = . . . = u(n) = 0) and assume the
measured disturbances are zero (d(1) = d(2) = . . . = d(n) =
0) during that time. Then the system will be at rest at k = 0.
2. Initialization (k = 0). Measure the output y(0) and initialize the model
prediction vector as3
T

Y (k) = ym (0)T , ym (0)T , . . . , ym (0)T


{z
}
|

(3.52)

3. State Update: Set k = k + 1. Then, update the state according to


Y (k) = M Y (k 1) + S u u(k 1) + S d d(k 1)

(3.53)

where the first element of Y (k), y(k|k), is the model prediction of the
output ym (k) at time k.
4. Obtain Measurements: Obtain measurements (ym (k), d(k)).
5. Compute the reference trajectory error vector

Ep (k + 1|k) = R(k + 1) MY (k) + S d d(k) + Ip (ym (k) y0 (k)) (3.54)


6. Compute the QP gradient vector

G(k + 1|k) = S uT (y )T y Ep (k + 1|k).

(3.55)

7. Compute the constraint equations right hand side vector


3

If (3.52) is used for intialization and changes in the past n inputs did actually occur,
then the initial operation of the algorithm will not be smooth. The transfer from manual to
automatic will introduce a disturbance; it will not be bumpless.

42

Constrained Model Predictive Control

u(k 1) uhigh (0)


..
.

u(k 1) uhigh (m 1)

ulow (0) u(k 1)

..

u
(m

1)
u(k 1)
low

umax (0)
C(k + 1|k) =

..

umax (m 1)

umax (0)

..

u
max (m 1)

Ep (k + 1|k) + R(k + 1) Yhigh


Ep (k + 1|k) R(k + 1) + Ylow

(3.56)

8. Solve the QP
min

U(k)

1
T u
2 U(k) H U(k)
s.t. C u U(k)

G(k + 1|k)T U(k)


C(k + 1|k)

(3.57)

and implement u(k|k) as u(k) on the plant.


9. Go to 3.
Note that the sequence of moves produced by the moving horizon implementation of the QP will be different from the sequence of moves U(k).
Example (plot first step solution)

3.6.2

Solving the QP

In a moving horizon framework the QP in (3.57) is solved at each controller


execution time after a new prediction is obtained. The only time varying
elements in this problem are the vectors Ep (k+1|k) (or equivalently G(k+1|k))
and C(k + 1|k). That is, the Hessian Hu of the QP remains constant for
all executions. In that case, as explained above a parametric QP algorithm
which employs the preinverted Hessian in its computations is preferable in
order to reduce on-line computation effort. Note that in the unconstrained
case this is equivalent to the offline computation of KM P C . Of course, in
case either y or u (or the step response coefficients) need to be updated,
or the models step response coefficients have changed, the Hessian must be
recomputed and inverted in background mode in order not to increase the
online computational requirements.

February 8, 2002

43

QP is a convex program and therefore is fundamentally tractable, meaning


a global optimal solution within a specified tolerance can be assured. Though
not extensively as LPs, QPs have been well studied and reliable algorithms
have been developed and coded. General-purpose QP solvers like QPSOL are
readily available but use of tailored algorithms that take advantage of specific
problem structures can offer significant computational savings.
The conventional approach for solving QPs is the so called Active Set
method. In this method, one initiates the search by assuming a set of active constraints. For an assumed active set, one can easily solve the resulting
least squares problem (where the active constraints are treated as equality
constraints) through the use of Lagrange multiplier. In general, the active
set one starts out with will not be the correct one. Through the use of the
Karush-Kuhn-Tucker (KKT) condition4 , one can modify the active set iteratively until the correction is found. Most active set algorithms are feasible
path algorithms, in which the constraints must be met at all times. Hence, the
number of constraints can has a significant effect on the computational time.
More recently, a promising new approach called the Interior Point (IP)
method has been getting a lot of attention. The idea of the IP method is to
trap the solution within the feasible region by including a so called barrier
function in the objective function. With the modified objective function, the
Newton iteration is applied to find the solution. Though originally developed
for LPs, the IP method can be readily generalized to QPs and other more
general constrained optimization problems. Even though not formally proven,
it has been observed empirically that the Newton iteration converges within 550 steps. Significant work has been carried out in using this solution approach
for solving QPs that arise in MPC, but details are out of the scope of this
book; for interested readers, we give some references at the end of the chapter.
Computational properties of QPs vary with problems. As the number of
constraints increase, more iterations are generally required to find the QP
solution, and therefore the solution time increases. This may have an impact
on the minimum control execution time possible. Also, note that the dimension
of the QP (that is, the number of degrees of freedom m nu ) influences the
execution time proportionately.
Storage requirements are also affected directly by the number of degrees of
freedom and the number of projections n ny . For example, the Hessian size
increases quadratically with the number of degrees of freedom. Also, because
of the prediction algorithm, Y (k) must be stored for use in the next controller
execution (both Ep (k + 1|k) and C(k + 1|k) can be computed from Y (k)).

The KKT condition is a necessary condition for the solution to a general constrained
optimization problem. For QP, it is a necessary and sufficient condition.

44

3.6.3

Constrained Model Predictive Control

Proper Constraint Formulation

Many engineering control objectives are stated in the form of constraints.


Therefore it is very tempting to translate them into linear inequalities and to
include them in the QP control problem formulation. In this section we want
to demonstrate that constraints make it very difficult to predict the behavior of
the control algorithm under real operating conditions. Therefore, they should
be used only when necessary and then only with great caution.
First of all constraints tend to greatly increase the time needed to solve
the QP. Thus, we should introduce them sparingly. For example, if we wish an
output constraint to be satisfied over the whole future horizon, we may want to
state it as a linear inequality only at selected future sampling times rather than
at all future sampling times. Unless we are dealing with a highly oscillatory
system, a few output constraints at the beginning and one at the end of the
horizon should keep the output more or less inside the constraints throughout
the horizon. Note that even when constraint violations occur in the prediction
this does not imply constraint violations in the actual implementation because
of the moving horizon policy. The future constraints serve only to prevent the
present control move from being short-sighted.
Output constraints can also lead to an infeasibility. A QP is infeasible
if there does not exist any value of the vector of independent variables (the
future manipulated variable move U(k)) which satisfies all the constraints
regardless of the value of the objective function. Physically this situation
can arise when there are output constraints to be met but the manipulated
variables are not sufficiently effective either because they are constrained
or because there is dead time in the system which delays their effect. Needless to say, provisions must be built into the on-line algorithm such that an
infeasibility never occurs.
Mathematically an infeasibility can only occur when the right hand side
of the output constraint equations is positive. This implies that a nonzero
move must be made in order to satisfy the constraint equations. Otherwise,
infeasibility is not an issue since U(k) = 0 is feasible.
A simple example of infeasibility arises is the case of deadtimes in the
response. For illustration, assume a SISO system with units of deadtime.
The output constraint equations for this system will look like:

0
..
.

0
0

S u
0
+1

u
S u
S
+1
+2

..
.

0
c(k + 1|k)

..
..

.
.

c(k + |k)
0
U(k)
c(k + + 1|k)
0

c(k + + 2|k)

..
..
.
.

45

February 8, 2002

y max
k

k+H c
Relax the constraints between
k+1 and k+H

-1

Figure 3.3: Relaxing the constraints

Positive elements c(k + 1|k), , c(k + |k) indicate that a violation is projected unless the manipulated variables are changed (U(k) 6= 0). Since the
corresponding coefficients in the left hand side matrix are zero, the inequalities
cannot be satisfied and the QP is infeasible. Of course, this problem can be
removed by simply not including these initial inequalities in the QP.
Because inequalities are dealt with exactly by the QP, the corrective action
against a projected violation is equivalent to that generated by a very tightly
tuned controller. As a result, the moves produced by the QP to correct for
violations may be undesirably severe (even when feasible). Both infeasibilities
and severe moves can be dealt with in various ways.
One way is to include is a constraint window on the output constraints
similar to what we suggested above for computational savings. For each output
a time k +Hc in the future is chosen at which constraint violations will start to
be checked (Fig. 3.3). For the illustration above, this time should be picked to
be at least equal to +1. This allows the algorithm to check for violations after
the effects of deadtimes and inverse responses have passed. For each situation
there is a minimal value of Hc necessary for feasibility. If this minimal value is
chosen large, constraint violations may occur over a significant period of time.
In many cases, if a larger value of Hc is chosen, smaller constraint violations
may occur over a longer time interval. Thus, there is a trade-off between
magnitude and duration of constraint violation.
In general, it is difficult to select a value of Hc for each constrained output
such that the proper compromise is achieved. Furthermore, in multivariable
cases, constraints may need to be relaxed according to the priorities of the constrained variables. The selection of constraint windows is greatly complicated
by the fact that appropriate amount and location for relaxation are usually
time-dependent due to varying disturbances and occurrences of actuator and
sensor failures. Therefore it is usually preferred to soften the constraint by

46

Constrained Model Predictive Control

adding a slack variable and penalizing this violation through an additional


term in the objective function.
min,U (k) [Usual Objective] + 2
ymin y(k + `|k) ymax +
plus other constraints
The optimization seeks a compromise between minimizing the original performance objective and minimizing the constraint violations expressed by 2 .
The parameter determines the relative importance of the two terms. The
degree of constraint violation can be fine tuned arbitrarily by introducing a
separate slack variable for each output and time step, and associating with
it a separate penalty parameter .
Finally we must realize that while unconstrained MPC is a form of linear
feedback control, constrained MPC is a nonlinear control algorithm. Thus, its
behavior for small deviations can be drastically different from that for large
deviations. This may be surprising and undesirable and is usually very difficult
to analyze a priori.
EX Example from MPC course. INCLUDE!

3.6.4

Choice of Horizon Length

On one hand, the prediction horizon p and the control horizon m should be
kept short to reduce the computational effort; on the other hand, they should
be made long to prevent short-sighted control policies. Making m short is
generally conservative because we are imposing constraints (forcing the control
to be constant after m steps) which do not exist in the actual implementation
because of the moving horizon policy. Therefore a small m will tend to give
rise to a cautious control action.
Choosing p small is short-sighted and will generally lead to an aggressive
control action. If constraint violations are checked only over a small control
horizon p this policy may lead the system into a dead alley from which it can
escape only with difficulty, i.e., only with large constraint violations and/or
large manipulated variable moves.
When p and m are infinity and when there are no disturbance changes
and unknown inputs, the sequence of control moves determined at time k is
the same sequence which is realized through the moving horizon policy. In
this sense our control actions are truly optimal. When the horizon lengths are
shortened, then the sequence of moves determined by the optimizer and the
sequence of moves actually implemented on the system will become increasingly different. Thus the short time objective which is optimized will have less
and less to do with the actual value of the objective realized when the moving
horizon control is implemented. This may be undesirable.

47

February 8, 2002

k+m-1

k+m+n-1
N time steps

k+m-1
Figure 3.4: Choosing the horizon
In general, we should try to choose a small m to keep the computational
effort manageable, but large enough to give us a sufficient number of degrees of
freedom. We should choose p as large as possible, possibly , to completely
capture the consequences of the control actions. This is possible in several
ways. Because an FIR system will settle after m + n steps, choosing a horizon
p = m + n is a sensible choice used in many commercial systems (Fig. 3.4).
Instead or in addition we can impose a large output penalty at the end of the
prediction horizon forcing the system effectively to settle to zero at the end of
the horizon. Then, with p = m+n, the error after m+n is essentially zero and
there is little difference between the finite and the infinite horizon objective.

3.6.5

Input Blocking

As said, use of a large control horizon is generally preferred from the viewpoint
of performance but available computational resource may limit its size. One
way to relax this limit is through a procedure called Blocking, which allows
to the user to block out the input moves at selected locations from the
calculation by setting them to zero a priori. Result is a reduction in the
number of input moves that need to be computed through the optimization,
hopefully without a significant sacrifice in the solution quality. Obviously,
judicious selection of blocking locations is critical for achieving the intended
effect. The selection is done mostly on an ad hoc basis, though there are some
qualitative rules like blocking less of the immediate moves and more of the
distant ones.

48

Constrained Model Predictive Control


At a more general level, blocking can be expressed as follows:
U = BU b

(3.58)

where U b represent the reduced input parameters to be calculated through


the optimization. B is the blocking matrix that needs to be designed for a
good performance. Typically, the rows of B corresponding to the blocked
moves would contain all zeros. In general, columns of B can be designed to
represent different basis in the input space. Note that dimension of U b , which
is less than that of U, must also be determined in the design.

3.6.6

Filtering of the Feedback Signal

In practice, feedback measurements can contain significant noise and other


fast-varying disturbances. Since in DMC the effect of unmeasured disturbances
is projected as a constant bias in the prediction, the high-frequency contents
of a feedback signal must be filtered out in order to obtain a meaningful longterm prediction. For this, one can pass the feedback signal through a low-pass
filter of some sort, perhaps a first- or second-order filter, before putting it into
the prediction equation. Use of state estimation, discussed in a later chapter
of this book, allows one to model the statistical characteristics of disturbances
and noise and perform the filtering in an optimal manner.

3.7

Examples

INCLUDE!
Effect of constraints on response PC examples of slowdown with IMC
Stability-contrast with unconstrained case (EZ)
Example from MPC course demonstrating positive effect of constraint.

3.8

Features Found in Other Algorithms

What we just covered is the basic form of a multivariable control algorithm


called Dynamic Matrix Control (DMC), which was one of the first MPC algorithms applied to industrial processes with success. The original DMC algorithm did not use QP to handle constraints; instead, it added an extra output
to the prediction to drive the input back to the feasible region whenever a
predicted future input came close to a constraint. This was somewhat ad hoc
and it wasnt until the 80s that engineers at Shell Oil proposed the use of
QP to handle input and output constraints explicitly and rigorously. They
called this modified version QDMC. Currently, this basic form of DMC is still
used in a commercial package called DMC-PLUS, which is marketed by Aspen
Technology.

February 8, 2002

49

A figure similar to FIGURE7 in Qin and Badgwell


Figure 3.5: Output Penalties Used in Various Formulations
Besides DMC and QDMC, there are several other MPC algorithms that
have seen, and are still seeing, extensive use in practice. These include Model
Predictive Heuristic Control (MPHC), which led to popular commercial algorithms like IDCOM and SMC-IDCOM marketed by Setpoint (now Aspen
Technology) and Hierarchical Constraint Control (HIECON) and Predictive
Functional Control (PFC) marketed by Adersa; Predictive Control Technology
(PCT), which was marketed by Profimatics (now Honeywell); and more recent
Robust Model Predictive Control Technology (RMPCT), which is currently
being marketed by Honeywell. These algorithms share same fundamentals
but differ in details of implementation. Rather than elaborating on the details
of each algorithm, we will touch upon some popular features not seen in the
basic DMC/QDMC method.

3.8.1

Reference Trajectories

In DMC, output deviation from the desired setpoint is penalized in the optimization. Other algorithms like IDCOM, HIECON, and PFC let the user
specify not only where the output should go but also how. For this, a reference
trajectory is introduced for each controlled variable (CV), which is typically
defined as a first-order path from the current output value to the desired setpoint. The time constant of the path can be adjusted according to the speed
of the desired closed-loop response. This is displayed in Figure ??.
Reference trajectories provide an intuitive way to control the aggressiveness of control, which would is adjusted through the weighting matrix for
the input move penalty term in DMC. One could argue that the controllers
aggressiveness is more conveniently tuned by specifying the speed of output
response rather than through input weight parameters, whose effects on the
speed of response is highly system-dependent.

3.8.2

Coincidence Points

Some commercial algorithms like IDCOM and PFC allowed the option of penalizing the output error only at a few chosen points in the prediction horizon
called coincidence points. This is motivated primarily by reduction in computation it brings. When the number of input moves has to be kept small
(in order to keep the computational burden low), use of a large prediction
horizon, which is sometimes necessary due to large inverse responses, and long
dynamics, results in a sluggish control behavior. This problem can be obviated
by penalizing output deviation only at a few carefully selected points. At the

50

Constrained Model Predictive Control

extreme, one could ask the output to match the reference trajectory value at
a single time point, which can be achieved with a single control move. Such
formulation was used, for example, in IDCOM-M, an offspring of the original
IDCOM algorithm, marketed by Setpoint.
Clearly, the choice of coincidence points is critical for performance, especially when the number of points used is small. Though some guidelines exist
on choosing these points, there is no systematic method for the selection. Because the response time of different outputs can vary significantly, coincidence
points are usually defined separately for each output.

3.8.3

The Funnel Approach

The RMPCT algorithm differs from other MPC algorithms in that it attempts
to keep each controlled output within a user-specified zone called funnel, rather
than to keep it on a specific reference trajectory. The typical shape of a funnel
is displayed in Figure ??. The user sets the maximum and minimum limits and
also the slope of the funnel through a parameter called performance ratio,
which is the desired time to return to the limit zone divided by the open-loop
response time. The gap between the maximum and minimum can be closed
for exact setpoint control, or left open for range control.
The algorithm solves the following quadratic program at each time:
min
r
y ,u

p
X

ky(k + i|k) y (k +

i|k)k2Q

i=1

m1
X

ku(k + j|k)k2R

(3.59)

ku(k + j|k) ur k2R

(3.60)

j=0

or
min
r
y ,u

p
X
i=1

ky(k + i|k) y (k +

i|k)k2Q

m1
X
j=0

subject to usual constraints plus the funnel constraint


f
f
(k + i|k) y r (k + i|k) ymax
(k + i|k), 1 i p
ymin

(3.61)

f
f
where ymin
(k+i|k) and ymin
(k+i|k) represent the upper and lower limit values
of the funnel for k + i in the prediction horizon as specified at time k. ur is the
desired settling value for the input. Notice that the reference trajectory y r is a
free parameter, which is optimized to lie within the funnel. Typically, Q R
in order to keep the outputs within the funnel as much as possible. Then one
can think of the above as a multi-objective optimization, in which the primary
objective is to minimize the funnel constraint violation by the output and
the secondary objective is to minimize the size of input movement (or input
deviation from the desired settling value in the case of (3.60)). In this case,
as long as there exists an input trajectory that keeps the output within the

February 8, 2002

51

INCLUDE!
Table 3.1: Optimization Resulting from Use of Different Spatial and Temporal
Norms
funnel, the first penalty term will be made exactly zero. Typically, there will
be an infinite number of solutions that achieve this, leading to a degenerate
QP. The algorithm thus finds the minimum norm solution, which corresponds
to the least amount of input adjustment hence the name Robust MPCT .
However, if there is no input that can keep the output within the funnel, the
first term will be the primary factor that determines the input.
The use of funnel is motivated by the fact that, in multivariable systems,
the shape of desirable trajectories for outputs are not always clear due to the
system interaction. Thus, it is argued that an attractive formulation is to let
the user specify an acceptable dynamic zone for each output as a funnel and
then find the minimum size input moves (or inputs with minimum deviation
from their desired values) that keep the outputs within the zone or, if not
possible, minimize the extent of violation.

3.8.4

Use of Other Norms

In defining the objective function, use of norms other than 2-norm is certainly
possible. For example, the possibility of using 1-norm (sum of absolute values) has been explored to a great extent. Use of infinity norm has also been
investigated with the aim of minimizing worst-case deviation over time. In
both cases, one gets a Linear Program, for which plethora of theories and software exist due to its significance in economics. However, one difficulty with
these formulations is in tuning. This is because the solution of a LP lies at
the intersection of binding constraints and it can switch abruptly from one
vertex to another as one varies tuning parameters (such as the input weight
parameters). The solution behavior of a QP is much smoother and therefore
it is a preferred formulation for control.
Table 3.1 summarizes the optimization that results from various combinations of spatial and temporal norms.

3.8.5

Input Parameterization

In some algorithms like PFC, the input trajectory can be parameterized using
continuous basis functions like polynomials. This can be useful if the objective
is to follow smooth setpoint trajectories precisely, such as in mechanical servo
applications, and the sampling time cannot be made sufficiently small to allow
this with piecewise constant inputs.
In other commercial algorithms like HIECON and IDCOM-M, only a single

52

Constrained Model Predictive Control

control move is calculated,which would correspond to m = 1 in DMC. With


this setting, the calculation is greatly simplified. On the other hand, use of
m = 1 in DMC would limit the closed-loop performance in general. These
algorithms get around this problem by using a single coincidence point, at
which the output is asked to match the reference value exactly.

3.8.6

Model Conditioning

In multivariable plants, two or more outputs can behave very similarly in


response to all the inputs. This phenomenon is referred to as ill-conditioning
and is reflected by a gain matrix that is nearly singular. An implication is that
it can be very difficult to control these outputs independently with the inputs,
as it will require an excessive amount of input movement in order to move
the outputs in certain directions. Using an ill-conditioned process model for
control calculation is not recommended as it can lead to numerical problems
(e.g., inversion of a nearly singular matrix) and also excessive input movements
and/or even an instability.
Even though one would check the conditioning of the model at the design
stage, because control structure can change due to constraints and failures of
sensors and actuators, one must make sure at each execution time that an
ill-conditioned process model is not directly inverted in the input calculation.
In DMC, direct inversion of an ill-conditioned process model can be circumvented by including a substantive input move penalty term, which effectively
increases the magnitudes of the diagonal elements of the dynamic matrix that
is inverted during the least squares calculation.
In other algorithms that do not include an input move penalty in the objective function, ill-conditioning must be checked at each execution time. In
RMPCT, this is done through a method called Singular Value Threshholding,
where a procedure called singular value decomposition is performed on the
gain matrix to determine those CV directions for which the gain is too low for
any effective control at all. Those directions with singular values lower than a
threshhold value are given up for control and only the remaining high-gain
directions are controlled. SMC-IDCOM addresses this based on the userdefined ranking of CVs. Here, whenever an ill-conditioning is detected, CVs
are dropped from the control calculation in the order of their ranks, starting from the one with the least assigned priority, until the condition number
improves to an acceptable level. When two CVs are seen to behave very similarly, the user can rank the less important CV with a very low priority. Even
though the control on the dropped CV is given up, it is hoped that it would be
controlled indirectly since it behaves similarly to the other high ranked CV.

February 8, 2002

3.8.7

53

Prioritization of CVs and MVs

In most practical control problems, it is not possible to satisfy all constraints


and also drive all outputs and inputs to their desired resting values. Hence,
priorities need to be assigned to express their relative importance. In DMC,
these priorities are determined through weight parameters, which enter into
the various quadratic penalty terms in the objective function. For large, complex problems, determining proper weights that lead to an intended behavior
can be a daunting task. Even if a set of weights consistent with the control
specification is found, the weights can differ vastly in magnitude from one
another, causing a numerical conditioning problem.
Algorithms like HIECON and SMC-IDCOM attempt to address this difficulty by letting the user rank various objectives in the order of their importance. For example, constraint satisfaction may be the most critical aspect,
which must be taken care of before satisfying other objectives. Also driving
the CVs to their desired setpoints may be more important than driving the
MVs to their most economic values. In these algorithms, an optimization
would be solved with the most important objective first and then remaining
degrees of freedom would be used to address the other objectives in the order
of priority. These algorithms also allow the user to rank each CV and MV
according to its priority. Hence, for constraint softening, one may specify the
order in which constraints for various CVs must be relaxed. Also, in setpoint
tracking, one can prioritize the CVs so that CVs with higher ranks are driven
to their setpoints before those with lower ranks are considered.

3.9
3.9.1

Some Possible Enhancements to DMC


Closed-Loop Update of Model State

In the conventional DMC formulation, the model is run in open-loop (i.e., by


entering known inputs only); the model prediction error is added to the prediction equation, typically as a constant bias term. Hence, the model states
do not contain any effect of unmeasured inputs, or more precisely, their estimates based on the feedback measurements. Another possibility for using
the feedback measurements is to correct the model state directly at each time
so that the model state will also hold information about relevant effects of
unmeasured inputs. Since this information will be provided indirectly through
feedback measurements, the state can be considered as a holder of relevant
past input and measured output information in this context. The difference
in the two approaches are displayed graphically in Figure 3.6. Since the model
state gets continually corrected by the measurements in the closed-loop update
approach, it is not necessary to add another correction term in the prediction
stage.

54

Constrained Model Predictive Control

Figure 3.6: Open-Loop vs. Closed-Loop Update of the Model and Corresponding Prediction
For the step response model, we can modify the update equation to add
the step for feedback measurement based correction of the model state:
Model Prediction:
Y (k|k 1) = M Y (k 1|k 1) + Sv(k 1)
(3.62)
where Y (|k 1) denotes the estimate of Y () obtained at k 1, taking into
account all measurement information up to k 1. Recall that this is th esole
step we took to update the model state in the previous formulation. Here we
postulate to correct the model prediction Y (k|k 1) based on the difference
between the measurement ym (k) at time k and the model prediction y0 (k|k1),
the first output vector appearing in Y (k|k 1), for this time step.
Correction:
Y (k|k) = Y (k|k 1) + K(ym (k) y0 (k|k 1))

(3.63)

The matrix K is referred to as observer gain. To make the prediction equivalent


to the earlier prediction where we added a constant bias term of size (y m (k)
y0 (k|k 1)) to the prediction equation, we should choose the observer gain as

I

K=I =
(3.64)
... n

I
Element by element this equation is

y0 (k|k) = y0 (k|k 1) + [ym (k) y0 ] = y(k)


y1 (k|k) = y1 (k|k 1) + [ym (k) y0 (k|k 1)]
..
.. ..
.
. .
yn1 (k|k) = yn1 (k|k 1) + [ym (k) y0 (k|k 1)]
Note from the first equation that the estimate of the current output is set equal
to the measurement. As before, we have added the same correction term []
to all future predicted outputs, y1 (k|k 1), . . . , yn1 (k|k 1) and therefore
y(k + 1|k 1), . . . , y(k + n 1|k 1), interpreting it as a bias term which is
determined based on the present measurement and remains constant.
Substituting (3.62) into (3.63) we obtain the state estimator
Y (k|k) = Mh Y (k 1|k 1) + Sv(k 1)
i
+ I ym (k) N (M Y (k 1|k 1) + Sv(k 1))

(3.65)

55

February 8, 2002
INCLUDE!
Figure 3.7: Step Response of An Integrating System
where
"

N = |I 0 0{z. . . 0}
n

which allows one to compute the current state estimate Y (k|k) based on the
previous estimate Y (k 1|k 1), the previous input move v(k 1) and the
current measurement ym (k). I is referred to as the observer gain.
An advantage of the above formulation is that substantial theories exist
that enable us to design K optimally so as to account for information we may
have about the statistical characteristics of disturbances and measurement
noise. Another advantage is that it can be generalized to handle systems with
unstable dynamics. We note that running an unstable system model in open
loop would lead to an OVERFLOW in the computer. The noise filtering
issue will be discussed in a simplified way below. The more rigorous general
treatment of the design of K and its accompanying properties will be given in
Chapter ?? in the advanced part of the book.

3.9.2

Integrating System

Integrating dynamics are common in chemical processes. In systems with


integrating dynamics, a step response never settles down to a constant value,
and therefore, the standard finite step response based model description cannot
be used. For these systems, an equivalent assumption to a finite settling time
is that, after n steps, all stable dynamics die out and the responses of all the
outputs of integrating dynamics becomes pure ramps, as shown in Figure 3.9.2.
The step response of such a system can be represented by
{S1 , S2 , . . . , . . . , Sn , Sn + (Sn Sn1 ), Sn + 2(Sn Sn1 ), . . . , . . .}

(3.66)

We can define the system state as we did for stable systems:

Y (k) = y0 (k)T , y1 (k)T , . . . , yn1 (k)T k T

(3.67)

Y (k + 1) = M I Y (k) + Sv(k)

(3.68)

Then, we can represent the state transition from one sample time to the next
as

56

Constrained Model Predictive Control

where S =

S1 S n

0
0
.
.
MI =
.
0
0

as before and

1
0

0
I

... ...
0 ...

0
0

0 ... ... ... 0


0 . . . . . . . . . I

..
.

2I

(3.69)

where M I represents essentially the same shift operation as before except the
way the last set of elements are constructed. Note that the assumption of
pure linear rise after n steps in the step response translates into the transition
equation of

yn1 (k + 1) = yn1 (k) + (


yn1 (k) yn2 (k)) + Sn v(k)

(3.70)

One important difference in forming the prediction equation for integrating


systems is that the effect of unmeasured disturbances should be extrapolated
as a linear rise rather than a constant bias. Hence, one may modify the

57

February 8, 2002
prediction equation from before to

y1 (k)
y2 (k)

..

Y(k + 1|k) =
.
..
.
yp (k)
| {z }
MI Y (k)
from the memory
d

S1
wy (k|k) + (wy (k|k) wy (k 1|k 1))
Sd
wy (k|k) + 2 (wy (k|k) wy (k 1|k 1))
2

..

..

.
+ . d(k) +

..

..
.

.
Spd
wy (k|k) + p (wy (k|k) wy (k 1|k 1))
|
{z
|
{z
}
feedback term
d
S d(k)
feedforward term

S1
0

0
u(k|k)
u
u

S
S1
0
0

2
u(k + 1|k)

..
.
..
.
.
.
.
.

.
.
.
..
.
.

.
+ u
u
u

S1

Sm Sm1
..

..

.
.
.
.
.
.
.
.
.

.
.
.
.
.
u(k + m 1|k)
u
u
Spm+1
Spu Sp1
|
{z
}
|
{z
}
U(k)
Su
future input moves
dynamic matrix

(3.71)
where wy (k|k) = ym (k) y0 (k) representing the model prediction error. Notice
that we have used two point linear extrapolation in projecting e into the future.
In the above,

0 I 0 ... ...
0
0

0 0 1
0
.
.
.
0
0

..

..

I
0
I
p
(3.72)
M = 0 0 ... ... ...

0
0
.
.
.
.
.
.
.
.
.
I
2I

0 0 . . . . . . . . . 2I 3I

.. ..
..
..
..
..
..

. .
.
.
.
.
.
Here we assumed p > n. If not, one can simply take the first p rows.

However, there is a problem in using the above in practice: The openloop model prediction of the output (terms like yi (k)) will increase without

58

Constrained Model Predictive Control

bound due to the integrators and eventually lead to an OVERFLOW in


the control computer. This problem happens not because the physical system
is actually going unstable but because the model states are not corrected by
measurements to account for various unmeasured inputs. To circumvent this,
it is necessary to update the model state directly using the measurements,
as in the closed-loop model state update discussed earlier. Let us postulate
a closed-loop update equation of the same form (3.62 and 3.63) as for stable
systems
Model Prediction:
Y (k|k 1) = M I Y (k 1|k 1) + Sv(k 1)

(3.73)

Y (k|k) = Y (k|k 1) + (I + I0 )(ym (k) y0 (k|k 1))

(3.74)

Correction:

where

I0 =

and hence

I + I0

0
I
2I
..
.
(n 1)I

I
2I

3I
.
..
nI

(3.75)

(3.76)

Here I + I0 is referred to as the observer gain. The form of the correction


term can be motivated by examining the special situation depicted in Fig. 3.8
(the situation would be caused, for example, by an unknown step disturbance
entering the integrator). In this case, estimates y` (k|k) for ` k will be exactly
equal to the true process outputs ym (k + `) if the observer (3.73,3.74) is used.
In general, there will be both stable and integrating outputs and the observer gain has to be chosen either according to (3.64) or (3.76).

3.9.3

Noise Filter

Let us take another look at the correction steps for both stable systems
Y (k|k) = Y (k|k 1) + I(ym (k) y0 (k|k 1))

(3.63)

and integrating systems


Y (k|k) = Y (k|k 1) + (I + I0 )(ym (k) y0 (k|k 1))

(3.74)

59

February 8, 2002

Figure 3.8: Motivation for observer correction term ( = ym (k) y0 (k|k 1)).
In both cases, we determine the difference between the model prediction
y0 (k|k 1) and the measurement ym (k) and add it either as a constant bias or
a ramp bias to the model prediction Y (k|k 1) to obtain the corrected prediction Y (k|k). This is justifiable, for example, if the difference is solely due to
a constant disturbance effect. It could, however, be solely due to measurement
noise, in which case we would not want to correct the model prediction at all.
In general, disturbances, model error, and measurement noise will contribute
to the difference, in which case a more cautious correction than implied by
(3.63) and (3.74) will be appropriate. This can be achieved by filtering the
correction term in (3.63):
Y (k|k) = Y (k|k 1) + IF [ym (k) y0 (k|k 1)]

(3.77)

where F is a diagonal matrix

and

F = diag f1 , f2 , . . . , fny

(3.78)

0 < f` 1

(3.79)

Thus, rather than correct the model prediction by the full error [ym (k)
y0 (k|k 1)] one takes a more cautious approach and utilizes only a fraction
f` . The larger the measurement noise associated with output y` , the smaller
f` should be chosen.
To understand better the effect of this noise filter on control performance,
assume that the output suddenly changes to a constant value (output dis-

60

Constrained Model Predictive Control

turbance) ym (0) = y and that neither the disturbance nor the manipulated
variables change (d(k) = 0, u(k) = 0). Then we find from (3.62)
Model Prediction:
Y (k|k 1) = M Y (k 1|k 1)

(3.80)

and from (3.77)

Y (k|k) = Y (k|k 1) + IF (ym (k) y(k|k 1))


= M Y (k 1|k 1) + IF ym (k) IF N M Y (k 1|k 1)
= (I IF N )M Y (k 1|k 1) + IF ym (k)

(3.81)

"

(3.82)

where

N = |I 0 {z
. . . 0}
n

The form suggests and we shall rigorously prove this in the advanced
part that Y (k|k) corresponds to ym (k) passed through a first-order filter.
Indeed, the estimate Y (k|k) approaches the true value y with the filter time
constant:
stable system : T / ln(1 f` )
(3.83)
where T is the sample time. In principle, for integrating systems, we could
also detune the observer gain I + I0 (3.76) by post-multiplying it by a filter
matrix F . However, this choice tends to lead to a highly oscillatory observer
response and is therefore undesirable. As we will show in the advanced
part, it is more desirable to introduce the filter in the following manner into
(3.74):
Y (k|k) = Y (k|k 1) + (I F + I0 F 0 )(ym (k) y0 (k|k 1))
where F was defined as above (3.78) and
o
n
F 0 = diag f10 , f20 , . . . , fn0 y

and

fi0 =

fi2
2 fi

(3.84)

(3.85)

(3.86)

Thus, for both stable systems (3.77) and integrating systems (3.84), we have a
single tuning parameter 0 < f` 1 for each output. The noise filtering action
decreases with increasing f` . For f` = 1, measurement noise is not filtered at
all and we recover (3.63) and (3.74).

61

February 8, 2002

In general, there will be both stable and integrating outputs and the filter
gain is chosen for each output either as suggested in (3.77) or in (3.84). This
leads to the following correction expression with the general filter gain K F :
Y (k|k) = Y k|k 1) + KF (ym (k) y0 (k|k 1))

3.9.4

(3.87)

Bi-Level Optimization

The MPC calculation can be split into two parts for an added flexibility. First a
local steady-state optimization can be performed to obtain obtain target values
for each input and output. This can be followed by a dynamic optimization to
determine the most desirable dynamic trajectory to these target values. Even
though the local steady-state optimization can be based on an economic index,
it does not replace the more comprehensive nonlinear optimization that often
runs above the MPC layer at a much more slower rate in order to provide an
optimal range of inputs and outputs for the plant condition experienced during
a particular optimization cycle. The local optimization performed in MPC is
based on a linear steady state model, which may be obtained by linearizing a
nonlinear model or simply the steady-state version of the step response model
used in the dynamic optimization.
The reason for running the local optimization may vary. For example,
one may want to perform an economic optimization at a higher frequency to
account for local disturbances. Even if there is no economic objective in the
given control problem, the steady-state optimization can be helpful to determine best feasible target values for CVs and the corresponding MV settling
values.
The two-stage optimization can be formulated as below:
Step 1: Steady-State Optimization
The general form of a steady-state prediction equation is
y(|k) = Ks (u(|k) u(k 1)) +b(k)
{z
}
|

(3.88)

us (k)

where y(|k) and u(|k) are the steady state values of the output and
input projected at time k. With only m input moves considered,
us (k) = u(k) + u(k + 1) + . . . . . . + u(k + m 1)

(3.89)

Note that, for the step response model,


y(|k) = y(k + m + n 1|k)

(3.90)

b(k) = yn1 (k) + Snd d(k) + (ym (k) y0 (k))

(3.91)

and Ks = Sn . Also,

62

Constrained Model Predictive Control


This steady-state prediction model can be used to optimize a given
economic objective function subject to various input and output constraints.:
min `(u(|k), y(|k))

(3.92)

us (k)

Since an economic objective function is typically linear and the prediction


equation is also linear, a Linear Program (LP) results. Alternatively, one
can also solve
min kus (k)k

(3.93)

min kr y(|k))kQ

(3.94)

us (k)

us (k)

In the first case, we would be looking for a minimum input change such
that all the constraints are satisfied. In the second case, we would seeking
a minimum deviation from the setpoint values that are achievable within
the given constraints. The solution sets the target settling values for the
inputs and outputs.
Step 2: Dynamic Optimization
The dynamic prediction equation is same as before. A quadratic regulation objective of the following is minimized subject to the give constraints
through QP:

m+n2
X

(y(k + i|k) y (|k))T Q(y(k + i|k) y (|k)) +

i=1

m1
X
j=0

uT (k + j|k)Ru(k + j|k)

(3.95)
where
is the solution from the steady-state optimization. An
additional constraint may be added to match the settling values of the
optimized input trajectories to those computed from the steady-state
optimization:
y (|k))

u(k|k) + u(k + 1|k) + . . . . . . + u(k + m 1|k) = us (k) (3.96)


This also forces y(k + m + n 1|k) to be at the optimal steady-state
value y (k + |k).
Note that, this steady-state optimization may be performed as often as at
every sample time, that is at the same execution rate as the dynamic optimization. However, it is critical to filter the noise and other high frequency
variations from the feedback signal. Otherwise, the solution from the steadystate solution can fluctuate wildly from sample time to sample time, especially
in the case of a LP.

63

February 8, 2002

3.9.5

Product Property Estimation

Many CVs, such as product compositions and other property variables, cannot
be measured at a frequency, speed, accuracy, and/or reliability required for
direct feedback control to be effective. Control of these variables may be
nevertheless critical and the only recourse may be to develop and use estimates
from measurements of other process variables. Since all process variables are
driven by a same basic set of disturbances and inputs, their behavior should
be strongly correlated. The correlation can be captured, which can be used to
build an inferential estimator for the property variables.
Typically, linear regression techniques, such as least squares and partial
least squares (PLS), are used to build an estimator of the form

y1s (k 1 )

..
yp (k) = L

.
s
y` (k ` )

(3.97)

where yp is the estimate of the product property variable in question and yis s
are he secondary variables used to estimate the product property. Because
different variables can have different response times to various inputs, the
regressor may need to include delayed measurement terms as shown above.
Determination of appropriate delay amounts would require significant process
knowledge or a careful data analysis. In the case that proper values for these
delays cannot be determined a priori, one may have to include several delayed
terms of a same variable in the regressor.
When direct measurements of the product property variables are available,
one may use them in conjunction with the inferential estimates. In practice,
the direct measurements are typically used to correct any bias in the inferential
estimate. For example, when a measurement of y p becomes available after a
delay of d sample steps, it can be included in the estimator in the following
way:
p
yp (k) = yp (k) + (ym
(k d ) yp (k d ))

(3.98)

p
is the measured value of y p and yp (k) is the measurement-corrected
where ym
estimate.

In the case that the process is highly nonlinear and / or the operation range
is wide, nonlinear regression techniques such as Artificial Neural Networks can
be used in place of the least squares technique.

64

Constrained Model Predictive Control


Light Oil

Naphtha
Crude Oil

Vaporizer
Light Gas Oil

Short Residue

Heavy oil
Fractionator
Figure 3.9: Heavy Oil Fractionator

3.10

Application of DMC to the Case Studies

3.10.1

Control of a heavy oil fractionator using Dynamic Matrix Control

This case study is intended to show how the MATLAB MPC Toolbox can be
used to design a model predictive controller for a fairly complex, practically
motivated control problem and test it through simulation. The case study is on
a heavy-oil fractionation column and involves multivariable control, constraint
handling, and optimization. The interested readers may attempt to solve the
problem using the standard step response based DMC technique we explained
in this chapter. A solution will be provided at the end of the section.

Heavy Oil Fractionator: Background


A heavy oil fractionator (Figure 3.9) splits crude oil into several streams that
are further processed downstream. Vaporizing the feed stream consumes much
energy and therefore heat integration is of paramount importance. The fractionator shown in the figure has three heat exchangers, which are used to
recover energy from the recirculation streams. Product quality needs to be
maintained at a desired level and certain constraints have to be met.

65

February 8, 2002
PC

reflux drum

(F )
1 s

LC
FC
Upper reflux duty
(URD)

F
1

Intermediate reflux duty


stripper

(IRD)

(F )
2 s
FC
F
2

(Q)s
Bottoms reflux
duty (BRD)

FC
Q

LC

Feed

Bottoms

Figure 3.10: Control of Heavy Oil Fractionator Level 1: Inventory Control

Control Structure Description


The control structure of the fractionator consists of two levels. In the lower
level (displayed in Figure ??), liquid levels and flow rates are controlled. This
structure does not need MPC; conventional PD/PID controllers suffice.
In addition to the basic inventory and flow controls, higher level of control
is needed to assure the quality of the output streams of interest. We may
use composition measurements of the product streams obtained through online analyzers. Composition analyzers, however, introduce significant delays.
Alternatively, we may use other more easily measurable variables such as temperatures as an indicator for compositions of the product streams. In addition
to product quality control, this level may handle various constraints and any
optimization objective that competes with control requirements. At this level,
use of MPC may bring significant benefits. The control structure for this level
is shown in Figure 3.11
The following are the control objectives at the MPC level listed in the
order of their priority: (1) y7 should remain above the minimum level of -0.5;
(2) y1 , y2 must be kept at their setpoints; (3) u3 , the opening of the bypass
valve, must be minimized to maximize the heat recovery.

66

Constrained Model Predictive Control


PC

reflux drum

T
LC

FC
Upper reflux duty

Top draw

Intermediate reflux duty


T

stripper
FC

T
Bottoms reflux
duty

FC

Side draw

LC
T
Bottoms

Feed

Figure 3.11: Control of Heavy Oil Fractionator Level 2: Quality Control

Input Output Transfer Functions

Models relating the MVs to the CVs and other outputs are given as below:

T EP y1
SEP y2
T T y3
U RT y4
SDT y5
IRT y6
BRT y6

TD
u1

SD
u2

BRD
u3

U RD
d1

IRD
d2

4.05e27s
50s+1
5.39e18s
50s+1
3.66e2s
9s+1
5.92e11s
12s+1
4.13e5s
8s+1
4.06e8s
13s+1
4.38e20s
33s+1

1.77e28s
60s+1
5.72e14s
60s+1
1.65e20s
30s+1
2.54e12s
27s+1
2.38e7s
19s+1
4.18e4s
33s+1
4.42e22s
44s+1

5.88e27s
50s+1
6.9e15s
40s+1
5.53e2s
40s+1
8.1e2s
20s+1
6.23e2s
10s+1
6.53e1s
9s+1
7.2
19s+1

1.2e27s
45s+1
1.52e15s
25s+1
1.16
11s+1
1.73
5s+1
1.31
2s+1
1.19
19s+1
1.14
27s+1

1.44e27s
40s+1
1.83e15s
20s+1
1.27
6s+1
1.79
19s+1
1.26
22s+1
1.17
24s+1
1.26
32s+1

The transfer functions for the actual plant are assumed to be the same as
the model, except that there can be gain variations. Structure of the variations
is shown below. They are to be incorporated into the simulation as a plantmodel mismatch.

67

February 8, 2002

T EP y1
SEP y2
T T y3
U RT y4
SDT y5
IRT y6
BRT y6

TD
u1
4.05 + 2.111
5.39 + 3.291
3.66 + 2.291
5.92 + 2.341
4.13 + 1.711
4.06 + 2.391
4.38 + 3.111

SD
u2
1.77 + 0.392
5.72 + 0.572
1.65 + 0.352
2.54 + 0.242
2.38 + 0.932
4.18 + 0.355
4.42 + 0.732

BRD
u3
5.88 + 0.593
6.90 + 0.893
5.53 + 0.673
8.10 + 0.323
6.23 + 0.303
6.53 + 0.723
7.2 + 1.333

U RD
d1
1.2 + 0.124
1.52 + 0.134
1.16 + 0.084
1.73 + 0.024
1.31 + 0.034
1.19 + 0.084
1.14 + 0.184

IRD
d2
1.44 + 0.165
1.83 + 0.135
1.27 + 0.085
1.79 + 0.045
1.26 + 0.025
1.17 + 0.015
1.26 + 0.185

Problem Statement
The following are the five cases representing different disturbances and gain
variations.
Case
Case
Case
Case
Case

I:
II:
III:
IV:
V:

d1
0.5
- 0.5
-0.5
0.5
-0.5

d2
0.5
-0.5
-0.5
-0.5
-0.5

1
0
-1
1
1
-1

2
0
-1
-1
1
1

3
0
-1
1
1
0

4
0
1
1
1
0

5
0
1
1
1
0

The following constraints must be met.


0.5 u1 , u2 , u3 0.5
|u1 |, |u2 |, |u3 | 0.05
0.5 y1 0.5, y7 0.5
sampling time 1 min
The input constraints are hard constraints, meaning they must be satisfied.
The output constraint can be viewed as a soft constraint, which is to be met if
possible but may be relaxed in the case that infeasibility problems arise. You
should relax the constraint for y1 before you relax the constraint for y7 .
The objective is to satisfy to regulate y1 and y2 , the product stream compositions, at their setpoints while satisfying the above constraints, for all five
simulation cases. As a secondary objective, we wish to minimize u3 in order
to maximize the heat recovery.
In addition to the standard cases, we want to test some failure scenarios.
Specifically, let us simulate the Case I when one of the actuators, u1 or u2 , is
out of service and also when one of the composition sensors goes out of service.

68

Constrained Model Predictive Control


Controlled Variables
y1, y2

Manipulated Variables
u1, u2, u3

HEAVY OIL
FRACTIONATOR

Optimized input u3

Disturbances
Associate Variable y7

d1 and d2

Figure 3.12: Control scenario block diagram

Note that d1 , d2 are unmeasured disturbances. Therefore the model for the
MPC calculation should not contain the models d1 , d2 . These models appear
only in the plant used to do the closed-loop simulation.

Our Solution Strategy


We will use the MPC toolbox in MATLAB to solve this simulation problem. The other toolboxes available with MATLAB include SIMULINK, GUI
(Graphical User Interface) etc. The block diagram representation of the problem is shown in Figure 3.12. The system has certain inputs and certain outputs. Out of all the measured outputs, we need to control y1 and y2 . Since
these are composition variables, they are associated with significant time delays. Alternatively, we may use some temperature or such other inferential
measurements (see the earlier section on Property Estimation. However, this
possibility is not considered in our solution strategy.
We can identify the following as the output variables of interest:
Controlled variables We need to control y1 and y2 at their set points.
Associate variables We need to maintain y7 above its minimum value.
Optimized variables We need to obtain the required control action with
minimal change in u3 . Thus, u3 enters as an output variable.
Note that u3 is also a manipulated (input) variable. Putting down the
foregoing discussion in mathematical terms, we obtain the model

y1
u
1
y2
u

u2
y7 = G
u3
u3

(3.99)

69

February 8, 2002
where,

g11 g12 g13


g21 g22 g23

Gu =
g71 g72 g73
0
0
1

(3.100)

The model for actual plant is given by

y1

u1
y2
d
1
u
d

u2 + G
y7 = G
d2
u3
u3

(3.101)

We solve the control problem using the cmpc algorithm available in the
MPC toolbox5 . The cmpc algorithm solves, as the name suggests, Constrained
MPC problem. The description of cmpc could be found in the MPC toolbox
manual or by typing help cmpc at the MATLAB prompt. Resulting help text
displayed is:
CMPC Simulation of the constrained Model Predictive Controller. [y,u,ym]
= cmpc(plant, model, ywt, uwt,M,P, tend, r, ulim,ylim,... tfilter, dplant,
dmodel, dstep) REQUIRED INPUTS: . . . . . .
The first step in solving the problem is to obtain the step response matrices
plant and model. These two models are required to be in step response form.
We use the following MATLAB commands 6 to do this.
First, we obtain the models in transfer function form using poly2tfd function. You may refer to the MPC toolbox manual for more details. The usage of
this function to obtain the transfer function g11 is shown. The third argument
being 0 signifies continuous transfer function form and the final argument is
the time delay. g11 = poly2tfd(4.05, [50 1], 0, 27]; The resulting matrix
e27s
corresponds to the transfer function 4.05
50s+1 . We then use the tfd2step
function to get them in the required form. Before that, as described above,
we obtain all the necessary transfer functions (see Eq. 3.100). The transfer
functions are passed column-wise (and not row-wise). Scalar Ny specifies the
number of outputs (ie. number of rows).
delt=5; tfin=300; Ny=4;
model = tfd2step(tfin,delt,Ny,g11,g21,g71,gu31,g12,g22,g72,gu32,g13,g23,g73,gu33);
5
Please refer to the MPC toolbox manual for details. You may also use the GUI (Graphical
User Interface) version or SIMULINK, instead
6
The commands described may be MATLAB commands or MPC Toolbox commands. In
this example, we will use the same term in describing both. We assume here that you have
MPC Toolbox installed. If not, some of these commands will not work.

70

Constrained Model Predictive Control

You may want to see the validity of the tfd2step command by actually
computing the step response matrix yourself. One way to do this is to find the
inverse Laplace transform of a transfer function and obtain the step response
matrix. Remember that, to obtain the step response, you need to multiply the
transfer function with 1/s (output transform for a step input) before taking
the inverse Laplace transform. First few elements of model are listed below:
model =

0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.2113
..
.

0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0945
0.0000
0.0000
0.0000
0.5443
..
.

0.0000
0.0000
1.6659
1.0000
0.0000
0.0000
2.9464
1.0000
0.0000
0.0000
3.9306
1.0000
0.0000
0.8108
..
.

We need two step response matrices because there can be a mismatch


between the actual plant and our model. Moreover, the disturbance models
are not to be incorporated into the model, because the disturbances are not
measured. Thus, we introduce the disturbances for the plant only and call
them dplant, whereas dmodel is an empty matrix.
Next step is to specify the weighting matrices (y and u ). This is done by
specifying the vectors ywt and uwt respectively. Constraint limits are specified
through vectors ulim and ylim. Using these values, we call the function cmpc.
Below is a sample program for Case II.
E1=0; E2=0; E3=0; c11=4.05+2.11*E1; c12=1.77+0.39*E2; c13=5.88+0.59*E3;
c21=5.39+3.29*E1; c22=5.72+0.57*E2; c23=6.90+0.89*E3; c71=4.38+3.11*E1;
c72=4.42+0.73*E2; c73=7.20+1.33*E3;
g11=poly2tfd(c11, [50 1], 0, 27); g12=poly2tfd(c12, [60 1], 0, 28); g13=poly2tfd(c13,
[50 1], 0, 27);
g21=poly2tfd(c21, [50 1], 0, 18); g22=poly2tfd(c22, [60 1], 0, 14); g23=poly2tfd(c23,
[40 1], 0, 15);
g71=poly2tfd(c71, [33 1], 0, 20); g72=poly2tfd(c72, [44 1], 0, 22); g73=poly2tfd(c73,

February 8, 2002

71

[19 1]);
gu31=poly2tfd(0, [1]); gu32=poly2tfd(0, [1]); gu33=poly2tfd(1, [1]);
delt=5; tfin=300; Ny=4;
model = tfd2step(tfin,delt,Ny,g11,g21,g71,gu31,g12,g22,g72,gu32,g13,g23,g73,gu33);
E1=-1; E2=-1; E3=-1; E4=1; E5=1; d1=-0.5; d2=-0.5; y7min=-0.5;
c11=4.05+2.11*E1; c12=1.77+0.39*E2; c13=5.88+0.59*E3; c21=5.39+3.29*E1;
c22=5.72+0.57*E2; c23=6.90+0.89*E3; c71=4.38+3.11*E1; c72=4.42+0.73*E2;
c73=7.20+1.33*E3;
g11=poly2tfd(c11, [50 1], 0, 27); g12=poly2tfd(c12, [60 1], 0, 28); g13=poly2tfd(c13,
[50 1], 0, 27);
g21=poly2tfd(c21, [50 1], 0, 18); g22=poly2tfd(c22, [60 1], 0, 14); g23=poly2tfd(c23,
[40 1], 0, 15);
g71=poly2tfd(c71, [33 1], 0, 20); g72=poly2tfd(c72, [44 1], 0, 22); g73=poly2tfd(c73,
[19 1]);
plant = tfd2step(tfin,delt,Ny,g11,g21,g71,gu31,g12,g22,g72,gu32,g13,g23,g73,gu33);
dc11=1.20+0.12*E4; dc12=1.44+0.16*E5; dc21=1.52+0.13*E4; dc22=1.83+0.13*E5;
dc71=1.14+0.18*E4; dc72=1.26+0.18*E5;
d11=poly2tfd(dc11, [45 1], 0, 27); d12=poly2tfd(dc12, [40 1], 0, 27); d21=poly2tfd(dc21,
[25 1], 0, 15); d22=poly2tfd(dc22, [20 1], 0, 15); d71=poly2tfd(dc71, [27 1]);
d72=poly2tfd(dc72, [32 1]);
du1=poly2tfd(0, [1]); du2=poly2tfd(0, [1]);
dplant = tfd2step(tfin,delt,Ny,d11,d21,d71,du1,d12,d22,d72,du2);
dmodel=[];
ywt=[1 1 0 1]; uwt=[0.1 0.1 0.1]; M=10; P=20; tend=300; r=[0 0 y7min
0];
ulim=[-0.5 -0.5 -0.5 0.5 0.5 0.5 0.05 0.05 0.05];
ylim=[-0.5 -inf y7min -0.5 0.5 inf inf 0.5];
tfilter=[]; dstep=[d1 d2];
[yp,u] = cmpc(plant,model,ywt,uwt,M,P,tend,r,ulim,ylim,tfilter,dplant,dmodel,dstep);
figure (3); plotall (yp,u,delt); subplot(211); legend(y1,y2,y7,u3,0); subplot(212); legend(u1,u2,u3,0);
Modifying the above for the other cases is trivial. Just the values for
E1. . . E5, D1 and D2 need to be changed. The cases of u1 actuator being
stuck and no feedback for output y1 are left as an exercise. Helpful hints:
For a stuck actuator, the respective input should be constrained to a
very small region of operation.

72

Constrained Model Predictive Control

Outputs
0.2
0
0.2
0.4
0.6

y1
y2
y7
u3

0.8
1
1.2

50

100

150
Time

200

250

Manipulated Variables
0.6
0.5
0.4
u1
u2
u3

0.3
0.2
0.1
0
0.1

50

100

150
Time

Figure 3.13: Result for Case-II

200

250

February 8, 2002
If feedback filtering time delay is , we get no feedback.

73

MODEL PREDICTIVE
CONTROL

Manfred Morari
Jay H. Lee
Carlos E. Garca

February 28, 2002

Chapter 8

Linear Time Invariant System


Models
In this section we introduce different descriptions of linear time-invariant systems and review key concepts in linear systems. Focus will be given to discretetime models even though we introduce the continuous-time counterparts in
order to point out the similarities and differences.
The two forms of the model we discuss are state-space models and transfer functions. Each form has some advantages and disadvantages associated
with its use. Because state-space models are compact and convenient forms
for representing multivariable systems, which are the main target systems for
MPC, we will use the state-space model as the basis for building and analyzing
advanced MPC algorithms in the subsequent chapters of this book. The transfer function form of the model may be preferred by some for reasons such as
familiarity and ease of identification. This is not a problem as linear systems
can easily be converted from one form to the other, as we will show.
In many cases, a deterministic system representation alone may be insufficient for designing a high-performance controller. For example, when feedback
measurements are infrequent or lacking completely, one needs to take advantage of the disturbance information and the resulting correlation structure in
order to achieve satisfactory control. We will hence give a brief introduction
to stochastic modeling of disturbances and discuss how to integrate stochastic
disturbance descriptions into deterministic system models.
Comprehensive treatment of various concepts in linear systems and stochastic modeling can easily take up an entire book. Our coverage here, being limited to a single chapter, will necessarily be terse and limited in scope. There
are a number of excellent textbooks giving in-depth coverage of these concepts;
some of these are listed in the bibliographical information section at the end
of this chapter.

Modeling

8.1
8.1.1

State-Space Model
Continuous Time

A standard form for continuous-time state-space system is


x = Ac x + B c v
y = Cx

(8.1)

x denotes the state vector, v denotes the input vector containing the independent variables, and y denotes the output vector comprising linear combinations
of the state variables. The state can be composed of physical variables if the
model is derived from first principles, or of some artificial variables storing the
information about the past inputs needed to predict the future behavior of the
outputs.
Solution to the above differential equation with initial condition x(0) = x0
is
x(t) = e

Ac t

x0 +

eA

c (t )

Bv( )d

(8.2)

The matrix exponential in the above is defined with the infinite power series
e

Ac t

X
(Ac t)n

n=0

n!

(8.3)

and can be evaluated conveniently using the Cayley Hamilton Theorem discussed in Appendix ??.

8.1.2

Discrete Time

A discrete-time analogue of (8.1) is


x(tk+1 ) = Ax(tk ) + Bv(tk )
y(tk ) = Cx(tk )

(8.4)

The typical assumption underlying this representation is that the system input v(t) changes only at discrete equally spaced times (. . . , tk , tk+1 , . . .) (see
Figure 8.1). The model describes the behavior of the state only at the discrete sample times. hs = tk+1 tk is the sampling time. The discrete-time
description can be obtained from the continuous state-space model and the
relationship between the continuous system matrices and the discrete system
matrices will be derived shortly. It can also come directly from system identification, which is the subject of Chapter ??.

February 28, 2002

Figure 8.1: Discrete Time System with A Zero Order Hold


To simplify the notation, we will write the above as
x(k + 1) = Ax(k) + Bv(k)
y(k) = Cx(k)

(8.5)

where x(k) represents the value of x at the kth sample time, and so on.
Notice that, with the above model, one gets a delay of at least one sample
interval between v and y. This reflects the fact that the output does not
respond instantaneously to an input change.
Representation of time delays within the discrete-time setting is quite
straightforward. For instance, consider the case where the input has a delay of d sampling units (where d 1). The system description is
x(k + 1) = Ax(k) + Bv(k d)
y(k) = Cx(k)
which can be converted into the
state in the following way:

A B
x(k + 1)
0 0
v(k d + 1)

.. . .

..
.

=
.
.

.
.
v(k 1)
..
..
v(k)
0
y(k) = Cx(k)

(8.6)

standard form of (8.5) by augmenting the


0
0
I
0
.. .. ..
.
.
.
.. .. ..
.
.
.

0
x(k)
0
v(k
d)

..
0

I v(k 2)
v(k 1)
0

0
0
..
.




+

0
I

v(k)

(8.7)
Different units of delays in the individual input channels can be handled
efficiently in the same straightforward manner. Delays in the output channels
can also be handled in a similar manner.
Example 8.1 Consider the system

Modeling

y(k) = 0.5y(k 1) + v(k 3)

(8.8)

The above system has a delay of three sampling times (3hs ), including the one
sampley delay inherent to every system, between the input v and the output y.
The corresponding state-space description is

x1 (k + 1)
0.5 1
x2 (k + 1) = 0 0
x3 (k + 1)
0 0
y(k) = x1 (k)

8.1.3


0
x1 (k)
0
1 x2 (k) + 0 v(k)
x3 (k)
0
1

(8.9)

Converting Continuous-Time System to Discrete-Time


System

Recall that the solution to the differential equation (8.1) with initial condition
x(0) = x0 is
Z t
c
Ac t
x(t) = e x0 +
eA (t ) B c v( )d
(8.10)
0

If the state at time tk is x(tk ), then by integration of (8.1) from tk to tk+1


with a constant input v(t) = v(k), we obtain
Z tk+1
0
c
c
0
x(tk+1 ) = eA (tk+1 tk ) x(tk ) +
eA (tk+1 ) B c v(k)d
(8.11)
tk

With the change of variable = tk+1 , we obtain the corresponding discretetime system
x(k + 1) = Ax(k) + Bv(k)
(8.12)
y(k) = Cx(k),
where

A = eA hs
Rh
c
B = 0 s eA d B c

(8.13)

(8.13) gives formulas for converting continuous-time system matrices into discretetime system matrices.
When the continuous time system has a delay, the conversion procedure
becomes more complicated. As an example, let us consider the continuoustime state-space model of (8.1) that has an input delay of . Suppose for a
moment that 0 < hs . Then, the input v(t ) takes the value

v(k 1)
for tk t < tk +
v(t ) =
(8.14)
v(k)
for tk + t < tk+1
during the time interval between tk and tk+1 . Then, integration of (8.1) from
tk to tk+1 yields
x(k + 1) = Ax(k) + B1 v(k 1) + B0 v(k)

(8.15)

February 28, 2002


where

R c
c
B1 = eA (hs ) 0 eA d B c
R hs Ac
e
d B c
B0 = 0

(8.16)

The above can be put in the standard form by augmenting the state x(k) with
v(k 1):

B0
x(k)
A B1
x(k + 1)
v(k)
(8.17)
+
=
I
v(k 1)
0 0
v(k)
Note that the state vector now includes v(k 1) since its effect is not fully
stored in x(k) due to the delay.
0

More generally, we can have = (d 1)hs + where d( 1) is a positive


0
integer and 0 < hs . Then, the input v(t ) takes the following values:

0
v(k d)
for tk t < tk +
v(t ) =
(8.18)
0
v(k d + 1)
for tk + t < tk+1
Then integration of (8.1) from tk to tk+1 yields
x(k + 1) = Ax(k) + B1 v(k d) + B0 v(k d + 1)
where B1 and B0 are defined same
standard state-space representation

A B 1 B0
x(k + 1)
0 0
I
v(k d + 1)

.
.
.
.

..
.. ..
= .

.
. .

v(k 1) ..
.. ...
v(k)
0

(8.19)
0

as in (8.16) but replaced with . The


for this case becomes


0 0
0
x(k)
0 0
0
v(k

d)

.. ..
..

..
. 0
.
+ . v(k)

.. ..
.
. I v(k 2) 0
I
v(k 1)
0
(8.20)

For systems with different input and output delays, the procedure can become very complex. One immediate way is to discretize the model for each
input-output pair separately and then pack them together into a single
model afterward (as described in Exercise). The resulting model is likely
to include superfluous states as some of the states for different input-output
pairs may be merged and shared. Later in this chapter, we will describe a
procedure called model reduction, which can be used to get rid of redundant
or negligible states.

8.1.4

State-Coordinate Transformation

In describing an input-output system with a state-space model, the choice of


state is not unique as there are infinitely many different ways to store the
same information. More precisely, state coordinates can be redefined without

Modeling

affecting the input-output relationship. We can define a new coordinate system


for the state vector according to the linear transformation x
= T x where T is
any nonsingular matrix. State-space system (8.5) can be re-written in terms
of x
as follows:
T 1 x
(k + 1) = AT 1 x
(k) + Bv(k)
y(k) = CT 1 x
(k)

(8.21)

Rearranging the above gives the state-space system model for the new coordinate system
1
x
(k + 1) = T
(k) + |{z}
T B v(k)
| AT
{z } x

A
B
(8.22)
1
x

(k)
y(k) = CT
| {z }

Even though state coordinate transformations do not affect the relationship


between input v and output y, a particular choice of coordinates may prove
to be convenient for a certain type of analysis, as we will illustrate shortly.

8.1.5

System Poles and Characteristic Equation

In state-space system (8.5), the eigenvalues of the state transition matrix A


are called system poles. The poles are calculated by solving the equation
det (A I) = 0

(8.23)

which is referred to as the characteristic equation of the system. Important


dynamic characteristics of the system including stability can be inferred by
the location of system poles. Before we discuss the implications of the pole
location, we precisely define the notion of stability in the state-space systems
context.

8.1.6

Stability

Definition 1 State-space system (8.5) is said to be asymptotically stable if


the state returns to the origin from an arbitrary initial condition in the absence
of input excitation, that is, if
limk x(k) = 0

x(0) <n with v = 0.

Asymptotic stability can be tested by examining the eigenvalues of A. A


necessary and sufficient condition for asymptotic stability of system (8.5) is
that all the eigenvalues of A lie strictly inside the unit circle. This implies
the system is unstable if any eigenvalue of A lies on or outside the unit circle.
This condition can be derived straightforwardly when the state coordinates are

February 28, 2002

chosen as the eigenvectors of A. Let us assume for convenience that A has a full
set of eigenvectors e1 , , en . Then, performing the coordinate transformation
x
=

e1 e n

transforms the state system (8.5) into

x
(k + 1) = e1
1

..
=
.
n

en

e1 e n

(8.24)

(8.25)

Hence, in the transformed coordinates, the behavior of each state is independent of one another and can be analyzed on a separate basis. From the
equations x
i (k + 1) = i x
i (k), i = 1, , n, it is clear that all i s must have
magnitudes less than zero for asymptotic stability.

8.1.7

Lyapunov Equation

Let us introduce a concept associated with stability of a dynamical system and


a related form of matrix equation that shows up in various contexts throughout
the book. They arise from the so called Lyapunovs Second Method, which is a
general method for proving the stability of a nonlinear dynamical system. To
paraphrase it in the present discussions context, we are interested in finding
a convex function V (x) 0 such that V (0) = 0 and V (x) decreases strictly as
the systems state x evolves around in time. If such a function is found, we
can conclude that the system is asymptotically stable.
For linear systems, we may consider the quadratic form V (x) = xT P x as a
candidate Lyapunov function. We will leave P general and see what condition
we need to impose in order for it to qualify as a Lyapunov function. Since the
main requirement is that V (x) decrease in time, we need
V (x(k + 1)) V (x(k)) = xT (k + 1)P x(k + 1)
xT (k)P x(k)

= xT (k) AT P A P x(k)

(8.26)

Hence, the condition for xT P x to be a Lyapunov function is that


AT P A P = Q

for some Q > 0

(8.27)

The above form of equation is referred to as a Lyapunov equation for discretetime systems. It is true (and can be proven) that a positive-definite solution
(P > 0) to the Lyapunov equation always exists for any given Q > 0 if all the
eigenvalues of A are strictly inside the unit disk. It is given as a homework
exercise to prove this.

Modeling

Definition 2 State-space system (8.5) is said to be bounded input / bounded


output (BIBO) stable if y remains bounded in response to any bounded v.
BIBO stability is a weaker condition than asymptotic stability, i.e., asymptotic
stability implies BIBO stability but the reverse is not necessarily true. This is
because systems containing simple integrators that are not reachable by input
v may still satisfy the BIBO stability requirement as the following example
shows.
Example 8.2 Consider the following system with an integrating state.


x1 (k + 1)
0.2 0.5
x1 (k)
1
=
+
v(k)
x2 (k + 1)
0.0 1.0
x
(k)
0
2

(8.28)

x1 (k)
1 1
y(k) =
x2 (k)

Note that the system is BIBO stable since the integrating state x2 is not affected
by input v(k) and hence does not ramp up or down under a constant input.
However, the system is not asymptotically stable since x2 remains constant
and does not return to zero.

8.1.8

Controllability, Reachability, and Stabilizability

A desirable property of a state-space system to be controlled is controllability


or reachability. Controllability asks the question of whether it is possible to
move from an arbitray state to the origin in finite time by manipulating the
inputs; reachability asks the same but for an arbitray destination state rather
than just the origin. Stabilizability relaxes the controllability by removing the
requirement of finite time.
Definition 3 System (8.5) is said to be controllable if it is possible to find
an input sequence that moves the system from an arbitrary initial state to the
origin in finite time.
Definition 4 System (8.5) is said to be reachable if it is possible to find
an input sequence that moves the system from an arbitrary initial state to any
desired state in finite time.
Note that a reachable system is always controllable, but a controllable system is not necessarily reachable. For example, if An x(0) = 0, the system is
controllable regardless if it is reachable or not.
Even though reachability or controllability is a desirable property, it is
often too strong a requirement for practical systems. A minimum requirement
for controller design is stabilizability.
Definition 5 System (8.5) is said to be stabilizable if there exists an input
sequence that returns the state to the origin asymptotically (as k ) from
an arbitrary initial condition.

February 28, 2002


From this definition, stable systems are necessarily stabilizable.
The following are some mathematical conditions for reachability.
The most well-known is the following rank condition:
rank (Wnc ) = n
Wnc =

B AB An1 B

(8.29)

(8.30)

This condition can be easily proved using the Cayley Hamilton Theorem
in Appendix ??. First, note that

v(k 1)

v(k 2)
x(k) = Ak x(0) + B AB Ak1 B
(8.31)

..

.
v(0)

Reachability requires that, for all choices of x(0) <n and xt <n ,
there be an input
sequence (v(0), ,v(k 1)) such that x(k) = xt .

Let us denote B AB Ak1 B as Wkc . Then, the reachability


condition is simply that Wkc has full row rank for some finite k. From the
Cayley-Hamilton Theorem it is immediate that rank(Wkc ) rank(Wnc )
for all k, and the reachability condition reduces to (8.29).
Wnc is called the controllability matrix. From the foregoing argument, it
should be clear that the range space of the controllability matrix defines
the reachable subspace, i.e., a subspace of the state space wherein the
state can be moved around freely by manipulating the input.
Another useful test for reachability is the so called Hautus condition,
which is

A I B
=n
A
(8.32)
rank
where A denotes the spectrum (the set of all eigenvalues) of A. To see
how the above condition arises, let us first prove that

Reachability

rank A I B = n A
For this, we can prove instead

Not Reachable

rank A I B 6= n for some A

If rank A I B 6= n for some A , then there exists a row


vector xT such that xT A I B = 0T . Since the above implies
that xT (A I) = 0T , x is an eigenvector (corresponding to the eigenvalue ). Note that it also implies that xT B = 0T . Hence,

xT B AB An1 B = xT B xT B n1 xT B = 0T

10

Modeling
This implies that the controllability matrix does not have a full row rank
and therefore system is not reachable.
We must also prove that
Reachability

rank

Again, we can prove instead


Not

Reachable

rank

A I B

A I B

= n A

6= n

for some A

The assumption of the


implies that there
system being not reachable

exists x such that xT B AB An1 B = 0T , which means


xT B = 0, xT AB = 0, , , xT An1 B = 0

This implies that x is an eigenvector of A and xT B = 0. In turn, this


implies that

xT A I B = 0T for some A
which means the rank is not n.

x
1
x, that
There exists a coordinate transformation,
= T1c T2c
x
2
transforms the state-space model into the form of


1
x
1 (k + 1)
A11 A12
x
1 (k)
B
v(k),
(8.33)
=
+
x
2 (k + 1)
x
2 (k)
0
0 A22
1 ) is a reachable pair. This form is intuitive and convenient as
where (A11 , B
x
1 represents the part of the state that is reachable and x
2 represents the part
that is not (and cannot be affected by the input at all).
To put the model in the above form, we must choose T1c and T2c such that
the columns of T1c span the range space of Wnc and T2c its orthogonal complement. One can find such matrices, for instance, by performing a singular value
decomposition on the controllability matrix,

1 0
V1
Wnc = U1 U2
(8.34)
0 0
V2T

and choosing

T1c = U1 ,

T2c = U2

(8.35)

If all the eigenvalues of A22 are zero, the system is controllable. If the
eigenvalues of A22 in (8.33) lie strictly inside the unit disk, the system is
stabilizable. This means that the states that are not reachable evolve through
stable dynamics.

11

February 28, 2002


Another condition that can be used for testing stabilizability is
rank

A I B

+
A

=n

(8.36)

where +
A is the the set of all eigenvalues of A lying on or outside the unit
disk. This can be proved by following essentially the same argument as the
one we used for proving the analogous test for reachability. The proof is left
as a homework exercise.
It should be clear to the readers by now that the reachability and stabilizability are a systems intrinsic properties and lack of them can be overcome
only by changing the system dynamics or by choosing a different set of inputs.
Example 8.3 We will use the following simple system to illustrate the concept
of reachability and reachable subspace:

x1 (k + 1)
x2 (k + 1)

1.5 + 0.5a 0.5 0.5a


1.5 0.5a 0.5 + 0.5a

x1 (k)
x2 (k)

1
1

v(k)
(8.37)

The controllability matrix is


W2c

B AB

1
1

2
2

The rank is 1 and the system is not reachable. SVD of W2c looks like
W2c

"

1
2
1
2

1
2
12

1 0
0 0

"

1
5
2
5

2
5
15

From this, we learn that the range space of W2c defined by the span of the first
output singular vector [ 12 12 ]T is the reachable subspace. The subspace is
graphically illustrated in Figure 8.2. Performing the coordinate transformation
#

" 1

1
x
1 (k)
x1 (k)
2
2
= 1
x
2 (k)
12
x2 (k)
2
we obtain

x
1 (k + 1)
x
2 (k + 1)

2 1
0 a

x
1 (k)
x
2 (k)

1
0

v(k)

Hence, x
1 = (x1+ x2 )/ 2 is the portion of the state that can be controlled and
x
2 = (x1 +x2 )/ 2 evolves according to the autonomous dynamics regardless of
the input. The stabilizability of the system depends on a. If a 1, the system
is not stabilizable, as shown by the possible control responses for various values
of a in Figure 8.2.

12

Modeling

Figure 8.2: Reachable Subspace and Possible Control Responses for Various
Values of a in Example 8.3

Example 8.4 Let us use the following state-space system to further illusrate
the concepts:

0.65
0.1
0
0.45

0.35
0.4
0.3
0.15
x(k) +
x(k + 1) =

0
0.25
0.35
0.2
0.4
0.2 0.05 0.15
2 0 0 2 x(k)
y(k) =

0.5
0.5
u(k)
0
0
(8.38)

The controllability matrix is calculated as

0.5 0.375
0.3375
0.3263

0.5 0.375
0.3375
0.3263

W c = B AB A2 B A3 B =
0
0.125
0.1625
0.1738
0 0.125 0.1625 0.1738
The rank of the above matrix is 2 and therefore the system is not reachable.
The eigenvalues of
A are 1, 0, 0.5 and 0.3. Evaluating the rank of the
matrix A I B , we find that the rank is 4, 3, 3, 4 for = 1, 0, 0.5, 0.3.
Hence, we reach the same conclusion that the system is not reachable. Moreover, the two eigenvalues that resulted in loss of rank ( = 0, 0.5) are both
stable. So, the system is stabilizable.

13

February 28, 2002


Performing SVD on W c , we obtain
Wc =

u1 u2 u3 u4

0.6823
0.6823
=
0.1857
0.1857
0.5965
0.4880

0.4554
0.4457

0.1857
0.1857
0.6823
0.6823
0.7699
0.1296
0.3995
0.4804

T
v1
1 0 0 0
v2T
0 2 0 0

0 0 3 0 v3T
0 0 0 4 v4T
0.7071
0
1.1438
0
0 0

0.7071
0
0
0.2412 0 0

0
0
0 0
0
0.7071
0
0 0
0
0.7071 0
0.1184 0.1932
0.5943 0.6260

0.7476
0.2721
0.2716 0.7048

Note that the span of u1 and u2 represent the reachable subspace. Also, the
choices of u3 and u4 are not unique (since they both correspond to the zero
singular values) and any two orthonormal vectors in the span of the above
given u3 and u4 will work. For example, in the SVD, we could have used
instead

0.5
0.5
0.5

, u4 = 0.5
u3 =
0.5
0.5
0.5
0.5

since the span of the above two vectors are the same as that of the previous
choices of u3 and u4 . The same can be said for the choices of v3 and v4 .

With the coordinate transformation z = U 1 x = U T x, we can compute the


new state-space matrices as

0.9136 0.3510 0.2281 0.1579


0.1510 0.3864 0.1139 0.0088

A = U 1 AU =

0
0
0.3
0.3
0
0
0.2
0.2

0.6823

= U 1 B = 0.1857
B

0
0

C =

1.7360 0.9931 1.4142 1.4142

One can clearly see that the first two states in the transformed coordinates are
reachable
and the

next two not reachable. Since the eigenvalues of the matrix


0.3 0.3
are 0.5 and 0, we conclude that the system is stabilizable.
0.2 0.2

14

Modeling

8.1.9

Observability, Reconstructability, and Detectability

Another desirable system property for output feedback control is observability


or reconstructability, which asks whether it is possible to reconstruct an unknown state sequence from the corresponding output sequence. Detectability
is a related but weaker concept.
Definition 6 System (8.5) is said observable if no two different initial
states with a same input sequence produce identical outputs for all times.
If a system is observable, it is possible to reconstruct any arbitrary, unknown
initial state x(0) from the knowledge of the input sequence v(0), v(1), , v(k
1) and the measurement sequence y(0), y(1), , y(k 1) in finite time k.
Hence, for time-invariant systems observability is equivalent to reconstructability, which asks the question of whether it is possible to reconstruct any given
state from past measurements.
A concept analogous to stabilizability is detectability.
Definition 7 A system is called detectable if one can construct from the measurement sequence a sequence of state estimates that converges to the true state
asymptotically (k ) starting from an arbitrary initial estimate.
The following are some useful conditions for observability:

The best known condition is


rank (Wno ) = n
where
Wno =

CT

(CA)T

(8.39)

(CAn1 )T

(8.40)

This condition can also be proved in a manner similar to the proof of


reachability using the Cayley-Hamilton Theorem. The proof is left as
an exercise. Wno is called the observability matrix and the null space
of Wno defines the unobservable subspace. A state starting from the
unobservable subspace gives zero output.
From the above result, we can deduce that observability of (8.5) is equivalent to reachability of its dual system
x(k + 1) = AT x(k) + C T v(k)
y(k) = B T x(k)

(8.41)

Another condition for observability is


rank

AT I C T

=n

(8.42)

This condition can easily be derived by invoking the duality argument,


i.e., by noting the fact that observability of the system (8.5) is equivalent
to reachability of its dual system (8.41).

15

February 28, 2002

As before,
we can find a coordinate transformation that puts (8.5) in the

x
1
x that transforms (8.5) into the form of
= T1o T2o
form of
x
2

1
x
1 (k + 1)
A11 0
x
1 (k)
B
=
+
v(k)
x
2 (k + 1)
x
2 (k)
A21 A22
B2

(8.43)

x
1 (k)
C1 0
y(k) =
x
2 (k)
where (C1 , A11 ) is an observable pair. In the above form, x
2 represents the
part of the state that is not observable from y. In other words, the value of x
2
does not have any effect on the output and hence cannot be observed from it.

To put the system in the above form, one must choose T2o so that the
columns span the null space of Wno and T1o as its orthogonal complement.
Again, SVD of Wno can be used conveniently to obtain such a transformation
as we will illustrate below.
System (8.43) (and hence original system (8.5)) is detectable if all the
eigenvalues of A22 lie inside the unit disk. A more explcit test for detectability
that can be derived from the duality argument is

T
A I C T
=n
+
(8.44)
rank
A

Example 8.5 Let us use the dual system of (8.37), which we used in Example 8.3) to elucidate the concept of observability and unobservable subspace:

1.5 + 0.5a 1.5 0.5a


x1 (k)
x1 (k + 1)
=
x2 (k + 1)
0.5 0.5a
0.5 + 0.5a
x2 (k)

(8.45)

x1 (k)
1 1
y(k) =
x2 (k)
The observability matrix is
W2o

CT

AT C T

1
1

2
2

The rank is 1 and the system is not reachable. SVD of W2o looks like
" 1
#
#
" 1

2
1

0
1
5
2
2
W2o = 25
1
1
1

0
0

5
5
2
2
The span of the second input vector [ 12 12 ]T is the null space of W2o , which
is the unobservable subspace of the system. This is illustrated in Figure 8.3.
Performing the coordinate transformation
#

" 1
1
x1 (k)
x
1 (k)
2
2
= 1
12
x2 (k)
x
2 (k)
2

16

Modeling

Figure 8.3: Unobservable Subspace and Its Meaning in Example 8.5


we obtain this time

x
1 (k)
2 0
=
x
2 (k)
1 a

1
x
1 (k)
y(k) =
0
x
2 (k)

x
1 (k + 1)
x
2 (k + 1)

x
1 is the portion of the state that can be observed and x
2 the portion that
cannot be seen from the output at all as its evolution has no effect on y. The
detectability of the system depends on a. If a 1, the system is not detectable.
Example 8.6 . Let us bring back the system (8.38), which we used to demonstrate the use of the reachability and stabilizability tests earlier in Example 8.4.
This time, we will analyze the system for observability and detectability. First,
we calculate the observability matrix:

Wo =

CT

AT C T

(AT )2 C T

(AT )3 C T

T
2
1.7
1.55
1.475
T 0
0.3
0.45
0.525

=
0
0.3
0.45
0.525
2 1.7 1.55 1.475

The rank of the above matrix is 2 and therefore the system is not observable.

Evaluating the matrix AT I C T , we find that the rank is 4, 3, 4,


3 for = 1, 0, 0.5, 0.3. Hence, the system is not observable. However, the two
eigenvalues that resulted in the loss of rank ( = 0, 0.3) are both stable. So,
the system is detectable.

17

February 28, 2002


Performing SVD on W o , we obtain
T

v1
1 0 0 0

0 2 0 0 v2T
u1 u2 u3 u4
(W o )T =
0 0 3 0 v3T
T
0 0 0 4
4

v
0.6964 0.1227 0.7018 0.0867
0.1227

0.6964
0.0867
0.7018

=
0.1227

0.6964 0.0867 0.7018


0.7018 0.0867
0.6964 0.1227
0.5730 0.7416 0.3478 0.0273
0.5022
0.001
0.7990 0.3308

0.4668 0.3724 0.0379 0.8013


0.4491
0.558
0.489
0.4978

4.8615
0
0 0
0
0.6618 0 0

0
0
0 0
0
0
0 0

Note that the span of u3 and u4 , which is the null space of W o , represents
the unobservable subspace. As before, the choices of u3 and u4 are not unique
since they both correspond to the zero singular values, and any two orthonormal
vectors in the span of the above-given u3 and u4 will work. For example, in
the SVD, we could have used instead

0.5
0.5
0.5

, u4 = 0.5
u3 =
0.5
0.5
0.5
0.5
since the span of the above two vectors are the same as that of the former
choices of u3 and u4 . The same can be said for the choices of v3 and v4 .

With the coordinate transformation z = U 1 x = U T x, we can compute the


new state-space matrices as

0.9294 0.1008
0
0
0.3008 0.5706

0
0

A = U 1 AU =
0.2549 0.0891 0.2350 0.0833
0.1261 0.0344 0.1833 0.065

0.4095

= U 1 B = 0.2868
B
0.3942
0.3075

C = 2.7855 0.4908 0 0
We can clearly see that, with this coordinate transformation, the first two states
are observable
and the next

two not observable. Since the eigenvalues of the


0.2350 0.0833
matrix
are 0.3 and 0, we see that the system is detectable.
0.1833 0.065

18

Modeling

8.1.10

Kalmans Decomposition

Kalman (1961) showed that one can always find a coordinate transformation
yielding the following decomposition:


A11
x
1 (k + 1)
x

(k
+
1)
2
= 0
x
A31
3 (k + 1)
x
4 (k + 1)
0
y(k) =

C1

A12 0
0
x
1 (k)
x
A22 0
0

2 (k)

3 (k)
A32 A33 A34 x
x
4 (k)
A42 0 A44

x
1 (k)
x
2 (k)

C2 0 0
x
3 (k)
x
4 (k)


B1
0
+
B
3
0

v(k)

(8.46)
The decomposition clearly indicates the parts of the state that are reachable
and / or unobservable, i.e., x
1 through x
4 represent the parts of the state that
are both reachable and observable, observable but not reachable, reachable
but not observable, and neither reachable nor observable, respectively.
The Kalmans decomposition can be obtained by performing a coordinate
1

x where T1 , T2 , T3 , T4 are chosen aptransformation x


= T1 T2 T3 T4
propriately. The columns of T1 must be linearly independent vectors belonging
to the range space of Wnc but not to the null space of Wno , the columns of T2
are linearly independent vectors that constitute an orthogonal complement to
the range space of Wnc but not belonging to the null space of (Wno )T , etc. This
is illustrated in the example below.
Example 8.7 Again, we use the same system in (8.38), which we used in
Examples 8.4 and 8.6 for illustration. We see from the previous SVDs that
the orthogonal complement to the reachable space (to be denoted here by R )
and the unobservable space (S) are

0.5
0.5
0.5
0.5

0.5
0.5
0.5
0.5

R = Span
,
;
S = Span
,
0.5 0.5
0.5 0.5

0.5
0.5
0.5
0.5

From the above,

0.5

0.5
, R S = Span
R S = Span

0.5

0.5

Based on the above, we can conclude that


0.5
0.5

0.5 0.5
,
0.5 0.5
0.5
0.5

0.5

0.5

,
0.5

0.5

the span of ([0.5, 0.5, 0.5, 0.5]) is both not reachable and not observable;

19

February 28, 2002

the span of ([0.5, 0.5, 0.5, 0.5]) is reachable (since its orthogonal to R )
but not observable;
the span of ([0.5, 0.5, 0.5, 0.5]) is observable (since its orthogonal to
S) but not reachable;
and the span of ([0.5, 0.5, 0.5, 0.5]), the orthogonal complement to R
S, is both reachable and observable.
Hence, lets do the coordinate transformation
1
0.5
0.5
0.5
0.5
0.5 0.5 0.5 0.5

z=
0.5 0.5 0.5 0.5 x
0.5 0.5 0.5
0.5

Then, we obtain

0
A = U 1 AU =
0.2
0

0.2 0
0
0.5 0
0

0.2 0.3 0.1


0.1 0
0

0.5

= U 1 B = 0
B
0.5
0

C = 2 2 0 0

8.1.11

Minimial Realization

We say that a state-space system is minimal if it is both reachable and observable. A minimal realization of a system is a minimal system that is equivalent
to the original system in the input-output system but is devoid of the unreachable and/or unobservable parts. These terminolgies convey the fact that, for
the purpose of describing input-output system dynamics, unreachable and/or
unobservable parts are superfluous and can be elminated without loss of information. Based on these definitions, a minimial realization of (8.46) which
results from Kalmans decomposition is

x
1 (k + 1) = A11 x
1 (k) + Bv(k)
y(k) = C1 x
1 (k)

(8.47)

Of course, any system within a state coordinate transformation to the above


qualifies as a minimal realization of (8.46).

20

Modeling

Example 8.8 Based on the result of the Kalmans decomposition, a minimal


realization of (8.38) is
x(k + 1) = x(k) + 0.5v(k)
y(k) = 2x(k)

8.1.12

(8.48)

Model Reduction

A model of lower order is generally preferred as it implies reduced complexity and computational requirement for tcontrol system design and implementation. On the other hand, it is often the case that a particular modeling
method we use yields a high-order model that may not even be minimal. A
good example is the step response identification that we discussed earlier. For
systems with many inputs and outputs, one can easily end up with a model
with hundreds or even thousands of state variables. The same can be said
about first-principles modeling. This motivates us to develop a model reduction technique that can transform a high-order model into a low-order model
that still retains the essential system information for controller design.
Beyond removing superfluous states and constructing a minimal realization, model reduction in general involves trading off some accuracy in describing the systems input output behavior for a reduced system order. Typically,
this is carried out by sorting the states according to their importance through
some appropriate coordinate transformation and removing those judged negligible. Formally stated, for the nth order state-space system of (8.5), the
objective is to obtain an rth order model (r < n) that closely approximates
the input-output relationship of the original system.
Let us introduce the state coordinate transformation x
(k) = T x(k) where
T is a nonsingular matrix. The state-space system in the transformed state
coordinates can be written as
x(k) + Bv(k)

x
(k + 1) = A

y(k) = C x
(k)

(8.49)

= T B, and C = CT 1 . If the coordinate transformation


where A = T AT 1 , B
is chosen such that the transformed state variables are defined and ordered to
reflect the relative importances of various modes for describing the systems
input-output relationship, one may obtain the desired r th order system simply
by truncating the last n r states as
r v(k)
x
r (k + 1) = Ar x
r (k) + B

y(k) = Cr x
r (k),
where

(8.50)

21

February 28, 2002

x
r (k) =
Ar =

Ir 0
Ir 0

x
(k)

Ir 0 B
Ir
= C
0

r =
B
Cr

Ir
0

(8.51)

The notion of importance of various modes depends on the application


and needs to be defined precisely in order to develop and evaluate a systematic
model reduction method (i.e., a particular choice of T ). Typical methods are
based on minimization of the so called Hankel norm of the difference and
balanced realization. Details of these methods are given in Appendix ??.

8.2

Transfer Functions

Another way to represent system dynamics is through a transfer function,


which algebraically relates the Laplace / z transforms of inputs and outputs.
Such a model can describe only the response behavior of the output to a change
in the input. This is all right as the primary concern in most controller designs
is the input-output behavior.

8.2.1

Continuous Time

A popular form of model for a linear time-invariant system is the transfer


function, which relates the Laplace transforms of the input and the output in
the following way:
Gc (s) =

y(s)
m sm + m1 sm1 + + 1 s + 0
,
=
v(s)
sn +
n1 sn1 + +
1s +
0

m < n (8.52)

In the above, y(s) and v(s) denote the Laplace transforms of the output and the
input, respectively. This corresponds to an input-output relationship described
by the following time-domain differential equation:
dn y
dn1 y
dy
dm v
dm1 v
dv
+
n1
+ +
1 +
0 y = m
+m1
+ +1 +0 u
dt
dt
dt
dt
dt
dt
(8.53)
Taking the Laplace transform of (8.53) with the assumption that the system
is initially at rest yields the transfer function of (8.52).
n is the order of the transfer function. All physical systems are strictly
proper, that is, m < n, as outputs do not respond instantaneously to inputs.
Time delays can be incorporated conveniently into transfer functions. For
example, the same input output relationship but with an additional delay of

22

Modeling

is represented by
y(s)
m sm + m1 sm1 + + 1 s + 0 s
=
e
v(s)
sn +
n1 sn1 + +
1s +
0

8.2.2

(8.54)

Discrete Time

In discrete time, nth order input-output dynamics are expressed by


y(k + n) + a1 y(k + n 1) + + an1 y(k + 1) + an y(k)
= bnm v(k + m) + bnm+1 u(k + m 1) + + bn1 v(k) + bn v(k)
(8.55)
The above can be viewed as a discrete-time analogue of (8.53), which can be
obtained, for example, by approximating the derivatives using finite difference formulas. Note that, unlike in state-space models, we do not have state
variables x, which act as an intermediate storage between inputs and outputs.
Instead, several delayed input and output terms appear explicitly in the model.
With some abuse of notation, we will sometimes write (8.55) as
y(k) =

bnm q m + bnm+1 q m1 + + bn
v(k)
q n + a1 q n1 + + an

(8.56)

with q denoting the forward-shift operation (i.e., q ` {y(k)} = y(k + `)). Performing z-transform on the difference equation yields

G(z) =

y(z)
v(z)

bnm z m + bnm+1 z m1 + + bn1 z + bn


z n + a1 z n1 + + an1 z + an
N (z)
(8.57)
D(z)

G(z) is the discrete-time version of Laplace transfer function and is ofen referred to as pulse transfer function in the literature since it describes the output
response behavior to a series of pulse changes to the input. n m > 0 is the
relative degree. The roots of D(z) = 0 are the poles and the roots of N (z) = 0
the zeros of the transfer fuction. Poles and zeros are intimately linked with the
dynamic characteristics of the input-output system, as we will discuss shortly.
Dividing the top and bottom by z n gives
bnm z (nm) + bnm+1 z (nm+1) + + bn1 z (n1) + bn z n
1 + a1 z 1 + + an1 z (n1) + an z n
(8.58)
The corresponding time-domain expression is
G(z) =

y(k) = a1 y(k 1) an1 y(k n + 1) an y(k n)

+bnm v(k n + m) + + bn1 v(k n + 1) + bn v(k n)

= (aq 1 an1 q n ){y(k)}

+(bnm q n+m + + bn q n ){v(k)}

(8.59)

23

February 28, 2002

So the model can be written using the delay operators (q 1 ) instead of the
forward-shift operators. The above form of the equation provides a useful
insight about the nature of the model: The current output is a linear combination of the n past inputs and outputs. The relative degree, (n m),
represents the systems time delay, measured in number of sample intervals.
Comments:
For convenience, we will often replace y(z) = G(z)v(z) with a time domain expression y(k) = G(q)v(k), which denotes the difference equation
(8.56) corresponding to the transfer function (8.57).
In the special case that ai = 0 for all i, we obtain
y(k) = bnm u(kn+m)+bnm+1 u(kn+m1)+ +bn1 u(kn+1)+bn u(kn)
(8.60)
With n chosen to correspond to the systems settling time, the above
is a Finite Impulse Response (FIR) model that we used throughout the
basic part of the book.

8.2.3

Transfer Matrix

For systems with nv inputs and ny outputs we need to specify ny nv transfer


functions. The individual transfer functions are typically arranged in the form
of an array shown below:

G1,1 (z) G1,2 (z) G1,nv (z)


G2,1 (z)

G2,nv (z)

(8.61)
G(z) =

..
..
..
..

.
.
.
.
Gny ,1 (z)

Gny ,nv (z)

Gi,j denotes the transfer function between the jth input and the ith output.
The above is referred to as the transfer matrix.

8.2.4

Converting Continuous Transfer Function To Discrete


Transfer Function

A discrete-time transfer function that corresponds to a continuous-time system


with a zero-order hold can be obtained using the formula

1 ehs s
1
Gc (s)
G(z) = Z L
(8.62)
s
where L1 and Z denote the inverse Laplace transformation and the z-transformation,
respectively. Note that the discretization requires multiplication of the continhs s
uous transfer function with 1es , which corresponds to a zero-order hold,

24

Modeling
hs s

is the Laplace transbefore taking the z- transform. Note that Gc (s) 1es
form of the output response to a discrete unit impulse a unit-size pulse that
starts at t = 0 and lasts for one sample interval. This is consistent with the
fact that z-transform of a discrete unit impulse is 1.
Conversion tables for common types of transfer functions are available in
many standard textbooks but they cannot be used directly when the system
includes fractional delays (non-integer-multiples of the sample time).

8.2.5

Representing State-Space System As Transfer Function

A pulse transfer matrix can be obtained from state-space model (8.5) by taking
the z-transform with x(0) = 0:
zx(z) = Ax(z) + Bv(z)
y(z) = Cx(z)

(8.63)

Substituting for x(z) in the second equation, we find the relationship

y(z) = C(zI A)1 B v(z)


= G(z)v(z)

(8.64)

A state coordinate transformation does not change the transfer matrix since

CT 1 (zI T AT 1 )1 T B = C(zI A)1 B

(8.65)

1
G(z) = C1 (zI A11 )1 B

(8.66)

Deriving an expression for the transfer matrix of a system given in the form
of (8.46), which is obtained after the Kalmans decomposition, provides some
insight. Straightforward calculation of (8.64) yields

A noteworthy point is that only those matrices for the reachable and observable
part of the state appear in the expression. The parts of the state-space system
that are not reachable and/or unobservable do not affect the transfer function
representation.
Example 8.9 Consider the integrating system
d2 x
dt2

time domain equation is


of
dx1
dt
dx2
dt

y(s)
u(s)

1
.
s2

The corresponding

= u. The above equation can be put in the form


x1
0
+
u
x2
1

x1
1 0
y =
x2
=

0 1
0 0

25

February 28, 2002

We first discretize the state-space equation using the discretization formulas


given in (8.13) to put it in the form of (8.5):
A=e

B=

hs

e
0

A c hs

Ac s

=I+

ds B =

0 1
0 0

hs h2s /2
0
hs

hs =

1 hs
0 1

"

0
1

h2s
2
hs

2
hs /2
1 hs
u(k)
x(k) +
hs
0 1
1 0 x(k)
y(k) =

x(k + 1) =

Performing the z-transform on the state-space equation, we see that the corresponding transfer function is
G(z) = C(zI A)

B=

1 0

z 1 hs
0
z1

h2s /2
hs

h2s (z + 1)
2(z 1)2

Alternatively, we can obtain the transfer function by takingnthe z-transform


o
1 1ehs s
1
, as
of the response to a unit pulse input, which is given by L
s
s2
indicated by the formula (8.62).
L

1 1 ehs s
s2
s

=L

1
ehs s

s3
s3

t2
(t hs )2
S(t)
S(t hs )
2
2

where S(t) and S(t hs ) are unit step functions starting at time 0 and hs ,
respectively. First,
h2 (z + z 2 )
Z t2 = s
(z 1)3

Since (t hs )2 S(t hs ) is simply t2 S(t) translated by a sample interval hs ,

h2 (1 + z)
Z (t hs )2 S(t hs ) = s
(z 1)3

t2
(t hs )2
S(t)
S(t hs )
2
2

h2s (z + z 2 ) h2s (1 + z)
h2s (z + 1)
=
2(z 1)3
2(z 1)2

We see that the final result is the same.

26

Modeling

8.2.6

Realization of Transfer Function As State-Space System

Any transfer function matrix can be realized as a state-space system. While


the transfer matrix for a given state-space system is unique, the reverse is
not true. For a given transfer matrix, there are infinitely many possibilities
for state-space realization. To see this, simply recall the earlier result that a
transfer matrix is invariant under a state coordinate transformation.
As an example, let us consider a transfer function
y(z)
b1 z 1 + + bn z n
=
v(z)
1 + a1 z 1 + + an z n

(8.67)

which represents the I/O relationship of


y(k) = a1 y(k 1) an y(k n) + b1 v(k 1) + + bn v(k n) (8.68)
Two possible state-space realizations of the above are

and

a1 a2 an
1

0

0

1

0 x(k) +
x(k + 1) =

..

..
.

.
0
1 0
0
b1 b2 bn x(k)
y(k) =

.. x(k) +
x(k + 1) =

an1 0 1

an 0 0 0
1 0 0 x(k)
y(k) =
a1
a2
..
.

1
0

0
1

1
0
0
..
.
0

b1
b2
..
.
bn1
bn

v(k)

(8.69)

v(k)

(8.70)

(8.69) and (8.70) are called the controllable canonical form and the observable
canonical form, respectively.
Realization of a transfer matrix can be a bit more tricky. If the transfer
matrix is given in a matrix polynomial form, it can be readily realized in one
of the canonical forms as before. For instance, suppose the transfer matrix is
given in the form of
(I + A1 z 1 + + An z n )y(z) = (B1 z 1 + + Bn z n )v(z) (8.71)

27

February 28, 2002


It can be directly realized in the observable canonical form as

B1
A1 I 0
B2
A2
0 I 0

..

.
.

..
.. x(k) +
x(k + 1) =
. v(k)

.
An1 0 I
..
An 0 0 0
Bn

I 0 I x(k)
y(k) =

(8.72)

Such a realization may not be minimal, however; so one should be able to


reduce the system order by removing the unreachable states in the case of
(8.72).

If the transfer function is not given in a matrix polynomial form, rather


than trying to find a matrix polynomial represention for it, one may find
a realization for each transfer function separately and pack them together
afterward. The packed model is most likely to be nonminimal, so it should
again be followed by an order reduction. For details, see Exercise????? at
the end of the chapter.

8.2.7

Poles and Transmission Zeros of A Transfer Matrix

Poles of a transfer matrix are the roots of (z) = det(zI Am ) = 0, where


Am is the state-space transition matrix resulting from a minimal realization
of the transfer matrix. Disregarding multiplicities, they are just a collection
of all the poles of the individual elements.
Recall the earlier discussion that only the part of a state-space system that
is reachable and observable is represented by the transfer matrix. Hence, the
poles of a transfer function may be just a smaller subset of the poles of the
original state-space system. This is true if the original state-space system is
nonminimal. For example, the poles of the transfer matrix for (8.46) is just
the eigenvalues of A11 .
Transmission zeros of a transfer matrix are defined as those values of z
that result in a loss of rank for G(z).
Methods for computing the poles and zeros of a transfer matrix directly
from it be found in standard textbooks (for example, see pages 207 and
208 of Morari and Zafiriou (REFERENCE)).
REMOVE? Poles are the roots of the polynomial ((z)), which is the
least common denominator of all non-identically-zero minors of all orders of
G(z). Zeros are the roots of the polynomial, which is the greatest common
divisor of all the numerators of order-r minors of G(z), where r is the normal
rank of G(z), provided that these minors have all been adjusted in such a way
to have the pole polynomial (z) as their denominator. A NUMERICAL
EXAMPLE: page 207 and 208 of Morari and Zafiriou

28

Modeling

SHOW FIGURE for constant damping line - match it with timedomain step response.
Figure 8.4: Pole Locations and the Step Response: Constant Damping Curves

8.2.8

Implications of Poles and Zeros for Input-Output Dynamics

Definition 8 Transfer function (8.57) is said to be stable if y remains


bounded in response to any bounded v.
Stability of a transfer function can be determined by checking its poles. In
order for a transfer function to be stable, its poles must all lie strictly inside the
unit disk. This is in agreement of the earlier result that the poles of a transfer
function are the poles of the reachable and observable part of the original
state-space system. Stability of a transfer matrix is equivalent to stability of
all the individual transfer function elements.
Other important input-output dynamic characteristics can also be inferred
from the location of poles and zeros. For instance, the location of poles reflects
the degree of damping. Figure 8.4 shows the constant damping lines for continuous system poles and discrete system poles as well as the corresponding
step responses.

The relative degree of a transfer function represents the delay between the
input and the output in number of sample time units. Transfer functions
derived from physical systems have relative degree of at least one; this means
there is at least one unit delay for all physical systems, for which outputs do
not respond instantaneously to an input change.
Finally, the existence of a zero outside the unit disk implies an inverse step
response.
Example 8.10 Show step responses of various transfer function.

29

February 28, 2002


Gain, Frequency Response

For systems initially at rest, when a step input change of v() induces a final
output change of y(), K = y()/v() is called the gain of the system. For
stable systems, the gain can be computed easily using the transfer function
according to
K = lim G(z)
z1

(8.73)

This follows directly from the final value theorem, which states
lim y(t) = lim z y(z)

z1

(8.74)

Similarly, the systems frequency response, the amplitude ratio A.R. and
phase angle , can be computed readily from the transfer function:

A.R. = G(ej )
= tan1

Im[G(ej )]
Re[G(ej )]

(8.75)
(8.76)

8.3

Relationships With Impulse Response and Step


Response Models

8.3.1

Calculation of Impulse Response and Step Response of


State-Space System

For the discrete state-space description {A, B, C}, let us define B = [b1 , b2 , . . . , bnv ]
iT
h
and C = cT1 , cT2 , . . . , cTny , i.e., bi and ci are the ith column and row of B
and C. Let us also define the following discrete unit impulse starting at the
jth sample time:

j (t) =

1 for jhs t < (j + 1)hs


0 elsewhere

Starting from an initial state of zero, the response of the system to an impulse
in the ith input channel at time t = 0
vi (t) = 0 (t)
vm (t) = 0 m 6= i
is

30

Modeling

i (0)
yimp
i (1)
yimp
i (2)
yimp
i (3)
yimp
..
.

=
=
=
=

0
Cbi
CAbi
CA2 bi

i (n) = CAn1 b
yimp
i

i (k) is the output vector at time k in response to an impulse change


where yimp
in input vi at time zero. The impulse response coefficient matrix at time k
(for k > 0) is defined by


Hk =

h1,1,k
h2,1,k
..
.

h1,2,k
h2,2,k
..
.

...
...

hny ,1,k

hny ,2,k

. . . hny ,nv ,k

c1 Ak1 b1
k1 b
1
c2 A
=
..

cny Ak1 b1

h1,nv ,k
h2,nv1 ,k
..
.

c1 Ak1 b2
c2 Ak1 b2
..
.

...
...

c1 Ak1 bnv
c2 Ak1 bnv
= CAk1 B
..

cny Ak1 b2

. . . cny Ak1 bnv

where h`,m,k denotes the response of output y` to an impulse change in input


vm at time k.
The step response coefficient matrices (containing the output responses to
a unit step change in the input) can be computed from the impulse response
coefficient matrices according to

Sk =

k
X

Hi

(8.77)

i=1

8.3.2

Derivation of Transfer Matrix From Impulse Response

In general, ith column of Hkj denotes the responses at time k to a unit


impulse j (t) in the ith input channel; [Hkj ]i vi (j) (where [Hkj ]i is the ith
column) denotes the response at time k to an impulse vi (j)j (t) (an impulse
of strength vi (j) at time j) entering the ith input channel. We can represent

31

February 28, 2002


any zero-order-hold generated signal as a sequence of discrete impulses

P
j v1 (j)j (t)
P

j v2 (j)j (t)

.
.

v(t) =

..

.
P
j vnv (j)j (t)

By superposition (i.e., by adding the responses of all individual impulses occurring in different input channels at different times), the response at time k
to this sequence is
y(k) =

nv
k1 X
X

[Hkj ]i vi (j) =

j= i=1

k1
X

Hkj v(j) =

j=

X
`=1

H` v(k `)

(8.78)

By summing up to j = k 1 only, we implicitly assumed that the system is


causal, that is, only those input changes that occurred before time k affect
y(k). Equation (8.78) shows the system output to be a weighted sum of all
the past system inputs with the weighting parameters given by the impulse
response coefficients. It is indeed the model as we had derived earlier
in Chapter??? but based on a heuristic argument.
Taking the z-transform of (8.78) with the assumption that v(k) = 0 for k <
0, we can obtain
y(z) = H(z) v(z),
(8.79)
where H(z) is the z-transform of {Hk }. The derivation is laborious (See Exercise??) but the result is intuitive. It means that the pulse transfer matrix
can be obtained by z-transforming the impulse response, which is consistent
with the fact that z-transform of a discrete-time unit impulse (a unit pulse
that occurs at t = 0 and lasts for one sample interval) is 1.
For stable systems, the following Finite Impulse Response (FIR) model
(approximating (8.78)) is commonly used:
y(k) =

n
X
i=1

Hi v(k i)

(8.80)

n should be chosen sufficiently large so that the discarded terms are negligible.
The z-transform of the FIR model can be written as
y(z)
1
1
1
= H1 + H2 2 + + H n n
v(z)
z
z
z

(8.81)

Hence, the transfer matrix for an FIR model has all its poles at the origin
representing the delay operations.

32

Modeling

8.3.3

From Impulse Response / Step Response to State-Space


Model

The state-space realization of FIR model in the controllable canonical form is


(8.80):
x(k + 1) = Ax(k) + Bv(k)
(8.82)
y(k) = Cx(k)
where

0
I

0
.
.
.
..
.

0
0

A =

0
0
I

0
0
0

... ...
0 ...
0 ...

0
0
0

..
.
.
.
0 ... ... ... 0
0 ... ... ... I

T
I 0 0
B =

H1 H2 H n
C =
..

..

..
.
..

n nv

(8.83)

In this realization, the state is simply a sequence of n past inputs:


x(k) =

v T (k 1) v T (k 2) v T (k n)

All the eigenvalues of A are located at the origin.

(8.84)

The realization in the observable canonical form is


x(k + 1) = Ax(k) + Bv(k)
y(k) = Cx(k)

(8.85)

where

..

A =
.

. . . . . . . . . 0 I

... ... ... 0 0


T
T
H1 H2T HnT
B =

I 0 0
C =

0 I
0 0
.
..

0 0
0 0

0
I

... ... 0
0 ... 0

n ny

(8.86)

We note that the impulse response or step response based state-space realization is a highly over-parameterized, nonminimal realization. In general,
one should be able to reduce the system order substantially by applying the
model reduction. Balanced truncation is a popular model reduction method
and gives some numerical advantages when applied to the above realization.

33

February 28, 2002

For interested readers, details of balanced truncation and its application to


the above FIR-based realization is given in Appendix ??.
For derivation of a similar realization in terms of step response coefficients,
let us first note that, for system (8.80), the output is
P
y(k) = Pni=1 Hi v(k i)
y(k + 1) = Pni=2 Hi v(k + 1 i) + H1 v(k)
n
y(k + 2) =
i=3 Hi v(k + 2 i) + H1 v(k + 1) + H2 v(k)
..
.
Pn1
y(k + n 1) = H
v(k

1)
+
n
i=1 Hi v(k + n i 1)
Pn
y(k + n + `) =
i=1 Hi v(k + n + ` i), ` 0

Now let us assume that the input is kept constant after time k1:
v(k) = v(k + 1) = . . . . Then, the output is

v(k1) =

P
y(k) = Pni=1 Hi v(k i)
y(k + 1) = Pni=2 Hi v(k + 1 i) + H1 v(k 1)
n
y(k + 2) =
i=3 Hi v(k + 2 i) + (H1 + H2 )v(k 1)
..
.
P
y(k + n 1) = Hn v(k 1) + n1
i=1 Hi v(k 1)
y(k + n + `) = y(k + n 1), ` 0
Now define the vector

Y (k) =

y(k)T , y(k + 1)T , . . . , y(k + n 1)T


for v(k + `) = v(k 1), ` 0

Y (k), the future output sequence assuming no more change occurs in the input,
can be interpreted as one possible choice for the state vector, as it represents
the effect of all past input moves on the future outputs.
Based on the same definition, the state vector at time k + 1 is

y(k + 1)T , y(k + 2)T , . . . , y(k + n)T

Y (k + 1) =
for v(k + `) = v(k), ` > 0

(8.87)

With the assumption that the system input v is kept constant after time k
(v(k) = v(k + 1) = v(k + 2) = . . .), the system output is
P
y(k + 1) = Pni=2 Hi v(k + 1 i) + H1 v(k)
n
y(k + 2) =
i=3 Hi v(k + 2 i) + (H1 + H2 )v(k)
..
.
Pn1
y(k + n 1) = H
v(k

1)
+
n
i=1 Hi v(k)
Pn
y(k + n) =
i=1 Hi v(k)
y(k + n + `) = y(k + n), ` 0

34

Modeling

Then we find the following by direct comparison of (8.3.3) and (8.3.3):


Y (k + 1) = A Y (k) + Bv(k)
y(k) = C Y (k)

(8.88)

where

A =

0
0
.
..

0
0

I
0

0
I

... ... 0
0 ... 0

0
0

... ... ... 0


... ... ... 0
T
T
T
S1 S2 SnT
B =

I 0 0
C =

..
.

(8.89)

This is indeed the same step response based model (????) that we adopted
in the earlier discussion of industrial MPC. Unlike in the realizations based on
impulse responses, the input appears as v rather than v in the above system.
Despite the fact that the original system is stable, the above model contains n y
integrators, which reflect the integrating effect of v on output y. This model
form will be useful in designing MPC controllers that automatically possess
the integral action.

8.4

Summary

Figure 8.5 shows a diagram summarizing the procedures for converting one
form of model to another. Materials in the rest of the book assume a discretetime state-space model; the diagram illustrates the fact that one can obtain
such a model form starting from any model form. In addition, we can see from
the diagram that several possible routes exist for most conversions.
Example 8.11 As a simple example, consider the continuous transfer function
g 0 (s) =

e2s
.
s+1

For a sampling time hs = 1, since

1 1 es
0.6321
1
Z L
=
s+1
s
z 0.3679
we obtain the pulse transfer function
g(z) = z 2

0.6321
z 0.3679

February 28, 2002

35

Figure 8.5: Summary of conversion among different linear time-invariant model


structures

36

Modeling

The corresponding time domain equation is


y(k) = 0.3679y(k 1) + 0.6321v(k 3)
A state space realization of the above in the controllable canonical form is

0.3679
A = 1.0
0

8.5

0
0
1.0


0
1
0 ; B = 0 ; C = [0
0
0

0.6321]

Disturbance Modeling

One of the main reasons for control is to suppress the effect of external disturbances on key process variables. It is sometimes possible to eliminate the
source of disturbances entirely through design modifications, but more often
their effects need to be offset by adjusting manipulated variables. In designing
a controller that performs this task in an efficient manner, it is helpful to have
a model that enables prediction of disturbances influence on the outputs of
interest on the basis of measurement signals. While such a model may be
constructed from first principles or system identification, it is not necessary
for the model to include exact physical sources for all disturbances. For linear
controller design, it is sufficient to include in the model their combined overall
effect on the output. This is important since in many cases it is not even
possible to determine the physical sources of disturbances. In this section, we
will discuss stochastic models of disturbances. We will also show how such
disturbance models can be integrated with deterministic system models for
estimation and control.

8.5.1

Linear Stochastic System Model for Stationary Processes

A natural framework for modeling disturbances given the uncertain nature of


their behavior is the framework of probability. Stochastic processes are the
main mathematical vehicles for modeling uncertain systems and signals within
this framework. Some basic concepts for stochastic processes are given in
Appendix ??. In general, to define a stochastic process, one must specify the
joint probability distribution function for the entire time sequence. This is an
intractable task without further simplification in almost all practial cases. One
commonly made simplification is the assumption of stationarity. Another is
the assumption of normality, which reduces the task to the specification of only
the first two moments of the distribution, i.e., the mean and the covariance.
Typically, stochastic disturbances are described by a linear system with
an external white noise input. This is based on the result that any stationary
stochastic process in terms of its first two moments can be described by the

37

February 28, 2002

output of some stable (and stably-invertible) linear system driven by a white


noise sequence. Within this limited setting, the problem of stochastic modeling
reduces to finding a suitable linear system for given mean and covariance. The
linear system can be a state-space system or a transfer matrix.
State Space Model
The following stochastic difference equation may be used to characterize a
stochastic process:
xw (k + 1) = Aw xw (k) + Bw (k)
w(k) = Cw xw (k) + (k)

(8.90)

where (k) is a zero-mean white noise sequence with covariance R . It is


assumed that all the eigenvalues of Aw lie strictly inside the unit circle.
It is then easy to verify that
E{xw (k)} = Aw E{xw (k 1)} = Akw E{x(0)}

T
E{xw (k)xTw (k)} = Aw E{xw (k 1)xTw (k 1)}ATw + Bw R Bw

E{xw (k + )xTw (k)} = Aw E{xw (k)xTw (k)}

(8.91)

From the above, it is straightforward to see that


limk E{x(k)} = 0
limk E{xw (k + )xTw (k)} = Aw Pw ,

(8.92)

where Pw is a positive semi-definite solution to the Lyapunov equation


T
Pw = Aw Pw ATw + Bw R Bw

(8.93)

Thus,
w
= limk E{w(k)} = 0
Rw ( ) = limk E{w(k + )w T (k)} = Cw Aw Pw CwT + R

(8.94)

In the limit,(8.90) becomes a stationary process, the mean and the covariance
of which are independent of the initial condition, as shown by (8.94).
Choosing (Aw , Bw , Cw ) to match a given Rw ( ) according to (8.94) is not
straightforward but some numerical approaches are available. These will be
discussed in a later chapter that covers system identification.
Transfer Function
Stationary stochastic processes can also be described using transfer matrices.
The general form is
w(k) = H(q)(k)
(8.95)

38

Modeling

Figure 8.6: Persistent or Drifting Type Disturbances


where is again a zero-mean white noise sequence of covariance R . Without
loss of generality, H(z) is assumed to be a stable and stably invertible transfer
matrix with H(0) = I. The power spectrum of w can be obtained by calculating the covariance of w and taking the Fourier transform of it. The resulting
expression is
w () = H(ej )R H T (ej )
(8.96)
The derivation of the above is given as an exercise at the end of the chapter.
The above expression suggests one way to solve the problem of finding H(q)
that matches a given spectrum (or equivalently, a given covariance function).
1/2
One can find a spectral factor w () and find a stable and stably-invertible
transfer matrix whose frequency response matches this factor.

8.5.2

Stochastic System Models for Processes with Nonstationary Behavior

While the assumption of stationary noise is a convenient simplification, it does


not befit all situations. For example, a disturbance signal may exhbit persistent behavior, as illustrated in Figure ??, it is not appropriately modelled
as a stationary stochastic process. Disturbances of such characteristics are
common in many chemical and biological processes. Adoption of the general
nonstationary noise in view of such disturbances would lead to intractable
modeling problems, however.
If a disturbance is characterized completely by intermittent jumps (of random magnitudes and occurrences) or Brownian-motion-type drifts, it is suit-

39

February 28, 2002

Figure 8.7: Stochastic system model for a stationary process superimposed on


random jumps or Browninan-Motion-type drifts
ably represented by an integrated white noise int (k), which is simply a signal
obtained by integrating a white noise.
int (k + 1) = int (k) + (k)

(8.97)

E{int (k)Tint (k)} = E{int (k 1)Tint (k 1)} + R

(8.98)

Since
the covariance of (k) grows without bound as k and hence (k) is a
non-stationary stochastic process.
More generally, a stationary signal superimposed on pure jumps or Brownian motion can be used to model persistent or drifting disturbances. Such
a superimposed signal can be described by a linear system driven by an integrated white noise (see Figure 8.7).
xw (k + 1) = Aw xw (k) + Bw int (k)
w(k) = Cw xw (k) + int (k)

(8.99)

In both cases, it is convenient to adopt the differenced variable (w(k) =


w(k)w(k1)) in writing the model. Differencing of (8.99) for two consecutive
time steps yields
xw (k + 1) = Aw xw (k) + Bw (k)
w(k) = Cw xw (k) + (k)

(8.100)

Note that the external noise (k) is now a white noise sequence, which is
the standard way to write models in optimal estimation and control. Note
that w(k) becomes a zero-mean stationary process. Any covariance can be
matched by appropriately choosing the state-space matrices.

40

Modeling

As before, a transfer function may be used in place of the state-space


system:
w(k) = H(q)int (k)

(8.101)

In this case too, it is convenient to express the model in terms of the difference
variables:
w(k) = H(q)(k)

8.5.3

(8.102)

Models for Estimation and Control

Models used in our subsequent discussion of estimation and control will be a


combination of deterministic and stochastic models. The deterministic part
is used to model the causal relationship between known inputs (like the manipulated inputs) and the outputs; the stochastic part is used to characterize
the correlations in the outputs created by various unknown inputs, e.g., disturbances. The standard form is
x(k + 1) = Ax(k) + Bv(k) + 1 (k)
y(k) = Cx(k) + 2 (k),

(8.103)

where 1 (k) and 2 (k) are zero-mean white noises. The above model is best
interpreted as the superposition of two models,
xd (k + 1) = Axd (k) + Bv(k)
yd (k) = Cxd (k)

(8.104)

xs (k + 1) = Axs (k) + 1 (k)


ys (k) = Cxs (k) + 2 (k)

(8.105)

and

By superposition, we mean that the output is


y(k) = yd (k) + ys (k)

(8.106)

So the first system models the effect of deterministic inputs and the second
system charaeterizes the residual vector in a statistical sense (i.e., in terms of
its covariance function).
Such a model can be obtained through either fundamental modeling or
system identification. However, the model should not be mistaken as one
that is obtained by simple addition of white noise inputs to a deterministic
system model. In general, this would not result in a good statistical description
of the residual and eventually lead to a poor prediction performance. In the
identification section, we will discuss some methods to obtain such a combined
model from input output data.

41

February 28, 2002

Possible Exercises
1. Prove the observability condition using the Cayley Hamilton theorem.
In what sense does the null space of Wno represent the unobservable
subspace?
Solution: To see how this condition arises, first notice that observability
is equivalent to the fact that no nonzero initial state with zero input
result in a zero output response. Note that

y(0)
C
y(1) CA

(8.107)

=
x(0)
..
..

.
.
CAk1
y(k 1)

T
Let us denote C T (CA)T (CAk1 )T
as Wko . For a system
o
to be unobservable, the null space of Wk must be nontrivial for all finite
ks. Then, observability implies the existence of a finite k such that
rank(Wko ) = n. Since rank(Wko ) rank(Wno ) for all k, the condition
reduces to the rank condition of (8.39).
2. Prove the Hautus condition for stabilizability.
Solution: The condition for stabilizability can be proved in a similar
way. First, we prove

Stabilizable

rank A I B = n +
A

Again, we can prove instead


Not Stabilizable

rank

A I B

So, lets assume that

rank A I B 6= n

6= n

for some +
A

for some +
A

Following the same argument as before, we can show that, in this case,
there is an eigenvector of A (denoted by x) corresponding to an unstable
eigenvalue such that

xT B AB An1 B = 0T

This means that the eigenvector corresponding to the unstable eigenvalue


lies outside the reachable subspace, implying non-stabilizability.
Next, we prove
Stabilizability

rank

A I B

= n +
A

42

Modeling
Again, we can prove instead
Not

stabilizable

rank

A I B

6= n

for some +
A

The assumption of the system being not stabilizable implies that there
exists x outside the reachable subspace such that

xT B AB An1 B = 0T
This means

xT B = 0, xT AB = 0, , , xT An1 B = 0
This implies that x is an eigenvector of A corresponding to an unstable
eigenvalue and xT B = 0. This, in turn, implies that

xT A I B = 0T for some +
A

which means the rank is not n.

3. Derive the final value theorem for the z-transform and use it to prove that
system gain can be obtained by setting z = 1 in the transfer function.
4. Derive the equation y(z) = H(z)v(z) where H(z) is the z-transform of
the sequence of impulse response coefficient matrices {Hk }. Do this by
z-transforming the convolution equation.
5. Show that z-transform of the observer canonical form and the controller
canonical form of state-space realization indeed leads to the same transfer
function.
6. Consider the following MIMO system:
"
120s+2
G(s) =

(100s+1)(20s+1)
80s
(100s+1)(20s+1)

80s
(100s+1)(20s+1)
(120s+2)
(100s+1)(20s+1)

Perform a discretization and a state-space realization (or vice versa)


of each transfer function.
Pack the four realizations into a single multivariable system.
can be done by using

A1 1
B11
0

A
0
B
12
12
x(k) +
x(k + 1) =

B21

A21
0
A22
0 B22

C11 C12 0
0
y(k) =
x(k)
0
0 C21 C22

This

u(k)

where (Aij , Bij , Cij ) represent the state-space realization of the


(i, j)th element of G. Whats the total order?

February 28, 2002

43

Perform a balanced truncation to remove the redundant state. Whats


the minimum order you need?
7. Prove the formula (8.96) for calculating the spectrum of the stochastic
process given by (8.95).
8. (Give a simple spectrum). Compute a spectral factor and realize it as a
stable and stably invertible transfer function. Get a realization of white
noise, pass it through the filter and obtain the output signal. Compute
the covariance function, do the fourier transform and make sure you end
up with what you started with.

References
1. Linear systems text books, Kwakernaak and Sivan, Astrom and Wittenmark....
2. Further information on reachability, controllability, observability, etc.,
esp. the extension to continuous-time systems and time-varying systems,
and other linear system concepts can be found in Kwakernaak and Sivan.
Some insightful but limited treatments of these concepts can also be
found in Astrom and Wittenmark.
3. z-transform and pulse transfer function table in Astrom and Wittenmark.
4. References in stochastic modeling, esp. spectral factorization theorem.
Jazwinski gives a terse treatment of the subject within the confines of
optimal estimation. Astrom and Wittenmark discusses stochastic modeling of disturbances in the general setting of optimal estimation and
control.

44

8.6

Modeling

Model Reduction

MOVE TO THE APPENDIX


Model reduction refers to transformation of a high order model into a lower
order model that retains the essential information about the input-output
dynamics. Besides removing the dynamics for superfluous states (e.g., those
that are not reachable or not observable and therefore do not contribute to
the overall input-output dynamics), model reduction often involves trading off
some accuracy for reduced system order. Because the relative importance of
information about various parts of plant dynamics depends on the application,
there cannot be a single model reduction technique that is appropriate for all
cases. In this section, we will concentrate on a particular model reduction
technique called balanced truncation that has proven to be effective for control
applications and offers certain numerical advantages for reduction of FIRbased models.

8.6.1

Model Reduction Problem

Let us consider the discrete state-space model


x(k + 1) = Ax(k) + Bv(k)
y(k) = Cx(k)

(8.108)

Suppose the order of (8.108) is n. By performing a model reduction, we would


like to obtain a r th order model (r < n) that closely approximates the inputoutput relationship of the original system.
Let us recall the fact that state-space representation for a specific inputoutput behavior is not unique and state coordinates can be changed freely
without affecting the input-output relationship. For the purpose of model
reduction, it is convenient to find a realization such that the state variables
are defined and ordered in terms of the relative importance in describing the
input-output behavior of the system.
Let us introduce the state coordinate transformation x
(k) = T x(k) where
T is a nonsingular matrix. The state-space system in the transformed state
coordinates can be written as
x(k) + Bv(k)

x
(k + 1) = A

y(k) = C x
(k)

(8.109)

= T B, and C = CT 1 . If the coordinate transformation


where A = T AT 1 , B
is chosen such that the transformed state variables are ordered to reflect the
relative importances of various modes for describing the systems input-output
relationship, one may obtain an rth order system simply by truncating the last
n r states as shown below:

45

February 28, 2002

r v(k)
x
r (k + 1) = Ar x
r (k) + B

y(k) = Cr x
r (k),

(8.110)

where
x
r (k) =
Ar =

Ir 0
Ir 0

x
(k)

Ir 0 B
Ir
= C
0

r =
B
Cr

Ir
0

(8.111)

The rest of the section discusses how an appropriate coordinate transformation


may be found for this purpose.

8.6.2

Hankel Matrix and Hankel Singular Values

Let us define the controllability and observability matrices as


Wc =

Wo =

B AB A2 B
CT

(CA)T

(CA2 )T

(8.112)

(8.113)

Recall that the truncated versions of the above matrices (Wnc and Wno ) were
used for checking reachability and observability, which accounts for their names.
Hankel matrix HG is defined as

HG = W o W c

(8.114)

Subscript G is to denote that it is the Hankel matrix for discrete operator


G = C(zI A)1 B. The Hankel matrix HG is a doubly infinite matrix that
maps the past inputs to current and future outputs. This can be seen by
observing the fact that

and

x(0) = W c

y(0)
y(1)
y(2)
..
.

u(1)
u(2)
u(3)
..
.

= W o x(0)

(8.115)

(8.116)

46

Modeling

W c maps the past inputs u(, 0) to the current state x(0) which is subsequently mapped to the current / future outputs y[0, ) via W o . Assuming
the original system is controllable and observable, W c and W o both have rank
of n and therefore HG has rank n. In this case, HG has n nonzero singular
values defined as
i (HG ) = i (W o W c )
1/2

= i ((W c )T (W o )T W o W c )

1/2
= i (W c (W c )T (W o )T W o )

(8.117)

Here i () denotes the ith eigenvalue. i (HG )s are called Hankel singular values
of G and their magnitudes reflect the relative importance of the various modes
of Hankel matrix HG in describing the input-output dynamics. The largest
singular value
(HG ) is called the Hankel norm of G.
Let us define

c T

= W (W ) =

Q = (W o )T W o =

j=0

Aj BB T AT j

(8.118)

AT j C T CAj

(8.119)

j=0

P and Q are called the controllability and observability gramians and satisfy
the following matrix equations respectively:

AP AT P + BB T

= 0

(8.120)

A QA Q + C C = 0

(8.121)

One can verify this by direct substitution of (8.118) and (8.119) into (8.120)
and (8.121). Equations of the above form are called Lyapunov equations and
their numerical properties have been studied extensively due to their common
occurrences in control applications. The ith Hankel singular value is simply [i (P Q)]1/2 where P and Q are positive-definite solutions to (8.120) and
(8.121) respectively. Hence, one must solve a pair of Lyapunov equations to
calculate the Hankel singular values. We remark that the Hankel matrix and
hence its singular values are properties intrinsic to the systems input-output
dynamics and are invariant under a state coordinate transformation.
One possible objective for model reduction is to transform the state coordinates such that, when the truncation of type (8.110) is made, the Hankel
matrix of the reduced-order system approximates that of the full-order system
as closely as possible. It has been shown that there exists an rth order model
Gr such that

47

February 28, 2002

(HG HGr ) = r+1 (HG )

(8.122)

The procedure to obtain a reduced-order system Gr achieving the optimum


is available, but quite complex. It is tempting to use the singular value decomposition of HG to obtain a matrix (of rank r) that retains the r pricipal
components of HG , but the resulting matrix may not be a Hankel matrix and
hence not correspond to any set of (Ar , Br , Cr ).

8.6.3

Balanced Realization and Truncation

While it is true that


min
(HG HGr ) = r+1 (HG )

(8.123)

Gr

finding such Gr numerically can be quite difficult. In this section, we introduce


a model reduction technique called balanced truncation that yields a close
approximation to the optimal Hankel- norm approximation and is numerically
simple and robust at the same time.
In a balanced realization, the controllability and observability gramians are
identically a diagonal matrix containing the Hankel singular values (ordered
in terms of their magnitude), that is

1 (HG )


P = Q = HG =

..

.
n (HG )

(8.124)

This way, the last state is the least reachable and observable, the second to the
last is the second least reachable and observable, and so on. Before we derive
a procedure to obtain a balanced realization, let us note that, under a state
coordinate transformation of x
= T x, the controllability and observability
gramians are transformed as follows:

P = T P T T
= (T 1 )T QT 1
Q

(8.125)
(8.126)

The first step is to factorize positive-definite Q as follows:


Q = RT R

(8.127)

The factorization of the above form is called Cholesky factorization. The state
coordinate transformation of x
1 = Rx gives

48

Modeling

(8.128)

(8.129)

P1 = RP RT
Q1 = (R1 )T QR1 = I

We next perform a singular value decomposition (SVD) of symmetric positivedefinite matrix P1


P1 = U1 2HG U1T

(8.130)
1/2

Then, the second state coordinate transformation x


2 = HG U1T x
1 leads
to the balanced controllability and observability gramians since
1/2

1/2

P2 = HG U1T U1 2HG U1T U1 HG = HG

(8.131)

Q2 =

(8.132)

1/2
1/2
HG U1T U1 HG

= HG
1/2

In summary, the coordinate transformation of x


= HG U1T Rx leads to a balanced realization. Numerically, calculation of such a transformation requires
solving a pair of Lyapunov equations and performing a Cholesky factorization
and a SVD on a symmetric positive-definite matrix.
A model reduction technique that combines the balanced realization with
the truncation of type (8.110) is called balanced truncation. Although it was
first suggested as a technique for eliminating the part of a state-space system
that is practically unreachable and/or unobservable, its properties and relationships with the Hankel-norm and other approximations have been drawn
and reported in the literature. The technique doesnt yield the optimal Hankel norm approximation, but it comes close in most cases and guarantees the
following upper bound on the Hankel-norm of the truncation error:

(HG HGr ) 2

n
X

j (HG )

(8.133)

j=r+1

Similarly, the technique also provides an upper-bound on the L norm of the


truncation error:

kG(q) Gr (q)k = sup


G(ej ) Gr (ej ) 2nj=r+1 j (HG )

8.6.4

(8.134)

Application of Balanced Truncation to FIR Models

Recall the following state-space model, which was introduced as the observable
canonical realization of a FIR model.

49

February 28, 2002

x(k + 1) = Ax(k) + Bv(k)


y(k) = Cx(k)

(8.135)

where

..
A =
n
.

... ... ... 0 I

... ... ... 0 0


T
T
H1 H2T HnT
B =
; Hi = Si Si1

I 0 0
C =

0 I
0 0
.
..

0 0
0 0

0
I

... ... 0
0 ... 0

(8.136)

(8.137)
(8.138)

In order to obtain a balanced realization for such a system, we must first


find the controllability and observability gramians by solving the following
Lyapunov equations:
AP AT P + BB T = 0
(8.120)
AT QA Q + C T C = 0

(8.121)

Because of the special structure of A, B, C matrices for the FIR model, we can
obtain explicit solutions to the above equations. The observability gramian
Q is simply an identity matrix. The controllability gramian P can also be
obtained analytically by using the following recursive formula:
T
Pj,n = Pn,j
= (BB T )j,n

Pi,j

1jn

= Pi+1,j+1 + (BB )i,j

(8.139)

(1, 1) (i, j) (n 1, n 1) (8.140)

Since P is a symmetric, positive-definite matrix, it admits a SVD in the following form:


P = U 2HG U T

(8.141)

B,
C)
can then be obtained by the state-coordinate
A balanced realization (A,
1/2
transformation of x
= HG U T x.

In summary, by exploiting the special structure of the particular realization


of the FIR model, we have eschewed the computationally intensive steps of
solving the Lyapunov equations and performing the Cholesky factorization in
obtaining a balanced realization. Once the balanced realization is obtained,
a reduced order model can be found by inspecting the Hankel singular values
and truncating the subspace corresponding to the singular values of negligible
magnitudes. If r+1 r , an rth order model obtained through balanced
truncation should yield an input-output behavior very similar to the original
system.

50

Modeling

Example 8.12 Consider the following SISO system:


G(s) =

20s + 1
(100s + 1)(20s + 1)

Determine the FIR coefficients of the system with the sample time of 5.
How many coefficients do you need to adequately describe the system?
Realize the FIR system and perform a balanced truncation as shown in
the lecture. How many significant Hankel singular values do you see?
That determines the minimum order of the system you need. Write
down the reduced order system in a balanced form.
SOLUTIONS are given in YR2000 MPC Class HW Solution (Copy
from the solution).

References
An optimal Hankel model reduction method can be found in Glover.
The balanced realization and truncation is due to Moore??????
The balanced truncation of the FIR model is drawn from ????, which
discusses a finite dimensional approximation of time delays.

February 28, 2002

51

Items to Be Moved to Another Chapter


(THIS WILL BE MOVED TO CHAPTER State-Space MPC with State
Estimation.)
In this section, we discuss how a process model can be developed and
merged with a disturbance model for estimation and control. Two different
approaches to modeling can be envisaged. In the first approach, one starts with
a set of linear differential equations that are derived using first principles. If
the physical sources for disturbances are known and included as variables in the
model, the variables can be modeled as stochastic processes and incorporated
into the linear difference equations, which are obtained after discretizing the
linearized form of the differential equations). Otherwise, the overall effect of
disturbances on the output can be modeled as stochastic processes and added
to the output of the deterministic model.
The second approach is to fit an empirical model structure with undetermined parameters to input-output data. The empirical structure will generally
have a stochastic part to account for the effect of disturbances and other unknown inputs to the system. While the first approach requires significant
a priori understanding of the process phenomena, this approach requires an
identification experiment designed to collect the necessary input-output data.
The theoretical modeling approach has an added advantage that the same
model can be used to derive linear models at different operating points. However, in many practical situations, processes are not understood sufficiently to
render fundamental modeling feasible.

8.6.5

Modeling via First Principles

A general form of ODEs derived from first principles modeling after linearization is
c u + Bc w
x p = Acp xp + B1,p
2,p
(8.142)
y = Cp x + D2,p w
where xp is the state, u the manipulated input, w the disturbance input and
y the output. Subscript p is used throughout to distinguish the state and the
model matrices for the process from those for disturbances that we introduce
later. The above model can be discretized according to formula (8.13) (the
formula for B can be applied to both B1,p and B2,p ):
xp (k + 1) = Ap x(k) + B1,p u(k) + B2,p w(k)
y(k) = Cp xp (k) + D2,p w(k)

(8.143)

Note: If physical disturbance variables cannot be identified, the only recourse


is to express the overall effect of disturbances as a signal added directly to the
output. We will refer to this signal as output disturbance. In such a case,
D2,p = I and B2,p = 0.

52

Modeling

Augmented System Model for Stationary Disturbances


For state estimation and prediction, it is convenient to recast the model so that
all the driving inputs are white noises. This can be done by combining the
stochastic disturbance model discussed in the previous section with equation
(8.143). As explained earlier, we typically represent a stationary stochastic
signal with the stochastic difference equation,
xw (k + 1) = Aw xw (k) + Bw (k)
w(k) = Cw xw (k) + (k)

(8.144)

where (k) is white noise with covariance R and Aw is a matrix with all its
eigenvalues strictly inside the unit disk.
One can augment (8.143) with (8.144) to arrive at the following model:

xp (k)
Ap B2,p Cw
B1,p
B2,p
xp (k + 1)
=
+
u(k) +
(k)
0
Aw
0
xw (k)
Bw
xw (k + 1)
|
| {z }
{z
}
{z
} | {z } | {z }
|
A
B1
B2
x(k+1)
x(k)

xp (k)
Cp D2,p Cw
+ D2,p (k)
y(k) =
|{z}
{z
} xw (k)
|
| {z }
D
C
x(k)

(8.145)

With some appropriate re-definition of system matrices, the above is in the


standard state-space form of
x(k + 1) = Ax(k) + B1 u(k) + B2 (k)
| {z }
1 (k)

y(k) = Cx(k) + D(k)


| {z }

(8.146)

2 (k)

Note that the state is now expanded to include both the original process model
state xp and disturbance state xw .
Augmented System Model for Cases with Persistent Disturbances
For nonstationary disturbances with persistent characterisics, we earlier introduced the stochastic model
xw (k + 1) = Aw xw (k) + Bw (k)
w(k) = Cw xw (k) + (k)

(8.147)

The system model (8.143) can also be written in the differenced form,
xp (k + 1) = Ap xp (k) + B1,p u(k) + B2,p w(k)
y(k) = Cp xp (k) + D2,p w(k)

(8.148)

53

February 28, 2002

As before, the two can be combined into

B2,p
Ap B2,p Cw
xp (k)
B1,p
xp (k + 1)
u(k) +
(k)
=
+
Bw
0
Aw
0
xw (k)
xw (k + 1)
| {z }
|
|
{z
}
{z
}|
{z
} | {z }
A
B1
B2
x(k+1)
x(k)

xp (k)
Cp D2,p Cw
y(k) =
+ D2,p (k)
|{z}
|
{z
} xw (k)
|
{z
}
D2
C
x(k)

(8.149)
For estimation and control, it is further desired that the model output be y
rather than y. This requires yet another augmentation of the state with
output y according to

x(k + 1)
A 0
x(k)
B1
=
+
u(k)
y(k + 1)
y(k)
CB
CA I
1
0
B2
+
(k + 1) +
(k)
(8.150)
D2
DB2

x(k)
0 I
y(k) =
y(k)

The above is in the standard state-space form

x(k) + B
1 u(k) + 1 (k)
x
(k + 1) = A

y(k) = C x
(k) + 2 (k)

(8.151)

except that now the system input is u rather than u. Note that this system
has ny integrators, which reflect the integrated effect of the white noise input
1 (k) and and the system input u on the output. This particular form will
be used for the development of advanced MPC techniques.

8.6.6

Modeling via Identification

Methods to obtain an input output model from an identification experiment


has been explained in detail in Chapter ??. We show here that the resulting
model can be put into the standard state-space form of (8.146) or (8.150), in
which the unknown external input is a white noise.
Input-output models obtained from an identification experiment have the
general structure of
y(k) = G(q)u(k) + H(q)(k)
(8.152)
where G(q) and H(q) are stable transfer matrices. One can easily put the
above
model in the form of (8.146) by finding a state-space realization of

G(q) H(q) ,
x(k + 1) = A(k) + B1 u(k) + B2 (k)
y(k) = Cx(k) + D1 u(k) + D2 (k)

(8.153)

54

Modeling

From the fact that G(q) has relative degree of at least one and it is conventional
to assume without loss of generality that H(0) = I, D1 = 0 and D2 = I.
Hence, the above is in the same form as (8.146)
As before, in the case that the disturbance effects are non-stationary exhibiting mean shifts, the driving noise should be integrated white noise.
y(k) = G(q)u(k) + H(q)

1
(q)
1 q 1
{z
}
|

(8.154)

int (q)

Using the fact (1 q 1 )y(k) = y(k), we can rewrite the above as


y(k) = G(q)u(k) + H(q)(k)

Denoting the realization of G(q) H(q) as

x(k + 1) = Ax(k) + B1 u(k) + B2 (k)


y(k) = Cx(k) + (k)

(8.155)

(8.156)

The state can be augmented with y as before to obtain (8.150).


For multi-variable systems, identification of transfer matrices can be a
challenge, as explained in Chapter ??. In Chapter ??, we discussed the subspace identification method that gives a multivariable system model directly
in the form of (8.156) or (8.153), depending on whether one uses data given in
terms of deviations from some fixed nominal values or in terms of incremental
changes from one sample time to next.

8.6.7

Examples

Take a chemical process and show how to put everything together both via
fundamental modelling and identification here.

MODEL PREDICTIVE
CONTROL

Manfred Morari
Jay H. Lee
Carlos E. Garca

March 12, 2002

Chapter 1

Random Variables and


Stochastic Processes
1.1
1.1.1

RANDOM VARIABLES
INTRODUCTION

What Is Random Variable?


We are dealing with
a physical phenomenon which exhibits randomness.
the outcome of any one occurence (trial) cannot be predicted.
the probability of any subset of possible outcomes is well-defined.
We ascribe the term random variable to such a phenomenon. Note that a
random variable is not defined by a specific number; rather it is defined by
the probabilities of all subsets of the possible outcomes. An outcome of a
particular trial is called a realization of the random variable.
An example is outcome of rolling a dice. Let x represent the outcome (not
of a particular trial, but in general). Then, x is not represented by a single
outcome, but is defined by the set of possible outcomes ({1, 2, 3, 4, 5, 6}) and
the probability of the possible outcome(s) (1/6 each). When we say x is 1 or
2 or so on, we really should say a realization of x is such.
A random variable can be discrete or continuous. If the outcome of a
random variable belongs to a discrete space, the random variable is discrete.
An example is the outcome of rolling a dice. On the other hand, if the outcome belongs to a continuous space, the random variable is continuous. For
instance, composition or temperature of a distillation column can be viewed
as continuous random variables.
1

Random Variables and Stochastic Processes

What Is Statistics?
Statistics deals with the application of probability theory to real problems.
There are two basic problems in statistics.
Given a probabilistic model, predict the outcome of future trial(s). For
instance one may say:
choose the prediction x
such that expected value of (x x
)2 is
minimized.
Given collected data, define / improve a probabilistic model.
For instance, there may be some unknown parameters (say ) in the
probabilistic model. Then, given data X generated from the particular
probabilistic model, one should construct an estimate of in the form

of (X).
For example, (X)
may be constructed based on the objective
2.
of minimizing expected value of k k
2
Another related topic is hypothesis testing, which has to do with testing
whether a given hypothesis is correct (i.e, how correct defined in terms
of probability), based on available data.
In fact, one does both. That is, as data come in, one may continue to
improve the probabilistic model and use the updated model for further prediction.
A priori Knowledge

Error
feedback

Predictor
PROBABILISTIC
MODEL

ACTUAL
SYSTEM

1.1.2

X
+

BASIC PROBABILITY CONCEPTS

PROBABILITY DISTRIBUTION, DENSITY: SCALAR CASE


A random variable is defined by a function describing the probability of the
outcome rather than a specific value. Let d be a continuous random variable
(d R). Then one of the following functions is used to define d:

March 12, 2002

Probability Distribution Function


The probability distribution function F (; d) for random variable d is
defined as
F (; d) = Pr{d }
(1.1)

F( ;d)

where Pr denotes the probability. Note that F (; d) is monotonically


increasing with and asymptotically reaches 1 as approaches its upper
limit.
Probability Density Function
The probability density function P(; d) for random variable d is defined
as
dF (; d)
P(; d) =
(1.2)
d

P(;d)

Note that

P(; d)d =

dF (; d) = 1

(1.3)

In addition,
Z

b
a

P(; d) d =

b
a

dF (; d) = F (b; d)F (a; d) = P r{a < d b} (1.4)

Example: Guassian or Normally Distributed Variable


(

)
1
1 m 2
P(; d) =
exp
2

2 2

(1.5)

Random Variables and Stochastic Processes




P( ;d)

m-

68.3%

m +

Note that this distribution is determined entirely by two parameters (the


mean m and standard deviation ).
PROBABILITY DISTRIBUTION, DENSITY: VECTOR CASE

T
Let d = d1 dn
be a continuous random variable vector(d Rn ).
Now we must quantify the distribution of its individual elements as well as
their correlations.
Joint Probability Distribution Function
The joint probability distribution function F (1 , , n ; d1 , , dn ) for
random variable vector d is defined as
F (1 , , n ; d1 , , dn ) = P r{d1 1 , , dn n }

(1.6)

Now the domain of F is an n-dimensional space. For example, for n = 2,


F is represented by a surface. Note that F (1 , , n ; d1 , , dn ) 1 as
1 , , n .
Joint and Marginal Probability Density Function
The joint probability density function P(1 , , n ; d1 , , dn ) for random
variable vector d is defined as
n F (; d)
P(1 , , n ; d1 , , dn ) =
(1.7)
1 , , n

0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
3
2

2
0

1
0

2
3

March 12, 2002

For convenience, we may write P(; d) to denote P(1 , , n ; d1 , , dn ).


Again,
R b1
R bn
a1 an P(1 , , n ; d1 , , dn ) d1 dn
(1.8)
= Pr{a1 < d1 b1 , , an < dn bn }
Naturally,
Z

,,

P(1 , , n ; d1 , , dn )d1 dn = 1

(1.9)

We can easily derive the probability density of individual element from


the joint probability density. For instance,
Z
Z
P(1 , , n ; d1 , , dn ) d2 dn (1.10)
,,
P(1 ; d1 ) =

This is called marginal probability density.


While the joint probability density (or distribution) tells us the likelihood
of several random variables achieving certain values simultaneously, the
marginal density tells us the likelihood of one element achieving certain
value when the others are not known.
Note that in general
P(1 , , n ; d1 , , dn ) 6= P(1 ; d1 ) P(n ; dn )

(1.11)

P(1 , , n ; d1 , , dn ) = P(1 ; d1 ) P(n ; dn )

(1.12)

If
d1 , , dn are called mutually independent.
Example: Guassian or Jointly Normally Distributed Variables

Suppose that d = [d1 d2 ]T is a Gaussian variable. The density takes the form
of
"
(

1 m 1 2
1
1
P(1 , 2 ; d1 , d2 ) =
exp
2(1 2 )
1
2 1 2 (1 2 )1/2

2 #)
(1 m1 )(2 m2 )
2 m 2
2
+
(1.13)
1 2
2

Random Variables and Stochastic Processes

Note that this density is determined by five parameters (the means m1 , m2 ,


standard deviations 1 , 2 and correlation parameter ). = 1 represents
complete correlation between d1 and d2 , while = 0 represents no correlation.
It is fairly straightforward to verify that
Z
P(1 , 2 ; d1 , d2 ) d2
P(1 ; d1 ) =

(
)

1
1 1 m 1 2
= p
exp
2
1
212
P(2 ; d2 ) =
=

P(1 , 2 ; d1 , d2 ) d1
(
)

1
1 2 m 2 2
p
exp
2
2
222

(1.14)
(1.15)

(1.16)

(1.17)

Hence, (m1 , 1 ) and (m2 , 2 ) represent parameters for the marginal density of
d1 and d2 respectively. Note also that
P(1 , 2 ; d1 , d2 ) 6= P(1 ; d1 )P(2 ; d2 )

(1.18)

except when = 0.
General n-dimensional Gaussian random variable vector d = [d1 , , dn ]T
has the density function of the following form:

P(; d) = P(1 , , n ; d1 , , dn )

1
1
T 1

=
exp ( d) Pd ( d)
n
2
(2) 2 |Pd |1/2

(1.19)
(1.20)

where the parameters are d Rn and Pd Rnn . The significance of these


parameters will be discussed later.
EXPECTATION OF RANDOM VARIABLES AND RANDOM VARIABLE FUNCTIONS: SCALAR CASE
Random variables are completely characterized by their distribution functions
or density functions. However, in general, these functions are nonparametric.
Hence, random variables are often characterized by their moments up to a
finite order; in particular, use of the first two moments is quite common.
Expection of Random Variable Fnction
Any function of d is a random variable. Its expectation is computed as
follows:
Z

E{f (d)} =
f ()P(; d) d
(1.21)

March 12, 2002


Mean

d = E{d} =

P(; d) d

(1.22)

The above is called mean or expectation of d.


Variance

2} =
Var{d} = E{(d d)

2 P(; d) d
( d)

(1.23)

The above is the variance of d and quantifies the extent of d deviating


from its mean.
Example: Gaussian Variable
For Gaussian variable with density
1

1
P(; d) =
exp
2
2 2

2 )

(1.24)

it is easy to verify that


(

2 )
Z
1

m
1

exp

d = m
(1.25)
d = E{d} =
2

2 2

2 )
Z
1

m
1

2
2
}=
exp
d = 2
Var{d} = E{(d d)
( m)
2
2

(1.26)
2
Hence, m and that parametrize the normal density represent the mean and
the variance of the Gaussian variable.
EXPECTATION OF RANDOM VARIABLES AND RANDOM VARIABLE FUNCTIONS: VECTOR CASE
We can extend the concepts of mean and variance similarly to the vector case.
Let d be a random variable vector that belongs to Rn .
Z
` P(` ; d` ) d`
(1.27)
d` = E{d` } =

Z
Z
=

` P(1 , , n ; d1 , , dn ) d1 , , dn

Z
2

Var{d` } = E{(d` d` ) } =
(` d` )2 P(` ; d` ) d`
(1.28)

Z
Z
=

(` d` )2 P(1 , , n ; d1 , , dn ) 1 , , dn

Random Variables and Stochastic Processes

In the vector case, we also need to quantify the correlations among different
elements.
Cov{d` , dm } = E{(d` d` )(dm dm )}
(1.29)
Z
Z
(` d` )(m dm )P(1 , , n ; d1 , , dn ) d1 , , dn

Note that
Cov{d` , d` } = Var{d` }

The ratio

= p

Cov{d` , dm }

Var{d` }Var{dm }

(1.30)
(1.31)

is the correlation factor. = 1 indicates complete correlation (d ` is determined


uniquely by dm and vice versa). = 0 indicates no correlation.
It is convenient to define covariance matrix for d, which contains all variances and covariances of d1 , , dn .

d)
T}
Cov{d} = E{(d d)(d
(1.32)
Z
Z
d)
T P(1 , , n ; d1 , , dn ) d1 , , dn
( d)(

The (i, j)th element of Cov{d} is Cov{di , dj }. The diagonal elements of Cov{d}
are variances of elements of d. The above matrix is symmetric since
Cov{di , dj } = Cov{dj , di }

(1.33)

Covariance of two different vectors x Rn and y Rm can be defined similarly.


Cov{x, y} = E{(x x
)(y y)T }
(1.34)

In this case, Cov{x, y} is an n m matrix. In addition,


Cov{x, y} = (Cov{y, x})T

(1.35)

Example: Gaussian Variables 2-Dimensional Case


Let d = [d1 d2 ]T and
(
"

1
1
1 m 1 2
exp

(1.36)
P(; d) =
2(1 2 )
1
2 1 2 (1 2 )1/2

#)
(1 m1 )(2 m2 )
2 m 2 2
2
+
1 2
2
Then,
E{d} =
=

m2
m2

1
2

P(; d) d1 d2

(1.37)

March 12, 2002

Similarly, one can show that

Z Z

1 m 1
(1 m1 ) (2 m2 ) P(; d) d1 d2
Cov{d} =
2 m 2

2
1
1 2
=
(1.38)
1 2
22
Example: Gaussian Variables n-Dimensional Case
Let d = [d1 dn ]T and

1
1
T 1

exp ( d) Pd ( d)
P(; d) =
n
2
(2) 2 |Pd |1/2

(1.39)

Then, one can show that


Z
Z
P(; d) d1 , , dn = d
(1.40)

E{d} =

Z
Z
d)
T P(; d) d1 , , dn = Pd(1.41)
Cov{d} =

( d)(

Hence, d and Pd that parametrize the normal density function P(; d) represent
the mean and the covariance matrix.
Exercise: Verify that, with

2
m

1
1
2
1
d =
; Pd =
m2
1 2
22

(1.42)

one obtains the expression for normal density of a 2-dimensional vector shown
earlier.
NOTE: Use of SVD for Visualization of Normal Density
Covariance matrix Pd contains information about the spread (i.e., extent of
deviation from the mean) for each element and their correlations. For instance,
Var{d` } = [Cov{d}]`,`
{d` , dm } =

(1.43)

[Cov{d}]`,m
q
[Cov{d}]`,` [Cov{d}]m,m

(1.44)

where []i,j represents the (i, j)th element of the matrix. However, one still has
hard time understanding the correlations among all the elements and visualizing the overall shape of the density function. Here, the SVD can be useful.
Because Pd is a symmetric matrix, it has the following SVD:

d)
T}
Pd = E{(d d)(d

= V V

v1

(1.45)

vn

(1.46)

1
..

v1T

..
.
vnT
n

(1.47)

10

Random Variables and Stochastic Processes

Pre-multiplying V T and post-multiplying V to both sides, we obtain

1
..

d)
TV } =
E{V T (d d)(d

.
n

(1.48)

Let d = V T d. Hence, d is the representation of d in terms of the coordinate


system defined by orthonormal basis v1 , , vn . Then, we see that

E{(d d )(d d )T } =

1
..

.
n

(1.49)

The diagonal covariance matrix means that every element of d is completely


independent of each other. Hence, v1 , , vn define the coordiate system with
respect to which the random variable vector is independent. 12 , , n2 are
the variances of d with respect to axes defined by v1 , , vn .
Exercise: Suppose d R2 is zero-mean Gaussian and
Pd =

20.2 19.8
19.8 20.2

"

2
2
2
2

2
2
22

10 0
0 0.1

"

2
2
2
2

2
2

2
2

(1.50)

Then, v1 = [ 22 22 ]T and v2 = [ 22 22 ]T . Can you visualize the overall shape


of the density function? What is the variance of d along the (1,1) direction?
What about along the (1,-1) direction? What do you think the conditional
density of d1 given d2 = looks like? Plot the densities to verify.
CONDITIONAL PROBABILITY DENSITY: SCALAR CASE
When two random variables are related, the probability density of a random
variable changes when the other random variable takes on a particular value.
The probability density of a random variable when one or more
other random variables are fixed is called conditional probability
density.
This concept is important in stochastic estimation as it can be used to develop
estimates of unknown variables based on readings of other related variables.
Let x and y be random variables. Suppose xand y have joint probability
density P(, ; x, y). One may then ask what the probability density of x is
given a particular value of y (say y = ). Formally, this is called conditional density function of x given y and denoted as P(|; x|y). P(|; x|y)

11

March 12, 2002


is computed as
P(|; x|y) =

=
=

R +
lim0 P(, ; x, y)d
Z Z +
P(, ; x, y)d d

|
{z
}
normalization factor
P(, ; x, y)
R
P(, ; x, y)d

P(, ; x, y)
P(, y)

(1.51)

(1.52)
(1.53)

Note:
The above means

Joint Density of x and y


Conditional Density
=
of x given y
Marginal Density of y

(1.54)

This should be quite intuitive.


Due to the normalization,
Z

P(|; x|y) d = 1

(1.55)

which is what we want for a density function.

P(|; x|y) = P(, x)

(1.56)

P(, ; x, y) = P(, x)P(, y)

(1.57)

if and only if
This means that the conditional density is same as the marginal density
when and only when x and y are independent.
We are interested in the conditional density, because often some of the
random variables are measured while others are not. For a particular trial,
if x is not measurable, but y is, we are intersted in knowing P(|; x|y) for
estimation of x.
Finally, note the distinctions among different density functions:

12

Random Variables and Stochastic Processes


P(, ; x, y): Joint Probability Density of x and y
represents the probability density of x = and y = simultaneously.
Z

b2
a2

b1
a1

P(, ; x, y)dd = Pr{a1 < x b1 and a2 < y b2 } (1.58)

P(; x): Marginal Probability Density of x


represents the probability density of x = NOT knowing what y is.
Z
P(, x) =
P(, ; x, y)d
(1.59)

P(; y): Marginal Probability Density of y


represents the probability density of y = NOT knowing what x is.
Z
P(, y) =
P(, ; x, y)d
(1.60)

P(|; x|y): Conditional Probability Density of x given y


represents the probability density of x when y = .
P(|; x|y) =

P(, ; x, y)
P(, y)

(1.61)

P(|; y|x): Conditional Probability Density of y given x


represents the probability density of y when x = .
P(|; y|x) =

P(, ; x, y)
P(, x)

(1.62)

Bayes Rule:
Note that
P(|; x|y) =
P(|; y|x) =

P(, ; x, y)
P(, y)
P(, ; x, y)
P(, x)

(1.63)
(1.64)

Hence, we arrive at
P(|; x|y) =

P(|; y|x)P(, x)
P(, y)

(1.65)

The above is known as the Bayes Rule. It essentially says


(Cond. Prob. of x given y) (Marg. Prob. of y)

= (Cond. Prob. of y given x) (Marg. Prob. of x)

(1.66)
(1.67)

13

March 12, 2002

Bayes Rule is useful, since in many cases, we are trying to compute P(|; x|y)
and its difficult to obtain the expression for it directly, while it may be easy
to write down the expression for P(|; y|x).

We can define the concepts of conditional expectation and conditional covariance using the conditional density. For instance, the conditional expectation of x given y = is defined as
Z

E{x|y} =
P(|; x|y)d
(1.68)

Conditional variance can be defined as

Var{x|y} = E{( E{x|y})2 }


Z
=
( E{x|y})2 P(|; x|y)d

(1.69)
(1.70)

Example: Jointly Normally Distributed or Gaussian Variables


Suppose that x and y have the following joint normal densities parametrized
by m1 , m2 , 1 , 2 , :
1
(1.71)
2 1/2
y (1 )
(
"

#)

1
x
2
( x
)( y)
y 2
exp
2
+
2(1 2 )
x
x y
y

P(, ; x, y) =

2 x

Some algebra yields


P(, ; x, y) =

1 y 2
q
(1.72)
exp
2
y
2y2
|
{z
}
marginal density of y

!2
x

)
1
1

p
p y
exp

2
2x2 (1 2 )
x 1 2
|
{z
}
conditional density of x
(

)
1
1 x
2
p
exp
(1.73)
2
x
2x2
{z
}
|
marginal density of x

!2
y

1
1
p x
q
exp
2

y 1 2
2y2 (1 2 )
{z
}
|
conditional density of y
1

14

Random Variables and Stochastic Processes

Hence,
P(|; x|y) =
P(|; y|x) =

!2

)
1
1

p
p y
(1.74)
exp

2
2x2 (1 2 )
x 1 2

!2
y

1
1
p x
q
exp
(1.75)
2

y 1 2
2 2 (1 2 )
y

Note that the above conditional densities are normal. For instance, P(|; x|y)
is a normal density with mean of x
+ xy ( y) and variance of x2 (1 2 ).
So,
x
E{x|y} = x
+ ( y)
(1.76)
y
x y
( y)
(1.77)
= x
+
y2
= E{x} + Cov{x, y}Var1 {y}( y)

(1.78)

Conditional covariance of x given y = is:


E{(x E{x|y})2 |y} = x2 (1 2 )
x2 y2 2
= x2
y2
= x2 (x y )

(1.79)
(1.80)
1
(x y )
y2

(1.81)

= Var{x} Cov{x, y}Var1 {y}Cov{y, x} (1.82)


Notice that the conditional distribution becomes a point density as 1,
which should be intuitively obvious.
CONDITIONAL PROBABILITY DENSITY: VECTOR CASE
We can extend the concept of conditional probability distribution to the vector
case similarly as before.
Let x and y be n and m dimensional random vectors respectively. Then,
the conditional density of x given y = [1 , , m ]T is defined as
P(1 , , n |1 , , m ; x1 , , xn |y1 , , ym )
P(1 , , n , 1 , , m ; x1 , , xn , y1 , , ym )
=
P(1 , , m ; y1 , , ym )

(1.83)

Bayes Rule can be stated as


P(1 , , n |1 , , m ; x1 , , xn |y1 , , ym )
(1.84)
P(1 , , m |1 , , n ; y1 , , ym |x1 , , xn )P(1 , , n ; x1 , , xn )
=
P(1 , , m ; y1 , , ym )

15

March 12, 2002

The conditional expectation and covariance matrix can be defined similarly:

..

E{x|y} =
. P(|; x|y) d1 , , dn

n
Z

(1.85)

T
1 E{x1 |y}
1 E{x1 |y}

..
..

Cov{x|y} =

P(|; x|y) d1 , , dn
.
.

n E{xn |y}
n E{xn |y}
(1.86)
Z

Example: Gaussian or Jointly Normally Distributed Variables


Let x and y be jointly normally distributed random variable vectors of dimension n and m respectively. Let
z=

x
y

(1.87)

The joint distribution takes the form of


P(, ; x, y) =

1
(2)

n+m
2

|Pz |1/2

1
exp ( z)T Pz1 ( z)
2

(1.88)

where
z =
Pz =

; =

(1.89)

Cov(x) Cov(x, y)
Cov(y, x) Cov(y)

(1.90)

Then, it can be proven that (see Theorem 2.13 in [Jaz70])


E{x|y} = x
+ Cov(x, y)Cov1 (y)( y)
E{y|x} = y + Cov(y, x)Cov

(1.91)

(x)( x
)

(1.92)

o
n

Cov{x|y} = E ( E{x|y}) ( E{x|y})T

(1.93)

and

= Cov{x} Cov{x, y}Cov1 {y}Cov{y, x}


n
o

Cov{y|x} = E ( E{y|x}) ( E{y|x})T


= Cov{y} Cov{y, x}Cov1 {x}Cov{x, y}

(1.94)
(1.95)
(1.96)

16

Random Variables and Stochastic Processes

1.1.3

STATISTICS

PREDICTION
The first problem of statistics is prediction of the outcome of a future trial
given a probabilistic model.
Suppose P(x), the probability density for random variable x, is
given. Predict the outcome of x for a new trial (which is about to
occur).
Note that, unless P(x) is a point distribution, x cannot be predicted exactly.

To do optimal estimation, one must first establish a formal criterion. For


example, the most likely value of x is the one that corresponds to the highest
density value:
h
i
x
= arg max P(x)
x

A more commonly used criterion is the following minimum variance estimate:

2
x
= arg min E kx x
k2
x

The solution to the above is x


= E{x}.
Exercise: Can you prove the above?

If a related variable y (from the same trial) is given, then one should use
x
= E{x|y} instead.
SAMPLE MEAN AND COVARIANCE, PROBABILISTIC MODEL
The other problem of statistics is inferring a probabilistic model from collected
data. The simplest of such problems is the following:
We are given the data for random variable x from N trials. These
data are labeled as x(1), , x(N ). Find the probability density
function for x.
Often times, a certain density shape (like normal distribution) is assumed to
make it a well-posed problem. If a normal density is assumed, the following
sample averages can then be used as estimates for the mean and covariance:
=
x

N
1 X
x(i)
N
i=1

N
1 X

Rx =
x(i)xT (i)
N
i=1

17

March 12, 2002

Note that the above estimates are consistent estimates of real mean and covariance x
and Rx (i.e., they converge to true values as N ).
A slightly more general problem is:

A random variable vector y is produced according to


y = f (, u) + x
In the above, x is another random variable vector, u is a known
deterministic vector (which can change from trial to trial) and is
an unknown deterministic vector (which is invariant). Given data
for y from N trials, find the probability density parameters for x
(e.g., x
, Rx ) and the unknown deterministic vector .
This problem will be discussed later in the regression section.

1.2

STOCHASTIC PROCESSES

A stochastic process refers to a family of random variables indexed by a parameter set. This parameter set can be continuous or discrete. Since we are
interested in discrete systems, we will limit our discussion to processes with
a discrete parameter set. Hence, a stochastic process in our context is a time
sequence of random variables.

1.2.1

BASIC PROBABILITY CONCEPTS

DISTRIBUTION FUNCTION
Let x(k) be a sequence. Then, (x(k1 ), , x(k` )) form an `-dimensional random variable. Then, one can define the finite dimensional distribution function and the density function as before. For instance, the distribution function
F (1 , , ` ; x(k1 ), , x(k` )), is defined as:
F (1 , , ` ; x(k1 ), , x(k` )) = Pr{x(k1 ) 1 , , x(k` ) ` }

(1.97)

The density function is also defined similarly as before.


We note that the above definitions also apply to vector time sequences if
x(ki ) and i s are taken as vectors and each integral is defined over the space
that i occupies.
MEAN AND COVARIANCE
Mean value of the stochastic variable x(k) is

18

Random Variables and Stochastic Processes

x
(k) = E{x(k)} =
Its covariance is defined as

dF (; x(k))

(1.98)

Rx (k1 , k2 ) = E{[x(k
(k1 )][x(k2 ) x
(k2 )]T }
R R 1 ) x
= [1 x
(k1 )][2 x
(k2 )]T dF (1 , 2 ; x(k1 ), x(k2 ))
(1.99)
The cross-covariance of two stochastic processes x(k) and y(k) are defined as
Rxy (k1 , k2 ) = E{[x(k
(k1 )][y(k2 ) y(k2 )]T }
R R 1 ) x
(k1 )][2 y(k2 )]T dF (1 , 2 ; x(k1 ), y(k2 ))
= [1 x
(1.100)
Gaussian processes refer to the processes of which any finite-dimensional distribution function is normal. Gaussian processes are completely characterized
by the mean and covariance.
STATIONARY STOCHASTIC PROCESSES
Throughout this book we will define stationary stochastic processes as those
with time-invariant distribution function. Weakly stationary (or stationary
in a wide sense) processes are processes whose first two moments are timeinvariant. Hence, for a weakly stationary process x(k),
E{x(k)} = x
k
T
E{[x(k) x
][x(k ) x
] } = Rx ( ) k

(1.101)

In other words, if x(k) is stationary, it has a constant mean value and its
covariance depends only on the time difference . For Gaussian processes,
weakly stationary processes are also stationary.
For scalar x(k), R(0) can be interpreted as the variance of the signal and
)
reveals its time correlation. The normalized covariance R(
R(0) ranges from
0 to 1 and indicates the time correlation of the signal. The value of 1 indicates
a complete correlation and the value of 0 indicates no correlation.
R( )
R(0)

Note that many signals have both deterministic and stochastic components. In some applications, it is very useful to treat these signals in the same
framework. One can do this by defining
P
x
= limN N1 N
k=1 x(k)
(1.102)
1 PN
Rx ( ) = limN N k=1 [x(k) x
][x(k ) x
] T

Note that in the above, both deterministic and stochastic parts are averaged out. The signals for which the above limits converge are called quasistationary signals. The above definitions are consistent with the previous

19

March 12, 2002

definitions since,in the purely stochastic case, a particular realization of a stationary stochastic process with given mean (
x) and covariance (Rx ( )) should
satisfy the above relationships.

SPECTRA OF STATIONARY STOCHASTIC PROCESSES


Throughout this chapter, continuous time is rescaled so that each discrete time
interval represents one continuous time unit. If the sample interval Ts is not
one continuous time unit, the frequency in discrete time needs to be scaled
with the factor of T1s .
Spectral density of a stationary process x(k) is defined as the Fourier transform of its covariance function:

x () =

1 X
Rx ( )ej
2 =

(1.103)

Area under the curve represents the power of the signal for the particular frequency range. For example, the power of x(k) in the frequency range (1 , 2 )
is calculated by the integral
2

=2

x ()d
=1

Peaks in the signal spectrum indicate the presence of periodic components in


the signal at the respective frequency.
The inverse Fourier transform can be used to calculate Rx ( ) from the
spectrum x () as well

Rx ( ) =

x ()ej d

(1.104)

(1.105)

With = 0, the above becomes


T

E{x(k)x(k) } = Rx (0) =

x ()d

which indicates that the total area under the spectral density is equal to the
variance of the signal. This is known as the Parsevals relationship.
Example: Show plots of various covariances, spectra and realizations!
**Exercise: Plot the spectra of (1) white noise, (2) sinusoids, and (3)white
noise filtered through a low-pass filter.

20

Random Variables and Stochastic Processes

DISCRETE-TIME WHITE NOISE


A particular type of a stochastic process called white noise will be used extensively throughout this book. x(k) is called a white noise (or white sequence)
if
P(x(k)|x(`)) = P(x(k)) for ` < k
(1.106)
for all k. In other words, the sequence has no time correlation and hence
all the elements are mutually independent. In such a situation, knowing the
realization of x(`) in no way helps in estimating x(k).
A stationary white noise sequence has the following properties:
E{x(k)} = x

Rx if = 0
E{(x(k) x
)(x(k ) x
) T } =
0
if 6= 0

(1.107)

Hence, the covariance of a white noise is defined by a single matrix.


The spectrum of white noise x(k) is constant for the entire frequency range
since from (1.103)
1
x () =
Rx
(1.108)
2
The name white noise actually originated from its similarity with white light
in spectral properties.
COLORED NOISE
A stochastic process generated by filtering white noise through a dynamic
system is called colored noise.
Important:
A stationary stochastic process with any given mean and covariance function can be generated by passing a white noise through
an appropriate dynamical system.
To see this, consider

d(k) = H(q)(k) + d

(1.109)

where (k) is a white noise of identity covariance and H(q) is a stable /


stably invertible transfer function (matrix). Using simple algebra (Ljung REFERENCE), one can show that
d () = H(ej )H T (ej )

(1.110)

strom and WittenThe spectral factorization theorem (REFERENCE - A


mark, 1984) says that one can always find H(q) that satisfies (1.110) for an

21

March 12, 2002

arbitrary d and has no pole or zero outside the unit disk. In other words,
the first and second order moments of any stationary signal can be matched
by the above model.
This result is very useful in modeling disturbances whose covariance functions are known or fixed. Note that a stationary Gaussian process is completely
specified by its mean and covariance. Such a process can be modelled by filtering a zero-mean Gaussian white sequence through appropriate dynamics
determined by its spectrum (plus adding a bias at the output if the mean is
not zero).

INTEGRATED WHITE NOISE AND NONSTATIONARY PROCESSES


Some processes exhibit mean-shifts (whose magnitude and occurence are random). Consider the following model:
y(k) = y(k 1) + (k)
where (k) is a white sequence. Such a sequence is called integrated white noise
or sometimes random walk. Particular realizations under different distribution
of (k) are shown below:

P()
y(k)

90%


10%

More generally, many interesting signals will exhibit stationary behavior


combined with randomly occuring mean-shifts. Such signals can be modeled
as

22

Random Variables and Stochastic Processes


2 (k)

~ 1
H (q )

1 (k)

1 q

(k)

y(k)
H (q1 )

1
1 q

y(k)

As shown above, the combined effects can be expressed as an integrated white


noise colored with a filter H(q 1 ).
Note that while y(k) is nonstationary, the differenced signal y(k) is stationary.

(k)

1
1 q1

H(q1)

y(k)

(k)

H(q1)

y(k)

STOCHASTIC DIFFERENCE EQUATION


Generally, a stochastic process can be modeled through the following stochastic
difference equation.
x(k + 1) = Ax(k) + B(k)
y(k) = Cx(k) + D(k)

(1.111)

where (k) is a white vector sequence of zero mean and covariance R .


Note that
E{x(k)} = AE{x(k 1)} = Ak E{x(0)}
E{x(k)xT (k)} = AE{x(k 1)xT (k 1)}AT + BR B T

(1.112)

If all the eigenvalues of A are strictly inside the unit disk, the above approaches a stationary process as k since
limk E{x(k)} = 0
limk E{x(k)xT (k)} = Rx

(1.113)

where Rx is a solution to the Lyapunov equation


Rx = ARx AT + BR B T

(1.114)

23

March 12, 2002


Since y(k) = Cx(k) + D(k),

E{y(k)} = CE{x(k)} + DE{(k)} = 0


E{y(k)y T (k)} = CE{x(k)xT (k)}C T + DE{(k)T (k)}D T = CRx C T + DR DT
(1.115)
The auto-correlation function of y(k) becomes

CRx C T + DR DT for = 0

T
Ry ( ) = E{y(k + )y (k)} =
CA Rx C T + CA 1 BR DT for > 0
(1.116)
The spectrum of w is obtained by taking the Fourier transform of Ry ( )) and
can be shown to be
T


(1.117)
y () = C(ej I A)1 B + D R C(ej I A)1 B + D
In the case that A contains eigenvalues on or outside the unit circle, the
process is nonstationary as its covariance keeps increasing (see Eqn. (1.112).
However, it is common to include integrators in A to model mean-shifting
(random-walk-like) behavior. If all the outputs exhibit this behavior, one can
use
x(k + 1) = Ax(k) + B(k)
(1.118)
y(k) = Cx(k) + D(k)
Note that, with a stable A, while y(k) is a stationary process, y(k) includes
an integrator and therefore is nonstationary.

Stationary process
()

x(k+1)=Ax(k)+B (k)
y(k)=Cx(k)+D (k)

stable system

Nonstationary(Meanshifting) process
y(k)

()

x(k+1)=Ax(k)+B (k)
y(k)=Cx(k)+D (k)

stable system

y(k)

1
1-q -1
integrator

y(k)

MODEL PREDICTIVE
CONTROL

Manfred Morari
Jay H. Lee
Carlos E. Garca

March 26, 2002

Chapter 6

State Estimation
/labelchap:stateestimation
In practice, it is unrealistic to assume that all the disturbances and state
variables can be measured. In general, one must estimate the state from the
measured input / output sequences. This is called state estimation.
Let us bring in the standard state-space system description we introduced
in the previous chapter:
x(k + 1) = Ax(k) + Bu(k) + 1 (k)
y(k) = Cx(k) + 2 (k)

(6.1)

1 (k) and 2 (k) are white noise sequences of covariance


E

1 (k)
1 (k)

1 (k)
1 (k)

T )

R1 R1,2
T
R2
R1,2

(6.2)

The problem of state estimation is to estimate x(k + i), i 0, given


{y(j), u(j), j k} (i.e., inputs and outputs up to the kth sample time). Estimating x(k + i) for i > 0 is called prediction, while that for i = 0 is called
filtering. Some applications require estimation of x(k + i), i < 0 and this is
referred to as smoothing.
There are many state estimation techniques, ranging from a simple openloop estimator to more sophisticated statistically optimal estimators like the
Kalman filter. We examine some popular techniques in this chapter. Since
state estimation is an integral part of model predictive control, intimate knowledge of state estimation techniques is essential for designing an effective control
system. In addition, these techniques are very useful for parameter estimation
problems, such as those arising in system identification discussed in a later
chapter.
1

March 26, 2002

Main role of state estimation in the context of MPC is to realign the model
state to the process based on the measured signals so that accurate multi-step
predictions of the outputs, measured and possibly unmeasured, can be made.
With the focus of most state estimation literature given on the estimation
techniques, often overlooked is the importance of disturbance modeling. Simply adding white noises to the state and output equations of a deterministic
system model, as is sometimes done by those who misunderstand the meaning of white noise in the standard state-space system description, can lead to
extremely poor results regardless of the technique. In general, to obtain satisfactory results, disturbances (or their overall effect on the outputs) should
be modeled as appropriate stationary / non-stationary stochastic processes
and the system equations must be augmented with their describing stochastic
equations, as we have shown in the previous chapter.

6.1

Linear Estimator Structure

We start the discussion with a simple linear state estimator structure for system (6.1):
x
(k|k 1) = A
x(k 1|k 1) + Bu(k 1)

x
(k|k) = x
(k|k 1) + K(y(k)
x
(k|k 1))

(6.3)

In the above, x
(i|j) denotes an estimate of x(i) based on the measurements
available at time j. The two equations allow us to recursively compute the
filtered estimate x
(k|k). The first (which we refer to as the model forwarding
equation) corresponds to the simple propagation of the model state to next
time step without accounting for errors in the estimate and the effect of new
noise. The second (which we refer to as the measurement update equation)
attempts to compensate for these neglected factors by correcting the estimate
on the basis of the term called innovation, which is the difference between the
actual measurement of the output and its predicted value from the current
state estimate.
In some applications, one may need to compute the one-step-ahead prediction x
(k + 1|k) rather than the filtered estimate. For instance, in situations
where control computation requires one full sample period, one needs x
(k+1|k)
at time k in order to begin the computation of the control input u(k + 1). This
is not a problem as (6.3) can be implemented also as a one-step-ahead predictor simply by executing the measurement correction step first and the model
forwarding step afterwards:

x
(k|k) = x
(k|k 1) + K(y(k)
x
(k|k 1))

x
(k + 1|k) = A
x(k|k) + Bu(k)

(6.4)

page

{y(k) C x
x
(k + 1|k) = A
x(k|k 1) + Bu(k) + |{z}
AK
(k|k 1)} (6.5)
K

(or K), which is called the filter (or


The free parameter in the above is K
predictor) gain matrix. We will use the term filter and predictor interchangeably. For example, when the context makes it obvious, we will sometimes
refer to K as the filter gain matrix.) The choice of K is critical. Through
K, errors in y are to be translated into errors in x. In general, K should

be chosen so that the estimation error (xe (k + 1) = x(k + 1) x


(k + 1|k) or

x
e (k) = x(k) x
(k|k)) is minimized in some sense.

Equations for error dynamics can be easily derived. For instance, the
equations for the one-step-ahead prediction error is
xe (k + 1) = (A KC)xe (k) + 1 (k) + K2 (k)

(6.6)

The above can be derived straightforwardly by replacing y(k) in equation


(6.4) with Cx(k) + 2 (k) and subtracting the resulting equation from the state
equation of (6.1). The equation for x
e (k) can be derived similarly as

x
e (k + 1) = (A KCA)
xe (k) + (I KC)
1 (k) + K2 (k + 1)

6.2

(6.7)

Observer Pole Placement

The theory of state observer is based on the behavior of estimation error in


a deterministic setting (with 1 and 2 uniformly set to zero). Hence, its immediate concern is the reconstruction of the state sequence for a deterministic
system when the initial condition is unknown. From (6.7), it is clear that in
the deterministic setting the transition matrix A KC determines how the
estimation error propagates in time. The stability of an observer is defined in
the following way.
Definition 1 Estimator (6.5) for system (6.1) is said to be observer-stable
if xe (k) 0 as k for any xe (0) when 2 (i) = 0 and 2 (i) = 0 for all
i 0.

Since 1 (k) and 2 (k) are uniformly zero, (6.6) becomes


xe (k + 1) = (A KC)xe (k)

(6.8)

and it is clear that, for observer stability, all the eigenvalues of (A KC) must

lie strictly inside the unit disk. In addition, the eigenvalues of (A KCA)
is

the same as those of (A AKC) (since the eigenvalues of AB and BA are


same for square A and B), and hence the stability of one-step-ahead predictor
(6.5) implies the stability of filter (6.3) and vice versa.
The eigenvalues of A KC are called observer poles and determining K on
the basis of pre-specified observer pole locations is called pole placement. It can

March 26, 2002

be shown that, if (C, A) is an observable pair, the observer poles can be placed
at arbitrary locations through K. The proof is left as a homework exercise.
The conventional pole placement technique requires the system model to be
transformed into certain canonical forms (such as the observer form) so that
the observer eigenvalues can be related to the parameters in K in a transparent
manner.
Unfortunately, pole placement has seen very little use in process control. A
major drawback of pole placement is that, in general, it is not clear where the
observer poles should be placed for good estimation performance for a given
situation. An immediate factor for determining the observer pole locations is
the trade-off between the speed of convergence and noise filtering. This is a
problem that can be solved relatively easily and satisfactorily through some
tuning of the pole locations.
A more subtle and difficult problem is that, pole location is not the sole
factor that determines the speed of recovery from errors. For instance, even
dead-beat observers - that have all the poles at the origin - can be very slow
in recovering from certain types of errors. This is particularly true when the
state dimension is high in relation to the output dimension. As a simple
example, consider the case that a deadbeat observer resulted in the following
error transition equation:

0 1 0 0
0 0 1 0

..

..
..
e1 e2 en 1
.
.
.
xe (k + 1) = e1 e2 en

..
..
.
.
1
0 0 0
{z
}
|
(AKC)

(6.9)
Even though all the observer poles are placed at the origin, it still takes n
sample steps to reject an error in the direction of the last eigenvector e n . On
the other hand, this could be the mode that is most vulnerable to disturbances,
in which case the deadbeat observer would perform very poorly.
For efficient estimation, it is helpful to incorporate into the estimator design
some statistical information on how disturbances affect the state. For instance,
even though the state may be of very high dimension, the number of modes
that are affected by disturbances may be much lower and priorities can be
given to attribute errors to those modes. These considerations motivate the
development of a statistically optimal estimator.
Example 6.1 NIKET: Demonstrate the above point! Compare the performance of a deadbeat observer designed through pole placement with a Kalman
filter. Make n fairly large. Note that one can always choose the primary
disturbance direction such that the deadbeat observer would perform poorly.

page

6.3

Kalman Filter

The filter gain matrix can also be determined to be optimal in some statistical
sense. For example, it can be chosen to minimize the variance of the estimation
error. The resulting estimator is the celebrated Kalman filter, which has been
the most popular state estimation technique by far. We present a derivation
for the simple case (when R12 = 0) and discuss some properties.
For the simplicity of discussion, let us assume for now that 1 and 2
are mutually independent with R1 0 and R2 > 0. Recall that the linear
estimator of (6.4) can be written in the following one-step-ahead predictor
form:
x
(k + 1|k) = A
x(k|k 1) + Bu(k) + K(k){y(k) C x
(k|k 1)}

(6.10)

In the above, we allowed the filter gain matrix to vary with time for more
generality.

6.3.1

Derivation of The Optimal Filter Gain Matrix

Recall that the error dynamics for xe (k) = x(k) x


(k|k 1) are given by
xe (k + 1) = (A K(k)C)xe (k) + 1 (k) + K(k)2 (k)

(6.11)

Let

P (k) = Cov{xe (k)}


o
n
= E (xe (k) E{xe (k)}) (xe (k) E{xe (k)})T

(6.12)

Let us assume that the initial guess is chosen so that E{xe (0)} = 0. Then, by
applying the rules for the expectation operator given in Appendix D
to (6.11), we obtain E{xe (k)} = 0 for all k 0 and

P (k + 1) = E xe (k + 1)xTe (k + 1)
= (A K(k)C)P (k)(A K(k)C)T + R1 + K(k)R2 K T (k)

(6.13)

In the above, we used the fact that xe (k), 1 (k) and 2 (k) in (6.11) are mutually
independent.
Now let us choose K(k) such that T P (k + 1) is minimized for any arbitrary choice of . It is straightforward algebra to show that

T P (k + 1) = T AP (k)AT + R1 K(k)CP (k)AT

AP (k)C T K T (k) + K(k)(R2 + CP (k)C T )K T (k)

March 26, 2002

Completing the square on the terms involving K(k), we obtain

T P (k + 1) = T K(k) AP (k)C T (R2 + CP (k)C T )1 R2 + CP (k)C T

T o

K(k) AP (k)C T (R2 + CP (k)C T )1

+T AP (k)AT + R1 AP (k)C T (R2 + CP (k)C T )1 CP (k)AT


(6.14)

Hence, K(k) minimizing the above is


K(k) = AP (k)C T (R2 + CP (k)C T )1

(6.15)

Note that the solution is independent of the choice of . In addition, under


this choice of K(k)

1
P (k + 1) = AP (k)AT + R1 AP (k)C T R2 + CP (k)C T
CP (k)AT (6.16)
Given x
(1|0) and P (1), equations (6.15) and (6.16) can be used along with
(6.5) to recursively compute x
(k + 1|k). They are referred to as the timevarying Kalman filter equations. Matrix recursion formula (6.16) is called
Riccati Difference Equation (RDE).
Suppose P (k) in the RDE converges to a steady-state solution P as k
. Then, we may consider the possibility of sacrificing the optimality during
transient periods and implementing the filter with the constant filter gain
matrix given by

(6.17)
K = AP C T (R2 + CP C T )1
P can be obtained by either iterating on the RDE or by finding a positive
semi-definite solution of the following Algebraic Riccati Equation (ARE):

1
P = AP AT + R1 AP C T R2 + CP C T
CP AT

(6.18)

This we refer to as the steady-state Kalman filter as opposed to the timevarying Kalman filter. Properties of the ARE and the steady-state Kalman
filter will be discussed later.
Remarks:
Recall the relationship between the one-step-ahead predictor gain K(k)

and the filter gain K(k) (K(k) = AK(k)).


Hence, for optimal filtering,
we can choose

K(k)
= P (k)C T (R2 + CP (k)C T )1

(6.19)

This gain matrix should be used to implement the optimal filter in the
form of (6.3) that recursively computes x
(k|k) rather than x
(k + 1|k).

page

The assumption of independence between 1 (k) and 2 (k) may not always be met. The derivation we have given above can be modified in a
straightforward manner for this case (see Exercise??? at the end of
this chapter).
In the above derivation, we imposed the linear estimator structure and
optimized the filter gain matrix based on some cost index. We can
also show that the Kalman filter is the optimal estimator in the sense
of minimizing the conditional expectation of the error, provided that 1
and 2 are Gaussian noises in addition to being white noises.
Example 6.2 NIKET: Apply the Kalman filter to the same problem for
which pole placement didnt work very well.
Show the importance of disturbance modeling. Show the performance by
just adding white noise to a deterministic model when integrated white noise
disturbances enter.

6.3.2

Stability of Kalman Filter

In Section 6.3.1, we showed that the time-varying optimal gain matrix of


a linear filter is given recursively by the RDE of (6.16). If P (k) converges
upon the iteration, the steady-state solution is a positive semi-definite solution
of ARE (6.18). The steady-state solution may be directly implemented if
one is willing to sacrifice the optimality during the transient period. Before
implementing the steady-state solution, however, we must make sure that the
steady-state Kalman filter is stable, that is, it leads to a stable error transition
matrix A K C.

To begin the discussion on the stability property of the Kalman filter, it is


useful to introduce the following definitions:
Definition 2 A positive semi-definite solution to ARE (6.18) is called a stabilizing solution if all the eigenvalues of the corresponding error transition
matrix A K C (where K is defined according to (6.17)) lie inside the unit
circle.
Definition 3 A positive semi-definite solution to ARE (6.18) is called a strong
solution if all the eigenvalues of the corresponding error transition matrix
A K C lie inside or on the unit circle.
It is worth asking the following three questions about (6.16):
1. When does P (k) converge to a steady-state solution P ?
2. If it does converge, when is the steady-state solution a stabilizing solution
of (6.18)?

March 26, 2002


3. If it does converge to a stabilizing solution, when is the stabilizing solution a unique positive semi-definite solution to ARE (6.18)?

This question of whether P (k) converges has the relevance of telling us


when we should expect the ARE to have a meaningful (positive semi-definite)
solution. The answer is:
P (k) in RDE (6.16) converges to P regardless of P (0) if (C,A)
is detectable.
When the system is detectable, the increase in the covariance caused by the
new disturbance and possibly unstable dynamics and the decrease achieved
through the new measurements balance themselves out eventually. The higher
the covariance, the larger the reduction.
This question of whether the steady-state solution is a stabilizing solution
is important as stability is a minimum requirement and optimality does not
necessarily imply stability. The following are some useful answers:
1/2

If (A, R1 ) is stabilizable and P (0) 0, then P (k) P as k ,


which is a stabilizing solution of the ARE.
1/2

If (A, R1 ) has no uncontrollable mode on the unit circle and P (0) > 0,
then P (k) P as k , which is a stabilizing solution of the ARE.
1/2

If (A, R1 ) has uncontrollable mode(s) on the unit circle, P is a strong


solution but not a stabilizing solution of the ARE.
The above statements can be proven algebraically but we will instead opt
for some qualitative arguments based on expected asymptotic behavior of the
covariance matrix. First, it is important to note the obvious fact that, by
the virtue of (C, A) being detectable, (A KC) can indeed be made stable
with proper choices of K. The reason why the optimal filter may not achieve
stability despite the optimality is that the estimation problem can be
defined such that it is not necessary for the filter to be stable to achieve
optimality. As a trivial example, if the covariance of the state noise is set to
zero and the initial covariance P (0) also to zero, the steady-state solution is
trivially a null matrix, which would result in an open-loop observer. Such an
observer would lead to instability if the observed system is unstable.
1/2

The stabilizability of (A, R1 ) in the first statement implies that all the
unstable modes of A are independently excited by the external noise. This
means that the covariance for these modes will grow without bounds if the
filter is not stabilizing. By the virtue of optimality, the optimal filter has to
contain the growth if capable, and this means the optimal filter is necessarily
stable.

page

The second statement says that we can have exponentially unstable modes
that are not excited by the external noise but the optimal filter still achieves
observer stability. The stronger requirement of P (0) > 0 ensures that errors
in these modes will grow exponentially fast if left alone. For optimality, the
growth and reduction in the covariance (for the unstable modes) has to reach
a steady state and the balance point is, in fact, a nonzero covariance. This
means that the optimal filter has to be stable.
The reason why uncontrolllable modes on the unit circle cause problems
as can be seen in the third statement is that, unlike errors in the exponentially
unstable modes, errors in these modes are eventually reduced to zero by the
time-varying Kalman filter. Hence, at steady state, the covariance for these
modes drops to zero and the optimal filter no longer provides any correction
to these modes, thereby leaving these modes unstabilized.
This question of whether the steady-state solution is a unique positive
semi-definite solution to the ARE is relevant because one may wish to find
the steady-state solution directly by solving the ARE rather than iterating on
the RDE. If more than one positive semi-definite solution exists, one may very
well end up with a non-stabilizing solution even though a stabilizing solution
may exist. The answer is:
1/2

If (A, R1 ) is stabilizable, the stabilzing solution is the only positive


1/2
semi-definite solution of ARE (6.18). If (A, R1 ) has uncontrollable mode(s) outside the unit circle, there exists at least one more
positive semi-definite solution besides the stabilizing solution.
Again, we provide here some intuitive arguments rather than a formal math1/2
ematical proof. If (A, R1 ) has uncontrollable modes outside the unit circle,
starting with an initial condition of zero in one or more of these modes in the
iteration of the RDE will result in a solution that is positive semi-definite, but
not stabilizing.
Example 6.3 NIKET: Demonstrate these using simple examples. Look at
the steady-state covariance matrix for a system with uncontrollable modes outside the unit circle vs. with those on the unit circle.

6.3.3

Extensions

Time-Varying System
Consider the time-varying system
x(k + 1) = A(k)x(k) + B(k)u(k) + 1 (k)
y(k) = C(k)x(k) + 2 (k)

(6.20)

10

March 26, 2002

where 1 (k) and 2 (k) are zero-mean white noises of covariances R1 (k) and
R2 (k). The system, for instance, may serve as an approximation to some
nonlinear system evolving along a trajectory. Generalizing the Kalman filter
equations to the above system can be done in a straightforward manner. Recall
that we already derived the equations for optimal time-varying estimator for
the linear time-invariant system. The fact that the system dynamics vary with
k does not cause any further complication and one gets the same form of the
estimator equation and update formula:
x
(k + 1|k) = A(k)
x(k|k 1) + B(k)u(k) + K(k){y(k) C(k)
x(k|k 1)} (6.21)
where
K(k) = A(k)P (k)C T (k)(R2 (k) + C(k)P (k)C T (k))1

(6.22)

and
P (k + 1) = A(k)P (k)AT (k)

1
+R1 (k) A(k)P (k)C T (k) R2 (k) + C(k)P (k)C T (k)
C(k)P (k)AT (k)
(6.23)
In this case, however, the covariance matrix and the gain matrix do not converge to steady-state values in general.
Periodically Time-Varying System and Multi-Rate Kalman Filter
Let us consider the system
xk (t + 1) = A(t)xk (t) + B(t)uk (t) + 1,k (t)
yk (t) = C(t)xk (t) + 2,k (t),
t = 0, , N 1
xk+1 (0) = xk (N )

(6.24)

The above represents a periodic linear system of period N . k is the run index
and t is the time index within a period.
Such periodic systems are common in practice. Batch systems that evolve
along some pre-specified trajectories can be modeled as such. Cyclic operation
of a continuous process also results in periodically varying dynamics. Finally, a
multi-rate sample data system (with its sample rates that are integer multiples
of some basic sampling unit) is a periodic system, the period of which is given
by the least common multiple of all the sampling periods. In this case, only
the C matrix is periodically varying.
The Kalman filter of the above can be derived straightforwardly from the
time-varying Kalman filter equations and is represented by the equations below:
x
k (t + 1|t)
&
x
k+1 (0| 1)

=
=

t = 0, , N 1

A(t)
xk (t|t 1) + B(t)uk (t) +
x
k (N |N 1)
(6.25)

11

page
where
Kk (t) = A(t)Pk (t)C T (t)(R2 (t) + C(t)Pk (t)C T (t))1

(6.26)

and
Pk (t + 1) = A(t)Pk (t)AT (t) + R1 (t)

1
A(t)Pk (t)C T (t) R2 (t) + C(t)Pk (t)C T (t)
C(t)Pk (t)AT (t),
Pk+1 (0) = Pk (N )
(6.27)
Due to the periodic nature, for detectable systems, as k ,
Kk (t) K (t)
Pk (t) P (t)

Hence, at steady state, the time-varying solution converges to a periodic solution. As in the case of linear time-invariant systems, if one is willing to accept
some performance loss during the initial transient period, one can implement
the periodic solution rather than the time-varying solution. This offers computational advantage as the periodic filter gains can be computed off-line and
stored.
Periodic solutions of {P (t)} and {K (t)} can be computed in two different ways.
One can iterate on the RDE (6.27) until it converges to a periodic solution. With {P (t)}, one can obtain {K (t)} according to (6.26).
The periodic system can be expressed equivalently as the following lifted
system:

xk+1 (0) = xk (0) +

uk (0)
uk (1)
..
.

uk (N 1)
uk (0)

uk (1)

= xk (0) + D
..

.
yk (N 1)
uk (N 1)
yk (0)
yk (1)
..
.

(6.28)

Derivation of the exact form of , , and D are left as an exercise


to the reader. One can then treat the lifted system as a linear timeinvariant system and find P (0) by solving the respective ARE. With
P (0) found, one can proceed to calculate P (1), , P (N 1) using
the recursion formula of (6.27).

t = 0, , N 1

12

March 26, 2002

6.4

Least Squares Formulation of State Estimation

6.4.1

Batch Least Squares Formulation

For state estimation of system (6.1), one can consider solving the following
least squares problem at each time:

Jk =

min

xe (1),1 (i),2 (i)

"

xTe (1)Q0 (1)xe (1) +

k
X

T1 (i)Q1 1 (i) + T2 (i)Q2 2 (i)

i=1

xe (1) = x(1) x
(1|0)

1 (i) = x(i + 1) Ax(i) Bu(i)

2 (i) = y(i) Cx(i),

i = 1, , k

(6.29)

The first term xe (1) represents the error in the initial state estimate. The
second term 1 (i) represents the error in the state transition as predicted by
x(i + 1) = Ax(i) + Bu(i). Finally, 2 (i) represents the error in the output
prediction given by y(i) = Cx(i). Since we are penalizing these errors in the
objective function, the weighting matrices should reflect our relative confidence
in the initial estimate, the state transition equation and the output prediction
equation. The more confident, the higher the weighting. Use of other norms
in the objective function are always possible, but the 2-norm formulation gives
some advantages in terms of implementation and will also turn out to be the
most convenient choice for our discussion.
Notation Jk will be used to denote the least squares problem as well as the
optimal cost. With solutions to Jk (denoted hereafter by x
e (1|k), 1 (i|k), 2 (i|k)
for the consistency with previous notations), we can construct a smoothed estimate for the entire state sequence:
x
(1|k) = x
(1|0) + x
e (1|k)
x
(i + 1|k) = A
x(i|k) + Bu(i) + 1 (i|k),

i = 1, , k

(6.30)

Solving Jk repeatedly as time k progresses gives a sequence of one-step-ahead


predictions of the state x
(k + 1|) just as does the Kalman filter.
The advantages of the optimization formulation are as follows:
Constraints can be added to the estimation. For example, one may know
that the state must lie with a certain set:
x(i) X

(6.31)

In addition, one can also add constraints on the estimated error sequence:
1 (i) E1 and 2 (i) E2

(6.32)

13

page

Incorporating prior knowledge about the estimated variables in the form


of constraints can improve the efficiency and robustness of the estimation.
The same conceptual framework can be applied to nonlinear systems.
For the nonlinear system
x(k + 1) = f (x(k), u(k), 1 (k))
y(k) = g(x(k)) + 2 (k),

(6.33)

one would end up with a nonlinear least squares problem, which is computationally more difficult to solve.
Note that, even though the method is cast in a deterministic least squares
setting, the design is not made any easier as one must still choose the weighting
matrices. The problem of specifying the weighting matrices is essentially the
same as that of specifying the covariance matrices in the Kalman filter design.
In fact, we will be able to relate the weighting matrices to the noise covariance
matrices explicitly when we later establish an equivalence between the least
squares estimation method and the Kalman filter.
Of course, the above-mentioned advantages do not come for free; in order
to realize the benefits, one must be willing to accept higher computational
cost.
At first glance, the computational effort for the least squares problem seems
much higher than the Kalman filter, even in the unconstrained linear case; in
fact, it appears to grow unbounded with time. For this case, however, we
will show that we can use dynamic programming to derive a recursive formula
for x
(k + 1|k) for the above least squares problem. This, in fact, reduces
solving the least squares problem to the Kalman filter calculation. With this
approach, however, one loses the advantage of having access to a smoothed
state sequence. To retain the advantage, one can employ a fixed-size moving
window, within which the least squares estimation is performed. One can add
an appropriate penalty term on the initial state so that the solution within the
window obtained in this manner matches that of the full problem exactly. This
does translate into a slight increase in computation and storage. The estimate
of the initial point and the associated weighting matrix can be adjusted such
that solutions coincide with those for the full batch problem.
For the constrained linear problem or the nonlinear problem, the dynamic
programming approach does not work as the optimization does not yield an
analytical solution. The only option is to employ a moving window and solve
the constrained and/or nonlinear least squares problem directly within the
window. This, of course, means a significant leap in the computational requirement since mathematical programming techniques must be employed for
solution. In addition, there is no way, in general, to choose the penalty term

14

March 26, 2002

on the initial state such that the estimate obtained with the moving window
coincides exactly with that of the full problem.
The use of moving estimation window and the related issues will be discussed in more details later on.

6.4.2

Recursive Solution to Unconstrained Linear Problem and


Equivalence with Kalman Filter

In this section, we show that the sequence x


(k + 1|k) produced by solving
(6.29) can be obtained recursively through the equations
1 1
1
T
T
(k|k1))
x
(k+1|k) = A
x(k|k1)AQ1
0 (k)C (CQ0 (k)C +Q2 ) C(y(k)C x
(6.34)
and
1
1
1 1
1
1
1
T
T
T
T
Q1
0 (k+1) = AQ0 (k)A AQ0 (k)C (CQ0 (k)C +Q2 ) CQ0 (j)A +Q1
(6.35)

Note that the above equations are equivalent to those for the Kalman filter
if we set Q0 (k) = P 1 (k), Q1 = R11 , and Q2 = R21 . It implies that x
(k + 1|k)
constructed from the least squares estimation is the same as the estimate from
the Kalman filter given that the weighting matrices Q0 (1), Q1 and Q2 are
chosen as the inverses of the corresponding covariance matrices, P 1 (1), R11
and R21 , respectively.
The rest of section 6.4.2 will be devoted to deriving the above result. The
derivation is based on dynamic programming and the ensuing algebra is somewhat involved. The readers who are not interested in it may the rest of the
section without loss of continuity.
For the notational convenience, let us ignore the term involving deterministic input u. Since these are known terms, dropping them does not affect
our analysis in any way.
Dynamic Programming and Arrival Cost
We now consider the possibility of calculating the estimate x
(k + 1|k) recursively by solving the least squares problem of (6.29) via dynamic programming.
First, note that (6.29) can be re-formulated as

(1|0))T Q0 (1)(x(1) x
(1|0))
Jk = minx(1),x(2),,x(k+1) (x(1) x

Pk
T
+ i=1 (x(i + 1) Ax(i)) Q1 (x(i + 1) Ax(i)) + (y(i) Cx(i))T Q2 (y(i) Cx(i))
(6.36)

Definition 4 k (z) = Jk with the constraint x(k + 1) = z is called j-step


arrival cost for z. This value function of z represents the minimum cost
incurred in solving Jk with the additional arrival constraint of x(k + 1) = z.

15

page
According to the above definition,

k (x(k + 1)) = minx(1),,x(k) (x(1) x


(1|0))T Q0 (1)(x(1) x
(1|0))

Pk
T
+ i=1 (x(i + 1) Ax(i)) Q1 (x(i + 1) Ax(i)) + (y(i) Cx(i))T Q2 (y(i) Cx(i))
(6.37)

The above definition is useful in constructing a recursive solution for Jk by


stage-wise calculation. Let us describe the basic approach first.
Consider the problem of

(1|0))T Q0 (1)(x(1) x
(1|0)) + (x(2) Ax(1))T Q1 (x(2) Ax(1))
1 (x(2)) = min (x(1) x
x(1)

+(y(1) Cx(1))T Q2 (y(1) Cx(1))


(6.38)

1 (x(2)) is the arrival cost for x(2); it represents the minimum cost incurred
to arrive at a given x(2). Note that we now can rewrite (6.36) as

h
P
minx(2),,x(k+1) 1 (x(2)) + ki=2 (x(i + 1) Ax(i))T Q1 (x(i + 1) Ax(i))

+(y(i) Cx(i))T Q2 (y(i) Cx(i))


(6.39)
This is because, for a fixed value of x(2), the rest of the cost (i.e., the terms
other than 1 (x(2)) in the objective function) does not depend on x(1). At
the next stage, we can consider solving

2 (x(3)) = minx(2) 1 (x(2)) + (x(3) Ax(2))T Q1 (x(3) Ax(2))


+(y(2) Cx(2))T Q2 (y(2) Cx(2))

(6.40)

Carrying out this idea repeatedly, we eventually arrive at

k (x(k + 1)) = minx(k) k1 (x(k)) + (x(k + 1) Ax(k))T Q1 (x(k +


1) Ax(k))
+(y(k) Cx(k))T Q2 (y(k) Cx(k))
(6.41)
Then, we can solve
Jk = min k (x(k + 1))
(6.42)
x(k+1)

to obtain x
(k + 1|k). The question we have at this point is whether we can
derive a recursive equation for i (x(i + 1)) at each stage. With this, we should
also be to generate x
(i + 1|i) in a recursive manner. The answer is affirmative
as we will show next.
Recursive Calculation of the Arrival Cost and One-Step-Ahead Prediction
It is useful to introduce the following lemma:

16

March 26, 2002

Lemma 1 Let x be the solution to the least squares problem


min xT Hx 2g T x

(6.43)

x = H 1 g

(6.44)

Then,
and
xT Hx 2g T x = (x x )T H(x x ) gT H 1 g

= (x x )T H(x x ) xT Hx 2g T x x=x

(6.45)

(6.44) can be obtained by setting the derivative of the objective function with
respect to x to zero. The first equation of (eq:lscost-a) can be proved by
substituting (6.44) for x in the right-hand-side of the equation and showing
that it reduces to the left-hand side. The second equation can be proved by
substituting (6.44 into the objective function and showing that it reduces to
g T H 1 g.
Let us now show that a closed-form expression for the arrival cost can be
derived for the first stage. For the first stage, we have to solve

1 (x(2)) = min (x(1) x


(1|0))T Q0 (1)(x(1) x
(1|0))
x(1)

+(x(2) Ax(1))T Q1 (x(2) Ax(1))


T

+(y(1) Cx(1)) Q2 (y(1) Cx(1))

We first write the above in the standard least squares form.

1 (x(2)) = min xT (1)(Q0 (1) + C T Q2 C + AT Q1 A)x(1)

(6.46)

(6.47)

x(1)

2xT (1)(Q0 (1)


x(1|0) + C T Q2 y(1) + AT Q1 x(2)) + xT (2)Q1 x(2)
+ terms involving only y(1) and x
(1|0)]

(6.48)

Using Lemma 1,
1 (x(2)) = (Q0 (1)
x(1|0) + C T Q2 y(1) + AT Q1 x(2))T (Q0 (1) + C T Q2 C + AT Q1 A)1
(Q0 (1)
x(1|0) + C T Q2 y(1) + AT Q1 x(2)) + xT (2)Q1 x(2)

+ terms involving only y(1) and x


(1|0)

(6.49)

It is useful to rewrite the above in a quadratic form for x(2):

1 (x(2)) = xT (2) Q1 A(Q0 (1) + C T Q2 C + AT Q1 A)1 AT Q1 + Q1 x(2)

2xT (2) Q1 A(Q0 (1) + C T Q2 C + AT Q1 A)1 Q0 (1)


x(1|0) + C T Q2 y(1)
+ terms independent of x(2)

Let

x
(2|1) = arg min 1 (x(2))
x(2)

(6.50)

(6.51)

17

page

This is consistent with the notations we have been using previously. Then,
from Lemma 1,
1

Q1 A
x
(2|1) = Q1 A(Q0 (1) + C T Q2 C + AT Q1 A)1 AT Q1 + Q1

T
T
1
T
(Q0 (1) + C Q2 C + A Q1 A)
Q0 (1)
x(1|0) + C Q2 y(1) (6.52)

and

1 (x(2)) = (x(2) x
(2|1))T Q1 A(Q0 (1) + C T Q2 C + AT Q1 A)1 AT Q1 + Q1 (x(2) x
(2|1))
|
{z
}
Q0 (2)

+1 (
x(2|1)) + other terms independent of x(2)

(6.53)

At the next stage, the problem we have to solve is

(2|1))T Q0 (2)(x(2) x
(2|1))
2 (x(3)) = min (x(2) x
x(2)

+(x(3) Ax(2))T Q1 (x(3) Ax(2))


+(y(2) Cx(2))T Q2 (y(2) Cx(2))

+1 (
x(2|1))] + other terms independent of x(2) (6.54)

Except for the constant term of 1 (


x(2|1)) , this is in the same form as 1 (x(2))
and the one can use the same formula. Continuing on with this, one gets the
following recursion formula:

j (x(j + 1)) = (x(j + 1) x


(j + 1|j))T Q0 (j + 1)(x(j + 1) x
(j + 1|j)) + j (
x(j + 1|j))
+other terms independent of x(2)
for j = 1, , k
(6.55)
where

1
x
(j + 1|j) =
Q1 A(Q0 (j) + C T Q2 C + AT Q1 A)1AT Q1 + Q1

Q1 A(Q0 (j) + C T Q2 C + AT Q1 A)1 Q0 (j)


x(j|j 1) + C T Q2 y(j)
(6.56)
and
Q0 (j + 1) = Q1 A(Q0 (j) + C T Q2 C + AT Q1 A)1 AT Q1 + Q1

(6.57)

Remarks:
Eqns. (6.56)(6.57) represent a way to construct x
(j + 1|j) recursively.
The expression for the arrival cost j (x(j + 1)) includes the constant
term j (
x(j + 1|j)). Even though this term is carried over to the next
stage in the expression for j+1 (x(j + 2)), it does not affect the optimal
solution x
(j + 2|j + 1) and therefore can be ignored.

18

March 26, 2002

Equivalence with the Kalman Filter


Now we show that (6.56) and (6.57) can be written as (6.34)(6.35), thus
establishing the equivalence with the Kalman filter.
To show the equivalence, the following lemma known as Matrix Inversion
Lemma will be useful.
Lemma 2 Let A, C, and C 1 + DA1 B be nonsingular square matrices. Then
(A + BCD)1 = A1 A1 B(C 1 + DA1 B)1 DA1

(6.58)

The above lemma can be easily proved by multiplying both sides of the equation by (A + BCD) and showing that the right-hand side indeed reduces to
identity. Proof is left as an exercise.
Let us first show the equivalence between (6.57) and (6.35). For this, we
first invert both sides of (6.57) to obtain
1

T
T
1 T
(6.59)
Q1
0 (j + 1) = Q1 A(Q0 (j) + C Q2 C + A Q1 A) A Q1 + Q1

We then apply the Matrix Inversion Lemma to the right side of the above
equation with
A
B
C
D

=
=
=
=

Q1
Q1 A
(Q0 (j) + C T Q2 C + AT Q1 A)1
A T Q1

(6.60)

We then obtain
1
Q1
A1 B(C 1 + DA1 B)1 DA1
0 (j + 1) = A
1 T

1
1
T
T
T
A
= Q1
1 Q1 Q1 A Q0 (j) C Q2 C A Q1 A + A Q1 Q1 Q1 A

1
1
= Q1 + A Q0 (j) + C T Q2 C
AT
(6.61)

1
Applying the Matrix Inversion Lemma once more to Q0 (j) + C T Q2 C
, we
get
T
1
1
1
1
1
1
T 1
T
Q1
0 (j+1) = Q1 +A Q0 (j) Q0 (j)C (Q2 + CQ0 (j)C ) CQ0 (j) A
(6.62)
which is (6.35).
Let us next show the equivalence between (6.56) and (6.34).
x
(j + 1|j) =

Q1 A(Q0 (j) + C T Q2 C + AT Q1 A)1 AT Q1 + Q1

Q1 A(Q0 (j) + C T Q2 C + AT Q1 A)1

Q0 (j)
x(j|j 1) + C T Q2 y(j)

(6.63)

(6.64)

page

19

Using the earlier result that


1

1 T

T
= Q1
A ,
Q1 A(Q0 (j) + C T Q2 C + AT Q1 A)1 AT Q1 + Q1
1 +A Q0 (j) + C Q2 C
(6.65)
we obtain
n

1 T o
T
x
(j + 1|j) =
Q1
+
A
Q
(j)
+
C
Q
C
A Q1 A(Q0 (j) + C T Q2 C + AT Q1 A)1
0
2
1

Q0 (j)
x(j|j 1) + C T Q2 y(j)
n
o

1 T
=
A + A Q0 (j) + C T Q2 C
A Q1 A (Q0 (j) + C T Q2 C + AT Q1 A)1

Q0 (j)
x(j|j 1) + C T Q2 y(j)

= A(Q0 (j) + C T Q2 C)1 (Q0 (j) + C T Q2 C) + AT Q1 A (Q0 (j) + C T Q2 C + AT Q1 A)1

Q0 (j)
x(j|j 1) + C T Q2 y(j)

A(Q0 (j) + C T Q2 C)1 Q0 (j)


x(j|j 1) + C T Q2 y(j)
(6.66)
Note that

A(Q0 (j) + C T Q2 C)1 Q0 (j)


x(j|j 1)
1

1
1
1
T
T 1
= A Q0 (j) Q0 (j)C (Q2 + CQ1
x(j|j 1)
0 (j)C ) CQ0 (j) Q0 (j)

1
1
T
T 1
= A AQ1
(j|j 1)
0 (j)C (Q2 + CQ0 (j)C ) C x

(6.67)

and
A(Q0 (j) + C T Q2 C)1 C T Q2 y(j)

T
1
1
1
1
T
T 1
= A Q1
0 (j) Q0 (j)C (Q2 + CQ0 (j)C ) CQ0 (j) C Q2 y(j)

1
1
T
T 1
T
= AQ1
I (Q1
Q2 y(j)
0 (j)C
2 + CQ0 (j)C ) CQ0 (j)C

1
1
1
1
1
T
T 1
T
Q2 y(j)
(Q2 + CQ0 (j)C T ) CQ1
= AQ0 (j)C (Q2 + CQ0 (j)C )
0 (j)C

1
1
T
T 1
= AQ1
0 (j)C (Q2 + CQ0 (j)C ) y(j)

In the both derivations, the first step requires use of Matrix Inversion Lemma
and the rest are straightforward algebra. Putting the two together gives equation (6.34).
The equivalence between the unconstrained least squares estimator and the
Kalman filter is relevant in several ways. First, it gives a physically meaningful
way to choose the weighting matrices for the least squares estimator. Second, it
points to the fact that simply minimizing the residual of a deterministic model
in the least squares estimation would amount to using a Kalman filter designed
by adding some artificial white noise terms added to the state and output
equations of the deterministic model, which would perform poorly in most
situations. Even though the least squares estimation is cast in a completely
deterministic setting, disturbance modeling is just as important here as in
the Kalman filter design. Third, it provides some insight into when the least
squares estimator will perform well and when it wont. Specifically, it shows

(6.68)

20

March 26, 2002

that the Gaussian noise assumption is inherent to the least squares estimator
just as it is to the Kalman filter. One advantage of the least squares estimator
over the Kalman filter, however, is that constraints can be used to alter the
underlying noise statistics.

6.4.3

Use of Moving Estimation Window

We mentioned earlier that the full least squares problem must be solved directly when a smoothed estimate of the past state sequence is desired or when
the problem formulation involves constraints and / or a nonlinear model. We
also mentioned the use of a moving estimation window as a way to contain
the size of the least squares problem. Estimation with a fixed-size estimation
window that moves in time will be referred to as Moving Horizon Estimation
(MHE).
Formulation of MHE
Consider the full batch least squares problem of (6.29). Based on the same
forward dynamic programming argument we used earlier, this problem can be
reformulated as

Jk = minx(km+1),1 (i),2 (j) [km (x(k m + 1))

i
P
+ ki=km+1 T1 (i)Q1 1 (i) + T2 (i)Q2 2 (i)
1 (i) = x(k + 1) Ax(i) Bu(i)
2 (i) = y(i) Cx(i)

(6.69)

(6.70)

where km (x(k m + 1)) is the arrival cost defined as

km (z) = Jkm with the arrival constraint of x(k m + 1) = z

(6.71)

(6.69) works because the quantity


k
X

i=km+1

T1 (i)Q1 1 (i) + T2 (i)Q2 2 (i)

depends only on decision variables {1 (i), 2 (i), i = k m + 1, k} for a given


x(k m + 1).

To be able to use the MHE strategy to solve the full batch problem exactly,
we must be able to compute the arrival cost in some recursive fashion. For
unconstrained linear problems, we have already derived a recursive formula
for the arrival cost (Equations (6.55)(6.57), which are essentially the Kalman
filter update equations). Note that the constant term j (
x(j + 1|j)) in (6.55)
can be dropped since it doesnt affect the solution.

21

page

To the above, constraints of types (6.31) and (6.32) can be added. With
constraints, the least squares problem no longer yields an analytical solution.
If the constraints are formulated as linear inequalities and km (x(k m + 1))
is quadratic, the resulting problem is a QP. With introduction of a nonlinear
model of form (6.33), the resulting problem is an NLP.
For constrained linear systems or unconstrained / constrained nonlinear
systems, there is no way to compute the exact arrival cost in a recursive
manner. In these cases, the only option is to compute the arrival cost approximately. Here, it is better to use an approximate cost that lower-bounds the
exact cost, i.e., use km (x(k m + 1)) in place of km (x(k m + 1)) in
(6.69) such that
km (x(k m + 1)) km (x(k m + 1))

x(k m + 1) X

(6.72)

This means it is better to underestimate the accuracy of the starting estimate x


(k m + 1|k m) rather than to overestimate it. Statistically, this
amounts to saying one should not assume information content higher than the
actual. Even though we will not delve further into the mathematical aspects
of this aproximation, it is worth noting that a moving horizon estimator with
km (x(k m + 1)) chosen to lower-bound the true arrival cost is stable in
the observer sense.
For constrained linear problem, it can be shown that
km (x(k m + 1)) (x(k m + 1) x
(k m + 1|k m))T Qkm (k m + 1)
(x(k m + 1) x
(k m + 1|k m)) + (
x(k m + 1|k m))
(6.73)
where x
(k m + 1|k m) is the estimate form the constrained MHE at time
k m and Qkm (k m + 1) is calculated according to the recursion formula
(6.57). Hence, choosing the initial error penalty as a quadratic term with
the weighting matrix given by the Kalman filter update equation provides a
lower-bounding approximation of the arrival cost. Proof of this is somewhat
complicated by the fact that x
(k m + 1|k m) is an estimate from the
constrained MHE rather than the unconstrained MHE (the Kalman filter).
See the discussion at the end of the chapter for references containing the
proofs and more detailed discussions on this issue.
For nonlinear problems, calculation of a tight lower-bound of the arrival
cost is much more difficult. The choice of km (x(k m + 1)) = 0 certainly
qualifies. However, notice that, with this choice, one forgets all the information outside the estimation window. Hence, without a short horizon, the
performance can be very bad. Another more practical option is to use linearized statistics. The quadratic initial penalty term can be calculated in the
same way as above but using a model obtained by linearizing the nonlinear
model. Due to the linearization, the quadratic penalty term does not necessarily lower-bound the arrival cost. For that, one could combine the two

22

March 26, 2002

approaches and use


1
km (x(k m + 1)) = (x(k m + 1) x
(k m + 1|k m))T (Q1
km (k m + 1) + I)
(x(k m + 1) x
(k m + 1|k m)) + (
x(k m + 1|k m))
(6.74)
where 0. With a sufficiently large , the above is a lower-bounding
approximation of km (x(k m + 1)). Note that, as , we approach
the first option of setting km (x(k m + 1)) = 0.

Example 6.4 Show a simple state estimation problem where the constraints
help. Rawlings, Muske, and Lee (Automatica,2001).
Example 6.5 Show a simple parameter estimation example with different
noise distributions. Robertson and Lee (Automatica, 2002)

Examples to Include
1. Demonstrate the difficulty of pole placement design. Use the distillation
column model with a single temperature output?
2. Demonstrate the effectiveness of the Kalman filter on the same problem.
3. Show the importance of disturbance modeling in the Kalman filter design. Compare with the design where a white noise is simply added to
the deterministic model.
4. Demonstrate the various conditions for the existence and uniqueness of
a stabilizing solution of ARE through a simple example. (HOMEWORK
Problem for CHE6400)
5. Examples of Constrained Least Squares estimation and MHE. Demonstrate the use of constraints and the benefit. (Robertson and Lee (Automatica, 2002).)

Possible Exercises To Give


1. Derive the Kalman filter when the the assumption of independence of the
state noise and the output noise does not hold. Do this in two different
ways. First, simply optimize the filter gain matrix as before, Second,
rewrite the system so that the assumption of independence is satisfied.

23

page
Note that (6.1) can be rewritten as

x(k + 1) = Ax(k) + Bu(k) + 1 (k) R12 R21 (Cx(k) + 2 (k) y(k))


= (A R12 R21 C) x(k) + R12 R21 y(k) + 1 (k) R12 R21 2 (k)
|
{z
}
{z
}
|
A

1 (k)

y(k) = Cx(k) + 2 (k)

(6.75)

It can be easily checked that


( 0

T )
0
R1 R12 R21 R21 0
1 (k)
1 (k)
=
E
0
R2
2 (k)
2 (k)

(6.76)

The presence of the extra input term R12 R21 y(k) does not affect the
filter gain matrix calculation as it is a known term. Hence, the same
formulae can be used to design the optimal filter for the case when the
independence assumption is not satisfied.
2. Show that, for an observable systems, observer poles can be placed at
arbitrary locations.
3. Derive the form of system matrices for the lifted system (6.28) in terms
of the time-varying system matrices.
4. Prove Lemma 1.
5. Prove the Matrix Inversion Lemma.
6. Show that the batch least squares problem yields the maximum a posteriori estimate of the state sequence, which corresponds to the maximum
of the conditional density of the state sequence given the measurements.

Bibliography
1. Observer theory. Luenberger, etc.
2. Pole placement design can be found in most textbooks for linear systems,
including????
3. Kalman filter was first presented in Kalmans original paper(???). DISCUSS! Extensions to continuous time systems can be found in (Kwakernaak and Sivan?????? A good overview of the variations of the Kalman
filter such as the extended Kalman filter (EKF) for nonlinear systems is
given in Jazwinski (????).
4. Least Squares estimation and connection with the Kalman filter was
shown by .... Also, the connection with maximum a posteriori estimation
for Markov systems is discussed in...... A good overview of both can be
found in Jazwinski (?????).

24

March 26, 2002


5. One of the early papers on Moving Horizon Estimation include Mayne
and Michalska (???) and Robertson, Lee, and Rawlings (???). Observer
stability of the moving horizon estimator is treated in Rao, Rawlings,
and Lee (???) for linear constrained systems and Rao and Rawlings
(???) for general nonlinear systems. Robertson and Lee showed how
the constraints can be used to develop an optimal estimator for various
non-Gaussian distributions including asymmetric distributions which are
constructed by amalgamating the halves of two normal distributions of
difference variances.

25

page

Items to Be Moved
Kalman Filter As The Optimal Bayesian Estimator For Gaussian Systems
THIS SECTION WILL BE MOVED TO THE APPENDIX!
In the previous section, we assumed a linear observer structure and posed
the problem as a parametric optimization where the expected value of the
estimation error variance is minimized with respect to the observer gain. In
fact, the Kalman filter can be derived from an entirely probabilistic argument,
i.e., by deriving a Bayesian estimator that recursively computes the conditional
density of x(k).
Assume that 1 (k) and 2 (k) are Gaussian noise sequences. Then, assuming
x(0) is also a Gaussian variable, x(k) and y(k) are jointly-Gaussian sequences.
Now we can simply formulate the state estimation problem as computing the
conditional expectation E{x(k) | Y (k)} where Y (k) = [y T (1), , y T (k)]T .
Let us denote E{x(i) | Y (j)} as x(i|j). We divide the estimation into the
following two steps.
Model Update: Compute E{x(k)|Y (k 1)} given E{x(k 1)|Y (k 1)}
and u(k 1).
Since x(k) = Ax(k1)+Bu(k1)+1 (k1) and 1 (k1) is a zero-mean
variable independent of y(0), , Y (k 1),
x
(k|k 1) = E {Ax(k 1) + Bu(k 1) + 1 (k 1) | Y (k 1)}
= AE{x(k 1) | Y (k 1)} + Bu(k 1)

(6.77)

Hence, we obtain
x
(k|k 1) = A
x(k 1|k 1) + Bu(k 1)

(6.78)

In addition, note that


x(k) x
(k|k 1) = A (x(k) x
(k 1|k 1)) + 1 (k 1)

(6.79)

Therefore,
o
n
P (k) = E (x(k) x
(k|k 1)) (x(k) x
(k|k 1))T
= AP (k 1)AT + R1

(6.80)

Since the conditional density for x(k) given Y (k 1) is Gaussian, it is


completely specified by x
(k|k 1) and P (k).
Measurement Update: Compute E{x(k)|Y (k)} given E{x(k)|Y (k
1)} and u(k 1)y(k).

26

March 26, 2002


The conditional density P{x(k) | Y (k)} is equivalent to the conditional
density P{x(k) | y(k)} with the prior density of x(k) given instead by
P{x(k) | Y (k1)}. Note that P{x(k) | Y (k1)} is a Gaussian density of
mean x
(k|k 1) and covariance P (k). In addition, y(k) = Cx(k) + 2 (k)
and therefore is also Gaussian. In fact, x(k) and y(k) are jointly Gaussian
with the following covariance:
(

T )
x(k) x
(k|k 1)
P (k)
P (k)C T
x(k) x
(k|k 1)
E
=
y(k) y(k|k 1)
CP (k) CP (k)C T + R2
y(k) y(k|k 1)
(6.81)
Now, recall the earlier results for jointly Gaussian variables:
E{x|y} = E{x} + Rxy Ry1 (y E{y})

Cov{x|y} = Rx

Rxy Ry1 Ryx

(6.82)
(6.83)

Applying the above to x(k) and y(k),


x
(k|k) = E{x(k)|y(k)}
1

(y(k) C x
(k|k 1))
= x
(k|k 1) + P (k)C T CP (k)C T + R2

(6.84)

P (k) = Cov{x(k)|y(k)}

1
= P (k) P (k)C T CP (k)C T + R2
CP (k)

(6.85)

In short, for Gaussian systems, we can compute the conditional mean and
covariance of x(k) recursively using
x
(k|k 1) = A
x(k 1|k 1) + Bu(k 1)
(6.86)

1
(y(k) C x
(k|k(6.87)
1))
x
(k|k) = x
(k|k 1) + P (k)C T CP (k)C T + R2
{z
}
|

K(k)

and

P (k) = AP (k 1)AT + R1

1
P (k) = P (k) P (k)C T CP (k)C T + R2
CP (k)

(6.88)
(6.89)

Note that this above has a linear estimator structure with the estimator gain
given by the Kalman filter equations.

Chapter 7

Random Variables and


Stochastic Processes
7.1
7.1.1

RANDOM VARIABLES
INTRODUCTION

What Is Random Variable?


We are dealing with
a physical phenomenon which exhibits randomness.
the outcome of any one occurence (trial) cannot be predicted.
the probability of any subset of possible outcomes is well-defined.
We ascribe the term random variable to such a phenomenon. Note that a
random variable is not defined by a specific number; rather it is defined by
the probabilities of all subsets of the possible outcomes. An outcome of a
particular trial is called a realization of the random variable.
An example is outcome of rolling a dice. Let x represent the outcome (not
of a particular trial, but in general). Then, x is not represented by a single
outcome, but is defined by the set of possible outcomes ({1, 2, 3, 4, 5, 6}) and
the probability of the possible outcome(s) (1/6 each). When we say x is 1 or
2 or so on, we really should say a realization of x is such.
A random variable can be discrete or continuous. If the outcome of a
random variable belongs to a discrete space, the random variable is discrete.
An example is the outcome of rolling a dice. On the other hand, if the outcome belongs to a continuous space, the random variable is continuous. For
instance, composition or temperature of a distillation column can be viewed
as continuous random variables.
27

28

Random Variables and Stochastic Processes

What Is Statistics?
Statistics deals with the application of probability theory to real problems.
There are two basic problems in statistics.
Given a probabilistic model, predict the outcome of future trial(s). For
instance one may say:
choose the prediction x
such that expected value of (x x
)2 is
minimized.
Given collected data, define / improve a probabilistic model.
For instance, there may be some unknown parameters (say ) in the
probabilistic model. Then, given data X generated from the particular
probabilistic model, one should construct an estimate of in the form

of (X).
For example, (X)
may be constructed based on the objective
2.
of minimizing expected value of k k
2
Another related topic is hypothesis testing, which has to do with testing
whether a given hypothesis is correct (i.e, how correct defined in terms
of probability), based on available data.
In fact, one does both. That is, as data come in, one may continue to
improve the probabilistic model and use the updated model for further prediction.
A priori Knowledge

Error
feedback

Predictor
PROBABILISTIC
MODEL

ACTUAL
SYSTEM

7.1.2

X
+

BASIC PROBABILITY CONCEPTS

PROBABILITY DISTRIBUTION, DENSITY: SCALAR CASE


A random variable is defined by a function describing the probability of the
outcome rather than a specific value. Let d be a continuous random variable
(d R). Then one of the following functions is used to define d:

29

March 26, 2002

Probability Distribution Function


The probability distribution function F (; d) for random variable d is
defined as
F (; d) = Pr{d }
(7.1)

F( ;d)

where Pr denotes the probability. Note that F (; d) is monotonically


increasing with and asymptotically reaches 1 as approaches its upper
limit.
Probability Density Function
The probability density function P(; d) for random variable d is defined
as
dF (; d)
P(; d) =
(7.2)
d

P(;d)

Note that

P(; d)d =

dF (; d) = 1

(7.3)

In addition,
Z

b
a

P(; d) d =

b
a

dF (; d) = F (b; d)F (a; d) = P r{a < d b} (7.4)

Example: Guassian or Normally Distributed Variable


(

)
1
1 m 2
P(; d) =
exp
2

2 2

(7.5)

30

Random Variables and Stochastic Processes




P( ;d)

m-

68.3%

m +

Note that this distribution is determined entirely by two parameters (the


mean m and standard deviation ).
PROBABILITY DISTRIBUTION, DENSITY: VECTOR CASE

T
Let d = d1 dn
be a continuous random variable vector(d Rn ).
Now we must quantify the distribution of its individual elements as well as
their correlations.
Joint Probability Distribution Function
The joint probability distribution function F (1 , , n ; d1 , , dn ) for
random variable vector d is defined as
F (1 , , n ; d1 , , dn ) = P r{d1 1 , , dn n }

(7.6)

Now the domain of F is an n-dimensional space. For example, for n = 2,


F is represented by a surface. Note that F (1 , , n ; d1 , , dn ) 1 as
1 , , n .
Joint and Marginal Probability Density Function
The joint probability density function P(1 , , n ; d1 , , dn ) for random
variable vector d is defined as
n F (; d)
P(1 , , n ; d1 , , dn ) =
(7.7)
1 , , n

0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
3
2

2
0

1
0

2
3

31

March 26, 2002

For convenience, we may write P(; d) to denote P(1 , , n ; d1 , , dn ).


Again,
R b1
R bn
a1 an P(1 , , n ; d1 , , dn ) d1 dn
(7.8)
= Pr{a1 < d1 b1 , , an < dn bn }
Naturally,
Z

,,

P(1 , , n ; d1 , , dn )d1 dn = 1

(7.9)

We can easily derive the probability density of individual element from


the joint probability density. For instance,
Z
Z
P(1 , , n ; d1 , , dn ) d2 dn (7.10)
,,
P(1 ; d1 ) =

This is called marginal probability density.


While the joint probability density (or distribution) tells us the likelihood
of several random variables achieving certain values simultaneously, the
marginal density tells us the likelihood of one element achieving certain
value when the others are not known.
Note that in general
P(1 , , n ; d1 , , dn ) 6= P(1 ; d1 ) P(n ; dn )

(7.11)

P(1 , , n ; d1 , , dn ) = P(1 ; d1 ) P(n ; dn )

(7.12)

If
d1 , , dn are called mutually independent.
Example: Guassian or Jointly Normally Distributed Variables

Suppose that d = [d1 d2 ]T is a Gaussian variable. The density takes the form
of
"
(

1 m 1 2
1
1
P(1 , 2 ; d1 , d2 ) =
exp
2(1 2 )
1
2 1 2 (1 2 )1/2

2 #)
(1 m1 )(2 m2 )
2 m 2
2
+
(7.13)
1 2
2

32

Random Variables and Stochastic Processes

Note that this density is determined by five parameters (the means m1 , m2 ,


standard deviations 1 , 2 and correlation parameter ). = 1 represents
complete correlation between d1 and d2 , while = 0 represents no correlation.
It is fairly straightforward to verify that
Z
P(1 , 2 ; d1 , d2 ) d2
P(1 ; d1 ) =

(
)

1
1 1 m 1 2
= p
exp
2
1
212
P(2 ; d2 ) =
=

P(1 , 2 ; d1 , d2 ) d1
(
)

1
1 2 m 2 2
p
exp
2
2
222

(7.14)
(7.15)

(7.16)

(7.17)

Hence, (m1 , 1 ) and (m2 , 2 ) represent parameters for the marginal density of
d1 and d2 respectively. Note also that
P(1 , 2 ; d1 , d2 ) 6= P(1 ; d1 )P(2 ; d2 )

(7.18)

except when = 0.
General n-dimensional Gaussian random variable vector d = [d1 , , dn ]T
has the density function of the following form:

P(; d) = P(1 , , n ; d1 , , dn )

1
1
T 1

=
exp ( d) Pd ( d)
n
2
(2) 2 |Pd |1/2

(7.19)
(7.20)

where the parameters are d Rn and Pd Rnn . The significance of these


parameters will be discussed later.
EXPECTATION OF RANDOM VARIABLES AND RANDOM VARIABLE FUNCTIONS: SCALAR CASE
Random variables are completely characterized by their distribution functions
or density functions. However, in general, these functions are nonparametric.
Hence, random variables are often characterized by their moments up to a
finite order; in particular, use of the first two moments is quite common.
Expection of Random Variable Fnction
Any function of d is a random variable. Its expectation is computed as
follows:
Z

E{f (d)} =
f ()P(; d) d
(7.21)

33

March 26, 2002


Mean

d = E{d} =

P(; d) d

(7.22)

The above is called mean or expectation of d.


Variance

2} =
Var{d} = E{(d d)

2 P(; d) d
( d)

(7.23)

The above is the variance of d and quantifies the extent of d deviating


from its mean.
Example: Gaussian Variable
For Gaussian variable with density
1

1
P(; d) =
exp
2
2 2

2 )

(7.24)

it is easy to verify that


(

2 )
Z
1

m
1

exp

d = m
(7.25)
d = E{d} =
2

2 2

2 )
Z
1

m
1

2
2
}=
exp
d = 2
Var{d} = E{(d d)
( m)
2
2

(7.26)
2
Hence, m and that parametrize the normal density represent the mean and
the variance of the Gaussian variable.
EXPECTATION OF RANDOM VARIABLES AND RANDOM VARIABLE FUNCTIONS: VECTOR CASE
We can extend the concepts of mean and variance similarly to the vector case.
Let d be a random variable vector that belongs to Rn .
Z
` P(` ; d` ) d`
(7.27)
d` = E{d` } =

Z
Z
=

` P(1 , , n ; d1 , , dn ) d1 , , dn

Z
2

Var{d` } = E{(d` d` ) } =
(` d` )2 P(` ; d` ) d`
(7.28)

Z
Z
=

(` d` )2 P(1 , , n ; d1 , , dn ) 1 , , dn

34

Random Variables and Stochastic Processes

In the vector case, we also need to quantify the correlations among different
elements.
Cov{d` , dm } = E{(d` d` )(dm dm )}
(7.29)
Z
Z
(` d` )(m dm )P(1 , , n ; d1 , , dn ) d1 , , dn

Note that
Cov{d` , d` } = Var{d` }

The ratio

= p

Cov{d` , dm }

Var{d` }Var{dm }

(7.30)
(7.31)

is the correlation factor. = 1 indicates complete correlation (d ` is determined


uniquely by dm and vice versa). = 0 indicates no correlation.
It is convenient to define covariance matrix for d, which contains all variances and covariances of d1 , , dn .

d)
T}
Cov{d} = E{(d d)(d
(7.32)
Z
Z
d)
T P(1 , , n ; d1 , , dn ) d1 , , dn
( d)(

The (i, j)th element of Cov{d} is Cov{di , dj }. The diagonal elements of Cov{d}
are variances of elements of d. The above matrix is symmetric since
Cov{di , dj } = Cov{dj , di }

(7.33)

Covariance of two different vectors x Rn and y Rm can be defined similarly.


Cov{x, y} = E{(x x
)(y y)T }
(7.34)

In this case, Cov{x, y} is an n m matrix. In addition,


Cov{x, y} = (Cov{y, x})T

(7.35)

Example: Gaussian Variables 2-Dimensional Case


Let d = [d1 d2 ]T and
(
"

1
1
1 m 1 2
exp

(7.36)
P(; d) =
2(1 2 )
1
2 1 2 (1 2 )1/2

#)
(1 m1 )(2 m2 )
2 m 2 2
2
+
1 2
2
Then,
E{d} =
=

m2
m2

1
2

P(; d) d1 d2

(7.37)

35

March 26, 2002

Similarly, one can show that

Z Z

1 m 1
(1 m1 ) (2 m2 ) P(; d) d1 d2
Cov{d} =
2 m 2

2
1
1 2
=
(7.38)
1 2
22
Example: Gaussian Variables n-Dimensional Case
Let d = [d1 dn ]T and

1
1
T 1

exp ( d) Pd ( d)
P(; d) =
n
2
(2) 2 |Pd |1/2

(7.39)

Then, one can show that


Z
Z
P(; d) d1 , , dn = d
(7.40)

E{d} =

Z
Z
d)
T P(; d) d1 , , dn = Pd(7.41)
Cov{d} =

( d)(

Hence, d and Pd that parametrize the normal density function P(; d) represent
the mean and the covariance matrix.
Exercise: Verify that, with

2
m

1
1
2
1
d =
; Pd =
m2
1 2
22

(7.42)

one obtains the expression for normal density of a 2-dimensional vector shown
earlier.
NOTE: Use of SVD for Visualization of Normal Density
Covariance matrix Pd contains information about the spread (i.e., extent of
deviation from the mean) for each element and their correlations. For instance,
Var{d` } = [Cov{d}]`,`
{d` , dm } =

(7.43)

[Cov{d}]`,m
q
[Cov{d}]`,` [Cov{d}]m,m

(7.44)

where []i,j represents the (i, j)th element of the matrix. However, one still has
hard time understanding the correlations among all the elements and visualizing the overall shape of the density function. Here, the SVD can be useful.
Because Pd is a symmetric matrix, it has the following SVD:

d)
T}
Pd = E{(d d)(d

= V V

v1

(7.45)

vn

(7.46)

1
..

v1T

..
.
vnT
n

(7.47)

36

Random Variables and Stochastic Processes

Pre-multiplying V T and post-multiplying V to both sides, we obtain

1
..

d)
TV } =
E{V T (d d)(d

.
n

(7.48)

Let d = V T d. Hence, d is the representation of d in terms of the coordinate


system defined by orthonormal basis v1 , , vn . Then, we see that

E{(d d )(d d )T } =

1
..

.
n

(7.49)

The diagonal covariance matrix means that every element of d is completely


independent of each other. Hence, v1 , , vn define the coordiate system with
respect to which the random variable vector is independent. 12 , , n2 are
the variances of d with respect to axes defined by v1 , , vn .
Exercise: Suppose d R2 is zero-mean Gaussian and
Pd =

20.2 19.8
19.8 20.2

"

2
2
2
2

2
2
22

10 0
0 0.1

"

2
2
2
2

2
2

2
2

(7.50)

Then, v1 = [ 22 22 ]T and v2 = [ 22 22 ]T . Can you visualize the overall shape


of the density function? What is the variance of d along the (1,1) direction?
What about along the (1,-1) direction? What do you think the conditional
density of d1 given d2 = looks like? Plot the densities to verify.
CONDITIONAL PROBABILITY DENSITY: SCALAR CASE
When two random variables are related, the probability density of a random
variable changes when the other random variable takes on a particular value.
The probability density of a random variable when one or more
other random variables are fixed is called conditional probability
density.
This concept is important in stochastic estimation as it can be used to develop
estimates of unknown variables based on readings of other related variables.
Let x and y be random variables. Suppose xand y have joint probability
density P(, ; x, y). One may then ask what the probability density of x is
given a particular value of y (say y = ). Formally, this is called conditional density function of x given y and denoted as P(|; x|y). P(|; x|y)

37

March 26, 2002


is computed as
P(|; x|y) =

=
=

R +
lim0 P(, ; x, y)d
Z Z +
P(, ; x, y)d d

|
{z
}
normalization factor
P(, ; x, y)
R
P(, ; x, y)d

P(, ; x, y)
P(, y)

(7.51)

(7.52)
(7.53)

Note:
The above means

Joint Density of x and y


Conditional Density
=
of x given y
Marginal Density of y

(7.54)

This should be quite intuitive.


Due to the normalization,
Z

P(|; x|y) d = 1

(7.55)

which is what we want for a density function.

P(|; x|y) = P(, x)

(7.56)

P(, ; x, y) = P(, x)P(, y)

(7.57)

if and only if
This means that the conditional density is same as the marginal density
when and only when x and y are independent.
We are interested in the conditional density, because often some of the
random variables are measured while others are not. For a particular trial,
if x is not measurable, but y is, we are intersted in knowing P(|; x|y) for
estimation of x.
Finally, note the distinctions among different density functions:

38

Random Variables and Stochastic Processes


P(, ; x, y): Joint Probability Density of x and y
represents the probability density of x = and y = simultaneously.
Z

b2
a2

b1
a1

P(, ; x, y)dd = Pr{a1 < x b1 and a2 < y b2 } (7.58)

P(; x): Marginal Probability Density of x


represents the probability density of x = NOT knowing what y is.
Z
P(, x) =
P(, ; x, y)d
(7.59)

P(; y): Marginal Probability Density of y


represents the probability density of y = NOT knowing what x is.
Z
P(, y) =
P(, ; x, y)d
(7.60)

P(|; x|y): Conditional Probability Density of x given y


represents the probability density of x when y = .
P(|; x|y) =

P(, ; x, y)
P(, y)

(7.61)

P(|; y|x): Conditional Probability Density of y given x


represents the probability density of y when x = .
P(|; y|x) =

P(, ; x, y)
P(, x)

(7.62)

Bayes Rule:
Note that
P(|; x|y) =
P(|; y|x) =

P(, ; x, y)
P(, y)
P(, ; x, y)
P(, x)

(7.63)
(7.64)

Hence, we arrive at
P(|; x|y) =

P(|; y|x)P(, x)
P(, y)

(7.65)

The above is known as the Bayes Rule. It essentially says


(Cond. Prob. of x given y) (Marg. Prob. of y)

= (Cond. Prob. of y given x) (Marg. Prob. of x)

(7.66)
(7.67)

39

March 26, 2002

Bayes Rule is useful, since in many cases, we are trying to compute P(|; x|y)
and its difficult to obtain the expression for it directly, while it may be easy
to write down the expression for P(|; y|x).

We can define the concepts of conditional expectation and conditional covariance using the conditional density. For instance, the conditional expectation of x given y = is defined as
Z

E{x|y} =
P(|; x|y)d
(7.68)

Conditional variance can be defined as

Var{x|y} = E{( E{x|y})2 }


Z
=
( E{x|y})2 P(|; x|y)d

(7.69)
(7.70)

Example: Jointly Normally Distributed or Gaussian Variables


Suppose that x and y have the following joint normal densities parametrized
by m1 , m2 , 1 , 2 , :
1
(7.71)
2 1/2
y (1 )
(
"

#)

1
x
2
( x
)( y)
y 2
exp
2
+
2(1 2 )
x
x y
y

P(, ; x, y) =

2 x

Some algebra yields


P(, ; x, y) =

1 y 2
q
(7.72)
exp
2
y
2y2
|
{z
}
marginal density of y

!2
x

)
1
1

p
p y
exp

2
2x2 (1 2 )
x 1 2
|
{z
}
conditional density of x
(

)
1
1 x
2
p
exp
(7.73)
2
x
2x2
{z
}
|
marginal density of x

!2
y

1
1
p x
q
exp
2

y 1 2
2y2 (1 2 )
{z
}
|
conditional density of y
1

40

Random Variables and Stochastic Processes

Hence,
P(|; x|y) =
P(|; y|x) =

!2

)
1
1

p
p y
(7.74)
exp

2
2x2 (1 2 )
x 1 2

!2
y

1
1
p x
q
exp
(7.75)
2

y 1 2
2 2 (1 2 )
y

Note that the above conditional densities are normal. For instance, P(|; x|y)
is a normal density with mean of x
+ xy ( y) and variance of x2 (1 2 ).
So,
x
E{x|y} = x
+ ( y)
(7.76)
y
x y
( y)
(7.77)
= x
+
y2
= E{x} + Cov{x, y}Var1 {y}( y)

(7.78)

Conditional covariance of x given y = is:


E{(x E{x|y})2 |y} = x2 (1 2 )
x2 y2 2
= x2
y2
= x2 (x y )

(7.79)
(7.80)
1
(x y )
y2

(7.81)

= Var{x} Cov{x, y}Var1 {y}Cov{y, x} (7.82)


Notice that the conditional distribution becomes a point density as 1,
which should be intuitively obvious.
CONDITIONAL PROBABILITY DENSITY: VECTOR CASE
We can extend the concept of conditional probability distribution to the vector
case similarly as before.
Let x and y be n and m dimensional random vectors respectively. Then,
the conditional density of x given y = [1 , , m ]T is defined as
P(1 , , n |1 , , m ; x1 , , xn |y1 , , ym )
P(1 , , n , 1 , , m ; x1 , , xn , y1 , , ym )
=
P(1 , , m ; y1 , , ym )

(7.83)

Bayes Rule can be stated as


P(1 , , n |1 , , m ; x1 , , xn |y1 , , ym )
(7.84)
P(1 , , m |1 , , n ; y1 , , ym |x1 , , xn )P(1 , , n ; x1 , , xn )
=
P(1 , , m ; y1 , , ym )

41

March 26, 2002

The conditional expectation and covariance matrix can be defined similarly:

..

E{x|y} =
. P(|; x|y) d1 , , dn

n
Z

(7.85)

T
1 E{x1 |y}
1 E{x1 |y}

..
..

Cov{x|y} =

P(|; x|y) d1 , , dn
.
.

n E{xn |y}
n E{xn |y}
(7.86)
Z

Example: Gaussian or Jointly Normally Distributed Variables


Let x and y be jointly normally distributed random variable vectors of dimension n and m respectively. Let
z=

x
y

(7.87)

The joint distribution takes the form of


P(, ; x, y) =

1
(2)

n+m
2

|Pz |1/2

1
exp ( z)T Pz1 ( z)
2

(7.88)

where
z =
Pz =

; =

(7.89)

Cov(x) Cov(x, y)
Cov(y, x) Cov(y)

(7.90)

Then, it can be proven that (see Theorem 2.13 in [Jaz70])


E{x|y} = x
+ Cov(x, y)Cov1 (y)( y)
E{y|x} = y + Cov(y, x)Cov

(7.91)

(x)( x
)

(7.92)

o
n

Cov{x|y} = E ( E{x|y}) ( E{x|y})T

(7.93)

and

= Cov{x} Cov{x, y}Cov1 {y}Cov{y, x}


n
o

Cov{y|x} = E ( E{y|x}) ( E{y|x})T


= Cov{y} Cov{y, x}Cov1 {x}Cov{x, y}

(7.94)
(7.95)
(7.96)

42

Random Variables and Stochastic Processes

7.1.3

STATISTICS

PREDICTION
The first problem of statistics is prediction of the outcome of a future trial
given a probabilistic model.
Suppose P(x), the probability density for random variable x, is
given. Predict the outcome of x for a new trial (which is about to
occur).
Note that, unless P(x) is a point distribution, x cannot be predicted exactly.

To do optimal estimation, one must first establish a formal criterion. For


example, the most likely value of x is the one that corresponds to the highest
density value:
h
i
x
= arg max P(x)
x

A more commonly used criterion is the following minimum variance estimate:

2
x
= arg min E kx x
k2
x

The solution to the above is x


= E{x}.
Exercise: Can you prove the above?

If a related variable y (from the same trial) is given, then one should use
x
= E{x|y} instead.
SAMPLE MEAN AND COVARIANCE, PROBABILISTIC MODEL
The other problem of statistics is inferring a probabilistic model from collected
data. The simplest of such problems is the following:
We are given the data for random variable x from N trials. These
data are labeled as x(1), , x(N ). Find the probability density
function for x.
Often times, a certain density shape (like normal distribution) is assumed to
make it a well-posed problem. If a normal density is assumed, the following
sample averages can then be used as estimates for the mean and covariance:
=
x

N
1 X
x(i)
N
i=1

N
1 X

Rx =
x(i)xT (i)
N
i=1

43

March 26, 2002

Note that the above estimates are consistent estimates of real mean and covariance x
and Rx (i.e., they converge to true values as N ).
A slightly more general problem is:

A random variable vector y is produced according to


y = f (, u) + x
In the above, x is another random variable vector, u is a known
deterministic vector (which can change from trial to trial) and is
an unknown deterministic vector (which is invariant). Given data
for y from N trials, find the probability density parameters for x
(e.g., x
, Rx ) and the unknown deterministic vector .
This problem will be discussed later in the regression section.

7.2

STOCHASTIC PROCESSES

A stochastic process refers to a family of random variables indexed by a parameter set. This parameter set can be continuous or discrete. Since we are
interested in discrete systems, we will limit our discussion to processes with
a discrete parameter set. Hence, a stochastic process in our context is a time
sequence of random variables.

7.2.1

BASIC PROBABILITY CONCEPTS

DISTRIBUTION FUNCTION
Let x(k) be a sequence. Then, (x(k1 ), , x(k` )) form an `-dimensional random variable. Then, one can define the finite dimensional distribution function and the density function as before. For instance, the distribution function
F (1 , , ` ; x(k1 ), , x(k` )), is defined as:
F (1 , , ` ; x(k1 ), , x(k` )) = Pr{x(k1 ) 1 , , x(k` ) ` }

(7.97)

The density function is also defined similarly as before.


We note that the above definitions also apply to vector time sequences if
x(ki ) and i s are taken as vectors and each integral is defined over the space
that i occupies.
MEAN AND COVARIANCE
Mean value of the stochastic variable x(k) is

44

Random Variables and Stochastic Processes

x
(k) = E{x(k)} =
Its covariance is defined as

dF (; x(k))

(7.98)

Rx (k1 , k2 ) = E{[x(k
(k1 )][x(k2 ) x
(k2 )]T }
R R 1 ) x
= [1 x
(k1 )][2 x
(k2 )]T dF (1 , 2 ; x(k1 ), x(k2 ))
(7.99)
The cross-covariance of two stochastic processes x(k) and y(k) are defined as
Rxy (k1 , k2 ) = E{[x(k
(k1 )][y(k2 ) y(k2 )]T }
R R 1 ) x
(k1 )][2 y(k2 )]T dF (1 , 2 ; x(k1 ), y(k2 ))
= [1 x
(7.100)
Gaussian processes refer to the processes of which any finite-dimensional distribution function is normal. Gaussian processes are completely characterized
by the mean and covariance.
STATIONARY STOCHASTIC PROCESSES
Throughout this book we will define stationary stochastic processes as those
with time-invariant distribution function. Weakly stationary (or stationary
in a wide sense) processes are processes whose first two moments are timeinvariant. Hence, for a weakly stationary process x(k),
E{x(k)} = x
k
T
E{[x(k) x
][x(k ) x
] } = Rx ( ) k

(7.101)

In other words, if x(k) is stationary, it has a constant mean value and its
covariance depends only on the time difference . For Gaussian processes,
weakly stationary processes are also stationary.
For scalar x(k), R(0) can be interpreted as the variance of the signal and
)
reveals its time correlation. The normalized covariance R(
R(0) ranges from
0 to 1 and indicates the time correlation of the signal. The value of 1 indicates
a complete correlation and the value of 0 indicates no correlation.
R( )
R(0)

Note that many signals have both deterministic and stochastic components. In some applications, it is very useful to treat these signals in the same
framework. One can do this by defining
P
x
= limN N1 N
k=1 x(k)
(7.102)
1 PN
Rx ( ) = limN N k=1 [x(k) x
][x(k ) x
] T

Note that in the above, both deterministic and stochastic parts are averaged out. The signals for which the above limits converge are called quasistationary signals. The above definitions are consistent with the previous

45

March 26, 2002

definitions since,in the purely stochastic case, a particular realization of a stationary stochastic process with given mean (
x) and covariance (Rx ( )) should
satisfy the above relationships.

SPECTRA OF STATIONARY STOCHASTIC PROCESSES


Throughout this chapter, continuous time is rescaled so that each discrete time
interval represents one continuous time unit. If the sample interval Ts is not
one continuous time unit, the frequency in discrete time needs to be scaled
with the factor of T1s .
Spectral density of a stationary process x(k) is defined as the Fourier transform of its covariance function:

x () =

1 X
Rx ( )ej
2 =

(7.103)

Area under the curve represents the power of the signal for the particular frequency range. For example, the power of x(k) in the frequency range (1 , 2 )
is calculated by the integral
2

=2

x ()d
=1

Peaks in the signal spectrum indicate the presence of periodic components in


the signal at the respective frequency.
The inverse Fourier transform can be used to calculate Rx ( ) from the
spectrum x () as well

Rx ( ) =

x ()ej d

(7.104)

(7.105)

With = 0, the above becomes


T

E{x(k)x(k) } = Rx (0) =

x ()d

which indicates that the total area under the spectral density is equal to the
variance of the signal. This is known as the Parsevals relationship.
Example: Show plots of various covariances, spectra and realizations!
**Exercise: Plot the spectra of (1) white noise, (2) sinusoids, and (3)white
noise filtered through a low-pass filter.

46

Random Variables and Stochastic Processes

DISCRETE-TIME WHITE NOISE


A particular type of a stochastic process called white noise will be used extensively throughout this book. x(k) is called a white noise (or white sequence)
if
P(x(k)|x(`)) = P(x(k)) for ` < k
(7.106)
for all k. In other words, the sequence has no time correlation and hence
all the elements are mutually independent. In such a situation, knowing the
realization of x(`) in no way helps in estimating x(k).
A stationary white noise sequence has the following properties:
E{x(k)} = x

Rx if = 0
E{(x(k) x
)(x(k ) x
) T } =
0
if 6= 0

(7.107)

Hence, the covariance of a white noise is defined by a single matrix.


The spectrum of white noise x(k) is constant for the entire frequency range
since from (7.103)
1
x () =
Rx
(7.108)
2
The name white noise actually originated from its similarity with white light
in spectral properties.
COLORED NOISE
A stochastic process generated by filtering white noise through a dynamic
system is called colored noise.
Important:
A stationary stochastic process with any given mean and covariance function can be generated by passing a white noise through
an appropriate dynamical system.
To see this, consider

d(k) = H(q)(k) + d

(7.109)

where (k) is a white noise of identity covariance and H(q) is a stable /


stably invertible transfer function (matrix). Using simple algebra (Ljung REFERENCE), one can show that
d () = H(ej )H T (ej )

(7.110)

strom and WittenThe spectral factorization theorem (REFERENCE - A


mark, 1984) says that one can always find H(q) that satisfies (7.110) for an

47

March 26, 2002

arbitrary d and has no pole or zero outside the unit disk. In other words,
the first and second order moments of any stationary signal can be matched
by the above model.
This result is very useful in modeling disturbances whose covariance functions are known or fixed. Note that a stationary Gaussian process is completely
specified by its mean and covariance. Such a process can be modelled by filtering a zero-mean Gaussian white sequence through appropriate dynamics
determined by its spectrum (plus adding a bias at the output if the mean is
not zero).

INTEGRATED WHITE NOISE AND NONSTATIONARY PROCESSES


Some processes exhibit mean-shifts (whose magnitude and occurence are random). Consider the following model:
y(k) = y(k 1) + (k)
where (k) is a white sequence. Such a sequence is called integrated white noise
or sometimes random walk. Particular realizations under different distribution
of (k) are shown below:

P()
y(k)

90%


10%

More generally, many interesting signals will exhibit stationary behavior


combined with randomly occuring mean-shifts. Such signals can be modeled
as

48

Random Variables and Stochastic Processes


2 (k)

~ 1
H (q )

1 (k)

1 q

(k)

y(k)
H (q1 )

1
1 q

y(k)

As shown above, the combined effects can be expressed as an integrated white


noise colored with a filter H(q 1 ).
Note that while y(k) is nonstationary, the differenced signal y(k) is stationary.

(k)

1
1 q1

H(q1)

y(k)

(k)

H(q1)

y(k)

STOCHASTIC DIFFERENCE EQUATION


Generally, a stochastic process can be modeled through the following stochastic
difference equation.
x(k + 1) = Ax(k) + B(k)
y(k) = Cx(k) + D(k)

(7.111)

where (k) is a white vector sequence of zero mean and covariance R .


Note that
E{x(k)} = AE{x(k 1)} = Ak E{x(0)}
E{x(k)xT (k)} = AE{x(k 1)xT (k 1)}AT + BR B T

(7.112)

If all the eigenvalues of A are strictly inside the unit disk, the above approaches a stationary process as k since
limk E{x(k)} = 0
limk E{x(k)xT (k)} = Rx

(7.113)

where Rx is a solution to the Lyapunov equation


Rx = ARx AT + BR B T

(7.114)

49

March 26, 2002


Since y(k) = Cx(k) + D(k),

E{y(k)} = CE{x(k)} + DE{(k)} = 0


E{y(k)y T (k)} = CE{x(k)xT (k)}C T + DE{(k)T (k)}D T = CRx C T + DR DT
(7.115)
The auto-correlation function of y(k) becomes

CRx C T + DR DT for = 0

T
Ry ( ) = E{y(k + )y (k)} =
CA Rx C T + CA 1 BR DT for > 0
(7.116)
The spectrum of w is obtained by taking the Fourier transform of Ry ( )) and
can be shown to be
T


(7.117)
y () = C(ej I A)1 B + D R C(ej I A)1 B + D
In the case that A contains eigenvalues on or outside the unit circle, the
process is nonstationary as its covariance keeps increasing (see Eqn. (7.112).
However, it is common to include integrators in A to model mean-shifting
(random-walk-like) behavior. If all the outputs exhibit this behavior, one can
use
x(k + 1) = Ax(k) + B(k)
(7.118)
y(k) = Cx(k) + D(k)
Note that, with a stable A, while y(k) is a stationary process, y(k) includes
an integrator and therefore is nonstationary.

Stationary process
()

x(k+1)=Ax(k)+B (k)
y(k)=Cx(k)+D (k)

stable system

Nonstationary(Meanshifting) process
y(k)

()

x(k+1)=Ax(k)+B (k)
y(k)=Cx(k)+D (k)

stable system

y(k)

1
1-q -1
integrator

y(k)

MODEL PREDICTIVE
CONTROL

Manfred Morari
Jay H. Lee
Carlos E. Garca

April 4, 2002

Chapter 6

Unconstrained Quadratic
Optimal Control
In this chapter, we present the basic results in linear quadratic optimal control.
We will derive optimal and suboptimal control policies for both finite horizon
and infinite horizon problems. We will consider the unconstrained case in
this chapter and then extend the results to the constrained case in the next
chapter.
The standard problem of quadratic optimal control is that of regulating
the state at the origin (or driving the state to the origin starting from some
nonzero initial condition). For linear system
x(k + 1) = Ax(k) + Bu(k)

(6.1)

with a given initial condition x(0), the objective is to find an input sequence
or a feedback policy that minimizes the quadratic cost function of

Vp =

p1
X

k=0

xT (k)Qx(k) + uT (k)Ru(k) + xT (p)Qt x(p)

(6.2)

where Q is a positive (semi)-definite matrix and R a positive definite matrix.


With a finite p, the problem is referred to as the finite horizon problem and
with p = the infinite horizon problem. Many problems of practical interest,
such as the output regulation in the presence of constant disturbances and
setpoint changes, can be posed as state regulation problems of the above form
(see Exercise??).

6.1

Finite Horizon Problem

Let us consider the finite horizon quadratic optimal control problem


Jp (x0 ) =

min

u(0),,u(p1)

Vp

(6.3)

Unconstrained Quadratic Optimal Control

for the deterministic system


x(k + 1) = Ax(k) + Bu(k)

(6.4)

x(0) = x0

(6.5)

with initial condition


The above problem will be referred to as Jp (x0 ), which is also used to denote
the optimal cost.
Two approaches can be taken to solve this problem. Conceptually simpler
of the two is to develop the linear equation between the future state sequence
and the future input sequence explicitly and put it into the objective function
to solve the least squares problem just as in the DMC approach covered in
the previous chapters. The more sophisticated alternative is to use dynamic
programming to solve the problem in a stagewise recursive manner. We will
review the two approaches next.

6.1.1

Least Squares Solution Based On Explicit Multi-Step


Prediction

The future state sequence can be related to the future input sequence through
the following linear equation:

0
... ... 0
x(0)
I
u(0)

x(1) A
B
0 ... 0

..

.. ..
.
.
. . . . . ...
(6.6)

. = . x(0) + AB

..

.. ..

..
..
.. ..
.

. .
.
. .
.
u(p 1)
x(p)
Ap
Ap1 B . . . . . . B
Denote the above as

X = S x x(0) + S u U

(6.7)

Then the objective function can be rewritten as


+ U T RU

Vp = X T QX

(6.8)

= blockdiag{Q, , Q, Qt } and R
= blockdiag{R, , R}. Substitutwhere Q
ing (6.7) into the objective function (6.8) yields
(S x x(0) + S u U) + U T RU

Vp = (S x x(0) + S u U)T Q
T
xT u
uT u
T

= U (S QS + R) U 2 (x (0)S QS ) U + xT (0)S xT S x x(0)


{z
}
|
{z
}
|
H

gT

(6.9)
By applying Lemma ?? of Chapter ??, we find that the solution that minimizes Vp is given by
U

= H1 g

x x0
u+R
1 S uT QS
= S uT QS

(6.10)

April 4, 2002
Also, from the same lemma, the optimal cost is
Jp (x0 ) = g Th Hg + xT0 S xT S x x0
i

u S uT QS
u+R
1 S uT QS
x x0
= xT0 S xT S x S xT QS

6.1.2

(6.11)

Dynamic Programing and Feedback Solution

A more elegant way to solve the same problem is by using dynamic programming. The idea is to recast the problem as a series of one-step optimal control
problem. The following definition is useful:
Definition 1 The optimal cost for the p k step problem

Jk (z) =

min

u(pj),,u(p1)

p1
X

i=pk

xT (i)Qx(i) + uT (i)Ru(i) + xT (p)Qt x(p)

for system x(i + 1) = Ax(i) + Bu(i) and the initial condition x(p j) = z is
called k-step cost-to-go for z. It is the minimum cost incurred to solve the
k-step optimal control problem starting from the state z.
Because of the particular structure of the state-space system, the optimal
decision for u(p 1) depends only on x(p 1) not on x(p 2), x(p 3),
and so on. Based on this, we start with the following one-step-ahead problem
posed at time p 1:
J1 (x(p 1)) = min xT (p)S(p)x(p) + xT (p 1)Qx(p 1) + uT (p 1)Ru(p 1)
u(p1)

(6.12)
x(p) = Ax(p 1) + Bu(p 1)
S(p) = Qt

(6.13)

Substituting (6.13) into the objective function (6.12),

J1 (x(p 1)) = minu(p1) xT (p 1)(AT S(p)A + Qt )x(p 1)


+2xT (p 1)AT S(p)Bu(p 1)

+uT (p 1)(B T S(p)B + R)u(p 1)


(6.14)
Applying Lemma ?? of Chapter ??, the optimal solution is
u(p 1) = (B T S(p)B + R)1 B T S(p)A x(p 1)
{z
}
|

(6.15)

J1 (x(p 1)) = xT (p 1)S(p 1)x(p 1),

(6.16)

L(p1)

and the one-step cost-to-go is

where
S(p 1) = AT S(p)A + Q AT S(p)B(B T S(p)B + R)1 B T S(p)A

(6.17)

Unconstrained Quadratic Optimal Control

At the next stage, consider solving the two-step-ahead problem posed at time
p 2:
J2 (x(p 2)) =

min

u(p1),u(p2)

x (p)S(p)x(p) +

p1
X

xT (i)Qx(i) + uT (i)Ru(i)

i=p2

(6.18)
The above is equivalent to the following one-stage problem with the cost-to-go
inherited from the previous stage:
J2 (x(p 2))

= minu(p2) J1 (x(p 1)) + xT (p 2)Qx(p 2) + uT (p 2)Ru(p 2)

= minu(p2) xT (p 1)S(p 1)x(p 1) + xT (p 2)Qx(p 2) + uT (p 2)Ru(p 2)


(6.19)
The above is in the same form as (6.12) and therefore the optimal solution is
u(p 2) = (B T S(p 1)B + R)1 B T S(p 1)A x(p 2)
{z
}
|

(6.20)

L(p2)

and the two-step cost-to-go is

J2 (x(p 2)) = xT (p 2)S(p 2)x(p 2),

(6.21)

where
S(p 2) = AT S(p 1)A + Q AT S(p 1)B(B T S(p 1)B + R)1 B T S(p 1)A
(6.22)
Continuing on with this (i.e., successively solving Jpk (x(k)), k = p 1, , 0
and propagating the cost-to-go), we obtain the following optimal sequence for
the original problem of (6.3):
u(k) = L(k)x(k),

for k = p 1, , 0

(6.23)

where
L(k) = (B T S(k + 1)B + R)1 B T S(k + 1)Ax(k)

(6.24)

and
S(k) = AT S(k + 1)A + Q AT S(k + 1)B(B T S(k + 1)B + R)1 B T S(k + 1)A
(6.25)
The above equation called Discrete Time Riccati Equation or Riccati Difference
Equation is initialized with S(p) = Qt and is solved backward, i.e., starting
with S(p) and solving for S(p 1), etc. The optimal cost for the p-stage
problem is
Jp (x0 ) = xT0 S(0)x0
(6.26)

April 4, 2002

6.1.3

Comparison Of The Two Approaches

In the dynamic programming approach, the optimal control at each time


step is given as an explicit linear function of the state at that time, i.e.,
in the form of a state feedback law. The control, when implemented as
a feedback policy rather than an open-loop trajectory, should be more
robust against disturbances and model errors. To achieve the same with
the explicit multi-step prediction based approach, one must solve the
following least squares problem for each k = 0, , p 1:
)
( p1
X

xT (i)Qx(i) + uT (i)Ru(i) + xT (p)Qt x(p)


Jpk (x(k)) =
min
u(k),,u(p1)

i=k

(6.27)
This also gives the same optimal feedback policy between u(k) and x(k).

Dynamic programming turns a multi-step problem into multiple singlestep problems and the feedback gain (6.23) is determined through a
recursion (6.25) involving matrices of low dimension. On the other hand,
the explicit approach requires the inversion of the Hessian matrix H,
which is of very large dimension if the horizon is large. Special algorithms
may be needed to take advantage of the structure of the Hessian matrix.
When constraints are imposed on the inputs and/or the state, the least
squares problem does not yield an analytical solution and numerical optimization must be employed. In this case, the dynamic programming
approach is no longer feasible. On the other hand, the least squares approach can be made viable by using some numerical optimization strategy to solve the constrained least squares problem Jpk (x(k)) on-line
for a specific value of x(k) at each sample time. From the optimal input
sequence for Jpk (x(k)), one would implement only u(k) and discard the
rest. This strategy is referred to as receding horizon control.

6.2

Infinite Horizon Problem

For continuous processes operating over a long time period it is reasonable to


solve the following infinite horizon problem:
(
)

X
T

T
J (x0 ) = min V =
x (k)Qx(k) + u (k)Ru(k)
(6.28)
u()

k=0

We will present the optimal solution and a suboptimal solution based on a


finite moving horizon. Some readers may question the value of the suboptimal
solution when the optimal solution can be readily derived. It is presented here
in order to help us formulate a workable strategy for the constrained case in
the next chapter.

Unconstrained Quadratic Optimal Control

6.2.1

The Optimal Solution: Asymptotic Solution of the Finite


Horizon Problem

Since the prediction must be carried out to infinity, the calculation of the
optimal input sequence through the explicit least squares method becomes
impossible. On the other hand, derivation of the optimal feedback law through
the dynamic programming approach remains viable. Consider the optimal
feedback solution with p = :
u(k) = (B T S(k + 1)B + R)1 B T S(k + 1)Ax(k),

k = , , 0

(6.29)

The RDE
S(k) = AT S(k + 1)A + Q AT S(k + 1)B(B T S(k + 1)B + R)1 B T S(k + 1)A
(6.30)
is to be initialized with S() = Q and solved backward. Let us assume for the
moment that the iteration of (6.30) converges to a solution S within some
finite number of iteration. Such S would then satisfy the Algebraic Riccati
Equation (ARE) of
S = AT S A + Q AT S B(B T S B + R)1 B T S A

(6.31)

The optimal feedback control law is


u(k) = (B T S B + R)1 B T S A x(k),
|
{z
}

k = 0, ,

(6.32)

and the optimal infinite horizon cost is

Jp (x0 ) = xT0 S x0 .

(6.33)

This controller is referred to as the asymptotic form of the Linear Quadratic


Regulator (LQR) or the -horizon LQR. In section ??, we will present conditions for the convergence of the RDE (6.30) and state some properties of the
LQR.

6.2.2

Suboptimal Solution: Receding Horizon Control

Repeated Solution of the Finite Horizon Problem


Another possibility is receding horizon control, in which the finite horizon
problem with a fixed size horizon is solved at each sample time (with the
initial state of the problem given by the current state), and the first input
move from the solution is implemented as the control move at that time. That
is, at sample time k (k = 0, 1, ), given x(k), we solve the following Jp (x0 )

April 4, 2002
with x0 = x(k):
nP

p1 T
i=0 x (k

+ i)Qx(k + i) + uT (k + i)Ru(k + i)

+xT (k + p)Qt x(k + p)


(6.34)
From the solution of the above, only u(k) is actually implemented at time k
and the whole problem of Jp (x(k + 1))is solved again at the next sample time,
and so on.
Jp (x(k)) = minu(k),,u(k+p1)

The solution to the finite horizon problem has already been discussed in
section 6.1. Since only u(k) is implemented from the solution to Jp (x(k)), the
resulting control is a feedback control given by
u(k) = (B T Sp B + R)1 B T Sp A x(k)
{z
}
|

(6.35)

Lp

where Sp = S(1) is obtained iteratively from the RDE

S(i) = AT S(i+1)A+QAT S(i+1)B(B T S(i+1)B+R)1 B T S(i+1)A (6.36)


with the initial condition S(p) = Qt . Note that one can also solve Jp (x(k)) by
writing down the multi-step prediction equation explicitly and applying the
least squares formula. Both approaches yield the same feedback law.
Equivalence with the Infinite Horizon Problem
The resulting feedback law from the above represents, in general, only a suboptimal solution to the original infinite horizon quadratic optimal problem since
a finite horizon problem posed on a moving window is solved at each time
step. Under some specific assumptions and for certain choices of Qt , however,
the finite horizon solution and the infinite horizon solution become equal.
We can choose Qt such that the terminal cost xT (k + p)Qt x(k + p) is
exactly the optimal cost from p to , i.e.,
xT (k+p)Qt x(k+p) = minu(k+p+j),j0

xT (k+p+j)Qx(k+p+j)+uT (k+p+j)Ru(k+p+j)

j=0

(6.37)
From the discussion of the infinite horizon optimal control policy given
in section 6.2.1, it is clear that we can compute such Qt by solving the
ARE of
Qt = AT Qt A + Q AT Qt B(B T Qt B + R)1 B T Qt A

(6.38)

With this choice of Qt , the optimal solution of Jp (x(k)) is equivalent to


that of J (x(k)).

Unconstrained Quadratic Optimal Control


We may also choose Qt such that the terminal cost is equal to the infinite
horizon cost (from k + p to ) under the assumption that no control
action is taken beyond the horizon k + p. Then, the autonomous system
x(k + i + 1) = Ax(k + i), i = p, . . . describes the evolution of the state.
(Note that this assumption is meaningful only when the system is stable,
otherwise this cost is infinite.) Under this assumption, let us choose Qt
such that

X
xT (k + i)Qx(k + i)
(6.39)
xT (k + p)Qt x(k + p) =
i=p

Then,

P
T
xT (k + p)Qt x(k + p) = xT (k + p)Qx(k + p) +
i=p+1 x (k + i)Qx(k + i)
T
T
= x (k + p)Qx(k + p) + x (k + p + 1)Qt x(k + p + 1) = xT (k + p)Qx(k + p) + xT (k + p)A
(6.40)
From the above, we can see that Qt can be chosen as a positive semidefinite solution of the Lyapunov equation
AQt AT + Q = Qt

(6.41)

A rigorous proof is left as an exercise.


Then solving Jp (x(k)) is equivalent to solving J (x(k)) with u(k + i) =
0, i p, which we will denote as problem J,p (x(k)) :
P T
J,p (x(k)) = minu(k),,u(k+p1)
i=0 x (k + i)Qx(k + i)
o
Pp1 T
u (k + j)Ru(k + j)
+ j=0
with u(k + j) = 0, j p

(6.42)

Let us assume again that no control is taken beyond the horizon p but
that the system is unstable. The previous formulation cannot be used
since there is no positive semi-definite solution to the Lyapunov equation
(6.41) in this case. To ensure that the infinite horizon cost is finite under
the open-loop assumption, the unstable modes of the system must be
zeroed at k + p. This can be done either through an explicit terminal
constraint in the optimization or by assigning an infinite weight to them
at k + p.
For this purpose we must first find a coordinate transformation that
separates the unstable modes x
u from the stable modes x
s :


x
s (k)
T1
x
(6.43)
=
x
u (k)
T2
The states x
s and x
u in the transformed coordinates are related to the
original state according to

x
s (k)
x = G1 G2
(6.44)
x
u (k)

April 4, 2002

where the columns of G1 and G2 contain all the eigenvectors and generalized eigenvectors for the stable and unstable
respectively.
eigenvalues,

T
T
Hence, the coordinate transformation
matrix
can
be obtained
1
2

by inverting the matrix G1 G2 . After the coordinate transformation, we obtain the decoupled system

s
x
s (k + 1)
As 0
x
s (k)
B
=
+
u(k)
(6.45)
x
u (k + 1)
x
u (k)
0 Au
Bu
Then one can solve the Lyapunov equation for the stable subsystem only
As P0 ATs + GT1 QG1 = P0

(6.46)

and use the terminal weighting matrix


Qt = T1T P0 T1

(6.47)

T2 x(k + p) = 0

(6.48)

The constraint
forces the unstable modes to be zero at the end of the finite horizon.
Finally we can imagine the more restrictive terminal constraint that
forces the whole state x(k + p) to be zero at the end of the horizon.
Under this constraint, the infinite horizon problem is equivalent to the
finite horizon problem because the cost from time k + p to is zero.
Note, however, that implementing any one of the mentioned finite receding
horizon control alternatives, except the first, yields a control law that is slightly
different from the solution of J (x(k)).
Of course, when x(k) is not directly measurable, one can construct x
(k|k)
through a state estimator and use it instead of x(k) to find the control u(k).

6.3
6.3.1

Analysis
State Feedback Case

The infinite horizon and the receding horizon policies mentioned above lead
to a linear feedback law u(k) = Lx(k). Implementation on the system
x(k + 1) = Ax(k) + Bu(k)

(6.49)

yields the closed-loop system


x(k + 1) = (A BL)x(k)

(6.50)

10

Unconstrained Quadratic Optimal Control

For any particular choice of controller design parameters we can check stability
by computing the eigenvalues of A BL. It is desirable to establish conditions
under which the closed loop system is stable and which do not require the
eigenvalues to be checked. This makes the search for appropriate controller
design parameters much easier.
We will show below that the infinite horizon LQR renders the closed loop
system stable for virtually any design parameters under rather mild assumptions. We will show in the next chapter that, in general, the various receding
horizon control policies proposed above lead to closed loop stable systems as
long as the control objective reflects directly or indirectly (through constraints)
the infinite horizon cost.
For an analysis of the infinite horizon control problem we consider the
optimal estimation problem for the dual system
x(k + 1) = AT x(k) + 1 (k)
y(k) = B T x(k) + 2 (k)

(6.51)

where 1 (k) and 2 (k) are zero-mean white noise sequences with covariances
Q and R respectively. We can verify that the estimator Riccati equation (??)
for the dual system (6.51) is identical to the controller Riccati equation (6.30)
for the original system (6.49). (Since the former runs forward in time and the
latter backward, set P (0) as S()). Also, by comparison we can verify that
the asymptotic optimal estimator gain (??) for the dual system is identical to
the transpose of the asymptotic optimal controller gain (6.32) for the original
system. Moreover, the observer poles of the dual system are the poles of the
closed-loop system because the eigenvalues of (AT LT B T ) are the same as
the eigenvalues of (A BL )).
Based on the duality, we can draw some important properties of the LQR
and the Riccati equation from the analysis of the Kalman filter in section ??.
By applying the result from that discussion to the above dual system, we know
that, if (B T , AT ) is a detectable pair, the observer Riccati difference equation
(??) for the dual system will converge asymptotically upon iteration. This
means, if (A, B) is a stabilizable pair, the control Riccati difference equation
of (6.30) will converge.
1/2

We also know that, if (AT , Q1 ) is a stabilizable pair, the converged solution is a stabilizing solution, which is a unique positive semi-definite solution
1/2
of the observer ARE (??). This means that, if (Q1 , A) is a detectable pair,
the converged solution is a stabilizing solution, which is a unique positive
semi-definite solution of the controller ARE (6.31).
In summary,
If (A, B) is a stabilizable pair and (Q1/2 , A) is a detectable pair, the
controller Riccati difference equation (6.30) with S() 0 con-

11

April 4, 2002
verges to a unique positive (semi)-definite solution of ARE (6.31)
and all the eigenvalues of (A BL ) lie inside the unit disk.

The first condition is clearly necessary for stabilization. The second condition
means that all the unstable modes in the state space should be weighed in the
objective function in a linearly independent manner. This condition ensures
that, if the infinite horizon objective function remains bounded, x 0 and
u 0.

6.3.2

Output Feedback Case

Consider the system


x(k + 1) = Ax(k) + Bu(k)
y(k) = Cx(k)

(6.52)

with the output feedback controller


x(k + 1|k + 1) = Ax(k|k) + Bu(k) + K (y(k + 1) C(Ax(k|k) + Bu(k)))
u(k) = Lx(k|k)
(6.53)
Rewriting the above in terms of the estimation error xe (k) = x(k) x(k|k)
and combining the two equations, we obtain

x(k + 1)
A BL
BL
x(k)
=
(6.54)
xe (k + 1)
0
A KCA
xe (k)
Note that the above equation is only one way coupled and the eigenvalues of
the closed-loop systems transition matrix are composed of those of (A BL)
and (A KCA). An important implication is that closed-loop stability of an
observer-based output feedback controller can be guaranteed by designing the
state feedback regulator component and the state estimator component to be
stable on an individual basis. In other words, there is a kind of separation
between the regulator and the observer in terms of stability for unconstrained
linear systems.

6.4

Extension to The Stochastic Case (*)

In practice, measurements may be corrupted with random errors. Also, unmeasured disturbances and other random errors will always be present. Hence,
it is of interest to consider the same problem posed for stochastic system
x(k + 1) = Ax(k) + Bu(k) + 1 (k)
y(k) = Cx(k) + 2 (k)

(6.55)

12

Unconstrained Quadratic Optimal Control

In this section, we revisit the problems we previously considered in the


above systems context. Readers not interested in the extension or not versed
in the language of stochastic systems may skip this section without loss of
continuity.

6.4.1

State Feedback Problems

Let us consider the stochastic system


x(k + 1) = Ax(k) + Bu(k) + 1 (k)

(6.56)

1 is a zero-mean white noise sequence with covariance R1 . The state is assumed to be measured fully and exactly.
Open-Loop Optimal Solution Via Least Squares
Let us examine for the above system (with initial condition x(0) = x0 ) the
following open-loop quadratic optimal control problem:
)!
( p1

xT (k)Qx(k) + uT (k)Ru(k) + xT (p)Qt x(p)


Vp = E
Jpo (x0 ) =
min
u(0),,u(p1)

k=0

(6.57)

As before,

x(0)
x(1)

..
.

..
.
x(p)

I
A
..
.
..
.
Ap
0
I

x(0) +

A
..
.
Ap1

...
0
..
.
..
.
...

...
...
..
.
..
.
...

0
B

... ...
0 ...
.. ..
.
.
AB
..
.. ..
.
.
.
Ap1
B
.
.
.
.
.
.

0
1 (0)
0

..

..
.

.
..
..
.

.
1 (p 1)
I

0
0
..
.
..
.
B

u(0)
..
.
..
.
u(p 1)

(6.58)

which will be denoted as


X = S x x(0) + S u U + S E

(6.59)

Then,

+ U T RU

Vp = E nX T QX
o
(S x x(0) + S u U + S E) + U T RU

= E (S x x(0) + S u U + S E)T Q

(S x x(0) + S u U) + U T RU
+ E E T S T QS
E
= (S x x(0) + S u U)T Q
(6.60)

13

April 4, 2002

Using the property that E{xT y} = E trace{yxT } in evaluating the last


term, we obtain
(S x x(0) + S u U) + U T RU

Vp = (S x x(0) + S u U)T Q
R
1}
+trace{S T QS

(6.61)

1 = blockdiag{R1 , , R1 }. We note that the first two terms are the


where R
same as those that appeared in the deterministic case (6.8). Furthermore, the
last term does not involve U and therefore does not affect the solution. Hence,
the optimal solution for the stochastic problem is the same as that for the
deterministic problem:

The optimal cost is

u+R
1 S uT QS
x x0
U = S uT QS

i
h

u S uT QS
u+R
1 S uT QS
x x0
Jpo (x0 ) = xT0 S xT S x S xT QS
R
1}
+trace{S T QS

(6.62)

(6.63)

Note that the above represents only an open-loop optimal solution. In


the case of a stochastic system, there is a clear distinction between open-loop
optimal control and closed-loop (feedback) optimal control, the former yielding
a significantly higher cost than the latter in general. To derive the optimal
feedback policy, we must take into account the effect of future measurements
on optimizing future input moves. This problem is naturally formulated as a
stochastic dynamic program, which is discussed next.
Derivation of the Optimal Feedback Policy Via Dynamic Programming
We start with the following one-step-ahead problem posed at time p 1 as
before:
J1 (x(p1)) = min E{xT (p)S(p)x(p)+xT (p1)Qx(p1)+uT (p1)Ru(p1)}
u(p1)

(6.64)
x(p) = Ax(p 1) + Bu(p 1) + (p 1)
S(p) = Qt

(6.65)

Substituting the state equation in (6.65) into the objective function (6.64), we
obtain

J1 (x(p 1)) = minu(p1) (Ax(p 1) + Bu(p 1))T S(p)(Ax(p 1) + Bu(p 1))


+xT (p 1)Qx(p 1) + uT (p 1)Ru(p 1)
+E{T1 (p 1)S(p)1 (p 1)}
(6.66)

14

Unconstrained Quadratic Optimal Control

The first two terms are same as those for the deterministic system. Furthermore, the last term is trace{S(p)R1 }, which is a constant. Hence, the optimal
solution is same as the deterministic case:
u(p 1) = (B T S(p)B + R)1 B T S(p)A x(p 1)
{z
}
|

(6.67)

L(p1)

The optimal cost is

J1 (x(p 1)) = xT (p 1)S(p 1)x(p 1) + trace{S(p)R1 },

(6.68)

S(p 1) = AT S(p)A + Q AT S(p)B(B T S(p)B + R)1 B T S(p)A

(6.69)

where

At the next stage, consider solving the two-step-ahead problem posed at time
p 2:

J2 (x(p 2)) = min E xT (p 1)S(p 1)x(p 1) + xT (p 2)Qx(p 2)


u(p2)

+uT (p 2)Ru(p 2) + trace{S(p)R1 } (6.70)

The above is in the same form as (6.64) and the same argument yields
T

J2 (x(p 2)) = x (p 2)S(p 2)x(p 2) +

p
X

trace{S(j)R1 }

(6.71)

j=p1

u(p 2) = (B T S(p 1)B + R)1 B T S(p 1)A x(p 2),


|
{z
}

(6.72)

L(p2)

where

S(p 2) = AT S(p 1)A + Q AT S(p 1)B(B T S(p 1)B + R)1 B T S(p 1)A
(6.73)
By induction, we obtain the optimal input sequence for the original problem
of (6.57):
u(k) = (B T S(k + 1)B + R)1 B T S(k + 1)Ax(k),

k = p 1, , 0, (6.74)

where
S(k) = AT S(k + 1)A + Q AT S(k + 1)B(B T S(k + 1)B + R)1 B T S(k + 1)A
(6.75)
with S(p) = Qt . This is the same control law we derived for the deterministic
system. The optimal cost (i.e., the achievable cost for feedback control) is
Jp (x0 ) = xT0 S(0)x0 +

p
X

trace{S(j)R1 }

j=1

This is lower than the open-loop optimal cost of Jpo (x0 ).

(6.76)

15

April 4, 2002

Open-Loop Optimal Feedback Control vs. Optimal Feedback Control


In the deterministic case, the dynamic programming and the multi-step prediction based least squares calculation were merely different ways of solving the
same problem. In the general stochastic case, however, the two approaches
solve conceptually different problems. With the dynamic programming approach, you solve for the optimal feedback control policy; on the other hand,
the multi-step prediction based approach gives the optimal open-loop trajectory for a given x(0).
The strategy of solving the open-loop optimal control problem repeatedly
at every time step after a feedback update, referred to as open-loop optimal
feedback control (OLOFC), yields a feedback control policy but not the optimal one in general. To derive the optimal feedback policy, we must account
for the fact that the future inputs will be optimized on the basis of future
measurements, which are stochastic. Note that, for determining u(k) at t = k,
OLOFC solves
min

u(k),,u(p1)

( p1
X

x (i)Qx(i) + u (i)Ru(i) + x (p)Qt x(p)

i=k

(6.77)

Hence, the expected benefit of having the measurements of x(k+1), , x(k+p)


in deciding u(k+1), , u(k+p1) is ignored. On the other hand, the dynamic
programming solves the nested problem of

Jp (x(k)) = min E xT (k)Qx(k) + uT (k)Ru(k) + Jp1 (x(k + 1))


u(k)

Jp1 (x(k + 1)) = min E xT (k + 1)Qx(k + 1) + uT (k + 1)Ru(k + 1) + Jp2 (x(k + 2))


u(k+1)

..
.

J1 (x(k + p 1)) =

..
.

min

u(k+p1)

E xT (k + p 1)Qx(k + p 1) + uT (k + p 1)Ru(k + p 1) + xT (k + p)Qt

Note that the min operator and the E operator do not commute in general.
In the special case of the linear quadratic problem we consider, however,
the optimal solution for the stochastic system happens to be the same as that
for the deterministic system. In other words, for the particular case of linear
quadratic optimal control, OLOFC is equivalent to the optimal feedback control.
This is an exception rather than a rule, however.
The dynamic programming solution can be easily extended to the infinite
horizon problem (where p = ). Using the limit argument, it can be shown
that the optimal feedback law for the infinite horizon stochastic problem is the
-horizon LQR discussed for the deterministic case (6.31)(6.32).

16

Unconstrained Quadratic Optimal Control

6.4.2

Output Feedback Problems

This time we consider the case where we have only partial, noise-corrupt measurement of the state. Hence, rather than having the measurement of the full
state x, we have measurement of output y, which contains partial information
about x:
y(k) = Cx(k) + 2 (k)
(6.79)
It is assumed that:
1. x(0) is a Gaussian variable of mean x
0 and covariance R0 .
2. 1 (k) and 2 (k) are zero-mean white Guassian sequences of covariances
R1 and R2 respectively.
Optimal Output Feedback Controller
Let us look at the problem of deriving an optimal output feedback policy based
on the quadratic index
)
( p1
X

T
T
T
x (i)Qx(i) + u (i)Ru(i) + x (p)Qt x(p)
(6.80)
Vp = E
i=0

The solution turns out to be a combination of the optimal regulator for the
deterministic problem (LQR) and the optimal state estimator (Kalman filter):

x
(i + 1|i + 1) = A
x(i|i) + Bu(i) + K(i + 1) [y(i + 1) C(A
x(i|i) + Bu(i))]
u(i) = L(i)
x(i|i),
i = p 1, , 0,
(6.81)
where
L(i) = (B T S(i + 1)B + R)1 B T S(i + 1)A
S(i) = AT S(i + 1)A + Q AT S(i + 1)B(B T S(i + 1)B + R)1 B T S(i + 1)A
with the initialization of S(p) = Qt
(6.82)
and
K(i + 1) = P (i + 1|i)C T (CP (i + 1|i)C T + R2 )1
P (i + 1|i) = AP (i|i 1)AT + R1 AP (i|i 1)C T (CP (i|i 1)C T + R2 )1 CP (i|i 1)AT
with the initialization of P (1|0) = AR0 AT + R1
(6.83)
The controller state is to be initialized with x
(0|0) = x
0 .
Derivation of the Optimal Controller Via Dynamic Programming
The derivation is based on stochastic dynamic programming and somewhat involved. Readers not interested in the detailed derivation may skip this section
without loss of continuity.

17

April 4, 2002

Lets start with the following one-step-ahead output feedback problem


posed at time p 1:
J1 (Ip1 ) = min E{xT (p)S(p)x(p)+xT (p1)Qx(p1)+uT (p1)Ru(p1) | Ip1 }
u(p1)

(6.84)
In the above, Ip1 represents all the information available for feedback control calculation at t = p 1, including the prior statistics and the collected
measurements (e.g., x
0 , R0 , y(1), , y(p 1)). Since the statistics of x(k 1)
summarizes all the relevant information about the past measurements, we can
choose
Ip1 = {
x(p 1|p 1), P (p 1|p 1)}
(6.85)
where x
(p1|p1) and P (p1|p1) are the conditional mean and covariance
of x(p 1). These quantities can be computed using a Kalman filter that is
started with the initial condition of x(0|0) = x
0 and P0|0 = R0 . Note that
x
(p|p 1) = A
x(p 1|p 1) + Bu(p 1)

(6.86)

Substituting this into the objective function, we obtain


J1 (Ip1 ) =

min (A
x(p 1|p 1) + Bu(p 1))T S(p)(A
x(p 1|p 1) + Bu(p 1))

u(p1)

+
xT (p 1|p 1)Q
x(p 1|p 1) + uT (p 1)Ru(p 1)
T

+E x
e (p)S(p)
xe (p) + xTe (p 1)Qxe (p 1)} | Ip1

(6.87)

where x
e (p) = x(p) x
(p|p 1) and xe (p 1) = x(p 1) x
(p 1|p 1), which
are zero-mean and uncorrelated with x
(p|p 1) and x
(p 1|p 1) respectively.
The first three terms are the same as those that appeared for the deterministic
system, but with x(p 1) replaced by the mean value x
(p 1|p 1). The last
two terms, trace{S(p)P (p|p 1)} and trace{QP (p 1|p 1)} respectively, are
merely constants and do not affect the solution. Hence, the optimal solution
is in the same form as the deterministic case:
u(p 1) = (B T S(p)B + R)1 B T S(p)A x
(p 1|p 1)
|
{z
}

(6.88)

L(p1)

The optimal cost is

J1 (Ip1 ) = x
T (p1|p1)S(p1)
x(p1|p1)+trace{QP (p1|p1)}+trace{S(p)P (p|p1)},
(6.89)
where
S(p 1) = AT S(p)A + Q AT S(p)B(B T S(p)B + R)1 B T S(p)A

(6.90)

18

Unconstrained Quadratic Optimal Control

At the next stage, consider solving the two-step-ahead problem posed at time
p 2:

min E J1 (Ip1 ) + xT (p 2)Qx(p 2) + uT (p 2)Ru(p 2) | Ip2


u(p2)
T
= min E x
(p 1|p 1)S(p 1)
x(p 1|p 1) + xT (p 2)Qx(p 2)

J2 (Ip2 ) =

u(p2)

+uT (p 2)Ru(p 2) + trace{QP (p 1|p 1)}

+trace{S(p)(AP (p 1|p 1)AT + R1 )} | Ip2

(6.91)

In writing the last term, we used the fact that P (p|p 1) = AP (p 1|p
1)AT + R1 . Now, note that
x
(p1|p1) = A
x(p2|p2)+Bu(p2)+K(p1) (y(p 1) Cx(p 1|p 2))
{z
}
|
e(p1)

(6.92)
Note that the innovation term e(p1) has the property that E{e(p1) | I p2 } =
0 and it is uncorrelated with x
(p 2|p 2) or u(p 2). Substituting the above
and evaluating the expectation yields
J2 (Ip2 )
T
T
T

x (p 2|p 2) + Bu(p 2) S(p 1) A


= min A
x (p 2|p 2) + Bu(p 2)
u(p2)

+
xT (p 2|p 2)Q
x(p 2|p 2) + uT (p 2)Ru(p 2)

+E{eT (p 1)K T (p 1)S(p 1)K(p 1)e(p 1) + xTe (p 2)Qxe (p 2) | Ip2 }


+trace{QP (p 1|p 1)} + trace{S(p)(AP (p 1|p 1)A T + R1 )}
Evaluating the first term in the expectation, it can be shown that
E{eT (p 1)K T (p 1)S(p 1)K(p 1)e(p 1) | Ip2 }
= trace{S(p 1)K(p 1)E{e(p 1)eT (p 1) | Ip2 }K T (p 1)}
= trace{S(p 1)(AP (p 2|p 2)AT + R1 P (p 1|p 1))}
(6.93)
The proof for the above step is left as an exercise. Substituting this into the
previous expression,
J2 (Ip2 )
=

min (A
xT (p 2|p 2) + Bu(p 1))T S(p 1)(A
xT (p 2|p 2) + Bu(p 1))

u(p2)

+
xT (p 2|p 2)Q
x(p 2|p 2) + uT (p 2)Ru(p 2)
+trace{S(p 1)(AP (p 2|p 2)AT + R1 )} + trace{QP (p 2|p 2)}
trace{S(p 1)P (p 1|p 1)}
+trace{S(p)(AP (p 1|p 1)AT + R1 )} + trace{QP (p 1|p 1)}

19

April 4, 2002

The above is in the same form as J1 (Ip1 ) except for the constant terms, which
do not affect the solution. Therefore,
(p 2|p 2)
u(p 2) = (B T S(p 1)B + R)1 B T S(p 1)A x
{z
}
|

(6.94)

L(p2)

In addition,

J2 (Ip2 ) = x
T (p 2|p 2)S(p 2)
x(p 2|p 2)

p
X
trace{S(j)(AP (j 1|j 1)AT + R1 )} + trace{QP (j 1|j 1)}
+
j=p1

trace{S(p 1)P (p 1|p 1)},

(6.95)

where
S(p 2) = AT S(p 1)A + Q AT S(p 1)B(B T S(p 1)B + R)1 B T S(p 1)A
(6.96)
and
P (p 1|p 1) = P (p 1|p 2)
P (p 1|p 2)C T (CP (p 1|p 2)C T + R2 )1 CP (p 1|p 2)
P (p 1|p 2) = AP (p 2|p 2)AT + R1
(6.97)
By induction, the previously shown optimal feedback policy is easily derived.
The optimal cost (i.e., the lowest achievable cost through output feedback
control) is
Jp (x0 ) = x
TP
(0|0)S(0)
x(0|0)

+ pj=1 trace{S(j)(AP (j 1|j 1)AT + R1 )} + trace{QP (j 1|j 1)}


Pp1
`=1
trace{S(`)P (`|`)}
(6.98)
Extension To Infinite Horizon Case: LQG Controller and Separation
Principle
This result can be extended to straightforwardly to the infinite horizon case.
The appropriate objective function for this case is
( p
)

1 X T
T
V = lim E
x (k)Qx(k) + u (k)Ru(k)
(6.99)
p
p
k=0

We divide the objective function by p to keep the infinite horizon cost to be


bounded. Hence, by minimizing V , we minimize the steady state value of
E{xT Qx + uT Ru}.

20

Unconstrained Quadratic Optimal Control

Taking the limit of the solution to the finite horizon problem readily yields
the following output feedback law made up of the Kalman filter (the optimal
state estimator) and the -horizon LQR (the optimal state-feedback regulator
for the deterministic problem):
x
(k|k) = (A BL )
x(k 1|k 1) + K(k) (y(k) (A BL )
x(k 1|k 1)))
u(k) = L x(k|k)
(6.100)
Since K(k) often converges quickly to a steady-state solution K , the above
can be implemented as the linear time-invariant controller,
x
(k|k) = (A BL )
x(k 1|k 1) + K {y(k) (A BL )
x(k 1|k 1)}
u(k) = L x(k|k),
(6.101)
where
1

(6.102)
K = P C T CP C T + R2

The above is the celebrated Linear Quadratic Gaussian (LQG) controller.

In summary, for the linear quadratic Gaussian control problem, the optimal output feedback controller is simply a composition of the optimal state
estimator and the optimal state feedback regulator for the deterministic case.
This result is referred to as the separation principle.

6.4.3

Analysis

The system
x(k + 1) = Ax(k) + Bu(k) + 1 (k)
y(k) = Cx(k) + 2 (k)

(6.103)

with controller
x(k + 1|k + 1) = Ax(k|k) + Bu(k) + K (y(k + 1) C(Ax(k|k) + Bu(k)))
u(k) = Lx(k|k)
(6.104)
gives

x(k + 1)
A BL
BL
x(k)
I
0
=
+
1 (k)+
2 (k)
xe (k + 1)
0
A AKC
xe (k)
I KC
K
|
{z
}
{z
}
|
| {z }

(6.105)
The above equation is only one way coupled and the eigenvalues of the overall
closed-loop systems transition matrix are the eigenvalues of (A BL) and
the eigenvalues of (A AKC). Hence, if these eigenvalues are all stable (i.e.,
inside the unit disk), x and xe approach zero-mean stationary sequences (with
finite variances). In fact,
(

T )
x(k)
x(k)
E
=
(6.106)
xe (k)
xe (k)

21

April 4, 2002
where is the positive (semi)-definite solution to the Lyapunov equation
T + 1 R1 T1 + 2 R2 T2 =

(6.107)

Since the optimal feedback controller of the LQG problem consists of the
Kalman filter and the LQR, each of which is stable under the mild assumptions
stated earlier, the LQG controller is closed-loop stable under the same mild
assumptions.

Bibliography
1. Work on LQR. Kalmans original paper.
2. Work on LQG, separation principle. Cite the main paper. A good
treatment of both topics can be found in Kwakernaak and Sivans.......
3. A good discussion on various stochastic optimal control strategies and
stochastic dynamic programming can be found in ??? by Bertsekas .....

Examples to Include
1. NIKET: Simple state regulation example - stable system, unstable system. Use 2nd order plus deadtime SISO system and double integrator.
2. NIKET: Output regulation in the presence of constant disturbances in
the output and setpoint changes. Use the same 2nd order plus deadtime SISO system but with constant disturbance and setpoint changes.
Output regulation instead of the state regulation. Use the idea below.
Suppose y(k) = Cx(k) + d where d is some constant unknown disturbance. Let us pose the problem of regulating the output at the setpoint
r as the minimization of
p1
X

i=0

eT (i)Qe e(i) + uT (i)Ru(i) + eT (p)Qet e(p)

(6.108)

where e = y r. Now we will formulate this as a standard LQ optimal


control problem (one of state regulation). One can for instance write the
system model as

x(k + 1)
A 0
x(k)
B
=
+
u(k)
(6.109)
e(k + 1)
CA I
e(k)
CB
where e(k) = y(k) r. Denoting the above system as
z(i + 1) = z(i) + u(i)

(6.110)

22

Unconstrained Quadratic Optimal Control


one can solve the state regulation problem minimizing
p1
X

i=0

z T (i)Qz(i) + uT (i)Ru(i) + z T (p)Qt z(p)

where
Q=

0 0
0 Qy

Qt =

0 0
0 Qyt

(6.111)

(6.112)

3. NIKET: Simple stochastic output feedback problem. Use the 2ndorder-plus-deadtime system as before. Compare the -horizon cost of
open-loop optimal control and closed-loop optimal control. Compare
those values obtained from the formula we derived with the cost computed from the solution of the Lyapunov equation of (eq:varlyapunov).
Finally, do 100 Monte-Carlo simulations, compute the average cost, and
compare.
4. NIKET: LQG design for the 4-block output feedback problem. Distillation column, temperature measured, compositions controlled, feed
flowrate and composition disturbances (integrated white noise), L and
V manipulated inputs.

Exercises to Give
1. For the linear time-invariant deterministic system we studied, derive the
quadratic optimal feedback controller for a given time-varying reference
trajectory {r(k), k = 1, , p} within a finite horizon. Use the dynamic
programming approach.
2. Prove that the solution to equation (6.41) indeed provides the infinite
horizon cost as stated.
Solution:
=
=
=
=
=

E{eT (p 1)K T (p 1)S(p 1)K(p 1)e(p 1) | Ip2 }


trace{S(p 1)K(p 1)E{e(p 1)eT (p 1) | Ip2 }K T (p 1)}
trace{S(p 1)K(p 1)(CP (p 1|p 2)C T + R2 )K T (p 1)}
trace{S(p 1)P (p 1|p 2)C T (CP (p 1|p 2)C T + R2 )1 CP (p 1|p 2)}
trace{S(p 1)(P (p 1|p 2) P (p 1|p 1))}
trace{S(p 1)(AP (p 2|p 2)AT + R1 P (p 1|p 1))}
(6.113)

3. NIKET: (Give a stochastic state-space system.) Compute the openloop optimal cost and feedback optimal cost using the formulae given in
the book. Compare.

23

April 4, 2002

4. Derive an expression for the optimal cost for the infinite horizon LQG
problem.
5. NIKET: (Give an infinite horizon output feedback problem for a stochastic system.) Derive the optimal feedback (LQG) controller. Compute
the optimal cost using the formula you obtained above. Compute the
same cost by writing the closed-loop equation for the obtained controller
and solving the Lyapunov equation (6.107). Finally, perform the MonteCarlo simulation with the controller and compute the average cost.
6. Show that from solving (6.107)
T + 1 R1 T1 + 2 R2 T2 =
indeed represents the closed-loop covariances.

(6.114)

24

Unconstrained Quadratic Optimal Control

Chapter 7

Constrained Quadratic
Optimal Control
In this chapter, we study the linear quadratic optimal control of constrained
systems. We consider the same problems as those studied in the previous chapter, but with constraints on the inputs and the state variables. Both control
computation and closed-loop analysis become greatly complicated when constraints are added. This is because both the system and the resulting control
law are nonlinear and the closed-loop system no longer renders itself to simple
linear analysis methods.

7.1

Finite Horizon Problem

Let us consider the same finite horizon optimal control problem as the one in
Section 6.1,
)
(
p1
X
T

x (k)Qx(k) + uT (k)Ru(k) + xT (p)Qt x(p)


(7.1)
min
Vp =
u(0),,u(p1)

k=0

for the deterministic system


x(k + 1) = Ax(k) + Bu(k),

k = 0, , p 1

(7.2)

with initial condition


x(0) = x0 ,

(7.3)

except that now we constrain the state and inputs to lie within some feasible
sets, Xf sb and Uf sb :
x(k) Xf sb ,

k = 1, , p

(7.4)

u(k) Uf sb ,

k = 0, , p 1

(7.5)

25

26

Constrained Quadratic Optimal Control

In the simplest case upper and lower bounds would be imposed on x and u.
In general, we will assume that the sets are convex, compact and include the
origin as an interior point so that the origin is a feasible stationary point for
the system. We also assume that the feasible sets are time invariant. The
resulting problem is a constrained least squares problem and no longer yields
a simple analytical solution. A numerical technique must be used. For the
convenience of presentation, we will refer to the above finite horizon optimal
control problem as Jp (x0 ); the same notation will also be used to denote the
optimal cost.
Dynamic programming is not a practically feasible solution in this case.
Numerical solution through dynamic programming requires discretization of
the state space and computation / storage of the optimal cost in the discretized
state space at each stage, as one marches backward from the last stage to the
first stage. Obviously the number of discrete points increases exponentially fast
with state dimension and the computational load for this approach becomes
quickly intractable. This is refer to the curse of dimensionality.
A more practical alternative is to use the explicit multi-step-predictionbased least squares formulation introduced in section 6.1. Substituting the
expression for the prediction (6.7) into the objective function in (6.8) and the
constraint in (7.4) yields the following convex program which can be solved
numerically for a particular x0 :

u + R)U
+ 2U T S uT QS
x x0
min U T (S uT QS
U

with

(7.6)

S x x0 + S u U

Xf sb Xf sb

(7.7)

Uf sb Uf sb

(7.8)

If the constraints can be expressed as linear inequalities, the above is a quadratic


program (QP), for which efficient off-the-shelf solvers are available.
For a given x0 , one can solve the optimization to obtain an open-loop optimal input trajectory. However, feedback control may be preferred to correct
for disturbances and model errors. For feedback control, one can conceivably
re-solve the problem on-line at each time step, i.e., at each k = 0, , p 1,
we solve Jpk (x0 ) with x0 = x(k) and then set u(k) equal to the first of the
computed optimal input sequence.
Another option is to solve the above optimization off-line in terms of parameter x0 via parametric programming. This gives an explicit feedback control
law, which is generally quite complex and not always conveniently characterized. However, there are results in parametric quadratic programming that
can be used to generate an explicit form of the feedback law for low-order
systems, as we shall show later.

27

April 4, 2002

7.2

Infinite Horizon Problem

Now consider the infinite horizon problem of


)
(
X

xT (k)Qx(k) + uT (k)Ru(k)
min
u(0),,u()

(7.9)

i=0

for the same deterministic system with initial condition x0 and constraints
x(k) Xf sb , k = 1, ,

(7.10)

u(k) Uf sb , k = 0, ,

(7.11)

The infinite-horizon optimal control problem cannot be solved directly as the


optimization problem is infinite-dimensional. Hence, we may adopt the suboptimal strategy of finite receding horizon control, in which Jp (x0 ) (defined by
(7.1)(7.5)) with x0 = x(k) is solved at each time step k to determine u(k).
Note that with the initial condition set as x(k), we solve
min

u(k),,u(k+p1)

( p1
X

x (k + i)Qx(k + i) + u (k + i)Ru(k + i) + x (k + p)Qt x(k + p)

i=0

(7.12)

subject to the model equation constraint and


x(k + i) Xf sb , i = 1, , p

(7.13)

u(k + i) Uf sb , i = 0, , p 1

(7.14)

The feedback law can be expressed as u(k) = u (x(k)) where u () represents


the (nonlinear) operator that relates the optimal value of u(k) to the given
initial condition x(k) for the above program. To implement the feedback
strategy, at each sample time k, one needs to solve the convex program on-line
for given x(k), or develop an explicit solution in terms of the parameter x(k)
via off-line parametric programming.
The choices for the horizon length (p) and the terminal weighting matrix
(Qt ) prove to be crucial for the stability and performance of the feedback
control. We have already seen that the infinite horizon formulation results in
some nice properties in the unconstrained case and we will show this also for
the constrained case in section 7.5.
These properties suggest to formulate the finite horizon problem in a manner so that its solution is similar to that of the infinite horizon problem. We
can then expect that the finite moving horizon problem inherits some of the
stability and performance properties of the infinite horizon problem. Following closely the discussion of the unconstrained case, we will suggest various
approaches.

28

Constrained Quadratic Optimal Control

Option 1 A Because the origin was assumed to be in the interior of the feasible state and input set, the optimal control will become unconstrained
as the system approaches the origin. Therefore if we choose p large
enough we can solve the constrained infinite horizon problem exactly in
the following manner.
Choose the terminal weight Qt as the solution of the Riccati equation
AT Qt A + Q AT Qt B(B T Qt B + R)1 B T Qt A = Qt
This way, the terminal cost reflects the unconstrained infinite horizon cost.
Choose p sufficiently large so that the solution yields a terminal
state x(k + p) that satisfies
u(k + i) = L x(k + i) Uf sb

i p

(7.15)

and
x(k + i) Xf sb

i p

(7.16)

under the system equation


x(i + 1) = (A BL )x(i)

(7.17)

Then, trivially, the finite horizon solution is equivalent to the solution


to the original constrained infinite horizon problem. There are methods
available for computing such a choice of p, which depends on the size of
x(k). (See the Bibliography for some references.)
Option 1 B One problem with the above option is that such a choice of p may
be difficult to calculate a priori (since it depends on the current state)
and also quite large causing numerical difficulties. We can alleviate this
problem somewhat by incorporating into the finite horizon problem the
following terminal constraint (which is equivalent to (7.15)(7.16)):
x(k + p) Xma

(7.18)

where Xma is defined as the largest set such that


x(0) Xma x(i) Xf sb and u(i) Uf sb for i 0
for the system
x(i + 1) = Ax(i) + Bu(i), u(i) = L x(i) for i 0.
Such a set is called the maximal output admissible set and can be calculated a priori.

29

April 4, 2002

Now p has to be chosen only so that the terminal constraint (7.18) is feasible. Forcing such an artificial constraint in the optimization, however,
can make the solution different from the solution to the original infinite
horizon problem. In other words, the equivalence holds only when the
terminal constraint is inactive.
Option 2 A Rather than assuming the unconstrained control to be optimal
after p, we can require for open-loop stable systems that p be chosen long
enough so that the system state remains in the feasible region without
control.
Following the discussion given in section 6.2.2, we choose the terminal weight Qt as the solution of the Lyapunov equation
A T Qt A + Q = Q t

(7.19)

so that the terminal cost reflects the unconstrained infinite horizon


cost for the open-loop system
x(i + 1) = Ax(i).

(7.20)

Choose p sufficiently large so that the solution yields a terminal


ma where X
ma is defined
state x(k + p) that satisfies x(k + p) X
as the maximal set such that
ma x(i) Xf sb for i 0
x(0) X

(7.21)

for the open-loop system (7.19).


ma can be explicitly forced
Option 2 B As before, the constraint x(k +p) X
in the optimization. p now has to be chosen just large enough that
this terminal constraint is feasible. Though possible, it may not be
ma explicitly. Equivalently, the constraint
always easy to characterize X

x(k + p) Xma can be replaced by x(k + p + i) Xf sb , i = 0, , hc with


hc chosen sufficiently large so that
x(i) Xf sb , i = 0, , hc x(i) Xf sb , i > hc

(7.22)

for the open-loop stable system (7.19). Calculation of such a hc may


prove to be easier in certain cases.
Note that, under this formulation, no terminal constraint is necessary
when there is no state constraint. This is an advantage over Options
1 A and 1 B, which require the terminal constraint to enforce the input
constraints beyond the horizon.
Option 2 C Options 2 A-B cannot be used when the system is unstable since
there is no positive semi-definite solution to the Lyapunov equation in

30

Constrained Quadratic Optimal Control


this case. To ensure feasibility and that the infinite horizon cost is finite under the open-loop assumption, the unstable modes of the system
must be zeroed at k + p. As described for unconstrained systems in section 6.2.2, we can perform a coordinate transformation to separate the
state into stable and unstable modes, x
s and x
u respectively. One can
then compute the terminal weighting Qt for x
s by solving the Lyapunov
equation (6.46) and adding the constraint x
u (k + p) = 0. Hence, for un ma must be modified
stable systems, the terminal constraint x(k +p) X
to include the zeroing of the unstable modes. Note that the system must
be stabilizable in order for such a constraint to be feasible.

Option 3 Increasing the terminal weight for the finite horizon problem Qt
I has the same effect as imposing an extra end-point or terminal equality constraint x(k + p) = 0. If this terminal constraint is used in the
infinite horizon formulation, all the terms for times k + p, k + p + 1, etc.
in the objective function will be zero and the infinite horizon problem
and the finite horizon problem will become equivalent. The system must
be controllable for this constraint to be feasible, however.
It is worthwhile to note that the options are listed in the order of increasing
restrictiveness and lead to increasing deviations from the optimal solution
of the constrained infinite horizon problem. The difficulty inherent in Options
1 A and 2 A is that one needs to compute p and that p can be very large
leading to computational problems. This problem is alleviated somewhat in
Options 1 B and 2 B but p must still be chosen sufficiently large to ensure
the feasibility of the terminal constraint. In Options 2 C and 3, some or all
the state variables need to be zeroed at the end of the prediction horizon.
Equality constraints are more difficult to handle from a numerical standpoint,
and infeasibility can arise if the control horizon is not sufficiently large.

7.3

Constraint Softening

With state constraints, it is possible for the optimization problem, Jp (x0 ), to


become infeasible. For instance, a large disturbance may enter the system
at a particular time instant, which makes it impossible to keep the state in
the feasible region throughout the prediction horizon. Obviously, a real-time
algorithm must not fail in this trivial manner and some relaxation scheme must
be introduced. In most practical situations, input limits are hard, i.e. they
may not be exceeded, while state constraints can be violated occasionally albeit
with some undesirable consequences for the controlled system. Therefore, the
state constraints are often softened by introducing slack variables which are
kept small by adding a corresponding penalty term to the objective.

31

April 4, 2002
For example, state constraints of the form
F x(k) g

(7.23)

F x(k) g + (k),

(7.24)

would be relaxed to
where the vector (k) denotes the positive slack variables expressing the constraint relaxation or violation. In order to keep the violations small, the objective function is augmented by a term penalizing the violations:
(
p1
X
T

Vp =
x (i)Qx(i) + uT (i)Ru(i) + T (i)Q (i)(i)
min
u(0),,u(p1),(0),,(p1)

i=0

+xT (p)Qt x(p)(7.25)

The positive definite weighting matrix Q (i) determines how much the various
violations are penalized. The suggested form allows the designer to penalize
the square of the violations in each constraint at each time differently and
thus offers maximum freedom. This, however, comes at the price of introducing many additional tuning variables into the optimization. If one chooses
the vector (k) to be constant rather than varying with time, for each constraint only the worst violation over the horizon is penalized. If the same
scalar (k) is used for each constraint, then at each time step only the worst
one of the constraint violations is penalized. By choosing the form of and
the magnitude of the elements of Q appropriately, a proper trade-off among
computational complexity, magnitude of constraint violation, and its duration
can be achieved.

Note that in the optimization performance and constraint violation are


traded off depending on the weights Q, R, Qt and Q (i). Thus it may happen
that a constraint is violated (because satisfying it would be too expensive)
though a feasible solution without constraint violation exists.
As before, in the case that the initial state vector is unknown, one can
couple a state-feedback control law with a state estimator.

7.4

Derivation of An Explicit Form of Optimal Control Law Via Multi-Parametric Programming

Refer to the Paper by Bemporad et al.

7.5

Analysis

In this section, we examine the stability of the finite receding horizon control
law. We examine both the state feedback case and the output feedback case.

32

Constrained Quadratic Optimal Control

For our analysis to be meaningful, we must assume that the underlying system
is stabilizable. For a constrained system, the stabilizability can depend on both
the system model and the initial condition. If the system contains unstable
modes, the system may not be globally stabilizable. In this case, we must
assume that the initial condition lies inside the domain of attraction.

7.5.1

Stability Concepts and Lyapunovs Direct Method

For linear systems, a single definition of stability sufficed. For nonlinear systems, different definitions exist. We consider here the definitions of Lyapunov
stability (or stability in the sense of Lyapunov), asymptotic stability, and exponential stability. WE also review the Lyapunovs direct method, which is
useful for analyzing stability of nonlinear systems.
Stability Concepts
Definition 2 The equilibrium x = 0 is said to be stable (in the sense of
Lyapunov) if, for any > 0, there exists a corresponding > 0 such that, if
kx(0)k < , then kx(t)k < for all t 0. Otherwise the equilibrium is said to
be unstable.
Definition 3 The equilibrium x = 0 is said to be asymptotically stable if it is
stable (in the sense of Lyapunov) and there exists some such that kx(0)k <
implies x(t) 0 as t .
The second condition for asymptotic stability is called attractivity and the
set of all points such that trajectories initiated from these points converge to
the origin is called domain of attraction. Note that, while attractivity and
asymptotic stability are equivalent in linear systems, the former is necessary
but not sufficient for the latter in nonlinear systems.
The stability concepts are pictorially illustrated in Figure 7.1. Note that,
in the context of a linear system, a marginally stable system, for which some
of the eigenvalues lie on the unit circle, would be considered stable according
to the above definition of stability. However, it wouldnt be asymptotically
stable.
We also introduce the notion of exponential stability.
Definition 4 The equilibrium x = 0 is said to be exponentially stable if it
is stable (in the sense of Lyapunov) and there exists some such that, if
kx(0)k < , then
kx(t)k kx(0)ket , t > 0
for some > 0 and > 0
Exponential stability is a stronger property than asymptotic stability. For
linear systems, the two stability concepts are equivalent.

33

April 4, 2002
Figure 7.1: Concepts of stability for nonlinear systems.

Finally, the above definitions for asymptotic stability and exponential stability are for local stability. For global stability, the conditions need to hold for
any starting state, not just a starting state within some ball.
Definition 5 The equilibrium x = 0 is said to be globally asymptotically
(or exponentially) stable if asymptotic (or exponential) stability holds for any
starting state.
Lyapunovs Direct Method
Lyapunovs direct method is based on the idea that, if an energy or energy-like
function of the system is continuously dissipated, the system must settle down
to an equilibrium. It is one of the most popular and useful stability analysis
techniques available for general nonlinear systems. We present this method in
the context of a discrete-time system.
Let x be the state of some autonomous nonlinear dynamic system x(k +
1) = f (x(k)). Consider a scalar function V (x) such that, within some ball
B = {x : kxk < },
V (0) = 0 and V (x) > 0 for x 6= 0. Such a function is called a locally
positive-definite function.
V (x) < for kxk < .
For any trajectory of x generated from the system, V (x(k+1))V (x(k))
0 for all k 0.
Such V (x) is called a Lyapunov function for the system x(k + 1) = f (x(k)).
The existence of a Lyapunov function implies Lyapunov stability. In addition, if the second condition is modified to V (x(k + 1)) V (x(k)) < 0, the
existence of such a V (x) implies asymptotic stability.
For global asymptotic stability, the above conditions need to be satisfied
for any x, not just within some ball. Additionally, V (x) has to satisfy the
following condition called radial unboundedness:
V (x) as kxk
The main difficulty with the Lyapunovs direct method is that it not always
clear how to choose a Lyapunov function. Later, we will use this method to
prove the asymptotic stability of some of the suboptimal receding horizon
control solutions we introduced earlier.

34

7.5.2

Constrained Quadratic Optimal Control

State Feedback Case

Let us consider the state feedback control law


u(k) = u (x(k))

(7.26)

where u () represents the solution to the convex program Jp (x0 ) with x0 =


x(k). The resulting closed-loop system can be expressed as
x(k + 1) = Ax(k) + Bu (x(k))

(7.27)

Since Since u () is a nonlinear operator, we must resort to a nonlinear stability


analysis technique.
Let us demonstrate how the proof of stability may be approached by examining a specific case of Option 2 B or 2 C. Hence we will first prove the
asymptotic stability of the state sequence under the feedback law, u(k) =
u (x(k)) when Qt in Jp (x0 ) is chosen as the solution to the Lyapunov equa ma is used.
tion AT Qt A + Q = Qt and the terminal constraint x(k + p) X
Assumptions
For proving the asymptotic stability, we assume that Jp (x0 ) is feasible and
the optimal cost bounded. This means (1)the initial state x(0) lies within
the stabilizable set of the constrained system, (2)all the state constraints are
feasible, and (3)the terminal constraint is feasible. Because (x = 0, u = 0) is
assumed to be an interior point of the feasible set, assuming the state-space
system is stabilizable, there should be a region around the origin where this
assumption is satisfied.
Proof of Convergence
First, we prove the asymptotic convergence of the state sequence to the origin.
For this, we note that Jp (x0 ) is equivalent to
J,p (x(k)) =

min

u(k),,u()

xT (k + i)Qx(k + i) + uT (k + i)Ru(k + i)

i=0

(7.28)

subject to the model equation constraint and


x(k + i) Xf sb ,

i = 1, ,

(7.29)

u(k + i) Uf sb ,

i = 0, , p 1

(7.30)

u(k + i) = 0,

ip

(7.31)

Then,
J,p (x(k)) J,p (x(k + 1)) + xT (k)Qx(k) + uT (k)Ru(k)

(7.32)

To see how this inequality arises, let us define the restriction of input sequence

35

April 4, 2002
U as

0 I 0
0 0 I


..
R(U) =
.

0 0
0 0

0
0

I
0

(7.33)

(7.32) is true because R(U (k)), the restriction of the optimal solution for
J,p (x(k)), represents a feasible but possibly suboptimal solution for J,p (x(k+
1)). Hence, Denoting the corresponding suboptimal cost as J,p (x(k + 1)), it
is clear that
J,p (x(k)) = J,p (x(k + 1)) + xT (k)Qx(k) + uT (k)Ru(k)

(7.34)

and (7.32) is immediate.


Now, (7.32) can be rewritten as
J,p (x(i)) J,p (x(i + 1)) xT (i)Qx(i) + uT (i)Ru(i)

(7.35)

By summing up the above inequalities from i = 0 to i = k 1, we obtain


J,p (x(0)) J,p (x(k))

k1
X

[xT (i)Qx(i) + uT (i)Ru(i)]

(7.36)

i=0

Pk1

Since i=0 [xT (i)Qx(i) + uT (i)Ru(i)] 0, the left-hand side is bounded below
by zero. With J,p (x(0)) < , J,p (x(k)) 0 , the left hand-side is finite for
all k 0. So the right-hand side must also be finite for all k 0. This means,
with k = ,

X
[xT (k)Qx(k) + uT (k)Ru(k)] <
(7.37)
k=0

With Q > 0 and R > 0, x(k) 0 and u(k) 0 as k .

Proof of Lyapunov Stability


To prove the asymptotic stability, in addition to the convergence (attractivity),
we must prove the stability (in the sense of Lyapunov). For this, we first define

some ball around the origin X0 = {x : kxk c} so that


J,p (x(0)) bkx(0)k2

x(0) X0 for some b <

(7.38)

Such a ball always exists because, when the state is sufficiently close to the
origin, the optimal control in linear control and the optimal cost is a quadratic
function of x - as was proven in the discussion of multi-parametric programming.
From the preceding argument,
J,p (x(k)) J,p (x(0)) k > 0

(7.39)

36

Constrained Quadratic Optimal Control

Also, given the form of the objective function (quadratic weighting of the state
vector with a positive-definite weighting matrix),
a > 0 such that akx(k)k2 J,p (x(k))

(7.40)

Combining these, we can say, for x(0) X0 ,


akx(k)k2 bkx(0)k2

(7.41)

1/2
b
kx(k)k
kx(0)k
a

(7.42)

which implies

Hence,

)
(
b 1/2
, c
kx(0)k min
a

kx(k)k

(7.43)

Comments

The assumption of Q > 0 can be relaxed to Q 0 and the detectability of (Q1/2 , A). The detectability means that the modes that are not
weighed in the objective function and therefore evolve according to their
autonomous dynamics are asymptotically stable. Hence the preceding
argument applied to those modes that are weighed in the objective function is sufficient for proving the asymptotic stability.
Extending the same proof to the case where the state constraints are
softened through slack variables, as in (7.25) with p = , is straightforward, assuming a same weighting matrix Q is used throughout the
horizon. The proof is left as an exercise. One potential complication in
ma in the terminal constraint is
the implementation, however, is that X

now dependent on . Therefore Xma cannot be computed a priori and


the minimization becomes more complicated computationally.
With the softening of the state constraints in the infinite horizon, Option
1 B applied to stable systems can be shown to be globally asymptotically stable. In fact, it can be shown to be exponentially stable. The
same cannot be said for Option 1 C applied to unstable systems. With
bounded inputs, exponentially unstable systems cannot be globally stabilized. This is because, as kx(0)k increases, one needs an increasingly
larger input to stabilize the system. Some finite domain of attraction
exists for such cases. The proofs are somewhat involved and will not be
given here, but at the end of this chapter we will provide some references
where formal proofs can be found.

37

April 4, 2002

The stability proof for Option 1 A and Option 1 B follows essentially


the same argument as the above. One can show that the restriction of
the infinite horizon solution at one time step provides a feasible solution
for the next time step.
The stability proof for Option 1 C is based on the fact that
Jp (x(k)) = Jp1 (x(k + 1)) + xT (k)Qx(k) + uT (k)Ru(k)
Jp (x(k + 1)) + xT (k)Qx(k) + uT (k)Ru(k)

(7.44)

The inequality follows from the fact that the restriction of the solution
for Jp (x(k)) is a feasible solution for Jp (x(k + 1)) because x(k + p) = 0
implies x(k + p + 1) = 0. However, the assumption of Jp (x(0)) < is
more restrictive in this case.
Alternatively, we can consider applying the Lyapunovs direct method we
discussed earlier. Here, we may consider the optimal cost function Jp (x),
which represents Jp (x0 ) with x0 = x, as a Lyapunov function candidate for
the closed-loop system x(k + 1) = Ax(k) + Bu (x(k)). The fact that Jp (x(k))
is a positive-definite function follows immediately from the fact that x = 0
is an equilibrium point (and therefore Jp (0) = 0) and Q > 0, R > 0 (and
therefore Jp (x) > 0 for x 6= 0). Also, V (x(0)) < by assumption. Hence, the
key condition to prove for asymptotic stability is the negative definiteness of
Jp (x(k + 1)) Jp (x(k)). Note that

Jp (x(k+1))Jp (x(k)) = Jp (x(k+1)) Jp1 (x(k + 1)) + xT (k)Qx(k) + uT (k)Ru(k)


(7.45)
Whether this can be guaranteed depends on the choice of p, Qt and the
terminal constraint. For instance, under the following choices of Qt and terminal constraint, the negative definiteness can be guaranteed regardless of choice
of p:
With terminal constraint x(k + p) = 0, that
Jp (x(k + 1)) Jp1 (x(k + 1))

(7.46)

since the optimal solution to Jp1 (x(k +1)) is provided by the restriction
of the optimal solution to Jp (x(k)), which also is a feasible solution to
Jp (x(k + 1)). Since this particular feasible solution makes x(k + p) = 0
and hence x(k + p + 1) = 0, Jp1 (x(k + 1)) is also the cost for the p-step
problem under the same (suboptimal) solution.
Since (xT (k)Qx(k) + uT (k)Ru(k)) is negative-definite, negative semidefiniteness of Jp (x(k + 1)) Jp1 (x(k + 1)) implies Jp (x(k + 1))
Jp (x(k)) < 0.

38

Constrained Quadratic Optimal Control


With Qt chosen as the solution to the Lyapunov equation and the extra
ma added, it was shown in (7.32) that
constraint x(k + p) X
J,p (x(k + 1)) J,p (x(k)) (xT (k)Qx(k) + uT (k)Ru(k))

(7.47)

Since (xT (k)Qx(k) + uT (k)Ru(k)) is a negative definite function and


J,p (0) J,p (0) = 0, J,p (x(k + 1)) J,p (x(k)) is a negative definite
function.

7.5.3

Output Feedback Case

Here we will show that, if the state feedback law u(k) = u (x(k)) yielding
an asymptotically stable closed-loop system (where u () represents a solution
operator to a constrained quadratic problem) is coupled with a stable linear
observer, then the resulting output feedback law u(k) = u (
x(k|k)) also yields
an asymptotically stable closed-loop system. In other words, the stability of
the state feedback controller and that of the observer can be checked separately
for the stability of the combined system.
The key point in proving the above is that the stable MPC feedback laws we
described earlier are nonlinear but the nonlinearity is relatively well-behaved.
Specifically, they are Lipschitz continuous, which means that
there exists a fixed constant K such that ku (x+)u (x)k Kkk
for all > 0 and for all x.
For example, for the case that the constraints are linear inequalities yielding
a QP for the control calculation, we have already proven that the resulting
feedback law u (x) is a piecewise affine function of x, which is clearly Lipschitz
continuous.
The closed-loop system under the output feedback controller can be written
as follows:
x(k + 1) = Ax(k) + Bu (
x(k|k))
= Axk + Bu (x(k)) + B[u (
x(k|k)) u (x(k))]

(7.48)

x(k + 1) = g(x(k)) = Ax(k) + Bu (x(k)) is a nonlinear dynamic system, for


which the origin is an asymptotically stable fixed point. In addition, g(x(k))
is Lipschitz continuous since u (x(k)) is Lipschitz continuous.
At this point, we draw upon a result in the literature that such wellbehaved nonlinear systems are robust with respect to an exponentially decaying additive disturbance. That is,
if the origin is an asymptotically stable equilibrium point for x(k +
1) = g(x(k)) where g(x) is a Lipschitz continuous function, then

39

April 4, 2002
the origin remains an asymptotically stable fixed point for the perturbed system x(k + 1) = g(x(k)) + e(k) where e(k) is an exponentially converging sequence.

The proof of the above is somewhat involved and is skipped here, but the
interested readers can find it in the references provided at the end of this
chapter.
What remains to show for us to prove the asymptotic stability of the output
feedback controller is that B[u (
x(k|k)) u (x(k))] is indeed an exponentially
converging sequence. Since u (x) is Lipschitz continuous,
B[u (
x(k|k)) u (x(k))] Kkx(k) x
(k|k)k for some K > 0.

(7.49)

Since (x(k) x(k|k)) is an exponentially converging sequence due to the assumption of a stable linear observer,
kx(k) x
(k|k)k elambdak for some > 0 and > 0,

(7.50)

and therefore
ku (x(k)) u (
x(k|k))k Kelambdak

7.6

(7.51)

Extension To The Stochastic Case (*)

Extending the optimal control formulations to stochastic systems is of obvious


interest as real processes experience disturbances and noises that are best
described by random variables. However, we will see that there are a number
of new complications, which we have not been able to resolve in a satisfactory
manner.
First, in the stochastic case, the state constraints need to be formulated in
a probabilistic manner. For instance, one may enforce
E{x(k + i)} Xf sb ,

i = 1, , p

(7.52)

However, the above does not guarantee the satisfaction of the constraint with
much confidence. A better alternative is based on the chance constraint
Pr{x(k + i) Xf sb ,

i = 1, , p}

(7.53)

where Pr denotes probability and is a parameter between 0 and 1. However,


this constraint is much more difficult to handle computationally.
Second, no matter how the state constraints are formulated, derivation of
the linear quadratic optimal feedback law for the constrained stochastic case is
not possible because, unlike in the unconstrained case, the stochastic dynamic
program for the constrained problem does not yield an analytical solution. An

40

Constrained Quadratic Optimal Control

alternative is to adopt the strategy of open-loop optimal feedback control, which


results by solving the open-loop optimal control problem at each time step. For
instance, in an infinite horizon problem, one can implement u(k) = u (x(k))
where u (x(k)) represents the optimal solution to the problem
nP

p1 T
T
Jp (x(k)) = minu(k),,u(k+p1) E
i=0 x (k + i)Qx(k + i) + u (k + i)Ru(k + i)

+xT (k + p)Qt x(k + p)


(7.54)
for a stochastic system
x(k + i + 1) = Ax(k + i) + Bu(k + i) + 1 (k + i),

i = 0, , p 1 (7.55)

and inequality constraints


Pr{x(k + i) Xf sb } >
u(k + i) Uf sb ,

i = 1, , p

i = 0, , p 1

(7.56)
(7.57)

Besides the computational difficult involved in handling the chance constraint, u(k) = u (x(k)) (or u(k) = u (
x(k|k)) resulting from solving the above
program at each time step does not represent the optimal feedback law, even
with p = . This is different from the unconstrained case in which, due to
the separation principle, the optimal feedback solution reduces to that of the
deterministic case coupled with the optimal estimator.
If we replace the chance constraint with the less desirable alternative of
E{x(k + i)} Xf sb , the solution to the above open-loop optimal control
problem turns out to be identical to the deterministic case. Hence, considering
the stochastic nature of the system in the control calculation does not provide
anything extra.
Despite these theoretical difficulties, the deterministic formulation has been
applied successfully in many stochastic problems. It has been observed empirically that, since constrained linear systems represent only a minor departure
from linear systems, especially when there are no state constraints, the combination of the optimal deterministic regulator (constrained LQR) with the
optimal state estimator yields a satisfactory suboptimal solution in almost all
cases.

Examples to Include
1. Performance of constrained vs. unconstrained on a simple problem.
(double integrators?)
2. Comparison of performances among different approximations. Use a
simple 1st order or 2nd order system. Use a short horizon to make the
comparison more meaningful.

April 4, 2002

41

3. Show the double integrator example for the explicit solution.


4. Stochastic case with input constraints only. Show the performance on a
simple problem. Show the nonwhiteness of the residual.

Possible Exercises to Give


1. Prove the stability for the case with soft constraints with quadratic
penalty.
2. Prove the asymptotic stability for the RHC with Option 2 B.
3. Prove (??.
4. Show that the MPC control laws defined by a quadratic program is
Lipschitz continuous.
5. (Give a simple but realistic problem). Design a constrained LQR +
Kalman filter and simulate the performance.

Bibliography
Different choice of terminal penalty to approximate the infinite horizon
control. Kwon and Pearson, Book by Bitmead, Keerthi and Gilbert,
Mayne and Michalska, Rawlings and Muske, etc.
Maximal Output Admissible Set theory by Gilbert and Tan.
Parameteric Quadratic Programming solution by Bemporad et al.
Stability of constrained LQR. Keerthi and Gilbert, Mayne and Michalska, Rawlings and Muske, etc.
Constraint relaxing schemes. Various formulations and trade-offs (Rawlings). Exact softening.
Stability for LQR with soft constraints. Unstable systems with constrained inputs. Zheng and Morari.
Stability proof for output feedback problem. Rawlings, etc. Halanay.

MODEL PREDICTIVE
CONTROL

Manfred Morari
Jay H. Lee
Carlos E. Garca

April 23, 2002

Chapter 12

Identification
Identification of process dynamics is perhaps the most time consuming step
in implementing a model predictive controller and one that requires relatively
high expertise from the user. In this section, we give an introduction to various
identification methods and touch upon some key issues. Since system identification is a very broad subject that can easily take up an entire book, we will
limit our objective to giving just an overview and providing a starting point
for further exploration of the field. Hence, our treatment of various methods
and issues will be somewhat brief and informal. References will be given at
the end for more complete, detailed treatments of the various topics presented
in this chapter.

12.1

Problem Overview

The goal of system identification is to build a mathematical relation for predicting the system behavior by using input output data gathered from the
process. For convenience, the mathematical relation searched for is often limited to linear ones. As we discussed in Chapter ?? (Linear Time-Invariant
System Models), both known and unknown inputs affect the outputs. Since
not all inputs change in a deterministic manner, it is often desirable to identify
a model that has both deterministic and stochastic components.
In terms of how input output data are translated into a mathematical relation, the field of identification can be divided broadly into two branches:
parametric identification and nonparametric identification. In parameteric
identification, the structure of the mathematical relation is parameterized
(compactly) a priori and the parameters of the structure are fitted to the
data. In nonparametric identification, no (or very little) assumption is made
with respect to the model structure. Frequency response identification is nonparametric. Impulse response identification is also nonparametric, but it can
also be viewed as parametric identification since an impulse response of a finite
1

Identification-DRAFT

length at discrete time points is often identified.


As a final note, it is important not to forget the end-use of model, which
is to design a feedback control system in our case. Accuracy of a model must
ultimately be judged in view of how well the model predicts the output behavior with the intended feedback control system in place. This consideration
must be reflected in all phases of identification including test input design,
data filtering, model structure selection, and parameter estimation.

12.2

Parametric Identification Methods

In parametric identification, the model structure is set in prior to model fitting.


The problem then is to identify the model parameters, based on provided input
output data. Although a particular model structure is assumed for parameter
estimation, one often adjusts the model structure iteratively based on the
result of fitting (for example, through a residual analysis).

12.2.1

Model Structures

A general structure for parametric identification is:


y(k) = G(q, )u(k) + H(q, )(k)

(12.1)

where y is the output and u is the input. (Most of times, u will be a manipulated input, but it can also be a measured disturbance variable). G(q, )
referred to as the process model represents the causal relationship between
the deterministic input u and the output y. (k) is a white noise sequence,
which by itself does not represent any physical variable. Together with the
noise model H(q, ), they define the auto- and cross-correlation functions of
the residual sequence (y(k) G(q, )u(k)). For a stationary process, without
loss of generality, H(q, ) is assumed to be a stable, stably invertible, and normalized (i.e., H(, ) = 1) transfer function. This is in view of the spectral
factorization theorem that states any spectrum can be factorized in terms of
a stable and stably invertible factor. (see Appendix...). For processes exhibiting random mean shifts, it is necessary that the noise model include an
integrator. Equivalently, we can replace y(k) and u(k) with y(k) and u(k)
in the above.
Within the general structure, different parametrizations exist. Let us discuss some popular ones, first in the single input, single output context.
ARX Model
If we represent G as a rational function and express it
as a linear equation with an additive error term, we obtain
y(k) + a1 y(k 1) + + an y(k n) = b1 u(k 1) + + bm u(k m) + (k)
(12.2)

April 23, 2002-DRAFT

When the equation error (k) is taken as a white noise sequence, the
resulting model is called an ARX model (AR for Auto-Regressive and X
for eXtra input u). Hence, the ARX model corresponds to the following
parametrization of the transfer functions:
G(q, ) =
H(q, ) =

b1 q 1 + + bm q m
B(q)
=
A(q)
1 + a1 q 1 + + an q n
1
1

=
1
A(q)
1 + a1 q + + an q n

(12.3)

Generally, an ARX model of very high order is needed to describe process


and noise dynamics. To see this, notice that (12.1) can be written as
H 1 (q, )y(k) = H 1 (q, )G(q, )u(k) + (k)

(12.4)

Since H 1 is assumed stable, if G is stable,


H 1 (q, ) 1 + a1 q 1 + + an q n
1 + b1 q 1 + + bm q m

H 1 (q, )G(q, )

(12.5)

for sufficiently large n and m.


ARMAX Model
A natural extension to the ARX parametrization is
the equation error term expressed as a moving average of white noise:
y(k) + a1 y(k 1) + + an y(k n) = b1 u(k 1) + + bm u(k m)
+(k) + c1 (k 1) + + c` (k `)
(12.6)
The above is called an ARMAX model. For the ARMAX structure,
parameterization of the noise transfer function changes to
H(q, ) =

C(q)
A(q)

1 + c1 q 1 + + c` q `
1 + a1 q 1 + + an q n

(12.7)

Because of the numerator term in the noise model, an ARMAX model


can potentially represent a system with much fewer parameters when
compared to an ARX model. In fact, a state-space system of order
n (with a deterministic input and a white noise input) always yields an
input output representation given by an nth order ARMAX model. However, parameter estimation is more complicated and over-parametrization
can cause loss of identifiability (i.e., parameter values can become nonunique).
Output Error Model
Both ARX model and ARMAX model put
common poles in G and H. In some cases, it may be more natural to
model them separately. One such parametrization is the Output Error
(OE) structure given below:
y(k) + a1 y(k 1) + + an y(k n) = b1 u(k 1) + + bm u(k m)
y(k) = y(k) + (k)
(12.8)

Identification-DRAFT
In the above y(k) represents the noise-free output. Customarily, (k) is
assumed to be a white noise. This means the OE structure is equivalent
to
A(q)
and H(q) = 1
(12.9)
G(q, ) =
B(q)
From this, it may seem that the structure is not that useful (since
disturbance / noise effects in most cases are auto-correlated and therefore
not adequately represented by a white noise). However, the use of the
OE structure can be more general. For instance, the OE model can be
used when H(q) is not 1, but a priori known (i.e., the residual sequence
is a colored noise of a known spectrum). In this case, we can write the
model as
H 1 (q)y(k) = G(q, ) H 1 u(k) +(k)
(12.10)
|
{z
}
| {z }
yf (k)

uf (k)

Note that the above is in the form of (12.8). Simple filtering of input
and output de-correlates the noise and gives the standard OE structure.
Parameter estimation is complicated by the fact that ys are not given
and depend on the choice of parameters.

FIR and Orthogonal Expansion Model


A special kind of output error
structure is obtained when G(q, ) is parameterized in a linear fashion.
For instance, when G(q) is stable, it can be expanded as a power series
of q 1 to yield

X
G(q) =
bi q i
(12.11)
i=1

Truncating the power series after n terms, one obtains the model

y(k) = b1 q 1 + b2 q 2 + + bn q n u(k) + H(q)(k)


(12.12)
This is the Finite Impulse Response model which we used in the earlier
part of this book.
A general form of an orthogonal expansion model is
G(q) =

bi Bi (q)

(12.13)

i=1

One popular choice for {Bi (q)} is the so called Laguere functions defined
as

1 2 1 q i1
Bi (q) =
(12.14)
q
q

An advantage of using this function is that knowledge of the processs


dominant time constant can be incorporated into the choice of to
speed up the convergence. (It therefore helps curtail the number of
parameters.)

April 23, 2002-DRAFT

Box-Jenkins Model
A natural generalization of the output error
model is to let the disturbance transfer function be a rational function
of unknown parameters. This leads to the Box-Jenkins model, which has
the structure of
B(q)
C(q)
u(k) +
(k)
(12.15)
y(k) =
A(q)
D(q)
This model structure is quite general, but the parameter estimation is
nonlinear and loss of identifiability can occur.
General Model

The most general form of the model is


A(q)y(k) =

C(q)
B(q)
u(k) +
(k)
F (q)
D(q)

(12.16)

This way some but not all poles can be shared between the process model
and the noise model. Note that this model includes all the other models
as subsets.
All of the above models can be generalized to include an integrator in
H(q, ). For instance, we can extend the ARMAX model to
y(k) =

1
C(q)
B(q)
u(k) +
(k),
1
A(q)
1 q A(q)

(12.17)

which is called ARIMAX model (I for integration). In terms of parameter


estimation, the resulting problem is the same since we can transform the above
to
B(q)
C(q)
y(k) =
u(k) +
(k)
(12.18)
A(q)
A(q)
The same holds for all the other model structues.
Extensions to the multivariable case are mostly straightforward, but can
involve some complications. One may choose to fit each output independently
using one of the above structures and this is referred to as multiple-input
single-output (MISO) identification. In this case, the only modification to
the above is that B(q) is now a row vector containing nu polynomials, where
nu is the number of inputs. The parameter estimation problem remains the
same except in the number of parameters. With the MISO identification,
however, cross-correlation among outputs cannot be captured. For this, one
needs to perform multiple-input, multiple-output (MIMO) identification, where
all the outputs are fitted to a single multivariable model structure. In this
case, A(q), B(q), etc. are matrix polynomials of appropriate dimension. For
instance, the MIMO ARX model looks like
y(k) = A1 (q)B(q)u(k) + A1 (q)(k)

(12.19)

where A(q) and B(q) are ny ny and ny nu matrix polynomials respectively.


The parameterization of the coefficient matrices can be a subtle issue. With

Identification-DRAFT

all the matrix entries assumed unknown, for instance, one can end up with
an over-parameterized structure leading to loss of identifiability. In general,
significant prior knowledge (e.g., observability indices for outputs) is needed to
obtain a parsimonious parameterization. Starting with an identifiable structure is important in view of the fact that many of the model structures lead
to a nonlinear parameter estimation problem.
Note on Loss of Identifiability: For certain model structures, overparameterization can result in parameter values becoming nonunique. This is
referred to as loss of identifiability. For example, assume that the underlying
system is the ARMAX system,
y(k) =

1 q 1
1 + 1 q 1
u(k)
+
(k).
1 + 1 q 1
1 + 1 q 1

(12.20)

Now suppose we try to fit an over-parametrized ARMAX structure of


y(k) =

1 + c1 q 1 + c2 q 2
b1 q 1 + b2 q 2
u(k)
+
(k)
1 + a1 q 1 + a2 q 2
1 + a1 q 1 + a2 q 2

(12.21)

The above can be re-parameterized as


y(k) =

(1 + 1 q 1 )(1 + 2 q 1 )
1 q 1 (1 + 2 q 1 )
u(k) +
(k), (12.22)
1
2
(1 + 1 q )(1 + 2 q )
(1 + 1 q 1 )(1 + 2 q 2 )

and constraining equations for the choice of 2 , 2 and 2 are just 2 = 2


and 2 = 2 . We see that there are infinitely many parameter values yielding
the input output model of (12.20). Similar loss of identifiability can occur in
an OE or BJ structure, but not in an ARX structure, which is without the
possibility of pole zero cancellation in the noise model.
Assuring identifiability for a MIMO model structure is much
more complicated due to the cross feedback terms. For example,
....
2

12.2.2

Prediction Error Method

The optimal one-step ahead predictor for system (12.1) can be written as

y(k|k 1) = G(q, )u(k) + I H 1 (q, ) (y(k) G(q, )u(k))

(12.23)

By comparing (12.1) with (12.23), we see that the prediction error (y(k)
y(k|k 1)) is simply white noise (k), which verifies the assertion
that (12.23)

is indeed the optimal predictor. Note that I H 1 (q, ) contains at least


one delay since I H 1 (, ) = 0. Hence, evaluating the right hand side
requires not y(k) but only y(k 1), y(k 2), etc.

April 23, 2002-DRAFT

Because the primary intended function of a model in control is to provide a


prediction of the future output behavior, it is logical to choose such that the
prediction error resulting from the model is minimized for the available data
record. This is especially meaningful if the data used for the model fitting
reflects the intended closed-loop situation. Let us denote the data record we
have as (y(1), , y(N )). Then, this objective is formulated as
min

N
X
k=1

kepred (k, )k22

(12.24)

where epred (k, ) = y(k) y(k|k 1) and k k2 denotes the Euclidean norm.
Use of other norms is possible, but the 2-norm is by far the most popular.
Using (12.23), we can write
epred (k, ) = H 1 (q, ) (y(k) G(q, )u(k))

(12.25)

This method of obtaining parameter estimates is referred to as the Prediction Error Method (PEM). The numerical complexity of PEM depends on the
model structure.
For certain model structures, the 2-norm minimization of prediction error
is formulated as a linear least-squares problem. For example, for the
1
ARX structure, G(q, ) = B(q)
A(q) , and H(q, ) = A(q) and
epred (k, ) = A(q)y(k) B(q)u(k)
= y(k) + a1 y(k 1) + + an y(k n)
b1 u(k 1) bm u(k m)

(12.26)

Since epred (k, ) is linear with respect to the unknown parameters, the
P
2
minimization of N
k=1 epred (k, ) is a linear least squares problem.

Another such example is an FIR


P model with known disturbance characteristics, for which G(q, ) = ni=1 hi q i and H(q) contains no unknown
parameters. In this case
epred (k, ) = yf (k) h1 uf (k 1) hn uf (k n)

(12.27)

where yf (k) = H 1 (q)y(k) and uf (k) = H 1 (q)u(k). Again, the expression is linear in the unknowns and the prediction error minimization (PEM) is a linear least squares problem. If the noise model was
1
H(q), then yf (k) and uf (k) can be redefined as H 1 (q)y(k) and
1q 1
H 1 (q)u(k) respectively. The same idea applies to Laguerre or other
orthogonal expansion models.

Identification-DRAFT
PEM for other model structures such as the ARMAX and Box-Jenkins
structures is not a linear least squares problem but pseudo-linear regression can be used for them. For example, the optimal predictor for the
ARMAX model can be written in the following form (see Exercise??):
y(k|k 1) = B(q)u(k) + [1 A(q)]y(k) + [C(q) 1][y(k) y(k|k 1)]
(12.28)
The above equation is called the pseudo-linear regression form of the
ARMAX model. Since the right-hand-side of the equation depends only
on the past values of y(k) and y(k|k 1) (as the leading coefficients for
A(q) and C(q) are both 1), by treating the past predictions as if they were
optimal (which they arent since they are based only on coarse estimates
of system parameters), one can recursively update the parameter values
using the linear least squares method.

Example: Pseudo-Linear Regression vs. Nonlinear Optimization.


Show the result.

12.2.3

Properties of Linear Least Squares Identification

We just saw that prediction error minimization for many model structures can
be cast as a linear regression problem. The general linear problem can be
written as
y(k) = T (k) + e(k)

(12.29)

where y is the observed output (or filtered output), is the regressor vector,
is the parameter vector to be identified, and e is the residual error (that
depends on the choice of ). {}(k) denotes the kth sample. In the least
squares identification, is found
such that the
squared sum of the residual is
n
PN 2 o
LS
minimized, i.e., N = arg min k=1 e (k) . The 2-norm minimization of
prediction error for certain model structures can be cast in this form.
For a data set collected over N sample intervals, (12.29) can be written
collectively as the following set of linear equations:
YN = N + EN

(12.30)

where

N
YN
EN

=
=
=

(1) (N )
y(1) y(N )
e(1) e(N )

(12.31)

(12.33)

(12.32)

April 23, 2002-DRAFT


The least squares solution is
LS
N
= (TN N )1 TN YN

(12.34)

Convergence
For the discussion of convergence, let us assume that the underlying system
(from which the data are obtained) is represented by the model
y(k) = T (k)o + (k)

(12.35)

o is the true parameter vector in this context and (k) is a term due to
disturbances, noise, etc.
Some insight can be drawn by rewriting the least squares solution in the
following form:

LS
N

i1 P
T

N
1
T (k)
(k)
k=1
k=1 (k) (k)o + (k)
N
i1 P
h P
N
1
T (k)
(k)
= o + N1 N
k=1
k=1 (k)(k)
N
=

1
N

PN

(12.36)

LS is that under fairly mild assumptions it conA desirable property of N


verges to o as the number of data points becomes large (N ). Note that
the term
#1
"
N
N
1 X
1 X
T
(k) (k)
(k)(k)
N
N
k=1
k=1

P
T (k)
(k)
represents the error in the parameter estimate. Assume that limN N1 N
k=1
exists (which is true if (k) is quasi-stationary). In order that

lim

"

N
1 X
(k)T (k)
N
k=1

#1

N
1 X
(k)(k) = 0,
N

(12.37)

k=1

the following two conditions must be satisfied:


1.

N
1 X
(k)(k) = 0
N N

lim

(12.38)

k=1

2.
rank

lim

"

N
1 X
(k)T (k)
N
k=1

#)

= dim{}

(12.39)

10

Identification-DRAFT

The first condition is satisfied if the regressor vector and the residual sequence are uncorrelated. There are two scenarios under which this condition
holds:
(k) is a zero-mean whitePsequence. Since (k) does not contain (k),
E{(k)(k)} = 0 and N1 N
k=1 (k)(k) 0 as N . In the prediction error minimization, if the assumed model structure is unbiased,
(k) is white.
(k) is independent of (k) and the mean of at least one of them is zero.
For instance, in the case of an FIR model (or an orthogonal expansion
model), (k) contains inputs only and is therefore independent of (k)
whether it is white or nonwhite (assuming the data were collected in an
open-loop fashion). This means that the FIR parameters can be made
to converge to the true values even if the disturbance transfer function
H(q) is not known perfectly (resulting in nonwhite prediction errors), as
long as uf (k) is designed to be a zero-mean signal that is independent
of (k). The same is true for an OE structure but not true for an ARX
structure since (k) contains past outputs that would be correlated with
(k) if is a non-white sequence.
h P
i
T (k)
In order for the second condition to be satisfied, limN N1 N
(k)
k=1
must exist
and
should
be
nonsingular.
The
rank
condition
on
the matrix
h P
i
N
1
T
limN N k=1 (k) (k) is called the persistent excitation condition as
it is closely related to the notion of persistency of excitation (of an input
signal), which we shall discuss in Section 12.2.4.
Statistical Properties
Let us again assume that the underlying system is represented by (12.35).
We further assume that (k) is an independent, identically distributed (i.i.d.)
random variable sequence of zero mean and variance r . Then, using (12.36),
we can easily see that


!1
N
N

1 X
X
1
LS
(k)T (k)
(k)(k) = 0 (12.40)
E{N
0 } = E

N
N
k=1

k=1

and

LS )(
LS o )T }
E{(N
o
N
P
1 P
P
1
N
N
N
1
1
1
T (k)
T (k)
T (k)
=
(k)
(k)r

(k)

2
k=1
k=1
k=1
N
N
1 N
P
N
r
1
T
=
k=1 (k) (k)
N
N

= r (TN N )1

(12.41)

11

April 23, 2002-DRAFT

(12.40) implies that the least squares estimate is unbiased. (12.41) gives the
covariance matrix of the parameter estimate. This information can be used
to compute confidence intervals. For instance, when normal distribution is
assumed, one can compute an ellipsoid corresponding to a specific confidence
level by using a 2 table.

12.2.4

Persistency of Excitation

In the linear least squares identification, in order for parameters to converge


to true values in the presence of noise, we must have
(

N
1 X
rank lim
(k)T (k)
N N
k=1

= dim{}

(12.42)

This condition is closely related to the notion of persistency of excitation.


A signal u(k) is said to be persistently exciting of order n if the following
condition is satisfied:
(12.43)
rank{Cun } = n,
where
1
Cun = lim
N N
u(k 1)u(k 1) u(k 1)u(k 2)

PN
u(k 2)u(k 1) u(k 2)u(k 2)
..
..
k=1

.
.

u(k n)u(k 1) u(k n)u(k 2)

u(k 1)u(k n)
u(k 2)u(k n)
..
..
.
.

u(k n)u(k n)
(12.44)
The above is equivalent to requiring the power spectrum of u(k) to be nonzero
at n or more distinct frequency points between and .

Now, suppose (k) consists of past inputs and outputs. A necessary and
sufficient condition for (12.42) to hold is that
the input is persistently exciting of order dim{}.

This follows trivially in the case that (k) is made of n past inputs only (as
in FIR models). In this case,
N
1 X
lim
(k)T (k) = Cun
N N

(12.45)

k=1

The condition also holds when (k) contains filtered past inputs uf (k1), , uf (k
n) (where uf (k) = H 1 (q)u(k)). Note that:
uf () =

u ()
|H(ej )|2

(12.46)

12

Identification-DRAFT

Hence, if u(k) is persistently exciting of order n, so is uf (k). What is not


immediate (but can be proven) is that the above holds even when (k) contains
outputs (see Exercise??).
An important conclusion is that, in order to assure the convergence of
parameter estimates to true values, we must design the input signal u(k) to be
persistently exciting of order dim{}. A pulse is not persistently exciting of
any order since the rank of the matrix Cu1 for a pulse is zero. A step signal is
persistently exciting of order 1. A single step test is inadequate in the presence
of significant noise since only one parameter (the steady-state gain) may be
identified without error using such a signal. Sinusoidal signals are persistently
exciting of order 2 since their spectra are nonzero at two frequency points.
Finally, a random signal can be persistently exciting of any order since its
spectrum is nonzero over a frequency interval. It is also noteworthy that a
periodic signal with period n can at most be persistently exciting of order n.
Violation of the persistent excitation condition does not mean that obtaining estimates for parameters is impossible. It does imply, however, that
parameters do not converge to true values no matter how many data points
are taken.
Numerical Example: Do a Monte Carlo simulation with two
different input signals leading to a well-conditioned and an illconditioned . Show how the covariance matrix is consistent with
the average parameter error along different directions.

12.2.5

Frequency-Domain Bias Distribution Under PEM

The discussion of parameter convergence is based on the assumption that there


exists a true parameter vector. Even when parameters converge to their
best values, it is still possible for the model to show significant bias from the
true system if the model structure used for the identification is not sufficiently
rich. For example, an FIR model with too few coefficients may differ from
the true system significantly even with the exact impulse response coefficients.
Understanding how the choice of input signal affects the distribution of model
bias in the frequency domain is important, especially for developing a model
for closed-loop control purposes since the accuracy of model fit in certain
frequency regions (e.g., cross-over frequency region) can be more important
than in others. The consideration of bias distribution is particularly relevant
in the context of fitting a low-order system model to data.
In the prediction error method, parameters are fitted on the basis of the
criterion
N
1 X 2
min
epred (k, )
(12.47)
N
k=1

where epred (k, ) = H 1 (q, ){y(k) G(q, )u(k)}. Suppose the true system is

13

April 23, 2002-DRAFT


represented by
y(k) = Go (q)u(k) + Ho (q)(k)

(12.48)

Then,
epred (k, ) =

Go (q) G(q, )
Ho (q)
u(k) +
(k)
H(q, )
H(q, )

By Parsevals theorem,
R
P
2
limN N1 N
(k,
)
=
e
k=1
pred
e ()d

2 u ()
R
= Go (ej ) G(ej , )
j
|H(e

,)|2

(12.49)

()
d

|Ho (ej )|

|H(ej ,)|

(12.50)

where e () is the spectrum of epred (k).

To obtain some insight, let us assume for the moment that the noise model
does not contain any unknown parameter,
!

Z
N
Ho (ej )2

2 u ()
1 X 2
j
j
Go (e ) G(e , )
epred (k, ) =
+
() d
lim
j )|2
j )|2
N N
|H(e
|H(e

k=1
(12.51)
Since the last term of the integrand is unaffected by the choice of in this
case, we may conclude that PEM selects such that the L2 -norm of the error
Go (q)G(q, ) weighted by the filtered input spectrum uf () (where uf (k) =
H 1 (q)u(k)) is minimized. An implication is that, in order to obtain a good
frequency response estimate at a certain frequency region, the filtered input u f
must be designed so that its power is concentrated in that region. If we want
good frequency estimates for the entire frequency range, an input signal with
a flat spectrum (e.g., a sequence of independent, zero mean random variables)
is the best choice.
Frequency domain bias distribution can be made more flexible by adding
the option of prefiltering input-output data before applying the PEM. This

amounts to minimizing the filtered prediction error ef pred (= L(q)epred ). In


this case,
N
1 X 2
ef pred (k, )
lim
N N
k=1
!

Z
L(ej )2 u () L(ej )2 Ho (ej )2

2
Go (ej ) G(ej , )
+
() d
=
|H(ej )|2
|H(ej )|2

(12.52)

Hence, by pre-filtering the data before the parameter estimation, one can
affect the bias distribution. This is a useful flexibility as it is not always easy
to shape or change the input spectrum. For example, new data may have to
be collected in order to change the input spectrum.

14

Identification-DRAFT

Finally, we have based our argument on the case where the noise model
does not contain any unknown parameter. When the noise model contains
parameters that are shared with the process model (as in ARX or ARMAX
models), the noise spectrum |Ho (ej )|2 does have an effect on the bias distribution. However, the qualitative effect of input spectrum and prefiltering
remain the same.
Example 12.1 Show the effect of prefiltering on the frequency bias

12.2.6

Parameter Estimation Via Statistical Methods (*)

In formulating the prediction error minimization, we did not require an exact


statistical description of the underlying plant. Prediction error minimization
is a logical criterion for parameter estimation regardless of true nature of the
underlying plant (i.e., even if the assumed model structure does not match
the real plant exactly). When a precise statistical description for the plant is
available, the parameters can be estimated in an optimal fashion based on some
well-defined criterion. Although it may be difficult to come up with an exact
probabilistic description of the plant in reality, studying these methods can
provide some useful insights into the performance of more empirical methods
like the prediction error method. We present the two most popular methods
here.
Maximum Likelihood Estimation
In system identification, one is trying to extract system information out of
measurements that are inherently unreliable. In maximum likelihood estimation, this notion is formalized by describing each observation as a realizaton
of a random variable with certain probability distribution. For instance, if we
assume a model
y(k) = T (k) + (k)
(12.53)
where (k) is a Gaussian variable with zero mean and variance r , then the
probability density function (PDF) of y(k) becomes

1
( T (k))2
(12.54)
dF (; y(k)) =
exp
2r
2r
In the above, represents a particular realized value for y(k) (see Appendix ?? for notation).
In performing parametric identification with N data points, we can work

with a joint PDF for YN = (y(1), , y(N )). Let us denote the joint PDF as
dF (N ; YN ). Again, N is a variable representing a realization of YN . Suppose the actual observations (the data) are given as YN = (
y (1), , y(N )).
Once we insert these values into the probability density function, dF (YN ; YN )

15

April 23, 2002-DRAFT

is a deterministic function of called likelihood function. We denote the


likelihood function for the observation YN as `(|YN ).
The basic idea of maximum likelihood estimation is to make the observations most likely by choosing such that the liklihood function is maximized.
In other words,

ML

= arg max `(|YN )


(12.55)
N

It is generally very difficult to derive the likelihood function starting from a


stochastic system model. An exception is the case when the model can be
put into a form in which the observation is linear with respect to the random
variables.

Let us apply the maximum likelihood method to the following linear identification problem:
YN = N + E N
(12.56)
In the above, we assume that EN is a zero-mean Gaussian variable vector of
covariance RE . Then, we have
dF (YN ; YN ) = dF (YN N ; ENn)
o
exp 21 (YN N )T RE1 (YN N )
= N1
(2) det(RE )

(12.57)

The maximum likelihood estimator is defined as


n
o
M L = arg max dF (Y

N
;
Y
)
N
N

= arg max log dF (YN ; YN )


n

o
= arg max 21 (YN N )T RE (YN N )
o
n

= arg min 12 (YN N )T RE (YN N )

(12.58)

Note that the above is a weighted least squares estimator. We see that, when
the weighting matrix is chosen as the inverse of the covariance matrix for the
output error term EN , the weighted least squares estimation is equivalent to
the maximum likelihood estimation. In addition, the unweighted least squares
estimator is a maximum likelihood estimator for the case when the output
error is an i.i.d. Gaussian sequence, in which case the covariance matrix for
EN is in the form of r IN .
Bayesian Estimation
Bayesian estimation is a philosophically different approach to the parameter
estimation problem. In this approach, parameters themselves are viewed as
random variables with a certain prior probability distribution. If the observations are described in terms of the parameter vector, the probability distribution of the parameter vector changes after the observations. The distribution

16

Identification-DRAFT

after the observations is called posterior probability distribution, which is given


by the conditional distribution of the parameter vector conditioned with the
observation vector. The parameter value for which the posterior PDF attains
its maximum is called the maximum a posteriori (MAP) estimate. It is also
possible to use the mean of the posterior distribution as an estimate, which
gives the minimum variance estimate.
One useful rule in computing the posterior PDF is Bayess rule. Let us
denote the conditional PDF of the parameter vector for given observations as
YN ; |YN ). Then, Bayess rule says
dF (|
YN ; |YN ) =
dF (|

YN |) dF (;
)
dF (YN |;
dF (YN ; YN )

(12.59)

dF (YN ; YN ) is independent of and therefore is constant once evaluated for a


given observation YN . Hence, the MAP estimator becomes

M AP
YN |) dF (;
)
N
= arg max dF (YN |;
(12.60)

Note that we end up with a parameter value that maximizes the product of
the likelihood function and the prior density.
Let us again apply this concept to the linear parameter estimation problem
of
YN = N + E N

(12.61)

where EN is a Gaussian vector of zero mean and covariance RE . We also treat


and covariance P (0). Hence, the prior for
as a Gaussian vector of mean (0)
is a normal distribution of the above mean and covariance.
Next, let us evaluate the posterior PDF using Bayess rule.
)
YN ; |YN ) = [constant] dFN (YN ; YN )( ,R ) dFN (,
dF (|
N
E
((0),P (0))
(12.62)
where

1
1
T 1
xx
) R (
xx
)
(12.63)
exp (
dFN (
x, x)(x,R) = p
2
(2)N det(R)

The MAP estimate can be obtained by maximizing the logarithm of the posterior PDF:
M AP
N
n

o
T P 1 (0)(

(0)

T R1 (YN N )
1 ( (0))
= arg max 12 (YN N )
E
2
o
n

1
1
T P 1 (0)(
T

(0))

= arg min 2 (YN N ) RE (YN N ) + ( (0))


(12.64)

April 23, 2002-DRAFT


Solving the above least squares problem, we obtain

1 T 1
M AP

N RE YN + P 1 (0)(0)
N
= TN RE1 N + P 1 (0)

Using the Matrix Inversion Lemma, we can rewrite the above as

M AP

+ P (0)T T P (0)N + RE 1 YN N (0)


N
= (0)
N
N

17

(12.65)

(12.66)

We make the following observations:

From (12.64), we see that, as long as P (0) is chosen as a nonsingular


matrix and the persistent excitation condition is satisfied, the first term
M AP converges to
LS . Hence,
will dominate the second as N and N
N
all the asymptotic properties of the least squares identification apply to
the above method as well.
If P (0) is chosen as a singular matrix, the estimate of may be biased
since the null space of P (0) represents the parameter subspace corresponding to zero update gain.
From (12.64), we see that specifying the initial parameter covariance matrix P (0) to be other than I is equivalent to penalizing the deviation
from the initial parameter guess through weighting matrix P 1 (0) in the
least squares framework. The standard least squares solution is interpreted in the Bayesian framework as the MAP solution corresponding to
the uniform prior distribution (i.e., P (0) = I).
Utilizing prior knowledge in the above framework can help us obtain a
smoother and more realistic impulse response. In Section ??, we suggested
using a diagonal weighting matrix to penalize the magnitudes of the impulse
response coefficients so that a smoother step response may be obtained. We
now see that this is equivalent to specifying the initial parameter covariance
as a diagonal matrix (i.e., the inverse of the weighting matrix) in the Bayesian
framework. The statistical interpretation provides a formal justification for
this practice and a systematic way to choose the weighting matrix (possibly
as a nondiagonal matrix).
(12.66) can be written in the following recursive form (see Exercise????):

1) + K(k) y(k) T (k)(k


1)
(k)
= (k
K(k) =
P (k) =

P (k1)(k)
r +T (k)P (k1)(k)
T (k)P (k1)
P (k 1) P (k1)(k)
r +T (k)P (k1)(k)

(12.67)

n
o
T |Y

(k)
represents kM AP or E {|Yk } and P (k) = E ( (k))(
(k))
k .
Derivation of the above formula is straightforward when you consider the problem as a special case of the Kalman filter

18

Identification-DRAFT

One could generalize the above to the case of time-varying parameters


based on the following system model for parameter variation:
(k) = (k 1) + 1 (k)
y(k) = T (k)(k) + 2 (k),

(12.68)

where 1 (k) and 2 (k) are i.i.d. Gaussian sequences. Here the parameter vector
(k) is assumed to be time-varying in a random walk fashion. One may also
model 1 (k) and 2 (k) as auto-correlated signals by further augmenting the
state vector. The recursive Bayesian estimator for the above can be derived
using the Kalman filter technique (See Exercise???)
We will demonstrate an application of the Bayesian approach to the impulse response coefficient identification through the following example.
Example 12.2 In practice, it may be more appropriate to assume (in prior to
the identification) the derivatives of the impulse response as zero-mean random
variables of Gaussian distribution and specify the covariance of the derivative
of the impulse response coefficients. In other words, one may specify

dh
hi hi1
E
E
= 0;
1in
dt t=iTs
Ts
(
(
)
2 )
i
hi hi1 2
dh
= 2
E
E
dt t=iTs
Ts
Ts

(12.69)
(12.70)

In this case, P (0) (the covariance for ) takes the following form:

1
1
0

0
2

0

..
.
1

1
0
1 1
0


.
.
.
..
.
0
. .. ..
.

.. . . . . . .
..
.
.
.
.
.
0 1
n
(12.71)
Note that the above is translated as penalizing the 2-norm of the difference
between two successive impulse response coefficients in the least squares identification method. It is straightforward to use the above concepts to include
prior estimates of time constants, etc.
1
0
1 1
0

.
.
.
.
. .. ..
P (0) =
0
.. . . . . . .
.
.
.
.
0 1

(Comment: ADD NUMERICAL EXAMPLE HERE!!!)

12.2.7

Other Methods

There are other methods for estimating parameters in the literature. A method
that stands out is the instrumental variable (IV) method. The basic idea

0
0

..
.
1

19

April 23, 2002-DRAFT

behind this method is that, in order for a model to be good, the prediction
error must show little or no correlation with past data. If they show significant
correlation, it implies that there is information left in the past data not utilized
by the predictor.
In the IV method, a set of variables called instruments (denoted by
vector hereafter) must be defined first. contains some transformations
(linear or nonlinear) of past data (y(k 1), , y(0), u(k 1), , u(0)). Then,
is determined from the following relation:
N
1 X
(k)epred (k, ) = 0
N

(12.72)

k=1

(k) is typically chosen to be of same dimension as the parameter vector .


This way, one obtains the same number of equations as unknowns. Sometimes, is chosen to be of higher
P dimension. Then, can be determined by
minimizing some norm of N1 N
k=1 (k)epred (k, ). Filtered epred can be used
as well in the above. The success of the method obviously depends critically
on the choice of instruments. If (k) is chosen as (k), one obtains the same
estimate as the least squares estimate. It is also possible to choose that
contains parameters. This leads to pseudo-linear regression.
Other notable variations to the least squares regression are the so called
biased regression methods in which the regression is restricted to a subspace
of the regressor space. The subspace is not chosen a priori, but is formed by
incrementally adding a one-dimensional space chosen to maximize the covariance of data (as in the Principal Component Regression) or the covariance
between and y (as in the Partial Least Squares). These methods are designed
to reduce the variance (esp. when the data do not show adequate excitation
over the whole regressor space) at the expense of bias. In the Bayesian estimation setting, this can be interpreted as choosing a singular initial covariance
matrix P (0). However, the singular directions are determined on the basis of
data rather than prior knowledge.

12.3

Nonparametric Identification Methods

When one has little prior knowledge about the system, nonparametric identification which assumes very little about the underlying system is an alternative. Nonparametric model structures include frequency response models,
impulse response models, etc.. These model structures intrinsically have no
finite-dimensional parameter representations. In reality, however, the dividing line between parametric identification and nonparametric identification is
somewhat blurred: In nonparametric identification, some assumptions are always made about the system structure (e.g., a finite length impulse response,

20

Identification-DRAFT

smoothness of the frequency response) to obtain a well-posed estimation problem. In addition, in parametric identification, a proper choice of model order
is often determined by examining the residuals from fitting models of various
orders.

12.3.1

Frequency Response Identification

Dynamics of a general linear system can be represented by the systems frequency response, which is defined through amplitude ratio and phase angle
at each frequency. The frequency response information is conveniently represented as a complex function of of which the modulus and argument define
the amplitude ratio and the phase angle respectively. Such a function can be
easily derived from the systems transfer function G(q) by replacing q with ej .
Hence, the amplitude ratio and phase angle of the system at each frequency
is related to the transfer function parameters through the following relations:
q
A.R.() = |G(ej )| = Re{G(ej )}2 + Im{G(ej )}2
(12.73)

Im{G(ej )}
() = arg G(ej ) = tan1
(12.74)
Re{G(ej )}
Since G(ej ) (0 for a system with sample time of 1) defines system dynamics completely, one approach to system identification is to identify
G(ej ) directly. This belongs to the category of nonparametric identification
as frequency response is not parametrized by a finite-dimensional parameter
vector. (There are infinite number of frequency points in {0 }.)
Frequency Response Computation
The most immediate way to identify the frequency response is through a sinewave testing, where sinusoidal perturbations are made directly to system input
at different frequencies. Though conceptually straightforward, this method is
of limited practical value since (1) sinusoidal perturbations are difficult to
make in practice, and (2) each experiment gives frequency response at only a
single frequency.
A more practical approach is to use the results from the Fourier analysis.
From the z-domain input / output relationship, it is immediate that, for system
y(k) = G(q)u(k),
Y ()
G(ej ) =
(12.75)
U ()
where
Y () =

X
k=1

y(k)ejk

(12.76)

21

April 23, 2002-DRAFT

U () =

u(k)ejk

(12.77)

k=1

Hence, by dividing the Fourier transform of the output data with that of
the input data, one can compute the systems frequency response. What
complicates the frequency response identification in practice is that one only
has a finite length data record. In addition, output data are corrupted by
noise and disturbances.
Let us assume that the underlying system is represented by
y(k) = G(q)u(k) + e(k)

(12.78)

where e(k) is a zero-mean stationary sequence and collectively describes the


effect of noise and disturbance. We define

YN () =

UN () =

EN () =

N
1 X

y(k)ejk
N k=1

(12.79)

N
1 X

u(k)ejk
N k=1

(12.80)

N
1 X

e(k)ejk
N k=1

(12.81)

Then,
RN () EN ()
YN ()
N () =
G
= G(ej ) +
+
UN ()
UN ()
UN ()
where |RN ()| =

c1
N

(12.82)

N () computed as above using N data


for some c1 . G

points is an estimate of the true system frequency response G(ej ) and will be
referred to as the Empirical Transfer Function Estimate (ETFE).
Statistical Properties of the ETFE
Let us take the expectation of (12.82):

RN ()
RN () EN ()
j

+
= G(ej ) +
E{GN ()} = E G(e ) +
UN ()
UN ()
UN ()

(12.83)

We can also compute the variance as


E

N () G(ej )
G

where N

c2
N.

N () G(ej )
G

Implications of the above are as follows:

e + N
|UN ()|2

(12.84)

22

Identification-DRAFT
Since the second term of the RHS of (12.83) decays as
asymptotically unbiased estimate of G(ej ).

1 ,
N

N () is an
G

If u(k) is a periodic signal with period of N , |UN ()| is nonzero only at


N frequency points (at = 2k
N , k = 0, , N 1). This means that
the ETFE is defined only at the N frequency points. |UN ()| at these
frequency points keep growing larger as N , and from (12.84), we
see that the variance goes to zero.
If u(k) is a randomly generated signal, as N increases, the number of
frequency points at which the ETFE can be computed also increases.
However, |UN ()|2 is a function that fluctuates around the spectrum of
u(k) and therefore does not increase with data. From (12.84), we conclude that the variance does not decay to zero. This is a characteristic of
any nonparameteric identification where, roughly speaking, one is trying
to estimate infinite number of parameters.
A practical implication of the last comment is that the estimate can be
very sensitive to noise in data (no matter how many data points are used).
Hence, some smoothing is needed. The following are some simple smoothing
methods:
Select a finite number of frequency points, 1 , N between 0 and .
Assume that G(ej ) is constant over i i + . Hence,
the EFTE (GN ()) obtained within this window are averaged, for ine
stance, according to the signal-to-noise ratio |UN()|
2 . Since the number
of frequency response parameters become finite under the assumption,
the variance decays to zero as 1/N . However, the assumption leads to
bias.
A generalization of the above is to use the weighting function Ws ( )
for smoothing. The ETFE is smoothed according to
R
2
d
Ws ( )GN () |UNe ()|

()
(12.85)
GsN () =
R
|UN ()|2
d
W
(

)
s

e ()
Ws is a function that is centered around zero and is symmetric. It usually
includes a parameter that determines the width of the smoothing window
and therefore the trade-off between bias and variance. Larger window
reduces variance, but increases bias and vice versa. Again, the variance
can be shown to decay as 1/N under a nonzero smoothing window.

12.3.2

Impulse Response Identification

Impulse response identification is another form of nonparametric identification,


that is commonly used in practice. Suppose the underlying system is described

23

April 23, 2002-DRAFT


by a convolution model
y(k) =

X
i=1

Hi u(k i) + ek

(12.86)

Now post-multiply uT (k ) to the above equation to obtain


y(k)uT (k ) =

X
i=1

Hi u(k i)uT (k ) + e(k)uT (k )

(12.87)

Summing up the data from k = 1 to k = N ,

!
!
!

N
N
N
X
X
X
1
1 X
1
Hi
y(k)uT (k ) =
u(k i)uT (k ) +
e(k)uT (k )
N
N
N
i=1
k=1
k=1
k=1
(12.88)
Assuming the signals are quasi-stationary, we can take the limit of the above
as N to obtain
Ryu ( ) =

Hi Ruu ( i) + Reu ( )

(12.89)

N
1 X
y(k)uT (k )
N N

(12.90)

i=1

where
Ryu ( ) =

lim

k=1

N
1 X
Ruu ( ) = lim
u(k)uT (k )
N N

(12.91)

1
N

(12.92)

Reu ( ) =

lim

k=1
N
X
k=1

e(k)uT (k )

The above equation can also be derived from a statistical argument. More
specifically, we can take expectation of (12.87) to obtain
T

E{y(k)u (k )} =

X
i=1

Hi E{u(k i)uT (k )} + E{e(k)uT (k )} (12.93)

Assuming {u(k)} and {e(k)} are stationary sequences, Ruu , Ryu and Reu are
simply the expectations in the above.
Now, let us assume that {u(k)} is a zero-mean stationary sequence that
is uncorrelated with {e(k)}. {e(k)} is assumed to be also stationary. Then,
Reu ( ) 0 as N . Let us also assume that Hi = 0 for i > n. Such n can
be determined by examining Ryu ( ) under a white noise perturbation. When
the input signal is white, Ruu (i) = 0 except i = 0. From the above, it is clear

24

Identification-DRAFT

that Ryu ( ) = 0 if H = 0. Hence, one can choose n where Ryu ( ) 0 for


> n.
With these assumptions, we can write (12.89) as

Ryu (1) Ryu (2) Ryu (n)


Ruu (0)
Ruu (1)
R
(1)
R

uu
uu (0)

H1 H2 H n

..
..

.
.

Ruu (n 1)
Ruu (n 1)
..
..
.
.
Ruu (n + 1) Ruu (n + 2)
Ruu (0)
(12.94)
Taking transpose of the above equation and rearranging it gives
1 T

T
Ryu (1)
Ruu (0)
Ruu (1)
Ruu (n 1)
H1
T

H T Ruu (1)
Ruu (0)
Ruu (n 2)
Ryu (2)
2

..
..
..
..
..
..

.
.
.
.
.
.
T
T
Ruu (n + 1) Ruu (n + 2)
Ruu (0)
Hn
Ryu (n)
(12.95)
T (i) = R (i)
Here we used the fact that Ruu
uu
In practice, one has to approximate Ruu (i) and Ryu (i) with finite-length
data. With such approximations, parameter variance can be significant. However, because we limited the number of impulse response coefficients to n by
assuming Hi = 0, i > n, we only need estimates for {Ruu (i), i = 0, , n 1}
and {Ryu (i), i = 1, , n}. Hence, the variance decays as 1/N (assuming
the matrix remains nonsingular). However, some bias results because of
the truncation. Again, the choice of n determines the trade-off between the
variance and the bias.
It is noteworthy that (12.95) gives virtually the same estimate as the least
squares identification when the number of data points is large. (See Exercise???)

12.4

Subspace Identification

In identifying a multivariable system, it is advantageous to use a model structure that includes cross-correlation among the outputs. Not only can this
improve identification of the deterministic part, the multivariable noise model
so identified can be useful in certain applications, e.g., those that require crosschannel feedback updates for inferential control. To build a model with crosscorrelation information, however, MIMO identification (rather than SISO or
MISO identification) is needed. Polynomial models like a MIMO ARMAX
model are generally difficult to work with in this context, since they give
rise to numerically ill-conditioned, nonlinear estimation problems with many
local minima. In addition, significant prior knowledge (e.g., system order, observability indices) is needed to obtain a proper model parameterization. An

25

April 23, 2002-DRAFT

alternative is to identify a state-space model using a subspace identification


method. Its advantages are that it involves only non-iterative linear algebra
operations and that it requires little prior system knowledge from the user.
Subspace identification is a relatively recent development and has its root in
the classical realization theory.

12.4.1

The Basic Method

Different subspace identification algorithms available in the literature share


the same basic concept, which we present here. In subspace identification,
a state sequence is first determined either implicitly or explicitly from
a projection of input-output data and the state-space system matrices are
determined subsequently using the constructed state sequence.
The Problem Formulation
Suppose underlying is the standard linear stochastic state-space system of
x(k + 1) = Ax(k) + Bu(k) + w(k)
y(k) = Cx(k) + (k),

(12.96)

where w(k) and (k) are white noise sequences. The usual assumptions of
reachability and observability as well as stationarity will be enforced here.
The following steady-state Kalman filter equation is equivalent to the above
system in an input-output sense and is called the innovation form of (12.96):
x
(k + 1) = A
x (k) + Bu(k) + K (k)
y(k) = C x
(k) + (k)

(12.97)

In the above, x
(k + 1) and x
(k) represent consecutive state estimates from
the steady-state Kalman filter, K the Kalman gain, and (k) the corresponding innovation sequence, which is white. The system equation we identify
in subspace identification is
x
i+1 (k + 1) = A
xi (k) + Bu(k) + Ki i (k)
y(k) = C x
i (k) + i (k),

(12.98)

where x
i+1 (k+1) and x
i (k) represent consecutive estimates from a non-steadystate Kalman filter that was started at time k i with the initial estimate of
x
0 (k i) = 0 and a certain covariance matrix (specifically, the systems openloop steady-state covariance matrix). The subscripts []i+1 and []i are added
to denote the number of time steps the Kalman filter has been running. i
should be chosen larger than state dimension n (i.e., choose one larger than
an upper-bound of n).

26

Identification-DRAFT

(12.98) serves as an approximation to (12.97). Note that the system matrices for the deterministic part of (12.98) (i.e., A, B, C) are same as those
appearing in (12.97). The same is not true for the stochastic parts, but
with a reasonably large i, Ki and Cov{i (k)}) would be good approximations of K and Cov{ (k)}). The goal is to identify system matrices,
(A, B, C, Ki , Cov{i (k)}), within some state coordinate transformation.
Multi-Step Prediction Equation Based On The Kalman Estimate
As a first step toward this goal, we note that

y(k)
C

y(k + 1)
CA

.
2

..
CA
=

(k)
+

..
.

CAi1
y(k + i 1)

0 0

..
.
CB
0
0 0

..
..
..
.
.
CAB
CB
.

..
..
..
..
..
.
.
.
.
.
i2
CA B CB 0

e(k|k 1)

u(k)
e(k + 1|k 1)

u(k + 1)

..

(12.99)
.
+
..

..

.
u(k + i 1)
e(k + i 1|k 1)
0

where the first two terms together represent the optimal predictions for y(k), , y(k+
i 1) based on the Kalman estimate x
i (k) and e(k|k 1), , e(k + i 1|k 1)
the corresponding prediction error. For the convenience of notation, let us
denote the above as
i (k) + L3 Ui0+ (k) + Ei0+ (k|k 1)
Yi0+ (k) = Wio x

(12.100)

It can be easily verified by propagating the Kalman filter equation that x


i (k)
is simply a linear function of y(k i), , y(k 1) and u(k i), , u(k 1).
Defining

u(k i)
y(k i)

..
..

(12.101)
Yi (k) =
,
; Ui (k) =
.
.
u(k 1)

y(k 1)

we can express

x
i (k) =

H1 H2

Yi (k)
Ui (k)

(12.102)

where H1 and H2 are functions of the system matrices as well as the covariance
matrices (including the initial covariance matrix chosen for the nonstationary

27

April 23, 2002-DRAFT

Kalman filter) (See Exercise??). Combining (12.100) and (12.102), we obtain

Yi0+ (k)

Wio
| h

H1 H2
{z
}
i
L1 L2

Yi (k)
Ui (k)

+ L3 Ui0+ (k) + Ei0+ (k|k 1) (12.103)

An important observation to make at this point is that, when the Kalman


filter underlying the above equation is initialized in the particular manner,
Ei0+ (k|k1) is orthogonal in a statistical sense to Yi (k), Ui (k) and Ui0+ (k),
given that the input sequence u(k) is white. Therefore, the least squares
regression gives an unbiased, consistent estimate for [L1 L2 ] and L3 .
Note: The assumption of white input can be relaxed by adopting a different
initialization of the Kalman filter wherein the initial estimate x
0 (k i) and

0+
the covariance matrix are allowed to depend on Ui (k) and Ui (k) (instead
of being chosen as the systems open-loop covariance matrix). This way, the
consistency of the least squares estimate can be retained despite the relaxation
of the assumption. However, the relationships between matrices L1 , L2 and L3
and system matrices change, thus complicating the subsequent development.
A detailed treatment of this generalization is beyond the scope of this chapter.
We list the relevant references at the end of the chapter.
2

28

Identification-DRAFT

Obtaining Consistent Estimates of Predictor Matrices


Suppose we are given N input-output sample pairs, y(0), , y(N ) and u(0), , u(N ).
Then, we can construct the following matrices:


U0|i1 = U (i) U (i + 1) U (i + j 1)
u(0)
u(1)
u(2)

u(j 1)

u(1)
u(2)
u(3)

u(j)

..
..

.
u(i 1) u(i) u(i + 1) u(i + j 2)


Ui|2i1 = U 0+ (i) U 0+ (i + 1) U 0+ (i + j 1)
u(i)
u(i + 1) u(i + 2) u(i + j 1)
u(i + 1) u(i + 2) u(i + 3)
u(i + j)

=
..
..

.
Y0|i1

u(2i 1)
u(2i)
u(2i + 1) u(2i + j 2)

= Y (i) Y (i + 1) Y (i + j 1)
y(0)
y(1)
y(2)

y(j 1)

y(1)
y(2)
y(3)

y(j)

..
..

y(i 1) y(i) y(i + 1) y(i + j 2)


0+

Yi|2i1 = Y (i) Y 0+ (i + 1) Y 0+ (i + j 1)

y(i)
y(i + 1) y(i + 2) y(i + j 1)

y(i + 1) y(i + 2) y(i + 3)


y(i + j)

..
..

.
y(2i 1)
y(2i)
y(2i + 1) y(2i + j 2)
(12.104)
where j = N 2i + 1. Then, the least squares estimate for [L1 L2 L3 ] can be
written as

, Y
0|i1

1 L
2 L
3 = Yi|2i1 U0|i1
(12.105)
L
Ui|2i1

where the operator / means A/B = AB T (BB T )1 .

Choosing The System Order and Obtaining The Kalman State Data
Matrix for x
i+1 - The Theory
Since [L1 L2 ] = Wio [H1 H2 ] with i > n, if the system is reachable / observable, [L1 L2 ] has the rank of n. Hence, one can obtain the system order
simply by examining its rank. This means we are required to store only n
linear combinations of Yi (k) and Ui (k) (which defines the Kalman state),
but we are free to choose any coordinate system within such a state space

April 23, 2002-DRAFT

29

(recall that system matrices can only be identified within a state coordinate
transformation). Suppose that the SVD of the matrix is given as


0
V1
L1 L2 = W1 W2
(12.106)
0 0
V2T

Then the extended observability matrix Wio can be chosen as W1 P where P is


any nonsingular (square) matrix. To obtain a balanced system, we may choose
Wio = W1 1/2 . This fixes the state coordinate axes to yield

Y (k)
o
(12.107)
x
i (k) = (Wi ) L1 L2
U (k)

Y (k)
1/2 T
= V1
(12.108)
U (k)
Using the above equation, we can easily create a data matrix for the Kalman
estimate x
i (k) from input-output data:

i =
x
i (i) x
i (N i)
X

Y0|i1
(12.109)
1 L
2
= (Wio ) L
U0|i1
Choosing The System Order and Obtaining The Kalman State Data
Matrix x
i - The Practice
In practice, one would be given only a finite data set and hence the rank of
2 ], a least squares estimate for [L1 L2 ], can be significantly higher than
1 L
[L
n because of various identification errors. A logical way to choose the system
order and define the state space is to use the SVD of the matrix,

Y0|i1

.
(12.110)
L1 L2
U0|i1
o can be defined in a similar manner as before.
Based on the above SVD, W
i
The system order can be determined by examining the singular values and
finding a large gap between two successive values a somewhat subjective
decision that involves a trade-off between variance and bias. The state data
matrix can be obtained by using the same formula of (12.109).

From the viewpoint of model reduction, choosing the reduced state this way
amounts to minimizing the squared sum of prediction error vector Ei0+ (k|k
1), k = i + 1, , N + 1 i for given data. It bears a strong resemblance to the
Hankel-norm model reduction with a frequency weighting given by the input
spectrum. This way of choosing the state is particular to the algorithm called
N4SID. Various subspace identification methods differ in the way of choosing
the system order and the state. For example, in performing the SVD to decide

30

Identification-DRAFT

on the system order and the state coordinate system, one may use instead just
the gain matrix

1 L
2
L
or the normalized matrix
Cov

1/2

{Yi0+ (k)}

1 L
2
L

Cov

1/2

Yi (k)
Ui (k)

to emphasize correlation rather than prediction error for given data.


Obtaining The Kalman State Data Matrix For x
i+1
Next we create the data matrix for x
i+1 (k + 1) in a similar manner. An
important point is that one cannot simply time-shift x
i (k) to obtain x
i+1 (k+1).
Shifting x
i (k) would give x
i (k + 1) that represents an estimate from a Kalman
filter started at a different time (i.e., at k i + 1 instead of k i). To keep
the initialization point for the two estimates the same, one has to follow the

(k + 1)
same procedure as before but with Yi (k) and Ui (k) replaced by Yi+1

and Ui+1 (k + 1) that are defined as

y(k i)
u(k i)

..
..

.
.
Yi+1
(k + 1) =
(k + 1) =
(12.111)
; Ui+1

y(k 1)
u(k 1)
y(k)
u(k)
Also define


0+
(k + 1) =
Yi1

y(k + 1)
..
.
y(k + i 1)

As before, we can express

x
i+1 (k + 1) =
This yields
0+
Yi1
(k+1)

o
Wi1
| h

1 H
2
H
{z
}
i

L1 L2


0+
(k + 1) =
Ui1

2
1 H
H

Yi+1
(k + 1)

Ui+1 (k + 1)

u(k + 1)
..
.
u(k + i 1)

Yi+1
(k + 1)

Ui+1 (k + 1)

(12.112)
(12.113)

3 U 0+ (k+1)+E 0+ (k+1|k)
+L
i1
i1

Again, the consistent least squares estimate is given by

,
Y0|i
i
h

L
L

U0|i
= Yi+1|2i1
L
3
2
1
Ui+1|2i1

(12.114)

(12.115)

31

April 23, 2002-DRAFT

where the data matrices, Yi+1|2i1 , Y0|i , etc. are defined in the same manner
as (12.105).
The state vector can then be

i+1 =
x
i+1 (i + 1)
X
h

o ) L
= (W
1
i1

created by using
x
i+1 (N i + 1)
i Y
0|i

L
2
U0|i

(12.116)

o is the same W
o used to create x
where W
i but with the last ny rows
i1
i
eliminated.
Obtaining the System Matrices
Once data matrices for x
i (k) and x
i+1 (k + 1) are created according to the
above procedure, system matrices A, B, and C can be estimated using the
linear least squares method. In addition, the covariance matrix for i (k) and
Ki can be estimated on the basis of the residual from the least squares.
Note that the system equation (12.98) in terms of the data matrices can
be written as

i+1
i
A B
K
X
X
=
+
Ei
(12.117)
C 0
I
Yi|i
Ui|i
where
Ei =

i (i) i (1) i (2) i (N i)

(12.118)

We can obtain the system matrices A, B and C using the least squares as
before. From the residual, we can also calculate the estimates of the innovation
gain matrix K and the covariance matrix R (of i (k)) according to
= 1 Ei E T
R
i
j
1
T
1

K =
j 1 2 R

(12.119)

Note: Equation (12.117) becomes considerably more complicated when the


assumption of white input is relaxed. This is because, in order to retain
the consistency of the least squares estimates, a different initialization of the
Kalman filter must be assumed in which the initial estimate x
0 (k i) depends
on the input data. The result from the above simple algorithm, though biased
in the general case, approaches the unbiased result as i .
Computational Issues
The procedure we just described involves a series of linear algebra operations
and does not involve any iterative calculation. Hence, the computational requirement for subspace identification is mild when compared to the PEM.

32

Identification-DRAFT

The procedure as we described, though attractive from a conceptual viewpoint, does not represent the most efficient and stable implementation from a
numerical viewpoint. The procedure can be implemented in a computationally
efficient and stable way by employing QR (or RQ) factorization. For details,
see the references given at the end of the chapter.
Properties and Other Issues
The subspace identification method we just described has the following properties:
The resulting deterministic system model is asymptotically unbiased.
The estimates for the covariance matrices are biased, however, due to the
fact that (12.98) is a non-steady-state Kalman filter. The approximation
error diminishes as i .
Both results follow straightforwardly from the consistency of the least squares
estimation.
Advantages of the subspace identification method over the prediction error
method include:
Structure identification is automatic. It avoids having to try models of
different orders or the need for a special parameterization that requires
difficult-to-obtain prior information like the observability indices.
It yields a stochastic system model in the form of a state estimator
without any nonlinear optimization, which is needed, for instance, to
identify a MIMO ARMAX model.
Model reduction is already built into the identification.
However, there are some drawbacks as well. Although the method yields
an asymptotically unbiased model, little can be said about the model quality
obtained with finite data. In practice, one must work with finite-length data
sets. In addition, various non-ideal factors like nonlinearity and nonstationarity make the residual sequence e(k + `|k), ` = 0, , i 1 in (12.99) become
correlated with the regression data. Because of these reasons, L 1 , L2 obtained
from the least squares identification (which are critical for determining the
system order and generating data for the Kalman estimates) may contain significant errors, both variances and biases. Variance will dominate when the
upper-bound of the state order (i) is set too high. Although the variances
of the prediction matrices may be quantifiable, it is difficult to say how they
affect the final model quality measured in terms of prediction error, frequency response error, etc. One implication is that, in general, one needs a

April 23, 2002-DRAFT

33

large amount of data in order to have success with these algorithms, which is
only natural since these algorithms use very little prior knowledge. Another
implication is that the above does not replace the traditional prediction error
method but complements it. For instance, it has been suggested that the subspace method be used to provide a starting estimate for the prediction error
minimization.
Another related issue is that, if the input and output data are correlated
due to a feedback of some sort, the above algorithm can fail. In this case, the
prediction error e(k + `|k), ` = 0, , i 1 in (12.99) become correlated with
the input vector Ui0+ (k) and the consistency of the least squares estimation
breaks down. Methods have been suggested to overcome this problem. For
example, it has been suggested to replace the data for Y 0+ (k) with one step
ahead predictions obtained using a high-order ARX model, in order to break
the correlation. See the section Bibliography at the end for the reference.
Finally, the requirement that the stochastic part of the system be stationary should not be overlooked. If the nonstationarity arises mainly from the
processs exhibiting mean changes due to integrating type disturbances, one
can difference the input output data before applying the algorithm. Further
low-pass filtering may be necessary in order not to over-emphasize the high
frequency fitting. (Recall the discussion on the frequency-domain bias distribution.)
Example 12.3 Stochastic system model. Identification Result.
Example 12.4 Closed-loop data. Subspace ID exhibiting bias. Ljungs method.

12.5

Practice of System Identification - The Users


Perspective

So far we have presented various identification methods and their theoretical


properties. Performing system identification in practice, however, involves
much more than understanding and applying these methods. One must design
the experiment and generate appropriate data, remove spikes/outliers and
pretreat them, choose a right model structure and apply a right algorithm,
and validate the model and further condition it so that it is appropriate for
use in model-based control. In this section, we will take the users viewpoint
and attempt to provide a perspective on these steps while highlighting some
key points.

12.5.1

Experiment Design

In most cases, data used for system identification must be generated by performing a test on an actual process. These tests are oftentimes costly and

34

Identification-DRAFT

time-consuming to run as they involve direct interaction with the actual process. For applications in the process industry, it is estimated that up to 90%
of cost and time involved in commissioning a model-based controller can be
attributed to the plant testing step. It is therefore important to understand
the underlying issues and optimize these tests as much as possible. Despite the
importance, the science of designing experiments for system identification has
not fully matured yet. Hence, we will resort to some qualitative discussions
on the key decisions.
Sample Time: Sample time for data collection is usually preset by the
existing hardware (e.g., data acquisition systems, actuators, and sensors). In deciding the input manipulation frequency, however, one needs
to keep in mind the theoretical limit given by the Sampling Theorem
(See Appendix??): With the sample time of Ts , no control can be
done beyond the Nyquist frequency of /Ts . On the other hand, too
small a sample period for a slow system can lead to excessively many
parameters (in the case of FIR identification) or numerical difficulties
(all poles collapsing around (1,0)).
Experiment Length: This too is largely decided by the practical consideration: The longer the experiment, the more the information but the
higher the cost. On the other hand, it is convenient to have an estimate
for the minimum length needed to obtain data for a desired closed-loop
performance, ahead of the actual experiment. We showed earlier that
calculation of the variance and its distribution in the context of least
squares estimation can be carried out under certain assumptions. Consideration of these quantities could provide some useful clues.
Test Signal: This probably represents the most important design parameter as one is often given significant degrees of freedom in its choice
and it also has the biggest impact on the model quality. Test signal (i.e.,
how it distributes a given power among different frequency channels and
possibly among different directions in the input space) determines the
distribution of model error (variance and bias) and therefore achievable
closed-loop performance. Popular signals used in plant testing are (series of) steps, pulses, random signals (e.g., random binary sequences),
and pseudo-random binary sequence (PRBS) signals. The last two are
recommended by theoreticians as they have high power contents and
distribute the power evenly through all frequencies (within the Nyquist
band). We will discuss the PRBS signal in more detail in the next subsection. However, practical constraints often dictate what signals can
be implemented and what cant. The consideration for the operational
impact that a test perturbation is expected to have on the process tends
to override any other consideration including the quality of information
it produces.

April 23, 2002-DRAFT

35

Single Input Testing vs. Multiple Input Testing: One also has
to decide whether to vary each input one at a time or several inputs together at the same time. Generally, the one-input-at-a-time approach
is safer as its impact on the process is more predictable and controllable.
However, varying the inputs altogether yields more information for a
given amount of time. In addition, for highly interactive systems (with
strong gain directionality characteristics), it is very difficult to obtain
the necessary multivariable information using single input testings (see
the example below). The single input testing inherently emphasizes
the accuracy of SISO models, which may not be enough. If multiple
input testing is used, both the auto-correlation and the cross-correlation
functions of the test signals should be optimized in the design.
Open-Loop Testing vs. Closed-Loop Testing: Finally, experiment
can be conducted in open loop or in closed loop with some feedback
controller in place. This is illustrated in Figure 12.1. In principle, closedloop testing would give a more relevant information as it more closely
simulates the situation the model will be subjected to. On the other
hand, it may not always be possible to implement a reasonable feedback
controller before model identification. If the closed-loop testing option is
chosen, one needs to be cognizant of two issues. First, in all likelihood,
naturally occurring disturbances alone will not be enough to produce
data with necessary information contents. For instance, it can be seen
from Figure 12.1(b) that, without an external excitation signal, one could
easily end up identifying (the inverse of) the controller rather than the
process. Second, the feedback loop introduces correlation between past
output and future input in the data. Some provision may be necessary as
most conventional algorithms give biased results in such a case (see the
discussion on the convergence of linear least squares in Section 12.2.3,
for example). For example, one could identify a closed-loop function
between the external dither signal and the output and then extract an
open-loop model from it.
Example 12.5 ADD THE EXAMPLE OF HIGH PURITY DISTILLATION
COLUMN. Single-input testing vs. multiple-input testing. Demonstrate the
importance of correlated design.

12.5.2

PRBS Signals

PRBS is a popular choice for the test signal because it is easy to generate,
convenient to implement, and guarantees a specific power spectrum even for a
relatively short-length signal (compared to a random signal). In this subsection, we examine its properties and design in more detail.

36

Identification-DRAFT

Figure 12.1: Open-Loop Testing, Closed-Loop Testing Without External


Dithering and Closed-Loop Testing With External Dithering

37

April 23, 2002-DRAFT


The Covariance and Power Spectrum of PRBS

PRBS (denoted with hereafter) is a periodic binary signal taking some upper value (say b) and some lower value (b). Its covariance function R ( )
(calculated assuming the input is a period signal of period M ) is:
2
B for = 0
(12.120)
R ( ) =
2
BM for 6= 0

where B(= b b+b


2 ) represents the deviation from the mean value and M is
the period of the PRBS signal. So, by choosing M as a fairly large value, we
can make the covariance function to be virtually same as that of white noise.

The above power spectrum of a PRBS looks like


() =

R( )exp(j )

R(0) + R(M )exp(jM ) + R(2M )exp(j2M ) +


+R(M )exp(jM ) + R(2M )exp(j2M ) +

= B 2 + 2B 2 cos(M ) + 2B 2 cos(2M ) +

The above power spectrum is zero except at the discrete frequency points of
i = i 2
M . The power at these frequency points are all the same but infinite.
This makes sense since the integral of the power spectrum over (0, 2) must be
R(0) (which is the power of the signal) and the power spectrum is nonzero only
at discrete points. For periodic signals, it is convenient to use the following
DFT of the covariance function:
(i ) =

M
1
X
=0

R( )exp(ji ) R(0) = B 2

Hence, we see that a PRBS distributes its power uniformly over the discrete
frequency points within the Nyquist band.
Note: The power distribution will not be exactly uniform since the DFT of
the covariance function shows the distribution of the power when the discrete
signal is implemented as an impulse train. In reality, you would implement it
as a zero-order-hold signal.)
2
An advantage is that, being a deterministic signal, the statistical property
holds even for a relatively short-length signal. For example, with a PRBS
signal of length N that is an integer-multiple of M ,
2
N
B for = 0
1 X
(k)(k ) =
(12.121)
2
N
BM for 6= 0
i=1

For purely random signals, in order for the above property to hold approximately, N must be quite large. Since one is often limited to collecting just
short-length data in practice, the above property of PRBS can be very useful

38

Identification-DRAFT

Figure 12.2: Generation of a PRBS signal using a shift register


Generation of A Single PRBS
To generate PRBSs, we can use the so called shift registers. Shown in Figure
12.2 is a 4-digit shift register. here means a binary summation, that is,
11 = 0

01 = 1

00 = 0

Using this method, we can obtain a periodic binary sequence of length 2Ns
1, where Ns denotes the number of shift registers. From the (0, 1) binary
sequence, a PRBS signal can be generated by choosing the lower value for 0
and the upper value for 1. Once we obtain a signal for one period, we can
simply repeat the signal for how many ever periods needed. The procedure
for PRBS generation can be summarized as follows:
1. Initialize the register values by setting them either 1 or 0. Different
initializations lead to different signals.
2. Perform the binary summation between the first register value and the
final register value.
3. Shift all register values to the right direction by one register.
4. Enter the summation result into the first register.
5. Repeat step 1 step 4 until 2Ns 1 data points are obtained for each
register. One can choose the data history of any register as the PRBS.
Generation of Multiple Independent PRBS Signals
If it is desired to do a multiple-input test (by varying several inputs simultaneoulsy), one may wish to design several PRBSs that are mutually independent.
There are different ways to do this. Let us discuss the two signal design case.

39

April 23, 2002-DRAFT

Option 1 Design a PRBS of basic period M (denoted as = {(k), k =


1, , M } and then choose

u1 =
, u2 =

The above gives a two-dimensional signal of period 2M .

Option 2 Design a PRBS of basic period 2M (denoted as = [1T 2T ]T where


1 and 2 are the first half and second half of the PRBS) and then choose

1
2
u1 =
, u2 =
2
1
The above also gives a two-dimensional signal of period 2M .
The latter option is expected to be more robust when only a part of the basic
signal is used for some unexpected reason. In such a case, with Option 1, one
may end up implementing just the portion of the test signal that excites the
input channels in only a single direction of (1,1).
One can easily extend the above idea to larger dimensional cases. For
example, for an nu input channel design, we can create a PRBS of period
nu M . Then, we can choose

1
2
2
..

u1 = . u2 = . , etc.
.
.
nu
nu
1
What happens to the cross-variance between inputs with the above design?
All cross-variance terms become B 2 /M . So by choosing M as a sufficiently
large number, the cross correlation between input signals can be made negligible. As before, it is recommended that M be chosen to correspond to
the systems maximum settling time (with respect to all the inputs). This
is because the number of identifiable parameters with the above design are
M nu .
Bias and Variance of PRBS
The key parameters one has to choose in the PRBS generation is the upper
np . Note
value b, the lower value b, the period M and the number of periods

that, by choosing b and b, one


determines the mean value, (b + b)/2 , and
the variance value (b b)2 /4 of the signal. The mean value can be chosen
to correspond to the steady-state value of the input. Some considerations are
in order in deciding the variance. If the variance is chosen large, one gets a

40

Identification-DRAFT

high signal-to-noise ratio which should help the identification in most cases.
However, as with everything, too much can be bad since it can accentuate
the nonlinear characteristics of input output responses and adversely affect
the on-going plant operation as well. Finding a reasonable compromise would
require some intimate knowledge about the process.
Period and Number of Repetitions
Note that M np represents the total duration of the experiment. If M is chosen too small, the power does not get distributed to enough frequency points.
In addition, the value B 2 /M may be significant resulting in a covariance function that does not closely match that of white noise. Note that the PRBS
signal is persistently exciting of order M . Hence, it can be used to identify
only M parameters. If an FIR model is to be identified, it implies that M
should be chosen to be equal to or larger than the number of sample steps
corresponding to the systems settling time. In certain cases, it may be undesirable to distribute the power to too many frequency points. M should be
limited in this case but np can always be increased to produce a signal of a
desired length.
To Vary the Bias Value or Not To Vary It?
In some cases, it may be advantageous to vary the bias value from one period
to next. This is true when the steady-state value of the input cannot be fixed
because it changes according to some load disturbance. If an identification
experiment is performed for a long time, the load disturbance may change
even during the experiment. This case requires a test input with its power
density concentrated more in the low frequency range, which can be achieved
by varying the bias value. One can still use the same basic PRBS design, but
an extra bias (chosen to be different for each period) is added to the design.

12.5.3

Data Pre-Processing

Before applying an identification algorithm, the input output data need to


be preprocessed. First, one should do the obvious things like removing spikes
and outliers. Second, one should remove bias and compute deviation variables.
The latter is called detrending and can be done in several ways:
Option 1 Compute the data average and subtract it to create the deviation
variables, i.e.,
y(k) = [yp (k) yref ],
PN
yp (k)
yref = k=1
,
N

u(k) = [up (k) uref ]


PN
up (k)
uref = k=1
N

(12.122)

41

April 23, 2002-DRAFT


where yp (k) and up (k) represent the raw data.

Option 2 Use the given steady-state values of the variables instead to compute the deviation variables, i.e.,
y(k) = [yp (k) yss ],

u(k) = [up (k) uss ]

where yss and uss represent a priori given steady-state values of the
process output and input respectively.
Option 3 Difference the data by subtracting from each datum its value at
the previous sample time:
y(k) = y(k) y(k 1),

u(k) = u(k) u(k 1)

(12.123)

Which detrending method is most appropriate depends on the situation.


If the identification experiment is performed for a long time and the mean of
the variables change a lot during the course of the experiment due to load
disturbances and / or different bias values applied to the test input (e.g., as
in step tests), the differencing option (Option 3) may yield the best result.
Otherwise, the former two options may be better. Also note that Option 1
will not work well if the number of data is not large enough in relation to the
systems settling time.
After the detrending is done, one may wish to perform some additional
prefiltering. There are at least two reasons why one may want to do this.
Some or the entire part of the noise model may be known or fixed a
priori. Suppose the model structure one chooses to fit is
y(k) = G(q, )u(k) + H1 (q)H2 (q, )(k),

(12.124)

where H1 (q) represents the known (or given) part of the noise model and
H2 (q, ) the unknown. In this case, one can write
1
1
y(k) = G(q, )
u(k) + H2 (q, )(k)
H1 (q)
H1 (q)

(12.125)

Hence, the input output data can be prefiltered with 1/H1 (q) to put the
model structure in the standard form of ARX, ARMAX, OE, etc.
Note that, by electing to detrend by Option 3, one implicitly assumes
that H(q) includes an integrator. A reasonable form of noise model in
Ao (q)
many process control problems is H1 (q) = 1q
1 where Ao (q) is some
stable polynomial. In the absence of better knowledge, one can choose
Ao (q) = (1 q 1 ), where < 1 is chosen to correspond to the desired closed-loop time constant. In this case, the prefiltering should be
1
done using the filter 1q
Ao (q) . This amounts to filtering the differenced
input output data with a low-pass filter Ao1(q) , which makes sense in
view of the fact that differencing of the data makes high-frequency fit
over-emphasized.

42

Identification-DRAFT
Sometimes, one would like to minimize a filtered prediction error, L(q)e(i),
instead. This is often done to emphasize or de-emphasize the fit of a certain frequency range over the others. In this case, one can simply prefilter
the input output data with L(q) before applying the PEM to obtain the
desired effect.

12.5.4

Model Fitting and Validation

Getting A Feel for the Problem


It is generally a good idea to start the identification with simple identification
methods that can be performed fast and with relative ease. An advantage
is that one does not need to worry about seeing numerical artifacts and
can also quickly develop and compare models of many different orders, which
should help the order determination.
The simplest among the discussed are the PEM using an ARX structure
and an FIR structure, both of which lead to linear least squares problems.
Other relatively simple identification methods one can run at this stage are
subspace identification and frequency response identification (ETFE). One can
run these to get a feel for the difficulty of the problem in hand. The following
analyses can be done for this:
One can plot the step responses and the frequency responses and compare
the results from applying different structures.
One can compute the residual from the identification and analyze to
see if it is white and if there exist significant cross-correlation with the
input data. This test may identify a bias problem but not a variance
problem (over-fitting) since more parameters always mean less residual
for the data used for the fitting.
Using the fitted model, one can compute the prediction error with some
fresh set of data (called validation data) to see if one gets a good enough
prediction performance.
If the results indicate that all models are similar and provide good predictions
with validation data, it is a relatively simple identification problem you got.
If they dont, here are some possible causes and suggestions on what to do.
Bias Due To An Insufficent Model Order If the order of the ARX
model used is insufficient for the given system, the model identified would
be biased. This is especially common when the noise part of the system is
significant and possesses a complex type of auto-correlation. Note that,
1
in an ARX model, the noise part must be modeled by A(q)
(k). Hence,
even when the deterministic part of the model can be modeled adequately

April 23, 2002-DRAFT

43

by B(q)/A(q) of the specified order, if the noise part of the model cannot
be explained by 1/A(q), both parts will show bias (which is distributed
between the two according to the signal to noise ratio). In this case,
increasing the order of A(q) may improve the result but only to a point
since this will also introduce more parameters and increase the parameter
variance. If one starts seeing near pole-zero cancellations between A(q)
and B(q) but significant auto-correlation still exists in the residual, it is
a good bet that the noise part of the system is undermodelled. In this
case, using an ARMAX structure can help reduce the error and lower
the system order.
The bias problem is not expected for the FIR identification and ETFE.
However, complex noise patterns can still pose a problem in terms of the
convergence rate. If the noise patterns can be modeled a priori, it helps
to pre-filter the data before performing the identification.
A similar statement holds for subspace identification. If the maximum
order is chosen too small compared to the real system order, it can give
a bias.
Large Parameter Variances If the amount of data used are insufficient
for given noise level and number of parameters, parameter variance can
be very large. In this case, the residuals from the model fitting may look
good but the prediction performance of the model with validation data
may be poor. In addition, there may not be good agreements among
different model types. The variance problem is expected to be most
severe for the FIR identification and ETFE, but it can also show up
in an ARX model identification if the order chosen is very high (in an
attempt to explain the noise part). Subspace identification too can suffer
from this if the selected model order (both the upper bound and the final
order chosen on the basis of the SVD) is very high.
Feedback Within Data If there are feedbacks present within the data
(e.g., due to an existing control system), it can introduce a bias to the
model. The bias will manifest itself as the models poor prediction performance with validation data. This problem is expected to be most severe for the FIR Identification, subspace identification and ETFE. The
problem should be less severe for the PEM with an ARX structure if the
chosen ARX structure is sufficiently rich.
Nonlinearity If the nonlinearity is severe, it will hamper the identification even when the object is to obtain the most accurate linear system
model. Rather than directly proceeding to complex nonlinear model
structures (e.g., neural nets), it may be advantageous to consider the
underlying physics and see if there are simple variable transformations
(e.g., taking square roots or logarithms) that make sense.

44

Identification-DRAFT

Further Improvements and Refinements


If the results from applying the simple identification methods were not satisfactory, one can now attempt to improve them using more complex methods.
Even if the results were fairly good, one may still wish to refine them further.
At this point, one should ask the following two questions:
Is it important to identify an accurate noise model?
If yes, is it necessary to model the correlation among different outputs?
If the answer to the first question is no, one can use the output error structure or the instrumental variable method. The ARMAX structure can also
be employed even though the noise model is not cared for. The output error
method is especially appropriate if the noise model structure to be used in the
model-based control is fixed a priori.
If the answer is yes, then one has to address the second question before proceeding. If it is not necessary to model the correlation among the
outputs, then one can go with a MISO ARMAX structure. If it is necessary
(for example, for inferential control), then one can take the result from applying the subspace identification and further refine it using the PEM with the
corresponding MIMO ARMAX structure.
It is important that all models be validated using a fresh set of data. If the
prediction performance of the identified model is satisfactory (i.e., the error is
mostly white and shows little correlation with the input), the model can be
accepted. If the performance does not improve to a satisfactory level, it could
mean any of the possible causes mentioned previously. It may also mean that
the given data were not sufficient or badly corrupted and one needs to perform
some additional plant tests to gather new sets of data.
When is the identified noise model useful?
Some identification methods (e.g., subspace identification, ARMAX, (very)high-order ARX) yield noise models. Noise models obtained from these identifications are useful if the noise characteristics observed during the data gathering period are similar to what they will be during actual closed-loop application periods. For example, if the system was artificially made to be quiet
(or even stationary) during the identification experiment, the noise patterns
captured by the model will not be representative of real noise patterns. In
this case, the noise model should not be used in real applications. Instead,
some other reasonable noise models may have to be found or assumed in the
feedback controller design. Even if this is the case, having an adequate noise
model structure can be helpful as accuracy of the noise model has an effect
on the identification of the deterministic part. Recall the discussion that, in
the case of ARX models, a limited noise structure can bias the entire model.
Some identification methods like the OE identification are more immune to

April 23, 2002-DRAFT

45

poor noise modeling even though the convergence rate can still be affected by
it.
Example 12.6 Take the reader through a simple example?

12.5.5

Model Quality Assessment and An Integrated Framework for System Identification and Controller Design

Ultimately, quality of a model can be judged only by designing and applying


a model-based controller on an actual process. It is nevertheless desirable
to have some albeit coarse assessment of the model quality and what can be
expected once the loop is closed. In general, one needs to be conscious of
two types of errors: variance and bias. While variance dies out as one uses
more and more data, bias stays. Variance shows up only in the prediction
with fresh validation data while bias reveals itself in the residual from model
fitting. Usually, the user must trade-off between the two: With a very high
order model, little bias but significant variance, and vice versa.
In the absence of bias, variance can be quantified for simple linear estimation problems or by employing some linearization technique. Bias too can be
quantified if given enough data points. One can for instance fit residuals to a
very high-order system. Quantifying both variance and bias together,
however, is a much more difficult task in general (CHECK).
Ultimately, one is interested in the impact of these uncertainties on the
achievable closed-loop performance. One approach would be to express these
uncertainties in the frequency domain and use the well-developed robust control theories. The aspect of integration between identification and robust controller design has lately received much attention from the theoreticians of the
both fields. While some promising initial steps have been proposed, a definitive
framework is yet to emerge, especially for general multivariable systems.
The picture shown in Figure 12.3 represents one possible framework for
integrated identification, model quality assessment, and controller design. The
integration can be seen in several parts. First, model quality is quantified and
its impact on achievable closed-loop performance is assessed. If it is judged
that more data are needed, additional tests are performed. Model quality
serves as an important piece of information for designing the new experiment.
One can do this in a discrete fashion which leads to iterative identification, or
even in a continuous fashion which leads to adaptive identification. Though
we are still far away from having all the necessary tools, it is important to
have this picture in mind as we do further research and developmental work
in this area.

46

Identification-DRAFT

Figure 12.3: A Framework for Integrated Identification, Model Quality Assessment, and Controller Design

April 23, 2002-DRAFT

47

Examples to Include
1. Demonstrate a loss of identifiability in a MIMO ARMAX system.
2. Example for Pseudo-Linear Regression vs. Nonlinear Optimization.
3. Example for two input signals giving different information matrix (wellconditioned vs. ill-conditioned). Do Monte-Carlo simulation and compute the average of the error with respect to different linear directions.
4. Bayesian estimator vs. least squares in FIR identification.
5. Example of instrumental variable method.
6. subspace identification applied to some simple but realistic system. Number of data points.
7. Subspace identification applied to closed-loop data. Ljungs method.
8. Multi-input testing vs. single input testing. High-purity distillation
column.
9. Effect of detrending. Integrated white noise disturbance or unknown
bias.
10. Going through some simple example (with significant noise) as discussed
in the Model-Fitting section.

Possible Exercises to Give


1. Show that the pseudo-linear form (12.28) is indeed the optimal predictor
for the ARMAX model.
2. (Give the structure and data from some ARMAX model.) Write an
algorithm for pseudo-linear regression in MATLAB and estimate the
parameters. Compare with the true parameters.
3. (Given a linear regression problem with a Guassian noise of some variance. Give data.) Compute the estimate. Compute the covariance of
the estimate. Define the confidence interval (in terms of equations) for
the probability level of 0.95.
4. Prove that, for an ARX model with n + m parameters, use of a signal
that is persistently exciting of order n + m guarantees that the rank
condition for the consistency is satisfied .
5. Show that (12.64) can be solved by the given recursive formula. Show
that formulating the Bayesian estimator using the Kalman filter theory
also gives the same recursive formula.

48

Identification-DRAFT
6. Derive the Bayesian estimator for system (12.68) using the Kalman filter
theory.
7. Show that x
i (k) in the subspace identification is a linear function of
y(k 1), , y(k i) and u(k 1), , u(k i). Derive the explicit
relationship (expressions of H1 and H2 ).

Bibliography
1. Detailed treatment of parametric identification methods for various model
parameterizations can be found in Ljung (1987) and Soderstrom (REF).
2. PEM - developed mostly by Ljung and coworkers.
3. Frequency domain bias distribution. the result derived here is asymptotic
in nature, a similar result also exists for finite data sets (Ljung, 1987).
(CHECK!)
4. Instrumental Variable method references. Popular choices of instruments
(Table?? in Ljung).
5. PCR, PLS references.
6. ETFE - mostly drawn from Ljung. etails can be found there. Smoothing
functions (Table??).
7. The subspace identification method originated from the classical realization theory presented in Ho and Kalman (CITE) and King (CITE).
The subspace identification algorithm that we described is called N4SID
and was originally proposed by Van Overschee and De Moor (REF).
The reference describes the modifications when the input is not white
but correlated in time. Also, RQ factorization based numerically stable
and efficient algorithm..... Other subspace algorithms available in the
literature are based on similar principles, but differ on how the data for
x
i (k) and x
i+1 (k + 1) are constructed. For instance, in Larimores CVA
method [?], the basis for the states are chosen by considering the statis
T
tical correlation between Yi0+ (k) and Yi (k)T Ui (k)T
. Another
notable algorithm in this breed is MOESP by Verhaegen (CITE). These
algorithms (and their variants) have already appeared in commercial
software packages (e.g., MATLAB System Identification Toolbox) and
also been embedded into some of the identification packages used with
MPC.
8. Closed-loop subspace ID methods. Ljung. Chou and Verhaegen.....
9. PRBS and Signal design. Reference the book.

April 23, 2002-DRAFT

49

10. Importance of multi-input testing and design method for highly interactive systems. Koung and MacGregor, Cooley and Lee, etc.

You might also like