Least-Squares Data Fitting: EE263 Autumn 2015 S. Boyd and S. Lall

EE263 Autumn 2015 S. Boyd and S.
Lall
Least-squares data fitting
1
we are given:
I functions f1 , . . . , fn : S → R, called regressors or basis functions
I data or measurements (si , gi ), i = 1, . . . , m, where si ∈ S and (usually)

mn
problem: find coefficients x1 , . . . , xn ∈ R so that
x1 f1 (si ) + · · · + xn fn (si ) ≈ gi , i = 1, . . . , m
i.e., find linear combination of functions that fits data

least-squares fit: choose x to minimize total square fitting error:
m
X
(x1 f1 (si ) + · · · + xn fn (si ) − gi )2
i=1
2
I total square fitting error is kAx − gk2 , where Aij = fj (si )
I hence, least-squares fit is given by
x = (AT A)−1 AT g
(assuming A is skinny, full rank)
I corresponding function is
flsfit (s) = x1 f1 (s) + · · · + xn fn (s)
I applications:
I interpolation, extrapolation, smoothing of data

I developing simple, approximate model of data
3
Least-squares polynomial fitting
problem: fit polynomial of degree < n,
p(t) = a0 + a1 t + · · · + an−1 tn−1 ,
to data (ti , yi ), i = 1, . . . , m
I basis functions are fj (t) = tj−1 , j = 1, . . . , n
j−1
I matrix A has form Aij = ti
t21 ··· tn−1

 
1 t1 1
1 t2 t22 ··· tn−1
2

A=
 
.. .. 
 . . 
1 tm t2m ··· n−1
tm
(called a Vandermonde matrix)
4
Vandermonde matrices
assuming tk 6= tl for k 6= l and m ≥ n, A is full rank:
I suppose Aa = 0
I corresponding polynomial p(t) = a0 + · · · + an−1 tn−1 vanishes at m points

t1 , . . . , t m
I by fundamental theorem of algebra p can have no more than n − 1 zeros, so

p is identically zero, and a = 0
I columns of A are independent, i.e., A full rank
5
Example
I fit g(t) = 4t/(1 + 10t2 ) with polynomial
I m = 100 points between t = 0 & t = 1
I fits for degrees 1, 2, 3, 4 have RMS errors .135, .076, .025, .005, respectively
0.75 0.75
0.50 0.50
0.25 0.25
degree 1 degree 2
0.00 0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.75 0.75
0.50 0.50
0.25 0.25
degree 3 degree 4
0.00 0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
6
Growing sets of regressors
consider family of least-squares problems
p
X
minimize xi ai − y

i=1
for p = 1, . . . , n
(a1 , . . . , ap are called regressors)
I approximate y by linear combination of a1 , . . . , ap
I project y onto span{a1 , . . . , ap }
I regress y on a1 , . . . , ap
I as p increases, get better fit, so optimal residual decreases
7
Growing sets of regressors
solution for each p ≤ n is given by
xls = (ATp Ap )−1 ATp y = Rp−1 QTp y

(p)
where
I Ap = [a1 · · · ap ] ∈ Rm×p is the first p columns of A
I Ap = Qp Rp is the QR factorization of Ap
I Rp ∈ Rp×p is the leading p × p submatrix of R
I Qp = [q1 · · · qp ] is the first p columns of Q
8
Norm of optimal residual versus p
plot of optimal residual versus p shows how well y can be matched by linear com-
bination of a1 , . . . , ap , as function of p
kresidualk
kyk
min kx1 a1 − yk
x1
X
7
min k xi ai − yk
x1 ,...,x7
i=1
p
0 1 2 3 4 5 6 7
9
Least-squares system identification
we measure input u(t) and output y(t) for t = 0, . . . , N of unknown system
u(t) unknown system y(t)
system identification problem: find reasonable model for system based on mea-
sured I/O data u, y
example with scalar u, y (vector u, y readily handled): fit I/O data with moving-
average (MA) model with n delays
ŷ(t) = h0 u(t) + h1 u(t − 1) + · · · + hn u(t − n)
where h0 , . . . , hn ∈ R
10
System identification
we can write model or predicted output as
u(n − 1) ···
    
ŷ(n) u(n) u(0) h0
 ŷ(n + 1)   u(n + 1) u(n) ··· u(1)    h1
 
=
   
.. .. .. ..  .
  ..
 
 .   . . . 
ŷ(N ) u(N ) u(N − 1) ··· u(N − n) hn
model prediction error is
e = (y(n) − ŷ(n), . . . , y(N ) − ŷ(N ))
least-squares identification: choose model (i.e., h) that minimizes norm of model

prediction error kek
. . . a least-squares problem (with variables h)
11
Example
data used to fit model
4 3.5
3 3.0
2.5
2
2.0
1
1.5
u(t) 0 y(t) 1.0

-1 0.5
0.0
-2
-0.5
-3
-1.0
-4 -1.5
-5 -2.0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
t t
12
Example
for n = 7 we obtain MA model with
(h0 , . . . , h7 ) = (.024, .282, .418, .354, .243, .487, .208, .441)
with relative prediction error kek/kyk = 0.37

3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
0 10 20 30 40 50 60 70
y(t) actual output, ŷ(t) predicted from model

13
Model order selection
question: how large should n be?
I obviously the larger n, the smaller the prediction error on the data used to
form the model
I suggests using largest possible model order for smallest prediction error
14
Model order selection
1.0
relative prediction error kek/kyk

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 5 10 15 20 25 30 35 40 45 50 55
n
difficulty: for n too large the predictive ability of the model on other I/O data
(from the same system) becomes worse
15
Out of sample validation
I evaluate model predictive performance on another I/O data set not used to
develop model model validation data set
I check prediction error of models (developed using modeling data) on valida-
tion data
I plot suggests n = 10 is a good choice
1.0
relative prediction error kek/kyk
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 5 10 15 20 25 30 35 40 45 50 55
n 16
Validation
for n = 50 the actual and predicted outputs on system identification and model
validation data are:
3.5 2.5
3.0 2.0
2.5 1.5
2.0 1.0
1.5 0.5
1.0 0.0
0.5 -0.5
0.0 -1.0
-0.5 -1.5
-1.0 -2.0
-1.5 -2.5
-2.0 -3.0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
t t
I y(t) actual output, ŷ(t) predicted from model
I loss of predictive ability when n too large called model overfit or overmodeling
17

Least-Squares Data Fitting: EE263 Autumn 2015 S. Boyd and S. Lall

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Least-Squares Data Fitting: EE263 Autumn 2015 S. Boyd and S. Lall

Uploaded by

Copyright:

Available Formats

EE263 Autumn 2015 S. Boyd and S.

Least-squares data fitting

I functions f1 , . . . , fn : S → R, called regressors or basis functions

I data or measurements (si , gi ), i = 1, . . . , m, where si ∈ S and (usually)

problem: find coefficients x1 , . . . , xn ∈ R so that

i.e., find linear combination of functions that fits data

I total square fitting error is kAx − gk2 , where Aij = fj (si )

I hence, least-squares fit is given by

(assuming A is skinny, full rank)

flsfit (s) = x1 f1 (s) + · · · + xn fn (s)

I interpolation, extrapolation, smoothing of data

problem: fit polynomial of degree < n,

p(t) = a0 + a1 t + · · · + an−1 tn−1 ,

I basis functions are fj (t) = tj−1 , j = 1, . . . , n

t21 ··· tn−1

(called a Vandermonde matrix)

assuming tk 6= tl for k 6= l and m ≥ n, A is full rank:

I corresponding polynomial p(t) = a0 + · · · + an−1 tn−1 vanishes at m points

I by fundamental theorem of algebra p can have no more than n − 1 zeros, so

I columns of A are independent, i.e., A full rank

I fit g(t) = 4t/(1 + 10t2 ) with polynomial

I m = 100 points between t = 0 & t = 1

consider family of least-squares problems

(a1 , . . . , ap are called regressors)

I approximate y by linear combination of a1 , . . . , ap

I project y onto span{a1 , . . . , ap }

I as p increases, get better fit, so optimal residual decreases

solution for each p ≤ n is given by

xls = (ATp Ap )−1 ATp y = Rp−1 QTp y

I Ap = [a1 · · · ap ] ∈ Rm×p is the first p columns of A

I Rp ∈ Rp×p is the leading p × p submatrix of R

I Qp = [q1 · · · qp ] is the first p columns of Q

we measure input u(t) and output y(t) for t = 0, . . . , N of unknown system

u(t) unknown system y(t)

ŷ(t) = h0 u(t) + h1 u(t − 1) + · · · + hn u(t − n)

we can write model or predicted output as

model prediction error is

e = (y(n) − ŷ(n), . . . , y(N ) − ŷ(N ))

least-squares identification: choose model (i.e., h) that minimizes norm of model

data used to fit model

u(t) 0 y(t) 1.0

(h0 , . . . , h7 ) = (.024, .282, .418, .354, .243, .487, .208, .441)

with relative prediction error kek/kyk = 0.37

y(t) actual output, ŷ(t) predicted from model

question: how large should n be?

relative prediction error kek/kyk

I y(t) actual output, ŷ(t) predicted from model

You might also like