You are on page 1of 17

EE263 Autumn 2015 S. Boyd and S.

Lall

Least-squares data fitting

1
Least-squares data fitting

we are given:

I functions f1 , . . . , fn : S → R, called regressors or basis functions

I data or measurements (si , gi ), i = 1, . . . , m, where si ∈ S and (usually)


mn

problem: find coefficients x1 , . . . , xn ∈ R so that

x1 f1 (si ) + · · · + xn fn (si ) ≈ gi , i = 1, . . . , m

i.e., find linear combination of functions that fits data


least-squares fit: choose x to minimize total square fitting error:
m
X
(x1 f1 (si ) + · · · + xn fn (si ) − gi )2
i=1

2
Least-squares data fitting

I total square fitting error is kAx − gk2 , where Aij = fj (si )

I hence, least-squares fit is given by

x = (AT A)−1 AT g

(assuming A is skinny, full rank)

I corresponding function is

flsfit (s) = x1 f1 (s) + · · · + xn fn (s)

I applications:

I interpolation, extrapolation, smoothing of data


I developing simple, approximate model of data

3
Least-squares polynomial fitting

problem: fit polynomial of degree < n,

p(t) = a0 + a1 t + · · · + an−1 tn−1 ,

to data (ti , yi ), i = 1, . . . , m

I basis functions are fj (t) = tj−1 , j = 1, . . . , n

j−1
I matrix A has form Aij = ti

t21 ··· tn−1


 
1 t1 1
1 t2 t22 ··· tn−1
2

A=
 
.. .. 
 . . 
1 tm t2m ··· n−1
tm

(called a Vandermonde matrix)

4
Vandermonde matrices

assuming tk 6= tl for k 6= l and m ≥ n, A is full rank:

I suppose Aa = 0

I corresponding polynomial p(t) = a0 + · · · + an−1 tn−1 vanishes at m points


t1 , . . . , t m

I by fundamental theorem of algebra p can have no more than n − 1 zeros, so


p is identically zero, and a = 0

I columns of A are independent, i.e., A full rank

5
Example

I fit g(t) = 4t/(1 + 10t2 ) with polynomial

I m = 100 points between t = 0 & t = 1

I fits for degrees 1, 2, 3, 4 have RMS errors .135, .076, .025, .005, respectively

0.75 0.75

0.50 0.50

0.25 0.25
degree 1 degree 2
0.00 0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.75 0.75

0.50 0.50

0.25 0.25
degree 3 degree 4
0.00 0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

6
Growing sets of regressors

consider family of least-squares problems

p
X
minimize xi ai − y



i=1

for p = 1, . . . , n

(a1 , . . . , ap are called regressors)

I approximate y by linear combination of a1 , . . . , ap

I project y onto span{a1 , . . . , ap }

I regress y on a1 , . . . , ap

I as p increases, get better fit, so optimal residual decreases

7
Growing sets of regressors

solution for each p ≤ n is given by

xls = (ATp Ap )−1 ATp y = Rp−1 QTp y


(p)

where

I Ap = [a1 · · · ap ] ∈ Rm×p is the first p columns of A

I Ap = Qp Rp is the QR factorization of Ap

I Rp ∈ Rp×p is the leading p × p submatrix of R

I Qp = [q1 · · · qp ] is the first p columns of Q

8
Norm of optimal residual versus p

plot of optimal residual versus p shows how well y can be matched by linear com-
bination of a1 , . . . , ap , as function of p

kresidualk

kyk
min kx1 a1 − yk
x1

X
7
min k xi ai − yk
x1 ,...,x7
i=1
p
0 1 2 3 4 5 6 7

9
Least-squares system identification

we measure input u(t) and output y(t) for t = 0, . . . , N of unknown system

u(t) unknown system y(t)

system identification problem: find reasonable model for system based on mea-
sured I/O data u, y
example with scalar u, y (vector u, y readily handled): fit I/O data with moving-
average (MA) model with n delays

ŷ(t) = h0 u(t) + h1 u(t − 1) + · · · + hn u(t − n)

where h0 , . . . , hn ∈ R

10
System identification

we can write model or predicted output as

u(n − 1) ···
    
ŷ(n) u(n) u(0) h0
 ŷ(n + 1)   u(n + 1) u(n) ··· u(1)    h1
 
=
   
.. .. .. ..  .
  ..
 
 .   . . . 
ŷ(N ) u(N ) u(N − 1) ··· u(N − n) hn

model prediction error is

e = (y(n) − ŷ(n), . . . , y(N ) − ŷ(N ))

least-squares identification: choose model (i.e., h) that minimizes norm of model


prediction error kek
. . . a least-squares problem (with variables h)

11
Example

data used to fit model

4 3.5

3 3.0
2.5
2
2.0
1
1.5

u(t) 0 y(t) 1.0


-1 0.5
0.0
-2
-0.5
-3
-1.0
-4 -1.5
-5 -2.0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
t t

12
Example
for n = 7 we obtain MA model with

(h0 , . . . , h7 ) = (.024, .282, .418, .354, .243, .487, .208, .441)

with relative prediction error kek/kyk = 0.37


3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
0 10 20 30 40 50 60 70

y(t) actual output, ŷ(t) predicted from model


13
Model order selection

question: how large should n be?

I obviously the larger n, the smaller the prediction error on the data used to
form the model

I suggests using largest possible model order for smallest prediction error

14
Model order selection

1.0

relative prediction error kek/kyk


0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1
0 5 10 15 20 25 30 35 40 45 50 55
n

difficulty: for n too large the predictive ability of the model on other I/O data
(from the same system) becomes worse

15
Out of sample validation

I evaluate model predictive performance on another I/O data set not used to
develop model model validation data set
I check prediction error of models (developed using modeling data) on valida-
tion data
I plot suggests n = 10 is a good choice

1.0
relative prediction error kek/kyk

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1
0 5 10 15 20 25 30 35 40 45 50 55
n 16
Validation

for n = 50 the actual and predicted outputs on system identification and model
validation data are:

3.5 2.5
3.0 2.0
2.5 1.5
2.0 1.0
1.5 0.5
1.0 0.0
0.5 -0.5
0.0 -1.0
-0.5 -1.5
-1.0 -2.0
-1.5 -2.5
-2.0 -3.0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
t t

I y(t) actual output, ŷ(t) predicted from model

I loss of predictive ability when n too large called model overfit or overmodeling

17

You might also like