You are on page 1of 10

Polynomial Regression

a.h.thiery@nus.edu.sg
Version: 0.1

Contents
1 A simple polynomial example
1.1 Least square estimate . . . . . . . . . . . . .
1.2 Performance v.s. complexity of the model . .
1.3 Estimation of the generalization performances
1.4 b is a Maximum Likelihood Estimator . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

1
3
6
8
10

A simple polynomial example

Consider some data y = (y1 , . . . , yn ) that are noisy observations of a function f


in the sense that yi = f (xi ) + (noise). For illustration purposes, we will suppose
that f (x) is a polynomial and our task will be to reconstruct this polynomial.
More precisely,
d
X
yi =
k? xki + i
k=0

Pd

where f (x) = k=0 k? xk is an unkown polynomial and i is a random variable


with mean zero. For n observations y = (y1 , . . . , yn ), and assuming that we
know the degree d of the polynomial f , this can also be written in matrix form
as [Exercise]
Y = X ? + 
with Y Rn and X Rn(d+1) and ? Rd+1 . Indeed, the vector ? is
unkown.
set.seed(1)
#for reproducibility
#generate some data
create_X_matrix = function(x_data, degree){
p = degree + 1
X = matrix(rep(0, length(x_data)*p), ncol=p)
for(k in 1:p){

X[,k] = x_data**(k-1)
}
return( X )
}
#create a polynomial
poly_deg = 2
poly_coef = c(0,0,1)

#P(x) = x**2

#generate some noisy data


n_data = 20
x_data = rnorm(n_data)
X = create_X_matrix(x_data, poly_deg)
sd_noise = 0.5 #standrad deviation of the noise
y_data = X %*% poly_coef + rnorm(n_data, mean=0, sd=sd_noise)
#plot the results
plot(x_data, y_data, pch=20, col="red",
xlab="x", ylab="y")

5
4
3
0

0
x

1.1

Least square estimate

The least square estimate b for is defined as


n
o
2
b = argmin 7 kY Xk
We will see later in the course that b is given by
1 >
b = X > X
X y.

#standard beta estimate


compute_beta = function(y,X){
return( solve( t(X) %*% X, t(X) %*% y ) )
}

b which also reads


Let us look at the fit to the data yb = X ,
1 >
yb = H y
with
H X X> X
X ,
for several value of d; the matrix H is usually called the hat matrix.

#let us start with a low degree


d=1
XX = create_X_matrix(x_data, d)
beta = compute_beta(y_data,XX)
x_list = seq(-5,5,by=0.01)
y_fit = create_X_matrix(x_list, d) %*% beta
plot(x_data, y_data, pch=20, col="red",
xlab="x", ylab="y", main=paste("Degree = ",d))
points(x_list, y_fit, type="l", lwd=3)

Degree = 1

0
x

The fit is pretty bad since d is too low


4

#let us do the case d=2


d=2
XX = create_X_matrix(x_data, d)
beta = compute_beta(y_data,XX)
x_list = seq(-5,5,by=0.01)
y_fit = create_X_matrix(x_list, d) %*% beta
plot(x_data, y_data, pch=20, col="red",
xlab="x", ylab="y", main=paste("Degree = ",d))
points(x_list, y_fit, type="l", lwd=3)

Degree = 2

0
x

The fit is indeed quite good since d = 2 is the true value of d.


#let us conclude with a (too) high degree
d=11
XX = create_X_matrix(x_data, d)
beta = compute_beta(y_data,XX)
x_list = seq(-5,5,by=0.01)
5

y_fit = create_X_matrix(x_list, d) %*% beta


plot(x_data, y_data, pch=20, col="red",
xlab="x", ylab="y", main=paste("Degree = ",d))
points(x_list, y_fit, type="l", lwd=3)

Degree = 11

We are observing a phenomenon called overfitting.

1.2

Performance v.s. complexity of the model

Let us look at the performances of the least square estimate for different value
of d. One needs a way of measuring performance and a common approach in
this situations is to define
n
X
(performance) =
Loss(yi , ybi )
i=1

where the Loss function Loss() measures how well the prediction ybi approximate the true value yi . It is a standard practice in this case, mainly because
6

this leads to tractable computations, to use the squared error loss function
Loss(y, yb) (y yb)2 . The resulting measure of performance is called the
Residual Sum of Square,
RSS =

N
X

(yi ybi )2 .

i=1

We will now simply compute the RSS for different value of d; indeed, it is
completely equivalent too look at the Mean Squared Error MSE = (1/n) RSS.
#generalization estimation
deg_max = 10
mse_list = rep(0, deg_max)
for(d in 1:deg_max){
XX = create_X_matrix(x_data, d)
beta = compute_beta(y_data,XX)
y_fit = create_X_matrix(x_data, d) %*% beta
mse_list[d] = mean( (y_data - y_fit)**2 )
}
#display the results
plot(mse_list, col="red", type="o", pch=20,
main = "Mean Squared Error v.s Degree",
xlab = "degree", ylab="MSE")

0.6
0.2

0.4

MSE

0.8

1.0

1.2

Mean Squared Error v.s Degree

10

degree

The higher the degree d, the lowest the MSE [Exercise]: this is indeed not
helpful at all if one wants to find a suitable value for d. In most situations
of interest, we are trying to do some prediction on data that have not indeed
been used to train the model. In the above situations, the coefficient b has
been determined by using the whole dataset {yi }ni=1 and the MSE has been
estimated on the same dataset!

1.3

Estimation of the generalization performances

To estimate the performance of a procedure, one needs to test it on data that


have not been used to train the algorithm. In our case, it suffices to estimate b
on a subset of the data (i.e. the training set) and then estimate the MSE on
data that have not been used to estimate b (i.e. the test set). For a fixed value
of d, one can repeat this procedure on many different split of the dataset.

#generalization estimation
n_bootstrap = 100
deg_max = 6
mse_list = rep(0, deg_max*n_bootstrap)
deg_list = rep(0, deg_max*n_bootstrap)
for(d in 1:deg_max){
for(k in 1:n_bootstrap){
sampled_index = sample(1:n_data, round(length(x_data)/2),
replace=FALSE)
XX_train = create_X_matrix(x_data[sampled_index], d)
XX_test = create_X_matrix(x_data[-sampled_index], d)
yy_train = y_data[sampled_index]
yy_test = y_data[-sampled_index]
beta = compute_beta(yy_train,XX_train)
yy_fit = XX_test %*% beta
mse = mean( (yy_test - yy_fit)**2 )
mse_list[(d-1)*n_bootstrap + k] = mse
deg_list[(d-1)*n_bootstrap + k] = d
}
}
Let us now plot the estimate of the MSE as a function of d.
validation = data.frame(mse = mse_list, deg = deg_list)
boxplot(mse ~ deg, data = validation,
log = "y", col = "bisque",
main="Generalization",
xlab="degree",
ylab="mean MSE")

1e+03
1e01

1e+01

mean MSE

1e+05

1e+07

Generalization

degree

It is now clear that choosing too high a degree leads to suboptimal performances.

1.4

b is a Maximum Likelihood Estimator

Recall that we postulated that the data were generated through the model
Y = X +  for some noise . Under the assumption that  is Gaussianly
distributed the least square estimate b is also the maximum likelihood estimate
[Exercise].

10

You might also like