Data Mining - Chap1 Code

Polynomial Regression
a.h.thiery@nus.edu.sg
Version: 0.1
Contents
1 A simple polynomial example
1.1 Least square estimate . . . . . . . . . . . . .
1.2 Performance v.s. complexity of the model . .
1.3 Estimation of the generalization performances
1.4 b is a Maximum Likelihood Estimator . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
3
6
8
10
A simple polynomial example
Consider some data y = (y1 , . . . , yn ) that are noisy observations of a function f

in the sense that yi = f (xi ) + (noise). For illustration purposes, we will suppose
that f (x) is a polynomial and our task will be to reconstruct this polynomial.
More precisely,
d
X
yi =
k? xki + i
k=0
Pd
where f (x) = k=0 k? xk is an unkown polynomial and i is a random variable

with mean zero. For n observations y = (y1 , . . . , yn ), and assuming that we
know the degree d of the polynomial f , this can also be written in matrix form
as [Exercise]
Y = X ? +
with Y Rn and X Rn(d+1) and ? Rd+1 . Indeed, the vector ? is
unkown.
set.seed(1)
#for reproducibility
#generate some data
create_X_matrix = function(x_data, degree){
p = degree + 1
X = matrix(rep(0, length(x_data)*p), ncol=p)
for(k in 1:p){
X[,k] = x_data**(k-1)
}
return( X )
}
#create a polynomial
poly_deg = 2
poly_coef = c(0,0,1)
#P(x) = x**2
#generate some noisy data

n_data = 20
x_data = rnorm(n_data)
X = create_X_matrix(x_data, poly_deg)
sd_noise = 0.5 #standrad deviation of the noise
y_data = X %*% poly_coef + rnorm(n_data, mean=0, sd=sd_noise)
#plot the results
plot(x_data, y_data, pch=20, col="red",
xlab="x", ylab="y")
5
4
3
0
0
x
1.1
Least square estimate
The least square estimate b for is defined as

n
o
2
b = argmin 7 kY Xk
We will see later in the course that b is given by
1 >
b = X > X
X y.
#standard beta estimate

compute_beta = function(y,X){
return( solve( t(X) %*% X, t(X) %*% y ) )
}
b which also reads

Let us look at the fit to the data yb = X ,
1 >
yb = H y
with
H X X> X
X ,
for several value of d; the matrix H is usually called the hat matrix.
#let us start with a low degree

d=1
XX = create_X_matrix(x_data, d)
beta = compute_beta(y_data,XX)
x_list = seq(-5,5,by=0.01)
y_fit = create_X_matrix(x_list, d) %*% beta
xlab="x", ylab="y", main=paste("Degree = ",d))
points(x_list, y_fit, type="l", lwd=3)
Degree = 1
0
x
The fit is pretty bad since d is too low

4
#let us do the case d=2

d=2
x_list = seq(-5,5,by=0.01)
Degree = 2
0
x
The fit is indeed quite good since d = 2 is the true value of d.

#let us conclude with a (too) high degree
d=11
x_list = seq(-5,5,by=0.01)
5

Degree = 11
We are observing a phenomenon called overfitting.
1.2
Performance v.s. complexity of the model
Let us look at the performances of the least square estimate for different value
of d. One needs a way of measuring performance and a common approach in
this situations is to define
n
X
(performance) =
Loss(yi , ybi )
i=1
where the Loss function Loss() measures how well the prediction ybi approximate the true value yi . It is a standard practice in this case, mainly because
6
this leads to tractable computations, to use the squared error loss function
Loss(y, yb) (y yb)2 . The resulting measure of performance is called the
Residual Sum of Square,
RSS =
N
X
(yi ybi )2 .
i=1
We will now simply compute the RSS for different value of d; indeed, it is
completely equivalent too look at the Mean Squared Error MSE = (1/n) RSS.
#generalization estimation
deg_max = 10
mse_list = rep(0, deg_max)
for(d in 1:deg_max){
y_fit = create_X_matrix(x_data, d) %*% beta
mse_list[d] = mean( (y_data - y_fit)**2 )
}
#display the results
plot(mse_list, col="red", type="o", pch=20,
main = "Mean Squared Error v.s Degree",
xlab = "degree", ylab="MSE")
0.6
0.2
0.4
MSE
0.8
1.0
1.2
Mean Squared Error v.s Degree
10
degree
The higher the degree d, the lowest the MSE [Exercise]: this is indeed not
helpful at all if one wants to find a suitable value for d. In most situations
of interest, we are trying to do some prediction on data that have not indeed
been used to train the model. In the above situations, the coefficient b has
been determined by using the whole dataset {yi }ni=1 and the MSE has been
estimated on the same dataset!
1.3
Estimation of the generalization performances
To estimate the performance of a procedure, one needs to test it on data that

have not been used to train the algorithm. In our case, it suffices to estimate b
on a subset of the data (i.e. the training set) and then estimate the MSE on
data that have not been used to estimate b (i.e. the test set). For a fixed value
of d, one can repeat this procedure on many different split of the dataset.
#generalization estimation
n_bootstrap = 100
deg_max = 6
mse_list = rep(0, deg_max*n_bootstrap)
deg_list = rep(0, deg_max*n_bootstrap)
for(d in 1:deg_max){
for(k in 1:n_bootstrap){
sampled_index = sample(1:n_data, round(length(x_data)/2),
replace=FALSE)
XX_train = create_X_matrix(x_data[sampled_index], d)
XX_test = create_X_matrix(x_data[-sampled_index], d)
yy_train = y_data[sampled_index]
yy_test = y_data[-sampled_index]
beta = compute_beta(yy_train,XX_train)
yy_fit = XX_test %*% beta
mse = mean( (yy_test - yy_fit)**2 )
mse_list[(d-1)*n_bootstrap + k] = mse
deg_list[(d-1)*n_bootstrap + k] = d
}
}
Let us now plot the estimate of the MSE as a function of d.
validation = data.frame(mse = mse_list, deg = deg_list)
boxplot(mse ~ deg, data = validation,
log = "y", col = "bisque",
main="Generalization",
xlab="degree",
ylab="mean MSE")
1e+03
1e01
1e+01
mean MSE
1e+05
1e+07
Generalization
degree
It is now clear that choosing too high a degree leads to suboptimal performances.
1.4
b is a Maximum Likelihood Estimator
Recall that we postulated that the data were generated through the model
Y = X + for some noise . Under the assumption that is Gaussianly
distributed the least square estimate b is also the maximum likelihood estimate
[Exercise].
10

Data Mining - Chap1 Code

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining - Chap1 Code

Uploaded by

Copyright:

Available Formats

Polynomial Regression

A simple polynomial example

Consider some data y = (y1 , . . . , yn ) that are noisy observations of a function f

where f (x) = k=0 k? xk is an unkown polynomial and i is a random variable

#generate some noisy data

Least square estimate

The least square estimate b for is defined as

#standard beta estimate

b which also reads

#let us start with a low degree

The fit is pretty bad since d is too low

#let us do the case d=2

The fit is indeed quite good since d = 2 is the true value of d.

y_fit = create_X_matrix(x_list, d) %*% beta

We are observing a phenomenon called overfitting.

Performance v.s. complexity of the model

Mean Squared Error v.s Degree

Estimation of the generalization performances

To estimate the performance of a procedure, one needs to test it on data that

b is a Maximum Likelihood Estimator

You might also like

Data Mining - Chap1 Code

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining - Chap1 Code

Uploaded by

Copyright:

Available Formats

Polynomial Regression

A simple polynomial example

Consider some data y = (y1 , . . . , yn ) that are noisy observations of a function f

where f (x) = k=0 k? xk is an unkown polynomial and i is a random variable

#generate some noisy data

Least square estimate

The least square estimate b for is defined as

#standard beta estimate

b which also reads

#let us start with a low degree

The fit is pretty bad since d is too low

#let us do the case d=2

The fit is indeed quite good since d = 2 is the true value of d.

y_fit = create_X_matrix(x_list, d) %*% beta

We are observing a phenomenon called overfitting.

Performance v.s. complexity of the model

Mean Squared Error v.s Degree

Estimation of the generalization performances

To estimate the performance of a procedure, one needs to test it on data that

b is a Maximum Likelihood Estimator

You might also like

where f (x) = k=0 k? xk is an unkown polynomial and i is a random variable