You are on page 1of 15

Data Mining and Computational Inference Project

A Simple KNN Regression Model

K Nearest Neighbours for (K=11 at 10 folds)


1.2
sample pts
cv knn pts
1

0.8

0.6

0.4

0.2

-0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

JIANPING SHEN

University of Waterloo

Instructor: Prof Fan Guang Zhe

April 16, 2009

1
Problem description:

Perform a k-nearest neighbor regression for a simulated data set and select the best k
using cross-validation for Y=X+e where e~norm(0,gamma) and X is uniformly
distributed on [0,1] interval with factor :
(1) Training set size: 30 or 100
(2) Gamma: 0.1 or 0.2
(3) Cross-Validation: 10 fold or 2 fold
Generate 500 observations set to provide a final evaluation.

KNN Modeling:

Given a training data set with size n


Choose a distance metric, such as Euclidean distance
Choose a K possibly from 1, 3, 5… (For classification); or from 1, 2, 3… (For regression)
For any new observation, compute its distances from all training observations,
respectively, and use the K closest training observation to vote for its class label or to
average for its Y value

Cross Validation:

Each observation is used both in training and validation but never used at the same time
The learning sample is roughly equally divided into V (V>1, V<=n) subsets (folds).

For k from 1 to V,
Do {
The k-th fold is used as test set, while the rest of the data is used as training set to build a
sequence of models corresponding to some tuning parameter α. Compute the prediction
error of each candidate model on the test set

The average prediction error of the V-Folds for each tuning parameter α is called the
“cross-validated” error. The value of the tuning parameter α which gives the smallest
cross-validated error is used to train the whole learning data set and provide the final
model for future use.

KNN v-folds Algorithm:

1) Generate a training set ( for example 30 or 100 points) : (x1,y1), (x2,y2), …, (xn,
yn) where X is uniform distribution and Y=X+e, where e~Norm (0,gamma)

2
2) Divided the training into d-folds (v=2 and v=10)
3) Cross validation each train set and test set and find their the Euclidean distance
4) Find K which has the minimum MSE in above Cross Validation. This K and the
training set (X,Y)will be used a classifier for the new test set (Xnew,Ynew),
which has the same distribution as the training set, generated from 1)
5) Generates Yknn for Xnew for the new test set using majority votes among kNN
points
6) Repeat 1)-5) 10 times to get CV_MSE and K mean

Implementation:
A KNN implementation in MATLAB allowing continuous responses, the specification of
the Euclidean distance used to calculate nearest neighbors, the aggregation method used
to summarize response (majority class, mean, SSE or MSE etc.) and the method of
handling ties (all random selection odd K etc.).

Summary the calculation results of KNN

n gamma CV K mean Test_MSE CV_MSE*


(Std.ev) Mean(std.ev) Mean(Std.ev)
1 30 0.1 2 3.1 0.0156 0.0118
(0.57) (0.0031) (0.0008)
2 30 0.1 10 6.6 0.0185 0.0121
(1.80) (0.0055) (0.0010)
3 100 0.1 2 7.7 0.0118 0.0103
(1.13) (0.0010) (0.0005)
4 100 0.1 10 12.8 0.0134 0.0103
(3.17) (0.0027) (0.0005)
5 30 0.2 2 4.86 0.0536 0.0432
(0.91) (0.0081) (0.0025)
6 30 0.2 10 8.10 0.0690 0.0439
(2.45) (0.0166) (0.0028)
7 100 0.2 2 11.11 0.0448 0.0397
(1.69) (0.0032) (0.0017)
8 100 0.2 10 16.72 0.0457 0.0400
(4.40) (0.0042) (0.0019)

Notes:

CV_MSE* are approximate to gamma square


K increase as sample point’s increase
K increase as v folds increase
K increase as gamma increase

3
Gamma=0.1

K Nearest Neighbours for (K=3 at 2 folds)


1.2
sample pts
cv knn pts
1

0.8

0.6

0.4

0.2

-0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

K Nearest Neighbours for (K=5 at 10 folds)


1.2
sample pts
cv knn pts
1

0.8

0.6

0.4

0.2

-0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

4
K Nearest Neighbours for (K=7 at 2 folds)
1.2
sample pts
cv knn pts
1

0.8

0.6

0.4

0.2

-0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

K Nearest Neighbours for (K=7 at 10 folds)


1.2
sample pts
cv knn pts
1

0.8

0.6

0.4

0.2

-0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

5
Gamma=0.2

K Nearest Neighbours for (K=5 at 2 folds)


1.2
sample pts
1 cv knn pts

0.8

0.6

0.4

0.2

-0.2

-0.4

-0.6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

K Nearest Neighbours for (K=7 at 10 folds)


1.4
sample pts
1.2 cv knn pts

0.8

0.6

0.4

0.2

-0.2

-0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

6
K Nearest Neighbours for (K=11 at 2 folds)
1.4
sample pts
1.2 cv knn pts

0.8

0.6

0.4

0.2

-0.2

-0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

K Nearest Neighbours for (K=25 at 10 folds)


1.4
sample pts
1.2 cv knn pts

0.8

0.6

0.4

0.2

-0.2

-0.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

As shown in the graphs, the KNN results fit to the sample points well!

7
Behavior of training and test (validation)
Sample error (test_MSE) vs model complexity plot

test-MSE vs K compared with diff sample pts


0.028
2 folds with 30 pts
0.026 2 folds with 100
10 folds with 30 pts
0.024
10 folds with 100 pts

0.022

0.02

0.018

0.016

0.014

0.012

0.01

0.008
0 5 10 15 20 25 30 35 40

test-MSE vs K compared with diff sample pts at gamma=0.2


0.11
2 folds with 30 pts
0.1 2 folds with 100
10 folds with 30 pts
10 folds with 100 pts
0.09

0.08

0.07

0.06

0.05

0.04

0.03
0 5 10 15 20 25 30

The calculation results of test_MSE based on above training and test set are plotted in
relation with K. As we can see from the plot, we have following conclusion:

8
Less sample pts (dot) has low K but high variance vs more sample pts (circle) has high
K but low variance and less folds (blue) has less complexity (low K) but more folds (red)
has more complexity (high K) However, when complexity increases, i.e. K>=15, the
more complexity of the model, the high variance will produce in more folds.

Same as above when gamma=0.2 and when running the KNN Matlab codes more step
produce similar results below

test-MSE vs K compared with diff sample pts


0.028

0.026

0.024

0.022

0.02

0.018

0.016

0.014 2 folds with 30 pts


2 folds with 100
0.012 10 folds with 30 pts
10 folds with 100 pts
0.01
0 10 20 30 40 50 60

9
The above results of KNN are plotted by the MATLAB code below:

close all

% T=Tset_MSE;
% T1 is the result of (30,2) corresponding to K1
% T2 is the results of (30,10) corresponding to K2
% T3 is the result of (30,2) corresponding to K3
% T4 is the results of (30,10) corresponding to K4

% K=[1 5 1 1 5 3 1 3 5 1 1 7 3 11 1 5 5 7 1 9 7 5 7 9 5 15 9 7 5 3 7 3
21 15 11 3 3 27 9 51];
%
Tset_MSE=[.0238 .0120 .0233 .0180 .0127 .0153 .0226 .0169 .0145 .0254...
% .0210 .0128 .0133 .0162 .0154 .0140 .0145 .0129 .0177 .0128
...
% .0111 .0103 .0128 .0116 .0121 .0106 .0112 .0111 .0130 .0130
...
% .0117 .0148 .0120 .0116 .0103 .0133 .0124 .0135 .0117 .0264
];
%
% K1=[7 7 3 1 3 3 1 3 1 3];
% K2=[5 1 3 1 1 15 13 7 1 17];
% K3=[9 7 7 9 5 5 5 7 5 19];
% K4=[5 27 7 3 39 1 5 17 13 11];
%
% T=[.0151 .0128 .0157 .0198 .0138 .0124 .0210 .0130 .0183 .0139...
% .0109 .0251 .0141 .0150 .0195 .0236 .0154 .0137 .0204 .0271...
% .0114 .0113 .0099 .0126 .0117 .0110 .0134 .0123 .0129 .0120...
% .0117 .0146 .0114 .0136 .0132 .0204 .0138 .0114 .0122 .0118];
%
% gamma=0.2

K1=[5 5 7 1 3 9 5 5 7 3];
K2=[1 3 1 15 5 11 7 1 15 21];
K3=[7 13 15 9 13 5 11 23 13 9 ];
K4=[19 19 21 23 17 3 25 5 3 27];
T1=[.0633 .0499 .0605 .0864 .0556 .0476 .0463 .0480 .0530 .0485 ];
T2=[.0615 .0545 .0730 .0531 .0557 .0466 .0426 .0691 .0464 .1058 ];
T3=[.0490 .0444 .0436 .0437 .0460 .0481 .0382 .0417 .0452 .0477 ];
T4=[.0417 .0429 .0450 .0486 .0434 .0528 .0408 .0446 .0534 .0427 ];
figure (1)
plot(K1,T1,'.',K3,T3,'bo')
hold on
plot(K2,T2,'r.',K4,T4,'ro')
legend('2 folds with 30 pts','2 folds with 100','10 folds with 30
pts','10 folds with 100 pts',2)
title('test-MSE vs K compared with diff sample pts at gamma=0.2')

10
Appendix: the MATLAB code of KNN Regression Model:

The distances to be used for K-Nearest Neighbor (KNN) predictions are calculated
and returned as a symmetric matrix. Distances are calculated by knndist.m

function distmat = knndist(mat1, mat2)


%
% Distance between two set of vectors
% knndist(MAT1, MAT2) returns the distance matrix
between two
% set of vectors MAT1 and MAT2. The element at row i and
column j
% of the return matrix is the Euclidean distance between
row i
% of MAT1 and row j of MAT2
% Jianping Shen, April 11, 2009.

if nargin == 1,
mat2 = mat1;
end

[m1, n1] = size(mat1);


[m2, n2] = size(mat2);

if n1 ~= n2,
error('Matrices mismatch!');
end

distmat = zeros(m1, m2);

if n1 == 1,
distmat = abs(mat1*ones(1,m2)-ones(m1,1)*mat2');
elseif m2 >= m1,
for i = 1:m1,
distmat(i,:) = sqrt(sum(((ones(m2,1)*mat1(i,:)-
mat2)').^2));
end
else
for i = 1:m2,
distmat(:,i) = sqrt(sum(((mat1-
ones(m1,1)*mat2(i,:))').^2))';
end
end

11
kcv.m calculates v-folds cross validation between the training set and test set and
return the best K which has minimum SSE.

function K=kcv(mrow,folds,X,Y)

% this program using v-folds Cross Validation to determine


the best K,
% which has the min SSE

% Jianping Shen, April 11, 2009

md=mrow*folds-mrow;

switch(folds)
case 2
Xd=blkdiag(X(:,1),X(:,2));
case 10
Xd=blkdiag(X(:,1),X(:,2),X(:,3),X(:,4),X(:,5),...
X(:,6),X(:,7),X(:,8),X(:,9),X(:,10));
case 30
Xd=diag(X(:));
end
% if(folds==10)
% Xd=blkdiag(X(:,1),X(:,2),X(:,3),X(:,4),X(:,5),...
% X(:,6),X(:,7),X(:,8),X(:,9),X(:,10));
% else
% Xd=blkdiag(X(:,1),X(:,2));
% end
X=X(:);
Y=Y(:);
bX=repmat(X,1,folds);
Xm=bX-Xd;

ikv=zeros(folds,1);
sev=zeros(folds,1);
[mrl,ncl]=size(Xm);%30x10=sample_numXfolds
for j=1:ncl
tidx=find(Xd(:,j)>0);
ridx=find(Xm(:,j)>0);
Xt=X(tidx);
Xr=X(ridx);
Ybar=knn(Xr,Xt,Y);
% calculation the tuning parameter alpha
SSE=sum((Ybar-Y(tidx)*ones(1,md)).^2);
[se, k]=min(SSE);
ikv(j)=k;

12
sev(j)=se;
end
[minse,id]=min(sev);% pick the smallest one
K=ikv(id);
end

knn.m calculates Ybar (Y classified value) Matrix, its first column is 1NN, second
column is 2NN, 3rd column is 3NN…

function Yhat=knn(Xr,Xt,Y)

%this program creates Ybar Matrix (mtxmr), 1st column is


1NN,
% 2nd column is 2NN 3rd column is 3-NN...
%
% Jianping Shen, April 11, 2009

eqt=isequal(Xr,Xt);

if(eqt)
m=length(Xr);
m2=m*m;
end

D=knndist(Xr,Xt);%(sample_num-nrow)xnrow md=mr,nd=mt
if(eqt)
D(1:m+1:m2)=inf;
end
[mr,mt]=size(D); %md=(sample_num-nrow) and nd=nrow
[junk,index]=sort(D);% 500x30
Yc=Y(index)';%30x500
Yhat=zeros(mt,mr);%
Yhat(:,1)=Yc(:,1);
for jd=2:mr
Yhat(:,jd)=Yhat(:,jd-1)+Yc(:,jd);
end
for jd=1:mr
Yhat(:,jd)=Yhat(:,jd)/jd;
End

if(eqt)
Yhat(:,mr)=Yc(:,mr);%last column in Ybar is junk replace by
true Y
end
%
end

13
simple_knn.m realizes KNN (v-folds) Algorithm described above
%function [k,CV_MSE,MSE]=simple_knn(train_pts,vfolds)
clear all
close all
clc

% sz=10;
% Kmv=zeros(sz,1);
% cv_msev=zeros(sz,1);
% for jr=1:sz

gamma=0.2;%or 0.2
vfolds=10;
train_num=100;
mx=train_num/vfolds;
count=0;
Km=0;
kdm=0;
cv_mse=inf;
kmv=zeros(10,1);
t_mse=zeros(10,1);
while(count<10)
kodd=1;
while(kodd)
Xs=unifrnd(0,1,[mx,vfolds]);
Ys=Xs+normrnd(0,gamma,[mx,vfolds]);
K=kcv(mx,vfolds,Xs,Ys);
kodd=mod(K+1,2); % pick a odd K to avoid tie vote
end
Xs=Xs(:);
Ys=Ys(:);
ls=length(Xs);
%equal to train_num
obs=500;
% figure (1)
% [junk1,junk2,Xo,Yo]=scv(obs,gamma);
Xo=unifrnd(0,1,[obs,1]);
lxo=length(Xo);%equal to obs
Yo=Xo+normrnd(0,gamma,[obs,1]);
% you should classify Yo to Ys not
%Ybar=knn(Xo,Xs,Yo);%100x500 but
Ybar=knn(Xs,Xo,Ys);%500x100
SSE=sum((Ybar-Yo*ones(1,length(Xs))).^2);

14
test_mse=SSE(K)/(lxo-1);
mse=test_mse;
Yp=Ybar(:,K);
%Yp=Ybar(:,1:10:30);
[se,kd]=min(SSE);
Yd=Ybar(:,kd);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
disp([mse,K])

if(mse<cv_mse)
cv_mse=mse;
%figure (2)
plot(Xs,Ys,'rp',Xo,Yp,'.')
%
% legend('sample pts','cv knn pts','minsse pts',2)
% title(sprintf('K Nearest Neighbours for (K=%d vs kd=%d
at %d folds)'...
% ,K,kd,vfolds));
legend('sample pts','cv knn pts',2)
title(sprintf('K Nearest Neighbours for (K=%d at %d
folds)'...
,K,vfolds));

end

Km=Km+K;
kdm=kdm+kd;
count=count+1;
kmv(count)=K;
t_mse(count)=test_mse;
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%
Km=Km/count;
mtestmse=mean(t_mse);
testmstd=std(t_mse);
disp('------------------------------------')
disp([cv_mse,Km,mean(t_mse),testmstd])
disp('------------------------------------')

% Kmv(jr)=Km;
% cv_msev(jr)=cv_mse;
% end
% disp([mean(Kmv),std(Kmv),mean(cv_msev),std(cv_msev)]

%%%%%%%% THE END OF KNN CODE %%%%%%%%%%

15

You might also like