You are on page 1of 3

K nearest neighbor prediction.

1.a.
Download the file {practice_k-nn_prediction.xlsx}. Using the records in
KNNP_TrainingScore as the reference data, predict the Bus Travel values for the
records in KNNP_ValidationScore worksheet for k=4 and k=3. Calculate the RMSE for
1.b.

each of the choice of k and comment on which one is the better model to choose.
Using the best model selected from 1.a., predict the Bus Travel hours for a city with
the following values:

2.a.

Incom

Populat

e
18014

ion
368

Land
Area
136

The predicted values for a k-nn model can be derived in many ways. Out of them the
two most popular ones are the simple average of the k-nearest neighbors values of
the variable to be predicted and a weighted average of the k-nearest neighbors
values of the variable to be predicted where the weight is derived from the inverse of
the distance of each neighbor (the closer one will have more weight). The formula
for the second method is
Predicted value = sum of (wi*value of ith neighbor) for i = 1 to k.
where, wi = (1/di)/[sum(1/di)] for i = 1 to k.
For example, let us say that we are using k = 3 for the example in problem 1 above
and the input values of the record to be scored are as shown below.
Populati
Income
15409

on
387

Land Area
103

The 3-nearest neighbor for this record in the training dataset will be as follows:

Actual Value of Bus Travel


2386
3933
1065

Income
15207
15522
15092

Population
397
694
794

Land Area
55
63
263

Distance
207.8653
329.5725
540.1278

Once we add the inverse of the distances 1 and the weights, the table will look like the
following (including the input values of the record to be predicted):
sum of the inverse
distances
1540
9

387

103

0.009696455

Actual
Value of Bus

Inco

Populat

Land

Distan

Travel

me
1520

ion

Area

ce
207.86

Weigh
inverse of distance

ts
0.4961

2386

7
1552

397

55

53
329.57

0.004810807

41
0.3129

3933

2
1509

694

63

25
540.12

0.003034234

22
0.1909

1065

794

263

78

0.001851414

37

The predicted value will be (2386*0.496141 + 3933*0.312922 + 1065*0.190937) = 2617.86


hours.
Your task is to use the method of weighted average for k=4 for the record with the input
values of
Income

Population
18014

Land Area
368

136

Multiple Regression:
The file {Multiple_Regression_practice1.xlsx} contains fictitious data on 80 managers salary
based on their management ranks, years of experience and years of the service to the
company (years here). A scatterplot of the relationships between the various variables are
shown below.

1 XLMiner sometime uses squared-distance as opposed to the actual distance for its
weight and prediction calculations.

a. Comment on the relationship between the Salary and other variables based on the
scatterplot shown above.
b. Do you think that all the possible predictors (management rank, years of experience,
years here) should be used for building a multiple linear regression model for the
salary prediction? Why or why not?
c. Using XLMiner, run a stepwise Multiple Regression model using best subset selection.
Based on the output, decide and create the best model that will predict the salary of
a manager accurately. How does it compare with the model with all the predictors?
Comment on how you judged the quality of the models using specific statistics/output
parameters.
d. Write down the equation for the salary for the best model created in part c above.
e. Using the best model, predict the salary of a manager with 12 years of experience, 5
years at the company working at a management level 3. What will be the 95%
prediction intervals for the predicted value of the salary?

You might also like