Professional Documents
Culture Documents
1.a.
Download the file {practice_k-nn_prediction.xlsx}. Using the records in
KNNP_TrainingScore as the reference data, predict the Bus Travel values for the
records in KNNP_ValidationScore worksheet for k=4 and k=3. Calculate the RMSE for
1.b.
each of the choice of k and comment on which one is the better model to choose.
Using the best model selected from 1.a., predict the Bus Travel hours for a city with
the following values:
2.a.
Incom
Populat
e
18014
ion
368
Land
Area
136
The predicted values for a k-nn model can be derived in many ways. Out of them the
two most popular ones are the simple average of the k-nearest neighbors values of
the variable to be predicted and a weighted average of the k-nearest neighbors
values of the variable to be predicted where the weight is derived from the inverse of
the distance of each neighbor (the closer one will have more weight). The formula
for the second method is
Predicted value = sum of (wi*value of ith neighbor) for i = 1 to k.
where, wi = (1/di)/[sum(1/di)] for i = 1 to k.
For example, let us say that we are using k = 3 for the example in problem 1 above
and the input values of the record to be scored are as shown below.
Populati
Income
15409
on
387
Land Area
103
The 3-nearest neighbor for this record in the training dataset will be as follows:
Income
15207
15522
15092
Population
397
694
794
Land Area
55
63
263
Distance
207.8653
329.5725
540.1278
Once we add the inverse of the distances 1 and the weights, the table will look like the
following (including the input values of the record to be predicted):
sum of the inverse
distances
1540
9
387
103
0.009696455
Actual
Value of Bus
Inco
Populat
Land
Distan
Travel
me
1520
ion
Area
ce
207.86
Weigh
inverse of distance
ts
0.4961
2386
7
1552
397
55
53
329.57
0.004810807
41
0.3129
3933
2
1509
694
63
25
540.12
0.003034234
22
0.1909
1065
794
263
78
0.001851414
37
Population
18014
Land Area
368
136
Multiple Regression:
The file {Multiple_Regression_practice1.xlsx} contains fictitious data on 80 managers salary
based on their management ranks, years of experience and years of the service to the
company (years here). A scatterplot of the relationships between the various variables are
shown below.
1 XLMiner sometime uses squared-distance as opposed to the actual distance for its
weight and prediction calculations.
a. Comment on the relationship between the Salary and other variables based on the
scatterplot shown above.
b. Do you think that all the possible predictors (management rank, years of experience,
years here) should be used for building a multiple linear regression model for the
salary prediction? Why or why not?
c. Using XLMiner, run a stepwise Multiple Regression model using best subset selection.
Based on the output, decide and create the best model that will predict the salary of
a manager accurately. How does it compare with the model with all the predictors?
Comment on how you judged the quality of the models using specific statistics/output
parameters.
d. Write down the equation for the salary for the best model created in part c above.
e. Using the best model, predict the salary of a manager with 12 years of experience, 5
years at the company working at a management level 3. What will be the 95%
prediction intervals for the predicted value of the salary?