Professional Documents
Culture Documents
Information
You are about to start Problem 1 of 2, which analyzes gas mileage and uses the
ISLR library in R.
Question 1 (2 points)
Load the ISLR library into R and look at the first few rows of the Auto data set.
What data mining strategy would you use to investigate the following questions?
Question 1 options:
Quiz
Previous PageNext Page
Page 3 of 9
Note: It is recommended that you save your response as you complete each question.
Question 2 (3 points)
We would like to use K-nearest neighbors to predict the gas mileage (MPG) of cars
based on their weight (in pounds) and their year of manufacture. Explain why
standardizing the data is a good idea. Comment on observed features of the data
and possible consequences.
Question 2 options:
Save
Question 3 (1 point)
Question 3 options:
Save
Question 4 (2 points)
Question 4 options:
Save
Page 3 of 9
Quiz
Previous PageNext Page
Page 4 of 9
Note: It is recommended that you save your response as you complete each question.
Question 5 (3 points)
Set R’s seed to 1 (for Homework 1) and use sample() to divide the data into:
Question 5 options:
Save
Question 6 (3 points)
Use 1-nearest neighbor regression (fit on the standardized training data) to predict
the gas mileage of the cars in the validation set. Compute the mean squared error.
Question 6 options:
Save
Question 7 (1 point)
What is the MSE for the validation set? (Round your answer to 2 decimal places.)
Your Answer:
Question 7 options:
Answer
Save
Page 4 of 9
Quiz
Previous PageNext Page
Page 5 of 9
Note: It is recommended that you save your response as you complete each question.
Question 8 (4 points)
Use a for() loop to apply K-nearest neighbors regression to the same training and
validation sets, for values of k from 1 to 20. Make a plot of the MSE as a function
of k.
Enter your R code (just the code, not the plot) below.
Question 8 options:
Save
Question 9 (2 points)
Question 9 options:
Save
Page 5 of 9
Quiz
Previous PageNext Page
Page 6 of 9
Note: It is recommended that you save your response as you complete each question.
Data Source: Kohavi, R and B. Becker. (1996). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of
Information and Computer Science.
Information
You are about to start Problem 2 of 2, which analyzes personal income using
the Census_income.csv data file. You can find more information in Homework 1:
Instructions.
Data Source: Kohavi, R and B. Becker. (1996). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of
Information and Computer Science.
Page 6 of 9
Quiz
Previous PageNext Page
Page 7 of 9
Note: It is recommended that you save your response as you complete each question.
Question 10 (2 points)
Create a new variable, Sex01, which equals 0 for males and 1 for females.
Caution: For this data set, R reads in the values of Sex with an extra
space in front of them: “ Male” and “ Female”. You will need to account for this
when creating the variable Sex01.
Question 10 options:
Save
Question 11 (4 points)
Set R’s seed to 1 again and randomly sample 20,000 individuals to be in the
training set.
Use the same means and standard deviations (from the training data) to
standardize the validation data, creating two more
variables, Educ.valid.std and Age.valid.std. Combine these variables, along with
the validation-set values of variable Sex01, into a matrix or data
frame valid.X.std.
[Comment: this allows us to standardize the numeric variables EducYears and Age,
without standardizing the indicator variable Sex01.]
Question 11 options:
Save
Page 7 of 9
Quiz
Previous PageNext Page
Page 8 of 9
Note: It is recommended that you save your response as you complete each question.
Question 12 (2 points)
Question 12 options:
Use 25-nearest neighbor classification (fit on the training set) to predict whether
the income of each individual in the validation set is >50K or <=50K.
Find the confusion matrix. You should be able to produce a matrix table with two
rows and two columns, similar to the one below. Use the spaces below the table to
indicate what appears in each part of your matrix that corresponds to the
letters [A] through [D]. For example, if the matrix you create shows 5432 in the
cell that corresponds to [A] in the matrix below, you would enter "5432" in the
space next to "[A]".
[A]
[B]
[C]
[D]
Save
Question 13 (1 point)
What is the overall error rate on the validation set? Enter your answer as a decimal
between 0 and 1, rounded to 3 decimal places.
Your Answer:
Question 13 options:
Answer
Save
Question 14 (1 point)
What proportion of people making > $50,000 were misclassified? Enter your answer
as a decimal between 0 and 1, rounded to 3 decimal places.
Your Answer:
Question 14 options:
Answer
Save
Page 8 of 9
Quiz
Previous PageNext Page
Page 9 of 9
Note: It is recommended that you save your response as you complete each question.
Information
Following these steps allows some questions to be graded immediately and others
passed on to your instructors.
Page 9 of 9