You are on page 1of 13

Problem 1: Analyzing Gas Mileage

Information
You are about to start Problem 1 of 2, which analyzes gas mileage and uses the
ISLR library in R.

Question 1 (2 points)

Load the ISLR library into R and look at the first few rows of the Auto data set.

What data mining strategy would you use to investigate the following questions?

Question 1 options:

You are building an app for a used-car website that


will take information about the year, engine
displacement, and weight of cars, and determine
whether they are most likely American (origin = 1),
European (2), or Japanese (3).
You are building an app for a used-car website that 1. Regression
will take information about the year, engine
2. Classification
displacement, and weight of cars, and estimate their
horsepower. 3. Unsupervised learning
The manager of a used-car lot wants to arrange
groups of similar cars on the lot. The manager wants
to understand the relationships between the year,
engine displacement, and weight of cars to identify
informative groupings.

Quiz
Previous PageNext Page

Page 3 of 9

Note: It is recommended that you save your response as you complete each question.
Question 2 (3 points)

We would like to use K-nearest neighbors to predict the gas mileage (MPG) of cars
based on their weight (in pounds) and their year of manufacture. Explain why
standardizing the data is a good idea. Comment on observed features of the data
and possible consequences.

Question 2 options:

Save

Question 3 (1 point)

Create two new variables, weight.std and year.std, containing standardized


values of the weight and year.

Enter your R code below.

Question 3 options:
Save

Question 4 (2 points)

Create a data frame or matrix containing your new


variables, weight.std and year.std. Use write.csv() to save the data frame or
matrix to a file. We’ll use these variables again in Homework 2.

Enter your R code below.

Question 4 options:

Save

Previous PageNext Page

Page 3 of 9
Quiz
Previous PageNext Page

Page 4 of 9

Note: It is recommended that you save your response as you complete each question.

Question 5 (3 points)

Set R’s seed to 1 (for Homework 1) and use sample() to divide the data into:

 a training set of 256 observations (automobiles), and


 a validation set of 136 observations.

In addition, create two new variables, weight.train.std and year.train.std,


containing standardized values of the weight and year for the training data. Use
the same means and standard deviations (from the training data) to standardize
the validation data, creating two more
variables, weight.valid.std and year.valid.std.

Enter your R code below.

Question 5 options:

Save

Question 6 (3 points)
Use 1-nearest neighbor regression (fit on the standardized training data) to predict
the gas mileage of the cars in the validation set. Compute the mean squared error.

Enter your R code below.

Question 6 options:

Save

Question 7 (1 point)

What is the MSE for the validation set? (Round your answer to 2 decimal places.)

Your Answer:

Question 7 options:

Answer

Save

Previous PageNext Page

Page 4 of 9
Quiz
Previous PageNext Page

Page 5 of 9

Note: It is recommended that you save your response as you complete each question.

Question 8 (4 points)

Use a for() loop to apply K-nearest neighbors regression to the same training and
validation sets, for values of k from 1 to 20. Make a plot of the MSE as a function
of k.

Enter your R code (just the code, not the plot) below.

Question 8 options:

Save

Question 9 (2 points)

In your opinion, which value of k is the best choice? Why?

Question 9 options:
Save

Previous PageNext Page

Page 5 of 9

Quiz
Previous PageNext Page

Page 6 of 9

Note: It is recommended that you save your response as you complete each question.

Problem 2: Classifying Income


In this problem, you will use K-nearest neighbors to classify people’s income as
>$50,000 or $50,000.

Download the data set Census_income.csv and read it into R.

Data Source: Kohavi, R and B. Becker. (1996). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of
Information and Computer Science.

Information
You are about to start Problem 2 of 2, which analyzes personal income using
the Census_income.csv data file. You can find more information in Homework 1:
Instructions.

Data Source: Kohavi, R and B. Becker. (1996). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of
Information and Computer Science.

Important: It may be helpful to clear your workspace prior to starting this


problem. This will prevent confusion when referencing the knn function.

Previous PageNext Page

Page 6 of 9

Quiz
Previous PageNext Page

Page 7 of 9

Note: It is recommended that you save your response as you complete each question.

Question 10 (2 points)

Create a new variable, Sex01, which equals 0 for males and 1 for females.

Caution: For this data set, R reads in the values of Sex with an extra
space in front of them: “ Male” and “ Female”. You will need to account for this
when creating the variable Sex01.

Enter your R code below.

Question 10 options:
Save

Question 11 (4 points)

Set R’s seed to 1 again and randomly sample 20,000 individuals to be in the
training set.

Create two new variables, Educ.train.std, and Age.train.std, which contain


standardized versions of the EducYears and Age variables for the training data.
Combine these variables, along with the training-set values of variable Sex01, into
a matrix or data frame train.X.std.

Use the same means and standard deviations (from the training data) to
standardize the validation data, creating two more
variables, Educ.valid.std and Age.valid.std. Combine these variables, along with
the validation-set values of variable Sex01, into a matrix or data
frame valid.X.std.

[Comment: this allows us to standardize the numeric variables EducYears and Age,
without standardizing the indicator variable Sex01.]

Enter your R code below.

Question 11 options:
Save

Previous PageNext Page

Page 7 of 9

Quiz
Previous PageNext Page

Page 8 of 9

Note: It is recommended that you save your response as you complete each question.

Question 12 (2 points)

Question 12 options:

Use 25-nearest neighbor classification (fit on the training set) to predict whether
the income of each individual in the validation set is >50K or <=50K.

Find the confusion matrix. You should be able to produce a matrix table with two
rows and two columns, similar to the one below. Use the spaces below the table to
indicate what appears in each part of your matrix that corresponds to the
letters [A] through [D]. For example, if the matrix you create shows 5432 in the
cell that corresponds to [A] in the matrix below, you would enter "5432" in the
space next to "[A]".

Please enter the information exactly as it appears in R.


Actual income <= 50K Actual Income > 50K

Classified <= 50K [A] [B]

Classified > 50K [C] [D]

[A]

[B]

[C]

[D]

Save

Question 13 (1 point)

What is the overall error rate on the validation set? Enter your answer as a decimal
between 0 and 1, rounded to 3 decimal places.

Your Answer:

Question 13 options:

Answer

Save

Question 14 (1 point)

What proportion of people making > $50,000 were misclassified? Enter your answer
as a decimal between 0 and 1, rounded to 3 decimal places.
Your Answer:

Question 14 options:

Answer

Save

Previous PageNext Page

Page 8 of 9

Quiz
Previous PageNext Page

Page 9 of 9

Note: It is recommended that you save your response as you complete each question.

Information

Important: Don't forget to completely submit your work after you


complete all questions:

1. Save the answers to all your questions.


2. Click Go to Submit Quiz.
3. Click Submit Quiz.
4. When the confirmation window appears, click Yes, submit quiz.

Following these steps allows some questions to be graded immediately and others
passed on to your instructors.

Previous PageNext Page

Page 9 of 9

You might also like