DM WK 1

Problem 1: Analyzing Gas Mileage
Information
You are about to start Problem 1 of 2, which analyzes gas mileage and uses the
ISLR library in R.
Question 1 (2 points)
Load the ISLR library into R and look at the first few rows of the Auto data set.
What data mining strategy would you use to investigate the following questions?
Question 1 options:
You are building an app for a used-car website that

will take information about the year, engine
displacement, and weight of cars, and determine
whether they are most likely American (origin = 1),
European (2), or Japanese (3).
You are building an app for a used-car website that 1. Regression
will take information about the year, engine
2. Classification
displacement, and weight of cars, and estimate their
horsepower. 3. Unsupervised learning
The manager of a used-car lot wants to arrange
groups of similar cars on the lot. The manager wants
to understand the relationships between the year,
engine displacement, and weight of cars to identify
informative groupings.
Quiz
Previous PageNext Page
Page 3 of 9
Note: It is recommended that you save your response as you complete each question.
We would like to use K-nearest neighbors to predict the gas mileage (MPG) of cars
based on their weight (in pounds) and their year of manufacture. Explain why
standardizing the data is a good idea. Comment on observed features of the data
and possible consequences.
Question 2 options:
Save
Question 3 (1 point)
Create two new variables, weight.std and year.std, containing standardized

values of the weight and year.
Enter your R code below.
Question 3 options:
Save
Create a data frame or matrix containing your new

variables, weight.std and year.std. Use write.csv() to save the data frame or
matrix to a file. We’ll use these variables again in Homework 2.
Question 4 options:
Save
Page 3 of 9
Quiz
Page 4 of 9
Set R’s seed to 1 (for Homework 1) and use sample() to divide the data into:
 a training set of 256 observations (automobiles), and

 a validation set of 136 observations.
In addition, create two new variables, weight.train.std and year.train.std,

containing standardized values of the weight and year for the training data. Use
the same means and standard deviations (from the training data) to standardize
the validation data, creating two more
variables, weight.valid.std and year.valid.std.
Question 5 options:
Save
Use 1-nearest neighbor regression (fit on the standardized training data) to predict
the gas mileage of the cars in the validation set. Compute the mean squared error.
Question 6 options:
Save
What is the MSE for the validation set? (Round your answer to 2 decimal places.)
Your Answer:
Question 7 options:
Answer
Save
Page 4 of 9
Quiz
Page 5 of 9
Use a for() loop to apply K-nearest neighbors regression to the same training and
validation sets, for values of k from 1 to 20. Make a plot of the MSE as a function
of k.
Enter your R code (just the code, not the plot) below.
Question 8 options:
Save
In your opinion, which value of k is the best choice? Why?
Question 9 options:
Save
Page 5 of 9
Quiz
Page 6 of 9
Problem 2: Classifying Income

In this problem, you will use K-nearest neighbors to classify people’s income as
>$50,000 or $50,000.
Download the data set Census_income.csv and read it into R.
Data Source: Kohavi, R and B. Becker. (1996). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of
Information and Computer Science.
Information
You are about to start Problem 2 of 2, which analyzes personal income using
the Census_income.csv data file. You can find more information in Homework 1:
Instructions.
Data Source: Kohavi, R and B. Becker. (1996). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of
Information and Computer Science.
Important: It may be helpful to clear your workspace prior to starting this

problem. This will prevent confusion when referencing the knn function.
Page 6 of 9
Quiz
Page 7 of 9
Create a new variable, Sex01, which equals 0 for males and 1 for females.
Caution: For this data set, R reads in the values of Sex with an extra
space in front of them: “ Male” and “ Female”. You will need to account for this
when creating the variable Sex01.
Question 10 options:
Save
Set R’s seed to 1 again and randomly sample 20,000 individuals to be in the
training set.
Create two new variables, Educ.train.std, and Age.train.std, which contain

standardized versions of the EducYears and Age variables for the training data.
Combine these variables, along with the training-set values of variable Sex01, into
a matrix or data frame train.X.std.
Use the same means and standard deviations (from the training data) to
standardize the validation data, creating two more
variables, Educ.valid.std and Age.valid.std. Combine these variables, along with
the validation-set values of variable Sex01, into a matrix or data
frame valid.X.std.
[Comment: this allows us to standardize the numeric variables EducYears and Age,
without standardizing the indicator variable Sex01.]
Save
Page 7 of 9
Quiz
Page 8 of 9
Use 25-nearest neighbor classification (fit on the training set) to predict whether
the income of each individual in the validation set is >50K or <=50K.
Find the confusion matrix. You should be able to produce a matrix table with two
rows and two columns, similar to the one below. Use the spaces below the table to
indicate what appears in each part of your matrix that corresponds to the
letters [A] through [D]. For example, if the matrix you create shows 5432 in the
cell that corresponds to [A] in the matrix below, you would enter "5432" in the
space next to "[A]".
Please enter the information exactly as it appears in R.

Actual income <= 50K Actual Income > 50K
Classified <= 50K [A] [B]
Classified > 50K [C] [D]
[A]
[B]
[C]
[D]
Save
What is the overall error rate on the validation set? Enter your answer as a decimal
between 0 and 1, rounded to 3 decimal places.
Your Answer:
Answer
Save
What proportion of people making > $50,000 were misclassified? Enter your answer
as a decimal between 0 and 1, rounded to 3 decimal places.
Your Answer:
Answer
Save
Page 8 of 9
Quiz
Page 9 of 9
Information
Important: Don't forget to completely submit your work after you

complete all questions:
1. Save the answers to all your questions.

2. Click Go to Submit Quiz.
3. Click Submit Quiz.
4. When the confirmation window appears, click Yes, submit quiz.
Following these steps allows some questions to be graded immediately and others
passed on to your instructors.
Page 9 of 9

DM WK 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DM WK 1

Uploaded by

Copyright:

Available Formats

Problem 1: Analyzing Gas Mileage

You are building an app for a used-car website that

Create two new variables, weight.std and year.std, containing standardized

Enter your R code below.

Create a data frame or matrix containing your new

Enter your R code below.

Previous PageNext Page

 a training set of 256 observations (automobiles), and

In addition, create two new variables, weight.train.std and year.train.std,

Enter your R code below.

Enter your R code below.

Previous PageNext Page

In your opinion, which value of k is the best choice? Why?

Previous PageNext Page

Problem 2: Classifying Income

Download the data set Census_income.csv and read it into R.

Important: It may be helpful to clear your workspace prior to starting this

Previous PageNext Page

Enter your R code below.

Create two new variables, Educ.train.std, and Age.train.std, which contain

Enter your R code below.

Previous PageNext Page

Please enter the information exactly as it appears in R.

Classified <= 50K [A] [B]

Classified > 50K [C] [D]

Previous PageNext Page

Important: Don't forget to completely submit your work after you

1. Save the answers to all your questions.

Previous PageNext Page

You might also like