You are on page 1of 3

STATS 216 Introduction to Statistical Learning

Stanford University, Winter 2018

Practice Question solutions


Duration: 1 hour

Sample Instructions: (This is a practice midterm and will not be graded.)

• Do not look at the exam questions until instructed to do so.

• Remember the university honor code.

• Write your name and SUNet ID (ThisIsYourSUNetID@stanford.edu) on each page.

• All questions are of equal value and are meant to elicit fairly short answers. All answers
should be written in the space provided between questions.

• No one is expected to answer all of the questions, but everyone is encouraged to try
all of them.

• You may not access the internet during the exam.

• You may refer to your course textbook and notes, and you may use your laptop provided
that internet access is disabled.

• Please write neatly.


1. You have a regression problem with n = 100 observations and p = 2000 features,
and you are told by your collaborators that most of the features are likely to be
uninformative (but he doesn’t know which features are informative). Which method(s)
are likely to work best on this data? Give reasons.
Since it is known that most features are uninformative for the regression problem, I
would choose a variable selection method designed to automatically exclude irrelevant
variables. Lasso regression and forward stepwise regression are two strong candidates
in this class. Such methods would be preferable to least squares (which does not even
have a uniquely defined solution when p > n) and ridge regression, which typically
includes all predictors in the model, even uninformative ones.

2. Suppose that we have p-dimensional data X generated from two normal distributions
labelled Y = 1 and Y = 2, with means µ1 , µ2 and common covariance matrix Σ.
Assume the prior probabilities are each 0.5. Derive an expression for the posterior
probabilities P (Y = j|X = x). Relate your answer to logistic regression.
We can use an expression very similar to (4.12) in the textbook, except we use the
multivariate version of (4.11) as given in (4.18). Plugging this in, and cancelling equal
terms, we get

exp − 12 (x − µj )T Σ−1 (x − µj )

P (Y = j|X = x) =
exp − 12 (x − µ1 )T Σ−1 (x − µ1 ) + exp − 21 (x − µ2 )T Σ−1 (x − µ2 )
 

If we take the log-odds of this expression for j = 1 vs j = 2, we get

P (Y = 1|X = x) 1 1
log = − (x − µ1 )T Σ−1 (x − µ1 ) + (x − µ2 )T Σ−1 (x − µ2 )
P (Y = 2|X = x) 2 2
1
= xT Σ−1 (µ1 − µ2 ) − (µ1 + µ2 )T Σ−1 (µ1 − µ2 )
2
T
= x β + β0 .

So it results in a linear logistic regression model.


3. You are given some data by a collaborator, and asked to build a two-class classifier
with n = 1000 observations and p = 500 features, to predict the risk of a customer
defaulting on a loan. Unfortunately about 25% of the features are missing at random
(and not the same 25% each time). The result is that nearly every observation has
some missing features. How would you deal with this?
For each predictor I would replace the missing values by the mean or median of the
non-missing and fit the model. An observation whose value for a variable is at the
mean has no influence on that coefficient. Then I would apply a regularized method
like logistic regression with an `1 or `2 penalty.

4. In the same setting as the previous question, you later learn that some of the features
like monthly income are not missing at random but are more likely to be missing
because the mortgage company has lost track of the customer. How would you deal
with this issue?
Here, mean imputation alone may not suffice, because the missingness seems informa-
tive. I might create a dummy variable to indicate whether a feature value is missing or
not and include it as an additional variable in the model; in addition, I would replace
the missing values for a given feature by the mean of the non-missing values.

5. Suppose we run a forward stepwise linear regression procedure on a set of 12 predictor


variables. We see that variable 3 enters first because it causes the biggest drop in RSS
(over the mean). After adding (one-by-one) the next 5 variables, we pause to see which
variable, if dropped, would increase the RSS the least. Could this be variable 3?
Yes. Variable 3 might be redundant at this stage, even though in the beginning It was
the best representative for the entire team of predictors.

You might also like