Professional Documents
Culture Documents
• All questions are of equal value and are meant to elicit fairly short answers. All answers
should be written in the space provided between questions.
• No one is expected to answer all of the questions, but everyone is encouraged to try
all of them.
• You may refer to your course textbook and notes, and you may use your laptop provided
that internet access is disabled.
2. Suppose that we have p-dimensional data X generated from two normal distributions
labelled Y = 1 and Y = 2, with means µ1 , µ2 and common covariance matrix Σ.
Assume the prior probabilities are each 0.5. Derive an expression for the posterior
probabilities P (Y = j|X = x). Relate your answer to logistic regression.
We can use an expression very similar to (4.12) in the textbook, except we use the
multivariate version of (4.11) as given in (4.18). Plugging this in, and cancelling equal
terms, we get
exp − 12 (x − µj )T Σ−1 (x − µj )
P (Y = j|X = x) =
exp − 12 (x − µ1 )T Σ−1 (x − µ1 ) + exp − 21 (x − µ2 )T Σ−1 (x − µ2 )
P (Y = 1|X = x) 1 1
log = − (x − µ1 )T Σ−1 (x − µ1 ) + (x − µ2 )T Σ−1 (x − µ2 )
P (Y = 2|X = x) 2 2
1
= xT Σ−1 (µ1 − µ2 ) − (µ1 + µ2 )T Σ−1 (µ1 − µ2 )
2
T
= x β + β0 .
4. In the same setting as the previous question, you later learn that some of the features
like monthly income are not missing at random but are more likely to be missing
because the mortgage company has lost track of the customer. How would you deal
with this issue?
Here, mean imputation alone may not suffice, because the missingness seems informa-
tive. I might create a dummy variable to indicate whether a feature value is missing or
not and include it as an additional variable in the model; in addition, I would replace
the missing values for a given feature by the mean of the non-missing values.