You are on page 1of 13

Introduction to Statistical Learning with R

James, Witten, Hastie, Tibshirani


Notes by Anuar Yeraliyev
September 2015

2.1. What is Statistical Learning?


X1 , X2 , X3 ... - input variables, predictors, features
Y - output variable, response
while Y - prediction by the model; f - model estimate
Y = f(X) + 
where  is an irreducible error which cannot be inferred from the features
E(Y Y )2 = [f (X) f(X)]2 + variance()
expectation
reducible error
irreducible
Irreducible error always provides an upper bound on the accuracy of your model.
Questions to ask when constructing a model:
Which predictors are associated with the response?
What is the relationship between the response and each predictor?
Can the relationship between Y and each predictor be adequately summarized using a
linear equation, or is the relationship more complicated?
Let i represent a data point (observation) and j be a predictor

Parametric Methods
Parametric methods involve a two-step model-based approach.
1. Choose/assume how a f (X) looks like
2. Fit/train the data

Non-parametric methods
Dont know how the f (X) looks like

The Trade-Off Between Prediction Accuracy and Model


The very flexible models (the ones that can fit pretty much anything) are very difficult to
interpret. So, sometimes we choose less flexible methods like linear regression to get to know
which variables influence the Y the most.
On the other side, when we dont necessarily need the interpretation, we might still choose a
less flexible method because it will provide better accuracy due to less overfitting.

Supervised Versus Unsupervised Learning


Unsupervised Learning doesnt have any output variables Y. It involves finding patterns in the
data.
Semi-supervised learning problems - problems where you can have responses only of the partial
dataset.

Problems that have qualitative (categorical) response are assigned classification problems and
quantitative responses - regression problems.

Measuring the Quality of Fit


Mean Squared Error aka Loss Function:
M SE = 1/n

n
X

(yi f(xi ))2

i=1

We are not that interested in minimizing training MSE but mostly doing accurate prediction,
i.e. minimizing test MSE. Even though we try to minimize training MSE and also training
MSE is correlated to test MSE, our testing MSE would still be larger.
Look Page 31 for Graphs.
Flexibility = Level of Fitingness = Degrees of Freedom

The Bias-Variance Trade-Off


T est M SE = E(Y0 + f(X0 )2 = V ar(f(X0 )) + [Bias(f(X0 ))]2 + V ar()

We need to choose a method that provides low Variance AND Bias.


If one of the points in the training set is changed and the fit changes significantly high
variance More flexible methods have high variance. Variance proportional to Flexibility. Low
Flexibility - high bias. Linear fit has high bias and low variance. Bias similar to systematic
error.

The Classification Setting


Accuracy =
n

1X
I(yi 6= yi )
n i=1
I(yi 6= yi ) - indicator variable, 1 when prediction is wrong and 0 when yi = yi Equation
computes the fraction of incorrect classifications.

Bayes Method
The Bayes Classifier assigns each response to a class that has the largest probability (< or >
50%)
Bayes decision boundary - 50% split curve
Bayes Error Rate:
1 E(maxj P (Y = j|X))
Bayes considered to be a gold standard in classification having the smallest error rate.

K-Nearest Neighbours
Takes every observation and compares first nearest K points. And then chooses the class
according to the largest fraction. Low K value - high flexibility (perfectly overfit), high K - low
flexibility. Need to choose the right K such that the variance is not high.
Plot error rate against 1/K (represents flexibility); test error should have a usual U-shape.

3. Linear Regression
Residual = ei = yi yi
Residual Sum of Squares:
RSS = e21 + e22 + e23 + ... + e2n
Least Squares method and gradient descent try to minimize RSS. Population regression line true, most accurate prediction from the features obtained, not known.
If we were to train many linear regression lines of different samples, then the average regression line would be the true population regression line. If we use a single regression line
from a single sample, that estimate would be unbiased. So how far off our regression line from
a population regression line? We could answer that by finding the standard error (variance):
2
V ar(
) = SE(
)2 = n
In general, we dont know the SE (variance) of the 1 and 2 but we can estimate it using
ResidualpStandard Error:
RSE = RSS/(n 2)
RSE measures the lack of fit in terms of Y.
We can use SE to compute 95% confidence intervals, which is 2*SE.
Null Hypothesis - there is no relationship. In order to understand whether a given value reinforces an alternate hypothesis or rejects the null hypothesis, a t-statistic can be calculated
which measures how many standard deviation of its from 0.
t=

SE()

SE is found by finding the sum of the differences of each point and a linear model.
We find t-statistic with respect to zero because if the coefficient is close to zero (with relatively large SE) then t-statistic would produce small value which would mean that this features
is not related to the response.
By calculating a p-value, if p-value is small (less than 5, 1% - 2 and 2.75 in t-statistics) we
would know that there is a relationship between features (predictors) and the response.

R2 Statistic
RSS
T SS RSS
=1
T SS
T SS
Similarly to RSE, R2 measures the lack of fit of our prediction by the predictors. However, in
comparison to RSE, it measures it in relative terms, from 0 to 1.
TSS = Total Sum of Squares = total variability of the data
RSS = Residual Sum of Squares = variability that is not explained by the trained model
TSS - RSS = Amount that is explained by the model
R2 = Proportion of variability in Y that is explained using X
R2 =

r = Corellation
R2 = r 2

Multiple Linear Regression


Y = 0 + 1 X1 + ... + p Xp + 

Questions to ask yourself:


Do all the predictors influence the outcome Y? Or only some of them?
Given a set of predictor values, what output values do we predict and whats their accuracy?
1. Is there a relationship between the Response and Predictors? Its found by computing
F-statistic:
(T SS RSS)/p
F =
RSS/(n p 1)
, where p - number of predictors, n - number of observations
Large F-statistic says that there is a relationship between at least one of the features with
the output. If n is large then an F-statistic just slightly larger than 1 would reject the
null hypothesis, while for smaller n we would need larger F.
If you are computing a t-statistic and a p-value to get to know whether you feature (one
3

of many) is related to the output, then this method would be flawed because there might
be one feature that randomly correlates with the output. You have to compute a p-value
with an F-statistic at the same time because p-values would give you which individial
predictors are significant.
2. Deciding on Important Variables
There are 2p models for every p features.
Variable Selection:
Forward Selection - we start with no variables and then add a feature that has the lowest
RSS one at a time. This method can include variables early that later become redundant.
Backward Selection - we start with all the variables and then delete a variable that has
the largest p-value one at a time. It continues until the stopping rule is reached. Cannot
be used if p > n.
Mixed Selection - we start with no variables like in Forward Selection and keep adding
them. When a p-value of a variable that is being added reaches a certain p-value threshold,
we delete that variable. Continue
3. Model Fit
Most common numerical methods are RSE and R2 .
Addition of variables that are only weakly associated with the response will still increase
R2 , however, by a small amount.
4. Predictions
Reducible Error - try to approximate the model to the true population regression by using
95% confidence intervals.
Model Bias - choose a different model (learning technique)
Irreducible Error

Other Considerations in the Regression Model


For Qualitative variables can be chose to be either 0; 1 or -1; 1, which wouldnt affect the
predictability of the model. However, it would change the interpretation that each coefficient
carries.
When you have a qualitative variable with n levels, then you can construct n 1 dummy
variable for each level and you will be able to see the weigth of each level to the response.
There are also other techniques for handling qualitative variables with many levels...

Assumptions in a Linear Model


1. Additive. The effect of changes of a predictor Xj to the response Y is independent of other
predictors.
2. Linear. The change in a predictor Xj by 1 would only result in a change of the response
by a 1 c (c - constant).
1. Fixing Additive Assumption
We can add an interaction term (for example X1 X2 ) to the model which is simply a multiple
of several variables. This will essentially add another feature to our model that represents a
synergy of two variables.
Even if the features of the interaction have high p-values in comparison to the interaction itself,
we still must include the original X1 and X2 into our model.
2. Fixing Linear Assumption. Polynomial.
Adding features that are in the power (Xjq , q is the power)

Potential Problems with Linear Model


1. Non-linearity of the predictor-response relationship.
Linear Model might not be the best representation of the relationship, it can be log or
polynomial.
A good way to notice it, is to plot the residuals. Residuals graph should give you a
constant, no-pattern line.
2. Correlation of Error Terms.
Each error term j of a single observation can be correlated (depend on) with another
one. An error term might represent other factors (features) that are not obtained in out
4

dataset but this factor stil influences the response.


For example, a linear regression is performed to predict the heights of individuals based
on their weight. Here, correlated error terms would be two people from the same family
(they ate similar food, have similar genes...).
Linear Regression assumes there is no correlation of error terms! If there is, then our
further statistical methods performed would be inaccurate because the calculation of the
Standard Error would wrong, for example, the confidence intervals would be narrower.
Therefore, we might overestimate our model.
Error terms are usually occur in a time-series analysis, when a given event depends on
the event before that.
3. Non-constant Variance of Error Terms.
Error terms might have different variances, which would cause a problem with calculations
of SE and confidence intervals. It is often the case! You can identify non-constant variance
of error terms, or heteroscedasticity from the funnel shape of the residuals plot.
4. Outliers
Outliers can be easily spotted using the residuals, or even the studentized residuals. Sometimes, it is difficult to find an asbolute value over which things are considered to be outliers.
Outliers usually dont alter the fit significantly, but they do change the statistical parameters more or less. Be careful to remove those outliers because it might just be a problem
with your model (for example you just dont have an important feature).
5. High Leverage Xj predictors
High leverage predictors simply means outliers that are X, not the output variables discussed in a previous bullet point. These points have a large effect on the actual fitted
line.
It is easy to find high leverage points in s single dimension data since we can just check
whether they are in a sensible range. But in a multi-dimensional data, predictors can
satisfy their respective ranges but might not do that in a group. To find the outlier this
way, two suspected predictors can be plotted against each other.
We can also compute a leverage statistic of each observation. A large value of this statistic
would mean a high leverage (outlier).
hi =

(xi x
)2
1
+ Pn
2
n
i=1 (xi x)

6. Collinearity
Collinearity - when two or more predictors are related to each other.
When two variables are collinear, a model can give a wide range of of weights on those
variables, i.e. the RSS can have the same value of local minimum for a wide range of
values of these weights. Therefore, the SE for the predictors increase and subsequently
the t-statistic decreases. If t-statistic is small we tend to fail to reject the null hypothesis
and therefore we might think that the given predictor doesnt influence the output. Multicollinearity occurs when data is collected without experimental design
In order to detect collinearity, look at the correlation matrix! It gives correlations between two variables. However, there can be correlations between 3 or more variables in
which case we need to compute Variance Inflation Factor (VIF). VIF is the ratio of the
variance of one predictor when full model is implemented divided by the variance of the
same predictor fitted on its own. When VIF = 1: completely no collinearity; VIF > 5 or
10: problematic amount of collinearity.
2 Solutions to the Problem of Collinearity:
First: delete one of the variables
Second: combine them into one variable (average of the standardized versions)

Questions to ask about your model and methods to solve them


1. Is there a relationship between the features and response?
You look at the weights of each feature. If the computed t-statistic is low and p-value is high,
then you manage to find evidence to support the null hypothesis which says that the weight is
pretty much zero, i.e. have not effect onto the response.
2. How strong is the relationship? How much do the features explain the response?
The RSE would give an average standard deviation of the response relative to the true population regression line. This can be compared to the mean value of the response to get the
percentage of the mean, which is a error of our model.
Then R2 can explain the percentage of the response explained by the model.
3. Which feature contributed to the response?
The lower the p-value usually have the larger influence on the response.
5

4. By how much did each feature contributed to the response?


Check for collinearity first because some features can be over-weight even though they might
contribute as much in reality.
In order to see the strength of each association we need to compute the linear regression of each
one feature because as we have said the linear regression of all the features might be collinear
and might not find the global minimum.
5. How accurate are our predictions?
If we wish to predict the individual response (which includes the error term) we need to the
prediction interval. Otherwise, when using the average response we are using confidence intervals.
6. Is the relationship linear?
Residual Plots!
7. Do some of the variables interact?
We can add the interaction term.

Overview of parametric methods


Parametric methods: easy to fit, easy to interpret, easy to do statistical tests, HOWEVER,
they make strong assumptions about the underlying function.

Comparison to a K-NN method


K-Nearest Regression first identifies first nearest K observations and then finds the average of
them, which is the prediction.
1 X
yi
f(X) =
K x
i

The value of K represents the bias-variance trade-off. With small K (e.g. K=1) we get high
flexibility, i.e. low bias and high variance. In comparison, high K would produce lower variance
and much smoother fit.
If we use KNN regression on the straight linear line, then KNN would just regress (approach) to
the line but wouldnt technically be as accurate as the actual linear regression model. Therefore,
non-parametric models have a higher variance (not necessarily with corresponding
reduction in bias) in comparison to parametric methods.
KNN might seem better in comparison to linear regression when the actual function is unknown
and might highly non-linear, however, that only works with low number of features. In high
dimensional data (p>4), linear regression outperforms KNN.
That happens because in high dimensional data each observation might not have another close
observation because of so many variables. There is essentially a reduction in the sample size
(for non-parametric) as dimensions increase. Its called curse of dimensionality. Generally,
parametric methods would outperform non-parametric methods when there is a
low number of observations per feature.

4. Classification
Why dont we use Linear Regression instead?
In general, creating a dummy variable (with values of 0, 1, 2, 3...) and applying it to linear
regression doesnt reflect the true qualitative response. This happens due to ordering of the
dummy variable which would imply that 2 lies in between 1 and 3, however, in reality they
might not be related at all!
Therefore, unless you have binary data (where you can just assume that if its > 0.5 then
its this qualitative outcome) or qualitative response thats already ordered (for example mild,
moderate, severe) you cant use linear regression.
The result of linear regression of binary output is exactly the same as in Linear Discriminant
Analysis (LDA) given later.
Another problem is that a linear fit would produce values that dont lie within [0, 1] which
would generate a problem of interpretability. Therefore, we can use a logistic function (Sshaped):
eB0 +B1 X
p(X) =
1 + eB0 +B1 X
This is called log-odds or logits:
log(

p(X)
) = B0 + B1 X
1 p(X)

If B1 is positive, then an increase in X by 1 would lead to an increase in p(X).


The coefficients B0 and B1 are found using Maximum Likelihood function.
Y
Y
(1 p(Xi0 ))
l(B0 , B1 ) =
p(Xi )
yi0 =1

yi =1

In classification, when you want to reject the null hypothesis (that this variable is not related
to the response) instead of t-statistic, you need use z-statistic.
Confounding variable - a variable that correlates with both the dependent (output) and the
independent (predictor). Be careful with it!

Multiple-Class Classification
Linear Discriminant Analysis (LDA)
LDA is popular for Multi-class (more than 2 output variables) classification.
When the class are well separated, the logistic regression can be unstable.
When the number of observations n is small and the distributions of predictors X is approximately normal, LDA is more stable again
Since the Bayes Classifier has the lowest error rate (the most accurate) we can use another
method in order to approximate to the Bayes classifier. That is LDA. Bayes classifier:
k fk (x)
P (Y = k|X = x) = PK
l=1 l fk (x)
For LDA of p=1: We assume that the density function fk (x) is Gaussian, where k is the
class. So fk (x) be large which would mean there is a high probability of it being in class k.
WHY IS THE BAYES CLASSIFIER THE MOST ACCURATE CLASSIFIER???????
ROC curve is one of the best tools that describes the goodness of a classifier, it is plotted
as true positives against false positives. You have to look at area under the curve (AUC) which
represents the accuracy of a classifier. The larger the ROCto the top left the better.
True positive rate = sensitivity, false positive rate = 1 - specificity
P
F P rate = F PF+T
P - Type I error
TP
T P rate = T P +F P - Type II error, recall, sensitivity

Comparison of Classification methods


Both LDA and logistic regression are very similar in nature, the difference is in the fitting
procedure, where LDA uses means and variances from a normal distribution while logistic regression uses Maximum Likelihood to update the weights. Most often both methods produce
very similar results. However, in case when predictors dont form normal distributions, logistic
regression is stronger, while when assumption that observations are from Gaussian distribution
7

with a common covariance matrix is true, the LDA can give some improvements over logistic
regression.
In comparison, when the decision boundary is highly non-linear we expect, polynomial (Quadratic
DA) and non-parametric methods to outperform. KNN being non-parametric is a good example
of this since it doesnt take any assumptions over the shape of the decision boundary. On the
other hand, KNNs disadvantage is in inference, it doesnt provide the weights of each predictor
and therefore, doesnt give any information over the importance of any predictors.
Approaching a Classification problem:
1. Look whether the relationship is linear (if its not you can you non-parametric methods or
add polynomial terms) (you can see if its linear by looking at residual plots
2. Look for correlations btw the predictors (u can add interaction terms), check for multicollinearity by applying F-test
3. Observe t-test values (and p-values) of each predictor and see how much it is related to the
output. If there are many features, use subset selection or dimensionality reduction to exclude
predictors that arent related
4. See the Comparison of Classification methods to choose the appropriate method for given
data set.
5. Try different features, create new ones, add polynomial terms, exclude some terms, play
with the model, add lasso, choose appropriate learning parameter, etc....

5. Resampling Methods
Cross-Validation
In order to check the test error, you need to split your training set into another set, validation
set. The problem is how do you decide by what fraction do you split it by. The ratio of the
split significantly relates to your model prediction as the training might not include enough
observation in order for the model to learn from training set and validation set might be too
small to test more unusual observations on the trained model.

Leave one-out Cross-Validation (LOOCV)


For the n number of observations, we select one observation (x1 , y1 ), which would our CV
set, and the rest n - 1 would be the training set (x2 , y2 ), ...., (xn , yn ). After that, you iterate
through each observation and assign it as the CV set. Each time one observation is your CV
set, which means that your bias is very low, however, producing high variance since a change
in observation would produce large error. This high variance is eliminated by doing n number
of substitutions. Therefore, the CV error rate:
n

1X
M SEi
CV (n) =
n i=1
Advantages: LOOCV eliminates the randomness involved in choosing the subset of the data
set to be our CV set. It would also yield the same results no matter how many times (or from
which observation you start) you run it, while Validation approach differs every time you run
it. Plus, LOOCV enables to train the model on most of the data set. Disadvantages: LOOCV
can be computationally expensive to implement, especially if n is large and the model is slow
to fit.
For linear and polynomial regression there is a shortcut for LOOCV that has the same time as
single fit.

k-Fold Cross-Validation
You can use k-fold (where k < n) in which you take k different subsets of n, thus each subset
contains n/k elements. You CV error:
CV (k) =

k
1X
M SEi
k i=1

k is usually taken to be 5 or 10, which are empirically verified to produce the best results.
Advantages: Obviously a computational advantage. Disadvantages: For both one-out and
k-fold CV techniques, the CV error is smaller (they underestimate) than the true test error.

Bias-Variance Trafeoff
The LOOCV produces lower bias than k-fold method, however, the k-fold yields lower variance.
Therefore, use k-fold since both methods are pretty much the same and k-fold is less computationally expensive.

CV for Classification
The same approach for CV apart from the error rate:
n

CV (n) =

1X
I(yi 6= yi )
n i=1

Bootstrap
From the original data set you produce B new data sets by randomly drawing n observations
from the original data set, so that there can be repeating observations in each of our bootstrap
data sets. From these individual data set you can compute your parameters that you are
interested in and obtain B number of different variations, from which you can find the SE.
Bootstrap method can be applied to a wide range of statistical learning methods in which
measure of variability is sometimes difficult to obtain.

10

6. Linear Model Selection and Regularization


Adding new characteristics to our algorithm of least squares can increase:
1. Prediction Accuracy If np then least squares method is pretty good at giving low bias
and low variance (with a testing test). But if n is just slightly larger than p, then the
variance and a possibility of overfitting is high. Moreover, when np we cannot use least
squares method
2. Model Interpretability Some features might not be related to the response at all, but the
least squares would still provide some weight to them. We can set those to zero.
There are 3 additional techniques:
1. Subset Selection
2. Shrinkage (Regularization) - some weights are shrunk towards 0
3. Dimensionality Reduction

Subset Selection
1. Make the null model, with no predictors which is just making our prediction to be the
mean of the data.
2. For each model out of 2p models fit a least squares regression. Find the best model for
each of p, which we mean, the one that gives the smallest RSS or equivalently largest R2
3. Select a single best model by method of cross-validation error, Cp , BIC or adjusted R2
In the second step we reduce the number of models in consideration from 2p to p-1 models.
And then in the 3 step we choose the one of the smallest test error. This method can be applied
to classification too, where we compute the deviance instead of RSS.
Even though, Subset Selection is very appealing, it very computationally expensive to implement when p is large. For p=20 there are 1 mln models to consider. Dont do it when p is
more than 35.

Forward Step-Wise Selection


Starting from the same null model. When adding each predictor we only add the one that gives
the lowest RSS. This way we consider much less number of models (even though the model
search might cost more here).
It can be applied to the case of n p Forward selection method might not capture the best
feature set because it doesnt scan through all the 2p possibilities.

Backward Selection
Starts with the possibility when all p features are included and then starts to exclude the most
useless ones. Cannot be used for the case np.
In order to estimate a test error we can either:
1. Adjust a training error (4 common approaches: Cp , Aikake Information Criterion (AIC),
Bayesian Information Criterion (BIC) and adjusted R2
2. Approximate a test error with a validation or cross-validation set
Generally, validation is a better approach and can be used in a wider range of model selection
tasks. Validation was an issue in the past because it was too computationally expensive.

Ridge Regression
RSS +

p
X
j=1

j2 =

n
X
i=1

(y y)2 +

p
X

j2

j=1

Pp 0 - tuning (learning) parameter


j=1 j2 - shrinkage penalty
It has an effect of shrinking (reducing) the weights during the training. It is critical to choose
the right value for since a low value would make regression too slow and a large value would
make the right predictors to have low weights (closer to the null model). l2 norm gives an idea
how weights get shrunk by the tuning parameter.
Without ridge regression, whatever constant c we multiply our Xj , the weights would adjust
automatically by a factor 1c . However, in ridge regression because of the additional shrinkage
11

penalty added to the loss function the weights would linearly adjust in the same way. Thats
why we need to make sure they are scaled (standardized), otherwise the weights that are large
in value would be penalized more than those smaller in value.
x
ij = p

xij
1/n

Pn

i=1 (xij

x
ij )2

The standard deviation is 1 of all variables.

As the parameter increases, the flexibility of the ridge regression decreases, by decreasing
the variance and increasing the bias.
Ridge regression is much faster than the best Subset Selection and can be applied for cases with
large number of features. In fact, it is shown that you can solve simalteniously for all values of
in a similar timescale as for normal regression. ?????????????

The Lasso
Ridge Regression does have one shortcoming in comparison to the best, forward and backward
subset selection. The unnecessary predictors would be minimized, but not to exactly zero,
which even though might not harm the accuracy (predictability) of the model, but would harm
the interpretability as it still would include all p features. The Lasso is a way to overcome this.
We simply change the shrinkage penalty from l2 to l1 .
l1 =

kk1 =

|j |

|j |

In comparison to the l2 , l1 actually forces the weights to go down to almost zero. Hence, it
pretty much performs a variable selection and the models are easier to interpret now. The
Lasso yields sparse models.
Depending on the value of , some of the variables can get lost, the larger the the higher
the chance your model is left with less variables.
Another Formulation for Ridge and Lasso

If s is too large then the above models would just yield a least squares solution
Comparison of Ridge to Lasso For the case where most of the features are related to
the response: Both generate a similar bias error, however, ridge regression gives slightly lower
variance than Lasso and therefore lower MSE.
For the case when some of the features should be zero: Lasso definitely outperforms ridge
regression; it gives lower bias, variance and MSE.
You need to use cross-validation in order to know which technique is better: lasso or ridge.

12

R Tutorials
Hadley Wickhams paper on tidy data: http://vita.had.co.nz/papers/tidy-data.pdf
Characteristics of tidy data:
1. Each variable forms a column
2. Each observation forms a row
3. Each type of observational unit forms a table

13

You might also like