Professional Documents
Culture Documents
The Lasso
Patrick Breheny
September 6
Patrick Breheny
1/16
Denition Comparison with subset selection and ridge regression Model tting and selection of
Introduction
As we have seen, ridge regression is capable of reducing the variability and improving the accuracy of linear regression models, and that these gains are largest in the presence of multicollinearity What ridge regression doesnt do is variable selection, and it fails to provide a parsimonious model with few parameters
Patrick Breheny
2/16
Denition Comparison with subset selection and ridge regression Model tting and selection of
The lasso
|j | ,
the only dierence from ridge regression being that absolute values, instead of squares, are used in the penalty function The change to the penalty function is subtle, but has a dramatic impact on the resulting estimator
Patrick Breheny
3/16
Denition Comparison with subset selection and ridge regression Model tting and selection of
Like ridge regression, penalizing the absolute values of the coecients introduces shrinkage towards zero However, unlike ridge regression, some of the coecients are shrunken all the way to zero; such solutions, with multiple values that are identically zero, are said to be sparse The penalty thereby performs a sort of continuous variable selection The resulting estimator was thus named the lasso, for Least Absolute Shrinkage and Selection Operator
Patrick Breheny
4/16
Denition Comparison with subset selection and ridge regression Model tting and selection of
Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 3
A geometrical illustration of why lasso results in sparsity, but ridge does not, is given by the constraint interpretation of their penalties:
FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression (right). Shown are contours of the error and constraint functions. The solid blue areas are 2 2 + 2 t2 , the constraint regions |1 | + |2 | t and 1 respectively, while the red ellipses are the contours of the least squares error function.
Patrick Breheny BST 764: Applied Statistical Modeling 5/16
Denition Comparison with subset selection and ridge regression Model tting and selection of
Bayesian perspective
Another way of seeing how the lasso produces sparsity is to view it from a Bayesian perspective, where the lasso penalty produces a double exponential prior:
Ridge Lasso
p()
Note that the lasso prior is pointy at 0, so there is a chance that the posterior mode will be identically zero
Patrick Breheny BST 764: Applied Statistical Modeling 6/16
Denition Comparison with subset selection and ridge regression Model tting and selection of
Orthonormal Solutions
Because the lasso penalty has the absolute value operation in it, the objective function is not dierentiable and as a result, lacks a closed form in general However, in the special case of an orthonormal design matrix, it is possible to obtain closed form solutions for the lasso: lasso = S ( OLS , ), where S , the soft-thresholding operator, J J is dened as z if z > S (z, ) = 0 if |z | z + if z <
Patrick Breheny
7/16
Denition Comparison with subset selection and ridge regression Model tting and selection of
The function on the previous slide is referred to as soft thresholding to distinguish it from hard thresholding: H (z, ) = z 0 if |z | > if |z |
In the orthonormal case, best subset selection is equivalent to hard thresholding Note that soft thresholding is continuous, while hard thresholding is not
Patrick Breheny
8/16
Denition Comparison with subset selection and ridge regression Model tting and selection of
Estimate
^ OLS
Patrick Breheny
9/16
Denition Comparison with subset selection and ridge regression Model tting and selection of
Patrick Breheny
10/16
Denition Comparison with subset selection and ridge regression Model tting and selection of
Selection of
Unlike ridge regression, the lasso is not a linear estimator = Hy there is no matrix H such that y Dening the degrees of freedom of the lasso is therefore somewhat messy However, a number of arguments can be made that the number of nonzero coecients in the model is a reasonable quantication of the models degrees of freedom, and this quantity can be used in AIC/BIC/GCV to select Other statisticians, however, feel these approximations to be untrustworthy, and prefer to select via cross-validation instead
Patrick Breheny
11/16
GLMSELECT allows for many other selection criteria, include cross-validation Note that despite its name, GLMSELECT only ts linear models, not GLMs
Patrick Breheny
12/16
In R, the glmnet package can t a wide variety of models (linear models, generalized linear models, multinomial models, proportional hazards models) with lasso penalties The syntax is fairly straightforward, though it diers from lm in that it requires you to form your own design matrix:
fit <- glmnet(X,y)
Patrick Breheny
13/16
Ridge
0.8
svi
Lasso
0.8
svi
0.6
lcavol lweight
0.6
lcavol lweight
Coefficients
Coefficients
0.4
0.2
0.0
0.0
0.2
0.4
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
2 2 j max j
j max j
Cross-validation results
0 1 2 3 5 5 6 7 8 8 8 8 8 8 1.6 1.4 MeanSquared Error
G G
1.2
1.0
G G G G G G G G G G G G GG GGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
0.6 0
0.8
3 log()
The line on the right is drawn at the minimum CV error; the other is drawn at the maximum value of within 1 SE of the minimum
Patrick Breheny BST 764: Applied Statistical Modeling 15/16
Patrick Breheny
16/16