Professional Documents
Culture Documents
25
25
25
20
20
20
Sales
Sales
Sales
15
15
15
10
10
10
5
5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
1 / 30
Notation
Here Sales is a response or target that we wish to predict. We
generically refer to the response as Y .
TV is a feature, or input, or predictor; we name it X1 .
Likewise name Radio as X2 , and so on.
We can refer to the input vector collectively as
0 1
X1
B C
X = @ X2 A
X3
Y = f (X) + ✏
2 / 30
What is f (X) good for?
3 / 30
●●
6
● ●
●
●
● ●
●●●● ● ●
●● ●
4
● ●●●● ●
●●
●● ●●
●
● ●●
● ●● ●●●
● ●●●● ● ●
●
● ●● ●● ●●●●●
●
●●● ●●● ● ●
● ●●
●
●
●
●●●
●●
●
●
●●●
● ●●
●● ●
●●
● ●
●●●●
●●●●●●
● ●●
●● ●
● ● ●● ● ● ●
●●
●●
●●●
●● ●●●●
●●
● ●
y
2 ●● ●
●● ●
●
●● ●● ●● ●●●●●
●
●●●
●●●● ●
● ● ●●●●●
●
●●
●
●●●●●
●●
●●
●●
●
●
●
●
●●
●
●●●●
● ●●●
● ●●
● ●
●●
●●
● ●
●●●●
●
●
●
●●
●●
●
●
●●●
●
●
●●●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●●●
●
● ● ●●
●
●● ●
●● ●●
●●●
● ●
●
●●●
●●
●
●●●
●●●●
●●
●●
●
●
●
●●
●●
●●
●●
● ●
●
●●
●●
●●●●
●
● ●● ●●●●●●
●● ●
●● ●●
●
●●
●●●
●●
● ●
●●
●● ●●●
● ●●●
●● ●
●●
●●●●●
●●●●
●
●
●●
● ●●
●
●●
●
●
●●
●
●●
●●
●
●●
●●
●
●●
●
●●
●●
●●
●
●●●
●
●●●
●●
●●●●●
●
●●●
● ●● ●
●
●●● ●●●●●
●
●
●● ●●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
● ●●●
● ●
● ● ●● ● ● ●●● ● ● ● ●
●
● ●● ●
●●
●●
● ●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●●
●
●●
●
●●
●
●●●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
● ●●
●●●●●
● ●● ●●
●● ● ●● ● ●
● ●●
●●●
●●
●●●●●●●
●●●
●
●● ●●●● ●●
●●●●●
●●●●
0
1 2 3 4 5 6 7
f (4) = E(Y |X = 4)
E(Y |X = 4) means expected value (average) of Y given X = 4.
This ideal f (x) = E(Y |X = x) is called the regression function.
4 / 30
The regression function f (x)
5 / 30
How to estimate f
• Typically we have few if any data points with X = 4
exactly.
• So we cannot compute E(Y |X = x)!
• Relax the definition and let
fˆ(x) = Ave(Y |X 2 N (x))
where N (x) is some neighborhood of x.
3
● ●
2
●
●
●
●● ● ●
● ● ● ●
● ● ●●
1
● ●
●
●●
● ● ●
y
● ●
● ●
●● ●
● ● ● ● ● ● ●
0
●
● ● ● ● ●
● ● ● ●
● ●
● ● ● ●● ● ●
● ● ●
−1
● ● ● ●● ● ● ●
●●
−2
1 2 3 4 5 6
x
6 / 30
• Nearest neighbor averaging can be pretty good for small p
— i.e. p 4 and large-ish N .
• We will discuss smoother versions, such as kernel and
spline smoothing later in the course.
• Nearest neighbor methods can be lousy when p is large.
Reason: the curse of dimensionality. Nearest neighbors
tend to be far away in high dimensions.
7 / 30
The curse of dimensionality
10% Neighborhood
1.0
●
● ● ●● ●
p= 10
● ●
●
●
● ● ● ● ●●
1.5
● ● ● ●
● ●
● ● ●
● ●
0.5
● ● ● p= 5
● ●
● ● ●
● ●
p= 3
1.0
●● ●
●
Radius
●
● ● ● ● ●
●
0.0
● p= 2
x2
● ●● ● ● ●●
● ●
● ● ●
● ● ●
● ●
●
● p= 1
●● ●
0.5
●
● ● ●
−0.5
● ●
● ●●
● ●
●●
● ● ●
● ●●
● ● ● ● ●
● ●
●
−1.0
0.0
●
●
−1.0 −0.5 0.0 0.5 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
x1 Fraction of Volume
8 / 30
Parametric and structured models
9 / 30
A linear model fˆL (X) = ˆ0 + ˆ1 X gives a reasonable fit here
3
● ●
2
●
●
●
●● ● ●
● ● ●
● ●
1
● ● ●●
● ● ●
●
● ● ●
y
● ●● ●
● ●● ● ●
● ● ● ● ●
0
●
● ● ● ● ●
● ● ● ●
● ●● ●
●
● ●●
● ● ● ●
−1
● ● ● ●
●
●● ●● ●
−2
1 2 3 4 5 6
● ●
2
●
●
●
●● ● ●
● ● ●
● ●
1
● ● ●●
● ● ●
●
● ● ●
y
● ●
● ●
●● ●
● ● ● ● ●
● ●
0
●
● ● ● ● ●
● ● ● ●
● ●● ●
●
● ●●
● ● ● ●
−1
● ● ● ●
●
●● ●● ●
−2
1 2 3 4 5 6
10 / 30
Incom
e
ity
or
Ye
ni
ars
Se
of
Ed
uc
ati
on
11 / 30
Incom
e
ity
or
Ye
ni
ars
Se
of
Ed
uc
ati
on
12 / 30
Incom
e
ity
or
Ye
ni
ars
Se
of
Ed
uc
ati
on
13 / 30
Incom
e
ity
or
Ye
ni
ars
Se
of
Ed
uc
ati
on
14 / 30
Some trade-o↵s
15 / 30
High
Subset Selection
Lasso
Least Squares
Interpretability
Bagging, Boosting
Low High
Flexibility
17 / 30
2.5
12
2.0
10
1.5
8
Y
1.0
6
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
2.0
10
1.5
8
Y
1.0
6
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
Here the truth is smoother, so the smoother fit and linear model do
FIGURE 2.10. Details are as in Figure 2.9, using a di↵erent true f that is
really well.
much closer to linear. In this setting, linear regression provides a very good fit to
the data.
19 / 30
the data.
20
20
15
Mean Squared Error
10
10
Y
5
−10
0
0 20 40 60 80 100 2 5 10 20
X Flexibility
Here the truth is wiggly and the noise is low, so the more flexible fits
FIGURE
do 2.11. Details are as in Figure 2.9, using a di↵erent f that is far from
the best.
linear. In this setting, linear regression provides a very poor fit to the data.
20 / 30
Bias-Variance Trade-o↵
Suppose we have fit a model fˆ(x) to some training data Tr, and
let (x0 , y0 ) be a test observation drawn from the population. If
the true model is Y = f (X) + ✏ (with f (x) = E(Y |X = x)),
then
⇣ ⌘2
E y0 fˆ(x0 ) = Var(fˆ(x0 )) + [Bias(fˆ(x0 ))]2 + Var(✏).
21 / 30
Bias-variance trade-o↵ for the three examples
36 2. Statistical Learning
2.5
2.5
20
MSE
Bias
Var
2.0
2.0
15
1.5
1.5
10
1.0
1.0
5
0.5
0.5
0.0
0.0
0
2 5 10 20 2 5 10 20 2 5 10 20
FIGURE 2.12. Squared bias (blue curve), variance (orange curve), Var(✏)
(dashed line), and test MSE (red curve) for the three data sets in Figures 2.9–2.11.
The vertical dashed line indicates the flexibility level corresponding to the smallest
test MSE.
22 / 30
Classification Problems
23 / 30
1.0
|| | | | || || |||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| |||| ||| | | ||
0.8
0.6
y
0.4
0.2
1 2 3 4 5 6 7
0.4
0.2
stronger red lines indicate aggregated data to estimate conditional probabilities at x=5
(nearest neighbor averaging)
0.0
2 3 4 5 6
25 / 30
Classification: some details
26 / 30
2. Statistical Learning
Example: K-nearest neighbors in two dimensions
oo o
oo o
o
o
o oo oo o
o o
o o ooo
o oo o oo ooo o
oo o o o o ooo oo oo
o o oo oo
o o o o oo o
oo oo o o o o
o o o oo o o o
o oo o o o o o o o
o o oooo o ooo o o o o ooo
o
X2
o
o o o oo o o o ooooo o o o
oo o oo o o
o o o o oo o
o oo o
o o o oo o ooo o o
o oo o
oo o ooooo oooo
o o oo o o
o o oo oo o o o
o o o oo oo
oo
o o o o
o oo o
o o o
27 / 30
KNN: K=10
oo o
oo o
o
o
o oo oo o
o oo o
o o o
o oo o oo oo
oo o o o o o oo o oo
o o oo oo o oo
o o o o oo o
oo oo o o o o
o o o oo o o o
o oo o o o o o
o o
o o oooo o ooo o o o o ooo
o
o
X2
o o oo o o o o
o o o oo
oo o oo o
o
oo o o o o o oo o
o oo o
o
o o o oo o oo o o
o oo o
oo o ooooo oo
o oo
o oo o o
o o oo oo o o o
o o o oo ooo
ooo
o o
o oo o
o o o
FIGURE 2.15. The black curve indicates the KNN decision boundary on the
28 / 30
FIGURE 2.15. The black curve indicates the KNN decision boundary on the
data from Figure 2.13, using K = 10. The Bayes decision boundary is shown as
a purple dashed line. The KNN and Bayes decision boundaries are very similar.
KNN: K=1 KNN: K=100
o o o o
oo o o oo o o
o o o o
o oo o o oo o
oo oo
o o o o
o oo oo o o o o oo oo o o o
o o oo o o o o oo o o
oo o o o o o oo oo oo o o o o o oo oo
o o o oo oo o oo o o o oo oo o oo
o o o o o o o o
o ooo o o o ooo o o
o
o o o o o o
o o o o o
o oo o o o o oo o o o
o oo o o o o o oo o o o o
o
o o o o o
o o o o
o o ooo o ooo o o o
o o o ooo o ooo o o o
o
oo o o oo o o oo o oo o o oo o o oo o
o oo o oo
o o o ooo o o o ooo
oo o o o o oo o o o o
oo o o o oo o oo o o o oo o
o o o o o o o o
o o o o
o o o o o oo o o o o o o oo o
o ooo o o ooo o
oo o oooo oo oo o oooo oo
o o oo o o o oo o
o oo o o oo o
o o oo oo o o o o o oo oo o o o
oo o oo o
o
o
o oooo oo o
o
o
o oooo oo o
oo o oo o
o o
o o o o
o o
0.10
0.05
Training Errors
0.00
Test Errors
1/K
30 / 30