Professional Documents
Culture Documents
1 / 48
Linear regression
(x) =is0a +
1 x1 approach
+ 2 x2 + .to. . supervised
p xp
Linear regression
simple
5
4
3
f(X)
learning.
assumes
dependence oftoYthe
ontruth.
Almost
always Itthought
of that
as anthe
approximation
X
,
X
,
.
.
.
X
is
linear.
1
2 nature
p are rarely linear.
Functions
in
True regression functions are never linear!
1 / 48
Linear regression
(x) =is0a +
1 x1 approach
+ 2 x2 + .to. . supervised
p xp
Linear regression
simple
5
3
f(X)
learning.
assumes
dependence oftoYthe
ontruth.
Almost
always Itthought
of that
as anthe
approximation
X
,
X
,
.
.
.
X
is
linear.
1
2 nature
p are rarely linear.
Functions
in
True regression functions are never linear!
1 / 48
sales?
and sales?
2 / 48
50
100
200
TV
300
25
5
10
15
Sales
20
25
20
15
Sales
10
15
10
5
Sales
20
25
Advertising data
10
20
30
Radio
40
50
20
40
60
80
100
Newspaper
3 / 48
Y = 0 + 1 X + ,
where 0 and 1 are two unknown constants that represent
the intercept and slope, also known as coefficients or
parameters, and is the error term.
Given some estimates 0 and 1 for the model coefficients,
we predict future sales using
y = 0 + 1 x,
where y indicates a prediction of Y on the basis of X = x.
The hat symbol denotes an estimated value.
4 / 48
5 / 48
or equivalently as
RSS = (y1 0 1 x1 )2 +(y2 0 1 x2 )2 +. . .+(yn 0 1 xn )2 .
5 / 48
or equivalently as
RSS = (y1 0 1 x1 )2 +(y2 0 1 x2 )2 +. . .+(yn 0 1 xn )2 .
The least squares approach chooses 0 and 1 to minimize
Pn i
1 = i=1
,
)2
i=1 (xi x
0 = y 1 x
,
P
P
n1 ni=1 xi are the sample
where y n1 ni=1 yi and x
means.
5 / 48
i=1 (xi
0 = y 1 x
,
x
)
(3.4)
P
P
where y n1 ni=1 yi and x
n1 ni=1 xi are the sample means. In other
words, (3.4) defines the least squares coefficient estimates for simple linear
regression.
15
5
10
Sales
20
25
50
100
150
200
250
300
TV
FIGURE 3.1. For the Advertising data, the least squares fit for the regression
of sales onto TV is shown. The fit is found by minimizing the sum of squared
errors. Each grey line segment represents an error, and the fit makes a compromise by averaging their squares. In this case a linear fit captures the essence of
the relationship, although it is somewhat deficient in the left of the plot.
The least squares fit for the regression of sales onto TV.
In this case a linear fit captures the essence of the relationship,
although it is somewhat deficient in the left of the plot.
Figure 3.1 displays the simple linear regression fit to the Advertising
data, where 0 = 7.03 and 1 = 0.0475. In other words, according to this
6 / 48
2
SE(1 ) = Pn
where 2 = Var()
x
2
2
2 1
P
SE(0 ) =
+ n
,
)2
n
i=1 (xi x
7 / 48
2
SE(1 ) = Pn
where 2 = Var()
x
2
2
2 1
P
SE(0 ) =
+ n
,
)2
n
i=1 (xi x
8 / 48
8 / 48
Hypothesis testing
Standard errors can also be used to perform hypothesis
HA :
9 / 48
Hypothesis testing
Standard errors can also be used to perform hypothesis
HA :
H0 : 1 = 0
versus
HA : 1 6= 0,
by
t=
1 0
,
SE(1 )
freedom, assuming 1 = 0.
10 / 48
Intercept
TV
Coefficient
7.0325
0.0475
Std. Error
0.4578
0.0027
t-statistic
15.36
17.67
p-value
< 0.0001
< 0.0001
11 / 48
RSE =
1
n2
v
u
u
RSS = t
1 X
(yi yi )2 ,
n2
i=1
Pn
i=1 (yi
yi )2 .
12 / 48
RSE =
1
n2
v
u
u
RSS = t
1 X
(yi yi )2 ,
n2
i=1
P
where the residual sum-of-squares is RSS = ni=1 (yi yi )2 .
R-squared or fraction of variance explained is
R2 =
where TSS =
Pn
TSS RSS
RSS
=1
TSS
TSS
i=1 (yi
12 / 48
RSE =
1
n2
v
u
u
RSS = t
1 X
(yi yi )2 ,
n2
i=1
P
where the residual sum-of-squares is RSS = ni=1 (yi yi )2 .
R-squared or fraction of variance explained is
R2 =
TSS RSS
RSS
=1
TSS
TSS
P
where TSS = ni=1 (yi y)2 is the total sum of squares.
It can be shown that in this simple linear regression setting
that R2 = r2 , where r is the correlation between X and Y :
Pn
(xi x)(yi y)
p
pPn
r = Pn i=1
.
2
2
(x
x)
i
i=1
i=1 (yi y)
12 / 48
Quantity
Residual Standard Error
R2
F-statistic
Value
3.26
0.612
312.1
13 / 48
Y = 0 + 1 X1 + 2 X2 + + p Xp + ,
We interpret j as the average effect on Y of a one unit
14 / 48
a balanced design:
data.
15 / 48
16 / 48
16 / 48
16 / 48
17 / 48
17 / 48
y = 0 + 1 x1 + 2 x2 + + p xp .
We estimate 0 , 1 , . . . , p as the values that minimize the
n
X
i=1
n
X
i=1
(yi yi )2
(yi 0 1 xi1 2 xi2 p xip )2 .
15
X2
X1
19 / 48
Intercept
TV
radio
newspaper
Coefficient
2.939
0.046
0.189
-0.001
TV
radio
newspaper
sales
Std. Error
0.3119
0.0014
0.0086
0.0059
t-statistic
9.42
32.81
21.89
-0.18
Correlations:
TV
radio newspaper
1.0000 0.0548
0.0567
1.0000
0.3541
1.0000
p-value
< 0.0001
< 0.0001
< 0.0001
0.8599
sales
0.7822
0.5762
0.2283
1.0000
20 / 48
21 / 48
21 / 48
21 / 48
21 / 48
(TSS RSS)/p
Fp,np1
RSS/(n p 1)
Quantity
Residual Standard Error
R2
F-statistic
Value
1.69
0.897
570
22 / 48
23 / 48
23 / 48
Forward selection
24 / 48
Backward selection
Remove the variable with the largest p-value that is, the
25 / 48
26 / 48
variables.
27 / 48
26
3. Linear Regression
40
60
80
100
10
15
20
2000
8000
14000
1500
20
80
100
500
Balance
20
40
60
Age
15
20
Cards
100
150
10
Education
8000
14000
50
Income
600
1000
2000
Limit
200
Rating
500
1500
50
100
150
200
600
1000
FIGURE 3.6. The Credit data set contains information about balance, age,
28 / 48
Intrepretation?
29 / 48
Intercept
gender[Female]
Coefficient
509.80
19.73
Std. Error
33.13
46.05
t-statistic
15.389
0.429
p-value
< 0.0001
0.6690
30 / 48
31 / 48
0 + 1 + i
yi = 0 +1 xi1 +2 xi2 +i = 0 + 2 + i
0 + i
32 / 48
Intercept
ethnicity[Asian]
ethnicity[Caucasian]
Coefficient
531.00
-18.69
-12.50
Std. Error
46.32
65.02
56.68
t-statistic
11.464
-0.287
-0.221
p-value
< 0.0001
0.7740
0.8260
33 / 48
\ = 0 + 1 TV + 2 radio + 3 newspaper
sales
states that the average effect on sales of a one-unit
increase in TV is always 1 , regardless of the amount spent
on radio.
34 / 48
Interactions continued
35 / 48
81
Sales
TV
Radio
When levels
of 3.5.
either
or radio
areregression
low, fitthen
the
FIGURE
For the TV
Advertising
data, a linear
to sales
usingtrue sales
and radio as predictors. From the pattern of the residuals, we can see that
are lowerTV
than
predicted
by
the
linear
model.
there is a pronounced non-linear relationship in the data.
But when advertising is split between the two media, then the
model tends
to underestimate
which simplifies
to (3.15) for a simplesales.
linear regression. Thus, models with
more variables can have higher RSE if the decrease in RSS is small relative
to the increase in p.
In addition to looking at the RSE and R2 statistics just discussed, it can
be useful to plot the data. Graphical summaries can reveal problems with
36 / 48
Results:
Intercept
TV
radio
TVradio
Coefficient
6.7502
0.0191
0.0289
0.0011
Std. Error
0.248
0.002
0.009
0.000
t-statistic
27.23
12.70
3.24
20.73
p-value
< 0.0001
< 0.0001
0.0014
< 0.0001
37 / 48
Interpretation
important.
only 89.7% for the model that predicts sales using TV and
radio without an interaction term.
38 / 48
Interpretation continued
39 / 48
Hierarchy
very small p-value, but the associated main effects (in this
case, TV and radio) do not.
40 / 48
Hierarchy continued
41 / 48
(
2
0 + 1 incomei +
0
(
0 + 2
1 incomei +
0
42 / 48
0 + 1 incomei +
(
2 + 3 incomei
0
(0 + 2 ) + (1 + 3 ) incomei
0 + 1 incomei
if student
if not student
if student
if not student
43 / 48
1000
student
nonstudent
200
600
Balance
1000
600
200
Balance
1400
3. Linear Regression
1400
90
50
100
150
Income
50
100
150
Income
FIGURE 3.7. For the Credit data, the least squares lines are shown for preCredit
data; Left: no interaction between income and student.
diction of balance from income for students and non-students. Left: The model
Right:
with
interaction
termbetween
between
income
and student.
income
and student.
Right: The
(3.34) was
fit. an
There
is no interaction
model (3.35) was fit. There is an interaction term between income and student.
0 + 1 incomei +
2
0
44 / 48
Non-linear
effects of predictors
3.3 Other Considerations in the Regression Model
91
50
30
10
20
40
Linear
Degree 2
Degree 5
50
100
150
200
Horsepower
FIGURE 3.8. The Auto data set. For a number of cars, mpg and horsepower are45 / 48
Intercept
horsepower
horsepower2
Coefficient
56.9001
-0.4662
0.0012
Std. Error
1.8004
0.0311
0.0001
t-statistic
31.6
-15.0
10.1
p-value
< 0.0001
< 0.0001
< 0.0001
46 / 48
Outliers
Non-constant variance of error terms
High leverage points
Collinearity
See text Section 3.33
47 / 48
machines
48 / 48