You are on page 1of 13

5.

Linear Regression

Outline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Simple linear regression Linear model . . . . . . . Linear model . . . . . . . Linear model . . . . . . . Small residuals . . . . . . Minimize 2 i . . . . . . Properties of residuals . Regression in R . . . . . . R output - Davis data . How good is the t? Residual standard error R2 . . . . . . . . . . . . . . R2 . . . . . . . . . . . . . . Analysis of variance . . . r ............... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Multiple linear regression 2 independent variables. Statistical error . . . . . . . . Estimates and residuals . . Computing estimates . . . . Properties of residuals . . . 2 . . . . . . . . . . . R2 and R

Ozone example 25 Ozone example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Ozone data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 R output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Standardized coecients Standardized coecients Using hinge spread . . . . Interpretation . . . . . . . . Using st.dev. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 30 31 32 33

Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Ozone example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Summary 36 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Outline
s

We have seen that linear regression has its limitations. However, it is worth studying linear regression because: x Sometimes data (nearly) satisfy the assumptions. x Sometimes the assumptions can be (nearly) satised by transforming the data. x There are many useful extensions of linear regression: weighted regression, robust regression, nonparametric regression, and generalized linear models. s How does linear regression work? We start with one independent variable. 2 / 38

Simple linear regression


Linear model
s

3 / 38

Linear statistical model: Y = + X + . s is the intercept of the line, and is the slope of the line. One unit increase in X gives units increase in Y . (see gure on blackboard) s is called a statistical error. It accounts for the fact that the statistical model does not give an exact t to the data. s Statistical errors can have a xed and a random component. x Fixed component: arises when the true relation is not linear (also called lack of t error, bias) we assume this component is negligible. x Random component: due to measurement errors in Y , variables that are not included in the model, random variation. 4 / 38

Linear model
s s s s s

Data (X1 , Y1 ), . . . , (Xn , Yn ). Then the model gives: Yi = + Xi + i , where i is the statistical error for the ith case. Thus, the observed value Yi equals + Xi , except that i , an unknown random quantity is added on. The statistical errors i cannot be observed. Why? We assume: x E (i ) = 0 x Var(i ) = 2 for all i = 1, . . . , n x Cov(i , j ) = 0 for all i = j 5 / 38

Linear model
s s s s s s s

The population parameters , and are unknown. We use lower case Greek letters for population parameters. and We compute estimates of the population parameters: , . Yi = + Xi is called the tted value. (see gure on blackboard) i ) is called the residual. i = Yi ( i = Yi Y + X The residuals are observable, and can be used to check assumptions on the statistical errors i . Points above the line have positive residuals, and points below the line have negative residuals. A line that ts the data well has small residuals. 6 / 38

Small residuals
s s s s

We want the residuals to be small in magnitude, because large negative residuals are as bad as large positive residuals. So we cannot simply require i = 0. Y ) - satises In fact, any line through the means of the variables - the point (X, i = 0 (derivation on board). Two immediate solutions: x Require | i | to be small. x Require 2 i to be small. We consider the second option because working with squares is mathematically easier than working with absolute values (for example, it is easier to take derivatives). However, the rst option is more resistant to outliers. Eyeball regression line (see overhead). 7 / 38

Minimize
s

2 i

SSE stands for Sum of Squared Error. ) that minimizes SSE ( ) := (Yi i )2 . s We want to nd the pair ( , , X ) with respect to equal to zero: s Thus, we set the partial derivatives of RSS ( , and ) SSE ( , i) = 0 x = (1)(2)(Yi X i ) = 0. (Yi X
x
) SSE ( ,

i) = 0 (Xi )(2)(Yi X i ) = 0. X Xi (Yi . The solution is (derivation on board, s We now have two normal equations in two unknowns and p. 18 of script): )(Yi Y ) = (Xi X x 2 =
(Xi X )

X =Y x

8 / 38

Properties of residuals
s

Y ). i = 0, since the regression line goes through the point (X, i s Xi i = 0 and Y i = 0. The residuals are uncorrelated with the independent variables Xi i . and with the tted values Y s Least squares estimates are uniquely dened as long as the values of the independent variable are )2 = 0 (see gure on board). not all identical. In that case the numerator (Xi X

9 / 38

Regression in R
s s s s s

model <- lm(y x) summary(model) Coecients: model$coef or coef(model) (Alias: coefficients) Fitted mean values: model$fitted or fitted(model) (Alias: fitted.values) Residuals: model$resid or resid(model) (Alias: residuals) 10 / 38

R output - Davis data


> model <- lm(weight ~ repwt) > summary(model) Call: lm(formula = weight ~ repwt) Residuals: Min 1Q Median -5.5248 -0.7526 -0.3654

3Q 0.6118

Max 6.3841

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.77750 1.74441 1.019 0.311 repwt 0.97722 0.03053 32.009 <2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 2.057 on 99 degrees of freedom Multiple R-Squared: 0.9119, Adjusted R-squared: 0.911 F-statistic: 1025 on 1 and 99 DF, p-value: < 2.2e-16

11 / 38

How good is the t?


Residual standard error
s

12 / 38

i Residual standard error: = SSE/(n 2) = n2 . s n 2 is the degrees of freedom (we lose two degrees of freedom because we estimate the two parameters and ). s For the Davis data, 2. Interpretation: x on average, using the least squares regression line to predict weight from reported weight, results in an error of about 2 kg. x If the residuals are approximately normal, then about 2/3 is in the range 2 and about 95% is in the range 4.

13 / 38

R2
s

We compare our t to a null model Y = + , in which we dont use the independent variable X . = s We dene the tted value Y , and the residual i i = Yi Yi . 2 . s We nd by minimizing ( (Yi )2 . This gives = Y i) = 2 2 2 2 i ) = ) (why?). s Note that (Yi Y i ( i ) = (Yi Y

14 / 38

R2
s s s s s s s
2 )2 is the total sum of squares: the sum of squared errors in the model T SS = ( (Yi Y i) = that does not use the independent variable. i )2 is the sum of squared errors in the linear model. SSE = 2 (Yi Y i = Regression sum of squares: RegSS = T SS SSE gives reduction in squared error due to the linear regression. R2 = RegSS/T SS = 1 SSE/T SS is the proportional reduction in squared error due to the linear regression. Thus, R2 is the proportion of the variation in Y that is explained by the linear regression. R2 has no units doesnt chance when scale is changed. Good values of R2 vary widely in dierent elds of application.

15 / 38

Analysis of variance
s

i )(Y i Y ) = 0 (will be shown later geometrically) (Yi Y i Y )2 (derivation on board) s RegSS = (Y s Hence, T SS = SSE +RegSS 2 2 i Y )2 (Yi Y ) = (Yi Yi ) + (Y This decomposition is called analysis of variance. 16 / 38

r
s

> 0 and take negative root if < 0). Correlation coecient r = R2 (take positive root if s r gives the strength and direction of the relationship. )(Yi Y ) (Xi X s Alternative formula: r = . 2 2
) (Xi X

= Using this formula, we can write (derivation on board). SD s In the eyeball regression, the steep line had slope SDY , and the other line had the correct slope X SDY r SD . X s r is symmetric in X and Y . s r has no units doesnt change when scale is changed.
s

) (Yi Y SDY r SDX

17 / 38

Multiple linear regression


2 independent variables
s

18 / 38

Y = + 1 X1 + 2 X2 + . (see p. 9 of script) s This describes a plane in the 3-dimensional space {X1 , X2 , Y } (see gure): x is the intercept x 1 is the increase in Y associated with a one-unit increase in X1 when X2 is held constant x 2 is the increase in Y for a one-unit increase in X2 when X1 is held constant. 19 / 38

Statistical error
s s s s

Data: (X11 , X12 , Y1 ), . . . , (Xn1 , Xn2 , Yn ). Yi = + 1 Xi1 + 2 Xi2 + i , where i is the statistical error for the ith case. Thus, the observed value Yi equals + 1 Xi1 + 2 Xi2 , except that i , an unknown random quantity is added on. We make the same assumptions about as before: x E (i ) = 0 x Var(i ) = 2 for all i = 1, . . . , n x Cov(i , j ) = 0 for all i = j Compare to assumption on p. 14-16 of script. 20 / 38

Estimates and residuals


s s s s s s s

The population parameters , 1 , 2 , and are unknown. 1 , 2 and We compute estimates of the population parameters: , . Yi = + 1 Xi1 + 2 Xi2 is called the tted value. 1 Xi1 + 2 Xi2 ) is called the residual. i = Yi ( i = Yi Y + The residuals are observable, and can be used to check assumptions on the statistical errors i . Points above the plane have positive residuals, and points below the plane have negative residuals. A plane that ts the data well has small residuals. 21 / 38

Computing estimates
s s s s s s

1 , 2 ) minimizes SSE (, 1 , 2 ) = The triple ( , 2 (Yi 1 Xi1 2 Xi2 )2 . i = We can again take partial derivatives and set these equal to zero. This gives three equations in the three unknowns , 1 and 2 . Solving these normal equations 1 and 2 . gives the regression coecients , Least squares estimates are unique unless one of the independent variables is invariant, or independent variables are perfectly collinear. The same procedure works for p independent variables X1 , . . . , Xp . However, it is then easier to use matrix notation (see board and section 1.3 of script). In R: model <- lm(y x1 + x2)

22 / 38

Properties of residuals
s

i = 0 i and with each of the independent s The residuals i are uncorrelated with the tted values Y variables X1 , . . . , Xp .
s s

The standard error of the residuals =

n p 1 is the degrees of freedom (we lose p + 1 degrees of freedom because we estimate the p + 1 parameters , 1 , . . . , p ). 23 / 38

2 i /(n p 1) gives the average size of the residuals.

2 R2 and R
s s s s s s s

)2 . T SS = (Yi Y i )2 = SSE = (Yi Y 2 i. i Y )2 . RegSS = T SS SSE = (Y R2 = RegSS/T SS = 1 SSE/T SS is the proportion of variation in Y that is captured by its linear regression on the X s. R2 can never decrease when we add an extra variable to the model. Why? 2 = 1 SSE/(np1) penalizes R2 when there are extra variables in the Corrected sum of squares: R T SS/(n1) model. 2 dier very little if sample size is large. R2 and R 24 / 38

Ozone example
Ozone example
s

25 / 38

Data from Sandberg, Basso, Okin (1978): x SF = Summer quarter maximum hourly average ozone reading in parts per million in San Francisco x SJ = Same, but then in San Jose x YEAR = Year of ozone measurement x RAIN = Average winter precipitation in centimeters in the San Francisco Bay area for the preceding two winters s Research question: How does SF depend on YEAR and RAIN? s Think about assumptions: Which one may be violated? 26 / 38

Ozone data
YEAR 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 RAIN 18.9 23.7 26.2 26.6 39.6 45.5 26.7 19.0 30.6 34.1 23.7 14.6 7.6 SF 4.3 4.2 4.6 4.7 4.1 4.6 3.7 3.1 3.4 3.4 2.1 2.2 2.0 SJ 4.2 4.8 5.3 4.8 5.5 5.6 5.4 4.6 5.1 3.7 2.7 2.1 2.5

27 / 38

10

R output
> model <- lm(sf ~ year + rain) > summary(model) Call: lm(formula = sf ~ year + rain) Residuals: Min 1Q -0.61072 -0.20317 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 388.412083 49.573690 7.835 1.41e-05 *** year -0.195703 0.025112 -7.793 1.48e-05 *** rain 0.034288 0.009655 3.551 0.00526 ** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.3224 on 10 degrees of freedom Multiple R-Squared: 0.9089, Adjusted R-squared: 0.8906 F-statistic: 49.87 on 2 and 10 DF, p-value: 6.286e-06

Median 0.06129

3Q 0.16329

Max 0.51992

28 / 38

Standardized coecients
Standardized coecients
s

29 / 38

We often want to compare coecients of dierent independent variables. s When the independent variables are measured in the same units, this is straightforward. s If the independent variables are not commensurable, we can perform a limited comparison by rescaling the regression coecients in relation to a measure of variation: x using hinge spread x using standard deviations 30 / 38

Using hinge spread


s

Hinge spread = interquartile range (IQR) s Let IQR1 , . . . , IQRp be the IQRs of X1 , . . . , Xp . 1 Xi1 + . . . p Xip + s We start with Yi = + i .
s s s

1 IQR1 This can be rewritten as: Yi = + Let Zij =


Xij IQRj ,

Xi1 IQR1

p IQRp + +

Xip IQRp

+ i .

for j = 1, . . . , p and i = 1, . . . , n.

= j IQRj , j = 1, . . . , p. Let j Zi1 + + Zip + s Then we get Yi = + i . p 1 = j IQRj is called the standardized regression coecient. s j 31 / 38

11

Interpretation
s

Interpretation: Increasing Zj by 1 and holding constant the other Z s ( = j ), is associated, on in Y . average, with an increase of j s Increasing Zj by 1, means that Xj is increased by one IQR of Xj . s So increasing Xj by one IQR of Xj and holding constant the other X s ( = j ), is associated, on in Y . average, with an increase of j s Ozone example: Variable Coe. Hinge spread Stand. coe. Year -0.196 6 -1.176 Rain 0.034 11.6 0.394 32 / 38

Using st.dev.
s

Let SY be the standard deviation of Y , and let S1 , . . . , Sp be the standard deviations of X1 , . . . , Xp . 1 Xi1 + . . . p Xip + s We start with Yi = + i . s This can be rewritten as (derivation on board): p 1 Yi Y i S1 Xi1 X p Sp Xip X + + +S . SY = 1 SY S1 SY Sp Y
j Xij X Yi Y , for j = 1, . . . , p. SY and Zij = Sj i = j Sj and s Let j i = SY . SY Zi1 + + Zip + s Then we get ZiY = p 1 i. S = j j is called the standardized regression coecient. s j SY

Let ZiY =

33 / 38

Interpretation
s

Interpretation: Increasing Zj by 1 and holding constant the other Z s ( = j ), is associated, on in ZY . average, with an increase of j s Increasing Zj by 1, means that Xj is increased by one SD of Xj . s Increasing ZY by 1 means that Y is increased by one SD of Y . s So increasing Xj by one SD of Xj and holding constant the other X s ( = j ), is associated, on SDs of Y in Y . average, with an increase of j 34 / 38

12

Ozone example
s

Ozone example: Variable Coe. St.dev(variable) Stand. coe. St.dev(Y) Year -0.196 3.99 -0.783 Rain 0.034 10.39 0.353 s Both methods (using hinge spread or standard deviations) only allow for a very limited comparison. They both assume that predictors with a large spread are more important, and that does not need to be the case. 35 / 38

Summary
Summary
s s s s s s

36 / 38

Linear statistical model: Y = + 1 X1 + + p Xp + . We assume that the statistical errors have mean zero, constant standard deviation , and are uncorrelated. The population parameters , 1 , . . . , p and cannot be observed. Also the statistical errors cannot be observed. 1 Xi1 + + p Xip and the residual i = i . We can use We dene the tted value Y + i = Yi Y the residuals to check the assumptions about the statistical errors. 1 , . . . , p for , 1 , . . . , p by minimizing the residual sum of squares We compute estimates , 1 Xi1 + + p Xip ))2 . SSE = 2 (Yi ( + i = Interpretation of the coecients? 37 / 38

Summary
s

To measure how good the t is, we can use: x the residual standard error = SSE/(n p 1) x the multiple correlation coecient R2 2 x the adjusted multiple correlation coecient R x the correlation coecient r s Analysis of variance (ANOVA): T SS = SSE + RegSS s Standardized regression coecients 38 / 38

13

You might also like