Module05 Notes

Regression
(Module 5)
Statistics (MAST20005) & Elements of Statistics (MAST90058)
Semester 2, 2018
Contents
1 Introduction 1
2 Regression 2
3 Simple linear regression 4

3.1 Point estimation of the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Interlude: Analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Point estimation of the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Standard errors of the estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.5 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.6 Prediction intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.7 R examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.8 Model checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Further regression models 15
5 Correlation 16
5.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.2 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.3 Relationship to regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.4 Confidence interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.5 R example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Aims of this module

• Introduce the concept of regression
• Show a simple model for studying the relationship between two variables
• Discuss correlation and how it relates to regression
1 Introduction
Relationships between two variables
We have studied how to do estimation for some simple scenarios:

• iid samples from a single distribution (Xi )
• comparing iid samples from two different distributions (Xi & Yj )
• differences between paired measurements (Xi − Yi )
We now consider how to analyse bivariate data more generally, i.e. two variables, X and Y , measured at the same
time, i.e. as a pair.
The data consist of pairs of data points, (xi , yi ).
These can be visualised using a scatter plot.
1
Example data
xi yi
1.80 9.18
1.40 7.66
2.10 6.33
0.30 4.51
3.60 14.04
0.70 4.94
1.10 4.24
2.10 8.19
0.90 4.55
3.80 11.57
n = 10
16
14 ●
12
●
10
y ●
8 ●
●
●
6
●
● ●
●
4
0 1 2 3 4
2 Regression
Regression
Often interested in how Y depends on X. For example, we might want to use X to predict Y .
In such a setting, we will assume that the X values are known and fixed (henceforth, x instead of X), and look at
how Y varies given x.
Example: Y is a student’s final mark for Statistics, and x is their mark for the prerequisite subject Probability. Does
x help to predict Y ?
The regression of Y on x is the conditional mean, E(Y | x) = µ(x).
The regression can take any form. We consider simple linear regression, which has the form of a straight line:
E(Y | x) = α + βx and var(Y | x) = σ 2 .
Example: simple linear regression model
E(Y | x) = α + βx
var(Y | x) = σ 2
2
16
14 ●
12
●
10
y ●
8 ●
●
●
6
●
● ●
●
4
0 1 2 3 4
Terminology
• Y is called a response variable. Can also be called an outcome or target variable. Please do not call it the
‘dependent’ variable.
• x is called a predictor variable. Can also be called an explanatory variable. Please do not call it an ‘independent’
variable.
• µ(x) is called the (linear) predictor function or sometimes the regression curve or the model equation.
• The parameters in the predictor function are called regression coefficients.
Why ‘regression’ ?
It is strange terminology, but it has stuck.

Refers to the idea of ‘regression to the mean’: if a variable is extreme on its first measurement, it will tend to be
closer to the average on its second measurement, and vice versa.
First described by Sir Francis Galton when studying the inheritance of height between fathers and sons. In doing so,
he invented the technique of simple linear regression.
Linearity
A regression model is called linear if it is linear in the coefficients.

It doesn’t have to define a straight line!
Complex and non-linear functions of x are allowed, as long as the resulting predictor function is a linear combination
(i.e. an additive function) of them, with the coefficients ‘out the front’.
For example, the following are linear models:
µ(x) = α + βx + γx2
α β
µ(x) = + 2
x x
µ(x) = α sin x + β log x
3
The following are NOT linear models:
µ(x) = α sin(βx)
α
µ(x) =
1 + βx
µ(x) = αxβ
. . . but the last one can be re-expressed as a linear model on a log scale (by taking logs of both sides),
µ∗ (x) = α∗ + β log x
3 Simple linear regression

Estimation goals
Back to our simple linear regression model:

E(Y | x) = α + βx and var(Y | x) = σ 2 .
• We wish to estimate the slope (β), the intercept (α), the variance of the errors (σ 2 ), their standard errors and
construct confidence intervals for these quantities.
• Often want to use the fitted model to make predictions about future observations (i.e. predict Y for a new x).
• Note: the Yi are not iid. They are independent but have different means, since they depend on xi .
• We have not (yet) assumed any specific distribution for Y , only a conditional mean and variance.
Reparameterisation
Changing our model slightly. . .

Let α0 = α + β x̄, which gives:
E(Y | x) = α + βx
= α0 + β(x − x̄)
Now our model is in terms of α0 and β.

This will make calculations and proofs simpler.
3.1 Point estimation of the mean
Least squares estimation
Choose α0 and β to minimize the sum of squared deviations:

n
X 2
H(α0 , β) = (yi − α0 − β (xi − x̄))
i=1
Solve this by finding the partial derivatives and setting to zero:

n
∂H(α0 , β) X
0= =2 [yi − α0 − β(xi − x̄)](−1)
∂α0 i=1
n
∂H(α0 , β) X
0= =2 [yi − α0 − β(xi − x̄)](−(xi − x̄))
∂β i=1
These are called the normal equations.
4
Least squares estimators
Some algebra yields the least square estimators,

Pn
(xi − x̄)Yi
α̂0 = Ȳ , β̂ = Pi=1
n 2
.
i=1 (xi − x̄)
Another expression for β̂ is: Pn

(x − x̄)(Yi − Ȳ )
β̂ = Pn i
i=1
2
.
i=1 (xi − x̄)
These are equivalent, due to the following result:
X X
(xi − x̄)(Yi − Ȳ ) = (xi − x̄)Yi .
Can also then get an estimator for α:

α̂ = α̂0 − β̂ x̄
= Ȳ − β̂ x̄.
And also an estimator for the predictor function,

µ̂(x) = α̂ + β̂x
= α̂0 + β̂(x − x̄)
= Ȳ + β̂(x − x̄).
Ordinary least squares
This method is sometimes called ordinary least squares or OLS.

Other variants of least squares estimation exist, with different names. For example, ‘weighted least squares’.
Example: least squares estimates
For our data:

x̄ = 1.78
ȳ = 7.52 = α̂0
α̂ = 2.91
β̂ = 2.59
The fitted model equation is then:
µ̂(x) = 2.91 + 2.59x
> rbind(y, x)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
y 9.18 7.66 6.33 4.51 14.04 4.94 4.24 8.19 4.55 11.57
x 1.80 1.40 2.10 0.30 3.60 0.70 1.10 2.10 0.90 3.80
> model1 <- lm(y ~ x)

> model1
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
2.911 2.590
5
Properties of these estimators
What do we know about these estimators?

They are all linear combinations of the Yi ,
n
X 1
α̂0 = Yi
i=1
n
n
X xi − x̄
β̂ = Yi
i=1
K
Pn
where K = i=1 (xi − x̄)2 .
This allows us to easily calculate means and variances.
Means?
n n
1X 1X
E(α̂0 ) = E(Ȳ ) = E(Yi ) = [α0 + β(xi − x̄)] = α0
n i=1 n i=1
n n
X (xi − x̄) 1 X
E(β̂) = E(Yi ) = (xi − x̄)(α + (xi − x̄)β)
i=1
K K i=1
n
1 X K
= (xi − x̄)α + β = β
K i=1 K
This also implies, E(α̂) = α and E(µ̂(x)) = µ(x), and so we have that all of the estimators are unbiased.
Variances?
n
1 X σ2
var(α̂0 ) = var(Ȳ ) = 2
var(Yi ) =
n i=1 n
n
! n 2
X (xi − x̄) X xi − x̄
var(β̂) = var Yi = var(Yi )
i=1
K i=1
K
n
1 X 2
= (xi − x̄) var(Yi )
K2 i=1
1
= 2 Kσ 2
K
σ2
=
K
Similarly,
x̄2

1
var(α̂) = + σ2
n K
cov(α̂0 , β̂) = 0
!
2
1 (x − x̄)
var(µ̂(x)) = + σ2
n K
Can we get their standard errors?

We need an estimate of σ 2 .
3.2 Interlude: Analysis of variance
Analysis of variance: iid model
6
For Xi ∼ N(µ, σ 2 ) iid,
n
X n
X
2
(Xi − µ) = (Xi − X̄)2 + n(X̄ − µ)2
i=1 i=1
Analysis of variance: regression model
n
X
(Yi − α0 − β(xi − x̄))2
i=1
n
X
= (Yi − α̂0 − β̂(xi − x̄) + α̂0 + β̂(xi − x̄) − α0 − β(xi − x̄))2
i=1
n
X
= (Yi − α̂0 − β̂(xi − x̄) + (α̂0 − α0 ) + (β̂ − β)(xi − x̄))2
i=1
n
X
= (Yi − α̂0 − β̂(xi − x̄))2 + n(α̂0 − α0 )2 + K(β̂ − β)2
i=1
Note that the cross-terms disappear. Let’s see...
The cross-terms. . .
n
X
t1 = 2 (Yi − α̂0 − β̂(xi − x̄))(α̂0 − α0 )
i=1
Xn
t2 = 2 (Yi − α̂0 − β̂(xi − x̄))(β̂ − β)(xi − x̄)
i=1
Xn
t3 = 2 (xi − x̄)(β̂ − β)(α̂0 − α0 )
i=1
Pn Pn Pn
Since i=1 (xi − x̄) = 0 and i=1 (Yi − α̂0 ) = i=1 (Yi − Ȳ ) = 0, the first and third cross-terms are easily shown to
be zero.
For the second term,
n n
t2 X X
= (Yi − Ȳ )(xi − x̄) − β̂ (xi − x̄)2
2(β̂ − β) i=1 i=1
n
X
= (Yi − Ȳ )(xi − x̄) − β̂K
i=1
Xn n
X
= Yi (xi − x̄) − Yi (xi − x̄)
i=1 i=1
=0
Therefore, all the cross-terms are zero.
Back to the analysis of variance formula. . .
n
X
(Yi − α0 − β(xi − x̄))2
i=1
n
X
= (Yi − α̂0 − β̂(xi − x̄))2 + n(α̂0 − α0 )2 + K(β̂ − β)2
i=1
7
Taking expectations gives,
nσ 2 = E(D2 ) + σ 2 + σ 2
⇒ E(D2 ) = (n − 2)σ 2
where
n
X
D2 = (Yi − α̂0 − β̂(xi − x̄))2 .
i=1
3.3 Point estimation of the variance
Variance estimator
Based on these results, we have an unbiased estimator of the variance,

1
σ̂ 2 = D2 .
n−2
The inferred mean for each observation is called its fitted value, Ŷi = α̂0 + β̂(xi − x̄).
The deviation from each fitted value is called a residual, Ri = Yi − Ŷi .
Pn
The variance estimator is based on the sum of squared residuals, D2 = i=1 Ri2 .
Example: variance estimate
For our data:

d2 = 16.12
σ̂ 2 = 2.015
σ̂ = 1.42
3.4 Standard errors of the estimates
Standard errors
We can subsitute σ̂ 2 into the formulae for the standard deviation of the estimators in order to calculate standard
errors.
For example,
σ2
var(β̂) =
K
σ̂
⇒ se(β̂) = √
K
Example: standard errors
For our data:

σ̂
se(α̂0 ) = √ = 0.142
n
σ̂
se(β̂) = √ = 0.404
K
s s
2 2
1 (x − x̄) 1 (x − 1.78)
se(µ̂(x)) = σ̂ + = 1.42 × +
n K 10 12.34
8
3.5 Confidence intervals
Maximum likelihood estimation
Want to also construct confidence intervals. This requires further assumptions about the population distribution.
Let’s assume a normal distribution:
Yi ∼ N(α + βxi , σ 2 ).
Alternative notation (commonly used for regression/linear models):
Yi = α + βxi + i , where i ∼ N(0, σ 2 ).
Let’s maximise the likelihood. . .

Since the Yi ’s are independent, the likelihood is:
n
(yi − α − βxi )2

2
Y 1
L(α, β, σ ) = √ exp −
i=1 2πσ 2 2σ 2
n Pn 2

1 i=1 (yi − α0 − β(xi − x̄))
= √ exp −
2πσ 2 2σ 2
n
n 1 X
− ln L(α, β, σ 2 ) = ln(2πσ 2 ) + 2 (yi − α0 − β(xi − x̄))2
2 2σ i=1
n 1
= ln(2πσ 2 ) + 2 H(α0 , β)
2 2σ
The α0 and β that maximise the likelihood (minimise the log-likelihood) are the same as those that minimise the sum
of squares, H.
The OLS estimates are the same as the MLEs!
What about σ 2 ?
Differentiate by σ, set to zero, solve. . .
1 2
2
D
σ̂MLE =
n
This is biased. Prefer to use the previous, unbiased estimator,
1
σ̂ 2 = D2
n−2
Sampling distributions
The Y1 , · · · , Yn are independent normally distributed random variables.

Except for σ̂ 2 , our estimators are linear combinations of the Yi so will also have normal distributions, with mean and
variance as previously derived.
For example,
σ2

β̂ ∼ N β, .
K
Moreover, we know α̂0 and β̂ are independent, because they are normal rvs with zero covariance.
Using the analysis of variance decomposition (from earlier), we can show that,
(n − 2)σ̂ 2
∼ χ2n−2 .
σ2
Therefore, we can define pivots for the various mean parameters. For example,
β̂ − β
√ ∼ tn−2
σ̂/ K
9
and
µ̂(x) − µ(x)
q ∼ tn−2
2
σ̂ n1 + (x−x̄)
K
This allows us to construct confidence intervals.
Example: confidence itervals
For our data, a 95% CI for β is:

σ̂
β̂ ± c √ = 2.59 ± 2.31 × 0.404 = (1.66, 3.52)
K
where c is the 0.975 quantile of tn−2 .
A 95% CI for µ(3) is:
µ̂(3) ± c × se(µ̂(3)) = 10.68 ± 2.31 × 0.667 = (9.14, 12.22)
3.6 Prediction intervals
Deriving prediction intervals
Use the same trick as we used for the simple model,
Y ∗ ∼ N µ(x∗ ), σ 2

! !
2
∗ ∗ 1 (x∗ − x̄) 2
µ̂(x ) ∼ N µ(x ), + σ
n K
! !
2
∗ ∗ 1 (x∗ − x̄) 2
Y − µ̂(x ) ∼ N 0, 1 + + σ
n K
A 95% PI for Y ∗ is given by: r

∗ 1 (x∗ − x̄)2
µ̂(x ) ± c σ̂ 1+ +
n K
Example: prediction interval
A 95% PI for Y ∗ corresponding to x∗ = 3 is:

s
2
1 (3 − 1.78)
10.68 ± 2.31 × 1.42 × 1+ + = (7.06, 14.30)
10 12.34
Much wider than the corresponding CI, as we’ve seen previously.
3.7 R examples
> summary(model1)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.01970 -1.05963 0.02808 1.04774 1.80580
10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.9114 0.8479 3.434 0.008908 **
x 2.5897 0.4041 6.408 0.000207 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.419 on 8 degrees of freedom

Multiple R-squared: 0.8369,Adjusted R-squared: 0.8166
F-statistic: 41.06 on 1 and 8 DF, p-value: 0.0002074
> # Confidence intervals for mean parameters
> confint(model1)
2.5 % 97.5 %
(Intercept) 0.9560629 4.866703
x 1.6577220 3.521623
> # Data to use for prediction.

> data2 <- data.frame(x = 3)
> # Confidence interval for mu(3).

> predict(model1, newdata = data2, interval = "confidence")
fit lwr upr
1 10.6804 9.142823 12.21798
> # Prediction interval for y when x = 3.

> predict(model1, newdata = data2, interval = "prediction")
fit lwr upr
1 10.6804 7.064 14.2968
R example explained
• The lm (linear model) command fits the model.
• model1 is an object that contains all the results of the regression needed for later calculations.
• summary(model1) acts on model1 and summarizes the regression.
• predict can calculate CIs and PIs.
• R provides more detail than we need at the moment. Much of the output relates to hypothesis testing that we
will get to later.
Plot data and fitted model

> plot(x, y, col = "blue")
> abline(model1, col = "blue")
11
The command abline(model1) adds the fitted line to a plot.
14
●
12
●
10
●
y
●
8
●
6
●
● ●
●
4
0.5 1.0 1.5 2.0 2.5 3.0 3.5
Fitted values and CIs for their means

> predict(model1, interval = "confidence")
fit lwr upr
1 7.572793 6.537531 8.608056
2 6.536924 5.442924 7.630925
3 8.349695 7.272496 9.426895
4 3.688285 1.963799 5.412771
5 12.234204 10.247160 14.221248
6 4.724154 3.280382 6.167925
7 5.760023 4.546338 6.973707
8 8.349695 7.272496 9.426895
9 5.242088 3.921478 6.562699
10 12.752138 10.603796 14.900481
Confidence band for the mean

> data3 <- data.frame(x = seq(-1, 5, 0.05))
> y.conf <- predict(model1, data3, interval = "confidence")
> head(cbind(data3, y.conf))
x fit lwr upr
1 -1.00 0.3217104 -2.468232 3.111653
2 -0.95 0.4511941 -2.295531 3.197919
3 -0.90 0.5806777 -2.122943 3.284298
4 -0.85 0.7101613 -1.950472 3.370794
5 -0.80 0.8396449 -1.778124 3.457414
6 -0.75 0.9691286 -1.605906 3.544164
> matplot(data3$x, y.conf, type = "l", lty = c(1, 2, 2),

+ lwd = 2, xlab = "x", ylab = "y")
> points(x, y, col = "blue")
12
15
●
10
●
y
●
●
●
5
●
● ●●
0
−1 0 1 2 3 4 5
Prediction bands for new observations

> y.pred <- predict(model1, data3, interval = "prediction")
> head(cbind(data3, y.pred))
x fit lwr upr
1 -1.00 0.3217104 -3.979218 4.622639
2 -0.95 0.4511941 -3.821827 4.724215
3 -0.90 0.5806777 -3.664763 4.826119
4 -0.85 0.7101613 -3.508034 4.928357
5 -0.80 0.8396449 -3.351646 5.030936
6 -0.75 0.9691286 -3.195606 5.133863
> matplot(data3$x, y.pred, type = "l", lty = c(1, 3, 3),

+ lwd = 2, xlab = "x", ylab = "y")
20
15
●
10
●
●
y
●
●
5
●
● ●●
0
−1 0 1 2 3 4 5
13
Both bands plotted together
> matplot(data3$x, y.pred, type = "l", lty = c(1, 2, 2, 3, 3),
+ lwd = 2, xlab = "x", ylab = "y")
20
15
●
10
●
●
y
●
●
5
●
● ●●
0
−1 0 1 2 3 4 5
3.8 Model checking
Checking our assumptions
What modelling assumptions have we made?

• Linear model for the mean
• Equal variances for all observations (homoscedasticity)
• Normally distributed residuals
Ways to check these:
• Plot the data and fitted model together (done!)
• Plot residuals vs fitted values
• QQ plot of the residuals
In R, the last two of these are very easy to do:
> plot(model1, 1:2)
14
Residuals vs Fitted Normal Q−Q
2
5● 5●
1.5
●1
1●
●
1.0
Standardized residuals
1
● ●
●
0.5
Residuals
●
0
● ●
−0.5 0.0
●
●
−1
●
●
● ●
−2
●3
−1.5
●3
4 6 8 10 12 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
Fitted values Theoretical Quantiles
4 Further regression models

Multiple regression
• What if we have more than one predictor?
• Observe xi1 , . . . , xik as well as yi (for each i)
• Can fit a multiple regression model:
E(Y | x1 , . . . , xk ) = β0 + β1 x1 + β2 x2 + · · · + βk xk
• This is linear in the coefficients, so is still a linear model

• Fit by method of least squares by minimising:
n
X
H= (yi − β0 − β1 xi1 − β2 xi2 − · · · − βk xik )2
i=1
• Take partial derivatives, etc., and solve for β0 , . . . , βk .

• The subject Linear Statistical Models (MAST30025) looks into these types of models in much more detail.
Two-sample problem
• The two-sample problem can be expressed as a linear model!
• Sample Y1 , . . . , Yn ∼ N(µ1 , σ 2 ) and Yn+1 , . . . , Yn+m ∼ N(µ2 , σ 2 ).
• Define indicator variables (xi1 , xi2 ) where (xi1 , xi2 ) = (1, 0) for i = 1, . . . , n and (xi1 , xi2 ) = (0, 1) for i =
n + 1, . . . , n + m.
• Observed data: (yi , xi1 , xi2 )
• Then Y1 , . . . , Yn each have mean 1 × β1 + 0 × β2 = µ1 and Yn+1 , . . . , Yn+m each have mean 0 × β1 + 1 × β2 = µ2 .
• This is in the form a multiple regression model.
• The general linear model unifies many different types of models together into a common framework. The subject
MAST30025 covers this in more detail.
15
5 Correlation
5.1 Definitions
Correlation coefficient
(Revision) for two rvs X and Y , the correlation coefficient, or simply the correlation, is defined as:
cov(X, Y ) σXY
ρ = ρXY = √ =
var X var Y σ X σy
This is quantitative measure of the strength of relationship, or association, between X and Y .

x1y11 y12 y13 x2 y21 y22 y23 x3 y31 y32
#We will now consider inference on ρ, based on an iid sample of pairs (Xi , Yi ).
0.131 -0.283 1.625 0.464 -1.021 -0.090 0.268 0.660 -0.334 0.862
# 0.070 -0.524 -0.135 0.110 0.450 -0.973 0.085 0.588 -0.114 -0.263
# 1.031 0.863 0.734 -0.133 -0.278 1.756 0.582 -0.462 -0.195 -0.221
#Note: unlike in regression, X is now considered as a random variable.
-0.727 -1.439 -0.820 -0.893 0.908 -0.466 -0.457 -1.640 -0.306 -1.369
# 0.494 0.110 0.570 -0.090 -0.790 -0.562 1.680 2.464 2.094 2.971
# -0.210 -0.680 0.332 -1.466 0.362 0.428 -0.023 -0.271 0.016 -0.548
# -0.552 -0.219 -0.392 -0.752 1.383 0.702 -0.450 0.897 -0.055 0.048
# -0.551 0.374 -0.572 0.425 -0.919 1.026 -0.538 -1.404 0.336 -1.085
# 1.155 1.211 1.786 -0.249 -1.672 -0.177 -0.853 -0.890 -0.750 -1.234
# 0.331 1.611 -0.083 -0.875 1.345 -1.053 0.589 -0.453 -0.244 -0.111
# -1.272 -1.067 -0.356 -0.384 0.049 1.748 0.959 -0.976 0.096 -1.152
# 0.843 1.621 1.383 -0.552 -0.326 -0.196 0.936 -0.474 0.023 -0.775
# 0.073 0.502 1.026 1.030 1.251 -0.648 -0.754 0.367 -0.553 -0.607
# -0.559 -0.783 0.229 1.486 0.051 0.645 -0.017 -0.304 -0.566 -0.009
# -0.741 -0.928 -0.439 0.406 0.926 0.270 0.350 -0.093 1.963 -0.345
# -0.294 -0.826 0.456 -0.558 -0.350 -0.407 -1.077 -2.343 -0.602 -1.067
# -0.423 0.143 0.885 -0.092 0.002 1.010 -0.489 1.418 0.327 -0.068
# 1.867 0.402 -0.485 1.825 0.369 1.107 2.781 0.515 -0.444 -0.070
# ρ = −1
-0.755 -0.855 -0.010 -0.870ρ = −0.75
0.736 -0.336 -0.513 ρ = −0.5
-0.468 -2.264 0.735
# -0.514 -0.675 -1.025 0.340 -1.060 0.373 -0.911 -0.344 -0.900 -0.241
# -0.438 -0.955 0.641 -0.508 -0.183 -0.162 0.092 0.418 0.281 0.484
# -0.035 -1.119 -0.470 2.078 -0.697 -1.267 1.944 -0.101 -0.433 -0.346
# -0.049 0.816 1.554 0.346 1.116 -0.811 -0.696 -1.610 -0.122 -1.740
# -1.046 -1.795 -2.115 1.586 -0.024 1.159 0.040 0.217 -1.036 -1.047
# -0.333 -0.418 -1.404 0.937 1.105 0.235 -0.907 0.893 1.062 1.433
# -1.601 -0.975 -1.514 -2.082 0.208 -1.310 0.590 -0.856 -0.749 -0.648
# 0.450 -0.784 0.515 0.872 -0.530 1.925 -0.892 1.669 1.965 0.711
# -0.645 1.047 -0.024 -1.685 0.595 1.627 0.157 1.194 1.112 0.624
# -0.860 -0.177 0.005 1.123 -1.128 -1.192 -2.005 -1.271 0.115 -0.915
# 1.728 1.713 1.414 1.462 -0.738 0.605 0.189 -0.164 -1.743 0.104
# -1.670 -1.338 0.331 -0.411 -1.100 -0.611 0.399 1.034 -0.853 -0.421
# -0.120 0.107 -0.404 -0.155 0.528 0.025 0.057 -0.356 0.368 -0.233
#
ρ = −0.25
0.983 0.352 2.297 -2.073
ρ=0 -0.866 -0.792 -0.741
ρ = +0.25
-1.689 -0.315 -1.093
# 2.158 2.263 0.590 1.132 -0.875 1.151 0.447 1.018 0.373 0.383
# -0.507 -0.727 -0.613 0.969 -0.467 0.228 -1.482 0.719 0.001 1.561
# 0.190 1.098 0.299 0.757 0.657 1.592 0.358 -1.526 -1.418 -1.572
# -0.344 -1.157 1.171 -1.960 -0.046 -0.899 -1.351 0.151 0.569 -0.592
# -2.017 -0.921 -0.444 -0.583 1.379 0.044 -0.653 -0.595 -0.850 -1.159
# -0.856 0.193 0.612 0.314 0.313 0.627 0.685 -0.352 1.032 -0.463
# 0.873 0.905 0.006 -0.685 0.244 1.403 -1.655 2.019 0.245 1.389
# 0.491 0.166 -0.648 0.004 -1.449 -0.037 1.832 1.304 0.962 2.560
# -0.457 -0.372 1.150 -0.763 -0.084 -0.798 -0.448 0.844 0.173 -0.614
# 0.839 0.004 0.906 -0.351 -1.943 1.333 -0.932 0.145 -1.881 -0.306
# -0.505 -0.700 -0.190 0.643 -0.186 -1.353 1.599 1.288 1.711 1.161
# -0.153 0.177 0.155 0.892 -2.275 0.499 -2.265 -0.485 -1.549 -1.087
# -0.002
ρ = +0.5 -0.397 2.069 -1.923 0.316
ρ = +0.75 0.521 -0.300 -0.374
ρ = +1 0.670 0.146
# -0.716 -0.462 0.805 -0.439 0.060 0.295 0.478 -0.383 -0.187 0.531
# -0.458 0.518 -0.916 -0.631 -1.333 1.157 -0.831 0.544 0.434 0.819
# -1.741 -1.015 -0.793 -0.004 -0.670 -1.073 -0.512 -0.263 -1.026 -0.073
# -0.380 0.523 -0.008 -0.403 -0.303 0.025 0.051 0.789 1.315 0.300
# 0.570 0.062 -1.042 -1.219 1.954 -0.629 -0.285 1.088 -1.015 1.360
5.2 Point estimation
Sample covariance
To estimate cov(X, Y ) we use the sample covariance:

n n
!
1 X 1 X
SXY = (Xi − X̄)(Yi − Ȳ ) = Xi Yi − nX̄ Ȳ
n − 1 i=1 n−1 i=1
You can check that this is unbiased, E(SXY ) = σXY = cov(X, Y ).
Sample correlation coefficient
To estimate ρ we use the sample correlation coefficient (also known as Pearson’s correlation coefficient):
Pn
SXY (Xi − X̄)(Yi − Ȳ )
R = RXY = = qP i=1
SX SY n 2
Pn 2
i=1 (Xi − X̄) i=1 (Yi − Ȳ )
You can check that |R| 6 1, just like |ρ| 6 1.
16
This gives a point estimate of ρ.
For further results, we make some more assumptions. . .
5.3 Relationship to regression
Bivariate normal
Assume X and Y have correlation ρ and follow a bivariate normal distribution,

2
X µX σX ρσX σY
∼ N2 ,
Y µY ρσX σY σY2
In this case, the regressions are linear,

ρσX
E(X | y) = µX + (y − µY ) = α0 + β 0 y
σY
ρσY
E(Y | x) = µY + (x − µX ) = α + βx
σX
Note: β 0 6= 1/β
E(X | y)
1
E(Y | x)
y 0
−1
−2
−3
−3 −2 −1 0 1 2 3
Variance explained
An alternative analysis of variance decomposition:

X X X
(Yi − Ȳ )2 = (Yi − α̂ − β̂xi )2 + β̂ 2 (Xi − X̂)2
X X
= (1 − R2 ) (Yi − Ȳ )2 + R2 (Yi − Ȳ )2
This implies that R2 is the proportion of the variation in Y ‘explained’ by x.

In this usage, R2 is called the coefficient of determination.
Remarks
• For simple linear regression, the coefficient of deterimination is the same as the square of the sample correlation,
with both being denoted by R2 .
17
• Also, the proportion of Y explained by x is the same as the proportion of X explained by y. Both are equal to
R2 , which is a symmetric expression of both X and Y .
• For more complex models, the coefficient of determination is more complicated: it needs to be calculated using
all predictor variables together.
5.4 Confidence interval
Approximate sampling distribution
Define:
1 1+r
g(r) = ln
2 1−r
This function has a standard name, g(r) = artanh(r), and so does it’s inverse, g −1 (r) = tanh(r). The function g(r) is
also known as the Fisher transformation.
The following is a widely used approximation:

1
g(R) ≈ N g(ρ),
n−3
We can use this to construct approximate confidence intervals.
Example: correlation
For our data:
r = 0.91
r2 = 0.84
An approximate 95% CI for g(ρ) is:

c
g(r) ± √ = 1.56 ± 1.96 × 0.378 = (0.819, 2.30)
n−3
where c = Φ−1 (1 − α/2). Transforming this to an approximate 95% CI for ρ:
(tanh (0.819) , tanh (2.30)) = (0.67, 0.98)
5.5 R example
> cor(x, y)
[1] 0.9148421
> cor(x, y)^2

[1] 0.836936
> cor.test(x, y)
Pearson’s product-moment correlation
data: x and y
t = 6.4078, df = 8, p-value = 0.0002074
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6726924 0.9799873
sample estimates:
cor
0.9148421
18
> summary(model1)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.01970 -1.05963 0.02808 1.04774 1.80580
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.9114 0.8479 3.434 0.008908 **
x 2.5897 0.4041 6.408 0.000207 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.419 on 8 degrees of freedom

Multiple R-squared: 0.8369,Adjusted R-squared: 0.8166
F-statistic: 41.06 on 1 and 8 DF, p-value: 0.0002074
19

Module05 Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module05 Notes

Uploaded by

Copyright:

Available Formats

Regression

3 Simple linear regression 4

4 Further regression models 15

Aims of this module

We have studied how to do estimation for some simple scenarios:

E(Y | x) = α + βx and var(Y | x) = σ 2 .

Example: simple linear regression model

It is strange terminology, but it has stuck.

A regression model is called linear if it is linear in the coefficients.

3 Simple linear regression

Back to our simple linear regression model:

Changing our model slightly. . .

Now our model is in terms of α0 and β.

3.1 Point estimation of the mean

Least squares estimation

Choose α0 and β to minimize the sum of squared deviations:

Solve this by finding the partial derivatives and setting to zero:

These are called the normal equations.

Some algebra yields the least square estimators,

Another expression for β̂ is: Pn

Can also then get an estimator for α:

And also an estimator for the predictor function,

Ordinary least squares

This method is sometimes called ordinary least squares or OLS.

Example: least squares estimates

For our data:

> model1 <- lm(y ~ x)

What do we know about these estimators?

Can we get their standard errors?

3.2 Interlude: Analysis of variance

Analysis of variance: iid model

Analysis of variance: regression model

Note that the cross-terms disappear. Let’s see...

Back to the analysis of variance formula. . .

3.3 Point estimation of the variance

Based on these results, we have an unbiased estimator of the variance,

Example: variance estimate

For our data:

3.4 Standard errors of the estimates

Example: standard errors

For our data:

Maximum likelihood estimation

Yi = α + βxi + i , where i ∼ N(0, σ 2 ).

Let’s maximise the likelihood. . .

The Y1 , · · · , Yn are independent normally distributed random variables.

This allows us to construct confidence intervals.

Example: confidence itervals

For our data, a 95% CI for β is:

3.6 Prediction intervals

Deriving prediction intervals

Use the same trick as we used for the simple model,

A 95% PI for Y ∗ is given by: r

Example: prediction interval

A 95% PI for Y ∗ corresponding to x∗ = 3 is:

Much wider than the corresponding CI, as we’ve seen previously.

Residual standard error: 1.419 on 8 degrees of freedom

> # Data to use for prediction.

> # Confidence interval for mu(3).

> # Prediction interval for y when x = 3.

Plot data and fitted model

0.5 1.0 1.5 2.0 2.5 3.0 3.5

Fitted values and CIs for their means

Confidence band for the mean

> matplot(data3$x, y.conf, type = "l", lty = c(1, 2, 2),

Yi = α + βxi + i , where i ∼ N(0, σ 2 ).