Linear Regression

What is Modeling?
3
?
m l m
21
m̂  a  bl
1.0 15.0 20
1.5 17.0 19
18
a? b?
2.0 18.0 17 a=-4
2.5 19.5 16
15
b = + 0.33
3.0 21.0 1 2 3 l
-4 0.33 20.7
mˆ  a  b  l
l = 20.7
mˆ  2.9
Simple Linear Regression
• The equation that describes how y is related to x and

an error term is called the regression model.
• The simple linear regression model is: y = β0 + β1x +ε
where:
β0 and β1 are called parameters of the model, ε is a
random variable called the error term.
• The simple linear regression equation is: E(y) = β0 + β1x

• Graph of the regression equation is a straight line.
• β0 is the y intercept of the regression line.
• Β1 is the slope of the regression line.
• E(y) is the expected value of y for a given x value.
E(y)
Positive Linear Relationship
Slope β1 is positive
Intercept
β0
x
Negative Linear Relationship

E(y)
Intercept
β0
Slope β1 is Negative
x
E(y)
No Relationship
Intercept Regression line

β0 Slope β1 is Zero
x
 The estimated simple linear regression equation

yˆ  b0  b1 x
 The graph is called the estimated regression line.
 b0 is the y intercept of the line
 b1 is the slope of the line.
 ŷ is the estimated value of y for a given x value
Regression Model Sample Data:

y =β0 + β1x +ε x y
Regression Equation x1 y1
E(y) = β0 + β1x . .
Unknown Parameters . .
β0, β1 xn yn
Estimated
Regression Equation
b0 and b1
provide estimates of yˆ  b0  b1 x
β0 and β1
Sample Statistics
b0, b1
Least Squares Criterion

min  ( yi  yˆi )2
where:
yi = observed value of the dependent variable for the ith
observation
ŷi = estimated value of the dependent variable for the ith
observation
Slope for the Estimated Regression Equation
b 
 ( x  x )( y  y )
i i
 (x  x)
1 2
i
y - Intercept for the Estimated Regression Equation

b0  y  b1 x
where:
xi = value of independent variable for ith observation
yi = value of dependent variable for ith observation
_
x = mean value for independent variable
Example: Reed Auto Sales

Reed Auto periodically has a Number of Number of
special week-long sale. As part of TV Ads Cars Sold
the advertising campaign Reed 1 14
runs one or more television 3 24
commercials during the weekend 2 18
preceding the sale. Data from a 1 17
sample of 5 previous sales are: 3 27
Slope for the Estimated Regression Equation
b 
 ( x  x )( y  y ) 20
i i
 5
 (x  x)
1 2
i 4
y-Intercept for the Estimated Regression Equation

b0  y  b1 x  20  5(2) 10
Estimated Regression Equation

yˆ 10  5x
30
25
20
Cars Sold
y = 5x + 10
15
10
5
0
0 1 2 3 4
TV Ads
Relationship Among SST, SSR, SSE

SST = SSR + SSE
 i
( y  y ) 2
  i
( ˆ
y  y ) 2
  i i
( y  ˆ
y ) 2
where:
SST = Total Sum of Squares
SSR = Sum of Squares due to Regression
SSE = Sum of Squares due to Error
The coefficient of determination is: r2 = SSR/SST

where:
SSR = sum of squares due to regression
SST = total sum of squares
r2 = SSR/SST = 100/114 = 0.8772
The regression relationship is very strong; 88% of the

variability in the number of cars sold can be explained
by the linear relationship between the number of TV
ads and the number of cars sold.
rxy  sign of b1 coefficient of determinat ion

rxy  sign of b1 r 2
Where b1 is the the slope of the estimated regression

equation yˆ  b0  b1 x
The sign of b1 in the equation yˆ 10  5x is “+”.

rxy   0.8772
Hence, rxy = +0.9366

1. The error ε is a random variable with mean of zero.

2. The variance of ε, denoted by σ2, is the same for all
values of the independent variable.
3. The values of ε are independent
4. The error ε is a normally distributed random variable.
 To test for a significant regression relationship, we must

conduct a hypothesis test to determine whether the
value of b1 is zero.
 Two tests namely, t-test and F-test are commonly used.
 Both the t test and F test require an estimate of σ2, the
variance of ε in the regression model.
An Estimate of σ
The mean square error (MSE) provides the estimate
of σ2, and the notation s2 is also used.
s2 = MSE = SSE/(n - 2)
where: SSE   ( yi  yˆi )2   ( yi  b0  b1xi )2
An Estimate of σ
 To estimate σ we take the square root of s2.
 The resulting s is called the standard error of the estimate.
SSE
s  MSE 
n2
Hypotheses
H0: β1 = 0
H1: β1 ≠ 0
Test Statistic is
b1
t
sb1
Rejection Rule
Reject H0 if p-value < α
or t < -tα/2 or t > t α/2
where:
tα/2 is based on a t distribution
with n - 2 degrees of freedom
1. Determine the hypotheses.
H0: β1 = 0
H1: β1 ≠ 0
2. Specify the level of significance. α = 0.05

3. Select the test statistic.
b1
t
sb1
4. State the rejection rule.
Reject H0 if p-value < 0.05 or |t | > 3.182 (with 3
degrees of freedom)
5. Compute the value of the test statistic.

b1 5
t   4.63
sb1 1.08
6. Determine whether to reject H0.

t = 4.541 provides an area of 0.01 in the upper tail.
Hence, the p-value is less than 0.02.
(Also, t = 4.63 > 3.182). We can reject H0.
 We can use a 95% confidence interval for β1 to test
the hypotheses just used in the t test.
 H0 is rejected if the hypothesized value of β1 is not
included in the confidence interval for β1 .
 The form of a confidence interval for β1 is:

Margin of Error
b1  t / 2 sb1
Point Estimator
Where tα/2 is the t value providing an area of α/2 in the
upper tail of a t-distribution with n - 2 degrees of
freedom
 Rejection Rule
Reject H0 if 0 is not included in the confidence interval for β1.
 95% Confidence Interval for β1
b1  t / 2 sb1 = 5 +/- 3.182(1.08) = 5 +/- 3.44
or 1.56 to 8.44
 Conclusion
0 is not included in the confidence interval. Reject H0
Hypotheses
H0: β1 = 0
H1: β1 ≠ 0
Test Statistic
F = MSR/MSE
Rejection Rule
Reject H0 if p-value < α or F > Fα
where:
Fα is based on an F distribution with 1 degree of
freedom in the numerator and n-2 degrees of freedom
in the denominator
1. Determine the hypotheses.

H0: β1 = 0
H1: β1 ≠ 0
2. Specify the level of significance.
α = 0.05
3. Select the test statistic.
F = MSR/MSE
4. State the rejection rule.

Reject H0 if p-value < 0.05 or F > 10.13 (with 1 d.f. in
numerator and 3 d.f. in denominator)
5. Compute the value of the test statistic.

F = MSR/MSE = 100/4.667 = 21.43
6. Determine whether to reject H0.

F = 17.44 provides an area of 0.025 in the upper tail.
Thus, the p-value corresponding to F = 21.43 is less than
2(0.025) = 0.05. Hence, we reject H0.
The statistical evidence is sufficient to conclude that we
have a significant relationship between the number of TV
ads aired and the number of cars sold.
• If the assumptions about the error term ε appear
questionable, the hypothesis tests about the significance
of the regression relationship and the interval estimation
results may not be valid.
• The residuals provide the best information about ε .
• Residual for Observation i

yi  yˆ i
• Much of the residual analysis is based on an examination of
graphical plots.
If the assumption that the variance of ε is the same for all

values of x is valid, and the assumed regression model is
an adequate representation of the relationship between
the variables, then -
The residual plot should give an overall impression of a
horizontal band of points
y  yˆ Good Pattern
Residual
x
y  yˆ Non-constant Variance
Residual
x
y  yˆ Model Form Not Adequate

Residual
x
Residuals
Observation Predicted Cars Sold Residuals
1 15 -1
2 25 -1
3 20 -2
4 15 2
5 25 2
TV Ads Residual Plot

3
2
Residuals
1
0
-1
-2
-3
0 1 2 3 4
TV Ads
Multiple Regression…
The simple linear regression model was used to analyze how one
variable (the dependent variable y) is related to one other
variable (the independent variable x).
Multiple regression allows for any number of independent
variables.
We expect to develop models that fit the data better than would
a simple linear regression model.
Simple regression considers the relation
between a single explanatory variable and
response variable
Multiple regression simultaneously considers the influence of
multiple explanatory variables on a response variable Y
The intent is to look at the

independent effect of each
variable while “adjusting out”
the influence of potential
confounders
The Model…
We now assume we have k independent variables potentially
related to the one dependent variable. This relationship is
represented in this first order linear equation:
dependent independent variables
variable
error variable
coefficients
In the one variable, two dimensional case we drew a regression
line; here we imagine a response surface.
Regression Modeling
 A simple regression model

(one independent variable)
fits a regression line in 2-
dimensional space
 A multiple regression model

with two explanatory
variables fits a regression
plane in 3-dimensional
space
Required Conditions…
For these regression methods to be valid the following four

conditions for the error variable ( ) must be met:
• The probability distribution of the error variable ( ) is
normal.
• The mean of the error variable is 0.
• The standard deviation of is , which is a constant.
• The errors are independent.
Estimating the Coefficients…
The sample regression equation is expressed as:
We will check the following:

Assess the model…
How well it fits the data
Is it useful
Are any required conditions violated?
Employ the model…
Interpreting the coefficients
Predictions using the prediction equation
Estimating the expected value of the dependent variable

Linear Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression

Uploaded by

Copyright:

Available Formats

What is Modeling?

• The equation that describes how y is related to x and

• The simple linear regression equation is: E(y) = β0 + β1x

Negative Linear Relationship

Intercept Regression line

 The estimated simple linear regression equation

Regression Model Sample Data:

Least Squares Criterion

Slope for the Estimated Regression Equation

y - Intercept for the Estimated Regression Equation

Example: Reed Auto Sales

Slope for the Estimated Regression Equation

y-Intercept for the Estimated Regression Equation

Estimated Regression Equation

Relationship Among SST, SSR, SSE

The coefficient of determination is: r2 = SSR/SST

r2 = SSR/SST = 100/114 = 0.8772

The regression relationship is very strong; 88% of the

rxy  sign of b1 coefficient of determinat ion

Where b1 is the the slope of the estimated regression

The sign of b1 in the equation yˆ 10  5x is “+”.

Hence, rxy = +0.9366

1. The error ε is a random variable with mean of zero.

 To test for a significant regression relationship, we must

 The resulting s is called the standard error of the estimate.

Reject H0 if p-value < α

or t < -tα/2 or t > t α/2

2. Specify the level of significance. α = 0.05

5. Compute the value of the test statistic.

6. Determine whether to reject H0.

 The form of a confidence interval for β1 is:

1. Determine the hypotheses.

4. State the rejection rule.

5. Compute the value of the test statistic.

6. Determine whether to reject H0.

• Residual for Observation i

If the assumption that the variance of ε is the same for all

y  yˆ Model Form Not Adequate

TV Ads Residual Plot

The intent is to look at the

 A simple regression model

 A multiple regression model

For these regression methods to be valid the following four

We will check the following:

You might also like