Professional Documents
Culture Documents
m
17.1 Introduction
In this chapter we employ Regression Analysis
to examine the relationship among quantitative
variables.
The technique is used to predict the value of one
variable (the dependent variable - y)based on
the value of other variables (independent
variables x1, x2,«xk.)
17.2 The Model
The first order linear model
y ½ Ô ½1 x
½Ô and ½1 are unknown,
y = dependent variable y therefore, are estimated
from the data.
x = independent variable
½Ô = y-intercept
Rise
½1 = slope of the line ½ = Rise/Run
½ Run
= error variable x
Ô
17.3 Estimating the Coefficients
The estimates are determined by
± drawing a sample from the population of interest,
± calculating sample statistics.
± producing a straight line that cuts into the data.
y u The question is:
u Which straight line fits best?
u
u
u u u u u
u u u u u
u
x °
The best line is the one that minimizes
the sum of squared vertical differences
between the points and the line.
Sum of squared differences = (2 - 1)2 + (° - 2)2 +(1.5 - 3)2 + (3.2 - °)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (° - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
Let us compare two lines
° (2,°)
u The second line is horizontal
3 u (°,3.2)
2.5
2
(1,2) u The smaller the sum of
u (3,1.5)
1 squared differences
the better the fit of the
1 2 3 °
line to the data. ¦
To calculate the estimates of the coefficients The regression equation that estimates
that minimize the differences between the data the equation of the first order linear model
points and the line, use the formulas: is:
cov( X , Y)
b1
s 2x yÖ b Ô b1x
b Ô y r b1x
-
Example 17.1 Relationship between odometer
reading and a used car¶s selling price.
± A car dealer wants to find ar Od o met er Pric e
the relationship between 1 7 1
447 Ô 1
the odometer reading and 4 ÔÔ
the selling price of used cars. 4 Ô 7
17Ô 74
± A random sample of 1ÔÔ 4Ô1Ô
cars is selected, and the data . . .
. . .
recorded. . . .
± Find the regression line.
Independent variable x
Dependent variable y
Solution
± Solving by hand
To calculate bÔ and b1 we need to calculate several
statistics first; 2
(xi x)
x Ê 36 ,ÔÔ9 . °5 ; s 2x Ê Ê °3 ,528 ,688
n 1
(xi x )( y i y)
y Ê 5 ,°11 . °1; cov( , ) Ê Ê 1,356 ,256
n 1
where n = 1ÔÔ.
cov( , ) 1,356,256
b1 Ê Ê Ê .Ô312
s 2x °3,528,688
bÔ Ê y b1x Ê 5°11.°1 ( .Ô312)(36,ÔÔ9.°5) Ê 6,533
yÖ Ê b Ô b 1 x Ê 6 ,533 . Ô312 x D
± Using the computer (see file Xm17-Ô1.xls)
Tools > Data analysis > Regression > [Shade the y range and the x range] > K
6
?
6
6
°
6 °
!"#
yÖ 6 ,533 r . Ô312 x
$%
&# '
( ()(
#
(( (
*+
))) '(
#, ' ( '() ' [
6533
°
Ô o data °
yÖ Ê 6 ,533 r . Ô312 x
m
17.° Error Variable: Required Conditions
½Ô + ½1x1
x1 x2 x3
m
17.5 Assessing the Model
The least squares method will produce a
regression line whether or not there is a linear
relationship between x and y.
Consequently, it is important to assess how well
the linear model fits the data.
Several methods are used to assess the model:
± Testing and/or estimating the coefficients.
± Using descriptive measurements.
mÔ
Sum of squares for errors
± This is the sum of differences between the points
and the regression line.
± It can serve as a measure of how well the line fits the
n
data. SSE ( y i r yÖ i ) 2 .
i 1
cov( X , Y )
SSE (n r 1)s 2Y r
s 2x
Î
Î
Î
Î Î Î
Î
Î
Î Î
Î Î
b 1 r ½1 s
tÊ where s b1 Ê
s b1 (n r 1) s 2x
The standard error of b1.
m r
á á r á
± To understand the significance of this coefficient
note:
The error
m
Two data points (x1,y1) and (x2,y2) of a certain sample are shown.
y2
y1
x1 x2
Total variation in = Variation explained b the + Unexplained variation error
regression line
Ê
Variation in y = SSR + SSE
2
R 1r
SSE
( y i r y ) 2 r SSE
SSR
(y i r y) 2
(y r y)
i
2
(y i r y) 2
-
17.7 Using the Regression Equation
Before using the regression model, we need to
assess how well it fits the data.
If we are satisfied with how well the model fits
the data, we can use it to make predictions for y.
Illustration
± redict the selling price of a three-year-old Taurus
Taurus
with °Ô,ÔÔÔ miles on the odometer (Example 17.1).
yÖ 6533 r .Ô312 x 6533 r .Ô312( °Ô,ÔÔÔ) 5,285
rediction interval and confidence interval
± Two intervals can be used to discover how closely
the predicted value will match the true value of y.
rediction interval - for a particular value of y,
Confidence interval - for the expected value of y.
± The prediction interval ± The confidence interval
1 (x g r x)2 1 (x g r x)2
yÖ
t 2 s 1 yÖ
t 2 s
n (x i r x)2 n (x i r x)2
1 ( °Ô,ÔÔÔ r 36,ÔÔ9) 2
[ 6533 r .Ô312( °ÔÔÔÔ)]
1.98°(151.6) 1 Ê 5,285
3Ô3
1ÔÔ
°,3Ô9,3°Ô,16Ô [
± The car dealer wants to bid on a lot of 25Ô Ford
Tauruses, where each car has been driven for about
°Ô,ÔÔÔ miles.
± Solution
The dealer needs to estimate the mean price per car.
1 (x g r x)2
The confidence interval (95%) = yÖ t 2 s
n (x i r x)2
1 ( °Ô ,ÔÔÔ r 36 ,ÔÔ9 ) 2
[ 6533 r .Ô312 ( °ÔÔÔÔ )] 1.98° (151 .6) 5,285 35
1ÔÔ ° ,3Ô9 ,3°Ô ,16Ô
Ô
The effect of the given value of x on the interval
± As xg moves away from x the interval becomes
longer. That is, the shortest interval is found at x.
yÖ Ê b Ô b 1 x g 1 ( x r x ) 2
The yÖconfidence
t 2 s interval
g
when xg =nx
(x r x)2
i
yÖ( x g Ê x 1)
yÖ( x g Ê x r 1) The yÖconfidence 1 12
t 2 s interval
when xg = xn
1
( x i r x )2
Ô¦
± The hypotheses are:
HÔ: Ës = Ô
H1: Ës = Ô
± The test statistic is
cov(a, b)
rs Ê
s a sb
a and b are the ranks of the data.
± For a large sample (n > 3Ô) rs is approximately
normally distributed
z Ê rs n r 1
Ô-
Example 17.8
Ô
R i
i Si
± Th
b
b iv
i
z
h
ihi
bw
w
v
i
b
±
i
i
Scores range from Ô to 1ÔÔ Scores range from 1 to 5
k
± Th
h h
: ± The test statistic is rs,
HÔ: Ës = Ô and the re ection region
H1: Ës = Ô is |rs| > rcritical (taken from
the Spearman rank
correlation table). ÔD
Rpt tude erfor ane
p oee tet an a rat ng an
1 59 9 3 1 .5
2 °7 3 2 3.5 T e are ro en
3 58 8 ° 17 averag ng the
° 66 1° 3 1 .5 ran .
5 77 2 2 3.5
. . . . .
. . . . .
. . . . .
± Conclusion:
So v ng hand
Do notanre ect
eahthevar a hypothesis.
null e eparate .At 5% significance
level
Cathere
isteinsufficient
= 5.92; evidence to infer that the
=5.5 ; ov a, = 12.3°
two variable
Thu r =areovrelated
a,/[toone another.
] = .379.
The r t va ue for = .Ô5 and n = 2Ô is .°5Ô. Ô[
17.9 Regression Diagnostics - I
The three conditions required for the validity of
the regression analysis are:
± the error variable is normally distributed.
± the error variance is constant for all values of x.
± The errors are independent of each other.
How can we diagnose violations of these
conditions?
°
Residual Analysis
°m
! ! A artial list of For each residual we calculate
Standard residuals the standard deviation as follows:
ser i n esi s n r
esi s
1 2 sri s 1 r hi where
2 2 2 1 1
" #" 21 21 1 ( x i r x)2
hi
22 2" " 1 "1 " 12 n
( x r x)2
2 " 1 1 1 2
Standardized residual i =
Residual i / Standard deviation
We can also apply the Lilliefors test
or the i2 test of normality.
#
°
Heteroscedasticity
+
^y
++
Residual
+ +
+ + + ++
+
+ + + +
+ ++ + +
+ +
+ + + + ++ +
+ + + ^y +++
+ + + ++ +
+ + + + ++
+
+ +++++
+
The spread of the data points
does not change much.
°°
When the requirement of a constant variance is
not violated we have homoscedasticity.
++
^y ++ +
++ ++
Residual
+ +++
+ + +++ +
+ +++
+ + + +
+ ++ +
+ + +
+ ++
+ + + ^y + ++
+ + +
+ + +
+ + + ++ +
+ ++
+ ++
As far as the even spread, this is
a much better situation
°¦
onindependence of error variables
°-
atterns in the appearance of the residuals
over time indicates that autocorrelation exists.
Residual Residual
+
+ ++
+
+ + +
+ + +
Ô + Ô + +
+ Time Time
+ + + + + +
+ + + +
+ +
+
ote the runs of positive residuals, ote the oscillating behavior of the
replaced by runs of negative residuals residuals around zero.
°
utliers
± An outlier is an observation that is unusually small or
large.
± Several possibilities need to be investigated when an
outlier is observed:
There was an error in recording the value.
The point does not belong in the sample.
The observation is valid.
± Identify outliers from the scatter diagram.
± It is customary to suspect an observation is an outlier if
its |standard residual| > 2 °D
An outlier An influential observation
+++++++++++
+ +
+ « but, some outliers
+ +
+ +
may be very influential
+
+ + + +
+
+ +
+
°[
rocedure for regression diagnostics