Professional Documents
Culture Documents
y y
x x
y y
x x
Strong relationships Weak relationships
y y
x x
y y
x x
No relationship
x
Correlation Coefficient
The population correlation coefficient ()
measures the strength of the linear association
between the variables
cov( x, y )
rxy
sx s y
1
cov( x, y ) ( xi x )( yi y )
n
1 1
sx
n
( xi x ) 2
s y
n
( y i y ) 2
Features of correlation coefficient
Unit free
Range between -1.00 and 1.00
-1<0 implies that as X (), Y ( )
0< 1 implies that as X (), Y ()
The closer to -1.00, the stronger the negative
linear relationship
The closer to 1.00, the stronger the positive linear
relationship
The closer to 0.00, the weaker the linear
relationship
=0 implies that X and Y are not linearly
associated
Significance Test for Correlation
Hypotheses
H0: = 0 (no linear correlation)
H1: 0 (linear correlation)
Significance test for Correlation
Test statistic:
r n2
tobs ~ t n2 , under H 0
1 r 2
Critical Region:
{tobs t ;n 2 }
{tobs t ;n 2 }
{ tobs t / 2;n 2 }
What is Regression
Regression is a tool for finding existence of an
association relationship between a dependent
variable (Y) and one or more independent
variables (1 , 2 , , ) in a study.
The relationship can be linear or non-linear.
Mathematical vs Statistical
Relationship
Mathematical Relationship is exact
y 0 1 x
Statistical Relationship is not exact
y 0 1x
Nomenclature in Regression
A dependent variable (response variable)
measures an outcome of a study (also called
outcome variable).
An independent variable (explanatory
variable) explains changes in a response
variable.
Regression often set values of explanatory
variable to see how it affects response
variable (predict response variable)
Population Linear Regression
y 0 1x
Variable
y y 0 1x
Observed Value
of y for xi
i Slope = 1
Predicted Value Random Error for
of y for xi
this x value
Intercept = 0
xi x
Estimated Regression Model
The sample regression line provides an estimate of the
population regression line
Independent
y b 0 b1 x variable
Estimation of parameters
Least square method of estimation
Confidence interval
Prediction interval
p-value
Interpretation of the Slope and the Intercept
t Test
s 2 = MSE = SSE/(n - 2)
where:
( SS ) 2
SSE = (yi - yi )2 SS y xy
SS x
= SS y b SS
1 XY
Testing for slope parameter
Hypotheses
H 0 : 1 10
H1 : 1 10
b1 10 s
tobs where sb1
sb1 i
(
i
x x ) 2
Testing for intercept parameter
Hypotheses
H 0 : 0 00
H1 : 0 00
b0 00 1 x2
where sb 0 s
tobs
sb 0 n ( xi x ) 2
Testing for Significance: t Test
Critical Region
where:
t is based on a t distribution
with n - 2 degrees of freedom
Testing for Significance: Example
b1
3. Select the test statistic. t
sb1
0.048687879
= = 24.56369137
0.001982108
s p s 1
n ( xi x ) 2
where:
confidence coefficient is 1 - and
t/2 is based on a t distribution
with n - 2 degrees of freedom
Assessing Model Accuracy
2
Residual Standard Error (interpretation?)
F Statistic
Coefficient of Determination
Relationship Among SST, SSR, SSE
SST = SSR + SSE
i
( y y ) 2
i
(
y y ) 2
i i
( y
y ) 2
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Goodness of fit of regression
Coefficient of Determination
It can be noted that a fitted model can be said to be good
when residuals are small. Since SSR is based on residuals, so
a measure of quality of fitted model can be based on SSR.
R2 is a measure of relative fit based on a comparison of SSR
and SST
R2 = r2 = SSR/SST
where:
SSR = sum of squares due to regression
SST = total sum of squares
a value of closer to 1 indicates the better
fit and value of closer to zero indicates
the poor fit.
Coefficient of Determination (example)
()
= =
2