You are on page 1of 183

*BASIC ECONOMETRICS

*THE NATURE OF LINEAR REGRESSION



Hypothesis testing , and
Estimation



2
INTRODUCTION
What is Econometrics?
Econometrics consists of the application of
mathematical statistics to economic data to lend
empirical support to the models constructed by
mathematical economics and to obtain numerical
results.
Econometrics may be defined as the quantitative
analysis of actual economic phenomena based on
the concurrent development of theory and
observation, related by appropriate methods of
inference.
3
WHAT IS ECONOMETRICS?
Statistics
Economics
Econometrics
Mathematics
4
PURPOSE OF ECONOMETRICS
Structural Analysis
Policy Evaluation
Economic Prediction
Empirical Analysis
5
METHODOLOGY OF ECONOMETRICS
1. Statement of theory or hypothesis.
2. Specification of the mathematical model of the theory.
3. Specification of the statistical, or econometric model.
4. Obtaining the data.
5. Estimation of the parameters of the econometric model.
6. Hypothesis testing.
7. Forecasting or prediction.
8. Using the model for control or policy purposes.
6
EXAMPLEKYNESIAN THEORY OF
CONSUMPTION
1. Statement of theory or hypothesis.
Keynes stated: The fundamental psychological law is
that men/women are disposed, as a rule and on
average, to increase their consumption as their
income increases, but not as much as the increase
in their income.
In short, Keynes postulated that the marginal
propensity to consume (MPC), the rate of change of
consumption for a unit change in income, is greater
than zero but less than 1
7
2.SPECIFICATION OF THE MATHEMATICAL
MODEL OF THE THEORY
A mathematical economist might suggest the
following form of the Keynesian consumption
function:

1 0
1 1 0
< < + = | | | X Y
Consumption
expenditure
Income
8
3. SPECIFICATION OF THE STATISTICAL,
OR ECONOMETRIC MODEL.
To allow for the inexact relationships between
economic variables, the econometrician would modify
the deterministic consumption function as follows:






This is called an econometric model.






u X Y + + =
1 0
| |
U, known as disturbance, or error term

9
4. OBTAINING THE DATA.
year Y X
1982 3081.5 4620.3
1983 3240.6 4803.7
1984 3407.6 5140.1
1985 3566.5 5323.5
1986 3708.7 5487.7
1987 3822.3 5649.5
1988 3972.7 5865.2
1989 4064.6 6062
1990 4132.2 6136.3
1991 4105.8 6079.4
1992 4219.8 6244.4
1993 4343.6 6389.6
1994 4486 6610.7
1995 4595.3 6742.1
1996 4714.1 6928.4
Sourse: Data on Y (Personal Consumption Expenditure) and X (Gross
Domestic Product),1982-1996) all in 1992 billions of dollars
10
5. ESTIMATION OF THE PARAMETERS OF
THE ECONOMETRIC MODEL.
reg y x

Source | SS df MS Number of obs = 15
-------------+------------------------------ F( 1, 13) = 8144.59
Model | 3351406.23 1 3351406.23 Prob > F = 0.0000
Residual | 5349.35306 13 411.488697 R-squared = 0.9984
-------------+------------------------------ Adj R-squared = 0.9983
Total | 3356755.58 14 239768.256 Root MSE = 20.285

------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .706408 .0078275 90.25 0.000 .6894978 .7233182
_cons | -184.0779 46.26183 -3.98 0.002 -284.0205 -84.13525
------------------------------------------------------------------------------


11
6. HYPOTHESIS TESTING.
Such confirmation or refutation of
econometric theories on the basis of
sample evidence is based on a branch of
statistical theory know as statistical
inference (hypothesis testing)
As noted earlier, Keynes expected the
MPC to be positive but less than 1. In
our example we found it is about 0.70.
Then, is 0.70 statistically less than 1?
If it is, it may support keyness theory.

12
7.FORECASTING OR PREDICTION.
To illustrate, suppose we want to predict the mean
consumption expenditure for 1997. The GDP value
for 1997 was 7269.8 billion dollars. Putting this
value on the right-hand of the model, we obtain
4951.3 billion dollars.
But the actual value of the consumption expenditure
reported in 1997 was 4913.5 billion dollars. The
estimated model thus overpredicted.
The forecast error is about 37.82 billion dollars.
13
TYPES OF DATA SETS
Assume that we have collected data on
two variables X and Y. Let

(x
1
, y
1
) (x
2
, y
2
) (x
3
, y
3
) (x
n
, y
n
)

denote the pairs of measurements on the
on two variables X and Y for n cases in a
sample (or population)
THE STATISTICAL MODEL
Each y
i
is assumed to be randomly
generated from a normal distribution with
mean
i
= o + |x
i
and
standard deviation o.
(o, | and o are unknown)

y
i

o + |x
i


o

x
i


Y = o + |X
slope = |
o
THE DATA
THE LINEAR REGRESSION MODEL
The data falls roughly about a straight line.
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
Y = o + |X
unseen
THE LEAST SQUARES LINE
Fitting the best straight line
to linear data
Let
Y = a + b X
denote an arbitrary equation of a straight line.
a and b are known values.
This equation can be used to predict for each
value of X, the value of Y.
For example, if X = x
i
(as for the i
th
case) then
the predicted value of Y is:




i i
bx a y + =

The residual

can be computed for each case in the sample,

The residual sum of squares (RSS) is


a measure of the goodness of fit of the line
Y = a + bX to the data

( )
i i i i i
bx a y y y r + = =

, ,

2 2 2 1 1 1 n n n
y y r y y r y y r = = =
( ) ( ) ( )

= = =
+ = = =
n
i
i i
n
i
i i
n
i
i
bx a y y y r RSS
1
2
1
2
1
2

The optimal choice of a and b will result


in the residual sum of squares


attaining a minimum.
If this is the case than the line:
Y = a + bX
is called the Least Squares Line

( ) ( ) ( )

= = =
+ = = =
n
i
i i
n
i
i i
n
i
i
bx a y y y r RSS
1
2
1
2
1
2

The equation for the least squares line


Let




( )

=
=
n
i
i xx
x x S
1
2
( )

=
=
n
i
i yy
y y S
1
2
( )( )

=
=
n
i
i i xy
y y x x S
1





LINEAR REGRESSION
Hypothesis testing and Estimation
THE LEAST SQUARES LINE
Fitting the best straight line
to linear data




( )
n
x
x x x S
n
i
i
n
i
i
n
i
i xx
2
1
1
2
1
2
|
.
|

\
|
= =


=
= =
n
y x
y x
n
i
i
n
i
i
n
i
i i
|
.
|

\
|
|
.
|

\
|
=

= =
=
1 1
1
( )
n
y
y y y S
n
i
i
n
i
i
n
i
i yy
2
1
1
2
1
2
|
.
|

\
|
= =


=
= =
( )( )

=
=
n
i
i i xy
y y x x S
1
Computing Formulae:
Then the slope of the least squares line
can be shown to be:
( )( )
( )

=
=


= =
n
i
i
n
i
i i
xx
xy
x x
y y x x
S
S
b
1
2
1
and the intercept of the least squares line
can be shown to be:
x
S
S
y x b y a
xx
xy
= =
The residual sum of Squares
( ) ( )
( )
2
2
1 1

n n
i i i i
i i
RSS y y y a bx
= =
= = +

2
xy
yy
xx
S
S
S
(

=
Computing
formula
Estimating o, the standard deviation in the
regression model :
( ) ( ) ( )
2 2

1
2
1
2

+
=

=

= =
n
bx a y
n
y y
s
n
i
i i
n
i
i i
| |
(
(

=
xx
xy
yy
S
S
S
n
2
2
1
This estimate of o is said to be based on n 2
degrees of freedom
Computing
formula
SAMPLING DISTRIBUTIONS OF THE
ESTIMATORS
The sampling distribution slope of the
least squares line :
( )( )
( )

=
=


= =
n
i
i
n
i
i i
xx
xy
x x
y y x x
S
S
b
1
2
1
It can be shown that b has a normal
distribution with mean and standard deviation
( )

=

= = =
n
i
i
xx
b b
x x
S
1
2
and
o o
o |
Thus

has a standard normal distribution, and

b
b
xx
b b
z
S
|
o
o

= =

b
b
xx
b b
t
s
s
S
|
= =
has a t distribution with df = n - 2
(1 o)100% Confidence Limits for slope |
:
t
o/2
critical value for the t-distribution with n 2
degrees of freedom
xx
S
s
t

2 / o
|
Testing the slope

The test statistic is:
0 0 0
: vs :
A
H H | | | | = =
0

xx
b
t
s
S
|
=
- has a t distribution with df = n 2 if H
0
is true.
The Critical Region


Reject
0 0 0
: vs :
A
H H | | | | = =
0
/ 2 / 2
if or
xx
b
t t t t
s
S
o o
|
= < >
df = n 2
This is a two tailed tests. One tailed tests are
also possible
The sampling distribution intercept of the
least squares line :
It can be shown that a has a normal
distribution with mean and standard deviation
( )

=

+ = =
n
i
i
a a
x x
x
n
1
2
2
1
and o o o
x
S
S
y x b y a
xx
xy
= =
Thus
has a standard normal distribution and
( )
2
2
1

1
a
a
n
i
i
a a
z
x
n
x x
o
o
o
=

= =
+

( )
2
2
1

1
a
a
n
i
i
a a
t
s
x
s
n
x x
o
=

= =
+

has a t distribution with df = n - 2


(1 o)100% Confidence Limits for intercept
o :

t
o/2
critical value for the t-distribution with n 2
degrees of freedom

1

2
2 /
xx
S
x
n
s t +
o
o
Testing the intercept

The test statistic is:
0 0 0
: vs :
A
H H o o o o = =
- has a t distribution with df = n 2 if H
0
is true.
( )
0
2
2
1

1
n
i
i
a
t
x
s
n
x x
o
=

=
+

The Critical Region



Reject
0 0 0
: vs :
A
H H o o o o = =
0
/ 2 / 2
if or
a
a
t t t t
s
o o
o
= < >
df = n 2
EXAMPLE
THE FOLLOWING DATA SHOWED THE PER CAPITA CONSUMPTION OF
CIGARETTES PER MONTH (X) IN VARIOUS COUNTRIES IN 1930, AND THE
DEATH RATES FROM LUNG CANCER FOR MEN IN 1950.

TABLE : PER CAPITA CONSUMPTION OF CIGARETTES PER MONTH (X
I
) IN N
= 11 COUNTRIES IN 1930, AND THE DEATH RATES, Y
I
(PER 100,000),
FROM LUNG CANCER FOR MEN IN 1950.

COUNTRY (I) X
I
Y
I

AUSTRALIA 48 18
CANADA 50 15
DENMARK 38 17
FINLAND 110 35
GREAT BRITAIN 110 46
HOLLAND 49 24
ICELAND 23 6
NORWAY 25 9
SWEDEN 30 11
SWITZERLAND 51 25
USA 130 20

Australia
Canada
Denmark
Finland
Great Britain
Holland
Iceland
Norway
Sweden
Switzerland
USA
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140
d
e
a
t
h

r
a
t
e
s

f
r
o
m

l
u
n
g

c
a
n
c
e
r

(
1
9
5
0
)

Per capita consumption of cigarettes




404 , 54
1
2
=

=
n
i
i
x
914 , 16
1
=

=
n
i
i i
y x
018 , 6
1
2
=

=
n
i
i
y
Fitting the Least Squares Line





664
1
=

=
n
i
i
x
226
1
=

=
n
i
i
y




( )
55 . 14322
11
664
54404
2
= =
xx
S
( )
73 . 1374
11
226
6018
2
= =
yy
S
( )( )
82 . 3271
11
226 664
16914 = =
xy
S
Fitting the Least Squares Line
First compute the following three quantities:



Computing Estimate of Slope (|), Intercept (o)
and standard deviation (o),
288 . 0
55 . 14322
82 . 3271
= = =
xx
xy
S
S
b
756 . 6
11
664
288 . 0
11
226
=
|
.
|

\
|
= = x b y a
| |
35 . 8
2
1
2
=
(
(

=
xx
xy
yy
S
S
S
n
s
95% Confidence Limits for slope | :
t
.025
= 2.262 critical value for the t-distribution with 9
degrees of freedom
xx
S
s
t

2 / o
|
0.0706 to 0.3862
( )
8.35
0.288 2.262
1432255

95% Confidence Limits for intercept o :




1

2
2 /
xx
S
x
n
s t +
o
o
-4.34 to 17.85
t
.025
= 2.262 critical value for the t-distribution with 9
degrees of freedom
( )
( )
2
664 11
1
6.756 2.262 8.35
11 1432255
+
Iceland
Norway
Sweden
Denmark
Canada
Aust ralia
Holland
Swit zerland
Great Brit ain
Finland
USA
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140
Per capita consumption of cigarettes
d
e
a
t
h

r
a
t
e
s

f
r
o
m

l
u
n
g

c
a
n
c
e
r

(
1
9
5
0
)

Y = 6.756 + (0.228)X
95% confidence Limits for slope 0.0706 to 0.3862
95% confidence Limits for intercept -4.34 to 17.85
Testing the positive slope


The test statistic is:
0
: 0 vs : 0
A
H H | | s >
0

xx
b
t
s
S

=
The Critical Region


Reject
0
: 0 in favour of : 0
A
H H | | s >
0.05
0
if =1.833
xx
b
t t
s
S

= >
df = 11 2 = 9
A one tailed test
and conclude
0
: 0 H | s
0
Since
xx
b
t
s
S

=
0.288
41.3 1.833
8.35
1432255
= = >
we reject
: 0
A
H | >
CONFIDENCE LIMITS FOR POINTS ON THE
REGRESSION LINE
The intercept o is a specific point on the regression
line.
It is the y coordinate of the point on the
regression line when x = 0.
It is the predicted value of y when x = 0.
We may also be interested in other points on the
regression line. e.g. when x = x
0
In this case the y coordinate of the point on the
regression line when x = x
0
is o + | x
0
x
0
o + | x
0
y = o + | x

(1- o)100% Confidence Limits for o + | x
0
:

( )

1
2
0
2 / 0
xx
S
x x
n
s t bx a

+ +
o
t
o/2
is the o/2 critical value for the t-distribution with
n - 2 degrees of freedom
PREDICTION LIMITS FOR NEW VALUES
OF THE DEPENDENT VARIABLE Y
An important application of the regression line
is prediction.
Knowing the value of x (x
0
) what is the value
of y?
The predicted value of y when x = x
0
is:

This in turn can be estimated by:.

0
x y | o + =
0 0


bx a x y + = + = | o
The predictor


Gives only a single value for y.
A more appropriate piece of information
would be a range of values.
A range of values that has a fixed
probability of capturing the value for y.
A (1- o)100% prediction interval for y.

0 0


bx a x y + = + = | o
(1- o)100% Prediction Limits for y when x =
x
0
:

( )

1
1
2
0
2 / 0
xx
S
x x
n
s t bx a

+ + +
o
t
o/2
is the o/2 critical value for the t-distribution with
n - 2 degrees of freedom
EXAMPLE
In this example we are studying building fires
in a city and interested in the relationship
between:
1. X = the distance of the closest fire hall
and the building that puts out the alarm
and
2. Y = cost of the damage (1000$)
The data was collected on n = 15 fires.
THE DATA
Fire Distance Damage
1 3.4 26.2
2 1.8 17.8
3 4.6 31.3
4 2.3 23.1
5 3.1 27.5
6 5.5 36.0
7 0.7 14.1
8 3.0 22.3
9 2.6 19.6
10 4.3 31.3
11 2.1 24.0
12 1.1 17.3
13 6.1 43.2
14 4.8 36.4
15 3.8 26.1
SCATTER PLOT
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
0.0 2.0 4.0 6.0 8.0
Distance (miles)
D
a
m
a
g
e

(
1
0
0
0
$
)
COMPUTATIONS
Fire Distance Damage
1 3.4 26.2
2 1.8 17.8
3 4.6 31.3
4 2.3 23.1
5 3.1 27.5
6 5.5 36.0
7 0.7 14.1
8 3.0 22.3
9 2.6 19.6
10 4.3 31.3
11 2.1 24.0
12 1.1 17.3
13 6.1 43.2
14 4.8 36.4
15 3.8 26.1
2 . 49
1
=

=
n
i
i
x
2 . 396
1
=

=
n
i
i
y
65 . 1470
1
=

=
n
i
i i
y x
16 . 196
1
2
=

=
n
i
i
x
5 . 11376
1
2
=

=
n
i
i
y
COMPUTATIONS CONTINUED
28 . 3
15
2 . 49 1
= = =

=
n
x
x
n
i
i
4133 . 26
15
2 . 396 1
= = =

=
n
y
y
n
i
i
COMPUTATIONS CONTINUED
784 . 34
15
2 . 49
16 . 196
2
2
1
1
2
= =
|
.
|

\
|
=

=
=
n
x
x S
n
i
i n
i
i xx
517 . 911
15
2 . 396
5 . 11376
2
2
1
1
2
= =
|
.
|

\
|
=

=
=
n
y
y S
n
i
i n
i
i yy
n
y x
y x S
n
i
i
n
i
i n
i
i i xy
|
.
|

\
|
|
.
|

\
|
=

= =
=
1 1
1
( )( )
114 . 171
15
2 . 396 2 . 49
65 . 1470 = =
COMPUTATIONS CONTINUED
92 . 4
784 . 34
114 . 171

= = = =
xx
xy
S
S
b |
( )( ) 28 . 10 28 . 3 919 . 4 4133 . 26

= = = = x b y a o
2
2

=
n
S
S
S
s
xx
xy
yy
316 . 2
13
784 . 34
114 . 171
517 . 911
2
=

=
95% Confidence Limits for slope | :
t
.025
= 2.160 critical value for the t-distribution with
13 degrees of freedom
xx
S
s
t

2 / o
|
4.07 to 5.77
95% Confidence Limits for intercept o :


1

2
2 /
xx
S
x
n
s t +
o
o
7.21 to 13.35
t
.025
= 2.160 critical value for the t-distribution with
13 degrees of freedom
LEAST SQUARES LINE
0.0
10.0
20.0
30.0
40.0
50.0
60.0
0.0 2.0 4.0 6.0 8.0
Distance (miles)
D
a
m
a
g
e

(
1
0
0
0
$
)
y=4.92x+10.28
(1- o)100% Confidence Limits for o + | x
0
:

( )

1
2
0
2 / 0
xx
S
x x
n
s t bx a

+ +
o
t
o/2
is the o/2 critical value for the t-distribution with
n - 2 degrees of freedom
95% CONFIDENCE LIMITS FOR A + B X
0

:
x
0
lower upper
1 12.87 17.52
2 18.43 21.80
3 23.72 26.35
4 28.53 31.38
5 32.93 36.82
6 37.15 42.44
95% CONFIDENCE LIMITS FOR A + B
X
0
0.0
10.0
20.0
30.0
40.0
50.0
60.0
0.0 2.0 4.0 6.0 8.0
Distance (miles)
D
a
m
a
g
e

(
1
0
0
0
$
)
Confidence limits
(1- o)100% Prediction Limits for y when x =
x
0
:

( )

1
1
2
0
2 / 0
xx
S
x x
n
s t bx a

+ + +
o
t
o/2
is the o/2 critical value for the t-distribution with
n - 2 degrees of freedom
95% PREDICTION LIMITS FOR Y WHEN X =
X
0

x
0
lower upper
1 9.68 20.71
2 14.84 25.40
3 19.86 30.21
4 24.75 35.16
5 29.51 40.24
6 34.13 45.45
95% PREDICTION LIMITS FOR Y
WHEN X = X
0
0.0
10.0
20.0
30.0
40.0
50.0
60.0
0.0 2.0 4.0 6.0 8.0
Distance (miles)
D
a
m
a
g
e

(
1
0
0
0
$
)
Prediction limits
LINEAR REGRESSION
SUMMARY
Hypothesis testing and Estimation
(1 o)100% Confidence Limits for slope |
:
t
o/2
critical value for the t-distribution with n 2
degrees of freedom
xx
S
s
t

2 / o
|
Testing the slope

The test statistic is:
0 0 0
: vs :
A
H H | | | | = =
0

xx
b
t
s
S
|
=
- has a t distribution with df = n 2 if H
0
is true.
(1 o)100% Confidence Limits for intercept
o :

t
o/2
critical value for the t-distribution with n 2
degrees of freedom

1

2
2 /
xx
S
x
n
s t +
o
o
Testing the intercept

The test statistic is:
0 0 0
: vs :
A
H H o o o o = =
- has a t distribution with df = n 2 if H
0
is true.
( )
0
2
2
1

1
n
i
i
a
t
x
s
n
x x
o
=

=
+

(1- o)100% Confidence Limits for o + | x


0
:

( )

1
2
0
2 / 0
xx
S
x x
n
s t bx a

+ +
o
t
o/2
is the o/2 critical value for the t-distribution with
n - 2 degrees of freedom
(1- o)100% Prediction Limits for y when x =
x
0
:

( )

1
1
2
0
2 / 0
xx
S
x x
n
s t bx a

+ + +
o
t
o/2
is the o/2 critical value for the t-distribution with
n - 2 degrees of freedom
CORRELATION
The statistic:
Definition
( )( )
( ) ( )

= =
=


= =
n
i
i
n
i
i
n
i
i i
yy xx
xy
y y x x
y y x x
S S
S
r
1
2
1
2
1
is called Pearsons correlation coefficient
1. -1 r 1, |r| 1, r
2
1
2. |r| = 1 (r = +1 or -1) if the points
(x
1
, y
1
), (x
2
, y
2
), , (x
n
, y
n
) lie along a
straight line. (positive slope for +1,
negative slope for -1)
Properties
The test for independence (zero correlation)
The test statistic:
2
2
1
r
t n
r
=

Reject H
0
if |t| > t
a/2
(df = n 2)
H
0
: X and Y are independent
H
A
: X and Y are correlated
The Critical region
This is a two-tailed critical region, the critical
region could also be one-tailed
EXAMPLE
In this example we are studying building fires
in a city and interested in the relationship
between:
1. X = the distance of the closest fire hall
and the building that puts out the alarm
and
2. Y = cost of the damage (1000$)
The data was collected on n = 15 fires.
THE DATA
Fire Distance Damage
1 3.4 26.2
2 1.8 17.8
3 4.6 31.3
4 2.3 23.1
5 3.1 27.5
6 5.5 36.0
7 0.7 14.1
8 3.0 22.3
9 2.6 19.6
10 4.3 31.3
11 2.1 24.0
12 1.1 17.3
13 6.1 43.2
14 4.8 36.4
15 3.8 26.1
SCATTER PLOT
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
0.0 2.0 4.0 6.0 8.0
Distance (miles)
D
a
m
a
g
e

(
1
0
0
0
$
)
COMPUTATIONS
Fire Distance Damage
1 3.4 26.2
2 1.8 17.8
3 4.6 31.3
4 2.3 23.1
5 3.1 27.5
6 5.5 36.0
7 0.7 14.1
8 3.0 22.3
9 2.6 19.6
10 4.3 31.3
11 2.1 24.0
12 1.1 17.3
13 6.1 43.2
14 4.8 36.4
15 3.8 26.1
2 . 49
1
=

=
n
i
i
x
2 . 396
1
=

=
n
i
i
y
65 . 1470
1
=

=
n
i
i i
y x
16 . 196
1
2
=

=
n
i
i
x
5 . 11376
1
2
=

=
n
i
i
y
COMPUTATIONS CONTINUED
28 . 3
15
2 . 49 1
= = =

=
n
x
x
n
i
i
4133 . 26
15
2 . 396 1
= = =

=
n
y
y
n
i
i
COMPUTATIONS CONTINUED
784 . 34
15
2 . 49
16 . 196
2
2
1
1
2
= =
|
.
|

\
|
=

=
=
n
x
x S
n
i
i n
i
i xx
517 . 911
15
2 . 396
5 . 11376
2
2
1
1
2
= =
|
.
|

\
|
=

=
=
n
y
y S
n
i
i n
i
i yy
n
y x
y x S
n
i
i
n
i
i n
i
i i xy
|
.
|

\
|
|
.
|

\
|
=

= =
=
1 1
1
( )( )
114 . 171
15
2 . 396 2 . 49
65 . 1470 = =
THE CORRELATION COEFFICIENT
171.114
0.961
34.784 911.517
xy
xx yy
S
r
S S
= = =
The test for independence (zero correlation)
The test statistic:
2 2
0.961
2 13 12.525
1 1 0.961
r
t n
r
= = =

We reject H
0
: independence, if |t| > t
0.025
= 2.160
H
0
: independence, is rejected
RELATIONSHIP BETWEEN REGRESSION
AND CORRELATION
Recall
xy
xx yy
S
r
S S
=
Also

xy yy xy yy y
xx xx xx x
xx yy
S S S S s
r r
S S S s
S S
| = = = =
since
and
1 1
yy
xx
x y
S
S
s s
n n
= =

Thus the slope of the least squares line is simply the ratio
of the standard deviations the correlation coefficient
The test for independence (zero correlation)
Uses the test statistic:
2
2
1
r
t n
r
=

H
0
: X and Y are independent
H
A
: X and Y are correlated
Note:
and

yy
xx
S
r
S
| =

xx
yy
S
r
S
| =
1. The test for independence (zero correlation)
H
0
: X and Y are independent
H
A
: X and Y are correlated
are equivalent
The two tests
2. The test for zero slope
H
0
: | = 0.
H
A
: | 0
1. the test statistic for independence:
2
2
1
r
t n
r
=

2 2
2 2
1 1
xy
xy
xx yy
xx
xy xy
yy
xx yy xx yy
S
S
S S
S
t n n
S S
S
S S S S
= =

Thus
2

1
2
the same statistic for testing for slope.
xy
xx
xy
yy
xx
xx
xx
S
S
s
S
S n
S
S
S
|
= =
| |
| |

|
|
|
|
\ .
\ .
= zero
REGRESSION (IN GENERAL)
In many experiments we would have collected data on a
single variable Y (the dependent variable ) and on p
(say) other variables X
1
, X
2
, X
3
, ... , X
p
(the independent
variables).

One is interested in determining a model that
describes the relationship between Y (the response
(dependent) variable) and X
1
, X
2
, , X
p
(the predictor
(independent) variables.

This model can be used for
Prediction
Controlling Y by manipulating X
1
, X
2
, , X
p




The Model:
is an equation of the form
Y = f(X
1
, X
2
,... ,X
p
| u
1
, u
2
, ... , u
q
) + c

where u
1
, u
2
, ... , u
q
are unknown
parameters of the function f and c is a
random disturbance (usually assumed to
have a normal distribution with mean 0
and standard deviation o).


0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
Examples:
1. Y = Blood Pressure, X = age
The model
Y = o + |X + c,thus u
1
= o and u
2
= |.
This model is called:
the simple Linear Regression Model

Y = o + |X
8
8.5
9
9.5
10
10.5
11
11.5
12
12.5
1930 1940 1950 1960 1970 1980 1990 2000 2010
2. Y = average of five best times for running
the 100m, X = the year
The model
Y = o e
-|X
+ + c, thus u
1
= o, u
2
= | and u
2
=
.
This model is called:
the exponential Regression Model

Y = o e
-|X
+
2. Y = gas mileage ( mpg) of a car brand
X
1
= engine size
X
2
= horsepower
X
3
= weight
The model
Y = |
0
+ |
1
X
1
+ |
2
X
2
+ |
3
X
3
+ c.
This model is called:
the Multiple Linear Regression Model

THE MULTIPLE LINEAR
REGRESSION MODEL
In Multiple Linear Regression we assume the
following model

Y = |
0
+ |
1
X
1
+ |
2
X
2
+ ... + |
p
X
p
+ c

This model is called the Multiple Linear
Regression Model.
Again are unknown parameters of the model
and where |
0
, |
1
, |
2
, ... , |
p
are unknown
parameters and c is a random disturbance
assumed to have a normal distribution with
mean 0 and standard deviation o.
THE IMPORTANCE OF THE LINEAR MODEL
1. It is the simplest form of a model in which
each dependent variable has some effect on
the independent variable Y.
When fitting models to data one tries to find the
simplest form of a model that still adequately
describes the relationship between the
dependent variable and the independent
variables.
The linear model is sometimes the first model to
be fitted and only abandoned if it turns out to be
inadequate.

2. In many instance a linear model is the
most appropriate model to describe
the dependence relationship between
the dependent variable and the
independent variables.
This will be true if the dependent variable
increases at a constant rate as any or the
independent variables is increased while
holding the other independent variables
constant.
3. Many non-Linear models can be
Linearized (put into the form of a
Linear model by appropriately
transformation the dependent variables
and/or any or all of the independent
variables.)
This important fact ensures the wide utility
of the Linear model. (i.e. the fact the many
non-linear models are linearizable.)

AN EXAMPLE
The following data comes from an experiment
that was interested in investigating the source
from which corn plants in various soils obtain
their phosphorous.
The concentration of inorganic phosphorous (X
1
)
and the concentration of organic phosphorous (X
2
)
was measured in the soil of n = 18 test plots.
In addition the phosphorous content (Y) of corn
grown in the soil was also measured. The data is
displayed below:





Inorganic
Phosphorous
X
1



Organic
Phosphorous
X
2



Plant
Available
Phosphorous
Y


Inorganic
Phosphorous
X
1



Organic
Phosphorous
X
2



Plant
Available
Phosphorous
Y

0.4

53

64

12.6

58

51

0.4

23

60

10.9

37

76

3.1

19

71

23.1

46

96

0.6

34

61

23.1

50

77

4.7

24

54

21.6

44

93

1.7

65

77

23.1

56

95

9.4

44

81

1.9

36

54

10.1

31

93

26.8

58

168

11.6

29

93

29.9

51

99



Coefficients

Intercept

56.2510241 (|
0
)

X
1


1.78977412 (|
1
)

X
2


0.08664925 (|
2
)

Equation:
Y = 56.2510241 + 1.78977412 X
1
+ 0.08664925 X
2

THE MULTIPLE LINEAR
REGRESSION MODEL
In Multiple Linear Regression we assume the
following model

Y = |
0
+ |
1
X
1
+ |
2
X
2
+ ... + |
p
X
p
+ c

This model is called the Multiple Linear
Regression Model.
Again are unknown parameters of the model
and where |
0
, |
1
, |
2
, ... , |
p
are unknown
parameters and c is a random disturbance
assumed to have a normal distribution with
mean 0 and standard deviation o.
SUMMARY OF THE STATISTICS
USED IN
MULTIPLE REGRESSION
The Least Squares Estimates:
0 1 2
, , , , ,
p
| | | |
( )
2
1

n
i i
i
RSS y y
=
=

( )
2
0 1 1 2 2
1
n
i i i p pi
i
y x x x | | | |
=
(
= + + + +

- the values that minimize


The Analysis of Variance Table Entries
a) Adjusted Total Sum of Squares (SS
Total
)


b) Residual Sum of Squares (SS
Error
)


c) Regression Sum of Squares (SS
Reg
)

Note:


i.e. SS
Total
= SS
Reg
+SS
Error


SS
Total
=

n
i=1
(
y
i
y
_
)
2
.
(
d. f. = n 1
)
RSS = SS
Error
=

n
i=1
(
y
i
y

i
)
2
.
(
d. f. = n p 1
)
SS
Reg
= SS
(
|
1
,|
2
, .. . , |
p
)
=

n
i=1
(
y

i
y
_
)
2
.
(
d.f. =p
)

n
i=1
(
y
i
y
_
)
2
=

n
i=1
(
y

i
y
_
)
2
+

n
i=1
(
y
i
y

i

)
2
.
THE ANALYSIS OF VARIANCE TABLE
Source Sum of Squares d.f. Mean Square F

Regression SS
Reg
p SS
Reg
/p = MS
Reg
MS
Reg
/s
2

Error SS
Error
n-p-1 SS
Error
/(n-p-1) =MS
Error
= s
2

Total SS
Total
n-1

USES:
1. To estimate o
2
(the error variance).
- Use s
2
= MS
Error
to estimate o
2
.
2. To test the Hypothesis
H
0
: |
1
= |
2
= ... = |
p
= 0.
Use the test statistic

2
Reg Reg Error
F MS MS MS s = =
( )
Reg
1
Error
SS p SS n p
( ( =

- Reject H
0
if F > F
o
(p,n-p-1).

3. To compute other statistics that are useful in
describing the relationship between Y (the dependent
variable) and X
1
, X
2
, ... ,X
p
(the independent variables).
a) R
2
= the coefficient of determination
= SS
Reg
/SS
Total

=


= the proportion of variance in Y explained by
X
1
, X
2
, ... ,X
p
1 - R
2
= the proportion of variance in Y
that is left unexplained by X
1
, X2, ... , X
p
= SS
Error
/SS
Total
.


y
i
y
( )
2
i =1
n

y
i
y ( )
2
i =1
n

b) R
a
2
= "R
2
adjusted" for degrees of freedom.
= 1 -[the proportion of variance in Y that is left
unexplained by X
1
, X
2
,... , X
p
adjusted for d.f.]
1
Error Total
MS MS =
( )
( )
1
1
1
Error
Total
SS n p
SS n

=

( )
( )
1
1
1
Error
Total
n SS
n p SS

=

( )
( )
2
1
1 1
1
n
R
n p

(
=


c) R= \R
2
= the Multiple correlation coefficient of
Y with X
1
, X
2
, ... ,X
p

=


= the maximum correlation between Y and a
linear combination of X
1
, X
2
, ... ,X
p

Comment: The statistics F, R
2
, R
a
2
and R are
equivalent statistics.

SS
Re g
SS
Tot al
USING STATISTICAL PACKAGES
To perform Multiple Regression
USING SPSS
Note: The use of another statistical package
such as Minitab is similar to using SPSS
AFTER STARTING THE SSPS PROGRAM THE FOLLOWING
DIALOGUE BOX APPEARS:
IF YOU SELECT OPENING AN EXISTING FILE AND PRESS OK
THE FOLLOWING DIALOGUE BOX APPEARS
THE FOLLOWING DIALOGUE BOX APPEARS:
IF THE VARIABLE NAMES ARE IN THE FILE ASK IT TO
READ THE NAMES. IF YOU DO NOT SPECIFY THE
RANGE THE PROGRAM WILL IDENTIFY THE RANGE:
Once you click OK, two windows will appear
ONE THAT WILL CONTAIN THE OUTPUT:
THE OTHER CONTAINING THE DATA:
TO PERFORM ANY STATISTICAL ANALYSIS SELECT
THE ANALYZE MENU:
THEN SELECT REGRESSION AND LINEAR.
THE FOLLOWING REGRESSION DIALOGUE BOX
APPEARS
SELECT THE DEPENDENT VARIABLE Y.
SELECT THE INDEPENDENT VARIABLES X
1
, X
2
, ETC.
IF YOU SELECT THE METHOD - ENTER.
All variables will be put into the equation.

There are also several other methods that can be
used :
1. Forward selection
2. Backward Elimination
3. Stepwise Regression

Forward selection

1. This method starts with no variables in the
equation
2. Carries out statistical tests on variables not in
the equation to see which have a significant
effect on the dependent variable.
3. Adds the most significant.
4. Continues until all variables not in the
equation have no significant effect on the
dependent variable.
Backward Elimination

1. This method starts with all variables in the
equation
2. Carries out statistical tests on variables in the
equation to see which have no significant
effect on the dependent variable.
3. Deletes the least significant.
4. Continues until all variables in the equation
have a significant effect on the dependent
variable.
Stepwise Regression (uses both forward and
backward techniques)

1. This method starts with no variables in the
equation
2. Carries out statistical tests on variables not in
the equation to see which have a significant
effect on the dependent variable.
3. It then adds the most significant.
4. After a variable is added it checks to see if any
variables added earlier can now be deleted.
5. Continues until all variables not in the
equation have no significant effect on the
dependent variable.
All of these methods are procedures for
attempting to find the best equation

The best equation is the equation that is the
simplest (not containing variables that are not
important) yet adequate (containing variables
that are important)

ONCE THE DEPENDENT VARIABLE, THE INDEPENDENT VARIABLES
AND THE METHOD HAVE BEEN SELECTED IF YOU PRESS OK, THE
ANALYSIS WILL BE PERFORMED.
THE OUTPUT WILL CONTAIN THE FOLLOWING TABLE
Model Summary
.822
a
.676 .673 4.46
Model
1
R R Square
Adjusted
R Square
Std. Error
of the
Esti mate
Predi ctors: (Constant), WEIGHT, HORSE, ENGINE
a.
R
2
and R
2
adjusted measures the proportion of variance
in Y that is explained by X
1
, X
2
, X
3
, etc (67.6% and
67.3%)
R

is the Multiple correlation coefficient (the maximum
correlation between Y and a linear combination of X
1
,
X
2
, X
3
, etc)
THE NEXT TABLE IS THE ANALYSIS OF VARIANCE
TABLE
The F test is testing if the regression coefficients of
the predictor variables are all zero.
Namely none of the independent variables X
1
, X
2
, X
3
,
etc have any effect on Y
ANOVA
b
16098.158 3 5366.053 269.664 .000
a
7720.836 388 19.899
23818.993 391
Regressi on
Resi dual
Total
Model
1
Sum of
Squares df
Mean
Square F Si g.
Predi ctors: (Constant), WEIGHT, HORSE, ENGINE
a.
Dependent Vari abl e: MPG
b.
THE FINAL TABLE IN THE OUTPUT
Gives the estimates of the regression coefficients,
there standard error and the t test for testing if they are
zero
Note: Engine size has no significant effect on
Mileage
Coefficients
a
44.015 1.272 34.597 .000
-5.53E-03 .007 -.074 -.786 .432
-5.56E-02 .013 -.273 -4.153 .000
-4.62E-03 .001 -.504 -6.186 .000
(Constant)
ENGINE
HORSE
WEIGHT
Model
1
B Std. Error
Unstandardi zed
Coeffi ci ents
Beta
Standardi
zed
Coeffi ci en
ts
t Si g.
Dependent Vari abl e: MPG
a.
THE ESTIMATED EQUATION FROM THE TABLE BELOW:
5.53 5.56 4.62
44.0
1000 100 1000
Mileage Engine Horse Weight Error = +
Is:
Coefficients
a
44.015 1.272 34.597 .000
-5.53E-03 .007 -.074 -.786 .432
-5.56E-02 .013 -.273 -4.153 .000
-4.62E-03 .001 -.504 -6.186 .000
(Constant)
ENGINE
HORSE
WEIGHT
Model
1
B Std. Error
Unstandardi zed
Coeffi ci ents
Beta
Standardi
zed
Coeffi ci en
ts
t Si g.
Dependent Vari abl e: MPG
a.
NOTE THE EQUATION IS:
5.53 5.56 4.62
44.0
1000 100 1000
Mileage Engine Horse Weight Error = +
Mileage decreases with:
1. With increases in Engine Size (not
significant, p = 0.432)
With increases in Horsepower (significant,
p = 0.000)
With increases in Weight (significant, p =
0.000)

LOGISTIC REGRESSION
Recall the simple linear regression model:
y = |
0
+ |
1
x + c
where we are trying to predict a continuous
dependent variable y from a continuous
independent variable x.
This model can be extended to Multiple linear
regression model:
y = |
0
+ |
1
x
1
+ |
2
x
2
+ + + |
p
x
p
+ c
Here we are trying to predict a continuous
dependent variable y from a several continuous
dependent variables x
1
, x
2
, , x
p
.
Now suppose the dependent variable y is
binary.
It takes on two values Success (1) or
Failure (0)
This is the situation in which Logistic
Regression is used
We are interested in predicting a y from a
continuous dependent variable x.
EXAMPLE
We are interested how the success (y) of a
new antibiotic cream is curing acne problems
and how it depends on the amount (x) that is
applied daily.
The values of y are 1 (Success) or 0 (Failure).
The values of x range over a continuum
THE LOGISITIC REGRESSION MODEL
Let p denote P[y = 1] = P[Success].
This quantity will increase with the value of
x.
1
p
p
The ratio:
is called the odds ratio
This quantity will also increase with the value of
x, ranging from zero to infinity.
The quantity: ln
1
p
p
| |
|

\ .
is called the log odds ratio
EXAMPLE: ODDS RATIO, LOG ODDS
RATIO
Suppose a die is rolled:
Success = roll a six, p = 1/6
1 1
6 6
5 1
6 6
1
1 1 5
p
p
= = =

The odds ratio
( )
1
ln ln ln 0.2 1.69044
1 5
p
p
| |
| |
= = =
| |

\ .
\ .
The log odds ratio
THE LOGISITIC REGRESSION MODEL
0 1
1
x
p
e
p
| | +
=

i. e. :
In terms of the odds ratio
0 1
ln
1
p
x
p
| |
| |
= +
|

\ .
Assumes the log odds ratio is linearly
related to x.
THE LOGISITIC REGRESSION MODEL
0 1
1
x
p
e
p
| | +
=

or
Solving for p in terms x.
( )
0 1
1
x
p e p
| | +
=
0 1 0 1
x x
p pe e
| | | | + +
+ =
0 1
0 1
1
x
x
e
p
e
| |
| |
+
+
=
+
INTERPRETATION OF THE PARAMETER B
0

(DETERMINES THE INTERCEPT)

0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
p
0
0
1
e
e
|
|
+
x
INTERPRETATION OF THE PARAMETER B
1

(DETERMINES WHEN P IS 0.50 (ALONG WITH
B
0
))
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
p
0 1
0 1
1 1
1 1 1 2
x
x
e
p
e
| |
| |
+
+
= = =
+ +
x
0
0 1
1
0 or x x
|
| |
|
+ = =
when
ALSO
0 1
0 1
1
x
x
dp d e
dx dx e
| |
| |
+
+
=
+
0
1
x
|
|
= when
( )
( )
0 1 0 1 0 1 0 1
0 1
1 1
2
1
1
x x x x
x
e e e e
e
| | | | | | | |
| |
| |
+ + + +
+
+
=
+
( )
0 1
0 1
1 1
2
4
1
x
x
e
e
| |
| |
| |
+
+
= =
+
1
4
|
is the rate of increase in p with respect to x
when p = 0.50
INTERPRETATION OF THE PARAMETER B
1

(DETERMINES SLOPE WHEN P IS 0.50 )
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
p
x
1
slope
4
|
=
THE DATA
The data will for each case consist of
1. a value for x, the continuous independent
variable
2. a value for y (1 or 0) (Success or Failure)
Total of n = 250 cases
case x y
230 4.7 1
231 0.3 0
232 1.4 0
233 4.5 1
234 1.4 1
235 4.5 1
236 3.9 0
237 0.0 0
238 4.3 1
239 1.0 0
240 3.9 1
241 1.1 0
242 3.4 1
243 0.6 0
244 1.6 0
245 3.9 0
246 0.2 0
247 2.5 0
248 4.1 1
249 4.2 1
250 4.9 1
case x y
1 0.8 0
2 2.3 1
3 2.5 0
4 2.8 1
5 3.5 1
6 4.4 1
7 0.5 0
8 4.5 1
9 4.4 1
10 0.9 0
11 3.3 1
12 1.1 0
13 2.5 1
14 0.3 1
15 4.5 1
16 1.8 0
17 2.4 1
18 1.6 0
19 1.9 1
20 4.6 1
ESTIMATION OF THE PARAMETERS
The parameters are estimated by Maximum
Likelihood estimation and require a
statistical package such as SPSS
USING SPSS TO PERFORM LOGISTIC REGRESSION
Open the data file:
Choose from the menu:
Analyze -> Regression -> Binary Logistic
The following dialogue box appears
Select the dependent variable (y) and the independent
variable (x) (covariate).
Press OK.
Here is the output
The Estimates and their S.E.
THE PARAMETER ESTIMATES
| SE
X 1.0309 0.1334
Constant -2.0475 0.332
|
1
1.0309
|
0
-2.0475
INTERPRETATION OF THE
PARAMETER B
0

(DETERMINES THE INTERCEPT)

0
0
-2.0475
-2.0475
intercept 0.1143
1 1
e e
e e
|
|
= = =
+ +
Interpretation of the parameter |
1

(determines when p is 0.50 (along with |
0
))
0
1
2.0475
1.986
1.0309
x
|
|

= = =
Another interpretation of the parameter |
1

1
4
|
is the rate of increase in p with
respect to x when p = 0.50
1
1.0309
0.258
4 4
|
= =
The dependent variable y is binary.
It takes on two values Success (1) or
Failure (0)
The Logistic Regression Model
We are interested in predicting a y from a
continuous dependent variable x.
THE LOGISITIC REGRESSION MODEL
Let p denote P[y = 1] = P[Success].
This quantity will increase with the value of
x.
1
p
p
The ratio:
is called the odds ratio
This quantity will also increase with the value of
x, ranging from zero to infinity.
The quantity: ln
1
p
p
| |
|

\ .
is called the log odds ratio
THE LOGISITIC REGRESSION MODEL
0 1
1
x
p
e
p
| | +
=

i. e. :
In terms of the odds ratio
0 1
ln
1
p
x
p
| |
| |
= +
|

\ .
Assumes the log odds ratio is linearly
related to x.
THE LOGISITIC REGRESSION MODEL
In terms of p
0 1
0 1
1
x
x
e
p
e
| |
| |
+
+
=
+
THE GRAPH OF P VS X

0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
p
0 1
0 1
1
x
x
e
p
e
| |
| |
+
+
=
+
x
THE MULTIPLE LOGISTIC REGRESSION
MODEL
Here we attempt to predict the outcome of
a binary response variable Y from several
independent variables X
1
, X
2
, etc
0 1 1
ln
1
p p
p
X X
p
| | |
| |
= + + +
|

\ .
0 1 1
0 1 1
or
1
p p
p p
X X
X X
e
p
e
| | |
| | |
+ + +
+ + +
=
+
MULTIPLE LOGISTIC REGRESSION
AN EXAMPLE
In this example we are interested in
determining the risk of infants (who were born
prematurely) of developing BPD
(bronchopulmonary dysplasia)
More specifically we are interested in
developing a predictive model which will
determine the probability of developing BPD
from
X
1
= gestational Age and X
2
= Birthweight
For n = 223 infants in prenatal ward the
following measurements were determined
1. X
1
= gestational Age (weeks),
2. X
2
= Birth weight (grams) and
3. Y = presence of BPD
THE DATA
case Gestational Age Birthweight presence of BMD
1 28.6 1119 1
2 31.5 1222 0
3 30.3 1311 1
4 28.9 1082 0
5 30.3 1269 0
6 30.5 1289 0
7 28.5 1147 0
8 27.9 1136 1
9 30 972 0
10 31 1252 0
11 27.4 818 0
12 29.4 1275 0
13 30.8 1231 0
14 30.4 1112 0
15 31.1 1353 1
16 26.7 1067 1
17 27.4 846 1
18 28 1013 0
19 29.3 1055 0
20 30.4 1226 0
21 30.2 1237 0
22 30.2 1287 0
23 30.1 1215 0
24 27 929 1
25 30.3 1159 0
26 27.4 1046 1
THE RESULTS
ln 16.858 .003 .505
1
p
BW GA
p
| |
=
|

\ .
Variables in the Equation
-.003 .001 4.885 1 .027 .998
-.505 .133 14.458 1 .000 .604
16.858 3.642 21.422 1 .000 2.1E+07
Bi rthwei ght
Gestati onal Age
Constant
Step
1
a
B S.E. Wal d df Si g. Exp(B)
Vari abl e(s) entered on step 1: Bi rthwei ght, Gestati onal Age.
a.
16.858 .003 .505
1
BW GA
p
e
p

=

16.858 .003 .505


16.858 .003 .505
1
BW GA
BW GA
e
p
e


=
+
GRAPH: SHOWING RISK OF BPD VS GA AND
BRTHWT
0
0.2
0.4
0.6
0.8
1
700 900 1100 1300 1500 1700
GA = 27
GA = 28
GA = 29
GA = 30
GA = 31
GA = 32
NON-PARAMETRIC STATISTICS

You might also like