Statistics and Quantitative Analysis

Statistics and Quantitative
Analysis U4320
Lecture 13: Explaining Variation

Prof. Sharyn OHalloran
I. Explaining Variation: R2
A. Breaking Down the Distances
Let's go back to the basics of regression
analysis.
Y
Money Spent 8 x x
on Health Care 7 x Y=a+bx
x x
6 Y
x x
5
4 x
10 20 30 40 50 60 70 X
Income
How well does the predicted line explain the variation in
the independent variable money spent?
Total Variation
$ Spent on
Health Care
Y
Y-Y=deviation unexplained by regression
8 x x
(x,y)
7 x
x x Y-Y=deviation explained by regression
6
5.9 Y
x x Y-Y=total deviation around Y
5
4 x
Y=a+bx
10 20 30 40 50 60 70 X
Income
Total Deviation
Y Y (Y Y ) (Y Y ) .
Total = Explained + Unexplained
Deviation Deviation Deviation
The total distance from any point to Y is the sum of

the distance from Y to the regression line plus the
distance from the regression line to Y .
B. Sums of Squares
We can sum this equation across all the Y's and
square both sides to get:
2
(Y Y ) 2
(Y Y ) (Y Y )
(Y Y )2 2 (Y Y )(Y Y ) (Y Y )2
(Y Y )2 (Y Y )2 ,
1. Total Sum of Squares (SST).
The term on the left-hand side of this equation is the
sum of the squared distances from all points to Y .
We call this the total variation in the Y's, or the
Total Sum of Squares (SST).
2. Regression Sum of Squares
The first term on the right hand side is the sum of
the squared distances from the regression line to Y .
We call it the Regression Sum of Squares, or
SSR.
3. Error Sum of Squares
Finally, the last term is the sum of the squared
distances from the points to the regression line.
Remember, this is the quantity that least squares
minimizes. We call it the Error Sum of Squares, or
SSE.
We can rewrite the previous equation as:
SST = SSR + SSE.
C. Definition of R2
We can use these new terms to determine how
much variation is explained by the regression
line.
If the points are perfectly linear, then the Error
Sum of Squares is 0:
Y
Money Spent 8 x
on Health Care 7 x
x
6 x
x Y
5 x
4 x
10 20 30 40 50 60 70 X
Income
Here, SSR = SST. The variance in the Y's is
completely explained by the regression line.
On the other hand, if there is no relation
between X and Y:
Y
Money Spent 8 x
on Health Care 7 x
x x
6 Y
x x
5 x
4
10 20 30 40 50 60 70 X
Income
Now SSR is 0 and SSE=SST. The regression
line explains none of the variance in Y.
3. Formula
So we can construct a useful statistic.
Take the ratio of the Regression Sum of Squares
to the Total Sum of Squares:

SSR
R2
SST
We call this statistic R2
It represents the percent of the variation in Y
explained by the regression
R2 is always between 0 and 1.
For a perfectly straight line it's 1, which is
perfect correlation.
For data with little relation, it's near 0.
R2 measures the explanatory power of your model.

The more of the variance in Y you can explain,
the more powerful your model.

D. Example
I wanted to investigate why people have
confidence in what they see on TV.
1. Dependent variable
TRUSTTV = 1 if has a lot of confidence
= 2 if somewhat confidence
= 3 if the individual has no confidence.
2. Independent variables
TUBETIME = number of Hours of TV watched a week
SKOOL = years of education.
LIKEJPAN = feelings towards Japan.
YELOWSTN = attitudes whether the US should spend
more on national parks.
MYSIGN = the respondents astrological sign.
3. Calculating R2
a) Correlation matrix
TRUSTTV TUBETIME SKOOL LIKEJPAN YELOWSTN MYSIGN
TRUSTTV 1.000 -.177 .112 .043 .003 -.038

TUBETIME -.177 1.000 -.272 .080 -.137 .053
SKOOL .112 -.272 1.000 -.072 -.016 .012
LIKEJPAN .043 .080 -.072 1.000 .040 -.001
YELOWSTN .003 -.137 -.016 .040 1.000 -.020
MYSIGN -.038 .053 .012 -.001 -.020 1.000
b) The first model is:
TRUSTTV = 2.34 0.0539 (TUBETIME)
Analysis of Variance
DF Sum of Squares Mean Square
Regression 1 5.90547 5.90547
Residual 468 183.61793 0.39235
How do we calculate the Total Sum of Squares?
SST = SSR + SSE
SST = 5.90 + 183.61 = 189.51
Now we can calculate R2:
SSR
R2
SST
5.90

189.54
.031
c) Each of the 4 different models has an
associated R2.
Eq # 2: R2 = 0.035
Eq # 3: R2 = 0.039
Eq # 4 = R2 = .04
F. Using R2 in Practice
1. Useful Tool
2. Measure of Unexplained Variance
3. Not a Statistical Test
4. Don't Obsess about R2
5. You can always improve R2 by adding
variables
G. Example
You'll notice that R2 increase every time.
No matter what variables you add you can
always increase your R2.
II. Adjusted R2
II. Adjusted R2
A. Definition of Adjusted R2
So we'd like a measure like R2, but one that takes into
account the fact that adding extra variables always
increases your explanatory power.
The statistic we use for this is call the Adjusted R2, and
its formula is:
n 1
R2 1 (1 R2 );
nk
n number of observations,
k number of independent variables.
So the Adjusted R2 can actually fall if the variable you

add doesn't explain much of the variance.
II. Adjusted R2
B. Back to the Example
1. Adjusted R2
You can see that the adjusted R2 rises from equation
1 to equation 2, and from equation 2 to equation 3.
But then it falls from equation 3 to 4, when we add
in the variables for national parks and the zodiac.
II. Adjusted R2
2. Calculating Adjusted R2
Example: Equation 2
Multiple R .18856
R Square .03555
Adjusted R Square .03142
Standard Error .62562
Regression 2 6.73848 3.36924
Residual 467 182.78492 .39140
F= 8.60813 Signif F = .0002

II. Adjusted R2
Variable B SE B Beta T Sig T

SKOOL .015524 .010641 .068897 1.459 .1453
TUBETIME -.048228 .014436 -.157772 -3.341 .0009
(Constant) 2.127991 .158260 13.446 .0000
We calculate:
n 1
R2 1 (1 R2 )
nk
470 1
1 (1.03555)
470 3
.0314
II. Adjusted R2
C. Stepwise Regression
One strategy for model building is to add
variables only if they increase your adjusted R2.
This technique is called stepwise regression.
However, I don't want to emphasize this
approach to strongly. Just as people can fixate
on R2 they can fixate on adjusted R2.
****IMPORTANT****
If you have a theory that suggests that certain
variables are important for your analysis then
include them whether or not they increase the
adjusted R2.
Negative findings can be important!
III. F Tests
III. F Tests
A. When to use an F-Test?
Say you add a number of variables into a
regression model and you want to see if, as a
group, they are significant in explaining
variation in your dependent variable Y.
The F-test tells you whether a group of
variables, or even an entire model, is jointly
significant.
This is in contrast to a t-test, which tells whether an
individual coefficient is significantly different from
zero.
III. F Tests
B. Equations
To be precise, say our original equation is:
EQ 1: Y = b0 + b1X1 + b2X2,
and we add two more variables, so the new equation is:
EQ 2: Y = b0 + b1X1 + b2X2 + b3X3 + b4X4.
We want to test the hypothesis that
H0: b3 = b4 = 0.
That is, we want to test the joint hypothesis that X3 and X4
together are not significant factors in determining Y.
III. F Tests
C. Using Adjusted R2 First
There's an easy way to tell if these two
variables are not significant.
First, run the regression without X3 and X4 in it, then
run the regression with X3 and X4.
Now look at the adjusted R2's for the two
regressions. If the adjusted R2 went down, then X3
and X4 are not jointly significant.
So the adjusted R2 can serve as a quick test for
insignificance.
III. F Tests
D. Calculating an F-Test
If the adjusted R2 goes up, then you need to do a
more complicated test, F-Test.
1. Ratio
Let regression 1 be the model without X3 and X4, and
let regression 2 include X3 and X4.
The basic idea of the F statistic, then, is to compute
the ratio:
SSE1 SSE2
SSE2
III. F Tests
2. Correction
We have to correct for the number of independent
we add.
So the complete statistic is:
SSE1 SSE2
F m ;
SSE2
nk
m number of restrictions;
k number of independent variables.
Remember that k is the total number of independent

variables, including the ones that you are testing and
the constant.
III. F Tests
2. Correction (cont.)
This equation defines an F statistic with m and n-k
degrees of freedom.
We write it like this:
Fnm k
To get critical values for the F statistic, we use a set
of tables, just like for the normal and t-statistics.
III. F Tests
E. Example
1. Adding Extra Variables: Are a group of
variables jointly significant?
Are the variables YELOWSTN and MYSIGN jointly
significant?
EQ 1: TRUSTTV = b0 + b1LIKE JPAN + b2SKOOL + b3TUBETIME .
EQ 2: TRUSTTV = b0 + b1LIKE JPAN + b2SKOOL + b3TUBETIME +
b4MYSIGN + b5YELOWSTN
III. F Tests
1. Adding Extra Variables (cont.)
a) State the null hypothesis
H0: b4 = b5 = 0.
b) Calculate the F-statistic

Our formula for the F statistic is:
SSE1 SSE2
F m ,
SSE2
nk
III. F Tests
What is SSE1 --the sum of squared errors in the
first regression?
What is SSE2, the sum of squared errors in the
second regression?
m=2 N = 470 k=6
The formula is:
182.07 18182
.
F 2
18182
.
470 6
= 0.319
III. F Tests
c) Reject or fail to reject the null hypothesis?
The critical value at the 5 % level
2
F
470 6 from the table, is 3.00.
2
Is the F-statistic > F ?
470 6
If yes, then we reject the null hypothesis that the
variables are not significantly different from zero;
otherwise we fail to reject.
.319 < 3.00, so we can reject the null hypothesis.
B=0 0.319 3.0

III. F Tests
2. Testing All Variables: Is the Model
Significant?
Equation 2
Multiple R .18856
R Square .03555
Adjusted R Square .03142
Standard Error .62562
Regression 2 6.73848 3.36924
Residual 467 182.78492 .39140
F= 8.60813 Signif F = .0002

III. F Tests
Variable B SE B Beta T Sig T
SKOOL .015524 .010641 .068897 1.459 .1453

TUBETIME -.048228 .014436 -.157772 -3.341 .0009
(Constant) 2.127991 .158260 13.446 .0000
a) Hypothesis:
H0: b1 = b2 = 0.
Again, we start with our formula:
SSE1 SSE2
F m ,
SSE2
nk
III. F Tests
b) Calculate F-statistic
SSE2 = 182.78
SSE1 is the sum of squared errors when there
are no explanatory variables at all.

If there are no explanatory variables, then SSR
must be 0. In this case, SSE=SST.
So we can substitute SST for SSE1 in our formula.
SST = SSR + SSE = 6.738 + 182.78 = 189.54
189.54 182.78
F 2
182.78
470 3
= 8.61.
This is the number reported in your printout under the
F statistic.
III. F Tests
c) Reject or fail to reject the null hypothesis?
2
The critical value at the 5% level, F470 3 from
your table, is 3.00.

So this time we can reject the null hypothesis
that b1 = b2 = 0.

Statistics and Quantitative Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics and Quantitative Analysis

Uploaded by

Copyright:

Available Formats

Statistics and Quantitative

Lecture 13: Explaining Variation

The total distance from any point to Y is the sum of

to the Total Sum of Squares:

R2 measures the explanatory power of your model.

the more powerful your model.

TRUSTTV TUBETIME SKOOL LIKEJPAN YELOWSTN MYSIGN

TRUSTTV 1.000 -.177 .112 .043 .003 -.038

So the Adjusted R2 can actually fall if the variable you

F= 8.60813 Signif F = .0002

Variable B SE B Beta T Sig T

Remember that k is the total number of independent

b) Calculate the F-statistic

B=0 0.319 3.0

F= 8.60813 Signif F = .0002

SKOOL .015524 .010641 .068897 1.459 .1453

SSE1 is the sum of squared errors when there

are no explanatory variables at all.

your table, is 3.00.

You might also like