You are on page 1of 43

More Multiple Regression

Approaches to Regression Analysis, Types of


Correlations and Advanced Regression
Regression Analysis

y a b1 x1 b2 x2 b3 x3 ... bk xk
y
X3

X1

X2
Types of Regression Analysis
Standard Regression

Standard or Simultaneous Regression

Put all of the predictors in at one time and


the coefficients are calculated for all of
them controlling for all others

Method equals enter in SPSS


Sequential
Forward Sequential
What does a predictor add to the prediction
equation, over and above the variables
already in the equation?

You think the X1 is a more important


predictor and your interest in X2 is what does
it add to the X1 -> Y prediction

Real Forward Sequential in SPSS is setting it


to Enter and using the blocks function (user
specified)
Statistical Forward Sequential
starts with Y=a, all potential predictors are assessed
and compared to an Entry Criterion; the variable with
the lowest F probability (p<.05) enters it into the
equation

Remaining predictors are re-evaluated given the new


equation (Y=a + Xfirst entered) and the next variable with
the lowest probability enters, etc

This continues until either all of the variables are


entered or no other variables meet the entry criterion.

Once variables enter the equation they remain.

Method equals Forward in SPSS


Backward Sequential
Can predictors be removed from an equation without
hurting the prediction of Y? In other words, can a
prediction equation be simplified?

You know there are a set of predictors of a certain


variable and you want to know if any of them can be
removed without weakening the prediction

In SPSS put all predictors in block one method equals


enter, in block 2 any variables you want removed
method equals removed, etc
Statistical backward sequential
All variables entered in and then each are tested against
an Exit Criteria; F probability is above a set criteria
(p>.10).

The variable with the worst probability is then removed.

Re-evaluation of remaining variables given the new


equation and the next variable with the worst probability
is then removed.

This continues until all variables meet the criteria or all


variables removed.

In SPSS this is setting method equals backward.


Stepwise
(Purely Statistical Regression)
at each step of the analysis variables are
tested for both entry and exit criteria.

Starts with intercept only then tests all of the


variables to see if any match entry criteria.

Any matches enter the equation

The next step tests un-entered variables for


both entry and entered variables for exit
criteria, and so on
Stepwise
This cycles through adding and removing
variables until none meet the entry or exit
criteria

Variables can be added or removed over and


over given the new state of the equation each
time.

Considered a very post-hoc type of analysis


and is not recommended
Correlations and Effect size
Ballantine

Regular Correlation
(Zero Order,
Y
e Pearson)

a
c
b r ac
2
y1
X1
d
X2
r bc
2
y2

r cd
2
12
Standard Regression

Partial Correlation

correlation between Y
Y
and X1 with the
e influence of X2
removed from both
a b
c
Yres, X1res
X1 X2
d
area a/(a + e) for x1
and b/(b + e) for x2
in the ballantine
Semipartial or Part Correlation

correlation between Y
and X1 with the influence
Y of X2 removed from X1
e only

a b Y, X1res
c
X1 X2 area a/(a + b + c + e)
d
for x1 and b/(a + b + c
+e) for x2
Semipartials and Bs

Bs and semipartials are very similar

B is the amount of change in Y for every unit


change in X, while controlling for other Xs on
Xi.

Semipartials are measures of the relationship


between Y and Xi controlling for other Xs on
Xi.
Sequential

Assuming x1 enters first

Y The partial correlations


e would be
(a + c)/(a + c + e) for
a b x1 and unchanged for x2
c
X1 X2 The part correlation
d would be
(a + c)/(a + b + c + e)
for x1 and x2 is
unchanged.
Simple and Multiple Regression Analysis

i yi x1 x2
Family Actual # of Family Size Family
Number Income
Credit Cards We now can attempt
1 4 2 14 to estimate # of CCs
2 6 2 16 from our information
3 6 4 14 on family size and
4 7 4 17 family income!
5 8 5 18
Our regression model
6 7 5 21
will now be a linear
7 8 6 17
plane, rather than a
8 10 6 25
straight line!
Generic Equation for a linear plane: y a b1 x1 b2 x2
Lets examine the regression plane for our example graphically.
Y = # of Credit Cards
12

y a b1 x1 b2 x2 11

Formulas are available for 10


computing values of 9
a, b1 and b2
8
MULTIPLE REGRESSION 7 Family Income
MODEL FOR OUR EXAMPLE:
6
y .482 .63 x1 .216 x2 5
4
Lets now see
how much error 3

in estimation we 2
are committing 1
by using this 0 Actual
Regression Estimate
multiple 0 1 2 3
regression 4 5 6 7 8 X1 = Family Size
model.
Simple and Multiple Regression Analysis
y .482 .63 x1 .216 x2 y
i y x1 x2 Y y y ( y y )
2

Family Actual # Family Family Regression Error Errors


Number of Credit Size Income Estimate (Residual) Squared
Cards ($000)
1 4 2 14 ? ? ?
2 6 2 16 ? ? ?
3 6 4 14 ? ? ?
4 7 4 17 ? ? ?
5 8 5 18 ? ? ?
6 7 5 21 ? ? ?
7 8 6 17 ? ? ?
8 10 6 25 ? ? ?
(y y
) 2
Simple and Multiple Regression Analysis
y .482 .63 x1 .216 x2 y .482 .63(2) .216(14) 4.77

i y x1 x2 Y y y ( y y )
2

Family Actual # Family Family Regression Error Errors


Number of Credit Size Income Estimate (Residual) Squared
Cards ($000)
1 4 2 14 4.77 -.77 .59
2 6 2 16 5.20 .80 .64
3 6 4 14 6.03 -.03 .00
4 7 4 17 6.68 .32 .10
5 8 5 18 7.53 .47 .22
6 7 5 21 8.18 -1.18 1.39
7 8 6 17 7.95 .05 .00
8 10 6 25 9.67 .33 .11
2

SSE = Sum of Squares Error (Residual) 3.05 ( y y )


Unique (additional) contribution of X2 (family income)
5.5 3.05 = 2.45
Regression Formulas

The Total Sum of Squares (SST) is equal to SSR + SSE.

Mathematically,

SSR = ( y ^y ) 2 (measure of explained variation)

SSE = ( y y ) ^ (measure of unexplained variation)

SST = SSR + SSE = ( y y ) (measure


2 of total variation in y)
The Coefficient of Determination

The proportion of total variation (SST) that is explained by the


regression (SSR) is known as the Coefficient of Determination,
and is often referred2to as R .

R2 = SSR = SSR
SST SSR + SSE

The value of R2 can range between 0 and 1, and the higher its
value the more accurate the regression model is. It is often
referred to as a percentage.
Simple and Multiple Regression Analysis
The MULTIPLE REGRESSION MODEL FOR OUR EXAMPLE:

y .482 .63 x1 .216 x2


? ?
Y-Intercept, a b1 and b2 = Regression Coefficients
(NOTE: Only when all Xs 0.63: Among families of the same income, an increase
can meaningfully take on in family size by one person would, on average, result
value of zero, the intercept in .63 more credit cards.
will have a
meaningful/direct/ practical 0.21: Among families of the same size, an income
interpretation. Otherwise, increase of $1,000, results in an average increase of 0.2
it is simply an aid in credit cards .
increasing accuracy of bs represent effect of each X on Y when all
estimation. other Xs are controlled for/held constant/taken
into account
i.e., after impacts of all other variables are
accounted for (remember the high blood
pressure-hearing problem connection?)
Simple and Multiple Regression Analysis
The MULTIPLE REGRESSION MODEL FOR OUR EXAMPLE:

y .482 .63 x1 .216 x2


SST = 22 SSE = 3.05

What is our new R2?


SS Regression = 22 3.05 = 18.95 Percent of differences in households
2 number of CCs that is explained by
R = 18.95 / 22 = .861 or differences in family size and family
86% income.

Percent of variation in number of credit


The Remaining 14%? cards that can be accounted for by (a) all
other relevant factors not included in the
(3.05 / 22 = .14) model, beyond family size and income, and
(b) unexplainable random/chance
variations.
d
a Y= # of CC
Total Variation/Error in Y = SS Total = a + b + c + d = 22
c b
y 2.87 .97 X 1 r2 = ? R2 = (a+c) / (a+b+c+d)
X1=Family Y R2 = 16.5 / 22 = 0.75
Size
X2 = Family What do we call the square root of this?
SSR =
Pearson/simple ryx
16.5
0.75 0.867
Income a+c 22
Correlation
1

= 16.5
of Y with X1 ac
X1=Family
(not controlling ryx
size
for X2)
1
abcd

Y y 0.063 .398 X 2 r2 = (b+c) / (a+b+c+d) = 15.12 / 22 = 0.687


SSR = Pearson/simple bc
c+b r
Correlation of Y yx2
= 15.12
abcd
with X2 (not
controlling for 15.11
X2 = Family X 1) ? ryx 2 0.829
Income
22
d
a y .482 .63 x1 .216 x2
c b
R2Graphically = ?
X1=Family NOTE: c is explained by
Size
both X1 and X2
X2 = Family
Income
SSR = a + b +c = 18.95
SST = a + b + c + d = 22
R2 = SSR / SST = (a + b + c) / (a + b + c + d) = 18.95 / 22
= 86%
SSE = ?

SSE = d = 22 18.95 = 3.05


Simple and Multiple Regression Analysis
y .482 .63 x1 .216 x2 y .482 .63(2) .216(14) 4.77

i y x1 x2 Y y y ( y y )
2

Family Actual # Family Family Regression Error Errors


Number of Credit Size Income Estimate (Residual) Squared
Cards ($000)
1 4 2 14 4.77 -.77 .59
2 6 2 16 5.20 .80 .64
3 6 4 14 6.03 -.03 .00
4 7 4 17 6.68 .32 .10
5 8 5 18 7.53 .47 .22
6 7 5 21 8.18 -1.18 1.39
7 8 6 17 7.95 .05 .00
8 10 6 25 9.67 .33 .11
2

SSE = Sum of Squares Error (Residual) 3.05 ( y y )

Remember: Unique (additional) contribution of X2 = 5.5 3.05 = 2.45


Testing hypothesis about P

We need to calculate a test


statistic
How many standard deviations
have we deviated if the null
hypothesis p=0.20 was true?

Z (0.28 0.20) /(( (0.20)(1 0.20) 100))


Z 2.0
What is the likelihood of observing a
Z=2.0 or more extreme if the
Governments figure was correct?
P-value= P[Z > 2.0] = 0.025

How does this p-value as compared with


=0.05?

If p-value < , then reject the null hypothesis


H0 in favor of the alternative hypothesis Ha.

In this situation we reject the Governments


Elements of Testing hypothesis
Null Hypothesis
Alternative hypothesis
Level of significance
Test statistics
P-value
Conclusion
Power of the test
Is there an association between
Drinking and Lung Cancer?

What is the most appropriate


and feasible study design in
order to test the above
research hypothesis?
Case Control Study of Smoking and
Lung Cancer
Null Hypothesis: There is no
association between Smoking
and Lung cancer, P1=P2

Alternative Hypothesis: There is


some kind of association between
Smoking and Lung cancer, P1P2.
In the following contingency table
estimate the proportion and odds of
drinkers among those who develop Lung
Cancer and those without the disease?

Lung Cancer Total


Case Control
Drinker Yes A=33 B=27 60
No C=1667 D= 2273 3940
P1=33/1700 P2=27/2300
Odds1=33/1667 Odds2=27/2273
QUESTION: Is there a difference
between the proportion of drinkers
among cases and controls?

G ro u p 1 G ro u p 2
D is e a s e N o D is e a s e
P 1 = p r o p o r t io n o f d r in k e r s P 2 = p r o p o r t io n o f d r in k r s
Test Statistic

A statistical yard stick which is


computed based on the
information contained in the
sample under the assumption that
the null hypothesis is true.
Knowledge about the sampling
distribution of the test statistics is
needed in determining the
likelihood of observing extreme
values for the test statistics in a
P-value

An indicator which measures the


likelihood of observing values as extreme
as the one observed based on the sample
information, assuming the null
hypothesis is true.

P-value is also known as the observed


level of significance.
The level of significance ( )
is known as the nominal level of
significance.
If p-value < , then we reject the null
hypothesis in favor of the alternative
hypothesis.
Most of statistical packages give P-value in
their computer output.
needs to be pre-determined. (Usually 5%)
Type I and Type II errors

Type I error is committed when a true


null hypothesis is rejected.
is the probability of committing type I
error.
Type II error is committed when a false
null hypothesis is not rejected.
is the probability of committing type II
error.
Decision made about the
validity of null hypothesis
Rejected Not rejected
Null True Type I No Error
Hypothesis Error
False No Error Type II Error
Power of a test

The power of a test is the probability that


a false null hypothesis is rejected.
Power = 1 - , where is the probability
of committing type II error.
More powerful tests are preferred. At
the design stage one should identify the
desired level of power in the given
situation.
Factors influencing the Power
The power of a test is influenced by the
magnitude of the difference between the
null hypothesis and the true parameter.

The power of a test could be improved by


increasing the sample size.
The power of a test could be improved by
increasing . (this is a very artificial way)
Minimum Required Sample Size
Usually a Sample size calculation
formula is available for most of the well
known study designs. Some software
packages such as Epi-Info could also be
utilized for the sample size calculation
purpose.
It is extremely important to consult a
biostatistician at the design phase to
ensure adequate sample is considered for
the study.
Testing hypothesis about one
population mean
H0: =16 vs Ha: >16
Z= (sample mean hypothesized
s
mean) n
SE of the Mean

Under the null hypothesis and


when n is large, (n>30), the
distribution of Z is standard
normal.
P-value

You might also like