Stat Project DBA Hewaln Radwan Mahmoud

More Multiple Regression
Approaches to Regression Analysis, Types of

Correlations and Advanced Regression
Regression Analysis
y a b1 x1 b2 x2 b3 x3 ... bk xk
y
X3
X1
X2
Types of Regression Analysis
Standard Regression
Standard or Simultaneous Regression
Put all of the predictors in at one time and

the coefficients are calculated for all of
them controlling for all others
Method equals enter in SPSS

Sequential
Forward Sequential
What does a predictor add to the prediction
equation, over and above the variables
already in the equation?
You think the X1 is a more important

predictor and your interest in X2 is what does
it add to the X1 -> Y prediction
Real Forward Sequential in SPSS is setting it

to Enter and using the blocks function (user
specified)
Statistical Forward Sequential
starts with Y=a, all potential predictors are assessed
and compared to an Entry Criterion; the variable with
the lowest F probability (p<.05) enters it into the
equation
Remaining predictors are re-evaluated given the new

equation (Y=a + Xfirst entered) and the next variable with
the lowest probability enters, etc
This continues until either all of the variables are

entered or no other variables meet the entry criterion.
Once variables enter the equation they remain.
Method equals Forward in SPSS

Backward Sequential
Can predictors be removed from an equation without
hurting the prediction of Y? In other words, can a
prediction equation be simplified?
You know there are a set of predictors of a certain

variable and you want to know if any of them can be
removed without weakening the prediction
In SPSS put all predictors in block one method equals

enter, in block 2 any variables you want removed
method equals removed, etc
Statistical backward sequential
All variables entered in and then each are tested against
an Exit Criteria; F probability is above a set criteria
(p>.10).
The variable with the worst probability is then removed.
Re-evaluation of remaining variables given the new

equation and the next variable with the worst probability
is then removed.
This continues until all variables meet the criteria or all

variables removed.
In SPSS this is setting method equals backward.

Stepwise
(Purely Statistical Regression)
at each step of the analysis variables are
tested for both entry and exit criteria.
Starts with intercept only then tests all of the

variables to see if any match entry criteria.
Any matches enter the equation
The next step tests un-entered variables for

both entry and entered variables for exit
criteria, and so on
Stepwise
This cycles through adding and removing
variables until none meet the entry or exit
criteria
Variables can be added or removed over and

over given the new state of the equation each
time.
Considered a very post-hoc type of analysis

and is not recommended
Correlations and Effect size
Ballantine
Regular Correlation
(Zero Order,
Y
e Pearson)
a
c
b r ac
2
y1
X1
d
X2
r bc
2
y2
r cd
2
12
Standard Regression
Partial Correlation
correlation between Y
Y
and X1 with the
e influence of X2
removed from both
a b
c
Yres, X1res
X1 X2
d
area a/(a + e) for x1
and b/(b + e) for x2
in the ballantine
Semipartial or Part Correlation
correlation between Y
and X1 with the influence
Y of X2 removed from X1
e only
a b Y, X1res
c
X1 X2 area a/(a + b + c + e)
d
for x1 and b/(a + b + c
+e) for x2
Semipartials and Bs
Bs and semipartials are very similar
B is the amount of change in Y for every unit

change in X, while controlling for other Xs on
Xi.
Semipartials are measures of the relationship

between Y and Xi controlling for other Xs on
Xi.
Sequential
Assuming x1 enters first
Y The partial correlations

e would be
(a + c)/(a + c + e) for
a b x1 and unchanged for x2
c
X1 X2 The part correlation
d would be
(a + c)/(a + b + c + e)
for x1 and x2 is
unchanged.
Simple and Multiple Regression Analysis
i yi x1 x2
Family Actual # of Family Size Family
Number Income
Credit Cards We now can attempt
1 4 2 14 to estimate # of CCs
2 6 2 16 from our information
3 6 4 14 on family size and
4 7 4 17 family income!
5 8 5 18
Our regression model
6 7 5 21
will now be a linear
7 8 6 17
plane, rather than a
8 10 6 25
straight line!
Generic Equation for a linear plane: y a b1 x1 b2 x2
Lets examine the regression plane for our example graphically.
Y = # of Credit Cards
12
y a b1 x1 b2 x2 11
Formulas are available for 10

computing values of 9
a, b1 and b2
8
MULTIPLE REGRESSION 7 Family Income
MODEL FOR OUR EXAMPLE:
6
y .482 .63 x1 .216 x2 5
4
Lets now see
how much error 3
in estimation we 2
are committing 1
by using this 0 Actual
Regression Estimate
multiple 0 1 2 3
regression 4 5 6 7 8 X1 = Family Size
model.
y .482 .63 x1 .216 x2 y
i y x1 x2 Y y y ( y y )
2
Family Actual # Family Family Regression Error Errors

Number of Credit Size Income Estimate (Residual) Squared
Cards ($000)
1 4 2 14 ? ? ?
2 6 2 16 ? ? ?
3 6 4 14 ? ? ?
4 7 4 17 ? ? ?
5 8 5 18 ? ? ?
6 7 5 21 ? ? ?
7 8 6 17 ? ? ?
8 10 6 25 ? ? ?
(y y
) 2
y .482 .63 x1 .216 x2 y .482 .63(2) .216(14) 4.77
i y x1 x2 Y y y ( y y )
2

Cards ($000)
1 4 2 14 4.77 -.77 .59
2 6 2 16 5.20 .80 .64
3 6 4 14 6.03 -.03 .00
4 7 4 17 6.68 .32 .10
5 8 5 18 7.53 .47 .22
6 7 5 21 8.18 -1.18 1.39
7 8 6 17 7.95 .05 .00
8 10 6 25 9.67 .33 .11
2
SSE = Sum of Squares Error (Residual) 3.05 ( y y )

Unique (additional) contribution of X2 (family income)
5.5 3.05 = 2.45
Regression Formulas
The Total Sum of Squares (SST) is equal to SSR + SSE.
Mathematically,
SSR = ( y ^y ) 2 (measure of explained variation)
SSE = ( y y ) ^ (measure of unexplained variation)
SST = SSR + SSE = ( y y ) (measure

2 of total variation in y)
The Coefficient of Determination
The proportion of total variation (SST) that is explained by the

regression (SSR) is known as the Coefficient of Determination,
and is often referred2to as R .
R2 = SSR = SSR
SST SSR + SSE
The value of R2 can range between 0 and 1, and the higher its
value the more accurate the regression model is. It is often
referred to as a percentage.
The MULTIPLE REGRESSION MODEL FOR OUR EXAMPLE:
y .482 .63 x1 .216 x2

? ?
Y-Intercept, a b1 and b2 = Regression Coefficients
(NOTE: Only when all Xs 0.63: Among families of the same income, an increase
can meaningfully take on in family size by one person would, on average, result
value of zero, the intercept in .63 more credit cards.
will have a
meaningful/direct/ practical 0.21: Among families of the same size, an income
interpretation. Otherwise, increase of $1,000, results in an average increase of 0.2
it is simply an aid in credit cards .
increasing accuracy of bs represent effect of each X on Y when all
estimation. other Xs are controlled for/held constant/taken
into account
i.e., after impacts of all other variables are
accounted for (remember the high blood
pressure-hearing problem connection?)
The MULTIPLE REGRESSION MODEL FOR OUR EXAMPLE:
y .482 .63 x1 .216 x2

SST = 22 SSE = 3.05
What is our new R2?

SS Regression = 22 3.05 = 18.95 Percent of differences in households
2 number of CCs that is explained by
R = 18.95 / 22 = .861 or differences in family size and family
86% income.
Percent of variation in number of credit

The Remaining 14%? cards that can be accounted for by (a) all
other relevant factors not included in the
(3.05 / 22 = .14) model, beyond family size and income, and
(b) unexplainable random/chance
variations.
d
a Y= # of CC
Total Variation/Error in Y = SS Total = a + b + c + d = 22
c b
y 2.87 .97 X 1 r2 = ? R2 = (a+c) / (a+b+c+d)
X1=Family Y R2 = 16.5 / 22 = 0.75
Size
X2 = Family What do we call the square root of this?
SSR =
Pearson/simple ryx
16.5
0.75 0.867
Income a+c 22
Correlation
1
= 16.5
of Y with X1 ac
X1=Family
(not controlling ryx
size
for X2)
1
abcd
Y y 0.063 .398 X 2 r2 = (b+c) / (a+b+c+d) = 15.12 / 22 = 0.687

SSR = Pearson/simple bc
c+b r
Correlation of Y yx2
= 15.12
abcd
with X2 (not
controlling for 15.11
X2 = Family X 1) ? ryx 2 0.829
Income
22
d
a y .482 .63 x1 .216 x2
c b
R2Graphically = ?
X1=Family NOTE: c is explained by
Size
both X1 and X2
X2 = Family
Income
SSR = a + b +c = 18.95
SST = a + b + c + d = 22
R2 = SSR / SST = (a + b + c) / (a + b + c + d) = 18.95 / 22
= 86%
SSE = ?
SSE = d = 22 18.95 = 3.05

y .482 .63 x1 .216 x2 y .482 .63(2) .216(14) 4.77
i y x1 x2 Y y y ( y y )
2

Cards ($000)
1 4 2 14 4.77 -.77 .59
2 6 2 16 5.20 .80 .64
3 6 4 14 6.03 -.03 .00
4 7 4 17 6.68 .32 .10
5 8 5 18 7.53 .47 .22
6 7 5 21 8.18 -1.18 1.39
7 8 6 17 7.95 .05 .00
8 10 6 25 9.67 .33 .11
2
SSE = Sum of Squares Error (Residual) 3.05 ( y y )
Remember: Unique (additional) contribution of X2 = 5.5 3.05 = 2.45

Testing hypothesis about P
We need to calculate a test

statistic
How many standard deviations
have we deviated if the null
hypothesis p=0.20 was true?
Z (0.28 0.20) /(( (0.20)(1 0.20) 100))

Z 2.0
What is the likelihood of observing a
Z=2.0 or more extreme if the
Governments figure was correct?
P-value= P[Z > 2.0] = 0.025
How does this p-value as compared with

=0.05?
If p-value < , then reject the null hypothesis

H0 in favor of the alternative hypothesis Ha.
In this situation we reject the Governments

Elements of Testing hypothesis
Null Hypothesis
Alternative hypothesis
Level of significance
Test statistics
P-value
Conclusion
Power of the test
Is there an association between
Drinking and Lung Cancer?
What is the most appropriate

and feasible study design in
order to test the above
research hypothesis?
Case Control Study of Smoking and
Lung Cancer
Null Hypothesis: There is no
association between Smoking
and Lung cancer, P1=P2
Alternative Hypothesis: There is

some kind of association between
Smoking and Lung cancer, P1P2.
In the following contingency table
estimate the proportion and odds of
drinkers among those who develop Lung
Cancer and those without the disease?
Lung Cancer Total

Case Control
Drinker Yes A=33 B=27 60
No C=1667 D= 2273 3940
P1=33/1700 P2=27/2300
Odds1=33/1667 Odds2=27/2273
QUESTION: Is there a difference
between the proportion of drinkers
among cases and controls?
G ro u p 1 G ro u p 2
D is e a s e N o D is e a s e
P 1 = p r o p o r t io n o f d r in k e r s P 2 = p r o p o r t io n o f d r in k r s
Test Statistic
A statistical yard stick which is

computed based on the
information contained in the
sample under the assumption that
the null hypothesis is true.
Knowledge about the sampling
distribution of the test statistics is
needed in determining the
likelihood of observing extreme
values for the test statistics in a
P-value
An indicator which measures the

likelihood of observing values as extreme
as the one observed based on the sample
information, assuming the null
hypothesis is true.
P-value is also known as the observed

level of significance.
The level of significance ( )
is known as the nominal level of
significance.
If p-value < , then we reject the null
hypothesis in favor of the alternative
hypothesis.
Most of statistical packages give P-value in
their computer output.
needs to be pre-determined. (Usually 5%)
Type I and Type II errors
Type I error is committed when a true

null hypothesis is rejected.
is the probability of committing type I
error.
Type II error is committed when a false
null hypothesis is not rejected.
is the probability of committing type II
error.
Decision made about the
validity of null hypothesis
Rejected Not rejected
Null True Type I No Error
Hypothesis Error
False No Error Type II Error
Power of a test
The power of a test is the probability that

a false null hypothesis is rejected.
Power = 1 - , where is the probability
of committing type II error.
More powerful tests are preferred. At
the design stage one should identify the
desired level of power in the given
situation.
Factors influencing the Power
The power of a test is influenced by the
magnitude of the difference between the
null hypothesis and the true parameter.
The power of a test could be improved by

increasing the sample size.
The power of a test could be improved by
increasing . (this is a very artificial way)
Minimum Required Sample Size
Usually a Sample size calculation
formula is available for most of the well
known study designs. Some software
packages such as Epi-Info could also be
utilized for the sample size calculation
purpose.
It is extremely important to consult a
biostatistician at the design phase to
ensure adequate sample is considered for
the study.
Testing hypothesis about one
population mean
H0: =16 vs Ha: >16
Z= (sample mean hypothesized
s
mean) n
SE of the Mean
Under the null hypothesis and

when n is large, (n>30), the
distribution of Z is standard
normal.
P-value

Stat Project DBA Hewaln Radwan Mahmoud

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stat Project DBA Hewaln Radwan Mahmoud

Uploaded by

Copyright:

Available Formats

More Multiple Regression

Approaches to Regression Analysis, Types of

Standard or Simultaneous Regression

Put all of the predictors in at one time and

Method equals enter in SPSS

You think the X1 is a more important

Real Forward Sequential in SPSS is setting it

Remaining predictors are re-evaluated given the new

This continues until either all of the variables are

Once variables enter the equation they remain.

Method equals Forward in SPSS

You know there are a set of predictors of a certain

In SPSS put all predictors in block one method equals

The variable with the worst probability is then removed.

Re-evaluation of remaining variables given the new

This continues until all variables meet the criteria or all

In SPSS this is setting method equals backward.

Starts with intercept only then tests all of the

Any matches enter the equation

The next step tests un-entered variables for

Variables can be added or removed over and

Considered a very post-hoc type of analysis

Bs and semipartials are very similar

B is the amount of change in Y for every unit

Semipartials are measures of the relationship

Assuming x1 enters first

Y The partial correlations

Formulas are available for 10

Family Actual # Family Family Regression Error Errors

Family Actual # Family Family Regression Error Errors

SSE = Sum of Squares Error (Residual) 3.05 ( y y )

The Total Sum of Squares (SST) is equal to SSR + SSE.

SSR = ( y ^y ) 2 (measure of explained variation)

SSE = ( y y ) ^ (measure of unexplained variation)

SST = SSR + SSE = ( y y ) (measure

The proportion of total variation (SST) that is explained by the

y .482 .63 x1 .216 x2

y .482 .63 x1 .216 x2

What is our new R2?

Percent of variation in number of credit

Y y 0.063 .398 X 2 r2 = (b+c) / (a+b+c+d) = 15.12 / 22 = 0.687

SSE = d = 22 18.95 = 3.05

Family Actual # Family Family Regression Error Errors

SSE = Sum of Squares Error (Residual) 3.05 ( y y )

Remember: Unique (additional) contribution of X2 = 5.5 3.05 = 2.45

We need to calculate a test

Z (0.28 0.20) /(( (0.20)(1 0.20) 100))

How does this p-value as compared with

If p-value < , then reject the null hypothesis

In this situation we reject the Governments

What is the most appropriate

Alternative Hypothesis: There is

Lung Cancer Total

A statistical yard stick which is

An indicator which measures the

P-value is also known as the observed

Type I error is committed when a true

The power of a test is the probability that

The power of a test could be improved by

Under the null hypothesis and

You might also like