Linear Regression PDF

By Hui Bian
Office for Faculty Excellence
Email: bianh@ecu.edu
Phone: 328-5428
Location: 2307 Old Cafeteria Complex
When want to predict one variable from a

combination of several variables.
When want to determine which variables are
better predictors than others.
When want to compare models.
It is a model for the relationship between a dependent

variable and a collection of independent variables.
According to IBM SPSS Manual
Linear regression is used to model the value of a
dependent scale variable based on its linear relationship or
straight line relationship to one or more predictors.
Regression Equation
Ypredicted = b0+b1x1+b2x2++bpxp+ep
Ypredicted : predicted score of dependent variable
b0: intercept
p: number of predictors
b1-bp: weights or partial regression coefficients for
predictors/slope
x1-xp: scores of predictors
ep: errors of prediction
Positive and negative regression weights reflect the nature of
correlations between predictor and dependent variable.
Y
Regression line
Intercept
Slope
X
6
Positive relationship
Negative relationship
No relationship
The model is linear because increasing the value

of the pth predictor by 1 unit increases the value
of the dependent by bp units.
b0 is the intercept, the model-predicted value of
the dependent variable when the value of every
predictor is equal to 0.
We use Least Square Criterion to estimate parameters.

Lease Square means the sum of the squared estimated errors
of predictions is minimized.
Residuals or errors =
y(observed scorepredicted score)
The line best fits the

data.
The vertical distance
between observed
values of y and line
is the residual.
In the scatterplot, we have an independent or X variable, and

a dependent or Y variable.
Each point in the plot represents one case (or one subject).
The goal of linear regression procedure is to fit a line through
the points.
SPSS program computes a line so that the squared deviations
of the observed points from that line are minimized.
This general procedure is sometimes also referred to as least
squares estimation.
10
11
Normality
Linearity
Equal variance
12
For each value of the independent variable, the

distribution of the dependent variable must be
normal.
The variance of the distribution of the dependent
variable should be constant for all values of the
independent variable.
The relationship between the dependent variable
and each independent variable should be linear, and
all observations should be independent.
13
The error term has a normal distribution with a

mean of 0.
The variance of the error term is constant across
cases and independent of the variables in the model.
14
Multicollinearity
Moderate to high inter-correlations among the
independent variables
It limits the size of R.
The model is unstable in terms of prediction.
It is hard to interpret the significance of predictors.
15
Checking assumptions
Histogram of the standardized or studentized
residuals (normality assumption)
Scatter plots: the dependent variable, standardized
predicted values, standardized residuals, deleted
residuals, adjusted predicted values, Studentized
residuals, or Studentized deleted residuals.
16
Scatter plots: Plot the standardized residuals (* ZRESID)

against the standardized predicted values (*ZPRED) to check
for linearity and equality of variances.
From SPSS: Dependent and Standardized predicted values
(*ZPRED), Standardized residuals (*ZRESID), Deleted
residuals (*DRESID), Adjusted predicted values (*ADJPRED),
Studentized residuals (*SRESID), Studentized deleted
residuals (*SDRESID).
17
Plots from SPSS
18
19
Regression coefficients determine the relative

importance of the significant predictors when
the effects of other predictors are controlled.
Unstandardized regression coefficients (B): reflect the
raw score values (different metrics).
Standardized regression coefficients (): all the
variables are measured on the same metric.
20
Squared multiple correlation (R2)
The model accounts for certain amount of the variance of

dependent variable. That certain amount is R2.
Residual (prediction error)

The difference between the predicted value and
observed score of dependent variable.
21
Dependent variable: criterion variable

Scale variables (interval or ratio)/quantitative
Independent variables: predictors or control

variables
Continuous or categorical
Inclusion of variables in the model is based on

theories and empirical studies done by other
researchers.
22
1 means the presence of something

0 means the absence of something or reference
Number of dummy variables = p-1
p = number of levels of nominal variable
Each dummy variable is dichotomous (0, 1)

The reference level is the focus and other levels
will compare with it.
23
Exercise
One variable: a03 (race)
Recode a03 into three categories: white, black, and
others and create a new variable named a03r (1 =
White, 2 = Black, 3 = Others)
Then recode a03r into two dummy variables
White is the reference category
Two new dummy variables are: Dummy1 (black vs.
white) and Dummy2 (others vs. white)
24
Recode a03 into a03r

Response options for a03r: 1 = White, 2 = Black, 3 = Others
25
Recode a03 into a03r

Transform > Recode into different variables > Highlight
a03 and click > type a03r > Click
26
Click Old and New Values button
27
Dummy1
Dummy2
White
Black
Others
Dummy1: if participants are Black then coded 1,

other categories are coded 0.
Dummy2: if participants are Others then coded 1,
other categories are coded 0.
28
Transform > Recode into different variables > Highlight

a03r and click > type Dummy1
29
Click Old and New Values button
30
The same process of creating Dummy2

You should have this window:
31
Example: we want to determine if several predictors have

effect on problem of drug use among drug users (use any of
alcohol, cigarettes, and marijuana in the last 30 days:a28,
a29, and a30) while controlling race variable (two dummy
variables).
Dependent variable: aAlcohol_Problem (total score: 0-17)
32
Independent variables: including two dummy variables and

Frequency of marijuana use (a30: During the past 30 days, on
how many days did you use marijuana? 1 = 0 days, 2 = 1-3 days,
11 = 28-30 days)
Self-efficacy (a80r: How sure are you that you can avoid using
alcohol, if offered by friends? 0= Very sure, 1 = Somewhat sure
to not sure)
Self-control (During the past 30 days, which of the following
have you used to help you avoid or limit your alcohol, cigarette,
or marijuana use? a total score ranges from 0 to 18). Higher
score = More self-control
Peer norms: a93a (My friends think that it's okay for me to
drinks too much alcohol. 1 = Agree a lot, 2= Agree, 3= Disagree,
4 = Disagree a lot)
33
Regression model for our study

Self-efficacy
Self-control
Error
Problems related to
drug uses
Marijuana use
Peer norms
34
Enter: enters all independent variables in a single step

Stepwise: enters one independent variable at a time. At each
step, the program performs the following calculations: for
each variable currently in the model, it computes "F-toremove" statistic; for each variable not in the model, it
computes "F-to-enter" statistic. At the next step, the
program automatically enters the variable with the highest Fto-enter statistic, or removes the variable with the lowest Fto-remove statistic. Each predictor is constantly assessed.
35
Forward: enters one independent variable at each step and

that variable has the largest simple correlation with
dependent variable.
Once a variable is entered into the model, it remains in the
model.
Backward: enters all independent variables in the analysis,
then starts to remove non-significant variable from the
model. The loss of this variable would least decrease the R2.
36
Data screening
Purpose of data screen is to check assumptions for the regression
model
Residual plots:
used to check the constant variance assumption.
standardized residuals (on Y axis) versus standardized predicted
values (on X axis)
If there is no violation of assumptions, standardized residuals should
scatter randomly around a horizontal line of 0.
Histogram and Normal p-p plot of standardized or studentized
residuals
Used to check normality assumption
37
Run multiple regression analysis

First select cases: condition is: a28>1 | a29>1 |a30>1
then go to Analyze >

Regression > Linear > put
aAlcohol_Problem into
Dependent > put a80r, a30,
a93a, self-control, and two
dummy variables into
Independent
38
Click Statistics
39
Click Plots
40
Click Save
41
From the Descriptive Statistics table,

we know that a total of 202 drug
users were in the study. The average
drug problems is 3.47 (SD=3.25).
The method used is Enter. It

means that we entered all
independent variables into model
simultaneously.
42
1. Model Summary
a. R is a Pearson correlation between
predicted values and actual values of
dependent variable.
b. R2 is multiple correlation
coefficient that represents the
amount of variance of dependent
variable explained by the
combination of six predictors. 14%
variance of drug problem is explained
by six predictors.
c. Adjusted R2 is a more conservative
than R2.
2. ANOVA table
The significant F value, F(6, 195)
= 5.18, p < .01, indicates that there is
a significant relationships between
drug problem and six predictors.
43
1. The regression equation should be: Y = 2.275 +.258 Marijuana+1.128

Self_efficacy+.088 self_control -.457 Peer Norms + .410 Dummy 1 + .035 Dummy2
2. B is unstandardized regression coefficient and Beta is standardized regression
coefficient .
3. t test and sig show the outcomes of each independent variable.
44
Colinearity
Tolerance is the percentage of the variance in a given
predictor that cannot be explained by the other predictors.
When the tolerances are close to 0, there is high
multicollinearity and the standard error of the regression
coefficients will be inflated.
Variance Inflation Factor (VIF) greater than 2 is usually
considered problematic (based on SPSS manual).
45
Histogram of
standardized residuals
46
Normal QQ plot
1. We want to know
whether the
distribution of errors
matches a normal
distribution.
2. If the selected
variable matches the
test distribution, the
points cluster around a
straight line.
47
Residual plot
1. Our residuals scatter
randomly around 0.
2. The constant variance
assumption is not
violated.
3. The standardized
residual of ID 1090 is
3.06.
48
1.
1. First two residuals plots

suggest that the error
variance changes with the
independent variable.
2. Neither of these
distributions are constant
variance patterns.
Therefore there is a
violation of equal error
variance assumption
3. The last horizontal-band
pattern suggests that the
variance of the residuals is
constant.
49
Zero-order correlation
Simple bivariate correlations between independent variable and
dependent variable.
Partial correlation
Correlation between independent variable and dependent variable
after all other independent variables are controlled.
Part correlation
Correlation between independent variable and dependent variable
with the correlation between dependent variable and other
independent variable is controlled.
When squared, it represents the unique contribution of the
independent variable to the model.
50
New created variable: standardized residual:

ZRE_1
Run descriptive statistics of ZRE_1, e.g. use Explore
function
51
Explore results
52
Explore results
The Kolmogorov-Smirnov test is based on a simple way to quantify the

discrepancy between the observed and expected distributions. It turns out,
however, that it is too simple, and doesn't do a good job of discriminating
whether or not your data was sampled from a Gaussian distribution. An expert
on normality tests, R.B. DAgostino, makes a very strong statement: The
Kolmogorov-Smirnov test is only a historical curiosity. It should never be used.
(Tests for Normal Distribution in Goodness-of-fit Techniques, Marcel Decker,
1986).
53
Run previous analysis again using stepwise

methods.
Analyze > Regression > Linear
54
1. This table lists how many models

in the process and which variable
is entered and which is removed
on each step.
2. No variable is removed on each
step.
55
1. Model summary shows R2 for each model.

2. Sig F Change tells us when extra IV is added into model, what kind contribution
that IV makes.
56
1. ANOVA table shows F

values for each model.
2. All two models are
significant ( p < .05).
3. The last model has two
predictors in the model.
57
For self-efficacy, high score means lower self-efficacy. The results show that drug
users who used more marijuana and had lower self-efficacy, more likely to have
drug use problems
58
Meyers, L. S., Gamst, G., & Guarino, A. J. (2006).

Applied multivariate research: design and
interpretation. Thousand Oaks, CA: Sage
Publications, Inc.
Stevens, J. P. (2002). Applied multivariate statistics
for the social sciences. Mahwah, NJ: Lawrence
Erlbaum Associates, Inc.
59
60

Linear Regression PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression PDF

Uploaded by

Copyright:

Available Formats

By Hui Bian

Office for Faculty Excellence

When want to predict one variable from a

It is a model for the relationship between a dependent

The model is linear because increasing the value

We use Least Square Criterion to estimate parameters.

The line best fits the

In the scatterplot, we have an independent or X variable, and

For each value of the independent variable, the

The error term has a normal distribution with a

Scatter plots: Plot the standardized residuals (* ZRESID)

Plots from SPSS

Regression coefficients determine the relative

Squared multiple correlation (R2)

The model accounts for certain amount of the variance of

Residual (prediction error)

Dependent variable: criterion variable

Independent variables: predictors or control

Inclusion of variables in the model is based on

1 means the presence of something

Each dummy variable is dichotomous (0, 1)

Recode a03 into a03r

Recode a03 into a03r

Click Old and New Values button

Dummy1: if participants are Black then coded 1,

Transform > Recode into different variables > Highlight

Click Old and New Values button

The same process of creating Dummy2

Example: we want to determine if several predictors have

Independent variables: including two dummy variables and

Regression model for our study

Enter: enters all independent variables in a single step

Forward: enters one independent variable at each step and

Run multiple regression analysis

then go to Analyze >

From the Descriptive Statistics table,

The method used is Enter. It

1. The regression equation should be: Y = 2.275 +.258 Marijuana+1.128

1. First two residuals plots

New created variable: standardized residual:

The Kolmogorov-Smirnov test is based on a simple way to quantify the

Run previous analysis again using stepwise

1. This table lists how many models

1. Model summary shows R2 for each model.

1. ANOVA table shows F

Meyers, L. S., Gamst, G., & Guarino, A. J. (2006).

You might also like