You are on page 1of 60

By Hui Bian

Office for Faculty Excellence

Email: bianh@ecu.edu
Phone: 328-5428
Location: 2307 Old Cafeteria Complex

When want to predict one variable from a


combination of several variables.
When want to determine which variables are
better predictors than others.
When want to compare models.

It is a model for the relationship between a dependent


variable and a collection of independent variables.
According to IBM SPSS Manual
Linear regression is used to model the value of a
dependent scale variable based on its linear relationship or
straight line relationship to one or more predictors.

Regression Equation
Ypredicted = b0+b1x1+b2x2++bpxp+ep
Ypredicted : predicted score of dependent variable
b0: intercept
p: number of predictors
b1-bp: weights or partial regression coefficients for
predictors/slope
x1-xp: scores of predictors
ep: errors of prediction
Positive and negative regression weights reflect the nature of
correlations between predictor and dependent variable.

Y
Regression line

Intercept

Slope

X
6

Positive relationship

Negative relationship

No relationship

The model is linear because increasing the value


of the pth predictor by 1 unit increases the value
of the dependent by bp units.
b0 is the intercept, the model-predicted value of
the dependent variable when the value of every
predictor is equal to 0.

We use Least Square Criterion to estimate parameters.


Lease Square means the sum of the squared estimated errors
of predictions is minimized.

Residuals or errors =
y(observed scorepredicted score)

The line best fits the


data.
The vertical distance
between observed
values of y and line
is the residual.

In the scatterplot, we have an independent or X variable, and


a dependent or Y variable.
Each point in the plot represents one case (or one subject).
The goal of linear regression procedure is to fit a line through
the points.
SPSS program computes a line so that the squared deviations
of the observed points from that line are minimized.
This general procedure is sometimes also referred to as least
squares estimation.
10

11

Normality
Linearity
Equal variance

12

For each value of the independent variable, the


distribution of the dependent variable must be
normal.
The variance of the distribution of the dependent
variable should be constant for all values of the
independent variable.
The relationship between the dependent variable
and each independent variable should be linear, and
all observations should be independent.
13

The error term has a normal distribution with a


mean of 0.
The variance of the error term is constant across
cases and independent of the variables in the model.

14

Multicollinearity
Moderate to high inter-correlations among the
independent variables
It limits the size of R.
The model is unstable in terms of prediction.
It is hard to interpret the significance of predictors.

15

Checking assumptions
Histogram of the standardized or studentized
residuals (normality assumption)
Scatter plots: the dependent variable, standardized
predicted values, standardized residuals, deleted
residuals, adjusted predicted values, Studentized
residuals, or Studentized deleted residuals.

16

Scatter plots: Plot the standardized residuals (* ZRESID)


against the standardized predicted values (*ZPRED) to check
for linearity and equality of variances.
From SPSS: Dependent and Standardized predicted values
(*ZPRED), Standardized residuals (*ZRESID), Deleted
residuals (*DRESID), Adjusted predicted values (*ADJPRED),
Studentized residuals (*SRESID), Studentized deleted
residuals (*SDRESID).

17

Plots from SPSS

18

19

Regression coefficients determine the relative


importance of the significant predictors when
the effects of other predictors are controlled.
Unstandardized regression coefficients (B): reflect the
raw score values (different metrics).
Standardized regression coefficients (): all the
variables are measured on the same metric.

20

Squared multiple correlation (R2)

The model accounts for certain amount of the variance of


dependent variable. That certain amount is R2.

Residual (prediction error)


The difference between the predicted value and
observed score of dependent variable.

21

Dependent variable: criterion variable


Scale variables (interval or ratio)/quantitative

Independent variables: predictors or control


variables
Continuous or categorical

Inclusion of variables in the model is based on


theories and empirical studies done by other
researchers.
22

1 means the presence of something


0 means the absence of something or reference
Number of dummy variables = p-1
p = number of levels of nominal variable

Each dummy variable is dichotomous (0, 1)


The reference level is the focus and other levels
will compare with it.
23

Exercise
One variable: a03 (race)
Recode a03 into three categories: white, black, and
others and create a new variable named a03r (1 =
White, 2 = Black, 3 = Others)
Then recode a03r into two dummy variables
White is the reference category
Two new dummy variables are: Dummy1 (black vs.
white) and Dummy2 (others vs. white)
24

Recode a03 into a03r


Response options for a03r: 1 = White, 2 = Black, 3 = Others
25

Recode a03 into a03r


Transform > Recode into different variables > Highlight
a03 and click > type a03r > Click

26

Click Old and New Values button

27

Dummy1

Dummy2

White

Black

Others

Dummy1: if participants are Black then coded 1,


other categories are coded 0.
Dummy2: if participants are Others then coded 1,
other categories are coded 0.
28

Transform > Recode into different variables > Highlight


a03r and click > type Dummy1

29

Click Old and New Values button

30

The same process of creating Dummy2


You should have this window:

31

Example: we want to determine if several predictors have


effect on problem of drug use among drug users (use any of
alcohol, cigarettes, and marijuana in the last 30 days:a28,
a29, and a30) while controlling race variable (two dummy
variables).
Dependent variable: aAlcohol_Problem (total score: 0-17)

32

Independent variables: including two dummy variables and


Frequency of marijuana use (a30: During the past 30 days, on
how many days did you use marijuana? 1 = 0 days, 2 = 1-3 days,
11 = 28-30 days)
Self-efficacy (a80r: How sure are you that you can avoid using
alcohol, if offered by friends? 0= Very sure, 1 = Somewhat sure
to not sure)
Self-control (During the past 30 days, which of the following
have you used to help you avoid or limit your alcohol, cigarette,
or marijuana use? a total score ranges from 0 to 18). Higher
score = More self-control
Peer norms: a93a (My friends think that it's okay for me to
drinks too much alcohol. 1 = Agree a lot, 2= Agree, 3= Disagree,
4 = Disagree a lot)

33

Regression model for our study


Self-efficacy

Self-control

Error

Problems related to
drug uses

Marijuana use

Peer norms
34

Enter: enters all independent variables in a single step


Stepwise: enters one independent variable at a time. At each
step, the program performs the following calculations: for
each variable currently in the model, it computes "F-toremove" statistic; for each variable not in the model, it
computes "F-to-enter" statistic. At the next step, the
program automatically enters the variable with the highest Fto-enter statistic, or removes the variable with the lowest Fto-remove statistic. Each predictor is constantly assessed.

35

Forward: enters one independent variable at each step and


that variable has the largest simple correlation with
dependent variable.
Once a variable is entered into the model, it remains in the
model.
Backward: enters all independent variables in the analysis,
then starts to remove non-significant variable from the
model. The loss of this variable would least decrease the R2.

36

Data screening
Purpose of data screen is to check assumptions for the regression
model
Residual plots:
used to check the constant variance assumption.
standardized residuals (on Y axis) versus standardized predicted
values (on X axis)
If there is no violation of assumptions, standardized residuals should
scatter randomly around a horizontal line of 0.
Histogram and Normal p-p plot of standardized or studentized
residuals
Used to check normality assumption
37

Run multiple regression analysis


First select cases: condition is: a28>1 | a29>1 |a30>1

then go to Analyze >


Regression > Linear > put
aAlcohol_Problem into
Dependent > put a80r, a30,
a93a, self-control, and two
dummy variables into
Independent
38

Click Statistics

39

Click Plots

40

Click Save

41

From the Descriptive Statistics table,


we know that a total of 202 drug
users were in the study. The average
drug problems is 3.47 (SD=3.25).

The method used is Enter. It


means that we entered all
independent variables into model
simultaneously.
42

1. Model Summary
a. R is a Pearson correlation between
predicted values and actual values of
dependent variable.
b. R2 is multiple correlation
coefficient that represents the
amount of variance of dependent
variable explained by the
combination of six predictors. 14%
variance of drug problem is explained
by six predictors.
c. Adjusted R2 is a more conservative
than R2.
2. ANOVA table
The significant F value, F(6, 195)
= 5.18, p < .01, indicates that there is
a significant relationships between
drug problem and six predictors.
43

1. The regression equation should be: Y = 2.275 +.258 Marijuana+1.128


Self_efficacy+.088 self_control -.457 Peer Norms + .410 Dummy 1 + .035 Dummy2
2. B is unstandardized regression coefficient and Beta is standardized regression
coefficient .
3. t test and sig show the outcomes of each independent variable.
44

Colinearity
Tolerance is the percentage of the variance in a given
predictor that cannot be explained by the other predictors.
When the tolerances are close to 0, there is high
multicollinearity and the standard error of the regression
coefficients will be inflated.
Variance Inflation Factor (VIF) greater than 2 is usually
considered problematic (based on SPSS manual).

45

Histogram of
standardized residuals

46

Normal QQ plot
1. We want to know
whether the
distribution of errors
matches a normal
distribution.
2. If the selected
variable matches the
test distribution, the
points cluster around a
straight line.

47

Residual plot
1. Our residuals scatter
randomly around 0.
2. The constant variance
assumption is not
violated.
3. The standardized
residual of ID 1090 is
3.06.

48

1.

1. First two residuals plots


suggest that the error
variance changes with the
independent variable.
2. Neither of these
distributions are constant
variance patterns.
Therefore there is a
violation of equal error
variance assumption
3. The last horizontal-band
pattern suggests that the
variance of the residuals is
constant.
49

Zero-order correlation
Simple bivariate correlations between independent variable and
dependent variable.
Partial correlation
Correlation between independent variable and dependent variable
after all other independent variables are controlled.
Part correlation
Correlation between independent variable and dependent variable
with the correlation between dependent variable and other
independent variable is controlled.
When squared, it represents the unique contribution of the
independent variable to the model.
50

New created variable: standardized residual:


ZRE_1
Run descriptive statistics of ZRE_1, e.g. use Explore
function

51

Explore results

52

Explore results

The Kolmogorov-Smirnov test is based on a simple way to quantify the


discrepancy between the observed and expected distributions. It turns out,
however, that it is too simple, and doesn't do a good job of discriminating
whether or not your data was sampled from a Gaussian distribution. An expert
on normality tests, R.B. DAgostino, makes a very strong statement: The
Kolmogorov-Smirnov test is only a historical curiosity. It should never be used.
(Tests for Normal Distribution in Goodness-of-fit Techniques, Marcel Decker,
1986).
53

Run previous analysis again using stepwise


methods.
Analyze > Regression > Linear

54

1. This table lists how many models


in the process and which variable
is entered and which is removed
on each step.
2. No variable is removed on each
step.

55

1. Model summary shows R2 for each model.


2. Sig F Change tells us when extra IV is added into model, what kind contribution
that IV makes.

56

1. ANOVA table shows F


values for each model.
2. All two models are
significant ( p < .05).
3. The last model has two
predictors in the model.

57

For self-efficacy, high score means lower self-efficacy. The results show that drug
users who used more marijuana and had lower self-efficacy, more likely to have
drug use problems

58

Meyers, L. S., Gamst, G., & Guarino, A. J. (2006).


Applied multivariate research: design and
interpretation. Thousand Oaks, CA: Sage
Publications, Inc.
Stevens, J. P. (2002). Applied multivariate statistics
for the social sciences. Mahwah, NJ: Lawrence
Erlbaum Associates, Inc.

59

60

You might also like