You are on page 1of 40

Regression

A goal of science is prediction and


explanation of phenomena
In order to do so we must find events that
are related in some way such that
knowledge about one will lead to knowledge
about the other
In psychology we seek to understand the
relationship among variables that are
indicators of an innumerable amount of
information about human nature in order
better understand ourselves and why we do
the things we do

While we could just use our N of 1 personal


experience to try and understand human
behavior, a scientific (and better) means of
understanding the relationship between variables
is by means of assessing correlation
Two variables take on different values, but if they
are related in some fashion they will covary
They may do so in a way in which their values
tend to move in the same direction, or they may
tend to move in opposite directions
The underlying statistic assessing this is
covariance, which is at the heart of every
statistical procedure you are likely to use
inferentially

Covariance as a
statistical construct is
unbounded and thus
difficult to interpret in
its raw form
Correlation (Pearsons
r) is a measure of the
direction and degree
of a linear association
between two variables
Correlation is the
standardized
covariance between
two variables

cov( x, y )

( x x)( y y )
i 1

n 1

cov( x, y )
n

cov( x, y )
rxy
sx s y

rxy

1 r 1

Z
i 1

xi

Z yi

n 1

Regression allows us to use the information about covariance to


make predictions
Given a particular value of x, we can predict y with some level of
accuracy
The basic model is that of a straight line (the general linear
model)
Only one possible straight line can be drawn once the slope and
Y intercept are specified
The formula for a straight line is:
Y = bx + a

Y = the calculated value for the variable on the vertical axis


a = the intercept
b = the slope of the line
X = a value for the variable on the horizontal axis

Once this line is specified, we can calculate the corresponding


value of Y for any value of X entered
In more general terms Y = Xb + e, where these elements
represent vectors and/or matrices (of the outcome, data,
coefficients and error respectively), is the general linear model to
which most of the techniques in psychological research adhere to

Real data do not conform perfectly to a


straight line
The best fit straight line is that which
minimizes the amount of variation in data
points from the line

The common, but by no means the only or only

acceptable method attempts to derive a least


squares regression line which minimizes the
squared deviations from it

The equation for this line can be used to


predict or estimate an individuals score
on Y on the basis of his or her score on X

bx a

When the relation between


variables are expressed in this
manner, we call the relevant
equation(s) mathematical
models
The intercept and weight values
are called the parameters of the
model
While typical regression analysis
by itself does not determine
causal relations, the assumption
indicated by such a model is that
the variable on the left-hand side
of the previous equation is being
caused by the variable(s) on the
right side
The arrows explicitly go from the

predictors to the outcome, not


vice versa*

Variable X

Criterio
n

Variable Y
C

Variable Z

The process of obtaining the correct parameter values


(assuming we are working with the right model) is called
parameter estimation
Often, theories specify the form of the relationship rather
than the specific values of the parameters
The parameters themselves, assuming the basic model is
correct, are typically estimated from data.
We refer to the estimation processes as calibrating the model

A method is required for choosing parameter values that


will give us the best representation of the data possible
In estimating the parameters of our model, we are trying to
Y Y
find a set of parameters that minimizes
the error variance.
N
With least-squares estimation, we want
to be as
small as it possibly can be.
2

Estimating the Slope (the regression coefficient)


requires first estimating the covariance
cov( X , Y )
b
var( X )

Estimating the Y intercept


a Y bX

where and
are the means based on the sets
of the Y and X values respectively, and b is the
estimated slope
These calculations ensure that the regression line
passes through the point on the scatterplot
defined by the two means

Alternatively , slope
s y
b r
sx
so, by substituting we get
s
s

y
y

Y r X Y r X
sx
sx

Total variability in the dependent variable (observed mean)


comes from two sources
Variability predicted by the model i.e. what variability in the
dependent variable is due to the independent variable
How far off our predicted values are from the mean of Y

Error or residual variability i.e. variability not explained by


the independent variable
The difference between the predicted values and the observed

values

S2y
S2y
S2(yi - yi)
y

Total variance =
predicted variance + error variance

We can also show this


graphically using a
Venn diagram Showing
r2 as the proportion of
variability shared by
two variables (X and Y)
The larger the area of
overlap, the greater
the strength of the
association between
the two variables

The square of the


correlation, r, is the
fraction of the variation
in the values of y that is
explained by the
regression of y on x
R = variance of predicted
values y divided by the
variance of observed
values y

2
y

( y

y)

n 1

s r s
2
y

r
2

2 2
y

s
s

2
y
2
y

How good a fit does our line


represent?

The error associated with a


prediction (of a Y value from a
known X value) is a function of the
deviations of Y about the predicted
point
The standard error of estimate
provides an assessment of
accuracy of prediction

the standard deviation of Y

predicted from X

In terms of R2, we can see that the


more variance we account for the
smaller our standard error of
estimate will be

SY X

( ) 2

SS residual
df residual

SY X (1 R 2 )

N 1
2

Intercept
Value of Y if X is 0
Often not meaningful, particularly if its practically

impossible to have an X of 0 (e.g. weight)

Slope

Amount of change in Y seen with 1 unit change in X

Standardized regression coefficient

Amount of change in Y seen in standard deviation units

with 1 standard deviation unit change in X


In simple regression it is equivalent to the r for the two
variables

Standard error of estimate

Gives a measure of the accuracy of prediction

R2

Proportion of variance explained by the model

The General Linear Model with Categorical Predictors

Regression

can actually handle


different types of predictors, and in
the social sciences we are often
interested in differences between
groups
For now we will concern ourselves
with the two independent groups case
E.g. gender, republican vs. democrat etc.

There are different ways to code


categorical data for regression, and in
general, to represent a categorical variable
you need k-1* coded variables
k = number of categories/groups

Dummy coding involves using zeros and


ones to identify group membership, and
since we only have two groups, one group
will be zero (the reference group) and the
other 1
We will revisit coding with k > 2 after
weve discussed multiple regression

Example

The thing to note at this point is


that we have a simple bivariate
correlation/simple regression setting
The correlation between group and
the DV is .76
This is sometimes referred to as the
point biserial correlation (rpb)
because of the categorical variable
However, dont be fooled, it is
calculated exactly the same way as
before i.e. you treat that 0,1
grouping variable like any other in
calculating the correlation
coefficient

Group
0
0
0
0
0
1
1
1
1
1

DV
3
5
7
2
3
6
7
7
8
9

Graphical

display

The

R-square is .
762 = .577
The regression
equation is

Y 4 3.4 X

Look

closely at the descriptive


output compared to the
coefficients.
What do you see?
Descriptive Statistics a
group
.00

1.00

Mean

dv

Valid N (listwise)

dv

Valid N (listwise)

Std. Deviation

4.0000

2.00000

7.4000

1.14018

a. No statistics are computed for one or more split files because there are no valid cases.

Coefficients a
Unstandardized
Coefficients
Model
1

Standardized
Coeffic ients

Std. Error

(Constant)

4.000

.728

group

3.400

1.030

a. Dependent Variable: dv

Beta

95% Confidence Interval for B


t

.760

Sig.

Lower Bound

Upper Bound

5.494

.001

2.321

5.679

3.302

.011

1.026

5.774

Note again our regression equation


Recall the definition for the slope and constant
First the constant, what does when X = 0 mean here in
this setting?
It means when we are in the 0 group
What is that value?

Y = 4, which is that groups mean

The constant here is thus the reference groups mean

Descriptive Statistics a
group
.00

1.00

Mean

dv

Valid N (listwise)

dv

Valid N (listwise)

Y 4 3.4 X

Std. Deviation

4.0000

2.00000

7.4000

1.14018

a. No statistics are computed for one or more split files because there are no valid cases.

Coefficients a
Unstandardized
Coeffic ients
Model
1

Standardized
Coeffic ients

Std. Error

(Constant)

4.000

.728

group

3.400

1.030

a. Dependent Variable: dv

Beta

95% Confidence Interval for B


t

.760

Sig.

Lower Bound

Upper Bound

5.494

.001

2.321

5.679

3.302

.011

1.026

5.774

Now think about the slope


What does a 1 unit change in X mean in this
setting?
It means we go from one group to the other
Based on that coefficient, what does the slope
represent in this case (i.e. can you derive that
coefficient from the descriptive stats in some way?)
The coefficient is the difference between means

Descriptive Statistics a
group
.00

1.00

Mean

dv

Valid N (listwise)

dv

Valid N (listwise)

Y 4 3.4 X

Std. Deviation

4.0000

2.00000

7.4000

1.14018

a. No statistics are computed for one or more split files because there are no valid cases.

Coefficients a
Unstandardized
Coeffic ients
Model
1

Standardized
Coeffic ients

Std. Error

(Constant)

4.000

.728

group

3.400

1.030

a. Dependent Variable: dv

Beta

95% Confidence Interval for B


t

.760

Sig.

Lower Bound

Upper Bound

5.494

.001

2.321

5.679

3.302

.011

1.026

5.774

The regression line


covers the values
represented
i.e. 0, 1, for the two groups

It passes through each


of their means
Using least squares

regression the regression


line always passes through
the mean of X and Y

The constant (if we are


using dummy coding) is
the mean for the zero
(reference) group
The coefficient is the
difference between
means

Analysis of variance
Recall that in regression we are trying to
account for the variance in the DV
That total variance reflects the sum of the
squared deviations of values from the DV
mean

Sums of squares

That breaks down into:


Variance we account for
Sums of squares predicted or model or

regression

And that which we do not account for


Sums of squares error (observed predicted)

What are our predicted values in this case?


We only have 2 values of X to plug in
We already know what Y is if X is zero, and
so wed predict the group mean of 4 for all
zero values Y 4 3.4*0

The only other value to plug in is 1 for the


rest of the cases
In other words for those in the 1 group, were

predicting their respective mean


Y 4 3.4*1 7.4

So in order to get our


model summary and Fstatistic, we need:
Total variance
Predicted variance
Predicted value minus grand

mean of the DV just like it has


always been
Note again how our average
predicted value is our group
average for the DV

Error variance

Essentially each persons

score minus group mean

Predicted SS = 5[(4-5.7)2 + (7.4-5.7)2]


28.9

Error SS = (3-4)2 + 5-4)2+ (9-7.4)2


21.2

Total variance to be accounted for = (3-5.7)2+(55.7)2+(9-5.7)2


Or just Predicted SS + Error SS
50.1
Calculate R2 from these values

Here is the summary


table from our
regression
The mean square is
derived from dividing
our sums of squares by
the degrees of freedom

K-1 for the regression


Total = N -1
Error N-k

The ratio of the mean


squares is the F-statistic

ANOVA b
Model
1

Sum of Squares

df

Mean Square

Regression

28.900

28.900

Residual

21.200

2.650

Total

50.100

a. Predictors: (Constant), group


b. Dependent Variable: dv

F
10.906

Sig.
.011 a

Note the title of the summary


table
ANOVA

It is an ANOVA summary table


because you have in fact just
conducted an analysis of
variance, specifically for the
two group situation
ANOVA, the statistical
procedure as it is so-called, is
a special case of regression
Below the first table is the
ANOVA, as opposed to
regression output.

ANOVA b
Model
1

Sum of Squares

df

Mean Square

Regression

28.900

28.900

Residual

21.200

2.650

Total

50.100

Sig.

10.906

.011 a

a. Predictors: (Constant), group


b. Dependent Variable: dv

Tests of Betw een-Subjects Effects


Dependent Variable: dv
Sourc e

Type III Sum


of Squares

df

Mean Square

group

28.900

28.900

Error

21.200

2.650

Total

375.000

10

50.100

Corrected Total

F
10.906

Sig.
.011

Partial Eta
Squared
.577

Note the partial etasquared


Eta-squared has the
same interpretation as Rsquared and as one can
see, is R-squared from
our regression
SPSS calls it partial as there

is often more than one


grouping variable, and we
are interested in unique
effects (i.e. partial out the
effects from other variables)
However it is actually etasquared here, as there is no
other variable effect to
partial out

ANOVA b
Model
1

Sum of Squares

df

Mean Square

Regression

28.900

28.900

Residual

21.200

2.650

Total

50.100

Sig.

10.906

.011 a

a. Predictors: (Constant), group


b. Dependent Variable: dv

Tests of Betw een-Subjects Effects


Dependent Variable: dv
Sourc e

Type III Sum


of Squares

df

Mean Square

group

28.900

28.900

Error

21.200

2.650

Total

375.000

10

50.100

Corrected Total

F
10.906

Sig.
.011

Partial Eta
Squared
.577

The t-test is a special case of ANOVA


ANOVA can handle more than two groups,
while the t-test is just for two
However, F = t2 in the two group setting,
the p-value is exactly the same

Independent Sam ples Test


t-test for Equality of Means

t
dv

Equal variances assumed

df

-3.302

Sig. (2-tailed)

Mean Difference

.011

-3.40000

Std. Error
Difference

95% Confidence Interval of


the Differenc e

1.02956

Lower

Upper

-5.77418

-1.02582

Tests of Betw een-Subjects Effects


Dependent Variable: dv
Sourc e

Type III Sum


of Squares

df

Mean Square

group

28.900

28.900

Error

21.200

2.650

Total

375.000

10

50.100

Corrected Total

F
10.906

Sig.
.011

Partial Eta
Squared
.577

Compare to regression
The t, standard error, CI

and p-value
are the same, and again the
coefficient is the difference between
means
Independent Sam ples Test
t-test for Equality of Means

t
dv

Equal variances assumed

df

-3.302

Std. Error
Difference

Sig. (2-tailed)

Mean Difference

.011

-3.40000

95% Confidence Interval of


the Differenc e

1.02956

Lower

Upper

-5.77418

-1.02582

Coefficients a
Unstandardized
Coeffic ients
Model
1

Standardized
Coeffic ients

Std. Error

(Constant)

4.000

.728

group

3.400

1.030

a. Dependent Variable: dv

Beta

95% Confidence Interval for B


t

.760

Lower Bound

Upper Bound

5.494

Sig.
.001

2.321

5.679

3.302

.011

1.026

5.774

Statistics is a language used for


communicating research ideas and findings
We have various dialects with which to
speak it and of course pick freely of the
words available
Sometimes we prefer to do regression and
talk about amount of variance to be
accounted for
Sometimes we prefer to talk about mean
differences and how large those are

In both cases we are interested in the effect size

Which tool we use reflects how we want to


talk about our results

Lets

assume that
we believe there
is a linear
relationship
between X and Y.
Which set of
parameter values
will bring us
closest to
representing the
data accurately?

Y 2 2 X

y y

We begin by picking
some values, plugging
them into the equation,
and seeing how well the
implied values correspond
to the observed values
We can quantify what we
mean by how well by
examining the difference
between the modelimplied Y and the actual Y
value
This difference
y y between
our observed value and
the one predicted,
,
is often called error in
prediction, or the residual

Y 2 1X

Lets try a different


value of b and see
what happens
Now the implied
values of Y are
getting closer to
the actual values
of Y, but were still
off by quite a bit

Y 2 0 X

Things

are
getting better,
but certainly
things could
improve

Y 2 1X

Ah,

much better

Y 2 2 X

Now

thats very

nice
There is a perfect
correspondence
between the
predicted values
of Y and the
actual values of
Y

You might also like